RE: Stemmer and stopword Development

2015-09-11 Thread Imtiaz Shakil Siddique
55 PM > To: solr-user@lucene.apache.org > Subject: Re: Stemmer and stopword Development > > I've used stopwords to reduce the index size considerably to improve > search performance (same with stemming, etc). For relevance I've often > preferred to leave stop words in for the reaso

Re: Stemmer and stopword Development

2015-09-10 Thread Upayavira
On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote: > Hi, > > I am trying to develop stemmer and stopword for Bengaly language which is > not shipped with solr. > > I am trying to make this with machine learning approach but I couldn't > find > any good documents to study. It

Re: Stemmer and stopword Development

2015-09-10 Thread Imtiaz Shakil Siddique
Thanks for the reply. Currently I have 20GB Bengali newspaper data ( for corpus building ) I don't have manual stemmed corpus but if needed I will build one. Basically I need guidance regarding how to do this. If there are some standard approaches of building stemmer and stopword for use with

Re: Stemmer and stopword Development

2015-09-10 Thread Upayavira
I haven't heard of any machine learning based stemmers. I'm not really sure what algorithm you would use to do stemming - what you'd be looking for is something that says, well, running stemmed to run, walking stemmed to walk, therefore hopping should stem to hop, but that'd be quite an algorithm

Re: Stemmer and stopword Development

2015-09-10 Thread Imtiaz Shakil Siddique
Hi Upayavira, Thank you for your kind assistance Sir. If that is the requirement for stemming then I will do it. My next question is how can I build a stopword list for Bengali language? The option that I've thought about are 1. Calculate idf values for all the stemmed words inside 20GB crawled

Re: Stemmer and stopword Development

2015-09-10 Thread Upayavira
It depends on why you want stopwords. Stopwords were an important thing back in the day - they helped performance. Now, with a decent CPU and TF/IDF on your side, they don't do so much harm, in fact, avoiding them can save the day: q=to be or not to be would not locate anything if we'd used

Re: Stemmer and stopword Development

2015-09-10 Thread Doug Turnbull
I've used stopwords to reduce the index size considerably to improve search performance (same with stemming, etc). For relevance I've often preferred to leave stop words in for the reasons Upayavira mentions. There's all kinds of confusing things taht can happen with stopwords that sometimes

RE: Stemmer and stopword Development

2015-09-10 Thread Davis, Daniel (NIH/NLM) [C]
4:55 PM To: solr-user@lucene.apache.org Subject: Re: Stemmer and stopword Development I've used stopwords to reduce the index size considerably to improve search performance (same with stemming, etc). For relevance I've often preferred to leave stop words in for the reasons Upayavira mentions

Stemmer and stopword Development

2015-09-09 Thread Imtiaz Shakil Siddique
Hi, I am trying to develop stemmer and stopword for Bengaly language which is not shipped with solr. I am trying to make this with machine learning approach but I couldn't find any good documents to study. It would be very helpful if you could shed some lights into this matter. Thank you so