It depends on why you want stopwords. Stopwords were an important thing back in the day - they helped performance. Now, with a decent CPU and TF/IDF on your side, they don't do so much harm, in fact, avoiding them can save the day:
q=to be or not to be would not locate anything if we'd used stopwords. However: q=jack and jill will score docs that have "jack" or "jill" or preferably both way above docs that just have "and". If I needed stopwords, I'd do something like you suggested, then show the results to a native speaker and see what they think. Upayavira On Thu, Sep 10, 2015, at 03:16 PM, Imtiaz Shakil Siddique wrote: > Hi Upayavira, > > Thank you for your kind assistance Sir. > If that is the requirement for stemming then I will do it. > > My next question is how can I build a stopword list for Bengali language? > The option that I've thought about are > > 1. Calculate idf values for all the stemmed words inside 20GB crawled > data. > 2. Find the words that have high inverse document frequency and mark them > as stopwords. > > If you have any better solution then please help! > Thank you Sir, > Imtiaz Shakil Siddique > > > On 10 September 2015 at 17:38, Upayavira <u...@odoko.co.uk> wrote: > > > I haven't heard of any machine learning based stemmers. I'm not really > > sure what algorithm you would use to do stemming - what you'd be looking > > for is something that says, well, running stemmed to run, walking > > stemmed to walk, therefore hopping should stem to hop, but that'd be > > quite an algorithm to develop, I'd say. > > > > There are a few ways you could handle this: > > > > 1) locate a Bengali linguist who can help you define an algorithm > > 2) manually stem a large number of documents and use that as a basis > > for stemming > > > > If you had a stemmed corpus, you could simply use synonyms to do it, in > > English, you could map: > > > > run,running,runs,ran,runner=>run > > walk,walked,walking,walker=>walk > > > > Then all you need to do is generate a synonym file and use the > > SynonymFilterFactory with it, in place of a stemmer. > > > > Would that work? > > > > Upayavira > > > > On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote: > > > Thanks for the reply. > > > > > > Currently I have 20GB Bengali newspaper data ( for corpus building ) > > > I don't have manual stemmed corpus but if needed I will build one. > > > > > > Basically I need guidance regarding how to do this. > > > If there are some standard approaches of building stemmer and stopword > > > for > > > use with solr then please > > > share it . > > > > > > Thank you Upayavira for your kind help. > > > > > > Imtiaz Shakil Siddique > > > > > > > > > On 10 September 2015 at 13:23, Upayavira <u...@odoko.co.uk> wrote: > > > > > > > > > > > > > > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote: > > > > > Hi, > > > > > > > > > > I am trying to develop stemmer and stopword for Bengaly language > > which is > > > > > not shipped with solr. > > > > > > > > > > I am trying to make this with machine learning approach but I > > couldn't > > > > > find > > > > > any good documents to study. It would be very helpful if you could > > shed > > > > > some lights into this matter. > > > > > > > > How are you going to do this with machine learning? What corpus are you > > > > going to use to learn from? Do you have some documents that have been > > > > manually stemmed for which you also have the originals? > > > > > > > > Upayavira > > > > > >