Thank you all for your precious advice. For now I'll just stick with building a stemmer and test the solr search results.
Imtiaz Shakil Siddique On Sep 11, 2015 3:20 AM, "Davis, Daniel (NIH/NLM) [C]" <daniel.da...@nih.gov> wrote: > Stop words for international indexing seem not too useful to me at this > point. To use them, you definitely have to know what language you are in > at all times, and that doesn't happen with unstructured data (e.g. a bunch > of PDF/Word files that happen to be linked from a bunch of web pages). > I'm currently working on something where I do have structured data, but > diacritics show up in fields clearly identified as English - structured > data also can be messy. > > -----Original Message----- > From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com] > Sent: Thursday, September 10, 2015 4:55 PM > To: solr-user@lucene.apache.org > Subject: Re: Stemmer and stopword Development > > I've used stopwords to reduce the index size considerably to improve > search performance (same with stemming, etc). For relevance I've often > preferred to leave stop words in for the reasons Upayavira mentions. > There's all kinds of confusing things taht can happen with stopwords that > sometimes they're not worth the trouble. > > For an example of something confusing that happens when you take out > stopwords from the index, it interacts a bit unintuitively with min should > match > http://opensourceconnections.com/blog/2013/04/15/querying-more-fields-more-results-stop-wording-and-solrs-mm-min-should-match-argument/ > > Cheers > -Doug > > > > > On Thu, Sep 10, 2015 at 4:50 PM, Upayavira <u...@odoko.co.uk> wrote: > > > It depends on why you want stopwords. Stopwords were an important > > thing back in the day - they helped performance. Now, with a decent > > CPU and TF/IDF on your side, they don't do so much harm, in fact, > > avoiding them can save the day: > > > > q=to be or not to be > > > > would not locate anything if we'd used stopwords. However: > > > > q=jack and jill > > > > will score docs that have "jack" or "jill" or preferably both way > > above docs that just have "and". > > > > If I needed stopwords, I'd do something like you suggested, then show > > the results to a native speaker and see what they think. > > > > Upayavira > > > > On Thu, Sep 10, 2015, at 03:16 PM, Imtiaz Shakil Siddique wrote: > > > Hi Upayavira, > > > > > > Thank you for your kind assistance Sir. > > > If that is the requirement for stemming then I will do it. > > > > > > My next question is how can I build a stopword list for Bengali > language? > > > The option that I've thought about are > > > > > > 1. Calculate idf values for all the stemmed words inside 20GB > > > crawled data. > > > 2. Find the words that have high inverse document frequency and mark > > > them as stopwords. > > > > > > If you have any better solution then please help! > > > Thank you Sir, > > > Imtiaz Shakil Siddique > > > > > > > > > On 10 September 2015 at 17:38, Upayavira <u...@odoko.co.uk> wrote: > > > > > > > I haven't heard of any machine learning based stemmers. I'm not > > > > really sure what algorithm you would use to do stemming - what > > > > you'd be > > looking > > > > for is something that says, well, running stemmed to run, walking > > > > stemmed to walk, therefore hopping should stem to hop, but that'd > > > > be quite an algorithm to develop, I'd say. > > > > > > > > There are a few ways you could handle this: > > > > > > > > 1) locate a Bengali linguist who can help you define an algorithm > > > > 2) manually stem a large number of documents and use that as a > > > > basis for stemming > > > > > > > > If you had a stemmed corpus, you could simply use synonyms to do > > > > it, in English, you could map: > > > > > > > > run,running,runs,ran,runner=>run > > > > walk,walked,walking,walker=>walk > > > > > > > > Then all you need to do is generate a synonym file and use the > > > > SynonymFilterFactory with it, in place of a stemmer. > > > > > > > > Would that work? > > > > > > > > Upayavira > > > > > > > > On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote: > > > > > Thanks for the reply. > > > > > > > > > > Currently I have 20GB Bengali newspaper data ( for corpus > > > > > building ) I don't have manual stemmed corpus but if needed I will > build one. > > > > > > > > > > Basically I need guidance regarding how to do this. > > > > > If there are some standard approaches of building stemmer and > > stopword > > > > > for > > > > > use with solr then please > > > > > share it . > > > > > > > > > > Thank you Upayavira for your kind help. > > > > > > > > > > Imtiaz Shakil Siddique > > > > > > > > > > > > > > > On 10 September 2015 at 13:23, Upayavira <u...@odoko.co.uk> wrote: > > > > > > > > > > > > > > > > > > > > > > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I am trying to develop stemmer and stopword for Bengaly > > > > > > > language > > > > which is > > > > > > > not shipped with solr. > > > > > > > > > > > > > > I am trying to make this with machine learning approach but > > > > > > > I > > > > couldn't > > > > > > > find > > > > > > > any good documents to study. It would be very helpful if you > > could > > > > shed > > > > > > > some lights into this matter. > > > > > > > > > > > > How are you going to do this with machine learning? What > > > > > > corpus > > are you > > > > > > going to use to learn from? Do you have some documents that > > > > > > have > > been > > > > > > manually stemmed for which you also have the originals? > > > > > > > > > > > > Upayavira > > > > > > > > > > > > > > > > -- > *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections < > http://opensourceconnections.com>, LLC | 240.476.9983 > Author: Relevant Search <http://manning.com/turnbull> This e-mail and all > contents, including attachments, is considered to be Company Confidential > unless explicitly stated otherwise, regardless of whether attachments are > marked as such. >