RE: Stemmer and stopword Development

Davis, Daniel (NIH/NLM) [C] Thu, 10 Sep 2015 14:21:20 -0700

Stop words for international indexing seem not too useful to me at this point.  
  To use them, you definitely have to know what language you are in at all 
times, and that doesn't happen with unstructured data (e.g. a bunch of PDF/Word 
files that happen to be linked from a bunch of web pages).   I'm currently 
working on something where I do have structured data, but diacritics show up in 
fields clearly identified as English - structured data also can be messy.


-----Original Message-----
From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com] 
Sent: Thursday, September 10, 2015 4:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Stemmer and stopword Development

I've used stopwords to reduce the index size considerably to improve search 
performance (same with stemming, etc). For relevance I've often preferred to 
leave stop words in for the reasons Upayavira mentions. There's all kinds of 
confusing things taht can happen with stopwords that sometimes they're not 
worth the trouble.

For an example of something confusing that happens when you take out stopwords 
from the index, it interacts a bit unintuitively with min should match 
http://opensourceconnections.com/blog/2013/04/15/querying-more-fields-more-results-stop-wording-and-solrs-mm-min-should-match-argument/

Cheers
-Doug




On Thu, Sep 10, 2015 at 4:50 PM, Upayavira <u...@odoko.co.uk> wrote:

> It depends on why you want stopwords. Stopwords were an important 
> thing back in the day - they helped performance. Now, with a decent 
> CPU and TF/IDF on your side, they don't do so much harm, in fact, 
> avoiding them can save the day:
>
> q=to be or not to be
>
> would not locate anything if we'd used stopwords. However:
>
> q=jack and jill
>
> will score docs that have "jack" or "jill" or preferably both way 
> above docs that just have "and".
>
> If I needed stopwords, I'd do something like you suggested, then show 
> the results to a native speaker and see what they think.
>
> Upayavira
>
> On Thu, Sep 10, 2015, at 03:16 PM, Imtiaz Shakil Siddique wrote:
> > Hi Upayavira,
> >
> > Thank you for your kind assistance Sir.
> > If that is the requirement for stemming then I will do it.
> >
> > My next question is how can I build a stopword list for Bengali language?
> > The option that I've thought about are
> >
> > 1. Calculate idf values for all the stemmed words inside 20GB 
> > crawled data.
> > 2. Find the words that have high inverse document frequency and mark 
> > them as stopwords.
> >
> > If you have any better solution then please help!
> > Thank you Sir,
> > Imtiaz Shakil Siddique
> >
> >
> > On 10 September 2015 at 17:38, Upayavira <u...@odoko.co.uk> wrote:
> >
> > > I haven't heard of any machine learning based stemmers. I'm not 
> > > really sure what algorithm you would use to do stemming - what 
> > > you'd be
> looking
> > > for is something that says, well, running stemmed to run, walking 
> > > stemmed to walk, therefore hopping should stem to hop, but that'd 
> > > be quite an algorithm to develop, I'd say.
> > >
> > > There are a few ways you could handle this:
> > >
> > > 1) locate a Bengali linguist who can help you define an algorithm
> > >  2) manually stem a large number of documents and use that as a 
> > > basis  for stemming
> > >
> > > If you had a stemmed corpus, you could simply use synonyms to do 
> > > it, in English, you could map:
> > >
> > > run,running,runs,ran,runner=>run
> > > walk,walked,walking,walker=>walk
> > >
> > > Then all you need to do is generate a synonym file and use the 
> > > SynonymFilterFactory with it, in place of a stemmer.
> > >
> > > Would that work?
> > >
> > > Upayavira
> > >
> > > On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote:
> > > > Thanks for the reply.
> > > >
> > > > Currently I have 20GB Bengali newspaper data ( for corpus 
> > > > building ) I don't have manual stemmed corpus but if needed I will 
> > > > build one.
> > > >
> > > > Basically I need guidance regarding how to do this.
> > > > If there are some standard approaches of building stemmer and
> stopword
> > > > for
> > > > use with solr then please
> > > > share it .
> > > >
> > > > Thank you Upayavira for your kind help.
> > > >
> > > > Imtiaz Shakil Siddique
> > > >
> > > >
> > > > On 10 September 2015 at 13:23, Upayavira <u...@odoko.co.uk> wrote:
> > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to develop stemmer and stopword for Bengaly 
> > > > > > language
> > > which is
> > > > > > not shipped with solr.
> > > > > >
> > > > > > I am trying to make this with machine learning approach but 
> > > > > > I
> > > couldn't
> > > > > > find
> > > > > > any good documents to study. It would be very helpful if you
> could
> > > shed
> > > > > > some lights into this matter.
> > > > >
> > > > > How are you going to do this with machine learning? What 
> > > > > corpus
> are you
> > > > > going to use to learn from? Do you have some documents that 
> > > > > have
> been
> > > > > manually stemmed for which you also have the originals?
> > > > >
> > > > > Upayavira
> > > > >
> > >
>



--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections 
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull> This e-mail and all 
contents, including attachments, is considered to be Company Confidential 
unless explicitly stated otherwise, regardless of whether attachments are 
marked as such.

RE: Stemmer and stopword Development

Reply via email to