RE: Stemmer and stopword Development

Imtiaz Shakil Siddique Fri, 11 Sep 2015 06:46:34 -0700

Thank you all for your precious advice.

For now I'll just stick with building a stemmer and test the solr search
results.


Imtiaz Shakil Siddique
On Sep 11, 2015 3:20 AM, "Davis, Daniel (NIH/NLM) [C]" <daniel.da...@nih.gov>
wrote:

> Stop words for international indexing seem not too useful to me at this
> point.    To use them, you definitely have to know what language you are in
> at all times, and that doesn't happen with unstructured data (e.g. a bunch
> of PDF/Word files that happen to be linked from a bunch of web pages).
>  I'm currently working on something where I do have structured data, but
> diacritics show up in fields clearly identified as English - structured
> data also can be messy.
>
> -----Original Message-----
> From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com]
> Sent: Thursday, September 10, 2015 4:55 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Stemmer and stopword Development
>
> I've used stopwords to reduce the index size considerably to improve
> search performance (same with stemming, etc). For relevance I've often
> preferred to leave stop words in for the reasons Upayavira mentions.
> There's all kinds of confusing things taht can happen with stopwords that
> sometimes they're not worth the trouble.
>
> For an example of something confusing that happens when you take out
> stopwords from the index, it interacts a bit unintuitively with min should
> match
> http://opensourceconnections.com/blog/2013/04/15/querying-more-fields-more-results-stop-wording-and-solrs-mm-min-should-match-argument/
>
> Cheers
> -Doug
>
>
>
>
> On Thu, Sep 10, 2015 at 4:50 PM, Upayavira <u...@odoko.co.uk> wrote:
>
> > It depends on why you want stopwords. Stopwords were an important
> > thing back in the day - they helped performance. Now, with a decent
> > CPU and TF/IDF on your side, they don't do so much harm, in fact,
> > avoiding them can save the day:
> >
> > q=to be or not to be
> >
> > would not locate anything if we'd used stopwords. However:
> >
> > q=jack and jill
> >
> > will score docs that have "jack" or "jill" or preferably both way
> > above docs that just have "and".
> >
> > If I needed stopwords, I'd do something like you suggested, then show
> > the results to a native speaker and see what they think.
> >
> > Upayavira
> >
> > On Thu, Sep 10, 2015, at 03:16 PM, Imtiaz Shakil Siddique wrote:
> > > Hi Upayavira,
> > >
> > > Thank you for your kind assistance Sir.
> > > If that is the requirement for stemming then I will do it.
> > >
> > > My next question is how can I build a stopword list for Bengali
> language?
> > > The option that I've thought about are
> > >
> > > 1. Calculate idf values for all the stemmed words inside 20GB
> > > crawled data.
> > > 2. Find the words that have high inverse document frequency and mark
> > > them as stopwords.
> > >
> > > If you have any better solution then please help!
> > > Thank you Sir,
> > > Imtiaz Shakil Siddique
> > >
> > >
> > > On 10 September 2015 at 17:38, Upayavira <u...@odoko.co.uk> wrote:
> > >
> > > > I haven't heard of any machine learning based stemmers. I'm not
> > > > really sure what algorithm you would use to do stemming - what
> > > > you'd be
> > looking
> > > > for is something that says, well, running stemmed to run, walking
> > > > stemmed to walk, therefore hopping should stem to hop, but that'd
> > > > be quite an algorithm to develop, I'd say.
> > > >
> > > > There are a few ways you could handle this:
> > > >
> > > > 1) locate a Bengali linguist who can help you define an algorithm
> > > >  2) manually stem a large number of documents and use that as a
> > > > basis  for stemming
> > > >
> > > > If you had a stemmed corpus, you could simply use synonyms to do
> > > > it, in English, you could map:
> > > >
> > > > run,running,runs,ran,runner=>run
> > > > walk,walked,walking,walker=>walk
> > > >
> > > > Then all you need to do is generate a synonym file and use the
> > > > SynonymFilterFactory with it, in place of a stemmer.
> > > >
> > > > Would that work?
> > > >
> > > > Upayavira
> > > >
> > > > On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote:
> > > > > Thanks for the reply.
> > > > >
> > > > > Currently I have 20GB Bengali newspaper data ( for corpus
> > > > > building ) I don't have manual stemmed corpus but if needed I will
> build one.
> > > > >
> > > > > Basically I need guidance regarding how to do this.
> > > > > If there are some standard approaches of building stemmer and
> > stopword
> > > > > for
> > > > > use with solr then please
> > > > > share it .
> > > > >
> > > > > Thank you Upayavira for your kind help.
> > > > >
> > > > > Imtiaz Shakil Siddique
> > > > >
> > > > >
> > > > > On 10 September 2015 at 13:23, Upayavira <u...@odoko.co.uk> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am trying to develop stemmer and stopword for Bengaly
> > > > > > > language
> > > > which is
> > > > > > > not shipped with solr.
> > > > > > >
> > > > > > > I am trying to make this with machine learning approach but
> > > > > > > I
> > > > couldn't
> > > > > > > find
> > > > > > > any good documents to study. It would be very helpful if you
> > could
> > > > shed
> > > > > > > some lights into this matter.
> > > > > >
> > > > > > How are you going to do this with machine learning? What
> > > > > > corpus
> > are you
> > > > > > going to use to learn from? Do you have some documents that
> > > > > > have
> > been
> > > > > > manually stemmed for which you also have the originals?
> > > > > >
> > > > > > Upayavira
> > > > > >
> > > >
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections <
> http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull> This e-mail and all
> contents, including attachments, is considered to be Company Confidential
> unless explicitly stated otherwise, regardless of whether attachments are
> marked as such.
>

RE: Stemmer and stopword Development

Reply via email to