It depends on why you want stopwords. Stopwords were an important thing
back in the day - they helped performance. Now, with a decent CPU and
TF/IDF on your side, they don't do so much harm, in fact, avoiding them
can save the day:

q=to be or not to be

would not locate anything if we'd used stopwords. However:

q=jack and jill

will score docs that have "jack" or "jill" or preferably both way above
docs that just have "and".

If I needed stopwords, I'd do something like you suggested, then show
the results to a native speaker and see what they think.

Upayavira

On Thu, Sep 10, 2015, at 03:16 PM, Imtiaz Shakil Siddique wrote:
> Hi Upayavira,
> 
> Thank you for your kind assistance Sir.
> If that is the requirement for stemming then I will do it.
> 
> My next question is how can I build a stopword list for Bengali language?
> The option that I've thought about are
> 
> 1. Calculate idf values for all the stemmed words inside 20GB crawled
> data.
> 2. Find the words that have high inverse document frequency and mark them
> as stopwords.
> 
> If you have any better solution then please help!
> Thank you Sir,
> Imtiaz Shakil Siddique
> 
> 
> On 10 September 2015 at 17:38, Upayavira <u...@odoko.co.uk> wrote:
> 
> > I haven't heard of any machine learning based stemmers. I'm not really
> > sure what algorithm you would use to do stemming - what you'd be looking
> > for is something that says, well, running stemmed to run, walking
> > stemmed to walk, therefore hopping should stem to hop, but that'd be
> > quite an algorithm to develop, I'd say.
> >
> > There are a few ways you could handle this:
> >
> > 1) locate a Bengali linguist who can help you define an algorithm
> >  2) manually stem a large number of documents and use that as a basis
> >  for stemming
> >
> > If you had a stemmed corpus, you could simply use synonyms to do it, in
> > English, you could map:
> >
> > run,running,runs,ran,runner=>run
> > walk,walked,walking,walker=>walk
> >
> > Then all you need to do is generate a synonym file and use the
> > SynonymFilterFactory with it, in place of a stemmer.
> >
> > Would that work?
> >
> > Upayavira
> >
> > On Thu, Sep 10, 2015, at 09:59 AM, Imtiaz Shakil Siddique wrote:
> > > Thanks for the reply.
> > >
> > > Currently I have 20GB Bengali newspaper data ( for corpus building )
> > > I don't have manual stemmed corpus but if needed I will build one.
> > >
> > > Basically I need guidance regarding how to do this.
> > > If there are some standard approaches of building stemmer and stopword
> > > for
> > > use with solr then please
> > > share it .
> > >
> > > Thank you Upayavira for your kind help.
> > >
> > > Imtiaz Shakil Siddique
> > >
> > >
> > > On 10 September 2015 at 13:23, Upayavira <u...@odoko.co.uk> wrote:
> > >
> > > >
> > > >
> > > > On Thu, Sep 10, 2015, at 04:45 AM, Imtiaz Shakil Siddique wrote:
> > > > > Hi,
> > > > >
> > > > > I am trying to develop stemmer and stopword for Bengaly language
> > which is
> > > > > not shipped with solr.
> > > > >
> > > > > I am trying to make this with machine learning approach but I
> > couldn't
> > > > > find
> > > > > any good documents to study. It would be very helpful if you could
> > shed
> > > > > some lights into this matter.
> > > >
> > > > How are you going to do this with machine learning? What corpus are you
> > > > going to use to learn from? Do you have some documents that have been
> > > > manually stemmed for which you also have the originals?
> > > >
> > > > Upayavira
> > > >
> >

Reply via email to