I haven’t removed stopwords since 1996, when I joined Infoseek. What is your special case where you must remove them?
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 22, 2019, at 9:51 PM, akash jayaweera <akash.jayawe...@gmail.com> > wrote: > > Hello Walter, > > Thank you for the reply. > But for some of my use-case I need to identify stopword. So I need a better > way to identify domain specific stopwords. I used TF-IDF to identify > stopwords. But it has the issue I mentioned above. > > Regards, > *Akash Jayaweera.* > > > E akash.jayawe...@gmail.com <akash.jayawe...@gmail.com> > M + 94 77 2472635 <+94%2077%20247%202635> > > > On Sun, Jun 23, 2019 at 10:13 AM Walter Underwood <wun...@wunderwood.org> > wrote: > >> Don’t remove stopwords. That was a useful hack when we were running search >> engines on 16-bit machines. These days, it causes more problems than it >> solves. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >>> On Jun 22, 2019, at 8:14 PM, akash jayaweera <akash.jayawe...@gmail.com> >> wrote: >>> >>> Hello All, >>> I'm trying to identify stopwords for a non-English corpus using TF-IDF >>> score. I calculated the score for each unique term in the corpus. But my >>> question is how can I select stopwords using the score. >>> For example if we have a corpus of football, term "football" get the >> lowest >>> TF-IDF score. But for my requirement I don't want to identify "football" >> as >>> a stopword. >>> How can I clearly Identify stopword. Is there any other simple method to >>> identify stopwords than TF-IDF score. >>> >>> Regards, >>> *Akash Jayaweera.* >>> >>> >>> E akash.jayawe...@gmail.com <akash.jayawe...@gmail.com> >>> M + 94 77 2472635 <+94%2077%20247%202635> >> >>