Re: Stop words (how to create ideal set of stop words?)

Grant Ingersoll Fri, 11 May 2007 04:14:36 -0700

Use Lucener's tend to be more practically oriented!  :-)

For some reason, the application of zipf's law comes to mind, wherebyyou could look at the most commonly occurring words andmathematically deduce which ones are "too" common, but where yourcutoff is still may be difficult to choose. You will always have thebalance between losing some information and shrinking the index, etc.

Google Scholar search for "zipf's law +stopwords" yields: http://ir.dcs.gla.ac.uk/terrier/publications/rtlo_DIRpaper.pdf which lookslike it holds promise (but I admit I didn't read beyond the abstract)as it has references to old approaches for the same task, plus a"new" approach.

Good Luck and if you find something that works well, we would love tohave it contributed back!


-Grant



On May 11, 2007, at 1:53 AM, Lukas Vlcek wrote:

Hi,

Thanks for your comments!

I was thinking that there could be some method based on frequency and
linguistic research. So far it seems that manually choosen set ofwords is
very common approach but this leaves some questions opened in my mind.
I am not a native english speaker but I think that this (
http://www.ranks.nl/tools/stopwords.html) makes sense, but for mynativelanguage (http://www.ranks.nl/stopwords/czech.html) this can bequestionable
in some cases (especially in case of specific corpus).
What I am searching for is some authomatic method of stop wordsextractionbased on given set of documents. I don't expect such method to be100% exact
but I would expect it to be ~good enough~.
I will try to search in citeseer as well (was hoping somebody couldgive me
some references of this kind).

Thanks!
Lukas

On 5/11/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
There is a handy class in contrib/misc.../ that will show you themost
frequent terms in an index. Handy dandy.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Lukas Vlcek <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, May 10, 2007 2:39:35 PM
Subject: Stop words (how to create ideal set of stop words?)

Hi,
Can anybody point me to some references how to create an ideal setof stopwords? I konw that this is more like a theoretical question buthow doLuceners determine which words shuold be excluded when creatingAnalyzersfor a new languages? And which technique was used for validationof stop
word lists in current Analyzers?
More specificaly I am interested in situations when there is aneed to
build
a search engine around specific corpus (for example when we needto searchset of articles related to programming languages only). Given aspecific
corpus is there any recommended technique of stop words derivation?

Thanks,
Lukas




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stop words (how to create ideal set of stop words?)

Reply via email to