Use Lucener's tend to be more practically oriented! :-)
For some reason, the application of zipf's law comes to mind, whereby
you could look at the most commonly occurring words and
mathematically deduce which ones are "too" common, but where your
cutoff is still may be difficult to choose. You will always have the
balance between losing some information and shrinking the index, etc.
Google Scholar search for "zipf's law +stopwords" yields: http://
ir.dcs.gla.ac.uk/terrier/publications/rtlo_DIRpaper.pdf which looks
like it holds promise (but I admit I didn't read beyond the abstract)
as it has references to old approaches for the same task, plus a
"new" approach.
Good Luck and if you find something that works well, we would love to
have it contributed back!
-Grant
On May 11, 2007, at 1:53 AM, Lukas Vlcek wrote:
Hi,
Thanks for your comments!
I was thinking that there could be some method based on frequency and
linguistic research. So far it seems that manually choosen set of
words is
very common approach but this leaves some questions opened in my mind.
I am not a native english speaker but I think that this (
http://www.ranks.nl/tools/stopwords.html) makes sense, but for my
native
language (http://www.ranks.nl/stopwords/czech.html) this can be
questionable
in some cases (especially in case of specific corpus).
What I am searching for is some authomatic method of stop words
extraction
based on given set of documents. I don't expect such method to be
100% exact
but I would expect it to be ~good enough~.
I will try to search in citeseer as well (was hoping somebody could
give me
some references of this kind).
Thanks!
Lukas
On 5/11/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
There is a handy class in contrib/misc.../ that will show you the
most
frequent terms in an index. Handy dandy.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
----- Original Message ----
From: Lukas Vlcek <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, May 10, 2007 2:39:35 PM
Subject: Stop words (how to create ideal set of stop words?)
Hi,
Can anybody point me to some references how to create an ideal set
of stop
words? I konw that this is more like a theoretical question but
how do
Luceners determine which words shuold be excluded when creating
Analyzers
for a new languages? And which technique was used for validation
of stop
word lists in current Analyzers?
More specificaly I am interested in situations when there is a
need to
build
a search engine around specific corpus (for example when we need
to search
set of articles related to programming languages only). Given a
specific
corpus is there any recommended technique of stop words derivation?
Thanks,
Lukas
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]