Hi,

Thanks for your comments!

I was thinking that there could be some method based on frequency and
linguistic research. So far it seems that manually choosen set of words is
very common approach but this leaves some questions opened in my mind.
I am not a native english speaker but I think that this (
http://www.ranks.nl/tools/stopwords.html) makes sense, but for my native
language (http://www.ranks.nl/stopwords/czech.html) this can be questionable
in some cases (especially in case of specific corpus).

What I am searching for is some authomatic method of stop words extraction
based on given set of documents. I don't expect such method to be 100% exact
but I would expect it to be ~good enough~.

I will try to search in citeseer as well (was hoping somebody could give me
some references of this kind).

Thanks!
Lukas

On 5/11/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

There is a handy class in contrib/misc.../ that will show you the most
frequent terms in an index. Handy dandy.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Lukas Vlcek <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, May 10, 2007 2:39:35 PM
Subject: Stop words (how to create ideal set of stop words?)

Hi,

Can anybody point me to some references how to create an ideal set of stop
words? I konw that this is more like a theoretical question but how do
Luceners determine which words shuold be excluded when creating Analyzers
for a new languages? And which technique was used for validation of stop
word lists in current Analyzers?

More specificaly I am interested in situations when there is a need to
build
a search engine around specific corpus (for example when we need to search
set of articles related to programming languages only). Given a specific
corpus is there any recommended technique of stop words derivation?

Thanks,
Lukas




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to