Also, from the empirical side, have a look at Luke (after indexing w/ o any stopwords, or just the standard ones) and see what the most common terms are and see if they are meaningful or not in the context of your application.

-Grant


On May 10, 2007, at 7:41 PM, Doron Cohen wrote:

See also  en.wikipedia.org/wiki/Stop_words  and
www.ranks.nl/tools/stopwords.html

karl wettin <[EMAIL PROTECTED]> wrote on 10/05/2007 13:57:33:


10 maj 2007 kl. 20.39 skrev Lukas Vlcek:

Can anybody point me to some references how to create an ideal set
of stop
words? I konw that this is more like a theoretical question but how do
Luceners determine which words shuold be excluded when creating
Analyzers
for a new languages?

The idea with stop words is to keep the index as small as possible
without major loss of features, thus they ought to be frequently
occuring words with little or no semantic meaning. What these words
are really depends on language, corpus, et c.

And which technique was used for validation of stop
word lists in current Analyzers?

My guess is that they are manually choosen from a corpus term
frequency vector.

More specificaly I am interested in situations when there is a need
to build
a search engine around specific corpus (for example when we need to
search
set of articles related to programming languages only). Given a
specific
corpus is there any recommended technique of stop words derivation?

If you have no knowledge of the language for wich you wish to produce
stop words, then it will be fairly hard to know what to consider a
stop word. You might be able to consider it as a text classification
problem. Feature/attribute selection for classifiers is a well
researched subject. Weka, Yale, R, et c are all tools that might help
you. But I honestly think no matter how you turn and twist the data,
manually choosing the stop words is the way to go.


--
karl


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to