10 maj 2007 kl. 20.39 skrev Lukas Vlcek:

Can anybody point me to some references how to create an ideal set of stop
words? I konw that this is more like a theoretical question but how do
Luceners determine which words shuold be excluded when creating Analyzers
for a new languages?

The idea with stop words is to keep the index as small as possible without major loss of features, thus they ought to be frequently occuring words with little or no semantic meaning. What these words are really depends on language, corpus, et c.

And which technique was used for validation of stop
word lists in current Analyzers?

My guess is that they are manually choosen from a corpus term frequency vector.

More specificaly I am interested in situations when there is a need to build a search engine around specific corpus (for example when we need to search set of articles related to programming languages only). Given a specific
corpus is there any recommended technique of stop words derivation?

If you have no knowledge of the language for wich you wish to produce stop words, then it will be fairly hard to know what to consider a stop word. You might be able to consider it as a text classification problem. Feature/attribute selection for classifiers is a well researched subject. Weka, Yale, R, et c are all tools that might help you. But I honestly think no matter how you turn and twist the data, manually choosing the stop words is the way to go.


--
karl




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to