Re: Stop words (how to create ideal set of stop words?)

karl wettin Thu, 10 May 2007 13:59:34 -0700


10 maj 2007 kl. 20.39 skrev Lukas Vlcek:

Can anybody point me to some references how to create an ideal setof stop
words? I konw that this is more like a theoretical question but how do
Luceners determine which words shuold be excluded when creatingAnalyzers
for a new languages?

The idea with stop words is to keep the index as small as possiblewithout major loss of features, thus they ought to be frequentlyoccuring words with little or no semantic meaning. What these wordsare really depends on language, corpus, et c.

And which technique was used for validation of stop
word lists in current Analyzers?

My guess is that they are manually choosen from a corpus termfrequency vector.

More specificaly I am interested in situations when there is a needto builda search engine around specific corpus (for example when we need tosearchset of articles related to programming languages only). Given aspecific
corpus is there any recommended technique of stop words derivation?

If you have no knowledge of the language for wich you wish to producestop words, then it will be fairly hard to know what to consider astop word. You might be able to consider it as a text classificationproblem. Feature/attribute selection for classifiers is a wellresearched subject. Weka, Yale, R, et c are all tools that might helpyou. But I honestly think no matter how you turn and twist the data,manually choosing the stop words is the way to go.



--
karl




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stop words (how to create ideal set of stop words?)

Reply via email to