See also en.wikipedia.org/wiki/Stop_words and www.ranks.nl/tools/stopwords.html
karl wettin <[EMAIL PROTECTED]> wrote on 10/05/2007 13:57:33: > > 10 maj 2007 kl. 20.39 skrev Lukas Vlcek: > > > Can anybody point me to some references how to create an ideal set > > of stop > > words? I konw that this is more like a theoretical question but how do > > Luceners determine which words shuold be excluded when creating > > Analyzers > > for a new languages? > > The idea with stop words is to keep the index as small as possible > without major loss of features, thus they ought to be frequently > occuring words with little or no semantic meaning. What these words > are really depends on language, corpus, et c. > > > And which technique was used for validation of stop > > word lists in current Analyzers? > > My guess is that they are manually choosen from a corpus term > frequency vector. > > > More specificaly I am interested in situations when there is a need > > to build > > a search engine around specific corpus (for example when we need to > > search > > set of articles related to programming languages only). Given a > > specific > > corpus is there any recommended technique of stop words derivation? > > If you have no knowledge of the language for wich you wish to produce > stop words, then it will be fairly hard to know what to consider a > stop word. You might be able to consider it as a text classification > problem. Feature/attribute selection for classifiers is a well > researched subject. Weka, Yale, R, et c are all tools that might help > you. But I honestly think no matter how you turn and twist the data, > manually choosing the stop words is the way to go. > > > -- > karl --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]