See also  en.wikipedia.org/wiki/Stop_words  and
www.ranks.nl/tools/stopwords.html

karl wettin <[EMAIL PROTECTED]> wrote on 10/05/2007 13:57:33:

>
> 10 maj 2007 kl. 20.39 skrev Lukas Vlcek:
>
> > Can anybody point me to some references how to create an ideal set
> > of stop
> > words? I konw that this is more like a theoretical question but how do
> > Luceners determine which words shuold be excluded when creating
> > Analyzers
> > for a new languages?
>
> The idea with stop words is to keep the index as small as possible
> without major loss of features, thus they ought to be frequently
> occuring words with little or no semantic meaning. What these words
> are really depends on language, corpus, et c.
>
> > And which technique was used for validation of stop
> > word lists in current Analyzers?
>
> My guess is that they are manually choosen from a corpus term
> frequency vector.
>
> > More specificaly I am interested in situations when there is a need
> > to build
> > a search engine around specific corpus (for example when we need to
> > search
> > set of articles related to programming languages only). Given a
> > specific
> > corpus is there any recommended technique of stop words derivation?
>
> If you have no knowledge of the language for wich you wish to produce
> stop words, then it will be fairly hard to know what to consider a
> stop word. You might be able to consider it as a text classification
> problem. Feature/attribute selection for classifiers is a well
> researched subject. Weka, Yale, R, et c are all tools that might help
> you. But I honestly think no matter how you turn and twist the data,
> manually choosing the stop words is the way to go.
>
>
> --
> karl


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to