Hi, Thanks for your comments!
I was thinking that there could be some method based on frequency and linguistic research. So far it seems that manually choosen set of words is very common approach but this leaves some questions opened in my mind. I am not a native english speaker but I think that this ( http://www.ranks.nl/tools/stopwords.html) makes sense, but for my native language (http://www.ranks.nl/stopwords/czech.html) this can be questionable in some cases (especially in case of specific corpus). What I am searching for is some authomatic method of stop words extraction based on given set of documents. I don't expect such method to be 100% exact but I would expect it to be ~good enough~. I will try to search in citeseer as well (was hoping somebody could give me some references of this kind). Thanks! Lukas On 5/11/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
There is a handy class in contrib/misc.../ that will show you the most frequent terms in an index. Handy dandy. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: Lukas Vlcek <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, May 10, 2007 2:39:35 PM Subject: Stop words (how to create ideal set of stop words?) Hi, Can anybody point me to some references how to create an ideal set of stop words? I konw that this is more like a theoretical question but how do Luceners determine which words shuold be excluded when creating Analyzers for a new languages? And which technique was used for validation of stop word lists in current Analyzers? More specificaly I am interested in situations when there is a need to build a search engine around specific corpus (for example when we need to search set of articles related to programming languages only). Given a specific corpus is there any recommended technique of stop words derivation? Thanks, Lukas --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]