Hi, Can anybody point me to some references how to create an ideal set of stop words? I konw that this is more like a theoretical question but how do Luceners determine which words shuold be excluded when creating Analyzers for a new languages? And which technique was used for validation of stop word lists in current Analyzers?
More specificaly I am interested in situations when there is a need to build a search engine around specific corpus (for example when we need to search set of articles related to programming languages only). Given a specific corpus is there any recommended technique of stop words derivation? Thanks, Lukas