I thought that since I'm updating UpLib's Lucene code, I should tackle the issue of document languages, as well. Right now I'm using an off-the-shelf language identifier, textcat, to figure out which language a Web page or PDF is (mainly) written in. I then want to analyze that document with an appropriate analyzer. I'd then like to map to the correct Lucene analyzer for that language, falling back to StandardAnalyzer if the installed Lucene library doesn't have an analyzer for that language.
It would be *very* handy if Analyzer had a static method static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag); Right now I'm consulting a hand-compiled mapping of langtag-to-Lucene-classname to figure out which Analyzer to use. Wearisome, and it will be out-of-date for future releases of Lucenen which will presumably support more languages. Secondly, if I've got an instance of a SnowballAnalyzer, there's no way to look "inside" it, and see what language it's for. That's a problem on the search side. My QueryParser is a subclass of MultiFieldQueryParser, and it looks for a "special" FieldQuery on the field "_query_language", i.e., "_query_language:de" to tell the query parser to use a German analyzer on this query. What I'd like to be able to do is interrogate the current analyzer attached to the query parser instance, and throw an exception if it's not for the specified language. I can do this for non-Snowball analyzers, because of the brittle hand-compiled mapping mentioned above. But if it's a SnowballAnalyzer, there's no way to tell what the language inside it is. So it would be nice if SnowballAnalyzer grew a method String getLanguageName(); Even better would be String getLanguageTag(); And, it would be nice if QueryParser grew a method void setAnalyzer(Analyzer a); which would allow me to simply replace the current analyzer for the parsing of the rest of the query, instead of going through the rigmarole of throwing an exception, catching it, recreating the QueryParser with a different analyzer, and trying again. What would break if you changed the analyzer in midstream? Wouldn't it simply be used for analyzing remaining terms in the query? I see that Robert Muir has been doing a lot of good work on the Snowball code. I'd really like to see the stopword work finished, so that a SnowballAnalyzer for a particular language has a decent set of stopwords. Bill --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org