Hi Yonik,
Thanks for your comments.
Secondly, has anyone thought that it would be a good idea to extend
the Analyzer
interface (Abstract class) to allow a standard way to set stop words?
There
seem to be two 'families' of stop word configuration via constructors.
That belongs at the TokenFilter level (where it currently is).
That's true, but all the existing Analyzers allow the stop set to be configured
via the analyzer constructors, but in different ways.
For example StandardAnalyzer has:
public StandardAnalyzer(String[] stopWords)
public StandardAnalyzer(Set stopWords)
public StandardAnalyzer(File stopwords)
wheras RussianAnalyzer has:
public RussianAnalyzer(char[] charset, Hashtable stopwords)
public RussianAnalyzer(char[] charset, String[] stopwords)
so, this does not make common stop word configuration possible without some
messy code to look at constructor signatures and make some guesses.
Perhaps the Analyzer class could have some default methods, e.g.
public void setStopWords(File stopWordFile);
public void setStopWords(Set stopWordSet);
public void setStopWords(String[] stopWords);
Things currently are pluggable: one makes new Analyzers by plugging
together a Tokenizer followed by several TokeFilters.
If you are talking about some sort of external configuration, take a
look at Solr.
Yes, you've done some nice stuff there with Solr. Unfortunately, I only came
across it some time after I'd already done a lot of the work for our system.
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]