Re: Analyzer thread safety; Stop words

Antony Bowesman Wed, 29 Nov 2006 13:21:26 -0800

Hi Yonik,

Thanks for your comments.

Secondly, has anyone thought that it would be a good idea to extendthe Analyzerinterface (Abstract class) to allow a standard way to set stop words?There
seem to be two 'families' of stop word configuration via constructors.
That belongs at the TokenFilter level (where it currently is).

That's true, but all the existing Analyzers allow the stop set to be configuredvia the analyzer constructors, but in different ways.


For example StandardAnalyzer has:

public StandardAnalyzer(String[] stopWords)
public StandardAnalyzer(Set stopWords)
public StandardAnalyzer(File stopwords)

wheras RussianAnalyzer has:

public RussianAnalyzer(char[] charset, Hashtable stopwords)
public RussianAnalyzer(char[] charset, String[] stopwords)

so, this does not make common stop word configuration possible without somemessy code to look at constructor signatures and make some guesses.


Perhaps the Analyzer class could have some default methods, e.g.

public void setStopWords(File stopWordFile);
public void setStopWords(Set stopWordSet);
public void setStopWords(String[] stopWords);

Things currently are pluggable: one makes new Analyzers by plugging
together a Tokenizer followed by several TokeFilters.

If you are talking about some sort of external configuration, take a
look at Solr.

Yes, you've done some nice stuff there with Solr. Unfortunately, I only cameacross it some time after I'd already done a lot of the work for our system.


Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzer thread safety; Stop words

Reply via email to