Re: Analyzer thread safety; Stop words

Yonik Seeley Wed, 29 Nov 2006 20:10:56 -0800

On 11/29/06, Antony Bowesman <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:
> On 11/29/06, Antony Bowesman <[EMAIL PROTECTED]> wrote:
>>
>> That's true, but all the existing Analyzers allow the stop set to be
>> configured
>> via the analyzer constructors, but in different ways.
>
> But you can duplicate most Analyzers (all the ones in Lucene?) with a
> chain of Tokenizers and TokenFilters (since that is how almost all of
> them are implemented).  Most Analyzers are simply shortcuts to putting
> together your own.


Something seems confused to me.  Although stop words are use by Filters, they
are currently exposed via Analyzers which is the granularity used at the
IndexWriter/Parser levels.  This is what contributors are writing, not Filters.

There are lots of analysis contributions which deal with stop words that are
perfectly usable as is.  They shouldn't need to be duplicated to be re-used and
if that's needed, it points to a deficiency in the design.  If we all have to
put together our own, again, doesn't this argue that there should be a standard
way of doing it at the higher Analyzer level.

Sure, the solr way of using the configurable filters gives great flexibility,
but in your solrconfig.xml example it shows how the GreekAnalyzer can be
deployed, but it also highlights the problem that it does not seem to be
possible to make use of the stopword Hashtable available to the GreekAnalyzer
constructor.


The GreekAnalyzer is just an example of how you can use existing
Analyzers (as long as they have a default constructor), but it's not
the recommended approach.

TokenFilters are preffered over Analyzers.... you can plug them
together in any way you see fit to solve your analysis problem.  For
Solr, an added bonus of using chains of filters  is that Solr can
"know" about the results after each filter and show you the results on
an analysis web page (very useful for debugging).

If I were to analyze greek text, I might do something like this:

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
        <filter class="solr.SnowballPorterFilterFactory" language="Greek" />
xt"/>
     </analyzer>
</fieldtype>

If you try to put everything in Analyzer constructors, you get
combinatorial explosion.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzer thread safety; Stop words

Reply via email to