Lucene's CachingTokenFilter in index analyzer chain

Enrico Detoma Wed, 14 Oct 2009 07:24:37 -0700

Hi all,

I'm trying to add a CachingTokenFilter derived filter to the index analyzer
chain for field "text".
I need to work with CachingTokenFilter because I need to look-ahead in the
token stream (my filter is a "stop phrases" filter, where I look ahead in
the index to see if a stop phrase is found and then remove it from the token
stream).


When I test the correctness of the chain using this query:

/solr/analysis/field?analysis.fieldname=description&analysis.fieldtype=text&analysis.fieldvalue=...
everything seems ok (I see that the stop phrases are removed from the token
stream).

But when I index documents, the index is totally empty: all searches on
"text" fields give no results at all!

Here is my index chain, where StopPhrasesFilterFactory is my custom filter
which derives from CachingTokenFilter:

    <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="org.apache.solr.analysis.StopPhrasesFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="Italian"
protected="protwords.txt"/>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="Italian"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>

Is it wrong to use CachingTokenFilter in the index chain?

Regards
Enrico

Lucene's CachingTokenFilter in index analyzer chain

Reply via email to