TeeTokenFilter and SinkTokenizer

Grant Ingersoll Thu, 27 Dec 2007 06:11:01 -0800

Anyone have any thoughts on how best to integrate Lucene's newSinkTokenizer and TeeTokenFilter (https://issues.apache.org/jira/browse/LUCENE-1058andhttp://www.gossamer-threads.com/lists/lucene/java-dev/55927?search_string=TeeTokenFilter;#55927)into Solr? It doesn't fit into the TokenizerFactory andTokenFilterFactory model for create since the constructors havedependencies on things other than a Reader and a TokenStream.

I can do one off Analyzer constructions, but that doesn't really fitwith the Solr way. I think this could have some nice benefits for thecopyField mechanism, as well, but that is more work to get right.

My initial (half-baked?) thinking is that we need the ability to nameTokenStreams (Tokenizers and TokenFilters) so that we could dosomething like:


<analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="my.TokenFilterFactory" name="step1"/>
        -->
        <filter class="my.next.TFF" name="step2"/>
        <filter class="my.other.TFF" name="step3"/>
      </analyzer>

Thus, each of the named filters create a TeeTokenFilter and have anassociated SinkTokenizer. Then, I can declare another analyzer thatlooks like:

<analyzer type="index">
        <tokenizer name="step2"/>
</analyzer>

which would just use the tokens saved by step 2 in the firstAnalysis. Similarly, we do that for step 3 with some other filtersadded like:

<analyzer type="index">
        <tokenizer name="step3"/>
        <filter class="StopFilterFactory"/>
</analyzer>

Now, Solr would need to be smart about this and know that it has toindex the fields using the first analyzer before those using thesinks. And there might be some concerns about what to do if multiplefields use the same "Tee" analyzer and whether that effects theSinks. The "name" attribute, of course, would be optional. Therealso is the issue of initialization in that we would most likely need2 pass initialization so that the names of the token streams are knownahead of time.

I know, of course, the proof is in the pudding, as they say and apatch does wonders, but I am wondering if people have any initialthoughts on this. I think the performance upside can be significantin some common cases, especially once we work out issues w/ Lucene'sclone method and in the case where the SinkTokenizer is not keepingall tokens.


-Grant

TeeTokenFilter and SinkTokenizer

Reply via email to