Anyone have any thoughts on how best to integrate Lucene's new
SinkTokenizer and TeeTokenFilter (https://issues.apache.org/jira/browse/LUCENE-1058
and
http://www.gossamer-threads.com/lists/lucene/java-dev/55927?search_string=TeeTokenFilter;#55927)
into Solr? It doesn't fit into the TokenizerFactory and
TokenFilterFactory model for create since the constructors have
dependencies on things other than a Reader and a TokenStream.
I can do one off Analyzer constructions, but that doesn't really fit
with the Solr way. I think this could have some nice benefits for the
copyField mechanism, as well, but that is more work to get right.
My initial (half-baked?) thinking is that we need the ability to name
TokenStreams (Tokenizers and TokenFilters) so that we could do
something like:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="my.TokenFilterFactory" name="step1"/>
-->
<filter class="my.next.TFF" name="step2"/>
<filter class="my.other.TFF" name="step3"/>
</analyzer>
Thus, each of the named filters create a TeeTokenFilter and have an
associated SinkTokenizer. Then, I can declare another analyzer that
looks like:
<analyzer type="index">
<tokenizer name="step2"/>
</analyzer>
which would just use the tokens saved by step 2 in the first
Analysis. Similarly, we do that for step 3 with some other filters
added like:
<analyzer type="index">
<tokenizer name="step3"/>
<filter class="StopFilterFactory"/>
</analyzer>
Now, Solr would need to be smart about this and know that it has to
index the fields using the first analyzer before those using the
sinks. And there might be some concerns about what to do if multiple
fields use the same "Tee" analyzer and whether that effects the
Sinks. The "name" attribute, of course, would be optional. There
also is the issue of initialization in that we would most likely need
2 pass initialization so that the names of the token streams are known
ahead of time.
I know, of course, the proof is in the pudding, as they say and a
patch does wonders, but I am wondering if people have any initial
thoughts on this. I think the performance upside can be significant
in some common cases, especially once we work out issues w/ Lucene's
clone method and in the case where the SinkTokenizer is not keeping
all tokens.
-Grant