Several times now, I’ve had to come up with work-arounds for a TokenStream
not knowing it’s processing the first value or a subsequent-value of a
multi-valued field.  Two of these times, the use-case was ensuring the
first position of each value started at a multiple of 1000 (or some other
configurable value), and the third was encoding sentence paragraph counters
(similar to a do-it-yourself position increment).

The work-arounds are awkward and hacky.  For example if you’re in control
of your Tokenizer, you can prefix subsequent values with a special flag,
and then do the right think in reset().  But then the highlighter or value
retrieval in general is impacted.  It’s also possible to create the fields
with the constructor that accepts a TokenStream that you’ve told it’s the
first or subsequent value but it’s awkward going that route, and sometimes
(e.g. Solr) it’s hard to know all the values you have up-front to even do
that.

It would be nice if TokenStream.reset() took a boolean ‘first’ argument.
Such a change would obviously be backwards incompatible.  Simply
overloading the method to call the no-arg version is problematic because
TokenStreams are a chain, and it would likely result in the chain getting
doubly-reset.

Any ideas?

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

Reply via email to