Several times now, I’ve had to come up with work-arounds for a TokenStream not knowing it’s processing the first value or a subsequent-value of a multi-valued field. Two of these times, the use-case was ensuring the first position of each value started at a multiple of 1000 (or some other configurable value), and the third was encoding sentence paragraph counters (similar to a do-it-yourself position increment).
The work-arounds are awkward and hacky. For example if you’re in control of your Tokenizer, you can prefix subsequent values with a special flag, and then do the right think in reset(). But then the highlighter or value retrieval in general is impacted. It’s also possible to create the fields with the constructor that accepts a TokenStream that you’ve told it’s the first or subsequent value but it’s awkward going that route, and sometimes (e.g. Solr) it’s hard to know all the values you have up-front to even do that. It would be nice if TokenStream.reset() took a boolean ‘first’ argument. Such a change would obviously be backwards incompatible. Simply overloading the method to call the no-arg version is problematic because TokenStreams are a chain, and it would likely result in the chain getting doubly-reset. Any ideas? ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley