inconsistency of tokenstream.end() with OffsetLimitTokenFilter and
LimitTokenCountFilter
----------------------------------------------------------------------------------------
Key: LUCENE-3088
URL: https://issues.apache.org/jira/browse/LUCENE-3088
Project: Lucene - Java
Issue Type: Bug
Reporter: Robert Muir
In LUCENE-3064, we added some state and checks to MockTokenizer to validate
that consumers
are properly using the tokenstream workflow (described here:
http://lucene.apache.org/java/3_0_3/api/core/org/apache/lucene/analysis/TokenStream.html)
One inconsistency is the following steps:
4. The consumer calls incrementToken() until it returns false consuming the
attributes after each call.
5. The consumer calls end() so that any end-of-stream operations can be
performed.
In the case of these limitingfilters, end() is called on the Tokenizer *before*
incrementToken() returns false. This is a little strange for a few reasons: one
is that the tokenizer might not even be "ready" for end(), e.g. it might be
coded where end() only works correctly if its entirely consumed. The other
problem of course is that the finalOffset, the general use of end(), will most
often be wrong in this case, so multi-valued field highlighting will not work.
We should probably figure out a way to address the inconsistency, some ideas
are:
# fixing the javadocs, perhaps documenting that end() could be called at any
time, and accepting the fact that the finalOffset will be wrong.
# the limiting filters could consume the rest of the tokens in a while
(incrementToken()) loop to ensure totally proper behavior.
# the limiting filters could do something tricky like override end() so that
its not invoked on the Tokenizer in a surprising state. This is still evil but
perhaps less evil than calling it "out of order".
# ...
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]