[
https://issues.apache.org/jira/browse/LUCENE-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250666#comment-14250666
]
David Smiley commented on LUCENE-6033:
--------------------------------------
Hi Rob.
I think it's easier to make the case that CachingTokenFilter should have been
propagating reset() from it's fillCache() all along, and thus you would then
use CachingTokenFilter in a more normal way -- wrap it and call reset() then
increment in a loop, etc., instead of knowing you need to reset() on what it
wraps but not this token filter itself. That's weird. It's ab-normal for a
TokenFilter to never propagate reset, so every user of CachingTokenFilter to
date has worked around this by calling reset() on the underlying input
_instead_ of the final wrapping token filter (CachingTokenFilter in this case).
To be clear, CachingTokenFilter._reset()_ didn't and still doesn't with this
patch propagate reset(), it happens the one time it consumes the stream
indirectly via incrementToken().
The exact case that brought me here is as follows: DefaultSolrHighlighter has
a block of code activated when you pass "hl.usePhraseHighlighter" (around line
501). This block of code calls getPhraseHighlighter passing in a token stream
that may never actually be used by that method. This is an extension point for
subclassing, our shipped code doesn't use it at all. Prior to me doing
SOLR-6680, we'd always then pass the CachingTokenFilter further on into the
Highlighter. But unless getPhraseHighlighter actually uses the token stream,
doing this is a waste (needless caching of every token -- pretty bulky). So
with isCached() I can now see if it was used, and if not then toss the
CachingTokenFilter aside. The problem is that isCached() isn't enough here; I
overlooked it in SOLR-6680 (no test for this extension point). I was hoping to
simply declare that if you want to use this token stream, you need to call
reset() on it first. But CachingTokenFilter doesn't propagate the reset()! So
it won't get reset. I _could_ add a reset on the underlying stream before
calling getPhraseHighlighter but doing so would likely result in reset() being
called twice in a row when the caching isn't needed; Highlighter calls reset().
Test assertions trip when this happens, although I think in practice it's fine.
> Add CachingTokenFilter.isCached and switch LinkedList to ArrayList
> ------------------------------------------------------------------
>
> Key: LUCENE-6033
> URL: https://issues.apache.org/jira/browse/LUCENE-6033
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: David Smiley
> Assignee: David Smiley
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-6033.patch,
> LUCENE-6033_boolean_resetInput_option.patch
>
>
> CachingTokenFilter could use a simple boolean isCached() method implemented
> as-such:
> {code:java}
> /** If the underlying token stream was consumed and cached */
> public boolean isCached() {
> return cache != null;
> }
> {code}
> It's useful for the highlighting code to remove its wrapping of
> CachingTokenFilter if after handing-off to parts of its framework it turns
> out that it wasn't used.
> Furthermore, use an ArrayList, not a LinkedList. ArrayList is leaner when
> the token count is high, and this class doesn't manipulate the list in a way
> that might favor LL.
> A separate patch will come that actually uses this method.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]