[
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653893#action_12653893
]
Michael McCandless commented on LUCENE-1448:
--------------------------------------------
{quote}
What I'd like to work on soon is an efficient way to buffer attributes
(maybe add methods to attribute that write into a bytebuffer). Then
attributes can implement what variables need to be serialized and
which ones don't. In that case we could add a finalOffset to
OffsetAttribute that does not get serialiezd/deserialized.
{quote}
I like that (it'd make streams like CachingTokenFilter much more
efficient). It'd also presumably lead to more efficiently serialized
token streams.
But: you'd still need a way in this model to serialize finalOffset, once,
at the end?
{quote}
And possibly it might be worthwhile to have explicit states defined in
a TokenStream that we can enforce with three methods: start(),
increment(), end(). Then people would now if they have to do something
at the end of a stream they have to do it in end().
{quote}
This also seems good. So end() would be the obvious place to set
the OffsetAttribute.finalOffset,
PositionIncrementAttribute.positionIncrementGap, etc.
OK I'm gonna assign this one to you, Michael ;)
> add getFinalOffset() to TokenStream
> -----------------------------------
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch,
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a
> document, and you then index those fields with TermVectors storing offsets,
> it's very likely the offsets for all but the first field instance will be
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the
> offsets of each field instance, where that base is 1 + the endOffset of the
> last token it saw when analyzing that field.
> But this logic is overly simplistic. For example, if the WhitespaceAnalyzer
> is being used, and the text being analyzed ended in 3 whitespace characters,
> then that information is lost and then next field's offsets are then all 3
> too small. Similarly, if a StopFilter appears in the chain, and the last N
> tokens were stop words, then the base will be 1 + the endOffset of the last
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm
> thinking by default it returns -1, which means "I don't know so you figure it
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]