Re: Incorrect Token Offset when using multiple fieldable instance

Michael McCandless Wed, 05 Mar 2008 03:10:23 -0800

Well, first off, sometimes the thing being indexed isn't a string, soyou have no stringValue to get its length. It could be a Reader or aTokenStream.

Second off, it's conceivable that an analyzer computes its own"interesting" offsets that are not in fact simple indices into thestringValue, though I would expect that to be the exception not therule.

I can't think of any other harm ... so if neither of these apply inyour situation then it should be OK?

I do agree this seems like a bug. EG, if you use Highlighter on amulti-valued field indexed with stored field & term vectors and saythe first field ended with a stop word that was filtered out, thenyour offsets will be off and the wrong parts will be highlighted inall but the first field (I think?). I think we really need some wayfor the tokenStream to "declare" its final offset at the end.


Mike

Renaud Delbru wrote:

Do you know if there will be side-effects if we replace inDocumentWriter$FieldData#invertField
offset = offsetEnd+1;
by
offset = stringValue.length();
I still not understand the reason of such choice for theincrementation of the start offset.
Regards.

Michael McCandless wrote:
This is how Lucene has worked for quite some time (since 1.9).
When there are multiple fields with the same name in one Document,each field's offset starts from the last offset (offset of thelast token) seen in the previous field. If tokens are skipped atthe end there's no way IndexWriter can know (because tokenStreamdoesn't return them). It's as if we need the ability to query atokenStream for its "final" offset or something.
One workaround might be to insert an "end marker" token, with thetrue end offset, which is a term you would never search on?
Mike

Renaud Delbru wrote:
Hi,
I currently use multiple fieldable instances for indexingsentences of a document.When there is only one single fieldable instance, the tokenoffset generation performed in DocumentWriter is correct.The problem appears when there is two or more fieldableinstances. In DocumentWriter$FieldData#invertField method, if thefield is tokenized, instead of updating offset attribute withstringValue.length() (which is performed if the field is nottokenized, line 1458), you update the offset attribute with theend offset of the last token (line 1503: offset = offsetEnd+1;).As a consequence, if a token has been filtered (for example astopword, a dot, a space, etc.), the offset attribute is updatedwith the end offset of the last token not filtered. In this case,you store inside the offset attribute an incorrect offset (theoffset is shift back) and all the next fieldable instances willhave their offset shifted back.
Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Incorrect Token Offset when using multiple fieldable instance

Reply via email to