Re: Token offset values for custom Tokenizer

Shahan Khatchadourian Mon, 16 Jul 2007 08:46:29 -0700

The issue continues to exist with nightly 146 from Jul 10, 2007.


http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/146/


Ard Schrijvers wrote:

Hello,

The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the 
issue is already addressed and solved...

Regards Ard
Thank you for the reply Ard,
The tokens exist in the index and are returned accurately, except forthe offsets. In this case I am not dealing with the positions, so thetermvector is specified as using 'with_offsets'. I have left the termposition incrememt as its default. Looking at the existingtokenstreams,they don't maintain knowledge of the current position, they alwaysgenerate values startoffsets beginning at 0 of the currentstream, andthen a 'proper' offset is generated based on the +1 of the previoustoken the DocumentWriter applies when indexeding. Nor arethere any testcases for offsets. I found a bug that was opened a while ago dealingwith this issue (as well as related one). It is:
https://issues.apache.org/jira/browse/LUCENE-579
I am retrieving the a text token's offset values usingTermPositionVector.getOffsets() which returns TermVectorOffsetInfo[].The same offset values that were placed into the token duringindexingare not being returned, they have been shifted.
Thanks.
Shahan

Ard Schrijvers wrote:
Hello,
Hi,
I am storing custom values in the Tokens provided by a
Tokenizer but
when retrieving them from the index the values don't match.
What do you mean by retrieving? Do you mean retrieving
terms, or do you mean doing a search with words you know thatshould be in, but you do not find a match?
In the latter, you must make sure that you are using the
same analyzer for the search as you used for indexing.
I've lookedin the LIA book but it's not current since it mentioned
term vectors
aren't stored. I'm using Lucene Nightly 146 but the same thing hashappened with older versions. Looking at the internals,DocumentWriterseems to keep track of the end offset that was placed intothe index andmodifies the token values (with +1) but I'm not sure whetherI should beconcerned with it.No existing analyzers are used when adding the document so all theoffsets are generated manually.
Any suggestions of how the token offsets should be stored?
Look at other clases that implement TokenStream. Also take
a look at setPositionIncrement when you are putting in your own terms
Regards Ard
Is this valid?
Token, start, end
aaa, 0, 3
bbb, 4, 7
ccc, 8, 11

Thanks,
Shahan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token offset values for custom Tokenizer

Reply via email to