The issue continues to exist with nightly 146 from Jul 10, 2007.

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/146/


Ard Schrijvers wrote:
Hello,

The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the 
issue is already addressed and solved...

Regards Ard

Thank you for the reply Ard,

The tokens exist in the index and are returned accurately, except for the offsets. In this case I am not dealing with the positions, so the termvector is specified as using 'with_offsets'. I have left the term position incrememt as its default. Looking at the existing tokenstreams, they don't maintain knowledge of the current position, they always generate values startoffsets beginning at 0 of the current stream, and then a 'proper' offset is generated based on the +1 of the previous token the DocumentWriter applies when indexeding. Nor are there any test cases for offsets. I found a bug that was opened a while ago dealing with this issue (as well as related one). It is:
https://issues.apache.org/jira/browse/LUCENE-579

I am retrieving the a text token's offset values using TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[]. The same offset values that were placed into the token during indexing are not being returned, they have been shifted.
Thanks.
Shahan

Ard Schrijvers wrote:
Hello,

Hi,
I am storing custom values in the Tokens provided by a
Tokenizer but
when retrieving them from the index the values don't match.
What do you mean by retrieving? Do you mean retrieving
terms, or do you mean doing a search with words you know that should be in, but you do not find a match?
In the latter, you must make sure that you are using the
same analyzer for the search as you used for indexing.
I've looked in the LIA book but it's not current since it mentioned
term vectors
aren't stored. I'm using Lucene Nightly 146 but the same thing has happened with older versions. Looking at the internals, DocumentWriter seems to keep track of the end offset that was placed into the index and modifies the token values (with +1) but I'm not sure whether I should be concerned with it. No existing analyzers are used when adding the document so all the offsets are generated manually.
Any suggestions of how the token offsets should be stored?

Look at other clases that implement TokenStream. Also take
a look at setPositionIncrement when you are putting in your own terms
Regards Ard

Is this valid?
Token, start, end
aaa, 0, 3
bbb, 4, 7
ccc, 8, 11

Thanks,
Shahan


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to