The issue continues to exist with nightly 146 from Jul 10, 2007.
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/146/
Ard Schrijvers wrote:
Hello,
The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the
issue is already addressed and solved...
Regards Ard
Thank you for the reply Ard,
The tokens exist in the index and are returned accurately, except for
the offsets. In this case I am not dealing with the positions, so the
termvector is specified as using 'with_offsets'. I have left the term
position incrememt as its default. Looking at the existing
tokenstreams,
they don't maintain knowledge of the current position, they always
generate values startoffsets beginning at 0 of the current
stream, and
then a 'proper' offset is generated based on the +1 of the previous
token the DocumentWriter applies when indexeding. Nor are
there any test
cases for offsets. I found a bug that was opened a while ago dealing
with this issue (as well as related one). It is:
https://issues.apache.org/jira/browse/LUCENE-579
I am retrieving the a text token's offset values using
TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[].
The same offset values that were placed into the token during
indexing
are not being returned, they have been shifted.
Thanks.
Shahan
Ard Schrijvers wrote:
Hello,
Hi,
I am storing custom values in the Tokens provided by a
Tokenizer but
when retrieving them from the index the values don't match.
What do you mean by retrieving? Do you mean retrieving
terms, or do you mean doing a search with words you know that
should be in, but you do not find a match?
In the latter, you must make sure that you are using the
same analyzer for the search as you used for indexing.
I've looked
in the LIA book but it's not current since it mentioned
term vectors
aren't stored. I'm using Lucene Nightly 146 but the same thing has
happened with older versions. Looking at the internals,
DocumentWriter
seems to keep track of the end offset that was placed into
the index and
modifies the token values (with +1) but I'm not sure whether
I should be
concerned with it.
No existing analyzers are used when adding the document so all the
offsets are generated manually.
Any suggestions of how the token offsets should be stored?
Look at other clases that implement TokenStream. Also take
a look at setPositionIncrement when you are putting in your own terms
Regards Ard
Is this valid?
Token, start, end
aaa, 0, 3
bbb, 4, 7
ccc, 8, 11
Thanks,
Shahan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]