Hello, The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the issue is already addressed and solved...
Regards Ard > > Thank you for the reply Ard, > > The tokens exist in the index and are returned accurately, except for > the offsets. In this case I am not dealing with the positions, so the > termvector is specified as using 'with_offsets'. I have left the term > position incrememt as its default. Looking at the existing > tokenstreams, > they don't maintain knowledge of the current position, they always > generate values startoffsets beginning at 0 of the current > stream, and > then a 'proper' offset is generated based on the +1 of the previous > token the DocumentWriter applies when indexeding. Nor are > there any test > cases for offsets. I found a bug that was opened a while ago dealing > with this issue (as well as related one). It is: > https://issues.apache.org/jira/browse/LUCENE-579 > > I am retrieving the a text token's offset values using > TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[]. > The same offset values that were placed into the token during > indexing > are not being returned, they have been shifted. > Thanks. > Shahan > > Ard Schrijvers wrote: > > Hello, > > > > > >> Hi, > >> I am storing custom values in the Tokens provided by a > Tokenizer but > >> when retrieving them from the index the values don't match. > >> > > > > What do you mean by retrieving? Do you mean retrieving > terms, or do you mean doing a search with words you know that > should be in, but you do not find a match? > > > > In the latter, you must make sure that you are using the > same analyzer for the search as you used for indexing. > > > > > >> I've looked > >> in the LIA book but it's not current since it mentioned > term vectors > >> aren't stored. I'm using Lucene Nightly 146 but the same thing has > >> happened with older versions. Looking at the internals, > >> DocumentWriter > >> seems to keep track of the end offset that was placed into > >> the index and > >> modifies the token values (with +1) but I'm not sure whether > >> I should be > >> concerned with it. > >> No existing analyzers are used when adding the document so all the > >> offsets are generated manually. > >> Any suggestions of how the token offsets should be stored? > >> > >> > > > > Look at other clases that implement TokenStream. Also take > a look at setPositionIncrement when you are putting in your own terms > > > > Regards Ard > > > > > >> Is this valid? > >> Token, start, end > >> aaa, 0, 3 > >> bbb, 4, 7 > >> ccc, 8, 11 > >> > >> Thanks, > >> Shahan > >> > >> > --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]