RE: Token offset values for custom Tokenizer

Ard Schrijvers Mon, 16 Jul 2007 08:48:09 -0700

Hello,

The issue is about lucene 1.9. Can you test it with lucene 2.2? Perhaps the 
issue is already addressed and solved...


Regards Ard

> 
> Thank you for the reply Ard,
> 
> The tokens exist in the index and are returned accurately, except for 
> the offsets. In this case I am not dealing with the positions, so the 
> termvector is specified as using 'with_offsets'. I have left the term 
> position incrememt as its default. Looking at the existing 
> tokenstreams, 
> they don't maintain knowledge of the current position, they always 
> generate values startoffsets beginning at 0 of the current 
> stream, and 
> then a 'proper' offset is generated based on the +1 of the previous 
> token the DocumentWriter applies when indexeding. Nor are 
> there any test 
> cases for offsets. I found a bug that was opened a while ago dealing 
> with this issue (as well as related one). It is:
> https://issues.apache.org/jira/browse/LUCENE-579
> 
> I am retrieving the a text token's offset values using 
> TermPositionVector.getOffsets() which returns TermVectorOffsetInfo[]. 
> The same offset values that were placed into the token during 
> indexing 
> are not being returned, they have been shifted.
> Thanks.
> Shahan
> 
> Ard Schrijvers wrote:
> > Hello,
> >
> >   
> >> Hi,
> >> I am storing custom values in the Tokens provided by a 
> Tokenizer but 
> >> when retrieving them from the index the values don't match. 
> >>     
> >
> > What do you mean by retrieving? Do you mean retrieving 
> terms, or do you mean doing a search with words you know that 
> should be in, but you do not find a match?
> >
> > In the latter, you must make sure that you are using the 
> same analyzer for the search as you used for indexing. 
> >
> >   
> >> I've looked 
> >> in the LIA book but it's not current since it mentioned 
> term vectors 
> >> aren't stored. I'm using Lucene Nightly 146 but the same thing has 
> >> happened with older versions. Looking at the internals, 
> >> DocumentWriter 
> >> seems to keep track of the end offset that was placed into 
> >> the index and 
> >> modifies the token values (with +1) but I'm not sure whether 
> >> I should be 
> >> concerned with it.
> >> No existing analyzers are used when adding the document so all the 
> >> offsets are generated manually.
> >> Any suggestions of how the token offsets should be stored?
> >>
> >>     
> >
> > Look at other clases that implement TokenStream. Also take 
> a look at setPositionIncrement when you are putting in your own terms
> >
> > Regards Ard
> >
> >   
> >> Is this valid?
> >> Token, start, end
> >> aaa, 0, 3
> >> bbb, 4, 7
> >> ccc, 8, 11
> >>
> >> Thanks,
> >> Shahan
> >>
> >> 
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >>     
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >   
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Token offset values for custom Tokenizer

Reply via email to