RE: Does Index have a Tokenizer Built into it

Ard Schrijvers Fri, 13 Jul 2007 00:58:52 -0700

Hello,

> I'm wondering if after 
> opening the
> index I can retrieve the Tokens (not the terms) of a 
> document, something
> akin to IndexReader.Document(n).getTokenizer().


It is obviously not possible to get the original tokens of the document back 
when you haven't stored the document, because:

1) the analyzer might have removed stop words in the first place
2) the terms in lucene index are perhaps stemmed words / synonyms / etc etc
3) how would you expect things like spaces, commas, dots etc to be restored?

And, I think what you want does not comply with an inverted index. When you do 
not store the document, you always loose information about the document during 
indexing/analyzing

How many documents are you talking about? They must be either somewhere on FS 
or accessible over http...when you need the document, why not just provide a 
link to the original location?

Regards Ard

> 
> In summary:
> 
> My current ( too wasteful implementation is this)
> 
> StandardTokenizer(BufferedReader (  
> IndexReader.Document(n).getField("text"
> )  )
> 
> I'm wondering if Lucene has a more efficient manner to 
> retrieve the tokens
> of a document from an index.  Because it seems like it has 
> information about
> every "term" already, Since you can get retrieve a 
> TermPositions object.
> 
> Thanks,
> 
> 
> --JP
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Does Index have a Tokenizer Built into it

Reply via email to