Hello,
> I'm wondering if after
> opening the
> index I can retrieve the Tokens (not the terms) of a
> document, something
> akin to IndexReader.Document(n).getTokenizer().
It is obviously not possible to get the original tokens of the document back
when you haven't stored the document, because:
1) the analyzer might have removed stop words in the first place
2) the terms in lucene index are perhaps stemmed words / synonyms / etc etc
3) how would you expect things like spaces, commas, dots etc to be restored?
And, I think what you want does not comply with an inverted index. When you do
not store the document, you always loose information about the document during
indexing/analyzing
How many documents are you talking about? They must be either somewhere on FS
or accessible over http...when you need the document, why not just provide a
link to the original location?
Regards Ard
>
> In summary:
>
> My current ( too wasteful implementation is this)
>
> StandardTokenizer(BufferedReader (
> IndexReader.Document(n).getField("text"
> ) )
>
> I'm wondering if Lucene has a more efficient manner to
> retrieve the tokens
> of a document from an index. Because it seems like it has
> information about
> every "term" already, Since you can get retrieve a
> TermPositions object.
>
> Thanks,
>
>
> --JP
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]