Hello, > I'm wondering if after > opening the > index I can retrieve the Tokens (not the terms) of a > document, something > akin to IndexReader.Document(n).getTokenizer().
It is obviously not possible to get the original tokens of the document back when you haven't stored the document, because: 1) the analyzer might have removed stop words in the first place 2) the terms in lucene index are perhaps stemmed words / synonyms / etc etc 3) how would you expect things like spaces, commas, dots etc to be restored? And, I think what you want does not comply with an inverted index. When you do not store the document, you always loose information about the document during indexing/analyzing How many documents are you talking about? They must be either somewhere on FS or accessible over http...when you need the document, why not just provide a link to the original location? Regards Ard > > In summary: > > My current ( too wasteful implementation is this) > > StandardTokenizer(BufferedReader ( > IndexReader.Document(n).getField("text" > ) ) > > I'm wondering if Lucene has a more efficient manner to > retrieve the tokens > of a document from an index. Because it seems like it has > information about > every "term" already, Since you can get retrieve a > TermPositions object. > > Thanks, > > > --JP > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]