Hi, When Lucene's standard Indexer is used to store documents does it store the information about the tokens in anyway. I'm playing around with making a Snippet Generator (like the highlighter class), and it is going to involve a very large amount of documents. For my test cases I have only used one document and simply passed the document into the StandardTokenizer. But now I am ready to start working with a large amount of documents. I know one option is to store the text of a document as a field and then open the index and pass the text of the document into a tokenizer, but storing the text of each document costs me way too much. I'm wondering if after opening the index I can retrieve the Tokens (not the terms) of a document, something akin to IndexReader.Document(n).getTokenizer().
In summary: My current ( too wasteful implementation is this) StandardTokenizer(BufferedReader ( IndexReader.Document(n).getField("text" ) ) I'm wondering if Lucene has a more efficient manner to retrieve the tokens of a document from an index. Because it seems like it has information about every "term" already, Since you can get retrieve a TermPositions object. Thanks, --JP