Doug, do you believe the storing (as an option of course) of token offset information would be something that you'de accept as a contribution to the core of lucene? Does anyone else think that this would be beneficial information to have?
I have mixed feelings about this. Aesthetically I don't like it a lot, as it is asymmetric: indexes store sequential positions, while vectors would store character offsets. On the other hand, it could be useful for summarizing long documents.
Another approach that someone mentioned for solving this problem is to create a fragment index for long documents. For example, if a document is over, say, 32k, then you could create a separate index for it that chops its text into 1000 character overlapping chunks. The first chunk would be characters 0-1000, the next 500-1500, and so on. Then, to summarize, you open this index and search it to figure out which chunks have the best hits. Then you can, based on the chunk document id, seek into the full text and retokenize only selected chunks. Such indexes should be fast to open, since they'd be small. I'd recommend calling IndexWriter#setUseCompoundFile(true) on these, and optimizing them. That way there'd only be a couple of files to open.
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]