I'm indexing books, with a significant amount of overhead in each document and a LOT of OCR data. I'm indexing over 20,000 books and the index size is 8G. So I decided to play around with not storing some of the termvector information and I'm shocked at how much smaller the index is. By storing all my fields with Field.TermVector.WITH_POSITIONS, my index is reduced by OVER 75%. It went from 485M to 100M for my sample of 1,000 documents. Which implies my full index will be somewhere around 2G (I'll build the full index tonight and see).
My reasoning was that I do need position information since I need to do Span queries, but character information (WITH_OFFSETS) isn't necessary here/now. So I thought I'd make a small test to see if this was worth pursuing. If omitting offsets had only saved me 10%, for instance, I wouldn't pursue it very much. But 75+% is a savings well worth pursuing. All of my unit tests run, some of which include spans and highlighting. Whether they're sophisticated enough to catch some subtle issue I won't guarantee. I do NOT need to reconstruct the text, nor do I need to highlight with what's in the index, I handle highlighting by putting my display data in a MemoryIndex and running a query against that. I play some fun games to correlate my display and MemoryIndex info, but that's another story. Many thanks for the MemoryIndex contribution!!! With that as a background, I have two questions.... 1> Am I going off a cliff here? I suppose this is really answered by 2> what is the difference between WITH_POSITIONS and WITH_OFFSETS and YES and NO? I assume that WITH_POSITIONS is necessary for Span queries, for instance, which is all I really care about. While this has been discussed, I searched and didn't find a satisfactory answer (or at least an answer that I understood<G>). I looked at Grants PowerPoint presentation and I guess I'm really looking for confirmation of my interpretation that WITH_POSITIONS lets me do span queries and WITH_OFFSETS is irrelevant in my situation, one where I don't highlight and don't need to reconstruct the document...... Many thanks Erick