> In an inverted index, terms point to documents. So you have to > traverse *all* of the terms of a field across all documents, and keep > track of when you run across the document you are interested in. When > you do, then get the positions that the term appeared at, and keep > track of them. After you have covered all the terms, you can put > everything in order. There could be gaps (positionIncrement, stop > word removal, etc) and it's also possible for multiple tokens to > appear at the same position. > > For a full-text field with many terms, and a large index, this could > take a *long* time. > It's probably very useful for debugging though.
I just realized that it's worse... if you specified a field, then you only have to iterate the terms for that field. If you want *all* of the indexed, non-stored fields for a particular document, but don't know what they are, there is no info to help you. You need to iterate over *all* terms in the index. Luckily, there is patch in the works in Lucene that will make skipTo(myDoc) in TermDocs faster. That should speed things up a little.
> Remember that df is not updated when a document is marked for deletion > in Lucene. > So you can have a df of 2, do a search, and only come up with one document. > that would explain why I'm seeing df > 1 for the uniqueKey!
Yep, that's not likely to ever be fixed in Lucene. Again, it's the nature of the inverted index... given a particular docid, you really have no clue what terms in the index point to that docid. -Yonik
