> In an inverted index, terms point to documents.   So you have to
> traverse *all* of the terms of a field across all documents, and keep
> track of when you run across the document you are interested in.  When
> you do, then get the positions that the term appeared at, and keep
> track of them.  After you have covered all the terms, you can put
> everything in order.  There could be gaps (positionIncrement, stop
> word removal, etc) and it's also possible for multiple tokens to
> appear at the same position.
>
> For a full-text field with many terms, and a large index, this could
> take a *long* time.
> It's probably very useful for debugging though.

I just realized that it's worse... if you specified a field, then you
only have to iterate the terms for that field.  If you want *all* of
the indexed, non-stored fields for a particular document, but don't
know what they are, there is no info to help you.  You need to iterate
over *all* terms in the index.

Luckily, there is patch in the works in Lucene that will make
skipTo(myDoc) in TermDocs faster.  That should speed things up a
little.

> Remember that df is not updated when a document is marked for deletion
> in Lucene.
> So you can have a df of 2, do a search, and only come up with one document.
>

that would explain why I'm seeing df > 1 for the uniqueKey!

Yep, that's not likely to ever be fixed in Lucene.  Again, it's the
nature of the inverted index... given a particular docid, you really
have no clue what terms in the index point to that docid.

-Yonik

Reply via email to