Re: IndexWriter.deleteDocuments(Query query)

Michael McCandless Thu, 02 Apr 2009 02:21:10 -0700

On Wed, Apr 1, 2009 at 6:37 PM, John Wang <john.w...@gmail.com> wrote:
> a code snippet is worth 1000 words :)


Here here!

OK, now I understand the difference.

With approach 1, for each of N UIDs you use a TermDocs to find the
postings for that UID, and retrieve the one docID corresponding to
that UID.  You retrieve UID -> docID.

With approach 2, you iterate through all docs in the index, using a
single full walk through the single TermPositions instance for your
special UID_TERM, and retrieve the UID stored in the 4-byte payload.
You retrieve docID -> UID.

Approach 1 is expected to be more costly, per UID - Lucene must
consult the terms dict (binary search on the terms index, followed by
scan on disk within the 128 term block) to find the posting, then seek
to the posting and read that.

Approach 2 is an efficient "bulk" walk, but it loads all docID -> UIDs
into RAM (ie, you cannot be selective about which UIDs you load).

So if the number of UIDs you need to process is small, approach 1
should win; but after that number crosses X (apparently X < 10000 for
you), approach 2's "bulk walk" will win.

Approach 1 will get faster with the "pulsing" approach for inlining
low-frequency postings directly into the terms dict (discussed on
java-dev and implemented as a codec in the experimental flexible
indexing patch on LUCENE-1458), because we save the second seek.

Approach 2 will get much faster with column-stride fields
(LUCENE-1231).

Though we may want to take this even further and allow inversion for
special fields ("primary key int" field, ie your UID) to be stored as
a column-stride field.  Probably this could simply be another codec in
LUCENE-1458.  Then, delete-by-Term would be exceptionally fast for
such fields.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter.deleteDocuments(Query query)

Reply via email to