On Wed, Apr 1, 2009 at 6:37 PM, John Wang <john.w...@gmail.com> wrote: > a code snippet is worth 1000 words :)
Here here! OK, now I understand the difference. With approach 1, for each of N UIDs you use a TermDocs to find the postings for that UID, and retrieve the one docID corresponding to that UID. You retrieve UID -> docID. With approach 2, you iterate through all docs in the index, using a single full walk through the single TermPositions instance for your special UID_TERM, and retrieve the UID stored in the 4-byte payload. You retrieve docID -> UID. Approach 1 is expected to be more costly, per UID - Lucene must consult the terms dict (binary search on the terms index, followed by scan on disk within the 128 term block) to find the posting, then seek to the posting and read that. Approach 2 is an efficient "bulk" walk, but it loads all docID -> UIDs into RAM (ie, you cannot be selective about which UIDs you load). So if the number of UIDs you need to process is small, approach 1 should win; but after that number crosses X (apparently X < 10000 for you), approach 2's "bulk walk" will win. Approach 1 will get faster with the "pulsing" approach for inlining low-frequency postings directly into the terms dict (discussed on java-dev and implemented as a codec in the experimental flexible indexing patch on LUCENE-1458), because we save the second seek. Approach 2 will get much faster with column-stride fields (LUCENE-1231). Though we may want to take this even further and allow inversion for special fields ("primary key int" field, ie your UID) to be stored as a column-stride field. Probably this could simply be another codec in LUCENE-1458. Then, delete-by-Term would be exceptionally fast for such fields. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org