[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Michael McCandless (JIRA) Sat, 07 Nov 2009 15:55:01 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774688#action_12774688
 ]


Michael McCandless commented on LUCENE-1526:
--------------------------------------------

bq. We do not hold the deleted set for a long period of time. I agree the 
memory cost is not a "killer" but it is tremendously wasteful, e.g. 10 M doc 
index, you have say 2 docs deleted, 0 and 9999999, you are representing it with 
5M of memory (where you could have reprsented it with 2 ints, 8 bytes). Sure it 
is an extremely case, if you look at the avg number of deleted docs vs index 
size, it is usually sparse. hence we avoided this approach.

Actually a 10 M doc index would be 1.25 MB BitVector right?

But, I agree it's wasteful of space when deletes are so
sparse... though it is fast.

So are you using this, only, as your deleted docs?  Ie you don't store
the deletions with Lucene?  I'm getting confused if this is only for
the NRT case, or, in general.

Have you compared performance of this vs straight lookup in BitVector?

bq. We do not accumulate deletes. Deletes/updates are tracked per batch, and 
special TermDocs are returned to skip over the deleted/mod set for the given 
reader.

OK, I think I'm catching up here... so you only open a new reader at
the batch boundary right?  Ie, a batch update (all its adds & deletes)
is atomic from the readers standpoint?

bq. We simply call delete on the diskreader for each batch and internal readers 
are refreshed with deletes loaded in background.

OK so a batch is quickly reopened, using bloom filter + int set for
fast "contains" check for the deletions that occurred during that
batch (and, custom TermDocs that does the "and not deleted").  This
gets you your fast turnaround and decent search performance.

In the BG the deletes are applied to Lucene "for real".

This is similar to Marvin's original tombstone deletions, in that the
deletions against old segments are stored with the new segment, rather
than aggressively pushed to the old segment's bit vectors.

> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to