[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Jake Mannix (JIRA) Sat, 07 Nov 2009 16:22:00 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774693#action_12774693
 ]


Jake Mannix commented on LUCENE-1526:
-------------------------------------

bq. But, I agree it's wasteful of space when deletes are so
sparse... though it is fast.

It's fast for random access, but it's really slow if you need to make a lot of 
these (either during heavy indexing if copy-on-write, or during heavy query 
load if copy-on-reopen).

bq. So are you using this, only, as your deleted docs? Ie you don't store
the deletions with Lucene? I'm getting confused if this is only for
the NRT case, or, in general.

These are only to augment the deleted docs *of the disk reader* - the disk 
reader isn't reopened at all except infrequently - once a batch (a big enough 
RAMDirectory is filled, or enough time goes by, depending on configuration) is 
ready to be flushed to disk, diskReader.addIndexes is called and when the 
diskReader is reopened, the deletes live in the normal diskReader's delete set. 
  Before this time is ready, when there is a batch in ram that hasn't been 
flushed, the IntSetAccelerator is applied to the not-reopened diskReader.  It's 
a copy-on-read ThreadLocal.  

So I'm not sure if that described it correctly: only the deletes which should 
have been applied to the diskReader are treated separately - those are 
basically batched: for T amount of time or D amount of docs (configurable) 
whichever comes first, they are applied to the diskReader, which knows about 
Lucene's regular deletions and now these new ones as well.   Once the memory is 
flushed to disk, the in-memory delSet is emptied, and applied to the diskReader 
using regular apis before reopening.

bq. OK, I think I'm catching up here... so you only open a new reader at
the batch boundary right? Ie, a batch update (all its adds & deletes)
is atomic from the readers standpoint?

Yes - disk reader, you mean, right?  This is only reopened at batch boundary.

bq. OK so a batch is quickly reopened, using bloom filter + int set for
fast "contains" check for the deletions that occurred during that
batch (and, custom TermDocs that does the "and not deleted"). This
gets you your fast turnaround and decent search performance.

The reopening isn't that quick, but it's in the background, or are you talking 
about the RAMDirectory?  Yeah, that is reopened per query (if necessary - if 
there are no changes, of course no reopen), but it is kept very small (10k docs 
or less, for example).  It's actually pretty fantastic performance - check out 
the zoie perf pages: 
http://code.google.com/p/zoie/wiki/Performance_Comparisons_for_ZoieLucene24ZoieLucene29LuceneNRT
 



> For near real-time search, use paged copy-on-write BitVector impl
> -----------------------------------------------------------------
>
>                 Key: LUCENE-1526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1526
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>         Attachments: LUCENE-1526.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> SegmentReader currently uses a BitVector to represent deleted docs.
> When performing rapid clone (see LUCENE-1314) and delete operations,
> performing a copy on write of the BitVector can become costly because
> the entire underlying byte array must be created and copied. A way to
> make this clone delete process faster is to implement tombstones, a
> term coined by Marvin Humphrey. Tombstones represent new deletions
> plus the incremental deletions from previously reopened readers in
> the current reader. 
> The proposed implementation of tombstones is to accumulate deletions
> into an int array represented as a DocIdSet. With LUCENE-1476,
> SegmentTermDocs iterates over deleted docs using a DocIdSet rather
> than accessing the BitVector by calling get. This allows a BitVector
> and a set of tombstones to by ANDed together as the current reader's
> delete docs. 
> A tombstone merge policy needs to be defined to determine when to
> merge tombstone DocIdSets into a new deleted docs BitVector as too
> many tombstones would eventually be detrimental to performance. A
> probable implementation will merge tombstones based on the number of
> tombstones and the total number of documents in the tombstones. The
> merge policy may be set in the clone/reopen methods or on the
> IndexReader. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1526) For near real-time search, use paged copy-on-write BitVector impl

Reply via email to