[ https://issues.apache.org/jira/browse/LUCENE-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774628#action_12774628 ]
Jake Mannix commented on LUCENE-1526: ------------------------------------- bq. Another approach we might take here is to track new deletions using a sorted tree. New deletions insert in O(log(N)) time, and then we can efficiently expose a DocIdSetIterator from that. We did this in Zoie for a while, and it turned out to be a bottleneck - not as much of a bottleneck as continually cloning a bitvector (that was even worse), but still not good. We currently use a bloomfilter on top of an openintset, which performs pretty fantastically: constant-time adds and even-faster constant-time contains() checks, with small size (necessary for the new Reader per query scenario since this requires lots of deep-cloning of this structure). Just a note from our experience over in zoie-land. It also helped to not produce a docIdset iterator using these bits, but instead override TermDocs to be returned on the disk reader, and keep track of it directly there. > For near real-time search, use paged copy-on-write BitVector impl > ----------------------------------------------------------------- > > Key: LUCENE-1526 > URL: https://issues.apache.org/jira/browse/LUCENE-1526 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Minor > Attachments: LUCENE-1526.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > SegmentReader currently uses a BitVector to represent deleted docs. > When performing rapid clone (see LUCENE-1314) and delete operations, > performing a copy on write of the BitVector can become costly because > the entire underlying byte array must be created and copied. A way to > make this clone delete process faster is to implement tombstones, a > term coined by Marvin Humphrey. Tombstones represent new deletions > plus the incremental deletions from previously reopened readers in > the current reader. > The proposed implementation of tombstones is to accumulate deletions > into an int array represented as a DocIdSet. With LUCENE-1476, > SegmentTermDocs iterates over deleted docs using a DocIdSet rather > than accessing the BitVector by calling get. This allows a BitVector > and a set of tombstones to by ANDed together as the current reader's > delete docs. > A tombstone merge policy needs to be defined to determine when to > merge tombstone DocIdSets into a new deleted docs BitVector as too > many tombstones would eventually be detrimental to performance. A > probable implementation will merge tombstones based on the number of > tombstones and the total number of documents in the tombstones. The > merge policy may be set in the clone/reopen methods or on the > IndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org