Why not just write the first byte as 0 for a bitsit, and 1 for a sparse bit set (compressed), and make the determination when writing based on the segment size and/or number of set bits.

On Jan 7, 2009, at 8:38 PM, Marvin Humphrey (JIRA) wrote:


[ https://issues.apache.org/jira/browse/LUCENE-1476? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel&focusedCommentId=12661829#action_12661829 ]

Marvin Humphrey commented on LUCENE-1476:
-----------------------------------------

Jason Rutherglen:

For realtime search where a new transaction may only have a handful of
deletes the tombstones may not be optimal

The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed.

When you make deletes with the BitSet model, you have to rewrite files that scale with segment size, regardless of how few deletions you make. Deletion of a single document in a large segment may necessitate writing out a substantial bit vector file.

In contrast, i/o throughput for writing out a tombstone file scales with the number of tombstones.

because too many tombstones would accumulate (I believe).

Say that you make a string of commits that are nothing but deleting a single document -- thus adding a new segment each time that contains nothing but a single tombstone. Those are going to be cheap to merge, so it seems unlikely that we'll end up with an unwieldy number of tombstone streams to interleave at search-time.

The more likely problem is the one McCandless articulated regarding a large segment accumulating a lot of tombstone streams against it. But I agree with him that it only gets truly serious if your merge policy neglects such segments and allows them to deteriorate for too long.

For this scenario rolling bitsets may be better. Meaning pool bit sets and
throw away unused readers.

I don't think I understand. Is this the "combination index reader/ writer" model, where the writer prepares a data structure that then gets handed off to the reader?

BitVector implement DocIdSet
----------------------------

                Key: LUCENE-1476
URL: https://issues.apache.org/jira/browse/ LUCENE-1476
            Project: Lucene - Java
         Issue Type: Improvement
         Components: Index
   Affects Versions: 2.4
           Reporter: Jason Rutherglen
           Priority: Trivial
        Attachments: LUCENE-1476.patch

  Original Estimate: 12h
 Remaining Estimate: 12h

BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to