[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

Marvin Humphrey (JIRA) Fri, 30 Jan 2009 09:39:26 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668946#action_12668946
 ]


Marvin Humphrey commented on LUCENE-1476:
-----------------------------------------


> Actually I used your entire patch on its own.

Ah.  Thanks for running all these benchmarks.  (I'm going to have to port the
benchmark suite sooner rather than later so that Lucy doesn't have to rely on
having algos worked out under Lucene.)

I think it's important to note that the degradation at low deletion
percentages is not as bad as the chart seems to imply.  The worst performance
was on the cheapest query, and the best performance was on the most expensive
query.

Furthermore, a 50% deletion percentage is very high.  Regardless of how
deletions are implemented, the default merge policy ought to trigger
absorbtion of segments with deletion percentages over a certain threshold.
The actual threshold percentage is an arbitrary number because the ideal
tradeoff between index-time sacrifices and search-time sacrifices varies, but
my gut pick would be somewhere between 10% and 30% for bit vector deletions
and between 5% and 20% for iterated deletions.

> I'm now feeling like we're gonna have to keep random-access to deleted
> docs....

Ha, you're giving up a little early, amigo.  :)

I do think it's clear that maintaining an individual deletions iterator for
each posting list isn't going to work well, particularly when it's implemented
using an opaque DocIdSetIterator and virtual methods.  That can't compete with
what is essentially an array lookup (since BitVector.get() can be inlined) on
a shared bit vector, even at low deletion rates.

I also think we can conclude that high deletion rates cause an accelerating
degradation with iterated deletions, and that if they are implemented, that
problem probably needs to be addressed with more aggressive segment merging.

However, I don't think the benchmark data we've seen so far demonstrates that
the filtered deletions model won't work.  Heck, with that model,
deletions get pulled out out TermDocs so we lose the per-iter null-check,
possibly yielding a small performance increase in the common case of zero
deletions.

Maybe we should close this issue with a won't-fix and start a new one for
filtered deletions?

> BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: hacked-deliterator.patch, LUCENE-1476.patch, 
> LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, 
> quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff, 
> quasi_iterator_deletions_r3.diff, searchdeletes.alg, sortBench2.py, 
> sortCollate2.py, TestDeletesDocIdSet.java
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Update BitVector to implement DocIdSet.  Expose deleted docs DocIdSet from 
> IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

Reply via email to