[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

Michael McCandless (JIRA) Fri, 30 Jan 2009 13:09:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669024#action_12669024
 ]


Michael McCandless commented on LUCENE-1476:
--------------------------------------------

{quote}
Thanks for running all these benchmarks. (I'm going to have to port the
benchmark suite sooner rather than later so that Lucy doesn't have to rely on
having algos worked out under Lucene.)
{quote}

No problem :) And I don't mind working these things out / discussing
them in Lucene -- many of these ideas turn out very well!  EG
LUCENE-1483.

Though... iteration may work better in C than it does in Java, so it's
hard to know what to conclude here for Lucy.

{quote}
I think it's important to note that the degradation at low deletion
percentages is not as bad as the chart seems to imply. The worst performance
was on the cheapest query, and the best performance was on the most expensive
query.
{quote}

True.

{quote}
Furthermore, a 50% deletion percentage is very high. Regardless of how
deletions are implemented, the default merge policy ought to trigger
absorbtion of segments with deletion percentages over a certain threshold.
The actual threshold percentage is an arbitrary number because the ideal
tradeoff between index-time sacrifices and search-time sacrifices varies, but
my gut pick would be somewhere between 10% and 30% for bit vector deletions
and between 5% and 20% for iterated deletions.
{quote}

Right, we should have the default merge policy try to coalsce segments
with too many deletions (Lucene does not do this today -- you have to
call expungeDeletes yourself, and even that's not perfect since it
eliminates every single delete).

bq. Ha, you're giving up a little early, amigo.

:)

{quote}
I do think it's clear that maintaining an individual deletions iterator for
each posting list isn't going to work well, particularly when it's implemented
using an opaque DocIdSetIterator and virtual methods. That can't compete with
what is essentially an array lookup (since BitVector.get() can be inlined) on
a shared bit vector, even at low deletion rates.
{quote}

Right.

{quote}
I also think we can conclude that high deletion rates cause an accelerating
degradation with iterated deletions, and that if they are implemented, that
problem probably needs to be addressed with more aggressive segment merging.
{quote}

Agreed.

{quote}
However, I don't think the benchmark data we've seen so far demonstrates that
the filtered deletions model won't work. Heck, with that model,
deletions get pulled out out TermDocs so we lose the per-iter null-check,
possibly yielding a small performance increase in the common case of zero
deletions.
{quote}

By "filtered deletions model", you mean treating deletions as a
"normal" filter and moving "up" when they are applied?  Right, we have
not explored that yet, here.

Except, filters are accessed by iteration in Lucene now (they used to
be random access before LUCENE-584).

So... now I'm wondering if we should go back and scrutinize the
performance cost of switching from random access to iteration for
filters, especially in cases where under-the-hood the filter needed to
create a non-sparse repr anyway.  It looks like there was a fair
amount of discussion about performance, but no hard conclusions, in
LUCENE-584, on first read.  Hmmm.

I think to do a fair test, we need to move deleted docs "up higher",
but access it w/ random access API?

Except... the boolean query QPS are not that different with the
no-deletions case vs the 1% deletions.  And it's only the non-simple
queries (boolean, span, phrase, etc.) that would see any difference
in moving deletions "up" as a filter?  Ie their QPS is already so low
that the added cost of doing the deletes "at the bottom" appears to be
quite minor.

(Conceptually, I completely agree that deletions are really the same
thing as filters).

{quote}
Maybe we should close this issue with a won't-fix and start a new one for
filtered deletions?
{quote}
Maybe discuss a bit more here or on java-dev first?


> BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: hacked-deliterator.patch, LUCENE-1476.patch, 
> LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, 
> quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff, 
> quasi_iterator_deletions_r3.diff, searchdeletes.alg, sortBench2.py, 
> sortCollate2.py, TestDeletesDocIdSet.java
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Update BitVector to implement DocIdSet.  Expose deleted docs DocIdSet from 
> IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs

Reply via email to