[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668946#action_12668946 ]
Marvin Humphrey commented on LUCENE-1476: ----------------------------------------- > Actually I used your entire patch on its own. Ah. Thanks for running all these benchmarks. (I'm going to have to port the benchmark suite sooner rather than later so that Lucy doesn't have to rely on having algos worked out under Lucene.) I think it's important to note that the degradation at low deletion percentages is not as bad as the chart seems to imply. The worst performance was on the cheapest query, and the best performance was on the most expensive query. Furthermore, a 50% deletion percentage is very high. Regardless of how deletions are implemented, the default merge policy ought to trigger absorbtion of segments with deletion percentages over a certain threshold. The actual threshold percentage is an arbitrary number because the ideal tradeoff between index-time sacrifices and search-time sacrifices varies, but my gut pick would be somewhere between 10% and 30% for bit vector deletions and between 5% and 20% for iterated deletions. > I'm now feeling like we're gonna have to keep random-access to deleted > docs.... Ha, you're giving up a little early, amigo. :) I do think it's clear that maintaining an individual deletions iterator for each posting list isn't going to work well, particularly when it's implemented using an opaque DocIdSetIterator and virtual methods. That can't compete with what is essentially an array lookup (since BitVector.get() can be inlined) on a shared bit vector, even at low deletion rates. I also think we can conclude that high deletion rates cause an accelerating degradation with iterated deletions, and that if they are implemented, that problem probably needs to be addressed with more aggressive segment merging. However, I don't think the benchmark data we've seen so far demonstrates that the filtered deletions model won't work. Heck, with that model, deletions get pulled out out TermDocs so we lose the per-iter null-check, possibly yielding a small performance increase in the common case of zero deletions. Maybe we should close this issue with a won't-fix and start a new one for filtered deletions? > BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs > ----------------------------------------------------------------------- > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Trivial > Attachments: hacked-deliterator.patch, LUCENE-1476.patch, > LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, > quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff, > quasi_iterator_deletions_r3.diff, searchdeletes.alg, sortBench2.py, > sortCollate2.py, TestDeletesDocIdSet.java > > Original Estimate: 12h > Remaining Estimate: 12h > > Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from > IndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org