[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669024#action_12669024 ]
Michael McCandless commented on LUCENE-1476: -------------------------------------------- {quote} Thanks for running all these benchmarks. (I'm going to have to port the benchmark suite sooner rather than later so that Lucy doesn't have to rely on having algos worked out under Lucene.) {quote} No problem :) And I don't mind working these things out / discussing them in Lucene -- many of these ideas turn out very well! EG LUCENE-1483. Though... iteration may work better in C than it does in Java, so it's hard to know what to conclude here for Lucy. {quote} I think it's important to note that the degradation at low deletion percentages is not as bad as the chart seems to imply. The worst performance was on the cheapest query, and the best performance was on the most expensive query. {quote} True. {quote} Furthermore, a 50% deletion percentage is very high. Regardless of how deletions are implemented, the default merge policy ought to trigger absorbtion of segments with deletion percentages over a certain threshold. The actual threshold percentage is an arbitrary number because the ideal tradeoff between index-time sacrifices and search-time sacrifices varies, but my gut pick would be somewhere between 10% and 30% for bit vector deletions and between 5% and 20% for iterated deletions. {quote} Right, we should have the default merge policy try to coalsce segments with too many deletions (Lucene does not do this today -- you have to call expungeDeletes yourself, and even that's not perfect since it eliminates every single delete). bq. Ha, you're giving up a little early, amigo. :) {quote} I do think it's clear that maintaining an individual deletions iterator for each posting list isn't going to work well, particularly when it's implemented using an opaque DocIdSetIterator and virtual methods. That can't compete with what is essentially an array lookup (since BitVector.get() can be inlined) on a shared bit vector, even at low deletion rates. {quote} Right. {quote} I also think we can conclude that high deletion rates cause an accelerating degradation with iterated deletions, and that if they are implemented, that problem probably needs to be addressed with more aggressive segment merging. {quote} Agreed. {quote} However, I don't think the benchmark data we've seen so far demonstrates that the filtered deletions model won't work. Heck, with that model, deletions get pulled out out TermDocs so we lose the per-iter null-check, possibly yielding a small performance increase in the common case of zero deletions. {quote} By "filtered deletions model", you mean treating deletions as a "normal" filter and moving "up" when they are applied? Right, we have not explored that yet, here. Except, filters are accessed by iteration in Lucene now (they used to be random access before LUCENE-584). So... now I'm wondering if we should go back and scrutinize the performance cost of switching from random access to iteration for filters, especially in cases where under-the-hood the filter needed to create a non-sparse repr anyway. It looks like there was a fair amount of discussion about performance, but no hard conclusions, in LUCENE-584, on first read. Hmmm. I think to do a fair test, we need to move deleted docs "up higher", but access it w/ random access API? Except... the boolean query QPS are not that different with the no-deletions case vs the 1% deletions. And it's only the non-simple queries (boolean, span, phrase, etc.) that would see any difference in moving deletions "up" as a filter? Ie their QPS is already so low that the added cost of doing the deletes "at the bottom" appears to be quite minor. (Conceptually, I completely agree that deletions are really the same thing as filters). {quote} Maybe we should close this issue with a won't-fix and start a new one for filtered deletions? {quote} Maybe discuss a bit more here or on java-dev first? > BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs > ----------------------------------------------------------------------- > > Key: LUCENE-1476 > URL: https://issues.apache.org/jira/browse/LUCENE-1476 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Priority: Trivial > Attachments: hacked-deliterator.patch, LUCENE-1476.patch, > LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, LUCENE-1476.patch, > quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff, > quasi_iterator_deletions_r3.diff, searchdeletes.alg, sortBench2.py, > sortCollate2.py, TestDeletesDocIdSet.java > > Original Estimate: 12h > Remaining Estimate: 12h > > Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from > IndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org