[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665894#action_12665894
 ] 

Marvin Humphrey commented on LUCENE-1476:
-----------------------------------------

> For OR like queries, that use Scorer.next(), deletions might be treated as
> early as possible, since each hit will cause a matching doc and a
> corresponding score calculation.

It depends whether the OR-like query spawns a Scorer that pre-fetches.

At present, deletions are handled in Lucene at the SegmentTermDocs level.
Filtering deletions that early minimizes overhead from *all* Scorer.next()
implementations, including those that pre-fetch like the original
BooleanScorer.  The cost is that we must check every single posting in every
single SegmentTermDocs to see if the doc has been deleted.

The Scorer_Collect() routine above, which skips past deletions, consolidates
the cost of checking every single posting into one check.  However, the
original BooleanScorer can't skip, so it will incur overhead from scoring
documents which have been deleted.  Do the savings and the additional costs
balance each other out?  Its hard to say.

However, with OR-like Scorers which implement skipping and don't pre-fetch --
Lucene's DisjunctionSumScorer and its KS analogue ORScorer -- you get the best
of both worlds.  Not only do you avoid the per-posting-per-SegTermDocs
deletions check, you get to skip past deletions and avoid the extra overhead
in Scorer.next().

Of course the problem is that because ScorerDocQueue isn't very efficient,
DisjunctionSumScorer and ORScorer are often slower than the distributed sort
in the original BooleanScorer.

> For AND like queries, that use Scorer.skipTo(), deletions can be treated
> later, for example in skipTo() of the conjunction/and/scorer.

Yes, and it could be implemented by adding a Filter sub-clause to the
conjunction/and/scorer.



> BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch, LUCENE-1476.patch, 
> quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Update BitVector to implement DocIdSet.  Expose deleted docs DocIdSet from 
> IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to