[
https://issues.apache.org/jira/browse/LUCENE-6585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600550#comment-14600550
]
Robert Muir commented on LUCENE-6585:
-------------------------------------
I added lots of coord() tests, and changed it to test a crazy coord(), the
patch here is fine actually. It is difficult to see that its correct, but the
tests help.
I still opened LUCENE-6603 regardless, this stuff is really tricky and we
should try to simplify life at some point.
Separately my concerns around 2-phase iterators (phrase/multiphrase) should not
be an issue, unless they already have bugs with the two-phase API. I will look
into this now.
But overall I think the patch here is good. We should probably add a comment to
collapseSubs so that its a little more obvious:
We collapse the sub-iterators (removing the useless conjunction "in between")
but the two-phase views are also kept. So everything works because of
guarantees in the api.
I don't like the instanceof ConjunctionDISI check, with the assert false. The
reason is that this class is not final. I would rather it be an explicit class
check on getClass() == ConjunctionDISI.class or TwoPhase.class.
> Make ConjunctionDISI flatten sub ConjunctionDISI instances
> ----------------------------------------------------------
>
> Key: LUCENE-6585
> URL: https://issues.apache.org/jira/browse/LUCENE-6585
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-6585.patch
>
>
> Today ConjunctionDISI wraps some sub (two-phase) iterators. I would like to
> improve it by flattening sub iterators when they implement ConjunctionDISI.
> In practice, this would make "+A +(+B +C)" be executed more like "+A +B +C"
> (only in terms of matching, scoring would not change).
> My motivation for this is that if we don't flatten and are unlucky, we can
> sometimes hit some worst cases. For instance consider the 3 following
> postings lists (sorted by increasing cost):
> A: 1, 1001, 2001, 3001, ...
> C: 0, 2, 4, 6, 8, 10, 12, 14, ...
> B: 1, 3, 5, 7, 9, 11, 13, 15, ...
> If we run "+A +B +C", then everything works fine, we use A as a lead, and
> advance B 1000 by 1000 to find the next match (if any).
> However if we run "+A +(+B +C)", then we would iterate B and C 2 by 2 over
> the entire doc ID space when trying to find the first match which occurs on
> or after A:1.
> This is an extreme example which is unlikely to happen in practice, but
> flattening would also help a bit on some more common cases. For instance
> imagine that A, B and C have respective costs of 100, 10 and 1000. If you
> search for "+A +(+B +C)", then we will use the most costly iterator (C) to
> confirm matches of B (the least costly iterator, used as a lead) while it
> would have been more efficient to confirm matches of B with A first, since A
> is less costly than C.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]