Just chiming in here to answer David's question since I have some familiarity:
In this specific case, the logic was implemented inside a Collector
and we tried to move it into a Query abstraction using a
TwoPhaseIterator with a high matchCost. The first-phase would match on
all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the
second phase would do the costly check. The matchCost was advertised
as reader.maxDoc(). ("reader" in this example is from the
LeafReaderContext).
Moving the logic behind a Query abstraction caused performance
regressions. So one theory is that it was somehow leading iteration
with an expensive "match all docs" DISI, but we don't actually know if
that's true right now.
Cheers,
-Greg
On Fri, May 7, 2021 at 8:41 AM David Smiley <[email protected]> wrote:
>
> Instead of a Collector, why isn't this a TwoPhaseIterator with a high
> matchCost?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <[email protected]> wrote:
>>
>> Thanks Adrien, that is something like what I had in mind. If you are
>> able to share, that could be very helpful. And -- deleted docs is not
>> something I had considered, it's possibly a problem here. I'd have to
>> go check - I think these "filter" Queries were implemented in the
>> second part of the two-phase iteration.
>>
>> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[email protected]> wrote:
>> >
>> > We have something like that in Elasticsearch that wraps queries in order
>> > to be able to report cost, matchCost and the number of calls to
>> > nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in
>> > the query tree.
>> >
>> > It's not perfect as it needs to disable some optimizations in order to
>> > work properly. For instance bulk scorers are disabled and conjunctions are
>> > not inlined, which means that clauses may run in a different order. So
>> > results need to be interpreted carefully as the way the query gets
>> > executed when observed may differ a bit from how it gets executed
>> > normally. That said it has still been useful in a number of cases. I don't
>> > think our implementation works when IndexSearcher is configured with an
>> > executor but we could maybe put it in sandbox and iterate from there?
>> >
>> > For your case, do you think it could be attributed to deleted docs?
>> > Deleted docs are checked before two-phase confirmation and collectors but
>> > after disjunctions/conjunctions of postings.
>> >
>> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[email protected]> a écrit :
>> >>
>> >> Do we have a way to understand how BooleanQuery (and other composite
>> >> queries) are advancing their child queries? For example, a simple
>> >> conjunction of two queries advances the more restrictive (lower
>> >> cost()) query first, enabling the more costly query to skip over more
>> >> documents. But we may not be making the best choice in every case, and
>> >> I would like to know, for some query, how we are doing. For example,
>> >> we could execute in a debugging mode, interposing something that wraps
>> >> or observes the Scorers in some way, gathering statistics about how
>> >> many documents are visited by each Scorer, which can be aggregated for
>> >> later analysis.
>> >>
>> >> This is motivated by a use case we have in which we currently
>> >> post-filter our query results in a custom collector using some filters
>> >> that we know to be expensive (they must be evaluated on every
>> >> document), but we would rather express these post-filters as Queries
>> >> and have them advanced during the main Query execution. However when
>> >> we tried to do that, we saw some slowdowns (in spite of marking these
>> >> Queries as high-cost) and I suspect it is due to the iteration order,
>> >> but I'm not sure how to debug.
>> >>
>> >> Suggestions welcome!
>> >>
>> >> -Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]