Thanks for the clarification Greg. I've been looking into this recently and filed https://issues.apache.org/jira/browse/LUCENE-9938 based on a hunch that these DocIdSetIterator.all(maxDoc) iterators have a non-negligible cost inside ConjunctionDISI. Ultimately I closed the issue because the TPI design seems to prohibit removing them :-(. Feel free to comment there nonetheless if you have any thoughts on the matter. For my part, I have some benchmarking to do in Solr for a related matter that would move certain queries that work at the collector stage to be TPIs -- SOLR-14164.
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Fri, May 7, 2021 at 6:06 PM Greg Miller <[email protected]> wrote: > Just chiming in here to answer David's question since I have some > familiarity: > > In this specific case, the logic was implemented inside a Collector > and we tried to move it into a Query abstraction using a > TwoPhaseIterator with a high matchCost. The first-phase would match on > all docs (essentially: DocIdSetIterator.all(reader.maxDoc())) and the > second phase would do the costly check. The matchCost was advertised > as reader.maxDoc(). ("reader" in this example is from the > LeafReaderContext). > > Moving the logic behind a Query abstraction caused performance > regressions. So one theory is that it was somehow leading iteration > with an expensive "match all docs" DISI, but we don't actually know if > that's true right now. > > Cheers, > -Greg > > On Fri, May 7, 2021 at 8:41 AM David Smiley <[email protected]> wrote: > > > > Instead of a Collector, why isn't this a TwoPhaseIterator with a high > matchCost? > > > > ~ David Smiley > > Apache Lucene/Solr Search Developer > > http://www.linkedin.com/in/davidwsmiley > > > > > > On Thu, May 6, 2021 at 6:43 PM Michael Sokolov <[email protected]> > wrote: > >> > >> Thanks Adrien, that is something like what I had in mind. If you are > >> able to share, that could be very helpful. And -- deleted docs is not > >> something I had considered, it's possibly a problem here. I'd have to > >> go check - I think these "filter" Queries were implemented in the > >> second part of the two-phase iteration. > >> > >> On Thu, May 6, 2021 at 4:24 PM Adrien Grand <[email protected]> wrote: > >> > > >> > We have something like that in Elasticsearch that wraps queries in > order to be able to report cost, matchCost and the number of calls to > nextDoc/advance/matches/score/advanceShallow/getMaxScore for every node in > the query tree. > >> > > >> > It's not perfect as it needs to disable some optimizations in order > to work properly. For instance bulk scorers are disabled and conjunctions > are not inlined, which means that clauses may run in a different order. So > results need to be interpreted carefully as the way the query gets executed > when observed may differ a bit from how it gets executed normally. That > said it has still been useful in a number of cases. I don't think our > implementation works when IndexSearcher is configured with an executor but > we could maybe put it in sandbox and iterate from there? > >> > > >> > For your case, do you think it could be attributed to deleted docs? > Deleted docs are checked before two-phase confirmation and collectors but > after disjunctions/conjunctions of postings. > >> > > >> > Le jeu. 6 mai 2021 à 20:20, Michael Sokolov <[email protected]> a > écrit : > >> >> > >> >> Do we have a way to understand how BooleanQuery (and other composite > >> >> queries) are advancing their child queries? For example, a simple > >> >> conjunction of two queries advances the more restrictive (lower > >> >> cost()) query first, enabling the more costly query to skip over more > >> >> documents. But we may not be making the best choice in every case, > and > >> >> I would like to know, for some query, how we are doing. For example, > >> >> we could execute in a debugging mode, interposing something that > wraps > >> >> or observes the Scorers in some way, gathering statistics about how > >> >> many documents are visited by each Scorer, which can be aggregated > for > >> >> later analysis. > >> >> > >> >> This is motivated by a use case we have in which we currently > >> >> post-filter our query results in a custom collector using some > filters > >> >> that we know to be expensive (they must be evaluated on every > >> >> document), but we would rather express these post-filters as Queries > >> >> and have them advanced during the main Query execution. However when > >> >> we tried to do that, we saw some slowdowns (in spite of marking these > >> >> Queries as high-cost) and I suspect it is due to the iteration order, > >> >> but I'm not sure how to debug. > >> >> > >> >> Suggestions welcome! > >> >> > >> >> -Mike > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: [email protected] > >> >> For additional commands, e-mail: [email protected] > >> >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
