I think it depends on what information we actually want to get here. If it’s just finding which fields matched in which document, then running Matches over the top-k results is fine. If you want to get some kind of aggregate data, as in you want to get a list of fields that matched in *any* document (or conversely, a list of fields that *didn’t* match - useful if you want to prune your schema, for example), then Matches will be too slow. But at the same time, queries are designed to tell you which *documents* match efficiently, and they are allowed to advance their sub-queries lazily or indeed not at all if the result isn’t needed for scoring. So we don’t really have any way of finding this kind of information via a collector that is accurate and performs reasonably.
It *might* be possible to rework Matches so that they act more like an iterator and maintain their state within a segment, but there hasn’t been a pressing need for that so far. > On 27 Jun 2022, at 12:46, Shai Erera <ser...@gmail.com > <mailto:ser...@gmail.com>> wrote: > > Thanks Alan, yeah I guess I was thinking about the usecase I described, which > involves (usually) simple term queries, but you're definitely right about > complex boolean clauses as well non-term queries. > > I think the case for highlighter is different though? I mean you usually > generate highlights only for the top-K results and therefore are probably > less affected by whether the matches() API is slower than a Collector. And if > you invoke the API for every document in the index, it might be much slower > (depending on the index size) than the Collector. > > Maybe a hybrid approach which runs the query and caches the docs in a > DocIdSet (like FacetsCollector does) and then invokes the matches() API only > on those hits, will let you enjoy the best of both worlds? Assuming though > that the number of matching documents is not huge. > > So it seems there are several options and one should choose based on their > usecase. Do you see an advantage for Lucene to offer a Collector for this > usecase? Or should we tell users to use the matches API > > Shai > > On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss <dawid.we...@gmail.com > <mailto:dawid.we...@gmail.com>> wrote: > A side note - I've been using a highlighter based on matches API for > quite some time now and it's been fantastic. Very precise and handles > non-trivial queries (interval queries) very well. > > https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html > > <https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html> > > Dawid > > On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward <romseyg...@gmail.com > <mailto:romseyg...@gmail.com>> wrote: > > > > Your approach is almost certainly more efficient, but it might give you > > false matches in some cases - for example, if you have a complex query with > > many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is > > positioned on the correct document, but which is part of a clause that > > doesn’t actually match. It also only works for term queries, so it won’t > > match phrases or span/interval groups. And Matches will work on points or > > docvalues queries as well. The reason I added Matches in the first place > > was precisely to handle these weird corner cases - I had written > > highlighters which more or less did the same thing you describe with a > > Collector and the Scorable tree, and I would occasionally get bad > > highlights back. > > > > On 27 Jun 2022, at 10:51, Shai Erera <ser...@gmail.com > > <mailto:ser...@gmail.com>> wrote: > > > > Out of curiosity and for education purposes, is the Collector approach I > > proposed wrong/inefficient? Or less efficient than the matches() API? > > > > I'm thinking, if you want to both match/rank documents and as a side effect > > know which fields matched, the Collector will perform better than > > Weight.matches(), but I could be wrong. > > > > Shai > > > > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.we...@gmail.com > > <mailto:dawid.we...@gmail.com>> wrote: > >> > >> The matches API is awesome. Use it. You can also get a rough glimpse > >> into a superset of fields potentially matching the query via: > >> > >> query.visit( > >> new QueryVisitor() { > >> @Override > >> public boolean acceptField(String field) { > >> affectedFields.add(field); > >> return false; > >> } > >> }); > >> > >> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor) > >> > >> <https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)> > >> > >> I'd go with the Matches API though. > >> > >> Dawid > >> > >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseyg...@gmail.com > >> <mailto:romseyg...@gmail.com>> wrote: > >> > > >> > The Matches API will give you this information - it’s still likely to be > >> > fairly slow, but it’s a lot easier to use than trying to parse Explain > >> > output. > >> > > >> > Query q = ….; > >> > Weight w = searcher.createWeight(searcher.rewrite(query), > >> > ScoreMode.COMPLETE_NO_SCORES, 1.0f); > >> > > >> > Matches m = w.matches(context, doc); > >> > List<String> matchingFields = new ArrayList(); > >> > for (String field : m) { > >> > matchingFields.add(field); > >> > } > >> > > >> > Bear in mind that `matches` doesn’t maintain any state between calls, so > >> > calling it for every matching document is likely to be slow; for those > >> > cases Shai’s suggestion of using a Collector and examining low-level > >> > scorers will perform better, but it won’t work for every query type. > >> > > >> > > >> > > On 25 Jun 2022, at 04:14, Yichen Sun <yiche...@bu.edu > >> > > <mailto:yiche...@bu.edu>> wrote: > >> > > > >> > > Hello! > >> > > > >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try > >> > > to output matched fields by one query. For example, for one document, > >> > > there are 10 fields and 2 of them match the query. I want to get the > >> > > name of these fields. > >> > > > >> > > I have tried using explain() method and getting description then > >> > > regex. However it cost so much time. > >> > > > >> > > I wonder what is the efficient way to get the matched fields. Would > >> > > you please offer some help? Thank you so much! > >> > > > >> > > Best regards, > >> > > Yichen Sun > >> > > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> > <mailto:dev-unsubscr...@lucene.apache.org> > >> > For additional commands, e-mail: dev-h...@lucene.apache.org > >> > <mailto:dev-h...@lucene.apache.org> > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> <mailto:dev-unsubscr...@lucene.apache.org> > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> <mailto:dev-h...@lucene.apache.org> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > <mailto:dev-unsubscr...@lucene.apache.org> > For additional commands, e-mail: dev-h...@lucene.apache.org > <mailto:dev-h...@lucene.apache.org> >