Re: Finding out which fields matched the query

Alan Woodward Tue, 28 Jun 2022 06:09:05 -0700

I think it depends on what information we actually want to get here.  If it’s 
just finding which fields matched in which document, then running Matches over 
the top-k results is fine.  If you want to get some kind of aggregate data, as 
in you want to get a list of fields that matched in *any* document (or 
conversely, a list of fields that *didn’t* match - useful if you want to prune 
your schema, for example), then Matches will be too slow.  But at the same 
time, queries are designed to tell you which *documents* match efficiently, and 
they are allowed to advance their sub-queries lazily or indeed not at all if 
the result isn’t needed for scoring.  So we don’t really have any way of 
finding this kind of information via a collector that is accurate and performs 
reasonably.


It *might* be possible to rework Matches so that they act more like an iterator 
and maintain their state within a segment, but there hasn’t been a pressing 
need for that so far.

> On 27 Jun 2022, at 12:46, Shai Erera <ser...@gmail.com 
> <mailto:ser...@gmail.com>> wrote:
> 
> Thanks Alan, yeah I guess I was thinking about the usecase I described, which 
> involves (usually) simple term queries, but you're definitely right about 
> complex boolean clauses as well non-term queries.
> 
> I think the case for highlighter is different though? I mean you usually 
> generate highlights only for the top-K results and therefore are probably 
> less affected by whether the matches() API is slower than a Collector. And if 
> you invoke the API for every document in the index, it might be much slower 
> (depending on the index size) than the Collector.
> 
> Maybe a hybrid approach which runs the query and caches the docs in a 
> DocIdSet (like FacetsCollector does) and then invokes the matches() API only 
> on those hits, will let you enjoy the best of both worlds? Assuming though 
> that the number of matching documents is not huge.
> 
> So it seems there are several options and one should choose based on their 
> usecase. Do you see an advantage for Lucene to offer a Collector for this 
> usecase? Or should we tell users to use the matches API
> 
> Shai
> 
> On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss <dawid.we...@gmail.com 
> <mailto:dawid.we...@gmail.com>> wrote:
> A side note - I've been using a highlighter based on matches API for
> quite some time now and it's been fantastic. Very precise and handles
> non-trivial queries (interval queries) very well.
> 
> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html
>  
> <https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html>
> 
> Dawid
> 
> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward <romseyg...@gmail.com 
> <mailto:romseyg...@gmail.com>> wrote:
> >
> > Your approach is almost certainly more efficient, but it might give you 
> > false matches in some cases - for example, if you have a complex query with 
> > many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is 
> > positioned on the correct document, but which is part of a clause that 
> > doesn’t actually match.  It also only works for term queries, so it won’t 
> > match phrases or span/interval groups.  And Matches will work on points or 
> > docvalues queries as well.  The reason I added Matches in the first place 
> > was precisely to handle these weird corner cases - I had written 
> > highlighters which more or less did the same thing you describe with a 
> > Collector and the Scorable tree, and I would occasionally get bad 
> > highlights back.
> >
> > On 27 Jun 2022, at 10:51, Shai Erera <ser...@gmail.com 
> > <mailto:ser...@gmail.com>> wrote:
> >
> > Out of curiosity and for education purposes, is the Collector approach I 
> > proposed wrong/inefficient? Or less efficient than the matches() API?
> >
> > I'm thinking, if you want to both match/rank documents and as a side effect 
> > know which fields matched, the Collector will perform better than 
> > Weight.matches(), but I could be wrong.
> >
> > Shai
> >
> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <dawid.we...@gmail.com 
> > <mailto:dawid.we...@gmail.com>> wrote:
> >>
> >> The matches API is awesome. Use it. You can also get a rough glimpse
> >> into a superset of fields potentially matching the query via:
> >>
> >>     query.visit(
> >>         new QueryVisitor() {
> >>           @Override
> >>           public boolean acceptField(String field) {
> >>             affectedFields.add(field);
> >>             return false;
> >>           }
> >>         });
> >>
> >> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
> >>  
> >> <https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)>
> >>
> >> I'd go with the Matches API though.
> >>
> >> Dawid
> >>
> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <romseyg...@gmail.com 
> >> <mailto:romseyg...@gmail.com>> wrote:
> >> >
> >> > The Matches API will give you this information - it’s still likely to be 
> >> > fairly slow, but it’s a lot easier to use than trying to parse Explain 
> >> > output.
> >> >
> >> > Query q = ….;
> >> > Weight w = searcher.createWeight(searcher.rewrite(query), 
> >> > ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >> >
> >> > Matches m = w.matches(context, doc);
> >> > List<String> matchingFields = new ArrayList();
> >> > for (String field : m) {
> >> >  matchingFields.add(field);
> >> > }
> >> >
> >> > Bear in mind that `matches` doesn’t maintain any state between calls, so 
> >> > calling it for every matching document is likely to be slow; for those 
> >> > cases Shai’s suggestion of using a Collector and examining low-level 
> >> > scorers will perform better, but it won’t work for every query type.
> >> >
> >> >
> >> > > On 25 Jun 2022, at 04:14, Yichen Sun <yiche...@bu.edu 
> >> > > <mailto:yiche...@bu.edu>> wrote:
> >> > >
> >> > > Hello!
> >> > >
> >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try 
> >> > > to output matched fields by one query. For example, for one document, 
> >> > > there are 10 fields and 2 of them match the query. I want to get the 
> >> > > name of these fields.
> >> > >
> >> > > I have tried using explain() method and getting description then 
> >> > > regex. However it cost so much time.
> >> > >
> >> > > I wonder what is the efficient way to get the matched fields. Would 
> >> > > you please offer some help? Thank you so much!
> >> > >
> >> > > Best regards,
> >> > > Yichen Sun
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> >> > <mailto:dev-unsubscr...@lucene.apache.org>
> >> > For additional commands, e-mail: dev-h...@lucene.apache.org 
> >> > <mailto:dev-h...@lucene.apache.org>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> >> <mailto:dev-unsubscr...@lucene.apache.org>
> >> For additional commands, e-mail: dev-h...@lucene.apache.org 
> >> <mailto:dev-h...@lucene.apache.org>
> >>
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
>

Re: Finding out which fields matched the query

Reply via email to