I think it's a matter of tradeoff. For example when you do faceting then we require complete evaluation, and since this field-matching is a kind of aggregation I think it's OK if that's how it works. Users can choose which technique they want to apply based on their usecase.
Anyway I don't think we must introduce this kind of collector in Lucene, it's definitely something someone can write in his/her own project. Shai On Tue, Jun 28, 2022 at 4:09 PM Alan Woodward <[email protected]> wrote: > I think it depends on what information we actually want to get here. If > it’s just finding which fields matched in which document, then running > Matches over the top-k results is fine. If you want to get some kind of > aggregate data, as in you want to get a list of fields that matched in > *any* document (or conversely, a list of fields that *didn’t* match - > useful if you want to prune your schema, for example), then Matches will be > too slow. But at the same time, queries are designed to tell you which > *documents* match efficiently, and they are allowed to advance their > sub-queries lazily or indeed not at all if the result isn’t needed for > scoring. So we don’t really have any way of finding this kind of > information via a collector that is accurate and performs reasonably. > > It *might* be possible to rework Matches so that they act more like an > iterator and maintain their state within a segment, but there hasn’t been a > pressing need for that so far. > > On 27 Jun 2022, at 12:46, Shai Erera <[email protected]> wrote: > > Thanks Alan, yeah I guess I was thinking about the usecase I described, > which involves (usually) simple term queries, but you're definitely right > about complex boolean clauses as well non-term queries. > > I think the case for highlighter is different though? I mean you usually > generate highlights only for the top-K results and therefore are probably > less affected by whether the matches() API is slower than a Collector. And > if you invoke the API for every document in the index, it might be much > slower (depending on the index size) than the Collector. > > Maybe a hybrid approach which runs the query and caches the docs in a > DocIdSet (like FacetsCollector does) and then invokes the matches() API > only on those hits, will let you enjoy the best of both worlds? Assuming > though that the number of matching documents is not huge. > > So it seems there are several options and one should choose based on their > usecase. Do you see an advantage for Lucene to offer a Collector for this > usecase? Or should we tell users to use the matches API > > Shai > > On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss <[email protected]> wrote: > >> A side note - I've been using a highlighter based on matches API for >> quite some time now and it's been fantastic. Very precise and handles >> non-trivial queries (interval queries) very well. >> >> >> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html >> >> Dawid >> >> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward <[email protected]> >> wrote: >> > >> > Your approach is almost certainly more efficient, but it might give you >> false matches in some cases - for example, if you have a complex query with >> many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is >> positioned on the correct document, but which is part of a clause that >> doesn’t actually match. It also only works for term queries, so it won’t >> match phrases or span/interval groups. And Matches will work on points or >> docvalues queries as well. The reason I added Matches in the first place >> was precisely to handle these weird corner cases - I had written >> highlighters which more or less did the same thing you describe with a >> Collector and the Scorable tree, and I would occasionally get bad >> highlights back. >> > >> > On 27 Jun 2022, at 10:51, Shai Erera <[email protected]> wrote: >> > >> > Out of curiosity and for education purposes, is the Collector approach >> I proposed wrong/inefficient? Or less efficient than the matches() API? >> > >> > I'm thinking, if you want to both match/rank documents and as a side >> effect know which fields matched, the Collector will perform better than >> Weight.matches(), but I could be wrong. >> > >> > Shai >> > >> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <[email protected]> >> wrote: >> >> >> >> The matches API is awesome. Use it. You can also get a rough glimpse >> >> into a superset of fields potentially matching the query via: >> >> >> >> query.visit( >> >> new QueryVisitor() { >> >> @Override >> >> public boolean acceptField(String field) { >> >> affectedFields.add(field); >> >> return false; >> >> } >> >> }); >> >> >> >> >> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor) >> >> >> >> I'd go with the Matches API though. >> >> >> >> Dawid >> >> >> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <[email protected]> >> wrote: >> >> > >> >> > The Matches API will give you this information - it’s still likely >> to be fairly slow, but it’s a lot easier to use than trying to parse >> Explain output. >> >> > >> >> > Query q = ….; >> >> > Weight w = searcher.createWeight(searcher.rewrite(query), >> ScoreMode.COMPLETE_NO_SCORES, 1.0f); >> >> > >> >> > Matches m = w.matches(context, doc); >> >> > List<String> matchingFields = new ArrayList(); >> >> > for (String field : m) { >> >> > matchingFields.add(field); >> >> > } >> >> > >> >> > Bear in mind that `matches` doesn’t maintain any state between >> calls, so calling it for every matching document is likely to be slow; for >> those cases Shai’s suggestion of using a Collector and examining low-level >> scorers will perform better, but it won’t work for every query type. >> >> > >> >> > >> >> > > On 25 Jun 2022, at 04:14, Yichen Sun <[email protected]> wrote: >> >> > > >> >> > > Hello! >> >> > > >> >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I >> try to output matched fields by one query. For example, for one document, >> there are 10 fields and 2 of them match the query. I want to get the name >> of these fields. >> >> > > >> >> > > I have tried using explain() method and getting description then >> regex. However it cost so much time. >> >> > > >> >> > > I wonder what is the efficient way to get the matched fields. >> Would you please offer some help? Thank you so much! >> >> > > >> >> > > Best regards, >> >> > > Yichen Sun >> >> > >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: [email protected] >> >> > For additional commands, e-mail: [email protected] >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [email protected] >> >> For additional commands, e-mail: [email protected] >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >
