Thanks Uwe, I didn't know about named queries, but it seems useful. Is there interest in getting similar functionality in Lucene, or perhaps just the FieldMatching collector? I'd be happy to PR-it.
As for usecase, I was thinking of using something similar to this collector for some kind of (simple) entity recognition task. If you have a corpus of documents with many fields which denote product attributes, you could match a word like "Red" to the various product attribute fields and determine based on the matching fields + their doc count whether this word likely represents a Color or Brand entity (hint: it matches both, the question is which is more probable). I'm sure there are other ways to achieve this, and probably much smarter NER implementations, but this one is at least based on the actual data that you index which guarantees something about the results you will receive if applying a certain attribute filtering. Shai On Mon, Jun 27, 2022 at 1:01 PM Uwe Schindler <[email protected]> wrote: > I think the collector approach is perfectly fine for mass-processing of > queries. > > By the way: Elasticserach/Opensearch have a feature already built-in and > it is working based on collector API in a similar way like you mentioned > (as far as I remember). It is a bit different as you can tag any clause in > a BQ (so every query) using a "name" (they call it "named query", > https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries). > When you get the search results, for each hit it tells you which named > queries were a match on the hit. The actual implementation is some wrapper > query on each of those clauses that contains the name. In hit collection it > just collects all named query instances found in query tree. I think their > implementation somehow the wrapper query scorer impl adds the name to some > global state. > > Uwe > Am 27.06.2022 um 11:51 schrieb Shai Erera: > > Out of curiosity and for education purposes, is the Collector approach I > proposed wrong/inefficient? Or less efficient than the matches() API? > > I'm thinking, if you want to both match/rank documents and as a side > effect know which fields matched, the Collector will perform better than > Weight.matches(), but I could be wrong. > > Shai > > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <[email protected]> > wrote: > >> The matches API is awesome. Use it. You can also get a rough glimpse >> into a superset of fields potentially matching the query via: >> >> query.visit( >> new QueryVisitor() { >> @Override >> public boolean acceptField(String field) { >> affectedFields.add(field); >> return false; >> } >> }); >> >> >> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor) >> >> I'd go with the Matches API though. >> >> Dawid >> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <[email protected]> >> wrote: >> > >> > The Matches API will give you this information - it’s still likely to >> be fairly slow, but it’s a lot easier to use than trying to parse Explain >> output. >> > >> > Query q = ….; >> > Weight w = searcher.createWeight(searcher.rewrite(query), >> ScoreMode.COMPLETE_NO_SCORES, 1.0f); >> > >> > Matches m = w.matches(context, doc); >> > List<String> matchingFields = new ArrayList(); >> > for (String field : m) { >> > matchingFields.add(field); >> > } >> > >> > Bear in mind that `matches` doesn’t maintain any state between calls, >> so calling it for every matching document is likely to be slow; for those >> cases Shai’s suggestion of using a Collector and examining low-level >> scorers will perform better, but it won’t work for every query type. >> > >> > >> > > On 25 Jun 2022, at 04:14, Yichen Sun <[email protected]> wrote: >> > > >> > > Hello! >> > > >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try >> to output matched fields by one query. For example, for one document, there >> are 10 fields and 2 of them match the query. I want to get the name of >> these fields. >> > > >> > > I have tried using explain() method and getting description then >> regex. However it cost so much time. >> > > >> > > I wonder what is the efficient way to get the matched fields. Would >> you please offer some help? Thank you so much! >> > > >> > > Best regards, >> > > Yichen Sun >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [email protected] >> > For additional commands, e-mail: [email protected] >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> -- > Uwe Schindler > Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de > eMail: [email protected] > >
