Hi Adrien,

maybe it changed a bit, but last time I looked into is it was somehow wrapping all Queries using a wrapper "NamedQuery" or similiar. When it collected hits it was able to figure out by a wrapper somewhere around weight/scorer/DISI and set a flag that the query was a hit. It could be that this bit is only set when it goes into the topdocs, but in general the work was done at collection phase.

I use this feature quite often also with scanning results and it is very fast like without named query (at least for my queries - maybe the result scanning and data transfer took longer than the overhead).

Uwe

P.S.: We at PANGAEA use the feature to implement our "OAI-PMH sets" (Open Archives Protocol for Metadata Harvesting, a standard API used in library world). This is for datacenters harvesting our metadata and all the delivered results dynamically get their assigned sets tagged (representated as queries). All those set queries are added a named should queries  to the main query and for each result it returns which set a PANGAEA dataset belongs to (as this is required by the protocol).

Am 27.06.2022 um 13:48 schrieb Adrien Grand:
Uwe,

Elasticsearch's named queries are not using a collector actually. Ater top hits have been evaluated for the whole query, they are evaluated independently on each of the top hits. It's probably faster than the collector approach since it doesn't add per-document overhead to collection, but also less flexible since it cannot compute statistics across all matches.

On Mon, Jun 27, 2022 at 12:01 PM Uwe Schindler <[email protected]> wrote:

    I think the collector approach is perfectly fine for
    mass-processing of queries.

    By the way: Elasticserach/Opensearch have a feature already
    built-in and it is working based on collector API in a similar way
    like you mentioned (as far as I remember). It is a bit different
    as you can tag any clause in a BQ (so every query) using a "name"
    (they call it "named query",
    
https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
    When you get the search results, for each hit it tells you which
    named queries were a match on the hit. The actual implementation
    is some wrapper query on each of those clauses that contains the
    name. In hit collection it just collects all named query instances
    found in query tree. I think their implementation somehow the
    wrapper query scorer impl adds the name to some global state.

    Uwe

    Am 27.06.2022 um 11:51 schrieb Shai Erera:
    Out of curiosity and for education purposes, is the Collector
    approach I proposed wrong/inefficient? Or less efficient than the
    matches() API?

    I'm thinking, if you want to both match/rank documents and as a
    side effect know which fields matched, the Collector will perform
    better than Weight.matches(), but I could be wrong.

    Shai

    On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss
    <[email protected]> wrote:

        The matches API is awesome. Use it. You can also get a rough
        glimpse
        into a superset of fields potentially matching the query via:

            query.visit(
                new QueryVisitor() {
                  @Override
                  public boolean acceptField(String field) {
                    affectedFields.add(field);
                    return false;
                  }
                });

        
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)

        I'd go with the Matches API though.

        Dawid

        On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward
        <[email protected]> wrote:
        >
        > The Matches API will give you this information - it’s still
        likely to be fairly slow, but it’s a lot easier to use than
        trying to parse Explain output.
        >
        > Query q = ….;
        > Weight w = searcher.createWeight(searcher.rewrite(query),
        ScoreMode.COMPLETE_NO_SCORES, 1.0f);
        >
        > Matches m = w.matches(context, doc);
        > List<String> matchingFields = new ArrayList();
        > for (String field : m) {
        >  matchingFields.add(field);
        > }
        >
        > Bear in mind that `matches` doesn’t maintain any state
        between calls, so calling it for every matching document is
        likely to be slow; for those cases Shai’s suggestion of using
        a Collector and examining low-level scorers will perform
        better, but it won’t work for every query type.
        >
        >
        > > On 25 Jun 2022, at 04:14, Yichen Sun <[email protected]> wrote:
        > >
        > > Hello!
        > >
        > > I’m a MSCS student from BU and learning to use Lucene.
        Recently I try to output matched fields by one query. For
        example, for one document, there are 10 fields and 2 of them
        match the query. I want to get the name of these fields.
        > >
        > > I have tried using explain() method and getting
        description then regex. However it cost so much time.
        > >
        > > I wonder what is the efficient way to get the matched
        fields. Would you please offer some help? Thank you so much!
        > >
        > > Best regards,
        > > Yichen Sun
        >
        >
        >
        ---------------------------------------------------------------------
        > To unsubscribe, e-mail: [email protected]
        > For additional commands, e-mail: [email protected]
        >

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: [email protected]
        For additional commands, e-mail: [email protected]

-- Uwe Schindler
    Achterdiek 19, D-28357 Bremen
    https://www.thetaphi.de
    eMail:[email protected]



--
Adrien

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:[email protected]

Reply via email to