Hi Adrien,
maybe it changed a bit, but last time I looked into is it was somehow
wrapping all Queries using a wrapper "NamedQuery" or similiar. When it
collected hits it was able to figure out by a wrapper somewhere around
weight/scorer/DISI and set a flag that the query was a hit. It could be
that this bit is only set when it goes into the topdocs, but in general
the work was done at collection phase.
I use this feature quite often also with scanning results and it is very
fast like without named query (at least for my queries - maybe the
result scanning and data transfer took longer than the overhead).
Uwe
P.S.: We at PANGAEA use the feature to implement our "OAI-PMH sets"
(Open Archives Protocol for Metadata Harvesting, a standard API used in
library world). This is for datacenters harvesting our metadata and all
the delivered results dynamically get their assigned sets tagged
(representated as queries). All those set queries are added a named
should queries to the main query and for each result it returns which
set a PANGAEA dataset belongs to (as this is required by the protocol).
Am 27.06.2022 um 13:48 schrieb Adrien Grand:
Uwe,
Elasticsearch's named queries are not using a collector actually. Ater
top hits have been evaluated for the whole query, they are evaluated
independently on each of the top hits. It's probably faster than the
collector approach since it doesn't add per-document overhead to
collection, but also less flexible since it cannot compute statistics
across all matches.
On Mon, Jun 27, 2022 at 12:01 PM Uwe Schindler <[email protected]> wrote:
I think the collector approach is perfectly fine for
mass-processing of queries.
By the way: Elasticserach/Opensearch have a feature already
built-in and it is working based on collector API in a similar way
like you mentioned (as far as I remember). It is a bit different
as you can tag any clause in a BQ (so every query) using a "name"
(they call it "named query",
https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
When you get the search results, for each hit it tells you which
named queries were a match on the hit. The actual implementation
is some wrapper query on each of those clauses that contains the
name. In hit collection it just collects all named query instances
found in query tree. I think their implementation somehow the
wrapper query scorer impl adds the name to some global state.
Uwe
Am 27.06.2022 um 11:51 schrieb Shai Erera:
Out of curiosity and for education purposes, is the Collector
approach I proposed wrong/inefficient? Or less efficient than the
matches() API?
I'm thinking, if you want to both match/rank documents and as a
side effect know which fields matched, the Collector will perform
better than Weight.matches(), but I could be wrong.
Shai
On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss
<[email protected]> wrote:
The matches API is awesome. Use it. You can also get a rough
glimpse
into a superset of fields potentially matching the query via:
query.visit(
new QueryVisitor() {
@Override
public boolean acceptField(String field) {
affectedFields.add(field);
return false;
}
});
https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
I'd go with the Matches API though.
Dawid
On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward
<[email protected]> wrote:
>
> The Matches API will give you this information - it’s still
likely to be fairly slow, but it’s a lot easier to use than
trying to parse Explain output.
>
> Query q = ….;
> Weight w = searcher.createWeight(searcher.rewrite(query),
ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>
> Matches m = w.matches(context, doc);
> List<String> matchingFields = new ArrayList();
> for (String field : m) {
> matchingFields.add(field);
> }
>
> Bear in mind that `matches` doesn’t maintain any state
between calls, so calling it for every matching document is
likely to be slow; for those cases Shai’s suggestion of using
a Collector and examining low-level scorers will perform
better, but it won’t work for every query type.
>
>
> > On 25 Jun 2022, at 04:14, Yichen Sun <[email protected]> wrote:
> >
> > Hello!
> >
> > I’m a MSCS student from BU and learning to use Lucene.
Recently I try to output matched fields by one query. For
example, for one document, there are 10 fields and 2 of them
match the query. I want to get the name of these fields.
> >
> > I have tried using explain() method and getting
description then regex. However it cost so much time.
> >
> > I wonder what is the efficient way to get the matched
fields. Would you please offer some help? Thank you so much!
> >
> > Best regards,
> > Yichen Sun
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:[email protected]
--
Adrien
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:[email protected]