Re: Finding out which fields matched the query

2022-06-27 Thread Walter Underwood
For a quick hack, you can use highlighting. That does more than you want, 
showing which words match, but it does have the info. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 27, 2022, at 3:23 AM, Shai Erera  wrote:
> 
> Thanks Uwe, I didn't know about named queries, but it seems useful. Is there 
> interest in getting similar functionality in Lucene, or perhaps just the 
> FieldMatching collector? I'd be happy to PR-it.
> 
> As for usecase, I was thinking of using something similar to this collector 
> for some kind of (simple) entity recognition task. If you have a corpus of 
> documents with many fields which denote product attributes, you could match a 
> word like "Red" to the various product attribute fields and determine based 
> on the matching fields + their doc count whether this word likely represents 
> a Color or Brand entity (hint: it matches both, the question is which is more 
> probable).
> 
> I'm sure there are other ways to achieve this, and probably much smarter NER 
> implementations, but this one is at least based on the actual data that you 
> index which guarantees something about the results you will receive if 
> applying a certain attribute filtering.
> 
> Shai
> 
> On Mon, Jun 27, 2022 at 1:01 PM Uwe Schindler  > wrote:
> I think the collector approach is perfectly fine for mass-processing of 
> queries.
> 
> By the way: Elasticserach/Opensearch have a feature already built-in and it 
> is working based on collector API in a similar way like you mentioned (as far 
> as I remember). It is a bit different as you can tag any clause in a BQ (so 
> every query) using a "name" (they call it "named query", 
> https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries
>  
> ).
>  When you get the search results, for each hit it tells you which named 
> queries were a match on the hit. The actual implementation is some wrapper 
> query on each of those clauses that contains the name. In hit collection it 
> just collects all named query instances found in query tree. I think their 
> implementation somehow the wrapper query scorer impl adds the name to some 
> global state.
> 
> Uwe
> 
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>> Out of curiosity and for education purposes, is the Collector approach I 
>> proposed wrong/inefficient? Or less efficient than the matches() API?
>> 
>> I'm thinking, if you want to both match/rank documents and as a side effect 
>> know which fields matched, the Collector will perform better than 
>> Weight.matches(), but I could be wrong.
>> 
>> Shai
>> 
>> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss > > wrote:
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>> 
>> query.visit(
>> new QueryVisitor() {
>>   @Override
>>   public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>>   }
>> });
>> 
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>  
>> 
>> 
>> I'd go with the Matches API though.
>> 
>> Dawid
>> 
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward > > wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to be 
>> > fairly slow, but it’s a lot easier to use than trying to parse Explain 
>> > output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query), 
>> > ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List matchingFields = new ArrayList();
>> > for (String field : m) {
>> >  matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls, so 
>> > calling it for every matching document is likely to be slow; for those 
>> > cases Shai’s suggestion of using a Collector and examining low-level 
>> > scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun > > > > wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try to 
>> > > output matched fields by one query. For example, for one document, there 
>> > > are 10 fields and 2 of them match the query. I want to get the name of 
>> > > these fields.
>> > >
>> > > I have tried using explain() method and getting description then regex. 
>> > > However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched 

Re: Finding out which fields matched the query

2022-06-27 Thread Uwe Schindler

Hi Adrien,

maybe it changed a bit, but last time I looked into is it was somehow 
wrapping all Queries using a wrapper "NamedQuery" or similiar. When it 
collected hits it was able to figure out by a wrapper somewhere around 
weight/scorer/DISI and set a flag that the query was a hit. It could be 
that this bit is only set when it goes into the topdocs, but in general 
the work was done at collection phase.


I use this feature quite often also with scanning results and it is very 
fast like without named query (at least for my queries - maybe the 
result scanning and data transfer took longer than the overhead).


Uwe

P.S.: We at PANGAEA use the feature to implement our "OAI-PMH sets" 
(Open Archives Protocol for Metadata Harvesting, a standard API used in 
library world). This is for datacenters harvesting our metadata and all 
the delivered results dynamically get their assigned sets tagged 
(representated as queries). All those set queries are added a named 
should queries  to the main query and for each result it returns which 
set a PANGAEA dataset belongs to (as this is required by the protocol).


Am 27.06.2022 um 13:48 schrieb Adrien Grand:

Uwe,

Elasticsearch's named queries are not using a collector actually. Ater 
top hits have been evaluated for the whole query, they are evaluated 
independently on each of the top hits. It's probably faster than the 
collector approach since it doesn't add per-document overhead to 
collection, but also less flexible since it cannot compute statistics 
across all matches.


On Mon, Jun 27, 2022 at 12:01 PM Uwe Schindler  wrote:

I think the collector approach is perfectly fine for
mass-processing of queries.

By the way: Elasticserach/Opensearch have a feature already
built-in and it is working based on collector API in a similar way
like you mentioned (as far as I remember). It is a bit different
as you can tag any clause in a BQ (so every query) using a "name"
(they call it "named query",

https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
When you get the search results, for each hit it tells you which
named queries were a match on the hit. The actual implementation
is some wrapper query on each of those clauses that contains the
name. In hit collection it just collects all named query instances
found in query tree. I think their implementation somehow the
wrapper query scorer impl adds the name to some global state.

Uwe

Am 27.06.2022 um 11:51 schrieb Shai Erera:

Out of curiosity and for education purposes, is the Collector
approach I proposed wrong/inefficient? Or less efficient than the
matches() API?

I'm thinking, if you want to both match/rank documents and as a
side effect know which fields matched, the Collector will perform
better than Weight.matches(), but I could be wrong.

Shai

On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss
 wrote:

The matches API is awesome. Use it. You can also get a rough
glimpse
into a superset of fields potentially matching the query via:

    query.visit(
        new QueryVisitor() {
          @Override
          public boolean acceptField(String field) {
            affectedFields.add(field);
            return false;
          }
        });


https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)

I'd go with the Matches API though.

Dawid

On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward
 wrote:
>
> The Matches API will give you this information - it’s still
likely to be fairly slow, but it’s a lot easier to use than
trying to parse Explain output.
>
> Query q = ….;
> Weight w = searcher.createWeight(searcher.rewrite(query),
ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>
> Matches m = w.matches(context, doc);
> List matchingFields = new ArrayList();
> for (String field : m) {
>  matchingFields.add(field);
> }
>
> Bear in mind that `matches` doesn’t maintain any state
between calls, so calling it for every matching document is
likely to be slow; for those cases Shai’s suggestion of using
a Collector and examining low-level scorers will perform
better, but it won’t work for every query type.
>
>
> > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
> >
> > Hello!
> >
> > I’m a MSCS student from BU and learning to use Lucene.
Recently I try to output matched fields by one query. For
example, for one document, there are 10 fields and 2 of them
match the query. I want to get the name of these fields.
> >
> > I have tried using explain() 

Re: Finding out which fields matched the query

2022-06-27 Thread Adrien Grand
Uwe,

Elasticsearch's named queries are not using a collector actually. Ater top
hits have been evaluated for the whole query, they are evaluated
independently on each of the top hits. It's probably faster than the
collector approach since it doesn't add per-document overhead to
collection, but also less flexible since it cannot compute statistics
across all matches.

On Mon, Jun 27, 2022 at 12:01 PM Uwe Schindler  wrote:

> I think the collector approach is perfectly fine for mass-processing of
> queries.
>
> By the way: Elasticserach/Opensearch have a feature already built-in and
> it is working based on collector API in a similar way like you mentioned
> (as far as I remember). It is a bit different as you can tag any clause in
> a BQ (so every query) using a "name" (they call it "named query",
> https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
> When you get the search results, for each hit it tells you which named
> queries were a match on the hit. The actual implementation is some wrapper
> query on each of those clauses that contains the name. In hit collection it
> just collects all named query instances found in query tree. I think their
> implementation somehow the wrapper query scorer impl adds the name to some
> global state.
>
> Uwe
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>
> Out of curiosity and for education purposes, is the Collector approach I
> proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better than
> Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss 
> wrote:
>
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>>   @Override
>>   public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>>   }
>> });
>>
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward 
>> wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to
>> be fairly slow, but it’s a lot easier to use than trying to parse Explain
>> output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query),
>> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List matchingFields = new ArrayList();
>> > for (String field : m) {
>> >  matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls,
>> so calling it for every matching document is likely to be slow; for those
>> cases Shai’s suggestion of using a Collector and examining low-level
>> scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try
>> to output matched fields by one query. For example, for one document, there
>> are 10 fields and 2 of them match the query. I want to get the name of
>> these fields.
>> > >
>> > > I have tried using explain() method and getting description then
>> regex. However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched fields. Would
>> you please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

-- 
Adrien


Re: Finding out which fields matched the query

2022-06-27 Thread Shai Erera
Thanks Alan, yeah I guess I was thinking about the usecase I described,
which involves (usually) simple term queries, but you're definitely right
about complex boolean clauses as well non-term queries.

I think the case for highlighter is different though? I mean you usually
generate highlights only for the top-K results and therefore are probably
less affected by whether the matches() API is slower than a Collector. And
if you invoke the API for every document in the index, it might be much
slower (depending on the index size) than the Collector.

Maybe a hybrid approach which runs the query and caches the docs in a
DocIdSet (like FacetsCollector does) and then invokes the matches() API
only on those hits, will let you enjoy the best of both worlds? Assuming
though that the number of matching documents is not huge.

So it seems there are several options and one should choose based on their
usecase. Do you see an advantage for Lucene to offer a Collector for this
usecase? Or should we tell users to use the matches API

Shai

On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss  wrote:

> A side note - I've been using a highlighter based on matches API for
> quite some time now and it's been fantastic. Very precise and handles
> non-trivial queries (interval queries) very well.
>
>
> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html
>
> Dawid
>
> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward 
> wrote:
> >
> > Your approach is almost certainly more efficient, but it might give you
> false matches in some cases - for example, if you have a complex query with
> many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is
> positioned on the correct document, but which is part of a clause that
> doesn’t actually match.  It also only works for term queries, so it won’t
> match phrases or span/interval groups.  And Matches will work on points or
> docvalues queries as well.  The reason I added Matches in the first place
> was precisely to handle these weird corner cases - I had written
> highlighters which more or less did the same thing you describe with a
> Collector and the Scorable tree, and I would occasionally get bad
> highlights back.
> >
> > On 27 Jun 2022, at 10:51, Shai Erera  wrote:
> >
> > Out of curiosity and for education purposes, is the Collector approach I
> proposed wrong/inefficient? Or less efficient than the matches() API?
> >
> > I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better than
> Weight.matches(), but I could be wrong.
> >
> > Shai
> >
> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss 
> wrote:
> >>
> >> The matches API is awesome. Use it. You can also get a rough glimpse
> >> into a superset of fields potentially matching the query via:
> >>
> >> query.visit(
> >> new QueryVisitor() {
> >>   @Override
> >>   public boolean acceptField(String field) {
> >> affectedFields.add(field);
> >> return false;
> >>   }
> >> });
> >>
> >>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
> >>
> >> I'd go with the Matches API though.
> >>
> >> Dawid
> >>
> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward 
> wrote:
> >> >
> >> > The Matches API will give you this information - it’s still likely to
> be fairly slow, but it’s a lot easier to use than trying to parse Explain
> output.
> >> >
> >> > Query q = ….;
> >> > Weight w = searcher.createWeight(searcher.rewrite(query),
> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >> >
> >> > Matches m = w.matches(context, doc);
> >> > List matchingFields = new ArrayList();
> >> > for (String field : m) {
> >> >  matchingFields.add(field);
> >> > }
> >> >
> >> > Bear in mind that `matches` doesn’t maintain any state between calls,
> so calling it for every matching document is likely to be slow; for those
> cases Shai’s suggestion of using a Collector and examining low-level
> scorers will perform better, but it won’t work for every query type.
> >> >
> >> >
> >> > > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
> >> > >
> >> > > Hello!
> >> > >
> >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I
> try to output matched fields by one query. For example, for one document,
> there are 10 fields and 2 of them match the query. I want to get the name
> of these fields.
> >> > >
> >> > > I have tried using explain() method and getting description then
> regex. However it cost so much time.
> >> > >
> >> > > I wonder what is the efficient way to get the matched fields. Would
> you please offer some help? Thank you so much!
> >> > >
> >> > > Best regards,
> >> > > Yichen Sun
> >> >
> >> >
> >> > -
> >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> > For additional 

Re: Plan for GitHub issue metadata management

2022-06-27 Thread Tomoko Uchida
I've prepared issue labels in a test repository. (A test migration is
in progress and not yet completed.)
https://github.com/mocobeta/sandbox-lucene-10557/issues

Thre are four label families.

- type:abcd (Issue Type)
- fixVersion:x.x.x (Fix Versions) [1]
- affectsVersion:x.x.x  (Affects Versions)
- component:module/ (Components)

If you have any suggestions on label management, please feel free to
redesign it. I have no strong opinion on that and may not be able to
take the time to think deliberately about it.

[1] I first thought of using Milestone for versions, but there were a
few questions on it; then I'd keep the current operation: multiple fix
versions.


2022年6月20日(月) 19:10 Tomoko Uchida :
>
> I haven't used the "project" feature either - maybe it could be an
> option but I can't have an opinion on it. Is there anyone who has
> experience with it and wants to lead us to use it?
>
> Tomoko
>
> 2022年6月20日(月) 18:59 Jens Wille :
> >
> > Hi,
> >
> > I'm just a bystander here. But are you aware that the new projects (beta)
> > includes support for custom fields?
> >
> > 
> >
> > I haven't used them myself yet, but it seems that they might be a viable
> > alternative to modeling everything with labels (which is more of a crutch I
> > suppose).
> >
> > Cheers,
> > Jens
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Finding out which fields matched the query

2022-06-27 Thread Dawid Weiss
A side note - I've been using a highlighter based on matches API for
quite some time now and it's been fantastic. Very precise and handles
non-trivial queries (interval queries) very well.

https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html

Dawid

On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward  wrote:
>
> Your approach is almost certainly more efficient, but it might give you false 
> matches in some cases - for example, if you have a complex query with many 
> nested MUST and SHOULD clauses, you can have a leaf TermScorer that is 
> positioned on the correct document, but which is part of a clause that 
> doesn’t actually match.  It also only works for term queries, so it won’t 
> match phrases or span/interval groups.  And Matches will work on points or 
> docvalues queries as well.  The reason I added Matches in the first place was 
> precisely to handle these weird corner cases - I had written highlighters 
> which more or less did the same thing you describe with a Collector and the 
> Scorable tree, and I would occasionally get bad highlights back.
>
> On 27 Jun 2022, at 10:51, Shai Erera  wrote:
>
> Out of curiosity and for education purposes, is the Collector approach I 
> proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side effect 
> know which fields matched, the Collector will perform better than 
> Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss  wrote:
>>
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>>   @Override
>>   public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>>   }
>> });
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward  wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to be 
>> > fairly slow, but it’s a lot easier to use than trying to parse Explain 
>> > output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query), 
>> > ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List matchingFields = new ArrayList();
>> > for (String field : m) {
>> >  matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls, so 
>> > calling it for every matching document is likely to be slow; for those 
>> > cases Shai’s suggestion of using a Collector and examining low-level 
>> > scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try to 
>> > > output matched fields by one query. For example, for one document, there 
>> > > are 10 fields and 2 of them match the query. I want to get the name of 
>> > > these fields.
>> > >
>> > > I have tried using explain() method and getting description then regex. 
>> > > However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched fields. Would you 
>> > > please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Finding out which fields matched the query

2022-06-27 Thread Alan Woodward
Your approach is almost certainly more efficient, but it might give you false 
matches in some cases - for example, if you have a complex query with many 
nested MUST and SHOULD clauses, you can have a leaf TermScorer that is 
positioned on the correct document, but which is part of a clause that doesn’t 
actually match.  It also only works for term queries, so it won’t match phrases 
or span/interval groups.  And Matches will work on points or docvalues queries 
as well.  The reason I added Matches in the first place was precisely to handle 
these weird corner cases - I had written highlighters which more or less did 
the same thing you describe with a Collector and the Scorable tree, and I would 
occasionally get bad highlights back.

> On 27 Jun 2022, at 10:51, Shai Erera  wrote:
> 
> Out of curiosity and for education purposes, is the Collector approach I 
> proposed wrong/inefficient? Or less efficient than the matches() API?
> 
> I'm thinking, if you want to both match/rank documents and as a side effect 
> know which fields matched, the Collector will perform better than 
> Weight.matches(), but I could be wrong.
> 
> Shai
> 
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss  > wrote:
> The matches API is awesome. Use it. You can also get a rough glimpse
> into a superset of fields potentially matching the query via:
> 
> query.visit(
> new QueryVisitor() {
>   @Override
>   public boolean acceptField(String field) {
> affectedFields.add(field);
> return false;
>   }
> });
> 
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>  
> 
> 
> I'd go with the Matches API though.
> 
> Dawid
> 
> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward  > wrote:
> >
> > The Matches API will give you this information - it’s still likely to be 
> > fairly slow, but it’s a lot easier to use than trying to parse Explain 
> > output.
> >
> > Query q = ….;
> > Weight w = searcher.createWeight(searcher.rewrite(query), 
> > ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >
> > Matches m = w.matches(context, doc);
> > List matchingFields = new ArrayList();
> > for (String field : m) {
> >  matchingFields.add(field);
> > }
> >
> > Bear in mind that `matches` doesn’t maintain any state between calls, so 
> > calling it for every matching document is likely to be slow; for those 
> > cases Shai’s suggestion of using a Collector and examining low-level 
> > scorers will perform better, but it won’t work for every query type.
> >
> >
> > > On 25 Jun 2022, at 04:14, Yichen Sun  > > > wrote:
> > >
> > > Hello!
> > >
> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try to 
> > > output matched fields by one query. For example, for one document, there 
> > > are 10 fields and 2 of them match the query. I want to get the name of 
> > > these fields.
> > >
> > > I have tried using explain() method and getting description then regex. 
> > > However it cost so much time.
> > >
> > > I wonder what is the efficient way to get the matched fields. Would you 
> > > please offer some help? Thank you so much!
> > >
> > > Best regards,
> > > Yichen Sun
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> > 
> > For additional commands, e-mail: dev-h...@lucene.apache.org 
> > 
> >
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> 
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> 
> 



Re: Finding out which fields matched the query

2022-06-27 Thread Shai Erera
Thanks Uwe, I didn't know about named queries, but it seems useful. Is
there interest in getting similar functionality in Lucene, or perhaps just
the FieldMatching collector? I'd be happy to PR-it.

As for usecase, I was thinking of using something similar to this collector
for some kind of (simple) entity recognition task. If you have a corpus of
documents with many fields which denote product attributes, you could match
a word like "Red" to the various product attribute fields and determine
based on the matching fields + their doc count whether this word likely
represents a Color or Brand entity (hint: it matches both, the question is
which is more probable).

I'm sure there are other ways to achieve this, and probably much smarter
NER implementations, but this one is at least based on the actual data that
you index which guarantees something about the results you will receive if
applying a certain attribute filtering.

Shai

On Mon, Jun 27, 2022 at 1:01 PM Uwe Schindler  wrote:

> I think the collector approach is perfectly fine for mass-processing of
> queries.
>
> By the way: Elasticserach/Opensearch have a feature already built-in and
> it is working based on collector API in a similar way like you mentioned
> (as far as I remember). It is a bit different as you can tag any clause in
> a BQ (so every query) using a "name" (they call it "named query",
> https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries).
> When you get the search results, for each hit it tells you which named
> queries were a match on the hit. The actual implementation is some wrapper
> query on each of those clauses that contains the name. In hit collection it
> just collects all named query instances found in query tree. I think their
> implementation somehow the wrapper query scorer impl adds the name to some
> global state.
>
> Uwe
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>
> Out of curiosity and for education purposes, is the Collector approach I
> proposed wrong/inefficient? Or less efficient than the matches() API?
>
> I'm thinking, if you want to both match/rank documents and as a side
> effect know which fields matched, the Collector will perform better than
> Weight.matches(), but I could be wrong.
>
> Shai
>
> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss 
> wrote:
>
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>>
>> query.visit(
>> new QueryVisitor() {
>>   @Override
>>   public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>>   }
>> });
>>
>>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>
>> I'd go with the Matches API though.
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward 
>> wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to
>> be fairly slow, but it’s a lot easier to use than trying to parse Explain
>> output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query),
>> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List matchingFields = new ArrayList();
>> > for (String field : m) {
>> >  matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls,
>> so calling it for every matching document is likely to be slow; for those
>> cases Shai’s suggestion of using a Collector and examining low-level
>> scorers will perform better, but it won’t work for every query type.
>> >
>> >
>> > > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
>> > >
>> > > Hello!
>> > >
>> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try
>> to output matched fields by one query. For example, for one document, there
>> are 10 fields and 2 of them match the query. I want to get the name of
>> these fields.
>> > >
>> > > I have tried using explain() method and getting description then
>> regex. However it cost so much time.
>> > >
>> > > I wonder what is the efficient way to get the matched fields. Would
>> you please offer some help? Thank you so much!
>> > >
>> > > Best regards,
>> > > Yichen Sun
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>


Re: Finding out which fields matched the query

2022-06-27 Thread Uwe Schindler
I think the collector approach is perfectly fine for mass-processing of 
queries.


By the way: Elasticserach/Opensearch have a feature already built-in and 
it is working based on collector API in a similar way like you mentioned 
(as far as I remember). It is a bit different as you can tag any clause 
in a BQ (so every query) using a "name" (they call it "named query", 
https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries). 
When you get the search results, for each hit it tells you which named 
queries were a match on the hit. The actual implementation is some 
wrapper query on each of those clauses that contains the name. In hit 
collection it just collects all named query instances found in query 
tree. I think their implementation somehow the wrapper query scorer impl 
adds the name to some global state.


Uwe

Am 27.06.2022 um 11:51 schrieb Shai Erera:
Out of curiosity and for education purposes, is the Collector approach 
I proposed wrong/inefficient? Or less efficient than the matches() API?


I'm thinking, if you want to both match/rank documents and as a side 
effect know which fields matched, the Collector will perform better 
than Weight.matches(), but I could be wrong.


Shai

On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss  
wrote:


The matches API is awesome. Use it. You can also get a rough glimpse
into a superset of fields potentially matching the query via:

    query.visit(
        new QueryVisitor() {
          @Override
          public boolean acceptField(String field) {
            affectedFields.add(field);
            return false;
          }
        });


https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)

I'd go with the Matches API though.

Dawid

On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward
 wrote:
>
> The Matches API will give you this information - it’s still
likely to be fairly slow, but it’s a lot easier to use than trying
to parse Explain output.
>
> Query q = ….;
> Weight w = searcher.createWeight(searcher.rewrite(query),
ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>
> Matches m = w.matches(context, doc);
> List matchingFields = new ArrayList();
> for (String field : m) {
>  matchingFields.add(field);
> }
>
> Bear in mind that `matches` doesn’t maintain any state between
calls, so calling it for every matching document is likely to be
slow; for those cases Shai’s suggestion of using a Collector and
examining low-level scorers will perform better, but it won’t work
for every query type.
>
>
> > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
> >
> > Hello!
> >
> > I’m a MSCS student from BU and learning to use Lucene.
Recently I try to output matched fields by one query. For example,
for one document, there are 10 fields and 2 of them match the
query. I want to get the name of these fields.
> >
> > I have tried using explain() method and getting description
then regex. However it cost so much time.
> >
> > I wonder what is the efficient way to get the matched fields.
Would you please offer some help? Thank you so much!
> >
> > Best regards,
> > Yichen Sun
>
>
>
-
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


Re: Finding out which fields matched the query

2022-06-27 Thread Shai Erera
Out of curiosity and for education purposes, is the Collector approach I
proposed wrong/inefficient? Or less efficient than the matches() API?

I'm thinking, if you want to both match/rank documents and as a side effect
know which fields matched, the Collector will perform better than
Weight.matches(), but I could be wrong.

Shai

On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss  wrote:

> The matches API is awesome. Use it. You can also get a rough glimpse
> into a superset of fields potentially matching the query via:
>
> query.visit(
> new QueryVisitor() {
>   @Override
>   public boolean acceptField(String field) {
> affectedFields.add(field);
> return false;
>   }
> });
>
>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>
> I'd go with the Matches API though.
>
> Dawid
>
> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward 
> wrote:
> >
> > The Matches API will give you this information - it’s still likely to be
> fairly slow, but it’s a lot easier to use than trying to parse Explain
> output.
> >
> > Query q = ….;
> > Weight w = searcher.createWeight(searcher.rewrite(query),
> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
> >
> > Matches m = w.matches(context, doc);
> > List matchingFields = new ArrayList();
> > for (String field : m) {
> >  matchingFields.add(field);
> > }
> >
> > Bear in mind that `matches` doesn’t maintain any state between calls, so
> calling it for every matching document is likely to be slow; for those
> cases Shai’s suggestion of using a Collector and examining low-level
> scorers will perform better, but it won’t work for every query type.
> >
> >
> > > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
> > >
> > > Hello!
> > >
> > > I’m a MSCS student from BU and learning to use Lucene. Recently I try
> to output matched fields by one query. For example, for one document, there
> are 10 fields and 2 of them match the query. I want to get the name of
> these fields.
> > >
> > > I have tried using explain() method and getting description then
> regex. However it cost so much time.
> > >
> > > I wonder what is the efficient way to get the matched fields. Would
> you please offer some help? Thank you so much!
> > >
> > > Best regards,
> > > Yichen Sun
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Finding out which fields matched the query

2022-06-27 Thread Dawid Weiss
The matches API is awesome. Use it. You can also get a rough glimpse
into a superset of fields potentially matching the query via:

query.visit(
new QueryVisitor() {
  @Override
  public boolean acceptField(String field) {
affectedFields.add(field);
return false;
  }
});

https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)

I'd go with the Matches API though.

Dawid

On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward  wrote:
>
> The Matches API will give you this information - it’s still likely to be 
> fairly slow, but it’s a lot easier to use than trying to parse Explain output.
>
> Query q = ….;
> Weight w = searcher.createWeight(searcher.rewrite(query), 
> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>
> Matches m = w.matches(context, doc);
> List matchingFields = new ArrayList();
> for (String field : m) {
>  matchingFields.add(field);
> }
>
> Bear in mind that `matches` doesn’t maintain any state between calls, so 
> calling it for every matching document is likely to be slow; for those cases 
> Shai’s suggestion of using a Collector and examining low-level scorers will 
> perform better, but it won’t work for every query type.
>
>
> > On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
> >
> > Hello!
> >
> > I’m a MSCS student from BU and learning to use Lucene. Recently I try to 
> > output matched fields by one query. For example, for one document, there 
> > are 10 fields and 2 of them match the query. I want to get the name of 
> > these fields.
> >
> > I have tried using explain() method and getting description then regex. 
> > However it cost so much time.
> >
> > I wonder what is the efficient way to get the matched fields. Would you 
> > please offer some help? Thank you so much!
> >
> > Best regards,
> > Yichen Sun
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Finding out which fields matched the query

2022-06-27 Thread Alan Woodward
The Matches API will give you this information - it’s still likely to be fairly 
slow, but it’s a lot easier to use than trying to parse Explain output.

Query q = ….;
Weight w = searcher.createWeight(searcher.rewrite(query), 
ScoreMode.COMPLETE_NO_SCORES, 1.0f);

Matches m = w.matches(context, doc);
List matchingFields = new ArrayList();
for (String field : m) {
 matchingFields.add(field);
}

Bear in mind that `matches` doesn’t maintain any state between calls, so 
calling it for every matching document is likely to be slow; for those cases 
Shai’s suggestion of using a Collector and examining low-level scorers will 
perform better, but it won’t work for every query type.


> On 25 Jun 2022, at 04:14, Yichen Sun  wrote:
> 
> Hello!
> 
> I’m a MSCS student from BU and learning to use Lucene. Recently I try to 
> output matched fields by one query. For example, for one document, there are 
> 10 fields and 2 of them match the query. I want to get the name of these 
> fields.
> 
> I have tried using explain() method and getting description then regex. 
> However it cost so much time.
> 
> I wonder what is the efficient way to get the matched fields. Would you 
> please offer some help? Thank you so much!
> 
> Best regards,
> Yichen Sun


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[FINAL CALL] - Travel Assistance to ApacheCon New Orleans 2022

2022-06-27 Thread Gavin McDonald
 To all committers and non-committers.

This is a final call to apply for travel/hotel assistance to get to and
stay in New Orleans
for ApacheCon 2022.

Applications have been extended by one week and so the application deadline
is now the 8th July 2022.

The rest of this email is a copy of what has been sent out previously.

We will be supporting ApacheCon North America in New Orleans, Louisiana,
on October 3rd through 6th, 2022.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. This year, We are supporting
both committers and non-committers involved with projects at the
Apache Software Foundation, or open source projects in general.

For more info on this year's applications and qualifying criteria, please
visit the TAC website at http://www.apache.org/travel/
Applications have been extended until the 8th of July 2022.

Important: Applicants have until the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process their request), this will enable TAC
to announce successful awards shortly afterwards.

As usual, TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

Why should you attend as a TAC recipient? We encourage you to read stories
from
past recipients at https://apache.org/travel/stories/ . Also note that
previous TAC recipients have gone on to become Committers, PMC Members, ASF
Members, Directors of the ASF Board and Infrastructure Staff members.
Others have gone from Committer to full time Open Source Developers!

How far can you go! - Let TAC help get you there.


===

Gavin McDonald on behalf of the Travel Assistance Committee.


[FINAL CALL] - Travel Assistance to ApacheCon New Orleans 2022

2022-06-27 Thread Gavin McDonald
 To all committers and non-committers.

This is a final call to apply for travel/hotel assistance to get to and
stay in New Orleans
for ApacheCon 2022.

Applications have been extended by one week and so the application deadline
is now the 8th July 2022.

The rest of this email is a copy of what has been sent out previously.

We will be supporting ApacheCon North America in New Orleans, Louisiana,
on October 3rd through 6th, 2022.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. This year, We are supporting
both committers and non-committers involved with projects at the
Apache Software Foundation, or open source projects in general.

For more info on this year's applications and qualifying criteria, please
visit the TAC website at http://www.apache.org/travel/
Applications have been extended until the 8th of July 2022.

Important: Applicants have until the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process their request), this will enable TAC
to announce successful awards shortly afterwards.

As usual, TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

Why should you attend as a TAC recipient? We encourage you to read stories
from
past recipients at https://apache.org/travel/stories/ . Also note that
previous TAC recipients have gone on to become Committers, PMC Members, ASF
Members, Directors of the ASF Board and Infrastructure Staff members.
Others have gone from Committer to full time Open Source Developers!

How far can you go! - Let TAC help get you there.


===

Gavin McDonald on behalf of the Travel Assistance Committee.


[FINAL CALL] - Travel Assistance to ApacheCon New Orleans 2022

2022-06-27 Thread Gavin McDonald
 To all committers and non-committers.

This is a final call to apply for travel/hotel assistance to get to and
stay in New Orleans
for ApacheCon 2022.

Applications have been extended by one week and so the application deadline
is now the 8th July 2022.

The rest of this email is a copy of what has been sent out previously.

We will be supporting ApacheCon North America in New Orleans, Louisiana,
on October 3rd through 6th, 2022.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. This year, We are supporting
both committers and non-committers involved with projects at the
Apache Software Foundation, or open source projects in general.

For more info on this year's applications and qualifying criteria, please
visit the TAC website at http://www.apache.org/travel/
Applications have been extended until the 8th of July 2022.

Important: Applicants have until the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process their request), this will enable TAC
to announce successful awards shortly afterwards.

As usual, TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

Why should you attend as a TAC recipient? We encourage you to read stories
from
past recipients at https://apache.org/travel/stories/ . Also note that
previous TAC recipients have gone on to become Committers, PMC Members, ASF
Members, Directors of the ASF Board and Infrastructure Staff members.
Others have gone from Committer to full time Open Source Developers!

How far can you go! - Let TAC help get you there.


===

Gavin McDonald on behalf of the Travel Assistance Committee.


Re: Finding out which fields matched the query

2022-06-27 Thread Jörn Franke
What is the reason you need the matched fields? Maybe your use case can be 
solved using sth completely different than knowing which fields were matched.

> Am 25.06.2022 um 06:58 schrieb Yichen Sun :
> 
> Hello!
> 
> I’m a MSCS student from BU and learning to use Lucene. Recently I try to 
> output matched fields by one query. For example, for one document, there are 
> 10 fields and 2 of them match the query. I want to get the name of these 
> fields.
> 
> I have tried using explain() method and getting description then regex. 
> However it cost so much time.
> 
> I wonder what is the efficient way to get the matched fields. Would you 
> please offer some help? Thank you so much!
> 
> Best regards,
> Yichen Sun

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org