I think it's a matter of tradeoff. For example when you do faceting then we
require complete evaluation, and since this field-matching is a kind of
aggregation I think it's OK if that's how it works. Users can choose which
technique they want to apply based on their usecase.

Anyway I don't think we must introduce this kind of collector in Lucene,
it's definitely something someone can write in his/her own project.

Shai

On Tue, Jun 28, 2022 at 4:09 PM Alan Woodward <[email protected]> wrote:

> I think it depends on what information we actually want to get here.  If
> it’s just finding which fields matched in which document, then running
> Matches over the top-k results is fine.  If you want to get some kind of
> aggregate data, as in you want to get a list of fields that matched in
> *any* document (or conversely, a list of fields that *didn’t* match -
> useful if you want to prune your schema, for example), then Matches will be
> too slow.  But at the same time, queries are designed to tell you which
> *documents* match efficiently, and they are allowed to advance their
> sub-queries lazily or indeed not at all if the result isn’t needed for
> scoring.  So we don’t really have any way of finding this kind of
> information via a collector that is accurate and performs reasonably.
>
> It *might* be possible to rework Matches so that they act more like an
> iterator and maintain their state within a segment, but there hasn’t been a
> pressing need for that so far.
>
> On 27 Jun 2022, at 12:46, Shai Erera <[email protected]> wrote:
>
> Thanks Alan, yeah I guess I was thinking about the usecase I described,
> which involves (usually) simple term queries, but you're definitely right
> about complex boolean clauses as well non-term queries.
>
> I think the case for highlighter is different though? I mean you usually
> generate highlights only for the top-K results and therefore are probably
> less affected by whether the matches() API is slower than a Collector. And
> if you invoke the API for every document in the index, it might be much
> slower (depending on the index size) than the Collector.
>
> Maybe a hybrid approach which runs the query and caches the docs in a
> DocIdSet (like FacetsCollector does) and then invokes the matches() API
> only on those hits, will let you enjoy the best of both worlds? Assuming
> though that the number of matching documents is not huge.
>
> So it seems there are several options and one should choose based on their
> usecase. Do you see an advantage for Lucene to offer a Collector for this
> usecase? Or should we tell users to use the matches API
>
> Shai
>
> On Mon, Jun 27, 2022 at 2:22 PM Dawid Weiss <[email protected]> wrote:
>
>> A side note - I've been using a highlighter based on matches API for
>> quite some time now and it's been fantastic. Very precise and handles
>> non-trivial queries (interval queries) very well.
>>
>>
>> https://lucene.apache.org/core/9_2_0/highlighter/org/apache/lucene/search/matchhighlight/package-summary.html
>>
>> Dawid
>>
>> On Mon, Jun 27, 2022 at 1:10 PM Alan Woodward <[email protected]>
>> wrote:
>> >
>> > Your approach is almost certainly more efficient, but it might give you
>> false matches in some cases - for example, if you have a complex query with
>> many nested MUST and SHOULD clauses, you can have a leaf TermScorer that is
>> positioned on the correct document, but which is part of a clause that
>> doesn’t actually match.  It also only works for term queries, so it won’t
>> match phrases or span/interval groups.  And Matches will work on points or
>> docvalues queries as well.  The reason I added Matches in the first place
>> was precisely to handle these weird corner cases - I had written
>> highlighters which more or less did the same thing you describe with a
>> Collector and the Scorable tree, and I would occasionally get bad
>> highlights back.
>> >
>> > On 27 Jun 2022, at 10:51, Shai Erera <[email protected]> wrote:
>> >
>> > Out of curiosity and for education purposes, is the Collector approach
>> I proposed wrong/inefficient? Or less efficient than the matches() API?
>> >
>> > I'm thinking, if you want to both match/rank documents and as a side
>> effect know which fields matched, the Collector will perform better than
>> Weight.matches(), but I could be wrong.
>> >
>> > Shai
>> >
>> > On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss <[email protected]>
>> wrote:
>> >>
>> >> The matches API is awesome. Use it. You can also get a rough glimpse
>> >> into a superset of fields potentially matching the query via:
>> >>
>> >>     query.visit(
>> >>         new QueryVisitor() {
>> >>           @Override
>> >>           public boolean acceptField(String field) {
>> >>             affectedFields.add(field);
>> >>             return false;
>> >>           }
>> >>         });
>> >>
>> >>
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>> >>
>> >> I'd go with the Matches API though.
>> >>
>> >> Dawid
>> >>
>> >> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward <[email protected]>
>> wrote:
>> >> >
>> >> > The Matches API will give you this information - it’s still likely
>> to be fairly slow, but it’s a lot easier to use than trying to parse
>> Explain output.
>> >> >
>> >> > Query q = ….;
>> >> > Weight w = searcher.createWeight(searcher.rewrite(query),
>> ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >> >
>> >> > Matches m = w.matches(context, doc);
>> >> > List<String> matchingFields = new ArrayList();
>> >> > for (String field : m) {
>> >> >  matchingFields.add(field);
>> >> > }
>> >> >
>> >> > Bear in mind that `matches` doesn’t maintain any state between
>> calls, so calling it for every matching document is likely to be slow; for
>> those cases Shai’s suggestion of using a Collector and examining low-level
>> scorers will perform better, but it won’t work for every query type.
>> >> >
>> >> >
>> >> > > On 25 Jun 2022, at 04:14, Yichen Sun <[email protected]> wrote:
>> >> > >
>> >> > > Hello!
>> >> > >
>> >> > > I’m a MSCS student from BU and learning to use Lucene. Recently I
>> try to output matched fields by one query. For example, for one document,
>> there are 10 fields and 2 of them match the query. I want to get the name
>> of these fields.
>> >> > >
>> >> > > I have tried using explain() method and getting description then
>> regex. However it cost so much time.
>> >> > >
>> >> > > I wonder what is the efficient way to get the matched fields.
>> Would you please offer some help? Thank you so much!
>> >> > >
>> >> > > Best regards,
>> >> > > Yichen Sun
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [email protected]
>> >> > For additional commands, e-mail: [email protected]
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>

Reply via email to