Re: Avoiding false-positives in multivalued field search with intervals?

Michael Gibney Mon, 14 Sep 2020 07:57:10 -0700

This might be a little outside the spirit of this discussion (in that
it's not really "off-the-shelf") -- but I implemented a
proof-of-concept for a different use case that I think could be
adapted here:


For a given doc, for each term in your multivalued field, you could
record a bitset representation of the indexes of the individual fields
in which that term appears; then in conjunction DISI for different
terms, intersect the bitset values for different terms to speed the
determination of whether the terms appear in the same field. You could
put the bitset representation, e.g., in the Payload for the first
position of each term, or for more general-purpose use, in
polyField/subfield DocValues, or whatever.

It seems like everyone's on the same page more-or-less, but I'll
explicitly note: this feels superficially a little like a "special
case", as it addresses only the "conjunction" case ... but for
avoiding false-positives in the multivalued-field case, arguably the
conjunction case *is* the general case.

Michael

On Mon, Sep 14, 2020 at 3:17 AM Dawid Weiss <[email protected]> wrote:
>
> bq. Expanding a query over numerous fields grows combinatorically
> in the number of fields (if I want my query to match when all terms
> match in *some* field), doesn't it?
>
> I don't think it does? It grows linearly with the number of fields? In
> my experience the number of fields
> searchable "by default" is typically limited - it's not *all* fields -
> it's just a subset that constitutes the "text body"
> of a document. Of course everyone's experience will vary depending on
> the application.
>
> > Re: query parsing; wasn't there at one time an interval query parser? It 
> > had operators like w() and n() IIRC
>
> I've tried that but it's really unusable unless the queries are
> automated - the syntax is difficult to use; mistakes cause cryptic
> parse errors and are hard to recover from.
>
> Dawid
>
> On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov <[email protected]> wrote:
> >
> > A slightly different but related topic is how to manage lots of fields
> >
> > I agree that sub-fields are a pain and that mashing everything
> > together in an all-field is a mess, but for best performance with a
> > large number of fields/sub-fields, it is the only workable option I
> > can see? Expanding a query over numerous fields grows combinatorically
> > in the number of fields (if I want my query to match when all terms
> > match in *some* field), doesn't it?
> >
> > I would like to see a mechanism for defining sub-fields using
> > positions. Together with an absolute positional query this would
> > enable both match-any-field as well as field-specific matching with
> > each token indexed only once (multi-values are possible within this
> > with boundary tokens or big enough position ranges, as Alan
> > suggested). It does mean that the sub-field boundaries have to be
> > managed somehow. Without index support, you can set an arbitrary large
> > size for your sub-field and insert position gaps at the boundaries,
> > but maybe we could detect the largest sub-field at flush time and
> > write that metadata somewhere in the index to enable smaller gaps?
> > Another issue is differing analysis for the sub-fields, and properly
> > updating the positions during analysis: at the boundaries(you don't
> > want to insert a gap, rather advance to a fixed position, and you have
> > to index sub-fields in order. Maybe we could make it less horrible by
> > adding better support for it.
> >
> > Re: query parsing; wasn't there at one time an interval query parser?
> > It had operators like w() and n() IIRC
> >
> > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <[email protected]> wrote:
> > >
> > > > Ok so the more general question is whether we need an interval query 
> > > > parser
> > >
> > > Oh, to this I'd say: yes, yes, yes.
> > >
> > > I didn't have much prior experience writing frontend apps on top of
> > > Solr/Lucene but once I did have
> > > to go that route it quickly turns out that several things that are
> > > readily available from code-level
> > > are so darn difficult to achieve and integrate from the outside. 
> > > Specifically:
> > >
> > > - Field expansion in query parsers is a must (so that unqualified
> > > terms are expanded over multiple fields).
> > > Any query parser that doesn't support this is in my opinion of zero
> > > use. The "default" copy-to sink field known
> > > from Solr brings more problems than it solves.
> > >
> > > - Exact match-region hit highlighting is a strong expectation. I
> > > solved this with matches API (see LUCENE-9461)
> > > and flexible query parser's multifield expansion. Works like a charm.
> > >
> > > - Multivalued fields are common and sub-document handling is a pain.
> > > The problem I raised here is a result of
> > > direct user feedback. In real life multivalued fields are omnipresent
> > > and searches over those fields can be complex.
> > > Users see hits that just should not be there and are confused.
> > >
> > > - People do use complex queries. Maybe not all people but there are
> > > people out there who do... Just recently I extended
> > > flexible query parser with a handcrafted min-should-match operator
> > > because it is otherwise not accessible in any Lucene
> > > query parser (!). I can make this code available (it's not terribly
> > > complex), although, since you asked, I think a query parser that
> > > exposes all sorts of "higher level" functionality of intervals would
> > > be very, very useful.
> > >
> > > It may end up that I'll have to write something for intervals anyway
> > > so we can work on this together if you like.
> > > Especially the syntax is an open question - should it be
> > > operator-based (like the current boost of fuzzy operators) or
> > > meta-function-based (so that pseudo-functions would be available). Or
> > > maybe a mix of both? I don't know, really. :)
> > >
> > > Dawid
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Avoiding false-positives in multivalued field search with intervals?

Reply via email to