Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss Mon, 14 Sep 2020 22:42:42 -0700

Thanks Michael. The outcome of this discussion seems to be clear that
everyone is trying to reinvent the wheel somehow. ;) I think it really
should become part of core Lucene functionality. Seems like a corner
case people are not aware of until they hit it (and then it's not
clear what to do about it).


Dawid

On Mon, Sep 14, 2020 at 4:57 PM Michael Gibney
<mich...@michaelgibney.net> wrote:
>
> This might be a little outside the spirit of this discussion (in that
> it's not really "off-the-shelf") -- but I implemented a
> proof-of-concept for a different use case that I think could be
> adapted here:
>
> For a given doc, for each term in your multivalued field, you could
> record a bitset representation of the indexes of the individual fields
> in which that term appears; then in conjunction DISI for different
> terms, intersect the bitset values for different terms to speed the
> determination of whether the terms appear in the same field. You could
> put the bitset representation, e.g., in the Payload for the first
> position of each term, or for more general-purpose use, in
> polyField/subfield DocValues, or whatever.
>
> It seems like everyone's on the same page more-or-less, but I'll
> explicitly note: this feels superficially a little like a "special
> case", as it addresses only the "conjunction" case ... but for
> avoiding false-positives in the multivalued-field case, arguably the
> conjunction case *is* the general case.
>
> Michael
>
> On Mon, Sep 14, 2020 at 3:17 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
> >
> > bq. Expanding a query over numerous fields grows combinatorically
> > in the number of fields (if I want my query to match when all terms
> > match in *some* field), doesn't it?
> >
> > I don't think it does? It grows linearly with the number of fields? In
> > my experience the number of fields
> > searchable "by default" is typically limited - it's not *all* fields -
> > it's just a subset that constitutes the "text body"
> > of a document. Of course everyone's experience will vary depending on
> > the application.
> >
> > > Re: query parsing; wasn't there at one time an interval query parser? It 
> > > had operators like w() and n() IIRC
> >
> > I've tried that but it's really unusable unless the queries are
> > automated - the syntax is difficult to use; mistakes cause cryptic
> > parse errors and are hard to recover from.
> >
> > Dawid
> >
> > On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov <msoko...@gmail.com> wrote:
> > >
> > > A slightly different but related topic is how to manage lots of fields
> > >
> > > I agree that sub-fields are a pain and that mashing everything
> > > together in an all-field is a mess, but for best performance with a
> > > large number of fields/sub-fields, it is the only workable option I
> > > can see? Expanding a query over numerous fields grows combinatorically
> > > in the number of fields (if I want my query to match when all terms
> > > match in *some* field), doesn't it?
> > >
> > > I would like to see a mechanism for defining sub-fields using
> > > positions. Together with an absolute positional query this would
> > > enable both match-any-field as well as field-specific matching with
> > > each token indexed only once (multi-values are possible within this
> > > with boundary tokens or big enough position ranges, as Alan
> > > suggested). It does mean that the sub-field boundaries have to be
> > > managed somehow. Without index support, you can set an arbitrary large
> > > size for your sub-field and insert position gaps at the boundaries,
> > > but maybe we could detect the largest sub-field at flush time and
> > > write that metadata somewhere in the index to enable smaller gaps?
> > > Another issue is differing analysis for the sub-fields, and properly
> > > updating the positions during analysis: at the boundaries(you don't
> > > want to insert a gap, rather advance to a fixed position, and you have
> > > to index sub-fields in order. Maybe we could make it less horrible by
> > > adding better support for it.
> > >
> > > Re: query parsing; wasn't there at one time an interval query parser?
> > > It had operators like w() and n() IIRC
> > >
> > > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <dawid.we...@gmail.com> wrote:
> > > >
> > > > > Ok so the more general question is whether we need an interval query 
> > > > > parser
> > > >
> > > > Oh, to this I'd say: yes, yes, yes.
> > > >
> > > > I didn't have much prior experience writing frontend apps on top of
> > > > Solr/Lucene but once I did have
> > > > to go that route it quickly turns out that several things that are
> > > > readily available from code-level
> > > > are so darn difficult to achieve and integrate from the outside. 
> > > > Specifically:
> > > >
> > > > - Field expansion in query parsers is a must (so that unqualified
> > > > terms are expanded over multiple fields).
> > > > Any query parser that doesn't support this is in my opinion of zero
> > > > use. The "default" copy-to sink field known
> > > > from Solr brings more problems than it solves.
> > > >
> > > > - Exact match-region hit highlighting is a strong expectation. I
> > > > solved this with matches API (see LUCENE-9461)
> > > > and flexible query parser's multifield expansion. Works like a charm.
> > > >
> > > > - Multivalued fields are common and sub-document handling is a pain.
> > > > The problem I raised here is a result of
> > > > direct user feedback. In real life multivalued fields are omnipresent
> > > > and searches over those fields can be complex.
> > > > Users see hits that just should not be there and are confused.
> > > >
> > > > - People do use complex queries. Maybe not all people but there are
> > > > people out there who do... Just recently I extended
> > > > flexible query parser with a handcrafted min-should-match operator
> > > > because it is otherwise not accessible in any Lucene
> > > > query parser (!). I can make this code available (it's not terribly
> > > > complex), although, since you asked, I think a query parser that
> > > > exposes all sorts of "higher level" functionality of intervals would
> > > > be very, very useful.
> > > >
> > > > It may end up that I'll have to write something for intervals anyway
> > > > so we can work on this together if you like.
> > > > Especially the syntax is an open question - should it be
> > > > operator-based (like the current boost of fuzzy operators) or
> > > > meta-function-based (so that pseudo-functions would be available). Or
> > > > maybe a mix of both? I don't know, really. :)
> > > >
> > > > Dawid
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: dev-h...@lucene.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Avoiding false-positives in multivalued field search with intervals?

Reply via email to