Re: Avoiding false-positives in multivalued field search with intervals?

Dawid Weiss Mon, 14 Sep 2020 00:17:13 -0700

bq. Expanding a query over numerous fields grows combinatorically
in the number of fields (if I want my query to match when all terms
match in *some* field), doesn't it?


I don't think it does? It grows linearly with the number of fields? In
my experience the number of fields
searchable "by default" is typically limited - it's not *all* fields -
it's just a subset that constitutes the "text body"
of a document. Of course everyone's experience will vary depending on
the application.

> Re: query parsing; wasn't there at one time an interval query parser? It had 
> operators like w() and n() IIRC

I've tried that but it's really unusable unless the queries are
automated - the syntax is difficult to use; mistakes cause cryptic
parse errors and are hard to recover from.

Dawid

On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov <[email protected]> wrote:
>
> A slightly different but related topic is how to manage lots of fields
>
> I agree that sub-fields are a pain and that mashing everything
> together in an all-field is a mess, but for best performance with a
> large number of fields/sub-fields, it is the only workable option I
> can see? Expanding a query over numerous fields grows combinatorically
> in the number of fields (if I want my query to match when all terms
> match in *some* field), doesn't it?
>
> I would like to see a mechanism for defining sub-fields using
> positions. Together with an absolute positional query this would
> enable both match-any-field as well as field-specific matching with
> each token indexed only once (multi-values are possible within this
> with boundary tokens or big enough position ranges, as Alan
> suggested). It does mean that the sub-field boundaries have to be
> managed somehow. Without index support, you can set an arbitrary large
> size for your sub-field and insert position gaps at the boundaries,
> but maybe we could detect the largest sub-field at flush time and
> write that metadata somewhere in the index to enable smaller gaps?
> Another issue is differing analysis for the sub-fields, and properly
> updating the positions during analysis: at the boundaries(you don't
> want to insert a gap, rather advance to a fixed position, and you have
> to index sub-fields in order. Maybe we could make it less horrible by
> adding better support for it.
>
> Re: query parsing; wasn't there at one time an interval query parser?
> It had operators like w() and n() IIRC
>
> On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <[email protected]> wrote:
> >
> > > Ok so the more general question is whether we need an interval query 
> > > parser
> >
> > Oh, to this I'd say: yes, yes, yes.
> >
> > I didn't have much prior experience writing frontend apps on top of
> > Solr/Lucene but once I did have
> > to go that route it quickly turns out that several things that are
> > readily available from code-level
> > are so darn difficult to achieve and integrate from the outside. 
> > Specifically:
> >
> > - Field expansion in query parsers is a must (so that unqualified
> > terms are expanded over multiple fields).
> > Any query parser that doesn't support this is in my opinion of zero
> > use. The "default" copy-to sink field known
> > from Solr brings more problems than it solves.
> >
> > - Exact match-region hit highlighting is a strong expectation. I
> > solved this with matches API (see LUCENE-9461)
> > and flexible query parser's multifield expansion. Works like a charm.
> >
> > - Multivalued fields are common and sub-document handling is a pain.
> > The problem I raised here is a result of
> > direct user feedback. In real life multivalued fields are omnipresent
> > and searches over those fields can be complex.
> > Users see hits that just should not be there and are confused.
> >
> > - People do use complex queries. Maybe not all people but there are
> > people out there who do... Just recently I extended
> > flexible query parser with a handcrafted min-should-match operator
> > because it is otherwise not accessible in any Lucene
> > query parser (!). I can make this code available (it's not terribly
> > complex), although, since you asked, I think a query parser that
> > exposes all sorts of "higher level" functionality of intervals would
> > be very, very useful.
> >
> > It may end up that I'll have to write something for intervals anyway
> > so we can work on this together if you like.
> > Especially the syntax is an open question - should it be
> > operator-based (like the current boost of fuzzy operators) or
> > meta-function-based (so that pseudo-functions would be available). Or
> > maybe a mix of both? I don't know, really. :)
> >
> > Dawid
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Avoiding false-positives in multivalued field search with intervals?

Reply via email to