Thanks Michael. The outcome of this discussion seems to be clear that everyone is trying to reinvent the wheel somehow. ;) I think it really should become part of core Lucene functionality. Seems like a corner case people are not aware of until they hit it (and then it's not clear what to do about it).
Dawid On Mon, Sep 14, 2020 at 4:57 PM Michael Gibney <mich...@michaelgibney.net> wrote: > > This might be a little outside the spirit of this discussion (in that > it's not really "off-the-shelf") -- but I implemented a > proof-of-concept for a different use case that I think could be > adapted here: > > For a given doc, for each term in your multivalued field, you could > record a bitset representation of the indexes of the individual fields > in which that term appears; then in conjunction DISI for different > terms, intersect the bitset values for different terms to speed the > determination of whether the terms appear in the same field. You could > put the bitset representation, e.g., in the Payload for the first > position of each term, or for more general-purpose use, in > polyField/subfield DocValues, or whatever. > > It seems like everyone's on the same page more-or-less, but I'll > explicitly note: this feels superficially a little like a "special > case", as it addresses only the "conjunction" case ... but for > avoiding false-positives in the multivalued-field case, arguably the > conjunction case *is* the general case. > > Michael > > On Mon, Sep 14, 2020 at 3:17 AM Dawid Weiss <dawid.we...@gmail.com> wrote: > > > > bq. Expanding a query over numerous fields grows combinatorically > > in the number of fields (if I want my query to match when all terms > > match in *some* field), doesn't it? > > > > I don't think it does? It grows linearly with the number of fields? In > > my experience the number of fields > > searchable "by default" is typically limited - it's not *all* fields - > > it's just a subset that constitutes the "text body" > > of a document. Of course everyone's experience will vary depending on > > the application. > > > > > Re: query parsing; wasn't there at one time an interval query parser? It > > > had operators like w() and n() IIRC > > > > I've tried that but it's really unusable unless the queries are > > automated - the syntax is difficult to use; mistakes cause cryptic > > parse errors and are hard to recover from. > > > > Dawid > > > > On Thu, Sep 10, 2020 at 10:40 PM Michael Sokolov <msoko...@gmail.com> wrote: > > > > > > A slightly different but related topic is how to manage lots of fields > > > > > > I agree that sub-fields are a pain and that mashing everything > > > together in an all-field is a mess, but for best performance with a > > > large number of fields/sub-fields, it is the only workable option I > > > can see? Expanding a query over numerous fields grows combinatorically > > > in the number of fields (if I want my query to match when all terms > > > match in *some* field), doesn't it? > > > > > > I would like to see a mechanism for defining sub-fields using > > > positions. Together with an absolute positional query this would > > > enable both match-any-field as well as field-specific matching with > > > each token indexed only once (multi-values are possible within this > > > with boundary tokens or big enough position ranges, as Alan > > > suggested). It does mean that the sub-field boundaries have to be > > > managed somehow. Without index support, you can set an arbitrary large > > > size for your sub-field and insert position gaps at the boundaries, > > > but maybe we could detect the largest sub-field at flush time and > > > write that metadata somewhere in the index to enable smaller gaps? > > > Another issue is differing analysis for the sub-fields, and properly > > > updating the positions during analysis: at the boundaries(you don't > > > want to insert a gap, rather advance to a fixed position, and you have > > > to index sub-fields in order. Maybe we could make it less horrible by > > > adding better support for it. > > > > > > Re: query parsing; wasn't there at one time an interval query parser? > > > It had operators like w() and n() IIRC > > > > > > On Thu, Sep 10, 2020 at 4:20 PM Dawid Weiss <dawid.we...@gmail.com> wrote: > > > > > > > > > Ok so the more general question is whether we need an interval query > > > > > parser > > > > > > > > Oh, to this I'd say: yes, yes, yes. > > > > > > > > I didn't have much prior experience writing frontend apps on top of > > > > Solr/Lucene but once I did have > > > > to go that route it quickly turns out that several things that are > > > > readily available from code-level > > > > are so darn difficult to achieve and integrate from the outside. > > > > Specifically: > > > > > > > > - Field expansion in query parsers is a must (so that unqualified > > > > terms are expanded over multiple fields). > > > > Any query parser that doesn't support this is in my opinion of zero > > > > use. The "default" copy-to sink field known > > > > from Solr brings more problems than it solves. > > > > > > > > - Exact match-region hit highlighting is a strong expectation. I > > > > solved this with matches API (see LUCENE-9461) > > > > and flexible query parser's multifield expansion. Works like a charm. > > > > > > > > - Multivalued fields are common and sub-document handling is a pain. > > > > The problem I raised here is a result of > > > > direct user feedback. In real life multivalued fields are omnipresent > > > > and searches over those fields can be complex. > > > > Users see hits that just should not be there and are confused. > > > > > > > > - People do use complex queries. Maybe not all people but there are > > > > people out there who do... Just recently I extended > > > > flexible query parser with a handcrafted min-should-match operator > > > > because it is otherwise not accessible in any Lucene > > > > query parser (!). I can make this code available (it's not terribly > > > > complex), although, since you asked, I think a query parser that > > > > exposes all sorts of "higher level" functionality of intervals would > > > > be very, very useful. > > > > > > > > It may end up that I'll have to write something for intervals anyway > > > > so we can work on this together if you like. > > > > Especially the syntax is an open question - should it be > > > > operator-based (like the current boost of fuzzy operators) or > > > > meta-function-based (so that pseudo-functions would be available). Or > > > > maybe a mix of both? I don't know, really. :) > > > > > > > > Dawid > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org