I'm working on a product for librarians and similar people, who
apparently expect to be able to combine classic boolean operators
(i.e. AND, OR, NOT) with proximity operators (especially w/n and pre/n
-- which basically map to unordered and ordered SpanQueries with slop
n, respectively) in unrestricted fashion. For example, users appear to
believe that not only are relatively easy-to-grasp queries like these
legitimate:

medical w/5 agreement
(medical w/5 agreement) and (doctor w/10 rights)

but also crazier ones, perhaps like

agreement w/5 (medical and companion)
(dog or dragon) w/5 (cat and cow)
(daisy and (dog or dragon)) w/25 (cat not cow)

What I've noticed is that it's not always obvious how to interpret
such queries; it's not always obvious what the user had in mind, nor
how you might construct a Lucene query to carry out the user's intent.

Therefore, to broaden my understanding of the problem, I'd like to
learn more about schemes people may have used/proposed to assign
meanings to arbitrary combinations/nestings of, say, the AND, OR, NOT,
W, and PRE operators. Ultimately I'm interested in parsing queries
into Lucene Query objects (perhaps SpanQuery objects). But I'm also
interested in a more abstract/mathematical discussion. (After all, I'm
open to the possibility that the best implementation requires new
Query classes to be written.)

So far I've found two main sources of inspiration, but I'm not 100%
satisfied with either:

1. Mark Miller's no-longer-maintained qsol query parser
(https://github.com/markrmiller/qsol)

The premise of qsol seems to be that you can simplify any arbitrary
combination of boolean and proximity operators into an equivalent
expression where boolean operators can contain proximity operators,
but where no boolean operator appears nested inside a proximity
operator. (With the exception of OR, which maps nicely to a
SpanOrQuery...) This is a convenient transformation, because then you
can readily express the query in terms of existing SpanQuery and
BooleanQuery classes.

The main principle that qsol uses for these transformations might be
sloppily summarized as, "You can distribute any boolean operator over
any proximity operator".

Thus, for example, input query

    agreement w/5 (medical and companion)

gets morphed into something along along the lines of

    (agreement w/5 medical) and (medical w/5 companion)

In closer-to-Lucene syntax, that's

    +(spanNear(slop=5, inOrder=false, ageement, medical))
+(spanNear(slop=5, inOrder=false, medical, companion))

I like how qsol is able to provide at least *some* Lucene-executable
Query for every input query string, and I like how it seems to take a
single principle of distributing booleans over proximities and see it
through pretty systematically.

Unfortunately, the Query objects returned by qsol don't always align
perfectly with what I imagine user intent to be.

For example, when my users try queries like

    agreement w/5 (medical and companion)

I believe they are seeking spans where a single occurrence of
"agreement" is near both "medical" and "companion". qsol will find
documents like that, but it will also return what I think I consider
to be spurious matches, namely docs where there's an "agreement" near
"medical" and an "agreement" near "companion", but no "agreement"
that's near both.

Also non-intuitive is that qsol generates rather different Query
objects for these two queries:

    (dog or dragon) w/5 (cat and cow)
    (cat and cow) w/5 (dog or dragon)

The former maps to something like

    ((dog w/5 cat) AND (dog w/5 cow)) OR ((dragon w/5 cat) AND (dragon w/5 cow))

while the latter maps to something like

    ((cat w/5 dog) OR (cat w/5 dragon)) AND ((cow w/5 dog) OR (cow w/5 dragon))

I'm not sure there's an obviously better thing for qsol to do with
these, but it would be nice if w/5 were a symmetric operation.

Note: The above does not reflect qsol's actual syntax, only its semantics.

2. Minimum interval semantics

This approach is reflected in the Lucene "positions branch"
(PositionIntervalIterator stuff). It's also nicely described in the
paper "An Algebra for Structured Text Search and A Framework for its
Implementation" (Clarke, Cormack, Burkowski).

I don't know of a Lucene query parser built around this stuff yet, but
it seems possible to construct one. The basic idea is that each
subquery of a query string corresponds to a set of matching
spans/intervals/extents, and operators (whether "boolean" or
"proximity") are a means of filter and combining
spans/intervals/extents. For example:

* Query [ medical ] would correspond to every span where the term
"medical" appears
* Query [ medical and companion ] would correspond to all the
minimum-match spans containing both "medical" and "companion"
* Query [ agreement w/5 (medical and companion) ] would correspond to
all the minimum-match spans where "agreement" was within 5 words of a
minimum-match span for [medical and companion]

I like how minimum interval semantics gives a precise meaning to not
just a query but every subcomponent thereof. There is a very clear
story for why every document did/didn't match.

Unfortunately, I doubt that minimum interval semantics corresponds
especially well to my users' expectations. For example, in the above
interpretation of [agreement w/5 (medical and companion)], that query
would match the hypothetical span

    medical ... [insert 50,000 words here, none of which is
"companion"] ... companion to the agreement

whereas I'm pretty sure my users are hoping to find instances where
"medical" and "companion" are much closer together.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to