I'm working on a product for librarians and similar people, who apparently expect to be able to combine classic boolean operators (i.e. AND, OR, NOT) with proximity operators (especially w/n and pre/n -- which basically map to unordered and ordered SpanQueries with slop n, respectively) in unrestricted fashion. For example, users appear to believe that not only are relatively easy-to-grasp queries like these legitimate:
medical w/5 agreement (medical w/5 agreement) and (doctor w/10 rights) but also crazier ones, perhaps like agreement w/5 (medical and companion) (dog or dragon) w/5 (cat and cow) (daisy and (dog or dragon)) w/25 (cat not cow) What I've noticed is that it's not always obvious how to interpret such queries; it's not always obvious what the user had in mind, nor how you might construct a Lucene query to carry out the user's intent. Therefore, to broaden my understanding of the problem, I'd like to learn more about schemes people may have used/proposed to assign meanings to arbitrary combinations/nestings of, say, the AND, OR, NOT, W, and PRE operators. Ultimately I'm interested in parsing queries into Lucene Query objects (perhaps SpanQuery objects). But I'm also interested in a more abstract/mathematical discussion. (After all, I'm open to the possibility that the best implementation requires new Query classes to be written.) So far I've found two main sources of inspiration, but I'm not 100% satisfied with either: 1. Mark Miller's no-longer-maintained qsol query parser (https://github.com/markrmiller/qsol) The premise of qsol seems to be that you can simplify any arbitrary combination of boolean and proximity operators into an equivalent expression where boolean operators can contain proximity operators, but where no boolean operator appears nested inside a proximity operator. (With the exception of OR, which maps nicely to a SpanOrQuery...) This is a convenient transformation, because then you can readily express the query in terms of existing SpanQuery and BooleanQuery classes. The main principle that qsol uses for these transformations might be sloppily summarized as, "You can distribute any boolean operator over any proximity operator". Thus, for example, input query agreement w/5 (medical and companion) gets morphed into something along along the lines of (agreement w/5 medical) and (medical w/5 companion) In closer-to-Lucene syntax, that's +(spanNear(slop=5, inOrder=false, ageement, medical)) +(spanNear(slop=5, inOrder=false, medical, companion)) I like how qsol is able to provide at least *some* Lucene-executable Query for every input query string, and I like how it seems to take a single principle of distributing booleans over proximities and see it through pretty systematically. Unfortunately, the Query objects returned by qsol don't always align perfectly with what I imagine user intent to be. For example, when my users try queries like agreement w/5 (medical and companion) I believe they are seeking spans where a single occurrence of "agreement" is near both "medical" and "companion". qsol will find documents like that, but it will also return what I think I consider to be spurious matches, namely docs where there's an "agreement" near "medical" and an "agreement" near "companion", but no "agreement" that's near both. Also non-intuitive is that qsol generates rather different Query objects for these two queries: (dog or dragon) w/5 (cat and cow) (cat and cow) w/5 (dog or dragon) The former maps to something like ((dog w/5 cat) AND (dog w/5 cow)) OR ((dragon w/5 cat) AND (dragon w/5 cow)) while the latter maps to something like ((cat w/5 dog) OR (cat w/5 dragon)) AND ((cow w/5 dog) OR (cow w/5 dragon)) I'm not sure there's an obviously better thing for qsol to do with these, but it would be nice if w/5 were a symmetric operation. Note: The above does not reflect qsol's actual syntax, only its semantics. 2. Minimum interval semantics This approach is reflected in the Lucene "positions branch" (PositionIntervalIterator stuff). It's also nicely described in the paper "An Algebra for Structured Text Search and A Framework for its Implementation" (Clarke, Cormack, Burkowski). I don't know of a Lucene query parser built around this stuff yet, but it seems possible to construct one. The basic idea is that each subquery of a query string corresponds to a set of matching spans/intervals/extents, and operators (whether "boolean" or "proximity") are a means of filter and combining spans/intervals/extents. For example: * Query [ medical ] would correspond to every span where the term "medical" appears * Query [ medical and companion ] would correspond to all the minimum-match spans containing both "medical" and "companion" * Query [ agreement w/5 (medical and companion) ] would correspond to all the minimum-match spans where "agreement" was within 5 words of a minimum-match span for [medical and companion] I like how minimum interval semantics gives a precise meaning to not just a query but every subcomponent thereof. There is a very clear story for why every document did/didn't match. Unfortunately, I doubt that minimum interval semantics corresponds especially well to my users' expectations. For example, in the above interpretation of [agreement w/5 (medical and companion)], that query would match the hypothetical span medical ... [insert 50,000 words here, none of which is "companion"] ... companion to the agreement whereas I'm pretty sure my users are hoping to find instances where "medical" and "companion" are much closer together. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
