[ 
https://issues.apache.org/jira/browse/LUCENE-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-7533:
-------------------------------
    Attachment: LUCENE-7533.patch

Patch that addresses some of this issue, with some failing tests and nocommits.

The existing autoGeneratePhraseQueries=true approach generates queries exactly 
as if the query had contained quotation marks, but as I mentioned above, this 
is inappropriate when splitOnWhitespace=false and the query text contains 
spaces.

The approach in the patch is to add a new QueryBuilder method to handle the 
autoGeneratePhraseQueries=true case.  The query text is split on whitespace and 
these tokens' offsets are compared to those produced by the configured 
analyzer.  When multiple non-overlapping tokens have offsets within the bounds 
of a single whitespace-separated token, a phrase query is created.  If the 
original token is present as a token overlapping with the first split token, 
then a disjunction query is created with the original token and the phrase 
query of the split tokens.

I've added a couple of tests that show posincr/poslength/offset output from 
SynonymFilter and WordDelimiterFilter (likely the two most frequently used 
analysis components that can create split tokens), and both create corrupt 
token graphs of various kinds (e.g. LUCENE-6582, LUCENE-5051), so solving this 
problem in a complete way just isn't possible right now.

So I'm not happy with the approach in the patch.  It only covers a subset of 
possible token graphs (e.g. more than one overlapping multi-term synonym 
doesn't work).  And it's a lot of new code solving a problem that AFAIK no user 
has reported (does anybody even use autoGeneratePhraseQueries=true with classic 
QP?),

I'd be much happier if we could somehow get TermAutomatonQuery hooked into the 
query parsers, and then rewrite to simpler queries if possible: LUCENE-6824.  
First thing though is unbreaking SynonymFilter and friends to produce 
non-broken token graphs though.  Attempts to do this for SynonymFilter have 
stalled though: LUCENE-6664.  (I have a germ of an idea that might break the 
logjam - I'll post over there.)

For this issue, maybe instead of my patch, for now, we just disallow 
autoGeneratePhraseQueries=true when splitOnWhitespace=false.

Thoughts?

> Classic query parser: autoGeneratePhraseQueries=true doesn't work when 
> splitOnWhitespace=false
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7533
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7533
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 6.2, 6.3, 6.2.1
>            Reporter: Steve Rowe
>         Attachments: LUCENE-7533.patch
>
>
> LUCENE-2605 introduced the classic query parser option to not split on 
> whitespace prior to performing analysis.
> When splitOnWhitespace=false, the output from analysis can now come from 
> multiple whitespace-separated tokens, which breaks code assumptions when 
> autoGeneratePhraseQueries=true: for this combination of options, it's not 
> appropriate to auto-quote multiple non-overlapping tokens produced by 
> analysis.  E.g. simple whitespace tokenization over the query "some words" 
> will produce the token sequence ("some", "words"), and even when 
> autoGeneratePhraseQueries=true, we should not be creating a phrase query here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to