[ 
https://issues.apache.org/jira/browse/LUCENE-7315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rowe updated LUCENE-7315:
-------------------------------
    Attachment: LUCENE-7315.patch

WIP patch against master, generated files not included ({{ant javacc-flexible}} 
in {{lucene/queryparser/}} will generate them), still has nocommits and failing 
tests.

In addition to enabling not splitting on whitespace prior to text analysis, the 
patch includes the following changes:

* Changed {{TermQueryNode}}'s {{positionIncrement}} name to {{position}}, since 
that's what it really holds.
* {{SynonymQueryNode}}/{{Builder}} now produces a {{SynonymQuery}} instead of a 
boolean query.
* Refactored {{AnalyzerQueryNodeProcessor.postProcessNode()}} into shorter 
methods and made it simpler and easier to follow.
* Moved split-on-whitespace tests to the shared {{QueryParserTestBase}}.

Some challenges remain:

* Unlike the classic QP, the flexible standard QP appears to remove a top-level 
MUST boolean query, e.g. {{+(word)}} -> {{word}}.  Some of the 
split-on-whitespace shared tests will need to be specialized for each parser.
* There's no simple way to collapse the children of the boolean query produced 
for text containing whitespace when not splitting on whitespace into their 
ancestor boolean query (if there is one), so some of the shared 
split-on-whitespace tests are failing.
** The patch includes a {{FlattenQueryNodeProcessor}} meant to address this 
issue, but it's not working and I haven't figured out why yet.
* Recent master-only changes will likely make the branch_6x backport 
non-trivial, e.g LUCENE-7347. 

> Flexible "standard" query parser parses on whitespace
> -----------------------------------------------------
>
>                 Key: LUCENE-7315
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7315
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/queryparser
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>         Attachments: LUCENE-7315.patch
>
>
> Copied from LUCENE-2605:
> The queryparser parses input on whitespace, and sends each whitespace 
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across 
> whitespace boundaries:
> n-gram analysis
> shingles
> synonyms (especially multi-word for whitespace-separated languages)
> languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their 
> charfilters/tokenizers/tokenfilters will do the same thing at index and 
> querytime, but in many cases they can't. Instead, preferably the queryparser 
> would parse around only real 'operators'.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to