[
https://issues.apache.org/jira/browse/LUCENE-7315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Rowe updated LUCENE-7315:
-------------------------------
Attachment: LUCENE-7315.patch
WIP patch against master, generated files not included ({{ant javacc-flexible}}
in {{lucene/queryparser/}} will generate them), still has nocommits and failing
tests.
In addition to enabling not splitting on whitespace prior to text analysis, the
patch includes the following changes:
* Changed {{TermQueryNode}}'s {{positionIncrement}} name to {{position}}, since
that's what it really holds.
* {{SynonymQueryNode}}/{{Builder}} now produces a {{SynonymQuery}} instead of a
boolean query.
* Refactored {{AnalyzerQueryNodeProcessor.postProcessNode()}} into shorter
methods and made it simpler and easier to follow.
* Moved split-on-whitespace tests to the shared {{QueryParserTestBase}}.
Some challenges remain:
* Unlike the classic QP, the flexible standard QP appears to remove a top-level
MUST boolean query, e.g. {{+(word)}} -> {{word}}. Some of the
split-on-whitespace shared tests will need to be specialized for each parser.
* There's no simple way to collapse the children of the boolean query produced
for text containing whitespace when not splitting on whitespace into their
ancestor boolean query (if there is one), so some of the shared
split-on-whitespace tests are failing.
** The patch includes a {{FlattenQueryNodeProcessor}} meant to address this
issue, but it's not working and I haven't figured out why yet.
* Recent master-only changes will likely make the branch_6x backport
non-trivial, e.g LUCENE-7347.
> Flexible "standard" query parser parses on whitespace
> -----------------------------------------------------
>
> Key: LUCENE-7315
> URL: https://issues.apache.org/jira/browse/LUCENE-7315
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/queryparser
> Reporter: Steve Rowe
> Assignee: Steve Rowe
> Attachments: LUCENE-7315.patch
>
>
> Copied from LUCENE-2605:
> The queryparser parses input on whitespace, and sends each whitespace
> separated term to its own independent token stream.
> This breaks the following at query-time, because they can't see across
> whitespace boundaries:
> n-gram analysis
> shingles
> synonyms (especially multi-word for whitespace-separated languages)
> languages where a 'word' can contain whitespace (e.g. vietnamese)
> Its also rather unexpected, as users think their
> charfilters/tokenizers/tokenfilters will do the same thing at index and
> querytime, but in many cases they can't. Instead, preferably the queryparser
> would parse around only real 'operators'.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]