[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2605: --- Attachment: LUCENE-2605-dont-split-by-default.patch Patch for master only that switches the default split-on-whitespace behavior from *do* to *don't*. Committing shortly. > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Attachments: LUCENE-2605-dont-split-by-default.patch, > LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, > LUCENE-2605.patch, LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2605: --- Attachment: LUCENE-2605.patch Okay, really final patch. On SOLR-9185 I was having trouble integrating the Solr standard QP's comment support with the whitespace tokenization I introduced here, so I tried switching the Solr parser back to ignoring both whitespace and comments, and it worked. The patch brings this grammar simplification back here too - in addition to many fewer whitespace mentions in the rules, fewer (and less complicated) lookaheads are required. I've included the generated files in the patch. No tests changed from the last patch. All Lucene tests pass, and precommit passes. > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, > LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2605: --- Attachment: LUCENE-2605.patch Patch adds lucene-test-framework files missing from last version of the patch. Also adds a CHANGES entry. I plan on committing in a couple days if there are no objections. > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, > LUCENE-2605.patch, LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2605: --- Attachment: LUCENE-2605.patch Patch adding an option to preserve the old behavior: {{\{set/get\}SplitOnWhitespace()}}, defaulting to {{true}} (the current behavior). Though nobody said so here, on the Solr issue (SOLR-9185), a couple people mentioned that the old behavior should be preserved, and not be the default until a major release. That's what this patch does. > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch, > LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2605: --- Fix Version/s: (was: 6.0) (was: 4.9) > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2605: --- Attachment: LUCENE-2605.patch Patch, I think it's ready. I've pulled MockSynonymFilter/Analyzer out into their own files in lucene-test-framework, and added tests for it, and added a fixed multi-term source synonym with one single-term target. I added tests to {{TestQueryParser}} using the modified MockSynonymAnalyzer ensuring operators block multi-term analysis when they should and don't when they shouldn't. I'll go make issues now for converting Solr's clone of this QueryParser, and the standard flexible query parser, to add the same capabilities. > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Fix For: 4.9, 6.0 > > Attachments: LUCENE-2605.patch, LUCENE-2605.patch, LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2605: --- Attachment: LUCENE-2605.patch Patch fixing up two problems: # Multiple whitespace-separated terms' TermQuery-s within a BooleanClause are now flattened directly into the output Query list, rather than inserting the BooleanClause. # MultiFieldQuery's getFieldQuery() is modified to recombine multiple terms from each field's query, to produce a series of disjunctions of term against each field. All queryparser module tests now pass, with the exception of the flexible query parser's TestStandardQP run with QueryParserTestBase.testQPA(). Since this patch doesn't modify anything about the flexible query parser, this is not surprising. > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Fix For: 4.9, 6.0 > > Attachments: LUCENE-2605.patch, LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rowe updated LUCENE-2605: --- Attachment: LUCENE-2605.patch Initial patch against the classic Lucene QueryParser grammar only. Runs of whitespace-separated text with no operators are now sent as a single input to the analyzer, rather than first splitting on whitespace. There are 10 failing tests in the queryparser module (didn't try anywhere else yet), mostly caused by the way getFieldQuery() produces a nested query when there are multiple terms. I'll look into how to address that. > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir >Assignee: Steve Rowe > Fix For: 4.9, 6.0 > > Attachments: LUCENE-2605.patch > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-2605: - Fix Version/s: (was: 4.7) 4.8 > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir > Fix For: 4.8 > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2605: -- Fix Version/s: (was: 4.3) 4.4 > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser >Reporter: Robert Muir > Fix For: 4.4 > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2605: Fix Version/s: (was: 3.1) > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Reporter: Robert Muir > Fix For: 4.0 > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2605) queryparser parses on whitespace
[ https://issues.apache.org/jira/browse/LUCENE-2605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2605: Component/s: QueryParser > queryparser parses on whitespace > > > Key: LUCENE-2605 > URL: https://issues.apache.org/jira/browse/LUCENE-2605 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser >Reporter: Robert Muir > Fix For: 3.1, 4.0 > > > The queryparser parses input on whitespace, and sends each whitespace > separated term to its own independent token stream. > This breaks the following at query-time, because they can't see across > whitespace boundaries: > * n-gram analysis > * shingles > * synonyms (especially multi-word for whitespace-separated languages) > * languages where a 'word' can contain whitespace (e.g. vietnamese) > Its also rather unexpected, as users think their > charfilters/tokenizers/tokenfilters will do the same thing at index and > querytime, but > in many cases they can't. Instead, preferably the queryparser would parse > around only real 'operators'. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org