[ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868003#action_12868003 ]
Michael McCandless commented on LUCENE-2465: -------------------------------------------- OK that's a good point -- saying the quote must be preceded by "whitespace" is in fact wrong for non-whitespace languages. Meaning, if we made this change we'd actually break what are correct String -> PhraseQuery done by QP today. > QueryParser should ignore double-quotes if mid-word > --------------------------------------------------- > > Key: LUCENE-2465 > URL: https://issues.apache.org/jira/browse/LUCENE-2465 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser > Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, > 4.0 > Reporter: Itamar Syn-Hershko > > Current implementation of Lucene's QueryParser identifies a phrase in the > query when hitting a double-quotes char, even if it is mid-word. For example, > the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term > and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase > is a group of words surrounded by double quotes as defined by > http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does > it say double-quotes will also tokenize the input. Arguably, a phrase should > only be identified as such when it is also surrounded by whitespaces. > Other than a logically incorrect behavior, this makes parsing of Hebrew > acronyms impossible. Hebrew acronyms contain one double-quotes char in the > middle of a word (for example, MNK"L), hence causing the QP to throw a syntax > exception, since it is expecting another double-quotes to create a phrase > query, essentially splitting the acronym into two. > The solution to this is pretty simple - changing the JavaCC syntax to check > if a whitespace precedes the double-quote when a phrase opening is expected, > or peek to see if a whitespace follows the double-quotes if a phrase closing > is expected. > This will both eliminate a logically incorrect behavior which shouldn't be > relied on anyway, and allow Hebrew queries to be correctly parsed also when > containing acronyms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org