[ https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867982#action_12867982 ]
Itamar Syn-Hershko commented on LUCENE-2465: -------------------------------------------- Using QueryParser.escape() is not an option, since by that I practically prevent the QP from ever returning PhraseQuery's on user queries (it just replaces all occurrences of a QP syntax char). Your other suggestion of using the "correct" Unicode char GERSHAYIM is not doable, because we are talking about user-typed queries here, and no user has such a character on his keyboard. In 99.9% of Hebrew text files, old and new, double-quotes is being used as GERSHAYIM. Only exceptions are when an automated program has automatically converted the mid-word instance of double-quotes into U+05F4. This is pretty much like asking the Lucene community to type U+201C and U+201D (left / right double quotation marks) around phrases or they won't be recognized as such. Because no one has those characters easily accessible from their k/b (to the best of my knowledge), and it doesn't really matter anyway what you type, this thought never passed in anyone's mind. Exactly the same goes for Hebrew. The only doable workaround is to go through the query string before sending it to the QP, and resolve this by either escaping mid-word double-quotes or replacing them with U+05F4. Since most Hebrew dictionaries work with double-quotes for acronyms anyway, escaping it seems much better, but then I ask again - why bother with a double-pass on the query string if a simple change to the QP can resolve that? The effect the behavior has on non-Hebrew scripts is flawed anyway, and there's no reason to require such a pass for Hebrew consumers only (imagine what it'd be like to write a multi-lingual search interface with this issue in mind). As a reference, see how Google and Wikipedia treat Hebrew acronyms: http://www.google.com/#hl=en&source=hp&q=%D7%9E%D7%A0%D7%9B%22%D7%9C&aq=f&aqi=&aql=&oq=&gs_rfai=&fp=d059ab474882bfe2 http://he.wikipedia.org/wiki/%D7%9E%D7%A0%D7%9B%22%D7%9C Google recognizes both double-quotes and GERSHAYIM as correct forms of Hebrew acronyms, while Wikipedia only uses the former in all acronyms. Robert, I hear what you are saying, but this just ain't right when it comes to usability, when the resolution is so simple and doesn't break anything. > QueryParser should ignore double-quotes if mid-word > --------------------------------------------------- > > Key: LUCENE-2465 > URL: https://issues.apache.org/jira/browse/LUCENE-2465 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser > Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, > 4.0 > Reporter: Itamar Syn-Hershko > > Current implementation of Lucene's QueryParser identifies a phrase in the > query when hitting a double-quotes char, even if it is mid-word. For example, > the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term > and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase > is a group of words surrounded by double quotes as defined by > http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does > it say double-quotes will also tokenize the input. Arguably, a phrase should > only be identified as such when it is also surrounded by whitespaces. > Other than a logically incorrect behavior, this makes parsing of Hebrew > acronyms impossible. Hebrew acronyms contain one double-quotes char in the > middle of a word (for example, MNK"L), hence causing the QP to throw a syntax > exception, since it is expecting another double-quotes to create a phrase > query, essentially splitting the acronym into two. > The solution to this is pretty simple - changing the JavaCC syntax to check > if a whitespace precedes the double-quote when a phrase opening is expected, > or peek to see if a whitespace follows the double-quotes if a phrase closing > is expected. > This will both eliminate a logically incorrect behavior which shouldn't be > relied on anyway, and allow Hebrew queries to be correctly parsed also when > containing acronyms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org