[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Itamar Syn-Hershko (JIRA) Sun, 16 May 2010 05:17:08 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867982#action_12867982
 ]


Itamar Syn-Hershko commented on LUCENE-2465:
--------------------------------------------

Using QueryParser.escape() is not an option, since by that I practically 
prevent the QP from ever returning PhraseQuery's on user queries (it just 
replaces all occurrences of a QP syntax char).

Your other suggestion of using the "correct" Unicode char GERSHAYIM is not 
doable, because we are talking about user-typed queries here, and no user has 
such a character on his keyboard. In 99.9% of Hebrew text files, old and new, 
double-quotes is being used as GERSHAYIM. Only exceptions are when an automated 
program has automatically converted the mid-word instance of double-quotes into 
U+05F4. This is pretty much like asking the Lucene community to type U+201C and 
U+201D (left / right double quotation marks) around phrases or they won't be 
recognized as such. Because no one has those characters easily accessible from 
their k/b (to the best of my knowledge), and it doesn't really matter anyway 
what you type, this thought never passed in anyone's mind. Exactly the same 
goes for Hebrew.

The only doable workaround is to go through the query string before sending it 
to the QP, and resolve this by either escaping mid-word double-quotes or 
replacing them with U+05F4. Since most Hebrew dictionaries work with 
double-quotes for acronyms anyway, escaping it seems much better, but then I 
ask again - why bother with a double-pass on the query string if a simple 
change to the QP can resolve that? The effect the behavior has on non-Hebrew 
scripts is flawed anyway, and there's no reason to require such a pass for 
Hebrew consumers only (imagine what it'd be like to write a multi-lingual 
search interface with this issue in mind).

As a reference, see how Google and Wikipedia treat Hebrew acronyms:
http://www.google.com/#hl=en&source=hp&q=%D7%9E%D7%A0%D7%9B%22%D7%9C&aq=f&aqi=&aql=&oq=&gs_rfai=&fp=d059ab474882bfe2
http://he.wikipedia.org/wiki/%D7%9E%D7%A0%D7%9B%22%D7%9C

Google recognizes both double-quotes and GERSHAYIM as correct forms of Hebrew 
acronyms, while Wikipedia only uses the former in all acronyms.

Robert, I hear what you are saying, but this just ain't right when it comes to 
usability, when the resolution is so simple and doesn't break anything.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 
> 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the 
> query when hitting a double-quotes char, even if it is mid-word. For example, 
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term 
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase 
> is a group of words surrounded by double quotes as defined by 
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
> it say double-quotes will also tokenize the input. Arguably, a phrase should 
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew 
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
> exception, since it is expecting another double-quotes to create a phrase 
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check 
> if a whitespace precedes the double-quote when a phrase opening is expected, 
> or peek to see if a whitespace follows the double-quotes if a phrase closing 
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be 
> relied on anyway, and allow Hebrew queries to be correctly parsed also when 
> containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Reply via email to