[
https://issues.apache.org/jira/browse/LUCENE-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Busch resolved LUCENE-1189.
-----------------------------------
Resolution: Fixed
Fix Version/s: 2.4
Lucene Fields: [New, Patch Available] (was: [Patch Available, New])
Committed. Thanks, Tomer!
> QueryParser does not correctly handle escaped characters within quoted strings
> ------------------------------------------------------------------------------
>
> Key: LUCENE-1189
> URL: https://issues.apache.org/jira/browse/LUCENE-1189
> Project: Lucene - Java
> Issue Type: Bug
> Components: QueryParser
> Affects Versions: 2.2, 2.3, 2.3.1
> Environment: Windows Vista Business (x86 and x64) as well as latest
> Ubuntu server, both cases under Tomcat 6.0.14.
> This shouldn't matter though.
> Reporter: Tomer Gabel
> Assignee: Michael Busch
> Priority: Minor
> Fix For: 2.4
>
> Attachments: lucene-1189.patch, QueryParser.jj.patch
>
>
> The Lucene query parser incorrectly handles escaped characters inside quoted
> strings; specifically, a quoted string that ends with an (escaped) backslash
> followed by any additional quoted string will not be properly tokenized.
> Consider the following example:
> bq. {{(name:"///mike\\\\\\") or (name:"alphonse")}}
> This is not a contrived example -- it derives from an actual bug we've
> encountered in our system. Running this query will throw an exception, but
> removing the second clause resolves the problem. After some digging I've
> found that the problem is with the way quoted strings are processed by the
> lexer: you'll notice that Mike's name is followed by three escaped
> backslashes right before the ending quote; looking at the JavaCC code for the
> query parser highlights the problem:
> {code:title=QueryParser.jj|borderStyle=solid}
> <DEFAULT> TOKEN : {
> <AND: ("AND" | "&&") >
> | <OR: ("OR" | "||") >
> | <NOT: ("NOT" | "!") >
> | <PLUS: "+" >
> | <MINUS: "-" >
> | <LPAREN: "(" >
> | <RPAREN: ")" >
> | <COLON: ":" >
> | <STAR: "*" >
> | <CARAT: "^" > : Boost
> | <QUOTED: "\"" (~["\""] | "\\\"")* "\"">
> ...
> {code}
> Take a look at the way the QUOTED token is constructed -- there is no lexical
> processing of the escaped characters within the quoted string itself. In the
> above query the lexer matches everything from the first quote through all the
> backslashes, _treating the end quote as an escaped character_, thus also
> matching the starting quote of the second term. This causes a lexer error,
> because the last quote is then considered the start of a new match.
> I've come to understand that the Lucene query handler is supposed to be able
> to handle unsanitized human input; indeed the lexer above would handle a
> query like {{"blah\"}} without complaining, but that's a "best-guess"
> approach that results in bugs with legal, automatically generated queries.
> I've attached a patch that fixes the erroneous behavior but does not maintain
> leniency with malformed queries; I believe this is the correct approach
> because the two design goals are fundamentally at odds. I'd appreciate any
> comments.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]