[ https://issues.apache.org/jira/browse/LUCENE-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Busch updated LUCENE-1189: ---------------------------------- Priority: Minor (was: Major) Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) > QueryParser does not correctly handle escaped characters within quoted strings > ------------------------------------------------------------------------------ > > Key: LUCENE-1189 > URL: https://issues.apache.org/jira/browse/LUCENE-1189 > Project: Lucene - Java > Issue Type: Bug > Components: QueryParser > Affects Versions: 2.2, 2.3, 2.3.1 > Environment: Windows Vista Business (x86 and x64) as well as latest > Ubuntu server, both cases under Tomcat 6.0.14. > This shouldn't matter though. > Reporter: Tomer Gabel > Assignee: Michael Busch > Priority: Minor > Attachments: lucene-1189.patch, QueryParser.jj.patch > > > The Lucene query parser incorrectly handles escaped characters inside quoted > strings; specifically, a quoted string that ends with an (escaped) backslash > followed by any additional quoted string will not be properly tokenized. > Consider the following example: > bq. {{(name:"///mike\\\\\\") or (name:"alphonse")}} > This is not a contrived example -- it derives from an actual bug we've > encountered in our system. Running this query will throw an exception, but > removing the second clause resolves the problem. After some digging I've > found that the problem is with the way quoted strings are processed by the > lexer: you'll notice that Mike's name is followed by three escaped > backslashes right before the ending quote; looking at the JavaCC code for the > query parser highlights the problem: > {code:title=QueryParser.jj|borderStyle=solid} > <DEFAULT> TOKEN : { > <AND: ("AND" | "&&") > > | <OR: ("OR" | "||") > > | <NOT: ("NOT" | "!") > > | <PLUS: "+" > > | <MINUS: "-" > > | <LPAREN: "(" > > | <RPAREN: ")" > > | <COLON: ":" > > | <STAR: "*" > > | <CARAT: "^" > : Boost > | <QUOTED: "\"" (~["\""] | "\\\"")* "\""> > ... > {code} > Take a look at the way the QUOTED token is constructed -- there is no lexical > processing of the escaped characters within the quoted string itself. In the > above query the lexer matches everything from the first quote through all the > backslashes, _treating the end quote as an escaped character_, thus also > matching the starting quote of the second term. This causes a lexer error, > because the last quote is then considered the start of a new match. > I've come to understand that the Lucene query handler is supposed to be able > to handle unsanitized human input; indeed the lexer above would handle a > query like {{"blah\"}} without complaining, but that's a "best-guess" > approach that results in bugs with legal, automatically generated queries. > I've attached a patch that fixes the erroneous behavior but does not maintain > leniency with malformed queries; I believe this is the correct approach > because the two design goals are fundamentally at odds. I'd appreciate any > comments. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]