[ https://issues.apache.org/jira/browse/SPARK-22183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402042#comment-16402042 ]
Michal Szafranski commented on SPARK-22183: ------------------------------------------- As for the reporters use case, using explicit 'contains()' function would not just work around this issue, but also I would expect it to be significantly faster. I don't think it is mapped in SQL though: {code:java} sqlContext.sql("""SELECT * FROM test t""").filter($"_1".contains($"_2")).show() {code} > Inconsistency in LIKE escaping between literal values and column-based ones > --------------------------------------------------------------------------- > > Key: SPARK-22183 > URL: https://issues.apache.org/jira/browse/SPARK-22183 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Reporter: Adrien Lavoillotte > Priority: Minor > > I'm trying to implement auto-escaping for {{LIKE}} expressions, in order to > have filters & join conditions like: > * Column A's value contains column B's > * Column A's value contains some literal string > So I need to escape {{LIKE}}-significant characters {{%}} and {{_}}. Since > SparkSQL does not support {{LIKE expr ESCAPE char}}, I need to escape using > \, and presumably also \ itself (twice in the case of literals, since '\\' > represents a single \). > But it seems that in a {{LIKE}} expression literal does not have quite the > same escaping as other literal strings or non-literals {{LIKE}} expressions, > seemingly depending on whether the left-hand side and/or right-hand side are > literals or columns. > Note: I'm using triple-quotes below to avoid scala-level \ escaping. And in > the body of this description, I'm purposedly using zero-width spaces to avoid > Jira transforming my \. > On Spark 2.2.0: > {code} > // both LHS & RHS literals > scala> spark.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show() > +---+---------+ > | \|\ LIKE \\| > +---+---------+ > | \| true| > +---+---------+ > scala> spark.sql("""SELECT '\\', '\\' LIKE '\\'""").show() > org.apache.spark.sql.AnalysisException: the pattern '\' is invalid, it is not > allowed to end with the escape character; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:53) > at > org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105) > at > org.apache.spark.sql.catalyst.expressions.StringRegexExpression.compile(regexpExpressions.scala:50) > at > org.apache.spark.sql.catalyst.expressions.StringRegexExpression.pattern(regexpExpressions.scala:53) > at > org.apache.spark.sql.catalyst.expressions.StringRegexExpression.nullSafeEval(regexpExpressions.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419) > ... > scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show() > +---+-------------+ > |a\b|a\b LIKE a\\b| > +---+-------------+ > |a\b| true| > +---+-------------+ > scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show() > org.apache.spark.sql.AnalysisException: the pattern 'a\b' is invalid, the > escape character is not allowed to precede 'b'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105) > ... > // test data > spark.sql("""SELECT * FROM test""").show() > +----+----+ > | _1| _2| > +----+----+ > | ok| ok| > | Ok| ok| > | a_b| a_b| > | aab| a_b| > | c%d| c%d| > |caad| c%d| > |e\nf|e\nf| > | e > f|e\nf| > +----+----+ > // both column-based > // not escaping \ > scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, > '([%_])', '\\\\$1')""").show() > ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1) > org.apache.spark.sql.AnalysisException: the pattern 'e\nf' is invalid, the > escape character is not allowed to precede 'n'; > at > org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42) > at > org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51) > at > org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala) > ... > // escaping \ > scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, > '([%_\\\\])', '\\\\$1')""").show() > +----+----+ > | _1| _2| > +----+----+ > | ok| ok| > | a_b| a_b| > | c%d| c%d| > |e\nf|e\nf| > +----+----+ > // LHS column-based, RHS literal > scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show() > +----+----+ > | _1| _2| > +----+----+ > |e\nf|e\nf| > +----+----+ > scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show() > +---+---+ > | _1| _2| > +---+---+ > +---+---+ > {code} > So in Spark 2.2: > * Escaping \ in the RHS is mandatory if RHS is not a literal, otherwise we > get an error. So far so good. > * If LHS is also a literal, same applies. > * If RHS is a literal and LHS is a column, it seems that the string literal > escaping of \ interferes with the {{LIKE}} escaping, since re-escaping \ does > not match anymore. I would expect needing \\\\ between quotes to match a > single \ (one escaping for the string literal, one for the {{LIKE}} escaping). > On Spark 1.6 (and it seems until Spark 2.2.0): > {code} > // both LHS & RHS literals > scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show() > +---+-----+ > |_c0| _c1| > +---+-----+ > | \|false| > +---+-----+ > scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\'""").show() > +---+-----+ > |_c0| _c1| > +---+-----+ > | \|false| > +---+-----+ > scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show() > +---+----+ > |_c0| _c1| > +---+----+ > |a\b|true| > +---+----+ > scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show() > +---+----+ > |_c0| _c1| > +---+----+ > |a\b|true| > +---+----+ > // Same test data as for 2.2 > // both column-based > // not escaping \ > scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE > REGEXP_REPLACE(`_2`, '([%_])', '\\\\$1')""").show() > +----+----+ > | _1| _2| > +----+----+ > | ok| ok| > | a_b| a_b| > | c%d| c%d| > |e\nf|e\nf| > +----+----+ > // escaping \ > scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE > REGEXP_REPLACE(`_2`, '([%_\\\\])', '\\\\$1')""").show() > +----+----+ > | _1| _2| > +----+----+ > | ok| ok| > | a_b| a_b| > | c%d| c%d| > |e\nf|e\nf| > +----+----+ > // LHS column-based, RHS literal > scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE > 'e\\nf'""").show() > +----+----+ > | _1| _2| > +----+----+ > |e\nf|e\nf| > +----+----+ > scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE > 'e\\\\nf'""").show() > +---+---+ > | _1| _2| > +---+---+ > +---+---+ > {code} > So in Spark 1.6: > * Cannot match a single-character literal string of \ > * Matching a string (literal or column) containing \ works whether you escape > \ or not, which makes me think there is room for ambiguity (would \\n match > \n?) > * If RHS is a literal and LHS is a column, same issue as Spark 2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org