[jira] [Commented] (SPARK-22183) Inconsistency in LIKE escaping between literal values and column-based ones

Michal Szafranski (JIRA) Fri, 16 Mar 2018 08:20:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402042#comment-16402042
 ]


Michal Szafranski commented on SPARK-22183:
-------------------------------------------

As for the reporters use case, using explicit 'contains()' function would not 
just work around this issue, but also I would expect it to be significantly 
faster. I don't think it is mapped in SQL though:

 
{code:java}
sqlContext.sql("""SELECT * FROM test t""").filter($"_1".contains($"_2")).show()
{code}
 

> Inconsistency in LIKE escaping between literal values and column-based ones
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-22183
>                 URL: https://issues.apache.org/jira/browse/SPARK-22183
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Adrien Lavoillotte
>            Priority: Minor
>
> I'm trying to implement auto-escaping for {{LIKE}} expressions, in order to 
> have filters & join conditions like:
> * Column A's value contains column B's
> * Column A's value contains some literal string
> So I need to escape {{LIKE}}-significant characters {{%}} and {{_}}. Since 
> SparkSQL does not support {{LIKE expr ESCAPE char}}, I need to escape using 
> \, and presumably also \ itself (twice in the case of literals, since '\\' 
> represents a single \).
> But it seems that in a {{LIKE}} expression literal does not have quite the 
> same escaping as other literal strings or non-literals {{LIKE}} expressions, 
> seemingly depending on whether the left-hand side and/or right-hand side are 
> literals or columns.
> Note: I'm using triple-quotes below to avoid scala-level \ escaping. And in 
> the body of this description, I'm purposedly using zero-width spaces to avoid 
> Jira transforming my \.
> On Spark 2.2.0:
> {code}
> // both LHS & RHS literals
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+---------+
> |  \|\ LIKE \\|
> +---+---------+
> |  \|     true|
> +---+---------+
> scala> spark.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> org.apache.spark.sql.AnalysisException: the pattern '\' is invalid, it is not 
> allowed to end with the escape character;
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:53)
>   at 
> org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.compile(regexpExpressions.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.pattern(regexpExpressions.scala:53)
>   at 
> org.apache.spark.sql.catalyst.expressions.StringRegexExpression.nullSafeEval(regexpExpressions.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419)
>   ...
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+-------------+
> |a\b|a\b LIKE a\\b|
> +---+-------------+
> |a\b|         true|
> +---+-------------+
> scala> spark.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> org.apache.spark.sql.AnalysisException: the pattern 'a\b' is invalid, the 
> escape character is not allowed to precede 'b';
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>   at 
> org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:105)
>   ...
> // test data
> spark.sql("""SELECT * FROM test""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> |  Ok|  ok|
> | a_b| a_b|
> | aab| a_b|
> | c%d| c%d|
> |caad| c%d|
> |e\nf|e\nf|
> | e
> f|e\nf|
> +----+----+
> // both column-based
> // not escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, 
> '([%_])', '\\\\$1')""").show()
> ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> org.apache.spark.sql.AnalysisException: the pattern 'e\nf' is invalid, the 
> escape character is not allowed to precede 'n';
>       at 
> org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:42)
>       at 
> org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:51)
>       at 
> org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(StringUtils.scala)
>       ...
> // escaping \
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE REGEXP_REPLACE(`_2`, 
> '([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\nf'""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> spark.sql("""SELECT * FROM test t WHERE `_1` LIKE 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 2.2:
> * Escaping \ in the RHS is mandatory if RHS is not a literal, otherwise we 
> get an error. So far so good.
> * If LHS is also a literal, same applies.
> * If RHS is a literal and LHS is a column, it seems that the string literal 
> escaping of \ interferes with the {{LIKE}} escaping, since re-escaping \ does 
> not match anymore. I would expect needing \\\\ between quotes to match a 
> single \ (one escaping for the string literal, one for the {{LIKE}} escaping).
> On Spark 1.6 (and it seems until Spark 2.2.0):
> {code}
> // both LHS & RHS literals
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\\\'""").show()
> +---+-----+
> |_c0|  _c1|
> +---+-----+
> |  \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT '\\', '\\' LIKE '\\'""").show()
> +---+-----+
> |_c0|  _c1|
> +---+-----+
> |  \|false|
> +---+-----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> scala> sqlContext.sql("""SELECT 'a\\b', 'a\\b' LIKE 'a\\\\b'""").show()
> +---+----+
> |_c0| _c1|
> +---+----+
> |a\b|true|
> +---+----+
> // Same test data as for 2.2
> // both column-based
> // not escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 
> REGEXP_REPLACE(`_2`, '([%_])', '\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // escaping \
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 
> REGEXP_REPLACE(`_2`, '([%_\\\\])', '\\\\$1')""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |  ok|  ok|
> | a_b| a_b|
> | c%d| c%d|
> |e\nf|e\nf|
> +----+----+
> // LHS column-based, RHS literal
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 
> 'e\\nf'""").show()
> +----+----+
> |  _1|  _2|
> +----+----+
> |e\nf|e\nf|
> +----+----+
> scala> sqlContext.sql("""SELECT * FROM test t WHERE `_1` LIKE 
> 'e\\\\nf'""").show()
> +---+---+
> | _1| _2|
> +---+---+
> +---+---+
> {code}
> So in Spark 1.6:
> * Cannot match a single-character literal string of \
> * Matching a string (literal or column) containing \ works whether you escape 
> \ or not, which makes me think there is room for ambiguity (would \\n match 
> \n?)
> * If RHS is a literal and LHS is a column, same issue as Spark 2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22183) Inconsistency in LIKE escaping between literal values and column-based ones

Reply via email to