spark git commit: [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.

2017-04-17 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 7aad057b0 -> db9517c16


[SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.

This patch fixes a bug in the way LIKE patterns are translated to Java regexes. 
The bug causes any character following an escaped backslash to be escaped, i.e. 
there is double-escaping.
A concrete example is the following pattern:`'%\\%'`. The expected Java regex 
that this pattern should correspond to (according to the behavior described 
below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead.

---

Update: in light of the discussion that ensued, we should explicitly define the 
expected behaviour of LIKE expressions, especially in certain edge cases. With 
the help of gatorsmile, we put together a list of different RDBMS and their 
variations wrt to certain standard features.

| RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity |
| --- | --- | --- | --- |
| [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, 
%, [], [^] | none | no |
| 
[Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm)
 | _, % | none | yes |
| [DB2 
z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html)
 | _, % | none | yes |
| 
[MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html)
 | _, % | none | no |
| 
[PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) 
| _, % | \ | yes |
| [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | 
_, % | none | yes |
| Current Spark | _, % | \ | yes |

[1] Default escape character: most systems do not have a default escape 
character, instead the user can specify one by calling a like expression with 
an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not 
supported by Spark, however I would volunteer to implement this feature in a 
separate ticket.

The specifications are often quite terse and certain scenarios are 
undocumented, so here is a list of scenarios that I am uncertain about and 
would appreciate any input. Specifically I am looking for feedback on whether 
or not Spark's current behavior should be changed.
1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`.
   PostreSQL gives an error: 'LIKE pattern must not end with escape character', 
which I personally find logical. Currently, Spark allows "non-terminated" 
escapes and simply ignores them as part of the pattern.
   According to [DB2's 
documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html),
 ending a pattern in an escape character is invalid.
   _Proposed new behaviour in Spark: throw AnalysisException_
2. [x] Empty input, e.g. `'' like ''`
   Postgres and DB2 will match empty input only if the pattern is empty as 
well, any other combination of empty input will not match. Spark currently 
follows this rule.
3. [x] Escape before a non-special character, e.g. `'a' like '\a'`.
   Escaping a non-wildcard character is not really documented but PostgreSQL 
just treats it verbatim, which I also find the least surprising behavior. Spark 
does the same.
   According to [DB2's 
documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html),
 it is invalid to follow an escape character with anything other than an escape 
character, an underscore or a percent sign.
   _Proposed new behaviour in Spark: throw AnalysisException_

The current specification is also described in the operator's source code in 
this patch.

Extra case in regex unit tests.

Author: Jakob Odersky 

This patch had conflicts when merged, resolved by
Committer: Reynold Xin 

Closes #15398 from jodersky/SPARK-17647.

(cherry picked from commit e5fee3e4f853f906f0b476bb04ee35a15f1ae650)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db9517c1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db9517c1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db9517c1

Branch: refs/heads/branch-2.1
Commit: db9517c1661935e88fe9c5d27874d718c928d5d6
Parents: 7aad057
Author: Jakob Odersky 
Authored: Mon Apr 17 11:17:57 2017 -0700
Committer: Reynold Xin 
Committed: Mon Apr 17 11:57:01 2017 -0700

--
 .../expressions/regexpExpressions.scala |  28 +++-
 .../spark/sql/catalyst/util/StringUtils.scala   |  50 +++---
 .../expressions/RegexpExpressionsSuite.scala| 161 +++
 .../sql/catalyst/util/StringUtilsSuite.scala|   4 +-
 4 files changed, 154 insertions(+), 89 deletions(-)
--



spark git commit: [SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.

2017-04-17 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 01ff0350a -> e5fee3e4f


[SPARK-17647][SQL] Fix backslash escaping in 'LIKE' patterns.

## What changes were proposed in this pull request?

This patch fixes a bug in the way LIKE patterns are translated to Java regexes. 
The bug causes any character following an escaped backslash to be escaped, i.e. 
there is double-escaping.
A concrete example is the following pattern:`'%\\%'`. The expected Java regex 
that this pattern should correspond to (according to the behavior described 
below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead.

---

Update: in light of the discussion that ensued, we should explicitly define the 
expected behaviour of LIKE expressions, especially in certain edge cases. With 
the help of gatorsmile, we put together a list of different RDBMS and their 
variations wrt to certain standard features.

| RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity |
| --- | --- | --- | --- |
| [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, 
%, [], [^] | none | no |
| 
[Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm)
 | _, % | none | yes |
| [DB2 
z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html)
 | _, % | none | yes |
| 
[MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html)
 | _, % | none | no |
| 
[PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) 
| _, % | \ | yes |
| [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | 
_, % | none | yes |
| Current Spark | _, % | \ | yes |

[1] Default escape character: most systems do not have a default escape 
character, instead the user can specify one by calling a like expression with 
an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not 
supported by Spark, however I would volunteer to implement this feature in a 
separate ticket.

The specifications are often quite terse and certain scenarios are 
undocumented, so here is a list of scenarios that I am uncertain about and 
would appreciate any input. Specifically I am looking for feedback on whether 
or not Spark's current behavior should be changed.
1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`.
   PostreSQL gives an error: 'LIKE pattern must not end with escape character', 
which I personally find logical. Currently, Spark allows "non-terminated" 
escapes and simply ignores them as part of the pattern.
   According to [DB2's 
documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html),
 ending a pattern in an escape character is invalid.
   _Proposed new behaviour in Spark: throw AnalysisException_
2. [x] Empty input, e.g. `'' like ''`
   Postgres and DB2 will match empty input only if the pattern is empty as 
well, any other combination of empty input will not match. Spark currently 
follows this rule.
3. [x] Escape before a non-special character, e.g. `'a' like '\a'`.
   Escaping a non-wildcard character is not really documented but PostgreSQL 
just treats it verbatim, which I also find the least surprising behavior. Spark 
does the same.
   According to [DB2's 
documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html),
 it is invalid to follow an escape character with anything other than an escape 
character, an underscore or a percent sign.
   _Proposed new behaviour in Spark: throw AnalysisException_

The current specification is also described in the operator's source code in 
this patch.
## How was this patch tested?

Extra case in regex unit tests.

Author: Jakob Odersky 

This patch had conflicts when merged, resolved by
Committer: Reynold Xin 

Closes #15398 from jodersky/SPARK-17647.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e5fee3e4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e5fee3e4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e5fee3e4

Branch: refs/heads/master
Commit: e5fee3e4f853f906f0b476bb04ee35a15f1ae650
Parents: 01ff035
Author: Jakob Odersky 
Authored: Mon Apr 17 11:17:57 2017 -0700
Committer: Reynold Xin 
Committed: Mon Apr 17 11:17:57 2017 -0700

--
 .../expressions/regexpExpressions.scala |  25 ++-
 .../spark/sql/catalyst/util/StringUtils.scala   |  50 +++---
 .../expressions/RegexpExpressionsSuite.scala| 161 +++
 .../sql/catalyst/util/StringUtilsSuite.scala|   4 +-
 4 files changed, 153 insertions(+), 87 deletions(-)
--