[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430
 ] 

Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM:
-------------------------------------------------------------

After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character, so you would need the pattern to be 
{{'( |\\\\.)*'}}


was (Author: jeff.w.evans):
After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.

> Regex_replace works in 1.6 but not in 2.0
> -----------------------------------------
>
>                 Key: SPARK-19248
>                 URL: https://issues.apache.org/jira/browse/SPARK-19248
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.2, 2.4.3
>            Reporter: Lucas Tittmann
>            Priority: Major
>              Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to