[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430 ]
Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM: ------------------------------------------------------------- After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. Otherwise, you need to escape the literal backslash before the dot character, so you would need the pattern to be {{'( |\\\\.)*'}} was (Author: jeff.w.evans): After some debugging, I figured out what's going on here. The crux of this is the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under SPARK-20399. This behavior changed in 2.0 (see [here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]). If you start your PySpark sessions described above with this line: {{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}} then you should see the 1.6 behavior. > Regex_replace works in 1.6 but not in 2.0 > ----------------------------------------- > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.0.2, 2.4.3 > Reporter: Lucas Tittmann > Priority: Major > Labels: correctness > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5. ',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5. ',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org