[ https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Chammas reopened SPARK-19248: -------------------------------------- > Regex_replace works in 1.6 but not in 2.0 > ----------------------------------------- > > Key: SPARK-19248 > URL: https://issues.apache.org/jira/browse/SPARK-19248 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2 > Reporter: Lucas Tittmann > Priority: Major > Labels: bulk-closed > > We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, > we get the following, expected behaviour: > {noformat} > df = sqlContext.createDataFrame([('.. 5. ',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'5')] > {noformat} > In Spark 2.0.2, with the same code, we get the following: > {noformat} > df = sqlContext.createDataFrame([('.. 5. ',)], ['col']) > dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect() > z.show(dfout) > >>> [Row(col=u'5')] > dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS > col"]).collect() > z.show(dfout2) > >>> [Row(col=u'')] > {noformat} > As you can see, the second regex shows different behaviour depending on the > Spark version. We checked the regex in Java, and both should be correct and > work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not > have the possibility to confirm in 2.1 at the moment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org