[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Tittmann updated SPARK-19248:
-----------------------------------
    Description: 
We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, we 
get the following, expected behaviour:
{noformat}
df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
z.show(dfout)
>>> [Row(col=u'5')]
dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
z.show(dfout2)
>>> [Row(col=u'5')]
{noformat}
In Spark 2.0.2, with the same code, we get the following:
{noformat}
df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
z.show(dfout)
>>> [Row(col=u'5')]
dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
z.show(dfout2)
>>> [Row(col=u'')]
{noformat}

As you can see, the second regex shows different behaviour depending on the 
Spark version. We checked the regex in Java, and both should be correct and 
work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not have 
the possibility to confirm in 2.1 at the moment.

  was:
We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, we 
get the following, expected behaviour:
{noformat}
df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
z.show(dfout)
>>> [Row(col=u'5')]
dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
z.show(dfout2)
>>> [Row(col=u'5')]
{noformat}
In Spark 2.0.2, with the same code, we get the following:
{noformat}
df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
z.show(dfout)
>>> [Row(col=u'5')]
dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS col"]).collect()
z.show(dfout2)
{color:red}
>>> [Row(col=u'')]
{color}
{noformat}

As you can see, the second regex shows different behaviour depending on the 
Spark version. We checked the regex in Java, and both should be correct and 
work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not have 
the possibility to confirm in 2.1 at the moment.


> Regex_replace works in 1.6 but not in 2.0
> -----------------------------------------
>
>                 Key: SPARK-19248
>                 URL: https://issues.apache.org/jira/browse/SPARK-19248
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Lucas Tittmann
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.    ',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to