[ 
https://issues.apache.org/jira/browse/SPARK-13691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13691:
---------------------------------
    Description: 
Here is an example that Scala and Python generate different results

{code}
Scala:
scala> var i = 0
i: Int = 0
scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
scala> rdd.collect()
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> i += 1
scala> rdd.collect()
res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

Python:
>>> i = 0
>>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
>>> rdd.collect()
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> i += 1
>>> rdd.collect()
[1, 2, 3, 4, 5, 6, 7, 8, 9]
{code}

The difference is Scala will capture all variables' values when running a job 
every time, but Python just captures variables' values once and always uses 
them for all jobs.

In addition, SQL UDF has the similar issue. It's better to fix that too if 
anyone wants to fix the bug.

  was:
Here is an example that Scala and Python generate different results

{code}
Scala:
scala> var i = 0
i: Int = 0
scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
scala> rdd.collect()
res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> i += 1
scala> rdd.collect()
res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

Python:
>>> i = 0
>>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
>>> rdd.collect()
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> i += 1
>>> rdd.collect()
[1, 2, 3, 4, 5, 6, 7, 8, 9]
{code}

The difference is Scala will capture all variables' values when running a job 
every time, but Python just captures variables' values once and always uses 
them for all jobs.


> Scala and Python generate inconsistent results
> ----------------------------------------------
>
>                 Key: SPARK-13691
>                 URL: https://issues.apache.org/jira/browse/SPARK-13691
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.1, 1.5.2, 1.6.0
>            Reporter: Shixiong Zhu
>
> Here is an example that Scala and Python generate different results
> {code}
> Scala:
> scala> var i = 0
> i: Int = 0
> scala> val rdd = sc.parallelize(1 to 10).map(_ + i)
> scala> rdd.collect()
> res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> scala> i += 1
> scala> rdd.collect()
> res2: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
> Python:
> >>> i = 0
> >>> rdd = sc.parallelize(range(1, 10)).map(lambda x: x + i)
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> >>> i += 1
> >>> rdd.collect()
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
> {code}
> The difference is Scala will capture all variables' values when running a job 
> every time, but Python just captures variables' values once and always uses 
> them for all jobs.
> In addition, SQL UDF has the similar issue. It's better to fix that too if 
> anyone wants to fix the bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to