[ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391108#comment-15391108
 ] 

Maciej Szymkiewicz commented on SPARK-16589:
--------------------------------------------

[~holdenk] Makes sense. I was thinking more about design than other possible 
issues but it is probably better safe than sorry. It still should be fixed as 
fast as possible. It is really ugly bug and is easy to miss.

I doubt there are many legitimate cases when one can do something like this 
though (I guess this is why it hasn't been reported before).

> Chained cartesian produces incorrect number of records
> ------------------------------------------------------
>
>                 Key: SPARK-16589
>                 URL: https://issues.apache.org/jira/browse/SPARK-16589
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>            Reporter: Maciej Szymkiewicz
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> rdd.cartesian(rdd).cartesian(rdd).distinct().count()
> ## 251
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> ## 1000
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> ## 1000
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to