[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065317#comment-14065317 ]
Davies Liu commented on SPARK-2494: ----------------------------------- The tip version already handle hash of None, but it can not handle hash of tuple with None in it. Here is the updated test cases, sorry for that: >>> rdd = sc.parallelize([((None, 1), 1),] *100 , 100) >>> assert rdd.groupByKey(10).collect() == 1 > Hash of None is different cross machines in CPython > --------------------------------------------------- > > Key: SPARK-2494 > URL: https://issues.apache.org/jira/browse/SPARK-2494 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.0.0, 1.0.1 > Environment: CPython 2.x > Reporter: Davies Liu > Priority: Blocker > Labels: pyspark, shuffle > Fix For: 1.0.0, 1.0.1 > > Original Estimate: 24h > Remaining Estimate: 24h > > The hash of None, also tuple with None in it, is different cross machines, so > the result will be wrong if None appear in the key of partitionBy(). > It should use an portable hash function as the default partition function, > which generate same hash for all the builtin immutable types, especially > tuple. -- This message was sent by Atlassian JIRA (v6.2#6252)