spark git commit: [SPARK-4304] [PySpark] Fix sort on empty RDD (1.0 branch)

joshrosen Fri, 07 Nov 2014 20:58:49 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-1.0 18c8c3833 -> d4aed266d



[SPARK-4304] [PySpark] Fix sort on empty RDD (1.0 branch)

This PR fix sortBy()/sortByKey() on empty RDD.

This should be back ported into 1.0

Author: Davies Liu <dav...@databricks.com>

Closes #3163 from davies/fix_sort_1.0 and squashes the following commits:

9be984f [Davies Liu] fix sort on empty RDD


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d4aed266
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d4aed266
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d4aed266

Branch: refs/heads/branch-1.0
Commit: d4aed266d3db3cb3aea711f30aa058c74bfe60a5
Parents: 18c8c38
Author: Davies Liu <dav...@databricks.com>
Authored: Fri Nov 7 20:57:56 2014 -0800
Committer: Josh Rosen <joshro...@databricks.com>
Committed: Fri Nov 7 20:57:56 2014 -0800

----------------------------------------------------------------------
 python/pyspark/rdd.py   | 2 ++
 python/pyspark/tests.py | 3 +++
 2 files changed, 5 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/d4aed266/python/pyspark/rdd.py
----------------------------------------------------------------------
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index 368ab50..57c2cd7 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -496,6 +496,8 @@ class RDD(object):
         # number of (key, value) pairs falling into them
         if numPartitions > 1:
             rddSize = self.count()
+            if not rddSize:
+                return self
             maxSampleSize = numPartitions * 20.0 # constant from Spark's 
RangePartitioner
             fraction = min(maxSampleSize / max(rddSize, 1), 1.0)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d4aed266/python/pyspark/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 45284ee..8f5b48d 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -198,6 +198,9 @@ class TestRDDFunctions(PySparkTestCase):
         os.unlink(tempFile.name)
         self.assertRaises(Exception, lambda: filtered_data.count())
 
+    def test_sort_on_empty_rdd(self):
+        self.assertEqual([], self.sc.parallelize(zip([], 
[])).sortByKey().collect())
+
     def test_itemgetter(self):
         rdd = self.sc.parallelize([range(10)])
         from operator import itemgetter


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-4304] [PySpark] Fix sort on empty RDD (1.0 branch)

Reply via email to