Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21859#discussion_r208775584 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -166,7 +169,16 @@ class RangePartitioner[K : Ordering : ClassTag, V]( // Assume the input partitions are roughly balanced and over-sample a little bit. val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition) - if (numItems == 0L) { + val numSampled = sketched.map(_._3.length).sum + if (numItems == 0) { + Array.empty + } + // already got the whole data + else if (sampleCacheEnabled && numItems == numSampled) { + // get the sampled data + sampledArray = sketched.foldLeft(Array.empty[K])((total, sample) => { + total ++ sample._3 + }) Array.empty --- End diff -- Returning `Array.empty` here will result in single partition.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org