Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/20091#discussion_r162778187 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -43,17 +43,19 @@ object Partitioner { /** * Choose a partitioner to use for a cogroup-like operation between a number of RDDs. * - * If any of the RDDs already has a partitioner, and the number of partitions of the - * partitioner is either greater than or is less than and within a single order of - * magnitude of the max number of upstream partitions, choose that one. + * If spark.default.parallelism is set, we'll use the value of SparkContext defaultParallelism + * as the default partitions number, otherwise we'll use the max number of upstream partitions. * - * Otherwise, we use a default HashPartitioner. For the number of partitions, if - * spark.default.parallelism is set, then we'll use the value from SparkContext - * defaultParallelism, otherwise we'll use the max number of upstream partitions. + * If any of the RDDs already has a partitioner, and the partitioner is an eligible one (with a + * partitions number that is not less than the max number of upstream partitions by an order of + * magnitude), or the number of partitions is larger than the default one, we'll choose the + * exsiting partitioner. --- End diff -- We should rephrase this for clarity. How about "When available, we choose the partitioner from rdds with maximum number of partitions. If this partitioner is eligible (number of partitions within an order of maximum number of partitions in rdds), or has partition number higher than default partitions number - we use this partitioner"
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org