Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/19763#discussion_r152087573 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -472,16 +475,48 @@ private[spark] class MapOutputTrackerMaster( shuffleStatuses.get(shuffleId).map(_.findMissingPartitions()) } + /** + * To equally divide n elements into m buckets, basically each bucket should have n/m elements, + * for the remaining n%m elements, add one more element to the first n%m buckets each. + */ + def equallyDivide(numElements: Int, numBuckets: Int): Iterator[Seq[Int]] = { + val elementsPerBucket = numElements / numBuckets + val remaining = numElements % numBuckets + if (remaining == 0) { + 0.until(numElements).grouped(elementsPerBucket) + } else { + val splitPoint = (elementsPerBucket + 1) * remaining + 0.to(splitPoint).grouped(elementsPerBucket + 1) ++ --- End diff -- `grouped` is expensive here. I saw it generates Vector rather than `Range`: ``` scala> (1 to 100).grouped(10).foreach(g => println(g.getClass)) class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector class scala.collection.immutable.Vector ``` It means we need to generate all of numbers between 0 and `numElements`. Could you implement a special `grouped` for Range instead?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org