Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16677 Let me take an example from the PR description > For example, we have three partitions with rows (100, 100, 50) respectively. In global limit of 100 rows, we may take (34, 33, 33) rows for each partition locally. After global limit we still have three partitions. Without this patch, we need to take the first 100 rows from each partition, and then perform a shuffle to send all data into one partition and take the first 100 rows. So if the limit is big, this patch is super useful, if the limit is small, this patch is not that useful but should not be slower. The only overhead I can think of is, `MapStatus` needs to carry the numRecords metrics. It should be a small overhead, as `MapStatus` already carries many information.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org