[GitHub] [spark] ulysses-you commented on pull request #33079: [SPARK-35888][SQL] Add dataSize field in CoalescedPartitionSpec
ulysses-you commented on pull request #33079: URL: https://github.com/apache/spark/pull/33079#issuecomment-869728096 > how are we going to use dataSize? Only for tests is a bit overkill. @cloud-fan It can help coalesce partition in `ShufflePartitionsUtil.coalescePartitionsWithSkew` if we apply optimize skewed partitions before coalesce partitions in [#32883](https://github.com/apache/spark/pull/32883). Let's say if we have a skewed partitions: [0, 128MB, 0, 128MB, 0], with the different order of these two rules will produce different result: 1. coalesce partitions first then optimize skewed partitions: [64MB, 64MB, 64MB, 64MB] 2. optimize skew partition first then coalesce partitions: [0, 64MB, 64MB, 0, 64MB, 64MB, 0] Then we can do coalesce in `ShufflePartitionsUtil.coalescePartitionsWithSkew` with mixed `CoalescedPartitionSpec` and `PartialReducerPartitionSpec` if `CoalescedPartitionSpec` is empty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #33079: [SPARK-35888][SQL] Add dataSize field in CoalescedPartitionSpec
ulysses-you commented on pull request #33079: URL: https://github.com/apache/spark/pull/33079#issuecomment-869728096 > how are we going to use dataSize? Only for tests is a bit overkill. @cloud-fan It can help coalesce partition in `ShufflePartitionsUtil.coalescePartitionsWithSkew` if we apply optimize skewed partitions before coalesce partitions in [#32883](https://github.com/apache/spark/pull/32883). Let's say if we have a skewed partitions: [0, 128MB, 0, 128MB, 0], with the different order of these two rules will produce different result: 1. coalesce partitions first then optimize skewed partitions: [64MB, 64MB, 64MB, 64MB] 2. optimize skew partition first then coalesce partitions: [0, 64MB, 64MB, 0, 64MB, 64MB, 0] Then we can do coalesce in `ShufflePartitionsUtil.coalescePartitionsWithSkew` with mixed `CoalescedPartitionSpec` and `PartialReducerPartitionSpec` if `CoalescedPartitionSpec` is empty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #33079: [SPARK-35888][SQL] Add dataSize field in CoalescedPartitionSpec
ulysses-you commented on pull request #33079: URL: https://github.com/apache/spark/pull/33079#issuecomment-868389742 cc @cloud-fan @yaooqinn @JkSelf @maryannxue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org