ulysses-you commented on pull request #34933: URL: https://github.com/apache/spark/pull/34933#issuecomment-1007224052
I see the requirement but there are some potential issue if we only use a new config for writing's final stage. - if the final stage is heavy it will cause regression if we make partition size big, e.g. the final stage is join even multi-join - the input shuffle size is not equal to the output size. if the plan of final stage changes the data size, this config is less meaning - not all query contains shuffle, then the semantics of this config is broken since the config is not used - it's not enough for dynamic partition writing that just increase the partition size. we should cluster the same partition value in several partitions as far as possible - and this config should also affect the rebalance I think it's a good idea to add a `RebalancePartitions` node for all writing command as @wangyum working on SPARK-31264. And then we can consider adding a special partition size config for the added shuffle which is from `RebalancePartitions`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org