[GitHub] [spark] ulysses-you commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

GitBox Sat, 08 Jan 2022 02:17:49 -0800


ulysses-you commented on pull request #34933:
URL: https://github.com/apache/spark/pull/34933#issuecomment-1007224052



   I see the requirement but there are some potential issue if we only use a 
new config for writing's final stage.
   
   - if the final stage is heavy it will cause regression if we make partition 
size big, e.g. the final stage is join even multi-join
   - the input shuffle size is not equal to the output size. if the plan of 
final stage changes the data size, this config is less meaning
   - not all query contains shuffle, then the semantics of this config is 
broken since the config is not used
   - it's not enough for dynamic partition writing that just increase the 
partition size. we should cluster the same partition value in several 
partitions as far as possible
   - and this config should also affect the rebalance
   
   I think it's a good idea to add a `RebalancePartitions` node for all writing 
command as  @wangyum working on SPARK-31264. And then we can consider adding a 
special partition size config for the added shuffle which is from  
`RebalancePartitions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ulysses-you commented on pull request #34933: [SPARK-37674][SQL] Reduce the output partition of output stage to avoid producing small files.

Reply via email to