Zouxxyy commented on issue #4816: URL: https://github.com/apache/paimon/issues/4816#issuecomment-2774205817
@Aitozi @zhongyujiang Paimon calls REPARTITION_BY_COL on the bucket column for bucket table, which has two advantages: 1. It allows data to be more concentrated, reducing the overhead of writers and avoiding small files. 2. It does not produce concurrent compaction. In reality, Spark has another mode called REBALANCE_BY_COL, which offers an additional skew splitting effect, but it cannot guarantee the second advantage. Therefore, we can choose to use this type of shuffle in a write-only scenario. Additionally, if skew occurs, it may still be necessary to consider whether the bucket key values or the number of bucket keys are reasonable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@paimon.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org