Re: [I] [Feature] Paimon Spark 2025 Roadmap [paimon]

via GitHub Wed, 02 Apr 2025 19:28:17 -0700


Zouxxyy commented on issue #4816:
URL: https://github.com/apache/paimon/issues/4816#issuecomment-2774205817


   @Aitozi @zhongyujiang  Paimon calls REPARTITION_BY_COL on the bucket column 
for bucket table, which has two advantages:
   1. It allows data to be more concentrated, reducing the overhead of writers 
and avoiding small files.
   2. It does not produce concurrent compaction.
   In reality, Spark has another mode called REBALANCE_BY_COL, which offers an 
additional skew splitting effect, but it cannot guarantee the second advantage. 
Therefore, we can choose to use this type of shuffle in a write-only scenario.
   
   Additionally, if skew occurs, it may still be necessary to consider whether 
the bucket key values or the number of bucket keys are reasonable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@paimon.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Feature] Paimon Spark 2025 Roadmap [paimon]

Reply via email to