[PR] Draft: support spark v2 write [paimon]

via GitHub Sun, 09 Mar 2025 21:22:34 -0700


zhongyujiang opened a new pull request, #5241:
URL: https://github.com/apache/paimon/pull/5241


   <!-- Please specify the module before the PR name: [core] ... or [flink] ... 
-->
   
   ### Purpose
   
   <!-- Linking this pull request to the issue -->
   Linked issue: part of  #4816
   
   <!-- What is the purpose of the change -->
   
   Support spark datasource v2 write path, reduce write serialization overhead 
and accelerate the process of writing to primary key tables in Spark. Currently 
only added support for fixed-bucket table.
   
   
   ### Tests
   
   <!-- List UT and IT cases to verify this change -->
   
   BucketFunctionTest, SparkWriteITCase
   
   PaimonSourceWriteBenchmark：
   ```md
   Benchmark                           Mode  Cnt   Score    Error  Units
   PaimonSourceWriteBenchmark.v1Write    ss    5  13.845 ± 23.192   s/op
   PaimonSourceWriteBenchmark.v2Write    ss    5   9.579 ± 14.929   s/op
   ```
   
   ### API and Format
   
   <!-- Does this change affect API or storage format -->
   
   ### Documentation
   
   <!-- Does this change introduce a new feature -->
   
   Add a config `spark.sql.paimon.use-v2-write` to enable switching to v2 
write, will fall back to v1 write when encountering an unsupported 
scenario(e.g. `HASH_DYNAMIC` bucket mode table).
   
   
   Note: this is an overall draft PR, which will be split into smaller PRs for 
easier review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Draft: support spark v2 write [paimon]

Reply via email to