[I] Job "Doing partition and writing data" performance regression from Hudi 0.9.0 to 0.14.0 [hudi]

via GitHub Sun, 30 Nov 2025 04:11:41 -0800


hudi-bot opened a new issue, #16971:
URL: https://github.com/apache/hudi/issues/16971


   User is running the same Hudi upsert application and same hudi 
configurations on Hudi versions 0.9.0 and 0.14.0 for performance benchmarking.
   The results show that there is about a 2x performance regression on the job
   {code:java}
   Doing partition and writing data{code}
   with ~2.5 mins on 0.9.0 and ~5 mins on 0.14.0. 
   Is this a known issue on the performance regression and what is the cause of 
this regression?
   
   Hudi config
   
   {code:java}
   upsert_hudi_config = {
   "hoodie.table.name": "[table_name]",
   "hoodie.database.name": "[database_name]",
   "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator",
   "hoodie.datasource.write.operation": "upsert", 
   "hoodie.datasource.write.precombine.field": "[precombine_key]",
   "hoodie.datasource.write.recordkey.field": "[record_key]", 
   "hoodie.datasource.write.table.name": "[table_name]", 
   "hoodie.index.type": "BLOOM",
   "hoodie.metadata.enable": False, 
   "hoodie.upsert.shuffle.parallelism": 3,
   }{code}
   Data Characteristics
   
   {code:java}
   Table size: ~5GB uncompressed parquet data
   Column count: 310 columns
   High NULL density:
     - Average NULLs per row: 217.74
     - Min NULLs per row: 185
     - Max NULLs per row: 230{code}
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-9313
   - Type: Bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Job "Doing partition and writing data" performance regression from Hudi 0.9.0 to 0.14.0 [hudi]

Reply via email to