Re: [I] UPSERTs are taking time [hudi]

via GitHub Fri, 03 Nov 2023 16:49:00 -0700


nsivabalan commented on issue #9976:
URL: https://github.com/apache/hudi/issues/9976#issuecomment-1793247596


   hey @darlatrade : 
   can you help w/ some more info. 
   1. Whats the size of the table. 
   2. I assume its COW table. 
   3. based on your stats, looks like we have 60 file groups minimum per 
partition and we are updating 12 partitions which comes to 720 file groups. So, 
this will involve rewriting 720 parquet files for which hudi might spin up 720- 
tasks. With COW table, its known that updating very small percentage of data 
spread across lot of file groups might result in some overhead. 
   
   if these matches your workload, and if you prefer faster write times, may be 
you can try MOR table. 
   
   4. Also, I see you are in 0.10.1. Some of the configs you have shared may 
not be applicable in 0.10. just incase you may want to remove them
   ```
   "hoodie.metadata.index.bloom.filter.enable": "true",
   "hoodie.metadata.index.bloom.filter.parallelism": 100,
   "hoodie.metadata.index.bloom.filter.column.list": "id",
   "hoodie.bloom.index.use.metadata": "true",
   "hoodie.metadata.index.column.stats.enable": "true",
   "hoodie.metadata.index.column.stats.column.list": "col1,col2,col3",
   "hoodie.enable.data.skipping": "true"
   ``` 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] UPSERTs are taking time [hudi]

Reply via email to