bvaradar commented on issue #1852:
URL: https://github.com/apache/hudi/issues/1852#issuecomment-663816363


   @ssomuah : Looking at the commit metadata, it is the case where your updates 
are spread across a large number of files. For example, in latest commit, 334 
files sees updates whereas only one file is newly created due to inserts. It 
looks like this is the nature of your workload. 
   
   If your record key has some sort of ordering, then you can initially 
bootstrap using "bulk-insert" which would sort and write the data in record-key 
order. This can potentially help reduce the number of files getting updated if 
each batch of writes have similar ordering.  You can also try recreating the 
dataset with larger parquet file size and higher small file limit and async 
compactions (more frequent to keep the number of active log files in check). 
   
   However, you are basically trying to reduce the number of files getting 
appended at the expense of more data getting appended to a single file. This is 
a general upsert problem due to the nature of your workload.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to