bvaradar commented on issue #1852: URL: https://github.com/apache/hudi/issues/1852#issuecomment-663816363
@ssomuah : Looking at the commit metadata, it is the case where your updates are spread across a large number of files. For example, in latest commit, 334 files sees updates whereas only one file is newly created due to inserts. It looks like this is the nature of your workload. If your record key has some sort of ordering, then you can initially bootstrap using "bulk-insert" which would sort and write the data in record-key order. This can potentially help reduce the number of files getting updated if each batch of writes have similar ordering. You can also try recreating the dataset with larger parquet file size and higher small file limit and async compactions (more frequent to keep the number of active log files in check). However, you are basically trying to reduce the number of files getting appended at the expense of more data getting appended to a single file. This is a general upsert problem due to the nature of your workload. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org