gudladona commented on issue #8199: URL: https://github.com/apache/hudi/issues/8199#issuecomment-1474500670
We may have some indicators on what is causing this problem we have a small file limit of 100MB, it appears that this works well (makes larger files and cleans smaller files) for an average partitions that meets the size requirements. however, for a partition thats very busy/high volume. it seems like its over bucketing the inserts into many files bec based on avg rec size and the size of new inserts it would always exceed the file size limits and causing it to write to a new file group example, here is number of file groups written for a single instant(commit) in this partition ``` aws s3 ls s3://<prefix>/<table>/<tenant>/date=20230316/ | awk -F _ '{print $3}' | sort | uniq -c | sort -nk1 | tail 167 20230316203454183.parquet 168 20230316195218670.parquet 168 20230316201208079.parquet 170 20230316200728433.parquet 175 20230316210557345.parquet 180 20230316130454342.parquet 182 20230316212237421.parquet 211 20230316192405566.parquet 245 20230316210251305.parquet 263 20230316204926437.parquet ``` As we can see here the shear number of small files in this partition is causing a HUGE json response from the driver there by triggering OOM errors. we need help in figuring out how to tune this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org