HEPBO3AH commented on issue #7062: URL: https://github.com/apache/hudi/issues/7062#issuecomment-1304913263
Hello! Thank you for the reply. > hey @HEPBO3AH : do you mean to say that, even after our fix https://github.com/apache/hudi/pull/6864, your avg record size estimate is wrong in some cases. And as a result your are running into OOM issue. Because of data variability, the `AvgRecordSize` is still off sometimes. This is expected, but we do not think this value is the primary cause of the issue. We believe that the combination of `AvgRecordSize` and hudi appending the existing small files is the cause. If you read the section of the original ticket, going around hudi's feature to append to existing completely eliminated the issue for us even when we try to pack much more data into files. > also, I don't get this statement of yours "We noticed that the class HoodieMergeHandle is not being used due to PARQUET_SMALL_FILE_LIMIT = 0 and the job passes successfully.". can you help clarify? In the hoodie source code, we see the following: 1. [UpsertPartitioner.assignInserts](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L162) method calls [getSmallFilesForPartitions](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L268) 2. If the `PARQUET_SMALL_FILE_LIMIT = 0` this method returns empty Map 3. The codepath in `HoodieMergeHandle` is not invoked when there are no files to be updated. This can be seen in the [try section of the BaseSparkCommitActionExecutor](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L326). The `handleUpdate` is the one creating the HoodieMergeHandle class which only happens when: * there are updates to existing records * small files which data can be added to were returned in 1 and 2 > whats the write operation you are using. We are using `UPSERT`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org