HEPBO3AH commented on issue #7062:
URL: https://github.com/apache/hudi/issues/7062#issuecomment-1304913263

   Hello! Thank you for the reply.
   
   > hey @HEPBO3AH : do you mean to say that, even after our fix 
https://github.com/apache/hudi/pull/6864, your avg record size estimate is 
wrong in some cases. And as a result your are running into OOM issue.
   
   Because of data variability, the `AvgRecordSize` is still off sometimes. 
This is expected, but we do not think this value is the primary cause of the 
issue.
   We believe that the combination of `AvgRecordSize` and hudi appending the 
existing small files is the cause. If you read the section of the original 
ticket, going around hudi's feature to append to existing completely eliminated 
the issue for us even when we try to pack much more data into files.
   
   > also, I don't get this statement of yours "We noticed that the class 
HoodieMergeHandle is not being used due to PARQUET_SMALL_FILE_LIMIT = 0 and the 
job passes successfully.". can you help clarify? 
   
   In the hoodie source code, we see the following:
   1. 
[UpsertPartitioner.assignInserts](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L162)
 method calls 
[getSmallFilesForPartitions](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L268)
   2. If the `PARQUET_SMALL_FILE_LIMIT = 0` this method returns empty Map
   3. The codepath in `HoodieMergeHandle` is not invoked when there are no 
files to be updated. This can be seen in the [try section of the 
BaseSparkCommitActionExecutor](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L326).
 The `handleUpdate` is the one creating the HoodieMergeHandle class which only 
happens when:
       * there are updates to existing records 
       * small files which data can be added to were returned in 1 and 2
   
   
   > whats the write operation you are using.
   
   We are using `UPSERT`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to