[GitHub] [hudi] HEPBO3AH commented on issue #7062: [SUPPORT] Appeding to files during UPSERT causes executors to die due to memory issues.


HEPBO3AH commented on issue #7062:
URL: https://github.com/apache/hudi/issues/7062#issuecomment-1304913263

Hello! Thank you for the reply.

> hey @HEPBO3AH : do you mean to say that, even after our fix
https://github.com/apache/hudi/pull/6864, your avg record size estimate is
wrong in some cases. And as a result your are running into OOM issue.

Because of data variability, the `AvgRecordSize` is still off sometimes.
This is expected, but we do not think this value is the primary cause of the
issue.
We believe that the combination of `AvgRecordSize` and hudi appending the
existing small files is the cause. If you read the section of the original
ticket, going around hudi's feature to append to existing completely eliminated
the issue for us even when we try to pack much more data into files.

> also, I don't get this statement of yours "We noticed that the class
HoodieMergeHandle is not being used due to PARQUET_SMALL_FILE_LIMIT = 0 and the
job passes successfully.". can you help clarify?

In the hoodie source code, we see the following:
1.
[UpsertPartitioner.assignInserts](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L162)
method calls
[getSmallFilesForPartitions](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L268)
2. If the `PARQUET_SMALL_FILE_LIMIT = 0` this method returns empty Map
3. The codepath in `HoodieMergeHandle` is not invoked when there are no
files to be updated. This can be seen in the [try section of the
BaseSparkCommitActionExecutor](https://github.com/apache/hudi/blob/e088faac9eca747f74a48f12358a3bb6f66c21d5/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L326).
The `handleUpdate` is the one creating the HoodieMergeHandle class which only
happens when:
* there are updates to existing records
* small files which data can be added to were returned in 1 and 2

> whats the write operation you are using.

We are using `UPSERT`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to