hudi-bot opened a new issue, #14788: URL: https://github.com/apache/hudi/issues/14788
Problem: When we use upsert in MOR table, hudi assign one task for one fileId which needs to be created or updated. In such situation, near one million tasks may be created in most of which may simply append few records to a fileId. Such process may be slow and a few skew tasks appear. Reason: hudi use hsync to guarantee data is stored properly. Call hsync so much times towards a hdfs cluster in 2 minutes or less will lead to high IOPS for disks. In addition to this, creating too much tasks brings high overhead of scheduling tasks against append two or three records to a file. TODO: Option One: use hflush instead of hsync. This may lead data loss when all DN shutdown at the same time. However, this has a quite low chance to occur when HDFS deploy across AZ. Option two: make hsync process asynchronous and let more than one writing process run in the same task. This will reduce the task numbers but increase mem use. I may first try option one as it is simple enough. When ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-1875 - Type: Improvement -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
