hudi-bot opened a new issue, #14788:
URL: https://github.com/apache/hudi/issues/14788

   Problem: When we use upsert in MOR table, hudi assign one task for one 
fileId which needs to be created or updated. In such situation, near one 
million tasks may be created in most of which may simply append few records to 
a fileId. Such process may be slow and a few skew tasks appear.
   
   Reason: hudi use hsync to guarantee data is stored properly.  Call hsync so 
much times towards a hdfs cluster in 2 minutes or less will lead to high IOPS 
for disks. In addition to this, creating too much tasks brings high overhead of 
scheduling tasks against append two or three records to a file.
   
   TODO: 
   
   Option One: use hflush instead of hsync. This may lead data loss when all DN 
shutdown at the same time. However, this has a quite low chance to occur when 
HDFS deploy across AZ.
   
   Option two: make hsync process asynchronous and let more than one writing 
process run in the same task. This will reduce the task numbers but increase 
mem use.
   
   I may first try option one as it is simple enough.
   
   When
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1875
   - Type: Improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to