Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]

via GitHub Thu, 18 Apr 2024 09:53:19 -0700


steveloughran commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2064541193


   @snvijaya we actually know the total number of subdirs for the deletion!
   
   it is propagated via the manifests: each TA manifest includes the #of dirs 
as an IOStatistic, the aggregate summary adds these all up.
   
   the number of paths under the job dir is that number (counter 
committer_task_directory_count ) + any of failed task attempts.
   
   which means we could actually have a threshold of how many subdirectories 
will trigger an automatic switch to parallel delete.
   
   I'm just going to pass this down and log immediately before the cleanup 
kicks off, so if there are problems we will get the diagnostics adjacent to the 
error.
   
   Note that your details on retry timings imply that on a mapreduce job 
(rather than spark one) the progress() callback will not take place -so there's 
a risk that the job will actually timeout. I don't think that's an issue in MR 
job actions, the way it is is in task-side actions where a heartbeat back to 
the MapRed AM is required.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Re: [PR] MAPREDUCE-7474. Improve Manifest committer resilience [hadoop]

Reply via email to