[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

ASF GitHub Bot (Jira) Thu, 18 Apr 2024 09:54:08 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838736#comment-17838736
 ]


ASF GitHub Bot commented on MAPREDUCE-7474:
-------------------------------------------

steveloughran commented on PR #6716:
URL: https://github.com/apache/hadoop/pull/6716#issuecomment-2064541193

   @snvijaya we actually know the total number of subdirs for the deletion!
   
   it is propagated via the manifests: each TA manifest includes the #of dirs 
as an IOStatistic, the aggregate summary adds these all up.
   
   the number of paths under the job dir is that number (counter 
committer_task_directory_count ) + any of failed task attempts.
   
   which means we could actually have a threshold of how many subdirectories 
will trigger an automatic switch to parallel delete.
   
   I'm just going to pass this down and log immediately before the cleanup 
kicks off, so if there are problems we will get the diagnostics adjacent to the 
error.
   
   Note that your details on retry timings imply that on a mapreduce job 
(rather than spark one) the progress() callback will not take place -so there's 
a risk that the job will actually timeout. I don't think that's an issue in MR 
job actions, the way it is is in task-side actions where a heartbeat back to 
the MapRed AM is required.
   
   




> [ABFS] Improve commit resilience and performance in Manifest Committer
> ----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7474
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7474
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 3.4.0, 3.3.6
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>
> * Manifest committer is not resilient to rename failures on task commit 
> without HADOOP-18012 rename recovery enabled. 
> * large burst of delete calls noted: are they needed?
> relates to HADOOP-19093 but takes a more minimal approach with goal of 
> changes in manifest committer only.
> Initial proposed changes
> * retry recovery on task commit rename, always (repeat save, delete, rename)
> * audit delete use and see if it can be pruned



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7474) [ABFS] Improve commit resilience and performance in Manifest Committer

Reply via email to