[ https://issues.apache.org/jira/browse/YARN-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated YARN-9877: --------------------------------- Labels: pull-request-available (was: ) > Intermittent TIME_OUT of LogAggregationReport > --------------------------------------------- > > Key: YARN-9877 > URL: https://issues.apache.org/jira/browse/YARN-9877 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation, resourcemanager, yarn > Affects Versions: 3.0.3, 3.3.0, 3.2.1, 3.1.3 > Reporter: Adam Antal > Assignee: Adam Antal > Priority: Major > Labels: pull-request-available > Attachments: YARN-9877.001.patch > > > I noticed some intermittent TIME_OUT in some downstream log-aggregation based > tests. > Steps to reproduce: > - Let's run a MR job > {code} > hadoop jar hadoop-mapreduce/hadoop-mapreduce-client-jobclient-tests.jar sleep > -Dmapreduce.job.queuename=root.default -m 10 -r 10 -mt 5000 -rt 5000 > {code} > - Suppose the AM is requesting more containers, but as soon as they're > allocated - the AM realizes it doesn't need them. The container's state > changes are: ALLOCATED -> ACQUIRED -> RELEASED. > Let's suppose these extra containers are allocated in a different node from > the other 21 (AM + 10 mapper + 10 reducer) containers' node. > - All the containers finish successfully and the app is finished successfully > as well. Log aggregation status for the whole app seemingly stucks in RUNNING > state. > - After a while the final log aggregation status for the app changes to > TIME_OUT. > Root cause: > - As unused containers are getting through the state transition in the RM's > internal representation, {{RMAppImpl$AppRunningOnNodeTransition}}'s > transition function is called. This calls the > {{RMAppLogAggregation$addReportIfNecessary}} which forcefully adds the > "NOT_START" LogAggregationStatus associated with this NodeId for the app, > even though it does not have any running container on it. > - The node's LogAggregationStatus is never updated to "SUCCEEDED" by the > NodeManager because it does not have any running container on it (Note that > the AM immediately released them after acquisition). The LogAggregationStatus > remains NOT_START until time out is reached. After that point the RM > aggregates the LogAggregationReports for all the nodes, and though all the > containers have SUCCEEDED state, one particular node has NOT_START, so the > final log aggregation will be TIME_OUT. > (I crawled the RM UI for the log aggregation statuses, and it was always > NOT_START for this particular node). > This situation is highly unlikely, but has an estimated ~0.8% of failure rate > based on a year's 1500 run on an unstressed cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org