[jira] [Created] (YARN-8012) Support Unmanaged Container Cleanup
Yuqi Wang created YARN-8012: --- Summary: Support Unmanaged Container Cleanup Key: YARN-8012 URL: https://issues.apache.org/jira/browse/YARN-8012 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Affects Versions: 2.7.1 Reporter: Yuqi Wang Assignee: Yuqi Wang Fix For: 2.7.1 An *unmanaged container* is a container which is no longer managed by NM. Thus, it is cannot be managed by YARN, too. *There are many cases a YARN managed container can become unmanaged, such as:* # For container resource managed by YARN, such as container job object and disk data: ** NM service is disabled or removed on the node. ** NM is unable to start up again on the node, such as depended configuration, or resources cannot be ready. ** NM local leveldb store is corrupted or lost, such as bad disk sectors. ** NM has bugs, such as wrongly mark live container as complete. # For container resource unmanaged by YARN: ** User breakaway processes from container job object. ** User creates VMs from container job object. ** User acquires other resource on the machine which is unmanaged by YARN, such as produce data outside Container folder. *Bad impacts of unmanaged container, such as:* # Resource cannot be managed for YARN and the node: ** Cause YARN and node resource leak ** Cannot kill the container to release YARN resource on the node # Container and App killing is not eventually consistent for user: ** App which has bugs can still produce bad impacts to outside even if the App is killed for a long time *Initial patch for review:* For the initial patch, the unmanaged container cleanup feature on Windows, only can cleanup the container job object of the unmanaged container. Cleanup for more container resources will be supported. And the UT will be added if the design is agreed. The current container will be considered as unmanaged when: # NM is dead: ** Failed to check whether container is managed by NM within timeout. # NM is alive but container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE or not found: ** The container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE or not found in the NM container list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7872) labeled node cannot be used to satisfy locality specified request
Yuqi Wang created YARN-7872: --- Summary: labeled node cannot be used to satisfy locality specified request Key: YARN-7872 URL: https://issues.apache.org/jira/browse/YARN-7872 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler, resourcemanager Affects Versions: 2.7.2 Reporter: Yuqi Wang Assignee: Yuqi Wang Fix For: 2.7.2, 2.8.0 labeled node (i.e. node with 'not empty' node label) cannot be used to satisfy locality specified request (i.e. container request with 'not ANY' resource name and the relax locality is false). For example: The node with available resource: [Resource: [MemoryMB: [100] CpuNumber: [12]] {color:#14892c}NodeLabel: [persistent]{color} {color:#f79232}HostName: \{SRG}{color} RackName: \{/default-rack}] The container request: [Priority: [1] Resource: [MemoryMB: [1] CpuNumber: [1]] {color:#14892c}NodeLabel: [null]{color} {color:#f79232}HostNames: \{SRG}{color} RackNames: {} {color:#59afe1}RelaxLocality: [false]{color}] Current RM capacity scheduler's behaiour is: The node cannot allocate container for the request because of the node label not matched in the leaf queue assign container. However, node locality and node label should be two orthogonal dimensions to select candidate nodes for container request. And the node label matching should only be executed for container request with ANY resource name, since only this kind of container request is allowed to have 'not empty' node label. So, for container request with 'not ANY' resource name (besides, it should not have node label), we should use resource name to match with the node instead of node label to match with the node. And it should be safe, since the node which is not accessible for the queue will not be sent in the leaf queue. Attachment is the fix according to this principle, please help to review. Without it, we cannot use locality to request container within these labeled nodes. If the fix is acceptable, we should also recheck whether the same issue happens in trunk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-6959) RM may allocate wrong AM Container for new attempt
Yuqi Wang created YARN-6959: --- Summary: RM may allocate wrong AM Container for new attempt Key: YARN-6959 URL: https://issues.apache.org/jira/browse/YARN-6959 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.7.1 Reporter: Yuqi Wang Assignee: Yuqi Wang Fix For: 3.0.0-alpha4, 2.7.1 *Issue Summary:* Previous attempt ResourceRequest may be recorded into current attempt ResourceRequests. These mis-recorded ResourceRequests may confuse AM Container Request and Allocation for current attempt. *Issue Pipeline:* {code:java} // Executing precondition check for the incoming attempt id. ApplicationMasterService.allocate() -> scheduler.allocate(attemptId, ask, ...) -> // Previous precondition check for the attempt id may be outdated here, // i.e. the currentAttempt may not be the corresponding attempt of the attemptId. // Such as the attempt id is corresponding to the previous attempt. currentAttempt = scheduler.getApplicationAttempt(attemptId) -> // Previous attempt ResourceRequest may be recorded into current attempt ResourceRequests currentAttempt.updateResourceRequests(ask) -> // RM may allocate wrong AM Container for the current attempt, because its ResourceRequests // may come from previous attempt which can be any ResourceRequests previous AM asked // and there is not matching logic for the original AM Container ResourceRequest and // the returned amContainerAllocation below. AMContainerAllocatedTransition.transition(...) -> amContainerAllocation = scheduler.allocate(currentAttemptId, ...) {code} *Patch Correctness:* Because after this Patch, RM will definitely record ResourceRequests from different attempt into different objects of SchedulerApplicationAttempt.AppSchedulingInfo. So, even if RM still record ResourceRequests from old attempt at any time, these ResourceRequests will be recorded in old AppSchedulingInfo object which will not impact current attempt's resource requests and allocation. *Concerns:* The getApplicationAttempt function in AbstractYarnScheduler is so confusing, we should better rename it to getCurrentApplicationAttempt. And reconsider whether there are any other bugs related to getApplicationAttempt. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org