[jira] [Created] (YARN-8012) Support Unmanaged Container Cleanup

2018-03-07 Thread Yuqi Wang (JIRA)
Yuqi Wang created YARN-8012:
---

 Summary: Support Unmanaged Container Cleanup
 Key: YARN-8012
 URL: https://issues.apache.org/jira/browse/YARN-8012
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Yuqi Wang
Assignee: Yuqi Wang
 Fix For: 2.7.1


An *unmanaged container* is a container which is no longer managed by NM. Thus, 
it is cannot be managed by YARN, too.

*There are many cases a YARN managed container can become unmanaged, such as:*
 # For container resource managed by YARN, such as container job object
 and disk data:
 ** NM service is disabled or removed on the node.
 ** NM is unable to start up again on the node, such as depended configuration, 
or resources cannot be ready.
 ** NM local leveldb store is corrupted or lost, such as bad disk sectors.
 ** NM has bugs, such as wrongly mark live container as complete.
 #  For container resource unmanaged by YARN:
 ** User breakaway processes from container job object.
 ** User creates VMs from container job object.
 ** User acquires other resource on the machine which is unmanaged by
 YARN, such as produce data outside Container folder.

*Bad impacts of unmanaged container, such as:*
 # Resource cannot be managed for YARN and the node:
 ** Cause YARN and node resource leak
 ** Cannot kill the container to release YARN resource on the node
 # Container and App killing is not eventually consistent for user:
 ** App which has bugs can still produce bad impacts to outside even if the App 
is killed for a long time

*Initial patch for review:*

For the initial patch, the unmanaged container cleanup feature on Windows, only 
can cleanup the container job object of the unmanaged container. Cleanup for 
more container resources will be supported. And the UT will be added if the 
design is agreed.

The current container will be considered as unmanaged when:
 # NM is dead:
 ** Failed to check whether container is managed by NM within timeout.
 # NM is alive but container is
 org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE
 or not found:
 ** The container is org.apache.hadoop.yarn.api.records.ContainerState#COMPLETE 
or
 not found in the NM container list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7872) labeled node cannot be used to satisfy locality specified request

2018-02-01 Thread Yuqi Wang (JIRA)
Yuqi Wang created YARN-7872:
---

 Summary: labeled node cannot be used to satisfy locality specified 
request
 Key: YARN-7872
 URL: https://issues.apache.org/jira/browse/YARN-7872
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, capacityscheduler, resourcemanager
Affects Versions: 2.7.2
Reporter: Yuqi Wang
Assignee: Yuqi Wang
 Fix For: 2.7.2, 2.8.0


labeled node (i.e. node with 'not empty' node label) cannot be used to satisfy 
locality specified request (i.e. container request with 'not ANY' resource name 
and the relax locality is false).

For example:

The node with available resource:

[Resource: [MemoryMB: [100] CpuNumber: [12]] {color:#14892c}NodeLabel: 
[persistent]{color} {color:#f79232}HostName: \{SRG}{color} RackName: 
\{/default-rack}]

The container request:
[Priority: [1] Resource: [MemoryMB: [1] CpuNumber: [1]] 
{color:#14892c}NodeLabel: [null]{color} {color:#f79232}HostNames: \{SRG}{color} 
RackNames: {} {color:#59afe1}RelaxLocality: [false]{color}]

 

Current RM capacity scheduler's behaiour is:
The node cannot allocate container for the request because of the node label 
not matched in the leaf queue assign container.

However, node locality and node label should be two orthogonal dimensions to 
select candidate nodes for container request. And the node label matching 
should only be executed for container request with ANY resource name, since 
only this kind of container request is allowed to have 'not empty' node label.

So, for container request with 'not ANY' resource name (besides, it should not 
have node label), we should use resource name to match with the node instead of 
node label to match with the node. And it should be safe, since the node which 
is not accessible for the queue will not be sent in the leaf queue.

Attachment is the fix according to this principle, please help to review.

Without it, we cannot use locality to request container within these labeled 
nodes.

If the fix is acceptable, we should also recheck whether the same issue happens 
in trunk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6959) RM may allocate wrong AM Container for new attempt

2017-08-07 Thread Yuqi Wang (JIRA)
Yuqi Wang created YARN-6959:
---

 Summary: RM may allocate wrong AM Container for new attempt
 Key: YARN-6959
 URL: https://issues.apache.org/jira/browse/YARN-6959
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.7.1
Reporter: Yuqi Wang
Assignee: Yuqi Wang
 Fix For: 3.0.0-alpha4, 2.7.1


*Issue Summary:*
Previous attempt ResourceRequest may be recorded into current attempt 
ResourceRequests. These mis-recorded ResourceRequests may confuse AM Container 
Request and Allocation for current attempt.

*Issue Pipeline:*

{code:java}
// Executing precondition check for the incoming attempt id.
ApplicationMasterService.allocate() ->

scheduler.allocate(attemptId, ask, ...) ->

// Previous precondition check for the attempt id may be outdated here, 
// i.e. the currentAttempt may not be the corresponding attempt of the 
attemptId.
// Such as the attempt id is corresponding to the previous attempt.
currentAttempt = scheduler.getApplicationAttempt(attemptId) ->

// Previous attempt ResourceRequest may be recorded into current attempt 
ResourceRequests
currentAttempt.updateResourceRequests(ask) ->

// RM may allocate wrong AM Container for the current attempt, because its 
ResourceRequests
// may come from previous attempt which can be any ResourceRequests previous AM 
asked
// and there is not matching logic for the original AM Container 
ResourceRequest and 
// the returned amContainerAllocation below.
AMContainerAllocatedTransition.transition(...) ->
amContainerAllocation = scheduler.allocate(currentAttemptId, ...)
{code}

*Patch Correctness:*
Because after this Patch, RM will definitely record ResourceRequests from 
different attempt into different objects of 
SchedulerApplicationAttempt.AppSchedulingInfo.
So, even if RM still record ResourceRequests from old attempt at any time, 
these ResourceRequests will be recorded in old AppSchedulingInfo object which 
will not impact current attempt's resource requests and allocation.

*Concerns:*
The getApplicationAttempt function in AbstractYarnScheduler is so confusing, we 
should better rename it to getCurrentApplicationAttempt. And reconsider whether 
there are any other bugs related to getApplicationAttempt.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org