[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Sharad Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209689#comment-15209689
 ] 

Sharad Agarwal commented on YARN-4852:
--

Thanks Rohith. Should we consider adding duplicate check in the RM side as well 
for completed containers as we are doing for launched ones. This will make it 
more full proof and eliminate scenarious like resync etc where NM might still 
send duplicates.
 we can open a new ticket for the same.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Sharad Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209535#comment-15209535
 ] 

Sharad Agarwal commented on YARN-4852:
--

Further analysis shows that we are seeing exceptionally high log lines of "Null 
container completed...", somewhere in between 100k to 200k every minute. This 
could be related to lot of duplicate UpdatedContainerInfo objects for completed 
containers.

> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory

2016-03-23 Thread Sharad Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209520#comment-15209520
 ] 

Sharad Agarwal commented on YARN-4852:
--

[~rohithsharma] the slowness in schedulers still does not explain the built up 
of UpdatedContainerInfo to be 0.5 million objects in a short span. 
UpdatedContainerInfo should only be created in case of newly launched/completed 
containers. 
Looking at the code at RMNodeImpl.StatusUpdateWhenHealthyTransition  (branch 
2.6.0)
{code}
 // Process running containers
if (remoteContainer.getState() == ContainerState.RUNNING) {
  if (!rmNode.launchedContainers.contains(containerId)) {
// Just launched container. RM knows about it the first time.
rmNode.launchedContainers.add(containerId);
newlyLaunchedContainers.add(remoteContainer);
  }
} else {
  // A finished container
  rmNode.launchedContainers.remove(containerId);
  completedContainers.add(remoteContainer);
}
  }
  if(newlyLaunchedContainers.size() != 0 
  || completedContainers.size() != 0) {
rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo
(newlyLaunchedContainers, completedContainers));
  }
{code}

Above UpdatedContainerInfo is seemed to be getting created each time there is a 
completed containers in the container status (it is not checking if from 
previous update this has already been created). Wouldn't this lead to lot of 
duplicates UpdatedContainerInfo objects and further putting stress on the 
scheduler unnecessarily.


> Resource Manager Ran Out of Memory
> --
>
> Key: YARN-4852
> URL: https://issues.apache.org/jira/browse/YARN-4852
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.6.0
>Reporter: Gokul
> Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-270) RM scheduler event handler thread gets behind

2013-01-06 Thread Sharad Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545679#comment-13545679
 ] 

Sharad Agarwal commented on YARN-270:
-

bq. When all else fails, try to parallelize the scheduler dispatcher

long term i think this should be the solution. we need ordering of events only 
for a given event type. so this should be doable and will give next level of 
scalability both for AM and RM

> RM scheduler event handler thread gets behind
> -
>
> Key: YARN-270
> URL: https://issues.apache.org/jira/browse/YARN-270
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 0.23.5
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> We had a couple of incidents on a 2800 node cluster where the RM scheduler 
> event handler thread got behind processing events and basically become 
> unusable.  It was still processing apps, but taking a long time (1 hr 45 
> minutes) to accept new apps.   this actually happened twice within 5 days.
> We are using the capacity scheduler and at the time had between 400 and 500 
> applications running.  There were another 250 apps that were in the SUBMITTED 
> state in the RM but the scheduler hadn't processed those to put in pending 
> state yet.  We had about 15 queues none of them hierarchical.  We also had 
> plenty of space lefts on the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira