I remember you asking this question before. Check if your OS' OOM killer is 
killing it.

+Vinod

On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <write2kish...@gmail.com> 
wrote:

> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire all 
> the containers that are available on one of those nodes and remaining 
> containers from the other node in the cluster. When I run this application 
> continuously in a loop, one of the NM or RM is getting killed at a random 
> point. There is no corresponding message in the log files.
> 
> One of the times that NM had got killed today, the tail of the it's log is 
> like this:
> 
> 2014-03-04 02:42:44,386 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: 
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's 
> health-status : true,
> 
> 
> And at the time of NM's crash, the RM's log has the following entries:
> 
> 2014-03-04 02:42:40,371 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing 
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Dispatching the event 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
>  NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server 
> Responder: responding to 
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  nodeUpdate: isredeng:52867 clusterResources: 
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Node being looked for scheduling isredeng:52867 
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
> 
> 
> Note: the name of the node on which NM has got killed is isredeng, does it 
> indicate anything from the above message as to why it got killed?
> 
> Thanks,
> Kishore
> 
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to