Li Cheng created HDDS-3559:
------------------------------

             Summary: Datanode doesn't handle java heap OutOfMemory exception 
                 Key: HDDS-3559
                 URL: https://issues.apache.org/jira/browse/HDDS-3559
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
          Components: Ozone Datanode
    Affects Versions: 0.5.0
            Reporter: Li Cheng


2020-05-05 15:47:41,568 [Datanode State Machine Thread - 167] WARN 
org.apache.hadoop.ozone.container.common.statemachine.Endpoi
ntStateMachine: Unable to communicate to SCM server at host-10-51-87-181:9861 
for past 0 seconds.
java.io.IOException: com.google.protobuf.ServiceException: 
java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47)
        at 
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:118)
        at 
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.sendHeartbeat(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:148)
        at 
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:145)
        at 
org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:76)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: 
Java heap space
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.getReturnMessage(ProtobufRpcEngine.java:293)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:270)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy38.submitRequest(Unknown Source)
        at 
org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:116)
 
On a cluster, one datanode stops reporting to SCM while being kept unknown. The 
datanode process is still working. Log shows Java heap OOM when it's 
serializing protobuf for rpc message. However, datanode silently stops reports 
to SCM and the process becomes stale.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to