Li Cheng created HDDS-3559: ------------------------------ Summary: Datanode doesn't handle java heap OutOfMemory exception Key: HDDS-3559 URL: https://issues.apache.org/jira/browse/HDDS-3559 Project: Hadoop Distributed Data Store Issue Type: Bug Components: Ozone Datanode Affects Versions: 0.5.0 Reporter: Li Cheng
2020-05-05 15:47:41,568 [Datanode State Machine Thread - 167] WARN org.apache.hadoop.ozone.container.common.statemachine.Endpoi ntStateMachine: Unable to communicate to SCM server at host-10-51-87-181:9861 for past 0 seconds. java.io.IOException: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.ipc.ProtobufHelper.getRemoteException(ProtobufHelper.java:47) at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:118) at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.sendHeartbeat(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:148) at org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:145) at org.apache.hadoop.ozone.container.common.states.endpoint.HeartbeatEndpointTask.call(HeartbeatEndpointTask.java:76) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: com.google.protobuf.ServiceException: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.getReturnMessage(ProtobufRpcEngine.java:293) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:270) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy38.submitRequest(Unknown Source) at org.apache.hadoop.ozone.protocolPB.StorageContainerDatanodeProtocolClientSideTranslatorPB.submitRequest(StorageContainerDatanodeProtocolClientSideTranslatorPB.java:116) On a cluster, one datanode stops reporting to SCM while being kept unknown. The datanode process is still working. Log shows Java heap OOM when it's serializing protobuf for rpc message. However, datanode silently stops reports to SCM and the process becomes stale. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org