Hi guys, Environment:Hadoop2.3-cdh5.0.2, I have a cluster about sixty nodes. Nn1,nn2 are ha namenodes. Dn1-dn58 are data nodes(datanode,nodemanager).
Now one datanode 's nodemanager always crash after executing some containers ,sometimes after some hours ,sometimes some minutes. Configuration are same with other datanodes. Kernel paramer are not, because I am tunning for this issue. I have spent a lot of time to investigate this issue, but have no solution. This drives me crazy. I tune some Linux kernel parameter: vm.overcommit_memory=1 vm.swappiness = 20 #for dmesg page allocate failure vm.zone_reclaim_mode = 1 vm.min_free_kbytes = 65536 I also change nodemanager's gc policy from gencon to optthruput; Process log: 2015-12-16 17:24:39,663 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error [id: 0x34ed4c97, /172.19.206.148:34641 => /172.19.206.142:8080] EXCEPTII ON: java.lang.ArrayIndexOutOfBoundsException 2015-12-16 17:24:39,663 FATAL org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Container Monitor,5,main] threw an Error. Shutting down noo w... java.lang.OutOfMemoryError: Java heapspace at java.util.HashMap.inflateTable(HashMap.java:328) at java.util.HashMap.<init>(HashMap.java:308) at org.apache.hadoop.yarn.util.ProcfsBasedProcessTree.updateProcessTree(ProcfsB asedProcessTree.java:154) at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Container sMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:390) 2015-12-16 17:24:39,666 INFO org.apache.hadoop.util.ExitUtil: Halt with status -1 Message: HaltException Before this there are always some errors like this: 2015-12-16 17:24:35,336 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Container sMonitorImpl: Memory usage of ProcessTree 19947 forr container-id container_1448915696877_23390_01_000037: 102.6 MB of 2 GB physical memory used; 2.1 GB of 4.2 GB virtual memory used 2015-12-16 17:24:38,379 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Container sMonitorImpl: Uncaught exception in ContainerMemoryy Manager while managing memory of container_1448915696877_23390_01_000543 java.lang.IllegalArgumentException: disparate values at sun.misc.FDBigInt.quoRemIteration(FloatingDecimal.java:2931) at sun.misc.FormattedFloatingDecimal.dtoa(FormattedFloatingDecimal.java:922) at sun.misc.FormattedFloatingDecimal.<init>(FormattedFloatingDecimal.java:542) at java.util.Formatter$FormatSpecifier.print(Formatter.java:3264) at java.util.Formatter$FormatSpecifier.print(Formatter.java:3202) at java.util.Formatter$FormatSpecifier.printFloat(Formatter.java:2769) at java.util.Formatter$FormatSpecifier.print(Formatter.java:2720) at java.util.Formatter.format(Formatter.java:2500) at java.util.Formatter.format(Formatter.java:2435) at java.lang.String.format(String.java:2148) at org.apache.hadoop.util.StringUtils.format(StringUtils.java:123) at org.apache.hadoop.util.StringUtils$TraditionalBinaryPrefix.long2String(Strin gUtils.java:758) at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Container sMonitorImpl$MonitoringThread.formatUsageString(ContainersMonitorImpll .java:487) at org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Container sMonitorImpl$MonitoringThread.run(ContainersMonitorImpl.java:399) 2015-12-16 17:24:38,516 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Container sMonitorImpl: Uncaught exception in ContainerMemoryy Manager while managing memory of container_1448915696877_23390_01_000374 java.lang.ArrayIndexOutOfBoundsException at sun.misc.FormattedFloatingDecimal.dtoa(FormattedFloatingDecimal.java:848) at sun.misc.FormattedFloatingDecimal.<init>(FormattedFloatingDecimal.java:542) at java.util.Formatter$FormatSpecifier.print(Formatter.java:3264) at java.util.Formatter$FormatSpecifier.print(Formatter.java:3202) at java.util.Formatter$FormatSpecifier.printFloat(Formatter.java:2769) at java.util.Formatter$FormatSpecifier.print(Formatter.java:2720) at java.util.Formatter.format(Formatter.java:2500) Best Regards, Evan Yao