Hi, > cluster is scaled up to, say, 500 nodes, a very large percentage of the jobs > fail due to either container timeout or attempt timeout and we're trying to > figure out what might be causing this problem.
After you've configured your heartbeat settings according to what Sid said, you need to do a basic sweep of the cluster TCP configs. > it looks like the ping that's supposed to be coming from the > attempt/container JVM isn't happening During the 700 node concurrency tests, I found some kernel + Hadoop configs which have a huge impact on the IPC performance. Kernel somaxconn (sysctl -a | grep somaxconn). This controls the total # of half-open connections, which somewhat scales with O(N^2) of the nodes. My clusters all have >32Gb of memory, so that's kicked up from 128 -> 16000. Changing just that does help a bit, but there are linked parameters which need to be changed, but some of the scale options might not be available on CDH. https://issues.apache.org/jira/browse/HADOOP-1849 (uses 128, increase to match kernel) + https://issues.apache.org/jira/browse/MAPREDUCE-6763 (YARN shuffle handler) You effectively need to restart NodeManagers and NameNode with a higher listen queue, to allow for high concurrency of connections. AFAIK, CDH doesn't ship with any of the performance fixes for multi-threaded IPC, however if you don't have a latency issue you can just lower the heartbeat frequency https://issues.apache.org/jira/browse/HADOOP-11770 > Does anyone know if this is/was a problem with Tez 0.8.4? Or maybe it's a > Cloudera/RM/cluster issue? Any suggestions on what to look for? Here's the counter I use to detect lost heartbeats. $ netstat -s | grep overflow 0 listen queue overflow $ dmesg | grep cookies To see if TCP cookies were used. There is a way to turn these overflows from being silent and cause actual errors which bubble up to the JVM. sysctl -a | grep tcp_abort_on_overflow And if you make the config changes, you can verify that your changes took effect by doing $ ss -tl | grep 16000 Cheers, Gopal
