Re: Help: we're seeing many container and Attempt timeouts when running on a larger cluster--any suggestions on where to look?

Gopal Vijayaraghavan Tue, 01 Aug 2017 14:05:28 -0700

Hi,

> cluster is scaled up to, say, 500 nodes, a very large percentage of the jobs 
> fail due to either container timeout or attempt timeout and we're trying to 
> figure out what might be causing this problem.


After you've configured your heartbeat settings according to what Sid said, you 
need to do a basic sweep of the cluster TCP configs.

> it looks like the ping that's supposed to be coming from the 
> attempt/container JVM isn't happening

During the 700 node concurrency tests, I found some kernel + Hadoop configs 
which have a huge impact on the IPC performance.

Kernel somaxconn (sysctl -a | grep somaxconn). This controls the total # of 
half-open connections, which somewhat scales with O(N^2) of the nodes. My 
clusters all have >32Gb of memory, so that's kicked up from 128 -> 16000.

Changing just that does help a bit, but there are linked parameters which need 
to be changed, but some of the scale options might not be available on CDH.

https://issues.apache.org/jira/browse/HADOOP-1849 (uses 128, increase to match 
kernel)
+
https://issues.apache.org/jira/browse/MAPREDUCE-6763 (YARN shuffle handler)

You effectively need to restart NodeManagers and NameNode with a higher listen 
queue, to allow for high concurrency of connections.

AFAIK, CDH doesn't ship with any of the performance fixes for multi-threaded 
IPC, however if you don't have a latency issue you can just lower the heartbeat 
frequency

https://issues.apache.org/jira/browse/HADOOP-11770

> Does anyone know if this is/was a problem with Tez 0.8.4?  Or maybe it's a 
> Cloudera/RM/cluster issue?  Any suggestions on what to look for?

Here's the counter I use to detect lost heartbeats.

$ netstat -s | grep overflow
0 listen queue overflow

$ dmesg | grep cookies

To see if TCP cookies were used.

There is a way to turn these overflows from being silent and cause actual 
errors which bubble up to the JVM.

sysctl -a | grep tcp_abort_on_overflow

And if you make the config changes, you can verify that your changes took 
effect by doing 

$ ss -tl | grep 16000

Cheers,
Gopal

Re: Help: we're seeing many container and Attempt timeouts when running on a larger cluster--any suggestions on where to look?

Reply via email to