Wow, very helpful and specific things for us to try! Thanks, Sid and
Gopal. I've changed the heartbeat settings but I don't know if they've
been deployed yet and I've asked devops to look at all the OS-level config
mentioned.
One thing I did find is that in the two instances of timeouts I have
consolidated logs for, both of them show that a container is launched, and
then ~327 seconds later the container is stopped. I have no idea why it's
327 seconds:
2017-08-01 18:20:41,825 [INFO] [ContainerLauncher #0]
|launcher.TezContainerLauncherImpl|: Launching
container_1501546028075_0555_01_000002
...
2017-08-01 18:26:09,100 [INFO] [Dispatcher thread {Central}]
|history.HistoryEventHandler|:
[HISTORY][DAG:dag_1501546028075_0555_1][Event:CONTAINER_STOPPED]:
containerId=container_1501546028075_0555_01_000002,
stoppedTime=1501611969100, exitStatus=-1000
>From browsing the code, it looks like that exitStatus might be from the RPC
call and if so then that looks like it's saying INVALID (from
protoc-generated YarnProtos.java's ContainerExitStatusProto enum), but I'm
just fishing around so I could be wrong.
On Tue, Aug 1, 2017 at 3:05 PM, Gopal Vijayaraghavan <[email protected]>
wrote:
> Hi,
>
> > cluster is scaled up to, say, 500 nodes, a very large percentage of the
> jobs fail due to either container timeout or attempt timeout and we're
> trying to figure out what might be causing this problem.
>
> After you've configured your heartbeat settings according to what Sid
> said, you need to do a basic sweep of the cluster TCP configs.
>
> > it looks like the ping that's supposed to be coming from the
> attempt/container JVM isn't happening
>
> During the 700 node concurrency tests, I found some kernel + Hadoop
> configs which have a huge impact on the IPC performance.
>
> Kernel somaxconn (sysctl -a | grep somaxconn). This controls the total #
> of half-open connections, which somewhat scales with O(N^2) of the nodes.
> My clusters all have >32Gb of memory, so that's kicked up from 128 -> 16000.
>
> Changing just that does help a bit, but there are linked parameters which
> need to be changed, but some of the scale options might not be available on
> CDH.
>
> https://issues.apache.org/jira/browse/HADOOP-1849 (uses 128, increase to
> match kernel)
> +
> https://issues.apache.org/jira/browse/MAPREDUCE-6763 (YARN shuffle
> handler)
>
> You effectively need to restart NodeManagers and NameNode with a higher
> listen queue, to allow for high concurrency of connections.
>
> AFAIK, CDH doesn't ship with any of the performance fixes for
> multi-threaded IPC, however if you don't have a latency issue you can just
> lower the heartbeat frequency
>
> https://issues.apache.org/jira/browse/HADOOP-11770
>
> > Does anyone know if this is/was a problem with Tez 0.8.4? Or maybe it's
> a Cloudera/RM/cluster issue? Any suggestions on what to look for?
>
> Here's the counter I use to detect lost heartbeats.
>
> $ netstat -s | grep overflow
> 0 listen queue overflow
>
> $ dmesg | grep cookies
>
> To see if TCP cookies were used.
>
> There is a way to turn these overflows from being silent and cause actual
> errors which bubble up to the JVM.
>
> sysctl -a | grep tcp_abort_on_overflow
>
> And if you make the config changes, you can verify that your changes took
> effect by doing
>
> $ ss -tl | grep 16000
>
> Cheers,
> Gopal
>
>
>