Hi, We're running a Cloudera hadoop cluster (5.8.x, I believe) that we scale up and down as needed. The only jobs running on the cluster are Tez (version 0.8.4) and when the cluster is small (about 200 nodes or less) things work reasonably well, but when the cluster is scaled up to, say, 500 nodes, a very large percentage of the jobs fail due to either container timeout or attempt timeout and we're trying to figure out what might be causing this problem.
The timeout(s) are set to 30 minutes and from looking at Tez code that gives that timeout error, it looks like the ping that's supposed to be coming from the attempt/container JVM isn't happening. It's happened on the initial input node in the DAG so it's not necessarily failing due to intra-DAG communication problems. The Tez jobs are custom code--we're not using Tez for Hive queries--and some of the processing on key/value records can take quite a while but that doesn't cause problems when the cluster is smaller. Also, we have Tez sessions and container reuse turned off. Does anyone know if this is/was a problem with Tez 0.8.4? Or maybe it's a Cloudera/RM/cluster issue? Any suggestions on what to look for? (For sure it would be good to upgrade to a more recent version of Tez but that might have to wait for a short while.) Thanks in advance for any help/suggestions. --Scott
