Hi,

We're running a Cloudera hadoop cluster (5.8.x, I believe) that we scale up
and down as needed.  The only jobs running on the cluster are Tez (version
0.8.4) and when the cluster is small (about 200 nodes or less) things work
reasonably well, but when the cluster is scaled up to, say, 500 nodes, a
very large percentage of the jobs fail due to either container timeout or
attempt timeout and we're trying to figure out what might be causing this
problem.

The timeout(s) are set to 30 minutes and from looking at Tez code that
gives that timeout error, it looks like the ping that's supposed to be
coming from the attempt/container JVM isn't happening.  It's happened on
the initial input node in the DAG so it's not necessarily failing due to
intra-DAG communication problems.

The Tez jobs are custom code--we're not using Tez for Hive queries--and
some of the processing on key/value records can take quite a while but that
doesn't cause problems when the cluster is smaller.  Also, we have Tez
sessions and container reuse turned off.

Does anyone know if this is/was a problem with Tez 0.8.4?  Or maybe it's a
Cloudera/RM/cluster issue?  Any suggestions on what to look for?  (For sure
it would be good to upgrade to a more recent version of Tez but that might
have to wait for a short while.)

Thanks in advance for any help/suggestions.

--Scott

Reply via email to