There's several configs which control timeouts and heartbeats. For larger clusters, lowering the frequency at which some of these messages are sent out would likely help.
For controlling the timeouts tez.task.progress.stuck.interval-ms - User code needs to send heartbeats for this property. (Default is set to -1, which disables this) tez.task.timeout-ms - the framework takes care of sending pings for this property. (Defaults to a 5 minute timeout) Some of the heartbeat intervals are a little aggressive out of the box. 100-200ms. This can cause network congestion, as well as an event backup on the AM. tez.task.get-task.sleep.interval-ms.max - Default 200. Set to > 3s on a larger cluster. tez.task.am.heartbeat.interval-ms.max - Default 100. Set to > 3s on a larger cluster. tez.task.am.heartbeat.counter.interval-ms.max - Default 4000. Set to > 30s on a larger cluster. (You can try tuning the suggested values to see how well things work). A problem to look for is the AM queue backing up (will show up as messages in the AM logs, also look at GC logs for the AM), which can lead to GC on the AM, as well as a general delay in processing events, which can lead to a timeout. Also, look at GC for the tasks that are running. Are heartbeats actually going out. On Tue, Aug 1, 2017 at 11:07 AM, Scott McCarty <[email protected]> wrote: > Hi, > > We're running a Cloudera hadoop cluster (5.8.x, I believe) that we scale > up and down as needed. The only jobs running on the cluster are Tez > (version 0.8.4) and when the cluster is small (about 200 nodes or less) > things work reasonably well, but when the cluster is scaled up to, say, 500 > nodes, a very large percentage of the jobs fail due to either container > timeout or attempt timeout and we're trying to figure out what might be > causing this problem. > > The timeout(s) are set to 30 minutes and from looking at Tez code that > gives that timeout error, it looks like the ping that's supposed to be > coming from the attempt/container JVM isn't happening. It's happened on > the initial input node in the DAG so it's not necessarily failing due to > intra-DAG communication problems. > > The Tez jobs are custom code--we're not using Tez for Hive queries--and > some of the processing on key/value records can take quite a while but that > doesn't cause problems when the cluster is smaller. Also, we have Tez > sessions and container reuse turned off. > > Does anyone know if this is/was a problem with Tez 0.8.4? Or maybe it's a > Cloudera/RM/cluster issue? Any suggestions on what to look for? (For sure > it would be good to upgrade to a more recent version of Tez but that might > have to wait for a short while.) > > Thanks in advance for any help/suggestions. > > --Scott > >
