Hi,

I am running a yarn cluster on AWS. The slave nodes (NMs) are all
configured to listen on private DNS. For example, a sample node manager
listens on ip-10-16-141-168.ec2.internal:8042
<https://multicluster.qubole.net/cluster-proxy?encodedUrl=http%3A%2F%2Fip-10-16-141-168.ec2.internal%3A8042%2F>
.

When I'm trying to run a Tez job (even simple ones like select count(*)
from nation) - they fail because child tasks are unable to connect to the
AM. The issue is they are trying to connect to the IP instead of the
private DNS. Here's a sample log line (couple of them added by me for
debugging):

2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting
2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket
factory class: org.apache.hadoop.net.StandardSocketFactory
2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID,
containerIdentifier:  3699, container_1437498369268_0001_01_000002
2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation:
fs.default.name is deprecated. Instead, use fs.defaultFS
2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port:
10.16.141.168:37949
2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables:
10.16.141.168:37949
2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter:
Attempting to fetch new task
2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect
to server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50,
sleepTime=1000 MILLISECONDS)
2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect
to server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50,
sleepTime=1000 MILLISECONDS)
2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect
to server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50,
sleepTime=1000 MILLISECONDS)
2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect
to server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50,
sleepTime=1000 MILLISECONDS)


The task ultimately fails. Any idea how this can be fixed? These jobs ran
fine on Tez 0.4.1.

Thanks,
Rajat

Reply via email to