Here are the AM logs: 2015-07-21 17:08:14,279 INFO [ServiceThread:DAGClientRPCServer] ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2015-07-21 17:08:14,285 INFO [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue 2015-07-21 17:08:14,299 INFO [Socket Reader #1 for port 46373] ipc.Server: Starting Socket Reader #1 for port 46373 2015-07-21 17:08:14,300 INFO [Socket Reader #1 for port 37949] ipc.Server: Starting Socket Reader #1 for port 37949 2015-07-21 17:08:14,358 INFO [IPC Server Responder] ipc.Server: IPC Server Responder: starting 2015-07-21 17:08:14,364 INFO [IPC Server listener on 46373] ipc.Server: IPC Server listener on 46373: starting 2015-07-21 17:08:14,364 INFO [IPC Server Responder] ipc.Server: IPC Server Responder: starting 2015-07-21 17:08:14,365 INFO [IPC Server listener on 37949] ipc.Server: IPC Server listener on 37949: starting 2015-07-21 17:08:14,374 INFO [ServiceThread:DAGClientRPCServer] client.DAGClientServer: Instantiated DAGClientRPCServer at ip-10-16-141-168.ec2.internal/10.16.141.168:46373 2015-07-21 17:08:14,377 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event AM_LAUNCHED to history file
The interesting thing to note is the Tez Task is trying to connect to port 37949. The DAGClientRPCServer (which uses private DNS) is instantiated on 46373. But it also starts another IPC server on 37949 though I'm not sure what it is for. On Tue, Jul 21, 2015 at 10:13 AM, Rajat Jain <[email protected]> wrote: > Hi, > > I am running a yarn cluster on AWS. The slave nodes (NMs) are all > configured to listen on private DNS. For example, a sample node manager > listens on ip-10-16-141-168.ec2.internal:8042 > <https://multicluster.qubole.net/cluster-proxy?encodedUrl=http%3A%2F%2Fip-10-16-141-168.ec2.internal%3A8042%2F> > . > > When I'm trying to run a Tez job (even simple ones like select count(*) > from nation) - they fail because child tasks are unable to connect to the > AM. The issue is they are trying to connect to the IP instead of the > private DNS. Here's a sample log line (couple of them added by me for > debugging): > > 2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting > 2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory > class: org.apache.hadoop.net.StandardSocketFactory > 2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, containerIdentifier: > 3699, container_1437498369268_0001_01_000002 > 2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: > fs.default.name is deprecated. Instead, use fs.defaultFS > 2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: > 10.16.141.168:37949 > 2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: > 10.16.141.168:37949 > 2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting to > fetch new task > 2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to > server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to > server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to > server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to > server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s); retry > policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > > > The task ultimately fails. Any idea how this can be fixed? These jobs ran > fine on Tez 0.4.1. > > Thanks, > Rajat >
