DAGClientRPCServer is for client service, not for TezChild. You need look at "Instantiated TaskAttemptListener RPC at"
On Tue, Jul 21, 2015 at 10:21 AM, Rajat Jain <[email protected]> wrote: > Here are the AM logs: > > 2015-07-21 17:08:14,279 INFO [ServiceThread:DAGClientRPCServer] > ipc.CallQueueManager: Using callQueue class > java.util.concurrent.LinkedBlockingQueue > 2015-07-21 17:08:14,285 INFO > [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] > ipc.CallQueueManager: Using callQueue class > java.util.concurrent.LinkedBlockingQueue > 2015-07-21 17:08:14,299 INFO [Socket Reader #1 for port 46373] ipc.Server: > Starting Socket Reader #1 for port 46373 > 2015-07-21 17:08:14,300 INFO [Socket Reader #1 for port 37949] ipc.Server: > Starting Socket Reader #1 for port 37949 > 2015-07-21 17:08:14,358 INFO [IPC Server Responder] ipc.Server: IPC Server > Responder: starting > 2015-07-21 17:08:14,364 INFO [IPC Server listener on 46373] ipc.Server: IPC > Server listener on 46373: starting > 2015-07-21 17:08:14,364 INFO [IPC Server Responder] ipc.Server: IPC Server > Responder: starting > 2015-07-21 17:08:14,365 INFO [IPC Server listener on 37949] ipc.Server: IPC > Server listener on 37949: starting > 2015-07-21 17:08:14,374 INFO [ServiceThread:DAGClientRPCServer] > client.DAGClientServer: Instantiated DAGClientRPCServer at > ip-10-16-141-168.ec2.internal/10.16.141.168:46373 > 2015-07-21 17:08:14,377 INFO [HistoryEventHandlingThread] > impl.SimpleHistoryLoggingService: Writing event AM_LAUNCHED to history file > > > The interesting thing to note is the Tez Task is trying to connect to port > 37949. The DAGClientRPCServer (which uses private DNS) is instantiated on > 46373. But it also starts another IPC server on 37949 though I'm not sure > what it is for. > > On Tue, Jul 21, 2015 at 10:13 AM, Rajat Jain <[email protected]> wrote: > >> Hi, >> >> I am running a yarn cluster on AWS. The slave nodes (NMs) are all >> configured to listen on private DNS. For example, a sample node manager >> listens on ip-10-16-141-168.ec2.internal:8042 >> <https://multicluster.qubole.net/cluster-proxy?encodedUrl=http%3A%2F%2Fip-10-16-141-168.ec2.internal%3A8042%2F> >> . >> >> When I'm trying to run a Tez job (even simple ones like select count(*) >> from nation) - they fail because child tasks are unable to connect to the >> AM. The issue is they are trying to connect to the IP instead of the >> private DNS. Here's a sample log line (couple of them added by me for >> debugging): >> >> 2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting >> 2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory >> class: org.apache.hadoop.net.StandardSocketFactory >> 2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, containerIdentifier: >> 3699, container_1437498369268_0001_01_000002 >> 2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: >> fs.default.name is deprecated. Instead, use fs.defaultFS >> 2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: >> 10.16.141.168:37949 >> 2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: >> 10.16.141.168:37949 >> 2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting >> to fetch new task >> 2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to >> server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s); retry >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 >> MILLISECONDS) >> 2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to >> server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s); retry >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 >> MILLISECONDS) >> 2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to >> server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s); retry >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 >> MILLISECONDS) >> 2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to >> server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s); retry >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 >> MILLISECONDS) >> >> >> The task ultimately fails. Any idea how this can be fixed? These jobs ran >> fine on Tez 0.4.1. >> >> Thanks, >> Rajat >> > > -- Best Regards Jeff Zhang
