Tried that. AM is listening at the right address. But TezChild is receiving the IP address instead of the private DNS.
AM logs: *2015-07-21 18:09:27,906 INFO [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] app.TaskAttemptListenerImpTezDag: Listening at address: ip-10-234-2-80.ec2.internal:49967 * TezChild logs: 2015-07-21 18:09:35,353 INFO [main] task.TezChild: TezChild starting*2015-07-21 18:09:35,379 INFO [main] task.TezChild: Args: 10.234.2.80,49967,container_1437501941642_0001_01_000002,application_1437501941642_0001,1 *2015-07-21 18:09:35,770 INFO [main] task.TezChild: Using socket factory class: org.apache.hadoop.net.StandardSocketFactory 2015-07-21 18:09:35,785 INFO [main] task.TezChild: PID, containerIdentifier: 8670, container_1437501941642_0001_01_000002 2015-07-21 18:09:35,864 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS 2015-07-21 18:09:36,403 INFO [main] task.TezChild: Got host:port: 10.234.2.80:49967 2015-07-21 18:09:36,413 INFO [main] task.TezChild: address variables: 10.234.2.80:49967 Any idea what changed between 0.4.1 and 0.7.0? Things worked fine out of the box in 0.4.1. On Tue, Jul 21, 2015 at 10:28 AM, Jeff Zhang <[email protected]> wrote: > DAGClientRPCServer is for client service, not for TezChild. You need look > at "Instantiated TaskAttemptListener RPC at" > > On Tue, Jul 21, 2015 at 10:21 AM, Rajat Jain <[email protected]> wrote: > >> Here are the AM logs: >> >> 2015-07-21 17:08:14,279 INFO [ServiceThread:DAGClientRPCServer] >> ipc.CallQueueManager: Using callQueue class >> java.util.concurrent.LinkedBlockingQueue >> 2015-07-21 17:08:14,285 INFO >> [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] >> ipc.CallQueueManager: Using callQueue class >> java.util.concurrent.LinkedBlockingQueue >> 2015-07-21 17:08:14,299 INFO [Socket Reader #1 for port 46373] ipc.Server: >> Starting Socket Reader #1 for port 46373 >> 2015-07-21 17:08:14,300 INFO [Socket Reader #1 for port 37949] ipc.Server: >> Starting Socket Reader #1 for port 37949 >> 2015-07-21 17:08:14,358 INFO [IPC Server Responder] ipc.Server: IPC Server >> Responder: starting >> 2015-07-21 17:08:14,364 INFO [IPC Server listener on 46373] ipc.Server: IPC >> Server listener on 46373: starting >> 2015-07-21 17:08:14,364 INFO [IPC Server Responder] ipc.Server: IPC Server >> Responder: starting >> 2015-07-21 17:08:14,365 INFO [IPC Server listener on 37949] ipc.Server: IPC >> Server listener on 37949: starting >> 2015-07-21 17:08:14,374 INFO [ServiceThread:DAGClientRPCServer] >> client.DAGClientServer: Instantiated DAGClientRPCServer at >> ip-10-16-141-168.ec2.internal/10.16.141.168:46373 >> 2015-07-21 17:08:14,377 INFO [HistoryEventHandlingThread] >> impl.SimpleHistoryLoggingService: Writing event AM_LAUNCHED to history file >> >> >> The interesting thing to note is the Tez Task is trying to connect to >> port 37949. The DAGClientRPCServer (which uses private DNS) is instantiated >> on 46373. But it also starts another IPC server on 37949 though I'm not >> sure what it is for. >> >> On Tue, Jul 21, 2015 at 10:13 AM, Rajat Jain <[email protected]> wrote: >> >>> Hi, >>> >>> I am running a yarn cluster on AWS. The slave nodes (NMs) are all >>> configured to listen on private DNS. For example, a sample node manager >>> listens on ip-10-16-141-168.ec2.internal:8042 >>> <https://multicluster.qubole.net/cluster-proxy?encodedUrl=http%3A%2F%2Fip-10-16-141-168.ec2.internal%3A8042%2F> >>> . >>> >>> When I'm trying to run a Tez job (even simple ones like select count(*) >>> from nation) - they fail because child tasks are unable to connect to the >>> AM. The issue is they are trying to connect to the IP instead of the >>> private DNS. Here's a sample log line (couple of them added by me for >>> debugging): >>> >>> 2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting >>> 2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory >>> class: org.apache.hadoop.net.StandardSocketFactory >>> 2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, >>> containerIdentifier: 3699, container_1437498369268_0001_01_000002 >>> 2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: >>> fs.default.name is deprecated. Instead, use fs.defaultFS >>> 2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: >>> 10.16.141.168:37949 >>> 2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: >>> 10.16.141.168:37949 >>> 2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting >>> to fetch new task >>> 2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to >>> server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s); retry >>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 >>> MILLISECONDS) >>> 2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to >>> server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s); retry >>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 >>> MILLISECONDS) >>> 2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to >>> server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s); retry >>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 >>> MILLISECONDS) >>> 2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to >>> server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s); retry >>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 >>> MILLISECONDS) >>> >>> >>> The task ultimately fails. Any idea how this can be fixed? These jobs >>> ran fine on Tez 0.4.1. >>> >>> Thanks, >>> Rajat >>> >> >> > > > -- > Best Regards > > Jeff Zhang >
