Filed: https://issues.apache.org/jira/browse/TEZ-2630
On Tue, Jul 21, 2015 at 11:12 AM, Rajat Jain <[email protected]> wrote: > Tried that. AM is listening at the right address. But TezChild is > receiving the IP address instead of the private DNS. > > AM logs: > > *2015-07-21 18:09:27,906 INFO > [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] > app.TaskAttemptListenerImpTezDag: Listening at address: > ip-10-234-2-80.ec2.internal:49967 > * > > > TezChild logs: > > 2015-07-21 18:09:35,353 INFO [main] task.TezChild: TezChild > starting*2015-07-21 18:09:35,379 INFO [main] task.TezChild: Args: > 10.234.2.80,49967,container_1437501941642_0001_01_000002,application_1437501941642_0001,1 > *2015-07-21 18:09:35,770 INFO [main] task.TezChild: Using socket factory > class: org.apache.hadoop.net.StandardSocketFactory > 2015-07-21 18:09:35,785 INFO [main] task.TezChild: PID, containerIdentifier: > 8670, container_1437501941642_0001_01_000002 > 2015-07-21 18:09:35,864 INFO [main] Configuration.deprecation: > fs.default.name is deprecated. Instead, use fs.defaultFS > 2015-07-21 18:09:36,403 INFO [main] task.TezChild: Got host:port: > 10.234.2.80:49967 > 2015-07-21 18:09:36,413 INFO [main] task.TezChild: address variables: > 10.234.2.80:49967 > > > Any idea what changed between 0.4.1 and 0.7.0? Things worked fine out of > the box in 0.4.1. > > On Tue, Jul 21, 2015 at 10:28 AM, Jeff Zhang <[email protected]> wrote: > >> DAGClientRPCServer is for client service, not for TezChild. You need look >> at "Instantiated TaskAttemptListener RPC at" >> >> On Tue, Jul 21, 2015 at 10:21 AM, Rajat Jain <[email protected]> wrote: >> >>> Here are the AM logs: >>> >>> 2015-07-21 17:08:14,279 INFO [ServiceThread:DAGClientRPCServer] >>> ipc.CallQueueManager: Using callQueue class >>> java.util.concurrent.LinkedBlockingQueue >>> 2015-07-21 17:08:14,285 INFO >>> [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] >>> ipc.CallQueueManager: Using callQueue class >>> java.util.concurrent.LinkedBlockingQueue >>> 2015-07-21 17:08:14,299 INFO [Socket Reader #1 for port 46373] ipc.Server: >>> Starting Socket Reader #1 for port 46373 >>> 2015-07-21 17:08:14,300 INFO [Socket Reader #1 for port 37949] ipc.Server: >>> Starting Socket Reader #1 for port 37949 >>> 2015-07-21 17:08:14,358 INFO [IPC Server Responder] ipc.Server: IPC Server >>> Responder: starting >>> 2015-07-21 17:08:14,364 INFO [IPC Server listener on 46373] ipc.Server: IPC >>> Server listener on 46373: starting >>> 2015-07-21 17:08:14,364 INFO [IPC Server Responder] ipc.Server: IPC Server >>> Responder: starting >>> 2015-07-21 17:08:14,365 INFO [IPC Server listener on 37949] ipc.Server: IPC >>> Server listener on 37949: starting >>> 2015-07-21 17:08:14,374 INFO [ServiceThread:DAGClientRPCServer] >>> client.DAGClientServer: Instantiated DAGClientRPCServer at >>> ip-10-16-141-168.ec2.internal/10.16.141.168:46373 >>> 2015-07-21 17:08:14,377 INFO [HistoryEventHandlingThread] >>> impl.SimpleHistoryLoggingService: Writing event AM_LAUNCHED to history file >>> >>> >>> The interesting thing to note is the Tez Task is trying to connect to >>> port 37949. The DAGClientRPCServer (which uses private DNS) is instantiated >>> on 46373. But it also starts another IPC server on 37949 though I'm not >>> sure what it is for. >>> >>> On Tue, Jul 21, 2015 at 10:13 AM, Rajat Jain <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I am running a yarn cluster on AWS. The slave nodes (NMs) are all >>>> configured to listen on private DNS. For example, a sample node manager >>>> listens on ip-10-16-141-168.ec2.internal:8042 >>>> <https://multicluster.qubole.net/cluster-proxy?encodedUrl=http%3A%2F%2Fip-10-16-141-168.ec2.internal%3A8042%2F> >>>> . >>>> >>>> When I'm trying to run a Tez job (even simple ones like select count(*) >>>> from nation) - they fail because child tasks are unable to connect to the >>>> AM. The issue is they are trying to connect to the IP instead of the >>>> private DNS. Here's a sample log line (couple of them added by me for >>>> debugging): >>>> >>>> 2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting >>>> 2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory >>>> class: org.apache.hadoop.net.StandardSocketFactory >>>> 2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, >>>> containerIdentifier: 3699, container_1437498369268_0001_01_000002 >>>> 2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: >>>> fs.default.name is deprecated. Instead, use fs.defaultFS >>>> 2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: >>>> 10.16.141.168:37949 >>>> 2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: >>>> 10.16.141.168:37949 >>>> 2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting >>>> to fetch new task >>>> 2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to >>>> server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s); retry >>>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, >>>> sleepTime=1000 MILLISECONDS) >>>> 2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to >>>> server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s); retry >>>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, >>>> sleepTime=1000 MILLISECONDS) >>>> 2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to >>>> server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s); retry >>>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, >>>> sleepTime=1000 MILLISECONDS) >>>> 2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to >>>> server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s); retry >>>> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, >>>> sleepTime=1000 MILLISECONDS) >>>> >>>> >>>> The task ultimately fails. Any idea how this can be fixed? These jobs >>>> ran fine on Tez 0.4.1. >>>> >>>> Thanks, >>>> Rajat >>>> >>> >>> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > >
