Ok - this seems like a small fix that can be implemented. The TezChild is told to contact the AM at a certain address. We probably have a mismatch in how the rpc server is started and how the container launcher informs the TezChild about the address. And based on what Rajat says, it seems that this code path changed a bit since 0.4.1. It is likely that the fix we put it to make the server listen on the correct NM FQDN address did not propagate to fixing the address sent to the TezChild. We should fix it to talk back to the same address that the server is listening to.
@Rajat, would you mind filing a jira for this? thanks — Hitesh On Jul 21, 2015, at 11:12 AM, Rajat Jain <[email protected]> wrote: > Tried that. AM is listening at the right address. But TezChild is receiving > the IP address instead of the private DNS. > > AM logs: > 2015-07-21 18:09:27,906 INFO > [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] > app.TaskAttemptListenerImpTezDag: Listening at address: > ip-10-234-2-80.ec2.internal:49967 > > > TezChild logs: > 2015-07-21 18:09:35,353 INFO [main] task.TezChild: TezChild starting > > 2015-07-21 18:09:35,379 INFO [main] task.TezChild: Args: > 10.234.2.80,49967,container_1437501941642_0001_01_000002,application_1437501941642_0001,1 > > 2015-07-21 18:09:35,770 INFO [main] task.TezChild: Using socket factory > class: org.apache.hadoop.net.StandardSocketFactory > 2015-07-21 18:09:35,785 INFO [main] task.TezChild: PID, containerIdentifier: > 8670, container_1437501941642_0001_01_000002 > 2015-07-21 18:09:35,864 INFO [main] Configuration.deprecation: > fs.default.name > is deprecated. Instead, use fs.defaultFS > 2015-07-21 18:09:36,403 INFO [main] task.TezChild: Got host:port: > 10.234.2.80:49967 > > 2015-07-21 18:09:36,413 INFO [main] task.TezChild: address variables: > 10.234.2.80:49967 > > Any idea what changed between 0.4.1 and 0.7.0? Things worked fine out of the > box in 0.4.1. > > On Tue, Jul 21, 2015 at 10:28 AM, Jeff Zhang <[email protected]> wrote: > DAGClientRPCServer is for client service, not for TezChild. You need look at > "Instantiated TaskAttemptListener RPC at" > > On Tue, Jul 21, 2015 at 10:21 AM, Rajat Jain <[email protected]> wrote: > Here are the AM logs: > > 2015-07-21 17:08:14,279 INFO [ServiceThread:DAGClientRPCServer] > ipc.CallQueueManager: Using callQueue class > java.util.concurrent.LinkedBlockingQueue > 2015-07-21 17:08:14,285 INFO > [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] > ipc.CallQueueManager: Using callQueue class > java.util.concurrent.LinkedBlockingQueue > 2015-07-21 17:08:14,299 INFO [Socket Reader #1 for port 46373] ipc.Server: > Starting Socket Reader #1 for port 46373 > 2015-07-21 17:08:14,300 INFO [Socket Reader #1 for port 37949] ipc.Server: > Starting Socket Reader #1 for port 37949 > 2015-07-21 17:08:14,358 INFO [IPC Server Responder] ipc.Server: IPC Server > Responder: starting > 2015-07-21 17:08:14,364 INFO [IPC Server listener on 46373] ipc.Server: IPC > Server listener on 46373: starting > 2015-07-21 17:08:14,364 INFO [IPC Server Responder] ipc.Server: IPC Server > Responder: starting > 2015-07-21 17:08:14,365 INFO [IPC Server listener on 37949] ipc.Server: IPC > Server listener on 37949: starting > 2015-07-21 17:08:14,374 INFO [ServiceThread:DAGClientRPCServer] > client.DAGClientServer: Instantiated DAGClientRPCServer at > ip-10-16-141-168.ec2.internal/ > 10.16.141.168:46373 > > 2015-07-21 17:08:14,377 INFO [HistoryEventHandlingThread] > impl.SimpleHistoryLoggingService: Writing event AM_LAUNCHED to history file > > > The interesting thing to note is the Tez Task is trying to connect to port > 37949. The DAGClientRPCServer (which uses private DNS) is instantiated on > 46373. But it also starts another IPC server on 37949 though I'm not sure > what it is for. > > On Tue, Jul 21, 2015 at 10:13 AM, Rajat Jain <[email protected]> wrote: > Hi, > > I am running a yarn cluster on AWS. The slave nodes (NMs) are all configured > to listen on private DNS. For example, a sample node manager listens on > ip-10-16-141-168.ec2.internal:8042. > > When I'm trying to run a Tez job (even simple ones like select count(*) from > nation) - they fail because child tasks are unable to connect to the AM. The > issue is they are trying to connect to the IP instead of the private DNS. > Here's a sample log line (couple of them added by me for debugging): > > 2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting > 2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory > class: org.apache.hadoop.net.StandardSocketFactory > 2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, containerIdentifier: > 3699, container_1437498369268_0001_01_000002 > 2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: > fs.default.name > is deprecated. Instead, use fs.defaultFS > 2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: > 10.16.141.168:37949 > > 2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: > 10.16.141.168:37949 > > 2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting to > fetch new task > 2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to > server: > 10.16.141.168/10.16.141.168:37949 > . Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to > server: > 10.16.141.168/10.16.141.168:37949 > . Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to > server: > 10.16.141.168/10.16.141.168:37949 > . Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > 2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to > server: > 10.16.141.168/10.16.141.168:37949 > . Already tried 3 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 > MILLISECONDS) > > > The task ultimately fails. Any idea how this can be fixed? These jobs ran > fine on Tez 0.4.1. > > Thanks, > Rajat > > > > > -- > Best Regards > > Jeff Zhang >
