[ https://issues.apache.org/jira/browse/HIVE-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth reassigned HIVE-10480: ------------------------------------- Assignee: Siddharth Seth > LLAP: Tez task is interrupted for unknown reason after an IPC exception and > then fails to report completion > ----------------------------------------------------------------------------------------------------------- > > Key: HIVE-10480 > URL: https://issues.apache.org/jira/browse/HIVE-10480 > Project: Hive > Issue Type: Sub-task > Reporter: Sergey Shelukhin > Assignee: Siddharth Seth > Attachments: HIVE-10480.1.txt > > > No idea if this is LLAP bug, Tez bug, Hadoop IPC bug (due to patch on the > cluster), or all 3. > So for now I will just dump all I have here. > TPCH Q1 started running for a long time for me on large number of runs today > (didn't happen yesterday). It would always be one Map task timing out. > Example attempt (logs from am): > {noformat} > 2015-04-24 11:11:01,073 INFO [TaskCommunicator # 0] > tezplugins.LlapTaskCommunicator: Successfully launched task: > attempt_1429683757595_0321_9_00_000928_0 > 2015-04-24 11:16:25,498 INFO [Dispatcher thread: Central] > history.HistoryEventHandler: > [HISTORY][DAG:dag_1429683757595_0321_9][Event:TASK_ATTEMPT_FINISHED]: > vertexName=Map 1, taskAttemptId=attempt_1429683757595_0321_9_00_000928_0, > startTime=1429899061071, finishTime=1429899385498, timeTaken=324427, > status=FAILED, errorEnum=TASK_HEARTBEAT_ERROR, > diagnostics=AttemptID:attempt_1429683757595_0321_9_00_000928_0 Timed out > after 300 secs, counters=Counters: 1, > org.apache.tez.common.counters.DAGCounter, RACK_LOCAL_TASKS=1 > {noformat} > No other lines for this attempt in between. > However there's this: > {noformat} > 2015-04-24 11:11:01,074 WARN [Socket Reader #1 for port 59446] ipc.Server: > Unable to read call parameters for client 172.19.128.56on connection protocol > org.apache.hadoop.hive.llap.protocol.LlapTaskUmbilicalProtocol for rpcKind > RPC_WRITABLE > java.lang.ArrayIndexOutOfBoundsException > 2015-04-24 11:11:01,075 INFO [Socket Reader #1 for port 59446] ipc.Server: > Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 > threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable > to read call parameters: null] > {noformat} > On LLAP, the following is logged > {noformat} > 2015-04-24 11:11:01,142 [TaskHeartbeatThread()] ERROR > org.apache.tez.runtime.task.TezTaskRunner: TaskReporter reported error > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcServerException): > IPC server unable to read call parameters: null > at org.apache.hadoop.ipc.Client.call(Client.java:1492) > at org.apache.hadoop.ipc.Client.call(Client.java:1423) > at > org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:242) > at com.sun.proxy.$Proxy19.heartbeat(Unknown Source) > at > org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.heartbeat(LlapTaskReporter.java:258) > at > org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:186) > at > org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.call(LlapTaskReporter.java:128) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The attempt starts but is then interrupted (not clear by whom) > {noformat} > 2015-04-24 11:11:01,144 [Initializer > 0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map > 1_928_0)] INFO org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: > Initialized Input with src edge: lineitem > 2015-04-24 11:11:01,145 > [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map > 1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Encounted an error > while executing task: attempt_1429683757595_0321_9_00_000928_0 > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) > at > java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:218) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:177) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) > {noformat} > There's a heartbeat error that may or may not be spurious: > {noformat} > 2015-04-24 11:11:01,146 > [TezTaskRunner_attempt_1429683757595_0321_9_00_000928_0(container_222212222_0321_01_008943_sershe_20150424110948_86ce1f6f-7cd2-4a40-b9a6-4a6854f010f6:9_Map > 1_928_0)] INFO org.apache.tez.runtime.task.TezTaskRunner: Ignoring the > following exception since a previous exception is already registered > java.lang.NullPointerException > at > org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter$HeartbeatCallable.access$300(LlapTaskReporter.java:129) > at > org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter.taskFailed(LlapTaskReporter.java:394) > at > org.apache.tez.runtime.task.TezTaskRunner.sendFailure(TezTaskRunner.java:261) > at > org.apache.tez.runtime.task.TezTaskRunner.access$600(TezTaskRunner.java:52) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:228) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172) > {noformat} > And nothing more for this task. > Task succeeds on rerun. Other tasks also succeed on this node meanwhile. > 1) Is it Hadoop IPC bug due to new patch? > 2) Even if so (assuming IPC is not totally broken) I wonder if the Heartbeat > NPE (which we have seen before, without the IPC patch too IIRC) is actually a > real issue that prevents the task from being sent to AM? > 3) Who interrupts the task and why? AM doesn't have any logs about that and > it happens after RPC error, not before. > Btw, there's another flavor or RPC error: > {noformat} > 2015-04-24 10:36:30,183 INFO [Socket Reader #1 for port 59446] ipc.Server: > Socket Reader #1 for port 59446: readAndProcess from client 172.19.128.56 > threw exception [org.apache.hadoop.ipc.RpcServerException: IPC server unable > to read call parameters: 1382376565] > {noformat} > application_1429683757595_0320, application_1429683757595_0321 -- This message was sent by Atlassian JIRA (v6.3.4#6332)