Siddharth Seth created HIVE-15722:
-------------------------------------
Summary: LLAP: Avoid marking a query as complete if the AMReporter
runs into an error
Key: HIVE-15722
URL: https://issues.apache.org/jira/browse/HIVE-15722
Project: Hive
Issue Type: Bug
Reporter: Siddharth Seth
Assignee: Siddharth Seth
When the AMReporter runs into an error (typically intermittent), we end up
killing all fragments on the daemon. This is done by marking the query as
complete.
The AM would continue to try scheduling on this node - which would lead to task
failures if the daemon structures are updated.
Instead of clearing the structures, it's better to kill the fragments, and let
a queryComplete call come in from the AM.
Later, we could make enhancements in the AM to avoid such nodes. That's not
simple though, since the AM will not find out what happened due to the
communication failure from the daemon.
Leads to
{code}
org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): Dag query16
already complete. Rejecting fragment [Map 7, 29, 0]
at
org.apache.hadoop.hive.llap.daemon.impl.QueryTracker.registerFragment(QueryTracker.java:149)
at
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:226)
at
org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:487)
at
org.apache.hadoop.hive.llap.daemon.impl.LlapProtocolServerImpl.submitWork(LlapProtocolServerImpl.java:101)
at
org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:16728)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307)
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)