[ 
https://issues.apache.org/jira/browse/HIVE-23061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063720#comment-17063720
 ] 

Jason Dere commented on HIVE-23061:
-----------------------------------

As for why the error is occurring in the first place, it looks like LLAP is 
getting duplicate query fragments submissions from the external client (HWC). 
So the duplicate fragment is submitted and it looks like it fails:
{noformat}2020-03-17T06:49:11,239 WARN  [IPC Server handler 2 on 15001 ()] 
org.apache.hadoop.ipc.Server: IPC Server handler 2 on 15001, call Call#75 
Retry#0 org.apache.hadoop.hive.llap.protocol.LlapProtocolBlockingPB.submitWork 
from 19.40.252.114:33906
java.lang.IllegalStateException: Only a single registration allowed per entity. 
Duplicate for TaskWrapper{task=attempt_1854104024183112753_6052_0_00_000128_1, 
inWaitQueue=true, inPreemptionQueue=false, registeredForNotifications=true, 
canFinish=true, canFinish(in queue)=true, isGuaranteed=false, 
firstAttemptStartTime=1584442003327, dagStartTime=1584442003327, 
withinDagPriority=0, vertexParallelism= 2132, selfAndUpstreamParallelism= 2132, 
selfAndUpstreamComplete= 0}
    at 
org.apache.hadoop.hive.llap.daemon.impl.QueryInfo$FinishableStateTracker.registerForUpdates(QueryInfo.java:233)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.impl.QueryInfo.registerForFinishableStateUpdates(QueryInfo.java:205)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.impl.QueryFragmentInfo.registerForFinishableStateUpdates(QueryFragmentInfo.java:160)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper.maybeRegisterForFinishedStateNotifications(TaskExecutorService.java:1167)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.schedule(TaskExecutorService.java:564)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService.schedule(TaskExecutorService.java:93)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:292)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:610)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.impl.LlapProtocolServerImpl.submitWork(LlapProtocolServerImpl.java:122)
 ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
    at 
org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:22695)
 ~[hive-exec-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.32-1]
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 ~[hadoop-common-3.1.1.3.1.4.26-3.jar:?]
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) 
~[hadoop-common-3.1.1.3.1.4.26-3.jar:?]
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) 
~[hadoop-common-3.1.1.3.1.4.26-3.jar:?]
    at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) 
~[hadoop-common-3.1.1.3.1.4.26-3.jar:?]
    at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_191]
    at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_191]
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 ~[hadoop-common-3.1.1.3.1.4.26-3.jar:?]
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) 
~[hadoop-common-3.1.1.3.1.4.26-3.jar:?]
{noformat}

I suspect that on that error, that fragment is cleaned up which may clear the 
info for the first fragment, and when the first fragment exits it may hit this. 
Still need more investigation on this one.

But in general, I don't think we should be allowing errors from one fragment 
submission cause the entire LLAP daemon to die, which is why I've done this 
patch.

> LLAP crash due to unhandled exception: Cannot invoke unregister on an entity 
> which has not been registered
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23061
>                 URL: https://issues.apache.org/jira/browse/HIVE-23061
>             Project: Hive
>          Issue Type: Bug
>          Components: llap
>            Reporter: Jason Dere
>            Assignee: Jason Dere
>            Priority: Major
>         Attachments: HIVE-23061.1.patch
>
>
> The following exception goes uncaught and causes the entire LLAP daemon to 
> shut down:
> {noformat}
> 2020-03-17T06:49:11,304 ERROR [ExecutionCompletionThread #0 ()] 
> org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon: Thread 
> Thread[ExecutionCompletionThread #0,5,main] threw an Exception. Shutting down 
> now...
> java.lang.IllegalStateException: Cannot invoke unregister on an entity which 
> has not been registered
>     at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:508) 
> ~[hive-exec-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.32-1]
>     at 
> org.apache.hadoop.hive.llap.daemon.impl.QueryInfo$FinishableStateTracker.unregisterForUpdates(QueryInfo.java:256)
>  ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
>     at 
> org.apache.hadoop.hive.llap.daemon.impl.QueryInfo.unregisterFinishableStateUpdate(QueryInfo.java:209)
>  ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
>     at 
> org.apache.hadoop.hive.llap.daemon.impl.QueryFragmentInfo.unregisterForFinishableStateUpdates(QueryFragmentInfo.java:166)
>  ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
>     at 
> org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$TaskWrapper.maybeUnregisterForFinishedStateNotifications(TaskExecutorService.java:1177)
>  ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
>     at 
> org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$InternalCompletionListener.onSuccess(TaskExecutorService.java:980)
>  ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
>     at 
> org.apache.hadoop.hive.llap.daemon.impl.TaskExecutorService$InternalCompletionListener.onSuccess(TaskExecutorService.java:944)
>  ~[hive-llap-server-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.26-3]
>     at 
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1021)
>  ~[hive-exec-3.1.0.3.1.4.26-3.jar:3.1.0.3.1.4.32-1]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_191]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_191]
>     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to