[ https://issues.apache.org/jira/browse/HIVE-23409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103382#comment-17103382 ]
Naresh P R commented on HIVE-23409: ----------------------------------- [~ashutoshc] Thanks for looking into this. If a tez session AM is released after dag wait timeout, call to tez session will try to launch a new AM which is failing after 2 retries at here {code:java} Dag submit failed due to java.lang.RuntimeException: Failed to connect to timeline server. Connection retries limit exceeded. The posted timeline event may be missing stack trace: [org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:403) org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:363) org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:282) org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:77) org.apache.tez.client.TezClient.start(TezClient.java:402) org.apache.hadoop.hive.ql.exec.tez.TezSessionState.startSessionAndContainers(TezSessionState.java:516) org.apache.hadoop.hive.ql.exec.tez.TezSessionState.openInternal(TezSessionState.java:451) org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession.openInternal(TezSessionPoolSession.java:124) org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:379) org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.reopenInternal(TezSessionPoolManager.java:498) org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.reopen(TezSessionPoolManager.java:487) org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolSession.reopen(TezSessionPoolSession.java:228) org.apache.hadoop.hive.ql.exec.tez.TezTask.getNewTezSessionOnError(TezTask.java:531) org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:547) org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:221){code} If it fails twice, we are destroying the session which is part of TezSessionPool. [HiveServer2-Background-Pool: Thread-12345]: tez.TezSessionPoolManager (:()) - We are closing a default session because of retry failure. All new queries are waiting for a session from TezSessionPool {code:java} "HiveServer2-Background-Pool: Thread-21342" #21342"HiveServer2-Background-Pool: Thread-21342" #21342 java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000005c4567e10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163) at org.apache.hadoop.hive.ql.exec.tez.TezSessionPool.getSession(TezSessionPool.java:193) at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.getSession(TezSessionPoolManager.java:295) at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.getSession(TezSessionPoolManager.java:474) at org.apache.hadoop.hive.ql.exec.tez.WorkloadManagerFederation.getUnmanagedSession(WorkloadManagerFederation.java:66) at org.apache.hadoop.hive.ql.exec.tez.WorkloadManagerFederation.getSession(WorkloadManagerFederation.java:38) at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:189) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:103) {code} Only HS2 restart is resolving this issue. > If TezSession application reopen fails for Timeline service down, default > TezSession from SessionPool is closed after a retry > ----------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-23409 > URL: https://issues.apache.org/jira/browse/HIVE-23409 > Project: Hive > Issue Type: Bug > Reporter: Naresh P R > Assignee: Naresh P R > Priority: Major > Attachments: HIVE-23409.patch > > > we are closing a default session from TezSessionPool at here. > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java#L589] > If all the sessions in a pool are destroyed, queries wait indefinitely at > TezSessionPool.getSession until HS2 restarts after other service recoveries. > [HiveServer2-Background-Pool: Thread-12345]: tez.TezSessionPoolManager (:()) > - We are closing a default session because of retry failure. > It's better if we allow retry & fail than hung. -- This message was sent by Atlassian Jira (v8.3.4#803005)