[
https://issues.apache.org/jira/browse/TEZ-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577244#comment-17577244
]
zhengchenyu edited comment on TEZ-4441 at 8/9/22 8:06 AM:
----------------------------------------------------------
[~abstractdog] Hi, my tez app stuck in yarn federation cluster.
In our cluster, getAvailableResources return null, then throw NPE. Then
AMRMClientAsyncImpl.CallbackHandlerThread was stop, but TezAppMaster was not
shutdown. Here two problem, then I submit two issues:
(1) TEZ-4440
Before YARN-8933. getAvailableResources may return null, so throw NPE.
(2) TEZ-4441
event handler will never receive DAGAppMasterEventSchedulingServiceError, so
TezAppMaster could not shutdown, then stuck.
Can you please review the two PR:
[https://github.com/apache/tez/pull/235]
[https://github.com/apache/tez/pull/236]
was (Author: zhengchenyu):
[~abstractdog] Hi, my tez app stuck in yarn federation cluster.
In our cluster, getAvailableResources return null, then throw NPE. Then
AMRMClientAsyncImpl.CallbackHandlerThread was stop, but TezAppMaster was not
shutdown. Here two problem, then I submit two issue:
(1) TEZ-4440
Before YARN-8933. getAvailableResources may return null, so throw NPE.
(2) TEZ-4441
event handler will never receive DAGAppMasterEventSchedulingServiceError, so
TezAppMaster could not shutdown, then stuck.
Can you please review the two PR:
https://github.com/apache/tez/pull/235
https://github.com/apache/tez/pull/236
> TezAppMaster may stuck because of reportError skip send error event.
> ---------------------------------------------------------------------
>
> Key: TEZ-4441
> URL: https://issues.apache.org/jira/browse/TEZ-4441
> Project: Apache Tez
> Issue Type: Bug
> Reporter: zhengchenyu
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In Yarn mode, after parseAllPlugins, the className of NamedEntityDescriptor
> is null.
> When some exception was throw(For example described in TEZ-4441.), event
> handler will never receive DAGAppMasterEventSchedulingServiceError event.
> Then AppMaster will stuck!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)