[ https://issues.apache.org/jira/browse/TEZ-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432206#comment-15432206 ]
Siddharth Seth commented on TEZ-3405: ------------------------------------- bq. Could be fixed in the DAGClient to set up the regular pings to AM. Have not done that as yet as the primary use-case seems session specific. Could that in a follow-up jira if this is needed. This will get interesting with the parallel request to allow creation of a DAGClient to interact with a running DAG. Maybe we should drop non-session mode at some point. bq. Mind filing a jira with more details? ... waitForProxy interrupts Was assuming that any interaction with YARN to get the AppReport would result in an Interrupt. I don't think that actually happens - YARN throws an IOException, or YarnException - but no InterruptedException. Not sure if there's a way to detect if the YarnClient was interrupted during a request. bq. For the most part, the client if actively pinging the AM for updates, will be logging errors talking to AM. Did not see a need to add more logging for an internal heartbeat ping. Think it's worth logging that an interrupt was seen or a specific exception was seen without the entire trace. At least we know if this is hit. bq. I can fix this to a max(1 sec, overall timeout/20or /50 ) ? Would that work? Re-scheduling based on timeout-lastPingTime gets complicated ? Haven't looked at the new patch yet. One thing from a quick look - the SchedulerExecutor could use a name, and would be better to make it a daemon. > Support ability for AM to kill itself if there is no client heartbeating to it > ------------------------------------------------------------------------------ > > Key: TEZ-3405 > URL: https://issues.apache.org/jira/browse/TEZ-3405 > Project: Apache Tez > Issue Type: Bug > Reporter: Gunther Hagleitner > Assignee: Hitesh Shah > Priority: Critical > Attachments: TEZ-3405.1.patch, TEZ-3405.2.patch, TEZ-3405.3.patch > > > HiveServer2 optionally maintains a pool of AMs in either Tez or LLAP mode. > This is done to amortize the cost of launching a Tez session. > We also try in a shutdown hook to kill all these AMs when HS2 goes down. > However, there are cases where HS2 doesn't get the chance to kill these AMs > before it goes away. As a result these zombie AMs hang around until the > timeout kicks in. > The trouble with the timeout is that we have to set it fairly high. Otherwise > the benefit of having pre-launched AMs obviously goes away (in a lightly > loaded cluster). > So, if people kill/restart HS2 they often times run into situations where the > cluster/queue doesn't have any more capacity for AMs. They either have to > manually kill the zombies or wait. > The request is therefore for Tez to maintain a heartbeat to the client. If > the client goes away the AM should exit. That way we can keep the AMs alive > for a long time regardless of activity and at the same time don't have to > worry about them if HS2 goes down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)