Michael Ho created IMPALA-6025:
----------------------------------

             Summary: Improve hang diagnostics
                 Key: IMPALA-6025
                 URL: https://issues.apache.org/jira/browse/IMPALA-6025
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend, Distributed Exec
    Affects Versions: Impala 2.9.0
            Reporter: Michael Ho


In the past, users of Impalad had a hard time getting diagnostics information 
when a query is hung. Usually, that involves a rather manual process of 
determining the fragment instances which aren't making progress and generating 
stack trace or core from that Impalad and looking into it under a debugger. 
Given the thousand of threads running when multiple queries are active, it's 
quite time consuming for diagnostics.

This JIRA aims to track the improvement ideas which we can implement to 
alleviate the stress with debugging this kind of issue. Some ideas include:

- implement a diagnostic button (analogous to the cancellation button in the 
UI) to dump diagnostics information (e.g. threads' backtraces, executor nodes' 
internals, states of data stream sender and receivers, lock information (e.g. 
holder's pid) ) for fragment instances on some or all hosts of a query.

-  have a watch dog to dump backtraces on threads which aren't making progress 
for a while. This probably doesn't apply to all threads (e.g. idle threads 
shouldn't trigger any alert).

- A fragment instance can appear to be not making progress because its parent 
operator / fragment may be hung (e.g.the probe side of a join will not be able 
to make much progress until the build side is done and the build side itself 
could be another chain of joins). It'd be much easier to resolve this 
dependency chain programmatically to find the root of the cascade of delay.

Please feel free to add more ideas to this JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to