[ 
https://issues.apache.org/jira/browse/AMBARI-12522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Lysnichenko resolved AMBARI-12522.
-----------------------------------------
    Resolution: Fixed

Committed to trunk

> Provide traceback patch to debug hanging agents
> -----------------------------------------------
>
>                 Key: AMBARI-12522
>                 URL: https://issues.apache.org/jira/browse/AMBARI-12522
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>             Fix For: 2.2.0
>
>         Attachments: AMBARI-12522.patch
>
>
> there has been a few reports (on trunk and using local VMS) that at times the 
> agent become super busy and not process any commands. Only way out is agent 
> restart. 
> Some time ago we had signal handler that would dump traceback to log and open 
> debugger, or something like that. But it looks like to be removed already. We 
> decided to reimplement this signal handler
> Patch tries to load and register traceback handler if it is available, and 
> skips if not. Also it fixes binding signal handlers twice during agent start.
> To install faulthandler under Centos 6 (*faulthandler is not included to 
> default distribution of Python 2.x*), we have to perform:
> {code}
> yum install python-devel gcc -y
> # install setup tools
> curl https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py | python 
> -
> # install pip
> curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python -
> easy_install faulthandler
> {code}
> If faulthandler module is available, agent posts {{Registered faulthandler}} 
> to agent out file.
> After that, we start agent and can dump tracebacks for all running threads 
> like that:
> {code}
> # kill -USR1 `cat /var/run/ambari-agent/ambari-agent.pid`
> # cat /var/log/ambari-agent/ambari-agent.out
> Registered faulthandler
> Thread 0x00007feccffff700 (most recent call first):
>   File "/usr/lib64/python2.6/socket.py", line 197 in accept
>   File "/usr/lib/python2.6/site-packages/ambari_agent/PingPortListener.py", 
> line 67 in run
>   File "/usr/lib64/python2.6/threading.py", line 532 in __bootstrap_inner
>   File "/usr/lib64/python2.6/threading.py", line 504 in __bootstrap
> Thread 0x00007fecd4a89700 (most recent call first):
>   File "/usr/lib/python2.6/site-packages/ambari_agent/DataCleaner.py", line 
> 123 in run
>   File "/usr/lib64/python2.6/threading.py", line 532 in __bootstrap_inner
>   File "/usr/lib64/python2.6/threading.py", line 504 in __bootstrap
> Current thread 0x00007fecdfe8c700 (most recent call first):
>   File "/usr/lib64/python2.6/threading.py", line 258 in wait
>   File "/usr/lib64/python2.6/threading.py", line 395 in wait
>   File "/usr/lib/python2.6/site-packages/ambari_agent/HeartbeatHandlers.py", 
> line 122 in wait
>   File "/usr/lib/python2.6/site-packages/ambari_agent/NetUtil.py", line 108 
> in try_to_connect
>   File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 291 in 
> main
>   File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 306 in 
> <module>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to