[
https://issues.apache.org/jira/browse/AMBARI-12522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Lysnichenko resolved AMBARI-12522.
-----------------------------------------
Resolution: Fixed
Committed to trunk
> Provide traceback patch to debug hanging agents
> -----------------------------------------------
>
> Key: AMBARI-12522
> URL: https://issues.apache.org/jira/browse/AMBARI-12522
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Reporter: Dmitry Lysnichenko
> Assignee: Dmitry Lysnichenko
> Fix For: 2.2.0
>
> Attachments: AMBARI-12522.patch
>
>
> there has been a few reports (on trunk and using local VMS) that at times the
> agent become super busy and not process any commands. Only way out is agent
> restart.
> Some time ago we had signal handler that would dump traceback to log and open
> debugger, or something like that. But it looks like to be removed already. We
> decided to reimplement this signal handler
> Patch tries to load and register traceback handler if it is available, and
> skips if not. Also it fixes binding signal handlers twice during agent start.
> To install faulthandler under Centos 6 (*faulthandler is not included to
> default distribution of Python 2.x*), we have to perform:
> {code}
> yum install python-devel gcc -y
> # install setup tools
> curl https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py | python
> -
> # install pip
> curl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python -
> easy_install faulthandler
> {code}
> If faulthandler module is available, agent posts {{Registered faulthandler}}
> to agent out file.
> After that, we start agent and can dump tracebacks for all running threads
> like that:
> {code}
> # kill -USR1 `cat /var/run/ambari-agent/ambari-agent.pid`
> # cat /var/log/ambari-agent/ambari-agent.out
> Registered faulthandler
> Thread 0x00007feccffff700 (most recent call first):
> File "/usr/lib64/python2.6/socket.py", line 197 in accept
> File "/usr/lib/python2.6/site-packages/ambari_agent/PingPortListener.py",
> line 67 in run
> File "/usr/lib64/python2.6/threading.py", line 532 in __bootstrap_inner
> File "/usr/lib64/python2.6/threading.py", line 504 in __bootstrap
> Thread 0x00007fecd4a89700 (most recent call first):
> File "/usr/lib/python2.6/site-packages/ambari_agent/DataCleaner.py", line
> 123 in run
> File "/usr/lib64/python2.6/threading.py", line 532 in __bootstrap_inner
> File "/usr/lib64/python2.6/threading.py", line 504 in __bootstrap
> Current thread 0x00007fecdfe8c700 (most recent call first):
> File "/usr/lib64/python2.6/threading.py", line 258 in wait
> File "/usr/lib64/python2.6/threading.py", line 395 in wait
> File "/usr/lib/python2.6/site-packages/ambari_agent/HeartbeatHandlers.py",
> line 122 in wait
> File "/usr/lib/python2.6/site-packages/ambari_agent/NetUtil.py", line 108
> in try_to_connect
> File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 291 in
> main
> File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 306 in
> <module>
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)