[ 
https://issues.apache.org/jira/browse/AMBARI-6702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089944#comment-14089944
 ] 

Alejandro Fernandez commented on AMBARI-6702:
---------------------------------------------

After some research, it appears that the rpm database can become corrupt if a 
yum command is running and killed abruptly.
When an agent first heartbeats to the server, it calculates which packages are 
installed on the agent; in some cases, running "yum list installed" can take a 
long time, so Ambari tries to do a soft kill (SIGTERM) after waiting some time 
(originally 10 secs). In order to avoid prematurely killing this command, I've 
increased the timeout to 20 secs.
If the soft kill hasn't finished after 5 more secs, then Ambari tries to do a 
hard kill (SIGKILL).

If the rpm database becomes corrupt, it can be recreated by running,
{code}
rm /var/lib/rpm/__db*
yum --rebuilddb
{code}

This may not entirely remove the problem, but should help in making it less 
frequent. 

> Ambari detects RPM DB corruption
> --------------------------------
>
>                 Key: AMBARI-6702
>                 URL: https://issues.apache.org/jira/browse/AMBARI-6702
>             Project: Ambari
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.5.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>             Fix For: 1.7.0
>
>
> Users have described scenarios in which the RPM DB becomes corrupt, usually 
> after stoping all services, rebooting all hosts (including the server), and 
> restarting all services.
> http://hortonworks.com/community/forums/topic/cant-restart-cluster-ambari-not-proving-useful/
> http://hortonworks.com/community/forums/topic/ambari-corrupts-rpmdb/
> * Problem: yum commands fail to run because the RPM database is corrupt.
> * Symptom: The ambari agent log will show something of the sort,
> {code}
> INFO 2014-04-24 05:30:11,051 Controller.py:186 - RegistrationCommand received 
> - repeat agent registration
> ERROR 2014-04-24 05:33:22,669 PackagesAnalyzer.py:43 - Task timed out and 
> will be killed
> INFO 2014-04-24 05:35:12,815 HostCheckReportFileHandler.py:43 - Host check 
> report at /var/lib/ambari-agent/data/hostcheck.result
> INFO 2014-04-24 05:35:12,845 HostCheckReportFileHandler.py:104 - Removing old 
> host check file at /var/lib/ambari-agent/data/hostcheck.result
> INFO 2014-04-24 05:35:12,845 HostCheckReportFileHandler.py:109 - Creating 
> host check file at /var/lib/ambari-agent/data/hostcheck.result
> root@xhadoopm32p rpm# rpm -qa
> rpmdb: Thread/process 30282/xx failed: Thread died in Berkeley DB library
> error: db3 error(30974) from dbenv>failchk: DB_RUNRECOVERY: Fatal error, run 
> database recovery
> error: cannot open Packages index using db3 - (-30974)
> error: cannot open Packages database in /var/lib/rpm
> rpmdb: Thread/process 30282/xx failed: Thread died in Berkeley DB library
> error: db3 error(30974) from dbenv>failchk: DB_RUNRECOVERY: Fatal error, run 
> database recovery
> error: cannot open Packages database in /var/lib/rpm
> {code}
> * Fix:
> Run the following
> {code}
> rm /var/lib/rpm/__db*
> yum --rebuilddb
> {code}
> This appears to be an underlying issue with yum (either a lock is not 
> released, or multiple yum commands are ran in parallel), so to attempt to 
> decrease its frequency, the agent's PackagesAnalyzer will increase the time 
> it waits for the "yum list available" and "yum list installed" from 10 secs 
> to 20 secs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to