Dmitry Lysnichenko created AMBARI-7791:
------------------------------------------

             Summary: HBase Master CPU utilization alert is not suppressed at MM
                 Key: AMBARI-7791
                 URL: https://issues.apache.org/jira/browse/AMBARI-7791
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 1.7.0
            Reporter: Dmitry Lysnichenko
            Assignee: Dmitry Lysnichenko
             Fix For: 1.7.0


Looks like we have a design flaw that affects suppressing some alerts. It 
causes a rare bug that probably affects 1.6.1.

h2. The short story
When we put HBase Master (or entire HBase service) into MM and then stop HBase 
Master, the alert "HBase Master CPU utilization" pops up and is not suppressed. 
This issue reproduces only when HBase Master is located on a separate host then 
Nagios server. 

h2. How suppressing alerts works 
When we put some service/host/host component into MM, at the server we build a 
complete map of host components that are in MM and post it to an agent. Agent 
writes down this info to file /var/nagios/ignore.dat in a form:
{code}
vm-3.vm GANGLIA GANGLIA_MONITOR
vm-0.vm HBASE HBASE_MASTER
vm-3.vm HDFS DATANODE
vm-2.vm HBASE HBASE_REGIONSERVER
vm-0.vm HBASE HBASE_REGIONSERVER
vm-1.vm HBASE HBASE_REGIONSERVER
vm-3.vm YARN NODEMANAGER
vm-3.vm HBASE HBASE_REGIONSERVER
{code}
All alerts at Nagios are wrapped into shell script (check_wrapper.sh). When any 
alert is generated, this wrapper checks  if the hostname, service name and 
component name for this alert are present at /var/nagios/ignore.dat. If yes, 
alert is suppressed

h2. What exactly is broken
At jira https://issues.apache.org/jira/browse/AMBARI-6358 we had a requirement 
to have only one 'HBase Master CPU utilization' check even in HA mode. So this 
check is bound to Nagios host (to be executed only once even if hbase master 
hostgroup has more than one host, like it is done for "* Percent Count" 
alerts). As a result, Hbase Master alert origin data does not match any entry 
at file /var/nagios/ignore.dat . That's why the alert is not suppressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to