Thanks Sid, appreciate the info about hbase and hdfs alerts. I'll work on upgrading Ambari but it will probably take time. One other question about the alert.
*What does the value in seconds in the 'Metrics Collector Process' alert mean? *The Ambari definition says:* "This alert is triggered if the Metrics Collector cannot be confirmed to be up and listening on the configured port for number of seconds equal to threshold."* Is it the number of seconds the process was not active and reachable when the check happened? But if its a point-in-time check and the check is done every 1minute, why does this have a default of 1.5s and 5s for WARNING and CRITICAL? -Ganesh On Fri, Oct 28, 2016 at 1:25 PM, Siddharth Wagle <swa...@hortonworks.com> wrote: > Hi Ganesh, > > > AMS in AMbari version 2.2.1 had some perf impact due to HBase noramlizer, > I would recommend upgrading to 2.4.1 if possible. > > > Regarding 2] The HBase and HDFS alerts are not all based off of AMS only > the Namenode alerts ending with "_hourly" or "_daily" depend on AMS. > Other alerts either are port/pid or jmx based. > > > - Sid > > > ------------------------------ > *From:* Ganesh Viswanathan <gan...@gmail.com> > *Sent:* Friday, October 28, 2016 1:07 PM > *To:* Jonathan Hurley > *Cc:* user@ambari.apache.org > *Subject:* Re: Ambari Metrics Collector Process alert - CRITICAL > threshold rule > > Thanks Jonathan, that explains some of the behavior I'm seeing. > > Two additional questions: > 1) How do I make sure the Ambari "Metrics Collector Process" does not > alert immediately when the process is down? I am using Ambari 2.2.1.0 and > it has a bug [1] which can trigger restarts of the process. The fix for > AMBARI-15492 <http://issues.apache.org/jira/browse/AMBARI-15492> has been > documented in that wiki as "comment out auto-recovery". But that would mean > the process would not restart (when the bug hits) bringing down visibility > into the cluster metrics. We have disabled the auto-restart count alert > because of the bug, but what is a good way to say "if the metrics collector > process has been down for 15mins, then alert". > > 2) Will restarting "Metrics Collector Process" impact the other hbase or > hdfs health alerts? Or is this process only for the Ambari-Metrics system > (collecting usage and internal ambari metrics). I am trying to see if the > Ambari Metrics Collector Process can be disabled while still keep the other > hbase and hdfs alerts. > > [1] https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues > > > -Ganesh > > > On Fri, Oct 28, 2016 at 12:36 PM, Jonathan Hurley <jhur...@hortonworks.com > > wrote: > >> It sounds like you're asking two different questions here. Let me see if >> I can address them: >> >> Most "CRITICAL" thresholds do contain different text then their >> OK/WARNING counterparts. This is because there is different information >> which needs to be conveyed when an alert has gone CRITICAL. In the case of >> this alert, it's a port connection problem. When that happens, >> administrators are mostly interested in the error message and the attempted >> host:port combination. I'm not sure what you mean by "CRITICAL is a point >> in time alert". All alerts of the PORT/WEB variety are point-in-time >> alerts. They represent the connection state of a socket and the data >> returned over that socket at a specific point in time. The alert which gets >> recorded in Ambari's database maintains the time of the alert. This value >> is available via a tooltip hover in the UI. >> >> The second part of your question is asking why increasing the timeout >> value to something large, like 600, doesn't prevent the alert from >> triggering. I believe this is how the python sockets are being used in that >> a failed connection is not limited to the same timeout restrictions as a >> socket which won't respond. If the machine is available and refuses the >> connection outright, then the timeout wouldn't take effect. >> >> >> >> On Oct 28, 2016, at 1:37 PM, Ganesh Viswanathan <gan...@gmail.com> wrote: >> >> Hello, >> >> The Ambari "Metrics Collector Process" Alert has a different defintion >> for CRITICAL threshold vs. OK and WARNING thresholds. What is the reason >> for this? >> >> In my tests, CRITICAL seems like a "point-in-time" alert and the value of >> that field is not being used. When the metrics collector process is killed >> or restarts, the alert fires in 1min or less even when I set the threshold >> value to 600s. This means the alert description of "*This alert is >> triggered if the Metrics Collector cannot be confirmed to be up and >> listening on the configured port for number of seconds equal to threshold."* >> NOT VALID for CRITICAL threshold. Is that true and what is the reason for >> this discrepancy? Has anyone else gotten false pages because of this and >> what is the fix? >> >> "ok": { >> "text": "TCP OK - {0:.3f}s response on port {1}" >> }, >> "warning": { >> "text": "TCP OK - {0:.3f}s response on port {1}", >> "value": 1.5 >> }, >> "critical": { >> "text": "Connection failed: {0} to {1}:{2}", >> "value": 5.0 >> } >> >> Ref: >> https://github.com/apache/ambari/blob/2ad42074f1633c5c6f56cf >> 979bdaa49440457566/ambari-server/src/main/resources/ >> common-services/AMBARI_METRICS/0.1.0/alerts.json#L102 >> >> Thanks, >> Ganesh >> >> >> >