Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule

Jonathan Hurley Fri, 28 Oct 2016 12:37:21 -0700

It sounds like you're asking two different questions here. Let me see if I can 
address them:

Most "CRITICAL" thresholds do contain different text then their OK/WARNING
counterparts. This is because there is different information which needs to be
conveyed when an alert has gone CRITICAL. In the case of this alert, it's a
port connection problem. When that happens, administrators are mostly
interested in the error message and the attempted host:port combination. I'm
not sure what you mean by "CRITICAL is a point in time alert". All alerts of
the PORT/WEB variety are point-in-time alerts. They represent the connection
state of a socket and the data returned over that socket at a specific point in
time. The alert which gets recorded in Ambari's database maintains the time of
the alert. This value is available via a tooltip hover in the UI.

The second part of your question is asking why increasing the timeout value to
something large, like 600, doesn't prevent the alert from triggering. I believe
this is how the python sockets are being used in that a failed connection is
not limited to the same timeout restrictions as a socket which won't respond.
If the machine is available and refuses the connection outright, then the
timeout wouldn't take effect.

On Oct 28, 2016, at 1:37 PM, Ganesh Viswanathan
<gan...@gmail.com<mailto:gan...@gmail.com>> wrote:

Hello,

The Ambari "Metrics Collector Process" Alert has a different defintion for
CRITICAL threshold vs. OK and WARNING thresholds. What is the reason for this?

In my tests, CRITICAL seems like a "point-in-time" alert and the value of that
field is not being used. When the metrics collector process is killed or
restarts, the alert fires in 1min or less even when I set the threshold value
to 600s. This means the alert description of "This alert is triggered if the
Metrics Collector cannot be confirmed to be up and listening on the configured
port for number of seconds equal to threshold." NOT VALID for CRITICAL
threshold. Is that true and what is the reason for this discrepancy? Has anyone
else gotten false pages because of this and what is the fix?

"ok": {
"text": "TCP OK - {0:.3f}s response on port {1}"
},
"warning": {
"text": "TCP OK - {0:.3f}s response on port {1}",
"value": 1.5
},
"critical": {
"text": "Connection failed: {0} to {1}:{2}",
"value": 5.0
}

Ref:
https://github.com/apache/ambari/blob/2ad42074f1633c5c6f56cf979bdaa49440457566/ambari-server/src/main/resources/common-services/AMBARI_METRICS/0.1.0/alerts.json#L102

Thanks,
Ganesh

Re: Ambari Metrics Collector Process alert - CRITICAL threshold rule

Reply via email to