On 17/10/2022 07:26, Rishabh Arora wrote:
Hello!
I'm currently in the process of implementing Prometheus along with
Alertmanager as our de facto solution for node health monitoring. We
have a kubernetes, kafka, mqtt setup and for monitoring our
infrastructure, prometheus is an obvious good fit.
We have an application / business case, where I'm wondering whether
Prometheus may be a reasonable solution. Our application needs to meet
certain SLAs. In case those SLAs are not being, some alerts need to be
firing. For example, consider the following case which bears close
resemblance to our real business case:
An /Order/ schema in our system has a /payment/ field which can be one
of ['COMPLETED','FAILED','PENDING']. In our HA real time system, we
need to fire alerts for Orders which are in a PENDING state. Rows in
our /Orders/ collection will be in the order of potentially millions.
An order also has a /paymentEngine/ field, which represents the entity
responsible for processing the payment for the order.
Now, with Prometheus, finding the total count of PENDING Orders would
be a simple metric, but what we're interested in is also the Order
IDs. For instance, is there a way I could capture the PENDING order
IDs in the "metadata"(???) or "payload" of the metric? Downstream in
the alertmanager, I'd also like to group by /paymentEngine/__so I
could potentially inhibit alerts for an unstable engine.
Can anyone please help me out? Apologies in advance for my naivety :)
What you are asking for isn't really the job of Prometheus.
Having a metric detailing the number of pending orders & alerting on
that is completely within the normal area for Prometheus & Alertmanager
- observing the system and alerting if there are issues that need
investigation. However the next step of dealing with the individual
events/orders is the job for a different system. If paymentEngine could
be a small number of options (e.g. PayPal, Swipe, Cash) then it would be
reasonable to have that as a label to the pending orders metric (which
then would allow you to alert if one method stops working), but order ID
isn't something you should ever put in the metrics. Instead once you
were alerted about a potential issue you might query your order database
directly or look at log files to dig into the detail and figure out what
is happening.
--
Stuart Clark
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/43479ddc-5970-194e-4779-97b6fc6e1e32%40Jahingo.com.