Awesome explanation. This helps a lot. Thanks, I appreciate it. On Saturday, March 14, 2020 at 10:06:38 PM UTC+5:30, Christian Hoffmann wrote: > > On 3/14/20 5:06 PM, Yagyansh S. Kumar wrote: > > Can you explain in a little detail please? > I'll try to walk through your example in several steps: > > ## Step 1 > Your initial expression was this: > > (node_load15 > count without (cpu, mode) > (node_cpu_seconds_total{mode="system"})) * on(instance) > group_left(nodename) node_uname_info > > > ## Step 2 > Let's drop the info part for now to make things simpler (you can add it > back at the end): > > node_load15 > count without (cpu, mode) > (node_cpu_seconds_total{mode="system"}) > > > ## Step 3 > With that query, you could add a factor. The simplest way would be, to > have two alerts, one for your machines with the 1x factor, one with the > 2x factor > > node_load15{instance=~"a|b|c"} > count without (cpu, mode) > (node_cpu_seconds_total{mode="system"}) > > and > > node_load15{instance!~"a|b|c"} > count without (cpu, mode) > (node_cpu_seconds_total{mode="system"}) * 2 > > > ## Step 4 > Depending on your use case, this may be enough already. However, you > would need to modify those two alerts whenever you add a machine. So, > something more scalable would be using a metric (e.g. from a recording > rule) for the scale factor: > > node_load15 > count without (cpu, mode) > (node_cpu_seconds_total{mode="system"}) * on(instance) > cpu_core_scale_factor > > This would require that you have a recording rule for each and every of > your machines: > > - record: cpu_core_scale_factor > labels: > instance: a > expr: 1 > - record: cpu_core_scale_factor > labels: > instance: c > expr: 2 # factor two > > > ## Step 5 > A further simplification regarding maintenance would be, if you could > omit those entries for your more prominent case (just the number of > cores, no multiplication factor). > This is what the linked blog post describes. Sadly, it complicates the > alert rule a little bit: > > > node_load15 > count without (cpu, mode) > (node_cpu_seconds_total{mode="system"}) * on(instance) group_left() ( > cpu_core_scale_factor > or on(instance) > node_load15*0 + 1 # <-- the "1" is the default value > ) > > The part after group_left() basically returns the value from your factor > recording rule. If it doesn't exist, it calculates a default value. This > works by taking an arbitrary metric which exists exactly once for each > instance. It makes sense to take the same metric which your alert is > based on. The value is multiplied by 0, as we do not care about the > value at all. We then add 1, the default value you wanted. Essentially, > this leads to a temporary, invisible metric. This part might be a bit > hard to get across, but basically you can just copy this pattern verbatim. > > In this case, you would only need to add a recording rule for those > machines which should have a non-default (i.e. other than 1) cpu count > scale factor (i.e. the "instance: c" rule above). > > # Step 6 > As a last suggestion, you might want to revisit if strict alerting on > the system load is so useful at all. In our setup, we do alert on it, > but only on really high values which should only trigger if the load is > skyrocketing (usually due to some hanging network filesystem or other > deadlock situation). > > > Note: All examples are untested, so take them with a grain of salt. I > just want to get the idea across. > > Hope this helps, > Christian >
-- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e29cf4ff-c52a-40dc-9ae6-785c6111c64d%40googlegroups.com.