Glad it makes sense now. It was definitely a bump in the learning curve for me :-)
Regards, Brian. On Friday, 4 March 2022 at 10:00:12 UTC Federico Buti wrote: > Hi Brian. > > Thanks for the super-deep dive into the topic! This is simply awesome. And > sorry for the mails mismatch...too many mail accounts! :-D > > On Fri, 4 Mar 2022 at 09:46, Brian Candler <[email protected]> wrote: > >> > Assuming the second metric goes missing how is the binary expression >> evaluated exactly? >> >> The same as it always is. Remember that the left-hand side and the >> right-hand side are both vectors, containing zero or more values, each >> value having a distinct set of labels. Noting the documentation here >> <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators> >> : >> >> * vector1* >> * and vector2 results in a vector consisting of the elements >> of vector1 for which there are elements in vector2 with exactly matching >> label sets. Other elements are dropped. The metric name and values are >> carried over from the left-hand side vector.* >> >> Therefore, if the RHS of "and" is an empty vector, then the result of the >> entire "and" expression is an empty vector - since there is nothing in >> vector2 for vector1 to match. >> >> > In the "normal" case, i.e. "foo and bar" we would not have points but >> in the case of "absent(foo) and bar", from my tests, it seems to me the >> "bar" filtering is simply ignored. >> >> I don't understand what mean by that. Can you give examples of the LHS >> and the RHS vectors, and the combined expression, which don't behave how >> you expect? >> > > I was referring to "absent(foo) and bar", which was the source of my > original question. On the surface it seemed to me that LHS was firing > even though RHS was empty. But your detailed explanation below forced me to > double-check again in the expression browser and now I see the RHS wasn't > really empty as I first (erroneously) reported. Which matches the > documentation you mentioned and makes everything click perfectly in my > head. Was dumb of me, but I guess stuff happens. Thanks a lot. > > > > Note that "foo and bar" and "absent(foo) and bar" will both be empty if >> bar is empty, as just described. >> >> "absent(foo)" is an unusual function: >> - if the input vector has one or more values, i.e. any non-empty vector, >> its output is an empty vector (no values) >> - if the input vector is empty, its output is one-element vector with a >> single value "1". The label set of that value depends on the exact form of >> the expression inside the parentheses; it tries to do "the right thing" but >> at worst you could have value 1 with empty label set {} >> >> In your case, >> >> absent(our_metric{environment="pro",service="bar",stack="foo"}) >> >> will return >> {environment="pro",service="bar",stack="foo"} 1 >> >> i.e. a single-element vector with empty metric name, those labels, and >> the value 1. >> >> Going back to the whole original expression: >> >> absent(our_metric{environment="pro",service="bar",stack="foo"}) and >> on(stack, environment) up{service="bar",source="app"} == 1 >> >> ISTM that is saying you want to generate an alert if >> our_metric{environment="pro",service="bar",stack="foo"} is missing, but >> only if metric up{service="bar",source="app"} exists *and* has value 1. >> That means the alert is suppressed if either: >> (a) up{service="bar",source="app"} exists but its value is not 1 >> (b) up{service="bar",source="app"} does not exist - i.e. that expression >> returns an empty vector. ("up" is a special metric in prometheus; if it >> doesn't exist, it means there is no configured scrape job with those labels) >> > > Yes, I was interested in having (a). Then yesterday we experienced (b) > because of a provision problem and I wrote to the list to understand that > case better. Just to improve my knowledge. We do NOT want disappearance of > targets which would lead to (b) ofc, but that is an investigation we are > doing on our side to avoid the problem in the future. > > > > If that's not what you want, then think about what you actually want, and >> then how to express that. For example, if you want to suppress the alert >> in case (a) but not in case (b), then you can do this: >> >> absent(our_metric{environment="pro",service="bar",stack="foo"}) >> unless on(stack, environment) up{service="bar",source="app"} != 1 >> >> ------ >> > > Cool! I've always struggled a bit with "unless" but I can totally give it > a go for this case. As I should have mentioned I want to move away from the > absent altogether but that is something is not going to happen soon due to > the way the exporter is written atm, unfortunately. > > > > If you don't mind, I will make an observation about the use of "and >> on(...)". Since the LHS and RHS are vectors, an expression needs to >> identify corresponding values in the LHS vector and the RHS vector, to >> generate a vector of results. The on(...) part is when the LHS and RHS >> vectors don't have exactly the same label sets, and you need to ignore some >> when matching them up. I think you know all this already. >> >> I find your expression rather confusing, because: >> - we know that any values in the LHS vector must have labels >> {environment="pro",service="bar",stack="foo"} >> - we know that any values in the RHS vector must have labels >> {service="bar",source="app"} >> - "on(stack,environment)" says to pair up LHS and RHS values where the >> "stack" and "environment" labels match >> - therefore, the RHS vector must also have stack="foo" and >> environment="pro" >> - as this a one-to-one vector match: it will fail if a particular pair of >> (stack,environment) labels returns multiple values for the LHS and one or >> more for the RHS, or vice versa. Therefore we know (stack,environment) must >> be a unique match for a given service (*) >> >> Therefore, implicitly I think all of (environment, service, stack) must >> match, i.e. this expression is the same as: >> >> absent(our_metric{environment="pro",service="bar",stack="foo"}) and >> on(environment, service, stack) >> up{environment="pro",service="bar",stack="foo",source="app"} == 1 >> >> And this can be simplified to: >> >> absent(our_metric{environment="pro",service="bar",stack="foo"}) and >> on(environment, service, stack) up{source="app"} == 1 >> >> I find the second version easier to read and reason about, because the >> environment/service/stack matching is all in one place, but you may >> disagree :-) >> > > Not really sure why I should disagree here! :-D > This is a great insight and a source of reflection for us to improve our > rule set. We have a few binary expressions using "and" for which the > reasoning applied here could be taken in account. If anything it > simplifies/shortens the expression a lot, which is always a plus, imo. > > Thanks a lot for your huge help! > F. > > > > > (*) This does provide another reason why an alert could fail to trigger. >> If the "and" expression returns multiple values for the same >> (stack,environment) pair on either the LHS or the RHS, with at least one >> match on the other side, then the whole expression will generate an error. >> >> However, I think it's unlikely in this particular case. We know the LHS >> can only possibly return a single-element vector, so this error condition >> could only occur if up{service="bar",source="app"} == 1 returns multiple >> values with the same pair of (stack,environment) labels. That is, it would >> only be a problem if you had something like this: >> up{environment="pro",service="bar",stack="foo",source="app",xxx="yyy"} 1 >> up{environment="pro",service="bar",stack="foo",source="app",xxx="zzz"} 1 >> >> On Friday, 4 March 2022 at 07:23:16 UTC [email protected] wrote: >> >>> Hi Brian, >>> >>> thanks a lot for your reply. >>> >>> I re-read my original mail and I recognize I should have probably >>> delivered less information and went straight to the point. That probably >>> created a bit of confusion. E.g. I never intended the up metric - or any >>> other metric - to be considered a boolean. My bad. I'll try to get straight >>> to the point this time. >>> >>> >This is *not* boolean. Rather, it takes the vector of timeseries "foo" >>> and matches them up with the vector of timeseries "bar". All those >>> elements of foo which have exactly matching label >sets with bar, are >>> passed through unchanged. Anything else is dropped. >>> >>> Right, and my question is the following. Mostly to understand the >>> underlining behaviour, not because I have any particular problem to resolve. >>> Assuming the second metric goes missing how is the binary expression >>> evaluated exactly? In the "normal" case, i.e. "foo and bar" we would not >>> have points but in the case of "absent(foo) and bar", from my tests, it >>> seems to me the "bar" filtering is simply ignored. >>> >>> I can guess that is because "absent" is not really a metric per se and >>> thus we are comparing two empty sets of labels - effectively reducing >>> "absent(foo) and bar" to "absent(foo)". >>> I'd say, it would make sort of sense, right? >>> >>> Cheers, >>> F. >>> >>> On Thursday, 3 March 2022 at 17:01:29 UTC+1 Brian Candler wrote: >>> >>>> You can use the PromQL browser in the prometheus web UI to debug this, >>>> since you can view the value of an expression at any previous point in >>>> time. >>>> >>>> Try the two halves separately: >>>> >>>> absent(our_metric{environment="pro",service="bar",stack="foo"}) >>>> >>>> up{service="bar",source="app"} == 1 >>>> >>>> Then try the whole expression at that point in time. Either view the >>>> graph, or view the instant query and set the instant time to when there >>>> was >>>> a problem. >>>> >>>> > As the node went missing the second operand of the binary operator >>>> could not be evaluated, simply because it was neither `1`, nor `0` >>>> >>>> The expression: >>>> up{service="bar",source="app"} == 1 >>>> can only ever have the value 1 or be missing. metric == constant is a >>>> filter, not a boolean. The value it returns is the value of the LHS, or >>>> no >>>> value if the filter condition is not met. >>>> >>>> Possibly you want to remove the "== 1" entirely: >>>> >>>> absent(our_metric{environment="pro",service="bar",stack="foo"}) and >>>> on(stack, environment) up{service="bar",source="app"} >>>> >>>> "and" expressions behave in a corresponding way: >>>> >>>> foo and bar >>>> >>>> This is *not* boolean. Rather, it takes the vector of timeseries "foo" >>>> and matches them up with the vector of timeseries "bar". All those >>>> elements of foo which have exactly matching label sets with bar, are >>>> passed >>>> through unchanged. Anything else is dropped. >>>> >>>> So it's just a filter: "give me all values of foo, where there is also >>>> a value present for bar". It does not have true/false values either as >>>> its >>>> input or its output. >>>> >>>> > Or, in other words, the following was holding true: >>>> > >>>> > absent(up{service="bar",source="app"}) = 1 >>>> >>>> How do you know? The "up" metric is always present for a target, >>>> whether or not scraping is successful: it would only not be present if you >>>> removed the target from the scrape job. This could be the case if you are >>>> using some dynamic service discovery, and the service went away. But then >>>> your real problem is how to stop services vanishing from service discovery. >>>> >>>> Anyway, you can tell for sure by looking at historical values of these >>>> queries: >>>> >>>> up{service="bar",source="app"} >>>> absent(up{service="bar",source="app"}) >>>> >>>> >>>> On Thursday, 3 March 2022 at 11:12:11 UTC Federico Buti wrote: >>>> >>>>> Hi list, >>>>> >>>>> For a monitored system we setup a rule as follows: >>>>> >>>>> absent(our_metric{environment="pro",service="bar",stack="foo"}) and >>>>> on(stack, environment) up{service="bar",source="app"} == 1 >>>>> >>>>> This is one of the few absence rules we have in our ruleset. This is >>>>> also a bit special because the exporter uses the absence of the metric to >>>>> indicate a problem - something that is discouraged from guidelines. But >>>>> that goes beyond my question anyway. >>>>> >>>>> Using a binary AND operator seems to work fine, cutting out the cases >>>>> in which the node is not scrapable. However this morning the node went >>>>> missing. We had probably a misconfiguration in our provisioning which we >>>>> are currently investigating. >>>>> >>>>> As the node went missing the second operand of the binary operator >>>>> could not be evaluated, simply because it was neither `1`, nor `0`. Or, >>>>> in >>>>> other words, the following was holding true: >>>>> >>>>> absent(up{service="bar",source="app"}) = 1 >>>>> >>>>> I understand an alert can resolve if the related metric goes stale but >>>>> I'm not sure how the logic should translate in this case. On the surface >>>>> I >>>>> would not expect the AND expression to fire as we are not able to say the >>>>> "up" metric is really 1. >>>>> >>>>> But maybe I'm missing the point here? >>>>> >>>>> Thanks in advance, >>>>> F. >>>>> >>>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "Prometheus Users" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/prometheus-users/pyTVLNKp3XM/unsubscribe >> . >> To unsubscribe from this group and all its topics, send an email to >> [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/prometheus-users/f24239ac-aa22-4b1e-bcd9-92861bfa2976n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/prometheus-users/f24239ac-aa22-4b1e-bcd9-92861bfa2976n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7485f8ca-2304-4d3c-81fe-a38b3a1d80f9n%40googlegroups.com.

