Hi Debashish,

The way we did SLA reporting on our side was:

   - export an '*_up' metric for the VMs giving a value of 1 or 0
   - create silences via Alertmanager for maintenance periods, and ensure 
   they contain matchers that help identify the VMs (we used matchers like 
   'resource_group' & 'resource_name' as the machines run in Azure)
   - export silences just like machine state via: 
   https://github.com/FXinnovation/alertmanager-silences-exporter
   the exporter will give you a value of 1 in case the silence is active, 
   and 0 for all other states.
   - create a recording rule to check if a VM is in an 'up', 'down' or 
   'under maintenance' state. We use the metric created here for the time 
   range we want to calculate the SLA.
   - share results via Grafana to our clients

Hope this helps!

Thanks,
Roland

On Monday, March 16, 2020 at 4:44:01 PM UTC-4, Christian Hoffmann wrote:
>
> Hi, 
>
> On 3/16/20 9:21 PM, Debashish Ghosh wrote: 
> >   I am currently using spring's actuator/micrometer to spit out metrics 
> > that are scraped by prometheus. 
> > The framework generates a metric called *process_uptime_seconds* which 
> > is the number of seconds my app is running in a VM . I have *2 VMs* 
> > where my app is running to provide high availability of 99.95 %. 
> > 
> > I am using the formula *100-(((30*24*60*60) - 
> > 
> increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100
>  
>
> > *to calculate the SLA. 
> > 
> > 30*24*60*60 represents the number of sencods in 30 days and the 
> > difference with the process_uptime_seconds will give the number of 
> > seconds the app was down in a VM . 
> > 
> > But the problem with this approach is that periodically we have to 
> > *restart *the service to apply patch and while doing so we do it one by 
> > one so that there is no downtime. 
> > 
> > But since the above formula creates one timeseries for each VM instance 
> > the SLA goes down since both the servers are restarted one after the 
> > another. 
> > 
> > Is there a way to take this into consideration to calculate sla based on 
> > the time*when both the servers were down together *? 
> Hrm, can't you just use the up metric to detect whether your application 
> was available? 
>
> That way, you could calculate availability of your service via 
> max(up{instance=~"server1|server2"}) == 1. I think that would make the 
> whole thing much easier, wouldn't it? 
>
> I fail to come up with an idea based on your process_uptime_seconds 
> approach. It may be possible (maybe using a recording rule which decides 
> for each evaluation interval whether your servers cound as available or 
> not...?), but it sounds like it would get complicated quickly. 
>
>
> Kind regards, 
> Christian 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/632327fa-2526-4744-9268-500d6d1b1707%40googlegroups.com.

Reply via email to