[prometheus-users] Re: K6_http_req_duration_$quantile_stat Metrics are the Same Across Quantiles for Certain APIs

2024-10-04 Thread 'Brian Candler' via Prometheus Users
On Friday 4 October 2024 at 01:59:29 UTC+1 Zhang Zhao wrote:

When running a specific test case and switching the trend metric query to 
different quantile values in Grafana, the panels don't update properly.


I think you should first remove Grafana from the equation entirely. If the 
problem is something to do with Grafana, e.g. Grafana dashboard variables, 
then the appropriate place to ask would be the Grafana 
Community: https://community.grafana.com/

However, in this case it seems here that the problem is likely how you are 
generating the metrics in the first place and submitting them using the 
Remote Write protocol. You haven't shown any code which does that. If that 
code is part of the "k6" framework that you refer to, then probably the 
place you should be asking is on a discussion group for that framework.

Is "$quantile_stat" a feature of Grafana or k6? That should help you decide 
where to focus your attention.

If you still think the issue is to do with Prometheus, then you should 
reproduce your problem using only Prometheus components (e.g. the 
Prometheus web interface, which directly talks to the Prometheus web UI). 
You'd also need to basic information to allow the problem to be reproduced, 
such as what version of Prometheus you're running, and samples of the 
remote write requests.

I would say that in general, Prometheus is very good at faithfully storing 
the data you give it, so if you see a problem it's likely to be "garbage 
in, garbage out". But if you're using one of the more bleeding-edge 
features like native histograms, then it's possible that you've found a 
Prometheus issue.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/48ddf1c7-c0ec-4ee9-9e02-0a842abf3518n%40googlegroups.com.


[prometheus-users] Re: Synology NAS Details Dashboard

2024-09-30 Thread 'Brian Candler' via Prometheus Users
onformance OBJECT IDENTIFIER ::= { synoRaid 2 }
>> raidCompliances OBJECT IDENTIFIER ::= { raidConformance 1 }
>> raidGroups OBJECT IDENTIFIER ::= { raidConformance 2 }
>>
>> raidCompliance MODULE-COMPLIANCE
>> STATUS  current
>> DESCRIPTION
>> "The compliance statement for synoRaid entities which
>> implement the SYNOLOGY RAID MIB."
>> MODULE  -- this module
>> MANDATORY-GROUPS { raidGroup }
>>
>> ::= { raidCompliances 1 }
>>
>> raidGroup OBJECT-GROUP
>> OBJECTS { raidIndex,
>>   raidName,
>>   raidStatus,
>>   raidFreeSize,
>>   raidTotalSize,
>>   raidHotspareCnt}
>> STATUS  current
>> DESCRIPTION
>> "A collection of objects providing basic instrumentation and
>> control of an synology raid entity."
>> ::= { raidGroups 1 }
>>
>> END
>>
>>   Does this help?
>> On Monday, September 30, 2024 at 10:36:45 AM UTC-4 Brian Candler wrote:
>>
>>> I mean the MIB files consumed by generator.
>>>
>>> On Monday 30 September 2024 at 14:41:14 UTC+1 Mitchell Laframboise wrote:
>>>
>>>> I know the mib is working because when I do an snmpwalk i get the 
>>>> following output.
>>>>
>>>> ~/snmp_exporter/generator$ snmpwalk -v2c -c public *.*.*.* 
>>>> 1.3.6.1.4.1.6574.3
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.1.0 = INTEGER: 0
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.1.1 = INTEGER: 1
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.2.0 = STRING: "Volume 1"
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.2.1 = STRING: "Storage Pool 1"
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.3.0 = INTEGER: 1
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.3.1 = INTEGER: 1
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.4.0 = Counter64: 14395893346304
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.4.1 = Counter64: 398458880
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.5.0 = Counter64: 15355710676992
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.5.1 = Counter64: 15995942993920
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.6.0 = INTEGER: 0
>>>> SNMPv2-SMI::enterprises.6574.3.1.1.6.1 = INTEGER: 0
>>>>
>>>> On Monday, September 30, 2024 at 9:06:17 AM UTC-4 Mitchell Laframboise 
>>>> wrote:
>>>>
>>>>> Thank you for the explanations... I've checked the mibs and that 
>>>>> metric is included as an object along with others like raidFreeSize that 
>>>>> are also not being included in the generated snmp.yml  I'm using the 
>>>>> latest 
>>>>> verison of snmp-exporter 0.26.0 Im wondering if the generator is broken?
>>>>>
>>>>> On Monday, September 30, 2024 at 8:48:11 AM UTC-4 Brian Candler wrote:
>>>>>
>>>>>> > Since the generator.yml has that metric in overrides, shouldn't it 
>>>>>> be generated?
>>>>>>
>>>>>> No. Overrides only change how a metric is rendered; if there's no 
>>>>>> matching metric in the MIB then there's nothing to override.
>>>>>>
>>>>>> On Monday 30 September 2024 at 13:45:57 UTC+1 Brian Candler wrote:
>>>>>>
>>>>>>> > I looked at the sample snmp.yml from Github that I assume is 
>>>>>>> generated from the default generator.yml and I see that the 
>>>>>>> "raidTotalSize" 
>>>>>>> metric is included, but when I check my snmp.yml that metric isn't 
>>>>>>> included.
>>>>>>>
>>>>>>> Either something is different in your generator.yml, or something is 
>>>>>>> different in the set of MIBs you are making available to generator. If 
>>>>>>> you 
>>>>>>> can solve that, it would avoid you having to hack snmp.yml manually, 
>>>>>>> and 
>>>>>>> might be covering up some other problem.
>>>>>>>
>>>>>>> > the dashboard is still not picking it up.  I guess I'm going to 
>>>>>>> have to ask the Grafana community.
>>>>>>>
>>>>>>> It will be a problem with the queries configured in Grafana, and if 
>>>>>>> they make use of Grafana variables they may not be set the way you 

[prometheus-users] Re: Synology NAS Details Dashboard

2024-09-30 Thread 'Brian Candler' via Prometheus Users
I mean the MIB files consumed by generator.

On Monday 30 September 2024 at 14:41:14 UTC+1 Mitchell Laframboise wrote:

> I know the mib is working because when I do an snmpwalk i get the 
> following output.
>
> ~/snmp_exporter/generator$ snmpwalk -v2c -c public *.*.*.* 
> 1.3.6.1.4.1.6574.3
> SNMPv2-SMI::enterprises.6574.3.1.1.1.0 = INTEGER: 0
> SNMPv2-SMI::enterprises.6574.3.1.1.1.1 = INTEGER: 1
> SNMPv2-SMI::enterprises.6574.3.1.1.2.0 = STRING: "Volume 1"
> SNMPv2-SMI::enterprises.6574.3.1.1.2.1 = STRING: "Storage Pool 1"
> SNMPv2-SMI::enterprises.6574.3.1.1.3.0 = INTEGER: 1
> SNMPv2-SMI::enterprises.6574.3.1.1.3.1 = INTEGER: 1
> SNMPv2-SMI::enterprises.6574.3.1.1.4.0 = Counter64: 14395893346304
> SNMPv2-SMI::enterprises.6574.3.1.1.4.1 = Counter64: 398458880
> SNMPv2-SMI::enterprises.6574.3.1.1.5.0 = Counter64: 15355710676992
> SNMPv2-SMI::enterprises.6574.3.1.1.5.1 = Counter64: 15995942993920
> SNMPv2-SMI::enterprises.6574.3.1.1.6.0 = INTEGER: 0
> SNMPv2-SMI::enterprises.6574.3.1.1.6.1 = INTEGER: 0
>
> On Monday, September 30, 2024 at 9:06:17 AM UTC-4 Mitchell Laframboise 
> wrote:
>
>> Thank you for the explanations... I've checked the mibs and that metric 
>> is included as an object along with others like raidFreeSize that are also 
>> not being included in the generated snmp.yml  I'm using the latest verison 
>> of snmp-exporter 0.26.0 Im wondering if the generator is broken?
>>
>> On Monday, September 30, 2024 at 8:48:11 AM UTC-4 Brian Candler wrote:
>>
>>> > Since the generator.yml has that metric in overrides, shouldn't it be 
>>> generated?
>>>
>>> No. Overrides only change how a metric is rendered; if there's no 
>>> matching metric in the MIB then there's nothing to override.
>>>
>>> On Monday 30 September 2024 at 13:45:57 UTC+1 Brian Candler wrote:
>>>
>>>> > I looked at the sample snmp.yml from Github that I assume is 
>>>> generated from the default generator.yml and I see that the 
>>>> "raidTotalSize" 
>>>> metric is included, but when I check my snmp.yml that metric isn't 
>>>> included.
>>>>
>>>> Either something is different in your generator.yml, or something is 
>>>> different in the set of MIBs you are making available to generator. If you 
>>>> can solve that, it would avoid you having to hack snmp.yml manually, and 
>>>> might be covering up some other problem.
>>>>
>>>> > the dashboard is still not picking it up.  I guess I'm going to have 
>>>> to ask the Grafana community.
>>>>
>>>> It will be a problem with the queries configured in Grafana, and if 
>>>> they make use of Grafana variables they may not be set the way you expect. 
>>>> So indeed, Grafana is where you need to look. Using (three dots) > Inspect 
>>>> > Query on a panel, you should be able to see what query it is sending.
>>>>
>>>> On Monday 30 September 2024 at 13:39:30 UTC+1 Mitchell Laframboise 
>>>> wrote:
>>>>
>>>>> I looked at the sample snmp.yml from Github that I assume is generated 
>>>>> from the default generator.yml and I see that the "raidTotalSize" metric 
>>>>> is 
>>>>> included, but when I check my snmp.yml that metric isn't included.  So I 
>>>>> edited the snmp.yml to include that metric and now Prometheus is scraping 
>>>>> that data, but the dashboard is still not picking it up.  I guess I'm 
>>>>> going 
>>>>> to have to ask the Grafana community.
>>>>>
>>>>> On Monday, September 30, 2024 at 3:03:00 AM UTC-4 Brian Candler wrote:
>>>>>
>>>>>> I can't see what you're looking at, because:
>>>>>>
>>>>>> 1. You've shown your generator.yml, but you've not shown the snmp.yml 
>>>>>> output that generator creates.
>>>>>> 2. You've not said how the output snmp.yml is different from the 
>>>>>> supplied snmp.yml
>>>>>> 3. You've not said what version of snmp_exporter you're using, so I 
>>>>>> can't look at the supplied snmp.yml.
>>>>>>
>>>>>> Have you tried using *exactly* the same synology section in your 
>>>>>> generator.yml as in the supplied generator.yml, and then comparing the 
>>>>>> snmp.yml output?
>>>>>>
>>

[prometheus-users] Re: Synology NAS Details Dashboard

2024-09-30 Thread 'Brian Candler' via Prometheus Users
> Since the generator.yml has that metric in overrides, shouldn't it be 
generated?

No. Overrides only change how a metric is rendered; if there's no matching 
metric in the MIB then there's nothing to override.

On Monday 30 September 2024 at 13:45:57 UTC+1 Brian Candler wrote:

> > I looked at the sample snmp.yml from Github that I assume is generated 
> from the default generator.yml and I see that the "raidTotalSize" metric is 
> included, but when I check my snmp.yml that metric isn't included.
>
> Either something is different in your generator.yml, or something is 
> different in the set of MIBs you are making available to generator. If you 
> can solve that, it would avoid you having to hack snmp.yml manually, and 
> might be covering up some other problem.
>
> > the dashboard is still not picking it up.  I guess I'm going to have to 
> ask the Grafana community.
>
> It will be a problem with the queries configured in Grafana, and if they 
> make use of Grafana variables they may not be set the way you expect. So 
> indeed, Grafana is where you need to look. Using (three dots) > Inspect > 
> Query on a panel, you should be able to see what query it is sending.
>
> On Monday 30 September 2024 at 13:39:30 UTC+1 Mitchell Laframboise wrote:
>
>> I looked at the sample snmp.yml from Github that I assume is generated 
>> from the default generator.yml and I see that the "raidTotalSize" metric is 
>> included, but when I check my snmp.yml that metric isn't included.  So I 
>> edited the snmp.yml to include that metric and now Prometheus is scraping 
>> that data, but the dashboard is still not picking it up.  I guess I'm going 
>> to have to ask the Grafana community.
>>
>> On Monday, September 30, 2024 at 3:03:00 AM UTC-4 Brian Candler wrote:
>>
>>> I can't see what you're looking at, because:
>>>
>>> 1. You've shown your generator.yml, but you've not shown the snmp.yml 
>>> output that generator creates.
>>> 2. You've not said how the output snmp.yml is different from the 
>>> supplied snmp.yml
>>> 3. You've not said what version of snmp_exporter you're using, so I 
>>> can't look at the supplied snmp.yml.
>>>
>>> Have you tried using *exactly* the same synology section in your 
>>> generator.yml as in the supplied generator.yml, and then comparing the 
>>> snmp.yml output?
>>>
>>> Are you getting any errors or warnings from generator when you run it? 
>>> If so, maybe you've not got the correct versions of MIBs available. The 
>>> Makefile in the generator directory shows where it downloads them from when 
>>> building the default MIBs.
>>>
>>> On Monday 30 September 2024 at 03:20:32 UTC+1 Mitchell Laframboise wrote:
>>>
>>>>   Hi.  I'm having issues with another metric.  raidTotalSize 
>>>>
>>>> its in the default generator.yml under the synology module in overrides 
>>>> but when I generate the snmp.yml it doesn't put the metric in there???   I 
>>>> can't figure out why
>>>>
>>>> Here is my generator.yml file
>>>>
>>>> ---
>>>> auths:
>>>>   public_v1:
>>>> version: 1
>>>>   public_v2:
>>>> version: 2
>>>>
>>>> modules:
>>>>   # Default IF-MIB interfaces table with ifIndex.
>>>>   if_mib:
>>>> walk: [sysUpTime, interfaces, ifXTable]
>>>>
>>>> lookups:
>>>>   - source_indexes: [ifIndex]
>>>> lookup: ifAlias
>>>>   - source_indexes: [ifIndex]
>>>> # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
>>>> lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
>>>>   - source_indexes: [ifIndex]
>>>> # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
>>>> lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
>>>> overrides:
>>>>   ifAlias:
>>>> ignore: true # Lookup metric
>>>>   ifDescr:
>>>> ignore: true # Lookup metric
>>>>   ifName:
>>>> ignore: true # Lookup metric
>>>>   ifType:
>>>> type: EnumAsInfo
>>>> # Synology
>>>> #
>>>> # Synology MIBs can be found here:
>>>> #   http://www.synology.com/support/snmp_mib.php
>>>> #   
>>>> http://dedl.synology.com/download/Docum

[prometheus-users] Re: Synology NAS Details Dashboard

2024-09-30 Thread 'Brian Candler' via Prometheus Users
> I looked at the sample snmp.yml from Github that I assume is generated 
from the default generator.yml and I see that the "raidTotalSize" metric is 
included, but when I check my snmp.yml that metric isn't included.

Either something is different in your generator.yml, or something is 
different in the set of MIBs you are making available to generator. If you 
can solve that, it would avoid you having to hack snmp.yml manually, and 
might be covering up some other problem.

> the dashboard is still not picking it up.  I guess I'm going to have to 
ask the Grafana community.

It will be a problem with the queries configured in Grafana, and if they 
make use of Grafana variables they may not be set the way you expect. So 
indeed, Grafana is where you need to look. Using (three dots) > Inspect > 
Query on a panel, you should be able to see what query it is sending.

On Monday 30 September 2024 at 13:39:30 UTC+1 Mitchell Laframboise wrote:

> I looked at the sample snmp.yml from Github that I assume is generated 
> from the default generator.yml and I see that the "raidTotalSize" metric is 
> included, but when I check my snmp.yml that metric isn't included.  So I 
> edited the snmp.yml to include that metric and now Prometheus is scraping 
> that data, but the dashboard is still not picking it up.  I guess I'm going 
> to have to ask the Grafana community.
>
> On Monday, September 30, 2024 at 3:03:00 AM UTC-4 Brian Candler wrote:
>
>> I can't see what you're looking at, because:
>>
>> 1. You've shown your generator.yml, but you've not shown the snmp.yml 
>> output that generator creates.
>> 2. You've not said how the output snmp.yml is different from the supplied 
>> snmp.yml
>> 3. You've not said what version of snmp_exporter you're using, so I can't 
>> look at the supplied snmp.yml.
>>
>> Have you tried using *exactly* the same synology section in your 
>> generator.yml as in the supplied generator.yml, and then comparing the 
>> snmp.yml output?
>>
>> Are you getting any errors or warnings from generator when you run it? If 
>> so, maybe you've not got the correct versions of MIBs available. The 
>> Makefile in the generator directory shows where it downloads them from when 
>> building the default MIBs.
>>
>> On Monday 30 September 2024 at 03:20:32 UTC+1 Mitchell Laframboise wrote:
>>
>>>   Hi.  I'm having issues with another metric.  raidTotalSize 
>>>
>>> its in the default generator.yml under the synology module in overrides 
>>> but when I generate the snmp.yml it doesn't put the metric in there???   I 
>>> can't figure out why
>>>
>>> Here is my generator.yml file
>>>
>>> ---
>>> auths:
>>>   public_v1:
>>> version: 1
>>>   public_v2:
>>> version: 2
>>>
>>> modules:
>>>   # Default IF-MIB interfaces table with ifIndex.
>>>   if_mib:
>>> walk: [sysUpTime, interfaces, ifXTable]
>>>
>>> lookups:
>>>   - source_indexes: [ifIndex]
>>> lookup: ifAlias
>>>   - source_indexes: [ifIndex]
>>> # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
>>> lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
>>>   - source_indexes: [ifIndex]
>>> # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
>>> lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
>>> overrides:
>>>   ifAlias:
>>> ignore: true # Lookup metric
>>>   ifDescr:
>>> ignore: true # Lookup metric
>>>   ifName:
>>> ignore: true # Lookup metric
>>>   ifType:
>>> type: EnumAsInfo
>>> # Synology
>>> #
>>> # Synology MIBs can be found here:
>>> #   http://www.synology.com/support/snmp_mib.php
>>> #   
>>> http://dedl.synology.com/download/Document/MIBGuide/Synology_MIB_File.zip
>>> #
>>> # Tested on RS2414rp+ NAS
>>> #
>>>   synology:
>>> walk:
>>>   - 1.3.6.1.4.1.6574.1   # synoSystem
>>>   - 1.3.6.1.4.1.6574.2   # synoDisk
>>>   - 1.3.6.1.4.1.6574.3   # synoRaid
>>>   - 1.3.6.1.4.1.6574.4   # synoUPS
>>>   - 1.3.6.1.4.1.6574.5   # synologyDiskSMART
>>>   - 1.3.6.1.4.1.6574.6   # synologyService
>>>   - 1.3.6.1.4.1.6574.101 # storageIO
>>>   - 1.3.6.1.4.1.6574.102 # spaceIO
>>>   - 1.3.6.1.4.1.6574.104   

[prometheus-users] Re: Synology NAS Details Dashboard

2024-09-30 Thread 'Brian Candler' via Prometheus Users
I can't see what you're looking at, because:

1. You've shown your generator.yml, but you've not shown the snmp.yml 
output that generator creates.
2. You've not said how the output snmp.yml is different from the supplied 
snmp.yml
3. You've not said what version of snmp_exporter you're using, so I can't 
look at the supplied snmp.yml.

Have you tried using *exactly* the same synology section in your 
generator.yml as in the supplied generator.yml, and then comparing the 
snmp.yml output?

Are you getting any errors or warnings from generator when you run it? If 
so, maybe you've not got the correct versions of MIBs available. The 
Makefile in the generator directory shows where it downloads them from when 
building the default MIBs.

On Monday 30 September 2024 at 03:20:32 UTC+1 Mitchell Laframboise wrote:

>   Hi.  I'm having issues with another metric.  raidTotalSize 
>
> its in the default generator.yml under the synology module in overrides 
> but when I generate the snmp.yml it doesn't put the metric in there???   I 
> can't figure out why
>
> Here is my generator.yml file
>
> ---
> auths:
>   public_v1:
> version: 1
>   public_v2:
> version: 2
>
> modules:
>   # Default IF-MIB interfaces table with ifIndex.
>   if_mib:
> walk: [sysUpTime, interfaces, ifXTable]
>
> lookups:
>   - source_indexes: [ifIndex]
> lookup: ifAlias
>   - source_indexes: [ifIndex]
> # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
> lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
>   - source_indexes: [ifIndex]
> # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
> lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
> overrides:
>   ifAlias:
> ignore: true # Lookup metric
>   ifDescr:
> ignore: true # Lookup metric
>   ifName:
> ignore: true # Lookup metric
>   ifType:
> type: EnumAsInfo
> # Synology
> #
> # Synology MIBs can be found here:
> #   http://www.synology.com/support/snmp_mib.php
> #   
> http://dedl.synology.com/download/Document/MIBGuide/Synology_MIB_File.zip
> #
> # Tested on RS2414rp+ NAS
> #
>   synology:
> walk:
>   - 1.3.6.1.4.1.6574.1   # synoSystem
>   - 1.3.6.1.4.1.6574.2   # synoDisk
>   - 1.3.6.1.4.1.6574.3   # synoRaid
>   - 1.3.6.1.4.1.6574.4   # synoUPS
>   - 1.3.6.1.4.1.6574.5   # synologyDiskSMART
>   - 1.3.6.1.4.1.6574.6   # synologyService
>   - 1.3.6.1.4.1.6574.101 # storageIO
>   - 1.3.6.1.4.1.6574.102 # spaceIO
>   - 1.3.6.1.4.1.6574.104 # synologyiSCSILUN
> lookups:
>   - source_indexes: [spaceIOIndex]
> lookup: spaceIODevice
> drop_source_indexes: true
>   - source_indexes: [storageIOIndex]
> lookup: storageIODevice
> drop_source_indexes: true
>   - source_indexes: [serviceInfoIndex]
> lookup: serviceName
> drop_source_indexes: true
>   - source_indexes: [diskIndex]
> lookup: diskID
> drop_source_indexes: true
>   - source_indexes: [raidIndex]
> lookup: raidName
> drop_source_indexes: true
> overrides:
>   diskModel:
> type: DisplayString
>   diskSMARTAttrName:
> type: DisplayString
>   diskSMARTAttrStatus:
> type: DisplayString
>   diskSMARTInfoDevName:
> type: DisplayString
>   diskType:
> type: DisplayString
>   modelName:
> type: DisplayString
>   raidFreeSize:
> type: gauge
>   raidName:
> type: DisplayString
>   raidTotalSize:
> type: gauge
>   serialNumber:
> type: DisplayString
>   serviceName:
> type: DisplayString
>   version:
> type: DisplayString
>
> # UCD-SNMP-MIB
> #
> # University of California, Davis extensions. Commonly used for host
> # metrics. For example, Linux-based systems, DD-WRT, Synology,
> # Mikrotik, Kemp LoadMaster, etc.
> #
> # http://www.net-snmp.org/docs/mibs/UCD-SNMP-MIB.txt
> #
>   ucd_la_table:
> walk:
>   - 1.3.6.1.4.1.2021.10.1.2 # laNames
>   - 1.3.6.1.4.1.2021.10.1.5 # laLoadInt
>   - 1.3.6.1.4.1.2021.10.1.6 # laLoadFloat
> lookups:
>   - source_indexes: [laIndex]
> lookup: laNames
> drop_source_indexes: true
>   ucd_memory:
> walk:
>   - 1.3.6.1.4.1.2021.4 # memory
>   ucd_system_stats:
> walk:
>   - 1.3.6.1.4.1.2021.11 # systemStats
>
> any help would be appreciated.
>
> Thanks,
>
>
> On Sunday, September 29, 2024 at 4

[prometheus-users] Re: Synology NAS Details Dashboard

2024-09-29 Thread 'Brian Candler' via Prometheus Users
>  I am successful in querying the metrics in Prometheus

Which ones in particular *are* you able to see?

> I did some more queries and found that I'm unable to return ifName?

Please explain exactly what you're doing when you say "unable to return". 
If you're going to the Prometheus web interface (usually at x.x.x.x:9090) 
and entering "ifName" as the query and hitting Enter, and getting no 
results, then it seems like you're not successfully scraping the if_mib 
from any targets. However if you're getting some other metrics like 
ifHCInOctets from the if_mib, then maybe the way you built snmp.yml from 
generator.yml is broken.

You'll need to work out what's happening. In the same Prometheus web 
interface go to Status > Targets as a starting point. If it says the target 
is "up" then try doing exactly the same scrape manually:
curl -v 'x.x.x.x:9116/snmp?target=y.y.y.y&module=&auth='

and/or point a web browser at x.x.x.x:9116/snmp/status as I suggested 
before. Also look at snmp_exporter's stdout ("systemctl status 
snmp_exporter" if you're running it under systemd).

Basically, you need to divide and conquer. If ifName not being returned 
from any targets, then is it a problem with your snmp.yml, or with your 
prometheus scrape config, or something else? You haven't shown your scrape 
config, so the problem could be there. You also haven't shown the snmp.yml 
which came from your generator.yml.

On Sunday 29 September 2024 at 14:40:33 UTC+1 Mitchell Laframboise wrote:

> Hi there,
>
>   I did some more queries and found that I'm unable to return ifName?  Im 
> walking that specific OID so I don't understand?
>
> Can you help
>
> On Sunday, September 29, 2024 at 9:23:37 AM UTC-4 Mitchell Laframboise 
> wrote:
>
>> Thanks Brian.  I am successful in querying the metrics in Prometheus, so 
>> I will check out the Grafana community for support.
>>
>> On Sunday, September 29, 2024 at 9:03:12 AM UTC-4 Brian Candler wrote:
>>
>>> First, do a query in the Prometheus web interface (for example, just 
>>> "ifPhysAddress"). If you see no answers, then you need to drill down into 
>>> your metrics collection. Check the query "up" to see if SNMP scraping is 
>>> successful. If it's not, then check logs from snmp_exporter ("journalctl 
>>> -eu snmp_exporter), or use the test web interface at 
>>> :9116/snmp/status
>>>
>>> If the metrics collection into Prometheus is working, meaning that you 
>>> have a problem with Grafana, then please seek Grafana support from the 
>>> Grafana 
>>> Community <https://community.grafana.com/>.
>>>
>>> On Sunday 29 September 2024 at 13:57:12 UTC+1 Mitchell Laframboise wrote:
>>>
>>>> Good morning group,  I have only some of this public dashboard working 
>>>> and I'm wondering how to get the rest up and running.  I am starting with 
>>>> the interface.  Its showing no data, so I was hoping someone could point 
>>>> me 
>>>> in the right direction.  I've attached a screenshot of the queries for the 
>>>> dashboard and my generator.yml so you can see if I have this set up 
>>>> correctly.
>>>>
>>>> ---
>>>> auths:
>>>>   public_v1:
>>>> version: 1
>>>>   public_v2:
>>>> version: 2
>>>>
>>>> modules:
>>>>   # Default IF-MIB interfaces table with ifIndex.
>>>>   if_mib:
>>>> walk: [sysUpTime, 1.3.6.1.2.1.2.2, 1.3.6.1.2.1.31.1.1]
>>>> lookups:
>>>>   - source_indexes: [ifIndex]
>>>> lookup: ifAlias
>>>>   - source_indexes: [ifIndex]
>>>> # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
>>>> lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
>>>>   - source_indexes: [ifIndex]
>>>> # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
>>>> lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
>>>> overrides:
>>>>   ifAlias:
>>>> ignore: true # Lookup metric
>>>>   ifDescr:
>>>> ignore: true # Lookup metric
>>>>   ifName:
>>>> ignore: true # Lookup metric
>>>>   ifType:
>>>> type: EnumAsInfo
>>>>   # Default IP-MIB with ipv4InterfaceTable for example.
>>>>   ip_mib:
>>>> walk: [ipv4InterfaceTable]
>>>>
>>>>   readynas

[prometheus-users] Re: Synology NAS Details Dashboard

2024-09-29 Thread &#x27;Brian Candler' via Prometheus Users
First, do a query in the Prometheus web interface (for example, just 
"ifPhysAddress"). If you see no answers, then you need to drill down into 
your metrics collection. Check the query "up" to see if SNMP scraping is 
successful. If it's not, then check logs from snmp_exporter ("journalctl 
-eu snmp_exporter), or use the test web interface at 
:9116/snmp/status

If the metrics collection into Prometheus is working, meaning that you have 
a problem with Grafana, then please seek Grafana support from the Grafana 
Community .

On Sunday 29 September 2024 at 13:57:12 UTC+1 Mitchell Laframboise wrote:

> Good morning group,  I have only some of this public dashboard working and 
> I'm wondering how to get the rest up and running.  I am starting with the 
> interface.  Its showing no data, so I was hoping someone could point me in 
> the right direction.  I've attached a screenshot of the queries for the 
> dashboard and my generator.yml so you can see if I have this set up 
> correctly.
>
> ---
> auths:
>   public_v1:
> version: 1
>   public_v2:
> version: 2
>
> modules:
>   # Default IF-MIB interfaces table with ifIndex.
>   if_mib:
> walk: [sysUpTime, 1.3.6.1.2.1.2.2, 1.3.6.1.2.1.31.1.1]
> lookups:
>   - source_indexes: [ifIndex]
> lookup: ifAlias
>   - source_indexes: [ifIndex]
> # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
> lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
>   - source_indexes: [ifIndex]
> # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
> lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
> overrides:
>   ifAlias:
> ignore: true # Lookup metric
>   ifDescr:
> ignore: true # Lookup metric
>   ifName:
> ignore: true # Lookup metric
>   ifType:
> type: EnumAsInfo
>   # Default IP-MIB with ipv4InterfaceTable for example.
>   ip_mib:
> walk: [ipv4InterfaceTable]
>
>   readynas:
> walk:
>   - 1.3.6.1.4.1.4526   # Raid/Disks status
>
> # Synology
> #
> # Synology MIBs can be found here:
> #   http://www.synology.com/support/snmp_mib.php
> #   
> http://dedl.synology.com/download/Document/MIBGuide/Synology_MIB_File.zip
> #
> # Tested on RS2414rp+ NAS
> #
>   synology:
> walk:
>   - 1.3.6.1.4.1.6574.1   # synoSystem
>   - 1.3.6.1.4.1.6574.2   # synoDisk
>   - 1.3.6.1.4.1.6574.3   # synoRaid
>   - 1.3.6.1.4.1.6574.4   # synoUPS
>   - 1.3.6.1.4.1.6574.5   # synologyDiskSMART
>   - 1.3.6.1.4.1.6574.6   # synologyService
>   - 1.3.6.1.4.1.6574.101 # storageIO
>   - 1.3.6.1.4.1.6574.102 # spaceIO
>   - 1.3.6.1.4.1.6574.104 # synologyiSCSILUN
>   - 1.3.6.1.4.1.6574.3.1  # raid table
> lookups:
>   - source_indexes: [spaceIOIndex]
> lookup: spaceIODevice
> drop_source_indexes: true
>   - source_indexes: [storageIOIndex]
> lookup: storageIODevice
> drop_source_indexes: true
>   - source_indexes: [serviceInfoIndex]
> lookup: serviceName
> drop_source_indexes: true
>   - source_indexes: [diskIndex]
> lookup: diskID
> drop_source_indexes: true
>   - source_indexes: [raidIndex]
> lookup: raidName
> drop_source_indexes: true
> overrides:
>   diskModel:
> type: DisplayString
>   diskSMARTAttrName:
> type: DisplayString
>   diskSMARTAttrStatus:
> type: DisplayString
>   diskSMARTInfoDevName:
> type: DisplayString
>   diskType:
> type: DisplayString
>   modelName:
> type: DisplayString
>   raidFreeSize:
> type: gauge
>   raidName:
> type: DisplayString
>   raidTotalSize:
> type: gauge
>   serialNumber:
> type: DisplayString
>   serviceName:
> type: DisplayString
>   version:
> type: DisplayString
>
> # UCD-SNMP-MIB
> #
> # University of California, Davis extensions. Commonly used for host
> # metrics. For example, Linux-based systems, DD-WRT, Synology,
> # Mikrotik, Kemp LoadMaster, etc.
> #
> # http://www.net-snmp.org/docs/mibs/UCD-SNMP-MIB.txt
> #
>   ucd_la_table:
> walk:
>   - 1.3.6.1.4.1.2021.10.1.2 # laNames
>   - 1.3.6.1.4.1.2021.10.1.5 # laLoadInt
>   - 1.3.6.1.4.1.2021.10.1.6 # laLoadFloat
> lookups:
>   - source_indexes: [laIndex]
> lookup: laNames
> drop_source_indexes: true
>   ucd_memory:
> walk:
>   - 1.3.6.1.4.1.2021.4 # memory
>   ucd_system_stats:
> walk:
>   - 1.3.6.1.4.1.2021.11 # systemStats
>
>
> Any help would be greatly appreciated!
>
> Thanks,
>
> [image: Grafana Dashboard NAS.png]
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discuss

[prometheus-users] Re: promql stat functions return identical values

2024-09-27 Thread &#x27;Brian Candler' via Prometheus Users
> e.g. Grafana can quite happily render 0-1 as 0-100%

and in alerting rules:

- expr: blah > 0.9
  annotations:
summary: 'filesystem usage is high: {{ $value | humanizePercentage }}'

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6efb45f0-6a0e-40f6-a001-474f2c31ed45n%40googlegroups.com.


[prometheus-users] Re: promql stat functions return identical values

2024-09-27 Thread &#x27;Brian Candler' via Prometheus Users
Perform the two halves of the query separately, i.e.
max_over_time(node_filesystem_avail_bytes{...}[1h]
max_over_time(node_filesystem_size_bytes{...}[1h]

and then you'll see why they divide to give 48% instead of 97%

I expect node_filesystem_size_bytes doesn't change much, so max_over_time 
doesn't do much for that. But max_over_time(node_filesystem_avail_bytes) 
will show the *largest* available space over that 1 hour window, and 
therefore you'll get the value for when the disk was *least full*. If you 
want to know the value when it was *most full* then it would be 
min_over_time(node_filesystem_avail_bytes).

Note that you showed a graph, rather than a table. When you're graphing, 
you're repeating the same query at different evaluation times. So where the 
time axis show 04:00, the data point on the graph is for the 1 hour period 
from 03:00 to 04:00. Where the time axis is 04:45, the result is of your 
query covering the 1 hour from 03:45 to 04:45. 

Aside: in general, I'd advise keeping percentage queries simple by removing 
the factor of 100, so you get a fraction between 0 and 1 instead. This can 
be represented as a human-friendly percentage when rendered (e.g. Grafana 
can quite happily render 0-1 as 0-100%)

On Friday 27 September 2024 at 06:01:16 UTC+1 mohan garden wrote:

> Sorry for the double posting, image was corrupted, so reposting 
>
> Thank you for the response Brian,
>
> I removed the $__ variables and tried viewing disk usage metrics from past 
> 1 hour in PromUI -
> I tried the query in the Prometheus UI , and i was expecting value ~97% 
> with following query for past 1 hour metrics but the table view reports 
> 48%. 
>
> [image: max_over_time.png]
> I am not sure if i missed out on some thing within the query.
>
> i am under impression that max function works with multiple series,  and 
> over time will generate stats from the values within the series.
> Please advice.
>
>
>
>
>
>
> On Friday, September 27, 2024 at 10:29:22 AM UTC+5:30 mohan garden wrote:
>
>> Thank you for the response Brian,
>>
>> I removed the $__ variables and tried viewing disk usage metrics from 
>> past 1 hour in PromUI -
>>  
>>
>> I tried the query in the Prometheus UI , and i was expecting value ~97%  
>>  for past 1 hour metrics but the table view reports 48%. 
>>
>> [image: max_over_time.png]
>> I am not sure if i missed out on something within the query.
>>
>> i am under impression that max function works with multiple series,  and 
>> over time will generate stats from the values within the series.
>> Please advice.
>>
>>
>>
>> On Tuesday, September 24, 2024 at 8:12:29 PM UTC+5:30 Brian Candler wrote:
>>
>>> $__rate_interval is (roughly speaking) the interval between 2 adjacent 
>>> points in the graph, with a minimum of 4 times the configured scrape 
>>> interval. It's not the entire period over which Grafana is drawing the 
>>> graph. You probably want $__range or $__range_s. See:
>>>
>>> https://grafana.com/docs/grafana/latest/datasources/prometheus/template-variables/#use-__rate_interval
>>>
>>> https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/#global-variables
>>>
>>> However, questions about Grafana would be better off asked in the 
>>> Grafana community. Prometheus is not Grafana, and those variables are 
>>> Grafana-specific.
>>>
>>> > so you can see that avg|min|max_over_time functions return identical 
>>> values which dont make much sense
>>>
>>> It makes sense when you realise that the time period you're querying 
>>> over is very small; hence for a value that doesn't change rapidly, the 
>>> min/max/average over such a short time range will all be roughly the same.
>>>
>>> On Tuesday 24 September 2024 at 15:10:33 UTC+1 mohan garden wrote:
>>>
>>>> Hi , 
>>>> seems images in my previous post did not show up as expected.
>>>> Sorry for the spam , reposting again  - 
>>>>
>>>>
>>>> Hi , 
>>>>
>>>> I am trying to analyse memory usage of a server for 2 specific months 
>>>> using Grafana and prometheus. but seems _over_time functions are returning 
>>>> unexpected results.
>>>>
>>>> Here is the data for the duration
>>>>
>>>> [image: one.png]
>>>>
>>>> Now, the summary table shows expected values
>>>> [image: one.png]
>>>>
>>>>
>>>> query -
>>>> (( 
>>>> node_me

[prometheus-users] Re: promql stat functions return identical values

2024-09-24 Thread &#x27;Brian Candler' via Prometheus Users
$__rate_interval is (roughly speaking) the interval between 2 adjacent 
points in the graph, with a minimum of 4 times the configured scrape 
interval. It's not the entire period over which Grafana is drawing the 
graph. You probably want $__range or $__range_s. See:
https://grafana.com/docs/grafana/latest/datasources/prometheus/template-variables/#use-__rate_interval
https://grafana.com/docs/grafana/latest/dashboards/variables/add-template-variables/#global-variables

However, questions about Grafana would be better off asked in the Grafana 
community. Prometheus is not Grafana, and those variables are 
Grafana-specific.

> so you can see that avg|min|max_over_time functions return identical 
values which dont make much sense

It makes sense when you realise that the time period you're querying over 
is very small; hence for a value that doesn't change rapidly, the 
min/max/average over such a short time range will all be roughly the same.

On Tuesday 24 September 2024 at 15:10:33 UTC+1 mohan garden wrote:

> Hi , 
> seems images in my previous post did not show up as expected.
> Sorry for the spam , reposting again  - 
>
>
> Hi , 
>
> I am trying to analyse memory usage of a server for 2 specific months 
> using Grafana and prometheus. but seems _over_time functions are returning 
> unexpected results.
>
> Here is the data for the duration
>
> [image: one.png]
>
> Now, the summary table shows expected values
> [image: one.png]
>
>
> query -
> (( 
> node_memory_MemAvailable_bytes{instance="$node",job="$job"}[$__rate_interval])
>  
> * 100 ) / 
> node_memory_MemTotal_bytes{instance="$node",job="$job"}[$__rate_interval]
>
>
> Issue - when i am trying to create similar stats using PromQL at my end , 
> i am facing issues . i fail to get the same stats when i use the following 
> promql , example -
>
> [image: two.png]
>
> ( 
> avg_over_time(node_memory_MemAvailable_bytes{instance="$node",job="$job"}[$__rate_interval])
>  
> / 
> avg_over_time(node_memory_MemTotal_bytes{instance="$node",job="$job"}[$__rate_interval])
>  
> ) * 100
>
> ( 
> min_over_time(node_memory_MemAvailable_bytes{instance="$node",job="$job"}[$__rate_interval])
>  
> / 
> min_over_time(node_memory_MemTotal_bytes{instance="$node",job="$job"}[$__rate_interval])
>  
> ) * 100
>
> ( 
> max_over_time(node_memory_MemAvailable_bytes{instance="$node",job="$job"}[$__rate_interval])
>  
> / 
> max_over_time(node_memory_MemTotal_bytes{instance="$node",job="$job"}[$__rate_interval])
>  
> ) * 100
>
> so you can see that avg|min|max_over_time functions return identical 
> values which dont make much sense. I was using  following setting
>
> [image: one.png]
>
> I tried changing from range -> instant, i see similar values
> [image: two.png]
>
> Where do i need to make modifications in PromQL so i can get the correct 
> min/max/avg values in the gauges as correctly reported by the
> [image: one.png]
>
>
> for a specific duration , say - 
>
> [image: one.png]
>
> please advice
>
>
>
>
>
>
>
>
>
> On Tuesday, September 24, 2024 at 7:25:00 PM UTC+5:30 mohan garden wrote:
>
>> I am trying to analyse memory usage of a server for 2 specific months 
>> using Grafana and prometheus. but seems _over_time functions are returning 
>> unexpected results.
>>
>> Here is the data for the duration
>> [image: image] 
>> 
>> the summary table shows expected values
>> [image: image] 
>> 
>> query -
>> (( 
>> node_memory_MemAvailable_bytes{

Re: [prometheus-users] Re: TLS CONFIGURATION

2024-09-15 Thread &#x27;Brian Candler' via Prometheus Users
> The error
> ts=2024-09-15T17:58:49.480Z caller=coordinator.go:118 level=error 
component=configuration msg="Loadion file failed" 
file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n 
 line 7: field tls_config not found in type config.plain"

It's saying that you cannot put "tls_config" as a top-level key in 
Prometheus config. Since the config file is invalid, it cannot run.

As I said before, if you need to use tls_config then it has to be under the 
E-mail receiver.

receivers:
  - name: send_email
email_configs:
  - to: chi...@valucid.com <https://groups.google.com/>
from: chi...@valucid.com <https://groups.google.com/>
smarthost: smtp.zoho.com:587
auth_username: chi...@valucid.com <https://groups.google.com/>
auth_password: pa
require_tls: true
tls_config:
  ... blah

You don't need to repeat the smarthost / auth_username / auth_password / 
require_tls if you've set them globally.
But unfortunately you *do* need to put a separate "tls_config" section 
under every email receiver.

> tls_config:
>   cert_file: /home/chinelo/alertmanager.crt
>   key_file: /home/chinelo/alertmanager.key

That means you want to authenticate to your SMTP server using a TLS client 
certificate. I note that if I connect to it, it says it only supports 
password authentication (LOGIN and PLAIN):

% openssl s_client -connect smtp.zoho.com:587 -starttls smtp
...
ehlo wombat
250-mx.zohomail.com Hello wombat (x.x.x.x (x.x.x.x))
250-AUTH LOGIN PLAIN
250 SIZE 32505856

I believe the normal way to do TLS client authentication would be with the 
SASL "EXTERNAL" mechanism. But since you are already providing an 
auth_username and auth_password, I don't think you'll need to provide a TLS 
certificate as well.  (In which case, maybe you don't need a tls_config 
section at all).

However, that's all detail around your particular SMTP server, and maybe it 
works in a weird way.

On Sunday 15 September 2024 at 19:08:28 UTC+1 Chinelo Ufondu wrote:

> This is what i did
> global:
>   smtp_smarthost: smtp.zoho.com:587
>   smtp_from: chi...@valucid.com
>   smtp_auth_username: 'chi...@valucid.com'
>   smtp_auth_password: passs
>
>   smtp_require_tls: true
> tls_config:
>   cert_file: /home/chinelo/alertmanager.crt
>   key_file: /home/chinelo/alertmanager.key
> receivers:
>   - name: send_email
> email_configs:
>   - to: chi...@valucid.com
> from: chi...@valucid.com
> smarthost: smtp.zoho.com:587
> auth_username: chi...@valucid.com
> auth_password: pa
>
> require_tls: true
>   - name: send_email2
> email_configs:
>   - to: la...@valucid.com
> from: la...@valucid.com
> smarthost: smtp.zoho.com:587
> auth_username: la...@valucid.com
> auth_password: pa
>
> require_tls: true
> route:
>   receiver: send_email
>   routes:
> - receiver: send_email2
> inhibit_rules:
>   - source_match:
>   severity: critical
> target_match:
>   severity: warning
> equal:
>   - alertname
>   - dev
>   - instance
>
> The error
>  ts=2024-09-15T17:58:49.480Z caller=coordinator.go:118 level=error 
> component=configuration msg="Loadion file failed" 
> file=/etc/alertmanager/alertmanager.yml err="yaml: unmarshal errors:\n 
>  line 7: field tls_config not found in type config.plain"
>
> Sep 15 17:58:49 localhost alertmanager[2767706]: 
> ts=2024-09-15T17:58:49.480Z 
> On Sun, 15 Sept 2024 at 18:46, 'Brian Candler' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> Show what you did, and what the error was, and then maybe we can help you.
>>
>> There are some global settings that cover common use cases:
>>
>> https://prometheus.io/docs/alerting/latest/configuration/#file-layout-and-global-settings
>>
>> However, if you need more control (e.g. for client certificate auth or 
>> accepting self-signed certificates from the E-mail server) you'll need to 
>> use tls_config under the email receiver definition:
>> https://prometheus.io/docs/alerting/latest/configuration/#email_config
>> https://prometheus.io/docs/alerting/latest/configuration/#tls_config
>>
>> On Sunday 15 September 2024 at 16:48:16 UTC+1 Chinelo Ufondu wrote:
>>
>>> Hello all!!
>>>
>>> I am currently trying to configure TLS in my alert manager configuration 
>>> file to enable it authenticate to my smtp host, I have tried various 
>>> options from the documentation and forums , but all to no avail. I would 
>>&g

[prometheus-users] Re: TLS CONFIGURATION

2024-09-15 Thread &#x27;Brian Candler' via Prometheus Users
Show what you did, and what the error was, and then maybe we can help you.

There are some global settings that cover common use cases:
https://prometheus.io/docs/alerting/latest/configuration/#file-layout-and-global-settings

However, if you need more control (e.g. for client certificate auth or 
accepting self-signed certificates from the E-mail server) you'll need to 
use tls_config under the email receiver definition:
https://prometheus.io/docs/alerting/latest/configuration/#email_config
https://prometheus.io/docs/alerting/latest/configuration/#tls_config

On Sunday 15 September 2024 at 16:48:16 UTC+1 Chinelo Ufondu wrote:

> Hello all!!
>
> I am currently trying to configure TLS in my alert manager configuration 
> file to enable it authenticate to my smtp host, I have tried various 
> options from the documentation and forums , but all to no avail. I would 
> appreciate if I am being assisted with this blocker.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a1d49d1e-7761-4897-9c43-72d54ce3278fn%40googlegroups.com.


[prometheus-users] Re: Synology SNMP

2024-09-11 Thread &#x27;Brian Candler' via Prometheus Users
> The job is running so the Dashboard must be broken

Quite possibly (many of them area). I suggest you don't go to the Targets 
screen and click on the scrape URL; that will make a scrape from your 
browser.  Rather, go to the PromQL interface (the main page) and enter some 
PromQL queries like

ifHCInOctets
ssCpuUser

and see if you get any results from the query.

As I said before: that dashboard requires ssCpuUser, which according to a 
web search is .1.3.6.1.4.1.2021.11.9.  The default generator.yml has a few 
entries under enterprise 2021 (ucd_la_table, ucd_memory, ucd_system_stats), 
and ucd_system_stats seems to cover it:

  ucd_system_stats:
walk:
  - 1.3.6.1.4.1.2021.11 # systemStats

That's it - no overrides etc. So you should be able to test this with the 
default snmp.yml, and/or include that walk in your generator.yml.  If you 
need, with a modern snmp_exporter you can scrape multiple modules with one 
scrape, e.g. 
/snmp?target=x.x.x.x&module=if_mib,ucd_system_stats&auth=public_v2

On Wednesday 11 September 2024 at 01:13:56 UTC+1 Mitchell Laframboise wrote:

> This is still not working yet.  
>
> Here is a copy of my generator.yml and prometheus.yml   I listed all the 
> OID's to walk, I just don't know how to configure the lookups and 
> overrides.  Can some please tell me if this looks right or if I did 
> something wrong?
>
> #generator.yml
> auths:
>   public_v1:
> version: 1
>   public_v2:
> version: 2
>
> modules:
>   # Default IF-MIB interfaces table with ifIndex.
>   if_mib:
> walk: [sysUpTime, interfaces, ifXTable]
> lookups:
>   - source_indexes: [ifIndex]
> lookup: ifAlias
>   - source_indexes: [ifIndex]
> # Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
> lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
>   - source_indexes: [ifIndex]
> # Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
> lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
> overrides:
>   ifAlias:
> ignore: true # Lookup metric
>   ifDescr:
> ignore: true # Lookup metric
>   ifName:
> ignore: true # Lookup metric
>   ifType:
> type: EnumAsInfo
> # Synology
> #
> # Synology MIBs can be found here:
> #   http://www.synology.com/support/snmp_mib.php
> #   
> http://dedl.synology.com/download/Document/MIBGuide/Synology_MIB_File.zip
> #
> # Tested on RS2414rp+ NAS
> #
>   synology:
> walk:
>   - 1.3.6.1.4.1.6574.1   # synoSystem
>   - 1.3.6.1.4.1.6574.2   # synoDisk
>   - 1.3.6.1.4.1.6574.3   # synoRaid
>   - 1.3.6.1.4.1.6574.4   # synoUPS
>   - 1.3.6.1.4.1.6574.5   # synologyDiskSMART
>   - 1.3.6.1.4.1.6574.6   # synologyService
>   - 1.3.6.1.4.1.6574.101 # storageIO
>   - 1.3.6.1.4.1.6574.102 # spaceIO
>   - 1.3.6.1.4.1.6574.104 # synologyiSCSILUN
>   - 1.3.6.1.4.1.2021.10
>   - 1.3.6.1.4.1.2021.4.3
>   - 1.3.6.1.4.1.2021.4.4
>   - 1.3.6.1.4.1.2021.4.5
>   - 1.3.6.1.4.1.2021.4.6
>   - 1.3.6.1.4.1.2021.4.11
>   - 1.3.6.1.4.1.2021.4.13
>   - 1.3.6.1.4.1.2021.4.14
>   - 1.3.6.1.4.1.2021.4.15
>   - 1.3.6.1.2.1.31.1.1.1.1
>   - 1.3.6.1.2.1.31.1.1.1.6
>   - 1.3.6.1.2.1.31.1.1.1.10
>   - 1.3.6.1.2.1.25.2.3.1.3
>   - 1.3.6.1.2.1.25.2.3.1.4
>   - 1.3.6.1.2.1.25.2.3.1.5
>   - 1.3.6.1.2.1.25.2.3.1.6
>   - 1.3.6.1.4.1.2021.13.15.1.1.2
>   - 1.3.6.1.4.1.2021.13.15.1.1.12
>   - 1.3.6.1.4.1.2021.13.15.1.1.13
>   - 1.3.6.1.4.1.6574.2
>   - 1.3.6.1.4.1.6574.1
>   - 1.3.6.1.4.1.6574.3
>   - 1.3.6.1.4.1.6574.4
> lookups:
>   - source_indexes: [spaceIOIndex]
> lookup: spaceIODevice
> drop_source_indexes: true
>   - source_indexes: [storageIOIndex]
> lookup: storageIODevice
> drop_source_indexes: true
>   - source_indexes: [serviceInfoIndex]
> lookup: serviceName
> drop_source_indexes: true
>   - source_indexes: [diskIndex]
> lookup: diskID
> drop_source_indexes: true
>   - source_indexes: [raidIndex]
> lookup: raidName
> drop_source_indexes: true
> overrides:
>   diskModel:
> type: DisplayString
>   diskSMARTAttrName:
> type: DisplayString
>   diskSMARTAttrStatus:
> type: DisplayString
>   diskSMARTInfoDevName:
> type: DisplayString
>   diskType:
> type: DisplayString
>   modelName:
> type: DisplayString
>   raidFreeSize:
> type: gauge
>   raidName:
> type: DisplayString
>   raidTotalSize:
> type: gauge
>   serialNumber:
> type: DisplayString
>   serviceName:
> type: DisplayString
>   version:
> type: DisplayString
>
> #prometheus.yml
>
> global:
>   scrape_interval: 30s # Set the scrape interval to every 15 seconds. 
> Default is every 1 minute.
>   evaluation_interval: 30s # Evaluate rules every 15 secon

Re: [prometheus-users] Re: ALERTMANAGER NOT RUNNING

2024-09-10 Thread &#x27;Brian Candler' via Prometheus Users
> Yes, but i do not know why when trying to start Alertmanager it tells me 
the port is already in use and can’t start.

It's because there's an instance of alertmanager already running. (*)

This is not really a question about prometheus or alertmanager; it's a 
general system administration question. It all depends on how alertmanager 
was originally installed on your system, and whether it's running under 
some sort of supervisor process, and if so what that supervisor is.  For 
example, it's possible to run alertmanager under systemd, in which case 
you'd use systemd commands to start and stop it. But that configuration is 
not supplied as part of alertmanager; it's something that a third party 
would have added, perhaps when packaging it up.

So the answer depends entirely on the details of your system.  You might 
want to find a local system administrator who can help you identify how 
alertmanager was originally installed and configured.

(*) Or possibly it could be some other software listening on ports 9093 and 
9094. Either way, you need to identify what that process is. Julius gave 
you some commands as a starting point to help identify that process.

On Tuesday 10 September 2024 at 20:01:25 UTC+1 Chinelo Ufondu wrote:

> Yes, but i do not know why when trying to start Alertmanager it tells me 
> the port is already in use and can’t start.
>
> I was able to change the default port to 9095 on Alertmanager.service 
> file, and specify this command on run
>
> *alertmanager --config.file=alertmanager.yml 
> --cluster.listen-address=0.0.0.0:8081 <http://0.0.0.0:8081> - - 
> web.listen-address= 0.0.0.0:9095 <http://0.0.0.0:9095> *
>
> It ran successfully, then I checked my log file and saw an http2 error, 
> meaning Alertmanager is till using its default port. I also tried accessing 
> Alertmanager via my web interface with the new port no attached, nothing 
> showed up 
>
> On Tue, 10 Sep 2024 at 19:34, 'Brian Candler' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> alertmanager listens on two ports. By default:
>> --web.listen-address=:9093
>> --cluster.listen-address=0.0.0.0:9094
>>
>> On Tuesday 10 September 2024 at 15:25:31 UTC+1 Chinelo Ufondu wrote:
>>
>>> Hello 
>>> i have tried again by running this command like you suggested and 
>>> specifying a port that the clusters should listen on  *alertmanager 
>>> --config.file=alertmanager.yml --cluster.listen-address=0.0.0.0:8081 
>>> <http://0.0.0.0:8081>*, and i got a different error saying port 9093 is 
>>> already in use, and port 9093 is the default port alertmanager is currently 
>>> listening on
>>>
>>> ts=2024-09-10T13:10:36.310Z caller=main.go:181 level=info msg="Starting 
>>> Alertmanager" version="(version=0.27.0, branch=HEAD, revision=
>>> 0aa3c2aad14cff039931923ab16b26b7481783b5)"
>>> ts=2024-09-10T13:10:36.310Z caller=main.go:182 level=info 
>>> build_context="(go=go1.21.7, platform=linux/amd64, user=root@22cd11f671e9, 
>>> date=20240228-11:51:20, tags=netgo)"
>>> ts=2024-09-10T13:10:36.325Z caller=cluster.go:186 level=info 
>>> component=cluster msg="setting advertise address explicitly" 
>>> addr=192.168.101.2 port=8081
>>> ts=2024-09-10T13:10:36.326Z caller=cluster.go:683 level=info 
>>> component=cluster msg="Waiting for gossip to settle..." interval=2s
>>> ts=2024-09-10T13:10:36.359Z caller=coordinator.go:113 level=info 
>>> component=configuration msg="Loading configuration file" 
>>> file=alertmanager.yml
>>> ts=2024-09-10T13:10:36.359Z caller=coordinator.go:126 level=info 
>>> component=configuration msg="Completed loading of configuration file" 
>>> file=alertmanager.yml
>>> ts=2024-09-10T13:10:36.360Z caller=main.go:394 level=info 
>>> component=configuration msg="skipping creation of receiver not referenced 
>>> by any route" receiver=send_email2
>>> ts=2024-09-10T13:10:36.363Z caller=main.go:517 level=error msg="Listen 
>>> error" err="listen tcp :9093: bind: address already in use"
>>> ts=2024-09-10T13:10:36.365Z caller=cluster.go:692 level=info 
>>> component=cluster msg="gossip not settled but continuing anyway" polls=0 
>>> elapsed=39.047022ms
>>>
>>>
>>> On Sun, 8 Sep 2024 at 16:11, Chinelo Ufondu  
>>> wrote:
>>>
>>>> I tried running alertmanager again and i came across this issue, here 
>>>> is the error
>>>>
>>>> ts=2024-09-01T17:35:52.421Z cal

[prometheus-users] Re: Synology SNMP

2024-09-10 Thread &#x27;Brian Candler' via Prometheus Users
So when you said "im getting metrics returned", which metrics were you 
talking about?

> the job name specified in the .yml isn't even showing up.

In which yml - the Grafana dashboard, the Prometheus scrape config, 
something else?

On Tuesday 10 September 2024 at 20:00:26 UTC+1 Mitchell Laframboise wrote:

> Im not getting any results from the queries but the job name specified in 
> the .yml isn't even showing up.
>
> On Tuesday, September 10, 2024 at 2:56:27 PM UTC-4 Brian Candler wrote:
>
>> Looking at the source of that dashboard, all the queries are filtered 
>> against
>> {job=~'$JobName',instance=~'$Device'
>>
>> and the way the JobName values are chosen is from this query:
>>
>> "name": "JobName",
>> "options": [],
>> "query": {
>>   "query": "label_values(ssCpuUser, job)",
>>   "refId": "StandardVariableQuery"
>> },
>>
>> I have no idea what the "ssCpuUser" metric is, but if you're not 
>> collecting that metric, this dashboard won't show you anything.
>>
>> It looks like it's a vendor-specific MIB from the net-snmp (ucdavis) MIB 
>> tree:
>> https://kb.synology.com/en-af/DG/Synology_DiskStation_MIB_Guide/4
>>
>> On Tuesday 10 September 2024 at 19:49:15 UTC+1 Mitchell Laframboise wrote:
>>
>>> I am using a published dashboard 
>>> https://grafana.com/grafana/dashboards/14284-synology-nas-details/
>>>
>>> I will try your suggestion about opening the panels and running the 
>>> queries.
>>>
>>> Thanks for the lead
>>>
>>> On Tuesday, September 10, 2024 at 2:43:23 PM UTC-4 Brian Candler wrote:
>>>
>>>> If you're "getting metrics returned" then either the dashboard is 
>>>> broken, or the metrics you're collecting are not the same as the ones the 
>>>> dashboard is expecting, or the dashboard has some hard-coded assumptions 
>>>> that don't match your environment (e.g. the queries are hard-coded to 
>>>> expect particular labels, such as job name)
>>>>
>>>> You can simply open the panels in Grafana, copy the queries into the 
>>>> Prometheus web interface, and try them there. If they give no results, 
>>>> then 
>>>> you need to drill down why (are the metrics missing, or the label matchers 
>>>> wrong, or something else about the query?)
>>>>
>>>> On Tuesday 10 September 2024 at 19:41:22 UTC+1 Brian Candler wrote:
>>>>
>>>>> Which dashboard? Did you write it yourself, or find one published? 
>>>>> There aren't many snmp_exporter dashboards on the grafana hub, but I see 
>>>>> several that claim to be for Synology.
>>>>>
>>>>> I made a couple for simple ones for generic if_mib (interface stats):
>>>>> https://grafana.com/grafana/dashboards/12492-snmp-interface-detail/
>>>>> https://grafana.com/grafana/dashboards/12489-snmp-device-summary/
>>>>>
>>>>> On Tuesday 10 September 2024 at 17:59:24 UTC+1 Mitchell Laframboise 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>I have installed snmp_exporter and enabled snmp on my NAS.  The 
>>>>>> job is running and up in Prometheus and im getting metrics returned  but 
>>>>>> Im 
>>>>>> unable to get data on my Synology Details Dashboard in Grafana??  
>>>>>>
>>>>>> Can someone please help
>>>>>>
>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/90b1ee6b-35f7-49a5-9f62-dbfc05aafe48n%40googlegroups.com.


[prometheus-users] Re: Synology SNMP

2024-09-10 Thread &#x27;Brian Candler' via Prometheus Users
Looking at the source of that dashboard, all the queries are filtered 
against
{job=~'$JobName',instance=~'$Device'

and the way the JobName values are chosen is from this query:

"name": "JobName",
"options": [],
"query": {
  "query": "label_values(ssCpuUser, job)",
  "refId": "StandardVariableQuery"
},

I have no idea what the "ssCpuUser" metric is, but if you're not collecting 
that metric, this dashboard won't show you anything.

It looks like it's a vendor-specific MIB from the net-snmp (ucdavis) MIB 
tree:
https://kb.synology.com/en-af/DG/Synology_DiskStation_MIB_Guide/4

On Tuesday 10 September 2024 at 19:49:15 UTC+1 Mitchell Laframboise wrote:

> I am using a published dashboard 
> https://grafana.com/grafana/dashboards/14284-synology-nas-details/
>
> I will try your suggestion about opening the panels and running the 
> queries.
>
> Thanks for the lead
>
> On Tuesday, September 10, 2024 at 2:43:23 PM UTC-4 Brian Candler wrote:
>
>> If you're "getting metrics returned" then either the dashboard is broken, 
>> or the metrics you're collecting are not the same as the ones the dashboard 
>> is expecting, or the dashboard has some hard-coded assumptions that don't 
>> match your environment (e.g. the queries are hard-coded to expect 
>> particular labels, such as job name)
>>
>> You can simply open the panels in Grafana, copy the queries into the 
>> Prometheus web interface, and try them there. If they give no results, then 
>> you need to drill down why (are the metrics missing, or the label matchers 
>> wrong, or something else about the query?)
>>
>> On Tuesday 10 September 2024 at 19:41:22 UTC+1 Brian Candler wrote:
>>
>>> Which dashboard? Did you write it yourself, or find one published? There 
>>> aren't many snmp_exporter dashboards on the grafana hub, but I see several 
>>> that claim to be for Synology.
>>>
>>> I made a couple for simple ones for generic if_mib (interface stats):
>>> https://grafana.com/grafana/dashboards/12492-snmp-interface-detail/
>>> https://grafana.com/grafana/dashboards/12489-snmp-device-summary/
>>>
>>> On Tuesday 10 September 2024 at 17:59:24 UTC+1 Mitchell Laframboise 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>I have installed snmp_exporter and enabled snmp on my NAS.  The job 
>>>> is running and up in Prometheus and im getting metrics returned  but Im 
>>>> unable to get data on my Synology Details Dashboard in Grafana??  
>>>>
>>>> Can someone please help
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6e70b91f-ccbb-4f54-b069-40f5a1df1028n%40googlegroups.com.


[prometheus-users] Re: Synology SNMP

2024-09-10 Thread &#x27;Brian Candler' via Prometheus Users
If you're "getting metrics returned" then either the dashboard is broken, 
or the metrics you're collecting are not the same as the ones the dashboard 
is expecting, or the dashboard has some hard-coded assumptions that don't 
match your environment (e.g. the queries are hard-coded to expect 
particular labels, such as job name)

You can simply open the panels in Grafana, copy the queries into the 
Prometheus web interface, and try them there. If they give no results, then 
you need to drill down why (are the metrics missing, or the label matchers 
wrong, or something else about the query?)

On Tuesday 10 September 2024 at 19:41:22 UTC+1 Brian Candler wrote:

> Which dashboard? Did you write it yourself, or find one published? There 
> aren't many snmp_exporter dashboards on the grafana hub, but I see several 
> that claim to be for Synology.
>
> I made a couple for simple ones for generic if_mib (interface stats):
> https://grafana.com/grafana/dashboards/12492-snmp-interface-detail/
> https://grafana.com/grafana/dashboards/12489-snmp-device-summary/
>
> On Tuesday 10 September 2024 at 17:59:24 UTC+1 Mitchell Laframboise wrote:
>
>> Hi,
>>
>>I have installed snmp_exporter and enabled snmp on my NAS.  The job is 
>> running and up in Prometheus and im getting metrics returned  but Im unable 
>> to get data on my Synology Details Dashboard in Grafana??  
>>
>> Can someone please help
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/504826e8-4e92-4a2a-930b-e17a8b5a6231n%40googlegroups.com.


[prometheus-users] Re: Synology SNMP

2024-09-10 Thread &#x27;Brian Candler' via Prometheus Users
Which dashboard? Did you write it yourself, or find one published? There 
aren't many snmp_exporter dashboards on the grafana hub, but I see several 
that claim to be for Synology.

I made a couple for simple ones for generic if_mib (interface stats):
https://grafana.com/grafana/dashboards/12492-snmp-interface-detail/
https://grafana.com/grafana/dashboards/12489-snmp-device-summary/

On Tuesday 10 September 2024 at 17:59:24 UTC+1 Mitchell Laframboise wrote:

> Hi,
>
>I have installed snmp_exporter and enabled snmp on my NAS.  The job is 
> running and up in Prometheus and im getting metrics returned  but Im unable 
> to get data on my Synology Details Dashboard in Grafana??  
>
> Can someone please help
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/71375cc5-f6ea-42bd-8eae-2c2dbe089d34n%40googlegroups.com.


[prometheus-users] Re: ALERTMANAGER NOT RUNNING

2024-09-10 Thread &#x27;Brian Candler' via Prometheus Users
alertmanager listens on two ports. By default:
--web.listen-address=:9093
--cluster.listen-address=0.0.0.0:9094

On Tuesday 10 September 2024 at 15:25:31 UTC+1 Chinelo Ufondu wrote:

> Hello 
> i have tried again by running this command like you suggested and 
> specifying a port that the clusters should listen on  *alertmanager 
> --config.file=alertmanager.yml --cluster.listen-address=0.0.0.0:8081 
> *, and i got a different error saying port 9093 is 
> already in use, and port 9093 is the default port alertmanager is currently 
> listening on
>
> ts=2024-09-10T13:10:36.310Z caller=main.go:181 level=info msg="Starting 
> Alertmanager" version="(version=0.27.0, branch=HEAD, revision=
> 0aa3c2aad14cff039931923ab16b26b7481783b5)"
> ts=2024-09-10T13:10:36.310Z caller=main.go:182 level=info 
> build_context="(go=go1.21.7, platform=linux/amd64, user=root@22cd11f671e9, 
> date=20240228-11:51:20, tags=netgo)"
> ts=2024-09-10T13:10:36.325Z caller=cluster.go:186 level=info 
> component=cluster msg="setting advertise address explicitly" 
> addr=192.168.101.2 port=8081
> ts=2024-09-10T13:10:36.326Z caller=cluster.go:683 level=info 
> component=cluster msg="Waiting for gossip to settle..." interval=2s
> ts=2024-09-10T13:10:36.359Z caller=coordinator.go:113 level=info 
> component=configuration msg="Loading configuration file" 
> file=alertmanager.yml
> ts=2024-09-10T13:10:36.359Z caller=coordinator.go:126 level=info 
> component=configuration msg="Completed loading of configuration file" 
> file=alertmanager.yml
> ts=2024-09-10T13:10:36.360Z caller=main.go:394 level=info 
> component=configuration msg="skipping creation of receiver not referenced 
> by any route" receiver=send_email2
> ts=2024-09-10T13:10:36.363Z caller=main.go:517 level=error msg="Listen 
> error" err="listen tcp :9093: bind: address already in use"
> ts=2024-09-10T13:10:36.365Z caller=cluster.go:692 level=info 
> component=cluster msg="gossip not settled but continuing anyway" polls=0 
> elapsed=39.047022ms
>
>
> On Sun, 8 Sep 2024 at 16:11, Chinelo Ufondu  wrote:
>
>> I tried running alertmanager again and i came across this issue, here is 
>> the error
>>
>> ts=2024-09-01T17:35:52.421Z caller=main.go:181 level=info msg="Starting 
>> Alertmanager" version="(version=0.27.0, branch=HEAD, revision=
>> 0aa3c2aad14cff039931923ab16b26b7481783b5)"
>> ts=2024-09-01T17:35:52.421Z caller=main.go:182 level=info 
>> build_context="(go=go1.21.7, platform=linux/amd64, user=root@22cd11f671e9, 
>> date=20240228-11:51:20, tags=netgo)"
>> ts=2024-09-01T17:35:52.440Z caller=cluster.go:186 level=info 
>> component=cluster msg="setting advertise address explicitly" 
>> addr=192.168.101.2 port=9094
>> ts=2024-09-01T17:35:52.441Z caller=main.go:221 level=error msg="unable to 
>> initialize gossip mesh" err="create memberlist: Could not set up network 
>> transport: failed to obtain an address: Failed to start TCP listener on 
>> \"0.0.0.0\" port 9094: listen tcp 0.0.0.0:9094: bind: address already in 
>> use"
>>
>> I have tried all i can to stop the processes that is currently running  
>> on alert manager, but it didn't work out, i also tried adding an external 
>> command to run *alertmanager --web.listen-address=localhost:9095 
>> --config.file=alertmanager.yml, *but it still isn't picking the new port 
>> number i would appreciate further assistance from you guys please, Thank 
>> you.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b8ff032c-990d-4870-9bec-2d3abecf843fn%40googlegroups.com.


Re: [prometheus-users] SNMP EXPORTER GENERATOR ERRORS

2024-09-09 Thread &#x27;Brian Candler' via Prometheus Users
I strongly advise you to use node_exporter rather than snmpd for collecting 
metrics from Linux hosts, unless there's something you can't get any other 
way (keepalived VRRP might be one example).

By "snmpd.conf ... access control" you might be asking how to create SNMPv2 
communities and SNMPv3 users, or you might mean configuring view-based ACLs 
to limit what parts of the MIB tree they can see.
For the first, there are simple examples at 
https://nsrc.org/activities/agendas/en/nmm-4-days/netmgmt/en/snmp/exercises-snmp.html
 
- scroll down to "Configuration of snmpd (server/agent)".  For view-based 
ACLs, if you really need that, you'll need to read the snmpd.conf manpage 
carefully.

But node_exporter FTW.

On Monday 9 September 2024 at 15:42:20 UTC+1 Mitchell Laframboise wrote:

> Thank you.  I ended up just downloading the mibs from mibbrowser.online, 
> they had them all.  But then I had a problem with SNMPv2-PDU that I 
> couldn't figure out so I just replaced it with a good copy.  
>
> Do you know how to edit the snmpd.conf file?  The access control part is 
> kind of confusing
>
>
>
>
>
> On Monday, September 9, 2024 at 10:35:52 AM UTC-4 Brian Candler wrote:
>
>> >  msg="Loading MIBs" 
>> from=$HOME/.snmp/mibs:/usr/share/snmp/mibs:/usr/share/snmp/mibs/iana:/usr/share/snmp/mibs/ietf
>>
>> If you do that, then it's your responsibility to download the mibs you 
>> need. "apt-get snmp-mibs-downloader" on Ubuntu/Debian will get a bunch, but 
>> I don't know if you'll have all the ones you need.
>>
>> My preferred option is to type "make mibs" then "make".  The first will 
>> fetch all the mibs required by the sample generator.yml, into a 
>> subdirectory called "mibs".  The second will build snmp.yml from 
>> generator.yml using those mibs.
>>
>> Once you have this working, you can then replace generator.yml with your 
>> own generator.yml, and run "make" again, if necessary dropping any 
>> additional mibs you need into the "mibs" subdirectory.
>>
>> On Monday 9 September 2024 at 13:37:29 UTC+1 Mitchell Laframboise wrote:
>>
>>> Yes I did read it, I just never had to do this last time I set this up.  
>>>
>>> On Monday, September 9, 2024 at 8:33:48 AM UTC-4 Ben Kochie wrote:
>>>
>>>> Did you read the error messages?
>>>>
>>>> Your MIBDIRS are missing a number of MIBs in order to satisfy all the 
>>>> requirements.
>>>>
>>>> Either find the missing MIBs, or set the `MIBDIRS` env var to point at 
>>>> the generator example "mibs" dir that is created with `make mibs`.
>>>>
>>>> On Mon, Sep 9, 2024 at 1:50 PM Mitchell Laframboise <
>>>> mlafra...@razzberrys.ca> wrote:
>>>>
>>>>> I am trying to generate an snmp.yml and have removed all sections from 
>>>>> the generator.yml except for if-mib and I'm getting all these parse 
>>>>> errors 
>>>>> and I don't know why.
>>>>>
>>>>> ./generator parse_errors
>>>>> ts=2024-09-09T11:22:12.963Z caller=net_snmp.go:175 level=info 
>>>>> msg="Loading MIBs" 
>>>>> from=$HOME/.snmp/mibs:/usr/share/snmp/mibs:/usr/share/snmp/mibs/iana:/usr/share/snmp/mibs/ietf
>>>>> ts=2024-09-09T11:22:13.109Z caller=main.go:177 level=warn msg="NetSNMP 
>>>>> reported parse error(s)" errors=36
>>>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error 
>>>>> msg="Missing MIB" mib=IANA-STORAGE-MEDIA-TYPE-MIB from="At line 19 in 
>>>>> /usr/share/snmp/mibs/ietf/VM-MIB"
>>>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error 
>>>>> msg="Missing MIB" mib=IEEE8021-CFM-MIB from="At line 30 in 
>>>>> /usr/share/snmp/mibs/ietf/TRILL-OAM-MIB"
>>>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error 
>>>>> msg="Missing MIB" mib=LLDP-MIB from="At line 35 in 
>>>>> /usr/share/snmp/mibs/ietf/TRILL-OAM-MIB"
>>>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error 
>>>>> msg="Missing MIB" mib=IANA-SMF-MIB from="At line 28 in 
>>>>> /usr/share/snmp/mibs/ietf/SMF-MIB"
>>>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error 
>>>>> msg="Missing MIB" mib=IANA-ENTITY-MIB from="At line 18 in 
>>>>> /usr/share/snmp

Re: [prometheus-users] SNMP EXPORTER GENERATOR ERRORS

2024-09-09 Thread &#x27;Brian Candler' via Prometheus Users
>  msg="Loading MIBs" 
from=$HOME/.snmp/mibs:/usr/share/snmp/mibs:/usr/share/snmp/mibs/iana:/usr/share/snmp/mibs/ietf

If you do that, then it's your responsibility to download the mibs you 
need. "apt-get snmp-mibs-downloader" on Ubuntu/Debian will get a bunch, but 
I don't know if you'll have all the ones you need.

My preferred option is to type "make mibs" then "make".  The first will 
fetch all the mibs required by the sample generator.yml, into a 
subdirectory called "mibs".  The second will build snmp.yml from 
generator.yml using those mibs.

Once you have this working, you can then replace generator.yml with your 
own generator.yml, and run "make" again, if necessary dropping any 
additional mibs you need into the "mibs" subdirectory.

On Monday 9 September 2024 at 13:37:29 UTC+1 Mitchell Laframboise wrote:

> Yes I did read it, I just never had to do this last time I set this up.  
>
> On Monday, September 9, 2024 at 8:33:48 AM UTC-4 Ben Kochie wrote:
>
>> Did you read the error messages?
>>
>> Your MIBDIRS are missing a number of MIBs in order to satisfy all the 
>> requirements.
>>
>> Either find the missing MIBs, or set the `MIBDIRS` env var to point at 
>> the generator example "mibs" dir that is created with `make mibs`.
>>
>> On Mon, Sep 9, 2024 at 1:50 PM Mitchell Laframboise <
>> mlafra...@razzberrys.ca> wrote:
>>
>>> I am trying to generate an snmp.yml and have removed all sections from 
>>> the generator.yml except for if-mib and I'm getting all these parse errors 
>>> and I don't know why.
>>>
>>> ./generator parse_errors
>>> ts=2024-09-09T11:22:12.963Z caller=net_snmp.go:175 level=info 
>>> msg="Loading MIBs" 
>>> from=$HOME/.snmp/mibs:/usr/share/snmp/mibs:/usr/share/snmp/mibs/iana:/usr/share/snmp/mibs/ietf
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:177 level=warn msg="NetSNMP 
>>> reported parse error(s)" errors=36
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=IANA-STORAGE-MEDIA-TYPE-MIB from="At line 19 in 
>>> /usr/share/snmp/mibs/ietf/VM-MIB"
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=IEEE8021-CFM-MIB from="At line 30 in 
>>> /usr/share/snmp/mibs/ietf/TRILL-OAM-MIB"
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=LLDP-MIB from="At line 35 in 
>>> /usr/share/snmp/mibs/ietf/TRILL-OAM-MIB"
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=IANA-SMF-MIB from="At line 28 in /usr/share/snmp/mibs/ietf/SMF-MIB"
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=IANA-ENTITY-MIB from="At line 18 in 
>>> /usr/share/snmp/mibs/ietf/ENTITY-MIB"
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=IANAPowerStateSet-MIB from="At line 20 in 
>>> /usr/share/snmp/mibs/ietf/ENERGY-OBJECT-MIB"
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=IANA-OLSRv2-LINK-METRIC-TYPE-MIB from="At line 26 in 
>>> /usr/share/snmp/mibs/ietf/OLSRv2-MIB"
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=IANA-ENERGY-RELATION-MIB from="At line 22 in 
>>> /usr/share/snmp/mibs/ietf/ENERGY-OBJECT-CONTEXT-MIB"
>>> ts=2024-09-09T11:22:13.109Z caller=main.go:183 level=error msg="Missing 
>>> MIB" mib=IANA-BFD-TC-STD-MIB from="At line 30 in 
>>> /usr/share/snmp/mibs/ietf/BFD-STD-MIB"
>>> ts=2024-09-09T11:22:13.232Z caller=tree.go:83 level=warn msg="Can't find 
>>> augmenting node" augments=dot1agCfmMepEntry node=trillOamMepEntry
>>> ts=2024-09-09T11:22:13.232Z caller=tree.go:83 level=warn msg="Can't find 
>>> augmenting node" augments=dot1agCfmMepDbEntry node=trillOamMepDbEntry
>>> MIB search path: 
>>> /home/mitchell/.snmp/mibs:/usr/share/snmp/mibs:/usr/share/snmp/mibs/iana:/usr/share/snmp/mibs/ietf
>>> Cannot find module (IANA-STORAGE-MEDIA-TYPE-MIB): At line 19 in 
>>> /usr/share/snmp/mibs/ietf/VM-MIB
>>> Did not find 'IANAStorageMediaType' in module #-1 
>>> (/usr/share/snmp/mibs/ietf/VM-MIB)
>>> Cannot find module (IEEE8021-CFM-MIB): At line 30 in 
>>> /usr/share/snmp/mibs/ietf/TRILL-OAM-MIB
>>> Cannot find module (LLDP-MIB): At line 35 in 
>>> /usr/share/snmp/mibs/ietf/TRILL-OAM-MIB
>>> Did not find 'dot1agCfmMdIndex' in module #-1 
>>> (/usr/share/snmp/mibs/ietf/TRILL-OAM-MIB)
>>> Did not find 'dot1agCfmMaIndex' in module #-1 
>>> (/usr/share/snmp/mibs/ietf/TRILL-OAM-MIB)
>>> Did not find 'dot1agCfmMepIdentifier' in module #-1 
>>> (/usr/share/snmp/mibs/ietf/TRILL-OAM-MIB)
>>> Did not find 'dot1agCfmMepEntry' in module #-1 
>>> (/usr/share/snmp/mibs/ietf/TRILL-OAM-MIB)
>>> Did not find 'dot1agCfmMepDbEntry' in module #-1 
>>> (/usr/share/snmp/mibs/ietf/TRILL-OAM-MIB)
>>> Did not find 'Dot1agCfmIngressActionFieldValue' in module #-1 
>>> (/usr/share/snmp/mibs/ietf/TRILL-OAM-MIB)
>>> Did not find 'Dot1agCfmEgressActionFieldValue' in module #-1 
>>> (/usr/share/snmp/mibs/ietf/TRILL-OAM-MIB)
>>> Did

[prometheus-users] Re: PromQL redirection

2024-09-06 Thread &#x27;Brian Candler' via Prometheus Users
Have a look at https://github.com/jacksontj/promxy

But I don't think it's yet clever enough to avoid querying servers that 
couldn't possibly match the query.  (Presumably it could only do that if 
your PromQL was specific enough with its labels)

On Friday 6 September 2024 at 16:28:10 UTC+1 Samit Jain wrote:

>
> We've multiple segregated data systems which support PromQL, each storing 
> metrics for different class of applications, infra, etc.
>
> We would like to explore the possibility of abstracting promql over these 
> systems, such that user can run a query without knowing about the different 
> backends. The options we considered below use something of a brute force 
> approach and won't scale:
>
>1. support remote read API in all backends and configure Prometheus to 
>remote read from all of them.
>2. send PromQL query to all backends and merge the results.
>
> I think a system where there is an external 'router' component which knows 
> where different time series' are stored (some sort of an index table) and 
> uses it to query the right backend would be worth exploring. We can presume 
> for now that the time series are unique across all backends. Do you know of 
> something like this exists in some form, or some literature on this that we 
> could build upon?
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7e65f7de-cab5-48b5-a4f6-ab810ea5823bn%40googlegroups.com.


[prometheus-users] Re: feed large csv file into the Prometheus

2024-09-05 Thread &#x27;Brian Candler' via Prometheus Users
> Hi Brian, want to import 10GB csv file into the Prometheus, after that 
try to run different queries to find out how it performs with data with 
high cardinality.

In prometheus, the timeseries data consists of float values and there's no 
"cardinality" as such. But each timeseries is determined by its unique set 
of labels, and if those labels have high cardinality, it will perform very 
poorly (due to an explosion in the number of distinct timeseries).

> Now which option more suitable? And faster?

More suitable for ingestion into Prometheus? Backfilling via OpenMetrics 
format is the only approach.

More suitable for your application? I don't know what that application is, 
or anything about the data you're importing, so I can't really say.

If you have high cardinality and/or non-numeric data then you might want to 
look at logging systems (e.g. Loki, VictoriaLogs), document databases (e.g. 
OpenSearch/ElasticSearch, MongoDB), columnar databases (e.g. Clickhouse, 
Druid) or various other "analytics/big data" platforms.
 
On Thursday 5 September 2024 at 16:49:19 UTC+1 Mehrdad wrote:

> Hi Brian, want to import 10GB csv file into the Prometheus, after that try 
> to run different queries to find out how it performs with data with high 
> cardinality.
> This process need to run once, and data belong to last 24 hours of another 
> monitoring tool.
> Now which option more suitable? And faster?
>
> Thanks
> On Thursday, September 5, 2024 at 6:27:54 PM UTC+3:30 Brian Candler wrote:
>
>> Prometheus is very specific to timeseries data, and normally new data is 
>> ingested as of the current time.
>>
>> If you have previous timeseries data that you need to import as a 
>> one-time activity, then there is "backfilling", see
>>
>> https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
>> This is not something you would want to do on a regular basis though.
>>
>> If the reason for CSV import is you are trying to gather data from remote 
>> sites which don't have continuous connectivity, then another option is to 
>> run prometheus in those sites in "agent" mode, and have it upload data to 
>> another prometheus server using "remote write".
>>
>> On Thursday 5 September 2024 at 14:48:13 UTC+1 Mehrdad wrote:
>>
>>> Hi 
>>> how can i feed large csv file into the Prometheus?
>>> Thanks
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/23a4b49f-7036-467b-b634-f708b81ddb59n%40googlegroups.com.


[prometheus-users] Re: feed large csv file into the Prometheus

2024-09-05 Thread &#x27;Brian Candler' via Prometheus Users
Prometheus is very specific to timeseries data, and normally new data is 
ingested as of the current time.

If you have previous timeseries data that you need to import as a one-time 
activity, then there is "backfilling", see
https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
This is not something you would want to do on a regular basis though.

If the reason for CSV import is you are trying to gather data from remote 
sites which don't have continuous connectivity, then another option is to 
run prometheus in those sites in "agent" mode, and have it upload data to 
another prometheus server using "remote write".

On Thursday 5 September 2024 at 14:48:13 UTC+1 Mehrdad wrote:

> Hi 
> how can i feed large csv file into the Prometheus?
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d8b45095-2359-4520-ada8-67f8b42c3ac7n%40googlegroups.com.


[prometheus-users] Re: max_over_time not working as expected - want to get the 3 most recent values higher than a specific threshold

2024-09-03 Thread &#x27;Brian Candler' via Prometheus Users
Oops,

topk(3, max_over_time(foo[31d] @ 1725145200) )

On Tuesday 3 September 2024 at 09:41:23 UTC+1 Brian Candler wrote:

> > If I run "max_over_time{}[1m:15s]" it will show me the peak of every 1m 
> evaluating every 15s sample. That's ok.
>
> That expression is almost certainly wrong; it is querying a metric called 
> "max_over_time" (which probably doesn't exist), rather than calling the 
> function max_over_time(...) on an expression.
>
> Again, why are you not using "max_over_time(foo[1m])" ??  The subquery 
> foo[1m:15s] is just causing more work.
>
> > second step would be to get the top3 max_over_time values of the 
> selected time_range. If I run "max_over_time{}[1m:15s]" for the last 3 days 
> I will get 3d x 24h x 60m "max_over_time" values.
>
> It sounds like you are trying to do stuff with Grafana, and I can't help 
> you with that. If you have an issue with Grafana, please take it to the 
> Grafana discussion community; this mailing list is for Prometheus only.
>
> > May goal was and that is why I used "topk(3,)" (wrong) to get the top3 
> values of the last 24hrs evaluated by "max_over_time{}[1m:15s]". But 
> topk(3,) only show me the top3 values at the same timestamp
>
> If you want the maximum values of each timeseries over the last 24 hours, 
> then you want max_over_time(foo[24h]) - try it in the PromQL web interface.
>
> This expression will return an instant vector. The values are timestamped 
> with the time at which the query was evaluated for - and that's the end 
> time of the 24 hour window.  If you don't select an evaluation time, then 
> the time is "now" and the window is "now - 24h to now"
>
> However, the *values* returned will be the maximum value for each 
> timeseries over that 24 hour period.  And topk(3, max_over_time(foo[24h]) 
> will then give you the three timeseries which have the highest value of 
> that maximum.
>
> > third step would be to get the time of these top3 values calculated 
> earlier.
>
> As far as I know, prometheus can't do that for you.  You'd have to use the 
> plain range vector query "foo[24h]" - pass it to the Prometheus HTTP API as 
> an instant query 
> <https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries>. 
> This will return all the data points in the 24 hour period, each with its 
> own raw timestamp. Then write your own code to identify the maximum 
> value(s) you are interested in, and pick the associated timestamps.
>
> It would be an interesting feature request for max_over_time(...) to 
> return values timestamped with the time they actually occurred at, but it 
> would make max_over_time work differently to other range vector aggregation 
> functions 
> <https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time>.
>   
> And there are some edge cases to be ironed out, e.g. what happens if the 
> same maximum value occurs multiple times.
>
> > the fourth step would be to get the top3 vaues of the month august and 
> the top3 values of the month july.
>
> You can evaluate PromQL expressions at a given instant. There are two ways:
> - call the instant query API 
> <https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries> 
> and pass the "time" parameter
> - on more recent versions of Prometheus, send a PromQL query with the @ 
> modifier 
> <https://prometheus.io/docs/prometheus/latest/querying/basics/#modifier>.
>
> For example, for the maxima in August 2024, it would be something like 
> (untested):
>
> topk(3, max_over_time(foo[31d]) @ 1725145200) 
>
>
> On Monday 2 September 2024 at 20:52:33 UTC+1 Alexander Wilke wrote:
>
>> Hello Brian,
>>
>> thanks for clarification. I investigated the issue further and found that 
>> Grafana Dashboards is manipulationg the data and for that reason 
>> "max_over_time" for the last 1h showed the correct peak and max_over_time 
>> for the last 24hrs did not show that peak because of a too low set of Data 
>> points. I increased it to 11.000 and then I was able to see the peak value 
>> again as expected. As "min Step" I set 15s in Grafana.
>>
>>
>> However this only solved the first part of my main problem. Now I can 
>> reliably query the peaks of a time range and do not miss the peak.
>> If I run "max_over_time{}[1m:15s]" it will show me the peak of every 1m 
>> evaluating every 15s sample. That's ok.
>>
>> second step would be to get the top3 max_over_time values of the selected 
>> time

[prometheus-users] Re: max_over_time not working as expected - want to get the 3 most recent values higher than a specific threshold

2024-09-03 Thread &#x27;Brian Candler' via Prometheus Users
> If I run "max_over_time{}[1m:15s]" it will show me the peak of every 1m 
evaluating every 15s sample. That's ok.

That expression is almost certainly wrong; it is querying a metric called 
"max_over_time" (which probably doesn't exist), rather than calling the 
function max_over_time(...) on an expression.

Again, why are you not using "max_over_time(foo[1m])" ??  The subquery 
foo[1m:15s] is just causing more work.

> second step would be to get the top3 max_over_time values of the selected 
time_range. If I run "max_over_time{}[1m:15s]" for the last 3 days I will 
get 3d x 24h x 60m "max_over_time" values.

It sounds like you are trying to do stuff with Grafana, and I can't help 
you with that. If you have an issue with Grafana, please take it to the 
Grafana discussion community; this mailing list is for Prometheus only.

> May goal was and that is why I used "topk(3,)" (wrong) to get the top3 
values of the last 24hrs evaluated by "max_over_time{}[1m:15s]". But 
topk(3,) only show me the top3 values at the same timestamp

If you want the maximum values of each timeseries over the last 24 hours, 
then you want max_over_time(foo[24h]) - try it in the PromQL web interface.

This expression will return an instant vector. The values are timestamped 
with the time at which the query was evaluated for - and that's the end 
time of the 24 hour window.  If you don't select an evaluation time, then 
the time is "now" and the window is "now - 24h to now"

However, the *values* returned will be the maximum value for each 
timeseries over that 24 hour period.  And topk(3, max_over_time(foo[24h]) 
will then give you the three timeseries which have the highest value of 
that maximum.

> third step would be to get the time of these top3 values calculated 
earlier.

As far as I know, prometheus can't do that for you.  You'd have to use the 
plain range vector query "foo[24h]" - pass it to the Prometheus HTTP API as 
an instant query 
<https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries>. 
This will return all the data points in the 24 hour period, each with its 
own raw timestamp. Then write your own code to identify the maximum 
value(s) you are interested in, and pick the associated timestamps.

It would be an interesting feature request for max_over_time(...) to return 
values timestamped with the time they actually occurred at, but it would 
make max_over_time work differently to other range vector aggregation 
functions 
<https://prometheus.io/docs/prometheus/latest/querying/functions/#aggregation_over_time>.
  
And there are some edge cases to be ironed out, e.g. what happens if the 
same maximum value occurs multiple times.

> the fourth step would be to get the top3 vaues of the month august and 
the top3 values of the month july.

You can evaluate PromQL expressions at a given instant. There are two ways:
- call the instant query API 
<https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries> 
and pass the "time" parameter
- on more recent versions of Prometheus, send a PromQL query with the @ 
modifier 
<https://prometheus.io/docs/prometheus/latest/querying/basics/#modifier>.

For example, for the maxima in August 2024, it would be something like 
(untested):

topk(3, max_over_time(foo[31d]) @ 1725145200) 


On Monday 2 September 2024 at 20:52:33 UTC+1 Alexander Wilke wrote:

> Hello Brian,
>
> thanks for clarification. I investigated the issue further and found that 
> Grafana Dashboards is manipulationg the data and for that reason 
> "max_over_time" for the last 1h showed the correct peak and max_over_time 
> for the last 24hrs did not show that peak because of a too low set of Data 
> points. I increased it to 11.000 and then I was able to see the peak value 
> again as expected. As "min Step" I set 15s in Grafana.
>
>
> However this only solved the first part of my main problem. Now I can 
> reliably query the peaks of a time range and do not miss the peak.
> If I run "max_over_time{}[1m:15s]" it will show me the peak of every 1m 
> evaluating every 15s sample. That's ok.
>
> second step would be to get the top3 max_over_time values of the selected 
> time_range. If I run "max_over_time{}[1m:15s]" for the last 3 days I will 
> get 3d x 24h x 60m "max_over_time" values. May goal was and that is why I 
> used "topk(3,)" (wrong) to get the top3 values of the last 24hrs evaluated 
> by "max_over_time{}[1m:15s]". But topk(3,) only show me the top3 values at 
> the same timestamp
>
> third step would be to get the time of these top3 values calculated 
> earlier.
>
> the fourth step would be to get the top3 vaues of the month aug

[prometheus-users] Re: Drop unsed metrics

2024-09-03 Thread &#x27;Brian Candler' via Prometheus Users
That looks OK to me. I think it should drop all metrics with name 
"container_processes".

Can you show a wider context of the entire scrape job config? 
metric_relabel_configs is configured under a specific scrape job. and only 
applies to metrics collected by that scrape job.

There are simple things to check, like did you send prometheus a HUP or 
reload signal after changing the config? And if you check prometheus output 
(typically "journalctl -eu prometheus") did it say the reload was 
successful, with no syntax errors in the new config?

Are you deploying prometheus via some higher-level wrapper, like a helm 
chart? If so, the issue is likely somewhere around that.

On Tuesday 3 September 2024 at 07:11:12 UTC+1 anwer shahith wrote:

> Hi Team, 
> I have find unsed metrics from my promethues i want to drop them 
> I tried various relabel_configs configs but looks like non of them are 
> working
>
> eg: 
>
>
> *metric_relabel_configs*:
>
> - *source_labels*: [__name__]
>
>   *regex*: container_processes
>
>   *action*: drop
>
>
> please help me to fix this
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/38382070-7317-4374-beba-ffd9396bdfa9n%40googlegroups.com.


[prometheus-users] Re: promehtus.yml autogenerated

2024-09-01 Thread &#x27;Brian Candler' via Prometheus Users
That is not part of prometheus.

Where did /opt/prom-registry/scripts/update.php come from? That's what's 
doing it.  But I can't find any "prom-registry" PHP code on github or via a 
google search. Perhaps you obtained Prometheus as part of a third-party 
application or bundle?

On Sunday 1 September 2024 at 21:46:39 UTC+1 Vincent Romero wrote:

> The prometheus.yml file is rewritten when editing any changes
>
> At the beginning of the file you will find the following lines
>
> # THE FILE prometheus.yml IS GENERATED AUTOMATICALLY
> # To make changes do this:
> # 1) Edit the prometheus.yml.template file
> # 2) Run
> # sudo php -f /opt/prom-registry/scripts/update.php force
>
>
>
> How i can disable this? 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/412d6f61-d153-4a1f-95ca-80c8fdf8426dn%40googlegroups.com.


Re: [prometheus-users] I am having an error while checking the alert manager status

2024-09-01 Thread &#x27;Brian Candler' via Prometheus Users
Show your config file?

On Sunday 1 September 2024 at 18:45:19 UTC+1 Chinelo Ufondu wrote:

> I have been able to resolve the issue, the problem was my config file, it 
> wasn't properly indented and it had some syntax errors
>
> I tried running alertmanager again and i came across another issue, here 
> is the error
>
> ts=2024-09-01T17:35:52.421Z caller=main.go:181 level=info msg="Starting 
> Alertmanager" version="(version=0.27.0, branch=HEAD, 
> revision=0aa3c2aad14cff039931923ab16b26b7481783b5)"
> ts=2024-09-01T17:35:52.421Z caller=main.go:182 level=info 
> build_context="(go=go1.21.7, platform=linux/amd64, user=root@22cd11f671e9, 
> date=20240228-11:51:20, tags=netgo)"
> ts=2024-09-01T17:35:52.440Z caller=cluster.go:186 level=info 
> component=cluster msg="setting advertise address explicitly" 
> addr=192.168.101.2 port=9094
> ts=2024-09-01T17:35:52.441Z caller=main.go:221 level=error msg="unable to 
> initialize gossip mesh" err="create memberlist: Could not set up network 
> transport: failed to obtain an address: Failed to start TCP listener on 
> \"0.0.0.0\" port 9094: listen tcp 0.0.0.0:9094: bind: address already in 
> use"
>
> I have tried all i can to stop the processes that is currently running  on 
> alert manager, but it didn't work out, i also tried adding an external 
> command to run *alertmanager --web.listen-address=localhost:9095 
> --config.file=alertmanager.yml, *but it still isn't picking the new port 
> number i would appreciate further assistance from you guys please, Thank 
> you.
>
>
> On Wed, 28 Aug 2024 at 18:28, chinelo Ufondu  wrote:
>
>> × alertmanager.service - AlertManager
>>  Loaded: loaded (/lib/systemd/system/alertmanager.service; enabled; 
>> vendor preset: enabled)
>>  Active: failed (Result: exit-code) since Wed 2024-08-28 15:43:10 
>> UTC; 2s ago
>> Process: 1691244 ExecStart=/usr/bin/alertmanager --config.file 
>> /etc/alertmanager/alertmanager.yml (code=exited, status=1/FAIL>
>>Main PID: 1691244 (code=exited, status=1/FAILURE)
>> CPU: 29ms
>>
>> Aug 28 15:43:10 localhost systemd[1]: Started AlertManager.
>> Aug 28 15:43:10 localhost alertmanager[1691244]: 
>> ts=2024-08-28T15:43:10.392Z caller=main.go:181 level=info msg="Starting 
>> Alertman>
>> Aug 28 15:43:10 localhost alertmanager[1691244]: 
>> ts=2024-08-28T15:43:10.392Z caller=main.go:182 level=info 
>> build_context="(go=go1>
>> Aug 28 15:43:10 localhost alertmanager[1691244]: 
>> ts=2024-08-28T15:43:10.392Z caller=main.go:193 level=error msg="Unable to 
>> create>
>> Aug 28 15:43:10 localhost systemd[1]: alertmanager.service: Main process 
>> exited, code=exited, status=1/FAILURE
>> Aug 28 15:43:10 localhost systemd[1]: alertmanager.service: Failed with 
>> result 'exit-code'.
>> ~
>> ~
>>
>> -- 
>>
> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/4010ba23-d4bb-4396-875b-13768c36eca3n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/12709b8a-ecb8-40ae-a9b7-2319fae3a1afn%40googlegroups.com.


[prometheus-users] Re: max_over_time not working as expected - want to get the 3 most recent values higher than a specific threshold

2024-08-31 Thread &#x27;Brian Candler' via Prometheus Users
Why are you doing a subquery there?  max_over_time(metric[1h]) should give 
you the largest value at any time over that 1h period. The range vector 
includes all the points in that time period, without resampling.

A subquery could be used if you needed to take an instant vector expression 
and turn it into a range vector by evaluating it at multiple time instants, 
e.g.

max_over_time( (metric > 10 < 100)[1h:15s] )

But for a simple vector expression, the range vector is better than the 
subquery as you get all the data points without resampling.

You said before:

> However first problem if the max value is e.g. 22 and it appears several 
times within the timerange I see this displayxed several times.

That makes no sense. The result of max_over_time() is an *instant vector*. 
By definition, it only has one value for each unique set of labels. If you 
see multiple values of 22, then they are for separate timeseries, and each 
will be identified by its unique sets of labels.

That's what max_over_time does: it works on a range vector of timeseries, 
and gives you the *single* maximum for *each* timeseries. If you pass it a 
range vector with 10 timeseries, you will get an instant vector with 10 
timeseries.

> I would like to idealle see the most recent ones

That also makes no sense. For each timeseries, you will get the maximum 
value of that timeseries across the whole time range, regardless of at what 
time it occurred, and regardless of the values of any other timeseries.

topk(3, ...) then just picks whichever three timeseries have the highest 
maxima over the time period.

> Why do I see a correct peak using
>max_over_time(metric{}[1h:15s])
> 
> but if I run this command the peak is lower than with the other command 
before?
> max_over_time(metric{}[24h:15s])

I'm not sure, but first, try comparing the range vector forms:

max_over_time(metric{}[1h])
max_over_time(metric{}[24h])

If those work as expected, then there may be some issue with subqueries. 
That can be drilled down into by looking at the raw data. Try these queries 
in the PromQL browser, set to "table" rather than "graph" mode:

metric{}[24h]
metric{}[24h:15s]

It will show the actual data that max_over_time() is working across. It 
might be some issue around resampling of the data, but I can't think off 
the top of my head what it could be.

What version of prometheus are you running? It could be a bug with 
subqueries, which may or may not be fixed in later versions.

Also, please remove Grafana from the equation. Enter your PromQL queries 
directly into the PromQL browser in Prometheus.  There are lots of ways you 
can misconfigure Grafana or otherwise confuse matters, e.g. by asking it to 
sweep an instant vector query over a time range to form a graph.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/79c4ab5f-0241-49d5-9e07-c9bdd10eeb6cn%40googlegroups.com.


[prometheus-users] Re: I am having an error while checking the alert manager status

2024-08-30 Thread &#x27;Brian Candler' via Prometheus Users
You're not taking a very helpful approach to debugging. If you had shown 
the full error message, as I suggested, then your issue could probably have 
been fixed.

You didn't say what guide you're following. The official documentation is 
IMO clear and detailed:
https://prometheus.io/docs/alerting/latest/overview/
https://github.com/prometheus/alertmanager

As for installation, it's just a binary that you download and run, but the 
above documents don't tell you how to configure systemd to run it 
(ultimately it's assumed that you know how to administer a Linux system).

You can find some step-by-step instructions here, which may or may not help:
https://nsrc.org/workshops/2022/rwnog/nmm/netmgmt/en/prometheus/ex-alertmanager.html

On Friday 30 August 2024 at 10:19:04 UTC+1 chinelo Ufondu wrote:

> I just had to uninstall alert manager, its stressing me out
> Please i need a good guide on installing alertmanager, i want to start 
> afresh
> The guide i have seen so far is just complicating
>
> On Wednesday 28 August 2024 at 19:30:23 UTC+1 Brian Candler wrote:
>
>> The important error message has been truncated ("Unable to create..."). 
>> You can use left/right arrows to scroll sideways, but it would be better to 
>> use these commands:
>>
>> systemctl status alertmanager -l --no-pager
>> journalctl -u alertmanager -n100 --no-pager
>>
>> On Wednesday 28 August 2024 at 18:28:15 UTC+1 chinelo Ufondu wrote:
>>
>>> × alertmanager.service - AlertManager
>>>  Loaded: loaded (/lib/systemd/system/alertmanager.service; enabled; 
>>> vendor preset: enabled)
>>>  Active: failed (Result: exit-code) since Wed 2024-08-28 15:43:10 
>>> UTC; 2s ago
>>> Process: 1691244 ExecStart=/usr/bin/alertmanager --config.file 
>>> /etc/alertmanager/alertmanager.yml (code=exited, status=1/FAIL>
>>>Main PID: 1691244 (code=exited, status=1/FAILURE)
>>> CPU: 29ms
>>>
>>> Aug 28 15:43:10 localhost systemd[1]: Started AlertManager.
>>> Aug 28 15:43:10 localhost alertmanager[1691244]: 
>>> ts=2024-08-28T15:43:10.392Z caller=main.go:181 level=info msg="Starting 
>>> Alertman>
>>> Aug 28 15:43:10 localhost alertmanager[1691244]: 
>>> ts=2024-08-28T15:43:10.392Z caller=main.go:182 level=info 
>>> build_context="(go=go1>
>>> Aug 28 15:43:10 localhost alertmanager[1691244]: 
>>> ts=2024-08-28T15:43:10.392Z caller=main.go:193 level=error msg="Unable to 
>>> create>
>>> Aug 28 15:43:10 localhost systemd[1]: alertmanager.service: Main process 
>>> exited, code=exited, status=1/FAILURE
>>> Aug 28 15:43:10 localhost systemd[1]: alertmanager.service: Failed with 
>>> result 'exit-code'.
>>> ~
>>> ~
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ca6da2eb-7ca9-4c35-8c22-81865d1d039bn%40googlegroups.com.


[prometheus-users] Re: I am having an error while checking the alert manager status

2024-08-28 Thread &#x27;Brian Candler' via Prometheus Users
The important error message has been truncated ("Unable to create..."). You 
can use left/right arrows to scroll sideways, but it would be better to use 
these commands:

systemctl status alertmanager -l --no-pager
journalctl -u alertmanager -n100 --no-pager

On Wednesday 28 August 2024 at 18:28:15 UTC+1 chinelo Ufondu wrote:

> × alertmanager.service - AlertManager
>  Loaded: loaded (/lib/systemd/system/alertmanager.service; enabled; 
> vendor preset: enabled)
>  Active: failed (Result: exit-code) since Wed 2024-08-28 15:43:10 UTC; 
> 2s ago
> Process: 1691244 ExecStart=/usr/bin/alertmanager --config.file 
> /etc/alertmanager/alertmanager.yml (code=exited, status=1/FAIL>
>Main PID: 1691244 (code=exited, status=1/FAILURE)
> CPU: 29ms
>
> Aug 28 15:43:10 localhost systemd[1]: Started AlertManager.
> Aug 28 15:43:10 localhost alertmanager[1691244]: 
> ts=2024-08-28T15:43:10.392Z caller=main.go:181 level=info msg="Starting 
> Alertman>
> Aug 28 15:43:10 localhost alertmanager[1691244]: 
> ts=2024-08-28T15:43:10.392Z caller=main.go:182 level=info 
> build_context="(go=go1>
> Aug 28 15:43:10 localhost alertmanager[1691244]: 
> ts=2024-08-28T15:43:10.392Z caller=main.go:193 level=error msg="Unable to 
> create>
> Aug 28 15:43:10 localhost systemd[1]: alertmanager.service: Main process 
> exited, code=exited, status=1/FAILURE
> Aug 28 15:43:10 localhost systemd[1]: alertmanager.service: Failed with 
> result 'exit-code'.
> ~
> ~
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c40cfae6-dcbd-49a5-ad03-631ea064fe3en%40googlegroups.com.


[prometheus-users] Re: Changing Port Number and Implementing Authentication for Windows Exporter

2024-08-28 Thread &#x27;Brian Candler' via Prometheus Users
You need to pass some flags to the server process:
https://github.com/prometheus-community/windows_exporter?tab=readme-ov-file#flags

--web.listen-address lets you change the port
--web.config.file lets you point to a config file for setting up HTTP basic 
auth and/or TLS client certificate auth

On Wednesday 28 August 2024 at 11:28:48 UTC+1 madan wrote:

> Dear Prometheus Support Team,
>
> I am using the Windows Exporter to monitor my Windows server. I would like 
> to know how to change the default port number from 9182 to a different 
> port. Additionally, I would like to implement authentication to ensure that 
> only authorized users can access the metrics exposed by the exporter.
>
> Could you please provide detailed instructions on how to accomplish these 
> tasks? Any specific configuration changes or steps involved would be 
> greatly appreciated.
>
> Thank you for your time and assistance.
>
> Madan D S.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f1198bf9-0907-4004-a334-125cc7bd54a4n%40googlegroups.com.


Re: [prometheus-users] Re: My Query never fires an Alarm

2024-08-24 Thread &#x27;Brian Candler' via Prometheus Users
Your alert has an odd name for its purpose ("alert: 
Dev-NotEqualtoBoolZero"). Is it possible you're using the same name for 
another alerting rule? Or maybe an alert name containing a dash is 
problematic, although I don't remember this being a problem.

What version of Prometheus are you running?

If you go to the web interface in the "Alerts" tab, you should be able to 
view green "Inactive" alerts. Is your alerting rule shown there? If you 
click on the ">" to expand it, do you see the rule you were expecting?

You could try copy-pasting the alert rule from this view directly into the 
PromQL browser, just in case some symbol is not what you expect it to be.

You could also try putting the whole expr in single quotes, or using the 
multi-line form:

- alert: Dev-NotEqualtoBoolZero
  expr: |
confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 100
  labels:
severity: critical
  annotations:
description: "The consumer lags for Dev client`"

Those are the only things I can think of.

On Friday 23 August 2024 at 14:45:48 UTC+1 Jay wrote:

> Brian
> "Please determine whether Prometheus is sending alerts to Alertmanager by 
> checking in the Prometheus web interface under the "Alerts" tab.  Then we 
> can focus on either Prometheus or Alertmanager configuration."
> Prometheus Alerts Tab is empty (Never see alarm there for This alert 
> rule). I see Alarms for UP.  Also, Labels for ALL rules are same! Critical.
>
> On Fri, Aug 23, 2024 at 2:38 AM 'Brian Candler' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> > 2. I am getting other Alerts through Alertmanager for example, UP/down 
>> of instance. So its not the Alertmanager.
>>
>> No, that does not necessarily follow.  (e.g. different alerts can have 
>> different labels and are processed differently by alertmanager routing 
>> rules).
>>
>> Please determine whether Prometheus is sending alerts to Alertmanager by 
>> checking in the Prometheus web interface under the "Alerts" tab.  Then we 
>> can focus on either Prometheus or Alertmanager configuration.
>>
>> On Thursday 22 August 2024 at 16:53:29 UTC+1 Jay wrote:
>>
>>> Brian
>>>Let me answer you in bullet points:
>>> 1. I have tried the expression with both, ie. > 1 and also > 100. Both 
>>> don't fire.
>>> 2. I am getting other Alerts through Alertmanager for example, UP/down 
>>> of instance. So its not the Alertmanager.
>>>
>>> Expression shows non-empty results in PromQL Query interface but still 
>>> it doesn't fire.
>>>
>>> J 
>>>
>>> On Thu, Aug 22, 2024 at 10:20 AM 'Brian Candler' via Prometheus Users <
>>> promethe...@googlegroups.com> wrote:
>>>
>>>> Your test example in PromQL browser has:
>>>> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 1
>>>> and the values were 2 or 3; but the alerting expression has 
>>>> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 100
>>>> So clearly it's not going to trigger under that condition, when the 
>>>> lags are less than 100.
>>>>
>>>> If that's not the probelm, then you need to determine: is the rule not 
>>>> firing? Or is Alertmanager not sending an alert?
>>>>
>>>> To do this, check in the Prometheus web interface under the Alerts tab. 
>>>> Is there a firing alert there? If yes, then you focus your investigation 
>>>> on 
>>>> the alertmanager side (e.g. check alertmanager logs). If no, then drill 
>>>> further into the expression, although if the same expression shows a 
>>>> non-empty result in the PromQL query interface, then it certainly should 
>>>> be 
>>>> able to fire an alert.
>>>>
>>>> On Wednesday 21 August 2024 at 21:24:58 UTC+1 Jay wrote:
>>>>
>>>>> Here is the text:
>>>>>
>>>>> groups:
>>>>>   - name: confluent-rules
>>>>> rules:
>>>>>
>>>>> - alert: Dev-NotEqualtoBoolZero
>>>>>   expr: 
>>>>> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} 
>>>>> > 100
>>>>>   labels:
>>>>> severity: critical
>>>>>   annotations:
>>>>> description: "The consumer lags for 

Re: [prometheus-users] Re: My Query never fires an Alarm

2024-08-23 Thread &#x27;Brian Candler' via Prometheus Users
> 2. I am getting other Alerts through Alertmanager for example, UP/down of 
instance. So its not the Alertmanager.

No, that does not necessarily follow.  (e.g. different alerts can have 
different labels and are processed differently by alertmanager routing 
rules).

Please determine whether Prometheus is sending alerts to Alertmanager by 
checking in the Prometheus web interface under the "Alerts" tab.  Then we 
can focus on either Prometheus or Alertmanager configuration.

On Thursday 22 August 2024 at 16:53:29 UTC+1 Jay wrote:

> Brian
>Let me answer you in bullet points:
> 1. I have tried the expression with both, ie. > 1 and also > 100. Both 
> don't fire.
> 2. I am getting other Alerts through Alertmanager for example, UP/down of 
> instance. So its not the Alertmanager.
>
> Expression shows non-empty results in PromQL Query interface but still it 
> doesn't fire.
>
> J 
>
> On Thu, Aug 22, 2024 at 10:20 AM 'Brian Candler' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> Your test example in PromQL browser has:
>> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 1
>> and the values were 2 or 3; but the alerting expression has 
>> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 100
>> So clearly it's not going to trigger under that condition, when the lags 
>> are less than 100.
>>
>> If that's not the probelm, then you need to determine: is the rule not 
>> firing? Or is Alertmanager not sending an alert?
>>
>> To do this, check in the Prometheus web interface under the Alerts tab. 
>> Is there a firing alert there? If yes, then you focus your investigation on 
>> the alertmanager side (e.g. check alertmanager logs). If no, then drill 
>> further into the expression, although if the same expression shows a 
>> non-empty result in the PromQL query interface, then it certainly should be 
>> able to fire an alert.
>>
>> On Wednesday 21 August 2024 at 21:24:58 UTC+1 Jay wrote:
>>
>>> Here is the text:
>>>
>>> groups:
>>>   - name: confluent-rules
>>> rules:
>>>
>>> - alert: Dev-NotEqualtoBoolZero
>>>   expr: 
>>> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} 
>>> > 100
>>>   labels:
>>> severity: critical
>>>   annotations:
>>> description: "The consumer lags for Dev client`"
>>> On Wed, Aug 21, 2024 at 1:51 PM Daz Wilkin  wrote:
>>>
>>>> Please include the rule.
>>>>
>>>> You've shown that the query returns results which is necessary but 
>>>> insufficient.
>>>>
>>>> On Wednesday, August 21, 2024 at 8:19:34 AM UTC-7 Jay P wrote:
>>>>
>>>>> I am not new to Prometheus however, I wrote the following rules which 
>>>>> never fires. (Alertmanger and all other settings are fine since I get 
>>>>> Alarms for other rules except this one)
>>>>>
>>>>> Attached here it he screenshot and i am copy pasting here as well.
>>>>>
>>>>> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 1
>>>>>
>>>>> Results:
>>>>> confluent_kafka_server_consumer_lag_offsets{consumer_group_id="XXX", 
>>>>> instance="api.telemetry.confluent.cloud:443", job="confluent-cloud", 
>>>>> kafka_id="XXX", topic="XXX"}
>>>>> 2
>>>>> confluent_kafka_server_consumer_lag_offsets{consumer_group_id="XXX", 
>>>>> instance="api.telemetry.confluent.cloud:443", job="confluent-cloud", 
>>>>> kafka_id="XXX", topic="XXX"}
>>>>> 3
>>>>>
>>>>> Any help is greatly appreciated. Thank you
>>>>>
>>>> -- 
>>>>
>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "Prometheus Users" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/prometheus-users/pBEqCDIUFug/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> prometheus-use...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-users/e8007886-5a85-4fdb-938e-3637

Re: [prometheus-users] Re: My Query never fires an Alarm

2024-08-22 Thread &#x27;Brian Candler' via Prometheus Users
Your test example in PromQL browser has:
confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 1
and the values were 2 or 3; but the alerting expression has 
confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 100
So clearly it's not going to trigger under that condition, when the lags 
are less than 100.

If that's not the probelm, then you need to determine: is the rule not 
firing? Or is Alertmanager not sending an alert?

To do this, check in the Prometheus web interface under the Alerts tab. Is 
there a firing alert there? If yes, then you focus your investigation on 
the alertmanager side (e.g. check alertmanager logs). If no, then drill 
further into the expression, although if the same expression shows a 
non-empty result in the PromQL query interface, then it certainly should be 
able to fire an alert.

On Wednesday 21 August 2024 at 21:24:58 UTC+1 Jay wrote:

> Here is the text:
>
> groups:
>   - name: confluent-rules
> rules:
>
> - alert: Dev-NotEqualtoBoolZero
>   expr: 
> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} 
> > 100
>   labels:
> severity: critical
>   annotations:
> description: "The consumer lags for Dev client`"
> On Wed, Aug 21, 2024 at 1:51 PM Daz Wilkin  wrote:
>
>> Please include the rule.
>>
>> You've shown that the query returns results which is necessary but 
>> insufficient.
>>
>> On Wednesday, August 21, 2024 at 8:19:34 AM UTC-7 Jay P wrote:
>>
>>> I am not new to Prometheus however, I wrote the following rules which 
>>> never fires. (Alertmanger and all other settings are fine since I get 
>>> Alarms for other rules except this one)
>>>
>>> Attached here it he screenshot and i am copy pasting here as well.
>>>
>>> confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 1
>>>
>>> Results:
>>> confluent_kafka_server_consumer_lag_offsets{consumer_group_id="XXX", 
>>> instance="api.telemetry.confluent.cloud:443", job="confluent-cloud", 
>>> kafka_id="XXX", topic="XXX"}
>>> 2
>>> confluent_kafka_server_consumer_lag_offsets{consumer_group_id="XXX", 
>>> instance="api.telemetry.confluent.cloud:443", job="confluent-cloud", 
>>> kafka_id="XXX", topic="XXX"}
>>> 3
>>>
>>> Any help is greatly appreciated. Thank you
>>>
>> -- 
>>
> You received this message because you are subscribed to a topic in the 
>> Google Groups "Prometheus Users" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/prometheus-users/pBEqCDIUFug/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to 
>> prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/e8007886-5a85-4fdb-938e-36373409498cn%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8be525b8-e216-4cc4-8f05-c126bf42fc35n%40googlegroups.com.


[prometheus-users] Oddity with v0.xxx tags

2024-08-19 Thread &#x27;Brian Candler' via Prometheus Users
I have just noticed a load of tags in the prometheus repo for v0.XXX (from 
v0.35.0 to v0.54.0 inclusiveq) which match with v2.XXX

For example:
https://github.com/prometheus/prometheus/tree/v0.54.0
https://github.com/prometheus/prometheus/releases/v0.54.0

and which github claims was only tagged last week, matches
https://github.com/prometheus/prometheus/tree/v2.54.0
(both are commit 5354e87a)

Is this perhaps to do with Go module versioning, or is something else going 
on here?

Thanks,

Brian.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2a8e9ee0-252f-4049-b306-7d0865419842n%40googlegroups.com.


[prometheus-users] Re: Questions about best way to monitor CPU usage and accuracy and container life-time

2024-08-19 Thread &#x27;Brian Candler' via Prometheus Users
The @ modifier is no longer experimental. It was made a permanent part of 
Prometheus in Jan 2022, when the promql-at-modifier was made a no-op. It is 
no longer listed under feature flags 
.

See commit b39f2739e5b01560ad8299d2579f1041a0e9ae5f 

 which 
was included in v2.33.0

On Monday 19 August 2024 at 18:34:48 UTC+1 Simon Hardy-Francis wrote:

> I think I found one answer to question 2, which is to turn on this via "
> --enable-feature=promql-at-modifier" feature [1].
> However, [2] says such " features .. are disabled by default since they 
> are breaking changes or are considered experimental". So not sure if I want 
> to use it.
>
> [1] https://prometheus.io/blog/2021/02/18/introducing-the-@-modifier/
> [2] https://prometheus.io/docs/prometheus/2.45/feature_flags/
>
> On Friday, August 16, 2024 at 9:30:27 PM UTC-7 Simon Hardy-Francis wrote:
>
>> Hello!
>>
>> I am a relative newbie to Prometheus and promql and have the following 
>> questions:
>>
>> Question 1: I tried to find the CPU used by containers in the last 1 
>> minute, using 3 different ways via promql, and the ways do not agree with 
>> each other.
>>
>> One way is finding the absolute value for " 
>> container_cpu_usage_seconds_total" before and after and subtracting the 
>> values. The other way is using "rate" and the last way is using "increase". 
>> Why do the "rate" and "increase" ways show so much less CPU being used?
>>
>> Question 2: Let's say I want to find the CPU used for all containers 
>> active between 1 and 2 hours ago. And to discover how long each container 
>> is active for the case that e.g. a container only existed for e.g. 10 
>> minutes of that 60 minute window. How to create a promql command to do that?
>>
>> Thanks,
>> Simon
>>
>> P.S.: Here are my commands:
>>
>> $ cat promql.try-1m.sh
>> (promql --timeout 300 --no-headers --host 'https://> server>' 'sort_desc(sum by (instance, namespace, node, pod, container) 
>> (container_cpu_usage_seconds_total{container!=""}))' > promql.before.txt) &
>> sleep 60
>> (promql --timeout 300 --no-headers --host 'https://> server>' 'sort_desc(sum by (instance, namespace, node, pod, container) 
>> (container_cpu_usage_seconds_total{container!=""}))' > promql.after.txt) &
>> (promql --timeout 300 --no-headers --host 'https://> server>' 'sum(rate(container_cpu_usage_seconds_total{container!=""}[1m])) 
>> by (node, instance, namespace, pod, container)' > promql.rate-1m.txt) &
>> (promql --timeout 300 --no-headers --host 'https://> server>' 
>> 'sum(increase(container_cpu_usage_seconds_total{container!=""}[1m])) by 
>> (node, instance, namespace, pod, container)' > promql.increase-1m.txt) &
>>
>> $ ./promql.try-1m.sh
>>
>> $ ls -al promql.*.txt
>> -rw-rw-r-- 1 simon simon 3845801 Aug 16 15:36 promql.before.txt
>> -rw-rw-r-- 1 simon simon 3844193 Aug 16 15:37 promql.after.txt
>> -rw-rw-r-- 1 simon simon 3565377 Aug 16 15:37 promql.increase-1m.txt
>> -rw-rw-r-- 1 simon simon 3591045 Aug 16 15:37 promql.rate-1m.txt
>>
>> $ cat promql.before.txt | egrep kube-router | head -1
>> kube-router  10.34.28.252:10250  kube-system   
>>  kube-router-kjxnp  826909.35828385  2024-08-16T15:36:16-07:00
>>
>> $ cat promql.after.txt | egrep kube-router | egrep 
>> "kube-router.*kube-router-kjxnp"
>> kube-router  10.34.28.252:10250  kube-system   
>>  kube-router-kjxnp  826929.923827853  2024-08-16T15:37:16-07:00
>>
>> $ perl -e 'printf qq[%f\n], 826929.923827853 - 826909.35828385;'
>> 20.565544
>>
>> $ cat promql.increase-1m.txt | egrep kube-router | egrep 
>> "kube-router.*kube-router-kjxnp"
>> kube-router  10.34.28.252:10250  kube-system   
>>  kube-router-kjxnp  5.908468843536704  2024-08-16T15:37:16-07:00
>>
>> $ cat promql.rate-1m.txt | egrep kube-router | egrep 
>> "kube-router.*kube-router-kjxnp"
>> kube-router  10.34.28.252:10250  kube-system   
>>  kube-router-kjxnp  0.09846672408107424  2024-08-16T15:37:16-07:00
>>
>> $ perl -e 'printf qq[%f\n], 0.09846672408107424 * 60;'
>> 5.908003
>>
>> $ perl -e 'printf qq[%f%%\n], (20.565544 - 5.908003) / 5.908003 * 100;'
>> 248.096370%
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6cd6b961-893c-4394-90ef-14b0ec7c981cn%40googlegroups.com.


[prometheus-users] Re: Suggest any exporter which exports results of KQL query from azure resource to promethues

2024-08-01 Thread &#x27;Brian Candler' via Prometheus Users
At worst, you can use a cronjob script to perform your KQL query 
periodically, write its results to a file in prometheus text-based 
exposition format 
,
 
then pick it up using node_exporter textfile collector (or even just serve 
it as a static HTTP web page and have prometheus scrape it directly)

On Thursday 1 August 2024 at 10:38:55 UTC+1 Venkatraman Natarajan wrote:

> Hi Team,
>
> I have application insights which I am able to query using KQL and show 
> the result in Azure.
>
> I need to store the queried result in prometheus database. 
>
> Then using promQL I need to display it in grafana.
>
> https://github.com/webdevops/azure-metrics-exporter - I have tried this 
> one; it shows only metrics not able to query using KQL. 
>
> https://github.com/webdevops/azure-loganalytics-exporter - This one only 
> queries log analytics workspace not other resources.
>
> https://github.com/RobustPerception/azure_metrics_exporter - This one not 
> having dimension support.
>
> Could you please help me with this? 
>
> Thanks,
> Venkatraman N
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f12b9f05-9cfc-4496-a42c-c65ee3c84f2en%40googlegroups.com.


[prometheus-users] Re: Alertmananger keep send blank alert and how create resolve template

2024-07-31 Thread &#x27;Brian Candler' via Prometheus Users
> So can I turn that blank alert off or I missing any config?

You haven't shown any of your config, so it's impossible to comment on it.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d38961b7-fc2e-4421-9534-770bd1c32a6cn%40googlegroups.com.


[prometheus-users] Re: [Relabel} For specific metric, persist only metrics coming from particular namespace and ignore rest

2024-07-30 Thread &#x27;Brian Candler' via Prometheus Users
Use a temporary label to give the logic "drop if metric name is X and 
namespace is not Y"

Roughly like this (untested):

- source_labels: [namespace]
  regex: 'my_interesting_namespace'
  target_label: __tmp_keep_namespace
  replacement: '1'

- source_labels: [__name__, __tmp_keep_namespace]
  regex: 'envoy_cluster_upstream_cx_connect_ms_bucket;'
  action: drop

- regex: __tmp_keep_namespace
  action: labeldrop

On Tuesday 30 July 2024 at 09:40:01 UTC+1 learner wrote:

> Hi Team,
>
> I have scrape job called - job_name: 'kubernetes-pods', there are so many 
> metrics being  push into prometheus. But i have this bit tricky 
> requirement. As i want to store all metrics but for specific metric i.e 
> envoy_cluster_upstream_cx_connect_ms_bucket i want to store this metrics 
> only from specific namespace and ignore rest other namespace.
>
> Note: if i use keep then it will drop all other metrics on job 
> kubernetes-pods. So i don't want that. And i use drop then i have to 
> provide whole list of namespaces which is not dynamic.
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6b881303-8d0f-4455-acfd-4123eb6b6db9n%40googlegroups.com.


[prometheus-users] Re: SNMP.yml configuration

2024-07-29 Thread &#x27;Brian Candler' via Prometheus Users
I suggest you use a text editor.

But you shouldn't create snmp.yml manually. You should create generator.yml 
and then use generator + MIB files to convert it to snmp.yml.

The format of generator.yml is documented here 
, 
and the output snmp.yml it generates here 
.

The generator binary isn't included 
 in the 
snmp_exporter release bundle, but once you've installed the dependencies 

 
you can install it with:
go install github.com/prometheus/snmp_exporter/generator@latest

On Monday 29 July 2024 at 08:25:56 UTC+1 test2 thejo wrote:

> Dear Team,
> How to write SNMP.yml manually in Prometheus and give any notes Please..
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8923eca6-eea9-4b19-82f0-05776051ba99n%40googlegroups.com.


[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

2024-07-27 Thread &#x27;Brian Candler' via Prometheus Users
Q1 - yes, each route can have separate group_by section, as shown in the 
documentation:
https://prometheus.io/docs/alerting/latest/configuration/#route-related-settings

Note that if you do
*group_by: [instance]*
then you'll get one Opsgenie alert group for an instance, even if there are 
multiple problems with that instance. If you want to disable grouping 
completely, put a string with three dots between the square brackets:
*group_by: ['...']*

Q2 - I don't see why you want to put {{ $labels.instance }} in the alert 
name. It's then no longer the name of the alert, it's a combination of the 
name of the alert and the name of the instance; and to analyze the data by 
instance you'd have to parse it out of the alert.

Put it in the alert description instead.

> Having host name tag will be helpful and we can know via JIRA integration 
that how many incidents have occured for a host in past.

Surely it would be better to do this is analysis with alert labels, and 
from what I can see of the POST content you showed, Opsgenie calls these 
"tags" rather than "labels".

> It seems the alert manager needs to send another PUT request for updating 
the opsgenie tags.

Are you saying that the problem is that Alertmanager isn't updating the 
tags? But if these tags come from CommonLabels, and the alerts are part of 
a group, then the CommonLabels are by definition those which are common to 
all the alerts in the group.

It seems to me that there are two meaningful alternatives. Either:
1. multiple alerts from Prometheus are in the same group (in which case, 
it's a single alert as far as Opsgenie is concerned, and the tags are the 
labels common to all alerts in the group); or
2. you send separate alerts from Prometheus, each with their own tags, and 
then you analyze and/or group them Opsgenie-side.

If host-by-host incident analysis is what you want, then option (2) seems 
to be the way to go.

What version of Alertmanager are you running? Looking in the changelogs I 
don't see any particular recent changes, and I notice you're already using 
"update_alerts: true", but I thought it was worth checking.

## 0.25.0 / 2022-12-22

* [ENHANCEMENT] Support templating for Opsgenie's responder type. #3060

## 0.24.0 / 2022-03-24

* [ENHANCEMENT] Add `update_alerts` field to the OpsGenie configuration to 
update message and description when sending alerts. #2519
* [ENHANCEMENT] Add `entity` and `actions` fields to the OpsGenie 
configuration. #2753
* [ENHANCEMENT] Add `opsgenie_api_key_file` field to the global 
configuration. #2728
* [ENHANCEMENT] Add support for `teams` responders to the OpsGenie 
configuration. #2685

## 0.22.0 / 2021-05-21

* [ENHANCEMENT] OpsGenie: Propagate labels to Opsgenie details. #2276

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8f93ef01-8eff-40a6-9d51-fe0f374fac80n%40googlegroups.com.


Re: [prometheus-users] Re: SNMP Exporter - Gathering MAC and IP per port

2024-07-23 Thread &#x27;Brian Candler' via Prometheus Users
Depending on what you mean by "fails", you may be able to get some more 
info by adding debug=true to the query params, e.g.

curl 'localhost:9116/snmp?target=x.x.x.x&module=foo&auth=bar&debug=true'

On Tuesday 23 July 2024 at 16:26:26 UTC+1 Matthew Koch wrote:

> Well you made rethink how I had the context setup in the auth and I did 
> find something interesting, for the context I had tried vlan-, vlan-*, * 
> and nothing worked previously. I just tried it with the context variable 
> there in the auth but without anything preceding it and I'm starting to get 
> some results. The original yml I sent works and now grabs every VLAN but 
> the second I try to add the ifIndex lookups it fails. 
>
>   ReadOnly:
> security_level: authPriv
> username: User
> password: password
> auth_protocol: SHA
> priv_protocol: AES
> priv_password: password
> context_name: vlan-200   
> version: 3
>
>   ReadOnly:
> security_level: authPriv
> username: User
> password: password
> auth_protocol: SHA
> priv_protocol: AES
> priv_password: password
> context_name: 
> version: 3
> On Tuesday, July 23, 2024 at 9:49:57 AM UTC-4 Brian Candler wrote:
>
>> And you can't create an SNMP context on the device that exposes all the 
>> parts of the MIB tree that you're interested in?
>>
>> On Tuesday 23 July 2024 at 14:19:14 UTC+1 Matthew Koch wrote:
>>
>>>
>>> *Correct, the SNMP context I am using is specific to the VLAN that I am 
>>> trying to get data from. *
>>> *Non-Cisco Switch: *
>>>
>>> dot1dTpFdbStatus_info{dot1dTpFdbAddress="0x00D51541",dot1dTpFdbPort="5",dot1dTpFdbStatus="learned",ifAlias="Switch5",ifDescr="Module:
>>>  
>>> 1 Port: 5 - 10/100 Mbit TX",ifIndex="5",ifName="1/5"} 1 
>>> *Cisco Switch  - No SNMP Context:*
>>>
>>> # HELP dot1dBasePortCircuit For a port which (potentially) has the same 
>>> value of dot1dBasePortIfIndex as another port on the same bridge, this 
>>> object contains the name of an object instance unique to this port - 
>>> 1.3.6.1.2.1.17.1.4.1.3 # TYPE dot1dBasePortCircuit gauge 
>>> dot1dBasePortCircuit{dot1dBasePort="25",dot1dBasePortCircuit="0.0"} 1 # 
>>> HELP dot1dBasePortDelayExceededDiscards The number of frames discarded by 
>>> this port due to excessive transit delay through the bridge - 
>>> 1.3.6.1.2.1.17.1.4.1.4 # TYPE dot1dBasePortDelayExceededDiscards counter 
>>> dot1dBasePortDelayExceededDiscards{dot1dBasePort="25"} 0 # HELP 
>>> dot1dBasePortIfIndex The value of the instance of the ifIndex object, 
>>> defined in MIB-II, for the interface corresponding to this port. - 
>>> 1.3.6.1.2.1.17.1.4.1.2 # TYPE dot1dBasePortIfIndex gauge 
>>> dot1dBasePortIfIndex{dot1dBasePort="25"} 25 # HELP 
>>> dot1dBasePortMtuExceededDiscards The number of frames discarded by this 
>>> port due to an excessive size - 1.3.6.1.2.1.17.1.4.1.5 # TYPE 
>>> dot1dBasePortMtuExceededDiscards counter 
>>> dot1dBasePortMtuExceededDiscards{dot1dBasePort="25"} 0 # HELP ifIndex 
>>> interface index reported by the SNMP agent - 1.3.6.1.2.1.2.2.1.1 # TYPE 
>>> ifIndex gauge ifIndex{ifIndex="1"} 1 ifIndex{ifIndex="10"} 1 
>>> ifIndex{ifIndex="11"} 1 ifIndex{ifIndex="12"} 1 ifIndex{ifIndex="13"} 1 
>>> ifIndex{ifIndex="14"} 1 *Cisco Switch with VLAN-100 context*
>>> dot1dTpFdbStatus_info{dot1dTpFdbAddress="0x00BD4526",dot1dTpFdbPort="1",dot1dTpFdbStatus="learned",ifAlias="",ifDescr="",ifIndex="",ifName=""}
>>>  
>>> 1 
>>> dot1dTpFdbStatus_info{dot1dTpFdbAddress="0x00013B93",dot1dTpFdbPort="11",dot1dTpFdbStatus="learned",ifAlias="",ifDescr="",ifIndex="",ifName=""}
>>>  
>>> 1 
>>> dot1dTpFdbStatus_info{dot1dTpFdbAddress="0x006664FA",dot1dTpFdbPort="4",dot1dTpFdbStatus="learned",ifAlias="",ifDescr="",ifIndex="",ifName=""}
>>>  
>>> 1 
>>> On Tuesday, July 23, 2024 at 6:21:05 AM UTC-4 Brian Candler wrote:
>>>
>>>> Ah right - so we're talking about SNMP v3 context then, not "VLAN 
>>>> context"?
>>>>
>>>> As I understand it, the SNMP context gives you a selected subset 

Re: [prometheus-users] Re: SNMP Exporter - Gathering MAC and IP per port

2024-07-23 Thread &#x27;Brian Candler' via Prometheus Users
And you can't create an SNMP context on the device that exposes all the 
parts of the MIB tree that you're interested in?

On Tuesday 23 July 2024 at 14:19:14 UTC+1 Matthew Koch wrote:

>
> *Correct, the SNMP context I am using is specific to the VLAN that I am 
> trying to get data from. *
> *Non-Cisco Switch: *
>
> dot1dTpFdbStatus_info{dot1dTpFdbAddress="0x00D51541",dot1dTpFdbPort="5",dot1dTpFdbStatus="learned",ifAlias="Switch5",ifDescr="Module:
>  
> 1 Port: 5 - 10/100 Mbit TX",ifIndex="5",ifName="1/5"} 1 
> *Cisco Switch  - No SNMP Context:*
>
> # HELP dot1dBasePortCircuit For a port which (potentially) has the same 
> value of dot1dBasePortIfIndex as another port on the same bridge, this 
> object contains the name of an object instance unique to this port - 
> 1.3.6.1.2.1.17.1.4.1.3 # TYPE dot1dBasePortCircuit gauge 
> dot1dBasePortCircuit{dot1dBasePort="25",dot1dBasePortCircuit="0.0"} 1 # 
> HELP dot1dBasePortDelayExceededDiscards The number of frames discarded by 
> this port due to excessive transit delay through the bridge - 
> 1.3.6.1.2.1.17.1.4.1.4 # TYPE dot1dBasePortDelayExceededDiscards counter 
> dot1dBasePortDelayExceededDiscards{dot1dBasePort="25"} 0 # HELP 
> dot1dBasePortIfIndex The value of the instance of the ifIndex object, 
> defined in MIB-II, for the interface corresponding to this port. - 
> 1.3.6.1.2.1.17.1.4.1.2 # TYPE dot1dBasePortIfIndex gauge 
> dot1dBasePortIfIndex{dot1dBasePort="25"} 25 # HELP 
> dot1dBasePortMtuExceededDiscards The number of frames discarded by this 
> port due to an excessive size - 1.3.6.1.2.1.17.1.4.1.5 # TYPE 
> dot1dBasePortMtuExceededDiscards counter 
> dot1dBasePortMtuExceededDiscards{dot1dBasePort="25"} 0 # HELP ifIndex 
> interface index reported by the SNMP agent - 1.3.6.1.2.1.2.2.1.1 # TYPE 
> ifIndex gauge ifIndex{ifIndex="1"} 1 ifIndex{ifIndex="10"} 1 
> ifIndex{ifIndex="11"} 1 ifIndex{ifIndex="12"} 1 ifIndex{ifIndex="13"} 1 
> ifIndex{ifIndex="14"} 1 *Cisco Switch with VLAN-100 context*
> dot1dTpFdbStatus_info{dot1dTpFdbAddress="0x00BD4526",dot1dTpFdbPort="1",dot1dTpFdbStatus="learned",ifAlias="",ifDescr="",ifIndex="",ifName=""}
>  
> 1 
> dot1dTpFdbStatus_info{dot1dTpFdbAddress="0x00013B93",dot1dTpFdbPort="11",dot1dTpFdbStatus="learned",ifAlias="",ifDescr="",ifIndex="",ifName=""}
>  
> 1 
> dot1dTpFdbStatus_info{dot1dTpFdbAddress="0x006664FA",dot1dTpFdbPort="4",dot1dTpFdbStatus="learned",ifAlias="",ifDescr="",ifIndex="",ifName=""}
>  
> 1 
> On Tuesday, July 23, 2024 at 6:21:05 AM UTC-4 Brian Candler wrote:
>
>> Ah right - so we're talking about SNMP v3 context then, not "VLAN 
>> context"?
>>
>> As I understand it, the SNMP context gives you a selected subset of the 
>> OID tree. From RFC 5343:
>>
>>
>> * An SNMP context is a collection of management information accessible by 
>> an SNMP entity. An item of management information may exist in more than 
>> one context and an SNMP entity potentially has access to many contexts 
>> [RFC3411 <https://datatracker.ietf.org/doc/html/rfc3411>]. A context is 
>> identified by the snmpEngineID value of the entity hosting the management 
>> information (also called a contextEngineID) and a context name that 
>> identifies the specific context (also called a contextName).*
>> On Tuesday 23 July 2024 at 10:53:11 UTC+1 Ben Kochie wrote:
>>
>>> SNMP has the concept of a "Context Name" that is part of the walk, in 
>>> addition to the community and other security parameters.
>>>
>>> This can be included in the auth section of the config[0], or as a URL 
>>> parameter in the latest release[1].
>>>
>>> [0]: 
>>> https://github.com/prometheus/snmp_exporter/tree/main/generator#file-format
>>> [1]: https://github.com/prometheus/snmp_exporter/pull/1163
>>>
>>> On Tue, Jul 23, 2024 at 11:40 AM 'Brian Candler' via Prometheus Users <
>>> promethe...@googlegroups.com> wrote:
>>>
>>>> > The Cisco switches I am using require you to specify the VLAN context 
>>>> to retrieve the data
>>>>
>>>> I'm not sure I follow. Clearly, you "retrieve" the data simply by 
>>>> walking the relevant SNMP MIB, for which you need to specify nothing more 
>>>

[prometheus-users] Re: node exporter's data collection frequency

2024-07-23 Thread &#x27;Brian Candler' via Prometheus Users
> does node exporter use the same method to collect the file system usage 
stats which the df command uses?

Essentially yes. But you can easily exclude 

 
certain filesystems and/or types of filesystem from collection.

The below is picked up from an old machine, but I believe the flags won't 
have changed much: "--collector.filesystem.fs-types-exclude" is probably 
the one of most of interest to you.

/usr/local/bin/node_exporter 
--web.config=/etc/prometheus/node_exporter_web_config.yaml 
--web.disable-exporter-metrics 
--collector.textfile.directory=/var/lib/node_exporter 
--collector.netdev.device-exclude=^(veth.+)$ 
--collector.netclass.ignored-devices=^(veth.+)$ 
--collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]+|nvme[0-9]+n[0-9]+p)[0-9]+$
 
--collector.filesystem.fs-types-exclude=^(aufs|autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fuse[.]glusterfs|fuse[.]lxcfs|fusectl|hugetlbfs|mqueue|nfs4?|nsfs|overlay|proc|procfs|pstore|rootfs|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tmpfs|tracefs)$
 
--collector.filesystem.mount-points-exclude=^/(dev|proc|sys|var/lib/docker|.*[.]zfs/snapshot)($|/)

On Tuesday 23 July 2024 at 14:10:33 UTC+1 mohan garden wrote:

> came across this query , seems data is collected when prometheus contacts 
> the node exporter during scrape.
> https://groups.google.com/g/prometheus-users/c/h46MJjkEadQ
>
> This leaves me with just the following query , 
> does node exporter use the same method to collect the file system usage 
> stats which the df command uses?
>
> Regards,
> - MG
>
>
> On Tuesday, July 23, 2024 at 5:58:42 PM UTC+5:30 mohan garden wrote:
>
>> Hi All, 
>> We are running node exporter in our infra server and prometheus  is able 
>> to scrape the reported data at 30 seconds scrape inverval. It appears that 
>> node exporter collects the df command's output, and we try to discourage 
>> the use of df command on the NFS .
>>
>> As we plan to reduce the impact of node exporter on the NFS  metadata 
>> query , so wanted to understand that at what frequency does the node 
>> exporter collects the data.
>>
>> Was unable to figure out this information from node exporter's --help 
>> option.
>>
>> Please advice.
>>
>> Regards
>> - MG
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ca67af7d-064e-472d-b229-97b43dea8446n%40googlegroups.com.


Re: [prometheus-users] Re: SNMP Exporter - Gathering MAC and IP per port

2024-07-23 Thread &#x27;Brian Candler' via Prometheus Users
Ah right - so we're talking about SNMP v3 context then, not "VLAN context"?

As I understand it, the SNMP context gives you a selected subset of the OID 
tree. From RFC 5343:


* An SNMP context is a collection of management information accessible by 
an SNMP entity. An item of management information may exist in more than 
one context and an SNMP entity potentially has access to many contexts 
[RFC3411 <https://datatracker.ietf.org/doc/html/rfc3411>]. A context is 
identified by the snmpEngineID value of the entity hosting the management 
information (also called a contextEngineID) and a context name that 
identifies the specific context (also called a contextName).*
On Tuesday 23 July 2024 at 10:53:11 UTC+1 Ben Kochie wrote:

> SNMP has the concept of a "Context Name" that is part of the walk, in 
> addition to the community and other security parameters.
>
> This can be included in the auth section of the config[0], or as a URL 
> parameter in the latest release[1].
>
> [0]: 
> https://github.com/prometheus/snmp_exporter/tree/main/generator#file-format
> [1]: https://github.com/prometheus/snmp_exporter/pull/1163
>
> On Tue, Jul 23, 2024 at 11:40 AM 'Brian Candler' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> > The Cisco switches I am using require you to specify the VLAN context 
>> to retrieve the data
>>
>> I'm not sure I follow. Clearly, you "retrieve" the data simply by walking 
>> the relevant SNMP MIB, for which you need to specify nothing more than the 
>> OID to walk. Are you saying that Cisco have a proprietary MIB for this 
>> data, and/or that the VLAN is part of the table key?  Does it not have an 
>> equivalent to dot1dTpFdbPort, or does dot1dBasePortIfIndex not match with 
>> ifIndex?
>>
>> If you show some examples of snmpwalk output it may be clearer. Although 
>> I don't have anything to test with here (except perhaps IOSv)
>>
>> > In a perfect world I'm able to get ifIndex, ifDescr, ifAlias, ifName, 
>> mac address and IP address in one call.
>>
>> One call to what - Prometheus? If the IP--to-MAC mapping and MAC-to-port 
>> mapping are in different SNMP tables then it would not be straightforward 
>> to combine them in snmp_exporter (it might be possible with chained 
>> lookups).  You could also have a recording rule 
>> <https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/>
>>  
>> in prometheus which performs the join and stores the result.
>>
>> On Monday 22 July 2024 at 19:42:56 UTC+1 Matthew Koch wrote:
>>
>>> Unfortunately adding the ifIndex only works on some switches. The Cisco 
>>> switches I am using require you to specify the VLAN context to retrieve the 
>>> data which doesn't pull the ifIndex information. This definitely helps 
>>> though,  I was hoping there was a way trick it into 
>>> using dot1dBasePortIfIndex instead of ifIndex because they are equivalents 
>>> to pull ifAlias, ifDescr, ifName etc. This would also be useful to get the 
>>> IP address in the same poll. In a perfect world I'm able to get ifIndex, 
>>> ifDescr, ifAlias, ifName, mac address and IP address in one call. 
>>>
>>> On Saturday, July 20, 2024 at 7:56:10 AM UTC-4 Brian Candler wrote:
>>>
>>>> I had a play with this and I think I got most of the way there. Here's 
>>>> generator.yml:
>>>>
>>>> modules:
>>>>   bridge_mib:
>>>> walk:
>>>>   - dot1dBasePortTable
>>>>   - dot1dTpFdbTable
>>>> lookups:
>>>>   - source_indexes: [dot1dTpFdbAddress]
>>>> lookup: dot1dTpFdbPort
>>>>   - source_indexes: [dot1dTpFdbPort]
>>>> lookup: dot1dBasePortIfIndex
>>>> overrides:
>>>>   dot1dBasePort:
>>>> ignore: true
>>>>   dot1dTpFdbStatus:
>>>> type: EnumAsInfo
>>>>   dot1dTpFdbPort:
>>>> ignore: true
>>>>
>>>> Here's the snmp.yml that it creates:
>>>>
>>>> # WARNING: This file was auto-generated using snmp_exporter generator, 
>>>> manual changes will be lost.
>>>> modules:
>>>>   bridge_mib:
>>>> walk:
>>>> - 1.3.6.1.2.1.17.1.4
>>>> - 1.3.6.1.2.1.17.4.3
>>>>
>>>> metrics:
>>>> - name: dot1dBasePortIfIndex
>>>>   oid: 1.3.6.1.2.1.17.1.4.1.2
>>>>   type: gau

[prometheus-users] Re: SNMP Exporter - Gathering MAC and IP per port

2024-07-23 Thread &#x27;Brian Candler' via Prometheus Users
> The Cisco switches I am using require you to specify the VLAN context to 
retrieve the data

I'm not sure I follow. Clearly, you "retrieve" the data simply by walking 
the relevant SNMP MIB, for which you need to specify nothing more than the 
OID to walk. Are you saying that Cisco have a proprietary MIB for this 
data, and/or that the VLAN is part of the table key?  Does it not have an 
equivalent to dot1dTpFdbPort, or does dot1dBasePortIfIndex not match with 
ifIndex?

If you show some examples of snmpwalk output it may be clearer. Although I 
don't have anything to test with here (except perhaps IOSv)

> In a perfect world I'm able to get ifIndex, ifDescr, ifAlias, ifName, mac 
address and IP address in one call.

One call to what - Prometheus? If the IP--to-MAC mapping and MAC-to-port 
mapping are in different SNMP tables then it would not be straightforward 
to combine them in snmp_exporter (it might be possible with chained 
lookups).  You could also have a recording rule 
<https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/> 
in prometheus which performs the join and stores the result.

On Monday 22 July 2024 at 19:42:56 UTC+1 Matthew Koch wrote:

> Unfortunately adding the ifIndex only works on some switches. The Cisco 
> switches I am using require you to specify the VLAN context to retrieve the 
> data which doesn't pull the ifIndex information. This definitely helps 
> though,  I was hoping there was a way trick it into 
> using dot1dBasePortIfIndex instead of ifIndex because they are equivalents 
> to pull ifAlias, ifDescr, ifName etc. This would also be useful to get the 
> IP address in the same poll. In a perfect world I'm able to get ifIndex, 
> ifDescr, ifAlias, ifName, mac address and IP address in one call. 
>
> On Saturday, July 20, 2024 at 7:56:10 AM UTC-4 Brian Candler wrote:
>
>> I had a play with this and I think I got most of the way there. Here's 
>> generator.yml:
>>
>> modules:
>>   bridge_mib:
>> walk:
>>   - dot1dBasePortTable
>>   - dot1dTpFdbTable
>> lookups:
>>   - source_indexes: [dot1dTpFdbAddress]
>> lookup: dot1dTpFdbPort
>>   - source_indexes: [dot1dTpFdbPort]
>> lookup: dot1dBasePortIfIndex
>> overrides:
>>   dot1dBasePort:
>> ignore: true
>>   dot1dTpFdbStatus:
>> type: EnumAsInfo
>>   dot1dTpFdbPort:
>> ignore: true
>>
>> Here's the snmp.yml that it creates:
>>
>> # WARNING: This file was auto-generated using snmp_exporter generator, 
>> manual changes will be lost.
>> modules:
>>   bridge_mib:
>> walk:
>> - 1.3.6.1.2.1.17.1.4
>> - 1.3.6.1.2.1.17.4.3
>>
>> metrics:
>> - name: dot1dBasePortIfIndex
>>   oid: 1.3.6.1.2.1.17.1.4.1.2
>>   type: gauge
>>   help: The value of the instance of the ifIndex object, defined in 
>> IF-MIB, for
>>
>> the interface corresponding to this port. - 1.3.6.1.2.1.17.1.4.1.2
>>   indexes:
>>   - labelname: dot1dBasePort
>> type: gauge
>> - name: dot1dBasePortCircuit
>>   oid: 1.3.6.1.2.1.17.1.4.1.3
>>   type: OctetString
>>   help: For a port that (potentially) has the same value of 
>> dot1dBasePortIfIndex
>> as another port on the same bridge - 1.3.6.1.2.1.17.1.4.1.3
>>
>>   indexes:
>>   - labelname: dot1dBasePort
>> type: gauge
>> - name: dot1dBasePortDelayExceededDiscards
>>   oid: 1.3.6.1.2.1.17.1.4.1.4
>>   type: counter
>>   help: The number of frames discarded by this port due to excessive 
>> transit delay
>> through the bridge - 1.3.6.1.2.1.17.1.4.1.4
>>
>>   indexes:
>>   - labelname: dot1dBasePort
>> type: gauge
>> - name: dot1dBasePortMtuExceededDiscards
>>   oid: 1.3.6.1.2.1.17.1.4.1.5
>>   type: counter
>>   help: The number of frames discarded by this port due to an 
>> excessive size -
>> 1.3.6.1.2.1.17.1.4.1.5
>>
>>   indexes:
>>   - labelname: dot1dBasePort
>> type: gauge
>> - name: dot1dTpFdbAddress
>>   oid: 1.3.6.1.2.1.17.4.3.1.1
>>   type: PhysAddress48
>>   help: A unicast MAC address for which the bridge has forwarding 
>> and/or filtering
>> information. - 1.3.6.1.2.1.17.4.3.1.1
>>
>>   indexes:
>>   - labelname: dot1dTpFdbAddress
>> type: PhysAddress48
>> fixed_size: 6
>>   - labelname:

Re: [prometheus-users] Counter or Gauge metric?

2024-07-21 Thread &#x27;Brian Candler' via Prometheus Users
On Sunday 21 July 2024 at 00:51:48 UTC+1 Christoph Anton Mitterer wrote:

Hey. 

On Sat, 2024-07-20 at 10:26 -0700, 'Brian Candler' via Prometheus Users 
wrote: 
> 
> If the label stays constant, then the amount of extra space required 
> is tiny.  There is an internal mapping between a bag of labels and a 
> timeseries ID. 

Is it the same if one uses a metric (like for the RPMs from below) and 
that never changes? I mean is that also efficient?


Yes:

smartraid_physical_drive_rotational_speed_rpm 7200
smartraid_info{rpm="7200"} 1

are both static timeseries. Prometheus does delta compression; if you store 
the same value repeatedly the difference between adjacent points is zero. 
It doesn't matter if the timeseries value is 1 or 7200.

 


> But if any label changes, that generates a completely new timeseries. 
> This is not something you want to happen too often (a.k.a "timeseries 
> churn"), but moderate amounts are OK. 

Why exactly wouldn't one want this? I mean especially with respect to 
such _info metrics.


It's just a general consideration. When a timeseries churns you get new a 
new index entry, new head blocks etc.

For info metrics which rarely change, it's fine.

The limiting worst case is where you have a label value that changes every 
sample (for example, putting a timestamp in a label). Then every scrape 
generates a new timeseries containing one point. Have a few hundred 
thousand scrapes like that and your server will collapse.

 


Graphing _info time series doesn't make sense anyway... so it's not as 
if one would get some usable time series/graph (like a temperature or 
so) interrupted, if e.g. the state changes for a while from OK to 
degraded.


Indeed, and Grafana has a swim-lanes type view that works quite well for 
that.  When a time series disappears, it goes "stale". But the good news 
is, for quite some time now, Prometheus has been automatically inserting 
staleness markers for a timeseries which existed in a previous scrape but 
not in the current scrape from the same job and target.

Prior to that, timeseries would only go stale if there had been no data 
point ingested for 5 minutes, so it would be very unclear when the 
timeseries had actually vanished.
 


I guess with appearing/disappearing you mean, that one has to take into 
account, that e.g. pd_info{state=="OK",pd_name="foo"} won't exist while 
"foo" is failed... and thus e.g. when graphing the OK-times of a 
device, it would per default show nothing during that time and not a 
value of zero?


Yes. And it's a bit harder to alert on that condition, but you just have to 
approach it the right way. As you've realised, you can alert on the 
presence of a timeseries with a label not "OK", which is easier than 
alerting on the absence of a timeseries whose label is "OK".

 

> The other option, if the state values are integer enumerations at 
> source (e.g. as from SNMP), is to store the raw numeric value: 
> 
> foo 3 
> 
> That means the querier has to know how the meaning of these values. 
> (Grafana can map specific values to textual labels and/or colours 
> though). 

But that also requires me to use a label like in enum_metric{value=3},


No, I mean

my_metric{other_labels="dontcare"} 3

An example is ifOperStatus in SNMP, where the meaning of values 1, 2, 3 
...etc is defined in the MIB.

 


or I have to construct metric names dynamically (which I could also 
have done for the symbolic name), which seems however discouraged (and 
I'd say for good reasons)?


Don't generate metric names dynamically. That's what labels are for.  (In 
any case, the metric name is itself just a hidden label called "__name__")
 
There is good advice at https://prometheus.io/docs/practices/naming/

I mean if both, label and metric, are equally efficient (in therms of 
storage)... then using a metric would have still the advantage of being 
able to do things like: 
smartraid_logical_drive_chunk_size_bytes > (256*1024) 
i.e. select those LDs, that use a chunk size > 256 KiB ... which I 
cannot (as easily) do if it's in a label.


Correct. The flip side is if you want to see at a glance all the 
information about a logical volume, you'll need to look at a bunch of 
different metrics and associate them by some common label (e.g. a unique 
volume ID)

Both approaches are valid.  If you see a use case for the filtering or 
arithmetic, that pushes you down the path of separate metrics.

If you're comparing a hundred static metrics versus a single metric with a 
hundred labels then I'd *guess* the single metric would be a bit more 
efficient in terms of storage and ingestion performance, but it's marginal 
and shouldn't really be a consideration: data is there to be used, so pu

Re: [prometheus-users] Counter or Gauge metric?

2024-07-20 Thread &#x27;Brian Candler' via Prometheus Users
> If one adds a label to a metric, which then stays mostly constant, does
> this add any considerably amount of space needed for storing it?

If the label stays constant, then the amount of extra space required is 
tiny.  There is an internal mapping between a bag of labels and a 
timeseries ID.

But if any label changes, that generates a completely new timeseries. This 
is not something you want to happen too often (a.k.a "timeseries churn"), 
but moderate amounts are OK.  For example, if a drive changes from "OK" to 
"degraded" then that would be reasonable, but putting the drive temperature 
in a label would not.

Some people prefer to enumerate a separate timeseries per state, which can 
make certain types of query easier since you don't have to worry about 
staleness or timeseries appearing and disappearing. e.g.

foo{state="OK"} 1
foo{state="degraded"} 0
foo{state="absent"} 0

It's much easier to alert on foo{state="OK"}  == 0, than on the absence of 
timeseries foo{state="OK"}. However, as you observe, you need to know in 
advance what all the possible states are.

The other option, if the state values are integer enumerations at source 
(e.g. as from SNMP), is to store the raw numeric value:

foo 3

That means the querier has to know how the meaning of these values. 
(Grafana can map specific values to textual labels and/or colours though).

> 2) Metrics like:
> - smartraid_physical_drive_size_bytes
> - smartraid_physical_drive_rotational_speed_rpm
> can in principle not change (again: unless the PD is replaced with
> another one of the same name).
> 
> So should they rather be labels, despite being numbers?
> 
> OTOH, labels like:
> - smartraid_logical_drive_size_bytes
> - smartraid_logical_drive_chunk_size_bytes
> - smartraid_logical_drive_data_stripe_size_bytes
> *can* in principle change (namely if the RAID is converted).

IMO those are all fine as labels. Creation or modification of a logical 
volume is a rare thing to do, and arguably changing such fundamental 
parameters is making a new logical volume.

If you ever wanted to do *arithmetic* on those values - like divide the 
physical drive size by the sum of logical drive sizes - then you'd want 
them as metrics. Also, filtering on labels can be awkward (e.g. "show me 
all drives with speed greater than 7200rpm" requires a bit of regexp magic, 
although "show me all drives with speed not 7200rpm" is easy).

But I don't think those are common use cases.  Rather it's just about 
collecting secondary identifying information and characteristics.

> I went now for the approach to have a dedicated metric for those
> where there's a dedicated property in the RAID tool output, like:
> - smartraid_controller_temperature_celsius

Yes: something that's continuously variable (and likely to vary 
frequently), and/or that you might want to draw a graph of or alert on, is 
definitely its own metric value, not a label.

On Saturday 20 July 2024 at 16:00:51 UTC+1 Christoph Anton Mitterer wrote:

> Hey Ben and Chris.
>
> Thanks for your replies!
>
> On Fri, 2024-07-19 at 09:17 +0200, Ben Kochie wrote:
> > This is one of those tricky situations where there's not a strict
> > correct answer.
>
> Indeed.
>
>
> > For power-on-hours I would probably go with a gauge.
> > * You don't really have a "perfect" monotonic counter here.
>
> Why not?
> I mean there's what Chris said about some drives that may overflow the
> number - which in principle sounds unlikely though I must admit that
> especially with NVMe SMART data I have seen unreasonably low numbers,
> too (but in those cases, I've never seen high numbers for these drives,
> so it may also just be some other issue).
>
>
> > * I would also include the serial number label as well, just for
> > uniqueness identification sake.
>
> If one adds a label to a metric, which then stays mostly constant, does
> this add any considerably amount of space needed for storing it?
>
> But more on that below.
>
>
> > * Power-on-hours doesn't really have a lot of use as a counter. Do
> > actually want to display a counter like `rate(power_on_hours[1h])`?
>
> No, not particularly. It's just a number that should at least in theory
> only increase, and I wanted to do it right.
>
>
> Perhaps I should describe things a bit more, because actually I would
> have some more cases where it's not clear to me how to map perfectly
> into metrics.
>
> I had already asked at
> https://discuss.prometheus.io/t/how-to-design-metrics-labels/2337
> and one further thread there (which was however swallowed by the anti-
> spam, and would need some admin to approve it).
>
>
> My exporter parses the RAID CLI tools output, which results in a structure 
> like this (as JSON):
> {
> "controllers": {
> "0": {
> "properties": {
> "slot": "0",
> "serial_number": "000A",
> "controller_status": "OK",
> "hardware_revision": "A",
> "firmware_version": "6.52",
> "rebuild_priority": "High",
> "cache_status": "OK",
> "battery_capacitor_status": "OK",
> "controller_te

[prometheus-users] Re: SNMP Exporter - Gathering MAC and IP per port

2024-07-20 Thread &#x27;Brian Candler' via Prometheus Users
",dot1dTpFdbStatus="self"}
 
1
dot1dTpFdbStatus_info{dot1dBasePortIfIndex="10",dot1dTpFdbAddress="XX:XX:XX:5F:6C:B2",dot1dTpFdbPort="23",dot1dTpFdbStatus="learned"}
 
1
dot1dTpFdbStatus_info{dot1dBasePortIfIndex="10",dot1dTpFdbAddress="XX:XX:XX:81:98:C4",dot1dTpFdbPort="23",dot1dTpFdbStatus="learned"}
 
1
...

I think dot1dTpFdbAddress now gives more or less what you want. A few 
niggles:

(1) I would like to change "dot1dBasePortIfIndex" to "ifIndex" to make 
joins easier, without having to use label_replace(). I couldn't see a way 
to rename a metric in snmp_exporter.

(2) I would like to merge the enumerated dot1dTpFdbStatus strings 
into dot1dTpFdbAddress. However if I add this:

lookups:
  - source_indexes: [dot1dTpFdbAddress]
lookup: dot1dTpFdbPort


*  - source_indexes: [dot1dTpFdbAddress]lookup: 
dot1dTpFdbStatus*  - source_indexes: [dot1dTpFdbPort]
lookup: dot1dBasePortIfIndex

then I get scraping errors, e.g.

* error collecting metric Desc{fqName: "snmp_error", help: "Error calling 
NewConstMetric for EnumAsInfo", constLabels: {}, variableLabels: {}}: error 
for metric dot1dTpFdbStatus with labels [9 3 5 XX:XX:XX:27:29:BA learned]: 
duplicate label names in constant and variable labels for metric 
"dot1dTpFdbStatus_info"

If I remove the override

  dot1dTpFdbStatus:
type: EnumAsInfo

then scraping works, but I only get the numeric status code 
e.g. dot1dTpFdbStatus="3"

---

Note that if you want to avoid the join in PromQL, you *can* walk 
if[X]Table as well:

modules:
  bridge_mib:
walk:
  - dot1dBasePortTable
  - dot1dTpFdbTable
  - ifIndex
  - ifAlias
  - 1.3.6.1.2.1.2.2.1.2
  - 1.3.6.1.2.1.31.1.1.1.1
lookups:
  - source_indexes: [dot1dTpFdbAddress]
lookup: dot1dTpFdbPort
  - source_indexes: [dot1dTpFdbPort]
lookup: dot1dBasePortIfIndex
  - source_indexes: [dot1dBasePortIfIndex]
lookup: ifIndex
drop_source_indexes: true
  - source_indexes: [ifIndex]
lookup: ifAlias
  - source_indexes: [ifIndex]
# Uis OID to avoid conflict with PaloAlto PAN-COMMON-MIB.
lookup: 1.3.6.1.2.1.2.2.1.2 # ifDescr
  - source_indexes: [ifIndex]
# Use OID to avoid conflict with Netscaler NS-ROOT-MIB.
lookup: 1.3.6.1.2.1.31.1.1.1.1 # ifName
overrides:
  dot1dBasePort:
ignore: true
  dot1dTpFdbStatus:
type: EnumAsInfo
  dot1dTpFdbPort:
ignore: true
  ifAlias:
ignore: true
  ifDescr:
ignore: true
  ifName:
ignore: true

In this case, dot1dTpFdbAddress includes ifIndex *and* the other interface 
info, which makes the metric rather convenient to use:

# HELP dot1dTpFdbAddress A unicast MAC address for which the bridge has 
forwarding and/or filtering information. - 1.3.6.1.2.1.17.4.3.1.1
# TYPE dot1dTpFdbAddress gauge
dot1dTpFdbAddress{dot1dTpFdbAddress="XX:XX:XX:C5:A2:F2",dot1dTpFdbPort="9",ifAlias="",ifDescr="ether5",ifIndex="5",ifName="ether5"}
 
1
dot1dTpFdbAddress{dot1dTpFdbAddress="XX:XX:XX:12:91:4B",dot1dTpFdbPort="6",ifAlias="",ifDescr="ether2",ifIndex="2",ifName="ether2"}
 
1
dot1dTpFdbAddress{dot1dTpFdbAddress="XX:XX:XX:27:FE:A9",dot1dTpFdbPort="6",ifAlias="",ifDescr="ether2",ifIndex="2",ifName="ether2"}
 
1
...

But I suspect that if you're scraping if_mib as well, then snmp_exporter 
will end up walking bits of ifTable/ifXTable twice, making it less 
efficient network-wise.

On Saturday 20 July 2024 at 10:20:53 UTC+1 Brian Candler wrote:

> I found a relevant issue: 
> https://github.com/prometheus/snmp_exporter/issues/405
>
> Firstly, the PromQL count_values 
> <https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators>
>  
> operator can be used to convert a metric value to a label (very neat trick).
>
> And secondly, the ability to do "chainable lookups" was added:
> https://github.com/prometheus/snmp_exporter/pull/527/files
> This might be a way to solve this in the exporter - but I haven't got my 
> head around this. I'm not sure if you'd need to walk ifTable in your 
> generator, even if you're not actually interested in any additional values 
> from ifTable.
>
> On Saturday 20 July 2024 at 09:48:26 UTC+1 Brian Candler wrote:
>
>> > dot1dBasePortIfIndex{dot1dBasePort="12"} 12  - *This won't always be 
>> the same number*
>>
>> The MIB help text says "The value of the instance of the ifIndex object". 
>> So I'm guessing that what yo

[prometheus-users] Re: SNMP Exporter - Gathering MAC and IP per port

2024-07-20 Thread &#x27;Brian Candler' via Prometheus Users
I found a relevant issue: 
https://github.com/prometheus/snmp_exporter/issues/405

Firstly, the PromQL count_values 
<https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators>
 
operator can be used to convert a metric value to a label (very neat trick).

And secondly, the ability to do "chainable lookups" was added:
https://github.com/prometheus/snmp_exporter/pull/527/files
This might be a way to solve this in the exporter - but I haven't got my 
head around this. I'm not sure if you'd need to walk ifTable in your 
generator, even if you're not actually interested in any additional values 
from ifTable.

On Saturday 20 July 2024 at 09:48:26 UTC+1 Brian Candler wrote:

> > dot1dBasePortIfIndex{dot1dBasePort="12"} 12  - *This won't always be 
> the same number*
>
> The MIB help text says "The value of the instance of the ifIndex object". 
> So I'm guessing that what you currently get as
>
> dot1dBasePortIfIndex{dot1dBasePort="12"} 42
>
> would be more usefully returned as
>
> dot1dBasePortIfIndex{dot1dBasePort="12",ifIndex="42"} 1
>
> But I'm afraid I don't have enough generator.yml foo to know how to do 
> that :-(
>
> On Thursday 18 July 2024 at 20:04:35 UTC+1 Matthew Koch wrote:
>
>>
>> *This is a physical port ifIndex example*
>>
>> ifAdminStatus{ifAlias="Device; Device 
>> (DEVICE)",ifDescr="GigabitEthernet1/12",ifIndex="12",ifName="Gi1/12"} 1
>>
>>
>> *1. dot1dBasePortIfIndex is an equivalent of ifIndex but dot1dBasePort is 
>> not. dot1dBasePort is used to get the MAC address. *
>>
>> dot1dBasePortIfIndex{dot1dBasePort="12"} 12  - *This won't always be the 
>> same number*
>>
>> *2. I get the MAC address and port pair from this*
>>
>> dot1dTpFdbPort{dot1dTpFdbAddress="11:E0:E4:66:5E:11"} 12
>>
>>
>> *3.  I get the MAC address and IP pair from this. But 
>> the ipNetToMediaIfIndex is a VLAN not a physical port. *
>>
>> ipNetToMediaPhysAddress{ipNetToMediaIfIndex="28",ipNetToMediaNetAddress="
>> 10.10.1.33",ipNetToMediaPhysAddress="11:E0:E4:66:5E:11"} 1
>>
>>
>>
>> On Thursday, July 18, 2024 at 2:42:10 PM UTC-4 Brian Candler wrote:
>>
>>> > The challenge I am having is using promql to join the data so I can 
>>> show the IP associated with the MAC address on the physical port. 
>>>
>>> Can you show some examples of the metrics you're trying to join?
>>>
>>> On Thursday 18 July 2024 at 18:48:35 UTC+1 Matthew Koch wrote:
>>>
>>>> I am working on a project to gather the MAC address and IP which is on 
>>>> a specific port on a network switch. I've been able to gather this 
>>>> information with the below SNMP config but the challenge is the MAC 
>>>> address 
>>>> comes back against the physical port index and the IPs come back against 
>>>> the VLANs index which is expected. The challenge I am having is using 
>>>> promql to join the data so I can show the IP associated with the MAC 
>>>> address on the physical port. 
>>>>
>>>>  walk:
>>>> - 1.3.6.1.2.1.17.1.4.1
>>>> - 1.3.6.1.2.1.17.4.3.1
>>>> - 1.3.6.1.2.1.4.22.1
>>>> - 1.3.6.1.2.1.4.35.1
>>>> metrics:
>>>> - name: dot1dBasePortIfIndex
>>>>   oid: 1.3.6.1.2.1.17.1.4.1.2
>>>>   type: gauge
>>>>   help: The value of the instance of the ifIndex object, defined in 
>>>> MIB-II, for
>>>> the interface corresponding to this port. - 
>>>> 1.3.6.1.2.1.17.1.4.1.2
>>>>   indexes:
>>>>   - labelname: dot1dBasePort
>>>> type: gauge
>>>> - name: dot1dTpFdbPort
>>>>   oid: 1.3.6.1.2.1.17.4.3.1.2
>>>>   type: gauge
>>>>   help: Either the value '0', or the port number of the port on 
>>>> which a frame
>>>> having a source address equal to the value of the corresponding 
>>>> instance of
>>>> dot1dTpFdbAddress has been seen - 1.3.6.1.2.1.17.4.3.1.2
>>>>   indexes:
>>>>   - labelname: dot1dTpFdbAddress
>>>> type: PhysAddress48
>>>> fixed_size: 6
>>>> - name: dot1dTpFdbStatus
>>>>   oid: 1.3.6.1.2.1.17.4.3.1.3
>>>>   type: EnumA

[prometheus-users] Re: SNMP Exporter - Gathering MAC and IP per port

2024-07-20 Thread &#x27;Brian Candler' via Prometheus Users
> dot1dBasePortIfIndex{dot1dBasePort="12"} 12  - *This won't always be the 
same number*

The MIB help text says "The value of the instance of the ifIndex object". 
So I'm guessing that what you currently get as

dot1dBasePortIfIndex{dot1dBasePort="12"} 42

would be more usefully returned as

dot1dBasePortIfIndex{dot1dBasePort="12",ifIndex="42"} 1

But I'm afraid I don't have enough generator.yml foo to know how to do that 
:-(

On Thursday 18 July 2024 at 20:04:35 UTC+1 Matthew Koch wrote:

>
> *This is a physical port ifIndex example*
>
> ifAdminStatus{ifAlias="Device; Device 
> (DEVICE)",ifDescr="GigabitEthernet1/12",ifIndex="12",ifName="Gi1/12"} 1
>
>
> *1. dot1dBasePortIfIndex is an equivalent of ifIndex but dot1dBasePort is 
> not. dot1dBasePort is used to get the MAC address. *
>
> dot1dBasePortIfIndex{dot1dBasePort="12"} 12  - *This won't always be the 
> same number*
>
> *2. I get the MAC address and port pair from this*
>
> dot1dTpFdbPort{dot1dTpFdbAddress="11:E0:E4:66:5E:11"} 12
>
>
> *3.  I get the MAC address and IP pair from this. But 
> the ipNetToMediaIfIndex is a VLAN not a physical port. *
>
> ipNetToMediaPhysAddress{ipNetToMediaIfIndex="28",ipNetToMediaNetAddress="
> 10.10.1.33",ipNetToMediaPhysAddress="11:E0:E4:66:5E:11"} 1
>
>
>
> On Thursday, July 18, 2024 at 2:42:10 PM UTC-4 Brian Candler wrote:
>
>> > The challenge I am having is using promql to join the data so I can 
>> show the IP associated with the MAC address on the physical port. 
>>
>> Can you show some examples of the metrics you're trying to join?
>>
>> On Thursday 18 July 2024 at 18:48:35 UTC+1 Matthew Koch wrote:
>>
>>> I am working on a project to gather the MAC address and IP which is on a 
>>> specific port on a network switch. I've been able to gather this 
>>> information with the below SNMP config but the challenge is the MAC address 
>>> comes back against the physical port index and the IPs come back against 
>>> the VLANs index which is expected. The challenge I am having is using 
>>> promql to join the data so I can show the IP associated with the MAC 
>>> address on the physical port. 
>>>
>>>  walk:
>>> - 1.3.6.1.2.1.17.1.4.1
>>> - 1.3.6.1.2.1.17.4.3.1
>>> - 1.3.6.1.2.1.4.22.1
>>> - 1.3.6.1.2.1.4.35.1
>>> metrics:
>>> - name: dot1dBasePortIfIndex
>>>   oid: 1.3.6.1.2.1.17.1.4.1.2
>>>   type: gauge
>>>   help: The value of the instance of the ifIndex object, defined in 
>>> MIB-II, for
>>> the interface corresponding to this port. - 
>>> 1.3.6.1.2.1.17.1.4.1.2
>>>   indexes:
>>>   - labelname: dot1dBasePort
>>> type: gauge
>>> - name: dot1dTpFdbPort
>>>   oid: 1.3.6.1.2.1.17.4.3.1.2
>>>   type: gauge
>>>   help: Either the value '0', or the port number of the port on 
>>> which a frame
>>> having a source address equal to the value of the corresponding 
>>> instance of
>>> dot1dTpFdbAddress has been seen - 1.3.6.1.2.1.17.4.3.1.2
>>>   indexes:
>>>   - labelname: dot1dTpFdbAddress
>>> type: PhysAddress48
>>> fixed_size: 6
>>> - name: dot1dTpFdbStatus
>>>   oid: 1.3.6.1.2.1.17.4.3.1.3
>>>   type: EnumAsInfo
>>>   help: The status of this entry - 1.3.6.1.2.1.17.4.3.1.3
>>>   indexes:
>>>   - labelname: dot1dTpFdbAddress
>>> type: PhysAddress48
>>> fixed_size: 6
>>>   enum_values:
>>> 1: other
>>> 2: invalid
>>> 3: learned
>>> 4: self
>>> 5: mgmt
>>> - name: ipNetToMediaPhysAddress
>>>   oid: 1.3.6.1.2.1.4.22.1.2
>>>   type: PhysAddress48
>>>   help: ' - 1.3.6.1.2.1.4.22.1.2'
>>>   indexes:
>>>   - labelname: ipNetToMediaIfIndex
>>> type: gauge
>>>   - labelname: ipNetToMediaNetAddress
>>> type: InetAddressIPv4
>>> - name: ipNetToMediaType
>>>   oid: 1.3.6.1.2.1.4.22.1.4
>>>   type: EnumAsInfo
>>>   help: ' - 1.3.6.1.2.1.4.22.1.4'
>>>   indexes:
>>>   - labelname: ipNetToMediaIfIndex
>>> type: gauge
>>>

[prometheus-users] Re: SNMP Exporter - Gathering MAC and IP per port

2024-07-18 Thread &#x27;Brian Candler' via Prometheus Users
> The challenge I am having is using promql to join the data so I can show 
the IP associated with the MAC address on the physical port. 

Can you show some examples of the metrics you're trying to join?

On Thursday 18 July 2024 at 18:48:35 UTC+1 Matthew Koch wrote:

> I am working on a project to gather the MAC address and IP which is on a 
> specific port on a network switch. I've been able to gather this 
> information with the below SNMP config but the challenge is the MAC address 
> comes back against the physical port index and the IPs come back against 
> the VLANs index which is expected. The challenge I am having is using 
> promql to join the data so I can show the IP associated with the MAC 
> address on the physical port. 
>
>  walk:
> - 1.3.6.1.2.1.17.1.4.1
> - 1.3.6.1.2.1.17.4.3.1
> - 1.3.6.1.2.1.4.22.1
> - 1.3.6.1.2.1.4.35.1
> metrics:
> - name: dot1dBasePortIfIndex
>   oid: 1.3.6.1.2.1.17.1.4.1.2
>   type: gauge
>   help: The value of the instance of the ifIndex object, defined in 
> MIB-II, for
> the interface corresponding to this port. - 1.3.6.1.2.1.17.1.4.1.2
>   indexes:
>   - labelname: dot1dBasePort
> type: gauge
> - name: dot1dTpFdbPort
>   oid: 1.3.6.1.2.1.17.4.3.1.2
>   type: gauge
>   help: Either the value '0', or the port number of the port on which 
> a frame
> having a source address equal to the value of the corresponding 
> instance of
> dot1dTpFdbAddress has been seen - 1.3.6.1.2.1.17.4.3.1.2
>   indexes:
>   - labelname: dot1dTpFdbAddress
> type: PhysAddress48
> fixed_size: 6
> - name: dot1dTpFdbStatus
>   oid: 1.3.6.1.2.1.17.4.3.1.3
>   type: EnumAsInfo
>   help: The status of this entry - 1.3.6.1.2.1.17.4.3.1.3
>   indexes:
>   - labelname: dot1dTpFdbAddress
> type: PhysAddress48
> fixed_size: 6
>   enum_values:
> 1: other
> 2: invalid
> 3: learned
> 4: self
> 5: mgmt
> - name: ipNetToMediaPhysAddress
>   oid: 1.3.6.1.2.1.4.22.1.2
>   type: PhysAddress48
>   help: ' - 1.3.6.1.2.1.4.22.1.2'
>   indexes:
>   - labelname: ipNetToMediaIfIndex
> type: gauge
>   - labelname: ipNetToMediaNetAddress
> type: InetAddressIPv4
> - name: ipNetToMediaType
>   oid: 1.3.6.1.2.1.4.22.1.4
>   type: EnumAsInfo
>   help: ' - 1.3.6.1.2.1.4.22.1.4'
>   indexes:
>   - labelname: ipNetToMediaIfIndex
> type: gauge
>   - labelname: ipNetToMediaNetAddress
> type: InetAddressIPv4
>   enum_values:
> 1: other
> 2: invalid
> 3: dynamic
> 4: static
> - name: ipNetToPhysicalIfIndex
>   oid: 1.3.6.1.2.1.4.35.1.1
>   type: gauge
>   help: The index value that uniquely identifies the interface to 
> which this entry
> is applicable - 1.3.6.1.2.1.4.35.1.1
>   indexes:
>   - labelname: ipNetToPhysicalIfIndex
> type: gauge
>   - labelname: ipNetToPhysicalNetAddress
> type: InetAddress
> - name: ipNetToPhysicalNetAddressType
>   oid: 1.3.6.1.2.1.4.35.1.2
>   type: EnumAsInfo
>   help: The type of ipNetToPhysicalNetAddress. - 1.3.6.1.2.1.4.35.1.2
>   indexes:
>   - labelname: ipNetToPhysicalIfIndex
> type: gauge
>   - labelname: ipNetToPhysicalNetAddress
> type: InetAddress
>   enum_values:
> 0: unknown
> 1: ipv4
> 2: ipv6
> 3: ipv4z
> 4: ipv6z
> 16: dns
> - name: ipNetToPhysicalNetAddress
>   oid: 1.3.6.1.2.1.4.35.1.3
>   type: InetAddress
>   help: The IP Address corresponding to the media-dependent `physical' 
> address
> - 1.3.6.1.2.1.4.35.1.3
>   indexes:
>   - labelname: ipNetToPhysicalIfIndex
> type: gauge
>   - labelname: ipNetToPhysicalNetAddress
> type: InetAddress
> - name: ipNetToPhysicalPhysAddress
>   oid: 1.3.6.1.2.1.4.35.1.4
>   type: PhysAddress48
>   help: The media-dependent `physical' address - 1.3.6.1.2.1.4.35.1.4
>   indexes:
>   - labelname: ipNetToPhysicalIfIndex
> type: gauge
>   - labelname: ipNetToPhysicalNetAddress
> type: InetAddress
> - name: ipNetToPhysicalLastUpdated
>   oid: 1.3.6.1.2.1.4.35.1.5
>   type: gauge
>   help: The value of sysUpTime at the time this entry was last updated 
> - 1.3.6.1.2.1.4.35.1.5
>   indexes:
>   - labelname: ipNetToPhysicalIfIndex
> type: gauge
>   - labelname: ipNetToPhysicalNetAddress
> type: InetAddress
> - name: ipNetToPhysicalType
>   oid: 1.3.6.1.2.1.4.35.1.6
>   type: EnumAsInfo
>   help: The type of mapping - 1.3.6.1.2.1.4.35.1.6
>   indexes:
>   - labelname: ipNetToPhysicalIfIndex
> type: gauge
>   - labelname: ipNetToPhysicalNetAddress
> type: InetAddress
>   enum_values:
> 1: other

[prometheus-users] Re: Regexp match in template

2024-06-28 Thread &#x27;Brian Candler' via Prometheus Users
See the template functions listed here:
https://prometheus.io/docs/prometheus/latest/configuration/template_reference/#strings

There is one called "match" which matches regular expressions. (Note that 
there is not one called "regexMatch")

You should also be able to use the go template global functions:
https://pkg.go.dev/text/template#hdr-Functions

If this is for alerts, it might be simpler to use routing rules, with two 
different alert receivers.

On Friday 28 June 2024 at 10:49:00 UTC+1 fiala...@gmail.com wrote:

> Hi,
>
> is it possible to make a regexp match in annotation?
>
> I need to check, if label contains specific string.
>
> function "regexMatch" not definedThank you.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/1c9af9c0-32f0-4f13-bb72-a8f8a3f76a7dn%40googlegroups.com.


Re: [prometheus-users] Uptime SLA in percentage for metric

2024-06-24 Thread &#x27;Brian Candler' via Prometheus Users
A PromQL query like  "mymetric == bool 2" will return 1 when the value is 
2, and 0 otherwise.

You'll likely need to run this inside a subquery if you're doing time range 
aggregation over it. But if Grafana is doing the summarization that might 
not be necessary.

On Monday 24 June 2024 at 13:38:03 UTC+1 Ben Kochie wrote:

> IMO you need to fix your service metrics. Prometheus best practice is to 
> follow the pattern of probe_success. Boolean values are far easier to 
> handle.
>
> On Mon, Jun 24, 2024 at 2:36 PM Raúl Lopez  wrote:
>
>> Hello,
>> I need to know in percentage the time my service has been available in 
>> the last month, last week, etc (dynamic value).
>> The metric in question can return the values; 0, 1 and 2.
>>
>> 0 -> OK
>> 1 -> Warning
>> 2 -> KO
>>
>> The idea I have is to disregard value 1 and only treat my service as KO 
>> when it has returned value 2. I am trying to build in a Grafana 
>> visualisation for the SLA in percentage that my service has been available 
>> according to the time range that the user specifies in the dashboard.
>>
>> I've been doing some research and it seems that for this kind of cases it 
>> is not as simple as for example for those endpoints where Blackbox is used 
>> for example (as I cannot use probe_success).
>>
>> Could someone help me?
>> Thank you in advance.
>>
>> Regards.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/a26d17b9-b507-413a-89d7-f95ca49ef725n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c4ec7e1e-d850-4a1e-bc43-de33a12d464dn%40googlegroups.com.


[prometheus-users] Re: Prometheus LTS EOL

2024-06-23 Thread &#x27;Brian Candler' via Prometheus Users
The release cycle page 
 has been updated: 
it shows 2.53 will be LTS supported from 2024-07-01 to 2025-07-31. (2.53.0 
was actually released a few days ago).

It would be nice to have a summary of key differences from 2.45 to 2.53 
though.

On Friday 14 June 2024 at 11:20:37 UTC+1 pentester 0006 wrote:

> Hi Team,
>
> From Long-Term Support | Prometheus 
>  Prometheus 2.45 
> is going EOL by July 31st 2024. The new major non LTS releases are also 
> going to be EOL in June. Will there be new LTS release from Prometheus ? if 
> yes when ?
>
> A community case has been raised for same. 
> https://discuss.prometheus.io/t/prometheus-lts-end-of-life/2301
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/659da276-2952-4117-b4a8-56ece25300e8n%40googlegroups.com.


[prometheus-users] Re: node_exporter CPU underutilized alert

2024-06-23 Thread &#x27;Brian Candler' via Prometheus Users
node_cpu_seconds_total gives you a separate metric for each CPU, so with an 
8 vCPU VM you'll get 8 alerts (if they're all under 20%)

You're saying that you're happy with all these alerts, but want to suppress 
them where the VM has only one vCPU?  In that case:

count by (instance) (node_cpu_seconds_total{mode="idle"})

will give you the number of CPUs per instance, and hence you can modify 
your alert to something like

expr: ( . unless on (instance) count by (instance) 
(node_cpu_seconds_total{mode="idle"} == 1)

which would give something like:

  (
  (100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20)
unless on (instance)
  count by (instance) (node_cpu_seconds_total{mode="idle"} == 1)
  )
* on (instance) group_left (nodename)
  node_uname_info{nodename=~".+"}

Aside 1: personally I like to leave percentages as fractions. You can 
change these to percentages in alerts using humanizePercentage 


Aside 2: It might be better to aggregate all the CPUs usage for an 
instance. Otherwise, if you have 8 mostly-idle CPUs, but each CPU in turn 
has a short burst of activity, you'll get no alerts.  Do do this, you 
should use sum over rate 
, not 
rate over sum.

On Saturday 22 June 2024 at 17:01:58 UTC+1 mel wrote:

> I have this CPU underutilized alert for virtual machines.
>
> expr: '(100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20) 
> * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}'
>
> for: 1w
>
> The problem is that I get alerted even if the CPU is 1 so I cannot reduce 
> it further. I want the alert to fire only number of CPUs > 1.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4927b59f-50df-4d37-9e04-ec6851567e16n%40googlegroups.com.


[prometheus-users] Re: Alertmanager Configuration for Routing Alerts via Telegram

2024-06-20 Thread &#x27;Brian Candler' via Prometheus Users
If you put multiple matchers, they must all be true to match ("AND" 
semantics). So when you wrote

   - matchers:
- alertname = "SystemdUnitDown"
- alertname = "InstanceDown"  

it means alertname must be simultaneously equal to both those values, which 
can never be true.

One solution is to rewrite your matchers, such as

   - matchers:
- alertname =~ "SystemdUnitDown|InstanceDown"

Personally though I find it easier to structure my rules the other way 
round: when a condition matches list all the receivers who should receive 
this alert. You can do this using nested routing rules ("routes" instead of 
"receiver").  For example, for the InstanceDown alert:

- matchers:
- alertname = "InstanceDown"
  routes: [ { receiver: Team1, continue: true }, { receiver: Team2 } ]
  #continue: true

The magic here is that the nested routes don't have any matchers, so they 
always match and deliver to the receiver.

You then don't need the top-level "continue: True" either (I've shown it 
commented out), since once this condition matches, you've finished all the 
processing for InstanceDown and you don't need to test any subsequent rules.

On Thursday 20 June 2024 at 14:44:57 UTC+1 Alexander Varejão wrote:

> Hi,
>
> I need help again :(
>
> I am trying to configure my Alertmanager to send separate alerts without 
> success.
>
> Basically, I need to trigger two alerts for two different groups via 
> Telegram.
>
> So, I created two alerts (Alert1 and Alert2) and two teams (Team1 and 
> Team2).
>
> Team1 should only receive Alert1, while Team2 should receive both alerts 
> (Alert1 and Alert2).
>
> However, only Team 2 is receiving the alerts. I don't know what is wrong. 
> Could someone help me find the error in my configuration?
>
> [...]
> route:
>   group_by: ["instance"]
>   receiver: 'Team2'
>   routes:
> - matchers:
> - alertname = "InstanceDown"
>   receiver: 'Team2'
>   continue: true
> - matchers:
> - alertname = "SystemdUnitDown"
> - alertname = "InstanceDown"  
>   receiver: 'Team1'
>   continue: true
>
> receivers:
> - name: 'Team1'
>   email_configs:
>- to: 'email@domain'
>  send_resolved: true
>  html: ''
>  text: "Summary: {{ .CommonAnnotations.summary }}\ndescription: {{ 
> .CommonAnnotations.description }}\n\n"
>   telegram_configs:
>- api_url: 'https://api.telegram.org'
>  chat_id: -ID_HERE 
>  bot_token: -TOKEN_HERE 
> - name: 'Team2'
>   telegram_configs:
> - api_url: 'https://api.telegram.org'
>   chat_id: -ID_HERE
>   bot_token: -SAME_TOKEN_HERE
> [...]
>
> Tanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7fb9989e-cc23-4dcb-8fcc-bd70ed0fc5f0n%40googlegroups.com.


[prometheus-users] Re: ZTE DSL modem statistics

2024-06-20 Thread &#x27;Brian Candler' via Prometheus Users
Talk to the vendor, or a user group for that device.

Your starting point would be firstly to see if it supports SNMP, and if so 
to get hold of a copy of the MIB. If not, then you'd have to look at 
whether it exposes that information in any other way, and then write your 
own exporter to extract it from the device.

On Thursday 20 June 2024 at 09:48:54 UTC+1 Nexgn Technologies wrote:

> Dear Experts,
>
> I hope this message finds you well.
>
> I am seeking assistance to retrieve various statistics from my ZTE DSL 
> Modem. Specifically, I would like to obtain data on the following:
>
>- WAN traffic
>- LAN traffic per port
>- LAN side WiFi Traffic
>- Modem CPU usage
>- Modem RAM usage
>- Modem Temperature
>
> My Modem Model is ZTE ZXHN H168N v3.5, and the Serial Number is 
> ZTELRUVN8507364.
>
> Could anyone guide me on how to access and extract these statistics from 
> my modem?
>
> Thank you in advance for your help.
>
> Best regards,
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/530e7a27-e045-4ec6-94b7-59e9e273472bn%40googlegroups.com.


[prometheus-users] Re: job label missing from discoveredLabels (prometheus v2.42.0)

2024-05-31 Thread &#x27;Brian Candler' via Prometheus Users
I don't see this with v2.45.5, and I'm also concerned about why "app": 
"another-testapp" occurs in one of your discoveredLabels.

I suggest you try that, and/or the latests v2.52.1 (you can of course set 
up a completely separate instance but point it to the same service 
discovery source) and see if you can replicate the issue. Also check the 
changelogs and git history to see if there's anything relevant there.

On Friday 31 May 2024 at 20:22:00 UTC+1 Vu Nguyen wrote:

> Hi all,
>
> We have a test code that reads target metadata info and job label name 
> from `discoveredLabels` list. That list is included in the response we get 
> from '/api/v1/targets' endpoint.
>
> During the test, we noticed that the response from target endpoint is 
> inconsistent: the job label sometimes is missing from `discoveredLabels` 
> for a few discovered targets. 
>
> Below output is what I extracted from our deployment: the first target 
> have the job label in its discovered labels but missing in the second 
> target, on the same config job.
>
> {
>   "status" : "success",
>   "data" : {
> "activeTargets" : [ {
>   "discoveredLabels" : {
> "__meta_kubernetes_pod_phase" : "Running",
> "__meta_kubernetes_pod_ready" : "true",
> "__meta_kubernetes_pod_uid" : 
> "a7b4cce2-1be7-4df9-a032-c7a51bb655db",
> "__metrics_path__" : "/metrics",
> "__scheme__" : "http",
> "__scrape_interval__" : "15s",
> "__scrape_timeout__" : "10s",
> "job" : "kubernetes-pods"
>   },
>   "labels" : {
> "app" : "testapp",
> "job" : "kubernetes-pods",
> "kubernetes_namespace" : "spider1"
>   },
>   "health" : "down",
>   "scrapeInterval" : "15s",
>   "scrapeTimeout" : "10s"
> }, {
>   "discoveredLabels" : {
>  "__meta_kubernetes_pod_phase" : "Running",
> "__meta_kubernetes_pod_ready" : "true",
> "__meta_kubernetes_pod_uid" : 
> "85dfeac6-985d-479e-8459-fc20ae8dcec3",
> "__metrics_path__" : "/metrics",
> "__scheme__" : "http",
> "__scrape_interval__" : "15s",
> "__scrape_timeout__" : "10s",
> "app" : "another-testapp"
>   },
>   "labels" : {
> "app" : "another-testapp",
> "job" : "kubernetes-pods",
> "kubernetes_namespace" : "spider1"
>   },
>   "scrapePool" : "kubernetes-pods",
>   "health" : "down",
>   "scrapeInterval" : "15s",
>   "scrapeTimeout" : "10s"
> } ]
>   }
> }
>
> Could you please help us understand why we have this inconsistency? Is 
> that correct way to get job level value from `discoveredLabels` set?
>
> Thanks,
> Vu
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bfe354cf-a4f6-40d8-b88b-571bf9d1289fn%40googlegroups.com.


Re: [prometheus-users] how to get count of no.of instance

2024-05-28 Thread &#x27;Brian Candler' via Prometheus Users
Those mangled screenshots are no use. What I would need to see are the 
actual results of the two queries, from the Prometheus web interface (not 
Grafana), in plain text: e.g.

foo{bar="baz",qux="abc"} 42.0

...with the *complete* set of labels, not expurgated. That's what's needed 
to formulate the join query.

On Tuesday 28 May 2024 at 13:23:21 UTC+1 Sameer Modak wrote:

> Hello Brian,
>
> Actually tried as you suggested earlier but when i execute it says no data 
> . So below are the individual query ss , so if i ran individually they give 
> the output
>
> On Sunday, May 26, 2024 at 1:24:10 PM UTC+5:30 Brian Candler wrote:
>
>> The labels for the two sides of the division need to match exactly.
>>
>> If they match 1:1 except for additional labels, then you can use
>> xxx / on (foo,bar) yyy   # foo,bar are the matching labels
>> or
>> xxx / ignoring (baz,qux) zzz   # baz,qux are the labels to ignore
>>
>> If they match N:1 then you need to use group_left or group_right.
>>
>> If you show the results of the two halves of the query separately then we 
>> can be more specific. That is:
>>
>> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~"$consumergroup",topic=~"$topic"})
>>  
>> by (consumergroup, topic) 
>>
>> count(up{job="prometheus.scrape.kafka_exporter"})
>>
>> On Sunday 26 May 2024 at 08:28:10 UTC+1 Sameer Modak wrote:
>>
>>> I tried the same i m not getting any data post adding below 
>>>
>>> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~
>>> "$consumergroup",topic=~"$topic"}) by (consumergroup, topic) / count(up{
>>> job="prometheus.scrape.kafka_exporter"})
>>>
>>> On Saturday, May 25, 2024 at 11:53:44 AM UTC+5:30 Ben Kochie wrote:
>>>
>>>> You can use the `up` metric
>>>>
>>>> sum(...)
>>>> /
>>>> count(up{job="kafka"})
>>>>
>>>> On Fri, May 24, 2024 at 5:53 PM Sameer Modak  
>>>> wrote:
>>>>
>>>>> Hello Team,
>>>>>
>>>>> I want to know the no of instance data sending to prometheus. How do i 
>>>>> formulate the query .
>>>>>
>>>>>
>>>>> Basically i have below working query but issues is we have 6  
>>>>> instances hence its summing value of all instances. Instead we just need 
>>>>> value from one instance.
>>>>> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~
>>>>> "$consumergroup",topic=~"$topic"})by (consumergroup, topic)
>>>>> I was thinking to divide it / 6 but it has to be variabalise on runtime
>>>>> if 3 exporters are running then it value/3 to get exact value.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "Prometheus Users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to prometheus-use...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/prometheus-users/fa5f309f-779f-45f9-b5a0-430b75ff0884n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/prometheus-users/fa5f309f-779f-45f9-b5a0-430b75ff0884n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9633348b-5d27-409e-b28f-e1e32e8af6b0n%40googlegroups.com.


Re: [prometheus-users] Pod with Pending phase is in endpoints scraping targets (Prometheus 2.46.0)

2024-05-27 Thread &#x27;Brian Candler' via Prometheus Users
Have you looked in the changelog 
 for 
Prometheus? I found:

## 2.51.0 / 2024-03-18

* [BUGFIX] Kubernetes SD: Pod status changes were not discovered by 
Endpoints service discovery #13337 


*=> fixes #11305 , 
which looks similar to your problem*

## 2.50.0 / 2024-02-22

* [ENHANCEMENT] Kubernetes SD: Check preconditions earlier and avoid 
unnecessary checks or iterations in kube_sd. #13408 


I'd say it's worth trying the latest release, 2.51.2.

On Monday 27 May 2024 at 12:21:01 UTC+1 Vu Nguyen wrote:

> Hi,
>
> Do you have a response to this thread? Has anyone ever encountered the 
> issue?
>
> Regards,
> Vu
>
> On Mon, May 20, 2024 at 2:56 PM Vu Nguyen  wrote:
>
>> Hi,
>>
>> With endpoints scraping role, the job should scrape POD endpoint that is 
>> up and running. That is what we are expected. 
>>
>> I think by concept, K8S does not create an endpoint if Pod is in other 
>> phases like Pending, Failed, etc.
>>
>> In our environments, Prometheus 2.46.0 on K8S v1.28.2, we currently have 
>> issues: 
>> 1) POD is up and running from `kubectl get pod`, but from Prometheus 
>> discovery page, it shows:
>> __meta_kubernetes_pod_phase="Pending" 
>> __meta_kubernetes_pod_ready="false"  
>>
>> 2) The the endpoints job discover POD targets with pod phase=`Pending`.
>>
>> Those issues disappear after we restart Prometheus pod.  
>>
>> I am not sure if 1) that is K8S that does not trigger event after POD 
>> phase changes so Prometheus is not able to refresh its endpoints discovery 
>> or 2) it is a known problem of Prometheus? 
>>
>> And do you think it is worth to add the following relabeling rule to 
>> endpoints job role?
>>
>>   - source_labels: [ __meta_kubernetes_pod_phase ]
>> regex: Pending|Succeeded|Failed|Completed
>> action: drop
>>
>> Thanks, Vu 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/c0f97ed7-1421-4c7c-a57d-2d301bb12418n%40googlegroups.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0641d658-e295-418b-ae00-af6ce83e7ccbn%40googlegroups.com.


Re: [prometheus-users] how to get count of no.of instance

2024-05-26 Thread &#x27;Brian Candler' via Prometheus Users
The labels for the two sides of the division need to match exactly.

If they match 1:1 except for additional labels, then you can use
xxx / on (foo,bar) yyy   # foo,bar are the matching labels
or
xxx / ignoring (baz,qux) zzz   # baz,qux are the labels to ignore

If they match N:1 then you need to use group_left or group_right.

If you show the results of the two halves of the query separately then we 
can be more specific. That is:

sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~"$consumergroup",topic=~"$topic"})
 
by (consumergroup, topic) 

count(up{job="prometheus.scrape.kafka_exporter"})

On Sunday 26 May 2024 at 08:28:10 UTC+1 Sameer Modak wrote:

> I tried the same i m not getting any data post adding below 
>
> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~
> "$consumergroup",topic=~"$topic"}) by (consumergroup, topic) / count(up{
> job="prometheus.scrape.kafka_exporter"})
>
> On Saturday, May 25, 2024 at 11:53:44 AM UTC+5:30 Ben Kochie wrote:
>
>> You can use the `up` metric
>>
>> sum(...)
>> /
>> count(up{job="kafka"})
>>
>> On Fri, May 24, 2024 at 5:53 PM Sameer Modak  
>> wrote:
>>
>>> Hello Team,
>>>
>>> I want to know the no of instance data sending to prometheus. How do i 
>>> formulate the query .
>>>
>>>
>>> Basically i have below working query but issues is we have 6  instances 
>>> hence its summing value of all instances. Instead we just need value from 
>>> one instance.
>>> sum(kafka_consumergroup_lag{cluster=~"$cluster",consumergroup=~
>>> "$consumergroup",topic=~"$topic"})by (consumergroup, topic)
>>> I was thinking to divide it / 6 but it has to be variabalise on runtime
>>> if 3 exporters are running then it value/3 to get exact value.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to prometheus-use...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/prometheus-users/fa5f309f-779f-45f9-b5a0-430b75ff0884n%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4487c977-aac5-478b-a81d-47c7edace010n%40googlegroups.com.


[prometheus-users] Re: Regular Expression and Label Action Support to match two or more source labels

2024-05-22 Thread &#x27;Brian Candler' via Prometheus Users
I would assume that the reason this feature was added was because there 
wasn't a feasible alternative way to do it.

I suggest you upgrade to v2.45.5 which is the current "Long Term Stable" 
release.  The previous LTS release (v2.37) went end-of-life 
<https://prometheus.io/docs/introduction/release-cycle/> in July 2023, so 
it seems you're very likely running something unsupported at the moment.

On Wednesday 22 May 2024 at 11:52:03 UTC+1 tejaswini vadlamudi wrote:

> Sure Brian, I was aware of this.
> This config comes with a software change, but is there any possibility or 
> workaround in the old (< 2.41) Prometheus releases on this topic?
>
> /Teja
>
> On Wednesday, May 22, 2024 at 12:01:31 PM UTC+2 Brian Candler wrote:
>
>> Yes, there are similar relabel actions "keepequal" and "dropequal":
>>
>> https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
>>
>> These were added in v2.41.0 
>> <https://github.com/prometheus/prometheus/releases/v2.41.0> / 2022-12-20
>> https://github.com/prometheus/prometheus/pull/11564
>>
>> They behave slightly differently from VM: in Prometheus, the 
>> concatenation of source_labels is compared with target_label.
>>
>> On Tuesday 21 May 2024 at 15:43:05 UTC+1 tejaswini vadlamudi wrote:
>>
>>> The below relabeling rule from Victoria Metrics is useful for matching 
>>> accurate ports and dropping unwanted targets.- action: 
>>> keep_if_equal
>>>   source_labels: 
>>> [__meta_kubernetes_service_annotation_prometheus_io_port, 
>>> __meta_kubernetes_pod_container_port_number]
>>> Does anyone know how we can compare two labels using Prometheus 
>>> Relabeling rules?
>>>
>>> Based on my analysis, Prometheus doesn't support regex patterns on 
>>> 1. backreferences like \1 
>>> 2. lookaheads or lookbehinds
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0a936729-7115-4f5c-ae98-04e99a6287a8n%40googlegroups.com.


[prometheus-users] Re: Regular Expression and Label Action Support to match two or more source labels

2024-05-22 Thread &#x27;Brian Candler' via Prometheus Users
Yes, there are similar relabel actions "keepequal" and "dropequal":
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

These were added in v2.41.0 
 / 2022-12-20
https://github.com/prometheus/prometheus/pull/11564

They behave slightly differently from VM: in Prometheus, the concatenation 
of source_labels is compared with target_label.

On Tuesday 21 May 2024 at 15:43:05 UTC+1 tejaswini vadlamudi wrote:

> The below relabeling rule from Victoria Metrics is useful for matching 
> accurate ports and dropping unwanted targets.- action: 
> keep_if_equal
>   source_labels: 
> [__meta_kubernetes_service_annotation_prometheus_io_port, 
> __meta_kubernetes_pod_container_port_number]
> Does anyone know how we can compare two labels using Prometheus Relabeling 
> rules?
>
> Based on my analysis, Prometheus doesn't support regex patterns on 
> 1. backreferences like \1 
> 2. lookaheads or lookbehinds
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ee16deed-f7c8-4388-ae2f-78a767bb1cc6n%40googlegroups.com.


[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

2024-05-19 Thread &#x27;Brian Candler' via Prometheus Users
ding remote write" err="label name 
> \"resourceId\" is not unique: invalid sample"
> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:134 level=error 
> component=web msg="unknown error from remote write" err="label name 
> \"service\" is not unique: invalid sample" 
> series="{__name__=\"rest_client_request_size_bytes_bucket\", 
> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
> clusterZone=\"zone1\", container=\"kube-scheduler\", endpoint=\"https\", 
> host=\"127.0.0.1:6443\", instance=\"10.253.58.236:10259\", 
> job=\"scheduler\", le=\"262144\", namespace=\"kube-scheduler\", 
> pod=\"kube-scheduler-20230428-wangbo-dev14\", 
> prometheus=\"monitoring/agent-0\", 
> prometheus_replica=\"prometheus-agent-0-0\", resourceType=\"NETWORK-HOST\", 
> service=\"scheduler\", service=\"net-monitor-vnet-ovs\", verb=\"POST\"}" 
> timestamp=1715349164522
> ts=2024-05-11T08:42:26.967Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="label name 
> \"service\" is not unique: invalid sample"
> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:134 level=error 
> component=web msg="unknown error from remote write" err="label name 
> \"prometheus_replica\" is not unique: invalid sample" 
> series="{__name__=\"workqueue_work_duration_seconds_sum\", 
> clusterName=\"clustertest150\", clusterRegion=\"region0\", 
> clusterZone=\"zone1\", endpoint=\"https\", instance=\"21.100.10.52:8443\", 
> job=\"metrics\", name=\"ResourceSyncController\", 
> namespace=\"service-ca-operator\", 
> pod=\"service-ca-operator-645cfdbfb6-rjr4z\", 
> prometheus=\"monitoring/agent-0\", 
> prometheus_replica=\"prometheus-agent-0-0\", 
> prometheus_replica=\"prometheus-agent-0-0\", service=\"metrics\"}" 
> timestamp=1715349271085
> ts=2024-05-11T08:42:27.091Z caller=write_handler.go:76 level=error 
> component=web msg="Error appending remote write" err="label name 
> \"prometheus_replica\" is not unique: invalid sample"
>
> Currently we dont' know why there are duplicated labels. But when the 
> server encounters duplicated labels, it returns 500. Then the agent keeps 
> retrying, which means new samples cannot be handled.
>
> we set external_labels in prometheus-agent configs:
> global:
>   evaluation_interval: 30s
>   scrape_interval: 5m
>   scrape_timeout: 1m
>   external_labels:
> clusterName: clustertest150
> clusterRegion: region0
> clusterZone: zone1
> prometheus: ccos-monitoring/agent-0
> prometheus_replica: prometheus-agent-0-0
>   keep_dropped_targets: 1
>
> and the remote write config:
> remote_write:
> - url: https://prometheus-k8s-0.monitoring:9091/api/v1/write
>   remote_timeout: 30s
>   name: prometheus-k8s-0
>   write_relabel_configs:
>   - target_label: __tmp_cluster_id__
> replacement: 713c30cb-81c3-411d-b4dc-0c775a0f9564
> action: replace
>   - regex: __tmp_cluster_id__
> action: labeldrop
>   bearer_token: XDFSDF...
>   tls_config:
> insecure_skip_verify: true
>   queue_config:
> capacity: 1
> min_shards: 1
> max_shards: 500
> max_samples_per_send: 2000
> batch_send_deadline: 10s
> min_backoff: 30ms
> max_backoff: 5s
> sample_age_limit: 5m
>
> > You are saying that you would prefer the agent to throw away data, 
> rather than hold onto the data and try again later when it may succeed. In 
> this situation, retrying is normally the correct thing to do.
> Yes, retry is the normal solution. But there should be maximum number of 
> retries. We notice that prometheus agent sets the retry nubmers to the 
> request header, but it seems the request header is not used by the server.
>
> prometheus-agent sets the retry numbers to request header:
>
> https://github.com/prometheus/prometheus/blob/release-2.51/storage/remote/client.go#L214
>
> Besides, if some samples is incorrect and others are correct in the same 
> request, why don't prometheus server save the correct part and drop the 
> wrong part? It is more complicated as retry should be considered, but is it 
> possible to save partial data and return 206 when the maximum number of 
> retries is reached? 
>
> And should prometheus server log samples for all kinds of error?
>
> https://github.com/prome

[prometheus-users] Re: hundreds of containers, how to alert when a certain container is down?

2024-05-18 Thread &#x27;Brian Candler' via Prometheus Users
Monitoring for a metric vanishing is not a very good way to do alerting. 
Metrics hang around for the "staleness" interval, which by default is 5 
minutes. Ideally, you should monitor all the things you care about 
explicitly, get a success metric like "up" (1 = working, 0 = not working) 
and then alert on "up == 0" or equivalent. This is much more flexible and 
timely.

Having said that, there's a quick and dirty hack that might be good enough 
for you:

expr: container_memory_usage_bytes offset 10m unless 
container_memory_usage_bytes

This will give you an alert if any metric container_memory_usage_bytes 
existed 10 minutes ago but does not exist now. The alert will resolve 
itself after 10 minutes.

The result of this expression is a vector, so it can alert on multiple 
containers at once; each element of the vector will have the container name 
in the label ("name")

On Saturday 18 May 2024 at 19:50:48 UTC+1 Sleep Man wrote:

> I have a large number of containers. I learned that the following 
> configuration can monitor a single container down. How to configure it to 
> monitor all containers and send the container name once a container is down.
>
>
> - name: containers
>   rules:
>   - alert: jenkins_down
> expr: absent(container_memory_usage_bytes{name="jenkins"})
> for: 30s
> labels:
>   severity: critical
> annotations:
>   summary: "Jenkins down"
>   description: "Jenkins container is down for more than 30 seconds."
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/17dd75ea-16d6-4e2d-bdc1-3d2bb345c4fan%40googlegroups.com.


[prometheus-users] Re: Alertmanager frequently sending erroneous resolve notifications

2024-05-18 Thread &#x27;Brian Candler' via Prometheus Users
> What can be done?

Perhaps the alert condition resolved very briefly. The solution with modern 
versions of prometheus (v2.42.0 
 or later) is to 
do this:

for: 2d
keep_firing_for: 10m

The alert won't be resolved unless it has been *continuously* absent for 10 
minutes. (Of course, this means your "resolved" notifications will be 
delayed by 10 minutes - but that's basically the whole point, don't send 
them until you're sure they're not going to retrigger)

The other alternative is simply to turn off resolved notifications 
entirely. This approach sounds odd but has a lot to recommend it:
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
https://blog.cloudflare.com/alerts-observability

The point is that if a problem occurred which was serious enough to alert 
on, then it requires investigation before the case can be "closed": either 
there's an underlying problem, or if it was a false positive then the alert 
condition needs tuning. Sending a resolved message encourages laziness 
("oh, it fixed itself, no further work required").  Also, turning off 
resolved messages instantly reduces your notifications by 50%.

On Saturday 18 May 2024 at 19:50:32 UTC+1 Sarah Dundras wrote:

> Hi, this problem is driving me mad: 
>
> I am monitoring backups that log their backup results to a textfile. It is 
> being picked up and all is well, also the alert are ok, BUT! Alertmanager 
> frequently sends out odd "resolved" notifications although the firing 
> status never changed! 
>
> Here's such an alert rule that does this: 
>
> - alert: Restic Prune Freshness
> expr: restic_prune_status{uptodate!="1"} and 
> restic_prune_status{alerts!="0"}
> for: 2d
> labels:
> topic: backup
> freshness: outdated
> job: "{{ $labels.restic_backup }}"
> server: "{{ $labels.server }}"
> product: veeam
> annotations:
> description: "Restic Prune for '{{ $labels.backup_name }}' on host '{{ 
> $labels.server_name }}' is not up-to-date (too old)"
> host_url: "
> https://backups.example.com/d/3be21566-3d15-4238-a4c5-508b059dccec/restic?orgId=2&var-server_name={{
>  
> $labels.server_name }}&var-result=0&var-backup_name=All"
> service_url: "
> https://backups.example.com/d/3be21566-3d15-4238-a4c5-508b059dccec/restic?orgId=2&var-server_name=All&var-result=0&var-backup_name={{
>  
> $labels.backup_name }}"
> service: "{{ $labels.job_name }}" 
>
> What can be done? 
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3f1dc4fe-9378-4b6f-a7eb-cc0e7e02bcfan%40googlegroups.com.


[prometheus-users] Re: All Samples Lost when prometheus server return 500 to prometheus agent

2024-05-17 Thread &#x27;Brian Candler' via Prometheus Users
It's difficult to make sense of what you're saying. Without seeing logs 
from both the agent and the server while this problem was occurring (e.g. 
`journalctl -eu prometheus`), it's hard to know what was really happening. 
Also you need to say what exact versions of prometheus and the agent were 
running.

The fundamental issue here is, why should restarting the *agent* cause the 
prometheus *server* to stop returning 500 errors?

> So my question is why 5xx from the promtheus server is considered 
Recoverable?

It is by definition of the HTTP protocol: 
https://datatracker.ietf.org/doc/html/rfc2616#section-10.5

Actually it depends on exactly which 5xx error code you're talking about, 
but common 500 and 503 errors are generally transient, meaning there was a 
problem at the server and the request may succeed if tried again later.  If 
the prometheus server wanted to tell the client that the request was 
invalid and could never possibly succeed, then it would return a 4xx error.

> And I believe there should be a way to exit the loop, for example a 
maximum times to  retry.

You are saying that you would prefer the agent to throw away data, rather 
than hold onto the data and try again later when it may succeed. In this 
situation, retrying is normally the correct thing to do.

You may have come across a bug where a *particular* piece of data being 
sent by the agent was causing a *particular* version of prometheus to fail 
with a 5xx internal error every time. The logs should make it clear if this 
was happening.

On Friday 17 May 2024 at 10:02:49 UTC+1 koly li wrote:

> Hello all,
>
> Recently we found that our samples are all lost. After some investigation, 
> we found:
> 1, we are using prometheus agent to send all data to prometheus server by 
> remote write
> 2, the agent sample sending code is in storage\remote\queue_manager.go, 
> the function is sendWriteRequestWithBackoff()
> 3, inside the function, if attempt(the function where request is made to 
> prometheus server) function returns an Recoverable Error, then it will 
> retry sending the request
> 4, when a Recoverable error is returned? one scenario is the prometheus 
> server returned 5xx error
> 5, I think not every 5xx error is recoverable, and there is no other way 
> to exit the for loop in sendWriteRequestWithBackoff(). The agent keeps 
> retrying but every time it receives an 5xx from the server. so we lost all 
> samples for hours until we restart the agent
>
> So my question is why 5xx from the promtheus server is considered 
> Recoverable? And I believe there should be a way to exit the loop, for 
> example a maximum times to  retry.
>
> It seems that the agent mode is not mature enough to work in production.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/099dd271-0797-4f07-8ce5-700f3d552317n%40googlegroups.com.


[prometheus-users] Re: what insecure_skip_verify will do

2024-05-16 Thread &#x27;Brian Candler' via Prometheus Users
Then you did something wrong in your config, but you'll need to show the 
config if you want help fixing it.

It also depends on what you're talking to: is this a scrape job talking to 
an exporter? Is this service discovery? Something else?

On Thursday 16 May 2024 at 15:12:14 UTC+1 Sameer Modak wrote:

> So here is the update i did try this insecure skip but i am still getting 
> below error,
>
>  tls: failed to verify certificate: x509: certificate signed by unknown 
> authority
>
> On Thursday, May 16, 2024 at 1:28:43 PM UTC+5:30 Brian Candler wrote:
>
>> It depends what you mean by "secure".
>>
>> It's encrypted, because you've told it to use HTTPS (HTTP + TLS). If the 
>> remote end doesn't talk TLS, then the two won't be able to establish a 
>> connection at all.
>>
>> However it is also insecure, because the client has no way of knowing 
>> whether the remote device is the one it's expecting to talk to, or an 
>> imposter. If it's an imposter, they can capture any data sent by the 
>> client, and return any data they like to the client. It's the job of a 
>> certificate to verify the identity of the server, and you've told it to 
>> skip that check.
>>
>> On Thursday 16 May 2024 at 07:33:31 UTC+1 Sameer Modak wrote:
>>
>>> Thanks a lot . Any easy way to check if traffic is secure apart from 
>>> wireshark. 
>>>
>>> On Wednesday, May 15, 2024 at 8:50:18 PM UTC+5:30 Alexander Wilke wrote:
>>>
>>>> It will skip the certificate Check. So certificate May be valid or 
>>>> invalid and is Always trusted.
>>>> Connection is still encrypted
>>>>
>>>> Sameer Modak schrieb am Mittwoch, 15. Mai 2024 um 17:04:07 UTC+2:
>>>>
>>>>> Hello Team,
>>>>>
>>>>> If i set  insecure_skip_verify: true will my data be unsecured. Will 
>>>>> it be non ssl??
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2aebad82-3278-44a5-8daf-8db97869fe24n%40googlegroups.com.


[prometheus-users] Re: what insecure_skip_verify will do

2024-05-16 Thread &#x27;Brian Candler' via Prometheus Users
It depends what you mean by "secure".

It's encrypted, because you've told it to use HTTPS (HTTP + TLS). If the 
remote end doesn't talk TLS, then the two won't be able to establish a 
connection at all.

However it is also insecure, because the client has no way of knowing 
whether the remote device is the one it's expecting to talk to, or an 
imposter. If it's an imposter, they can capture any data sent by the 
client, and return any data they like to the client. It's the job of a 
certificate to verify the identity of the server, and you've told it to 
skip that check.

On Thursday 16 May 2024 at 07:33:31 UTC+1 Sameer Modak wrote:

> Thanks a lot . Any easy way to check if traffic is secure apart from 
> wireshark. 
>
> On Wednesday, May 15, 2024 at 8:50:18 PM UTC+5:30 Alexander Wilke wrote:
>
>> It will skip the certificate Check. So certificate May be valid or 
>> invalid and is Always trusted.
>> Connection is still encrypted
>>
>> Sameer Modak schrieb am Mittwoch, 15. Mai 2024 um 17:04:07 UTC+2:
>>
>>> Hello Team,
>>>
>>> If i set  insecure_skip_verify: true will my data be unsecured. Will it 
>>> be non ssl??
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/78dc1be1-77c9-4b7f-bdfb-805843ed69b4n%40googlegroups.com.


[prometheus-users] Re: Locatinme in Alertmanager

2024-05-09 Thread &#x27;Brian Candler' via Prometheus Users
Can you describe what the actual problem is? Are you seeing an error 
message, if so what is it?

Why are you defining a time interval of 00:00 to 23:59, which is basically 
all the time apart from 1 minute between 23:59 and 24:00? You also don't 
seem to be referencing it from a routing rule.

In any case, "Time interval" only affects what times notifications are sent 
or muted, and only if you refer to them in a routing rule. It makes no 
change to the *content* of the notification.

If you want the notifications to contain local times, then you'll need to 
show the configuration of your receivers - are you doing template 
expansion? Which exact parts of the message do you want to change? Of 
course, your webhook receiver can do whatever reformatting of the messages 
you like.

On Thursday 9 May 2024 at 09:06:03 UTC+1 Tareerat Pansuntia wrote:

> Hello! all.  I set up a website monitoring project using Prometheus, 
> Blackbox Exporter, and Alertmanager to monitor and send notifications. I 
> have configured it to send alerts via Line Notify using a webhook receiver. 
> However, i currently facing an issue with setting the timezone to my 
> country's timezone. this is my config
>
> global:
>  resolve_timeout: 1m
>
> route:
>   group_by: ['alertname']
>   group_wait: 30s
>   group_interval: 10s
>   repeat_interval: 10s
>   receiver: 'email and line-notify'
>
> receivers:
> - name: 'email and line-notify'
>   email_configs:
>   - to: '...
>   webhook_configs:
> - ...
>
> time_intervals:
> - name: everyday
>   time_intervals:
>   - times:
> - start_time: "00:00"
>   end_time: "23:59"
> location: 'Asia/Bangkok'
>
> inhibit_rules:
>   - source_match:
>   severity: 'critical'
> target_match:
>   severity: 'warning'
> equal: ['alertname', 'instance']
>
> Could someone please guide me on the correct format for specifying time 
> intervals in Prometheus?
>
> Regards.
> Tareerat
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f02f2a50-a89d-4be3-be1d-04a347e44b86n%40googlegroups.com.


[prometheus-users] Re: Does anyone have any examples of what a postgres_exporter.yml file is supposed to look like?

2024-05-08 Thread &#x27;Brian Candler' via Prometheus Users
...then move on to configuring *prometheus* I meant.

On Wednesday 8 May 2024 at 07:11:46 UTC+1 Brian Candler wrote:

> - job_name: 'postgresql_exporter'
> static_configs:
> - targets: ['host.docker.internal:5432']
>
> One problem I can see is that you're trying to get prometheus to scrape 
> the postgres SQL port. If you go to the Prometheus web UI and look at the 
> Status > Targets menu option, I think you will see it's currently failing.  
> Or run the query "up == 0".
>
> You need to change it to scrape prometheus exporter: that is port 9187, 
> not port 5432.
>
> However, before you get around to configuring prometheus, I suggest you 
> first make sure that postgres-exporter itself is working properly, by 
> scraping it manually:
>
> curl x.x.x.x:9187/metrics
>
> (or inside the exporter container you could try curl 
> 127.0.0.1:9187/metrics, but that depends if the container has a "curl" 
> binary)
>
> Once you're able to do that (which may also require adjusting your 
> postgres_exporter.yml and/or pg_hba.conf, then move on to configuring 
> postgres.
>
> On Tuesday 7 May 2024 at 21:24:18 UTC+1 Christian Sanchez wrote:
>
>> Hello, all.
>>
>> I've started to learn Prometheus and found out about the 
>> postgres_exporter. I'd like to include metrics from the PostgreSQL server I 
>> have running on Google Cloud.
>>
>> I don't understand how to actually build out the postgres_exporter.yml 
>> file. The prometheus-community GitHub repository 
>> <https://github.com/prometheus-community/postgres_exporter> doesn't seem 
>> to have examples of building this file out.
>>
>> Maybe I am not reading the README in the repo that well, but I'd like to 
>> see some examples of the exporter file.
>>
>> When running the Prometheus container, this is where I'm expecting to see 
>> the exporter query options (see attachment)
>>
>>
>> I am running Prometheus and the Postgres Exporter through Docker Compose.
>> Here is my docker-compose.yml file:
>> version: '3'
>> services:
>> prometheus:
>> image: prom/prometheus
>> volumes:
>> - "./prometheus.yml:/etc/prometheus/prometheus.yml"
>> ports:
>> - 9090:9090
>>
>> postgres-exporter:
>> image: prometheuscommunity/postgres-exporter
>> volumes:
>> - "./postgres_exporter.yml:/postgres_exporter.yml:ro"
>> ports:
>> - 9187:9187
>> environment:
>> DATA_SOURCE_NAME: "
>> postgresql://my-user:my-pa...@host.docker.internal:5432/my-database?sslmode=disable
>> "
>>
>>
>> Here is my prometheus.yml file:
>> global:
>> scrape_interval: 45s
>>
>> scrape_configs:
>> - job_name: 'prometheus'
>> static_configs:
>> - targets: ['localhost:9090']
>>
>> - job_name: 'postgresql_exporter'
>> static_configs:
>> - targets: ['host.docker.internal:5432']
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8d5d968c-0714-4ffe-885d-911e1a243cc9n%40googlegroups.com.


[prometheus-users] Re: Does anyone have any examples of what a postgres_exporter.yml file is supposed to look like?

2024-05-07 Thread &#x27;Brian Candler' via Prometheus Users
 - job_name: 'postgresql_exporter'
static_configs:
- targets: ['host.docker.internal:5432']

One problem I can see is that you're trying to get prometheus to scrape the 
postgres SQL port. If you go to the Prometheus web UI and look at the 
Status > Targets menu option, I think you will see it's currently failing.  
Or run the query "up == 0".

You need to change it to scrape prometheus exporter: that is port 9187, not 
port 5432.

However, before you get around to configuring prometheus, I suggest you 
first make sure that postgres-exporter itself is working properly, by 
scraping it manually:

curl x.x.x.x:9187/metrics

(or inside the exporter container you could try curl 
127.0.0.1:9187/metrics, but that depends if the container has a "curl" 
binary)

Once you're able to do that (which may also require adjusting your 
postgres_exporter.yml and/or pg_hba.conf, then move on to configuring 
postgres.

On Tuesday 7 May 2024 at 21:24:18 UTC+1 Christian Sanchez wrote:

> Hello, all.
>
> I've started to learn Prometheus and found out about the 
> postgres_exporter. I'd like to include metrics from the PostgreSQL server I 
> have running on Google Cloud.
>
> I don't understand how to actually build out the postgres_exporter.yml 
> file. The prometheus-community GitHub repository 
>  doesn't seem 
> to have examples of building this file out.
>
> Maybe I am not reading the README in the repo that well, but I'd like to 
> see some examples of the exporter file.
>
> When running the Prometheus container, this is where I'm expecting to see 
> the exporter query options (see attachment)
>
>
> I am running Prometheus and the Postgres Exporter through Docker Compose.
> Here is my docker-compose.yml file:
> version: '3'
> services:
> prometheus:
> image: prom/prometheus
> volumes:
> - "./prometheus.yml:/etc/prometheus/prometheus.yml"
> ports:
> - 9090:9090
>
> postgres-exporter:
> image: prometheuscommunity/postgres-exporter
> volumes:
> - "./postgres_exporter.yml:/postgres_exporter.yml:ro"
> ports:
> - 9187:9187
> environment:
> DATA_SOURCE_NAME: "
> postgresql://my-user:my-pa...@host.docker.internal:5432/my-database?sslmode=disable
> "
>
>
> Here is my prometheus.yml file:
> global:
> scrape_interval: 45s
>
> scrape_configs:
> - job_name: 'prometheus'
> static_configs:
> - targets: ['localhost:9090']
>
> - job_name: 'postgresql_exporter'
> static_configs:
> - targets: ['host.docker.internal:5432']
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/32dfd4cb-6730-423c-ac6a-20aee647b6c2n%40googlegroups.com.


[prometheus-users] Re: Compare metrics with differents labels

2024-04-30 Thread &#x27;Brian Candler' via Prometheus Users
There's no metric I see there that tells you whether messages are being 
produced, only whether they're being consumed.

Without that, then I'm not sure you can do any better than this:

sum by (consumergroup, topic) (rate(kafka_consumergroup_current_offset[5m]) 
* 60) == 0
unless on (topic) sum by (topic) 
(rate(kafka_consumergroup_current_offset[5m]) * 60) < 1

The first part:
sum by (consumergroup, topic) (rate(kafka_consumergroup_current_offset[5m]) 
* 60) == 0
will give you an alert for each (consumergroup,topic) combination which has 
not consumed anything in the last 5 minutes.

The second part:
unless on (topic) sum by (topic) 
(rate(kafka_consumergroup_current_offset[5m]) * 60) < 1
will suppress the alert if *no* consumers have consumed at least 1 message 
per minute.  But this won't be useful unless each topic has at least 2 
consumer groups, so that if one is consuming it can alert on the other.

Given the examples you show, it looks like you only have one consumer group 
per topic.  Therefore, I think you need to find a metric which explicitly 
gives the publisher offset for each topic/partition.

On Tuesday 30 April 2024 at 18:30:24 UTC+1 Robson Jose wrote:

> like this ?
>
> kafka_consumergroup_current_offset{consumergroup="consumer-events", 
> env="prod", instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-EVENTS"}
> 292350417
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION"}
> 30027218
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-CHAT"}
> 3493310
> kafka_consumergroup_current_offset{consumergroup="consumer-email", 
> env="prod", instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-EMAIL"}
> 82381171
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-PUSH"}
> 31267495
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-SMS"}
> 366
> kafka_consumergroup_current_offset{consumergroup="$Default", env="prod", 
> instance="kafka-exporter.monitor:9308", job="kafka-exporter", 
> partition="0", topic="TOPIC-NOTIFICATION-WHATSAPP"}
> Em terça-feira, 30 de abril de 2024 às 12:28:29 UTC-3, Brian Candler 
> escreveu:
>
>> You're showing aggregates, not the raw metrics.
>>
>> On Tuesday 30 April 2024 at 16:23:15 UTC+1 Robson Jose wrote:
>>
>>> like this
>>>   sum by (consumergroup, topic) 
>>> (delta(kafka_consumergroup_current_offset{}[5m])/5)
>>>
>>> {consumergroup="consumer-shop", topic="SHOP-EVENTS"}
>>> 1535.25
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION"}
>>> 1.5
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-CHAT"}
>>> 0.25
>>> {consumergroup="consumer-email", topic="TOPIC-NOTIFICATION-EMAIL"}
>>> 0
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-TESTE"}
>>> 1.25
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-SMS"}
>>> 0
>>> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-WHATSAPP"}
>>> 0
>>> {consumergroup="consumer-user-event", topic="TOPIC-USER-EVENTS"}
>>> 0
>>>
>>> Em terça-feira, 30 de abril de 2024 às 12:14:23 UTC-3, Brian Candler 
>>> escreveu:
>>>
>>>> Without seeing examples of the exact metrics you are receiving then 
>>>> it's hard to be sure what the right query is.
>>>>
>>>> > I want that if the consumption of messages in the topic in the last 5 
>>>> minutes is 0 and the production of messages is greater than 1 in the topic
>>>>
>>>> Then you'll want metrics for the consumption (consumer gr

[prometheus-users] Re: Compare metrics with differents labels

2024-04-30 Thread &#x27;Brian Candler' via Prometheus Users
You're showing aggregates, not the raw metrics.

On Tuesday 30 April 2024 at 16:23:15 UTC+1 Robson Jose wrote:

> like this
>   sum by (consumergroup, topic) 
> (delta(kafka_consumergroup_current_offset{}[5m])/5)
>
> {consumergroup="consumer-shop", topic="SHOP-EVENTS"}
> 1535.25
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION"}
> 1.5
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-CHAT"}
> 0.25
> {consumergroup="consumer-email", topic="TOPIC-NOTIFICATION-EMAIL"}
> 0
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-TESTE"}
> 1.25
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-SMS"}
> 0
> {consumergroup="$Default", topic="TOPIC-NOTIFICATION-WHATSAPP"}
> 0
> {consumergroup="consumer-user-event", topic="TOPIC-USER-EVENTS"}
> 0
>
> Em terça-feira, 30 de abril de 2024 às 12:14:23 UTC-3, Brian Candler 
> escreveu:
>
>> Without seeing examples of the exact metrics you are receiving then it's 
>> hard to be sure what the right query is.
>>
>> > I want that if the consumption of messages in the topic in the last 5 
>> minutes is 0 and the production of messages is greater than 1 in the topic
>>
>> Then you'll want metrics for the consumption (consumer group offset) and 
>> production (e.g. partition long-end offset or consumer group lag)
>>
>> On Tuesday 30 April 2024 at 13:51:50 UTC+1 Robson Jose wrote:
>>
>>>
>>> Hello, Thanks for responding in case
>>>
>>> I want that if the consumption of messages in the topic in the last 5 
>>> minutes is 0 and the production of messages is greater than 1 in the topic, 
>>> then the group of consumers is not consuming messages and I wanted to 
>>> return which groups and topics these would be
>>> Em sexta-feira, 19 de abril de 2024 às 15:36:44 UTC-3, Brian Candler 
>>> escreveu:
>>>
>>>> Maybe what you're trying to do is:
>>>>
>>>> sum by (consumergroup, topic) 
>>>> (rate(kafka_consumergroup_current_offset[5m]) * 60) == 0
>>>> unless sum by (topic) (rate(kafka_consumergroup_current_offset[5m]) * 
>>>> 60) < 1
>>>>
>>>> That is: alert on any combination of (consumergroup,topic) where the 
>>>> 5-minute rate of consumption is zero, unless the rate for that topic 
>>>> across 
>>>> all consumers is less than 1 per minute.
>>>>
>>>> As far as I can tell, kafka_consumergroup_current_offset is a counter, 
>>>> and therefore you should use either rate() or increase().  The only 
>>>> difference is that rate(foo[5m]) gives the increase per second, while 
>>>> increase(foo[5m]) gives the increase per 5 minutes.
>>>>
>>>> Hence:
>>>> rate(kafka_consumergroup_current_offset[5m]) * 60
>>>> increase(kafka_consumergroup_current_offset[5m]) / 5
>>>> should both be the same, giving the per-minute increase.
>>>>
>>>> On Friday 19 April 2024 at 18:30:21 UTC+1 Brian Candler wrote:
>>>>
>>>>> Sorry, first link was wrong.
>>>>>
>>>>> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/unto0oGQAQAJ
>>>>>
>>>>> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>>>>>
>>>>> On Friday 19 April 2024 at 18:28:29 UTC+1 Brian Candler wrote:
>>>>>
>>>>>> Can you give examples of the metrics in question, and what conditions 
>>>>>> you're trying to check for?
>>>>>>
>>>>>> Looking at your specific PromQL query: Firstly, in my experience, 
>>>>>> it's very unusual in Prometheus queries to use ==bool or >bool, and in 
>>>>>> this 
>>>>>> specific case definitely seems to be wrong.
>>>>>>
>>>>>> Secondly, you won't be able to join the LH and RH sides of your 
>>>>>> expression with "and" unless either they have exactly the same label 
>>>>>> sets, 
>>>>>> or you modify your condition using "and on (...)" or "and ignoring 
>>>>>> (...)".
>>>>>>
>>>>>> "and" is a vector intersection operator, where the result vector 
>>>>>> includes a value if the labels match, and the value is taken from the 
>>

[prometheus-users] Re: Compare metrics with differents labels

2024-04-30 Thread &#x27;Brian Candler' via Prometheus Users
Without seeing examples of the exact metrics you are receiving then it's 
hard to be sure what the right query is.

> I want that if the consumption of messages in the topic in the last 5 
minutes is 0 and the production of messages is greater than 1 in the topic

Then you'll want metrics for the consumption (consumer group offset) and 
production (e.g. partition long-end offset or consumer group lag)

On Tuesday 30 April 2024 at 13:51:50 UTC+1 Robson Jose wrote:

>
> Hello, Thanks for responding in case
>
> I want that if the consumption of messages in the topic in the last 5 
> minutes is 0 and the production of messages is greater than 1 in the topic, 
> then the group of consumers is not consuming messages and I wanted to 
> return which groups and topics these would be
> Em sexta-feira, 19 de abril de 2024 às 15:36:44 UTC-3, Brian Candler 
> escreveu:
>
>> Maybe what you're trying to do is:
>>
>> sum by (consumergroup, topic) 
>> (rate(kafka_consumergroup_current_offset[5m]) * 60) == 0
>> unless sum by (topic) (rate(kafka_consumergroup_current_offset[5m]) * 60) 
>> < 1
>>
>> That is: alert on any combination of (consumergroup,topic) where the 
>> 5-minute rate of consumption is zero, unless the rate for that topic across 
>> all consumers is less than 1 per minute.
>>
>> As far as I can tell, kafka_consumergroup_current_offset is a counter, 
>> and therefore you should use either rate() or increase().  The only 
>> difference is that rate(foo[5m]) gives the increase per second, while 
>> increase(foo[5m]) gives the increase per 5 minutes.
>>
>> Hence:
>> rate(kafka_consumergroup_current_offset[5m]) * 60
>> increase(kafka_consumergroup_current_offset[5m]) / 5
>> should both be the same, giving the per-minute increase.
>>
>> On Friday 19 April 2024 at 18:30:21 UTC+1 Brian Candler wrote:
>>
>>> Sorry, first link was wrong.
>>> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/unto0oGQAQAJ
>>> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>>>
>>> On Friday 19 April 2024 at 18:28:29 UTC+1 Brian Candler wrote:
>>>
>>>> Can you give examples of the metrics in question, and what conditions 
>>>> you're trying to check for?
>>>>
>>>> Looking at your specific PromQL query: Firstly, in my experience, it's 
>>>> very unusual in Prometheus queries to use ==bool or >bool, and in this 
>>>> specific case definitely seems to be wrong.
>>>>
>>>> Secondly, you won't be able to join the LH and RH sides of your 
>>>> expression with "and" unless either they have exactly the same label sets, 
>>>> or you modify your condition using "and on (...)" or "and ignoring (...)".
>>>>
>>>> "and" is a vector intersection operator, where the result vector 
>>>> includes a value if the labels match, and the value is taken from the LHS, 
>>>> and that means it doesn't combine the values like you might be used to in 
>>>> other programming languages. For example,
>>>>
>>>> vector(0) and vector(1)  => value is 0
>>>> vector(1) and vector(0)  => value is 1
>>>> vector(42) and vector(99)  => value is 42
>>>>
>>>> This is as described in the documentation 
>>>> <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>
>>>> :
>>>>
>>>> vector1 and vector2 results in a vector consisting of the elements of 
>>>> vector1 for which there are elements in vector2 with exactly matching 
>>>> label sets. Other elements are dropped. The metric name and values are 
>>>> carried over from the left-hand side vector.
>>>>
>>>> PromQL alerts on the presence of values, and in PromQL you need to 
>>>> think in terms of "what (labelled) values are present or absent in this 
>>>> vector", using the "and/unless" operators to suppress elements in the 
>>>> result vector, and the "or" operator to add additional elements to the 
>>>> result vector.
>>>>
>>>> Maybe these explanations help:
>>>>
>>>> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/NH2_CRPaAQAJ
>>>>
>>>> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>>>>
>>>> On Friday 19 April 2024 at 16:31:23 UTC+1 Robson Jose wrote:
>>>>
>>>>> Good afternoon, I would like to know if it is possible to do this 
>>>>> query, the value that should return is applications with a value of 0 in 
>>>>> the first query and greater than one in the 2nd
>>>>>
>>>>> (
>>>>>   sum by (consumergroup, topic) 
>>>>> (delta(kafka_consumergroup_current_offset{}[5m])/5) ==bool 0
>>>>> ) 
>>>>> and (
>>>>>   sum by (topic) (delta(kafka_consumergroup_current_offset{}[5m])/5) 
>>>>> >bool 1
>>>>> )
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a3b6f298-1c3d-47cd-b04a-66c62bd71c86n%40googlegroups.com.


[prometheus-users] Re: Compare metrics with differents labels

2024-04-19 Thread &#x27;Brian Candler' via Prometheus Users
Maybe what you're trying to do is:

sum by (consumergroup, topic) (rate(kafka_consumergroup_current_offset[5m]) 
* 60) == 0
unless sum by (topic) (rate(kafka_consumergroup_current_offset[5m]) * 60) < 
1

That is: alert on any combination of (consumergroup,topic) where the 
5-minute rate of consumption is zero, unless the rate for that topic across 
all consumers is less than 1 per minute.

As far as I can tell, kafka_consumergroup_current_offset is a counter, and 
therefore you should use either rate() or increase().  The only difference 
is that rate(foo[5m]) gives the increase per second, while 
increase(foo[5m]) gives the increase per 5 minutes.

Hence:
rate(kafka_consumergroup_current_offset[5m]) * 60
increase(kafka_consumergroup_current_offset[5m]) / 5
should both be the same, giving the per-minute increase.

On Friday 19 April 2024 at 18:30:21 UTC+1 Brian Candler wrote:

> Sorry, first link was wrong.
> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/unto0oGQAQAJ
> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>
> On Friday 19 April 2024 at 18:28:29 UTC+1 Brian Candler wrote:
>
>> Can you give examples of the metrics in question, and what conditions 
>> you're trying to check for?
>>
>> Looking at your specific PromQL query: Firstly, in my experience, it's 
>> very unusual in Prometheus queries to use ==bool or >bool, and in this 
>> specific case definitely seems to be wrong.
>>
>> Secondly, you won't be able to join the LH and RH sides of your 
>> expression with "and" unless either they have exactly the same label sets, 
>> or you modify your condition using "and on (...)" or "and ignoring (...)".
>>
>> "and" is a vector intersection operator, where the result vector includes 
>> a value if the labels match, and the value is taken from the LHS, and that 
>> means it doesn't combine the values like you might be used to in other 
>> programming languages. For example,
>>
>> vector(0) and vector(1)  => value is 0
>> vector(1) and vector(0)  => value is 1
>> vector(42) and vector(99)  => value is 42
>>
>> This is as described in the documentation 
>> <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>
>> :
>>
>> vector1 and vector2 results in a vector consisting of the elements of 
>> vector1 for which there are elements in vector2 with exactly matching 
>> label sets. Other elements are dropped. The metric name and values are 
>> carried over from the left-hand side vector.
>>
>> PromQL alerts on the presence of values, and in PromQL you need to think 
>> in terms of "what (labelled) values are present or absent in this vector", 
>> using the "and/unless" operators to suppress elements in the result vector, 
>> and the "or" operator to add additional elements to the result vector.
>>
>> Maybe these explanations help:
>> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/NH2_CRPaAQAJ
>> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>>
>> On Friday 19 April 2024 at 16:31:23 UTC+1 Robson Jose wrote:
>>
>>> Good afternoon, I would like to know if it is possible to do this query, 
>>> the value that should return is applications with a value of 0 in the first 
>>> query and greater than one in the 2nd
>>>
>>> (
>>>   sum by (consumergroup, topic) 
>>> (delta(kafka_consumergroup_current_offset{}[5m])/5) ==bool 0
>>> ) 
>>> and (
>>>   sum by (topic) (delta(kafka_consumergroup_current_offset{}[5m])/5) 
>>> >bool 1
>>> )
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9796ea54-47c9-47dc-8f87-460de1468a66n%40googlegroups.com.


[prometheus-users] Re: Compare metrics with differents labels

2024-04-19 Thread &#x27;Brian Candler' via Prometheus Users
Sorry, first link was wrong.
https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/unto0oGQAQAJ
https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ

On Friday 19 April 2024 at 18:28:29 UTC+1 Brian Candler wrote:

> Can you give examples of the metrics in question, and what conditions 
> you're trying to check for?
>
> Looking at your specific PromQL query: Firstly, in my experience, it's 
> very unusual in Prometheus queries to use ==bool or >bool, and in this 
> specific case definitely seems to be wrong.
>
> Secondly, you won't be able to join the LH and RH sides of your expression 
> with "and" unless either they have exactly the same label sets, or you 
> modify your condition using "and on (...)" or "and ignoring (...)".
>
> "and" is a vector intersection operator, where the result vector includes 
> a value if the labels match, and the value is taken from the LHS, and that 
> means it doesn't combine the values like you might be used to in other 
> programming languages. For example,
>
> vector(0) and vector(1)  => value is 0
> vector(1) and vector(0)  => value is 1
> vector(42) and vector(99)  => value is 42
>
> This is as described in the documentation 
> <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>
> :
>
> vector1 and vector2 results in a vector consisting of the elements of 
> vector1 for which there are elements in vector2 with exactly matching 
> label sets. Other elements are dropped. The metric name and values are 
> carried over from the left-hand side vector.
>
> PromQL alerts on the presence of values, and in PromQL you need to think 
> in terms of "what (labelled) values are present or absent in this vector", 
> using the "and/unless" operators to suppress elements in the result vector, 
> and the "or" operator to add additional elements to the result vector.
>
> Maybe these explanations help:
> https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/NH2_CRPaAQAJ
> https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ
>
> On Friday 19 April 2024 at 16:31:23 UTC+1 Robson Jose wrote:
>
>> Good afternoon, I would like to know if it is possible to do this query, 
>> the value that should return is applications with a value of 0 in the first 
>> query and greater than one in the 2nd
>>
>> (
>>   sum by (consumergroup, topic) 
>> (delta(kafka_consumergroup_current_offset{}[5m])/5) ==bool 0
>> ) 
>> and (
>>   sum by (topic) (delta(kafka_consumergroup_current_offset{}[5m])/5) 
>> >bool 1
>> )
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/304a4437-6cbb-451b-b476-d3196dc6923bn%40googlegroups.com.


[prometheus-users] Re: Compare metrics with differents labels

2024-04-19 Thread &#x27;Brian Candler' via Prometheus Users
Can you give examples of the metrics in question, and what conditions 
you're trying to check for?

Looking at your specific PromQL query: Firstly, in my experience, it's very 
unusual in Prometheus queries to use ==bool or >bool, and in this specific 
case definitely seems to be wrong.

Secondly, you won't be able to join the LH and RH sides of your expression 
with "and" unless either they have exactly the same label sets, or you 
modify your condition using "and on (...)" or "and ignoring (...)".

"and" is a vector intersection operator, where the result vector includes a 
value if the labels match, and the value is taken from the LHS, and that 
means it doesn't combine the values like you might be used to in other 
programming languages. For example,

vector(0) and vector(1)  => value is 0
vector(1) and vector(0)  => value is 1
vector(42) and vector(99)  => value is 42

This is as described in the documentation 

:

vector1 and vector2 results in a vector consisting of the elements of 
vector1 for which there are elements in vector2 with exactly matching label 
sets. Other elements are dropped. The metric name and values are carried 
over from the left-hand side vector.

PromQL alerts on the presence of values, and in PromQL you need to think in 
terms of "what (labelled) values are present or absent in this vector", 
using the "and/unless" operators to suppress elements in the result vector, 
and the "or" operator to add additional elements to the result vector.

Maybe these explanations help:
https://groups.google.com/g/prometheus-users/c/IeW_3nyGkR0/m/NH2_CRPaAQAJ
https://groups.google.com/g/prometheus-users/c/83pEAX44L3M/m/E20UmVJyBQAJ

On Friday 19 April 2024 at 16:31:23 UTC+1 Robson Jose wrote:

> Good afternoon, I would like to know if it is possible to do this query, 
> the value that should return is applications with a value of 0 in the first 
> query and greater than one in the 2nd
>
> (
>   sum by (consumergroup, topic) 
> (delta(kafka_consumergroup_current_offset{}[5m])/5) ==bool 0
> ) 
> and (
>   sum by (topic) (delta(kafka_consumergroup_current_offset{}[5m])/5) >bool 
> 1
> )
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d54ade93-2ea4-438e-986a-a9c780ab71acn%40googlegroups.com.


Re: [prometheus-users] Re: Need urgent help!!! Want to modify tags "keys" to lowercase scraping from Cloudwatch-Exporter in Prometheus before sending to Mimir #13912

2024-04-18 Thread &#x27;Brian Candler' via Prometheus Users
No. That test case demonstrates that it is the label *values* that are 
downcased, not the label names, exactly as you said.

On Thursday 18 April 2024 at 13:07:51 UTC+1 Vaibhav Ingulkar wrote:

> Thanks @Brian Candler
>
> Actually not possible fixing the data at source due to multiple 
> variations in diff aws services and huge data modification. So looking to 
> make it dynamically by capturing labels starting with "*tag_*".
>
> As mentioned here 
> https://github.com/prometheus/prometheus/blob/v2.45.4/model/relabel/relabel_test.go#L461-L482
>  can 
> you please give me one example of config to achieve it dynamically for all 
> labels starting with "*tag_*"
>
> It will be great help if that works for me. :)
>
>
> On Thursday, April 18, 2024 at 4:46:15 PM UTC+5:30 Brian Candler wrote:
>
>> You mean you're seeing tag_owner, tag_Owner, tag_OWNER from different 
>> instances? Because the tags weren't entered consistently?
>>
>> I don't see a lowercasing version of the "labelmap" action. So I think 
>> you're back to either:
>>
>> 1. fixing the data at source (e.g. using the EC2 API to read the tags and 
>> reset them to the desired values; and then make policies and procedures so 
>> that new instances have consistent tag names); or
>> 2. proxying / modifying the exporter
>>
>> > I think  lower/upper action in relabeling works to make "*values*" of 
>> labels to lower/upper 
>>
>> I believe so. The way I interpret it, "lowercase" action is the same as 
>> "replace", but the concatenated values from source_labels are lowercased 
>> first. Hence the fixed target_label that you specify will get the 
>> lowercased value, after any regex matching/capturing.
>>
>> The test case here agrees:
>>
>> https://github.com/prometheus/prometheus/blob/v2.45.4/model/relabel/relabel_test.go#L461-L482
>>
>> On Thursday 18 April 2024 at 11:47:16 UTC+1 Vaibhav Ingulkar wrote:
>>
>>> Additionally , I have prepare below config under metric_relable_configs
>>> - action: labelmap
>>>   regex: 'tag_(.*)'
>>>   replacement: $1
>>>
>>> It is giving one me new set of all label starting with word '*tag_*' as 
>>> added in regex but not converting them to lowercase and removing "*tag_*" 
>>> from label name, for ex. *tag_Name* is converted only "N*ame*"
>>> Also existing label *tag_Name* is also remaining as it is .i.e. old 
>>> label *tag_Name* and new label *Name*
>>>
>>> So Firstly I want that "*tag_"* should remain as it it in new label and 
>>> it should get converted to lower case i.e. for ex. *tag_Budget_Code* to 
>>> *tag_budget_code* or *tag_Name* to *tag_name*
>>> Secondly need to remove old label for ex. *tag_Budget_Code* , *tag_Name* , 
>>> etc
>>>
>>> On Thursday, April 18, 2024 at 3:46:57 PM UTC+5:30 Vaibhav Ingulkar 
>>> wrote:
>>>
>>>> Thanks @Brian Kochie
>>>>
>>>> Correct me if I am wrong but I think  lower/upper action in relabeling 
>>>> works to make "*values*" of labels to lower/upper and not "*keys*" *i.e. 
>>>> label name itself wont get convert to lowercase*. Right?
>>>>
>>>> Because I an using *v2.41.0 *and  have tried it and it is converting 
>>>> all values of labels to lowercase.
>>>>
>>>> Here my requirement is to convert labels i.e. keys to lowercase for ex. 
>>>> *tag_Budget_Code* to *tag_budget_code* or *tag_Name* to *tag_name*
>>>>
>>>> On Thursday, April 18, 2024 at 2:26:10 PM UTC+5:30 Brian Candler wrote:
>>>>
>>>>> On Thursday 18 April 2024 at 09:42:41 UTC+1 Ben Kochie wrote:
>>>>>
>>>>> Prometheus can lower/upper in relabeling.
>>>>>
>>>>>
>>>>> Thanks! That was added in v2.36.0 
>>>>> <https://github.com/prometheus/prometheus/releases/v2.36.0>, and I 
>>>>> missed it.
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c5a33b4a-b27c-447d-bc6a-3b14a5fb2e12n%40googlegroups.com.


Re: [prometheus-users] Re: Need urgent help!!! Want to modify tags "keys" to lowercase scraping from Cloudwatch-Exporter in Prometheus before sending to Mimir #13912

2024-04-18 Thread &#x27;Brian Candler' via Prometheus Users
You mean you're seeing tag_owner, tag_Owner, tag_OWNER from different 
instances? Because the tags weren't entered consistently?

I don't see a lowercasing version of the "labelmap" action. So I think 
you're back to either:

1. fixing the data at source (e.g. using the EC2 API to read the tags and 
reset them to the desired values; and then make policies and procedures so 
that new instances have consistent tag names); or
2. proxying / modifying the exporter

> I think  lower/upper action in relabeling works to make "*values*" of 
labels to lower/upper 

I believe so. The way I interpret it, "lowercase" action is the same as 
"replace", but the concatenated values from source_labels are lowercased 
first. Hence the fixed target_label that you specify will get the 
lowercased value, after any regex matching/capturing.

The test case here agrees:
https://github.com/prometheus/prometheus/blob/v2.45.4/model/relabel/relabel_test.go#L461-L482

On Thursday 18 April 2024 at 11:47:16 UTC+1 Vaibhav Ingulkar wrote:

> Additionally , I have prepare below config under metric_relable_configs
> - action: labelmap
>   regex: 'tag_(.*)'
>   replacement: $1
>
> It is giving one me new set of all label starting with word '*tag_*' as 
> added in regex but not converting them to lowercase and removing "*tag_*" 
> from label name, for ex. *tag_Name* is converted only "N*ame*"
> Also existing label *tag_Name* is also remaining as it is .i.e. old label 
> *tag_Name* and new label *Name*
>
> So Firstly I want that "*tag_"* should remain as it it in new label and 
> it should get converted to lower case i.e. for ex. *tag_Budget_Code* to 
> *tag_budget_code* or *tag_Name* to *tag_name*
> Secondly need to remove old label for ex. *tag_Budget_Code* , *tag_Name* , 
> etc
>
> On Thursday, April 18, 2024 at 3:46:57 PM UTC+5:30 Vaibhav Ingulkar wrote:
>
>> Thanks @Brian Kochie
>>
>> Correct me if I am wrong but I think  lower/upper action in relabeling 
>> works to make "*values*" of labels to lower/upper and not "*keys*" *i.e. 
>> label name itself wont get convert to lowercase*. Right?
>>
>> Because I an using *v2.41.0 *and  have tried it and it is converting all 
>> values of labels to lowercase.
>>
>> Here my requirement is to convert labels i.e. keys to lowercase for ex. 
>> *tag_Budget_Code* to *tag_budget_code* or *tag_Name* to *tag_name*
>>
>> On Thursday, April 18, 2024 at 2:26:10 PM UTC+5:30 Brian Candler wrote:
>>
>>> On Thursday 18 April 2024 at 09:42:41 UTC+1 Ben Kochie wrote:
>>>
>>> Prometheus can lower/upper in relabeling.
>>>
>>>
>>> Thanks! That was added in v2.36.0 
>>> <https://github.com/prometheus/prometheus/releases/v2.36.0>, and I 
>>> missed it.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/bb7045b4-eeaf-404a-8aaf-affeae3bcf95n%40googlegroups.com.


Re: [prometheus-users] Re: Need urgent help!!! Want to modify tags "keys" to lowercase scraping from Cloudwatch-Exporter in Prometheus before sending to Mimir #13912

2024-04-18 Thread &#x27;Brian Candler' via Prometheus Users
On Thursday 18 April 2024 at 09:42:41 UTC+1 Ben Kochie wrote:

Prometheus can lower/upper in relabeling.


Thanks! That was added in v2.36.0 
, and I missed 
it.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b6a6d314-9c36-4d81-957d-22048c6b04ben%40googlegroups.com.


[prometheus-users] Re: Need urgent help!!! Want to modify tags "keys" to lowercase scraping from Cloudwatch-Exporter in Prometheus before sending to Mimir #13912

2024-04-18 Thread &#x27;Brian Candler' via Prometheus Users
> Need urgent help!!!

See https://www.catb.org/~esr/faqs/smart-questions.html#urgent

> we can add *only one pattern (Uppercase or lowercase)* in template code.

At worst you can match like this: tag_Name=~"[fF][oO][oO][bB][aA][rR]"

I don't know of any way internally to prometheus to lowercase labels. What 
you could do though is to write a HTTP proxy: you scrape the proxy from 
prometheus, the proxy scrapes the upstream source, and modifies the labels 
before returning the results to prometheus.

Or: since you're using an external package anyway (cloudwatch_exporter), 
you could modify and recompile it yourself.

IMO it would better if you fix the data at source, i.e. make your tags be 
consistent in AWS. Prometheus faithfully reproduces the data you give it. 
Garbage in, garbage out.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c6d8f693-8238-4929-9dbc-d96e64b57180n%40googlegroups.com.


[prometheus-users] Re: many-to-many not allowed error

2024-04-18 Thread &#x27;Brian Candler' via Prometheus Users
Look at the results of each half of the query separately:

redis_memory_max_bytes{k8s_cluster_name="$cluster", 
namespace="$namespace", pod="$pod_name"}

redis_instance_info{role=~"master|slave"}

You then need to find some set of labels which mean that N entries on the 
left-hand side always match exactly 1 entry on the right-hand side.

On Thursday 18 April 2024 at 07:30:49 UTC+1 saravanan E.M wrote:

> Hi Team
>
> Am getting many-to-many not allowed error while trying to join two time 
> series with role
>
> redis_memory_max_bytes{k8s_cluster_name="$cluster", 
> namespace="$namespace", pod="$pod_name"}
>   * on (k8s_cluster_name, namespace, pod) group_left(redis_instance_info) 
>   (redis_instance_info{role=~"master|slave"})
>
> Kindly help in having the correct query for this.
>
> Thanks
> Saravanan
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/205d246f-6b23-4474-955f-b71012eb3fbfn%40googlegroups.com.


Re: [prometheus-users] Re: Config DNS Prometheus/Blackbox_Exporter

2024-04-18 Thread &#x27;Brian Candler' via Prometheus Users
You don't need a separate job for each DNS server. You can have a single 
job with multiple target blocks.

  - job_name: 'dns'
scrape_interval: 5s
metrics_path: /probe
params:
  module: [dns_probe]
static_configs:
  - targets:
- www.google.com
- www.mindfree.cl
labels:
  dns: 208.67.220.220 #australia cloudflare
  - targets:
- www.google.com
- www.microsoft.com
labels:
  dns: 198.55.49.149 #canada

relabel_configs:
- source_labels: [__address__]
  #target_label: __param_target
  target_label: __param_hostname
# Populate target URL parameter with dns server IP
- source_labels: [__param_hostname]
  target_label: instance
#QUERY
- source_labels: [dns]
  #target_label: __param_hostname
  target_label: __param_target
# Populate __address__ with the address of the blackbox exporter to hit
- target_label: __address__
  replacement: localhost:9115

(Although personally, I would use file_sd_configs for this, so I can edit 
the targets without having to re-read the prometheus config file).

On Thursday 18 April 2024 at 01:52:45 UTC+1 Vincent Romero wrote:

> [image: blackbox-dns1.png]
> log blackbox_exporter sorry
>
> El Wednesday, April 17, 2024 a la(s) 8:50:39 PM UTC-4, Vincent Romero 
> escribió:
>
>> Hello every i change the relabel
>>
>> y try this
>>
>> - job_name: '208.67.222.220-opendns' ##REBUILD new blackbox_expoerter
>> scrape_interval: 5s
>> metrics_path: /probe
>> params:
>> module: [dns_probe]
>> static_configs:
>> - targets:
>> - www.google.com
>> - www.mindfree.cl
>> labels:
>> dns: 208.67.220.220 #australia cloudflare
>>
>> relabel_configs:
>> - source_labels: [__address__]
>> #target_label: __param_target
>> target_label: __param_hostname
>> # Populate target URL parameter with dns server IP
>> - source_labels: [__param_hostname]
>> target_label: instance
>> #QUERY
>> - source_labels: [dns]
>> #target_label: __param_hostname
>> target_label: __param_target
>> # Populate __address__ with the address of the blackbox exporter to hit
>> - target_label: __address__
>> replacement: localhost:9115
>>
>> - job_name: '198.55.49.149-canada' ##REBUILD new blackbox_expoerter
>> scrape_interval: 5s
>> metrics_path: /probe
>> params:
>> module: [dns_probe]
>> static_configs:
>> - targets:
>> - www.google.com
>> - www.microsoft.com
>> labels:
>> dns: 198.55.49.149 #canada
>>
>> relabel_configs:
>> - source_labels: [__address__]
>> #target_label: __param_target
>> target_label: __param_hostname
>> # Populate target URL parameter with dns server IP
>> - source_labels: [__param_hostname]
>> target_label: instance
>> #QUERY
>> - source_labels: [dns]
>> #target_label: __param_hostname
>> target_label: __param_target
>> # Populate __address__ with the address of the blackbox exporter to hit
>> - target_label: __address__
>> replacement: localhost:9115
>>
>>
>> with this i can used in target any domain to resolve with labels dns
>>
>> in the log in blackbox i have this
>>
>> looking good no? 
>>
>>
>> El Friday, April 12, 2024 a la(s) 9:46:32 AM UTC-4, Brian Candler 
>> escribió:
>>
>>> It's not really related to blackbox_exporter itself, but I don't 
>>> entirely agree with that comment.
>>>
>>> There are two different things at play here: the address you send the 
>>> query to ("target"), and the name that you are looking up ("queryname").
>>>
>>> - For caching resolvers: large providers use anycast with fixed IP 
>>> addresses, since that's what you have to configure in your client (8.8.8.8, 
>>> 1.1.1.1 etc). Those target addresses will *never* change.  I think 
>>> 185.228.168.9 
>>> falls into this category too: although you could get to it by resolving "
>>> security-filter-dns.cleanbrowsing.org", for a filtered DNS service 
>>> you'd always be using the IP address directly.
>>>
>>> - For authoritative servers: using the nameserver's DNS name (e.g. 
>>> ns.example.com) more closely reflects what the DNS does during 
>>> resolution, but makes it harder to work out what's going wrong if it fails. 
>>> The IP addresses that NS records resolve to can change, but very rarely do 
>>> (and since it's your own authoritative nameservers, you'll know if you 
>>&

[prometheus-users] Re: Prometheus Azure Service Discovery behind a proxy server

2024-04-15 Thread &#x27;Brian Candler' via Prometheus Users
> Is there a way to enable or add proxy config just for the service 
discoery and microsoft authentication part ?

The configuration of azure sd is here:
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#azure_sd_config
It has its own local settings for proxy_url, proxy_connect_header etc, 
which relate purely to the service discovery, and not to scraping.

On Monday 15 April 2024 at 01:01:59 UTC+1 Durga Prasad Kommareddy wrote:

> I have Prometheus running on a azure VM. And have few other VMs in 
> multiple subscriptions peered with the prometheus VM/Vnet.
>
> So i can reach the target VM metrics at http://IP:9100/metrics. But the 
> service discovery itself is not working unless i use a public IP/internet  
> Prometheus service discovery and microsoft authentication.
>
> Is there a way to enable or add proxy config just for the service discoery 
> and microsoft authentication part ? i dont need proxy for the actual 
> metrcis scraping because the VM can talk to all my target VMs so that'll 
> work.  

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/16554dc8-5482-4c8e-936f-3f0e8294c7f4n%40googlegroups.com.


Re: [prometheus-users] Re: Config DNS Prometheus/Blackbox_Exporter

2024-04-12 Thread &#x27;Brian Candler' via Prometheus Users
It's not really related to blackbox_exporter itself, but I don't entirely 
agree with that comment.

There are two different things at play here: the address you send the query 
to ("target"), and the name that you are looking up ("queryname").

- For caching resolvers: large providers use anycast with fixed IP 
addresses, since that's what you have to configure in your client (8.8.8.8, 
1.1.1.1 etc). Those target addresses will *never* change.  I think 
185.228.168.9 
falls into this category too: although you could get to it by resolving 
"security-filter-dns.cleanbrowsing.org", 
for a filtered DNS service you'd always be using the IP address directly.

- For authoritative servers: using the nameserver's DNS name (e.g. 
ns.example.com) more closely reflects what the DNS does during resolution, 
but makes it harder to work out what's going wrong if it fails. The IP 
addresses that NS records resolve to can change, but very rarely do (and 
since it's your own authoritative nameservers, you'll know if you renumber 
them). Furthermore, in my experience, NS names are never geo-aware: they 
always return static IPs (although these may point to anycast addresses).

- Geo-aware DNS generally takes place for the user-visible query names 
(like "www.google.com") and generally are affected by the *source* address 
where the query is coming from.

On Friday 12 April 2024 at 14:21:57 UTC+1 Conall O'Brien wrote:

> On Wed, 10 Apr 2024 at 06:47, 'Brian Candler' via Prometheus Users <
> promethe...@googlegroups.com> wrote:
>
>> One exporter scrape = one probe test and I think that should remain. You 
>> can get what you want by expanding the targets (which is a *list* of 
>> targets+labels):
>>
>>   static_configs:
>> - targets:
>> - 1.1.1.1
>> - 185.228.168.9
>>   labels:
>> queryname: www.google.com
>> - targets:
>> - 1.1.1.1
>> - 185.228.168.9
>>   labels:
>> queryname: www.microsoft.com
>>
>
> Given the targets, I would strongly suggest using DNS names over raw IP 
> addresses for every scrape. Large providers use geo-aware DNS systems, so 
> the IP numbers change over time for a number of reasons (e.g maintenance, 
> capacity turnup/turndown, etc). Probing raw IPs will not reflect the actual 
> state of the service.
>  
>
>> On Tuesday 9 April 2024 at 22:48:44 UTC+1 Vincent Romero wrote:
>>
>>> Hello, this worked
>>>
>>> With the new feature with simple domain works, but considered whether 
>>> the label required adding N domains?
>>>
>>> Y try add other domain in the same labels
>>>
>>>   - job_name: 'blackbox-dns-monitor'
>>> scrape_interval: 5s
>>> metrics_path: /probe
>>> params:
>>>   module: [dns_probe]
>>> static_configs:
>>>   - targets:
>>> - 1.1.1.1 #australia cloudflare
>>> - 185.228.168.9 #ireland
>>> labels:
>>>       queryname: www.google.com, www.microsoft.com NOT WORK
>>>   queryname: www.microsoft.com NOT WORK (add line)
>>>
>>> [image: Captura de pantalla 2024-04-09 a la(s) 17.44.20.png]
>>>
>>> El Tuesday, April 9, 2024 a la(s) 12:19:25 PM UTC-4, Vincent Romero 
>>> escribió:
>>>
>>>> i will try make build, with this change
>>>>
>>>>
>>>>
>>>> El Saturday, April 6, 2024 a la(s) 2:45:29 PM UTC-3, Brian Candler 
>>>> escribió:
>>>>
>>>>> You're correct that currently the qname is statically configured in 
>>>>> the prober config.
>>>>>
>>>>> A patch was submitted to allow what you want, but hasn't been merged:
>>>>> https://github.com/prometheus/blackbox_exporter/pull/1105
>>>>>
>>>>> You can build blackbox_exporter yourself with this patch applied 
>>>>> though.
>>>>>
>>>>> On Saturday 6 April 2024 at 18:06:01 UTC+1 Vincent Romero wrote:
>>>>>
>>>>>> Helo everyone
>>>>>>
>>>>>> what is the difference between http_2xx and dns module configuration
>>>>>>
>>>>>>
>>>>>> I have this example y my config
>>>>>>
>>>>>> blackbox.yml
>>>>>> modules:
>>>>>>   http_2xx:
>>>>>> prober: http
>>>>>> http:
>>>>>>   preferred_

[prometheus-users] Re: Config DNS Prometheus/Blackbox_Exporter

2024-04-09 Thread &#x27;Brian Candler' via Prometheus Users
One exporter scrape = one probe test and I think that should remain. You 
can get what you want by expanding the targets (which is a *list* of 
targets+labels):

  static_configs:
- targets:
- 1.1.1.1
- 185.228.168.9
  labels:
queryname: www.google.com
- targets:
- 1.1.1.1
- 185.228.168.9
  labels:
queryname: www.microsoft.com

On Tuesday 9 April 2024 at 22:48:44 UTC+1 Vincent Romero wrote:

> Hello, this worked
>
> With the new feature with simple domain works, but considered whether the 
> label required adding N domains?
>
> Y try add other domain in the same labels
>
>   - job_name: 'blackbox-dns-monitor'
> scrape_interval: 5s
> metrics_path: /probe
> params:
>   module: [dns_probe]
> static_configs:
>   - targets:
> - 1.1.1.1 #australia cloudflare
> - 185.228.168.9 #ireland
> labels:
>   queryname: www.google.com, www.microsoft.com NOT WORK
>   queryname: www.microsoft.com NOT WORK (add line)
>
> [image: Captura de pantalla 2024-04-09 a la(s) 17.44.20.png]
>
> El Tuesday, April 9, 2024 a la(s) 12:19:25 PM UTC-4, Vincent Romero 
> escribió:
>
>> i will try make build, with this change
>>
>>
>>
>> El Saturday, April 6, 2024 a la(s) 2:45:29 PM UTC-3, Brian Candler 
>> escribió:
>>
>>> You're correct that currently the qname is statically configured in the 
>>> prober config.
>>>
>>> A patch was submitted to allow what you want, but hasn't been merged:
>>> https://github.com/prometheus/blackbox_exporter/pull/1105
>>>
>>> You can build blackbox_exporter yourself with this patch applied though.
>>>
>>> On Saturday 6 April 2024 at 18:06:01 UTC+1 Vincent Romero wrote:
>>>
>>>> Helo everyone
>>>>
>>>> what is the difference between http_2xx and dns module configuration
>>>>
>>>>
>>>> I have this example y my config
>>>>
>>>> blackbox.yml
>>>> modules:
>>>>   http_2xx:
>>>> prober: http
>>>> http:
>>>>   preferred_ip_protocol: "ip4"
>>>>   http_post_2xx:
>>>> prober: http
>>>> http:
>>>>   method: POST
>>>>   www.google.com:
>>>> prober: dns
>>>> timeout: 1s
>>>> dns:
>>>>   transport_protocol: "udp"
>>>>   preferred_ip_protocol: "ip4"
>>>>   query_name: "www.google.com"
>>>>   query_type: "A"
>>>>   valid_rcodes:
>>>> - NOERROR
>>>>
>>>> prometheus.yml
>>>>   - job_name: 'blackbox'
>>>> metrics_path: /probe
>>>> params:
>>>>   module: [http_2xx]
>>>> static_configs:
>>>>   - targets:
>>>> - https://www.google.com
>>>> relabel_configs:
>>>>   - source_labels: [__address__]
>>>> target_label: __param_target
>>>>   - source_labels: [__param_target]
>>>> target_label: instance
>>>>   - target_label: __address__
>>>> replacement: localhost:9115
>>>>
>>>>   - job_name: 'blackbox-dns-monitor'
>>>> scrape_interval: 1s
>>>> metrics_path: /probe
>>>>   #params:
>>>>   #module: [mindfree.cl]
>>>> relabel_configs:
>>>> # Populate domain label with domain portion of __address__
>>>> - source_labels: [__address__]
>>>>   regex: (.*):.*$
>>>>   replacement: $1
>>>>   target_label: domain
>>>> # Populate instance label with dns server IP portion of __address__
>>>> - source_labels: [__address__]
>>>>   regex: .*:(.*)$
>>>>   replacement: $1
>>>>   target_label: instance
>>>> # Populate module URL parameter with domain portion of __address__
>>>> # This is a parameter passed to the blackbox exporter
>>>> - source_labels: [domain]
>>>>   target_label: __param_module
>>>> # Populate target URL parameter with dns server IP
>>>> - source_labels: [instance]
>>>>   target_label: __param_target
>>>> # Populate __address__ with the address of the blackbox exporter to 
>>>> hit
>>>> - target_label: __address__
>>>>   replacement: localhost:9115
>>>>
>>>> static_configs:
>>>>   - targets:
>>>> - www.google.com:1.1.1.1 #australia cloudflare
>>>>  - www.google.com:8.8.8.8 #example other nameserver
>>>>
>>>>
>>>> So, i will try config a simple DNS resolution for any domain
>>>> If i want add other nameserver i need to add other line with the same 
>>>> domain
>>>>
>>>> Why whe i used module http_2xx need simple add the target
>>>>
>>>> Thanks
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/74285578-2c0c-48e1-ac85-4ca80cd9bcffn%40googlegroups.com.


[prometheus-users] Re: what to do about flapping alerts?

2024-04-08 Thread &#x27;Brian Candler' via Prometheus Users
On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:

Assume the following (arguably a bit made up) example:
One has a metric that counts the number of failed drives in a RAID. One 
drive fails so some alert starts firing. Eventually the computing centre 
replaces the drive and it starts rebuilding (guess it doesn't matter 
whether the rebuilding is still considered to cause an alert or not). 
Eventually it finishes and the alert should go away (and I should e.g. get 
a resolved message).
But because of keep_firing_for, it doesn't stop straight away.
Now before it does, yet another disk fails.
But for Prometheus, with keep_firing_for, it will be like the same alert.


If the alerts have the exact same set of labels (e.g. the alert is at the 
level of the RAID controller, not at the level of individual drives) then 
yes.

It failed, it fixed, it failed again within keep_firing_for: then you only 
get a single alert, with no additional notification.

But that's not the problem you originally asked for:

"When the target goes down, the alert clears and as soon as it's back, it 
pops up again, sending a fresh alert notification."

keep_firing_for can be set differently for different alerts.  So you can 
set it to 10m for the "up == 0" alert, and not set it at all for the RAID 
alert, if that's what you want.

 


Also, depending on how large I have to set keep_firing_for, I will also get 
resolve messages later... which depending on what one does with the alerts 
may also be less desirable.


Surely that delay is essential for the de-flapping scenario you describe: 
you can't send the alert resolved message until you are *sure* the alert 
has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before 
keeping_firing_for had expired), and the problem recurred, then you'd have 
to send out a new alert failing message - which is the flap noise I think 
you are asking to suppress.

In any case, sending out resolved messages is arguably a bad idea:
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

I turned them off, and:
(a) it immediately reduced notifications by 50%
(b) it encourages that alerts are properly investigated (or that alerts are 
properly tuned)

That is: if something was important enough to alert on in the first place, 
then it's important enough to investigate thoroughly, even if the threshold 
has been crossed back to normal since then. And if it wasn't important 
enough to alert on, then the alerting rule needs adjusting to make it less 
noisy.

This is expanded upon in this document:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

 


I think the main problem behind may be rather a conceptual one, namely that 
Prometheus uses "no data" for no alert, which happens as well when there is 
no data because of e.g. scrape failures, so it can’t really differentiate 
between the two conditions.


I think it can.

Scrape failures can be explicitly detected by up == 0.  Alert on those 
separately.

The odd occasional missed scrape doesn't affect most other queries because 
of the lookback-delta: i.e. instant vector queries will look up to 5 
minutes into the past. As long as you're scraping every 2 minutes, you can 
always survive a single failed scrape without noticing it.

If your device goes away for longer than 5 minutes, then sure the alerting 
data will no longer be there - but then you have no idea whether the 
condition you were alerting on or not exists (since you have no visibility 
of the target state).  Instead, you have a "scrape failed" condition, which 
as I said already, is easy to alert on.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6e6de7dd-b156-475f-b76d-6f758f2c3189n%40googlegroups.com.


[prometheus-users] Re: Config DNS Prometheus/Blackbox_Exporter

2024-04-06 Thread &#x27;Brian Candler' via Prometheus Users
You're correct that currently the qname is statically configured in the 
prober config.

A patch was submitted to allow what you want, but hasn't been merged:
https://github.com/prometheus/blackbox_exporter/pull/1105

You can build blackbox_exporter yourself with this patch applied though.

On Saturday 6 April 2024 at 18:06:01 UTC+1 Vincent Romero wrote:

> Helo everyone
>
> what is the difference between http_2xx and dns module configuration
>
>
> I have this example y my config
>
> blackbox.yml
> modules:
>   http_2xx:
> prober: http
> http:
>   preferred_ip_protocol: "ip4"
>   http_post_2xx:
> prober: http
> http:
>   method: POST
>   www.google.com:
> prober: dns
> timeout: 1s
> dns:
>   transport_protocol: "udp"
>   preferred_ip_protocol: "ip4"
>   query_name: "www.google.com"
>   query_type: "A"
>   valid_rcodes:
> - NOERROR
>
> prometheus.yml
>   - job_name: 'blackbox'
> metrics_path: /probe
> params:
>   module: [http_2xx]
> static_configs:
>   - targets:
> - https://www.google.com
> relabel_configs:
>   - source_labels: [__address__]
> target_label: __param_target
>   - source_labels: [__param_target]
> target_label: instance
>   - target_label: __address__
> replacement: localhost:9115
>
>   - job_name: 'blackbox-dns-monitor'
> scrape_interval: 1s
> metrics_path: /probe
>   #params:
>   #module: [mindfree.cl]
> relabel_configs:
> # Populate domain label with domain portion of __address__
> - source_labels: [__address__]
>   regex: (.*):.*$
>   replacement: $1
>   target_label: domain
> # Populate instance label with dns server IP portion of __address__
> - source_labels: [__address__]
>   regex: .*:(.*)$
>   replacement: $1
>   target_label: instance
> # Populate module URL parameter with domain portion of __address__
> # This is a parameter passed to the blackbox exporter
> - source_labels: [domain]
>   target_label: __param_module
> # Populate target URL parameter with dns server IP
> - source_labels: [instance]
>   target_label: __param_target
> # Populate __address__ with the address of the blackbox exporter to hit
> - target_label: __address__
>   replacement: localhost:9115
>
> static_configs:
>   - targets:
> - www.google.com:1.1.1.1 #australia cloudflare
>  - www.google.com:8.8.8.8 #example other nameserver
>
>
> So, i will try config a simple DNS resolution for any domain
> If i want add other nameserver i need to add other line with the same 
> domain
>
> Why whe i used module http_2xx need simple add the target
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f2c1373c-51a6-446d-8ec1-d2e784abfd40n%40googlegroups.com.


[prometheus-users] Re: what to do about flapping alerts?

2024-04-06 Thread &#x27;Brian Candler' via Prometheus Users
> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep 
firing, when the scraping failed, but also when it actually goes back to an 
ok state, right?

It affects all alerts individually, and I believe it's exactly what you 
want. A brief flip from "failing" to "OK" doesn't resolve the alert; it 
only resolves if it has remained in the "OK" state for the keep_firing_for 
duration. Therefore you won't get a fresh alert until it's been OK for at 
least keep_firing_for and *then* fails again.

As you correctly surmise, an alert isn't really a boolean condition, it's a 
presence/absence condition: the expr returns a vector of 0 or more alerts, 
each with a unique combination of labels.  "keep_firing_for" retains a 
particular labelled value in the vector for a period of time even if it's 
no longer being generated by the alerting "expr".  Hence if it does 
reappear in the expr output during that time, it's just a continuation of 
the previous alert.

> Similarly, when a node goes completely down (maintenance or so) and then 
up again, all alerts would then start again to fire (and even a generous 
keep_firing_for would have been exceeded)... and send new notifications.

I don't understand what you're saying here. Can you give some specific 
examples?

If you have an alerting expression like "up == 0" and you take 10 machines 
down then your alerting expression will return a vector of ten zeros and 
this will generate ten alerts (typically grouped into a single 
notification, if you use the default alertmanager config)

When they revert to up == 1 then they won't "start again to fire", because 
they were already firing. Indeed, it's almost the opposite. Let's say you 
have keep_firing_for: 10m, then if any machine goes down in the 10 minutes 
after the end of maintenance then it *won't* generate a new alert, because 
it will just be a continuation of the old one.

However, when you're doing maintenance, you might also be using silences to 
prevent notifications. In that case you might want your silence to extend 
10 minutes past the end of the maintenance period.

On Saturday 6 April 2024 at 04:03:07 UTC+1 Christoph Anton Mitterer wrote:

> Hey.
>
> I have some simple alerts like:
> - alert: node_upgrades_non-security_apt
>   expr:  'sum by (instance,job) ( 
> apt_upgrades_pending{origin!~"(?i)^.*-security(?:\\PL.*)?$"} )'
> - alert: node_upgrades_security_apt
>   expr:  'sum by (instance,job) ( 
> apt_upgrades_pending{origin=~"(?i)^.*-security(?:\\PL.*)?$"} )'
>
> If there's no upgrades, these give no value.
> Similarly, for all other simple alerts, like free disk space:
> 1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs", 
> instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} / 
> node_filesystem_size_bytes  >  0.80
>
> No value => all ok, some value => alert.
>
> I do have some instances which are pretty unstable (i.e. scraping fails 
> every know and then - or more often than that), which are however mostly 
> out of my control, so I cannot do anything about that.
>
> When the target goes down, the alert clears and as soon as it's back, it 
> pops up again, sending a fresh alert notification.
>
> Now I've seen:
> https://github.com/prometheus/prometheus/pull/11827
> which describes keep_firing_for as "the minimum amount of time that an 
> alert should remain firing, after the expression does not return any 
> results", respectively in 
> https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule
>  
> :
> # How long an alert will continue firing after the condition that 
> triggered it # has cleared. [ keep_firing_for:  | default = 0s ] 
>
> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep 
> firing, when the scraping failed, but also when it actually goes back to an 
> ok state, right?
> That's IMO however rather undesirable.
>
> Similarly, when a node goes completely down (maintenance or so) and then 
> up again, all alerts would then start again to fire (and even a generous 
> keep_firing_for would have been exceeded)... and send new notifications.
>
>
> Is there any way to solve this? Especially that one doesn't get new 
> notifications sent, when the alert never really stopped?
>
> At least I wouldn't understand how keep_firing_for would do this.
>
> Thanks,
> Chris.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fa157174-2d90-45f0-9084-dc28e52e88dan%40googlegroups.com.


[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

2024-04-03 Thread &#x27;Brian Candler' via Prometheus Users
On Wednesday 3 April 2024 at 16:01:21 UTC+1 mohan garden wrote:

Is there a way i can see the entire message which alert manager sends out 
to the Opsgenie? - somewhere in the alertmanager logs or a text file?


You could try setting api_url to point to a webserver that you control.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/27e70b2b-9101-478e-9a2b-364f6287da32n%40googlegroups.com.


[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

2024-04-03 Thread &#x27;Brian Candler' via Prometheus Users
t; *Scenario1: *When server1 's local disk usage reaches 50%, i see that 
>> Opsgenie ticket is created having:
>> Opsgenie Ticket metadata: 
>> ticket header name:  local disk usage reached 50%
>> ticket description:  space on /var file system at server1:9100 server = 
>> 82%."
>> ticket tags: criteria: overuse , team: support, severity: critical, 
>> infra,monitor,host=server1
>>
>> so everything works as expected, no issues with Scenario1.
>>
>>
>> *Scenario2: *While server1 trigger is active, a second server ( say 
>> server2)'s local disk usage reaches 50%,
>>
>> i see that Opsgenie tickets are getting updated as:
>> ticket header name:  local disk usage reached 50%
>> ticket description:  space on /var file system at server1:9100 server = 
>> 82%."
>> ticket description:  space on /var file system at server2:9100 server = 
>> 80%."
>> ticket tags: criteria: overuse , team: support, severity: critical, 
>> infra,monitor,host=server1
>>
>>
>> but i was expecting an additional host=server2 tag on the ticket.  
>> in Summary - i see updated description , but unable to see updated tags.
>>
>> in tags section of the alertmanager - opsgenie integration configuration 
>> , i had tried iterating over Alerts and CommonLabels, but i was unable to 
>> add  additional host=server2 tag .
>> {{ range $idx, $alert := .Alerts}}{{range $k, $v := $alert.Labels 
>> }}{{$k}}={{$v}},{{end}}{{end}},test=test
>> {{ range $k, $v := .CommonLabels}}{{end}}
>>
>>
>> At the moment, i am not sure that what is potentially preventing the 
>> update of tags on the opsgenie tickets.
>> If i can get some clarity on the fact that if the configurations i have 
>> for  alertmanager are good enough, then i can look at the opsgenie 
>> configurations.
>>
>>
>> Please advice.
>>
>>
>> Regards
>> CP
>>
>>
>> On Tuesday, April 2, 2024 at 10:46:36 PM UTC+5:30 Brian Candler wrote:
>>
>>> FYI, those images are unreadable - copy-pasted text would be much better.
>>>
>>> My guess, though, is that you probably don't want to group alerts before 
>>> sending them to opsgenie. You haven't shown your full alertmanager config, 
>>> but if you have a line like
>>>
>>>group_by: ['alertname']
>>>
>>> then try
>>>
>>>group_by: ["..."]
>>>
>>> (literally, exactly that: a single string containing three dots, inside 
>>> square brackets)
>>>
>>> On Tuesday 2 April 2024 at 17:15:39 UTC+1 mohan garden wrote:
>>>
>>>> Dear Prometheus Community,
>>>> I am reaching out regarding an issue i have encountered with  
>>>> prometheus alert tagging, specifically while creating tickets in Opsgenie.
>>>>
>>>>
>>>> I have configured alertmanager  to send alerts to Opsgenie as , the 
>>>> configuration as :
>>>> [image: photo001.png]i ticket is generated with expected description 
>>>> and tags as - 
>>>> [image: photo002.png]
>>>>
>>>> Now, by default the alerts are grouped by the alert name( default 
>>>> behavior).So when the similar event happens on a different server i see 
>>>> that the description is updated as:
>>>> [image: photo003.png]
>>>> but the tag on the ticket remains same, 
>>>> expected behavior: criteria=..., host=108, host=114, infra.support 
>>>>
>>>> I have set update_alert and send_resolved settings to true.
>>>> I am not sure that in order to make it work as expected, If i need 
>>>> additional configuration at opsgenie or at the alertmanager. 
>>>>
>>>> I would appreciate any insight or guidance on the method to resolve 
>>>> this issue and ensure that alerts for different servers are correctly 
>>>> tagged in Opsgenie.
>>>>
>>>> Thank you in advance.
>>>> Regards,
>>>> CP
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9e2be26c-2fcf-46e4-af0a-9b4e56debaa1n%40googlegroups.com.


[prometheus-users] Re: Prometheus alert tagging issue - multiple servers

2024-04-02 Thread &#x27;Brian Candler' via Prometheus Users
FYI, those images are unreadable - copy-pasted text would be much better.

My guess, though, is that you probably don't want to group alerts before 
sending them to opsgenie. You haven't shown your full alertmanager config, 
but if you have a line like

   group_by: ['alertname']

then try

   group_by: ["..."]

(literally, exactly that: a single string containing three dots, inside 
square brackets)

On Tuesday 2 April 2024 at 17:15:39 UTC+1 mohan garden wrote:

> Dear Prometheus Community,
> I am reaching out regarding an issue i have encountered with  prometheus 
> alert tagging, specifically while creating tickets in Opsgenie.
>
>
> I have configured alertmanager  to send alerts to Opsgenie as , the 
> configuration as :
> [image: photo001.png]i ticket is generated with expected description and 
> tags as - 
> [image: photo002.png]
>
> Now, by default the alerts are grouped by the alert name( default 
> behavior).So when the similar event happens on a different server i see 
> that the description is updated as:
> [image: photo003.png]
> but the tag on the ticket remains same, 
> expected behavior: criteria=..., host=108, host=114, infra.support 
>
> I have set update_alert and send_resolved settings to true.
> I am not sure that in order to make it work as expected, If i need 
> additional configuration at opsgenie or at the alertmanager. 
>
> I would appreciate any insight or guidance on the method to resolve this 
> issue and ensure that alerts for different servers are correctly tagged in 
> Opsgenie.
>
> Thank you in advance.
> Regards,
> CP
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f4ec4e77-672a-42a5-ad5a-1aa9f82d6b3en%40googlegroups.com.


Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

2024-03-30 Thread &#x27;Brian Candler' via Prometheus Users
Only you can determine that, by comparing the lists of alerts from both 
sides and seeing what differs, and looking into how they are generated and 
measured. There are all kinds of things which might affect this, e.g. 
pending/keep_firing_for alerts, group wait etc.

But you might also want to read this:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

If you're generating more than a handful of alerts per day, then maybe you 
need to reconsider what constitutes an "alert".

On Saturday 30 March 2024 at 09:49:04 UTC Trio Official wrote:

> Thank you for your prompt response and guidance on addressing the metric 
> staleness issue.
>
> Regarding metric staleness  I confirm that I have already implemented the 
> approach to use square brackets for the recording metrics and alerting rule
>  (e.g. max_over_time(metric[1h])). However, the main challenge persists 
> with the discrepancy in the number of alerts generated by Prometheus 
> compared to those displayed in Alertmanager. 
>
> To illustrate, when observing Prometheus, I may observe approximately 
> 25,000 alerts triggered within a given period. However, when reviewing the 
> corresponding alerts in Alertmanager, the count often deviates 
> significantly, displaying figures such as 10,000 or 18,000, rather than the 
> expected 25,000.
>
> This inconsistency poses a significant challenge in our alert management 
> process, leading to confusion and potentially overlooking critical alerts.
>
> I would greatly appreciate any further insights or recommendations you may 
> have to address this issue and ensure alignment between Prometheus and 
> Alertmanager in terms of the number of alerts generated and displayed.
> On Saturday, March 30, 2024 at 2:29:42 PM UTC+5:30 Brian Candler wrote:
>
>> On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:
>>
>> I believe that recording rules and alerting rules similarly may have 
>> their evaluation time happen at different offsets within their 
>> evaluation interval. This is done for the similar reason of spreading 
>> out the internal load of rule evaluations across time.
>>
>>
>> I think it's more accurate to say that *rule groups* are spread spread 
>> over their evaluation interval, and rules within the same rule group are 
>> evaluated 
>> sequentially 
>> <https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules>.
>>  
>> This is how you can build rules that depend on each other, e.g. a recording 
>> rule followed by other rules that depend on its output; put them in the 
>> same rule group.
>>
>> As for scraping: you *can* change this staleness interval, 
>> using --query.lookback-delta, but it's strongly not recommended. Using the 
>> default of 5 mins, you should use a maximum scrape interval of 2 mins so 
>> that even if you miss one scrape for a random reason, you still have two 
>> points within the lookback-delta so that the timeseries does not go stale.
>>
>> There's no good reason to scrape at one hour intervals:
>> * Prometheus is extremely efficient with its storage compression, 
>> especially when adjacent data points are equal, so scraping the same value 
>> every 2 minutes is going to use hardly any more storage than scraping it 
>> every hour.
>> * If you're worried about load on the exporter because responding to a 
>> scrape is slow or expensive, then you should run the exporter every hour 
>> from a local cronjob, and write its output to a persistent location (e.g. 
>> to PushGateway or statsd_exporter, or simply write it to a file which can 
>> be picked up by node_exporter textfile-collector or even a vanilla HTTP 
>> server).  You can then scrape this as often as you like.
>>
>> node_exporter textfile-collector exposes an extra metrics for the 
>> timestamp on each file, so you can alert in the case that the file isn't 
>> being updated.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4471ac2e-ee83-494a-9a90-a7c86992a9f6n%40googlegroups.com.


Re: [prometheus-users] Assistance Needed with Prometheus and Alertmanager Configuration

2024-03-30 Thread &#x27;Brian Candler' via Prometheus Users
On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:

I believe that recording rules and alerting rules similarly may have 
their evaluation time happen at different offsets within their 
evaluation interval. This is done for the similar reason of spreading 
out the internal load of rule evaluations across time.


I think it's more accurate to say that *rule groups* are spread spread over 
their evaluation interval, and rules within the same rule group are evaluated 
sequentially 
.
 
This is how you can build rules that depend on each other, e.g. a recording 
rule followed by other rules that depend on its output; put them in the 
same rule group.

As for scraping: you *can* change this staleness interval, 
using --query.lookback-delta, but it's strongly not recommended. Using the 
default of 5 mins, you should use a maximum scrape interval of 2 mins so 
that even if you miss one scrape for a random reason, you still have two 
points within the lookback-delta so that the timeseries does not go stale.

There's no good reason to scrape at one hour intervals:
* Prometheus is extremely efficient with its storage compression, 
especially when adjacent data points are equal, so scraping the same value 
every 2 minutes is going to use hardly any more storage than scraping it 
every hour.
* If you're worried about load on the exporter because responding to a 
scrape is slow or expensive, then you should run the exporter every hour 
from a local cronjob, and write its output to a persistent location (e.g. 
to PushGateway or statsd_exporter, or simply write it to a file which can 
be picked up by node_exporter textfile-collector or even a vanilla HTTP 
server).  You can then scrape this as often as you like.

node_exporter textfile-collector exposes an extra metrics for the timestamp 
on each file, so you can alert in the case that the file isn't being 
updated.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/244eb39e-1ded-4161-80cf-b32deb9cd2c7n%40googlegroups.com.


[prometheus-users] Re: Relabeling for proxied hosts

2024-03-27 Thread &#x27;Brian Candler' via Prometheus Users
According to the source in prometheus-common/model/labels.go, these are the 
only declared magic labels:

const (
// AlertNameLabel is the name of the label containing the an 
alert's name.
AlertNameLabel = "alertname"

// ExportedLabelPrefix is the prefix to prepend to the label names 
present in
// exported metrics if a label of the same name is added by the 
server.
ExportedLabelPrefix = "exported_"

// MetricNameLabel is the label name indicating the metric name of a
// timeseries.
MetricNameLabel = "__name__"

// SchemeLabel is the name of the label that holds the scheme on 
which to
// scrape a target.
SchemeLabel = "__scheme__"

// AddressLabel is the name of the label that holds the address of
// a scrape target.
AddressLabel = "__address__"

// MetricsPathLabel is the name of the label that holds the path on 
which to
// scrape a target.
MetricsPathLabel = "__metrics_path__"

// ReservedLabelPrefix is a prefix which is not legal in 
user-supplied
// label names.
ReservedLabelPrefix = "__"

// MetaLabelPrefix is a prefix for labels that provide meta 
information.
// Labels with this prefix are used for intermediate label 
processing and
// will not be attached to time series.
MetaLabelPrefix = "__meta_"

// TmpLabelPrefix is a prefix for temporary labels as part of 
relabelling.
// Labels with this prefix are used for intermediate label 
processing and
// will not be attached to time series. This is reserved for use in
// Prometheus configuration files by users.
TmpLabelPrefix = "__tmp_"

// ParamLabelPrefix is a prefix for labels that provide URL 
parameters
// used to scrape a target.
ParamLabelPrefix = "__param_"

// JobLabel is the label name indicating the job from which a 
timeseries
// was scraped.
JobLabel = "job"

// InstanceLabel is the label name used for the instance label.
InstanceLabel = "instance"

// BucketLabel is used for the label that defines the upper bound 
of a
// bucket of a histogram ("le" -> "less or equal").
BucketLabel = "le"

// QuantileLabel is used for the label that defines the quantile in 
a
// summary.
QuantileLabel = "quantile"
)

Hence I don't think you can do what you want in relabeling; you need 
separate jobs.

On Wednesday 27 March 2024 at 21:20:08 UTC Mykola Buhryk wrote:

> Hello, 
>
> I'm looking for a possibility to have one Prometheus job that can include 
> targets that are not directly accessible by Prometheus.
>
> For now, I have 2 separate jobs, one for standard hosts, and a second one 
> for proxied where I need to set the *proxy_url* parameter
>
> So my question is, is there any way to achieve the same result with 
> relabeling within one job?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/6a126188-03f6-4e60-acb1-01abe4a196c7n%40googlegroups.com.


  1   2   3   4   5   6   7   8   9   10   >