[prometheus-users] Re: Promteheus HA different metrics

2023-09-05 Thread Brian Candler
On Tuesday, 5 September 2023 at 14:26:07 UTC+1 Анастасия Зель wrote:

i only have pod ip and i cant get it from prometheus node because they are 
in different subnets.


Hosts on different subnets *could* talk to each other - that's what routers 
are for.

It's quite possible that you have a routing or network reachability issue, 
but you'll have to work out why you can reach some pods but not others.  
That will be down to how your particular k8s cluster(s) have been built and 
configured.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8a88f431-07c4-4b2f-873b-3152f06db064n%40googlegroups.com.


Re: [prometheus-users] Re: Promteheus HA different metrics

2023-09-05 Thread Stuart Clark

On 2023-09-05 14:26, Анастасия Зель wrote:

yeah, i think scrape manually it will be useful but remember that its
k8s pods :)
i only have pod ip and i cant get it from prometheus node because they
are in different subnets. Pods subnet don't have access to outside
network.
so i dont know how i can scrape manually particular pod target from
prometheus server.



That would explain why it isn't working. You need to have network 
connectivity to all of your scrape targets from the Prometheus server. 
So if you have configured Prometheus to scrape every pod (via the 
Kubernetes SD for example) the Prometheus server will either need to be 
inside the cluster or connected to the same network mechanism as the 
pods.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/4eb0b62f043f84563619eecb8ba0c307%40Jahingo.com.


[prometheus-users] Re: Promteheus HA different metrics

2023-09-05 Thread Анастасия Зель
yeah, i think scrape manually it will be useful but remember that its k8s 
pods :)
i only have pod ip and i cant get it from prometheus node because they are 
in different subnets. Pods subnet don't have access to outside network. 
so i dont know how i can scrape manually particular pod target from 
prometheus server.

but thank you for yours guesses, i will check it out
вторник, 5 сентября 2023 г. в 15:06:30 UTC+4, Brian Candler: 

> > the fail 100% of the time on that prometheus where its down
>
> Then you're lucky: in principle it's straightforward to debug.
> - get a shell on the affected prometheus server
> - use "curl" to do a manual scrape of the target which is down (using the 
> same URL that the Targets list shows)
> - if it fails, then you've taken Prometheus out of the equation.
>
> My best guesses would be (1) Network connectivity between the Prometheus 
> server and the affected pods, or (2) service discovery is giving wrong 
> information (i.e. you're scraping the wrong URL in the first place)
>
> In case (2), I note that you're getting the targets to scrape from pod 
> annotations. Look carefully at the values of those annotations, and how 
> they are mapped into scrape address/port/path for the affected pods.
>
> On Tuesday, 5 September 2023 at 11:45:04 UTC+1 Анастасия Зель wrote:
>
>> Actually its targets on different k8s nodes, but the fail 100% of the 
>> time on that prometheus where its down. 
>> I get list of all down pods targets and noticed that number of down pods 
>> its the same on both prometheus nodes - 306 down pods targets. But its 
>> different targets :D
>> Yes, they scrape same urls of pods.
>> вторник, 5 сентября 2023 г. в 10:32:15 UTC+4, Brian Candler: 
>>
>>> Note that setting the scrape timeout longer than the scrape interval 
>>> won't achieve anything.
>>>
>>> I'd suggest you investigate by looking at the history of the "up" 
>>> metric: this will go to zero on scrape failures.  Can you discern a 
>>> pattern?  Is it only on a certain type of target, or targets running on a 
>>> particular k8s node?  Is it intermittent across all targets, or some 
>>> targets which fail 100% of the time?
>>>
>>> If you compare the Targets page on both servers, are they scraping 
>>> exactly the same URLs?  (That is, check whether service discovery is giving 
>>> different results)
>>>
>>> On Tuesday, 5 September 2023 at 06:09:55 UTC+1 Анастасия Зель wrote:
>>>
 yes, i see errors on targets page in web interface.
 I tried to increase timeout to 5 minutes and it changes nothing. 
 Its strange because prometheus 2 always get this error on similar pods. 
 And prometheus 1 never get this errors on this pods. 
 понедельник, 4 сентября 2023 г. в 19:00:32 UTC+4, Brian Candler: 

> On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:
>
> Hello, we use HA prometheus with two servers.
>
> You mean, two Prometheus servers with the same config, both scraping 
> the same targets?
>
>  
>
> The problem is we get different metrics in dashboards from this two 
> servers.
>
> Small differences are to be expected.  That's because the two servers 
> won't be scraping the targets at the same points in time.  If you see 
> more 
> significant differences, then please provide some examples.
>
>  
>
> And we also scrape metrics from k8s, and some pods are not scraping 
> because of error context deadline exceeded
>
> That basically means "scrape timed out".  The scrape hadn't completed 
> within the "scrape_timeout:" value that you've set.  You'll need to look 
> at 
> your individual exporters and the failing scrape URLs: either the target 
> is 
> not reachable at all (e.g. firewalling or network configuration issue), 
> or 
> the target is taking too long to respond.
>  
>
> Its differents pods on each server. In prometheus logs we dont see any 
> of errors.
>
> Where *do* you see the "context deadline exceeded" errors then?
>


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/494ada91-c4b8-4ea5-bdbc-4db440c4a40en%40googlegroups.com.


[prometheus-users] Re: Promteheus HA different metrics

2023-09-05 Thread Brian Candler
> the fail 100% of the time on that prometheus where its down

Then you're lucky: in principle it's straightforward to debug.
- get a shell on the affected prometheus server
- use "curl" to do a manual scrape of the target which is down (using the 
same URL that the Targets list shows)
- if it fails, then you've taken Prometheus out of the equation.

My best guesses would be (1) Network connectivity between the Prometheus 
server and the affected pods, or (2) service discovery is giving wrong 
information (i.e. you're scraping the wrong URL in the first place)

In case (2), I note that you're getting the targets to scrape from pod 
annotations. Look carefully at the values of those annotations, and how 
they are mapped into scrape address/port/path for the affected pods.

On Tuesday, 5 September 2023 at 11:45:04 UTC+1 Анастасия Зель wrote:

> Actually its targets on different k8s nodes, but the fail 100% of the time 
> on that prometheus where its down. 
> I get list of all down pods targets and noticed that number of down pods 
> its the same on both prometheus nodes - 306 down pods targets. But its 
> different targets :D
> Yes, they scrape same urls of pods.
> вторник, 5 сентября 2023 г. в 10:32:15 UTC+4, Brian Candler: 
>
>> Note that setting the scrape timeout longer than the scrape interval 
>> won't achieve anything.
>>
>> I'd suggest you investigate by looking at the history of the "up" metric: 
>> this will go to zero on scrape failures.  Can you discern a pattern?  Is it 
>> only on a certain type of target, or targets running on a particular k8s 
>> node?  Is it intermittent across all targets, or some targets which fail 
>> 100% of the time?
>>
>> If you compare the Targets page on both servers, are they scraping 
>> exactly the same URLs?  (That is, check whether service discovery is giving 
>> different results)
>>
>> On Tuesday, 5 September 2023 at 06:09:55 UTC+1 Анастасия Зель wrote:
>>
>>> yes, i see errors on targets page in web interface.
>>> I tried to increase timeout to 5 minutes and it changes nothing. 
>>> Its strange because prometheus 2 always get this error on similar pods. 
>>> And prometheus 1 never get this errors on this pods. 
>>> понедельник, 4 сентября 2023 г. в 19:00:32 UTC+4, Brian Candler: 
>>>
 On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:

 Hello, we use HA prometheus with two servers.

 You mean, two Prometheus servers with the same config, both scraping 
 the same targets?

  

 The problem is we get different metrics in dashboards from this two 
 servers.

 Small differences are to be expected.  That's because the two servers 
 won't be scraping the targets at the same points in time.  If you see more 
 significant differences, then please provide some examples.

  

 And we also scrape metrics from k8s, and some pods are not scraping 
 because of error context deadline exceeded

 That basically means "scrape timed out".  The scrape hadn't completed 
 within the "scrape_timeout:" value that you've set.  You'll need to look 
 at 
 your individual exporters and the failing scrape URLs: either the target 
 is 
 not reachable at all (e.g. firewalling or network configuration issue), or 
 the target is taking too long to respond.
  

 Its differents pods on each server. In prometheus logs we dont see any 
 of errors.

 Where *do* you see the "context deadline exceeded" errors then?

>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/24575a81-2302-4d4c-8b6b-e24075ddaa98n%40googlegroups.com.


[prometheus-users] Re: Promteheus HA different metrics

2023-09-05 Thread Анастасия Зель
Actually its targets on different k8s nodes, but the fail 100% of the time 
on that prometheus where its down. 
I get list of all down pods targets and noticed that number of down pods 
its the same on both prometheus nodes - 306 down pods targets. But its 
different targets :D
Yes, they scrape same urls of pods.
вторник, 5 сентября 2023 г. в 10:32:15 UTC+4, Brian Candler: 

> Note that setting the scrape timeout longer than the scrape interval won't 
> achieve anything.
>
> I'd suggest you investigate by looking at the history of the "up" metric: 
> this will go to zero on scrape failures.  Can you discern a pattern?  Is it 
> only on a certain type of target, or targets running on a particular k8s 
> node?  Is it intermittent across all targets, or some targets which fail 
> 100% of the time?
>
> If you compare the Targets page on both servers, are they scraping exactly 
> the same URLs?  (That is, check whether service discovery is giving 
> different results)
>
> On Tuesday, 5 September 2023 at 06:09:55 UTC+1 Анастасия Зель wrote:
>
>> yes, i see errors on targets page in web interface.
>> I tried to increase timeout to 5 minutes and it changes nothing. 
>> Its strange because prometheus 2 always get this error on similar pods. 
>> And prometheus 1 never get this errors on this pods. 
>> понедельник, 4 сентября 2023 г. в 19:00:32 UTC+4, Brian Candler: 
>>
>>> On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:
>>>
>>> Hello, we use HA prometheus with two servers.
>>>
>>> You mean, two Prometheus servers with the same config, both scraping the 
>>> same targets?
>>>
>>>  
>>>
>>> The problem is we get different metrics in dashboards from this two 
>>> servers.
>>>
>>> Small differences are to be expected.  That's because the two servers 
>>> won't be scraping the targets at the same points in time.  If you see more 
>>> significant differences, then please provide some examples.
>>>
>>>  
>>>
>>> And we also scrape metrics from k8s, and some pods are not scraping 
>>> because of error context deadline exceeded
>>>
>>> That basically means "scrape timed out".  The scrape hadn't completed 
>>> within the "scrape_timeout:" value that you've set.  You'll need to look at 
>>> your individual exporters and the failing scrape URLs: either the target is 
>>> not reachable at all (e.g. firewalling or network configuration issue), or 
>>> the target is taking too long to respond.
>>>  
>>>
>>> Its differents pods on each server. In prometheus logs we dont see any 
>>> of errors.
>>>
>>> Where *do* you see the "context deadline exceeded" errors then?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/805a2feb-d0ab-4f70-a308-2a2e8a58cee6n%40googlegroups.com.


[prometheus-users] Re: Promteheus HA different metrics

2023-09-05 Thread Brian Candler
Note that setting the scrape timeout longer than the scrape interval won't 
achieve anything.

I'd suggest you investigate by looking at the history of the "up" metric: 
this will go to zero on scrape failures.  Can you discern a pattern?  Is it 
only on a certain type of target, or targets running on a particular k8s 
node?  Is it intermittent across all targets, or some targets which fail 
100% of the time?

If you compare the Targets page on both servers, are they scraping exactly 
the same URLs?  (That is, check whether service discovery is giving 
different results)

On Tuesday, 5 September 2023 at 06:09:55 UTC+1 Анастасия Зель wrote:

> yes, i see errors on targets page in web interface.
> I tried to increase timeout to 5 minutes and it changes nothing. 
> Its strange because prometheus 2 always get this error on similar pods. 
> And prometheus 1 never get this errors on this pods. 
> понедельник, 4 сентября 2023 г. в 19:00:32 UTC+4, Brian Candler: 
>
>> On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:
>>
>> Hello, we use HA prometheus with two servers.
>>
>> You mean, two Prometheus servers with the same config, both scraping the 
>> same targets?
>>
>>  
>>
>> The problem is we get different metrics in dashboards from this two 
>> servers.
>>
>> Small differences are to be expected.  That's because the two servers 
>> won't be scraping the targets at the same points in time.  If you see more 
>> significant differences, then please provide some examples.
>>
>>  
>>
>> And we also scrape metrics from k8s, and some pods are not scraping 
>> because of error context deadline exceeded
>>
>> That basically means "scrape timed out".  The scrape hadn't completed 
>> within the "scrape_timeout:" value that you've set.  You'll need to look at 
>> your individual exporters and the failing scrape URLs: either the target is 
>> not reachable at all (e.g. firewalling or network configuration issue), or 
>> the target is taking too long to respond.
>>  
>>
>> Its differents pods on each server. In prometheus logs we dont see any of 
>> errors.
>>
>> Where *do* you see the "context deadline exceeded" errors then?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ff7ed768-c75b-462d-be60-7c2d47773751n%40googlegroups.com.


[prometheus-users] Re: Promteheus HA different metrics

2023-09-04 Thread Анастасия Зель
yes, i see errors on targets page in web interface.
I tried to increase timeout to 5 minutes and it changes nothing. 
Its strange because prometheus 2 always get this error on similar pods. And 
prometheus 1 never get this errors on this pods. 
понедельник, 4 сентября 2023 г. в 19:00:32 UTC+4, Brian Candler: 

> On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:
>
> Hello, we use HA prometheus with two servers.
>
> You mean, two Prometheus servers with the same config, both scraping the 
> same targets?
>
>  
>
> The problem is we get different metrics in dashboards from this two 
> servers.
>
> Small differences are to be expected.  That's because the two servers 
> won't be scraping the targets at the same points in time.  If you see more 
> significant differences, then please provide some examples.
>
>  
>
> And we also scrape metrics from k8s, and some pods are not scraping 
> because of error context deadline exceeded
>
> That basically means "scrape timed out".  The scrape hadn't completed 
> within the "scrape_timeout:" value that you've set.  You'll need to look at 
> your individual exporters and the failing scrape URLs: either the target is 
> not reachable at all (e.g. firewalling or network configuration issue), or 
> the target is taking too long to respond.
>  
>
> Its differents pods on each server. In prometheus logs we dont see any of 
> errors.
>
> Where *do* you see the "context deadline exceeded" errors then?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/3718ef76-392f-4af8-b7b9-bb371813c76dn%40googlegroups.com.


Re: [prometheus-users] Re: Promteheus HA different metrics

2023-09-04 Thread Ben Kochie
On Mon, Sep 4, 2023 at 5:00 PM Brian Candler  wrote:

> On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:
>
> Hello, we use HA prometheus with two servers.
>
> You mean, two Prometheus servers with the same config, both scraping the
> same targets?
>
>
>
> The problem is we get different metrics in dashboards from this two
> servers.
>
> Small differences are to be expected.  That's because the two servers
> won't be scraping the targets at the same points in time.  If you see more
> significant differences, then please provide some examples.
>
>
>
> And we also scrape metrics from k8s, and some pods are not scraping
> because of error context deadline exceeded
>
> That basically means "scrape timed out".  The scrape hadn't completed
> within the "scrape_timeout:" value that you've set.  You'll need to look at
> your individual exporters and the failing scrape URLs: either the target is
> not reachable at all (e.g. firewalling or network configuration issue), or
> the target is taking too long to respond.
>
>
> Its differents pods on each server. In prometheus logs we dont see any of
> errors.
>
> Where *do* you see the "context deadline exceeded" errors then?
>

Usually on the `/targets` page.

Prometheus does not log scrape errors by default. I would love this to be a
configuration option, or even better, a per-job `scrape_configs` option.


> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/34cf1354-9e58-4517-8c3d-3301d4fc0236n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmrsM%3DDjSu2Mjvkmhzo%3D5XNJbmNvDFPN3fScuVRBOkzs%3Dg%40mail.gmail.com.


[prometheus-users] Re: Promteheus HA different metrics

2023-09-04 Thread Brian Candler
On Monday, 4 September 2023 at 15:49:25 UTC+1 Анастасия Зель wrote:

Hello, we use HA prometheus with two servers.

You mean, two Prometheus servers with the same config, both scraping the 
same targets?

 

The problem is we get different metrics in dashboards from this two servers.

Small differences are to be expected.  That's because the two servers won't 
be scraping the targets at the same points in time.  If you see more 
significant differences, then please provide some examples.

 

And we also scrape metrics from k8s, and some pods are not scraping because 
of error context deadline exceeded

That basically means "scrape timed out".  The scrape hadn't completed 
within the "scrape_timeout:" value that you've set.  You'll need to look at 
your individual exporters and the failing scrape URLs: either the target is 
not reachable at all (e.g. firewalling or network configuration issue), or 
the target is taking too long to respond.
 

Its differents pods on each server. In prometheus logs we dont see any of 
errors.

Where *do* you see the "context deadline exceeded" errors then?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/34cf1354-9e58-4517-8c3d-3301d4fc0236n%40googlegroups.com.