Re: [prometheus-users] Re: Unusual traffic in prometheus nodes.

2023-07-28 Thread Brian Candler
Another query to try:
topk(10, scrape_samples_scraped)

On Friday, 28 July 2023 at 09:53:00 UTC+1 Ben Kochie wrote:

> That's 7 billion metrics, which would require approximately  30-50TiB of 
> ram.
>
> On Thu, Jul 27, 2023 at 5:50 PM Brian Candler  wrote:
>
>> As Stuart says, that looks correct, assuming your metrics don't have any 
>> labels other than the ones you've excluded. You'd save a lot of typing just 
>> by doing:
>>
>> sum(scrape_samples_scraped)
>>
>> which is expected to return a single value, with no labels (as it's 
>> summed across all timeseries of this metric).
>>
>> The value 7,525,871,918 does seem quite high - what was it before?  You 
>> can set an execution time for this query in the PromQL browser, or draw a 
>> graph this expression over time, to see historical values.
>>
>> You could also look at
>> count(scrape_samples_scraped)
>>
>> or more simply
>> count(up)
>>
>> and see if that has jumped up: it would imply that lots more targets have 
>> been added (e.g. more pods are being monitored).
>>
>> If not, then as well as Stuart's suggestion of graphing 
>> "scrape_samples_scraped" by itself to see if one particular target is 
>> generating way more metrics than usual, you could try different summary 
>> variants like
>>
>> sum by (instance,job) (scrape_samples_scraped)
>> sum by (clusterName) (scrape_samples_scraped)
>> ... etc
>>
>> and see if there's a spike in any of these.  This may help you drill down 
>> to the offending item(s).
>>
>> On Thursday, 27 July 2023 at 15:51:24 UTC+1 Uvais Ibrahim wrote:
>>
>>> Hi Brain,
>>>
>>> This is the query that I have used.
>>>
>>> sum(scrape_samples_scraped)without(app,app_kubernetes_io_managed_by,clusterName,release,environment,instance,job,k8s_cluster,kubernetes_name,kubernetes_namespace,ou,app_kubernetes_io_component,app_kubernetes_io_name,app_kubernetes_io_version,kustomize_toolkit_fluxcd_io_name,kustomize_toolkit_fluxcd_io_namespace,application,name,role,app_kubernetes_io_instance,app_kubernetes_io_part_of,control_plane,beta_kubernetes_io_arch,beta_kubernetes_io_instance_type,
>>>  
>>> beta_kubernetes_io_os, failure_domain_beta_kubernetes_io_region, 
>>> failure_domain_beta_kubernetes_io_zone,kubernetes_io_arch, 
>>> kubernetes_io_hostname, kubernetes_io_os, node_kubernetes_io_instance_type, 
>>> nodegroup, topology_kubernetes_io_region, 
>>> topology_kubernetes_io_zone,chart,heritage,revised,transit,component,namespace,
>>>  
>>> pod_name, pod_template_hash, security_istio_io_tlsMode, 
>>> service_istio_io_canonical_name, 
>>> service_istio_io_canonical_revision,k8s_app,kubernetes_io_cluster_service,kubernetes_io_name,route_reflector)
>>>
>>> Which simply excluded every label but still I am getting a result like 
>>> this
>>>
>>> {}  7525871918
>>>
>>>
>>> It shouldn't return any results right?
>>>
>>> Prometheus version: 2.36.2
>>>
>>> By increased traffic I meant that, the prometheus servers are getting 
>>> high traffic from a specific point of time. Currently prometheus is getting 
>>> 13 million packets earlier it was like 2 to 3 M packets on an average. And 
>>> the prometheus endpoint is not public.
>>>
>>>
>>> On Thursday, July 27, 2023 at 6:06:10 PM UTC+5:30 Brian Candler wrote:
>>>
 scrape_samples_scraped always has the labels which prometheus itself 
 adds (i.e. job and instance).

 Extraordinary claims require extraordinary evidence. Are you saying 
 that the PromQL query *scrape_samples_scraped{job="",instance=""}* 
 returns a result?  If so, what's the number?  What do you mean by "with 
 increased size" - increased as compared to what? And what version of 
 prometheus are you running?

 In any case, what you see with scrape_samples_scraped may be completely 
 unrelated to the "high traffic" issue.  Is your prometheus server exposed 
 to the Internet? Maybe someone is accessing it remotely.  Even if not, you 
 can use packet capture to work out where the traffic is going to and from. 
  
 A tool like https://www.sniffnet.net/ may be helpful.

 On Thursday, 27 July 2023 at 13:14:25 UTC+1 Uvais Ibrahim wrote:

> Hi,
>
> Since last night, my Prometheus EC2 servers are getting high traffic 
> unusually. When I was checking in Prometheus I can see this 
> metric scrape_samples_scraped with with increased size but without any 
> labels. What could be the reason?
>
>
> Thanks,
> Uvais Ibrahim
>
>
>
> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-use...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-users/811fba5c-1bd3-4677-b276-84116180a1acn%40googlegroups.com
>>  
>> 

Re: [prometheus-users] Re: Unusual traffic in prometheus nodes.

2023-07-28 Thread Ben Kochie
That's 7 billion metrics, which would require approximately  30-50TiB of
ram.

On Thu, Jul 27, 2023 at 5:50 PM Brian Candler  wrote:

> As Stuart says, that looks correct, assuming your metrics don't have any
> labels other than the ones you've excluded. You'd save a lot of typing just
> by doing:
>
> sum(scrape_samples_scraped)
>
> which is expected to return a single value, with no labels (as it's summed
> across all timeseries of this metric).
>
> The value 7,525,871,918 does seem quite high - what was it before?  You
> can set an execution time for this query in the PromQL browser, or draw a
> graph this expression over time, to see historical values.
>
> You could also look at
> count(scrape_samples_scraped)
>
> or more simply
> count(up)
>
> and see if that has jumped up: it would imply that lots more targets have
> been added (e.g. more pods are being monitored).
>
> If not, then as well as Stuart's suggestion of graphing
> "scrape_samples_scraped" by itself to see if one particular target is
> generating way more metrics than usual, you could try different summary
> variants like
>
> sum by (instance,job) (scrape_samples_scraped)
> sum by (clusterName) (scrape_samples_scraped)
> ... etc
>
> and see if there's a spike in any of these.  This may help you drill down
> to the offending item(s).
>
> On Thursday, 27 July 2023 at 15:51:24 UTC+1 Uvais Ibrahim wrote:
>
>> Hi Brain,
>>
>> This is the query that I have used.
>>
>> sum(scrape_samples_scraped)without(app,app_kubernetes_io_managed_by,clusterName,release,environment,instance,job,k8s_cluster,kubernetes_name,kubernetes_namespace,ou,app_kubernetes_io_component,app_kubernetes_io_name,app_kubernetes_io_version,kustomize_toolkit_fluxcd_io_name,kustomize_toolkit_fluxcd_io_namespace,application,name,role,app_kubernetes_io_instance,app_kubernetes_io_part_of,control_plane,beta_kubernetes_io_arch,beta_kubernetes_io_instance_type,
>> beta_kubernetes_io_os, failure_domain_beta_kubernetes_io_region,
>> failure_domain_beta_kubernetes_io_zone,kubernetes_io_arch,
>> kubernetes_io_hostname, kubernetes_io_os, node_kubernetes_io_instance_type,
>> nodegroup, topology_kubernetes_io_region,
>> topology_kubernetes_io_zone,chart,heritage,revised,transit,component,namespace,
>> pod_name, pod_template_hash, security_istio_io_tlsMode,
>> service_istio_io_canonical_name,
>> service_istio_io_canonical_revision,k8s_app,kubernetes_io_cluster_service,kubernetes_io_name,route_reflector)
>>
>> Which simply excluded every label but still I am getting a result like
>> this
>>
>> {}  7525871918
>>
>>
>> It shouldn't return any results right?
>>
>> Prometheus version: 2.36.2
>>
>> By increased traffic I meant that, the prometheus servers are getting
>> high traffic from a specific point of time. Currently prometheus is getting
>> 13 million packets earlier it was like 2 to 3 M packets on an average. And
>> the prometheus endpoint is not public.
>>
>>
>> On Thursday, July 27, 2023 at 6:06:10 PM UTC+5:30 Brian Candler wrote:
>>
>>> scrape_samples_scraped always has the labels which prometheus itself
>>> adds (i.e. job and instance).
>>>
>>> Extraordinary claims require extraordinary evidence. Are you saying that
>>> the PromQL query *scrape_samples_scraped{job="",instance=""}* returns a
>>> result?  If so, what's the number?  What do you mean by "with increased
>>> size" - increased as compared to what? And what version of prometheus are
>>> you running?
>>>
>>> In any case, what you see with scrape_samples_scraped may be completely
>>> unrelated to the "high traffic" issue.  Is your prometheus server exposed
>>> to the Internet? Maybe someone is accessing it remotely.  Even if not, you
>>> can use packet capture to work out where the traffic is going to and from.
>>> A tool like https://www.sniffnet.net/ may be helpful.
>>>
>>> On Thursday, 27 July 2023 at 13:14:25 UTC+1 Uvais Ibrahim wrote:
>>>
 Hi,

 Since last night, my Prometheus EC2 servers are getting high traffic
 unusually. When I was checking in Prometheus I can see this
 metric scrape_samples_scraped with with increased size but without any
 labels. What could be the reason?


 Thanks,
 Uvais Ibrahim



 --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/811fba5c-1bd3-4677-b276-84116180a1acn%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 

[prometheus-users] Re: Unusual traffic in prometheus nodes.

2023-07-27 Thread Brian Candler
As Stuart says, that looks correct, assuming your metrics don't have any 
labels other than the ones you've excluded. You'd save a lot of typing just 
by doing:

sum(scrape_samples_scraped)

which is expected to return a single value, with no labels (as it's summed 
across all timeseries of this metric).

The value 7,525,871,918 does seem quite high - what was it before?  You can 
set an execution time for this query in the PromQL browser, or draw a graph 
this expression over time, to see historical values.

You could also look at
count(scrape_samples_scraped)

or more simply
count(up)

and see if that has jumped up: it would imply that lots more targets have 
been added (e.g. more pods are being monitored).

If not, then as well as Stuart's suggestion of graphing 
"scrape_samples_scraped" by itself to see if one particular target is 
generating way more metrics than usual, you could try different summary 
variants like

sum by (instance,job) (scrape_samples_scraped)
sum by (clusterName) (scrape_samples_scraped)
... etc

and see if there's a spike in any of these.  This may help you drill down 
to the offending item(s).

On Thursday, 27 July 2023 at 15:51:24 UTC+1 Uvais Ibrahim wrote:

> Hi Brain,
>
> This is the query that I have used.
>
> sum(scrape_samples_scraped)without(app,app_kubernetes_io_managed_by,clusterName,release,environment,instance,job,k8s_cluster,kubernetes_name,kubernetes_namespace,ou,app_kubernetes_io_component,app_kubernetes_io_name,app_kubernetes_io_version,kustomize_toolkit_fluxcd_io_name,kustomize_toolkit_fluxcd_io_namespace,application,name,role,app_kubernetes_io_instance,app_kubernetes_io_part_of,control_plane,beta_kubernetes_io_arch,beta_kubernetes_io_instance_type,
>  
> beta_kubernetes_io_os, failure_domain_beta_kubernetes_io_region, 
> failure_domain_beta_kubernetes_io_zone,kubernetes_io_arch, 
> kubernetes_io_hostname, kubernetes_io_os, node_kubernetes_io_instance_type, 
> nodegroup, topology_kubernetes_io_region, 
> topology_kubernetes_io_zone,chart,heritage,revised,transit,component,namespace,
>  
> pod_name, pod_template_hash, security_istio_io_tlsMode, 
> service_istio_io_canonical_name, 
> service_istio_io_canonical_revision,k8s_app,kubernetes_io_cluster_service,kubernetes_io_name,route_reflector)
>
> Which simply excluded every label but still I am getting a result like this
>
> {}  7525871918
>
>
> It shouldn't return any results right?
>
> Prometheus version: 2.36.2
>
> By increased traffic I meant that, the prometheus servers are getting high 
> traffic from a specific point of time. Currently prometheus is getting 13 
> million packets earlier it was like 2 to 3 M packets on an average. And the 
> prometheus endpoint is not public.
>
>
> On Thursday, July 27, 2023 at 6:06:10 PM UTC+5:30 Brian Candler wrote:
>
>> scrape_samples_scraped always has the labels which prometheus itself adds 
>> (i.e. job and instance).
>>
>> Extraordinary claims require extraordinary evidence. Are you saying that 
>> the PromQL query *scrape_samples_scraped{job="",instance=""}* returns a 
>> result?  If so, what's the number?  What do you mean by "with increased 
>> size" - increased as compared to what? And what version of prometheus are 
>> you running?
>>
>> In any case, what you see with scrape_samples_scraped may be completely 
>> unrelated to the "high traffic" issue.  Is your prometheus server exposed 
>> to the Internet? Maybe someone is accessing it remotely.  Even if not, you 
>> can use packet capture to work out where the traffic is going to and from.  
>> A tool like https://www.sniffnet.net/ may be helpful.
>>
>> On Thursday, 27 July 2023 at 13:14:25 UTC+1 Uvais Ibrahim wrote:
>>
>>> Hi,
>>>
>>> Since last night, my Prometheus EC2 servers are getting high traffic 
>>> unusually. When I was checking in Prometheus I can see this 
>>> metric scrape_samples_scraped with with increased size but without any 
>>> labels. What could be the reason?
>>>
>>>
>>> Thanks,
>>> Uvais Ibrahim
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/811fba5c-1bd3-4677-b276-84116180a1acn%40googlegroups.com.


Re: [prometheus-users] Re: Unusual traffic in prometheus nodes.

2023-07-27 Thread Stuart Clark

On 27/07/2023 15:51, Uvais Ibrahim wrote:

Hi Brain,

This is the query that I have used.

sum(scrape_samples_scraped)without(app,app_kubernetes_io_managed_by,clusterName,release,environment,instance,job,k8s_cluster,kubernetes_name,kubernetes_namespace,ou,app_kubernetes_io_component,app_kubernetes_io_name,app_kubernetes_io_version,kustomize_toolkit_fluxcd_io_name,kustomize_toolkit_fluxcd_io_namespace,application,name,role,app_kubernetes_io_instance,app_kubernetes_io_part_of,control_plane,beta_kubernetes_io_arch,beta_kubernetes_io_instance_type, 
beta_kubernetes_io_os, failure_domain_beta_kubernetes_io_region, 
failure_domain_beta_kubernetes_io_zone,kubernetes_io_arch, 
kubernetes_io_hostname, kubernetes_io_os, 
node_kubernetes_io_instance_type, nodegroup, 
topology_kubernetes_io_region, 
topology_kubernetes_io_zone,chart,heritage,revised,transit,component,namespace, 
pod_name, pod_template_hash, security_istio_io_tlsMode, 
service_istio_io_canonical_name, 
service_istio_io_canonical_revision,k8s_app,kubernetes_io_cluster_service,kubernetes_io_name,route_reflector)


Which simply excluded every label but still I am getting a result like 
this


{}  7525871918

I'm not sure what you are expecting, as that sounds about right. The 
query is adding together all the different variants of the 
scrape_samples_scraped metric (removing all the different labels), so if 
that is indeed a list of every label the query is going to return a 
value without any associated labels.


You want to be instead just graphing the raw scrape_samples_scraped 
metric (no sum or without) and see how it varies over time. Is there a 
particular job or target which has a huge increase in the graph, or new 
series appearing? As to why that might happen it could be many different 
reasons, but ideas could include:


* new version of software which increases number of exposed metrics (or 
more granular labels)
* bug in software where a label is set to something with high 
cardinality (e.g. there is a "path" label from a web app, which means a 
potentially infinite cardinality, and you could have had a web scan 
producing millions of combinations)
* lots of changes to the targets, such as new instances of software or 
high churn of applications restarting


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/7423743a-8b29-625d-4472-6aa710cf1179%40Jahingo.com.


[prometheus-users] Re: Unusual traffic in prometheus nodes.

2023-07-27 Thread Uvais Ibrahim
Hi Brain,

This is the query that I have used.

sum(scrape_samples_scraped)without(app,app_kubernetes_io_managed_by,clusterName,release,environment,instance,job,k8s_cluster,kubernetes_name,kubernetes_namespace,ou,app_kubernetes_io_component,app_kubernetes_io_name,app_kubernetes_io_version,kustomize_toolkit_fluxcd_io_name,kustomize_toolkit_fluxcd_io_namespace,application,name,role,app_kubernetes_io_instance,app_kubernetes_io_part_of,control_plane,beta_kubernetes_io_arch,beta_kubernetes_io_instance_type,
 
beta_kubernetes_io_os, failure_domain_beta_kubernetes_io_region, 
failure_domain_beta_kubernetes_io_zone,kubernetes_io_arch, 
kubernetes_io_hostname, kubernetes_io_os, node_kubernetes_io_instance_type, 
nodegroup, topology_kubernetes_io_region, 
topology_kubernetes_io_zone,chart,heritage,revised,transit,component,namespace, 
pod_name, pod_template_hash, security_istio_io_tlsMode, 
service_istio_io_canonical_name, 
service_istio_io_canonical_revision,k8s_app,kubernetes_io_cluster_service,kubernetes_io_name,route_reflector)

Which simply excluded every label but still I am getting a result like this

{}  7525871918


It shouldn't return any results right?

Prometheus version: 2.36.2

By increased traffic I meant that, the prometheus servers are getting high 
traffic from a specific point of time. Currently prometheus is getting 13 
million packets earlier it was like 2 to 3 M packets on an average. And the 
prometheus endpoint is not public.


On Thursday, July 27, 2023 at 6:06:10 PM UTC+5:30 Brian Candler wrote:

> scrape_samples_scraped always has the labels which prometheus itself adds 
> (i.e. job and instance).
>
> Extraordinary claims require extraordinary evidence. Are you saying that 
> the PromQL query *scrape_samples_scraped{job="",instance=""}* returns a 
> result?  If so, what's the number?  What do you mean by "with increased 
> size" - increased as compared to what? And what version of prometheus are 
> you running?
>
> In any case, what you see with scrape_samples_scraped may be completely 
> unrelated to the "high traffic" issue.  Is your prometheus server exposed 
> to the Internet? Maybe someone is accessing it remotely.  Even if not, you 
> can use packet capture to work out where the traffic is going to and from.  
> A tool like https://www.sniffnet.net/ may be helpful.
>
> On Thursday, 27 July 2023 at 13:14:25 UTC+1 Uvais Ibrahim wrote:
>
>> Hi,
>>
>> Since last night, my Prometheus EC2 servers are getting high traffic 
>> unusually. When I was checking in Prometheus I can see this 
>> metric scrape_samples_scraped with with increased size but without any 
>> labels. What could be the reason?
>>
>>
>> Thanks,
>> Uvais Ibrahim
>>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c6f46be3-6aca-42b9-9164-7eca6b598dddn%40googlegroups.com.


[prometheus-users] Re: Unusual traffic in prometheus nodes.

2023-07-27 Thread Brian Candler
scrape_samples_scraped always has the labels which prometheus itself adds 
(i.e. job and instance).

Extraordinary claims require extraordinary evidence. Are you saying that 
the PromQL query *scrape_samples_scraped{job="",instance=""}* returns a 
result?  If so, what's the number?  What do you mean by "with increased 
size" - increased as compared to what? And what version of prometheus are 
you running?

In any case, what you see with scrape_samples_scraped may be completely 
unrelated to the "high traffic" issue.  Is your prometheus server exposed 
to the Internet? Maybe someone is accessing it remotely.  Even if not, you 
can use packet capture to work out where the traffic is going to and from.  
A tool like https://www.sniffnet.net/ may be helpful.

On Thursday, 27 July 2023 at 13:14:25 UTC+1 Uvais Ibrahim wrote:

> Hi,
>
> Since last night, my Prometheus EC2 servers are getting high traffic 
> unusually. When I was checking in Prometheus I can see this 
> metric scrape_samples_scraped with with increased size but without any 
> labels. What could be the reason?
>
>
> Thanks,
> Uvais Ibrahim
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/5ec6dd0e-380b-464b-8e3f-4d813920e51cn%40googlegroups.com.