On Sat, Oct 8, 2022 at 4:22 AM Muthuveerappan Periyakaruppan <
muthu.veerap...@gmail.com> wrote:

> Please find replies inline.
>
> On Friday, 7 October, 2022 at 1:25:27 pm UTC+5:30 Stuart Clark wrote:
>
>> On 07/10/2022 04:09, Muthuveerappan Periyakaruppan wrote:
>> > we have a situation , where we have 8 to 15 million head series in
>> > each Prometheus and we have 7 instance of them (federated). Our
>> > prometheus are in a constant flooded situation handling the incoming
>> > metrics and back end recording rules.
>>
>> 8-15 million time series on a single Prometheus instance is pretty high.
>> What spec machine/pod are these?
>>
>> 90gb ram, 5000 millicores.
>

Wait, are you federating multiple Prometheus instances on multiple clusters
into one? Maybe you should look at Thanos instead. It lets you federate,
but without actually forcing you to put all the data in oneservice.

We have a Thanos setup with 250+ million metrics. Thousands of Prometheus
instances across multiple large Kubernetes clusters.


>
>
>> When you say "flooded" what are you meaning?
>>
>
> Always high usage of ram,  no oom , although missing metrics, average
> scrape duration like 35 seconds ... (may be due to no of targets/metrics)
> cpu demand/usage is not that high
>
>
>> > One thought which came to was - do we have something similar to log
>> > level for prometheus metrics ? If its there then... we can benefit
>> > from it .... by configuring to run all targets in error level in
>> > production and in debug/info level in development... This will help
>> > control flooding of metrics.
>> >
>> I'm not sure what I understand what you are suggesting. What would be
>> the difference between setting this hypothetical "error" and "debug"
>> levels? Are you meaning some metrics would only be exposed on some
>> environments?
>>
>> Lets say every pod has close to 100 metrics , we may not need all of them
> in production ...
>

100 metrics per pod is not a lot. Is that really what you're using? That
means you have between 80k and 150k pods. Is that in a single cluster?

And you said you have a scrape duration of 35s. For 100 metrics per pod,
your scrape duration should be closer to 35 milliseconds.

Something in what you're saying doesn't add up.


> A developer before logging a metric can access on how useful this metric
> will be in production ...what indicators does it have - Utilization,
> Saturation, and Errors (USE) / Rate, Errors, and Duration (RED) ... based
> on this he can choose the metric level.
> Based on the level of metric,  only few can be enabled (ERROR / SEVERE
> level) in production the rest can be enabled (INFO /DEBUG Level) in
> development / testing / staging environments.
> few metrics should / are enough to troubleshoot and on demand we should
> have the option to change the metric level ...like log level at runtime to
> get more metrics
>
> --
>> Stuart Clark
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/577a43f4-3e8d-4c16-9061-3ba35699bd41n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/577a43f4-3e8d-4c16-9061-3ba35699bd41n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmpWiVoS6ZMqgcF0gnAovrfAj3D%3DGFAHZyG97D8FshGudw%40mail.gmail.com.

Reply via email to