[prometheus-users] Prometheus runs oom

nbada...@gmail.com Wed, 13 Apr 2022 00:00:49 -0700

Hi Guyz,

I have been using Prometheus for few years with almost no issues, 
recently i am witnessing performance issues while spinning the queries via 
Grafana dashboard.


Prometheus container sustains to server 12hrs of metrics without any 
issues, 
if i extend it to 24hrs of metrics then it gets restarted with the 
following info (  sorry for the long trail...  ).

Please, do see few manually ran query metrics at the end of the logs for 
the reference.

level=info ts=2022-04-13T05:15:17.798Z caller=main.go:851 
fs_type=EXT4_SUPER_MAGIC
level=info ts=2022-04-13T05:15:17.798Z caller=main.go:854 msg="TSDB started"
level=info ts=2022-04-13T05:15:17.798Z caller=main.go:981 msg="Loading 
configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2022-04-13T05:15:17.804Z caller=main.go:1012 msg="Completed 
loading of configuration file" filename=/etc/prometheus/prometheus.yml 
totalDuration=5.955772ms remote_storage=1.608µs web_handler=397ns 
query_engine=1.065µs scrape=569.244µs scrape_sd=637.764µs notify=24.31µs 
notify_sd=18.627µs rules=1.984671ms
level=info ts=2022-04-13T05:15:17.804Z caller=main.go:796 msg="Server is 
ready to receive web requests."
level=warn ts=2022-04-13T06:12:38.025Z caller=main.go:378 
deprecation_notice="'storage.tsdb.retention' flag is deprecated use 
'storage.tsdb.retention.time' instead."
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:443 msg="Starting 
Prometheus" version="(version=2.28.1, branch=HEAD, 
revision=b0944590a1c9a6b35dc5a696869f75f422b107a1)"
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:448 
build_context="(go=go1.16.5, user=xxxxx, date=20210701-15:20:10)"
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:449 
host_details="(Linux 4.14.262-200.489.amzn2.x86_64 #1 SMP Fri Feb 4 
20:34:30 UTC 2022 x86_64  (none))"
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:450 
fd_limits="(soft=32768, hard=65536)"
level=info ts=2022-04-13T06:12:38.025Z caller=main.go:451 
vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2022-04-13T06:12:38.026Z caller=query_logger.go:79 
component=activeQueryTracker msg="These queries didn't finish in 
prometheus' last run:" 
queries="[{\"query\":\"count(_cpu_usage{process=~\\\".*.*\\\",job=\\\"sftp_top_prod\\\",app!=\\\"-c\\\",app!=\\\"\\u003cdefunct\\u003e\\\",app!=\\\"connected:\\\"})
 
by (app) \\u003e 
5\",\"timestamp_sec\":1649830310},{\"query\":\"count(_cpu_usage{job=\\\"sftp_top_prod\\\",process=~\\\".*.*\\\",clientip!=\\\"-nd\\\",
 
app!=\\\"\\u003cdefunct\\u003e\\\"}) by (clientip,app) \\u003e 
5\",\"timestamp_sec\":1649830316},{\"query\":\"count(_cpu_usage{process=~\\\".*.*\\\",job=\\\"sftp_top_prod\\\"})\",\"timestamp_sec\":1649830310},{\"query\":\"count(_cpu_usage{process=~\\\".*.*\\\",job=\\\"sftp_top_prod\\\"})
 
by 
(instance)\",\"timestamp_sec\":1649830310},{\"query\":\"count(_cpu_usage{job=\\\"sftp_top_prod\\\",
 
process=~\\\".*.*\\\"}) by (clientip) \\u003e 
5\",\"timestamp_sec\":1649830311}]"
level=info ts=2022-04-13T06:12:38.028Z caller=web.go:541 component=web 
msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2022-04-13T06:12:38.029Z caller=main.go:824 msg="Starting 
TSDB ..."
level=info ts=2022-04-13T06:12:38.031Z caller=tls_config.go:191 
component=web msg="TLS is disabled." http2=false
level=info ts=2022-04-13T06:12:38.031Z caller=repair.go:57 component=tsdb 
msg="Found healthy block" mint=1641708000000 maxt=1642291200000 
ulid=01FSGDEHH869s5PYJ3HQZ8GEVNx5
level=info ts=2022-04-13T06:12:38.032Z caller=repair.go:57 component=tsdb 
msg="Found healthy block" mint=1642291200000 maxt=1642874400000 
ulid=01FT1SMMQZNXxAAEJS6NPDD32Sx3
.....
level=warn ts=2022-04-13T06:12:38.048Z caller=db.go:676 component=tsdb 
msg="A TSDB lockfile from a previous execution already existed. It was 
replaced" file=/prometheus/lock
level=info ts=2022-04-13T06:12:40.121Z caller=head.go:780 component=tsdb 
msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2022-04-13T06:12:43.525Z caller=head.go:794 component=tsdb 
msg="On-disk memory mappable chunks replay completed" duration=3.4045956s
level=info ts=2022-04-13T06:12:43.526Z caller=head.go:800 component=tsdb 
msg="Replaying WAL, this may take a while"
level=info ts=2022-04-13T06:12:48.512Z caller=head.go:826 component=tsdb 
msg="WAL checkpoint loaded"
level=info ts=2022-04-13T06:12:49.978Z caller=head.go:854 component=tsdb 
msg="WAL segment loaded" segment=133826 maxSegment=133854
........
level=info ts=2022-04-13T06:13:33.511Z caller=head.go:860 component=tsdb 
msg="WAL replay completed" checkpoint_replay_duration=4.986518678s 
wal_replay_duration=44.999189293s total_replay_duration=53.390369759s
level=info ts=2022-04-13T06:13:35.315Z caller=main.go:851 
fs_type=EXT4_SUPER_MAGIC
level=info ts=2022-04-13T06:13:35.315Z caller=main.go:854 msg="TSDB started"
level=info ts=2022-04-13T06:13:35.315Z caller=main.go:981 msg="Loading 
configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2022-04-13T06:13:35.324Z caller=main.go:1012 msg="Completed 
loading of configuration file" filename=/etc/prometheus/prometheus.yml 
totalDuration=9.159204ms remote_storage=2.044µs web_handler=395ns 
query_engine=1.047µs scrape=1.011248ms scrape_sd=2.974467ms notify=16.737µs 
notify_sd=9.337µs rules=2.452176ms
level=info ts=2022-04-13T06:13:35.324Z caller=main.go:796 msg="Server is 
ready to receive web requests.

----

Query references:

prometheus_tsdb_data_replay_duration_seconds  shows increase from 25 sec to 
around 50sec during the failure

prometheus_engine_queries shows 5 from 0 during the search and 
prometheus_engine_queries_concurrent_max shows 20


prometheus_engine_query_duration_seconds{instance="abc:80", 
job="prometheus", quantile="0.99", slice="inner_eval"} 7.31
prometheus_engine_query_duration_seconds{instance="abc:80", 
job="prometheus", quantile="0.99", slice="prepare_time"}  0.013609492
prometheus_engine_query_duration_seconds{instance="abc:80", 
job="prometheus", quantile="0.99", slice="queue_time"} 0.000030745
prometheus_engine_query_duration_seconds{instance="prometheus.prod.datadelivery.info:80",
 
job="prometheus", quantile="0.99", slice="result_sort"} 0.000040619 

I could see RAM is running with its full capacity and container gets 
restarted after it.

EBS volume am using is ext4, i pressume this could be one of the reason for 
the slow IO, which i can give a try to change to GP2.

Increasing the RAM is not feasible solution as its utilization during 
normal times not even reaches half of its total capacity (30Gb).

Appreciate your quick advice.

let me know for any further more details.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9d8526ad-bf14-437b-8807-6fca4ff60a18n%40googlegroups.com.

[prometheus-users] Prometheus runs oom

Reply via email to