Hi Guyz, I have been using Prometheus for few years with almost no issues, recently i am witnessing performance issues while spinning the queries via Grafana dashboard.
Prometheus container sustains to server 12hrs of metrics without any issues, if i extend it to 24hrs of metrics then it gets restarted with the following info ( sorry for the long trail... ). Please, do see few manually ran query metrics at the end of the logs for the reference. level=info ts=2022-04-13T05:15:17.798Z caller=main.go:851 fs_type=EXT4_SUPER_MAGIC level=info ts=2022-04-13T05:15:17.798Z caller=main.go:854 msg="TSDB started" level=info ts=2022-04-13T05:15:17.798Z caller=main.go:981 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml level=info ts=2022-04-13T05:15:17.804Z caller=main.go:1012 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=5.955772ms remote_storage=1.608µs web_handler=397ns query_engine=1.065µs scrape=569.244µs scrape_sd=637.764µs notify=24.31µs notify_sd=18.627µs rules=1.984671ms level=info ts=2022-04-13T05:15:17.804Z caller=main.go:796 msg="Server is ready to receive web requests." level=warn ts=2022-04-13T06:12:38.025Z caller=main.go:378 deprecation_notice="'storage.tsdb.retention' flag is deprecated use 'storage.tsdb.retention.time' instead." level=info ts=2022-04-13T06:12:38.025Z caller=main.go:443 msg="Starting Prometheus" version="(version=2.28.1, branch=HEAD, revision=b0944590a1c9a6b35dc5a696869f75f422b107a1)" level=info ts=2022-04-13T06:12:38.025Z caller=main.go:448 build_context="(go=go1.16.5, user=xxxxx, date=20210701-15:20:10)" level=info ts=2022-04-13T06:12:38.025Z caller=main.go:449 host_details="(Linux 4.14.262-200.489.amzn2.x86_64 #1 SMP Fri Feb 4 20:34:30 UTC 2022 x86_64 (none))" level=info ts=2022-04-13T06:12:38.025Z caller=main.go:450 fd_limits="(soft=32768, hard=65536)" level=info ts=2022-04-13T06:12:38.025Z caller=main.go:451 vm_limits="(soft=unlimited, hard=unlimited)" level=info ts=2022-04-13T06:12:38.026Z caller=query_logger.go:79 component=activeQueryTracker msg="These queries didn't finish in prometheus' last run:" queries="[{\"query\":\"count(_cpu_usage{process=~\\\".*.*\\\",job=\\\"sftp_top_prod\\\",app!=\\\"-c\\\",app!=\\\"\\u003cdefunct\\u003e\\\",app!=\\\"connected:\\\"}) by (app) \\u003e 5\",\"timestamp_sec\":1649830310},{\"query\":\"count(_cpu_usage{job=\\\"sftp_top_prod\\\",process=~\\\".*.*\\\",clientip!=\\\"-nd\\\", app!=\\\"\\u003cdefunct\\u003e\\\"}) by (clientip,app) \\u003e 5\",\"timestamp_sec\":1649830316},{\"query\":\"count(_cpu_usage{process=~\\\".*.*\\\",job=\\\"sftp_top_prod\\\"})\",\"timestamp_sec\":1649830310},{\"query\":\"count(_cpu_usage{process=~\\\".*.*\\\",job=\\\"sftp_top_prod\\\"}) by (instance)\",\"timestamp_sec\":1649830310},{\"query\":\"count(_cpu_usage{job=\\\"sftp_top_prod\\\", process=~\\\".*.*\\\"}) by (clientip) \\u003e 5\",\"timestamp_sec\":1649830311}]" level=info ts=2022-04-13T06:12:38.028Z caller=web.go:541 component=web msg="Start listening for connections" address=0.0.0.0:9090 level=info ts=2022-04-13T06:12:38.029Z caller=main.go:824 msg="Starting TSDB ..." level=info ts=2022-04-13T06:12:38.031Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false level=info ts=2022-04-13T06:12:38.031Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1641708000000 maxt=1642291200000 ulid=01FSGDEHH869s5PYJ3HQZ8GEVNx5 level=info ts=2022-04-13T06:12:38.032Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1642291200000 maxt=1642874400000 ulid=01FT1SMMQZNXxAAEJS6NPDD32Sx3 ..... level=warn ts=2022-04-13T06:12:38.048Z caller=db.go:676 component=tsdb msg="A TSDB lockfile from a previous execution already existed. It was replaced" file=/prometheus/lock level=info ts=2022-04-13T06:12:40.121Z caller=head.go:780 component=tsdb msg="Replaying on-disk memory mappable chunks if any" level=info ts=2022-04-13T06:12:43.525Z caller=head.go:794 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.4045956s level=info ts=2022-04-13T06:12:43.526Z caller=head.go:800 component=tsdb msg="Replaying WAL, this may take a while" level=info ts=2022-04-13T06:12:48.512Z caller=head.go:826 component=tsdb msg="WAL checkpoint loaded" level=info ts=2022-04-13T06:12:49.978Z caller=head.go:854 component=tsdb msg="WAL segment loaded" segment=133826 maxSegment=133854 ........ level=info ts=2022-04-13T06:13:33.511Z caller=head.go:860 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=4.986518678s wal_replay_duration=44.999189293s total_replay_duration=53.390369759s level=info ts=2022-04-13T06:13:35.315Z caller=main.go:851 fs_type=EXT4_SUPER_MAGIC level=info ts=2022-04-13T06:13:35.315Z caller=main.go:854 msg="TSDB started" level=info ts=2022-04-13T06:13:35.315Z caller=main.go:981 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml level=info ts=2022-04-13T06:13:35.324Z caller=main.go:1012 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=9.159204ms remote_storage=2.044µs web_handler=395ns query_engine=1.047µs scrape=1.011248ms scrape_sd=2.974467ms notify=16.737µs notify_sd=9.337µs rules=2.452176ms level=info ts=2022-04-13T06:13:35.324Z caller=main.go:796 msg="Server is ready to receive web requests. ---- Query references: prometheus_tsdb_data_replay_duration_seconds shows increase from 25 sec to around 50sec during the failure prometheus_engine_queries shows 5 from 0 during the search and prometheus_engine_queries_concurrent_max shows 20 prometheus_engine_query_duration_seconds{instance="abc:80", job="prometheus", quantile="0.99", slice="inner_eval"} 7.31 prometheus_engine_query_duration_seconds{instance="abc:80", job="prometheus", quantile="0.99", slice="prepare_time"} 0.013609492 prometheus_engine_query_duration_seconds{instance="abc:80", job="prometheus", quantile="0.99", slice="queue_time"} 0.000030745 prometheus_engine_query_duration_seconds{instance="prometheus.prod.datadelivery.info:80", job="prometheus", quantile="0.99", slice="result_sort"} 0.000040619 I could see RAM is running with its full capacity and container gets restarted after it. EBS volume am using is ext4, i pressume this could be one of the reason for the slow IO, which i can give a try to change to GP2. Increasing the RAM is not feasible solution as its utilization during normal times not even reaches half of its total capacity (30Gb). Appreciate your quick advice. let me know for any further more details. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9d8526ad-bf14-437b-8807-6fca4ff60a18n%40googlegroups.com.