Re: [prometheus-users] Prometheus server increases CPU usage beyond 200%

Stuart Clark Thu, 14 May 2020 01:50:31 -0700

On 2020-05-14 09:45, Isabel Noronha wrote:

Hi,


Server config where prometheus is running:
160 CPU cores
500 Gb RAM
2Tb Hardisk.

Prometheus version:2.18.0
cadvisor version:0.36.0

Prometheus is running inside a container.
I have already done relabeling.
Retention period is 15days.

I am using Cadvisor to get metrics from containers  around 4k
containers.
I have done relabeling for container metrics as well.

Scrape interval is 40s

I use top command to check the CPU usage.
So to my surprise Prometheus was exceeding 200% CPU usage.
On this server (where prometheus server is running ) has around 2K
containers.

Memory, CPU and disk usage will be for down to a number of differenttasks:


- Scraping (more targets/time series, more resources)
- Recording rules (more rule touching more data, more resources)
- Queries (more & more complex, more resources)

- WAL processing, compaction and expiry (more time series, moreresources)

Those different usages will add together. There are various metrics toshow the number of scrapes, timeseries, queries, etc.

On another target 2K containers,
So overall 4K containers.

Could anyone help me understand the possible reasons for prometheus to
increase the CPU usage?

prometheus .yml

# my global config
global:
  scrape_interval:     15s # By default, scrape targets every 15
seconds.
  evaluation_interval: 15s # By default, scrape targets every 15
seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when
communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: ‘prometheus-monitor’

# Load and evaluate rules in this file every 'evaluation_interval'
seconds.
rule_files:
  #- 'alert.rules'
  # - "first.rules"
   - "alert_rules.yml"

# alert
alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - "server:9093"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any
timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job
every 5 seconds.
    scrape_interval: 40s
    scrape_timeout: 40s

    static_configs:
        - targets: ['localhost:9010']

  - job_name: 'cadvisor'

    # Override the global default and scrape targets from this job
every 5 seconds.
    scrape_interval: 40s
    scrape_timeout: 40s

    static_configs:
          - targets: [server1:8080',server2:8080',server3:8080']
    metric_relabel_configs:
      - source_labels: [__name__]
        regex:
'(container_fs_writes_total|container_fs_reads_total|container_tasks_state|container_cpu_user_seconds_total|container_last_seen|container_memory_usage_bytes|container_cpu_usage_seconds_total|container_network_transmit_bytes_total|container_memory_rss|container_network_receive_bytes_total|container_network_transmit_bytes_total|container_memory_cache|cadvisor_version_info)'
        action: keep

  - job_name: 'node-exporter'

    # Override the global default and scrape targets from this job
every 5 seconds.
    scrape_interval: 15s
    scrape_timeout: 15s
    static_configs:
          - targets: [server1:9100',server2:9100',server3:9100']

    metric_relabel_configs:
      - source_labels: [__name__]
        regex:
'(process_start_time_seconds|node_load1|node_exporter_build_info|node_uname_info|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemFree_bytes|node_memory_SwapCached_bytes|node_memory_PageTables_bytes|node_memory_VmallocUsed_bytes|node_memory_SwapTotal_bytes|node_memory_Committed_AS_bytes|node_memory_Active_bytes|node_memory_Mapped_bytes|node_memory_Inactive_bytes|node_cpu_seconds_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_memory_MemAvailable_bytes|node_memory_MemTotal_bytes|node_memory_MemFree_bytes
|node_memory_Cached_bytes|node_filesystem_free_bytes)'
        action: keep
  - job_name: 'docker'
         # metrics_path defaults to '/metrics'
         # scheme defaults to 'http'.
    scrape_interval: 5s

    static_configs:
      - targets: ['172.17.0.1:9999']

Thank you!

 --
You received this message because you are subscribed to the Google
Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/41576f93-da37-4524-aba8-8e5d0e595402%40googlegroups.com
[1].


Links:
------
[1]
https://groups.google.com/d/msgid/prometheus-users/41576f93-da37-4524-aba8-8e5d0e595402%40googlegroups.com?utm_medium=email&utm_source=footer


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/dbcc89389cd4ccca619ea4daeaa35ded%40Jahingo.com.

Re: [prometheus-users] Prometheus server increases CPU usage beyond 200%

Reply via email to