[prometheus-users] Re: node_expoter high cpu usage

Brian Candler Fri, 28 Jan 2022 08:31:05 -0800

It appears to be a problem that's specific to your system only.  Therefore, 
you'll need to debug it on your side I'm afraid.  Tools like strace may be 
able to identify specific system calls which are taking a lot of time.


FWIW, I ran that "go tool pprof" command here, pointing it at a 
node_exporter instance running on a very low power (NUC DN2820 dual 
Celeron) box.
go tool pprof -svg http://nuc1:9100/debug/pprof/profile > node_exporter.svg

At the same time I hit it with curl roughly every 2 seconds:
while true; do curl nuc1:9100/metrics >/dev/null; sleep 2; done

The target system's node_exporter is running with just one flag:
node_exporter --collector.textfile.directory=/var/lib/node_exporter

In the SVG summary I see:

File: node_exporter
Type: cpu
Time: Jan 28, 2022 at 4:20pm (GMT)
Duration: 30s, Total samples = 1.95s ( 6.50%)
Showing nodes accounting for 1.42s, 72.82% of 1.95s total
Showing top 80 nodes out of 374

and there are no nodes for the systemd collector.  They do appear if I add 
"--collector.systemd", as expected.

So: there's something strange on your system.  It could be one of a hundred 
things, but the fact that systemd appears in your pprof output, even though 
you claim you're not running the systemd collector, is a big red flag.  
Finding out what's happening there will probably point you at the answer.

Remember of course that if node_exporter is running on a remote host, say 
1.2.3.4, but you're running go tool pprof on another system (say your 
laptop), then you'd need to do
go tool pprof -svg http://1.2.3.4:9100/debug/pprof/profile > 
node_exporter.svg

If you leave it as 127.0.0.1 then you're looking at the node_exporter 
instance running on the *same* system as where you're running go tool pprof.

Good luck with your hunting!

On Friday, 28 January 2022 at 12:59:09 UTC dyio...@gmail.com wrote:

> All,
>
> I realize this is a very long thread (with apologies.  But, I really need 
> to find a solution to the high CPU usage.  Please let me know if I can 
> provide any additional information so that you can help me.
>
> On Friday, January 21, 2022 at 7:27:18 AM UTC-5 Dimitri Yioulos wrote:
>
>> That's a good question.  The machine that I'm running node_exporter on 
>> for which you see the pprof output was just rebuilt.  So, the output is 
>> from a fresh, and basic, install of node_exporter.  This is the systemd 
>> node_exporter service:
>>
>> [Unit]
>> Description=Node Exporter
>> After=network.target
>>
>> [Service]
>> User=node_exporter
>> Group=node_exporter
>> Type=simple
>> ExecStart=/usr/local/bin/node_exporter
>>
>> [Install]
>> WantedBy=multi-user.target
>>
>> and, the prometheus target:
>>
>>   - job_name: 'myserver1'
>>     scrape_interval: 5s
>>     static_configs:
>>       - targets: ['myserver1:9100']
>>         labels:
>>           env: prod
>>           alias: myserver1
>>
>> I'm not sure what else to look at.
>>
>> On Friday, January 21, 2022 at 3:06:43 AM UTC-5 Brian Candler wrote:
>>
>>> The question is, why are systemd collector and process collector still 
>>> in that graph?
>>>
>>> On Friday, 21 January 2022 at 00:27:14 UTC dyio...@gmail.com wrote:
>>>
>>>> The attached is pprof output in text format, which may be easier to read
>>>>
>>>> On Thursday, January 20, 2022 at 6:30:25 PM UTC-5 Dimitri Yioulos wrote:
>>>>
>>>>> I ran pprof (attached).  I'll have to work on /proc/<pid>/stat (even 
>>>>> with the much appreciated reference :-) ).
>>>>>
>>>>> On Thursday, January 20, 2022 at 11:54:33 AM UTC-5 Brian Candler wrote:
>>>>>
>>>>>> So now go back to the original suggestion: run pprof with 
>>>>>> node_exporter running the way you *want* to be running it.
>>>>>>
>>>>>> > [root@myhost1 ~]# time for ((i=1;i<=1000;i++)); do node_exporter 
>>>>>> >/dev/null 2>&1; done
>>>>>>
>>>>>> That's meaningless.  node_exporter is a daemon, not something you can 
>>>>>> run one-shot like that.  If you remove the ">/dev/null 2>&1" you'll see 
>>>>>> lots of startup messages, probably ending with
>>>>>>
>>>>>> ts=2022-01-20T16:49:07.433Z caller=node_exporter.go:202 level=error 
>>>>>> err="listen tcp :9100: bind: address already in use"
>>>>>>
>>>>>> and then node_exporter terminating.  So you're not seeing the CPU 
>>>>>> overhead of any node_exporter scrape jobs, only its startup overhead.
>>>>>>
>>>>>> If the system is idle apart from running node_exporter, then "top" 
>>>>>> will show you system time and cpu time.  More accurately, find the 
>>>>>> process 
>>>>>> ID of node_exporter then look in /proc/<pid>/stat
>>>>>>
>>>>>> https://stackoverflow.com/questions/16726779/how-do-i-get-the-total-cpu-usage-of-an-application-from-proc-pid-stat
>>>>>>
>>>>>> On Thursday, 20 January 2022 at 12:33:06 UTC dyio...@gmail.com wrote:
>>>>>>
>>>>>>> Brian,
>>>>>>>
>>>>>>> Originally, I had not activated any additional collectors.  Then, I 
>>>>>>> read somewhere that I should add the systemd and process collectors.  
>>>>>>> Still 
>>>>>>> learning, here, so ... .  That's why you saw them in the pprof graph.  
>>>>>>> I 
>>>>>>> then curcled back and removed them.  However, high CPU usage has 
>>>>>>> *always* been an issue.  That goes for every system in which I have 
>>>>>>> node_exporter running.  While a few are test machines, and I care a bit 
>>>>>>> less, for production machines it's an issue.
>>>>>>>
>>>>>>> Here's some time output for node_exporter, though I'm not good at 
>>>>>>> interpreting the results:
>>>>>>>
>>>>>>> [root@myhost1 ~]# time for ((i=1;i<=1000;i++)); do node_exporter 
>>>>>>> >/dev/null 2>&1; done
>>>>>>>
>>>>>>> real        0m6.103s
>>>>>>> user        0m3.658s
>>>>>>> sys        0m3.151s
>>>>>>>
>>>>>>> So, if the above is a good way to measure node_exporter's user 
>>>>>>> versus system time, then they're about equal.  If you have another 
>>>>>>> means to 
>>>>>>> do such measurement, I'd appreciate your sharing it.  Once that's 
>>>>>>> determined and, if system time versus user time is "out-of-whack", how 
>>>>>>> do I 
>>>>>>> remediate?
>>>>>>>
>>>>>>> Many thanks.
>>>>>>>
>>>>>>> On Thursday, January 20, 2022 at 3:46:35 AM UTC-5 Brian Candler 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> So the systemd and process collectors aren't active.  I wonder why 
>>>>>>>> they appeared in your pprof graph then?  Was it exactly the same 
>>>>>>>> binary you 
>>>>>>>> were running?
>>>>>>>>
>>>>>>>> 20% CPU usage from a once-every-five-second scrape implies that it 
>>>>>>>> should take about 1 CPU-second in total, but all the collectors seem 
>>>>>>>> very 
>>>>>>>> fast.  The top five use between 0.01 and 0.015 seconds - and that's 
>>>>>>>> wall 
>>>>>>>> clock time, not CPU time.
>>>>>>>>
>>>>>>>> node_scrape_collector_duration_seconds{collector="cpu"} 0.010873961
>>>>>>>> node_scrape_collector_duration_seconds{collector="diskstats"} 
>>>>>>>> 0.01727642
>>>>>>>> node_scrape_collector_duration_seconds{collector="hwmon"} 
>>>>>>>> 0.014143617
>>>>>>>> node_scrape_collector_duration_seconds{collector="netclass"} 
>>>>>>>> 0.013852102
>>>>>>>> node_scrape_collector_duration_seconds{collector="thermal_zone"} 
>>>>>>>> 0.010936983
>>>>>>>>
>>>>>>>> Something weird is going on.  Next you might want to drill down 
>>>>>>>> into node_exporter's user versus system time.  Is the usage mostly 
>>>>>>>> system 
>>>>>>>> time?  That might point you some way, although the implication then is 
>>>>>>>> that 
>>>>>>>> the high CPU usage is some part of node_exporter outside of individual 
>>>>>>>> collectors.
>>>>>>>>
>>>>>>>> On Wednesday, 19 January 2022 at 23:27:40 UTC dyio...@gmail.com 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> [root@myhost1 ~]# curl -Ss localhost:9100/metrics | grep -i 
>>>>>>>>> collector
>>>>>>>>> # HELP node_scrape_collector_duration_seconds node_exporter: 
>>>>>>>>> Duration of a collector scrape.
>>>>>>>>> # TYPE node_scrape_collector_duration_seconds gauge
>>>>>>>>> node_scrape_collector_duration_seconds{collector="arp"} 0.002911805
>>>>>>>>> node_scrape_collector_duration_seconds{collector="bcache"} 
>>>>>>>>> 1.4571e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="bonding"} 
>>>>>>>>> 0.000112308
>>>>>>>>> node_scrape_collector_duration_seconds{collector="btrfs"} 
>>>>>>>>> 0.001308192
>>>>>>>>> node_scrape_collector_duration_seconds{collector="conntrack"} 
>>>>>>>>> 0.002750716
>>>>>>>>> node_scrape_collector_duration_seconds{collector="cpu"} 0.010873961
>>>>>>>>> node_scrape_collector_duration_seconds{collector="cpufreq"} 
>>>>>>>>> 0.008559194
>>>>>>>>> node_scrape_collector_duration_seconds{collector="diskstats"} 
>>>>>>>>> 0.01727642
>>>>>>>>> node_scrape_collector_duration_seconds{collector="dmi"} 0.000971785
>>>>>>>>> node_scrape_collector_duration_seconds{collector="edac"} 
>>>>>>>>> 0.006972343
>>>>>>>>> node_scrape_collector_duration_seconds{collector="entropy"} 
>>>>>>>>> 0.001360089
>>>>>>>>> node_scrape_collector_duration_seconds{collector="fibrechannel"} 
>>>>>>>>> 2.8256e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="filefd"} 
>>>>>>>>> 0.000739988
>>>>>>>>> node_scrape_collector_duration_seconds{collector="filesystem"} 
>>>>>>>>> 0.00554684
>>>>>>>>> node_scrape_collector_duration_seconds{collector="hwmon"} 
>>>>>>>>> 0.014143617
>>>>>>>>> node_scrape_collector_duration_seconds{collector="infiniband"} 
>>>>>>>>> 1.3484e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="ipvs"} 7.5532e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="loadavg"} 
>>>>>>>>> 0.004074291
>>>>>>>>> node_scrape_collector_duration_seconds{collector="mdadm"} 
>>>>>>>>> 0.000974966
>>>>>>>>> node_scrape_collector_duration_seconds{collector="meminfo"} 
>>>>>>>>> 0.004201816
>>>>>>>>> node_scrape_collector_duration_seconds{collector="netclass"} 
>>>>>>>>> 0.013852102
>>>>>>>>> node_scrape_collector_duration_seconds{collector="netdev"} 
>>>>>>>>> 0.006993921
>>>>>>>>> node_scrape_collector_duration_seconds{collector="netstat"} 
>>>>>>>>> 0.007896151
>>>>>>>>> node_scrape_collector_duration_seconds{collector="nfs"} 0.000125062
>>>>>>>>> node_scrape_collector_duration_seconds{collector="nfsd"} 3.6075e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="nvme"} 
>>>>>>>>> 0.001064067
>>>>>>>>> node_scrape_collector_duration_seconds{collector="os"} 0.005645435
>>>>>>>>> node_scrape_collector_duration_seconds{collector="powersupplyclass"} 
>>>>>>>>> 0.001394135
>>>>>>>>> node_scrape_collector_duration_seconds{collector="pressure"} 
>>>>>>>>> 0.001466664
>>>>>>>>> node_scrape_collector_duration_seconds{collector="rapl"} 0.00226622
>>>>>>>>> node_scrape_collector_duration_seconds{collector="schedstat"} 
>>>>>>>>> 0.006677493
>>>>>>>>> node_scrape_collector_duration_seconds{collector="sockstat"} 
>>>>>>>>> 0.000970676
>>>>>>>>> node_scrape_collector_duration_seconds{collector="softnet"} 
>>>>>>>>> 0.002014497
>>>>>>>>> node_scrape_collector_duration_seconds{collector="stat"} 
>>>>>>>>> 0.004216999
>>>>>>>>> node_scrape_collector_duration_seconds{collector="tapestats"} 
>>>>>>>>> 1.0296e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="textfile"} 
>>>>>>>>> 5.2573e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="thermal_zone"} 
>>>>>>>>> 0.010936983
>>>>>>>>> node_scrape_collector_duration_seconds{collector="time"} 0.00568072
>>>>>>>>> node_scrape_collector_duration_seconds{collector="timex"} 
>>>>>>>>> 3.3662e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="udp_queues"} 
>>>>>>>>> 0.004138555
>>>>>>>>> node_scrape_collector_duration_seconds{collector="uname"} 
>>>>>>>>> 1.3713e-05
>>>>>>>>> node_scrape_collector_duration_seconds{collector="vmstat"} 
>>>>>>>>> 0.005691152
>>>>>>>>> node_scrape_collector_duration_seconds{collector="xfs"} 0.008633677
>>>>>>>>> node_scrape_collector_duration_seconds{collector="zfs"} 2.8179e-05
>>>>>>>>> # HELP node_scrape_collector_success node_exporter: Whether a 
>>>>>>>>> collector succeeded.
>>>>>>>>> # TYPE node_scrape_collector_success gauge
>>>>>>>>> node_scrape_collector_success{collector="arp"} 1
>>>>>>>>> node_scrape_collector_success{collector="bcache"} 1
>>>>>>>>> node_scrape_collector_success{collector="bonding"} 0
>>>>>>>>> node_scrape_collector_success{collector="btrfs"} 1
>>>>>>>>> node_scrape_collector_success{collector="conntrack"} 1
>>>>>>>>> node_scrape_collector_success{collector="cpu"} 1
>>>>>>>>> node_scrape_collector_success{collector="cpufreq"} 1
>>>>>>>>> node_scrape_collector_success{collector="diskstats"} 1
>>>>>>>>> node_scrape_collector_success{collector="dmi"} 1
>>>>>>>>> node_scrape_collector_success{collector="edac"} 1
>>>>>>>>> node_scrape_collector_success{collector="entropy"} 1
>>>>>>>>> node_scrape_collector_success{collector="fibrechannel"} 0
>>>>>>>>> node_scrape_collector_success{collector="filefd"} 1
>>>>>>>>> node_scrape_collector_success{collector="filesystem"} 1
>>>>>>>>> node_scrape_collector_success{collector="hwmon"} 1
>>>>>>>>> node_scrape_collector_success{collector="infiniband"} 0
>>>>>>>>> node_scrape_collector_success{collector="ipvs"} 0
>>>>>>>>> node_scrape_collector_success{collector="loadavg"} 1
>>>>>>>>> node_scrape_collector_success{collector="mdadm"} 1
>>>>>>>>> node_scrape_collector_success{collector="meminfo"} 1
>>>>>>>>> node_scrape_collector_success{collector="netclass"} 1
>>>>>>>>> node_scrape_collector_success{collector="netdev"} 1
>>>>>>>>> node_scrape_collector_success{collector="netstat"} 1
>>>>>>>>> node_scrape_collector_success{collector="nfs"} 0
>>>>>>>>> node_scrape_collector_success{collector="nfsd"} 0
>>>>>>>>> node_scrape_collector_success{collector="nvme"} 0
>>>>>>>>> node_scrape_collector_success{collector="os"} 1
>>>>>>>>> node_scrape_collector_success{collector="powersupplyclass"} 1
>>>>>>>>> node_scrape_collector_success{collector="pressure"} 0
>>>>>>>>> node_scrape_collector_success{collector="rapl"} 1
>>>>>>>>> node_scrape_collector_success{collector="schedstat"} 1
>>>>>>>>> node_scrape_collector_success{collector="sockstat"} 1
>>>>>>>>> node_scrape_collector_success{collector="softnet"} 1
>>>>>>>>> node_scrape_collector_success{collector="stat"} 1
>>>>>>>>> node_scrape_collector_success{collector="tapestats"} 0
>>>>>>>>> node_scrape_collector_success{collector="textfile"} 1
>>>>>>>>> node_scrape_collector_success{collector="thermal_zone"} 1
>>>>>>>>> node_scrape_collector_success{collector="time"} 1
>>>>>>>>> node_scrape_collector_success{collector="timex"} 1
>>>>>>>>> node_scrape_collector_success{collector="udp_queues"} 1
>>>>>>>>> node_scrape_collector_success{collector="uname"} 1
>>>>>>>>> node_scrape_collector_success{collector="vmstat"} 1
>>>>>>>>> node_scrape_collector_success{collector="xfs"} 1
>>>>>>>>> node_scrape_collector_success{collector="zfs"} 0
>>>>>>>>>
>>>>>>>>> On Tuesday, January 18, 2022 at 1:12:04 PM UTC-5 Brian Candler 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Can you show the output of:
>>>>>>>>>>
>>>>>>>>>> curl -Ss localhost:9100/metrics | grep -i collector
>>>>>>>>>>
>>>>>>>>>> On Tuesday, 18 January 2022 at 14:33:25 UTC dyio...@gmail.com 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> [root@myhost1 ~]# ps auxwww | grep node_exporter
>>>>>>>>>>> node_ex+ 4143664 12.5  0.0 725828 22668 ?        Ssl  09:29   
>>>>>>>>>>> 0:06 /usr/local/bin/node_exporter --no-collector.wifi
>>>>>>>>>>>
>>>>>>>>>>> On Saturday, January 15, 2022 at 11:23:43 AM UTC-5 Brian Candler 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Friday, 14 January 2022 at 14:12:02 UTC dyio...@gmail.com 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Brian Chandler  I'm using the node_exporter defaults, as 
>>>>>>>>>>>>> described here - https://github.com/prometheus/node_exporter.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Are you *really*?    Can you show the *exact* command line 
>>>>>>>>>>>> that node_exporter is running with?  e.g.
>>>>>>>>>>>>
>>>>>>>>>>>> ps auxwww | grep node_exporter
>>>>>>>>>>>>
>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/56a7defd-d75b-465d-860b-de11b021de78n%40googlegroups.com.

[prometheus-users] Re: node_expoter high cpu usage

Reply via email to