Looks like we've been hit with a potential Kernel bug related to fork. I believe InfluxDB Telegraf might be causing it.
Not sure if related, but we are trying it out. https://access.redhat.com/solutions/1228953 *Issue* - Repeated fork() call causes anon_vma_chain and anon_vma slabs to grow result in kernel panic due to out of memory (OOM). - Kernel panic due to out of memory (OOM) even though none of the applications running on the system are using memory. We've rebooted all our caches that had this problem (one crashed last night due to OOM). The spikes in Load Average and the blocking is gone for now. Thanks Bryan... Steve On Tue, Aug 16, 2016 at 10:00 PM, Steve Malenfant <[email protected]> wrote: > Bryan, here's the output during of one of those events... > > This server RAM is used at about 80%. Hugepage is not enabled. 256GB of > Ram Cache on a server with 384GB RAM. > > Samples: 756K of event 'cycles', Event count (approx.): 26719576313 > Overhead Shared Object Symbol > 70.35% [kernel] [k] try_to_unmap_anon > 2.42% [kernel] [k] find_busiest_group > 1.34% [kernel] [k] _spin_lock > 0.73% [kernel] [k] thread_return > 0.64% libpthread-2.12.so [.] pthread_cond_timedwait@@GLIBC_2.3.2 > 0.52% [kernel] [k] schedule > 0.50% [kernel] [k] find_next_bit > 0.42% [kernel] [k] cpumask_next_and > 0.38% [kernel] [k] _spin_lock_irqsave > > [~]# cat /proc/meminfo > MemTotal: 396790288 kB > MemFree: 52626016 kB > Buffers: 151148 kB > Cached: 2035288 kB > SwapCached: 0 kB > Active: 1198652 kB > Inactive: 1171944 kB > Active(anon): 222464 kB > Inactive(anon): 152 kB > Active(file): 976188 kB > Inactive(file): 1171792 kB > Unevictable: 289593148 kB > Mlocked: 289595216 kB > SwapTotal: 4095996 kB > SwapFree: 4095996 kB > Dirty: 208 kB > Writeback: 0 kB > AnonPages: 289777432 kB > Mapped: 128752 kB > Shmem: 244 kB > Slab: 15548892 kB > SReclaimable: 15454924 kB > SUnreclaim: 93968 kB > KernelStack: 17648 kB > PageTables: 573316 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 202491140 kB > Committed_AS: 291736828 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 1383676 kB > VmallocChunk: 34157955876 kB > HardwareCorrupted: 0 kB > AnonHugePages: 0 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 58368 kB > DirectMap2M: 1968128 kB > DirectMap1G: 400556032 kB > > On Tue, Aug 16, 2016 at 8:54 PM, Bryan Call <[email protected]> wrote: > >> Can you run sudo perf top or sudo perf record during the time you are >> having a spike in CPU usage? >> >> -Bryan >> >> >> >> >> On Aug 16, 2016, at 3:09 PM, Steve Malenfant <[email protected]> >> wrote: >> >> We have a condition where 4 caches, every 33-34 minutes goes high CPU >> utliization (1000%) and prevents incoming connections. This last about 30 >> seconds and goes away. >> >> It seems to occur with some specific origin/content which I don't exactly >> know which one so far. >> >> I'm including some details (in case it's obvious), but how do we profile >> a running server to see what's causing this high cpu usage? >> >> The load is normal, maybe about 100Mbps (basically idling). >> >> top - 17:20:33 up 61 days, 3:32, 4 users, load average: 4.74, 1.20, >> 0.53 >> Tasks: 616 total, 16 running, 600 sleeping, 0 stopped, 0 zombie >> Cpu(s): 2.6%us, 22.3%sy, 0.0%ni, 74.9%id, 0.0%wa, 0.0%hi, 0.2%si, >> 0.0%st >> Mem: 396790288k total, 342530608k used, 54259680k free, 148764k buffers >> Swap: 4095996k total, 0k used, 4095996k free, 1618892k cached >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 8703 ats 20 0 282g 275g 39m D 853.9 72.7 12496:54 [ET_NET 0] >> 61 root 20 0 0 0 0 R 5.6 0.0 5:10.11 ksoftirqd/14 >> 77 root 20 0 0 0 0 R 5.0 0.0 5:15.08 ksoftirqd/18 >> 125 root 20 0 0 0 0 R 5.0 0.0 5:02.48 ksoftirqd/30 >> 13 root 20 0 0 0 0 R 4.6 0.0 5:23.63 ksoftirqd/2 >> 37 root 20 0 0 0 0 R 4.6 0.0 5:10.31 ksoftirqd/8 >> 45 root 20 0 0 0 0 R 4.6 0.0 5:10.36 ksoftirqd/10 >> 93 root 20 0 0 0 0 R 4.6 0.0 5:29.05 ksoftirqd/22 >> 109 root 20 0 0 0 0 R 4.6 0.0 5:06.27 ksoftirqd/26 >> 21 root 20 0 0 0 0 R 4.3 0.0 5:12.82 ksoftirqd/4 >> 29 root 20 0 0 0 0 R 4.3 0.0 5:10.79 ksoftirqd/6 >> 85 root 20 0 0 0 0 R 4.3 0.0 5:18.36 ksoftirqd/20 >> 101 root 20 0 0 0 0 R 4.3 0.0 5:13.54 ksoftirqd/24 >> 4 root 20 0 0 0 0 R 4.0 0.0 5:16.25 ksoftirqd/0 >> 69 root 20 0 0 0 0 R 4.0 0.0 5:09.72 ksoftirqd/16 >> 117 root 20 0 0 0 0 R 3.0 0.0 5:03.74 ksoftirqd/28 >> 53 root 20 0 0 0 0 S 1.7 0.0 5:07.40 ksoftirqd/12 >> >> Here's a "dstat" output of that server when this is occurring... updated >> every second. >> >> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- >> usr sys idl wai hiq siq| read writ| recv send| in out | int csw >> 0 0 98 1 0 0|7940k 10M|6270k 18M| 0 0 | 13k 35k >> 0 0 99 0 0 0|9738k 4160k|1600k 10M| 0 0 | 12k 34k >> 0 0 99 0 0 0|3232k 4136k|1390k 5724k| 0 0 | 11k 34k >> 0 0 100 0 0 0| 0 4096k| 34k 555k| 0 0 | 10k 33k >> 0 0 99 0 0 0|1316k 9095k|7028k 7240k| 0 0 | 11k 34k >> 0 1 99 0 0 0| 288k 8658k|6189k 5878k| 0 0 | 29k 36k >> 0 0 99 1 0 0| 13M 6204k|2870k 17M| 0 0 | 14k 35k >> 0 0 99 0 0 0| 0 4096k|3009k 3387k| 0 0 | 14k 35k >> 0 0 99 0 0 0|1028k 6384k|5138k 6047k| 0 0 | 14k 35k >> 0 0 99 1 0 0|5127k 8192k|4262k 7417k| 0 0 | 13k 34k >> 0 0 99 0 0 0|2372k 2612k|2011k 4944k| 0 0 | 14k 35k >> 0 0 99 0 0 0| 0 4172k|1633k 2201k| 0 0 | 13k 35k >> 0 0 99 0 0 0| 11M 608k|2528k 12M| 0 0 | 13k 35k >> 0 3 95 1 0 0| 15M 8409k| 11M 28M| 0 0 | 16k 37k >> 0 3 95 1 0 0|6324k 4244k|7959k 20M| 0 0 | 12k 32k >> 0 4 96 0 0 0| 256k 2128k| 853k 2823k| 0 0 | 27k 32k >> 0 3 96 0 0 0|1028k 0 | 472k 562k| 0 0 | 10k 31k >> 0 3 96 0 0 0|1028k 68k|1457k 1454k| 0 0 | 11k 32k >> 0 3 97 0 0 0| 0 0 | 316k 317k| 0 0 | 11k 31k >> 0 3 97 0 0 0|3076k 0 | 415k 3251k| 0 0 |9498 30k >> 0 3 97 0 0 0|2688k 84k| 16k 2985k| 0 0 |8623 30k >> 0 3 97 0 0 0| 0 0 | 64k 65k| 0 0 |9121 30k >> 0 3 97 0 0 0| 0 68k|1607k 1556k| 0 0 | 11k 31k >> 0 3 97 0 0 0| 0 0 | 18k 120k| 0 0 |7669 28k >> 0 3 97 0 0 0| 0 0 | 536k 537k| 0 0 |8660 30k >> 0 4 96 0 0 0| 0 72k| 65k 75k| 0 0 | 26k 30k >> 0 3 96 0 0 0| 0 2104k|3079k 3093k| 0 0 |9267 30k >> 0 3 97 0 0 0| 0 108k| 593k 506k| 0 0 |7591 28k >> 0 3 96 0 0 0|1152k 4096B| 124k 238k| 0 0 |7914 29k >> 0 3 97 0 0 0| 0 0 |5719B 1574B| 0 0 |8884 29k >> 0 3 96 0 0 0| 0 224k| 18k 51k| 0 0 |9092 30k >> 0 3 97 0 0 0| 0 0 | 87k 108k| 0 0 |8998 29k >> 0 3 96 0 0 0|5328k 0 |1594k 6790k| 0 0 |8499 29k >> 0 3 97 0 0 0| 0 156k| 44k 203k| 0 0 |8039 29k >> 0 3 97 0 0 0| 0 0 | 17k 82k| 0 0 |5377 25k >> 0 4 96 0 0 0| 0 252k|3224B 6858B| 0 0 | 28k 31k >> 0 3 96 0 0 0| 0 0 |2139B 8535B| 0 0 | 10k 29k >> 0 3 97 0 0 0|5508k 0 | 20k 5516k| 0 0 |7762 29k >> 0 3 97 0 0 0| 0 72k| 10k 28k| 0 0 |7026 29k >> 0 3 97 0 0 0| 0 0 |3782B 712B| 0 0 |5829 29k >> 0 4 96 0 0 0| 0 264k| 11k 10k| 0 0 |5170 183k >> 0 4 95 0 0 0| 0 0 | 15k 1415B| 0 0 |4125 245k >> 0 3 97 0 0 0| 0 0 |5416B 966B| 0 0 |3608 40k >> 0 3 97 0 0 0| 0 4096B|6233B 3868B| 0 0 |3636 41k >> 0 3 96 0 0 0| 0 64k|5761B 1594B| 0 0 |3579 41k >> 2 20 78 0 0 0| 0 80k| 11k 1544B| 0 0 | 16k 3590k >> 3 26 71 0 0 0| 0 0 | 13k 9997B| 0 0 | 12k 4969k >> 3 26 71 0 0 0| 0 0 |6685B 1042B| 0 0 | 12k 5017k >> 3 25 71 0 0 0| 0 0 | 11k 1660B| 0 0 | 12k 4957k >> 3 25 72 0 0 0| 0 0 |7765B 4463B| 0 0 | 12k 4862k >> 3 25 72 0 0 0| 0 128k|8377B 1478B| 0 0 | 12k 4875k >> 3 25 72 0 0 0| 0 0 |4407B 992B| 0 0 | 12k 4814k >> 3 25 71 0 0 0| 0 0 | 17k 1959B| 0 0 | 12k 4993k >> 3 26 71 0 0 0| 0 0 |7774B 1478B| 0 0 | 12k 4986k >> 3 25 72 0 0 0| 0 0 |5728B 1066B| 0 0 | 12k 4921k >> 3 25 71 0 0 0| 0 108k| 14k 1816B| 0 0 | 19k 4922k >> 3 25 72 0 0 0| 0 0 |6296B 9096B| 0 0 | 12k 4899k >> 3 25 72 0 0 0| 0 0 |8974B 1166B| 0 0 | 12k 4914k >> 3 26 71 0 0 0| 0 0 | 12k 1038B| 0 0 | 12k 5133k >> 3 25 72 0 0 0| 0 0 |9956B 1668B| 0 0 | 12k 4873k >> 3 25 72 0 0 0| 0 228k|8025B 2172B| 0 0 | 12k 4841k >> 3 25 72 0 0 0| 0 0 |7743B 1728B| 0 0 | 12k 4888k >> 3 26 71 0 0 0| 0 0 |8287B 6061B| 0 0 | 12k 4955k >> 3 26 70 0 0 0| 0 0 |8702B 1602B| 0 0 | 12k 5064k >> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- >> usr sys idl wai hiq siq| read writ| recv send| in out | int csw >> 3 26 71 0 0 0| 0 0 |8898B 1198B| 0 0 | 12k 4983k >> 3 25 72 0 0 0| 0 152k| 10k 48k| 0 0 | 17k 4840k >> 3 25 72 0 0 0| 0 0 |4030B 8837B| 0 0 | 15k 4900k >> 3 25 72 0 0 0| 0 0 | 12k 2260B| 0 0 | 12k 4909k >> 3 26 71 0 0 0| 0 0 | 14k 2169B| 0 0 | 12k 5001k >> 3 26 70 0 0 0| 0 0 | 12k 1470B| 0 0 | 12k 5047k >> 3 26 71 0 0 0| 0 232k| 14k 1964B| 0 0 | 12k 4967k >> 3 25 72 0 0 0| 0 0 | 14k 2115B| 0 0 | 12k 4846k >> 3 25 72 0 0 0| 0 0 | 15k 2144B| 0 0 | 12k 4945k >> 3 25 72 0 0 0| 0 0 | 15k 1544B| 0 0 | 12k 4862k >> 3 26 71 0 0 0| 0 0 |6595B 1124B| 0 0 | 12k 4991k >> 3 27 70 0 0 0| 0 128k| 39k 7960B| 0 0 | 17k 5065k >> 3 26 71 0 0 0| 0 0 | 13k 11k| 0 0 | 12k 4942k >> 3 25 72 0 0 0| 0 0 |9035B 2072B| 0 0 | 12k 4881k >> 3 25 72 0 0 0| 0 0 |9057B 2788B| 0 0 | 12k 4825k >> 1 5 90 4 0 0| 41M 28M| 49M 104M| 0 0 | 30k 848k >> 0 1 92 7 0 0| 28M 51M| 46M 166M| 0 0 | 22k 39k >> 0 0 97 2 0 0| 19M 20M| 18M 48M| 0 0 | 15k 36k >> 0 0 97 2 0 0| 21M 14M| 13M 41M| 0 0 | 15k 36k >> 0 0 97 3 0 0| 13M 16M| 21M 33M| 0 0 | 14k 36k >> 0 0 96 3 0 0| 14M 20M| 21M 42M| 0 0 | 14k 36k >> 0 1 96 3 0 0| 15M 15M| 16M 32M| 0 0 | 24k 36k >> 0 0 99 1 0 0| 11M 4351k|3789k 18M| 0 0 | 12k 34k >> 0 0 99 0 0 0|7182k 0 |1881k 8056k| 0 0 | 11k 33k >> 0 0 99 0 0 0|3164k 100k|2091k 6605k| 0 0 | 11k 35k >> 0 0 99 0 0 0|3079k 2647k|6850k 6604k| 0 0 | 11k 33k >> 0 0 98 1 0 0|8640k 7195k|6742k 17M| 0 0 | 12k 34k >> 0 0 99 1 0 0| 0 4451k|6093k 7067k| 0 0 | 10k 33k >> 0 0 99 1 0 0|3908k 4787k|1294k 4556k| 0 0 | 11k 33k >> 0 0 99 0 0 0|7460k 44k|1505k 9884k| 0 0 |9809 32k >> >> Thanks, >> >> Steve >> >> >> >
