Bryan,

Somehow the problem came back after a few hours (not as bad yet)... This is
running 5.3.2 with a few patches (available on Traffic Control GitHub).

Per-node numastat info (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
Numa_Hit               831049.44       684360.21      1515409.66
Numa_Miss                   0.00            0.00            0.00
Numa_Foreign                0.00            0.00            0.00
Interleave_Hit            423.24          423.16          846.40
Local_Node             831046.29       683892.71      1514939.01
Other_Node                  3.15          467.50          470.65


#  sysctl -a | grep huge
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.hugetlb_shm_group = 0
vm.hugepages_treat_as_movable = 0
vm.nr_overcommit_hugepages = 0

We ATS set with CONFIG proxy.config.allocator.hugepages = INT 0

On Wed, Aug 17, 2016 at 5:11 PM, Bryan Call <[email protected]> wrote:

> Can you run: numastat -v?  Looking up try_to_unmap_anon it looks like the
> kernel is trying to balance the memory usage across the numa nodes.  There
> was a recent change to how memory allocation is done on the aio threads in
> this ticket:
> https://issues.apache.org/jira/browse/TS-4719
>
> Can you also run sudo sysctl -a | grep huge and paste the output?
>
> -Bryan
>
>
>
> On Aug 16, 2016, at 11:00 PM, Steve Malenfant <[email protected]>
> wrote:
>
> Bryan, here's the output during of one of those events...
>
> This server RAM is used at about 80%. Hugepage is not enabled. 256GB of
> Ram Cache on a server with 384GB RAM.
>
> Samples: 756K of event 'cycles', Event count (approx.): 26719576313
> Overhead  Shared Object            Symbol
>   70.35%  [kernel]                 [k] try_to_unmap_anon
>    2.42%  [kernel]                 [k] find_busiest_group
>    1.34%  [kernel]                 [k] _spin_lock
>    0.73%  [kernel]                 [k] thread_return
>    0.64%  libpthread-2.12.so       [.] pthread_cond_timedwait@@GLIBC_2.3.2
>    0.52%  [kernel]                 [k] schedule
>    0.50%  [kernel]                 [k] find_next_bit
>    0.42%  [kernel]                 [k] cpumask_next_and
>    0.38%  [kernel]                 [k] _spin_lock_irqsave
>
> [~]# cat /proc/meminfo
> MemTotal:       396790288 kB
> MemFree:        52626016 kB
> Buffers:          151148 kB
> Cached:          2035288 kB
> SwapCached:            0 kB
> Active:          1198652 kB
> Inactive:        1171944 kB
> Active(anon):     222464 kB
> Inactive(anon):      152 kB
> Active(file):     976188 kB
> Inactive(file):  1171792 kB
> Unevictable:    289593148 kB
> Mlocked:        289595216 kB
> SwapTotal:       4095996 kB
> SwapFree:        4095996 kB
> Dirty:               208 kB
> Writeback:             0 kB
> AnonPages:      289777432 kB
> Mapped:           128752 kB
> Shmem:               244 kB
> Slab:           15548892 kB
> SReclaimable:   15454924 kB
> SUnreclaim:        93968 kB
> KernelStack:       17648 kB
> PageTables:       573316 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    202491140 kB
> Committed_AS:   291736828 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:     1383676 kB
> VmallocChunk:   34157955876 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:         0 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:       58368 kB
> DirectMap2M:     1968128 kB
> DirectMap1G:    400556032 kB
>
> On Tue, Aug 16, 2016 at 8:54 PM, Bryan Call <[email protected]> wrote:
>
>> Can you run sudo perf top or sudo perf record during the time you are
>> having a spike in CPU usage?
>>
>> -Bryan
>>
>>
>>
>>
>> On Aug 16, 2016, at 3:09 PM, Steve Malenfant <[email protected]>
>> wrote:
>>
>> We have a condition where 4 caches, every 33-34 minutes goes high CPU
>> utliization (1000%) and prevents incoming connections. This last about 30
>> seconds and goes away.
>>
>> It seems to occur with some specific origin/content which I don't exactly
>> know which one so far.
>>
>> I'm including some details (in case it's obvious), but how do we profile
>> a running server to see what's causing this high cpu usage?
>>
>> The load is normal, maybe about 100Mbps (basically idling).
>>
>> top - 17:20:33 up 61 days,  3:32,  4 users,  load average: 4.74, 1.20,
>> 0.53
>> Tasks: 616 total,  16 running, 600 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  2.6%us, 22.3%sy,  0.0%ni, 74.9%id,  0.0%wa,  0.0%hi,  0.2%si,
>>  0.0%st
>> Mem:  396790288k total, 342530608k used, 54259680k free,   148764k buffers
>> Swap:  4095996k total,        0k used,  4095996k free,  1618892k cached
>>
>>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>   8703 ats       20   0  282g 275g  39m D 853.9 72.7  12496:54 [ET_NET 0]
>>     61 root      20   0     0    0    0 R  5.6  0.0   5:10.11 ksoftirqd/14
>>     77 root      20   0     0    0    0 R  5.0  0.0   5:15.08 ksoftirqd/18
>>    125 root      20   0     0    0    0 R  5.0  0.0   5:02.48 ksoftirqd/30
>>     13 root      20   0     0    0    0 R  4.6  0.0   5:23.63 ksoftirqd/2
>>     37 root      20   0     0    0    0 R  4.6  0.0   5:10.31 ksoftirqd/8
>>     45 root      20   0     0    0    0 R  4.6  0.0   5:10.36 ksoftirqd/10
>>     93 root      20   0     0    0    0 R  4.6  0.0   5:29.05 ksoftirqd/22
>>    109 root      20   0     0    0    0 R  4.6  0.0   5:06.27 ksoftirqd/26
>>     21 root      20   0     0    0    0 R  4.3  0.0   5:12.82 ksoftirqd/4
>>     29 root      20   0     0    0    0 R  4.3  0.0   5:10.79 ksoftirqd/6
>>     85 root      20   0     0    0    0 R  4.3  0.0   5:18.36 ksoftirqd/20
>>    101 root      20   0     0    0    0 R  4.3  0.0   5:13.54 ksoftirqd/24
>>      4 root      20   0     0    0    0 R  4.0  0.0   5:16.25 ksoftirqd/0
>>     69 root      20   0     0    0    0 R  4.0  0.0   5:09.72 ksoftirqd/16
>>    117 root      20   0     0    0    0 R  3.0  0.0   5:03.74 ksoftirqd/28
>>     53 root      20   0     0    0    0 S  1.7  0.0   5:07.40 ksoftirqd/12
>>
>> Here's a "dstat" output of that server when this is occurring... updated
>> every second.
>>
>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
>>   0   0  98   1   0   0|7940k   10M|6270k   18M|   0     0 |  13k   35k
>>   0   0  99   0   0   0|9738k 4160k|1600k   10M|   0     0 |  12k   34k
>>   0   0  99   0   0   0|3232k 4136k|1390k 5724k|   0     0 |  11k   34k
>>   0   0 100   0   0   0|   0  4096k|  34k  555k|   0     0 |  10k   33k
>>   0   0  99   0   0   0|1316k 9095k|7028k 7240k|   0     0 |  11k   34k
>>   0   1  99   0   0   0| 288k 8658k|6189k 5878k|   0     0 |  29k   36k
>>   0   0  99   1   0   0|  13M 6204k|2870k   17M|   0     0 |  14k   35k
>>   0   0  99   0   0   0|   0  4096k|3009k 3387k|   0     0 |  14k   35k
>>   0   0  99   0   0   0|1028k 6384k|5138k 6047k|   0     0 |  14k   35k
>>   0   0  99   1   0   0|5127k 8192k|4262k 7417k|   0     0 |  13k   34k
>>   0   0  99   0   0   0|2372k 2612k|2011k 4944k|   0     0 |  14k   35k
>>   0   0  99   0   0   0|   0  4172k|1633k 2201k|   0     0 |  13k   35k
>>   0   0  99   0   0   0|  11M  608k|2528k   12M|   0     0 |  13k   35k
>>   0   3  95   1   0   0|  15M 8409k|  11M   28M|   0     0 |  16k   37k
>>   0   3  95   1   0   0|6324k 4244k|7959k   20M|   0     0 |  12k   32k
>>   0   4  96   0   0   0| 256k 2128k| 853k 2823k|   0     0 |  27k   32k
>>   0   3  96   0   0   0|1028k    0 | 472k  562k|   0     0 |  10k   31k
>>   0   3  96   0   0   0|1028k   68k|1457k 1454k|   0     0 |  11k   32k
>>   0   3  97   0   0   0|   0     0 | 316k  317k|   0     0 |  11k   31k
>>   0   3  97   0   0   0|3076k    0 | 415k 3251k|   0     0 |9498    30k
>>   0   3  97   0   0   0|2688k   84k|  16k 2985k|   0     0 |8623    30k
>>   0   3  97   0   0   0|   0     0 |  64k   65k|   0     0 |9121    30k
>>   0   3  97   0   0   0|   0    68k|1607k 1556k|   0     0 |  11k   31k
>>   0   3  97   0   0   0|   0     0 |  18k  120k|   0     0 |7669    28k
>>   0   3  97   0   0   0|   0     0 | 536k  537k|   0     0 |8660    30k
>>   0   4  96   0   0   0|   0    72k|  65k   75k|   0     0 |  26k   30k
>>   0   3  96   0   0   0|   0  2104k|3079k 3093k|   0     0 |9267    30k
>>   0   3  97   0   0   0|   0   108k| 593k  506k|   0     0 |7591    28k
>>   0   3  96   0   0   0|1152k 4096B| 124k  238k|   0     0 |7914    29k
>>   0   3  97   0   0   0|   0     0 |5719B 1574B|   0     0 |8884    29k
>>   0   3  96   0   0   0|   0   224k|  18k   51k|   0     0 |9092    30k
>>   0   3  97   0   0   0|   0     0 |  87k  108k|   0     0 |8998    29k
>>   0   3  96   0   0   0|5328k    0 |1594k 6790k|   0     0 |8499    29k
>>   0   3  97   0   0   0|   0   156k|  44k  203k|   0     0 |8039    29k
>>   0   3  97   0   0   0|   0     0 |  17k   82k|   0     0 |5377    25k
>>   0   4  96   0   0   0|   0   252k|3224B 6858B|   0     0 |  28k   31k
>>   0   3  96   0   0   0|   0     0 |2139B 8535B|   0     0 |  10k   29k
>>   0   3  97   0   0   0|5508k    0 |  20k 5516k|   0     0 |7762    29k
>>   0   3  97   0   0   0|   0    72k|  10k   28k|   0     0 |7026    29k
>>   0   3  97   0   0   0|   0     0 |3782B  712B|   0     0 |5829    29k
>>   0   4  96   0   0   0|   0   264k|  11k   10k|   0     0 |5170   183k
>>   0   4  95   0   0   0|   0     0 |  15k 1415B|   0     0 |4125   245k
>>   0   3  97   0   0   0|   0     0 |5416B  966B|   0     0 |3608    40k
>>   0   3  97   0   0   0|   0  4096B|6233B 3868B|   0     0 |3636    41k
>>   0   3  96   0   0   0|   0    64k|5761B 1594B|   0     0 |3579    41k
>>   2  20  78   0   0   0|   0    80k|  11k 1544B|   0     0 |  16k 3590k
>>   3  26  71   0   0   0|   0     0 |  13k 9997B|   0     0 |  12k 4969k
>>   3  26  71   0   0   0|   0     0 |6685B 1042B|   0     0 |  12k 5017k
>>   3  25  71   0   0   0|   0     0 |  11k 1660B|   0     0 |  12k 4957k
>>   3  25  72   0   0   0|   0     0 |7765B 4463B|   0     0 |  12k 4862k
>>   3  25  72   0   0   0|   0   128k|8377B 1478B|   0     0 |  12k 4875k
>>   3  25  72   0   0   0|   0     0 |4407B  992B|   0     0 |  12k 4814k
>>   3  25  71   0   0   0|   0     0 |  17k 1959B|   0     0 |  12k 4993k
>>   3  26  71   0   0   0|   0     0 |7774B 1478B|   0     0 |  12k 4986k
>>   3  25  72   0   0   0|   0     0 |5728B 1066B|   0     0 |  12k 4921k
>>   3  25  71   0   0   0|   0   108k|  14k 1816B|   0     0 |  19k 4922k
>>   3  25  72   0   0   0|   0     0 |6296B 9096B|   0     0 |  12k 4899k
>>   3  25  72   0   0   0|   0     0 |8974B 1166B|   0     0 |  12k 4914k
>>   3  26  71   0   0   0|   0     0 |  12k 1038B|   0     0 |  12k 5133k
>>   3  25  72   0   0   0|   0     0 |9956B 1668B|   0     0 |  12k 4873k
>>   3  25  72   0   0   0|   0   228k|8025B 2172B|   0     0 |  12k 4841k
>>   3  25  72   0   0   0|   0     0 |7743B 1728B|   0     0 |  12k 4888k
>>   3  26  71   0   0   0|   0     0 |8287B 6061B|   0     0 |  12k 4955k
>>   3  26  70   0   0   0|   0     0 |8702B 1602B|   0     0 |  12k 5064k
>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
>>   3  26  71   0   0   0|   0     0 |8898B 1198B|   0     0 |  12k 4983k
>>   3  25  72   0   0   0|   0   152k|  10k   48k|   0     0 |  17k 4840k
>>   3  25  72   0   0   0|   0     0 |4030B 8837B|   0     0 |  15k 4900k
>>   3  25  72   0   0   0|   0     0 |  12k 2260B|   0     0 |  12k 4909k
>>   3  26  71   0   0   0|   0     0 |  14k 2169B|   0     0 |  12k 5001k
>>   3  26  70   0   0   0|   0     0 |  12k 1470B|   0     0 |  12k 5047k
>>   3  26  71   0   0   0|   0   232k|  14k 1964B|   0     0 |  12k 4967k
>>   3  25  72   0   0   0|   0     0 |  14k 2115B|   0     0 |  12k 4846k
>>   3  25  72   0   0   0|   0     0 |  15k 2144B|   0     0 |  12k 4945k
>>   3  25  72   0   0   0|   0     0 |  15k 1544B|   0     0 |  12k 4862k
>>   3  26  71   0   0   0|   0     0 |6595B 1124B|   0     0 |  12k 4991k
>>   3  27  70   0   0   0|   0   128k|  39k 7960B|   0     0 |  17k 5065k
>>   3  26  71   0   0   0|   0     0 |  13k   11k|   0     0 |  12k 4942k
>>   3  25  72   0   0   0|   0     0 |9035B 2072B|   0     0 |  12k 4881k
>>   3  25  72   0   0   0|   0     0 |9057B 2788B|   0     0 |  12k 4825k
>>   1   5  90   4   0   0|  41M   28M|  49M  104M|   0     0 |  30k  848k
>>   0   1  92   7   0   0|  28M   51M|  46M  166M|   0     0 |  22k   39k
>>   0   0  97   2   0   0|  19M   20M|  18M   48M|   0     0 |  15k   36k
>>   0   0  97   2   0   0|  21M   14M|  13M   41M|   0     0 |  15k   36k
>>   0   0  97   3   0   0|  13M   16M|  21M   33M|   0     0 |  14k   36k
>>   0   0  96   3   0   0|  14M   20M|  21M   42M|   0     0 |  14k   36k
>>   0   1  96   3   0   0|  15M   15M|  16M   32M|   0     0 |  24k   36k
>>   0   0  99   1   0   0|  11M 4351k|3789k   18M|   0     0 |  12k   34k
>>   0   0  99   0   0   0|7182k    0 |1881k 8056k|   0     0 |  11k   33k
>>   0   0  99   0   0   0|3164k  100k|2091k 6605k|   0     0 |  11k   35k
>>   0   0  99   0   0   0|3079k 2647k|6850k 6604k|   0     0 |  11k   33k
>>   0   0  98   1   0   0|8640k 7195k|6742k   17M|   0     0 |  12k   34k
>>   0   0  99   1   0   0|   0  4451k|6093k 7067k|   0     0 |  10k   33k
>>   0   0  99   1   0   0|3908k 4787k|1294k 4556k|   0     0 |  11k   33k
>>   0   0  99   0   0   0|7460k   44k|1505k 9884k|   0     0 |9809    32k
>>
>> Thanks,
>>
>> Steve
>>
>>
>>
>
>

Reply via email to