Strange system behaviour of during haproxy run

Krishna Kumar (Engineering) Tue, 07 Jul 2015 00:55:31 -0700

Hi all,

This is not related to haproxy, but I am having a performance issue with
number of
packets processed. I am running haproxy on a 48 core system (we have 64
such servers
at present, which is going to increase for production tessting), where cpus
0,2,4,6,..46
are part of NUMA node 1, and cpus 1,3,5,7,.. 47 are part of NUMA node 2.
The systems
are running Debian 7, with 3.16.0-23 (kernel has both CONFIG_XPS and
CONFIG_RPS
enabled). nbproc is set to 12, and each haproxy is bound to cpus 0,2,4, ...
22, so that
they are on the same socket, as seen here:


# ps -efF | egrep "hap|PID" | cut -c1-80
UID         PID   PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
haproxy    3099      1 17 89697 324024  0 18:37 ?        00:11:19 haproxy
-f hap
haproxy    3100      1 18 87171 314324  2 18:37 ?        00:12:00 haproxy
-f hap
haproxy    3101      1 18 87214 305328  4 18:37 ?        00:12:00 haproxy
-f hap
haproxy    3102      1 19 89215 322676  6 18:37 ?        00:12:02 haproxy
-f hap
haproxy    3103      1 18 86788 310976  8 18:37 ?        00:11:59 haproxy
-f hap
haproxy    3104      1 18 87197 314888 10 18:37 ?        00:12:00 haproxy
-f hap
haproxy    3105      1 18 91311 319784 12 18:37 ?        00:11:59 haproxy
-f hap
haproxy    3106      1 18 88785 305576 14 18:37 ?        00:12:00 haproxy
-f hap
haproxy    3107      1 19 90366 326428 16 18:37 ?        00:12:09 haproxy
-f hap
haproxy    3108      1 19 89758 320780 18 18:37 ?        00:12:09 haproxy
-f hap
haproxy    3109      1 19 87670 314752 20 18:37 ?        00:12:07 haproxy
-f hap
haproxy    3110      1 19 87763 316672 22 18:37 ?        00:12:10 haproxy
-f hap

set_irq_affinity.sh was run on the ixgbe card, and /proc/irq/*/smp_affinity
shows that each
irq is bound to cpus 0-47 correctly. However, I see that packets are being
processed on
cpus of the 2nd socket too, though user/system usage is zero on those as
haproxy does
not run on those cores. The following shows the difference of number of
packets processed
after 10 seconds on the different rx/tx queues:

# ./rx_tx   /tmp/ethtool_start     /tmp/ethtool_end
"Significant" difference in #packets processed after 10 seconds on the
various rx/tx queues:
Queue#        TX                    RX
0               2623165         2826065
1               2564573         2749859
2               2901998         2801043
3               2636856         2794000
4               2892465         2742228
5               3087442         2795762
6               2936588         2760732
7               2934087         2767705
8               2260933         2767707
9               2165087         2759038
10              2144893         2814390
11              2302304         2835790
12              3037722         2748335
13              2940284         2727689
14              2348277         2830378
15              2117679         2838013
16              2679899         487703
17              2447832         438733
18              2505330         429834
19              2611643         447960
20              2595708         449729
21              2534836         447217
22              2616150         466920
23              2522947         450145

mpstat shows that first 22 even numbered cpus are heavily used, while the
odd ones only
does softirq processing:

Average:    CPU    %usr       %sys      %soft      %idle
Average:    0         15.47       60.0       24.47       0.00
Average:    1          0.00         0.00       12.86      87.14
Average:    2         20.32       58.49      21.19      0.00
Average:    3         0.10          0.00        2. 59       97.30
Average:    4         18.20        60.87     20.93       0.00
Average:    5         0.10           0.00       4.15        95.75
Average:    6         18.75        59.37      21.88      0.00
Average:    7         0.00           0.00        3.03        96.97
Average:    8         22.75         57.71      19.55     0.00
Average:    9         0.00           0.00          2.78      97.22
Average:    10       21.87         57.67       20.47     0.00
Average:    11       0.00           0.00          2.80      97.20
Average:    12       19.48         59.84        20.68     0.00
Average:    13       0.00           0.00          1.76       98.24
Average:    14       22.58         57.16        20.25     0.00
Average:    15       0.00           0.00          1.57       98.43
Average:    16       27.00         67.00        6.00        0.00
Average:    17       0.00           0.07           0.59       99.27
Average:    18       26.17         67.84         5.93       0.07
Average:    19       0.00            0.00          0.15        99.78
Average:    20       26.52         67.36         6.13        0.00
Average:    21       0.00            0.00          0.30        99.63
Average:    22       27.69          66.71        5.60        0.00
Average:    23       0.00             0.00          0.07       99.93
Average:    24       0.00             0.00          0.00       100.00
        (remaining are 100% idle)

Is there a way to make sure that tx/rx happens only on the cpus that
haproxy runs on?
The reason I think this is affecting performance is due to locking and IPI:
cpu#0 gets skbs
and is in softirq handler. netif_receive_skb calls get_rps_cpu() and uses
the flow information
to find that this skb is for cpu#1. Next cpu#0 calls enqueue_to_backlog()
giving the cpu#1
index as parameter, which gets the input_pkt_queue_lock of cpu#1,
contending for a lock
across nodes, that should normally be only used by cpu#1, and then
enqueue's the skb.
Finally cpu#0 sends and IPI to  cpu#1 to process it's backlog since we
added skbs to it.

Thanks,
- Krishna Kumar

-- 


------------------------------------------------------------------------------------------------------------------------------------------

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments

Strange system behaviour of during haproxy run

Reply via email to