Study of nginx-1.9.12 performance/latency on DragonFlyBSD-g67a73. The performance and latency is measured using a modified version of wrk: https://github.com/sepherosa/wrk.git (sephe/wrk branch).
It mainly adds requests/connection setting and avoids several unnecessary syscalls. Hardware configuration: Server: 2-way E5-2620v2 (24 logical cpus), 32GB DDR3 1600 (4GBx8). Client: i7-3770, 16GB DDR3 1600 (8GBx2). NICs: Intel 82599 10Ge connected through DAC. Network configure: +--------+ +--------+ | |192.168.3.254 192.168.3.1| | | Server +---------------------------+ Client | | |10Ge DAC 10Ge| | +--------+ +--------+ MSL of the testing network is changed to 10ms by: route change -net 192.168.3.0/24 -msl 10 DragonFlyBSD settings: /boot/loader.conf: kern.ipc.nmbclusters="524288" kern.ipc.nmbjclusters="262144" /etc/sysctl.conf: kern.ipc.somaxconn=256 machdep.mwait.CX.idle=AUTODEEP net.inet.ip.portrange.last=40000 And powerd(8) is enabled on both sides during the measures. NOTE: Unlike other nginx performance measures, which use nginx default number of requests/connection (100) or even intentionally use infinite number of requests/connection, we use three values for requests/connection through out these measures: 1 request/connection, 4 requests/connection and 14 requests/connection, which are more close to the real world usage; as noted in RFC6928 that 35% of HTTP requests are made on new connection, and according to the data from httparhive.com around 2014: https://discuss.httparchive.org/t/distribution-of-http-requests-per-tcp-connection/365 NOTE: Unless otherwise noted: polling(4) @1000hz and IW4 are used. 32 workers are used and 'reuseport' option is enabled in nginx-1.9.12. ========================== The effect of DragonFlyBSD polling(4). The results of the following command, with interrupt and different polling frequency settings: ./wrk -c 15000 --connreqs 1 -d 600s -t 8 --latency --delay http://192.168.3.254/1K.bin (15000 concurrent connections, 1 request/connection, 1KB web object, 600 seconds average). intr (7700/s) | poll (7000hz) | poll (4000hz) | poll (1000hz) ---------------+---------------+---------------+--------------- Reqs/s 116961 | 140580 | 142862 | 144807 LatAvg 64.14ms | 54.25ms | 52.87ms | 51.20ms LatStdev 150.30ms | 21.68ms | 19.16ms | 13.96ms So in addition to greatly improving the performance (~20%, even if we set the polling rate close to the interrupt rate), polling also reduces the average latency and latency stdev. And the lower the polling rate, the better the performance. ========================== The effect of 'reuseport' option in the nginx-1.9.12 on DragonFlyBSD. The results of the following command, with 'reuseport' option on and off on nginx: ./wrk -c 15000 --connreqs X -d 600s -t 8 --latency --delay http://192.168.3.254/1K.bin (15000 concurrent connections, X requests/connectin X={1,4,14}, 1KB web object, 600 seconds average). 1 request/connection no reuseport | reuseport --------------+----------- Reqs/s 45589 | 144807 content 1200K/s | 30K/s 4 request/connection no reuseport | reuseport --------------+----------- Reqs/s 158603 | 227856 content 1300K/s | 100K/s 14 request/connection no reuseport | reuseport --------------+----------- Reqs/s 246833 | 250335 content 500K/s | 150K/s So 'reuseport' option drastically improves the performance when the requests/connection is low (~210% for 1 request/connection, and ~40% for 4 requests/connection). And obviously 'reuseport' option reduces contention rate much. ========================== The number of workers in nginx-1.9.12 on DragonFlyBSD (interaction of non-power-of-2 number of cpus and power-of-2 number of netisrs for nginx 'reuseport' option on DragonFlyBSD). The results of the following command, with different number of workers: ./wrk -c 15000 --connreqs 1 -d 600s -t 8 --latency --delay http://192.168.3.254/1K.bin (15000 concurrent connections, 1 request/connection, 1KB web object, 600 seconds average). 16 workers | 24 workers | 32 workers ------------+------------+------------ Reqs/s 132645 | 143276 | 144807 LatAvg 46.48ms | 54.14ms | 51.20ms LatStdev 27.88ms | 18.29ms | 13.96ms content 20K/s | 33K/s | 30K/s Since the server has 24 logical cpus, 16 workers give less performance than 24/32 workers, even if it matches the number of netisrs. Its latency and contention rate is also lower because less requetss are handled. 24 workers have ted bit lower performance, higher latency and content rate than 32 workers. Why? ;). It's mainly because the how DragonFlyBSD implements the SO_REUSEPORT: Incoming TCP connections are dispatched to the netisr (power-of-2) based on the SYN's RSS hash value, and from there the listen socket's inpcb is looked up based on the same RSS hash value. If the number of listen sockets were not power-of-2, i.e. not aligned with the number of netisrs, certain amount of extra contention would happen, which reduces performance and increases latency. That's why 32 workers (aligned with 16 netisrs on the server) act better than 24 workers on DragonFlyBSD. ========================== Web object size, performance and interface bit rate on DragonFlyBSD. The results of the following command, with 'reuseport' option on and off on nginx: ./wrk -c 15000 --connreqs 1 -d 600s -t 8 --latency --delay http://192.168.3.254/_X_K.bin (15000 concurrent connections, 1 request/connectin, _X_ KB web object _X_={1,8,16}, 600 seconds average). 1KB object | 8KB object | 16KB object ------------+------------+------------- Reqs/s 144807 | 105100 | 68909 LatAvg 51.20ms | 70.53ms | 195.49ms BitRate 1.7Gbps | 7.5Gbps | 9.5Gbps Idle 0% | 34% | 54% DragonFlyBSD maxes out the 10Ge for 16KB web object (or somewhere between 8KB and 16KB :). And as far as I have tested, IW10 does not help either performance or latency in these measures. Thanks, sephe -- Tomorrow Will Never Die
