Re: Few problems seen in haproxy? (threads, connections).
Hi Willy, My systems were out of rotation for some other tests so did not get to this till now. I have pulled the latest bits just now and tested. Regarding maxconn, I simply kept maxconn in global/defaults to 1 million and have this line in the backend section: default-server maxconn 100 I have not seen the Queue/Max you mentioned earlier. The FD time has gone down to zero, but the LB time has increased about 50% from last time (7700 ns to 11600 ns, I am using 'balance leastconn'). Test was run for 1 minute: $ wrk -c 4800 -t 48 -d 60s http://www.flipkart.com/128 The results were for 32 threads, which is the same configuration I tested with earlier. Both of these testing was done with threads pinning to NUMA-1 cores (cores 1, 3, 5, ..47), and irq's to NUMA-0 (0, 2, 4, ..46). However, the cpus recycles from 1-47 back to 1-15 for the thread pinning. So that may explain the much higher lock numbers that I am seeing. When I changed this to use all cpus (0-31), the LBPRM lock took 74339.117 ns per operation. But performance dropped from 210K to 80K. Overall, I am not at ease for threading, or will have to settle for 12 threads for the 12 non-hyperthreaded cores for a single socket. Inlining the output of locks for the case where all threads are pinned to NUMA-1 cores (and hence 2 threads to same cores for some cores), at the end of this mail. Thanks, - Krishna Stats about Lock FD: # write lock : 2 # write unlock: 2 (0) # wait time for write : 0.001 msec # wait time for write/lock: 302.000 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock TASK_RQ: # write lock : 373317 # write unlock: 373317 (0) # wait time for write : 341.875 msec # wait time for write/lock: 915.775 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock TASK_WQ: # write lock : 373432 # write unlock: 373432 (0) # wait time for write : 491.524 msec # wait time for write/lock: 1316.235 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock LISTENER: # write lock : 1248 # write unlock: 1248 (0) # wait time for write : 0.295 msec # wait time for write/lock: 236.341 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock PROXY: # write lock : 12524202 # write unlock: 12524202 (0) # wait time for write : 20979.972 msec # wait time for write/lock: 1675.154 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock SERVER: # write lock : 50100330 # write unlock: 50100330 (0) # wait time for write : 76908.311 msec # wait time for write/lock: 1535.086 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock LBPRM: # write lock : 50096808 # write unlock: 50096808 (0) # wait time for write : 584505.012 msec # wait time for write/lock: 11667.510 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock BUF_WQ: # write lock : 35653802 # write unlock: 35653802 (0) # wait time for write : 80406.420 msec # wait time for write/lock: 2255.199 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock STRMS: # write lock : 9602 # write unlock: 9602 (0) # wait time for write : 5.613 msec # wait time for write/lock: 584.594 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock VARS: # write lock : 37596611 # write unlock: 37596611 (0) # wait time for write : 2285.148 msec # wait time for write/lock: 60.781 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec On Mon, Oct 15, 2018 at 11:14 PM Willy Tarreau wrote: > Hi again, > > finally I got rid of the FD lock for single-threaded accesses (most of > them), and based on Olivier's suggestion, I implemented a per-thread > wait queue, and cache-aligned some list heads to avoid undesired cache > line sharing. For me all of this combined resulted in a performance > increase of 25% on a 12-threads workload. I'm interested in your test > results, all of this is in the latest master. > > If you still see LBPRM a lot, I can send you the experimental patch > to move the element inside the tree without unlinking/relinking it > and we can see if that provides any benefit or not (I'm not convinced). > > Cheers, > Willy >
Re: Few problems seen in haproxy? (threads, connections).
Hi again, finally I got rid of the FD lock for single-threaded accesses (most of them), and based on Olivier's suggestion, I implemented a per-thread wait queue, and cache-aligned some list heads to avoid undesired cache line sharing. For me all of this combined resulted in a performance increase of 25% on a 12-threads workload. I'm interested in your test results, all of this is in the latest master. If you still see LBPRM a lot, I can send you the experimental patch to move the element inside the tree without unlinking/relinking it and we can see if that provides any benefit or not (I'm not convinced). Cheers, Willy
Re: Few problems seen in haproxy? (threads, connections).
Hi Krishna, On Thu, Oct 11, 2018 at 12:04:31PM +0530, Krishna Kumar (Engineering) wrote: > I must say the improvements are pretty impressive! > > Earlier number reported with 24 processes: 519K > Earlier number reported with 24 threads: 79K > New RPS with system irq tuning, today's git, >configuration changes, 24 threads:353K > Old code with same tuning gave: 290K OK that's much better but I'm still horrified by the time taken in the load balancing algorithm. I thought it could be fwlc_reposition(), which contains an eb32_delete()+eb32_insert(), so I decided to replace this with a new eb32_move() which moves the node within the tree, and it didn't change anything here. Also I figured that I cannot manage to reach that high time spent in this lock (300ms here, 58s for you). There is one possible difference that might explain it, do you have a maxconn setting on your servers ? If so, is it possible that it's reached ? You can take a look at your stats page and see if the "Queue/Max" entry for any backend is non-null. Indeed, I'm seeing that once a server is saturated, we skip it for the next one. This part can be expensive. Ideally we should remove such servers from the tree until they're unblocked, but there is one special case making this difficult, which is the dynamic limitation (minconn+maxconn+fullconn). However I think we could improve this so that only this use case would be affected and not the other ones. I'm also seeing that this lock could be replaced by an RW lock. But before taking a deeper look, I'm interested in verifying that it's indeed the situation you're facing. Thanks, Willy
Re: Few problems seen in haproxy? (threads, connections).
I must say the improvements are pretty impressive! Earlier number reported with 24 processes: 519K Earlier number reported with 24 threads: 79K New RPS with system irq tuning, today's git, configuration changes, 24 threads:353K Old code with same tuning gave: 290K My test machine is a 2 NUMA node server, CPU's 0, 2, ..., 22 on node 0, and 1, 3, ..., 23 on node 1. That adds to 24 cpus. The remaining 24 cores are HT. Node 0 has 24, 26, ..., 46 and node 1 has 25, 27, ..., 47. This may explain why it scales well to 24. The 16 irq's of the NIC and pinned to cpus 0, 2, ..., 30. Hoping performance further improves. 12 threads: 280K 16 threads: 318K 24 threads: 353K (occasional drop till 330K) 32 threads: 238K I am attaching 2 text files of the lock metrics for 24 and 32 threads. A vimdiff shows the differences nicely (fd, task_rq, task_wq, proxy, server, lbprm and buf_wq increased significantly). Thanks! On Thu, Oct 11, 2018 at 8:53 AM Krishna Kumar (Engineering) < krishna...@flipkart.com> wrote: > Thanks, will do that. > > On Thu, Oct 11, 2018 at 8:37 AM Willy Tarreau wrote: > >> On Thu, Oct 11, 2018 at 08:18:21AM +0530, Krishna Kumar (Engineering) >> wrote: >> > Hi Willy, >> > >> > Thank you very much for the in-depth analysis and configuration setting >> > suggestions. >> > I believe I have got the 3 key items to continue based on your mail: >> > >> > 1. Thread pinning >> > 2. Fix system irq pinning accordingly >> > 3. Listen on all threads. >> > >> > I will post the configuration changes and the results. >> >> By the way, please pull the latest master fixes. I've addressed two issues >> there with locking : >> - one where the scheduler work was slightly too high, increasing the >> time >> spent on RQ lock >> - another one where I messed up on a fix, causing lock-free pools to be >> disabled (as seen in your output, where the POOL lock appears a lot) >> >> On some tests I've run here, I've found the stick-tables lock to be a >> bottleneck when tracking is enabled. I don't have a short-term solution >> to this, but looking at the code it's obvious that it can significantly >> be improved (though it will take quite some time). I'll probably at least >> try to replace it with an RW lock as I think it could improve the >> situation. >> >> The FD lock is another one requiring some lift-up. I'm certain it's >> possible, >> I just don't know if it will not degrade low-thread count performance by >> using too many atomic ops instead. We'll have to experiment. >> >> Cheers, >> Willy >> > Stats about Lock FD: # write lock : 407304162 # write unlock: 407304155 (-7) # wait time for write : 39627.620 msec # wait time for write/lock: 97.292 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock TASK_RQ: # write lock : 2230051 # write unlock: 2230051 (0) # wait time for write : 63163.277 msec # wait time for write/lock: 28323.692 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock TASK_WQ: # write lock : 14897430 # write unlock: 14897430 (0) # wait time for write : 49136.313 msec # wait time for write/lock: 3298.308 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock POOL: # write lock : 0 # write unlock: 0 (0) # wait time for write : 0.000 msec # wait time for write/lock: 0.000 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock LISTENER: # write lock : 5500 # write unlock: 5500 (0) # wait time for write : 0.076 msec # wait time for write/lock: 13.734 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock PROXY: # write lock : 7368276 # write unlock: 7368276 (0) # wait time for write : 16768.394 msec # wait time for write/lock: 2275.755 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock SERVER: # write lock : 29505947 # write unlock: 29505947 (0) # wait time for write : 38455.325 msec # wait time for write/lock: 1303.308 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read :
Re: Few problems seen in haproxy? (threads, connections).
On Thu, Oct 11, 2018 at 08:18:21AM +0530, Krishna Kumar (Engineering) wrote: > Hi Willy, > > Thank you very much for the in-depth analysis and configuration setting > suggestions. > I believe I have got the 3 key items to continue based on your mail: > > 1. Thread pinning > 2. Fix system irq pinning accordingly > 3. Listen on all threads. > > I will post the configuration changes and the results. By the way, please pull the latest master fixes. I've addressed two issues there with locking : - one where the scheduler work was slightly too high, increasing the time spent on RQ lock - another one where I messed up on a fix, causing lock-free pools to be disabled (as seen in your output, where the POOL lock appears a lot) On some tests I've run here, I've found the stick-tables lock to be a bottleneck when tracking is enabled. I don't have a short-term solution to this, but looking at the code it's obvious that it can significantly be improved (though it will take quite some time). I'll probably at least try to replace it with an RW lock as I think it could improve the situation. The FD lock is another one requiring some lift-up. I'm certain it's possible, I just don't know if it will not degrade low-thread count performance by using too many atomic ops instead. We'll have to experiment. Cheers, Willy
Re: Few problems seen in haproxy? (threads, connections).
Hi Willy, Thank you very much for the in-depth analysis and configuration setting suggestions. I believe I have got the 3 key items to continue based on your mail: 1. Thread pinning 2. Fix system irq pinning accordingly 3. Listen on all threads. I will post the configuration changes and the results. Regards, - Krishna On Wed, Oct 10, 2018 at 6:39 PM Willy Tarreau wrote: > Hi Krishna, > > On Tue, Oct 02, 2018 at 09:18:19PM +0530, Krishna Kumar (Engineering) > wrote: > (...) > > 1. HAProxy system: > > Kernel: 4.17.13, > > CPU: 48 core E5-2670 v3 > > Memory: 128GB memory > > NIC: Mellanox 40g with IRQ pinning > > > > 2. Client, 48 core similar to server. Test command line: > > wrk -c 4800 -t 48 -d 30s http:///128 > > > > 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout > as > > of > > Oct 2nd). > > # haproxy-git -vv > > HA-Proxy version 1.9-dev3 2018/09/29 > (...) > > 4. HAProxy results for #processes and #threads > > #Threads-RPS Procs-RPS > > 1 20903 19280 > > 2 46400 51045 > > 4 96587 142801 > > 8 172224 254720 > > 12 210451 437488 > > 16 173034 437375 > > 24 79069 519367 > > 32 55607 586367 > > 48 31739 596148 > > Our largest thread test was on 12 cores and it happens that in your case > it's also the optimal one. > > However I do have some comments about your config, before going back to > real thread issues : > > > # cpu-map auto:1/1-48 0-39 > => you must absolutely pin your processes, and they must be pinned > to cores *not* shared with the network card. That's critical. > Moreover it's also important that threads are not split across > multiple physical CPUs because the remote L3 cache access time > over QPI/UPI is terrible. When you run on 12 threads with two > 12-cores/24-threads CPUs, you could very well have haproxy using > 12 threads from 6 cores, and the NIC using 12 threads from the > other 6 cores of the same physical CPU. The second socket is, > as usual, useless for anything requiring low latency. However > it's perfect to run SSL. So you could be interested in testing > if running the NIC on one socket (try to figure what node the > PCIe lanes are physically connected to), and haproxy on the other > one. It *could* be possible that you get more performance from 12 > cores of each but I strongly doubt it based on a number of tests. > If you use SSL however it's different as you will benefit from > lots of cores much more than low latency. > > > bind :80 process 1/1-48 > => it's also capital for scalability to have individual bind lines. Here > you have a single socket accessed from all 48 threads. There's no > efficient thread load balancing here. By having this : > > bind :80 process 1/1 > bind :80 process 1/2 > ... > bind :80 process 1/47 > bind :80 process 1/48 > > You will let the kernel perform the load balancing and distribute a > fair load to all threads. This way none of them will risk to pick a > larger share of the incoming connections than optimal. I know it's > annoying to configure at the moment, I've been thinking about having > a way to automatically iterate from a single config line (like the > "auto" feature of cpu-map), but for now it's not done. > > Now back to the thread measurements : > > > 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more > time > >to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: > > Stats about Lock FD: > > # write lock : 143933900 > > # write unlock: 143933895 (-5) > > # wait time for write : 11370.245 msec > > This one definitely is huge. We know some work is still needed on this lock > and that there are still a few low hanging fruits but not much savings to > expect short term. This output is very revealing however of the importance > of this lock. > > > # wait time for write/lock: 78.996 nsec > > That's roughly the time it takes to access the other CPU's cache, so using > your two sockets for the same process definitely hurts a lot here. > > > Stats about Lock TASK_RQ: > > # write lock : 2062874 > > # write unlock: 2062875 (1) > > # wait time for write : 7820.234 msec > > This one is still far too large for what we'd hope, even though it > has significantly shrunk since 1.8. It could be related to the poor > distribution of the incoming connections across threads. > > > # wait time for write/lock: 3790.941 nsec > > Wow, 3.8 microseconds to acquire the write lock is a lot! I'm starting > to suspect some longer tree walks than expected. Again, lack of fairness > between threads can make this significantly worse than it should. > > > Stats about Lock TASK_WQ: > > # write lock : 2601227 > > # write unlock: 2601227 (0) > > # wait time for write : 5019.811 msec > > # wait time for write/lock: 1929.786 nsec > > For this one I had an idea that could significantly improve the situation. > It will unlikely be for 1.9
Re: Few problems seen in haproxy? (threads, connections).
Hi Krishna, On Tue, Oct 02, 2018 at 09:18:19PM +0530, Krishna Kumar (Engineering) wrote: (...) > 1. HAProxy system: > Kernel: 4.17.13, > CPU: 48 core E5-2670 v3 > Memory: 128GB memory > NIC: Mellanox 40g with IRQ pinning > > 2. Client, 48 core similar to server. Test command line: > wrk -c 4800 -t 48 -d 30s http:///128 > > 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as > of > Oct 2nd). > # haproxy-git -vv > HA-Proxy version 1.9-dev3 2018/09/29 (...) > 4. HAProxy results for #processes and #threads > #Threads-RPS Procs-RPS > 1 20903 19280 > 2 46400 51045 > 4 96587 142801 > 8 172224 254720 > 12 210451 437488 > 16 173034 437375 > 24 79069 519367 > 32 55607 586367 > 48 31739 596148 Our largest thread test was on 12 cores and it happens that in your case it's also the optimal one. However I do have some comments about your config, before going back to real thread issues : > # cpu-map auto:1/1-48 0-39 => you must absolutely pin your processes, and they must be pinned to cores *not* shared with the network card. That's critical. Moreover it's also important that threads are not split across multiple physical CPUs because the remote L3 cache access time over QPI/UPI is terrible. When you run on 12 threads with two 12-cores/24-threads CPUs, you could very well have haproxy using 12 threads from 6 cores, and the NIC using 12 threads from the other 6 cores of the same physical CPU. The second socket is, as usual, useless for anything requiring low latency. However it's perfect to run SSL. So you could be interested in testing if running the NIC on one socket (try to figure what node the PCIe lanes are physically connected to), and haproxy on the other one. It *could* be possible that you get more performance from 12 cores of each but I strongly doubt it based on a number of tests. If you use SSL however it's different as you will benefit from lots of cores much more than low latency. > bind :80 process 1/1-48 => it's also capital for scalability to have individual bind lines. Here you have a single socket accessed from all 48 threads. There's no efficient thread load balancing here. By having this : bind :80 process 1/1 bind :80 process 1/2 ... bind :80 process 1/47 bind :80 process 1/48 You will let the kernel perform the load balancing and distribute a fair load to all threads. This way none of them will risk to pick a larger share of the incoming connections than optimal. I know it's annoying to configure at the moment, I've been thinking about having a way to automatically iterate from a single config line (like the "auto" feature of cpu-map), but for now it's not done. Now back to the thread measurements : > 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time >to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: > Stats about Lock FD: > # write lock : 143933900 > # write unlock: 143933895 (-5) > # wait time for write : 11370.245 msec This one definitely is huge. We know some work is still needed on this lock and that there are still a few low hanging fruits but not much savings to expect short term. This output is very revealing however of the importance of this lock. > # wait time for write/lock: 78.996 nsec That's roughly the time it takes to access the other CPU's cache, so using your two sockets for the same process definitely hurts a lot here. > Stats about Lock TASK_RQ: > # write lock : 2062874 > # write unlock: 2062875 (1) > # wait time for write : 7820.234 msec This one is still far too large for what we'd hope, even though it has significantly shrunk since 1.8. It could be related to the poor distribution of the incoming connections across threads. > # wait time for write/lock: 3790.941 nsec Wow, 3.8 microseconds to acquire the write lock is a lot! I'm starting to suspect some longer tree walks than expected. Again, lack of fairness between threads can make this significantly worse than it should. > Stats about Lock TASK_WQ: > # write lock : 2601227 > # write unlock: 2601227 (0) > # wait time for write : 5019.811 msec > # wait time for write/lock: 1929.786 nsec For this one I had an idea that could significantly improve the situation. It will unlikely be for 1.9 though. > Stats about Lock POOL: > # write lock : 2823393 > # write unlock: 2823393 (0) > # wait time for write : 11984.706 msec Interesting. Very interesting... I thought we merged the lock-free pool code. Maybe I was mistaken. Or I misunderstood something about the nature of the changes that went in early during the 1.9-dev cycle. This one is not expected to be hard to address, and I even see how we could go further with a small thread-local cache since I developed such a thing a few years ago (though it was over-engineered) that could easily be retrofitted here. > # wait time for
Re: Few problems seen in haproxy? (threads, connections).
Hi Krishna, On Fri, Oct 05, 2018 at 02:25:13PM +0530, Krishna Kumar (Engineering) wrote: > Sorry for repeating once again, but this is my last unsolicited > mail on this topic. Any directions for what to look out for? Sorry, but I didn't even have the time to read your mail over the last two days and still have a huge backlog pending. Maybe next week. Regards, Willy
Re: Few problems seen in haproxy? (threads, connections).
Sorry for repeating once again, but this is my last unsolicited mail on this topic. Any directions for what to look out for? Thanks, - Krishna On Thu, Oct 4, 2018 at 8:42 AM Krishna Kumar (Engineering) < krishna...@flipkart.com> wrote: > Re-sending in case this mail was missed. To summarise the 3 issues seen: > > 1. Performance drops 18x with higher number of nbthreads as compared to > nbprocs. > 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for > 1.9-dev3 > for nbprocs and nbthreads). > 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in > either > CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers), > till > the server/client timeout expires. > > The tests for threads and processes were done on the same systems, so > there is > no difference in system parameters. > > Thanks, > - Krishna > > > On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < > krishna...@flipkart.com> wrote: > >> Hi Willy, and community developers, >> >> I am not sure if I am doing something wrong, but wanted to report >> some issues that I am seeing. Please let me know if this is a problem. >> >> 1. HAProxy system: >> Kernel: 4.17.13, >> CPU: 48 core E5-2670 v3 >> Memory: 128GB memory >> NIC: Mellanox 40g with IRQ pinning >> >> 2. Client, 48 core similar to server. Test command line: >> wrk -c 4800 -t 48 -d 30s http:///128 >> >> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout >> as of >> Oct 2nd). >> # haproxy-git -vv >> HA-Proxy version 1.9-dev3 2018/09/29 >> Copyright 2000-2018 Willy Tarreau >> >> Build options : >> TARGET = linux2628 >> CPU = generic >> CC = gcc >> CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement >> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare >> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers >> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits >> OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 >> >> Default settings : >> maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 >> >> Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >> Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >> OpenSSL library supports TLS extensions : yes >> OpenSSL library supports SNI : yes >> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 >> Built with transparent proxy support using: IP_TRANSPARENT >> IPV6_TRANSPARENT IP_FREEBIND >> Encrypted password support via crypt(3): yes >> Built with multi-threading support. >> Built with PCRE version : 8.38 2015-11-23 >> Running on PCRE version : 8.38 2015-11-23 >> PCRE library supports JIT : no (USE_PCRE_JIT not set) >> Built with zlib version : 1.2.8 >> Running on zlib version : 1.2.8 >> Compression algorithms supported : identity("identity"), >> deflate("deflate"), raw-deflate("deflate"), gzip("gzip") >> Built with network namespace support. >> >> Available polling systems : >> epoll : pref=300, test result OK >>poll : pref=200, test result OK >> select : pref=150, test result OK >> Total: 3 (3 usable), will use epoll. >> >> Available multiplexer protocols : >> (protocols markes as cannot be specified using 'proto' keyword) >> h2 : mode=HTTP side=FE >> : mode=TCP|HTTP side=FE|BE >> >> Available filters : >> [SPOE] spoe >> [COMP] compression >> [TRACE] trace >> >> 4. HAProxy results for #processes and #threads >> #Threads-RPS Procs-RPS >> 1 20903 19280 >> 2 46400 51045 >> 4 96587 142801 >> 8 172224 254720 >> 12 210451 437488 >> 16 173034 437375 >> 24 79069 519367 >> 32 55607 586367 >> 48 31739 596148 >> >> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more >> time >>to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: >> Stats about Lock FD: >> # write lock : 143933900 >> # write unlock: 143933895 (-5) >> # wait time for write : 11370.245 msec >> # wait time for write/lock: 78.996 nsec >> # read lock : 0 >> # read unlock : 0 (0) >> # wait time for read : 0.000 msec >> # wait time for read/lock : 0.000 nsec >> Stats about Lock TASK_RQ: >> # write lock : 2062874 >> # write unlock: 2062875 (1) >> # wait time for write : 7820.234 msec >> # wait time for write/lock: 3790.941 nsec >> # read lock : 0 >> # read unlock : 0 (0) >> # wait time for read : 0.000 msec >> # wait time for read/lock : 0.000 nsec >> Stats about Lock TASK_WQ: >> # write lock : 2601227 >> # write unlock: 2601227 (0) >> # wait time for write : 5019.811 msec >> # wait time for write/lock: 1929.786 nsec >> # read lock : 0 >> # read unlock : 0 (0) >> # wait time for read : 0.000 msec >> # wait time for read/lock : 0.000 nsec >> Stats about Lock POOL: >> # write lock : 2823393 >> # write unlock: 2823393 (0) >> # wait time for write : 11984.706 msec >> # wait time for write/lock: 4244.788 nsec >> # read lock : 0 >> # read unlock : 0 (0) >> # wait time for
Re: Few problems seen in haproxy? (threads, connections).
Thanks, will take a look! On Thu, Oct 4, 2018 at 12:58 PM Илья Шипицин wrote: > what I going to try (when I will have some spare time) is sampling with > google perftools > > https://github.com/gperftools/gperftools > > they are great in cpu profiling. > you can try them youself if you have time/wish :) > > > чт, 4 окт. 2018 г. в 11:53, Krishna Kumar (Engineering) < > krishna...@flipkart.com>: > >> 1. haproxy config: Same as given above (both processes and threads were >> given in the mail) >> 2. nginx: default, no changes. >> 3. sysctl's: nothing set. All changes as described earlier (e.g. >> irqbalance, irq pinning, etc). >> 4. nf_conntrack: disabled >> 5. dmesg: no messages. >> >> With the same system and settings, threads gives 18x lesser RPS than >> processes, along with >> the other 2 issues given in my mail today. >> >> >> On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин >> wrote: >> >>> haproxy config, nginx config >>> non default sysctl (if any) >>> >>> as a side note, can you have a look at "dmesg" output ? do you have nf >>> conntrack enabled ? what are its limits ? >>> >>> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) < >>> krishna...@flipkart.com>: >>> Sure. 1. Client: Use one of the following two setup's: - a single baremetal (48 core, 40g) system Run: "wrk -c 4800 -t 48 -d 30s http://:80/128", or, - 100 2 core vm's. Run "wrk -c 16 -t 2 -d 30s http://:80/128" from each VM and summarize the results using some parallel-ssh setup. 2. HAProxy running on a single baremetal (same system config as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different cores of the same NUMA node for each irq, kill irqbalance, with haproxy configuration file as given in my first mail. Around 60 backend servers are configured in haproxy. 3. Backend servers are 2 core VM's running nginx and serving a file called "/128", which is 128 bytes in size. Let me know if you need more information. Thanks, - Krishna On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин wrote: > load testing is somewhat good. > can you describe an overall setup ? (I want to reproduce and play with > it) > > чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) < > krishna...@flipkart.com>: > >> Re-sending in case this mail was missed. To summarise the 3 issues >> seen: >> >> 1. Performance drops 18x with higher number of nbthreads as compared >> to nbprocs. >> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds >> (for 1.9-dev3 >> for nbprocs and nbthreads). >> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it >> remains in either >> CLOSE-WAIT (towards clients) and ESTAB (towards the backend >> servers), till >> the server/client timeout expires. >> >> The tests for threads and processes were done on the same systems, so >> there is >> no difference in system parameters. >> >> Thanks, >> - Krishna >> >> >> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < >> krishna...@flipkart.com> wrote: >> >>> Hi Willy, and community developers, >>> >>> I am not sure if I am doing something wrong, but wanted to report >>> some issues that I am seeing. Please let me know if this is a >>> problem. >>> >>> 1. HAProxy system: >>> Kernel: 4.17.13, >>> CPU: 48 core E5-2670 v3 >>> Memory: 128GB memory >>> NIC: Mellanox 40g with IRQ pinning >>> >>> 2. Client, 48 core similar to server. Test command line: >>> wrk -c 4800 -t 48 -d 30s http:///128 >>> >>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git >>> checkout as of >>> Oct 2nd). >>> # haproxy-git -vv >>> HA-Proxy version 1.9-dev3 2018/09/29 >>> Copyright 2000-2018 Willy Tarreau >>> >>> Build options : >>> TARGET = linux2628 >>> CPU = generic >>> CC = gcc >>> CFLAGS = -O2 -g -fno-strict-aliasing >>> -Wdeclaration-after-statement -fwrapv -fno-strict-overflow >>> -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter >>> -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered >>> -Wno-missing-field-initializers -Wtype-limits >>> OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 >>> >>> Default settings : >>> maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents >>> = 200 >>> >>> Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >>> Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >>> OpenSSL library supports TLS extensions : yes >>> OpenSSL library supports SNI : yes >>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 >>> Built with transparent proxy support
Re: Few problems seen in haproxy? (threads, connections).
what I going to try (when I will have some spare time) is sampling with google perftools https://github.com/gperftools/gperftools they are great in cpu profiling. you can try them youself if you have time/wish :) чт, 4 окт. 2018 г. в 11:53, Krishna Kumar (Engineering) < krishna...@flipkart.com>: > 1. haproxy config: Same as given above (both processes and threads were > given in the mail) > 2. nginx: default, no changes. > 3. sysctl's: nothing set. All changes as described earlier (e.g. > irqbalance, irq pinning, etc). > 4. nf_conntrack: disabled > 5. dmesg: no messages. > > With the same system and settings, threads gives 18x lesser RPS than > processes, along with > the other 2 issues given in my mail today. > > > On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин wrote: > >> haproxy config, nginx config >> non default sysctl (if any) >> >> as a side note, can you have a look at "dmesg" output ? do you have nf >> conntrack enabled ? what are its limits ? >> >> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) < >> krishna...@flipkart.com>: >> >>> Sure. >>> >>> 1. Client: Use one of the following two setup's: >>> - a single baremetal (48 core, 40g) system >>> Run: "wrk -c 4800 -t 48 -d 30s http://:80/128", or, >>> - 100 2 core vm's. >>> Run "wrk -c 16 -t 2 -d 30s http://:80/128" from >>> each VM and summarize the results using some >>> parallel-ssh setup. >>> >>> 2. HAProxy running on a single baremetal (same system config >>> as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different >>> cores of the same NUMA node for each irq, kill irqbalance, with >>> haproxy configuration file as given in my first mail. Around 60 >>> backend servers are configured in haproxy. >>> >>> 3. Backend servers are 2 core VM's running nginx and serving >>> a file called "/128", which is 128 bytes in size. >>> >>> Let me know if you need more information. >>> >>> Thanks, >>> - Krishna >>> >>> >>> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин >>> wrote: >>> load testing is somewhat good. can you describe an overall setup ? (I want to reproduce and play with it) чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) < krishna...@flipkart.com>: > Re-sending in case this mail was missed. To summarise the 3 issues > seen: > > 1. Performance drops 18x with higher number of nbthreads as compared > to nbprocs. > 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds > (for 1.9-dev3 > for nbprocs and nbthreads). > 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains > in either > CLOSE-WAIT (towards clients) and ESTAB (towards the backend > servers), till > the server/client timeout expires. > > The tests for threads and processes were done on the same systems, so > there is > no difference in system parameters. > > Thanks, > - Krishna > > > On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < > krishna...@flipkart.com> wrote: > >> Hi Willy, and community developers, >> >> I am not sure if I am doing something wrong, but wanted to report >> some issues that I am seeing. Please let me know if this is a problem. >> >> 1. HAProxy system: >> Kernel: 4.17.13, >> CPU: 48 core E5-2670 v3 >> Memory: 128GB memory >> NIC: Mellanox 40g with IRQ pinning >> >> 2. Client, 48 core similar to server. Test command line: >> wrk -c 4800 -t 48 -d 30s http:///128 >> >> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git >> checkout as of >> Oct 2nd). >> # haproxy-git -vv >> HA-Proxy version 1.9-dev3 2018/09/29 >> Copyright 2000-2018 Willy Tarreau >> >> Build options : >> TARGET = linux2628 >> CPU = generic >> CC = gcc >> CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement >> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare >> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers >> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits >> OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 >> >> Default settings : >> maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = >> 200 >> >> Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >> Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >> OpenSSL library supports TLS extensions : yes >> OpenSSL library supports SNI : yes >> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 >> Built with transparent proxy support using: IP_TRANSPARENT >> IPV6_TRANSPARENT IP_FREEBIND >> Encrypted password support via crypt(3): yes >> Built with multi-threading support. >> Built with PCRE version : 8.38 2015-11-23 >> Running on PCRE version : 8.38
Re: Few problems seen in haproxy? (threads, connections).
1. haproxy config: Same as given above (both processes and threads were given in the mail) 2. nginx: default, no changes. 3. sysctl's: nothing set. All changes as described earlier (e.g. irqbalance, irq pinning, etc). 4. nf_conntrack: disabled 5. dmesg: no messages. With the same system and settings, threads gives 18x lesser RPS than processes, along with the other 2 issues given in my mail today. On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин wrote: > haproxy config, nginx config > non default sysctl (if any) > > as a side note, can you have a look at "dmesg" output ? do you have nf > conntrack enabled ? what are its limits ? > > чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) < > krishna...@flipkart.com>: > >> Sure. >> >> 1. Client: Use one of the following two setup's: >> - a single baremetal (48 core, 40g) system >> Run: "wrk -c 4800 -t 48 -d 30s http://:80/128", or, >> - 100 2 core vm's. >> Run "wrk -c 16 -t 2 -d 30s http://:80/128" from >> each VM and summarize the results using some >> parallel-ssh setup. >> >> 2. HAProxy running on a single baremetal (same system config >> as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different >> cores of the same NUMA node for each irq, kill irqbalance, with >> haproxy configuration file as given in my first mail. Around 60 >> backend servers are configured in haproxy. >> >> 3. Backend servers are 2 core VM's running nginx and serving >> a file called "/128", which is 128 bytes in size. >> >> Let me know if you need more information. >> >> Thanks, >> - Krishna >> >> >> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин >> wrote: >> >>> load testing is somewhat good. >>> can you describe an overall setup ? (I want to reproduce and play with >>> it) >>> >>> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) < >>> krishna...@flipkart.com>: >>> Re-sending in case this mail was missed. To summarise the 3 issues seen: 1. Performance drops 18x with higher number of nbthreads as compared to nbprocs. 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for 1.9-dev3 for nbprocs and nbthreads). 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in either CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers), till the server/client timeout expires. The tests for threads and processes were done on the same systems, so there is no difference in system parameters. Thanks, - Krishna On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < krishna...@flipkart.com> wrote: > Hi Willy, and community developers, > > I am not sure if I am doing something wrong, but wanted to report > some issues that I am seeing. Please let me know if this is a problem. > > 1. HAProxy system: > Kernel: 4.17.13, > CPU: 48 core E5-2670 v3 > Memory: 128GB memory > NIC: Mellanox 40g with IRQ pinning > > 2. Client, 48 core similar to server. Test command line: > wrk -c 4800 -t 48 -d 30s http:///128 > > 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git > checkout as of > Oct 2nd). > # haproxy-git -vv > HA-Proxy version 1.9-dev3 2018/09/29 > Copyright 2000-2018 Willy Tarreau > > Build options : > TARGET = linux2628 > CPU = generic > CC = gcc > CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement > -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare > -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers > -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits > OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 > > Default settings : > maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = > 200 > > Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 > Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 > OpenSSL library supports TLS extensions : yes > OpenSSL library supports SNI : yes > OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 > Built with transparent proxy support using: IP_TRANSPARENT > IPV6_TRANSPARENT IP_FREEBIND > Encrypted password support via crypt(3): yes > Built with multi-threading support. > Built with PCRE version : 8.38 2015-11-23 > Running on PCRE version : 8.38 2015-11-23 > PCRE library supports JIT : no (USE_PCRE_JIT not set) > Built with zlib version : 1.2.8 > Running on zlib version : 1.2.8 > Compression algorithms supported : identity("identity"), > deflate("deflate"), raw-deflate("deflate"), gzip("gzip") > Built with network namespace support. > > Available polling systems : > epoll : pref=300, test result OK >poll : pref=200, test result OK >
Re: Few problems seen in haproxy? (threads, connections).
haproxy config, nginx config non default sysctl (if any) as a side note, can you have a look at "dmesg" output ? do you have nf conntrack enabled ? what are its limits ? чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) < krishna...@flipkart.com>: > Sure. > > 1. Client: Use one of the following two setup's: > - a single baremetal (48 core, 40g) system > Run: "wrk -c 4800 -t 48 -d 30s http://:80/128", or, > - 100 2 core vm's. > Run "wrk -c 16 -t 2 -d 30s http://:80/128" from > each VM and summarize the results using some > parallel-ssh setup. > > 2. HAProxy running on a single baremetal (same system config > as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different > cores of the same NUMA node for each irq, kill irqbalance, with > haproxy configuration file as given in my first mail. Around 60 > backend servers are configured in haproxy. > > 3. Backend servers are 2 core VM's running nginx and serving > a file called "/128", which is 128 bytes in size. > > Let me know if you need more information. > > Thanks, > - Krishna > > > On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин wrote: > >> load testing is somewhat good. >> can you describe an overall setup ? (I want to reproduce and play with it) >> >> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) < >> krishna...@flipkart.com>: >> >>> Re-sending in case this mail was missed. To summarise the 3 issues seen: >>> >>> 1. Performance drops 18x with higher number of nbthreads as compared to >>> nbprocs. >>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds >>> (for 1.9-dev3 >>> for nbprocs and nbthreads). >>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains >>> in either >>> CLOSE-WAIT (towards clients) and ESTAB (towards the backend >>> servers), till >>> the server/client timeout expires. >>> >>> The tests for threads and processes were done on the same systems, so >>> there is >>> no difference in system parameters. >>> >>> Thanks, >>> - Krishna >>> >>> >>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < >>> krishna...@flipkart.com> wrote: >>> Hi Willy, and community developers, I am not sure if I am doing something wrong, but wanted to report some issues that I am seeing. Please let me know if this is a problem. 1. HAProxy system: Kernel: 4.17.13, CPU: 48 core E5-2670 v3 Memory: 128GB memory NIC: Mellanox 40g with IRQ pinning 2. Client, 48 core similar to server. Test command line: wrk -c 4800 -t 48 -d 30s http:///128 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as of Oct 2nd). # haproxy-git -vv HA-Proxy version 1.9-dev3 2018/09/29 Copyright 2000-2018 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Encrypted password support via crypt(3): yes Built with multi-threading support. Built with PCRE version : 8.38 2015-11-23 Running on PCRE version : 8.38 2015-11-23 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built with zlib version : 1.2.8 Running on zlib version : 1.2.8 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with network namespace support. Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. Available multiplexer protocols : (protocols markes as cannot be specified using 'proto' keyword) h2 : mode=HTTP side=FE : mode=TCP|HTTP side=FE|BE Available filters : [SPOE] spoe [COMP] compression [TRACE] trace 4. HAProxy results for #processes and #threads #Threads-RPS Procs-RPS 1 20903 19280 2 46400 51045 4 96587 142801 8 172224 254720 12 210451 437488 16 173034 437375
Re: Few problems seen in haproxy? (threads, connections).
Sure. 1. Client: Use one of the following two setup's: - a single baremetal (48 core, 40g) system Run: "wrk -c 4800 -t 48 -d 30s http://:80/128", or, - 100 2 core vm's. Run "wrk -c 16 -t 2 -d 30s http://:80/128" from each VM and summarize the results using some parallel-ssh setup. 2. HAProxy running on a single baremetal (same system config as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different cores of the same NUMA node for each irq, kill irqbalance, with haproxy configuration file as given in my first mail. Around 60 backend servers are configured in haproxy. 3. Backend servers are 2 core VM's running nginx and serving a file called "/128", which is 128 bytes in size. Let me know if you need more information. Thanks, - Krishna On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин wrote: > load testing is somewhat good. > can you describe an overall setup ? (I want to reproduce and play with it) > > чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) < > krishna...@flipkart.com>: > >> Re-sending in case this mail was missed. To summarise the 3 issues seen: >> >> 1. Performance drops 18x with higher number of nbthreads as compared to >> nbprocs. >> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for >> 1.9-dev3 >> for nbprocs and nbthreads). >> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in >> either >> CLOSE-WAIT (towards clients) and ESTAB (towards the backend >> servers), till >> the server/client timeout expires. >> >> The tests for threads and processes were done on the same systems, so >> there is >> no difference in system parameters. >> >> Thanks, >> - Krishna >> >> >> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < >> krishna...@flipkart.com> wrote: >> >>> Hi Willy, and community developers, >>> >>> I am not sure if I am doing something wrong, but wanted to report >>> some issues that I am seeing. Please let me know if this is a problem. >>> >>> 1. HAProxy system: >>> Kernel: 4.17.13, >>> CPU: 48 core E5-2670 v3 >>> Memory: 128GB memory >>> NIC: Mellanox 40g with IRQ pinning >>> >>> 2. Client, 48 core similar to server. Test command line: >>> wrk -c 4800 -t 48 -d 30s http:///128 >>> >>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout >>> as of >>> Oct 2nd). >>> # haproxy-git -vv >>> HA-Proxy version 1.9-dev3 2018/09/29 >>> Copyright 2000-2018 Willy Tarreau >>> >>> Build options : >>> TARGET = linux2628 >>> CPU = generic >>> CC = gcc >>> CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement >>> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare >>> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers >>> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits >>> OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 >>> >>> Default settings : >>> maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 >>> >>> Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >>> Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >>> OpenSSL library supports TLS extensions : yes >>> OpenSSL library supports SNI : yes >>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 >>> Built with transparent proxy support using: IP_TRANSPARENT >>> IPV6_TRANSPARENT IP_FREEBIND >>> Encrypted password support via crypt(3): yes >>> Built with multi-threading support. >>> Built with PCRE version : 8.38 2015-11-23 >>> Running on PCRE version : 8.38 2015-11-23 >>> PCRE library supports JIT : no (USE_PCRE_JIT not set) >>> Built with zlib version : 1.2.8 >>> Running on zlib version : 1.2.8 >>> Compression algorithms supported : identity("identity"), >>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip") >>> Built with network namespace support. >>> >>> Available polling systems : >>> epoll : pref=300, test result OK >>>poll : pref=200, test result OK >>> select : pref=150, test result OK >>> Total: 3 (3 usable), will use epoll. >>> >>> Available multiplexer protocols : >>> (protocols markes as cannot be specified using 'proto' keyword) >>> h2 : mode=HTTP side=FE >>> : mode=TCP|HTTP side=FE|BE >>> >>> Available filters : >>> [SPOE] spoe >>> [COMP] compression >>> [TRACE] trace >>> >>> 4. HAProxy results for #processes and #threads >>> #Threads-RPS Procs-RPS >>> 1 20903 19280 >>> 2 46400 51045 >>> 4 96587 142801 >>> 8 172224 254720 >>> 12 210451 437488 >>> 16 173034 437375 >>> 24 79069 519367 >>> 32 55607 586367 >>> 48 31739 596148 >>> >>> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more >>> time >>>to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: >>> Stats about Lock FD: >>> # write lock : 143933900 >>> # write unlock: 143933895 (-5) >>> # wait time for write : 11370.245 msec >>> # wait time for write/lock: 78.996 nsec >>> # read lock : 0
Re: Few problems seen in haproxy? (threads, connections).
load testing is somewhat good. can you describe an overall setup ? (I want to reproduce and play with it) чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) < krishna...@flipkart.com>: > Re-sending in case this mail was missed. To summarise the 3 issues seen: > > 1. Performance drops 18x with higher number of nbthreads as compared to > nbprocs. > 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for > 1.9-dev3 > for nbprocs and nbthreads). > 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in > either > CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers), > till > the server/client timeout expires. > > The tests for threads and processes were done on the same systems, so > there is > no difference in system parameters. > > Thanks, > - Krishna > > > On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < > krishna...@flipkart.com> wrote: > >> Hi Willy, and community developers, >> >> I am not sure if I am doing something wrong, but wanted to report >> some issues that I am seeing. Please let me know if this is a problem. >> >> 1. HAProxy system: >> Kernel: 4.17.13, >> CPU: 48 core E5-2670 v3 >> Memory: 128GB memory >> NIC: Mellanox 40g with IRQ pinning >> >> 2. Client, 48 core similar to server. Test command line: >> wrk -c 4800 -t 48 -d 30s http:///128 >> >> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout >> as of >> Oct 2nd). >> # haproxy-git -vv >> HA-Proxy version 1.9-dev3 2018/09/29 >> Copyright 2000-2018 Willy Tarreau >> >> Build options : >> TARGET = linux2628 >> CPU = generic >> CC = gcc >> CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement >> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare >> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers >> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits >> OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 >> >> Default settings : >> maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 >> >> Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >> Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 >> OpenSSL library supports TLS extensions : yes >> OpenSSL library supports SNI : yes >> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 >> Built with transparent proxy support using: IP_TRANSPARENT >> IPV6_TRANSPARENT IP_FREEBIND >> Encrypted password support via crypt(3): yes >> Built with multi-threading support. >> Built with PCRE version : 8.38 2015-11-23 >> Running on PCRE version : 8.38 2015-11-23 >> PCRE library supports JIT : no (USE_PCRE_JIT not set) >> Built with zlib version : 1.2.8 >> Running on zlib version : 1.2.8 >> Compression algorithms supported : identity("identity"), >> deflate("deflate"), raw-deflate("deflate"), gzip("gzip") >> Built with network namespace support. >> >> Available polling systems : >> epoll : pref=300, test result OK >>poll : pref=200, test result OK >> select : pref=150, test result OK >> Total: 3 (3 usable), will use epoll. >> >> Available multiplexer protocols : >> (protocols markes as cannot be specified using 'proto' keyword) >> h2 : mode=HTTP side=FE >> : mode=TCP|HTTP side=FE|BE >> >> Available filters : >> [SPOE] spoe >> [COMP] compression >> [TRACE] trace >> >> 4. HAProxy results for #processes and #threads >> #Threads-RPS Procs-RPS >> 1 20903 19280 >> 2 46400 51045 >> 4 96587 142801 >> 8 172224 254720 >> 12 210451 437488 >> 16 173034 437375 >> 24 79069 519367 >> 32 55607 586367 >> 48 31739 596148 >> >> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more >> time >>to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: >> Stats about Lock FD: >> # write lock : 143933900 >> # write unlock: 143933895 (-5) >> # wait time for write : 11370.245 msec >> # wait time for write/lock: 78.996 nsec >> # read lock : 0 >> # read unlock : 0 (0) >> # wait time for read : 0.000 msec >> # wait time for read/lock : 0.000 nsec >> Stats about Lock TASK_RQ: >> # write lock : 2062874 >> # write unlock: 2062875 (1) >> # wait time for write : 7820.234 msec >> # wait time for write/lock: 3790.941 nsec >> # read lock : 0 >> # read unlock : 0 (0) >> # wait time for read : 0.000 msec >> # wait time for read/lock : 0.000 nsec >> Stats about Lock TASK_WQ: >> # write lock : 2601227 >> # write unlock: 2601227 (0) >> # wait time for write : 5019.811 msec >> # wait time for write/lock: 1929.786 nsec >> # read lock : 0 >> # read unlock : 0 (0) >> # wait time for read : 0.000 msec >> # wait time for read/lock : 0.000 nsec >> Stats about Lock POOL: >> # write lock : 2823393 >> # write unlock: 2823393 (0) >> # wait time for write : 11984.706 msec >> # wait time for write/lock: 4244.788 nsec >> # read lock : 0 >> # read unlock : 0 (0) >> # wait time for read : 0.000 msec >> # wait time for
Re: Few problems seen in haproxy? (threads, connections).
Re-sending in case this mail was missed. To summarise the 3 issues seen: 1. Performance drops 18x with higher number of nbthreads as compared to nbprocs. 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for 1.9-dev3 for nbprocs and nbthreads). 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in either CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers), till the server/client timeout expires. The tests for threads and processes were done on the same systems, so there is no difference in system parameters. Thanks, - Krishna On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) < krishna...@flipkart.com> wrote: > Hi Willy, and community developers, > > I am not sure if I am doing something wrong, but wanted to report > some issues that I am seeing. Please let me know if this is a problem. > > 1. HAProxy system: > Kernel: 4.17.13, > CPU: 48 core E5-2670 v3 > Memory: 128GB memory > NIC: Mellanox 40g with IRQ pinning > > 2. Client, 48 core similar to server. Test command line: > wrk -c 4800 -t 48 -d 30s http:///128 > > 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as > of > Oct 2nd). > # haproxy-git -vv > HA-Proxy version 1.9-dev3 2018/09/29 > Copyright 2000-2018 Willy Tarreau > > Build options : > TARGET = linux2628 > CPU = generic > CC = gcc > CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement > -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare > -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers > -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits > OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 > > Default settings : > maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 > > Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 > Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 > OpenSSL library supports TLS extensions : yes > OpenSSL library supports SNI : yes > OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 > Built with transparent proxy support using: IP_TRANSPARENT > IPV6_TRANSPARENT IP_FREEBIND > Encrypted password support via crypt(3): yes > Built with multi-threading support. > Built with PCRE version : 8.38 2015-11-23 > Running on PCRE version : 8.38 2015-11-23 > PCRE library supports JIT : no (USE_PCRE_JIT not set) > Built with zlib version : 1.2.8 > Running on zlib version : 1.2.8 > Compression algorithms supported : identity("identity"), > deflate("deflate"), raw-deflate("deflate"), gzip("gzip") > Built with network namespace support. > > Available polling systems : > epoll : pref=300, test result OK >poll : pref=200, test result OK > select : pref=150, test result OK > Total: 3 (3 usable), will use epoll. > > Available multiplexer protocols : > (protocols markes as cannot be specified using 'proto' keyword) > h2 : mode=HTTP side=FE > : mode=TCP|HTTP side=FE|BE > > Available filters : > [SPOE] spoe > [COMP] compression > [TRACE] trace > > 4. HAProxy results for #processes and #threads > #Threads-RPS Procs-RPS > 1 20903 19280 > 2 46400 51045 > 4 96587 142801 > 8 172224 254720 > 12 210451 437488 > 16 173034 437375 > 24 79069 519367 > 32 55607 586367 > 48 31739 596148 > > 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more > time >to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: > Stats about Lock FD: > # write lock : 143933900 > # write unlock: 143933895 (-5) > # wait time for write : 11370.245 msec > # wait time for write/lock: 78.996 nsec > # read lock : 0 > # read unlock : 0 (0) > # wait time for read : 0.000 msec > # wait time for read/lock : 0.000 nsec > Stats about Lock TASK_RQ: > # write lock : 2062874 > # write unlock: 2062875 (1) > # wait time for write : 7820.234 msec > # wait time for write/lock: 3790.941 nsec > # read lock : 0 > # read unlock : 0 (0) > # wait time for read : 0.000 msec > # wait time for read/lock : 0.000 nsec > Stats about Lock TASK_WQ: > # write lock : 2601227 > # write unlock: 2601227 (0) > # wait time for write : 5019.811 msec > # wait time for write/lock: 1929.786 nsec > # read lock : 0 > # read unlock : 0 (0) > # wait time for read : 0.000 msec > # wait time for read/lock : 0.000 nsec > Stats about Lock POOL: > # write lock : 2823393 > # write unlock: 2823393 (0) > # wait time for write : 11984.706 msec > # wait time for write/lock: 4244.788 nsec > # read lock : 0 > # read unlock : 0 (0) > # wait time for read : 0.000 msec > # wait time for read/lock : 0.000 nsec > Stats about Lock LISTENER: > # write lock : 184 > # write unlock: 184 (0) > # wait time for write : 0.011 msec > # wait time for write/lock: 60.554 nsec > # read lock : 0 > # read unlock : 0 (0) > # wait time for read : 0.000 msec > # wait time for read/lock : 0.000 nsec > Stats about Lock PROXY: > # write lock : 291557
Few problems seen in haproxy? (threads, connections).
Hi Willy, and community developers, I am not sure if I am doing something wrong, but wanted to report some issues that I am seeing. Please let me know if this is a problem. 1. HAProxy system: Kernel: 4.17.13, CPU: 48 core E5-2670 v3 Memory: 128GB memory NIC: Mellanox 40g with IRQ pinning 2. Client, 48 core similar to server. Test command line: wrk -c 4800 -t 48 -d 30s http:///128 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as of Oct 2nd). # haproxy-git -vv HA-Proxy version 1.9-dev3 2018/09/29 Copyright 2000-2018 Willy Tarreau Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Built with OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 Running on OpenSSL version : OpenSSL 1.0.2g 1 Mar 2016 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Encrypted password support via crypt(3): yes Built with multi-threading support. Built with PCRE version : 8.38 2015-11-23 Running on PCRE version : 8.38 2015-11-23 PCRE library supports JIT : no (USE_PCRE_JIT not set) Built with zlib version : 1.2.8 Running on zlib version : 1.2.8 Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip") Built with network namespace support. Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. Available multiplexer protocols : (protocols markes as cannot be specified using 'proto' keyword) h2 : mode=HTTP side=FE : mode=TCP|HTTP side=FE|BE Available filters : [SPOE] spoe [COMP] compression [TRACE] trace 4. HAProxy results for #processes and #threads #Threads-RPS Procs-RPS 1 20903 19280 2 46400 51045 4 96587 142801 8 172224 254720 12 210451 437488 16 173034 437375 24 79069 519367 32 55607 586367 48 31739 596148 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get: Stats about Lock FD: # write lock : 143933900 # write unlock: 143933895 (-5) # wait time for write : 11370.245 msec # wait time for write/lock: 78.996 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock TASK_RQ: # write lock : 2062874 # write unlock: 2062875 (1) # wait time for write : 7820.234 msec # wait time for write/lock: 3790.941 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock TASK_WQ: # write lock : 2601227 # write unlock: 2601227 (0) # wait time for write : 5019.811 msec # wait time for write/lock: 1929.786 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock POOL: # write lock : 2823393 # write unlock: 2823393 (0) # wait time for write : 11984.706 msec # wait time for write/lock: 4244.788 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock LISTENER: # write lock : 184 # write unlock: 184 (0) # wait time for write : 0.011 msec # wait time for write/lock: 60.554 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock PROXY: # write lock : 291557 # write unlock: 291557 (0) # wait time for write : 109.694 msec # wait time for write/lock: 376.235 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock SERVER: # write lock : 1188511 # write unlock: 1188511 (0) # wait time for write : 854.171 msec # wait time for write/lock: 718.690 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock LBPRM: # write lock : 1184709 # write unlock: 1184709 (0) # wait time for write : 778.947 msec # wait time for write/lock: 657.501 nsec # read lock : 0 # read unlock : 0 (0) # wait time for read : 0.000 msec # wait time for read/lock : 0.000 nsec Stats about Lock BUF_WQ: # write lock : 669247 # write unlock: 669247 (0) # wait time for write : 252.265 msec # wait time for write/lock: 376.939 nsec # read lock