Hi Yann, Yes, without SO_REUSEPORT, child only accepts connections from a single listening socket only. In order to address the situation of in-balanced traffic among different sockets/listen statements, the patch makes each bucket does its own idler server maintenance. For example, if we have two listen statements defined, one is very busy and the other is almost idle. The patch creates two buckets, each listens to 1 IP:port. The busy bucket would end up with lots of children and idle bucket would only maintain minimum number of children which is equal to 1/2 of the min idle servers defined in the httpd.conf file.
Thanks, Yingqi From: Yann Ylavic [mailto:[email protected]] Sent: Thursday, March 06, 2014 5:49 AM To: httpd Subject: Re: [PATCH ASF bugzilla# 55897]prefork_mpm patch with SO_REUSEPORT support On Wed, Mar 5, 2014 at 6:38 PM, Lu, Yingqi <[email protected]<mailto:[email protected]>> wrote: 1. If I understand correctly (please correct me if not), do you suggest duplicating the listen socks inside the child process with SO_REUSEPROT enabled? Yes, I agree this would be a cleaner implementation and I actually tried that before. However, I encountered the "connection reset" error since the number of the child process is changing. I googled online and found it actually being discussed here at http://lwn.net/Articles/542629/. Actually I found that article, but expected the "defect" was solved since then... This looks like a thorn in the side of MPMs in general, but couldn't find any pointer to a fix, do you know if there is some progress on this in the latest linux kernel? For testing purpose (until then?), you could also configure MPM prefork to not create/terminate children processes once started (using the same value for StartServers and ServerLimit, still MaxRequetsPerChild 0). It could be interesting to see how SO_REUSEPORT scales in these optimal conditions (no lock, full OS round-robin on all listeners). For this you would have to use your former patch (duplicating listeners in each child process), and do nothing in SAFE_ACCEPT when HAVE_SO_REUSEPORT. Also, SO_REUSEPORT exists on (and even comes from) FreeBSD if I am not mistaken, but it seems that there is no round-robin garantee for it, is there? Could this patch also take advantage of BSD's SO_REUSEPORT implementation? 2. Then, I decided to do the socket duplication in the parent process. The goal of this change is to extend the CPU thread scalability with the big thread count system. Therefore, I just very simply defined number_of_listen_buckets=total_number_active_thread/8, and each listen bucket has a dedicated listener. I do not want to over duplicate the socket; otherwise, it would create too many child processes at the beginning. One listen bucket should have at least one child process to start with. However, this is only my understanding and it may not be correct and complete. If you have other ideas, please share with us. Feedbacks and comments are very welcome here :) The listeners buckets make sense with SO_REUSEPORT given the defect, I hope this is temporary. 3. I am struggling with myself as well on if we should put with and without SO_REUSEPORT into two different patches. The only reason I put them together is because they both use the concept of listen buckets. If you think it would make more sense to separate them into two patches, I can certainly do that. Also, I am a little bit confused about your comments "On the other hand, each child is dedicated, won't one have to multiply the configured ServerLimit by the number of Listen to achieve the same (maximum theoretical) scalability with regard to all the listeners?". Can you please explain a little bit more on this? Really appreciate. Sorry to have not been clear enough (nay at all). I'm referring to the following code. In prefork.c::make_child(), each child is assigned a listener like this (before fork()ing) : child_listen = mpm_listen[bucket[slot]]; and then each child will use child_listen as listeners list. The duplicated listeners array (mpm_listen) is built by the following (new) function : /* This function is added for the patch. This function duplicates * open_listeners, alloc_listener() and re-call make_sock() for the * duplicated listeners. In this function, the newly created sockets * will bind and listen*/ AP_DECLARE(apr_status_t) ap_post_config_listeners(server_rec *s, apr_pool_t *p, int num_buckets) { mpm_listen = apr_palloc(p, sizeof(ap_listen_rec*) * num_buckets); int i; ap_listen_rec *lr; /* duplicate from alloc_listener() for the additional listen record*/ lr = ap_listeners; for (i = 0; i < num_buckets; i++) { #ifdef HAVE_SO_REUSEPORT ap_listen_rec *templr; ap_listen_rec *last = NULL; while (lr) { templr = ap_duplicate_listener(p, lr); ap_apply_accept_filter(p, templr, s); if (last == NULL) { mpm_listen[i] = last = templr; } else { last->next = templr; last = templr; } lr = lr->next; } lr = ap_listeners; #else mpm_listen[i] = ap_duplicate_listener(p, lr); lr = (lr->next) ? lr->next : ap_listeners; #endif } return APR_SUCCESS; } Since ap_duplicate_listener() will duplicate a single (unlinked) listener, my understanding is that : - with SO_REUSEPORT: each child will use all the listeners (the whole list is duplicated per bucket), - without SO_REUSEPORT: each child will use a single listener (one per bucket, although muliple children will use the same listener at the same time should num_buckets > num_listeners). That's what I mean by "each child is dedicated" (without SO_REUSEPORT), each will accept connections from a single listening socket only. Is that correct? If so, this is a change with regard to the current prefork sizing habits. Currently with prefork (but this is also true for other threaded MPMs modulo ThreadsPerChild), when the admin sizes httpd.conf (max children/clients/... according to the hardware capabilites, applications needs...), (s)he expects each child process to handle all the incoming connections (on any listening socket). Should one VirtualHost (on one listening socket) handle more traffic than the others, its load is distributed on all the children. The admin does not have to worry about how much process es for this listening socket or that other. With this patch though, it no longer holds, there can be idle processes (no activity on their listener) while others are busy (and even full). It's worth it if the load is closed for all the listeners, but that won't fit all the existing configurations... Hence I think we need a way to configure this. Regards, Yann. This is our first patch to the open source and Apache community. We are still on the learning curve about a lot of things. Your feedback and comments really help us! Please let me know if you have any further questions. Thanks, Yingqi From: Yann Ylavic [mailto:[email protected]<mailto:[email protected]>] Sent: Wednesday, March 05, 2014 5:04 AM To: httpd Subject: Re: [PATCH ASF bugzilla# 55897]prefork_mpm patch with SO_REUSEPORT support Hi Yingqi, I'm a bit confused about the patch, mainly because it seems to handle the same way both with and without SO_REUSEPORT available, while SO_REUSEPORT could (IMHO) be handled in children only (a less intrusive way). With SO_REUSEPORT, I would have expected the accept mutex to be useless since, if I understand correcly the option, multiple processes/threads can accept() simultaneously provided they use their own socket (each one bound/listening on the same addr:port). Couldn't then each child duplicate the listeners (ie. new socket+bind(SO_REUSEPORT)+listen), before switching UIDs, and then poll() all of them without synchronisation (accept() is probably not an option for timeout reasons), and then get fair scheduling from the OS (for all the listeners)? Is the lock still needed because the duplicated listeners are inherited from the parent process? Without SO_REUSEPORT, if I understand correctly still, each child will poll() a single listener to avoid the serialized accept. On the other hand, each child is dedicated, won't one have to multiply the configured ServerLimit by the number of Listen to achieve the same (maximum theoretical) scalability with regard to all the listeners? I don't pretend it is a good or bad thing, just figuring out what could then be a "rule" to size the configuration (eg. MaxClients/ServerLimit/#cores/#Listen). It seems to me that the patches with and without SO_REUSEPORT should be separate ones, but I may be missing something. Also, but this is not related to this patch particularly (addressed to who knows), it's unclear to me why an accept mutex is needed at all. Multiple processes poll()ing the same inherited socket is safe but not multiple ones? Is that an OS issue? Process wide only? Still (in)valid in latest OSes? Thanks for the patch anyway, it looks promising. Regards, Yann. On Sat, Jan 25, 2014 at 12:25 AM, Lu, Yingqi <[email protected]<mailto:[email protected]>> wrote: Dear All, Our analysis of Apache httpd 2.4.7 prefork mpm, on 32 and 64 thread Intel Xeon 2600 series systems, using an open source three tier social networking web server workload, revealed performance scaling issues. In current software single listen statement (listen 80) provides better scalability due to un-serialized accept. However, when system is under very high load, this can lead to big number of child processes stuck in D state. On the other hand, the serialized accept approach cannot scale with the high load either. In our analysis, a 32-thread system, with 2 listen statements specified, could scale to just 70% utilization, and a 64-thread system, with signal listen statement specified (listen 80, 4 network interfaces), could scale to only 60% utilization. Based on those findings, we created a prototype patch for prefork mpm which extends performance and thread utilization. In Linux kernel newer than 3.9, SO_REUSEPORT is enabled. This feature allows multiple sockets listen to the same IP:port and automatically round robins connections. We use this feature to create multiple duplicated listener records of the original one and partition the child processes into buckets. Each bucket listens to 1 IP:port. In case of old kernel which does not have the SO_REUSEPORT enabled, we modified the "multiple listen statement case" by creating 1 listen record for each listen statement and partitioning the child processes into different buckets. Each bucket listens to 1 IP:port. Quick tests of the patch, running the same workload, demonstrated a 22% throughput increase with 32-threads system and 2 listen statements (Linux kernel 3.10.4). With the older kernel (Linux Kernel 3.8.8, without SO_REUSEPORT), 10% performance gain was measured. With single listen statement (listen 80) configuration, we observed over 2X performance improvements on modern dual socket Intel platforms (Linux Kernel 3.10.4). We also observed big reduction in response time, in addition to the throughput improvement gained in our tests 1. Following the feedback from the bugzilla website where we originally submitted the patch, we removed the dependency of APR change to simplify the patch testing process. Thanks Jeff Trawick for his good suggestion! We are also actively working on extending the patch to worker and event MPMs, as a next step. Meanwhile, we would like to gather comments from all of you on the current prefork patch. Please take some time test it and let us know how it works in your environment. This is our first patch to the Apache community. Please help us review it and let us know if there is anything we might revise to improve it. Your feedback is very much appreciated. Configuration: <IfModule prefork.c> ListenBacklog 105384 ServerLimit 105000 MaxClients 1024 MaxRequestsPerChild 0 StartServers 64 MinSpareServers 8 MaxSpareServers 16 </IfModule> 1. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Thanks, Yingqi
