Hi Yann,

Thanks very much for your email.

1. If I understand correctly (please correct me if not), do you suggest 
duplicating the listen socks inside the child process with SO_REUSEPROT 
enabled? Yes, I agree this would be a cleaner implementation and I actually 
tried that before. However, I encountered the "connection reset" error since 
the number of the child process is changing. I googled online and found it 
actually being discussed here at http://lwn.net/Articles/542629/.

2. Then, I decided to do the socket duplication in the parent process. The goal 
of this change is to extend the CPU thread scalability with the big thread 
count system. Therefore, I just very simply defined 
number_of_listen_buckets=total_number_active_thread/8, and each listen bucket 
has a dedicated listener. I do not want to over duplicate the socket; 
otherwise, it would create too many child processes at the beginning. One 
listen bucket should have at least one child process to start with. However, 
this is only my understanding and it may not be correct and complete. If you 
have other ideas, please share with us. Feedbacks and comments are very welcome 
here :)

3. I am struggling with myself as well on if we should put with and without 
SO_REUSEPORT into two different patches. The only reason I put them together is 
because they both use the concept of listen buckets. If you think it would make 
more sense to separate them into two patches, I can certainly do that. Also, I 
am a little bit confused about your comments "On the other hand, each child is 
dedicated, won't one have to multiply the configured ServerLimit by the number 
of Listen to achieve the same (maximum theoretical) scalability with regard to 
all the listeners?". Can you please explain a little bit more on this? Really 
appreciate.

This is our first patch to the open source and Apache community. We are still 
on the learning curve about a lot of things. Your feedback and comments really 
help us!

Please let me know if you have any further questions.

Thanks,
Yingqi


From: Yann Ylavic [mailto:ylavic....@gmail.com]
Sent: Wednesday, March 05, 2014 5:04 AM
To: httpd
Subject: Re: [PATCH ASF bugzilla# 55897]prefork_mpm patch with SO_REUSEPORT 
support

Hi Yingqi,

I'm a bit confused about the patch, mainly because it seems to handle the same 
way both with and without SO_REUSEPORT available, while SO_REUSEPORT could 
(IMHO) be handled in children only (a less intrusive way).
With SO_REUSEPORT, I would have expected the accept mutex to be useless since, 
if I understand correcly the option, multiple processes/threads can accept() 
simultaneously provided they use their own socket (each one bound/listening on 
the same addr:port).
Couldn't then each child duplicate the listeners (ie. new 
socket+bind(SO_REUSEPORT)+listen), before switching UIDs, and then poll() all 
of them without synchronisation (accept() is probably not an option for timeout 
reasons), and then get fair scheduling from the OS (for all the listeners)?
Is the lock still needed because the duplicated listeners are inherited from 
the parent process?

Without SO_REUSEPORT, if I understand correctly still, each child will poll() a 
single listener to avoid the serialized accept.
On the other hand, each child is dedicated, won't one have to multiply the 
configured ServerLimit by the number of Listen to achieve the same (maximum 
theoretical) scalability with regard to all the listeners?
I don't pretend it is a good or bad thing, just figuring out what could then be 
a "rule" to size the configuration (eg. MaxClients/ServerLimit/#cores/#Listen).
It seems to me that the patches with and without SO_REUSEPORT should be 
separate ones, but I may be missing something.
Also, but this is not related to this patch particularly (addressed to who 
knows), it's unclear to me why an accept mutex is needed at all.
Multiple processes poll()ing the same inherited socket is safe but not multiple 
ones? Is that an OS issue? Process wide only? Still (in)valid in latest OSes?

Thanks for the patch anyway, it looks promising.
Regards,
Yann.

On Sat, Jan 25, 2014 at 12:25 AM, Lu, Yingqi 
<yingqi...@intel.com<mailto:yingqi...@intel.com>> wrote:
Dear All,

Our analysis of Apache httpd 2.4.7 prefork mpm, on 32 and 64 thread Intel Xeon 
2600 series systems, using an open source three tier social networking web 
server workload, revealed performance scaling issues.  In current software 
single listen statement (listen 80) provides better scalability due to 
un-serialized accept. However, when system is under very high load, this can 
lead to big number of child processes stuck in D state.


On the other hand, the serialized accept approach cannot scale with the high 
load either.  In our analysis, a 32-thread system, with 2 listen statements 
specified, could scale to just 70% utilization, and a 64-thread system, with 
signal listen statement specified (listen 80, 4 network interfaces), could 
scale to only 60% utilization.

Based on those findings, we created a prototype patch for prefork mpm which 
extends performance and thread utilization. In Linux kernel newer than 3.9, 
SO_REUSEPORT is enabled. This feature allows multiple sockets listen to the 
same IP:port and automatically round robins connections. We use this feature to 
create multiple duplicated listener records of the original one and partition 
the child processes into buckets. Each bucket listens to 1 IP:port. In case of 
old kernel which does not have the SO_REUSEPORT enabled, we modified the 
"multiple listen statement case" by creating 1 listen record for each listen 
statement and partitioning the child processes into different buckets. Each 
bucket listens to 1 IP:port.

Quick tests of the patch, running the same workload, demonstrated a 22% 
throughput increase with 32-threads system and 2 listen statements (Linux 
kernel 3.10.4). With the older kernel (Linux Kernel 3.8.8, without 
SO_REUSEPORT), 10% performance gain was measured. With single listen statement 
(listen 80) configuration, we observed over 2X performance improvements on 
modern dual socket Intel platforms (Linux Kernel 3.10.4). We also observed big 
reduction in response time, in addition to the throughput improvement gained in 
our tests 1.

Following the feedback from the bugzilla website where we originally submitted 
the patch, we removed the dependency of APR change to simplify the patch 
testing process. Thanks Jeff Trawick for his good suggestion! We are also 
actively working on extending the patch to worker and event MPMs, as a next 
step. Meanwhile, we would like to gather comments from all of you on the 
current prefork patch. Please take some time test it and let us know how it 
works in your environment.

This is our first patch to the Apache community. Please help us review it and 
let us know if there is anything we might revise to improve it. Your feedback 
is very much appreciated.

Configuration:
<IfModule prefork.c>
    ListenBacklog 105384
    ServerLimit 105000
    MaxClients 1024
    MaxRequestsPerChild 0
    StartServers 64
    MinSpareServers 8
    MaxSpareServers 16
</IfModule>

1. Software and workloads used in performance tests may have been optimized for 
performance only on Intel microprocessors. Performance tests, such as SYSmark 
and MobileMark, are measured using specific computer systems, components, 
software, operations and functions. Any change to any of those factors may 
cause the results to vary. You should consult other information and performance 
tests to assist you in fully evaluating your contemplated purchases, including 
the performance of that product when combined with other products.

Thanks,
Yingqi

Reply via email to