Niklas Quarfot Nielsen created MESOS-1706: ---------------------------------------------
Summary: Introduce socket / connection pooling to libprocess Key: MESOS-1706 URL: https://issues.apache.org/jira/browse/MESOS-1706 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Niklas Quarfot Nielsen Just wrote a libprocess connection throughput stress test (basically two libprocess programs sending messsages back and forth). One end is multihomed so we can scale up the number of clients. The throughput with a single client (10 "concurrent" connections or rather, send up to 10 message before awaiting responses) is roughly 8000 - 9000 requests per second. I think I (accidentially) produced more load (around 30.000 requests per second) - but I am running into one particular error in both cases: `Failed to send, connect: Cannot assign requested address`. According to http://khanna111.com/articles/TCPAAIU.html - it seems the only way around it is the some kind of connection pooling (we already use SO_REUSEADDR). It happens during connect() and hints that the machine is running out of available ports on the sender end (when getting randomly assigned ports). {code} I0815 07:03:49.348409 30317 main.cpp:109] 8984.79 requests / second (delta: 1.000356864secs) I0815 07:03:50.348898 30320 main.cpp:109] 8715.88 requests / second (delta: 1.000473088secs) I0815 07:03:51.349040 30317 main.cpp:109] 8622.64 requests / second (delta: 1.000157184secs) I0815 07:03:52.349184 30320 main.cpp:109] 9039.69 requests / second (delta: 1.000144896secs) I0815 07:03:53.349478 30319 main.cpp:109] 8768.42 requests / second (delta: 1.000293888secs) I0815 07:03:54.349954 30322 main.cpp:109] 8728.9 requests / second (delta: 1.000470016secs) I0815 07:03:55.350334 30316 main.cpp:109] 8628.79 requests / second (delta: 1.000371968secs) I0815 07:03:56.350957 30320 main.cpp:109] 8726.57 requests / second (delta: 1.000621824secs) I0815 07:03:57.351474 30318 main.cpp:109] 8587.46 requests / second (delta: 1.000529152secs) I0815 07:03:58.351805 30314 main.cpp:109] 8475.16 requests / second (delta: 1.000335104secs) F0815 07:03:59.092653 30323 process.cpp:2197] Failed to send, connect: Cannot assign requested address [99] *** Check failure stack trace: *** Aborted {code} One way to deal with it couple be to introduce the notion of connection pooling. Any thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)