I'll let Amin follow up, but from what I understand, the way he's doing
batching doesn't introduce any additional delay. Rather, if he can
write to the socket, he writes. However, if the socket is blocked for
whatever reason (e.g. waiting for an ACK or send buffer is full) he
buffers all of the waiting packets and then sends them in aggregate.
Oh.. another point, if you are batching the frames, then what about
delay? There seems to be a trade-off between delay and throughput,
and we have went for the former by disabling Nagle's algorithm.
Regards
KK
On 15 December 2010 12:46, kk yap<[email protected]> wrote:
Hi Amin,
Just to clarify, does your jumbo frames refer to the OpenFlow messages
or the frames in the datapath? By OpenFlow messages, I am assuming
you use a TCP connection between NOX and the switches, and you are
batching the messages into jumbo frames of 9000 bytes before sending
them out. By frames in the datapath, I mean jumbo Ethernet frames are
being sent in the datapath. The latter does not make any sense to me,
because OpenFlow should send 128 bytes to the controller by default.
Thanks.
Regards
KK
On 15 December 2010 12:36, Amin Tootoonchian<[email protected]> wrote:
I double checked. It does slightly improve the performance (in the
order of a few thousand replies/sec). Larger MTUs decrease the CPU
workload (by decreasing the number of transfers across the bus) and
this means that more CPU cycles are available to the controller to
process requests. However, I am not suggesting that people should use
jumbo frames. Apparently running with more user-space threads does the
trick here. Anyway, I should trust a profiler rather than guessing, so
I will get back with a definite answer once I have done a more
thorough evaluation.
Cheers,
Amin
On Wed, Dec 15, 2010 at 2:51 PM, kk yap<[email protected]> wrote:
Random curiosity: Why would jumbo frames increases replies per sec?
Regards
KK
On 15 December 2010 11:45, Amin Tootoonchian<[email protected]> wrote:
I missed that. The single core throughput is ~250k replies/sec, two
cores ~450k replies/sec, three cores ~650k replies/sec, four cores
~800 replies/sec. These numbers are higher than what I reported in my
previous post. That is most probably because, right now, I am testing
with MTU 9000 (jumbo frames) and with more user-space threads.
Cheers,
Amin
On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado<[email protected]> wrote:
Also, do you mind posting the single core throughput?
[cross-posting to nox-dev, openflow-discuss, ovs-discuss]
I have prepared a patch based on NOX Zaku that improves its
performance by a factor of>10. This implies that a single controller
instance can run a large network with near a million flow initiations
per second. I am writing to open up a discussion and get feedback from
the community.
Here are some preliminary results:
- Benchmark configuration:
* Benchmark: Throughput test of cbench (controller benchmarker) with
64 switches. Cbench is a part of the OFlops package
(http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
mode, cbench sends a batch of ofp_packet_in messages to the controller
and counts the number of replies it gets back.
* Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
quad-core Intel Xeon processor (X3210), and 4GB RAM
* Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
quad-core Intel Xeon processor (E5405), and 4GB RAM
* Connectivity: 1Gbps
- Benchmark results:
* NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
* Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
available cores). The sustained controller->benchmarker throughput is
~400Mbps.
The patch updates the asynchronous harness of NOX to a standard
library (boost asynchronous I/O library) which simplifies the code
base. It fixes the code in several areas, including but not limited
to:
- Multi-threading: The patch enables having any number of worker
threads running on multiple cores.
- Batching: Serving requests individually and sending replies one by
one is quite inefficient. The patch tries to batch requests together
were possible, as well replies (which reduces the number of system
calls significantly).
- Memory allocation: The standard C++ memory allocator is not robust
in multi-threaded environments. Google's Thread-Caching Malloc
(TCMalloc) or Hoard memory allocator perform much better for NOX.
- Fully asynchronous operation: The patched version avoids wasting CPU
cycles polling sockets, or event/timer dispatchers when not necessary.
I would like to add that the patched version should perform much
better than what I reported above (the number reported is with a run
on 4 CPU cores). I guess a single NOX instance running on a machine
with 8 CPU cores should handle well above 1 million flow initiation
requests per second. Also having a more capable machine should help to
serve more requests! The code will be made available soon and I will
post updates as well.
Cheers,
Amin
_______________________________________________
openflow-discuss mailing list
[email protected]
https://mailman.stanford.edu/mailman/listinfo/openflow-discuss
_______________________________________________
nox-dev mailing list
[email protected]
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
_______________________________________________
nox-dev mailing list
[email protected]
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org