Oh.. another point, if you are batching the frames, then what about delay? There seems to be a trade-off between delay and throughput, and we have went for the former by disabling Nagle's algorithm.
Regards KK On 15 December 2010 12:46, kk yap <yap...@stanford.edu> wrote: > Hi Amin, > > Just to clarify, does your jumbo frames refer to the OpenFlow messages > or the frames in the datapath? By OpenFlow messages, I am assuming > you use a TCP connection between NOX and the switches, and you are > batching the messages into jumbo frames of 9000 bytes before sending > them out. By frames in the datapath, I mean jumbo Ethernet frames are > being sent in the datapath. The latter does not make any sense to me, > because OpenFlow should send 128 bytes to the controller by default. > > Thanks. > > Regards > KK > > On 15 December 2010 12:36, Amin Tootoonchian <a...@cs.toronto.edu> wrote: >> I double checked. It does slightly improve the performance (in the >> order of a few thousand replies/sec). Larger MTUs decrease the CPU >> workload (by decreasing the number of transfers across the bus) and >> this means that more CPU cycles are available to the controller to >> process requests. However, I am not suggesting that people should use >> jumbo frames. Apparently running with more user-space threads does the >> trick here. Anyway, I should trust a profiler rather than guessing, so >> I will get back with a definite answer once I have done a more >> thorough evaluation. >> >> Cheers, >> Amin >> >> On Wed, Dec 15, 2010 at 2:51 PM, kk yap <yap...@stanford.edu> wrote: >>> Random curiosity: Why would jumbo frames increases replies per sec? >>> >>> Regards >>> KK >>> >>> On 15 December 2010 11:45, Amin Tootoonchian <a...@cs.toronto.edu> wrote: >>>> I missed that. The single core throughput is ~250k replies/sec, two >>>> cores ~450k replies/sec, three cores ~650k replies/sec, four cores >>>> ~800 replies/sec. These numbers are higher than what I reported in my >>>> previous post. That is most probably because, right now, I am testing >>>> with MTU 9000 (jumbo frames) and with more user-space threads. >>>> >>>> Cheers, >>>> Amin >>>> >>>> On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado <cas...@nicira.com> wrote: >>>>> Also, do you mind posting the single core throughput? >>>>> >>>>>> [cross-posting to nox-dev, openflow-discuss, ovs-discuss] >>>>>> >>>>>> I have prepared a patch based on NOX Zaku that improves its >>>>>> performance by a factor of>10. This implies that a single controller >>>>>> instance can run a large network with near a million flow initiations >>>>>> per second. I am writing to open up a discussion and get feedback from >>>>>> the community. >>>>>> >>>>>> Here are some preliminary results: >>>>>> >>>>>> - Benchmark configuration: >>>>>> * Benchmark: Throughput test of cbench (controller benchmarker) with >>>>>> 64 switches. Cbench is a part of the OFlops package >>>>>> (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput >>>>>> mode, cbench sends a batch of ofp_packet_in messages to the controller >>>>>> and counts the number of replies it gets back. >>>>>> * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz >>>>>> quad-core Intel Xeon processor (X3210), and 4GB RAM >>>>>> * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz >>>>>> quad-core Intel Xeon processor (E5405), and 4GB RAM >>>>>> * Connectivity: 1Gbps >>>>>> >>>>>> - Benchmark results: >>>>>> * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). >>>>>> * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 >>>>>> available cores). The sustained controller->benchmarker throughput is >>>>>> ~400Mbps. >>>>>> >>>>>> The patch updates the asynchronous harness of NOX to a standard >>>>>> library (boost asynchronous I/O library) which simplifies the code >>>>>> base. It fixes the code in several areas, including but not limited >>>>>> to: >>>>>> >>>>>> - Multi-threading: The patch enables having any number of worker >>>>>> threads running on multiple cores. >>>>>> >>>>>> - Batching: Serving requests individually and sending replies one by >>>>>> one is quite inefficient. The patch tries to batch requests together >>>>>> were possible, as well replies (which reduces the number of system >>>>>> calls significantly). >>>>>> >>>>>> - Memory allocation: The standard C++ memory allocator is not robust >>>>>> in multi-threaded environments. Google's Thread-Caching Malloc >>>>>> (TCMalloc) or Hoard memory allocator perform much better for NOX. >>>>>> >>>>>> - Fully asynchronous operation: The patched version avoids wasting CPU >>>>>> cycles polling sockets, or event/timer dispatchers when not necessary. >>>>>> >>>>>> I would like to add that the patched version should perform much >>>>>> better than what I reported above (the number reported is with a run >>>>>> on 4 CPU cores). I guess a single NOX instance running on a machine >>>>>> with 8 CPU cores should handle well above 1 million flow initiation >>>>>> requests per second. Also having a more capable machine should help to >>>>>> serve more requests! The code will be made available soon and I will >>>>>> post updates as well. >>>>>> >>>>>> >>>>>> Cheers, >>>>>> Amin >>>>>> _______________________________________________ >>>>>> openflow-discuss mailing list >>>>>> openflow-disc...@lists.stanford.edu >>>>>> https://mailman.stanford.edu/mailman/listinfo/openflow-discuss >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> nox-dev mailing list >>>> nox-dev@noxrepo.org >>>> http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org >>>> >>> >> > _______________________________________________ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org