Oh.. another point, if you are batching the frames, then what about
delay?  There seems to be a trade-off between delay and throughput,
and we have went for the former by disabling Nagle's algorithm.

Regards
KK

On 15 December 2010 12:46, kk yap <yap...@stanford.edu> wrote:
> Hi Amin,
>
> Just to clarify, does your jumbo frames refer to the OpenFlow messages
> or the frames in the datapath?   By OpenFlow messages, I am assuming
> you use a TCP connection between NOX and the switches, and you are
> batching the messages into jumbo frames of 9000 bytes before sending
> them out.  By frames in the datapath, I mean jumbo Ethernet frames are
> being sent in the datapath.  The latter does not make any sense to me,
> because OpenFlow should send 128 bytes to the controller by default.
>
> Thanks.
>
> Regards
> KK
>
> On 15 December 2010 12:36, Amin Tootoonchian <a...@cs.toronto.edu> wrote:
>> I double checked. It does slightly improve the performance (in the
>> order of a few thousand replies/sec). Larger MTUs decrease the CPU
>> workload (by decreasing the number of transfers across the bus) and
>> this means that more CPU cycles are available to the controller to
>> process requests. However, I am not suggesting that people should use
>> jumbo frames. Apparently running with more user-space threads does the
>> trick here. Anyway, I should trust a profiler rather than guessing, so
>> I will get back with a definite answer once I have done a more
>> thorough evaluation.
>>
>> Cheers,
>> Amin
>>
>> On Wed, Dec 15, 2010 at 2:51 PM, kk yap <yap...@stanford.edu> wrote:
>>> Random curiosity: Why would jumbo frames increases replies per sec?
>>>
>>> Regards
>>> KK
>>>
>>> On 15 December 2010 11:45, Amin Tootoonchian <a...@cs.toronto.edu> wrote:
>>>> I missed that. The single core throughput is ~250k replies/sec, two
>>>> cores ~450k replies/sec, three cores ~650k replies/sec, four cores
>>>> ~800 replies/sec. These numbers are higher than what I reported in my
>>>> previous post. That is most probably because, right now, I am testing
>>>> with MTU 9000 (jumbo frames) and with more user-space threads.
>>>>
>>>> Cheers,
>>>> Amin
>>>>
>>>> On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado <cas...@nicira.com> wrote:
>>>>> Also, do you mind posting the single core throughput?
>>>>>
>>>>>> [cross-posting to nox-dev, openflow-discuss, ovs-discuss]
>>>>>>
>>>>>> I have prepared a patch based on NOX Zaku that improves its
>>>>>> performance by a factor of>10. This implies that a single controller
>>>>>> instance can run a large network with near a million flow initiations
>>>>>> per second. I am writing to open up a discussion and get feedback from
>>>>>> the community.
>>>>>>
>>>>>> Here are some preliminary results:
>>>>>>
>>>>>> - Benchmark configuration:
>>>>>>   * Benchmark: Throughput test of cbench (controller benchmarker) with
>>>>>> 64 switches. Cbench is a part of the OFlops package
>>>>>> (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
>>>>>> mode, cbench sends a batch of ofp_packet_in messages to the controller
>>>>>> and counts the number of replies it gets back.
>>>>>>   * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
>>>>>> quad-core Intel Xeon processor (X3210), and 4GB RAM
>>>>>>   * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
>>>>>> quad-core Intel Xeon processor (E5405), and 4GB RAM
>>>>>>   * Connectivity: 1Gbps
>>>>>>
>>>>>> - Benchmark results:
>>>>>>   * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
>>>>>>   * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
>>>>>> available cores). The sustained controller->benchmarker throughput is
>>>>>> ~400Mbps.
>>>>>>
>>>>>> The patch updates the asynchronous harness of NOX to a standard
>>>>>> library (boost asynchronous I/O library) which simplifies the code
>>>>>> base. It fixes the code in several areas, including but not limited
>>>>>> to:
>>>>>>
>>>>>> - Multi-threading: The patch enables having any number of worker
>>>>>> threads running on multiple cores.
>>>>>>
>>>>>> - Batching: Serving requests individually and sending replies one by
>>>>>> one is quite inefficient. The patch tries to batch requests together
>>>>>> were possible, as well replies (which reduces the number of system
>>>>>> calls significantly).
>>>>>>
>>>>>> - Memory allocation: The standard C++ memory allocator is not robust
>>>>>> in multi-threaded environments. Google's Thread-Caching Malloc
>>>>>> (TCMalloc) or Hoard memory allocator perform much better for NOX.
>>>>>>
>>>>>> - Fully asynchronous operation: The patched version avoids wasting CPU
>>>>>> cycles polling sockets, or event/timer dispatchers when not necessary.
>>>>>>
>>>>>> I would like to add that the patched version should perform much
>>>>>> better than what I reported above (the number reported is with a run
>>>>>> on 4 CPU cores). I guess a single NOX instance running on a machine
>>>>>> with 8 CPU cores should handle well above 1 million flow initiation
>>>>>> requests per second. Also having a more capable machine should help to
>>>>>> serve more requests! The code will be made available soon and I will
>>>>>> post updates as well.
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Amin
>>>>>> _______________________________________________
>>>>>> openflow-discuss mailing list
>>>>>> openflow-disc...@lists.stanford.edu
>>>>>> https://mailman.stanford.edu/mailman/listinfo/openflow-discuss
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> nox-dev mailing list
>>>> nox-dev@noxrepo.org
>>>> http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
>>>>
>>>
>>
>

_______________________________________________
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org

Reply via email to