Re: [nox-dev] dev/destiny-fast doesn't respond
Looks like you are already seeing ~3M packet-ins per sec (3k per msec = 3M per sec). Amin On Tue, Dec 27, 2011 at 2:05 PM, Volkan YAZICI volkan.yaz...@gmail.com wrote: Thanks David! You are right, removing tcmalloc for nox_core solved the problem. --8---cut here---start-8--- $ dpkg -l | grep tcmalloc ii libtcmalloc-mi 1.5-1 an efficient thread-caching malloc $ nox_core -i ptcp:6633 switch -l ~/usr/bin/nox -t 7 $ cbench -c localhost -p 6633 -m 1 -l 10 -s 32 -M 100 -t cbench: controller benchmarking tool running in mode 'throughput' connecting to controller at localhost:6633 faking 32 switches :: 10 tests each; 1 ms per test with 100 unique source MACs per switch starting test with 0 ms delay after features_reply ignoring first 1 warmup and last 0 cooldown loops debugging info is off 32 switches: fmods/sec: 630907 ... total = 1977.609403 per ms 32 switches: fmods/sec: 799125 ... total = 2558.905526 per ms 32 switches: fmods/sec: 903720 ... total = 2901.221645 per ms 32 switches: fmods/sec: 900237 ... total = 2868.801376 per ms 32 switches: fmods/sec: 875842 ... total = 2825.217623 per ms ... --8---cut here---end---8--- This is a reasonably powerful machine, that is, --8---cut here---start-8--- $ cat /etc/debian_version 6.0.3 $ uname -a Linux odun 2.6.32-5-amd64 #1 SMP Thu Nov 3 03:41:26 UTC 2011 x86_64 GNU/Linux $ grep ^processor /proc/cpuinfo | wc -l 8 $ grep ^model name /proc/cpuinfo | head -n 1 model name : Intel(R) Xeon(R) CPU E5606 @ 2.13GHz --8---cut here---end---8--- I still couldn't understand how do you get the results at million level in your comparisons. Am I missing something? What should I suspect? Can tcmalloc cause such a 1000x performance impact? Best. On Tue, 27 Dec 2011 10:44:39 -0800, David Erickson writes: What tcmalloc version do you have, and what OS? Try launching without tcmalloc, on some combinations NOX would just hang when a switch connects and you are using tcmalloc. ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev
Re: [nox-dev] Error building dev/destiny-fast branch
On Wed, Oct 26, 2011 at 7:32 PM, Andreas Voellmy andreas.voel...@gmail.com wrote: On Wed, Oct 26, 2011 at 8:42 PM, Amin Tootoonchian a...@cs.toronto.edu wrote: I only updated the 'switch' app in that code base, and I never looked at 'hub'. My guess is that the hub app is doing so little that locking within boost::asio scheduler outweights the actual work done by the hub app. We need to make sure that the amount of work done by each thread upon its invocation is significantly more than the locking overhead in boost::asio's internal job queue. I'm unclear about how components in the destiny branch work. Do the handlers run concurrently by default, or is there something extra that one has to write to get them to execute concurrently? If something extra is needed, what is it in switch.cc that makes it execute concurrently? Or are you saying that the event handlers in 'hub' are indeed running concurrently, but they aren't doing enough work to get much performance gain? (By the way, I was looking at /src/nox/coreapps/switch/switch.cc and /src/nox/coreapps/hub/hub.cc) Thanks, Andreas They run concurrently by default. They should be indeed running concurrently, but I am guessing locking overhead within boost::asio significantly outweights the actual work done by each thread. It shouldn't be hard to fix, but not worth it since we consider that code base to be just a proof of concept. Thanks, Amin Cheers, Amin P.S.: Btw, passing '--enable-ndebug' to configure should boost the performace. On Wed, Oct 26, 2011 at 2:08 PM, Andreas Voellmy andreas.voel...@gmail.com wrote: Thanks. The code compiled after configuring without python. I was able to get roughly the same kind of performance out of the 'switch' application that is mentioned on the performance page (http://www.openflow.org/wk/index.php/Controller_Performance_Comparisons). However, the 'hub' controller doesn't have much speedup when running with more threads. For example, when running with one thread I get a throughput of 213868.81 and when I run it with 8 threads I get a throughput of 264017.35. (To run with 8 threads, I am starting the controller like this: ./nox_core -i ptcp: hub -t 8; I am testing with cbench in throughput mode cbench -p -t) Is this - that 'hub' gets not much speedup while 'switch' gets lots of speedup - expected with this branch of NOX? Is there something that needs to be done to hub in order to enable the framework to run it concurrently? Regards, Andreas On Wed, Oct 26, 2011 at 5:53 AM, Murphy McCauley jam...@nau.edu wrote: This branch is quite a bit behind the actual development. We're preparing to release the updated codebase in the near future. But for one thing, Python doesn't work in it. So you probably need to do --with-python=no when you run configure. Hope that helps. -- Murphy On Oct 25, 2011, at 8:49 PM, Andreas Voellmy wrote: Thanks. I tried editing the conflict marker out in a couple ways that seemed reasonable to me, but I got other compile errors. Does anyone know if there is a known working version of this branch in the repository, and how I can get back to it? Thanks, Andreas 2011/10/25 Zoltán Lajos Kis zoltan.lajos@ericsson.com Seems like someone checked in a conflict marker to that file: http://noxrepo.org/cgi-bin/gitweb.cgi?p=nox;a=blob;f=src/nox/coreapps/pyrt/context.i;h=cb8641d72feb3a1f0543e97830a2addd55d502b9;hb=dev/destiny-fast#l83 Z. From: nox-dev-boun...@noxrepo.org [nox-dev-boun...@noxrepo.org] On Behalf Of Andreas Voellmy [andreas.voel...@gmail.com] Sent: Wednesday, October 26, 2011 4:40 AM To: nox-dev@noxrepo.org Subject: [nox-dev] Error building dev/destiny-fast branch Hi, I'd like to try the destiny-fast branch (I saw it mentioned here: http://www.openflow.org/wk/index.php/Controller_Performance_Comparisons), so I did the following git clone git://noxrepo.org/noxhttp://noxrepo.org/nox cd nox git checkout dev/destiny-fast Is that the right way to get this branch? After that I ran ./boot.sh mkdir build cd build ../configure make and got the following error: Making all in pyrt make[8]: Entering directory `/home/av/Download/nox-destiny/nox/build/src/nox/coreapps/pyrt' /usr/bin/swig -c++ -python -DSWIGWORDSIZE64 -I../../../../../src/include/openflow -I../../../../../src/nox/lib/ -outdir ./. -o oxidereactor_wrap.cc -module oxidereactor ../../../../../src/nox/coreapps/pyrt/oxidereactor.i /usr/bin/swig -c++ -python -DSWIGWORDSIZE64 -outdir ./. -o deferredcallback_wrap.cc -module deferredcallback ../../../../../src/nox/coreapps/pyrt/deferredcallback.i /usr/bin/swig -c++ -python -DSWIGWORDSIZE64 -I../../../../../src/include/openflow -I../../../../../src/nox/lib/ -outdir ./. -o pycomponent_wrap.cc -module pycomponent ../../../../../src
Re: [nox-dev] Error building dev/destiny-fast branch
I only updated the 'switch' app in that code base, and I never looked at 'hub'. My guess is that the hub app is doing so little that locking within boost::asio scheduler outweights the actual work done by the hub app. We need to make sure that the amount of work done by each thread upon its invocation is significantly more than the locking overhead in boost::asio's internal job queue. If that is the case, since we are working on a new release, it doesn't make much sense to fix it in that code base. Could you wait for that? Cheers, Amin P.S.: Btw, passing '--enable-ndebug' to configure should boost the performace. On Wed, Oct 26, 2011 at 2:08 PM, Andreas Voellmy andreas.voel...@gmail.com wrote: Thanks. The code compiled after configuring without python. I was able to get roughly the same kind of performance out of the 'switch' application that is mentioned on the performance page (http://www.openflow.org/wk/index.php/Controller_Performance_Comparisons). However, the 'hub' controller doesn't have much speedup when running with more threads. For example, when running with one thread I get a throughput of 213868.81 and when I run it with 8 threads I get a throughput of 264017.35. (To run with 8 threads, I am starting the controller like this: ./nox_core -i ptcp: hub -t 8; I am testing with cbench in throughput mode cbench -p -t) Is this - that 'hub' gets not much speedup while 'switch' gets lots of speedup - expected with this branch of NOX? Is there something that needs to be done to hub in order to enable the framework to run it concurrently? Regards, Andreas On Wed, Oct 26, 2011 at 5:53 AM, Murphy McCauley jam...@nau.edu wrote: This branch is quite a bit behind the actual development. We're preparing to release the updated codebase in the near future. But for one thing, Python doesn't work in it. So you probably need to do --with-python=no when you run configure. Hope that helps. -- Murphy On Oct 25, 2011, at 8:49 PM, Andreas Voellmy wrote: Thanks. I tried editing the conflict marker out in a couple ways that seemed reasonable to me, but I got other compile errors. Does anyone know if there is a known working version of this branch in the repository, and how I can get back to it? Thanks, Andreas 2011/10/25 Zoltán Lajos Kis zoltan.lajos@ericsson.com Seems like someone checked in a conflict marker to that file: http://noxrepo.org/cgi-bin/gitweb.cgi?p=nox;a=blob;f=src/nox/coreapps/pyrt/context.i;h=cb8641d72feb3a1f0543e97830a2addd55d502b9;hb=dev/destiny-fast#l83 Z. From: nox-dev-boun...@noxrepo.org [nox-dev-boun...@noxrepo.org] On Behalf Of Andreas Voellmy [andreas.voel...@gmail.com] Sent: Wednesday, October 26, 2011 4:40 AM To: nox-dev@noxrepo.org Subject: [nox-dev] Error building dev/destiny-fast branch Hi, I'd like to try the destiny-fast branch (I saw it mentioned here: http://www.openflow.org/wk/index.php/Controller_Performance_Comparisons), so I did the following git clone git://noxrepo.org/noxhttp://noxrepo.org/nox cd nox git checkout dev/destiny-fast Is that the right way to get this branch? After that I ran ./boot.sh mkdir build cd build ../configure make and got the following error: Making all in pyrt make[8]: Entering directory `/home/av/Download/nox-destiny/nox/build/src/nox/coreapps/pyrt' /usr/bin/swig -c++ -python -DSWIGWORDSIZE64 -I../../../../../src/include/openflow -I../../../../../src/nox/lib/ -outdir ./. -o oxidereactor_wrap.cc -module oxidereactor ../../../../../src/nox/coreapps/pyrt/oxidereactor.i /usr/bin/swig -c++ -python -DSWIGWORDSIZE64 -outdir ./. -o deferredcallback_wrap.cc -module deferredcallback ../../../../../src/nox/coreapps/pyrt/deferredcallback.i /usr/bin/swig -c++ -python -DSWIGWORDSIZE64 -I../../../../../src/include/openflow -I../../../../../src/nox/lib/ -outdir ./. -o pycomponent_wrap.cc -module pycomponent ../../../../../src/nox/coreapps/pyrt/component.i ../../../../../src/nox/coreapps/pyrt/context.i:79: Error: Syntax error in input(3). make[8]: *** [pycomponent.py] Error 1 Does anyone know what went wrong and how to fix this? Thanks, Andreas ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev
Re: [nox-dev] NOX Zaku with OF 1.1 support (C++ only)
That would be great! I will be able to work on it again in two weeks I guess. Just a couple of quick notes: * So far I have only ported the switch app. Porting is super easy for most apps: just need to add a boost mutex to protect the data structure. * I think some apps should be rewritten with performance in mind. * The most important missing application is discovery which you have already ported to C++. * dev/destiny-fast branch needs testing. It has been mostly used in different benchmarks and there are parts of the system which I never tested after rewriting (e.g., SSL)! Amin Having a C++-only fork of Nox is long overdue. There are many Nox developers who have expressed interest in this. Off hand, I would suggest that we pull in Amin's changes (which makes Nox blazingly fast) and remove most spurious apps. ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev
Re: [nox-dev] Maestro: a new scalable OpenFlow controller
Hi Zheng, Sorry for the delay, and thanks a lot for all your comments. First I want to clarify that the performance number we measured in the tech report is for the routing application. Because the routing application needs to generate multiple flow configuration messages for all switches along the path one packet is going to take, its performance is considered to be worse than that of the switch application. We also use our simulator (which supports LLDP packets exchanging between neighbor switches so that the routing application could work) to evaluate the switch application of NOX, and get a throughput of 90K rps on a 3.1Ghz machine. But for the routing application, even after I turn on the ndebug option (by running ./configure --enable-ndebug and make, and I hope this is the right way to do it), the throughput is still around 20K rps. All the numbers in the tech report are for the routing application. I hope this makes sense to you. Thanks for your reply. Using the routing application explains some of your observation. However, I think benchmarks should be based on the switch application or a no-op application to illustrate the overhead of the controller itself (i.e., the underlying framework) as the baseline. I think the reason for routing application performing badly here is due to binding lookups (this is just a guess) and for sure it could be tuned to perform significantly better than 20Krps. About the latency, we measure it in an end-to-end style. That is, we measure the start time-stamp for a request before we call the socket.send function, and measure the end time-stamp after we receive the flow_mod/packet_out for that particular request. I agree that NOX will throttle the connection and the latency of a request within NOX is very small. It's just the latency we measure is the end-to-end delay, which includes the queuing delay in both the sending buffer and receiving buffer. Regarding latency measurements, it is tricky to measure stuff precisely at a sub-millisecond resolution. One should be very careful with the resolution of timers and avoid being affected by the operating system scheduler. If you are seeing latencies close to the scheduler quantum size (under Linux typically 10ms) your measurement might be affected by the scheduler. There is no real workaround for this under a non-realtime kernel; however, running the controller, benchmarker, and the latency measurement tool with CPU affinity set (non-overlapping CPU sets -- using taskset -c X) and with a high-priority FIFO scheduler (chrt -f X under Linux) relieves many of these side effects. Btw, I was talking about the end-to-end latency previously. It is quite possible to reduce that Furthermore, I would like to know that is the code of HyperFlow, or the NOX multi-threading patch you mentioned, going to be available? Because I really want to study the difference of all the existing systems, and hopefully my effort could contribute to this community :) I think we could make the NOX multi-threading patch publicly available soon as a branch on noxrepo. The code needs to be reviewed and tested by the NOX team and the community before making its way to the mainstream NOX. I, for sure, appreciate your efforts very much. The only thing is that if we had a framework to compare controller performance we could better understand the trade-offs in controller design. Rob's oflops/cbench package is a great start, and we could all contribute to it. Cheers, Amin ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
As Martin said, in some cases cbench may significantly over-report numbers in throughput mode (of course it depends on the controller implementation, so not all the controllers might be affected). The cbench code sleeps for 100ms to clear out buffers after reading the switch counters (fakeswitch_get_count in fakeswitch.c). There are two problems here: * Switch input and output buffers are not cleared under throughput mode. * Having X switches means that the code sleeps for 100X ms instead of a single 100ms for all emulated switches. These would result in a significant over-estimation of controller performance under throughput mode if one is using more than a few emulated switches. For instance, with 128 switches, cbench would sleep for almost 13 seconds before printing out stats of each round, meanwhile the controller fills the input buffer of all the emulated switches. Since the input buffer is not cleared, the stats of the next round would contain the replies received for requests in previous rounds (which is a potentially large number). Rob, I will post a patch soon. Meanwhile, a quick fix is to move the sleep to an appropriate place in run_test (cbench.c) and clear the buffers under throughput mode as well in fakeswitch_get_count (fakeswitch.c). Amin A problem with cbench might even be of interest to those who wrote it :-) If I could bother you to just send me a diff of what you've changed, it would be much appreciated. I can push it back into the main branch. Fwiw, cbench is something I wrote very quickly while jetlagged, so it's not surprising that there are bugs in it. I didn't realize that people were actually using it, or I would try to snag some time to make it less crappy :-) Thanks for the feedback, - Rob ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
I double checked. It does slightly improve the performance (in the order of a few thousand replies/sec). Larger MTUs decrease the CPU workload (by decreasing the number of transfers across the bus) and this means that more CPU cycles are available to the controller to process requests. However, I am not suggesting that people should use jumbo frames. Apparently running with more user-space threads does the trick here. Anyway, I should trust a profiler rather than guessing, so I will get back with a definite answer once I have done a more thorough evaluation. Cheers, Amin On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote: Random curiosity: Why would jumbo frames increases replies per sec? Regards KK On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote: I missed that. The single core throughput is ~250k replies/sec, two cores ~450k replies/sec, three cores ~650k replies/sec, four cores ~800 replies/sec. These numbers are higher than what I reported in my previous post. That is most probably because, right now, I am testing with MTU 9000 (jumbo frames) and with more user-space threads. Cheers, Amin On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote: Also, do you mind posting the single core throughput? [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory allocator perform much better for NOX. - Fully asynchronous operation: The patched version avoids wasting CPU cycles polling sockets, or event/timer dispatchers when not necessary. I would like to add that the patched version should perform much better than what I reported above (the number reported is with a run on 4 CPU cores). I guess a single NOX instance running on a machine with 8 CPU cores should handle well above 1 million flow initiation requests per second. Also having a more capable machine should help to serve more requests! The code will be made available soon and I will post updates as well. Cheers, Amin ___ openflow-discuss mailing list openflow-disc...@lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/openflow-discuss ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
I am talking about jumbo Ethernet frames here. By batching, I mean batching outgoing messages together and writing to the underlying layer which would be the TCP write buffer. The TCP buffer is not limited to MTU or anything like that, so in most cases my code flushes more than 64KB to the TCP write buffer. The gain is due to issuing a single system call with a larger buffer rather than many system calls with tiny buffers (e.g., 128 bytes you mentioned). I do not sacrifice delay for throughput here. I keep a write buffer and keep appending to it until the underlying socket is ready for writes. Once it is ready for a write operation, buffered replies are flush to the underlying layer immediately. This is quite different than Nagle's algorithm and will not add any delays. Amin On Wed, Dec 15, 2010 at 3:47 PM, kk yap yap...@stanford.edu wrote: Oh.. another point, if you are batching the frames, then what about delay? There seems to be a trade-off between delay and throughput, and we have went for the former by disabling Nagle's algorithm. Regards KK On 15 December 2010 12:46, kk yap yap...@stanford.edu wrote: Hi Amin, Just to clarify, does your jumbo frames refer to the OpenFlow messages or the frames in the datapath? By OpenFlow messages, I am assuming you use a TCP connection between NOX and the switches, and you are batching the messages into jumbo frames of 9000 bytes before sending them out. By frames in the datapath, I mean jumbo Ethernet frames are being sent in the datapath. The latter does not make any sense to me, because OpenFlow should send 128 bytes to the controller by default. Thanks. Regards KK On 15 December 2010 12:36, Amin Tootoonchian a...@cs.toronto.edu wrote: I double checked. It does slightly improve the performance (in the order of a few thousand replies/sec). Larger MTUs decrease the CPU workload (by decreasing the number of transfers across the bus) and this means that more CPU cycles are available to the controller to process requests. However, I am not suggesting that people should use jumbo frames. Apparently running with more user-space threads does the trick here. Anyway, I should trust a profiler rather than guessing, so I will get back with a definite answer once I have done a more thorough evaluation. Cheers, Amin On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote: Random curiosity: Why would jumbo frames increases replies per sec? Regards KK On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote: I missed that. The single core throughput is ~250k replies/sec, two cores ~450k replies/sec, three cores ~650k replies/sec, four cores ~800 replies/sec. These numbers are higher than what I reported in my previous post. That is most probably because, right now, I am testing with MTU 9000 (jumbo frames) and with more user-space threads. Cheers, Amin On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote: Also, do you mind posting the single core throughput? [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory
[nox-dev] NOX performance improvement by a factor 10
[cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of 10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory allocator perform much better for NOX. - Fully asynchronous operation: The patched version avoids wasting CPU cycles polling sockets, or event/timer dispatchers when not necessary. I would like to add that the patched version should perform much better than what I reported above (the number reported is with a run on 4 CPU cores). I guess a single NOX instance running on a machine with 8 CPU cores should handle well above 1 million flow initiation requests per second. Also having a more capable machine should help to serve more requests! The code will be made available soon and I will post updates as well. Cheers, Amin ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
[nox-dev] Typical NOX Memory Usage
Hi all, For a research project, I need to know what is the typical memory usage of a NOX controller in existing deployments. I am particularly interested in average and maximum memory usage. Btw, do you have any numbers about the number of times NOX crashed in existing deployments because of memory leaks? Thanks, Amin ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
[nox-dev] Order of event handlers
Hi all, Is there a way to specify a handler to be the last one receiving an event (without enumerating all components in nox.xml)? Thanks, Amin ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
Re: [nox-dev] Preparing for Nox 0.6.0
Hi Martin, I meant requiring the originators to add themselves (I can think of workarounds to have it work implicitly, but they are compiler/architecture specific). This feature is not only useful for debugging, it can also be used to deploy multiple NOX controllers to control a single network: On each controller *capture a set of ofp_msg_events* (e.g., only a small portion of packet_in events change the controller state) and *replay/dispatch* them on the others. We need to discard any outgoing ofp packets caused by the replayed events and for that we need to keep track of what events triggered other events/messages. Also, to find out which events are *important* (i.e., alter the controller state), the controller and the running applications need to mark events explicitly. In other words, it is the controller/application developer's job to specify which events should be propagated to other controllers. This part also requires the feature mentioned above. That is because if a non-ofp_msg_event is marked we should be able to trace back to the original ofp_msg_event and mark it. Am I right about the ofp_msg_events being the driving force of NOX operation? And my last question: Is there currently anyway for two NOX instances to synchronize their states for fail over? If not, are there any plans to provide such a feature? Is there any way for a controller/application to store its transient state on disk? What happens in a production network with hundreds of switches when the controller crashes and comes back up in a few seconds? Should it rediscover the topology, host-ip-mac bindings, etc. from scratch? Cheers, Amin Regarding tracing the event call stack. This would certainly be a useful debugging tool. However, the nature of events is that the infrastructure is decoupled from the senders and receivers so it isn't clear to me how we'd mark the originator an a general way without requiring the originators to add themselves. I'm certainly open to ideas ... Thanks, Amin ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org