Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

2010-12-15 Thread Amin Tootoonchian
As Martin said, in some cases cbench may significantly over-report
numbers in throughput mode (of course it depends on the controller
implementation, so not all the controllers might be affected).

The cbench code sleeps for 100ms to clear out buffers after reading
the switch counters (fakeswitch_get_count in fakeswitch.c). There are
two problems here:

* Switch input and output buffers are not cleared under throughput mode.
* Having X switches means that the code sleeps for 100X ms instead of
a single 100ms for all emulated switches.

These would result in a significant over-estimation of controller
performance under throughput mode if one is using more than a few
emulated switches. For instance, with 128 switches, cbench would sleep
for almost 13 seconds before printing out stats of each round,
meanwhile the controller fills the input buffer of all the emulated
switches. Since the input buffer is not cleared, the stats of the next
round would contain the replies received for requests in previous
rounds (which is a potentially large number).

Rob, I will post a patch soon. Meanwhile, a quick fix is to move the
sleep to an appropriate place in run_test (cbench.c) and clear the
buffers under throughput mode as well in fakeswitch_get_count
(fakeswitch.c).

Amin

 A problem with cbench might even be of interest to those who wrote it
 :-)  If I could bother you to just send me a diff of what you've
 changed, it would be much appreciated.  I can push it back into the
 main branch.

 Fwiw, cbench is something I wrote very quickly while jetlagged, so
 it's not surprising that there are bugs in it.  I didn't realize that
 people were actually using it, or I would try to snag some time to
 make it less crappy :-)

 Thanks for the feedback,

 - Rob

___
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

2010-12-15 Thread kk yap
Random curiosity: Why would jumbo frames increases replies per sec?

Regards
KK

On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote:
 I missed that. The single core throughput is ~250k replies/sec, two
 cores ~450k replies/sec, three cores ~650k replies/sec, four cores
 ~800 replies/sec. These numbers are higher than what I reported in my
 previous post. That is most probably because, right now, I am testing
 with MTU 9000 (jumbo frames) and with more user-space threads.

 Cheers,
 Amin

 On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote:
 Also, do you mind posting the single core throughput?

 [cross-posting to nox-dev, openflow-discuss, ovs-discuss]

 I have prepared a patch based on NOX Zaku that improves its
 performance by a factor of10. This implies that a single controller
 instance can run a large network with near a million flow initiations
 per second. I am writing to open up a discussion and get feedback from
 the community.

 Here are some preliminary results:

 - Benchmark configuration:
   * Benchmark: Throughput test of cbench (controller benchmarker) with
 64 switches. Cbench is a part of the OFlops package
 (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
 mode, cbench sends a batch of ofp_packet_in messages to the controller
 and counts the number of replies it gets back.
   * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
 quad-core Intel Xeon processor (X3210), and 4GB RAM
   * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
 quad-core Intel Xeon processor (E5405), and 4GB RAM
   * Connectivity: 1Gbps

 - Benchmark results:
   * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
   * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
 available cores). The sustained controller-benchmarker throughput is
 ~400Mbps.

 The patch updates the asynchronous harness of NOX to a standard
 library (boost asynchronous I/O library) which simplifies the code
 base. It fixes the code in several areas, including but not limited
 to:

 - Multi-threading: The patch enables having any number of worker
 threads running on multiple cores.

 - Batching: Serving requests individually and sending replies one by
 one is quite inefficient. The patch tries to batch requests together
 were possible, as well replies (which reduces the number of system
 calls significantly).

 - Memory allocation: The standard C++ memory allocator is not robust
 in multi-threaded environments. Google's Thread-Caching Malloc
 (TCMalloc) or Hoard memory allocator perform much better for NOX.

 - Fully asynchronous operation: The patched version avoids wasting CPU
 cycles polling sockets, or event/timer dispatchers when not necessary.

 I would like to add that the patched version should perform much
 better than what I reported above (the number reported is with a run
 on 4 CPU cores). I guess a single NOX instance running on a machine
 with 8 CPU cores should handle well above 1 million flow initiation
 requests per second. Also having a more capable machine should help to
 serve more requests! The code will be made available soon and I will
 post updates as well.


 Cheers,
 Amin
 ___
 openflow-discuss mailing list
 openflow-disc...@lists.stanford.edu
 https://mailman.stanford.edu/mailman/listinfo/openflow-discuss



 ___
 nox-dev mailing list
 nox-dev@noxrepo.org
 http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


___
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

2010-12-15 Thread Amin Tootoonchian
I double checked. It does slightly improve the performance (in the
order of a few thousand replies/sec). Larger MTUs decrease the CPU
workload (by decreasing the number of transfers across the bus) and
this means that more CPU cycles are available to the controller to
process requests. However, I am not suggesting that people should use
jumbo frames. Apparently running with more user-space threads does the
trick here. Anyway, I should trust a profiler rather than guessing, so
I will get back with a definite answer once I have done a more
thorough evaluation.

Cheers,
Amin

On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote:
 Random curiosity: Why would jumbo frames increases replies per sec?

 Regards
 KK

 On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote:
 I missed that. The single core throughput is ~250k replies/sec, two
 cores ~450k replies/sec, three cores ~650k replies/sec, four cores
 ~800 replies/sec. These numbers are higher than what I reported in my
 previous post. That is most probably because, right now, I am testing
 with MTU 9000 (jumbo frames) and with more user-space threads.

 Cheers,
 Amin

 On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote:
 Also, do you mind posting the single core throughput?

 [cross-posting to nox-dev, openflow-discuss, ovs-discuss]

 I have prepared a patch based on NOX Zaku that improves its
 performance by a factor of10. This implies that a single controller
 instance can run a large network with near a million flow initiations
 per second. I am writing to open up a discussion and get feedback from
 the community.

 Here are some preliminary results:

 - Benchmark configuration:
   * Benchmark: Throughput test of cbench (controller benchmarker) with
 64 switches. Cbench is a part of the OFlops package
 (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
 mode, cbench sends a batch of ofp_packet_in messages to the controller
 and counts the number of replies it gets back.
   * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
 quad-core Intel Xeon processor (X3210), and 4GB RAM
   * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
 quad-core Intel Xeon processor (E5405), and 4GB RAM
   * Connectivity: 1Gbps

 - Benchmark results:
   * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
   * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
 available cores). The sustained controller-benchmarker throughput is
 ~400Mbps.

 The patch updates the asynchronous harness of NOX to a standard
 library (boost asynchronous I/O library) which simplifies the code
 base. It fixes the code in several areas, including but not limited
 to:

 - Multi-threading: The patch enables having any number of worker
 threads running on multiple cores.

 - Batching: Serving requests individually and sending replies one by
 one is quite inefficient. The patch tries to batch requests together
 were possible, as well replies (which reduces the number of system
 calls significantly).

 - Memory allocation: The standard C++ memory allocator is not robust
 in multi-threaded environments. Google's Thread-Caching Malloc
 (TCMalloc) or Hoard memory allocator perform much better for NOX.

 - Fully asynchronous operation: The patched version avoids wasting CPU
 cycles polling sockets, or event/timer dispatchers when not necessary.

 I would like to add that the patched version should perform much
 better than what I reported above (the number reported is with a run
 on 4 CPU cores). I guess a single NOX instance running on a machine
 with 8 CPU cores should handle well above 1 million flow initiation
 requests per second. Also having a more capable machine should help to
 serve more requests! The code will be made available soon and I will
 post updates as well.


 Cheers,
 Amin
 ___
 openflow-discuss mailing list
 openflow-disc...@lists.stanford.edu
 https://mailman.stanford.edu/mailman/listinfo/openflow-discuss



 ___
 nox-dev mailing list
 nox-dev@noxrepo.org
 http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org



___
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

2010-12-15 Thread kk yap
Hi Amin,

Just to clarify, does your jumbo frames refer to the OpenFlow messages
or the frames in the datapath?   By OpenFlow messages, I am assuming
you use a TCP connection between NOX and the switches, and you are
batching the messages into jumbo frames of 9000 bytes before sending
them out.  By frames in the datapath, I mean jumbo Ethernet frames are
being sent in the datapath.  The latter does not make any sense to me,
because OpenFlow should send 128 bytes to the controller by default.

Thanks.

Regards
KK

On 15 December 2010 12:36, Amin Tootoonchian a...@cs.toronto.edu wrote:
 I double checked. It does slightly improve the performance (in the
 order of a few thousand replies/sec). Larger MTUs decrease the CPU
 workload (by decreasing the number of transfers across the bus) and
 this means that more CPU cycles are available to the controller to
 process requests. However, I am not suggesting that people should use
 jumbo frames. Apparently running with more user-space threads does the
 trick here. Anyway, I should trust a profiler rather than guessing, so
 I will get back with a definite answer once I have done a more
 thorough evaluation.

 Cheers,
 Amin

 On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote:
 Random curiosity: Why would jumbo frames increases replies per sec?

 Regards
 KK

 On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote:
 I missed that. The single core throughput is ~250k replies/sec, two
 cores ~450k replies/sec, three cores ~650k replies/sec, four cores
 ~800 replies/sec. These numbers are higher than what I reported in my
 previous post. That is most probably because, right now, I am testing
 with MTU 9000 (jumbo frames) and with more user-space threads.

 Cheers,
 Amin

 On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote:
 Also, do you mind posting the single core throughput?

 [cross-posting to nox-dev, openflow-discuss, ovs-discuss]

 I have prepared a patch based on NOX Zaku that improves its
 performance by a factor of10. This implies that a single controller
 instance can run a large network with near a million flow initiations
 per second. I am writing to open up a discussion and get feedback from
 the community.

 Here are some preliminary results:

 - Benchmark configuration:
   * Benchmark: Throughput test of cbench (controller benchmarker) with
 64 switches. Cbench is a part of the OFlops package
 (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
 mode, cbench sends a batch of ofp_packet_in messages to the controller
 and counts the number of replies it gets back.
   * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
 quad-core Intel Xeon processor (X3210), and 4GB RAM
   * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
 quad-core Intel Xeon processor (E5405), and 4GB RAM
   * Connectivity: 1Gbps

 - Benchmark results:
   * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
   * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
 available cores). The sustained controller-benchmarker throughput is
 ~400Mbps.

 The patch updates the asynchronous harness of NOX to a standard
 library (boost asynchronous I/O library) which simplifies the code
 base. It fixes the code in several areas, including but not limited
 to:

 - Multi-threading: The patch enables having any number of worker
 threads running on multiple cores.

 - Batching: Serving requests individually and sending replies one by
 one is quite inefficient. The patch tries to batch requests together
 were possible, as well replies (which reduces the number of system
 calls significantly).

 - Memory allocation: The standard C++ memory allocator is not robust
 in multi-threaded environments. Google's Thread-Caching Malloc
 (TCMalloc) or Hoard memory allocator perform much better for NOX.

 - Fully asynchronous operation: The patched version avoids wasting CPU
 cycles polling sockets, or event/timer dispatchers when not necessary.

 I would like to add that the patched version should perform much
 better than what I reported above (the number reported is with a run
 on 4 CPU cores). I guess a single NOX instance running on a machine
 with 8 CPU cores should handle well above 1 million flow initiation
 requests per second. Also having a more capable machine should help to
 serve more requests! The code will be made available soon and I will
 post updates as well.


 Cheers,
 Amin
 ___
 openflow-discuss mailing list
 openflow-disc...@lists.stanford.edu
 https://mailman.stanford.edu/mailman/listinfo/openflow-discuss



 ___
 nox-dev mailing list
 nox-dev@noxrepo.org
 http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org




___
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

2010-12-15 Thread kk yap
Oh.. another point, if you are batching the frames, then what about
delay?  There seems to be a trade-off between delay and throughput,
and we have went for the former by disabling Nagle's algorithm.

Regards
KK

On 15 December 2010 12:46, kk yap yap...@stanford.edu wrote:
 Hi Amin,

 Just to clarify, does your jumbo frames refer to the OpenFlow messages
 or the frames in the datapath?   By OpenFlow messages, I am assuming
 you use a TCP connection between NOX and the switches, and you are
 batching the messages into jumbo frames of 9000 bytes before sending
 them out.  By frames in the datapath, I mean jumbo Ethernet frames are
 being sent in the datapath.  The latter does not make any sense to me,
 because OpenFlow should send 128 bytes to the controller by default.

 Thanks.

 Regards
 KK

 On 15 December 2010 12:36, Amin Tootoonchian a...@cs.toronto.edu wrote:
 I double checked. It does slightly improve the performance (in the
 order of a few thousand replies/sec). Larger MTUs decrease the CPU
 workload (by decreasing the number of transfers across the bus) and
 this means that more CPU cycles are available to the controller to
 process requests. However, I am not suggesting that people should use
 jumbo frames. Apparently running with more user-space threads does the
 trick here. Anyway, I should trust a profiler rather than guessing, so
 I will get back with a definite answer once I have done a more
 thorough evaluation.

 Cheers,
 Amin

 On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote:
 Random curiosity: Why would jumbo frames increases replies per sec?

 Regards
 KK

 On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote:
 I missed that. The single core throughput is ~250k replies/sec, two
 cores ~450k replies/sec, three cores ~650k replies/sec, four cores
 ~800 replies/sec. These numbers are higher than what I reported in my
 previous post. That is most probably because, right now, I am testing
 with MTU 9000 (jumbo frames) and with more user-space threads.

 Cheers,
 Amin

 On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote:
 Also, do you mind posting the single core throughput?

 [cross-posting to nox-dev, openflow-discuss, ovs-discuss]

 I have prepared a patch based on NOX Zaku that improves its
 performance by a factor of10. This implies that a single controller
 instance can run a large network with near a million flow initiations
 per second. I am writing to open up a discussion and get feedback from
 the community.

 Here are some preliminary results:

 - Benchmark configuration:
   * Benchmark: Throughput test of cbench (controller benchmarker) with
 64 switches. Cbench is a part of the OFlops package
 (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
 mode, cbench sends a batch of ofp_packet_in messages to the controller
 and counts the number of replies it gets back.
   * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
 quad-core Intel Xeon processor (X3210), and 4GB RAM
   * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
 quad-core Intel Xeon processor (E5405), and 4GB RAM
   * Connectivity: 1Gbps

 - Benchmark results:
   * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
   * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
 available cores). The sustained controller-benchmarker throughput is
 ~400Mbps.

 The patch updates the asynchronous harness of NOX to a standard
 library (boost asynchronous I/O library) which simplifies the code
 base. It fixes the code in several areas, including but not limited
 to:

 - Multi-threading: The patch enables having any number of worker
 threads running on multiple cores.

 - Batching: Serving requests individually and sending replies one by
 one is quite inefficient. The patch tries to batch requests together
 were possible, as well replies (which reduces the number of system
 calls significantly).

 - Memory allocation: The standard C++ memory allocator is not robust
 in multi-threaded environments. Google's Thread-Caching Malloc
 (TCMalloc) or Hoard memory allocator perform much better for NOX.

 - Fully asynchronous operation: The patched version avoids wasting CPU
 cycles polling sockets, or event/timer dispatchers when not necessary.

 I would like to add that the patched version should perform much
 better than what I reported above (the number reported is with a run
 on 4 CPU cores). I guess a single NOX instance running on a machine
 with 8 CPU cores should handle well above 1 million flow initiation
 requests per second. Also having a more capable machine should help to
 serve more requests! The code will be made available soon and I will
 post updates as well.


 Cheers,
 Amin
 ___
 openflow-discuss mailing list
 openflow-disc...@lists.stanford.edu
 https://mailman.stanford.edu/mailman/listinfo/openflow-discuss



 

Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

2010-12-15 Thread Martin Casado
I'll let Amin follow up, but from what I understand, the way he's doing 
batching doesn't introduce any additional delay.  Rather, if he can 
write to the socket, he writes.  However, if the socket is blocked for 
whatever reason (e.g. waiting for an ACK or send buffer is full) he 
buffers all of the waiting packets and then sends them in aggregate.



Oh.. another point, if you are batching the frames, then what about
delay?  There seems to be a trade-off between delay and throughput,
and we have went for the former by disabling Nagle's algorithm.

Regards
KK

On 15 December 2010 12:46, kk yapyap...@stanford.edu  wrote:

Hi Amin,

Just to clarify, does your jumbo frames refer to the OpenFlow messages
or the frames in the datapath?   By OpenFlow messages, I am assuming
you use a TCP connection between NOX and the switches, and you are
batching the messages into jumbo frames of 9000 bytes before sending
them out.  By frames in the datapath, I mean jumbo Ethernet frames are
being sent in the datapath.  The latter does not make any sense to me,
because OpenFlow should send 128 bytes to the controller by default.

Thanks.

Regards
KK

On 15 December 2010 12:36, Amin Tootoonchiana...@cs.toronto.edu  wrote:

I double checked. It does slightly improve the performance (in the
order of a few thousand replies/sec). Larger MTUs decrease the CPU
workload (by decreasing the number of transfers across the bus) and
this means that more CPU cycles are available to the controller to
process requests. However, I am not suggesting that people should use
jumbo frames. Apparently running with more user-space threads does the
trick here. Anyway, I should trust a profiler rather than guessing, so
I will get back with a definite answer once I have done a more
thorough evaluation.

Cheers,
Amin

On Wed, Dec 15, 2010 at 2:51 PM, kk yapyap...@stanford.edu  wrote:

Random curiosity: Why would jumbo frames increases replies per sec?

Regards
KK

On 15 December 2010 11:45, Amin Tootoonchiana...@cs.toronto.edu  wrote:

I missed that. The single core throughput is ~250k replies/sec, two
cores ~450k replies/sec, three cores ~650k replies/sec, four cores
~800 replies/sec. These numbers are higher than what I reported in my
previous post. That is most probably because, right now, I am testing
with MTU 9000 (jumbo frames) and with more user-space threads.

Cheers,
Amin

On Wed, Dec 15, 2010 at 12:36 AM, Martin Casadocas...@nicira.com  wrote:

Also, do you mind posting the single core throughput?


[cross-posting to nox-dev, openflow-discuss, ovs-discuss]

I have prepared a patch based on NOX Zaku that improves its
performance by a factor of10. This implies that a single controller
instance can run a large network with near a million flow initiations
per second. I am writing to open up a discussion and get feedback from
the community.

Here are some preliminary results:

- Benchmark configuration:
   * Benchmark: Throughput test of cbench (controller benchmarker) with
64 switches. Cbench is a part of the OFlops package
(http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
mode, cbench sends a batch of ofp_packet_in messages to the controller
and counts the number of replies it gets back.
   * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
quad-core Intel Xeon processor (X3210), and 4GB RAM
   * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
quad-core Intel Xeon processor (E5405), and 4GB RAM
   * Connectivity: 1Gbps

- Benchmark results:
   * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
   * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
available cores). The sustained controller-benchmarker throughput is
~400Mbps.

The patch updates the asynchronous harness of NOX to a standard
library (boost asynchronous I/O library) which simplifies the code
base. It fixes the code in several areas, including but not limited
to:

- Multi-threading: The patch enables having any number of worker
threads running on multiple cores.

- Batching: Serving requests individually and sending replies one by
one is quite inefficient. The patch tries to batch requests together
were possible, as well replies (which reduces the number of system
calls significantly).

- Memory allocation: The standard C++ memory allocator is not robust
in multi-threaded environments. Google's Thread-Caching Malloc
(TCMalloc) or Hoard memory allocator perform much better for NOX.

- Fully asynchronous operation: The patched version avoids wasting CPU
cycles polling sockets, or event/timer dispatchers when not necessary.

I would like to add that the patched version should perform much
better than what I reported above (the number reported is with a run
on 4 CPU cores). I guess a single NOX instance running on a machine
with 8 CPU cores should handle well above 1 million flow initiation
requests per second. Also having a more capable machine should help to
serve more requests! The code will be made 

Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10

2010-12-15 Thread Amin Tootoonchian
I am talking about jumbo Ethernet frames here. By batching, I mean
batching outgoing messages together and writing to the underlying
layer which would be the TCP write buffer. The TCP buffer is not
limited to MTU or anything like that, so in most cases my code flushes
more than 64KB to the TCP write buffer. The gain is due to issuing a
single system call with a larger buffer rather than many system calls
with tiny buffers (e.g., 128 bytes you mentioned).

I do not sacrifice delay for throughput here. I keep a write buffer
and keep appending to it until the underlying socket is ready for
writes. Once it is ready for a write operation, buffered replies are
flush to the underlying layer immediately. This is quite different
than Nagle's algorithm and will not add any delays.

Amin

On Wed, Dec 15, 2010 at 3:47 PM, kk yap yap...@stanford.edu wrote:
 Oh.. another point, if you are batching the frames, then what about
 delay?  There seems to be a trade-off between delay and throughput,
 and we have went for the former by disabling Nagle's algorithm.

 Regards
 KK

 On 15 December 2010 12:46, kk yap yap...@stanford.edu wrote:
 Hi Amin,

 Just to clarify, does your jumbo frames refer to the OpenFlow messages
 or the frames in the datapath?   By OpenFlow messages, I am assuming
 you use a TCP connection between NOX and the switches, and you are
 batching the messages into jumbo frames of 9000 bytes before sending
 them out.  By frames in the datapath, I mean jumbo Ethernet frames are
 being sent in the datapath.  The latter does not make any sense to me,
 because OpenFlow should send 128 bytes to the controller by default.

 Thanks.

 Regards
 KK

 On 15 December 2010 12:36, Amin Tootoonchian a...@cs.toronto.edu wrote:
 I double checked. It does slightly improve the performance (in the
 order of a few thousand replies/sec). Larger MTUs decrease the CPU
 workload (by decreasing the number of transfers across the bus) and
 this means that more CPU cycles are available to the controller to
 process requests. However, I am not suggesting that people should use
 jumbo frames. Apparently running with more user-space threads does the
 trick here. Anyway, I should trust a profiler rather than guessing, so
 I will get back with a definite answer once I have done a more
 thorough evaluation.

 Cheers,
 Amin

 On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote:
 Random curiosity: Why would jumbo frames increases replies per sec?

 Regards
 KK

 On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote:
 I missed that. The single core throughput is ~250k replies/sec, two
 cores ~450k replies/sec, three cores ~650k replies/sec, four cores
 ~800 replies/sec. These numbers are higher than what I reported in my
 previous post. That is most probably because, right now, I am testing
 with MTU 9000 (jumbo frames) and with more user-space threads.

 Cheers,
 Amin

 On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote:
 Also, do you mind posting the single core throughput?

 [cross-posting to nox-dev, openflow-discuss, ovs-discuss]

 I have prepared a patch based on NOX Zaku that improves its
 performance by a factor of10. This implies that a single controller
 instance can run a large network with near a million flow initiations
 per second. I am writing to open up a discussion and get feedback from
 the community.

 Here are some preliminary results:

 - Benchmark configuration:
   * Benchmark: Throughput test of cbench (controller benchmarker) with
 64 switches. Cbench is a part of the OFlops package
 (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
 mode, cbench sends a batch of ofp_packet_in messages to the controller
 and counts the number of replies it gets back.
   * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
 quad-core Intel Xeon processor (X3210), and 4GB RAM
   * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
 quad-core Intel Xeon processor (E5405), and 4GB RAM
   * Connectivity: 1Gbps

 - Benchmark results:
   * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
   * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
 available cores). The sustained controller-benchmarker throughput is
 ~400Mbps.

 The patch updates the asynchronous harness of NOX to a standard
 library (boost asynchronous I/O library) which simplifies the code
 base. It fixes the code in several areas, including but not limited
 to:

 - Multi-threading: The patch enables having any number of worker
 threads running on multiple cores.

 - Batching: Serving requests individually and sending replies one by
 one is quite inefficient. The patch tries to batch requests together
 were possible, as well replies (which reduces the number of system
 calls significantly).

 - Memory allocation: The standard C++ memory allocator is not robust
 in multi-threaded environments. Google's Thread-Caching Malloc
 (TCMalloc) or Hoard memory 

[nox-dev] Associate xid with a flow mod event

2010-12-15 Thread Derek Cormier

Hello,

When you receive a flow mod event, is there any way to associate it with 
the xid of the original request that caused it? I'm looking for a way to 
confirm that a specific request generated a specific response. For 
example, if multiple components are running and they both send a packet 
to add the same flow (with the no overlapping flows flag on), then one 
will get an error and the other will not. I suppose you could check the 
xid of the errors to determine who's was successful, but that seems a 
bit hackish.


Thanks!
Derek

___
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


Re: [nox-dev] Associate xid with a flow mod event

2010-12-15 Thread kk yap
Hi Derek,

Are you assuming the components will tag the flow_mod with the same
xid as the packet_in?  I think this is not true for verbatim NOX,
though I am not sure.  Either way, what is important is that you can
make changes to make that true.  So, you can definitely do this.

Regards
KK

On 15 December 2010 22:04, Derek Cormier derek.corm...@lab.ntt.co.jp wrote:
 Hello,

 When you receive a flow mod event, is there any way to associate it with the
 xid of the original request that caused it? I'm looking for a way to confirm
 that a specific request generated a specific response. For example, if
 multiple components are running and they both send a packet to add the same
 flow (with the no overlapping flows flag on), then one will get an error and
 the other will not. I suppose you could check the xid of the errors to
 determine who's was successful, but that seems a bit hackish.

 Thanks!
 Derek

 ___
 nox-dev mailing list
 nox-dev@noxrepo.org
 http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


___
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


Re: [nox-dev] Associate xid with a flow mod event

2010-12-15 Thread Rob Sherwood
I'm not sure if this is what you're asking, but flow_mod's have a
'cookie' associated with them that gets returned in all sorts of
flow_mod related messages, e.g., flow_removed messages.  Maybe that is
what you're looking for.

- Rob
.



On Wed, Dec 15, 2010 at 10:04 PM, Derek Cormier
derek.corm...@lab.ntt.co.jp wrote:
 Hello,

 When you receive a flow mod event, is there any way to associate it with the
 xid of the original request that caused it? I'm looking for a way to confirm
 that a specific request generated a specific response. For example, if
 multiple components are running and they both send a packet to add the same
 flow (with the no overlapping flows flag on), then one will get an error and
 the other will not. I suppose you could check the xid of the errors to
 determine who's was successful, but that seems a bit hackish.

 Thanks!
 Derek

 ___
 nox-dev mailing list
 nox-dev@noxrepo.org
 http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


___
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org


Re: [nox-dev] Associate xid with a flow mod event

2010-12-15 Thread Derek Cormier

@KK
It turns out I made a wrong assumption. I thought that when an 
ofp_flow_mod (OFPFC_ADD) message was sent, it returns a reply with the 
same xid. After looking at the OF protocol, it looks like a message is 
only sent back if an error occurred.


@Rob
The cookie isn't quite what I'm looking for, I'm not sure what it might 
be used for in the future...


Basically, I'm looking for a way to validate that adding a flow worked. 
Consider the following cenario:


There are two components: A  B

1. A sends a request to add a flow.
2. B sends a request to add the exact same flow.
3. B's gets added first and is successful.
4. A received a flow mod event and thinks its flow was added.
5. A's flow gets added. It conflicts with B's flow and generates an error.
6. A sees an error for his exact flow and doesn't know if it's flow was 
added or another component's.


-Derek

___
nox-dev mailing list
nox-dev@noxrepo.org
http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org