Dear Ryu developer team,
I am using a Ryu SDN controller on top of Mininet to deploy and test an active
loss monitoring system intended for datacenter topologies. Everything woks
quite nicely for smaller networks but we want to have some data that is closer
to the scale of a datacenter so we aimed for a topology of the size of at least
FatTree(16).
Concretely that means:
- - 320 Distinct switches
- - 2048 Links between those switches
- - 190 Hosts attached to switches
I have the necessary resources on my university server, CPU and Memory load is
only rarely maxed out (In a brief moment during startup or while
postprocessing) when monitoring with htop.
I am consistently observing from the ryu logs that the Datapaths are
disconnecting. This means an entry like this:
unregister Switch<dpid=294, Port<dpid=294, port_no=1, LIVE> Port<dpid=294,
port_no=3, LIVE> Port<dpid=294, port_no=4, LIVE> Port<dpid=294, port_no=5,
LIVE> Port<dpid=294, port_no=6, LIVE> Port<dpid=294, port_no=7, LIVE>
Port<dpid=294, port_no=8, LIVE> Port<dpid=294, port_no=9, LIVE> Port<dpid=294,
port_no=10, LIVE> Port<dpid=294, port_no=11, LIVE> Port<dpid=294, port_no=12,
LIVE> Port<dpid=294, port_no=13, LIVE> Port<dpid=294, port_no=14, LIVE>
Port<dpid=294, port_no=15, LIVE> Port<dpid=294, port_no=16, LIVE>
Port<dpid=294, port_no=17, LIVE> Port<dpid=294, port_no=18, LIVE> >
I have also seen switches disconnecting for smaller network sizes, but this
usually happens in a later stage of running my code which leads to less
problems. When running my big network, these events happen during startup, if I
push flows to the switches from the controller while some of them are not
properly connected, some of the flows don’t get pushed and this leads to black
holes in my system. This is usually accompanied by such log entries:
Datapath in process of terminating; send() to ('127.0.0.1', 40006) discarded.
My best guess to why this is happening is that the operating system might be
overcharged with all the context switches needed to keep the emulation running
smoothly and that some keepalive timer somewhere is running out because one of
the connecting processes was starved out of CPU time. But of course I might
also have messed up in my implementation somewhere.
Since my goal is not to model a dynamically changing topology or complete
switch failures, there is no reason for this behavior to be part of my
emulation. So I tried finding any timers in the code that could be causing
these unwanted disconnects and deactivating them or setting them to a very high
interval:
In Ryu:
In controller.py
- - I set the default value of the ‘echo-request -interval’ to 604800
- - ‘maximum-unreplied-echo-requests’ was set to 10
- - I set the socket timeout to None in the __init__ method of the
Datapath class
In Mininet:
- - I set the default value of “reconnectms” to zero in the __init__
method of the OVSSwitch Class, in the file node.py
In OVS:
- - I set the probe_interval to zero in the __init__ method of the
Reconnect class to disable the keepalive feature in the file reconnect.py.
Unfortunately, all of this did not stop the switches from disconnecting.
Could anyone more familiar point me in the right direction for further
investigation? Did I miss a timer somewhere or do you have a different
explanation for this behavior and how to stop it?
Thank you for your assistance.
Christelle Gloor
_______________________________________________
Ryu-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ryu-devel