[ovs-discuss] OVS under load causes TCP connection stalls on the lo interface
Hi, I have been observing strange TCP connection aborts on an OVS lately. The TCP connection are all localhost only. So no external components can be blamed. tcpdump shows the TCP ACKs where missing and there might have been some data corruption as well (hard to tell without a proper decoder). The ovs instance is not configured to touch lo. After lots of debugging I have been able to find a correlation with an OVS instance on that host. To reproduce the issue I run netperf on lo like this: # netperf -l 600 -D 1,second -H localhost This reports a steady 48527.77 10^6bits/s through on lo. Then I push load through OVS. My OF controller creates on flow rule per TCP connection going through the switch. With about 100 new connections per second this loads the 8 cores to about 50% each. At some random point (mostly within the first 10 seconds of the test), CPU load drops to zero and netperf stalls. The kernel begins to spill out messages like this: grep : 1433 callbacks suppressed With systemtap, I have traced this message to ip_finish_output2 in net/ipv4/ip_output.c. The skb's at that point have a destination IP of 0.0.0.0. Combinations tested: openvswitch-1.11 on Linux 3.8.13 openvswitch-2.3.0 on Linux 3.14.19 openvswitch-git (2654cc338bfb413a6295078e3a7a8e1d4f67cbcc) on Linux 3.14.19 I seems that under this type of load openvswitch kills traffic through lo. Any ideas on what to try next? Andreas -- -- Dipl. Inform. Andreas Schultz ___ discuss mailing list discuss@openvswitch.org http://openvswitch.org/mailman/listinfo/discuss
Re: [ovs-discuss] intermitting ARP problems on DP interface
Hi Ben, Tanks for the insight. The original problem was in my first mail: Oct 18 12:22:54|00344|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:24, type=30(ovs_packet), flags=0, seq=0, pid=0(0:0)),genl(cmd=1,version=1) Oct 18 12:22:54|00345|dpif|DBG|system@dp0: miss upcall: in_port(0),eth(src=00:23:20:fc:70:43,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=172.28.0.1,tip=172.28.0.19,op=1,sha=00:23:20:fc:70:43,tha=00:00:00:00:00:00) 00:23:20:fc:70:43 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.28.0.19 tell 172.28.0.1, length 28 Oct 18 12:22:54|00346|ofproto_dpif|WARN|bridge dp0: received packet on unknown port 65534 Oct 18 12:22:54|00347|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): nl(len:140, type=30(ovs_packet), flags=1[REQUEST], seq=4e9d5835, pid=75500397(2925:18)),genl(cmd=3,version=1) Oct 18 12:22:54|00348|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, pid=75500397(2925:18)),genl(cmd=1,version=1) Oct 18 12:22:54|00349|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:36, type=2(error), flags=0, seq=4e9d5836, pid=75500397(2925:18)) error(0, in-reply-to(nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, pid=75500397(2925:18 The netlink errors are obviously not error then, which leaves this message: Oct 18 12:22:54|00346|ofproto_dpif|WARN|bridge dp0: received packet on unknown port 65534 I can generate a full log if you are interested. About the toy program, we only need a way to talk to the datapath, the controller is actually not performing any other use full tasks. Also, if this tool is used for unit tests, then it obviously has to be correct or how are your test supposed to succeed and reflect real usage scenarios. Regards Andreas - Original Message - > The error type is 0, which indicates "success". It is the form of > acknowledgment sent when an operation completes successfully. > > I've diagnosed the first such error that you posted. I can't > diagnose > the ones that I have not seen. > > We haven't seen anything like this internally. I suggest using > ovs-vswitchd instead of a toy test program that is only intended for > unit tests. > > On Wed, Oct 19, 2011 at 06:13:11PM +0200, Andreas Schultz wrote: > > Hi Ben, > > > > It might be harmless, but it is followed by tons of similar errors > > and why > > errors anyway?. > > Also, there is not a single such error on the first start of > > controller. > > Only when I kill the fist instance and start it again those error > > occur > > and the switch is acting erratically. Removing the dp interface in > > between > > also does not help. > > > > Any ideas? > > > > Andreas > > > > - Original Message - > > > On Wed, Oct 19, 2011 at 01:23:53PM +0200, Andreas Schultz wrote: > > > > Hi all, > > > > > > > > Upon further investigation it turns out the the > > > > dpif_linux_open() > > > > sequence > > > > is broken somewhere. The flow of netlink messages simply makes > > > > no > > > > sence at > > > > all. > > > > > > > > first we get an create attempt, which is expected to fail since > > > > the > > > > DP already exists: > > > > Oct 19 11:07:52|00014|netlink_socket|DBG|nl_sock_recv__ > > > > (Success): > > > > nl(len:60, type=2(error), flags=0, seq=4e9ef0dc, > > > > pid=25535(25535:0)) error(-17(File exists), > > > > in-reply-to(nl(len:40, > > > > type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], > > > > seq=4e9ef0dc, > > > > pid=25535(25535:0 > > > > Oct 19 11:07:52|00015|netlink_socket|DBG|received NAK error=17 > > > > (File exists) > > > > > > > > now it tries OVS_DP_CMD_GET: > > > > Oct 19 > > > > 11:07:52|00016|netlink_socket|DBG|nl_sock_transact_multiple__ > > > > (Success): nl(len:32, type=33(ovs_datapath), > > > > flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd, > > > > pid=25535(25535:0)),genl(cmd=3,version=1) > > > > Oct 19 11:07:52|00017|netlink_socket|DBG|nl_sock_recv__ > > > > (Success): > > > > nl(len:84, type=33(ovs_datapath), flags=0, seq=4e9ef0dd, > > > > pid=25535(25535:0)),genl(cmd=1,version=1) > > > > > > > > and succeeds. So far so good... > > > > > > > > Next step is to flush any old flows: > > > > Oct 19 > > > > 11:07:52|00018|netlink_socket|DBG|nl_sock_transact_multiple__ > >
Re: [ovs-discuss] intermitting ARP problems on DP interface
Hi Ben, It might be harmless, but it is followed by tons of similar errors and why errors anyway?. Also, there is not a single such error on the first start of controller. Only when I kill the fist instance and start it again those error occur and the switch is acting erratically. Removing the dp interface in between also does not help. Any ideas? Andreas - Original Message - > On Wed, Oct 19, 2011 at 01:23:53PM +0200, Andreas Schultz wrote: > > Hi all, > > > > Upon further investigation it turns out the the dpif_linux_open() > > sequence > > is broken somewhere. The flow of netlink messages simply makes no > > sence at > > all. > > > > first we get an create attempt, which is expected to fail since the > > DP already exists: > > Oct 19 11:07:52|00014|netlink_socket|DBG|nl_sock_recv__ (Success): > > nl(len:60, type=2(error), flags=0, seq=4e9ef0dc, > > pid=25535(25535:0)) error(-17(File exists), in-reply-to(nl(len:40, > > type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dc, > > pid=25535(25535:0 > > Oct 19 11:07:52|00015|netlink_socket|DBG|received NAK error=17 > > (File exists) > > > > now it tries OVS_DP_CMD_GET: > > Oct 19 > > 11:07:52|00016|netlink_socket|DBG|nl_sock_transact_multiple__ > > (Success): nl(len:32, type=33(ovs_datapath), > > flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd, > > pid=25535(25535:0)),genl(cmd=3,version=1) > > Oct 19 11:07:52|00017|netlink_socket|DBG|nl_sock_recv__ (Success): > > nl(len:84, type=33(ovs_datapath), flags=0, seq=4e9ef0dd, > > pid=25535(25535:0)),genl(cmd=1,version=1) > > > > and succeeds. So far so good... > > > > Next step is to flush any old flows: > > Oct 19 > > 11:07:52|00018|netlink_socket|DBG|nl_sock_transact_multiple__ > > (Success): nl(len:24, type=35(ovs_flow), flags=5[REQUEST][ACK], > > seq=4e9ef0de, pid=25535(25535:0)),genl(cmd=2,version=1) > > > > send that to the kernel... > > > > and the kernel give us a netlink error report for the > > OVS_DP_CMD_GET that was already ACKed OK: > > Oct 19 11:07:52|00019|netlink_socket|DBG|nl_sock_recv__ (Success): > > nl(len:36, type=2(error), flags=0, seq=4e9ef0dd, > > pid=25535(25535:0)) error(0, in-reply-to(nl(len:32, > > type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd, > > pid=25535(25535:0 > > Oct 19 11:07:52|00020|netlink_socket|DBG|ignoring unexpected seq > > 0x4e9ef0dd > > > > I have verified above sequence with strace and the decoded netlink > > messages where indeed send to and received from the kernel. So it > > is not a buffering issue in the controller. > > This is a red herring. It's very common for a Netlink request to > have > two replies when the ACK flag is set, because the kernel > unconditionally > sends an "error" reply after the command implementation itself sends > any > reply of its own. We just ignore the second reply in userspace; it's > harmless. > -- -- Dipl. Inform. Andreas Schultz ___ discuss mailing list discuss@openvswitch.org http://openvswitch.org/mailman/listinfo/discuss
Re: [ovs-discuss] intermitting ARP problems on DP interface
Hi all, Upon further investigation it turns out the the dpif_linux_open() sequence is broken somewhere. The flow of netlink messages simply makes no sence at all. first we get an create attempt, which is expected to fail since the DP already exists: Oct 19 11:07:52|00014|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:60, type=2(error), flags=0, seq=4e9ef0dc, pid=25535(25535:0)) error(-17(File exists), in-reply-to(nl(len:40, type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dc, pid=25535(25535:0 Oct 19 11:07:52|00015|netlink_socket|DBG|received NAK error=17 (File exists) now it tries OVS_DP_CMD_GET: Oct 19 11:07:52|00016|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): nl(len:32, type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd, pid=25535(25535:0)),genl(cmd=3,version=1) Oct 19 11:07:52|00017|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:84, type=33(ovs_datapath), flags=0, seq=4e9ef0dd, pid=25535(25535:0)),genl(cmd=1,version=1) and succeeds. So far so good... Next step is to flush any old flows: Oct 19 11:07:52|00018|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): nl(len:24, type=35(ovs_flow), flags=5[REQUEST][ACK], seq=4e9ef0de, pid=25535(25535:0)),genl(cmd=2,version=1) send that to the kernel... and the kernel give us a netlink error report for the OVS_DP_CMD_GET that was already ACKed OK: Oct 19 11:07:52|00019|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:36, type=2(error), flags=0, seq=4e9ef0dd, pid=25535(25535:0)) error(0, in-reply-to(nl(len:32, type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd, pid=25535(25535:0 Oct 19 11:07:52|00020|netlink_socket|DBG|ignoring unexpected seq 0x4e9ef0dd I have verified above sequence with strace and the decoded netlink messages where indeed send to and received from the kernel. So it is not a buffering issue in the controller. The problem occurs at least with openvswitch version 1.2.2 and git-HEAD on kernel 2.6.32 and 2.6.38. Regards Andreas - Original Message - > Hi, > > We have encountered a strange behavior. After restarting the > controller process and even after removing and reinserting the > datapath module, ARP packets are forwarded from dp0 to the real port > intermittently. > tcpdump confirms that the ARP request is seen on dp0, but not on the > physical port (eth1). > At the same time the following sequence on log messages appear: > > Oct 18 12:22:54|00344|netlink_socket|DBG|nl_sock_recv__ (Success): > nl(len:24, type=30(ovs_packet), flags=0, seq=0, > pid=0(0:0)),genl(cmd=1,version=1) > Oct 18 12:22:54|00345|dpif|DBG|system@dp0: miss upcall: > in_port(0),eth(src=00:23:20:fc:70:43,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=172.28.0.1,tip=172.28.0.19,op=1,sha=00:23:20:fc:70:43,tha=00:00:00:00:00:00) > 00:23:20:fc:70:43 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length > 42: Request who-has 172.28.0.19 tell 172.28.0.1, length 28 > Oct 18 12:22:54|00346|ofproto_dpif|WARN|bridge dp0: received packet > on unknown port 65534 > Oct 18 12:22:54|00347|netlink_socket|DBG|nl_sock_transact_multiple__ > (Success): nl(len:140, type=30(ovs_packet), flags=1[REQUEST], > seq=4e9d5835, pid=75500397(2925:18)),genl(cmd=3,version=1) > Oct 18 12:22:54|00348|netlink_socket|DBG|nl_sock_transact_multiple__ > (Success): nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], > seq=4e9d5836, pid=75500397(2925:18)),genl(cmd=1,version=1) > Oct 18 12:22:54|00349|netlink_socket|DBG|nl_sock_recv__ (Success): > nl(len:36, type=2(error), flags=0, seq=4e9d5836, > pid=75500397(2925:18)) error(0, in-reply-to(nl(len:92, > type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, > pid=75500397(2925:18 > > There clearly is something strange going on. How can the dp receive > something on an unknown port? > > Any hints? > > Regards > Andreas > -- > -- > Dipl. Inform. > Andreas Schultz > > > _______ > discuss mailing list > discuss@openvswitch.org > http://openvswitch.org/mailman/listinfo/discuss > -- -- Dipl. Inform. Andreas Schultz ___ discuss mailing list discuss@openvswitch.org http://openvswitch.org/mailman/listinfo/discuss
[ovs-discuss] intermitting ARP problems on DP interface
Hi, We have encountered a strange behavior. After restarting the controller process and even after removing and reinserting the datapath module, ARP packets are forwarded from dp0 to the real port intermittently. tcpdump confirms that the ARP request is seen on dp0, but not on the physical port (eth1). At the same time the following sequence on log messages appear: Oct 18 12:22:54|00344|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:24, type=30(ovs_packet), flags=0, seq=0, pid=0(0:0)),genl(cmd=1,version=1) Oct 18 12:22:54|00345|dpif|DBG|system@dp0: miss upcall: in_port(0),eth(src=00:23:20:fc:70:43,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=172.28.0.1,tip=172.28.0.19,op=1,sha=00:23:20:fc:70:43,tha=00:00:00:00:00:00) 00:23:20:fc:70:43 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 172.28.0.19 tell 172.28.0.1, length 28 Oct 18 12:22:54|00346|ofproto_dpif|WARN|bridge dp0: received packet on unknown port 65534 Oct 18 12:22:54|00347|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): nl(len:140, type=30(ovs_packet), flags=1[REQUEST], seq=4e9d5835, pid=75500397(2925:18)),genl(cmd=3,version=1) Oct 18 12:22:54|00348|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, pid=75500397(2925:18)),genl(cmd=1,version=1) Oct 18 12:22:54|00349|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:36, type=2(error), flags=0, seq=4e9d5836, pid=75500397(2925:18)) error(0, in-reply-to(nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, pid=75500397(2925:18 There clearly is something strange going on. How can the dp receive something on an unknown port? Any hints? Regards Andreas -- -- Dipl. Inform. Andreas Schultz email: a...@travelping.com phone: +49-391-819099-224 mobil: +49-179-7654368 -- managed broadband access -- Travelping GmbH phone: +49-391-8190990 Roentgenstr. 13 fax: +49-391-819099299 D-39108 Magdeburg email: i...@travelping.com GERMANY web: http://www.travelping.com Company Registration: HRB21276 Handelsregistergericht Chemnitz Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780 -- ___ discuss mailing list discuss@openvswitch.org http://openvswitch.org/mailman/listinfo/discuss
[ovs-discuss] ofputil_check_output_port() should permit NONE as port
In lib/ofp-util.c, ofputil_check_output_port() validates to port in output actions. It currently rejects a output port with id NONE, however a output action with a port NONE is the only way to free all buffers associated with a packet without having to way for a timeout on the buffer. So it really should permit NONE as port id. Regards Andreas -- -- Dipl. Inform. Andreas Schultz email: a...@travelping.com phone: +49-391-819099-224 mobil: +49-179-7654368 -- managed broadband access -- Travelping GmbH phone: +49-391-8190990 Roentgenstr. 13 fax: +49-391-819099299 D-39108 Magdeburg email: i...@travelping.com GERMANY web: http://www.travelping.com Company Registration: HRB21276 Handelsregistergericht Chemnitz Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780 -- ___ discuss mailing list discuss@openvswitch.org http://openvswitch.org/mailman/listinfo/discuss