[ovs-discuss] OVS under load causes TCP connection stalls on the lo interface

2014-09-19 Thread Andreas Schultz
Hi,

I have been observing strange TCP connection aborts on an OVS lately. The TCP
connection are all localhost only. So no external components can be blamed.
tcpdump shows the TCP ACKs where missing and there might have been some data
corruption as well (hard to tell without a proper decoder).

The ovs instance is not configured to touch lo.

After lots of debugging I have been able to find a correlation with an OVS
instance on that host. To reproduce the issue I run netperf on lo like this:

# netperf -l 600 -D 1,second -H localhost

This reports a steady 48527.77 10^6bits/s through on lo. Then I push load
through OVS. My OF controller creates on flow rule per TCP connection going
through the switch. With about 100 new connections per second this loads
the 8 cores to about 50% each. At some random point (mostly within the first
10 seconds of the test), CPU load drops to zero and netperf stalls.

The kernel begins to spill out messages like this:

grep : 1433 callbacks suppressed

With systemtap, I have traced this message to ip_finish_output2 in
net/ipv4/ip_output.c. The skb's at that point have a destination IP
of 0.0.0.0.

Combinations tested:

openvswitch-1.11 on Linux 3.8.13
openvswitch-2.3.0 on Linux 3.14.19
openvswitch-git (2654cc338bfb413a6295078e3a7a8e1d4f67cbcc) on Linux 3.14.19

I seems that under this type of load openvswitch kills traffic through lo.

Any ideas on what to try next?

Andreas
-- 
-- 
Dipl. Inform.
Andreas Schultz
___
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss


Re: [ovs-discuss] intermitting ARP problems on DP interface

2011-10-19 Thread Andreas Schultz
Hi Ben,

Tanks for the insight. The original problem was in my first mail:

Oct 18 12:22:54|00344|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:24, 
type=30(ovs_packet), flags=0, seq=0, pid=0(0:0)),genl(cmd=1,version=1)
Oct 18 12:22:54|00345|dpif|DBG|system@dp0: miss upcall:
in_port(0),eth(src=00:23:20:fc:70:43,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=172.28.0.1,tip=172.28.0.19,op=1,sha=00:23:20:fc:70:43,tha=00:00:00:00:00:00)
00:23:20:fc:70:43 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: 
Request who-has 172.28.0.19 tell 172.28.0.1, length 28
Oct 18 12:22:54|00346|ofproto_dpif|WARN|bridge dp0: received packet on unknown 
port 65534
Oct 18 12:22:54|00347|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): 
nl(len:140, type=30(ovs_packet), flags=1[REQUEST], seq=4e9d5835, 
pid=75500397(2925:18)),genl(cmd=3,version=1)
Oct 18 12:22:54|00348|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): 
nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, 
pid=75500397(2925:18)),genl(cmd=1,version=1)
Oct 18 12:22:54|00349|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:36, 
type=2(error), flags=0, seq=4e9d5836, pid=75500397(2925:18)) error(0, 
in-reply-to(nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, 
pid=75500397(2925:18

The netlink errors are obviously not error then, which leaves this message:
Oct 18 12:22:54|00346|ofproto_dpif|WARN|bridge dp0: received packet on unknown 
port 65534

I can generate a full log if you are interested.

About the toy program, we only need a way to talk to the datapath, the 
controller is actually not performing any other use full tasks.
Also, if this tool is used for unit tests, then it obviously has to be correct 
or how are your test supposed to succeed and reflect real usage scenarios.

Regards
Andreas

- Original Message -
> The error type is 0, which indicates "success".  It is the form of
> acknowledgment sent when an operation completes successfully.
> 
> I've diagnosed the first such error that you posted.  I can't
> diagnose
> the ones that I have not seen.
> 
> We haven't seen anything like this internally.  I suggest using
> ovs-vswitchd instead of a toy test program that is only intended for
> unit tests.
> 
> On Wed, Oct 19, 2011 at 06:13:11PM +0200, Andreas Schultz wrote:
> > Hi Ben,
> > 
> > It might be harmless, but it is followed by tons of similar errors
> > and why
> > errors anyway?.
> > Also, there is not a single such error on the first start of
> > controller.
> > Only when I kill the fist instance and start it again those error
> > occur
> > and the switch is acting erratically. Removing the dp interface in
> > between
> > also does not help.
> > 
> > Any ideas?
> > 
> > Andreas
> > 
> > - Original Message -
> > > On Wed, Oct 19, 2011 at 01:23:53PM +0200, Andreas Schultz wrote:
> > > > Hi all,
> > > > 
> > > > Upon further investigation it turns out the the
> > > > dpif_linux_open()
> > > > sequence
> > > > is broken somewhere. The flow of netlink messages simply makes
> > > > no
> > > > sence at
> > > > all.
> > > > 
> > > > first we get an create attempt, which is expected to fail since
> > > > the
> > > > DP already exists:
> > > > Oct 19 11:07:52|00014|netlink_socket|DBG|nl_sock_recv__
> > > > (Success):
> > > > nl(len:60, type=2(error), flags=0, seq=4e9ef0dc,
> > > > pid=25535(25535:0)) error(-17(File exists),
> > > > in-reply-to(nl(len:40,
> > > > type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO],
> > > > seq=4e9ef0dc,
> > > > pid=25535(25535:0
> > > > Oct 19 11:07:52|00015|netlink_socket|DBG|received NAK error=17
> > > > (File exists)
> > > > 
> > > > now it tries OVS_DP_CMD_GET:
> > > > Oct 19
> > > > 11:07:52|00016|netlink_socket|DBG|nl_sock_transact_multiple__
> > > > (Success): nl(len:32, type=33(ovs_datapath),
> > > > flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd,
> > > > pid=25535(25535:0)),genl(cmd=3,version=1)
> > > > Oct 19 11:07:52|00017|netlink_socket|DBG|nl_sock_recv__
> > > > (Success):
> > > > nl(len:84, type=33(ovs_datapath), flags=0, seq=4e9ef0dd,
> > > > pid=25535(25535:0)),genl(cmd=1,version=1)
> > > > 
> > > > and succeeds. So far so good...
> > > > 
> > > > Next step is to flush any old flows:
> > > > Oct 19
> > > > 11:07:52|00018|netlink_socket|DBG|nl_sock_transact_multiple__
> >

Re: [ovs-discuss] intermitting ARP problems on DP interface

2011-10-19 Thread Andreas Schultz
Hi Ben,

It might be harmless, but it is followed by tons of similar errors and why
errors anyway?.
Also, there is not a single such error on the first start of controller.
Only when I kill the fist instance and start it again those error occur
and the switch is acting erratically. Removing the dp interface in between
also does not help.

Any ideas?

Andreas

- Original Message -
> On Wed, Oct 19, 2011 at 01:23:53PM +0200, Andreas Schultz wrote:
> > Hi all,
> > 
> > Upon further investigation it turns out the the dpif_linux_open()
> > sequence
> > is broken somewhere. The flow of netlink messages simply makes no
> > sence at
> > all.
> > 
> > first we get an create attempt, which is expected to fail since the
> > DP already exists:
> > Oct 19 11:07:52|00014|netlink_socket|DBG|nl_sock_recv__ (Success):
> > nl(len:60, type=2(error), flags=0, seq=4e9ef0dc,
> > pid=25535(25535:0)) error(-17(File exists), in-reply-to(nl(len:40,
> > type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dc,
> > pid=25535(25535:0
> > Oct 19 11:07:52|00015|netlink_socket|DBG|received NAK error=17
> > (File exists)
> > 
> > now it tries OVS_DP_CMD_GET:
> > Oct 19
> > 11:07:52|00016|netlink_socket|DBG|nl_sock_transact_multiple__
> > (Success): nl(len:32, type=33(ovs_datapath),
> > flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd,
> > pid=25535(25535:0)),genl(cmd=3,version=1)
> > Oct 19 11:07:52|00017|netlink_socket|DBG|nl_sock_recv__ (Success):
> > nl(len:84, type=33(ovs_datapath), flags=0, seq=4e9ef0dd,
> > pid=25535(25535:0)),genl(cmd=1,version=1)
> > 
> > and succeeds. So far so good...
> > 
> > Next step is to flush any old flows:
> > Oct 19
> > 11:07:52|00018|netlink_socket|DBG|nl_sock_transact_multiple__
> > (Success): nl(len:24, type=35(ovs_flow), flags=5[REQUEST][ACK],
> > seq=4e9ef0de, pid=25535(25535:0)),genl(cmd=2,version=1)
> > 
> > send that to the kernel...
> > 
> > and the kernel give us a netlink error report for the
> > OVS_DP_CMD_GET that was already ACKed OK:
> > Oct 19 11:07:52|00019|netlink_socket|DBG|nl_sock_recv__ (Success):
> > nl(len:36, type=2(error), flags=0, seq=4e9ef0dd,
> > pid=25535(25535:0)) error(0, in-reply-to(nl(len:32,
> > type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd,
> > pid=25535(25535:0
> > Oct 19 11:07:52|00020|netlink_socket|DBG|ignoring unexpected seq
> > 0x4e9ef0dd
> > 
> > I have verified above sequence with strace and the decoded netlink
> > messages where indeed send to and received from the kernel. So it
> > is not a buffering issue in the controller.
> 
> This is a red herring.  It's very common for a Netlink request to
> have
> two replies when the ACK flag is set, because the kernel
> unconditionally
> sends an "error" reply after the command implementation itself sends
> any
> reply of its own.  We just ignore the second reply in userspace; it's
> harmless.
> 

-- 
-- 
Dipl. Inform.
Andreas Schultz
___
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss


Re: [ovs-discuss] intermitting ARP problems on DP interface

2011-10-19 Thread Andreas Schultz
Hi all,

Upon further investigation it turns out the the dpif_linux_open() sequence
is broken somewhere. The flow of netlink messages simply makes no sence at
all.

first we get an create attempt, which is expected to fail since the DP already 
exists:
Oct 19 11:07:52|00014|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:60, 
type=2(error), flags=0, seq=4e9ef0dc, pid=25535(25535:0)) error(-17(File 
exists), in-reply-to(nl(len:40, type=33(ovs_datapath), 
flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dc, pid=25535(25535:0
Oct 19 11:07:52|00015|netlink_socket|DBG|received NAK error=17 (File exists)

now it tries OVS_DP_CMD_GET:
Oct 19 11:07:52|00016|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): 
nl(len:32, type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], seq=4e9ef0dd, 
pid=25535(25535:0)),genl(cmd=3,version=1)
Oct 19 11:07:52|00017|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:84, 
type=33(ovs_datapath), flags=0, seq=4e9ef0dd, 
pid=25535(25535:0)),genl(cmd=1,version=1)

and succeeds. So far so good...

Next step is to flush any old flows:
Oct 19 11:07:52|00018|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): 
nl(len:24, type=35(ovs_flow), flags=5[REQUEST][ACK], seq=4e9ef0de, 
pid=25535(25535:0)),genl(cmd=2,version=1)

send that to the kernel...

and the kernel give us a netlink error report for the OVS_DP_CMD_GET that was 
already ACKed OK:
Oct 19 11:07:52|00019|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:36, 
type=2(error), flags=0, seq=4e9ef0dd, pid=25535(25535:0)) error(0, 
in-reply-to(nl(len:32, type=33(ovs_datapath), flags=d[REQUEST][ACK][ECHO], 
seq=4e9ef0dd, pid=25535(25535:0
Oct 19 11:07:52|00020|netlink_socket|DBG|ignoring unexpected seq 0x4e9ef0dd

I have verified above sequence with strace and the decoded netlink messages 
where indeed send to and received from the kernel. So it is not a buffering 
issue in the controller.
The problem occurs at least with openvswitch version 1.2.2 and git-HEAD on 
kernel 2.6.32 and 2.6.38.

Regards
Andreas

- Original Message -
> Hi,
> 
> We have encountered a strange behavior. After restarting the
> controller process and even after removing and reinserting the
> datapath module, ARP packets are forwarded from dp0 to the real port
> intermittently.
> tcpdump confirms that the ARP request is seen on dp0, but not on the
> physical port (eth1).
> At the same time the following sequence on log messages appear:
> 
> Oct 18 12:22:54|00344|netlink_socket|DBG|nl_sock_recv__ (Success):
> nl(len:24, type=30(ovs_packet), flags=0, seq=0,
> pid=0(0:0)),genl(cmd=1,version=1)
> Oct 18 12:22:54|00345|dpif|DBG|system@dp0: miss upcall:
> in_port(0),eth(src=00:23:20:fc:70:43,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=172.28.0.1,tip=172.28.0.19,op=1,sha=00:23:20:fc:70:43,tha=00:00:00:00:00:00)
> 00:23:20:fc:70:43 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length
> 42: Request who-has 172.28.0.19 tell 172.28.0.1, length 28
> Oct 18 12:22:54|00346|ofproto_dpif|WARN|bridge dp0: received packet
> on unknown port 65534
> Oct 18 12:22:54|00347|netlink_socket|DBG|nl_sock_transact_multiple__
> (Success): nl(len:140, type=30(ovs_packet), flags=1[REQUEST],
> seq=4e9d5835, pid=75500397(2925:18)),genl(cmd=3,version=1)
> Oct 18 12:22:54|00348|netlink_socket|DBG|nl_sock_transact_multiple__
> (Success): nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK],
> seq=4e9d5836, pid=75500397(2925:18)),genl(cmd=1,version=1)
> Oct 18 12:22:54|00349|netlink_socket|DBG|nl_sock_recv__ (Success):
> nl(len:36, type=2(error), flags=0, seq=4e9d5836,
> pid=75500397(2925:18)) error(0, in-reply-to(nl(len:92,
> type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836,
> pid=75500397(2925:18
> 
> There clearly is something strange going on. How can the dp receive
> something on an unknown port?
> 
> Any hints?
> 
> Regards
> Andreas
> --
> --
> Dipl. Inform.
> Andreas Schultz
> 
> 
> _______
> discuss mailing list
> discuss@openvswitch.org
> http://openvswitch.org/mailman/listinfo/discuss
> 

-- 
-- 
Dipl. Inform.
Andreas Schultz

___
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss


[ovs-discuss] intermitting ARP problems on DP interface

2011-10-18 Thread Andreas Schultz
Hi,

We have encountered a strange behavior. After restarting the controller process 
and even after removing and reinserting the datapath module, ARP packets are 
forwarded from dp0 to the real port intermittently.
tcpdump confirms that the ARP request is seen on dp0, but not on the physical 
port (eth1).
At the same time the following sequence on log messages appear:

Oct 18 12:22:54|00344|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:24, 
type=30(ovs_packet), flags=0, seq=0, pid=0(0:0)),genl(cmd=1,version=1)
Oct 18 12:22:54|00345|dpif|DBG|system@dp0: miss upcall:
in_port(0),eth(src=00:23:20:fc:70:43,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=172.28.0.1,tip=172.28.0.19,op=1,sha=00:23:20:fc:70:43,tha=00:00:00:00:00:00)
00:23:20:fc:70:43 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: 
Request who-has 172.28.0.19 tell 172.28.0.1, length 28
Oct 18 12:22:54|00346|ofproto_dpif|WARN|bridge dp0: received packet on unknown 
port 65534
Oct 18 12:22:54|00347|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): 
nl(len:140, type=30(ovs_packet), flags=1[REQUEST], seq=4e9d5835, 
pid=75500397(2925:18)),genl(cmd=3,version=1)
Oct 18 12:22:54|00348|netlink_socket|DBG|nl_sock_transact_multiple__ (Success): 
nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, 
pid=75500397(2925:18)),genl(cmd=1,version=1)
Oct 18 12:22:54|00349|netlink_socket|DBG|nl_sock_recv__ (Success): nl(len:36, 
type=2(error), flags=0, seq=4e9d5836, pid=75500397(2925:18)) error(0, 
in-reply-to(nl(len:92, type=29(ovs_flow), flags=5[REQUEST][ACK], seq=4e9d5836, 
pid=75500397(2925:18

There clearly is something strange going on. How can the dp receive something 
on an unknown port?

Any hints?

Regards
Andreas
-- 
-- 
Dipl. Inform.
Andreas Schultz

email: a...@travelping.com
phone: +49-391-819099-224
mobil: +49-179-7654368

-- managed broadband access --

Travelping GmbH   phone:   +49-391-8190990
Roentgenstr. 13   fax:   +49-391-819099299
D-39108 Magdeburg email:   i...@travelping.com
GERMANY   web:   http://www.travelping.com

Company Registration: HRB21276 Handelsregistergericht Chemnitz
Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
--

___
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss


[ovs-discuss] ofputil_check_output_port() should permit NONE as port

2011-07-25 Thread Andreas Schultz
In lib/ofp-util.c, ofputil_check_output_port() validates to port in output 
actions. It currently rejects a output port with id NONE, however a output 
action with a port NONE is the only way to free all buffers associated with a 
packet without having to way for a timeout on the buffer.

So it really should permit NONE as port id.

Regards
Andreas
-- 
-- 
Dipl. Inform.
Andreas Schultz

email: a...@travelping.com
phone: +49-391-819099-224
mobil: +49-179-7654368

-- managed broadband access --

Travelping GmbH   phone:   +49-391-8190990
Roentgenstr. 13   fax:   +49-391-819099299
D-39108 Magdeburg email:   i...@travelping.com
GERMANY   web:   http://www.travelping.com

Company Registration: HRB21276 Handelsregistergericht Chemnitz
Geschaeftsfuehrer: Holger Winkelmann | VAT ID No.: DE236673780
--

___
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss