Re: [ovs-discuss] ovn-controller consuming lots of CPU

2017-12-13 Thread Kevin Lin
Thanks for the replies!

We’re using v2.8.1.

I don’t completely understand Russell’s patch, but I don’t think our ACLs were 
taking advantage of it. Do the ACLs need to be “tagged” with port information 
in order for it to be useful?

Before, our ACLs were in terms of L3 and above. I brought down the number of 
flows from 252981 to 16224 by modifying the ACLs to include the logical port. 
For example:
from-lport: (inport == “1” || inport == “2” || inport == “3”) && (ip4.dst == 
$addressSet)
to-lport: (ip4.src == $addressSet) && (outport == “1” || outport == “2” || 
outport == “3”)

Is that the right way to take advantage of Russell’s patch? This explodes the 
address set on one side of the connection, but does decrease the number of 
flows installed on each vswitchd.

Even with the decrease in flows in vswitchd, I’m still seeing the log messages. 
Does 16224 flows per vswitchd instance, and 559 flows in ovn-sb sound 
reasonable?

Thanks,
—Kevin

> On Dec 13, 2017, at 7:55 AM, Mark Michelson <mmich...@redhat.com> wrote:
> 
> On 12/12/2017 03:26 PM, Kevin Lin wrote:
>> Hi again,
>> We’re trying to scale up our OVN deployment and we’re seeing some worrying 
>> log messages.
>> The topology is 32 containers connected to another 32 containers on 10 
>> different ports. This is running on 17 machines (one machine runs ovn-northd 
>> and ovsdb-server, the other 16 run ovn-controller, ovs-vswitchd, and 
>> ovsdb-server). We’re using an address set for the source group, but not the 
>> destination group. We’re also creating a different ACL for each port. So the 
>> ACLs look like:
>> One address set for { container1, container2, … container32 }
>> addressSet -> container1 on port 80
>> addressSet -> container1 on port 81
>> …
>> addressSet -> container1 on port 90
>> addressSet -> container2 on port 80
>> …
>> addressSet -> container32 on port 90
>> The ovn-controller log:
>> 2017-12-12T20:14:49Z|11878|timeval|WARN|Unreasonably long 1843ms poll 
>> interval (1840ms user, 0ms system)
>> 2017-12-12T20:14:49Z|11879|timeval|WARN|disk: 0 reads, 16 writes
>> 2017-12-12T20:14:49Z|11880|timeval|WARN|context switches: 0 voluntary, 21 
>> involuntary
>> 2017-12-12T20:14:49Z|11881|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 
>> (172.31.11.193:48460<->172.31.2.181:6640) at lib/stream-fd.c:157 (36% CPU 
>> usage)
>> 2017-12-12T20:14:49Z|11882|poll_loop|DBG|wakeup due to [POLLIN] on fd 12 
>> (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (36% CPU usage)
>> 2017-12-12T20:14:49Z|11883|jsonrpc|DBG|tcp:172.31.2.181:6640: received 
>> reply, result=[], id="echo"
>> 2017-12-12T20:14:49Z|11884|netlink_socket|DBG|nl_sock_transact_multiple__ 
>> (Success): nl(len:36, type=38(family-defined), flags=9[REQUEST][ECHO], 
>> seq=b11, pid=2268452876
>> 2017-12-12T20:14:49Z|11885|netlink_socket|DBG|nl_sock_recv__ (Success): 
>> nl(len:136, type=36(family-defined), flags=0, seq=b11, pid=2268452876
>> 2017-12-12T20:14:49Z|11886|vconn|DBG|unix:/var/run/openvswitch/br-int.mgmt: 
>> received: OFPT_ECHO_REQUEST (OF1.3) (xid=0x0): 0 bytes of payload
>> 2017-12-12T20:14:49Z|11887|vconn|DBG|unix:/var/run/openvswitch/br-int.mgmt: 
>> sent (Success): OFPT_ECHO_REPLY (OF1.3) (xid=0x0): 0 bytes of payload
>> 2017-12-12T20:14:51Z|11888|timeval|WARN|Unreasonably long 1851ms poll 
>> interval (1844ms user, 8ms system)
>> 2017-12-12T20:14:51Z|11889|timeval|WARN|context switches: 0 voluntary, 11 
>> involuntary
>> 2017-12-12T20:14:52Z|11890|poll_loop|DBG|wakeup due to [POLLIN] on fd 9 
>> (172.31.11.193:48460<->172.31.2.181:6640) at lib/stream-fd.c:157 (73% CPU 
>> usage)
>> 2017-12-12T20:14:52Z|11891|jsonrpc|DBG|tcp:172.31.2.181:6640: received 
>> request, method="echo", params=[], id="echo"
>> 2017-12-12T20:14:52Z|11892|jsonrpc|DBG|tcp:172.31.2.181:6640: send reply, 
>> result=[], id="echo"
>> 2017-12-12T20:14:52Z|11893|netlink_socket|DBG|nl_sock_transact_multiple__ 
>> (Success): nl(len:36, type=38(family-defined), flags=9[REQUEST][ECHO], 
>> seq=b12, pid=2268452876
>> 2017-12-12T20:14:52Z|11894|netlink_socket|DBG|nl_sock_recv__ (Success): 
>> nl(len:136, type=36(family-defined), flags=0, seq=b12, pid=2268452876
>> 2017-12-12T20:14:52Z|11895|netdev_linux|DBG|Dropped 18 log messages in last 
>> 56 seconds (most recently, 3 seconds ago) due to excessive rate
>> 2017-12-12T20:14:52Z|11896|netdev_linux|DBG|unknown qdisc "mq"
>> 2017-12-12T20:14:54Z|11897|hmap|DBG|Dropped 15511 log messages in last 6 
>> seconds (most recently, 0 seconds ago) due to excessive rate
>> 2017-12

Re: [ovs-discuss] Failed assertion in ovs-vswitchd when running OVN

2017-12-07 Thread Kevin Lin
The patch worked for me! (network works, and no more warnings in ovs-vswitchd)

> On Dec 7, 2017, at 3:54 PM, Ben Pfaff <b...@ovn.org> wrote:
> 
> On Thu, Dec 07, 2017 at 02:44:36PM -0800, Kevin Lin wrote:
>> Hi Ben,
>> 
>> I’ve included the traces for an ARP request, and a ping. ovs-vswitchd also 
>> logs errors for the return traffic, but I didn’t include that as it seems 
>> redundant.
>> 
>> root@ip-172-31-2-45:/# ovs-appctl ofproto/trace kelda-int 
>> 'arp,in_port=1,vlan_tci=0x,dl_src=02:00:0a:c5
>> :34:1e,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=10.197.52.30,arp_tpa=10.203.4.66,arp_op=1,arp_sha=02:00:0a:c5:34:1e
>> ,arp_tha=ff:ff:ff:ff:ff:ff'
>> Flow: 
>> arp,in_port=1,vlan_tci=0x,dl_src=02:00:0a:c5:34:1e,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=10.197.52.30,arp_tpa=10.203.4.66,arp_op=1,arp_sha=02:00:0a:c5:34:1e,arp_tha=ff:ff:ff:ff:ff:ff
>> 
>> bridge("kelda-int")
>> ---
>> 0. in_port=1,dl_src=02:00:0a:c5:34:1e, priority 32768
>>load:0x2->NXM_NX_REG0[]
>>resubmit(,1)
>> 1. arp,dl_dst=ff:ff:ff:ff:ff:ff, priority 1000
>>LOCAL
>>output:NXM_NX_REG0[]
>> -> output port is 2
>>>>>> Cannot truncate output to patch port <<<<
> 
> Thanks for the details.
> 
> Would you mind testing this patch?
>https://patchwork.ozlabs.org/patch/845908/ 
> <https://patchwork.ozlabs.org/patch/845908/>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Failed assertion in ovs-vswitchd when running OVN

2017-12-07 Thread Kevin Lin
:00:0a:c5:34:1e,dl_dst=02:00:0a:cb:04:42,nw_src=10.197.52.30,nw_dst=10.203.4.66,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0

bridge("kelda-int")
---
thaw
Resuming from table 43
43. metadata=0x1, priority 0, cookie 0x95c9e574
resubmit(,44)
44. 
ct_state=+new-est+trk,icmp,metadata=0x1,nw_src=10.197.52.30,nw_dst=10.203.4.66, 
priority 1001, cookie 0x6468c276
load:0x1->NXM_NX_XXREG0[97]
resubmit(,45)
45. metadata=0x1, priority 0, cookie 0x2b14b7f8
resubmit(,46)
46. ip,reg0=0x2/0x2,metadata=0x1, priority 100, cookie 0xac371eb4
ct(commit,zone=NXM_NX_REG13[0..15],exec(load:0->NXM_NX_CT_LABEL[0]))
load:0->NXM_NX_CT_LABEL[0]
resubmit(,47)
47. metadata=0x1, priority 0, cookie 0x390a9bed
resubmit(,48)
48. reg15=0x3,metadata=0x1, priority 50, cookie 0x8ebb8a35
resubmit(,64)
64. priority 0
resubmit(,65)
65. reg15=0x3,metadata=0x1, priority 100
output:2

bridge("kelda-int")
---
 0. in_port=4, priority 32768
output:3

Final flow: 
recirc_id=0xe,eth,icmp,reg0=0x3,reg11=0x3,reg12=0x2,reg13=0x6,reg14=0x1,reg15=0x3,metadata=0x1,in_port=1,vlan_tci=0x,dl_src=02:00:0a:c5:34:1e,dl_dst=02:00:0a:cb:04:42,nw_src=10.197.52.30,nw_dst=10.203.4.66,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
Megaflow: 
recirc_id=0xe,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,in_port=1,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00,nw_src=10.197.52.30,nw_dst=10.203.4.66,nw_frag=no
Datapath actions: ct(commit,zone=6,label=0/0x1),4

—Kevin
> On Dec 7, 2017, at 1:03 PM, Ben Pfaff <b...@ovn.org> wrote:
> 
> On Thu, Dec 07, 2017 at 09:26:14AM -0800, Kevin Lin wrote:
>> Hi,
>> 
>> I work on Kelda (kelda.io <http://kelda.io/>) with Ethan Jackson. We run a 
>> containerized, distributed version of OVN. The master branch of 
>> openvswitch/ovs (commit 07754b23ee5027508d64804d445e617b017cc2d1) fails with 
>> the following assertion in ovs-vswitchd:
>> 
>> ovs-vswitchd(handler2): ofproto/ofproto-dpif-xlate.c:3704: assertion 
>> !truncate failed in compose_output_action__()
>> 
>> whenever we try to use the OVN network. A little background on our setup:
>> We’re a container orchestrator that uses OVN for the container network.
>> One machine in our cluster runs ovn-northd and ovsdb-server. The network is 
>> mostly configured from here (creating the logical ports, creating ACLs etc).
>> Another machine runs ovn-controller, ovs-vswitchd, and ovsdb-server. We 
>> install some container-specific OpenFlow rules by connecting directly to 
>> ovs-vswitchd, and of course ovs-vswitchd also receives rules from OVN.
>> ovs-vswitchd does not crash immediately after the rules are installed. But 
>> it crashes as soon as the network is used (e.g. a ping from one container to 
>> another).
>> 
>> The commit before the commit that introduced the assertion works for us 
>> (https://github.com/openvswitch/ovs/commit/48f704f4768d13f85252bac4f93c8d45d8ab3eea
>>  
>> <https://github.com/openvswitch/ovs/commit/48f704f4768d13f85252bac4f93c8d45d8ab3eea>).
>> 
>> I’ve attached the ovs-vswitchd logs. I’m not sure how helpful the output of 
>> ovs-bugtool will be given our containerized setup, but I’ve also attached 
>> the output of running that from within the ovs-vswitchd container from 
>> before and after the crash. Note, because the ovs-vswitchd container 
>> crashed, the “after” tarball was generated after restarting the container, 
>> so I’m not sure if any of the commands it ran actually succeeded.
>> 
>> The crash is trivial for me to reproduce, so please let me know if there’s 
>> anymore information I can give you.
> 
> Thank you for the report.
> 
> I don't see a good reason that this should be a condition that kills
> ovs-vswitchd.  I think that it will be both easier to debug and less
> inherently harmful if we change it to an error message.  I sent out a
> patch that does that:
>https://patchwork.ozlabs.org/patch/845845/
> 
> It would be great if you could apply the patch and then try to track
> down the activity that triggers the error.  "ofproto/trace" is the best
> way to do that, if you can find the right packet or flow, because it
> will give us all the details on how the problem gets triggered.
> 
> Thanks,
> 
> Ben.

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss