+1... Hey John, thanks a lot for the detailed analysis...

Dave

From: John Lo (loj)
Sent: Wednesday, January 11, 2017 5:40 PM
To: Dave Barach (dbarach) <dbar...@cisco.com>; Juraj Linkes -X (jlinkes - 
PANTHEON TECHNOLOGIES at Cisco) <jlin...@cisco.com>; vpp-dev@lists.fd.io
Subject: RE: VPP-556 - vpp crashing in an openstack odl stack

Hi Juraj,

I looked at the custom-dump of the API trace and noticed this "interesting" 
sequence:
SCRIPT: vxlan_add_del_tunnel src 192.168.11.22 dst 192.168.11.20 decap-next -1 
vni 1
SCRIPT: sw_interface_set_flags sw_if_index 4 admin-up link-up
SCRIPT: sw_interface_set_l2_bridge sw_if_index 4 bd_id 1 shg 1  enable
SCRIPT: sw_interface_set_l2_bridge sw_if_index 2 disable
SCRIPT: bridge_domain_add_del bd_id 1 del

Any idea why BD1 is deleted while the VXLAN tunnel with sw_if_index still in 
the BD? May be this is what is  causing the crash. From your vppctl output 
capture for "compute_that_crashed.txt", I do see BD 1 presen with vxlan_tunnel0 
on it:
[root@overcloud-novacompute-1 ~]# vppctl show bridge-domain
  ID   Index   Learning   U-Forwrd   UU-Flood   Flooding   ARP-Term     BVI-Intf
  0      0        off        off        off        off        off        local0
  1      1        on         on         on         on         off          N/A
[root@overcloud-novacompute-1 ~]# vppctl show bridge-domain 1 detail
  ID   Index   Learning   U-Forwrd   UU-Flood   Flooding   ARP-Term     BVI-Intf
  1      1        on         on         on         on         off          N/A

           Interface           Index  SHG  BVI  TxFlood        VLAN-Tag-Rewrite
         vxlan_tunnel0           3     1    -      *                 none

I did install a vpp 1701 image on my server and performed an api trace replay 
of your api_post_mortem. Thereafter, I do not see BD 1 present while 
vxlan_tunnel1 is still configured as in BD 1:
DBGvpp# show bridge
  ID   Index   Learning   U-Forwrd   UU-Flood   Flooding   ARP-Term     BVI-Intf
  0      0        off        off        off        off        off        local0
DBGvpp# sho vxlan tunnel
[1] src 192.168.11.22 dst 192.168.11.20 vni 1 sw_if_index 4 encap_fib_index 0 
fib_entry_index 12 decap_next l2
DBGvpp# sho int addr
GigabitEthernet2/3/0 (dn):
VirtualEthernet0/0/0 (up):
local0 (dn):
vxlan_tunnel0 (dn):
vxlan_tunnel1 (up):
  l2 bridge bd_id 1 shg 1
DBGvpp# show int
              Name               Idx       State          Counter          Count
GigabitEthernet2/3/0              1        down
VirtualEthernet0/0/0              2         up
local0                            0        down
vxlan_tunnel0                     3        down
vxlan_tunnel1                     4         up
DBGvpp#

With system in this state, I can easily imaging a packet received by 
vxlan_tunnel1 and forwarded in a non-existing BD causes VPP crash. I will look 
into VPP code from this angle. In general, however, there is really no need to 
create and delete BDs on VPP. Adding an interface/tunnel to a BD will cause it 
to be created. Deleting a BD without removing all the ports in it can cause 
problems which may well be the cause here. If a BD is to be not used, all the 
ports on it should be removed. If a BD is to be reused, just add ports to it.

As mentioned by Dave, please test using a known good image like 1701 and 
preferably built with debug enabled (with TAG-vpp_debug) so it is easier to 
find any issues.

Regards,
John

From: Dave Barach (dbarach)
Sent: Wednesday, January 11, 2017 9:01 AM
To: Juraj Linkes -X (jlinkes - PANTHEON TECHNOLOGIES at Cisco) 
<jlin...@cisco.com<mailto:jlin...@cisco.com>>; 
vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>; John Lo (loj) 
<l...@cisco.com<mailto:l...@cisco.com>>
Subject: RE: VPP-556 - vpp crashing in an openstack odl stack

Dear Juraj,

I took a look. It appears that the last operation in the post-mortem API trace 
was to kill a vxlan tunnel. Is there a reasonable chance that other interfaces 
in the bridge group containing the tunnel were still admin-up? Was the tunnel 
interface removed from the bridge group prior to killing it?

The image involved is not stable/1701/LATEST. It's missing at least 20 fixes 
considered critical enough to justify merging them into the release throttle:

[root@overcloud-novacompute-1 ~]# vppctl show version verbose
Version:                  v17.01-rc0~242-gabd98b2~b1576
Compiled by:              jenkins
Compile host:             centos-7-a8b
Compile date:             Mon Dec 12 18:55:56 UTC 2016

Please re-test with stable/1701/LATEST. Please use a TAG=vpp_debug image. If 
the problem is reproducible, we'll need a core file to make further progress.

Copying John Lo ("Dr. Vxlan") for any further thoughts he might have...

Thanks... Dave

From: vpp-dev-boun...@lists.fd.io<mailto:vpp-dev-boun...@lists.fd.io> 
[mailto:vpp-dev-boun...@lists.fd.io] On Behalf Of Juraj Linkes -X (jlinkes - 
PANTHEON TECHNOLOGIES at Cisco)
Sent: Wednesday, January 11, 2017 3:47 AM
To: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: [vpp-dev] VPP-556 - vpp crashing in an openstack odl stack

Hi vpp-dev,

I just wanted to ask whether anyone has taken a look at 
VPP-556<https://jira.fd.io/browse/VPP-556>? There might not be enough logs, I 
collected just backtrace from gdb - if we need anything more, please give me a 
little bit of a guidance on what could help/how to get it.

This is one the last few issues we're facing with the openstack odl scenario 
where we use vpp jsut for l2 and it's been there for a while.

Thanks,
Juraj
_______________________________________________
vpp-dev mailing list
vpp-dev@lists.fd.io
https://lists.fd.io/mailman/listinfo/vpp-dev
  • [vpp-dev] VPP-5... Juraj Linkes -X (jlinkes - PANTHEON TECHNOLOGIES at Cisco)
    • Re: [vpp-d... Dave Barach (dbarach)
      • Re: [v... John Lo (loj)
        • Re... Dave Barach (dbarach)
          • ... Juraj Linkes -X (jlinkes - PANTHEON TECHNOLOGIES at Cisco)
          • ... Juraj Linkes -X (jlinkes - PANTHEON TECHNOLOGIES at Cisco)
            • ... Damjan Marion (damarion)

Reply via email to