On Mon, Sep 14, 2020 at 9:54 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 9/14/20 6:45 PM, Dumitru Ceara wrote: > > On 9/14/20 6:36 PM, Han Zhou wrote: > >> > >> > >> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dce...@redhat.com > >> <mailto:dce...@redhat.com>> wrote: > >>> > >>> On 9/11/20 9:16 PM, Dumitru Ceara wrote: > >>>> On 9/11/20 8:39 PM, Han Zhou wrote: > >>>>> > >>>>> > >>>>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dce...@redhat.com > >> <mailto:dce...@redhat.com> > >>>>> <mailto:dce...@redhat.com <mailto:dce...@redhat.com>>> wrote: > >>>>>> > >>>>>> On 9/9/20 11:13 PM, Han Zhou wrote: > >>>>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson > >> <mmich...@redhat.com <mailto:mmich...@redhat.com> > >>>>> <mailto:mmich...@redhat.com <mailto:mmich...@redhat.com>>> wrote: > >>>>>>>> > >>>>>>>> Thanks for addressing my concerns on patch 9. > >>>>>>>> > >>>>>>>> Acked-by: Mark Michelson <mmich...@redhat.com > >> <mailto:mmich...@redhat.com> > >>>>> <mailto:mmich...@redhat.com <mailto:mmich...@redhat.com>>> > >>>>>>>> > >>>>>>>> Warning: I'm also ACKing Numan's patch series that also adds some > >>>>>>>> lflow-level caching (but he is caching expressions). So whoever > >> loses > >>>>>>>> this merge race may have some work to do in order to apply their > >> patch > >>>>>>>> cleanly. > >>>>>>>> > >>>>>>> Thanks Mark for the review, and the warning :). > >>>>>>> I applied the series to master, and also the first 4 patches to > >>>>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the conjunction > >>>>> flow > >>>>>>> bug in those branches. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Han > >>>>>>> > >>>>>> > >>>>>> Hi Han, > >>>>>> > >>>>>> This seems to cause an assertion failure in ovn-controller > >> compiled with > >>>>>> today's OVS/OVN master branches. I'm not completely sure about the > >> steps > >>>>>> to reproduce because I was testing some other ovn-k8s related > >> things but > >>>>>> I did get a core dump and the NB/DB/conf DBs. > >>>>>> > >>>>>> I opened a RedHat BZ [0] (more for my tracking than anything else) > >> and I > >>>>>> shared there the OVS/OVN revisions and how to build the exact same > >> image > >>>>>> that I was running when I hit the crash. > >>>>>> > >>>>>> I'll try to investigate what's happening but you're probably more > >>>>>> familiar with the code so any hints would be great. > >>>>>> > >>>>>> Thanks, > >>>>>> Dumitru > >>>>>> > >>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139 > >>>>>> > >>>>> > >>>>> Thanks Dumitru for reporting the issue. I reviewed the code and the > >> test > >>>>> cases again, and didn't figure out any problem yet. The failed > >> assertion > >>>>> "ovs_list_is_empty(&f->list_node) " is to assert that the desired_flow > >>>>> is not accessed more than once in a single flood_remove call. This > >>>>> should be true because the loop at > >>>>> > >> ( https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134) > >>>>> ensures that all flows in to_be_removed (which is the only list that > >>>>> uses the "list_node" member) is detached from the cross reference > >> lists, > >>>>> so that in the following recursive calls these flows will never be > >>>>> accessed again. Now that this assert fails, it could be some orphaned > >>>>> pointer somewhere, leading to the "f" pointing to a flow that's already > >>>>> destroyed. > >>>>> > >>>>> One thing not clear to me is the version you used for the testing. > >> In BZ > >>>>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa. > >>>>> However, in this revision, the line number of the assertion doesn't > >>>>> match the one you pasted in BZ. What you pasted was 1135: > >>>> > >>>> The revision should be good, I think the problem is with the way I > >>>> rebuilt the packages in the new container after I retrieved the core > >> dump. > >>>> > >>>> The assert message clearly points to line 1108. This is from the old > >>>> logs of the run where I saw the crash: > >>>> > >>>> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108: > >>>> assertion ovs_list_is_empty(&f->list_node) failed in > >>>> flood_remove_flows_for_sb_uuid() > >>>> > >>>>> > >>>>> #6 0x00005607d73502fa in flood_remove_flows_for_sb_uuid > >>>>> (flow_table=flow_table@entry=0x5607d8c4b380, > >>>>> sb_uuid=sb_uuid@entry=0x5607d8f91430, > >>>>> flood_remove_nodes=flood_remove_nodes@entry=0x7ffeccadb9c0) at > >>>>> controller/ofctrl.c:1135 > >>>>> > >>>>> What's in the code for revision (520189bf) is 1108: > >>>>> > >> https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108 > >>>>> > >>>>> Did you change anything while doing the testing? > >>>>> > >>>> > >>>> I'll try to replicate the issue again next week, maybe we get more info. > >>>> > >>>> Thanks, > >>>> Dumitru > >>>> > >>> > >>> Hi Han, > >>> > >>> I managed to minimize the way to replicate the crash (with OVN master). > >>> I do: > >>> > >>> make sandbox > >>> ovn-nbctl ls-add ls > >>> ovn-nbctl lsp-add ls vm1 > >>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e > >> fd00:10:244:1::5" > >> > >> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide > >> the ACLs added? > > Just skip this line, it's not needed. > 100% reproducible for me without it: > > 2020-09-14T16:52:04.039Z|00020|util|EMER|controller/ofctrl.c:1108: > assertion ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid() > > > > > > > There are no ACLs involved. This line was a mistake, please ignore it. > > The conjunctions come from the IPv6 port security logical flows. > > > > https://paste.centos.org/view/bb93f7d3 > >
Thanks Dumitru and Ilya. With these clear steps to reproduce, I was able to find the root cause. It is related to different problems in the history: 1. commit 07e3017 added a check to skip a lflow if lport is not local. The check is done on every match condition in a loop, but instead I think it should be done only once per-flow. 2. commit ade4e77 added the resource reference for lport checking at the same location in the loop, which added redundant references for the same lport - lflow mapping. 3. commit 580aea7 added flood removing based on the lflow-resource references, which assumes that there are no redundant items in the lists. Since the assumption isn't true because of the above 2 problems, it results in corrupted pointers. There are different ways to fix the issue. I am trying to figure out the best way and also working on debugging some regression test failures. I will post the fix later today. Thanks, Han > > Regards, > > Dumitru > > > >> > >>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" > >>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5" > >>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal -- set > >>> interface vm1 external-ids:iface-id=vm1 > >>> sleep 1 > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo > >>> sleep 1 > >>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1 > >>> sleep 1 > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo > >>> > >>> This crashes ovn-controller and if I look at the openflows before the > >>> last port unbind, I notice some weird flows with action: > >>> > >>> > >> actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2) > >>> > >>> This looks broken. I tried to figure out what's happening but > >>> unfortunately I didn't manage to. > >>> > >>> Hope this helps pinpoint the root cause. > >>> > >>> Regards, > >>> Dumitru > >>> > > > > _______________________________________________ > > dev mailing list > > d...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev