On 10/25/22 20:13, Han Zhou wrote: > On Tue, Oct 25, 2022 at 2:53 AM Dumitru Ceara <dce...@redhat.com> wrote: >> >> On 10/25/22 07:38, Ales Musil wrote: >>> On Tue, Oct 25, 2022 at 5:51 AM Han Zhou <zhou...@gmail.com> wrote: >>> >>>> >>>> >>>> On Sun, Oct 23, 2022 at 10:21 PM Ales Musil <amu...@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> On Sun, Oct 23, 2022 at 10:16 PM Han Zhou <zhou...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 21, 2022 at 1:37 AM Ales Musil <amu...@redhat.com> wrote: >>>>>>> >>>>>>> Run ovs_list_init on list_node from desired flow. >>>>>>> Not doing that can potentially cause crash on >>>>>>> ovs_assert(ovs_list_is_empty(&f->list_node)) >>>>>>> this might happen when we have conditional monitoring >>>>>>> enabled and have some flow like below being processed >>>>>>> by ofctrl in a single run. >>>>>>> >>>>>>> flowA is removed, flowA is added, flowA is removed. >>>>>>> >>>>>> >>>>> >>>>> Hi Han, >>>>> >>>>>> >>>>>> Thanks Ales for the patch. First of all I agree this patch is not >>>> harmful and is the right thing to do, so I am going to apply it. >>>>>> At the same time, I still don't understand the root cause analysis >>>> above and like you said, I am not sure if it would fix the bug. Two >>>> questions here: >>>>> >>>>> >>>>> The description is based on some assumptions and might not be the >>>> correct way to get into that state. >>>>> >>>>>> >>>>>> 1. How would it relate to conditional monitoring? >>>>> >>>>> >>>>> It is more likely to receive multiple updates at once with conditional >>>> monitoring or am I wrong? >>>>> >>>> I don't see a reason for this. I think it only depends on what changes >>>> happened in SB and notified and pending to ovn-controller during an >>>> iteration of ovn-controller control loop, which doesn't seem to have >>>> relationship with conditional monitoring. Do you have more details on > the >>>> reasoning? >>>> >>> >>> It was based on the configuration, however it was confirmed that it > still >>> happens even with conditional monitoring disabled. >>> >>> >>>> >>>>>> >>>>>> 2. For the example "flowA is removed, flowA is added, flowA is >>>> removed", I assume flowA is "a desired OVS flow", not a logical flow, >>>> correct? So when flowA is removed, it should be removed from the > desired >>>> flow table and added to the "tracked_flows". Then when flowA is added, > it >>>> would in fact be a newly allocated desired_flow, and no chance to find > the >>>> earlier tracked deleted flowA from the desired flow table. The 2nd > flowA >>>> deletion would find the 2nd flowA instance and destroy it, but the > first >>>> deleted desired_flow instance of flowA would not be referenced any > more and >>>> have no chance to use the list_node before it is finally destroyed, > right? >>>>> >>>>> >>>>> That makes sense, as I said, I was not able to reproduce it, the issue >>>> seemed to make sense under those conditions. Anyway I can update the > commit >>>> message leaving the flows part out. >>>> >>>> Understand, please update the commit message. >>>> I tend to believe that the core dump was caused by some other bug that > may >>>> result in corruption of the list/memory (in such case this change won't >>>> help), and also not sure if it is a problem only with this old version >>>> v20.12 or still a problem of the latest code. >>>> >>> >>> Ack, I will update the commit message in v2. We have also provided a > build >>> so we will get feedback if it still happens with this patch or not. I > will >>> postpone v2 until then. >>> >>> >>>> >>>>> >>>>> Also one thing to keep in mind, it was originally reported against >>>> v20.12. We should also backport it if possible. >>>> >>>> I am not sure if it is a candidate for backporting. Usually we backport >>>> critical bug fixes only, especially for such an old version. But in > this >>>> case the change itself is so simple and we are almost sure it won't > cause >>>> any harm, although no evidence of any benefit, either. >>>> >>>> What's your opinion of backporting this? Dumitru, Mark, Numan? >>>> >> >> Hi Ales, Han, >> >> Regardless if we think this is the root cause of the reported crash or >> not, I still think there's a possibility of failing the assertion >> because we didn't reinitialize the list_node field while computing >> incremental OF updates. > > Thanks Dumitru for the feedback. Maybe I am missing something, but from the > code I didn't find any possibility that the list_node would be accessed > again after the node is removed from the to_be_removed list. So I didn't > see the possibility of failing the assertion, unless there is a bug that > would cause the list_node to be accessed, probably after it is freed. >
I spent some more time looking into this and I realized I had missed the fact that when remove_flows_from_sb_to_flow() empties to_be_removed also removes all the items from the flow_table->match_flow_table. So I think you're right, at this point there should be no way to access list_node anymore. We probably need to get some more data. Ales, would it be an option to run an address-sanitizer enabled build for a test? >> >> As you said, it won't cause any harm but it hardens the ovn-controller >> code so I would backport it. >> > Ack. > >> Regards, >> Dumitru >> >>>> Thanks, >>>> Han >>>>> >>>>> Thanks, >>>>> Ales >>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Han >>>>>> >>>>>>> This is most likely related to BZ[0]. >>>>>>> >>>>>>> [0] https://bugzilla.redhat.com/2125723 >>>>>>> Suggested-By: Dumitru Ceara <dce...@redhat.com> >>>>>>> Signed-off-by: Ales Musil <amu...@redhat.com> >>>>>>> --- >>>>>>> I'm not adding the BZ as Reported-At because this might >>>>>>> not fix the issue they are facing. It seems to be very hard >>>>>>> to reproduce, however it still is a good idea to apply this >>>>>>> fix as in the context it's the right to do and shouldn't cause >>>>>>> any harm. >>>>>>> --- >>>>>>> controller/ofctrl.c | 1 + >>>>>>> 1 file changed, 1 insertion(+) >>>>>>> >>>>>>> diff --git a/controller/ofctrl.c b/controller/ofctrl.c >>>>>>> index c77991258..fb0e0869c 100644 >>>>>>> --- a/controller/ofctrl.c >>>>>>> +++ b/controller/ofctrl.c >>>>>>> @@ -1535,6 +1535,7 @@ remove_flows_from_sb_to_flow(struct >>>> ovn_desired_flow_table *flow_table, >>>>>>> free(sfr); >>>>>>> } >>>>>>> ovs_list_remove(&f->list_node); >>>>>>> + ovs_list_init(&f->list_node); >>>>>>> if (log_msg) { >>>>>>> ovn_flow_log(&f->flow, log_msg); >>>>>>> } >>>>>>> -- >>>>>>> 2.37.3 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> dev mailing list >>>>>>> d...@openvswitch.org >>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Ales Musil >>>>> >>>>> Senior Software Engineer - OVN Core >>>>> >>>>> Red Hat EMEA >>>>> >>>>> amu...@redhat.com IM: amusil >>>> >>> >>> >>> Thanks, >>> Ales >>> >> > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev