On 08/07/2024 5:17, LIU Yulong wrote: > Hi, > > > Maybe this fix should be the root cause of: > ovs-vswitchd core at revalidator_sweep__ > https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html > > > Did you see such core at revalidator_sweep__ ? > > > Regards, > > > LIU Yulong > >
Hi, It doesn't look like the same issue. from the dump there is looks like the state of the ukey did get update to be UKEY_DELETE. > > > ------------------ Original ------------------ > From: "Eelco Chaudron"<echau...@redhat.com>; > Date: Wed, Jul 3, 2024 06:46 PM > To: "Roi Dayan"<r...@nvidia.com>; > Cc: "dev"<d...@openvswitch.org>; "Maor > Dickman"<ma...@nvidia.com>; > Subject: Re: [ovs-dev] [PATCH 1/1] ofproto-dpif-upcall: Avoid stale > ukeys leaks. > > > > > > On 3 Jul 2024, at 12:22, Roi Dayan wrote: > > > On 03/07/2024 13:01, Eelco Chaudron wrote: > >> > >> > >> On 3 Jul 2024, at 9:31, Roi Dayan wrote: > >> > >>> On 18/06/2024 11:53, Eelco Chaudron wrote: > >>>> > >>>> > >>>> On 18 Jun 2024, at 8:05, Roi Dayan wrote: > >>>> > >>>>> On 03/06/2024 16:29, Eelco Chaudron wrote: > >>>>>> > >>>>>> > >>>>>> On 3 Jun 2024, at 10:07, Roi Dayan wrote: > >>>>>> > >>>>>>> On 03/06/2024 10:18, Roi Dayan wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> On 30/05/2024 18:48, Eelco Chaudron wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 23 May 2024, at 12:46, Roi Dayan via > dev wrote: > >>>>>>>>> > >>>>>>>>>> It is observed in some environments > that there are much more ukeys than > >>>>>>>>>> actual DP flows. For example: > >>>>>>>>>> > >>>>>>>>>> $ ovs-appctl upcall/show > >>>>>>>>>> system@ovs-system: > >>>>>>>>>> flows : (current 7) (avg 6) (max > 117) (limit 2125) > >>>>>>>>>> offloaded flows : 525 > >>>>>>>>>> dump duration : 1063ms > >>>>>>>>>> ufid enabled : true > >>>>>>>>>> > >>>>>>>>>> 23: (keys 3612) > >>>>>>>>>> 24: (keys 3625) > >>>>>>>>>> 25: (keys 3485) > >>>>>>>>>> > >>>>>>>>>> The revalidator threads are busy > revalidating the stale ukeys leading to > >>>>>>>>>> high CPU and long dump duration. > >>>>>>>>>> > >>>>>>>>>> There are some possible situations > that may result in stale ukeys that > >>>>>>>>>> have no corresponding DP flows. > >>>>>>>>>> > >>>>>>>>>> In revalidator, push_dp_ops() > doesn't check error if the op type is not > >>>>>>>>>> DEL. It is possible that a > PUT(MODIFY) fails, especially for tc offload > >>>>>>>>>> case, where the old flow is deleted > first and then the new one is > >>>>>>>>>> created. If the creation fails, the > ukey will be stale (no corresponding > >>>>>>>>>> DP flow). This patch adds a warning > in such case. > >>>>>>>>> > >>>>>>>>> Not sure I understand, this behavior > should be captured by the UKEY_INCONSISTENT state. > >>>>>>>> > >>>>>>>> Hi Eelco, > >>>>>>>> > >>>>>>>> thanks for reviewing. > >>>>>>>> We started the patch on older branch as we > didn't rebase for some time > >>>>>>>> and got a little later on sending it. > >>>>>>>> I see the addition now for UKEY_INCOSISTENT > in the following patch: > >>>>>>>> > >>>>>>>> cf11766cbcf1 ofproto-dpif-upcall: Fix > push_dp_ops to handle all errors. > >>>>>>>> > >>>>>>>> Looks like it handles the same situation we > tried to handle in this patch. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> Another possible scenario is in > handle_upcalls() if a PUT operation did > >>>>>>>>>> not succeed and op->error > attribute was not set correctly it can lead to > >>>>>>>>>> stale ukey in operational state. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Guess we need to figure out these cases, > as these are the root cause of your problem. > >>>>>>>> > >>>>>>>> right. maybe. but this could help keep the > system alive/clean while logging the > >>>>>>>> bad situation instead of having increase in > those stale/inconsistent ukeys. > >>>>>>>> I understand if it's not nice to handle it > like this. > >>>>>>>> > >>>>>>> > >>>>>>> Hi Eelco, > >>>>>>> > >>>>>>> I remember now one of the reproduction scenarios > we did is do some traffic > >>>>>>> to get rules added using TC and then delete > those from tc command line > >>>>>>> and it got to stale ukeys. > >>>>>>> The revalidator dump thread not seeing the rules > so not updating anything > >>>>>>> and sweep over the ukeys skipping them as well. > >>>>>>> This is why we checked against the timing stats > of the ukey. > >>>>>>> > >>>>>>> I remember I tested on the upstream code and not > our development branch > >>>>>>> and it reproduces. I didn't notice if the commit > adding UKEY_INCONSISTENT > >>>>>>> existed but it handles errors from adding flows > so I dont think it matters. > >>>>>>> > >>>>>>> I'll try to check and verify again but I think > it's still here. > >>>>>>> So while cases getting dop.error handled with > UKEY_INCONSISTENT, > >>>>>>> this case I think is not. > >>>>>> > >>>>>> I think you are right this case is not covered by > the UKEY_INCONSISTENT test below. See my suggestion on fixing this in > revalidate_ukey(). > >>>>>> > >>>>>> Maybe you could also add a test case for this in the > offload test suite. > >>>>>> > >>>>>> //Eelco > >>>>> > >>>>> Hi Eelco, > >>>>> > >>>>> Thanks for the review. I didn't have time to check this > but wanted to > >>>>> reply that it's in my queue to verify your suggestion > and add a test. > >>>> > >>>> Thanks for letting me know, and I understand as we are all > busy :) > >>>> > >>>> //Eelco > >>>> > >>> > >>> > >>> Hi Eelco, > >>> > >>> I tested your suggestion to move to against removing tc rules > >>> from tc command I see it's helpful in cleaning those stale ukeys. > >>> I have trouble doing a clean test for this for the testsuite. > >> > >> With the tc rule deletion, it becomes an offload-specific test case, > but as this is the only way we’ve seen the issue, it might be fine to add it > just there. > >> > > > > yes. after the first half of the fix you pointed out was implemented > already. > > now the way i reproduce it is deleting tc rules. > > could potentially happen from something else or because of another bug > somewhere. > > > >>> At first I tested with modification in revalidator_sweep__() to > >>> always set seq_mismatch to reach revalidate_ukey(). > >>> Later I moved to create a redundant veth and doing add/del loop > >>> of it to the bridge to cause a seq mismatch. - sure this is > >>> the clean way but without the change I get stale ukeys and with > >>> it we get to the cleanup change and cleaning the ukeys. > >> > >> Can you maybe share the test case so I know what you are doing? Is > this working for the general system tests? > > > > I did this: > > > > - create bridge with 2 veth ports. configure ips. > > - ping between the ports to have tc rules. > > - repeated few times: clear the tc rules like this: tc filter del dev > veth1 ingress and also on the 2nd port. > > - set max-idle to 1 and remove it to cause a flush of the rules. > > - create another set of veth ports. add/del veth4 from the bridge. to > cause a sweep. > > - before the fix: ovs-appctl upcall/show will show ukeys. > > - after the fix upcall/show will show 0 ukeys. > > Thanks, will find some time to test this, and reply to the V2. > > //Eelco > > >>> I'll send v2 for now as a clean patch with only the relevant > change > >>> and cleaner commit msg and lets take it from there if my > suggestion > >>> here of the test is ok or something else or the test can be > postponed > >>> to a later time to think of a cleaner/right way to do it. > >> > >> You missed the change in the python script as explained in the > definition header. I will reply to the patch. > > > > oh right. i'll check it. thanks. > > > >> > >> //Eelco > >> > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev