On 08/07/2024 5:17, LIU Yulong wrote:
> Hi,
> 
> 
> Maybe this fix should be the root cause of:
> ovs-vswitchd core at revalidator_sweep__
> https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
> 
> 
> Did you see such core at revalidator_sweep__ ?
> 
> 
> Regards,
> 
> 
> LIU Yulong
> 
> 

Hi,

It doesn't look like the same issue. from the dump there is  looks
like the state of the ukey did get update to be UKEY_DELETE.



>  
>  
> ------------------ Original ------------------
> From: &nbsp;"Eelco&nbsp;Chaudron"<echau...@redhat.com&gt;;
> Date: &nbsp;Wed, Jul 3, 2024 06:46 PM
> To: &nbsp;"Roi Dayan"<r...@nvidia.com&gt;; 
> Cc: &nbsp;"dev"<d...@openvswitch.org&gt;; "Maor 
> Dickman"<ma...@nvidia.com&gt;; 
> Subject: &nbsp;Re: [ovs-dev] [PATCH 1/1] ofproto-dpif-upcall: Avoid stale 
> ukeys leaks.
> 
> &nbsp;
> 
> 
> 
> On 3 Jul 2024, at 12:22, Roi Dayan wrote:
> 
> &gt; On 03/07/2024 13:01, Eelco Chaudron wrote:
> &gt;&gt;
> &gt;&gt;
> &gt;&gt; On 3 Jul 2024, at 9:31, Roi Dayan wrote:
> &gt;&gt;
> &gt;&gt;&gt; On 18/06/2024 11:53, Eelco Chaudron wrote:
> &gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt; On 18 Jun 2024, at 8:05, Roi Dayan wrote:
> &gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt; On 03/06/2024 16:29, Eelco Chaudron wrote:
> &gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt; On 3 Jun 2024, at 10:07, Roi Dayan wrote:
> &gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; On 03/06/2024 10:18, Roi Dayan wrote:
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On 30/05/2024 18:48, Eelco Chaudron wrote:
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On 23 May 2024, at 12:46, Roi Dayan via 
> dev wrote:
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; It is observed in some environments 
> that there are much more ukeys than
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; actual DP flows. For example:
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; $ ovs-appctl upcall/show
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; system@ovs-system:
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; flows : (current 7) (avg 6) (max 
> 117) (limit 2125)
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; offloaded flows : 525
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; dump duration : 1063ms
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; ufid enabled : true
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; 23: (keys 3612)
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; 24: (keys 3625)
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; 25: (keys 3485)
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; The revalidator threads are busy 
> revalidating the stale ukeys leading to
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; high CPU and long dump duration.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; There are some possible situations 
> that may result in stale ukeys that
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; have no corresponding DP flows.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; In revalidator, push_dp_ops() 
> doesn't check error if the op type is not
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; DEL. It is possible that a 
> PUT(MODIFY) fails, especially for tc offload
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; case, where the old flow is deleted 
> first and then the new one is
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; created. If the creation fails, the 
> ukey will be stale (no corresponding
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; DP flow). This patch adds a warning 
> in such case.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Not sure I understand, this behavior 
> should be captured by the UKEY_INCONSISTENT state.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Hi Eelco,
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; thanks for reviewing.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; We started the patch on older branch as we 
> didn't rebase for some time
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; and got a little later on sending it.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I see the addition now for UKEY_INCOSISTENT 
> in the following patch:
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; cf11766cbcf1 ofproto-dpif-upcall: Fix 
> push_dp_ops to handle all errors.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Looks like it handles the same situation we 
> tried to handle in this patch.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Another possible scenario is in 
> handle_upcalls() if a PUT operation did
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; not succeed and op-&gt;error 
> attribute was not set correctly it can lead to
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; stale ukey in operational state.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Guess we need to figure out these cases, 
> as these are the root cause of your problem.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; right. maybe. but this could help keep the 
> system alive/clean while logging the
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; bad situation instead of having increase in 
> those stale/inconsistent ukeys.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I understand if it's not nice to handle it 
> like this.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; Hi Eelco,
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; I remember now one of the reproduction scenarios 
> we did is do some traffic
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; to get rules added using TC and then delete 
> those from tc command line
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; and it got to stale ukeys.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; The revalidator dump thread not seeing the rules 
> so not updating anything
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; and sweep over the ukeys skipping them as well.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; This is why we checked against the timing stats 
> of the ukey.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; I remember I tested on the upstream code and not 
> our development branch
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; and it reproduces. I didn't notice if the commit 
> adding UKEY_INCONSISTENT
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; existed but it handles errors from adding flows 
> so I dont think it matters.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; I'll try to check and verify again but I think 
> it's still here.
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; So while cases getting dop.error handled with 
> UKEY_INCONSISTENT,
> &gt;&gt;&gt;&gt;&gt;&gt;&gt; this case I think is not.
> &gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt; I think you are right this case is not covered by 
> the UKEY_INCONSISTENT test below. See my suggestion on fixing this in 
> revalidate_ukey().
> &gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt; Maybe you could also add a test case for this in the 
> offload test suite.
> &gt;&gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt;&gt; //Eelco
> &gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt; Hi Eelco,
> &gt;&gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt;&gt; Thanks for the review. I didn't have time to check this 
> but wanted to
> &gt;&gt;&gt;&gt;&gt; reply that it's in my queue to verify your suggestion 
> and add a test.
> &gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt; Thanks for letting me know, and I understand as we are all 
> busy :)
> &gt;&gt;&gt;&gt;
> &gt;&gt;&gt;&gt; //Eelco
> &gt;&gt;&gt;&gt;
> &gt;&gt;&gt;
> &gt;&gt;&gt;
> &gt;&gt;&gt; Hi Eelco,
> &gt;&gt;&gt;
> &gt;&gt;&gt; I tested your suggestion to move to against removing tc rules
> &gt;&gt;&gt; from tc command I see it's helpful in cleaning those stale ukeys.
> &gt;&gt;&gt; I have trouble doing a clean test for this for the testsuite.
> &gt;&gt;
> &gt;&gt; With the tc rule deletion, it becomes an offload-specific test case, 
> but as this is the only way we’ve seen the issue, it might be fine to add it 
> just there.
> &gt;&gt;
> &gt;
> &gt; yes. after the first half of the fix you pointed out was implemented 
> already.
> &gt; now the way i reproduce it is deleting tc rules.
> &gt; could potentially happen from something else or because of another bug 
> somewhere.
> &gt;
> &gt;&gt;&gt; At first I tested with modification in revalidator_sweep__() to
> &gt;&gt;&gt; always set seq_mismatch to reach revalidate_ukey().
> &gt;&gt;&gt; Later I moved to create a redundant veth and doing add/del loop
> &gt;&gt;&gt; of it to the bridge to cause a seq mismatch. - sure this is
> &gt;&gt;&gt; the clean way but without the change I get stale ukeys and with
> &gt;&gt;&gt; it we get to the cleanup change and cleaning the ukeys.
> &gt;&gt;
> &gt;&gt; Can you maybe share the test case so I know what you are doing? Is 
> this working for the general system tests?
> &gt;
> &gt; I did this:
> &gt;
> &gt; - create bridge with 2 veth ports. configure ips.
> &gt; - ping between the ports to have tc rules.
> &gt; - repeated few times: clear the tc rules like this: tc filter del dev 
> veth1 ingress and also on the 2nd port.
> &gt; - set max-idle to 1 and remove it to cause a flush of the rules.
> &gt; - create another set of veth ports. add/del veth4 from the bridge. to 
> cause a sweep.
> &gt; - before the fix: ovs-appctl upcall/show will show ukeys.
> &gt; - after the fix upcall/show will show 0 ukeys.
> 
> Thanks, will find some time to test this, and reply to the V2.
> 
> //Eelco
> 
> &gt;&gt;&gt; I'll send v2 for now as a clean patch with only the relevant 
> change
> &gt;&gt;&gt; and cleaner commit msg and lets take it from there if my 
> suggestion
> &gt;&gt;&gt; here of the test is ok or something else or the test can be 
> postponed
> &gt;&gt;&gt; to a later time to think of a cleaner/right way to do it.
> &gt;&gt;
> &gt;&gt; You missed the change in the python script as explained in the 
> definition header. I will reply to the patch.
> &gt;
> &gt; oh right. i'll check it. thanks.
> &gt;
> &gt;&gt;
> &gt;&gt; //Eelco
> &gt;&gt;
> 
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to