On 23 Oct 2025, at 4:24, LIU Yulong wrote:

> Maybe we should use the patch I uploaded 19 month ago:
> https://mail.openvswitch.org/pipermail/ovs-dev/2024-March/412491.html
>
>
> It did solve the issue for about 2 years.

First of all, sorry for my previous patch, it doesn’t actually fix anything; I 
misinterpreted some of the code. :(

As mentioned in the comment, that 2 year old patch might not address the root 
cause of the problem. I’ve created some debug patches that you might want to 
try out to see if they provide any clues about what’s really going on.

Note that this is based on a general debug library I’ve been using for a while. 
It does use a mutex to lock the circular buffer, but I hope that’s not enough 
to make the problem go away entirely.

Anyway, it will log quite a bit of information, and you can use the GDB macros 
to dump the buffer once it has crashed. Make sure to build OVS with libunwind, 
otherwise, the callback trace won’t be useful.

You can find the code here: 
https://github.com/chaudron/ovs/tree/refs/heads/dev/ec_dbg

Here’s an example of how to dump the buffer from GDB:

(gdb) source ovs_gdb.py
(gdb) ovs_dump_ec_debug
Dumping EC_DEBUG_BUFFER (Total Size: 536870912, Used: 258324):
3373100819104|handler11|||: time: 2025-10-27 10:59:11.457
3373100819104|handler11|ofproto/ofproto-dpif-upcall.c:1875|ukey_create__()[4462481]<-process_upcall()[4464430]<-recv_upcalls()[4469457]<-udpif_upcall_handler()[4470863]:ukey[0x7fd11c0055a0][1c2a11dd-9300-4dec-8f67-d85a70b5695f]create
3373100819881|handler11|ofproto/ofproto-dpif-upcall.c:2040|ukey_install__()[4467798]<-recv_upcalls()[4469906]<-udpif_upcall_handler()[4470863]<-ovsthread_wrapper()[5327791]:ukey[0x7fd11c0055a0][1c2a11dd-9300-4dec-8f67-d85a70b5695f]install/insert
3373100820426|handler11|ofproto/ofproto-dpif-upcall.c:2055|transition_ukey_at()[4465347]<-ukey_install__()[4467832]<-recv_upcalls()[4469906]<-udpif_upcall_handler()[4470863]:ukey[0x7fd11c0055a0][1c2a11dd-9300-4dec-8f67-d85a70b5695f]state
 0 -> 1
3373100820972|handler11|ofproto/ofproto-dpif-upcall.c:2055|transition_ukey_at()[4465347]<-recv_upcalls()[4470738]<-udpif_upcall_handler()[4470863]<-ovsthread_wrapper()[5327791]:ukey[0x7fd11c0055a0][1c2a11dd-9300-4dec-8f67-d85a70b5695f]state
 1 -> 2
3373100823070|handler6|ofproto/ofproto-dpif-upcall.c:1875|ukey_create__()[4462481]<-process_upcall()[4464430]<-recv_upcalls()[4469457]<-udpif_upcall_handler()[4470863]:ukey[0x7fd130004e30][5483ed11-6aef-4efd-a1f2-6dacfa1b3bda]create
3373100823061|handler11|ofproto/ofproto-dpif-upcall.c:1875|ukey_create__()[4462481]<-process_upcall()[4464430]<-recv_upcalls()[4469457]<-udpif_upcall_handler()[4470863]:ukey[0x7fd11c005e20][2b2a41bb-942a-4674-a995-94b210a0d09d]create

Cheers,
Eelco

>
> &nbsp;
> ------------------&nbsp;Original&nbsp;------------------
> From: &nbsp;"LIU&nbsp;Yulong"<[email protected]&gt;;
> Date: &nbsp;Wed, Oct 22, 2025 06:03 PM
> To: &nbsp;"liuyulong"<[email protected]&gt;; "Eelco 
> Chaudron"<[email protected]&gt;;
> Cc: &nbsp;"dev"<[email protected]&gt;;
> Subject: &nbsp;Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Add ovsrcu_postpone 
> for ukey_delete__.
>
> &nbsp;
> Updates:
> 2. The change of function `upcall_receive` did not solve the issuse, 
> ovs-vswitchd still get cored:
> #0&nbsp; 0x00007f7bf21ca337 in __GI_raise (sig=sig@entry=6) at 
> ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> #1&nbsp; 0x00007f7bf21cba28 in __GI_abort () at abort.c:90
> #2&nbsp; 0x000055811c97cc6e in ovs_abort_valist (err_no=<optimized out&gt;, 
> format=<optimized out&gt;, args=args@entry=0x7f7bdfffa360) at lib/util.c:499
> #3&nbsp; 0x000055811c97cd04 in ovs_abort (err_no=err_no@entry=0, 
> format=format@entry=0x55811cddaec0 "%s: %s() passed uninitialized ovs_mutex") 
> at lib/util.c:491
> #4&nbsp; 0x000055811c947a81 in ovs_mutex_trylock_at 
> (l_=l_@entry=0x7f7bc9a751f8, where=where@entry=0x55811cdb7e78 
> "ofproto/ofproto-dpif-upcall.c:3027") at lib/ovs-thread.c:106
> #5&nbsp; 0x000055811c86f4f1 in revalidator_sweep__ 
> (revalidator=revalidator@entry=0x558120f6df00, purge=purge@entry=false) at 
> ofproto/ofproto-dpif-upcall.c:3027
> #6&nbsp; 0x000055811c873516 in revalidator_sweep (revalidator=0x558120f6df00) 
> at ofproto/ofproto-dpif-upcall.c:3085
> #7&nbsp; udpif_revalidator (arg=0x558120f6df00) at 
> ofproto/ofproto-dpif-upcall.c:1093
> #8&nbsp; 0x000055811c94863f in ovsthread_wrapper (aux_=<optimized out&gt;) at 
> lib/ovs-thread.c:422
> #9&nbsp; 0x00007f7bf4321e65 in start_thread (arg=0x7f7bdffff700) at 
> pthread_create.c:307
> #10 0x00007f7bf229288d in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
>
> ------------------ Original ------------------
> From: "LIU Yulong"<[email protected]&gt;;
> Date: Wed, Oct 22, 2025 01:54 PM
> To: "Eelco Chaudron"<[email protected]&gt;;
> Cc: "dev"<[email protected]&gt;;
> Subject: Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Add ovsrcu_postpone for 
> ukey_delete__.
>
>
> Updates:
> 1. The change of function `upcall_uninit` did not solve the issue, 
> ovs-vswitchd can still run cored from call `ukey_delete(umap, ukey);` in the 
> `revalidator_sweep__`.
> 2. The change of function `upcall_receive` was applied to another host, and 
> we do not see core issue for 24h. We need to run it for a longer period of 
> time to verify.
>
>
>
>
> ------------------ Original ------------------
> From:&nbsp; "Eelco Chaudron"<[email protected]&gt;;
> Date:&nbsp; Mon, Oct 20, 2025 07:04 PM
> To:&nbsp; "LIU Yulong"<[email protected]&gt;;
> Cc:&nbsp; "dev"<[email protected]&gt;;
> Subject:&nbsp; Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Add ovsrcu_postpone 
> for ukey_delete__.
>
>
> On 20 Oct 2025, at 12:35, LIU Yulong wrote:
>
> &gt; Thank you Eelco.
> &gt;
> &gt;
> &gt; Code search shows we have `recv_upcalls` and `upcall_cb` which will call 
> `upcall_uninit`.
> &gt; And dp_netdev_upcall will call the dp-&gt;upcall_cb.
> &gt; So we have call stacks like this:
> &gt; i) 
> handle_packet_upcall-&gt;dp_netdev_upcall-&gt;upcall_cb-&gt;upcall_uninit
> &gt; ii) 
> dp_execute_userspace_action-&gt;dp_netdev_upcall-&gt;upcall_cb-&gt;upcall_uninit
> &gt;
> &gt; Cloud you confirm these calls?
>
> From the top of my head, this is correct. However, the new ukey structure is 
> never inserted, so we do not need the RCU-delayed remove.
>
> &gt; For your change, I'll run tests with recoreded packets to verify.
>
> Thanks, and let me know the results.
>
> //Eelco
>
> &gt;
> &gt; Regards,
> &gt;
> &gt;
> &gt; LIU Yulong
> &gt;&nbsp;
> &gt;&nbsp;
> &gt; ------------------ Original ------------------
> &gt; From:&nbsp; "Eelco Chaudron"<[email protected]&gt;;
> &gt; Date:&nbsp; Fri, Oct 17, 2025 08:08 PM
> &gt; To:&nbsp; "LIU Yulong"<[email protected]&gt;;
> &gt; Cc:&nbsp; "dev"<[email protected]&gt;;
> &gt; Subject:&nbsp; Re: [ovs-dev] [PATCH] ofproto-dpif-upcall: Add 
> ovsrcu_postpone for ukey_delete__.
> &gt;
> &gt;&nbsp;
> &gt; Hi Liu,
> &gt;
> &gt; I looked at the change; however, upcall_uninit() is only called for 
> newly created (never inserted) ukeys, so the ovs_postpone() call is not 
> needed.
> &gt;
> &gt; However, I did find an issue in upcall_receive(), where, in an error 
> path, it could use an uninitialized upcall structure — causing a ukey to be 
> freed that should not have been.
> &gt;
> &gt; Can you try out the diff below to see if it fixes your problem?
> &gt;
> &gt; Cheers,
> &gt;
> &gt; Eelco
> &gt;
> &gt; diff --git a/ofproto/ofproto-dpif-upcall.c 
> b/ofproto/ofproto-dpif-upcall.c
> &gt; index b3b4b2d2f..53b906a16 100644
> &gt; --- a/ofproto/ofproto-dpif-upcall.c
> &gt; +++ b/ofproto/ofproto-dpif-upcall.c
> &gt; @@ -1230,6 +1230,17 @@ upcall_receive(struct upcall *upcall, const 
> struct dpif_backer *backer,
> &gt; {
> &gt;&nbsp; int error;
> &gt;
> &gt; +&nbsp; &nbsp; /* Initialize the minimal required fields in the upcall 
> structure to ensure
> &gt; +&nbsp;&nbsp; &nbsp; * upcall_uninit() does not operate on invalid data. 
> */
> &gt; +&nbsp; &nbsp; upcall-&gt;have_recirc_ref = false;
> &gt; +&nbsp; &nbsp; upcall-&gt;xout_initialized = false;
> &gt; +&nbsp; &nbsp; upcall-&gt;ukey_persists = false;
> &gt; +&nbsp; &nbsp; upcall-&gt;ukey = NULL;
> &gt; +&nbsp; &nbsp; ofpbuf_use_stub(&amp;upcall-&gt;odp_actions, 
> upcall-&gt;odp_actions_stub,
> &gt; 
> +&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  &nbsp; sizeof upcall-&gt;odp_actions_stub);
> &gt; +&nbsp; &nbsp; ofpbuf_init(&amp;upcall-&gt;put_actions, 0);
> &gt; +
> &gt; +
> &gt;&nbsp; upcall-&gt;type = classify_upcall(type, userdata, 
> &amp;upcall-&gt;cookie);
> &gt;&nbsp; if (upcall-&gt;type == BAD_UPCALL) {
> &gt;&nbsp; return EAGAIN;
> &gt; @@ -1258,19 +1269,11 @@ upcall_receive(struct upcall *upcall, const 
> struct dpif_backer *backer,
> &gt;&nbsp; }
> &gt;
> &gt;&nbsp; upcall-&gt;recirc = NULL;
> &gt; -&nbsp; &nbsp; upcall-&gt;have_recirc_ref = false;
> &gt;&nbsp; upcall-&gt;flow = flow;
> &gt;&nbsp; upcall-&gt;packet = packet;
> &gt;&nbsp; upcall-&gt;ufid = ufid;
> &gt;&nbsp; upcall-&gt;pmd_id = pmd_id;
> &gt; -&nbsp; &nbsp; ofpbuf_use_stub(&amp;upcall-&gt;odp_actions, 
> upcall-&gt;odp_actions_stub,
> &gt; 
> -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  &nbsp; sizeof upcall-&gt;odp_actions_stub);
> &gt; -&nbsp; &nbsp; ofpbuf_init(&amp;upcall-&gt;put_actions, 0);
> &gt;
> &gt; -&nbsp; &nbsp; upcall-&gt;xout_initialized = false;
> &gt; -&nbsp; &nbsp; upcall-&gt;ukey_persists = false;
> &gt; -
> &gt; -&nbsp; &nbsp; upcall-&gt;ukey = NULL;
> &gt;&nbsp; upcall-&gt;key = NULL;
> &gt;&nbsp; upcall-&gt;key_len = 0;
> &gt;&nbsp; upcall-&gt;mru = mru;
> &gt;
> &gt;
> &gt; On 16 Oct 2025, at 3:12, LIU Yulong wrote:
> &gt;
> &gt; &gt; We have such call stack of coredump:
> &gt; &gt; *0&nbsp; 0x00007f7f197ae337 in raise () from /lib64/libc.so.6
> &gt; &gt; *1&nbsp; 0x00007f7f197afa28 in abort () from /lib64/libc.so.6
> &gt; &gt; *2&nbsp; 0x000055934ca4f4ee in ovs_abort_valist (err_no=<optimized 
> out&gt;, format=<optimized out&gt;, args=args@entry=0x7f7f07530360) at 
> lib/util.c:499
> &gt; &gt; *3&nbsp; 0x000055934ca4f584 in ovs_abort (err_no=err_no@entry=0, 
> format=format@entry=0x55934ccedd18 "%s: %s() passed uninitialized ovs_mutex") 
> at lib/util.c:491
> &gt; &gt; *4&nbsp; 0x000055934ca1a4a1 in ovs_mutex_trylock_at 
> (l_=l_@entry=0x7f7ed4a43e58, where=where@entry=0x55934cccb318 
> "ofproto/ofproto-dpif-upcall.c:3014") at lib/ovs-thread.c:106
> &gt; &gt; *5&nbsp; 0x000055934c943181 in revalidator_sweep__ 
> (revalidator=revalidator@entry=0x5593518c1720, purge=purge@entry=false) at 
> ofproto/ofproto-dpif-upcall.c:3014
> &gt; &gt; *6&nbsp; 0x000055934c9471a6 in revalidator_sweep 
> (revalidator=0x5593518c1720) at ofproto/ofproto-dpif-upcall.c:3072
> &gt; &gt; *7&nbsp; udpif_revalidator (arg=0x5593518c1720) at 
> ofproto/ofproto-dpif-upcall.c:1086
> &gt; &gt; *8&nbsp; 0x000055934ca1b05f in ovsthread_wrapper (aux_=<optimized 
> out&gt;) at lib/ovs-thread.c:422
> &gt; &gt; *9&nbsp; 0x00007f7f1b6ece65 in start_thread () from 
> /lib64/libpthread.so.0
> &gt; &gt; *10 0x00007f7f1987688d in clone () from /lib64/libc.so.6
> &gt; &gt;
> &gt; &gt; When calling ovs_mutex_trylock() on ukey-&gt;mutex, 
> ovs_mutex_trylock_at
> &gt; &gt; sees that the input is an "uninitialized ovs_mutex" (l-&gt;where is 
> NULL),
> &gt; &gt; and aborts.
> &gt; &gt;
> &gt; &gt; This state can only occur after the mutex has not been initialized 
> or
> &gt; &gt; has been destroyed. The mutex is definitely initialized in 
> ukey_create__
> &gt; &gt; by ukey, so the "uninitialized" state is almost certainly 
> "destroyed".
> &gt; &gt; Destruction occurs in ukey_delete__, which calls 
> ovs_mutex_destroy(&amp;ukey-&gt;mutex)
> &gt; &gt; and sets where to NULL.
> &gt; &gt;
> &gt; &gt; When revalidator_sweep__ is traversing cmap and trying to lock 
> (&amp;ukey-&gt;mutex),
> &gt; &gt; it encounters a ukey that has been directly destroyed by 
> ukey_delete__,
> &gt; &gt; However, the ukey is still visible to the revalidator (either still 
> in
> &gt; &gt; the cmap or has not yet passed the RCU grace period), resulting in 
> an
> &gt; &gt; abort. That is to say, there is a path where ukey_delete__ is 
> directly
> &gt; &gt; called during concurrent traversal of ukey, bypassing the
> &gt; &gt; cmap_remove + ovsrcu_postpone semantics of ukey_delete.
> &gt; &gt;
> &gt; &gt; Modify upcall_uninit() to change direct ukey_delete__ to RCU 
> deferred
> &gt; &gt; release to avoid concurrent traversal conflicts with revalidator.
> &gt; &gt; This ensures that ukey_delete__ is not executed until after the
> &gt; &gt; global grace period, and that the CMAP_FOR_EACH within
> &gt; &gt; revalidator_sweep__ will not encounter a destroyed mutex before
> &gt; &gt; the end of running cycle.
> &gt; &gt;
> &gt; &gt; Some earlier email discussions:
> &gt; &gt; [1] 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052973.html
> &gt; &gt; [2] 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2024-February/052949.html
> &gt; &gt; [3] 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2024-March/052993.html
> &gt; &gt;
> &gt; &gt; Signed-off-by: LIU Yulong <[email protected]&gt;
> &gt; &gt; ---
> &gt; &gt;&nbsp; ofproto/ofproto-dpif-upcall.c | 2 +-
> &gt; &gt;&nbsp; 1 file changed, 1 insertion(+), 1 deletion(-)
> &gt; &gt;
> &gt; &gt; diff --git a/ofproto/ofproto-dpif-upcall.c 
> b/ofproto/ofproto-dpif-upcall.c
> &gt; &gt; index 9dfa52d82..b3b4b2d2f 100644
> &gt; &gt; --- a/ofproto/ofproto-dpif-upcall.c
> &gt; &gt; +++ b/ofproto/ofproto-dpif-upcall.c
> &gt; &gt; @@ -1386,7 +1386,7 @@ upcall_uninit(struct upcall *upcall)
> &gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; 
> ofpbuf_uninit(&amp;upcall-&gt;put_actions);
> &gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; if 
> (upcall-&gt;ukey) {
> &gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
> &nbsp; if (!upcall-&gt;ukey_persists) {
> &gt; &gt; 
> -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  &nbsp; ukey_delete__(upcall-&gt;ukey);
> &gt; &gt; 
> +&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  &nbsp; ovsrcu_postpone(ukey_delete__, upcall-&gt;ukey);
> &gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
> &nbsp; }
> &gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; } else if 
> (upcall-&gt;have_recirc_ref) {
> &gt; &gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
> &nbsp; /* The reference was transferred to the ukey if one was created. */
> &gt; &gt; --
> &gt; &gt; 2.50.1 (Apple Git-155)
> &gt; &gt;
> &gt; &gt; _______________________________________________
> &gt; &gt; dev mailing list
> &gt; &gt; [email protected]
> &gt; &gt; https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to