I should have had the sense to ask this earlier: which version of vpp are you using?
The line number in your debug snippet is more than 100 lines off from master/latest. The timer wheel code has been relatively untouched, but there have been several important fixes over the years... D. diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + } From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Sudhir CR via lists.fd.io Sent: Thursday, March 9, 2023 4:00 AM To: vpp-dev@lists.fd.io Cc: rtbrick....@lists.fd.io Subject: Re: [vpp-dev] process node suspended indefinitely Hi Dave, Please excuse my delayed response. It took some time to recreate this issue. I made changes to our process node as per your suggestion. now our process node code looks like this while (1) { vlib_process_wait_for_event_or_clock (vm, RTB_VPP_EPOLL_PROCESS_NODE_TIMER); event_type = vlib_process_get_events (vm, &event_data); vec_reset_length(event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } } After these changes we didn't observe any assertions but we hit the process node suspend issue. with this it is clear other than time out we are not getting any other events. In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) flags value and it seems to be correct (flags = 11). Please find the vlib_process_t and vlib_node_t data structure values collected in the issue state below. vlib_process_t: ============ $38 = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", node_runtime = { cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, errors = 0x7f9b3076a560, clocks_since_last_overflow = 0, max_clock = 3785970526, max_clock_n = 0, calls_since_last_overflow = 0, vectors_since_last_overflow = 0, next_frame_index = 1668, node_index = 437, input_main_loops_per_call = 0, main_loop_count_last_dispatch = 4147405645, main_loop_vector_stats = {0, 0}, flags = 0, state = 0, n_next_nodes = 0, cached_next_index = 0, thread_index = 0, runtime_data = 0x7f9b2da503c6 "" }, return_longjmp = { regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 94502640552512, 0, 140304430422032, 140306703608766} }, resume_longjmp = { regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 94502640552512, 0, 140304161734272, 140304430441787} }, flags = 11, log2_n_stack_bytes = 16, suspended_process_frame_index = 0, n_suspends = 0, pending_event_data_by_type_index = 0x7f9b307b8310, non_empty_event_type_bitmap = 0x7f9b307b8390, one_time_event_type_bitmap = 0x0, event_type_index_by_type_opaque = 0x7f9b2dab8bd8, event_type_pool = 0x7f9b2dcb5978, resume_clock_interval = 1000, stop_timer_handle = 3098, output_function = 0x0, output_function_arg = 0, stack = 0x7f9b1bb78000 } vlib_node_t ========= (gdb) p *n $17 = { function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process", name_elog_string = 11783, stats_total = { calls = 0, vectors = 0, clocks = 1971244932732, suspends = 6847366, max_clock = 3785970526, max_clock_n = 0 }, stats_last_clear = { calls = 0, vectors = 0, clocks = 0, suspends = 0, max_clock = 0, max_clock_n = 0 }, type = VLIB_NODE_TYPE_PROCESS, index = 437, runtime_index = 40, runtime_data = 0x0, flags = 0, state = 0 '\000', runtime_data_bytes = 0 '\000', protocol_hint = 0 '\000', n_errors = 0, scalar_size = 0, vector_size = 0, error_heap_handle = 0, error_heap_index = 0, error_counters = 0x0, next_node_names = 0x7f9b3076a530, next_nodes = 0x0, sibling_of = 0x0, sibling_bitmap = 0x0, n_vectors_by_next_node = 0x0, next_slot_by_node = 0x0, prev_node_bitmap = 0x0, owner_node_index = 4294967295, owner_next_index = 4294967295, format_buffer = 0x0, unformat_buffer = 0x0, format_trace = 0x0, validate_frame = 0x0, state_string = 0x0, node_fn_registrations = 0x0 } I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in dispatch_suspended_process function. But this assert statement is not hitting. diff --git a/src/vlib/main.c b/src/vlib/main.c index af0fcd1cb..55c231d8b 100644 --- a/src/vlib/main.c +++ b/src/vlib/main.c @@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm, } else { + if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) { + ASSERT(0); + } p->flags &= ~VLIB_PROCESS_IS_RUNNING; pool_put_index (nm->suspended_process_frames, p->suspended_process_frame_index); I am not able to figure out why this process node is suspended in some scenarios. Can you please help me by providing some pointers to debug and resolve this issue. Hi Jinsh, I applied your patch to my code. The issue is not solved with your patch. Thank you for helping me out. Thanks and Regards, Sudhir On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <http://lists.fd.io> <sudhir=rtbrick....@lists.fd.io <mailto:rtbrick....@lists.fd.io> > wrote: Hi Chetan, In our case we are observing this issue occasionally exact steps to recreate the issue are not known. I made changes to our process node as suggested by dave and with these changes trying to recreate the issue. Soon I will update my results and findings in this mail thread. Thanks and Regards, Sudhir On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin...@gmail.com <mailto:chetan.bhasin...@gmail.com> > wrote: Hi Sudhir, Is your issue resolved? Actually we are facing same issue on vpp.2106. In our case "api-rx-ring" is not getting called. in our usecase workers are calling some functions in main-thread context leading to RPC message and memory is allocated from api section. This leads to Api-segment memory is used fully and leads to crash. Thanks, Chetan On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <http://lists.fd.io> <sudhir=rtbrick....@lists.fd.io <mailto:rtbrick....@lists.fd.io> > wrote: Hi Dave, Thank you very much for your inputs. I will try this out and get back to you with the results. Regards, Sudhir On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <v...@barachs.net <mailto:v...@barachs.net> > wrote: Please try something like this, to eliminate the possibility that some bit of code is sending this process an event. It’s not a good idea to skip the vec_reset_length (event_data) step. while (1) { uword event_type, * event_data = 0; int i; vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */); event_type = vlib_process_get_events (vm, &event_data); switch (event_type) { case ~0: /* handle timer expirations */ rtb_event_loop_run_once (); break; default: /* bug! */ ASSERT (0); } vec_reset_length(event_data); } From: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> <vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> > On Behalf Of Sudhir CR via lists.fd.io <http://lists.fd.io> Sent: Monday, February 20, 2023 4:02 AM To: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> Subject: Re: [vpp-dev] process node suspended indefinitely Hi Dave, Thank you for your response and help. Please find the additional details below. VPP Version 21.10 We are creating a process node rtb-vpp-epoll-process to handle control plane events like interface add/delete, route add/delete. This process node waits for 10ms of time (Not Interested in any events ) once 10ms is expired it will process control plane events mentioned above. code snippet looks like below ``` static uword rtb_vpp_epoll_process (vlib_main_t *vm, vlib_node_runtime_t *rt, vlib_frame_t *f) { ... ... while (1) { vlib_process_wait_for_event_or_clock (vm, 10e-3); vlib_process_get_events (vm, NULL); rtb_event_loop_run_once(); <---- controlplane events handling } } ``` What we observed is that sometimes (when there is a high controlplane load like request to install more routes) "rtb-vpp-epoll-process" is suspended and not scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process" (in "show runtime rtb-vpp-epoll-process" command output suspends counter is not incrementing.) show runtime output in working case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 192246 1.91e6 0.00 DBGvpp# DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 193634 1.89e6 0.00 DBGvpp# ``` show runtime output in issue case : ``` DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 DBGvpp# show runtime rtb-vpp-epoll-process Name State Calls Vectors Suspends Clocks Vectors/Call rtb-vpp-epoll-process any wait 0 0 81477 7.08e6 0.00 ``` Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process running without any issue. only "rtb-vpp-epoll-process" process node suspended forever. Please let me know if any additional information is required. Hi Jinsh, Thanks for pointing me to the issue you faced. The issue I am facing looks similar. I will verify with the given patch. Thanks and Regards, Sudhir On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jins...@chinatelecom.cn <mailto:jins...@chinatelecom.cn> > wrote: HI: * I have the same problem, bfd process node stop running. I raised this issue, https://lists.fd.io/g/vpp-dev/message/22380 I think there is a problem with the porcess scheduling module when using the time wheel. NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com <http://www.rtbrick.com> NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com <http://www.rtbrick.com> NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com <http://www.rtbrick.com> NOTICE TO RECIPIENT This e-mail message and any attachments are confidential and may be privileged. If you received this e-mail in error, any review, use, dissemination, distribution, or copying of this e-mail is strictly prohibited. Please notify us immediately of the error by return e-mail and please delete this message from your system. For more information about Rtbrick, please visit us at www.rtbrick.com <http://www.rtbrick.com>
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#22690): https://lists.fd.io/g/vpp-dev/message/22690 Mute This Topic: https://lists.fd.io/mt/97032803/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-