I should have had the sense to ask this earlier: which version of vpp are you
using?
The line number in your debug snippet is more than 100 lines off from
master/latest. The timer wheel code has been relatively untouched, but there
have been several important fixes over the years...
D.
diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
}
else
{
+ if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
From: [email protected] <[email protected]> On Behalf Of Sudhir CR via
lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: [email protected]
Cc: [email protected]
Subject: Re: [vpp-dev] process node suspended indefinitely
Hi Dave,
Please excuse my delayed response. It took some time to recreate this issue.
I made changes to our process node as per your suggestion. now our process node
code looks like this
while (1) {
vlib_process_wait_for_event_or_clock (vm,
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
event_type = vlib_process_get_events (vm, &event_data);
vec_reset_length(event_data);
switch (event_type) {
case ~0: /* handle timer expirations */
rtb_event_loop_run_once ();
break;
default: /* bug! */
ASSERT (0);
}
}
After these changes we didn't observe any assertions but we hit the process
node suspend issue. with this it is clear other than time out we are not
getting any other events.
In the issue state I have collected vlib_process node (rtb_vpp_epoll_process)
flags value and it seems to be correct (flags = 11).
Please find the vlib_process_t and vlib_node_t data structure values collected
in the issue state below.
vlib_process_t:
============
$38 = {
cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
node_runtime = {
cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177",
function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
errors = 0x7f9b3076a560,
clocks_since_last_overflow = 0,
max_clock = 3785970526,
max_clock_n = 0,
calls_since_last_overflow = 0,
vectors_since_last_overflow = 0,
next_frame_index = 1668,
node_index = 437,
input_main_loops_per_call = 0,
main_loop_count_last_dispatch = 4147405645,
main_loop_vector_stats = {0, 0},
flags = 0,
state = 0,
n_next_nodes = 0,
cached_next_index = 0,
thread_index = 0,
runtime_data = 0x7f9b2da503c6 ""
},
return_longjmp = {
regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048,
94502640552512, 0, 140304430422032, 140306703608766}
},
resume_longjmp = {
regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048,
94502640552512, 0, 140304161734272, 140304430441787}
},
flags = 11,
log2_n_stack_bytes = 16,
suspended_process_frame_index = 0,
n_suspends = 0,
pending_event_data_by_type_index = 0x7f9b307b8310,
non_empty_event_type_bitmap = 0x7f9b307b8390,
one_time_event_type_bitmap = 0x0,
event_type_index_by_type_opaque = 0x7f9b2dab8bd8,
event_type_pool = 0x7f9b2dcb5978,
resume_clock_interval = 1000,
stop_timer_handle = 3098,
output_function = 0x0,
output_function_arg = 0,
stack = 0x7f9b1bb78000
}
vlib_node_t
=========
(gdb) p *n
$17 = {
function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>,
name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process",
name_elog_string = 11783,
stats_total = {
calls = 0,
vectors = 0,
clocks = 1971244932732,
suspends = 6847366,
max_clock = 3785970526,
max_clock_n = 0
},
stats_last_clear = {
calls = 0,
vectors = 0,
clocks = 0,
suspends = 0,
max_clock = 0,
max_clock_n = 0
},
type = VLIB_NODE_TYPE_PROCESS,
index = 437,
runtime_index = 40,
runtime_data = 0x0,
flags = 0,
state = 0 '\000',
runtime_data_bytes = 0 '\000',
protocol_hint = 0 '\000',
n_errors = 0,
scalar_size = 0,
vector_size = 0,
error_heap_handle = 0,
error_heap_index = 0,
error_counters = 0x0,
next_node_names = 0x7f9b3076a530,
next_nodes = 0x0,
sibling_of = 0x0,
sibling_bitmap = 0x0,
n_vectors_by_next_node = 0x0,
next_slot_by_node = 0x0,
prev_node_bitmap = 0x0,
owner_node_index = 4294967295,
owner_next_index = 4294967295,
format_buffer = 0x0,
unformat_buffer = 0x0,
format_trace = 0x0,
validate_frame = 0x0,
state_string = 0x0,
node_fn_registrations = 0x0
}
I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in
dispatch_suspended_process function.
But this assert statement is not hitting.
diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
}
else
{
+ if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+ ASSERT(0);
+ }
p->flags &= ~VLIB_PROCESS_IS_RUNNING;
pool_put_index (nm->suspended_process_frames,
p->suspended_process_frame_index);
I am not able to figure out why this process node is suspended in some
scenarios. Can you please help me by providing some pointers to debug and
resolve this issue.
Hi Jinsh,
I applied your patch to my code. The issue is not solved with your patch. Thank
you for helping me out.
Thanks and Regards,
Sudhir
On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <http://lists.fd.io>
<[email protected] <mailto:[email protected]> > wrote:
Hi Chetan,
In our case we are observing this issue occasionally exact steps to recreate
the issue are not known.
I made changes to our process node as suggested by dave and with these changes
trying to recreate the issue.
Soon I will update my results and findings in this mail thread.
Thanks and Regards,
Sudhir
On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <[email protected]
<mailto:[email protected]> > wrote:
Hi Sudhir,
Is your issue resolved?
Actually we are facing same issue on vpp.2106.
In our case "api-rx-ring" is not getting called.
in our usecase workers are calling some functions in main-thread context
leading to RPC message and memory is allocated from api section.
This leads to Api-segment memory is used fully and leads to crash.
Thanks,
Chetan
On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <http://lists.fd.io>
<[email protected] <mailto:[email protected]> > wrote:
Hi Dave,
Thank you very much for your inputs. I will try this out and get back to you
with the results.
Regards,
Sudhir
On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <[email protected]
<mailto:[email protected]> > wrote:
Please try something like this, to eliminate the possibility that some bit of
code is sending this process an event. It’s not a good idea to skip the
vec_reset_length (event_data) step.
while (1)
{
uword event_type, * event_data = 0;
int i;
vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);
event_type = vlib_process_get_events (vm, &event_data);
switch (event_type) {
case ~0: /* handle timer expirations */
rtb_event_loop_run_once ();
break;
default: /* bug! */
ASSERT (0);
}
vec_reset_length(event_data);
}
From: [email protected] <mailto:[email protected]> <[email protected]
<mailto:[email protected]> > On Behalf Of Sudhir CR via lists.fd.io
<http://lists.fd.io>
Sent: Monday, February 20, 2023 4:02 AM
To: [email protected] <mailto:[email protected]>
Subject: Re: [vpp-dev] process node suspended indefinitely
Hi Dave,
Thank you for your response and help.
Please find the additional details below.
VPP Version 21.10
We are creating a process node rtb-vpp-epoll-process to handle control plane
events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once
10ms is expired it will process control plane events mentioned above.
code snippet looks like below
```
static uword
rtb_vpp_epoll_process (vlib_main_t *vm,
vlib_node_runtime_t *rt,
vlib_frame_t *f)
{
...
...
while (1) {
vlib_process_wait_for_event_or_clock (vm, 10e-3);
vlib_process_get_events (vm, NULL);
rtb_event_loop_run_once(); <---- controlplane events handling
}
}
```
What we observed is that sometimes (when there is a high controlplane load like
request to install more routes) "rtb-vpp-epoll-process" is suspended and not
scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"
(in "show runtime rtb-vpp-epoll-process" command output suspends counter is not
incrementing.)
show runtime output in working case :
```
DBGvpp# show runtime rtb-vpp-epoll-process
Name State Calls Vectors
Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0
192246 1.91e6 0.00
DBGvpp#
DBGvpp# show runtime rtb-vpp-epoll-process
Name State Calls Vectors
Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0
193634 1.89e6 0.00
DBGvpp#
```
show runtime output in issue case :
```
DBGvpp# show runtime rtb-vpp-epoll-process
Name State Calls Vectors
Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0
81477 7.08e6 0.00
DBGvpp# show runtime rtb-vpp-epoll-process
Name State Calls Vectors
Suspends Clocks Vectors/Call
rtb-vpp-epoll-process any wait 0 0
81477 7.08e6 0.00
```
Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process
running without any issue. only "rtb-vpp-epoll-process" process node suspended
forever.
Please let me know if any additional information is required.
Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks
similar.
I will verify with the given patch.
Thanks and Regards,
Sudhir
On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <[email protected]
<mailto:[email protected]> > wrote:
HI:
* I have the same problem,
bfd process node stop running. I raised this issue,
https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the
time wheel.
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential
and may be privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly prohibited.
Please notify us immediately of the error by return e-mail and please delete
this message from your system. For more information about Rtbrick, please visit
us at www.rtbrick.com <http://www.rtbrick.com>
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential
and may be privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly prohibited.
Please notify us immediately of the error by return e-mail and please delete
this message from your system. For more information about Rtbrick, please visit
us at www.rtbrick.com <http://www.rtbrick.com>
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential
and may be privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly prohibited.
Please notify us immediately of the error by return e-mail and please delete
this message from your system. For more information about Rtbrick, please visit
us at www.rtbrick.com <http://www.rtbrick.com>
NOTICE TO RECIPIENT This e-mail message and any attachments are confidential
and may be privileged. If you received this e-mail in error, any review, use,
dissemination, distribution, or copying of this e-mail is strictly prohibited.
Please notify us immediately of the error by return e-mail and please delete
this message from your system. For more information about Rtbrick, please visit
us at www.rtbrick.com <http://www.rtbrick.com>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22690): https://lists.fd.io/g/vpp-dev/message/22690
Mute This Topic: https://lists.fd.io/mt/97032803/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-