I should have had the sense to ask this earlier: which version of vpp are you 
using? 

 

The line number in your debug snippet is more than 100 lines off from 
master/latest. The timer wheel code has been relatively untouched, but there 
have been several important fixes over the years...

 

D.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }

 

From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Sudhir CR via 
lists.fd.io
Sent: Thursday, March 9, 2023 4:00 AM
To: vpp-dev@lists.fd.io
Cc: rtbrick....@lists.fd.io
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,

Please excuse my delayed response. It took some time to recreate this issue.

I made changes to our process node as per your suggestion. now our process node 
code looks like this

 

while (1) {

        vlib_process_wait_for_event_or_clock (vm, 
RTB_VPP_EPOLL_PROCESS_NODE_TIMER);
        event_type = vlib_process_get_events (vm, &event_data);
        vec_reset_length(event_data);

        switch (event_type) {
            case ~0: /* handle timer expirations */
                rtb_event_loop_run_once ();
                break;

            default: /* bug! */
                ASSERT (0);
        }
    }

After these changes we didn't observe any assertions but we hit the process 
node suspend issue. with this it is clear other than time out we are not 
getting any other events.

 

In the issue state I have collected vlib_process node (rtb_vpp_epoll_process) 
flags value and it seems to be correct (flags = 11).

 

Please find the vlib_process_t and vlib_node_t data structure values collected 
in the issue state below.

 

vlib_process_t:

============

$38 = {
  cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", 
  node_runtime = {
    cacheline0 = 0x7f9b2da50380 "\200~\274+\233\177", 
    function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, 
    errors = 0x7f9b3076a560, 
    clocks_since_last_overflow = 0, 
    max_clock = 3785970526, 
    max_clock_n = 0, 
    calls_since_last_overflow = 0, 
    vectors_since_last_overflow = 0, 
    next_frame_index = 1668, 
    node_index = 437, 
    input_main_loops_per_call = 0, 
    main_loop_count_last_dispatch = 4147405645, 
    main_loop_vector_stats = {0, 0}, 
    flags = 0, 
    state = 0, 
    n_next_nodes = 0, 
    cached_next_index = 0, 
    thread_index = 0, 
    runtime_data = 0x7f9b2da503c6 ""
  }, 
  return_longjmp = {
    regs = {94502584873984, 140304430422064, 140306731463680, 94502584874048, 
94502640552512, 0, 140304430422032, 140306703608766}
  }, 
  resume_longjmp = {
    regs = {94502584873984, 140304161734368, 140306731463680, 94502584874048, 
94502640552512, 0, 140304161734272, 140304430441787}
  }, 
  flags = 11, 
  log2_n_stack_bytes = 16, 
  suspended_process_frame_index = 0, 
  n_suspends = 0, 
  pending_event_data_by_type_index = 0x7f9b307b8310, 
  non_empty_event_type_bitmap = 0x7f9b307b8390, 
  one_time_event_type_bitmap = 0x0, 
  event_type_index_by_type_opaque = 0x7f9b2dab8bd8, 
  event_type_pool = 0x7f9b2dcb5978, 
  resume_clock_interval = 1000, 
  stop_timer_handle = 3098, 
  output_function = 0x0, 
  output_function_arg = 0, 
  stack = 0x7f9b1bb78000
}

 

vlib_node_t

=========

 (gdb) p *n

$17 = {
  function = 0x7f9b2bbc7e80 <rtb_vpp_epoll_process>, 
  name = 0x7f9b3076a3f0 "rtb-vpp-epoll-process", 
  name_elog_string = 11783, 
  stats_total = {
    calls = 0, 
    vectors = 0, 
    clocks = 1971244932732, 
    suspends = 6847366, 
    max_clock = 3785970526, 
    max_clock_n = 0
  }, 
  stats_last_clear = {
    calls = 0, 
    vectors = 0, 
    clocks = 0, 
    suspends = 0, 
    max_clock = 0, 
    max_clock_n = 0
  }, 
  type = VLIB_NODE_TYPE_PROCESS, 
  index = 437, 
  runtime_index = 40, 
  runtime_data = 0x0, 
  flags = 0, 
  state = 0 '\000', 
  runtime_data_bytes = 0 '\000', 
  protocol_hint = 0 '\000', 
  n_errors = 0, 
  scalar_size = 0, 
  vector_size = 0, 
  error_heap_handle = 0, 
  error_heap_index = 0, 
  error_counters = 0x0, 
  next_node_names = 0x7f9b3076a530, 
  next_nodes = 0x0, 
  sibling_of = 0x0, 
  sibling_bitmap = 0x0, 
  n_vectors_by_next_node = 0x0, 
  next_slot_by_node = 0x0, 
  prev_node_bitmap = 0x0, 
  owner_node_index = 4294967295, 
  owner_next_index = 4294967295, 
  format_buffer = 0x0, 
  unformat_buffer = 0x0, 
  format_trace = 0x0, 
  validate_frame = 0x0, 
  state_string = 0x0, 
  node_fn_registrations = 0x0
}

 

I added an assert statement before clearing VLIB_PROCESS_IS_RUNNING flag in 
dispatch_suspended_process function.

But this assert statement is not hitting.

 

diff --git a/src/vlib/main.c b/src/vlib/main.c
index af0fcd1cb..55c231d8b 100644
--- a/src/vlib/main.c
+++ b/src/vlib/main.c
@@ -1490,6 +1490,9 @@ dispatch_suspended_process (vlib_main_t * vm,
     }
   else
     {
+           if (strcmp((char *)node->name, "rtb-vpp-epoll-process") == 0) {
+                   ASSERT(0);
+           }
       p->flags &= ~VLIB_PROCESS_IS_RUNNING;
       pool_put_index (nm->suspended_process_frames,
                      p->suspended_process_frame_index);

 

I am not able to figure out why this process node is suspended in some 
scenarios. Can you please help me by providing some pointers to debug and 
resolve this issue. 

 

Hi Jinsh,

I applied your patch to my code. The issue is not solved with your patch. Thank 
you for helping me out.

 

Thanks and Regards,

Sudhir

 

 

On Fri, Mar 3, 2023 at 12:53 PM Sudhir CR via lists.fd.io <http://lists.fd.io>  
<sudhir=rtbrick....@lists.fd.io <mailto:rtbrick....@lists.fd.io> > wrote:

Hi Chetan,

In our case we are observing this issue occasionally exact steps  to recreate 
the issue are not known.

I made changes to our process node as suggested by dave and with these changes 
trying to recreate the issue.

Soon I will update my results and findings in this mail thread.

 

Thanks and Regards,

Sudhir

 

On Fri, Mar 3, 2023 at 12:37 PM chetan bhasin <chetan.bhasin...@gmail.com 
<mailto:chetan.bhasin...@gmail.com> > wrote:

Hi Sudhir,

 

Is your issue resolved?

 

Actually we are facing same issue on vpp.2106. 

In our case "api-rx-ring" is not getting called.

in our usecase workers are calling some functions in main-thread context 
leading to RPC message and memory is allocated from api section.

This leads to Api-segment memory is used fully and leads to crash.

 

Thanks,

Chetan 

 

On Mon, Feb 20, 2023, 18:24 Sudhir CR via lists.fd.io <http://lists.fd.io>  
<sudhir=rtbrick....@lists.fd.io <mailto:rtbrick....@lists.fd.io> > wrote:

Hi Dave,

Thank you very much for your inputs. I will try this out and get back to you 
with the results.

 

Regards,

Sudhir 

 

On Mon, Feb 20, 2023 at 6:01 PM Dave Barach <v...@barachs.net 
<mailto:v...@barachs.net> > wrote:

Please try something like this, to eliminate the possibility that some bit of 
code is sending this process an event. It’s not a good idea to skip the 
vec_reset_length (event_data) step.

 

while (1)

{

   uword event_type, * event_data = 0;

   int i;

 

   vlib_process_wait_for_event_or_clock (vm, 1e-2 /* 10 ms */);

 

   event_type = vlib_process_get_events (vm, &event_data);

 

   switch (event_type) {

  case ~0: /* handle timer expirations */

       rtb_event_loop_run_once ();

       break;

 

   default: /* bug! */

       ASSERT (0);

   }

 

   vec_reset_length(event_data);

}

 

From: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>  <vpp-dev@lists.fd.io 
<mailto:vpp-dev@lists.fd.io> > On Behalf Of Sudhir CR via lists.fd.io 
<http://lists.fd.io> 
Sent: Monday, February 20, 2023 4:02 AM
To: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> 
Subject: Re: [vpp-dev] process node suspended indefinitely

 

Hi Dave,
Thank you for your response and help. 

 

Please find the additional details below.

VPP Version 21.10


We are creating a process node rtb-vpp-epoll-process to handle control plane 
events like interface add/delete, route add/delete.
This process node waits for 10ms of time (Not Interested in any events ) once 
10ms is expired it will process control plane events mentioned above.

code snippet looks like below 

 

```

static uword
rtb_vpp_epoll_process (vlib_main_t                 *vm,
                       vlib_node_runtime_t  *rt,
                       vlib_frame_t         *f)
{

    ...
    ...
    while (1) {
        vlib_process_wait_for_event_or_clock (vm, 10e-3);
        vlib_process_get_events (vm, NULL);

        rtb_event_loop_run_once();   <---- controlplane events handling 
    }  
}
``` 

What we observed is that sometimes (when there is a high controlplane load like 
request to install more routes) "rtb-vpp-epoll-process" is suspended and not 
scheduled furever. this we found by using "show runtime rtb-vpp-epoll-process"  
(in "show runtime rtb-vpp-epoll-process" command output suspends counter is not 
incrementing.)

show runtime output in working case :


```
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0      
    192246          1.91e6            0.00
DBGvpp# 

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0      
    193634          1.89e6            0.00
DBGvpp# 

``` 

show runtime output in issue case :
```

DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0      
     81477          7.08e6            0.00
DBGvpp# show runtime rtb-vpp-epoll-process
             Name                 State         Calls          Vectors        
Suspends         Clocks       Vectors/Call  
rtb-vpp-epoll-process           any wait                 0               0      
     81477          7.08e6            0.00

```

Other process nodes like lldp-process, ip4-neighbor-age-process, ip6-ra-process 
running without any issue. only "rtb-vpp-epoll-process" process node suspended 
forever. 

 

Please let me know if any additional information is required.

Hi Jinsh,
Thanks for pointing me to the issue you faced. The issue I am facing looks 
similar.
I will verify with the given patch.


Thanks and Regards,

Sudhir

 

On Sun, Feb 19, 2023 at 6:19 AM jinsh11 <jins...@chinatelecom.cn 
<mailto:jins...@chinatelecom.cn> > wrote:

HI:

*       I have the same problem,

bfd process node stop running. I raised this issue,

https://lists.fd.io/g/vpp-dev/message/22380
I think there is a problem with the porcess scheduling module when using the 
time wheel.

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential 
and may be privileged. If you received this e-mail in error, any review, use, 
dissemination, distribution, or copying of this e-mail is strictly prohibited. 
Please notify us immediately of the error by return e-mail and please delete 
this message from your system. For more information about Rtbrick, please visit 
us at www.rtbrick.com <http://www.rtbrick.com> 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential 
and may be privileged. If you received this e-mail in error, any review, use, 
dissemination, distribution, or copying of this e-mail is strictly prohibited. 
Please notify us immediately of the error by return e-mail and please delete 
this message from your system. For more information about Rtbrick, please visit 
us at www.rtbrick.com <http://www.rtbrick.com> 

 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential 
and may be privileged. If you received this e-mail in error, any review, use, 
dissemination, distribution, or copying of this e-mail is strictly prohibited. 
Please notify us immediately of the error by return e-mail and please delete 
this message from your system. For more information about Rtbrick, please visit 
us at www.rtbrick.com <http://www.rtbrick.com> 

 

 

NOTICE TO RECIPIENT This e-mail message and any attachments are confidential 
and may be privileged. If you received this e-mail in error, any review, use, 
dissemination, distribution, or copying of this e-mail is strictly prohibited. 
Please notify us immediately of the error by return e-mail and please delete 
this message from your system. For more information about Rtbrick, please visit 
us at www.rtbrick.com <http://www.rtbrick.com> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22690): https://lists.fd.io/g/vpp-dev/message/22690
Mute This Topic: https://lists.fd.io/mt/97032803/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to