Re: [vpp-dev] VPP crashes because of API segment exhaustion

2023-01-25 Thread Alexander Chernavin via lists.fd.io
Hello Florin,

> 
> Agreed that it looks like vl_api_clnt_process sleeps, probably because it
> hit a queue size of 0, but memclnt_queue_callback or the timeout, albeit
> 20s is a lot, should wake it up.

It doesn't look like vl_api_clnt_process would have woken up later. Firstly, 
because QUEUE_SIGNAL_EVENT was already signaled and vm->queue_signal_pending 
was set. And memclnt_queue_callback() is only triggered if 
vm->queue_signal_pending is unset. Thus, no new calls of 
memclnt_queue_callback() would have happened while vm->queue_signal_pending was 
set. Secondly, the timer id that vl_api_clnt_process holds belongs to another 
process node. Even if the timer was valid, the other process node would have 
been triggered by it.

> 
> So, given that QUEUE_SIGNAL_EVENT is set, the only thing that comes to
> mind is that maybe somehow vlib_process_signal_event context gets
> corrupted. Could you run a debug image and see if anything asserts? Is
> vlib_process_signal_event called by chance from a worker?

It's problematic to run a debug version of VPP on the affected instances.

There are no signs of vlib_process_signal_event() being called from a worker 
thread. If look at memclnt_queue_callback(), it is called only in the main 
thread.

Regards,
Alexander

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22508): https://lists.fd.io/g/vpp-dev/message/22508
Mute This Topic: https://lists.fd.io/mt/96500275/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



[vpp-dev] VPP crashes because of API segment exhaustion

2023-01-24 Thread Alexander Chernavin via lists.fd.io
Hello all,

We are experiencing VPP crashes that occur a few days after the startup
because of API segment exhaustion. Increasing API segment size to 256MB
didn't stop the crashes from occurring.

Can you please take a look at the description below and tell us if you have
seen similar issues or have any ideas what the cause may be?

Given:

   - VPP 22.10
   - 2 worker threads
   - API segment size is 256MB
   - ~893k IPv4 routes and ~160k IPv6 routes added


Backtrace:

> [..]
> #32660 0x55b02f606896 in os_panic () at
> /home/jenkins/tnsr-pkgs/work/vpp/src/vpp/vnet/main.c:414
> #32661 0x7fce3c0ec740 in clib_mem_heap_alloc_inline (heap=0x0,
> size=, align=8,
> os_out_of_memory_on_failure=1) at
> /home/jenkins/tnsr-pkgs/work/vpp/src/vppinfra/mem_dlmalloc.c:613
> #32662 clib_mem_alloc (size=)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vppinfra/mem_dlmalloc.c:628
> #32663 0x7fce3dc4ee6f in vl_msg_api_alloc_internal
> (vlib_rp=0x130026000, nbytes=69, pool=0,
> may_return_null=0) at
> /home/jenkins/tnsr-pkgs/work/vpp/src/vlibmemory/memory_shared.c:179
> #32664 0x7fce3dc592cd in vl_api_rpc_call_main_thread_inline
> (force_rpc=0 '\000',
> fp=, data=, data_length=)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vlibmemory/memclnt_api.c:617
> #32665 vl_api_rpc_call_main_thread (fp=0x7fce3c74de70 ,
> data=0x7fcc372bdc00 "& \001$ ", data_length=28)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vlibmemory/memclnt_api.c:641
> #32666 0x7fce3cc7fe2d in icmp6_neighbor_solicitation_or_advertisement
> (vm=0x7fccc0864000,
> frame=0x7fcccd7d2d40, is_solicitation=1, node=)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vnet/ip6-nd/ip6_nd.c:163
> #32667 icmp6_neighbor_solicitation (vm=0x7fccc0864000,
> node=0x7fccc09e3380, frame=0x7fcccd7d2d40)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vnet/ip6-nd/ip6_nd.c:322
> #32668 0x7fce3c1a2fe0 in dispatch_node (vm=0x7fccc0864000,
> node=0x7fce3dc74836,
> type=VLIB_NODE_TYPE_INTERNAL, dispatch_state=VLIB_NODE_STATE_POLLING,
> frame=0x7fcccd7d2d40,
> last_time_stamp=4014159654296481) at
> /home/jenkins/tnsr-pkgs/work/vpp/src/vlib/main.c:961
> #32669 dispatch_pending_node (vm=0x7fccc0864000, pending_frame_index=7,
> last_time_stamp=4014159654296481) at
> /home/jenkins/tnsr-pkgs/work/vpp/src/vlib/main.c:1120
> #32670 vlib_main_or_worker_loop (vm=0x7fccc0864000, is_main=0)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vlib/main.c:1589
> #32671 vlib_worker_loop (vm=vm@entry=0x7fccc0864000)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vlib/main.c:1723
> #32672 0x7fce3c1f581a in vlib_worker_thread_fn (arg=0x7fccbdb11b40)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vlib/threads.c:1579
> #32673 0x7fce3c1f02c1 in vlib_worker_thread_bootstrap_fn
> (arg=0x7fccbdb11b40)
> at /home/jenkins/tnsr-pkgs/work/vpp/src/vlib/threads.c:418
> #32674 0x7fce3be3db43 in start_thread (arg=) at
> ./nptl/pthread_create.c:442
> #32675 0x7fce3becfa00 in clone3 () at
> ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
>

According to the backtrace, an IPv6 neighbor is being learned. Since the
packet was received on a worker thread, the neighbor information is being
passed to the main thread by making an RPC call (that works via the API).
For this, an API message for RPC call is being allocated from the API
segment (as а client). But the allocation is failing because of no
available memory.

If inspect the API rings after crashing, it can be seen that they are all
filled with VL_API_RPC_CALL messages. Also, it can be seen that there are a
lot of pending RPC requests (vm->pending_rpc_requests has ~3.3M items).
Thus, API segment exhaustion occurs because of a huge number of pending RPC
messages.

RPC messages are processed in a process node called api-rx-from-ring
(function is called vl_api_clnt_process). And process nodes are handled in
the main thread only.

First hypothesis is that the main loop of the main thread pauses for such a
long time that a huge number of pending RPC messages are accumulated by the
worker threads (that keep running). But this doesn't seem to be confirmed
if inspect vm->loop_interval_start of all threads after crashing.
vm->loop_interval_start of the worker threads would have been greater
than vm->loop_interval_start of the main thread.

> (gdb) p vlib_global_main.vlib_mains[0]->loop_interval_start
> $117 = 197662.55595008997
> (gdb) p vlib_global_main.vlib_mains[1]->loop_interval_start
> $119 = 197659.82887979984
> (gdb) p vlib_global_main.vlib_mains[2]->loop_interval_start
> $121 = 197659.93944517447
>

Second hypothesis is that pending RPC messages stop being processed
completely at some point and keep being accumulated while the memory
permits. This seems to be confirmed if inspect the process node after
crashing. It can be seen that vm->main_loop_count is much bigger than the
process node's main_loop_count_last_dispatch (difference is ~50M
iterations). Although, according to the flags, the node is waiting for
time

Re: [vpp-dev] NAT ED empty users dump #nat #nat44

2020-05-13 Thread Alexander Chernavin via lists.fd.io
Ole,

OK, nat44_user_dump is not going to return anything in NAT ED.

nat44_user_session_dump has required fields ( ip_address and vrf_id) that don't 
allow you to dump all the sessions. If make those fields optional, that should 
work.

Addition of sort and limit optional fields is a good idea and might be helpful.

Thanks,
Alexander
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16356): https://lists.fd.io/g/vpp-dev/message/16356
Mute This Topic: https://lists.fd.io/mt/74156168/21656
Mute #nat44: https://lists.fd.io/mk?hashtag=nat44&subid=1480452
Mute #nat: https://lists.fd.io/mk?hashtag=nat&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] NAT ED empty users dump #nat #nat44

2020-05-13 Thread Alexander Chernavin via lists.fd.io
Hello Ole,

I'm not sure I get your question right.

The use case is being able to see NAT pool utilization and debug NAT sessions. 
I  think it's not a specific use case.

NAT44 ED sessions:
 thread 0 vpp_main: 3 sessions 
   i2o 10.255.10.100 proto icmp port 1593 fib 0
   o2i 10.100.200.14 proto icmp port 16253 fib 0
  external host 10.255.30.100:0
  index 0
  last heard 27.67
  total pkts 8, total bytes 728
  dynamic translation

   i2o 10.255.10.100 proto udp port 45177 fib 0
   o2i 10.100.200.14 proto udp port 18995 fib 0
  external host 10.255.30.100:8161
  index 1
  last heard 32.66
  total pkts 2, total bytes 106
  dynamic translation

   i2o 10.255.10.100 proto tcp port 59664 fib 0
   o2i 10.100.200.14 proto tcp port 53893 fib 0
  external host 10.255.30.100:22
  index 2
  last heard 36.64
  total pkts 9, total bytes 635
  dynamic translation
The way I see it is that there was API that worked for ED and non ED NAT modes 
(except for deterministic). ED mode logic has changed but API remains the same. 
It still works for non ED NAT modes and has stopped working for ED mode. I 
think it's not consistent.

Thanks,
Alexander
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16352): https://lists.fd.io/g/vpp-dev/message/16352
Mute This Topic: https://lists.fd.io/mt/74156168/21656
Mute #nat44: https://lists.fd.io/mk?hashtag=nat44&subid=1480452
Mute #nat: https://lists.fd.io/mk?hashtag=nat&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] NAT ED empty users dump #nat #nat44

2020-05-12 Thread Alexander Chernavin via lists.fd.io
Klement,

I would prefer the existing API working.

I expect millions of sessions and it's clear that dumping them all is a blocker 
but during debug, there are not so many of them.

Thanks,
Alexander
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16342): https://lists.fd.io/g/vpp-dev/message/16342
Mute This Topic: https://lists.fd.io/mt/74156168/21656
Mute #nat44: https://lists.fd.io/mk?hashtag=nat44&subid=1480452
Mute #nat: https://lists.fd.io/mk?hashtag=nat&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] NAT ED empty users dump #nat #nat44

2020-05-12 Thread Alexander Chernavin via lists.fd.io
Klement,

Basically print statistics and debug info: number of users, what user consumes 
what number of sessions, what session created for what communication.

Thanks,
Alexander
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16334): https://lists.fd.io/g/vpp-dev/message/16334
Mute This Topic: https://lists.fd.io/mt/74156168/21656
Mute #nat44: https://lists.fd.io/mk?hashtag=nat44&subid=1480452
Mute #nat: https://lists.fd.io/mk?hashtag=nat&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] NAT ED empty users dump #nat #nat44

2020-05-12 Thread Alexander Chernavin via lists.fd.io
Hello Klement,

I want to list all NAT sessions. In order to do that I used to call 
VL_API_NAT44_USER_DUMP. After that, I had all users, and I could call 
VL_API_NAT44_USER_SESSION_DUMP to get sessions for every user.

Now VL_API_NAT44_USER_DUMP returns nothing in ED mode and I don't know what 
users are. At the same time, VL_API_NAT44_USER_SESSION_DUMP requires ip_address 
and vrf_id arguments. So if you don't know users, you cannot get sessions.

Thanks,
Alexander
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16330): https://lists.fd.io/g/vpp-dev/message/16330
Mute This Topic: https://lists.fd.io/mt/74156168/21656
Mute #nat44: https://lists.fd.io/mk?hashtag=nat44&subid=1480452
Mute #nat: https://lists.fd.io/mk?hashtag=nat&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


[vpp-dev] NAT ED empty users dump #nat #nat44

2020-05-12 Thread Alexander Chernavin via lists.fd.io
Hello,

As I understand the "users" concept has been removed from NAT ED and now 
vl_api_nat44_user_dump_t returns nothing in ED mode. 
vl_api_nat44_user_session_dump_t returns sessions only if you know the user you 
are requesting sessions for. But you can't get the user list. Therefore this 
chain no longer works: dump all users, then dump all sessions of those users.

I think the user dump code could build the user list based on the sessions, but 
we need to collect these fields: IP address, VRF id, number of static and 
dynamic sessions. For a big number of sessions it might be time-consuming 
before the first user could be sent. Probably, maintaining a user list would be 
cheaper.

How do you think vl_api_nat44_user_dump_t can be fixed for NAT ED?
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16327): https://lists.fd.io/g/vpp-dev/message/16327
Mute This Topic: https://lists.fd.io/mt/74156168/21656
Mute #nat: https://lists.fd.io/mk?hashtag=nat&subid=1480452
Mute #nat44: https://lists.fd.io/mk?hashtag=nat44&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


Re: [vpp-dev] Events for IP address addition/deletion on an interface #vpp

2019-06-18 Thread Alexander Chernavin via Lists.Fd.Io
Hi Ole,

> 
> Where is the IP address configuration coming from? If it's your
> application that configures the addresses, shouldn't the control plane
> application know that itself?

There are several independent instances of the application. If one of them 
configures the addresses, others should know about it.

Regards,
Alexander
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#13318): https://lists.fd.io/g/vpp-dev/message/13318
Mute This Topic: https://lists.fd.io/mt/32105640/21656
Mute #vpp: https://lists.fd.io/mk?hashtag=vpp&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-


[vpp-dev] Events for IP address addition/deletion on an interface #vpp

2019-06-18 Thread Alexander Chernavin via Lists.Fd.Io
Hello,

I have an application that is a client to the shared memory API and I would 
like to know when an IP address has been added or deleted on an interface. I 
see that there is sw_interface_event that can notify a client about interface 
admin/link status changes as well as interface deletion.

If I extend sw_interface_event and add ipv4_address/ipv6_address flags 
indicating the corresponding change, would upstream accept this?

Regards,
Alexander
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#13316): https://lists.fd.io/g/vpp-dev/message/13316
Mute This Topic: https://lists.fd.io/mt/32105640/21656
Mute #vpp: https://lists.fd.io/mk?hashtag=vpp&subid=1480452
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-