Hi,

I saw VPP crash several times during some tests that were running to
evaluate IPsec performance. The last upstream commit on my build of VPP is
'fd77f8c00 quic: remove cmake --target'. The tests ran on a C3000 with an
onboard QAT. The tests were repeated with the QAT removed from the device
whitelist in startup.conf (using async crypto with sw_scheduler) and the
same thing happened.

The relevant part of the stack trace looks like this:

#8  0x00007fdbb4006459 in os_out_of_memory () at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221
#9  0x00007fdbb400d1fb in clib_mem_alloc_aligned_at_offset
(size=2305843009213692256, align=8, align_offset=8,
os_out_of_memory_on_failure=1) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243
#10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0,
length_increment=288230376151711515, data_bytes=2305843009213692256,
header_bytes=8, data_align=8, numa_id=255) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111
#11 0x00007fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0,
length_increment=288230376151711515, data_bytes=2305843009213692248,
header_bytes=0, data_align=8, numa_id=255) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170
#12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643
#13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280)
at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
#14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280,
ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 <cryptodev_raw_dequeue>, n_cache=1,
n_total=0x7fdb145053dc) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135
#15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280,
frame=0x0) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166
#16 0x00007fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80,
node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT,
dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0,
last_time_stamp=207016971809128) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024
#17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618

In vnet_crypto_async_free_frame() it appears that a call to pool_put() is
trying to return a pointer to a pool that it is not a member of:

(gdb) frame 13
#13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, frame=0x7fdb3461c280)
at
/usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585
585  pool_put (ct->frame_pool, frame);
(gdb) p frame - ct->frame_pool
$1 = -13689

It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored by
the crypto engine and before it could be dequeued the pool filled and had
to be reallocated. The per-thread frame_pool's are allocated with room for
1024 entries initially and ct->frame_pool had a vector length of 1025 when
the crash occurred.

Can anyone with knowledge of the async crypto code confirm or refute that
theory? Anyone have suggestions on the best way to fix this?

Thanks,
-Matt
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#19479): https://lists.fd.io/g/vpp-dev/message/19479
Mute This Topic: https://lists.fd.io/mt/83112898/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to