Hi Fan, Thanks for working on it!
I found a separate related issue which exacerbates the problem - async crypto frames leak when using more than one worker thread. The ESP encrypt/decrypt nodes allocate a frame for the crypto operation which needs to be applied to a packet if there has not already been a frame allocated. After allocating the frame, it may be decided that the packet should be handed off to another thread or dropped for some other reason. When this happens, if the frame never had any packets/operations added to it, it does not get submitted and it is also not freed. The leak is what caused the pool to need to be expanded in my test environment. I uploaded a patch to gerrit to try and fix this - https://gerrit.fd.io/r/c/vpp/+/32596. If you have any feedback on it, that would be appreciated. -Matt On Tue, Jun 8, 2021 at 7:14 AM Zhang, Roy Fan <roy.fan.zh...@intel.com> wrote: > Hi Matthew and Florin, > > > > We managed to recreate the problem. > > The cause is most likely caused by pool got expanded while there are > pending frame left to be dequeued. Once frame is dequeued later returning > it to the pool will cause seg-fault as the pool is in new memory location. > > > > We are working on the fix – currently in validation stage. If everything > is fine we are to upstream by tomorrow evening. > > > > Regards, > > Fan > > > > *From:* vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> *On Behalf Of *Matthew > Smith via lists.fd.io > *Sent:* Thursday, May 27, 2021 2:02 PM > *To:* Florin Coras <fcoras.li...@gmail.com> > *Cc:* vpp-dev <vpp-dev@lists.fd.io> > *Subject:* Re: [vpp-dev] IPsec crash with async crypto > > > > Hi Florin! > > > > It appears that the quic plugin is disabled in my build: > > > > 2021/05/27 07:44:49:044 notice plugin/load Plugin disabled > (default): quic_plugin.so > > > > I didn't mean to give the impression that I thought this issue was caused > by quic. My mention of the quic commit was just intended to indicate how up > to date my build is with the gerrit master branch in case there were > recent/pending patches that people know of that might be relevant. That > quic commit is from about 2 weeks ago, which is the last time I merged > upstream changes. > > > > Thanks, > > -Matt > > > > > > On Wed, May 26, 2021 at 5:58 PM Florin Coras <fcoras.li...@gmail.com> > wrote: > > Hi Matt, > > Did you try checking if quic plugin is loaded, just to see if there’s a > connection there. > > Regards, > Florin > > > On May 26, 2021, at 3:19 PM, Matthew Smith via lists.fd.io <mgsmith= > netgate....@lists.fd.io> wrote: > > > > Hi, > > > > I saw VPP crash several times during some tests that were running to > evaluate IPsec performance. The last upstream commit on my build of VPP is > 'fd77f8c00 quic: remove cmake --target'. The tests ran on a C3000 with an > onboard QAT. The tests were repeated with the QAT removed from the device > whitelist in startup.conf (using async crypto with sw_scheduler) and the > same thing happened. > > > > The relevant part of the stack trace looks like this: > > > > #8 0x00007fdbb4006459 in os_out_of_memory () at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/unix-misc.c:221 > > #9 0x00007fdbb400d1fb in clib_mem_alloc_aligned_at_offset > (size=2305843009213692256, align=8, align_offset=8, > os_out_of_memory_on_failure=1) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/mem.h:243 > > #10 vec_resize_allocate_memory (v=0x7fdb36a9b7f0, > length_increment=288230376151711515, data_bytes=2305843009213692256, > header_bytes=8, data_align=8, numa_id=255) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.c:111 > > #11 0x00007fdbb60efe01 in _vec_resize_inline (v=0x7fdb36a9b7f0, > length_increment=288230376151711515, data_bytes=2305843009213692248, > header_bytes=0, data_align=8, numa_id=255) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/vec.h:170 > > #12 clib_bitmap_ori_notrim (ai=0x7fdb36a9b7f0, i=18446744073709537927) > at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vppinfra/bitmap.h:643 > > #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, > frame=0x7fdb3461c280) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585 > > #14 crypto_dequeue_frame (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, > ct=0x7fdb33537f80, hdl=0x7fdb2bc32810 <cryptodev_raw_dequeue>, n_cache=1, > n_total=0x7fdb145053dc) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:135 > > #15 crypto_dispatch_node_fn (vm=0x7fdb356f7a80, node=0x7fdb36bbd280, > frame=0x0) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/node.c:166 > > #16 0x00007fdbb4b789e5 in dispatch_node (vm=0x7fdb356f7a80, > node=0x7fdb36bbd280, type=VLIB_NODE_TYPE_INPUT, > dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, > last_time_stamp=207016971809128) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1024 > > #17 vlib_main_or_worker_loop (vm=0x7fdb356f7a80, is_main=0) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vlib/main.c:1618 > > > > In vnet_crypto_async_free_frame() it appears that a call to pool_put() > is trying to return a pointer to a pool that it is not a member of: > > > > (gdb) frame 13 > > #13 vnet_crypto_async_free_frame (vm=0x7fdb356f7a80, > frame=0x7fdb3461c280) at > /usr/src/debug/vpp-21.01-568~g67ff5da46.el8.x86_64/src/vnet/crypto/crypto.h:585 > > 585 pool_put (ct->frame_pool, frame); > > (gdb) p frame - ct->frame_pool > > $1 = -13689 > > > > It seems like maybe a pointer to a vnet_crypto_async_frame_t was stored > by the crypto engine and before it could be dequeued the pool filled and > had to be reallocated. The per-thread frame_pool's are allocated with room > for 1024 entries initially and ct->frame_pool had a vector length of 1025 > when the crash occurred. > > > > Can anyone with knowledge of the async crypto code confirm or refute > that theory? Anyone have suggestions on the best way to fix this? > > > > Thanks, > > -Matt > > > > > > > > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#19537): https://lists.fd.io/g/vpp-dev/message/19537 Mute This Topic: https://lists.fd.io/mt/83112898/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-