Could we please see the faulting instruction, as well as the vector register 
contents involved?

As in "x/i $pc", and the ymmX registers involved?

If the vector instruction requires alignment, "movaps" or similar, it wouldn't 
be a shock to discover an unaligned address. We've already found and fixed a 
few of those since switching to clang, and I have to say that "va + 4" raises 
all sorts of aligned vector instruction red flags...


FWIW... Dave

-----Original Message-----
From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Elias Rudberg
Sent: Wednesday, May 6, 2020 1:46 PM
To: vpp-dev@lists.fd.io
Subject: [vpp-dev] Segmentation fault in rdma_device_input_refill when using 
clang compiler

Hello VPP experts,

When trying to use the current master branch, we get a segmentation fault 
error. Here is what it looks like in gdb:

Thread 3 "vpp_wk_0" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fedf91fe700 (LWP 21309)] rdma_device_input_refill 
(vm=0x7ff8a5d2f4c0, rd=0x7fedd35ed5c0, rxq=0x7ffff7edea80, is_mlx5dv=1)
    at vpp/src/plugins/rdma/input.c:115
115               *(u64x4 *) (va + 4) = u64x4_byte_swap (*(u64x4 *) (va
+ 4));
(gdb) bt
#0  rdma_device_input_refill (vm=0x7ff8a5d2f4c0, rd=0x7fedd35ed5c0, 
rxq=0x7ffff7edea80, is_mlx5dv=1)
    at vpp/src/plugins/rdma/input.c:115
#1  0x00007fffabbbb84d in rdma_device_input_inline (vm=0x7ff8a5d2f4c0, 
node=0x7ff5ccdfee00, frame=0x0, rd=0x7fedd35ed5c0, qid=0, use_mlx5dv=1)
    at vpp/src/plugins/rdma/input.c:622
#2  0x00007fffabbbae44 in rdma_input_node_fn_skx (vm=0x7ff8a5d2f4c0, 
node=0x7ff5ccdfee00, frame=0x0)
    at vpp/src/plugins/rdma/input.c:647
#3  0x00007ffff60e3155 in dispatch_node (vm=0x7ff8a5d2f4c0, 
node=0x7ff5ccdfee00, type=VLIB_NODE_TYPE_INPUT, 
dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0, 
    last_time_stamp=66486783453597600) at vpp/src/vlib/main.c:1235
#4  0x00007ffff60ddbf5 in vlib_main_or_worker_loop (vm=0x7ff8a5d2f4c0,
is_main=0) at vpp/src/vlib/main.c:1815
#5  0x00007ffff60dd227 in vlib_worker_loop (vm=0x7ff8a5d2f4c0) at
vpp/src/vlib/main.c:1996
#6  0x00007ffff61345a1 in vlib_worker_thread_fn (arg=0x7fffb74ea980) at
vpp/src/vlib/threads.c:1795
#7  0x00007ffff5531954 in clib_calljmp () at
vpp/src/vppinfra/longjmp.S:123
#8  0x00007fedf91fdce0 in ?? ()
#9  0x00007ffff612cd53 in vlib_worker_thread_bootstrap_fn
(arg=0x7fffb74ea980) at vpp/src/vlib/threads.c:584 Backtrace stopped: previous 
frame inner to this frame (corrupt stack?)

This segmentation fault happens the same way every time I try to start VPP.

This is in Ubuntu 18.04.4 using the rdma plugin with Mellanox mlx5 NICs and a 
Intel Xeon Gold 6126 CPU.

I have looked back at recent changes and found that this problem started with 
the commit 4ba16a44 "misc: switch to clang-9" dated April 28. Before that we 
could use the master branch without thie problem.

Changing back to gcc by removing clang in src/CMakeLists.txt makes the error go 
away. However, there is then instead a problem with a "symbol lookup error" for 
crypto_native_plugin.so: undefined symbol:
crypto_native_aes_cbc_init_avx512 (that problem disappears if disabling the 
crypto_native plugin)

So, two problems:

(1) The segmentation fault itself, perhaps indicating a bug somewhere but seems 
to appear only with clang and not with gcc

(2) The "undefined symbol: crypto_native_aes_cbc_init_avx512" problem when 
trying to use gcc instead of clang

What do you think about these?

As a short-term fix, is removing clang in src/CMakeLists.txt reasonable or is 
there a better/easier workaround?

Does anyone else use the rdma plugin when compiling using clang -- perhaps that 
combination triggers this problem?

Best regards,
Elias
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#16253): https://lists.fd.io/g/vpp-dev/message/16253
Mute This Topic: https://lists.fd.io/mt/74033970/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to