Hoi,
I am running VPP on a few aarch64 machines and observed regular crashes
with a stacktrace that suggests IPv6 FIB lookup issue -
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: #0 0x0000fc1b8ac808f8
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: from linux-vdso.so.1
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: #1 0x0000fc1b8998ac3c
ip6_fib_table_lookup_exact_match + 0x3c
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: from
/lib/aarch64-linux-gnu/libvnet.so.26.06
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: #2 0x0000fc1b89a21d2c
proxy_arp_intfc_walk + 0x4030
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: from
/lib/aarch64-linux-gnu/libvnet.so.26.06
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: #3 0x0000fc1b89101658
vlib_exit_with_status + 0x808
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: from
/lib/aarch64-linux-gnu/libvlib.so.26.06
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: #4 0x0000fc1b89103eb0
vlib_exit_with_status + 0x3060
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: from
/lib/aarch64-linux-gnu/libvlib.so.26.06
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: #5 0x0000fc1b8912d88c
vlib_worker_thread_bootstrap_fn + 0x6c
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: from
/lib/aarch64-linux-gnu/libvlib.so.26.06
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: #6 0x0000fc1b88de595c
pthread_condattr_setpshared + 0x5bc
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: from
/lib/aarch64-linux-gnu/libc.so.6
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: #7 0x0000fc1b88e4bb4c __clone
+ 0x2cc
Jun 04 19:42:29 dpu0-ddln0 vpp[1179016]: from
/lib/aarch64-linux-gnu/libc.so.6
I was lucky enough to get two coredumps out of it, which I fed to Claude
and it came back with an analysis that pointed to me having changed the
link-local address on an interface:
The interface did have IPv6/ND enabled — otherwise the
icmp6_neighbor_solicitation node would never have run. What it
transiently lacked was its per-interface link-local FIB table
(ilt_fibs[sw_if_index]), which is a structure distinct from "IPv6 is
enabled." That table is created lazily when a link-local address is
added (ip6_ll_fib_create) and freed, with the slot reset to ~0, when
the last FIB_SOURCE_IP6_ND entry is removed
(ip6_ll_table.c:150-154). Crucially, ip6_link_set_local_address()
(ip6_link.c:359-362) changes a link-local address by doing
delete-then-update: it removes the old LL prefix (which frees the
FIB and sets ilt_fibs = ~0), then re-adds the new one (which
recreates it). The ND node stays enabled across this whole sequence.
So the NS did not arrive anywhere unexpected — it hit a
normally-configured interface during the brief window in which its
LL FIB had been torn down and not yet rebuilt. A worker thread
forwarding an ordinary link-local NS in that window read ilt_fibs ==
~0 and segfaulted; two crashes 31 minutes apart is consistent with
two separate LL-address-change events, matching your note that you'd
been changing link-local addresses.
The debugging session made my head spin a little bit as I'm not very
good with gdb, but what the cores do prove: an unguarded ~0 from
ip6_ll_fib_get() in the link-local NS branch causes an out-of-bounds
pool_elt_at_index; the trigger is an NS for a link-local target arriving
while ilt_fibs[sw_if_index] == ~0.
The fix is a simple check of ip6_ll_fib_get() before doing the FIB
lookup, just as happens a few lines further down in the same
vnet/ip6-nd/ip6_nd.c file.
A candidate fix is in https://gerrit.fd.io/r/c/vpp/+/46038 and I have
not observed crashes after applying it, although I'm not certain if
returning FIB_NODE_INDEX_INVALID and dropping the packet is the right
call in this case, or if we have to do something more?
I have the coredumps and symbols here if somebody wants to take a closer
look.
groet,
Pim
--
Pim van Pelt<[email protected]>
PBVP1-RIPEhttps://ipng.ch/
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27063): https://lists.fd.io/g/vpp-dev/message/27063
Mute This Topic: https://lists.fd.io/mt/119745721/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-