On Sat, Feb 7, 2026 at 6:35 AM Michal Luczaj <[email protected]> wrote:
>
> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
> socket is properly set up, which would include having a defined peer. IOW,
> there's a window when unix_stream_bpf_update_proto() can be called on
> socket which still has unix_peer(sk) == NULL.
>
> T0 bpf T1 connect
> ------ ----------
>
> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
> sock_hold(newsk)
> smp_mb__after_atomic()
> unix_peer(sk) = newsk
>
> BUG: kernel NULL pointer dereference, address: 0000000000000080
> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
> Call Trace:
> sock_map_link+0x564/0x8b0
> sock_map_update_common+0x6e/0x340
> sock_map_update_elem_sys+0x17d/0x240
> __sys_bpf+0x26db/0x3250
> __x64_sys_bpf+0x21/0x30
> do_syscall_64+0x6b/0x3a0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Initial idea was to move peer assignment _before_ the sk_state update[1],
> but that involved an additional memory barrier, and changing the hot path
> was rejected. Then a check during proto update was considered[2], but a
> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
> lock.
>
> Thus, teach sockmap about the af_unix-specific locking: instead of the
> usual lock_sock() involving sock::sk_lock, af_unix protects critical
> sections under unix_state_lock() operating on unix_sock::lock.
>
> [1]:
> https://lore.kernel.org/netdev/[email protected]/
> [2]: https://lore.kernel.org/netdev/[email protected]/
> [3]:
> https://lore.kernel.org/netdev/[email protected]/
>
> This patch also happens to fix a deadlock that may occur when
> bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the
> iter prog attempts to update a sockmap. Which ends up spinning at
> sock_map_update_elem()'s bh_lock_sock():
Hmm.. this seems to be a more general problem for
bpf iter vs sockmap. bpf_iter_{tcp,udp}_seq_show() also
hold lock_sock(), where this patch's solution does not help.
We need to resolve this regardless of socket family.
Also, I feel lock_sock() should be outside of unix_state_lock()
since the former is usually sleepable. If we used such locking
order in the future, it would trigger ABBA deadlock and we would
have to revisit this problem.
>
> WARNING: possible recursive locking detected
> --------------------------------------------
> test_progs/1393 is trying to acquire lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at:
> sock_map_update_elem+0xdb/0x1f0
>
> but task is already holding lock:
> ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(slock-AF_UNIX);
> lock(slock-AF_UNIX);
>
> *** DEADLOCK ***
>
> May be due to missing lock nesting notation
>
> 4 locks held by test_progs/1393:
> #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
> #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at:
> bpf_seq_read+0x42c/0x10d0
> #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at:
> __lock_sock_fast+0x37/0xe0
> #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at:
> bpf_iter_run_prog+0x51d/0xb00
>
> Call Trace:
> dump_stack_lvl+0x5d/0x80
> print_deadlock_bug.cold+0xc0/0xce
> __lock_acquire+0x130f/0x2590
> lock_acquire+0x14e/0x2b0
> _raw_spin_lock+0x30/0x40
> sock_map_update_elem+0xdb/0x1f0
> bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
> bpf_iter_run_prog+0x5b9/0xb00
> bpf_iter_unix_seq_show+0x1f7/0x2e0
> bpf_seq_read+0x42c/0x10d0
> vfs_read+0x171/0xb20
> ksys_read+0xff/0x200
> do_syscall_64+0x6b/0x3a0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Suggested-by: Kuniyuki Iwashima <[email protected]>
> Suggested-by: Martin KaFai Lau <[email protected]>
> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
> Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain
> socket.")
> Signed-off-by: Michal Luczaj <[email protected]>
> ---
> Keeping sparse annotations in sock_map_sk_{acquire,release}() required some
> hackery I'm not proud of. Is there a better way?
> ---
> net/core/sock_map.c | 47 +++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index b6586d9590b7..0c638b1f363a 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -12,6 +12,7 @@
> #include <linux/list.h>
> #include <linux/jhash.h>
> #include <linux/sock_diag.h>
> +#include <net/af_unix.h>
> #include <net/udp.h>
>
> struct bpf_stab {
> @@ -115,17 +116,49 @@ int sock_map_prog_detach(const union bpf_attr *attr,
> enum bpf_prog_type ptype)
> }
>
> static void sock_map_sk_acquire(struct sock *sk)
> - __acquires(&sk->sk_lock.slock)
> + __acquires(sock_or_unix_lock)
> {
> - lock_sock(sk);
> + if (sk_is_unix(sk)) {
> + unix_state_lock(sk);
> + __release(sk); /* Silence sparse. */
> + } else {
> + lock_sock(sk);
> + }
> +
> rcu_read_lock();
> }
>
> static void sock_map_sk_release(struct sock *sk)
> - __releases(&sk->sk_lock.slock)
> + __releases(sock_or_unix_lock)
> {
> rcu_read_unlock();
> - release_sock(sk);
> +
> + if (sk_is_unix(sk)) {
> + unix_state_unlock(sk);
> + __acquire(sk); /* Silence sparse. */
> + } else {
> + release_sock(sk);
> + }
> +}
> +
> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
> +{
> + local_bh_disable();
> +
> + if (sk_is_unix(sk))
> + unix_state_lock(sk);
> + else
> + bh_lock_sock(sk);
> +}
> +
> +static inline void sock_map_sk_release_fast(struct sock *sk)
> +{
> + if (sk_is_unix(sk))
> + unix_state_unlock(sk);
> + else
> + bh_unlock_sock(sk);
> +
> + local_bh_enable();
> }
>
> static void sock_map_add_link(struct sk_psock *psock,
> @@ -604,16 +637,14 @@ static long sock_map_update_elem(struct bpf_map *map,
> void *key,
> if (!sock_map_sk_is_suitable(sk))
> return -EOPNOTSUPP;
>
> - local_bh_disable();
> - bh_lock_sock(sk);
> + sock_map_sk_acquire_fast(sk);
> if (!sock_map_sk_state_allowed(sk))
> ret = -EOPNOTSUPP;
> else if (map->map_type == BPF_MAP_TYPE_SOCKMAP)
> ret = sock_map_update_common(map, *(u32 *)key, sk, flags);
> else
> ret = sock_hash_update_common(map, key, sk, flags);
> - bh_unlock_sock(sk);
> - local_bh_enable();
> + sock_map_sk_release_fast(sk);
> return ret;
> }
>
>
> --
> 2.52.0
>