> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 9cd7b028592c..1e8754877dd1 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2247,10 +2247,11 @@ EXPORT_SYMBOL_GPL(bpf_base_func_proto);
>  void bpf_list_head_free(const struct btf_field *field, void *list_head,
>                       struct bpf_spin_lock *spin_lock)
>  {
> -     struct list_head *head = list_head, *orig_head = list_head;
> +     struct list_head *head = list_head, drain, *pos, *n;
>
>       BUILD_BUG_ON(sizeof(struct list_head) > sizeof(struct bpf_list_head));
>       BUILD_BUG_ON(__alignof__(struct list_head) > __alignof__(struct 
> bpf_list_head));
> +     INIT_LIST_HEAD(&drain);
>
>       /* Do the actual list draining outside the lock to not hold the lock for
>        * too long, and also prevent deadlocks if tracing programs end up
> @@ -2261,20 +2262,23 @@ void bpf_list_head_free(const struct btf_field 
> *field, void *list_head,
>       __bpf_spin_lock_irqsave(spin_lock);
>       if (!head->next || list_empty(head))
>               goto unlock;
> -     head = head->next;
> +     list_for_each_safe(pos, n, head) {
> +             WRITE_ONCE(container_of(pos,
> +                     struct bpf_list_node_kern, list_head)->owner, NULL);
> +             list_move_tail(pos, &drain);
> +     }
>  unlock:
> -     INIT_LIST_HEAD(orig_head);
> +     INIT_LIST_HEAD(head);
>       __bpf_spin_unlock_irqrestore(spin_lock);
>
> -     while (head != orig_head) {
> -             void *obj = head;
> -
> -             obj -= field->graph_root.node_offset;
> -             head = head->next;
> +     while (!list_empty(&drain)) {
> +             pos = drain.next;
> +             list_del_init(pos);

Does this create a race condition when a BPF program holds a
refcount-acquired reference to a node being freed?

The patch clears each node's owner to NULL under spin_lock, moves the
node to a local 'drain' list, then drops the lock. After unlock, the
drain loop walks the local list and calls list_del_init(pos) before
__bpf_obj_drop_impl().

Between spin_unlock and list_del_init(pos), each node in 'drain' has
owner == NULL AND is still linked in 'drain'.

When a BPF program holds an owning or refcount-acquired reference to
one of these nodes (node X), which is structurally supported because
__bpf_obj_drop_impl() uses refcount_dec_and_test() and only frees at
refcount 0, a concurrent push to a DIFFERENT bpf_list_head becomes a
corruption:

CPU 0 (bpf_list_head_free, lock released)  CPU 1 (BPF prog, refcount X)
-----------------------------------------   ----------------------------
(owner of X == NULL, X linked in drain)
                                            bpf_list_push_back(other, X)
                                              __bpf_list_add: spin_lock()
                                              cmpxchg(X->owner, NULL,
                                                      POISON) -> OK
                                              list_add_tail(&X->list_head,
                                                            other_head)
                                                -> overwrites X->next,
                                                   X->prev, corrupts
                                                   other_head's chain
                                                   because X is still
                                                   stitched into drain
pos = drain.next;      (may be X or neighbor using X's stale next)
list_del_init(pos);    reads X->next/prev now pointing into other_head,
                       corrupts other_head's list and/or drain

Pre-patch, the owner pointer was left pointing at the old head (never
cleared). Any concurrent push attempt cmpxchg(&node->owner, NULL,
POISON) would fail and return -EINVAL, blocking re-insertion. This
matches what bpf_rb_root_free() still does: it does NOT clear owner, so
concurrent bpf_rbtree_add() cmpxchg fails.

The patch clears owner under the lock but still links the node into the
unprotected local drain list, then drops the lock. The owner==NULL
state now acts as a 'push is allowed' signal even though the node is
still on a list, violating the invariant __bpf_list_add() relies on:

kernel/bpf/helpers.c:__bpf_list_add() {
    /* node->owner != NULL implies !list_empty(n), no need to separately
     * check the latter
     */
    if (cmpxchg(&node->owner, NULL, BPF_PTR_POISON)) {
            __bpf_obj_drop_impl((void *)n - off, rec, false);
            return -EINVAL;
    }

    tail ? list_add_tail(n, h) : list_add(n, h);
    WRITE_ONCE(node->owner, head);
}

Compounding this, 'drain' is a local stack variable: corruption can
redirect list pointers into kernel stack memory of the freeing CPU,
giving an unbounded list_add/list_del_init corruption vector.

Could you do list_del_init(pos) under spin_lock in the same critical
section that clears owner? Swapping the order (list_del_init then
WRITE_ONCE(owner, NULL), like __bpf_list_del() does), or leaving owner
non-NULL like bpf_rb_root_free does so cmpxchg in concurrent push still
fails?

The commit message's stated goal ('preventing subsequent users who hold
a reference count to the node from acquiring an invalid next node') is
only achievable if the delinking is done before unlock, not after.

>               /* The contained type can also have resources, including a
>                * bpf_list_head which needs to be freed.
>                */
> -             __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
> +             __bpf_obj_drop_impl((char *)pos - field->graph_root.node_offset,
> +                                 field->graph_root.value_rec, false);
>       }
>  }

[email protected] raised a concern in v9 about whether this patch
should include a Fixes tag. The CI bot suggested:

  Fixes: f0c5941ff5b2 ("bpf: Support bpf_list_head in map values")

since the concurrent reader issue was introduced when bpf_list_head
support was added. The author responded in v10 that the issue only
appears after the later patch 'bpf: Introduce the bpf_list_del kfunc' is
merged, so no Fixes tag was added.

However, the reasoning suggests the Fixes tag should point to the
bpf_list_del introduction rather than f0c5941ff5b2, and a Fixes tag may
still be appropriate. Should this include a Fixes tag pointing to the
commit that introduced bpf_list_del?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25716874656

Reply via email to