On 6/4/26 15:34, Peter Xu wrote:
The race condition was not reported in any real bug, but I found it while
reviewing some relevant chnages from Akihiko [1].  It's not clear if
there's any way to reproduce it, so far it's only theoretical.  Hence
there's also no Fixes tag attached.  There's also no need to copy stable
until we have a solid reproducer.

Currently, mru_block might still points to ramblocks that are removed if
below race condition happens:

   Reader A                              Writer
   --------                              ------
   rcu_read_lock()
   walk list, find block X
                                         QLIST_REMOVE_RCU(X)
                                         qatomic_set(&mru_block, NULL)
                                         call_rcu(X, reclaim_ramblock)
   qatomic_set(&mru_block, X)  <----------- overwrites NULL
   rcu_read_unlock()
                                         grace period ends, X freed

   Reader C
   --------
   rcu_read_lock()
   qatomic_rcu_read(&mru_block) -> X
   Read X's block->offset ...  <----------- UAF

To fix it, we can introduce a nested RCU free for reset of the mru_block
field, and only free the ramblock until the 2nd RCU call.

There is a version field in RAMList.  Maybe we need mru_version too?

- reader A reads version before starting the list walk, then writes mru_version and mru_block

- writers update version and do not even have to clear mru_block

- new readers ignore mru_block if mru_version != version

This works, but there is a snag:

    reader A (MRU parts elided, assuming a miss)
    --------------------------------------------
    load_acquire version // load version before walking list
    walk list
    ...

            writer B
            -----------------------
            // like seqcount_write_begin
            store version++
            smp_wmb();
            // ----
            QLIST_REMOVE_RCU
            store release version++  // like seqcount_write_end

    // these two must be atomic!
    store release mru_version = version & ~1, mru_block = block

    reader C (MRU parts shown)
    --------------------
    load_acquire mru_version    // like seqcount_read_begin
    load mru_block
    // like seqcount_read_retry
    smp_rmb()
    load_acquire version
    // ----
    if block == NULL || version != mru_version || block mismatch
      walk list

In other words, the seqcount retry check compares the current version of the list against the one *stored by the previous reader*.

The snag is that torn writes of the (version, block) pair are a problem, so the store of mru_version and mru_block must be atomic and that requires CONFIG_ATOMIC128. Maybe avoid MRU altogether if it's not available, and if it is use qatomic_read__nocheck/qatomic_set__nocheck?

Paolo


Reply via email to