On Tue, Jun 17, 2025 at 09:41:20PM +0800, Julian Sun wrote:
> Recently, syzkaller reported the following issue:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> Call Trace:
> <TASK>
> mempool_alloc_noprof+0x1a7/0x510 mm/mempool.c:402
> bch2_btree_update_start+0x549/0x1480 fs/bcachefs/btree_update_interior.c:1194
> bch2_btree_node_rewrite+0x17e/0x1120 fs/bcachefs/btree_update_interior.c:2208
> bch2_move_btree+0x6f0/0xc70 fs/bcachefs/move.c:1093
> bch2_scan_old_btree_nodes+0x95/0x240 fs/bcachefs/move.c:1215
> bch2_data_job+0x646/0x910 fs/bcachefs/move.c:1354
> bch2_data_thread+0x8f/0x1d0 fs/bcachefs/chardev.c:315
> kthread+0x711/0x8a0 kernel/kthread.c:464
> ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>
> This is because after commit d4d71b58e513 ("bcachefs: RO mounts now use less
> memory"),
> read-only mounts no longer initialize btree_interior_update_pool, which is
> required for
> processing BCH_IOCTL_DATA requests.
Alan already gave me a better fix for this. You pretty much never want
to just check if the filesystem is ro or rw - that would be racy, that
can change at any time. If you need the filesystem to be rw, you do it
by getting a write ref (which may fail).
Just checking SB_RDONLY here would be "technically" correct since we
only need the mempool, which is's never deallocated until filesystem
teardown, and the interior update path should get its own ref on
c->writes before doing anything serious.
But it's bad form, because then other code changes might go "ok, we've
checked that we're RW, we're safe" - but we're actually not.
And, I'm just now noticing that bch2_btree_update_start() actually does
not get a ref on c->writes, so we might want to fix that - or move.c
needs to be getting a write ref, or both.
c->writes is a percpu refcount, so it's dirt cheap, there's generally
zero downside to taking a ref even if an upper layer already has one.
The only exception is if it's an internal operation that needs to run
when we're going RO - but we have a flag for that,
BCH_TRANS_COMMIT_no_check_rw, which bch2_btree_update_start() can check.
The other consideration with write refs is that we don't want to be
holding them for an unbounded duration, because that will block going RO
- so I think bch2_ioctl_data() actually wasn't the best place for this,
we should be checking if we're RW in move.c, every time we kick off an
op.