Re: [PATCH v2] btrfs: fix mount failure caused by race with umount
On Fri, Jul 10, 2020 at 10:23:04AM -0700, Boris Burkov wrote: > Here is the sequence laid out in greater detail: > > CPU0CPU1 > down_write sb->s_umount > btrfs_kill_super > kill_anon_super(sb) > generic_shutdown_super(sb); > shrink_dcache_for_umount(sb); > sync_filesystem(sb); > evict_inodes(sb); // SLOW > > btrfs_mount_root > btrfs_scan_one_device > fs_devices = > device->fs_devices > fs_info->fs_devices = > fs_devices > // fs_devices-opened makes > this a no-op > > btrfs_open_devices(fs_devices, mode, fs_type) > s = sget(fs_type, test, set, > flags, fs_info); > find sb in s_instances > grab_super(sb); > down_write(&s->s_umount); > // blocks > > sop->put_super(sb) > // sb->fs_devices->opened == 2; no-op > spin_lock(&sb_lock); > hlist_del_init(&sb->s_instances); > spin_unlock(&sb_lock); > up_write(&sb->s_umount); > return 0; > retry lookup > don't find sb in > s_instances (deleted by CPU0) > s = alloc_super > return s; > btrfs_fill_super(s, > fs_devices, data) > open_ctree // fs_devices > total_rw_bytes improperly set! > btrfs_read_chunk_tree > read_one_dev // > increment total_rw_bytes again!! > super_total_bytes < > fs_devices->total_rw_bytes // ERROR!!! It seems weird that umount and mount can be mixed in such way but with the VFS locks and structures it's valid, so the devices managed by btrfs slipped through. With the suggested fix, the bit BTRFS_DEV_STATE_IN_FS_METADATA becomes quite important and the synchronization of the device related data. The semantics seems quite subtle and inconsistent regarding other uses of set_bit or clear_bit and the total_rw_bytes. I'm thinkig about unconditional setting of IN_FS_METADATA as it is now, but recalculating total_rw_size outside of read_one_dev in btrfs_read_chunk_tree. There it should not matter if the bit was set by the unmounted or the mounted filesystem, as long as the locking rules for updating fs_devices hold. For that we have uuid_mutex and fs_devices::device_list_mutex, this is used elsewhere so fixing it using existing mechanisms is IMHO better way than relying on subtle undocumented semantics of the state bit.
Re: [PATCH v2] btrfs: fix mount failure caused by race with umount
On 7/10/20 1:23 PM, Boris Burkov wrote: It is possible to cause a btrfs mount to fail by racing it with a slow umount. The crux of the sequence is generic_shutdown_super not yet calling sop->put_super before btrfs_mount_root calls btrfs_open_devices. If that occurs, btrfs_open_devices will decide the opened counter is non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to 0. From here, mount will call sget which will result in grab_super trying to take the super block umount semaphore. That semaphore will be held by the slow umount, so mount will block. Before up-ing the semaphore, umount will delete the super block, resulting in mount's sget reliably allocating a new one, which causes the mount path to dutifully fill it out, and increment total_rw_bytes a second time, which causes the mount to fail, as we see double the expected bytes. Here is the sequence laid out in greater detail: CPU0CPU1 down_write sb->s_umount btrfs_kill_super kill_anon_super(sb) generic_shutdown_super(sb); shrink_dcache_for_umount(sb); sync_filesystem(sb); evict_inodes(sb); // SLOW btrfs_mount_root btrfs_scan_one_device fs_devices = device->fs_devices fs_info->fs_devices = fs_devices // fs_devices-opened makes this a no-op btrfs_open_devices(fs_devices, mode, fs_type) s = sget(fs_type, test, set, flags, fs_info); find sb in s_instances grab_super(sb); down_write(&s->s_umount); // blocks sop->put_super(sb) // sb->fs_devices->opened == 2; no-op spin_lock(&sb_lock); hlist_del_init(&sb->s_instances); spin_unlock(&sb_lock); up_write(&sb->s_umount); return 0; retry lookup don't find sb in s_instances (deleted by CPU0) s = alloc_super return s; btrfs_fill_super(s, fs_devices, data) open_ctree // fs_devices total_rw_bytes improperly set! btrfs_read_chunk_tree read_one_dev // increment total_rw_bytes again!! super_total_bytes < fs_devices->total_rw_bytes // ERROR!!! To fix this, we observe that if we have already filled the device, the state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can use that to avoid filling it a second time for no reason and, critically, avoid double counting in total_rw_bytes. One gotcha is that read_one_chunk also sets this bit, which happens before read_one_dev (in read_sys_array), so we must remove that setting of the bit as well, for the state bit to truly correspond to the device struct being filled from disk. To reproduce, it is sufficient to dirty a decent number of inodes, then quickly umount and mount. for i in $(seq 0 500) do dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1 done umount /mnt/foo& mount /mnt/foo does the trick for me. A final note is that this fix actually breaks the fstest btrfs/163, but having investigated it, I believe that is due to a subtle flaw in how btrfs replace works when used on a seed device. The replace target device never gets a correct dev_item with the sprout fsid written out. This causes several problems, but for the sake of btrfs/163, read_one_chunk marking the device with IN_FS_METADATA was wallpapering over it, which this patch breaks. I will be sending a subsequent fix for the seed replace issue which will also fix btrfs/163. Signed-off-by: Boris Burkov Reviewed-by: Josef Bacik Thanks, Josef
[PATCH v2] btrfs: fix mount failure caused by race with umount
It is possible to cause a btrfs mount to fail by racing it with a slow umount. The crux of the sequence is generic_shutdown_super not yet calling sop->put_super before btrfs_mount_root calls btrfs_open_devices. If that occurs, btrfs_open_devices will decide the opened counter is non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to 0. From here, mount will call sget which will result in grab_super trying to take the super block umount semaphore. That semaphore will be held by the slow umount, so mount will block. Before up-ing the semaphore, umount will delete the super block, resulting in mount's sget reliably allocating a new one, which causes the mount path to dutifully fill it out, and increment total_rw_bytes a second time, which causes the mount to fail, as we see double the expected bytes. Here is the sequence laid out in greater detail: CPU0CPU1 down_write sb->s_umount btrfs_kill_super kill_anon_super(sb) generic_shutdown_super(sb); shrink_dcache_for_umount(sb); sync_filesystem(sb); evict_inodes(sb); // SLOW btrfs_mount_root btrfs_scan_one_device fs_devices = device->fs_devices fs_info->fs_devices = fs_devices // fs_devices-opened makes this a no-op btrfs_open_devices(fs_devices, mode, fs_type) s = sget(fs_type, test, set, flags, fs_info); find sb in s_instances grab_super(sb); down_write(&s->s_umount); // blocks sop->put_super(sb) // sb->fs_devices->opened == 2; no-op spin_lock(&sb_lock); hlist_del_init(&sb->s_instances); spin_unlock(&sb_lock); up_write(&sb->s_umount); return 0; retry lookup don't find sb in s_instances (deleted by CPU0) s = alloc_super return s; btrfs_fill_super(s, fs_devices, data) open_ctree // fs_devices total_rw_bytes improperly set! btrfs_read_chunk_tree read_one_dev // increment total_rw_bytes again!! super_total_bytes < fs_devices->total_rw_bytes // ERROR!!! To fix this, we observe that if we have already filled the device, the state bit BTRFS_DEV_STATE_IN_FS_METADATA will be set on it, and we can use that to avoid filling it a second time for no reason and, critically, avoid double counting in total_rw_bytes. One gotcha is that read_one_chunk also sets this bit, which happens before read_one_dev (in read_sys_array), so we must remove that setting of the bit as well, for the state bit to truly correspond to the device struct being filled from disk. To reproduce, it is sufficient to dirty a decent number of inodes, then quickly umount and mount. for i in $(seq 0 500) do dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1 done umount /mnt/foo& mount /mnt/foo does the trick for me. A final note is that this fix actually breaks the fstest btrfs/163, but having investigated it, I believe that is due to a subtle flaw in how btrfs replace works when used on a seed device. The replace target device never gets a correct dev_item with the sprout fsid written out. This causes several problems, but for the sake of btrfs/163, read_one_chunk marking the device with IN_FS_METADATA was wallpapering over it, which this patch breaks. I will be sending a subsequent fix for the seed replace issue which will also fix btrfs/163. Signed-off-by: Boris Burkov --- fs/btrfs/volumes.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c7a3d4d730a3..865fab39ef43 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6633,9 +6633,6 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf, } btrfs_report_missing_device(fs_info, devid, uuid, false); } - set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, - &(map->stripes[i].dev->dev_state)); - } write_lock(&map_tree->lock); @@ -6815,8 +6812,15 @@ static int read_one_dev(struct extent_buffer *leaf,