On Wed, Mar 31, 2021 at 11:55:50AM +0100, [email protected] wrote: > From: Filipe Manana <[email protected]> > > When we are running out of space for updating the chunk tree, that is, > when we are low on available space in the system space info, if we have > many task concurrently allocating block groups, via fallocate for example, > many of them can end up all allocating new system chunks when only one is > needed. In extreme cases this can lead to exhaustion of the system chunk > array, which has a size limit of 2048 bytes, and results in a transaction > abort with errno -EFBIG, producing a trace in dmesg like the following, > which was triggered on a PowerPC machine with a node/leaf size of 64K: > > [ 1359.518899] ------------[ cut here ]------------ > [ 1359.518980] BTRFS: Transaction aborted (error -27) > [ 1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 > btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] > [ 1359.519152] Modules linked in: (...) > [ 1359.519239] Supported: Yes, External > [ 1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X > 5.3.18-47-default #1 SLE15-SP3 > [ 1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: > 00000000006de8e8 > [ 1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X > (5.3.18-47-default) > [ 1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: > 48008222 XER: 00000007 > [ 1359.519356] CFAR: c00000000013e170 IRQMASK: 0 > [ 1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 > 0000000000000026 > [ 1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 > 0000000000000007 > [ 1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 > 0000000000000000 > [ 1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 > 000000001015dfd0 > [ 1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 > 0000dd1e287affff > [ 1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 > 0000000000000000 > [ 1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 > ffffffffffffffc0 > [ 1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 > c000000562188b08 > [ 1359.519561] NIP [c008000000e36fe8] > btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] > [ 1359.519613] LR [c008000000e36fe4] > btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] > [ 1359.519626] Call Trace: > [ 1359.519671] [c00000056890b990] [c008000000e36fe4] > btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable) > [ 1359.519729] [c00000056890ba90] [c008000000d68d44] > __btrfs_end_transaction+0xbc/0x2f0 [btrfs] > [ 1359.519782] [c00000056890bae0] [c008000000e309ac] > btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs] > [ 1359.519844] [c00000056890bba0] [c008000000d8a0fc] > btrfs_fallocate+0xe4/0x10e0 [btrfs] > [ 1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350 > [ 1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0 > [ 1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40 > [ 1359.519988] [c00000056890bdc0] [c000000000038968] > system_call_exception+0xe8/0x170 > [ 1359.520021] [c00000056890be20] [c00000000000cb70] > system_call_common+0xf0/0x278 > [ 1359.520037] Instruction dump: > [ 1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 > 3c620000 > [ 1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 > e884bcc8 7ec6b378 > [ 1359.520122] ---[ end trace d6c186e151022e20 ]--- > > The following steps explain how we can end up in this situation: > > 1) Task A is at check_system_chunk(), either because it is allocating a > new data or metadata block group, at btrfs_chunk_alloc(), or because > it is removing a block group or turning a block group RO. It does not > matter why; > > 2) Task A sees that there is not enough free space in the system > space_info object, that is 'left' is < 'thresh'. And at this point > the system space_info has a value of 0 for its 'bytes_may_use' > counter; > > 3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate > a new system block group (chunk) and then reserves 'thresh' bytes in > the chunk block reserve with the call to btrfs_block_rsv_add(). This > changes the chunk block reserve's 'reserved' and 'size' counters by an > amount of 'thresh', and changes the 'bytes_may_use' counter of the > system space_info object from 0 to 'thresh'. > > Also during its call to btrfs_alloc_chunk(), we end up increasing the > value of the 'total_bytes' counter of the system space_info object by > 8MiB (the size of a system chunk stripe). This happens through the > call chain: > > btrfs_alloc_chunk() > create_chunk() > btrfs_make_block_group() > btrfs_update_space_info() > > 4) After it finishes the first phase of the block group allocation, at > btrfs_chunk_alloc(), task A unlocks the chunk mutex; > > 5) At this point the new system block group was added to the transaction > handle's list of new block groups, but its block group item, device > items and chunk item were not yet inserted in the extent, device and > chunk trees, respectively. That only happens later when we call > btrfs_finish_chunk_alloc() through a call to > btrfs_create_pending_block_groups(); > > Note that only when we update the chunk tree, through the call to > btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter > of the chunk block reserve as we COW/allocate extent buffers, > through: > > btrfs_alloc_tree_block() > btrfs_use_block_rsv() > btrfs_block_rsv_use_bytes() > > And the system space_info's 'bytes_may_use' is decremented everytime > we allocate an extent buffer for COW operations on the chunk tree, > through: > > btrfs_alloc_tree_block() > btrfs_reserve_extent() > find_free_extent() > btrfs_add_reserved_bytes() > > If we end up COWing less chunk btree nodes/leaves than expected, which > is the typical case since the amount of space we reserve is always > pessimistic to account for the worst possible case, we release the > unused space through: > > btrfs_create_pending_block_groups() > btrfs_trans_release_chunk_metadata() > btrfs_block_rsv_release() > block_rsv_release_bytes() > btrfs_space_info_free_bytes_may_use() > > But before task A gets into btrfs_create_pending_block_groups()... > > 6) Many other tasks start allocating new block groups through fallocate, > each one does the first phase of block group allocation in a > serialized way, since btrfs_chunk_alloc() takes the chunk mutex > before calling check_system_chunk() and btrfs_alloc_chunk(). > > However before everyone enters the final phase of the block group > allocation, that is, before calling btrfs_create_pending_block_groups(), > new tasks keep coming to allocate new block groups and while at > check_system_chunk(), the system space_info's 'bytes_may_use' keeps > increasing each time a task reserves space in the chunk block reserve. > This means that eventually some other task can end up not seeing enough > free space in the system space_info and decide to allocate yet another > system chunk. > > This may repeat several times if yet more new tasks keep allocating > new block groups before task A, and all the other tasks, finish the > creation of the pending block groups, which is when reserved space > in excess is released. Eventually this can result in exhaustion of > system chunk array in the superblock, with btrfs_add_system_chunk() > returning -EFBIG, resulting later in a transaction abort. > > Even when we don't reach the extreme case of exhausting the system > array, most, if not all, unnecessarily created system block groups > end up being unused since when finishing creation of the first > pending system block group, the creation of the following ones end > up not needing to COW nodes/leaves of the chunk tree, so we never > allocate and deallocate from them, resulting in them never being > added to the list of unused block groups - as a consequence they > don't get deleted by the cleaner kthread - the only exceptions are > if we unmount and mount the filesystem again, which adds any unused > block groups to the list of unused block groups, if a scrub is > run, which also adds unused block groups to the unused list, and > under some circumstances when using a zoned filesystem or async > discard, which may also add unused block groups to the unused list. > > So fix this by: > > *) Tracking the number of reserved bytes for the chunk tree per > transaction, which is the sum of reserved chunk bytes by each > transaction handle currently being used; > > *) When there is not enough free space in the system space_info, > if there are other transaction handles which reserved chunk space, > wait for some of them to complete in order to have enough excess > reserved space released, and then try again. Otherwise proceed with > the creation of a new system chunk. > > Signed-off-by: Filipe Manana <[email protected]>
Added to misc-next.
