On Fri, Apr 12, 2019 at 04:35:20PM +0300, Nikolay Borisov wrote: > > > On 12.04.19 г. 16:26 ч., Josef Bacik wrote: > > On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote: > >> > >> > >> On 10.04.19 г. 22:56 ч., Josef Bacik wrote: > >>> With the per-inode block rsvs we started refilling the reserve based on > >>> the calculated size of the outstanding csum bytes and extents for the > >>> inode, including the amount we were adding with the new operation. > >>> > >>> However generic/224 exposed a problem with this approach. With 1000 > >>> files all writing at the same time we ended up with a bunch of bytes > >>> being reserved but unusable. > >>> > >>> When you write to a file we reserve space for the csum leaves for those > >>> bytes, the number of extent items required to cover those bytes, and a > >>> single credit for updating the inode at ordered extent finish for that > >>> range of bytes. This is held until the ordered extent finishes and we > >>> release all of the reserved space. > >>> > >>> If a second write comes in at this point we would add a single > >>> reservation for the new outstanding extent and however many reservations > >>> for the csum leaves. > >> > >> If a second write comes we won't do anything different than the first > >> i.e calculate the number of extent items + csums bytes required, add > >> them to the block reservation and call btrfs_inode_rsv_refill which > >> should refill the delta necessary for the 2nd write. > >> > >> > >> At this point we find the delta of how much we > >>> have reserved and how much outstanding size this is and attempt to > >>> reserve this delta. If the first write finishes it will not release any > >>> space, because the space it had reserved for the initial write is still > >>> needed for the second write. However some space would have been used, > >> > >> Each and every reservation is responsible for itself how come the first > >> one will know some of its space is required for the second, hence it > >> won't be released? > >> > > > > Write 1 comes in, sets the size to 3mib, reserves 3mib. > > Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib. > > - can't reserve because there's not enough space, starts flushing. > > Write 1 finishes, used 1mib of it's 3mib reservation > > Write 1 sets the size to 3mib > > We still have 2mib in reserves, which is less than 3mib, so no bytes are > > released to the space info. > > > > Now multiply this by 1000, you have 1000 files with 2mib sitting in their > > reservations, but they need 2mib, and there's no space to be squeezed from > > the > > rest of the fs, so they start to ENOSPC out one by one. > > > > With the new thing we get this > > > > Write 1 comes in, reserves 3mib, sets the size to 3mib. > > Write 2 comes in, attempts to reserve 3mib. > > - can't reserve because there's not enough space, starts flushing. > > Um, no, you've removed btrfs_inode_rsv_refill so there is no flushing > happening in btrfs_delalloc_reserve_metadata whatsoever. None of the 2 > remaining callers of btrfs_delalloc_reserve_metadata does any flushing > based on the retval of that function. >
Please go read the code again. Thanks, Josef