Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently

Josef Bacik Fri, 12 Apr 2019 06:37:57 -0700

On Fri, Apr 12, 2019 at 04:35:20PM +0300, Nikolay Borisov wrote:
> 
> 
> On 12.04.19 г. 16:26 ч., Josef Bacik wrote:
> > On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote:
> >>
> >>
> >> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
> >>> With the per-inode block rsvs we started refilling the reserve based on
> >>> the calculated size of the outstanding csum bytes and extents for the
> >>> inode, including the amount we were adding with the new operation.
> >>>
> >>> However generic/224 exposed a problem with this approach.  With 1000
> >>> files all writing at the same time we ended up with a bunch of bytes
> >>> being reserved but unusable.
> >>>
> >>> When you write to a file we reserve space for the csum leaves for those
> >>> bytes, the number of extent items required to cover those bytes, and a
> >>> single credit for updating the inode at ordered extent finish for that
> >>> range of bytes.  This is held until the ordered extent finishes and we
> >>> release all of the reserved space.
> >>>
> >>> If a second write comes in at this point we would add a single
> >>> reservation for the new outstanding extent and however many reservations
> >>> for the csum leaves.  
> >>
> >> If a second write comes we won't do anything different than the first
> >> i.e calculate the number of extent items + csums bytes required, add
> >> them to the block reservation and call btrfs_inode_rsv_refill which
> >> should refill the delta necessary for the 2nd write.
> >>
> >>
> >> At this point we find the delta of how much we
> >>> have reserved and how much outstanding size this is and attempt to
> >>> reserve this delta.  If the first write finishes it will not release any
> >>> space, because the space it had reserved for the initial write is still
> >>> needed for the second write.  However some space would have been used,
> >>
> >> Each and every reservation is responsible for itself how come the first
> >> one will know some of its space is required for the second, hence it
> >> won't be released?
> >>
> > 
> > Write 1 comes in, sets the size to 3mib, reserves 3mib.
> > Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib.
> >   - can't reserve because there's not enough space, starts flushing.
> > Write 1 finishes, used 1mib of it's 3mib reservation
> > Write 1 sets the size to 3mib
> > We still have 2mib in reserves, which is less than 3mib, so no bytes are
> >   released to the space info.
> > 
> > Now multiply this by 1000, you have 1000 files with 2mib sitting in their
> > reservations, but they need 2mib, and there's no space to be squeezed from 
> > the
> > rest of the fs, so they start to ENOSPC out one by one.
> > 
> > With the new thing we get this
> > 
> > Write 1 comes in, reserves 3mib, sets the size to 3mib.
> > Write 2 comes in, attempts to reserve 3mib.
> >   - can't reserve because there's not enough space, starts flushing.
> 
> Um, no, you've removed btrfs_inode_rsv_refill so there is no flushing
> happening in btrfs_delalloc_reserve_metadata whatsoever. None of the 2
> remaining callers of btrfs_delalloc_reserve_metadata does any flushing
> based on the retval of that function.
>


Please go read the code again.  Thanks,

Josef

Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently

Reply via email to