Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently

Nikolay Borisov Fri, 12 Apr 2019 06:35:33 -0700

On 12.04.19 г. 16:26 ч., Josef Bacik wrote:
> On Fri, Apr 12, 2019 at 04:06:25PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 10.04.19 г. 22:56 ч., Josef Bacik wrote:
>>> With the per-inode block rsvs we started refilling the reserve based on
>>> the calculated size of the outstanding csum bytes and extents for the
>>> inode, including the amount we were adding with the new operation.
>>>
>>> However generic/224 exposed a problem with this approach.  With 1000
>>> files all writing at the same time we ended up with a bunch of bytes
>>> being reserved but unusable.
>>>
>>> When you write to a file we reserve space for the csum leaves for those
>>> bytes, the number of extent items required to cover those bytes, and a
>>> single credit for updating the inode at ordered extent finish for that
>>> range of bytes.  This is held until the ordered extent finishes and we
>>> release all of the reserved space.
>>>
>>> If a second write comes in at this point we would add a single
>>> reservation for the new outstanding extent and however many reservations
>>> for the csum leaves.  
>>
>> If a second write comes we won't do anything different than the first
>> i.e calculate the number of extent items + csums bytes required, add
>> them to the block reservation and call btrfs_inode_rsv_refill which
>> should refill the delta necessary for the 2nd write.
>>
>>
>> At this point we find the delta of how much we
>>> have reserved and how much outstanding size this is and attempt to
>>> reserve this delta.  If the first write finishes it will not release any
>>> space, because the space it had reserved for the initial write is still
>>> needed for the second write.  However some space would have been used,
>>
>> Each and every reservation is responsible for itself how come the first
>> one will know some of its space is required for the second, hence it
>> won't be released?
>>
> 
> Write 1 comes in, sets the size to 3mib, reserves 3mib.
> Write 2 comes in, sets the size to 5 mib, attempts to reserve 2mib.
>   - can't reserve because there's not enough space, starts flushing.
> Write 1 finishes, used 1mib of it's 3mib reservation
> Write 1 sets the size to 3mib
> We still have 2mib in reserves, which is less than 3mib, so no bytes are
>   released to the space info.
> 
> Now multiply this by 1000, you have 1000 files with 2mib sitting in their
> reservations, but they need 2mib, and there's no space to be squeezed from the
> rest of the fs, so they start to ENOSPC out one by one.
> 
> With the new thing we get this
> 
> Write 1 comes in, reserves 3mib, sets the size to 3mib.
> Write 2 comes in, attempts to reserve 3mib.
>   - can't reserve because there's not enough space, starts flushing.

Um, no, you've removed btrfs_inode_rsv_refill so there is no flushing
happening in btrfs_delalloc_reserve_metadata whatsoever. None of the 2
remaining callers of btrfs_delalloc_reserve_metadata does any flushing
based on the retval of that function.

This actually means that you should also remove the 'flush' variable in
the same function otherwise you are leaving unused variable behind,
which is not nice.

> Write 1 finishes, used 1mib of it's 3mib reservation
> Write 1 sets the size to 0mib

Reveweing the way oustanding_extents are used I'm getting a little bit
confused - on the one hand we are modifying this value when we reserve
metadata on the other hand we are also modifying the count when adding
ordered extents. Shouldn't that count have been already accounted when
doing the initial metadata reservation?

> Write 1 releases 2mib to the space_info, allowing the next waiter to claim 
> 2mib
>   of whatever it's reservation was.
> 
> Multiply this by 1000.  As we get smaller and smaller amounts of metadata 
> space
> to work with, we get less and less writes happening in parallel, because we 
> only
> have X inodes worth of reservations to be in flight at any given time.  
> Thanks,
> 
> Josef
>
Re: [PATCH 2/2] btrfs: reserve delalloc metadata differently

Reply via email to