Re: Status of RAID5/6

Zygo Blaxell Tue, 03 Apr 2018 23:03:07 -0700

On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> >> I have to point out that in any case the extent is physically
> >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> >> you want to write 128KB, the first half is written in the first disk,
> >> the other in the 2nd disk.  If you want to write 96kb, the first 64
> >> are written in the first disk, the last part in the 2nd, only on a
> >> different BG.
> > The "only on a different BG" part implies something expensive, either
> > a seek or a new erase page depending on the hardware.  Without that,
> > nearby logical blocks are nearby physical blocks as well.
> 
> In any case it happens on a different disk


No it doesn't.  The small-BG could be on the same disk(s) as the big-BG.

> >> So yes there is a fragmentation from a logical point of view; from a
> >> physical point of view the data is spread on the disks in any case.
> 
> > What matters is the extent-tree point of view.  There is (currently)
> > no fragmentation there, even for RAID5/6.  The extent tree is unaware
> > of RAID5/6 (to its peril).
> 
> Before you pointed out that the non-contiguous block written has
> an impact on performance. I am replaying  that the switching from a
> different BG happens at the stripe-disk boundary, so in any case the
> block is physically interrupted and switched to another disk

The difference is that the write is switched to a different local address
on the disk.

It's not "another" disk if it's a different BG.  Recall in this plan
there is a full-width BG that is on _every_ disk, which means every
small-width BG shares a disk with the full-width BG.  Every extent tail
write requires a seek on a minimum of two disks in the array for raid5,
three disks for raid6.  A tail that is strip-width minus one will hit
N - 1 disks twice in an N-disk array.

> However yes: from an extent-tree point of view there will be an increase
> of number extents, because the end of the writing is allocated to
> another BG (if the size is not stripe-boundary)
> 
> > If an application does a loop writing 68K then fsync(), the multiple-BG
> > solution adds two seeks to read every 68K.  That's expensive if sequential
> > read bandwidth is more scarce than free space.
> 
> Why you talk about an additional seeks? In any case (even without the
> additional BG) the read happens from another disks

See above:  not another disk, usually a different location on two or
more of the same disks.

> >> * c),d),e) are applied only for the tail of the extent, in case the
> > size is less than the stripe size.
> > 
> > It's only necessary to split an extent if there are no other writes
> > in the same transaction that could be combined with the extent tail
> > into a single RAID stripe.  As long as everything in the RAID stripe
> > belongs to a single transaction, there is no write hole
> 
> May be that a more "simpler" optimization would be close the transaction
> when the data reach the stripe boundary... But I suspect that it is
> not so simple to implement.

Transactions exist in btrfs to batch up writes into big contiguous extents
already.  The trick is to _not_ do that when one transaction ends and
the next begins, i.e. leave a space at the end of the partially-filled
stripe so that the next transaction begins in an empty stripe.

This does mean that there will only be extra seeks during transaction
commit and fsync()--which were already very seeky to begin with.  It's
not necessary to write a partial stripe when there are other extents to
combine.

So there will be double the amount of seeking, but depending on the
workload, it could double a very small percentage of writes.

> > Not for d.  Balance doesn't know how to get rid of unreachable blocks
> > in extents (it just moves the entire extent around) so after a balance
> > the writes would still be rounded up to the stripe size.  Balance would
> > never be able to free the rounded-up space.  That space would just be
> > gone until the file was overwritten, deleted, or defragged.
> 
> If balance is capable to move the extent, why not place one near the
> other during a balance ? The goal is not to limit the the writing of
> the end of a extent, but avoid writing the end of an extent without
> further data (e.g. the gap to the stripe has to be filled in the
> same transaction)

That's plan f (leave gaps in RAID stripes empty).  Balance will repack
short extents into RAID stripes nicely.

Plan d can't do that because plan d overallocates the extent so that
the extent fills the stripe (only some of the extent is used for data).
Small but important difference.

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

signature.asc
Description: PGP signature

Re: Status of RAID5/6

Reply via email to