On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote: > On 04/04/2018 12:57 AM, Zygo Blaxell wrote: > >> I have to point out that in any case the extent is physically > >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if > >> you want to write 128KB, the first half is written in the first disk, > >> the other in the 2nd disk. If you want to write 96kb, the first 64 > >> are written in the first disk, the last part in the 2nd, only on a > >> different BG. > > The "only on a different BG" part implies something expensive, either > > a seek or a new erase page depending on the hardware. Without that, > > nearby logical blocks are nearby physical blocks as well. > > In any case it happens on a different disk
No it doesn't. The small-BG could be on the same disk(s) as the big-BG. > >> So yes there is a fragmentation from a logical point of view; from a > >> physical point of view the data is spread on the disks in any case. > > > What matters is the extent-tree point of view. There is (currently) > > no fragmentation there, even for RAID5/6. The extent tree is unaware > > of RAID5/6 (to its peril). > > Before you pointed out that the non-contiguous block written has > an impact on performance. I am replaying that the switching from a > different BG happens at the stripe-disk boundary, so in any case the > block is physically interrupted and switched to another disk The difference is that the write is switched to a different local address on the disk. It's not "another" disk if it's a different BG. Recall in this plan there is a full-width BG that is on _every_ disk, which means every small-width BG shares a disk with the full-width BG. Every extent tail write requires a seek on a minimum of two disks in the array for raid5, three disks for raid6. A tail that is strip-width minus one will hit N - 1 disks twice in an N-disk array. > However yes: from an extent-tree point of view there will be an increase > of number extents, because the end of the writing is allocated to > another BG (if the size is not stripe-boundary) > > > If an application does a loop writing 68K then fsync(), the multiple-BG > > solution adds two seeks to read every 68K. That's expensive if sequential > > read bandwidth is more scarce than free space. > > Why you talk about an additional seeks? In any case (even without the > additional BG) the read happens from another disks See above: not another disk, usually a different location on two or more of the same disks. > >> * c),d),e) are applied only for the tail of the extent, in case the > > size is less than the stripe size. > > > > It's only necessary to split an extent if there are no other writes > > in the same transaction that could be combined with the extent tail > > into a single RAID stripe. As long as everything in the RAID stripe > > belongs to a single transaction, there is no write hole > > May be that a more "simpler" optimization would be close the transaction > when the data reach the stripe boundary... But I suspect that it is > not so simple to implement. Transactions exist in btrfs to batch up writes into big contiguous extents already. The trick is to _not_ do that when one transaction ends and the next begins, i.e. leave a space at the end of the partially-filled stripe so that the next transaction begins in an empty stripe. This does mean that there will only be extra seeks during transaction commit and fsync()--which were already very seeky to begin with. It's not necessary to write a partial stripe when there are other extents to combine. So there will be double the amount of seeking, but depending on the workload, it could double a very small percentage of writes. > > Not for d. Balance doesn't know how to get rid of unreachable blocks > > in extents (it just moves the entire extent around) so after a balance > > the writes would still be rounded up to the stripe size. Balance would > > never be able to free the rounded-up space. That space would just be > > gone until the file was overwritten, deleted, or defragged. > > If balance is capable to move the extent, why not place one near the > other during a balance ? The goal is not to limit the the writing of > the end of a extent, but avoid writing the end of an extent without > further data (e.g. the gap to the stripe has to be filled in the > same transaction) That's plan f (leave gaps in RAID stripes empty). Balance will repack short extents into RAID stripes nicely. Plan d can't do that because plan d overallocates the extent so that the extent fills the stripe (only some of the extent is used for data). Small but important difference. > BR > G.Baroncelli > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
signature.asc
Description: PGP signature