On 1/8/21 2:05 AM, Zygo Blaxell wrote:
On Thu, May 28, 2020 at 08:34:47PM +0200, Goffredo Baroncelli wrote:

[...]

I've been testing these patches for a while now.  They enable an
interesting use case that can't otherwise be done safely, sanely or
cheaply with btrfs.

Thanks Zygo for this feedback. As usual you are source of very interesting 
considerations.

Normally if we have an array of, say, 10 spinning disks, and we want to
implement a writeback cache layer with SSD, we would need 10 distinct SSD
devices to avoid reducing btrfs's ability to recover from drive failures.
The writeback cache will be modified on both reads and writes, data and
metadata, so we need high endurance SSDs if we want them to make it to
the end of their warranty.  The SSD firmware has to not have crippling
performance bugs while under heavy write load, which means we are now
restricted to an expensive subset of high endurance SSDs targeted at
the enterprise/NAS/video production markets...and we need 10 of them!

NVME has fairly draconian restrictions on drive count, and getting
anything close to 10 of them into a btrfs filesystem can be an expensive
challenge.  (I'm not counting solutions that use USB-to-NVME bridges
because those don't count as "sane" or "safe").

We can share the cache between disks, but not safely in writeback mode,
because a failure in one SSD could affect multiple logical btrfs disks.
Strictly speaking we can't do it safely in any cache mode, but at least
with a writethrough cache we can recover the btrfs by throwing the SSDs
away.

I will replay in a separate thread, because I found your consideration very
interesting but OT.


In the current btrfs raid5 and dm-cache implementations, a scrub through a
SSD/HDD cache pair triggers an _astronomical_ number of SSD cache writes
(enough to burn through a low-end SSD's TBW in a weekend).  That can
presumably be fixed one day, but it's definitely unusable today.

Again, this is another point to discuss...


If we have only 2 NVME drives, with this patch, we can put them to work
storing metadata, leave the data on the spinning rust, and keep all
the btrfs self-repair features working, and get most of the performance
gain of SSD caching at significantly lower cost.  The write loads are
reasonable (metadata only, no data, no reads) so we don't need a high
endurance SSD.  With raid1/10/c3/c4 redundancy, btrfs can fix silent
metadata corruption, so we can even use cheap SSDs as long as they
don't hang when they fail.  We can use btrfs raid5/6 for data since
preferred_metadata avoids the bugs that kill the SSD when we run a scrub.

Now all we have to do is keep porting these patches to new kernels until
some equivalent feature lands in mainline.  :-P

Ok, this would be doable.

I do have some comments about these particular patches:

I dropped the "preferred_metadata" mount option very early in testing.
Apart from the merge conflicts with later kernels, the option is
redundant, since we could just as easily change all the drive type
properties back to 0 to get the same effect.  Adding an incompatible
mount option (and potentially removing it again after a failed test)
was a comparatively onerous requirement.

Could you clarify this point ? I understood that you removed some pieces
of the patch #4, but the other parts are still needed.

Anyway I have some doubt about removing entirely this options.
The options allows the new policy. The disk property instead marks the
disk as eligible as preferred storage for metadata. Are two concept
very different and I prefer to take them separates.

The behavior that you suggested is something like:
- if a "preferred metadata" properties is set on any disks, the
"preferred metadata" behavior is enabled.

My main concern is that setting this disk flag is not atomic, when
the mount option is. Would help the maintenance of the filesystem
when the user are replacing (dropping) the disks ?

Anyway I am open to ear other opinions.


The fallback to the other allocation mode when all disks of one type
are full is not always a feature.  When an array gets full, sometimes
data gets on SSDs, or metadata gets on spinners, and we quickly lose
the benefit of separating them.  Balances in either direction to fix
this after it happens are time-consuming and waste SSD lifetime TBW.

I'd like an easy way to make the preference strict, i.e. if there are
two types of disks in the filesystem, and one type fills up, and we're
not in a weird state like degraded mode or deleting the last drive of
one type, then go directly to ENOSPC.  That requires a more complex
patch since we'd have to change the way free space is calculated for
'df' to exclude the metadata devices, and track whether there are two
types of device present or just one.


I fear another issue: what happens when you filled the metadata disks ?
Does BTRFS allow to update the "preferred metadata" properties on a
filled disks ? I would say yes, but I am not sure what happens when
BTRFS force the filesystem RO.

Frankly speaking, I think that a slower filesystem is far better than a
locked one. A better way would be add a warning to btrfs-progs which
say: "WARNING: your preferred metadata disks are filled !!!"


Below I collected some data to highlight the performance increment.

Test setup:
I performed as test a "dist-upgrade" of a Debian from stretch to buster.
The test consisted in an image of a Debian stretch[1]  with the packages
needed under /var/cache/apt/archives/ (so no networking was involved).
For each test I formatted the filesystem from scratch, un-tar-red the
image and the ran "apt-get dist-upgrade" [2]. For each disk(s)/filesystem
combination I measured the time of apt dist-upgrade with and
without the flag "force-unsafe-io" which reduce the using of sync(2) and
flush(2). The ssd was 20GB big, the hdd was 230GB big,

I considered the following scenarios:
- btrfs over ssd
- btrfs over ssd + hdd with my patch enabled
- btrfs over bcache over hdd+ssd
- btrfs over hdd (very, very slow....)
- ext4 over ssd
- ext4 over hdd

The test machine was an "AMD A6-6400K" with 4GB of ram, where 3GB was used
as cache/buff.

Data analysis:

Of course btrfs is slower than ext4 when a lot of sync/flush are involved. Using
apt on a rotational was a dramatic experience. And IMHO  this should be replaced
by using the btrfs snapshot capabilities. But this is another (not easy) story.

Unsurprising bcache performs better than my patch. But this is an expected
result because it can cache also the data chunk (the read can goes directly to
the ssd). bcache perform about +60% slower when there are a lot of sync/flush
and only +20% in the other case.

Regarding the test with force-unsafe-io (fewer sync/flush), my patch reduce the
time from +256% to +113%  than the hdd-only . Which I consider a good
results considering how small is the patch.


Raw data:
The data below is the "real" time (as return by the time command) consumed by
apt


Test description         real (mmm:ss)  Delta %
--------------------     -------------  -------
btrfs hdd w/sync           142:38       +533%
btrfs ssd+hdd w/sync        81:04       +260%
ext4 hdd w/sync             52:39       +134%
btrfs bcache w/sync         35:59        +60%
btrfs ssd w/sync            22:31       reference
ext4 ssd w/sync             12:19        -45%



Test description         real (mmm:ss)  Delta %
--------------------     -------------  -------
btrfs hdd                    56:2       +256%
ext4 hdd                    51:32       +228%
btrfs ssd+hdd               33:30       +113%
btrfs bcache                18:57        +20%
btrfs ssd                   15:44       reference
ext4 ssd                    11:49        -25%


[1] I created the image, using "debootrap stretch", then I installed a set
of packages using the commands:

   # debootstrap stretch test/
   # chroot test/
   # mount -t proc proc proc
   # mount -t sysfs sys sys
   # apt --option=Dpkg::Options::=--force-confold \
         --option=Dpkg::options::=--force-unsafe-io \
        install mate-desktop-environment* xserver-xorg vim \
         task-kde-desktop task-gnome-desktop

Then updated the release from stretch to buster changing the file 
/etc/apt/source.list
Then I download the packages for the dist upgrade:

   # apt-get update
   # apt-get --download-only dist-upgrade

Then I create a tar of this image.
Before the dist upgrading the space used was about 7GB of space with 2281
packages. After the dist-upgrade, the space used was 9GB with 2870 packages.
The upgrade installed/updated about 2251 packages.


[2] The command was a bit more complex, to avoid an interactive session

   # mkfs.btrfs -m single -d single /dev/sdX
   # mount /dev/sdX test/
   # cd test
   # time tar xzf ../image.tgz
   # chroot .
   # mount -t proc proc proc
   # mount -t sysfs sys sys
   # export DEBIAN_FRONTEND=noninteractive
   # time apt-get -y --option=Dpkg::Options::=--force-confold \
        --option=Dpkg::options::=--force-unsafe-io dist-upgrade


BR
G.Baroncelli

--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5







--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

Reply via email to