[gentoo-user] Re: [offtopic] Copy-On-Write ?

Kai Krakow Sat, 16 Sep 2017 10:49:00 -0700

Am Sat, 16 Sep 2017 10:05:21 -0700
schrieb Rich Freeman <ri...@gentoo.org>:

> On Sat, Sep 16, 2017 at 9:43 AM, Kai Krakow <hurikha...@gmail.com>
> wrote:
> >
> > Actually, I'm running across 3x 1TB here on my desktop, with mraid1
> > and draid 0. Combined with bcache it gives confident performance.
> >  
> 
> Not entirely sure I'd use the word "confident" to describe a
> filesystem where the loss of one disk guarantees that:
> 1.  You will lose data (no data redundancy).
> 2.  But the filesystem will be able to tell you exactly what data you
> lost (as metadata will be fine).

I take daily backups with borg backup. It takes only 15 minutes to run.
And it has been tested twice successfully. The only breakdowns I had
were due to btrfs bugs, not hardware faults.

This is confident enough for my desktop system.

> > I was very happy a long time with XFS but switched to btrfs when it
> > became usable due to compression and stuff. But performance of
> > compression seems to get worse lately, IO performance drops due to
> > hogged CPUs even if my system really isn't that incapable.
> >  
> 
> Btrfs performance is pretty bad in general right now.  The problem is
> that they just simply haven't gotten around to optimizing it fully,
> mainly because they're more focused on getting rid of the data
> corruption bugs (which is of course the right priority).  For example,
> with raid1 mode btrfs picks the disk to use for raid based on whether
> the PID is even or odd, without any regard to disk utilization.
> 
> When I moved to zfs I noticed a huge performance boost.

Interesting... While I never tried it I always feared that it would
perform worse if not throwing RAM and ZIL/L2ARC at it.

> Fundamentally I don't see why btrfs can't perform just as well as the
> others.  It just isn't there yet.

And it will take a long time still, because devs are still throwing new
features at it which need to stabilize.

> > What's still cool is that I don't need to manage volumes since the
> > volume manager is built into btrfs. XFS on LVM was not that
> > flexible. If btrfs wouldn't have this feature, I probably would
> > have switched back to XFS already.  
> 
> My main concern with xfs/ext4 is that neither provides on-disk
> checksums or protection against the raid write hole.

Btrfs suffers the same RAID5 write hole problem since years. I always
planned moving to RAID5 later (which is why I have 3 disks) but I fear
this won't be fixed any time soon due to design decisions made too
early.

> I just switched motherboards a few weeks ago and either a connection
> or a SATA port was bad because one of my drives was getting a TON of
> checksum errors on zfs.  I moved it to an LSI card and scrubbed, and
> while it took forever and the system degraded the array more than once
> due to the high error rate, eventually it patched up all the errors
> and now the array is working without issue.  I didn't suffer more than
> a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE
> headache trying to recover from that (doing who knows how much
> troubleshooting before realizing I had to do a slow full restore from
> backup with the system down).

I found md raid not very reliable in the past but I didn't try again in
years. So this may have changed. I only remember it destroyed a file
system after an unclean shutdown not only once, that's not what I
expect from RAID1. Other servers with file systems on bare metal
survived this just fine.

> I just don't see how a modern filesystem can get away without having
> full checksum support.  It is a bit odd that it has taken so long for
> Ceph to introduce it, and I'm still not sure if it is truly
> end-to-end, or if at any point in its life the data isn't protected by
> checksums.  If I were designing something like Ceph I'd checksum the
> data at the client the moment it enters storage, then independently
> store the checksum and data, and then retrieve both and check it at
> the client when the data leaves storage.  Then you're protected
> against corruption at any layer below that.  You could of course have
> additional protections to catch errors sooner before the client even
> sees them.  I think that the issue is that Ceph was really designed
> for object storage originally and they just figured the application
> would be responsible for data integrity.

I'd at least pass the checksum through all the layers while checking it
again, so you could detect which transport or layer is broken.

> The other benefit of checksums is that if they're done right scrubs
> can go a lot faster, because you don't have to scrub all the
> redundancy data synchronously.  You can just start an idle-priority
> read thread on every drive and then pause it anytime a drive is
> accessed, and an access on one drive won't slow down the others.  With
> traditional RAID you have to read all the redundancy data
> synchronously because you can't check the integrity of any of it
> without the full set.  I think even ZFS is stuck doing synchronous
> reads due to how it stores/computes the checksums.  This is something
> btrfs got right.

One other point I decided for btrfs, tho I don't make much use of it
currently. I used to do regular scrubs a while ago but combined with
bcache, that is an SSD killer... I killed my old 128G SSD within one
year, although I used overprovisioning. Well, I actually didn't kill
it, it swapped it at 99% lifetime according to smartctl. It would
probably still work for a long time in normal workloads.

> >>  For the moment I'm
> >> relying more on zfs.  
> >
> > How does it perform memory-wise? Especially, I'm currently using
> > bees[1] for deduplication: It uses a 1G memory mapped file (you can
> > choose other sizes if you want), and it picks up new files really
> > fast, within a minute. I don't think zfs can do anything like that
> > within the same resources.  
> 
> I'm not using deduplication, but my understanding is that zfs
> deduplication:
> 1.  Works just fine.

No doubt...

> 2.  Uses a TON of RAM.

That's the problem. And I think there is no near-line dedup tool
available?

> So, it might not be your cup of tea.  There is no way to do
> semi-offline dedup as with btrfs (not really offline in that the
> filesystem is fully running - just that you periodically scan for dups
> and fix them after the fact, vs detect them in realtime).    With a
> semi-offline mode then the performance hits would only come at a time
> of my choosing, vs using gobs of RAM all the time to detect what are
> probably fairly rare dups.

I'm using bees, and I'd call it near-line. Changes to files are picked
up at commit time, when a new generation is made, and then it walks the
new extents, maps those to files, and deduplicates the blocks. I was
surprised how fast it detects new duplicate blocks. But it is still
working through the rest of the file system (since days), at least
without much impact on performance. Giving up 1G of RAM for this is
totally okay.

Once it finished scanning the first time, I'm thinking about starting
it at timed intervals. But it looks like impact will be so low that I
can keep it running all the time. Using cgroups to limit cpu and io
shares works really great.

I still didn't evaluate how it interferes with defragmenting, tho, or
how big the impact is of bees fragmenting extents.

> That aside, I find it works fine memory-wise (I don't use dedup).  It
> has its own cache system not integrated fully into the kernel's native
> cache, so it tends to hold on to a lot more ram than other
> filesystems, but you can tune this behavior so that it stays fairly
> tame.

I think the reasoning using own caching is, that block caching at the
vfs layer cannot just be done in an efficient way for a cow file system
with scrubbing and everything. You need to use good cache hinting
throuhout the whole pipeline which is currently slowly integrated into
the kernel.

E.g., when btrfs does cow action, bcache doesn't get notified that it
can discard the free block from cache. I don't know if this is handled
in the kernel cache layer...

-- 
Regards,
Kai

Replies to list-only preferred.

[gentoo-user] Re: [offtopic] Copy-On-Write ?

Reply via email to