Re: Migrate to bcache: A few questions

Duncan Sun, 29 Dec 2013 22:26:29 -0800

Kai Krakow posted on Sun, 29 Dec 2013 22:11:16 +0100 as excerpted:

> Hello list!
> 
> I'm planning to buy a small SSD (around 60GB) and use it for bcache in
> front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back
> caching. Btrfs is my root device, thus the system must be able to boot
> from bcache using init ramdisk. My /boot is a separate filesystem
> outside of btrfs and will be outside of bcache. I am using Gentoo as my
> system.


Gentooer here too. =:^)

> I have a few questions:
> 
> * How stable is it? I've read about some csum errors lately...

FWIW, both bcache and btrfs are new and still developing technology.  
While I'm using btrfs here, I have tested usable (which for root means 
either means directly bootable or that you have tested booting to a 
recovery image and restoring from there, I do the former, here) backups, 
as STRONGLY recommended for btrfs in its current state, but haven't had 
to use them.

And I considered bcache previously and might otherwise be using it, but 
at least personally, I'm not willing to try BOTH of them at once, since 
neither one is mature yet and if there are problems as there very well 
might be, I'd have the additional issue of figuring out which one was the 
problem, and I'm personally not prepared to deal with that.

Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs, 
and using bcache with a more mature filesystem like ext4 or (what I used 
for years previous and still use for spinning rust) reiserfs.

And as I said, keep your backups as current as you're willing to deal 
with losing what's not backed up, and tested usable and (for root) either 
bootable or restorable from alternate boot, because while at least btrfs 
is /reasonably/ stable for /ordinary/ daily use, there remain corner-
cases and you never know when your case is going to BE a corner-case!

> * I want to migrate my current storage to bcache without replaying a
> backup.  Is it possible?

Since I've not actually used bcache, I won't try to answer some of these, 
but will answer based on what I've seen on the list where I can...  I 
don't know on this one.

> * Did others already use it? What is the perceived performance for
> desktop workloads in comparision to not using bcache?

Others are indeed already using it.  I've seen some btrfs/bcache problems 
reported on this list, but as mentioned above, when both are in use that 
means figuring out which is the problem, and at least from the btrfs side 
I've not seen a lot of resolution in that regard.  From here it /looks/ 
like that's simply being punted at this time, as there's still more 
easily traceable problems without the additional bcache variable to work 
on first.  But it's quite possible the bcache list is actively tackling 
btrfs/bache combination problems, as I'm not subscribed there.

So I can't answer the desktop performance comparison question directly, 
but given that I /am/ running btrfs on SSD, I /can/ say I'm quite happy 
with that. =:^)

Keep in mind...

We're talking storage cache here.  Given the cost of memory and common 
system configurations these days, 4-16 gig of memory on a desktop isn't 
unusual or cost prohibitive, and a common desktop working set should well 
fit.

I suspect my desktop setup, 16 gigs memory backing a 6-core AMD fx6100 
(bulldozer-1) @ 3.6 GHz, is probably a bit toward the high side even for 
a gentooer, but not inordinately so.  Based on my usage...

Typical app memory usage runs 1-2 GiB (that's with KDE 4.12.49.9999 from 
the gentoo/kde overlay, but USE=-semantic-desktop, etc).  Buffer memory 
runs a few MiB but isn't normally significant, so it can fold into that 
same 1-2 GiB too.

That leaves a full 14 GiB for cache.  But at least with /my/ usage, 
normal non-update cache memory usage tends to be below ~6 GiB too, so 
total apps/buffer/cache memory usage tends to be below 8 GiB as well.

When I'm doing multi-job builds or working with big media files, I'll 
sometimes go above 8 gig usage, and that occasional cache-spill was why I 
upgraded to 16 gig.  But in practice, 10 gig would take care of that most 
of the time, and were it not for the "accident" of powers-of-two meaning 
16 gig is the notch above 8 gig, 10 or 12 gig would be plenty.  Truth be 
told, I so seldom use that last 4 gig that it's almost embarrassing.

* Tho if I ran multi-GiB VMs that'd use up that extra memory real fast!  
But while that /is/ becoming more common, I'm not exactly sure I'd 
classify 4 gigs plus of VM usage as "desktop" usage just yet.  
Workstation, yes, and definitely server, but not really desktop.

All that as background to this...

* Cache works only after first access.  If you only access something 
occasionally, it may not be worth caching at all.

* Similarly, if access isn't time critical, think of playing a huge video 
file where only a few meg in memory at once is plenty, and where storage 
access is several times faster than play-speed, cache isn't particularly 
useful.

* Bcache is designed not to cache sequential access (that large video 
file) in any case, since spinning rust tends to be more than fast enough 
for that sort of thing already.

Given the stated 3 x 1TB drive btrfs in raid1 metadata, raid0 data, config 
you mention, I'm wondering if big media is a/the big use case for you, in 
which case bcache isn't going to be a good solution anyway, since that 
tends to be sequential access, which bcache deliberately ignores as it 
doesn't fit the model it's targeting.

(I am a bit worried about that raid0 data, tho.  Unless you consider that 
data of trivial value that's not a good choice, since raid0 generally 
means you lose it all if you lose a physical device.  And you're running 
three devices, which means you just tripled the chance of a device 
failure over that of just putting it all on a single 3 TB drive!  And 
backups... a 3 TB restore on spinning rust will take some time any way 
you look at it, so backups may or may not be particularly viable here.  
The most common use case for that much data is probably a DVR scenario,  
which is video, and you may well consider it of low enough value that if 
you lose it, you lose it, and you're willing to take that risk, but for 
normally sequential access video/media, bcache isn't a good match anyway.)

* With memory cost what it is, for repeat access where initial access 
time isn't /too/ critical, investing in more memory, to a point (for me, 
8-12 gig as explained above), and simply letting the kernel manage cache 
and memory as it normally does, may make more sense than bcache to an ssd.

* Of course, what bcache *DOES* effectively do, is extend the per-boot 
cache time of memory, making the cache persistent.  That effectively 
extends the time over which "occasional access" still justifies caching 
at all.

* That makes bcache well suited to boot-time and initial-access-speed-
critical scenarios, where more memory for a larger in-memory cache won't 
do any good, since it's first-access-since-boot, because for in-memory 
cache that's a cold-cache scenario, while with bcache's persistent cache, 
it's a hot-cache scenario.


But what I'm actually wondering is if your use case better matches a 
split data model, where you put root and perhaps stuff like the portage 
tree and/or /home on fast SSD, while keeping all that big and generally 
sequential access media on slower but much cheaper big spinning rust.

That's effectively what I've done here, tho I'm looking at rather less 
than a TB of slow-access media, etc.  See below for the details.  The 
general idea is as I said to stick all the time-critical stuff on SSD 
directly (not using something like bcache), while keeping the slower 
spinning rust for the big less-time-critical and sequential-access stuff, 
and for non-btrfs backups of the stuff on the btrfs-formatted SSD, since 
btrfs /is/ after all still in development, and I /do/ intend to be 
prepared if /my/ particular case ends up being one of the corner-cases 
btrfs still worst-cases on.

> * How well does bcache handle power outages? Btrfs does handle them very
>   well since many months.

Since I don't run bcache I can't really speak to this at all, /except/, 
the btrfs/bcache combo trouble reports that have come to the list have I 
think all been power outage or kernel-crash scenarios... as could be 
predicted of course since that's a filesystem's worst-case scenario, at 
least that it has to commonly deal with.

But I know I'd definitely not trust that case, ATM.  Like I said, I'd not 
trust the combination of the two, and this is exactly where/why.  Under 
normal operation, the two should work together well.  But in a power-loss 
situation with both technologies being still relatively new and under 
development... not *MY* data!

> * How well does it play with dracut as initrd? Is it as simple as
> telling it the new device nodes or is there something complicate to
> configure?

I can't answer this at all for bcache, but I can say I've been relatively 
happy with the dracut initramfs solution for dual-device btrfs raid1 
root. =:^)  (At least back when I first set it up several kernels ago, 
the kernel's commandline parser apparently couldn't handle the multiple 
equals of something like rootflags=device=/dev/sda5,device=/dev/sdb5.  So 
the only way to get a multi-device btrfs rootfs to work was to use an 
initr* with userspace btrfs device scan before attempting to mount real-
root, and dracut has worked well for that.)

> * How does bcache handle a failing SSD when it starts to wear out in a
> few years?

Given the newness of the bcache technology, assuming your SSD doesn't 
fail early and it is indeed a few years, I'd suggest that question is 
premature.  Bcache will by that time be much older and more mature than 
it is now, and how it'd handle, or fail to handle, such an event /now/ 
likely hasn't a whole lot to do with how much (presumably) better it'll 
handle it /then/.

> * Is it worth waiting for hot-relocation support in btrfs to natively
> use a SSD as cache?

I wouldn't wait for it.  It's on the wishlist, but according to the wiki 
(project ideas, see the dm_cache or bcache like cache, and the hybrid 
storage points), nobody has claimed that project yet, which makes it 
effectively status "bluesky", which in turn means "nice idea, we might 
get to it... someday."

Given the btrfs project history of everything seeming to take rather 
longer than the original it turned out wildly optimistic projections, in 
the absense of a good filesystem dev personally getting that specific 
itch to scratch, that means it's likely a good two years out, and may be 
5-10.  So no, I'd definitely *NOT* wait on it!

> * Would you recommend going with a bigger/smaller SSD? I'm planning to
> use only 75% of it for bcache so wear-leveling can work better, maybe
> use another part of it for hibernation (suspend to disk).

FWIW, for my split data, some on SSD, some on spinning rust, setup, I had 
originally planned perhaps a 64 gig or so SSD, figuring I could put the 
boot-time-critical rootfs and a few other initial-access-time-critical 
things on it, with a reasonable amount of room to spare for wear-
leveling.  Maybe 128 gig or so, with a bit more stuff on it.

But when I actually went looking for hardware (some months ago now, but 
rather less than a year), I found the availability and price-point knee 
at closer to 256 gig.  128 gig or so was at a similar price-point per-
gig, but tends to sell out pretty fast as it's about half the gigs and 
thus about half the price.  There were some smaller ones available, but 
they tended to be either MUCH slower or MUCH higher priced, I'd guess 
left over from a previous generation before prices came down, and they 
simply hadn't been re-priced to match current price/capacity price-points.

But much below 128 GiB (there were some 120 GB at about the same per-gig, 
which "units" says is just under 112 GiB) and the price per gig tends to 
go up, while above 256 GB (not GiB) both the price per gig and full price 
tend to go up.

In practice, 60 or 80 GB SSDs just didn't seem to be that much cheaper 
than 120-ish gig, and 120-ish gig were a good deal, but were popular 
enough that availability was a bit of an issue.

So I actually ended up with 256 GB, which works out to ~ 238 GiB.  Yeah I 
paid a bit more, but that both gave me a bit more flexibility in terms of 
what I put on them, AND meant after I set them up I STILL had about 40% 
unallocated, giving them *LOTS* of wear-leveling room.

Of course that means if you do actually do bcache, 60-ish gigs should be 
good and I'd guess 128 gig would be overkill, as I guess 40-60 gigs 
probably about what my "hot" data is, the stuff bcache would likely catch.

And 60 gig will likely be /some/ cheaper tho not as much as you might 
expect, but you'll lose flexibility too, and/or you might actually pay 
more for the 60 gig than the 120 gig, or it'll be slower speed-rated.  
That was what I found when I actually went out to buy, anyway.

As to layout (all GPT partitions, not legacy MBR):

On the SSD(s, I actually have two setup, mostly in btrfs dual-device data/
metadata raid1 partitions but with some single-device, mixed/dup):

-       (boot area)

x       1007 KiB free space (so partitions are 1 MiB aligned)

1       3 MiB BIOS reserved partition

        (grub2 puts its core image here, partitions are now 4 MiB aligned)

2       124 MiB EFI reserved partition (for EFI forward compatibility)

        (partitions are now 128 MiB aligned)

3       256 MiB /boot (btrfs mixed-block mode, DUP data/metadata)

        I have a separate boot partition on each of the SSDs, with grub2
        installed to both SSD separately, pointing at its own /boot. with
        the SSD I boot selectable in BIOS.  That gives me a working /boot
        and a primary /boot backup.  I run git kernels and normally
        update the working /boot with a new kernel once or twice a week,
        while only updating the backup /boot with the release kernel, so
        every couple months.

4       640 MiB /var/log (btrfs mixed-mode, raid1 data/metadata)

        That gives me plenty of log space as long as logrotate doesn't
        break, while still keeping a reasonable cap on the log partition
        in case I get a runaway log.  As any good sysadmin should know,
        some from experience (!!), keeping a separate log partition is a
        good idea, since that limits the damage if something /does/ go
        runaway logging.

        (partitions beyond this are now 1 GiB aligned)

5       8 GiB rootfs (btrfs raid1 data/metadata)

        My rootfs includes (almost) all "installable" data, everything
        installed by packages except for /var/lib, which is a symlink to
        /home/var/lib.  The reason for that is that I keep rootfs mounted
        read-only by default, only mounting it read-write for updates or
        configuration changes, and /var/lib needs to be writable.  /home
        is mounted writable, thus the /var/lib symlink pointing into it.

        I learned the hard way to keep everything installed (but for
        /var/lib) on the same filesystem, along with the installed-
        package database (/var/db/pkg on gentoo), when I had to deal with
        a recovery situation with rootfs, /var, and /usr on separate
        partitions, recovering each one from a backup made at a different
        time!  Now I make **VERY** sure everything stays in sync, so
        the installed-package database matches what's actually installed.

        (Obviously /var/lib is a limited exception in ordered to keep
        rootfs read-only by default.  If I have to recover from an out of
        sync /home and thus /home/var/lib, I can query for what packages
        own /var/lib and reinstall them.)

6       20 GiB /home (btrfs raid1 data/metadata)

        20 GiB is plenty big enough for /home, since I keep my big media
        files on a dedicated media partition on spinning rust.

7       24 GiB build and packages tree (btrfs raid1 data/metadata)

        I mount this at /usr/src, since that seemed logical, but it
        contains the traditional /usr/src/linux (a git kernel, here),
        plus the gentoo tree and layman-based overlays, plus my binpkg
        cache, plus the ccache.  Additionally it contains the 32-bit
        chroot binpkg cache and ccache, see below.

8       8 GiB 32-bit chroot build-image (btrfs raid1 data/metadata)

        I have a 32-bit netbook that runs gentoo also.  This is its
        build image, more or less a copy of its rootfs, but on my main
        machine where I build the packages for it.  I keep this
        rsynced to the netbook for its updates.  That way the slower
        netbook with its smaller hard drive doesn't have to build
        packages or keep a copy of the gentoo tree or the 32-bit
        binpkg cache at all.

9-12    Primary backups of partitions 5-8, rootfs, /home, packages, and
        netbook build image.  These partitions are the same size and
        configuration as their working copies above, recreated
        periodically to protect against fat-finger mishaps as well as
        still-under-development btrfs corner-cases and ~arch plus
        live-branch-kde, etc update mishaps.

        (My SSDs, Corsair Neutron series, run a LAMD (Link A
        Media Devices) controller.  These don't have the compression
        or dedup features of something like the sandforce controllers,
        but the Neutrons at least (as opposed to the Neutron GTX) are
        enterprise targeted, with the resulting predictable performance,
        capacity and reliability bullet-point features.  What you save to
        the SSD is saved as-you-sent-it, regardless of compressibility or
        whether it's a dup of something else already on the SSD.  Thus,
        at least with my SSDs, the redundant working and backup copies
        are actually two copies on the SSD as well, not one compressed/
        dedupped copy.  That's a very nice confidence point when the
        whole /point/ of sending two copies is to have a backup!  So
        for anyone reading this that decides to do something similar,
        be sure your SSD firmware isn't doing de-duping in the background,
        leaving you with only the one copy regardless of what you thought
        you might have saved!)
        

That STILL leaves me 117.5 GiB of the 238.5 GiB entirely free and 
unallocated for wear-leveling and/or future flexibility, and I've a 
second copy (primary backup copy) of most of the data as well, which 
could be omitted from SSD (kept on spinning rust) if necessary.

Before I actually got the drives and was still figuring on 128-ish gigs, 
I was figuring 1 gig x-log, maybe 6 gig rootfs, 16 gig home, 20 gig pkg, 
and another 6 gig netbookroot, so about 49 gig of data if I went with 
60-80 gig SSDs, with the backups as well 97 gig, if I went with 120-ish 
gig SSDs.

But as I said, once I actually was out there shopping for 'em I ended up 
getting the 256 GB (238.5 GiB) SSDs as a near-best bargain in terms of 
rated performance and reliability vs. size vs. price.


Still on spinning rust, meanwhile, all my filesystems remain the many-
years-stable reiserfs.  I keep a working and backup media partition 
there, as well as second backup partitions for everything on btrfs on the 
ssds, just in case.

Additionally, I have an external USB-connected drive that's both 
disconnected and off most of the time, to recover from in case something 
takes out both the SSDs and internal spinning rust.

I figure if the external gets taken out too, say by fire if my house 
burnt down or by theft if someone broke in and stole it, I'd have much 
more important things to worry about for awhile, then what might have 
happened to my data!  And once I did get back on my feet and ready to 
think about computing again, much of the data would be sufficiently 
outdated as to be near worthless in any case.  At that point I might as 
well start from scratch but for the knowledge in my head, and whatever 
offsite or the like backups I might have had probably wouldn't be worth 
the trouble to recover anyway, so that's beyond cost/time/hassle 
effective and I don't bother.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Migrate to bcache: A few questions

Reply via email to