Re: Hot data tracking / hybrid storage

Henk Slager Thu, 19 May 2016 16:24:07 -0700

On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:
> On 2016-05-19 14:09, Kai Krakow wrote:
>>
>> Am Wed, 18 May 2016 22:44:55 +0000 (UTC)
>> schrieb Ferry Toth <ft...@exalondelft.nl>:
>>
>>> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
>>>
>>>> Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
>>>> <ahferro...@gmail.com>:
>>>>
>>>>> On 2016-05-17 02:27, Ferry Toth wrote:
>>>
>>>  [...]
>>>  [...]
>>>>>
>>>>>  [...]
>>>
>>>  [...]
>>>  [...]
>>>  [...]
>>>>>
>>>>> On the other hand, it's actually possible to do this all online
>>>>> with BTRFS because of the reshaping and device replacement tools.
>>>>>
>>>>> In fact, I've done even more complex reprovisioning online before
>>>>> (for example, my home server system has 2 SSD's and 4 HDD's,
>>>>> running BTRFS on top of LVM, I've at least twice completely
>>>>> recreated the LVM layer online without any data loss and minimal
>>>>> performance degradation).
>>>
>>>  [...]
>>>>>
>>>>> I have absolutely no idea how bcache handles this, but I doubt
>>>>> it's any better than BTRFS.
>>>>
>>>>
>>>> Bcache should in theory fall back to write-through as soon as an
>>>> error counter exceeds a threshold. This is adjustable with sysfs
>>>> io_error_halftime and io_error_limit. Tho I never tried what
>>>> actually happens when either the HDD (in bcache writeback-mode) or
>>>> the SSD fails. Actually, btrfs should be able to handle this (tho,
>>>> according to list reports, it doesn't handle errors very well at
>>>> this point).
>>>>
>>>> BTW: Unnecessary copying from SSD to HDD doesn't take place in
>>>> bcache default mode: It only copies from HDD to SSD in writeback
>>>> mode (data is written to the cache first, then persisted to HDD in
>>>> the background). You can also use "write through" (data is written
>>>> to SSD and persisted to HDD at the same time, reporting persistence
>>>> to the application only when both copies were written) and "write
>>>> around" mode (data is written to HDD only, and only reads are
>>>> written to the SSD cache device).
>>>>
>>>> If you want bcache behave as a huge IO scheduler for writes, use
>>>> writeback mode. If you have write-intensive applications, you may
>>>> want to choose write-around to not wear out the SSDs early. If you
>>>> want writes to be cached for later reads, you can choose
>>>> write-through mode. The latter two modes will ensure written data
>>>> is always persisted to HDD with the same guaranties you had without
>>>> bcache. The last mode is default and should not change behavior of
>>>> btrfs if the HDD fails, and if the SSD fails bcache would simply
>>>> turn off and fall back to HDD.
>>>
>>>
>>> Hello Kai,
>>>
>>> Yeah, lots of modes. So that means, none works well for all cases?
>>
>>
>> Just three, and they all work well. It's just a decision wearing vs.
>> performance/safety. Depending on your workload you might benefit more or
>> less from write-behind caching - that's when you want to turn the knob.
>> Everything else works out of the box. In case of an SSD failure,
>> write-back is just less safe while the other two modes should keep your
>> FS intact in that case.
>>
>>> Our server has lots of old files, on smb (various size), imap
>>> (10000's small, 1000's large), postgresql server, virtualbox images
>>> (large), 50 or so snapshots and running synaptics for system upgrades
>>> is painfully slow.
>>
>>
>> I don't think that bcache even cares to cache imap accesses to mail
>> bodies - it won't help performance. Network is usually much slower than
>> SSD access. But it will cache fs meta data which will improve imap
>> performance a lot.
>
> Bcache caches anything that falls within it's heuristics as candidates for
> caching.  It pays no attention to what type of data you're accessing, just
> the access patterns.  This is also the case for dm-cache, and for Windows
> ReadyBoost (or whatever the hell they're calling it these days).  Unless
> you're shifting very big e-mails, it's pretty likely that ones that get
> accessed more than once in a short period of time will end up being cached.
>>
>>
>>> We are expecting slowness to be caused by fsyncs which appear to be
>>> much worse on a raid10 with snapshots. Presumably the whole thing
>>> would be fast enough with ssd's but that would be not very cost
>>> efficient.
>>>
>>> All the overhead of the cache layer could be avoided if btrfs would
>>> just prefer to write small, hot, files to the ssd in the first place
>>> and clean up while balancing. A combination of 2 ssd's and 4 hdd's
>>> would be very nice (the mobo has 6 x sata, which is pretty common)
>>
>>
>> Well, I don't want to advertise bcache. But there's nothing you
>> couldn't do with it in your particular case:
>>
>> Just attach two HDDs to one SSD. Bcache doesn't use a 1:1 relation
>> here, you can use 1:n where n is the backing devices. There's no need
>> to clean up using balancing because bcache will track hot data by
>> default. You just have to decide which balance between wearing the SSD
>> vs. performance you prefer. If slow fsyncs are you primary concern, I'd
>> go with write-back caching. The small file contents are propably not
>> your performance problem anyways but the meta data management btrfs has
>> to do in the background. Bcache will help a lot here, especially in
>> write-back mode. I'd recommend against using balance too often and too
>> intensive (don't use too big usage% filters), it will invalidate your
>> block cache and probably also invalidate bcache if bcache is too small.
>> It will hurt performance more than you gain. You may want to increase
>> nr_requests in the IO scheduler for your situation.
>
> This may not perform as well as you would think, depending on your
> configuration.  If things are in raid1 (or raid10) mode on the BTRFS side,
> then you can end up caching duplicate data (and on some workloads, you're
> almost guaranteed to cache duplicate data), which is a bigger issue when
> you're sharing a cache between devices, because it means they are competing
> for cache space.
>>
>>
>>> Moreover increasing the ssd's size in the future would then be just
>>> as simple as replacing a disk by a larger one.
>>
>>
>> It's as simple as detaching the HDDs from the caching SSD, replace it,
>> reattach it. It can be done online without reboot. SATA is usually
>> hotpluggable nowadays.
>>
>>> I think many would sign up for such a low maintenance, efficient
>>> setup that doesn't require a PhD in IT to think out and configure.
>>
>>
>> Bcache is actually low maintenance, no knobs to turn. Converting to
>> bcache protective superblocks is a one-time procedure which can be done
>> online. The bcache devices act as normal HDD if not attached to a
>> caching SSD. It's really less pain than you may think. And it's a
>> solution available now. Converting back later is easy: Just detach the
>> HDDs from the SSDs and use them for some other purpose if you feel so
>> later. Having the bcache protective superblock still in place doesn't
>> hurt then. Bcache is a no-op without caching device attached.
>
> No, bcache is _almost_ a no-op without a caching device.  From a userspace
> perspective, it does nothing, but it is still another layer of indirection
> in the kernel, which does have a small impact on performance.  The same is
> true of using LVM with a single volume taking up the entire partition, it
> looks almost no different from just using the partition, but it will perform
> worse than using the partition directly.  I've actually done profiling of
> both to figure out base values for the overhead, and while bcache with no
> cache device is not as bad as the LVM example, it can still be a roughly
> 0.5-2% slowdown (it gets more noticeable the faster your backing storage
> is).
>
> You also lose the ability to mount that filesystem directly on a kernel
> without bcache support (this may or may not be an issue for you).


The bcache (protective) superblock is in an 8KiB block in front of the
file system device. In case the current, non-bcached HDD's use modern
partitioning, you can do a 5-minute remove or add of bcache, without
moving/copying filesystem data. So in case you have a bcache-formatted
HDD that had just 1 primary partition (512 byte logical sectors), the
partition start is at sector 2048 and the filesystem start is at 2064.
Hard removing bcache (so making sure the module is not
needed/loaded/used the next boot) can be done done by changing the
start-sector of the partition from 2048 to 2064. In gdisk one has to
change the alignment to 16 first, otherwise this it refuses. And of
course, also first flush+stop+de-register bcache for the HDD.

The other way around is also possible, i.e. changing the start-sector
from 2048 to 2032. So that makes adding bcache to an existing
filesystem a 5 minute action and not a GBs- or TBs copy action. It is
not online of course, but just one reboot is needed (or just umount,
gdisk, partprobe, add bcache etc).
For RAID setups, one could just do 1 HDD first.

There is also a tool doing the conversion in-place (I haven't used it
myself, my python(s) had trouble; I could do the partition table edit
much faster/easier):
https://github.com/g2p/blocks#bcache-conversion

>>> Even at home, I would just throw in a low cost ssd next to the hdd if
>>> it was as simple as device add. But I wouldn't want to store my
>>> photo/video collection on just ssd, too expensive.
>>
>>
>> Bcache won't store your photos if you copied them: Large copy
>> operations (like backups) and sequential access is detected and bypassed
>> by bcache. It won't invalidate your valuable "hot data" in the cache.
>> It works really well.
>>
>> I'd even recommend to format filesystems with bcache protective
>> superblock (aka format backing device) even if you not gonna use
>> caching and not gonna insert an SSD now, just to have the option for
>> the future easily and without much hassle.
>>
>> I don't think native hot data tracking will land in btrfs anytime soon
>> (read: in the next 5 years). Bcache is a general purpose solution for
>> all filesystems that works now (and properly works).
>>
>> You maybe want to clone your current system and try to integrate bcache
>> to see the benefits. There's actually a really big impact on
>> performance from my testing (home machine, 3x 1TB HDD btrfs mraid1
>> draid0, 1x 500GB SSD as cache, hit rate >90%, cache utilization ~70%,
>> boot time improvement ~400%, application startup times almost instant,
>> workload: MariaDB development server, git usage, 3 nspawn containers,
>> VirtualBox Windows 7 + XP VMs, Steam gaming, daily rsync backups, btrfs
>> 60% filled).
>>
>> I'd recommend to not use a too small SSD because it wears out very fast
>> when used as cache (I think that generally applies and is not bcache
>> specific). My old 120GB SSD was specified for 85TB write performance,
>> and it was worn out after 12 months of bcache usage, which included 2
>> complete backup restores, multiple scrubs (which relocates and rewrites
>> every data block), and weekly balances with relatime enabled. I've
>> since used noatime+nossd, completely stopped using balance and never
>> used scrub yet, with the result of vastly reduced write accesses to the
>> caching SSD. This setup is able to write bursts of 800MB/s to the disk
>> and read up to 800MB/s from disk (if btrfs can properly distribute
>> reads to all disks). Bootchart shows up to 600 MB/s during cold booting
>> (with warmed SSD cache). My nspawn containers boot in 1-2 seconds and
>> do not add to the normal boot time at all (they are autostarted during
>> boot, 1x MySQL, 1x ElasticSearch, 1x idle/spare/testing container).
>> This is really impressive for a home machine, and c'mon: 3x 1TB HDD +
>> 1x 500GB SSD is not that expensive nowadays. If you still prefer a
>> low-end SSD I'd recommend to use write-around only from my own
>> experience.
>>
>> The cache usage of the 120GB of 100% with 70-80% hit rate, which means
>> it was constantly rewriting stuff. 500GB (which I use now) is a little
>> underutilized now but almost no writes happen after warming up, so it's
>> mostly a hot-data read cache (although I configured it as write-back).
>> Plus, bigger SSDs are usually faster - especially for write ops.
>>
>> Conclusion: Btrfs + bcache make a very good pair. Btrfs is not really
>> optimized for good latency and that's where bcache comes in. Operating
>> noise from HDD reduces a lot as soon as bcache is warmed up.
>>
>> BTW: If deployed, keep an eye on your SSD wearing (using smartctl). But
>> given you are using btrfs, you keep backups anyways. ;-)
>
> Any decent SSD (read as 'any SSD of a major brand other than OCZ that you
> bought from a reputable source') will still take years to wear out unless
> you're constantly re-writing things and not using discard/trim support (and
> bcache does use discard).   Even if you're not using discard/trim, the
> typical wear-out point is well over 100x the size of the SSD for the good
> consumer devices.  For a point of reference, I've got a pair of 250GB
> Crucial MX100's (they cost less than 0.50 USD per GB when I got them and
> provide essentially the same power-loss protections that the high end Intel
> SSD's do) which have seen more than 2.5TB of data writes over their
> lifetime, combined from at least three different filesystem formats (BTRFS,
> FAT32, and ext4), swap space, and LVM management, and the wear-leveling
> indicator on each still says they have 100% life remaining, and the similar
> 500GB one I just recently upgraded in my laptop had seen over 50TB of writes
> and was still saying 95% life remaining (and had been for months).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hot data tracking / hybrid storage

Reply via email to