Re: Tiered storage?
Am Wed, 15 Nov 2017 08:11:04 +0100 schrieb waxhead: > As for dedupe there is (to my knowledge) nothing fully automatic yet. > You have to run a program to scan your filesystem but all the > deduplication is done in the kernel. > duperemove works apparently quite well when I tested it, but there > may be some performance implications. There's bees as near-line deduplication tool, that is it watches for generation changes in the filesystem and walks the inodes. It only looks at extents, not at files. Deduplication itself is then delegated to the kernel which ensures all changes are data-safe. The process is running as a daemon and processes your changes in realtime (delayed by a few seconds to minutes of course, due to transaction commit and hashing phase). You need to dedicate it part of your RAM to work, around 1 GB is usually sufficient to work well enough. The RAM will be locked and cannot be swapped out, so you should have a sufficiently equipped system. Works very well here (2TB of data, 1GB hash table, 16GB RAM). New dDuplicated files are picked up within seconds, scanned (hitting the cache most of the time thus not requiring physical IO), and then submitted to the kernel for deduplication. I'd call that fully automatic: Once set up, it just works, and works well. Performance impact is very low once the initial scan is done. https://github.com/Zygo/bees -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
Roy Sigurd Karlsbakk posted on Wed, 15 Nov 2017 15:10:08 +0100 as excerpted: >>> As for dedupe there is (to my knowledge) nothing fully automatic yet. >>> You have to run a program to scan your filesystem but all the >>> deduplication is done in the kernel. >>> duperemove works apparently quite well when I tested it, but there may >>> be some performance implications. >> Correct, there is nothing automatic (and there are pretty significant >> arguments against doing automatic deduplication in most cases), but the >> off-line options (via the EXTENT_SAME ioctl) are reasonably reliable. >> Duperemove in particular does a good job, though it may take a long >> time for large data sets. >> >> As far as performance, it's no worse than large numbers of snapshots. >> The issues arise from using very large numbers of reflinks. > > What is this "large" number of snapshots? Not that it's directly > comparible, but I've worked with ZFS a while, and haven't seen those > issues there. Btrfs has scaling issues with reflinks, not so much in normal operation, but when it comes to filesystem maintenance such as btrfs check and btrfs balance. Numerically, low double-digits of reflinks per extent seems to be reasonably fine, high double-digits to low triple-digits begins to run into scaling issues, and high triple digits to over 1000... better be prepared to wait awhile (can be days or weeks!) for that balance or check to complete, and check requires LOTS more memory as well, particularly at TB+ scale. Of course snapshots are the common instance of reflinking, and each snapshot is another reflink to each extent of the data in the subvolume it covers, so limiting snapshots to 10-50 of each subvolume is recommended, and limiting to under 250-ish is STRONGLY recommended. (Total number of snapshots per filesystem, where there's many subvolumes and snapshots per subvolume falls within the above limits, doesn't seem to be a problem.) Dedupe uses reflinking too, but the effects can be much more variable depending on the use-case and how many actual reflinks are being created. A single extent with 1000 deduping reflinks, as might be common in a commercial/hosting use-case, shouldn't be too bad, perhaps comparable to a single snapshot, but obviously, do that with a bunch of extents (as a hosting use-case might) and it quickly builds to the effect of 1000 snapshots of the same subvolume, which as mentioned above puts maintenance-task time out of the realm of reasonable, for many. Tho of course in a commercial/hosting case maintenance may well not be done as a simple swap-in of a fresh backup is more likely, so it may not matter for that scenario. OTOH, a typical individual/personal use-case may dedup many files but only single-digit times each, so the effect would be the same as a single- digit number of snapshots at worst. Meanwhile, while btrfs quotas are finally maturing in terms of actually tracking the numbers correctly, their effect on scaling is pretty bad too. The recommendation is to keep btrfs quotas off unless you actually need them. If you do need quotas, temporarily disable them while doing balances and device-removes (which do implicit balances), then quota- rescan after the balance is done, because precisely tracking quotas thru a balance ends up repeatedly recalculating the numbers again and again during the balance, and that just doesn't scale. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
>> As for dedupe there is (to my knowledge) nothing fully automatic yet. >> You have to run a program to scan your filesystem but all the >> deduplication is done in the kernel. >> duperemove works apparently quite well when I tested it, but there may >> be some performance implications. > Correct, there is nothing automatic (and there are pretty significant > arguments against doing automatic deduplication in most cases), but the > off-line options (via the EXTENT_SAME ioctl) are reasonably reliable. > Duperemove in particular does a good job, though it may take a long time > for large data sets. > > As far as performance, it's no worse than large numbers of snapshots. > The issues arise from using very large numbers of reflinks. What is this "large" number of snapshots? Not that it's directly comparible, but I've worked with ZFS a while, and haven't seen those issues there. Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
On 2017-11-15 02:11, waxhead wrote: As a regular BTRFS user I can tell you that there is no such thing as hot data tracking yet. Some people seem to use bcache together with btrfs and come asking for help on the mailing list. Bcache works fine recently. It was only with older versions that there were issues. dm-cache similarly works fine on recent versions. In both cases though, you need to be sure you know what you're doing, otherwise you are liable to break things. Raid5/6 have received a few fixes recently, and it *may* soon me worth trying out raid5/6 for data, but keeping metadata in raid1/10 (I would rather loose a file or two than the entire filesystem). I had plans to run some tests on this a while ago, but forgot about it. As call good citizens, remember to have good backups. Last time I tested for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 seems pretty rock solid as long as you have sufficient disks (hint: you need more than two for raid1 if you want to stay safe) Parity profiles (raid5 and raid6) still have issues, although there are fewer than there were, with most of the remaining issues surrounding recovery. I would still recommend against it for production usage. Simple replication (raid1) is pretty much rock solid as long as you keep on top of replacing failing hardware and aren't stupid enough to run the array degraded for any extended period of time (converting to a single device volume instead of leaving things with half a volume is vastly preferred for multiple reasons). Striped replication (raid10) is generally fine, but you can get much better performance by running BTRFS with a raid1 profile on top of two MD/LVM/Hardware RAID0 volumes (BTRFS still doesn't do a very good job of parallelizing things). As for dedupe there is (to my knowledge) nothing fully automatic yet. You have to run a program to scan your filesystem but all the deduplication is done in the kernel. duperemove works apparently quite well when I tested it, but there may be some performance implications. Correct, there is nothing automatic (and there are pretty significant arguments against doing automatic deduplication in most cases), but the off-line options (via the EXTENT_SAME ioctl) are reasonably reliable. Duperemove in particular does a good job, though it may take a long time for large data sets. As far as performance, it's no worse than large numbers of snapshots. The issues arise from using very large numbers of reflinks. Roy Sigurd Karlsbakk wrote: Hi all I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage). And btw, how far is raid[56] and block-level dedup from something useful in production? Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
On 2017-11-15 04:26, Marat Khalili wrote: On 15/11/17 10:11, waxhead wrote: hint: you need more than two for raid1 if you want to stay safe Huh? Two is not enough? Having three or more makes a difference? (Or, you mean hot spare?) They're probably referring to an issue where a two device array configured for raid1 which had lost a device and was mounted degraded and writable would generate single profile chunks on the remaining device instead of a half-complete raid1 chunk. This, when combined with the fact that older kernels only check the filesystem as a whole for normal/degraded/irreparable instead of checking individual chunks would refuse to mount the resultant filesystem, meant that you only had one chance to fix such an array. If instead you have more than two devices, regular complete raid1 profile chunks are generated, and it becomes a non-issue. The second issue (checking degraded status at the chunk level instead of volume level) has been fixed in the most recent kernels. The first issue has not been fixed yet, but I'm pretty sure there are patches pending. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
On 15/11/17 10:11, waxhead wrote: hint: you need more than two for raid1 if you want to stay safe Huh? Two is not enough? Having three or more makes a difference? (Or, you mean hot spare?) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
As a regular BTRFS user I can tell you that there is no such thing as hot data tracking yet. Some people seem to use bcache together with btrfs and come asking for help on the mailing list. Raid5/6 have received a few fixes recently, and it *may* soon me worth trying out raid5/6 for data, but keeping metadata in raid1/10 (I would rather loose a file or two than the entire filesystem). I had plans to run some tests on this a while ago, but forgot about it. As call good citizens, remember to have good backups. Last time I tested for Raid5/6 I ran into issues easily. For what it's worth - raid1/10 seems pretty rock solid as long as you have sufficient disks (hint: you need more than two for raid1 if you want to stay safe) As for dedupe there is (to my knowledge) nothing fully automatic yet. You have to run a program to scan your filesystem but all the deduplication is done in the kernel. duperemove works apparently quite well when I tested it, but there may be some performance implications. Roy Sigurd Karlsbakk wrote: Hi all I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage). And btw, how far is raid[56] and block-level dedup from something useful in production? Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tiered storage?
Hi all I've been following this project on and off for quite a few years, and I wonder if anyone has looked into tiered storage on it. With tiered storage, I mean hot data lying on fast storage and cold data on slow storage. I'm not talking about cashing (where you just keep a copy of the hot data on the fast storage). And btw, how far is raid[56] and block-level dedup from something useful in production? Vennlig hilsen roy -- Roy Sigurd Karlsbakk (+47) 98013356 http://blogg.karlsbakk.net/ GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- Hið góða skaltu í stein höggva, hið illa í snjó rita. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs RAID1 woes and tiered storage
I've been experimenting lately with btrfs RAID1 implementation and have to say that it is performing quite well, but there are few problems: * when I purposefully damage partitions on which btrfs stores data (for example, by changing the case of letters) it will read the other copy and return correct data. It doesn't report in dmesg this fact every time, but it does correct the one with wrong checksum * when both copies are damaged it returns the damaged block as it is written(!) and only adds a warning in the dmesg with exact same wording as with the single block corruption(!!) * from what I could find, btrfs doesn't remember anywhere the number of detected and fixed corruptions I don't know if it's the final design and while the first and last points are minor inconveniences the second one is quite major. At this time it doesn't prevent silent corruption from going unnoticed. I think that reading from such blocks should return EIO (unless mounted nodatasum) or at least a broadcast message noting that a corrupted block is being returned to userspace. I've also been thinking about tiered storage (meaning 2+, not only two-tiered) and have some ideas about it. I think that there need to be 3 different mechanisms working together to achieve high performance: * ability to store all metadata on selected volumes (probably read optimised SSDs) * ability to store all newly written data on selected volumes (write optimised SSDs) * ability to differentiate between often written, often read and infrequently accessed data (and based on this information, ability to move this data to fast SSDs, slow SSDs, fast RAID, slow RAID or MAID) While the first two are rather straight-forward, the third one needs some explanation. I think that for this to work, we should save not only the time of last access to file and last change time but also few past values (I think that at least 8 to 16 ctimes and atimes are necessary but this will need testing). I'm not sure about how and exactly when to move this data around to keep the arrays balanced but a userspace daemon would be most flexible. This solution won't work well for file systems with few very large files of which very few parts change often, in other words it won't be doing block- level tiered storage. From what I know, databases would benefit most from such configuration, but then most databases can already partition tables to different files based on access rate. As such, making its granularity on file level would make this mechanism easy to implement while still useful. On second thought: it won't make it exactly file-level granular, if we introduce snapshots in the mix, the new version can have the data regularly accessed while the old snapshot won't, this way the obsolete blocks can be moved to slow storage. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html