Re: BTRFS as image store for KVM?
On Mon, 05 Oct 2015 11:43:18 +0300 Erkki Seppalawrote: > Lionel Bouton writes: > > > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked > > you need 2 processes to read from 2 devices at once) and I've never seen > > anyone arguing that the current md code is unstable. > > This indeed seems to be the case on my MD RAID1 HDD. The key difference is that MD is smart enough to distribute reads to the least loaded devices in a RAID (i.e. two processes reading from a two-HDD RAID1 will ~always use different HDDs, no matter what their PIDs are). Another example of the differences was that in my limited testing I saw Btrfs submit writes sequentially on RAID1: according to 'iostat', there were long periods of only one drive writing, and then only the other one. The result was the RAID's write performance ended up being lower than that of a single drive. -- With respect, Roman signature.asc Description: PGP signature
Re: BTRFS as image store for KVM?
Rich Freeman posted on Sun, 04 Oct 2015 08:21:53 -0400 as excerpted: > On Sun, Oct 4, 2015 at 8:03 AM, Lionel Bouton >wrote: >> >> This focus on single reader RAID1 performance surprises me. >> >> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked >> you need 2 processes to read from 2 devices at once) and I've never >> seen anyone arguing that the current md code is unstable. I'm not a coder and could be wrong, but AFAIK, md/raid1 either works per thread (thus should multiplex I/O across raid1 devices in single-process- multi-thread), or handles multiple AIO requests in parallel, if not both. (If I'm laboring under a severe misconception, and I could be, please do correct me -- I'll rather be publicly corrected and have just that, my world-view corrected to align with reality, than be wrong, publicly or privately, and never know it, thus never correcting it! =:^) IOW, the primary case where I believe md/raid1 does single-device serial access, is where the single process is doing just that, serialized-single- request-sleep-until-the-data's-ready. Otherwise read requests are spread among the available spindles. =:^) But... > Perhaps, but with btrfs it wouldn't be hard to get 1000 processes > reading from a raid1 in btrfs and have every single request directed to > the same disk with the other disk remaining completely idle. I believe > the algorithm is just whether the pid is even or odd, and doesn't take > into account disk activity at all, let alone disk performance or > anything more sophisticated than that. > > I'm sure md does a better job than that. Exactly. Even/odd PID scheduling is great for testing, since it's simple enough to load either side exclusively or both sides exactly evenly, but it's absolutely horrible for multi-task, since worst-case single-device- bottleneck is all too easy to achieve by accident, and even pure-random distribution is going to favor one side or the other to some extent, most of the time. Even worse, due to the most-remaining-free-space chunk allocation algorithm and pair-mirroring only, no matter the number of devices, try to use 3+ devices of differing sizes, and until the space-available on the largest pair reaches that of the others, that largest pair will get the allocations. Consider a bunch of quarter-TiB devices in raid1, with a pair of 2 TiB devices as well. The quarter-TiB devices will remain idle until the pair of 2 TiB devices reach 1.75 TiB full, thus equalizing the space available on each compared to the other devices on the filesystem. Of course, that means reads too, are going to be tied to only those two devices, for anything in that first 1.75 TiB of data, and if all those reads are from even or all from odd PIDs, it's only going to be ONE of... perhaps 10 devices! Possibly hundreds of read threads bottlenecking on a single device of ten, while the other 9/10 of the filesystem-array remains entirely idle! =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Lionel Boutonwrites: > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked > you need 2 processes to read from 2 devices at once) and I've never seen > anyone arguing that the current md code is unstable. This indeed seems to be the case on my MD RAID1 HDD. But on MD SSD RAID10 it does use all the four devices (using dd on the md raid device and inspection iostat at the samy time). So the lack of MD RAID1 doing it on HDDs seems to be for devices that don't perform well in random access scenarios, as you mentioned. In practical terms I seem to be getting about 1.1 GB/s from the 4 SSDs with 'dd', whereas I seem to be getting ~650 MB/s when I dd from two fastest components of the MD device at the same time. As it seems that I get 330 M/s from two of the SSDs and 150M/s from the other two, it seems the concurrent RAID10 IO is scaling linearly. (In fact maybe I should look into why the two devices are getting lower speeds overall - they used to be fast.) I didn't calculate how large the linearly transferred chunks would need to be to overcome the seek altency. Probably quite large. -- _ / __// /__ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ /\ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-10-05 07:16, Lionel Bouton wrote: Hi, Le 04/10/2015 14:03, Lionel Bouton a écrit : [...] This focus on single reader RAID1 performance surprises me. 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked you need 2 processes to read from 2 devices at once) and I've never seen anyone arguing that the current md code is unstable. To better illustrate my point. According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1 most of the time. http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1 The only case where md RAID1 was noticeably faster is sequential reads with FIO libaio. Part of this is because BTRFS's built-in raid functionality is designed for COW workloads, whereas mdraid isn't. On top of that, I would be willing to bet that they were using the dup profile for metadata when testing mdraid (which is the default when using a single disk), which isn't a fair comparison because it stores more data in that case than the BTRFS raid. If you do the same thing with ZFS, I'd expect that you would see similar results (although probably with a bigger difference between ZFS and mdraid). A much better comparison (which _will_ sway in mdraid's favor) would be running XFS or ext4 (or even JFS) on top of mdraid, as that is the regular usage of mdraid. Furthermore, there is no sane reason to be running BTRFS on top of a single mdraid device, thus this was even less of a reasonable comparison. It is worth noting however, that using BTRFS raid1 on top of two md RAID0 devices is significantly faster than BTRFS RAID10. So if you base your analysis on Phoronix tests when serving large files to a few clients maybe md could perform better. In all other cases BTRFS RAID1 seems to be a better place to start if you want performance. According to the bad performance -> unstable logic, md would then be the less stable RAID1 implementation which doesn't make sense to me. No reasonable person should be basing their analysis on results from someone else's testing without replicating said results themselves, especially when those results are based on benchmarks and not real workloads. This goes double for Phoronix, as they are essentially paid to make whatever the newest technology on Linux is look good. I'm not even saying that BTRFS performs better than md for most real-world scenarios (these are only benchmarks), but that arguing that BTRFS is not stable because it has performance issues still doesn't make sense to me. Even synthetic benchmarks aren't enough to find the best fit for real-world scenarios, so you could always find a very restrictive situation where any filesystem, RAID implementation, volume manager could look bad even the most robust ones. Of course if BTRFS RAID1 was always slower than md RAID1 the logic might make more sense. But clearly there were design decisions and performance tuning in BTRFS that led to better or similar performance in several scenarios, if the remaining scenarios don't get attention it may be because they represent a niche (at least from the point of view of the developers) not a lack of polishing. Like I said above, a large part of this is probably because BTRFS raid was designed for COW workloads, and mdraid wasn't. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
On Mon, Oct 5, 2015 at 7:16 AM, Lionel Boutonwrote: > According to the bad performance -> unstable logic, md would then be the > less stable RAID1 implementation which doesn't make sense to me. > The argument wasn't that bad performance meant that something was unstable. The argument was that a lack of significant performance optimization meant that the developers considered it unstable and not worth investing time on optimizing. So, the question isn't whether btrfs is or isn't faster than something else. the question is whether it is or isn't faster than it could be if it were properly optimized. That is, how does btrfs perform today against btrfs from 20 years from now, which obviously cannot be benchmarked today. That said, I'm not really convinced that the developers haven't fixed this because they feel that it would need to be redone later after major refactoring. I think it is more likely that there are just very few developers working on btrfs and load-balancing on raid just doesn't rank high on their list of interests or possibly expertise. If any are being paid to work on btrfs then most likely their employers don't care too much about it either. I did find the phoronix results interesting though. The whole driver for "layer-violation" is that with knowledge of the filesystem you can better optimize what you do/don't read and write, and that may be showing here. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Hi, Le 04/10/2015 14:03, Lionel Bouton a écrit : > [...] > This focus on single reader RAID1 performance surprises me. > > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked > you need 2 processes to read from 2 devices at once) and I've never seen > anyone arguing that the current md code is unstable. To better illustrate my point. According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1 most of the time. http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1 The only case where md RAID1 was noticeably faster is sequential reads with FIO libaio. So if you base your analysis on Phoronix tests when serving large files to a few clients maybe md could perform better. In all other cases BTRFS RAID1 seems to be a better place to start if you want performance. According to the bad performance -> unstable logic, md would then be the less stable RAID1 implementation which doesn't make sense to me. I'm not even saying that BTRFS performs better than md for most real-world scenarios (these are only benchmarks), but that arguing that BTRFS is not stable because it has performance issues still doesn't make sense to me. Even synthetic benchmarks aren't enough to find the best fit for real-world scenarios, so you could always find a very restrictive situation where any filesystem, RAID implementation, volume manager could look bad even the most robust ones. Of course if BTRFS RAID1 was always slower than md RAID1 the logic might make more sense. But clearly there were design decisions and performance tuning in BTRFS that led to better or similar performance in several scenarios, if the remaining scenarios don't get attention it may be because they represent a niche (at least from the point of view of the developers) not a lack of polishing. Best regards, Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Mon, 5 Oct 2015 13:16:03 +0200 Lionel Boutonwrote: > To better illustrate my point. > > According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1 > most of the time. > > http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1 > > The only case where md RAID1 was noticeably faster is sequential reads > with FIO libaio. > > So if you base your analysis on Phoronix tests [Oops. Actually send to the list too this time.] FYI... 1) It's worth noting that while I personally think Phoronix has more merit than sometimes given credit for, it does have a rather bad rep in kernel circles, due to how results are (claimed to be) misused in support of points they don't actually support at all, if you know how to read the results taking into account the configuration used. As such, Phoronix is about the last reference you want to be using if trying to support points with kernel folks, because rightly or wrongly, a lot of them will see that and simply shut down right there, having already decided based on previous experience that there's little use arguing with somebody quoting Phoronix, since they invariably don't know how to read the results based on what was /actually/ tested given the independent variable and the test configuration. Tho I personally actually do find quite some use for various Phoronix benchmark articles, reading them with the testing context in mind. But I definitely wouldn't be pulling them out to demonstrate a point to kernel folks, unless it was near enough the end of a list of references making a similar point, after I've demonstrated my ability to keep the testing context in mind when looking at their results, because I know based on the history, quoting Phoronix in support of something simply isn't likely to get me anywhere with kernel folks. As for the specific test you referenced... 2) At least the URL you pointed at was benchmarks for Intel SSDs, not spinning rust. The issues and bottlenecks in the case of good SSDs are so entirely different than for spinning rust that it's an entirely different debate. Among other things, good SATA-based SSDs are and have been for awhile now, so fast that if the tests really are I/O bound, SATA-bus speed (thru SATA 3.0, 600 MB/s, anyway, SATA 3.2 aka SATA Express and M.2, 1969 MB/s, are fast enough to often put the bottleneck on the device, again), not device speed, tends to be the bottleneck. However, in many cases, because good SSDs and modern buses are so fast, the bottleneck actually ends up being CPU, once again. So in the case of good reasonably current SSDs, CPU is likely to be the bottleneck. Tho these aren't as current as might be expected... The actual devices tested here are SATA 2, 300 MB/s bus speed, and are rather dated given the Dec 2014 article date, as they're only rated 205 MB/s read, 45 MB/s write, so only read is anything near bus speed, while bus speed itself is only SATA 2, 300 MB/s. Given that, it's likely that write speed is device-bound, and while raid isn't likely to slow it down despite the multi-time-writing because it /is/ device-bound, it's unlikely to be faster than single-device writing, either. But this thread was addressing read-speed. Read-speed is much closer to the bus speed, and depending on the application, particularly for raid, may well be CPU-bound. Where it's CPU-bound, because the device and bus speed are relatively high, the multiple devices of RAID aren't likely to be of much benefit at all. Meanwhile, what was the actual configuration on the devices themselves? Here, we see that in both cases it was actually btrfs, btrfs with defaults as installed (in single-device-mode, if reading between the lines) on top of md/raid of the tested level for the md/raid side, native btrfs raid of the tested level on the native btrfs raid side. But... there's already so much that isn't known -- he says defaults where not stated, but that's still ambiguous in some cases. For instance, he does specifically state that in native mode, btrfs detects the ssds, and activates ssd mode, but that it doesn't do so when installed on the md/raid. So we know for sure that he took the detected ssd (or not) defaults there. But... we do NOT know... does btrfs native raid level mean for both data and metadata, or only for (presumably) data, leaving metadata at the defaults (which is raid1 for multi-device), or perhaps the reverse, tested level metadata, defaults for data (which AFAIK is single mode for multi-device). And in single-device btrfs on top of md/raid mode, with the md/raid at the tested level, we already know that it didn't detect ssd and enable ssd mode, and he didn't enable it manually, but what we /don't/ know for sure is how it was installed at mkfs.btrfs time and whether that might have detected ssd. If mkfs.btrfs detected ssd, it would have created single-mode metadata by default, otherwise, dup mode metadata, unless specifically told
Re: BTRFS as image store for KVM?
On 2015-10-05 10:04, Duncan wrote: On Mon, 5 Oct 2015 13:16:03 +0200 Lionel Boutonwrote: To better illustrate my point. According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1 most of the time. http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1 The only case where md RAID1 was noticeably faster is sequential reads with FIO libaio. So if you base your analysis on Phoronix tests [...snip...] Hmm... I think I've begun to see the kernel folks point about people quoting Phoronix in support of their points, when it's really not apropos at all. Yes, I do still consider Phoronix reports in context to contain useful information, at some level. However, one really must be aware of what was actually tested in ordered to understand what the results actually mean, and unfortunately, it seems most people quoting it, including here, really can't properly do so in context, and thus end up using it in support of points that simply are not supported by the given evidence in the Phoronix articles people are attempting to use. Even aside from the obvious past issues with Phoronix reports, people forget that they are news organization (regardless of what they claim, they _are_ a news organization), and as such their employees are not paid to verify existing results, they're paid to make impactful articles that grab people's attention (I'd be willing to bet that this story started in response to the people who pointed out correctly that XFS or ext4 on top of mdraid beats the pants off of BTRFS performance wise, and (incorrectly) assumed that this meant that mdraid was better than BTRFS raid). This when combined with almost no evidence in many cases of actual statistical analysis, really hurts their credibility (at least, for me it does). The other issue is that so many people tout benchmarks as the pinnacle of testing, when they really aren't. Benchmarks are by definition synthetic workloads, and as such only the _really_ good ones (which there aren't many of) give you more than a very basic idea what performance differences you can expect with a given workload. On top of that, people just accept results without trying to reproduce them themselves (Kernel folks tend to be much better about this than many other people though). A truly sane person, looking to determine the best configuration for a given workload, will: 1. Look at a wide variety of sources to determine what configurations he should even be testing. (The author of the linked article obviously didn't do this, or just didn't care, the defaults on btrfs are unsuitable for a significant number of cases, including usage on top of mdraid). 2. Using this information, run established benchmarks similar to his use-case to further narrow down the test candidates. 3. Write his own benchmark to simulate to the greatest degree possible the actual workload he expects to run, and then use that for testing the final candidates. 4. Gather some reasonable number of samples with the above mentioned benchmark, and use _real_ statistical analysis to determine what he should be using. To put this in further perspective, most people just do step one, assume that other people know what they're talking about, and don't do any further testing, and there are other people who just do step two, and then claim their results are infallible. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
On Sun, Oct 4, 2015 at 8:03 AM, Lionel Boutonwrote: > > This focus on single reader RAID1 performance surprises me. > > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked > you need 2 processes to read from 2 devices at once) and I've never seen > anyone arguing that the current md code is unstable. Perhaps, but with btrfs it wouldn't be hard to get 1000 processes reading from a raid1 in btrfs and have every single request directed to the same disk with the other disk remaining completely idle. I believe the algorithm is just whether the pid is even or odd, and doesn't take into account disk activity at all, let alone disk performance or anything more sophisticated than that. I'm sure md does a better job than that. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Hi, Le 04/10/2015 04:09, Duncan a écrit : > Russell Coker posted on Sat, 03 Oct 2015 18:32:17 +1000 as excerpted: > >> Last time I checked a BTRFS RAID-1 filesystem would assign each process >> to read from one disk based on it's PID. Every RAID-1 implementation >> that has any sort of performance optimisation will allow a single >> process that's reading to use both disks to some extent. >> >> When the BTRFS developers spend some serious effort optimising for >> performance it will be useful to compare BTRFS and ZFS. > This is the example I use as to why btrfs isn't really stable, as well. > Devs tend to be very aware of the dangers of premature optimization, > because done too early, it either means throwing that work away when a > rewrite comes, or it severely limits options as to what can be rewritten, > if necessary, in ordered to avoid throwing all that work that went into > optimization away. > > So at least for devs that have been around awhile, that don't have some > boss that's paying the bills saying optimize now, an actually really good > mark of when the /devs/ consider something stable, is when they start > focusing on that optimization. This focus on single reader RAID1 performance surprises me. 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked you need 2 processes to read from 2 devices at once) and I've never seen anyone arguing that the current md code is unstable. 2/ I'm not familiar with implementations taking advantage of several disks for single process reads but clearly they'll have more problems with seeks on rotating devices to solve. So are there really implementations with better performance across the spectrum or do they have to pay a performance penalty in the mutiple readers case to optimize the (arguably less frequent/important) single reader case ? Best regards, Lionel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Russell Coker posted on Sat, 03 Oct 2015 18:32:17 +1000 as excerpted: > Last time I checked a BTRFS RAID-1 filesystem would assign each process > to read from one disk based on it's PID. Every RAID-1 implementation > that has any sort of performance optimisation will allow a single > process that's reading to use both disks to some extent. > > When the BTRFS developers spend some serious effort optimising for > performance it will be useful to compare BTRFS and ZFS. This is the example I use as to why btrfs isn't really stable, as well. Devs tend to be very aware of the dangers of premature optimization, because done too early, it either means throwing that work away when a rewrite comes, or it severely limits options as to what can be rewritten, if necessary, in ordered to avoid throwing all that work that went into optimization away. So at least for devs that have been around awhile, that don't have some boss that's paying the bills saying optimize now, an actually really good mark of when the /devs/ consider something stable, is when they start focusing on that optimization. Since this rather obvious low hanging fruit bit of optimization hasn't yet been done, then, there's really no question, btrfs doesn't pass the optimized stability test yet, and thus is self-evidently not stable, in the opinion of the very devs working on it. Were they to really consider it stable, this optimization would already be done. So once we see this optimization done, /then/ we can debate whether btrfs is stable yet, or not. Until then, settled question, it's obviously not. It may indeed be some distance into the process of stabilization, "stabiliz_ing_", and I'd characterize it as exactly that, but not yet, "stable". -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Fri, 2 Oct 2015 10:07:24 PM Austin S Hemmelgarn wrote: > > ARC presumably worked better than the other Solaris caching options. It > > was ported to Linux with zfsonlinux because that was the easy way of > > doing it. > > Actually, I think part of that was also the fact that ZFS is a COW > filesystem, and classical LRU caching (like the regular Linux pagecache) > often does horribly with COW workloads (and I'm relatively convinced > that this is a significant part of why BTRFS has such horrible > performance compared to ZFS). Last time I checked a BTRFS RAID-1 filesystem would assign each process to read from one disk based on it's PID. Every RAID-1 implementation that has any sort of performance optimisation will allow a single process that's reading to use both disks to some extent. When the BTRFS developers spend some serious effort optimising for performance it will be useful to compare BTRFS and ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-10-02 00:21, Russell Coker wrote: On Sat, 26 Sep 2015 12:20:41 AM Austin S Hemmelgarn wrote: FYI: Linux pagecache use LRU cache algo, and in general case it's working good enough I'd argue that 'general usage' should be better defined in this statement. Obviously, ZFS's ARC implementation provides better performance in a significant number of common use cases for Linux, otherwise people wouldn't be using it to the degree they are. No-one gets a free choice about this. I have a number of servers running ZFS because I needed the data consistency features and BTRFS wasn't ready. There is no choice of LRU vs ARC once you've made the BTRFS vs ZFS decision. I'm not saying there is a free choice in this, although that is largely because the page-cache wasn't written in a way on Linux that allows for easy development of alternative caching algorithms for it. When I said 'using it', I meant using ZFS, not just ARC. I would love to be able some day to be able to use ARC or even just SLRU (ARC with out the adaptive internal sizing bits) on Linux, as both provide better performance for COW workloads than plain LRU (although, somewhat paradoxically, for some COW workloads, an MRU algorithm is even better). ARC presumably worked better than the other Solaris caching options. It was ported to Linux with zfsonlinux because that was the easy way of doing it. Actually, I think part of that was also the fact that ZFS is a COW filesystem, and classical LRU caching (like the regular Linux pagecache) often does horribly with COW workloads (and I'm relatively convinced that this is a significant part of why BTRFS has such horrible performance compared to ZFS). Some people here have reported that ARC worked well for them on Linux. My experience was that the zfsonlinux kernel modules wouldn't respect the module load options to reduce the size of the ARC and the default size would cause smaller servers to have kernel panics due to lack of RAM. My solution to that problem was to get more RAM for all ZFS servers as buying RAM is cheaper for my clients than paying me to diagnose the problems with ZFS. The whole ARC sizing issue with zfsonlinux is largely orthogonal to whether or not ARC is better for a given workload, and I think that there is actually some lower limit they force based on the amount of RAM at boot. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
On Sat, 26 Sep 2015 12:20:41 AM Austin S Hemmelgarn wrote: > > FYI: > > Linux pagecache use LRU cache algo, and in general case it's working good > > enough > > I'd argue that 'general usage' should be better defined in this > statement. Obviously, ZFS's ARC implementation provides better > performance in a significant number of common use cases for Linux, > otherwise people wouldn't be using it to the degree they are. No-one gets a free choice about this. I have a number of servers running ZFS because I needed the data consistency features and BTRFS wasn't ready. There is no choice of LRU vs ARC once you've made the BTRFS vs ZFS decision. ARC presumably worked better than the other Solaris caching options. It was ported to Linux with zfsonlinux because that was the easy way of doing it. Some people here have reported that ARC worked well for them on Linux. My experience was that the zfsonlinux kernel modules wouldn't respect the module load options to reduce the size of the ARC and the default size would cause smaller servers to have kernel panics due to lack of RAM. My solution to that problem was to get more RAM for all ZFS servers as buying RAM is cheaper for my clients than paying me to diagnose the problems with ZFS. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Hi, thank you all for your helpful comments. From what I've read, I forged the following guidelines (for myself; ymmv): - Use btrfs for generic data storage on spinning disks and for everything on ssds. - Use zfs for spinning disks that may be used for cow-unfriendly workloads, like vm images (if they are too big and/or too fast-changing for a scheduled defrag to make sense). For now I'm going with the following setup: a Debian system with root on btrfs/raid1 on two ssds, and a raidz1 pool for storage and vm images. However, those few vms that really should be fast would also fit on the SSDs, so I might move them there and switch from ZFS to btrfs on the storage pool at some point in the future. Some of the ideas presented here sound really interesting - for example I think that improving the Linux page cache to be more "arc-like" will probably benefit not only btrfs. Having both the page cache and the arc in parallel when using ZoL does not feel like an elegant solution, so maybe there's hope for that. (But I don't know if it is feasible for ZoL to abandon the arc in favor of an improved Linux page cache; I imagine it might be much work for little benefit.) Thanks again Gert -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Sat, Sep 19, 2015 at 9:26 PM, Jim Salterwrote: > > ZFS, by contrast, works like absolute gangbusters for KVM image storage. I'd be interested in what allows ZFS to handle KVM image storage well, and whether this could be implemented in btrfs. I'd think that the fragmentation issues would potentially apply to any COW filesystem, and if ZFS has a solution for this then it would probably benefit btrfs to implement the same solution, and not just for VM images. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
I suspect that the answer most likely boils down to "the ARC". ZFS uses an Adaptive Replacement Cache instead of a standard FIFO, which keeps blocks in cache longer if they have been accessed in cache. This means much higher cache hit rates, which also means minimizing the effects of fragmentation. That's an off-the-top-of-my-head guess, though. All I can tell you for certain is that I've done both - KVM stores on btrfs and on ZFS (and on LVM and on mdraid and...) - and it works extremely, extremely well on ZFS for long periods of time, where with btrfs it works very well at first but then degrades rapidly. FWIW I've been using KVM + ZFS in wide production (>50 hosts) for 5+ years now. On 09/25/2015 08:48 AM, Rich Freeman wrote: On Sat, Sep 19, 2015 at 9:26 PM, Jim Salterwrote: ZFS, by contrast, works like absolute gangbusters for KVM image storage. I'd be interested in what allows ZFS to handle KVM image storage well, and whether this could be implemented in btrfs. I'd think that the fragmentation issues would potentially apply to any COW filesystem, and if ZFS has a solution for this then it would probably benefit btrfs to implement the same solution, and not just for VM images. -- Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-09-25 08:48, Rich Freeman wrote: On Sat, Sep 19, 2015 at 9:26 PM, Jim Salterwrote: ZFS, by contrast, works like absolute gangbusters for KVM image storage. I'd be interested in what allows ZFS to handle KVM image storage well, and whether this could be implemented in btrfs. I'd think that the fragmentation issues would potentially apply to any COW filesystem, and if ZFS has a solution for this then it would probably benefit btrfs to implement the same solution, and not just for VM images. That may be tough to do however, the internal design of ZFS is _very_ different from that of BTRFS (and for that matter, every other filesystem on Linux). Part of it may just be better data locality (if all of the fragments of a file are close to each other, then the fragmentation of the file is not as much of a performance hit), and part of it is probably how they do caching (and I personally _do not_ want BTRFS to try to do caching the way ZFS does, we have a unified pagecache in the VFS for a reason, we should be improving that, not trying to come up with multiple independent solutions). Even aside from that however, just saying that ZFS works great for some particular use case isn't giving enough info, it has so many optional features and configuration knobs, you really need to give specifics on how you have ZFS set up in that case. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
Pretty much bog-standard, as ZFS goes. Nothing different than what's recommended for any generic ZFS use. * set blocksize to match hardware blocksize - 4K drives get 4K blocksize, 8K drives get 8K blocksize (Samsung SSDs) * LZO compression is a win. But it's not like anything sucks without it. No real impact on performance for most use, + or -. Just saves space. * > 4GB allocated to the ARC. General rule of thumb: half the RAM belongs to the host (which is mostly ARC), half belongs to the guests. I strongly prefer pool-of-mirrors topology, but nothing crazy happens if you use striped-with-parity instead. I use to use RAIDZ1 (the rough equivalent of RAID5) quite frequently, and there wasn't anything amazingly sucky about it; it performed at least as well as you'd expect ext4 on mdraid5 to perform. ZFS might or might not do a better job of managing fragmentation; I really don't know. I strongly suspect the design difference between the kernel's simple FIFO page cache and ZFS' weighted cache makes a really, really big difference. On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote: > you really need to give specifics on how you have ZFS set up in that case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-09-25 09:12, Jim Salter wrote: Pretty much bog-standard, as ZFS goes. Nothing different than what's recommended for any generic ZFS use. * set blocksize to match hardware blocksize - 4K drives get 4K blocksize, 8K drives get 8K blocksize (Samsung SSDs) * LZO compression is a win. But it's not like anything sucks without it. No real impact on performance for most use, + or -. Just saves space. * > 4GB allocated to the ARC. General rule of thumb: half the RAM belongs to the host (which is mostly ARC), half belongs to the guests. I strongly prefer pool-of-mirrors topology, but nothing crazy happens if you use striped-with-parity instead. I use to use RAIDZ1 (the rough equivalent of RAID5) quite frequently, and there wasn't anything amazingly sucky about it; it performed at least as well as you'd expect ext4 on mdraid5 to perform. ZFS might or might not do a better job of managing fragmentation; I really don't know. I /strongly/ suspect the design difference between the kernel's simple FIFO page cache and ZFS' weighted cache makes a really, really big difference. I've been coming to that same conclusion myself over the years. I would really love to see a drop in replacement for Linux's pagecache with better performance (I don't remember for sure, but I seem to remember that the native pagecache isn't straight FIFO), but the likelihood of that actually getting into mainline is slim to none (can you imagine though how fast XFS or ext* would be with a good caching algorithm?). On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote: you really need to give specifics on how you have ZFS set up in that case. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
2015-09-25 16:52 GMT+03:00 Jim Salter: > Pretty much bog-standard, as ZFS goes. Nothing different than what's > recommended for any generic ZFS use. > > * set blocksize to match hardware blocksize - 4K drives get 4K blocksize, 8K > drives get 8K blocksize (Samsung SSDs) > * LZO compression is a win. But it's not like anything sucks without it. > No real impact on performance for most use, + or -. Just saves space. > * > 4GB allocated to the ARC. General rule of thumb: half the RAM belongs > to the host (which is mostly ARC), half belongs to the guests. > > I strongly prefer pool-of-mirrors topology, but nothing crazy happens if you > use striped-with-parity instead. I use to use RAIDZ1 (the rough equivalent > of RAID5) quite frequently, and there wasn't anything amazingly sucky about > it; it performed at least as well as you'd expect ext4 on mdraid5 to > perform. > > ZFS might or might not do a better job of managing fragmentation; I really > don't know. I strongly suspect the design difference between the kernel's > simple FIFO page cache and ZFS' weighted cache makes a really, really big > difference. > > > > On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote: >> you really need to give specifics on how you have ZFS set up in that case. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html FYI: Linux pagecache use LRU cache algo, and in general case it's working good enough -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-09-25 10:02, Timofey Titovets wrote: 2015-09-25 16:52 GMT+03:00 Jim Salter: Pretty much bog-standard, as ZFS goes. Nothing different than what's recommended for any generic ZFS use. * set blocksize to match hardware blocksize - 4K drives get 4K blocksize, 8K drives get 8K blocksize (Samsung SSDs) * LZO compression is a win. But it's not like anything sucks without it. No real impact on performance for most use, + or -. Just saves space. * > 4GB allocated to the ARC. General rule of thumb: half the RAM belongs to the host (which is mostly ARC), half belongs to the guests. I strongly prefer pool-of-mirrors topology, but nothing crazy happens if you use striped-with-parity instead. I use to use RAIDZ1 (the rough equivalent of RAID5) quite frequently, and there wasn't anything amazingly sucky about it; it performed at least as well as you'd expect ext4 on mdraid5 to perform. ZFS might or might not do a better job of managing fragmentation; I really don't know. I strongly suspect the design difference between the kernel's simple FIFO page cache and ZFS' weighted cache makes a really, really big difference. On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote: you really need to give specifics on how you have ZFS set up in that case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html FYI: Linux pagecache use LRU cache algo, and in general case it's working good enough I'd argue that 'general usage' should be better defined in this statement. Obviously, ZFS's ARC implementation provides better performance in a significant number of common use cases for Linux, otherwise people wouldn't be using it to the degree they are. LRU often gives abysmal performance for VM images in my experience, and virtualization is becoming a very common use case for Linux. On top of that, there are lots of applications that bypass the cache almost completely, and while that is a valid option in some cases, it shouldn't be needed most of the time. If it's just plain LRU, I may take the time at some point to try and write some patches to test if SLRU works any better (as SLRU is essentially ARC without the auto-tuning), although I have nowhere near the resources to test something like that to the degree that would be required to get it even considered for inclusion in mainline. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
On Sat, 19 Sep 2015 12:13:29 AM Austin S Hemmelgarn wrote: > The other option (which for some reason I almost never see anyone > suggest), is to expose 2 disks to the guest (ideally stored on different > filesystems), and do BTRFS raid1 on top of that. In general, this is > what I do (except I use LVM for the storage back-end instead of a > filesystem) when I have data integrity requirements in the guest. On > the other hand of course, most of my VM's are trivial for me to > recreate, so I don't often need this and just use DM-RAID via LVm. I used to do that. But it was very fiddly and snapshotting the virtual machine images required making a snapshot of half a RAID-1 array via LVM (or snapshotting both when the virtual machine wasn't running). Now I just have a single big BTRFS RAID-1 filesystem and use regular files for the virtual machine images with the Ext3 filesystem. On Sun, 20 Sep 2015 11:26:26 AM Jim Salter wrote: > Performance will be fantastic... except when it's completely abysmal. > When I tried it, I also ended up with a completely borked (btrfs-raid1) > filesystem that would only mount read-only and read at hideously reduced > speeds after about a year of usage in a small office environment. Did > not make me happy. I've found performance to be acceptable, not great (you can't expect great performance from such things) but good enough for lightly loaded servers and test systems. I even ran a training session on BTRFS and ZFS filesystems with the images stored on a BTRFS RAID-1 (of 15,000rpm SAS disks). When more than 3 students ran a scrub at the same time performance dropped but it was mostly usable and there were no complaints. Admittedly that server hit a BTRFS bug and needed "reboot -nf" half way through, but I don't think that was a BTRFS virtual machine issue, rather it was a more general BTRFS under load issue. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Fri, 18 Sep 2015 12:00:15 PM Duncan wrote: > The caveat here is that if the VM/DB is active during the backups (btrfs > send/receive or other), it'll still COW1 any writes during the existence > of the btrfs snapshot. If the backup can be scheduled during VM/DB > downtime or at least when activity is very low, the relatively short COW1 > time should avoid serious fragmentation, but if not, even only relatively > temporary snapshots are likely to trigger noticeable cow1 fragmentation > issues eventually. One relevant issue for this is whether the working set of the database fits into RAM. RAM has been getting bigger and cheaper while databases I run haven't been getting bigger. Now every database I run has a working set that fits into RAM so read performance (and therefore fragmentation) doesn't matter for me except when rebooting - and database servers don't get rebooted that often. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
I can't recommend btrfs+KVM, and I speak from experience. Performance will be fantastic... except when it's completely abysmal. When I tried it, I also ended up with a completely borked (btrfs-raid1) filesystem that would only mount read-only and read at hideously reduced speeds after about a year of usage in a small office environment. Did *not* make me happy. ZFS, by contrast, works like absolute gangbusters for KVM image storage. Just create a dataset, drop a .qcow2 file in it, and off to the races. I don't recommend messing about with zvols, it's a PITA and isn't necessary. HTH. On 09/15/2015 05:34 PM, Gert Menke wrote: Hi everybody, first off, I'm not 100% sure if this is the right place to ask, so if it's not, I apologize and I'd appreciate a pointer in the right direction. I want to build a virtualization server to replace my current home server. I'm thinking about a Debian system with libvirt/KVM. The system will have one or two SSDs and five harddisks with some kind of software RAID5 for storage. I'd like to have a filesystem with data checksums, so BTRFS seems like the right way to go. However, I read that BTRFS does not perform well as storage for KVM disk images. (See here: http://www.linux-kvm.org/page/Tuning_KVM ) Is this still true? I would appreciate any comments and/or tips you might have on this topic. Is anyone using BTRFS as an image store? Are there any special settings I should be aware of to make it work well? Thanks, Gert -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-09-18 04:22, Duncan wrote: one way or another, you're going to have to write two things, one a checksum of the other, and if they are in- place-overwrites, while the race can be narrowed, there's always going to be a point at which either one or the other will have been written, while the other hasn't been, and if failure occurs at that point... ...then you still can recover the old data from the mirror or parity, and at least you don't have any inconsistent data. It's like the failure occurred just a tiny bit earlier. The only real way around that is /some/ form of copy-on-write, such that both the change and its checksum can be written to a different location than the old version, with a single, atomic write then updating a pointer to point to the new version of both the data and its checksum, instead of the old one. Or an intent log, but I guess that introduces a lot of additional writes (and seeks) that would impact performance noticeably... Gert -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 2015-09-17 14:35, Chris Murphy wrote: On Thu, Sep 17, 2015 at 11:56 AM, Gert Menkewrote: Hi, thank you for your answers! So it seems there are several suboptimal alternatives here... MD+LVM is very close to what I want, but md has no way to cope with silent data corruption. So if I'd want to use a guest filesystem that has no checksums either, I'm out of luck. You can use Btrfs in the guest to get at least notification of SDC. If you want recovery also then that's a bit more challenging. The way this has been done up until ZFS and Btrfs is T10 DIF (PI). There are already checksums on the drive, but this adds more checksums that can be confirmed through the entire storage stack, not just internal to the drive hardware. Another way is to put a conventional fs image on e.g. GlusterFS with checksumming enabled (and at least distributed+replicated filtering). If you do this directly on Btrfs, maybe you can mitigate some of the fragmentation issues with bcache or dmcache; and for persistent snapshotting, use qcow2 to do it instead of Btrfs. You'd use Btrfs snapshots to create a subvolume for doing backups of the images, and then get rid of the Btrfs snapshot. The other option (which for some reason I almost never see anyone suggest), is to expose 2 disks to the guest (ideally stored on different filesystems), and do BTRFS raid1 on top of that. In general, this is what I do (except I use LVM for the storage back-end instead of a filesystem) when I have data integrity requirements in the guest. On the other hand of course, most of my VM's are trivial for me to recreate, so I don't often need this and just use DM-RAID via LVm. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
On Thu, Sep 17, 2015 at 11:56 AM, Gert Menkewrote: > Hi, > > thank you for your answers! > > So it seems there are several suboptimal alternatives here... > > MD+LVM is very close to what I want, but md has no way to cope with silent > data corruption. So if I'd want to use a guest filesystem that has no > checksums either, I'm out of luck. You can use Btrfs in the guest to get at least notification of SDC. If you want recovery also then that's a bit more challenging. The way this has been done up until ZFS and Btrfs is T10 DIF (PI). There are already checksums on the drive, but this adds more checksums that can be confirmed through the entire storage stack, not just internal to the drive hardware. Another way is to put a conventional fs image on e.g. GlusterFS with checksumming enabled (and at least distributed+replicated filtering). If you do this directly on Btrfs, maybe you can mitigate some of the fragmentation issues with bcache or dmcache; and for persistent snapshotting, use qcow2 to do it instead of Btrfs. You'd use Btrfs snapshots to create a subvolume for doing backups of the images, and then get rid of the Btrfs snapshot. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Hi, thank you for your answers! So it seems there are several suboptimal alternatives here... MD+LVM is very close to what I want, but md has no way to cope with silent data corruption. So if I'd want to use a guest filesystem that has no checksums either, I'm out of luck. I'm honestly a bit confused here - isn't checksumming one of the most obvious things to want in a software RAID setup? Is it a feature that might appear in the future? Maybe I should talk to the md guys... BTRFS looks really nice feature-wise, but is not (yet) optimized for my use-case I guess. Disabling COW would certainly help, but I don't want to lose the data checksums. Is nodatacowbutkeepdatachecksums a feature that might turn up in the future? Maybe ZFS is the best choice for my scenario. At least, it seems to work fine for Joyent - their SmartOS virtualization OS is essentially Illumos (Solaris) with ZFS, and KVM ported from Linux. Since ZFS supports "Volumes" (virtual block devices inside a ZPool), I suspect these are probably optimized to be used for VM images (i.e. do as little COW as possible). Of course, snapshots will always degrade performance to a degree. However, there are some drawbacks to ZFS: - It's less flexible, especially when it comes to reconfiguration of disk arrays. Add or remove a disk to/from a RaidZ and rebalance, that would be just awesome. It's possible in BTRFS, but not ZFS. :-( - The not-so-good integration of the fs cache, at least on Linux. I don't know if this is really an issue, though. Actually, I imagine it's more of an issue for guest systems, because it probably breaks memory ballooning. (?) So it seems there are two options for me: 1. Go with ZFS for now, until BTRFS finds a better way to handle disk images, or until md gets data checksums. 2. Buy a bunch of SSDs for VM disk images and use spinning disks for data storage only. In that case, BTRFS should probably do fine. Any comments on that? Am I missing something? Thanks! Gert -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 17 September 2015 at 18:56, Gert Menkewrote: > MD+LVM is very close to what I want, but md has no way to cope with silent > data corruption. So if I'd want to use a guest filesystem that has no > checksums either, I'm out of luck. > I'm honestly a bit confused here - isn't checksumming one of the most > obvious things to want in a software RAID setup? Is it a feature that might > appear in the future? Maybe I should talk to the md guys... ... > Any comments on that? Am I missing something? How about using file integrity checking tools for cases when the chosen storage stack doesn't provided data checksumming. E.g. aide - http://aide.sourceforge.net/ cfv - http://cfv.sourceforge.net/ tripwire - http://sourceforge.net/projects/tripwire/ Don't use them, just providing options. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote: > Hi, > > thank you for your answers! > > So it seems there are several suboptimal alternatives here... > > MD+LVM is very close to what I want, but md has no way to cope with > silent data corruption. So if I'd want to use a guest filesystem > that has no checksums either, I'm out of luck. > I'm honestly a bit confused here - isn't checksumming one of the > most obvious things to want in a software RAID setup? Is it a > feature that might appear in the future? Maybe I should talk to the > md guys... > > BTRFS looks really nice feature-wise, but is not (yet) optimized for > my use-case I guess. Disabling COW would certainly help, but I don't > want to lose the data checksums. Is nodatacowbutkeepdatachecksums a > feature that might turn up in the future? [snip] No. If you try doing that particular combination of features, you end up with a filesystem that can be inconsistent: there's a race condition between updating the data in a file and updating the csum record for it, and the race can't be fixed. Hugo. -- Hugo Mills | I spent most of my money on drink, women and fast hugo@... carfax.org.uk | cars. The rest I wasted. http://carfax.org.uk/ | PGP: E2AB1DE4 |James Hunt signature.asc Description: Digital signature
Re: BTRFS as image store for KVM?
On 17.09.2015 at 21:43, Hugo Mills wrote: On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote: BTRFS looks really nice feature-wise, but is not (yet) optimized for my use-case I guess. Disabling COW would certainly help, but I don't want to lose the data checksums. Is nodatacowbutkeepdatachecksums a feature that might turn up in the future? [snip] No. If you try doing that particular combination of features, you end up with a filesystem that can be inconsistent: there's a race condition between updating the data in a file and updating the csum record for it, and the race can't be fixed. I'm no filesystem expert, but isn't that what an intent log is for? (Does btrfs have an intent log?) And, is this also true for mirrored or raid5 disks? I'm thinking something like "if the data does not match the checksum, just restore both from mirror/parity" should be possible, right? Gert -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On 17.09.2015 at 20:35, Chris Murphy wrote: You can use Btrfs in the guest to get at least notification of SDC. Yes, but I'd rather not depend on all potential guest OSes having btrfs or something similar. Another way is to put a conventional fs image on e.g. GlusterFS with checksumming enabled (and at least distributed+replicated filtering). This sounds interesting! I'll have a look at this. If you do this directly on Btrfs, maybe you can mitigate some of the fragmentation issues with bcache or dmcache; Thanks, I did not know about these. bcache seems to be more or less what "zpool add foo cache /dev/ssd" does. Definitely worth a look. > and for persistent snapshotting, use qcow2 to do it instead of Btrfs. You'd use Btrfs snapshots to create a subvolume for doing backups of the images, and then get rid of the Btrfs snapshot. Good idea. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Hugo Mills posted on Thu, 17 Sep 2015 19:43:14 + as excerpted: >> Is nodatacowbutkeepdatachecksums a feature that might turn up >> in the future? > > No. If you try doing that particular combination of features, you > end up with a filesystem that can be inconsistent: there's a race > condition between updating the data in a file and updating the csum > record for it, and the race can't be fixed. ... Which is both why btrfs disables checksumming on nocow, and why more traditional in-place-overwrite filesystems don't normally offer a checksumming feature -- it's only easily and reliably possible with copy- on-write, as in-place-overwrite introduces race issues that are basically impossible to solve. Logging can narrow the race, but consider, either they introduce some level of copy-on-write themselves, or one way or another, you're going to have to write two things, one a checksum of the other, and if they are in- place-overwrites, while the race can be narrowed, there's always going to be a point at which either one or the other will have been written, while the other hasn't been, and if failure occurs at that point... The only real way around that is /some/ form of copy-on-write, such that both the change and its checksum can be written to a different location than the old version, with a single, atomic write then updating a pointer to point to the new version of both the data and its checksum, instead of the old one. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Chris Murphy posted on Thu, 17 Sep 2015 12:35:41 -0600 as excerpted: > You'd use Btrfs snapshots to create a subvolume for doing backups of > the images, and then get rid of the Btrfs snapshot. The caveat here is that if the VM/DB is active during the backups (btrfs send/receive or other), it'll still COW1 any writes during the existence of the btrfs snapshot. If the backup can be scheduled during VM/DB downtime or at least when activity is very low, the relatively short COW1 time should avoid serious fragmentation, but if not, even only relatively temporary snapshots are likely to trigger noticeable cow1 fragmentation issues eventually. Some users have ameliorated that by scheduling weekly or monthly btrfs defrag, reporting that cow1 issues with temporary snapshots build up slow enough that the scheduled defrag effectively eliminates the otherwise growing problem, but it's still an additional complication to have to configure and administer, longer term. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote: > MD+LVM is very close to what I want, but md has no way to cope with silent > data corruption. So if I'd want to use a guest filesystem that has no > checksums either, I'm out of luck. > I'm honestly a bit confused here - isn't checksumming one of the most > obvious things to want in a software RAID setup? Is it a feature that might > appear in the future? Maybe I should talk to the md guys... MD is emulating hardware RAID. In hardware RAID, you are doing work at the block level. Block-level RAID has no understanding of the filesystem(s) running on top of it. Therefore it would have to checksum groups of blocks, and store those checksums on the physical disks somewhere, perhaps by keeping some portion of the drive for itself. But then this is not very efficient, since it is maintaining checksums for data that may be useless (blocks the FS is not currently using). So then you might make the RAID filesystem aware...and you now have BTRFS RAID. Simply put, the block level is probably not an appropriate place for checksumming to occur. BTRFS can make checksumming work much more effectively and efficiently by doing it at the filesystem level. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: BTRFS as image store for KVM?
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Brendan Heading > Sent: Wednesday, 16 September 2015 9:36 PM > To: Duncan <1i5t5.dun...@cox.net> > Cc: linux-btrfs@vger.kernel.org > Subject: Re: BTRFS as image store for KVM? > > > Btrfs has two possible solutions to work around the problem. The > > first one is the autodefrag mount option, which detects file > > fragmentation during the write and queues up the affected file for a > > defragmenting rewrite by a lower priority worker thread. This works > > best on the small end, because as file size increases, so does time to > > actually write it out, and at some point, depending on the size of the > > file and how busy the database/VM is, writes are (trying to) come in > > faster than the file can be rewritten. Typically, there's no problem > > under a quarter GiB, with people beginning to notice performance > > issues at half to 3/4 GiB, tho on fast disks and not too busy VMs/DBs > > (which may well include your home system, depending on what you use > > the VMs for), you might not see problems until size reaches 2 GiB or > > so. As such, autodefrag tends to be a very good option for firefox > > sqlite database files, for instance, as they tend to be small enough > > not to have issues. But it's not going to work so well for multi-GiB VM > images. > > [unlurking for the first time] > > This problem has been faced by a certain very large storage vendor whom I > won't name, who provide an option similar to the above. Reading between > the lines I think their approach is to try to detect which accesses are read- > sequential, and schedule those blocks for rewriting in sequence. They also > have a feature to run as a background job which can be scheduled to run > during an off peak period where they can reorder entire files that are > significantly out of sequence. I'd expect the algorithm is intelligent ie > there's > no need to rewrite entire large files that are mostly sequential with a few > out-of-order sections. > > Has anyone considered these options for btrfs ? Not being able to run VMs > on it is probably going to be a bit of a killer .. I run VMs on BTRFS using regular consumer grade SSDs and hardware, it works great I think. My hosts are windows server + MS SQL. Not the most ideal workload, but I care about data integrity so I'm willing to sacrifice a bit of speed for it. Checksums have prevented countless corruption issues. Although now that I think about it the spinning rust backup disks are the only ones that have ever had any corruption. I guess SSDs have their own internal checksumming as well. The speed seems quite reasonable, but the server has around 16G ram free which I presume is being used as a cache, which seems to help. Paul.
Re: BTRFS as image store for KVM?
On 2015-09-16 07:35, Brendan Heading wrote: Btrfs has two possible solutions to work around the problem. The first one is the autodefrag mount option, which detects file fragmentation during the write and queues up the affected file for a defragmenting rewrite by a lower priority worker thread. This works best on the small end, because as file size increases, so does time to actually write it out, and at some point, depending on the size of the file and how busy the database/VM is, writes are (trying to) come in faster than the file can be rewritten. Typically, there's no problem under a quarter GiB, with people beginning to notice performance issues at half to 3/4 GiB, tho on fast disks and not too busy VMs/DBs (which may well include your home system, depending on what you use the VMs for), you might not see problems until size reaches 2 GiB or so. As such, autodefrag tends to be a very good option for firefox sqlite database files, for instance, as they tend to be small enough not to have issues. But it's not going to work so well for multi-GiB VM images. [unlurking for the first time] This problem has been faced by a certain very large storage vendor whom I won't name, who provide an option similar to the above. Reading between the lines I think their approach is to try to detect which accesses are read-sequential, and schedule those blocks for rewriting in sequence. They also have a feature to run as a background job which can be scheduled to run during an off peak period where they can reorder entire files that are significantly out of sequence. I'd expect the algorithm is intelligent ie there's no need to rewrite entire large files that are mostly sequential with a few out-of-order sections. Has anyone considered these options for btrfs ? Not being able to run VMs on it is probably going to be a bit of a killer .. 3 things to mention here: 1. It's perfectly possible to run VM's on BTRFS, it just takes some effort to get decent efficiency, and you can't really over-provision storage (the above mentioned effort is to create the file with NOCOW set, and then use fallocate or dd to pre-allocate space for it). 2. If you are using a file for the disk image, you are already sacrificing performance for portability, it's just a bigger tradeoff with BTRFS than most other filesystems on Linux. 3. Almost all of the issues that BTRFS has with VM disk images are also present in other filesystems, they are just much worse on BTRFS because of the fact that it is COW based. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS as image store for KVM?
> Btrfs has two possible solutions to work around the problem. The first > one is the autodefrag mount option, which detects file fragmentation > during the write and queues up the affected file for a defragmenting > rewrite by a lower priority worker thread. This works best on the small > end, because as file size increases, so does time to actually write it > out, and at some point, depending on the size of the file and how busy > the database/VM is, writes are (trying to) come in faster than the file > can be rewritten. Typically, there's no problem under a quarter GiB, > with people beginning to notice performance issues at half to 3/4 GiB, > tho on fast disks and not too busy VMs/DBs (which may well include your > home system, depending on what you use the VMs for), you might not see > problems until size reaches 2 GiB or so. As such, autodefrag tends to be > a very good option for firefox sqlite database files, for instance, as > they tend to be small enough not to have issues. But it's not going to > work so well for multi-GiB VM images. [unlurking for the first time] This problem has been faced by a certain very large storage vendor whom I won't name, who provide an option similar to the above. Reading between the lines I think their approach is to try to detect which accesses are read-sequential, and schedule those blocks for rewriting in sequence. They also have a feature to run as a background job which can be scheduled to run during an off peak period where they can reorder entire files that are significantly out of sequence. I'd expect the algorithm is intelligent ie there's no need to rewrite entire large files that are mostly sequential with a few out-of-order sections. Has anyone considered these options for btrfs ? Not being able to run VMs on it is probably going to be a bit of a killer .. regards Brendan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
As others have said here, it's probably not going to work for you especially if you want to use regular scheduled btrfs snapshots on the host (which I consider to be 50% of the reason why I use btrfs in the first place). Once I had learned this lesson the hard way, I had a xen server using libvirt configured to provision from LVM storage. I had some preseed/kickstart/ansible recipes for VM provisioning that configured btrfs in the guests with appropriate scheduled snapshotting and remote send/receive to backup hosts. This worked well for me but it's not for everyone. I'm a btrfs fan, but have talked a few admins out of naively using btrfs in the manner that you've described. Particularly busy Windows HVM guests on a file-backed image sitting on a btrfs host filesystem with regular scheduled snapshots will deteriorate depressingly quickly. Having said that, most of my current btrfs superstitions on this use-case were formed around kernel 3.12, a long time ago now. On 16 September 2015 at 07:34, Gert Menkewrote: > Hi everybody, > > first off, I'm not 100% sure if this is the right place to ask, so if it's > not, I apologize and I'd appreciate a pointer in the right direction. > > I want to build a virtualization server to replace my current home server. > I'm thinking about a Debian system with libvirt/KVM. The system will have > one or two SSDs and five harddisks with some kind of software RAID5 for > storage. I'd like to have a filesystem with data checksums, so BTRFS seems > like the right way to go. However, I read that BTRFS does not perform well > as storage for KVM disk images. > (See here: http://www.linux-kvm.org/page/Tuning_KVM ) > > Is this still true? > > I would appreciate any comments and/or tips you might have on this topic. > > Is anyone using BTRFS as an image store? Are there any special settings I > should be aware of to make it work well? > > Thanks, > Gert > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
If you don't need image portability use an LVM logical volume for backing of the VM. That LV gets partitioned as if it were a disk, and you can use Btrfs for root home data or whatever. If you need image portability, e.g. qcow2, then I'd put it on ext4 or XFS, and you can use Btrfs within the VM for data integrity. If you put the qcow2 on Btrfs, you're advised (by FAQ and many threads on this topic in the list archives) to set the the image file with chattr +C, which is nocow and implies nodatasum, so no checksumming. And also I've run into this problem depending on the qemu cache setting. https://www.marc.info/?l=linux-btrfs=142714523110528=1 https://bugzilla.redhat.com/show_bug.cgi?id=1204569 -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
Gert Menke posted on Tue, 15 Sep 2015 23:34:04 +0200 as excerpted: > I'm not 100% sure if this is the right place to ask[.] It is. =:^) > I want to build a virtualization server to replace my current home > server. I'm thinking about a Debian system with libvirt/KVM. The system > will have one or two SSDs and five harddisks with some kind of software > RAID5 for storage. I'd like to have a filesystem with data checksums, so > BTRFS seems like the right way to go. However, I read that BTRFS does > not perform well as storage for KVM disk images. > (See here: http://www.linux-kvm.org/page/Tuning_KVM ) > > Is this still true? > > I would appreciate any comments and/or tips you might have on this > topic. > > Is anyone using BTRFS as an image store? Are there any special settings > I should be aware of to make it work well? Looks like you're doing some solid research before you deploy. =:^) Here's the deal. The problem is fragmentation, which is much more of an issue on spinning rust than it typically is on ssds, since ssds have effectively zero seek-time. If you can put the VMs on those ssds you mentioned, not on the spinning rust, the fragmentation won't matter so much, and you may well not have to worry about it. Any copy-on-write filesystem (which btrfs is), is going to have serious problems with a file-internal-rewrite write pattern (as contrasted to append, or simply rewrite the entire thing sequentially, beginning to end), because as various blocks are rewritten, they get written elsewhere, worst-case one at a time, dramatically increasing fragmentation -- hundreds of thousands of extents are not unheard-of with files in the multi-GiB size range.[1] The two typical problematic cases are database files and VM images (your case). Btrfs has two possible solutions to work around the problem. The first one is the autodefrag mount option, which detects file fragmentation during the write and queues up the affected file for a defragmenting rewrite by a lower priority worker thread. This works best on the small end, because as file size increases, so does time to actually write it out, and at some point, depending on the size of the file and how busy the database/VM is, writes are (trying to) come in faster than the file can be rewritten. Typically, there's no problem under a quarter GiB, with people beginning to notice performance issues at half to 3/4 GiB, tho on fast disks and not too busy VMs/DBs (which may well include your home system, depending on what you use the VMs for), you might not see problems until size reaches 2 GiB or so. As such, autodefrag tends to be a very good option for firefox sqlite database files, for instance, as they tend to be small enough not to have issues. But it's not going to work so well for multi-GiB VM images. The second solution, or more like workaround, for larger internal-rewrite- pattern files, generally 1 GiB plus (so many VMs), is to use the NOCOW file attribute (set with chattr +C), which tells btrfs to rewrite the file in-place instead of using the usual copy-on-write method. However, you're not going to like the side effects, as btrfs turns off both checksumming and transparent compression on nocow files, because there's serious checksum/data-it-covers write-race issues with in-place rewrite, and of course the rewritten data may compress better or worse than the old version, so rewriting a compressed copy in-place is problematic as well. So setting nocow turns off checksumming, the biggest reason you're considering btrfs in the first place, likely making this option effectively unworkable for you. =:^( Which means btrfs itself likely isn't a particularly good choice, UNLESS (a) your VM images are small (under a GiB, ideally under a quarter-gig, admittedly a pretty small VM), OR (b) your VMs are primarily reading, not writing, or aren't likely to be busy enough for autodefrag to be a problem, given the size, OR (c) you put the VM images (and thus the btrfs containing them) on ssd, not spinning rust. Meanwhile, quickly tying up a couple loose ends with nocow in case you do decide to use it for this or some other use-case: a) On btrfs, setting nocow on a file that's already larger than zero-size doesn't work as expected (cow writes can continue to occur for some time). Typically the easiest way to ensure that the file is nocow before getting data, is to set nocow on its containing directory before the file is created, so new files inherit the attribute. For existing files, set it on the dir and copy the file in from a different filesystem (or move it to say a tmpfs and back), so the file gets created with the nocow attribute as it is copied in. b) Btrfs' snapshot feature depends on COW, locking in place the existing version of the file, forcing otherwise nocow files to be what I've seen described as cow1 -- the first write to a file block will cow it to a new location because the existing