Re: BTRFS as image store for KVM?

2015-10-05 Thread Roman Mamedov
On Mon, 05 Oct 2015 11:43:18 +0300
Erkki Seppala  wrote:

> Lionel Bouton  writes:
> 
> > 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
> > you need 2 processes to read from 2 devices at once) and I've never seen
> > anyone arguing that the current md code is unstable.
> 
> This indeed seems to be the case on my MD RAID1 HDD.

The key difference is that MD is smart enough to distribute reads to the least
loaded devices in a RAID (i.e. two processes reading from a two-HDD RAID1 will
~always use different HDDs, no matter what their PIDs are).

Another example of the differences was that in my limited testing I saw Btrfs
submit writes sequentially on RAID1: according to 'iostat', there were long
periods of only one drive writing, and then only the other one. The result was
the RAID's write performance ended up being lower than that of a single drive.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: BTRFS as image store for KVM?

2015-10-05 Thread Duncan
Rich Freeman posted on Sun, 04 Oct 2015 08:21:53 -0400 as excerpted:

> On Sun, Oct 4, 2015 at 8:03 AM, Lionel Bouton
>  wrote:
>>
>> This focus on single reader RAID1 performance surprises me.
>>
>> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
>> you need 2 processes to read from 2 devices at once) and I've never
>> seen anyone arguing that the current md code is unstable.

I'm not a coder and could be wrong, but AFAIK, md/raid1 either works per 
thread (thus should multiplex I/O across raid1 devices in single-process-
multi-thread), or handles multiple AIO requests in parallel, if not both.

(If I'm laboring under a severe misconception, and I could be, please do 
correct me -- I'll rather be publicly corrected and have just that, my 
world-view corrected to align with reality, than be wrong, publicly or 
privately, and never know it, thus never correcting it!  =:^)

IOW, the primary case where I believe md/raid1 does single-device serial 
access, is where the single process is doing just that, serialized-single-
request-sleep-until-the-data's-ready.  Otherwise read requests are spread 
among the available spindles.  =:^)

But...

> Perhaps, but with btrfs it wouldn't be hard to get 1000 processes
> reading from a raid1 in btrfs and have every single request directed to
> the same disk with the other disk remaining completely idle.  I believe
> the algorithm is just whether the pid is even or odd, and doesn't take
> into account disk activity at all, let alone disk performance or
> anything more sophisticated than that.
> 
> I'm sure md does a better job than that.

Exactly.  Even/odd PID scheduling is great for testing, since it's simple 
enough to load either side exclusively or both sides exactly evenly, but 
it's absolutely horrible for multi-task, since worst-case single-device-
bottleneck is all too easy to achieve by accident, and even pure-random 
distribution is going to favor one side or the other to some extent, most 
of the time.

Even worse, due to the most-remaining-free-space chunk allocation 
algorithm and pair-mirroring only, no matter the number of devices, try 
to use 3+ devices of differing sizes, and until the space-available on 
the largest pair reaches that of the others, that largest pair will get 
the allocations.  Consider a bunch of quarter-TiB devices in raid1, with 
a pair of 2 TiB devices as well.  The quarter-TiB devices will remain 
idle until the pair of 2 TiB devices reach 1.75 TiB full, thus equalizing 
the space available on each compared to the other devices on the 
filesystem.  Of course, that means reads too, are going to be tied to 
only those two devices, for anything in that first 1.75 TiB of data, and 
if all those reads are from even or all from odd PIDs, it's only going to 
be ONE of... perhaps 10 devices! Possibly hundreds of read threads 
bottlenecking on a single device of ten, while the other 9/10 of the 
filesystem-array remains entirely idle! =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Erkki Seppala
Lionel Bouton  writes:

> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
> you need 2 processes to read from 2 devices at once) and I've never seen
> anyone arguing that the current md code is unstable.

This indeed seems to be the case on my MD RAID1 HDD.

But on MD SSD RAID10 it does use all the four devices (using dd on the
md raid device and inspection iostat at the samy time).

So the lack of MD RAID1 doing it on HDDs seems to be for devices that
don't perform well in random access scenarios, as you mentioned.

In practical terms I seem to be getting about 1.1 GB/s from the 4 SSDs
with 'dd', whereas I seem to be getting ~650 MB/s when I dd from two
fastest components of the MD device at the same time. As it seems that I
get 330 M/s from two of the SSDs and 150M/s from the other two, it seems
the concurrent RAID10 IO is scaling linearly.

(In fact maybe I should look into why the two devices are getting lower
speeds overall - they used to be fast.)

I didn't calculate how large the linearly transferred chunks would need
to be to overcome the seek altency. Probably quite large.

-- 
  _
 / __// /__   __   http://www.modeemi.fi/~flux/\   \
/ /_ / // // /\ \/ /\  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi  \/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Austin S Hemmelgarn

On 2015-10-05 07:16, Lionel Bouton wrote:

Hi,

Le 04/10/2015 14:03, Lionel Bouton a écrit :

[...]
This focus on single reader RAID1 performance surprises me.

1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
you need 2 processes to read from 2 devices at once) and I've never seen
anyone arguing that the current md code is unstable.


To better illustrate my point.

According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
most of the time.

http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1

The only case where md RAID1 was noticeably faster is sequential reads
with FIO libaio.
Part of this is because BTRFS's built-in raid functionality is designed 
for COW workloads, whereas mdraid isn't.  On top of that, I would be 
willing to bet that they were using the dup profile for metadata when 
testing mdraid (which is the default when using a single disk), which 
isn't a fair comparison because it stores more data in that case than 
the BTRFS raid.


If you do the same thing with ZFS, I'd expect that you would see similar 
results (although probably with a bigger difference between ZFS and 
mdraid).  A much better comparison (which _will_ sway in mdraid's favor) 
would be running XFS or ext4 (or even JFS) on top of mdraid, as that is 
the regular usage of mdraid.


Furthermore, there is no sane reason to be running BTRFS on top of a 
single mdraid device, thus this was even less of a reasonable 
comparison.  It is worth noting however, that using BTRFS raid1 on top 
of two md RAID0 devices is significantly faster than BTRFS RAID10.

So if you base your analysis on Phoronix tests when serving large files
to a few clients maybe md could perform better. In all other cases BTRFS
RAID1 seems to be a better place to start if you want performance.
According to the bad performance -> unstable logic, md would then be the
less stable RAID1 implementation which doesn't make sense to me.
No reasonable person should be basing their analysis on results from 
someone else's testing without replicating said results themselves, 
especially when those results are based on benchmarks and not real 
workloads.  This goes double for Phoronix, as they are essentially paid 
to make whatever the newest technology on Linux is look good.

I'm not even saying that BTRFS performs better than md for most
real-world scenarios (these are only benchmarks), but that arguing that
BTRFS is not stable because it has performance issues still doesn't make
sense to me. Even synthetic benchmarks aren't enough to find the best
fit for real-world scenarios, so you could always find a very
restrictive situation where any filesystem, RAID implementation, volume
manager could look bad even the most robust ones.

Of course if BTRFS RAID1 was always slower than md RAID1 the logic might
make more sense. But clearly there were design decisions and performance
tuning in BTRFS that led to better or similar performance in several
scenarios, if the remaining scenarios don't get attention it may be
because they represent a niche (at least from the point of view of the
developers) not a lack of polishing.
Like I said above, a large part of this is probably because BTRFS raid 
was designed for COW workloads, and mdraid wasn't.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-10-05 Thread Rich Freeman
On Mon, Oct 5, 2015 at 7:16 AM, Lionel Bouton
 wrote:
> According to the bad performance -> unstable logic, md would then be the
> less stable RAID1 implementation which doesn't make sense to me.
>

The argument wasn't that bad performance meant that something was unstable.

The argument was that a lack of significant performance optimization
meant that the developers considered it unstable and not worth
investing time on optimizing.

So, the question isn't whether btrfs is or isn't faster than something
else.  the question is whether it is or isn't faster than it could be
if it were properly optimized.  That is, how does btrfs perform today
against btrfs from 20 years from now, which obviously cannot be
benchmarked today.

That said, I'm not really convinced that the developers haven't fixed
this because they feel that it would need to be redone later after
major refactoring.  I think it is more likely that there are just very
few developers working on btrfs and load-balancing on raid just
doesn't rank high on their list of interests or possibly expertise.
If any are being paid to work on btrfs then most likely their
employers don't care too much about it either.

I did find the phoronix results interesting though.  The whole driver
for "layer-violation" is that with knowledge of the filesystem you can
better optimize what you do/don't read and write, and that may be
showing here.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Lionel Bouton
Hi,

Le 04/10/2015 14:03, Lionel Bouton a écrit :
> [...]
> This focus on single reader RAID1 performance surprises me.
>
> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
> you need 2 processes to read from 2 devices at once) and I've never seen
> anyone arguing that the current md code is unstable.

To better illustrate my point.

According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
most of the time.

http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1

The only case where md RAID1 was noticeably faster is sequential reads
with FIO libaio.

So if you base your analysis on Phoronix tests when serving large files
to a few clients maybe md could perform better. In all other cases BTRFS
RAID1 seems to be a better place to start if you want performance.
According to the bad performance -> unstable logic, md would then be the
less stable RAID1 implementation which doesn't make sense to me.

I'm not even saying that BTRFS performs better than md for most
real-world scenarios (these are only benchmarks), but that arguing that
BTRFS is not stable because it has performance issues still doesn't make
sense to me. Even synthetic benchmarks aren't enough to find the best
fit for real-world scenarios, so you could always find a very
restrictive situation where any filesystem, RAID implementation, volume
manager could look bad even the most robust ones.

Of course if BTRFS RAID1 was always slower than md RAID1 the logic might
make more sense. But clearly there were design decisions and performance
tuning in BTRFS that led to better or similar performance in several
scenarios, if the remaining scenarios don't get attention it may be
because they represent a niche (at least from the point of view of the
developers) not a lack of polishing.

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-05 Thread Duncan
On Mon, 5 Oct 2015 13:16:03 +0200
Lionel Bouton  wrote:

> To better illustrate my point.
> 
> According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
> most of the time.
> 
> http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1
> 
> The only case where md RAID1 was noticeably faster is sequential reads
> with FIO libaio.
> 
> So if you base your analysis on Phoronix tests

[Oops.  Actually send to the list too this time.]

FYI...

1) It's worth noting that while I personally think Phoronix has more
merit than sometimes given credit for, it does have a rather bad rep in
kernel circles, due to how results are (claimed to be) misused in
support of points they don't actually support at all, if you know how
to read the results taking into account the configuration used.

As such, Phoronix is about the last reference you want to be using if
trying to support points with kernel folks, because rightly or wrongly,
a lot of them will see that and simply shut down right there, having
already decided based on previous experience that there's little use
arguing with somebody quoting Phoronix, since they invariably don't
know how to read the results based on what was /actually/ tested given
the independent variable and the test configuration.

Tho I personally actually do find quite some use for various Phoronix
benchmark articles, reading them with the testing context in mind.  But
I definitely wouldn't be pulling them out to demonstrate a point to
kernel folks, unless it was near enough the end of a list of references
making a similar point, after I've demonstrated my ability to keep the
testing context in mind when looking at their results, because I know
based on the history, quoting Phoronix in support of something simply
isn't likely to get me anywhere with kernel folks.


As for the specific test you referenced...

2) At least the URL you pointed at was benchmarks for Intel
SSDs, not spinning rust.  The issues and bottlenecks in the case of
good SSDs are so entirely different than for spinning rust that it's an
entirely different debate.  Among other things, good SATA-based SSDs
are and have been for awhile now, so fast that if the tests really are
I/O bound, SATA-bus speed (thru SATA 3.0, 600 MB/s, anyway, SATA 3.2
aka SATA Express and M.2, 1969 MB/s, are fast enough to often put the
bottleneck on the device, again), not device speed, tends to be the
bottleneck. However, in many cases, because good SSDs and modern buses
are so fast, the bottleneck actually ends up being CPU, once again.

So in the case of good reasonably current SSDs, CPU is likely to be the
bottleneck.  Tho these aren't as current as might be expected...

The actual devices tested here are SATA 2, 300 MB/s bus speed, and are
rather dated given the Dec 2014 article date, as they're only rated 205
MB/s read, 45 MB/s write, so only read is anything near bus speed,
while bus speed itself is only SATA 2, 300 MB/s.

Given that, it's likely that write speed is device-bound, and while
raid isn't likely to slow it down despite the multi-time-writing
because it /is/ device-bound, it's unlikely to be faster than
single-device writing, either.

But this thread was addressing read-speed.  Read-speed is much closer
to the bus speed, and depending on the application, particularly for
raid, may well be CPU-bound.  Where it's CPU-bound, because the device
and bus speed are relatively high, the multiple devices of RAID aren't
likely to be of much benefit at all.

Meanwhile, what was the actual configuration on the devices themselves?

Here, we see that in both cases it was actually btrfs, btrfs with
defaults as installed (in single-device-mode, if reading between the
lines) on top of md/raid of the tested level for the md/raid side,
native btrfs raid of the tested level on the native btrfs raid side.

But... there's already so much that isn't known -- he says defaults
where not stated, but that's still ambiguous in some cases.

For instance, he does specifically state that in native mode, btrfs
detects the ssds, and activates ssd mode, but that it doesn't do so
when installed on the md/raid.  So we know for sure that he took the
detected ssd (or not) defaults there.

But...  we do NOT know... does btrfs native raid level mean for both
data and metadata, or only for (presumably) data, leaving metadata at
the defaults (which is raid1 for multi-device), or perhaps the
reverse, tested level metadata, defaults for data (which AFAIK is
single mode for multi-device).

And in single-device btrfs on top of md/raid mode, with the md/raid at
the tested level, we already know that it didn't detect ssd and enable
ssd mode, and he didn't enable it manually, but what we /don't/ know
for sure is how it was installed at mkfs.btrfs time and whether that
might have detected ssd.   If mkfs.btrfs detected ssd, it would have
created single-mode metadata by default, otherwise, dup mode metadata,
unless specifically told 

Re: BTRFS as image store for KVM?

2015-10-05 Thread Austin S Hemmelgarn

On 2015-10-05 10:04, Duncan wrote:

On Mon, 5 Oct 2015 13:16:03 +0200
Lionel Bouton  wrote:


To better illustrate my point.

According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
most of the time.

http://www.phoronix.com/scan.php?page=article=btrfs_raid_mdadm=1

The only case where md RAID1 was noticeably faster is sequential reads
with FIO libaio.

So if you base your analysis on Phoronix tests



[...snip...]


Hmm... I think I've begun to see the kernel folks point about people
quoting Phoronix in support of their points, when it's really not
apropos at all.  Yes, I do still consider Phoronix reports in context
to contain useful information, at some level.  However, one really must
be aware of what was actually tested in ordered to understand what the
results actually mean, and unfortunately, it seems most people quoting
it, including here, really can't properly do so in context, and thus
end up using it in support of points that simply are not supported by
the given evidence in the Phoronix articles people are attempting to
use.

Even aside from the obvious past issues with Phoronix reports, people 
forget that they are news organization (regardless of what they claim, 
they _are_ a news organization), and as such their employees are not 
paid to verify existing results, they're paid to make impactful articles 
that grab people's attention (I'd be willing to bet that this story 
started in response to the people who pointed out correctly that XFS or 
ext4 on top of mdraid beats the pants off of BTRFS performance wise, and 
(incorrectly) assumed that this meant that mdraid was better than BTRFS 
raid).  This when combined with almost no evidence in many cases of 
actual statistical analysis, really hurts their credibility (at least, 
for me it does).


The other issue is that so many people tout benchmarks as the pinnacle 
of testing, when they really aren't.  Benchmarks are by definition 
synthetic workloads, and as such only the _really_ good ones (which 
there aren't many of) give you more than a very basic idea what 
performance differences you can expect with a given workload.  On top of 
that, people just accept results without trying to reproduce them 
themselves (Kernel folks tend to be much better about this than many 
other people though).


A truly sane person, looking to determine the best configuration for a 
given workload, will:
1. Look at a wide variety of sources to determine what configurations he 
should even be testing.  (The author of the linked article obviously 
didn't do this, or just didn't care, the defaults on btrfs are 
unsuitable for a significant number of cases, including usage on top of 
mdraid).
2. Using this information, run established benchmarks similar to his 
use-case to further narrow down the test candidates.
3. Write his own benchmark to simulate to the greatest degree possible 
the actual workload he expects to run, and then use that for testing the 
final candidates.
4. Gather some reasonable number of samples with the above mentioned 
benchmark, and use _real_ statistical analysis to determine what he 
should be using.


To put this in further perspective, most people just do step one, assume 
that other people know what they're talking about, and don't do any 
further testing, and there are other people who just do step two, and 
then claim their results are infallible.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-10-04 Thread Rich Freeman
On Sun, Oct 4, 2015 at 8:03 AM, Lionel Bouton
 wrote:
>
> This focus on single reader RAID1 performance surprises me.
>
> 1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
> you need 2 processes to read from 2 devices at once) and I've never seen
> anyone arguing that the current md code is unstable.

Perhaps, but with btrfs it wouldn't be hard to get 1000 processes
reading from a raid1 in btrfs and have every single request directed
to the same disk with the other disk remaining completely idle.  I
believe the algorithm is just whether the pid is even or odd, and
doesn't take into account disk activity at all, let alone disk
performance or anything more sophisticated than that.

I'm sure md does a better job than that.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-04 Thread Lionel Bouton
Hi,

Le 04/10/2015 04:09, Duncan a écrit :
> Russell Coker posted on Sat, 03 Oct 2015 18:32:17 +1000 as excerpted:
>
>> Last time I checked a BTRFS RAID-1 filesystem would assign each process
>> to read from one disk based on it's PID.  Every RAID-1 implementation
>> that has any sort of performance optimisation will allow a single
>> process that's reading to use both disks to some extent.
>>
>> When the BTRFS developers spend some serious effort optimising for
>> performance it will be useful to compare BTRFS and ZFS.
> This is the example I use as to why btrfs isn't really stable, as well.  
> Devs tend to be very aware of the dangers of premature optimization, 
> because done too early, it either means throwing that work away when a 
> rewrite comes, or it severely limits options as to what can be rewritten, 
> if necessary, in ordered to avoid throwing all that work that went into 
> optimization away.
>
> So at least for devs that have been around awhile, that don't have some 
> boss that's paying the bills saying optimize now, an actually really good 
> mark of when the /devs/ consider something stable, is when they start 
> focusing on that optimization.

This focus on single reader RAID1 performance surprises me.

1/ AFAIK the kernel md RAID1 code behaves the same (last time I checked
you need 2 processes to read from 2 devices at once) and I've never seen
anyone arguing that the current md code is unstable.

2/ I'm not familiar with implementations taking advantage of several
disks for single process reads but clearly they'll have more problems
with seeks on rotating devices to solve. So are there really
implementations with better performance across the spectrum or do they
have to pay a performance penalty in the mutiple readers case to
optimize the (arguably less frequent/important) single reader case ?

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-03 Thread Duncan
Russell Coker posted on Sat, 03 Oct 2015 18:32:17 +1000 as excerpted:

> Last time I checked a BTRFS RAID-1 filesystem would assign each process
> to read from one disk based on it's PID.  Every RAID-1 implementation
> that has any sort of performance optimisation will allow a single
> process that's reading to use both disks to some extent.
> 
> When the BTRFS developers spend some serious effort optimising for
> performance it will be useful to compare BTRFS and ZFS.

This is the example I use as to why btrfs isn't really stable, as well.  
Devs tend to be very aware of the dangers of premature optimization, 
because done too early, it either means throwing that work away when a 
rewrite comes, or it severely limits options as to what can be rewritten, 
if necessary, in ordered to avoid throwing all that work that went into 
optimization away.

So at least for devs that have been around awhile, that don't have some 
boss that's paying the bills saying optimize now, an actually really good 
mark of when the /devs/ consider something stable, is when they start 
focusing on that optimization.

Since this rather obvious low hanging fruit bit of optimization hasn't 
yet been done, then, there's really no question, btrfs doesn't pass the 
optimized stability test yet, and thus is self-evidently not stable, in 
the opinion of the very devs working on it.  Were they to really consider 
it stable, this optimization would already be done.

So once we see this optimization done, /then/ we can debate whether btrfs 
is stable yet, or not.  Until then, settled question, it's obviously 
not.  It may indeed be some distance into the process of stabilization, 
"stabiliz_ing_", and I'd characterize it as exactly that, but not yet, 
"stable".

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-03 Thread Russell Coker
On Fri, 2 Oct 2015 10:07:24 PM Austin S Hemmelgarn wrote:
> > ARC presumably worked better than the other Solaris caching options.  It
> > was ported to Linux with zfsonlinux because that was the easy way of
> > doing it.
> 
> Actually, I think part of that was also the fact that ZFS is a COW 
> filesystem, and classical LRU caching (like the regular Linux pagecache) 
> often does horribly with COW workloads (and I'm relatively convinced 
> that this is a significant part of why BTRFS has such horrible 
> performance compared to ZFS).

Last time I checked a BTRFS RAID-1 filesystem would assign each process to read 
from one disk based on it's PID.  Every RAID-1 implementation that has any 
sort of performance optimisation will allow a single process that's reading to 
use both disks to some extent.

When the BTRFS developers spend some serious effort optimising for performance 
it will be useful to compare BTRFS and ZFS.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-10-02 Thread Austin S Hemmelgarn

On 2015-10-02 00:21, Russell Coker wrote:

On Sat, 26 Sep 2015 12:20:41 AM Austin S Hemmelgarn wrote:

FYI:
Linux pagecache use LRU cache algo, and in general case it's working good
enough


I'd argue that 'general usage' should be better defined in this
statement.  Obviously, ZFS's ARC implementation provides better
performance in a significant number of common use cases for Linux,
otherwise people wouldn't be using it to the degree they are.


No-one gets a free choice about this.  I have a number of servers running ZFS
because I needed the data consistency features and BTRFS wasn't ready.  There
is no choice of LRU vs ARC once you've made the BTRFS vs ZFS decision.
I'm not saying there is a free choice in this, although that is largely 
because the page-cache wasn't written in a way on Linux that allows for 
easy development of alternative caching algorithms for it. When I said 
'using it', I meant using ZFS, not just ARC. I would love to be able 
some day to be able to use ARC or even just SLRU (ARC with out the 
adaptive internal sizing bits) on Linux, as both provide better 
performance for COW workloads than plain LRU (although, somewhat 
paradoxically, for some COW workloads, an MRU algorithm is even better).


ARC presumably worked better than the other Solaris caching options.  It was
ported to Linux with zfsonlinux because that was the easy way of doing it.
Actually, I think part of that was also the fact that ZFS is a COW 
filesystem, and classical LRU caching (like the regular Linux pagecache) 
often does horribly with COW workloads (and I'm relatively convinced 
that this is a significant part of why BTRFS has such horrible 
performance compared to ZFS).


Some people here have reported that ARC worked well for them on Linux.  My
experience was that the zfsonlinux kernel modules wouldn't respect the module
load options to reduce the size of the ARC and the default size would cause
smaller servers to have kernel panics due to lack of RAM.  My solution to that
problem was to get more RAM for all ZFS servers as buying RAM is cheaper for
my clients than paying me to diagnose the problems with ZFS.
The whole ARC sizing issue with zfsonlinux is largely orthogonal to 
whether or not ARC is better for a given workload, and I think that 
there is actually some lower limit they force based on the amount of RAM 
at boot.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-10-01 Thread Russell Coker
On Sat, 26 Sep 2015 12:20:41 AM Austin S Hemmelgarn wrote:
> > FYI:
> > Linux pagecache use LRU cache algo, and in general case it's working good
> > enough
> 
> I'd argue that 'general usage' should be better defined in this 
> statement.  Obviously, ZFS's ARC implementation provides better 
> performance in a significant number of common use cases for Linux, 
> otherwise people wouldn't be using it to the degree they are.

No-one gets a free choice about this.  I have a number of servers running ZFS 
because I needed the data consistency features and BTRFS wasn't ready.  There 
is no choice of LRU vs ARC once you've made the BTRFS vs ZFS decision.

ARC presumably worked better than the other Solaris caching options.  It was 
ported to Linux with zfsonlinux because that was the easy way of doing it.

Some people here have reported that ARC worked well for them on Linux.  My 
experience was that the zfsonlinux kernel modules wouldn't respect the module 
load options to reduce the size of the ARC and the default size would cause 
smaller servers to have kernel panics due to lack of RAM.  My solution to that 
problem was to get more RAM for all ZFS servers as buying RAM is cheaper for 
my clients than paying me to diagnose the problems with ZFS.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-29 Thread Gert Menke

Hi,

thank you all for your helpful comments.

From what I've read, I forged the following guidelines (for myself; 
ymmv):
- Use btrfs for generic data storage on spinning disks and for 
everything on ssds.
- Use zfs for spinning disks that may be used for cow-unfriendly 
workloads, like vm images (if they are too big and/or too fast-changing 
for a scheduled defrag to make sense).


For now I'm going with the following setup: a Debian system with root on 
btrfs/raid1 on two ssds, and a raidz1 pool for storage and vm images. 
However, those few vms that really should be fast would also fit on the 
SSDs, so I might move them there and switch from ZFS to btrfs on the 
storage pool at some point in the future.


Some of the ideas presented here sound really interesting - for example 
I think that improving the Linux page cache to be more "arc-like" will 
probably benefit not only btrfs. Having both the page cache and the arc 
in parallel when using ZoL does not feel like an elegant solution, so 
maybe there's hope for that. (But I don't know if it is feasible for ZoL 
to abandon the arc in favor of an improved Linux page cache; I imagine 
it might be much work for little benefit.)


Thanks again
Gert
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-25 Thread Rich Freeman
On Sat, Sep 19, 2015 at 9:26 PM, Jim Salter  wrote:
>
> ZFS, by contrast, works like absolute gangbusters for KVM image storage.

I'd be interested in what allows ZFS to handle KVM image storage well,
and whether this could be implemented in btrfs.  I'd think that the
fragmentation issues would potentially apply to any COW filesystem,
and if ZFS has a solution for this then it would probably benefit
btrfs to implement the same solution, and not just for VM images.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-25 Thread Jim Salter

I suspect that the answer most likely boils down to "the ARC".

ZFS uses an Adaptive Replacement Cache instead of a standard FIFO, which 
keeps blocks in cache longer if they have been accessed in cache.  This 
means much higher cache hit rates, which also means minimizing the 
effects of fragmentation.


That's an off-the-top-of-my-head guess, though.  All I can tell you for 
certain is that I've done both - KVM stores on btrfs and on ZFS (and on 
LVM and on mdraid and...) - and it works extremely, extremely well on 
ZFS for long periods of time, where with btrfs it works very well at 
first but then degrades rapidly.


FWIW I've been using KVM + ZFS in wide production (>50 hosts) for 5+ 
years now.


On 09/25/2015 08:48 AM, Rich Freeman wrote:

On Sat, Sep 19, 2015 at 9:26 PM, Jim Salter  wrote:

ZFS, by contrast, works like absolute gangbusters for KVM image storage.

I'd be interested in what allows ZFS to handle KVM image storage well,
and whether this could be implemented in btrfs.  I'd think that the
fragmentation issues would potentially apply to any COW filesystem,
and if ZFS has a solution for this then it would probably benefit
btrfs to implement the same solution, and not just for VM images.

--
Rich


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-25 Thread Austin S Hemmelgarn

On 2015-09-25 08:48, Rich Freeman wrote:

On Sat, Sep 19, 2015 at 9:26 PM, Jim Salter  wrote:


ZFS, by contrast, works like absolute gangbusters for KVM image storage.


I'd be interested in what allows ZFS to handle KVM image storage well,
and whether this could be implemented in btrfs.  I'd think that the
fragmentation issues would potentially apply to any COW filesystem,
and if ZFS has a solution for this then it would probably benefit
btrfs to implement the same solution, and not just for VM images.
That may be tough to do however, the internal design of ZFS is _very_ 
different from that of BTRFS (and for that matter, every other 
filesystem on Linux).  Part of it may just be better data locality (if 
all of the fragments of a file are close to each other, then the 
fragmentation of the file is not as much of a performance hit), and part 
of it is probably how they do caching (and I personally _do not_ want 
BTRFS to try to do caching the way ZFS does, we have a unified pagecache 
in the VFS for a reason, we should be improving that, not trying to come 
up with multiple independent solutions).



Even aside from that however, just saying that ZFS works great for some 
particular use case isn't giving enough info, it has so many optional 
features and configuration knobs, you really need to give specifics on 
how you have ZFS set up in that case.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-09-25 Thread Jim Salter
Pretty much bog-standard, as ZFS goes.  Nothing different than what's 
recommended for any generic ZFS use.


* set blocksize to match hardware blocksize - 4K drives get 4K 
blocksize, 8K drives get 8K blocksize (Samsung SSDs)
* LZO compression is a win.  But it's not like anything sucks without 
it.  No real impact on performance for most use, + or -. Just saves space.
* > 4GB allocated to the ARC.  General rule of thumb: half the RAM 
belongs to the host (which is mostly ARC), half belongs to the guests.


I strongly prefer pool-of-mirrors topology, but nothing crazy happens if 
you use striped-with-parity instead.  I use to use RAIDZ1 (the rough 
equivalent of RAID5) quite frequently, and there wasn't anything 
amazingly sucky about it; it performed at least as well as you'd expect 
ext4 on mdraid5 to perform.


ZFS might or might not do a better job of managing fragmentation; I 
really don't know.  I strongly suspect the design difference between the 
kernel's simple FIFO page cache and ZFS' weighted cache makes a really, 
really big difference.




On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote:
> you really need to give specifics on how you have ZFS set up in that 
case.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-25 Thread Austin S Hemmelgarn

On 2015-09-25 09:12, Jim Salter wrote:

Pretty much bog-standard, as ZFS goes.  Nothing different than what's
recommended for any generic ZFS use.

* set blocksize to match hardware blocksize - 4K drives get 4K
blocksize, 8K drives get 8K blocksize (Samsung SSDs)
* LZO compression is a win.  But it's not like anything sucks without
it.  No real impact on performance for most use, + or -. Just saves space.
* > 4GB allocated to the ARC.  General rule of thumb: half the RAM
belongs to the host (which is mostly ARC), half belongs to the guests.

I strongly prefer pool-of-mirrors topology, but nothing crazy happens if
you use striped-with-parity instead.  I use to use RAIDZ1 (the rough
equivalent of RAID5) quite frequently, and there wasn't anything
amazingly sucky about it; it performed at least as well as you'd expect
ext4 on mdraid5 to perform.

ZFS might or might not do a better job of managing fragmentation; I
really don't know.  I /strongly/ suspect the design difference between
the kernel's simple FIFO page cache and ZFS' weighted cache makes a
really, really big difference.
I've been coming to that same conclusion myself over the years.  I would 
really love to see a drop in replacement for Linux's pagecache with 
better performance (I don't remember for sure, but I seem to remember 
that the native pagecache isn't straight FIFO), but the likelihood of 
that actually getting into mainline is slim to none (can you imagine 
though how fast XFS or ext* would be with a good caching algorithm?).




On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote:

you really need to give specifics on how you have ZFS set up in that
case.








smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-09-25 Thread Timofey Titovets
2015-09-25 16:52 GMT+03:00 Jim Salter :
> Pretty much bog-standard, as ZFS goes.  Nothing different than what's
> recommended for any generic ZFS use.
>
> * set blocksize to match hardware blocksize - 4K drives get 4K blocksize, 8K
> drives get 8K blocksize (Samsung SSDs)
> * LZO compression is a win.  But it's not like anything sucks without it.
> No real impact on performance for most use, + or -. Just saves space.
> * > 4GB allocated to the ARC.  General rule of thumb: half the RAM belongs
> to the host (which is mostly ARC), half belongs to the guests.
>
> I strongly prefer pool-of-mirrors topology, but nothing crazy happens if you
> use striped-with-parity instead.  I use to use RAIDZ1 (the rough equivalent
> of RAID5) quite frequently, and there wasn't anything amazingly sucky about
> it; it performed at least as well as you'd expect ext4 on mdraid5 to
> perform.
>
> ZFS might or might not do a better job of managing fragmentation; I really
> don't know.  I strongly suspect the design difference between the kernel's
> simple FIFO page cache and ZFS' weighted cache makes a really, really big
> difference.
>
>
>
> On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote:
>> you really need to give specifics on how you have ZFS set up in that case.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

FYI:
Linux pagecache use LRU cache algo, and in general case it's working good enough

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-25 Thread Austin S Hemmelgarn

On 2015-09-25 10:02, Timofey Titovets wrote:

2015-09-25 16:52 GMT+03:00 Jim Salter :

Pretty much bog-standard, as ZFS goes.  Nothing different than what's
recommended for any generic ZFS use.

* set blocksize to match hardware blocksize - 4K drives get 4K blocksize, 8K
drives get 8K blocksize (Samsung SSDs)
* LZO compression is a win.  But it's not like anything sucks without it.
No real impact on performance for most use, + or -. Just saves space.
* > 4GB allocated to the ARC.  General rule of thumb: half the RAM belongs
to the host (which is mostly ARC), half belongs to the guests.

I strongly prefer pool-of-mirrors topology, but nothing crazy happens if you
use striped-with-parity instead.  I use to use RAIDZ1 (the rough equivalent
of RAID5) quite frequently, and there wasn't anything amazingly sucky about
it; it performed at least as well as you'd expect ext4 on mdraid5 to
perform.

ZFS might or might not do a better job of managing fragmentation; I really
don't know.  I strongly suspect the design difference between the kernel's
simple FIFO page cache and ZFS' weighted cache makes a really, really big
difference.



On 09/25/2015 09:04 AM, Austin S Hemmelgarn wrote:

you really need to give specifics on how you have ZFS set up in that case.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FYI:
Linux pagecache use LRU cache algo, and in general case it's working good enough

I'd argue that 'general usage' should be better defined in this 
statement.  Obviously, ZFS's ARC implementation provides better 
performance in a significant number of common use cases for Linux, 
otherwise people wouldn't be using it to the degree they are.  LRU often 
gives abysmal performance for VM images in my experience, and 
virtualization is becoming a very common use case for Linux.  On top of 
that, there are lots of applications that bypass the cache almost 
completely, and while that is a valid option in some cases, it shouldn't 
be needed most of the time.


If it's just plain LRU, I may take the time at some point to try and 
write some patches to test if SLRU works any better (as SLRU is 
essentially ARC without the auto-tuning), although I have nowhere near 
the resources to test something like that to the degree that would be 
required to get it even considered for inclusion in mainline.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-09-23 Thread Russell Coker
On Sat, 19 Sep 2015 12:13:29 AM Austin S Hemmelgarn wrote:
> The other option (which for some reason I almost never see anyone
> suggest), is to expose 2 disks to the guest (ideally stored on different
> filesystems), and do BTRFS raid1 on top of that.  In general, this is
> what I do (except I use LVM for the storage back-end instead of a
> filesystem) when I have data integrity requirements in the guest.  On
> the other hand of course, most of my VM's are trivial for me to
> recreate, so I don't often need this and just use DM-RAID via LVm.

I used to do that.  But it was very fiddly and snapshotting the virtual machine 
images required making a snapshot of half a RAID-1 array via LVM (or 
snapshotting both when the virtual machine wasn't running).

Now I just have a single big BTRFS RAID-1 filesystem and use regular files for 
the virtual machine images with the Ext3 filesystem.

On Sun, 20 Sep 2015 11:26:26 AM Jim Salter wrote:
> Performance will be fantastic... except when it's completely abysmal.  
> When I tried it, I also ended up with a completely borked (btrfs-raid1) 
> filesystem that would only mount read-only and read at hideously reduced 
> speeds after about a year of usage in a small office environment.  Did 
> not make me happy.

I've found performance to be acceptable, not great (you can't expect great 
performance from such things) but good enough for lightly loaded servers and 
test systems.

I even ran a training session on BTRFS and ZFS filesystems with the images 
stored on a BTRFS RAID-1 (of 15,000rpm SAS disks).  When more than 3 students 
ran a scrub at the same time performance dropped but it was mostly usable and 
there were no complaints.  Admittedly that server hit a BTRFS bug and needed 
"reboot -nf" half way through, but I don't think that was a BTRFS virtual 
machine issue, rather it was a more general BTRFS under load issue.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-23 Thread Russell Coker
On Fri, 18 Sep 2015 12:00:15 PM Duncan wrote:
> The caveat here is that if the VM/DB is active during the backups (btrfs 
> send/receive or other), it'll still COW1 any writes during the existence 
> of the btrfs snapshot.  If the backup can be scheduled during VM/DB 
> downtime or at least when activity is very low, the relatively short COW1 
> time should avoid serious fragmentation, but if not, even only relatively 
> temporary snapshots are likely to trigger noticeable cow1 fragmentation 
> issues eventually.

One relevant issue for this is whether the working set of the database fits 
into RAM.  RAM has been getting bigger and cheaper while databases I run 
haven't been getting bigger.  Now every database I run has a working set that 
fits into RAM so read performance (and therefore fragmentation) doesn't matter 
for me except when rebooting - and database servers don't get rebooted that 
often.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-19 Thread Jim Salter

I can't recommend btrfs+KVM, and I speak from experience.

Performance will be fantastic... except when it's completely abysmal.  
When I tried it, I also ended up with a completely borked (btrfs-raid1) 
filesystem that would only mount read-only and read at hideously reduced 
speeds after about a year of usage in a small office environment.  Did 
*not* make me happy.


ZFS, by contrast, works like absolute gangbusters for KVM image 
storage.  Just create a dataset, drop a .qcow2 file in it, and off to 
the races.  I don't recommend messing about with zvols, it's a PITA and 
isn't necessary.


HTH.



On 09/15/2015 05:34 PM, Gert Menke wrote:

Hi everybody,

first off, I'm not 100% sure if this is the right place to ask, so if 
it's not, I apologize and I'd appreciate a pointer in the right 
direction.


I want to build a virtualization server to replace my current home 
server. I'm thinking about a Debian system with libvirt/KVM. The 
system will have one or two SSDs and five harddisks with some kind of 
software RAID5 for storage. I'd like to have a filesystem with data 
checksums, so BTRFS seems like the right way to go. However, I read 
that BTRFS does not perform well as storage for KVM disk images.

(See here: http://www.linux-kvm.org/page/Tuning_KVM )

Is this still true?

I would appreciate any comments and/or tips you might have on this topic.

Is anyone using BTRFS as an image store? Are there any special 
settings I should be aware of to make it work well?


Thanks,
Gert
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-18 Thread Gert Menke

On 2015-09-18 04:22, Duncan wrote:

one way or another, you're going to
have to write two things, one a checksum of the other, and if they are 
in-
place-overwrites, while the race can be narrowed, there's always going 
to
be a point at which either one or the other will have been written, 
while

the other hasn't been, and if failure occurs at that point...


...then you still can recover the old data from the mirror or parity, 
and at least you don't have any inconsistent data. It's like the failure 
occurred just a tiny bit earlier.


The only real way around that is /some/ form of copy-on-write, such 
that

both the change and its checksum can be written to a different location
than the old version, with a single, atomic write then updating a 
pointer
to point to the new version of both the data and its checksum, instead 
of

the old one.


Or an intent log, but I guess that introduces a lot of additional writes 
(and seeks) that would impact performance noticeably...


Gert

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-18 Thread Austin S Hemmelgarn

On 2015-09-17 14:35, Chris Murphy wrote:

On Thu, Sep 17, 2015 at 11:56 AM, Gert Menke  wrote:

Hi,

thank you for your answers!

So it seems there are several suboptimal alternatives here...

MD+LVM is very close to what I want, but md has no way to cope with silent
data corruption. So if I'd want to use a guest filesystem that has no
checksums either, I'm out of luck.


You can use Btrfs in the guest to get at least notification of SDC. If
you want recovery also then that's a bit more challenging. The way
this has been done up until ZFS and Btrfs is T10 DIF (PI). There are
already checksums on the drive, but this adds more checksums that can
be confirmed through the entire storage stack, not just internal to
the drive hardware.

Another way is to put a conventional fs image on e.g. GlusterFS with
checksumming enabled (and at least distributed+replicated filtering).

If you do this directly on Btrfs, maybe you can mitigate some of the
fragmentation issues with bcache or dmcache; and for persistent
snapshotting, use qcow2 to do it instead of Btrfs. You'd use Btrfs
snapshots to create a subvolume for doing backups of the images, and
then get rid of the Btrfs snapshot.


The other option (which for some reason I almost never see anyone 
suggest), is to expose 2 disks to the guest (ideally stored on different 
filesystems), and do BTRFS raid1 on top of that.  In general, this is 
what I do (except I use LVM for the storage back-end instead of a 
filesystem) when I have data integrity requirements in the guest.  On 
the other hand of course, most of my VM's are trivial for me to 
recreate, so I don't often need this and just use DM-RAID via LVm.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-09-17 Thread Chris Murphy
On Thu, Sep 17, 2015 at 11:56 AM, Gert Menke  wrote:
> Hi,
>
> thank you for your answers!
>
> So it seems there are several suboptimal alternatives here...
>
> MD+LVM is very close to what I want, but md has no way to cope with silent
> data corruption. So if I'd want to use a guest filesystem that has no
> checksums either, I'm out of luck.

You can use Btrfs in the guest to get at least notification of SDC. If
you want recovery also then that's a bit more challenging. The way
this has been done up until ZFS and Btrfs is T10 DIF (PI). There are
already checksums on the drive, but this adds more checksums that can
be confirmed through the entire storage stack, not just internal to
the drive hardware.

Another way is to put a conventional fs image on e.g. GlusterFS with
checksumming enabled (and at least distributed+replicated filtering).

If you do this directly on Btrfs, maybe you can mitigate some of the
fragmentation issues with bcache or dmcache; and for persistent
snapshotting, use qcow2 to do it instead of Btrfs. You'd use Btrfs
snapshots to create a subvolume for doing backups of the images, and
then get rid of the Btrfs snapshot.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Gert Menke

Hi,

thank you for your answers!

So it seems there are several suboptimal alternatives here...

MD+LVM is very close to what I want, but md has no way to cope with 
silent data corruption. So if I'd want to use a guest filesystem that 
has no checksums either, I'm out of luck.
I'm honestly a bit confused here - isn't checksumming one of the most 
obvious things to want in a software RAID setup? Is it a feature that 
might appear in the future? Maybe I should talk to the md guys...


BTRFS looks really nice feature-wise, but is not (yet) optimized for my 
use-case I guess. Disabling COW would certainly help, but I don't want 
to lose the data checksums. Is nodatacowbutkeepdatachecksums a feature 
that might turn up in the future?


Maybe ZFS is the best choice for my scenario. At least, it seems to work 
fine for Joyent - their SmartOS virtualization OS is essentially Illumos 
(Solaris) with ZFS, and KVM ported from Linux.
Since ZFS supports "Volumes" (virtual block devices inside a ZPool), I 
suspect these are probably optimized to be used for VM images (i.e. do 
as little COW as possible). Of course, snapshots will always degrade 
performance to a degree.


However, there are some drawbacks to ZFS:
- It's less flexible, especially when it comes to reconfiguration of 
disk arrays. Add or remove a disk to/from a RaidZ and rebalance, that 
would be just awesome. It's possible in BTRFS, but not ZFS. :-(
- The not-so-good integration of the fs cache, at least on Linux. I 
don't know if this is really an issue, though. Actually, I imagine it's 
more of an issue for guest systems, because it probably breaks memory 
ballooning. (?)


So it seems there are two options for me:
1. Go with ZFS for now, until BTRFS finds a better way to handle disk 
images, or until md gets data checksums.
2. Buy a bunch of SSDs for VM disk images and use spinning disks for 
data storage only. In that case, BTRFS should probably do fine.


Any comments on that? Am I missing something?

Thanks!
Gert
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Mike Fleetwood
On 17 September 2015 at 18:56, Gert Menke  wrote:
> MD+LVM is very close to what I want, but md has no way to cope with silent
> data corruption. So if I'd want to use a guest filesystem that has no
> checksums either, I'm out of luck.
> I'm honestly a bit confused here - isn't checksumming one of the most
> obvious things to want in a software RAID setup? Is it a feature that might
> appear in the future? Maybe I should talk to the md guys...
...
> Any comments on that? Am I missing something?

How about using file integrity checking tools for cases when the chosen
storage stack doesn't provided data checksumming.
E.g.
aide - http://aide.sourceforge.net/
cfv - http://cfv.sourceforge.net/
tripwire - http://sourceforge.net/projects/tripwire/

Don't use them, just providing options.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Hugo Mills
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote:
> Hi,
> 
> thank you for your answers!
> 
> So it seems there are several suboptimal alternatives here...
> 
> MD+LVM is very close to what I want, but md has no way to cope with
> silent data corruption. So if I'd want to use a guest filesystem
> that has no checksums either, I'm out of luck.
> I'm honestly a bit confused here - isn't checksumming one of the
> most obvious things to want in a software RAID setup? Is it a
> feature that might appear in the future? Maybe I should talk to the
> md guys...
> 
> BTRFS looks really nice feature-wise, but is not (yet) optimized for
> my use-case I guess. Disabling COW would certainly help, but I don't
> want to lose the data checksums. Is nodatacowbutkeepdatachecksums a
> feature that might turn up in the future?
[snip]

   No. If you try doing that particular combination of features, you
end up with a filesystem that can be inconsistent: there's a race
condition between updating the data in a file and updating the csum
record for it, and the race can't be fixed.

   Hugo.

-- 
Hugo Mills | I spent most of my money on drink, women and fast
hugo@... carfax.org.uk | cars. The rest I wasted.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |James Hunt


signature.asc
Description: Digital signature


Re: BTRFS as image store for KVM?

2015-09-17 Thread Gert Menke

On 17.09.2015 at 21:43, Hugo Mills wrote:

On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote:

BTRFS looks really nice feature-wise, but is not (yet) optimized for
my use-case I guess. Disabling COW would certainly help, but I don't
want to lose the data checksums. Is nodatacowbutkeepdatachecksums a
feature that might turn up in the future?

[snip]

No. If you try doing that particular combination of features, you
end up with a filesystem that can be inconsistent: there's a race
condition between updating the data in a file and updating the csum
record for it, and the race can't be fixed.
I'm no filesystem expert, but isn't that what an intent log is for? 
(Does btrfs have an intent log?)


And, is this also true for mirrored or raid5 disks?
I'm thinking something like "if the data does not match the checksum, 
just restore both from mirror/parity" should be possible, right?


Gert
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Gert Menke

On 17.09.2015 at 20:35, Chris Murphy wrote:

You can use Btrfs in the guest to get at least notification of SDC.
Yes, but I'd rather not depend on all potential guest OSes having btrfs 
or something similar.



Another way is to put a conventional fs image on e.g. GlusterFS with
checksumming enabled (and at least distributed+replicated filtering).

This sounds interesting! I'll have a look at this.


If you do this directly on Btrfs, maybe you can mitigate some of the
fragmentation issues with bcache or dmcache;
Thanks, I did not know about these. bcache seems to be more or less what 
"zpool add foo cache /dev/ssd" does. Definitely worth a look.


> and for persistent snapshotting, use qcow2 to do it instead of Btrfs. 
You'd use Btrfs

snapshots to create a subvolume for doing backups of the images, and
then get rid of the Btrfs snapshot.

Good idea.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Duncan
Hugo Mills posted on Thu, 17 Sep 2015 19:43:14 + as excerpted:

>> Is nodatacowbutkeepdatachecksums a feature that might turn up
>> in the future?
> 
> No. If you try doing that particular combination of features, you
> end up with a filesystem that can be inconsistent: there's a race
> condition between updating the data in a file and updating the csum
> record for it, and the race can't be fixed.

...  Which is both why btrfs disables checksumming on nocow, and why
more traditional in-place-overwrite filesystems don't normally offer a 
checksumming feature -- it's only easily and reliably possible with copy-
on-write, as in-place-overwrite introduces race issues that are basically 
impossible to solve.

Logging can narrow the race, but consider, either they introduce some 
level of copy-on-write themselves, or one way or another, you're going to 
have to write two things, one a checksum of the other, and if they are in-
place-overwrites, while the race can be narrowed, there's always going to 
be a point at which either one or the other will have been written, while 
the other hasn't been, and if failure occurs at that point...

The only real way around that is /some/ form of copy-on-write, such that 
both the change and its checksum can be written to a different location 
than the old version, with a single, atomic write then updating a pointer 
to point to the new version of both the data and its checksum, instead of 
the old one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Duncan
Chris Murphy posted on Thu, 17 Sep 2015 12:35:41 -0600 as excerpted:

> You'd use Btrfs snapshots to create a subvolume for doing backups of
> the images, and then get rid of the Btrfs snapshot.

The caveat here is that if the VM/DB is active during the backups (btrfs 
send/receive or other), it'll still COW1 any writes during the existence 
of the btrfs snapshot.  If the backup can be scheduled during VM/DB 
downtime or at least when activity is very low, the relatively short COW1 
time should avoid serious fragmentation, but if not, even only relatively 
temporary snapshots are likely to trigger noticeable cow1 fragmentation 
issues eventually.

Some users have ameliorated that by scheduling weekly or monthly btrfs 
defrag, reporting that cow1 issues with temporary snapshots build up slow 
enough that the scheduled defrag effectively eliminates the otherwise 
growing problem, but it's still an additional complication to have to 
configure and administer, longer term.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-17 Thread Sean Greenslade
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote:
> MD+LVM is very close to what I want, but md has no way to cope with silent
> data corruption. So if I'd want to use a guest filesystem that has no
> checksums either, I'm out of luck.
> I'm honestly a bit confused here - isn't checksumming one of the most
> obvious things to want in a software RAID setup? Is it a feature that might
> appear in the future? Maybe I should talk to the md guys...

MD is emulating hardware RAID. In hardware RAID, you are doing
work at the block level. Block-level RAID has no understanding of the
filesystem(s) running on top of it. Therefore it would have to checksum
groups of blocks, and store those checksums on the physical disks
somewhere, perhaps by keeping some portion of the drive for itself. But
then this is not very efficient, since it is maintaining checksums for
data that may be useless (blocks the FS is not currently using). So then
you might make the RAID filesystem aware...and you now have BTRFS RAID.

Simply put, the block level is probably not an appropriate place for
checksumming to occur. BTRFS can make checksumming work much more
effectively and efficiently by doing it at the filesystem level.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: BTRFS as image store for KVM?

2015-09-16 Thread Paul Jones

> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Brendan Heading
> Sent: Wednesday, 16 September 2015 9:36 PM
> To: Duncan <1i5t5.dun...@cox.net>
> Cc: linux-btrfs@vger.kernel.org
> Subject: Re: BTRFS as image store for KVM?
> 
> > Btrfs has two possible solutions to work around the problem.  The
> > first one is the autodefrag mount option, which detects file
> > fragmentation during the write and queues up the affected file for a
> > defragmenting rewrite by a lower priority worker thread.  This works
> > best on the small end, because as file size increases, so does time to
> > actually write it out, and at some point, depending on the size of the
> > file and how busy the database/VM is, writes are (trying to) come in
> > faster than the file can be rewritten.  Typically, there's no problem
> > under a quarter GiB, with people beginning to notice performance
> > issues at half to 3/4 GiB, tho on fast disks and not too busy VMs/DBs
> > (which may well include your home system, depending on what you use
> > the VMs for), you might not see problems until size reaches 2 GiB or
> > so.  As such, autodefrag tends to be a very good option for firefox
> > sqlite database files, for instance, as they tend to be small enough
> > not to have issues.  But it's not going to work so well for multi-GiB VM
> images.
> 
> [unlurking for the first time]
> 
> This problem has been faced by a certain very large storage vendor whom I
> won't name, who provide an option similar to the above. Reading between
> the lines I think their approach is to try to detect which accesses are read-
> sequential, and schedule those blocks for rewriting in sequence. They also
> have a feature to run as a background job which can be scheduled to run
> during an off peak period where they can reorder entire files that are
> significantly out of sequence. I'd expect the algorithm is intelligent ie 
> there's
> no need to rewrite entire large files that are mostly sequential with a few
> out-of-order sections.
> 
> Has anyone considered these options for btrfs ? Not being able to run VMs
> on it is probably going to be a bit of a killer ..

I run VMs on BTRFS using regular consumer grade SSDs and hardware, it works 
great I think. My hosts are windows server + MS SQL. Not the most ideal 
workload, but I care about data integrity so I'm willing to sacrifice a bit of 
speed for it. Checksums have prevented countless corruption issues. Although 
now that I think about it the spinning rust backup disks are the only ones that 
have ever had any corruption. I guess SSDs have their own internal checksumming 
as well.
The speed seems quite reasonable, but the server has around 16G ram free which 
I presume is being used as a cache, which seems to help.

Paul.


Re: BTRFS as image store for KVM?

2015-09-16 Thread Austin S Hemmelgarn

On 2015-09-16 07:35, Brendan Heading wrote:

Btrfs has two possible solutions to work around the problem.  The first
one is the autodefrag mount option, which detects file fragmentation
during the write and queues up the affected file for a defragmenting
rewrite by a lower priority worker thread.  This works best on the small
end, because as file size increases, so does time to actually write it
out, and at some point, depending on the size of the file and how busy
the database/VM is, writes are (trying to) come in faster than the file
can be rewritten.  Typically, there's no problem under a quarter GiB,
with people beginning to notice performance issues at half to 3/4 GiB,
tho on fast disks and not too busy VMs/DBs (which may well include your
home system, depending on what you use the VMs for), you might not see
problems until size reaches 2 GiB or so.  As such, autodefrag tends to be
a very good option for firefox sqlite database files, for instance, as
they tend to be small enough not to have issues.  But it's not going to
work so well for multi-GiB VM images.


[unlurking for the first time]

This problem has been faced by a certain very large storage vendor
whom I won't name, who provide an option similar to the above. Reading
between the lines I think their approach is to try to detect which
accesses are read-sequential, and schedule those blocks for rewriting
in sequence. They also have a feature to run as a background job which
can be scheduled to run during an off peak period where they can
reorder entire files that are significantly out of sequence. I'd
expect the algorithm is intelligent ie there's no need to rewrite
entire large files that are mostly sequential with a few out-of-order
sections.

Has anyone considered these options for btrfs ? Not being able to run
VMs on it is probably going to be a bit of a killer ..


3 things to mention here:
1. It's perfectly possible to run VM's on BTRFS, it just takes some 
effort to get decent efficiency, and you can't really over-provision 
storage (the above mentioned effort is to create the file with NOCOW 
set, and then use fallocate or dd to pre-allocate space for it).
2. If you are using a file for the disk image, you are already 
sacrificing performance for portability, it's just a bigger tradeoff 
with BTRFS than most other filesystems on Linux.
3. Almost all of the issues that BTRFS has with VM disk images are also 
present in other filesystems, they are just much worse on BTRFS because 
of the fact that it is COW based.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS as image store for KVM?

2015-09-16 Thread Brendan Heading
> Btrfs has two possible solutions to work around the problem.  The first
> one is the autodefrag mount option, which detects file fragmentation
> during the write and queues up the affected file for a defragmenting
> rewrite by a lower priority worker thread.  This works best on the small
> end, because as file size increases, so does time to actually write it
> out, and at some point, depending on the size of the file and how busy
> the database/VM is, writes are (trying to) come in faster than the file
> can be rewritten.  Typically, there's no problem under a quarter GiB,
> with people beginning to notice performance issues at half to 3/4 GiB,
> tho on fast disks and not too busy VMs/DBs (which may well include your
> home system, depending on what you use the VMs for), you might not see
> problems until size reaches 2 GiB or so.  As such, autodefrag tends to be
> a very good option for firefox sqlite database files, for instance, as
> they tend to be small enough not to have issues.  But it's not going to
> work so well for multi-GiB VM images.

[unlurking for the first time]

This problem has been faced by a certain very large storage vendor
whom I won't name, who provide an option similar to the above. Reading
between the lines I think their approach is to try to detect which
accesses are read-sequential, and schedule those blocks for rewriting
in sequence. They also have a feature to run as a background job which
can be scheduled to run during an off peak period where they can
reorder entire files that are significantly out of sequence. I'd
expect the algorithm is intelligent ie there's no need to rewrite
entire large files that are mostly sequential with a few out-of-order
sections.

Has anyone considered these options for btrfs ? Not being able to run
VMs on it is probably going to be a bit of a killer ..

regards

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-16 Thread Paul Harvey
As others have said here, it's probably not going to work for you
especially if you want to use regular scheduled btrfs snapshots on the
host (which I consider to be 50% of the reason why I use btrfs in the
first place).

Once I had learned this lesson the hard way, I had a xen server using
libvirt configured to provision from LVM storage. I had some
preseed/kickstart/ansible recipes for VM provisioning that configured
btrfs in the guests with appropriate scheduled snapshotting and remote
send/receive to backup hosts.

This worked well for me but it's not for everyone. I'm a btrfs fan,
but have talked a few admins out of naively using btrfs in the manner
that you've described. Particularly busy Windows HVM guests on a
file-backed image sitting on a btrfs host filesystem with regular
scheduled snapshots will deteriorate depressingly quickly.

Having said that, most of my current btrfs superstitions on this
use-case were formed around kernel 3.12, a long time ago now.

On 16 September 2015 at 07:34, Gert Menke  wrote:
> Hi everybody,
>
> first off, I'm not 100% sure if this is the right place to ask, so if it's
> not, I apologize and I'd appreciate a pointer in the right direction.
>
> I want to build a virtualization server to replace my current home server.
> I'm thinking about a Debian system with libvirt/KVM. The system will have
> one or two SSDs and five harddisks with some kind of software RAID5 for
> storage. I'd like to have a filesystem with data checksums, so BTRFS seems
> like the right way to go. However, I read that BTRFS does not perform well
> as storage for KVM disk images.
> (See here: http://www.linux-kvm.org/page/Tuning_KVM )
>
> Is this still true?
>
> I would appreciate any comments and/or tips you might have on this topic.
>
> Is anyone using BTRFS as an image store? Are there any special settings I
> should be aware of to make it work well?
>
> Thanks,
> Gert
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-15 Thread Chris Murphy
If you don't need image portability use an LVM logical volume for
backing of the VM. That LV gets partitioned as if it were a disk, and
you can use Btrfs for root home data or whatever.

If you need image portability, e.g. qcow2, then I'd put it on ext4 or
XFS, and you can use Btrfs within the VM for data integrity.

If you put the qcow2 on Btrfs, you're advised (by FAQ and many threads
on this topic in the list archives) to set the the image file with
chattr +C, which is nocow and implies nodatasum, so no checksumming.
And also I've run into this problem depending on the qemu cache
setting.

https://www.marc.info/?l=linux-btrfs=142714523110528=1
https://bugzilla.redhat.com/show_bug.cgi?id=1204569


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS as image store for KVM?

2015-09-15 Thread Duncan
Gert Menke posted on Tue, 15 Sep 2015 23:34:04 +0200 as excerpted:

> I'm not 100% sure if this is the right place to ask[.]

It is. =:^)

> I want to build a virtualization server to replace my current home
> server. I'm thinking about a Debian system with libvirt/KVM. The system
> will have one or two SSDs and five harddisks with some kind of software
> RAID5 for storage. I'd like to have a filesystem with data checksums, so
> BTRFS seems like the right way to go. However, I read that BTRFS does
> not perform well as storage for KVM disk images.
> (See here: http://www.linux-kvm.org/page/Tuning_KVM )
> 
> Is this still true?
> 
> I would appreciate any comments and/or tips you might have on this
> topic.
> 
> Is anyone using BTRFS as an image store? Are there any special settings
> I should be aware of to make it work well?

Looks like you're doing some solid research before you deploy. =:^)

Here's the deal.  The problem is fragmentation, which is much more of an 
issue on spinning rust than it typically is on ssds, since ssds have 
effectively zero seek-time.  If you can put the VMs on those ssds you 
mentioned, not on the spinning rust, the fragmentation won't matter so 
much, and you may well not have to worry about it.

Any copy-on-write filesystem (which btrfs is), is going to have serious 
problems with a file-internal-rewrite write pattern (as contrasted to 
append, or simply rewrite the entire thing sequentially, beginning to 
end), because as various blocks are rewritten, they get written 
elsewhere, worst-case one at a time, dramatically increasing 
fragmentation -- hundreds of thousands of extents are not unheard-of with 
files in the multi-GiB size range.[1]

The two typical problematic cases are database files and VM images (your 
case).

Btrfs has two possible solutions to work around the problem.  The first 
one is the autodefrag mount option, which detects file fragmentation 
during the write and queues up the affected file for a defragmenting 
rewrite by a lower priority worker thread.  This works best on the small 
end, because as file size increases, so does time to actually write it 
out, and at some point, depending on the size of the file and how busy 
the database/VM is, writes are (trying to) come in faster than the file 
can be rewritten.  Typically, there's no problem under a quarter GiB, 
with people beginning to notice performance issues at half to 3/4 GiB, 
tho on fast disks and not too busy VMs/DBs (which may well include your 
home system, depending on what you use the VMs for), you might not see 
problems until size reaches 2 GiB or so.  As such, autodefrag tends to be 
a very good option for firefox sqlite database files, for instance, as 
they tend to be small enough not to have issues.  But it's not going to 
work so well for multi-GiB VM images.

The second solution, or more like workaround, for larger internal-rewrite-
pattern files, generally 1 GiB plus (so many VMs), is to use the NOCOW 
file attribute (set with chattr +C), which tells btrfs to rewrite the 
file in-place instead of using the usual copy-on-write method.  However, 
you're not going to like the side effects, as btrfs turns off both 
checksumming and transparent compression on nocow files, because there's 
serious checksum/data-it-covers write-race issues with in-place rewrite, 
and of course the rewritten data may compress better or worse than the 
old version, so rewriting a compressed copy in-place is problematic as 
well.

So setting nocow turns off checksumming, the biggest reason you're 
considering btrfs in the first place, likely making this option 
effectively unworkable for you. =:^(

Which means btrfs itself likely isn't a particularly good choice, UNLESS 
(a) your VM images are small (under a GiB, ideally under a quarter-gig, 
admittedly a pretty small VM), OR (b) your VMs are primarily reading, not 
writing, or aren't likely to be busy enough for autodefrag to be a 
problem, given the size, OR (c) you put the VM images (and thus the btrfs 
containing them) on ssd, not spinning rust.

Meanwhile, quickly tying up a couple loose ends with nocow in case you do 
decide to use it for this or some other use-case:

a) On btrfs, setting nocow on a file that's already larger than zero-size 
doesn't work as expected (cow writes can continue to occur for some 
time).  Typically the easiest way to ensure that the file is nocow before 
getting data, is to set nocow on its containing directory before the file 
is created, so new files inherit the attribute.  For existing files, set 
it on the dir and copy the file in from a different filesystem (or move 
it to say a tmpfs and back), so the file gets created with the nocow 
attribute as it is copied in.

b) Btrfs' snapshot feature depends on COW, locking in place the existing 
version of the file, forcing otherwise nocow files to be what I've seen 
described as cow1 -- the first write to a file block will cow it to a new 
location because the existing