Re: BTRFS as image store for KVM?

Duncan Mon, 05 Oct 2015 08:04:59 -0700

On Mon, 5 Oct 2015 13:16:03 +0200
Lionel Bouton <lionel-subscript...@bouton.name> wrote:


> To better illustrate my point.
> 
> According to Phoronix tests, BTRFS RAID-1 is even faster than md RAID1
> most of the time.
> 
> http://www.phoronix.com/scan.php?page=article&item=btrfs_raid_mdadm&num=1
> 
> The only case where md RAID1 was noticeably faster is sequential reads
> with FIO libaio.
> 
> So if you base your analysis on Phoronix tests

[Oops.  Actually send to the list too this time.]

FYI...

1) It's worth noting that while I personally think Phoronix has more
merit than sometimes given credit for, it does have a rather bad rep in
kernel circles, due to how results are (claimed to be) misused in
support of points they don't actually support at all, if you know how
to read the results taking into account the configuration used.

As such, Phoronix is about the last reference you want to be using if
trying to support points with kernel folks, because rightly or wrongly,
a lot of them will see that and simply shut down right there, having
already decided based on previous experience that there's little use
arguing with somebody quoting Phoronix, since they invariably don't
know how to read the results based on what was /actually/ tested given
the independent variable and the test configuration.

Tho I personally actually do find quite some use for various Phoronix
benchmark articles, reading them with the testing context in mind.  But
I definitely wouldn't be pulling them out to demonstrate a point to
kernel folks, unless it was near enough the end of a list of references
making a similar point, after I've demonstrated my ability to keep the
testing context in mind when looking at their results, because I know
based on the history, quoting Phoronix in support of something simply
isn't likely to get me anywhere with kernel folks.


As for the specific test you referenced...

2) At least the URL you pointed at was benchmarks for Intel
SSDs, not spinning rust.  The issues and bottlenecks in the case of
good SSDs are so entirely different than for spinning rust that it's an
entirely different debate.  Among other things, good SATA-based SSDs
are and have been for awhile now, so fast that if the tests really are
I/O bound, SATA-bus speed (thru SATA 3.0, 600 MB/s, anyway, SATA 3.2
aka SATA Express and M.2, 1969 MB/s, are fast enough to often put the
bottleneck on the device, again), not device speed, tends to be the
bottleneck. However, in many cases, because good SSDs and modern buses
are so fast, the bottleneck actually ends up being CPU, once again.

So in the case of good reasonably current SSDs, CPU is likely to be the
bottleneck.  Tho these aren't as current as might be expected...

The actual devices tested here are SATA 2, 300 MB/s bus speed, and are
rather dated given the Dec 2014 article date, as they're only rated 205
MB/s read, 45 MB/s write, so only read is anything near bus speed,
while bus speed itself is only SATA 2, 300 MB/s.

Given that, it's likely that write speed is device-bound, and while
raid isn't likely to slow it down despite the multi-time-writing
because it /is/ device-bound, it's unlikely to be faster than
single-device writing, either.

But this thread was addressing read-speed.  Read-speed is much closer
to the bus speed, and depending on the application, particularly for
raid, may well be CPU-bound.  Where it's CPU-bound, because the device
and bus speed are relatively high, the multiple devices of RAID aren't
likely to be of much benefit at all.

Meanwhile, what was the actual configuration on the devices themselves?

Here, we see that in both cases it was actually btrfs, btrfs with
defaults as installed (in single-device-mode, if reading between the
lines) on top of md/raid of the tested level for the md/raid side,
native btrfs raid of the tested level on the native btrfs raid side.

But... there's already so much that isn't known -- he says defaults
where not stated, but that's still ambiguous in some cases.

For instance, he does specifically state that in native mode, btrfs
detects the ssds, and activates ssd mode, but that it doesn't do so
when installed on the md/raid.  So we know for sure that he took the
detected ssd (or not) defaults there.

But...  we do NOT know... does btrfs native raid level mean for both
data and metadata, or only for (presumably) data, leaving metadata at
the defaults (which is raid1 for multi-device), or perhaps the
reverse, tested level metadata, defaults for data (which AFAIK is
single mode for multi-device).

And in single-device btrfs on top of md/raid mode, with the md/raid at
the tested level, we already know that it didn't detect ssd and enable
ssd mode, and he didn't enable it manually, but what we /don't/ know
for sure is how it was installed at mkfs.btrfs time and whether that
might have detected ssd.   If mkfs.btrfs detected ssd, it would have
created single-mode metadata by default, otherwise, dup mode metadata,
unless specifically told otherwise.

If he /really/ meant /defaults/ unless stated otherwise, then the btrfs
on top of md/raid was pretty obviously running with known
deoptimizations compared to what one would sanely run in that case.
For instance, the metadata will be dup on a single-device as far as
btrfs is concerned, due to the non-detection of ssd, while for raid1
and raid10 modes, that's actually duped again, *4*-way duped again, for
8-way-duped total, in the 4-way-md/raid-1 case.

And that's not even counting the effect of the lack of ssd on chunk
placement.

Meanwhile, on the native btrfs side, it's detecting ssd, and optimizing
accordingly.

So, it's /very/ /possible/ that the apparently bad for md/raid1 results
you see for the btrfs on top of md/raid1 case, aren't really due to
md/raid1 scheduling problems as compared to btrfs native raid1
scheduling, in the first place, but rather, due to a combination of the
obviously ssd-deoptimized btrfs defaults in that mode, and the
relatively high per-device read-speeds of ssd (even those ancient sata
2 ssds), along with zero-seek-time of ssd.

The way to disprove that would obviously be to run the same set of
tests, but with the btrfs configured for ssd mode at runtime so as not
to use the default, and both data and metadata modes specified,
presumably single for both, for the case where it's deployed over
md/raid1.

That in a thread where the previous context was spinning rust, with its
definitely *not* zero seek-times, and generally rather slower sequential
per-device I/O speeds, as well.  So even if there weren't massive
issues with the Phoronix tests as displayed in that article, with btrfs
so deoptimized in the on-md/raid case no real conclusions about a sane
configuration can be made, the fact that the article was reporting on
(rather old and slow for ssds, but still fast at reading compared to
spinning rust) ssds, while the discussion was in the context of more
traditional spinning rust, means that it shouldn't have been introduced
here, at least without big caveats that it was dealing with the ssd
use-case.

Hmm... I think I've begun to see the kernel folks point about people
quoting Phoronix in support of their points, when it's really not
apropos at all.  Yes, I do still consider Phoronix reports in context
to contain useful information, at some level.  However, one really must
be aware of what was actually tested in ordered to understand what the
results actually mean, and unfortunately, it seems most people quoting
it, including here, really can't properly do so in context, and thus
end up using it in support of points that simply are not supported by
the given evidence in the Phoronix articles people are attempting to
use.

-- 
Duncan - No HTML messages please; they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS as image store for KVM?

Reply via email to