P. Remek posted on Mon, 09 Feb 2015 18:26:49 +0100 as excerpted:

> Hello,
> 
> I am benchmarking Btrfs and when benchmarking random writes with fio
> utility, I noticed following two things:
> 
> 1) On first run when target file doesn't exist yet, perfromance is about
> 8000 IOPs. On second, and every other run, performance goes up to 70000
> IOPs. Its massive difference. The target file is the one created during
> the first run.

You say a file size of 10 GiB with a block size of 4 KiB, but don't say 
whether you're using the autodefrag mount option, or whether you had set 
nocow on the file at creation (generally done by setting it on the 
directory, so new files inherit the option, chattr +C).

What I /suspect/ is happening, is that at the 10 GiB files size, on 
original file creation, btrfs is creating a large file of several 
comparatively large extents (possibly 1 GiB each, the nominal data chunk 
size, tho it can be larger on large enough filesystems).  Note that btrfs 
will normally wait to sync, accumulating further writes into the file 
before actually writing it.  By default it's 30 seconds, but there's a 
mount option to change that.  So btrfs is probably waiting, then writing 
out all changes for the last 30 seconds at once, allowing it to use 
fairly large extents when it does so.

Then when the file already exists,, keeping in mind that btrfs is COW 
(copy-on-write) and that by default it keeps two copies of metadata (dup 
on a single device, or one each on two separate devices, on a multi-
device filesystem), one copy of data (single on a single device, I 
believe raid0 on multi-device), it's having to COW individual 4K blocks 
within the file as they are rewritten.

This is going to massively fragment the file, driving up IOPs 
tremendously.  On top of that, each time a data fragment is written, 
there's going to be two metadata updates due to the dup/raid1 metadata 
default, and while they won't be updated immediately, every commit (30 
seconds), those metadata changes are going to replicate up the metadata 
tree to its root.

So instead of having a few orderly GiB-ish size extents written, along 
with their metadata, as at file-create, now you're writing a new extent 
for each changed 4 KiB block, plus 2X metadata updates for each one, plus 
every commit, the updated metadata chain up to the root.

Those 70K IOPs are all the extra work the filesystem is doing in ordered 
to track those 4 KiB COWed writes!

The autodefrag option will likely increase this even further, as it 
doesn't prevent the COWs, but instead, queues up any files it detects as 
fragmented, for later cleanup via autodefrag worker thread.  This is one 
reason this option isn't recommended for large (say quarter to half-gig-
plus) heavy-internal-rewrite-pattern use-cases (typically VM images or 
large database files), tho it works quite well for files upto a couple 
hundred MiB or so (typical of firefox sqlite database files, etc), since 
those get rewritten pretty fast.

The nocow file attribute can be used on these larger files, but it does 
have additional implications.  Nocow turns off btrfs compression for that 
file, if you had it enabled (mount option), and also turns off 
checksumming.  Turning off checksumming means btrfs will no longer detect 
file corruption, but many databases and vm tools have their own 
corruption detection and possibly correction schemes already, since they 
use them on filesystems such as ext* that don't have builtin 
checksumming, so turning off the btrfs checksumming and error detection 
for these files isn't as bad as it would otherwise seem, and in many 
cases prevents the filesystem duplicating work that the application is 
already doing.  (Also, on btrfs, nocow must be set at file creation, when 
it is still zero-sized.  As mentioned above, this is usually accomplished 
by setting it on the directory and letting new files and subdirs inherit 
the attribute.)

But with the nocow file attribute properly applied, these random rewrites 
will be done in-place, no cascading fragmentation and metadata updates, 
and my guess is that you'll see the IOPs on existing nocow files reduce 
to something far more sane as a result.

> 2) There are windows during the test where IOPs drop to 0 and stay 0
> about 10 seconds and then it goes back again, and after couple of
> seconds again to 0. This is reproducible 100% times.

I recall this periodic behavior coming up in at least one earlier thread 
as well, but I'm not a dev, just a btrfs user and list regular, and I 
don't recall what the explanation was, unless it was related to internal 
btrfs bookkeeping due to that 30-second commit cycle I mentioned above.

But I'm guessing that if you properly set nocow on the file, you'll 
probably see this go away as well, since you won't be overwhelming btrfs 
and the hardware with IOPs any longer.

Perhaps someone with a better understanding of the situation will jump in 
and explain this bit better than I can...

> Can somobody shred some light on what's happening?
> 
> 
> Command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
> --name=test9 --filename=test9 --bs=4k --iodepth=256 --size=10G
> --numjobs=1 --readwrite=randwrite
> 
> Environment:
> CPU: dual socket: E5-2630 v2
> RAM: 32 GB ram
> OS: Ubuntu server 14.10
> Kernel: 3.19.0-031900rc2-generic
> btrfs tools: Btrfs v3.14.1
> 2x LSI 9300 HBAs
> - SAS3 12/Gbs 8x SSD Ultrastar SSD1600MM 400GB SAS3 12/Gbs

I suppose you're already aware that you're running a rather outdated 
userspace/btrfs-progs (what I assume you meant by tools).  Userspace 
versions sync with the kernel cycle, with a particular 3.x.0 version 
typically being released a couple weeks after the kernel of the same 
version, usually with a couple 3.x.y, y-update releases following before 
the next kernel-synced x-version bump.

So userspace/progs v3.19.0 isn't out yet (tho rc2 is available), but 
3.18.2 is current, well beyond your 3.14.1.

FWIW, a current kernel is most important during normal operation, as the 
userspace simply tells the kernel what to do at a high level and the 
kernel follows thru with its lower level code.  So for normal operation, 
userspace getting a bit behind isn't a major issue unless you want a 
feature only available in a newer version.

But if something goes wrong and you're trying to diagnose and repair from 
userspace, THAT is when the userspace low-level code is run, and thus 
when userspace version becomes vitally important.

So as long as nothing's going wrong, you're probably OK with that 3.14 
userspace.  But I'd still recommend updating to current and keeping 
current, because you don't want to be scrambling to build a newer 
userspace after something goes wrong, in ordered to have the best chance 
at recovery.

Kudos on having a current kernel, at least.  There have been quite a few 
kernel bugs fixed since 3.14 era, and you're running a current kernel so 
at least aren't needlessly risking the known bugs of the older ones where 
it's operationally important. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to