Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-27 Thread Brian Foster
On Wed, Aug 26, 2020 at 08:34:32PM +0200, Alberto Garcia wrote:
> On Tue 25 Aug 2020 09:47:24 PM CEST, Brian Foster  wrote:
> > My fio fallocates the entire file by default with this command. Is that
> > the intent of this particular test? I added --fallocate=none to my test
> > runs to incorporate the allocation cost in the I/Os.
> 
> That wasn't intentional, you're right, it should use --fallocate=none (I
> don't see a big difference in my test anyway).
> 
> >> The Linux version is 4.19.132-1 from Debian.
> >
> > Thanks. I don't have LUKS in the mix on my box, but I was running on a
> > more recent kernel (Fedora 5.7.15-100). I threw v4.19 on the box and
> > saw a bit more of a delta between XFS (~14k iops) and ext4 (~24k). The
> > same test shows ~17k iops for XFS and ~19k iops for ext4 on v5.7. If I
> > increase the size of the LVM volume from 126G to >1TB, ext4 runs at
> > roughly the same rate and XFS closes the gap to around ~19k iops as
> > well. I'm not sure what might have changed since v4.19, but care to
> > see if this is still an issue on a more recent kernel?
> 
> Ok, I gave 5.7.10-1 a try but I still get similar numbers.
> 

Strange.

> Perhaps with a larger filesystem there would be a difference? I don't
> know.
> 

Perhaps. I believe Dave mentioned earlier how log size might affect
things.

I created a 125GB lvm volume and see slight deltas in iops going from
testing directly on the block device, to a fully allocated file on
XFS/ext4 and then to a preallocated file on XFS/ext4. In both cases the
numbers are comparable between XFS and ext4. On XFS, I can reproduce a
serious drop in iops if I reduce the default ~64MB log down to 8MB.
Perhaps you could try increasing your log ('-lsize=...' at mkfs time)
and see if that changes anything?

Beyond that, I'd probably try to normalize and simplify your storage
stack if you wanted to narrow it down further. E.g., clean format the
same bdev for XFS and ext4 and pull out things like LUKS just to rule
out any poor interactions.

Brian

> Berto
> 




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-25 Thread Brian Foster
On Tue, Aug 25, 2020 at 07:18:19PM +0200, Alberto Garcia wrote:
> On Tue 25 Aug 2020 06:54:15 PM CEST, Brian Foster wrote:
> > If I compare this 5m fio test between XFS and ext4 on a couple of my
> > systems (with either no prealloc or full file prealloc), I end up seeing
> > ext4 run slightly faster on my vm and XFS slightly faster on bare metal.
> > Either way, I don't see that huge disparity where ext4 is 5-6 times
> > faster than XFS. Can you describe the test, filesystem and storage in
> > detail where you observe such a discrepancy?
> 
> Here's the test:
> 
> fio --filename=/path/to/file.raw --direct=1 --randrepeat=1 \
> --eta=always --ioengine=libaio --iodepth=32 --numjobs=1 \
> --name=test --size=25G --io_limit=25G --ramp_time=0 \
> --rw=randwrite --bs=4k --runtime=300 --time_based=1
> 

My fio fallocates the entire file by default with this command. Is that
the intent of this particular test? I added --fallocate=none to my test
runs to incorporate the allocation cost in the I/Os.

> The size of the XFS filesystem is 126 GB and it's almost empty, here's
> the xfs_info output:
> 
> meta-data=/dev/vg/test   isize=512agcount=4, agsize=8248576
> blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=1finobt=1, sparse=1,
>  rmapbt=0
>  =   reflink=0
> data =   bsize=4096   blocks=32994304, imaxpct=25
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
> log  =internal log   bsize=4096   blocks=16110, version=2
>  =   sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> 
> The size of the ext4 filesystem is 99GB, of which 49GB are free (that
> is, without the file used in this test). The filesystem uses 4KB
> blocks, a 128M journal and these features:
> 
> Filesystem revision #:1 (dynamic)
> Filesystem features:  has_journal ext_attr resize_inode dir_index
>   filetype needs_recovery extent flex_bg
>   sparse_super large_file huge_file uninit_bg
>   dir_nlink extra_isize
> Filesystem flags: signed_directory_hash
> Default mount options:user_xattr acl
> 
> In both cases I'm using LVM on top of LUKS and the hard drive is a
> Samsung SSD 850 PRO 1TB.
> 
> The Linux version is 4.19.132-1 from Debian.
> 

Thanks. I don't have LUKS in the mix on my box, but I was running on a
more recent kernel (Fedora 5.7.15-100). I threw v4.19 on the box and saw
a bit more of a delta between XFS (~14k iops) and ext4 (~24k). The same
test shows ~17k iops for XFS and ~19k iops for ext4 on v5.7. If I
increase the size of the LVM volume from 126G to >1TB, ext4 runs at
roughly the same rate and XFS closes the gap to around ~19k iops as
well. I'm not sure what might have changed since v4.19, but care to see
if this is still an issue on a more recent kernel?

Brian

> Berto
> 




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-25 Thread Brian Foster
On Tue, Aug 25, 2020 at 02:24:58PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 07:02:32 PM CEST, Brian Foster wrote:
> >> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> >> of data in order to let performance settle, but if I remove that I can
> >> see the effect more clearly. I can observe it with raw files (in 'off'
> >> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> >> preallocation=off the performance is stable during the whole test.
> >
> > That's interesting. I ran your fio command (without --ramp_time and
> > with --runtime=5m) against a file on XFS (so no qcow2, no zero_range)
> > once with sparse file with a 64k extent size hint and again with a
> > fully preallocated 25GB file and I saw similar results in terms of the
> > delta.  This was just against an SSD backed vdisk in my local dev VM,
> > but I saw ~5800 iops for the full preallocation test and ~6200 iops
> > with the extent size hint.
> >
> > I do notice an initial iops burst as described for both tests, so I
> > switched to use a 60s ramp time and 60s runtime. With that longer ramp
> > up time, I see ~5000 iops with the 64k extent size hint and ~5500 iops
> > with the full 25GB prealloc. Perhaps the unexpected performance delta
> > with qcow2 is similarly transient towards the start of the test and
> > the runtime is short enough that it skews the final results..?
> 
> I also tried running directly against a file on xfs (no qcow2, no VMs)
> but it doesn't really matter whether I use --ramp_time=5 or 60.
> 
> Here are the results:
> 
> |---+---+---|
> | preallocation |   xfs |  ext4 |
> |---+---+---|
> | off   |  7277 | 43260 |
> | fallocate |  7299 | 42810 |
> | full  | 88404 | 83197 |
> |---+---+---|
> 
> I ran the first case (no preallocation) for 5 minutes and I said there's
> a peak during the first 5 seconds, but then the number remains under 10k
> IOPS for the rest of the 5 minutes.
> 

I don't think we're talking about the same thing. I was referring to the
difference between full file preallocation and the extent size hint in
XFS, and how the latter was faster with the shorter ramp time but that
swapped around when the test ramped up for longer. Here, it looks like
you're comparing XFS to ext4 writing direct to a file..

If I compare this 5m fio test between XFS and ext4 on a couple of my
systems (with either no prealloc or full file prealloc), I end up seeing
ext4 run slightly faster on my vm and XFS slightly faster on bare metal.
Either way, I don't see that huge disparity where ext4 is 5-6 times
faster than XFS. Can you describe the test, filesystem and storage in
detail where you observe such a discrepancy?

Brian




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Brian Foster
On Fri, Aug 21, 2020 at 02:12:32PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote:
> > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  
> > wrote:
> >>> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >>> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >>> > 
> >>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >>> > of the cluster with zeroes.
> >>> > 
> >>> > 3) metadata: all clusters were allocated when the image was created
> >>> > but they are sparse, QEMU only writes the 4KB of data.
> >>> > 
> >>> > 4) falloc: all clusters were allocated with fallocate() when the image
> >>> > was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > 5) full: all clusters were allocated by writing zeroes to all of them
> >>> > when the image was created, QEMU only writes 4KB of data.
> >>> > 
> >>> > As I said in a previous message I'm not familiar with xfs, but the
> >>> > parts that I don't understand are
> >>> > 
> >>> >- Why is (4) slower than (1)?
> >>> 
> >>> Because fallocate() is a full IO serialisation barrier at the
> >>> filesystem level. If you do:
> >>> 
> >>> fallocate(whole file)
> >>> 
> >>> 
> >>> 
> >>> .
> >>> 
> >>> The IO can run concurrent and does not serialise against anything in
> >>> the filesysetm except unwritten extent conversions at IO completion
> >>> (see answer to next question!)
> >>> 
> >>> However, if you just use (4) you get:
> >>> 
> >>> falloc(64k)
> >>>   
> >>>   
> >>> <4k io>
> >>>   
> >>> falloc(64k)
> >>>   
> >>>   
> >>>   <4k IO completes, converts 4k to written>
> >>>   
> >>> <4k io>
> >>> falloc(64k)
> >>>   
> >>>   
> >>>   <4k IO completes, converts 4k to written>
> >>>   
> >>> <4k io>
> >>>   
> >>> 
> >>
> >> Option 4 is described above as initial file preallocation whereas
> >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> >> is reporting that the initial file preallocation mode is slower than
> >> the per cluster prealloc mode. Berto, am I following that right?
> 
> After looking more closely at the data I can see that there is a peak of
> ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to
> ~7K for the rest of the test.
> 
> I was running fio with --ramp_time=5 which ignores the first 5 seconds
> of data in order to let performance settle, but if I remove that I can
> see the effect more clearly. I can observe it with raw files (in 'off'
> and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and
> preallocation=off the performance is stable during the whole test.
> 

That's interesting. I ran your fio command (without --ramp_time and with
--runtime=5m) against a file on XFS (so no qcow2, no zero_range) once
with sparse file with a 64k extent size hint and again with a fully
preallocated 25GB file and I saw similar results in terms of the delta.
This was just against an SSD backed vdisk in my local dev VM, but I saw
~5800 iops for the full preallocation test and ~6200 iops with the
extent size hint.

I do notice an initial iops burst as described for both tests, so I
switched to use a 60s ramp time and 60s runtime. With that longer ramp
up time, I see ~5000 iops with the 64k extent size hint and ~5500 iops
with the full 25GB prealloc. Perhaps the unexpected performance delta
with qcow2 is similarly transient towards the start of the test and the
runtime is short enough that it skews the final results..?

Brian

> Berto
> 




Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Brian Foster
On Fri, Aug 21, 2020 at 01:42:52PM +0200, Alberto Garcia wrote:
> On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster  wrote:
> >> > 1) off: for every write request QEMU initializes the cluster (64KB)
> >> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> >> > 
> >> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> >> > of the cluster with zeroes.
> >> > 
> >> > 3) metadata: all clusters were allocated when the image was created
> >> > but they are sparse, QEMU only writes the 4KB of data.
> >> > 
> >> > 4) falloc: all clusters were allocated with fallocate() when the image
> >> > was created, QEMU only writes 4KB of data.
> >> > 
> >> > 5) full: all clusters were allocated by writing zeroes to all of them
> >> > when the image was created, QEMU only writes 4KB of data.
> >> > 
> >> > As I said in a previous message I'm not familiar with xfs, but the
> >> > parts that I don't understand are
> >> > 
> >> >- Why is (4) slower than (1)?
> >> 
> >> Because fallocate() is a full IO serialisation barrier at the
> >> filesystem level. If you do:
> >> 
> >> fallocate(whole file)
> >> 
> >> 
> >> 
> >> .
> >> 
> >> The IO can run concurrent and does not serialise against anything in
> >> the filesysetm except unwritten extent conversions at IO completion
> >> (see answer to next question!)
> >> 
> >> However, if you just use (4) you get:
> >> 
> >> falloc(64k)
> >>   
> >>   
> >> <4k io>
> >>   
> >> falloc(64k)
> >>   
> >>   
> >>   <4k IO completes, converts 4k to written>
> >>   
> >> <4k io>
> >> falloc(64k)
> >>   
> >>   
> >>   <4k IO completes, converts 4k to written>
> >>   
> >> <4k io>
> >>   
> >> 
> >
> > Option 4 is described above as initial file preallocation whereas
> > option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto
> > is reporting that the initial file preallocation mode is slower than
> > the per cluster prealloc mode. Berto, am I following that right?
> 
> Option (1) means that no qcow2 cluster is allocated at the beginning of
> the test so, apart from updating the relevant qcow2 metadata, each write
> request clears the cluster first (with fallocate(ZERO_RANGE)) then
> writes the requested 4KB of data. Further writes to the same cluster
> don't need changes on the qcow2 metadata so they go directly to the area
> that was cleared with fallocate().
> 
> Option (4) means that all clusters are allocated when the image is
> created and they are initialized with fallocate() (actually with
> posix_fallocate() now that I read the code, I suppose it's the same for
> xfs?). Only after that the test starts. All write requests are simply
> forwarded to the disk, there is no need to touch any qcow2 metadata nor
> do anything else.
> 

Ok, I think that's consistent with what I described above (sorry, I find
the preallocation mode names rather confusing so I was trying to avoid
using them). Have you confirmed that posix_fallocate() in this case
translates directly to fallocate()? I suppose that's most likely the
case, otherwise you'd see numbers more like with preallocation=full
(file preallocated via writing zeroes).

> And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10%
> more IOPS.
> 
> I just ran the tests with aio=native and with a raw image instead of
> qcow2, here are the results:
> 
> qcow2:
> |--+-+|
> | preallocation| aio=threads | aio=native |
> |--+-+|
> | off  |8139 |   7649 |
> | off (w/o ZERO_RANGE) |2965 |   2779 |
> | metadata |7768 |   8265 |
> | falloc   |7742 |   7956 |
> | full |   41389 |  56668 |
> |--+-+|
> 

So this seems like Dave's suggestion to use native aio produced more
predictable results with full file prealloc being a bit faster than per
cluster prealloc. Not sure why that isn't the case with aio=threads. I
was wondering if perhaps the threading affects something indirectly like
the qcow2 metadata allocation itself, but I guess that would be
inconsistent with ext4 showing a notable jump from

Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-21 Thread Brian Foster
On Fri, Aug 21, 2020 at 07:58:11AM +1000, Dave Chinner wrote:
> On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote:
> > Cc: linux-xfs
> > 
> > On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote:
> > > In any event, if you're seeing unclear or unexpected performance
> > > deltas between certain XFS configurations or other fs', I think the
> > > best thing to do is post a more complete description of the workload,
> > > filesystem/storage setup, and test results to the linux-xfs mailing
> > > list (feel free to cc me as well). As it is, aside from the questions
> > > above, it's not really clear to me what the storage stack looks like
> > > for this test, if/how qcow2 is involved, what the various
> > > 'preallocation=' modes actually mean, etc.
> > 
> > (see [1] for a bit of context)
> > 
> > I repeated the tests with a larger (125GB) filesystem. Things are a bit
> > faster but not radically different, here are the new numbers:
> > 
> > |--+---+---|
> > | preallocation mode   |   xfs |  ext4 |
> > |--+---+---|
> > | off  |  8139 | 11688 |
> > | off (w/o ZERO_RANGE) |  2965 |  2780 |
> > | metadata |  7768 |  9132 |
> > | falloc   |  7742 | 13108 |
> > | full | 41389 | 16351 |
> > |--+---+---|
> > 
> > The numbers are I/O operations per second as reported by fio, running
> > inside a VM.
> > 
> > The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is
> > 2.16-1. I'm using QEMU 5.1.0.
> > 
> > fio is sending random 4KB write requests to a 25GB virtual drive, this
> > is the full command line:
> > 
> > fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always
> > --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G
> > --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60
> >   
> > The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on
> > the host (on an xfs or ext4 filesystem as the table above shows), and
> > it is attached to QEMU using a virtio-blk-pci device:
> > 
> >-drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M
> 
> You're not using AIO on this image file, so it can't do
> concurrent IO? what happens when you add "aio=native" to this?
> 
> > cache=none means that the image is opened with O_DIRECT and
> > l2-cache-size is large enough so QEMU is able to cache all the
> > relevant qcow2 metadata in memory.
> 
> What happens when you just use a sparse file (i.e. a raw image) with
> aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support
> sparse files so using qcow2 to provide sparse image file support is
> largely an unnecessary layer of indirection and overhead...
> 
> And with XFS, you don't need qcow2 for snapshots either because you
> can use reflink copies to take an atomic copy-on-write snapshot of
> the raw image file... (assuming you made the xfs filesystem with
> reflink support (which is the TOT default now)).
> 
> I've been using raw sprase files on XFS for all my VMs for over a
> decade now, and using reflink to create COW copies of golden
> image files iwhen deploying new VMs for a couple of years now...
> 
> > The host is running Linux 4.19.132 and has an SSD drive.
> > 
> > About the preallocation modes: a qcow2 file is divided into clusters
> > of the same size (64KB in this case). That is the minimum unit of
> > allocation, so when writing 4KB to an unallocated cluster QEMU needs
> > to fill the other 60KB with zeroes. So here's what happens with the
> > different modes:
> 
> Which is something that sparse files on filesystems do not need to
> do. If, on XFS, you really want 64kB allocation clusters, use an
> extent size hint of 64kB. Though for image files, I highly recommend
> using 1MB or larger extent size hints.
> 
> 
> > 1) off: for every write request QEMU initializes the cluster (64KB)
> > with fallocate(ZERO_RANGE) and then writes the 4KB of data.
> > 
> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest
> > of the cluster with zeroes.
> > 
> > 3) metadata: all clusters were allocated when the image was created
> > but they are sparse, QEMU only writes the 4KB of data.
> > 
> > 4) falloc: all clusters were allocated with fallocate() when the image
> > was created, QEMU only writes 4KB of data.
> > 
> > 5) full: all clusters were allocated by writing

Re: [PATCH 0/1] qcow2: Skip copy-on-write when allocating a zero cluster

2020-08-19 Thread Brian Foster
On Wed, Aug 19, 2020 at 05:07:11PM +0200, Kevin Wolf wrote:
> Am 19.08.2020 um 16:25 hat Alberto Garcia geschrieben:
> > On Mon 17 Aug 2020 05:53:07 PM CEST, Kevin Wolf wrote:
> > >> > Or are you saying that ZERO_RANGE + pwrite on a sparse file (=
> > >> > cluster allocation) is faster for you than just the pwrite alone (=
> > >> > writing to already allocated cluster)?
> > >> 
> > >> Yes, 20% faster in my tests (4KB random writes), but in the latter
> > >> case the cluster is already allocated only at the qcow2 level, not on
> > >> the filesystem. preallocation=falloc is faster than
> > >> preallocation=metadata (preallocation=off sits in the middle).
> > >
> > > Hm, this feels wrong. Doing more operations should never be faster
> > > than doing less operations.
> > >
> > > Maybe the difference is in allocating 64k at once instead of doing a
> > > separate allocation for every 4k block? But with the extent size hint
> > > patches to file-posix, we should allocate 1 MB at once by default now
> > > (if your test image was newly created). Can you check whether this is
> > > in effect for your image file?
> > 
> > I checked with xfs on my computer. I'm not very familiar with that
> > filesystem so I was using the default options and I didn't tune
> > anything.
> > 
> > What I got with my tests (using fio):
> > 
> > - Using extent_size_hint didn't make any difference in my test case (I
> >   do see a clear difference however with the test case described in
> >   commit ffa244c84a).
> 
> Hm, interesting. What is your exact fio configuration? Specifically,
> which iodepth are you using? I guess with a low iodepth (and O_DIRECT),
> the effect of draining the queue might not be as visible.
> 
> > - preallocation=off is still faster than preallocation=metadata.
> 
> Brian, can you help us here with some input?
> 
> Essentially what we're having here is a sparse image file on XFS that is
> opened with O_DIRECT (presumably - Berto, is this right?), and Berto is
> seeing cases where a random write benchmark is faster if we're doing the
> 64k ZERO_RANGE + 4k pwrite when touching a 64k cluster for the first
> time compared to always just doing the 4k pwrite. This is with a 1 MB
> extent size hint.
> 

Which is with the 1MB extent size hint? Both, or just the non-ZERO_RANGE
test? A quick test on a vm shows that a 1MB extent size hint widens a
smaller zero range request to the hint. Just based on that, I guess I
wouldn't expect much difference between the tests in the former case
(extra syscall overhead perhaps?) since they'd both be doing 1MB extent
allocs and 4k dio writes. If the hint is only active in the latter case,
then I suppose you'd be comparing 64k unwritten allocs + 4k writes vs.
1MB unwritten allocs + 4k writes.

I also see that Berto noted in a followup email that the XFS filesystem
is close to full, which can have a significant effect on block
allocation performance. I'd strongly recommend not testing for
performance under low free space conditions.


> From the discussions we had the other day [1][2] I took away that your
> suggestion is that we should not try to optimise things with
> fallocate(), but just write the areas we really want to write and let
> the filesystem deal with the sparse parts. Especially with the extent
> size hint that we're now setting, I'm surprised to hear that doing a
> ZERO_RANGE first still seems to improve the performance.
> 
> Do you have any idea why this is happening and what we should be doing
> with this?
> 

Note that I'm just returning from several weeks of leave so my memory is
a bit fuzzy, but I thought the previous issues were around performance
associated with fragmentation caused by doing such small allocations
over time, not necessarily finding the most performant configuration
according to a particular I/O benchmark.

In any event, if you're seeing unclear or unexpected performance deltas
between certain XFS configurations or other fs', I think the best thing
to do is post a more complete description of the workload,
filesystem/storage setup, and test results to the linux-xfs mailing list
(feel free to cc me as well). As it is, aside from the questions above,
it's not really clear to me what the storage stack looks like for this
test, if/how qcow2 is involved, what the various 'preallocation=' modes
actually mean, etc.

Brian

> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1850660
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666864
> 
> >   If I disable handle_alloc_space() (so there is no ZERO_RANGE used)
> >   then it is much slower.
> 
> This makes some sense because then we're falling back to writing
> explicit zero buffers (unless you disabled that, too).
> 
> > - With preallocation=falloc I get the same results as with
> >   preallocation=metadata.
> 
> Interesting, this means that the fallocate() call costs you basically no
> time. I would have expected preallocation=falloc to be a little faster.
> 
> > - preallocation=full is the fastest by far.