On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfos...@redhat.com> wrote: >> > 1) off: for every write request QEMU initializes the cluster (64KB) >> > with fallocate(ZERO_RANGE) and then writes the 4KB of data. >> > >> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest >> > of the cluster with zeroes. >> > >> > 3) metadata: all clusters were allocated when the image was created >> > but they are sparse, QEMU only writes the 4KB of data. >> > >> > 4) falloc: all clusters were allocated with fallocate() when the image >> > was created, QEMU only writes 4KB of data. >> > >> > 5) full: all clusters were allocated by writing zeroes to all of them >> > when the image was created, QEMU only writes 4KB of data. >> > >> > As I said in a previous message I'm not familiar with xfs, but the >> > parts that I don't understand are >> > >> > - Why is (4) slower than (1)? >> >> Because fallocate() is a full IO serialisation barrier at the >> filesystem level. If you do: >> >> fallocate(whole file) >> <IO> >> <IO> >> <IO> >> ..... >> >> The IO can run concurrent and does not serialise against anything in >> the filesysetm except unwritten extent conversions at IO completion >> (see answer to next question!) >> >> However, if you just use (4) you get: >> >> falloc(64k) >> <wait for inflight IO to complete> >> <allocates 64k as unwritten> >> <4k io> >> .... >> falloc(64k) >> <wait for inflight IO to complete> >> .... >> <4k IO completes, converts 4k to written> >> <allocates 64k as unwritten> >> <4k io> >> falloc(64k) >> <wait for inflight IO to complete> >> .... >> <4k IO completes, converts 4k to written> >> <allocates 64k as unwritten> >> <4k io> >> .... >> > > Option 4 is described above as initial file preallocation whereas > option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto > is reporting that the initial file preallocation mode is slower than > the per cluster prealloc mode. Berto, am I following that right?
Option (1) means that no qcow2 cluster is allocated at the beginning of the test so, apart from updating the relevant qcow2 metadata, each write request clears the cluster first (with fallocate(ZERO_RANGE)) then writes the requested 4KB of data. Further writes to the same cluster don't need changes on the qcow2 metadata so they go directly to the area that was cleared with fallocate(). Option (4) means that all clusters are allocated when the image is created and they are initialized with fallocate() (actually with posix_fallocate() now that I read the code, I suppose it's the same for xfs?). Only after that the test starts. All write requests are simply forwarded to the disk, there is no need to touch any qcow2 metadata nor do anything else. And yes, (4) is a bit slower than (1) in my tests. On ext4 I get 10% more IOPS. I just ran the tests with aio=native and with a raw image instead of qcow2, here are the results: qcow2: |----------------------+-------------+------------| | preallocation | aio=threads | aio=native | |----------------------+-------------+------------| | off | 8139 | 7649 | | off (w/o ZERO_RANGE) | 2965 | 2779 | | metadata | 7768 | 8265 | | falloc | 7742 | 7956 | | full | 41389 | 56668 | |----------------------+-------------+------------| raw: |---------------+-------------+------------| | preallocation | aio=threads | aio=native | |---------------+-------------+------------| | off | 7647 | 7928 | | falloc | 7662 | 7856 | | full | 45224 | 58627 | |---------------+-------------+------------| A qcow2 file with preallocation=metadata is more or less similar to a sparse raw file (and the numbers are indeed similar). preallocation=off on qcow2 does not have an equivalent on raw files. Berto