Am 17.08.2020 um 17:31 hat Alberto Garcia geschrieben: > On Mon 17 Aug 2020 12:10:19 PM CEST, Kevin Wolf wrote: > >> Since commit c8bb23cbdbe / QEMU 4.1.0 (and if the storage backend > >> allows it) writing to an image created with preallocation=metadata > >> can be slower (20% in my tests) than writing to an image with no > >> preallocation at all. > > > > A while ago we had a case where commit c8bb23cbdbe was actually > > reported as a major performance regression, so it's a big "it > > depends". > > > > XFS people told me that they consider this code a bad idea. Just > > because it's a specialised "write zeroes" operation, it's not > > necessarily fast on filesystems. In particular, on XFS, ZERO_RANGE > > causes a queue drain with O_DIRECT (probably hurts cases with high > > queue depths) and additionally even a page cache flush without > > O_DIRECT. > > > > So in a way this whole thing is a two-edged sword. > > I see... on ext4 the improvements are clearly visible. Are we not > detecting this for xfs? We do have an s->is_xfs flag.
My understanding is that XFS and ext4 behave very similar in this respect. It's not a clear loss on XFS either, some cases are improved. But cases that get a performance regression exist, too. It's a question of the workload, the file system state (e.g. fragmentation of the image file) and the storage. So I don't think checking for a specific filesystem is going to improve things. > >> a) shall we include a warning in the documentation ("note that this > >> preallocation mode can result in worse performance")? > > > > To be honest, I don't really understand this case yet. With metadata > > preallocation, the clusters are already marked as allocated, so why > > would handle_alloc_space() even be called? We're not allocating new > > clusters after all? > > It's not called, what happens is what you say below: > > > Or are you saying that ZERO_RANGE + pwrite on a sparse file (= cluster > > allocation) is faster for you than just the pwrite alone (= writing to > > already allocated cluster)? > > Yes, 20% faster in my tests (4KB random writes), but in the latter case > the cluster is already allocated only at the qcow2 level, not on the > filesystem. preallocation=falloc is faster than preallocation=metadata > (preallocation=off sits in the middle). Hm, this feels wrong. Doing more operations should never be faster than doing less operations. Maybe the difference is in allocating 64k at once instead of doing a separate allocation for every 4k block? But with the extent size hint patches to file-posix, we should allocate 1 MB at once by default now (if your test image was newly created). Can you check whether this is in effect for your image file? Kevin