> On 19 Jun 2024, at 11:16, Kevin Wolf <kw...@redhat.com> wrote: > > Am 18.06.2024 um 23:24 hat Nir Soffer geschrieben: >> Set has_discard only when BDRV_O_UNMAP is not set. With this users that >> want to keep their images fully allocated can disable hole punching >> when writing zeros or discarding using: >> >> -drive file=thick.img,discard=off >> >> This change is not entirely correct since it changes the default discard >> behavior. Previously we always allowed punching holes, but now you have >> must use discard=unmap|on to enable it. We probably need to add the >> BDDR_O_UNMAP flag by default. >> >> make check still works, so maybe we don't have tests for sparsifying >> images, or maybe you need to run special tests that do not run by >> default. We needs tests for keeping images non-sparse. >> >> Signed-off-by: Nir Soffer <nsof...@redhat.com> > > So first of all, I agree with you that this patch is wrong. ;-) > > At first, I failed to understand the problem this is trying to solve. I > put a debug message in handle_aiocb_discard() and tried with which > options it triggers. [1] To me, this looked exactly like it should be. > We only try to discard blocks when discard=unmap is given as an option. > > That leaves the case of write_zeroes. And while at the first sight, the > code looked good, we do seem to have a problem there and it tried to > unmap even with discard=off. > >> block/file-posix.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/block/file-posix.c b/block/file-posix.c >> index be25e35ff6..acac2abadc 100644 >> --- a/block/file-posix.c >> +++ b/block/file-posix.c >> @@ -738,11 +738,11 @@ static int raw_open_common(BlockDriverState *bs, QDict >> *options, >> ret = -EINVAL; >> goto fail; >> } >> #endif /* !defined(CONFIG_LINUX_IO_URING) */ >> >> - s->has_discard = true; >> + s->has_discard = !!(bdrv_flags & BDRV_O_UNMAP); >> s->has_write_zeroes = true; >> >> if (fstat(s->fd, &st) < 0) { >> ret = -errno; >> error_setg_errno(errp, errno, "Could not stat file"); > > s->has_discard is about what the host supports, not about the semantics > of the QEMU block node. So this doesn't feel right to me. > > So for the buggy case, write_zeroes, bdrv_co_pwrite_zeroes() has code > that considers the case and clears the ~BDRV_REQ_MAY_UNMAP flags: > > if (!(child->bs->open_flags & BDRV_O_UNMAP)) { > flags &= ~BDRV_REQ_MAY_UNMAP; > } > > But it turns out that we don't necessarily even go through this function > for the top node which has discard=off, so it can't take effect: > > (gdb) bt > #0 0x00007ffff4f2f144 in __pthread_kill_implementation () at /lib64/libc.so > <http://libc.so/>.6 > #1 0x00007ffff4ed765e in raise () at /lib64/libc.so <http://libc.so/>.6 > #2 0x00007ffff4ebf902 in abort () at /lib64/libc.so <http://libc.so/>.6 > #3 0x000055555615aff0 in raw_do_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, > bytes=1048576, flags=BDRV_REQ_MAY_UNMAP, blkdev=false) at > ../block/file-posix.c:3643 > #4 0x000055555615557e in raw_co_pwrite_zeroes (bs=0x555557f4bcf0, offset=0, > bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/file-posix.c:3655 > #5 0x00005555560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x555557f4bcf0, > offset=0, bytes=1048576, flags=6) at ../block/io.c:1901 > #6 0x00005555560c72f9 in bdrv_aligned_pwritev (child=0x555557f51460, > req=0x7fffed5ff800, offset=0, bytes=1048576, align=1, qiov=0x0, > qiov_offset=0, flags=6) at ../block/io.c:2100 > #7 0x00005555560c6b41 in bdrv_co_do_zero_pwritev (child=0x555557f51460, > offset=0, bytes=1048576, flags=6, req=0x7fffed5ff800) at ../block/io.c:2183 > #8 0x00005555560c6647 in bdrv_co_pwritev_part (child=0x555557f51460, > offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at > ../block/io.c:2283 > #9 0x00005555560c634f in bdrv_co_pwritev (child=0x555557f51460, offset=0, > bytes=1048576, qiov=0x0, flags=6) at ../block/io.c:2216 > #10 0x00005555560c75b5 in bdrv_co_pwrite_zeroes (child=0x555557f51460, > offset=0, bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/io.c:2322 > #11 0x0000555556117d24 in raw_co_pwrite_zeroes (bs=0x555557f44980, offset=0, > bytes=1048576, flags=BDRV_REQ_MAY_UNMAP) at ../block/raw-format.c:307 > #12 0x00005555560cde2a in bdrv_co_do_pwrite_zeroes (bs=0x555557f44980, > offset=0, bytes=1048576, flags=6) at ../block/io.c:1901 > #13 0x00005555560c72f9 in bdrv_aligned_pwritev (child=0x555557f513f0, > req=0x7fffed5ffd90, offset=0, bytes=1048576, align=1, qiov=0x0, > qiov_offset=0, flags=6) at ../block/io.c:2100 > #14 0x00005555560c6b41 in bdrv_co_do_zero_pwritev (child=0x555557f513f0, > offset=0, bytes=1048576, flags=6, req=0x7fffed5ffd90) at ../block/io.c:2183 > #15 0x00005555560c6647 in bdrv_co_pwritev_part (child=0x555557f513f0, > offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at > ../block/io.c:2283 > #16 0x00005555560ad741 in blk_co_do_pwritev_part (blk=0x555557f51660, > offset=0, bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at > ../block/block-backend.c:1425 > #17 0x00005555560ad5f2 in blk_co_pwritev_part (blk=0x555557f51660, offset=0, > bytes=1048576, qiov=0x0, qiov_offset=0, flags=6) at > ../block/block-backend.c:1440 > #18 0x00005555560ad8cf in blk_co_pwritev (blk=0x555557f51660, offset=0, > bytes=1048576, qiov=0x0, flags=6) at ../block/block-backend.c:1462 > #19 0x00005555560b0f79 in blk_co_pwrite_zeroes (blk=0x555557f51660, offset=0, > bytes=1048576, flags=6) at ../block/block-backend.c:2590 > #20 0x000055555606d240 in blk_co_pwrite_zeroes_entry (opaque=0x7fffffffbc18) > at block/block-gen.c:2162 > #21 0x00005555562c36ba in coroutine_trampoline (i0=1475685216, i1=21845) at > ../util/coroutine-ucontext.c:175 > #22 0x00007ffff4ef1190 in ??? () at /lib64/libc.so <http://libc.so/>.6 > #23 0x00007fffffffc010 in ??? () > #24 0x0000000000000000 in ??? () > > I haven't checked the details yet, but my first impression is that this > check should probably move to bdrv_co_do_pwrite_zeroes().
Punching holes only if BDRV_REQ_MAY_UNMAP bit is set sounds like the right way. > > Kevin > > > [1] Tests I did: > > # For discard > > $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive > if=none,file=/tmp/test.raw,format=raw -monitor stdio > QEMU 9.0.50 monitor - type 'help' for more information > (qemu) qemu-io none0 "discard 0 1M" > discard 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 00.00 sec (396.171 GiB/sec and 405679.5132 ops/sec) > (qemu) quit > > $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive > if=none,file=/tmp/test.raw,format=raw,discard=unmap -monitor stdio > QEMU 9.0.50 monitor - type 'help' for more information > (qemu) qemu-io none0 "discard 0 1M" > handle_aiocb_discard > discard 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 00.00 sec (6.623 GiB/sec and 6782.2820 ops/sec) > (qemu) quit > > $ echo -e 'qemu-io none0 "discard 0 1M"\nquit' | ./qemu-system-x86_64 -drive > if=none,file=/tmp/test.raw,format=raw,discard=off -monitor stdio > QEMU 9.0.50 monitor - type 'help' for more information > (qemu) qemu-io none0 "discard 0 1M" > discard 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 00.00 sec (375.168 GiB/sec and 384172.1091 ops/sec) > (qemu) quit > > # For write_zeroes > > $ echo -e 'qemu-io none0 "write -zu 0 1M"\nquit' | ./qemu-system-x86_64 > -drive if=none,file=/tmp/test.raw,format=raw,discard=unmap -monitor stdio > QEMU 9.0.50 monitor - type 'help' for more information > (qemu) qemu-io none0 "write -zu 0 1M" > handle_aiocb_write_zeroes_unmap > wrote 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 00.00 sec (4.943 GiB/sec and 5061.5997 ops/sec) > (qemu) quit > $ echo -e 'qemu-io none0 "write -zu 0 1M"\nquit' | ./qemu-system-x86_64 > -drive if=none,file=/tmp/test.raw,format=raw,discard=off -monitor stdio > QEMU 9.0.50 monitor - type 'help' for more information > (qemu) qemu-io none0 "write -zu 0 1M" > handle_aiocb_write_zeroes_unmap > wrote 1048576/1048576 bytes at offset 0 > 1 MiB, 1 ops; 00.00 sec (7.208 GiB/sec and 7381.4357 ops/sec) > (qemu) quit Looks like a good way to write tests for this, thanks for sharing. Nir