Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-13 Thread Michael Galaxy
One thing to keep in mind here (despite me not having any hardware to 
test) was that one of the original goals here
in the RDMA implementation was not simply raw throughput nor raw 
latency, but a lack of CPU utilization in kernel
space due to the offload. While it is entirely possible that newer 
hardware w/ TCP might compete, the significant

reductions in CPU usage in the TCP/IP stack were a big win at the time.

Just something to consider while you're doing the testing

- Michael

On 5/9/24 03:58, Zheng Chuan wrote:

Hi, Peter,Lei,Jinpu.

On 2024/5/8 0:28, Peter Xu wrote:

On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:

Hello,


-Original Message-
From: Peter Xu [mailto:pet...@redhat.com]
Sent: Monday, May 6, 2024 11:18 PM
To: Gonglei (Arei) 
Cc: Daniel P. Berrangé ; Markus Armbruster
; Michael Galaxy ; Yu Zhang
; Zhijian Li (Fujitsu) ; Jinpu Wang
; Elmar Gerdes ;
qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
; Prasanna Kumar Kalever
; Cornelia Huck ;
Michael Roth ; Prasanna Kumar Kalever
; integrat...@gluster.org; Paolo Bonzini
; qemu-block@nongnu.org; de...@lists.libvirt.org;
Hanna Reitz ; Michael S. Tsirkin ;
Thomas Huth ; Eric Blake ; Song
Gao ; Marc-André Lureau
; Alex Bennée ;
Wainer dos Santos Moschetta ; Beraldo Leal
; Pannengyuan ;
Xiexiangyou 
Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:

Hi, Peter

Hey, Lei,

Happy to see you around again after years.


Haha, me too.


RDMA features high bandwidth, low latency (in non-blocking lossless
network), and direct remote memory access by bypassing the CPU (As you
know, CPU resources are expensive for cloud vendors, which is one of
the reasons why we introduced offload cards.), which TCP does not have.

It's another cost to use offload cards, v.s. preparing more cpu resources?


Software and hardware offload converged architecture is the way to go for all 
cloud vendors
(Including comprehensive benefits in terms of performance, cost, security, and 
innovation speed),
it's not just a matter of adding the resource of a DPU card.


In some scenarios where fast live migration is needed (extremely short
interruption duration and migration duration) is very useful. To this
end, we have also developed RDMA support for multifd.

Will any of you upstream that work?  I'm curious how intrusive would it be
when adding it to multifd, if it can keep only 5 exported functions like what
rdma.h does right now it'll be pretty nice.  We also want to make sure it works
with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
loads to
multifd channels too.


In fact, we sent the patchset to the community in 2021. Pls see:
https://urldefense.com/v3/__https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZl4NUEGc$

Yes, I have sent the patchset of multifd support for rdma migration by taking 
over my colleague, and also
sorry for not keeping on this work at that time due to some reasons.
And also I am strongly agree with Lei that the RDMA protocol has some special 
advantages against with TCP
in some scenario, and we are indeed to use it in our product.


I wasn't aware of that for sure in the past..

Multifd has changed quite a bit in the last 9.0 release, that may not apply
anymore.  One thing to mention is please look at Dan's comment on possible
use of rsocket.h:

https://urldefense.com/v3/__https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZ0CFSE-o$

And Jinpu did help provide an initial test result over the library:

https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZxPNcdb4$

It looks like we have a chance to apply that in QEMU.




One thing to note that the question here is not about a pure performance
comparison between rdma and nics only.  It's about help us make a decision
on whether to drop rdma, iow, even if rdma performs well, the community still
has the right to drop it if nobody can actively work and maintain it.
It's just that if nics can perform as good it's more a reason to drop, unless
companies can help to provide good support and work together.


We are happy to provide the necessary review and maintenance work for RDMA
if the community needs it.

CC'ing Chuan Zheng.

I'm not sure whether you and Jinpu's team would like to work together and
provide a final solution for rdma over multifd.  It could be much simpler
than the original 2021 proposal if the rsocket API will work out.

Thanks,


That's a good news to see the socket abstraction for RDMA!
When I was developed the series above, the most pain is the RDMA migration has 
no QIOChannel 

Re: [PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state

2024-05-13 Thread Fabiano Rosas
Markus Armbruster  writes:

> Functions that use an Error **errp parameter to return errors should
> not also report them to the user, because reporting is the caller's
> job.  When the caller does, the error is reported twice.  When it
> doesn't (because it recovered from the error), there is no error to
> report, i.e. the report is bogus.
>
> qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate
> this principle: they call qemu_save_device_state() and
> qemu_loadvm_state(), which call error_report_err().
>
> I wish I could clean this up now, but migration's error reporting is
> too complicated (confused?) for me to mess with it.
>
> Instead, I'm merely improving the error reported by
> qmp_xen_load_devices_state() and qmp_xen_load_devices_state() to the
> QMP core from
>
> An IO error has occurred
>
> to
> saving Xen device state failed
>
> and
>
> loading Xen device state failed
>
> respectively.
>
> Signed-off-by: Markus Armbruster 

Acked-by: Fabiano Rosas 



Re: [PATCH 4/6] cpus: Improve error messages on memsave, pmemsave write error

2024-05-13 Thread Philippe Mathieu-Daudé

On 13/5/24 16:45, Markus Armbruster wrote:

Philippe Mathieu-Daudé  writes:


On 13/5/24 16:17, Markus Armbruster wrote:

qmp_memsave() and qmp_pmemsave() report fwrite() error as
  An IO error has occurred
Improve this to
  writing memory to '' failed
Signed-off-by: Markus Armbruster 
---
   system/cpus.c | 6 --
   1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/system/cpus.c b/system/cpus.c
index 68d161d96b..f8fa78f33d 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -813,7 +813,8 @@ void qmp_memsave(int64_t addr, int64_t size, const char 
*filename,
   goto exit;
   }
   if (fwrite(buf, 1, l, f) != l) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg(errp, "writing memory to '%s' failed",
+   filename);
   goto exit;
   }
   addr += l;
@@ -843,7 +844,8 @@ void qmp_pmemsave(int64_t addr, int64_t size, const char 
*filename,
   l = size;
   cpu_physical_memory_read(addr, buf, l);
   if (fwrite(buf, 1, l, f) != l) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg(errp, "writing memory to '%s' failed",
+   filename);


What about including errno with error_setg_errno()?


Sure fwrite() fails with errno reliably set?  The manual page doesn't
mention it...


Indeed. I can see some uses in the code base:

qemu-io-cmds.c:409:if (ferror(f)) {
qemu-io-cmds.c-410-perror(file_name);

qga/commands-posix.c-632-write_count = fwrite(buf, 1, count, fh);
qga/commands-posix.c:633:if (ferror(fh)) {
qga/commands-posix.c-634-error_setg_errno(errp, errno, "failed 
to write to file");


util/qemu-config.c:152:if (ferror(fp)) {
util/qemu-config.c-153-loc_pop();
util/qemu-config.c-154-error_setg_errno(errp, errno, "Cannot 
read config file");


Regardless,

Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH 2/6] dump/win_dump: Improve error messages on write error

2024-05-13 Thread Philippe Mathieu-Daudé

On 13/5/24 16:48, Markus Armbruster wrote:

Philippe Mathieu-Daudé  writes:


On 13/5/24 16:16, Markus Armbruster wrote:

create_win_dump() and write_run report qemu_write_full() failure to
their callers as
  An IO error has occurred
The errno set by qemu_write_full() is lost.
Improve this to
  win-dump: failed to write header: 
and
  win-dump: failed to save memory: 
This matches how dump.c reports similar errors.
Signed-off-by: Markus Armbruster 
---
   dump/win_dump.c | 7 ---
   1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/dump/win_dump.c b/dump/win_dump.c
index b7bfaff379..0e4fe692ce 100644
--- a/dump/win_dump.c
+++ b/dump/win_dump.c
@@ -12,7 +12,6 @@
   #include "sysemu/dump.h"
   #include "qapi/error.h"
   #include "qemu/error-report.h"
-#include "qapi/qmp/qerror.h"
   #include "exec/cpu-defs.h"
   #include "hw/core/cpu.h"
   #include "qemu/win_dump_defs.h"
@@ -52,6 +51,7 @@ static size_t write_run(uint64_t base_page, uint64_t 
page_count,
   uint64_t addr = base_page << TARGET_PAGE_BITS;
   uint64_t size = page_count << TARGET_PAGE_BITS;
   uint64_t len, l;
+int eno;
   size_t total = 0;
 while (size) {
@@ -65,9 +65,10 @@ static size_t write_run(uint64_t base_page, uint64_t 
page_count,
   }
 l = qemu_write_full(fd, buf, len);
+eno = errno;


Hmm this show the qemu_write_full() API isn't ideal.
Maybe we could pass  as argument and return errno.
There are only 20 calls.


qemu_write_full() is a drop-in replacement for write().


Fine.

Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH 2/6] dump/win_dump: Improve error messages on write error

2024-05-13 Thread Markus Armbruster
Philippe Mathieu-Daudé  writes:

> On 13/5/24 16:16, Markus Armbruster wrote:
>> create_win_dump() and write_run report qemu_write_full() failure to
>> their callers as
>>  An IO error has occurred
>> The errno set by qemu_write_full() is lost.
>> Improve this to
>>  win-dump: failed to write header: 
>> and
>>  win-dump: failed to save memory: 
>> This matches how dump.c reports similar errors.
>> Signed-off-by: Markus Armbruster 
>> ---
>>   dump/win_dump.c | 7 ---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>> diff --git a/dump/win_dump.c b/dump/win_dump.c
>> index b7bfaff379..0e4fe692ce 100644
>> --- a/dump/win_dump.c
>> +++ b/dump/win_dump.c
>> @@ -12,7 +12,6 @@
>>   #include "sysemu/dump.h"
>>   #include "qapi/error.h"
>>   #include "qemu/error-report.h"
>> -#include "qapi/qmp/qerror.h"
>>   #include "exec/cpu-defs.h"
>>   #include "hw/core/cpu.h"
>>   #include "qemu/win_dump_defs.h"
>> @@ -52,6 +51,7 @@ static size_t write_run(uint64_t base_page, uint64_t 
>> page_count,
>>   uint64_t addr = base_page << TARGET_PAGE_BITS;
>>   uint64_t size = page_count << TARGET_PAGE_BITS;
>>   uint64_t len, l;
>> +int eno;
>>   size_t total = 0;
>> while (size) {
>> @@ -65,9 +65,10 @@ static size_t write_run(uint64_t base_page, uint64_t 
>> page_count,
>>   }
>> l = qemu_write_full(fd, buf, len);
>> +eno = errno;
>
> Hmm this show the qemu_write_full() API isn't ideal.
> Maybe we could pass  as argument and return errno.
> There are only 20 calls.

qemu_write_full() is a drop-in replacement for write().

>>   cpu_physical_memory_unmap(buf, addr, false, len);
>>   if (l != len) {
>> -error_setg(errp, QERR_IO_ERROR);
>> +error_setg_errno(errp, eno, "win-dump: failed to save memory");
>>   return 0;
>>   }
>>   @@ -459,7 +460,7 @@ void create_win_dump(DumpState *s, Error **errp)
>> s->written_size = qemu_write_full(s->fd, h, hdr_size);
>>   if (s->written_size != hdr_size) {
>> -error_setg(errp, QERR_IO_ERROR);
>> +error_setg_errno(errp, errno, "win-dump: failed to write header");
>>   goto out_restore;
>>   }
>>   




Re: [PATCH 4/6] cpus: Improve error messages on memsave, pmemsave write error

2024-05-13 Thread Markus Armbruster
Philippe Mathieu-Daudé  writes:

> On 13/5/24 16:17, Markus Armbruster wrote:
>> qmp_memsave() and qmp_pmemsave() report fwrite() error as
>>  An IO error has occurred
>> Improve this to
>>  writing memory to '' failed
>> Signed-off-by: Markus Armbruster 
>> ---
>>   system/cpus.c | 6 --
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>> diff --git a/system/cpus.c b/system/cpus.c
>> index 68d161d96b..f8fa78f33d 100644
>> --- a/system/cpus.c
>> +++ b/system/cpus.c
>> @@ -813,7 +813,8 @@ void qmp_memsave(int64_t addr, int64_t size, const char 
>> *filename,
>>   goto exit;
>>   }
>>   if (fwrite(buf, 1, l, f) != l) {
>> -error_setg(errp, QERR_IO_ERROR);
>> +error_setg(errp, "writing memory to '%s' failed",
>> +   filename);
>>   goto exit;
>>   }
>>   addr += l;
>> @@ -843,7 +844,8 @@ void qmp_pmemsave(int64_t addr, int64_t size, const char 
>> *filename,
>>   l = size;
>>   cpu_physical_memory_read(addr, buf, l);
>>   if (fwrite(buf, 1, l, f) != l) {
>> -error_setg(errp, QERR_IO_ERROR);
>> +error_setg(errp, "writing memory to '%s' failed",
>> +   filename);
>
> What about including errno with error_setg_errno()?

Sure fwrite() fails with errno reliably set?  The manual page doesn't
mention it...


>>   goto exit;
>>   }
>>   addr += l;




Re: [PATCH 6/6] qerror: QERR_IO_ERROR is no longer used, drop

2024-05-13 Thread Philippe Mathieu-Daudé

On 13/5/24 16:17, Markus Armbruster wrote:

Signed-off-by: Markus Armbruster 
---
  include/qapi/qmp/qerror.h | 3 ---
  1 file changed, 3 deletions(-)


One less!

Reviewed-by: Philippe Mathieu-Daudé 





Re: [PATCH 4/6] cpus: Improve error messages on memsave, pmemsave write error

2024-05-13 Thread Philippe Mathieu-Daudé

On 13/5/24 16:17, Markus Armbruster wrote:

qmp_memsave() and qmp_pmemsave() report fwrite() error as

 An IO error has occurred

Improve this to

 writing memory to '' failed

Signed-off-by: Markus Armbruster 
---
  system/cpus.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/system/cpus.c b/system/cpus.c
index 68d161d96b..f8fa78f33d 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -813,7 +813,8 @@ void qmp_memsave(int64_t addr, int64_t size, const char 
*filename,
  goto exit;
  }
  if (fwrite(buf, 1, l, f) != l) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg(errp, "writing memory to '%s' failed",
+   filename);
  goto exit;
  }
  addr += l;
@@ -843,7 +844,8 @@ void qmp_pmemsave(int64_t addr, int64_t size, const char 
*filename,
  l = size;
  cpu_physical_memory_read(addr, buf, l);
  if (fwrite(buf, 1, l, f) != l) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg(errp, "writing memory to '%s' failed",
+   filename);


What about including errno with error_setg_errno()?


  goto exit;
  }
  addr += l;





Re: [PATCH 1/6] block: Improve error message when external snapshot can't flush

2024-05-13 Thread Philippe Mathieu-Daudé

On 13/5/24 16:16, Markus Armbruster wrote:

external_snapshot_action() reports bdrv_flush() failure to its caller
as

 An IO error has occurred

The errno code returned by bdrv_flush() is lost.

Improve this to

 Write to node '' failed: 

Signed-off-by: Markus Armbruster 
---
  blockdev.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 





Re: [PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state

2024-05-13 Thread Philippe Mathieu-Daudé

On 13/5/24 16:17, Markus Armbruster wrote:

Functions that use an Error **errp parameter to return errors should
not also report them to the user, because reporting is the caller's
job.  When the caller does, the error is reported twice.  When it
doesn't (because it recovered from the error), there is no error to
report, i.e. the report is bogus.

qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate
this principle: they call qemu_save_device_state() and
qemu_loadvm_state(), which call error_report_err().

I wish I could clean this up now, but migration's error reporting is
too complicated (confused?) for me to mess with it.

Instead, I'm merely improving the error reported by
qmp_xen_load_devices_state() and qmp_xen_load_devices_state() to the
QMP core from

 An IO error has occurred

to
 saving Xen device state failed

and

 loading Xen device state failed

respectively.

Signed-off-by: Markus Armbruster 
---
  migration/savevm.c | 5 ++---
  1 file changed, 2 insertions(+), 3 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH 3/6] block/vmdk: Improve error messages on extent write error

2024-05-13 Thread Philippe Mathieu-Daudé

On 13/5/24 16:17, Markus Armbruster wrote:

vmdk_init_extent() reports blk_co_pwrite() failure to its caller as

 An IO error has occurred

The errno code returned by blk_co_pwrite() is lost.

Improve this to

 failed to write VMDK : 

Signed-off-by: Markus Armbruster 
---
  block/vmdk.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé 




Re: [PATCH 2/6] dump/win_dump: Improve error messages on write error

2024-05-13 Thread Philippe Mathieu-Daudé

On 13/5/24 16:16, Markus Armbruster wrote:

create_win_dump() and write_run report qemu_write_full() failure to
their callers as

 An IO error has occurred

The errno set by qemu_write_full() is lost.

Improve this to

 win-dump: failed to write header: 

and

 win-dump: failed to save memory: 

This matches how dump.c reports similar errors.

Signed-off-by: Markus Armbruster 
---
  dump/win_dump.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/dump/win_dump.c b/dump/win_dump.c
index b7bfaff379..0e4fe692ce 100644
--- a/dump/win_dump.c
+++ b/dump/win_dump.c
@@ -12,7 +12,6 @@
  #include "sysemu/dump.h"
  #include "qapi/error.h"
  #include "qemu/error-report.h"
-#include "qapi/qmp/qerror.h"
  #include "exec/cpu-defs.h"
  #include "hw/core/cpu.h"
  #include "qemu/win_dump_defs.h"
@@ -52,6 +51,7 @@ static size_t write_run(uint64_t base_page, uint64_t 
page_count,
  uint64_t addr = base_page << TARGET_PAGE_BITS;
  uint64_t size = page_count << TARGET_PAGE_BITS;
  uint64_t len, l;
+int eno;
  size_t total = 0;
  
  while (size) {

@@ -65,9 +65,10 @@ static size_t write_run(uint64_t base_page, uint64_t 
page_count,
  }
  
  l = qemu_write_full(fd, buf, len);

+eno = errno;


Hmm this show the qemu_write_full() API isn't ideal.
Maybe we could pass  as argument and return errno.
There are only 20 calls.


  cpu_physical_memory_unmap(buf, addr, false, len);
  if (l != len) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg_errno(errp, eno, "win-dump: failed to save memory");
  return 0;
  }
  
@@ -459,7 +460,7 @@ void create_win_dump(DumpState *s, Error **errp)
  
  s->written_size = qemu_write_full(s->fd, h, hdr_size);

  if (s->written_size != hdr_size) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg_errno(errp, errno, "win-dump: failed to write header");
  goto out_restore;
  }
  





[PATCH 4/6] cpus: Improve error messages on memsave, pmemsave write error

2024-05-13 Thread Markus Armbruster
qmp_memsave() and qmp_pmemsave() report fwrite() error as

An IO error has occurred

Improve this to

writing memory to '' failed

Signed-off-by: Markus Armbruster 
---
 system/cpus.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/system/cpus.c b/system/cpus.c
index 68d161d96b..f8fa78f33d 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -813,7 +813,8 @@ void qmp_memsave(int64_t addr, int64_t size, const char 
*filename,
 goto exit;
 }
 if (fwrite(buf, 1, l, f) != l) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg(errp, "writing memory to '%s' failed",
+   filename);
 goto exit;
 }
 addr += l;
@@ -843,7 +844,8 @@ void qmp_pmemsave(int64_t addr, int64_t size, const char 
*filename,
 l = size;
 cpu_physical_memory_read(addr, buf, l);
 if (fwrite(buf, 1, l, f) != l) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg(errp, "writing memory to '%s' failed",
+   filename);
 goto exit;
 }
 addr += l;
-- 
2.45.0




[PATCH 5/6] migration: Rephrase message on failure to save / load Xen device state

2024-05-13 Thread Markus Armbruster
Functions that use an Error **errp parameter to return errors should
not also report them to the user, because reporting is the caller's
job.  When the caller does, the error is reported twice.  When it
doesn't (because it recovered from the error), there is no error to
report, i.e. the report is bogus.

qmp_xen_save_devices_state() and qmp_xen_load_devices_state() violate
this principle: they call qemu_save_device_state() and
qemu_loadvm_state(), which call error_report_err().

I wish I could clean this up now, but migration's error reporting is
too complicated (confused?) for me to mess with it.

Instead, I'm merely improving the error reported by
qmp_xen_load_devices_state() and qmp_xen_load_devices_state() to the
QMP core from

An IO error has occurred

to
saving Xen device state failed

and

loading Xen device state failed

respectively.

Signed-off-by: Markus Armbruster 
---
 migration/savevm.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 4509482ec4..a4a856982a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -45,7 +45,6 @@
 #include "qapi/qapi-commands-migration.h"
 #include "qapi/clone-visitor.h"
 #include "qapi/qapi-builtin-visit.h"
-#include "qapi/qmp/qerror.h"
 #include "qemu/error-report.h"
 #include "sysemu/cpus.h"
 #include "exec/memory.h"
@@ -3208,7 +3207,7 @@ void qmp_xen_save_devices_state(const char *filename, 
bool has_live, bool live,
 object_unref(OBJECT(ioc));
 ret = qemu_save_device_state(f);
 if (ret < 0 || qemu_fclose(f) < 0) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg(errp, "saving Xen device state failed");
 } else {
 /* libxl calls the QMP command "stop" before calling
  * "xen-save-devices-state" and in case of migration failure, libxl
@@ -3257,7 +3256,7 @@ void qmp_xen_load_devices_state(const char *filename, 
Error **errp)
 ret = qemu_loadvm_state(f);
 qemu_fclose(f);
 if (ret < 0) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg(errp, "loading Xen device state failed");
 }
 migration_incoming_state_destroy();
 }
-- 
2.45.0




[PATCH 1/6] block: Improve error message when external snapshot can't flush

2024-05-13 Thread Markus Armbruster
external_snapshot_action() reports bdrv_flush() failure to its caller
as

An IO error has occurred

The errno code returned by bdrv_flush() is lost.

Improve this to

Write to node '' failed: 

Signed-off-by: Markus Armbruster 
---
 blockdev.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 08eccc9052..528db3452f 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1406,8 +1406,10 @@ static void external_snapshot_action(TransactionAction 
*action,
 }
 
 if (!bdrv_is_read_only(state->old_bs)) {
-if (bdrv_flush(state->old_bs)) {
-error_setg(errp, QERR_IO_ERROR);
+ret = bdrv_flush(state->old_bs);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "Write to node '%s' failed",
+ bdrv_get_device_or_node_name(state->old_bs));
 return;
 }
 }
-- 
2.45.0




[PATCH 0/6] error: Eliminate QERR_IO_ERROR

2024-05-13 Thread Markus Armbruster
Markus Armbruster (6):
  block: Improve error message when external snapshot can't flush
  dump/win_dump: Improve error messages on write error
  block/vmdk: Improve error messages on extent write error
  cpus: Improve error messages on memsave, pmemsave write error
  migration: Rephrase message on failure to save / load Xen device state
  qerror: QERR_IO_ERROR is no longer used, drop

 include/qapi/qmp/qerror.h |  3 ---
 block/vmdk.c  | 10 +-
 blockdev.c|  6 --
 dump/win_dump.c   |  7 ---
 migration/savevm.c|  5 ++---
 system/cpus.c |  6 --
 6 files changed, 19 insertions(+), 18 deletions(-)

-- 
2.45.0




[PATCH 3/6] block/vmdk: Improve error messages on extent write error

2024-05-13 Thread Markus Armbruster
vmdk_init_extent() reports blk_co_pwrite() failure to its caller as

An IO error has occurred

The errno code returned by blk_co_pwrite() is lost.

Improve this to

failed to write VMDK : 

Signed-off-by: Markus Armbruster 
---
 block/vmdk.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index 3b82979fdf..78f6433607 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -28,7 +28,6 @@
 #include "block/block_int.h"
 #include "sysemu/block-backend.h"
 #include "qapi/qmp/qdict.h"
-#include "qapi/qmp/qerror.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
@@ -2278,12 +2277,12 @@ vmdk_init_extent(BlockBackend *blk, int64_t filesize, 
bool flat, bool compress,
 /* write all the data */
 ret = blk_co_pwrite(blk, 0, sizeof(magic), , 0);
 if (ret < 0) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg_errno(errp, -ret, "failed to write VMDK magic");
 goto exit;
 }
 ret = blk_co_pwrite(blk, sizeof(magic), sizeof(header), , 0);
 if (ret < 0) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg_errno(errp, -ret, "failed to write VMDK header");
 goto exit;
 }
 
@@ -2303,7 +2302,7 @@ vmdk_init_extent(BlockBackend *blk, int64_t filesize, 
bool flat, bool compress,
 ret = blk_co_pwrite(blk, le64_to_cpu(header.rgd_offset) * BDRV_SECTOR_SIZE,
 gd_buf_size, gd_buf, 0);
 if (ret < 0) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg_errno(errp, -ret, "failed to write VMDK grain directory");
 goto exit;
 }
 
@@ -2315,7 +2314,8 @@ vmdk_init_extent(BlockBackend *blk, int64_t filesize, 
bool flat, bool compress,
 ret = blk_co_pwrite(blk, le64_to_cpu(header.gd_offset) * BDRV_SECTOR_SIZE,
 gd_buf_size, gd_buf, 0);
 if (ret < 0) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg_errno(errp, -ret,
+ "failed to write VMDK backup grain directory");
 }
 
 ret = 0;
-- 
2.45.0




[PATCH 6/6] qerror: QERR_IO_ERROR is no longer used, drop

2024-05-13 Thread Markus Armbruster
Signed-off-by: Markus Armbruster 
---
 include/qapi/qmp/qerror.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/include/qapi/qmp/qerror.h b/include/qapi/qmp/qerror.h
index 00b18e9082..bc9116f76a 100644
--- a/include/qapi/qmp/qerror.h
+++ b/include/qapi/qmp/qerror.h
@@ -20,9 +20,6 @@
 #define QERR_INVALID_PARAMETER_VALUE \
 "Parameter '%s' expects %s"
 
-#define QERR_IO_ERROR \
-"An IO error has occurred"
-
 #define QERR_MISSING_PARAMETER \
 "Parameter '%s' is missing"
 
-- 
2.45.0




[PATCH 2/6] dump/win_dump: Improve error messages on write error

2024-05-13 Thread Markus Armbruster
create_win_dump() and write_run report qemu_write_full() failure to
their callers as

An IO error has occurred

The errno set by qemu_write_full() is lost.

Improve this to

win-dump: failed to write header: 

and

win-dump: failed to save memory: 

This matches how dump.c reports similar errors.

Signed-off-by: Markus Armbruster 
---
 dump/win_dump.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/dump/win_dump.c b/dump/win_dump.c
index b7bfaff379..0e4fe692ce 100644
--- a/dump/win_dump.c
+++ b/dump/win_dump.c
@@ -12,7 +12,6 @@
 #include "sysemu/dump.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
-#include "qapi/qmp/qerror.h"
 #include "exec/cpu-defs.h"
 #include "hw/core/cpu.h"
 #include "qemu/win_dump_defs.h"
@@ -52,6 +51,7 @@ static size_t write_run(uint64_t base_page, uint64_t 
page_count,
 uint64_t addr = base_page << TARGET_PAGE_BITS;
 uint64_t size = page_count << TARGET_PAGE_BITS;
 uint64_t len, l;
+int eno;
 size_t total = 0;
 
 while (size) {
@@ -65,9 +65,10 @@ static size_t write_run(uint64_t base_page, uint64_t 
page_count,
 }
 
 l = qemu_write_full(fd, buf, len);
+eno = errno;
 cpu_physical_memory_unmap(buf, addr, false, len);
 if (l != len) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg_errno(errp, eno, "win-dump: failed to save memory");
 return 0;
 }
 
@@ -459,7 +460,7 @@ void create_win_dump(DumpState *s, Error **errp)
 
 s->written_size = qemu_write_full(s->fd, h, hdr_size);
 if (s->written_size != hdr_size) {
-error_setg(errp, QERR_IO_ERROR);
+error_setg_errno(errp, errno, "win-dump: failed to write header");
 goto out_restore;
 }
 
-- 
2.45.0




Re: [PATCH 1/2] copy-before-write: allow specifying minimum cluster size

2024-05-13 Thread Fiona Ebner
Am 26.03.24 um 10:06 schrieb Markus Armbruster:
>> @@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
>> BdrvChild *target,
>>  
>>  GLOBAL_STATE_CODE();
>>  
>> -cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
>> +if (min_cluster_size && !is_power_of_2(min_cluster_size)) {
> 
> min_cluster_size is int64_t, is_power_of_2() takes uint64_t.  Bad if
> min_cluster_size is negative.  Could this happen?
> 

No, because it comes in as a uint32_t via the QAPI (the internal caller
added by patch 2/2 from the backup code also gets the value via QAPI and
there uint32_t is used too).

---snip---

>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 0a72c590a8..85c8f88f6e 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -4625,12 +4625,18 @@
>>  # @on-cbw-error parameter will decide how this failure is handled.
>>  # Default 0. (Since 7.1)
>>  #
>> +# @min-cluster-size: Minimum size of blocks used by copy-before-write
>> +# operations.  Has to be a power of 2.  No effect if smaller than
>> +# the maximum of the target's cluster size and 64 KiB.  Default 0.
>> +# (Since 9.0)
>> +#
>>  # Since: 6.2
>>  ##
>>  { 'struct': 'BlockdevOptionsCbw',
>>'base': 'BlockdevOptionsGenericFormat',
>>'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap',
>> -'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32' } }
>> +'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32',
>> +'*min-cluster-size': 'uint32' } }
> 
> Elsewhere in the schema, we use either 'int' or 'size' for cluster-size.
> Why the difference?
> 

The motivation was to disallow negative values up front and have it work
with block_copy_calculate_cluster_size(), whose result is an int64_t. If
I go with 'int', I'll have to add a check to disallow negative values.
If I go with 'size', I'll have to add a check for to disallow too large
values.

Which approach should I go with?

Best Regards,
Fiona




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-13 Thread Jinpu Wang
Hi Peter, Hi Chuan,

On Thu, May 9, 2024 at 4:14 PM Peter Xu  wrote:
>
> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> > That's a good news to see the socket abstraction for RDMA!
> > When I was developed the series above, the most pain is the RDMA migration 
> > has no QIOChannel abstraction and i need to take a 'fake channel'
> > for it which is awkward in code implementation.
> > So, as far as I know, we can do this by
> > i. the first thing is that we need to evaluate the rsocket is good enough 
> > to satisfy our QIOChannel fundamental abstraction
> > ii. if it works right, then we will continue to see if it can give us 
> > opportunity to hide the detail of rdma protocol
> > into rsocket by remove most of code in rdma.c and also some hack in 
> > migration main process.
> > iii. implement the advanced features like multi-fd and multi-uri for rdma 
> > migration.
> >
> > Since I am not familiar with rsocket, I need some times to look at it and 
> > do some quick verify with rdma migration based on rsocket.
> > But, yes, I am willing to involved in this refactor work and to see if we 
> > can make this migration feature more better:)
>
> Based on what we have now, it looks like we'd better halt the deprecation
> process a bit, so I think we shouldn't need to rush it at least in 9.1
> then, and we'll need to see how it goes on the refactoring.
>
> It'll be perfect if rsocket works, otherwise supporting multifd with little
> overhead / exported APIs would also be a good thing in general with
> whatever approach.  And obviously all based on the facts that we can get
> resources from companies to support this feature first.
>
> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
> any of us can provide some test results please do so.  Many people are
> saying RDMA is better, but I yet didn't see any numbers comparing it with
> modern TCP networks.  I don't want to have old impressions floating around
> even if things might have changed..  When we have consolidated results, we
> should share them out and also reflect that in QEMU's migration docs when a
> rdma document page is ready.
I also did a tests with Mellanox ConnectX-6 100 G RoCE nic, the
results are mixed, for less than 3 streams native ethernet is faster,
and when more than 3 streams rsocket performs better.

root@x4-right:~# iperf -c 1.1.1.16 -P 1

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[  3] local 1.1.1.15 port 44214 connected with 1.1.1.16 port 5001
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10. sec  52.9 GBytes  45.4 Gbits/sec
root@x4-right:~# iperf -c 1.1.1.16 -P 2
[  3] local 1.1.1.15 port 33118 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 33130 connected with 1.1.1.16 port 5001

Client connecting to 1.1.1.16, TCP port 5001
TCP window size: 4.00 MByte (default)

[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0001 sec  45.0 GBytes  38.7 Gbits/sec
[  4] 0.-10. sec  43.9 GBytes  37.7 Gbits/sec
[SUM] 0.-10. sec  88.9 GBytes  76.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.172/0.189/0.205/0.172 ms (tot/err) = 2/0
root@x4-right:~# iperf -c 1.1.1.16 -P 4

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[  5] local 1.1.1.15 port 50748 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 50734 connected with 1.1.1.16 port 5001
[  6] local 1.1.1.15 port 50764 connected with 1.1.1.16 port 5001
[  3] local 1.1.1.15 port 50730 connected with 1.1.1.16 port 5001
[ ID] Interval   Transfer Bandwidth
[  6] 0.-10. sec  24.7 GBytes  21.2 Gbits/sec
[  3] 0.-10.0004 sec  23.6 GBytes  20.3 Gbits/sec
[  4] 0.-10. sec  27.8 GBytes  23.9 Gbits/sec
[  5] 0.-10. sec  28.0 GBytes  24.0 Gbits/sec
[SUM] 0.-10. sec   104 GBytes  89.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.104/0.156/0.204/0.124 ms (tot/err) = 4/0
root@x4-right:~# iperf -c 1.1.1.16 -P 8
[  4] local 1.1.1.15 port 55588 connected with 1.1.1.16 port 5001
[  5] local 1.1.1.15 port 55600 connected with 1.1.1.16 port 5001

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[ 10] local 1.1.1.15 port 55628 connected with 1.1.1.16 port 5001
[ 15] local 1.1.1.15 port 55648 connected with 1.1.1.16 port 5001
[  7] local 1.1.1.15 port 55620 connected with 1.1.1.16 port 5001
[  3] local 1.1.1.15 port 55584 connected with 1.1.1.16 port 5001
[ 

Re: [PATCH] hw/nvme: Add CLI options for PCI vendor/device IDs and IEEE-OUI ID

2024-05-13 Thread Markus Armbruster
Saif Abrar  writes:

> Add CLI options for user specified
> - PCI vendor, device, subsystem vendor and subsystem IDs
> - IEEE-OUI ID
>
> e.g. PCI IDs to be specified as follows:
> -device 
> nvme,id_vendor=0xABCD,id_device=0xA0B0,id_subsys_vendor=0xEF00,id_subsys=0xEF01
>
> IEEE-OUI ID (Identify Controller bytes 75:73) is to be
> specified in LE format.
> (e.g. ieee_oui=0xABCDEF => Byte[73]=0xEF, Byte[74]=0xCD, Byte[75]=0xAB).
>
> Signed-off-by: Saif Abrar 

[...]

> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 127c3d2383..35aeb48e0b 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c

[...]

> @@ -8451,6 +8480,13 @@ static Property nvme_props[] = {
>params.sriov_max_vq_per_vf, 0),
>  DEFINE_PROP_BOOL("msix-exclusive-bar", NvmeCtrl, 
> params.msix_exclusive_bar,
>   false),
> +DEFINE_PROP_UINT16("id_vendor", NvmeCtrl, params.id_vendor, 0),
> +DEFINE_PROP_UINT16("id_device", NvmeCtrl, params.id_device, 0),
> +DEFINE_PROP_UINT16("id_subsys_vendor", NvmeCtrl,
> +params.id_subsys_vendor, 
> 0),
> +DEFINE_PROP_UINT16("id_subsys", NvmeCtrl, params.id_subsys, 0),
> +DEFINE_PROP("ieee_oui", NvmeCtrl, params.ieee_oui, nvme_prop_ieee,
> +  
> uint32_t),
>  DEFINE_PROP_END_OF_LIST(),
>  };

You add properties, not CLI options.  Properties are accessible via CLI
-device, but also via monitor device_add, qom-set, qom-get.

Please rephrase your commit message like "Add properties for".




Re: [PATCH v2 1/2] Revert "vhost-user: fix lost reconnect"

2024-05-13 Thread Michael S. Tsirkin
On Mon, May 13, 2024 at 03:10:47PM +0800, Li Feng wrote:
> This reverts commit f02a4b8e6431598612466f76aac64ab492849abf.
> 


include subject of reverted commit and motivation for the revert pls.


> Signed-off-by: Li Feng 
> ---
>  hw/block/vhost-user-blk.c  |  2 +-
>  hw/scsi/vhost-user-scsi.c  |  3 +--
>  hw/virtio/vhost-user-base.c|  2 +-
>  hw/virtio/vhost-user.c | 10 ++
>  include/hw/virtio/vhost-user.h |  3 +--
>  5 files changed, 6 insertions(+), 14 deletions(-)
> 
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 9e6bbc6950..41d1ac3a5a 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -384,7 +384,7 @@ static void vhost_user_blk_event(void *opaque, 
> QEMUChrEvent event)
>  case CHR_EVENT_CLOSED:
>  /* defer close until later to avoid circular close */
>  vhost_user_async_close(dev, >chardev, >dev,
> -   vhost_user_blk_disconnect, 
> vhost_user_blk_event);
> +   vhost_user_blk_disconnect);
>  break;
>  case CHR_EVENT_BREAK:
>  case CHR_EVENT_MUX_IN:
> diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
> index a63b1f4948..48a59e020e 100644
> --- a/hw/scsi/vhost-user-scsi.c
> +++ b/hw/scsi/vhost-user-scsi.c
> @@ -214,8 +214,7 @@ static void vhost_user_scsi_event(void *opaque, 
> QEMUChrEvent event)
>  case CHR_EVENT_CLOSED:
>  /* defer close until later to avoid circular close */
>  vhost_user_async_close(dev, >conf.chardev, >dev,
> -   vhost_user_scsi_disconnect,
> -   vhost_user_scsi_event);
> +   vhost_user_scsi_disconnect);
>  break;
>  case CHR_EVENT_BREAK:
>  case CHR_EVENT_MUX_IN:
> diff --git a/hw/virtio/vhost-user-base.c b/hw/virtio/vhost-user-base.c
> index a83167191e..4b54255682 100644
> --- a/hw/virtio/vhost-user-base.c
> +++ b/hw/virtio/vhost-user-base.c
> @@ -254,7 +254,7 @@ static void vub_event(void *opaque, QEMUChrEvent event)
>  case CHR_EVENT_CLOSED:
>  /* defer close until later to avoid circular close */
>  vhost_user_async_close(dev, >chardev, >vhost_dev,
> -   vub_disconnect, vub_event);
> +   vub_disconnect);
>  break;
>  case CHR_EVENT_BREAK:
>  case CHR_EVENT_MUX_IN:
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index cdf9af4a4b..c929097e87 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -2776,7 +2776,6 @@ typedef struct {
>  DeviceState *dev;
>  CharBackend *cd;
>  struct vhost_dev *vhost;
> -IOEventHandler *event_cb;
>  } VhostAsyncCallback;
>  
>  static void vhost_user_async_close_bh(void *opaque)
> @@ -2791,10 +2790,7 @@ static void vhost_user_async_close_bh(void *opaque)
>   */
>  if (vhost->vdev) {
>  data->cb(data->dev);
> -} else if (data->event_cb) {
> -qemu_chr_fe_set_handlers(data->cd, NULL, NULL, data->event_cb,
> - NULL, data->dev, NULL, true);
> -   }
> +}
>  
>  g_free(data);
>  }
> @@ -2806,8 +2802,7 @@ static void vhost_user_async_close_bh(void *opaque)
>   */
>  void vhost_user_async_close(DeviceState *d,
>  CharBackend *chardev, struct vhost_dev *vhost,
> -vu_async_close_fn cb,
> -IOEventHandler *event_cb)
> +vu_async_close_fn cb)
>  {
>  if (!runstate_check(RUN_STATE_SHUTDOWN)) {
>  /*
> @@ -2823,7 +2818,6 @@ void vhost_user_async_close(DeviceState *d,
>  data->dev = d;
>  data->cd = chardev;
>  data->vhost = vhost;
> -data->event_cb = event_cb;
>  
>  /* Disable any further notifications on the chardev */
>  qemu_chr_fe_set_handlers(chardev,
> diff --git a/include/hw/virtio/vhost-user.h b/include/hw/virtio/vhost-user.h
> index d7c09ffd34..324cd8663a 100644
> --- a/include/hw/virtio/vhost-user.h
> +++ b/include/hw/virtio/vhost-user.h
> @@ -108,7 +108,6 @@ typedef void (*vu_async_close_fn)(DeviceState *cb);
>  
>  void vhost_user_async_close(DeviceState *d,
>  CharBackend *chardev, struct vhost_dev *vhost,
> -vu_async_close_fn cb,
> -IOEventHandler *event_cb);
> +vu_async_close_fn cb);
>  
>  #endif
> -- 
> 2.45.0




[PATCH v2 2/2] vhost-user: fix lost reconnect again

2024-05-13 Thread Li Feng
When the vhost-user is reconnecting to the backend, and if the vhost-user fails
at the get_features in vhost_dev_init(), then the reconnect will fail
and it will not be retriggered forever.

The reason is:
When the vhost-user fail at get_features, the vhost_dev_cleanup will be called
immediately.

vhost_dev_cleanup calls 'memset(hdev, 0, sizeof(struct vhost_dev))'.

The reconnect path is:
vhost_user_blk_event
   vhost_user_async_close(.. vhost_user_blk_disconnect ..)
 qemu_chr_fe_set_handlers <- clear the notifier callback
   schedule vhost_user_async_close_bh

The vhost->vdev is null, so the vhost_user_blk_disconnect will not be
called, then the event fd callback will not be reinstalled.

With this patch, the vhost_user_blk_disconnect will call the
vhost_dev_cleanup() again, it's safe.

In addition, the CLOSE event may occur in a scenario where connected is false.
At this time, the event handler will be cleared. We need to ensure that the
event handler can remain installed.

All vhost-user devices have this issue, including vhost-user-blk/scsi.

Fixes: 71e076a07d ("hw/virtio: generalise CHR_EVENT_CLOSED handling")

Signed-off-by: Li Feng 
---
 hw/block/vhost-user-blk.c   |  3 ++-
 hw/scsi/vhost-user-scsi.c   |  3 ++-
 hw/virtio/vhost-user-base.c |  3 ++-
 hw/virtio/vhost-user.c  | 10 +-
 4 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 41d1ac3a5a..c6842ced48 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -353,7 +353,7 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
 
 if (!s->connected) {
-return;
+goto done;
 }
 s->connected = false;
 
@@ -361,6 +361,7 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
 
 vhost_dev_cleanup(>dev);
 
+done:
 /* Re-instate the event handler for new connections */
 qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event,
  NULL, dev, NULL, true);
diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index 48a59e020e..b49a11d23b 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -181,7 +181,7 @@ static void vhost_user_scsi_disconnect(DeviceState *dev)
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev);
 
 if (!s->connected) {
-return;
+goto done;
 }
 s->connected = false;
 
@@ -189,6 +189,7 @@ static void vhost_user_scsi_disconnect(DeviceState *dev)
 
 vhost_dev_cleanup(>dev);
 
+done:
 /* Re-instate the event handler for new connections */
 qemu_chr_fe_set_handlers(>conf.chardev, NULL, NULL,
  vhost_user_scsi_event, NULL, dev, NULL, true);
diff --git a/hw/virtio/vhost-user-base.c b/hw/virtio/vhost-user-base.c
index 4b54255682..11e72b1e3b 100644
--- a/hw/virtio/vhost-user-base.c
+++ b/hw/virtio/vhost-user-base.c
@@ -225,13 +225,14 @@ static void vub_disconnect(DeviceState *dev)
 VHostUserBase *vub = VHOST_USER_BASE(vdev);
 
 if (!vub->connected) {
-return;
+goto done;
 }
 vub->connected = false;
 
 vub_stop(vdev);
 vhost_dev_cleanup(>vhost_dev);
 
+done:
 /* Re-instate the event handler for new connections */
 qemu_chr_fe_set_handlers(>chardev,
  NULL, NULL, vub_event,
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index c929097e87..c407ea8939 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2781,16 +2781,8 @@ typedef struct {
 static void vhost_user_async_close_bh(void *opaque)
 {
 VhostAsyncCallback *data = opaque;
-struct vhost_dev *vhost = data->vhost;
 
-/*
- * If the vhost_dev has been cleared in the meantime there is
- * nothing left to do as some other path has completed the
- * cleanup.
- */
-if (vhost->vdev) {
-data->cb(data->dev);
-}
+data->cb(data->dev);
 
 g_free(data);
 }
-- 
2.45.0




[PATCH v2 1/2] Revert "vhost-user: fix lost reconnect"

2024-05-13 Thread Li Feng
This reverts commit f02a4b8e6431598612466f76aac64ab492849abf.

Signed-off-by: Li Feng 
---
 hw/block/vhost-user-blk.c  |  2 +-
 hw/scsi/vhost-user-scsi.c  |  3 +--
 hw/virtio/vhost-user-base.c|  2 +-
 hw/virtio/vhost-user.c | 10 ++
 include/hw/virtio/vhost-user.h |  3 +--
 5 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 9e6bbc6950..41d1ac3a5a 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -384,7 +384,7 @@ static void vhost_user_blk_event(void *opaque, QEMUChrEvent 
event)
 case CHR_EVENT_CLOSED:
 /* defer close until later to avoid circular close */
 vhost_user_async_close(dev, >chardev, >dev,
-   vhost_user_blk_disconnect, 
vhost_user_blk_event);
+   vhost_user_blk_disconnect);
 break;
 case CHR_EVENT_BREAK:
 case CHR_EVENT_MUX_IN:
diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c
index a63b1f4948..48a59e020e 100644
--- a/hw/scsi/vhost-user-scsi.c
+++ b/hw/scsi/vhost-user-scsi.c
@@ -214,8 +214,7 @@ static void vhost_user_scsi_event(void *opaque, 
QEMUChrEvent event)
 case CHR_EVENT_CLOSED:
 /* defer close until later to avoid circular close */
 vhost_user_async_close(dev, >conf.chardev, >dev,
-   vhost_user_scsi_disconnect,
-   vhost_user_scsi_event);
+   vhost_user_scsi_disconnect);
 break;
 case CHR_EVENT_BREAK:
 case CHR_EVENT_MUX_IN:
diff --git a/hw/virtio/vhost-user-base.c b/hw/virtio/vhost-user-base.c
index a83167191e..4b54255682 100644
--- a/hw/virtio/vhost-user-base.c
+++ b/hw/virtio/vhost-user-base.c
@@ -254,7 +254,7 @@ static void vub_event(void *opaque, QEMUChrEvent event)
 case CHR_EVENT_CLOSED:
 /* defer close until later to avoid circular close */
 vhost_user_async_close(dev, >chardev, >vhost_dev,
-   vub_disconnect, vub_event);
+   vub_disconnect);
 break;
 case CHR_EVENT_BREAK:
 case CHR_EVENT_MUX_IN:
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index cdf9af4a4b..c929097e87 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -2776,7 +2776,6 @@ typedef struct {
 DeviceState *dev;
 CharBackend *cd;
 struct vhost_dev *vhost;
-IOEventHandler *event_cb;
 } VhostAsyncCallback;
 
 static void vhost_user_async_close_bh(void *opaque)
@@ -2791,10 +2790,7 @@ static void vhost_user_async_close_bh(void *opaque)
  */
 if (vhost->vdev) {
 data->cb(data->dev);
-} else if (data->event_cb) {
-qemu_chr_fe_set_handlers(data->cd, NULL, NULL, data->event_cb,
- NULL, data->dev, NULL, true);
-   }
+}
 
 g_free(data);
 }
@@ -2806,8 +2802,7 @@ static void vhost_user_async_close_bh(void *opaque)
  */
 void vhost_user_async_close(DeviceState *d,
 CharBackend *chardev, struct vhost_dev *vhost,
-vu_async_close_fn cb,
-IOEventHandler *event_cb)
+vu_async_close_fn cb)
 {
 if (!runstate_check(RUN_STATE_SHUTDOWN)) {
 /*
@@ -2823,7 +2818,6 @@ void vhost_user_async_close(DeviceState *d,
 data->dev = d;
 data->cd = chardev;
 data->vhost = vhost;
-data->event_cb = event_cb;
 
 /* Disable any further notifications on the chardev */
 qemu_chr_fe_set_handlers(chardev,
diff --git a/include/hw/virtio/vhost-user.h b/include/hw/virtio/vhost-user.h
index d7c09ffd34..324cd8663a 100644
--- a/include/hw/virtio/vhost-user.h
+++ b/include/hw/virtio/vhost-user.h
@@ -108,7 +108,6 @@ typedef void (*vu_async_close_fn)(DeviceState *cb);
 
 void vhost_user_async_close(DeviceState *d,
 CharBackend *chardev, struct vhost_dev *vhost,
-vu_async_close_fn cb,
-IOEventHandler *event_cb);
+vu_async_close_fn cb);
 
 #endif
-- 
2.45.0




Re: [PATCH v3 3/5] mirror: allow specifying working bitmap

2024-05-13 Thread Markus Armbruster
Fiona Ebner  writes:

> From: John Snow 
>
> for the mirror job. The bitmap's granularity is used as the job's
> granularity.
>
> The new @bitmap parameter is marked unstable in the QAPI and can
> currently only be used for @sync=full mode.
>
> Clusters initially dirty in the bitmap as well as new writes are
> copied to the target.
>
> Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
> callers can simulate the three kinds of @BitmapSyncMode (which is used
> by backup):
> 1. always: default, just pass bitmap as working bitmap.
> 2. never: copy bitmap and pass copy to the mirror job.
> 3. on-success: copy bitmap and pass copy to the mirror job and if
>successful, merge bitmap into original afterwards.
>
> When the target image is a non-COW "diff image", i.e. one that was not
> used as the target of a previous mirror and the target image's cluster
> size is larger than the bitmap's granularity, or when
> @copy-mode=write-blocking is used, there is a pitfall, because the
> cluster in the target image will be allocated, but not contain all the
> data corresponding to the same region in the source image.
>
> An idea to avoid the limitation would be to mark clusters which are
> affected by unaligned writes and are not allocated in the target image
> dirty, so they would be copied fully later. However, for migration,
> the invariant that an actively synced mirror stays actively synced
> (unless an error happens) is useful, because without that invariant,
> migration might inactivate block devices when mirror still got work
> to do and run into an assertion failure [0].
>
> Another approach would be to read the missing data from the source
> upon unaligned writes to be able to write the full target cluster
> instead.
>
> But certain targets like NBD do not allow querying the cluster size.
> To avoid limiting/breaking the use case of syncing to an existing
> target, which is arguably more common than the diff image use case,
> document the limitation in QAPI.
>
> This patch was originally based on one by Ma Haocong, but it has since
> been modified pretty heavily, first by John and then again by Fiona.
>
> [0]: 
> https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/
>
> Suggested-by: Ma Haocong 
> Signed-off-by: Ma Haocong 
> Signed-off-by: John Snow 
> [FG: switch to bdrv_dirty_bitmap_merge_internal]
> Signed-off-by: Fabian Grünbichler 
> Signed-off-by: Thomas Lamprecht 
> [FE: rebase for 9.1
>  get rid of bitmap mode parameter
>  use caller-provided bitmap as working bitmap
>  turn bitmap parameter experimental]
> Signed-off-by: Fiona Ebner 

[...]

> diff --git a/include/block/block_int-global-state.h 
> b/include/block/block_int-global-state.h
> index 54f8c8cbcb..8b93db017e 100644
> --- a/include/block/block_int-global-state.h
> +++ b/include/block/block_int-global-state.h
> @@ -138,6 +138,8 @@ BlockJob *commit_active_start(const char *job_id, 
> BlockDriverState *bs,
>   * @granularity: The chosen granularity for the dirty bitmap.
>   * @buf_size: The amount of data that can be in flight at one time.
>   * @mode: Whether to collapse all images in the chain to the target.
> + * @bitmap: Use this bitmap as a working bitmap, i.e. non-dirty clusters are
> +only mirrored if written to later.
>   * @backing_mode: How to establish the target's backing chain after 
> completion.
>   * @zero_target: Whether the target should be explicitly zero-initialized
>   * @on_source_error: The action to take upon error reading from the source.
> @@ -158,7 +160,8 @@ void mirror_start(const char *job_id, BlockDriverState 
> *bs,
>BlockDriverState *target, const char *replaces,
>int creation_flags, int64_t speed,
>uint32_t granularity, int64_t buf_size,
> -  MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
> +  MirrorSyncMode mode, BdrvDirtyBitmap *bitmap,
> +  BlockMirrorBackingMode backing_mode,
>bool zero_target,
>BlockdevOnError on_source_error,
>BlockdevOnError on_target_error,
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 9d94129ea2..5ddebe71ee 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2191,6 +2191,18 @@
>  # destination (all the disk, only the sectors allocated in the
>  # topmost image, or only new I/O).
>  #
> +# @bitmap: The name of a bitmap to use as a working bitmap for
> +# sync=full mode.  This argument must be not be present for other
> +# sync modes and not at the same time as @granularity.  The
> +# bitmap's granularity is used as the job's granularity.  When
> +# the target does not support COW and is a diff image, i.e. one
> +# that should only contain the delta and was not synced to
> +# previously, the target's cluster size must not be larger than
> +# the bitmap's granularity 

[PATCH v2 00/11] qcow2: make subclusters discardable

2024-05-13 Thread Andrey Drobyshev
v1: https://lists.nongnu.org/archive/html/qemu-devel/2023-10/msg07223.html

Andrey Drobyshev (11):
  qcow2: make function update_refcount_discard() global
  qcow2: simplify L2 entries accounting for discard-no-unref
  qcow2: put discard requests in the common queue when discard-no-unref
enabled
  block/file-posix: add trace event for fallocate() calls
  iotests/common.rc: add disk_usage function
  iotests/290: add test case to check 'discard-no-unref' option behavior
  qcow2: add get_sc_range_info() helper for working with subcluster
ranges
  qcow2: zeroize the entire cluster when there're no non-zero
subclusters
  qcow2: make subclusters discardable
  qcow2: zero_l2_subclusters: fall through to discard operation when
requested
  iotests/271: add test cases for subcluster-based discard/unmap

 block/file-posix.c   |   1 +
 block/qcow2-cluster.c| 346 ---
 block/qcow2-refcount.c   |   8 +-
 block/qcow2-snapshot.c   |   6 +-
 block/qcow2.c|  25 +--
 block/qcow2.h|   6 +-
 block/trace-events   |   1 +
 tests/qemu-iotests/250   |   5 -
 tests/qemu-iotests/271   |  72 ++--
 tests/qemu-iotests/271.out   |  69 ++-
 tests/qemu-iotests/290   |  34 
 tests/qemu-iotests/290.out   |  28 +++
 tests/qemu-iotests/common.rc |   6 +
 13 files changed, 490 insertions(+), 117 deletions(-)

-- 
2.39.3




[PATCH v2 03/11] qcow2: put discard requests in the common queue when discard-no-unref enabled

2024-05-13 Thread Andrey Drobyshev
Normally discard requests are stored in the queue attached to BDRVQcow2State
to be processed later at once.  Currently discard-no-unref option handling
causes these requests to be processed straight away.  Let's fix that.

Note that when doing regular discards qcow2_free_any_cluster() would check
for the presence of external data files for us and redirect request to
underlying data_file.  Here we want to do the same but avoid refcount updates,
thus we perform the same checks.

Suggested-by: Hanna Czenczek 
Signed-off-by: Andrey Drobyshev 
---
 block/qcow2-cluster.c | 39 +--
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 5f057ba2fd..7dff0bd5a1 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1893,6 +1893,28 @@ again:
 return 0;
 }
 
+/*
+ * Helper for adding a discard request to the queue without any refcount
+ * modifications.  If external data file is used redirects the request to
+ * the corresponding BdrvChild.
+ */
+static inline void
+discard_no_unref_any_file(BlockDriverState *bs, uint64_t offset,
+  uint64_t length, QCow2ClusterType ctype,
+  enum qcow2_discard_type dtype)
+{
+BDRVQcow2State *s = bs->opaque;
+
+if (s->discard_passthrough[dtype] &&
+(ctype == QCOW2_CLUSTER_NORMAL || ctype == QCOW2_CLUSTER_ZERO_ALLOC)) {
+if (has_data_file(bs)) {
+bdrv_pdiscard(s->data_file, offset, length);
+} else {
+qcow2_queue_discard(bs, offset, length);
+}
+}
+}
+
 /*
  * This discards as many clusters of nb_clusters as possible at once (i.e.
  * all clusters in the same L2 slice) and returns the number of discarded
@@ -1974,12 +1996,10 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
 if (!keep_reference) {
 /* Then decrease the refcount */
 qcow2_free_any_cluster(bs, old_l2_entry, type);
-} else if (s->discard_passthrough[type] &&
-   (cluster_type == QCOW2_CLUSTER_NORMAL ||
-cluster_type == QCOW2_CLUSTER_ZERO_ALLOC)) {
+} else {
 /* If we keep the reference, pass on the discard still */
-bdrv_pdiscard(s->data_file, old_l2_entry & L2E_OFFSET_MASK,
-  s->cluster_size);
+discard_no_unref_any_file(bs, old_l2_entry & L2E_OFFSET_MASK,
+  s->cluster_size, cluster_type, type);
 }
 }
 
@@ -2088,12 +2108,11 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
 if (!keep_reference) {
 /* Then decrease the refcount */
 qcow2_free_any_cluster(bs, old_l2_entry, 
QCOW2_DISCARD_REQUEST);
-} else if (s->discard_passthrough[QCOW2_DISCARD_REQUEST] &&
-   (type == QCOW2_CLUSTER_NORMAL ||
-type == QCOW2_CLUSTER_ZERO_ALLOC)) {
+} else {
 /* If we keep the reference, pass on the discard still */
-bdrv_pdiscard(s->data_file, old_l2_entry & L2E_OFFSET_MASK,
-s->cluster_size);
+discard_no_unref_any_file(bs, old_l2_entry & L2E_OFFSET_MASK,
+  s->cluster_size, type,
+  QCOW2_DISCARD_REQUEST);
 }
 }
 }
-- 
2.39.3




[PATCH v2 04/11] block/file-posix: add trace event for fallocate() calls

2024-05-13 Thread Andrey Drobyshev
This would ease debugging of write zeroes and discard operations.

Signed-off-by: Andrey Drobyshev 
---
 block/file-posix.c | 1 +
 block/trace-events | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block/file-posix.c b/block/file-posix.c
index 35684f7e21..45134f0eef 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1859,6 +1859,7 @@ static int translate_err(int err)
 static int do_fallocate(int fd, int mode, off_t offset, off_t len)
 {
 do {
+trace_file_do_fallocate(fd, mode, offset, len);
 if (fallocate(fd, mode, offset, len) == 0) {
 return 0;
 }
diff --git a/block/trace-events b/block/trace-events
index 8e789e1f12..2f7ad28996 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -203,6 +203,7 @@ curl_setup_preadv(uint64_t bytes, uint64_t start, const 
char *range) "reading %"
 curl_close(void) "close"
 
 # file-posix.c
+file_do_fallocate(int fd, int mode, int64_t offset, int64_t len) "fd=%d 
mode=0x%02x offset=%" PRIi64 " len=%" PRIi64
 file_copy_file_range(void *bs, int src, int64_t src_off, int dst, int64_t 
dst_off, int64_t bytes, int flags, int64_t ret) "bs %p src_fd %d offset 
%"PRIu64" dst_fd %d offset %"PRIu64" bytes %"PRIu64" flags %d ret %"PRId64
 file_FindEjectableOpticalMedia(const char *media) "Matching using %s"
 file_setup_cdrom(const char *partition) "Using %s as optical disc"
-- 
2.39.3




[PATCH v2 10/11] qcow2: zero_l2_subclusters: fall through to discard operation when requested

2024-05-13 Thread Andrey Drobyshev
When zeroizing subclusters within single cluster, detect usage of the
BDRV_REQ_MAY_UNMAP flag and fall through to the subcluster-based discard
operation, much like it's done with the cluster-based discards.  That
way subcluster-aligned operations "qemu-io -c 'write -z -u ...'" will
lead to actual unmap.

Signed-off-by: Andrey Drobyshev 
---
 block/qcow2-cluster.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 3c134a7e80..53e04eff93 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2107,15 +2107,16 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
 
 static int coroutine_fn GRAPH_RDLOCK
 discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
-   uint64_t nb_subclusters,
-   enum qcow2_discard_type type,
-   bool full_discard)
+   uint64_t nb_subclusters, enum qcow2_discard_type type,
+   bool full_discard, bool uncond_zeroize)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t new_l2_bitmap, bitmap_alloc_mask, bitmap_zero_mask;
 int ret, sc = offset_to_sc_index(s, offset);
 g_auto(SubClusterRangeInfo) scri = { 0 };
 
+assert(!(full_discard && uncond_zeroize));
+
 ret = get_sc_range_info(bs, offset, nb_subclusters, );
 if (ret < 0) {
 return ret;
@@ -2140,7 +2141,8 @@ discard_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
  */
 if (full_discard) {
 new_l2_bitmap &= ~bitmap_zero_mask;
-} else if (bs->backing || scri.l2_bitmap & bitmap_alloc_mask) {
+} else if (uncond_zeroize || bs->backing ||
+   scri.l2_bitmap & bitmap_alloc_mask) {
 new_l2_bitmap |= bitmap_zero_mask;
 }
 
@@ -2197,7 +2199,7 @@ int qcow2_subcluster_discard(BlockDriverState *bs, 
uint64_t offset,
 if (head) {
 ret = discard_l2_subclusters(bs, offset - head,
  size_to_subclusters(s, head), type,
- full_discard);
+ full_discard, false);
 if (ret < 0) {
 goto fail;
 }
@@ -2221,7 +2223,7 @@ int qcow2_subcluster_discard(BlockDriverState *bs, 
uint64_t offset,
 if (tail) {
 ret = discard_l2_subclusters(bs, end_offset,
  size_to_subclusters(s, tail), type,
- full_discard);
+ full_discard, false);
 if (ret < 0) {
 goto fail;
 }
@@ -2318,6 +2320,18 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
 int ret, sc = offset_to_sc_index(s, offset);
 g_auto(SubClusterRangeInfo) scri = { 0 };
 
+/*
+ * If the request allows discarding subclusters we go down the discard
+ * path regardless of their allocation status.  After the discard
+ * operation with full_discard=false subclusters are going to be read as
+ * zeroes anyway.  But we make sure that subclusters are explicitly
+ * zeroed anyway with uncond_zeroize=true.
+ */
+if (flags & BDRV_REQ_MAY_UNMAP) {
+return discard_l2_subclusters(bs, offset, nb_subclusters,
+  QCOW2_DISCARD_REQUEST, false, true);
+}
+
 ret = get_sc_range_info(bs, offset, nb_subclusters, );
 if (ret < 0) {
 return ret;
-- 
2.39.3




[PATCH v2 07/11] qcow2: add get_sc_range_info() helper for working with subcluster ranges

2024-05-13 Thread Andrey Drobyshev
This helper simply obtains the l2 table parameters of the cluster which
contains the given subclusters range.  Right now this info is being
obtained and used by zero_l2_subclusters().  As we're about to introduce
the subclusters discard operation, this helper would let us avoid code
duplication.

Also introduce struct SubClusterRangeInfo, which would contain all the
needed params.

Signed-off-by: Andrey Drobyshev 
---
 block/qcow2-cluster.c | 140 --
 1 file changed, 108 insertions(+), 32 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7dff0bd5a1..475f167035 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1915,6 +1915,103 @@ discard_no_unref_any_file(BlockDriverState *bs, 
uint64_t offset,
 }
 }
 
+/*
+ * Structure containing info about the subclusters range within one cluster.
+ *
+ * Since @l2_slice is a strong reference to the l2 table slice containing
+ * the corresponding l2 entry, it must be explicitly released by
+ * qcow2_cache_put().  Thus the user must either declare it with g_auto()
+ * (in which case sc_range_info_cleanup() is called automatically) or do
+ * the cleanup themselves.
+ */
+typedef struct SubClusterRangeInfo {
+uint64_t *l2_slice;
+int l2_index;
+uint64_t l2_entry;
+uint64_t l2_bitmap;
+QCow2ClusterType ctype;
+Qcow2Cache *l2_table_cache;
+} SubClusterRangeInfo;
+
+static void sc_range_info_cleanup(SubClusterRangeInfo *scri)
+{
+if (scri->l2_table_cache && scri->l2_slice) {
+qcow2_cache_put(scri->l2_table_cache, (void **) >l2_slice);
+}
+}
+
+G_DEFINE_AUTO_CLEANUP_CLEAR_FUNC(SubClusterRangeInfo, sc_range_info_cleanup);
+
+/*
+ * For a given @offset and @nb_subclusters, fill out the SubClusterRangeInfo
+ * structure describing the subclusters range and referred to by @scri.
+ * Only the subclusters which can be independently discarded/zeroized
+ * (i.e. not compressed or invalid) are considered to be valid here.
+ *
+ * The subclusters range is denoted by @offset and @nb_subclusters and must
+ * not cross the cluster boundary.  @offset must be aligned to the subcluster
+ * size.
+ *
+ * Return: 0 if the SubClusterRangeInfo is successfully filled out and the
+ * subclusters within the given range might be discarded/zeroized;
+ * -EINVAL if any of the subclusters within the range is invalid;
+ * -ENOTSUP if the range is contained within a compressed cluster.
+ */
+static int GRAPH_RDLOCK
+get_sc_range_info(BlockDriverState *bs, uint64_t offset,
+  unsigned nb_subclusters, SubClusterRangeInfo *scri)
+{
+BDRVQcow2State *s = bs->opaque;
+int ret, sc_cleared, sc_index = offset_to_sc_index(s, offset);
+QCow2SubclusterType sctype;
+
+/* Here we only work with the subclusters within single cluster. */
+assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
+assert(sc_index + nb_subclusters <= s->subclusters_per_cluster);
+assert(offset_into_subcluster(s, offset) == 0);
+
+scri->l2_table_cache = s->l2_table_cache;
+
+ret = get_cluster_table(bs, offset, >l2_slice, >l2_index);
+if (ret < 0) {
+goto cleanup;
+}
+
+scri->l2_entry = get_l2_entry(s, scri->l2_slice, scri->l2_index);
+scri->l2_bitmap = get_l2_bitmap(s, scri->l2_slice, scri->l2_index);
+scri->ctype = qcow2_get_cluster_type(bs, scri->l2_entry);
+
+sc_cleared = 0;
+do {
+ret = qcow2_get_subcluster_range_type(
+bs, scri->l2_entry, scri->l2_bitmap, sc_index + sc_cleared,
+);
+if (ret < 0) {
+goto cleanup;
+}
+
+switch (sctype) {
+case QCOW2_SUBCLUSTER_COMPRESSED:
+/* We cannot partially zeroize/discard compressed clusters. */
+ret = -ENOTSUP;
+goto cleanup;
+case QCOW2_SUBCLUSTER_INVALID:
+ret = -EINVAL;
+goto cleanup;
+default:
+break;
+}
+
+sc_cleared += ret;
+} while (sc_cleared < nb_subclusters);
+
+return 0;
+
+cleanup:
+sc_range_info_cleanup(scri);
+return ret;
+}
+
 /*
  * This discards as many clusters of nb_clusters as possible at once (i.e.
  * all clusters in the same L2 slice) and returns the number of discarded
@@ -2127,46 +2224,25 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
 unsigned nb_subclusters)
 {
 BDRVQcow2State *s = bs->opaque;
-uint64_t *l2_slice;
-uint64_t old_l2_bitmap, l2_bitmap;
-int l2_index, ret, sc = offset_to_sc_index(s, offset);
-
-/* For full clusters use zero_in_l2_slice() instead */
-assert(nb_subclusters > 0 && nb_subclusters < s->subclusters_per_cluster);
-assert(sc + nb_subclusters <= s->subclusters_per_cluster);
-assert(offset_into_subcluster(s, offset) == 0);
+uint64_t new_l2_bitmap;
+int ret, sc = offset_to_sc_index(s, offset);
+g_auto(SubClusterRangeInfo) scri = { 0 };
 
-ret = 

[PATCH v2 05/11] iotests/common.rc: add disk_usage function

2024-05-13 Thread Andrey Drobyshev
Move the definition from iotests/250 to common.rc.  This is used to
detect real disk usage of sparse files.  In particular, we want to use
it for checking subclusters-based discards.

Signed-off-by: Andrey Drobyshev 
---
 tests/qemu-iotests/250   | 5 -
 tests/qemu-iotests/common.rc | 6 ++
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/250 b/tests/qemu-iotests/250
index af48f83aba..c0a0dbc0ff 100755
--- a/tests/qemu-iotests/250
+++ b/tests/qemu-iotests/250
@@ -52,11 +52,6 @@ _unsupported_imgopts data_file
 # bdrv_co_truncate(bs->file) call in qcow2_co_truncate(), which might succeed
 # anyway.
 
-disk_usage()
-{
-du --block-size=1 $1 | awk '{print $1}'
-}
-
 size=2100M
 
 _make_test_img -o "cluster_size=1M,preallocation=metadata" $size
diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 95c12577dd..237f746af8 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -140,6 +140,12 @@ _optstr_add()
 fi
 }
 
+# report real disk usage for sparse files
+disk_usage()
+{
+du --block-size=1 "$1" | awk '{print $1}'
+}
+
 # Set the variables to the empty string to turn Valgrind off
 # for specific processes, e.g.
 # $ VALGRIND_QEMU_IO= ./check -qcow2 -valgrind 015
-- 
2.39.3




[PATCH v2 11/11] iotests/271: add test cases for subcluster-based discard/unmap

2024-05-13 Thread Andrey Drobyshev
Add a bunch of test cases covering new subclusters behaviour: unmap of
last allocated subclusters; unmap of subclusters within unallocated
cluster; discard of unallocated subclusters within a cluster; regular discard
of subclusters within a cluster; discard of last allocated subclusters.

Also make all discard/unmap operations enable trace point 'file_do_fallocate'
so that actual fallocate() calls are visible.

Signed-off-by: Andrey Drobyshev 
---
 tests/qemu-iotests/271 | 70 +-
 tests/qemu-iotests/271.out | 69 ++---
 2 files changed, 119 insertions(+), 20 deletions(-)

diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index 04c57813c2..8b80682cff 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -81,6 +81,12 @@ _verify_l2_bitmap()
 fi
 }
 
+# Filter out trace params which we don't control (file fd)
+_filter_trace_fallocate()
+{
+sed -E 's/fd=[0-9]+/fd=N/g'
+}
+
 # This should be called as _run_test c=XXX sc=XXX off=XXX len=XXX cmd=XXX
 # c:   cluster number (0 if unset)
 # sc:  subcluster number inside cluster @c (0 if unset)
@@ -94,12 +100,13 @@ _verify_l2_bitmap()
 #  discard  -> discard
 _run_test()
 {
-unset c sc off len cmd
+unset c sc off len cmd opt
 for var in "$@"; do eval "$var"; done
 case "${cmd:-write}" in
 zero)
 cmd="write -q -z";;
 unmap)
+opt="--trace enable=file_do_fallocate"
 cmd="write -q -z -u";;
 compress)
 pat=$((${pat:-0} + 1))
@@ -108,6 +115,7 @@ _run_test()
 pat=$((${pat:-0} + 1))
 cmd="write -q -P ${pat}";;
 discard)
+opt="--trace enable=file_do_fallocate"
 cmd="discard -q";;
 *)
 echo "Unknown option $cmd"
@@ -121,7 +129,7 @@ _run_test()
 cmd="$cmd ${offset} ${len}"
 raw_cmd=$(echo $cmd | sed s/-c//) # Raw images don't support -c
 echo $cmd | sed 's/-P [0-9][0-9]\?/-P PATTERN/'
-$QEMU_IO -c "$cmd" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO $opt -c "$cmd" "$TEST_IMG" 2>&1 | _filter_qemu_io | 
_filter_trace_fallocate
 $QEMU_IO -c "$raw_cmd" -f raw "$TEST_IMG.raw" | _filter_qemu_io
 _verify_img
 _verify_l2_bitmap "$c"
@@ -202,9 +210,10 @@ for use_backing_file in yes no; do
 alloc="$(seq 16 31)"; zero="$(seq 0 15)"
 _run_test sc=0 len=32k cmd=unmap
 
-### Zero and unmap cluster #0
+### Zero and unmap second half of cluster #0 (this will unmap it and
+### discard l2 entry)
 alloc=""; zero="$(seq 0 31)"
-_run_test sc=0 len=64k cmd=unmap
+_run_test sc=16 len=32k cmd=unmap
 
 ### Write subcluster #1 (middle of subcluster)
 alloc="1"; zero="0 $(seq 2 31)"
@@ -439,6 +448,12 @@ for use_backing_file in yes no; do
 _verify_l2_bitmap 16
 _verify_l2_bitmap 17
 
+# Unmap subclusters #0-#3 of an unallocated cluster #8.  Since
+# 'write -z -u' doesn't lead to full discard, subclusters should become
+# explicitly zeroized
+alloc=""; zero="$(seq 0 3)"
+_run_test c=8 sc=0 len=8k cmd=unmap
+
 # Cluster-aligned request from clusters #9 to #11
 alloc=""; zero="$(seq 0 31)"
 _run_test c=9 sc=0 len=192k cmd=unmap
@@ -523,26 +538,45 @@ for use_backing_file in yes no; do
 echo
 echo "### Discarding clusters with non-zero bitmaps (backing file: 
$use_backing_file) ###"
 echo
+
+_reset_img 1M
+
+# Write first half of cluster #0 and discard another half
+alloc="$(seq 0 15)"; zero=""
+_run_test sc=0 len=32k
+# When discarding unallocated subclusters, they only have to be
+# explicitly zeroized when the image has a backing file
 if [ "$use_backing_file" = "yes" ]; then
-_make_test_img -o extended_l2=on -F raw -b "$TEST_IMG.base" 1M
+alloc="$(seq 0 15)"; zero="$(seq 16 31)"
 else
-_make_test_img -o extended_l2=on 1M
+alloc="$(seq 0 15)"; zero=""
 fi
-# Write clusters #0-#2 and then discard them
-$QEMU_IO -c 'write -q 0 128k' "$TEST_IMG"
-$QEMU_IO -c 'discard -q 0 128k' "$TEST_IMG"
+_run_test sc=16 len=32k cmd=discard
+
+# Write cluster #0 and discard its subclusters #0-#3
+alloc="$(seq 0 31)"; zero=""
+_run_test sc=0 len=64k
+alloc="$(seq 4 31)"; zero="$(seq 0 3)"
+_run_test sc=0 len=8k cmd=discard
+
+# Discard remaining subclusters #4-#32.  This should unmap the cluster
+# and discard its l2 entry
+alloc=""; zero="$(seq 0 31)"
+_run_test sc=4 len=56k cmd=discard
+
+# Write clusters #0-#1 and then discard them
+alloc="$(seq 0 31)"; zero=""
+_run_test sc=0 len=128k
 # 'qemu-io discard' doesn't do a full discard, it zeroizes the
-# cluster, so both clusters have all zero bits set now
+# cluster, so both clusters have all zero bits set after discard
 alloc=""; zero="$(seq 0 31)"
-_verify_l2_bitmap 0
+_run_test sc=0 len=128k cmd=discard
 _verify_l2_bitmap 1
+
 # 

[PATCH v2 01/11] qcow2: make function update_refcount_discard() global

2024-05-13 Thread Andrey Drobyshev
We are going to need it for discarding separate subclusters.  The
function itself doesn't do anything with the refcount tables, it simply
adds a discard request to the queue, so rename it to qcow2_queue_discard().

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Hanna Czenczek 
---
 block/qcow2-refcount.c | 8 
 block/qcow2.h  | 2 ++
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 0266542cee..2026f5fa21 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -754,8 +754,8 @@ void qcow2_process_discards(BlockDriverState *bs, int ret)
 }
 }
 
-static void update_refcount_discard(BlockDriverState *bs,
-uint64_t offset, uint64_t length)
+void qcow2_queue_discard(BlockDriverState *bs, uint64_t offset,
+ uint64_t length)
 {
 BDRVQcow2State *s = bs->opaque;
 Qcow2DiscardRegion *d, *p, *next;
@@ -902,7 +902,7 @@ update_refcount(BlockDriverState *bs, int64_t offset, 
int64_t length,
 }
 
 if (s->discard_passthrough[type]) {
-update_refcount_discard(bs, cluster_offset, s->cluster_size);
+qcow2_queue_discard(bs, cluster_offset, s->cluster_size);
 }
 }
 }
@@ -3619,7 +3619,7 @@ qcow2_discard_refcount_block(BlockDriverState *bs, 
uint64_t discard_block_offs)
 /* discard refblock from the cache if refblock is cached */
 qcow2_cache_discard(s->refcount_block_cache, refblock);
 }
-update_refcount_discard(bs, discard_block_offs, s->cluster_size);
+qcow2_queue_discard(bs, discard_block_offs, s->cluster_size);
 
 return 0;
 }
diff --git a/block/qcow2.h b/block/qcow2.h
index a9e3481c6e..197bdcdf53 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -891,6 +891,8 @@ int coroutine_fn qcow2_check_refcounts(BlockDriverState 
*bs, BdrvCheckResult *re
BdrvCheckMode fix);
 
 void GRAPH_RDLOCK qcow2_process_discards(BlockDriverState *bs, int ret);
+void qcow2_queue_discard(BlockDriverState *bs, uint64_t offset,
+ uint64_t length);
 
 int GRAPH_RDLOCK
 qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset,
-- 
2.39.3




[PATCH v2 09/11] qcow2: make subclusters discardable

2024-05-13 Thread Andrey Drobyshev
This commit makes the discard operation work on the subcluster level
rather than cluster level.  It introduces discard_l2_subclusters()
function and makes use of it in qcow2 discard implementation, much like
it's done with zero_in_l2_slice() / zero_l2_subclusters().  It also
changes the qcow2 driver pdiscard_alignment to subcluster_size.  That
way subcluster-aligned discards lead to actual fallocate(PUNCH_HOLE)
operation and free host disk space.

This feature will let us gain additional disk space on guest
TRIM/discard requests, especially when using large enough clusters
(1M, 2M) with subclusters enabled.

Also rename qcow2_cluster_discard() -> qcow2_subcluster_discard() to
reflect the change.

Signed-off-by: Andrey Drobyshev 
---
 block/qcow2-cluster.c  | 106 +
 block/qcow2-snapshot.c |   6 +--
 block/qcow2.c  |  25 +-
 block/qcow2.h  |   4 +-
 tests/qemu-iotests/271 |   2 +-
 5 files changed, 117 insertions(+), 26 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 8d39e2f960..3c134a7e80 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2105,25 +2105,106 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
 return nb_clusters;
 }
 
-int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
-  uint64_t bytes, enum qcow2_discard_type type,
-  bool full_discard)
+static int coroutine_fn GRAPH_RDLOCK
+discard_l2_subclusters(BlockDriverState *bs, uint64_t offset,
+   uint64_t nb_subclusters,
+   enum qcow2_discard_type type,
+   bool full_discard)
+{
+BDRVQcow2State *s = bs->opaque;
+uint64_t new_l2_bitmap, bitmap_alloc_mask, bitmap_zero_mask;
+int ret, sc = offset_to_sc_index(s, offset);
+g_auto(SubClusterRangeInfo) scri = { 0 };
+
+ret = get_sc_range_info(bs, offset, nb_subclusters, );
+if (ret < 0) {
+return ret;
+}
+
+new_l2_bitmap = scri.l2_bitmap;
+bitmap_alloc_mask = QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
+bitmap_zero_mask = QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
+
+new_l2_bitmap &= ~bitmap_alloc_mask;
+
+/*
+ * Full discard means we fall through to the backing file, thus we need
+ * to mark the subclusters as deallocated and clear the corresponding
+ * zero bits.
+ *
+ * Non-full discard means subclusters should be explicitly marked as
+ * zeroes.  In this case QCOW2 specification requires the corresponding
+ * allocation status bits to be unset as well.  If the subclusters are
+ * deallocated in the first place and there's no backing, the operation
+ * can be skipped.
+ */
+if (full_discard) {
+new_l2_bitmap &= ~bitmap_zero_mask;
+} else if (bs->backing || scri.l2_bitmap & bitmap_alloc_mask) {
+new_l2_bitmap |= bitmap_zero_mask;
+}
+
+/*
+ * If after discarding this range there won't be any allocated subclusters
+ * left, and new bitmap becomes the same as it'd be after discarding the
+ * whole cluster, we better discard it entirely.  That way we'd also
+ * update the refcount table.
+ */
+if ((full_discard && new_l2_bitmap == 0) ||
+(!full_discard && new_l2_bitmap == QCOW_L2_BITMAP_ALL_ZEROES)) {
+return discard_in_l2_slice(
+bs, QEMU_ALIGN_DOWN(offset, s->cluster_size),
+1, type, full_discard);
+}
+
+if (scri.l2_bitmap != new_l2_bitmap) {
+set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
+}
+
+discard_no_unref_any_file(
+bs, (scri.l2_entry & L2E_OFFSET_MASK) + offset_into_cluster(s, offset),
+nb_subclusters * s->subcluster_size, scri.ctype, type);
+
+return 0;
+}
+
+int qcow2_subcluster_discard(BlockDriverState *bs, uint64_t offset,
+ uint64_t bytes, enum qcow2_discard_type type,
+ bool full_discard)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t end_offset = offset + bytes;
 uint64_t nb_clusters;
+unsigned head, tail;
 int64_t cleared;
 int ret;
 
 /* Caller must pass aligned values, except at image end */
-assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
-assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
+assert(QEMU_IS_ALIGNED(offset, s->subcluster_size));
+assert(QEMU_IS_ALIGNED(end_offset, s->subcluster_size) ||
end_offset == bs->total_sectors << BDRV_SECTOR_BITS);
 
-nb_clusters = size_to_clusters(s, bytes);
+head = MIN(end_offset, ROUND_UP(offset, s->cluster_size)) - offset;
+offset += head;
+
+tail = (end_offset >= bs->total_sectors << BDRV_SECTOR_BITS) ? 0 :
+   end_offset - MAX(offset, start_of_cluster(s, end_offset));
+end_offset -= tail;
 
 

[PATCH v2 02/11] qcow2: simplify L2 entries accounting for discard-no-unref

2024-05-13 Thread Andrey Drobyshev
Commits 42a2890a and b2b10904 introduce handling of discard-no-unref
option in discard_in_l2_slice() and zero_in_l2_slice().  They add even
more if's when chosing the right l2 entry.  What we really need for this
option is the new entry simply to contain the same host cluster offset,
no matter whether we unmap or zeroize the cluster.  For that OR'ing with
the old entry is enough.

This patch doesn't change the logic and is pure refactoring.

Signed-off-by: Andrey Drobyshev 
---
 block/qcow2-cluster.c | 34 +++---
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ce8c0076b3..5f057ba2fd 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1946,25 +1946,21 @@ discard_in_l2_slice(BlockDriverState *bs, uint64_t 
offset, uint64_t nb_clusters,
 new_l2_entry = new_l2_bitmap = 0;
 } else if (bs->backing || qcow2_cluster_is_allocated(cluster_type)) {
 if (has_subclusters(s)) {
-if (keep_reference) {
-new_l2_entry = old_l2_entry;
-} else {
-new_l2_entry = 0;
-}
+new_l2_entry = 0;
 new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
 } else {
-if (s->qcow_version >= 3) {
-if (keep_reference) {
-new_l2_entry |= QCOW_OFLAG_ZERO;
-} else {
-new_l2_entry = QCOW_OFLAG_ZERO;
-}
-} else {
-new_l2_entry = 0;
-}
+new_l2_entry = s->qcow_version >= 3 ? QCOW_OFLAG_ZERO : 0;
 }
 }
 
+/*
+ * No need to check for the QCOW version since discard-no-unref is
+ * only allowed since version 3.
+ */
+if (keep_reference) {
+new_l2_entry |= old_l2_entry;
+}
+
 if (old_l2_entry == new_l2_entry && old_l2_bitmap == new_l2_bitmap) {
 continue;
 }
@@ -2064,19 +2060,19 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
 ((flags & BDRV_REQ_MAY_UNMAP) && qcow2_cluster_is_allocated(type));
 bool keep_reference =
 (s->discard_no_unref && type != QCOW2_CLUSTER_COMPRESSED);
-uint64_t new_l2_entry = old_l2_entry;
+uint64_t new_l2_entry = unmap ? 0 : old_l2_entry;
 uint64_t new_l2_bitmap = old_l2_bitmap;
 
-if (unmap && !keep_reference) {
-new_l2_entry = 0;
-}
-
 if (has_subclusters(s)) {
 new_l2_bitmap = QCOW_L2_BITMAP_ALL_ZEROES;
 } else {
 new_l2_entry |= QCOW_OFLAG_ZERO;
 }
 
+if (keep_reference) {
+new_l2_entry |= old_l2_entry;
+}
+
 if (old_l2_entry == new_l2_entry && old_l2_bitmap == new_l2_bitmap) {
 continue;
 }
-- 
2.39.3




[PATCH v2 06/11] iotests/290: add test case to check 'discard-no-unref' option behavior

2024-05-13 Thread Andrey Drobyshev
We basically fill 2 images with identical data and perform discard
operations with and without 'discard-no-unref' enabled.  Then we check
that images still read identically, that their disk usage is the same
(i.e. fallocate(FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE) is called for
both) and that with the option enabled cluster is still marked as
allocated in "qemu-img map" output.  We also check that the option
doesn't work with qcow2 v2 images.

Signed-off-by: Andrey Drobyshev 
---
 tests/qemu-iotests/290 | 34 ++
 tests/qemu-iotests/290.out | 28 
 2 files changed, 62 insertions(+)

diff --git a/tests/qemu-iotests/290 b/tests/qemu-iotests/290
index 776b59de1b..4eb929d15f 100755
--- a/tests/qemu-iotests/290
+++ b/tests/qemu-iotests/290
@@ -92,6 +92,40 @@ for qcow2_compat in 0.10 1.1; do
 $QEMU_IMG map "$TEST_IMG" | _filter_testdir
 done
 
+echo
+echo "### Test 'qemu-io -c discard' with 'discard-no-unref' option enabled"
+echo
+
+echo "# Check that qcow2 v2 images don't support 'discard-no-unref' option"
+NOUNREF_IMG="$TEST_IMG.nounref"
+TEST_IMG="$NOUNREF_IMG" _make_test_img -o "compat=0.10" 128k
+# This should immediately fail with an error
+$QEMU_IO -c 'reopen -o discard-no-unref=on' "$NOUNREF_IMG" | _filter_qemu_io
+
+echo "# Create two compat=1.1 images and fill them with identical data"
+_make_test_img -o "compat=1.1" 128k
+TEST_IMG="$NOUNREF_IMG" _make_test_img -o "compat=1.1" 128k
+$QEMU_IO -c 'write -P 0xaa 0 128k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c 'write -P 0xaa 0 128k' "$NOUNREF_IMG" | _filter_qemu_io
+
+echo "# Enable 'discard-no-unref' in one of them, discard 2nd cluster in both"
+$QEMU_IO -c 'discard 64k 64k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c 'reopen -o discard-no-unref=on' \
+ -c 'discard 64k 64k' "$NOUNREF_IMG" | _filter_qemu_io
+
+echo "# Compare disk usage of the 2 images"
+# Don't check the exact disk usage values but rather that they're equal
+echo "disk_usage(`basename $TEST_IMG`) - disk_usage(`basename $NOUNREF_IMG`)" \
+ "= $(( `disk_usage $TEST_IMG` - `disk_usage $NOUNREF_IMG`))"
+
+echo "# Check that images are still identical"
+$QEMU_IMG compare "$TEST_IMG" "$NOUNREF_IMG"
+
+echo "# Output of qemu-img map for the image with dropped reference"
+$QEMU_IMG map --output=json "$TEST_IMG" | _filter_qemu_img_map
+echo "# Output of qemu-img map for the image with kept reference"
+$QEMU_IMG map --output=json "$NOUNREF_IMG" | _filter_qemu_img_map
+
 # success, all done
 echo "*** done"
 rm -f $seq.full
diff --git a/tests/qemu-iotests/290.out b/tests/qemu-iotests/290.out
index 22b476594f..f790feae81 100644
--- a/tests/qemu-iotests/290.out
+++ b/tests/qemu-iotests/290.out
@@ -58,4 +58,32 @@ read 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 # Output of qemu-img map
 Offset  Length  Mapped to   File
+
+### Test 'qemu-io -c discard' with 'discard-no-unref' option enabled
+
+# Check that qcow2 v2 images don't support 'discard-no-unref' option
+Formatting 'TEST_DIR/t.IMGFMT.nounref', fmt=IMGFMT size=131072
+qemu-io: discard-no-unref is only supported since qcow2 version 3
+# Create two compat=1.1 images and fill them with identical data
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072
+Formatting 'TEST_DIR/t.IMGFMT.nounref', fmt=IMGFMT size=131072
+wrote 131072/131072 bytes at offset 0
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 0
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+# Enable 'discard-no-unref' in one of them, discard 2nd cluster in both
+discard 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discard 65536/65536 bytes at offset 65536
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+# Compare disk usage of the 2 images
+disk_usage(t.qcow2) - disk_usage(t.qcow2.nounref) = 0
+# Check that images are still identical
+Images are identical.
+# Output of qemu-img map for the image with dropped reference
+[{ "start": 0, "length": 65536, "depth": 0, "present": true, "zero": false, 
"data": true, "compressed": false, "offset": OFFSET},
+{ "start": 65536, "length": 65536, "depth": 0, "present": true, "zero": true, 
"data": false, "compressed": false}]
+# Output of qemu-img map for the image with kept reference
+[{ "start": 0, "length": 65536, "depth": 0, "present": true, "zero": false, 
"data": true, "compressed": false, "offset": OFFSET},
+{ "start": 65536, "length": 65536, "depth": 0, "present": true, "zero": true, 
"data": false, "compressed": false, "offset": OFFSET}]
 *** done
-- 
2.39.3




[PATCH v2 08/11] qcow2: zeroize the entire cluster when there're no non-zero subclusters

2024-05-13 Thread Andrey Drobyshev
When zeroizing the last non-zero subclusters within single cluster, it
makes sense to go zeroize the entire cluster and go down zero_in_l2_slice()
path right away.  That way we'd also update the corresponding refcount
table.

Signed-off-by: Andrey Drobyshev 
Reviewed-by: Hanna Czenczek 
---
 block/qcow2-cluster.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 475f167035..8d39e2f960 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2221,7 +2221,7 @@ zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
 
 static int coroutine_fn GRAPH_RDLOCK
 zero_l2_subclusters(BlockDriverState *bs, uint64_t offset,
-unsigned nb_subclusters)
+unsigned nb_subclusters, int flags)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t new_l2_bitmap;
@@ -2237,6 +2237,16 @@ zero_l2_subclusters(BlockDriverState *bs, uint64_t 
offset,
 new_l2_bitmap |=  QCOW_OFLAG_SUB_ZERO_RANGE(sc, sc + nb_subclusters);
 new_l2_bitmap &= ~QCOW_OFLAG_SUB_ALLOC_RANGE(sc, sc + nb_subclusters);
 
+/*
+ * If there're no non-zero subclusters left, we might as well zeroize
+ * the entire cluster.  That way we'd also update the refcount table.
+ */
+if ((new_l2_bitmap & QCOW_L2_BITMAP_ALL_ZEROES) ==
+QCOW_L2_BITMAP_ALL_ZEROES) {
+return zero_in_l2_slice(bs, QEMU_ALIGN_DOWN(offset, s->cluster_size),
+1, flags);
+}
+
 if (new_l2_bitmap != scri.l2_bitmap) {
 set_l2_bitmap(s, scri.l2_slice, scri.l2_index, new_l2_bitmap);
 qcow2_cache_entry_mark_dirty(s->l2_table_cache, scri.l2_slice);
@@ -2293,7 +2303,7 @@ int coroutine_fn 
qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
 
 if (head) {
 ret = zero_l2_subclusters(bs, offset - head,
-  size_to_subclusters(s, head));
+  size_to_subclusters(s, head), flags);
 if (ret < 0) {
 goto fail;
 }
@@ -2314,7 +2324,8 @@ int coroutine_fn 
qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
 }
 
 if (tail) {
-ret = zero_l2_subclusters(bs, end_offset, size_to_subclusters(s, 
tail));
+ret = zero_l2_subclusters(bs, end_offset,
+  size_to_subclusters(s, tail), flags);
 if (ret < 0) {
 goto fail;
 }
-- 
2.39.3