date:20220725

[PATCH] linux-user: Do not treat madvise()'s advice as a bitmask

2022-07-25 Thread Ilya Leoshkevich

Advice is enum, not flags. Doing (advice & MADV_DONTNEED) also matches
e.g. MADV_MERGEABLE.

Signed-off-by: Ilya Leoshkevich 
---
 linux-user/mmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index 4e7a6be6ee..edceaca4a8 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -891,7 +891,7 @@ abi_long target_madvise(abi_ulong start, abi_ulong len_in, 
int advice)
  * anonymous mappings. In this case passthrough is safe, so do it.
  */
 mmap_lock();
-if ((advice & MADV_DONTNEED) &&
+if (advice == MADV_DONTNEED &&
 can_passthrough_madv_dontneed(start, end)) {
 ret = get_errno(madvise(g2h_untagged(start), len, MADV_DONTNEED));
 }
-- 
2.35.3

Re: [PATCH v2] qemu-options: bring the kernel and image options together

2022-07-25 Thread Alex Bennée



Alex Bennée  writes:

> How to control the booting of QEMU is often a source of confusion for
> users. Bring the options that control this together in the manual
> pages and add some verbiage to describe when each option is
> appropriate. This attempts to codify some of the knowledge expressed
> in:
>
>   
> https://stackoverflow.com/questions/58420670/qemu-bios-vs-kernel-vs-device-loader-file/58434837#58434837
>
> Signed-off-by: Alex Bennée 
> Cc: Cédric Le Goater 
> Message-Id: <20220622145052.4012981-1-alex.ben...@linaro.org>

Queued to testing/next, thanks.

-- 
Alex Bennée

Re: [RFC PATCH] docs/devel: fix description of OBJECT_DECLARE_SIMPLE_TYPE

2022-07-25 Thread Alex Bennée



Alex Bennée  writes:

> Since 30b5707c26 (qom: Remove module_obj_name parameter from
> OBJECT_DECLARE* macros) we don't need the additional two parameters.
> Fix the documentation.
>
> Signed-off-by: Alex Bennée 
> ---
>  docs/devel/qom.rst | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/docs/devel/qom.rst b/docs/devel/qom.rst
> index e5fe3597cd..0cf9a714f0 100644
> --- a/docs/devel/qom.rst
> +++ b/docs/devel/qom.rst
> @@ -292,8 +292,7 @@ in the header file:
>  .. code-block:: c
> :caption: Declaring a simple type
>  
> -   OBJECT_DECLARE_SIMPLE_TYPE(MyDevice, my_device,
> -  MY_DEVICE, DEVICE)
> +   OBJECT_DECLARE_SIMPLE_TYPE(MyDevice, MY_DEVICE)
>  
>  This is equivalent to the following:

Queued to testing/next, thanks.

-- 
Alex Bennée

Re: [PATCH for-7.2 04/10] hmp, device_tree.c: introduce fdt-save

2022-07-25 Thread Daniel Henrique Barboza





On 7/22/22 20:13, BALATON Zoltan wrote:

On Fri, 22 Jul 2022, Daniel Henrique Barboza wrote:

To save the FDT blob we have the '-machine dumpdtb=' property. With this
property set, the machine saves the FDT in  and exit. The created
file can then be converted to plain text dts format using 'dtc'.

There's nothing particularly sophisticated into saving the FDT that
can't be done with the machine at any state, as long as the machine has
a valid FDT to be saved.

The 'fdt-save' command receives a 'filename' paramenter and, if a valid
FDT is available, it'll save it in a file 'filename'. In short, this is
a '-machine dumpdtb' that can be fired on demand via HMP.


If it does the same as -machine dumpdtb wouldn't it be more intuitive to call 
the HMP command the same, so either dumpdtb or machine-dumpdtb or similar? That 
way it's more obvious that these do the same.



Good point. I'll rename 'fdt-save' to 'dumpdtb'.


Daniel



Regards,
BALATON Zoltan


A valid FDT consists of a FDT that was created using libfdt being
retrieved via 'current_machine->fdt' in device_tree.c. This condition is
met by most FDT users in QEMU.

Cc: Dr. David Alan Gilbert 
Signed-off-by: Daniel Henrique Barboza 
---
hmp-commands.hx  | 13 +
include/sysemu/device_tree.h |  2 ++
monitor/misc.c   | 13 +
softmmu/device_tree.c    | 18 ++
4 files changed, 46 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index c9d465735a..3c134cf652 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1768,3 +1768,16 @@ ERST
  "\n\t\t\t -b to specify dirty bitmap as method of 
calculation)",
    .cmd    = hmp_calc_dirty_rate,
    },
+
+    {
+    .name   = "fdt-save",
+    .args_type  = "filename:s",
+    .params = "[filename] file to save the FDT",
+    .help   = "save the FDT in the 'filename' file to be decoded using 
dtc",
+    .cmd    = hmp_fdt_save,
+    },
+
+SRST
+``fdt-save`` *filename*
+  Save the FDT in the 'filename' file to be decoded using dtc
+ERST
diff --git a/include/sysemu/device_tree.h b/include/sysemu/device_tree.h
index ef060a9759..1397adb21c 100644
--- a/include/sysemu/device_tree.h
+++ b/include/sysemu/device_tree.h
@@ -123,6 +123,8 @@ int qemu_fdt_nop_node(void *fdt, const char *node_path);
int qemu_fdt_add_subnode(void *fdt, const char *name);
int qemu_fdt_add_path(void *fdt, const char *path);

+void fdt_save(const char *filename, Error **errp);
+
#define qemu_fdt_setprop_cells(fdt, node_path, property, ...) \
    do {  \
    uint32_t qdt_tmp[] = { __VA_ARGS__ }; \
diff --git a/monitor/misc.c b/monitor/misc.c
index 3d2312ba8d..145285cec0 100644
--- a/monitor/misc.c
+++ b/monitor/misc.c
@@ -78,6 +78,7 @@
#include "qapi/qmp-event.h"
#include "sysemu/cpus.h"
#include "qemu/cutils.h"
+#include "sysemu/device_tree.h"

#if defined(TARGET_S390X)
#include "hw/s390x/storage-keys.h"
@@ -936,6 +937,18 @@ static void hmp_boot_set(Monitor *mon, const QDict *qdict)
    }
}

+static void hmp_fdt_save(Monitor *mon, const QDict *qdict)
+{
+    const char *path = qdict_get_str(qdict, "filename");
+    Error *local_err = NULL;
+
+    fdt_save(path, _err);
+
+    if (local_err) {
+    error_report_err(local_err);
+    }
+}
+
static void hmp_info_mtree(Monitor *mon, const QDict *qdict)
{
    bool flatview = qdict_get_try_bool(qdict, "flatview", false);
diff --git a/softmmu/device_tree.c b/softmmu/device_tree.c
index 6ca3fad285..eeab6a5ef0 100644
--- a/softmmu/device_tree.c
+++ b/softmmu/device_tree.c
@@ -643,3 +643,21 @@ out:
    g_free(propcells);
    return ret;
}
+
+void fdt_save(const char *filename, Error **errp)
+{
+    int size;
+
+    if (!current_machine->fdt) {
+    error_setg(errp, "Unable to find the machine FDT");
+    return;
+    }
+
+    size = fdt_totalsize(current_machine->fdt);
+
+    if (g_file_set_contents(filename, current_machine->fdt, size, NULL)) {
+    return;
+    }
+
+    error_setg(errp, "Error when saving machine FDT to file %s", filename);
+}

Re: [PATCH for-7.2 00/10] add hmp 'save-fdt' and 'info fdt' commands

2022-07-25 Thread Daniel Henrique Barboza





On 7/25/22 06:11, Daniel P. Berrangé wrote:

On Fri, Jul 22, 2022 at 04:59:57PM -0300, Daniel Henrique Barboza wrote:

Hi,

After dealing with a FDT element that isn't being shown in the userspace
and having to shutdown the guest, dump the FDT using 'machine -dumpdtb' and
then using 'dtc' to see what was inside the FDT, I thought it was a good
idea to add extra support for FDT handling in QEMU.

This series introduces 2 commands. 'fdt-save' behaves similar to what
'machine -dumpdtb' does, with the advantage of saving the FDT of a running
guest on demand. This command is implemented in patch 03.

The second command, 'info fdt ' is more sophisticated. This
command can print specific nodes and properties of the FDT. A few
examples:

- print the /cpus/cpu@0 from an ARM 'virt' machine:

(qemu) info fdt /cpus/cpu@0
/cpus/cpu@0 {
 phandle = <0x8001>
 reg = <0x0>
 compatible = 'arm,cortex-a57'
 device_type = 'cpu'
}
(qemu)

- print the device_type property of the interrupt-controller node of a
pSeries machine:

(qemu) info fdt /interrupt-controller/device_type
/interrupt-controller/device_type = 'PowerPC-External-Interrupt-Presentation'
(qemu)


Please don't add new HMP-only commands. These should be provided
as QMP commands, where the HMP is a tiny shim to the QMP.


I wasn't sure if this would be useful to be in QMP, but perhaps it's better to
let QMP consumers to decide whether use it or not.

I'll repost both as QMP commands with an HMP layer on top of them.


Daniel



If you don't want to think about formally modelling the data
for 'info fdt' / 'query-fdt', then just put an 'x-' prefix on
the QMP command and return printed formatted data, as illustrated
in:

https://www.qemu.org/docs/master/devel/writing-monitor-commands.html#writing-a-debugging-aid-returning-unstructured-text

With regards,
Daniel

Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions

2022-07-25 Thread Chao Peng

On Thu, Jul 21, 2022 at 05:58:50PM +, Sean Christopherson wrote:
> On Thu, Jul 21, 2022, Chao Peng wrote:
> > On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> > > 
> > > 
> > > On 7/21/22 00:21, Sean Christopherson wrote:
> > > Maybe you could tag it with cgs for all the confidential guest support
> > > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
> > > 
> > > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > ...
> > > kvm_vm_ioctl_set_cgs_mem(, is_private)
> > 
> > If we plan to widely use such abbr. through KVM (e.g. it's well known),
> > I'm fine.
> 
> I'd prefer to stay away from "confidential guest", and away from any VM-scoped
> name for that matter.  User-unmappable memmory has use cases beyond hiding 
> guest
> state from the host, e.g. userspace could use inaccessible/unmappable memory 
> to
> harden itself against unintentional access to guest memory.
> 
> > I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
> > But I also don't quite like it, it's so generic and sounds say nothing.
> > 
> > But I do want a name can cover future usages other than just 
> > private/shared (pKVM for example may have a third state).
> 
> I don't think there can be a third top-level state.  Memory is either private 
> to
> the guest or it's not.  There can be sub-states, e.g. memory could be 
> selectively
> shared or encrypted with a different key, in which case we'd need metadata to
> track that state.
> 
> Though that begs the question of whether or not private_fd is the correct
> terminology.  E.g. if guest memory is backed by a memfd that can't be mapped 
> by
> userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel 
> plugs
> that memory into a device or another VM, then arguably that memory is shared,
> especially the multi-VM scenario.
> 
> For TDX and SNP "private vs. shared" is likely the correct terminology given 
> the
> current specs, but for generic KVM it's probably better to align with whatever
> terminology is used for memfd.  "inaccessible_fd" and "user_inaccessible_fd" 
> are
> a bit odd since the fd itself is accesible.
> 
> What about "user_unmappable"?  E.g.
> 
>   F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY,
>   MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...

For KVM I also think user_unmappable looks better than 'private', e.g.
user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more
appropriate names. For memfd however, I don't feel that strong to change
it from current 'inaccessible' to 'user_unmappable', one of the reason
is it's not just about unmappable, but actually also inaccessible
through direct ioctls like read()/write().

> 
> that gives us flexibility to map the memory from within the kernel, e.g. into
> other VMs or devices.
> 
> Hmm, and then keep your original "mem_attr_array" name?  And probably 
> 
>  int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
>  bool is_user_mappable)
> 
> Then the x86/mmu code for TDX/SNP private faults could be:
> 
>   is_private = !kvm_is_gpa_user_mappable();
> 
>   if (fault->is_private != is_private) {
> 
> or if we want to avoid mixing up "user_mappable" and "user_unmappable":
> 
>   is_private = kvm_is_gpa_user_unmappable();
> 
>   if (fault->is_private != is_private) {
> 
> though a helper that returns a negative (not mappable) feels kludgy.  And I 
> like
> kvm_is_gpa_user_mappable() because then when there's not "special" memory, it
> defaults to true, which is more intuitive IMO.

yes.

> 
> And then if the future needs more precision, e.g. user-unmappable memory isn't
> necessarily guest-exclusive, the uAPI names still work even though KVM 
> internals
> will need to be reworked, but that's unavoidable.  E.g. piggybacking
> KVM_MEMORY_ENCRYPT_(UN)REG_REGION doesn't allow for further differentiation,
> so we'd need to _extend_ the uAPI, but the _existing_ uAPI would still be 
> sane.

Right, that has to be extended.

Chao

Re: [PATCH] hw/display/bcm2835_fb: Fix framebuffer allocation address

2022-07-25 Thread Peter Maydell

On Sat, 23 Jul 2022 at 12:15, alanjian85  wrote:
>
> This patch fixes the dedicated framebuffer mailbox interface(marked as
> deprecated in official docs, but can still be fixed for emulation purposes)
> by removing unneeded offset to make it works like buffer allocate tag in
> bcm2835_property interface[1], some baremetal applications like the
> Screen01/Screen02 examples from Baking Pi tutorial[2] didn't work
> before this patch.
>
> [1] https://github.com/qemu/qemu/blob/master/hw/misc/bcm2835_property.c#L158
> [2] https://www.cl.cam.ac.uk/projects/raspberrypi/tutorials/os/screen01.html
>
> Signed-off-by: alanjian85 

Thanks for this patch. Our submission rules require a full
personal name in the Signed-off-by line, not just a username.
We can't take the patch without this, I'm afraid.

> ---
>  hw/display/bcm2835_fb.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Is there any difference between this patch and the one you
sent a few days previously?

https://patchew.org/QEMU/20220721081544.38228-1-alanjia...@outlook.com/

Generally if you're sending an updated version of patch it's helpful
to mark it as "[PATCH v2]" or whatever, and to include a note below
the '---' line that says what the changes are.

thanks
-- PMM

Re: driver type raw-xz supports discard=unmap?

2022-07-25 Thread Chris Murphy

On Mon, Jul 25, 2022, at 5:13 AM, Daniel P. Berrangé wrote:
> On Fri, Jul 22, 2022 at 04:03:52PM -0400, Chris Murphy wrote:
>> Is this valid?
>> 
>> `
>> 
>> 
>> `
>> `/>
>> `
>> 
>> I know type="raw" works fine, I'm wondering if there'd be any problem
>> with type "raw-xz" combined with discards?
>
> This is libvirt configuration, so libvirt-us...@redhat.com is the better
> list in general. That said, what is this 'raw-xz' type you refer to ?
>
> AFAIK, that is not a disk driver type that exists in either libvirt or
> QEMU releases.

Huh, interesting. I have no idea then. I just happened to notice it in the 
(libvirt) XML config that's used by oz.
https://kojipkgs.fedoraproject.org//packages/Fedora-Workstation/Rawhide/20220721.n.0/images/libvirt-raw-xz-aarch64.xml

When manually modifying a virt-manager created config, to change "raw" to 
"raw-xz" I get an error:

# virsh edit uefivm
error: XML document failed to validate against schema: Unable to validate doc 
against /usr/share/libvirt/schemas/domain.rng
Extra element devices in interleave
Element domain failed to validate content

Failed. Try again? [y,n,i,f,?]: 

I've got no idea what happens if an invalid type is specified in the config. 
The VM's are definitely running despite this. I'll ask oz devs.

[PATCH v2] linux-user: Passthrough MADV_DONTNEED for certain file mappings

2022-07-25 Thread Ilya Leoshkevich

This is a follow-up for commit 892a4f6a750a ("linux-user: Add partial
support for MADV_DONTNEED"), which added passthrough for anonymous
mappings. File mappings can be handled in a similar manner.

In order to do that, mark pages, for which mmap() was passed through,
with PAGE_PASSTHROUGH, and then allow madvise() passthrough for these
pages as well.

Signed-off-by: Ilya Leoshkevich 
---

v1: https://lists.gnu.org/archive/html/qemu-devel/2022-07/msg00112.html
v1 -> v2: Fix PAGE_PASSTHROUGH value; make checks consistent with
  page_set_flags() (Laurent).

 include/exec/cpu-all.h |  6 ++
 linux-user/mmap.c  | 25 +
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index f5bda2c3ca..2d29ba13c0 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -262,6 +262,12 @@ extern const TargetPageBits target_page;
 #define PAGE_TARGET_1  0x0200
 #define PAGE_TARGET_2  0x0400
 
+/*
+ * For linux-user, indicates that the page is mapped with the same semantics
+ * in both guest and host.
+ */
+#define PAGE_PASSTHROUGH 0x0800
+
 #if defined(CONFIG_USER_ONLY)
 void page_dump(FILE *f);
 
diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index 4e7a6be6ee..79928e3ae5 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -424,7 +424,8 @@ abi_ulong mmap_find_vma(abi_ulong start, abi_ulong size, 
abi_ulong align)
 abi_long target_mmap(abi_ulong start, abi_ulong len, int target_prot,
  int flags, int fd, abi_ulong offset)
 {
-abi_ulong ret, end, real_start, real_end, retaddr, host_offset, host_len;
+abi_ulong ret, end, real_start, real_end, retaddr, host_offset, host_len,
+  passthrough_start = -1, passthrough_end = -1;
 int page_flags, host_prot;
 
 mmap_lock();
@@ -537,6 +538,8 @@ abi_long target_mmap(abi_ulong start, abi_ulong len, int 
target_prot,
 host_start += offset - host_offset;
 }
 start = h2g(host_start);
+passthrough_start = start;
+passthrough_end = start + len;
 } else {
 if (start & ~TARGET_PAGE_MASK) {
 errno = EINVAL;
@@ -619,6 +622,8 @@ abi_long target_mmap(abi_ulong start, abi_ulong len, int 
target_prot,
  host_prot, flags, fd, offset1);
 if (p == MAP_FAILED)
 goto fail;
+passthrough_start = real_start;
+passthrough_end = real_end;
 }
 }
  the_end1:
@@ -626,7 +631,18 @@ abi_long target_mmap(abi_ulong start, abi_ulong len, int 
target_prot,
 page_flags |= PAGE_ANON;
 }
 page_flags |= PAGE_RESET;
-page_set_flags(start, start + len, page_flags);
+if (passthrough_start == passthrough_end) {
+page_set_flags(start, start + len, page_flags);
+} else {
+if (start < passthrough_start) {
+page_set_flags(start, passthrough_start, page_flags);
+}
+page_set_flags(passthrough_start, passthrough_end,
+   page_flags | PAGE_PASSTHROUGH);
+if (passthrough_end < start + len) {
+page_set_flags(passthrough_end, start + len, page_flags);
+}
+}
  the_end:
 trace_target_mmap_complete(start);
 if (qemu_loglevel_mask(CPU_LOG_PAGE)) {
@@ -845,7 +861,7 @@ static bool can_passthrough_madv_dontneed(abi_ulong start, 
abi_ulong end)
 }
 
 for (addr = start; addr < end; addr += TARGET_PAGE_SIZE) {
-if (!(page_get_flags(addr) & PAGE_ANON)) {
+if (!(page_get_flags(addr) & (PAGE_ANON | PAGE_PASSTHROUGH))) {
 return false;
 }
 }
@@ -888,7 +904,8 @@ abi_long target_madvise(abi_ulong start, abi_ulong len_in, 
int advice)
  *
  * This breaks MADV_DONTNEED, completely implementing which is quite
  * complicated. However, there is one low-hanging fruit: host-page-aligned
- * anonymous mappings. In this case passthrough is safe, so do it.
+ * anonymous mappings or mappings that are known to have the same semantics
+ * in the host and the guest. In this case passthrough is safe, so do it.
  */
 mmap_lock();
 if ((advice & MADV_DONTNEED) &&
-- 
2.35.3

Re: [PATCH] linux-user: Don't assume 0 is not a valid host timer_t value

2022-07-25 Thread Peter Maydell

On Mon, 25 Jul 2022 at 12:13, Daniel P. Berrangé  wrote:
>
> On Mon, Jul 25, 2022 at 12:00:35PM +0100, Peter Maydell wrote:
> > For handling guest POSIX timers, we currently use an array
> > g_posix_timers[], whose entries are a host timer_t value, or 0 for
> > "this slot is unused".  When the guest calls the timer_create syscall
> > we look through the array for a slot containing 0, and use that for
> > the new timer.
> >
> > This scheme assumes that host timer_t values can never be zero.  This
> > is unfortunately not a valid assumption -- for some host libc
> > versions, timer_t values are simply indexes starting at 0.  When
> > using this kind of host libc, the effect is that the first and second
> > timers end up sharing a slot, and so when the guest tries to operate
> > on the first timer it changes the second timer instead.
>
> For sake of historical record, could you mention here which specific
> libc impl / version highlights the problem.

Jon, which host libc are you seeing this with?

thanks
-- PMM

Re: [PATCH v2 0/2] accel/tcg: Test unaligned stores to s390x low-address-protected lowcore

2022-07-25 Thread Alex Bennée



Ilya Leoshkevich  writes:

> Hi,
>
> This is a follow-up series for [1].
>
> The fix has been committed.
>
> I asked Christian what might be a good alternative for the
> mmio-debug-exit device for testing, and he suggested to look into
> shutdown/panic actions.
>
> Patch 1 adds a new panic action.
> Patch 2 tests unaligned stores to s390x low-address-protected lowcore;
> it performs a shutdown on success and panic on failure.
>
> Best regards,
> Ilya

Queued to testing/next, thanks.

-- 
Alex Bennée

[PATCH] .gitlab-ci.d/windows.yml: Enable native Windows symlink

2022-07-25 Thread Bin Meng

From: Bin Meng 

The following error message was seen during the configure:

  "ln: failed to create symbolic link
  'x86_64-softmmu/qemu-system-x86_64.exe': No such file or directory"

By default the MSYS environment variable is not defined, so the runtime
behavior of winsymlinks is: if  does not exist, 'ln -s' fails.
At the configure phase, the qemu-system-x86_64.exe has not been built
so creation of the symbolic link fails hence the error message.

Set winsymlinks to 'native' whose behavior is most similar to the
behavior of 'ln -s' on *nix, that is:

  a) if native symlinks are enabled, and whether  exists
 or not, creates  as a native Windows symlink;
  b) else if native symlinks are not enabled, and whether 
 exists or not, 'ln -s' creates as a Windows shortcut file.

Signed-off-by: Bin Meng 
---

 .gitlab-ci.d/windows.yml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/.gitlab-ci.d/windows.yml b/.gitlab-ci.d/windows.yml
index 1b2ede49e1..0b9572a8a3 100644
--- a/.gitlab-ci.d/windows.yml
+++ b/.gitlab-ci.d/windows.yml
@@ -57,6 +57,7 @@ msys2-64bit:
   mingw-w64-x86_64-zstd "
   - $env:CHERE_INVOKING = 'yes'  # Preserve the current working directory
   - $env:MSYSTEM = 'MINGW64' # Start a 64 bit Mingw environment
+  - $env:MSYS = 'winsymlinks:native' # Enable native Windows symlink
   - .\msys64\usr\bin\bash -lc './configure --target-list=x86_64-softmmu
   --enable-capstone --without-default-devices'
   - .\msys64\usr\bin\bash -lc "sed -i '/^ROMS=/d' build/config-host.mak"
@@ -89,6 +90,7 @@ msys2-32bit:
   mingw-w64-i686-usbredir "
   - $env:CHERE_INVOKING = 'yes'  # Preserve the current working directory
   - $env:MSYSTEM = 'MINGW32' # Start a 32-bit MinG environment
+  - $env:MSYS = 'winsymlinks:native' # Enable native Windows symlink
   - mkdir output
   - cd output
   - ..\msys64\usr\bin\bash -lc "../configure --target-list=ppc64-softmmu"
-- 
2.34.1

Re: [PATCH for-7.2 10/10] hmp, device_tree.c: add 'info fdt ' support

2022-07-25 Thread Dr. David Alan Gilbert

* Daniel Henrique Barboza (danielhb...@gmail.com) wrote:
> 'info fdt' is only able to print full nodes so far. It would be good to
> be able to also print single properties, since ometimes we just want
> to verify a single value from the FDT.
> 
> libfdt does not have support to find a property given its full path, but
> it does have a way to return a fdt_property given a prop name and its
> subnode.
> 
> This is how we're going to support it:
> 
> - given the same fullpath parameter, assume it's a node. If we have a
> match with an existing node, print it. If not, assume it's a property;
> 
> - in fdt_find_property() we're going to split 'fullpath' into node and
> property. Unfortunately we can't use g_path_get_basename() to helps us
> because, although the device tree path format is similar to Linux, it'll
> not work when trying to run QEMU under Windows where the path format is
> different;
> 
> - after spliiting into node + property, try to find the node in the FDT.
> If we have a match, use fdt_get_property() to retrieve fdt_property.
> Return it if found;
> 
> - using the fdt_print_property() created previously, print the property.

Would it be easier to make 'info fdt' have an optional 2nd parameter
(hmp can do optionals) which if present is the property name, then these
would become:

> After this change, if an user wants to print just the value of 'cpu' inside
> /cpu/cpu-map(...) from an ARM FDT, we can do it:
> 
> (qemu) info fdt /cpus/cpu-map/socket0/cluster0/core0/cpu

info fdt /cpus/cpu-map/socket0/cluster0/core0 cpu

> /cpus/cpu-map/socket0/cluster0/core0/cpu = <0x8001>
> (qemu)
> 
> Or the 'ibm,my-dma-window' from the v-scsi device inside the pSeries
> FDT:
> 
> (qemu) info fdt /vdevice/v-scsi@7103/ibm,my-dma-window

info fdt /vdevice/v-scsi@7103 ibm,my-dma-window

it seems more explicit, and seems easier to implement.

Dave

> /vdevice/v-scsi@7103/ibm,my-dma-window = <0x7103 0x0 0x0 0x0 
> 0x1000>
> (qemu)
> 
> Cc: Dr. David Alan Gilbert 
> Signed-off-by: Daniel Henrique Barboza 
> ---
>  hmp-commands-info.hx  |  2 +-
>  softmmu/device_tree.c | 79 ---
>  2 files changed, 76 insertions(+), 5 deletions(-)
> 
> diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
> index abf277be7d..8891c2918a 100644
> --- a/hmp-commands-info.hx
> +++ b/hmp-commands-info.hx
> @@ -913,7 +913,7 @@ ERST
>  .name   = "fdt",
>  .args_type  = "fullpath:s",
>  .params = "fullpath",
> -.help   = "show firmware device tree node given its full path",
> +.help   = "show firmware device tree node or property given its 
> full path",
>  .cmd= hmp_info_fdt,
>  },
>  
> diff --git a/softmmu/device_tree.c b/softmmu/device_tree.c
> index e41894fbef..f6eb060acc 100644
> --- a/softmmu/device_tree.c
> +++ b/softmmu/device_tree.c
> @@ -774,9 +774,74 @@ static void fdt_print_node(int node, int depth, const 
> char *fullpath)
>  qemu_printf("%*s}\n", padding, "");
>  }
>  
> +static const struct fdt_property *fdt_find_property(const char *fullpath,
> +int *prop_size,
> +Error **errp)
> +{
> +const struct fdt_property *prop = NULL;
> +void *fdt = current_machine->fdt;
> +g_autoptr(GString) nodename = NULL;
> +const char *propname = NULL;
> +int path_len = strlen(fullpath);
> +int node = 0; /* default to root node '/' */
> +int i, idx = -1;
> +
> +/*
> + * We'll assume that we're dealing with a property. libfdt
> + * does not have an API to find a property given the full
> + * path, but it does have an API to find a property inside
> + * a node.
> + */
> +nodename = g_string_new("");
> +
> +for (i = path_len - 1; i >= 0; i--) {
> +if (fullpath[i] == '/') {
> +idx = i;
> +break;
> +}
> +}
> +
> +if (idx == -1) {
> +error_setg(errp, "FDT paths must contain at least one '/' 
> character");
> +return NULL;
> +}
> +
> +if (idx == path_len - 1) {
> +error_setg(errp, "FDT paths can't end with a '/' character");
> +return NULL;
> +}
> +
> +propname = [idx + 1];
> +
> +if (idx != 0) {
> +g_string_append_len(nodename, fullpath, idx);
> +
> +node = fdt_path_offset(fdt, nodename->str);
> +if (node < 0) {
> +error_setg(errp, "node '%s' of property '%s' not found in FDT",
> +   nodename->str, propname);
> +return NULL;
> +}
> +} else {
> +/* idx = 0 means that it's a property of the root node */
> +g_string_append(nodename, "/");
> +}
> +
> +prop = fdt_get_property(fdt, node, propname, prop_size);
> +if (!prop) {
> +error_setg(errp, "property '%s' not found in node '%s' in FDT",
> +   propname, nodename->str);
> +

Re: [PATCH 0/4] semihosting: fix various coverity issues

2022-07-25 Thread Alex Bennée



Peter Maydell  writes:

> This patchset fixes a handful of bugs in the semihosting code
> noticed by Coverity.

Queued to testing/next, thanks.

-- 
Alex Bennée

[PATCH v2 10/11] block: rename bdrv_child_try_change_aio_context in bdrv_try_change_aio_context

2022-07-25 Thread Emanuele Giuseppe Esposito

No functional changes intended.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 19 ---
 block/block-backend.c  |  3 +--
 include/block/block-global-state.h | 12 +---
 3 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/block.c b/block.c
index c066b41c8c..b82eb0ba6d 100644
--- a/block.c
+++ b/block.c
@@ -7462,11 +7462,9 @@ static bool bdrv_change_aio_context(BlockDriverState 
*bs, AioContext *ctx,
  * For the same reason, it temporarily holds also the new AioContext, since
  * bdrv_drained_end calls BDRV_POLL_WHILE that assumes the lock is taken too.
  */
-int bdrv_child_try_change_aio_context_tran(BlockDriverState *bs,
-   AioContext *ctx,
-   BdrvChild *ignore_child,
-   Transaction *tran,
-   Error **errp)
+int bdrv_try_change_aio_context_tran(BlockDriverState *bs, AioContext *ctx,
+ BdrvChild *ignore_child, Transaction 
*tran,
+ Error **errp)
 {
 GHashTable *visited;
 int ret;
@@ -7483,11 +7481,11 @@ int 
bdrv_child_try_change_aio_context_tran(BlockDriverState *bs,
 }
 
 /*
- * See bdrv_child_try_change_aio_context_tran for invariants on
+ * See bdrv_try_change_aio_context_tran for invariants on
  * AioContext locks.
  */
-int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-  BdrvChild *ignore_child, Error **errp)
+int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+BdrvChild *ignore_child, Error **errp)
 {
 Transaction *tran;
 int ret;
@@ -7496,8 +7494,7 @@ int bdrv_child_try_change_aio_context(BlockDriverState 
*bs, AioContext *ctx,
 
 /* Recursion phase: go through all nodes of the graph */
 tran = tran_new();
-ret = bdrv_child_try_change_aio_context_tran(bs, ctx, ignore_child, tran,
- errp);
+ret = bdrv_try_change_aio_context_tran(bs, ctx, ignore_child, tran, errp);
 
 /* Linear phase: go through all callbacks collected in the transaction */
 
@@ -7542,7 +7539,7 @@ int bdrv_try_set_aio_context(BlockDriverState *bs, 
AioContext *ctx,
  Error **errp)
 {
 GLOBAL_STATE_CODE();
-return bdrv_child_try_change_aio_context(bs, ctx, NULL, errp);
+return bdrv_try_change_aio_context(bs, ctx, NULL, errp);
 }
 
 void bdrv_add_aio_context_notifier(BlockDriverState *bs,
diff --git a/block/block-backend.c b/block/block-backend.c
index a27b8b7a89..f785c1e7e2 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2182,8 +2182,7 @@ static int blk_do_set_aio_context(BlockBackend *blk, 
AioContext *new_context,
  * update_root_node MUST be false for 
blk_root_set_aio_ctx_commit(),
  * as we are already in the commit function of a transaction.
  */
-ret = bdrv_child_try_change_aio_context(bs, new_context, blk->root,
-errp);
+ret = bdrv_try_change_aio_context(bs, new_context, blk->root, 
errp);
 if (ret < 0) {
 bdrv_unref(bs);
 return ret;
diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index 54fd008442..11f80768c3 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -223,13 +223,11 @@ AioContext *bdrv_child_get_parent_aio_context(BdrvChild 
*c);
 bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
GHashTable *visited, Transaction *tran,
Error **errp);
-int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-  BdrvChild *ignore_child, Error **errp);
-int bdrv_child_try_change_aio_context_tran(BlockDriverState *bs,
-   AioContext *ctx,
-   BdrvChild *ignore_child,
-   Transaction *tran,
-   Error **errp);
+int bdrv_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+BdrvChild *ignore_child, Error **errp);
+int bdrv_try_change_aio_context_tran(BlockDriverState *bs, AioContext *ctx,
+ BdrvChild *ignore_child, Transaction 
*tran,
+ Error **errp);
 
 int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz);
 int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
-- 
2.31.1

Re: [PATCH] .cirrus.yml: Change winsymlinks to 'native'

2022-07-25 Thread Alex Bennée



Bin Meng  writes:

> From: Bin Meng 
>
> At present winsymlinks is set to 'nativestrict', and its behavior is:
>
>   a) if native symlinks are enabled and  exists, creates
>   as a native Windows symlink;
>   b) else if native symlinks are not enabled or if  does
>  not exist, 'ln -s' fails.
>
> This causes the following error message was seen during the configure:
>
>   "ln: failed to create symbolic link
>   'x86_64-softmmu/qemu-system-x86_64.exe': No such file or directory"
>
> Change winsymlinks to 'native' whose behavior is most similar to the
> behavior of 'ln -s' on *nix, that is:
>
>   a) if native symlinks are enabled, and whether  exists
>  or not, creates  as a native Windows symlink;
>   b) else if native symlinks are not enabled, and whether 
>  exists or not, 'ln -s' creates as a Windows shortcut file.
>
> Signed-off-by: Bin Meng 

Queued to testing/next, thanks.

-- 
Alex Bennée

[PATCH v2 08/11] block: use the new _change_ API instead of _can_set_ and _set_

2022-07-25 Thread Emanuele Giuseppe Esposito

Replace all direct usage of ->can_set_aio_ctx and ->set_aio_ctx,
and call bdrv_child_try_change_aio_context() in
bdrv_try_set_aio_context(), the main function called through
the whole block layer.

>From this point onwards, ->can_set_aio_ctx and ->set_aio_ctx
won't be used anymore.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c   | 44 ---
 block/block-backend.c |  8 ++--
 2 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/block.c b/block.c
index bcc9b0d099..9b47aacad2 100644
--- a/block.c
+++ b/block.c
@@ -2970,17 +2970,21 @@ static void bdrv_attach_child_common_abort(void *opaque)
 }
 
 if (bdrv_child_get_parent_aio_context(child) != s->old_parent_ctx) {
-GSList *ignore;
+Transaction *tran;
+GHashTable *visited;
+bool ret;
 
-/* No need to ignore `child`, because it has been detached already */
-ignore = NULL;
-child->klass->can_set_aio_ctx(child, s->old_parent_ctx, ,
-  _abort);
-g_slist_free(ignore);
+tran = tran_new();
 
-ignore = NULL;
-child->klass->set_aio_ctx(child, s->old_parent_ctx, );
-g_slist_free(ignore);
+/* No need to visit `child`, because it has been detached already */
+visited = g_hash_table_new(NULL, NULL);
+ret = child->klass->change_aio_ctx(child, s->old_parent_ctx, visited,
+   tran, _abort);
+g_hash_table_destroy(visited);
+
+/* transaction is supposed to always succeed */
+assert(ret == true);
+tran_commit(tran);
 }
 
 bdrv_unref(bs);
@@ -3041,18 +3045,20 @@ static int bdrv_attach_child_common(BlockDriverState 
*child_bs,
 Error *local_err = NULL;
 int ret = bdrv_try_set_aio_context(child_bs, parent_ctx, _err);
 
-if (ret < 0 && child_class->can_set_aio_ctx) {
-GSList *ignore = g_slist_prepend(NULL, new_child);
-if (child_class->can_set_aio_ctx(new_child, child_ctx, ,
- NULL))
-{
+if (ret < 0 && child_class->change_aio_ctx) {
+Transaction *tran = tran_new();
+GHashTable *visited = g_hash_table_new(NULL, NULL);
+bool ret_child;
+
+g_hash_table_add(visited, new_child);
+ret_child = child_class->change_aio_ctx(new_child, child_ctx,
+visited, tran, NULL);
+if (ret_child == true) {
 error_free(local_err);
 ret = 0;
-g_slist_free(ignore);
-ignore = g_slist_prepend(NULL, new_child);
-child_class->set_aio_ctx(new_child, child_ctx, );
 }
-g_slist_free(ignore);
+tran_finalize(tran, ret_child == true ? 0 : -1);
+g_hash_table_destroy(visited);
 }
 
 if (ret < 0) {
@@ -7732,7 +7738,7 @@ int bdrv_try_set_aio_context(BlockDriverState *bs, 
AioContext *ctx,
  Error **errp)
 {
 GLOBAL_STATE_CODE();
-return bdrv_child_try_set_aio_context(bs, ctx, NULL, errp);
+return bdrv_child_try_change_aio_context(bs, ctx, NULL, errp);
 }
 
 void bdrv_add_aio_context_notifier(BlockDriverState *bs,
diff --git a/block/block-backend.c b/block/block-backend.c
index b4951c6e21..3046b4cc54 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2184,8 +2184,12 @@ static int blk_do_set_aio_context(BlockBackend *blk, 
AioContext *new_context,
 bdrv_ref(bs);
 
 if (update_root_node) {
-ret = bdrv_child_try_set_aio_context(bs, new_context, blk->root,
- errp);
+/*
+ * update_root_node MUST be false for 
blk_root_set_aio_ctx_commit(),
+ * as we are already in the commit function of a transaction.
+ */
+ret = bdrv_child_try_change_aio_context(bs, new_context, blk->root,
+errp);
 if (ret < 0) {
 bdrv_unref(bs);
 return ret;
-- 
2.31.1

[PATCH v2 11/11] block: remove bdrv_try_set_aio_context and replace it with bdrv_try_change_aio_context

2022-07-25 Thread Emanuele Giuseppe Esposito

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 14 --
 block/export/export.c  |  2 +-
 blockdev.c | 22 +++---
 docs/devel/multiple-iothreads.txt  |  4 ++--
 include/block/block-global-state.h |  2 --
 job.c  |  2 +-
 tests/unit/test-bdrv-drain.c   |  6 +++---
 tests/unit/test-block-iothread.c   | 10 +-
 8 files changed, 27 insertions(+), 35 deletions(-)

diff --git a/block.c b/block.c
index b82eb0ba6d..435e8fe731 100644
--- a/block.c
+++ b/block.c
@@ -2950,7 +2950,7 @@ static void bdrv_attach_child_common_abort(void *opaque)
 bdrv_replace_child_noperm(s->child, NULL, false);
 
 if (bdrv_get_aio_context(bs) != s->old_child_ctx) {
-bdrv_try_set_aio_context(bs, s->old_child_ctx, _abort);
+bdrv_try_change_aio_context(bs, s->old_child_ctx, NULL, _abort);
 }
 
 if (bdrv_child_get_parent_aio_context(child) != s->old_parent_ctx) {
@@ -3027,7 +3027,8 @@ static int bdrv_attach_child_common(BlockDriverState 
*child_bs,
 parent_ctx = bdrv_child_get_parent_aio_context(new_child);
 if (child_ctx != parent_ctx) {
 Error *local_err = NULL;
-int ret = bdrv_try_set_aio_context(child_bs, parent_ctx, _err);
+int ret = bdrv_try_change_aio_context(child_bs, parent_ctx, NULL,
+  _err);
 
 if (ret < 0 && child_class->change_aio_ctx) {
 Transaction *tran = tran_new();
@@ -3130,7 +3131,7 @@ static void bdrv_detach_child(BdrvChild **childp)
  * When the parent requiring a non-default AioContext is removed, the
  * node moves back to the main AioContext
  */
-bdrv_try_set_aio_context(old_bs, qemu_get_aio_context(), NULL);
+bdrv_try_change_aio_context(old_bs, qemu_get_aio_context(), NULL, 
NULL);
 }
 }
 
@@ -7535,13 +7536,6 @@ int bdrv_try_change_aio_context(BlockDriverState *bs, 
AioContext *ctx,
 return 0;
 }
 
-int bdrv_try_set_aio_context(BlockDriverState *bs, AioContext *ctx,
- Error **errp)
-{
-GLOBAL_STATE_CODE();
-return bdrv_try_change_aio_context(bs, ctx, NULL, errp);
-}
-
 void bdrv_add_aio_context_notifier(BlockDriverState *bs,
 void (*attached_aio_context)(AioContext *new_context, void *opaque),
 void (*detach_aio_context)(void *opaque), void *opaque)
diff --git a/block/export/export.c b/block/export/export.c
index 4744862915..7cc0c25c1c 100644
--- a/block/export/export.c
+++ b/block/export/export.c
@@ -129,7 +129,7 @@ BlockExport *blk_exp_add(BlockExportOptions *export, Error 
**errp)
 
 /* Ignore errors with fixed-iothread=false */
 set_context_errp = fixed_iothread ? errp : NULL;
-ret = bdrv_try_set_aio_context(bs, new_ctx, set_context_errp);
+ret = bdrv_try_change_aio_context(bs, new_ctx, NULL, set_context_errp);
 if (ret == 0) {
 aio_context_release(ctx);
 aio_context_acquire(new_ctx);
diff --git a/blockdev.c b/blockdev.c
index 2cd84d206c..fb0ec67523 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1619,8 +1619,8 @@ static void external_snapshot_abort(BlkActionState 
*common)
 aio_context_release(aio_context);
 aio_context_acquire(tmp_context);
 
-ret = bdrv_try_set_aio_context(state->old_bs,
-   aio_context, NULL);
+ret = bdrv_try_change_aio_context(state->old_bs,
+  aio_context, NULL, NULL);
 assert(ret == 0);
 
 aio_context_release(tmp_context);
@@ -1781,12 +1781,12 @@ static void drive_backup_prepare(BlkActionState 
*common, Error **errp)
 goto out;
 }
 
-/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+/* Honor bdrv_try_change_aio_context() context acquisition requirements. */
 old_context = bdrv_get_aio_context(target_bs);
 aio_context_release(aio_context);
 aio_context_acquire(old_context);
 
-ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+ret = bdrv_try_change_aio_context(target_bs, aio_context, NULL, errp);
 if (ret < 0) {
 bdrv_unref(target_bs);
 aio_context_release(old_context);
@@ -1881,12 +1881,12 @@ static void blockdev_backup_prepare(BlkActionState 
*common, Error **errp)
 return;
 }
 
-/* Honor bdrv_try_set_aio_context() context acquisition requirements. */
+/* Honor bdrv_try_change_aio_context() context acquisition requirements. */
 aio_context = bdrv_get_aio_context(bs);
 old_context = bdrv_get_aio_context(target_bs);
 aio_context_acquire(old_context);
 
-ret = bdrv_try_set_aio_context(target_bs, aio_context, errp);
+ret = bdrv_try_change_aio_context(target_bs, aio_context, NULL, errp);
 if (ret < 0) {

[PATCH v2 01/11] block.c: assert bs->aio_context is written under BQL and drains

2022-07-25 Thread Emanuele Giuseppe Esposito

Also here ->aio_context is read by I/O threads and written
under BQL.

Reviewed-by: Hanna Reitz 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block.c b/block.c
index 7559965dbc..58a9cfc8b7 100644
--- a/block.c
+++ b/block.c
@@ -7282,6 +7282,7 @@ static void bdrv_detach_aio_context(BlockDriverState *bs)
 if (bs->quiesce_counter) {
 aio_enable_external(bs->aio_context);
 }
+assert_bdrv_graph_writable(bs);
 bs->aio_context = NULL;
 }
 
@@ -7295,6 +7296,7 @@ static void bdrv_attach_aio_context(BlockDriverState *bs,
 aio_disable_external(new_context);
 }
 
+assert_bdrv_graph_writable(bs);
 bs->aio_context = new_context;
 
 if (bs->drv && bs->drv->bdrv_attach_aio_context) {
-- 
2.31.1

[PATCH v2 06/11] block: implement .change_aio_ctx in child_of_bds

2022-07-25 Thread Emanuele Giuseppe Esposito

bdrv_child_cb_change_aio_ctx() is identical to
bdrv_child_cb_can_set_aio_ctx(), as we only need
to recursively go on the parent bs.

Note: bdrv_child_try_change_aio_context() is not called by
anyone at this point.

Reviewed-by: Hanna Reitz 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block.c b/block.c
index 221bf90268..bcc9b0d099 100644
--- a/block.c
+++ b/block.c
@@ -1239,6 +1239,14 @@ static int bdrv_child_cb_inactivate(BdrvChild *child)
 return 0;
 }
 
+static bool bdrv_child_cb_change_aio_ctx(BdrvChild *child, AioContext *ctx,
+ GHashTable *visited, Transaction 
*tran,
+ Error **errp)
+{
+BlockDriverState *bs = child->opaque;
+return bdrv_change_aio_context(bs, ctx, visited, tran, errp);
+}
+
 static bool bdrv_child_cb_can_set_aio_ctx(BdrvChild *child, AioContext *ctx,
   GSList **ignore, Error **errp)
 {
@@ -1495,6 +1503,7 @@ const BdrvChildClass child_of_bds = {
 .inactivate  = bdrv_child_cb_inactivate,
 .can_set_aio_ctx = bdrv_child_cb_can_set_aio_ctx,
 .set_aio_ctx = bdrv_child_cb_set_aio_ctx,
+.change_aio_ctx  = bdrv_child_cb_change_aio_ctx,
 .update_filename = bdrv_child_cb_update_filename,
 .get_parent_aio_context = child_of_bds_get_parent_aio_context,
 };
-- 
2.31.1

[PATCH v2 02/11] block: use transactions as a replacement of ->{can_}set_aio_context()

2022-07-25 Thread Emanuele Giuseppe Esposito

Simplify the way the aiocontext can be changed in a BDS graph.
There are currently two problems in bdrv_try_set_aio_context:
- There is a confusion of AioContext locks taken and released, because
  we assume that old aiocontext is always taken and new one is
  taken inside.

- It doesn't look very safe to call bdrv_drained_begin while some
  nodes have already switched to the new aiocontext and others haven't.
  This could be especially dangerous because bdrv_drained_begin polls, so
  something else could be executed while graph is in an inconsistent
  state.

Additional minor nitpick: can_set and set_ callbacks both traverse the
graph, both using the ignored list of visited nodes in a different way.

Therefore, get rid of all of this and introduce a new callback,
change_aio_context, that uses transactions to efficiently, cleanly
and most importantly safely change the aiocontext of a graph.

This new callback is a "merge" of the two previous ones:
- Just like can_set_aio_context, recursively traverses the graph.
  Marks all nodes that are visited using a GList, and checks if
  they *could* change the aio_context.
- For each node that passes the above check, drain it and add a new transaction
  that implements a callback that effectively changes the aiocontext.
- Once done, the recursive function returns if *all* nodes can change
  the AioContext. If so, commit the above transactions.
  Regardless of the outcome, call transaction.clean() to undo all drains
  done in the recursion.
- The transaction list is scanned only after all nodes are being drained, so
  we are sure that they all are in the same context, and then
  we switch their AioContext, concluding the drain only after all nodes
  switched to the new AioContext. In this way we make sure that
  bdrv_drained_begin() is always called under the old AioContext, and
  bdrv_drained_end() under the new one.
- Because of the above, we don't need to release and re-acquire the
  old AioContext every time, as everything is done once (and not
  per-node drain and aiocontext change).

Note that the "change" API is not yet invoked anywhere.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 203 -
 include/block/block-global-state.h |   6 +
 include/block/block_int-common.h   |   3 +
 3 files changed, 211 insertions(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 58a9cfc8b7..c80e49009a 100644
--- a/block.c
+++ b/block.c
@@ -108,6 +108,10 @@ static void bdrv_reopen_abort(BDRVReopenState 
*reopen_state);
 
 static bool bdrv_backing_overridden(BlockDriverState *bs);
 
+static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+GSList **visited, Transaction *tran,
+Error **errp);
+
 /* If non-zero, use only whitelisted block drivers */
 static int use_bdrv_whitelist;
 
@@ -7325,7 +7329,7 @@ static void bdrv_attach_aio_context(BlockDriverState *bs,
  * must not own the AioContext lock for new_context (unless new_context is the
  * same as the current context of bs).
  *
- * @ignore will accumulate all visited BdrvChild object. The caller is
+ * @ignore will accumulate all visited BdrvChild objects. The caller is
  * responsible for freeing the list afterwards.
  */
 void bdrv_set_aio_context_ignore(BlockDriverState *bs,
@@ -7434,6 +7438,38 @@ static bool bdrv_parent_can_set_aio_context(BdrvChild 
*c, AioContext *ctx,
 return true;
 }
 
+typedef struct BdrvStateSetAioContext {
+AioContext *new_ctx;
+BlockDriverState *bs;
+} BdrvStateSetAioContext;
+
+static bool bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,
+   GSList **visited, Transaction *tran,
+   Error **errp)
+{
+GLOBAL_STATE_CODE();
+if (g_slist_find(*visited, c)) {
+return true;
+}
+*visited = g_slist_prepend(*visited, c);
+
+/*
+ * A BdrvChildClass that doesn't handle AioContext changes cannot
+ * tolerate any AioContext changes
+ */
+if (!c->klass->change_aio_ctx) {
+char *user = bdrv_child_user_desc(c);
+error_setg(errp, "Changing iothreads is not supported by %s", user);
+g_free(user);
+return false;
+}
+if (!c->klass->change_aio_ctx(c, ctx, visited, tran, errp)) {
+assert(!errp || *errp);
+return false;
+}
+return true;
+}
+
 bool bdrv_child_can_set_aio_context(BdrvChild *c, AioContext *ctx,
 GSList **ignore, Error **errp)
 {
@@ -7445,6 +7481,18 @@ bool bdrv_child_can_set_aio_context(BdrvChild *c, 
AioContext *ctx,
 return bdrv_can_set_aio_context(c->bs, ctx, ignore, errp);
 }
 
+bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
+   GSList **visited, Transaction *tran,
+   Error **errp)
+{
+GLOBAL_STATE_CODE();
+

[PATCH v2 09/11] block: remove all unused ->can_set_aio_ctx and ->set_aio_ctx callbacks

2022-07-25 Thread Emanuele Giuseppe Esposito

Together with all _can_set_ and _set_ APIs, as they are not needed
anymore.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 196 -
 block/block-backend.c  |  33 -
 blockjob.c |  35 --
 include/block/block-global-state.h |   9 --
 include/block/block_int-common.h   |   4 -
 5 files changed, 277 deletions(-)

diff --git a/block.c b/block.c
index 9b47aacad2..c066b41c8c 100644
--- a/block.c
+++ b/block.c
@@ -1247,20 +1247,6 @@ static bool bdrv_child_cb_change_aio_ctx(BdrvChild 
*child, AioContext *ctx,
 return bdrv_change_aio_context(bs, ctx, visited, tran, errp);
 }
 
-static bool bdrv_child_cb_can_set_aio_ctx(BdrvChild *child, AioContext *ctx,
-  GSList **ignore, Error **errp)
-{
-BlockDriverState *bs = child->opaque;
-return bdrv_can_set_aio_context(bs, ctx, ignore, errp);
-}
-
-static void bdrv_child_cb_set_aio_ctx(BdrvChild *child, AioContext *ctx,
-  GSList **ignore)
-{
-BlockDriverState *bs = child->opaque;
-return bdrv_set_aio_context_ignore(bs, ctx, ignore);
-}
-
 /*
  * Returns the options and flags that a temporary snapshot should get, based on
  * the originally requested flags (the originally requested image will have
@@ -1501,8 +1487,6 @@ const BdrvChildClass child_of_bds = {
 .attach  = bdrv_child_cb_attach,
 .detach  = bdrv_child_cb_detach,
 .inactivate  = bdrv_child_cb_inactivate,
-.can_set_aio_ctx = bdrv_child_cb_can_set_aio_ctx,
-.set_aio_ctx = bdrv_child_cb_set_aio_ctx,
 .change_aio_ctx  = bdrv_child_cb_change_aio_ctx,
 .update_filename = bdrv_child_cb_update_filename,
 .get_parent_aio_context = child_of_bds_get_parent_aio_context,
@@ -7334,125 +7318,6 @@ static void bdrv_attach_aio_context(BlockDriverState 
*bs,
 bs->walking_aio_notifiers = false;
 }
 
-/*
- * Changes the AioContext used for fd handlers, timers, and BHs by this
- * BlockDriverState and all its children and parents.
- *
- * Must be called from the main AioContext.
- *
- * The caller must own the AioContext lock for the old AioContext of bs, but it
- * must not own the AioContext lock for new_context (unless new_context is the
- * same as the current context of bs).
- *
- * @ignore will accumulate all visited BdrvChild objects. The caller is
- * responsible for freeing the list afterwards.
- */
-void bdrv_set_aio_context_ignore(BlockDriverState *bs,
- AioContext *new_context, GSList **ignore)
-{
-AioContext *old_context = bdrv_get_aio_context(bs);
-GSList *children_to_process = NULL;
-GSList *parents_to_process = NULL;
-GSList *entry;
-BdrvChild *child, *parent;
-
-g_assert(qemu_get_current_aio_context() == qemu_get_aio_context());
-GLOBAL_STATE_CODE();
-
-if (old_context == new_context) {
-return;
-}
-
-bdrv_drained_begin(bs);
-
-QLIST_FOREACH(child, >children, next) {
-if (g_slist_find(*ignore, child)) {
-continue;
-}
-*ignore = g_slist_prepend(*ignore, child);
-children_to_process = g_slist_prepend(children_to_process, child);
-}
-
-QLIST_FOREACH(parent, >parents, next_parent) {
-if (g_slist_find(*ignore, parent)) {
-continue;
-}
-*ignore = g_slist_prepend(*ignore, parent);
-parents_to_process = g_slist_prepend(parents_to_process, parent);
-}
-
-for (entry = children_to_process;
- entry != NULL;
- entry = g_slist_next(entry)) {
-child = entry->data;
-bdrv_set_aio_context_ignore(child->bs, new_context, ignore);
-}
-g_slist_free(children_to_process);
-
-for (entry = parents_to_process;
- entry != NULL;
- entry = g_slist_next(entry)) {
-parent = entry->data;
-assert(parent->klass->set_aio_ctx);
-parent->klass->set_aio_ctx(parent, new_context, ignore);
-}
-g_slist_free(parents_to_process);
-
-bdrv_detach_aio_context(bs);
-
-/* Acquire the new context, if necessary */
-if (qemu_get_aio_context() != new_context) {
-aio_context_acquire(new_context);
-}
-
-bdrv_attach_aio_context(bs, new_context);
-
-/*
- * If this function was recursively called from
- * bdrv_set_aio_context_ignore(), there may be nodes in the
- * subtree that have not yet been moved to the new AioContext.
- * Release the old one so bdrv_drained_end() can poll them.
- */
-if (qemu_get_aio_context() != old_context) {
-aio_context_release(old_context);
-}
-
-bdrv_drained_end(bs);
-
-if (qemu_get_aio_context() != old_context) {
-aio_context_acquire(old_context);
-}
-if (qemu_get_aio_context() != new_context) {
-aio_context_release(new_context);
-}
-}
-
-static bool bdrv_parent_can_set_aio_context(BdrvChild *c, AioContext *ctx,

[PATCH v2 05/11] blockjob: implement .change_aio_ctx in child_job

2022-07-25 Thread Emanuele Giuseppe Esposito

child_job_change_aio_ctx() is very similar to
child_job_can_set_aio_ctx(), but it implements a new transaction
so that if all check pass, the new transaction's .commit()
will take care of changin the BlockJob AioContext.
child_job_set_aio_ctx_commit() is similar to child_job_set_aio_ctx(),
but it doesn't need to invoke the recursion, as this is already
taken care by child_job_change_aio_ctx().

Note: bdrv_child_try_change_aio_context() is not called by
anyone at this point.

Reviewed-by: Hanna Reitz 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockjob.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/blockjob.c b/blockjob.c
index 375c90e4b8..704bab060f 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -126,6 +126,50 @@ static void child_job_drained_end(BdrvChild *c, int 
*drained_end_counter)
 job_resume(>job);
 }
 
+typedef struct BdrvStateChildJobContext {
+AioContext *new_ctx;
+BlockJob *job;
+} BdrvStateChildJobContext;
+
+static void child_job_set_aio_ctx_commit(void *opaque)
+{
+BdrvStateChildJobContext *s = opaque;
+BlockJob *job = s->job;
+
+job_set_aio_context(>job, s->new_ctx);
+}
+
+static TransactionActionDrv change_child_job_context = {
+.commit = child_job_set_aio_ctx_commit,
+.clean = g_free,
+};
+
+static bool child_job_change_aio_ctx(BdrvChild *c, AioContext *ctx,
+ GHashTable *visited, Transaction *tran,
+ Error **errp)
+{
+BlockJob *job = c->opaque;
+BdrvStateChildJobContext *s;
+GSList *l;
+
+for (l = job->nodes; l; l = l->next) {
+BdrvChild *sibling = l->data;
+if (!bdrv_child_change_aio_context(sibling, ctx, visited,
+   tran, errp)) {
+return false;
+}
+}
+
+s = g_new(BdrvStateChildJobContext, 1);
+*s = (BdrvStateChildJobContext) {
+.new_ctx = ctx,
+.job = job,
+};
+
+tran_add(tran, _child_job_context, s);
+return true;
+}
+
 static bool child_job_can_set_aio_ctx(BdrvChild *c, AioContext *ctx,
   GSList **ignore, Error **errp)
 {
@@ -174,6 +218,7 @@ static const BdrvChildClass child_job = {
 .drained_end= child_job_drained_end,
 .can_set_aio_ctx= child_job_can_set_aio_ctx,
 .set_aio_ctx= child_job_set_aio_ctx,
+.change_aio_ctx = child_job_change_aio_ctx,
 .stay_at_node   = true,
 .get_parent_aio_context = child_job_get_parent_aio_context,
 };
-- 
2.31.1

[PATCH v2 07/11] block-backend: implement .change_aio_ctx in child_root

2022-07-25 Thread Emanuele Giuseppe Esposito

blk_root_change_aio_ctx() is very similar to blk_root_can_set_aio_ctx(),
but implements a new transaction so that if all check pass, the new
transaction's .commit will take care of changing the BlockBackend
AioContext. blk_root_set_aio_ctx_commit() is the same as
blk_root_set_aio_ctx().

Note: bdrv_child_try_change_aio_context() is not called by
anyone at this point.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/block-backend.c | 52 +++
 1 file changed, 52 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index f425b00793..b4951c6e21 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -138,6 +138,9 @@ static bool blk_root_can_set_aio_ctx(BdrvChild *child, 
AioContext *ctx,
  GSList **ignore, Error **errp);
 static void blk_root_set_aio_ctx(BdrvChild *child, AioContext *ctx,
  GSList **ignore);
+static bool blk_root_change_aio_ctx(BdrvChild *child, AioContext *ctx,
+GHashTable *visited, Transaction *tran,
+Error **errp);
 
 static char *blk_root_get_parent_desc(BdrvChild *child)
 {
@@ -336,6 +339,7 @@ static const BdrvChildClass child_root = {
 
 .can_set_aio_ctx= blk_root_can_set_aio_ctx,
 .set_aio_ctx= blk_root_set_aio_ctx,
+.change_aio_ctx = blk_root_change_aio_ctx,
 
 .get_parent_aio_context = blk_root_get_parent_aio_context,
 };
@@ -2208,6 +2212,54 @@ int blk_set_aio_context(BlockBackend *blk, AioContext 
*new_context,
 return blk_do_set_aio_context(blk, new_context, true, errp);
 }
 
+typedef struct BdrvStateBlkRootContext {
+AioContext *new_ctx;
+BlockBackend *blk;
+} BdrvStateBlkRootContext;
+
+static void blk_root_set_aio_ctx_commit(void *opaque)
+{
+BdrvStateBlkRootContext *s = opaque;
+BlockBackend *blk = s->blk;
+
+blk_do_set_aio_context(blk, s->new_ctx, false, _abort);
+}
+
+static TransactionActionDrv set_blk_root_context = {
+.commit = blk_root_set_aio_ctx_commit,
+.clean = g_free,
+};
+
+static bool blk_root_change_aio_ctx(BdrvChild *child, AioContext *ctx,
+GHashTable *visited, Transaction *tran,
+Error **errp)
+{
+BlockBackend *blk = child->opaque;
+BdrvStateBlkRootContext *s;
+
+if (!blk->allow_aio_context_change) {
+/*
+ * Manually created BlockBackends (those with a name) that are not
+ * attached to anything can change their AioContext without updating
+ * their user; return an error for others.
+ */
+if (!blk->name || blk->dev) {
+/* TODO Add BB name/QOM path */
+error_setg(errp, "Cannot change iothread of active block backend");
+return false;
+}
+}
+
+s = g_new(BdrvStateBlkRootContext, 1);
+*s = (BdrvStateBlkRootContext) {
+.new_ctx = ctx,
+.blk = blk,
+};
+
+tran_add(tran, _blk_root_context, s);
+return true;
+}
+
 static bool blk_root_can_set_aio_ctx(BdrvChild *child, AioContext *ctx,
  GSList **ignore, Error **errp)
 {
-- 
2.31.1

[PATCH v2 03/11] bdrv_change_aio_context: use hash table instead of list of visited nodes

2022-07-25 Thread Emanuele Giuseppe Esposito

Minor performance improvement, but given that we have hash tables
available, avoid iterating in the visited nodes list every time just
to check if a node has been already visited.

The data structure is not actually a proper hash map, but an hash set,
as we are just adding nodes and not key,value pairs.

Suggested-by: Hanna Reitz 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 28 
 include/block/block-global-state.h |  2 +-
 include/block/block_int-common.h   |  3 ++-
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/block.c b/block.c
index c80e49009a..c02a628336 100644
--- a/block.c
+++ b/block.c
@@ -109,7 +109,7 @@ static void bdrv_reopen_abort(BDRVReopenState 
*reopen_state);
 static bool bdrv_backing_overridden(BlockDriverState *bs);
 
 static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-GSList **visited, Transaction *tran,
+GHashTable *visited, Transaction *tran,
 Error **errp);
 
 /* If non-zero, use only whitelisted block drivers */
@@ -7444,14 +7444,15 @@ typedef struct BdrvStateSetAioContext {
 } BdrvStateSetAioContext;
 
 static bool bdrv_parent_change_aio_context(BdrvChild *c, AioContext *ctx,
-   GSList **visited, Transaction *tran,
+   GHashTable *visited,
+   Transaction *tran,
Error **errp)
 {
 GLOBAL_STATE_CODE();
-if (g_slist_find(*visited, c)) {
+if (g_hash_table_contains(visited, c)) {
 return true;
 }
-*visited = g_slist_prepend(*visited, c);
+g_hash_table_add(visited, c);
 
 /*
  * A BdrvChildClass that doesn't handle AioContext changes cannot
@@ -7482,14 +7483,14 @@ bool bdrv_child_can_set_aio_context(BdrvChild *c, 
AioContext *ctx,
 }
 
 bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
-   GSList **visited, Transaction *tran,
+   GHashTable *visited, Transaction *tran,
Error **errp)
 {
 GLOBAL_STATE_CODE();
-if (g_slist_find(*visited, c)) {
+if (g_hash_table_contains(visited, c)) {
 return true;
 }
-*visited = g_slist_prepend(*visited, c);
+g_hash_table_add(visited, c);
 return bdrv_change_aio_context(c->bs, ctx, visited, tran, errp);
 }
 
@@ -7561,7 +7562,7 @@ static TransactionActionDrv set_aio_context = {
  * responsible for freeing the list afterwards.
  */
 static bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-GSList **visited, Transaction *tran,
+GHashTable *visited, Transaction *tran,
 Error **errp)
 {
 BdrvChild *c;
@@ -7646,16 +7647,19 @@ int bdrv_child_try_change_aio_context(BlockDriverState 
*bs, AioContext *ctx,
   BdrvChild *ignore_child, Error **errp)
 {
 Transaction *tran;
-GSList *visited;
+GHashTable *visited;
 int ret;
 AioContext *old_context = bdrv_get_aio_context(bs);
 GLOBAL_STATE_CODE();
 
 /* Recursion phase: go through all nodes of the graph */
 tran = tran_new();
-visited = ignore_child ? g_slist_prepend(NULL, ignore_child) : NULL;
-ret = bdrv_change_aio_context(bs, ctx, , tran, errp);
-g_slist_free(visited);
+visited = g_hash_table_new(NULL, NULL);
+if (ignore_child) {
+g_hash_table_add(visited, ignore_child);
+}
+ret = bdrv_change_aio_context(bs, ctx, visited, tran, errp);
+g_hash_table_destroy(visited);
 
 /* Linear phase: go through all callbacks collected in the transaction */
 
diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index fdcb81a175..ceecf0aa8e 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -230,7 +230,7 @@ bool bdrv_can_set_aio_context(BlockDriverState *bs, 
AioContext *ctx,
 AioContext *bdrv_child_get_parent_aio_context(BdrvChild *c);
 
 bool bdrv_child_change_aio_context(BdrvChild *c, AioContext *ctx,
-   GSList **visited, Transaction *tran,
+   GHashTable *visited, Transaction *tran,
Error **errp);
 int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
   BdrvChild *ignore_child, Error **errp);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 43828cf74f..c639873487 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -897,7 +897,8 @@ struct BdrvChildClass {
 void (*set_aio_ctx)(BdrvChild *child, AioContext *ctx, GSList

[PATCH v2 04/11] bdrv_child_try_change_aio_context: add transaction parameter

2022-07-25 Thread Emanuele Giuseppe Esposito

This enables the caller to use the same transaction to also
keep track of aiocontext changes.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 31 --
 include/block/block-global-state.h |  5 +
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index c02a628336..221bf90268 100644
--- a/block.c
+++ b/block.c
@@ -7643,17 +7643,16 @@ int bdrv_child_try_set_aio_context(BlockDriverState 
*bs, AioContext *ctx,
  * For the same reason, it temporarily holds also the new AioContext, since
  * bdrv_drained_end calls BDRV_POLL_WHILE that assumes the lock is taken too.
  */
-int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
-  BdrvChild *ignore_child, Error **errp)
+int bdrv_child_try_change_aio_context_tran(BlockDriverState *bs,
+   AioContext *ctx,
+   BdrvChild *ignore_child,
+   Transaction *tran,
+   Error **errp)
 {
-Transaction *tran;
 GHashTable *visited;
 int ret;
-AioContext *old_context = bdrv_get_aio_context(bs);
 GLOBAL_STATE_CODE();
 
-/* Recursion phase: go through all nodes of the graph */
-tran = tran_new();
 visited = g_hash_table_new(NULL, NULL);
 if (ignore_child) {
 g_hash_table_add(visited, ignore_child);
@@ -7661,6 +7660,26 @@ int bdrv_child_try_change_aio_context(BlockDriverState 
*bs, AioContext *ctx,
 ret = bdrv_change_aio_context(bs, ctx, visited, tran, errp);
 g_hash_table_destroy(visited);
 
+return ret;
+}
+
+/*
+ * See bdrv_child_try_change_aio_context_tran for invariants on
+ * AioContext locks.
+ */
+int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
+  BdrvChild *ignore_child, Error **errp)
+{
+Transaction *tran;
+int ret;
+AioContext *old_context = bdrv_get_aio_context(bs);
+GLOBAL_STATE_CODE();
+
+/* Recursion phase: go through all nodes of the graph */
+tran = tran_new();
+ret = bdrv_child_try_change_aio_context_tran(bs, ctx, ignore_child, tran,
+ errp);
+
 /* Linear phase: go through all callbacks collected in the transaction */
 
 if (!ret) {
diff --git a/include/block/block-global-state.h 
b/include/block/block-global-state.h
index ceecf0aa8e..1bd445b507 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -234,6 +234,11 @@ bool bdrv_child_change_aio_context(BdrvChild *c, 
AioContext *ctx,
Error **errp);
 int bdrv_child_try_change_aio_context(BlockDriverState *bs, AioContext *ctx,
   BdrvChild *ignore_child, Error **errp);
+int bdrv_child_try_change_aio_context_tran(BlockDriverState *bs,
+   AioContext *ctx,
+   BdrvChild *ignore_child,
+   Transaction *tran,
+   Error **errp);
 
 int bdrv_probe_blocksizes(BlockDriverState *bs, BlockSizes *bsz);
 int bdrv_probe_geometry(BlockDriverState *bs, HDGeometry *geo);
-- 
2.31.1

[PATCH v2 00/11] Refactor bdrv_try_set_aio_context using transactions

2022-07-25 Thread Emanuele Giuseppe Esposito

The aim of this series is to reorganize bdrv_try_set_aio_context
and drop BDS ->set_aio_context and ->can_set_aio_ctx callbacks in
favour of a new one, ->change_aio_ctx.

More informations in patch 3 (which is also RFC, due to the doubts
I have with AioContext locks).

Patch 1 just add assertions in the code, 2 extends the transactions API to be 
able to add also transactions in the tail
of the list.
Patch 3 is the core of this series, and introduces the new callback.
It is marked as RFC and the reason is explained in the commit message.
Patches 4-5-6 implement ->change_aio_ctx in the various block, blockjob
and block-backend BDSes.
Patch 7 substitutes ->change_aio_ctx with the old callbacks, and
patch 8 takes care of deleting the old callbacks and unused code.

This series is based on "job: replace AioContext lock with job_mutex",
but just because it uses job_set_aio_context() introduced there.

Suggested-by: Paolo Bonzini 
Based-on: <20220706201533.289775-1-eespo...@redhat.com>
---
v2:
* remove transaction patch, and drain transactions
* re-add old AioContext lock/unlock, since it makes sense to have it
* typos in commit message
* various cleanups in block-backend callbacks
* use hash map instead of glist when marking visited nodes
* 2 more additional patches, getting rid of
  bdrv_try_set_aio_context and bdrv_child_try_change_aio_context

Emanuele Giuseppe Esposito (11):
  block.c: assert bs->aio_context is written under BQL and drains
  block: use transactions as a replacement of ->{can_}set_aio_context()
  bdrv_change_aio_context: use hash table instead of list of visited
nodes
  bdrv_child_try_change_aio_context: add transaction parameter
  blockjob: implement .change_aio_ctx in child_job
  block: implement .change_aio_ctx in child_of_bds
  block-backend: implement .change_aio_ctx in child_root
  block: use the new _change_ API instead of _can_set_ and _set_
  block: remove all unused ->can_set_aio_ctx and ->set_aio_ctx callbacks
  block: rename bdrv_child_try_change_aio_context in
bdrv_try_change_aio_context
  block: remove bdrv_try_set_aio_context and replace it with
bdrv_try_change_aio_context

 block.c| 360 -
 block/block-backend.c  |  74 +++---
 block/export/export.c  |   2 +-
 blockdev.c |  22 +-
 blockjob.c |  50 ++--
 docs/devel/multiple-iothreads.txt  |   4 +-
 include/block/block-global-state.h |  18 +-
 include/block/block_int-common.h   |   6 +-
 job.c  |   2 +-
 tests/unit/test-bdrv-drain.c   |   6 +-
 tests/unit/test-block-iothread.c   |  10 +-
 11 files changed, 310 insertions(+), 244 deletions(-)

-- 
2.31.1

Re: [PATCH for-7.2 04/10] hmp, device_tree.c: introduce fdt-save

2022-07-25 Thread Dr. David Alan Gilbert

* Daniel Henrique Barboza (danielhb...@gmail.com) wrote:
> To save the FDT blob we have the '-machine dumpdtb=' property. With this
> property set, the machine saves the FDT in  and exit. The created
> file can then be converted to plain text dts format using 'dtc'.
> 
> There's nothing particularly sophisticated into saving the FDT that
> can't be done with the machine at any state, as long as the machine has
> a valid FDT to be saved.
> 
> The 'fdt-save' command receives a 'filename' paramenter and, if a valid
> FDT is available, it'll save it in a file 'filename'. In short, this is
> a '-machine dumpdtb' that can be fired on demand via HMP.
> 
> A valid FDT consists of a FDT that was created using libfdt being
> retrieved via 'current_machine->fdt' in device_tree.c. This condition is
> met by most FDT users in QEMU.
> 
> Cc: Dr. David Alan Gilbert 
> Signed-off-by: Daniel Henrique Barboza 

These all probably should be done as wrappers around qmp equivalents.

Dave

> ---
>  hmp-commands.hx  | 13 +
>  include/sysemu/device_tree.h |  2 ++
>  monitor/misc.c   | 13 +
>  softmmu/device_tree.c| 18 ++
>  4 files changed, 46 insertions(+)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index c9d465735a..3c134cf652 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -1768,3 +1768,16 @@ ERST
>"\n\t\t\t -b to specify dirty bitmap as method of 
> calculation)",
>  .cmd= hmp_calc_dirty_rate,
>  },
> +
> +{
> +.name   = "fdt-save",
> +.args_type  = "filename:s",
> +.params = "[filename] file to save the FDT",
> +.help   = "save the FDT in the 'filename' file to be decoded 
> using dtc",
> +.cmd= hmp_fdt_save,
> +},
> +
> +SRST
> +``fdt-save`` *filename*
> +  Save the FDT in the 'filename' file to be decoded using dtc
> +ERST
> diff --git a/include/sysemu/device_tree.h b/include/sysemu/device_tree.h
> index ef060a9759..1397adb21c 100644
> --- a/include/sysemu/device_tree.h
> +++ b/include/sysemu/device_tree.h
> @@ -123,6 +123,8 @@ int qemu_fdt_nop_node(void *fdt, const char *node_path);
>  int qemu_fdt_add_subnode(void *fdt, const char *name);
>  int qemu_fdt_add_path(void *fdt, const char *path);
>  
> +void fdt_save(const char *filename, Error **errp);
> +
>  #define qemu_fdt_setprop_cells(fdt, node_path, property, ...)
>  \
>  do { 
>  \
>  uint32_t qdt_tmp[] = { __VA_ARGS__ };
>  \
> diff --git a/monitor/misc.c b/monitor/misc.c
> index 3d2312ba8d..145285cec0 100644
> --- a/monitor/misc.c
> +++ b/monitor/misc.c
> @@ -78,6 +78,7 @@
>  #include "qapi/qmp-event.h"
>  #include "sysemu/cpus.h"
>  #include "qemu/cutils.h"
> +#include "sysemu/device_tree.h"
>  
>  #if defined(TARGET_S390X)
>  #include "hw/s390x/storage-keys.h"
> @@ -936,6 +937,18 @@ static void hmp_boot_set(Monitor *mon, const QDict 
> *qdict)
>  }
>  }
>  
> +static void hmp_fdt_save(Monitor *mon, const QDict *qdict)
> +{
> +const char *path = qdict_get_str(qdict, "filename");
> +Error *local_err = NULL;
> +
> +fdt_save(path, _err);
> +
> +if (local_err) {
> +error_report_err(local_err);
> +}
> +}
> +
>  static void hmp_info_mtree(Monitor *mon, const QDict *qdict)
>  {
>  bool flatview = qdict_get_try_bool(qdict, "flatview", false);
> diff --git a/softmmu/device_tree.c b/softmmu/device_tree.c
> index 6ca3fad285..eeab6a5ef0 100644
> --- a/softmmu/device_tree.c
> +++ b/softmmu/device_tree.c
> @@ -643,3 +643,21 @@ out:
>  g_free(propcells);
>  return ret;
>  }
> +
> +void fdt_save(const char *filename, Error **errp)
> +{
> +int size;
> +
> +if (!current_machine->fdt) {
> +error_setg(errp, "Unable to find the machine FDT");
> +return;
> +}
> +
> +size = fdt_totalsize(current_machine->fdt);
> +
> +if (g_file_set_contents(filename, current_machine->fdt, size, NULL)) {
> +return;
> +}
> +
> +error_setg(errp, "Error when saving machine FDT to file %s", filename);
> +}
> -- 
> 2.36.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

[PATCH] ui/console: fix qemu_console_resize() regression

2022-07-25 Thread marcandre . lureau

From: Marc-André Lureau 

The display may be corrupted when changing screen colour depth in
qemu-system-ppc/MacOS since 7.0.

Do not short-cut qemu_console_resize() if the surface is backed by vga
vram. When the scanout isn't set, or it is already allocated, or opengl,
and the size is fitting, we still avoid the reallocation & replace path.

Fixes: commit cb8962c1 ("ui: do not create a surface when resizing a GL 
scanout")

Reported-by: Mark Cave-Ayland 
Signed-off-by: Marc-André Lureau 
---
 ui/console.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/ui/console.c b/ui/console.c
index e139f7115e1f..765892f84f1c 100644
--- a/ui/console.c
+++ b/ui/console.c
@@ -2575,11 +2575,13 @@ static void vc_chr_open(Chardev *chr,
 
 void qemu_console_resize(QemuConsole *s, int width, int height)
 {
-DisplaySurface *surface;
+DisplaySurface *surface = qemu_console_surface(s);
 
 assert(s->console_type == GRAPHIC_CONSOLE);
 
-if (qemu_console_get_width(s, -1) == width &&
+if ((s->scanout.kind != SCANOUT_SURFACE ||
+ (surface && surface->flags & QEMU_ALLOCATED_FLAG)) &&
+qemu_console_get_width(s, -1) == width &&
 qemu_console_get_height(s, -1) == height) {
 return;
 }
-- 
2.37.0.rc0

Re: [PATCH v2 3/3] python/qemu/machine: use socketpair() for QMP by default

2022-07-25 Thread Daniel P . Berrangé

On Thu, Jun 30, 2022 at 04:34:19PM +0400, marcandre.lur...@redhat.com wrote:
> From: Marc-André Lureau 
> 
> When no monitor address is given, establish the QMP communication through
> a socketpair() (API is also supported on Windows since Python 3.5)
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  python/qemu/machine/machine.py | 24 
>  1 file changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
> index 37191f433b2d..aa1d9447352d 100644
> --- a/python/qemu/machine/machine.py
> +++ b/python/qemu/machine/machine.py
> @@ -158,17 +158,13 @@ def __init__(self,
>  self._qmp_timer = qmp_timer
>  
>  self._name = name or f"qemu-{os.getpid()}-{id(self):02x}"
> +self._sock_pair: Optional[Tuple[socket.socket, socket.socket]] = None
>  self._temp_dir: Optional[str] = None
>  self._base_temp_dir = base_temp_dir
>  self._sock_dir = sock_dir
>  self._log_dir = log_dir
>  
> -if monitor_address is not None:
> -self._monitor_address = monitor_address
> -else:
> -self._monitor_address = os.path.join(
> -self.sock_dir, f"{self._name}-monitor.sock"
> -)
> +self._monitor_address = monitor_address

Almost nothing in QEMU passes 'monitor_address' right now, so thue effect
of this will be that essentially all usage switches to the socketpair
behaviour. Should be ok, as nothing is expecting to have the ability to
leave QEMU running, and re-connect to its monitor in another process
later.

>  
>  self._console_log_path = console_log
>  if self._console_log_path:
> @@ -303,7 +299,11 @@ def _base_args(self) -> List[str]:
>  args = ['-display', 'none', '-vga', 'none']
>  
>  if self._qmp_set:
> -if isinstance(self._monitor_address, tuple):
> +if self._sock_pair:
> +fd = self._sock_pair[0].fileno()
> +os.set_inheritable(fd, True)
> +moncdev = f"socket,id=mon,fd={fd}"
> +elif isinstance(self._monitor_address, tuple):
>  moncdev = "socket,id=mon,host={},port={}".format(
>  *self._monitor_address
>  )
> @@ -337,10 +337,17 @@ def _pre_launch(self) -> None:
>  self._remove_files.append(self._console_address)
>  
>  if self._qmp_set:
> +monitor_address = None
> +sock = None
> +if self._monitor_address is None:
> +self._sock_pair = socket.socketpair()
> +sock = self._sock_pair[1]
>  if isinstance(self._monitor_address, str):
>  self._remove_files.append(self._monitor_address)
> +monitor_address = self._monitor_address
>  self._qmp_connection = QEMUMonitorProtocol(
> -self._monitor_address,
> +address=monitor_address,
> +sock=sock,
>  server=True,
>  nickname=self._name
>  )
> @@ -360,6 +367,7 @@ def _pre_launch(self) -> None:
>  ))
>  
>  def _post_launch(self) -> None:
> +self._sock_pair[0].close()
>  if self._qmp_connection:
>  self._qmp.accept(self._qmp_timer)
>  
> -- 
> 2.37.0.rc0
> 

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v2 0/3] python/qemu/machine: fix potential hang in QMP accept

2022-07-25 Thread Daniel P . Berrangé

On Mon, Jul 25, 2022 at 03:23:26PM +0400, Marc-André Lureau wrote:
> Hi
> 
> On Fri, Jul 1, 2022 at 2:51 AM John Snow  wrote:
> 
> > On Thu, Jun 30, 2022 at 8:34 AM  wrote:
> > >
> > > From: Marc-André Lureau 
> > >
> > > Hi,
> > >
> > > As reported earlier by Richard Henderson ("virgl avocado hang" thread),
> > avocado
> > > tests may hang when QEMU exits before the QMP connection is established.
> > >
> > > v2:
> > >  - use a socketpair() for QMP (instead of async concurrent code from v1)
> > as
> > >suggested by Daniel Berrange.
> > >  - should not regress (hopefully)
> > >
> > > Marc-André Lureau (3):
> > >   python/qmp/protocol: add open_with_socket()
> > >   python/qmp/legacy: make QEMUMonitorProtocol accept a socket
> > >   python/qemu/machine: use socketpair() for QMP by default
> > >
> > >  python/qemu/machine/machine.py | 24 
> > >  python/qemu/qmp/legacy.py  | 18 +++---
> > >  python/qemu/qmp/protocol.py| 25 -
> > >  3 files changed, 51 insertions(+), 16 deletions(-)
> > >
> > > --
> > > 2.37.0.rc0
> > >
> >
> > For anything that touches python/qemu/qmp/*, may I please ask that you
> > submit them to https://gitlab.com/qemu-project/python-qemu-qmp ?
> >
> >
> Ok
> 
> 
> > (I'll review them in the meantime on-list just in the interest of
> > moving things along.)
> >
> 
> I was waiting for a review before updating the patches / moving to
> python-qemu-qmp.

This code looks decent to me

  Reviewed-by: Daniel P. Berrangé 

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v2 0/3] python/qemu/machine: fix potential hang in QMP accept

2022-07-25 Thread Marc-André Lureau

Hi

On Fri, Jul 1, 2022 at 2:51 AM John Snow  wrote:

> On Thu, Jun 30, 2022 at 8:34 AM  wrote:
> >
> > From: Marc-André Lureau 
> >
> > Hi,
> >
> > As reported earlier by Richard Henderson ("virgl avocado hang" thread),
> avocado
> > tests may hang when QEMU exits before the QMP connection is established.
> >
> > v2:
> >  - use a socketpair() for QMP (instead of async concurrent code from v1)
> as
> >suggested by Daniel Berrange.
> >  - should not regress (hopefully)
> >
> > Marc-André Lureau (3):
> >   python/qmp/protocol: add open_with_socket()
> >   python/qmp/legacy: make QEMUMonitorProtocol accept a socket
> >   python/qemu/machine: use socketpair() for QMP by default
> >
> >  python/qemu/machine/machine.py | 24 
> >  python/qemu/qmp/legacy.py  | 18 +++---
> >  python/qemu/qmp/protocol.py| 25 -
> >  3 files changed, 51 insertions(+), 16 deletions(-)
> >
> > --
> > 2.37.0.rc0
> >
>
> For anything that touches python/qemu/qmp/*, may I please ask that you
> submit them to https://gitlab.com/qemu-project/python-qemu-qmp ?
>
>
Ok


> (I'll review them in the meantime on-list just in the interest of
> moving things along.)
>

I was waiting for a review before updating the patches / moving to
python-qemu-qmp.

thanks

-- 
Marc-André Lureau

Re: [PATCH] linux-user: Don't assume 0 is not a valid host timer_t value

2022-07-25 Thread Daniel P . Berrangé

On Mon, Jul 25, 2022 at 12:00:35PM +0100, Peter Maydell wrote:
> For handling guest POSIX timers, we currently use an array
> g_posix_timers[], whose entries are a host timer_t value, or 0 for
> "this slot is unused".  When the guest calls the timer_create syscall
> we look through the array for a slot containing 0, and use that for
> the new timer.
> 
> This scheme assumes that host timer_t values can never be zero.  This
> is unfortunately not a valid assumption -- for some host libc
> versions, timer_t values are simply indexes starting at 0.  When
> using this kind of host libc, the effect is that the first and second
> timers end up sharing a slot, and so when the guest tries to operate
> on the first timer it changes the second timer instead.

For sake of historical record, could you mention here which specific
libc impl / version highlights the problem.

> 
> Rework the timer allocation code, so that:
>  * the 'slot in use' indication uses a separate array from the
>host timer_t array
>  * we grab the free slot atomically, to avoid races when multiple
>threads call timer_create simultaneously
>  * releasing an allocated slot is abstracted out into a new
>free_host_timer_slot() function called in the correct places
> 
> This fixes:
>  * problems on hosts where timer_t 0 is valid
>  * the FIXME in next_free_host_timer() about locking
>  * bugs in the error paths in timer_create where we forgot to release
>the slot we grabbed, or forgot to free the host timer
> 
> Reported-by: Jon Alduan 
> Signed-off-by: Peter Maydell 
> ---
>  linux-user/syscall.c | 24 
>  1 file changed, 16 insertions(+), 8 deletions(-)

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

[PATCH] linux-user: Don't assume 0 is not a valid host timer_t value

2022-07-25 Thread Peter Maydell

For handling guest POSIX timers, we currently use an array
g_posix_timers[], whose entries are a host timer_t value, or 0 for
"this slot is unused".  When the guest calls the timer_create syscall
we look through the array for a slot containing 0, and use that for
the new timer.

This scheme assumes that host timer_t values can never be zero.  This
is unfortunately not a valid assumption -- for some host libc
versions, timer_t values are simply indexes starting at 0.  When
using this kind of host libc, the effect is that the first and second
timers end up sharing a slot, and so when the guest tries to operate
on the first timer it changes the second timer instead.

Rework the timer allocation code, so that:
 * the 'slot in use' indication uses a separate array from the
   host timer_t array
 * we grab the free slot atomically, to avoid races when multiple
   threads call timer_create simultaneously
 * releasing an allocated slot is abstracted out into a new
   free_host_timer_slot() function called in the correct places

This fixes:
 * problems on hosts where timer_t 0 is valid
 * the FIXME in next_free_host_timer() about locking
 * bugs in the error paths in timer_create where we forgot to release
   the slot we grabbed, or forgot to free the host timer

Reported-by: Jon Alduan 
Signed-off-by: Peter Maydell 
---
 linux-user/syscall.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 991b85e6b4d..b75de1bc8d6 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -497,20 +497,25 @@ _syscall4(int, sys_prlimit64, pid_t, pid, int, resource,
 
 #if defined(TARGET_NR_timer_create)
 /* Maximum of 32 active POSIX timers allowed at any one time. */
-static timer_t g_posix_timers[32] = { 0, } ;
+#define GUEST_TIMER_MAX 32
+static timer_t g_posix_timers[GUEST_TIMER_MAX];
+static int g_posix_timer_allocated[GUEST_TIMER_MAX];
 
 static inline int next_free_host_timer(void)
 {
-int k ;
-/* FIXME: Does finding the next free slot require a lock? */
-for (k = 0; k < ARRAY_SIZE(g_posix_timers); k++) {
-if (g_posix_timers[k] == 0) {
-g_posix_timers[k] = (timer_t) 1;
+int k;
+for (k = 0; k < ARRAY_SIZE(g_posix_timer_allocated); k++) {
+if (qatomic_xchg(g_posix_timer_allocated + k, 1) == 0) {
 return k;
 }
 }
 return -1;
 }
+
+static inline void free_host_timer_slot(int id)
+{
+qatomic_store_release(g_posix_timer_allocated + id, 0);
+}
 #endif
 
 static inline int host_to_target_errno(int host_errno)
@@ -12832,15 +12837,18 @@ static abi_long do_syscall1(CPUArchState *cpu_env, 
int num, abi_long arg1,
 phost_sevp = _sevp;
 ret = target_to_host_sigevent(phost_sevp, arg2);
 if (ret != 0) {
+free_host_timer_slot(timer_index);
 return ret;
 }
 }
 
 ret = get_errno(timer_create(clkid, phost_sevp, phtimer));
 if (ret) {
-phtimer = NULL;
+free_host_timer_slot(timer_index);
 } else {
 if (put_user(TIMER_MAGIC | timer_index, arg3, target_timer_t)) 
{
+timer_delete(*phtimer);
+free_host_timer_slot(timer_index);
 return -TARGET_EFAULT;
 }
 }
@@ -12976,7 +12984,7 @@ static abi_long do_syscall1(CPUArchState *cpu_env, int 
num, abi_long arg1,
 } else {
 timer_t htimer = g_posix_timers[timerid];
 ret = get_errno(timer_delete(htimer));
-g_posix_timers[timerid] = 0;
+free_host_timer_slot(timerid);
 }
 return ret;
 }
-- 
2.25.1

Re: [PATCH v2 4/7] vdpa: Add asid parameter to vhost_vdpa_dma_map/unmap

2022-07-25 Thread Jason Wang




在 2022/7/22 21:43, Eugenio Pérez 写道:

So the caller can choose which ASID is destined.

No need to update the batch functions as they will always be called from
memory listener updates at the moment. Memory listener updates will
always update ASID 0, as it's the passthrough ASID.

All vhost devices's ASID are 0 at this moment.

Signed-off-by: Eugenio Pérez 
---
  include/hw/virtio/vhost-vdpa.h |  8 +---
  hw/virtio/vhost-vdpa.c | 26 --
  net/vhost-vdpa.c   |  6 +++---
  hw/virtio/trace-events |  4 ++--
  4 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index d85643..6560bb9d78 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -29,6 +29,7 @@ typedef struct vhost_vdpa {
  int index;
  uint32_t msg_type;
  bool iotlb_batch_begin_sent;
+uint32_t address_space_id;
  MemoryListener listener;
  struct vhost_vdpa_iova_range iova_range;
  uint64_t acked_features;
@@ -42,8 +43,9 @@ typedef struct vhost_vdpa {
  VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
  } VhostVDPA;
  
-int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,

-   void *vaddr, bool readonly);
-int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, hwaddr iova, hwaddr size);
+int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+   hwaddr size, void *vaddr, bool readonly);
+int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+ hwaddr size);
  
  #endif

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index e1ed56b26d..79623badf2 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -72,22 +72,24 @@ static bool 
vhost_vdpa_listener_skipped_section(MemoryRegionSection *section,
  return false;
  }
  
-int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,

-   void *vaddr, bool readonly)
+int vhost_vdpa_dma_map(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+   hwaddr size, void *vaddr, bool readonly)
  {
  struct vhost_msg_v2 msg = {};
  int fd = v->device_fd;
  int ret = 0;
  
  msg.type = v->msg_type;

+msg.asid = asid;
  msg.iotlb.iova = iova;
  msg.iotlb.size = size;
  msg.iotlb.uaddr = (uint64_t)(uintptr_t)vaddr;
  msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
  msg.iotlb.type = VHOST_IOTLB_UPDATE;
  
-   trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.iotlb.iova, msg.iotlb.size,

-msg.iotlb.uaddr, msg.iotlb.perm, msg.iotlb.type);
+trace_vhost_vdpa_dma_map(v, fd, msg.type, msg.asid, msg.iotlb.iova,
+ msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm,
+ msg.iotlb.type);
  
  if (write(fd, , sizeof(msg)) != sizeof(msg)) {

  error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -98,18 +100,20 @@ int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, 
hwaddr size,
  return ret;
  }
  
-int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, hwaddr iova, hwaddr size)

+int vhost_vdpa_dma_unmap(struct vhost_vdpa *v, uint32_t asid, hwaddr iova,
+ hwaddr size)
  {
  struct vhost_msg_v2 msg = {};
  int fd = v->device_fd;
  int ret = 0;
  
  msg.type = v->msg_type;

+msg.asid = asid;
  msg.iotlb.iova = iova;
  msg.iotlb.size = size;
  msg.iotlb.type = VHOST_IOTLB_INVALIDATE;
  
-trace_vhost_vdpa_dma_unmap(v, fd, msg.type, msg.iotlb.iova,

+trace_vhost_vdpa_dma_unmap(v, fd, msg.type, msg.asid, msg.iotlb.iova,
 msg.iotlb.size, msg.iotlb.type);
  
  if (write(fd, , sizeof(msg)) != sizeof(msg)) {

@@ -228,7 +232,7 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
*listener,
  }
  
  vhost_vdpa_iotlb_batch_begin_once(v);

-ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
+ret = vhost_vdpa_dma_map(v, 0, iova, int128_get64(llsize),
   vaddr, section->readonly);
  if (ret) {
  error_report("vhost vdpa map fail!");
@@ -293,7 +297,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
*listener,
  vhost_iova_tree_remove(v->iova_tree, result);
  }
  vhost_vdpa_iotlb_batch_begin_once(v);
-ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
+ret = vhost_vdpa_dma_unmap(v, 0, iova, int128_get64(llsize));
  if (ret) {
  error_report("vhost_vdpa dma unmap error!");
  }
@@ -884,7 +888,7 @@ static bool vhost_vdpa_svq_unmap_ring(struct vhost_vdpa *v,
  }
  
  size = ROUND_UP(result->size, qemu_real_host_page_size());

-r = vhost_vdpa_dma_unmap(v, result->iova, size);
+r = vhost_vdpa_dma_unmap(v, v->address_space_id, result->iova, size);
  return r == 0;
  }
  
@@

Re: [PATCH v4 6/7] vdpa: Add virtio-net mac address via CVQ at start

2022-07-25 Thread Jason Wang




在 2022/7/22 19:12, Eugenio Pérez 写道:

This is needed so the destination vdpa device see the same state a the
guest set in the source.

Signed-off-by: Eugenio Pérez 
---
  net/vhost-vdpa.c | 61 
  1 file changed, 61 insertions(+)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 61516b1432..3e15a42c35 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -365,10 +365,71 @@ static virtio_net_ctrl_ack 
vhost_vdpa_net_cvq_add(VhostShadowVirtqueue *svq,
  return VIRTIO_NET_OK;
  }
  
+static int vhost_vdpa_net_start(NetClientState *nc)

+{
+VhostVDPAState *s = DO_UPCAST(VhostVDPAState, nc, nc);
+struct vhost_vdpa *v = >vhost_vdpa;
+VirtIONet *n;
+uint64_t features;
+VhostShadowVirtqueue *svq;
+
+assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA);
+
+if (!v->shadow_vqs_enabled) {
+return 0;
+}
+
+if (v->dev->nvqs != 1 &&
+v->dev->vq_index + v->dev->nvqs != v->dev->vq_index_end) {
+/* Only interested in CVQ */
+return 0;
+}



I'd have a dedicated NetClientInfo for cvq.



+
+n = VIRTIO_NET(v->dev->vdev);
+features = v->dev->vdev->host_features;
+svq = g_ptr_array_index(v->shadow_vqs, 0);
+if (features & BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR)) {
+const struct virtio_net_ctrl_hdr ctrl = {
+.class = VIRTIO_NET_CTRL_MAC,
+.cmd = VIRTIO_NET_CTRL_MAC_ADDR_SET,
+};
+uint8_t mac[6];
+const struct iovec out[] = {
+{
+.iov_base = (void *),
+.iov_len = sizeof(ctrl),
+},{
+.iov_base = mac,
+.iov_len = sizeof(mac),
+},
+};
+struct iovec dev_buffers[2] = {
+{ .iov_base = s->cvq_cmd_out_buffer },
+{ .iov_base = s->cvq_cmd_in_buffer },
+};
+bool ok;
+virtio_net_ctrl_ack state;
+
+ok = vhost_vdpa_net_cvq_map_sg(s, out, ARRAY_SIZE(out), dev_buffers);



To speed up the state recovery, can we map those buffers during svq start?

Thanks



+if (unlikely(!ok)) {
+return -1;
+}
+
+memcpy(mac, n->mac, sizeof(mac));
+state = vhost_vdpa_net_cvq_add(svq, dev_buffers);
+vhost_vdpa_cvq_unmap_buf(v, dev_buffers[0].iov_base);
+vhost_vdpa_cvq_unmap_buf(v, dev_buffers[1].iov_base);
+return state == VIRTIO_NET_OK ? 0 : 1;
+}
+
+return 0;
+}
+
  static NetClientInfo net_vhost_vdpa_info = {
  .type = NET_CLIENT_DRIVER_VHOST_VDPA,
  .size = sizeof(VhostVDPAState),
  .receive = vhost_vdpa_receive,
+.start = vhost_vdpa_net_start,
  .cleanup = vhost_vdpa_cleanup,
  .has_vnet_hdr = vhost_vdpa_has_vnet_hdr,
  .has_ufo = vhost_vdpa_has_ufo,

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-25 Thread Gupta, Pankaj




I view it as a performance problem because nothing stops KVM from 
copying from
userspace into the private fd during the SEV ioctl().  What's 
missing is the
ability for userspace to directly initialze the private fd, which 
may or may not

avoid an extra memcpy() depending on how clever userspace is.

Can you please elaborate more what you see as a performance problem? And
possible ways to solve it?


Oh, I'm not saying there actually _is_ a performance problem.  What 
I'm saying is
that in-place encryption is not a functional requirement, which means 
it's purely
an optimization, and thus we should other bother supporting in-place 
encryption

_if_ it would solve a performane bottleneck.


Even if we end up having a performance problem, I think we need to 
understand the workloads that we want to optimize before getting too 
excited about designing a speedup.


In particular, there's (depending on the specific technology, perhaps, 
and also architecture) a possible tradeoff between trying to reduce 
copying and trying to reduce unmapping and the associated flushes.  If a 
user program maps an fd, populates it, and then converts it in place 
into private memory (especially if it doesn't do it in a single shot), 
then that memory needs to get unmapped both from the user mm and 
probably from the kernel direct map.  On the flip side, it's possible to 
imagine an ioctl that does copy-and-add-to-private-fd that uses a 
private mm and doesn't need any TLB IPIs.


All of this is to say that trying to optimize right now seems quite 
premature to me.


Agree to it. Thank you for explaining!

Thanks,
Pankaj

Re: driver type raw-xz supports discard=unmap?

2022-07-25 Thread Daniel P . Berrangé

On Fri, Jul 22, 2022 at 04:03:52PM -0400, Chris Murphy wrote:
> Is this valid?
> 
> `
> 
> 
> `
> `/>
> `
> 
> I know type="raw" works fine, I'm wondering if there'd be any problem
> with type "raw-xz" combined with discards?

This is libvirt configuration, so libvirt-us...@redhat.com is the better
list in general. That said, what is this 'raw-xz' type you refer to ?

AFAIK, that is not a disk driver type that exists in either libvirt or
QEMU releases.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH for-7.2 00/10] add hmp 'save-fdt' and 'info fdt' commands

2022-07-25 Thread Daniel P . Berrangé

On Fri, Jul 22, 2022 at 04:59:57PM -0300, Daniel Henrique Barboza wrote:
> Hi,
> 
> After dealing with a FDT element that isn't being shown in the userspace
> and having to shutdown the guest, dump the FDT using 'machine -dumpdtb' and
> then using 'dtc' to see what was inside the FDT, I thought it was a good
> idea to add extra support for FDT handling in QEMU.
> 
> This series introduces 2 commands. 'fdt-save' behaves similar to what
> 'machine -dumpdtb' does, with the advantage of saving the FDT of a running
> guest on demand. This command is implemented in patch 03.
> 
> The second command, 'info fdt ' is more sophisticated. This
> command can print specific nodes and properties of the FDT. A few
> examples:
> 
> - print the /cpus/cpu@0 from an ARM 'virt' machine:
> 
> (qemu) info fdt /cpus/cpu@0
> /cpus/cpu@0 {
> phandle = <0x8001>
> reg = <0x0>
> compatible = 'arm,cortex-a57'
> device_type = 'cpu'
> }
> (qemu) 
> 
> - print the device_type property of the interrupt-controller node of a
> pSeries machine:
> 
> (qemu) info fdt /interrupt-controller/device_type
> /interrupt-controller/device_type = 'PowerPC-External-Interrupt-Presentation'
> (qemu)

Please don't add new HMP-only commands. These should be provided
as QMP commands, where the HMP is a tiny shim to the QMP.

If you don't want to think about formally modelling the data
for 'info fdt' / 'query-fdt', then just put an 'x-' prefix on
the QMP command and return printed formatted data, as illustrated
in:

https://www.qemu.org/docs/master/devel/writing-monitor-commands.html#writing-a-debugging-aid-returning-unstructured-text

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: Access target TranslatorOps

2022-07-25 Thread Peter Maydell

On Fri, 22 Jul 2022 at 06:09, Kenneth Adam Miller
 wrote:
>
> I need to determine the set of instruction encodings that the TCG can support 
> for a given platform. I am not bothered whether the target runs at all, and 
> in fact it is better if it doesn't, so runtime or translate time doesn't 
> bother me.

So, something like "does the emulated CPU support guest architecture
feature X" ? Look at how eg arm handles setting the Linux hwcap bits,
for instance.

> Imagine I were adding support for more instructions for a given platform.
> I would like to check that I'm using the API right. It's amazing that
> it's been so far and there's no way to check that the correct behavior
> occurs when a given encoding is encountered regarding the TCG.

The way to test "is the emulation correct" is to have test programs.
For Arm we use 'risu' to generate random instruction sequences and
check their behaviour against some golden reference, which can catch
things like "is this insn supposed to undef in this configuration".

> A boolean result from a can_translate called just when the target
> encounters the instruction would be good. Additionally, the ability
> to force the translation of arbitrary encodings would be good.

I am completely confused about what you want to do here, because
these requests just sound completely bizarre to me. The translator
is its own self-contained, and linux-user should have no requirement at
all to be told whether an instruction happens to translate to
"raise an exception" or "generate code to do something".

> Additionally, the ability
> to force the translation of arbitrary encodings would be good.

This is easy -- just put the right bytes into the test binary.

-- PMM

Re: [PATCH v3 1/1] python/machine: Fix AF_UNIX path too long on macOS

2022-07-25 Thread Daniel P . Berrangé

On Fri, Jul 22, 2022 at 11:25:08AM -0700, Peter Delevoryas wrote:
> On macOS, private $TMPDIR's are the default. These $TMPDIR's are
> generated from a user's unix UID and UUID [1], which can create a
> relatively long path:
> 
> /var/folders/d7/rz20f6hd709c1ty8f6_6y_z4gn/T/
> 
> QEMU's avocado tests create a temporary directory prefixed by
> "avo_qemu_sock_", and create QMP sockets within _that_ as well.
> The QMP socket is unnecessarily long, because a temporary directory
> is created for every QEMUMachine object.
> 
> /avo_qemu_sock_uh3w_dgc/qemu-37331-10bacf110-monitor.sock
> 
> The path limit for unix sockets on macOS is 104: [2]
> 
> /*
>  * [XSI] Definitions for UNIX IPC domain.
>  */
> struct  sockaddr_un {
> unsigned char   sun_len;/* sockaddr len including null */
> sa_family_t sun_family; /* [XSI] AF_UNIX */
> charsun_path[104];  /* [XSI] path name (gag) */
> };
> 
> This results in avocado tests failing on macOS because the QMP unix
> socket can't be created, because the path is too long:
> 
> ERROR| Failed to establish connection: OSError: AF_UNIX path too long
> 
> This change resolves by reducing the size of the socket directory prefix
> and the suffix on the QMP and console socket names.
> 
> The result is paths like this:
> 
> pdel@pdel-mbp:/var/folders/d7/rz20f6hd709c1ty8f6_6y_z4gn/T
> $ tree qemu*
> qemu_df4evjeq
> qemu_jbxel3gy
> qemu_ml9s_gg7
> qemu_oc7h7f3u
> qemu_oqb1yf97
> ├── 10a004050.con
> └── 10a004050.qmp
> 
> [1] 
> https://apple.stackexchange.com/questions/353832/why-is-mac-osx-temp-directory-in-weird-path
> [2] 
> /Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/usr/include/sys/un.h
> 
> Signed-off-by: Peter Delevoryas 
> ---
>  python/qemu/machine/machine.py | 6 +++---
>  tests/avocado/avocado_qemu/__init__.py | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)

Reviewed-by: Daniel P. Berrangé 


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH] .cirrus.yml: Change winsymlinks to 'native'

2022-07-25 Thread Thomas Huth


On 19/07/2022 18.12, Bin Meng wrote:

From: Bin Meng 

At present winsymlinks is set to 'nativestrict', and its behavior is:

   a) if native symlinks are enabled and  exists, creates
   as a native Windows symlink;
   b) else if native symlinks are not enabled or if  does
  not exist, 'ln -s' fails.

This causes the following error message was seen during the configure:

   "ln: failed to create symbolic link
   'x86_64-softmmu/qemu-system-x86_64.exe': No such file or directory"

Change winsymlinks to 'native' whose behavior is most similar to the
behavior of 'ln -s' on *nix, that is:

   a) if native symlinks are enabled, and whether  exists
  or not, creates  as a native Windows symlink;
   b) else if native symlinks are not enabled, and whether 
  exists or not, 'ln -s' creates as a Windows shortcut file.

Signed-off-by: Bin Meng 
---

  .cirrus.yml | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index 20843a420c..eac39024f2 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -10,7 +10,7 @@ windows_msys2_task:
  memory: 8G
env:
  CIRRUS_SHELL: powershell
-MSYS: winsymlinks:nativestrict
+MSYS: winsymlinks:native
  MSYSTEM: MINGW64
  MSYS2_URL: 
https://github.com/msys2/msys2-installer/releases/download/2022-05-03/msys2-base-x86_64-20220503.sfx.exe
  MSYS2_FINGERPRINT: 0


Acked-by: Thomas Huth 

Alex, if I've got that right, you're currently assembling a "testing" pull 
request - could you please pick up this patch for that, too? ... I currently 
don't have anything else pending right now, so I don't plan any new pull 
request soon.


 Thomas

Re: [PATCH v4 3/7] vdpa: Make vhost_vdpa_net_cvq_map_elem accept any out sg

2022-07-25 Thread Jason Wang




在 2022/7/22 19:12, Eugenio Pérez 写道:

So its generic enough to accept any out sg buffer and we can inject
NIC state messages.

Signed-off-by: Eugenio Pérez 
---
  net/vhost-vdpa.c | 29 +++--
  1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c
index 1b82ac2e07..bbe1830824 100644
--- a/net/vhost-vdpa.c
+++ b/net/vhost-vdpa.c
@@ -302,35 +302,36 @@ dma_map_err:
  }
  
  /**

- * Copy the guest element into a dedicated buffer suitable to be sent to NIC
+ * Maps out sg and in buffer into dedicated buffers suitable to be sent to NIC
   *
- * @iov: [0] is the out buffer, [1] is the in one
+ * @dev_iov: [0] is the out buffer, [1] is the in one



This still has assumption on the layout. I wonder if it's better to 
simply use in_sg and out_sg.


Thanks



   */
-static bool vhost_vdpa_net_cvq_map_elem(VhostVDPAState *s,
-VirtQueueElement *elem,
-struct iovec *iov)
+static bool vhost_vdpa_net_cvq_map_sg(VhostVDPAState *s,
+  const struct iovec *out, size_t out_num,
+  struct iovec *dev_iov)
  {
  size_t in_copied;
  bool ok;
  
-iov[0].iov_base = s->cvq_cmd_out_buffer;

-ok = vhost_vdpa_cvq_map_buf(>vhost_vdpa, elem->out_sg, elem->out_num,
-vhost_vdpa_net_cvq_cmd_len(), iov[0].iov_base,
-[0].iov_len, false);
+dev_iov[0].iov_base = s->cvq_cmd_out_buffer;
+ok = vhost_vdpa_cvq_map_buf(>vhost_vdpa, out, out_num,
+vhost_vdpa_net_cvq_cmd_len(),
+dev_iov[0].iov_base, _iov[0].iov_len,
+false);
  if (unlikely(!ok)) {
  return false;
  }
  
-iov[1].iov_base = s->cvq_cmd_in_buffer;

+dev_iov[1].iov_base = s->cvq_cmd_in_buffer;
  ok = vhost_vdpa_cvq_map_buf(>vhost_vdpa, NULL, 0,
-sizeof(virtio_net_ctrl_ack), iov[1].iov_base,
-_copied, true);
+sizeof(virtio_net_ctrl_ack),
+dev_iov[1].iov_base, _copied, true);
  if (unlikely(!ok)) {
  vhost_vdpa_cvq_unmap_buf(>vhost_vdpa, s->cvq_cmd_out_buffer);
  return false;
  }
  
-iov[1].iov_len = sizeof(virtio_net_ctrl_ack);

+dev_iov[1].iov_len = sizeof(virtio_net_ctrl_ack);
  return true;
  }
  
@@ -434,7 +435,7 @@ static int vhost_vdpa_net_handle_ctrl_avail(VhostShadowVirtqueue *svq,

  };
  bool ok;
  
-ok = vhost_vdpa_net_cvq_map_elem(s, elem, dev_buffers);

+ok = vhost_vdpa_net_cvq_map_sg(s, elem->out_sg, elem->out_num, 
dev_buffers);
  if (unlikely(!ok)) {
  goto out;
  }

Re: [RFC PATCH 3/8] RFC: block: use transactions as a replacement of ->{can_}set_aio_context()

2022-07-25 Thread Emanuele Giuseppe Esposito




Am 20/07/2022 um 16:09 schrieb Vladimir Sementsov-Ogievskiy:
> On 7/13/22 00:19, Emanuele Giuseppe Esposito wrote:
>> +/*
>> + * @visited will accumulate all visited BdrvChild object. The caller is
>> + * responsible for freeing the list afterwards.
>> + */
>> +bool bdrv_change_aio_context(BlockDriverState *bs, AioContext *ctx,
>> + GSList **visited, Transaction *tran,
>> + Error **errp)
>> +{
>> +    BdrvChild *c;
>> +    BdrvStateSetAioContext *state;
>> +
>> +    GLOBAL_STATE_CODE();
>> +
>> +    if (bdrv_get_aio_context(bs) == ctx) {
>> +    return true;
>> +    }
>> +
>> +    QLIST_FOREACH(c, >parents, next_parent) {
>> +    if (!bdrv_parent_change_aio_context(c, ctx, visited, tran,
>> errp)) {
>> +    return false;
>> +    }
>> +    }
>> +
>> +    QLIST_FOREACH(c, >children, next) {
>> +    if (!bdrv_child_change_aio_context(c, ctx, visited, tran,
>> errp)) {
>> +    return false;
>> +    }
>> +    }
>> +
>> +    state = g_new(BdrvStateSetAioContext, 1);
>> +    *state = (BdrvStateSetAioContext) {
>> +    .new_ctx = ctx,
>> +    .bs = bs,
>> +    };
>> +
>> +    /* First half of transactions are drains */
>> +    tran_add(tran, _begin_end, state);
>> +    /*
>> + * Second half of transactions takes care of setting the
>> + * AioContext and free the state
>> + */
>> +    tran_add_tail(tran, _aio_context, state);
>> +
>> +    return true;
>> +}
> 
> 
> First, it looks like you want to use .commit() as .prepare(), .clean()
> as commit(), and .prepare() as something that just schedule other actions.
> 
> In block transaction any effect that is visible to other transaction
> actions should be done in .prepare(). (and .prepare is the main function
> of the action with *tran parameter, i.e. bdrv_change_aio_context()
> function is actually .prepare() phase).
> 
> So, ideally, the action:
> 
>  - does the complete thing in .prepare (i.e. in the main function of the
> action)
>  - rollback it in .abort
>  - nothing in .commit
> 
> Usually we still need a .commit phase for irreversible part of the
> action (like calling free() on the object). But that should be
> transparent for other actions.
> 
> So, generally we do:
> 
> tran = tran_new()
> 
> action_a(..., tran);
> action_b(..., tran);
> action_c(..., tran);
> 
> 
> And we expect, that action_b() operates on top of action_a() already
> done. And action_c() operate on top of action_b() done. So we cannot
> postpone visible (to other actions) effect of the action to .commit phase.
> 
> So, for example, if you want a kind of smart drained section in
> transaction, you may add simple
> 
> bdrv_drain_tran(..., tran);
> 
> that just call drained_begin(), and have only .clean() handler that call
> drained_end(). This way you may do
> 
> bdrv_drain_tran(, tran);
> action_a(..., tran);
> action_b(..., tran);
> 
> And be sure that both actions and all their .abort/.commit/.clean
> handlers are called inside drained seaction. Isn't it what you need?

I understand how you see transactions, but I think we can also use them
in a different way than intended (with proper documentation).

I don't think it makes sense to use transactions as rollback in this
case, even though with the coming v2 it would be more similar to what
you propose:

.prepare takes care of drain and checking if the node is ok
.commit changes the aiocontext lock only if all nodes are drained and
passed the test
.clean undrains (drained end)

> 
> 
> Second, this all becomes extremely difficult when we mix transactions
> and recursion. When I moved permission system to transactions, I've
> introduced topological dfs search to just get a list of nodes to handle.
> I think, if we want to rework aio context change, we should first get
> rid of recursion.
> 

I honestly don't see how recursion complicates things. It's just graph
traversal, and building a list of things to do while doing that.

Isn't it much more complex to do it with a loop (ie not recursively?).
Your bdrv_topological_dfs is recursive too.

Think about it as 2 separate steps:
- Recursion takes care of checking the nodes and draining them
- Once done, if all nodes are ready to switch, switch linearly by
iterating in a list of callbacks

Anyways, I am probably not going to wait your feedback and send v2 just
because I think the way I did this double transaction list in v1
confuses people.

Feel free to continue the discussion here or in v2 directly, though.

Thank you,
Emanuele

Re: [PATCH] linux-user: Passthrough MADV_DONTNEED for certain file mappings

2022-07-25 Thread Laurent Vivier


Le 01/07/2022 à 15:52, Ilya Leoshkevich a écrit :

This is a follow-up for commit 892a4f6a750a ("linux-user: Add partial
support for MADV_DONTNEED"), which added passthrough for anonymous
mappings. File mappings can be handled in a similar manner.

In order to do that, mark pages, for which mmap() was passed through,
with PAGE_PASSTHROUGH, and then allow madvise() passthrough for these
pages as well.

Signed-off-by: Ilya Leoshkevich 
---
  include/exec/cpu-all.h |  6 ++
  linux-user/mmap.c  | 25 +
  2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index f5bda2c3ca..fbdbc0fdec 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -262,6 +262,12 @@ extern const TargetPageBits target_page;
  #define PAGE_TARGET_1  0x0200
  #define PAGE_TARGET_2  0x0400
  
+/*

+ * For linux-user, indicates that the page is mapped with the same semantics
+ * in both guest and host.
+ */
+#define PAGE_PASSTHROUGH 0x0080


0x0080 is already PAGE_ANON, do you mean 0x0800?


+
  #if defined(CONFIG_USER_ONLY)
  void page_dump(FILE *f);
  
diff --git a/linux-user/mmap.c b/linux-user/mmap.c

index 4e7a6be6ee..58622a0c15 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -424,7 +424,8 @@ abi_ulong mmap_find_vma(abi_ulong start, abi_ulong size, 
abi_ulong align)
  abi_long target_mmap(abi_ulong start, abi_ulong len, int target_prot,
   int flags, int fd, abi_ulong offset)
  {
-abi_ulong ret, end, real_start, real_end, retaddr, host_offset, host_len;
+abi_ulong ret, end, real_start, real_end, retaddr, host_offset, host_len,
+  passthrough_start = -1, passthrough_end = -1;
  int page_flags, host_prot;
  
  mmap_lock();

@@ -537,6 +538,8 @@ abi_long target_mmap(abi_ulong start, abi_ulong len, int 
target_prot,
  host_start += offset - host_offset;
  }
  start = h2g(host_start);
+passthrough_start = start;
+passthrough_end = start + len;
  } else {
  if (start & ~TARGET_PAGE_MASK) {
  errno = EINVAL;
@@ -619,6 +622,8 @@ abi_long target_mmap(abi_ulong start, abi_ulong len, int 
target_prot,
   host_prot, flags, fd, offset1);
  if (p == MAP_FAILED)
  goto fail;
+passthrough_start = real_start;
+passthrough_end = real_end;
  }
  }
   the_end1:
@@ -626,7 +631,18 @@ abi_long target_mmap(abi_ulong start, abi_ulong len, int 
target_prot,
  page_flags |= PAGE_ANON;
  }
  page_flags |= PAGE_RESET;
-page_set_flags(start, start + len, page_flags);
+if (passthrough_start == passthrough_end) {
+page_set_flags(start, start + len, page_flags);
+} else {
+if (start != passthrough_start) {


it would be consistent to use "check start < passthrough_start"


+page_set_flags(start, passthrough_start, page_flags);
+}
+page_set_flags(passthrough_start, passthrough_end,
+   page_flags | PAGE_PASSTHROUGH);
+if (passthrough_end != start + len) {


passthrough_end < start + len


+page_set_flags(passthrough_end, start + len, page_flags);
+}
+}
   the_end:
  trace_target_mmap_complete(start);
  if (qemu_loglevel_mask(CPU_LOG_PAGE)) {
@@ -845,7 +861,7 @@ static bool can_passthrough_madv_dontneed(abi_ulong start, 
abi_ulong end)
  }
  
  for (addr = start; addr < end; addr += TARGET_PAGE_SIZE) {

-if (!(page_get_flags(addr) & PAGE_ANON)) {
+if (!(page_get_flags(addr) & (PAGE_ANON | PAGE_PASSTHROUGH))) {
  return false;
  }
  }
@@ -888,7 +904,8 @@ abi_long target_madvise(abi_ulong start, abi_ulong len_in, 
int advice)
   *
   * This breaks MADV_DONTNEED, completely implementing which is quite
   * complicated. However, there is one low-hanging fruit: host-page-aligned
- * anonymous mappings. In this case passthrough is safe, so do it.
+ * anonymous mappings or mappings that are known to have the same semantics
+ * in the host and the guest. In this case passthrough is safe, so do it.
   */
  mmap_lock();
  if ((advice & MADV_DONTNEED) &&


Thanks,
Laurent

[PATCH v10 20/21] blockjob: remove unused functions

2022-07-25 Thread Emanuele Giuseppe Esposito

These public functions are not used anywhere, thus can be dropped.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
 blockjob.c   | 16 ++--
 include/block/blockjob.h | 31 ---
 2 files changed, 14 insertions(+), 33 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index f0ae466c34..375c90e4b8 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -56,12 +56,6 @@ BlockJob *block_job_next_locked(BlockJob *bjob)
 return job ? container_of(job, BlockJob, job) : NULL;
 }
 
-BlockJob *block_job_next(BlockJob *bjob)
-{
-JOB_LOCK_GUARD();
-return block_job_next_locked(bjob);
-}
-
 BlockJob *block_job_get_locked(const char *id)
 {
 Job *job = job_get_locked(id);
@@ -308,7 +302,7 @@ bool block_job_set_speed_locked(BlockJob *job, int64_t 
speed, Error **errp)
 return true;
 }
 
-bool block_job_set_speed(BlockJob *job, int64_t speed, Error **errp)
+static bool block_job_set_speed(BlockJob *job, int64_t speed, Error **errp)
 {
 JOB_LOCK_GUARD();
 return block_job_set_speed_locked(job, speed, errp);
@@ -357,12 +351,6 @@ BlockJobInfo *block_job_query_locked(BlockJob *job, Error 
**errp)
 return info;
 }
 
-BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
-{
-JOB_LOCK_GUARD();
-return block_job_query_locked(job, errp);
-}
-
 /* Called with job lock held */
 static void block_job_iostatus_set_err_locked(BlockJob *job, int error)
 {
@@ -525,7 +513,7 @@ void block_job_iostatus_reset_locked(BlockJob *job)
 job->iostatus = BLOCK_DEVICE_IO_STATUS_OK;
 }
 
-void block_job_iostatus_reset(BlockJob *job)
+static void block_job_iostatus_reset(BlockJob *job)
 {
 JOB_LOCK_GUARD();
 block_job_iostatus_reset_locked(job);
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 912e10b083..e499d2bc90 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -93,17 +93,15 @@ typedef struct BlockJob {
  */
 
 /**
- * block_job_next:
+ * block_job_next_locked:
  * @job: A block job, or %NULL.
  *
  * Get the next element from the list of block jobs after @job, or the
  * first one if @job is %NULL.
  *
  * Returns the requested job, or %NULL if there are no more jobs left.
+ * Called with job lock held.
  */
-BlockJob *block_job_next(BlockJob *job);
-
-/* Same as block_job_next(), but called with job lock held. */
 BlockJob *block_job_next_locked(BlockJob *job);
 
 /**
@@ -113,6 +111,7 @@ BlockJob *block_job_next_locked(BlockJob *job);
  * Get the block job identified by @id (which must not be %NULL).
  *
  * Returns the requested job, or %NULL if it doesn't exist.
+ * Called with job lock *not* held.
  */
 BlockJob *block_job_get(const char *id);
 
@@ -152,43 +151,37 @@ void block_job_remove_all_bdrv(BlockJob *job);
 bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs);
 
 /**
- * block_job_set_speed:
+ * block_job_set_speed_locked:
  * @job: The job to set the speed for.
  * @speed: The new value
  * @errp: Error object.
  *
  * Set a rate-limiting parameter for the job; the actual meaning may
  * vary depending on the job type.
- */
-bool block_job_set_speed(BlockJob *job, int64_t speed, Error **errp);
-
-/*
- * Same as block_job_set_speed(), but called with job lock held.
- * Might release the lock temporarily.
+ *
+ * Called with job lock held, but might release it temporarily.
  */
 bool block_job_set_speed_locked(BlockJob *job, int64_t speed, Error **errp);
 
 /**
- * block_job_query:
+ * block_job_query_locked:
  * @job: The job to get information about.
  *
  * Return information about a job.
+ *
+ * Called with job lock held.
  */
-BlockJobInfo *block_job_query(BlockJob *job, Error **errp);
-
-/* Same as block_job_query(), but called with job lock held. */
 BlockJobInfo *block_job_query_locked(BlockJob *job, Error **errp);
 
 /**
- * block_job_iostatus_reset:
+ * block_job_iostatus_reset_locked:
  * @job: The job whose I/O status should be reset.
  *
  * Reset I/O status on @job and on BlockDriverState objects it uses,
  * other than job->blk.
+ *
+ * Called with job lock held.
  */
-void block_job_iostatus_reset(BlockJob *job);
-
-/* Same as block_job_iostatus_reset(), but called with job lock held. */
 void block_job_iostatus_reset_locked(BlockJob *job);
 
 /*
-- 
2.31.1

[PATCH v10 15/21] blockjob.h: categorize fields in struct BlockJob

2022-07-25 Thread Emanuele Giuseppe Esposito

The same job lock is being used also to protect some of blockjob fields.
Categorize them just as done in job.h.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/blockjob.h | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 8b65d3949d..912e10b083 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -40,10 +40,16 @@ typedef struct BlockJobDriver BlockJobDriver;
  * Long-running operation on a BlockDriverState.
  */
 typedef struct BlockJob {
-/** Data belonging to the generic Job infrastructure */
+/**
+ * Data belonging to the generic Job infrastructure.
+ * Protected by job mutex.
+ */
 Job job;
 
-/** Status that is published by the query-block-jobs QMP API */
+/**
+ * Status that is published by the query-block-jobs QMP API.
+ * Protected by job mutex.
+ */
 BlockDeviceIoStatus iostatus;
 
 /** Speed that was set with @block_job_set_speed.  */
@@ -55,6 +61,8 @@ typedef struct BlockJob {
 /** Block other operations when block job is running */
 Error *blocker;
 
+/** All notifiers are set once in block_job_create() and never modified. */
+
 /** Called when a cancelled job is finalised. */
 Notifier finalize_cancelled_notifier;
 
@@ -70,7 +78,10 @@ typedef struct BlockJob {
 /** Called when the job coroutine yields or terminates */
 Notifier idle_notifier;
 
-/** BlockDriverStates that are involved in this block job */
+/**
+ * BlockDriverStates that are involved in this block job.
+ * Always modified and read under QEMU global mutex (GLOBAL_STATE_CODE)
+ */
 GSList *nodes;
 } BlockJob;
 
-- 
2.31.1

[PATCH v10 10/21] block/mirror.c: use of job helpers in drivers to avoid TOC/TOU

2022-07-25 Thread Emanuele Giuseppe Esposito

Once job lock is used and aiocontext is removed, mirror has
to perform job operations under the same critical section,
using the helpers prepared in previous commit.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/mirror.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index d8ecb9efa2..b38676e19d 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -654,9 +654,13 @@ static int mirror_exit_common(Job *job)
 BlockDriverState *target_bs;
 BlockDriverState *mirror_top_bs;
 Error *local_err = NULL;
-bool abort = job->ret < 0;
+bool abort;
 int ret = 0;
 
+WITH_JOB_LOCK_GUARD() {
+abort = job->ret < 0;
+}
+
 if (s->prepared) {
 return 0;
 }
@@ -1152,8 +1156,10 @@ static void mirror_complete(Job *job, Error **errp)
 s->should_complete = true;
 
 /* If the job is paused, it will be re-entered when it is resumed */
-if (!job->paused) {
-job_enter(job);
+WITH_JOB_LOCK_GUARD() {
+if (!job->paused) {
+job_enter_cond_locked(job, NULL);
+}
 }
 }
 
@@ -1173,8 +1179,11 @@ static bool mirror_drained_poll(BlockJob *job)
  * from one of our own drain sections, to avoid a deadlock waiting for
  * ourselves.
  */
-if (!s->common.job.paused && !job_is_cancelled(>job) && !s->in_drain) 
{
-return true;
+WITH_JOB_LOCK_GUARD() {
+if (!s->common.job.paused && !job_is_cancelled_locked(>job)
+&& !s->in_drain) {
+return true;
+}
 }
 
 return !!s->in_flight;
-- 
2.31.1

[PATCH v10 18/21] job.c: enable job lock/unlock and remove Aiocontext locks

2022-07-25 Thread Emanuele Giuseppe Esposito

Change the job_{lock/unlock} and macros to use job_mutex.

Now that they are not nop anymore, remove the aiocontext
to avoid deadlocks.

Therefore:
- when possible, remove completely the aiocontext lock/unlock pair
- if it is used by some other function too, reduce the locking
  section as much as possible, leaving the job API outside.
- change AIO_WAIT_WHILE in AIO_WAIT_WHILE_UNLOCKED, since we
  are not using the aiocontext lock anymore

There is only one JobDriver callback, ->free() that assumes that
the aiocontext lock is held (because it calls bdrv_unref), so for
now keep that under aiocontext lock.

Also remove real_job_{lock/unlock}, as they are replaced by the
public functions.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
 blockdev.c   | 74 +---
 include/qemu/job.h   | 22 -
 job-qmp.c| 46 +++--
 job.c| 84 ++--
 tests/unit/test-bdrv-drain.c |  4 +-
 tests/unit/test-block-iothread.c |  2 +-
 tests/unit/test-blockjob.c   | 15 ++
 7 files changed, 52 insertions(+), 195 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 5b79093155..2cd84d206c 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -155,12 +155,7 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 for (job = block_job_next_locked(NULL); job;
  job = block_job_next_locked(job)) {
 if (block_job_has_bdrv(job, blk_bs(blk))) {
-AioContext *aio_context = job->job.aio_context;
-aio_context_acquire(aio_context);
-
 job_cancel_locked(>job, false);
-
-aio_context_release(aio_context);
 }
 }
 
@@ -1836,14 +1831,7 @@ static void drive_backup_abort(BlkActionState *common)
 DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common);
 
 if (state->job) {
-AioContext *aio_context;
-
-aio_context = bdrv_get_aio_context(state->bs);
-aio_context_acquire(aio_context);
-
 job_cancel_sync(>job->job, true);
-
-aio_context_release(aio_context);
 }
 }
 
@@ -1937,14 +1925,7 @@ static void blockdev_backup_abort(BlkActionState *common)
 BlockdevBackupState *state = DO_UPCAST(BlockdevBackupState, common, 
common);
 
 if (state->job) {
-AioContext *aio_context;
-
-aio_context = bdrv_get_aio_context(state->bs);
-aio_context_acquire(aio_context);
-
 job_cancel_sync(>job->job, true);
-
-aio_context_release(aio_context);
 }
 }
 
@@ -3306,19 +3287,14 @@ out:
 }
 
 /*
- * Get a block job using its ID and acquire its AioContext.
- * Called with job_mutex held.
+ * Get a block job using its ID. Called with job_mutex held.
  */
-static BlockJob *find_block_job_locked(const char *id,
-   AioContext **aio_context,
-   Error **errp)
+static BlockJob *find_block_job_locked(const char *id, Error **errp)
 {
 BlockJob *job;
 
 assert(id != NULL);
 
-*aio_context = NULL;
-
 job = block_job_get_locked(id);
 
 if (!job) {
@@ -3327,36 +3303,30 @@ static BlockJob *find_block_job_locked(const char *id,
 return NULL;
 }
 
-*aio_context = block_job_get_aio_context(job);
-aio_context_acquire(*aio_context);
-
 return job;
 }
 
 void qmp_block_job_set_speed(const char *device, int64_t speed, Error **errp)
 {
-AioContext *aio_context;
 BlockJob *job;
 
 JOB_LOCK_GUARD();
-job = find_block_job_locked(device, _context, errp);
+job = find_block_job_locked(device, errp);
 
 if (!job) {
 return;
 }
 
 block_job_set_speed_locked(job, speed, errp);
-aio_context_release(aio_context);
 }
 
 void qmp_block_job_cancel(const char *device,
   bool has_force, bool force, Error **errp)
 {
-AioContext *aio_context;
 BlockJob *job;
 
 JOB_LOCK_GUARD();
-job = find_block_job_locked(device, _context, errp);
+job = find_block_job_locked(device, errp);
 
 if (!job) {
 return;
@@ -3369,22 +3339,19 @@ void qmp_block_job_cancel(const char *device,
 if (job_user_paused_locked(>job) && !force) {
 error_setg(errp, "The block job for device '%s' is currently paused",
device);
-goto out;
+return;
 }
 
 trace_qmp_block_job_cancel(job);
 job_user_cancel_locked(>job, force, errp);
-out:
-aio_context_release(aio_context);
 }
 
 void qmp_block_job_pause(const char *device, Error **errp)
 {
-AioContext *aio_context;
 BlockJob *job;
 
 JOB_LOCK_GUARD();
-job = find_block_job_locked(device, _context, errp);
+job = find_block_job_locked(device, errp);
 
 if (!job) {
 return;
@@ -3392,16 +3359,14 @@ void qmp_block_job_pause(const char *device, Error 
**errp)
 
 trace_qmp_block_job_pause(job);
 job_user_pause_locked(>job, errp);
-

[PATCH v10 13/21] job: detect change of aiocontext within job coroutine

2022-07-25 Thread Emanuele Giuseppe Esposito

From: Paolo Bonzini 

We want to make sure access of job->aio_context is always done
under either BQL or job_mutex. The problem is that using
aio_co_enter(job->aiocontext, job->co) in job_start and job_enter_cond
makes the coroutine immediately resume, so we can't hold the job lock.
And caching it is not safe either, as it might change.

job_start is under BQL, so it can freely read job->aiocontext, but
job_enter_cond is not. In order to fix this, use aio_co_wake():
the advantage is that it won't use job->aiocontext, but the
main disadvantage is that it won't be able to detect a change of
job AioContext.

Calling bdrv_try_set_aio_context() will issue the following calls
(simplified):
* in terms of  bdrv callbacks:
  .drained_begin -> .set_aio_context -> .drained_end
* in terms of child_job functions:
  child_job_drained_begin -> child_job_set_aio_context -> child_job_drained_end
* in terms of job functions:
  job_pause_locked -> job_set_aio_context -> job_resume_locked

We can see that after setting the new aio_context, job_resume_locked
calls again job_enter_cond, which then invokes aio_co_wake(). But
while job->aiocontext has been set in job_set_aio_context,
job->co->ctx has not changed, so the coroutine would be entering in
the wrong aiocontext.

Using aio_co_schedule in job_resume_locked() might seem as a valid
alternative, but the problem is that the bh resuming the coroutine
is not scheduled immediately, and if in the meanwhile another
bdrv_try_set_aio_context() is run (see test_propagate_mirror() in
test-block-iothread.c), we would have the first schedule in the
wrong aiocontext, and the second set of drains won't even manage
to schedule the coroutine, as job->busy would still be true from
the previous job_resume_locked().

The solution is to stick with aio_co_wake(), but then detect every time
the coroutine resumes back from yielding if job->aio_context
has changed. If so, we can reschedule it to the new context.

Check for the aiocontext change in job_do_yield_locked because:
1) aio_co_reschedule_self requires to be in the running coroutine
2) since child_job_set_aio_context allows changing the aiocontext only
   while the job is paused, this is the exact place where the coroutine
   resumes, before running JobDriver's code.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 job.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/job.c b/job.c
index b0729e2bb2..ecec66b44e 100644
--- a/job.c
+++ b/job.c
@@ -585,7 +585,9 @@ void job_enter_cond_locked(Job *job, bool(*fn)(Job *job))
 timer_del(>sleep_timer);
 job->busy = true;
 real_job_unlock();
-aio_co_enter(job->aio_context, job->co);
+job_unlock();
+aio_co_wake(job->co);
+job_lock();
 }
 
 void job_enter_cond(Job *job, bool(*fn)(Job *job))
@@ -611,6 +613,8 @@ void job_enter(Job *job)
  */
 static void coroutine_fn job_do_yield_locked(Job *job, uint64_t ns)
 {
+AioContext *next_aio_context;
+
 real_job_lock();
 if (ns != -1) {
 timer_mod(>sleep_timer, ns);
@@ -622,7 +626,20 @@ static void coroutine_fn job_do_yield_locked(Job *job, 
uint64_t ns)
 qemu_coroutine_yield();
 job_lock();
 
-/* Set by job_enter_cond() before re-entering the coroutine.  */
+next_aio_context = job->aio_context;
+/*
+ * Coroutine has resumed, but in the meanwhile the job AioContext
+ * might have changed via bdrv_try_set_aio_context(), so we need to move
+ * the coroutine too in the new aiocontext.
+ */
+while (qemu_get_current_aio_context() != next_aio_context) {
+job_unlock();
+aio_co_reschedule_self(next_aio_context);
+job_lock();
+next_aio_context = job->aio_context;
+}
+
+/* Set by job_enter_cond_locked() before re-entering the coroutine.  */
 assert(job->busy);
 }
 
-- 
2.31.1

[PATCH v10 14/21] jobs: protect job.aio_context with BQL and job_mutex

2022-07-25 Thread Emanuele Giuseppe Esposito

In order to make it thread safe, implement a "fake rwlock",
where we allow reads under BQL *or* job_mutex held, but
writes only under BQL *and* job_mutex.

The only write we have is in child_job_set_aio_ctx, which always
happens under drain (so the job is paused).
For this reason, introduce job_set_aio_context and make sure that
the context is set under BQL, job_mutex and drain.
Also make sure all other places where the aiocontext is read
are protected.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

Suggested-by: Paolo Bonzini 
Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
 block/replication.c |  6 --
 blockjob.c  |  3 ++-
 include/qemu/job.h  | 19 ++-
 job.c   | 12 
 4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 55c8f894aa..2189863df1 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -148,8 +148,10 @@ static void replication_close(BlockDriverState *bs)
 }
 if (s->stage == BLOCK_REPLICATION_FAILOVER) {
 commit_job = >commit_job->job;
-assert(commit_job->aio_context == qemu_get_current_aio_context());
-job_cancel_sync(commit_job, false);
+WITH_JOB_LOCK_GUARD() {
+assert(commit_job->aio_context == qemu_get_current_aio_context());
+job_cancel_sync_locked(commit_job, false);
+}
 }
 
 if (s->mode == REPLICATION_MODE_SECONDARY) {
diff --git a/blockjob.c b/blockjob.c
index 96fb9d9f73..9ff2727025 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -162,12 +162,13 @@ static void child_job_set_aio_ctx(BdrvChild *c, 
AioContext *ctx,
 bdrv_set_aio_context_ignore(sibling->bs, ctx, ignore);
 }
 
-job->job.aio_context = ctx;
+job_set_aio_context(>job, ctx);
 }
 
 static AioContext *child_job_get_parent_aio_context(BdrvChild *c)
 {
 BlockJob *job = c->opaque;
+assert(qemu_in_main_thread());
 
 return job->job.aio_context;
 }
diff --git a/include/qemu/job.h b/include/qemu/job.h
index 5709e8d4a8..c144aabefc 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -77,7 +77,12 @@ typedef struct Job {
 
 /** Protected by AioContext lock */
 
-/** AioContext to run the job coroutine in */
+/**
+ * AioContext to run the job coroutine in.
+ * This field can be read when holding either the BQL (so we are in
+ * the main loop) or the job_mutex.
+ * It can be only written when we hold *both* BQL and job_mutex.
+ */
 AioContext *aio_context;
 
 /** Reference count of the block job */
@@ -741,4 +746,16 @@ int job_finish_sync(Job *job, void (*finish)(Job *, Error 
**errp),
 int job_finish_sync_locked(Job *job, void (*finish)(Job *, Error **errp),
Error **errp);
 
+/**
+ * Sets the @job->aio_context.
+ * Called with job_mutex *not* held.
+ *
+ * This function must run in the main thread to protect against
+ * concurrent read in job_finish_sync_locked(),
+ * takes the job_mutex lock to protect against the read in
+ * job_do_yield_locked(), and must be called when the coroutine
+ * is quiescent.
+ */
+void job_set_aio_context(Job *job, AioContext *ctx);
+
 #endif
diff --git a/job.c b/job.c
index ecec66b44e..0a857b1468 100644
--- a/job.c
+++ b/job.c
@@ -394,6 +394,17 @@ Job *job_get(const char *id)
 return job_get_locked(id);
 }
 
+void job_set_aio_context(Job *job, AioContext *ctx)
+{
+/* protect against read in job_finish_sync_locked and job_start */
+assert(qemu_in_main_thread());
+/* protect against read in job_do_yield_locked */
+JOB_LOCK_GUARD();
+/* ensure the coroutine is quiescent while the AioContext is changed */
+assert(job->pause_count > 0);
+job->aio_context = ctx;
+}
+
 /* Called with job_mutex *not* held. */
 static void job_sleep_timer_cb(void *opaque)
 {
@@ -1376,6 +1387,7 @@ int job_finish_sync_locked(Job *job,
 {
 Error *local_err = NULL;
 int ret;
+assert(qemu_in_main_thread());
 
 job_ref_locked(job);
 
-- 
2.31.1

[PATCH v10 06/21] job: move and update comments from blockjob.c

2022-07-25 Thread Emanuele Giuseppe Esposito

This comment applies more on job, it was left in blockjob as in the past
the whole job logic was implemented there.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

No functional change intended.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockjob.c | 20 
 job.c  | 14 ++
 2 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index 4868453d74..7da59a1f1c 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -36,21 +36,6 @@
 #include "qemu/main-loop.h"
 #include "qemu/timer.h"
 
-/*
- * The block job API is composed of two categories of functions.
- *
- * The first includes functions used by the monitor.  The monitor is
- * peculiar in that it accesses the block job list with block_job_get, and
- * therefore needs consistency across block_job_get and the actual operation
- * (e.g. block_job_set_speed).  The consistency is achieved with
- * aio_context_acquire/release.  These functions are declared in blockjob.h.
- *
- * The second includes functions used by the block job drivers and sometimes
- * by the core block layer.  These do not care about locking, because the
- * whole coroutine runs under the AioContext lock, and are declared in
- * blockjob_int.h.
- */
-
 static bool is_block_job(Job *job)
 {
 return job_type(job) == JOB_TYPE_BACKUP ||
@@ -433,11 +418,6 @@ static void block_job_event_ready(Notifier *n, void 
*opaque)
 }
 
 
-/*
- * API for block job drivers and the block layer.  These functions are
- * declared in blockjob_int.h.
- */
-
 void *block_job_create(const char *job_id, const BlockJobDriver *driver,
JobTxn *txn, BlockDriverState *bs, uint64_t perm,
uint64_t shared_perm, int64_t speed, int flags,
diff --git a/job.c b/job.c
index ae25db97ac..ebaa4e585b 100644
--- a/job.c
+++ b/job.c
@@ -32,6 +32,20 @@
 #include "trace/trace-root.h"
 #include "qapi/qapi-events-job.h"
 
+/*
+ * The job API is composed of two categories of functions.
+ *
+ * The first includes functions used by the monitor.  The monitor is
+ * peculiar in that it accesses the block job list with job_get, and
+ * therefore needs consistency across job_get and the actual operation
+ * (e.g. job_user_cancel). To achieve this consistency, the caller
+ * calls job_lock/job_unlock itself around the whole operation.
+ *
+ *
+ * The second includes functions used by the block job drivers and sometimes
+ * by the core block layer. These delegate the locking to the callee instead.
+ */
+
 /*
  * job_mutex protects the jobs list, but also makes the
  * struct job fields thread-safe.
-- 
2.31.1

[PATCH v10 21/21] job: remove unused functions

2022-07-25 Thread Emanuele Giuseppe Esposito

These public functions are not used anywhere, thus can be dropped.
Also, since this is the final job API that doesn't use AioContext
lock and replaces it with job_lock, adjust all remaining function
documentation to clearly specify if the job lock is taken or not.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
 include/qemu/job.h |  97 ++-
 job.c  | 101 ++---
 2 files changed, 35 insertions(+), 163 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 676f69bb2e..e1c0ed5940 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -361,6 +361,8 @@ JobTxn *job_txn_new(void);
 /**
  * Release a reference that was previously acquired with job_txn_add_job or
  * job_txn_new. If it's the last reference to the object, it will be freed.
+ *
+ * Called with job lock *not* held.
  */
 void job_txn_unref(JobTxn *txn);
 
@@ -390,19 +392,17 @@ void *job_create(const char *job_id, const JobDriver 
*driver, JobTxn *txn,
 /**
  * Add a reference to Job refcnt, it will be decreased with job_unref, and then
  * be freed if it comes to be the last reference.
+ *
+ * Called with job lock held.
  */
-void job_ref(Job *job);
-
-/* Same as job_ref(), but called with job lock held. */
 void job_ref_locked(Job *job);
 
 /**
  * Release a reference that was previously acquired with job_ref() or
  * job_create(). If it's the last reference to the object, it will be freed.
+ *
+ * Called with job lock held.
  */
-void job_unref(Job *job);
-
-/* Same as job_unref(), but called with job lock held. */
 void job_unref_locked(Job *job);
 
 /**
@@ -448,12 +448,8 @@ void job_progress_increase_remaining(Job *job, uint64_t 
delta);
  * Conditionally enter the job coroutine if the job is ready to run, not
  * already busy and fn() returns true. fn() is called while under the job_lock
  * critical section.
- */
-void job_enter_cond(Job *job, bool(*fn)(Job *job));
-
-/*
- * Same as job_enter_cond(), but called with job lock held.
- * Might release the lock temporarily.
+ *
+ * Called with job lock held, but might release it temporarily.
  */
 void job_enter_cond_locked(Job *job, bool(*fn)(Job *job));
 
@@ -532,11 +528,8 @@ bool job_cancel_requested(Job *job);
 
 /**
  * Returns whether the job is in a completed state.
- * Called with job_mutex *not* held.
+ * Called with job lock held.
  */
-bool job_is_completed(Job *job);
-
-/* Same as job_is_completed(), but called with job lock held. */
 bool job_is_completed_locked(Job *job);
 
 /**
@@ -552,13 +545,15 @@ bool job_is_ready_locked(Job *job);
  * Request @job to pause at the next pause point. Must be paired with
  * job_resume(). If the job is supposed to be resumed by user action, call
  * job_user_pause() instead.
+ *
+ * Called with job lock *not* held.
  */
 void job_pause(Job *job);
 
 /* Same as job_pause(), but called with job lock held. */
 void job_pause_locked(Job *job);
 
-/** Resumes a @job paused with job_pause. */
+/** Resumes a @job paused with job_pause. Called with job lock *not* held. */
 void job_resume(Job *job);
 
 /*
@@ -570,27 +565,20 @@ void job_resume_locked(Job *job);
 /**
  * Asynchronously pause the specified @job.
  * Do not allow a resume until a matching call to job_user_resume.
+ * Called with job lock held.
  */
-void job_user_pause(Job *job, Error **errp);
-
-/* Same as job_user_pause(), but called with job lock held. */
 void job_user_pause_locked(Job *job, Error **errp);
 
-/** Returns true if the job is user-paused. */
-bool job_user_paused(Job *job);
-
-/* Same as job_user_paused(), but called with job lock held. */
+/**
+ * Returns true if the job is user-paused.
+ * Called with job lock held.
+ */
 bool job_user_paused_locked(Job *job);
 
 /**
  * Resume the specified @job.
  * Must be paired with a preceding job_user_pause.
- */
-void job_user_resume(Job *job, Error **errp);
-
-/*
- * Same as job_user_resume(), but called with job lock held.
- * Might release the lock temporarily.
+ * Called with job lock held, but might release it temporarily.
  */
 void job_user_resume_locked(Job *job, Error **errp);
 
@@ -599,6 +587,7 @@ void job_user_resume_locked(Job *job, Error **errp);
  * first one if @job is %NULL.
  *
  * Returns the requested job, or %NULL if there are no more jobs left.
+ * Called with job lock *not* held.
  */
 Job *job_next(Job *job);
 
@@ -609,20 +598,17 @@ Job *job_next_locked(Job *job);
  * Get the job identified by @id (which must not be %NULL).
  *
  * Returns the requested job, or %NULL if it doesn't exist.
+ * Called with job lock held.
  */
-Job *job_get(const char *id);
-
-/* Same as job_get(), but called with job lock held. */
 Job *job_get_locked(const char *id);
 
 /**
  * Check whether the verb @verb can be applied to @job in its current state.
  * Returns 0 if the verb can be applied; otherwise errp is set and -EPERM
  * returned.
+ *
+ * Called with job lock held.
  */
-int

[PATCH v10 16/21] blockjob: rename notifier callbacks as _locked

2022-07-25 Thread Emanuele Giuseppe Esposito

They all are called with job_lock held, in job_event_*_locked()

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockjob.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index 9ff2727025..0663faee2c 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -250,7 +250,8 @@ int block_job_add_bdrv(BlockJob *job, const char *name, 
BlockDriverState *bs,
 return 0;
 }
 
-static void block_job_on_idle(Notifier *n, void *opaque)
+/* Called with job_mutex lock held. */
+static void block_job_on_idle_locked(Notifier *n, void *opaque)
 {
 aio_wait_kick();
 }
@@ -370,7 +371,8 @@ static void block_job_iostatus_set_err(BlockJob *job, int 
error)
 }
 }
 
-static void block_job_event_cancelled(Notifier *n, void *opaque)
+/* Called with job_mutex lock held. */
+static void block_job_event_cancelled_locked(Notifier *n, void *opaque)
 {
 BlockJob *job = opaque;
 uint64_t progress_current, progress_total;
@@ -389,7 +391,8 @@ static void block_job_event_cancelled(Notifier *n, void 
*opaque)
 job->speed);
 }
 
-static void block_job_event_completed(Notifier *n, void *opaque)
+/* Called with job_mutex lock held. */
+static void block_job_event_completed_locked(Notifier *n, void *opaque)
 {
 BlockJob *job = opaque;
 const char *msg = NULL;
@@ -415,7 +418,8 @@ static void block_job_event_completed(Notifier *n, void 
*opaque)
 msg);
 }
 
-static void block_job_event_pending(Notifier *n, void *opaque)
+/* Called with job_mutex lock held. */
+static void block_job_event_pending_locked(Notifier *n, void *opaque)
 {
 BlockJob *job = opaque;
 
@@ -427,7 +431,8 @@ static void block_job_event_pending(Notifier *n, void 
*opaque)
   job->job.id);
 }
 
-static void block_job_event_ready(Notifier *n, void *opaque)
+/* Called with job_mutex lock held. */
+static void block_job_event_ready_locked(Notifier *n, void *opaque)
 {
 BlockJob *job = opaque;
 uint64_t progress_current, progress_total;
@@ -472,11 +477,11 @@ void *block_job_create(const char *job_id, const 
BlockJobDriver *driver,
 
 ratelimit_init(>limit);
 
-job->finalize_cancelled_notifier.notify = block_job_event_cancelled;
-job->finalize_completed_notifier.notify = block_job_event_completed;
-job->pending_notifier.notify = block_job_event_pending;
-job->ready_notifier.notify = block_job_event_ready;
-job->idle_notifier.notify = block_job_on_idle;
+job->finalize_cancelled_notifier.notify = block_job_event_cancelled_locked;
+job->finalize_completed_notifier.notify = block_job_event_completed_locked;
+job->pending_notifier.notify = block_job_event_pending_locked;
+job->ready_notifier.notify = block_job_event_ready_locked;
+job->idle_notifier.notify = block_job_on_idle_locked;
 
 WITH_JOB_LOCK_GUARD() {
 notifier_list_add(>job.on_finalize_cancelled,
-- 
2.31.1

[PATCH v10 09/21] jobs: use job locks also in the unit tests

2022-07-25 Thread Emanuele Giuseppe Esposito

Add missing job synchronization in the unit tests, with
explicit locks.

We are deliberately using _locked functions wrapped by a guard
instead of a normal call because the normal call will be removed
in future, as the only usage is limited to the tests.

In other words, if a function like job_pause() is/will be only used
in tests to avoid:

WITH_JOB_LOCK_GUARD(){
job_pause_locked();
}

then it is not worth keeping job_pause(), and just use the guard.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
---
 tests/unit/test-bdrv-drain.c |  76 +---
 tests/unit/test-block-iothread.c |   8 ++-
 tests/unit/test-blockjob-txn.c   |  24 ---
 tests/unit/test-blockjob.c   | 116 +++
 4 files changed, 141 insertions(+), 83 deletions(-)

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 36be84ae55..0db056ea63 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -943,61 +943,83 @@ static void test_blockjob_common_drain_node(enum 
drain_type drain_type,
 }
 }
 
-g_assert_cmpint(job->job.pause_count, ==, 0);
-g_assert_false(job->job.paused);
-g_assert_true(tjob->running);
-g_assert_true(job->job.busy); /* We're in qemu_co_sleep_ns() */
+WITH_JOB_LOCK_GUARD() {
+g_assert_cmpint(job->job.pause_count, ==, 0);
+g_assert_false(job->job.paused);
+g_assert_true(tjob->running);
+g_assert_true(job->job.busy); /* We're in qemu_co_sleep_ns() */
+}
 
 do_drain_begin_unlocked(drain_type, drain_bs);
 
-if (drain_type == BDRV_DRAIN_ALL) {
-/* bdrv_drain_all() drains both src and target */
-g_assert_cmpint(job->job.pause_count, ==, 2);
-} else {
-g_assert_cmpint(job->job.pause_count, ==, 1);
+WITH_JOB_LOCK_GUARD() {
+if (drain_type == BDRV_DRAIN_ALL) {
+/* bdrv_drain_all() drains both src and target */
+g_assert_cmpint(job->job.pause_count, ==, 2);
+} else {
+g_assert_cmpint(job->job.pause_count, ==, 1);
+}
+g_assert_true(job->job.paused);
+g_assert_false(job->job.busy); /* The job is paused */
 }
-g_assert_true(job->job.paused);
-g_assert_false(job->job.busy); /* The job is paused */
 
 do_drain_end_unlocked(drain_type, drain_bs);
 
 if (use_iothread) {
-/* paused is reset in the I/O thread, wait for it */
+/*
+ * Here we are waiting for the paused status to change,
+ * so don't bother protecting the read every time.
+ *
+ * paused is reset in the I/O thread, wait for it
+ */
 while (job->job.paused) {
 aio_poll(qemu_get_aio_context(), false);
 }
 }
 
-g_assert_cmpint(job->job.pause_count, ==, 0);
-g_assert_false(job->job.paused);
-g_assert_true(job->job.busy); /* We're in qemu_co_sleep_ns() */
+WITH_JOB_LOCK_GUARD() {
+g_assert_cmpint(job->job.pause_count, ==, 0);
+g_assert_false(job->job.paused);
+g_assert_true(job->job.busy); /* We're in qemu_co_sleep_ns() */
+}
 
 do_drain_begin_unlocked(drain_type, target);
 
-if (drain_type == BDRV_DRAIN_ALL) {
-/* bdrv_drain_all() drains both src and target */
-g_assert_cmpint(job->job.pause_count, ==, 2);
-} else {
-g_assert_cmpint(job->job.pause_count, ==, 1);
+WITH_JOB_LOCK_GUARD() {
+if (drain_type == BDRV_DRAIN_ALL) {
+/* bdrv_drain_all() drains both src and target */
+g_assert_cmpint(job->job.pause_count, ==, 2);
+} else {
+g_assert_cmpint(job->job.pause_count, ==, 1);
+}
+g_assert_true(job->job.paused);
+g_assert_false(job->job.busy); /* The job is paused */
 }
-g_assert_true(job->job.paused);
-g_assert_false(job->job.busy); /* The job is paused */
 
 do_drain_end_unlocked(drain_type, target);
 
 if (use_iothread) {
-/* paused is reset in the I/O thread, wait for it */
+/*
+ * Here we are waiting for the paused status to change,
+ * so don't bother protecting the read every time.
+ *
+ * paused is reset in the I/O thread, wait for it
+ */
 while (job->job.paused) {
 aio_poll(qemu_get_aio_context(), false);
 }
 }
 
-g_assert_cmpint(job->job.pause_count, ==, 0);
-g_assert_false(job->job.paused);
-g_assert_true(job->job.busy); /* We're in qemu_co_sleep_ns() */
+WITH_JOB_LOCK_GUARD() {
+g_assert_cmpint(job->job.pause_count, ==, 0);
+g_assert_false(job->job.paused);
+g_assert_true(job->job.busy); /* We're in qemu_co_sleep_ns() */
+}
 
 aio_context_acquire(ctx);
-ret = job_complete_sync(>job, _abort);
+WITH_JOB_LOCK_GUARD() {
+ret = job_complete_sync_locked(>job, _abort);

[PATCH v10 05/21] job.c: add job_lock/unlock while keeping job.h intact

2022-07-25 Thread Emanuele Giuseppe Esposito

With "intact" we mean that all job.h functions implicitly
take the lock. Therefore API callers are unmodified.

This means that:
- many static functions that will be always called with job lock held
  become _locked, and call _locked functions
- all public functions take the lock internally if needed, and call _locked
  functions
- all public functions called internally by other functions in job.c will have a
  _locked counterpart (sometimes public), to avoid deadlocks (job lock already 
taken).
  These functions are not used for now.
- some public functions called only from exernal files (not job.c) do not
  have _locked() counterpart and take the lock inside. Others won't need
  the lock at all because use fields only set at initialization and
  never modified.

job_{lock/unlock} is independent from real_job_{lock/unlock}.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/qemu/job.h | 138 ++-
 job.c  | 600 +++--
 2 files changed, 553 insertions(+), 185 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 4b64eb15f7..5709e8d4a8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -358,8 +358,15 @@ JobTxn *job_txn_new(void);
  */
 void job_txn_unref(JobTxn *txn);
 
+/*
+ * Same as job_txn_unref(), but called with job lock held.
+ * Might release the lock temporarily.
+ */
+void job_txn_unref_locked(JobTxn *txn);
+
 /**
  * Create a new long-running job and return it.
+ * Called with job_mutex *not* held.
  *
  * @job_id: The id of the newly-created job, or %NULL for internal jobs
  * @driver: The class object for the newly-created job.
@@ -380,17 +387,25 @@ void *job_create(const char *job_id, const JobDriver 
*driver, JobTxn *txn,
  */
 void job_ref(Job *job);
 
+/* Same as job_ref(), but called with job lock held. */
+void job_ref_locked(Job *job);
+
 /**
  * Release a reference that was previously acquired with job_ref() or
  * job_create(). If it's the last reference to the object, it will be freed.
  */
 void job_unref(Job *job);
 
+/* Same as job_unref(), but called with job lock held. */
+void job_unref_locked(Job *job);
+
 /**
  * @job: The job that has made progress
  * @done: How much progress the job made since the last call
  *
  * Updates the progress counter of the job.
+ *
+ * May be called with mutex held or not held.
  */
 void job_progress_update(Job *job, uint64_t done);
 
@@ -401,6 +416,8 @@ void job_progress_update(Job *job, uint64_t done);
  *
  * Sets the expected end value of the progress counter of a job so that a
  * completion percentage can be calculated when the progress is updated.
+ *
+ * May be called with mutex held or not held.
  */
 void job_progress_set_remaining(Job *job, uint64_t remaining);
 
@@ -416,6 +433,8 @@ void job_progress_set_remaining(Job *job, uint64_t 
remaining);
  * length before, and job_progress_update() afterwards.
  * (So the operation acts as a parenthesis in regards to the main job
  * operation running in background.)
+ *
+ * May be called with mutex held or not held.
  */
 void job_progress_increase_remaining(Job *job, uint64_t delta);
 
@@ -426,11 +445,19 @@ void job_progress_increase_remaining(Job *job, uint64_t 
delta);
  */
 void job_enter_cond(Job *job, bool(*fn)(Job *job));
 
+/*
+ * Same as job_enter_cond(), but called with job lock held.
+ * Might release the lock temporarily.
+ */
+void job_enter_cond_locked(Job *job, bool(*fn)(Job *job));
+
 /**
  * @job: A job that has not yet been started.
  *
  * Begins execution of a job.
  * Takes ownership of one reference to the job object.
+ *
+ * Called with job_mutex *not* held.
  */
 void job_start(Job *job);
 
@@ -438,6 +465,7 @@ void job_start(Job *job);
  * @job: The job to enter.
  *
  * Continue the specified job by entering the coroutine.
+ * Called with job_mutex *not* held.
  */
 void job_enter(Job *job);
 
@@ -446,6 +474,8 @@ void job_enter(Job *job);
  *
  * Pause now if job_pause() has been called. Jobs that perform lots of I/O
  * must call this between requests so that the job can be paused.
+ *
+ * Called with job_mutex *not* held.
  */
 void coroutine_fn job_pause_point(Job *job);
 
@@ -453,6 +483,7 @@ void coroutine_fn job_pause_point(Job *job);
  * @job: The job that calls the function.
  *
  * Yield the job coroutine.
+ * Called with job_mutex *not* held.
  */
 void job_yield(Job *job);
 
@@ -463,6 +494,8 @@ void job_yield(Job *job);
  * Put the job to sleep (assuming that it wasn't canceled) for @ns
  * %QEMU_CLOCK_REALTIME nanoseconds.  Canceling the job will immediately
  * interrupt the wait.
+ *
+ * Called with job_mutex *not* held.
  */
 void coroutine_fn job_sleep_ns(Job *job, int64_t ns);
 
@@ -475,21 +508,40 @@ const char *job_type_str(const Job *job);
 /** Returns true if the job should not be visible to the management layer. */
 bool job_is_internal(Job

[PATCH v10 19/21] block_job_query: remove atomic read

2022-07-25 Thread Emanuele Giuseppe Esposito

Not sure what the atomic here was supposed to do, since job.busy
is protected by the job lock. Since the whole function
is called under job_mutex, just remove the atomic.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockjob.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/blockjob.c b/blockjob.c
index 448bdb5a53..f0ae466c34 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -338,7 +338,7 @@ BlockJobInfo *block_job_query_locked(BlockJob *job, Error 
**errp)
 info = g_new0(BlockJobInfo, 1);
 info->type  = g_strdup(job_type_str(>job));
 info->device= g_strdup(job->job.id);
-info->busy  = qatomic_read(>job.busy);
+info->busy  = job->job.busy;
 info->paused= job->job.pause_count > 0;
 info->offset= progress_current;
 info->len   = progress_total;
-- 
2.31.1

[PATCH v10 08/21] jobs: add job lock in find_* functions

2022-07-25 Thread Emanuele Giuseppe Esposito

Both blockdev.c and job-qmp.c have TOC/TOU conditions, because
they first search for the job and then perform an action on it.
Therefore, we need to do the search + action under the same
job mutex critical section.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockdev.c | 67 +-
 job-qmp.c  | 57 --
 2 files changed, 86 insertions(+), 38 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 9230888e34..71f793c4ab 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3302,9 +3302,13 @@ out:
 aio_context_release(aio_context);
 }
 
-/* Get a block job using its ID and acquire its AioContext */
-static BlockJob *find_block_job(const char *id, AioContext **aio_context,
-Error **errp)
+/*
+ * Get a block job using its ID and acquire its AioContext.
+ * Called with job_mutex held.
+ */
+static BlockJob *find_block_job_locked(const char *id,
+   AioContext **aio_context,
+   Error **errp)
 {
 BlockJob *job;
 
@@ -3312,7 +3316,7 @@ static BlockJob *find_block_job(const char *id, 
AioContext **aio_context,
 
 *aio_context = NULL;
 
-job = block_job_get(id);
+job = block_job_get_locked(id);
 
 if (!job) {
 error_set(errp, ERROR_CLASS_DEVICE_NOT_ACTIVE,
@@ -3329,13 +,16 @@ static BlockJob *find_block_job(const char *id, 
AioContext **aio_context,
 void qmp_block_job_set_speed(const char *device, int64_t speed, Error **errp)
 {
 AioContext *aio_context;
-BlockJob *job = find_block_job(device, _context, errp);
+BlockJob *job;
+
+JOB_LOCK_GUARD();
+job = find_block_job_locked(device, _context, errp);
 
 if (!job) {
 return;
 }
 
-block_job_set_speed(job, speed, errp);
+block_job_set_speed_locked(job, speed, errp);
 aio_context_release(aio_context);
 }
 
@@ -3343,7 +3350,10 @@ void qmp_block_job_cancel(const char *device,
   bool has_force, bool force, Error **errp)
 {
 AioContext *aio_context;
-BlockJob *job = find_block_job(device, _context, errp);
+BlockJob *job;
+
+JOB_LOCK_GUARD();
+job = find_block_job_locked(device, _context, errp);
 
 if (!job) {
 return;
@@ -3353,14 +3363,14 @@ void qmp_block_job_cancel(const char *device,
 force = false;
 }
 
-if (job_user_paused(>job) && !force) {
+if (job_user_paused_locked(>job) && !force) {
 error_setg(errp, "The block job for device '%s' is currently paused",
device);
 goto out;
 }
 
 trace_qmp_block_job_cancel(job);
-job_user_cancel(>job, force, errp);
+job_user_cancel_locked(>job, force, errp);
 out:
 aio_context_release(aio_context);
 }
@@ -3368,57 +3378,69 @@ out:
 void qmp_block_job_pause(const char *device, Error **errp)
 {
 AioContext *aio_context;
-BlockJob *job = find_block_job(device, _context, errp);
+BlockJob *job;
+
+JOB_LOCK_GUARD();
+job = find_block_job_locked(device, _context, errp);
 
 if (!job) {
 return;
 }
 
 trace_qmp_block_job_pause(job);
-job_user_pause(>job, errp);
+job_user_pause_locked(>job, errp);
 aio_context_release(aio_context);
 }
 
 void qmp_block_job_resume(const char *device, Error **errp)
 {
 AioContext *aio_context;
-BlockJob *job = find_block_job(device, _context, errp);
+BlockJob *job;
+
+JOB_LOCK_GUARD();
+job = find_block_job_locked(device, _context, errp);
 
 if (!job) {
 return;
 }
 
 trace_qmp_block_job_resume(job);
-job_user_resume(>job, errp);
+job_user_resume_locked(>job, errp);
 aio_context_release(aio_context);
 }
 
 void qmp_block_job_complete(const char *device, Error **errp)
 {
 AioContext *aio_context;
-BlockJob *job = find_block_job(device, _context, errp);
+BlockJob *job;
+
+JOB_LOCK_GUARD();
+job = find_block_job_locked(device, _context, errp);
 
 if (!job) {
 return;
 }
 
 trace_qmp_block_job_complete(job);
-job_complete(>job, errp);
+job_complete_locked(>job, errp);
 aio_context_release(aio_context);
 }
 
 void qmp_block_job_finalize(const char *id, Error **errp)
 {
 AioContext *aio_context;
-BlockJob *job = find_block_job(id, _context, errp);
+BlockJob *job;
+
+JOB_LOCK_GUARD();
+job = find_block_job_locked(id, _context, errp);
 
 if (!job) {
 return;
 }
 
 trace_qmp_block_job_finalize(job);
-job_ref(>job);
-job_finalize(>job, errp);
+job_ref_locked(>job);
+job_finalize_locked(>job, errp);
 
 /*
  * Job's context might have changed via job_finalize (and job_txn_apply
@@ -3426,23 +3448,26 @@ void qmp_block_job_finalize(const char *id, Error 
**errp)

[PATCH v10 11/21] jobs: group together API calls under the same job lock

2022-07-25 Thread Emanuele Giuseppe Esposito

Now that the API offers also _locked() functions, take advantage
of it and give also the caller control to take the lock and call
_locked functions.

This makes sense especially when we have for loops, because it
makes no sense to have:

for(job = job_next(); ...)

where each job_next() takes the lock internally.
Instead we want

JOB_LOCK_GUARD();
for(job = job_next_locked(); ...)

In addition, protect also direct field accesses, by either creating a
new critical section or widening the existing ones.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c| 17 ++---
 blockdev.c | 12 +---
 blockjob.c | 35 ++-
 job-qmp.c  |  4 +++-
 job.c  |  7 +--
 monitor/qmp-cmds.c |  7 +--
 qemu-img.c | 37 +
 7 files changed, 75 insertions(+), 44 deletions(-)

diff --git a/block.c b/block.c
index 2c0080..7559965dbc 100644
--- a/block.c
+++ b/block.c
@@ -4978,8 +4978,8 @@ static void bdrv_close(BlockDriverState *bs)
 
 void bdrv_close_all(void)
 {
-assert(job_next(NULL) == NULL);
 GLOBAL_STATE_CODE();
+assert(job_next(NULL) == NULL);
 
 /* Drop references from requests still in flight, such as canceled block
  * jobs whose AIO context has not been polled yet */
@@ -6165,13 +6165,16 @@ XDbgBlockGraph *bdrv_get_xdbg_block_graph(Error **errp)
 }
 }
 
-for (job = block_job_next(NULL); job; job = block_job_next(job)) {
-GSList *el;
+WITH_JOB_LOCK_GUARD() {
+for (job = block_job_next_locked(NULL); job;
+ job = block_job_next_locked(job)) {
+GSList *el;
 
-xdbg_graph_add_node(gr, job, X_DBG_BLOCK_GRAPH_NODE_TYPE_BLOCK_JOB,
-   job->job.id);
-for (el = job->nodes; el; el = el->next) {
-xdbg_graph_add_edge(gr, job, (BdrvChild *)el->data);
+xdbg_graph_add_node(gr, job, X_DBG_BLOCK_GRAPH_NODE_TYPE_BLOCK_JOB,
+job->job.id);
+for (el = job->nodes; el; el = el->next) {
+xdbg_graph_add_edge(gr, job, (BdrvChild *)el->data);
+}
 }
 }
 
diff --git a/blockdev.c b/blockdev.c
index 71f793c4ab..5b79093155 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -150,12 +150,15 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 return;
 }
 
-for (job = block_job_next(NULL); job; job = block_job_next(job)) {
+JOB_LOCK_GUARD();
+
+for (job = block_job_next_locked(NULL); job;
+ job = block_job_next_locked(job)) {
 if (block_job_has_bdrv(job, blk_bs(blk))) {
 AioContext *aio_context = job->job.aio_context;
 aio_context_acquire(aio_context);
 
-job_cancel(>job, false);
+job_cancel_locked(>job, false);
 
 aio_context_release(aio_context);
 }
@@ -3745,7 +3748,10 @@ BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 BlockJobInfoList *head = NULL, **tail = 
 BlockJob *job;
 
-for (job = block_job_next(NULL); job; job = block_job_next(job)) {
+JOB_LOCK_GUARD();
+
+for (job = block_job_next_locked(NULL); job;
+ job = block_job_next_locked(job)) {
 BlockJobInfo *value;
 AioContext *aio_context;
 
diff --git a/blockjob.c b/blockjob.c
index 0d59aba439..96fb9d9f73 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -111,8 +111,10 @@ static bool child_job_drained_poll(BdrvChild *c)
 /* An inactive or completed job doesn't have any pending requests. Jobs
  * with !job->busy are either already paused or have a pause point after
  * being reentered, so no job driver code will run before they pause. */
-if (!job->busy || job_is_completed(job)) {
-return false;
+WITH_JOB_LOCK_GUARD() {
+if (!job->busy || job_is_completed_locked(job)) {
+return false;
+}
 }
 
 /* Otherwise, assume that it isn't fully stopped yet, but allow the job to
@@ -475,13 +477,15 @@ void *block_job_create(const char *job_id, const 
BlockJobDriver *driver,
 job->ready_notifier.notify = block_job_event_ready;
 job->idle_notifier.notify = block_job_on_idle;
 
-notifier_list_add(>job.on_finalize_cancelled,
-  >finalize_cancelled_notifier);
-notifier_list_add(>job.on_finalize_completed,
-  >finalize_completed_notifier);
-notifier_list_add(>job.on_pending, >pending_notifier);
-notifier_list_add(>job.on_ready, >ready_notifier);
-notifier_list_add(>job.on_idle, >idle_notifier);
+WITH_JOB_LOCK_GUARD() {
+notifier_list_add(>job.on_finalize_cancelled,
+  >finalize_cancelled_notifier);
+notifier_list_add(>job.on_finalize_completed,
+  >finalize_completed_notifier);
+notifier_list_add(>job.on_pending,

[PATCH v10 04/21] aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED

2022-07-25 Thread Emanuele Giuseppe Esposito

Same as AIO_WAIT_WHILE macro, but if we are in the Main loop
do not release and then acquire ctx_ 's aiocontext.

Once all Aiocontext locks go away, this macro will replace
AIO_WAIT_WHILE.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/aio-wait.h | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
index 54840f8622..a61f82c617 100644
--- a/include/block/aio-wait.h
+++ b/include/block/aio-wait.h
@@ -59,10 +59,13 @@ typedef struct {
 extern AioWait global_aio_wait;
 
 /**
- * AIO_WAIT_WHILE:
+ * _AIO_WAIT_WHILE:
  * @ctx: the aio context, or NULL if multiple aio contexts (for which the
  *   caller does not hold a lock) are involved in the polling condition.
  * @cond: wait while this conditional expression is true
+ * @unlock: whether to unlock and then lock again @ctx. This apples
+ * only when waiting for another AioContext from the main loop.
+ * Otherwise it's ignored.
  *
  * Wait while a condition is true.  Use this to implement synchronous
  * operations that require event loop activity.
@@ -75,7 +78,7 @@ extern AioWait global_aio_wait;
  * wait on conditions between two IOThreads since that could lead to deadlock,
  * go via the main loop instead.
  */
-#define AIO_WAIT_WHILE(ctx, cond) ({   \
+#define _AIO_WAIT_WHILE(ctx, cond, unlock) ({  \
 bool waited_ = false;  \
 AioWait *wait_ = _aio_wait; \
 AioContext *ctx_ = (ctx);  \
@@ -92,11 +95,11 @@ extern AioWait global_aio_wait;
 assert(qemu_get_current_aio_context() ==   \
qemu_get_aio_context());\
 while ((cond)) {   \
-if (ctx_) {\
+if (unlock && ctx_) {  \
 aio_context_release(ctx_); \
 }  \
 aio_poll(qemu_get_aio_context(), true);\
-if (ctx_) {\
+if (unlock && ctx_) {  \
 aio_context_acquire(ctx_); \
 }  \
 waited_ = true;\
@@ -105,6 +108,12 @@ extern AioWait global_aio_wait;
 qatomic_dec(_->num_waiters);  \
 waited_; })
 
+#define AIO_WAIT_WHILE(ctx, cond)  \
+_AIO_WAIT_WHILE(ctx, cond, true)
+
+#define AIO_WAIT_WHILE_UNLOCKED(ctx, cond) \
+_AIO_WAIT_WHILE(ctx, cond, false)
+
 /**
  * aio_wait_kick:
  * Wake up the main thread if it is waiting on AIO_WAIT_WHILE().  During
-- 
2.31.1

[PATCH v10 17/21] blockjob: protect iostatus field in BlockJob struct

2022-07-25 Thread Emanuele Giuseppe Esposito

iostatus is the only field (together with .job) that needs
protection using the job mutex.

It is set in the main loop (GLOBAL_STATE functions) but read
in I/O code (block_job_error_action).

In order to protect it, change block_job_iostatus_set_err
to block_job_iostatus_set_err_locked(), always called under
job lock.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockjob.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index 0663faee2c..448bdb5a53 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -363,7 +363,8 @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
 return block_job_query_locked(job, errp);
 }
 
-static void block_job_iostatus_set_err(BlockJob *job, int error)
+/* Called with job lock held */
+static void block_job_iostatus_set_err_locked(BlockJob *job, int error)
 {
 if (job->iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
 job->iostatus = error == ENOSPC ? BLOCK_DEVICE_IO_STATUS_NOSPACE :
@@ -577,8 +578,8 @@ BlockErrorAction block_job_error_action(BlockJob *job, 
BlockdevOnError on_err,
  */
 job->job.user_paused = true;
 }
+block_job_iostatus_set_err_locked(job, error);
 }
-block_job_iostatus_set_err(job, error);
 }
 return action;
 }
-- 
2.31.1

[PATCH v10 01/21] job.c: make job_mutex and job_lock/unlock() public

2022-07-25 Thread Emanuele Giuseppe Esposito

job mutex will be used to protect the job struct elements and list,
replacing AioContext locks.

Right now use a shared lock for all jobs, in order to keep things
simple. Once the AioContext lock is gone, we can introduce per-job
locks.

To simplify the switch from aiocontext to job lock, introduce
*nop* lock/unlock functions and macros.
We want to always call job_lock/unlock outside the AioContext locks,
and not vice-versa, otherwise we might get a deadlock. This is not
straightforward to do, and that's why we start with nop functions.
Once everything is protected by job_lock/unlock, we can change the nop into
an actual mutex and remove the aiocontext lock.

Since job_mutex is already being used, add static
real_job_{lock/unlock} for the existing usage.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/qemu/job.h | 24 
 job.c  | 35 +++
 2 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index c105b31076..d1192ffd61 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -303,6 +303,30 @@ typedef enum JobCreateFlags {
 JOB_MANUAL_DISMISS = 0x04,
 } JobCreateFlags;
 
+extern QemuMutex job_mutex;
+
+#define JOB_LOCK_GUARD() /* QEMU_LOCK_GUARD(_mutex) */
+
+#define WITH_JOB_LOCK_GUARD() /* WITH_QEMU_LOCK_GUARD(_mutex) */
+
+/**
+ * job_lock:
+ *
+ * Take the mutex protecting the list of jobs and their status.
+ * Most functions called by the monitor need to call job_lock
+ * and job_unlock manually.  On the other hand, function called
+ * by the block jobs themselves and by the block layer will take the
+ * lock for you.
+ */
+void job_lock(void);
+
+/**
+ * job_unlock:
+ *
+ * Release the mutex protecting the list of jobs and their status.
+ */
+void job_unlock(void);
+
 /**
  * Allocate and return a new job transaction. Jobs can be added to the
  * transaction using job_txn_add_job().
diff --git a/job.c b/job.c
index 075c6f3a20..2b4ffca9d4 100644
--- a/job.c
+++ b/job.c
@@ -32,6 +32,12 @@
 #include "trace/trace-root.h"
 #include "qapi/qapi-events-job.h"
 
+/*
+ * job_mutex protects the jobs list, but also makes the
+ * struct job fields thread-safe.
+ */
+QemuMutex job_mutex;
+
 static QLIST_HEAD(, Job) jobs = QLIST_HEAD_INITIALIZER(jobs);
 
 /* Job State Transition Table */
@@ -74,17 +80,22 @@ struct JobTxn {
 int refcnt;
 };
 
-/* Right now, this mutex is only needed to synchronize accesses to job->busy
- * and job->sleep_timer, such as concurrent calls to job_do_yield and
- * job_enter. */
-static QemuMutex job_mutex;
+void job_lock(void)
+{
+/* nop */
+}
+
+void job_unlock(void)
+{
+/* nop */
+}
 
-static void job_lock(void)
+static void real_job_lock(void)
 {
 qemu_mutex_lock(_mutex);
 }
 
-static void job_unlock(void)
+static void real_job_unlock(void)
 {
 qemu_mutex_unlock(_mutex);
 }
@@ -450,21 +461,21 @@ void job_enter_cond(Job *job, bool(*fn)(Job *job))
 return;
 }
 
-job_lock();
+real_job_lock();
 if (job->busy) {
-job_unlock();
+real_job_unlock();
 return;
 }
 
 if (fn && !fn(job)) {
-job_unlock();
+real_job_unlock();
 return;
 }
 
 assert(!job->deferred_to_main_loop);
 timer_del(>sleep_timer);
 job->busy = true;
-job_unlock();
+real_job_unlock();
 aio_co_enter(job->aio_context, job->co);
 }
 
@@ -481,13 +492,13 @@ void job_enter(Job *job)
  * called explicitly. */
 static void coroutine_fn job_do_yield(Job *job, uint64_t ns)
 {
-job_lock();
+real_job_lock();
 if (ns != -1) {
 timer_mod(>sleep_timer, ns);
 }
 job->busy = false;
 job_event_idle(job);
-job_unlock();
+real_job_unlock();
 qemu_coroutine_yield();
 
 /* Set by job_enter_cond() before re-entering the coroutine.  */
-- 
2.31.1

[PATCH v10 12/21] commit and mirror: create new nodes using bdrv_get_aio_context, and not the job aiocontext

2022-07-25 Thread Emanuele Giuseppe Esposito

We are always using the given bs AioContext, so there is no need
to take the job ones (which is identical anyways).
This also reduces the point we need to check when protecting
job.aio_context field.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/commit.c | 4 ++--
 block/mirror.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/commit.c b/block/commit.c
index 851d1c557a..336f799172 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -370,7 +370,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 goto fail;
 }
 
-s->base = blk_new(s->common.job.aio_context,
+s->base = blk_new(bdrv_get_aio_context(bs),
   base_perms,
   BLK_PERM_CONSISTENT_READ
   | BLK_PERM_WRITE_UNCHANGED);
@@ -382,7 +382,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 s->base_bs = base;
 
 /* Required permissions are already taken with block_job_add_bdrv() */
-s->top = blk_new(s->common.job.aio_context, 0, BLK_PERM_ALL);
+s->top = blk_new(bdrv_get_aio_context(bs), 0, BLK_PERM_ALL);
 ret = blk_insert_bs(s->top, top, errp);
 if (ret < 0) {
 goto fail;
diff --git a/block/mirror.c b/block/mirror.c
index b38676e19d..1977e25171 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1728,7 +1728,7 @@ static BlockJob *mirror_start_job(
 goto fail;
 }
 
-s->target = blk_new(s->common.job.aio_context,
+s->target = blk_new(bdrv_get_aio_context(bs),
 target_perms, target_shared_perms);
 ret = blk_insert_bs(s->target, target, errp);
 if (ret < 0) {
-- 
2.31.1

[PATCH v10 03/21] job.c: API functions not used outside should be static

2022-07-25 Thread Emanuele Giuseppe Esposito

job_event_* functions can all be static, as they are not used
outside job.c.

Same applies for job_txn_add_job().

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/qemu/job.h | 18 --
 job.c  | 22 +++---
 2 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 876e13d549..4b64eb15f7 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -358,18 +358,6 @@ JobTxn *job_txn_new(void);
  */
 void job_txn_unref(JobTxn *txn);
 
-/**
- * @txn: The transaction (may be NULL)
- * @job: Job to add to the transaction
- *
- * Add @job to the transaction.  The @job must not already be in a transaction.
- * The caller must call either job_txn_unref() or job_completed() to release
- * the reference that is automatically grabbed here.
- *
- * If @txn is NULL, the function does nothing.
- */
-void job_txn_add_job(JobTxn *txn, Job *job);
-
 /**
  * Create a new long-running job and return it.
  *
@@ -431,12 +419,6 @@ void job_progress_set_remaining(Job *job, uint64_t 
remaining);
  */
 void job_progress_increase_remaining(Job *job, uint64_t delta);
 
-/** To be called when a cancelled job is finalised. */
-void job_event_cancelled(Job *job);
-
-/** To be called when a successfully completed job is finalised. */
-void job_event_completed(Job *job);
-
 /**
  * Conditionally enter the job coroutine if the job is ready to run, not
  * already busy and fn() returns true. fn() is called while under the job_lock
diff --git a/job.c b/job.c
index 2b4ffca9d4..cafd597ba4 100644
--- a/job.c
+++ b/job.c
@@ -125,7 +125,17 @@ void job_txn_unref(JobTxn *txn)
 }
 }
 
-void job_txn_add_job(JobTxn *txn, Job *job)
+/**
+ * @txn: The transaction (may be NULL)
+ * @job: Job to add to the transaction
+ *
+ * Add @job to the transaction.  The @job must not already be in a transaction.
+ * The caller must call either job_txn_unref() or job_completed() to release
+ * the reference that is automatically grabbed here.
+ *
+ * If @txn is NULL, the function does nothing.
+ */
+static void job_txn_add_job(JobTxn *txn, Job *job)
 {
 if (!txn) {
 return;
@@ -427,12 +437,18 @@ void job_progress_increase_remaining(Job *job, uint64_t 
delta)
 progress_increase_remaining(>progress, delta);
 }
 
-void job_event_cancelled(Job *job)
+/**
+ * To be called when a cancelled job is finalised.
+ */
+static void job_event_cancelled(Job *job)
 {
 notifier_list_notify(>on_finalize_cancelled, job);
 }
 
-void job_event_completed(Job *job)
+/**
+ * To be called when a successfully completed job is finalised.
+ */
+static void job_event_completed(Job *job)
 {
 notifier_list_notify(>on_finalize_completed, job);
 }
-- 
2.31.1

[PATCH v10 07/21] blockjob: introduce block_job _locked() APIs

2022-07-25 Thread Emanuele Giuseppe Esposito

Just as done with job.h, create _locked() functions in blockjob.h

These functions will be later useful when caller has already taken
the lock. All blockjob _locked functions call job _locked functions.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockjob.c   | 52 
 include/block/blockjob.h | 18 ++
 2 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/blockjob.c b/blockjob.c
index 7da59a1f1c..0d59aba439 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -44,21 +44,27 @@ static bool is_block_job(Job *job)
job_type(job) == JOB_TYPE_STREAM;
 }
 
-BlockJob *block_job_next(BlockJob *bjob)
+BlockJob *block_job_next_locked(BlockJob *bjob)
 {
 Job *job = bjob ? >job : NULL;
 GLOBAL_STATE_CODE();
 
 do {
-job = job_next(job);
+job = job_next_locked(job);
 } while (job && !is_block_job(job));
 
 return job ? container_of(job, BlockJob, job) : NULL;
 }
 
-BlockJob *block_job_get(const char *id)
+BlockJob *block_job_next(BlockJob *bjob)
 {
-Job *job = job_get(id);
+JOB_LOCK_GUARD();
+return block_job_next_locked(bjob);
+}
+
+BlockJob *block_job_get_locked(const char *id)
+{
+Job *job = job_get_locked(id);
 GLOBAL_STATE_CODE();
 
 if (job && is_block_job(job)) {
@@ -68,6 +74,12 @@ BlockJob *block_job_get(const char *id)
 }
 }
 
+BlockJob *block_job_get(const char *id)
+{
+JOB_LOCK_GUARD();
+return block_job_get_locked(id);
+}
+
 void block_job_free(Job *job)
 {
 BlockJob *bjob = container_of(job, BlockJob, job);
@@ -256,14 +268,14 @@ static bool job_timer_pending(Job *job)
 return timer_pending(>sleep_timer);
 }
 
-bool block_job_set_speed(BlockJob *job, int64_t speed, Error **errp)
+bool block_job_set_speed_locked(BlockJob *job, int64_t speed, Error **errp)
 {
 const BlockJobDriver *drv = block_job_driver(job);
 int64_t old_speed = job->speed;
 
 GLOBAL_STATE_CODE();
 
-if (job_apply_verb(>job, JOB_VERB_SET_SPEED, errp) < 0) {
+if (job_apply_verb_locked(>job, JOB_VERB_SET_SPEED, errp) < 0) {
 return false;
 }
 if (speed < 0) {
@@ -277,7 +289,9 @@ bool block_job_set_speed(BlockJob *job, int64_t speed, 
Error **errp)
 job->speed = speed;
 
 if (drv->set_speed) {
+job_unlock();
 drv->set_speed(job, speed);
+job_lock();
 }
 
 if (speed && speed <= old_speed) {
@@ -285,18 +299,24 @@ bool block_job_set_speed(BlockJob *job, int64_t speed, 
Error **errp)
 }
 
 /* kick only if a timer is pending */
-job_enter_cond(>job, job_timer_pending);
+job_enter_cond_locked(>job, job_timer_pending);
 
 return true;
 }
 
+bool block_job_set_speed(BlockJob *job, int64_t speed, Error **errp)
+{
+JOB_LOCK_GUARD();
+return block_job_set_speed_locked(job, speed, errp);
+}
+
 int64_t block_job_ratelimit_get_delay(BlockJob *job, uint64_t n)
 {
 IO_CODE();
 return ratelimit_calculate_delay(>limit, n);
 }
 
-BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
+BlockJobInfo *block_job_query_locked(BlockJob *job, Error **errp)
 {
 BlockJobInfo *info;
 uint64_t progress_current, progress_total;
@@ -320,7 +340,7 @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
 info->len   = progress_total;
 info->speed = job->speed;
 info->io_status = job->iostatus;
-info->ready = job_is_ready(>job),
+info->ready = job_is_ready_locked(>job),
 info->status= job->job.status;
 info->auto_finalize = job->job.auto_finalize;
 info->auto_dismiss  = job->job.auto_dismiss;
@@ -333,6 +353,12 @@ BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
 return info;
 }
 
+BlockJobInfo *block_job_query(BlockJob *job, Error **errp)
+{
+JOB_LOCK_GUARD();
+return block_job_query_locked(job, errp);
+}
+
 static void block_job_iostatus_set_err(BlockJob *job, int error)
 {
 if (job->iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
@@ -478,7 +504,7 @@ fail:
 return NULL;
 }
 
-void block_job_iostatus_reset(BlockJob *job)
+void block_job_iostatus_reset_locked(BlockJob *job)
 {
 GLOBAL_STATE_CODE();
 if (job->iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
@@ -488,6 +514,12 @@ void block_job_iostatus_reset(BlockJob *job)
 job->iostatus = BLOCK_DEVICE_IO_STATUS_OK;
 }
 
+void block_job_iostatus_reset(BlockJob *job)
+{
+JOB_LOCK_GUARD();
+block_job_iostatus_reset_locked(job);
+}
+
 void block_job_user_resume(Job *job)
 {
 BlockJob *bjob = container_of(job, BlockJob, job);
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 6525e16fd5..8b65d3949d 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -92,6 +92,9 @@ typedef struct BlockJob {
  */
 BlockJob *block_job_next(BlockJob *job);
 
+/* Same as block_job_next(), but called with job lock held. */

[PATCH v10 02/21] job.h: categorize fields in struct Job

2022-07-25 Thread Emanuele Giuseppe Esposito

Categorize the fields in struct Job to understand which ones
need to be protected by the job mutex and which don't.

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/qemu/job.h | 61 +++---
 1 file changed, 36 insertions(+), 25 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index d1192ffd61..876e13d549 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -40,27 +40,52 @@ typedef struct JobTxn JobTxn;
  * Long-running operation.
  */
 typedef struct Job {
+
+/* Fields set at initialization (job_create), and never modified */
+
 /** The ID of the job. May be NULL for internal jobs. */
 char *id;
 
-/** The type of this job. */
+/**
+ * The type of this job.
+ * All callbacks are called with job_mutex *not* held.
+ */
 const JobDriver *driver;
 
-/** Reference count of the block job */
-int refcnt;
-
-/** Current state; See @JobStatus for details. */
-JobStatus status;
-
-/** AioContext to run the job coroutine in */
-AioContext *aio_context;
-
 /**
  * The coroutine that executes the job.  If not NULL, it is reentered when
  * busy is false and the job is cancelled.
+ * Initialized in job_start()
  */
 Coroutine *co;
 
+/** True if this job should automatically finalize itself */
+bool auto_finalize;
+
+/** True if this job should automatically dismiss itself */
+bool auto_dismiss;
+
+/** The completion function that will be called when the job completes.  */
+BlockCompletionFunc *cb;
+
+/** The opaque value that is passed to the completion function.  */
+void *opaque;
+
+/* ProgressMeter API is thread-safe */
+ProgressMeter progress;
+
+
+/** Protected by AioContext lock */
+
+/** AioContext to run the job coroutine in */
+AioContext *aio_context;
+
+/** Reference count of the block job */
+int refcnt;
+
+/** Current state; See @JobStatus for details. */
+JobStatus status;
+
 /**
  * Timer that is used by @job_sleep_ns. Accessed under job_mutex (in
  * job.c).
@@ -112,14 +137,6 @@ typedef struct Job {
 /** Set to true when the job has deferred work to the main loop. */
 bool deferred_to_main_loop;
 
-/** True if this job should automatically finalize itself */
-bool auto_finalize;
-
-/** True if this job should automatically dismiss itself */
-bool auto_dismiss;
-
-ProgressMeter progress;
-
 /**
  * Return code from @run and/or @prepare callback(s).
  * Not final until the job has reached the CONCLUDED status.
@@ -134,12 +151,6 @@ typedef struct Job {
  */
 Error *err;
 
-/** The completion function that will be called when the job completes.  */
-BlockCompletionFunc *cb;
-
-/** The opaque value that is passed to the completion function.  */
-void *opaque;
-
 /** Notifiers called when a cancelled job is finalised */
 NotifierList on_finalize_cancelled;
 
@@ -167,6 +178,7 @@ typedef struct Job {
 
 /**
  * Callbacks and other information about a Job driver.
+ * All callbacks are invoked with job_mutex *not* held.
  */
 struct JobDriver {
 
@@ -472,7 +484,6 @@ void job_yield(Job *job);
  */
 void coroutine_fn job_sleep_ns(Job *job, int64_t ns);
 
-
 /** Returns the JobType of a given Job. */
 JobType job_type(const Job *job);
 
-- 
2.31.1

[PATCH v10 00/21] job: replace AioContext lock with job_mutex

2022-07-25 Thread Emanuele Giuseppe Esposito

In this series, we want to remove the AioContext lock and instead
use the already existent job_mutex to protect the job structures
and list. This is part of the work to get rid of AioContext lock
usage in favour of smaller granularity locks.

In order to simplify reviewer's job, job lock/unlock functions and
macros are added as empty prototypes (nop) in patch 1.
They are converted to use the actual job mutex only in the last
patch. In this way we can freely create locking sections
without worrying about deadlocks with the aiocontext lock.

Patch 2 defines what fields in the job structure need protection.
Patches 3-6 are in preparation to the job locks, moving functions
from global to static and introducing helpers.

Patch 7-9 introduce the (nop) job lock into the job API and
its users, and patches 10-13 categorize respectively locked and
unlocked functions in the job API.

Patches 14-17 take care of protecting job->aio_context, and
finally patch 18 makes the prototypes in patch 1 use the
job_mutex and removes all aiocontext lock at the same time.

Tested this series by running unit tests, qemu-iotests and qtests
(x86_64).

---
v10:
* protect job->status in unit tests
* patch 11: change commit description and avoid using lock guard for a single
function call
* move patch 19 before patch 15

v9:
* merge patch 6 and 7 to 5.
* additional "taken with job lock/unlock" added and propagated in callers
* protect iostatus field of BlockJobs
* move all blockjob patches torward the end of the serie

v8:
* reorganize patch ordering according with Vladimir proposal
* minor nitpicks

v7:
* s/temporary/temporarly
* double identical locking comment to the same function
* patch 2: add "Protected by AioContext lock" to better categorize fields in
  job.h
* use same comment style in all function headers ("Just like {funct}, but
  called between job_lock and job_unlock")

v6:
* patch 4 and 6 squashed with patch 19 (enable job lock and
  reduce/remove AioContext lock)
* patch 19: job_unref_locked read the aiocontext inside the
  job lock.

v5:
* just restructured patches a little bit better, as there were
  functions used before they were defined.
* rebased on kwolf/block branch and API split serie

v4:
* move "protected by job_mutex" from patch 2 to 15, where the job_mutex is
  actually added.
* s/aio_co_enter/aio_co_schedule in job.c, and adjust tests accordingly.
* remove job_get_aio_context, add job_set_aio_context. Use "fake rwlock"
  to protect job->aiocontext.
* get rid of useless getters method, namely:
  job_get_status
  job_get_pause_count
  job_get_paused
  job_get_busy
  They are all used only by tests, and such getter is pretty useless.
  Replace with job_lock(); assert(); job_unlock();
* use job lock macros instead of job lock/unlock in unit tests.
* convert also blockjob functions to have _locked
* put the job_lock/unlock patches before the _locked ones
* replace aio_co_enter in job.c and detect change of context

v3:
* add "_locked" suffix to the functions called under job_mutex lock
* rename _job_lock in real_job_lock
* job_mutex is now public, and drivers like monitor use it directly
* introduce and protect job_get_aio_context
* remove mirror-specific APIs and just use WITH_JOB_GUARD
* more extensive use of WITH_JOB_GUARD and JOB_LOCK_GUARD

RFC v2:
* use JOB_LOCK_GUARD and WITH_JOB_LOCK_GUARD
* mu(u)ltiple typos in commit messages
* job API split patches are sent separately in another series
* use of empty job_{lock/unlock} and JOB_LOCK_GUARD/WITH_JOB_LOCK_GUARD
  to avoid deadlocks and simplify the reviewer job
* move patch 11 (block_job_query: remove atomic read) as last

Emanuele Giuseppe Esposito (20):
  job.c: make job_mutex and job_lock/unlock() public
  job.h: categorize fields in struct Job
  job.c: API functions not used outside should be static
  aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
  job.c: add job_lock/unlock while keeping job.h intact
  job: move and update comments from blockjob.c
  blockjob: introduce block_job  _locked() APIs
  jobs: add job lock in find_* functions
  jobs: use job locks also in the unit tests
  block/mirror.c: use of job helpers in drivers to avoid TOC/TOU
  jobs: group together API calls under the same job lock
  commit and mirror: create new nodes using bdrv_get_aio_context, and
not the job aiocontext
  jobs: protect job.aio_context with BQL and job_mutex
  blockjob.h: categorize fields in struct BlockJob
  blockjob: rename notifier callbacks as _locked
  blockjob: protect iostatus field in BlockJob struct
  job.c: enable job lock/unlock and remove Aiocontext locks
  block_job_query: remove atomic read
  blockjob: remove unused functions
  job: remove unused functions

Paolo Bonzini (1):
  job: detect change of aiocontext within job coroutine

 block.c  |  17 +-
 block/commit.c   |   4 +-
 block/mirror.c   |  21 +-
 block/replication.c  |   6 +-
 blockdev.c   | 129 +++---

Re: [PATCH v2 1/2] qapi: Add exit-failure PanicAction

2022-07-25 Thread David Hildenbrand

On 23.07.22 01:36, Ilya Leoshkevich wrote:
> Currently QEMU exits with code 0 on both panic an shutdown. For tests
> it is useful to return 1 on panic, so that it counts as a test
> failure.
> 
> Introduce a new exit-failure PanicAction that makes main() return
> EXIT_FAILURE. Tests can use -action panic=exit-failure option to
> activate this behavior.

Roughly what I proposed, nice.

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb

Re: [PATCH v6 3/5] target/riscv: smstateen check for fcsr

2022-07-25 Thread Weiwei Li



在 2022/7/24 下午11:49, Mayuresh Chitale 写道:

On Fri, 2022-07-22 at 09:42 +0800, Weiwei Li wrote:

在 2022/7/21 下午11:31, Mayuresh Chitale 写道:

If smstateen is implemented and sstateen0.fcsr is clear then the
floating point operations must return illegal instruction
exception.

Signed-off-by: Mayuresh Chitale 
---
   target/riscv/csr.c| 23 ++
   target/riscv/insn_trans/trans_rvf.c.inc   | 38
+--
   target/riscv/insn_trans/trans_rvzfh.c.inc |  4 +++
   3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index ab06b117f9..a597b6cbc7 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -96,6 +96,10 @@ static RISCVException fs(CPURISCVState *env, int
csrno)
   !RISCV_CPU(env_cpu(env))->cfg.ext_zfinx) {
   return RISCV_EXCP_ILLEGAL_INST;
   }
+
+if (!env->debugger && !riscv_cpu_fp_enabled(env)) {
+return smstateen_acc_ok(env, 0, SMSTATEEN0_FCSR);
+}
   #endif
   return RISCV_EXCP_NONE;
   }
@@ -1876,6 +1880,9 @@ static RISCVException
write_mstateen0(CPURISCVState *env, int csrno,
 target_ulong new_val)
   {
   uint64_t wr_mask = SMSTATEEN_STATEN | SMSTATEEN0_HSENVCFG;
+if (!riscv_has_ext(env, RVF)) {
+wr_mask |= SMSTATEEN0_FCSR;
+}
   
   return write_mstateen(env, csrno, wr_mask, new_val);

   }
@@ -1924,6 +1931,10 @@ static RISCVException
write_mstateen0h(CPURISCVState *env, int csrno,
   {
   uint64_t wr_mask = SMSTATEEN_STATEN | SMSTATEEN0_HSENVCFG;
   
+if (!riscv_has_ext(env, RVF)) {

+wr_mask |= SMSTATEEN0_FCSR;
+}
+
   return write_mstateenh(env, csrno, wr_mask, new_val);
   }
   
@@ -1973,6 +1984,10 @@ static RISCVException

write_hstateen0(CPURISCVState *env, int csrno,
   {
   uint64_t wr_mask = SMSTATEEN_STATEN | SMSTATEEN0_HSENVCFG;
   
+if (!riscv_has_ext(env, RVF)) {

+wr_mask |= SMSTATEEN0_FCSR;
+}
+
   return write_hstateen(env, csrno, wr_mask, new_val);
   }
   
@@ -2024,6 +2039,10 @@ static RISCVException

write_hstateen0h(CPURISCVState *env, int csrno,
   {
   uint64_t wr_mask = SMSTATEEN_STATEN | SMSTATEEN0_HSENVCFG;
   
+if (!riscv_has_ext(env, RVF)) {

+wr_mask |= SMSTATEEN0_FCSR;
+}
+
   return write_hstateenh(env, csrno, wr_mask, new_val);
   }
   
@@ -2083,6 +2102,10 @@ static RISCVException

write_sstateen0(CPURISCVState *env, int csrno,
   {
   uint64_t wr_mask = SMSTATEEN_STATEN | SMSTATEEN0_HSENVCFG;
   
+if (!riscv_has_ext(env, RVF)) {

+wr_mask |= SMSTATEEN0_FCSR;
+}
+
   return write_sstateen(env, csrno, wr_mask, new_val);
   }
   
diff --git a/target/riscv/insn_trans/trans_rvf.c.inc

b/target/riscv/insn_trans/trans_rvf.c.inc
index a1d3eb52ad..c43c48336b 100644
--- a/target/riscv/insn_trans/trans_rvf.c.inc
+++ b/target/riscv/insn_trans/trans_rvf.c.inc
@@ -24,9 +24,43 @@
   return false; \
   } while (0)
   
+#ifndef CONFIG_USER_ONLY

+#define SMSTATEEN_CHECK(ctx) do {\
+CPUState *cpu = ctx->cs; \
+CPURISCVState *env = cpu->env_ptr; \
+if (ctx->cfg_ptr->ext_smstateen && \
+(env->priv < PRV_M)) { \
+uint64_t stateen = env->mstateen[0]; \
+uint64_t hstateen = env->hstateen[0]; \
+uint64_t sstateen = env->sstateen[0]; \
+if (!(stateen & SMSTATEEN_STATEN)) {\
+hstateen = 0; \
+sstateen = 0; \
+} \
+if (ctx->virt_enabled) { \
+stateen &= hstateen; \
+if (!(hstateen & SMSTATEEN_STATEN)) {\
+sstateen = 0; \
+} \
+} \
+if (env->priv == PRV_U && has_ext(ctx, RVS)) {\eventually
meaning
+stateen &= sstateen; \
+} \
+if (!(stateen & SMSTATEEN0_FCSR)) { \
+return false; \
+} \
+} \
+} while (0)

It's better to add a space before '\'.

ok. will modify in the next version.

+#else
+#define SMSTATEEN_CHECK(ctx)
+#endif
+
   #define REQUIRE_ZFINX_OR_F(ctx) do {\
-if (!ctx->cfg_ptr->ext_zfinx) { \
-REQUIRE_EXT(ctx, RVF); \
+if (!has_ext(ctx, RVF)) { \
+SMSTATEEN_CHECK(ctx); \
+if (!ctx->cfg_ptr->ext_zfinx) { \
+return false; \
+} \
   } \
   } while (0)

SMSTATEEN_CHECK is for CSR. and REQUIRE_ZFINX_OR_F is for Extension.
I think It's better to separate them. By the way, if we want the
smallest modification
for current code, adding it to REQUIRE_FPU seems better.

Actually REQUIRE_FPU is checking for mstatus.fs but as per smstateen
spec we need to check for misa.f which is done in REQUIRE_ZFINX_OR_F.


OK. It's acceptable to me  even though I prefer separating them.

However, I find another question in SMSTATEEN_CHECK: when access is 
disallowed by Xstateen.FCSR,


it's always return false  which will trigger illegal instruction 
exception finally.


However, this exception is triggered by accessing fcsr CSR which

Re: [PATCH v6 1/5] target/riscv: Add smstateen support

2022-07-25 Thread Weiwei Li




在 2022/7/24 下午11:39, Mayuresh Chitale 写道:

On Fri, 2022-07-22 at 08:31 +0800, Weiwei Li wrote:

在 2022/7/21 下午11:31, Mayuresh Chitale 写道:

Smstateen extension specifies a mechanism to close
the potential covert channels that could cause security issues.

This patch adds the CSRs defined in the specification and
the corresponding predicates and read/write functions.

Signed-off-by: Mayuresh Chitale 
---
   target/riscv/cpu.h  |   4 +
   target/riscv/cpu_bits.h |  37 
   target/riscv/csr.c  | 370

   target/riscv/machine.c  |  21 +++
   4 files changed, 432 insertions(+)

diff --git a/target/riscv/cpu.h b/target/riscv/cpu.h
index ffb1a18873..7f8e5b0014 100644
--- a/target/riscv/cpu.h
+++ b/target/riscv/cpu.h
@@ -354,6 +354,9 @@ struct CPUArchState {
   
   /* CSRs for execution enviornment configuration */

   uint64_t menvcfg;
+uint64_t mstateen[SMSTATEEN_MAX_COUNT];
+uint64_t hstateen[SMSTATEEN_MAX_COUNT];
+uint64_t sstateen[SMSTATEEN_MAX_COUNT];
   target_ulong senvcfg;
   uint64_t henvcfg;
   #endif
@@ -426,6 +429,7 @@ struct RISCVCPUConfig {
   bool ext_zkt;
   bool ext_ifencei;
   bool ext_icsr;
+bool ext_smstateen;
   bool ext_svinval;
   bool ext_svnapot;
   bool ext_svpbmt;
diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
index 6be5a9e9f0..56b7c5bed6 100644
--- a/target/riscv/cpu_bits.h
+++ b/target/riscv/cpu_bits.h
@@ -199,6 +199,12 @@
   /* Supervisor Configuration CSRs */
   #define CSR_SENVCFG 0x10A
   
+/* Supervisor state CSRs */

+#define CSR_SSTATEEN0   0x10C
+#define CSR_SSTATEEN1   0x10D
+#define CSR_SSTATEEN2   0x10E
+#define CSR_SSTATEEN3   0x10F
+
   /* Supervisor Trap Handling */
   #define CSR_SSCRATCH0x140
   #define CSR_SEPC0x141
@@ -242,6 +248,16 @@
   #define CSR_HENVCFG 0x60A
   #define CSR_HENVCFGH0x61A
   
+/* Hypervisor state CSRs */

+#define CSR_HSTATEEN0   0x60C
+#define CSR_HSTATEEN0H  0x61C
+#define CSR_HSTATEEN1   0x60D
+#define CSR_HSTATEEN1H  0x61D
+#define CSR_HSTATEEN2   0x60E
+#define CSR_HSTATEEN2H  0x61E
+#define CSR_HSTATEEN3   0x60F
+#define CSR_HSTATEEN3H  0x61F
+
   /* Virtual CSRs */
   #define CSR_VSSTATUS0x200
   #define CSR_VSIE0x204
@@ -283,6 +299,27 @@
   #define CSR_MENVCFG 0x30A
   #define CSR_MENVCFGH0x31A
   
+/* Machine state CSRs */

+#define CSR_MSTATEEN0   0x30C
+#define CSR_MSTATEEN0H  0x31C
+#define CSR_MSTATEEN1   0x30D
+#define CSR_MSTATEEN1H  0x31D
+#define CSR_MSTATEEN2   0x30E
+#define CSR_MSTATEEN2H  0x31E
+#define CSR_MSTATEEN3   0x30F
+#define CSR_MSTATEEN3H  0x31F
+
+/* Common defines for all smstateen */
+#define SMSTATEEN_MAX_COUNT 4
+#define SMSTATEEN0_CS   (1ULL << 0)
+#define SMSTATEEN0_FCSR (1ULL << 1)
+#define SMSTATEEN0_HSCONTXT (1ULL << 57)
+#define SMSTATEEN0_IMSIC(1ULL << 58)
+#define SMSTATEEN0_AIA  (1ULL << 59)
+#define SMSTATEEN0_SVSLCT   (1ULL << 60)
+#define SMSTATEEN0_HSENVCFG (1ULL << 62)
+#define SMSTATEEN_STATEN(1ULL << 63)

Maybe  SMSTATEEN_STATEEN better.

ok. Will update in the next version.

+
   /* Enhanced Physical Memory Protection (ePMP) */
   #define CSR_MSECCFG 0x747
   #define CSR_MSECCFGH0x757
diff --git a/target/riscv/csr.c b/target/riscv/csr.c
index 235f2a011e..27032a416c 100644
--- a/target/riscv/csr.c
+++ b/target/riscv/csr.c
@@ -339,6 +339,68 @@ static RISCVException hmode32(CPURISCVState
*env, int csrno)
   
   }
   
+static RISCVException mstateen(CPURISCVState *env, int csrno)

+{
+CPUState *cs = env_cpu(env);
+RISCVCPU *cpu = RISCV_CPU(cs);
+
+if (!cpu->cfg.ext_smstateen) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+return any(env, csrno);
+}
+
+static RISCVException hstateen_pred(CPURISCVState *env, int csrno,
int base)
+{
+CPUState *cs = env_cpu(env);
+RISCVCPU *cpu = RISCV_CPU(cs);
+
+if (!cpu->cfg.ext_smstateen) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+if (!(env->mstateen[csrno - base] & SMSTATEEN_STATEN)) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+return hmode(env, csrno);
+}
+
+static RISCVException hstateen(CPURISCVState *env, int csrno)
+{
+return hstateen_pred(env, csrno, CSR_HSTATEEN0);
+}
+
+static RISCVException hstateenh(CPURISCVState *env, int csrno)
+{
+return hstateen_pred(env, csrno, CSR_HSTATEEN0H);
+}
+
+static RISCVException sstateen(CPURISCVState *env, int csrno)
+{
+bool virt = riscv_cpu_virt_enabled(env);
+int index = csrno - CSR_SSTATEEN0;
+CPUState *cs = env_cpu(env);
+RISCVCPU *cpu = RISCV_CPU(cs);
+
+if (!cpu->cfg.ext_smstateen) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+if (!(env->mstateen[index] & SMSTATEEN_STATEN)) {
+return RISCV_EXCP_ILLEGAL_INST;
+}
+
+if (virt) {
+if (!(env->hstateen[index] & SMSTATEEN_STATEN)) {
+

Re: [PATCH v2] vhost: Get vring base from vq, not svq

2022-07-25 Thread Jason Wang

On Mon, Jul 18, 2022 at 8:05 PM Eugenio Pérez  wrote:
>
> The SVQ vring used idx usually match with the guest visible one, as long
> as all the guest buffers (GPA) maps to exactly one buffer within qemu's
> VA. However, as we can see in virtqueue_map_desc, a single guest buffer
> could map to many buffers in SVQ vring.
>
> Also, its also a mistake to rewind them at the source of migration.
> Since VirtQueue is able to migrate the inflight descriptors, its
> responsability of the destination to perform the rewind just in case it
> cannot report the inflight descriptors to the device.
>
> This makes easier to migrate between backends or to recover them in
> vhost devices that support set in flight descriptors.
>
> Fixes: 6d0b22266633 ("vdpa: Adapt vhost_vdpa_get_vring_base to SVQ")
> Signed-off-by: Eugenio Pérez 

Acked-by: Jason Wang 

>
> --
> v2: Squash both fixes in one.
> ---
>  hw/virtio/vhost-vdpa.c | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 795ed5a049..4458c8d23e 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1178,7 +1178,18 @@ static int vhost_vdpa_set_vring_base(struct vhost_dev 
> *dev,
> struct vhost_vring_state *ring)
>  {
>  struct vhost_vdpa *v = dev->opaque;
> +VirtQueue *vq = virtio_get_queue(dev->vdev, ring->index);
>
> +/*
> + * vhost-vdpa devices does not support in-flight requests. Set all of 
> them
> + * as available.
> + *
> + * TODO: This is ok for networking, but other kinds of devices might
> + * have problems with these retransmissions.
> + */
> +while (virtqueue_rewind(vq, 1)) {
> +continue;
> +}
>  if (v->shadow_vqs_enabled) {
>  /*
>   * Device vring base was set at device start. SVQ base is handled by
> @@ -1194,21 +1205,10 @@ static int vhost_vdpa_get_vring_base(struct vhost_dev 
> *dev,
> struct vhost_vring_state *ring)
>  {
>  struct vhost_vdpa *v = dev->opaque;
> -int vdpa_idx = ring->index - dev->vq_index;
>  int ret;
>
>  if (v->shadow_vqs_enabled) {
> -VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, 
> vdpa_idx);
> -
> -/*
> - * Setting base as last used idx, so destination will see as 
> available
> - * all the entries that the device did not use, including the 
> in-flight
> - * processing ones.
> - *
> - * TODO: This is ok for networking, but other kinds of devices might
> - * have problems with these retransmissions.
> - */
> -ring->num = svq->last_used_idx;
> +ring->num = virtio_queue_get_last_avail_idx(dev->vdev, ring->index);
>  return 0;
>  }
>
> --
> 2.31.1
>

Re: [PATCH] vdpa: Fix memory listener deletions of iova tree

2022-07-25 Thread Jason Wang

On Fri, Jul 22, 2022 at 4:26 PM Eugenio Pérez  wrote:
>
> vhost_vdpa_listener_region_del is always deleting the first iova entry
> of the tree, since it's using the needle iova instead of the result's
> one.
>
> This was detected using a vga virtual device in the VM using vdpa SVQ.
> It makes some extra memory adding and deleting, so the wrong one was
> mapped / unmapped. This was undetected before since all the memory was
> mappend and unmapped totally without that device, but other conditions
> could trigger it too:
>
> * mem_region was with .iova = 0, .translated_addr = (correct GPA).
> * iova_tree_find_iova returned right result, but does not update
>   mem_region.
> * iova_tree_remove always removed region with .iova = 0. Right iova were
>   sent to the device.
> * Next map will fill the first region with .iova = 0, causing a mapping
>   with the same iova and device complains, if the next action is a map.
> * Next unmap will cause to try to unmap again iova = 0, causing the
>   device to complain that no region was mapped at iova = 0.
>
> Fixes: 34e3c94edaef ("vdpa: Add custom IOTLB translations to SVQ")
> Reported-by: Lei Yang 
> Signed-off-by: Eugenio Pérez 

Acked-by: Jason Wang 

> ---
>  hw/virtio/vhost-vdpa.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 291cd19054..00e990ea40 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -290,7 +290,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
> *listener,
>
>  result = vhost_iova_tree_find_iova(v->iova_tree, _region);
>  iova = result->iova;
> -vhost_iova_tree_remove(v->iova_tree, _region);
> +vhost_iova_tree_remove(v->iova_tree, result);
>  }
>  vhost_vdpa_iotlb_batch_begin_once(v);
>  ret = vhost_vdpa_dma_unmap(v, iova, int128_get64(llsize));
> --
> 2.31.1
>

Re: [PATCH v7 01/92] target/arm: Add ID_AA64ZFR0 fields and isar_feature_aa64_sve2

2022-07-25 Thread Zenghui Yu via


Hi Richard,

On 2021/5/25 9:02, Richard Henderson wrote:

Will be used for SVE2 isa subset enablement.

Reviewed-by: Alex Bennée 
Signed-off-by: Richard Henderson 
---
v2: Do not read zfr0 from kvm unless sve is available.
v7: Move zfr0 read inside existing sve_enabled block.


[...]


diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
index dff85f6db9..37ceadd9a9 100644
--- a/target/arm/kvm64.c
+++ b/target/arm/kvm64.c
@@ -647,17 +647,26 @@ bool kvm_arm_get_host_cpu_features(ARMHostCPUFeatures 
*ahcf)
 
 sve_supported = ioctl(fdarray[0], KVM_CHECK_EXTENSION, KVM_CAP_ARM_SVE) > 0;
 
-kvm_arm_destroy_scratch_host_vcpu(fdarray);

-
-if (err < 0) {
-return false;
-}
-
 /* Add feature bits that can't appear until after VCPU init. */
 if (sve_supported) {
 t = ahcf->isar.id_aa64pfr0;
 t = FIELD_DP64(t, ID_AA64PFR0, SVE, 1);
 ahcf->isar.id_aa64pfr0 = t;
+
+/*
+ * Before v5.1, KVM did not support SVE and did not expose
+ * ID_AA64ZFR0_EL1 even as RAZ.  After v5.1, KVM still does
+ * not expose the register to "user" requests like this
+ * unless the host supports SVE.
+ */
+err |= read_sys_reg64(fdarray[2], >isar.id_aa64zfr0,
+  ARM64_SYS_REG(3, 0, 0, 4, 4));


If I read it correctly, we haven't yet enabled SVE for the scratch vcpu
(using the KVM_ARM_VCPU_INIT ioctl with KVM_ARM_VCPU_SVE). KVM will
therefore expose ID_AA64ZFR0_EL1 to userspace as RAZ at this point and
isar.id_aa64zfr0 is reset to 0. I wonder if it was intentional?

Thanks,
Zenghui

Re: [PATCH v3 8/8] arm/aspeed: Replace mx25l25635e chip model

2022-07-25 Thread Andrew Jeffery




On Mon, 25 Jul 2022, at 16:02, Cédric Le Goater wrote:
> On 7/25/22 04:08, Andrew Jeffery wrote:
>> 
>> 
>> On Fri, 22 Jul 2022, at 16:06, Cédric Le Goater wrote:
>>> aspeed_machine_witherspoon_class_init(ObjectClass *oc, void *data)
>>>   mc->desc   = "OpenPOWER Witherspoon BMC (ARM1176)";
>>>   amc->soc_name  = "ast2500-a1";
>>>   amc->hw_strap1 = WITHERSPOON_BMC_HW_STRAP1;
>>> -amc->fmc_model = "mx25l25635e";
>>> +amc->fmc_model = "mx25l25635f";
>> 
>> The witherspoon I checked also reported mx25l25635e in dmesg for the
>> FMC.
>> 
>> You do say "generally" in the commit message though.
>
> You can not tell from dmesg.
>
> The MX25L25635F and MX25L25635E models share the same JEDEC ID and
> the spi-nor flash device table only contains a mx25l25635e entry.
> The MX25L25635F is detected in mx25l25635_post_bfpt_fixups using SFDP.
>
> That's why I added this warning  :
>
>
> https://github.com/legoater/linux/commit/934f0708ac597022cbf6c8d6f2ce91d55025e943
>

Oh righto, sorry for the noise.

Re: [PATCH v3 8/8] arm/aspeed: Replace mx25l25635e chip model

2022-07-25 Thread Cédric Le Goater


On 7/25/22 04:08, Andrew Jeffery wrote:



On Fri, 22 Jul 2022, at 16:06, Cédric Le Goater wrote:

A mx25l25635f chip model is generally found on these machines. It's
newer and uses 4B opcodes which is better to exercise the support in
the Linux kernel.

Signed-off-by: Cédric Le Goater 
---
  hw/arm/aspeed.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c
index 1c611284819d..7e95abc55b09 100644
--- a/hw/arm/aspeed.c
+++ b/hw/arm/aspeed.c
@@ -1157,7 +1157,7 @@ static void
aspeed_machine_palmetto_class_init(ObjectClass *oc, void *data)
  amc->soc_name  = "ast2400-a1";
  amc->hw_strap1 = PALMETTO_BMC_HW_STRAP1;
  amc->fmc_model = "n25q256a";
-amc->spi_model = "mx25l25635e";
+amc->spi_model = "mx25l25635f";


Hmm, dmesg reported mx25l25635e on the palmetto I checked


  amc->num_cs= 1;
  amc->i2c_init  = palmetto_bmc_i2c_init;
  mc->default_ram_size   = 256 * MiB;
@@ -1208,7 +1208,7 @@ static void
aspeed_machine_ast2500_evb_class_init(ObjectClass *oc, void *data)
  amc->soc_name  = "ast2500-a1";
  amc->hw_strap1 = AST2500_EVB_HW_STRAP1;
  amc->fmc_model = "mx25l25635e";
-amc->spi_model = "mx25l25635e";
+amc->spi_model = "mx25l25635f";
  amc->num_cs= 1;
  amc->i2c_init  = ast2500_evb_i2c_init;
  mc->default_ram_size   = 512 * MiB;
@@ -1258,7 +1258,7 @@ static void
aspeed_machine_witherspoon_class_init(ObjectClass *oc, void *data)
  mc->desc   = "OpenPOWER Witherspoon BMC (ARM1176)";
  amc->soc_name  = "ast2500-a1";
  amc->hw_strap1 = WITHERSPOON_BMC_HW_STRAP1;
-amc->fmc_model = "mx25l25635e";
+amc->fmc_model = "mx25l25635f";


The witherspoon I checked also reported mx25l25635e in dmesg for the
FMC.

You do say "generally" in the commit message though.


You can not tell from dmesg.

The MX25L25635F and MX25L25635E models share the same JEDEC ID and
the spi-nor flash device table only contains a mx25l25635e entry.
The MX25L25635F is detected in mx25l25635_post_bfpt_fixups using SFDP.

That's why I added this warning  :

  
https://github.com/legoater/linux/commit/934f0708ac597022cbf6c8d6f2ce91d55025e943


C.



Andrew

< 1 2

101 - 175 of 175 matches

Mail list logo