date:20210210

Re: [PATCH] iotests: Fix unsupported_imgopts for refcount_bits

2021-02-10 Thread Max Reitz


On 09.02.21 19:49, Eric Blake wrote:

On 2/9/21 12:27 PM, Max Reitz wrote:

Many _unsupported_imgopts lines for refcount_bits values use something
like "refcount_bits=1[^0-9]" to forbid everything but "refcount_bits=1"
(e.g. "refcount_bits=16" is allowed).

That does not work when $IMGOPTS does not have any entry past the
refcount_bits option, which now became apparent with the "check" script
rewrite.

Use \b instead of [^0-9] to check for a word boundary, which is what we
really want.


\b is a Linux-ism (that is, glibc supports it, but BSD libc does not).

https://mail-index.netbsd.org/tech-userlevel/2012/12/02/msg006954.html


:(



Signed-off-by: Max Reitz 
---
Reproducible with:
$ ./check -qcow2 -o refcount_bits=1
(The tests touched here should be skipped)

I don't know whether \b is portable.  I hope it is.
(This is why I CC-ed you, Eric.)


No, it's not portable.  \> and [[:>:]] are other spellings for the same
task, equally non-portable.



Then again, it appears that nobody ever runs the iotests with
refcount_bits=1 but me, and I do that on Linux.  So even if it isn't
portable, it shouldn't be an issue in practice... O:)


What exactly is failing?  Is it merely a case of our python script
running the regex against "${unsupported_imgopts}" instead of
"${unsupported_imgsopts} " with an added trailing space to guarantee
that we have something to match against?


A bit of a hack, but one that indeed works, yes.  Thanks!

Max

[PULL v4 00/27] Block patches

2021-02-10 Thread Stefan Hajnoczi

The following changes since commit 1214d55d1c41fbab3a9973a05085b8760647e411:

  Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into 
staging (2021-02-09 13:24:37 +)

are available in the Git repository at:

  https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to eb847c42296497978942f738cd41dc29a35a49b2:

  docs: fix Parallels Image "dirty bitmap" section (2021-02-10 09:23:28 +)


Pull request

v4:
 * Add PCI_EXPRESS Kconfig dependency to fix s390x in "multi-process: setup PCI
   host bridge for remote device" [Philippe and Thomas]



Denis V. Lunev (1):
  docs: fix Parallels Image "dirty bitmap" section

Elena Ufimtseva (8):
  multi-process: add configure and usage information
  io: add qio_channel_writev_full_all helper
  io: add qio_channel_readv_full_all_eof & qio_channel_readv_full_all
helpers
  multi-process: define MPQemuMsg format and transmission functions
  multi-process: introduce proxy object
  multi-process: add proxy communication functions
  multi-process: Forward PCI config space acceses to the remote process
  multi-process: perform device reset in the remote process

Jagannathan Raman (11):
  memory: alloc RAM from file at offset
  multi-process: Add config option for multi-process QEMU
  multi-process: setup PCI host bridge for remote device
  multi-process: setup a machine object for remote device process
  multi-process: Initialize message handler in remote device
  multi-process: Associate fd of a PCIDevice with its object
  multi-process: setup memory manager for remote device
  multi-process: PCI BAR read/write handling for proxy & remote
endpoints
  multi-process: Synchronize remote memory
  multi-process: create IOHUB object to handle irq
  multi-process: Retrieve PCI info from remote process

John G Johnson (1):
  multi-process: add the concept description to
docs/devel/qemu-multiprocess

Stefan Hajnoczi (6):
  .github: point Repo Lockdown bot to GitLab repo
  gitmodules: use GitLab repos instead of qemu.org
  gitlab-ci: remove redundant GitLab repo URL command
  docs: update README to use GitLab repo URLs
  pc-bios: update mirror URLs to GitLab
  get_maintainer: update repo URL to GitLab

 MAINTAINERS   |  24 +
 README.rst|   4 +-
 docs/devel/index.rst  |   1 +
 docs/devel/multi-process.rst  | 966 ++
 docs/system/index.rst |   1 +
 docs/system/multi-process.rst |  64 ++
 docs/interop/parallels.txt|   2 +-
 configure |  10 +
 meson.build   |   5 +-
 hw/remote/trace.h |   1 +
 include/exec/memory.h |   2 +
 include/exec/ram_addr.h   |   4 +-
 include/hw/pci-host/remote.h  |  30 +
 include/hw/pci/pci_ids.h  |   3 +
 include/hw/remote/iohub.h |  42 +
 include/hw/remote/machine.h   |  38 +
 include/hw/remote/memory.h|  19 +
 include/hw/remote/mpqemu-link.h   |  99 +++
 include/hw/remote/proxy-memory-listener.h |  28 +
 include/hw/remote/proxy.h |  48 ++
 include/io/channel.h  |  78 ++
 include/qemu/mmap-alloc.h |   4 +-
 include/sysemu/iothread.h |   6 +
 backends/hostmem-memfd.c  |   2 +-
 hw/misc/ivshmem.c |   3 +-
 hw/pci-host/remote.c  |  75 ++
 hw/remote/iohub.c | 119 +++
 hw/remote/machine.c   |  80 ++
 hw/remote/memory.c|  65 ++
 hw/remote/message.c   | 230 ++
 hw/remote/mpqemu-link.c   | 267 ++
 hw/remote/proxy-memory-listener.c | 227 +
 hw/remote/proxy.c | 379 +
 hw/remote/remote-obj.c| 203 +
 io/channel.c  | 116 ++-
 iothread.c|   6 +
 softmmu/memory.c  |   3 +-
 softmmu/physmem.c |  12 +-
 util/mmap-alloc.c |   8 +-
 util/oslib-posix.c|   2 +-
 .github/lockdown.yml  |   8 +-
 .gitlab-ci.yml|   1 -
 .gitmodules   |  44 +-
 Kconfig.host  |   4 +
 hw/Kconfig|   1 +
 hw/meson.build|   1 +
 hw/pci-host/Kconfig   |   3 +
 hw/pci-host/meson.build   |   1 +
 hw/remote/Kconfig |   4 +
 hw/remote/meson.build |  13 +
 hw/remote/t

[PULL v4 01/27] .github: point Repo Lockdown bot to GitLab repo

2021-02-10 Thread Stefan Hajnoczi

Use the GitLab repo URL as the main repo location in order to reduce
load on qemu.org.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Wainer dos Santos Moschetta 
Reviewed-by: Thomas Huth 
Message-id: 2021015017.156802-2-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 .github/lockdown.yml | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/.github/lockdown.yml b/.github/lockdown.yml
index 9acc393f1c..07fc2f31ee 100644
--- a/.github/lockdown.yml
+++ b/.github/lockdown.yml
@@ -10,8 +10,8 @@ issues:
   comment: |
 Thank you for your interest in the QEMU project.
 
-This repository is a read-only mirror of the project's master
-repostories hosted on https://git.qemu.org/git/qemu.git.
+This repository is a read-only mirror of the project's repostories hosted
+at https://gitlab.com/qemu-project/qemu.git.
 The project does not process issues filed on GitHub.
 
 The project issues are tracked on Launchpad:
@@ -24,8 +24,8 @@ pulls:
   comment: |
 Thank you for your interest in the QEMU project.
 
-This repository is a read-only mirror of the project's master
-repostories hosted on https://git.qemu.org/git/qemu.git.
+This repository is a read-only mirror of the project's repostories hosted
+on https://gitlab.com/qemu-project/qemu.git.
 The project does not process merge requests filed on GitHub.
 
 QEMU welcomes contributions of code (either fixing bugs or adding new
-- 
2.29.2

[PULL v4 02/27] gitmodules: use GitLab repos instead of qemu.org

2021-02-10 Thread Stefan Hajnoczi

qemu.org is running out of bandwidth and the QEMU project is moving
towards a gating CI on GitLab. Use the GitLab repos instead of qemu.org
(they will become mirrors).

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Wainer dos Santos Moschetta 
Reviewed-by: Thomas Huth 
Reviewed-by: Philippe Mathieu-Daudé 
Message-id: 2021015017.156802-3-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 .gitmodules | 44 ++--
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/.gitmodules b/.gitmodules
index 2bdeeacef8..08b1b48a09 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,66 +1,66 @@
 [submodule "roms/seabios"]
path = roms/seabios
-   url = https://git.qemu.org/git/seabios.git/
+   url = https://gitlab.com/qemu-project/seabios.git/
 [submodule "roms/SLOF"]
path = roms/SLOF
-   url = https://git.qemu.org/git/SLOF.git
+   url = https://gitlab.com/qemu-project/SLOF.git
 [submodule "roms/ipxe"]
path = roms/ipxe
-   url = https://git.qemu.org/git/ipxe.git
+   url = https://gitlab.com/qemu-project/ipxe.git
 [submodule "roms/openbios"]
path = roms/openbios
-   url = https://git.qemu.org/git/openbios.git
+   url = https://gitlab.com/qemu-project/openbios.git
 [submodule "roms/qemu-palcode"]
path = roms/qemu-palcode
-   url = https://git.qemu.org/git/qemu-palcode.git
+   url = https://gitlab.com/qemu-project/qemu-palcode.git
 [submodule "roms/sgabios"]
path = roms/sgabios
-   url = https://git.qemu.org/git/sgabios.git
+   url = https://gitlab.com/qemu-project/sgabios.git
 [submodule "dtc"]
path = dtc
-   url = https://git.qemu.org/git/dtc.git
+   url = https://gitlab.com/qemu-project/dtc.git
 [submodule "roms/u-boot"]
path = roms/u-boot
-   url = https://git.qemu.org/git/u-boot.git
+   url = https://gitlab.com/qemu-project/u-boot.git
 [submodule "roms/skiboot"]
path = roms/skiboot
-   url = https://git.qemu.org/git/skiboot.git
+   url = https://gitlab.com/qemu-project/skiboot.git
 [submodule "roms/QemuMacDrivers"]
path = roms/QemuMacDrivers
-   url = https://git.qemu.org/git/QemuMacDrivers.git
+   url = https://gitlab.com/qemu-project/QemuMacDrivers.git
 [submodule "ui/keycodemapdb"]
path = ui/keycodemapdb
-   url = https://git.qemu.org/git/keycodemapdb.git
+   url = https://gitlab.com/qemu-project/keycodemapdb.git
 [submodule "capstone"]
path = capstone
-   url = https://git.qemu.org/git/capstone.git
+   url = https://gitlab.com/qemu-project/capstone.git
 [submodule "roms/seabios-hppa"]
path = roms/seabios-hppa
-   url = https://git.qemu.org/git/seabios-hppa.git
+   url = https://gitlab.com/qemu-project/seabios-hppa.git
 [submodule "roms/u-boot-sam460ex"]
path = roms/u-boot-sam460ex
-   url = https://git.qemu.org/git/u-boot-sam460ex.git
+   url = https://gitlab.com/qemu-project/u-boot-sam460ex.git
 [submodule "tests/fp/berkeley-testfloat-3"]
path = tests/fp/berkeley-testfloat-3
-   url = https://git.qemu.org/git/berkeley-testfloat-3.git
+   url = https://gitlab.com/qemu-project/berkeley-testfloat-3.git
 [submodule "tests/fp/berkeley-softfloat-3"]
path = tests/fp/berkeley-softfloat-3
-   url = https://git.qemu.org/git/berkeley-softfloat-3.git
+   url = https://gitlab.com/qemu-project/berkeley-softfloat-3.git
 [submodule "roms/edk2"]
path = roms/edk2
-   url = https://git.qemu.org/git/edk2.git
+   url = https://gitlab.com/qemu-project/edk2.git
 [submodule "slirp"]
path = slirp
-   url = https://git.qemu.org/git/libslirp.git
+   url = https://gitlab.com/qemu-project/libslirp.git
 [submodule "roms/opensbi"]
path = roms/opensbi
-   url =   https://git.qemu.org/git/opensbi.git
+   url =   https://gitlab.com/qemu-project/opensbi.git
 [submodule "roms/qboot"]
path = roms/qboot
-   url = https://git.qemu.org/git/qboot.git
+   url = https://gitlab.com/qemu-project/qboot.git
 [submodule "meson"]
path = meson
-   url = https://git.qemu.org/git/meson.git
+   url = https://gitlab.com/qemu-project/meson.git
 [submodule "roms/vbootrom"]
path = roms/vbootrom
-   url = https://git.qemu.org/git/vbootrom.git
+   url = https://gitlab.com/qemu-project/vbootrom.git
-- 
2.29.2

[PULL v4 05/27] pc-bios: update mirror URLs to GitLab

2021-02-10 Thread Stefan Hajnoczi

qemu.org is running out of bandwidth and the QEMU project is moving
towards a gating CI on GitLab. Use the GitLab repos instead of qemu.org
(they will become mirrors).

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Wainer dos Santos Moschetta 
Reviewed-by: Thomas Huth 
Reviewed-by: Philippe Mathieu-Daudé 
Message-id: 2021015017.156802-6-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 pc-bios/README | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/pc-bios/README b/pc-bios/README
index 33f9754ad3..db7129ef64 100644
--- a/pc-bios/README
+++ b/pc-bios/README
@@ -20,7 +20,7 @@
   legacy x86 software to communicate with an attached serial console as
   if a video card were attached.  The master sources reside in a subversion
   repository at http://sgabios.googlecode.com/svn/trunk.  A git mirror is
-  available at https://git.qemu.org/git/sgabios.git.
+  available at https://gitlab.com/qemu-project/sgabios.git.
 
 - The PXE roms come from the iPXE project. Built with BANNER_TIME 0.
   Sources available at http://ipxe.org.  Vendor:Device ID -> ROM mapping:
@@ -37,7 +37,7 @@
 
 - The u-boot binary for e500 comes from the upstream denx u-boot project where
   it was compiled using the qemu-ppce500 target.
-  A git mirror is available at: https://git.qemu.org/git/u-boot.git
+  A git mirror is available at: https://gitlab.com/qemu-project/u-boot.git
   The hash used to compile the current version is: 2072e72
 
 - Skiboot (https://github.com/open-power/skiboot/) is an OPAL
-- 
2.29.2

[PULL v4 08/27] multi-process: add configure and usage information

2021-02-10 Thread Stefan Hajnoczi

From: Elena Ufimtseva 

Adds documentation explaining the command-line arguments needed
to use multi-process.

Signed-off-by: Elena Ufimtseva 
Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
49f757a84e5dd6fae14b22544897d1124c5fdbad.1611938319.git.jag.ra...@oracle.com

[Move orphan docs/multi-process.rst document into docs/system/ and add
it to index.rst to prevent Sphinx "document isn't included in any
toctree" error.
--Stefan]

Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS   |  1 +
 docs/system/index.rst |  1 +
 docs/system/multi-process.rst | 64 +++
 3 files changed, 66 insertions(+)
 create mode 100644 docs/system/multi-process.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index ddff8d25e8..1658397762 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3206,6 +3206,7 @@ M: Jagannathan Raman 
 M: John G Johnson 
 S: Maintained
 F: docs/devel/multi-process.rst
+F: docs/system/multi-process.rst
 
 Build and test automation
 -
diff --git a/docs/system/index.rst b/docs/system/index.rst
index d40f72c92b..625b494372 100644
--- a/docs/system/index.rst
+++ b/docs/system/index.rst
@@ -34,6 +34,7 @@ Contents:
pr-manager
targets
security
+   multi-process
deprecated
removed-features
build-platforms
diff --git a/docs/system/multi-process.rst b/docs/system/multi-process.rst
new file mode 100644
index 00..46bb0cafc2
--- /dev/null
+++ b/docs/system/multi-process.rst
@@ -0,0 +1,64 @@
+Multi-process QEMU
+==
+
+This document describes how to configure and use multi-process qemu.
+For the design document refer to docs/devel/qemu-multiprocess.
+
+1) Configuration
+
+
+multi-process is enabled by default for targets that enable KVM
+
+
+2) Usage
+
+
+Multi-process QEMU requires an orchestrator to launch.
+
+Following is a description of command-line used to launch mpqemu.
+
+* Orchestrator:
+
+  - The Orchestrator creates a unix socketpair
+
+  - It launches the remote process and passes one of the
+sockets to it via command-line.
+
+  - It then launches QEMU and specifies the other socket as an option
+to the Proxy device object
+
+* Remote Process:
+
+  - QEMU can enter remote process mode by using the "remote" machine
+option.
+
+  - The orchestrator creates a "remote-object" with details about
+the device and the file descriptor for the device
+
+  - The remaining options are no different from how one launches QEMU with
+devices.
+
+  - Example command-line for the remote process is as follows:
+
+  /usr/bin/qemu-system-x86_64\
+  -machine x-remote  \
+  -device lsi53c895a,id=lsi0 \
+  -drive id=drive_image2,file=/build/ol7-nvme-test-1.qcow2   \
+  -device scsi-hd,id=drive2,drive=drive_image2,bus=lsi0.0,scsi-id=0  \
+  -object x-remote-object,id=robj1,devid=lsi1,fd=4,
+
+* QEMU:
+
+  - Since parts of the RAM are shared between QEMU & remote process, a
+memory-backend-memfd is required to facilitate this, as follows:
+
+-object memory-backend-memfd,id=mem,size=2G
+
+  - A "x-pci-proxy-dev" device is created for each of the PCI devices emulated
+in the remote process. A "socket" sub-option specifies the other end of
+unix channel created by orchestrator. The "id" sub-option must be specified
+and should be the same as the "id" specified for the remote PCI device
+
+  - Example commandline for QEMU is as follows:
+
+  -device x-pci-proxy-dev,id=lsi0,socket=3
-- 
2.29.2

[PULL v4 03/27] gitlab-ci: remove redundant GitLab repo URL command

2021-02-10 Thread Stefan Hajnoczi

It is no longer necessary to point .gitmodules at GitLab repos when
running in GitLab CI since they are now used all the time.

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Wainer dos Santos Moschetta 
Reviewed-by: Thomas Huth 
Reviewed-by: Philippe Mathieu-Daudé 
Message-id: 2021015017.156802-4-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 .gitlab-ci.yml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index 7c0db64710..28a83afb91 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -18,7 +18,6 @@ include:
   image: $CI_REGISTRY_IMAGE/qemu/$IMAGE:latest
   before_script:
 - JOBS=$(expr $(nproc) + 1)
-- sed -i s,git.qemu.org/git,gitlab.com/qemu-project, .gitmodules
   script:
 - mkdir build
 - cd build
-- 
2.29.2

[PULL v4 04/27] docs: update README to use GitLab repo URLs

2021-02-10 Thread Stefan Hajnoczi

qemu.org is running out of bandwidth and the QEMU project is moving
towards a gating CI on GitLab. Use the GitLab repos instead of qemu.org
(they will become mirrors).

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Wainer dos Santos Moschetta 
Reviewed-by: Thomas Huth 
Reviewed-by: Philippe Mathieu-Daudé 
Message-id: 2021015017.156802-5-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 README.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.rst b/README.rst
index 58b9f2dc15..ce39d89077 100644
--- a/README.rst
+++ b/README.rst
@@ -60,7 +60,7 @@ The QEMU source code is maintained under the GIT version 
control system.
 
 .. code-block:: shell
 
-   git clone https://git.qemu.org/git/qemu.git
+   git clone https://gitlab.com/qemu-project/qemu.git
 
 When submitting patches, one common approach is to use 'git
 format-patch' and/or 'git send-email' to format & send the mail to the
@@ -78,7 +78,7 @@ The QEMU website is also maintained under source control.
 
 .. code-block:: shell
 
-  git clone https://git.qemu.org/git/qemu-web.git
+  git clone https://gitlab.com/qemu-project/qemu-web.git
 
 * ``_
 
-- 
2.29.2

[PULL v4 07/27] multi-process: add the concept description to docs/devel/qemu-multiprocess

2021-02-10 Thread Stefan Hajnoczi

From: John G Johnson 

Signed-off-by: John G Johnson 
Signed-off-by: Elena Ufimtseva 
Signed-off-by: Jagannathan Raman 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
02a68adef99f5df6a380bf8fd7b90948777e411c.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS  |   7 +
 docs/devel/index.rst |   1 +
 docs/devel/multi-process.rst | 966 +++
 3 files changed, 974 insertions(+)
 create mode 100644 docs/devel/multi-process.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 06635ba81a..ddff8d25e8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3200,6 +3200,13 @@ S: Maintained
 F: hw/semihosting/
 F: include/hw/semihosting/
 
+Multi-process QEMU
+M: Elena Ufimtseva 
+M: Jagannathan Raman 
+M: John G Johnson 
+S: Maintained
+F: docs/devel/multi-process.rst
+
 Build and test automation
 -
 Build and test automation
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index 98a7016a9b..22854e334d 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -37,3 +37,4 @@ Contents:
clocks
qom
block-coroutine-wrapper
+   multi-process
diff --git a/docs/devel/multi-process.rst b/docs/devel/multi-process.rst
new file mode 100644
index 00..69699329d6
--- /dev/null
+++ b/docs/devel/multi-process.rst
@@ -0,0 +1,966 @@
+This is the design document for multi-process QEMU. It does not
+necessarily reflect the status of the current implementation, which
+may lack features or be considerably different from what is described
+in this document. This document is still useful as a description of
+the goals and general direction of this feature.
+
+Please refer to the following wiki for latest details:
+https://wiki.qemu.org/Features/MultiProcessQEMU
+
+Multi-process QEMU
+===
+
+QEMU is often used as the hypervisor for virtual machines running in the
+Oracle cloud. Since one of the advantages of cloud computing is the
+ability to run many VMs from different tenants in the same cloud
+infrastructure, a guest that compromised its hypervisor could
+potentially use the hypervisor's access privileges to access data it is
+not authorized for.
+
+QEMU can be susceptible to security attacks because it is a large,
+monolithic program that provides many features to the VMs it services.
+Many of these features can be configured out of QEMU, but even a reduced
+configuration QEMU has a large amount of code a guest can potentially
+attack. Separating QEMU reduces the attack surface by aiding to
+limit each component in the system to only access the resources that
+it needs to perform its job.
+
+QEMU services
+-
+
+QEMU can be broadly described as providing three main services. One is a
+VM control point, where VMs can be created, migrated, re-configured, and
+destroyed. A second is to emulate the CPU instructions within the VM,
+often accelerated by HW virtualization features such as Intel's VT
+extensions. Finally, it provides IO services to the VM by emulating HW
+IO devices, such as disk and network devices.
+
+A multi-process QEMU
+
+
+A multi-process QEMU involves separating QEMU services into separate
+host processes. Each of these processes can be given only the privileges
+it needs to provide its service, e.g., a disk service could be given
+access only to the disk images it provides, and not be allowed to
+access other files, or any network devices. An attacker who compromised
+this service would not be able to use this exploit to access files or
+devices beyond what the disk service was given access to.
+
+A QEMU control process would remain, but in multi-process mode, will
+have no direct interfaces to the VM. During VM execution, it would still
+provide the user interface to hot-plug devices or live migrate the VM.
+
+A first step in creating a multi-process QEMU is to separate IO services
+from the main QEMU program, which would continue to provide CPU
+emulation. i.e., the control process would also be the CPU emulation
+process. In a later phase, CPU emulation could be separated from the
+control process.
+
+Separating IO services
+--
+
+Separating IO services into individual host processes is a good place to
+begin for a couple of reasons. One is the sheer number of IO devices QEMU
+can emulate provides a large surface of interfaces which could potentially
+be exploited, and, indeed, have been a source of exploits in the past.
+Another is the modular nature of QEMU device emulation code provides
+interface points where the QEMU functions that perform device emulation
+can be separated from the QEMU functions that manage the emulation of
+guest CPU instructions. The devices emulated in the separate process are
+referred to as remote devices.
+
+QEMU device emulation
+~
+
+QEMU uses an object oriented SW architecture for device emulation code.
+Configured objects are all compiled into the QEMU binary, then objects

[PULL v4 12/27] multi-process: setup a machine object for remote device process

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

x-remote-machine object sets up various subsystems of the remote
device process. Instantiate PCI host bridge object and initialize RAM, IO &
PCI memory regions.

Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Signed-off-by: Elena Ufimtseva 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
c537f38d17f90453ca610c6b70cf3480274e0ba1.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS  |  2 ++
 include/hw/pci-host/remote.h |  1 +
 include/hw/remote/machine.h  | 27 ++
 hw/remote/machine.c  | 70 
 hw/meson.build   |  1 +
 hw/remote/meson.build|  5 +++
 6 files changed, 106 insertions(+)
 create mode 100644 include/hw/remote/machine.h
 create mode 100644 hw/remote/machine.c
 create mode 100644 hw/remote/meson.build

diff --git a/MAINTAINERS b/MAINTAINERS
index 4a19e20815..aad849196c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3209,6 +3209,8 @@ F: docs/devel/multi-process.rst
 F: docs/system/multi-process.rst
 F: hw/pci-host/remote.c
 F: include/hw/pci-host/remote.h
+F: hw/remote/machine.c
+F: include/hw/remote/machine.h
 
 Build and test automation
 -
diff --git a/include/hw/pci-host/remote.h b/include/hw/pci-host/remote.h
index 06b8a83a4b..3dcf6aa51d 100644
--- a/include/hw/pci-host/remote.h
+++ b/include/hw/pci-host/remote.h
@@ -24,6 +24,7 @@ struct RemotePCIHost {
 
 MemoryRegion *mr_pci_mem;
 MemoryRegion *mr_sys_io;
+MemoryRegion *mr_sys_mem;
 };
 
 #endif
diff --git a/include/hw/remote/machine.h b/include/hw/remote/machine.h
new file mode 100644
index 00..bdfbca40b9
--- /dev/null
+++ b/include/hw/remote/machine.h
@@ -0,0 +1,27 @@
+/*
+ * Remote machine configuration
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_MACHINE_H
+#define REMOTE_MACHINE_H
+
+#include "qom/object.h"
+#include "hw/boards.h"
+#include "hw/pci-host/remote.h"
+
+struct RemoteMachineState {
+MachineState parent_obj;
+
+RemotePCIHost *host;
+};
+
+#define TYPE_REMOTE_MACHINE "x-remote-machine"
+OBJECT_DECLARE_SIMPLE_TYPE(RemoteMachineState, REMOTE_MACHINE)
+
+#endif
diff --git a/hw/remote/machine.c b/hw/remote/machine.c
new file mode 100644
index 00..9519a6c0a4
--- /dev/null
+++ b/hw/remote/machine.c
@@ -0,0 +1,70 @@
+/*
+ * Machine for remote device
+ *
+ *  This machine type is used by the remote device process in multi-process
+ *  QEMU. QEMU device models depend on parent busses, interrupt controllers,
+ *  memory regions, etc. The remote machine type offers this environment so
+ *  that QEMU device models can be used as remote devices.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/remote/machine.h"
+#include "exec/address-spaces.h"
+#include "exec/memory.h"
+#include "qapi/error.h"
+
+static void remote_machine_init(MachineState *machine)
+{
+MemoryRegion *system_memory, *system_io, *pci_memory;
+RemoteMachineState *s = REMOTE_MACHINE(machine);
+RemotePCIHost *rem_host;
+
+system_memory = get_system_memory();
+system_io = get_system_io();
+
+pci_memory = g_new(MemoryRegion, 1);
+memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
+
+rem_host = REMOTE_PCIHOST(qdev_new(TYPE_REMOTE_PCIHOST));
+
+rem_host->mr_pci_mem = pci_memory;
+rem_host->mr_sys_mem = system_memory;
+rem_host->mr_sys_io = system_io;
+
+s->host = rem_host;
+
+object_property_add_child(OBJECT(s), "remote-pcihost", OBJECT(rem_host));
+memory_region_add_subregion_overlap(system_memory, 0x0, pci_memory, -1);
+
+qdev_realize(DEVICE(rem_host), sysbus_get_default(), &error_fatal);
+}
+
+static void remote_machine_class_init(ObjectClass *oc, void *data)
+{
+MachineClass *mc = MACHINE_CLASS(oc);
+
+mc->init = remote_machine_init;
+mc->desc = "Experimental remote machine";
+}
+
+static const TypeInfo remote_machine = {
+.name = TYPE_REMOTE_MACHINE,
+.parent = TYPE_MACHINE,
+.instance_size = sizeof(RemoteMachineState),
+.class_init = remote_machine_class_init,
+};
+
+static void remote_machine_register_types(void)
+{
+type_register_static(&remote_machine);
+}
+
+type_init(remote_machine_register_types);
diff --git a/hw/meson.build b/hw/meson.build
index 010de7219c..e615d72d4d 100644
--- a/hw/meson.build
+++ b/hw/meson.build
@@ -56,6 +56,7 @@ subdir('moxie')
 subdir('nios2')
 subdir('openrisc')
 subdir('ppc')
+subdir('remote')
 subdir('riscv')
 subdir('rx')
 subdir('s390x')
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
new file mode 100644
index 00

[PULL v4 10/27] multi-process: Add config option for multi-process QEMU

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

Add configuration options to enable or disable multiprocess QEMU code

Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Signed-off-by: Elena Ufimtseva 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
6cc37253e35418ebd7b675a31a3df6e3c7a12dc1.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 configure | 10 ++
 meson.build   |  4 +++-
 Kconfig.host  |  4 
 hw/Kconfig|  1 +
 hw/remote/Kconfig |  3 +++
 5 files changed, 21 insertions(+), 1 deletion(-)
 create mode 100644 hw/remote/Kconfig

diff --git a/configure b/configure
index 7c496d81fc..a79b3746d4 100755
--- a/configure
+++ b/configure
@@ -463,6 +463,7 @@ skip_meson=no
 gettext="auto"
 fuse="auto"
 fuse_lseek="auto"
+multiprocess="no"
 
 malloc_trim="auto"
 
@@ -797,6 +798,7 @@ Linux)
   linux="yes"
   linux_user="yes"
   vhost_user=${default_feature:-yes}
+  multiprocess=${default_feature:-yes}
 ;;
 esac
 
@@ -1556,6 +1558,10 @@ for opt do
   ;;
   --disable-fuse-lseek) fuse_lseek="disabled"
   ;;
+  --enable-multiprocess) multiprocess="yes"
+  ;;
+  --disable-multiprocess) multiprocess="no"
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1908,6 +1914,7 @@ disabled with --disable-FEATURE, default is enabled if 
available
   libdaxctl   libdaxctl support
   fuseFUSE block device export
   fuse-lseek  SEEK_HOLE/SEEK_DATA support for FUSE exports
+  multiprocessMultiprocess QEMU support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -6082,6 +6089,9 @@ fi
 if test "$have_mlockall" = "yes" ; then
   echo "HAVE_MLOCKALL=y" >> $config_host_mak
 fi
+if test "$multiprocess" = "yes" ; then
+  echo "CONFIG_MULTIPROCESS_ALLOWED=y" >> $config_host_mak
+fi
 if test "$fuzzing" = "yes" ; then
   # If LIB_FUZZING_ENGINE is set, assume we are running on OSS-Fuzz, and the
   # needed CFLAGS have already been provided
diff --git a/meson.build b/meson.build
index e3ef660670..c8c07df735 100644
--- a/meson.build
+++ b/meson.build
@@ -1226,7 +1226,8 @@ host_kconfig = \
   ('CONFIG_VHOST_KERNEL' in config_host ? ['CONFIG_VHOST_KERNEL=y'] : []) + \
   (have_virtfs ? ['CONFIG_VIRTFS=y'] : []) + \
   ('CONFIG_LINUX' in config_host ? ['CONFIG_LINUX=y'] : []) + \
-  ('CONFIG_PVRDMA' in config_host ? ['CONFIG_PVRDMA=y'] : [])
+  ('CONFIG_PVRDMA' in config_host ? ['CONFIG_PVRDMA=y'] : []) + \
+  ('CONFIG_MULTIPROCESS_ALLOWED' in config_host ? 
['CONFIG_MULTIPROCESS_ALLOWED=y'] : [])
 
 ignored = [ 'TARGET_XML_FILES', 'TARGET_ABI_DIR', 'TARGET_ARCH' ]
 
@@ -2652,6 +2653,7 @@ summary_info += {'libpmem support':   
config_host.has_key('CONFIG_LIBPMEM')}
 summary_info += {'libdaxctl support': config_host.has_key('CONFIG_LIBDAXCTL')}
 summary_info += {'libudev':   libudev.found()}
 summary_info += {'FUSE lseek':fuse_lseek.found()}
+summary_info += {'Multiprocess QEMU': 
config_host.has_key('CONFIG_MULTIPROCESS_ALLOWED')}
 summary(summary_info, bool_yn: true, section: 'Dependencies')
 
 if not supported_cpus.contains(cpu)
diff --git a/Kconfig.host b/Kconfig.host
index a9a55a9c31..24255ef441 100644
--- a/Kconfig.host
+++ b/Kconfig.host
@@ -37,3 +37,7 @@ config VIRTFS
 
 config PVRDMA
 bool
+
+config MULTIPROCESS_ALLOWED
+bool
+imply MULTIPROCESS
diff --git a/hw/Kconfig b/hw/Kconfig
index d4cec9e476..8ea26479c4 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -27,6 +27,7 @@ source pci-host/Kconfig
 source pcmcia/Kconfig
 source pci/Kconfig
 source rdma/Kconfig
+source remote/Kconfig
 source rtc/Kconfig
 source scsi/Kconfig
 source sd/Kconfig
diff --git a/hw/remote/Kconfig b/hw/remote/Kconfig
new file mode 100644
index 00..54844467a0
--- /dev/null
+++ b/hw/remote/Kconfig
@@ -0,0 +1,3 @@
+config MULTIPROCESS
+bool
+depends on PCI && KVM
-- 
2.29.2

[PULL v4 15/27] multi-process: define MPQemuMsg format and transmission functions

2021-02-10 Thread Stefan Hajnoczi

From: Elena Ufimtseva 

Defines MPQemuMsg, which is the message that is sent to the remote
process. This message is sent over QIOChannel and is used to
command the remote process to perform various tasks.
Define transmission functions used by proxy and by remote.

Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Signed-off-by: Elena Ufimtseva 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
56ca8bcf95195b2b195b08f6b9565b6d7410bce5.1611938319.git.jag.ra...@oracle.com

[Replace struct iovec send[2] = {0} with {} to make clang happy as
suggested by Peter Maydell .
--Stefan]

Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS |   2 +
 meson.build |   1 +
 hw/remote/trace.h   |   1 +
 include/hw/remote/mpqemu-link.h |  63 ++
 include/sysemu/iothread.h   |   6 +
 hw/remote/mpqemu-link.c | 205 
 iothread.c  |   6 +
 hw/remote/meson.build   |   1 +
 hw/remote/trace-events  |   4 +
 9 files changed, 289 insertions(+)
 create mode 100644 hw/remote/trace.h
 create mode 100644 include/hw/remote/mpqemu-link.h
 create mode 100644 hw/remote/mpqemu-link.c
 create mode 100644 hw/remote/trace-events

diff --git a/MAINTAINERS b/MAINTAINERS
index aad849196c..389693f59a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3211,6 +3211,8 @@ F: hw/pci-host/remote.c
 F: include/hw/pci-host/remote.h
 F: hw/remote/machine.c
 F: include/hw/remote/machine.h
+F: hw/remote/mpqemu-link.c
+F: include/hw/remote/mpqemu-link.h
 
 Build and test automation
 -
diff --git a/meson.build b/meson.build
index c8c07df735..a923f249d8 100644
--- a/meson.build
+++ b/meson.build
@@ -1818,6 +1818,7 @@ if have_system
 'net',
 'softmmu',
 'ui',
+'hw/remote',
   ]
 endif
 if have_system or have_user
diff --git a/hw/remote/trace.h b/hw/remote/trace.h
new file mode 100644
index 00..5d5e3ac720
--- /dev/null
+++ b/hw/remote/trace.h
@@ -0,0 +1 @@
+#include "trace/trace-hw_remote.h"
diff --git a/include/hw/remote/mpqemu-link.h b/include/hw/remote/mpqemu-link.h
new file mode 100644
index 00..cac699cb42
--- /dev/null
+++ b/include/hw/remote/mpqemu-link.h
@@ -0,0 +1,63 @@
+/*
+ * Communication channel between QEMU and remote device process
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef MPQEMU_LINK_H
+#define MPQEMU_LINK_H
+
+#include "qom/object.h"
+#include "qemu/thread.h"
+#include "io/channel.h"
+
+#define REMOTE_MAX_FDS 8
+
+#define MPQEMU_MSG_HDR_SIZE offsetof(MPQemuMsg, data.u64)
+
+/**
+ * MPQemuCmd:
+ *
+ * MPQemuCmd enum type to specify the command to be executed on the remote
+ * device.
+ *
+ * This uses a private protocol between QEMU and the remote process. vfio-user
+ * protocol would supersede this in the future.
+ *
+ */
+typedef enum {
+MPQEMU_CMD_MAX,
+} MPQemuCmd;
+
+/**
+ * MPQemuMsg:
+ * @cmd: The remote command
+ * @size: Size of the data to be shared
+ * @data: Structured data
+ * @fds: File descriptors to be shared with remote device
+ *
+ * MPQemuMsg Format of the message sent to the remote device from QEMU.
+ *
+ */
+typedef struct {
+int cmd;
+size_t size;
+
+union {
+uint64_t u64;
+} data;
+
+int fds[REMOTE_MAX_FDS];
+int num_fds;
+} MPQemuMsg;
+
+bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, Error **errp);
+bool mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc, Error **errp);
+
+bool mpqemu_msg_valid(MPQemuMsg *msg);
+
+#endif
diff --git a/include/sysemu/iothread.h b/include/sysemu/iothread.h
index 0c5284dbbc..f177142f16 100644
--- a/include/sysemu/iothread.h
+++ b/include/sysemu/iothread.h
@@ -57,4 +57,10 @@ IOThread *iothread_create(const char *id, Error **errp);
 void iothread_stop(IOThread *iothread);
 void iothread_destroy(IOThread *iothread);
 
+/*
+ * Returns true if executing withing IOThread context,
+ * false otherwise.
+ */
+bool qemu_in_iothread(void);
+
 #endif /* IOTHREAD_H */
diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
new file mode 100644
index 00..0d1899fd94
--- /dev/null
+++ b/hw/remote/mpqemu-link.c
@@ -0,0 +1,205 @@
+/*
+ * Communication channel between QEMU and remote device process
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qemu/module.h"
+#include "hw/remote/mpqemu-link.h"
+#include "qapi/error.h"
+#include "qemu/iov.h"
+#include "qemu/error-report.h"
+#include "qemu/main-loop.h"
+#include "io/channel.h"
+#include "sysemu/iothread.h"
+#include "trace.h"
+
+/*
+ * Send message over the ioc QIOChannel.
+ * This function is safe to call from:
+ * - main loop in co

[PULL v4 13/27] io: add qio_channel_writev_full_all helper

2021-02-10 Thread Stefan Hajnoczi

From: Elena Ufimtseva 

Adds qio_channel_writev_full_all() to transmit both data and FDs.
Refactors existing code to use this helper.

Signed-off-by: Elena Ufimtseva 
Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Reviewed-by: Stefan Hajnoczi 
Acked-by: Daniel P. Berrangé 
Message-id: 
480fbf1fe4152495d60596c9b665124549b426a5.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 include/io/channel.h | 25 +
 io/channel.c | 15 ++-
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index ab9ea77959..19e76fc32f 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -777,4 +777,29 @@ void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
 IOHandler *io_write,
 void *opaque);
 
+/**
+ * qio_channel_writev_full_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @fds: an array of file handles to send
+ * @nfds: number of file handles in @fds
+ * @errp: pointer to a NULL-initialized error object
+ *
+ *
+ * Behaves like qio_channel_writev_full but will attempt
+ * to send all data passed (file handles and memory regions).
+ * The function will wait for all requested data
+ * to be written, yielding from the current coroutine
+ * if required.
+ *
+ * Returns: 0 if all bytes were written, or -1 on error
+ */
+
+int qio_channel_writev_full_all(QIOChannel *ioc,
+const struct iovec *iov,
+size_t niov,
+int *fds, size_t nfds,
+Error **errp);
+
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel.c b/io/channel.c
index 93d449dee2..0d4b8b5160 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -156,6 +156,15 @@ int qio_channel_writev_all(QIOChannel *ioc,
const struct iovec *iov,
size_t niov,
Error **errp)
+{
+return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, errp);
+}
+
+int qio_channel_writev_full_all(QIOChannel *ioc,
+const struct iovec *iov,
+size_t niov,
+int *fds, size_t nfds,
+Error **errp)
 {
 int ret = -1;
 struct iovec *local_iov = g_new(struct iovec, niov);
@@ -168,7 +177,8 @@ int qio_channel_writev_all(QIOChannel *ioc,
 
 while (nlocal_iov > 0) {
 ssize_t len;
-len = qio_channel_writev(ioc, local_iov, nlocal_iov, errp);
+len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds, nfds,
+  errp);
 if (len == QIO_CHANNEL_ERR_BLOCK) {
 if (qemu_in_coroutine()) {
 qio_channel_yield(ioc, G_IO_OUT);
@@ -182,6 +192,9 @@ int qio_channel_writev_all(QIOChannel *ioc,
 }
 
 iov_discard_front(&local_iov, &nlocal_iov, len);
+
+fds = NULL;
+nfds = 0;
 }
 
 ret = 0;
-- 
2.29.2

[PULL v4 14/27] io: add qio_channel_readv_full_all_eof & qio_channel_readv_full_all helpers

2021-02-10 Thread Stefan Hajnoczi

From: Elena Ufimtseva 

Adds qio_channel_readv_full_all_eof() and qio_channel_readv_full_all()
to read both data and FDs. Refactors existing code to use these helpers.

Signed-off-by: Elena Ufimtseva 
Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Acked-by: Daniel P. Berrangé 
Message-id: 
b059c4cc0fb741e794d644c144cc21372cad877d.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 include/io/channel.h |  53 +++
 io/channel.c | 101 ++-
 2 files changed, 134 insertions(+), 20 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index 19e76fc32f..88988979f8 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -777,6 +777,59 @@ void qio_channel_set_aio_fd_handler(QIOChannel *ioc,
 IOHandler *io_write,
 void *opaque);
 
+/**
+ * qio_channel_readv_full_all_eof:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data to
+ * @niov: the length of the @iov array
+ * @fds: an array of file handles to read
+ * @nfds: number of file handles in @fds
+ * @errp: pointer to a NULL-initialized error object
+ *
+ *
+ * Performs same function as qio_channel_readv_all_eof.
+ * Additionally, attempts to read file descriptors shared
+ * over the channel. The function will wait for all
+ * requested data to be read, yielding from the current
+ * coroutine if required. data refers to both file
+ * descriptors and the iovs.
+ *
+ * Returns: 1 if all bytes were read, 0 if end-of-file
+ *  occurs without data, or -1 on error
+ */
+
+int qio_channel_readv_full_all_eof(QIOChannel *ioc,
+   const struct iovec *iov,
+   size_t niov,
+   int **fds, size_t *nfds,
+   Error **errp);
+
+/**
+ * qio_channel_readv_full_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data to
+ * @niov: the length of the @iov array
+ * @fds: an array of file handles to read
+ * @nfds: number of file handles in @fds
+ * @errp: pointer to a NULL-initialized error object
+ *
+ *
+ * Performs same function as qio_channel_readv_all_eof.
+ * Additionally, attempts to read file descriptors shared
+ * over the channel. The function will wait for all
+ * requested data to be read, yielding from the current
+ * coroutine if required. data refers to both file
+ * descriptors and the iovs.
+ *
+ * Returns: 0 if all bytes were read, or -1 on error
+ */
+
+int qio_channel_readv_full_all(QIOChannel *ioc,
+   const struct iovec *iov,
+   size_t niov,
+   int **fds, size_t *nfds,
+   Error **errp);
+
 /**
  * qio_channel_writev_full_all:
  * @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index 0d4b8b5160..4555021b62 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -91,20 +91,48 @@ int qio_channel_readv_all_eof(QIOChannel *ioc,
   const struct iovec *iov,
   size_t niov,
   Error **errp)
+{
+return qio_channel_readv_full_all_eof(ioc, iov, niov, NULL, NULL, errp);
+}
+
+int qio_channel_readv_all(QIOChannel *ioc,
+  const struct iovec *iov,
+  size_t niov,
+  Error **errp)
+{
+return qio_channel_readv_full_all(ioc, iov, niov, NULL, NULL, errp);
+}
+
+int qio_channel_readv_full_all_eof(QIOChannel *ioc,
+   const struct iovec *iov,
+   size_t niov,
+   int **fds, size_t *nfds,
+   Error **errp)
 {
 int ret = -1;
 struct iovec *local_iov = g_new(struct iovec, niov);
 struct iovec *local_iov_head = local_iov;
 unsigned int nlocal_iov = niov;
+int **local_fds = fds;
+size_t *local_nfds = nfds;
 bool partial = false;
 
+if (nfds) {
+*nfds = 0;
+}
+
+if (fds) {
+*fds = NULL;
+}
+
 nlocal_iov = iov_copy(local_iov, nlocal_iov,
   iov, niov,
   0, iov_size(iov, niov));
 
-while (nlocal_iov > 0) {
+while ((nlocal_iov > 0) || local_fds) {
 ssize_t len;
-len = qio_channel_readv(ioc, local_iov, nlocal_iov, errp);
+len = qio_channel_readv_full(ioc, local_iov, nlocal_iov, local_fds,
+ local_nfds, errp);
 if (len == QIO_CHANNEL_ERR_BLOCK) {
 if (qemu_in_coroutine()) {
 qio_channel_yield(ioc, G_IO_IN);
@@ -112,20 +140,50 @@ int qio_channel_readv_all_eof(QIOChannel *ioc,
 qio_channel_wait(ioc, G_IO_IN);
 }
 continue;
-} else if (len <

[PULL v4 06/27] get_maintainer: update repo URL to GitLab

2021-02-10 Thread Stefan Hajnoczi

qemu.org is running out of bandwidth and the QEMU project is moving
towards a gating CI on GitLab. Use the GitLab repos instead of qemu.org
(they will become mirrors).

Signed-off-by: Stefan Hajnoczi 
Reviewed-by: Wainer dos Santos Moschetta 
Reviewed-by: Thomas Huth 
Reviewed-by: Philippe Mathieu-Daudé 
Message-id: 2021015017.156802-7-stefa...@redhat.com
Signed-off-by: Stefan Hajnoczi 
---
 scripts/get_maintainer.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/get_maintainer.pl b/scripts/get_maintainer.pl
index 271f5ff42a..e5499b94b4 100755
--- a/scripts/get_maintainer.pl
+++ b/scripts/get_maintainer.pl
@@ -1377,7 +1377,7 @@ sub vcs_exists {
warn("$P: No supported VCS found.  Add --nogit to options?\n");
warn("Using a git repository produces better results.\n");
warn("Try latest git repository using:\n");
-   warn("git clone https://git.qemu.org/git/qemu.git\n";);
+   warn("git clone https://gitlab.com/qemu-project/qemu.git\n";);
$printed_novcs = 1;
 }
 return 0;
-- 
2.29.2

[PULL v4 16/27] multi-process: Initialize message handler in remote device

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

Initializes the message handler function in the remote process. It is
called whenever there's an event pending on QIOChannel that registers
this function.

Signed-off-by: Elena Ufimtseva 
Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
99d38d8b93753a6409ac2340e858858cda59ab1b.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS |  1 +
 include/hw/remote/machine.h |  9 ++
 hw/remote/message.c | 57 +
 hw/remote/meson.build   |  1 +
 4 files changed, 68 insertions(+)
 create mode 100644 hw/remote/message.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 389693f59a..8d2693525c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3213,6 +3213,7 @@ F: hw/remote/machine.c
 F: include/hw/remote/machine.h
 F: hw/remote/mpqemu-link.c
 F: include/hw/remote/mpqemu-link.h
+F: hw/remote/message.c
 
 Build and test automation
 -
diff --git a/include/hw/remote/machine.h b/include/hw/remote/machine.h
index bdfbca40b9..b92b2ce705 100644
--- a/include/hw/remote/machine.h
+++ b/include/hw/remote/machine.h
@@ -14,6 +14,7 @@
 #include "qom/object.h"
 #include "hw/boards.h"
 #include "hw/pci-host/remote.h"
+#include "io/channel.h"
 
 struct RemoteMachineState {
 MachineState parent_obj;
@@ -21,7 +22,15 @@ struct RemoteMachineState {
 RemotePCIHost *host;
 };
 
+/* Used to pass to co-routine device and ioc. */
+typedef struct RemoteCommDev {
+PCIDevice *dev;
+QIOChannel *ioc;
+} RemoteCommDev;
+
 #define TYPE_REMOTE_MACHINE "x-remote-machine"
 OBJECT_DECLARE_SIMPLE_TYPE(RemoteMachineState, REMOTE_MACHINE)
 
+void coroutine_fn mpqemu_remote_msg_loop_co(void *data);
+
 #endif
diff --git a/hw/remote/message.c b/hw/remote/message.c
new file mode 100644
index 00..36e2d4fb0c
--- /dev/null
+++ b/hw/remote/message.c
@@ -0,0 +1,57 @@
+/*
+ * Copyright © 2020, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL-v2, version 2 or later.
+ *
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/remote/machine.h"
+#include "io/channel.h"
+#include "hw/remote/mpqemu-link.h"
+#include "qapi/error.h"
+#include "sysemu/runstate.h"
+
+void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
+{
+g_autofree RemoteCommDev *com = (RemoteCommDev *)data;
+PCIDevice *pci_dev = NULL;
+Error *local_err = NULL;
+
+assert(com->ioc);
+
+pci_dev = com->dev;
+for (; !local_err;) {
+MPQemuMsg msg = {0};
+
+if (!mpqemu_msg_recv(&msg, com->ioc, &local_err)) {
+break;
+}
+
+if (!mpqemu_msg_valid(&msg)) {
+error_setg(&local_err, "Received invalid message from proxy"
+   "in remote process pid="FMT_pid"",
+   getpid());
+break;
+}
+
+switch (msg.cmd) {
+default:
+error_setg(&local_err,
+   "Unknown command (%d) received for device %s"
+   " (pid="FMT_pid")",
+   msg.cmd, DEVICE(pci_dev)->id, getpid());
+}
+}
+
+if (local_err) {
+error_report_err(local_err);
+qemu_system_shutdown_request(SHUTDOWN_CAUSE_HOST_ERROR);
+} else {
+qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+}
+}
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index a2b2fc0e59..9f5c57f35a 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -2,5 +2,6 @@ remote_ss = ss.source_set()
 
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('machine.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('mpqemu-link.c'))
+remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: files('message.c'))
 
 softmmu_ss.add_all(when: 'CONFIG_MULTIPROCESS', if_true: remote_ss)
-- 
2.29.2

[PULL v4 18/27] multi-process: setup memory manager for remote device

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

SyncSysMemMsg message format is defined. It is used to send
file descriptors of the RAM regions to remote device.
RAM on the remote device is configured with a set of file descriptors.
Old RAM regions are deleted and new regions, each with an fd, is
added to the RAM.

Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Signed-off-by: Elena Ufimtseva 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
7d2d1831d812e85f681e7a8ab99e032cf4704689.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS |  2 +
 include/hw/remote/memory.h  | 19 ++
 include/hw/remote/mpqemu-link.h | 10 +
 hw/remote/memory.c  | 65 +
 hw/remote/mpqemu-link.c | 11 ++
 hw/remote/meson.build   |  2 +
 6 files changed, 109 insertions(+)
 create mode 100644 include/hw/remote/memory.h
 create mode 100644 hw/remote/memory.c

diff --git a/MAINTAINERS b/MAINTAINERS
index bcbb5a100c..00ea834ed0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3215,6 +3215,8 @@ F: hw/remote/mpqemu-link.c
 F: include/hw/remote/mpqemu-link.h
 F: hw/remote/message.c
 F: hw/remote/remote-obj.c
+F: include/hw/remote/memory.h
+F: hw/remote/memory.c
 
 Build and test automation
 -
diff --git a/include/hw/remote/memory.h b/include/hw/remote/memory.h
new file mode 100644
index 00..bc2e30945f
--- /dev/null
+++ b/include/hw/remote/memory.h
@@ -0,0 +1,19 @@
+/*
+ * Memory manager for remote device
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_MEMORY_H
+#define REMOTE_MEMORY_H
+
+#include "exec/hwaddr.h"
+#include "hw/remote/mpqemu-link.h"
+
+void remote_sysmem_reconfig(MPQemuMsg *msg, Error **errp);
+
+#endif
diff --git a/include/hw/remote/mpqemu-link.h b/include/hw/remote/mpqemu-link.h
index cac699cb42..6ee5bc5751 100644
--- a/include/hw/remote/mpqemu-link.h
+++ b/include/hw/remote/mpqemu-link.h
@@ -14,6 +14,7 @@
 #include "qom/object.h"
 #include "qemu/thread.h"
 #include "io/channel.h"
+#include "exec/hwaddr.h"
 
 #define REMOTE_MAX_FDS 8
 
@@ -30,9 +31,16 @@
  *
  */
 typedef enum {
+MPQEMU_CMD_SYNC_SYSMEM,
 MPQEMU_CMD_MAX,
 } MPQemuCmd;
 
+typedef struct {
+hwaddr gpas[REMOTE_MAX_FDS];
+uint64_t sizes[REMOTE_MAX_FDS];
+off_t offsets[REMOTE_MAX_FDS];
+} SyncSysmemMsg;
+
 /**
  * MPQemuMsg:
  * @cmd: The remote command
@@ -43,12 +51,14 @@ typedef enum {
  * MPQemuMsg Format of the message sent to the remote device from QEMU.
  *
  */
+
 typedef struct {
 int cmd;
 size_t size;
 
 union {
 uint64_t u64;
+SyncSysmemMsg sync_sysmem;
 } data;
 
 int fds[REMOTE_MAX_FDS];
diff --git a/hw/remote/memory.c b/hw/remote/memory.c
new file mode 100644
index 00..32085b1e05
--- /dev/null
+++ b/hw/remote/memory.c
@@ -0,0 +1,65 @@
+/*
+ * Memory manager for remote device
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/remote/memory.h"
+#include "exec/address-spaces.h"
+#include "exec/ram_addr.h"
+#include "qapi/error.h"
+
+static void remote_sysmem_reset(void)
+{
+MemoryRegion *sysmem, *subregion, *next;
+
+sysmem = get_system_memory();
+
+QTAILQ_FOREACH_SAFE(subregion, &sysmem->subregions, subregions_link, next) 
{
+if (subregion->ram) {
+memory_region_del_subregion(sysmem, subregion);
+object_unparent(OBJECT(subregion));
+}
+}
+}
+
+void remote_sysmem_reconfig(MPQemuMsg *msg, Error **errp)
+{
+ERRP_GUARD();
+SyncSysmemMsg *sysmem_info = &msg->data.sync_sysmem;
+MemoryRegion *sysmem, *subregion;
+static unsigned int suffix;
+int region;
+
+sysmem = get_system_memory();
+
+remote_sysmem_reset();
+
+for (region = 0; region < msg->num_fds; region++) {
+g_autofree char *name;
+subregion = g_new(MemoryRegion, 1);
+name = g_strdup_printf("remote-mem-%u", suffix++);
+memory_region_init_ram_from_fd(subregion, NULL,
+   name, sysmem_info->sizes[region],
+   true, msg->fds[region],
+   sysmem_info->offsets[region],
+   errp);
+
+if (*errp) {
+g_free(subregion);
+remote_sysmem_reset();
+return;
+}
+
+memory_region_add_subregion(sysmem, sysmem_info->gpas[region],
+subregion);
+
+}
+}
diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
index 0d1899fd94..4ee1128285 100644
--- a/hw/remote/

[PULL v4 20/27] multi-process: add proxy communication functions

2021-02-10 Thread Stefan Hajnoczi

From: Elena Ufimtseva 

Signed-off-by: Elena Ufimtseva 
Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
d54edb4176361eed86b903e8f27058363b6c83b3.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/remote/mpqemu-link.h |  4 
 hw/remote/mpqemu-link.c | 34 +
 2 files changed, 38 insertions(+)

diff --git a/include/hw/remote/mpqemu-link.h b/include/hw/remote/mpqemu-link.h
index 6ee5bc5751..1b35d408f8 100644
--- a/include/hw/remote/mpqemu-link.h
+++ b/include/hw/remote/mpqemu-link.h
@@ -15,6 +15,8 @@
 #include "qemu/thread.h"
 #include "io/channel.h"
 #include "exec/hwaddr.h"
+#include "io/channel-socket.h"
+#include "hw/remote/proxy.h"
 
 #define REMOTE_MAX_FDS 8
 
@@ -68,6 +70,8 @@ typedef struct {
 bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, Error **errp);
 bool mpqemu_msg_recv(MPQemuMsg *msg, QIOChannel *ioc, Error **errp);
 
+uint64_t mpqemu_msg_send_and_await_reply(MPQemuMsg *msg, PCIProxyDev *pdev,
+ Error **errp);
 bool mpqemu_msg_valid(MPQemuMsg *msg);
 
 #endif
diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
index 4ee1128285..f5e9e01923 100644
--- a/hw/remote/mpqemu-link.c
+++ b/hw/remote/mpqemu-link.c
@@ -182,6 +182,40 @@ fail:
 return ret;
 }
 
+/*
+ * Send msg and wait for a reply with command code RET_MSG.
+ * Returns the message received of size u64 or UINT64_MAX
+ * on error.
+ * Called from VCPU thread in non-coroutine context.
+ * Used by the Proxy object to communicate to remote processes.
+ */
+uint64_t mpqemu_msg_send_and_await_reply(MPQemuMsg *msg, PCIProxyDev *pdev,
+ Error **errp)
+{
+ERRP_GUARD();
+MPQemuMsg msg_reply = {0};
+uint64_t ret = UINT64_MAX;
+
+assert(!qemu_in_coroutine());
+
+QEMU_LOCK_GUARD(&pdev->io_mutex);
+if (!mpqemu_msg_send(msg, pdev->ioc, errp)) {
+return ret;
+}
+
+if (!mpqemu_msg_recv(&msg_reply, pdev->ioc, errp)) {
+return ret;
+}
+
+if (!mpqemu_msg_valid(&msg_reply)) {
+error_setg(errp, "ERROR: Invalid reply received for command %d",
+ msg->cmd);
+return ret;
+}
+
+return msg_reply.data.u64;
+}
+
 bool mpqemu_msg_valid(MPQemuMsg *msg)
 {
 if (msg->cmd >= MPQEMU_CMD_MAX && msg->cmd < 0) {
-- 
2.29.2

[PULL v4 21/27] multi-process: Forward PCI config space acceses to the remote process

2021-02-10 Thread Stefan Hajnoczi

From: Elena Ufimtseva 

The Proxy Object sends the PCI config space accesses as messages
to the remote process over the communication channel

Signed-off-by: Elena Ufimtseva 
Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
d3c94f4618813234655356c60e6f0d0362ff42d6.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/remote/mpqemu-link.h | 10 ++
 hw/remote/message.c | 60 +
 hw/remote/mpqemu-link.c |  8 -
 hw/remote/proxy.c   | 55 ++
 4 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/include/hw/remote/mpqemu-link.h b/include/hw/remote/mpqemu-link.h
index 1b35d408f8..7bc0bddb5a 100644
--- a/include/hw/remote/mpqemu-link.h
+++ b/include/hw/remote/mpqemu-link.h
@@ -34,6 +34,9 @@
  */
 typedef enum {
 MPQEMU_CMD_SYNC_SYSMEM,
+MPQEMU_CMD_RET,
+MPQEMU_CMD_PCI_CFGWRITE,
+MPQEMU_CMD_PCI_CFGREAD,
 MPQEMU_CMD_MAX,
 } MPQemuCmd;
 
@@ -43,6 +46,12 @@ typedef struct {
 off_t offsets[REMOTE_MAX_FDS];
 } SyncSysmemMsg;
 
+typedef struct {
+uint32_t addr;
+uint32_t val;
+int len;
+} PciConfDataMsg;
+
 /**
  * MPQemuMsg:
  * @cmd: The remote command
@@ -60,6 +69,7 @@ typedef struct {
 
 union {
 uint64_t u64;
+PciConfDataMsg pci_conf_data;
 SyncSysmemMsg sync_sysmem;
 } data;
 
diff --git a/hw/remote/message.c b/hw/remote/message.c
index 36e2d4fb0c..636bd161bd 100644
--- a/hw/remote/message.c
+++ b/hw/remote/message.c
@@ -15,6 +15,12 @@
 #include "hw/remote/mpqemu-link.h"
 #include "qapi/error.h"
 #include "sysemu/runstate.h"
+#include "hw/pci/pci.h"
+
+static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
+ MPQemuMsg *msg, Error **errp);
+static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
+MPQemuMsg *msg, Error **errp);
 
 void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
 {
@@ -40,6 +46,12 @@ void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
 }
 
 switch (msg.cmd) {
+case MPQEMU_CMD_PCI_CFGWRITE:
+process_config_write(com->ioc, pci_dev, &msg, &local_err);
+break;
+case MPQEMU_CMD_PCI_CFGREAD:
+process_config_read(com->ioc, pci_dev, &msg, &local_err);
+break;
 default:
 error_setg(&local_err,
"Unknown command (%d) received for device %s"
@@ -55,3 +67,51 @@ void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
 qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
 }
 }
+
+static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
+ MPQemuMsg *msg, Error **errp)
+{
+ERRP_GUARD();
+PciConfDataMsg *conf = (PciConfDataMsg *)&msg->data.pci_conf_data;
+MPQemuMsg ret = { 0 };
+
+if ((conf->addr + sizeof(conf->val)) > pci_config_size(dev)) {
+error_setg(errp, "Bad address for PCI config write, pid "FMT_pid".",
+   getpid());
+ret.data.u64 = UINT64_MAX;
+} else {
+pci_default_write_config(dev, conf->addr, conf->val, conf->len);
+}
+
+ret.cmd = MPQEMU_CMD_RET;
+ret.size = sizeof(ret.data.u64);
+
+if (!mpqemu_msg_send(&ret, ioc, NULL)) {
+error_prepend(errp, "Error returning code to proxy, pid "FMT_pid": ",
+  getpid());
+}
+}
+
+static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
+MPQemuMsg *msg, Error **errp)
+{
+ERRP_GUARD();
+PciConfDataMsg *conf = (PciConfDataMsg *)&msg->data.pci_conf_data;
+MPQemuMsg ret = { 0 };
+
+if ((conf->addr + sizeof(conf->val)) > pci_config_size(dev)) {
+error_setg(errp, "Bad address for PCI config read, pid "FMT_pid".",
+   getpid());
+ret.data.u64 = UINT64_MAX;
+} else {
+ret.data.u64 = pci_default_read_config(dev, conf->addr, conf->len);
+}
+
+ret.cmd = MPQEMU_CMD_RET;
+ret.size = sizeof(ret.data.u64);
+
+if (!mpqemu_msg_send(&ret, ioc, NULL)) {
+error_prepend(errp, "Error returning code to proxy, pid "FMT_pid": ",
+  getpid());
+}
+}
diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
index f5e9e01923..b45f325686 100644
--- a/hw/remote/mpqemu-link.c
+++ b/hw/remote/mpqemu-link.c
@@ -207,7 +207,7 @@ uint64_t mpqemu_msg_send_and_await_reply(MPQemuMsg *msg, 
PCIProxyDev *pdev,
 return ret;
 }
 
-if (!mpqemu_msg_valid(&msg_reply)) {
+if (!mpqemu_msg_valid(&msg_reply) || msg_reply.cmd != MPQEMU_CMD_RET) {
 error_setg(errp, "ERROR: Invalid reply received for command %d",
  msg->cmd);
 return ret;
@@ -242,6 +242,12 @@ bool mpqemu_msg_valid(MPQemuMsg *msg)
 return false;
 }
 break;
+case MPQEMU_CMD_PCI_C

[PULL v4 22/27] multi-process: PCI BAR read/write handling for proxy & remote endpoints

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

Proxy device object implements handler for PCI BAR writes and reads.
The handler uses BAR_WRITE/BAR_READ message to communicate to the
remote process with the BAR address and value to be written/read.
The remote process implements handler for BAR_WRITE/BAR_READ
message.

Signed-off-by: Jagannathan Raman 
Signed-off-by: Elena Ufimtseva 
Signed-off-by: John G Johnson 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
a8b76714a9688be5552c4c92d089bc9e8a4707ff.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/remote/mpqemu-link.h | 10 
 include/hw/remote/proxy.h   |  9 
 hw/remote/message.c | 83 +
 hw/remote/mpqemu-link.c |  6 +++
 hw/remote/proxy.c   | 60 
 5 files changed, 168 insertions(+)

diff --git a/include/hw/remote/mpqemu-link.h b/include/hw/remote/mpqemu-link.h
index 7bc0bddb5a..6303e62b17 100644
--- a/include/hw/remote/mpqemu-link.h
+++ b/include/hw/remote/mpqemu-link.h
@@ -37,6 +37,8 @@ typedef enum {
 MPQEMU_CMD_RET,
 MPQEMU_CMD_PCI_CFGWRITE,
 MPQEMU_CMD_PCI_CFGREAD,
+MPQEMU_CMD_BAR_WRITE,
+MPQEMU_CMD_BAR_READ,
 MPQEMU_CMD_MAX,
 } MPQemuCmd;
 
@@ -52,6 +54,13 @@ typedef struct {
 int len;
 } PciConfDataMsg;
 
+typedef struct {
+hwaddr addr;
+uint64_t val;
+unsigned size;
+bool memory;
+} BarAccessMsg;
+
 /**
  * MPQemuMsg:
  * @cmd: The remote command
@@ -71,6 +80,7 @@ typedef struct {
 uint64_t u64;
 PciConfDataMsg pci_conf_data;
 SyncSysmemMsg sync_sysmem;
+BarAccessMsg bar_access;
 } data;
 
 int fds[REMOTE_MAX_FDS];
diff --git a/include/hw/remote/proxy.h b/include/hw/remote/proxy.h
index faa9c4d580..ea7fa4fb3c 100644
--- a/include/hw/remote/proxy.h
+++ b/include/hw/remote/proxy.h
@@ -15,6 +15,14 @@
 #define TYPE_PCI_PROXY_DEV "x-pci-proxy-dev"
 OBJECT_DECLARE_SIMPLE_TYPE(PCIProxyDev, PCI_PROXY_DEV)
 
+typedef struct ProxyMemoryRegion {
+PCIProxyDev *dev;
+MemoryRegion mr;
+bool memory;
+bool present;
+uint8_t type;
+} ProxyMemoryRegion;
+
 struct PCIProxyDev {
 PCIDevice parent_dev;
 char *fd;
@@ -28,6 +36,7 @@ struct PCIProxyDev {
 QemuMutex io_mutex;
 QIOChannel *ioc;
 Error *migration_blocker;
+ProxyMemoryRegion region[PCI_NUM_REGIONS];
 };
 
 #endif /* PROXY_H */
diff --git a/hw/remote/message.c b/hw/remote/message.c
index 636bd161bd..f2e84457e0 100644
--- a/hw/remote/message.c
+++ b/hw/remote/message.c
@@ -16,11 +16,14 @@
 #include "qapi/error.h"
 #include "sysemu/runstate.h"
 #include "hw/pci/pci.h"
+#include "exec/memattrs.h"
 
 static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
  MPQemuMsg *msg, Error **errp);
 static void process_config_read(QIOChannel *ioc, PCIDevice *dev,
 MPQemuMsg *msg, Error **errp);
+static void process_bar_write(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
+static void process_bar_read(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
 
 void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
 {
@@ -52,6 +55,12 @@ void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
 case MPQEMU_CMD_PCI_CFGREAD:
 process_config_read(com->ioc, pci_dev, &msg, &local_err);
 break;
+case MPQEMU_CMD_BAR_WRITE:
+process_bar_write(com->ioc, &msg, &local_err);
+break;
+case MPQEMU_CMD_BAR_READ:
+process_bar_read(com->ioc, &msg, &local_err);
+break;
 default:
 error_setg(&local_err,
"Unknown command (%d) received for device %s"
@@ -115,3 +124,77 @@ static void process_config_read(QIOChannel *ioc, PCIDevice 
*dev,
   getpid());
 }
 }
+
+static void process_bar_write(QIOChannel *ioc, MPQemuMsg *msg, Error **errp)
+{
+ERRP_GUARD();
+BarAccessMsg *bar_access = &msg->data.bar_access;
+AddressSpace *as =
+bar_access->memory ? &address_space_memory : &address_space_io;
+MPQemuMsg ret = { 0 };
+MemTxResult res;
+uint64_t val;
+
+if (!is_power_of_2(bar_access->size) ||
+   (bar_access->size > sizeof(uint64_t))) {
+ret.data.u64 = UINT64_MAX;
+goto fail;
+}
+
+val = cpu_to_le64(bar_access->val);
+
+res = address_space_rw(as, bar_access->addr, MEMTXATTRS_UNSPECIFIED,
+   (void *)&val, bar_access->size, true);
+
+if (res != MEMTX_OK) {
+error_setg(errp, "Bad address %"PRIx64" for mem write, pid "FMT_pid".",
+   bar_access->addr, getpid());
+ret.data.u64 = -1;
+}
+
+fail:
+ret.cmd = MPQEMU_CMD_RET;
+ret.size = sizeof(ret.data.u64);
+
+if (!mpqemu_msg_send(&ret, ioc, NULL)) {
+error_prepend(errp, "Error returning code to proxy, pid "FMT_pid": ",
+  getpid());
+}
+}
+
+static void process_bar_read(QIOChannel *ioc

[PULL v4 09/27] memory: alloc RAM from file at offset

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

Allow RAM MemoryRegion to be created from an offset in a file, instead
of allocating at offset of 0 by default. This is needed to synchronize
RAM between QEMU & remote process.

Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Signed-off-by: Elena Ufimtseva 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
609996697ad8617e3b01df38accc5c208c24d74e.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 include/exec/memory.h |  2 ++
 include/exec/ram_addr.h   |  4 ++--
 include/qemu/mmap-alloc.h |  4 +++-
 backends/hostmem-memfd.c  |  2 +-
 hw/misc/ivshmem.c |  3 ++-
 softmmu/memory.c  |  3 ++-
 softmmu/physmem.c | 12 +++-
 util/mmap-alloc.c |  8 +---
 util/oslib-posix.c|  2 +-
 9 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index ed292767cd..c6fb714e49 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -998,6 +998,7 @@ void memory_region_init_ram_from_file(MemoryRegion *mr,
  * @size: size of the region.
  * @share: %true if memory must be mmaped with the MAP_SHARED flag
  * @fd: the fd to mmap.
+ * @offset: offset within the file referenced by fd
  * @errp: pointer to Error*, to store an error if it happens.
  *
  * Note that this function does not do anything to cause the data in the
@@ -1009,6 +1010,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
 uint64_t size,
 bool share,
 int fd,
+ram_addr_t offset,
 Error **errp);
 #endif
 
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 40b16609ab..3cb9791df3 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -121,8 +121,8 @@ RAMBlock *qemu_ram_alloc_from_file(ram_addr_t size, 
MemoryRegion *mr,
uint32_t ram_flags, const char *mem_path,
bool readonly, Error **errp);
 RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
- uint32_t ram_flags, int fd, bool readonly,
- Error **errp);
+ uint32_t ram_flags, int fd, off_t offset,
+ bool readonly, Error **errp);
 
 RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
   MemoryRegion *mr, Error **errp);
diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 8b7a5c70f3..456ff87df1 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -17,6 +17,7 @@ size_t qemu_mempath_getpagesize(const char *mem_path);
  *  @readonly: true for a read-only mapping, false for read/write.
  *  @shared: map has RAM_SHARED flag.
  *  @is_pmem: map has RAM_PMEM flag.
+ *  @map_offset: map starts at offset of map_offset from the start of fd
  *
  * Return:
  *  On success, return a pointer to the mapped area.
@@ -27,7 +28,8 @@ void *qemu_ram_mmap(int fd,
 size_t align,
 bool readonly,
 bool shared,
-bool is_pmem);
+bool is_pmem,
+off_t map_offset);
 
 void qemu_ram_munmap(int fd, void *ptr, size_t size);
 
diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index e5626d4330..69b0ae30bb 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -55,7 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 name = host_memory_backend_get_name(backend);
 memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
name, backend->size,
-   backend->share, fd, errp);
+   backend->share, fd, 0, errp);
 g_free(name);
 }
 
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 0505b52c98..603e992a7f 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -495,7 +495,8 @@ static void process_msg_shmem(IVShmemState *s, int fd, 
Error **errp)
 
 /* mmap the region and map into the BAR2 */
 memory_region_init_ram_from_fd(&s->server_bar2, OBJECT(s),
-   "ivshmem.bar2", size, true, fd, &local_err);
+   "ivshmem.bar2", size, true, fd, 0,
+   &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
 return;
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 23e8e33001..874a8fccde 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1612,6 +1612,7 @@ void memory_region_init_ram_from_fd(MemoryRegion *mr,
 uint64_t size,
 bool share,
 int fd,

[PULL v4 23/27] multi-process: Synchronize remote memory

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

Add ProxyMemoryListener object which is used to keep the view of the RAM
in sync between QEMU and remote process.
A MemoryListener is registered for system-memory AddressSpace. The
listener sends SYNC_SYSMEM message to the remote process when memory
listener commits the changes to memory, the remote process receives
the message and processes it in the handler for SYNC_SYSMEM message.

Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Signed-off-by: Elena Ufimtseva 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
04fe4e6a9ca90d4f11ab6f59be7652f5b086a071.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS   |   2 +
 include/hw/remote/proxy-memory-listener.h |  28 +++
 include/hw/remote/proxy.h |   2 +
 hw/remote/message.c   |   4 +
 hw/remote/proxy-memory-listener.c | 227 ++
 hw/remote/proxy.c |   6 +
 hw/remote/meson.build |   1 +
 7 files changed, 270 insertions(+)
 create mode 100644 include/hw/remote/proxy-memory-listener.h
 create mode 100644 hw/remote/proxy-memory-listener.c

diff --git a/MAINTAINERS b/MAINTAINERS
index aec5d2d076..3817e807b1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3219,6 +3219,8 @@ F: include/hw/remote/memory.h
 F: hw/remote/memory.c
 F: hw/remote/proxy.c
 F: include/hw/remote/proxy.h
+F: hw/remote/proxy-memory-listener.c
+F: include/hw/remote/proxy-memory-listener.h
 
 Build and test automation
 -
diff --git a/include/hw/remote/proxy-memory-listener.h 
b/include/hw/remote/proxy-memory-listener.h
new file mode 100644
index 00..c4f3efb928
--- /dev/null
+++ b/include/hw/remote/proxy-memory-listener.h
@@ -0,0 +1,28 @@
+/*
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PROXY_MEMORY_LISTENER_H
+#define PROXY_MEMORY_LISTENER_H
+
+#include "exec/memory.h"
+#include "io/channel.h"
+
+typedef struct ProxyMemoryListener {
+MemoryListener listener;
+
+int n_mr_sections;
+MemoryRegionSection *mr_sections;
+
+QIOChannel *ioc;
+} ProxyMemoryListener;
+
+void proxy_memory_listener_configure(ProxyMemoryListener *proxy_listener,
+ QIOChannel *ioc);
+void proxy_memory_listener_deconfigure(ProxyMemoryListener *proxy_listener);
+
+#endif
diff --git a/include/hw/remote/proxy.h b/include/hw/remote/proxy.h
index ea7fa4fb3c..12888b4f90 100644
--- a/include/hw/remote/proxy.h
+++ b/include/hw/remote/proxy.h
@@ -11,6 +11,7 @@
 
 #include "hw/pci/pci.h"
 #include "io/channel.h"
+#include "hw/remote/proxy-memory-listener.h"
 
 #define TYPE_PCI_PROXY_DEV "x-pci-proxy-dev"
 OBJECT_DECLARE_SIMPLE_TYPE(PCIProxyDev, PCI_PROXY_DEV)
@@ -36,6 +37,7 @@ struct PCIProxyDev {
 QemuMutex io_mutex;
 QIOChannel *ioc;
 Error *migration_blocker;
+ProxyMemoryListener proxy_listener;
 ProxyMemoryRegion region[PCI_NUM_REGIONS];
 };
 
diff --git a/hw/remote/message.c b/hw/remote/message.c
index f2e84457e0..25341d8ad2 100644
--- a/hw/remote/message.c
+++ b/hw/remote/message.c
@@ -17,6 +17,7 @@
 #include "sysemu/runstate.h"
 #include "hw/pci/pci.h"
 #include "exec/memattrs.h"
+#include "hw/remote/memory.h"
 
 static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
  MPQemuMsg *msg, Error **errp);
@@ -61,6 +62,9 @@ void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
 case MPQEMU_CMD_BAR_READ:
 process_bar_read(com->ioc, &msg, &local_err);
 break;
+case MPQEMU_CMD_SYNC_SYSMEM:
+remote_sysmem_reconfig(&msg, &local_err);
+break;
 default:
 error_setg(&local_err,
"Unknown command (%d) received for device %s"
diff --git a/hw/remote/proxy-memory-listener.c 
b/hw/remote/proxy-memory-listener.c
new file mode 100644
index 00..af1fa6f5aa
--- /dev/null
+++ b/hw/remote/proxy-memory-listener.c
@@ -0,0 +1,227 @@
+/*
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qemu/compiler.h"
+#include "qemu/int128.h"
+#include "qemu/range.h"
+#include "exec/memory.h"
+#include "exec/cpu-common.h"
+#include "cpu.h"
+#include "exec/ram_addr.h"
+#include "exec/address-spaces.h"
+#include "qapi/error.h"
+#include "hw/remote/mpqemu-link.h"
+#include "hw/remote/proxy-memory-listener.h"
+
+/*
+ * TODO: get_fd_from_hostaddr(), proxy_mrs_can_merge() and
+ * proxy_memory_listener_commit() defined below perform tasks similar to the
+ * functions defined in vhost-user.c. These functions are good candidates
+ * for refactoring.
+

[PULL v4 11/27] multi-process: setup PCI host bridge for remote device

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

PCI host bridge is setup for the remote device process. It is
implemented using remote-pcihost object. It is an extension of the PCI
host bridge setup by QEMU.
Remote-pcihost configures a PCI bus which could be used by the remote
PCI device to latch on to.

Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Signed-off-by: Elena Ufimtseva 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
0871ba857abb2eafacde07e7fe66a3f12415bfb2.1611938319.git.jag.ra...@oracle.com

[Added PCI_EXPRESS condition in hw/remote/Kconfig since remote-pcihost
needs PCIe. This solves "make check" failure on s390x. Fix suggested by
Philippe Mathieu-Daudé  and Thomas Huth
.
--Stefan]

Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS  |  2 +
 include/hw/pci-host/remote.h | 29 ++
 hw/pci-host/remote.c | 75 
 hw/pci-host/Kconfig  |  3 ++
 hw/pci-host/meson.build  |  1 +
 hw/remote/Kconfig|  3 +-
 6 files changed, 112 insertions(+), 1 deletion(-)
 create mode 100644 include/hw/pci-host/remote.h
 create mode 100644 hw/pci-host/remote.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1658397762..4a19e20815 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3207,6 +3207,8 @@ M: John G Johnson 
 S: Maintained
 F: docs/devel/multi-process.rst
 F: docs/system/multi-process.rst
+F: hw/pci-host/remote.c
+F: include/hw/pci-host/remote.h
 
 Build and test automation
 -
diff --git a/include/hw/pci-host/remote.h b/include/hw/pci-host/remote.h
new file mode 100644
index 00..06b8a83a4b
--- /dev/null
+++ b/include/hw/pci-host/remote.h
@@ -0,0 +1,29 @@
+/*
+ * PCI Host for remote device
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_PCIHOST_H
+#define REMOTE_PCIHOST_H
+
+#include "exec/memory.h"
+#include "hw/pci/pcie_host.h"
+
+#define TYPE_REMOTE_PCIHOST "remote-pcihost"
+OBJECT_DECLARE_SIMPLE_TYPE(RemotePCIHost, REMOTE_PCIHOST)
+
+struct RemotePCIHost {
+/*< private >*/
+PCIExpressHost parent_obj;
+/*< public >*/
+
+MemoryRegion *mr_pci_mem;
+MemoryRegion *mr_sys_io;
+};
+
+#endif
diff --git a/hw/pci-host/remote.c b/hw/pci-host/remote.c
new file mode 100644
index 00..eee45444ef
--- /dev/null
+++ b/hw/pci-host/remote.c
@@ -0,0 +1,75 @@
+/*
+ * Remote PCI host device
+ *
+ * Unlike PCI host devices that model physical hardware, the purpose
+ * of this PCI host is to host multi-process QEMU devices.
+ *
+ * Multi-process QEMU extends the PCI host of a QEMU machine into a
+ * remote process. Any PCI device attached to the remote process is
+ * visible in the QEMU guest. This allows existing QEMU device models
+ * to be reused in the remote process.
+ *
+ * This PCI host is purely a container for PCI devices. It's fake in the
+ * sense that the guest never sees this PCI host and has no way of
+ * accessing it. Its job is just to provide the environment that QEMU
+ * PCI device models need when running in a remote process.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_host.h"
+#include "hw/pci/pcie_host.h"
+#include "hw/qdev-properties.h"
+#include "hw/pci-host/remote.h"
+#include "exec/memory.h"
+
+static const char *remote_pcihost_root_bus_path(PCIHostState *host_bridge,
+PCIBus *rootbus)
+{
+return ":00";
+}
+
+static void remote_pcihost_realize(DeviceState *dev, Error **errp)
+{
+PCIHostState *pci = PCI_HOST_BRIDGE(dev);
+RemotePCIHost *s = REMOTE_PCIHOST(dev);
+
+pci->bus = pci_root_bus_new(DEVICE(s), "remote-pci",
+s->mr_pci_mem, s->mr_sys_io,
+0, TYPE_PCIE_BUS);
+}
+
+static void remote_pcihost_class_init(ObjectClass *klass, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(klass);
+PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
+
+hc->root_bus_path = remote_pcihost_root_bus_path;
+dc->realize = remote_pcihost_realize;
+
+dc->user_creatable = false;
+set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
+dc->fw_name = "pci";
+}
+
+static const TypeInfo remote_pcihost_info = {
+.name = TYPE_REMOTE_PCIHOST,
+.parent = TYPE_PCIE_HOST_BRIDGE,
+.instance_size = sizeof(RemotePCIHost),
+.class_init = remote_pcihost_class_init,
+};
+
+static void remote_pcihost_register(void)
+{
+type_register_static(&remote_pcihost_info);
+}
+
+type_init(remote_pcihost_register)
diff --git a/hw/pci-host/Kconfig b/hw/pci-host/Kconfig
index eb03f0489d..8b8c763c28 100644
--- a/hw/pci-host/Kco

[PULL v4 24/27] multi-process: create IOHUB object to handle irq

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

IOHUB object is added to manage PCI IRQs. It uses KVM_IRQFD
ioctl to create irqfd to injecting PCI interrupts to the guest.
IOHUB object forwards the irqfd to the remote process. Remote process
uses this fd to directly send interrupts to the guest, bypassing QEMU.

Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Signed-off-by: Elena Ufimtseva 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
51d5c3d54e28a68b002e3875c59599c9f5a424a1.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS |   2 +
 include/hw/pci/pci_ids.h|   3 +
 include/hw/remote/iohub.h   |  42 +++
 include/hw/remote/machine.h |   2 +
 include/hw/remote/mpqemu-link.h |   1 +
 include/hw/remote/proxy.h   |   4 ++
 hw/remote/iohub.c   | 119 
 hw/remote/machine.c |  10 +++
 hw/remote/message.c |   4 ++
 hw/remote/mpqemu-link.c |   5 ++
 hw/remote/proxy.c   |  56 +++
 hw/remote/meson.build   |   1 +
 12 files changed, 249 insertions(+)
 create mode 100644 include/hw/remote/iohub.h
 create mode 100644 hw/remote/iohub.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3817e807b1..e6f1eca30f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3221,6 +3221,8 @@ F: hw/remote/proxy.c
 F: include/hw/remote/proxy.h
 F: hw/remote/proxy-memory-listener.c
 F: include/hw/remote/proxy-memory-listener.h
+F: hw/remote/iohub.c
+F: include/hw/remote/iohub.h
 
 Build and test automation
 -
diff --git a/include/hw/pci/pci_ids.h b/include/hw/pci/pci_ids.h
index 11f8ab7149..bd0c17dc78 100644
--- a/include/hw/pci/pci_ids.h
+++ b/include/hw/pci/pci_ids.h
@@ -192,6 +192,9 @@
 #define PCI_DEVICE_ID_SUN_SIMBA  0x5000
 #define PCI_DEVICE_ID_SUN_SABRE  0xa000
 
+#define PCI_VENDOR_ID_ORACLE 0x108e
+#define PCI_DEVICE_ID_REMOTE_IOHUB   0xb000
+
 #define PCI_VENDOR_ID_CMD0x1095
 #define PCI_DEVICE_ID_CMD_6460x0646
 
diff --git a/include/hw/remote/iohub.h b/include/hw/remote/iohub.h
new file mode 100644
index 00..0bf98e0d78
--- /dev/null
+++ b/include/hw/remote/iohub.h
@@ -0,0 +1,42 @@
+/*
+ * IO Hub for remote device
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef REMOTE_IOHUB_H
+#define REMOTE_IOHUB_H
+
+#include "hw/pci/pci.h"
+#include "qemu/event_notifier.h"
+#include "qemu/thread-posix.h"
+#include "hw/remote/mpqemu-link.h"
+
+#define REMOTE_IOHUB_NB_PIRQSPCI_DEVFN_MAX
+
+typedef struct ResampleToken {
+void *iohub;
+int pirq;
+} ResampleToken;
+
+typedef struct RemoteIOHubState {
+PCIDevice d;
+EventNotifier irqfds[REMOTE_IOHUB_NB_PIRQS];
+EventNotifier resamplefds[REMOTE_IOHUB_NB_PIRQS];
+unsigned int irq_level[REMOTE_IOHUB_NB_PIRQS];
+ResampleToken token[REMOTE_IOHUB_NB_PIRQS];
+QemuMutex irq_level_lock[REMOTE_IOHUB_NB_PIRQS];
+} RemoteIOHubState;
+
+int remote_iohub_map_irq(PCIDevice *pci_dev, int intx);
+void remote_iohub_set_irq(void *opaque, int pirq, int level);
+void process_set_irqfd_msg(PCIDevice *pci_dev, MPQemuMsg *msg);
+
+void remote_iohub_init(RemoteIOHubState *iohub);
+void remote_iohub_finalize(RemoteIOHubState *iohub);
+
+#endif
diff --git a/include/hw/remote/machine.h b/include/hw/remote/machine.h
index b92b2ce705..2a2a33c4b2 100644
--- a/include/hw/remote/machine.h
+++ b/include/hw/remote/machine.h
@@ -15,11 +15,13 @@
 #include "hw/boards.h"
 #include "hw/pci-host/remote.h"
 #include "io/channel.h"
+#include "hw/remote/iohub.h"
 
 struct RemoteMachineState {
 MachineState parent_obj;
 
 RemotePCIHost *host;
+RemoteIOHubState iohub;
 };
 
 /* Used to pass to co-routine device and ioc. */
diff --git a/include/hw/remote/mpqemu-link.h b/include/hw/remote/mpqemu-link.h
index 6303e62b17..71d206f00e 100644
--- a/include/hw/remote/mpqemu-link.h
+++ b/include/hw/remote/mpqemu-link.h
@@ -39,6 +39,7 @@ typedef enum {
 MPQEMU_CMD_PCI_CFGREAD,
 MPQEMU_CMD_BAR_WRITE,
 MPQEMU_CMD_BAR_READ,
+MPQEMU_CMD_SET_IRQFD,
 MPQEMU_CMD_MAX,
 } MPQemuCmd;
 
diff --git a/include/hw/remote/proxy.h b/include/hw/remote/proxy.h
index 12888b4f90..741def71f1 100644
--- a/include/hw/remote/proxy.h
+++ b/include/hw/remote/proxy.h
@@ -12,6 +12,7 @@
 #include "hw/pci/pci.h"
 #include "io/channel.h"
 #include "hw/remote/proxy-memory-listener.h"
+#include "qemu/event_notifier.h"
 
 #define TYPE_PCI_PROXY_DEV "x-pci-proxy-dev"
 OBJECT_DECLARE_SIMPLE_TYPE(PCIProxyDev, PCI_PROXY_DEV)
@@ -38,6 +39,9 @@ struct PCIProxyDev {
 QIOChannel *ioc;
 Error *migration_blocker;
 ProxyMemoryListener proxy_listener;
+int virq;
+EventNotifier intr;
+EventNotifier resample;
 ProxyMemoryRegion region[PCI_NUM_REGIONS];
 };
 
diff --git a/hw/remote/i

[PULL v4 17/27] multi-process: Associate fd of a PCIDevice with its object

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

Associate the file descriptor for a PCIDevice in remote process with
DeviceState object.

Signed-off-by: Elena Ufimtseva 
Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
f405a2ed5d7518b87bea7c59cfdf334d67e5ee51.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS|   1 +
 hw/remote/remote-obj.c | 203 +
 hw/remote/meson.build  |   1 +
 3 files changed, 205 insertions(+)
 create mode 100644 hw/remote/remote-obj.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8d2693525c..bcbb5a100c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3214,6 +3214,7 @@ F: include/hw/remote/machine.h
 F: hw/remote/mpqemu-link.c
 F: include/hw/remote/mpqemu-link.h
 F: hw/remote/message.c
+F: hw/remote/remote-obj.c
 
 Build and test automation
 -
diff --git a/hw/remote/remote-obj.c b/hw/remote/remote-obj.c
new file mode 100644
index 00..4f21254219
--- /dev/null
+++ b/hw/remote/remote-obj.c
@@ -0,0 +1,203 @@
+/*
+ * Copyright © 2020, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL-v2, version 2 or later.
+ *
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "qemu/error-report.h"
+#include "qemu/notify.h"
+#include "qom/object_interfaces.h"
+#include "hw/qdev-core.h"
+#include "io/channel.h"
+#include "hw/qdev-core.h"
+#include "hw/remote/machine.h"
+#include "io/channel-util.h"
+#include "qapi/error.h"
+#include "sysemu/sysemu.h"
+#include "hw/pci/pci.h"
+#include "qemu/sockets.h"
+#include "monitor/monitor.h"
+
+#define TYPE_REMOTE_OBJECT "x-remote-object"
+OBJECT_DECLARE_TYPE(RemoteObject, RemoteObjectClass, REMOTE_OBJECT)
+
+struct RemoteObjectClass {
+ObjectClass parent_class;
+
+unsigned int nr_devs;
+unsigned int max_devs;
+};
+
+struct RemoteObject {
+/* private */
+Object parent;
+
+Notifier machine_done;
+
+int32_t fd;
+char *devid;
+
+QIOChannel *ioc;
+
+DeviceState *dev;
+DeviceListener listener;
+};
+
+static void remote_object_set_fd(Object *obj, const char *str, Error **errp)
+{
+RemoteObject *o = REMOTE_OBJECT(obj);
+int fd = -1;
+
+fd = monitor_fd_param(monitor_cur(), str, errp);
+if (fd == -1) {
+error_prepend(errp, "Could not parse remote object fd %s:", str);
+return;
+}
+
+if (!fd_is_socket(fd)) {
+error_setg(errp, "File descriptor '%s' is not a socket", str);
+close(fd);
+return;
+}
+
+o->fd = fd;
+}
+
+static void remote_object_set_devid(Object *obj, const char *str, Error **errp)
+{
+RemoteObject *o = REMOTE_OBJECT(obj);
+
+g_free(o->devid);
+
+o->devid = g_strdup(str);
+}
+
+static void remote_object_unrealize_listener(DeviceListener *listener,
+ DeviceState *dev)
+{
+RemoteObject *o = container_of(listener, RemoteObject, listener);
+
+if (o->dev == dev) {
+object_unref(OBJECT(o));
+}
+}
+
+static void remote_object_machine_done(Notifier *notifier, void *data)
+{
+RemoteObject *o = container_of(notifier, RemoteObject, machine_done);
+DeviceState *dev = NULL;
+QIOChannel *ioc = NULL;
+Coroutine *co = NULL;
+RemoteCommDev *comdev = NULL;
+Error *err = NULL;
+
+dev = qdev_find_recursive(sysbus_get_default(), o->devid);
+if (!dev || !object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
+error_report("%s is not a PCI device", o->devid);
+return;
+}
+
+ioc = qio_channel_new_fd(o->fd, &err);
+if (!ioc) {
+error_report_err(err);
+return;
+}
+qio_channel_set_blocking(ioc, false, NULL);
+
+o->dev = dev;
+
+o->listener.unrealize = remote_object_unrealize_listener;
+device_listener_register(&o->listener);
+
+/* co-routine should free this. */
+comdev = g_new0(RemoteCommDev, 1);
+*comdev = (RemoteCommDev) {
+.ioc = ioc,
+.dev = PCI_DEVICE(dev),
+};
+
+co = qemu_coroutine_create(mpqemu_remote_msg_loop_co, comdev);
+qemu_coroutine_enter(co);
+}
+
+static void remote_object_init(Object *obj)
+{
+RemoteObjectClass *k = REMOTE_OBJECT_GET_CLASS(obj);
+RemoteObject *o = REMOTE_OBJECT(obj);
+
+if (k->nr_devs >= k->max_devs) {
+error_report("Reached maximum number of devices: %u", k->max_devs);
+return;
+}
+
+o->ioc = NULL;
+o->fd = -1;
+o->devid = NULL;
+
+k->nr_devs++;
+
+o->machine_done.notify = remote_object_machine_done;
+qemu_add_machine_init_done_notifier(&o->machine_done);
+}
+
+static void remote_object_finalize(Object *obj)
+{
+RemoteObjectClass *k = REMOTE_OBJECT_GET_CLASS(obj);
+RemoteObject *o = REMOTE_OBJECT(obj);
+
+device_listener_unregister(&o->listener);
+
+if (o->ioc) {
+qio_channel_shutdown(o->i

[PULL v4 26/27] multi-process: perform device reset in the remote process

2021-02-10 Thread Stefan Hajnoczi

From: Elena Ufimtseva 

Perform device reset in the remote process when QEMU performs
device reset. This is required to reset the internal state
(like registers, etc...) of emulated devices

Signed-off-by: Elena Ufimtseva 
Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
7cb220a51f565dc0817bd76e2f540e89c2d2b850.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 include/hw/remote/mpqemu-link.h |  1 +
 hw/remote/message.c | 22 ++
 hw/remote/proxy.c   | 19 +++
 3 files changed, 42 insertions(+)

diff --git a/include/hw/remote/mpqemu-link.h b/include/hw/remote/mpqemu-link.h
index 71d206f00e..4ec0915885 100644
--- a/include/hw/remote/mpqemu-link.h
+++ b/include/hw/remote/mpqemu-link.h
@@ -40,6 +40,7 @@ typedef enum {
 MPQEMU_CMD_BAR_WRITE,
 MPQEMU_CMD_BAR_READ,
 MPQEMU_CMD_SET_IRQFD,
+MPQEMU_CMD_DEVICE_RESET,
 MPQEMU_CMD_MAX,
 } MPQemuCmd;
 
diff --git a/hw/remote/message.c b/hw/remote/message.c
index adab040ca1..11d729845c 100644
--- a/hw/remote/message.c
+++ b/hw/remote/message.c
@@ -19,6 +19,7 @@
 #include "exec/memattrs.h"
 #include "hw/remote/memory.h"
 #include "hw/remote/iohub.h"
+#include "sysemu/reset.h"
 
 static void process_config_write(QIOChannel *ioc, PCIDevice *dev,
  MPQemuMsg *msg, Error **errp);
@@ -26,6 +27,8 @@ static void process_config_read(QIOChannel *ioc, PCIDevice 
*dev,
 MPQemuMsg *msg, Error **errp);
 static void process_bar_write(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
 static void process_bar_read(QIOChannel *ioc, MPQemuMsg *msg, Error **errp);
+static void process_device_reset_msg(QIOChannel *ioc, PCIDevice *dev,
+ Error **errp);
 
 void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
 {
@@ -69,6 +72,9 @@ void coroutine_fn mpqemu_remote_msg_loop_co(void *data)
 case MPQEMU_CMD_SET_IRQFD:
 process_set_irqfd_msg(pci_dev, &msg);
 break;
+case MPQEMU_CMD_DEVICE_RESET:
+process_device_reset_msg(com->ioc, pci_dev, &local_err);
+break;
 default:
 error_setg(&local_err,
"Unknown command (%d) received for device %s"
@@ -206,3 +212,19 @@ fail:
   getpid());
 }
 }
+
+static void process_device_reset_msg(QIOChannel *ioc, PCIDevice *dev,
+ Error **errp)
+{
+DeviceClass *dc = DEVICE_GET_CLASS(dev);
+DeviceState *s = DEVICE(dev);
+MPQemuMsg ret = { 0 };
+
+if (dc->reset) {
+dc->reset(s);
+}
+
+ret.cmd = MPQEMU_CMD_RET;
+
+mpqemu_msg_send(&ret, ioc, errp);
+}
diff --git a/hw/remote/proxy.c b/hw/remote/proxy.c
index a082709881..4fa4be079d 100644
--- a/hw/remote/proxy.c
+++ b/hw/remote/proxy.c
@@ -26,6 +26,7 @@
 #include "util/event_notifier-posix.c"
 
 static void probe_pci_info(PCIDevice *dev, Error **errp);
+static void proxy_device_reset(DeviceState *dev);
 
 static void proxy_intx_update(PCIDevice *pci_dev)
 {
@@ -202,6 +203,8 @@ static void pci_proxy_dev_class_init(ObjectClass *klass, 
void *data)
 k->config_read = pci_proxy_read_config;
 k->config_write = pci_proxy_write_config;
 
+dc->reset = proxy_device_reset;
+
 device_class_set_props(dc, proxy_properties);
 }
 
@@ -358,3 +361,19 @@ static void probe_pci_info(PCIDevice *dev, Error **errp)
 }
 }
 }
+
+static void proxy_device_reset(DeviceState *dev)
+{
+PCIProxyDev *pdev = PCI_PROXY_DEV(dev);
+MPQemuMsg msg = { 0 };
+Error *local_err = NULL;
+
+msg.cmd = MPQEMU_CMD_DEVICE_RESET;
+msg.size = 0;
+
+mpqemu_msg_send_and_await_reply(&msg, pdev, &local_err);
+if (local_err) {
+error_report_err(local_err);
+}
+
+}
-- 
2.29.2

[PATCH] iotests: Consistent $IMGOPTS boundary matching

2021-02-10 Thread Max Reitz

To disallow certain refcount_bits values, some _unsupported_imgopts
invocations look like "refcount_bits=1[^0-9]", i.e. they match an
integer boundary with [^0-9].  This expression does not match the end of
the string, though, so it breaks down when refcount_bits is the last
option (which it tends to be after the rewrite of the check script in
Python).

Those invocations could use \b or \> instead, but those are not
portable.  They could use something like \([^0-9]\|$\), but that would
be cumbersome.  To make it simple and keep the existing invocations
working, just let _unsupported_imgopts match the regex against $IMGOPTS
plus a trailing space.

Suggested-by: Eric Blake 
Signed-off-by: Max Reitz 
---
Supersedes "iotests: Fix unsupported_imgopts for refcount_bits", and can
be reproduced in the same way:

$ ./check -qcow2 -o refcount_bits=1 7 15 29 58 62 66 68 80

(those tests should be skipped)
---
 tests/qemu-iotests/common.rc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
index 77c37e8312..65cdba5723 100644
--- a/tests/qemu-iotests/common.rc
+++ b/tests/qemu-iotests/common.rc
@@ -885,7 +885,9 @@ _unsupported_imgopts()
 {
 for bad_opt
 do
-if echo "$IMGOPTS" | grep -q 2>/dev/null "$bad_opt"
+# Add a space so tests can match for whitespace that marks the
+# end of an option (\b or \> are not portable)
+if echo "$IMGOPTS " | grep -q 2>/dev/null "$bad_opt"
 then
 _notrun "not suitable for image option: $bad_opt"
 fi
-- 
2.29.2

Re: [PATCH 0/7] qcow2: compressed write cache

2021-02-10 Thread Max Reitz


On 09.02.21 19:51, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 21:41, Denis V. Lunev wrote:

On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 17:47, Max Reitz wrote:

On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 16:25, Max Reitz wrote:

On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

I know, I have several series waiting for a resend, but I had to
switch
to another task spawned from our customer's bug.

Original problem: we use O_DIRECT for all vm images in our
product, it's
the policy. The only exclusion is backup target qcow2 image for
compressed backup, because compressed backup is extremely slow with
O_DIRECT (due to unaligned writes). Customer complains that backup
produces a lot of pagecache.

So we can either implement some internal cache or use fadvise
somehow.
Backup has several async workes, which writes simultaneously, so
in both
ways we have to track host cluster filling (before dropping the
cache
corresponding to the cluster).  So, if we have to track anyway,
let's
try to implement the cache.


I wanted to be excited here, because that sounds like it would be
very easy to implement caching.  Like, just keep the cluster at
free_byte_offset cached until the cluster it points to changes,
then flush the cluster.


The problem is that chunks are written asynchronously.. That's why
this all is not so easy.



But then I see like 900 new lines of code, and I’m much less
excited...


Idea is simple: cache small unaligned write and flush the cluster
when
filled.

Performance result is very good (results in a table is time of
compressed backup of 1000M disk filled with ones in seconds):


“Filled with ones” really is an edge case, though.


Yes, I think, all clusters are compressed to rather small chunks :)




---  ---  ---
   backup(old)  backup(new)
ssd:hdd(direct)  3e+02    4.4
  -99%
ssd:hdd(cached)  5.7  5.4
  -5%
---  ---  ---

So, we have benefit even for cached mode! And the fastest thing is
O_DIRECT with new implemented cache. So, I suggest to enable the 
new

cache by default (which is done by the series).


First, I’m not sure how O_DIRECT really is relevant, because I
don’t really see the point for writing compressed images.


compressed backup is a point


(Perhaps irrelevant, but just to be clear:) I meant the point of
using O_DIRECT, which one can decide to not use for backup targets
(as you have done already).


Second, I find it a bit cheating if you say there is a huge
improvement for the no-cache case, when actually, well, you just
added a cache.  So the no-cache case just became faster because
there is a cache now.


Still, performance comparison is relevant to show that O_DIRECT as
is unusable for compressed backup.


(Again, perhaps irrelevant, but:) Yes, but my first point was
exactly whether O_DIRECT is even relevant for writing compressed
images.


Well, I suppose I could follow that if O_DIRECT doesn’t make much
sense for compressed images, qemu’s format drivers are free to
introduce some caching (because technically the cache.direct
option only applies to the protocol driver) for collecting
compressed writes.


Yes I thought in this way, enabling the cache by default.


That conclusion makes both of my complaints kind of moot.

*shrug*

Third, what is the real-world impact on the page cache?  You
described that that’s the reason why you need the cache in qemu,
because otherwise the page cache is polluted too much.  How much
is the difference really?  (I don’t know how good the compression
ratio is for real-world images.)


Hm. I don't know the ratio.. Customer reported that most of RAM is
polluted by Qemu's cache, and we use O_DIRECT for everything except
for target of compressed backup.. Still the pollution may relate to
several backups and of course it is simple enough to drop the cache
after each backup. But I think that even one backup of 16T disk may
pollute RAM enough.


Oh, sorry, I just realized I had a brain fart there.  I was
referring to whether this series improves the page cache pollution.
But obviously it will if it allows you to re-enable O_DIRECT.


Related to that, I remember a long time ago we had some discussion
about letting qemu-img convert set a special cache mode for the
target image that would make Linux drop everything before the last
offset written (i.e., I suppose fadvise() with
POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
that implementing a cache in qemu would be simple, but it isn’t,
really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
advantage of using that would be that we could reuse it for
non-compressed images that are written by backup or qemu-img
convert.)


The problem is that writes are async. And therefore, not sequential.


In theory, yes, bu

Re: [PATCH 0/7] qcow2: compressed write cache

2021-02-10 Thread Max Reitz


On 09.02.21 17:52, Denis V. Lunev wrote:

On 2/9/21 5:47 PM, Max Reitz wrote:

On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 16:25, Max Reitz wrote:

On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

I know, I have several series waiting for a resend, but I had to
switch
to another task spawned from our customer's bug.

Original problem: we use O_DIRECT for all vm images in our product,
it's
the policy. The only exclusion is backup target qcow2 image for
compressed backup, because compressed backup is extremely slow with
O_DIRECT (due to unaligned writes). Customer complains that backup
produces a lot of pagecache.

So we can either implement some internal cache or use fadvise somehow.
Backup has several async workes, which writes simultaneously, so in
both
ways we have to track host cluster filling (before dropping the cache
corresponding to the cluster).  So, if we have to track anyway, let's
try to implement the cache.


I wanted to be excited here, because that sounds like it would be
very easy to implement caching.  Like, just keep the cluster at
free_byte_offset cached until the cluster it points to changes, then
flush the cluster.


The problem is that chunks are written asynchronously.. That's why
this all is not so easy.



But then I see like 900 new lines of code, and I’m much less excited...


Idea is simple: cache small unaligned write and flush the cluster when
filled.

Performance result is very good (results in a table is time of
compressed backup of 1000M disk filled with ones in seconds):


“Filled with ones” really is an edge case, though.


Yes, I think, all clusters are compressed to rather small chunks :)




---  ---  ---
   backup(old)  backup(new)
ssd:hdd(direct)  3e+02    4.4
  -99%
ssd:hdd(cached)  5.7  5.4
  -5%
---  ---  ---

So, we have benefit even for cached mode! And the fastest thing is
O_DIRECT with new implemented cache. So, I suggest to enable the new
cache by default (which is done by the series).


First, I’m not sure how O_DIRECT really is relevant, because I don’t
really see the point for writing compressed images.


compressed backup is a point


(Perhaps irrelevant, but just to be clear:) I meant the point of using
O_DIRECT, which one can decide to not use for backup targets (as you
have done already).


Second, I find it a bit cheating if you say there is a huge
improvement for the no-cache case, when actually, well, you just
added a cache.  So the no-cache case just became faster because
there is a cache now.


Still, performance comparison is relevant to show that O_DIRECT as is
unusable for compressed backup.


(Again, perhaps irrelevant, but:) Yes, but my first point was exactly
whether O_DIRECT is even relevant for writing compressed images.


Well, I suppose I could follow that if O_DIRECT doesn’t make much
sense for compressed images, qemu’s format drivers are free to
introduce some caching (because technically the cache.direct option
only applies to the protocol driver) for collecting compressed writes.


Yes I thought in this way, enabling the cache by default.


That conclusion makes both of my complaints kind of moot.

*shrug*

Third, what is the real-world impact on the page cache?  You
described that that’s the reason why you need the cache in qemu,
because otherwise the page cache is polluted too much.  How much is
the difference really?  (I don’t know how good the compression ratio
is for real-world images.)


Hm. I don't know the ratio.. Customer reported that most of RAM is
polluted by Qemu's cache, and we use O_DIRECT for everything except
for target of compressed backup.. Still the pollution may relate to
several backups and of course it is simple enough to drop the cache
after each backup. But I think that even one backup of 16T disk may
pollute RAM enough.


Oh, sorry, I just realized I had a brain fart there.  I was referring
to whether this series improves the page cache pollution.  But
obviously it will if it allows you to re-enable O_DIRECT.


Related to that, I remember a long time ago we had some discussion
about letting qemu-img convert set a special cache mode for the
target image that would make Linux drop everything before the last
offset written (i.e., I suppose fadvise() with
POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
that implementing a cache in qemu would be simple, but it isn’t,
really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
advantage of using that would be that we could reuse it for
non-compressed images that are written by backup or qemu-img convert.)


The problem is that writes are async. And therefore, not sequential.


In theory, yes, but all compressed writes still goes through
qcow2_alloc_bytes() right before submitting the write, so I wonder
whether in practice the writes aren’t usually sufficient

[PULL v4 19/27] multi-process: introduce proxy object

2021-02-10 Thread Stefan Hajnoczi

From: Elena Ufimtseva 

Defines a PCI Device proxy object as a child of TYPE_PCI_DEVICE.

Signed-off-by: Elena Ufimtseva 
Signed-off-by: Jagannathan Raman 
Signed-off-by: John G Johnson 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
b5186ebfedf8e557044d09a768846c59230ad3a7.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 MAINTAINERS   |  2 +
 include/hw/remote/proxy.h | 33 +
 hw/remote/proxy.c | 99 +++
 hw/remote/meson.build |  1 +
 4 files changed, 135 insertions(+)
 create mode 100644 include/hw/remote/proxy.h
 create mode 100644 hw/remote/proxy.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 00ea834ed0..aec5d2d076 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3217,6 +3217,8 @@ F: hw/remote/message.c
 F: hw/remote/remote-obj.c
 F: include/hw/remote/memory.h
 F: hw/remote/memory.c
+F: hw/remote/proxy.c
+F: include/hw/remote/proxy.h
 
 Build and test automation
 -
diff --git a/include/hw/remote/proxy.h b/include/hw/remote/proxy.h
new file mode 100644
index 00..faa9c4d580
--- /dev/null
+++ b/include/hw/remote/proxy.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef PROXY_H
+#define PROXY_H
+
+#include "hw/pci/pci.h"
+#include "io/channel.h"
+
+#define TYPE_PCI_PROXY_DEV "x-pci-proxy-dev"
+OBJECT_DECLARE_SIMPLE_TYPE(PCIProxyDev, PCI_PROXY_DEV)
+
+struct PCIProxyDev {
+PCIDevice parent_dev;
+char *fd;
+
+/*
+ * Mutex used to protect the QIOChannel fd from
+ * the concurrent access by the VCPUs since proxy
+ * blocks while awaiting for the replies from the
+ * process remote.
+ */
+QemuMutex io_mutex;
+QIOChannel *ioc;
+Error *migration_blocker;
+};
+
+#endif /* PROXY_H */
diff --git a/hw/remote/proxy.c b/hw/remote/proxy.c
new file mode 100644
index 00..cd5b071ab4
--- /dev/null
+++ b/hw/remote/proxy.c
@@ -0,0 +1,99 @@
+/*
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+
+#include "hw/remote/proxy.h"
+#include "hw/pci/pci.h"
+#include "qapi/error.h"
+#include "io/channel-util.h"
+#include "hw/qdev-properties.h"
+#include "monitor/monitor.h"
+#include "migration/blocker.h"
+#include "qemu/sockets.h"
+
+static void pci_proxy_dev_realize(PCIDevice *device, Error **errp)
+{
+ERRP_GUARD();
+PCIProxyDev *dev = PCI_PROXY_DEV(device);
+int fd;
+
+if (!dev->fd) {
+error_setg(errp, "fd parameter not specified for %s",
+   DEVICE(device)->id);
+return;
+}
+
+fd = monitor_fd_param(monitor_cur(), dev->fd, errp);
+if (fd == -1) {
+error_prepend(errp, "proxy: unable to parse fd %s: ", dev->fd);
+return;
+}
+
+if (!fd_is_socket(fd)) {
+error_setg(errp, "proxy: fd %d is not a socket", fd);
+close(fd);
+return;
+}
+
+dev->ioc = qio_channel_new_fd(fd, errp);
+
+error_setg(&dev->migration_blocker, "%s does not support migration",
+   TYPE_PCI_PROXY_DEV);
+migrate_add_blocker(dev->migration_blocker, errp);
+
+qemu_mutex_init(&dev->io_mutex);
+qio_channel_set_blocking(dev->ioc, true, NULL);
+}
+
+static void pci_proxy_dev_exit(PCIDevice *pdev)
+{
+PCIProxyDev *dev = PCI_PROXY_DEV(pdev);
+
+if (dev->ioc) {
+qio_channel_close(dev->ioc, NULL);
+}
+
+migrate_del_blocker(dev->migration_blocker);
+
+error_free(dev->migration_blocker);
+}
+
+static Property proxy_properties[] = {
+DEFINE_PROP_STRING("fd", PCIProxyDev, fd),
+DEFINE_PROP_END_OF_LIST(),
+};
+
+static void pci_proxy_dev_class_init(ObjectClass *klass, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(klass);
+PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
+
+k->realize = pci_proxy_dev_realize;
+k->exit = pci_proxy_dev_exit;
+device_class_set_props(dc, proxy_properties);
+}
+
+static const TypeInfo pci_proxy_dev_type_info = {
+.name  = TYPE_PCI_PROXY_DEV,
+.parent= TYPE_PCI_DEVICE,
+.instance_size = sizeof(PCIProxyDev),
+.class_init= pci_proxy_dev_class_init,
+.interfaces = (InterfaceInfo[]) {
+{ INTERFACE_CONVENTIONAL_PCI_DEVICE },
+{ },
+},
+};
+
+static void pci_proxy_dev_register_types(void)
+{
+type_register_static(&pci_proxy_dev_type_info);
+}
+
+type_init(pci_proxy_dev_register_types)
diff --git a/hw/remote/meson.build b/hw/remote/meson.build
index 64da16c1de..569cd20edf 100644
--- a/hw/remote/meson.build
+++ b/hw/remote/meson.build
@@ -4,6 +4,7 @@ remote_ss.add(when: 'CONFIG_MULTIPROCESS', if_true: 
files('machine.c'))
 remote_ss.add(when: 'CONFIG_MULTIPROCES

[PULL v4 25/27] multi-process: Retrieve PCI info from remote process

2021-02-10 Thread Stefan Hajnoczi

From: Jagannathan Raman 

Retrieve PCI configuration info about the remote device and
configure the Proxy PCI object based on the returned information

Signed-off-by: Elena Ufimtseva 
Signed-off-by: John G Johnson 
Signed-off-by: Jagannathan Raman 
Reviewed-by: Stefan Hajnoczi 
Message-id: 
85ee367bbb993aa23699b44cfedd83b4ea6d5221.1611938319.git.jag.ra...@oracle.com
Signed-off-by: Stefan Hajnoczi 
---
 hw/remote/proxy.c | 84 +++
 1 file changed, 84 insertions(+)

diff --git a/hw/remote/proxy.c b/hw/remote/proxy.c
index 555b3103f4..a082709881 100644
--- a/hw/remote/proxy.c
+++ b/hw/remote/proxy.c
@@ -25,6 +25,8 @@
 #include "sysemu/kvm.h"
 #include "util/event_notifier-posix.c"
 
+static void probe_pci_info(PCIDevice *dev, Error **errp);
+
 static void proxy_intx_update(PCIDevice *pci_dev)
 {
 PCIProxyDev *dev = PCI_PROXY_DEV(pci_dev);
@@ -77,6 +79,7 @@ static void pci_proxy_dev_realize(PCIDevice *device, Error 
**errp)
 {
 ERRP_GUARD();
 PCIProxyDev *dev = PCI_PROXY_DEV(device);
+uint8_t *pci_conf = device->config;
 int fd;
 
 if (!dev->fd) {
@@ -106,9 +109,14 @@ static void pci_proxy_dev_realize(PCIDevice *device, Error 
**errp)
 qemu_mutex_init(&dev->io_mutex);
 qio_channel_set_blocking(dev->ioc, true, NULL);
 
+pci_conf[PCI_LATENCY_TIMER] = 0xff;
+pci_conf[PCI_INTERRUPT_PIN] = 0x01;
+
 proxy_memory_listener_configure(&dev->proxy_listener, dev->ioc);
 
 setup_irqfd(dev);
+
+probe_pci_info(PCI_DEVICE(dev), errp);
 }
 
 static void pci_proxy_dev_exit(PCIDevice *pdev)
@@ -274,3 +282,79 @@ const MemoryRegionOps proxy_mr_ops = {
 .max_access_size = 8,
 },
 };
+
+static void probe_pci_info(PCIDevice *dev, Error **errp)
+{
+PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
+uint32_t orig_val, new_val, base_class, val;
+PCIProxyDev *pdev = PCI_PROXY_DEV(dev);
+DeviceClass *dc = DEVICE_CLASS(pc);
+uint8_t type;
+int i, size;
+
+config_op_send(pdev, PCI_VENDOR_ID, &val, 2, MPQEMU_CMD_PCI_CFGREAD);
+pc->vendor_id = (uint16_t)val;
+
+config_op_send(pdev, PCI_DEVICE_ID, &val, 2, MPQEMU_CMD_PCI_CFGREAD);
+pc->device_id = (uint16_t)val;
+
+config_op_send(pdev, PCI_CLASS_DEVICE, &val, 2, MPQEMU_CMD_PCI_CFGREAD);
+pc->class_id = (uint16_t)val;
+
+config_op_send(pdev, PCI_SUBSYSTEM_ID, &val, 2, MPQEMU_CMD_PCI_CFGREAD);
+pc->subsystem_id = (uint16_t)val;
+
+base_class = pc->class_id >> 4;
+switch (base_class) {
+case PCI_BASE_CLASS_BRIDGE:
+set_bit(DEVICE_CATEGORY_BRIDGE, dc->categories);
+break;
+case PCI_BASE_CLASS_STORAGE:
+set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+break;
+case PCI_BASE_CLASS_NETWORK:
+set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
+break;
+case PCI_BASE_CLASS_INPUT:
+set_bit(DEVICE_CATEGORY_INPUT, dc->categories);
+break;
+case PCI_BASE_CLASS_DISPLAY:
+set_bit(DEVICE_CATEGORY_DISPLAY, dc->categories);
+break;
+case PCI_BASE_CLASS_PROCESSOR:
+set_bit(DEVICE_CATEGORY_CPU, dc->categories);
+break;
+default:
+set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+break;
+}
+
+for (i = 0; i < PCI_NUM_REGIONS; i++) {
+config_op_send(pdev, PCI_BASE_ADDRESS_0 + (4 * i), &orig_val, 4,
+   MPQEMU_CMD_PCI_CFGREAD);
+new_val = 0x;
+config_op_send(pdev, PCI_BASE_ADDRESS_0 + (4 * i), &new_val, 4,
+   MPQEMU_CMD_PCI_CFGWRITE);
+config_op_send(pdev, PCI_BASE_ADDRESS_0 + (4 * i), &new_val, 4,
+   MPQEMU_CMD_PCI_CFGREAD);
+size = (~(new_val & 0xFFF0)) + 1;
+config_op_send(pdev, PCI_BASE_ADDRESS_0 + (4 * i), &orig_val, 4,
+   MPQEMU_CMD_PCI_CFGWRITE);
+type = (new_val & 0x1) ?
+   PCI_BASE_ADDRESS_SPACE_IO : PCI_BASE_ADDRESS_SPACE_MEMORY;
+
+if (size) {
+g_autofree char *name;
+pdev->region[i].dev = pdev;
+pdev->region[i].present = true;
+if (type == PCI_BASE_ADDRESS_SPACE_MEMORY) {
+pdev->region[i].memory = true;
+}
+name = g_strdup_printf("bar-region-%d", i);
+memory_region_init_io(&pdev->region[i].mr, OBJECT(pdev),
+  &proxy_mr_ops, &pdev->region[i],
+  name, size);
+pci_register_bar(dev, i, type, &pdev->region[i].mr);
+}
+}
+}
-- 
2.29.2

[PULL v4 27/27] docs: fix Parallels Image "dirty bitmap" section

2021-02-10 Thread Stefan Hajnoczi

From: "Denis V. Lunev" 

Original specification says that l1 table size if 64 * l1_size, which
is obviously wrong. The size of the l1 entry is 64 _bits_, not bytes.
Thus 64 is to be replaces with 8 as specification says about bytes.

There is also minor tweak, field name is renamed from l1 to l1_table,
which matches with the later text.

Signed-off-by: Denis V. Lunev 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-id: 20210128171313.2210947-1-...@openvz.org
CC: Stefan Hajnoczi 
CC: Vladimir Sementsov-Ogievskiy 

[Replace the original commit message "docs: fix mistake in dirty bitmap
feature description" as suggested by Eric Blake.
--Stefan]

Signed-off-by: Stefan Hajnoczi 
---
 docs/interop/parallels.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/interop/parallels.txt b/docs/interop/parallels.txt
index e9271eba5d..f15bf35bd1 100644
--- a/docs/interop/parallels.txt
+++ b/docs/interop/parallels.txt
@@ -208,7 +208,7 @@ of its data area are:
   28 - 31:l1_size
   The number of entries in the L1 table of the bitmap.
 
-  variable:   l1 (64 * l1_size bytes)
+  variable:   l1_table (8 * l1_size bytes)
   L1 offset table (in bytes)
 
 A dirty bitmap is stored using a one-level structure for the mapping to host
-- 
2.29.2

[PATCH] hw/block: nvme: Fix a build error in nvme_process_sq()

2021-02-10 Thread Bin Meng

From: Bin Meng 

Current QEMU HEAD nvme.c does not compile:

  hw/block/nvme.c: In function ‘nvme_process_sq’:
  hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
 trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 ^
  hw/block/nvme.c:3150:14: note: ‘result’ was declared here
 uint32_t result;
  ^

Explicitly initialize the result to fix it.

Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
Signed-off-by: Bin Meng 
---

 hw/block/nvme.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5ce21b7..c122ac0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -3228,6 +3228,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest 
*req)
 result = ns->features.err_rec;
 goto out;
 case NVME_VOLATILE_WRITE_CACHE:
+result = 0;
 for (i = 1; i <= n->num_namespaces; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
-- 
2.7.4

Re: [PATCH 0/7] qcow2: compressed write cache

2021-02-10 Thread Vladimir Sementsov-Ogievskiy


10.02.2021 13:00, Max Reitz wrote:

On 09.02.21 19:51, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 21:41, Denis V. Lunev wrote:

On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 17:47, Max Reitz wrote:

On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:

09.02.2021 16:25, Max Reitz wrote:

On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

I know, I have several series waiting for a resend, but I had to
switch
to another task spawned from our customer's bug.

Original problem: we use O_DIRECT for all vm images in our
product, it's
the policy. The only exclusion is backup target qcow2 image for
compressed backup, because compressed backup is extremely slow with
O_DIRECT (due to unaligned writes). Customer complains that backup
produces a lot of pagecache.

So we can either implement some internal cache or use fadvise
somehow.
Backup has several async workes, which writes simultaneously, so
in both
ways we have to track host cluster filling (before dropping the
cache
corresponding to the cluster).  So, if we have to track anyway,
let's
try to implement the cache.


I wanted to be excited here, because that sounds like it would be
very easy to implement caching.  Like, just keep the cluster at
free_byte_offset cached until the cluster it points to changes,
then flush the cluster.


The problem is that chunks are written asynchronously.. That's why
this all is not so easy.



But then I see like 900 new lines of code, and I’m much less
excited...


Idea is simple: cache small unaligned write and flush the cluster
when
filled.

Performance result is very good (results in a table is time of
compressed backup of 1000M disk filled with ones in seconds):


“Filled with ones” really is an edge case, though.


Yes, I think, all clusters are compressed to rather small chunks :)




---  ---  ---
   backup(old)  backup(new)
ssd:hdd(direct)  3e+02    4.4
  -99%
ssd:hdd(cached)  5.7  5.4
  -5%
---  ---  ---

So, we have benefit even for cached mode! And the fastest thing is
O_DIRECT with new implemented cache. So, I suggest to enable the new
cache by default (which is done by the series).


First, I’m not sure how O_DIRECT really is relevant, because I
don’t really see the point for writing compressed images.


compressed backup is a point


(Perhaps irrelevant, but just to be clear:) I meant the point of
using O_DIRECT, which one can decide to not use for backup targets
(as you have done already).


Second, I find it a bit cheating if you say there is a huge
improvement for the no-cache case, when actually, well, you just
added a cache.  So the no-cache case just became faster because
there is a cache now.


Still, performance comparison is relevant to show that O_DIRECT as
is unusable for compressed backup.


(Again, perhaps irrelevant, but:) Yes, but my first point was
exactly whether O_DIRECT is even relevant for writing compressed
images.


Well, I suppose I could follow that if O_DIRECT doesn’t make much
sense for compressed images, qemu’s format drivers are free to
introduce some caching (because technically the cache.direct
option only applies to the protocol driver) for collecting
compressed writes.


Yes I thought in this way, enabling the cache by default.


That conclusion makes both of my complaints kind of moot.

*shrug*

Third, what is the real-world impact on the page cache?  You
described that that’s the reason why you need the cache in qemu,
because otherwise the page cache is polluted too much.  How much
is the difference really?  (I don’t know how good the compression
ratio is for real-world images.)


Hm. I don't know the ratio.. Customer reported that most of RAM is
polluted by Qemu's cache, and we use O_DIRECT for everything except
for target of compressed backup.. Still the pollution may relate to
several backups and of course it is simple enough to drop the cache
after each backup. But I think that even one backup of 16T disk may
pollute RAM enough.


Oh, sorry, I just realized I had a brain fart there.  I was
referring to whether this series improves the page cache pollution.
But obviously it will if it allows you to re-enable O_DIRECT.


Related to that, I remember a long time ago we had some discussion
about letting qemu-img convert set a special cache mode for the
target image that would make Linux drop everything before the last
offset written (i.e., I suppose fadvise() with
POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
that implementing a cache in qemu would be simple, but it isn’t,
really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
advantage of using that would be that we could reuse it for
non-compressed images that are written by backup or qemu-img
convert.)


The problem is that writes are async. And therefore, no

Re: [PATCH] hw/block: nvme: Fix a build error in nvme_process_sq()

2021-02-10 Thread Bin Meng

On Wed, Feb 10, 2021 at 5:54 PM Bin Meng  wrote:
>
> From: Bin Meng 
>
> Current QEMU HEAD nvme.c does not compile:
>
>   hw/block/nvme.c: In function ‘nvme_process_sq’:

Not sure why compiler reports this error happens in nvme_process_sq()?

But it should be in nvme_get_feature(). I will update the commit message in v2.

>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
>  ^
>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
>  uint32_t result;
>   ^
>
> Explicitly initialize the result to fix it.
>
> Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
> Signed-off-by: Bin Meng 
> ---
>
>  hw/block/nvme.c | 1 +
>  1 file changed, 1 insertion(+)
>

Regards,
Bin

[PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Bin Meng

From: Bin Meng 

Current QEMU HEAD nvme.c does not compile:

  hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
 trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 ^
  hw/block/nvme.c:3150:14: note: ‘result’ was declared here
 uint32_t result;
  ^

Explicitly initialize the result to fix it.

Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
Signed-off-by: Bin Meng 

---

Changes in v2:
- update function name in the commit message

 hw/block/nvme.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5ce21b7..c122ac0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -3228,6 +3228,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest 
*req)
 result = ns->features.err_rec;
 goto out;
 case NVME_VOLATILE_WRITE_CACHE:
+result = 0;
 for (i = 1; i <= n->num_namespaces; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
-- 
2.7.4

Re: [PATCH] hw/block: nvme: Fix a build error in nvme_process_sq()

2021-02-10 Thread Klaus Jensen

On Feb 10 18:15, Bin Meng wrote:
> On Wed, Feb 10, 2021 at 5:54 PM Bin Meng  wrote:
> >
> > From: Bin Meng 
> >
> > Current QEMU HEAD nvme.c does not compile:
> >
> >   hw/block/nvme.c: In function ‘nvme_process_sq’:
> 
> Not sure why compiler reports this error happens in nvme_process_sq()?
> 

Yeah that is kinda wierd. Also, this went through the full CI suite.
What compiler is this?

> But it should be in nvme_get_feature(). I will update the commit message in 
> v2.
> 
> >   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
> > function [-Werror=maybe-uninitialized]
> >  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> >  ^
> >   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
> >  uint32_t result;
> >   ^
> >
> > Explicitly initialize the result to fix it.
> >
> > Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
> > Signed-off-by: Bin Meng 
> > ---
> >
> >  hw/block/nvme.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> 
> Regards,
> Bin
> 


signature.asc
Description: PGP signature

Re: [PATCH] hw/block: nvme: Fix a build error in nvme_process_sq()

2021-02-10 Thread Bin Meng

On Wed, Feb 10, 2021 at 6:23 PM Klaus Jensen  wrote:
>
> On Feb 10 18:15, Bin Meng wrote:
> > On Wed, Feb 10, 2021 at 5:54 PM Bin Meng  wrote:
> > >
> > > From: Bin Meng 
> > >
> > > Current QEMU HEAD nvme.c does not compile:
> > >
> > >   hw/block/nvme.c: In function ‘nvme_process_sq’:
> >
> > Not sure why compiler reports this error happens in nvme_process_sq()?
> >
>
> Yeah that is kinda wierd. Also, this went through the full CI suite.
> What compiler is this?
>

Yes it's quite strange.

I am using the default GCC 5.4 on a Ubuntu 16.04 host.

Regards,
Bin

Re: [PATCH] hw/block: nvme: Fix a build error in nvme_process_sq()

2021-02-10 Thread Klaus Jensen

On Feb 10 18:24, Bin Meng wrote:
> On Wed, Feb 10, 2021 at 6:23 PM Klaus Jensen  wrote:
> >
> > On Feb 10 18:15, Bin Meng wrote:
> > > On Wed, Feb 10, 2021 at 5:54 PM Bin Meng  wrote:
> > > >
> > > > From: Bin Meng 
> > > >
> > > > Current QEMU HEAD nvme.c does not compile:
> > > >
> > > >   hw/block/nvme.c: In function ‘nvme_process_sq’:
> > >
> > > Not sure why compiler reports this error happens in nvme_process_sq()?
> > >
> >
> > Yeah that is kinda wierd. Also, this went through the full CI suite.
> > What compiler is this?
> >
> 
> Yes it's quite strange.
> 
> I am using the default GCC 5.4 on a Ubuntu 16.04 host.
> 

Alright. I'm actually not sure why newer compilers does not report this.
The warning looks reasonable.

I'll queue up your patch, Thanks!


signature.asc
Description: PGP signature

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Klaus Jensen

CC qemu-trivial.

On Feb 10 18:23, Bin Meng wrote:
> From: Bin Meng 
> 
> Current QEMU HEAD nvme.c does not compile:
> 
>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
>  ^
>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
>  uint32_t result;
>   ^
> 
> Explicitly initialize the result to fix it.
> 
> Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
> Signed-off-by: Bin Meng 
> 
> ---
> 
> Changes in v2:
> - update function name in the commit message
> 
>  hw/block/nvme.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 5ce21b7..c122ac0 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -3228,6 +3228,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> NvmeRequest *req)
>  result = ns->features.err_rec;
>  goto out;
>  case NVME_VOLATILE_WRITE_CACHE:
> +result = 0;
>  for (i = 1; i <= n->num_namespaces; i++) {
>  ns = nvme_ns(n, i);
>  if (!ns) {
> -- 
> 2.7.4
> 
> 


signature.asc
Description: PGP signature

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Minwoo Im

On 21-02-10 18:23:17, Bin Meng wrote:
> From: Bin Meng 
> 
> Current QEMU HEAD nvme.c does not compile:
> 
>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
>  ^
>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
>  uint32_t result;
>   ^
> 
> Explicitly initialize the result to fix it.
> 
> Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
> Signed-off-by: Bin Meng 

Bin,

Thanks for the fix!

Re: [PATCH] hw/block: nvme: Fix a build error in nvme_process_sq()

2021-02-10 Thread Peter Maydell

On Wed, 10 Feb 2021 at 10:31, Klaus Jensen  wrote:
> On Feb 10 18:24, Bin Meng wrote:
> > I am using the default GCC 5.4 on a Ubuntu 16.04 host.
> >
>
> Alright. I'm actually not sure why newer compilers does not report this.
> The warning looks reasonable.

It's not actually ever possible for nvme_ns() to return
NULL in this loop, because nvme_ns() will only return NULL
if it is passed an nsid value that is 0 or > n->num_namespaces,
and the loop conditions mean that we never do that. So
we can only end up using an uninitialized result if
n->num_namespaces is zero.

Newer compilers tend to do deeper analysis (eg inlining a
function like nvme_ns() here and analysing on the basis of
what that function does), so they can identify that
the "if (!ns) { continue; }" codepath is never taken.
I haven't actually done the analysis but I'm guessing that
newer compilers also manage to figure out somehow that it's not
possible to get here with n->num_namespaces being zero.

GCC 5.4 is not quite so sophisticated, so it can't tell.

There does seem to be a consistent pattern in the code of

for (i = 1; i <= n->num_namespaces; i++) {
ns = nvme_ns(n, i);
if (!ns) {
continue;
}
[stuff]
}

Might be worth considering replacing the "if (!ns) { continue; }"
with "assert(ns);".

thanks
-- PMM

Re: [PATCH 1/2] hw/block/nvme: add oncs device parameter

2021-02-10 Thread Minwoo Im

On 21-02-10 08:06:45, Klaus Jensen wrote:
> From: Gollu Appalanaidu 
> 
> Add the 'oncs' nvme device parameter to allow optional features to be
> enabled/disabled explicitly. Since most of these are optional commands,
> make the CSE log pages dynamic to account for the value of ONCS.
> 
> Signed-off-by: Gollu Appalanaidu 
> Signed-off-by: Klaus Jensen 

Reviewed-by: Minwoo Im

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Philippe Mathieu-Daudé

Hi Bin,

On 2/10/21 11:23 AM, Bin Meng wrote:
> From: Bin Meng 
> 
> Current QEMU HEAD nvme.c does not compile:
> 
>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
>  ^
>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
>  uint32_t result;
>   ^

Why isn't this catched by our CI? What is your host OS? Fedora 33?

> 
> Explicitly initialize the result to fix it.
> 
> Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
> Signed-off-by: Bin Meng 
> 
> ---
> 
> Changes in v2:
> - update function name in the commit message
> 
>  hw/block/nvme.c | 1 +
>  1 file changed, 1 insertion(+)

Re: [PATCH 2/2] hw/block/nvme: add write uncorrectable command

2021-02-10 Thread Minwoo Im

On 21-02-10 08:06:46, Klaus Jensen wrote:
> From: Gollu Appalanaidu 
> 
> Add support for marking blocks invalid with the Write Uncorrectable
> command. Block status is tracked in a (non-persistent) bitmap that is
> checked on all reads and written to on all writes. This is potentially
> expensive, so keep Write Uncorrectable disabled by default.
> 
> Signed-off-by: Gollu Appalanaidu 
> Signed-off-by: Klaus Jensen 
> ---
>  docs/specs/nvme.txt   |  3 ++
>  hw/block/nvme-ns.h|  2 ++
>  hw/block/nvme.h   |  1 +
>  hw/block/nvme-ns.c|  2 ++
>  hw/block/nvme.c   | 65 +--
>  hw/block/trace-events |  1 +
>  6 files changed, 66 insertions(+), 8 deletions(-)
> 
> diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
> index 56d393884e7a..88f9cc278d4c 100644
> --- a/docs/specs/nvme.txt
> +++ b/docs/specs/nvme.txt
> @@ -19,5 +19,8 @@ Known issues
>  
>  * The accounting numbers in the SMART/Health are reset across power cycles
>  
> +* Marking blocks invalid with the Write Uncorrectable is not persisted across
> +  power cycles.
> +
>  * Interrupt Coalescing is not supported and is disabled by default in 
> volation
>of the specification.
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index 7af6884862b5..15fa422ded03 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -72,6 +72,8 @@ typedef struct NvmeNamespace {
>  struct {
>  uint32_t err_rec;
>  } features;
> +
> +unsigned long *uncorrectable;
>  } NvmeNamespace;
>  
>  static inline uint32_t nvme_nsid(NvmeNamespace *ns)
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 98082b2dfba3..9b8f85b9cf16 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -68,6 +68,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
>  case NVME_CMD_FLUSH:return "NVME_NVM_CMD_FLUSH";
>  case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
>  case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
> +case NVME_CMD_WRITE_UNCOR:  return "NVME_CMD_WRITE_UNCOR";
>  case NVME_CMD_COMPARE:  return "NVME_NVM_CMD_COMPARE";
>  case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
>  case NVME_CMD_DSM:  return "NVME_NVM_CMD_DSM";
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index ade46e2f3739..742bbc4b4b62 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -72,6 +72,8 @@ static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
>  id_ns->mcl = cpu_to_le32(ns->params.mcl);
>  id_ns->msrc = ns->params.msrc;
>  
> +ns->uncorrectable = bitmap_new(id_ns->nsze);
> +
>  return 0;
>  }
>  
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index e5f725d7..56048046c193 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1112,6 +1112,20 @@ static uint16_t nvme_check_dulbe(NvmeNamespace *ns, 
> uint64_t slba,
>  return NVME_SUCCESS;
>  }
>  
> +static inline uint16_t nvme_check_uncor(NvmeNamespace *ns, uint64_t slba,
> +uint32_t nlb)
> +{
> +uint64_t elba = nlb + slba;
> +
> +if (ns->uncorrectable) {
> +if (find_next_bit(ns->uncorrectable, elba, slba) < elba) {
> +return NVME_UNRECOVERED_READ | NVME_DNR;
> +}
> +}
> +
> +return NVME_SUCCESS;
> +}
> +
>  static void nvme_aio_err(NvmeRequest *req, int ret)
>  {
>  uint16_t status = NVME_SUCCESS;
> @@ -1423,14 +1437,24 @@ static void nvme_rw_cb(void *opaque, int ret)
>  BlockAcctCookie *acct = &req->acct;
>  BlockAcctStats *stats = blk_get_stats(blk);
>  
> +bool is_write = nvme_is_write(req);
> +
>  trace_pci_nvme_rw_cb(nvme_cid(req), blk_name(blk));
>  
> -if (ns->params.zoned && nvme_is_write(req)) {
> +if (ns->params.zoned && is_write) {
>  nvme_finalize_zoned_write(ns, req);
>  }
>  
>  if (!ret) {
>  block_acct_done(stats, acct);
> +
> +if (is_write) {
> +NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> +uint64_t slba = le64_to_cpu(rw->slba);
> +uint32_t nlb = le16_to_cpu(rw->nlb) + 1;
> +
> +bitmap_clear(ns->uncorrectable, slba, nlb);

It might be nitpick, 'nlb' would easily represent the value which is
defined itself in the spec which is zero-based.  Can we have this like:

uint32_t nlb = le16_to_cpu(rw->nlb);

bitmap_clear(ns->uncorrectable, slba, nlb + 1);

Otherwise, it looks good to me.

Reviewed-by: Minwoo Im

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Bin Meng

Hi Philippe,

On Wed, Feb 10, 2021 at 7:12 PM Philippe Mathieu-Daudé
 wrote:
>
> Hi Bin,
>
> On 2/10/21 11:23 AM, Bin Meng wrote:
> > From: Bin Meng 
> >
> > Current QEMU HEAD nvme.c does not compile:
> >
> >   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
> > function [-Werror=maybe-uninitialized]
> >  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> >  ^
> >   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
> >  uint32_t result;
> >   ^
>
> Why isn't this catched by our CI? What is your host OS? Fedora 33?
>

I am using GCC 5.4 on Ubuntu 16.04. Please see some initial analysis
from Peter about why newer version GCC does not report it.

Regards,
Bin

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Philippe Mathieu-Daudé

On 2/10/21 12:12 PM, Philippe Mathieu-Daudé wrote:
> Hi Bin,
> 
> On 2/10/21 11:23 AM, Bin Meng wrote:
>> From: Bin Meng 
>>
>> Current QEMU HEAD nvme.c does not compile:
>>
>>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
>> function [-Werror=maybe-uninitialized]
>>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
>>  ^
>>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
>>  uint32_t result;
>>   ^
> 
> Why isn't this catched by our CI? What is your host OS? Fedora 33?

Just noticed v1 and Peter's explanation:
https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg03528.html

Can you amend "default GCC 5.4 on a Ubuntu 16.04 host" information
please?

> 
>>
>> Explicitly initialize the result to fix it.
>>
>> Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
>> Signed-off-by: Bin Meng 
>>
>> ---
>>
>> Changes in v2:
>> - update function name in the commit message
>>
>>  hw/block/nvme.c | 1 +
>>  1 file changed, 1 insertion(+)

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Bin Meng

Hi Philippe,

On Wed, Feb 10, 2021 at 7:15 PM Philippe Mathieu-Daudé
 wrote:
>
> On 2/10/21 12:12 PM, Philippe Mathieu-Daudé wrote:
> > Hi Bin,
> >
> > On 2/10/21 11:23 AM, Bin Meng wrote:
> >> From: Bin Meng 
> >>
> >> Current QEMU HEAD nvme.c does not compile:
> >>
> >>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in 
> >> this function [-Werror=maybe-uninitialized]
> >>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> >>  ^
> >>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
> >>  uint32_t result;
> >>   ^
> >
> > Why isn't this catched by our CI? What is your host OS? Fedora 33?
>
> Just noticed v1 and Peter's explanation:
> https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg03528.html
>
> Can you amend "default GCC 5.4 on a Ubuntu 16.04 host" information
> please?
>

Sure.

Regards,
Bin

[PATCH v3] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Bin Meng

From: Bin Meng 

Current QEMU HEAD nvme.c does not compile with the default GCC 5.4
on a Ubuntu 16.04 host:

  hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
 trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
 ^
  hw/block/nvme.c:3150:14: note: ‘result’ was declared here
 uint32_t result;
  ^

Explicitly initialize the result to fix it.

Cc: qemu-triv...@nongnu.org
Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
Signed-off-by: Bin Meng 

---

Changes in v3:
- mention compiler and host information in the commit message

Changes in v2:
- update function name in the commit message

 hw/block/nvme.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5ce21b7..c122ac0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -3228,6 +3228,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest 
*req)
 result = ns->features.err_rec;
 goto out;
 case NVME_VOLATILE_WRITE_CACHE:
+result = 0;
 for (i = 1; i <= n->num_namespaces; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
-- 
2.7.4

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Daniel P . Berrangé

On Wed, Feb 10, 2021 at 12:15:45PM +0100, Philippe Mathieu-Daudé wrote:
> On 2/10/21 12:12 PM, Philippe Mathieu-Daudé wrote:
> > Hi Bin,
> > 
> > On 2/10/21 11:23 AM, Bin Meng wrote:
> >> From: Bin Meng 
> >>
> >> Current QEMU HEAD nvme.c does not compile:
> >>
> >>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in 
> >> this function [-Werror=maybe-uninitialized]
> >>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> >>  ^
> >>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
> >>  uint32_t result;
> >>   ^
> > 
> > Why isn't this catched by our CI? What is your host OS? Fedora 33?
> 
> Just noticed v1 and Peter's explanation:
> https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg03528.html
> 
> Can you amend "default GCC 5.4 on a Ubuntu 16.04 host" information
> please?

Well Ubuntu 16.04 hasn't been considered a supported build target for
QEMU for a year now.

https://qemu.readthedocs.io/en/latest/system/build-platforms.html#linux-os-macos-freebsd-netbsd-openbsd

  "The project aims to support the most recent major version 
   at all times. Support for the previous major version will 
   be dropped 2 years after the new major version is released
   or when the vendor itself drops support, whichever comes 
   first."

IOW, we only aim for QEMU to be buildable on Ubuntu LTS 20.04 and 18.04
at this point in time.  16.04 is explicitly dropped and we will increasingly
introduce incompatibilities with it.

While this specific patch is simple, trying to keep QEMU git master
working on 16.04 is not a goal, so I'd really suggest upgrading to
a newer Ubuntu version at the soonest opportunity.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Thomas Huth


On 10/02/2021 12.15, Bin Meng wrote:

Hi Philippe,

On Wed, Feb 10, 2021 at 7:12 PM Philippe Mathieu-Daudé
 wrote:


Hi Bin,

On 2/10/21 11:23 AM, Bin Meng wrote:

From: Bin Meng 

Current QEMU HEAD nvme.c does not compile:

   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
  ^
   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
  uint32_t result;
   ^


Why isn't this catched by our CI? What is your host OS? Fedora 33?



I am using GCC 5.4 on Ubuntu 16.04. Please see some initial analysis
from Peter about why newer version GCC does not report it.


Please note that Ubuntu 16.04 is not a supported version by QEMU anymore, 
see https://qemu.readthedocs.io/en/latest/system/build-platforms.html and 
https://wiki.qemu.org/Supported_Build_Platforms for details.


 Thomas

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Daniel P . Berrangé

On Wed, Feb 10, 2021 at 11:22:19AM +, Daniel P. Berrangé wrote:
> On Wed, Feb 10, 2021 at 12:15:45PM +0100, Philippe Mathieu-Daudé wrote:
> > On 2/10/21 12:12 PM, Philippe Mathieu-Daudé wrote:
> > > Hi Bin,
> > > 
> > > On 2/10/21 11:23 AM, Bin Meng wrote:
> > >> From: Bin Meng 
> > >>
> > >> Current QEMU HEAD nvme.c does not compile:
> > >>
> > >>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in 
> > >> this function [-Werror=maybe-uninitialized]
> > >>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> > >>  ^
> > >>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
> > >>  uint32_t result;
> > >>   ^
> > > 
> > > Why isn't this catched by our CI? What is your host OS? Fedora 33?
> > 
> > Just noticed v1 and Peter's explanation:
> > https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg03528.html
> > 
> > Can you amend "default GCC 5.4 on a Ubuntu 16.04 host" information
> > please?
> 
> Well Ubuntu 16.04 hasn't been considered a supported build target for
> QEMU for a year now.
> 
> https://qemu.readthedocs.io/en/latest/system/build-platforms.html#linux-os-macos-freebsd-netbsd-openbsd
> 
>   "The project aims to support the most recent major version 
>at all times. Support for the previous major version will 
>be dropped 2 years after the new major version is released
>or when the vendor itself drops support, whichever comes 
>first."
> 
> IOW, we only aim for QEMU to be buildable on Ubuntu LTS 20.04 and 18.04
> at this point in time.  16.04 is explicitly dropped and we will increasingly
> introduce incompatibilities with it.
> 
> While this specific patch is simple, trying to keep QEMU git master
> working on 16.04 is not a goal, so I'd really suggest upgrading to
> a newer Ubuntu version at the soonest opportunity.

In particular after 6.0 QEMU is released, we'll be dropping RHEL-7
and then likely setting the  min required GCC to somewhere around
6.3 which will cut off Ubuntu 16.04 upfront.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH] hw/block: nvme: Fix a build error in nvme_process_sq()

2021-02-10 Thread Klaus Jensen

On Feb 10 11:01, Peter Maydell wrote:
> On Wed, 10 Feb 2021 at 10:31, Klaus Jensen  wrote:
> > On Feb 10 18:24, Bin Meng wrote:
> > > I am using the default GCC 5.4 on a Ubuntu 16.04 host.
> > >
> >
> > Alright. I'm actually not sure why newer compilers does not report this.
> > The warning looks reasonable.
> 
> It's not actually ever possible for nvme_ns() to return
> NULL in this loop, because nvme_ns() will only return NULL
> if it is passed an nsid value that is 0 or > n->num_namespaces,

NvmeCtrl.namespaces is an array of pointers and some of those will most
likely be NULL (those are unallocated namespaces).

> and the loop conditions mean that we never do that. So
> we can only end up using an uninitialized result if
> n->num_namespaces is zero.
> 
> Newer compilers tend to do deeper analysis (eg inlining a
> function like nvme_ns() here and analysing on the basis of
> what that function does), so they can identify that
> the "if (!ns) { continue; }" codepath is never taken.
> I haven't actually done the analysis but I'm guessing that
> newer compilers also manage to figure out somehow that it's not
> possible to get here with n->num_namespaces being zero.
> 
> GCC 5.4 is not quite so sophisticated, so it can't tell.
> 
> There does seem to be a consistent pattern in the code of
> 
> for (i = 1; i <= n->num_namespaces; i++) {
> ns = nvme_ns(n, i);
> if (!ns) {
> continue;
> }
> [stuff]
> }
> 
> Might be worth considering replacing the "if (!ns) { continue; }"
> with "assert(ns);".
> 

As mentioned above, ns may very well be NULL (an unallocated namespace).

I know that "it's never the compiler". But in this case, wtf? If there
are no allocated namespaces, then we will actually never hit the
statement that initializes result. I just confirmed this with a
configuration without any namespaces.

The patch is good. I wonder why newer GCCs does NOT detect this. Trying
to use `result` as the first statement in the loop also does not cause a
warning. Only using the variable just before the loop triggers a
warning on this.

I'm more than happy to be schooled by compiler people about why the
compiler might be more clever than me!

signature.asc
Description: PGP signature

Re: [PATCH 2/2] hw/block/nvme: add write uncorrectable command

2021-02-10 Thread Klaus Jensen

On Feb 10 20:14, Minwoo Im wrote:
> On 21-02-10 08:06:46, Klaus Jensen wrote:
> > From: Gollu Appalanaidu 
> > 
> > Add support for marking blocks invalid with the Write Uncorrectable
> > command. Block status is tracked in a (non-persistent) bitmap that is
> > checked on all reads and written to on all writes. This is potentially
> > expensive, so keep Write Uncorrectable disabled by default.
> > 
> > Signed-off-by: Gollu Appalanaidu 
> > Signed-off-by: Klaus Jensen 
> > ---
> >  docs/specs/nvme.txt   |  3 ++
> >  hw/block/nvme-ns.h|  2 ++
> >  hw/block/nvme.h   |  1 +
> >  hw/block/nvme-ns.c|  2 ++
> >  hw/block/nvme.c   | 65 +--
> >  hw/block/trace-events |  1 +
> >  6 files changed, 66 insertions(+), 8 deletions(-)
> > 
> > diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
> > index 56d393884e7a..88f9cc278d4c 100644
> > --- a/docs/specs/nvme.txt
> > +++ b/docs/specs/nvme.txt
> > @@ -19,5 +19,8 @@ Known issues
> >  
> >  * The accounting numbers in the SMART/Health are reset across power cycles
> >  
> > +* Marking blocks invalid with the Write Uncorrectable is not persisted 
> > across
> > +  power cycles.
> > +
> >  * Interrupt Coalescing is not supported and is disabled by default in 
> > volation
> >of the specification.
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index 7af6884862b5..15fa422ded03 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -72,6 +72,8 @@ typedef struct NvmeNamespace {
> >  struct {
> >  uint32_t err_rec;
> >  } features;
> > +
> > +unsigned long *uncorrectable;
> >  } NvmeNamespace;
> >  
> >  static inline uint32_t nvme_nsid(NvmeNamespace *ns)
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 98082b2dfba3..9b8f85b9cf16 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -68,6 +68,7 @@ static inline const char *nvme_io_opc_str(uint8_t opc)
> >  case NVME_CMD_FLUSH:return "NVME_NVM_CMD_FLUSH";
> >  case NVME_CMD_WRITE:return "NVME_NVM_CMD_WRITE";
> >  case NVME_CMD_READ: return "NVME_NVM_CMD_READ";
> > +case NVME_CMD_WRITE_UNCOR:  return "NVME_CMD_WRITE_UNCOR";
> >  case NVME_CMD_COMPARE:  return "NVME_NVM_CMD_COMPARE";
> >  case NVME_CMD_WRITE_ZEROES: return "NVME_NVM_CMD_WRITE_ZEROES";
> >  case NVME_CMD_DSM:  return "NVME_NVM_CMD_DSM";
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index ade46e2f3739..742bbc4b4b62 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -72,6 +72,8 @@ static int nvme_ns_init(NvmeNamespace *ns, Error **errp)
> >  id_ns->mcl = cpu_to_le32(ns->params.mcl);
> >  id_ns->msrc = ns->params.msrc;
> >  
> > +ns->uncorrectable = bitmap_new(id_ns->nsze);
> > +
> >  return 0;
> >  }
> >  
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e5f725d7..56048046c193 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1112,6 +1112,20 @@ static uint16_t nvme_check_dulbe(NvmeNamespace *ns, 
> > uint64_t slba,
> >  return NVME_SUCCESS;
> >  }
> >  
> > +static inline uint16_t nvme_check_uncor(NvmeNamespace *ns, uint64_t slba,
> > +uint32_t nlb)
> > +{
> > +uint64_t elba = nlb + slba;
> > +
> > +if (ns->uncorrectable) {
> > +if (find_next_bit(ns->uncorrectable, elba, slba) < elba) {
> > +return NVME_UNRECOVERED_READ | NVME_DNR;
> > +}
> > +}
> > +
> > +return NVME_SUCCESS;
> > +}
> > +
> >  static void nvme_aio_err(NvmeRequest *req, int ret)
> >  {
> >  uint16_t status = NVME_SUCCESS;
> > @@ -1423,14 +1437,24 @@ static void nvme_rw_cb(void *opaque, int ret)
> >  BlockAcctCookie *acct = &req->acct;
> >  BlockAcctStats *stats = blk_get_stats(blk);
> >  
> > +bool is_write = nvme_is_write(req);
> > +
> >  trace_pci_nvme_rw_cb(nvme_cid(req), blk_name(blk));
> >  
> > -if (ns->params.zoned && nvme_is_write(req)) {
> > +if (ns->params.zoned && is_write) {
> >  nvme_finalize_zoned_write(ns, req);
> >  }
> >  
> >  if (!ret) {
> >  block_acct_done(stats, acct);
> > +
> > +if (is_write) {
> > +NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> > +uint64_t slba = le64_to_cpu(rw->slba);
> > +uint32_t nlb = le16_to_cpu(rw->nlb) + 1;
> > +
> > +bitmap_clear(ns->uncorrectable, slba, nlb);
> 
> It might be nitpick, 'nlb' would easily represent the value which is
> defined itself in the spec which is zero-based.  Can we have this like:
> 
>   uint32_t nlb = le16_to_cpu(rw->nlb);
> 
>   bitmap_clear(ns->uncorrectable, slba, nlb + 1);
> 


I do not disagree, but the `uint32_t nlb = le16_to_cpu(rw->nlb) + 1;`
pattern is already used in several places.

> Otherwise, it looks good to me.
> 
> Reviewed-by: Minwoo Im 

-- 
One of us - No more doubt, silence or taboo about mental illnes

Re: [PATCH 0/7] qcow2: compressed write cache

2021-02-10 Thread Kevin Wolf

Am 29.01.2021 um 17:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> I know, I have several series waiting for a resend, but I had to switch
> to another task spawned from our customer's bug.
> 
> Original problem: we use O_DIRECT for all vm images in our product, it's
> the policy. The only exclusion is backup target qcow2 image for
> compressed backup, because compressed backup is extremely slow with
> O_DIRECT (due to unaligned writes). Customer complains that backup
> produces a lot of pagecache.
> 
> So we can either implement some internal cache or use fadvise somehow.
> Backup has several async workes, which writes simultaneously, so in both
> ways we have to track host cluster filling (before dropping the cache
> corresponding to the cluster).  So, if we have to track anyway, let's
> try to implement the cache.
> 
> Idea is simple: cache small unaligned write and flush the cluster when
> filled.

I haven't had the time to properly look at the patches, but is there
anything in it that is actually specific to compressed writes?

I'm asking because you may remember that a few years ago I talked at KVM
Forum about how a data cache could be used for small unaligned (to
cluster sizes) writes to reduce COW cost (mostly for sequential access
where the other part of the cluster would be filled soon enough).

So if we're introducing some kind of data cache, wouldn't it be nice to
use it even in the more general case instead of just restricting it to
compression?

Kevin

Re: [PATCH] hw/block: nvme: Fix a build error in nvme_process_sq()

2021-02-10 Thread Peter Maydell

On Wed, 10 Feb 2021 at 11:37, Klaus Jensen  wrote:
>
> On Feb 10 11:01, Peter Maydell wrote:
> > On Wed, 10 Feb 2021 at 10:31, Klaus Jensen  wrote:
> > > On Feb 10 18:24, Bin Meng wrote:
> > > > I am using the default GCC 5.4 on a Ubuntu 16.04 host.
> > > >
> > >
> > > Alright. I'm actually not sure why newer compilers does not report this.
> > > The warning looks reasonable.
> >
> > It's not actually ever possible for nvme_ns() to return
> > NULL in this loop, because nvme_ns() will only return NULL
> > if it is passed an nsid value that is 0 or > n->num_namespaces,
>
> NvmeCtrl.namespaces is an array of pointers and some of those will most
> likely be NULL (those are unallocated namespaces).

Whoops, yes.

> I know that "it's never the compiler". But in this case, wtf? If there
> are no allocated namespaces, then we will actually never hit the
> statement that initializes result. I just confirmed this with a
> configuration without any namespaces.
>
> The patch is good. I wonder why newer GCCs does NOT detect this. Trying
> to use `result` as the first statement in the loop also does not cause a
> warning. Only using the variable just before the loop triggers a
> warning on this.

My new hypothesis is that maybe newer GCCs are more cautious
about when they produce the 'may be used uninitialized' warning,
to avoid having too many false positives.

-- PMM

Re: [PATCH] iotests/210: Fix reference output

2021-02-10 Thread Kevin Wolf

Am 09.02.2021 um 19:19 hat Max Reitz geschrieben:
> Commit 69b55e03f has changed an error message, adjust the reference
> output to account for it.
> 
> Fixes: 69b55e03f7e65a36eb954d0b7d4698b258df2708
>("block: refactor bdrv_check_request: add errp")
> Signed-off-by: Max Reitz 

Reviewed-by: Kevin Wolf 

> diff --git a/tests/qemu-iotests/210.out b/tests/qemu-iotests/210.out
> index dc1a3c9786..2e9fc596eb 100644
> --- a/tests/qemu-iotests/210.out
> +++ b/tests/qemu-iotests/210.out
> @@ -182,7 +182,7 @@ Job failed: The requested file size is too large
>  === Resize image with invalid sizes ===
>  
>  {"execute": "block_resize", "arguments": {"node-name": "node1", "size": 
> 9223372036854775296}}
> -{"error": {"class": "GenericError", "desc": "Required too big image size, it 
> must be not greater than 9223372035781033984"}}
> +{"error": {"class": "GenericError", "desc": "offset(9223372036854775296) 
> exceeds maximum(9223372035781033984)"}}

This doesn't exactly feel like an improved error message...

Kevin

Re: [PATCH v2 30/36] block: bdrv_reopen_multiple: refresh permissions on updated graph

2021-02-10 Thread Kevin Wolf

Am 08.02.2021 um 12:21 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 05.02.2021 20:57, Kevin Wolf wrote:
> > Am 27.11.2020 um 15:45 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > Move bdrv_reopen_multiple to new paradigm of permission update:
> > > first update graph relations, then do refresh the permissions.
> > > 
> > > We have to modify reopen process in file-posix driver: with new scheme
> > > we don't have prepared permissions in raw_reopen_prepare(), so we
> > > should reconfigure fd in raw_check_perm(). Still this seems more native
> > > and simple anyway.
> > 
> > Hm... The diffstat shows that it is simpler because it needs less code.
> > 
> > But relying on the permission change callbacks for getting a new file
> > descriptor that changes more than just permissions doesn't feel
> > completely right either. Can we even expect the permission callbacks to
> > be called when the permissions aren't changed?
> 
> With new scheme permission update becomes an obvious step of
> bdrv_reopen_multiple(): we do call bdrv_list_refresh_perms(), for the
> list of all touched nodes and all their subtrees. And callbacks are
> called unconditionally bdrv_node_refresh_perm()->bdrv_drv_set_perm().
> So, I think, we can rely on it. Probably worth one-two comments.

Yes, some comments in the right places that we must call the driver
callbacks even if the permissions are the same as before wouldn't hurt.

> > 
> > But then, reopen and permission updates were already a bit entangled
> > before. If we can guarantee that the permission functions will always be
> > called, even if the permissions don't change, I guess it's okay.
> > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy 
> > > ---
> > >   include/block/block.h |   2 +-
> > >   block.c   | 183 +++---
> > >   block/file-posix.c|  84 +--
> > >   3 files changed, 70 insertions(+), 199 deletions(-)
> > > 
> > > diff --git a/include/block/block.h b/include/block/block.h
> > > index 0f21ef313f..82271d9ccd 100644
> > > --- a/include/block/block.h
> > > +++ b/include/block/block.h
> > > @@ -195,7 +195,7 @@ typedef struct BDRVReopenState {
> > >   BlockdevDetectZeroesOptions detect_zeroes;
> > >   bool backing_missing;
> > >   bool replace_backing_bs;  /* new_backing_bs is ignored if this is 
> > > false */
> > > -BlockDriverState *new_backing_bs; /* If NULL then detach the current 
> > > bs */
> > > +BlockDriverState *old_backing_bs; /* keep pointer for permissions 
> > > update */
> > >   uint64_t perm, shared_perm;
> > 
> > perm and shared_perm are unused now and can be removed.
> > 
> > >   QDict *options;
> > >   QDict *explicit_options;
> > > diff --git a/block.c b/block.c
> > > index 617cba9547..474e624152 100644
> > > --- a/block.c
> > > +++ b/block.c
> > > @@ -103,8 +103,9 @@ static int bdrv_attach_child_common(BlockDriverState 
> > > *child_bs,
> > >   GSList **tran, Error **errp);
> > >   static void bdrv_remove_backing(BlockDriverState *bs, GSList **tran);
> > > -static int bdrv_reopen_prepare(BDRVReopenState *reopen_state, 
> > > BlockReopenQueue
> > > -   *queue, Error **errp);
> > > +static int bdrv_reopen_prepare(BDRVReopenState *reopen_state,
> > > +   BlockReopenQueue *queue,
> > > +   GSList **set_backings_tran, Error **errp);
> > >   static void bdrv_reopen_commit(BDRVReopenState *reopen_state);
> > >   static void bdrv_reopen_abort(BDRVReopenState *reopen_state);
> > > @@ -2403,6 +2404,7 @@ static void bdrv_list_abort_perm_update(GSList 
> > > *list)
> > >   }
> > >   }
> > > +__attribute__((unused))
> > >   static void bdrv_abort_perm_update(BlockDriverState *bs)
> > >   {
> > >   g_autoptr(GSList) list = bdrv_topological_dfs(NULL, NULL, bs);
> > > @@ -2498,6 +2500,7 @@ char *bdrv_perm_names(uint64_t perm)
> > >*
> > >* Needs to be followed by a call to either bdrv_set_perm() or
> > >* bdrv_abort_perm_update(). */
> > > +__attribute__((unused))
> > >   static int bdrv_check_update_perm(BlockDriverState *bs, 
> > > BlockReopenQueue *q,
> > > uint64_t new_used_perm,
> > > uint64_t new_shared_perm,
> > > @@ -4100,10 +4103,6 @@ static BlockReopenQueue 
> > > *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
> > >   bs_entry->state.explicit_options = explicit_options;
> > >   bs_entry->state.flags = flags;
> > > -/* This needs to be overwritten in bdrv_reopen_prepare() */
> > > -bs_entry->state.perm = UINT64_MAX;
> > > -bs_entry->state.shared_perm = 0;
> > > -
> > >   /*
> > >* If keep_old_opts is false then it means that unspecified
> > >* options must be reset to their original value. We don't allow
> > > @@ -4186,40 +4185,37 @@ BlockReopenQueue 
> > > *bdrv_reopen_queue(BlockReopenQueue *bs_queue

Re: [PATCH 0/7] qcow2: compressed write cache

2021-02-10 Thread Vladimir Sementsov-Ogievskiy


10.02.2021 15:35, Kevin Wolf wrote:

Am 29.01.2021 um 17:50 hat Vladimir Sementsov-Ogievskiy geschrieben:

Hi all!

I know, I have several series waiting for a resend, but I had to switch
to another task spawned from our customer's bug.

Original problem: we use O_DIRECT for all vm images in our product, it's
the policy. The only exclusion is backup target qcow2 image for
compressed backup, because compressed backup is extremely slow with
O_DIRECT (due to unaligned writes). Customer complains that backup
produces a lot of pagecache.

So we can either implement some internal cache or use fadvise somehow.
Backup has several async workes, which writes simultaneously, so in both
ways we have to track host cluster filling (before dropping the cache
corresponding to the cluster).  So, if we have to track anyway, let's
try to implement the cache.

Idea is simple: cache small unaligned write and flush the cluster when
filled.


I haven't had the time to properly look at the patches, but is there
anything in it that is actually specific to compressed writes?

I'm asking because you may remember that a few years ago I talked at KVM
Forum about how a data cache could be used for small unaligned (to
cluster sizes) writes to reduce COW cost (mostly for sequential access
where the other part of the cluster would be filled soon enough).

So if we're introducing some kind of data cache, wouldn't it be nice to
use it even in the more general case instead of just restricting it to
compression?



Specific things are:

 - setting data_end per cluster at some moment (so we flush the cluster when it 
is not full) In this case we align up the data_end, as we know that the 
remaining part of cluster is unused. But, that may be refactored as an option.
 - wait for the whole cluster filled

So it can be reused for some sequential (more or less) copying process with 
unaligned chunks.. But different copying jobs in qemu always have aligned 
chunks, the only exclusion is copying to compressed target..

Still I intentionally implemented it in a separate file, and there no use of 
BDRVQcow2State, so it's simple enough to refactor and reuse if needed.

I can rename it to "unaligned_copy_cache" or something like this.

--
Best regards,
Vladimir

Re: [PATCH v2 30/36] block: bdrv_reopen_multiple: refresh permissions on updated graph

2021-02-10 Thread Kevin Wolf

Am 08.02.2021 um 12:21 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > Come to think of it, the AioContext handling is probably wrong already
> > before your series. reopen_commit for one node could move the whole tree
> > to a different context and then the later nodes would all be processed
> > while holding the wrong lock.
> 
> Probably proper way is to acquire all involved aio contexts as I do in
> 29 and update aio-context updating functions to work in such
> conditions(all aio contexts are already acquired by caller).

Whoops, what I gave was kind of a non-answer...

So essentially the reason for the locking rules of changing the
AioContext is that they drain the node first and drain imposes the
locking rule that the AioContext for the node to be drained must be
locked, and all other AioContexts must be unlocked.

The reason why drain imposes the rule is that we run AIO_WAIT_WHILE() in
one thread and we may need the event loops in other threads to make
progress until the while condition can eventually become false. If other
threads can't make progress because their lock is taken, we'll see
deadlocks sooner or later.

Kevin

Re: [PATCH v2 34/36] block: refactor bdrv_child_set_perm_safe() transaction action

2021-02-10 Thread Kevin Wolf

Am 27.11.2020 um 15:45 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Old interfaces dropped, nobody directly calls
> bdrv_child_set_perm_abort() and bdrv_child_set_perm_commit(), so we can
> use personal state structure for the action and stop exploiting
> BdrvChild structure. Also, drop "_safe" suffix which is redundant now.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

> diff --git a/block.c b/block.c
> index 3093d20db8..1fde22e4f4 100644
> --- a/block.c
> +++ b/block.c
> @@ -2070,59 +2070,40 @@ static GSList *bdrv_topological_dfs(GSList *list, 
> GHashTable *found,
>  return g_slist_prepend(list, bs);
>  }
>  
> -static void bdrv_child_set_perm_commit(void *opaque)
> -{
> -BdrvChild *c = opaque;
> -
> -c->has_backup_perm = false;
> -}
> +typedef struct BdrvChildSetPermState {
> +BdrvChild *child;
> +uint64_t old_perm;
> +uint64_t old_shared_perm;
> +} BdrvChildSetPermState;
>  
>  static void bdrv_child_set_perm_abort(void *opaque)
>  {
> -BdrvChild *c = opaque;
> -/*
> - * We may have child->has_backup_perm unset at this point, as in case of
> - * _check_ stage of permission update failure we may _check_ not the 
> whole
> - * subtree.  Still, _abort_ is called on the whole subtree anyway.
> - */
> -if (c->has_backup_perm) {
> -c->perm = c->backup_perm;
> -c->shared_perm = c->backup_shared_perm;
> -c->has_backup_perm = false;
> -}
> +BdrvChildSetPermState *s = opaque;
> +
> +s->child->perm = s->old_perm;
> +s->child->shared_perm = s->old_shared_perm;
>  }

Ah, so this patch actually implements what I had asked for somewhere at
the start of the series.

Don't bother changing it earlier then. As long as it's in the same
series, this is fine.

Kevin

Re: [PATCH] iotests: Consistent $IMGOPTS boundary matching

2021-02-10 Thread Eric Blake

On 2/10/21 3:51 AM, Max Reitz wrote:
> To disallow certain refcount_bits values, some _unsupported_imgopts
> invocations look like "refcount_bits=1[^0-9]", i.e. they match an
> integer boundary with [^0-9].  This expression does not match the end of
> the string, though, so it breaks down when refcount_bits is the last
> option (which it tends to be after the rewrite of the check script in
> Python).
> 
> Those invocations could use \b or \> instead, but those are not
> portable.  They could use something like \([^0-9]\|$\), but that would
> be cumbersome.  To make it simple and keep the existing invocations
> working, just let _unsupported_imgopts match the regex against $IMGOPTS
> plus a trailing space.
> 
> Suggested-by: Eric Blake 
> Signed-off-by: Max Reitz 
> ---
> Supersedes "iotests: Fix unsupported_imgopts for refcount_bits", and can
> be reproduced in the same way:
> 
> $ ./check -qcow2 -o refcount_bits=1 7 15 29 58 62 66 68 80
> 
> (those tests should be skipped)
> ---
>  tests/qemu-iotests/common.rc | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Much smaller fix ;)

Reviewed-by: Eric Blake 

> 
> diff --git a/tests/qemu-iotests/common.rc b/tests/qemu-iotests/common.rc
> index 77c37e8312..65cdba5723 100644
> --- a/tests/qemu-iotests/common.rc
> +++ b/tests/qemu-iotests/common.rc
> @@ -885,7 +885,9 @@ _unsupported_imgopts()
>  {
>  for bad_opt
>  do
> -if echo "$IMGOPTS" | grep -q 2>/dev/null "$bad_opt"
> +# Add a space so tests can match for whitespace that marks the
> +# end of an option (\b or \> are not portable)
> +if echo "$IMGOPTS " | grep -q 2>/dev/null "$bad_opt"
>  then
>  _notrun "not suitable for image option: $bad_opt"
>  fi
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 36/36] block: refactor bdrv_node_check_perm()

2021-02-10 Thread Kevin Wolf

Am 27.11.2020 um 15:45 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Now, bdrv_node_check_perm() is called only with fresh cumulative
> permissions, so its actually "refresh_perm".
> 
> Move permission calculation to the function. Also, drop unreachable
> error message.
> 
> Add also Virtuozzo copyright, as big work is done at this point.

I guess we could add many copyright lines then... Maybe we should, I
don't know.

> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block.c | 38 +-
>  1 file changed, 9 insertions(+), 29 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 20b1cf59f7..576b145cbf 100644
> --- a/block.c
> +++ b/block.c
> @@ -2,6 +2,7 @@
>   * QEMU System Emulator block driver
>   *
>   * Copyright (c) 2003 Fabrice Bellard
> + * Copyright (c) 2020 Virtuozzo International GmbH.
>   *
>   * Permission is hereby granted, free of charge, to any person obtaining a 
> copy
>   * of this software and associated documentation files (the "Software"), to 
> deal
> @@ -2204,23 +2205,15 @@ static void bdrv_replace_child(BdrvChild *child, 
> BlockDriverState *new_bs,
>  /* old_bs reference is transparently moved from @child to @s */
>  }
>  
> -/*
> - * Check whether permissions on this node can be changed in a way that
> - * @cumulative_perms and @cumulative_shared_perms are the new cumulative
> - * permissions of all its parents. This involves checking whether all 
> necessary
> - * permission changes to child nodes can be performed.
> - *
> - * A call to this function must always be followed by a call to 
> bdrv_set_perm()
> - * or bdrv_abort_perm_update().
> - */

Would you mind updating the comment rather than removing it?

> -static int bdrv_node_check_perm(BlockDriverState *bs, BlockReopenQueue *q,
> -uint64_t cumulative_perms,
> -uint64_t cumulative_shared_perms,
> -GSList **tran, Error **errp)
> +static int bdrv_node_refresh_perm(BlockDriverState *bs, BlockReopenQueue *q,
> +  GSList **tran, Error **errp)
>  {
>  BlockDriver *drv = bs->drv;
>  BdrvChild *c;
>  int ret;
> +uint64_t cumulative_perms, cumulative_shared_perms;
> +
> +bdrv_get_cumulative_perm(bs, &cumulative_perms, 
> &cumulative_shared_perms);
>  
>  /* Write permissions never work with read-only images */
>  if ((cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) &&
> @@ -2229,15 +,8 @@ static int bdrv_node_check_perm(BlockDriverState *bs, 
> BlockReopenQueue *q,
>  if (!bdrv_is_writable_after_reopen(bs, NULL)) {
>  error_setg(errp, "Block node is read-only");
>  } else {
> -uint64_t current_perms, current_shared;
> -bdrv_get_cumulative_perm(bs, ¤t_perms, ¤t_shared);
> -if (current_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) 
> {
> -error_setg(errp, "Cannot make block node read-only, there is 
> "
> -   "a writer on it");
> -} else {
> -error_setg(errp, "Cannot make block node read-only and 
> create "
> -   "a writer on it");
> -}
> +error_setg(errp, "Cannot make block node read-only, there is "
> +   "a writer on it");

Hm, so if you want to add a new writer to an existing read-only node,
this is the error message that you would get?

Now that we can't distinguish both cases any more, should we try to
rephrase it so that it makes sense for both directions? Like "Read-only
block node  cannot support read-write users"?


Sorry for it taking so long, but I've now finally looked at all patches
in this series. Please feel free to send v3 when you're ready.

Kevin

Re: [PATCH 2/2] hw/block/nvme: add write uncorrectable command

2021-02-10 Thread Minwoo Im

> > It might be nitpick, 'nlb' would easily represent the value which is
> > defined itself in the spec which is zero-based.  Can we have this like:
> > 
> > uint32_t nlb = le16_to_cpu(rw->nlb);
> > 
> > bitmap_clear(ns->uncorrectable, slba, nlb + 1);
> > 
> 
> 
> I do not disagree, but the `uint32_t nlb = le16_to_cpu(rw->nlb) + 1;`
> pattern is already used in several places.

Oh yes, Now I just saw some places.  Then, please take my review tag for
this patch.

Thanks!

[PATCH V2 0/6] hw/block/nvme: support namespace attachment

2021-02-10 Thread Minwoo Im

Hello,

This series supports namespace attachment: attach and detach.  This is
the second version series with a fix a bug on choosing a controller to
attach for a namespace in the attach command handler.

Since V1:
  - Fix to take 'ctrl' which is given from the command rather than 'n'.
(Klaus)
  - Add a [7/7] patch to support CNS 12h Identify command (Namespace
Attached Controller list).

This series has been tested with the following: (!CONFIG_NVME_MULTIPATH)

  -device nvme-subsys,id=subsys0 \
  -device nvme,serial=foo,id=nvme0,subsys=subsys0 \
  -device nvme,serial=bar,id=nvme1,subsys=subsys0 \
  -device nvme-ns,id=ns1,drive=drv0,nsid=1,subsys=subsys0,zoned=false \
  -device nvme-ns,id=ns2,drive=drv1,nsid=2,subsys=subsys0,zoned=true \
  -device 
nvme-ns,id=ns3,drive=drv2,nsid=3,subsys=subsys0,detached=true,zoned=false \
  -device 
nvme-ns,id=ns4,drive=drv3,nsid=4,subsys=subsys0,detached=true,zoned=true \

  root@vm:~/work# nvme list
  Node  SN   Model  
  Namespace Usage  Format   FW Rev
  -  
 - -- 
 
  /dev/nvme0n1  foo  QEMU NVMe Ctrl 
  1 268.44  MB / 268.44  MB512   B +  0 B   1.0
  /dev/nvme0n2  foo  QEMU NVMe Ctrl 
  2 268.44  MB / 268.44  MB512   B +  0 B   1.0
  /dev/nvme1n1  bar  QEMU NVMe Ctrl 
  1 268.44  MB / 268.44  MB512   B +  0 B   1.0
  /dev/nvme1n2  bar  QEMU NVMe Ctrl 
  2 268.44  MB / 268.44  MB512   B +  0 B   1.0
  root@vm:~/work# nvme attach-ns /dev/nvme0 --namespace-id=3 --controllers=0,1
  attach-ns: Success, nsid:3
  root@vm:~/work# echo 1 > /sys/class/nvme/nvme0/rescan_controller 
  root@vm:~/work# echo 1 > /sys/class/nvme/nvme1/rescan_controller 
  root@vm:~/work# nvme list
  Node  SN   Model  
  Namespace Usage  Format   FW Rev  
  -  
 - -- 
 
  /dev/nvme0n1  foo  QEMU NVMe Ctrl 
  1 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme0n2  foo  QEMU NVMe Ctrl 
  2 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme0n3  foo  QEMU NVMe Ctrl 
  3 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme1n1  bar  QEMU NVMe Ctrl 
  1 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme1n2  bar  QEMU NVMe Ctrl 
  2 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme1n3  bar  QEMU NVMe Ctrl 
  3 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  root@vm:~/work# nvme detach-ns /dev/nvme0 --namespace-id=3 --controllers=0
  detach-ns: Success, nsid:3
  root@vm:~/work# echo 1 > /sys/class/nvme/nvme0/rescan_controller 
  root@vm:~/work# nvme list
  Node  SN   Model  
  Namespace Usage  Format   FW Rev  
  -  
 - -- 
 
  /dev/nvme0n1  foo  QEMU NVMe Ctrl 
  1 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme0n2  foo  QEMU NVMe Ctrl 
  2 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme1n1  bar  QEMU NVMe Ctrl 
  1 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme1n2  bar  QEMU NVMe Ctrl 
  2 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  /dev/nvme1n3  bar  QEMU NVMe Ctrl 
  3 268.44  MB / 268.44  MB512   B +  0 B   1.0 
  root@vm:~/work# nvme detach-ns /dev/nvme0 --namespace-id=1 --controllers=1
  detach-ns: Success, nsid:1
  root@vm:~/work# echo 1 > /sys/class/nvme/nvme1/rescan_controller 
  root@vm:~/work# nvme list
  Node  SN   Model  
  Namespace Usage  Format   FW Rev  
  -  
--

[PATCH V2 3/7] hw/block/nvme: fix allocated namespace list to 256

2021-02-10 Thread Minwoo Im

Expand allocated namespace list (subsys->namespaces) to have 256 entries
which is a value lager than at least NVME_MAX_NAMESPACES which is for
attached namespace list in a controller.

Allocated namespace list should at least larger than attached namespace
list.

n->num_namespaces = NVME_MAX_NAMESPACES;

The above line will set the NN field by id->nn so that the subsystem
should also prepare at least this number of namespace list entries.

Signed-off-by: Minwoo Im 
---
 hw/block/nvme-subsys.h | 2 +-
 hw/block/nvme.h| 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme-subsys.h b/hw/block/nvme-subsys.h
index 574774390c4c..8a0732b22316 100644
--- a/hw/block/nvme-subsys.h
+++ b/hw/block/nvme-subsys.h
@@ -14,7 +14,7 @@
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 
 #define NVME_SUBSYS_MAX_CTRLS   32
-#define NVME_SUBSYS_MAX_NAMESPACES  32
+#define NVME_SUBSYS_MAX_NAMESPACES  256
 
 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index bde0ed7c2679..1c7796b20996 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -10,6 +10,12 @@
 #define NVME_DEFAULT_ZONE_SIZE   (128 * MiB)
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
 
+/*
+ * Subsystem namespace list for allocated namespaces should be larger than
+ * attached namespace list in a controller.
+ */
+QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_SUBSYS_MAX_NAMESPACES);
+
 typedef struct NvmeParams {
 char *serial;
 uint32_t num_queues; /* deprecated since 5.1 */
-- 
2.17.1

[PATCH V2 1/7] hw/block/nvme: support namespace detach

2021-02-10 Thread Minwoo Im

Given that now we have nvme-subsys device supported, we can manage
namespace allocated, but not attached: detached.  This patch introduced
a parameter for nvme-ns device named 'detached'.  This parameter
indicates whether the given namespace device is detached from
a entire NVMe subsystem('subsys' given case, shared namespace) or a
controller('bus' given case, private namespace).

- Allocated namespace

  1) Shared ns in the subsystem 'subsys0':

 -device nvme-ns,id=ns1,drive=blknvme0,nsid=1,subsys=subsys0,detached=true

  2) Private ns for the controller 'nvme0' of the subsystem 'subsys0':

 -device nvme-subsys,id=subsys0
 -device nvme,serial=foo,id=nvme0,subsys=subsys0
 -device nvme-ns,id=ns1,drive=blknvme0,nsid=1,bus=nvme0,detached=true

  3) (Invalid case) Controller 'nvme0' has no subsystem to manage ns:

 -device nvme,serial=foo,id=nvme0
 -device nvme-ns,id=ns1,drive=blknvme0,nsid=1,bus=nvme0,detached=true

Signed-off-by: Minwoo Im 
---
 hw/block/nvme-ns.c |  1 +
 hw/block/nvme-ns.h |  1 +
 hw/block/nvme-subsys.h |  1 +
 hw/block/nvme.c| 41 +++--
 hw/block/nvme.h| 22 ++
 5 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index c3b513b0fc78..cdcb81319fb5 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -393,6 +393,7 @@ static Property nvme_ns_props[] = {
 DEFINE_BLOCK_PROPERTIES(NvmeNamespace, blkconf),
 DEFINE_PROP_LINK("subsys", NvmeNamespace, subsys, TYPE_NVME_SUBSYS,
  NvmeSubsystem *),
+DEFINE_PROP_BOOL("detached", NvmeNamespace, params.detached, false),
 DEFINE_PROP_UINT32("nsid", NvmeNamespace, params.nsid, 0),
 DEFINE_PROP_UUID("uuid", NvmeNamespace, params.uuid),
 DEFINE_PROP_UINT16("mssrl", NvmeNamespace, params.mssrl, 128),
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index 7af6884862b5..b0c00e115d81 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -26,6 +26,7 @@ typedef struct NvmeZone {
 } NvmeZone;
 
 typedef struct NvmeNamespaceParams {
+bool detached;
 uint32_t nsid;
 QemuUUID uuid;
 
diff --git a/hw/block/nvme-subsys.h b/hw/block/nvme-subsys.h
index ccf6a71398d3..890d118117dc 100644
--- a/hw/block/nvme-subsys.h
+++ b/hw/block/nvme-subsys.h
@@ -23,6 +23,7 @@ typedef struct NvmeSubsystem {
 uint8_t subnqn[256];
 
 NvmeCtrl*ctrls[NVME_SUBSYS_MAX_CTRLS];
+/* Allocated namespaces for this subsystem */
 NvmeNamespace *namespaces[NVME_SUBSYS_MAX_NAMESPACES];
 } NvmeSubsystem;
 
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 6b84e34843f5..a1e930f7c8e4 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -23,7 +23,7 @@
  *  max_ioqpairs=, \
  *  aerl=, aer_max_queued=, \
  *  mdts=,zoned.append_size_limit=, \
- *  subsys= \
+ *  subsys=,detached=
  *  -device nvme-ns,drive=,bus=,nsid=,\
  *  zoned=, \
  *  subsys=
@@ -78,6 +78,13 @@
  *   controllers in the subsystem. Otherwise, `bus` must be given to attach
  *   this namespace to a specified single controller as a non-shared namespace.
  *
+ * - `detached`
+ *   Not to attach the namespace device to controllers in the NVMe subsystem
+ *   during boot-up. If not given, namespaces are all attahced to all
+ *   controllers in the subsystem by default.
+ *   It's mutual exclusive with 'bus' parameter. It's only valid in case
+ *   `subsys` is provided.
+ *
  * Setting `zoned` to true selects Zoned Command Set at the namespace.
  * In this case, the following namespace properties are available to configure
  * zoned operation:
@@ -4521,6 +4528,20 @@ static void nvme_init_state(NvmeCtrl *n)
 n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 }
 
+static int nvme_attach_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+{
+if (nvme_ns_is_attached(n, ns)) {
+error_setg(errp,
+   "namespace %d is already attached to controller %d",
+   nvme_nsid(ns), n->cntlid);
+return -1;
+}
+
+nvme_ns_attach(n, ns);
+
+return 0;
+}
+
 int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 {
 uint32_t nsid = nvme_nsid(ns);
@@ -4552,7 +4573,23 @@ int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace 
*ns, Error **errp)
 
 trace_pci_nvme_register_namespace(nsid);
 
-n->namespaces[nsid - 1] = ns;
+/*
+ * If subsys is not given, namespae is always attached to the controller
+ * because there's no subsystem to manage namespace allocation.
+ */
+if (!n->subsys) {
+if (ns->params.detached) {
+error_setg(errp,
+   "detached needs nvme-subsys specified nvme or nvme-ns");
+return -1;
+}
+
+return nvme_attach_namespace(n, ns, errp);
+} else {
+if (!ns->params.detached) {
+return nvme_attach_name

[PATCH V2 2/7] hw/block/nvme: fix namespaces array to 1-based

2021-02-10 Thread Minwoo Im

subsys->namespaces array used to be sized to NVME_SUBSYS_MAX_NAMESPACES.
But subsys->namespaces are being accessed with 1-based namespace id
which means the very first array entry will always be empty(NULL).

Signed-off-by: Minwoo Im 
---
 hw/block/nvme-subsys.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/block/nvme-subsys.h b/hw/block/nvme-subsys.h
index 890d118117dc..574774390c4c 100644
--- a/hw/block/nvme-subsys.h
+++ b/hw/block/nvme-subsys.h
@@ -24,7 +24,7 @@ typedef struct NvmeSubsystem {
 
 NvmeCtrl*ctrls[NVME_SUBSYS_MAX_CTRLS];
 /* Allocated namespaces for this subsystem */
-NvmeNamespace *namespaces[NVME_SUBSYS_MAX_NAMESPACES];
+NvmeNamespace *namespaces[NVME_SUBSYS_MAX_NAMESPACES + 1];
 } NvmeSubsystem;
 
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp);
-- 
2.17.1

[PATCH V2 6/7] hw/block/nvme: support namespace attachment command

2021-02-10 Thread Minwoo Im

This patch supports Namespace Attachment command for the pre-defined
nvme-ns device nodes.  Of course, attach/detach namespace should only be
supported in case 'subsys' is given.  This is because if we detach a
namespace from a controller, somebody needs to manage the detached, but
allocated namespace in the NVMe subsystem.

Signed-off-by: Minwoo Im 
---
 hw/block/nvme-subsys.h | 10 +++
 hw/block/nvme.c| 59 ++
 hw/block/nvme.h|  5 
 hw/block/trace-events  |  2 ++
 include/block/nvme.h   |  5 
 5 files changed, 81 insertions(+)

diff --git a/hw/block/nvme-subsys.h b/hw/block/nvme-subsys.h
index 14627f9ccb41..ef4bec928eae 100644
--- a/hw/block/nvme-subsys.h
+++ b/hw/block/nvme-subsys.h
@@ -30,6 +30,16 @@ typedef struct NvmeSubsystem {
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp);
 int nvme_subsys_register_ns(NvmeNamespace *ns, Error **errp);
 
+static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem *subsys,
+uint32_t cntlid)
+{
+if (!subsys) {
+return NULL;
+}
+
+return subsys->ctrls[cntlid];
+}
+
 /*
  * Return allocated namespace of the specified nsid in the subsystem.
  */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 697368a6ae0c..71bcd66f1956 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -183,6 +183,7 @@ static const uint32_t nvme_cse_acs[256] = {
 [NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
 [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
+[NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP,
 };
 
 static const uint32_t nvme_cse_iocs_none[256];
@@ -3766,6 +3767,62 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req)
 return NVME_NO_COMPLETE;
 }
 
+static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns);
+static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeNamespace *ns;
+NvmeCtrl *ctrl;
+uint16_t list[NVME_CONTROLLER_LIST_SIZE] = {};
+uint32_t nsid = le32_to_cpu(req->cmd.nsid);
+uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
+bool attach = !(dw10 & 0xf);
+uint16_t *nr_ids = &list[0];
+uint16_t *ids = &list[1];
+uint16_t ret;
+int i;
+
+trace_pci_nvme_ns_attachment(nvme_cid(req), dw10 & 0xf);
+
+ns = nvme_subsys_ns(n->subsys, nsid);
+if (!ns) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+ret = nvme_dma(n, (uint8_t *)list, 4096,
+   DMA_DIRECTION_TO_DEVICE, req);
+if (ret) {
+return ret;
+}
+
+if (!*nr_ids) {
+return NVME_NS_CTRL_LIST_INVALID | NVME_DNR;
+}
+
+for (i = 0; i < *nr_ids; i++) {
+ctrl = nvme_subsys_ctrl(n->subsys, ids[i]);
+if (!ctrl) {
+return NVME_NS_CTRL_LIST_INVALID | NVME_DNR;
+}
+
+if (attach) {
+if (nvme_ns_is_attached(ctrl, ns)) {
+return NVME_NS_ALREADY_ATTACHED | NVME_DNR;
+}
+
+nvme_ns_attach(ctrl, ns);
+__nvme_select_ns_iocs(ctrl, ns);
+} else {
+if (!nvme_ns_is_attached(ctrl, ns)) {
+return NVME_NS_NOT_ATTACHED | NVME_DNR;
+}
+
+nvme_ns_detach(ctrl, ns);
+}
+}
+
+return NVME_SUCCESS;
+}
+
 static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
 trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
@@ -3797,6 +3854,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_get_feature(n, req);
 case NVME_ADM_CMD_ASYNC_EV_REQ:
 return nvme_aer(n, req);
+case NVME_ADM_CMD_NS_ATTACHMENT:
+return nvme_ns_attachment(n, req);
 default:
 assert(false);
 }
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 1c7796b20996..5a1ab857d166 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -222,6 +222,11 @@ static inline void nvme_ns_attach(NvmeCtrl *n, 
NvmeNamespace *ns)
 n->namespaces[nvme_nsid(ns) - 1] = ns;
 }
 
+static inline void nvme_ns_detach(NvmeCtrl *n, NvmeNamespace *ns)
+{
+n->namespaces[nvme_nsid(ns) - 1] = NULL;
+}
+
 static inline NvmeCQueue *nvme_cq(NvmeRequest *req)
 {
 NvmeSQueue *sq = req->sq;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index b6e972d733a6..bf67fe7873d2 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -80,6 +80,8 @@ pci_nvme_aer(uint16_t cid) "cid %"PRIu16""
 pci_nvme_aer_aerl_exceeded(void) "aerl exceeded"
 pci_nvme_aer_masked(uint8_t type, uint8_t mask) "type 0x%"PRIx8" mask 
0x%"PRIx8""
 pci_nvme_aer_post_cqe(uint8_t typ, uint8_t info, uint8_t log_page) "type 
0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
+pci_nvme_ns_attachment(uint16_t cid, uint8_t sel) "cid %"PRIu16", 
sel=0x%"PRIx8""
+pci_nvme_ns_attachment_attach(uint16_t cntlid, uint32_t nsid) 
"cntlid=0x%"PRIx16", nsid=0x%"PRIx32""
 pci_nvme_enqueue_event(uint8_t typ, uint8_t info, uint8_t log_page) "type

[PATCH V2 4/7] hw/block/nvme: support allocated namespace type

2021-02-10 Thread Minwoo Im

>From NVMe spec 1.4b "6.1.5. NSID and Namespace Relationships" defines
valid namespace types:

- Unallocated: Not exists in the NVMe subsystem
- Allocated: Exists in the NVMe subsystem
- Inactive: Not attached to the controller
- Active: Attached to the controller

This patch added support for allocated, but not attached namespace type:

!nvme_ns(n, nsid) && nvme_subsys_ns(n->subsys, nsid)

nvme_ns() returns attached namespace instance of the given controller
and nvme_subsys_ns() returns allocated namespace instance in the
subsystem.

Signed-off-by: Minwoo Im 
---
 hw/block/nvme-subsys.h | 13 +
 hw/block/nvme.c| 63 +++---
 2 files changed, 60 insertions(+), 16 deletions(-)

diff --git a/hw/block/nvme-subsys.h b/hw/block/nvme-subsys.h
index 8a0732b22316..14627f9ccb41 100644
--- a/hw/block/nvme-subsys.h
+++ b/hw/block/nvme-subsys.h
@@ -30,4 +30,17 @@ typedef struct NvmeSubsystem {
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp);
 int nvme_subsys_register_ns(NvmeNamespace *ns, Error **errp);
 
+/*
+ * Return allocated namespace of the specified nsid in the subsystem.
+ */
+static inline NvmeNamespace *nvme_subsys_ns(NvmeSubsystem *subsys,
+uint32_t nsid)
+{
+if (!subsys) {
+return NULL;
+}
+
+return subsys->namespaces[nsid];
+}
+
 #endif /* NVME_SUBSYS_H */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a1e930f7c8e4..d1761a82731f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -3124,7 +3124,7 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, 
NvmeRequest *req)
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req, bool active)
 {
 NvmeNamespace *ns;
 NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -3138,7 +3138,14 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeRequest *req)
 
 ns = nvme_ns(n, nsid);
 if (unlikely(!ns)) {
-return nvme_rpt_empty_id_struct(n, req);
+if (!active) {
+ns = nvme_subsys_ns(n->subsys, nsid);
+if (!ns) {
+return nvme_rpt_empty_id_struct(n, req);
+}
+} else {
+return nvme_rpt_empty_id_struct(n, req);
+}
 }
 
 if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
@@ -3149,7 +3156,8 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest 
*req)
 return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
-static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
+bool active)
 {
 NvmeNamespace *ns;
 NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -3163,7 +3171,14 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, 
NvmeRequest *req)
 
 ns = nvme_ns(n, nsid);
 if (unlikely(!ns)) {
-return nvme_rpt_empty_id_struct(n, req);
+if (!active) {
+ns = nvme_subsys_ns(n->subsys, nsid);
+if (!ns) {
+return nvme_rpt_empty_id_struct(n, req);
+}
+} else {
+return nvme_rpt_empty_id_struct(n, req);
+}
 }
 
 if (c->csi == NVME_CSI_NVM && nvme_csi_has_nvm_support(ns)) {
@@ -3176,7 +3191,8 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, 
NvmeRequest *req)
 return NVME_INVALID_FIELD | NVME_DNR;
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req,
+bool active)
 {
 NvmeNamespace *ns;
 NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -3201,7 +3217,14 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeRequest *req)
 for (i = 1; i <= n->num_namespaces; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
-continue;
+if (!active) {
+ns = nvme_subsys_ns(n->subsys, i);
+if (!ns) {
+continue;
+}
+} else {
+continue;
+}
 }
 if (ns->params.nsid <= min_nsid) {
 continue;
@@ -3215,7 +3238,8 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
NvmeRequest *req)
 return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req)
+static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, NvmeRequest *req,
+bool active)
 {
 NvmeNamespace *ns;
 NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
@@ -3241,7 +3265,14 @@ static uint16_t nvme_identify_nslist_csi(NvmeCtrl *n, 
NvmeRequest *req)
 for (i = 1; i <= n->num_namespaces; i++) {
 ns = nvme_ns(n, i);
 if (!ns) {
-continue;
+if (!active) {
+ns = nvme_subsys_ns(n->subsys, i);
+if (!ns) {
+continue;
+

[PATCH V2 5/7] hw/block/nvme: refactor nvme_select_ns_iocs

2021-02-10 Thread Minwoo Im

This patch has no functional changes.  This patch just refactored
nvme_select_ns_iocs() to iterate the attached namespaces of the
controlller and make it invoke __nvme_select_ns_iocs().

Signed-off-by: Minwoo Im 
---
 hw/block/nvme.c | 36 +---
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index d1761a82731f..697368a6ae0c 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -3896,6 +3896,25 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n)
 }
 }
 
+static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns)
+{
+ns->iocs = nvme_cse_iocs_none;
+switch (ns->csi) {
+case NVME_CSI_NVM:
+if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
+ns->iocs = nvme_cse_iocs_nvm;
+}
+break;
+case NVME_CSI_ZONED:
+if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) {
+ns->iocs = nvme_cse_iocs_zoned;
+} else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) {
+ns->iocs = nvme_cse_iocs_nvm;
+}
+break;
+}
+}
+
 static void nvme_select_ns_iocs(NvmeCtrl *n)
 {
 NvmeNamespace *ns;
@@ -3906,21 +3925,8 @@ static void nvme_select_ns_iocs(NvmeCtrl *n)
 if (!ns) {
 continue;
 }
-ns->iocs = nvme_cse_iocs_none;
-switch (ns->csi) {
-case NVME_CSI_NVM:
-if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
-ns->iocs = nvme_cse_iocs_nvm;
-}
-break;
-case NVME_CSI_ZONED:
-if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) {
-ns->iocs = nvme_cse_iocs_zoned;
-} else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) {
-ns->iocs = nvme_cse_iocs_nvm;
-}
-break;
-}
+
+__nvme_select_ns_iocs(n, ns);
 }
 }
 
-- 
2.17.1

[PATCH V2 7/7] hw/block/nvme: support Identify NS Attached Controller List

2021-02-10 Thread Minwoo Im

Support Identify command for Namespace attached controller list.  This
command handler will traverse the controller instances in the given
subsystem to figure out whether the specified nsid is attached to the
controllers or not.

The 4096bytes Identify data will return with the first entry (16bits)
indicating the number of the controller id entries.  So, the data can
hold up to 2047 entries for the controller ids.

Signed-off-by: Minwoo Im 
---
 hw/block/nvme.c   | 42 ++
 hw/block/trace-events |  1 +
 include/block/nvme.h  |  1 +
 3 files changed, 44 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 71bcd66f1956..da60335def9f 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -3157,6 +3157,46 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeRequest *req, bool active)
 return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
+static uint16_t nvme_identify_ns_attached_list(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeIdentify *c = (NvmeIdentify *)&req->cmd;
+uint16_t min_id = le16_to_cpu(c->ctrlid);
+uint16_t list[NVME_CONTROLLER_LIST_SIZE] = {};
+uint16_t *ids = &list[1];
+NvmeNamespace *ns;
+NvmeCtrl *ctrl;
+int cntlid, nr_ids = 0;
+
+trace_pci_nvme_identify_ns_attached_list(min_id);
+
+if (c->nsid == NVME_NSID_BROADCAST) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+ns = nvme_subsys_ns(n->subsys, c->nsid);
+if (!ns) {
+return NVME_INVALID_FIELD | NVME_DNR;
+}
+
+for (cntlid = min_id; cntlid < ARRAY_SIZE(n->subsys->ctrls); cntlid++) {
+ctrl = nvme_subsys_ctrl(n->subsys, cntlid);
+if (!ctrl) {
+continue;
+}
+
+if (!nvme_ns_is_attached(ctrl, ns)) {
+continue;
+}
+
+ids[nr_ids++] = cntlid;
+}
+
+list[0] = nr_ids;
+
+return nvme_dma(n, (uint8_t *)list, sizeof(list),
+DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req,
 bool active)
 {
@@ -3356,6 +3396,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ns(n, req, true);
 case NVME_ID_CNS_NS_PRESENT:
 return nvme_identify_ns(n, req, false);
+case NVME_ID_CNS_NS_ATTACHED_CTRL_LIST:
+return nvme_identify_ns_attached_list(n, req);
 case NVME_ID_CNS_CS_NS:
 return nvme_identify_ns_csi(n, req, true);
 case NVME_ID_CNS_CS_NS_PRESENT:
diff --git a/hw/block/trace-events b/hw/block/trace-events
index bf67fe7873d2..2d88d96c2165 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -62,6 +62,7 @@ pci_nvme_del_cq(uint16_t cqid) "deleted completion queue, 
cqid=%"PRIu16""
 pci_nvme_identify_ctrl(void) "identify controller"
 pci_nvme_identify_ctrl_csi(uint8_t csi) "identify controller, csi=0x%"PRIx8""
 pci_nvme_identify_ns(uint32_t ns) "nsid %"PRIu32""
+pci_nvme_identify_ns_attached_list(uint16_t cntid) "cntid=%"PRIu16""
 pci_nvme_identify_ns_csi(uint32_t ns, uint8_t csi) "nsid=%"PRIu32", 
csi=0x%"PRIx8""
 pci_nvme_identify_nslist(uint32_t ns) "nsid %"PRIu32""
 pci_nvme_identify_nslist_csi(uint16_t ns, uint8_t csi) "nsid=%"PRIu16", 
csi=0x%"PRIx8""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 4b016f954fee..fb82d8682e9f 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -968,6 +968,7 @@ enum NvmeIdCns {
 NVME_ID_CNS_CS_NS_ACTIVE_LIST = 0x07,
 NVME_ID_CNS_NS_PRESENT_LIST   = 0x10,
 NVME_ID_CNS_NS_PRESENT= 0x11,
+NVME_ID_CNS_NS_ATTACHED_CTRL_LIST = 0x12,
 NVME_ID_CNS_CS_NS_PRESENT_LIST= 0x1a,
 NVME_ID_CNS_CS_NS_PRESENT = 0x1b,
 NVME_ID_CNS_IO_COMMAND_SET= 0x1c,
-- 
2.17.1

Re: [PATCH 1/2] file-posix: Use OFD lock only if the filesystem supports the lock

2021-02-10 Thread Masayoshi Mizuma

On Fri, Nov 20, 2020 at 04:42:28PM +0100, Kevin Wolf wrote:
> Am 20.11.2020 um 00:56 hat Masayoshi Mizuma geschrieben:
> > On Thu, Nov 19, 2020 at 11:44:42AM +0100, Kevin Wolf wrote:
> > > Am 18.11.2020 um 20:48 hat Masayoshi Mizuma geschrieben:
> > > > On Wed, Nov 18, 2020 at 02:10:36PM -0500, Masayoshi Mizuma wrote:
> > > > > On Wed, Nov 18, 2020 at 04:42:47PM +0100, Kevin Wolf wrote:
> > > > > > The logic looks fine to me, at least assuming that EINVAL is really 
> > > > > > what
> > > > > > we will consistently get from the kernel if OFD locks are not 
> > > > > > supported.
> > > > > > Is this documented anywhere? The fcntl manpage doesn't seem to 
> > > > > > mention
> > > > > > this case.
> > > > 
> > > > The man page of fcntl(2) says:
> > > > 
> > > >EINVAL The value specified in cmd is not recognized by this 
> > > > kernel.
> > > > 
> > > > So I think EINVAL is good enough to check whether the filesystem 
> > > > supports
> > > > OFD locks or not...
> > > 
> > > A kernel not knowing the cmd at all is a somewhat different case (and
> > > certainly a different code path) than a filesystem not supporting it.
> > > 
> > > I just had a look at the kernel code, and to me it seems that the
> > > difference between POSIX locks and OFD locks is handled entirely in
> > > filesystem independent code. A filesystem driver would in theory have
> > > ways to distinguish both, but I don't see any driver in the kernel tree
> > > that actually does this (and there is probably little reason for a
> > > driver to do so).
> > > 
> > > So now I wonder what filesystem you are using? I'm curious what I
> > > missed.
> > 
> > I'm using a proprietary filesystem, which isn't in the linux kernel.
> > The filesystem supports posix lock only, doesn't support OFD lock...
> 
> Do you know why that proprietary filesystem driver makes a difference
> between POSIX locks and OFD locks? The main difference between both
> types is when they are released automatically, and this is handled by
> generic kernel code and not the filesystem driver.
> 
> From a filesystem perspective, I don't see any reason to even
> distuingish. So unless there are good reasons for making the
> distinction, I'm currently inclined to view this as a filesystem
> driver bug.
> 
> It makes handling this case hard because when the case isn't even
> supposed to exist, of course there won't be a defined error code.

Hi Kevin,

The filesystem team found a locking issue in the filesystem.
Your comments were very helpful! I really appriciate it.

Thanks,
Masa

Re: [PATCH 1/7] qemu/queue: add some useful QLIST_ and QTAILQ_ macros

2021-02-10 Thread Max Reitz


On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:

Add QLIST_FOREACH_FUNC_SAFE(), QTAILQ_FOREACH_FUNC_SAFE() and
QTAILQ_POP_HEAD(), to be used in following commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/qemu/queue.h | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index e029e7bf66..03e1fce85f 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -173,6 +173,13 @@ struct {   
 \
  (var) && ((next_var) = ((var)->field.le_next), 1);  \
  (var) = (next_var))
  
+#define QLIST_FOREACH_FUNC_SAFE(head, field, func) do { \

+typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;   \
+QLIST_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {\
+(func)(qffs_var);   \
+}   \
+} while (/*CONSTCOND*/0)
+


On one hand I have inexplicable reservations against adding these macros 
if they’re only used one time in the next patch.


On the other, I have one clearly expressible reservation, and that’s the 
fact that perhaps some future functions that could make use of this want 
to take more arguments, like closures.


Could we make these function vararg macros?  I.e., e.g.,

#define QLIST_FOREACH_FUNC_SAFE(head, field, func, ...) do {
...
(func)(qffs_var, ## __VA_ARGS__);
...

Max


  /*
   * List access methods.
   */
@@ -490,6 +497,13 @@ union {
 \
   (var) && ((prev_var) = QTAILQ_PREV(var, field), 1);\
   (var) = (prev_var))
  
+#define QTAILQ_FOREACH_FUNC_SAFE(head, field, func) do {\

+typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;  \
+QTAILQ_FOREACH_SAFE(qffs_var, (head), field, qffs_next_var) {   \
+(func)(qffs_var);   \
+}   \
+} while (/*CONSTCOND*/0)
+
  /*
   * Tail queue access methods.
   */

Re: [RFC PATCH v2 1/4] block: Allow changing bs->file on reopen

2021-02-10 Thread Kevin Wolf

Am 08.02.2021 um 19:44 hat Alberto Garcia geschrieben:
> When the x-blockdev-reopen was added it allowed reconfiguring the
> graph by replacing backing files, but changing the 'file' option was
> forbidden. Because of this restriction some operations are not
> possible, notably inserting and removing block filters.
> 
> This patch adds support for replacing the 'file' option. This is
> similar to replacing the backing file and the user is likewise
> responsible for the correctness of the resulting graph, otherwise this
> can lead to data corruption.
> 
> Signed-off-by: Alberto Garcia 
> ---
>  include/block/block.h  |  1 +
>  block.c| 65 ++
>  tests/qemu-iotests/245 |  7 +++--
>  3 files changed, 70 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 82271d9ccd..6dd687a69e 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -196,6 +196,7 @@ typedef struct BDRVReopenState {
>  bool backing_missing;
>  bool replace_backing_bs;  /* new_backing_bs is ignored if this is false 
> */
>  BlockDriverState *old_backing_bs; /* keep pointer for permissions update 
> */
> +BlockDriverState *old_file_bs;/* keep pointer for permissions update 
> */
>  uint64_t perm, shared_perm;
>  QDict *options;
>  QDict *explicit_options;
> diff --git a/block.c b/block.c
> index 576b145cbf..19b62da4af 100644
> --- a/block.c
> +++ b/block.c
> @@ -3978,6 +3978,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
> Error **errp)
>  refresh_list = bdrv_topological_dfs(refresh_list, found,
>  state->old_backing_bs);
>  }
> +if (state->old_file_bs) {
> +refresh_list = bdrv_topological_dfs(refresh_list, found,
> +state->old_file_bs);
> +}
>  }
>  
>  ret = bdrv_list_refresh_perms(refresh_list, bs_queue, &tran, errp);
> @@ -4196,6 +4200,61 @@ static int bdrv_reopen_parse_backing(BDRVReopenState 
> *reopen_state,
>  return 0;
>  }
>  
> +static int bdrv_reopen_parse_file(BDRVReopenState *reopen_state,
> +  GSList **tran,
> +  Error **errp)
> +{
> +BlockDriverState *bs = reopen_state->bs;
> +BlockDriverState *new_file_bs;
> +QObject *value;
> +const char *str;
> +
> +value = qdict_get(reopen_state->options, "file");
> +if (value == NULL) {
> +return 0;
> +}
> +
> +/* The 'file' option only allows strings */
> +assert(qobject_type(value) == QTYPE_QSTRING);

This is true, but not entirely obvious: The QAPI schema has BlockdevRef,
which can be either a string or a dict. However, we're dealing with a
flattened options dict here, so no more nested dicts.

qemu-io doesn't go through the schema, but its parser represents all
scalars as strings, so it's correct even in this case.

> +
> +str = qobject_get_try_str(value);

This function doesn't exist in master any more, but we already know that
we have a string here, so it's easy enough to replace:

str = qstring_get_str(qobject_to(QString, value));

> +new_file_bs = bdrv_lookup_bs(NULL, str, errp);
> +if (new_file_bs == NULL) {
> +return -EINVAL;
> +} else if (bdrv_recurse_has_child(new_file_bs, bs)) {
> +error_setg(errp, "Making '%s' a file of '%s' "
> +   "would create a cycle", str, bs->node_name);
> +return -EINVAL;
> +}
> +
> +assert(bs->file && bs->file->bs);
> +
> +/* If 'file' points to the current child then there's nothing to do */
> +if (bs->file->bs == new_file_bs) {
> +return 0;
> +}
> +
> +if (bs->file->frozen) {
> +error_setg(errp, "Cannot change the 'file' link of '%s' "
> +   "from '%s' to '%s'", bs->node_name,
> +   bs->file->bs->node_name, new_file_bs->node_name);
> +return -EPERM;
> +}
> +
> +/* Check AioContext compatibility */
> +if (!bdrv_reopen_can_attach(bs, bs->file, new_file_bs, errp)) {
> +return -EINVAL;
> +}
> +
> +/* Store the old file bs because we'll need to refresh its permissions */
> +reopen_state->old_file_bs = bs->file->bs;
> +
> +/* And finally replace the child */
> +bdrv_replace_child(bs->file, new_file_bs, tran);
> +
> +return 0;
> +}

As Vladimir said, it would be nice to avoid some duplication with the
backing file switching code (especially when you consider that we might
get more of these cases, think of qcow2 data files or VMDK extents), but
generally this patch makes sense to me.

Kevin

[PATCH v2 1/2] migration: dirty-bitmap: Convert alias map inner members to a struct

2021-02-10 Thread Peter Krempa

Currently the alias mapping hash stores just strings of the target
objects internally. In further patches we'll be adding another member
which will need to be stored in the map so convert the members to a
struct.

Signed-off-by: Peter Krempa 
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

v2:
 - NULL-check in freeing function (Eric)
 - style problems (Vladimir)

 migration/block-dirty-bitmap.c | 43 +++---
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index c61d382be8..0169f672df 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -169,6 +169,22 @@ typedef struct DBMState {

 static DBMState dbm_state;

+typedef struct AliasMapInnerBitmap {
+char *string;
+} AliasMapInnerBitmap;
+
+static void free_alias_map_inner_bitmap(void *amin_ptr)
+{
+AliasMapInnerBitmap *amin = amin_ptr;
+
+if (!amin_ptr) {
+return;
+}
+
+g_free(amin->string);
+g_free(amin);
+}
+
 /* For hash tables that map node/bitmap names to aliases */
 typedef struct AliasMapInnerNode {
 char *string;
@@ -263,8 +279,8 @@ static GHashTable *construct_alias_map(const 
BitmapMigrationNodeAliasList *bbm,
 node_map_to = bmna->node_name;
 }

-bitmaps_map = g_hash_table_new_full(g_str_hash, g_str_equal,
-g_free, g_free);
+bitmaps_map = g_hash_table_new_full(g_str_hash, g_str_equal, g_free,
+free_alias_map_inner_bitmap);

 amin = g_new(AliasMapInnerNode, 1);
 *amin = (AliasMapInnerNode){
@@ -277,6 +293,7 @@ static GHashTable *construct_alias_map(const 
BitmapMigrationNodeAliasList *bbm,
 for (bmbal = bmna->bitmaps; bmbal; bmbal = bmbal->next) {
 const BitmapMigrationBitmapAlias *bmba = bmbal->value;
 const char *bmap_map_from, *bmap_map_to;
+AliasMapInnerBitmap *bmap_inner;

 if (strlen(bmba->alias) > UINT8_MAX) {
 error_setg(errp,
@@ -311,8 +328,11 @@ static GHashTable *construct_alias_map(const 
BitmapMigrationNodeAliasList *bbm,
 }
 }

+bmap_inner = g_new0(AliasMapInnerBitmap, 1);
+bmap_inner->string = g_strdup(bmap_map_to);
+
 g_hash_table_insert(bitmaps_map,
-g_strdup(bmap_map_from), 
g_strdup(bmap_map_to));
+g_strdup(bmap_map_from), bmap_inner);
 }
 }

@@ -538,11 +558,15 @@ static int add_bitmaps_to_list(DBMSaveState *s, 
BlockDriverState *bs,
 }

 if (bitmap_aliases) {
-bitmap_alias = g_hash_table_lookup(bitmap_aliases, bitmap_name);
-if (!bitmap_alias) {
+AliasMapInnerBitmap *bmap_inner;
+
+bmap_inner = g_hash_table_lookup(bitmap_aliases, bitmap_name);
+if (!bmap_inner) {
 /* Skip bitmaps with no alias */
 continue;
 }
+
+bitmap_alias = bmap_inner->string;
 } else {
 if (strlen(bitmap_name) > UINT8_MAX) {
 error_report("Cannot migrate bitmap '%s' on node '%s': "
@@ -1074,14 +1098,17 @@ static int dirty_bitmap_load_header(QEMUFile *f, 
DBMLoadState *s,

 bitmap_name = s->bitmap_alias;
 if (!s->cancelled && bitmap_alias_map) {
-bitmap_name = g_hash_table_lookup(bitmap_alias_map,
-  s->bitmap_alias);
-if (!bitmap_name) {
+AliasMapInnerBitmap *bmap_inner;
+
+bmap_inner = g_hash_table_lookup(bitmap_alias_map, 
s->bitmap_alias);
+if (!bmap_inner) {
 error_report("Error: Unknown bitmap alias '%s' on node "
  "'%s' (alias '%s')", s->bitmap_alias,
  s->bs->node_name, s->node_alias);
 cancel_incoming_locked(s);
 }
+
+bitmap_name = bmap_inner->string;
 }

 if (!s->cancelled) {
-- 
2.29.2

[PATCH v2 0/2] migration: dirty-bitmap: Allow control of bitmap persistence

2021-02-10 Thread Peter Krempa

See 2/2 for explanation.

Peter Krempa (2):
  migration: dirty-bitmap: Convert alias map inner members to a struct
  migration: dirty-bitmap: Allow control of bitmap persistance

 migration/block-dirty-bitmap.c | 73 +-
 qapi/migration.json| 20 +-
 2 files changed, 81 insertions(+), 12 deletions(-)

-- 
2.29.2

[PATCH v2 2/2] migration: dirty-bitmap: Allow control of bitmap persistance

2021-02-10 Thread Peter Krempa

Bitmap's source persistance is transported over the migration stream and
the destination mirrors it. In some cases the destination might want to
persist bitmaps which are not persistent on the source (e.g. the result
of merge of bitmaps from a number of layers on the source when migrating
into a squashed image) but currently it would need to create another set
of persistent bitmaps and merge them.

This patch adds a 'transform' property to the alias map which allows to
override the persistance of migrated bitmaps both on the source and
destination sides.

Signed-off-by: Peter Krempa 
---

v2:
 - grammar fixes (Eric)
 - added 'transform' object to group other possible transformations (Vladimir)
 - transformation can also be used on source (Vladimir)
 - put bmap_inner directly into DBMLoadState for deduplication  (Vladimir)

 migration/block-dirty-bitmap.c | 38 +++---
 qapi/migration.json| 20 +-
 2 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 0169f672df..a05bf74073 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -138,6 +138,13 @@ typedef struct LoadBitmapState {
 bool enabled;
 } LoadBitmapState;

+typedef struct AliasMapInnerBitmap {
+char *string;
+
+/* 'transform' properties borrowed from QAPI */
+BitmapMigrationBitmapAliasTransform *transform;
+} AliasMapInnerBitmap;
+
 /* State of the dirty bitmap migration (DBM) during load process */
 typedef struct DBMLoadState {
 uint32_t flags;
@@ -148,6 +155,7 @@ typedef struct DBMLoadState {
 BdrvDirtyBitmap *bitmap;

 bool before_vm_start_handled; /* set in dirty_bitmap_mig_before_vm_start */
+AliasMapInnerBitmap *bmap_inner;

 /*
  * cancelled
@@ -169,10 +177,6 @@ typedef struct DBMState {

 static DBMState dbm_state;

-typedef struct AliasMapInnerBitmap {
-char *string;
-} AliasMapInnerBitmap;
-
 static void free_alias_map_inner_bitmap(void *amin_ptr)
 {
 AliasMapInnerBitmap *amin = amin_ptr;
@@ -330,6 +334,7 @@ static GHashTable *construct_alias_map(const 
BitmapMigrationNodeAliasList *bbm,

 bmap_inner = g_new0(AliasMapInnerBitmap, 1);
 bmap_inner->string = g_strdup(bmap_map_to);
+bmap_inner->transform = bmba->transform;

 g_hash_table_insert(bitmaps_map,
 g_strdup(bmap_map_from), bmap_inner);
@@ -547,6 +552,7 @@ static int add_bitmaps_to_list(DBMSaveState *s, 
BlockDriverState *bs,
 }

 FOR_EACH_DIRTY_BITMAP(bs, bitmap) {
+BitmapMigrationBitmapAliasTransform *bitmap_transform = NULL;
 bitmap_name = bdrv_dirty_bitmap_name(bitmap);
 if (!bitmap_name) {
 continue;
@@ -567,6 +573,7 @@ static int add_bitmaps_to_list(DBMSaveState *s, 
BlockDriverState *bs,
 }

 bitmap_alias = bmap_inner->string;
+bitmap_transform = bmap_inner->transform;
 } else {
 if (strlen(bitmap_name) > UINT8_MAX) {
 error_report("Cannot migrate bitmap '%s' on node '%s': "
@@ -592,8 +599,15 @@ static int add_bitmaps_to_list(DBMSaveState *s, 
BlockDriverState *bs,
 if (bdrv_dirty_bitmap_enabled(bitmap)) {
 dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_ENABLED;
 }
-if (bdrv_dirty_bitmap_get_persistence(bitmap)) {
-dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT;
+if (bitmap_transform &&
+bitmap_transform->has_persistent) {
+if (bitmap_transform->persistent) {
+dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT;
+}
+} else {
+if (bdrv_dirty_bitmap_get_persistence(bitmap)) {
+dbms->flags |= DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT;
+}
 }

 QSIMPLEQ_INSERT_TAIL(&s->dbms_list, dbms, entry);
@@ -801,6 +815,7 @@ static int dirty_bitmap_load_start(QEMUFile *f, 
DBMLoadState *s)
 uint32_t granularity = qemu_get_be32(f);
 uint8_t flags = qemu_get_byte(f);
 LoadBitmapState *b;
+bool persistent;

 if (s->cancelled) {
 return 0;
@@ -825,7 +840,15 @@ static int dirty_bitmap_load_start(QEMUFile *f, 
DBMLoadState *s)
 return -EINVAL;
 }

-if (flags & DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT) {
+if (s->bmap_inner &&
+s->bmap_inner->transform &&
+s->bmap_inner->transform->has_persistent) {
+persistent = s->bmap_inner->transform->persistent;
+} else {
+persistent = flags & DIRTY_BITMAP_MIG_START_FLAG_PERSISTENT;
+}
+
+if (persistent) {
 bdrv_dirty_bitmap_set_persistence(s->bitmap, true);
 }

@@ -1109,6 +1132,7 @@ static int dirty_bitmap_load_header(QEMUFile *f, 
DBMLoadState *s,
 }

 bitmap_name = bmap_inner->string;
+s->bmap_inner = bmap_inner;
 }

 if (!s->cancelled) {
di

Re: [PATCH v3 1/2] qemu-nbd: Use SOMAXCONN for socket listen() backlog

2021-02-10 Thread Nir Soffer

On Tue, Feb 9, 2021 at 5:28 PM Eric Blake  wrote:
>
> Our default of a backlog of 1 connection is rather puny; it gets in
> the way when we are explicitly allowing multiple clients (such as
> qemu-nbd -e N [--shared], or nbd-server-start with its default
> "max-connections":0 for unlimited), but is even a problem when we
> stick to qemu-nbd's default of only 1 active client but use -t
> [--persistent] where a second client can start using the server once
> the first finishes.  While the effects are less noticeable on TCP
> sockets (since the client can poll() to learn when the server is ready
> again), it is definitely observable on Unix sockets, where on Unix, a
> client will fail with EAGAIN and no recourse but to sleep an arbitrary
> amount of time before retrying if the server backlog is already full.
>
> Since QMP nbd-server-start is always persistent, it now always
> requests a backlog of SOMAXCONN;

This makes sense since we don't limit the number of connections.

> meanwhile, qemu-nbd will request
> SOMAXCONN if persistent, otherwise its backlog should be based on the
> expected number of clients.

If --persistent is used without --shared, we allow only one concurrent
connection, so not clear why we need maximum backlog.

I think that separating --persistent and --shared would be easier to
understand and use. The backlog will always be based on shared value.

> See https://bugzilla.redhat.com/1925045 for a demonstration of where
> our low backlog prevents libnbd from connecting as many parallel
> clients as it wants.
>
> Reported-by: Richard W.M. Jones 
> Signed-off-by: Eric Blake 
> CC: qemu-sta...@nongnu.org
> ---
>  blockdev-nbd.c |  7 ++-
>  qemu-nbd.c | 10 +-
>  2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/blockdev-nbd.c b/blockdev-nbd.c
> index d8443d235b73..b264620b98d8 100644
> --- a/blockdev-nbd.c
> +++ b/blockdev-nbd.c
> @@ -134,7 +134,12 @@ void nbd_server_start(SocketAddress *addr, const char 
> *tls_creds,
>  qio_net_listener_set_name(nbd_server->listener,
>"nbd-listener");
>
> -if (qio_net_listener_open_sync(nbd_server->listener, addr, 1, errp) < 0) 
> {
> +/*
> + * Because this server is persistent, a backlog of SOMAXCONN is
> + * better than trying to size it to max_connections.

The comment is not clear. Previously we used hard code value (1) but we
do support more than one connection. Maybe it is better to explain that we
don't know how many connections are needed?

> + */
> +if (qio_net_listener_open_sync(nbd_server->listener, addr, SOMAXCONN,
> +   errp) < 0) {
>  goto error;
>  }
>
> diff --git a/qemu-nbd.c b/qemu-nbd.c
> index 608c63e82a25..1a340ea4858d 100644
> --- a/qemu-nbd.c
> +++ b/qemu-nbd.c
> @@ -964,8 +964,16 @@ int main(int argc, char **argv)
>
>  server = qio_net_listener_new();
>  if (socket_activation == 0) {
> +int backlog;
> +
> +if (persistent) {
> +backlog = SOMAXCONN;

This increases the backlog, but since default shared is still 1, we will
not accept more than 1 connection, so not clear why SOMAXCONN
is better.

> +} else {
> +backlog = MIN(shared, SOMAXCONN);
> +}
>  saddr = nbd_build_socket_address(sockpath, bindto, port);
> -if (qio_net_listener_open_sync(server, saddr, 1, &local_err) < 0) {
> +if (qio_net_listener_open_sync(server, saddr, backlog,
> +   &local_err) < 0) {
>  object_unref(OBJECT(server));
>  error_report_err(local_err);
>  exit(EXIT_FAILURE);
> --
> 2.30.0
>

Re: [RFC PATCH v2 2/4] iotests: Update 245 to support replacing files with x-blockdev-reopen

2021-02-10 Thread Kevin Wolf

Am 08.02.2021 um 19:44 hat Alberto Garcia geschrieben:
> Signed-off-by: Alberto Garcia 

> +def test_insert_throttle_filter(self):
> +hd0_opts = hd_opts(0)
> +result = self.vm.qmp('blockdev-add', conv_keys = False, **hd0_opts)
> +self.assert_qmp(result, 'return', {})
> +
> +opts = { 'qom-type': 'throttle-group', 'id': 'group0',
> + 'props': { 'limits': { 'iops-total': 1000 } } }

Please don't add new users of 'props', it's deprecated. Instead, specify
'limits' on the top level.

Kevin

Re: [PATCH 2/7] block/qcow2: introduce cache for compressed writes

2021-02-10 Thread Max Reitz


On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:

Compressed writes and O_DIRECT are not friends: they works too slow,
because compressed writes does many small unaligned to 512 writes.

Let's introduce an internal cache, so that compressed writes may work
well when O_DIRECT is on.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/qcow2.h|  29 +
  block/qcow2-compressed-write-cache.c | 770 +++
  block/meson.build|   1 +
  3 files changed, 800 insertions(+)
  create mode 100644 block/qcow2-compressed-write-cache.c

diff --git a/block/qcow2.h b/block/qcow2.h
index 0678073b74..fbdedf89fa 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -322,6 +322,8 @@ typedef struct Qcow2BitmapHeaderExt {
  uint64_t bitmap_directory_offset;
  } QEMU_PACKED Qcow2BitmapHeaderExt;
  
+typedef struct Qcow2CompressedWriteCache Qcow2CompressedWriteCache;

+
  #define QCOW2_MAX_THREADS 4
  
  typedef struct BDRVQcow2State {

@@ -1010,4 +1012,31 @@ int coroutine_fn
  qcow2_co_decrypt(BlockDriverState *bs, uint64_t host_offset,
   uint64_t guest_offset, void *buf, size_t len);
  
+Qcow2CompressedWriteCache *qcow2_compressed_cache_new(BdrvChild *data_file,

+  int64_t cluster_size,
+  int64_t cache_size);
+void qcow2_compressed_cache_free(Qcow2CompressedWriteCache *s);
+int coroutine_fn
+qcow2_compressed_cache_co_read(Qcow2CompressedWriteCache *s, int64_t offset,
+   int64_t bytes, void *buf);
+int coroutine_fn
+qcow2_compressed_cache_co_write(Qcow2CompressedWriteCache *s, int64_t offset,
+int64_t bytes, void *buf);
+void coroutine_fn
+qcow2_compressed_cache_co_set_cluster_end(Qcow2CompressedWriteCache *s,
+  int64_t cluster_data_end);
+int coroutine_fn
+qcow2_compressed_cache_co_flush(Qcow2CompressedWriteCache *s);
+int qcow2_compressed_cache_flush(BlockDriverState *bs,
+ Qcow2CompressedWriteCache *state);
+int coroutine_fn
+qcow2_compressed_cache_co_stop_flush(Qcow2CompressedWriteCache *s);
+int qcow2_compressed_cache_stop_flush(BlockDriverState *bs,
+  Qcow2CompressedWriteCache *s);
+void qcow2_compressed_cache_set_size(Qcow2CompressedWriteCache *s,
+ int64_t size);
+void coroutine_fn
+qcow2_compressed_cache_co_discard(Qcow2CompressedWriteCache *s,
+  int64_t cluster_offset);
+


It would be nice if these functions had their interface documented 
somewhere.



  #endif
diff --git a/block/qcow2-compressed-write-cache.c 
b/block/qcow2-compressed-write-cache.c
new file mode 100644
index 00..7bb92cb550
--- /dev/null
+++ b/block/qcow2-compressed-write-cache.c
@@ -0,0 +1,770 @@
+/*
+ * Write cache for qcow2 compressed writes
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+
+#include "block/block_int.h"
+#include "block/block-gen.h"
+#include "qemu/coroutine.h"
+#include "qapi/qapi-events-block-core.h"
+#include "qcow2.h"
+
+typedef struct CacheExtent {
+int64_t offset;
+int64_t bytes;
+void *buf;
+QLIST_ENTRY(CacheExtent) next;
+} CacheExtent;
+
+typedef struct CacheCluster {


It isn’t immediately clear what these two structures mean by just their 
name, because “extent” has no meaning in the context of qcow2.


I understand CacheExtent to basically be a compressed cluster, and 
CacheCluster to be a host cluster.  Perhaps their names should reflect that.


(OTOH, the Cache* prefix seems unnecessary to me, because these are just 
local structs.)



+int64_t cluster_offset;
+int64_t n_bytes; /* sum of extents lengths */
+
+/*
+ * data_end: cluster i

Re: [RFC PATCH v2 0/4] Allow changing bs->file on reopen

2021-02-10 Thread Kevin Wolf

Am 08.02.2021 um 19:44 hat Alberto Garcia geschrieben:
> Hi,
> 
> this series allows changing bs->file using x-blockdev-reopen. Read
> here for more details:
> 
>https://lists.gnu.org/archive/html/qemu-block/2021-01/msg00437.html
> 
> Version 2 of the series introduces a very significant change:
> x-blockdev-reopen now receives a list of BlockdevOptions instead of
> just one, so it is possible to reopen multiple block devices using a
> single transaction.

Adding Peter to Cc for this one.

> This is still an RFC, I haven't updated the documentation and the
> structure of the patches will probably change in the future, but I'd
> like to know your opinion about the approach.

I like the direction where this is going.

You have a test case for adding a throttling filter. Can we also remove
it again or is there still a problem with that? I seem to remember that
that was a bit trickier, though I'm not sure what it was. Was it that we
can't have the throttle node without a file, so it would possibly still
have permission conflicts?

Kevin

Re: [PATCH 1/2] file-posix: Use OFD lock only if the filesystem supports the lock

2021-02-10 Thread Kevin Wolf

Hi Masa,

Am 10.02.2021 um 17:43 hat Masayoshi Mizuma geschrieben:
> Hi Kevin,
> 
> The filesystem team found a locking issue in the filesystem.
> Your comments were very helpful! I really appriciate it.
> 
> Thanks,
> Masa

I'm glad that I could help you to find the root cause. Thanks for
reporting back!

Kevin

Re: [PATCH 3/7] block/qcow2: use compressed write cache

2021-02-10 Thread Max Reitz


On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:

Introduce a new option: compressed-cache-size, with default to 64
clusters (to be not less than 64 default max-workers for backup job).

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  qapi/block-core.json   |  8 +++-
  block/qcow2.h  |  4 ++
  block/qcow2-refcount.c | 13 +++
  block/qcow2.c  | 87 --
  4 files changed, 108 insertions(+), 4 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 9f555d5c1d..e0be6657f3 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3202,6 +3202,11 @@
  # an image, the data file name is loaded from the image
  # file. (since 4.0)
  #
+# @compressed-cache-size: The maximum size of compressed write cache in
+# bytes. If positive must be not less than
+# cluster size. 0 disables the feature. Default
+# is 64 * cluster_size. (since 6.0)


Do we need this, really?  If you don’t use compression, the cache won’t 
use any memory, right?  Do you plan on using this option?


I’d just set it to a sane default.

OTOH, “a sane default” poses two questions, namely whether 64 * 
cluster_size is reasonable – with subclusters, the cluster size may be 
rather high, so 64 * cluster_size may well be like 128 MB.  Are 64 
clusters really necessary for a reasonable performance?


Second, I think I could live with a rather high default if clusters are 
flushed as soon as they are full.  OTOH, as I briefly touched on, in 
practice, I suppose compressed images are just written to constantly, so 
even if clusters are flushed as soon as they are full, the cache will 
still remain full all the time.



Different topic: Why is the cache disableable?  I thought there are no 
downsides?


(Not being able to disable it would make the code simpler, hence me asking.)

Max

Re: [PULL v4 00/27] Block patches

2021-02-10 Thread Peter Maydell

On Wed, 10 Feb 2021 at 09:26, Stefan Hajnoczi  wrote:
>
> The following changes since commit 1214d55d1c41fbab3a9973a05085b8760647e411:
>
>   Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' 
> into staging (2021-02-09 13:24:37 +)
>
> are available in the Git repository at:
>
>   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
>
> for you to fetch changes up to eb847c42296497978942f738cd41dc29a35a49b2:
>
>   docs: fix Parallels Image "dirty bitmap" section (2021-02-10 09:23:28 +)
>
> 
> Pull request
>
> v4:
>  * Add PCI_EXPRESS Kconfig dependency to fix s390x in "multi-process: setup 
> PCI
>host bridge for remote device" [Philippe and Thomas]
>
> 
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/6.0
for any user-visible changes.

-- PMM

[RFC PATCH 1/3] hw/block/nvme: set NVME_DNR in a single place

2021-02-10 Thread Minwoo Im

Set NVME_DNR in the CQ entry status field right before writing the CQ
entry: in nvme_post_cqes().  We have put NVME_DNR for all CQ entry
status for all error cases.  This patch is a former patch to support
command retry feature.

Signed-off-by: Minwoo Im 
---
 hw/block/nvme.c | 192 
 1 file changed, 97 insertions(+), 95 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 93345bf3c1fc..816e0e8e5205 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -270,12 +270,12 @@ static int nvme_aor_check(NvmeNamespace *ns, uint32_t 
act, uint32_t opn)
 if (ns->params.max_active_zones != 0 &&
 ns->nr_active_zones + act > ns->params.max_active_zones) {
 trace_pci_nvme_err_insuff_active_res(ns->params.max_active_zones);
-return NVME_ZONE_TOO_MANY_ACTIVE | NVME_DNR;
+return NVME_ZONE_TOO_MANY_ACTIVE;
 }
 if (ns->params.max_open_zones != 0 &&
 ns->nr_open_zones + opn > ns->params.max_open_zones) {
 trace_pci_nvme_err_insuff_open_res(ns->params.max_open_zones);
-return NVME_ZONE_TOO_MANY_OPEN | NVME_DNR;
+return NVME_ZONE_TOO_MANY_OPEN;
 }
 
 return NVME_SUCCESS;
@@ -492,7 +492,7 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, 
QEMUIOVector *iov,
 
 if (cmb || pmr) {
 if (qsg && qsg->sg) {
-return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+return NVME_INVALID_USE_OF_CMB;
 }
 
 assert(iov);
@@ -509,7 +509,7 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, 
QEMUIOVector *iov,
 }
 
 if (iov && iov->iov) {
-return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+return NVME_INVALID_USE_OF_CMB;
 }
 
 assert(qsg);
@@ -568,7 +568,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, uint64_t prp1, 
uint64_t prp2,
 if (i == n->max_prp_ents - 1 && len > n->page_size) {
 if (unlikely(prp_ent & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
-return NVME_INVALID_PRP_OFFSET | NVME_DNR;
+return NVME_INVALID_PRP_OFFSET;
 }
 
 i = 0;
@@ -585,7 +585,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, uint64_t prp1, 
uint64_t prp2,
 
 if (unlikely(prp_ent & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_prplist_ent(prp_ent);
-return NVME_INVALID_PRP_OFFSET | NVME_DNR;
+return NVME_INVALID_PRP_OFFSET;
 }
 
 trans_len = MIN(len, n->page_size);
@@ -600,7 +600,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, uint64_t prp1, 
uint64_t prp2,
 } else {
 if (unlikely(prp2 & (n->page_size - 1))) {
 trace_pci_nvme_err_invalid_prp2_align(prp2);
-return NVME_INVALID_PRP_OFFSET | NVME_DNR;
+return NVME_INVALID_PRP_OFFSET;
 }
 status = nvme_map_addr(n, qsg, iov, prp2, len);
 if (status) {
@@ -637,9 +637,9 @@ static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList 
*qsg,
 break;
 case NVME_SGL_DESCR_TYPE_SEGMENT:
 case NVME_SGL_DESCR_TYPE_LAST_SEGMENT:
-return NVME_INVALID_NUM_SGL_DESCRS | NVME_DNR;
+return NVME_INVALID_NUM_SGL_DESCRS;
 default:
-return NVME_SGL_DESCR_TYPE_INVALID | NVME_DNR;
+return NVME_SGL_DESCR_TYPE_INVALID;
 }
 
 dlen = le32_to_cpu(segment[i].len);
@@ -660,7 +660,7 @@ static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList 
*qsg,
 }
 
 trace_pci_nvme_err_invalid_sgl_excess_length(nvme_cid(req));
-return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
+return NVME_DATA_SGL_LEN_INVALID;
 }
 
 trans_len = MIN(*len, dlen);
@@ -672,7 +672,7 @@ static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList 
*qsg,
 addr = le64_to_cpu(segment[i].addr);
 
 if (UINT64_MAX - addr < dlen) {
-return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
+return NVME_DATA_SGL_LEN_INVALID;
 }
 
 status = nvme_map_addr(n, qsg, iov, addr, trans_len);
@@ -731,7 +731,7 @@ static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, 
QEMUIOVector *iov,
 case NVME_SGL_DESCR_TYPE_LAST_SEGMENT:
 break;
 default:
-return NVME_INVALID_SGL_SEG_DESCR | NVME_DNR;
+return NVME_INVALID_SGL_SEG_DESCR;
 }
 
 seg_len = le32_to_cpu(sgld->len);
@@ -739,11 +739,11 @@ static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList 
*qsg, QEMUIOVector *iov,
 /* check the length of the (Last) Segment descriptor */
 if ((!seg_len || seg_len & 0xf) &&
 (NVME_SGL_TYPE(sgld->type) != NVME_SGL_DESCR_TYPE_BIT_BUCKET)) {
-return NVME_INVALID_SGL_SEG_DESCR | NVME_DNR;
+return NVME

[RFC PATCH 0/3] support command retry

2021-02-10 Thread Minwoo Im

Hello,

This series is RFC about supporting command retry feature in NVMe device
model.  The background to propose this feature is that in kernel
development and testing, retry scheme has not been able to be covered in
QEMU NVMe model device.  If we are able to control the retry scheme
fromt he device side, it would be nice for kernel developers to test.

We have been putting NVME_DNR in the CQ entry status field for
all error cases.  This series added a control for the command retry
based on the 'cmd-retry-delay' parameter which is newly added.  If it's
given to positive value, Command Retry Delay Time1(CRDT1) in the
Identify Controller data structure will be set in 100msec units.
Accordingly, it will cause host to Set Feature with Host Behavior(0x16)
to enable the Advanced Command Retry Enable(ACRE) feature to support
command retry with defined delay.  If 'cmd-retry-delay' param is given
to 0, then command failures will be retried directly without delay.

This series just considered Command Interrupted status code first which
is mainly about the ACRE feature addition.  nvme_should_retry() helper
will decide command should be retried or not.

But, we don't have any use-cases specified for the Command Interrupted
status code in the device model.  So, I proposed [3/3] patch by adding
'nvme_inject_state' HMP command to make users to give pre-defined state
to the controller device by injecting it via QEMU monitor.

Usage:

  # Configure the nvme0 device to be retried every 1sec(1000msec)
  -device nvme,id=nvme0,cmd-retry-delay=1000,...

  (qemu) nvme_inject_state nvme0 cmd-interrupted
  -device nvme,id=nvme0: state cmd-interrupted injected
  (qemu)

  # Then from now on, controller will interrupt all the commands
  # to be processed with Command Interrupted status code.  Then host
  # will retry based on the delay.

Thanks,

Minwoo Im (3):
  hw/block/nvme: set NVME_DNR in a single place
  hw/block/nvme: support command retry delay
  hw/block/nvme: add nvme_inject_state HMP command

 hmp-commands.hx   |  13 ++
 hw/block/nvme.c   | 304 +-
 hw/block/nvme.h   |  10 ++
 include/block/nvme.h  |  13 +-
 include/monitor/hmp.h |   1 +
 5 files changed, 244 insertions(+), 97 deletions(-)

-- 
2.17.1

[RFC PATCH 2/3] hw/block/nvme: support command retry delay

2021-02-10 Thread Minwoo Im

Set CRDT1(Command Retry Delay Time 1) in the Identify controller data
structure to milliseconds units of 100ms by the given value of
'cmd-retry-delay' parameter which is newly added.  If
cmd-retry-delay=1000, it will be set CRDT1 to 10.  This patch only
considers the CRDT1 without CRDT2 and 3 for the simplicity.

This patch also introduced set/get feature command handler for Host
Behavior feature (16h).  In this feature, ACRE(Advanced Command Retry
Enable) will be set by the host based on the Identify controller data
structure, especially by CRDTs.

If 'cmd-retry-delay' is not given, the default value will be -1 which is
CRDT will not be configured at all and ACRE will not be supported.  In
this case, we just set NVME_DNR to the error CQ entry just like we used
to.  If it's given to positive value, then ACRE will be supported by the
device.

Signed-off-by: Minwoo Im 
---
 hw/block/nvme.c  | 65 ++--
 hw/block/nvme.h  |  2 ++
 include/block/nvme.h | 13 -
 3 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 816e0e8e5205..6d3c554a0e99 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -23,7 +23,7 @@
  *  max_ioqpairs=, \
  *  aerl=, aer_max_queued=, \
  *  mdts=,zoned.append_size_limit=, \
- *  subsys= \
+ *  subsys=,cmd-retry-delay= \
  *  -device nvme-ns,drive=,bus=,nsid=,\
  *  zoned=, \
  *  subsys=
@@ -71,6 +71,14 @@
  *   data size being in effect. By setting this property to 0, users can make
  *   ZASL to be equal to MDTS. This property only affects zoned namespaces.
  *
+ * - `cmd-retry-delay`
+ *   Command Retry Delay value in unit of millisecond.  This value will be
+ *   reported to the CRDT1(Command Retry Delay Time 1) in Identify Controller
+ *   data structure in 100 milliseconds unit.  If this is not given, DNR(Do Not
+ *   Retry) bit field in the Status field of CQ entry.  If it's given to 0,
+ *   CRD(Command Retry Delay) will be set to 0 which is for retry without
+ *   delay.  Otherwise, it will set to 1 to delay for CRDT1 value.
+ *
  * nvme namespace device parameters
  * 
  * - `subsys`
@@ -154,6 +162,7 @@ static const bool nvme_feature_support[NVME_FID_MAX] = {
 [NVME_WRITE_ATOMICITY]  = true,
 [NVME_ASYNCHRONOUS_EVENT_CONF]  = true,
 [NVME_TIMESTAMP]= true,
+[NVME_HOST_BEHAVIOR_SUPPORT]= true,
 };
 
 static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
@@ -163,6 +172,7 @@ static const uint32_t nvme_feature_cap[NVME_FID_MAX] = {
 [NVME_NUMBER_OF_QUEUES] = NVME_FEAT_CAP_CHANGE,
 [NVME_ASYNCHRONOUS_EVENT_CONF]  = NVME_FEAT_CAP_CHANGE,
 [NVME_TIMESTAMP]= NVME_FEAT_CAP_CHANGE,
+[NVME_HOST_BEHAVIOR_SUPPORT]= NVME_FEAT_CAP_CHANGE,
 };
 
 static const uint32_t nvme_cse_acs[256] = {
@@ -904,6 +914,16 @@ static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, 
uint32_t len,
 return status;
 }
 
+static inline bool nvme_should_retry(NvmeRequest *req)
+{
+switch (req->status) {
+case NVME_COMMAND_INTERRUPTED:
+return true;
+default:
+return false;
+}
+}
+
 static void nvme_post_cqes(void *opaque)
 {
 NvmeCQueue *cq = opaque;
@@ -947,7 +967,13 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 assert(cq->cqid == req->sq->cqid);
 
 if (req->status != NVME_SUCCESS) {
-req->status |= NVME_DNR;
+if (cq->ctrl->features.acre && nvme_should_retry(req)) {
+if (cq->ctrl->params.cmd_retry_delay > 0) {
+req->status |= NVME_CRD_CRDT1;
+}
+} else {
+req->status |= NVME_DNR;
+}
 }
 
 trace_pci_nvme_enqueue_req_completion(nvme_cid(req), cq->cqid,
@@ -3401,6 +3427,16 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, 
NvmeRequest *req)
 DMA_DIRECTION_FROM_DEVICE, req);
 }
 
+static uint16_t nvme_get_feature_host_behavior(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeFeatureHostBehavior data = {};
+
+data.acre = n->features.acre;
+
+return nvme_dma(n, (uint8_t *)&data, sizeof(data),
+DMA_DIRECTION_FROM_DEVICE, req);
+}
+
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeCmd *cmd = &req->cmd;
@@ -3506,6 +3542,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest 
*req)
 goto out;
 case NVME_TIMESTAMP:
 return nvme_get_feature_timestamp(n, req);
+case NVME_HOST_BEHAVIOR_SUPPORT:
+return nvme_get_feature_host_behavior(n, req);
 default:
 break;
 }
@@ -3569,6 +3607,22 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, 
NvmeRequest *req)
 return NVME_SUCCESS;
 }
 
+static uint16_t nvme_set_feature_host_behavior(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeFeatureHostBehavior data;
+int ret;
+
+

[RFC PATCH 3/3] hw/block/nvme: add nvme_inject_state HMP command

2021-02-10 Thread Minwoo Im

nvme_inject_state command is to give a controller state to be.
Human Monitor Interface(HMP) supports users to make controller to a
specified state of:

normal: Normal state (no injection)
cmd-interrupted:Commands will be interrupted internally

This patch is just a start to give dynamic command from the HMP to the
QEMU NVMe device model.  If "cmd-interrupted" state is given, then the
controller will return all the CQ entries with Command Interrupts status
code.

Usage:
-device nvme,id=nvme0,

(qemu) nvme_inject_state nvme0 cmd-interrupted



(qemu) nvme_inject_state nvme0 normal

This feature is required to test Linux kernel NVMe driver for the
command retry feature.

Signed-off-by: Minwoo Im 
---
 hmp-commands.hx   | 13 
 hw/block/nvme.c   | 49 +++
 hw/block/nvme.h   |  8 +++
 include/monitor/hmp.h |  1 +
 4 files changed, 71 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index d4001f9c5dc6..ef288c567b46 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1307,6 +1307,19 @@ SRST
   Inject PCIe AER error
 ERST
 
+{
+.name   = "nvme_inject_state",
+.args_type  = "id:s,state:s",
+.params = "id [normal|cmd-interrupted]",
+.help   = "inject controller/namespace state",
+.cmd= hmp_nvme_inject_state,
+},
+
+SRST
+``nvme_inject_state``
+  Inject NVMe controller/namespace state
+ERST
+
 {
 .name   = "netdev_add",
 .args_type  = "netdev:O",
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 6d3c554a0e99..42cf5bd113e6 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -123,6 +123,7 @@
 #include "sysemu/sysemu.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
+#include "qapi/qmp/qdict.h"
 #include "sysemu/hostmem.h"
 #include "sysemu/block-backend.h"
 #include "exec/memory.h"
@@ -132,6 +133,7 @@
 #include "trace.h"
 #include "nvme.h"
 #include "nvme-ns.h"
+#include "monitor/monitor.h"
 
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_DB_SIZE  4
@@ -966,6 +968,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
NvmeRequest *req)
 {
 assert(cq->cqid == req->sq->cqid);
 
+/*
+ * Override request status field if controller state has been injected by
+ * the QMP.
+ */
+if (cq->ctrl->state == NVME_STATE_CMD_INTERRUPTED) {
+req->status = NVME_COMMAND_INTERRUPTED;
+}
+
 if (req->status != NVME_SUCCESS) {
 if (cq->ctrl->features.acre && nvme_should_retry(req)) {
 if (cq->ctrl->params.cmd_retry_delay > 0) {
@@ -5025,4 +5035,43 @@ static void nvme_register_types(void)
 type_register_static(&nvme_bus_info);
 }
 
+static void nvme_inject_state(NvmeCtrl *n, NvmeState state)
+{
+n->state = state;
+}
+
+static const char *nvme_states[] = {
+[NVME_STATE_NORMAL] = "normal",
+[NVME_STATE_CMD_INTERRUPTED]= "cmd-interrupted",
+};
+
+void hmp_nvme_inject_state(Monitor *mon, const QDict *qdict)
+{
+const char *id = qdict_get_str(qdict, "id");
+const char *state = qdict_get_str(qdict, "state");
+PCIDevice *dev;
+NvmeCtrl *n;
+int ret, i;
+
+ret = pci_qdev_find_device(id, &dev);
+if (ret < 0) {
+monitor_printf(mon, "invalid device id %s\n", id);
+return;
+}
+
+n = NVME(dev);
+
+for (i = 0; i < ARRAY_SIZE(nvme_states); i++) {
+if (!strcmp(nvme_states[i], state)) {
+nvme_inject_state(n, i);
+monitor_printf(mon,
+   "-device nvme,id=%s: state %s injected\n",
+   id, state);
+return;
+}
+}
+
+monitor_printf(mon, "invalid state %s\n", state);
+}
+
 type_init(nvme_register_types)
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 37940b3ac2d2..1af1e0380d9b 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -128,6 +128,11 @@ typedef struct NvmeFeatureVal {
 uint8_t acre;
 } NvmeFeatureVal;
 
+typedef enum NvmeState {
+NVME_STATE_NORMAL,
+NVME_STATE_CMD_INTERRUPTED,
+} NvmeState;
+
 typedef struct NvmeCtrl {
 PCIDeviceparent_obj;
 MemoryRegion bar0;
@@ -185,6 +190,8 @@ typedef struct NvmeCtrl {
 NvmeCQueue  admin_cq;
 NvmeIdCtrl  id_ctrl;
 NvmeFeatureVal  features;
+
+NvmeState   state;
 } NvmeCtrl;
 
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
@@ -212,4 +219,5 @@ static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
 
 int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
 
+void hmp_nvme_inject_state(Monitor *mon, const QDict *qdict);
 #endif /* HW_NVME_H */
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index ed2913fd18e8..668384ea2e34 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -79,6 +79,7 @@ void hmp_migrate(Monitor *mon, const QDict *qdict);
 void hmp_device_add(Monitor *mon, const QDict *

Re: [RFC PATCH 1/3] hw/block/nvme: set NVME_DNR in a single place

2021-02-10 Thread Klaus Jensen

On Feb 11 04:52, Minwoo Im wrote:
> @@ -945,6 +945,11 @@ static void nvme_post_cqes(void *opaque)
>  static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>  {
>  assert(cq->cqid == req->sq->cqid);
> +
> +if (req->status != NVME_SUCCESS) {
> +req->status |= NVME_DNR;
> +}

There are status codes where we do not set the DNR bit (e.g. Data
Transfer Error, and that might be the only one actually).

Maybe a switch such that we do not explicitly set DNR for Data Transfer
Error (and any other errors we identify), but only if we set it earlier
in the stack.

signature.asc
Description: PGP signature

Re: [RFC PATCH 3/3] hw/block/nvme: add nvme_inject_state HMP command

2021-02-10 Thread Klaus Jensen

On Feb 11 04:52, Minwoo Im wrote:
> nvme_inject_state command is to give a controller state to be.
> Human Monitor Interface(HMP) supports users to make controller to a
> specified state of:
> 
>   normal: Normal state (no injection)
>   cmd-interrupted:Commands will be interrupted internally
> 
> This patch is just a start to give dynamic command from the HMP to the
> QEMU NVMe device model.  If "cmd-interrupted" state is given, then the
> controller will return all the CQ entries with Command Interrupts status
> code.
> 
> Usage:
>   -device nvme,id=nvme0,
> 
>   (qemu) nvme_inject_state nvme0 cmd-interrupted
> 
>   
> 
>   (qemu) nvme_inject_state nvme0 normal
> 
> This feature is required to test Linux kernel NVMe driver for the
> command retry feature.
> 

This is super cool and commands like this feel much nicer than the
qom-set approach that the SMART critical warning feature took.

But... looking at the existing commands I don't think we can "bloat" it
up with a device specific command like this, but I don't know the policy
around this.

If an HMP command is out, then we should be able to make do with the
qom-set approach just fine though.

> Signed-off-by: Minwoo Im 
> ---
>  hmp-commands.hx   | 13 
>  hw/block/nvme.c   | 49 +++
>  hw/block/nvme.h   |  8 +++
>  include/monitor/hmp.h |  1 +
>  4 files changed, 71 insertions(+)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index d4001f9c5dc6..ef288c567b46 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -1307,6 +1307,19 @@ SRST
>Inject PCIe AER error
>  ERST
>  
> +{
> +.name   = "nvme_inject_state",
> +.args_type  = "id:s,state:s",
> +.params = "id [normal|cmd-interrupted]",
> +.help   = "inject controller/namespace state",
> +.cmd= hmp_nvme_inject_state,
> +},
> +
> +SRST
> +``nvme_inject_state``
> +  Inject NVMe controller/namespace state
> +ERST
> +
>  {
>  .name   = "netdev_add",
>  .args_type  = "netdev:O",
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 6d3c554a0e99..42cf5bd113e6 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -123,6 +123,7 @@
>  #include "sysemu/sysemu.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include "qapi/qmp/qdict.h"
>  #include "sysemu/hostmem.h"
>  #include "sysemu/block-backend.h"
>  #include "exec/memory.h"
> @@ -132,6 +133,7 @@
>  #include "trace.h"
>  #include "nvme.h"
>  #include "nvme-ns.h"
> +#include "monitor/monitor.h"
>  
>  #define NVME_MAX_IOQPAIRS 0x
>  #define NVME_DB_SIZE  4
> @@ -966,6 +968,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, 
> NvmeRequest *req)
>  {
>  assert(cq->cqid == req->sq->cqid);
>  
> +/*
> + * Override request status field if controller state has been injected by
> + * the QMP.
> + */
> +if (cq->ctrl->state == NVME_STATE_CMD_INTERRUPTED) {
> +req->status = NVME_COMMAND_INTERRUPTED;
> +}
> +
>  if (req->status != NVME_SUCCESS) {
>  if (cq->ctrl->features.acre && nvme_should_retry(req)) {
>  if (cq->ctrl->params.cmd_retry_delay > 0) {
> @@ -5025,4 +5035,43 @@ static void nvme_register_types(void)
>  type_register_static(&nvme_bus_info);
>  }
>  
> +static void nvme_inject_state(NvmeCtrl *n, NvmeState state)
> +{
> +n->state = state;
> +}
> +
> +static const char *nvme_states[] = {
> +[NVME_STATE_NORMAL] = "normal",
> +[NVME_STATE_CMD_INTERRUPTED]= "cmd-interrupted",
> +};
> +
> +void hmp_nvme_inject_state(Monitor *mon, const QDict *qdict)
> +{
> +const char *id = qdict_get_str(qdict, "id");
> +const char *state = qdict_get_str(qdict, "state");
> +PCIDevice *dev;
> +NvmeCtrl *n;
> +int ret, i;
> +
> +ret = pci_qdev_find_device(id, &dev);
> +if (ret < 0) {
> +monitor_printf(mon, "invalid device id %s\n", id);
> +return;
> +}
> +
> +n = NVME(dev);
> +
> +for (i = 0; i < ARRAY_SIZE(nvme_states); i++) {
> +if (!strcmp(nvme_states[i], state)) {
> +nvme_inject_state(n, i);
> +monitor_printf(mon,
> +   "-device nvme,id=%s: state %s injected\n",
> +   id, state);
> +return;
> +}
> +}
> +
> +monitor_printf(mon, "invalid state %s\n", state);
> +}
> +
>  type_init(nvme_register_types)
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 37940b3ac2d2..1af1e0380d9b 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -128,6 +128,11 @@ typedef struct NvmeFeatureVal {
>  uint8_t acre;
>  } NvmeFeatureVal;
>  
> +typedef enum NvmeState {
> +NVME_STATE_NORMAL,
> +NVME_STATE_CMD_INTERRUPTED,
> +} NvmeState;
> +
>  typedef struct NvmeCtrl {
>  PCIDeviceparent_obj;
>  MemoryRegion bar0;
> @@ -185,6 +190,8 @@ typedef struct Nv

Re: [RFC PATCH 2/3] hw/block/nvme: support command retry delay

2021-02-10 Thread Klaus Jensen

On Feb 11 04:52, Minwoo Im wrote:
> Set CRDT1(Command Retry Delay Time 1) in the Identify controller data
> structure to milliseconds units of 100ms by the given value of
> 'cmd-retry-delay' parameter which is newly added.  If
> cmd-retry-delay=1000, it will be set CRDT1 to 10.  This patch only
> considers the CRDT1 without CRDT2 and 3 for the simplicity.
> 
> This patch also introduced set/get feature command handler for Host
> Behavior feature (16h).  In this feature, ACRE(Advanced Command Retry
> Enable) will be set by the host based on the Identify controller data
> structure, especially by CRDTs.
> 
> If 'cmd-retry-delay' is not given, the default value will be -1 which is
> CRDT will not be configured at all and ACRE will not be supported.  In
> this case, we just set NVME_DNR to the error CQ entry just like we used
> to.  If it's given to positive value, then ACRE will be supported by the
> device.
> 
> Signed-off-by: Minwoo Im 
> ---

LGTM.

Reviewed-by: Klaus Jensen 


signature.asc
Description: PGP signature

Re: [PATCH v3] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Klaus Jensen

On Feb 10 19:22, Bin Meng wrote:
> From: Bin Meng 
> 
> Current QEMU HEAD nvme.c does not compile with the default GCC 5.4
> on a Ubuntu 16.04 host:
> 
>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
>  ^
>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
>  uint32_t result;
>   ^
> 
> Explicitly initialize the result to fix it.
> 
> Cc: qemu-triv...@nongnu.org
> Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
> Signed-off-by: Bin Meng 
> 

Reviewed-by: Klaus Jensen 

> ---
> 
> Changes in v3:
> - mention compiler and host information in the commit message
> 
> Changes in v2:
> - update function name in the commit message
> 
>  hw/block/nvme.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 5ce21b7..c122ac0 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -3228,6 +3228,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> NvmeRequest *req)
>  result = ns->features.err_rec;
>  goto out;
>  case NVME_VOLATILE_WRITE_CACHE:
> +result = 0;
>  for (i = 1; i <= n->num_namespaces; i++) {
>  ns = nvme_ns(n, i);
>  if (!ns) {
> -- 
> 2.7.4
> 


signature.asc
Description: PGP signature

Re: [PATCH] hw/block/nvme: improve invalid zasl value reporting

2021-02-10 Thread Klaus Jensen

On Feb  8 09:25, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The Zone Append Size Limit (ZASL) must be at least 4096 bytes, so
> improve the user experience by adding an early parameter check in
> nvme_check_constraints.
> 
> When ZASL is still too small due to the host configuring the device for
> an even larger page size, convert the trace point in nvme_start_ctrl to
> an NVME_GUEST_ERR such that this is logged by QEMU instead of only
> traced.
> 

Thanks for the review; applied to nvme-next!


signature.asc
Description: PGP signature

Re: [PATCH v2] hw/block/nvme: use locally assigned QEMU IEEE OUI

2021-02-10 Thread Klaus Jensen

On Feb  9 12:10, Philippe Mathieu-Daudé wrote:
> On 2/9/21 11:45 AM, Klaus Jensen wrote:
> > From: Gollu Appalanaidu 
> > 
> > Commit 6eb7a071292a ("hw/block/nvme: change controller pci id") changed
> > the controller to use a Red Hat assigned PCI Device and Vendor ID, but
> > did not change the IEEE OUI away from the Intel IEEE OUI.
> > 
> > Fix that and use the locally assigned QEMU IEEE OUI instead if the
> > `use-intel-id` parameter is not explicitly set. Also reverse the Intel
> > IEEE OUI bytes.
> > 
> > Signed-off-by: Gollu Appalanaidu 
> > Signed-off-by: Klaus Jensen 
> > ---
> > 
> > v2: drop telemetry and add a check on the use_intel_id parameter.
> > 
> >  hw/block/nvme.c | 14 +++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index c2f0c88fbf39..870e9d8e1c17 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -4685,9 +4685,17 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
> > *pci_dev)
> >  id->cntlid = cpu_to_le16(n->cntlid);
> >  
> >  id->rab = 6;
> > -id->ieee[0] = 0x00;
> > -id->ieee[1] = 0x02;
> > -id->ieee[2] = 0xb3;
> > +
> > +if (n->params.use_intel_id) {
> > +id->ieee[0] = 0xb3;
> > +id->ieee[1] = 0x02;
> > +id->ieee[2] = 0x00;
> > +} else {
> > +id->ieee[0] = 0x00;
> > +id->ieee[1] = 0x54;
> > +id->ieee[2] = 0x52;
> > +}
> 
> Correct.
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> 
> Ideally we should have definitions and use them here and in
> qemu_macaddr_default_if_unset() instead of this magic values.
> 

For MAC-addresses we seem to inject some more bytes.

And thanks! Applied to nvme-next!


signature.asc
Description: PGP signature

Re: [PATCH] hw/block/nvme: drain namespaces on sq deletion

2021-02-10 Thread Klaus Jensen

On Jan 27 14:15, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> For most commands, when issuing an AIO, the BlockAIOCB is stored in the
> NvmeRequest aiocb pointer when the AIO is issued. The purpose of storing
> this is to allow the AIO to be cancelled when deleting submission
> queues (it is currently not used for Abort).
> 
> Since the addition of the Dataset Management command and Zoned
> Namespaces, NvmeRequests may involve more than one AIO and the AIOs are
> issued without saving a reference to the BlockAIOCB. This is a problem
> since nvme_del_sq will attempt to cancel outstanding AIOs, potentially
> with an invalid BlockAIOCB.
> 
> Fix this by instead of explicitly cancelling the requests, just allow
> the AIOs to complete by draining the namespace blockdevs.
> 
> Signed-off-by: Klaus Jensen 
> ---
>  hw/block/nvme.c | 18 +-
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 316858fd8adf..91f6fb6da1e2 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -403,6 +403,7 @@ static void nvme_req_clear(NvmeRequest *req)
>  {
>  req->ns = NULL;
>  req->opaque = NULL;
> +req->aiocb = NULL;
>  memset(&req->cqe, 0x0, sizeof(req->cqe));
>  req->status = NVME_SUCCESS;
>  }
> @@ -2396,6 +2397,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest 
> *req)
>  NvmeSQueue *sq;
>  NvmeCQueue *cq;
>  uint16_t qid = le16_to_cpu(c->qid);
> +int i;
>  
>  if (unlikely(!qid || nvme_check_sqid(n, qid))) {
>  trace_pci_nvme_err_invalid_del_sq(qid);
> @@ -2404,12 +2406,18 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest 
> *req)
>  
>  trace_pci_nvme_del_sq(qid);
>  
> -sq = n->sq[qid];
> -while (!QTAILQ_EMPTY(&sq->out_req_list)) {
> -r = QTAILQ_FIRST(&sq->out_req_list);
> -assert(r->aiocb);
> -blk_aio_cancel(r->aiocb);
> +for (i = 1; i <= n->num_namespaces; i++) {
> +NvmeNamespace *ns = nvme_ns(n, i);
> +if (!ns) {
> +continue;
> +}
> +
> +nvme_ns_drain(ns);
>  }
> +
> +sq = n->sq[qid];
> +assert(QTAILQ_EMPTY(&sq->out_req_list));
> +
>  if (!nvme_check_cqid(n, sq->cqid)) {
>  cq = n->cq[sq->cqid];
>  QTAILQ_REMOVE(&cq->sq_list, sq, entry);
> -- 
> 2.30.0
> 

Ping on this.


signature.asc
Description: PGP signature

Re: [PATCH v3 2/2] qemu-nbd: Permit --shared=0 for unlimited clients

2021-02-10 Thread Nir Soffer

On Tue, Feb 9, 2021 at 5:28 PM Eric Blake  wrote:
>
> This gives us better feature parity with QMP nbd-server-start, where
> max-connections defaults to 0 for unlimited.

Sound useful

> Signed-off-by: Eric Blake 
> ---
>  docs/tools/qemu-nbd.rst | 4 ++--
>  qemu-nbd.c  | 7 +++
>  2 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/docs/tools/qemu-nbd.rst b/docs/tools/qemu-nbd.rst
> index fe41336dc550..ee862fa0bc02 100644
> --- a/docs/tools/qemu-nbd.rst
> +++ b/docs/tools/qemu-nbd.rst
> @@ -136,8 +136,8 @@ driver options if ``--image-opts`` is specified.
>  .. option:: -e, --shared=NUM
>
>Allow up to *NUM* clients to share the device (default
> -  ``1``). Safe for readers, but for now, consistency is not
> -  guaranteed between multiple writers.
> +  ``1``), 0 for unlimited. Safe for readers, but for now,
> +  consistency is not guaranteed between multiple writers.
>
>  .. option:: -t, --persistent
>
> diff --git a/qemu-nbd.c b/qemu-nbd.c
> index 1a340ea4858d..5416509ece18 100644
> --- a/qemu-nbd.c
> +++ b/qemu-nbd.c
> @@ -328,7 +328,7 @@ static void *nbd_client_thread(void *arg)
>
>  static int nbd_can_accept(void)
>  {
> -return state == RUNNING && nb_fds < shared;
> +return state == RUNNING && (shared == 0 || nb_fds < shared);
>  }
>
>  static void nbd_update_server_watch(void);
> @@ -706,8 +706,8 @@ int main(int argc, char **argv)
>  device = optarg;
>  break;
>  case 'e':
>  if (qemu_strtoi(optarg, NULL, 0, &shared) < 0 ||
> -shared < 1) {
> +shared < 0) {
>  error_report("Invalid shared device number '%s'", optarg);
>  exit(EXIT_FAILURE);
>  }
> @@ -966,7 +965,7 @@ int main(int argc, char **argv)
>  if (socket_activation == 0) {
>  int backlog;
>
> -if (persistent) {
> +if (persistent || shared == 0) {
>  backlog = SOMAXCONN;
>  } else {
>  backlog = MIN(shared, SOMAXCONN);
> --
> 2.30.0
>

Reviewed-by: Nir Soffer

Re: [PATCH] hw/sd: sdhci: Do not transfer any data when command fails

2021-02-10 Thread Alistair Francis

On Tue, Feb 9, 2021 at 2:55 AM Bin Meng  wrote:
>
> At the end of sdhci_send_command(), it starts a data transfer if
> the command register indicates a data is associated. However the
> data transfer should only be initiated when the command execution
> has succeeded.
>
> Cc: qemu-sta...@nongnu.org
> Fixes: CVE-2020-17380
> Fixes: CVE-2020-25085
> Reported-by: Alexander Bulekov 
> Reported-by: Sergej Schumilo (Ruhr-University Bochum)
> Reported-by: Cornelius Aschermann (Ruhr-University Bochum)
> Reported-by: Simon Wrner (Ruhr-University Bochum)
> Buglink: https://bugs.launchpad.net/qemu/+bug/1892960

Isn't this already fixed?

> Signed-off-by: Bin Meng 

Acked-by: Alistair Francis 

Alistair

> ---
>
>  hw/sd/sdhci.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/hw/sd/sdhci.c b/hw/sd/sdhci.c
> index 8ffa539..0450110 100644
> --- a/hw/sd/sdhci.c
> +++ b/hw/sd/sdhci.c
> @@ -326,6 +326,7 @@ static void sdhci_send_command(SDHCIState *s)
>  SDRequest request;
>  uint8_t response[16];
>  int rlen;
> +bool cmd_failure = false;
>
>  s->errintsts = 0;
>  s->acmd12errsts = 0;
> @@ -349,6 +350,7 @@ static void sdhci_send_command(SDHCIState *s)
>  trace_sdhci_response16(s->rspreg[3], s->rspreg[2],
> s->rspreg[1], s->rspreg[0]);
>  } else {
> +cmd_failure = true;
>  trace_sdhci_error("timeout waiting for command response");
>  if (s->errintstsen & SDHC_EISEN_CMDTIMEOUT) {
>  s->errintsts |= SDHC_EIS_CMDTIMEOUT;
> @@ -369,7 +371,7 @@ static void sdhci_send_command(SDHCIState *s)
>
>  sdhci_update_irq(s);
>
> -if (s->blksize && (s->cmdreg & SDHC_CMD_DATA_PRESENT)) {
> +if (!cmd_failure && s->blksize && (s->cmdreg & SDHC_CMD_DATA_PRESENT)) {
>  s->data_count = 0;
>  sdhci_data_transfer(s);
>  }
> --
> 2.7.4
>
>

Re: [PATCH v2] hw/block: nvme: Fix a build error in nvme_get_feature()

2021-02-10 Thread Peter Maydell

On Wed, 10 Feb 2021 at 10:23, Bin Meng  wrote:
>
> From: Bin Meng 
>
> Current QEMU HEAD nvme.c does not compile:
>
>   hw/block/nvme.c:3242:9: error: ‘result’ may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>  trace_pci_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
>  ^
>   hw/block/nvme.c:3150:14: note: ‘result’ was declared here
>  uint32_t result;
>   ^
>
> Explicitly initialize the result to fix it.
>
> Fixes: aa5e55e3b07e ("hw/block/nvme: open code for volatile write cache")
> Signed-off-by: Bin Meng 
>
> ---
>
> Changes in v2:
> - update function name in the commit message
>
>  hw/block/nvme.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 5ce21b7..c122ac0 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -3228,6 +3228,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, 
> NvmeRequest *req)
>  result = ns->features.err_rec;
>  goto out;
>  case NVME_VOLATILE_WRITE_CACHE:
> +result = 0;
>  for (i = 1; i <= n->num_namespaces; i++) {
>  ns = nvme_ns(n, i);
>  if (!ns) {
> --

Also spotted by Coverity: CID 1446371

-- PMM

Re: [PATCH v3 1/2] qemu-nbd: Use SOMAXCONN for socket listen() backlog

2021-02-10 Thread Eric Blake

On 2/10/21 10:58 AM, Nir Soffer wrote:
> On Tue, Feb 9, 2021 at 5:28 PM Eric Blake  wrote:
>>
>> Our default of a backlog of 1 connection is rather puny; it gets in
>> the way when we are explicitly allowing multiple clients (such as
>> qemu-nbd -e N [--shared], or nbd-server-start with its default
>> "max-connections":0 for unlimited), but is even a problem when we
>> stick to qemu-nbd's default of only 1 active client but use -t
>> [--persistent] where a second client can start using the server once
>> the first finishes.  While the effects are less noticeable on TCP
>> sockets (since the client can poll() to learn when the server is ready
>> again), it is definitely observable on Unix sockets, where on Unix, a
>> client will fail with EAGAIN and no recourse but to sleep an arbitrary
>> amount of time before retrying if the server backlog is already full.
>>
>> Since QMP nbd-server-start is always persistent, it now always
>> requests a backlog of SOMAXCONN;
> 
> This makes sense since we don't limit the number of connections.
> 
>> meanwhile, qemu-nbd will request
>> SOMAXCONN if persistent, otherwise its backlog should be based on the
>> expected number of clients.
> 
> If --persistent is used without --shared, we allow only one concurrent
> connection, so not clear why we need maximum backlog.

We only allow one active connection, but other clients can queue up to
also take advantage of the server once the first client disconnects.  A
larger backlog allows those additional clients to reach the point where
they can poll() for activity, rather than getting EAGAIN failures.

> 
> I think that separating --persistent and --shared would be easier to
> understand and use. The backlog will always be based on shared value.
> 

>> +++ b/qemu-nbd.c
>> @@ -964,8 +964,16 @@ int main(int argc, char **argv)
>>
>>  server = qio_net_listener_new();
>>  if (socket_activation == 0) {
>> +int backlog;
>> +
>> +if (persistent) {
>> +backlog = SOMAXCONN;
> 
> This increases the backlog, but since default shared is still 1, we will
> not accept more than 1 connection, so not clear why SOMAXCONN
> is better.

While we aren't servicing the next client yet, we are at least allowing
them to make it further in their connection by supporting a backlog.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

1 2 >

1 - 100 of 115 matches

Mail list logo