date:20200329

Re: [PATCH v4 0/2] Replaced locks with lock guard macros

2020-03-29 Thread Daniel Brodsky

On Sat, Mar 28, 2020 at 9:12 AM Richard Henderson
 wrote:
>
> On 3/27/20 11:38 PM, Daniel Brodsky wrote:
> > On Thu, Mar 26, 2020 at 11:01 AM Richard Henderson
> >  wrote:
> >>
> >> My preference is to add -Wno-tautological-type-limit-compare in
> >> configure, so we don't have to work around this same issue elsewhere
> >> in the code base.
> >>
> >> r~
> >
> > What do you think would be the best way to add this? I could change
> > all additions of the `-m32` flag to instead use `-m32
> > -Wno-tautological-type-limit-compare` or add the flag if qemu is being
> > compiled with clang and `-m32` already enabled.
>
> I was going to add it unconditionally, with all of the other warning flags.
>
> Except that it doesn't work -- clang-9 *still* warns.  Clearly a clang bug, 
> but
> there doesn't seem to be any workaround at all except --disable-werror.
>
>
> r~
Using `#pragma clang diagnostic ignored
"-Wtautological-type-limit-compare"` suppresses the errors (on Clang
9). I could go and drop that in for the problem areas? There's only a
few so it wouldn't be a major change. I'm thinking of adding a macro
like this:
#define PRAGMA(x) _Pragma(stringify(x))
#define IF_IGNORE_TYPE_LIMIT(statement) \
PRAGMA(clang diagnostic push) \
PRAGMA(clang diagnostic ignored "-Wtautological-type-limit-compare") \
if (statement) \
PRAGMA(clang diagnostic pop)

and replacing the problem conditionals with it.

Daniel

Re: [PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs

2020-03-29 Thread no-reply

Patchew URL: 
https://patchew.org/QEMU/1585542301-84087-1-git-send-email-yi.l@intel.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

 from /tmp/qemu-test/src/include/hw/pci/pci_bus.h:4,
 from /tmp/qemu-test/src/include/hw/pci-host/i440fx.h:15,
 from /tmp/qemu-test/src/stubs/pci-host-piix.c:2:
/tmp/qemu-test/src/include/hw/iommu/host_iommu_context.h:28:10: fatal error: 
linux/iommu.h: No such file or directory
 #include 
  ^~~
compilation terminated.
  CC  scsi/pr-manager-stub.o
make: *** [/tmp/qemu-test/src/rules.mak:69: stubs/pci-host-piix.o] Error 1
make: *** Waiting for unfinished jobs
  CC  block/curl.o
Traceback (most recent call last):
---
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=a71cba547b0b47ef91f874b42e00f828', '-u', 
'1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-enp9m7rr/src/docker-src.2020-03-30-01.38.53.2480:/var/tmp/qemu:z,ro',
 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=a71cba547b0b47ef91f874b42e00f828
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-enp9m7rr/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real2m1.872s
user0m8.422s


The full log is available at
http://patchew.org/logs/1585542301-84087-1-git-send-email-yi.l@intel.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

[Bug 1868116] Re: QEMU monitor no longer works

2020-03-29 Thread Christian Ehrhardt 

Thanks Ken!
I verified it and the new version indeed fixes the issue in focal.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1868116

Title:
  QEMU monitor no longer works

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  Triaged
Status in vte2.91 package in Ubuntu:
  Fix Released
Status in qemu package in Debian:
  Unknown

Bug description:
  Repro:
  VTE
  $ meson _build && ninja -C _build && ninja -C _build install

  qemu:
  $ ../configure --python=/usr/bin/python3 --disable-werror --disable-user 
--disable-linux-user --disable-docs --disable-guest-agent --disable-sdl 
--enable-gtk --disable-vnc --disable-xen --disable-brlapi --disable-fdt 
--disable-hax --disable-vde --disable-netmap --disable-rbd --disable-libiscsi 
--disable-libnfs --disable-smartcard --disable-libusb --disable-usb-redir 
--disable-seccomp --disable-glusterfs --disable-tpm --disable-numa 
--disable-opengl --disable-virglrenderer --disable-xfsctl --disable-vxhs 
--disable-slirp --disable-blobs --target-list=x86_64-softmmu --disable-rdma 
--disable-pvrdma --disable-attr --disable-vhost-net --disable-vhost-vsock 
--disable-vhost-scsi --disable-vhost-crypto --disable-vhost-user 
--disable-spice --disable-qom-cast-debug --disable-vxhs --disable-bochs 
--disable-cloop --disable-dmg --disable-qcow1 --disable-vdi --disable-vvfat 
--disable-qed --disable-parallels --disable-sheepdog --disable-avx2 
--disable-nettle --disable-gnutls --disable-capstone --disable-tools 
--disable-libpmem --disable-iconv --disable-cap-ng
  $ make

  Test:
  $ LD_LIBRARY_PATH=/usr/local/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH 
./build/x86_64-softmmu/qemu-system-x86_64 -enable-kvm --drive 
media=cdrom,file=http://archive.ubuntu.com/ubuntu/dists/bionic/main/installer-amd64/current/images/netboot/mini.iso
  - switch to monitor with CTRL+ALT+2
  - try to enter something

  Affects head of both usptream git repos.

  
  --- original bug ---

  It was observed that the QEMU console (normally accessible using
  Ctrl+Alt+2) accepts no input, so it can't be used. This is being
  problematic because there are cases where it's required to send
  commands to the guest, or key combinations that the host would grab
  (as Ctrl-Alt-F1 or Alt-F4).

  ProblemType: Bug
  DistroRelease: Ubuntu 20.04
  Package: qemu 1:4.2-3ubuntu2
  Uname: Linux 5.6.0-rc6+ x86_64
  ApportVersion: 2.20.11-0ubuntu20
  Architecture: amd64
  CurrentDesktop: XFCE
  Date: Thu Mar 19 12:16:31 2020
  Dependencies:

  InstallationDate: Installed on 2017-06-13 (1009 days ago)
  InstallationMedia: Xubuntu 17.04 "Zesty Zapus" - Release amd64 (20170412)
  KvmCmdLine:
   COMMAND STAT  EUID  RUID PIDPPID %CPU COMMAND
   qemu-system-x86 Sl+   1000  1000   34275   25235 29.2 qemu-system-x86_64 -m 
4G -cpu Skylake-Client -device virtio-vga,virgl=true,xres=1280,yres=720 -accel 
kvm -device nec-usb-xhci -serial vc -serial stdio -hda 
/home/usuario/Sistemas/androidx86.img -display gtk,gl=on -device usb-audio
   kvm-nx-lpage-re S0 0   34284   2  0.0 [kvm-nx-lpage-re]
   kvm-pit/34275   S0 0   34286   2  0.0 [kvm-pit/34275]
  MachineType: LENOVO 80UG
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.6.0-rc6+ 
root=UUID=6b4ae5c0-c78c-49a6-a1ba-029192618a7a ro quiet ro kvm.ignore_msrs=1 
kvm.report_ignored_msrs=0 kvm.halt_poll_ns=0 kvm.halt_poll_ns_grow=0 
i915.enable_gvt=1 i915.fastboot=1 cgroup_enable=memory swapaccount=1 
zswap.enabled=1 zswap.zpool=z3fold 
resume=UUID=a82e38a0-8d20-49dd-9cbd-de7216b589fc log_buf_len=16M 
usbhid.quirks=0x0079:0x0006:0x10 config_scsi_mq_default=y 
scsi_mod.use_blk_mq=1 mtrr_gran_size=64M mtrr_chunk_size=64M nbd.nbds_max=2 
nbd.max_part=63
  SourcePackage: qemu
  UpgradeStatus: Upgraded to focal on 2019-12-22 (87 days ago)
  dmi.bios.date: 08/09/2018
  dmi.bios.vendor: LENOVO
  dmi.bios.version: 0XCN45WW
  dmi.board.asset.tag: NO Asset Tag
  dmi.board.name: Toronto 4A2
  dmi.board.vendor: LENOVO
  dmi.board.version: SDK0J40679 WIN
  dmi.chassis.asset.tag: NO Asset Tag
  dmi.chassis.type: 10
  dmi.chassis.vendor: LENOVO
  dmi.chassis.version: Lenovo ideapad 310-14ISK
  dmi.modalias: 
dmi:bvnLENOVO:bvr0XCN45WW:bd08/09/2018:svnLENOVO:pn80UG:pvrLenovoideapad310-14ISK:rvnLENOVO:rnToronto4A2:rvrSDK0J40679WIN:cvnLENOVO:ct10:cvrLenovoideapad310-14ISK:
  dmi.product.family: IDEAPAD
  dmi.product.name: 80UG
  dmi.product.sku: LENOVO_MT_80UG_BU_idea_FM_Lenovo ideapad 310-14ISK
  dmi.product.version: Lenovo ideapad 310-14ISK
  dmi.sys.vendor: LENOVO
  mtime.conffile..etc.apport.crashdb.conf: 2019-08-29T08:39:36.787240

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1868116/+subscriptions

Re: [PATCH for 5.0 v1 0/2] RISC-V: Fix Hypervisor guest user space

2020-03-29 Thread Anup Patel

Hi Palmer,

On Fri, Mar 27, 2020 at 5:30 AM Palmer Dabbelt  wrote:
>
> On Thu, 26 Mar 2020 15:44:04 PDT (-0700), Alistair Francis wrote:
> > This series fixes two bugs in the RISC-V two stage lookup
> > implementation. This fixes the Hypervisor userspace failing to start.
> >
> > Alistair Francis (2):
> >   riscv: Don't use stage-2 PTE lookup protection flags
> >   riscv: AND stage-1 and stage-2 protection flags
> >
> >  target/riscv/cpu_helper.c | 11 +++
> >  1 file changed, 7 insertions(+), 4 deletions(-)
>
> Thanks, these are in the queue.
>

I have tested this patch series on latest QEMU master without
"target/riscv: Don't set write permissions on dirty PTEs" workaround
patch. It works fine now.

Tested-by: Anup Patel 

Please drop the work-around patch  "target/riscv: Don't set write
permissions on dirty PTEs" from your for-next.

Regards,
Anup

[PATCH v2 21/22] intel_iommu: process PASID-based Device-TLB invalidation

2020-03-29 Thread Liu Yi L

This patch adds an empty handling for PASID-based Device-TLB
invalidation. For now it is enough as it is not necessary to
propagate it to host for passthru device and also there is no
emulated device has device tlb.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Reviewed-by: Peter Xu 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c  | 18 ++
 hw/i386/intel_iommu_internal.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 02ad90a..e8877d4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3224,6 +3224,17 @@ static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
 return true;
 }
 
+static bool vtd_process_device_piotlb_desc(IntelIOMMUState *s,
+   VTDInvDesc *inv_desc)
+{
+/*
+ * no need to handle it for passthru device, for emulated
+ * devices with device tlb, it may be required, but for now,
+ * return is enough
+ */
+return true;
+}
+
 static bool vtd_process_device_iotlb_desc(IntelIOMMUState *s,
   VTDInvDesc *inv_desc)
 {
@@ -3345,6 +3356,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 }
 break;
 
+case VTD_INV_DESC_DEV_PIOTLB:
+trace_vtd_inv_desc("device-piotlb", inv_desc.hi, inv_desc.lo);
+if (!vtd_process_device_piotlb_desc(s, _desc)) {
+return false;
+}
+break;
+
 case VTD_INV_DESC_DEVICE:
 trace_vtd_inv_desc("device", inv_desc.hi, inv_desc.lo);
 if (!vtd_process_device_iotlb_desc(s, _desc)) {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 85ebaa5..4910e63 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -386,6 +386,7 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_WAIT   0x5 /* Invalidation Wait Descriptor */
 #define VTD_INV_DESC_PIOTLB 0x6 /* PASID-IOTLB Invalidate Desc */
 #define VTD_INV_DESC_PC 0x7 /* PASID-cache Invalidate Desc */
+#define VTD_INV_DESC_DEV_PIOTLB 0x8 /* PASID-based-DIOTLB inv_desc*/
 #define VTD_INV_DESC_NONE   0   /* Not an Invalidate Descriptor */
 
 /* Masks for Invalidation Wait Descriptor*/
-- 
2.7.4

[PATCH v2 13/22] intel_iommu: add PASID cache management infrastructure

2020-03-29 Thread Liu Yi L

This patch adds a PASID cache management infrastructure based on
new added structure VTDPASIDAddressSpace, which is used to track
the PASID usage and future PASID tagged DMA address translation
support in vIOMMU.

struct VTDPASIDAddressSpace {
VTDBus *vtd_bus;
uint8_t devfn;
AddressSpace as;
uint32_t pasid;
IntelIOMMUState *iommu_state;
VTDContextCacheEntry context_cache_entry;
QLIST_ENTRY(VTDPASIDAddressSpace) next;
VTDPASIDCacheEntry pasid_cache_entry;
};

Ideally, a VTDPASIDAddressSpace instance is created when a PASID
is bound with a DMA AddressSpace. Intel VT-d spec requires guest
software to issue pasid cache invalidation when bind or unbind a
pasid with an address space under caching-mode. However, as
VTDPASIDAddressSpace instances also act as pasid cache in this
implementation, its creation also happens during vIOMMU PASID
tagged DMA translation. The creation in this path will not be
added in this patch since no PASID-capable emulated devices for
now.

The implementation in this patch manages VTDPASIDAddressSpace
instances per PASID+BDF (lookup and insert will use PASID and
BDF) since Intel VT-d spec allows per-BDF PASID Table. When a
guest bind a PASID with an AddressSpace, QEMU will capture the
guest pasid selective pasid cache invalidation, and allocate
remove a VTDPASIDAddressSpace instance per the invalidation
reasons:

*) a present pasid entry moved to non-present
*) a present pasid entry to be a present entry
*) a non-present pasid entry moved to present

vIOMMU emulator could figure out the reason by fetching latest
guest pasid entry.

v1 -> v2: - merged this patch with former replay binding patch, makes
PSI/DSI/GSI use the unified function to do cache invalidation
and pasid binding replay.
  - dropped pasid_cache_gen in both iommu_state and vtd_pasid_as
as it is not necessary so far, we may want it when one day
initroduce emulated SVA-capable device.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c  | 473 +
 hw/i386/intel_iommu_internal.h |  18 ++
 hw/i386/trace-events   |   1 +
 include/hw/i386/intel_iommu.h  |  24 +++
 4 files changed, 516 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2eb60c3..a7e9973 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "qemu/jhash.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -65,6 +66,8 @@
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
 error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -276,6 +279,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
 vtd_iommu_lock(s);
 vtd_reset_iotlb_locked(s);
 vtd_reset_context_cache_locked(s);
+vtd_pasid_cache_reset(s);
 vtd_iommu_unlock(s);
 }
 
@@ -686,6 +690,16 @@ static inline bool vtd_pe_type_check(X86IOMMUState 
*x86_iommu,
 return true;
 }
 
+static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
+{
+return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
 return pdire->val & 1;
@@ -2395,9 +2409,452 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, 
VTDInvDesc *inv_desc)
 return true;
 }
 
+static inline void vtd_init_pasid_key(uint32_t pasid,
+ uint16_t sid,
+ struct pasid_key *key)
+{
+key->pasid = pasid;
+key->sid = sid;
+}
+
+static guint vtd_pasid_as_key_hash(gconstpointer v)
+{
+struct pasid_key *key = (struct pasid_key *)v;
+uint32_t a, b, c;
+
+/* Jenkins hash */
+a = b = c = JHASH_INITVAL + sizeof(*key);
+a += key->sid;
+b += extract32(key->pasid, 0, 16);
+c += extract32(key->pasid, 16, 16);
+
+__jhash_mix(a, b, c);
+__jhash_final(a, b, c);
+
+return c;
+}
+
+static gboolean vtd_pasid_as_key_equal(gconstpointer v1, gconstpointer v2)
+{
+const struct pasid_key *k1 = v1;
+const struct pasid_key *k2 = v2;
+
+return (k1->pasid == k2->pasid) && (k1->sid == k2->sid);
+}
+
+static inline int vtd_dev_get_pe_from_pasid(IntelIOMMUState *s,
+uint8_t bus_num,
+uint8_t devfn,
+

[PATCH v2 02/22] header file update VFIO/IOMMU vSVA APIs

2020-03-29 Thread Liu Yi L

The kernel uapi/linux/iommu.h header file includes the
extensions for vSVA support. e.g. bind gpasid, iommu
fault report related user structures and etc.

Note: this should be replaced with a full header files update when
the vSVA uPAPI is stable.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Michael S. Tsirkin 
Cc: Cornelia Huck 
Cc: Paolo Bonzini 
Signed-off-by: Liu Yi L 
---
 linux-headers/linux/iommu.h | 378 
 linux-headers/linux/vfio.h  | 127 +++
 2 files changed, 505 insertions(+)
 create mode 100644 linux-headers/linux/iommu.h

diff --git a/linux-headers/linux/iommu.h b/linux-headers/linux/iommu.h
new file mode 100644
index 000..9025496
--- /dev/null
+++ b/linux-headers/linux/iommu.h
@@ -0,0 +1,378 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * IOMMU user API definitions
+ */
+
+#ifndef _IOMMU_H
+#define _IOMMU_H
+
+#include 
+
+/**
+ * Current version of the IOMMU user API. This is intended for query
+ * between user and kernel to determine compatible data structures.
+ *
+ * UAPI version can be bumped up with the following rules:
+ * 1. All data structures passed between user and kernel space share
+ *the same version number. i.e. any extension to any structure
+ *results in version number increment.
+ *
+ * 2. Data structures are open to extension but closed to modification.
+ *Extension should leverage the padding bytes first where a new
+ *flag bit is required to indicate the validity of each new member.
+ *The above rule for padding bytes also applies to adding new union
+ *members.
+ *After padding bytes are exhausted, new fields must be added at the
+ *end of each data structure with 64bit alignment. Flag bits can be
+ *added without size change but existing ones cannot be altered.
+ *
+ * 3. Versions are backward compatible.
+ *
+ * 4. Version to size lookup is supported by kernel internal API for each
+ *API function type. @version is mandatory for new data structures
+ *and must be at the beginning with type of __u32.
+ */
+#define IOMMU_UAPI_VERSION 1
+static __inline__ int iommu_get_uapi_version(void)
+{
+   return IOMMU_UAPI_VERSION;
+}
+
+/*
+ * Supported UAPI features that can be reported to user space.
+ * These types represent the capability available in the kernel.
+ *
+ * REVISIT: UAPI version also implies the capabilities. Should we
+ * report them explicitly?
+ */
+enum IOMMU_UAPI_DATA_TYPES {
+   IOMMU_UAPI_BIND_GPASID,
+   IOMMU_UAPI_CACHE_INVAL,
+   IOMMU_UAPI_PAGE_RESP,
+   NR_IOMMU_UAPI_TYPE,
+};
+
+#define IOMMU_UAPI_CAP_MASK ((1 << IOMMU_UAPI_BIND_GPASID) |   \
+   (1 << IOMMU_UAPI_CACHE_INVAL) | \
+   (1 << IOMMU_UAPI_PAGE_RESP))
+
+#define IOMMU_FAULT_PERM_READ  (1 << 0) /* read */
+#define IOMMU_FAULT_PERM_WRITE (1 << 1) /* write */
+#define IOMMU_FAULT_PERM_EXEC  (1 << 2) /* exec */
+#define IOMMU_FAULT_PERM_PRIV  (1 << 3) /* privileged */
+
+/* Generic fault types, can be expanded IRQ remapping fault */
+enum iommu_fault_type {
+   IOMMU_FAULT_DMA_UNRECOV = 1,/* unrecoverable fault */
+   IOMMU_FAULT_PAGE_REQ,   /* page request fault */
+};
+
+enum iommu_fault_reason {
+   IOMMU_FAULT_REASON_UNKNOWN = 0,
+
+   /* Could not access the PASID table (fetch caused external abort) */
+   IOMMU_FAULT_REASON_PASID_FETCH,
+
+   /* PASID entry is invalid or has configuration errors */
+   IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
+
+   /*
+* PASID is out of range (e.g. exceeds the maximum PASID
+* supported by the IOMMU) or disabled.
+*/
+   IOMMU_FAULT_REASON_PASID_INVALID,
+
+   /*
+* An external abort occurred fetching (or updating) a translation
+* table descriptor
+*/
+   IOMMU_FAULT_REASON_WALK_EABT,
+
+   /*
+* Could not access the page table entry (Bad address),
+* actual translation fault
+*/
+   IOMMU_FAULT_REASON_PTE_FETCH,
+
+   /* Protection flag check failed */
+   IOMMU_FAULT_REASON_PERMISSION,
+
+   /* access flag check failed */
+   IOMMU_FAULT_REASON_ACCESS,
+
+   /* Output address of a translation stage caused Address Size fault */
+   IOMMU_FAULT_REASON_OOR_ADDRESS,
+};
+
+/**
+ * struct iommu_fault_unrecoverable - Unrecoverable fault data
+ * @reason: reason of the fault, from  iommu_fault_reason
+ * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
+ * @pasid: Process Address Space ID
+ * @perm: requested permission access using by the incoming transaction
+ *(IOMMU_FAULT_PERM_* values)
+ * @addr: offending page address
+ * @fetch_addr: address that caused a fetch abort, if any
+ */
+struct iommu_fault_unrecoverable {
+   __u32   reason;
+#define IOMMU_FAULT_UNRECOV_PASID_VALID(1 << 0)
+#define IOMMU_FAULT_UNRECOV_ADDR_VALID (1

[PATCH v2 20/22] intel_iommu: propagate PASID-based iotlb invalidation to host

2020-03-29 Thread Liu Yi L

This patch propagates PASID-based iotlb invalidation to host.

Intel VT-d 3.0 supports nested translation in PASID granular.
Guest SVA support could be implemented by configuring nested
translation on specific PASID. This is also known as dual stage
DMA translation.

Under such configuration, guest owns the GVA->GPA translation
which is configured as first level page table in host side for
a specific pasid, and host owns GPA->HPA translation. As guest
owns first level translation table, piotlb invalidation should
be propagated to host since host IOMMU will cache first level
page table related mappings during DMA address translation.

This patch traps the guest PASID-based iotlb flush and propagate
it to host.

v1 -> v2: removed the valid check to vtd_pasid_as instance as
  v2 ensures all vtd_pasid_as instance should be valid
Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c  | 117 +
 hw/i386/intel_iommu_internal.h |   7 +++
 2 files changed, 124 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6114dd8..02ad90a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3045,16 +3045,133 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
 return true;
 }
 
+/**
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_invalidate_piotlb(IntelIOMMUState *s,
+  VTDBus *vtd_bus,
+  int devfn,
+  DualIOMMUStage1Cache *stage1_cache)
+{
+VTDHostIOMMUContext *vtd_dev_icx;
+HostIOMMUContext *iommu_ctx;
+
+vtd_dev_icx = vtd_bus->dev_icx[devfn];
+if (!vtd_dev_icx) {
+goto out;
+}
+iommu_ctx = vtd_dev_icx->iommu_ctx;
+if (!iommu_ctx) {
+goto out;
+}
+if (host_iommu_ctx_flush_stage1_cache(iommu_ctx, stage1_cache)) {
+error_report("Cache flush failed");
+}
+out:
+return;
+}
+
+/**
+ * This function is a loop function for the s->vtd_pasid_as
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+  gpointer user_data)
+{
+VTDPIOTLBInvInfo *piotlb_info = user_data;
+VTDPASIDAddressSpace *vtd_pasid_as = value;
+VTDPASIDCacheEntry *pc_entry = _pasid_as->pasid_cache_entry;
+uint16_t did;
+
+did = vtd_pe_get_domain_id(_entry->pasid_entry);
+
+if ((piotlb_info->domain_id == did) &&
+(piotlb_info->pasid == vtd_pasid_as->pasid)) {
+vtd_invalidate_piotlb(vtd_pasid_as->iommu_state,
+  vtd_pasid_as->vtd_bus,
+  vtd_pasid_as->devfn,
+  piotlb_info->stage1_cache);
+}
+
+/*
+ * TODO: needs to add QEMU piotlb flush when QEMU piotlb
+ * infrastructure is ready. For now, it is enough for passthru
+ * devices.
+ */
+}
+
 static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
 uint16_t domain_id,
 uint32_t pasid)
 {
+VTDPIOTLBInvInfo piotlb_info;
+DualIOMMUStage1Cache *stage1_cache;
+struct iommu_cache_invalidate_info *cache_info;
+
+stage1_cache = g_malloc0(sizeof(*stage1_cache));
+stage1_cache->pasid = pasid;
+
+cache_info = _cache->cache_info;
+cache_info->version = IOMMU_UAPI_VERSION;
+cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+cache_info->granularity = IOMMU_INV_GRANU_PASID;
+cache_info->pasid_info.pasid = pasid;
+cache_info->pasid_info.flags = IOMMU_INV_PASID_FLAGS_PASID;
+
+piotlb_info.domain_id = domain_id;
+piotlb_info.pasid = pasid;
+piotlb_info.stage1_cache = stage1_cache;
+
+vtd_iommu_lock(s);
+/*
+ * Here loops all the vtd_pasid_as instances in s->vtd_pasid_as
+ * to find out the affected devices since piotlb invalidation
+ * should check pasid cache per architecture point of view.
+ */
+g_hash_table_foreach(s->vtd_pasid_as,
+ vtd_flush_pasid_iotlb, _info);
+vtd_iommu_unlock(s);
+g_free(stage1_cache);
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
uint32_t pasid, hwaddr addr, uint8_t am,
bool ih)
 {
+VTDPIOTLBInvInfo piotlb_info;
+DualIOMMUStage1Cache *stage1_cache;
+struct iommu_cache_invalidate_info *cache_info;
+
+stage1_cache = g_malloc0(sizeof(*stage1_cache));
+stage1_cache->pasid = pasid;
+
+cache_info = _cache->cache_info;
+cache_info->version = IOMMU_UAPI_VERSION;
+cache_info->cache = IOMMU_CACHE_INV_TYPE_IOTLB;
+

[PATCH v2 08/22] vfio/common: provide PASID alloc/free hooks

2020-03-29 Thread Liu Yi L

This patch defines vfio_host_iommu_context_info, implements the PASID
alloc/free hooks defined in HostIOMMUContextClass.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Alex Williamson 
Signed-off-by: Liu Yi L 
---
 hw/vfio/common.c  | 69 +++
 include/hw/iommu/host_iommu_context.h |  3 ++
 include/hw/vfio/vfio-common.h |  4 ++
 3 files changed, 76 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c276732..5f3534d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1179,6 +1179,53 @@ static int vfio_get_iommu_type(VFIOContainer *container,
 return -EINVAL;
 }
 
+static int vfio_host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx,
+   uint32_t min, uint32_t max,
+   uint32_t *pasid)
+{
+VFIOContainer *container = container_of(iommu_ctx,
+VFIOContainer, iommu_ctx);
+struct vfio_iommu_type1_pasid_request req;
+unsigned long argsz;
+int ret;
+
+argsz = sizeof(req);
+req.argsz = argsz;
+req.flags = VFIO_IOMMU_PASID_ALLOC;
+req.alloc_pasid.min = min;
+req.alloc_pasid.max = max;
+
+if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, )) {
+ret = -errno;
+error_report("%s: %d, alloc failed", __func__, ret);
+return ret;
+}
+*pasid = req.alloc_pasid.result;
+return 0;
+}
+
+static int vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
+  uint32_t pasid)
+{
+VFIOContainer *container = container_of(iommu_ctx,
+VFIOContainer, iommu_ctx);
+struct vfio_iommu_type1_pasid_request req;
+unsigned long argsz;
+int ret;
+
+argsz = sizeof(req);
+req.argsz = argsz;
+req.flags = VFIO_IOMMU_PASID_FREE;
+req.free_pasid = pasid;
+
+if (ioctl(container->fd, VFIO_IOMMU_PASID_REQUEST, )) {
+ret = -errno;
+error_report("%s: %d, free failed", __func__, ret);
+return ret;
+}
+return 0;
+}
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
Error **errp)
 {
@@ -1791,3 +1838,25 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
 }
 return vfio_eeh_container_op(container, op);
 }
+
+static void vfio_host_iommu_context_class_init(ObjectClass *klass,
+   void *data)
+{
+HostIOMMUContextClass *hicxc = HOST_IOMMU_CONTEXT_CLASS(klass);
+
+hicxc->pasid_alloc = vfio_host_iommu_ctx_pasid_alloc;
+hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
+}
+
+static const TypeInfo vfio_host_iommu_context_info = {
+.parent = TYPE_HOST_IOMMU_CONTEXT,
+.name = TYPE_VFIO_HOST_IOMMU_CONTEXT,
+.class_init = vfio_host_iommu_context_class_init,
+};
+
+static void vfio_register_types(void)
+{
+type_register_static(_host_iommu_context_info);
+}
+
+type_init(vfio_register_types)
diff --git a/include/hw/iommu/host_iommu_context.h 
b/include/hw/iommu/host_iommu_context.h
index 35c4861..227c433 100644
--- a/include/hw/iommu/host_iommu_context.h
+++ b/include/hw/iommu/host_iommu_context.h
@@ -33,6 +33,9 @@
 #define TYPE_HOST_IOMMU_CONTEXT "qemu:host-iommu-context"
 #define HOST_IOMMU_CONTEXT(obj) \
 OBJECT_CHECK(HostIOMMUContext, (obj), TYPE_HOST_IOMMU_CONTEXT)
+#define HOST_IOMMU_CONTEXT_CLASS(klass) \
+OBJECT_CLASS_CHECK(HostIOMMUContextClass, (klass), \
+ TYPE_HOST_IOMMU_CONTEXT)
 #define HOST_IOMMU_CONTEXT_GET_CLASS(obj) \
 OBJECT_GET_CLASS(HostIOMMUContextClass, (obj), \
  TYPE_HOST_IOMMU_CONTEXT)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fd56420..0b07303 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -26,12 +26,15 @@
 #include "qemu/notify.h"
 #include "ui/console.h"
 #include "hw/display/ramfb.h"
+#include "hw/iommu/host_iommu_context.h"
 #ifdef CONFIG_LINUX
 #include 
 #endif
 
 #define VFIO_MSG_PREFIX "vfio %s: "
 
+#define TYPE_VFIO_HOST_IOMMU_CONTEXT "qemu:vfio-host-iommu-context"
+
 enum {
 VFIO_DEVICE_TYPE_PCI = 0,
 VFIO_DEVICE_TYPE_PLATFORM = 1,
@@ -71,6 +74,7 @@ typedef struct VFIOContainer {
 MemoryListener listener;
 MemoryListener prereg_listener;
 unsigned iommu_type;
+HostIOMMUContext iommu_ctx;
 Error *error;
 bool initialized;
 unsigned long pgsizes;
-- 
2.7.4

[PATCH v2 19/22] intel_iommu: process PASID-based iotlb invalidation

2020-03-29 Thread Liu Yi L

This patch adds the basic PASID-based iotlb (piotlb) invalidation
support. piotlb is used during walking Intel VT-d 1st level page
table. This patch only adds the basic processing. Detailed handling
will be added in next patch.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c  | 53 ++
 hw/i386/intel_iommu_internal.h | 13 +++
 2 files changed, 66 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 074d966..6114dd8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3045,6 +3045,55 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
 return true;
 }
 
+static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
+uint16_t domain_id,
+uint32_t pasid)
+{
+}
+
+static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
+   uint32_t pasid, hwaddr addr, uint8_t am,
+   bool ih)
+{
+}
+
+static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
+VTDInvDesc *inv_desc)
+{
+uint16_t domain_id;
+uint32_t pasid;
+uint8_t am;
+hwaddr addr;
+
+if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
+(inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
+error_report_once("non-zero-field-in-piotlb_inv_desc hi: 0x%" PRIx64
+  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+return false;
+}
+
+domain_id = VTD_INV_DESC_PIOTLB_DID(inv_desc->val[0]);
+pasid = VTD_INV_DESC_PIOTLB_PASID(inv_desc->val[0]);
+switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
+case VTD_INV_DESC_PIOTLB_ALL_IN_PASID:
+vtd_piotlb_pasid_invalidate(s, domain_id, pasid);
+break;
+
+case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
+am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
+addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
+vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
+break;
+
+default:
+error_report_once("Invalid granularity in P-IOTLB desc hi: 0x%" PRIx64
+  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+return false;
+}
+return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
  VTDInvDesc *inv_desc)
 {
@@ -3159,6 +3208,10 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 break;
 
 case VTD_INV_DESC_PIOTLB:
+trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
+if (!vtd_process_piotlb_desc(s, _desc)) {
+return false;
+}
 break;
 
 case VTD_INV_DESC_WAIT:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 9122601..5a49d5b 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -457,6 +457,19 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
 #define VTD_INV_DESC_PASIDC_GLOBAL (3ULL << 4)
 
+#define VTD_INV_DESC_PIOTLB_ALL_IN_PASID  (2ULL << 4)
+#define VTD_INV_DESC_PIOTLB_PSI_IN_PASID  (3ULL << 4)
+
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff0ffc0ULL
+#define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
+
+#define VTD_INV_DESC_PIOTLB_PASID(val)(((val) >> 32) & 0xfULL)
+#define VTD_INV_DESC_PIOTLB_DID(val)  (((val) >> 16) & \
+ VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PIOTLB_ADDR(val) ((val) & ~0xfffULL)
+#define VTD_INV_DESC_PIOTLB_AM(val)   ((val) & 0x3fULL)
+#define VTD_INV_DESC_PIOTLB_IH(val)   (((val) >> 6) & 0x1)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
 uint16_t domain_id;
-- 
2.7.4

[PATCH v2 10/22] vfio/pci: set host iommu context to vIOMMU

2020-03-29 Thread Liu Yi L

For vfio-pci devices, it could use pci_device_set/unset_iommu() to
expose host iommu context to vIOMMU emulators. vIOMMU emulators
could make use the methods provided by host iommu context. e.g.
propagate requests to host iommu.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Alex Williamson 
Signed-off-by: Liu Yi L 
---
 hw/vfio/pci.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95..c140c88 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2717,6 +2717,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 VFIOPCIDevice *vdev = PCI_VFIO(pdev);
 VFIODevice *vbasedev_iter;
 VFIOGroup *group;
+VFIOContainer *container;
 char *tmp, *subsys, group_path[PATH_MAX], *group_name;
 Error *err = NULL;
 ssize_t len;
@@ -3028,6 +3029,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 vfio_register_req_notifier(vdev);
 vfio_setup_resetfn_quirk(vdev);
 
+container = vdev->vbasedev.group->container;
+if (container->iommu_ctx.initialized) {
+pci_device_set_iommu_context(pdev, >iommu_ctx);
+}
+
 return;
 
 out_deregister:
@@ -3072,9 +3078,16 @@ static void vfio_instance_finalize(Object *obj)
 static void vfio_exitfn(PCIDevice *pdev)
 {
 VFIOPCIDevice *vdev = PCI_VFIO(pdev);
+VFIOContainer *container;
 
 vfio_unregister_req_notifier(vdev);
 vfio_unregister_err_notifier(vdev);
+
+container = vdev->vbasedev.group->container;
+if (container->iommu_ctx.initialized) {
+pci_device_unset_iommu_context(pdev);
+}
+
 pci_device_set_intx_routing_notifier(>pdev, NULL);
 if (vdev->irqchip_change_notifier.notify) {
 kvm_irqchip_remove_change_notifier(>irqchip_change_notifier);
-- 
2.7.4

[PATCH v2 14/22] vfio: add bind stage-1 page table support

2020-03-29 Thread Liu Yi L

This patch adds bind_stage1_pgtbl() definition in HostIOMMUContextClass,
also adds corresponding implementation in VFIO. This is to expose a way
for vIOMMU to setup dual stage DMA translation for passthru devices on
hardware.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Alex Williamson 
Signed-off-by: Liu Yi L 
---
 hw/iommu/host_iommu_context.c | 47 +-
 hw/vfio/common.c  | 55 ++-
 include/hw/iommu/host_iommu_context.h | 26 -
 3 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
index 5fb2223..8ae20fe 100644
--- a/hw/iommu/host_iommu_context.c
+++ b/hw/iommu/host_iommu_context.c
@@ -69,15 +69,60 @@ int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, 
uint32_t pasid)
 return hicxc->pasid_free(iommu_ctx, pasid);
 }
 
+int host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+ DualIOMMUStage1BindData *data)
+{
+HostIOMMUContextClass *hicxc;
+
+if (!iommu_ctx) {
+return -EINVAL;
+}
+
+hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+if (!hicxc) {
+return -EINVAL;
+}
+
+if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+!hicxc->bind_stage1_pgtbl) {
+return -EINVAL;
+}
+
+return hicxc->bind_stage1_pgtbl(iommu_ctx, data);
+}
+
+int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+   DualIOMMUStage1BindData *data)
+{
+HostIOMMUContextClass *hicxc;
+
+if (!iommu_ctx) {
+return -EINVAL;
+}
+
+hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+if (!hicxc) {
+return -EINVAL;
+}
+
+if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+!hicxc->unbind_stage1_pgtbl) {
+return -EINVAL;
+}
+
+return hicxc->unbind_stage1_pgtbl(iommu_ctx, data);
+}
+
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
  const char *mrtypename,
- uint64_t flags)
+ uint64_t flags, uint32_t formats)
 {
 HostIOMMUContext *iommu_ctx;
 
 object_initialize(_iommu_ctx, instance_size, mrtypename);
 iommu_ctx = HOST_IOMMU_CONTEXT(_iommu_ctx);
 iommu_ctx->flags = flags;
+iommu_ctx->stage1_formats = formats;
 iommu_ctx->initialized = true;
 }
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 44b142c..465e4d8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1226,6 +1226,54 @@ static int 
vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
 return 0;
 }
 
+static int vfio_host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+  DualIOMMUStage1BindData *bind_data)
+{
+VFIOContainer *container = container_of(iommu_ctx,
+VFIOContainer, iommu_ctx);
+struct vfio_iommu_type1_bind *bind;
+unsigned long argsz;
+int ret = 0;
+
+argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+bind = g_malloc0(argsz);
+bind->argsz = argsz;
+bind->flags = VFIO_IOMMU_BIND_GUEST_PGTBL;
+memcpy(>data, _data->bind_data, sizeof(bind_data->bind_data));
+
+if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+ret = -errno;
+error_report("%s: pasid (%u) bind failed: %d",
+  __func__, bind_data->pasid, ret);
+}
+g_free(bind);
+return ret;
+}
+
+static int vfio_host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
+DualIOMMUStage1BindData *bind_data)
+{
+VFIOContainer *container = container_of(iommu_ctx,
+VFIOContainer, iommu_ctx);
+struct vfio_iommu_type1_bind *bind;
+unsigned long argsz;
+int ret = 0;
+
+argsz = sizeof(*bind) + sizeof(bind_data->bind_data);
+bind = g_malloc0(argsz);
+bind->argsz = argsz;
+bind->flags = VFIO_IOMMU_UNBIND_GUEST_PGTBL;
+memcpy(>data, _data->bind_data, sizeof(bind_data->bind_data));
+
+if (ioctl(container->fd, VFIO_IOMMU_BIND, bind)) {
+ret = -errno;
+error_report("%s: pasid (%u) unbind failed: %d",
+  __func__, bind_data->pasid, ret);
+}
+g_free(bind);
+return ret;
+}
+
 /**
  * Get iommu info from host. Caller of this funcion should free
  * the memory pointed by the returned pointer stored in @info
@@ -1350,10 +1398,13 @@ static int vfio_init_container(VFIOContainer 
*container, int group_fd,
 
 flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
  HOST_IOMMU_PASID_REQUEST : 0;
+flags |= HOST_IOMMU_NESTING;
+
 host_iommu_ctx_init(>iommu_ctx,
 sizeof(container->iommu_ctx),

[PATCH v2 17/22] intel_iommu: do not pass down pasid bind for PASID #0

2020-03-29 Thread Liu Yi L

RID_PASID field was introduced in VT-d 3.0 spec, it is used
for DMA requests w/o PASID in scalable mode VT-d. It is also
known as IOVA. And in VT-d 3.1 spec, there is definition on it:

"Implementations not supporting RID_PASID capability
(ECAP_REG.RPS is 0b), use a PASID value of 0 to perform
address translation for requests without PASID."

This patch adds a check against the PASIDs which are going to be
bound to device. For PASID #0, it is not necessary to pass down
pasid bind request for it since PASID #0 is used as RID_PASID for
DMA requests without pasid. Further reason is current Intel vIOMMU
supports gIOVA by shadowing guest 2nd level page table. However,
in future, if guest IOMMU driver uses 1st level page table to store
IOVA mappings, then guest IOVA support will also be done via nested
translation. When gIOVA is over FLPT, then vIOMMU should pass down
the pasid bind request for PASID #0 to host, host needs to bind the
guest IOVA page table to a proper PASID. e.g PASID value in RID_PASID
field for PF/VF if ECAP_REG.RPS is clear or default PASID for ADI
(Assignable Device Interface in Scalable IOV solution).

IOVA over FLPT support on Intel VT-d:
https://lkml.org/lkml/2019/9/23/297

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Reviewed-by: Peter Xu 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 883aeac..074d966 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1895,6 +1895,16 @@ static int vtd_bind_guest_pasid(IntelIOMMUState *s, 
VTDBus *vtd_bus,
 struct iommu_gpasid_bind_data *g_bind_data;
 int ret = -1;
 
+if (pasid < VTD_HPASID_MIN) {
+/*
+ * If pasid < VTD_HPASID_MIN, this pasid is not allocated
+ * from host. No need to pass down the changes on it to host.
+ * TODO: when IOVA over FLPT is ready, this switch should be
+ * refined.
+ */
+return 0;
+}
+
 vtd_dev_icx = vtd_bus->dev_icx[devfn];
 if (!vtd_dev_icx) {
 /* means no need to go further, e.g. for emulated devices */
-- 
2.7.4

[PATCH v2 16/22] intel_iommu: replay pasid binds after context cache invalidation

2020-03-29 Thread Liu Yi L

This patch replays guest pasid bindings after context cache
invalidation. This is a behavior to ensure safety. Actually,
programmer should issue pasid cache invalidation with proper
granularity after issuing a context cache invalidation.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c  | 51 ++
 hw/i386/intel_iommu_internal.h |  6 -
 hw/i386/trace-events   |  1 +
 3 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d87f608..883aeac 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -68,6 +68,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState 
*s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+  VTDBus *vtd_bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -1853,7 +1857,10 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+VTDPASIDCacheInfo pc_info;
+
 trace_vtd_inv_desc_cc_global();
+
 /* Protects context cache */
 vtd_iommu_lock(s);
 s->context_cache_gen++;
@@ -1870,6 +1877,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState 
*s)
  * VT-d emulation codes.
  */
 vtd_iommu_replay_all(s);
+
+pc_info.flags = VTD_PASID_CACHE_GLOBAL;
+vtd_pasid_cache_sync(s, _info);
 }
 
 /**
@@ -2005,6 +2015,22 @@ static void 
vtd_context_device_invalidate(IntelIOMMUState *s,
  * happened.
  */
 vtd_sync_shadow_page_table(vtd_as);
+/*
+ * Per spec, context flush should also followed with PASID
+ * cache and iotlb flush. Regards to a device selective
+ * context cache invalidation:
+ * if (emaulted_device)
+ *modify the pasid cache gen and pasid-based iotlb gen
+ *value (will be added in following patches)
+ * else if (assigned_device)
+ *check if the device has been bound to any pasid
+ *invoke pasid_unbind regards to each bound pasid
+ * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+ * caches, while for piotlb in QEMU, we don't have it yet, so
+ * no handling. For assigned device, host iommu driver would
+ * flush piotlb when a pasid unbind is pass down to it.
+ */
+ vtd_pasid_cache_devsi(s, vtd_bus, devfn_it);
 }
 }
 }
@@ -2619,6 +2645,12 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer 
value,
 /* Fall through */
 case VTD_PASID_CACHE_GLOBAL:
 break;
+case VTD_PASID_CACHE_DEVSI:
+if (pc_info->vtd_bus != vtd_bus ||
+pc_info->devfn == devfn) {
+return false;
+}
+break;
 default:
 error_report("invalid pc_info->flags");
 abort();
@@ -2827,6 +2859,11 @@ static void 
vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
 walk_info.flags |= VTD_PASID_TABLE_DID_SEL_WALK;
 /* loop all assigned devices */
 break;
+case VTD_PASID_CACHE_DEVSI:
+walk_info.vtd_bus = pc_info->vtd_bus;
+walk_info.devfn = pc_info->devfn;
+vtd_replay_pasid_bind_for_dev(s, start, end, _info);
+return;
 case VTD_PASID_CACHE_FORCE_RESET:
 /* For force reset, no need to go further replay */
 return;
@@ -2912,6 +2949,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
 vtd_iommu_unlock(s);
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+  VTDBus *vtd_bus, uint16_t devfn)
+{
+VTDPASIDCacheInfo pc_info;
+
+trace_vtd_pasid_cache_devsi(devfn);
+
+pc_info.flags = VTD_PASID_CACHE_DEVSI;
+pc_info.vtd_bus = vtd_bus;
+pc_info.devfn = devfn;
+
+vtd_pasid_cache_sync(s, _info);
+}
+
 /**
  * Caller of this function should hold iommu_lock
  */
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index b9e48ab..9122601 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -529,14 +529,18 @@ struct VTDPASIDCacheInfo {
 #define VTD_PASID_CACHE_GLOBAL (1ULL << 1)
 #define VTD_PASID_CACHE_DOMSI  (1ULL << 2)
 #define VTD_PASID_CACHE_PASIDSI(1ULL << 3)
+#define VTD_PASID_CACHE_DEVSI  (1ULL << 4)
 uint32_t flags;
 uint16_t domain_id;
 uint32_t pasid;
+VTDBus *vtd_bus;

[PATCH v2 04/22] hw/iommu: introduce HostIOMMUContext

2020-03-29 Thread Liu Yi L

Currently, many platform vendors provide the capability of dual stage
DMA address translation in hardware. For example, nested translation
on Intel VT-d scalable mode, nested stage translation on ARM SMMUv3,
and etc. In dual stage DMA address translation, there are two stages
address translation, stage-1 (a.k.a first-level) and stage-2 (a.k.a
second-level) translation structures. Stage-1 translation results are
also subjected to stage-2 translation structures. Take vSVA (Virtual
Shared Virtual Addressing) as an example, guest IOMMU driver owns
stage-1 translation structures (covers GVA->GPA translation), and host
IOMMU driver owns stage-2 translation structures (covers GPA->HPA
translation). VMM is responsible to bind stage-1 translation structures
to host, thus hardware could achieve GVA->GPA and then GPA->HPA
translation. For more background on SVA, refer the below links.
 - https://www.youtube.com/watch?v=Kq_nfGK5MwQ
 - https://events19.lfasiallc.com/wp-content/uploads/2017/11/\
Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf

In QEMU, vIOMMU emulators expose IOMMUs to VM per their own spec (e.g.
Intel VT-d spec). Devices are pass-through to guest via device pass-
through components like VFIO. VFIO is a userspace driver framework
which exposes host IOMMU programming capability to userspace in a
secure manner. e.g. IOVA MAP/UNMAP requests. Thus the major connection
between VFIO and vIOMMU are MAP/UNMAP. However, with the dual stage
DMA translation support, there are more interactions between vIOMMU and
VFIO as below:
 1) PASID allocation (allow host to intercept in PASID allocation)
 2) bind stage-1 translation structures to host
 3) propagate stage-1 cache invalidation to host
 4) DMA address translation fault (I/O page fault) servicing etc.

With the above new interactions in QEMU, it requires an abstract layer
to facilitate the above operations and expose to vIOMMU emulators as an
explicit way for vIOMMU emulators call into VFIO. This patch introduces
HostIOMMUContext to stand for hardware IOMMU w/ dual stage DMA address
translation capability. And introduces HostIOMMUContextClass to provide
methods for vIOMMU emulators to propagate dual-stage translation related
requests to host. As a beginning, PASID allocation/free are defined to
propagate PASID allocation/free requests to host which is helpful for the
vendors who manage PASID in system-wide. In future, there will be more
operations like bind_stage1_pgtbl, flush_stage1_cache and etc.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Michael S. Tsirkin 
Signed-off-by: Liu Yi L 
---
 hw/Makefile.objs  |  1 +
 hw/iommu/Makefile.objs|  1 +
 hw/iommu/host_iommu_context.c | 97 +++
 include/hw/iommu/host_iommu_context.h | 75 +++
 4 files changed, 174 insertions(+)
 create mode 100644 hw/iommu/Makefile.objs
 create mode 100644 hw/iommu/host_iommu_context.c
 create mode 100644 include/hw/iommu/host_iommu_context.h

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 660e2b4..cab83fe 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -40,6 +40,7 @@ devices-dirs-$(CONFIG_MEM_DEVICE) += mem/
 devices-dirs-$(CONFIG_NUBUS) += nubus/
 devices-dirs-y += semihosting/
 devices-dirs-y += smbios/
+devices-dirs-y += iommu/
 endif
 
 common-obj-y += $(devices-dirs-y)
diff --git a/hw/iommu/Makefile.objs b/hw/iommu/Makefile.objs
new file mode 100644
index 000..e6eed4e
--- /dev/null
+++ b/hw/iommu/Makefile.objs
@@ -0,0 +1 @@
+obj-y += host_iommu_context.o
diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
new file mode 100644
index 000..5fb2223
--- /dev/null
+++ b/hw/iommu/host_iommu_context.c
@@ -0,0 +1,97 @@
+/*
+ * QEMU abstract of Host IOMMU
+ *
+ * Copyright (C) 2020 Intel Corporation.
+ *
+ * Authors: Liu Yi L 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qom/object.h"
+#include "qapi/visitor.h"
+#include "hw/iommu/host_iommu_context.h"
+
+int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
+   uint32_t max, uint32_t *pasid)
+{
+HostIOMMUContextClass *hicxc;
+
+if (!iommu_ctx) {
+return -EINVAL;
+}
+
+hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+

[PATCH v2 18/22] vfio: add support for flush iommu stage-1 cache

2020-03-29 Thread Liu Yi L

This patch adds flush_stage1_cache() definition in HostIOMUContextClass.
And adds corresponding implementation in VFIO. This is to expose a way
for vIOMMU to flush stage-1 cache in host side since guest owns stage-1
translation structures in dual stage DMA translation configuration.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Alex Williamson 
Acked-by: Peter Xu 
Signed-off-by: Liu Yi L 
---
 hw/iommu/host_iommu_context.c | 19 +++
 hw/vfio/common.c  | 25 +
 include/hw/iommu/host_iommu_context.h | 14 ++
 3 files changed, 58 insertions(+)

diff --git a/hw/iommu/host_iommu_context.c b/hw/iommu/host_iommu_context.c
index 8ae20fe..e884752 100644
--- a/hw/iommu/host_iommu_context.c
+++ b/hw/iommu/host_iommu_context.c
@@ -113,6 +113,25 @@ int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext 
*iommu_ctx,
 return hicxc->unbind_stage1_pgtbl(iommu_ctx, data);
 }
 
+int host_iommu_ctx_flush_stage1_cache(HostIOMMUContext *iommu_ctx,
+  DualIOMMUStage1Cache *cache)
+{
+HostIOMMUContextClass *hicxc;
+
+hicxc = HOST_IOMMU_CONTEXT_GET_CLASS(iommu_ctx);
+
+if (!hicxc) {
+return -EINVAL;
+}
+
+if (!(iommu_ctx->flags & HOST_IOMMU_NESTING) ||
+!hicxc->flush_stage1_cache) {
+return -EINVAL;
+}
+
+return hicxc->flush_stage1_cache(iommu_ctx, cache);
+}
+
 void host_iommu_ctx_init(void *_iommu_ctx, size_t instance_size,
  const char *mrtypename,
  uint64_t flags, uint32_t formats)
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 465e4d8..6b730b6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1274,6 +1274,30 @@ static int 
vfio_host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
 return ret;
 }
 
+static int vfio_host_iommu_ctx_flush_stage1_cache(HostIOMMUContext *iommu_ctx,
+DualIOMMUStage1Cache *cache)
+{
+VFIOContainer *container = container_of(iommu_ctx,
+VFIOContainer, iommu_ctx);
+struct vfio_iommu_type1_cache_invalidate *cache_inv;
+unsigned long argsz;
+int ret = 0;
+
+argsz = sizeof(*cache_inv) + sizeof(cache->cache_info);
+cache_inv = g_malloc0(argsz);
+cache_inv->argsz = argsz;
+cache_inv->flags = 0;
+memcpy(_inv->cache_info, >cache_info,
+   sizeof(cache->cache_info));
+
+if (ioctl(container->fd, VFIO_IOMMU_CACHE_INVALIDATE, cache_inv)) {
+error_report("%s: iommu cache flush failed: %d", __func__, -errno);
+ret = -errno;
+}
+g_free(cache_inv);
+return ret;
+}
+
 /**
  * Get iommu info from host. Caller of this funcion should free
  * the memory pointed by the returned pointer stored in @info
@@ -1998,6 +2022,7 @@ static void 
vfio_host_iommu_context_class_init(ObjectClass *klass,
 hicxc->pasid_free = vfio_host_iommu_ctx_pasid_free;
 hicxc->bind_stage1_pgtbl = vfio_host_iommu_ctx_bind_stage1_pgtbl;
 hicxc->unbind_stage1_pgtbl = vfio_host_iommu_ctx_unbind_stage1_pgtbl;
+hicxc->flush_stage1_cache = vfio_host_iommu_ctx_flush_stage1_cache;
 }
 
 static const TypeInfo vfio_host_iommu_context_info = {
diff --git a/include/hw/iommu/host_iommu_context.h 
b/include/hw/iommu/host_iommu_context.h
index 44daca9..69b1b7b 100644
--- a/include/hw/iommu/host_iommu_context.h
+++ b/include/hw/iommu/host_iommu_context.h
@@ -42,6 +42,7 @@
 
 typedef struct HostIOMMUContext HostIOMMUContext;
 typedef struct DualIOMMUStage1BindData DualIOMMUStage1BindData;
+typedef struct DualIOMMUStage1Cache DualIOMMUStage1Cache;
 
 typedef struct HostIOMMUContextClass {
 /* private */
@@ -65,6 +66,12 @@ typedef struct HostIOMMUContextClass {
 /* Undo a previous bind. @bind_data specifies the unbind info. */
 int (*unbind_stage1_pgtbl)(HostIOMMUContext *iommu_ctx,
DualIOMMUStage1BindData *bind_data);
+/*
+ * Propagate stage-1 cache flush to host IOMMU, cache
+ * info specifid in @cache
+ */
+int (*flush_stage1_cache)(HostIOMMUContext *iommu_ctx,
+  DualIOMMUStage1Cache *cache);
 } HostIOMMUContextClass;
 
 /*
@@ -86,6 +93,11 @@ struct DualIOMMUStage1BindData {
 } bind_data;
 };
 
+struct DualIOMMUStage1Cache {
+uint32_t pasid;
+struct iommu_cache_invalidate_info cache_info;
+};
+
 int host_iommu_ctx_pasid_alloc(HostIOMMUContext *iommu_ctx, uint32_t min,
uint32_t max, uint32_t *pasid);
 int host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx, uint32_t pasid);
@@ -93,6 +105,8 @@ int host_iommu_ctx_bind_stage1_pgtbl(HostIOMMUContext 
*iommu_ctx,
  DualIOMMUStage1BindData *data);
 int host_iommu_ctx_unbind_stage1_pgtbl(HostIOMMUContext *iommu_ctx,
DualIOMMUStage1BindData

[PATCH v2 11/22] intel_iommu: add virtual command capability support

2020-03-29 Thread Liu Yi L

This patch adds virtual command support to Intel vIOMMU per
Intel VT-d 3.1 spec. And adds two virtual commands: allocate
pasid and free pasid.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Reviewed-by: Peter Xu 
Signed-off-by: Liu Yi L 
Signed-off-by: Yi Sun 
---
 hw/i386/intel_iommu.c  | 154 -
 hw/i386/intel_iommu_internal.h |  37 ++
 hw/i386/trace-events   |   1 +
 include/hw/i386/intel_iommu.h  |  10 ++-
 4 files changed, 200 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fd349c6..6c3159f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2651,6 +2651,129 @@ static void vtd_handle_iectl_write(IntelIOMMUState *s)
 }
 }
 
+static int vtd_request_pasid_alloc(IntelIOMMUState *s, uint32_t *pasid)
+{
+VTDHostIOMMUContext *vtd_dev_icx;
+int ret = -1;
+
+vtd_iommu_lock(s);
+QLIST_FOREACH(vtd_dev_icx, >vtd_dev_icx_list, next) {
+HostIOMMUContext *iommu_ctx = vtd_dev_icx->iommu_ctx;
+
+/*
+ * We'll return the first valid result we got. It's
+ * a bit hackish in that we don't have a good global
+ * interface yet to talk to modules like vfio to deliver
+ * this allocation request, so we're leveraging this
+ * per-device iommu context to do the same thing just
+ * to make sure the allocation happens only once.
+ */
+ret = host_iommu_ctx_pasid_alloc(iommu_ctx, VTD_HPASID_MIN,
+ VTD_HPASID_MAX, pasid);
+if (!ret) {
+break;
+}
+}
+vtd_iommu_unlock(s);
+
+return ret;
+}
+
+static int vtd_request_pasid_free(IntelIOMMUState *s, uint32_t pasid)
+{
+VTDHostIOMMUContext *vtd_dev_icx;
+int ret = -1;
+
+vtd_iommu_lock(s);
+QLIST_FOREACH(vtd_dev_icx, >vtd_dev_icx_list, next) {
+HostIOMMUContext *iommu_ctx = vtd_dev_icx->iommu_ctx;
+
+/*
+ * Similar with pasid allocation. We'll free the pasid
+ * on the first successful free operation. It's a bit
+ * hackish in that we don't have a good global interface
+ * yet to talk to modules like vfio to deliver this pasid
+ * free request, so we're leveraging this per-device iommu
+ * context to do the same thing just to make sure the free
+ * happens only once.
+ */
+ret = host_iommu_ctx_pasid_free(iommu_ctx, pasid);
+if (!ret) {
+break;
+}
+}
+vtd_iommu_unlock(s);
+
+return ret;
+}
+
+/*
+ * If IP is not set, set it then return.
+ * If IP is already set, return.
+ */
+static void vtd_vcmd_set_ip(IntelIOMMUState *s)
+{
+s->vcrsp = 1;
+vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+ ((uint64_t) s->vcrsp));
+}
+
+static void vtd_vcmd_clear_ip(IntelIOMMUState *s)
+{
+s->vcrsp &= (~((uint64_t)(0x1)));
+vtd_set_quad_raw(s, DMAR_VCRSP_REG,
+ ((uint64_t) s->vcrsp));
+}
+
+/* Handle write to Virtual Command Register */
+static int vtd_handle_vcmd_write(IntelIOMMUState *s, uint64_t val)
+{
+uint32_t pasid;
+int ret = -1;
+
+trace_vtd_reg_write_vcmd(s->vcrsp, val);
+
+if (!(s->vccap & VTD_VCCAP_PAS) ||
+ (s->vcrsp & 1)) {
+return -1;
+}
+
+/*
+ * Since vCPU should be blocked when the guest VMCD
+ * write was trapped to here. Should be no other vCPUs
+ * try to access VCMD if guest software is well written.
+ * However, we still emulate the IP bit here in case of
+ * bad guest software. Also align with the spec.
+ */
+vtd_vcmd_set_ip(s);
+
+switch (val & VTD_VCMD_CMD_MASK) {
+case VTD_VCMD_ALLOC_PASID:
+ret = vtd_request_pasid_alloc(s, );
+if (ret) {
+s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_NO_AVAILABLE_PASID);
+} else {
+s->vcrsp |= VTD_VCRSP_RSLT(pasid);
+}
+break;
+
+case VTD_VCMD_FREE_PASID:
+pasid = VTD_VCMD_PASID_VALUE(val);
+ret = vtd_request_pasid_free(s, pasid);
+if (ret < 0) {
+s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_FREE_INVALID_PASID);
+}
+break;
+
+default:
+s->vcrsp |= VTD_VCRSP_SC(VTD_VCMD_UNDEFINED_CMD);
+error_report_once("Virtual Command: unsupported command!!!");
+break;
+}
+vtd_vcmd_clear_ip(s);
+return 0;
+}
+
 static uint64_t vtd_mem_read(void *opaque, hwaddr addr, unsigned size)
 {
 IntelIOMMUState *s = opaque;
@@ -2939,6 +3062,23 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
 vtd_set_long(s, addr, val);
 break;
 
+case DMAR_VCMD_REG:
+if (!vtd_handle_vcmd_write(s, val)) {
+if (size == 4) {
+vtd_set_long(s, addr, val);
+} else {
+vtd_set_quad(s, addr, val);
+}
+}
+

[PATCH v2 12/22] intel_iommu: process PASID cache invalidation

2020-03-29 Thread Liu Yi L

This patch adds PASID cache invalidation handling. When guest enabled
PASID usages (e.g. SVA), guest software should issue a proper PASID
cache invalidation when caching-mode is exposed. This patch only adds
the draft handling of pasid cache invalidation. Detailed handling will
be added in subsequent patches.

v1 -> v2: remove vtd_pasid_cache_gsi(), vtd_pasid_cache_psi() and
  vtd_pasid_cache_dsi()

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Reviewed-by: Peter Xu 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c  | 40 +++-
 hw/i386/intel_iommu_internal.h | 12 
 hw/i386/trace-events   |  3 +++
 3 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6c3159f..2eb60c3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2395,6 +2395,37 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, 
VTDInvDesc *inv_desc)
 return true;
 }
 
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+   VTDInvDesc *inv_desc)
+{
+if ((inv_desc->val[0] & VTD_INV_DESC_PASIDC_RSVD_VAL0) ||
+(inv_desc->val[1] & VTD_INV_DESC_PASIDC_RSVD_VAL1) ||
+(inv_desc->val[2] & VTD_INV_DESC_PASIDC_RSVD_VAL2) ||
+(inv_desc->val[3] & VTD_INV_DESC_PASIDC_RSVD_VAL3)) {
+error_report_once("non-zero-field-in-pc_inv_desc hi: 0x%" PRIx64
+  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+return false;
+}
+
+switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+case VTD_INV_DESC_PASIDC_DSI:
+break;
+
+case VTD_INV_DESC_PASIDC_PASID_SI:
+break;
+
+case VTD_INV_DESC_PASIDC_GLOBAL:
+break;
+
+default:
+error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+return false;
+}
+
+return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
  VTDInvDesc *inv_desc)
 {
@@ -2501,12 +2532,11 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 }
 break;
 
-/*
- * TODO: the entity of below two cases will be implemented in future 
series.
- * To make guest (which integrates scalable mode support patch set in
- * iommu driver) work, just return true is enough so far.
- */
 case VTD_INV_DESC_PC:
+trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+if (!vtd_process_pasid_desc(s, _desc)) {
+return false;
+}
 break;
 
 case VTD_INV_DESC_PIOTLB:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 3fc83f1..9a76f20 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -444,6 +444,18 @@ typedef union VTDInvDesc VTDInvDesc;
 (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
 (0x3800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
 
+#define VTD_INV_DESC_PASIDC_G  (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfULL)
+#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff0ffc0ULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL1  0xULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL2  0xULL
+#define VTD_INV_DESC_PASIDC_RSVD_VAL3  0xULL
+
+#define VTD_INV_DESC_PASIDC_DSI(0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL (3ULL << 4)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
 uint16_t domain_id;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 71536a7..f7cd4e5 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -22,6 +22,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domian slective PC invalidation domain 
0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC 
invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" 
devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t 
domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" 
domain 0x%"PRIx16
-- 
2.7.4

[PATCH v2 06/22] hw/pci: introduce pci_device_set/unset_iommu_context()

2020-03-29 Thread Liu Yi L

This patch adds pci_device_set/unset_iommu_context() to set/unset
host_iommu_context for a given device. New callback is added in
PCIIOMMUOps. As such, vIOMMU could make use of host IOMMU capability.
e.g setup nested translation.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Michael S. Tsirkin 
Reviewed-by: Peter Xu 
Signed-off-by: Liu Yi L 
---
 hw/pci/pci.c | 49 -
 include/hw/pci/pci.h | 10 ++
 2 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index aa9025c..af3c1a1 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2638,7 +2638,8 @@ static void pci_device_class_base_init(ObjectClass 
*klass, void *data)
 }
 }
 
-AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
+  PCIBus **pbus, uint8_t *pdevfn)
 {
 PCIBus *bus = pci_get_bus(dev);
 PCIBus *iommu_bus = bus;
@@ -2683,14 +2684,52 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice 
*dev)
 
 iommu_bus = parent_bus;
 }
-if (iommu_bus && iommu_bus->iommu_ops &&
- iommu_bus->iommu_ops->get_address_space) {
-return iommu_bus->iommu_ops->get_address_space(bus,
- iommu_bus->iommu_opaque, devfn);
+*pbus = iommu_bus;
+*pdevfn = devfn;
+}
+
+AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
+{
+PCIBus *bus;
+uint8_t devfn;
+
+pci_device_get_iommu_bus_devfn(dev, , );
+if (bus && bus->iommu_ops &&
+bus->iommu_ops->get_address_space) {
+return bus->iommu_ops->get_address_space(bus,
+bus->iommu_opaque, devfn);
 }
 return _space_memory;
 }
 
+int pci_device_set_iommu_context(PCIDevice *dev,
+ HostIOMMUContext *iommu_ctx)
+{
+PCIBus *bus;
+uint8_t devfn;
+
+pci_device_get_iommu_bus_devfn(dev, , );
+if (bus && bus->iommu_ops &&
+bus->iommu_ops->set_iommu_context) {
+return bus->iommu_ops->set_iommu_context(bus,
+  bus->iommu_opaque, devfn, iommu_ctx);
+}
+return -ENOENT;
+}
+
+void pci_device_unset_iommu_context(PCIDevice *dev)
+{
+PCIBus *bus;
+uint8_t devfn;
+
+pci_device_get_iommu_bus_devfn(dev, , );
+if (bus && bus->iommu_ops &&
+bus->iommu_ops->unset_iommu_context) {
+bus->iommu_ops->unset_iommu_context(bus,
+ bus->iommu_opaque, devfn);
+}
+}
+
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *ops, void *opaque)
 {
 bus->iommu_ops = ops;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index ffe192d..0ec5680 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -9,6 +9,8 @@
 
 #include "hw/pci/pcie.h"
 
+#include "hw/iommu/host_iommu_context.h"
+
 extern bool pci_available;
 
 /* PCI bus */
@@ -489,9 +491,17 @@ typedef struct PCIIOMMUOps PCIIOMMUOps;
 struct PCIIOMMUOps {
 AddressSpace * (*get_address_space)(PCIBus *bus,
 void *opaque, int32_t devfn);
+int (*set_iommu_context)(PCIBus *bus, void *opaque,
+ int32_t devfn,
+ HostIOMMUContext *iommu_ctx);
+void (*unset_iommu_context)(PCIBus *bus, void *opaque,
+int32_t devfn);
 };
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
+int pci_device_set_iommu_context(PCIDevice *dev,
+ HostIOMMUContext *iommu_ctx);
+void pci_device_unset_iommu_context(PCIDevice *dev);
 void pci_setup_iommu(PCIBus *bus, const PCIIOMMUOps *iommu_ops, void *opaque);
 
 static inline void
-- 
2.7.4

[PATCH v2 15/22] intel_iommu: bind/unbind guest page table to host

2020-03-29 Thread Liu Yi L

This patch captures the guest PASID table entry modifications and
propagates the changes to host to setup dual stage DMA translation.
The guest page table is configured as 1st level page table (GVA->GPA)
whose translation result would further go through host VT-d 2nd
level page table(GPA->HPA) under nested translation mode. This is the
key part of vSVA support, and also a key to support IOVA over 1st-
level page table for Intel VT-d in virtualization environment.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c  | 98 +++---
 hw/i386/intel_iommu_internal.h | 18 
 2 files changed, 111 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a7e9973..d87f608 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -41,6 +41,7 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "qemu/jhash.h"
+#include 
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -700,6 +701,16 @@ static inline uint32_t 
vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
 return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
 }
 
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
 return pdire->val & 1;
@@ -1861,6 +1872,82 @@ static void 
vtd_context_global_invalidate(IntelIOMMUState *s)
 vtd_iommu_replay_all(s);
 }
 
+/**
+ * Caller should hold iommu_lock.
+ */
+static int vtd_bind_guest_pasid(IntelIOMMUState *s, VTDBus *vtd_bus,
+int devfn, int pasid, VTDPASIDEntry *pe,
+VTDPASIDOp op)
+{
+VTDHostIOMMUContext *vtd_dev_icx;
+HostIOMMUContext *iommu_ctx;
+DualIOMMUStage1BindData *bind_data;
+struct iommu_gpasid_bind_data *g_bind_data;
+int ret = -1;
+
+vtd_dev_icx = vtd_bus->dev_icx[devfn];
+if (!vtd_dev_icx) {
+/* means no need to go further, e.g. for emulated devices */
+return 0;
+}
+
+iommu_ctx = vtd_dev_icx->iommu_ctx;
+if (!iommu_ctx) {
+return -EINVAL;
+}
+
+if (!(iommu_ctx->stage1_formats
+ & IOMMU_PASID_FORMAT_INTEL_VTD)) {
+error_report_once("IOMMU Stage 1 format is not compatible!\n");
+return -EINVAL;
+}
+
+bind_data = g_malloc0(sizeof(*bind_data));
+bind_data->pasid = pasid;
+g_bind_data = _data->bind_data.gpasid_bind;
+
+g_bind_data->flags = 0;
+g_bind_data->vtd.flags = 0;
+switch (op) {
+case VTD_PASID_BIND:
+g_bind_data->version = IOMMU_UAPI_VERSION;
+g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+g_bind_data->gpgd = vtd_pe_get_flpt_base(pe);
+g_bind_data->addr_width = vtd_pe_get_fl_aw(pe);
+g_bind_data->hpasid = pasid;
+g_bind_data->gpasid = pasid;
+g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+g_bind_data->vtd.flags =
+ (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ? 1 : 0)
+   | (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ? 1 : 0)
+   | (VTD_SM_PASID_ENTRY_PCD_BIT(pe->val[1]) ? 1 : 0)
+   | (VTD_SM_PASID_ENTRY_PWT_BIT(pe->val[1]) ? 1 : 0)
+   | (VTD_SM_PASID_ENTRY_EMTE_BIT(pe->val[1]) ? 1 : 0)
+   | (VTD_SM_PASID_ENTRY_CD_BIT(pe->val[1]) ? 1 : 0);
+g_bind_data->vtd.pat = VTD_SM_PASID_ENTRY_PAT(pe->val[1]);
+g_bind_data->vtd.emt = VTD_SM_PASID_ENTRY_EMT(pe->val[1]);
+ret = host_iommu_ctx_bind_stage1_pgtbl(iommu_ctx, bind_data);
+break;
+case VTD_PASID_UNBIND:
+g_bind_data->version = IOMMU_UAPI_VERSION;
+g_bind_data->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+g_bind_data->gpgd = 0;
+g_bind_data->addr_width = 0;
+g_bind_data->hpasid = pasid;
+g_bind_data->gpasid = pasid;
+g_bind_data->flags |= IOMMU_SVA_GPASID_VAL;
+ret = host_iommu_ctx_unbind_stage1_pgtbl(iommu_ctx, bind_data);
+break;
+default:
+error_report_once("Unknown VTDPASIDOp!!!\n");
+break;
+}
+
+g_free(bind_data);
+
+return ret;
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -2489,10 +2576,10 @@ static void vtd_fill_pe_in_cache(IntelIOMMUState *s,
 }
 
 pc_entry->pasid_entry = *pe;
-/*
- * TODO:
- * - send pasid bind to host for passthru devices
- */
+vtd_bind_guest_pasid(s, vtd_pasid_as->vtd_bus,
+ vtd_pasid_as->devfn,
+ vtd_pasid_as->pasid,
+ pe,

[PATCH v2 05/22] hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps

2020-03-29 Thread Liu Yi L

This patch modifies pci_setup_iommu() to set PCIIOMMUOps
instead of setting PCIIOMMUFunc. PCIIOMMUFunc is used to
get an address space for a PCI device in vendor specific
way. The PCIIOMMUOps still offers this functionality. But
using PCIIOMMUOps leaves space to add more iommu related
vendor specific operations.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Michael S. Tsirkin 
Reviewed-by: David Gibson 
Reviewed-by: Peter Xu 
Signed-off-by: Liu Yi L 
---
 hw/alpha/typhoon.c   |  6 +-
 hw/arm/smmu-common.c |  6 +-
 hw/hppa/dino.c   |  6 +-
 hw/i386/amd_iommu.c  |  6 +-
 hw/i386/intel_iommu.c|  6 +-
 hw/pci-host/designware.c |  6 +-
 hw/pci-host/pnv_phb3.c   |  6 +-
 hw/pci-host/pnv_phb4.c   |  6 +-
 hw/pci-host/ppce500.c|  6 +-
 hw/pci-host/prep.c   |  6 +-
 hw/pci-host/sabre.c  |  6 +-
 hw/pci/pci.c | 12 +++-
 hw/ppc/ppc440_pcix.c |  6 +-
 hw/ppc/spapr_pci.c   |  6 +-
 hw/s390x/s390-pci-bus.c  |  8 ++--
 hw/virtio/virtio-iommu.c |  6 +-
 include/hw/pci/pci.h |  8 ++--
 include/hw/pci/pci_bus.h |  2 +-
 18 files changed, 90 insertions(+), 24 deletions(-)

diff --git a/hw/alpha/typhoon.c b/hw/alpha/typhoon.c
index 1795e2f..f271de1 100644
--- a/hw/alpha/typhoon.c
+++ b/hw/alpha/typhoon.c
@@ -740,6 +740,10 @@ static AddressSpace *typhoon_pci_dma_iommu(PCIBus *bus, 
void *opaque, int devfn)
 return >pchip.iommu_as;
 }
 
+static const PCIIOMMUOps typhoon_iommu_ops = {
+.get_address_space = typhoon_pci_dma_iommu,
+};
+
 static void typhoon_set_irq(void *opaque, int irq, int level)
 {
 TyphoonState *s = opaque;
@@ -897,7 +901,7 @@ PCIBus *typhoon_init(MemoryRegion *ram, ISABus **isa_bus, 
qemu_irq *p_rtc_irq,
  "iommu-typhoon", UINT64_MAX);
 address_space_init(>pchip.iommu_as, MEMORY_REGION(>pchip.iommu),
"pchip0-pci");
-pci_setup_iommu(b, typhoon_pci_dma_iommu, s);
+pci_setup_iommu(b, _iommu_ops, s);
 
 /* Pchip0 PCI special/interrupt acknowledge, 0x801.F800., 64MB.  */
 memory_region_init_io(>pchip.reg_iack, OBJECT(s), _pci_iack_ops,
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index e13a5f4..447146e 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -343,6 +343,10 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void 
*opaque, int devfn)
 return >as;
 }
 
+static const PCIIOMMUOps smmu_ops = {
+.get_address_space = smmu_find_add_as,
+};
+
 IOMMUMemoryRegion *smmu_iommu_mr(SMMUState *s, uint32_t sid)
 {
 uint8_t bus_n, devfn;
@@ -437,7 +441,7 @@ static void smmu_base_realize(DeviceState *dev, Error 
**errp)
 s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
 
 if (s->primary_bus) {
-pci_setup_iommu(s->primary_bus, smmu_find_add_as, s);
+pci_setup_iommu(s->primary_bus, _ops, s);
 } else {
 error_setg(errp, "SMMU is not attached to any PCI bus!");
 }
diff --git a/hw/hppa/dino.c b/hw/hppa/dino.c
index 2b1b38c..3da4f84 100644
--- a/hw/hppa/dino.c
+++ b/hw/hppa/dino.c
@@ -459,6 +459,10 @@ static AddressSpace *dino_pcihost_set_iommu(PCIBus *bus, 
void *opaque,
 return >bm_as;
 }
 
+static const PCIIOMMUOps dino_iommu_ops = {
+.get_address_space = dino_pcihost_set_iommu,
+};
+
 /*
  * Dino interrupts are connected as shown on Page 78, Table 23
  * (Little-endian bit numbers)
@@ -580,7 +584,7 @@ PCIBus *dino_init(MemoryRegion *addr_space,
 memory_region_add_subregion(>bm, 0xfff0,
 >bm_cpu_alias);
 address_space_init(>bm_as, >bm, "pci-bm");
-pci_setup_iommu(b, dino_pcihost_set_iommu, s);
+pci_setup_iommu(b, _iommu_ops, s);
 
 *p_rtc_irq = qemu_allocate_irq(dino_set_timer_irq, s, 0);
 *p_ser_irq = qemu_allocate_irq(dino_set_serial_irq, s, 0);
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index b1175e5..5fec30e 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1451,6 +1451,10 @@ static AddressSpace *amdvi_host_dma_iommu(PCIBus *bus, 
void *opaque, int devfn)
 return _as[devfn]->as;
 }
 
+static const PCIIOMMUOps amdvi_iommu_ops = {
+.get_address_space = amdvi_host_dma_iommu,
+};
+
 static const MemoryRegionOps mmio_mem_ops = {
 .read = amdvi_mmio_read,
 .write = amdvi_mmio_write,
@@ -1577,7 +1581,7 @@ static void amdvi_realize(DeviceState *dev, Error **errp)
 
 sysbus_init_mmio(SYS_BUS_DEVICE(s), >mmio);
 sysbus_mmio_map(SYS_BUS_DEVICE(s), 0, AMDVI_BASE_ADDR);
-pci_setup_iommu(bus, amdvi_host_dma_iommu, s);
+pci_setup_iommu(bus, _iommu_ops, s);
 s->devid = object_property_get_int(OBJECT(>pci), "addr", errp);
 msi_init(>pci.dev, 0, 1, true, false, errp);
 amdvi_init(s);
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index df7ad25..4b22910 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3729,6 +3729,10 @@

[PATCH v2 09/22] vfio/common: init HostIOMMUContext per-container

2020-03-29 Thread Liu Yi L

In this patch, QEMU firstly gets iommu info from kernel to check the
supported capabilities by a VFIO_IOMMU_TYPE1_NESTING iommu. And inits
HostIOMMUContet instance.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Alex Williamson 
Signed-off-by: Liu Yi L 
---
 hw/vfio/common.c | 99 
 1 file changed, 99 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5f3534d..44b142c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1226,10 +1226,89 @@ static int 
vfio_host_iommu_ctx_pasid_free(HostIOMMUContext *iommu_ctx,
 return 0;
 }
 
+/**
+ * Get iommu info from host. Caller of this funcion should free
+ * the memory pointed by the returned pointer stored in @info
+ * after a successful calling when finished its usage.
+ */
+static int vfio_get_iommu_info(VFIOContainer *container,
+ struct vfio_iommu_type1_info **info)
+{
+
+size_t argsz = sizeof(struct vfio_iommu_type1_info);
+
+*info = g_malloc0(argsz);
+
+retry:
+(*info)->argsz = argsz;
+
+if (ioctl(container->fd, VFIO_IOMMU_GET_INFO, *info)) {
+g_free(*info);
+*info = NULL;
+return -errno;
+}
+
+if (((*info)->argsz > argsz)) {
+argsz = (*info)->argsz;
+*info = g_realloc(*info, argsz);
+goto retry;
+}
+
+return 0;
+}
+
+static struct vfio_info_cap_header *
+vfio_get_iommu_info_cap(struct vfio_iommu_type1_info *info, uint16_t id)
+{
+struct vfio_info_cap_header *hdr;
+void *ptr = info;
+
+if (!(info->flags & VFIO_IOMMU_INFO_CAPS)) {
+return NULL;
+}
+
+for (hdr = ptr + info->cap_offset; hdr != ptr; hdr = ptr + hdr->next) {
+if (hdr->id == id) {
+return hdr;
+}
+}
+
+return NULL;
+}
+
+static int vfio_get_nesting_iommu_cap(VFIOContainer *container,
+   struct vfio_iommu_type1_info_cap_nesting *cap_nesting)
+{
+struct vfio_iommu_type1_info *info;
+struct vfio_info_cap_header *hdr;
+struct vfio_iommu_type1_info_cap_nesting *cap;
+int ret;
+
+ret = vfio_get_iommu_info(container, );
+if (ret) {
+return ret;
+}
+
+hdr = vfio_get_iommu_info_cap(info,
+VFIO_IOMMU_TYPE1_INFO_CAP_NESTING);
+if (!hdr) {
+g_free(info);
+return -errno;
+}
+
+cap = container_of(hdr,
+struct vfio_iommu_type1_info_cap_nesting, header);
+*cap_nesting = *cap;
+
+g_free(info);
+return 0;
+}
+
 static int vfio_init_container(VFIOContainer *container, int group_fd,
Error **errp)
 {
 int iommu_type, ret;
+uint64_t flags = 0;
 
 iommu_type = vfio_get_iommu_type(container, errp);
 if (iommu_type < 0) {
@@ -1257,6 +1336,26 @@ static int vfio_init_container(VFIOContainer *container, 
int group_fd,
 return -errno;
 }
 
+if (iommu_type == VFIO_TYPE1_NESTING_IOMMU) {
+struct vfio_iommu_type1_info_cap_nesting nesting = {
+ .nesting_capabilities = 0x0,
+ .stage1_formats = 0, };
+
+ret = vfio_get_nesting_iommu_cap(container, );
+if (ret) {
+error_setg_errno(errp, -ret,
+ "Failed to get nesting iommu cap");
+return ret;
+}
+
+flags |= (nesting.nesting_capabilities & VFIO_IOMMU_PASID_REQS) ?
+ HOST_IOMMU_PASID_REQUEST : 0;
+host_iommu_ctx_init(>iommu_ctx,
+sizeof(container->iommu_ctx),
+TYPE_VFIO_HOST_IOMMU_CONTEXT,
+flags);
+}
+
 container->iommu_type = iommu_type;
 return 0;
 }
-- 
2.7.4

[PATCH v2 22/22] intel_iommu: modify x-scalable-mode to be string option

2020-03-29 Thread Liu Yi L

Intel VT-d 3.0 introduces scalable mode, and it has a bunch of capabilities
related to scalable mode translation, thus there are multiple combinations.
While this vIOMMU implementation wants simplify it for user by providing
typical combinations. User could config it by "x-scalable-mode" option. The
usage is as below:

"-device intel-iommu,x-scalable-mode=["legacy"|"modern"|"off"]"

 - "legacy": gives support for SL page table
 - "modern": gives support for FL page table, pasid, virtual command
 - "off": no scalable mode support
 -  if not configured, means no scalable mode support, if not proper
configured, will throw error

Note: this patch is supposed to be merged when  the whole vSVA patch series
were merged.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Signed-off-by: Liu Yi L 
Signed-off-by: Yi Sun 
---
 hw/i386/intel_iommu.c  | 30 --
 hw/i386/intel_iommu_internal.h |  4 
 include/hw/i386/intel_iommu.h  |  2 ++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e8877d4..2e745e8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4056,7 +4056,7 @@ static Property vtd_properties[] = {
 DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
   VTD_HOST_ADDRESS_WIDTH),
 DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
-DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
+DEFINE_PROP_STRING("x-scalable-mode", IntelIOMMUState, scalable_mode_str),
 DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
 DEFINE_PROP_END_OF_LIST(),
 };
@@ -4688,8 +4688,12 @@ static void vtd_init(IntelIOMMUState *s)
 }
 
 /* TODO: read cap/ecap from host to decide which cap to be exposed. */
-if (s->scalable_mode) {
+if (s->scalable_mode && !s->scalable_modern) {
 s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+} else if (s->scalable_mode && s->scalable_modern) {
+s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_PASID
+   | VTD_ECAP_FLTS | VTD_ECAP_PSS | VTD_ECAP_VCS;
+s->vccap |= VTD_VCCAP_PAS;
 }
 
 vtd_reset_caches(s);
@@ -4821,6 +4825,28 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error 
**errp)
 return false;
 }
 
+if (s->scalable_mode_str &&
+(strcmp(s->scalable_mode_str, "off") &&
+ strcmp(s->scalable_mode_str, "modern") &&
+ strcmp(s->scalable_mode_str, "legacy"))) {
+error_setg(errp, "Invalid x-scalable-mode config,"
+ "Please use \"modern\", \"legacy\" or \"off\"");
+return false;
+}
+
+if (s->scalable_mode_str &&
+!strcmp(s->scalable_mode_str, "legacy")) {
+s->scalable_mode = true;
+s->scalable_modern = false;
+} else if (s->scalable_mode_str &&
+!strcmp(s->scalable_mode_str, "modern")) {
+s->scalable_mode = true;
+s->scalable_modern = true;
+} else {
+s->scalable_mode = false;
+s->scalable_modern = false;
+}
+
 return true;
 }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 4910e63..e0719bc 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -196,8 +196,12 @@
 #define VTD_ECAP_PT (1ULL << 6)
 #define VTD_ECAP_MHMV   (15ULL << 20)
 #define VTD_ECAP_SRS(1ULL << 31)
+#define VTD_ECAP_PSS(19ULL << 35)
+#define VTD_ECAP_PASID  (1ULL << 40)
 #define VTD_ECAP_SMTS   (1ULL << 43)
+#define VTD_ECAP_VCS(1ULL << 44)
 #define VTD_ECAP_SLTS   (1ULL << 46)
+#define VTD_ECAP_FLTS   (1ULL << 47)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 626c1cd..3831ba7 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -263,6 +263,8 @@ struct IntelIOMMUState {
 
 bool caching_mode;  /* RO - is cap CM enabled? */
 bool scalable_mode; /* RO - is Scalable Mode supported? */
+char *scalable_mode_str;/* RO - admin's Scalable Mode config */
+bool scalable_modern;   /* RO - is modern SM supported? */
 
 dma_addr_t root;/* Current root table pointer */
 bool root_scalable; /* Type of root table (scalable or not) */
-- 
2.7.4

[PATCH v2 07/22] intel_iommu: add set/unset_iommu_context callback

2020-03-29 Thread Liu Yi L

This patch adds set/unset_iommu_context() impelementation in Intel
vIOMMU. For Intel platform, pass-through modules (e.g. VFIO) could
set HostIOMMUContext to Intel vIOMMU emulator.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Paolo Bonzini 
Cc: Richard Henderson 
Cc: Eduardo Habkost 
Signed-off-by: Liu Yi L 
---
 hw/i386/intel_iommu.c | 71 ---
 include/hw/i386/intel_iommu.h | 21 ++---
 2 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4b22910..fd349c6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3354,23 +3354,33 @@ static const MemoryRegionOps vtd_mem_ir_ops = {
 },
 };
 
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+/**
+ * Fetch a VTDBus instance for given PCIBus. If no existing instance,
+ * allocate one.
+ */
+static VTDBus *vtd_find_add_bus(IntelIOMMUState *s, PCIBus *bus)
 {
 uintptr_t key = (uintptr_t)bus;
 VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, );
-VTDAddressSpace *vtd_dev_as;
-char name[128];
 
 if (!vtd_bus) {
 uintptr_t *new_key = g_malloc(sizeof(*new_key));
 *new_key = (uintptr_t)bus;
 /* No corresponding free() */
-vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
-PCI_DEVFN_MAX);
+vtd_bus = g_malloc0(sizeof(VTDBus));
 vtd_bus->bus = bus;
 g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
 }
+return vtd_bus;
+}
 
+VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+{
+VTDBus *vtd_bus;
+VTDAddressSpace *vtd_dev_as;
+char name[128];
+
+vtd_bus = vtd_find_add_bus(s, bus);
 vtd_dev_as = vtd_bus->dev_as[devfn];
 
 if (!vtd_dev_as) {
@@ -3436,6 +3446,55 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
PCIBus *bus, int devfn)
 return vtd_dev_as;
 }
 
+static int vtd_dev_set_iommu_context(PCIBus *bus, void *opaque,
+ int devfn,
+ HostIOMMUContext *iommu_ctx)
+{
+IntelIOMMUState *s = opaque;
+VTDBus *vtd_bus;
+VTDHostIOMMUContext *vtd_dev_icx;
+
+assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+vtd_bus = vtd_find_add_bus(s, bus);
+
+vtd_iommu_lock(s);
+
+vtd_dev_icx = vtd_bus->dev_icx[devfn];
+
+assert(!vtd_dev_icx);
+
+vtd_bus->dev_icx[devfn] = vtd_dev_icx =
+g_malloc0(sizeof(VTDHostIOMMUContext));
+vtd_dev_icx->vtd_bus = vtd_bus;
+vtd_dev_icx->devfn = (uint8_t)devfn;
+vtd_dev_icx->iommu_state = s;
+vtd_dev_icx->iommu_ctx = iommu_ctx;
+
+vtd_iommu_unlock(s);
+
+return 0;
+}
+
+static void vtd_dev_unset_iommu_context(PCIBus *bus, void *opaque, int devfn)
+{
+IntelIOMMUState *s = opaque;
+VTDBus *vtd_bus;
+VTDHostIOMMUContext *vtd_dev_icx;
+
+assert(0 <= devfn && devfn < PCI_DEVFN_MAX);
+
+vtd_bus = vtd_find_add_bus(s, bus);
+
+vtd_iommu_lock(s);
+
+vtd_dev_icx = vtd_bus->dev_icx[devfn];
+g_free(vtd_dev_icx);
+vtd_bus->dev_icx[devfn] = NULL;
+
+vtd_iommu_unlock(s);
+}
+
 static uint64_t get_naturally_aligned_size(uint64_t start,
uint64_t size, int gaw)
 {
@@ -3731,6 +3790,8 @@ static AddressSpace *vtd_host_dma_iommu(PCIBus *bus, void 
*opaque, int devfn)
 
 static PCIIOMMUOps vtd_iommu_ops = {
 .get_address_space = vtd_host_dma_iommu,
+.set_iommu_context = vtd_dev_set_iommu_context,
+.unset_iommu_context = vtd_dev_unset_iommu_context,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3870052..b5fefb9 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -64,6 +64,7 @@ typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
 typedef struct VTDPASIDEntry VTDPASIDEntry;
+typedef struct VTDHostIOMMUContext VTDHostIOMMUContext;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -112,10 +113,20 @@ struct VTDAddressSpace {
 IOVATree *iova_tree;  /* Traces mapped IOVA ranges */
 };
 
+struct VTDHostIOMMUContext {
+VTDBus *vtd_bus;
+uint8_t devfn;
+HostIOMMUContext *iommu_ctx;
+IntelIOMMUState *iommu_state;
+};
+
 struct VTDBus {
-PCIBus* bus;   /* A reference to the bus to provide 
translation for */
+/* A reference to the bus to provide translation for */
+PCIBus *bus;
 /* A table of VTDAddressSpace objects indexed by devfn */
-VTDAddressSpace *dev_as[];
+VTDAddressSpace *dev_as[PCI_DEVFN_MAX];
+/* A table of VTDHostIOMMUContext objects indexed by devfn */
+VTDHostIOMMUContext *dev_icx[PCI_DEVFN_MAX];
 };
 
 struct VTDIOTLBEntry {
@@ -269,8 +280,10 @@ struct

[PATCH v2 00/22] intel_iommu: expose Shared Virtual Addressing to VMs

2020-03-29 Thread Liu Yi L

Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.

This QEMU series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes.

The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.

.-.  .---.
|   vIOMMU|  | Guest process CR3, FL only|
| |  '---'
./
| PASID Entry |--- PASID cache flush -
'-'   |
| |   V
| |CR3 in GPA
'-'
Guest
--| Shadow |--|
  vv  v
Host
.-.  .--.
|   pIOMMU|  | Bind FL for GVA-GPA  |
| |  '--'
./  |
| PASID Entry | V (Nested xlate)
'\.--.
| |   |SL for GPA-HPA, default domain|
| |   '--'
'-'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

The complete vSVA kernel upstream patches are divided into three phases:
1. Common APIs and PCI device direct assignment
2. IOMMU-backed Mediated Device assignment
3. Page Request Services (PRS) support

This QEMU patchset is aiming for the phase 1 and phase 2. It is based
on the two kernel series below.
[1] [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
https://lkml.org/lkml/2020/3/20/1172
[2] [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
https://lkml.org/lkml/2020/3/22/116

There are roughly two parts:
 1. Introduce HostIOMMUContext as abstract of host IOMMU. It provides explicit
method for vIOMMU emulators to communicate with host IOMMU. e.g. propagate
guest page table binding to host IOMMU to setup dual-stage DMA translation
in host IOMMU and flush iommu iotlb.
 2. Setup dual-stage IOMMU translation for Intel vIOMMU. Includes 
- Check IOMMU uAPI version compatibility and VFIO Nesting capabilities which
  includes hardware compatibility (stage 1 format) and VFIO_PASID_REQ
  availability. This is preparation for setting up dual-stage DMA 
translation
  in host IOMMU.
- Propagate guest PASID allocation and free request to host.
- Propagate guest page table binding to host to setup dual-stage IOMMU DMA
  translation in host IOMMU.
- Propagate guest IOMMU cache invalidation to host to ensure iotlb
  correctness.

The complete QEMU set can be found in below link:
https://github.com/luxis1999/qemu.git: sva_vtd_v10_v2

Complete kernel can be found in:
https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.6-rc6

Tests: basci vSVA functionality test, VM reboot/shutdown/crash, kernel build in
guest, boot VM with vSVA disabled, full comapilation with all archs.

Regards,
Yi Liu

Changelog:
- Patch v1 -> Patch v2:
  a) Refactor the vfio HostIOMMUContext init code (patch 0008 - 0009 of 
v1 series)
  b) Refactor the pasid binding handling (patch 0011 - 0016 of v1 
series)
  Patch v1: https://patchwork.ozlabs.org/cover/1259648/

- RFC v3.1 -> Patch v1:
  a) Implement HostIOMMUContext in QOM manner.
  b) Add pci_set/unset_iommu_context() to register HostIOMMUContext to
 vIOMMU, thus the lifecircle of HostIOMMUContext is awared in vIOMMU
 side. In such way, vIOMMU could use the methods provided by the
 HostIOMMUContext safely.
  c) Add back patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to 
set PCIIOMMUOps"
  RFCv3.1: https://patchwork.kernel.org/cover/11397879/

- RFC v3 -> v3.1:
  a) Drop IOMMUContext, and rename DualStageIOMMUObject to 
HostIOMMUContext.
 HostIOMMUContext is per-vfio-container, it is exposed to  vIOMMU 
via PCI
 layer. VFIO registers a PCIHostIOMMUFunc callback to PCI layer, 
vIOMMU
 could get HostIOMMUContext instance via it.
  b) Check IOMMU uAPI version by VFIO_CHECK_EXTENSION
  c) Add a check on VFIO_PASID_REQ availability via VFIO_GET_IOMMU_IHNFO
  d) Reorder the series, put vSVA linux header file update in the 
beginning
 put the x-scalable-mode option mofification in the end of the 
series.
  e) Dropped patch "[RFC v3 01/25] hw/pci: modify pci_setup_iommu() to 
set PCIIOMMUOps"
  RFCv3:

[PATCH v2 01/22] scripts/update-linux-headers: Import iommu.h

2020-03-29 Thread Liu Yi L

From: Eric Auger 

Update the script to import the new iommu.h uapi header.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Yi Sun 
Cc: Michael S. Tsirkin 
Cc: Cornelia Huck 
Cc: Paolo Bonzini 
Acked-by: Cornelia Huck 
Signed-off-by: Eric Auger 
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 29c27f4..5b64ee3 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -141,7 +141,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vhost.h \
+for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
   psci.h psp-sev.h userfaultfd.h mman.h; do
 cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.7.4

[PATCH v2 03/22] vfio: check VFIO_TYPE1_NESTING_IOMMU support

2020-03-29 Thread Liu Yi L

VFIO needs to check VFIO_TYPE1_NESTING_IOMMU support with Kernel before
further using it. e.g. requires to check IOMMU UAPI version.

Cc: Kevin Tian 
Cc: Jacob Pan 
Cc: Peter Xu 
Cc: Eric Auger 
Cc: Yi Sun 
Cc: David Gibson 
Cc: Alex Williamson 
Signed-off-by: Liu Yi L 
Signed-off-by: Yi Sun 
---
 hw/vfio/common.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b..c276732 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1157,12 +1157,21 @@ static void vfio_put_address_space(VFIOAddressSpace 
*space)
 static int vfio_get_iommu_type(VFIOContainer *container,
Error **errp)
 {
-int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
+int iommu_types[] = { VFIO_TYPE1_NESTING_IOMMU,
+  VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
   VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
-int i;
+int i, version;
 
 for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
 if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+if (iommu_types[i] == VFIO_TYPE1_NESTING_IOMMU) {
+version = ioctl(container->fd, VFIO_CHECK_EXTENSION,
+VFIO_NESTING_IOMMU_UAPI);
+if (version < IOMMU_UAPI_VERSION) {
+info_report("IOMMU UAPI incompatible for nesting");
+continue;
+}
+}
 return iommu_types[i];
 }
 }
@@ -1278,6 +1287,7 @@ static int vfio_connect_container(VFIOGroup *group, 
AddressSpace *as,
 }
 
 switch (container->iommu_type) {
+case VFIO_TYPE1_NESTING_IOMMU:
 case VFIO_TYPE1v2_IOMMU:
 case VFIO_TYPE1_IOMMU:
 {
-- 
2.7.4

Re: [PATCH v2] [Qemu-devel] target/i386: HAX: Enable ROM/ROM device memory region support

2020-03-29 Thread Colin Xu


Looks good to me.

Reviewed-by: Colin Xu 

On 2020-03-30 11:25, hang.y...@linux.intel.com wrote:

From: Hang Yuan 

Add ROM and ROM device memory region support in HAX. Their memory region is
read only and write access will generate EPT violation. The violation will be
handled in the HAX kernel with the following patch.

https://github.com/intel/haxm/commit/33ceea09a1655fca12c47f1e112b1d269357ff28

v2: fix coding style problems

Signed-off-by: Hang Yuan 
---
  target/i386/hax-mem.c | 11 ---
  1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/target/i386/hax-mem.c b/target/i386/hax-mem.c
index 6bb5a24917..a8bfd37977 100644
--- a/target/i386/hax-mem.c
+++ b/target/i386/hax-mem.c
@@ -175,13 +175,10 @@ static void hax_process_section(MemoryRegionSection 
*section, uint8_t flags)
  uint64_t host_va;
  uint32_t max_mapping_size;
  
-/* We only care about RAM and ROM regions */

-if (!memory_region_is_ram(mr)) {
-if (memory_region_is_romd(mr)) {
-/* HAXM kernel module does not support ROMD yet  */
-warn_report("Ignoring ROMD region 0x%016" PRIx64 "->0x%016" PRIx64,
-start_pa, start_pa + size);
-}
+/* We only care about RAM/ROM regions and ROM device */
+if (memory_region_is_rom(mr) || (memory_region_is_romd(mr))) {
+flags |= HAX_RAM_INFO_ROM;
+} else if (!memory_region_is_ram(mr)) {
  return;
  }
  


--
Best Regards,
Colin Xu

[PATCH v2] [Qemu-devel] target/i386: HAX: Enable ROM/ROM device memory region support

2020-03-29 Thread hang . yuan

From: Hang Yuan 

Add ROM and ROM device memory region support in HAX. Their memory region is
read only and write access will generate EPT violation. The violation will be
handled in the HAX kernel with the following patch.

https://github.com/intel/haxm/commit/33ceea09a1655fca12c47f1e112b1d269357ff28

v2: fix coding style problems

Signed-off-by: Hang Yuan 
---
 target/i386/hax-mem.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/target/i386/hax-mem.c b/target/i386/hax-mem.c
index 6bb5a24917..a8bfd37977 100644
--- a/target/i386/hax-mem.c
+++ b/target/i386/hax-mem.c
@@ -175,13 +175,10 @@ static void hax_process_section(MemoryRegionSection 
*section, uint8_t flags)
 uint64_t host_va;
 uint32_t max_mapping_size;
 
-/* We only care about RAM and ROM regions */
-if (!memory_region_is_ram(mr)) {
-if (memory_region_is_romd(mr)) {
-/* HAXM kernel module does not support ROMD yet  */
-warn_report("Ignoring ROMD region 0x%016" PRIx64 "->0x%016" PRIx64,
-start_pa, start_pa + size);
-}
+/* We only care about RAM/ROM regions and ROM device */
+if (memory_region_is_rom(mr) || (memory_region_is_romd(mr))) {
+flags |= HAX_RAM_INFO_ROM;
+} else if (!memory_region_is_ram(mr)) {
 return;
 }
 
-- 
2.21.0

Re: [PATCH v16 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.

2020-03-29 Thread Yan Zhao

On Fri, Mar 27, 2020 at 01:28:13PM +0800, Kirti Wankhede wrote:
> Hit send button little early.
> 
>  >
>  > I checked v12, it's not like what I said.
>  > In v12, bitmaps are generated per vfio_dma, and combination of the
>  > bitmaps are required in order to generate a big bitmap suiting for dirty
>  > query. It can cause problem when offset not aligning.
>  > But what I propose here is to generate an rb tree orthogonal to the tree
>  > of vfio_dma.
>  >
>  > as to CPU cycles saving, I don't think iterating/translating page by page
>  > would achieve that purpose.
>  >
> 
> Instead of creating one extra rb tree for dirty pages tracking in v10 
> tried to use dma->pfn_list itself, we tried changes in v10, v11 and v12, 
> latest version is evolved version with best possible approach after 
> discussion. Probably, go through v11 as well.
> https://patchwork.kernel.org/patch/11298335/
>
I'm not sure why all those previous implementations are bound to
vfio_dma. for vIOMMU on, in most cases, a vfio_dma is only for a page,
so generating a one-byte bitmap for a single page in each vfio_dma ?
is it possible to creating one extra rb tree to keep dirty ranges, and
one fixed length kernel bitmap whose content is generated on query,
serving as a bouncing buffer for copy_to_user

> 
> On 3/27/2020 6:00 AM, Yan Zhao wrote:
> > On Fri, Mar 27, 2020 at 05:39:01AM +0800, Kirti Wankhede wrote:
> >>
> >>
> >> On 3/25/2020 7:41 AM, Yan Zhao wrote:
> >>> On Wed, Mar 25, 2020 at 05:18:52AM +0800, Kirti Wankhede wrote:
>  VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
>  - Start dirty pages tracking while migration is active
>  - Stop dirty pages tracking.
>  - Get dirty pages bitmap. Its user space application's responsibility to
>  copy content of dirty pages from source to destination during 
>  migration.
> 
>  To prevent DoS attack, memory for bitmap is allocated per vfio_dma
>  structure. Bitmap size is calculated considering smallest supported page
>  size. Bitmap is allocated for all vfio_dmas when dirty logging is enabled
> 
>  Bitmap is populated for already pinned pages when bitmap is allocated for
>  a vfio_dma with the smallest supported page size. Update bitmap from
>  pinning functions when tracking is enabled. When user application queries
>  bitmap, check if requested page size is same as page size used to
>  populated bitmap. If it is equal, copy bitmap, but if not equal, return
>  error.
> 
>  Signed-off-by: Kirti Wankhede 
>  Reviewed-by: Neo Jia 
>  ---
> drivers/vfio/vfio_iommu_type1.c | 266 
>  +++-
> 1 file changed, 260 insertions(+), 6 deletions(-)
> 
>  diff --git a/drivers/vfio/vfio_iommu_type1.c 
>  b/drivers/vfio/vfio_iommu_type1.c
>  index 70aeab921d0f..874a1a7ae925 100644
>  --- a/drivers/vfio/vfio_iommu_type1.c
>  +++ b/drivers/vfio/vfio_iommu_type1.c
>  @@ -71,6 +71,7 @@ struct vfio_iommu {
>   unsigned intdma_avail;
>   boolv2;
>   boolnesting;
>  +booldirty_page_tracking;
> };
> 
> struct vfio_domain {
>  @@ -91,6 +92,7 @@ struct vfio_dma {
>   boollock_cap;   /* 
>  capable(CAP_IPC_LOCK) */
>   struct task_struct  *task;
>   struct rb_root  pfn_list;   /* Ex-user pinned pfn 
>  list */
>  +unsigned long   *bitmap;
> };
> 
> struct vfio_group {
>  @@ -125,7 +127,21 @@ struct vfio_regions {
> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)   \
>   
>  (!list_empty(>domain_list))
> 
>  +#define DIRTY_BITMAP_BYTES(n)   (ALIGN(n, BITS_PER_TYPE(u64)) / 
>  BITS_PER_BYTE)
>  +
>  +/*
>  + * Input argument of number of bits to bitmap_set() is unsigned 
>  integer, which
>  + * further casts to signed integer for unaligned multi-bit operation,
>  + * __bitmap_set().
>  + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 
>  bits/byte,
>  + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K 
>  page
>  + * system.
>  + */
>  +#define DIRTY_BITMAP_PAGES_MAX  (uint64_t)(INT_MAX - 1)
>  +#define DIRTY_BITMAP_SIZE_MAX
>  DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>  +
> static int put_pfn(unsigned long pfn, int prot);
>  +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
> 
> /*
>  * This code handles mapping and unmapping of user data buffers
>  @@ -175,6 +191,77 @@ static void vfio_unlink_dma(struct vfio_iommu 
>  *iommu, struct vfio_dma *old)
>   rb_erase(>node, >dma_list);
> }
> 
>  +

Re: [PATCH] fix vhost_user_blk_watch crash

2020-03-29 Thread Raphael Norwitz

On Mon, Mar 23, 2020 at 01:29:24PM +0800, Li Feng wrote:
> 
> the G_IO_HUP is watched in tcp_chr_connect, and the callback
> vhost_user_blk_watch is not needed, because tcp_chr_hup is registered as
> callback. And it will close the tcp link.
> 
> Signed-off-by: Li Feng 

Reviewed-by: Raphael Norwitz 

> ---
>  hw/block/vhost-user-blk.c  | 19 ---
>  include/hw/virtio/vhost-user-blk.h |  1 -
>  2 files changed, 20 deletions(-)
> 
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 12925a47ec..17df5338e7 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -349,18 +349,6 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
>  vhost_dev_cleanup(>dev);
>  }
>  
> -static gboolean vhost_user_blk_watch(GIOChannel *chan, GIOCondition cond,
> - void *opaque)
> -{
> -DeviceState *dev = opaque;
> -VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> -VHostUserBlk *s = VHOST_USER_BLK(vdev);
> -
> -qemu_chr_fe_disconnect(>chardev);
> -
> -return true;
> -}
> -
>  static void vhost_user_blk_event(void *opaque, QEMUChrEvent event)
>  {
>  DeviceState *dev = opaque;
> @@ -373,15 +361,9 @@ static void vhost_user_blk_event(void *opaque, 
> QEMUChrEvent event)
>  qemu_chr_fe_disconnect(>chardev);
>  return;
>  }
> -s->watch = qemu_chr_fe_add_watch(>chardev, G_IO_HUP,
> - vhost_user_blk_watch, dev);
>  break;
>  case CHR_EVENT_CLOSED:
>  vhost_user_blk_disconnect(dev);
> -if (s->watch) {
> -g_source_remove(s->watch);
> -s->watch = 0;
> -}
>  break;
>  case CHR_EVENT_BREAK:
>  case CHR_EVENT_MUX_IN:
> @@ -428,7 +410,6 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  
>  s->inflight = g_new0(struct vhost_inflight, 1);
>  s->vhost_vqs = g_new0(struct vhost_virtqueue, s->num_queues);
> -s->watch = 0;
>  s->connected = false;
>  
>  qemu_chr_fe_set_handlers(>chardev,  NULL, NULL, vhost_user_blk_event,
> diff --git a/include/hw/virtio/vhost-user-blk.h 
> b/include/hw/virtio/vhost-user-blk.h
> index 05ea0ad183..34ad6f0c0e 100644
> --- a/include/hw/virtio/vhost-user-blk.h
> +++ b/include/hw/virtio/vhost-user-blk.h
> @@ -38,7 +38,6 @@ typedef struct VHostUserBlk {
>  VhostUserState vhost_user;
>  struct vhost_virtqueue *vhost_vqs;
>  VirtQueue **virtqs;
> -guint watch;
>  bool connected;
>  } VHostUserBlk;
>  
> -- 
> 2.11.0
> 
> 
> -- 
> The SmartX email address is only for business purpose. Any sent message 
> that is not related to the business is not authorized or permitted by 
> SmartX.
> 本邮箱为北京志凌海纳科技有限公司（SmartX）工作邮箱. 如本邮箱发出的邮件与工作无关,该邮件未得到本公司任何的明示或默示的授权.
> 
> 
> 
>

Re: [PATCH v16 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap

2020-03-29 Thread Yan Zhao

On Fri, Mar 27, 2020 at 12:42:43PM +0800, Kirti Wankhede wrote:
> 
> 
> On 3/27/2020 5:34 AM, Yan Zhao wrote:
> > On Fri, Mar 27, 2020 at 05:39:44AM +0800, Kirti Wankhede wrote:
> >>
> >>
> >> On 3/25/2020 7:48 AM, Yan Zhao wrote:
> >>> On Wed, Mar 25, 2020 at 03:32:37AM +0800, Kirti Wankhede wrote:
>  DMA mapped pages, including those pinned by mdev vendor drivers, might
>  get unpinned and unmapped while migration is active and device is still
>  running. For example, in pre-copy phase while guest driver could access
>  those pages, host device or vendor driver can dirty these mapped pages.
>  Such pages should be marked dirty so as to maintain memory consistency
>  for a user making use of dirty page tracking.
> 
>  To get bitmap during unmap, user should allocate memory for bitmap, set
>  size of allocated memory, set page size to be considered for bitmap and
>  set flag VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP.
> 
>  Signed-off-by: Kirti Wankhede 
>  Reviewed-by: Neo Jia 
>  ---
> drivers/vfio/vfio_iommu_type1.c | 54 
>  ++---
> include/uapi/linux/vfio.h   | 10 
> 2 files changed, 60 insertions(+), 4 deletions(-)
> 
>  diff --git a/drivers/vfio/vfio_iommu_type1.c 
>  b/drivers/vfio/vfio_iommu_type1.c
>  index 27ed069c5053..b98a8d79e13a 100644
>  --- a/drivers/vfio/vfio_iommu_type1.c
>  +++ b/drivers/vfio/vfio_iommu_type1.c
>  @@ -982,7 +982,8 @@ static int verify_bitmap_size(uint64_t npages, 
>  uint64_t bitmap_size)
> }
> 
> static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  - struct vfio_iommu_type1_dma_unmap *unmap)
>  + struct vfio_iommu_type1_dma_unmap *unmap,
>  + struct vfio_bitmap *bitmap)
> {
>   uint64_t mask;
>   struct vfio_dma *dma, *dma_last = NULL;
>  @@ -1033,6 +1034,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu 
>  *iommu,
>    * will be returned if these conditions are not met.  The v2 
>  interface
>    * will only return success and a size of zero if there were no
>    * mappings within the range.
>  + *
>  + * When VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP flag is set, unmap 
>  request
>  + * must be for single mapping. Multiple mappings with this flag 
>  set is
>  + * not supported.
>    */
>   if (iommu->v2) {
>   dma = vfio_find_dma(iommu, unmap->iova, 1);
>  @@ -1040,6 +1045,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu 
>  *iommu,
>   ret = -EINVAL;
>   goto unlock;
>   }
>  +
>  +if ((unmap->flags & 
>  VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
>  +(dma->iova != unmap->iova || dma->size != 
>  unmap->size)) {
> >>> potential NULL pointer!
> >>>
> >>> And could you address the comments in v14?
> >>> How to handle DSI unmaps in vIOMMU
> >>> (https://lore.kernel.org/kvm/20200323011041.GB5456@joy-OptiPlex-7040/)
> >>>
> >>
> >> Sorry, I drafted reply to it, but I missed to send, it remained in my 
> >> drafts
> >>
> >>   >
> >>   > it happens in vIOMMU Domain level invalidation of IOTLB
> >>   > (domain-selective invalidation, see vtd_iotlb_domain_invalidate() in
> >> qemu).
> >>   > common in VTD lazy mode, and NOT just happening once at boot time.
> >>   > rather than invalidate page by page, it batches the page invalidation.
> >>   > so, when this invalidation takes place, even higher level page tables
> >>   > have been invalid and therefore it has to invalidate a bigger
> >> combined range.
> >>   > That's why we see IOVAs are mapped in 4k pages, but are unmapped in 2M
> >>   > pages.
> >>   >
> >>   > I think those UNMAPs should also have GET_DIRTY_BIMTAP flag on, right?
> >>
> >>
> >> vtd_iotlb_domain_invalidate()
> >> vtd_sync_shadow_page_table()
> >>   vtd_sync_shadow_page_table_range(vtd_as, , 0, UINT64_MAX)
> >> vtd_page_walk()
> >>   vtd_page_walk_level() - walk over specific level for IOVA range
> >> vtd_page_walk_one()
> >>   memory_region_notify_iommu()
> >>   ...
> >> vfio_iommu_map_notify()
> >>
> >> In the above trace, isn't page walk will take care of creating proper
> >> IOTLB entry which should be same as created during mapping for that
> >> IOTLB entry?
> >>
> > No. It does walk the page table, but as it's dsi (delay & batched unmap),
> > pages table entry for a whole 2M (the higher level, not last level for 4K)
> > range is invalid, so the iotlb->addr_mask what vfio_iommu_map_notify()
> > receives is (2M - 1), not the same as the size for map.
> > 
> 
> When do this happen?

Re: [PATCH v16 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.

2020-03-29 Thread Yan Zhao

On Fri, Mar 27, 2020 at 01:07:38PM +0800, Kirti Wankhede wrote:
> 
> 
> On 3/27/2020 6:00 AM, Yan Zhao wrote:
> > On Fri, Mar 27, 2020 at 05:39:01AM +0800, Kirti Wankhede wrote:
> >>
> >>
> >> On 3/25/2020 7:41 AM, Yan Zhao wrote:
> >>> On Wed, Mar 25, 2020 at 05:18:52AM +0800, Kirti Wankhede wrote:
>  VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
>  - Start dirty pages tracking while migration is active
>  - Stop dirty pages tracking.
>  - Get dirty pages bitmap. Its user space application's responsibility to
>  copy content of dirty pages from source to destination during 
>  migration.
> 
>  To prevent DoS attack, memory for bitmap is allocated per vfio_dma
>  structure. Bitmap size is calculated considering smallest supported page
>  size. Bitmap is allocated for all vfio_dmas when dirty logging is enabled
> 
>  Bitmap is populated for already pinned pages when bitmap is allocated for
>  a vfio_dma with the smallest supported page size. Update bitmap from
>  pinning functions when tracking is enabled. When user application queries
>  bitmap, check if requested page size is same as page size used to
>  populated bitmap. If it is equal, copy bitmap, but if not equal, return
>  error.
> 
>  Signed-off-by: Kirti Wankhede 
>  Reviewed-by: Neo Jia 
>  ---
> drivers/vfio/vfio_iommu_type1.c | 266 
>  +++-
> 1 file changed, 260 insertions(+), 6 deletions(-)
> 
>  diff --git a/drivers/vfio/vfio_iommu_type1.c 
>  b/drivers/vfio/vfio_iommu_type1.c
>  index 70aeab921d0f..874a1a7ae925 100644
>  --- a/drivers/vfio/vfio_iommu_type1.c
>  +++ b/drivers/vfio/vfio_iommu_type1.c
>  @@ -71,6 +71,7 @@ struct vfio_iommu {
>   unsigned intdma_avail;
>   boolv2;
>   boolnesting;
>  +booldirty_page_tracking;
> };
> 
> struct vfio_domain {
>  @@ -91,6 +92,7 @@ struct vfio_dma {
>   boollock_cap;   /* 
>  capable(CAP_IPC_LOCK) */
>   struct task_struct  *task;
>   struct rb_root  pfn_list;   /* Ex-user pinned pfn 
>  list */
>  +unsigned long   *bitmap;
> };
> 
> struct vfio_group {
>  @@ -125,7 +127,21 @@ struct vfio_regions {
> #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)   \
>   
>  (!list_empty(>domain_list))
> 
>  +#define DIRTY_BITMAP_BYTES(n)   (ALIGN(n, BITS_PER_TYPE(u64)) / 
>  BITS_PER_BYTE)
>  +
>  +/*
>  + * Input argument of number of bits to bitmap_set() is unsigned 
>  integer, which
>  + * further casts to signed integer for unaligned multi-bit operation,
>  + * __bitmap_set().
>  + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 
>  bits/byte,
>  + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K 
>  page
>  + * system.
>  + */
>  +#define DIRTY_BITMAP_PAGES_MAX  (uint64_t)(INT_MAX - 1)
>  +#define DIRTY_BITMAP_SIZE_MAX
>  DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>  +
> static int put_pfn(unsigned long pfn, int prot);
>  +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
> 
> /*
>  * This code handles mapping and unmapping of user data buffers
>  @@ -175,6 +191,77 @@ static void vfio_unlink_dma(struct vfio_iommu 
>  *iommu, struct vfio_dma *old)
>   rb_erase(>node, >dma_list);
> }
> 
>  +
>  +static int vfio_dma_bitmap_alloc(struct vfio_dma *dma, uint64_t pgsize)
>  +{
>  +uint64_t npages = dma->size / pgsize;
>  +
>  +if (npages > DIRTY_BITMAP_PAGES_MAX)
>  +return -EINVAL;
>  +
>  +dma->bitmap = kvzalloc(DIRTY_BITMAP_BYTES(npages), GFP_KERNEL);
>  +if (!dma->bitmap)
>  +return -ENOMEM;
>  +
>  +return 0;
>  +}
>  +
>  +static void vfio_dma_bitmap_free(struct vfio_dma *dma)
>  +{
>  +kfree(dma->bitmap);
>  +dma->bitmap = NULL;
>  +}
>  +
>  +static void vfio_dma_populate_bitmap(struct vfio_dma *dma, uint64_t 
>  pgsize)
>  +{
>  +struct rb_node *p;
>  +
>  +if (RB_EMPTY_ROOT(>pfn_list))
>  +return;
>  +
>  +for (p = rb_first(>pfn_list); p; p = rb_next(p)) {
>  +struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn, 
>  node);
>  +
>  +bitmap_set(dma->bitmap, (vpfn->iova - dma->iova) / 
>  pgsize, 1);
>  +}
>  +}
>  +
>  +static int vfio_dma_bitmap_alloc_all(struct

Re: [PATCH] fix vhost_user_blk_watch crash

2020-03-29 Thread Raphael Norwitz

On Sun, Mar 29, 2020 at 09:30:24AM -0400, Michael S. Tsirkin wrote:
> 
> On Mon, Mar 23, 2020 at 01:29:24PM +0800, Li Feng wrote:
> > the G_IO_HUP is watched in tcp_chr_connect, and the callback
> > vhost_user_blk_watch is not needed, because tcp_chr_hup is registered as
> > callback. And it will close the tcp link.
> > 
> > Signed-off-by: Li Feng 
> 
> Raphael would you like to review this?

Sure - I'll review it now.

> Also, I think at this point
> nutanix is the biggest contributor to vhost user blk.
> So if you want to be added to MAINTAINERS on this
> one so people Cc you on patcches, that'd be great.
> 

Yes, please add me to MAINTAINERS.

Re: [PATCH] hw/vfio: let readonly flag take effect for mmaped regions

2020-03-29 Thread Yan Zhao

On Sat, Mar 28, 2020 at 01:25:37AM +0800, Alex Williamson wrote:
> On Fri, 27 Mar 2020 11:19:34 +
> yan.y.z...@intel.com wrote:
> 
> > From: Yan Zhao 
> > 
> > currently, vfio regions without VFIO_REGION_INFO_FLAG_WRITE are only
> > read-only when VFIO_REGION_INFO_FLAG_MMAP is not set.
> > 
> > regions with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP
> > are only read-only in host page table for qemu.
> > 
> > This patch sets corresponding ept page entries read-only for regions
> > with flag VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_MMAP.
> > 
> > accordingly, it ignores guest write when guest writes to the read-only
> > regions are trapped.
> > 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Xin Zeng 
> > ---
> 
> Currently we set the r/w protection on the mmap, do I understand
> correctly that the change in the vfio code below results in KVM exiting
> to QEMU to handle a write to a read-only region and therefore we need
> the memory.c change to drop the write?  This prevents a SIGBUS or
> similar?
yes, correct. the change in memory.c is to prevent a SIGSEGV in host as
it's mmaped to read-only. we think it's better to just drop the writes
from guest rather than corrupt the qemu.

> 
> Meanwhile vfio_region_setup() uses the same vfio_region_ops for all
> regions and vfio_region_write() would still allow writes, so if the
> device were using x-no-mmap=on, I think we'd still get a write to this
> region and expect the vfio device to drop it.  Should we prevent that
> write in QEMU as well?
yes, it expects vfio device to drop it right now.
As the driver sets the flag without VFIO_REGION_INFO_FLAG_WRITE, it should
handle it properly.
both dropping in qemu and dropping in vfio device are fine to us.
we wonder which one is your preference :)


> Can you also identify what device and region requires this so that we
> can decide whether this is QEMU 5.0 or 5.1 material?  PCI BARs are of
> course always R/W and the ROM uses different ops and doesn't support
> mmap, so this is a device specific region of some sort.  Thanks,
> 
It's a virtual mdev device for which we want to emulate a virtual
read-only MMIO BAR.
Is there any consideration that PCI BARs have to be R/W ?
we didn't find it out in PCI specification.

Thanks
Yan


> 
> >  hw/vfio/common.c | 4 
> >  memory.c | 3 +++
> >  2 files changed, 7 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 0b3593b3c0..e901621ca0 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -971,6 +971,10 @@ int vfio_region_mmap(VFIORegion *region)
> >name, region->mmaps[i].size,
> >region->mmaps[i].mmap);
> >  g_free(name);
> > +
> > +if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
> > +memory_region_set_readonly(>mmaps[i].mem, true);
> > +}
> >  memory_region_add_subregion(region->mem, region->mmaps[i].offset,
> >  >mmaps[i].mem);
> >  
> > diff --git a/memory.c b/memory.c
> > index 601b749906..4b1071dc74 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1313,6 +1313,9 @@ static void memory_region_ram_device_write(void 
> > *opaque, hwaddr addr,
> >  MemoryRegion *mr = opaque;
> >  
> >  trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, 
> > size);
> > +if (mr->readonly) {
> > +return;
> > +}
> >  
> >  switch (size) {
> >  case 1:
>

[Bug 1869497] Re: x86_cpu_gdb_read_register segfaults when gdb requests registers

2020-03-29 Thread Peter Maydell

Thanks for tracking down the source of the bug. Our 'submitting patches'
policy is at https://wiki.qemu.org/Contribute/SubmitAPatch in case you
haven't already found it. (It's quite long but for a simple one-shot
bugfix patch the important stuff is just the summarized bits at the
top.)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1869497

Title:
  x86_cpu_gdb_read_register segfaults when gdb requests registers

Status in QEMU:
  New

Bug description:
  When attempting to attach to the gdbstub, a segfault occurs.

  I traced this down to a problem in a call to gdb_get_reg16 where the
  mem_buf was being treated like a uint8_t* instead of a GByteArray.
  The buffer passed to gdb_get_reg16 ends up passing an invalid
  GByteArray pointer, which subsequentlycauses a segfault in memcpy.

  I have a fix for this - just need to educate myself on how to submit a
  patch.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1869497/+subscriptions

[Bug 1869497] Re: x86_cpu_gdb_read_register segfaults when gdb requests registers

2020-03-29 Thread Matt Wilbur

** Changed in: qemu
 Assignee: (unassigned) => Matt Wilbur (mattwilbur)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1869497

Title:
  x86_cpu_gdb_read_register segfaults when gdb requests registers

Status in QEMU:
  New

Bug description:
  When attempting to attach to the gdbstub, a segfault occurs.

  I traced this down to a problem in a call to gdb_get_reg16 where the
  mem_buf was being treated like a uint8_t* instead of a GByteArray.
  The buffer passed to gdb_get_reg16 ends up passing an invalid
  GByteArray pointer, which subsequentlycauses a segfault in memcpy.

  I have a fix for this - just need to educate myself on how to submit a
  patch.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1869497/+subscriptions

Re: Request for Wiki write access

2020-03-29 Thread Mark Cave-Ayland

On 28/03/2020 16:31, Chad Kennedy wrote:

> I'm getting started with building QEMU on Windows and, following the 
> instructions
> at  https://wiki.qemu.org/Hosts/W32#Native_builds_with_MSYS2, I ran into some 
> small
> issues. I'd like to be able to tweak the wiki a bit to help save others who 
> might
> follow behind me some time.
> 
> Therefore, I'm requesting a wiki account per the bottom
> of https://wiki.qemu.org/Main_Page
> 
> Thanks and best regards!

Hi Chad,

Thanks for getting involved in the project! I've created a wiki account for you 
and
emailed across the details.


ATB,

Mark.

Re: [PATCH v2] tests/acceptance/machine_sparc_leon3: Do not run HelenOS test by default

2020-03-29 Thread Philippe Mathieu-Daudé


I just noticed I forgot to Cc Eduardo when posting this :S

Eduardo, do you want me to prepare a pullreq for rc1 with this patch and 
Oksana other fix?


On 2/12/20 9:36 PM, Philippe Mathieu-Daudé wrote:

The www.helenos.org server is slow and downloading the Leon3 binary
takes too long [*]. Do not include this test in the default suite.

Similarly to commit 471c97a69b:

   Currently the Avocado framework does not distinct the time spent
   downloading assets vs. the time spent running a test. With big
   assets (like a full VM image) the tests likely fail.

   This is a limitation known by the Avocado team.
   Until this issue get fixed, do not run this tests automatically.

   Tests can still be run setting the AVOCADO_TIMEOUT_EXPECTED
   environment variable.

[*] https://travis-ci.org/stsquad/qemu/jobs/649599880#L4198

Reported-by: Alex Bennée 
Signed-off-by: Philippe Mathieu-Daudé 
---
v2: Add missing staged hunk...
---
  tests/acceptance/machine_sparc_leon3.py | 4 
  1 file changed, 4 insertions(+)

diff --git a/tests/acceptance/machine_sparc_leon3.py 
b/tests/acceptance/machine_sparc_leon3.py
index f77e210ccb..27e4717a51 100644
--- a/tests/acceptance/machine_sparc_leon3.py
+++ b/tests/acceptance/machine_sparc_leon3.py
@@ -5,6 +5,9 @@
  # This work is licensed under the terms of the GNU GPL, version 2 or
  # later. See the COPYING file in the top-level directory.
  
+import os

+
+from avocado import skipUnless
  from avocado_qemu import Test
  from avocado_qemu import wait_for_console_pattern
  
@@ -13,6 +16,7 @@ class Leon3Machine(Test):
  
  timeout = 60
  
+@skipUnless(os.getenv('AVOCADO_TIMEOUT_EXPECTED'), 'Test might timeout')

  def test_leon3_helenos_uimage(self):
  """
  :avocado: tags=arch:sparc

Re: [PATCH v5 02/50] multi-process: Refactor machine_init and exit notifiers

2020-03-29 Thread Marc-André Lureau

Hi

On Mon, Feb 24, 2020 at 9:56 PM Jagannathan Raman  wrote:
>
> Relocate machine_int and exit notifiers into common code

utils/notify.c is not a good place to relocate those.

eventually, add a new softmmu/notifiers.c ?

And that patch broke make check test-char /char/mux, because it
overrides machine_init_done from stubs/machine-init-done.c..

>
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> Signed-off-by: Jagannathan Raman 
> ---
>  include/sysemu/sysemu.h |  2 ++
>  softmmu/vl.c| 42 --
>  util/notify.c   | 43 +++
>  3 files changed, 45 insertions(+), 42 deletions(-)
>
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index dec64fc..2f37e2b 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -17,11 +17,13 @@ extern bool qemu_uuid_set;
>
>  void qemu_add_exit_notifier(Notifier *notify);
>  void qemu_remove_exit_notifier(Notifier *notify);
> +void qemu_run_exit_notifiers(void);
>
>  extern bool machine_init_done;
>
>  void qemu_add_machine_init_done_notifier(Notifier *notify);
>  void qemu_remove_machine_init_done_notifier(Notifier *notify);
> +void qemu_run_machine_init_done_notifiers(void);
>
>  extern int autostart;
>
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 92c7b3a..94a7b93 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -173,12 +173,6 @@ int icount_align_option;
>  QemuUUID qemu_uuid;
>  bool qemu_uuid_set;
>
> -static NotifierList exit_notifiers =
> -NOTIFIER_LIST_INITIALIZER(exit_notifiers);
> -
> -static NotifierList machine_init_done_notifiers =
> -NOTIFIER_LIST_INITIALIZER(machine_init_done_notifiers);
> -
>  bool xen_allowed;
>  uint32_t xen_domid;
>  enum xen_mode xen_mode = XEN_EMULATE;
> @@ -2324,21 +2318,6 @@ static MachineClass *machine_parse(const char *name, 
> GSList *machines)
>  return mc;
>  }
>
> -void qemu_add_exit_notifier(Notifier *notify)
> -{
> -notifier_list_add(_notifiers, notify);
> -}
> -
> -void qemu_remove_exit_notifier(Notifier *notify)
> -{
> -notifier_remove(notify);
> -}
> -
> -static void qemu_run_exit_notifiers(void)
> -{
> -notifier_list_notify(_notifiers, NULL);
> -}
> -
>  static const char *pid_file;
>  static Notifier qemu_unlink_pidfile_notifier;
>
> @@ -2349,27 +2328,6 @@ static void qemu_unlink_pidfile(Notifier *n, void 
> *data)
>  }
>  }
>
> -bool machine_init_done;
> -
> -void qemu_add_machine_init_done_notifier(Notifier *notify)
> -{
> -notifier_list_add(_init_done_notifiers, notify);
> -if (machine_init_done) {
> -notify->notify(notify, NULL);
> -}
> -}
> -
> -void qemu_remove_machine_init_done_notifier(Notifier *notify)
> -{
> -notifier_remove(notify);
> -}
> -
> -static void qemu_run_machine_init_done_notifiers(void)
> -{
> -machine_init_done = true;
> -notifier_list_notify(_init_done_notifiers, NULL);
> -}
> -
>  static const QEMUOption *lookup_opt(int argc, char **argv,
>  const char **poptarg, int *poptind)
>  {
> diff --git a/util/notify.c b/util/notify.c
> index 76bab21..0e7479b 100644
> --- a/util/notify.c
> +++ b/util/notify.c
> @@ -15,6 +15,15 @@
>
>  #include "qemu/osdep.h"
>  #include "qemu/notify.h"
> +#include "sysemu/sysemu.h"
> +
> +bool machine_init_done;
> +
> +static NotifierList machine_init_done_notifiers =
> +NOTIFIER_LIST_INITIALIZER(machine_init_done_notifiers);
> +
> +static NotifierList exit_notifiers =
> +NOTIFIER_LIST_INITIALIZER(exit_notifiers);
>
>  void notifier_list_init(NotifierList *list)
>  {
> @@ -74,3 +83,37 @@ int 
> notifier_with_return_list_notify(NotifierWithReturnList *list, void *data)
>  }
>  return ret;
>  }
> +
> +void qemu_add_machine_init_done_notifier(Notifier *notify)
> +{
> +notifier_list_add(_init_done_notifiers, notify);
> +if (machine_init_done) {
> +notify->notify(notify, NULL);
> +}
> +}
> +
> +void qemu_remove_machine_init_done_notifier(Notifier *notify)
> +{
> +notifier_remove(notify);
> +}
> +
> +void qemu_run_machine_init_done_notifiers(void)
> +{
> +machine_init_done = true;
> +notifier_list_notify(_init_done_notifiers, NULL);
> +}
> +
> +void qemu_add_exit_notifier(Notifier *notify)
> +{
> +notifier_list_add(_notifiers, notify);
> +}
> +
> +void qemu_remove_exit_notifier(Notifier *notify)
> +{
> +notifier_remove(notify);
> +}
> +
> +void qemu_run_exit_notifiers(void)
> +{
> +notifier_list_notify(_notifiers, NULL);
> +}
> --
> 1.8.3.1
>


-- 
Marc-André Lureau

Re: [PATCH v1 02/22] header file update VFIO/IOMMU vSVA APIs

2020-03-29 Thread Auger Eric

Hi Yi,

On 3/22/20 1:35 PM, Liu Yi L wrote:
> The kernel uapi/linux/iommu.h header file includes the
> extensions for vSVA support. e.g. bind gpasid, iommu
> fault report related user structures and etc.
> 
> Note: this should be replaced with a full header files update when
> the vSVA uPAPI is stable.

Until this gets upstreamed, maybe add the branch against which you
updated the headers?

Thanks

Eric
> 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Peter Xu 
> Cc: Yi Sun 
> Cc: Michael S. Tsirkin 
> Cc: Cornelia Huck 
> Cc: Paolo Bonzini 
> Signed-off-by: Liu Yi L 
> ---
>  linux-headers/linux/iommu.h | 378 
> 
>  linux-headers/linux/vfio.h  | 127 +++
>  2 files changed, 505 insertions(+)
>  create mode 100644 linux-headers/linux/iommu.h
> 
> diff --git a/linux-headers/linux/iommu.h b/linux-headers/linux/iommu.h
> new file mode 100644
> index 000..9025496
> --- /dev/null
> +++ b/linux-headers/linux/iommu.h
> @@ -0,0 +1,378 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * IOMMU user API definitions
> + */
> +
> +#ifndef _IOMMU_H
> +#define _IOMMU_H
> +
> +#include 
> +
> +/**
> + * Current version of the IOMMU user API. This is intended for query
> + * between user and kernel to determine compatible data structures.
> + *
> + * UAPI version can be bumped up with the following rules:
> + * 1. All data structures passed between user and kernel space share
> + *the same version number. i.e. any extension to any structure
> + *results in version number increment.
> + *
> + * 2. Data structures are open to extension but closed to modification.
> + *Extension should leverage the padding bytes first where a new
> + *flag bit is required to indicate the validity of each new member.
> + *The above rule for padding bytes also applies to adding new union
> + *members.
> + *After padding bytes are exhausted, new fields must be added at the
> + *end of each data structure with 64bit alignment. Flag bits can be
> + *added without size change but existing ones cannot be altered.
> + *
> + * 3. Versions are backward compatible.
> + *
> + * 4. Version to size lookup is supported by kernel internal API for each
> + *API function type. @version is mandatory for new data structures
> + *and must be at the beginning with type of __u32.
> + */
> +#define IOMMU_UAPI_VERSION   1
> +static __inline__ int iommu_get_uapi_version(void)
> +{
> + return IOMMU_UAPI_VERSION;
> +}
> +
> +/*
> + * Supported UAPI features that can be reported to user space.
> + * These types represent the capability available in the kernel.
> + *
> + * REVISIT: UAPI version also implies the capabilities. Should we
> + * report them explicitly?
> + */
> +enum IOMMU_UAPI_DATA_TYPES {
> + IOMMU_UAPI_BIND_GPASID,
> + IOMMU_UAPI_CACHE_INVAL,
> + IOMMU_UAPI_PAGE_RESP,
> + NR_IOMMU_UAPI_TYPE,
> +};
> +
> +#define IOMMU_UAPI_CAP_MASK ((1 << IOMMU_UAPI_BIND_GPASID) | \
> + (1 << IOMMU_UAPI_CACHE_INVAL) | \
> + (1 << IOMMU_UAPI_PAGE_RESP))
> +
> +#define IOMMU_FAULT_PERM_READ(1 << 0) /* read */
> +#define IOMMU_FAULT_PERM_WRITE   (1 << 1) /* write */
> +#define IOMMU_FAULT_PERM_EXEC(1 << 2) /* exec */
> +#define IOMMU_FAULT_PERM_PRIV(1 << 3) /* privileged */
> +
> +/* Generic fault types, can be expanded IRQ remapping fault */
> +enum iommu_fault_type {
> + IOMMU_FAULT_DMA_UNRECOV = 1,/* unrecoverable fault */
> + IOMMU_FAULT_PAGE_REQ,   /* page request fault */
> +};
> +
> +enum iommu_fault_reason {
> + IOMMU_FAULT_REASON_UNKNOWN = 0,
> +
> + /* Could not access the PASID table (fetch caused external abort) */
> + IOMMU_FAULT_REASON_PASID_FETCH,
> +
> + /* PASID entry is invalid or has configuration errors */
> + IOMMU_FAULT_REASON_BAD_PASID_ENTRY,
> +
> + /*
> +  * PASID is out of range (e.g. exceeds the maximum PASID
> +  * supported by the IOMMU) or disabled.
> +  */
> + IOMMU_FAULT_REASON_PASID_INVALID,
> +
> + /*
> +  * An external abort occurred fetching (or updating) a translation
> +  * table descriptor
> +  */
> + IOMMU_FAULT_REASON_WALK_EABT,
> +
> + /*
> +  * Could not access the page table entry (Bad address),
> +  * actual translation fault
> +  */
> + IOMMU_FAULT_REASON_PTE_FETCH,
> +
> + /* Protection flag check failed */
> + IOMMU_FAULT_REASON_PERMISSION,
> +
> + /* access flag check failed */
> + IOMMU_FAULT_REASON_ACCESS,
> +
> + /* Output address of a translation stage caused Address Size fault */
> + IOMMU_FAULT_REASON_OOR_ADDRESS,
> +};
> +
> +/**
> + * struct iommu_fault_unrecoverable - Unrecoverable fault data
> + * @reason: reason of the fault, from  iommu_fault_reason
> + * @flags: parameters of this fault (IOMMU_FAULT_UNRECOV_* values)
> + * @pasid: Process Address Space ID
> + *

[PATCH v7 5/7] virtio-net: reference implementation of hash report

2020-03-29 Thread Yuri Benditovich

Suggest VIRTIO_NET_F_HASH_REPORT if specified in device
parameters.
If the VIRTIO_NET_F_HASH_REPORT is set,
the device extends configuration space. If the feature
is negotiated, the packet layout is extended to
accomodate the hash information. In this case deliver
packet's hash value and report type in virtio header
extension.
Use for configuration the same procedure as already
used for RSS. We add two fields in rss_data that
controls what the device does with the calculated hash
if rss_data.enabled is set. If field 'populate' is set
the hash is set in the packet, if field 'redirect' is
set the hash is used to decide the queue to place the
packet to.

Signed-off-by: Yuri Benditovich 
---
 hw/net/virtio-net.c| 99 +++---
 include/hw/virtio/virtio-net.h |  2 +
 2 files changed, 81 insertions(+), 20 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index de2d68d4ca..61c956d0ff 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -184,7 +184,7 @@ static VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_net_config, mtu)},
 {.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX,
  .end = endof(struct virtio_net_config, duplex)},
-{.flags = 1ULL << VIRTIO_NET_F_RSS,
+{.flags = (1ULL << VIRTIO_NET_F_RSS) | (1ULL << VIRTIO_NET_F_HASH_REPORT),
  .end = endof(struct virtio_net_config_with_rss, supported_hash_types)},
 {}
 };
@@ -218,7 +218,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, 
uint8_t *config)
 netcfg.cfg.duplex = n->net_conf.duplex;
 netcfg.rss_max_key_size = VIRTIO_NET_RSS_MAX_KEY_SIZE;
 virtio_stw_p(vdev, _max_indirection_table_length,
- VIRTIO_NET_RSS_MAX_TABLE_LEN);
+ virtio_host_has_feature(vdev, VIRTIO_NET_F_RSS) ?
+ VIRTIO_NET_RSS_MAX_TABLE_LEN : 1);
 virtio_stl_p(vdev, _hash_types,
  VIRTIO_NET_RSS_SUPPORTED_HASHES);
 memcpy(config, , n->config_size);
@@ -644,7 +645,7 @@ static int peer_has_ufo(VirtIONet *n)
 }
 
 static void virtio_net_set_mrg_rx_bufs(VirtIONet *n, int mergeable_rx_bufs,
-   int version_1)
+   int version_1, int hash_report)
 {
 int i;
 NetClientState *nc;
@@ -652,7 +653,10 @@ static void virtio_net_set_mrg_rx_bufs(VirtIONet *n, int 
mergeable_rx_bufs,
 n->mergeable_rx_bufs = mergeable_rx_bufs;
 
 if (version_1) {
-n->guest_hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+n->guest_hdr_len = hash_report ?
+sizeof(struct virtio_net_hdr_v1_hash) :
+sizeof(struct virtio_net_hdr_mrg_rxbuf);
+n->rss_data.populate_hash = !!hash_report;
 } else {
 n->guest_hdr_len = n->mergeable_rx_bufs ?
 sizeof(struct virtio_net_hdr_mrg_rxbuf) :
@@ -773,6 +777,8 @@ static uint64_t virtio_net_get_features(VirtIODevice *vdev, 
uint64_t features,
 virtio_clear_feature(, VIRTIO_NET_F_GUEST_TSO4);
 virtio_clear_feature(, VIRTIO_NET_F_GUEST_TSO6);
 virtio_clear_feature(, VIRTIO_NET_F_GUEST_ECN);
+
+virtio_clear_feature(, VIRTIO_NET_F_HASH_REPORT);
 }
 
 if (!peer_has_vnet_hdr(n) || !peer_has_ufo(n)) {
@@ -785,6 +791,7 @@ static uint64_t virtio_net_get_features(VirtIODevice *vdev, 
uint64_t features,
 }
 
 virtio_clear_feature(, VIRTIO_NET_F_RSS);
+virtio_clear_feature(, VIRTIO_NET_F_HASH_REPORT);
 features = vhost_net_get_features(get_vhost_net(nc->peer), features);
 vdev->backend_features = features;
 
@@ -951,12 +958,15 @@ static void virtio_net_set_features(VirtIODevice *vdev, 
uint64_t features)
virtio_has_feature(features,
   VIRTIO_NET_F_MRG_RXBUF),
virtio_has_feature(features,
-  VIRTIO_F_VERSION_1));
+  VIRTIO_F_VERSION_1),
+   virtio_has_feature(features,
+  VIRTIO_NET_F_HASH_REPORT));
 
 n->rsc4_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) &&
 virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO4);
 n->rsc6_enabled = virtio_has_feature(features, VIRTIO_NET_F_RSC_EXT) &&
 virtio_has_feature(features, VIRTIO_NET_F_GUEST_TSO6);
+n->rss_data.redirect = virtio_has_feature(features, VIRTIO_NET_F_RSS);
 
 if (n->has_vnet_hdr) {
 n->curr_guest_offloads =
@@ -1230,7 +1240,9 @@ static void virtio_net_disable_rss(VirtIONet *n)
 }
 
 static uint16_t virtio_net_handle_rss(VirtIONet *n,
-  struct iovec *iov, unsigned int iov_cnt)
+  struct iovec *iov,
+  unsigned int iov_cnt,
+  bool do_rss)
 {
 VirtIODevice *vdev =

[PATCH v7 6/7] vmstate.h: provide VMSTATE_VARRAY_UINT16_ALLOC macro

2020-03-29 Thread Yuri Benditovich

Similar to VMSTATE_VARRAY_UINT32_ALLOC, but the size is
16-bit field.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Yuri Benditovich 
---
 include/migration/vmstate.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 30667631bc..baaefb6b9b 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -432,6 +432,16 @@ extern const VMStateInfo vmstate_info_qlist;
 .offset = vmstate_offset_pointer(_state, _field, _type), \
 }
 
+#define VMSTATE_VARRAY_UINT16_ALLOC(_field, _state, _field_num, _version, 
_info, _type) {\
+.name   = (stringify(_field)),   \
+.version_id = (_version),\
+.num_offset = vmstate_offset_value(_state, _field_num, uint16_t),\
+.info   = &(_info),  \
+.size   = sizeof(_type), \
+.flags  = VMS_VARRAY_UINT16 | VMS_POINTER | VMS_ALLOC,   \
+.offset = vmstate_offset_pointer(_state, _field, _type), \
+}
+
 #define VMSTATE_VARRAY_UINT16_UNSAFE(_field, _state, _field_num, _version, 
_info, _type) {\
 .name   = (stringify(_field)),   \
 .version_id = (_version),\
-- 
2.17.1

[PATCH v7 7/7] virtio-net: add migration support for RSS and hash report

2020-03-29 Thread Yuri Benditovich

Save and restore RSS/hash report configuration.

Signed-off-by: Yuri Benditovich 
---
 hw/net/virtio-net.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 61c956d0ff..8e09aa0b99 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -2842,6 +2842,13 @@ static int virtio_net_post_load_device(void *opaque, int 
version_id)
 }
 }
 
+if (n->rss_data.enabled) {
+trace_virtio_net_rss_enable(n->rss_data.hash_types,
+n->rss_data.indirections_len,
+sizeof(n->rss_data.key));
+} else {
+trace_virtio_net_rss_disable();
+}
 return 0;
 }
 
@@ -3019,6 +3026,32 @@ static const VMStateDescription 
vmstate_virtio_net_has_vnet = {
 },
 };
 
+static bool virtio_net_rss_needed(void *opaque)
+{
+return VIRTIO_NET(opaque)->rss_data.enabled;
+}
+
+static const VMStateDescription vmstate_virtio_net_rss = {
+.name  = "virtio-net-device/rss",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = virtio_net_rss_needed,
+.fields = (VMStateField[]) {
+VMSTATE_BOOL(rss_data.enabled, VirtIONet),
+VMSTATE_BOOL(rss_data.redirect, VirtIONet),
+VMSTATE_BOOL(rss_data.populate_hash, VirtIONet),
+VMSTATE_UINT32(rss_data.hash_types, VirtIONet),
+VMSTATE_UINT16(rss_data.indirections_len, VirtIONet),
+VMSTATE_UINT16(rss_data.default_queue, VirtIONet),
+VMSTATE_UINT8_ARRAY(rss_data.key, VirtIONet,
+VIRTIO_NET_RSS_MAX_KEY_SIZE),
+VMSTATE_VARRAY_UINT16_ALLOC(rss_data.indirections_table, VirtIONet,
+rss_data.indirections_len, 0,
+vmstate_info_uint16, uint16_t),
+VMSTATE_END_OF_LIST()
+},
+};
+
 static const VMStateDescription vmstate_virtio_net_device = {
 .name = "virtio-net-device",
 .version_id = VIRTIO_NET_VM_VERSION,
@@ -3069,6 +3102,10 @@ static const VMStateDescription 
vmstate_virtio_net_device = {
 has_ctrl_guest_offloads),
 VMSTATE_END_OF_LIST()
},
+.subsections = (const VMStateDescription * []) {
+_virtio_net_rss,
+NULL
+}
 };
 
 static NetClientInfo net_virtio_info = {
-- 
2.17.1

[PATCH v7 4/7] tap: allow extended virtio header with hash info

2020-03-29 Thread Yuri Benditovich

Signed-off-by: Yuri Benditovich 
---
 net/tap.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/tap.c b/net/tap.c
index 6207f61f84..47de7fdeb6 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -63,6 +63,14 @@ typedef struct TAPState {
 Notifier exit;
 } TAPState;
 
+/* TODO: remove when virtio_net.h updated */
+struct virtio_net_hdr_v1_hash {
+struct virtio_net_hdr_v1 hdr;
+uint32_t hash_value;
+uint16_t hash_report;
+uint16_t padding;
+};
+
 static void launch_script(const char *setup_script, const char *ifname,
   int fd, Error **errp);
 
@@ -254,7 +262,8 @@ static void tap_set_vnet_hdr_len(NetClientState *nc, int 
len)
 
 assert(nc->info->type == NET_CLIENT_DRIVER_TAP);
 assert(len == sizeof(struct virtio_net_hdr_mrg_rxbuf) ||
-   len == sizeof(struct virtio_net_hdr));
+   len == sizeof(struct virtio_net_hdr) ||
+   len == sizeof(struct virtio_net_hdr_v1_hash));
 
 tap_fd_set_vnet_hdr_len(s->fd, len);
 s->host_vnet_hdr_len = len;
-- 
2.17.1

[PATCH v7 3/7] virtio-net: implement RX RSS processing

2020-03-29 Thread Yuri Benditovich

If VIRTIO_NET_F_RSS negotiated and RSS is enabled, process
incoming packets, calculate packet's hash and place the
packet into respective RX virtqueue.

Signed-off-by: Yuri Benditovich 
---
 hw/net/virtio-net.c| 88 +-
 include/hw/virtio/virtio-net.h |  1 +
 2 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 6d21922746..de2d68d4ca 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -42,6 +42,7 @@
 #include "trace.h"
 #include "monitor/qdev.h"
 #include "hw/pci/pci.h"
+#include "net_rx_pkt.h"
 
 #define VIRTIO_NET_VM_VERSION11
 
@@ -1598,8 +1599,80 @@ static int receive_filter(VirtIONet *n, const uint8_t 
*buf, int size)
 return 0;
 }
 
+static uint8_t virtio_net_get_hash_type(bool isip4,
+bool isip6,
+bool isudp,
+bool istcp,
+uint32_t types)
+{
+if (isip4) {
+if (istcp && (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4)) {
+return NetPktRssIpV4Tcp;
+}
+if (isudp && (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4)) {
+return NetPktRssIpV4Udp;
+}
+if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4) {
+return NetPktRssIpV4;
+}
+} else if (isip6) {
+uint32_t mask = VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6;
+
+if (istcp && (types & mask)) {
+return (types & VIRTIO_NET_RSS_HASH_TYPE_TCP_EX) ?
+NetPktRssIpV6TcpEx : NetPktRssIpV6Tcp;
+}
+mask = VIRTIO_NET_RSS_HASH_TYPE_UDP_EX | 
VIRTIO_NET_RSS_HASH_TYPE_UDPv6;
+if (isudp && (types & mask)) {
+return (types & VIRTIO_NET_RSS_HASH_TYPE_UDP_EX) ?
+NetPktRssIpV6UdpEx : NetPktRssIpV6Udp;
+}
+mask = VIRTIO_NET_RSS_HASH_TYPE_IP_EX | VIRTIO_NET_RSS_HASH_TYPE_IPv6;
+if (types & mask) {
+return (types & VIRTIO_NET_RSS_HASH_TYPE_IP_EX) ?
+NetPktRssIpV6Ex : NetPktRssIpV6;
+}
+}
+return 0xff;
+}
+
+static int virtio_net_process_rss(NetClientState *nc, const uint8_t *buf,
+  size_t size)
+{
+VirtIONet *n = qemu_get_nic_opaque(nc);
+unsigned int index = nc->queue_index, new_index;
+struct NetRxPkt *pkt = n->rx_pkt;
+uint8_t net_hash_type;
+uint32_t hash;
+bool isip4, isip6, isudp, istcp;
+
+net_rx_pkt_set_protocols(pkt, buf + n->host_hdr_len,
+ size - n->host_hdr_len);
+net_rx_pkt_get_protocols(pkt, , , , );
+if (isip4 && (net_rx_pkt_get_ip4_info(pkt)->fragment)) {
+istcp = isudp = false;
+}
+if (isip6 && (net_rx_pkt_get_ip6_info(pkt)->fragment)) {
+istcp = isudp = false;
+}
+net_hash_type = virtio_net_get_hash_type(isip4, isip6, isudp, istcp,
+ n->rss_data.hash_types);
+if (net_hash_type > NetPktRssIpV6UdpEx) {
+return n->rss_data.default_queue;
+}
+
+hash = net_rx_pkt_calc_rss_hash(pkt, net_hash_type, n->rss_data.key);
+new_index = hash & (n->rss_data.indirections_len - 1);
+new_index = n->rss_data.indirections_table[new_index];
+if (index == new_index) {
+return -1;
+}
+return new_index;
+}
+
 static ssize_t virtio_net_receive_rcu(NetClientState *nc, const uint8_t *buf,
-  size_t size)
+  size_t size, bool no_rss)
 {
 VirtIONet *n = qemu_get_nic_opaque(nc);
 VirtIONetQueue *q = virtio_net_get_subqueue(nc);
@@ -1613,6 +1686,14 @@ static ssize_t virtio_net_receive_rcu(NetClientState 
*nc, const uint8_t *buf,
 return -1;
 }
 
+if (!no_rss && n->rss_data.enabled) {
+int index = virtio_net_process_rss(nc, buf, size);
+if (index >= 0) {
+NetClientState *nc2 = qemu_get_subqueue(n->nic, index);
+return virtio_net_receive_rcu(nc2, buf, size, true);
+}
+}
+
 /* hdr_len refers to the header we supply to the guest */
 if (!virtio_net_has_buffers(q, size + n->guest_hdr_len - n->host_hdr_len)) 
{
 return 0;
@@ -1707,7 +1788,7 @@ static ssize_t virtio_net_do_receive(NetClientState *nc, 
const uint8_t *buf,
 {
 RCU_READ_LOCK_GUARD();
 
-return virtio_net_receive_rcu(nc, buf, size);
+return virtio_net_receive_rcu(nc, buf, size, false);
 }
 
 static void virtio_net_rsc_extract_unit4(VirtioNetRscChain *chain,
@@ -3283,6 +3364,8 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 
 QTAILQ_INIT(>rsc_chains);
 n->qdev = dev;
+
+net_rx_pkt_init(>rx_pkt, false);
 }
 
 static void virtio_net_device_unrealize(DeviceState *dev, Error **errp)
@@ -3320,6 +3403,7 @@ static void virtio_net_device_unrealize(DeviceState *dev, 
Error

[PATCH v7 1/7] virtio-net: introduce RSS and hash report features

2020-03-29 Thread Yuri Benditovich

Signed-off-by: Yuri Benditovich 
---
 hw/net/virtio-net.c | 65 +
 1 file changed, 65 insertions(+)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 3627bb1717..90b01221e9 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -71,6 +71,71 @@
 #define VIRTIO_NET_IP6_ADDR_SIZE   32  /* ipv6 saddr + daddr */
 #define VIRTIO_NET_MAX_IP6_PAYLOAD VIRTIO_NET_MAX_TCP_PAYLOAD
 
+/* TODO: remove after virtio-net header update */
+#if !defined(VIRTIO_NET_RSS_HASH_TYPE_IPv4)
+#define VIRTIO_NET_F_HASH_REPORT57  /* Supports hash report */
+#define VIRTIO_NET_F_RSS60  /* Supports RSS RX steering */
+
+/* supported/enabled hash types */
+#define VIRTIO_NET_RSS_HASH_TYPE_IPv4  (1 << 0)
+#define VIRTIO_NET_RSS_HASH_TYPE_TCPv4 (1 << 1)
+#define VIRTIO_NET_RSS_HASH_TYPE_UDPv4 (1 << 2)
+#define VIRTIO_NET_RSS_HASH_TYPE_IPv6  (1 << 3)
+#define VIRTIO_NET_RSS_HASH_TYPE_TCPv6 (1 << 4)
+#define VIRTIO_NET_RSS_HASH_TYPE_UDPv6 (1 << 5)
+#define VIRTIO_NET_RSS_HASH_TYPE_IP_EX (1 << 6)
+#define VIRTIO_NET_RSS_HASH_TYPE_TCP_EX(1 << 7)
+#define VIRTIO_NET_RSS_HASH_TYPE_UDP_EX(1 << 8)
+
+struct virtio_net_config_with_rss {
+struct virtio_net_config cfg;
+/* maximum size of RSS key */
+uint8_t rss_max_key_size;
+/* maximum number of indirection table entries */
+uint16_t rss_max_indirection_table_length;
+/* bitmask of supported VIRTIO_NET_RSS_HASH_ types */
+uint32_t supported_hash_types;
+} QEMU_PACKED;
+
+struct virtio_net_hdr_v1_hash {
+struct virtio_net_hdr_v1 hdr;
+uint32_t hash_value;
+#define VIRTIO_NET_HASH_REPORT_NONE0
+#define VIRTIO_NET_HASH_REPORT_IPv41
+#define VIRTIO_NET_HASH_REPORT_TCPv4   2
+#define VIRTIO_NET_HASH_REPORT_UDPv4   3
+#define VIRTIO_NET_HASH_REPORT_IPv64
+#define VIRTIO_NET_HASH_REPORT_TCPv6   5
+#define VIRTIO_NET_HASH_REPORT_UDPv6   6
+#define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
+#define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
+#define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
+uint16_t hash_report;
+uint16_t padding;
+};
+
+/*
+ * The command VIRTIO_NET_CTRL_MQ_RSS_CONFIG has the same effect as
+ * VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET does and additionally configures
+ * the receive steering to use a hash calculated for incoming packet
+ * to decide on receive virtqueue to place the packet. The command
+ * also provides parameters to calculate a hash and receive virtqueue.
+ */
+struct virtio_net_rss_config {
+uint32_t hash_types;
+uint16_t indirection_table_mask;
+uint16_t unclassified_queue;
+uint16_t indirection_table[1/* + indirection_table_mask */];
+uint16_t max_tx_vq;
+uint8_t hash_key_length;
+uint8_t hash_key_data[/* hash_key_length */];
+};
+
+#define VIRTIO_NET_CTRL_MQ_RSS_CONFIG  1
+#define VIRTIO_NET_CTRL_MQ_HASH_CONFIG 2
+
+#endif
+
 /* Purge coalesced packets timer interval, This value affects the performance
a lot, and should be tuned carefully, '30'(300us) is the recommended
value to pass the WHQL test, '5' can gain 2x netperf throughput with
-- 
2.17.1

[PATCH v7 2/7] virtio-net: implement RSS configuration command

2020-03-29 Thread Yuri Benditovich

Optionally report RSS feature.
Handle RSS configuration command and keep RSS parameters
in virtio-net device context.

Signed-off-by: Yuri Benditovich 
---
 hw/net/trace-events|   3 +
 hw/net/virtio-net.c| 189 +
 include/hw/virtio/virtio-net.h |  13 +++
 3 files changed, 185 insertions(+), 20 deletions(-)

diff --git a/hw/net/trace-events b/hw/net/trace-events
index a1da98a643..a84b9c3d9f 100644
--- a/hw/net/trace-events
+++ b/hw/net/trace-events
@@ -371,6 +371,9 @@ virtio_net_announce_notify(void) ""
 virtio_net_announce_timer(int round) "%d"
 virtio_net_handle_announce(int round) "%d"
 virtio_net_post_load_device(void)
+virtio_net_rss_disable(void)
+virtio_net_rss_error(const char *msg, uint32_t value) "%s, value 0x%08x"
+virtio_net_rss_enable(uint32_t p1, uint16_t p2, uint8_t p3) "hashes 0x%x, 
table of %d, key of %d"
 
 # tulip.c
 tulip_reg_write(uint64_t addr, const char *name, int size, uint64_t val) "addr 
0x%02"PRIx64" (%s) size %d value 0x%08"PRIx64
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 90b01221e9..6d21922746 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -142,6 +142,16 @@ struct virtio_net_rss_config {
tso/gso/gro 'off'. */
 #define VIRTIO_NET_RSC_DEFAULT_INTERVAL 30
 
+#define VIRTIO_NET_RSS_SUPPORTED_HASHES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+ VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+ VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+ VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+ VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+ VIRTIO_NET_RSS_HASH_TYPE_UDPv6 | \
+ VIRTIO_NET_RSS_HASH_TYPE_IP_EX | \
+ VIRTIO_NET_RSS_HASH_TYPE_TCP_EX | \
+ VIRTIO_NET_RSS_HASH_TYPE_UDP_EX)
+
 /* temporary until standard header include it */
 #if !defined(VIRTIO_NET_HDR_F_RSC_INFO)
 
@@ -173,6 +183,8 @@ static VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_net_config, mtu)},
 {.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX,
  .end = endof(struct virtio_net_config, duplex)},
+{.flags = 1ULL << VIRTIO_NET_F_RSS,
+ .end = endof(struct virtio_net_config_with_rss, supported_hash_types)},
 {}
 };
 
@@ -195,28 +207,33 @@ static int vq2q(int queue_index)
 static void virtio_net_get_config(VirtIODevice *vdev, uint8_t *config)
 {
 VirtIONet *n = VIRTIO_NET(vdev);
-struct virtio_net_config netcfg;
-
-virtio_stw_p(vdev, , n->status);
-virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues);
-virtio_stw_p(vdev, , n->net_conf.mtu);
-memcpy(netcfg.mac, n->mac, ETH_ALEN);
-virtio_stl_p(vdev, , n->net_conf.speed);
-netcfg.duplex = n->net_conf.duplex;
+struct virtio_net_config_with_rss netcfg;
+
+virtio_stw_p(vdev, , n->status);
+virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues);
+virtio_stw_p(vdev, , n->net_conf.mtu);
+memcpy(netcfg.cfg.mac, n->mac, ETH_ALEN);
+virtio_stl_p(vdev, , n->net_conf.speed);
+netcfg.cfg.duplex = n->net_conf.duplex;
+netcfg.rss_max_key_size = VIRTIO_NET_RSS_MAX_KEY_SIZE;
+virtio_stw_p(vdev, _max_indirection_table_length,
+ VIRTIO_NET_RSS_MAX_TABLE_LEN);
+virtio_stl_p(vdev, _hash_types,
+ VIRTIO_NET_RSS_SUPPORTED_HASHES);
 memcpy(config, , n->config_size);
 }
 
 static void virtio_net_set_config(VirtIODevice *vdev, const uint8_t *config)
 {
 VirtIONet *n = VIRTIO_NET(vdev);
-struct virtio_net_config netcfg = {};
+struct virtio_net_config_with_rss netcfg = {};
 
 memcpy(, config, n->config_size);
 
 if (!virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR) &&
 !virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1) &&
-memcmp(netcfg.mac, n->mac, ETH_ALEN)) {
-memcpy(n->mac, netcfg.mac, ETH_ALEN);
+memcmp(netcfg.cfg.mac, n->mac, ETH_ALEN)) {
+memcpy(n->mac, netcfg.cfg.mac, ETH_ALEN);
 qemu_format_nic_info_str(qemu_get_queue(n->nic), n->mac);
 }
 }
@@ -766,6 +783,7 @@ static uint64_t virtio_net_get_features(VirtIODevice *vdev, 
uint64_t features,
 return features;
 }
 
+virtio_clear_feature(, VIRTIO_NET_F_RSS);
 features = vhost_net_get_features(get_vhost_net(nc->peer), features);
 vdev->backend_features = features;
 
@@ -925,6 +943,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, 
uint64_t features)
 }
 
 virtio_net_set_multiqueue(n,
+  virtio_has_feature(features, VIRTIO_NET_F_RSS) ||
   virtio_has_feature(features, VIRTIO_NET_F_MQ));
 
 virtio_net_set_mrg_rx_bufs(n,
@@ -1201,25 +1220,152 @@ static int virtio_net_handle_announce(VirtIONet *n, 
uint8_t cmd,
 }
 }
 
+static void virtio_net_disable_rss(VirtIONet *n)
+{
+

[PATCH v7 0/7] reference implementation of RSS and hash report

2020-03-29 Thread Yuri Benditovich

Support for VIRTIO_NET_F_RSS and VIRTIO_NET_F_HASH_REPORT
features in QEMU for reference purpose.
Implements Toeplitz hash calculation for incoming
packets according to configuration provided by driver.
Uses calculated hash for decision on receive virtqueue
and/or reports the hash in the virtio header

Changes from v6:
Fixed a bug in patch 5 "reference implementation of hash report"
that caused the ASAN test to fail
was: n->rss_data.populate_hash = true;
fixed: n->rss_data.populate_hash = !!hash_report;

Yuri Benditovich (7):
  virtio-net: introduce RSS and hash report features
  virtio-net: implement RSS configuration command
  virtio-net: implement RX RSS processing
  tap: allow extended virtio header with hash info
  virtio-net: reference implementation of hash report
  vmstate.h: provide VMSTATE_VARRAY_UINT16_ALLOC macro
  virtio-net: add migration support for RSS and hash report

 hw/net/trace-events|   3 +
 hw/net/virtio-net.c| 448 +++--
 include/hw/virtio/virtio-net.h |  16 ++
 include/migration/vmstate.h|  10 +
 net/tap.c  |  11 +-
 5 files changed, 460 insertions(+), 28 deletions(-)

-- 
2.17.1

Re: [PATCH v5 07/18] s390x: protvirt: Inhibit balloon when switching to protected mode

2020-03-29 Thread Michael S. Tsirkin

On Fri, Mar 20, 2020 at 10:36:53AM +0100, David Hildenbrand wrote:
> On 19.03.20 18:45, Michael S. Tsirkin wrote:
> > On Thu, Mar 19, 2020 at 02:54:11PM +0100, David Hildenbrand wrote:
> >> Why does the balloon driver not support VIRTIO_F_IOMMU_PLATFORM? It is
> >> absolutely not clear to me. The introducing commit mentioned that it
> >> "bypasses DMA". I fail to see that.
> > 
> > Well sure one can put the balloon behind an IOMMU.  If will shuffle PFN
> > lists through a shared page.  Problem is, you can't run an untrusted
> > driver with it since if you do it can corrupt guest memory.
> > And VIRTIO_F_IOMMU_PLATFORM so far meant that you can run
> > a userspace driver.
> 
> Just to clarify: Is it sufficient to clear VIRTIO_F_IOMMU_PLATFORM in
> the *guest kernel driver* to prohibit *guest userspace drivers*?

No it isn't sufficient.

> I would have thought we would have to disallow on the hypervisor/device
> side. (no expert on user space drivers, especially how they
> detect/enable/access virtio devices)

QEMU does exactly this:

static int virtio_validate_features(VirtIODevice *vdev)
{
VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);

if (virtio_host_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM) &&
!virtio_vdev_has_feature(vdev, VIRTIO_F_IOMMU_PLATFORM)) {
return -EFAULT;
}
...
}


> > 
> > Maybe we need a separate feature bit for this kind of thing where you
> > assume the driver is trusted? Such a bit - unlike
> > VIRTIO_F_IOMMU_PLATFORM - would allow legacy guests ...
> 
> Let's take virtio-mem as an example. You cannot zap memory outside of
> the scope of a virtio-mem device. So I assume having a user space driver
> would be ok (although most probably of limited use :) )?
> 
> Still, for virtio-mem, special s390x handling, similar to virtio-balloon
> - (un)sharing of pages - would have to be performed.
> 
> So some feature bits to cleanly separate the different limitations would
> be great. At least in regard to s390x, I guess we don't have to worry
> too much about legacy guests.

So if you have the cycles to think through and document how balloon
interacts with different access limitations, that would be great!

> -- 
> Thanks,
> 
> David / dhildenb

Re: [PATCH RESEND v3 0/4] virtio-pci: enable blk and scsi multi-queue by default

2020-03-29 Thread Michael S. Tsirkin

On Fri, Mar 20, 2020 at 10:30:37AM +, Stefan Hajnoczi wrote:
> v3:
>  * Add new performance results that demonstrate the scalability
>  * Mention that this is PCI-specific [Cornelia]
> v2:
>  * Let the virtio-DEVICE-pci device select num-queues because the optimal
>multi-queue configuration may differ between virtio-pci, virtio-mmio, and
>virtio-ccw [Cornelia]


I'd like to queue it for merge after the release. If possible
please ping me after the release to help make sure it didn't get
dropped.

Thanks!


> Enabling multi-queue on virtio-pci storage devices improves performance on SMP
> guests because the completion interrupt is handled on the vCPU that submitted
> the I/O request.  This avoids IPIs inside the guest.
> 
> Note that performance is unchanged in these cases:
> 1. Uniprocessor guests.  They don't have IPIs.
> 2. Application threads might be scheduled on the sole vCPU that handles
>completion interrupts purely by chance.  (This is one reason why benchmark
>results can vary noticably between runs.)
> 3. Users may bind the application to the vCPU that handles completion
>interrupts.
> 
> Set the number of queues to the number of vCPUs by default on virtio-blk and
> virtio-scsi PCI devices.  Older machine types continue to default to 1 queue
> for live migration compatibility.
> 
> Random read performance:
>   IOPS
> q=178k
> q=32  104k  +33%
> 
> Boot time:
>   Duration
> q=151s
> q=32 1m41s  +98%
> 
> Guest configuration: 32 vCPUs, 101 virtio-blk-pci disks
> 
> Previously measured results on a 4 vCPU guest were also positive but showed a
> smaller 1-4% performance improvement.  They are no longer valid because
> significant event loop optimizations have been merged.
> 
> Stefan Hajnoczi (4):
>   virtio-scsi: introduce a constant for fixed virtqueues
>   virtio-scsi: default num_queues to -smp N
>   virtio-blk: default num_queues to -smp N
>   vhost-user-blk: default num_queues to -smp N
> 
>  hw/block/vhost-user-blk.c  |  6 +-
>  hw/block/virtio-blk.c  |  6 +-
>  hw/core/machine.c  |  5 +
>  hw/scsi/vhost-scsi.c   |  3 ++-
>  hw/scsi/vhost-user-scsi.c  |  5 +++--
>  hw/scsi/virtio-scsi.c  | 13 +
>  hw/virtio/vhost-scsi-pci.c | 10 --
>  hw/virtio/vhost-user-blk-pci.c |  6 ++
>  hw/virtio/vhost-user-scsi-pci.c| 10 --
>  hw/virtio/virtio-blk-pci.c |  9 -
>  hw/virtio/virtio-scsi-pci.c| 10 --
>  include/hw/virtio/vhost-user-blk.h |  2 ++
>  include/hw/virtio/virtio-blk.h |  2 ++
>  include/hw/virtio/virtio-scsi.h|  5 +
>  14 files changed, 76 insertions(+), 16 deletions(-)
> 
> -- 
> 2.24.1
>

Re: [PATCH] Refactor vhost_user_set_mem_table functions

2020-03-29 Thread Michael S. Tsirkin

On Wed, Mar 25, 2020 at 06:35:06AM -0400, Raphael Norwitz wrote:
> vhost_user_set_mem_table() and vhost_user_set_mem_table_postcopy() have
> gotten convoluted, and have some identical code.
> 
> This change moves the logic populating the VhostUserMemory struct and
> fds array from vhost_user_set_mem_table() and
> vhost_user_set_mem_table_postcopy() to a new function,
> vhost_user_fill_set_mem_table_msg().
> 
> No functionality is impacted.
> 
> Signed-off-by: Raphael Norwitz 
> Signed-off-by: Peter Turschmid 


Thanks!

I'd like to queue it for merge after the release. If possible
please ping me after the release to help make sure it didn't get
dropped.


> ---
>  hw/virtio/vhost-user.c | 143 
> +++--
>  1 file changed, 67 insertions(+), 76 deletions(-)
> 
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 08e7e63..ec21e8f 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -407,18 +407,79 @@ static int vhost_user_set_log_base(struct vhost_dev 
> *dev, uint64_t base,
>  return 0;
>  }
>  
> +static int vhost_user_fill_set_mem_table_msg(struct vhost_user *u,
> + struct vhost_dev *dev,
> + VhostUserMsg *msg,
> + int *fds, size_t *fd_num,
> + bool track_ramblocks)
> +{
> +int i, fd;
> +ram_addr_t offset;
> +MemoryRegion *mr;
> +struct vhost_memory_region *reg;
> +
> +msg->hdr.request = VHOST_USER_SET_MEM_TABLE;
> +
> +for (i = 0; i < dev->mem->nregions; ++i) {
> +reg = dev->mem->regions + i;
> +
> +assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
> +mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
> + );
> +fd = memory_region_get_fd(mr);
> +if (fd > 0) {
> +if (track_ramblocks) {
> +assert(*fd_num < VHOST_MEMORY_MAX_NREGIONS);
> +trace_vhost_user_set_mem_table_withfd(*fd_num, mr->name,
> +  reg->memory_size,
> +  reg->guest_phys_addr,
> +  reg->userspace_addr,
> +  offset);
> +u->region_rb_offset[i] = offset;
> +u->region_rb[i] = mr->ram_block;
> +} else if (*fd_num == VHOST_MEMORY_MAX_NREGIONS) {
> +error_report("Failed preparing vhost-user memory table msg");
> +return -1;
> +}
> +msg->payload.memory.regions[*fd_num].userspace_addr =
> +reg->userspace_addr;
> +msg->payload.memory.regions[*fd_num].memory_size =
> +reg->memory_size;
> +msg->payload.memory.regions[*fd_num].guest_phys_addr =
> +reg->guest_phys_addr;
> +msg->payload.memory.regions[*fd_num].mmap_offset = offset;
> +fds[(*fd_num)++] = fd;
> +} else if (track_ramblocks) {
> +u->region_rb_offset[i] = 0;
> +u->region_rb[i] = NULL;
> +}
> +}
> +
> +msg->payload.memory.nregions = *fd_num;
> +
> +if (!*fd_num) {
> +error_report("Failed initializing vhost-user memory map, "
> + "consider using -object memory-backend-file share=on");
> +return -1;
> +}
> +
> +msg->hdr.size = sizeof(msg->payload.memory.nregions);
> +msg->hdr.size += sizeof(msg->payload.memory.padding);
> +msg->hdr.size += *fd_num * sizeof(VhostUserMemoryRegion);
> +
> +return 1;
> +}
> +
>  static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
>   struct vhost_memory *mem)
>  {
>  struct vhost_user *u = dev->opaque;
>  int fds[VHOST_MEMORY_MAX_NREGIONS];
> -int i, fd;
>  size_t fd_num = 0;
>  VhostUserMsg msg_reply;
>  int region_i, msg_i;
>  
>  VhostUserMsg msg = {
> -.hdr.request = VHOST_USER_SET_MEM_TABLE,
>  .hdr.flags = VHOST_USER_VERSION,
>  };
>  
> @@ -433,48 +494,11 @@ static int vhost_user_set_mem_table_postcopy(struct 
> vhost_dev *dev,
>  u->region_rb_len = dev->mem->nregions;
>  }
>  
> -for (i = 0; i < dev->mem->nregions; ++i) {
> -struct vhost_memory_region *reg = dev->mem->regions + i;
> -ram_addr_t offset;
> -MemoryRegion *mr;
> -
> -assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
> -mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
> - );
> -fd = memory_region_get_fd(mr);
> -if (fd > 0) {
> -assert(fd_num < VHOST_MEMORY_MAX_NREGIONS);
> -

Re: [PATCH] fix vhost_user_blk_watch crash

2020-03-29 Thread Michael S. Tsirkin

On Mon, Mar 23, 2020 at 01:29:24PM +0800, Li Feng wrote:
> the G_IO_HUP is watched in tcp_chr_connect, and the callback
> vhost_user_blk_watch is not needed, because tcp_chr_hup is registered as
> callback. And it will close the tcp link.
> 
> Signed-off-by: Li Feng 

Raphael would you like to review this?
Also, I think at this point
nutanix is the biggest contributor to vhost user blk.
So if you want to be added to MAINTAINERS on this
one so people Cc you on patcches, that'd be great.

> ---
>  hw/block/vhost-user-blk.c  | 19 ---
>  include/hw/virtio/vhost-user-blk.h |  1 -
>  2 files changed, 20 deletions(-)
> 
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 12925a47ec..17df5338e7 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -349,18 +349,6 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
>  vhost_dev_cleanup(>dev);
>  }
>  
> -static gboolean vhost_user_blk_watch(GIOChannel *chan, GIOCondition cond,
> - void *opaque)
> -{
> -DeviceState *dev = opaque;
> -VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> -VHostUserBlk *s = VHOST_USER_BLK(vdev);
> -
> -qemu_chr_fe_disconnect(>chardev);
> -
> -return true;
> -}
> -
>  static void vhost_user_blk_event(void *opaque, QEMUChrEvent event)
>  {
>  DeviceState *dev = opaque;
> @@ -373,15 +361,9 @@ static void vhost_user_blk_event(void *opaque, 
> QEMUChrEvent event)
>  qemu_chr_fe_disconnect(>chardev);
>  return;
>  }
> -s->watch = qemu_chr_fe_add_watch(>chardev, G_IO_HUP,
> - vhost_user_blk_watch, dev);
>  break;
>  case CHR_EVENT_CLOSED:
>  vhost_user_blk_disconnect(dev);
> -if (s->watch) {
> -g_source_remove(s->watch);
> -s->watch = 0;
> -}
>  break;
>  case CHR_EVENT_BREAK:
>  case CHR_EVENT_MUX_IN:
> @@ -428,7 +410,6 @@ static void vhost_user_blk_device_realize(DeviceState 
> *dev, Error **errp)
>  
>  s->inflight = g_new0(struct vhost_inflight, 1);
>  s->vhost_vqs = g_new0(struct vhost_virtqueue, s->num_queues);
> -s->watch = 0;
>  s->connected = false;
>  
>  qemu_chr_fe_set_handlers(>chardev,  NULL, NULL, vhost_user_blk_event,
> diff --git a/include/hw/virtio/vhost-user-blk.h 
> b/include/hw/virtio/vhost-user-blk.h
> index 05ea0ad183..34ad6f0c0e 100644
> --- a/include/hw/virtio/vhost-user-blk.h
> +++ b/include/hw/virtio/vhost-user-blk.h
> @@ -38,7 +38,6 @@ typedef struct VHostUserBlk {
>  VhostUserState vhost_user;
>  struct vhost_virtqueue *vhost_vqs;
>  VirtQueue **virtqs;
> -guint watch;
>  bool connected;
>  } VHostUserBlk;
>  
> -- 
> 2.11.0
> 
> 
> -- 
> The SmartX email address is only for business purpose. Any sent message 
> that is not related to the business is not authorized or permitted by 
> SmartX.
> 本邮箱为北京志凌海纳科技有限公司（SmartX）工作邮箱. 如本邮箱发出的邮件与工作无关,该邮件未得到本公司任何的明示或默示的授权.
>

Re: [PATCH 0/6] acpi: i386 tweaks

2020-03-29 Thread Michael S. Tsirkin




On Fri, Mar 27, 2020 at 01:11:05PM +0100, Gerd Hoffmann wrote:
> First batch of microvm patches, some generic acpi stuff.
> Split the acpi-build.c monster, specifically split the
> pc and q35 and pci bits into a separate file which we
> can skip building at some point in the future.
> 
> Also some small refactorings and simplifications.
> 
> take care,
>   Gerd


I'd like to queue it for merge after the release. If possible
please ping me after the release to help make sure it didn't get
dropped.

Thanks!

-- 
MST

Re: [PATCH v2 2/3] acpi: Add Windows ACPI Emulated Device Table (WAET)

2020-03-29 Thread Michael S. Tsirkin



On Fri, Mar 27, 2020 at 06:14:26PM +0300, Liran Alon wrote:
> As almost two weeks have passed since this Reviewed-By comment, can I assume
> this series is done from my perspective and ready to be merged?
> As I haven't seen a Reviewed-By for the rest of the patches of this series.
> 
> Thanks,
> -Liran

I'll queue it for merge after the release. If possible please ping me
after the release to help make sure it didn't get dropped.

Thanks!

-- 
MST

RE: [PATCH v1 19/22] intel_iommu: process PASID-based iotlb invalidation

2020-03-29 Thread Liu, Yi L

> From: Peter Xu 
> Sent: Wednesday, March 25, 2020 11:16 PM
> To: Liu, Yi L 
> Subject: Re: [PATCH v1 19/22] intel_iommu: process PASID-based iotlb 
> invalidation
> 
> On Wed, Mar 25, 2020 at 01:36:03PM +, Liu, Yi L wrote:
> > > From: Peter Xu 
> > > Sent: Wednesday, March 25, 2020 2:26 AM
> > > To: Liu, Yi L 
> > > Subject: Re: [PATCH v1 19/22] intel_iommu: process PASID-based iotlb
> > > invalidation
> > >
> > > On Sun, Mar 22, 2020 at 05:36:16AM -0700, Liu Yi L wrote:
> > > > This patch adds the basic PASID-based iotlb (piotlb) invalidation
> > > > support. piotlb is used during walking Intel VT-d 1st level page
> > > > table. This patch only adds the basic processing. Detailed
> > > > handling will be added in next patch.
> > > >
> > > > Cc: Kevin Tian 
> > > > Cc: Jacob Pan 
> > > > Cc: Peter Xu 
> > > > Cc: Yi Sun 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Richard Henderson 
> > > > Cc: Eduardo Habkost 
> > > > Signed-off-by: Liu Yi L 
> > > > ---
> > > >  hw/i386/intel_iommu.c  | 57
> > > ++
> > > >  hw/i386/intel_iommu_internal.h | 13 ++
> > > >  2 files changed, 70 insertions(+)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > > > b007715..b9ac07d 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -3134,6 +3134,59 @@ static bool
> > > > vtd_process_pasid_desc(IntelIOMMUState
> > > *s,
> > > >  return (ret == 0) ? true : false;  }
> > > >
> > > > +static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
> > > > +uint16_t domain_id,
> > > > +uint32_t pasid) { }
> > > > +
> > > > +static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t
> domain_id,
> > > > + uint32_t pasid, hwaddr addr, uint8_t
> > > > +am, bool ih) { }
> > > > +
> > > > +static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
> > > > +VTDInvDesc *inv_desc) {
> > > > +uint16_t domain_id;
> > > > +uint32_t pasid;
> > > > +uint8_t am;
> > > > +hwaddr addr;
> > > > +
> > > > +if ((inv_desc->val[0] & VTD_INV_DESC_PIOTLB_RSVD_VAL0) ||
> > > > +(inv_desc->val[1] & VTD_INV_DESC_PIOTLB_RSVD_VAL1)) {
> > > > +error_report_once("non-zero-field-in-piotlb_inv_desc hi: 0x%" 
> > > > PRIx64
> > > > +  " lo: 0x%" PRIx64, inv_desc->val[1], 
> > > > inv_desc->val[0]);
> > > > +return false;
> > > > +}
> > > > +
> > > > +domain_id = VTD_INV_DESC_PIOTLB_DID(inv_desc->val[0]);
> > > > +pasid = VTD_INV_DESC_PIOTLB_PASID(inv_desc->val[0]);
> > > > +switch (inv_desc->val[0] & VTD_INV_DESC_IOTLB_G) {
> > > > +case VTD_INV_DESC_PIOTLB_ALL_IN_PASID:
> > > > +vtd_piotlb_pasid_invalidate(s, domain_id, pasid);
> > > > +break;
> > > > +
> > > > +case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
> > > > +am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
> > > > +addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
> > > > +if (am > VTD_MAMV) {
> > >
> > > I saw this of spec 10.4.2, MAMV:
> > >
> > > Independent of value reported in this field, implementations
> > > supporting SMTS must support address-selective PASID-based
> > > IOTLB invalidations (p_iotlb_inv_dsc) with any defined address
> > > mask.
> > >
> > > Does it mean we should even support larger AM?
> > >
> > > Besides that, the patch looks good to me.
> >
> > I don't think so. This field is for second-level table in scalable
> > mode and the translation table in legacy mode. For first-level table,
> > it always supports page selective invalidation and all the supported
> > masks regardless of the PSI support bit and the MAMV field in the CAP_REG.
> 
> Yes that's exactly what I wanted to ask...  Let me try again.
> 
> I thought VTD_MAMV was only for 2nd level page table, not for pasid-iotlb
> invalidations.  So I think we should remove this "if"
> check (that corresponds to "we should even support larger AM"), right?

Right. I confirmed with spec owner. Will remove it. :-)

Regards,
Yi Liu

Re: [PATCH v3 00/10] ARM virt: Add NVDIMM support

2020-03-29 Thread Michael S. Tsirkin

On Wed, Mar 11, 2020 at 05:20:04PM +, Shameer Kolothum wrote:
> This series adds NVDIMM support to arm/virt platform.


So I'm still confused about whether there's a bugfix here
that we need for 5.0. If yes pls post just that part
with acks included and for-5.0 in the subject.

> The series reuses some of the patches posted by Eric
> in his earlier attempt here[1].
> 
> This also include few fixes to qemu in general which were
> discovered while adding nvdimm support to arm/virt.
> 
> Patch #2 addresses the issue[2] that, during migration, the 
> source and destination might end up with an inconsistency
> in acpi table memory region sizes.
> 
> Patch #3 is to fix the qemu_ram_resize() callback issue[2].
> 
> Patch #4 is another fix to the nvdimm aml issue discussed
> here[3].
> 
> I have done a basic sanity testing of NVDIMM devices
> with Guest booting with ACPI. Further testing is always
> welcome.
> 
> Please let me know your feedback.
> 
> Thanks,
> Shameer
> 
> [1] https://patchwork.kernel.org/cover/10830777/
> [2] https://patchwork.kernel.org/patch/11339591/
> [3] https://patchwork.kernel.org/cover/11174959/
> 
> v2 --> v3
>  - Added patch #1 and # 2 to fix the inconsistency in acpi
>table memory region sizes during migration. Thanks to
>David H.
>  - The fix for qemu_ram_resize() callback was modified to
>the one in patch #3. Again thanks to David H.
>  - Addressed comments from MST and Eric on tests added.
>  - Addressed comments from Igor/MST on Integer size in patch #4
>  - Added Eric's R-by to patch #7.
> 
> v1 --> v2
>  -Reworked patch #1 and now fix is inside qemu_ram_resize().
>  -Added patch #2 to fix the nvdim aml issue.
>  -Dropped support to DT cold plug.
>  -Updated test_acpi_virt_tcg_memhp() with pc-dimm and nvdimms(patch #7)
> 
> David Hildenbrand (1):
>   exec: Fix for qemu_ram_resize() callback
> 
> Kwangwoo Lee (2):
>   nvdimm: Use configurable ACPI IO base and size
>   hw/arm/virt: Add nvdimm hot-plug infrastructure
> 
> Shameer Kolothum (7):
>   acpi: Use macro for table-loader file name
>   fw_cfg: Migrate ACPI table mr sizes separately
>   hw/acpi/nvdimm: Fix for NVDIMM incorrect DSM output buffer length
>   hw/arm/virt: Add nvdimm hotplug support
>   tests: Update ACPI tables list for upcoming arm/virt test changes
>   tests/bios-tables-test: Update arm/virt memhp test
>   tests/acpi: add expected tables for bios-tables-test
> 
>  docs/specs/acpi_hw_reduced_hotplug.rst |   1 +
>  exec.c |  14 +++-
>  hw/acpi/generic_event_device.c |  15 -
>  hw/acpi/nvdimm.c   |  72 +
>  hw/arm/Kconfig |   1 +
>  hw/arm/virt-acpi-build.c   |   8 ++-
>  hw/arm/virt.c  |  35 --
>  hw/core/machine.c  |   1 +
>  hw/i386/acpi-build.c   |   8 ++-
>  hw/i386/acpi-build.h   |   3 +
>  hw/i386/pc_piix.c  |   2 +
>  hw/i386/pc_q35.c   |   2 +
>  hw/mem/Kconfig |   2 +-
>  hw/nvram/fw_cfg.c  |  86 -
>  include/hw/acpi/aml-build.h|   1 +
>  include/hw/acpi/generic_event_device.h |   1 +
>  include/hw/arm/virt.h  |   1 +
>  include/hw/mem/nvdimm.h|   3 +
>  include/hw/nvram/fw_cfg.h  |   6 ++
>  tests/data/acpi/pc/SSDT.dimmpxm| Bin 685 -> 734 bytes
>  tests/data/acpi/q35/SSDT.dimmpxm   | Bin 685 -> 734 bytes
>  tests/data/acpi/virt/DSDT.memhp| Bin 6644 -> 6668 bytes
>  tests/data/acpi/virt/NFIT.memhp| Bin 0 -> 224 bytes
>  tests/data/acpi/virt/SSDT.memhp| Bin 0 -> 736 bytes
>  tests/qtest/bios-tables-test.c |   9 ++-
>  25 files changed, 244 insertions(+), 27 deletions(-)
>  create mode 100644 tests/data/acpi/virt/NFIT.memhp
>  create mode 100644 tests/data/acpi/virt/SSDT.memhp
> 
> -- 
> 2.17.1
>

57 matches

Mail list logo