[Qemu-devel] [PULL 00/30] Misc changes for 2016-05-27

2016-05-30 Thread Paolo Bonzini
The following changes since commit d6550e9ed2e1a60d889dfb721de00d9a4e3bafbe:

  Merge remote-tracking branch 'remotes/riku/tags/pull-linux-user-20160527' 
into staging (2016-05-27 14:05:48 +0100)

are available in the git repository at:

  git://github.com/bonzini/qemu.git tags/for-upstream

for you to fetch changes up to 0878d0e11ba8013dd759c6921cbf05ba6a41bd71:

  exec: hide mr->ram_addr from qemu_get_ram_ptr users (2016-05-29 09:11:12 
+0200)

Dropped the jinxed DMA change once more, and seriously thinking of
rewriting the whole thing in assembly language...


* docs/atomics fixes and atomic_rcu_* optimization (Emilio)
* NBD bugfix (Eric)
* Memory fixes and cleanups (Paolo, Paul)
* scsi-block support for SCSI status, including persistent
  reservations (Paolo)
* kvm_stat moves to the Linux repository
* SCSI bug fixes (Peter, Prasad)
* Killing qemu_char_get_next_serial, non-ARM parts (Xiaoqiang)


Emilio G. Cota (3):
  docs/atomics: update atomic_read/set comparison with Linux
  atomics: emit an smp_read_barrier_depends() barrier only for Alpha and 
Thread Sanitizer
  atomics: do not emit consume barrier for atomic_rcu_read

Eric Blake (1):
  nbd: Don't trim unrequested bytes

Fam Zheng (1):
  scsi-generic: Merge block max xfer len in INQUIRY response

Paolo Bonzini (13):
  Revert "memory: Drop FlatRange.romd_mode"
  kvm_stat: Remove
  bt: rewrite csrhci_write to avoid out-of-bounds writes
  docs/atomics: update comparison with Linux
  scsi-disk: introduce a common base class
  scsi-disk: introduce dma_readv and dma_writev
  scsi-disk: add need_fua_emulation to SCSIDiskClass
  scsi-disk: introduce scsi_disk_req_check_error
  scsi-block: always use SG_IO
  memory: remove qemu_get_ram_fd, qemu_set_ram_fd, qemu_ram_block_host_ptr
  exec: remove ram_addr argument from qemu_ram_block_from_host
  memory: split memory_region_from_host from qemu_ram_addr_from_host
  exec: hide mr->ram_addr from qemu_get_ram_ptr users

Paul Durrant (1):
  xen-hvm: ignore background I/O sections

Peter Lieven (1):
  block/iscsi: avoid potential overflow of acb->task->cdb

Prasad J Pandit (5):
  scsi: pvscsi: check command descriptor ring buffer size (CVE-2016-4952)
  scsi: mptsas: infinite loop while fetching requests
  scsi: megasas: use appropriate property buffer size
  scsi: megasas: initialise local configuration data buffer
  scsi: megasas: check 'read_queue_head' index value

xiaoqiang zhao (5):
  hw/char: QOM'ify escc.c
  hw/char: QOM'ify etraxfs_ser.c
  hw/char: QOM'ify lm32_juart.c
  hw/char: QOM'ify lm32_uart.c
  hw/char: QOM'ify milkymist-uart.c

 Makefile |   9 -
 block/iscsi.c|   7 +
 cputlb.c |   3 +-
 docs/atomics.txt |  38 +-
 exec.c   | 110 ++
 hw/bt/hci-csr.c  |  67 ++--
 hw/char/escc.c   |  30 +-
 hw/char/etraxfs_ser.c|  27 +-
 hw/char/lm32_juart.c |  17 +-
 hw/char/lm32_uart.c  |  28 +-
 hw/char/milkymist-uart.c |  10 +-
 hw/cris/axis_dev88.c |   4 +-
 hw/lm32/lm32.h   |  19 +-
 hw/lm32/lm32_boards.c|   9 +-
 hw/lm32/milkymist-hw.h   |   4 +-
 hw/lm32/milkymist.c  |   4 +-
 hw/misc/ivshmem.c|   5 +-
 hw/scsi/megasas.c|   6 +-
 hw/scsi/mptsas.c |   9 +-
 hw/scsi/scsi-disk.c  | 415 --
 hw/scsi/scsi-generic.c   |  12 +
 hw/scsi/vmw_pvscsi.c |  24 +-
 hw/virtio/vhost-user.c   |  25 +-
 include/exec/cpu-common.h|   4 +-
 include/exec/memory.h|  36 +-
 include/exec/ram_addr.h  |   3 -
 include/hw/cris/etraxfs.h|  16 +
 include/qemu/atomic.h|  25 +-
 memory.c |  43 ++-
 migration/postcopy-ram.c |   3 +-
 nbd/server.c |  20 +-
 scripts/dump-guest-memory.py |  19 +-
 scripts/kvm/kvm_stat | 825 ---
 scripts/kvm/kvm_stat.texi|  55 ---
 target-i386/kvm.c|   6 +-
 xen-hvm.c|  14 +-
 36 files changed, 709 insertions(+), 1242 deletions(-)
 delete mode 100755 scripts/kvm/kvm_stat
 delete mode 100644 scripts/kvm/kvm_stat.texi
-- 
2.5.5



Re: [Qemu-devel] [PATCH v6 01/17] pci: fix unaligned access in pci_xxx_quad()

2016-05-30 Thread Michael S. Tsirkin
On Mon, May 30, 2016 at 06:05:57PM +0300, Dmitry Fleytman wrote:
> 
> > On 30 May 2016, at 17:47 PM, Michael S. Tsirkin  wrote:
> > 
> > On Mon, May 30, 2016 at 12:14:26PM +0300, Leonid Bloch wrote:
> >> From: Dmitry Fleytman 
> >> 
> >> Replace legacy cpu_to_le64w()/le64_to_cpup()
> >> calls with stq_le_p()/ldq_le_p().
> >> 
> >> Signed-off-by: Dmitry Fleytman 
> >> Signed-off-by: Leonid Bloch 
> > 
> 
> Hi Michael,
> 
> > Could you please add a code comment to clarify what's going on a bit more?
> > Something to the point that capabilities are guaranteed to
> > be dword-aligned only.
> > 
> 
> Just to clarify, do you want to add these comments to 
> pci_set/get_quad functions or to the new DSN-generation function?

pci_set/get_quad

> > Also, this isn't a dependency of this patchset I think -
> > as far as I could say the only user of this is
> > pcie: Introduce function for DSN capability creation
> > but that merely accesses a capability, and all callers pass in
> > an aligned offset.
> 
> Right, this issue appeared after introduction of DSN generation function.

Does DSN generation function pass unaligned offsets?
It does not look like it does...

> All other callers pass aligned offsets so far.
> 
> Thanks,
> Dmitry
> 
> > 
> >> ---
> >> include/hw/pci/pci.h | 4 ++--
> >> 1 file changed, 2 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >> index ef6ba51..ee238ad 100644
> >> --- a/include/hw/pci/pci.h
> >> +++ b/include/hw/pci/pci.h
> >> @@ -468,13 +468,13 @@ pci_get_long(const uint8_t *config)
> >> static inline void
> >> pci_set_quad(uint8_t *config, uint64_t val)
> >> {
> >> -cpu_to_le64w((uint64_t *)config, val);
> >> +stq_le_p(config, val);
> >> }
> >> 
> >> static inline uint64_t
> >> pci_get_quad(const uint8_t *config)
> >> {
> >> -return le64_to_cpup((const uint64_t *)config);
> >> +return ldq_le_p(config);
> >> }
> >> 
> >> static inline void
> >> -- 
> >> 2.5.5



Re: [Qemu-devel] [PATCH v6 16/17] net: Introduce e1000e device emulation

2016-05-30 Thread Michael S. Tsirkin
On Mon, May 30, 2016 at 12:14:41PM +0300, Leonid Bloch wrote:
> diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
> new file mode 100644
> index 000..4da6bb1
> --- /dev/null
> +++ b/hw/net/e1000e.c

Here are minor style issues that can be fixed after this is upstream.
See below.

> @@ -0,0 +1,739 @@
> +/*
> +* QEMU INTEL 82574 GbE NIC emulation
> +*
> +* Software developer's manuals:
> +* 
> http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf
> +*
> +* Copyright (c) 2015 Ravello Systems LTD (http://ravellosystems.com)
> +* Developed by Daynix Computing LTD (http://www.daynix.com)
> +*
> +* Authors:
> +* Dmitry Fleytman 
> +* Leonid Bloch 
> +* Yan Vugenfirer 
> +*
> +* Based on work done by:
> +* Nir Peleg, Tutis Systems Ltd. for Qumranet Inc.
> +* Copyright (c) 2008 Qumranet
> +* Based on work done by:
> +* Copyright (c) 2007 Dan Aloni
> +* Copyright (c) 2004 Antony T Curtis
> +*
> +* This library is free software; you can redistribute it and/or
> +* modify it under the terms of the GNU Lesser General Public
> +* License as published by the Free Software Foundation; either
> +* version 2 of the License, or (at your option) any later version.
> +*
> +* This library is distributed in the hope that it will be useful,
> +* but WITHOUT ANY WARRANTY; without even the implied warranty of
> +* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +* Lesser General Public License for more details.
> +*
> +* You should have received a copy of the GNU Lesser General Public
> +* License along with this library; if not, see 
> .
> +*/
> +
> +#include "qemu/osdep.h"
> +#include "net/net.h"
> +#include "net/tap.h"
> +#include "qemu/range.h"
> +#include "sysemu/sysemu.h"
> +#include "hw/pci/msi.h"
> +#include "hw/pci/msix.h"
> +
> +#include "hw/net/e1000_regs.h"
> +
> +#include "e1000x_common.h"
> +#include "e1000e_core.h"
> +
> +#include "trace.h"
> +
> +#define TYPE_E1000E "e1000e"
> +#define E1000E(obj) OBJECT_CHECK(E1000EState, (obj), TYPE_E1000E)
> +
> +typedef struct {
> +PCIDevice parent_obj;
> +NICState *nic;
> +NICConf conf;
> +
> +MemoryRegion mmio;
> +MemoryRegion flash;
> +MemoryRegion io;
> +MemoryRegion msix;
> +
> +uint32_t ioaddr;
> +
> +uint16_t subsys_ven;
> +uint16_t subsys;
> +
> +uint16_t subsys_ven_used;
> +uint16_t subsys_used;
> +
> +uint32_t intr_state;
> +bool disable_vnet;
> +
> +E1000ECore core;
> +
> +} E1000EState;

typedef struct E1000EState is preferably because older
gdb versions do not always see typedefs.

> +
> +#define E1000E_MMIO_IDX 0
> +#define E1000E_FLASH_IDX1
> +#define E1000E_IO_IDX   2
> +#define E1000E_MSIX_IDX 3
> +
> +#define E1000E_MMIO_SIZE(128 * 1024)
> +#define E1000E_FLASH_SIZE   (128 * 1024)
> +#define E1000E_IO_SIZE  (32)
> +#define E1000E_MSIX_SIZE(16 * 1024)
> +
> +#define E1000E_MSIX_TABLE   (0x)
> +#define E1000E_MSIX_PBA (0x2000)
> +
> +#define E1000E_USE_MSI BIT(0)
> +#define E1000E_USE_MSIXBIT(1)
> +
> +static uint64_t
> +e1000e_mmio_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +E1000EState *s = opaque;
> +return e1000e_core_read(&s->core, addr, size);
> +}
> +
> +static void
> +e1000e_mmio_write(void *opaque, hwaddr addr,
> +   uint64_t val, unsigned size)
> +{
> +E1000EState *s = opaque;
> +e1000e_core_write(&s->core, addr, val, size);
> +}
> +
> +static bool
> +e1000e_io_get_reg_index(E1000EState *s, uint32_t *idx)
> +{
> +if (s->ioaddr < 0x1) {
> +*idx = s->ioaddr;
> +return true;
> +}
> +
> +if (s->ioaddr < 0x7) {
> +trace_e1000e_wrn_io_addr_undefined(s->ioaddr);
> +return false;
> +}
> +
> +if (s->ioaddr < 0xF) {
> +trace_e1000e_wrn_io_addr_flash(s->ioaddr);
> +return false;
> +}
> +
> +trace_e1000e_wrn_io_addr_unknown(s->ioaddr);
> +return false;
> +}
> +
> +static uint64_t
> +e1000e_io_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +E1000EState *s = opaque;
> +uint32_t idx;
> +uint64_t val;
> +
> +switch (addr) {
> +case E1000_IOADDR:
> +trace_e1000e_io_read_addr(s->ioaddr);
> +return s->ioaddr;
> +case E1000_IODATA:
> +if (e1000e_io_get_reg_index(s, &idx)) {
> +val = e1000e_core_read(&s->core, idx, sizeof(val));
> +trace_e1000e_io_read_data(idx, val);
> +return val;
> +}
> +return 0;
> +default:
> +trace_e1000e_wrn_io_read_unknown(addr);
> +return 0;
> +}
> +}
> +
> +static void
> +e1000e_io_write(void *opaque, hwaddr addr,
> +uint64_t val, unsigned size)
> +{
> +E1000EState *s = opaque;
> +uint32_t idx;
> +
> +switch (addr) {
> +case E1000_IOADDR:
> +trace_e1000e_io_write_addr(val);
> +s->ioaddr = (uint32_t) val;
> +return;
> +case E1000_IODATA:
> +   

Re: [Qemu-devel] [PATCH v6 01/17] pci: fix unaligned access in pci_xxx_quad()

2016-05-30 Thread Dmitry Fleytman

> On 30 May 2016, at 17:47 PM, Michael S. Tsirkin  wrote:
> 
> On Mon, May 30, 2016 at 12:14:26PM +0300, Leonid Bloch wrote:
>> From: Dmitry Fleytman 
>> 
>> Replace legacy cpu_to_le64w()/le64_to_cpup()
>> calls with stq_le_p()/ldq_le_p().
>> 
>> Signed-off-by: Dmitry Fleytman 
>> Signed-off-by: Leonid Bloch 
> 

Hi Michael,

> Could you please add a code comment to clarify what's going on a bit more?
> Something to the point that capabilities are guaranteed to
> be dword-aligned only.
> 

Just to clarify, do you want to add these comments to 
pci_set/get_quad functions or to the new DSN-generation function?

> Also, this isn't a dependency of this patchset I think -
> as far as I could say the only user of this is
>   pcie: Introduce function for DSN capability creation
> but that merely accesses a capability, and all callers pass in
> an aligned offset.

Right, this issue appeared after introduction of DSN generation function.
All other callers pass aligned offsets so far.

Thanks,
Dmitry

> 
>> ---
>> include/hw/pci/pci.h | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index ef6ba51..ee238ad 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -468,13 +468,13 @@ pci_get_long(const uint8_t *config)
>> static inline void
>> pci_set_quad(uint8_t *config, uint64_t val)
>> {
>> -cpu_to_le64w((uint64_t *)config, val);
>> +stq_le_p(config, val);
>> }
>> 
>> static inline uint64_t
>> pci_get_quad(const uint8_t *config)
>> {
>> -return le64_to_cpup((const uint64_t *)config);
>> +return ldq_le_p(config);
>> }
>> 
>> static inline void
>> -- 
>> 2.5.5




Re: [Qemu-devel] [PATCH v6 00/17] Introduce Intel 82574 GbE Controller Emulation (e1000e)

2016-05-30 Thread Michael S. Tsirkin
On Mon, May 30, 2016 at 12:14:25PM +0300, Leonid Bloch wrote:
> Hello All,
> 
> This is v6 of e1000e series.
> 
> For convenience, the same patches are available at:
> https://github.com/daynix/qemu-e1000e/tree/e1000e-submit-v6
> 
> Best regards,
> Dmitry.

There are some things that can be improved further
but overall I think it's OK to merge this.

Reviewed-by: Michael S. Tsirkin 



> Changes since v5:
> 
> 1. Fixed build failure on old clang versions
> 2. Added patch that fixes unaligned access in pci_[set|get]_quad()
> 3. Rebased to the latest master
> 
> Changes since v4:
> 
> 1. Rebased to the latest master (2.6.0+)
> 
> Changes since v3:
> 
> 1. Various code fixes as suggested by Jason and Michael
> 2. Rebased to the latest master
> 
> Changes since v2:
> 
> 1. Interrupt storm on latest Linux kernels fixed
> 2. Device unit test added
> 3. Introduced code sharing between e1000 and e1000e
> 4. Various code fixes as suggested by Jason
> 5. Rebased to the latest master
> 
> Changes since v1:
> 
> 1. PCI_PM_CAP_VER_1_1 is defined now in include/hw/pci/pci_regs.h and
>not in include/standard-headers/linux/pci_regs.h.
> 2. Changes in naming and extra comments in hw/pci/pcie.c and in
>include/hw/pci/pcie.h.
> 3. Defining pci_dsn_ver and pci_dsn_cap static const variables in
>hw/pci/pcie.c, instead of PCI_DSN_VER and PCI_DSN_CAP symbolic
>constants in include/hw/pci/pcie_regs.h.
> 4. Changing the vmxnet3_device_serial_num function in hw/net/vmxnet3.c
>to avoid the cast when it is called.
> 5. Avoiding a preceding underscore in all the e1000e-related names.
> 6. Minor style changes.
> 
> ===
> 
> Hello All,
> 
> This series is the final code of the e1000e device emulation, that we
> have developed. Please review, and consider acceptance of these patches
> to the upstream QEMU repository.
> 
> The code stability was verified by various traffic tests using Fedora 22
> Linux, and Windows Server 2012R2 guests. Also, Microsoft Hardware
> Certification Kit (HCK) tests were run on a Windows Server 2012R2 guest.
> 
> There was a discussion on the possibility of code sharing between the
> e1000e, and the existing e1000 devices. We have reviewed the final code
> for parts that may be shared between this device and the currently
> available e1000 emulation. The device specifications are very different,
> and there are almost no registers, nor functions, that were left as is
> from e1000. The ring descriptor structures were changed as well, by the
> introduction of extended and PS descriptors, as well as additional bits.
> 
> Additional differences stem from the fact that the e1000e device re-uses
> network packet abstractions introduced by the vmxnet3 device, while the
> e1000 has its own code for packet handling. BTW, it may be worth reusing
> those abstractions in e1000 as well. (Following these changes the
> vmxnet3 device was successfully tested for possible regressions.)
> 
> There are a few minor parts that may be shared, e.g. the default
> register handlers, and the ring management functions. The total amount
> of shared lines will be about 100--150, so we're not sure if it makes
> sense bothering, and taking a risk of breaking e1000, which is a good,
> old, and stable device.
> 
> Currently, the e1000e code is stand alone w.r.t. e1000.
> 
> Please share your thoughts.
> 
> Thanks in advance,
> Dmitry.
> 
> Changes since RFCv2:
> 
> 1. Device functionality verified using Microsoft Hardware Certification
> Test Kit (HCK)
> 2. Introduced a number of performance improvements
> 3. The code was cleaned, and rebased to the latest master
> 4. Patches verified with checkpatch.pl
> 
> ===
> 
> Changes since RFCv1:
> 
> 1. Added support for all the device features:
>   - Interrupt moderation.
>   - RSS.
>   - Multiqueue.
> 2. Simulated exact PCI/PCIe configuration space layout.
> 3. Made fixes needed to pass Microsoft's HW certification tests (HCK).
> 
> This series is still an RFC, because the following tasks are not done
> yet:
> 
> 1. See which code can be shared between this device and the existing
> e1000 device.
> 2. Rebase patches to the latest master (current base is v2.3.0).
> 
> Please share your thoughts,
> Thanks, Dmitry.
> 
> ===
> 
> Hello qemu-devel,
> 
> This patch series is an RFC for the new networking device emulation
> we're developing for QEMU.
> 
> This new device emulates the Intel 82574 GbE Controller and works
> with unmodified Intel e1000e drivers from the Linux/Windows kernels.
> 
> The status of the current series is "Functional Device Ready, work
> on Extended Features in Progress".
> 
> More precisely, these patches represent a functional device, which
> is recognized by the standard Intel drivers, and is able to transfer
> TX/RX packets with CSO/TSO offloads, according to the spec.
> 
> Extended features not supported yet (work in progress):
>   1. TX/RX Interrupt moderation mechanisms
>   2. RSS
>   3. Full-featured multi-q

Re: [Qemu-devel] [PATCH v6 01/17] pci: fix unaligned access in pci_xxx_quad()

2016-05-30 Thread Michael S. Tsirkin
On Mon, May 30, 2016 at 12:14:26PM +0300, Leonid Bloch wrote:
> From: Dmitry Fleytman 
> 
> Replace legacy cpu_to_le64w()/le64_to_cpup()
> calls with stq_le_p()/ldq_le_p().
> 
> Signed-off-by: Dmitry Fleytman 
> Signed-off-by: Leonid Bloch 

Could you please add a code comment to clarify what's going on a bit more?
Something to the point that capabilities are guaranteed to
be dword-aligned only.

Also, this isn't a dependency of this patchset I think -
as far as I could say the only user of this is
pcie: Introduce function for DSN capability creation
but that merely accesses a capability, and all callers pass in
an aligned offset.

> ---
>  include/hw/pci/pci.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index ef6ba51..ee238ad 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -468,13 +468,13 @@ pci_get_long(const uint8_t *config)
>  static inline void
>  pci_set_quad(uint8_t *config, uint64_t val)
>  {
> -cpu_to_le64w((uint64_t *)config, val);
> +stq_le_p(config, val);
>  }
>  
>  static inline uint64_t
>  pci_get_quad(const uint8_t *config)
>  {
> -return le64_to_cpup((const uint64_t *)config);
> +return ldq_le_p(config);
>  }
>  
>  static inline void
> -- 
> 2.5.5



Re: [Qemu-devel] [PATCH 04/10] qcow: add qcow_co_write_compressed

2016-05-30 Thread Pavel Butsykin

On 27.05.2016 20:45, Stefan Hajnoczi wrote:

On Sat, May 14, 2016 at 03:45:52PM +0300, Denis V. Lunev wrote:

+qemu_co_mutex_lock(&s->lock);
+cluster_offset = get_cluster_offset(bs, sector_num << 9, 2, out_len, 0, 0);
+qemu_co_mutex_unlock(&s->lock);
+if (cluster_offset == 0) {
+ret = -EIO;
+goto fail;
+}
+cluster_offset &= s->cluster_offset_mask;
+
+iov = (struct iovec) {
+.iov_base   = out_buf,
+.iov_len= out_len,
+};
+qemu_iovec_init_external(&hd_qiov, &iov, 1);
+ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len, &hd_qiov, 0);


Not sure if this has the same race condition as the qcow2 patch.  It
seems that bdrv_getlength() is used to extend the file on a per-sector
basis.  That would mean compressed data is not packed inside sectors and
no read-write-modify race condition exists, but I haven't fully audited
get_cluster_offset().



The get_cluster_offset() also doesn't allow to do multiple compressed
writes in a single cluster, because this function for all offsets
within the cluster returns the same cluster_offset. So here we just
can't write at offset in the cluster, only at the beginning of the
cluster.


Stefan





[Qemu-devel] [Bug 1587065] [NEW] btrfs qemu-ga - multiple mounts block fsfreeze

2016-05-30 Thread Dadio
Public bug reported:

Having two mounts of the same device makes fsfreeze through qemu-ga impossible.
root@CmsrvMTA:/# mount -l | grep /dev/vda2
/dev/vda2 on / type btrfs (rw,relatime,space_cache,subvolid=257,subvol=/@)
/dev/vda2 on /home type btrfs 
(rw,relatime,space_cache,subvolid=258,subvol=/@home)

Having two mounts is rather common with btrfs, so the feature fsfreeze
is unusable on these systems.


Below more information about how we encountered this issue...

Message send to qemu-disc...@nongnu.org:

Message 1:
--
I use external snapshots to backup my guests. I use the 'quiesce' option to 
flush and frees the guest file system with the qemu guest agent.

With the exeption of one guest, this procedure works fine. On the 'unwilling' 
guest, I get this error message:
"ERROR 2016-05-25 00:51:19 | T25-bakVMSCmsrvVH2 | fout: internal error: unable 
to execute QEMU agent command 'guest-fsfreeze-freeze': failed to freeze /: 
Device or resource busy"

I don't think this is not some sort of time-out error, because
activation of the fsfreeze and the error message happen immediately
after each other:

$ grep qemu-ga syslog.1
May 25 00:51:19 CmsrvMTA qemu-ga: info: guest-fsfreeze called

This is the only entry of the qemu guest agent in syslog.

$ sudo virsh version
Compiled against library: libvirt 1.3.1
Using library: libvirt 1.3.1
Gebruikte API: QEMU 1.3.1
Draaiende hypervisor: QEMU 2.5.0

$ virsh qemu-agent-command CmsrvMTA '{"execute": "guest-info"}'
{"return":{"version":"2.5.0", ... 
,{"enabled":true,"name":"guest-fstrim","success-response":true},{"enabled":true,"name":"guest-fsfreeze-thaw","success-response":true},{"enabled":true,"name":"guest-fsfreeze-status","success-response":true},{"enabled":true,"name":"guest-fsfreeze-freeze-list","success-response":true},{"enabled":true,"name":"guest-fsfreeze-freeze","success-response":true},
 ... }

For making an external snapshot, I use this command:
$ virsh snapshot-create-as --domain CmsrvMTA sn1 --disk-only --atomic --quiesce 
--no-metadata --diskspec vda,file=/srv/poolVMS/CmsrvMTA.sn1

Piece of reply 1:
-
I have encountered this before. Some operating systems
 may have bind-mounts that let a device appear multiple times in the mount 
list. Unfortunately the guest agent is not smart enough to consider a device 
that has been frozen as succesfull and keep going. This causes this specific 
error.

Piece of reply 2:
-
Ok, that seems to be it.

I’ve got the ‘/’ and ‘/home’ on the same device formatted as btrfs on
two separate sub volumes.

** Affects: qemu
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1587065

Title:
  btrfs qemu-ga - multiple mounts block fsfreeze

Status in QEMU:
  New

Bug description:
  Having two mounts of the same device makes fsfreeze through qemu-ga 
impossible.
  root@CmsrvMTA:/# mount -l | grep /dev/vda2
  /dev/vda2 on / type btrfs (rw,relatime,space_cache,subvolid=257,subvol=/@)
  /dev/vda2 on /home type btrfs 
(rw,relatime,space_cache,subvolid=258,subvol=/@home)

  Having two mounts is rather common with btrfs, so the feature fsfreeze
  is unusable on these systems.

  
  Below more information about how we encountered this issue...

  Message send to qemu-disc...@nongnu.org:

  Message 1:
  --
  I use external snapshots to backup my guests. I use the 'quiesce' option to 
flush and frees the guest file system with the qemu guest agent.

  With the exeption of one guest, this procedure works fine. On the 'unwilling' 
guest, I get this error message:
  "ERROR 2016-05-25 00:51:19 | T25-bakVMSCmsrvVH2 | fout: internal error: 
unable to execute QEMU agent command 'guest-fsfreeze-freeze': failed to freeze 
/: Device or resource busy"

  I don't think this is not some sort of time-out error, because
  activation of the fsfreeze and the error message happen immediately
  after each other:

  $ grep qemu-ga syslog.1
  May 25 00:51:19 CmsrvMTA qemu-ga: info: guest-fsfreeze called

  This is the only entry of the qemu guest agent in syslog.

  $ sudo virsh version
  Compiled against library: libvirt 1.3.1
  Using library: libvirt 1.3.1
  Gebruikte API: QEMU 1.3.1
  Draaiende hypervisor: QEMU 2.5.0

  $ virsh qemu-agent-command CmsrvMTA '{"execute": "guest-info"}'
  {"return":{"version":"2.5.0", ... 
,{"enabled":true,"name":"guest-fstrim","success-response":true},{"enabled":true,"name":"guest-fsfreeze-thaw","success-response":true},{"enabled":true,"name":"guest-fsfreeze-status","success-response":true},{"enabled":true,"name":"guest-fsfreeze-freeze-list","success-response":true},{"enabled":true,"name":"guest-fsfreeze-freeze","success-response":true},
 ... }

  For making an external snapshot, I use this command:
  $ virsh snapshot-create-as --domain CmsrvMTA sn1 --disk-only --atomic 
--quiesce --no-metadata --diskspec vda,file=/srv/poolVMS/CmsrvMTA.sn1

  Pi

Re: [Qemu-devel] [PATCH RFC 0/2] enable iommu with -device

2016-05-30 Thread Marcel Apfelbaum

On 05/30/2016 04:43 PM, Peter Xu wrote:

On Mon, May 23, 2016 at 05:01:28PM +0300, Marcel Apfelbaum wrote:

This is a proposal on how to create the iommu with
'-device intel-iommu' instead of '-machine,iommu=on'.

The device is part of the machine properties because we wanted
to ensure it is created before any other PCI device.

The alternative is to skip the bus_master_enable_region at
the time the device is created. We can create this region
at machine_done phase. (patch 1)

Then we can enable sysbus devices for PC machines and make all the
init steps inside the iommu realize function. (patch 2)

The series is working, but a lot of issues are not resolved:
   - minimum testing was done
   - the iommu addr should be passed (maybe) in command line rather than 
hard-coded
   - enabling sysbus devices for PC machines is risky, I am not aware yet
 of the side effects of this modification.
   - I am not sure moving the bus_master_enable_region to machine_done
 is with no undesired effects.


I gave it a shot on the patches and it works nicely (of course no
complex configurations, like hot plug).

Could you help introduce what will bring us if we use "-device" rather
than "-M" options?  Benefits I can see is that, we can specify
parameters with specific device, rather than messing them up in
"machine" options. Do we have any other benefits that I may have
missed?


Hi Peter,
Thanks for trying it!

Mainly is about not hard-coding device options (e.g. PCI address for AMD IOMMU),
but also to avoid having devices added as a side-effect of some machine option.
This will bring as closer to a cleaner model of a modular machine.

I plan to post a non-rfc version soon.
Thanks,
Marcel




Thanks!

-- peterx






[Qemu-devel] [Bug 1587065] Re: btrfs qemu-ga - multiple mounts block fsfreeze

2016-05-30 Thread Christian Theune
I'm the responder from the Qemu list. I reviewed that C code a while ago
when I stumbled over it. If someone helps me through the Qemu patch
acceptance process then I'm willing to provide a patch.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1587065

Title:
  btrfs qemu-ga - multiple mounts block fsfreeze

Status in QEMU:
  New

Bug description:
  Having two mounts of the same device makes fsfreeze through qemu-ga 
impossible.
  root@CmsrvMTA:/# mount -l | grep /dev/vda2
  /dev/vda2 on / type btrfs (rw,relatime,space_cache,subvolid=257,subvol=/@)
  /dev/vda2 on /home type btrfs 
(rw,relatime,space_cache,subvolid=258,subvol=/@home)

  Having two mounts is rather common with btrfs, so the feature fsfreeze
  is unusable on these systems.

  
  Below more information about how we encountered this issue...

  Message send to qemu-disc...@nongnu.org:

  Message 1:
  --
  I use external snapshots to backup my guests. I use the 'quiesce' option to 
flush and frees the guest file system with the qemu guest agent.

  With the exeption of one guest, this procedure works fine. On the 'unwilling' 
guest, I get this error message:
  "ERROR 2016-05-25 00:51:19 | T25-bakVMSCmsrvVH2 | fout: internal error: 
unable to execute QEMU agent command 'guest-fsfreeze-freeze': failed to freeze 
/: Device or resource busy"

  I don't think this is not some sort of time-out error, because
  activation of the fsfreeze and the error message happen immediately
  after each other:

  $ grep qemu-ga syslog.1
  May 25 00:51:19 CmsrvMTA qemu-ga: info: guest-fsfreeze called

  This is the only entry of the qemu guest agent in syslog.

  $ sudo virsh version
  Compiled against library: libvirt 1.3.1
  Using library: libvirt 1.3.1
  Gebruikte API: QEMU 1.3.1
  Draaiende hypervisor: QEMU 2.5.0

  $ virsh qemu-agent-command CmsrvMTA '{"execute": "guest-info"}'
  {"return":{"version":"2.5.0", ... 
,{"enabled":true,"name":"guest-fstrim","success-response":true},{"enabled":true,"name":"guest-fsfreeze-thaw","success-response":true},{"enabled":true,"name":"guest-fsfreeze-status","success-response":true},{"enabled":true,"name":"guest-fsfreeze-freeze-list","success-response":true},{"enabled":true,"name":"guest-fsfreeze-freeze","success-response":true},
 ... }

  For making an external snapshot, I use this command:
  $ virsh snapshot-create-as --domain CmsrvMTA sn1 --disk-only --atomic 
--quiesce --no-metadata --diskspec vda,file=/srv/poolVMS/CmsrvMTA.sn1

  Piece of reply 1:
  -
  I have encountered this before. Some operating systems
   may have bind-mounts that let a device appear multiple times in the mount 
list. Unfortunately the guest agent is not smart enough to consider a device 
that has been frozen as succesfull and keep going. This causes this specific 
error.

  Piece of reply 2:
  -
  Ok, that seems to be it.

  I’ve got the ‘/’ and ‘/home’ on the same device formatted as btrfs on
  two separate sub volumes.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1587065/+subscriptions



Re: [Qemu-devel] [PATCH RFC 0/2] enable iommu with -device

2016-05-30 Thread Peter Xu
On Mon, May 23, 2016 at 05:01:28PM +0300, Marcel Apfelbaum wrote:
> This is a proposal on how to create the iommu with
> '-device intel-iommu' instead of '-machine,iommu=on'.
> 
> The device is part of the machine properties because we wanted
> to ensure it is created before any other PCI device.
> 
> The alternative is to skip the bus_master_enable_region at
> the time the device is created. We can create this region
> at machine_done phase. (patch 1)
> 
> Then we can enable sysbus devices for PC machines and make all the
> init steps inside the iommu realize function. (patch 2)
> 
> The series is working, but a lot of issues are not resolved:
>   - minimum testing was done
>   - the iommu addr should be passed (maybe) in command line rather than 
> hard-coded
>   - enabling sysbus devices for PC machines is risky, I am not aware yet
> of the side effects of this modification.
>   - I am not sure moving the bus_master_enable_region to machine_done
> is with no undesired effects. 

I gave it a shot on the patches and it works nicely (of course no
complex configurations, like hot plug).

Could you help introduce what will bring us if we use "-device" rather
than "-M" options?  Benefits I can see is that, we can specify
parameters with specific device, rather than messing them up in
"machine" options. Do we have any other benefits that I may have
missed?

Thanks!

-- peterx



Re: [Qemu-devel] [PATCH v8 16/25] q35: add "intremap" parameter to enable IR

2016-05-30 Thread Peter Xu
On Mon, May 30, 2016 at 02:43:22PM +0200, Jan Kiszka wrote:
> On 2016-05-30 12:31, Peter Xu wrote:
> > One flag is added to specify whether to enable IR for emulated IOMMU. By
> > default, interrupt remapping is not supportted. To enable it, we should
> > specify something like:
> > 
> > $ qemu-system-x86_64 -M q35,iommu=on,intremap=on
> 
> Maybe it's time to move on to Marcel's "-device iommu" patches and
> convert this switch to a device property directly. Or what is the plan?

I just kept everything as it is since I do not know whether I should
rebase to Marcel's interface now. Anyway, I can do it in future
versions as long as we settle it down.

Marcel, do you have any suggestion?

-- peterx



Re: [Qemu-devel] [PATCH v3 0/4] migration: skip scanning and migrating ram pages released by virtio-balloon driver.

2016-05-30 Thread Jitendra Kolhe
ping...
for entire v3 version of the patchset.
http://patchwork.ozlabs.org/project/qemu-devel/list/?submitter=68462

- Jitendra

On Wed, May 18, 2016 at 4:50 PM, Jitendra Kolhe  wrote:
> While measuring live migration performance for qemu/kvm guest, it was observed
> that the qemu doesn’t maintain any intelligence for the guest ram pages 
> released
> by the guest balloon driver and treat such pages as any other
> normal guest ram pages. This has direct impact on overall migration time for
> the guest which has released (ballooned out) memory to the host.
>
> In case of large systems, where we can configure large guests with 1TB and
> with considerable amount of memory released by balloon driver to the host,
> the migration time gets worse.
>
> The solution proposed below is local to qemu (and does not require any
> modification to Linux kernel or any guest driver). We have verified the fix
> for large guests =1TB on HPE Superdome X (which can support up to 240 cores
> and 12TB of memory).
>
> During live migration, as part of first iteration in ram_save_iterate() ->
> ram_find_and_save_block () will try to migrate ram pages even if they are
> released by vitrio-balloon driver (balloon inflate). Although these pages
> which are returned to the host by virtio-balloon driver are zero pages,
> the migration algorithm will still end up scanning the entire page
> ram_find_and_save_block() -> ram_save_page()/ram_save_compressed_page() ->
> save_zero_page() -> is_zero_range(). We also end-up sending header information
> over network for these pages during migration. This adds to the total
> migration time.
>
> The solution creates a balloon bitmap ramblock as a part of virtio-balloon
> device initialization. The bits in the balloon bitmap represent a guest ram
> page of size 1UL << VIRTIO_BALLOON_PFN_SHIFT or 4K. If TARGET_PAGE_BITS <=
> VIRTIO_BALLOON_PFN_SHIFT, ram_addr offset for the dirty page which is used by
> dirty page bitmap during migration is checked against the balloon bitmap as
> is, if the bit is set in the balloon bitmap, the corresponding ram page will 
> be
> excluded from scanning and sending header information during migration. In 
> case
> TARGET_PAGE_BITS > VIRTIO_BALLOON_PFN_SHIFT for a given dirty page ram_addr,
> all sub-pages of 1UL << VIRTIO_BALLOON_PFN_SHIFT size should be ballooned out
> to avoid zero page scan and sending header information.
>
> The bitmap represents entire guest ram memory till max configured memory.
> Guest ram pages claimed by the virtio-balloon driver will be represented by 1
> in the bitmap. Since the bitmap is maintained as a ramblock, it’s migrated to
> target as part migration’s ram iterative and ram complete phase. So that
> substituent migrations from the target can continue to use optimization.
>
> A new migration capability called skip-balloon is introduced. The user can
> disable the capability in cases where user does not expect much benefit or in
> case the migration is from an older version.
>
> During live migration setup the optimization can be set to disabled state if
> . no virtio-balloon device is initialized.
> . skip-balloon migration capability is disabled.
> . If the guest virtio-balloon driver has not set 
> VIRTIO_BALLOON_F_MUST_TELL_HOST
>   flag. Which means the guest may start using a ram pages freed by guest 
> balloon
>   driver, even before the host/qemu is aware of it. In such case, the
>   optimization is disabled so that the ram pages that are being used by the
>   guest will continue to be scanned and migrated.
>
> Balloon bitmap ramblock size is set to zero if the optimization is disabled,
> to avoid overhead of migrating the bitmap. If the bitmap is not migrated to
> the target, the destination starts with a fresh bitmap and tracks the
> ballooning operation thereafter.
>
> Jitendra Kolhe (4):
>   balloon: maintain bitmap for pages released by guest balloon driver.
>   balloon: add balloon bitmap migration capability and setup bitmap
> migration  status.
>   balloon: reset balloon bitmap ramblock size on source and target.
>   migration: skip scanning and migrating ram pages released by
> virtio-balloon driver.
>
> Changed in v2:
>  - Resolved compilation issue for qemu-user binaries in exec.c
>  - Localize balloon bitmap test to save_zero_page().
>  - Updated version string for newly added migration capability to 2.7.
>  - Made minor modifications to patch commit text.
>
> Changed in v3:
>  - Add balloon bitmap to RAMBlock.
>  - Resolve bitmap offset calculation by translating host addr back to a
>RAMBlock and ram_addr
>  - Add balloon bitmap support for case if TARGET_PAGE_BITS
>> VIRTIO_BALLOON_PFN_SHIFT.
>  - Remove dependency of skip-balloon migration capability on postcopy
>migration.
>  - Disable optimization if the guest balloon driver does not support
>VIRTIO_BALLOON_F_MUST_TELL_HOST feature.
>  - Split single patch into 4 small patches.
>
>  balloon.c  | 196 
> +++

Re: [Qemu-devel] [PATCH 02/10] qcow2: add qcow2_co_write_compressed

2016-05-30 Thread Pavel Butsykin

On 30.05.2016 12:12, Pavel Butsykin wrote:

On 27.05.2016 20:33, Stefan Hajnoczi wrote:

On Sat, May 14, 2016 at 03:45:50PM +0300, Denis V. Lunev wrote:

+qemu_co_mutex_lock(&s->lock);
+cluster_offset = \
+qcow2_alloc_compressed_cluster_offset(bs, sector_num << 9,
out_len);


The backslash isn't necessary for wrapping lines in C.  This kind of
thing is only necessary in languages like Python where the grammar is
whitespace sensistive.

The C compiler is happy with an arbitrary amount of whitespace
(newlines) in the middle of a statement.  The backslash in C is handled
by the preprocessor: it joins the line.  That's useful for macro
definitions where you need to tell the preprocessor that several lines
belong to one macro definition.  But it's not needed for normal C code.


Thanks for the explanation, but the backslash is used more for the
person as a marker a line break. The current coding style misses this
point, but I can remove the backslash, because I don't think it's
something important :)


+if (!cluster_offset) {
+qemu_co_mutex_unlock(&s->lock);
+ret = -EIO;
+goto fail;
+}
+cluster_offset &= s->cluster_offset_mask;

-BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
-ret = bdrv_pwrite(bs->file->bs, cluster_offset, out_buf,
out_len);
-if (ret < 0) {
-goto fail;
-}
+ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset,
out_len);
+qemu_co_mutex_unlock(&s->lock);
+if (ret < 0) {
+goto fail;
  }

+iov = (struct iovec) {
+.iov_base   = out_buf,
+.iov_len= out_len,
+};
+qemu_iovec_init_external(&hd_qiov, &iov, 1);
+
+BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
+ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len,
&hd_qiov, 0);


There is a race condition here:

If the newly allocated cluster is only partially filled by compressed
data then qcow2_alloc_compressed_cluster_offset() remembers that more
bytes are still available in the cluster.  The
qcow2_alloc_compressed_cluster_offset() caller will continue filling the
same cluster.

Imagine two compressed writes running at the same time.  Write A
allocates just a few bytes so write B shares a sector with the first
write:


Sorry, but it seems this will never happen, because the second write
will not pass this check:

uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
   uint64_t offset,
   int compressed_size)
{
...
/* Compression can't overwrite anything. Fail if the cluster was 
already

 * allocated. */
cluster_offset = be64_to_cpu(l2_table[l2_index]);
if (cluster_offset & L2E_OFFSET_MASK) {
qcow2_cache_put(bs, s->l2_table_cache, (void**) &l2_table);
return 0;
}
   ...

As you can see we can't do the compressed write in the already allocated
cluster.



  Sector 1
   |AAAB|

The race condition is that bdrv_co_pwritev() uses read-modify-write (a
bounce buffer).  If both requests call bdrv_co_pwritev() around the same
time then the following could happen:

  Sector 1
   |000B|

or:

  Sector 1
   |AAA0|

It's necessary to hold s->lock around the compressed data write to avoid
this race condition.


I agree, there is really a race.. Thank you, this is a very good point!






Re: [Qemu-devel] [PATCH v8 16/25] q35: add "intremap" parameter to enable IR

2016-05-30 Thread Jan Kiszka
On 2016-05-30 12:31, Peter Xu wrote:
> One flag is added to specify whether to enable IR for emulated IOMMU. By
> default, interrupt remapping is not supportted. To enable it, we should
> specify something like:
> 
> $ qemu-system-x86_64 -M q35,iommu=on,intremap=on

Maybe it's time to move on to Marcel's "-device iommu" patches and
convert this switch to a device property directly. Or what is the plan?

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux



Re: [Qemu-devel] virtio_mmio address mapping

2016-05-30 Thread Peter Maydell
On 30 May 2016 at 13:24, Dean <664543...@qq.com> wrote:
> I am writing a UIO linux driver for a virtio_mmio device and find out that 
> all those virtio_mmio devices' registers are arrange in one single 4K aligned 
> page.
> which makes writing an UIO driver hard since linux doesn't allow UIO driver 
> to map a register space smaller than page size.
> would you consider divide them into defferent pages?
>
> what I am talking about is devices like:
> a003000.virtio_mmio
> a003c00.virtio_mmio
> a003e00.virtio_mmio
>
> what I am dealing with is a003e00.virtio_mmio

Unfortunately we can't really move them, for backwards
compatibility reasons. Also, virtio-mmio is now pretty
much obsolete since you can use virtio-pci instead, and
we wouldn't want it to take up much more of the physical
address space.

thanks
-- PMM



[Qemu-devel] virtio_mmio address mapping

2016-05-30 Thread Dean
hi,all,

I am writing a UIO linux driver for a virtio_mmio device and find out that all 
those virtio_mmio devices' registers are arrange in one single 4K aligned page.
which makes writing an UIO driver hard since linux doesn't allow UIO driver to 
map a register space smaller than page size.
would you consider divide them into defferent pages?

what I am talking about is devices like:
a003000.virtio_mmio
a003c00.virtio_mmio
a003e00.virtio_mmio

what I am dealing with is a003e00.virtio_mmio

Dean

Re: [Qemu-devel] [RFC] virtio-blk: simple multithreaded MQ implementation for bdrv_raw

2016-05-30 Thread Roman Penyaev
On Mon, May 30, 2016 at 8:40 AM, Alexandre DERUMIER  wrote:
> Hi,
>
>>>To avoid any locks in qemu backend and not to introduce thread safety
>>>into qemu block-layer I open same backend device several times, one
>>>device per one MQ.  e.g. the following is the stack for a virtio-blk
>>>with num-queues=2:
>
> Could it be possible in the future to not open several times the same backend 
> ?

You are too fast :) I think nobody will do that in nearest future.

> I'm thinking about ceph/librbd, which since last version allow only to open 
> once a backend by default
> (exclusive-lock, which is a requirement for advanced features like 
> rbd-mirroring, fast-diff,)

Consider my patch as a hack for only one reason: make true MQ support for
non-expandable file images and/or block devices to get some perf numbers
on lockless IO path.

If you are who is using block device as a backend and want to squeeze out
the IO till last drop from guest MQ bdev to host MQ bdev - feel free to
apply.  That's the only reason of this work.

--
Roman


>
> Regards,
>
> Alexandre Derumier
>
>
> - Mail original -
> De: "Stefan Hajnoczi" 
> À: "Roman Pen" 
> Cc: "qemu-devel" , "stefanha" 
> Envoyé: Samedi 28 Mai 2016 00:27:10
> Objet: Re: [Qemu-devel] [RFC] virtio-blk: simple multithreaded MQ 
> implementation for bdrv_raw
>
> On Fri, May 27, 2016 at 01:55:04PM +0200, Roman Pen wrote:
>> Hello, all.
>>
>> This is RFC because mostly this patch is a quick attempt to get true
>> multithreaded multiqueue support for a block device with native AIO.
>> The goal is to squeeze everything possible on lockless IO path from
>> MQ block on a guest to MQ block on a host.
>>
>> To avoid any locks in qemu backend and not to introduce thread safety
>> into qemu block-layer I open same backend device several times, one
>> device per one MQ. e.g. the following is the stack for a virtio-blk
>> with num-queues=2:
>>
>> VirtIOBlock
>> / \
>> VirtQueue#0 VirtQueue#1
>> IOThread#0 IOThread#1
>> BH#0 BH#1
>> Backend#0 Backend#1
>> \ /
>> /dev/null0
>>
>> To group all objects related to one vq new structure is introduced:
>>
>> typedef struct VirtQueueCtx {
>> BlockBackend *blk;
>> struct VirtIOBlock *s;
>> VirtQueue *vq;
>> void *rq;
>> QEMUBH *bh;
>> QEMUBH *batch_notify_bh;
>> IOThread *iothread;
>> Notifier insert_notifier;
>> Notifier remove_notifier;
>> /* Operation blocker on BDS */
>> Error *blocker;
>> } VirtQueueCtx;
>>
>> And VirtIOBlock includes an array of these contexts:
>>
>> typedef struct VirtIOBlock {
>> VirtIODevice parent_obj;
>> + VirtQueueCtx mq[VIRTIO_QUEUE_MAX];
>> ...
>>
>> This patch is based on Stefan's series: "virtio-blk: multiqueue support",
>> with minor difference: I reverted "virtio-blk: multiqueue batch notify",
>> which does not make a lot sense when each VQ is handled by it's own
>> iothread.
>>
>> The qemu configuration stays the same, i.e. put num-queues=N and N
>> iothreads will be started on demand and N drives will be opened:
>>
>> qemu -device virtio-blk-pci,num-queues=8
>>
>> My configuration is the following:
>>
>> host:
>> Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz,
>> 8 CPUs,
>> /dev/nullb0 as backend with the following parameters:
>> $ cat /sys/module/null_blk/parameters/submit_queues
>> 8
>> $ cat /sys/module/null_blk/parameters/irqmode
>> 1
>>
>> guest:
>> 8 VCPUs
>>
>> qemu:
>> -object iothread,id=t0 \
>> -drive 
>> if=none,id=d0,file=/dev/nullb0,format=raw,snapshot=off,cache=none,aio=native 
>> \
>> -device 
>> virtio-blk-pci,num-queues=$N,iothread=t0,drive=d0,disable-modern=off,disable-legacy=on
>>
>> where $N varies during the tests.
>>
>> fio:
>> [global]
>> description=Emulation of Storage Server Access Pattern
>> bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
>> fadvise_hint=0
>> rw=randrw:2
>> direct=1
>>
>> ioengine=libaio
>> iodepth=64
>> iodepth_batch_submit=64
>> iodepth_batch_complete=64
>> numjobs=8
>> gtod_reduce=1
>> group_reporting=1
>>
>> time_based=1
>> runtime=30
>>
>> [job]
>> filename=/dev/vda
>>
>> Results:
>> num-queues RD bw WR bw
>> -- - -
>>
>> * with 1 iothread *
>>
>> 1 thr 1 mq 1225MB/s 1221MB/s
>> 1 thr 2 mq 1559MB/s 1553MB/s
>> 1 thr 4 mq 1729MB/s 1725MB/s
>> 1 thr 8 mq 1660MB/s 1655MB/s
>>
>> * with N iothreads *
>>
>> 2 thr 2 mq 1845MB/s 1842MB/s
>> 4 thr 4 mq 2187MB/s 2183MB/s
>> 8 thr 8 mq 1383MB/s 1378MB/s
>>
>> Obviously, 8 iothreads + 8 vcpu threads is too much for my machine
>> with 8 CPUs, but 4 iothreads show quite good result.
>
> Cool, thanks for trying this experiment and posting results.
>
> It's encouraging to see the improvement. Did you use any CPU affinity
> settings to co-locate vcpu and iothreads onto host CPUs?
>
> Stefan



[Qemu-devel] [PATCH] block: assert that bs->request_alignment is a power of 2

2016-05-30 Thread Peter Lieven
at least bdrv_co_preadv/pwritev expect this.

Signed-off-by: Peter Lieven 
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 736432f..f54bc25 100644
--- a/block.c
+++ b/block.c
@@ -1018,7 +1018,7 @@ static int bdrv_open_common(BlockDriverState *bs, 
BdrvChild *file,
 
 assert(bdrv_opt_mem_align(bs) != 0);
 assert(bdrv_min_mem_align(bs) != 0);
-assert((bs->request_alignment != 0) || bdrv_is_sg(bs));
+assert(is_power_of_2(bs->request_alignment) || bdrv_is_sg(bs));
 
 qemu_opts_del(opts);
 return 0;
-- 
1.9.1




Re: [Qemu-devel] [RFC] virtio-blk: simple multithreaded MQ implementation for bdrv_raw

2016-05-30 Thread Roman Penyaev
On Sat, May 28, 2016 at 12:27 AM, Stefan Hajnoczi  wrote:
> On Fri, May 27, 2016 at 01:55:04PM +0200, Roman Pen wrote:
>> Hello, all.
>>
>> This is RFC because mostly this patch is a quick attempt to get true
>> multithreaded multiqueue support for a block device with native AIO.
>> The goal is to squeeze everything possible on lockless IO path from
>> MQ block on a guest to MQ block on a host.
>>
>> To avoid any locks in qemu backend and not to introduce thread safety
>> into qemu block-layer I open same backend device several times, one
>> device per one MQ.  e.g. the following is the stack for a virtio-blk
>> with num-queues=2:
>>
>> VirtIOBlock
>>/   \
>>  VirtQueue#0   VirtQueue#1
>>   IOThread#0IOThread#1
>>  BH#0  BH#1
>>   Backend#0 Backend#1
>>\   /
>>  /dev/null0
>>
>> To group all objects related to one vq new structure is introduced:
>>
>> typedef struct VirtQueueCtx {
>> BlockBackend *blk;
>> struct VirtIOBlock *s;
>> VirtQueue *vq;
>> void *rq;
>> QEMUBH *bh;
>> QEMUBH *batch_notify_bh;
>> IOThread *iothread;
>> Notifier insert_notifier;
>> Notifier remove_notifier;
>> /* Operation blocker on BDS */
>> Error *blocker;
>> } VirtQueueCtx;
>>
>> And VirtIOBlock includes an array of these contexts:
>>
>>  typedef struct VirtIOBlock {
>>  VirtIODevice parent_obj;
>> +VirtQueueCtx mq[VIRTIO_QUEUE_MAX];
>>  ...
>>
>> This patch is based on Stefan's series: "virtio-blk: multiqueue support",
>> with minor difference: I reverted "virtio-blk: multiqueue batch notify",
>> which does not make a lot sense when each VQ is handled by it's own
>> iothread.
>>
>> The qemu configuration stays the same, i.e. put num-queues=N and N
>> iothreads will be started on demand and N drives will be opened:
>>
>> qemu -device virtio-blk-pci,num-queues=8
>>
>> My configuration is the following:
>>
>> host:
>> Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz,
>> 8 CPUs,
>> /dev/nullb0 as backend with the following parameters:
>>   $ cat /sys/module/null_blk/parameters/submit_queues
>>   8
>>   $ cat /sys/module/null_blk/parameters/irqmode
>>   1
>>
>> guest:
>> 8 VCPUs
>>
>> qemu:
>> -object iothread,id=t0 \
>> -drive 
>> if=none,id=d0,file=/dev/nullb0,format=raw,snapshot=off,cache=none,aio=native 
>> \
>> -device 
>> virtio-blk-pci,num-queues=$N,iothread=t0,drive=d0,disable-modern=off,disable-legacy=on
>>
>> where $N varies during the tests.
>>
>> fio:
>> [global]
>> description=Emulation of Storage Server Access Pattern
>> bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
>> fadvise_hint=0
>> rw=randrw:2
>> direct=1
>>
>> ioengine=libaio
>> iodepth=64
>> iodepth_batch_submit=64
>> iodepth_batch_complete=64
>> numjobs=8
>> gtod_reduce=1
>> group_reporting=1
>>
>> time_based=1
>> runtime=30
>>
>> [job]
>> filename=/dev/vda
>>
>> Results:
>> num-queues   RD bw  WR bw
>> --   -  -
>>
>> * with 1 iothread *
>>
>> 1 thr 1 mq   1225MB/s   1221MB/s
>> 1 thr 2 mq   1559MB/s   1553MB/s
>> 1 thr 4 mq   1729MB/s   1725MB/s
>> 1 thr 8 mq   1660MB/s   1655MB/s
>>
>> * with N iothreads *
>>
>> 2 thr 2 mq   1845MB/s   1842MB/s
>> 4 thr 4 mq   2187MB/s   2183MB/s
>> 8 thr 8 mq   1383MB/s   1378MB/s
>>
>> Obviously, 8 iothreads + 8 vcpu threads is too much for my machine
>> with 8 CPUs, but 4 iothreads show quite good result.
>
> Cool, thanks for trying this experiment and posting results.
>
> It's encouraging to see the improvement.  Did you use any CPU affinity
> settings to co-locate vcpu and iothreads onto host CPUs?

No, in these measurements I did not try to pin anything.
But the following are results with pinning, take a look:

8 VCPUs, 8 fio jobs
===

 o each fio job is pinned to VCPU in 1 to 1
 o VCPUs are not pinned
 o iothreads are not pinned

num queues   RD bw
--   

* with 1 iothread *

1 thr 1 mq   1096MB/s
1 thr 2 mq   1602MB/s
1 thr 4 mq   1818MB/s
1 thr 8 mq   1860MB/s

* with N iothreads *

2 thr 2 mq   2008MB/s
4 thr 4 mq   2267MB/s
8 thr 8 mq   1388MB/s



8 VCPUs, 8 fio jobs
===

 o each fio job is pinned to VCPU in 1 to 1
 o each VCPU is pinned to CPU in 1 to 1
 o each iothread is pinned to CPU in 1 to 1

affinity masks:
 CPUs   01234567
VCPUs   

num queues   RD bw  iothreads affinity mask
--      ---

* with 1 iothread *

1 thr 1 mq   997MB/sX---
1 thr 2 mq   1066MB/s   X---
1 thr 4 mq   969MB/sX---
1 thr 8 mq   1050MB/s   X---

* with N iothreads *

2 thr 2 mq   1597MB/s   XX--
4 thr 4 mq   1985MB/s   
8 thr 8 mq   1230

Re: [Qemu-devel] [PATCH v2 1/1] block: clarify error message for qmp-eject

2016-05-30 Thread Markus Armbruster
John Snow  writes:

> It already got applied, but I can change it to your preference. (Always
> return an -errno and an Error, delete-and-free when we don't care about
> it...)

I think that would be an improvement.  This is advice, not a demand :)



Re: [Qemu-devel] [PULL V3 00/20] Net patches

2016-05-30 Thread Peter Maydell
On 30 May 2016 at 02:51, Jason Wang  wrote:
> git grep shows lots of places. Is it ok to send a new version of pull
> request with Dmitry's fix first?

Sure; I was talking about the audit as a later cleanup thing
we should do at some point, not something to do immediately.

-- PMM



[Qemu-devel] [Bug 1586611] Re: usb-hub can not be detached when detach usb device from VM

2016-05-30 Thread Michael liu
Of course using virtual usb controller is normal,The situation of the
problems is to use the passthrough usb devices

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1586611

Title:
  usb-hub can not be detached when detach usb  device from VM

Status in QEMU:
  New

Bug description:
  I give a host usb device to guest in the way of using "virsh attach-device" 
cmd. In guest os,use "lsusb" cmd I can see two devices have been added,one is 
usb device and the other is usb-hub(0409:55aa NEC Corp. Hub).
  when I use "virsh detach-device" detach the usb device,in guest os the 
usb-hub was still exists.
  It can create a bad impression when operating the VM,for example,suspend and 
resume the VM,qemu would report that:

  2016-05-24T12:03:54.434369Z qemu-kvm: Unknown savevm section or
  instance ':00:01.2/2/usb-hub' 0

  2016-05-24T12:03:54.434742Z qemu-kvm: load of migration failed:
  Invalid argument

  From qemu's code,it can be sure that the usb-hub is generated by qemu,and the 
process of detaching usb-hub has already been executed,but failed.With adding 
print information,error as follows:
  libusbx: error [do_close] Device handle closed while transfer was still being 
processed, but the device is still connected as far as we know
  libusbx: warning [do_close] A cancellation for an in-flight transfer hasn't 
completed but closing the device handle

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1586611/+subscriptions



[Qemu-devel] [PATCH] block/nfs: Implement .bdrv_co_preadv/pwritev interfaces

2016-05-30 Thread Peter Lieven
the libnfs read and write functions already take byte arguments
so thats an easy change.

Signed-off-by: Peter Lieven 
---
 block/nfs.c | 40 +++-
 1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/block/nfs.c b/block/nfs.c
index 9f51cc3..386f846 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -133,20 +133,19 @@ nfs_co_generic_cb(int ret, struct nfs_context *nfs, void 
*data,
 }
 }
 
-static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
- int64_t sector_num, int nb_sectors,
- QEMUIOVector *iov)
+static int coroutine_fn
+nfs_co_preadv(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+  QEMUIOVector *qiov, int flags)
 {
 NFSClient *client = bs->opaque;
 NFSRPC task;
 
 nfs_co_init_task(client, &task);
-task.iov = iov;
+task.iov = qiov;
 
 if (nfs_pread_async(client->context, client->fh,
-sector_num * BDRV_SECTOR_SIZE,
-nb_sectors * BDRV_SECTOR_SIZE,
-nfs_co_generic_cb, &task) != 0) {
+offset, bytes, nfs_co_generic_cb,
+&task) != 0) {
 return -ENOMEM;
 }
 
@@ -160,16 +159,16 @@ static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
 }
 
 /* zero pad short reads */
-if (task.ret < iov->size) {
-qemu_iovec_memset(iov, task.ret, 0, iov->size - task.ret);
+if (task.ret < qiov->size) {
+qemu_iovec_memset(qiov, task.ret, 0, qiov->size - task.ret);
 }
 
 return 0;
 }
 
-static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors,
-QEMUIOVector *iov)
+static int coroutine_fn
+nfs_co_pwritev(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+   QEMUIOVector *qiov, int flags)
 {
 NFSClient *client = bs->opaque;
 NFSRPC task;
@@ -177,17 +176,16 @@ static int coroutine_fn nfs_co_writev(BlockDriverState 
*bs,
 
 nfs_co_init_task(client, &task);
 
-buf = g_try_malloc(nb_sectors * BDRV_SECTOR_SIZE);
-if (nb_sectors && buf == NULL) {
+buf = g_try_malloc(bytes);
+if (bytes && buf == NULL) {
 return -ENOMEM;
 }
 
-qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);
+qemu_iovec_to_buf(qiov, 0, buf, bytes);
 
 if (nfs_pwrite_async(client->context, client->fh,
- sector_num * BDRV_SECTOR_SIZE,
- nb_sectors * BDRV_SECTOR_SIZE,
- buf, nfs_co_generic_cb, &task) != 0) {
+ offset, bytes, buf,
+ nfs_co_generic_cb, &task) != 0) {
 g_free(buf);
 return -ENOMEM;
 }
@@ -199,7 +197,7 @@ static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
 
 g_free(buf);
 
-if (task.ret != nb_sectors * BDRV_SECTOR_SIZE) {
+if (task.ret != bytes) {
 return task.ret < 0 ? task.ret : -EIO;
 }
 
@@ -547,8 +545,8 @@ static BlockDriver bdrv_nfs = {
 .bdrv_create= nfs_file_create,
 .bdrv_reopen_prepare= nfs_reopen_prepare,
 
-.bdrv_co_readv  = nfs_co_readv,
-.bdrv_co_writev = nfs_co_writev,
+.bdrv_co_preadv = nfs_co_preadv,
+.bdrv_co_pwritev= nfs_co_pwritev,
 .bdrv_co_flush_to_disk  = nfs_co_flush,
 
 .bdrv_detach_aio_context= nfs_detach_aio_context,
-- 
1.9.1




Re: [Qemu-devel] [PATCH 1/6] hw/char: QOM'ify pl011 model

2016-05-30 Thread Markus Armbruster
Peter Maydell  writes:

> On 25 May 2016 at 11:58, xiaoqiang zhao  wrote:
>> * drop qemu_char_get_next_serial and use chardev prop
>> * add pl011_create wrapper function to create pl011 uart device
>> * change affected board code to use the new way
>>
>> Signed-off-by: xiaoqiang zhao 
>> ---
>>  hw/arm/bcm2835_peripherals.c | 16 +++---
>>  hw/arm/highbank.c|  3 ++-
>>  hw/arm/integratorcp.c|  5 +++--
>>  hw/arm/realview.c|  9 
>>  hw/arm/stellaris.c   |  6 +++--
>>  hw/arm/versatilepb.c |  9 
>>  hw/arm/vexpress.c|  9 
>>  hw/arm/virt.c|  1 +
>>  hw/char/pl011.c  | 11 +-
>>  include/hw/char/pl011.h  | 52 
>> 
>>  10 files changed, 86 insertions(+), 35 deletions(-)
>>  create mode 100644 include/hw/char/pl011.h
>>
>> diff --git a/hw/arm/bcm2835_peripherals.c b/hw/arm/bcm2835_peripherals.c
>> index 234d518..46c320f 100644
>> --- a/hw/arm/bcm2835_peripherals.c
>> +++ b/hw/arm/bcm2835_peripherals.c
>> @@ -14,6 +14,7 @@
>>  #include "hw/misc/bcm2835_mbox_defs.h"
>>  #include "hw/arm/raspi_platform.h"
>>  #include "sysemu/char.h"
>> +#include "sysemu/sysemu.h"
>>
>>  /* Peripheral base address on the VC (GPU) system bus */
>>  #define BCM2835_VC_PERI_BASE 0x7e00
>> @@ -106,7 +107,6 @@ static void bcm2835_peripherals_realize(DeviceState 
>> *dev, Error **errp)
>>  MemoryRegion *ram;
>>  Error *err = NULL;
>>  uint32_t ram_size, vcram_size;
>> -CharDriverState *chr;
>>  int n;
>>
>>  obj = object_property_get_link(OBJECT(dev), "ram", &err);
>> @@ -147,6 +147,7 @@ static void bcm2835_peripherals_realize(DeviceState 
>> *dev, Error **errp)
>>  sysbus_pass_irq(SYS_BUS_DEVICE(s), SYS_BUS_DEVICE(&s->ic));
>>
>>  /* UART0 */
>> +qdev_prop_set_chr(DEVICE(&s->uart0), "chardev", serial_hds[0]);
>>  object_property_set_bool(OBJECT(s->uart0), true, "realized", &err);
>>  if (err) {
>>  error_propagate(errp, err);
>> @@ -158,17 +159,8 @@ static void bcm2835_peripherals_realize(DeviceState 
>> *dev, Error **errp)
>>  sysbus_connect_irq(s->uart0, 0,
>>  qdev_get_gpio_in_named(DEVICE(&s->ic), BCM2835_IC_GPU_IRQ,
>> INTERRUPT_UART));
>> -
>>  /* AUX / UART1 */
>> -/* TODO: don't call qemu_char_get_next_serial() here, instead set
>> - * chardev properties for each uart at the board level, once pl011
>> - * (uart0) has been updated to avoid qemu_char_get_next_serial()
>> - */
>
> This comment says this should be fixed by having board-level
> properties; you've removed it but this patch isn't adding
> the properties to this (SoC-level) device. I think the board
> level should be looking at serial_hds[], not this code.

Device models should not fish backends out of global state.  Whether
they fish directly or via some helper like qemu_char_get_next_serial()
doesn't matter.  The ones that still do need to set
cannot_instantiate_with_device_add_yet with a suitable comment.

>> @@ -310,8 +312,7 @@ static void pl011_class_init(ObjectClass *oc, void *data)
>>
>>  dc->realize = pl011_realize;
>>  dc->vmsd = &vmstate_pl011;
>> -/* Reason: realize() method uses qemu_char_get_next_serial() */
>> -dc->cannot_instantiate_with_device_add_yet = true;
>
> Why does instantiating with device_add work now? There's
> still no way to wire up interrupt lines or map mmio regions.
> (This has never made much sense to me -- Markus?)

Uh, which part does "this" refer to?

I systematically reviewed devices for my "Clean up and fix no_user"
series (commit f976b09..7ea5e78), and wrote down my findings in
"Reason:" comments next to cannot_instantiate_with_device_add_yet
assignments.  Any such assignment must have such a comment.

Testing can catch cases where we missed *all* reasons.  Example: my "Fix
device introspection regressions" series (commit b37686f..33fe968).  It
can't catch cases where we missed *some* reasons.

Note that cannot_instantiate_with_device_add_yet can get set by
(possibly abstract) parent devices as well.  If a parent device sets it,
its children should nevertheless set it *again* if they have additional
reasons.  I believe this is such a case.  I'm not 100% sure, because I
haven't been 100% sure about anything related to sysbus devices ever
since Alex's commit 33cd52b "sysbus: Make devices spawnable via -device"
dropped cannot_instantiate_with_device_add_yet from
sysbus_device_class_init(), quoted below.  As you see, that assignment
covered "no way to wire up interrupt lines or map mmio regions."


diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
index 19437e6..7bfe381 100644
--- a/hw/core/sysbus.c
+++ b/hw/core/sysbus.c
@@ -283,13 +283,6 @@ static void sysbus_device_class_init(ObjectClass *klass, vo
id *data)
 DeviceClass *k = DEVICE_CLASS(klass);
 k->init = sysbus_device_init;
 k->bus_type = TYPE_SYSTEM_BUS;
-/*
-

[Qemu-devel] [PATCH V3] block/io: optimize bdrv_co_pwritev for small requests

2016-05-30 Thread Peter Lieven
in a read-modify-write cycle a small request might cause
head and tail to fall into the same aligned block. Currently
QEMU reads the same block twice in this case which is
not necessary.

Signed-off-by: Peter Lieven 
---
v1->v2: following Paolos suggestions to simplify the if condition and
adjusting the comment
v2->v3: fix iotest 077 for requests that are within the same aligned block 
[Fam, Kevin]

 block/io.c |  8 
 tests/qemu-iotests/077 | 12 +---
 tests/qemu-iotests/077.out | 26 --
 3 files changed, 9 insertions(+), 37 deletions(-)

diff --git a/block/io.c b/block/io.c
index 2d832aa..0e4bb1e 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1427,6 +1427,14 @@ int coroutine_fn bdrv_co_pwritev(BlockDriverState *bs,
 
 bytes += offset & (align - 1);
 offset = offset & ~(align - 1);
+
+/* We have read the tail already if the request is smaller
+ * than one aligned block.
+ */
+if (bytes < align) {
+qemu_iovec_add(&local_qiov, head_buf + bytes, align - bytes);
+bytes = align;
+}
 }
 
 if ((offset + bytes) & (align - 1)) {
diff --git a/tests/qemu-iotests/077 b/tests/qemu-iotests/077
index 4dc680b..d2d2a2d 100755
--- a/tests/qemu-iotests/077
+++ b/tests/qemu-iotests/077
@@ -60,7 +60,7 @@ EOF
 
 # Sequential RMW requests on the same physical sector
 off=0x1000
-for ev in "head" "after_head" "tail" "after_tail"; do
+for ev in "head" "after_head"; do
 cat  <

Re: [Qemu-devel] [PATCH v2 00/22] GICv3 emulation

2016-05-30 Thread Andrew Jones
On Thu, May 26, 2016 at 03:55:18PM +0100, Peter Maydell wrote:
> This series implements emulation of the GICv3 interrupt controller.
> It is based to some extent on previous patches from Shlomo and
> Pavel, but the bulk of it has turned out to be new code. (The
> combination of changing the underlying data structures, adding
> support for TrustZone and implementing proper GICv3 behaviour rather
> than borrowing fragments of GICv2 emulation code meant there wasn't
> much left to reuse.) I've tried to reflect this in the various
> authorship credits on the patches, but please let me know if you
> feel I got anything miscredited one way or the other.
> 
> Key points about the GICv3 emulated here:
>  * "non-legacy" only, ie system registers and affinity routing
>  * TrustZone is implemented
>  * no virtualization support
>  * only the "core" GICv3, so no LPI support (via ITS or otherwise)
>  * no attempt to work around the Linux guest kernel bug fixed
>in commit 7c9b973061b0 (so you need that fix for your guest to
>boot with this GICv3)
> 
> I have included the "support KVM save/restore" patches from Pavel,
> reworked to use the new data structures, but they are only RFC
> status because the kernel API is not yet final (there are a couple
> of loose threads there to be followed up). Those patches are at the
> end of the series; I think everything else is in a commitable state
> (give or take code review).
> 
> Testing: I have confirmed that we can boot a Linux guest kernel,
> but not tried any other GIC-using guest code. I've done some light
> stress-testing using 'stress', and checked an SMP (2-cpu) boot.
> I've also tested booting a guest kernel via UEFI.
> 
> Design: all the code here is in hw/intc/, split into several
> files to avoid them being huge. The interface between the CPU
> proper and the CPU interface is a bit ad-hoc (you can see the
> awkward seams that result from the choice to think of the cpu
> i/f as part of the GIC device rather than part of the CPU itself),
> but I think that if you put the cpu i/f in the CPU you'd end up
> with an ad-hoc interface and awkward seams in the other direction.
> The GICv3 device currently assumes it is always connected to all
> CPUs; we can change that later to allow some kind of QOM link
> property to specify the CPUs explicitly, but I think this is OK
> for now (and it's already a pretty huge set of code changes to
> have to review).
> 
> Code review, testing, attempts to run guests other than Linux
> welcome.
> 
> Changes v1->v2:
>  * I have dropped the kernel bug workaround, since it didn't work for
>boots via UEFI anyway. This means that you will need kernel commit
>7c9b973061b0 (or its equivalent backports to stable) to boot a Linux
>guest with this emulated GICv3
>  * make bitmaps and arrays be GIC_MAXIRQS in size rather than
>GIC_MAXSPIS in size; this uses an extra 512 bytes or so per
>vcpu, but makes bugs of the "forgot to add/subtract GIC_INTERNAL"
>variety less likely to happen (and indeed a few were found and fixed
>in making this change...)
>  * fixed GICD_CTLR NS read to use correct bitmask
>  * moved ARMELChangeHook related prototypes etc into cpu.h from cpu-qom.h
>(needed after Paolo's recent header reshuffles)
>  * added missing 'inline' qualifier to arm_is_el3_or_mon()
>  * fixed missing reset of GICD_NSACR
>  * fixed icc_activate_irq() to call gicv3_redist_update() rather than
>gicv3_update() when it changes redistributor state
>  * make sure (and assert) gicv3_update() isn't called for out of range irqs
>  * add missing "bad num_irqs values" checks from gicv2 code

I've lightly tested this with kvm-unit-tests (I say lightly, because
it only does IPI testing so far). Although I did try with 123 vcpus,
the max mach-virt currently supports. I'm not sure the testing is
enough to warrant any tested-by's. I mostly just advertising the
unit test, which is here
https://github.com/rhdrjones/kvm-unit-tests/commits/arm/gic

Also, below is a follow-on patch for mach-virt, which might be
of interest.

Thanks,
drew


From: Andrew Jones 
Date: Mon, 30 May 2016 12:58:26 +0200
Subject: [PATCH] hw/arm/virt: gicv3: use all target-list bits

Signed-off-by: Andrew Jones 
---
 hw/arm/virt.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index e77ed88afb8a2..753c9ff8ccd64 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1204,6 +1204,16 @@ static void machvirt_init(MachineState *machine)
 }
 cpuobj = object_new(object_class_get_name(oc));
 
+/* Adjust MPIDR per the GIC's target-list size. */
+if (gic_version == 3) {
+CPUState *cs = CPU(cpuobj);
+uint8_t Aff1 = cs->cpu_index / 16;
+uint8_t Aff0 = cs->cpu_index % 16;
+
+object_property_set_int(cpuobj, (Aff1 << ARM_AFF1_SHIFT) | Aff0,
+"mp-affinity", NULL);
+}
+
 /* Handle any CPU options specifi

[Qemu-devel] [PATCH v8 24/25] kvm-irqchip: do explicit commit when update irq

2016-05-30 Thread Peter Xu
In the past, we are doing gsi route commit for each irqchip route
update. This is not efficient if we are updating lots of routes in the
same time. This patch removes the committing phase in
kvm_irqchip_update_msi_route(). Instead, we do explicit commit after all
routes updated.

Signed-off-by: Peter Xu 
---
 hw/i386/kvm/pci-assign.c | 2 ++
 hw/misc/ivshmem.c| 1 +
 hw/vfio/pci.c| 1 +
 hw/virtio/virtio-pci.c   | 1 +
 include/sysemu/kvm.h | 2 +-
 kvm-all.c| 2 --
 kvm-stub.c   | 4 
 target-i386/kvm.c| 1 +
 8 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index 5f7d5c6..0c911af 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -1016,6 +1016,7 @@ static void assigned_dev_update_msi_msg(PCIDevice 
*pci_dev)
 
 kvm_irqchip_update_msi_route(kvm_state, assigned_dev->msi_virq[0],
  msi_get_message(pci_dev, 0), pci_dev);
+kvm_irqchip_commit_routes(kvm_state);
 }
 
 static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
@@ -1602,6 +1603,7 @@ static void assigned_dev_msix_mmio_write(void *opaque, 
hwaddr addr,
 if (ret) {
 error_report("Error updating irq routing entry (%d)", ret);
 }
+kvm_irqchip_commit_routes(kvm_state);
 }
 }
 }
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 6909346..953d7f8 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -325,6 +325,7 @@ static int ivshmem_vector_unmask(PCIDevice *dev, unsigned 
vector,
 if (ret < 0) {
 return ret;
 }
+kvm_irqchip_commit_routes(kvm_state);
 
 return kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, n, NULL, v->virq);
 }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 06ad15e..aa5fc74 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -458,6 +458,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, 
MSIMessage msg,
  PCIDevice *pdev)
 {
 kvm_irqchip_update_msi_route(kvm_state, vector->virq, msg, pdev);
+kvm_irqchip_commit_routes(kvm_state);
 }
 
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index df85f28..6342435 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -872,6 +872,7 @@ static int virtio_pci_vq_vector_unmask(VirtIOPCIProxy 
*proxy,
 if (ret < 0) {
 return ret;
 }
+kvm_irqchip_commit_routes(kvm_state);
 }
 }
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 043bd12..0542dd8 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -370,7 +370,6 @@ int kvm_set_irq(KVMState *s, int irq, int level);
 int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg);
 
 void kvm_irqchip_add_irq_route(KVMState *s, int gsi, int irqchip, int pin);
-void kvm_irqchip_commit_routes(KVMState *s);
 
 void kvm_put_apic_state(DeviceState *d, struct kvm_lapic_state *kapic);
 void kvm_get_apic_state(DeviceState *d, struct kvm_lapic_state *kapic);
@@ -493,6 +492,7 @@ static inline void cpu_synchronize_post_init(CPUState *cpu)
 int kvm_irqchip_add_msi_route(KVMState *s, int vector, PCIDevice *dev);
 int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg,
  PCIDevice *dev);
+void kvm_irqchip_commit_routes(KVMState *s);
 void kvm_irqchip_release_virq(KVMState *s, int virq);
 
 int kvm_irqchip_add_adapter_route(KVMState *s, AdapterInfo *adapter);
diff --git a/kvm-all.c b/kvm-all.c
index f4ce357..eeb3d97 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1034,8 +1034,6 @@ static int kvm_update_routing_entry(KVMState *s,
 
 *entry = *new_entry;
 
-kvm_irqchip_commit_routes(s);
-
 return 0;
 }
 
diff --git a/kvm-stub.c b/kvm-stub.c
index 29604cc..8db2d02 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -130,6 +130,10 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, 
MSIMessage msg,
 return -ENOSYS;
 }
 
+void kvm_irqchip_commit_routes(KVMState *s)
+{
+}
+
 int kvm_irqchip_add_adapter_route(KVMState *s, AdapterInfo *adapter)
 {
 return -ENOSYS;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 173afd7..1a92c15 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -3374,6 +3374,7 @@ static void kvm_update_msi_routes_all(void *private, bool 
global,
 kvm_irqchip_update_msi_route(kvm_state, entry->virq,
  msg, entry->dev);
 }
+kvm_irqchip_commit_routes(kvm_state);
 trace_kvm_x86_update_msi_routes(cnt);
 }
 
-- 
2.4.11




[Qemu-devel] [PATCH v8 19/25] intel_iommu: Add support for Extended Interrupt Mode

2016-05-30 Thread Peter Xu
From: Jan Kiszka 

As neither QEMU nor KVM support more than 255 CPUs so far, this is
simple: we only need to switch the destination ID translation in
vtd_remap_irq_get if EIME is set.

Once CFI support is there, it will have to take EIM into account as
well. So far, nothing to do for this.

This patch allows to use x2APIC in split irqchip mode of KVM.

Signed-off-by: Jan Kiszka 
[use le32_to_cpu() to retrieve dest_id]
Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c  | 16 +---
 hw/i386/intel_iommu_internal.h |  2 ++
 include/hw/i386/intel_iommu.h  |  1 +
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5be9010..22c82fc 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -913,6 +913,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState 
*s)
 value = vtd_get_quad_raw(s, DMAR_IRTA_REG);
 s->intr_size = 1UL << ((value & VTD_IRTA_SIZE_MASK) + 1);
 s->intr_root = value & VTD_IRTA_ADDR_MASK;
+s->intr_eime = value & VTD_IRTA_EIME;
 
 /* Notify global invalidation */
 vtd_iec_notify_all(s, true, 0, 0);
@@ -2047,11 +2048,13 @@ static int vtd_remap_irq_get(IntelIOMMUState *iommu, 
uint16_t index, VTDIrq *irq
 irq->trigger_mode = irte.trigger_mode;
 irq->vector = irte.vector;
 irq->delivery_mode = irte.delivery_mode;
-/* Not support EIM yet: please refer to vt-d 9.10 DST bits */
+irq->dest = le32_to_cpu(irte.dest_id);
+if (!iommu->intr_eime) {
 #define  VTD_IR_APIC_DEST_MASK (0xff00ULL)
 #define  VTD_IR_APIC_DEST_SHIFT(8)
-irq->dest = (le32_to_cpu(irte.dest_id) & VTD_IR_APIC_DEST_MASK) >> \
-VTD_IR_APIC_DEST_SHIFT;
+irq->dest = (irq->dest & VTD_IR_APIC_DEST_MASK) >>
+VTD_IR_APIC_DEST_SHIFT;
+}
 irq->dest_mode = irte.dest_mode;
 irq->redir_hint = irte.redir_hint;
 
@@ -2304,7 +2307,7 @@ static void vtd_init(IntelIOMMUState *s)
 s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
 if (ms->iommu_intr) {
-s->ecap |= VTD_ECAP_IR;
+s->ecap |= VTD_ECAP_IR | VTD_ECAP_EIM;
 }
 
 vtd_reset_context_cache(s);
@@ -2358,10 +2361,9 @@ static void vtd_init(IntelIOMMUState *s)
 vtd_define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000ULL);
 
 /*
- * Interrupt remapping registers, not support extended interrupt
- * mode for now.
+ * Interrupt remapping registers.
  */
-vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf00fULL, 0);
+vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf80fULL, 0);
 }
 
 /* Should not reset address_spaces when reset because devices will still use
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 10c20fe..72b0114 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -176,6 +176,7 @@
 
 /* IRTA_REG */
 #define VTD_IRTA_ADDR_MASK  (VTD_HAW_MASK ^ 0xfffULL)
+#define VTD_IRTA_EIME   (1ULL << 11)
 #define VTD_IRTA_SIZE_MASK  (0xfULL)
 
 /* ECAP_REG */
@@ -184,6 +185,7 @@
 #define VTD_ECAP_QI (1ULL << 1)
 /* Interrupt Remapping support */
 #define VTD_ECAP_IR (1ULL << 3)
+#define VTD_ECAP_EIM(1ULL << 4)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3bca390..2fdca5b 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -271,6 +271,7 @@ struct IntelIOMMUState {
 bool intr_enabled;  /* Whether guest enabled IR */
 dma_addr_t intr_root;   /* Interrupt remapping table pointer */
 uint32_t intr_size; /* Number of IR table entries */
+bool intr_eime; /* Extended interrupt mode enabled */
 };
 
 #endif
-- 
2.4.11




[Qemu-devel] [PATCH v8 25/25] intel_iommu: support all masks in interrupt entry cache invalidation

2016-05-30 Thread Peter Xu
From: Radim Krčmář 

Linux guests do not gracefully handle cases when the invalidation mask
they wanted is not supported, probably because real hardware always
allowed all.

We can just say that all 16 masks are supported, because both
ioapic_iec_notifier and kvm_update_msi_routes_all invalidate all caches.

Signed-off-by: Radim Krčmář 
---
 hw/i386/intel_iommu.c  | 2 +-
 hw/i386/intel_iommu_internal.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c4ea0c3..d0c9743 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2354,7 +2354,7 @@ static void vtd_init(IntelIOMMUState *s)
 s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
 if (ms->iommu_intr) {
-s->ecap |= VTD_ECAP_IR | VTD_ECAP_EIM;
+s->ecap |= VTD_ECAP_IR | VTD_ECAP_EIM | VTD_ECAP_MHMV;
 }
 
 vtd_reset_context_cache(s);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 72b0114..0829a50 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -186,6 +186,7 @@
 /* Interrupt Remapping support */
 #define VTD_ECAP_IR (1ULL << 3)
 #define VTD_ECAP_EIM(1ULL << 4)
+#define VTD_ECAP_MHMV   (15ULL << 20)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
-- 
2.4.11




[Qemu-devel] [PATCH v8 16/25] q35: add "intremap" parameter to enable IR

2016-05-30 Thread Peter Xu
One flag is added to specify whether to enable IR for emulated IOMMU. By
default, interrupt remapping is not supportted. To enable it, we should
specify something like:

$ qemu-system-x86_64 -M q35,iommu=on,intremap=on

To be more clear, the following command:

$ qemu-system-x86_64 -M q35,iommu=on

Will enable IOMMU only, without interrupt remapping support.

Currently, Intel IOMMU IR only support kernel-irqchip={off|split}. We
need to specify either of it in -M as well.

Signed-off-by: Peter Xu 
---
 hw/core/machine.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 98471a7..41d6a95 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -314,6 +314,20 @@ static void machine_set_iommu(Object *obj, bool value, 
Error **errp)
 ms->iommu = value;
 }
 
+static bool machine_get_intremap(Object *obj, Error **errp)
+{
+MachineState *ms = MACHINE(obj);
+
+return ms->iommu_intr;
+}
+
+static void machine_set_intremap(Object *obj, bool value, Error **errp)
+{
+MachineState *ms = MACHINE(obj);
+
+ms->iommu_intr = value;
+}
+
 static void machine_set_suppress_vmdesc(Object *obj, bool value, Error **errp)
 {
 MachineState *ms = MACHINE(obj);
@@ -501,6 +515,12 @@ static void machine_initfn(Object *obj)
 object_property_set_description(obj, "iommu",
 "Set on/off to enable/disable Intel IOMMU 
(VT-d)",
 NULL);
+object_property_add_bool(obj, "intremap", machine_get_intremap,
+ machine_set_intremap, NULL);
+object_property_set_description(obj, "intremap",
+"Set on/off to enable/disable IOMMU"
+" interrupt remapping",
+NULL);
 object_property_add_bool(obj, "suppress-vmdesc",
  machine_get_suppress_vmdesc,
  machine_set_suppress_vmdesc, NULL);
-- 
2.4.11




Re: [Qemu-devel] [PATCH v8 00/25] IOMMU: Enable interrupt remapping for Intel IOMMU

2016-05-30 Thread Peter Xu
On Mon, May 30, 2016 at 06:31:13PM +0800, Peter Xu wrote:
> This is v8 patchset for Intel IOMMU IR support. If to test it with
> pci-bridges, we still need to apply the following fix to solve a known
> issue which will hang the guest:
> 
> - [PATCH v4] pci: fix pci_requester_id()
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg02769.html
> 
> V8 mostly fixes some issues with bit-field definitions, which is
> possibly errornous when host is big endian machine types.
> 
> v8 changes:
> - rebase to latest master
> - patch 7
>   - remove VTD_IR_IOAPICEntry, which is useless now
>   - fix possible issue on big endian machines for VTD_IRTE,
> VTD_IR_MSIAddress
> - patch 12
>   - fix endianess issue with bit-field defines: fix BE issue with
> VTD_MSIMessage, do cpu_to_*() or reverse when necessary on
> bit-field uses.
> - patch 19
>   - used le32_to_cpu() for dest_id, and added my s-o-b line beneath
> Jan's.

Online repository:

  https://github.com/xzpeter/qemu vtd-intr-v8

Thanks,

-- peterx



[Qemu-devel] [PATCH v8 22/25] kvm-irqchip: i386: add hook for add/remove virq

2016-05-30 Thread Peter Xu
Adding two hooks to be notified when adding/removing msi routes. There
are two kinds of MSI routes:

- in kvm_irqchip_add_irq_route(): before assigning IRQFD. Used by
  vhost, vfio, etc.

- in kvm_irqchip_send_msi(): when sending direct MSI message, if
  direct MSI not allowed, we will first create one MSI route entry
  in the kernel, then trigger it.

This patch only hooks the first one (irqfd case). We do not need to
take care for the 2nd one, since it's only used by QEMU userspace
(kvm-apic) and the messages will always do in-time translation when
triggered. While we need to note them down for the 1st one, so that we
can notify the kernel when cache invalidation happens.

Also, we do not hook IOAPIC msi routes (we have explicit notifier for
IOAPIC to keep its cache updated). We only need to care about irqfd
users.

Signed-off-by: Peter Xu 
---
 include/sysemu/kvm.h |  6 ++
 kvm-all.c|  2 ++
 target-arm/kvm.c | 11 +++
 target-i386/kvm.c| 48 
 target-mips/kvm.c| 11 +++
 target-ppc/kvm.c | 11 +++
 target-s390x/kvm.c   | 11 +++
 trace-events |  2 ++
 8 files changed, 102 insertions(+)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 9144c6a..043bd12 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -358,6 +358,12 @@ void kvm_arch_init_irq_routing(KVMState *s);
 int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
  uint64_t address, uint32_t data, PCIDevice *dev);
 
+/* Notify arch about newly added MSI routes */
+int kvm_arch_add_msi_route_post(struct kvm_irq_routing_entry *route,
+int vector, PCIDevice *dev);
+/* Notify arch about released MSI routes */
+int kvm_arch_release_virq_post(int virq);
+
 int kvm_arch_msi_data_to_gsi(uint32_t data);
 
 int kvm_set_irq(KVMState *s, int irq, int level);
diff --git a/kvm-all.c b/kvm-all.c
index 46190f7..a9eb366 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1073,6 +1073,7 @@ void kvm_irqchip_release_virq(KVMState *s, int virq)
 }
 }
 clear_gsi(s, virq);
+kvm_arch_release_virq_post(virq);
 }
 
 static unsigned int kvm_hash_msi(uint32_t data)
@@ -1221,6 +1222,7 @@ int kvm_irqchip_add_msi_route(KVMState *s, int vector, 
PCIDevice *dev)
 }
 
 kvm_add_routing_entry(s, &kroute);
+kvm_arch_add_msi_route_post(&kroute, vector, dev);
 kvm_irqchip_commit_routes(s);
 
 return virq;
diff --git a/target-arm/kvm.c b/target-arm/kvm.c
index 83da447..81d3e98 100644
--- a/target-arm/kvm.c
+++ b/target-arm/kvm.c
@@ -623,6 +623,17 @@ int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry 
*route,
 return 0;
 }
 
+int kvm_arch_add_msi_route_post(struct kvm_irq_routing_entry *route,
+int vector, PCIDevice *dev)
+{
+return 0;
+}
+
+int kvm_arch_release_virq_post(int virq)
+{
+return 0;
+}
+
 int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 return (data - 32) & 0x;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 1f5b1d6..9051d16 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -3347,6 +3347,54 @@ int kvm_arch_fixup_msi_route(struct 
kvm_irq_routing_entry *route,
 return 0;
 }
 
+typedef struct MSIRouteEntry MSIRouteEntry;
+
+struct MSIRouteEntry {
+PCIDevice *dev; /* Device pointer */
+int vector; /* MSI/MSIX vector index */
+int virq;   /* Virtual IRQ index */
+QLIST_ENTRY(MSIRouteEntry) list;
+};
+
+/* List of used GSI routes */
+static QLIST_HEAD(, MSIRouteEntry) msi_route_list = \
+QLIST_HEAD_INITIALIZER(msi_route_list);
+
+int kvm_arch_add_msi_route_post(struct kvm_irq_routing_entry *route,
+int vector, PCIDevice *dev)
+{
+MSIRouteEntry *entry;
+
+if (!dev) {
+/* These are (possibly) IOAPIC routes only used for split
+ * kernel irqchip mode, while what we are housekeeping are
+ * PCI devices only. */
+return 0;
+}
+
+entry = g_new0(MSIRouteEntry, 1);
+entry->dev = dev;
+entry->vector = vector;
+entry->virq = route->gsi;
+QLIST_INSERT_HEAD(&msi_route_list, entry, list);
+
+trace_kvm_x86_add_msi_route(route->gsi);
+return 0;
+}
+
+int kvm_arch_release_virq_post(int virq)
+{
+MSIRouteEntry *entry, *next;
+QLIST_FOREACH_SAFE(entry, &msi_route_list, list, next) {
+if (entry->virq == virq) {
+trace_kvm_x86_remove_msi_route(virq);
+QLIST_REMOVE(entry, list);
+break;
+}
+}
+return 0;
+}
+
 int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 abort();
diff --git a/target-mips/kvm.c b/target-mips/kvm.c
index a854e4d..7fc84d5 100644
--- a/target-mips/kvm.c
+++ b/target-mips/kvm.c
@@ -1044,6 +1044,17 @@ int kvm_arch_fixup_msi_route(struct 
kvm_irq_routing_entry *route,
 return 0;
 }
 
+int kvm_arch_add_msi_route_post(struct kvm_irq_ro

[Qemu-devel] [PATCH v8 23/25] kvm-irqchip: x86: add msi route notify fn

2016-05-30 Thread Peter Xu
One more IEC notifier is added to let msi routes know about the IEC
changes. When interrupt invalidation happens, all registered msi routes
will be updated for all PCI devices.

Since both vfio and vhost are possible gsi route consumers, this patch
will go one step further to keep them safe in split irqchip mode and
when irqfd is enabled.

Signed-off-by: Peter Xu 
---
 hw/pci/pci.c | 15 +++
 include/hw/pci/pci.h |  2 ++
 kvm-all.c| 10 +-
 target-i386/kvm.c| 30 ++
 trace-events |  1 +
 5 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 7430715..ec1928f 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2527,6 +2527,21 @@ uint16_t pci_requester_id(PCIDevice *dev)
 return result;
 }
 
+MSIMessage pci_get_msi_message(PCIDevice *dev, int vector)
+{
+MSIMessage msg;
+if (msix_enabled(dev)) {
+msg = msix_get_message(dev, vector);
+} else if (msi_enabled(dev)) {
+msg = msi_get_message(dev, vector);
+} else {
+/* Should never happen */
+error_report("%s: unknown interrupt type", __func__);
+abort();
+}
+return msg;
+}
+
 static const TypeInfo pci_device_type_info = {
 .name = TYPE_PCI_DEVICE,
 .parent = TYPE_DEVICE,
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 351266c..359c22e 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -779,4 +779,6 @@ extern const VMStateDescription vmstate_pci_device;
 .offset = vmstate_offset_pointer(_state, _field, PCIDevice), \
 }
 
+MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
+
 #endif
diff --git a/kvm-all.c b/kvm-all.c
index a9eb366..f4ce357 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1186,15 +1186,7 @@ int kvm_irqchip_add_msi_route(KVMState *s, int vector, 
PCIDevice *dev)
 MSIMessage msg = {0, 0};
 
 if (dev) {
-if (msix_enabled(dev)) {
-msg = msix_get_message(dev, vector);
-} else if (msi_enabled(dev)) {
-msg = msi_get_message(dev, vector);
-} else {
-/* Should never happen */
-error_report("%s: unknown interrupt type", __func__);
-abort();
-}
+msg = pci_get_msi_message(dev, vector);
 }
 
 if (kvm_gsi_direct_mapping()) {
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 9051d16..173afd7 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -37,6 +37,7 @@
 #include "hw/i386/apic_internal.h"
 #include "hw/i386/apic-msidef.h"
 #include "hw/i386/intel_iommu.h"
+#include "hw/i386/x86-iommu.h"
 
 #include "exec/ioport.h"
 #include "standard-headers/asm-x86/hyperv.h"
@@ -3360,9 +3361,26 @@ struct MSIRouteEntry {
 static QLIST_HEAD(, MSIRouteEntry) msi_route_list = \
 QLIST_HEAD_INITIALIZER(msi_route_list);
 
+static void kvm_update_msi_routes_all(void *private, bool global,
+  uint32_t index, uint32_t mask)
+{
+int cnt = 0;
+MSIRouteEntry *entry;
+MSIMessage msg;
+/* TODO: explicit route update */
+QLIST_FOREACH(entry, &msi_route_list, list) {
+cnt++;
+msg = pci_get_msi_message(entry->dev, entry->vector);
+kvm_irqchip_update_msi_route(kvm_state, entry->virq,
+ msg, entry->dev);
+}
+trace_kvm_x86_update_msi_routes(cnt);
+}
+
 int kvm_arch_add_msi_route_post(struct kvm_irq_routing_entry *route,
 int vector, PCIDevice *dev)
 {
+static bool notify_list_inited = false;
 MSIRouteEntry *entry;
 
 if (!dev) {
@@ -3379,6 +3397,18 @@ int kvm_arch_add_msi_route_post(struct 
kvm_irq_routing_entry *route,
 QLIST_INSERT_HEAD(&msi_route_list, entry, list);
 
 trace_kvm_x86_add_msi_route(route->gsi);
+
+if (!notify_list_inited) {
+/* For the first time we do add route, add ourselves into
+ * IOMMU's IEC notify list if needed. */
+X86IOMMUState *iommu = x86_iommu_get_default();
+if (iommu) {
+x86_iommu_iec_register_notifier(iommu,
+kvm_update_msi_routes_all,
+NULL);
+}
+notify_list_inited = true;
+}
 return 0;
 }
 
diff --git a/trace-events b/trace-events
index 6a423fd..8848fe2 100644
--- a/trace-events
+++ b/trace-events
@@ -1951,3 +1951,4 @@ gic_acknowledge_irq(int cpu, int irq) "cpu %d 
acknowledged irq %d"
 kvm_x86_fixup_msi_error(uint32_t gsi) "VT-d failed to remap interrupt for GSI 
%" PRIu32
 kvm_x86_add_msi_route(int virq) "Adding route entry for virq %d"
 kvm_x86_remove_msi_route(int virq) "Removing route entry for virq %d"
+kvm_x86_update_msi_routes(int num) "Updated %d MSI routes"
-- 
2.4.11




[Qemu-devel] [PATCH v8 17/25] x86-iommu: introduce IEC notifiers

2016-05-30 Thread Peter Xu
This patch introduces x86 IOMMU IEC (Interrupt Entry Cache)
invalidation notifier list. When vIOMMU receives IEC invalidate
request, all the registered units will be notified with specific
invalidation requests.

Intel IOMMU is the first provider that generates such a event.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c  | 36 +---
 hw/i386/intel_iommu_internal.h | 24 
 hw/i386/x86-iommu.c| 23 +++
 include/hw/i386/x86-iommu.h| 40 
 4 files changed, 112 insertions(+), 11 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a6bfd66..5be9010 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -901,6 +901,12 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
 (s->root_extended ? "(extended)" : ""));
 }
 
+static void vtd_iec_notify_all(IntelIOMMUState *s, bool global,
+   uint32_t index, uint32_t mask)
+{
+x86_iommu_iec_notify_all(X86_IOMMU_DEVICE(s), global, index, mask);
+}
+
 static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
 {
 uint64_t value = 0;
@@ -908,7 +914,8 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState 
*s)
 s->intr_size = 1UL << ((value & VTD_IRTA_SIZE_MASK) + 1);
 s->intr_root = value & VTD_IRTA_ADDR_MASK;
 
-/* TODO: invalidate interrupt entry cache */
+/* Notify global invalidation */
+vtd_iec_notify_all(s, true, 0, 0);
 
 VTD_DPRINTF(CSR, "int remap table addr 0x%"PRIx64 " size %"PRIu32,
 s->intr_root, s->intr_size);
@@ -1410,6 +1417,21 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, 
VTDInvDesc *inv_desc)
 return true;
 }
 
+static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
+ VTDInvDesc *inv_desc)
+{
+VTD_DPRINTF(INV, "inv ir glob %d index %d mask %d",
+inv_desc->iec.granularity,
+inv_desc->iec.index,
+inv_desc->iec.index_mask);
+
+vtd_iec_notify_all(s, inv_desc->iec.granularity,
+   inv_desc->iec.index,
+   inv_desc->iec.index_mask);
+
+return true;
+}
+
 static bool vtd_process_inv_desc(IntelIOMMUState *s)
 {
 VTDInvDesc inv_desc;
@@ -1450,12 +1472,12 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 break;
 
 case VTD_INV_DESC_IEC:
-VTD_DPRINTF(INV, "Interrupt Entry Cache Invalidation "
-"not implemented yet");
-/*
- * Since currently we do not cache interrupt entries, we can
- * just mark this descriptor as "good" and move on.
- */
+VTD_DPRINTF(INV, "Invalidation Interrupt Entry Cache "
+"Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
+inv_desc.hi, inv_desc.lo);
+if (!vtd_process_inv_iec_desc(s, &inv_desc)) {
+return false;
+}
 break;
 
 default:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e1a08cb..10c20fe 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -296,12 +296,28 @@ typedef enum VTDFaultReason {
 
 #define VTD_CONTEXT_CACHE_GEN_MAX   0xUL
 
+/* Interrupt Entry Cache Invalidation Descriptor: VT-d 6.5.2.7. */
+struct VTDInvDescIEC {
+uint32_t type:4;/* Should always be 0x4 */
+uint32_t granularity:1; /* If set, it's global IR invalidation */
+uint32_t resved_1:22;
+uint32_t index_mask:5;  /* 2^N for continuous int invalidation */
+uint32_t index:16;  /* Start index to invalidate */
+uint32_t reserved_2:16;
+};
+typedef struct VTDInvDescIEC VTDInvDescIEC;
+
 /* Queued Invalidation Descriptor */
-struct VTDInvDesc {
-uint64_t lo;
-uint64_t hi;
+union VTDInvDesc {
+struct {
+uint64_t lo;
+uint64_t hi;
+};
+union {
+VTDInvDescIEC iec;
+};
 };
-typedef struct VTDInvDesc VTDInvDesc;
+typedef union VTDInvDesc VTDInvDesc;
 
 /* Masks for struct VTDInvDesc */
 #define VTD_INV_DESC_TYPE   0xf
diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
index 2d6d221..0b14b09 100644
--- a/hw/i386/x86-iommu.c
+++ b/hw/i386/x86-iommu.c
@@ -22,6 +22,27 @@
 #include "hw/boards.h"
 #include "hw/i386/x86-iommu.h"
 
+void x86_iommu_iec_register_notifier(X86IOMMUState *iommu,
+ iec_notify_fn fn, void *data)
+{
+IEC_Notifier *notifier = g_new0(IEC_Notifier, 1);
+notifier->iec_notify = fn;
+notifier->private = data;
+QLIST_INSERT_HEAD(&iommu->iec_notifiers, notifier, list);
+}
+
+void x86_iommu_iec_notify_all(X86IOMMUState *iommu, bool global,
+  uint32_t index, uint32_t mask)
+{
+IEC_Notifier *notifier;
+QLIST_FOREACH(notifier, &iommu->iec_notifiers, list) {
+if (notifier->iec_notify) {
+ 

[Qemu-devel] [PATCH v8 18/25] ioapic: register IOMMU IEC notifier for ioapic

2016-05-30 Thread Peter Xu
Let IOAPIC the first consumer of x86 IOMMU IEC invalidation
notifiers. This is only used for split irqchip case, when vIOMMU
receives IR invalidation requests, IOAPIC will be notified to update
kernel irq routes. For simplicity, we just update all IOAPIC routes,
even if the invalidated entries are not IOAPIC ones.

Signed-off-by: Peter Xu 
---
 hw/intc/ioapic.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index c4469e4..4823211 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -31,6 +31,7 @@
 #include "sysemu/kvm.h"
 #include "target-i386/cpu.h"
 #include "hw/i386/apic-msidef.h"
+#include "hw/i386/x86-iommu.h"
 
 //#define DEBUG_IOAPIC
 
@@ -198,6 +199,14 @@ static void ioapic_update_kvm_routes(IOAPICCommonState *s)
 #endif
 }
 
+static void ioapic_iec_notifier(void *private, bool global,
+uint32_t index, uint32_t mask)
+{
+IOAPICCommonState *s = (IOAPICCommonState *)private;
+/* For simplicity, we just update all the routes */
+ioapic_update_kvm_routes(s);
+}
+
 void ioapic_eoi_broadcast(int vector)
 {
 IOAPICCommonState *s;
@@ -364,6 +373,18 @@ static void ioapic_realize(DeviceState *dev, Error **errp)
 qdev_init_gpio_in(dev, ioapic_set_irq, IOAPIC_NUM_PINS);
 
 ioapics[ioapic_no] = s;
+
+#ifdef CONFIG_KVM
+if (kvm_irqchip_is_split()) {
+X86IOMMUState *iommu = x86_iommu_get_default();
+if (iommu) {
+/* Register this IOAPIC with IOMMU IEC notifier, so that
+ * when there are IR invalidates, we can be notified to
+ * update kernel IR cache. */
+x86_iommu_iec_register_notifier(iommu, ioapic_iec_notifier, s);
+}
+}
+#endif
 }
 
 static void ioapic_class_init(ObjectClass *klass, void *data)
-- 
2.4.11




[Qemu-devel] [PATCH v8 20/25] intel_iommu: add SID validation for IR

2016-05-30 Thread Peter Xu
This patch enables SID validation. Invalid interrupts will be dropped.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c | 69 ---
 include/hw/i386/intel_iommu.h | 17 +++
 2 files changed, 75 insertions(+), 11 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 22c82fc..c4ea0c3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1993,11 +1993,15 @@ static Property vtd_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
+uint16_t vtd_svt_mask[VTD_SQ_MAX] = {0x, 0xfffb, 0xfff9, 0xfff8};
+
 /* Read IRTE entry with specific index */
 static int vtd_irte_get(IntelIOMMUState *iommu, uint16_t index,
-VTD_IRTE *entry)
+VTD_IRTE *entry, uint16_t sid)
 {
 dma_addr_t addr = 0x00;
+uint16_t mask, source_id;
+uint8_t bus, bus_max, bus_min;
 
 addr = iommu->intr_root + index * sizeof(*entry);
 if (dma_memory_read(&address_space_memory, addr, entry,
@@ -2024,23 +2028,58 @@ static int vtd_irte_get(IntelIOMMUState *iommu, 
uint16_t index,
 return -VTD_FR_IR_IRTE_RSVD;
 }
 
-/*
- * TODO: Check Source-ID corresponds to SVT (Source Validation
- * Type) bits
- */
+if (sid != X86_IOMMU_SID_INVALID) {
+/* Validate IRTE SID */
+source_id = le32_to_cpu(entry->source_id);
+switch (entry->sid_vtype) {
+case VTD_SVT_NONE:
+VTD_DPRINTF(IR, "No SID validation for IRTE index %d", index);
+break;
+
+case VTD_SVT_ALL:
+mask = vtd_svt_mask[entry->sid_q];
+if ((source_id & mask) != (sid & mask)) {
+VTD_DPRINTF(GENERAL, "SID validation for IRTE index "
+"%d failed (reqid 0x%04x sid 0x%04x)", index,
+sid, source_id);
+return -VTD_FR_IR_SID_ERR;
+}
+break;
+
+case VTD_SVT_BUS:
+bus_max = source_id >> 8;
+bus_min = source_id & 0xff;
+bus = sid >> 8;
+if (bus > bus_max || bus < bus_min) {
+VTD_DPRINTF(GENERAL, "SID validation for IRTE index %d "
+"failed (bus %d outside %d-%d)", index, bus,
+bus_min, bus_max);
+return -VTD_FR_IR_SID_ERR;
+}
+break;
+
+default:
+VTD_DPRINTF(GENERAL, "Invalid SVT bits (0x%x) in IRTE index "
+"%d", entry->sid_vtype, index);
+/* Take this as verification failure. */
+return -VTD_FR_IR_SID_ERR;
+break;
+}
+}
 
 return 0;
 }
 
 /* Fetch IRQ information of specific IR index */
-static int vtd_remap_irq_get(IntelIOMMUState *iommu, uint16_t index, VTDIrq 
*irq)
+static int vtd_remap_irq_get(IntelIOMMUState *iommu, uint16_t index,
+ VTDIrq *irq, uint16_t sid)
 {
 VTD_IRTE irte;
 int ret = 0;
 
 bzero(&irte, sizeof(irte));
 
-ret = vtd_irte_get(iommu, index, &irte);
+ret = vtd_irte_get(iommu, index, &irte, sid);
 if (ret) {
 return ret;
 }
@@ -2092,7 +2131,8 @@ static void vtd_generate_msi_message(VTDIrq *irq, 
MSIMessage *msg_out)
 /* Interrupt remapping for MSI/MSI-X entry */
 static int vtd_interrupt_remap_msi(IntelIOMMUState *iommu,
MSIMessage *origin,
-   MSIMessage *translated)
+   MSIMessage *translated,
+   uint16_t sid)
 {
 int ret = 0;
 VTD_IR_MSIAddress addr;
@@ -2127,7 +2167,7 @@ static int vtd_interrupt_remap_msi(IntelIOMMUState *iommu,
 
 index = addr.index_h << 15 | le16_to_cpu(addr.index_l);
 
-ret = vtd_remap_irq_get(iommu, index, &irq);
+ret = vtd_remap_irq_get(iommu, index, &irq, sid);
 if (ret) {
 return ret;
 }
@@ -2174,7 +2214,8 @@ do_not_translate:
 static int vtd_int_remap(X86IOMMUState *iommu, MSIMessage *src,
  MSIMessage *dst, uint16_t sid)
 {
-return vtd_interrupt_remap_msi(INTEL_IOMMU_DEVICE(iommu), src, dst);
+return vtd_interrupt_remap_msi(INTEL_IOMMU_DEVICE(iommu),
+   src, dst, sid);
 }
 
 static MemTxResult vtd_mem_ir_read(void *opaque, hwaddr addr,
@@ -2200,11 +2241,17 @@ static MemTxResult vtd_mem_ir_write(void *opaque, 
hwaddr addr,
 {
 int ret = 0;
 MSIMessage from = {0}, to = {0};
+uint16_t sid = X86_IOMMU_SID_INVALID;
 
 from.address = (uint64_t) addr + VTD_INTERRUPT_ADDR_FIRST;
 from.data = (uint32_t) value;
 
-ret = vtd_interrupt_remap_msi(opaque, &from, &to);
+if (!attrs.unspecified) {
+/* We have explicit Source ID */
+sid = attrs.requester_id;
+}
+
+ret = vtd_interrupt_remap_msi(opaque, &from, &to, sid);
 if (ret) {
 /* TODO: report error */
 VTD_DPRINTF(GENER

[Qemu-devel] [PATCH v8 21/25] kvm-irqchip: simplify kvm_irqchip_add_msi_route

2016-05-30 Thread Peter Xu
Changing the original MSIMessage parameter in kvm_irqchip_add_msi_route
into the vector number. Vector index provides more information than the
MSIMessage, we can retrieve the MSIMessage using the vector easily. This
will avoid fetching MSIMessage every time before adding MSI routes.

Meanwhile, the vector info will be used in the coming patches to further
enable gsi route update notifications.

Signed-off-by: Peter Xu 
---
 hw/i386/kvm/pci-assign.c |  8 ++--
 hw/misc/ivshmem.c|  3 +--
 hw/vfio/pci.c| 11 +--
 hw/virtio/virtio-pci.c   |  9 +++--
 include/sysemu/kvm.h | 13 -
 kvm-all.c| 18 --
 kvm-stub.c   |  2 +-
 target-i386/kvm.c|  3 +--
 8 files changed, 41 insertions(+), 26 deletions(-)

diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index bceed09..5f7d5c6 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -975,10 +975,9 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev)
 }
 
 if (ctrl_byte & PCI_MSI_FLAGS_ENABLE) {
-MSIMessage msg = msi_get_message(pci_dev, 0);
 int virq;
 
-virq = kvm_irqchip_add_msi_route(kvm_state, msg, pci_dev);
+virq = kvm_irqchip_add_msi_route(kvm_state, 0, pci_dev);
 if (virq < 0) {
 perror("assigned_dev_update_msi: kvm_irqchip_add_msi_route");
 return;
@@ -1043,7 +1042,6 @@ static int assigned_dev_update_msix_mmio(PCIDevice 
*pci_dev)
 uint16_t entries_nr = 0;
 int i, r = 0;
 MSIXTableEntry *entry = adev->msix_table;
-MSIMessage msg;
 
 /* Get the usable entry number for allocating */
 for (i = 0; i < adev->msix_max; i++, entry++) {
@@ -1080,9 +1078,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice 
*pci_dev)
 continue;
 }
 
-msg.address = entry->addr_lo | ((uint64_t)entry->addr_hi << 32);
-msg.data = entry->data;
-r = kvm_irqchip_add_msi_route(kvm_state, msg, pci_dev);
+r = kvm_irqchip_add_msi_route(kvm_state, i, pci_dev);
 if (r < 0) {
 return r;
 }
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index e40f23b..6909346 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -444,13 +444,12 @@ static void ivshmem_add_kvm_msi_virq(IVShmemState *s, int 
vector,
  Error **errp)
 {
 PCIDevice *pdev = PCI_DEVICE(s);
-MSIMessage msg = msix_get_message(pdev, vector);
 int ret;
 
 IVSHMEM_DPRINTF("ivshmem_add_kvm_msi_virq vector:%d\n", vector);
 assert(!s->msi_vectors[vector].pdev);
 
-ret = kvm_irqchip_add_msi_route(kvm_state, msg, pdev);
+ret = kvm_irqchip_add_msi_route(kvm_state, vector, pdev);
 if (ret < 0) {
 error_setg(errp, "kvm_irqchip_add_msi_route failed");
 return;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index deab0c6..06ad15e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -417,11 +417,11 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool 
msix)
 }
 
 static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
-  MSIMessage *msg, bool msix)
+  int vector_n, bool msix)
 {
 int virq;
 
-if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi) || !msg) {
+if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
 return;
 }
 
@@ -429,7 +429,7 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, 
VFIOMSIVector *vector,
 return;
 }
 
-virq = kvm_irqchip_add_msi_route(kvm_state, *msg, &vdev->pdev);
+virq = kvm_irqchip_add_msi_route(kvm_state, vector_n, &vdev->pdev);
 if (virq < 0) {
 event_notifier_cleanup(&vector->kvm_interrupt);
 return;
@@ -495,7 +495,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, 
unsigned int nr,
 vfio_update_kvm_msi_virq(vector, *msg, pdev);
 }
 } else {
-vfio_add_kvm_msi_virq(vdev, vector, msg, true);
+vfio_add_kvm_msi_virq(vdev, vector, nr, true);
 }
 
 /*
@@ -639,7 +639,6 @@ retry:
 
 for (i = 0; i < vdev->nr_vectors; i++) {
 VFIOMSIVector *vector = &vdev->msi_vectors[i];
-MSIMessage msg = msi_get_message(&vdev->pdev, i);
 
 vector->vdev = vdev;
 vector->virq = -1;
@@ -656,7 +655,7 @@ retry:
  * Attempt to enable route through KVM irqchip,
  * default to userspace handling if unavailable.
  */
-vfio_add_kvm_msi_virq(vdev, vector, &msg, false);
+vfio_add_kvm_msi_virq(vdev, vector, i, false);
 }
 
 /* Set interrupt type prior to possible interrupts */
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index bfedbbf..df85f28 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -727,14 +727,13 @@ static uint32_t virtio_read_config(PCIDevice *pci_dev,
 
 static int kvm_virtio_pci_vq_vector_use(VirtIOPCIPr

[Qemu-devel] [PATCH v8 15/25] intel_iommu: add support for split irqchip

2016-05-30 Thread Peter Xu
In split irqchip mode, IOAPIC is working in user space, only update
kernel irq routes when entry changed. When IR is enabled, we directly
update the kernel with translated messages. It works just like a kernel
cache for the remapping entries.

Since KVM irqfd is using kernel gsi routes to deliver interrupts, as
long as we can support split irqchip, we will support irqfd as
well. Also, since kernel gsi routes will cache translated interrupts,
irqfd delivery will not suffer from any performance impact due to IR.

And, since we supported irqfd, vhost devices will be able to work
seamlessly with IR now. Logically this should contain both vhost-net and
vhost-user case.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c |  7 +++
 include/hw/i386/intel_iommu.h |  1 +
 include/hw/i386/x86-iommu.h   |  4 
 target-i386/kvm.c | 27 +++
 trace-events  |  3 +++
 5 files changed, 42 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e832780..a6bfd66 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2146,6 +2146,12 @@ do_not_translate:
 return 0;
 }
 
+static int vtd_int_remap(X86IOMMUState *iommu, MSIMessage *src,
+ MSIMessage *dst, uint16_t sid)
+{
+return vtd_interrupt_remap_msi(INTEL_IOMMU_DEVICE(iommu), src, dst);
+}
+
 static MemTxResult vtd_mem_ir_read(void *opaque, hwaddr addr,
uint64_t *data, unsigned size,
MemTxAttrs attrs)
@@ -2374,6 +2380,7 @@ static void vtd_class_init(ObjectClass *klass, void *data)
 dc->props = vtd_properties;
 x86_class->realize = vtd_realize;
 x86_class->find_add_as = vtd_find_add_as;
+x86_class->int_remap = vtd_int_remap;
 }
 
 static const TypeInfo vtd_info = {
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index b3f17d7..3bca390 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -26,6 +26,7 @@
 #include "hw/i386/x86-iommu.h"
 #include "hw/i386/ioapic.h"
 #include "hw/pci/msi.h"
+#include "hw/sysbus.h"
 
 #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
 #define INTEL_IOMMU_DEVICE(obj) \
diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
index 2070cd1..1eb62cf 100644
--- a/include/hw/i386/x86-iommu.h
+++ b/include/hw/i386/x86-iommu.h
@@ -22,6 +22,7 @@
 
 #include "hw/sysbus.h"
 #include "exec/memory.h"
+#include "hw/pci/pci.h"
 
 #define  TYPE_X86_IOMMU_DEVICE  ("x86-iommu")
 #define  X86_IOMMU_DEVICE(obj) \
@@ -43,6 +44,9 @@ struct X86IOMMUClass {
 DeviceRealize realize;
 /* Find/Add IOMMU address space for specific PCI device */
 AddressSpace *(*find_add_as)(X86IOMMUState *s, PCIBus *bus, int devfn);
+/* MSI-based interrupt remapping */
+int (*int_remap)(X86IOMMUState *iommu, MSIMessage *src,
+ MSIMessage *dst, uint16_t sid);
 };
 
 struct X86IOMMUState {
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7b3667a..ef10ccb 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -36,6 +36,7 @@
 #include "hw/i386/apic.h"
 #include "hw/i386/apic_internal.h"
 #include "hw/i386/apic-msidef.h"
+#include "hw/i386/intel_iommu.h"
 
 #include "exec/ioport.h"
 #include "standard-headers/asm-x86/hyperv.h"
@@ -43,6 +44,7 @@
 #include "hw/pci/msi.h"
 #include "migration/migration.h"
 #include "exec/memattrs.h"
+#include "trace.h"
 
 //#define DEBUG_KVM
 
@@ -3318,6 +3320,31 @@ int kvm_device_msix_deassign(KVMState *s, uint32_t 
dev_id)
 int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
  uint64_t address, uint32_t data, PCIDevice *dev)
 {
+X86IOMMUState *iommu = x86_iommu_get_default();
+
+if (iommu) {
+int ret;
+MSIMessage src, dst;
+X86IOMMUClass *class = X86_IOMMU_GET_CLASS(iommu);
+
+src.address = route->u.msi.address_hi;
+src.address <<= VTD_MSI_ADDR_HI_SHIFT;
+src.address |= route->u.msi.address_lo;
+src.data = route->u.msi.data;
+
+ret = class->int_remap(iommu, &src, &dst, dev ? \
+   pci_requester_id(dev) : \
+   X86_IOMMU_SID_INVALID);
+if (ret) {
+trace_kvm_x86_fixup_msi_error(route->gsi);
+return 1;
+}
+
+route->u.msi.address_hi = dst.address >> VTD_MSI_ADDR_HI_SHIFT;
+route->u.msi.address_lo = dst.address & VTD_MSI_ADDR_LO_MASK;
+route->u.msi.data = dst.data;
+}
+
 return 0;
 }
 
diff --git a/trace-events b/trace-events
index b27d1da..54c0d41 100644
--- a/trace-events
+++ b/trace-events
@@ -1946,3 +1946,6 @@ gic_set_irq(int irq, int level, int cpumask, int target) 
"irq %d level %d cpumas
 gic_update_bestirq(int cpu, int irq, int prio, int priority_mask, int 
running_priority) "cpu %d irq %d priority %d cpu priority mask %d cpu running 
priority %d"
 gic_update_set_irq(int cpu, const char *name,

[Qemu-devel] [PATCH v8 13/25] q35: ioapic: add support for emulated IOAPIC IR

2016-05-30 Thread Peter Xu
This patch translates all IOAPIC interrupts into MSI ones. One pseudo
ioapic address space is added to transfer the MSI message. By default,
it will be system memory address space. When IR is enabled, it will be
IOMMU address space.

Currently, only emulated IOAPIC is supported.

Idea suggested by Jan Kiszka and Rita Sinha in the following patch:

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg01933.html

Signed-off-by: Peter Xu 
---
 hw/i386/pc.c  |  3 +++
 hw/intc/ioapic.c  | 28 
 hw/pci-host/q35.c |  4 
 include/hw/i386/apic-msidef.h |  1 +
 include/hw/i386/ioapic_internal.h |  1 +
 include/hw/i386/pc.h  |  4 
 6 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index e29ccc8..8d523f8 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1396,6 +1396,9 @@ void pc_memory_init(PCMachineState *pcms,
 rom_add_option(option_rom[i].name, option_rom[i].bootindex);
 }
 pcms->fw_cfg = fw_cfg;
+
+/* Init default IOAPIC address space */
+pcms->ioapic_as = &address_space_memory;
 }
 
 qemu_irq pc_allocate_cpu_irq(void)
diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index 273bb08..36dd42a 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -29,6 +29,8 @@
 #include "hw/i386/ioapic_internal.h"
 #include "include/hw/pci/msi.h"
 #include "sysemu/kvm.h"
+#include "target-i386/cpu.h"
+#include "hw/i386/apic-msidef.h"
 
 //#define DEBUG_IOAPIC
 
@@ -50,13 +52,15 @@ extern int ioapic_no;
 
 static void ioapic_service(IOAPICCommonState *s)
 {
+AddressSpace *ioapic_as = PC_MACHINE(qdev_get_machine())->ioapic_as;
+uint32_t addr, data;
 uint8_t i;
 uint8_t trig_mode;
 uint8_t vector;
 uint8_t delivery_mode;
 uint32_t mask;
 uint64_t entry;
-uint8_t dest;
+uint16_t dest_idx;
 uint8_t dest_mode;
 
 for (i = 0; i < IOAPIC_NUM_PINS; i++) {
@@ -67,7 +71,14 @@ static void ioapic_service(IOAPICCommonState *s)
 entry = s->ioredtbl[i];
 if (!(entry & IOAPIC_LVT_MASKED)) {
 trig_mode = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1);
-dest = entry >> IOAPIC_LVT_DEST_SHIFT;
+/*
+ * By default, this would be dest_id[8] +
+ * reserved[8]. When IR is enabled, this would be
+ * interrupt_index[15] + interrupt_format[1]. This
+ * field never means anything, but only used to
+ * generate corresponding MSI.
+ */
+dest_idx = entry >> IOAPIC_LVT_DEST_IDX_SHIFT;
 dest_mode = (entry >> IOAPIC_LVT_DEST_MODE_SHIFT) & 1;
 delivery_mode =
 (entry >> IOAPIC_LVT_DELIV_MODE_SHIFT) & IOAPIC_DM_MASK;
@@ -97,8 +108,17 @@ static void ioapic_service(IOAPICCommonState *s)
 #else
 (void)coalesce;
 #endif
-apic_deliver_irq(dest, dest_mode, delivery_mode, vector,
- trig_mode);
+/* No matter whether IR is enabled, we translate
+ * the IOAPIC message into a MSI one, and its
+ * address space will decide whether we need a
+ * translation. */
+addr = APIC_DEFAULT_ADDRESS | \
+(dest_idx << MSI_ADDR_DEST_IDX_SHIFT) |
+(dest_mode << MSI_ADDR_DEST_MODE_SHIFT);
+data = (vector << MSI_DATA_VECTOR_SHIFT) |
+(trig_mode << MSI_DATA_TRIGGER_SHIFT) |
+(delivery_mode << MSI_DATA_DELIVERY_MODE_SHIFT);
+stl_le_phys(ioapic_as, addr, data);
 }
 }
 }
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 6835da1..f3d47ad 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -435,6 +435,7 @@ static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void 
*opaque, int devfn)
 
 static void mch_init_dmar(MCHPCIState *mch)
 {
+PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
 PCIBus *pci_bus = PCI_BUS(qdev_get_parent_bus(DEVICE(mch)));
 
 mch->iommu = INTEL_IOMMU_DEVICE(qdev_create(NULL, 
TYPE_INTEL_IOMMU_DEVICE));
@@ -444,6 +445,9 @@ static void mch_init_dmar(MCHPCIState *mch)
 sysbus_mmio_map(SYS_BUS_DEVICE(mch->iommu), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
 
 pci_setup_iommu(pci_bus, q35_host_dma_iommu, mch->iommu);
+/* Pseudo address space under root PCI bus. */
+pcms->ioapic_as = q35_host_dma_iommu(pci_bus, mch->iommu,
+ Q35_PSEUDO_DEVFN_IOAPIC);
 }
 
 static void mch_realize(PCIDevice *d, Error **errp)
diff --git a/include/hw/i386/apic-msidef.h b/include/hw/i386/apic-msidef.h
index 6e2eb71..8b4d4cc 100644
--- a/include/hw/i386/apic-msidef.h
+++ b/include/hw/i386/apic-msidef.h
@@ -25,6 +25,7 @@
 #define MSI_ADDR_REDIRECTION_SHIFT  3
 
 #define MSI_ADDR_DEST_ID_SHIFT  12
+#defin

[Qemu-devel] [PATCH v8 14/25] ioapic: introduce ioapic_entry_parse() helper

2016-05-30 Thread Peter Xu
Abstract IOAPIC entry parsing logic into a helper function.

Signed-off-by: Peter Xu 
---
 hw/intc/ioapic.c | 110 +++
 1 file changed, 54 insertions(+), 56 deletions(-)

diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index 36dd42a..c4469e4 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -50,18 +50,56 @@ static IOAPICCommonState *ioapics[MAX_IOAPICS];
 /* global variable from ioapic_common.c */
 extern int ioapic_no;
 
+struct ioapic_entry_info {
+/* fields parsed from IOAPIC entries */
+uint8_t masked;
+uint8_t trig_mode;
+uint16_t dest_idx;
+uint8_t dest_mode;
+uint8_t delivery_mode;
+uint8_t vector;
+
+/* MSI message generated from above parsed fields */
+uint32_t addr;
+uint32_t data;
+};
+
+static void ioapic_entry_parse(uint64_t entry, struct ioapic_entry_info *info)
+{
+bzero(info, sizeof(*info));
+info->masked = (entry >> IOAPIC_LVT_MASKED_SHIFT) & 1;
+info->trig_mode = (entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1;
+/*
+ * By default, this would be dest_id[8] + reserved[8]. When IR
+ * is enabled, this would be interrupt_index[15] +
+ * interrupt_format[1]. This field never means anything, but
+ * only used to generate corresponding MSI.
+ */
+info->dest_idx = (entry >> IOAPIC_LVT_DEST_IDX_SHIFT) & 0x;
+info->dest_mode = (entry >> IOAPIC_LVT_DEST_MODE_SHIFT) & 1;
+info->delivery_mode = (entry >> IOAPIC_LVT_DELIV_MODE_SHIFT) \
+& IOAPIC_DM_MASK;
+if (info->delivery_mode == IOAPIC_DM_EXTINT) {
+info->vector = pic_read_irq(isa_pic);
+} else {
+info->vector = entry & IOAPIC_VECTOR_MASK;
+}
+
+info->addr = APIC_DEFAULT_ADDRESS | \
+(info->dest_idx << MSI_ADDR_DEST_IDX_SHIFT) | \
+(info->dest_mode << MSI_ADDR_DEST_MODE_SHIFT);
+info->data = (info->vector << MSI_DATA_VECTOR_SHIFT) | \
+(info->trig_mode << MSI_DATA_TRIGGER_SHIFT) | \
+(info->delivery_mode << MSI_DATA_DELIVERY_MODE_SHIFT);
+}
+
 static void ioapic_service(IOAPICCommonState *s)
 {
 AddressSpace *ioapic_as = PC_MACHINE(qdev_get_machine())->ioapic_as;
-uint32_t addr, data;
+struct ioapic_entry_info info;
 uint8_t i;
-uint8_t trig_mode;
-uint8_t vector;
-uint8_t delivery_mode;
 uint32_t mask;
 uint64_t entry;
-uint16_t dest_idx;
-uint8_t dest_mode;
 
 for (i = 0; i < IOAPIC_NUM_PINS; i++) {
 mask = 1 << i;
@@ -69,33 +107,18 @@ static void ioapic_service(IOAPICCommonState *s)
 int coalesce = 0;
 
 entry = s->ioredtbl[i];
-if (!(entry & IOAPIC_LVT_MASKED)) {
-trig_mode = ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1);
-/*
- * By default, this would be dest_id[8] +
- * reserved[8]. When IR is enabled, this would be
- * interrupt_index[15] + interrupt_format[1]. This
- * field never means anything, but only used to
- * generate corresponding MSI.
- */
-dest_idx = entry >> IOAPIC_LVT_DEST_IDX_SHIFT;
-dest_mode = (entry >> IOAPIC_LVT_DEST_MODE_SHIFT) & 1;
-delivery_mode =
-(entry >> IOAPIC_LVT_DELIV_MODE_SHIFT) & IOAPIC_DM_MASK;
-if (trig_mode == IOAPIC_TRIGGER_EDGE) {
+ioapic_entry_parse(entry, &info);
+if (!info.masked) {
+if (info.trig_mode == IOAPIC_TRIGGER_EDGE) {
 s->irr &= ~mask;
 } else {
 coalesce = s->ioredtbl[i] & IOAPIC_LVT_REMOTE_IRR;
 s->ioredtbl[i] |= IOAPIC_LVT_REMOTE_IRR;
 }
-if (delivery_mode == IOAPIC_DM_EXTINT) {
-vector = pic_read_irq(isa_pic);
-} else {
-vector = entry & IOAPIC_VECTOR_MASK;
-}
+
 #ifdef CONFIG_KVM
 if (kvm_irqchip_is_split()) {
-if (trig_mode == IOAPIC_TRIGGER_EDGE) {
+if (info.trig_mode == IOAPIC_TRIGGER_EDGE) {
 kvm_set_irq(kvm_state, i, 1);
 kvm_set_irq(kvm_state, i, 0);
 } else {
@@ -112,13 +135,7 @@ static void ioapic_service(IOAPICCommonState *s)
  * the IOAPIC message into a MSI one, and its
  * address space will decide whether we need a
  * translation. */
-addr = APIC_DEFAULT_ADDRESS | \
-(dest_idx << MSI_ADDR_DEST_IDX_SHIFT) |
-(dest_mode << MSI_ADDR_DEST_MODE_SHIFT);
-data = (vector << MSI_DATA_VECTOR_SHIFT) |
-(trig_mode << MSI_DATA_TRIGGER_SHIFT) |
-(delivery_mode << MSI_DATA_DELIVERY_MODE_SHIFT);
-stl_le_phys(ioapic_as, addr, data);
+stl_le_phys(ioa

[Qemu-devel] [PATCH v8 10/25] x86-iommu: q35: generalize find_add_as()

2016-05-30 Thread Peter Xu
Remove VT-d calls in common q35 codes. Instead, we provide a general
find_add_as() for x86-iommu type.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c |  7 +--
 hw/pci-host/q35.c | 10 --
 include/hw/i386/intel_iommu.h |  5 -
 include/hw/i386/x86-iommu.h   |  3 +++
 4 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0c7b24d..38cecae 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1971,8 +1971,10 @@ static Property vtd_properties[] = {
 };
 
 
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
+static AddressSpace *vtd_find_add_as(X86IOMMUState *x86_iommu, PCIBus *bus,
+ int devfn)
 {
+IntelIOMMUState *s = (IntelIOMMUState *)x86_iommu;
 uintptr_t key = (uintptr_t)bus;
 VTDBus *vtd_bus = g_hash_table_lookup(s->vtd_as_by_busptr, &key);
 VTDAddressSpace *vtd_dev_as;
@@ -2000,7 +2002,7 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
PCIBus *bus, int devfn)
 address_space_init(&vtd_dev_as->as,
&vtd_dev_as->iommu, "intel_iommu");
 }
-return vtd_dev_as;
+return &vtd_dev_as->as;
 }
 
 /* Do the initialization. It will also be called when reset, so pay
@@ -2128,6 +2130,7 @@ static void vtd_class_init(ObjectClass *klass, void *data)
 dc->vmsd = &vtd_vmstate;
 dc->props = vtd_properties;
 x86_class->realize = vtd_realize;
+x86_class->find_add_as = vtd_find_add_as;
 }
 
 static const TypeInfo vtd_info = {
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 27ee0c8..6835da1 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -426,13 +426,11 @@ static void mch_reset(DeviceState *qdev)
 
 static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void *opaque, int devfn)
 {
-IntelIOMMUState *s = opaque;
-VTDAddressSpace *vtd_as;
+X86IOMMUState *x86_iommu = opaque;
+X86IOMMUClass *x86_class = X86_IOMMU_GET_CLASS(x86_iommu);
 
 assert(0 <= devfn && devfn <= X86_IOMMU_PCI_DEVFN_MAX);
-
-vtd_as = vtd_find_add_as(s, bus, devfn);
-return &vtd_as->as;
+return x86_class->find_add_as(x86_iommu, bus, devfn);
 }
 
 static void mch_init_dmar(MCHPCIState *mch)
@@ -440,7 +438,7 @@ static void mch_init_dmar(MCHPCIState *mch)
 PCIBus *pci_bus = PCI_BUS(qdev_get_parent_bus(DEVICE(mch)));
 
 mch->iommu = INTEL_IOMMU_DEVICE(qdev_create(NULL, 
TYPE_INTEL_IOMMU_DEVICE));
-object_property_add_child(OBJECT(mch), "intel-iommu",
+object_property_add_child(OBJECT(mch), TYPE_X86_IOMMU_DEVICE,
   OBJECT(mch->iommu), NULL);
 qdev_init_nofail(DEVICE(mch->iommu));
 sysbus_mmio_map(SYS_BUS_DEVICE(mch->iommu), 0, Q35_HOST_BRIDGE_IOMMU_ADDR);
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 260aa8e..9a898c1 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -206,9 +206,4 @@ struct IntelIOMMUState {
 uint32_t intr_size; /* Number of IR table entries */
 };
 
-/* Find the VTD Address space associated with the given bus pointer,
- * create a new one if none exists
- */
-VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn);
-
 #endif
diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
index d6991cb..2070cd1 100644
--- a/include/hw/i386/x86-iommu.h
+++ b/include/hw/i386/x86-iommu.h
@@ -21,6 +21,7 @@
 #define IOMMU_COMMON_H
 
 #include "hw/sysbus.h"
+#include "exec/memory.h"
 
 #define  TYPE_X86_IOMMU_DEVICE  ("x86-iommu")
 #define  X86_IOMMU_DEVICE(obj) \
@@ -40,6 +41,8 @@ struct X86IOMMUClass {
 SysBusDeviceClass parent;
 /* Intel/AMD specific realize() hook */
 DeviceRealize realize;
+/* Find/Add IOMMU address space for specific PCI device */
+AddressSpace *(*find_add_as)(X86IOMMUState *s, PCIBus *bus, int devfn);
 };
 
 struct X86IOMMUState {
-- 
2.4.11




[Qemu-devel] [PATCH v8 12/25] intel_iommu: Add support for PCI MSI remap

2016-05-30 Thread Peter Xu
This patch enables interrupt remapping for PCI devices.

To play the trick, one memory region "iommu_ir" is added as child region
of the original iommu memory region, covering range 0xfeeX (which is
the address range for APIC). All the writes to this range will be taken
as MSI, and translation is carried out only when IR is enabled.

Idea suggested by Paolo Bonzini.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c  | 243 +
 hw/i386/intel_iommu_internal.h |   2 +
 include/hw/i386/intel_iommu.h  |  66 +++
 3 files changed, 311 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 38cecae..e832780 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1970,6 +1970,244 @@ static Property vtd_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
+/* Read IRTE entry with specific index */
+static int vtd_irte_get(IntelIOMMUState *iommu, uint16_t index,
+VTD_IRTE *entry)
+{
+dma_addr_t addr = 0x00;
+
+addr = iommu->intr_root + index * sizeof(*entry);
+if (dma_memory_read(&address_space_memory, addr, entry,
+sizeof(*entry))) {
+VTD_DPRINTF(GENERAL, "error: fail to access IR root at 0x%"PRIx64
+" + %"PRIu16, iommu->intr_root, index);
+return -VTD_FR_IR_ROOT_INVAL;
+}
+
+if (!entry->present) {
+VTD_DPRINTF(GENERAL, "error: present flag not set in IRTE"
+" entry index %u value 0x%"PRIx64 " 0x%"PRIx64,
+index, le64_to_cpu(entry->data[1]),
+le64_to_cpu(entry->data[0]));
+return -VTD_FR_IR_ENTRY_P;
+}
+
+if (entry->__reserved_0 || entry->__reserved_1 || \
+entry->__reserved_2) {
+VTD_DPRINTF(GENERAL, "error: IRTE entry index %"PRIu16
+" reserved fields non-zero: 0x%"PRIx64 " 0x%"PRIx64,
+index, le64_to_cpu(entry->data[1]),
+le64_to_cpu(entry->data[0]));
+return -VTD_FR_IR_IRTE_RSVD;
+}
+
+/*
+ * TODO: Check Source-ID corresponds to SVT (Source Validation
+ * Type) bits
+ */
+
+return 0;
+}
+
+/* Fetch IRQ information of specific IR index */
+static int vtd_remap_irq_get(IntelIOMMUState *iommu, uint16_t index, VTDIrq 
*irq)
+{
+VTD_IRTE irte;
+int ret = 0;
+
+bzero(&irte, sizeof(irte));
+
+ret = vtd_irte_get(iommu, index, &irte);
+if (ret) {
+return ret;
+}
+
+irq->trigger_mode = irte.trigger_mode;
+irq->vector = irte.vector;
+irq->delivery_mode = irte.delivery_mode;
+/* Not support EIM yet: please refer to vt-d 9.10 DST bits */
+#define  VTD_IR_APIC_DEST_MASK (0xff00ULL)
+#define  VTD_IR_APIC_DEST_SHIFT(8)
+irq->dest = (le32_to_cpu(irte.dest_id) & VTD_IR_APIC_DEST_MASK) >> \
+VTD_IR_APIC_DEST_SHIFT;
+irq->dest_mode = irte.dest_mode;
+irq->redir_hint = irte.redir_hint;
+
+VTD_DPRINTF(IR, "remapping interrupt index %d: trig:%u,vec:%u,"
+"deliver:%u,dest:%u,dest_mode:%u", index,
+irq->trigger_mode, irq->vector, irq->delivery_mode,
+irq->dest, irq->dest_mode);
+
+return 0;
+}
+
+/* Generate one MSI message from VTDIrq info */
+static void vtd_generate_msi_message(VTDIrq *irq, MSIMessage *msg_out)
+{
+VTD_MSIMessage msg = {};
+
+/* Generate address bits */
+msg.dest_mode = irq->dest_mode;
+msg.redir_hint = irq->redir_hint;
+msg.dest = irq->dest;
+msg.__addr_head = cpu_to_le32(0xfee);
+/* Keep this from original MSI address bits */
+msg.__not_used = irq->msi_addr_last_bits;
+
+/* Generate data bits */
+msg.vector = irq->vector;
+msg.delivery_mode = irq->delivery_mode;
+msg.level = 1;
+msg.trigger_mode = irq->trigger_mode;
+
+msg_out->address = msg.msi_addr;
+msg_out->data = msg.msi_data;
+}
+
+/* Interrupt remapping for MSI/MSI-X entry */
+static int vtd_interrupt_remap_msi(IntelIOMMUState *iommu,
+   MSIMessage *origin,
+   MSIMessage *translated)
+{
+int ret = 0;
+VTD_IR_MSIAddress addr;
+uint16_t index;
+VTDIrq irq = {0};
+
+assert(origin && translated);
+
+if (!iommu || !iommu->intr_enabled) {
+goto do_not_translate;
+}
+
+if (origin->address & VTD_MSI_ADDR_HI_MASK) {
+VTD_DPRINTF(GENERAL, "error: MSI addr high 32 bits nonzero"
+" during interrupt remapping: 0x%"PRIx32,
+(uint32_t)((origin->address & VTD_MSI_ADDR_HI_MASK) >> \
+VTD_MSI_ADDR_HI_SHIFT));
+return -VTD_FR_IR_REQ_RSVD;
+}
+
+addr.data = origin->address & VTD_MSI_ADDR_LO_MASK;
+if (le16_to_cpu(addr.__head) != 0xfee) {
+VTD_DPRINTF(GENERAL, "error: MSI addr low 32 bits invalid: "
+"0x%"PRIx32, addr.data);
+return -VTD_FR_IR_REQ_RSVD;
+}
+
+/* 

[Qemu-devel] [PATCH v8 09/25] x86-iommu: provide x86_iommu_get_default

2016-05-30 Thread Peter Xu
Instead of searching the device tree every time, one static variable is
declared for the default system x86 IOMMU device.  Also, some VT-d
macros are replaced by x86 ones.

Signed-off-by: Peter Xu 
---
 hw/i386/acpi-build.c  |  9 ++---
 hw/i386/intel_iommu.c |  8 +---
 hw/i386/x86-iommu.c   | 16 
 hw/pci-host/q35.c |  2 +-
 include/hw/i386/intel_iommu.h |  1 -
 include/hw/i386/x86-iommu.h   |  9 +
 6 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 6c572a3..9af1da0 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -51,7 +51,7 @@
 #include "hw/i386/ich9.h"
 #include "hw/pci/pci_bus.h"
 #include "hw/pci-host/q35.h"
-#include "hw/i386/intel_iommu.h"
+#include "hw/i386/x86-iommu.h"
 #include "hw/timer/hpet.h"
 
 #include "hw/acpi/aml-build.h"
@@ -2656,12 +2656,7 @@ static bool acpi_get_mcfg(AcpiMcfgInfo *mcfg)
 
 static bool acpi_has_iommu(void)
 {
-bool ambiguous;
-Object *intel_iommu;
-
-intel_iommu = object_resolve_path_type("", TYPE_INTEL_IOMMU_DEVICE,
-   &ambiguous);
-return intel_iommu && !ambiguous;
+return !!x86_iommu_get_default();
 }
 
 static
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0a70577..0c7b24d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -25,6 +25,7 @@
 #include "intel_iommu_internal.h"
 #include "hw/pci/pci.h"
 #include "hw/boards.h"
+#include "hw/i386/x86-iommu.h"
 
 /*#define DEBUG_INTEL_IOMMU*/
 #ifdef DEBUG_INTEL_IOMMU
@@ -191,7 +192,7 @@ static void vtd_reset_context_cache(IntelIOMMUState *s)
 
 VTD_DPRINTF(CACHE, "global context_cache_gen=1");
 while (g_hash_table_iter_next (&bus_it, NULL, (void**)&vtd_bus)) {
-for (devfn_it = 0; devfn_it < VTD_PCI_DEVFN_MAX; ++devfn_it) {
+for (devfn_it = 0; devfn_it < X86_IOMMU_PCI_DEVFN_MAX; ++devfn_it) {
 vtd_as = vtd_bus->dev_as[devfn_it];
 if (!vtd_as) {
 continue;
@@ -976,7 +977,7 @@ static void vtd_context_device_invalidate(IntelIOMMUState 
*s,
 vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
 if (vtd_bus) {
 devfn = VTD_SID_TO_DEVFN(source_id);
-for (devfn_it = 0; devfn_it < VTD_PCI_DEVFN_MAX; ++devfn_it) {
+for (devfn_it = 0; devfn_it < X86_IOMMU_PCI_DEVFN_MAX; ++devfn_it) {
 vtd_as = vtd_bus->dev_as[devfn_it];
 if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
 VTD_DPRINTF(INV, "invalidate context-cahce of devfn 0x%"PRIx16,
@@ -1978,7 +1979,8 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
PCIBus *bus, int devfn)
 
 if (!vtd_bus) {
 /* No corresponding free() */
-vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * 
VTD_PCI_DEVFN_MAX);
+vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
+X86_IOMMU_PCI_DEVFN_MAX);
 vtd_bus->bus = bus;
 key = (uintptr_t)bus;
 g_hash_table_insert(s->vtd_as_by_busptr, &key, vtd_bus);
diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
index d739afb..2d6d221 100644
--- a/hw/i386/x86-iommu.c
+++ b/hw/i386/x86-iommu.c
@@ -22,12 +22,28 @@
 #include "hw/boards.h"
 #include "hw/i386/x86-iommu.h"
 
+/* Default X86 IOMMU device */
+static X86IOMMUState *x86_iommu_default = NULL;
+
+static void x86_iommu_set_default(X86IOMMUState *x86_iommu)
+{
+assert(x86_iommu);
+assert(x86_iommu_default == NULL);
+x86_iommu_default = x86_iommu;
+}
+
+X86IOMMUState *x86_iommu_get_default(void)
+{
+return x86_iommu_default;
+}
+
 static void x86_iommu_realize(DeviceState *dev, Error **errp)
 {
 X86IOMMUClass *x86_class = X86_IOMMU_GET_CLASS(dev);
 if (x86_class->realize) {
 x86_class->realize(dev, errp);
 }
+x86_iommu_set_default(X86_IOMMU_DEVICE(dev));
 }
 
 static void x86_iommu_class_init(ObjectClass *klass, void *data)
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 70f897e..27ee0c8 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -429,7 +429,7 @@ static AddressSpace *q35_host_dma_iommu(PCIBus *bus, void 
*opaque, int devfn)
 IntelIOMMUState *s = opaque;
 VTDAddressSpace *vtd_as;
 
-assert(0 <= devfn && devfn <= VTD_PCI_DEVFN_MAX);
+assert(0 <= devfn && devfn <= X86_IOMMU_PCI_DEVFN_MAX);
 
 vtd_as = vtd_find_add_as(s, bus, devfn);
 return &vtd_as->as;
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index cb0c406..260aa8e 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -35,7 +35,6 @@
 #define VTD_PCI_BUS_MAX 256
 #define VTD_PCI_SLOT_MAX32
 #define VTD_PCI_FUNC_MAX8
-#define VTD_PCI_DEVFN_MAX   256
 #define VTD_PCI_SLOT(devfn) (((devfn) >> 3) & 0x1f)
 #define VTD_PCI_FUNC(devfn) ((devfn) & 0x07)
 #define VTD_SID_TO

[Qemu-devel] [PATCH v8 05/25] intel_iommu: define interrupt remap table addr register

2016-05-30 Thread Peter Xu
Defined Interrupt Remap Table Address register to store IR table
pointer. Also, do proper handling on global command register writes to
store table pointer and its size.

One more debug flag "DEBUG_IR" is added for interrupt remapping.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c  | 52 +-
 hw/i386/intel_iommu_internal.h |  4 
 include/hw/i386/intel_iommu.h  |  5 
 3 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 17668d6..00b873c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -30,7 +30,7 @@
 #ifdef DEBUG_INTEL_IOMMU
 enum {
 DEBUG_GENERAL, DEBUG_CSR, DEBUG_INV, DEBUG_MMU, DEBUG_FLOG,
-DEBUG_CACHE,
+DEBUG_CACHE, DEBUG_IR,
 };
 #define VTD_DBGBIT(x)   (1 << DEBUG_##x)
 static int vtd_dbgflags = VTD_DBGBIT(GENERAL) | VTD_DBGBIT(CSR);
@@ -900,6 +900,19 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
 (s->root_extended ? "(extended)" : ""));
 }
 
+static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
+{
+uint64_t value = 0;
+value = vtd_get_quad_raw(s, DMAR_IRTA_REG);
+s->intr_size = 1UL << ((value & VTD_IRTA_SIZE_MASK) + 1);
+s->intr_root = value & VTD_IRTA_ADDR_MASK;
+
+/* TODO: invalidate interrupt entry cache */
+
+VTD_DPRINTF(CSR, "int remap table addr 0x%"PRIx64 " size %"PRIu32,
+s->intr_root, s->intr_size);
+}
+
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
 s->context_cache_gen++;
@@ -1138,6 +1151,16 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
 vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
 }
 
+/* Set Interrupt Remap Table Pointer */
+static void vtd_handle_gcmd_sirtp(IntelIOMMUState *s)
+{
+VTD_DPRINTF(CSR, "set Interrupt Remap Table Pointer");
+
+vtd_interrupt_remap_table_setup(s);
+/* Ok - report back to driver */
+vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRTPS);
+}
+
 /* Handle Translation Enable/Disable */
 static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 {
@@ -1177,6 +1200,10 @@ static void vtd_handle_gcmd_write(IntelIOMMUState *s)
 /* Queued Invalidation Enable */
 vtd_handle_gcmd_qie(s, val & VTD_GCMD_QIE);
 }
+if (val & VTD_GCMD_SIRTP) {
+/* Set/update the interrupt remapping root-table pointer */
+vtd_handle_gcmd_sirtp(s);
+}
 }
 
 /* Handle write to Context Command Register */
@@ -1838,6 +1865,23 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
 vtd_update_fsts_ppf(s);
 break;
 
+case DMAR_IRTA_REG:
+VTD_DPRINTF(IR, "DMAR_IRTA_REG write addr 0x%"PRIx64
+", size %d, val 0x%"PRIx64, addr, size, val);
+if (size == 4) {
+vtd_set_long(s, addr, val);
+} else {
+vtd_set_quad(s, addr, val);
+}
+break;
+
+case DMAR_IRTA_REG_HI:
+VTD_DPRINTF(IR, "DMAR_IRTA_REG_HI write addr 0x%"PRIx64
+", size %d, val 0x%"PRIx64, addr, size, val);
+assert(size == 4);
+vtd_set_long(s, addr, val);
+break;
+
 default:
 VTD_DPRINTF(GENERAL, "error: unhandled reg write addr 0x%"PRIx64
 ", size %d, val 0x%"PRIx64, addr, size, val);
@@ -2017,6 +2061,12 @@ static void vtd_init(IntelIOMMUState *s)
 /* Fault Recording Registers, 128-bit */
 vtd_define_quad(s, DMAR_FRCD_REG_0_0, 0, 0, 0);
 vtd_define_quad(s, DMAR_FRCD_REG_0_2, 0, 0, 0x8000ULL);
+
+/*
+ * Interrupt remapping registers, not support extended interrupt
+ * mode for now.
+ */
+vtd_define_quad(s, DMAR_IRTA_REG, 0, 0xf00fULL, 0);
 }
 
 /* Should not reset address_spaces when reset because devices will still use
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 5b98a11..309833f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -172,6 +172,10 @@
 #define VTD_RTADDR_RTT  (1ULL << 11)
 #define VTD_RTADDR_ADDR_MASK(VTD_HAW_MASK ^ 0xfffULL)
 
+/* IRTA_REG */
+#define VTD_IRTA_ADDR_MASK  (VTD_HAW_MASK ^ 0xfffULL)
+#define VTD_IRTA_SIZE_MASK  (0xfULL)
+
 /* ECAP_REG */
 /* (offset >> 4) << 8 */
 #define VTD_ECAP_IRO(DMAR_IOTLB_REG_OFFSET << 4)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 0d89796..cc49839 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -125,6 +125,11 @@ struct IntelIOMMUState {
 MemoryRegionIOMMUOps iommu_ops;
 GHashTable *vtd_as_by_busptr;   /* VTDBus objects indexed by PCIBus* 
reference */
 VTDBus *vtd_as_by_bus_num[VTD_PCI_BUS_MAX]; /* VTDBus objects indexed by 
bus number */
+
+/* interrupt remapping */
+bool intr_enabled;  /* Whether guest enabled IR */
+dma_addr_t intr_root;   /* Interrupt remapping table poi

[Qemu-devel] [PATCH v8 04/25] acpi: add DMAR scope definition for root IOAPIC

2016-05-30 Thread Peter Xu
To enable interrupt remapping for intel IOMMU device, each IOAPIC device
in the system reported via ACPI MADT must be explicitly enumerated under
one specific remapping hardware unit. This patch adds the root-complex
IOAPIC into the default DMAR device.

Please refer to VT-d spec 8.3.1.1 for more information.

Signed-off-by: Peter Xu 
---
 hw/i386/acpi-build.c| 17 +++--
 include/hw/acpi/acpi-defs.h | 15 +++
 include/hw/pci-host/q35.h   |  9 +
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index ddc6f16..6c572a3 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -77,6 +77,9 @@
 #define ACPI_BUILD_DPRINTF(fmt, ...)
 #endif
 
+/* Default IOAPIC ID */
+#define ACPI_BUILD_IOAPIC_ID 0x0
+
 typedef struct AcpiMcfgInfo {
 uint64_t mcfg_base;
 uint32_t mcfg_size;
@@ -375,7 +378,6 @@ build_madt(GArray *table_data, GArray *linker, 
PCMachineState *pcms)
 io_apic = acpi_data_push(table_data, sizeof *io_apic);
 io_apic->type = ACPI_APIC_IO;
 io_apic->length = sizeof(*io_apic);
-#define ACPI_BUILD_IOAPIC_ID 0x0
 io_apic->io_apic_id = ACPI_BUILD_IOAPIC_ID;
 io_apic->address = cpu_to_le32(IO_APIC_DEFAULT_ADDRESS);
 io_apic->interrupt = cpu_to_le32(0);
@@ -2561,6 +2563,9 @@ build_dmar_q35(MachineState *ms, GArray *table_data, 
GArray *linker)
 AcpiTableDmar *dmar;
 AcpiDmarHardwareUnit *drhd;
 uint8_t dmar_flags = 0;
+AcpiDmarDeviceScope *scope = NULL;
+/* Root complex IOAPIC use one path[0] only */
+uint16_t scope_size = sizeof(*scope) + sizeof(uint16_t);
 
 if (ms->iommu_intr) {
 /* enable INTR for the IOMMU device */
@@ -2574,11 +2579,19 @@ build_dmar_q35(MachineState *ms, GArray *table_data, 
GArray *linker)
 /* DMAR Remapping Hardware Unit Definition structure */
 drhd = acpi_data_push(table_data, sizeof(*drhd));
 drhd->type = cpu_to_le16(ACPI_DMAR_TYPE_HARDWARE_UNIT);
-drhd->length = cpu_to_le16(sizeof(*drhd));   /* No device scope now */
+drhd->length = cpu_to_le16(sizeof(*drhd) + scope_size);
 drhd->flags = ACPI_DMAR_INCLUDE_PCI_ALL;
 drhd->pci_segment = cpu_to_le16(0);
 drhd->address = cpu_to_le64(Q35_HOST_BRIDGE_IOMMU_ADDR);
 
+/* Scope definition for the root-complex IOAPIC */
+scope = acpi_data_push(table_data, scope_size);
+scope->entry_type = cpu_to_le16(ACPI_DMAR_DEV_SCOPE_TYPE_IOAPIC);
+scope->length = scope_size;
+scope->enumeration_id = cpu_to_le16(ACPI_BUILD_IOAPIC_ID);
+scope->bus = cpu_to_le16(Q35_PSEUDO_BUS_PLATFORM);
+scope->path[0] = cpu_to_le16(Q35_PSEUDO_DEVFN_IOAPIC);
+
 build_header(linker, table_data, (void *)(table_data->data + dmar_start),
  "DMAR", table_data->len - dmar_start, 1, NULL, NULL);
 }
diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h
index 850a962..b46e472 100644
--- a/include/hw/acpi/acpi-defs.h
+++ b/include/hw/acpi/acpi-defs.h
@@ -569,6 +569,20 @@ enum {
 /*
  * Sub-structures for DMAR
  */
+
+#define ACPI_DMAR_DEV_SCOPE_TYPE_IOAPIC (0x03)
+
+/* Device scope structure for DRHD. */
+struct AcpiDmarDeviceScope {
+uint8_t entry_type;
+uint8_t length;
+uint16_t reserved;
+uint8_t enumeration_id;
+uint8_t bus;
+uint16_t path[0];   /* list of dev:func pairs */
+} QEMU_PACKED;
+typedef struct AcpiDmarDeviceScope AcpiDmarDeviceScope;
+
 /* Type 0: Hardware Unit Definition */
 struct AcpiDmarHardwareUnit {
 uint16_t type;
@@ -577,6 +591,7 @@ struct AcpiDmarHardwareUnit {
 uint8_t reserved;
 uint16_t pci_segment;   /* The PCI Segment associated with this unit */
 uint64_t address;   /* Base address of remapping hardware register-set */
+AcpiDmarDeviceScope scope[0];
 } QEMU_PACKED;
 typedef struct AcpiDmarHardwareUnit AcpiDmarHardwareUnit;
 
diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
index c5c073d..9afc221 100644
--- a/include/hw/pci-host/q35.h
+++ b/include/hw/pci-host/q35.h
@@ -175,4 +175,13 @@ typedef struct Q35PCIHost {
 
 uint64_t mch_mcfg_base(void);
 
+/*
+ * Arbitary but unique BNF number for IOAPIC device. This is only
+ * used when interrupt remapping is enabled.
+ *
+ * TODO: make sure there would have no conflict with real PCI bus
+ */
+#define Q35_PSEUDO_BUS_PLATFORM (0xff)
+#define Q35_PSEUDO_DEVFN_IOAPIC (0x00)
+
 #endif /* HW_Q35_H */
-- 
2.4.11




[Qemu-devel] [PATCH v8 08/25] x86-iommu: introduce parent class

2016-05-30 Thread Peter Xu
Introducing parent class for intel-iommu devices named "x86-iommu". This
is preparation work to abstract shared functionalities out from Intel
and AMD IOMMUs. Currently, only the parent class is introduced. It does
nothing yet.

Signed-off-by: Peter Xu 
---
 hw/i386/Makefile.objs |  2 +-
 hw/i386/intel_iommu.c |  5 ++--
 hw/i386/x86-iommu.c   | 53 +++
 include/hw/i386/intel_iommu.h |  3 ++-
 include/hw/i386/x86-iommu.h   | 46 +
 5 files changed, 105 insertions(+), 4 deletions(-)
 create mode 100644 hw/i386/x86-iommu.c
 create mode 100644 include/hw/i386/x86-iommu.h

diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
index b52d5b8..90e94ff 100644
--- a/hw/i386/Makefile.objs
+++ b/hw/i386/Makefile.objs
@@ -2,7 +2,7 @@ obj-$(CONFIG_KVM) += kvm/
 obj-y += multiboot.o
 obj-y += pc.o pc_piix.o pc_q35.o
 obj-y += pc_sysfw.o
-obj-y += intel_iommu.o
+obj-y += x86-iommu.o intel_iommu.o
 obj-$(CONFIG_XEN) += ../xenpv/ xen/
 
 obj-y += kvmvapic.o
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4d14124..0a70577 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2120,16 +2120,17 @@ static void vtd_realize(DeviceState *dev, Error **errp)
 static void vtd_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
+X86IOMMUClass *x86_class = X86_IOMMU_CLASS(klass);
 
 dc->reset = vtd_reset;
-dc->realize = vtd_realize;
 dc->vmsd = &vtd_vmstate;
 dc->props = vtd_properties;
+x86_class->realize = vtd_realize;
 }
 
 static const TypeInfo vtd_info = {
 .name  = TYPE_INTEL_IOMMU_DEVICE,
-.parent= TYPE_SYS_BUS_DEVICE,
+.parent= TYPE_X86_IOMMU_DEVICE,
 .instance_size = sizeof(IntelIOMMUState),
 .class_init= vtd_class_init,
 };
diff --git a/hw/i386/x86-iommu.c b/hw/i386/x86-iommu.c
new file mode 100644
index 000..d739afb
--- /dev/null
+++ b/hw/i386/x86-iommu.c
@@ -0,0 +1,53 @@
+/*
+ * QEMU emulation of common X86 IOMMU
+ *
+ * Copyright (C) 2016 Peter Xu, Red Hat 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "hw/sysbus.h"
+#include "hw/boards.h"
+#include "hw/i386/x86-iommu.h"
+
+static void x86_iommu_realize(DeviceState *dev, Error **errp)
+{
+X86IOMMUClass *x86_class = X86_IOMMU_GET_CLASS(dev);
+if (x86_class->realize) {
+x86_class->realize(dev, errp);
+}
+}
+
+static void x86_iommu_class_init(ObjectClass *klass, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(klass);
+dc->realize = x86_iommu_realize;
+}
+
+static const TypeInfo x86_iommu_info = {
+.name  = TYPE_X86_IOMMU_DEVICE,
+.parent= TYPE_SYS_BUS_DEVICE,
+.instance_size = sizeof(X86IOMMUState),
+.class_init= x86_iommu_class_init,
+.class_size= sizeof(X86IOMMUClass),
+.abstract  = true,
+};
+
+static void x86_iommu_register_types(void)
+{
+type_register_static(&x86_iommu_info);
+}
+
+type_init(x86_iommu_register_types)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index c9bb85b..cb0c406 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -23,6 +23,7 @@
 #define INTEL_IOMMU_H
 #include "hw/qdev.h"
 #include "sysemu/dma.h"
+#include "hw/i386/x86-iommu.h"
 
 #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
 #define INTEL_IOMMU_DEVICE(obj) \
@@ -166,7 +167,7 @@ union VTD_IR_MSIAddress {
 
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
-SysBusDevice busdev;
+X86IOMMUState x86_iommu;
 MemoryRegion csrmem;
 uint8_t csr[DMAR_REG_SIZE]; /* register values */
 uint8_t wmask[DMAR_REG_SIZE];   /* R/W bytes */
diff --git a/include/hw/i386/x86-iommu.h b/include/hw/i386/x86-iommu.h
new file mode 100644
index 000..924f39a
--- /dev/null
+++ b/include/hw/i386/x86-iommu.h
@@ -0,0 +1,46 @@
+/*
+ * Common IOMMU interface for X86 platform
+ *
+ * Copyright (C) 2016 Peter Xu, Red Hat 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT AN

[Qemu-devel] [PATCH v8 11/25] intel_iommu: add IR translation faults defines

2016-05-30 Thread Peter Xu
Adding translation fault definitions for interrupt remapping. Please
refer to VT-d spec section 7.1.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu_internal.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 309833f..2a9987f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -271,6 +271,19 @@ typedef enum VTDFaultReason {
  * context-entry.
  */
 VTD_FR_CONTEXT_ENTRY_TT,
+
+/* Interrupt remapping transition faults */
+VTD_FR_IR_REQ_RSVD = 0x20, /* One or more IR request reserved
+* fields set */
+VTD_FR_IR_INDEX_OVER = 0x21, /* Index value greater than max */
+VTD_FR_IR_ENTRY_P = 0x22,/* Present (P) not set in IRTE */
+VTD_FR_IR_ROOT_INVAL = 0x23, /* IR Root table invalid */
+VTD_FR_IR_IRTE_RSVD = 0x24,  /* IRTE Rsvd field non-zero with
+  * Present flag set */
+VTD_FR_IR_REQ_COMPAT = 0x25, /* Encountered compatible IR
+  * request while disabled */
+VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
+
 /* This is not a normal fault reason. We use this to indicate some faults
  * that are not referenced by the VT-d specification.
  * Fault event with such reason should not be recorded.
-- 
2.4.11




[Qemu-devel] [PATCH v8 06/25] intel_iommu: handle interrupt remap enable

2016-05-30 Thread Peter Xu
Handle writting to IRE bit in global command register.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 00b873c..4d14124 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1180,6 +1180,22 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool 
en)
 }
 }
 
+/* Handle Interrupt Remap Enable/Disable */
+static void vtd_handle_gcmd_ire(IntelIOMMUState *s, bool en)
+{
+VTD_DPRINTF(CSR, "Interrupt Remap Enable %s", (en ? "on" : "off"));
+
+if (en) {
+s->intr_enabled = true;
+/* Ok - report back to driver */
+vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_IRES);
+} else {
+s->intr_enabled = false;
+/* Ok - report back to driver */
+vtd_set_clear_mask_long(s, DMAR_GSTS_REG, VTD_GSTS_IRES, 0);
+}
+}
+
 /* Handle write to Global Command Register */
 static void vtd_handle_gcmd_write(IntelIOMMUState *s)
 {
@@ -1204,6 +1220,10 @@ static void vtd_handle_gcmd_write(IntelIOMMUState *s)
 /* Set/update the interrupt remapping root-table pointer */
 vtd_handle_gcmd_sirtp(s);
 }
+if (changed & VTD_GCMD_IRE) {
+/* Interrupt remap enable/disable */
+vtd_handle_gcmd_ire(s, val & VTD_GCMD_IRE);
+}
 }
 
 /* Handle write to Context Command Register */
-- 
2.4.11




[Qemu-devel] [PATCH v8 02/25] intel_iommu: allow queued invalidation for IR

2016-05-30 Thread Peter Xu
Queued invalidation is required for IR. This patch add basic support for
interrupt cache invalidate requests. Since we currently have no IR cache
implemented yet, we can just skip all interrupt cache invalidation
requests for now.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c  | 9 +
 hw/i386/intel_iommu_internal.h | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 347718f..4b0558e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1400,6 +1400,15 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 }
 break;
 
+case VTD_INV_DESC_IEC:
+VTD_DPRINTF(INV, "Interrupt Entry Cache Invalidation "
+"not implemented yet");
+/*
+ * Since currently we do not cache interrupt entries, we can
+ * just mark this descriptor as "good" and move on.
+ */
+break;
+
 default:
 VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
 "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e5f514c..b648e69 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -286,6 +286,8 @@ typedef struct VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_TYPE   0xf
 #define VTD_INV_DESC_CC 0x1 /* Context-cache Invalidate Desc */
 #define VTD_INV_DESC_IOTLB  0x2
+#define VTD_INV_DESC_IEC0x4 /* Interrupt Entry Cache
+   Invalidate Descriptor */
 #define VTD_INV_DESC_WAIT   0x5 /* Invalidation Wait Descriptor */
 #define VTD_INV_DESC_NONE   0   /* Not an Invalidate Descriptor */
 
-- 
2.4.11




[Qemu-devel] [PATCH v8 07/25] intel_iommu: define several structs for IOMMU IR

2016-05-30 Thread Peter Xu
Several data structs are defined to better support the rest of the
patches: IRTE to parse remapping table entries, and IOAPIC/MSI related
structure bits to parse interrupt entries to be filled in by guest
kernel.

Signed-off-by: Peter Xu 
---
 include/hw/i386/intel_iommu.h | 74 +++
 1 file changed, 74 insertions(+)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index cc49839..c9bb85b 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -52,6 +52,8 @@ typedef struct IntelIOMMUState IntelIOMMUState;
 typedef struct VTDAddressSpace VTDAddressSpace;
 typedef struct VTDIOTLBEntry VTDIOTLBEntry;
 typedef struct VTDBus VTDBus;
+typedef union VTD_IRTE VTD_IRTE;
+typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
 
 /* Context-Entry */
 struct VTDContextEntry {
@@ -90,6 +92,78 @@ struct VTDIOTLBEntry {
 bool write_flags;
 };
 
+/* Interrupt Remapping Table Entry Definition */
+union VTD_IRTE {
+struct {
+#ifdef HOST_WORDS_BIGENDIAN
+uint32_t dest_id:32; /* Destination ID */
+uint32_t __reserved_1:8; /* Reserved 1 */
+uint32_t vector:8;   /* Interrupt Vector */
+uint32_t irte_mode:1;/* IRTE Mode */
+uint32_t __reserved_0:3; /* Reserved 0 */
+uint32_t __avail:4;  /* Available spaces for software */
+uint32_t delivery_mode:3;/* Delivery Mode */
+uint32_t trigger_mode:1; /* Trigger Mode */
+uint32_t redir_hint:1;   /* Redirection Hint */
+uint32_t dest_mode:1;/* Destination Mode */
+uint32_t fault_disable:1;/* Fault Processing Disable */
+uint32_t present:1;  /* Whether entry present/available */
+#else
+uint32_t present:1;  /* Whether entry present/available */
+uint32_t fault_disable:1;/* Fault Processing Disable */
+uint32_t dest_mode:1;/* Destination Mode */
+uint32_t redir_hint:1;   /* Redirection Hint */
+uint32_t trigger_mode:1; /* Trigger Mode */
+uint32_t delivery_mode:3;/* Delivery Mode */
+uint32_t __avail:4;  /* Available spaces for software */
+uint32_t __reserved_0:3; /* Reserved 0 */
+uint32_t irte_mode:1;/* IRTE Mode */
+uint32_t vector:8;   /* Interrupt Vector */
+uint32_t __reserved_1:8; /* Reserved 1 */
+uint32_t dest_id:32; /* Destination ID */
+#endif
+uint16_t source_id:16;   /* Source-ID */
+#ifdef HOST_WORDS_BIGENDIAN
+uint64_t __reserved_2:44;/* Reserved 2 */
+uint64_t sid_vtype:2;/* Source-ID Validation Type */
+uint64_t sid_q:2;/* Source-ID Qualifier */
+#else
+uint64_t sid_q:2;/* Source-ID Qualifier */
+uint64_t sid_vtype:2;/* Source-ID Validation Type */
+uint64_t __reserved_2:44;/* Reserved 2 */
+#endif
+} QEMU_PACKED;
+uint64_t data[2];
+};
+
+#define VTD_IR_INT_FORMAT_COMPAT (0) /* Compatible Interrupt */
+#define VTD_IR_INT_FORMAT_REMAP  (1) /* Remappable Interrupt */
+
+/* Programming format for MSI/MSI-X addresses */
+union VTD_IR_MSIAddress {
+struct {
+#ifdef HOST_WORDS_BIGENDIAN
+uint32_t __head:12;  /* Should always be: 0x0fee */
+uint32_t index_l:15; /* Interrupt index bit 14-0 */
+uint32_t int_mode:1; /* Interrupt format */
+uint32_t sub_valid:1;/* SHV: Sub-Handle Valid bit */
+uint32_t index_h:1;  /* Interrupt index bit 15 */
+uint32_t __not_care:2;
+#else
+uint32_t __not_care:2;
+uint32_t index_h:1;  /* Interrupt index bit 15 */
+uint32_t sub_valid:1;/* SHV: Sub-Handle Valid bit */
+uint32_t int_mode:1; /* Interrupt format */
+uint32_t index_l:15; /* Interrupt index bit 14-0 */
+uint32_t __head:12;  /* Should always be: 0x0fee */
+#endif
+} QEMU_PACKED;
+uint32_t data;
+};
+
+/* When IR is enabled, all MSI/MSI-X data bits should be zero */
+#define VTD_IR_MSI_DATA  (0)
+
 /* The iommu (DMAR) device state struct */
 struct IntelIOMMUState {
 SysBusDevice busdev;
-- 
2.4.11




[Qemu-devel] [PATCH v8 00/25] IOMMU: Enable interrupt remapping for Intel IOMMU

2016-05-30 Thread Peter Xu
This is v8 patchset for Intel IOMMU IR support. If to test it with
pci-bridges, we still need to apply the following fix to solve a known
issue which will hang the guest:

- [PATCH v4] pci: fix pci_requester_id()

  https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg02769.html

V8 mostly fixes some issues with bit-field definitions, which is
possibly errornous when host is big endian machine types.

v8 changes:
- rebase to latest master
- patch 7
  - remove VTD_IR_IOAPICEntry, which is useless now
  - fix possible issue on big endian machines for VTD_IRTE,
VTD_IR_MSIAddress
- patch 12
  - fix endianess issue with bit-field defines: fix BE issue with
VTD_MSIMessage, do cpu_to_*() or reverse when necessary on
bit-field uses.
- patch 19
  - used le32_to_cpu() for dest_id, and added my s-o-b line beneath
Jan's.

v7 changes (using v6 patch index):
- patch 10: trivial change in debug string (remove one more "\n")
- patch 17-18: ioapic remote irr patches, sent seperately
  already. So removed from this series.
- patch 24: 
  - fix commit message: only irqfd msi routes are maintained, not
all msi routes.
  - skip all IOAPIC msi entries (dev == NULL). We only need to
housekeep irqfd users.
- added patches
  - pick up Radim's patch on adding MHMV ecap bits [Radim]
- remove all vtd_* patches, instead, use x86-iommu ones at the first
  place. This introduced lots of patch order changes and content
  changes, which affected from original patch 8 to the end. Sorry!
  [Jan]

v6 changes:
- patch 10: use write_with_attrs() rather than write(), preparing
  for SID verification [Jan]
- patch 17-18: add r-b line from Radim [Radim]
- new patch 19: put together Jan's EIM patch [Jan]
- new patch 20: add SID validation process
- new patch 21-22: introduce X86IOMMU class, which is the parent of
  IntelIOMMU class. Patch 21 only introduce the class and did
  nothing, patch 22 cleaned up all the vtd_*() hooks into x86
  ones. This is only a start. In the future, we can abstract more
  things into X86IOMMU class, like iotlb, address spaces mgmt,
  etc. [Jan]
- new patch 23-25: this is to do IEC notify to all irqfd consumers
  like vhost/vfio. patch 23 changed interface for
  kvm_irqchip_add_msi_route(), provide vector info rather than a raw
  MSI message. Patch 24 added new hooks to do arch-specific
  notification on addition/deletion of msi routes. Patch 25 is x86
  specific, which added one more IEC notifier for msi routes. [Jan]
- new patch 26: this is to partially solve the issue that Jan has
  encountered (1 sec delay when invalidating IR cache).

v5 changes:
- patch 10: add vector checking for IOAPIC interrupts (this may help
  debug in the future, will only generate warning if specify
  IOMMU_DEBUG)
- patch 13: replace error_report() with a trace. [Jan]
- patch 14: rename parameter "intr" to "intremap", to be aligned
  with kernel parameter [Jan]
- patch 15: fix comments for vtd_iec_notify_fn
- patch 17 & 18 (added): fix issue when IR enabled with devices
  using level-triggered interrupts, like e1000. Adding it to the end
  of series, since this issue never happen without IR.

  Patch 17 adds read-only check for IOAPIC entries.
  Patch 18 clears remote IRR bit when entry configured as
  edge-triggered.

v4 changes (all patch number corresponds to v3):
- add one patch at the start of v3 series: I missed to send the
  first patch in v3. adding it in. [Jan]
- patch 9: add support for compatible mode (no reason not to support
  it, if not, we will get some warnings when using split irqchip)
- patch 11: further simplify ioapic_update_kvm_routes() using the
  helper function.
- patch 12: tweak on kvm_arch_fixup_msi_route() rather than
  ioapic_update_kvm_routes() only. [Radim]
- add patch 15: introduce IEC (Interrupt Entry Cache) invalidation
  notifier list. We can register to this list if we want to be
  notified when we got IR invalidation requests [Radim]
- add patch 16: let IOAPIC the first consumer for the above IEC
  notifier list. [Radim]
- several other trivial fixes (like moving some defines from .c to
  .h, moving several lines of changes from one patch to another to
  make it make more sense, etc.)

v3 changes (all patch numbers corresponds to v2):
- patch 1 (-> v3 patch 13)
  - move to the end of series [Alex]
- patch 10 (dropped)
  - drop this one, since re-worked on IOAPIC support, so we do not
need this any more.
- patch 12 (-> v3 patch 10)
  - leverage MSI path for IOAPIC IR [Jan]
- patch 13 (v3 -> patch 9)
  - remove vtd_interrupt_remap_msi() declaration by reordering the
functions [mst]
  - vtd_generate_msi_message(): init msg using {}, remove FIXME
[mst]
- new patches
  - v3 patch 11: introduce ioapic_entry_parse() helper function
  - v3 patch 12: add support for kernel-irqchip=split. This needs
more reviews, logically this should enable lots of things:
splitted irqchip, irqfd, vhost, and irqfd support for
passthrough devices (not tested). Please refe

[Qemu-devel] [PATCH v8 03/25] intel_iommu: set IR bit for ECAP register

2016-05-30 Thread Peter Xu
Enable IR in IOMMU Extended Capability register.

Signed-off-by: Peter Xu 
---
 hw/i386/intel_iommu.c  | 7 +++
 hw/i386/intel_iommu_internal.h | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4b0558e..17668d6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -24,6 +24,7 @@
 #include "exec/address-spaces.h"
 #include "intel_iommu_internal.h"
 #include "hw/pci/pci.h"
+#include "hw/boards.h"
 
 /*#define DEBUG_INTEL_IOMMU*/
 #ifdef DEBUG_INTEL_IOMMU
@@ -1941,6 +1942,8 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
PCIBus *bus, int devfn)
  */
 static void vtd_init(IntelIOMMUState *s)
 {
+MachineState *ms = MACHINE(qdev_get_machine());
+
 memset(s->csr, 0, DMAR_REG_SIZE);
 memset(s->wmask, 0, DMAR_REG_SIZE);
 memset(s->w1cmask, 0, DMAR_REG_SIZE);
@@ -1961,6 +1964,10 @@ static void vtd_init(IntelIOMMUState *s)
  VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
 s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
+if (ms->iommu_intr) {
+s->ecap |= VTD_ECAP_IR;
+}
+
 vtd_reset_context_cache(s);
 vtd_reset_iotlb(s);
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index b648e69..5b98a11 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -176,6 +176,8 @@
 /* (offset >> 4) << 8 */
 #define VTD_ECAP_IRO(DMAR_IOTLB_REG_OFFSET << 4)
 #define VTD_ECAP_QI (1ULL << 1)
+/* Interrupt Remapping support */
+#define VTD_ECAP_IR (1ULL << 3)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
-- 
2.4.11




[Qemu-devel] [PATCH v8 01/25] acpi: enable INTR for DMAR report structure

2016-05-30 Thread Peter Xu
Introduce iommu_intr in MachineState to show whether IOMMU IR is
enabled. By default, IR is off.

In ACPI DMA remapping report structure, enable INTR flag when specified.

Signed-off-by: Peter Xu 
---
 hw/core/machine.c |  2 ++
 hw/i386/acpi-build.c  | 12 +---
 include/hw/boards.h   |  1 +
 include/hw/i386/intel_iommu.h |  2 ++
 4 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index ccdd5fa..98471a7 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -397,6 +397,8 @@ static void machine_initfn(Object *obj)
 ms->dump_guest_core = true;
 ms->mem_merge = true;
 ms->enable_graphics = true;
+/* Disable interrupt remapping by default. */
+ms->iommu_intr = false;
 
 object_property_add_str(obj, "accel",
 machine_get_accel, machine_set_accel, NULL);
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 279f0d7..ddc6f16 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2554,16 +2554,22 @@ build_mcfg_q35(GArray *table_data, GArray *linker, 
AcpiMcfgInfo *info)
 }
 
 static void
-build_dmar_q35(GArray *table_data, GArray *linker)
+build_dmar_q35(MachineState *ms, GArray *table_data, GArray *linker)
 {
 int dmar_start = table_data->len;
 
 AcpiTableDmar *dmar;
 AcpiDmarHardwareUnit *drhd;
+uint8_t dmar_flags = 0;
+
+if (ms->iommu_intr) {
+/* enable INTR for the IOMMU device */
+dmar_flags |= DMAR_REPORT_F_INTR;
+}
 
 dmar = acpi_data_push(table_data, sizeof(*dmar));
 dmar->host_address_width = VTD_HOST_ADDRESS_WIDTH - 1;
-dmar->flags = 0;/* No intr_remap for now */
+dmar->flags = dmar_flags;
 
 /* DMAR Remapping Hardware Unit Definition structure */
 drhd = acpi_data_push(table_data, sizeof(*drhd));
@@ -2724,7 +2730,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState 
*machine)
 }
 if (acpi_has_iommu()) {
 acpi_add_table(table_offsets, tables_blob);
-build_dmar_q35(tables_blob, tables->linker);
+build_dmar_q35(MACHINE(pcms), tables_blob, tables->linker);
 }
 if (pcms->acpi_nvdimm_state.is_enabled) {
 nvdimm_build_acpi(table_offsets, tables_blob, tables->linker);
diff --git a/include/hw/boards.h b/include/hw/boards.h
index d268bd0..b27018d 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -152,6 +152,7 @@ struct MachineState {
 bool igd_gfx_passthru;
 char *firmware;
 bool iommu;
+bool iommu_intr;
 bool suppress_vmdesc;
 bool enforce_config_section;
 bool enable_graphics;
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index b024ffa..0d89796 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -44,6 +44,8 @@
 #define VTD_HOST_ADDRESS_WIDTH  39
 #define VTD_HAW_MASK((1ULL << VTD_HOST_ADDRESS_WIDTH) - 1)
 
+#define DMAR_REPORT_F_INTR  (1)
+
 typedef struct VTDContextEntry VTDContextEntry;
 typedef struct VTDContextCacheEntry VTDContextCacheEntry;
 typedef struct IntelIOMMUState IntelIOMMUState;
-- 
2.4.11




[Qemu-devel] How to enable limited set of cpu features when using KVM

2016-05-30 Thread Gaurav Sharma
I am trying to boot a 64 bit image using KVM. By default I understand
'qemu64' is the guest processor.
What I am trying is to not to expose certain features like sse, sse2 etc.
Even though the change the same in 'builtin_x86_defs' for qemu64, i still
see these features in the guest cpu ?
Am i missing something here ?

Thanks,


Re: [Qemu-devel] [PATCH] docs: Fix a couple of typos in throttle.txt

2016-05-30 Thread Changlong Xie

On 05/30/2016 06:00 PM, Alberto Garcia wrote:

On Mon 30 May 2016 08:49:18 AM CEST, Changlong Xie wrote:

 - Water leaks from the bucket at a rate of 100 IOPS.
 - Water can be added to the bucket at a rate of 2000 IOPS.
 - The size of the bucket is 2000 x 60 = 12
-  - If 'iops-total-max-length' is unset then the bucket size is 100.
+  - If 'iops-total-max' is unset then the bucket size is 100.


Sorry to brother, why the bucket size is 100 rather than 100 x 60?


Oh, that's because 'iops-total-max-length' can only be set if
'iops-total-max' is set as well. It's explained earlier in the document,
maybe I should make it clear there as well.


Thanks for your explanation,

Thanks
-Xie



Michael, shall I send a new patch on top of my previous one or can the
previous one be replaced?

Berto









Re: [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-30 Thread Peter Lieven

Am 30.05.2016 um 12:06 schrieb Kevin Wolf:

Am 30.05.2016 um 11:53 hat Peter Lieven geschrieben:

Am 30.05.2016 um 11:47 schrieb Kevin Wolf:

Am 30.05.2016 um 11:30 hat Peter Lieven geschrieben:

Am 30.05.2016 um 10:24 schrieb Kevin Wolf:

Am 30.05.2016 um 08:25 hat Peter Lieven geschrieben:

Am 27.05.2016 um 10:55 schrieb Kevin Wolf:

Am 27.05.2016 um 02:36 hat Fam Zheng geschrieben:

On Thu, 05/26 11:20, Paolo Bonzini wrote:

On 26/05/2016 10:30, Fam Zheng wrote:

This doesn't look too wrong...  Should the right sequence of events be
head/after_head or head/after_tail?  It's probably simplest to just emit
all four events.

I've no idea. (That's why I leaned towards fixing the test case).

Well, fixing the testcase means knowing what events should be emitted.

QEMU with Peter's patch emits head/after_head.  If the right one is
head/after_tail, _both QEMU and the testcase_ need to be adjusted.  Your
patch keeps the backwards-compatible route.

Yes, I mean I was not very convinced in tweaking the events at all: each pair
of them has been emitted around bdrv_aligned_preadv(), and the new branch
doesn't do it anymore. So I don't see a reason to add events here.

Yes, if you can assume that anyone who uses the debug events know
exactly what the code looks like, adding the events here is pointless
because TAIL, AFTER_TAIL and for the greatest part also AFTER_HEAD are
essentially the same then.

Having TAIL before the qiov change and AFTER_TAIL afterwards doesn't
make any difference, they could (and should) be called immediately one
after another if we wanted to keep the behaviour.

I would agree that we should take a look at the test case and what it
actually wants to achieve before we can decide whether AFTER_HEAD and
TAIL/AFTER_TAIL would be the same (the former could trigger earlier if
there are two requests and only one is unaligned at the tail). Maybe we
even need to extend the test case now so that both paths (explicit read
of the tail and the shortcut) are covered.

The part that actually blocks in 077 is

# Sequential RMW requests on the same physical sector

its expecting all 4 events around the RMW cycle.

However, it seems that also other parts of 077 would need an adjustment
and the output might differ depending on the alignment. So I guess we
have to emit the events if we don't want to recode the whole 077 and make
it aware of the alignment.

Yes, but my point is that we may need to rework 077 anyway if we don't
only want to make it pass again, but to cover all relevant paths, too.
We got a new code path and it's unlikely that the existing tests covered
both the old code path and the new one.

So you would postpone this patch until 077 is reworked?
I found this one a nice improvement and 077 might take some time.

The problem with "we'll rework the tests later" is always that it
doesn't happen if the patches for the functional parts and a workaround
for the test case are merged.

I don't think that making 077 cover both cases should be hard or take
much time, it just needs to be done. If all the time for writing emails
in this thread had been used to work on the test case, it would already
be done.

Understood. If you can give a hint how to get the value of the align
parameter into the test script I can try. Otherwise the test will fail
also if any block driver has an align value that is not equal to 512.

The test case already uses blkdebug to enforce a specific align value
(which is 4096 in this test case, not 512):

 echo "open -o driver=$IMGFMT,file.align=4k blkdebug::$TEST_IMG"


Sorry, I missed that. Then I will try to fix 077.

Peter




Re: [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-30 Thread Kevin Wolf
Am 30.05.2016 um 11:53 hat Peter Lieven geschrieben:
> Am 30.05.2016 um 11:47 schrieb Kevin Wolf:
> >Am 30.05.2016 um 11:30 hat Peter Lieven geschrieben:
> >>Am 30.05.2016 um 10:24 schrieb Kevin Wolf:
> >>>Am 30.05.2016 um 08:25 hat Peter Lieven geschrieben:
> Am 27.05.2016 um 10:55 schrieb Kevin Wolf:
> >Am 27.05.2016 um 02:36 hat Fam Zheng geschrieben:
> >>On Thu, 05/26 11:20, Paolo Bonzini wrote:
> >>>On 26/05/2016 10:30, Fam Zheng wrote:
> >>This doesn't look too wrong...  Should the right sequence of events 
> >>be
> >>head/after_head or head/after_tail?  It's probably simplest to just 
> >>emit
> >>all four events.
> I've no idea. (That's why I leaned towards fixing the test case).
> >>>Well, fixing the testcase means knowing what events should be emitted.
> >>>
> >>>QEMU with Peter's patch emits head/after_head.  If the right one is
> >>>head/after_tail, _both QEMU and the testcase_ need to be adjusted.  
> >>>Your
> >>>patch keeps the backwards-compatible route.
> >>Yes, I mean I was not very convinced in tweaking the events at all: 
> >>each pair
> >>of them has been emitted around bdrv_aligned_preadv(), and the new 
> >>branch
> >>doesn't do it anymore. So I don't see a reason to add events here.
> >Yes, if you can assume that anyone who uses the debug events know
> >exactly what the code looks like, adding the events here is pointless
> >because TAIL, AFTER_TAIL and for the greatest part also AFTER_HEAD are
> >essentially the same then.
> >
> >Having TAIL before the qiov change and AFTER_TAIL afterwards doesn't
> >make any difference, they could (and should) be called immediately one
> >after another if we wanted to keep the behaviour.
> >
> >I would agree that we should take a look at the test case and what it
> >actually wants to achieve before we can decide whether AFTER_HEAD and
> >TAIL/AFTER_TAIL would be the same (the former could trigger earlier if
> >there are two requests and only one is unaligned at the tail). Maybe we
> >even need to extend the test case now so that both paths (explicit read
> >of the tail and the shortcut) are covered.
> The part that actually blocks in 077 is
> 
> # Sequential RMW requests on the same physical sector
> 
> its expecting all 4 events around the RMW cycle.
> 
> However, it seems that also other parts of 077 would need an adjustment
> and the output might differ depending on the alignment. So I guess we
> have to emit the events if we don't want to recode the whole 077 and make
> it aware of the alignment.
> >>>Yes, but my point is that we may need to rework 077 anyway if we don't
> >>>only want to make it pass again, but to cover all relevant paths, too.
> >>>We got a new code path and it's unlikely that the existing tests covered
> >>>both the old code path and the new one.
> >>So you would postpone this patch until 077 is reworked?
> >>I found this one a nice improvement and 077 might take some time.
> >The problem with "we'll rework the tests later" is always that it
> >doesn't happen if the patches for the functional parts and a workaround
> >for the test case are merged.
> >
> >I don't think that making 077 cover both cases should be hard or take
> >much time, it just needs to be done. If all the time for writing emails
> >in this thread had been used to work on the test case, it would already
> >be done.
> 
> Understood. If you can give a hint how to get the value of the align
> parameter into the test script I can try. Otherwise the test will fail
> also if any block driver has an align value that is not equal to 512.

The test case already uses blkdebug to enforce a specific align value
(which is 4096 in this test case, not 512):

echo "open -o driver=$IMGFMT,file.align=4k blkdebug::$TEST_IMG"

Kevin



Re: [Qemu-devel] [PATCH] docs: Fix a couple of typos in throttle.txt

2016-05-30 Thread Alberto Garcia
On Mon 30 May 2016 08:49:18 AM CEST, Changlong Xie wrote:
>>> - Water leaks from the bucket at a rate of 100 IOPS.
>>> - Water can be added to the bucket at a rate of 2000 IOPS.
>>> - The size of the bucket is 2000 x 60 = 12
>>> -  - If 'iops-total-max-length' is unset then the bucket size is 100.
>>> +  - If 'iops-total-max' is unset then the bucket size is 100.
>
> Sorry to brother, why the bucket size is 100 rather than 100 x 60?

Oh, that's because 'iops-total-max-length' can only be set if
'iops-total-max' is set as well. It's explained earlier in the document,
maybe I should make it clear there as well.

Michael, shall I send a new patch on top of my previous one or can the
previous one be replaced?

Berto



Re: [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-30 Thread Peter Lieven

Am 30.05.2016 um 11:47 schrieb Kevin Wolf:

Am 30.05.2016 um 11:30 hat Peter Lieven geschrieben:

Am 30.05.2016 um 10:24 schrieb Kevin Wolf:

Am 30.05.2016 um 08:25 hat Peter Lieven geschrieben:

Am 27.05.2016 um 10:55 schrieb Kevin Wolf:

Am 27.05.2016 um 02:36 hat Fam Zheng geschrieben:

On Thu, 05/26 11:20, Paolo Bonzini wrote:

On 26/05/2016 10:30, Fam Zheng wrote:

This doesn't look too wrong...  Should the right sequence of events be
head/after_head or head/after_tail?  It's probably simplest to just emit
all four events.

I've no idea. (That's why I leaned towards fixing the test case).

Well, fixing the testcase means knowing what events should be emitted.

QEMU with Peter's patch emits head/after_head.  If the right one is
head/after_tail, _both QEMU and the testcase_ need to be adjusted.  Your
patch keeps the backwards-compatible route.

Yes, I mean I was not very convinced in tweaking the events at all: each pair
of them has been emitted around bdrv_aligned_preadv(), and the new branch
doesn't do it anymore. So I don't see a reason to add events here.

Yes, if you can assume that anyone who uses the debug events know
exactly what the code looks like, adding the events here is pointless
because TAIL, AFTER_TAIL and for the greatest part also AFTER_HEAD are
essentially the same then.

Having TAIL before the qiov change and AFTER_TAIL afterwards doesn't
make any difference, they could (and should) be called immediately one
after another if we wanted to keep the behaviour.

I would agree that we should take a look at the test case and what it
actually wants to achieve before we can decide whether AFTER_HEAD and
TAIL/AFTER_TAIL would be the same (the former could trigger earlier if
there are two requests and only one is unaligned at the tail). Maybe we
even need to extend the test case now so that both paths (explicit read
of the tail and the shortcut) are covered.

The part that actually blocks in 077 is

# Sequential RMW requests on the same physical sector

its expecting all 4 events around the RMW cycle.

However, it seems that also other parts of 077 would need an adjustment
and the output might differ depending on the alignment. So I guess we
have to emit the events if we don't want to recode the whole 077 and make
it aware of the alignment.

Yes, but my point is that we may need to rework 077 anyway if we don't
only want to make it pass again, but to cover all relevant paths, too.
We got a new code path and it's unlikely that the existing tests covered
both the old code path and the new one.

So you would postpone this patch until 077 is reworked?
I found this one a nice improvement and 077 might take some time.

The problem with "we'll rework the tests later" is always that it
doesn't happen if the patches for the functional parts and a workaround
for the test case are merged.

I don't think that making 077 cover both cases should be hard or take
much time, it just needs to be done. If all the time for writing emails
in this thread had been used to work on the test case, it would already
be done.


Understood. If you can give a hint how to get the value of the align
parameter into the test script I can try. Otherwise the test will fail
also if any block driver has an align value that is not equal to 512.

Peter



Re: [Qemu-devel] [PATCH 0/2] convert device initialization functions

2016-05-30 Thread Wei, Jiangang
Ping 
Any comments?
Thanks in advance. 

On Tue, 2016-05-17 at 18:18 +0800, Wei Jiangang wrote:
> The first had been reviewed.
> The second had been posted last month, but no feedback.
> They're similar, so resend them together.
> 
> Wei Jiangang (2):
>   hw/pci-bridge: Convert pxb initialization functions to Error
>   apb: convert init to realize
> 
>  hw/pci-bridge/pci_expander_bridge.c | 52 
> ++---
>  hw/pci-host/apb.c   |  5 ++--
>  2 files changed, 27 insertions(+), 30 deletions(-)
> 





[Qemu-devel] [PATCH 2/2] migration/block: Convert saving to BlockBackend

2016-05-30 Thread Kevin Wolf
This creates a new BlockBackend for copying data from an images to the
migration stream on the source host. All I/O for block migration goes
through BlockBackend now.

Signed-off-by: Kevin Wolf 
---
 migration/block.c | 124 ++
 1 file changed, 79 insertions(+), 45 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index 30af182..ebc10e6 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -52,7 +52,8 @@
 
 typedef struct BlkMigDevState {
 /* Written during setup phase.  Can be read without a lock.  */
-BlockDriverState *bs;
+BlockBackend *blk;
+char *blk_name;
 int shared_base;
 int64_t total_sectors;
 QSIMPLEQ_ENTRY(BlkMigDevState) entry;
@@ -145,9 +146,9 @@ static void blk_send(QEMUFile *f, BlkMigBlock * blk)
  | flags);
 
 /* device name */
-len = strlen(bdrv_get_device_name(blk->bmds->bs));
+len = strlen(blk->bmds->blk_name);
 qemu_put_byte(f, len);
-qemu_put_buffer(f, (uint8_t *)bdrv_get_device_name(blk->bmds->bs), len);
+qemu_put_buffer(f, (uint8_t *) blk->bmds->blk_name, len);
 
 /* if a block is zero we need to flush here since the network
  * bandwidth is now a lot higher than the storage device bandwidth.
@@ -201,7 +202,7 @@ static int bmds_aio_inflight(BlkMigDevState *bmds, int64_t 
sector)
 {
 int64_t chunk = sector / (int64_t)BDRV_SECTORS_PER_DIRTY_CHUNK;
 
-if (sector < bdrv_nb_sectors(bmds->bs)) {
+if (sector < blk_nb_sectors(bmds->blk)) {
 return !!(bmds->aio_bitmap[chunk / (sizeof(unsigned long) * 8)] &
 (1UL << (chunk % (sizeof(unsigned long) * 8;
 } else {
@@ -235,10 +236,10 @@ static void bmds_set_aio_inflight(BlkMigDevState *bmds, 
int64_t sector_num,
 
 static void alloc_aio_bitmap(BlkMigDevState *bmds)
 {
-BlockDriverState *bs = bmds->bs;
+BlockBackend *bb = bmds->blk;
 int64_t bitmap_size;
 
-bitmap_size = bdrv_nb_sectors(bs) + BDRV_SECTORS_PER_DIRTY_CHUNK * 8 - 1;
+bitmap_size = blk_nb_sectors(bb) + BDRV_SECTORS_PER_DIRTY_CHUNK * 8 - 1;
 bitmap_size /= BDRV_SECTORS_PER_DIRTY_CHUNK * 8;
 
 bmds->aio_bitmap = g_malloc0(bitmap_size);
@@ -268,19 +269,19 @@ static int mig_save_device_bulk(QEMUFile *f, 
BlkMigDevState *bmds)
 {
 int64_t total_sectors = bmds->total_sectors;
 int64_t cur_sector = bmds->cur_sector;
-BlockDriverState *bs = bmds->bs;
+BlockBackend *bb = bmds->blk;
 BlkMigBlock *blk;
 int nr_sectors;
 
 if (bmds->shared_base) {
 qemu_mutex_lock_iothread();
-aio_context_acquire(bdrv_get_aio_context(bs));
+aio_context_acquire(blk_get_aio_context(bb));
 while (cur_sector < total_sectors &&
-   !bdrv_is_allocated(bs, cur_sector, MAX_IS_ALLOCATED_SEARCH,
-  &nr_sectors)) {
+   !bdrv_is_allocated(blk_bs(bb), cur_sector,
+  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
 cur_sector += nr_sectors;
 }
-aio_context_release(bdrv_get_aio_context(bs));
+aio_context_release(blk_get_aio_context(bb));
 qemu_mutex_unlock_iothread();
 }
 
@@ -323,12 +324,12 @@ static int mig_save_device_bulk(QEMUFile *f, 
BlkMigDevState *bmds)
  * without the need to acquire the AioContext.
  */
 qemu_mutex_lock_iothread();
-aio_context_acquire(bdrv_get_aio_context(bmds->bs));
-blk->aiocb = bdrv_aio_readv(bs, cur_sector, &blk->qiov,
-nr_sectors, blk_mig_read_cb, blk);
+aio_context_acquire(blk_get_aio_context(bmds->blk));
+blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, &blk->qiov,
+0, blk_mig_read_cb, blk);
 
 bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector, nr_sectors);
-aio_context_release(bdrv_get_aio_context(bmds->bs));
+aio_context_release(blk_get_aio_context(bmds->blk));
 qemu_mutex_unlock_iothread();
 
 bmds->cur_sector = cur_sector + nr_sectors;
@@ -343,10 +344,10 @@ static int set_dirty_tracking(void)
 int ret;
 
 QSIMPLEQ_FOREACH(bmds, &block_mig_state.bmds_list, entry) {
-aio_context_acquire(bdrv_get_aio_context(bmds->bs));
-bmds->dirty_bitmap = bdrv_create_dirty_bitmap(bmds->bs, BLOCK_SIZE,
-  NULL, NULL);
-aio_context_release(bdrv_get_aio_context(bmds->bs));
+aio_context_acquire(blk_get_aio_context(bmds->blk));
+bmds->dirty_bitmap = bdrv_create_dirty_bitmap(blk_bs(bmds->blk),
+  BLOCK_SIZE, NULL, NULL);
+aio_context_release(blk_get_aio_context(bmds->blk));
 if (!bmds->dirty_bitmap) {
 ret = -errno;
 goto fail;
@@ -357,9 +358,9 @@ static int set_dirty_tracking(void)
 fail:
 QSIMPLEQ_FOREACH(bmds, &block_mig_state.bmds_list, entry) {
 if (bmds->dirty_bitmap) {
-  

Re: [Qemu-devel] [libvirt] [PATCH 7/9] qmp: Add runnability information to query-cpu-definitions

2016-05-30 Thread Markus Armbruster
Eduardo Habkost  writes:

> Just noticed that I hadn't replied to this yet. Sorry for the
> long delay!
>
> On Thu, May 12, 2016 at 09:46:25AM +0200, Markus Armbruster wrote:
>> Eduardo Habkost  writes:
> [...]
>> > ##
>> > # @CpuDefinitionInfo:
>> > #
>> > # Virtual CPU definition.
>> > #
>> > # @name: the name of the CPU definition
>> > # @runnable: #optional. whether the CPU model us usable with the
>> > #current machine and accelerator. Omitted if we don't
>> > #know the answer. (since 2.7)
>> > # @unavailable-features: List of attributes that prevent the CPU
>> 
>> Unless you drop the * sigil from '*unavailable-features', you need to
>> insert #optional after the colon.
>
> Fixed.
>
>> 
>> > #model from running in the current host.
>> > #(since 2.7)
>> > #
>> > # @unavailable-features is a list of QOM property names that
>> > # represent CPU model attributes that prevent the CPU from running.
>> > # If the QOM property is read-only, that means the CPU model can
>> > # never run in the current host. If the property is read-write, it
>> > # means that it MAY be possible to run the CPU model in the current
>> > # host if that property is changed. Management software can use it
>> > # as hints to suggest or choose an alternative for the user, or
>> > # just to generate meaningful error messages explaining why the CPU
>> > # model can't be used.
>> > #
>> > # Since: 1.2.0
>> > ##
>> 
>> Better.
>> 
>> Next issue: how @runnable and @unavailable-features are related isn't
>> fully documented.  Here's my guess:
>> 
>> Combinations possible?@runnable
>> @unavailable-features   absent  false  true
>> absent yes  ? ?
>> present, empty   ?  ? ?
>> present, non-empty   ?yesno
>
> unavailable-features should be present only if runnable is false.
> It may be absent or empty if the architecture code still doesn't
> provide detailed info.
>
> Once we have additional architectures implementing the new
> fields, we can consider requiring unavailable-features to be
> always present (and non-empty) if runnable is false.
>
> In other words:
>
> Combinations possible?@runnable
> @unavailable-features   absent  false  true
> absent yes  yes[1]  yes
> present, empty  no  yes[1]   no
> present, non-empty  no   yes no
>
> [1] I would like it to be "no", but I prefer to make it mandatory
> only after we get some experience with other architectures.
>
>
> I'm making the following changes to the documentation:
>
>  # Virtual CPU definition.
>  #
>  # @name: the name of the CPU definition
> -# @runnable: #optional. whether the CPU model us usable with the
> +# @runnable: #optional Whether the CPU model us usable with the
>  #current machine and accelerator. Omitted if we don't
>  #know the answer. (since 2.7)
> -# @unavailable-features: List of attributes that prevent the CPU
> -#model from running in the current host.
> +# @unavailable-features: #optional List of attributes that prevent
> +#the CPU model from running in the current
> +#host. Present only if @runnable is false.
>  #(since 2.7)
>  #
>  # @unavailable-features is a list of QOM property names that

"Present only if @runnable is false" makes me wonder why we need two
separate optional members tied together with constraints.  I dislike
such constraints, and avoid them whenever practical.

The new members encode an answer to the question whether a certain CPU
usable with the current machine an accelerator, and if no, why.
The possible answers are:

(1) Don't know.
(2) Yes.
(3) No, but we can't say why.
(4) No, and here's a list of reasons.

The two "dunno" answers (1) and (3) exist so we don't have to boil the
CPU ocean now.

Without them, the natural solution is a single member, where (4) is
encoded as nonempty list, and (2) could be encoded as empty list or
absent.

Now let me try to fit in (1) and (3).

The obvious way to do (1) is absent.  So let's use empty list for (2).

That leaves (3).  I think the simplest solution that could possibly work
is to treat it as a special "dunno" reason: encode it just like (4), but
with a special "dunno" list element.  I'd use the empty string.

Could even be used if we need to distinguish

(4a) No, and here's the *complete* list of reasons.
(4b) No, and here's a possibly incomplete list of reasons.

For (4b), include the "dunno" element with the others.

Unlike the proposed solution, this one doesn't leave interface crud
behind if we succeed in getting rid of (1) and (3):

* When (1) goes away, the single member becomes mandatory.

* When (3) goes away, the special "dunno" list element no longer occurs.



[Qemu-devel] [PATCH v6 15/17] e1000: Move out code that will be reused in e1000e

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Code that will be shared moved to a separate files.

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 MAINTAINERS|   5 +
 hw/net/Makefile.objs   |   2 +-
 hw/net/e1000.c | 411 +++--
 hw/net/e1000x_common.c | 267 
 hw/net/e1000x_common.h | 213 +
 trace-events   |  13 ++
 6 files changed, 591 insertions(+), 320 deletions(-)
 create mode 100644 hw/net/e1000x_common.c
 create mode 100644 hw/net/e1000x_common.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e890849..ab4e884 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -981,6 +981,11 @@ F: hw/acpi/nvdimm.c
 F: hw/mem/nvdimm.c
 F: include/hw/mem/nvdimm.h
 
+e1000x
+M: Dmitry Fleytman 
+S: Maintained
+F: hw/net/e1000x*
+
 Subsystems
 --
 Audio
diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
index 527d264..bc69948 100644
--- a/hw/net/Makefile.objs
+++ b/hw/net/Makefile.objs
@@ -6,7 +6,7 @@ common-obj-$(CONFIG_NE2000_PCI) += ne2000.o
 common-obj-$(CONFIG_EEPRO100_PCI) += eepro100.o
 common-obj-$(CONFIG_PCNET_PCI) += pcnet-pci.o
 common-obj-$(CONFIG_PCNET_COMMON) += pcnet.o
-common-obj-$(CONFIG_E1000_PCI) += e1000.o
+common-obj-$(CONFIG_E1000_PCI) += e1000.o e1000x_common.o
 common-obj-$(CONFIG_RTL8139_PCI) += rtl8139.o
 common-obj-$(CONFIG_VMXNET3_PCI) += net_tx_pkt.o net_rx_pkt.o
 common-obj-$(CONFIG_VMXNET3_PCI) += vmxnet3.o
diff --git a/hw/net/e1000.c b/hw/net/e1000.c
index 8e79b55..36e3dbe 100644
--- a/hw/net/e1000.c
+++ b/hw/net/e1000.c
@@ -36,7 +36,7 @@
 #include "qemu/iov.h"
 #include "qemu/range.h"
 
-#include "e1000_regs.h"
+#include "e1000x_common.h"
 
 static const uint8_t bcast[] = {0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
 
@@ -64,11 +64,6 @@ static int debugflags = DBGBIT(TXERR) | DBGBIT(GENERAL);
 #define PNPMMIO_SIZE  0x2
 #define MIN_BUF_SIZE  60 /* Min. octets in an ethernet frame sans FCS */
 
-/* this is the size past which hardware will drop packets when setting LPE=0 */
-#define MAXIMUM_ETHERNET_VLAN_SIZE 1522
-/* this is the size past which hardware will drop packets when setting LPE=1 */
-#define MAXIMUM_ETHERNET_LPE_SIZE 16384
-
 #define MAXIMUM_ETHERNET_HDR_LEN (14+4)
 
 /*
@@ -102,22 +97,9 @@ typedef struct E1000State_st {
 unsigned char vlan[4];
 unsigned char data[0x1];
 uint16_t size;
-unsigned char sum_needed;
 unsigned char vlan_needed;
-uint8_t ipcss;
-uint8_t ipcso;
-uint16_t ipcse;
-uint8_t tucss;
-uint8_t tucso;
-uint16_t tucse;
-uint8_t hdr_len;
-uint16_t mss;
-uint32_t paylen;
+e1000x_txd_props props;
 uint16_t tso_frames;
-char tse;
-int8_t ip;
-int8_t tcp;
-char cptse; // current packet tse bit
 } tx;
 
 struct {
@@ -162,52 +144,19 @@ typedef struct E1000BaseClass {
 #define E1000_DEVICE_GET_CLASS(obj) \
 OBJECT_GET_CLASS(E1000BaseClass, (obj), TYPE_E1000_BASE)
 
-#define defreg(x)x = (E1000_##x>>2)
-enum {
-defreg(CTRL),defreg(EECD),defreg(EERD),defreg(GPRC),
-defreg(GPTC),defreg(ICR), defreg(ICS), defreg(IMC),
-defreg(IMS), defreg(LEDCTL),  defreg(MANC),defreg(MDIC),
-defreg(MPC), defreg(PBA), defreg(RCTL),defreg(RDBAH),
-defreg(RDBAL),   defreg(RDH), defreg(RDLEN),   defreg(RDT),
-defreg(STATUS),  defreg(SWSM),defreg(TCTL),defreg(TDBAH),
-defreg(TDBAL),   defreg(TDH), defreg(TDLEN),   defreg(TDT),
-defreg(TORH),defreg(TORL),defreg(TOTH),defreg(TOTL),
-defreg(TPR), defreg(TPT), defreg(TXDCTL),  defreg(WUFC),
-defreg(RA),  defreg(MTA), defreg(CRCERRS), defreg(VFTA),
-defreg(VET), defreg(RDTR),defreg(RADV),defreg(TADV),
-defreg(ITR), defreg(FCRUC),   defreg(TDFH),defreg(TDFT),
-defreg(TDFHS),   defreg(TDFTS),   defreg(TDFPC),   defreg(RDFH),
-defreg(RDFT),defreg(RDFHS),   defreg(RDFTS),   defreg(RDFPC),
-defreg(IPAV),defreg(WUC), defreg(WUS), defreg(AIT),
-defreg(IP6AT),   defreg(IP4AT),   defreg(FFLT),defreg(FFMT),
-defreg(FFVT),defreg(WUPM),defreg(PBM), defreg(SCC),
-defreg(ECOL),defreg(MCC), defreg(LATECOL), defreg(COLC),
-defreg(DC),  defreg(TNCRS),   defreg(SEC), defreg(CEXTERR),
-defreg(RLEC),defreg(XONRXC),  defreg(XONTXC),  defreg(XOFFRXC),
-defreg(XOFFTXC), defreg(RFC), defreg(RJC), defreg(RNBC),
-defreg(TSCTFC),  defreg(MGTPRC),  defreg(MGTPDC),  defreg(MGTPTC),
-defreg(RUC), defreg(ROC), defreg(GORCL),   defreg(GORCH),
-defreg(GOTCL),   defreg(GOTCH),   defreg(BPRC),defreg(MPRC),
-defreg(TSCTC),   defreg(PRC64),   defreg(PRC127),  defreg(PRC255),
-defreg(PRC511),  defreg(PRC1023), defreg(PRC1522), defreg(PTC64),
-defreg(PTC127),  defreg(PTC255),  defreg(PTC511), 

Re: [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-30 Thread Kevin Wolf
Am 30.05.2016 um 11:30 hat Peter Lieven geschrieben:
> Am 30.05.2016 um 10:24 schrieb Kevin Wolf:
> >Am 30.05.2016 um 08:25 hat Peter Lieven geschrieben:
> >>Am 27.05.2016 um 10:55 schrieb Kevin Wolf:
> >>>Am 27.05.2016 um 02:36 hat Fam Zheng geschrieben:
> On Thu, 05/26 11:20, Paolo Bonzini wrote:
> >On 26/05/2016 10:30, Fam Zheng wrote:
> This doesn't look too wrong...  Should the right sequence of events be
> head/after_head or head/after_tail?  It's probably simplest to just 
> emit
> all four events.
> >>I've no idea. (That's why I leaned towards fixing the test case).
> >Well, fixing the testcase means knowing what events should be emitted.
> >
> >QEMU with Peter's patch emits head/after_head.  If the right one is
> >head/after_tail, _both QEMU and the testcase_ need to be adjusted.  Your
> >patch keeps the backwards-compatible route.
> Yes, I mean I was not very convinced in tweaking the events at all: each 
> pair
> of them has been emitted around bdrv_aligned_preadv(), and the new branch
> doesn't do it anymore. So I don't see a reason to add events here.
> >>>Yes, if you can assume that anyone who uses the debug events know
> >>>exactly what the code looks like, adding the events here is pointless
> >>>because TAIL, AFTER_TAIL and for the greatest part also AFTER_HEAD are
> >>>essentially the same then.
> >>>
> >>>Having TAIL before the qiov change and AFTER_TAIL afterwards doesn't
> >>>make any difference, they could (and should) be called immediately one
> >>>after another if we wanted to keep the behaviour.
> >>>
> >>>I would agree that we should take a look at the test case and what it
> >>>actually wants to achieve before we can decide whether AFTER_HEAD and
> >>>TAIL/AFTER_TAIL would be the same (the former could trigger earlier if
> >>>there are two requests and only one is unaligned at the tail). Maybe we
> >>>even need to extend the test case now so that both paths (explicit read
> >>>of the tail and the shortcut) are covered.
> >>The part that actually blocks in 077 is
> >>
> >># Sequential RMW requests on the same physical sector
> >>
> >>its expecting all 4 events around the RMW cycle.
> >>
> >>However, it seems that also other parts of 077 would need an adjustment
> >>and the output might differ depending on the alignment. So I guess we
> >>have to emit the events if we don't want to recode the whole 077 and make
> >>it aware of the alignment.
> >Yes, but my point is that we may need to rework 077 anyway if we don't
> >only want to make it pass again, but to cover all relevant paths, too.
> >We got a new code path and it's unlikely that the existing tests covered
> >both the old code path and the new one.
> 
> So you would postpone this patch until 077 is reworked?
> I found this one a nice improvement and 077 might take some time.

The problem with "we'll rework the tests later" is always that it
doesn't happen if the patches for the functional parts and a workaround
for the test case are merged.

I don't think that making 077 cover both cases should be hard or take
much time, it just needs to be done. If all the time for writing emails
in this thread had been used to work on the test case, it would already
be done.

Kevin



[Qemu-devel] [PATCH v6 17/17] e1000e: Introduce qtest for e1000e device

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 tests/Makefile  |   3 +
 tests/e1000e-test.c | 480 
 2 files changed, 483 insertions(+)
 create mode 100644 tests/e1000e-test.c

diff --git a/tests/Makefile b/tests/Makefile
index c79691a..a3e20e3 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -142,6 +142,8 @@ gcov-files-virtio-y += $(gcov-files-virtioserial-y)
 
 check-qtest-pci-y += tests/e1000-test$(EXESUF)
 gcov-files-pci-y += hw/net/e1000.c
+check-qtest-pci-y += tests/e1000e-test$(EXESUF)
+gcov-files-pci-y += hw/net/e1000e.c hw/net/e1000e_core.c
 check-qtest-pci-y += tests/rtl8139-test$(EXESUF)
 gcov-files-pci-y += hw/net/rtl8139.c
 check-qtest-pci-y += tests/pcnet-test$(EXESUF)
@@ -551,6 +553,7 @@ tests/i440fx-test$(EXESUF): tests/i440fx-test.o 
$(libqos-pc-obj-y)
 tests/q35-test$(EXESUF): tests/q35-test.o $(libqos-pc-obj-y)
 tests/fw_cfg-test$(EXESUF): tests/fw_cfg-test.o $(libqos-pc-obj-y)
 tests/e1000-test$(EXESUF): tests/e1000-test.o
+tests/e1000e-test$(EXESUF): tests/e1000e-test.o $(libqos-pc-obj-y)
 tests/rtl8139-test$(EXESUF): tests/rtl8139-test.o $(libqos-pc-obj-y)
 tests/pcnet-test$(EXESUF): tests/pcnet-test.o
 tests/eepro100-test$(EXESUF): tests/eepro100-test.o
diff --git a/tests/e1000e-test.c b/tests/e1000e-test.c
new file mode 100644
index 000..d6e6311
--- /dev/null
+++ b/tests/e1000e-test.c
@@ -0,0 +1,480 @@
+ /*
+ * QTest testcase for e1000e NIC
+ *
+ * Copyright (c) 2015 Ravello Systems LTD (http://ravellosystems.com)
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman 
+ * Leonid Bloch 
+ * Yan Vugenfirer 
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see .
+ */
+
+
+#include "qemu/osdep.h"
+#include 
+#include "libqtest.h"
+#include "qemu-common.h"
+#include "libqos/pci-pc.h"
+#include "qemu/sockets.h"
+#include "qemu/iov.h"
+#include "qemu/bitops.h"
+#include "libqos/malloc.h"
+#include "libqos/malloc-pc.h"
+#include "libqos/malloc-generic.h"
+
+#define E1000E_IMS  (0x00d0)
+
+#define E1000E_STATUS   (0x0008)
+#define E1000E_STATUS_LU BIT(1)
+#define E1000E_STATUS_ASDV1000 BIT(9)
+
+#define E1000E_CTRL (0x)
+#define E1000E_CTRL_RESET BIT(26)
+
+#define E1000E_RCTL (0x0100)
+#define E1000E_RCTL_EN  BIT(1)
+#define E1000E_RCTL_UPE BIT(3)
+#define E1000E_RCTL_MPE BIT(4)
+
+#define E1000E_RFCTL (0x5008)
+#define E1000E_RFCTL_EXTEN  BIT(15)
+
+#define E1000E_TCTL (0x0400)
+#define E1000E_TCTL_EN  BIT(1)
+
+#define E1000E_CTRL_EXT (0x0018)
+#define E1000E_CTRL_EXT_DRV_LOADBIT(28)
+#define E1000E_CTRL_EXT_TXLSFLOWBIT(22)
+
+#define E1000E_RX0_MSG_ID   (0)
+#define E1000E_TX0_MSG_ID   (1)
+#define E1000E_OTHER_MSG_ID (2)
+
+#define E1000E_IVAR (0x00E4)
+#define E1000E_IVAR_TEST_CFG((E1000E_RX0_MSG_ID << 0)| BIT(3)  | \
+ (E1000E_TX0_MSG_ID << 8)| BIT(11) | \
+ (E1000E_OTHER_MSG_ID << 16) | BIT(19) | \
+ BIT(31))
+
+#define E1000E_RING_LEN (0x1000)
+#define E1000E_TXD_LEN  (16)
+#define E1000E_RXD_LEN  (16)
+
+#define E1000E_TDBAL(0x3800)
+#define E1000E_TDBAH(0x3804)
+#define E1000E_TDLEN(0x3808)
+#define E1000E_TDH  (0x3810)
+#define E1000E_TDT  (0x3818)
+
+#define E1000E_RDBAL(0x2800)
+#define E1000E_RDBAH(0x2804)
+#define E1000E_RDLEN(0x2808)
+#define E1000E_RDH  (0x2810)
+#define E1000E_RDT  (0x2818)
+
+typedef struct {
+QPCIDevice *pci_dev;
+void *mac_regs;
+
+uint64_t tx_ring;
+uint64_t rx_ring;
+} e1000e_device;
+
+static int test_sockets[2];
+static QGuestAllocator *test_alloc;
+static QPCIBus *test_bus;
+
+static void e1000e_pci_foreach_callback(QPCIDevice *dev, int devfn, void *data)
+{
+*(QPCIDevice **) data = dev;
+}
+
+static QPCIDevice *e1000e_device_find(QPCIBus *bus)
+{
+static const int e1000e_vendor_id = 0x8086;
+static const int e1000e_dev_id = 0x10D3;
+
+QPCIDevice *e1000e_dev = NULL;
+
+qpci_device_foreach(bus, e1000e_vendor_id, e1000e_dev_id,
+e1000e_pci_foreach_callback, &e1000e_dev);
+
+g_assert_nonnull(e1000e_dev);
+
+return e1000e_dev;
+}
+
+static void e1000e_macreg_wr

[Qemu-devel] [PATCH v6 10/17] net_pkt: Name vmxnet3 packet abstractions more generic

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

This patch drops "vmx" prefix from packet abstractions names
to emphasize the fact they are generic and not tied to any
specific network device.

These abstractions will be reused by e1000e emulation implementation
introduced by following patches so their names need generalization.

This patch (except renamed files, adjusted comments and changes in MAINTAINTERS)
was produced by:

git grep -lz 'vmxnet_tx_pkt' | xargs -0 perl -i'' -pE 
"s/vmxnet_tx_pkt/net_tx_pkt/g"
git grep -lz 'vmxnet_rx_pkt' | xargs -0 perl -i'' -pE 
"s/vmxnet_rx_pkt/net_rx_pkt/g"
git grep -lz 'VmxnetTxPkt' | xargs -0 perl -i'' -pE "s/VmxnetTxPkt/NetTxPkt/g"
git grep -lz 'VMXNET_TX_PKT' | xargs -0 perl -i'' -pE 
"s/VMXNET_TX_PKT/NET_TX_PKT/g"
git grep -lz 'VmxnetRxPkt' | xargs -0 perl -i'' -pE "s/VmxnetRxPkt/NetRxPkt/g"
git grep -lz 'VMXNET_RX_PKT' | xargs -0 perl -i'' -pE 
"s/VMXNET_RX_PKT/NET_RX_PKT/g"
sed -ie 's/VMXNET_/NET_/g' hw/net/vmxnet_rx_pkt.c
sed -ie 's/VMXNET_/NET_/g' hw/net/vmxnet_tx_pkt.c

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 MAINTAINERS|   8 +
 hw/net/Makefile.objs   |   2 +-
 hw/net/net_rx_pkt.c| 187 
 hw/net/net_rx_pkt.h| 174 +++
 hw/net/net_tx_pkt.c| 581 +
 hw/net/net_tx_pkt.h| 146 +
 hw/net/vmxnet3.c   |  88 
 hw/net/vmxnet_rx_pkt.c | 187 
 hw/net/vmxnet_rx_pkt.h | 174 ---
 hw/net/vmxnet_tx_pkt.c | 581 -
 hw/net/vmxnet_tx_pkt.h | 146 -
 tests/Makefile |   4 +-
 12 files changed, 1143 insertions(+), 1135 deletions(-)
 create mode 100644 hw/net/net_rx_pkt.c
 create mode 100644 hw/net/net_rx_pkt.h
 create mode 100644 hw/net/net_tx_pkt.c
 create mode 100644 hw/net/net_tx_pkt.h
 delete mode 100644 hw/net/vmxnet_rx_pkt.c
 delete mode 100644 hw/net/vmxnet_rx_pkt.h
 delete mode 100644 hw/net/vmxnet_tx_pkt.c
 delete mode 100644 hw/net/vmxnet_tx_pkt.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 3c949d5..e890849 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -954,6 +954,14 @@ S: Maintained
 F: hw/*/xilinx_*
 F: include/hw/xilinx.h
 
+Network packet abstractions
+M: Dmitry Fleytman 
+S: Maintained
+F: include/net/eth.h
+F: net/eth.c
+F: hw/net/net_rx_pkt*
+F: hw/net/net_tx_pkt*
+
 Vmware
 M: Dmitry Fleytman 
 S: Maintained
diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
index 64d0449..527d264 100644
--- a/hw/net/Makefile.objs
+++ b/hw/net/Makefile.objs
@@ -8,7 +8,7 @@ common-obj-$(CONFIG_PCNET_PCI) += pcnet-pci.o
 common-obj-$(CONFIG_PCNET_COMMON) += pcnet.o
 common-obj-$(CONFIG_E1000_PCI) += e1000.o
 common-obj-$(CONFIG_RTL8139_PCI) += rtl8139.o
-common-obj-$(CONFIG_VMXNET3_PCI) += vmxnet_tx_pkt.o vmxnet_rx_pkt.o
+common-obj-$(CONFIG_VMXNET3_PCI) += net_tx_pkt.o net_rx_pkt.o
 common-obj-$(CONFIG_VMXNET3_PCI) += vmxnet3.o
 
 common-obj-$(CONFIG_SMC91C111) += smc91c111.o
diff --git a/hw/net/net_rx_pkt.c b/hw/net/net_rx_pkt.c
new file mode 100644
index 000..8a4f29f
--- /dev/null
+++ b/hw/net/net_rx_pkt.c
@@ -0,0 +1,187 @@
+/*
+ * QEMU RX packets abstractions
+ *
+ * Copyright (c) 2012 Ravello Systems LTD (http://ravellosystems.com)
+ *
+ * Developed by Daynix Computing LTD (http://www.daynix.com)
+ *
+ * Authors:
+ * Dmitry Fleytman 
+ * Tamir Shomer 
+ * Yan Vugenfirer 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "net_rx_pkt.h"
+#include "net/eth.h"
+#include "qemu-common.h"
+#include "qemu/iov.h"
+#include "net/checksum.h"
+#include "net/tap.h"
+
+/*
+ * RX packet may contain up to 2 fragments - rebuilt eth header
+ * in case of VLAN tag stripping
+ * and payload received from QEMU - in any case
+ */
+#define NET_MAX_RX_PACKET_FRAGMENTS (2)
+
+struct NetRxPkt {
+struct virtio_net_hdr virt_hdr;
+uint8_t ehdr_buf[ETH_MAX_L2_HDR_LEN];
+struct iovec vec[NET_MAX_RX_PACKET_FRAGMENTS];
+uint16_t vec_len;
+uint32_t tot_len;
+uint16_t tci;
+bool vlan_stripped;
+bool has_virt_hdr;
+eth_pkt_types_e packet_type;
+
+/* Analysis results */
+bool isip4;
+bool isip6;
+bool isudp;
+bool istcp;
+};
+
+void net_rx_pkt_init(struct NetRxPkt **pkt, bool has_virt_hdr)
+{
+struct NetRxPkt *p = g_malloc0(sizeof *p);
+p->has_virt_hdr = has_virt_hdr;
+*pkt = p;
+}
+
+void net_rx_pkt_uninit(struct NetRxPkt *pkt)
+{
+g_free(pkt);
+}
+
+struct virtio_net_hdr *net_rx_pkt_get_vhdr(struct NetRxPkt *pkt)
+{
+assert(pkt);
+return &pkt->virt_hdr;
+}
+
+void net_rx_pkt_attach_data(struct NetRxPkt *pkt, const void *data,
+   size_t len, bool strip_vlan)
+{
+uint16_t tci = 0;
+uint16_t ploff;
+assert(pkt);
+pkt->vlan_stripped = false;
+
+if (strip_vlan) {
+pkt->vlan_stripped = eth_strip_vlan(data, pkt->ehdr_buf, &plo

[Qemu-devel] [PATCH v6 14/17] e1000_regs: Add definitions for Intel 82574-specific bits

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/net/e1000_regs.h | 345 +++-
 1 file changed, 342 insertions(+), 3 deletions(-)

diff --git a/hw/net/e1000_regs.h b/hw/net/e1000_regs.h
index 1c40244..d62b3fa 100644
--- a/hw/net/e1000_regs.h
+++ b/hw/net/e1000_regs.h
@@ -85,6 +85,7 @@
 #define E1000_DEV_ID_82573E  0x108B
 #define E1000_DEV_ID_82573E_IAMT 0x108C
 #define E1000_DEV_ID_82573L  0x109A
+#define E1000_DEV_ID_82574L  0x10D3
 #define E1000_DEV_ID_82546GB_QUAD_COPPER_KSP3 0x10B5
 #define E1000_DEV_ID_80003ES2LAN_COPPER_DPT 0x1096
 #define E1000_DEV_ID_80003ES2LAN_SERDES_DPT 0x1098
@@ -104,6 +105,7 @@
 #define E1000_PHY_ID2_82544x 0xC30
 #define E1000_PHY_ID2_8254xx_DEFAULT 0xC20 /* 82540x, 82545x, and 82546x */
 #define E1000_PHY_ID2_82573x 0xCC0
+#define E1000_PHY_ID2_82574x 0xCB1
 
 /* Register Set. (82543, 82544)
  *
@@ -135,8 +137,11 @@
 #define E1000_ITR  0x000C4  /* Interrupt Throttling Rate - RW */
 #define E1000_ICS  0x000C8  /* Interrupt Cause Set - WO */
 #define E1000_IMS  0x000D0  /* Interrupt Mask Set - RW */
+#define E1000_EIAC 0x000DC  /* Ext. Interrupt Auto Clear - RW */
 #define E1000_IMC  0x000D8  /* Interrupt Mask Clear - WO */
 #define E1000_IAM  0x000E0  /* Interrupt Acknowledge Auto Mask */
+#define E1000_IVAR 0x000E4  /* Interrupt Vector Allocation Register - RW */
+#define E1000_EITR 0x000E8  /* Extended Interrupt Throttling Rate - RW */
 #define E1000_RCTL 0x00100  /* RX Control - RW */
 #define E1000_RDTR10x02820  /* RX Delay Timer (1) - RW */
 #define E1000_RDBAL1   0x02900  /* RX Descriptor Base Address Low (1) - RW */
@@ -145,6 +150,7 @@
 #define E1000_RDH1 0x02910  /* RX Descriptor Head (1) - RW */
 #define E1000_RDT1 0x02918  /* RX Descriptor Tail (1) - RW */
 #define E1000_FCTTV0x00170  /* Flow Control Transmit Timer Value - RW */
+#define E1000_FCRTV0x05F40  /* Flow Control Refresh Timer Value - RW */
 #define E1000_TXCW 0x00178  /* TX Configuration Word - RW */
 #define E1000_RXCW 0x00180  /* RX Configuration Word - RO */
 #define E1000_TCTL 0x00400  /* TX Control - RW */
@@ -161,6 +167,10 @@
 #define E1000_PBM  0x1  /* Packet Buffer Memory - RW */
 #define E1000_PBS  0x01008  /* Packet Buffer Size - RW */
 #define E1000_EEMNGCTL 0x01010  /* MNG EEprom Control */
+#define E1000_EEMNGDATA0x01014 /* MNG EEPROM Read/Write data */
+#define E1000_FLMNGCTL 0x01018 /* MNG Flash Control */
+#define E1000_FLMNGDATA0x0101C /* MNG FLASH Read data */
+#define E1000_FLMNGCNT 0x01020 /* MNG FLASH Read Counter */
 #define E1000_FLASH_UPDATES 1000
 #define E1000_EEARBC   0x01024  /* EEPROM Auto Read Bus Control */
 #define E1000_FLASHT   0x01028  /* FLASH Timer Register */
@@ -169,9 +179,12 @@
 #define E1000_FLSWDATA 0x01034  /* FLASH data register */
 #define E1000_FLSWCNT  0x01038  /* FLASH Access Counter */
 #define E1000_FLOP 0x0103C  /* FLASH Opcode Register */
+#define E1000_FLOL 0x01050  /* FEEP Auto Load */
 #define E1000_ERT  0x02008  /* Early Rx Threshold - RW */
 #define E1000_FCRTL0x02160  /* Flow Control Receive Threshold Low - RW */
+#define E1000_FCRTL_A  0x00168  /* Alias to FCRTL */
 #define E1000_FCRTH0x02168  /* Flow Control Receive Threshold High - RW */
+#define E1000_FCRTH_A  0x00160  /* Alias to FCRTH */
 #define E1000_PSRCTL   0x02170  /* Packet Split Receive Control - RW */
 #define E1000_RDBAL0x02800  /* RX Descriptor Base Address Low - RW */
 #define E1000_RDBAH0x02804  /* RX Descriptor Base Address High - RW */
@@ -179,11 +192,17 @@
 #define E1000_RDH  0x02810  /* RX Descriptor Head - RW */
 #define E1000_RDT  0x02818  /* RX Descriptor Tail - RW */
 #define E1000_RDTR 0x02820  /* RX Delay Timer - RW */
+#define E1000_RDTR_A   0x00108  /* Alias to RDTR */
 #define E1000_RDBAL0   E1000_RDBAL /* RX Desc Base Address Low (0) - RW */
+#define E1000_RDBAL0_A 0x00110 /* Alias to RDBAL0 */
 #define E1000_RDBAH0   E1000_RDBAH /* RX Desc Base Address High (0) - RW */
+#define E1000_RDBAH0_A 0x00114 /* Alias to RDBAH0 */
 #define E1000_RDLEN0   E1000_RDLEN /* RX Desc Length (0) - RW */
+#define E1000_RDLEN0_A 0x00118 /* Alias to RDLEN0 */
 #define E1000_RDH0 E1000_RDH   /* RX Desc Head (0) - RW */
+#define E1000_RDH0_A   0x00120 /* Alias to RDH0 */
 #define E1000_RDT0 E1000_RDT   /* RX Desc Tail (0) - RW */
+#define E1000_RDT0_A   0x00128 /* Alias to RDT0 */
 #define E1000_RDTR0E1000_RDTR  /* RX Delay Timer (0) - RW */
 #define E1000_RXDCTL   0x02828  /* RX Descriptor Control queue 0 - RW */
 #define E1000_RXDCTL1  0x02928  /* RX Descriptor Control queue 1 - RW */
@@ -192,22 +211,33 @@
 #define E1000_RAID 0x02C08  /* Receive Ack Interrupt Delay - RW */
 #define E1000_TXDMAC   0x03000  /* TX DMA Control - RW */
 #define E1000_KABGTXD  0x03004  /* AFE Band G

[Qemu-devel] [PATCH v6 12/17] net_pkt: Extend packet abstraction as required by e1000e functionality

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

This patch extends the TX/RX packet abstractions with features that will
be used by the e1000e device implementation.

Changes are:

  1. Support iovec lists for RX buffers
  2. Deeper RX packets parsing
  3. Loopback option for TX packets
  4. Extended VLAN headers handling
  5. RSS processing for RX packets

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/net/net_rx_pkt.c| 473 +
 hw/net/net_rx_pkt.h| 193 +++-
 hw/net/net_tx_pkt.c| 204 +
 hw/net/net_tx_pkt.h|  60 ++-
 include/net/checksum.h |   4 +-
 include/net/eth.h  | 153 +++-
 net/checksum.c |   7 +-
 net/eth.c  | 410 +-
 trace-events   |  40 +
 9 files changed, 1336 insertions(+), 208 deletions(-)

diff --git a/hw/net/net_rx_pkt.c b/hw/net/net_rx_pkt.c
index 8a4f29f..1019b50 100644
--- a/hw/net/net_rx_pkt.c
+++ b/hw/net/net_rx_pkt.c
@@ -16,24 +16,16 @@
  */
 
 #include "qemu/osdep.h"
+#include "trace.h"
 #include "net_rx_pkt.h"
-#include "net/eth.h"
-#include "qemu-common.h"
-#include "qemu/iov.h"
 #include "net/checksum.h"
 #include "net/tap.h"
 
-/*
- * RX packet may contain up to 2 fragments - rebuilt eth header
- * in case of VLAN tag stripping
- * and payload received from QEMU - in any case
- */
-#define NET_MAX_RX_PACKET_FRAGMENTS (2)
-
 struct NetRxPkt {
 struct virtio_net_hdr virt_hdr;
-uint8_t ehdr_buf[ETH_MAX_L2_HDR_LEN];
-struct iovec vec[NET_MAX_RX_PACKET_FRAGMENTS];
+uint8_t ehdr_buf[sizeof(struct eth_header)];
+struct iovec *vec;
+uint16_t vec_len_total;
 uint16_t vec_len;
 uint32_t tot_len;
 uint16_t tci;
@@ -46,17 +38,31 @@ struct NetRxPkt {
 bool isip6;
 bool isudp;
 bool istcp;
+
+size_t l3hdr_off;
+size_t l4hdr_off;
+size_t l5hdr_off;
+
+eth_ip6_hdr_info ip6hdr_info;
+eth_ip4_hdr_info ip4hdr_info;
+eth_l4_hdr_info  l4hdr_info;
 };
 
 void net_rx_pkt_init(struct NetRxPkt **pkt, bool has_virt_hdr)
 {
 struct NetRxPkt *p = g_malloc0(sizeof *p);
 p->has_virt_hdr = has_virt_hdr;
+p->vec = NULL;
+p->vec_len_total = 0;
 *pkt = p;
 }
 
 void net_rx_pkt_uninit(struct NetRxPkt *pkt)
 {
+if (pkt->vec_len_total != 0) {
+g_free(pkt->vec);
+}
+
 g_free(pkt);
 }
 
@@ -66,33 +72,88 @@ struct virtio_net_hdr *net_rx_pkt_get_vhdr(struct NetRxPkt 
*pkt)
 return &pkt->virt_hdr;
 }
 
-void net_rx_pkt_attach_data(struct NetRxPkt *pkt, const void *data,
-   size_t len, bool strip_vlan)
+static inline void
+net_rx_pkt_iovec_realloc(struct NetRxPkt *pkt,
+int new_iov_len)
+{
+if (pkt->vec_len_total < new_iov_len) {
+g_free(pkt->vec);
+pkt->vec = g_malloc(sizeof(*pkt->vec) * new_iov_len);
+pkt->vec_len_total = new_iov_len;
+}
+}
+
+static void
+net_rx_pkt_pull_data(struct NetRxPkt *pkt,
+const struct iovec *iov, int iovcnt,
+size_t ploff)
+{
+if (pkt->vlan_stripped) {
+net_rx_pkt_iovec_realloc(pkt, iovcnt + 1);
+
+pkt->vec[0].iov_base = pkt->ehdr_buf;
+pkt->vec[0].iov_len = sizeof(pkt->ehdr_buf);
+
+pkt->tot_len =
+iov_size(iov, iovcnt) - ploff + sizeof(struct eth_header);
+
+pkt->vec_len = iov_copy(pkt->vec + 1, pkt->vec_len_total - 1,
+iov, iovcnt, ploff, pkt->tot_len);
+} else {
+net_rx_pkt_iovec_realloc(pkt, iovcnt);
+
+pkt->tot_len = iov_size(iov, iovcnt) - ploff;
+pkt->vec_len = iov_copy(pkt->vec, pkt->vec_len_total,
+iov, iovcnt, ploff, pkt->tot_len);
+}
+
+eth_get_protocols(pkt->vec, pkt->vec_len, &pkt->isip4, &pkt->isip6,
+  &pkt->isudp, &pkt->istcp,
+  &pkt->l3hdr_off, &pkt->l4hdr_off, &pkt->l5hdr_off,
+  &pkt->ip6hdr_info, &pkt->ip4hdr_info, &pkt->l4hdr_info);
+
+trace_net_rx_pkt_parsed(pkt->isip4, pkt->isip6, pkt->isudp, pkt->istcp,
+pkt->l3hdr_off, pkt->l4hdr_off, pkt->l5hdr_off);
+}
+
+void net_rx_pkt_attach_iovec(struct NetRxPkt *pkt,
+const struct iovec *iov, int iovcnt,
+size_t iovoff, bool strip_vlan)
 {
 uint16_t tci = 0;
-uint16_t ploff;
+uint16_t ploff = iovoff;
 assert(pkt);
 pkt->vlan_stripped = false;
 
 if (strip_vlan) {
-pkt->vlan_stripped = eth_strip_vlan(data, pkt->ehdr_buf, &ploff, &tci);
+pkt->vlan_stripped = eth_strip_vlan(iov, iovcnt, iovoff, pkt->ehdr_buf,
+&ploff, &tci);
 }
 
-if (pkt->vlan_stripped) {
-pkt->vec[0].iov_base = pkt->ehdr_buf;
-pkt->vec[0].iov_len = ploff - sizeof(struct vlan_header);
-pkt->vec[1].iov_base = (u

[Qemu-devel] [PATCH 1/2] migration/block: Convert load to BlockBackend

2016-05-30 Thread Kevin Wolf
This converts the loading part of block migration to use BlockBackend
interfaces rather than accessing the BlockDriverState directly.

Note that this takes a lazy shortcut. We should really use a separate
BlockBackend that is configured for the migration rather than for the
guest (e.g. writethrough caching is unnecessary) and holds its own
reference to the BlockDriverState, but the impact isn't that big and we
didn't have a separate migration reference before either, so it must be
good enough, I guess...

Signed-off-by: Kevin Wolf 
---
 migration/block.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/migration/block.c b/migration/block.c
index e0628d1..30af182 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -827,8 +827,7 @@ static int block_load(QEMUFile *f, void *opaque, int 
version_id)
 int len, flags;
 char device_name[256];
 int64_t addr;
-BlockDriverState *bs, *bs_prev = NULL;
-BlockBackend *blk;
+BlockBackend *blk, *blk_prev = NULL;;
 Error *local_err = NULL;
 uint8_t *buf;
 int64_t total_sectors = 0;
@@ -853,23 +852,17 @@ static int block_load(QEMUFile *f, void *opaque, int 
version_id)
 device_name);
 return -EINVAL;
 }
-bs = blk_bs(blk);
-if (!bs) {
-fprintf(stderr, "Block device %s has no medium\n",
-device_name);
-return -EINVAL;
-}
 
-if (bs != bs_prev) {
-bs_prev = bs;
-total_sectors = bdrv_nb_sectors(bs);
+if (blk != blk_prev) {
+blk_prev = blk;
+total_sectors = blk_nb_sectors(blk);
 if (total_sectors <= 0) {
 error_report("Error getting length of block device %s",
  device_name);
 return -EINVAL;
 }
 
-bdrv_invalidate_cache(bs, &local_err);
+blk_invalidate_cache(blk, &local_err);
 if (local_err) {
 error_report_err(local_err);
 return -EINVAL;
@@ -883,12 +876,14 @@ static int block_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 
 if (flags & BLK_MIG_FLAG_ZERO_BLOCK) {
-ret = bdrv_write_zeroes(bs, addr, nr_sectors,
+ret = blk_pwrite_zeroes(blk, addr * BDRV_SECTOR_SIZE,
+nr_sectors * BDRV_SECTOR_SIZE,
 BDRV_REQ_MAY_UNMAP);
 } else {
 buf = g_malloc(BLOCK_SIZE);
 qemu_get_buffer(f, buf, BLOCK_SIZE);
-ret = bdrv_write(bs, addr, buf, nr_sectors);
+ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
+ nr_sectors * BDRV_SECTOR_SIZE, 0);
 g_free(buf);
 }
 
-- 
1.8.3.1




Re: [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-30 Thread Peter Lieven

Am 30.05.2016 um 10:24 schrieb Kevin Wolf:

Am 30.05.2016 um 08:25 hat Peter Lieven geschrieben:

Am 27.05.2016 um 10:55 schrieb Kevin Wolf:

Am 27.05.2016 um 02:36 hat Fam Zheng geschrieben:

On Thu, 05/26 11:20, Paolo Bonzini wrote:

On 26/05/2016 10:30, Fam Zheng wrote:

This doesn't look too wrong...  Should the right sequence of events be
head/after_head or head/after_tail?  It's probably simplest to just emit
all four events.

I've no idea. (That's why I leaned towards fixing the test case).

Well, fixing the testcase means knowing what events should be emitted.

QEMU with Peter's patch emits head/after_head.  If the right one is
head/after_tail, _both QEMU and the testcase_ need to be adjusted.  Your
patch keeps the backwards-compatible route.

Yes, I mean I was not very convinced in tweaking the events at all: each pair
of them has been emitted around bdrv_aligned_preadv(), and the new branch
doesn't do it anymore. So I don't see a reason to add events here.

Yes, if you can assume that anyone who uses the debug events know
exactly what the code looks like, adding the events here is pointless
because TAIL, AFTER_TAIL and for the greatest part also AFTER_HEAD are
essentially the same then.

Having TAIL before the qiov change and AFTER_TAIL afterwards doesn't
make any difference, they could (and should) be called immediately one
after another if we wanted to keep the behaviour.

I would agree that we should take a look at the test case and what it
actually wants to achieve before we can decide whether AFTER_HEAD and
TAIL/AFTER_TAIL would be the same (the former could trigger earlier if
there are two requests and only one is unaligned at the tail). Maybe we
even need to extend the test case now so that both paths (explicit read
of the tail and the shortcut) are covered.

The part that actually blocks in 077 is

# Sequential RMW requests on the same physical sector

its expecting all 4 events around the RMW cycle.

However, it seems that also other parts of 077 would need an adjustment
and the output might differ depending on the alignment. So I guess we
have to emit the events if we don't want to recode the whole 077 and make
it aware of the alignment.

Yes, but my point is that we may need to rework 077 anyway if we don't
only want to make it pass again, but to cover all relevant paths, too.
We got a new code path and it's unlikely that the existing tests covered
both the old code path and the new one.


So you would postpone this patch until 077 is reworked?
I found this one a nice improvement and 077 might take some time.

Peter




[Qemu-devel] [PATCH v6 13/17] vmxnet3: Use pci_dma_* API instead of cpu_physical_memory_*

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

To make this device and network packets
abstractions ready for IOMMU.

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/net/net_tx_pkt.c | 16 +++-
 hw/net/net_tx_pkt.h |  5 +++--
 hw/net/vmxnet3.c| 51 ++-
 3 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/hw/net/net_tx_pkt.c b/hw/net/net_tx_pkt.c
index ad2258c..dbcbe23 100644
--- a/hw/net/net_tx_pkt.c
+++ b/hw/net/net_tx_pkt.c
@@ -20,6 +20,7 @@
 #include "net/checksum.h"
 #include "net/tap.h"
 #include "net/net.h"
+#include "hw/pci/pci.h"
 
 enum {
 NET_TX_PKT_VHDR_FRAG = 0,
@@ -30,6 +31,8 @@ enum {
 
 /* TX packet private context */
 struct NetTxPkt {
+PCIDevice *pci_dev;
+
 struct virtio_net_hdr virt_hdr;
 bool has_virt_hdr;
 
@@ -54,11 +57,13 @@ struct NetTxPkt {
 bool is_loopback;
 };
 
-void net_tx_pkt_init(struct NetTxPkt **pkt, uint32_t max_frags,
-bool has_virt_hdr)
+void net_tx_pkt_init(struct NetTxPkt **pkt, PCIDevice *pci_dev,
+uint32_t max_frags, bool has_virt_hdr)
 {
 struct NetTxPkt *p = g_malloc0(sizeof *p);
 
+p->pci_dev = pci_dev;
+
 p->vec = g_malloc((sizeof *p->vec) *
 (max_frags + NET_TX_PKT_PL_START_FRAG));
 
@@ -383,7 +388,8 @@ bool net_tx_pkt_add_raw_fragment(struct NetTxPkt *pkt, 
hwaddr pa,
 ventry = &pkt->raw[pkt->raw_frags];
 mapped_len = len;
 
-ventry->iov_base = cpu_physical_memory_map(pa, &mapped_len, false);
+ventry->iov_base = pci_dma_map(pkt->pci_dev, pa,
+   &mapped_len, DMA_DIRECTION_TO_DEVICE);
 
 if ((ventry->iov_base != NULL) && (len == mapped_len)) {
 ventry->iov_len = mapped_len;
@@ -444,8 +450,8 @@ void net_tx_pkt_reset(struct NetTxPkt *pkt)
 assert(pkt->raw);
 for (i = 0; i < pkt->raw_frags; i++) {
 assert(pkt->raw[i].iov_base);
-cpu_physical_memory_unmap(pkt->raw[i].iov_base, pkt->raw[i].iov_len,
-  false, pkt->raw[i].iov_len);
+pci_dma_unmap(pkt->pci_dev, pkt->raw[i].iov_base, pkt->raw[i].iov_len,
+  DMA_DIRECTION_TO_DEVICE, 0);
 }
 pkt->raw_frags = 0;
 
diff --git a/hw/net/net_tx_pkt.h b/hw/net/net_tx_pkt.h
index e49772d..07b9a20 100644
--- a/hw/net/net_tx_pkt.h
+++ b/hw/net/net_tx_pkt.h
@@ -31,11 +31,12 @@ struct NetTxPkt;
  * Init function for tx packet functionality
  *
  * @pkt:packet pointer
+ * @pci_dev:PCI device processing this packet
  * @max_frags:  max tx ip fragments
  * @has_virt_hdr:   device uses virtio header.
  */
-void net_tx_pkt_init(struct NetTxPkt **pkt, uint32_t max_frags,
-bool has_virt_hdr);
+void net_tx_pkt_init(struct NetTxPkt **pkt, PCIDevice *pci_dev,
+uint32_t max_frags, bool has_virt_hdr);
 
 /**
  * Clean all tx packet resources.
diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 33cd07d..16645e6 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -802,7 +802,9 @@ vmxnet3_pop_rxc_descr(VMXNET3State *s, int qidx, uint32_t 
*descr_gen)
 hwaddr daddr =
 vmxnet3_ring_curr_cell_pa(&s->rxq_descr[qidx].comp_ring);
 
-cpu_physical_memory_read(daddr, &rxcd, sizeof(struct Vmxnet3_RxCompDesc));
+pci_dma_read(PCI_DEVICE(s), daddr,
+ &rxcd, sizeof(struct Vmxnet3_RxCompDesc));
+
 ring_gen = vmxnet3_ring_curr_gen(&s->rxq_descr[qidx].comp_ring);
 
 if (rxcd.gen != ring_gen) {
@@ -1023,10 +1025,11 @@ nocsum:
 }
 
 static void
-vmxnet3_physical_memory_writev(const struct iovec *iov,
-   size_t start_iov_off,
-   hwaddr target_addr,
-   size_t bytes_to_copy)
+vmxnet3_pci_dma_writev(PCIDevice *pci_dev,
+   const struct iovec *iov,
+   size_t start_iov_off,
+   hwaddr target_addr,
+   size_t bytes_to_copy)
 {
 size_t curr_off = 0;
 size_t copied = 0;
@@ -1036,9 +1039,9 @@ vmxnet3_physical_memory_writev(const struct iovec *iov,
 size_t chunk_len =
 MIN((curr_off + iov->iov_len) - start_iov_off, bytes_to_copy);
 
-cpu_physical_memory_write(target_addr + copied,
-  iov->iov_base + start_iov_off - curr_off,
-  chunk_len);
+pci_dma_write(pci_dev, target_addr + copied,
+  iov->iov_base + start_iov_off - curr_off,
+  chunk_len);
 
 copied += chunk_len;
 start_iov_off += chunk_len;
@@ -1088,15 +1091,15 @@ vmxnet3_indicate_packet(VMXNET3State *s)
 }
 
 chunk_size = MIN(bytes_left, rxd.len);
-vmxnet3_physical_memory_writev(data, bytes_copied,
-   le64_to_cpu(rxd.addr), chunk_size);
+vmxnet3_pci_dma_writev(PCI_DEVICE(s), data, bytes_copied,
+   le64_to_cpu(r

[Qemu-devel] [PATCH v6 00/17] Introduce Intel 82574 GbE Controller Emulation (e1000e)

2016-05-30 Thread Leonid Bloch
Hello All,

This is v6 of e1000e series.

For convenience, the same patches are available at:
https://github.com/daynix/qemu-e1000e/tree/e1000e-submit-v6

Best regards,
Dmitry.

Changes since v5:

1. Fixed build failure on old clang versions
2. Added patch that fixes unaligned access in pci_[set|get]_quad()
3. Rebased to the latest master

Changes since v4:

1. Rebased to the latest master (2.6.0+)

Changes since v3:

1. Various code fixes as suggested by Jason and Michael
2. Rebased to the latest master

Changes since v2:

1. Interrupt storm on latest Linux kernels fixed
2. Device unit test added
3. Introduced code sharing between e1000 and e1000e
4. Various code fixes as suggested by Jason
5. Rebased to the latest master

Changes since v1:

1. PCI_PM_CAP_VER_1_1 is defined now in include/hw/pci/pci_regs.h and
   not in include/standard-headers/linux/pci_regs.h.
2. Changes in naming and extra comments in hw/pci/pcie.c and in
   include/hw/pci/pcie.h.
3. Defining pci_dsn_ver and pci_dsn_cap static const variables in
   hw/pci/pcie.c, instead of PCI_DSN_VER and PCI_DSN_CAP symbolic
   constants in include/hw/pci/pcie_regs.h.
4. Changing the vmxnet3_device_serial_num function in hw/net/vmxnet3.c
   to avoid the cast when it is called.
5. Avoiding a preceding underscore in all the e1000e-related names.
6. Minor style changes.

===

Hello All,

This series is the final code of the e1000e device emulation, that we
have developed. Please review, and consider acceptance of these patches
to the upstream QEMU repository.

The code stability was verified by various traffic tests using Fedora 22
Linux, and Windows Server 2012R2 guests. Also, Microsoft Hardware
Certification Kit (HCK) tests were run on a Windows Server 2012R2 guest.

There was a discussion on the possibility of code sharing between the
e1000e, and the existing e1000 devices. We have reviewed the final code
for parts that may be shared between this device and the currently
available e1000 emulation. The device specifications are very different,
and there are almost no registers, nor functions, that were left as is
from e1000. The ring descriptor structures were changed as well, by the
introduction of extended and PS descriptors, as well as additional bits.

Additional differences stem from the fact that the e1000e device re-uses
network packet abstractions introduced by the vmxnet3 device, while the
e1000 has its own code for packet handling. BTW, it may be worth reusing
those abstractions in e1000 as well. (Following these changes the
vmxnet3 device was successfully tested for possible regressions.)

There are a few minor parts that may be shared, e.g. the default
register handlers, and the ring management functions. The total amount
of shared lines will be about 100--150, so we're not sure if it makes
sense bothering, and taking a risk of breaking e1000, which is a good,
old, and stable device.

Currently, the e1000e code is stand alone w.r.t. e1000.

Please share your thoughts.

Thanks in advance,
Dmitry.

Changes since RFCv2:

1. Device functionality verified using Microsoft Hardware Certification
Test Kit (HCK)
2. Introduced a number of performance improvements
3. The code was cleaned, and rebased to the latest master
4. Patches verified with checkpatch.pl

===

Changes since RFCv1:

1. Added support for all the device features:
  - Interrupt moderation.
  - RSS.
  - Multiqueue.
2. Simulated exact PCI/PCIe configuration space layout.
3. Made fixes needed to pass Microsoft's HW certification tests (HCK).

This series is still an RFC, because the following tasks are not done
yet:

1. See which code can be shared between this device and the existing
e1000 device.
2. Rebase patches to the latest master (current base is v2.3.0).

Please share your thoughts,
Thanks, Dmitry.

===

Hello qemu-devel,

This patch series is an RFC for the new networking device emulation
we're developing for QEMU.

This new device emulates the Intel 82574 GbE Controller and works
with unmodified Intel e1000e drivers from the Linux/Windows kernels.

The status of the current series is "Functional Device Ready, work
on Extended Features in Progress".

More precisely, these patches represent a functional device, which
is recognized by the standard Intel drivers, and is able to transfer
TX/RX packets with CSO/TSO offloads, according to the spec.

Extended features not supported yet (work in progress):
  1. TX/RX Interrupt moderation mechanisms
  2. RSS
  3. Full-featured multi-queue (use of multiqueued network backend)

Also, there will be some code refactoring and performance
optimization efforts.

This series was tested on Linux (Fedora 22) and Windows (2012R2)
guests, using Iperf, with TX/RX and TCP/UDP streams, and various
packet sizes.

More thorough testing, including data streams with different MTU
sizes, and Microsoft Certification (HLK) tests, are pending missing
features' development.

See commit messages (esp. "net: Intro

Re: [Qemu-devel] [PATCH v7 07/25] intel_iommu: define several structs for IOMMU IR

2016-05-30 Thread David Kiarie
On Mon, May 30, 2016 at 12:16 PM, Peter Xu  wrote:
> On Mon, May 30, 2016 at 11:54:52AM +0300, David Kiarie wrote:
>> On Mon, May 30, 2016 at 11:14 AM, Peter Xu  wrote:
>> > On Mon, May 30, 2016 at 07:56:16AM +0200, Jan Kiszka wrote:
>> >> On 2016-05-30 07:45, Peter Xu wrote:
> [...]
>> >> >
>> >> > I assume you mean when host cpu is big endian. x86 was little endian,
>> >> > and I was testing on x86.
>> >> >
>> >> > I think you are right. I should do conditional byte swap for all
>> >> > uint{16/32/64} cases within the fields. For example, index_l field in
>> >> > above VTD_IR_MSIAddress. And there are several other cases that need
>> >> > special treatment in the patchset. Will go over and fix corresponding
>> >> > issues in next version.
>> >>
>> >> You actually need bit-swap with bit fields, see e.g. hw/net/vmxnet3.h.
>> >
>> > Not noticed about bit-field ordering before... So maybe I need both?
>>
>> Yes, I think we will need both though, I think, byte swapping the
>> whole struct will break the code but swapping individual fields is
>> what we need.
>>
>> Myself, I'm defining bitfields as below:
>>
>>   struct CMDCompletionWait {
>>
>> #ifdef __BIG_ENDIAN_BITFIELD
>> uint32_t type:4;   /* command type   */
>> uint32_t reserved:8;
>> uint64_t store_addr:49;/* addr to write  */
>> uint32_t completion_flush:1;   /* allow more executions  */
>> uint32_t completion_int:1; /* set MMIOWAITINT*/
>> uint32_t completion_store:1;   /* write data to address  */
>
> I guess what we need might be this one:
>
>   uint64_t type:4;   /* command type   */
>   uint64_t reserved:8;
>   uint64_t store_addr:49;/* addr to write  */
>   uint64_t completion_flush:1;   /* allow more executions  */
>   uint64_t completion_int:1; /* set MMIOWAITINT*/
>   uint64_t completion_store:1;   /* write data to address  */
>
> IIUC, if we define type:4 as uint32_t rather than uint64_t, it should
> be bits [29:32] of the struct on big endian machines, not bits
> [61:64].

Yes, you're right.

>
> Thanks,
>
> -- peterx



[Qemu-devel] [PATCH] iostatus: fix comments for block_job_iostatus_reset

2016-05-30 Thread Changlong Xie
Signed-off-by: Changlong Xie 
---
 include/block/blockjob.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index 86d2807..00ac418 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -397,7 +397,7 @@ int block_job_complete_sync(BlockJob *job, Error **errp);
  * @job: The job whose I/O status should be reset.
  *
  * Reset I/O status on @job and on BlockDriverState objects it uses,
- * other than job->bs.
+ * other than job->blk.
  */
 void block_job_iostatus_reset(BlockJob *job);
 
-- 
1.9.3






[Qemu-devel] [PATCH v6 07/17] net: Introduce Toeplitz hash calculator

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 include/net/checksum.h | 45 +
 1 file changed, 45 insertions(+)

diff --git a/include/net/checksum.h b/include/net/checksum.h
index 7de1acb..dd8b4f6 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -18,6 +18,7 @@
 #ifndef QEMU_NET_CHECKSUM_H
 #define QEMU_NET_CHECKSUM_H
 
+#include "qemu/bswap.h"
 struct iovec;
 
 uint32_t net_checksum_add_cont(int len, uint8_t *buf, int seq);
@@ -50,4 +51,48 @@ uint32_t net_checksum_add_iov(const struct iovec *iov,
   const unsigned int iov_cnt,
   uint32_t iov_off, uint32_t size);
 
+typedef struct toeplitz_key_st {
+uint32_t leftmost_32_bits;
+uint8_t *next_byte;
+} net_toeplitz_key;
+
+static inline
+void net_toeplitz_key_init(net_toeplitz_key *key, uint8_t *key_bytes)
+{
+key->leftmost_32_bits = be32_to_cpu(*(uint32_t *)key_bytes);
+key->next_byte = key_bytes + sizeof(uint32_t);
+}
+
+static inline
+void net_toeplitz_add(uint32_t *result,
+  uint8_t *input,
+  uint32_t len,
+  net_toeplitz_key *key)
+{
+register uint32_t accumulator = *result;
+register uint32_t leftmost_32_bits = key->leftmost_32_bits;
+register uint32_t byte;
+
+for (byte = 0; byte < len; byte++) {
+register uint8_t input_byte = input[byte];
+register uint8_t key_byte = *(key->next_byte++);
+register uint8_t bit;
+
+for (bit = 0; bit < 8; bit++) {
+if (input_byte & (1 << 7)) {
+accumulator ^= leftmost_32_bits;
+}
+
+leftmost_32_bits =
+(leftmost_32_bits << 1) | ((key_byte & (1 << 7)) >> 7);
+
+input_byte <<= 1;
+key_byte <<= 1;
+}
+}
+
+key->leftmost_32_bits = leftmost_32_bits;
+*result = accumulator;
+}
+
 #endif /* QEMU_NET_CHECKSUM_H */
-- 
2.5.5




[Qemu-devel] [PATCH v6 05/17] pcie: Introduce function for DSN capability creation

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/pci/pcie.c | 10 ++
 include/hw/pci/pcie.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 24cfc3b..9599fde 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -695,3 +695,13 @@ void pcie_ari_init(PCIDevice *dev, uint16_t offset, 
uint16_t nextfn)
 offset, PCI_ARI_SIZEOF);
 pci_set_long(dev->config + offset + PCI_ARI_CAP, (nextfn & 0xff) << 8);
 }
+
+void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num)
+{
+static const int pci_dsn_ver = 1;
+static const int pci_dsn_cap = 4;
+
+pcie_add_capability(dev, PCI_EXT_CAP_ID_DSN, pci_dsn_ver, offset,
+PCI_EXT_CAP_DSN_SIZEOF);
+pci_set_quad(dev->config + offset + pci_dsn_cap, ser_num);
+}
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index cbbf0c5..056d25e 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -119,6 +119,7 @@ void pcie_add_capability(PCIDevice *dev,
  uint16_t offset, uint16_t size);
 
 void pcie_ari_init(PCIDevice *dev, uint16_t offset, uint16_t nextfn);
+void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
 
 extern const VMStateDescription vmstate_pcie_device;
 
-- 
2.5.5




[Qemu-devel] [PATCH 0/2] Block migration: Convert to BlockBackend

2016-05-30 Thread Kevin Wolf
Users outside of the block layer shouldn't directly use BlockDriverState for
issuing their I/O requests, but go through a BlockBackend to do so. Block
migration ('migrate -b') is (one of?) the last remaining users that need to be
converted.

Kevin Wolf (2):
  migration/block: Convert load to BlockBackend
  migration/block: Convert saving to BlockBackend

 migration/block.c | 147 --
 1 file changed, 88 insertions(+), 59 deletions(-)

-- 
1.8.3.1




[Qemu-devel] [PATCH v6 11/17] rtl8139: Move more TCP definitions to common header

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/net/rtl8139.c  | 5 -
 include/net/eth.h | 8 
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/hw/net/rtl8139.c b/hw/net/rtl8139.c
index 1e5ec14..562c1fd 100644
--- a/hw/net/rtl8139.c
+++ b/hw/net/rtl8139.c
@@ -1867,11 +1867,6 @@ static int rtl8139_transmit_one(RTL8139State *s, int 
descriptor)
 return 1;
 }
 
-/* structures and macros for task offloading */
-#define TCP_HEADER_DATA_OFFSET(tcp) (((be16_to_cpu(tcp->th_offset_flags) >> 
12)&0xf) << 2)
-#define TCP_FLAGS_ONLY(flags) ((flags)&0x3f)
-#define TCP_HEADER_FLAGS(tcp) TCP_FLAGS_ONLY(be16_to_cpu(tcp->th_offset_flags))
-
 #define TCP_HEADER_CLEAR_FLAGS(tcp, off) ((tcp)->th_offset_flags &= 
cpu_to_be16(~TCP_FLAGS_ONLY(off)))
 
 /* produces ones' complement sum of data */
diff --git a/include/net/eth.h b/include/net/eth.h
index 18d0be3..5a32259 100644
--- a/include/net/eth.h
+++ b/include/net/eth.h
@@ -67,6 +67,14 @@ typedef struct tcp_header {
 uint16_t th_urp;/* urgent pointer */
 } tcp_header;
 
+#define TCP_FLAGS_ONLY(flags) ((flags) & 0x3f)
+
+#define TCP_HEADER_FLAGS(tcp) \
+TCP_FLAGS_ONLY(be16_to_cpu((tcp)->th_offset_flags))
+
+#define TCP_HEADER_DATA_OFFSET(tcp) \
+(((be16_to_cpu((tcp)->th_offset_flags) >> 12) & 0xf) << 2)
+
 typedef struct udp_header {
 uint16_t uh_sport; /* source port */
 uint16_t uh_dport; /* destination port */
-- 
2.5.5




[Qemu-devel] [PATCH v6 09/17] vmxnet3: Use common MAC address tracing macros

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/net/vmxnet3.c  | 8 
 hw/net/vmxnet_debug.h | 3 ---
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 586e915..200d2ea 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -474,7 +474,7 @@ static void vmxnet3_set_variable_mac(VMXNET3State *s, 
uint32_t h, uint32_t l)
 s->conf.macaddr.a[4] = VMXNET3_GET_BYTE(h, 0);
 s->conf.macaddr.a[5] = VMXNET3_GET_BYTE(h, 1);
 
-VMW_CFPRN("Variable MAC: " VMXNET_MF, VMXNET_MA(s->conf.macaddr.a));
+VMW_CFPRN("Variable MAC: " MAC_FMT, MAC_ARG(s->conf.macaddr.a));
 
 qemu_format_nic_info_str(qemu_get_queue(s->nic), s->conf.macaddr.a);
 }
@@ -1219,7 +1219,7 @@ static void vmxnet3_reset_interrupt_states(VMXNET3State 
*s)
 static void vmxnet3_reset_mac(VMXNET3State *s)
 {
 memcpy(&s->conf.macaddr.a, &s->perm_mac.a, sizeof(s->perm_mac.a));
-VMW_CFPRN("MAC address set to: " VMXNET_MF, VMXNET_MA(s->conf.macaddr.a));
+VMW_CFPRN("MAC address set to: " MAC_FMT, MAC_ARG(s->conf.macaddr.a));
 }
 
 static void vmxnet3_deactivate_device(VMXNET3State *s)
@@ -1301,7 +1301,7 @@ static void vmxnet3_update_mcast_filters(VMXNET3State *s)
 cpu_physical_memory_read(mcast_list_pa, s->mcast_list, list_bytes);
 VMW_CFPRN("Current multicast list len is %d:", s->mcast_list_len);
 for (i = 0; i < s->mcast_list_len; i++) {
-VMW_CFPRN("\t" VMXNET_MF, VMXNET_MA(s->mcast_list[i].a));
+VMW_CFPRN("\t" MAC_FMT, MAC_ARG(s->mcast_list[i].a));
 }
 }
 }
@@ -2102,7 +2102,7 @@ static void vmxnet3_net_init(VMXNET3State *s)
 
 s->link_status_and_speed = VMXNET3_LINK_SPEED | VMXNET3_LINK_STATUS_UP;
 
-VMW_CFPRN("Permanent MAC: " VMXNET_MF, VMXNET_MA(s->perm_mac.a));
+VMW_CFPRN("Permanent MAC: " MAC_FMT, MAC_ARG(s->perm_mac.a));
 
 s->nic = qemu_new_nic(&net_vmxnet3_info, &s->conf,
   object_get_typename(OBJECT(s)),
diff --git a/hw/net/vmxnet_debug.h b/hw/net/vmxnet_debug.h
index 96495db..5aab00b 100644
--- a/hw/net/vmxnet_debug.h
+++ b/hw/net/vmxnet_debug.h
@@ -142,7 +142,4 @@
 } \
 } while (0)
 
-#define VMXNET_MF   "%02X:%02X:%02X:%02X:%02X:%02X"
-#define VMXNET_MA(a)(a)[0], (a)[1], (a)[2], (a)[3], (a)[4], (a)[5]
-
 #endif /* _QEMU_VMXNET3_DEBUG_H  */
-- 
2.5.5




[Qemu-devel] [PATCH v6 04/17] pcie: Add support for PCIe CAP v1

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Added support for PCIe CAP v1, while reusing some of the existing v2
infrastructure.

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/pci/pcie.c  | 84 --
 include/hw/pci/pcie.h  |  4 +++
 include/hw/pci/pcie_regs.h |  5 +--
 3 files changed, 73 insertions(+), 20 deletions(-)

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index 728386a..24cfc3b 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -43,26 +43,15 @@
 /***
  * pci express capability helper functions
  */
-int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t type, uint8_t port)
-{
-int pos;
-uint8_t *exp_cap;
-
-assert(pci_is_express(dev));
-
-pos = pci_add_capability(dev, PCI_CAP_ID_EXP, offset,
- PCI_EXP_VER2_SIZEOF);
-if (pos < 0) {
-return pos;
-}
-dev->exp.exp_cap = pos;
-exp_cap = dev->config + pos;
 
+static void
+pcie_cap_v1_fill(uint8_t *exp_cap, uint8_t port, uint8_t type, uint8_t version)
+{
 /* capability register
-   interrupt message number defaults to 0 */
+interrupt message number defaults to 0 */
 pci_set_word(exp_cap + PCI_EXP_FLAGS,
  ((type << PCI_EXP_FLAGS_TYPE_SHIFT) & PCI_EXP_FLAGS_TYPE) |
- PCI_EXP_FLAGS_VER2);
+ version);
 
 /* device capability register
  * table 7-12:
@@ -81,7 +70,27 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t 
type, uint8_t port)
 
 pci_set_word(exp_cap + PCI_EXP_LNKSTA,
  PCI_EXP_LNK_MLW_1 | PCI_EXP_LNK_LS_25 |PCI_EXP_LNKSTA_DLLLA);
+}
+
+int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t type, uint8_t port)
+{
+/* PCIe cap v2 init */
+int pos;
+uint8_t *exp_cap;
+
+assert(pci_is_express(dev));
+
+pos = pci_add_capability(dev, PCI_CAP_ID_EXP, offset, PCI_EXP_VER2_SIZEOF);
+if (pos < 0) {
+return pos;
+}
+dev->exp.exp_cap = pos;
+exp_cap = dev->config + pos;
+
+/* Filling values common with v1 */
+pcie_cap_v1_fill(exp_cap, port, type, PCI_EXP_FLAGS_VER2);
 
+/* Filling v2 specific values */
 pci_set_long(exp_cap + PCI_EXP_DEVCAP2,
  PCI_EXP_DEVCAP2_EFF | PCI_EXP_DEVCAP2_EETLPP);
 
@@ -89,7 +98,29 @@ int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t 
type, uint8_t port)
 return pos;
 }
 
-int pcie_endpoint_cap_init(PCIDevice *dev, uint8_t offset)
+int pcie_cap_v1_init(PCIDevice *dev, uint8_t offset, uint8_t type,
+ uint8_t port)
+{
+/* PCIe cap v1 init */
+int pos;
+uint8_t *exp_cap;
+
+assert(pci_is_express(dev));
+
+pos = pci_add_capability(dev, PCI_CAP_ID_EXP, offset, PCI_EXP_VER1_SIZEOF);
+if (pos < 0) {
+return pos;
+}
+dev->exp.exp_cap = pos;
+exp_cap = dev->config + pos;
+
+pcie_cap_v1_fill(exp_cap, port, type, PCI_EXP_FLAGS_VER1);
+
+return pos;
+}
+
+static int
+pcie_endpoint_cap_common_init(PCIDevice *dev, uint8_t offset, uint8_t cap_size)
 {
 uint8_t type = PCI_EXP_TYPE_ENDPOINT;
 
@@ -102,7 +133,19 @@ int pcie_endpoint_cap_init(PCIDevice *dev, uint8_t offset)
 type = PCI_EXP_TYPE_RC_END;
 }
 
-return pcie_cap_init(dev, offset, type, 0);
+return (cap_size == PCI_EXP_VER1_SIZEOF)
+? pcie_cap_v1_init(dev, offset, type, 0)
+: pcie_cap_init(dev, offset, type, 0);
+}
+
+int pcie_endpoint_cap_init(PCIDevice *dev, uint8_t offset)
+{
+return pcie_endpoint_cap_common_init(dev, offset, PCI_EXP_VER2_SIZEOF);
+}
+
+int pcie_endpoint_cap_v1_init(PCIDevice *dev, uint8_t offset)
+{
+return pcie_endpoint_cap_common_init(dev, offset, PCI_EXP_VER1_SIZEOF);
 }
 
 void pcie_cap_exit(PCIDevice *dev)
@@ -110,6 +153,11 @@ void pcie_cap_exit(PCIDevice *dev)
 pci_del_capability(dev, PCI_CAP_ID_EXP, PCI_EXP_VER2_SIZEOF);
 }
 
+void pcie_cap_v1_exit(PCIDevice *dev)
+{
+pci_del_capability(dev, PCI_CAP_ID_EXP, PCI_EXP_VER1_SIZEOF);
+}
+
 uint8_t pcie_cap_get_type(const PCIDevice *dev)
 {
 uint32_t pos = dev->exp.exp_cap;
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index b48a7a2..cbbf0c5 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -80,8 +80,12 @@ struct PCIExpressDevice {
 
 /* PCI express capability helper functions */
 int pcie_cap_init(PCIDevice *dev, uint8_t offset, uint8_t type, uint8_t port);
+int pcie_cap_v1_init(PCIDevice *dev, uint8_t offset,
+ uint8_t type, uint8_t port);
 int pcie_endpoint_cap_init(PCIDevice *dev, uint8_t offset);
 void pcie_cap_exit(PCIDevice *dev);
+int pcie_endpoint_cap_v1_init(PCIDevice *dev, uint8_t offset);
+void pcie_cap_v1_exit(PCIDevice *dev);
 uint8_t pcie_cap_get_type(const PCIDevice *dev);
 void pcie_cap_flags_set_vector(PCIDevice *dev, uint8_t vector);
 uint8_t pcie_cap_flags_get_vector(PCIDevice *dev);
diff --git a/include/hw/pci/pcie_regs.h b/include

[Qemu-devel] [PATCH v6 06/17] vmxnet3: Use generic function for DSN capability definition

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/net/vmxnet3.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/hw/net/vmxnet3.c b/hw/net/vmxnet3.c
index 20f26b7..586e915 100644
--- a/hw/net/vmxnet3.c
+++ b/hw/net/vmxnet3.c
@@ -2255,9 +2255,9 @@ static const MemoryRegionOps b1_ops = {
 },
 };
 
-static uint8_t *vmxnet3_device_serial_num(VMXNET3State *s)
+static uint64_t vmxnet3_device_serial_num(VMXNET3State *s)
 {
-static uint64_t dsn_payload;
+uint64_t dsn_payload;
 uint8_t *dsnp = (uint8_t *)&dsn_payload;
 
 dsnp[0] = 0xfe;
@@ -2268,7 +2268,7 @@ static uint8_t *vmxnet3_device_serial_num(VMXNET3State *s)
 dsnp[5] = s->conf.macaddr.a[1];
 dsnp[6] = s->conf.macaddr.a[2];
 dsnp[7] = 0xff;
-return dsnp;
+return dsn_payload;
 }
 
 static void vmxnet3_pci_realize(PCIDevice *pci_dev, Error **errp)
@@ -2313,10 +2313,8 @@ static void vmxnet3_pci_realize(PCIDevice *pci_dev, 
Error **errp)
 pcie_endpoint_cap_init(pci_dev, VMXNET3_EXP_EP_OFFSET);
 }
 
-pcie_add_capability(pci_dev, PCI_EXT_CAP_ID_DSN, 0x1,
-VMXNET3_DSN_OFFSET, PCI_EXT_CAP_DSN_SIZEOF);
-memcpy(pci_dev->config + VMXNET3_DSN_OFFSET + 4,
-   vmxnet3_device_serial_num(s), sizeof(uint64_t));
+pcie_dev_ser_num_init(pci_dev, VMXNET3_DSN_OFFSET,
+  vmxnet3_device_serial_num(s));
 }
 
 register_savevm(dev, "vmxnet3-msix", -1, 1,
-- 
2.5.5




[Qemu-devel] [PATCH v6 08/17] net: Add macros for MAC address tracing

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

These macros will be used by future commits introducing
e1000e device emulation and by vmxnet3 tracing code.

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 include/net/net.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index 73e4c46..129d46b 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -9,6 +9,11 @@
 #include "migration/vmstate.h"
 #include "qapi-types.h"
 
+#define MAC_FMT "%02X:%02X:%02X:%02X:%02X:%02X"
+#define MAC_ARG(x) ((uint8_t *)(x))[0], ((uint8_t *)(x))[1], \
+   ((uint8_t *)(x))[2], ((uint8_t *)(x))[3], \
+   ((uint8_t *)(x))[4], ((uint8_t *)(x))[5]
+
 #define MAX_QUEUE_NUM 1024
 
 /* Maximum GSO packet size (64k) plus plenty of room for
-- 
2.5.5




[Qemu-devel] [PATCH v6 03/17] pci: Introduce define for PM capability version 1.1

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 include/hw/pci/pci_regs.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
index ba8cbe9..7a83142 100644
--- a/include/hw/pci/pci_regs.h
+++ b/include/hw/pci/pci_regs.h
@@ -1 +1,3 @@
 #include "standard-headers/linux/pci_regs.h"
+
+#define  PCI_PM_CAP_VER_1_1 0x0002  /* PCI PM spec ver. 1.1 */
-- 
2.5.5




[Qemu-devel] [PATCH v6 01/17] pci: fix unaligned access in pci_xxx_quad()

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

Replace legacy cpu_to_le64w()/le64_to_cpup()
calls with stq_le_p()/ldq_le_p().

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 include/hw/pci/pci.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index ef6ba51..ee238ad 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -468,13 +468,13 @@ pci_get_long(const uint8_t *config)
 static inline void
 pci_set_quad(uint8_t *config, uint64_t val)
 {
-cpu_to_le64w((uint64_t *)config, val);
+stq_le_p(config, val);
 }
 
 static inline uint64_t
 pci_get_quad(const uint8_t *config)
 {
-return le64_to_cpup((const uint64_t *)config);
+return ldq_le_p(config);
 }
 
 static inline void
-- 
2.5.5




[Qemu-devel] [PATCH v6 02/17] msix: make msix_clr_pending() visible for clients

2016-05-30 Thread Leonid Bloch
From: Dmitry Fleytman 

This function will be used by e1000e device code.

Signed-off-by: Dmitry Fleytman 
Signed-off-by: Leonid Bloch 
---
 hw/pci/msix.c | 2 +-
 include/hw/pci/msix.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index b75f0e9..0ec1cb1 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -72,7 +72,7 @@ void msix_set_pending(PCIDevice *dev, unsigned int vector)
 *msix_pending_byte(dev, vector) |= msix_pending_mask(vector);
 }
 
-static void msix_clr_pending(PCIDevice *dev, int vector)
+void msix_clr_pending(PCIDevice *dev, int vector)
 {
 *msix_pending_byte(dev, vector) &= ~msix_pending_mask(vector);
 }
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 72e5f93..048a29d 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -29,6 +29,7 @@ int msix_present(PCIDevice *dev);
 
 bool msix_is_masked(PCIDevice *dev, unsigned vector);
 void msix_set_pending(PCIDevice *dev, unsigned vector);
+void msix_clr_pending(PCIDevice *dev, int vector);
 
 int msix_vector_use(PCIDevice *dev, unsigned vector);
 void msix_vector_unuse(PCIDevice *dev, unsigned vector);
-- 
2.5.5




[Qemu-devel] [PATCH] block/io: Remove unused bdrv_aio_write_zeroes()

2016-05-30 Thread Kevin Wolf
Signed-off-by: Kevin Wolf 
---
 block/io.c| 11 ---
 include/block/block.h |  3 ---
 trace-events  |  1 -
 3 files changed, 15 deletions(-)

diff --git a/block/io.c b/block/io.c
index 2d832aa..7ac9897 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1865,17 +1865,6 @@ BlockAIOCB *bdrv_aio_writev(BlockDriverState *bs, 
int64_t sector_num,
  cb, opaque, true);
 }
 
-BlockAIOCB *bdrv_aio_write_zeroes(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, BdrvRequestFlags flags,
-BlockCompletionFunc *cb, void *opaque)
-{
-trace_bdrv_aio_write_zeroes(bs, sector_num, nb_sectors, flags, opaque);
-
-return bdrv_co_aio_rw_vector(bs, sector_num, NULL, nb_sectors,
- BDRV_REQ_ZERO_WRITE | flags,
- cb, opaque, true);
-}
-
 void bdrv_aio_cancel(BlockAIOCB *acb)
 {
 qemu_aio_ref(acb);
diff --git a/include/block/block.h b/include/block/block.h
index 70ea299..d6bb74d 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -229,9 +229,6 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
const uint8_t *buf, int nb_sectors);
 int bdrv_write_zeroes(BlockDriverState *bs, int64_t sector_num,
int nb_sectors, BdrvRequestFlags flags);
-BlockAIOCB *bdrv_aio_write_zeroes(BlockDriverState *bs, int64_t sector_num,
-  int nb_sectors, BdrvRequestFlags flags,
-  BlockCompletionFunc *cb, void *opaque);
 int bdrv_make_zero(BlockDriverState *bs, BdrvRequestFlags flags);
 int bdrv_pread(BlockDriverState *bs, int64_t offset,
void *buf, int count);
diff --git a/trace-events b/trace-events
index b27d1da..1ca886e 100644
--- a/trace-events
+++ b/trace-events
@@ -70,7 +70,6 @@ bdrv_aio_discard(void *bs, int64_t sector_num, int 
nb_sectors, void *opaque) "bs
 bdrv_aio_flush(void *bs, void *opaque) "bs %p opaque %p"
 bdrv_aio_readv(void *bs, int64_t sector_num, int nb_sectors, void *opaque) "bs 
%p sector_num %"PRId64" nb_sectors %d opaque %p"
 bdrv_aio_writev(void *bs, int64_t sector_num, int nb_sectors, void *opaque) 
"bs %p sector_num %"PRId64" nb_sectors %d opaque %p"
-bdrv_aio_write_zeroes(void *bs, int64_t sector_num, int nb_sectors, int flags, 
void *opaque) "bs %p sector_num %"PRId64" nb_sectors %d flags %#x opaque %p"
 bdrv_co_readv(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num 
%"PRId64" nb_sectors %d"
 bdrv_co_writev(void *bs, int64_t sector_num, int nb_sector) "bs %p sector_num 
%"PRId64" nb_sectors %d"
 bdrv_co_write_zeroes(void *bs, int64_t sector_num, int nb_sector, int flags) 
"bs %p sector_num %"PRId64" nb_sectors %d flags %#x"
-- 
1.8.3.1




Re: [Qemu-devel] [PATCH v7 07/25] intel_iommu: define several structs for IOMMU IR

2016-05-30 Thread Peter Xu
On Mon, May 30, 2016 at 11:54:52AM +0300, David Kiarie wrote:
> On Mon, May 30, 2016 at 11:14 AM, Peter Xu  wrote:
> > On Mon, May 30, 2016 at 07:56:16AM +0200, Jan Kiszka wrote:
> >> On 2016-05-30 07:45, Peter Xu wrote:
[...]
> >> >
> >> > I assume you mean when host cpu is big endian. x86 was little endian,
> >> > and I was testing on x86.
> >> >
> >> > I think you are right. I should do conditional byte swap for all
> >> > uint{16/32/64} cases within the fields. For example, index_l field in
> >> > above VTD_IR_MSIAddress. And there are several other cases that need
> >> > special treatment in the patchset. Will go over and fix corresponding
> >> > issues in next version.
> >>
> >> You actually need bit-swap with bit fields, see e.g. hw/net/vmxnet3.h.
> >
> > Not noticed about bit-field ordering before... So maybe I need both?
> 
> Yes, I think we will need both though, I think, byte swapping the
> whole struct will break the code but swapping individual fields is
> what we need.
> 
> Myself, I'm defining bitfields as below:
> 
>   struct CMDCompletionWait {
> 
> #ifdef __BIG_ENDIAN_BITFIELD
> uint32_t type:4;   /* command type   */
> uint32_t reserved:8;
> uint64_t store_addr:49;/* addr to write  */
> uint32_t completion_flush:1;   /* allow more executions  */
> uint32_t completion_int:1; /* set MMIOWAITINT*/
> uint32_t completion_store:1;   /* write data to address  */

I guess what we need might be this one:

  uint64_t type:4;   /* command type   */
  uint64_t reserved:8;
  uint64_t store_addr:49;/* addr to write  */
  uint64_t completion_flush:1;   /* allow more executions  */
  uint64_t completion_int:1; /* set MMIOWAITINT*/
  uint64_t completion_store:1;   /* write data to address  */

IIUC, if we define type:4 as uint32_t rather than uint64_t, it should
be bits [29:32] of the struct on big endian machines, not bits
[61:64].

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH 02/10] qcow2: add qcow2_co_write_compressed

2016-05-30 Thread Pavel Butsykin

On 27.05.2016 20:33, Stefan Hajnoczi wrote:

On Sat, May 14, 2016 at 03:45:50PM +0300, Denis V. Lunev wrote:

+qemu_co_mutex_lock(&s->lock);
+cluster_offset = \
+qcow2_alloc_compressed_cluster_offset(bs, sector_num << 9, out_len);


The backslash isn't necessary for wrapping lines in C.  This kind of
thing is only necessary in languages like Python where the grammar is
whitespace sensistive.

The C compiler is happy with an arbitrary amount of whitespace
(newlines) in the middle of a statement.  The backslash in C is handled
by the preprocessor: it joins the line.  That's useful for macro
definitions where you need to tell the preprocessor that several lines
belong to one macro definition.  But it's not needed for normal C code.


Thanks for the explanation, but the backslash is used more for the
person as a marker a line break. The current coding style misses this
point, but I can remove the backslash, because I don't think it's
something important :)


+if (!cluster_offset) {
+qemu_co_mutex_unlock(&s->lock);
+ret = -EIO;
+goto fail;
+}
+cluster_offset &= s->cluster_offset_mask;

-BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
-ret = bdrv_pwrite(bs->file->bs, cluster_offset, out_buf, out_len);
-if (ret < 0) {
-goto fail;
-}
+ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len);
+qemu_co_mutex_unlock(&s->lock);
+if (ret < 0) {
+goto fail;
  }

+iov = (struct iovec) {
+.iov_base   = out_buf,
+.iov_len= out_len,
+};
+qemu_iovec_init_external(&hd_qiov, &iov, 1);
+
+BLKDBG_EVENT(bs->file, BLKDBG_WRITE_COMPRESSED);
+ret = bdrv_co_pwritev(bs->file->bs, cluster_offset, out_len, &hd_qiov, 0);


There is a race condition here:

If the newly allocated cluster is only partially filled by compressed
data then qcow2_alloc_compressed_cluster_offset() remembers that more
bytes are still available in the cluster.  The
qcow2_alloc_compressed_cluster_offset() caller will continue filling the
same cluster.

Imagine two compressed writes running at the same time.  Write A
allocates just a few bytes so write B shares a sector with the first
write:

  Sector 1
   |AAAB|

The race condition is that bdrv_co_pwritev() uses read-modify-write (a
bounce buffer).  If both requests call bdrv_co_pwritev() around the same
time then the following could happen:

  Sector 1
   |000B|

or:

  Sector 1
   |AAA0|

It's necessary to hold s->lock around the compressed data write to avoid
this race condition.


I agree, there is really a race.. Thank you, this is a very good point!




[Qemu-devel] running qemu for powerpc (32bits) architecture

2016-05-30 Thread Marwa Hamza
hello everyone
I'm trying to run qemu for powerpc architecture but either
*1/* i got a black screen with this sentence " QEMU 2.4.0.1 monitor - type
help for more information"
   (QEMU)
if i run this command ./ppc-softmmu/qemu-system-ppc -M ppce500 -kernel
../linux-4.4.1/arch/powerpc/boot/zImage -initrd
powerpc/busybox-1.21.0/rootfs.img.gz -append "root=/dev/ram rdinit=/bin/sh"
*2/* or i got this error "rom: requested regions overlap (rom
/home/marwa/Bureau/lauterbach/powerpc/busybox-1.21.0/rootfs.img.gz.
free=0x01878dfc, addr=0x01579000)
rom check and register reset failed " if i run this command
./ppc-softmmu/qemu-system-ppc -M mac99 -kernel
/home/marwa/Bureau/lauterbach/powerpc/linux-4.4.1/arch/powerpc/boot/zImage
-initrd /home/marwa/Bureau/lauterbach/powerpc/busybox-1.21.0/rootfs.img.gz
-append "root=/dev/ram rdinit=/bin/sh"
*3/* or i got this error
qemu-system-ppc: Could not load PowerPC BIOS 'ppc405_rom.bin'
when i run this command
./ppc-softmmu/qemu-system-ppc -M ref405ep -kernel
/home/marwa/Bureau/lauterbach/powerpc/linux-4.4.1/arch/powerpc/boot/zImage
-initrd /home/marwa/Bureau/lauterbach/powerpc/busybox-1.21.0/rootfs.img.gz
-append "root=/dev/ram rdinit=/bin/sh"

and i have tried all machines available for powerpc in qemu but i got
always one of those three result , i could't access to the file system

any suggestion please
regards,
Marwa


Re: [Qemu-devel] [PATCH v7 07/25] intel_iommu: define several structs for IOMMU IR

2016-05-30 Thread David Kiarie
On Mon, May 30, 2016 at 11:14 AM, Peter Xu  wrote:
> On Mon, May 30, 2016 at 07:56:16AM +0200, Jan Kiszka wrote:
>> On 2016-05-30 07:45, Peter Xu wrote:
>> > On Sun, May 29, 2016 at 11:21:35AM +0300, David Kiarie wrote:
>> > [...]
>>  +
>>  +/* Programming format for MSI/MSI-X addresses */
>>  +union VTD_IR_MSIAddress {
>>  +struct {
>>  +uint8_t __not_care:2;
>>  +uint8_t index_h:1;  /* Interrupt index bit 15 */
>>  +uint8_t sub_valid:1;/* SHV: Sub-Handle Valid bit */
>>  +uint8_t int_mode:1; /* Interrupt format */
>>  +uint16_t index_l:15;/* Interrupt index bit 14-0 */
>>  +uint16_t __head:12; /* Should always be: 0x0fee */
>>  +} QEMU_PACKED;
>>  +uint32_t data;
>>  +};
>> >>>
>> >>> In a recent discussion, it was brought to my attention that you might
>> >>> have a problem with bitfields when the host cpu is not x86. Have you
>> >>> considered this ?
>> >>
>> >> In a case when say the host cpu is little endian.
>> >
>> > I assume you mean when host cpu is big endian. x86 was little endian,
>> > and I was testing on x86.
>> >
>> > I think you are right. I should do conditional byte swap for all
>> > uint{16/32/64} cases within the fields. For example, index_l field in
>> > above VTD_IR_MSIAddress. And there are several other cases that need
>> > special treatment in the patchset. Will go over and fix corresponding
>> > issues in next version.
>>
>> You actually need bit-swap with bit fields, see e.g. hw/net/vmxnet3.h.
>
> Not noticed about bit-field ordering before... So maybe I need both?

Yes, I think we will need both though, I think, byte swapping the
whole struct will break the code but swapping individual fields is
what we need.

Myself, I'm defining bitfields as below:

  struct CMDCompletionWait {

#ifdef __BIG_ENDIAN_BITFIELD
uint32_t type:4;   /* command type   */
uint32_t reserved:8;
uint64_t store_addr:49;/* addr to write  */
uint32_t completion_flush:1;   /* allow more executions  */
uint32_t completion_int:1; /* set MMIOWAITINT*/
uint32_t completion_store:1;   /* write data to address  */
#else
uint32_t completion_store:1;
uint32_t completion_int:1;
uint32_t completion_flush:1;
uint64_t store_addr:49;
uint32_t reserved:8;
uint32_t type:4;
#endif /* __BIG_ENDIAN_BITFIELD */

uint64_t store_data;   /* data to write  */
if
} QEMU_PACKED;

So, the bitfields are basically aligned to a {1,2,4,8}-byte boundary.
I will have to swap store_addr,type, store_data, e.t.c.

>
> Thanks,
>
> -- peterx



Re: [Qemu-devel] [PATCH 8/9] target-i386: Use "-" instead of "_" on all feature names

2016-05-30 Thread Igor Mammedov
On Fri, 27 May 2016 17:32:34 -0300
Eduardo Habkost  wrote:

> On Tue, May 24, 2016 at 03:22:27PM +0200, Igor Mammedov wrote:
> > On Tue, 24 May 2016 09:34:05 -0300
> > Eduardo Habkost  wrote:
> >   
> > > On Tue, May 24, 2016 at 02:17:03PM +0200, Igor Mammedov wrote:  
> > > > On Fri,  6 May 2016 15:11:31 -0300
> > > > Eduardo Habkost  wrote:  
> [...]
> > > > > -/* Convert all '_' in a feature string option name to '-', to make 
> > > > > feature
> > > > > - * name conform to QOM property naming rule, which uses '-' instead 
> > > > > of '_'.
> > > > > +/* Convert all '_' in a feature string option name to '-', to keep 
> > > > > compatibility
> > > > > + * with old feature names that used "_" instead of "-".
> > > > >   */
> > > > >  static inline void feat2prop(char *s)
> > > > >  {
> > > > > @@ -1925,8 +1925,10 @@ static void x86_cpu_parse_featurestr(CPUState 
> > > > > *cs, char *features,
> > > > >  while (featurestr) {
> > > > >  char *val;
> > > > I'd place a single feat2prop() here
> > > > and delete it from other call sites in this function.
> > > 
> > > A previous version of this patch had it. But it would change the
> > > property value too, not just the property name (breaking stuff
> > > like "model-id=some_string").
> > >  
> > it's bug in feat2prop(), which probably should be fixed there,
> > so it would do what comment above it says. Or as alternative:  
> 
> The comment above it doesn't say anything about stopping at a '='
> delimiter. I always expected it to just replace "_" with "-" in a
> null-terminated string.
> 
> (I am not completely against making it stop at '=', but I believe
> your suggestion below sounds better).
> 
> > 
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index ca2a893..e46e4c3 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -1941,14 +1941,16 @@ static void x86_cpu_parse_featurestr(CPUState *cs, 
> > char *features
> >  featurestr = features ? strtok(features, ",") : NULL;
> >  
> >  while (featurestr) {
> > -char *val;
> > +char *val = strchr(featurestr, '=');
> > +if (val) {
> > +*val = 0; val++;
> > +}
> > +feat2prop(featurestr);  
> 
> This would make "+feature=FOO" treated as a valid option, and it
> isn't. It would keep the existing behavior if we did this:
> 
> -  if (featurestr[0] == '+') {
> +  if (featurestr[0] == '+' && !val) {
>add_flagname_to_bitmaps(featurestr + 1, plus_features, 
> &local_err);
> -  } else if (featurestr[0] == '-') {
> +  if (featurestr[0] == '+' && !val) {
>add_flagname_to_bitmaps(featurestr + 1, minus_features, 
> &local_err);
> 
> In either case, I prefer to get this optimization reviewed as a
> separate patch. Can you send it as a follow-up?
sure

> 
> > -} else if ((val = strchr(featurestr, '='))) {
> > -*val = 0; val++;
> > -feat2prop(featurestr);
> > +} else if (val) {
> >  if (!strcmp(featurestr, "xlevel")) {
> >  char *err;
> >  char num[32];
> > @@ -2000,7 +2002,6 @@ static void x86_cpu_parse_featurestr(CPUState *cs, 
> > char *features,
> >  object_property_parse(OBJECT(cpu), val, featurestr, 
> > &local_err);
> >  }
> >  } else {
> > -feat2prop(featurestr);
> >  object_property_parse(OBJECT(cpu), "on", featurestr, 
> > &local_err);
> >  }
> >  if (local_err) {
> > 
> >   
> 




[Qemu-devel] [PATCH] console: ignore ui_info updates which don't actually update something

2016-05-30 Thread Gerd Hoffmann
Signed-off-by: Gerd Hoffmann 
---
 ui/console.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/ui/console.c b/ui/console.c
index 6402010..581480f 100644
--- a/ui/console.c
+++ b/ui/console.c
@@ -1457,16 +1457,21 @@ bool dpy_ui_info_supported(QemuConsole *con)
 int dpy_set_ui_info(QemuConsole *con, QemuUIInfo *info)
 {
 assert(con != NULL);
-con->ui_info = *info;
+
 if (!dpy_ui_info_supported(con)) {
 return -1;
 }
+if (memcmp(&con->ui_info, info, sizeof(con->ui_info)) == 0) {
+/* nothing changed -- ignore */
+return 0;
+}
 
 /*
  * Typically we get a flood of these as the user resizes the window.
  * Wait until the dust has settled (one second without updates), then
  * go notify the guest.
  */
+con->ui_info = *info;
 timer_mod(con->ui_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + 1000);
 return 0;
 }
-- 
1.8.3.1




[Qemu-devel] [PATCH] virtio-gpu: fix scanout rectangles

2016-05-30 Thread Gerd Hoffmann
Commit "ca58b45 ui/virtio-gpu: add and use qemu_create_displaysurface_pixman"
breaks scanouts which use a region of the underlying resource only.

So, we need another way to handle the underlying issue.  Lets create a
new pixman image, grab a reference on the pixman providing the
underlying storage, hook up a destroy callback which releases the
reference.  That way regions work again and releasing the backing
storage should still be impossible thanks to the extra reference we are
holding.

Signed-off-by: Gerd Hoffmann 
---
 hw/display/virtio-gpu.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c
index 2116106..4746edc 100644
--- a/hw/display/virtio-gpu.c
+++ b/hw/display/virtio-gpu.c
@@ -497,6 +497,11 @@ static void virtio_gpu_resource_flush(VirtIOGPU *g,
 pixman_region_fini(&flush_region);
 }
 
+static void virtio_unref_resource(pixman_image_t *image, void *data)
+{
+pixman_image_unref(data);
+}
+
 static void virtio_gpu_set_scanout(VirtIOGPU *g,
struct virtio_gpu_ctrl_command *cmd)
 {
@@ -576,8 +581,15 @@ static void virtio_gpu_set_scanout(VirtIOGPU *g,
 != ((uint8_t *)pixman_image_get_data(res->image) + offset) ||
 scanout->width != ss.r.width ||
 scanout->height != ss.r.height) {
+pixman_image_t *rect;
+void *ptr = (uint8_t *)pixman_image_get_data(res->image) + offset;
+rect = pixman_image_create_bits(format, ss.r.width, ss.r.height, ptr,
+pixman_image_get_stride(res->image));
+pixman_image_ref(res->image);
+pixman_image_set_destroy_function(rect, virtio_unref_resource,
+  res->image);
 /* realloc the surface ptr */
-scanout->ds = qemu_create_displaysurface_pixman(res->image);
+scanout->ds = qemu_create_displaysurface_pixman(rect);
 if (!scanout->ds) {
 cmd->error = VIRTIO_GPU_RESP_ERR_UNSPEC;
 return;
-- 
1.8.3.1




Re: [Qemu-devel] [PATCH V2] block/io: optimize bdrv_co_pwritev for small requests

2016-05-30 Thread Kevin Wolf
Am 30.05.2016 um 08:25 hat Peter Lieven geschrieben:
> Am 27.05.2016 um 10:55 schrieb Kevin Wolf:
> >Am 27.05.2016 um 02:36 hat Fam Zheng geschrieben:
> >>On Thu, 05/26 11:20, Paolo Bonzini wrote:
> >>>On 26/05/2016 10:30, Fam Zheng wrote:
> >>This doesn't look too wrong...  Should the right sequence of events be
> >>head/after_head or head/after_tail?  It's probably simplest to just emit
> >>all four events.
> I've no idea. (That's why I leaned towards fixing the test case).
> >>>Well, fixing the testcase means knowing what events should be emitted.
> >>>
> >>>QEMU with Peter's patch emits head/after_head.  If the right one is
> >>>head/after_tail, _both QEMU and the testcase_ need to be adjusted.  Your
> >>>patch keeps the backwards-compatible route.
> >>Yes, I mean I was not very convinced in tweaking the events at all: each 
> >>pair
> >>of them has been emitted around bdrv_aligned_preadv(), and the new branch
> >>doesn't do it anymore. So I don't see a reason to add events here.
> >Yes, if you can assume that anyone who uses the debug events know
> >exactly what the code looks like, adding the events here is pointless
> >because TAIL, AFTER_TAIL and for the greatest part also AFTER_HEAD are
> >essentially the same then.
> >
> >Having TAIL before the qiov change and AFTER_TAIL afterwards doesn't
> >make any difference, they could (and should) be called immediately one
> >after another if we wanted to keep the behaviour.
> >
> >I would agree that we should take a look at the test case and what it
> >actually wants to achieve before we can decide whether AFTER_HEAD and
> >TAIL/AFTER_TAIL would be the same (the former could trigger earlier if
> >there are two requests and only one is unaligned at the tail). Maybe we
> >even need to extend the test case now so that both paths (explicit read
> >of the tail and the shortcut) are covered.
> 
> The part that actually blocks in 077 is
> 
> # Sequential RMW requests on the same physical sector
> 
> its expecting all 4 events around the RMW cycle.
> 
> However, it seems that also other parts of 077 would need an adjustment
> and the output might differ depending on the alignment. So I guess we
> have to emit the events if we don't want to recode the whole 077 and make
> it aware of the alignment.

Yes, but my point is that we may need to rework 077 anyway if we don't
only want to make it pass again, but to cover all relevant paths, too.
We got a new code path and it's unlikely that the existing tests covered
both the old code path and the new one.

Kevin



Re: [Qemu-devel] [PATCH 4/4] target-tricore: Added new JNE instruction variant

2016-05-30 Thread Bastian Koppelmann
On 05/30/2016 12:59 AM, peer.ad...@c-lab.de wrote:
> From: Peer Adelt 
> 
> If D[15] is != sign_ext(const4) then PC will be set to (PC +
> zero_ext(disp4 + 16)).
> 
> Signed-off-by: Peer Adelt 
> ---
>  target-tricore/translate.c   | 1 +
>  target-tricore/tricore-opcodes.h | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/target-tricore/translate.c b/target-tricore/translate.c
> index 2145f64..9ad9fcc 100644
> --- a/target-tricore/translate.c
> +++ b/target-tricore/translate.c
> @@ -3363,6 +3363,7 @@ static void gen_compute_branch(DisasContext *ctx, 
> uint32_t opc, int r1,
>  gen_branch_condi(ctx, TCG_COND_EQ, cpu_gpr_d[15], constant, offset);
>  break;
>  case OPC1_16_SBC_JNE:
> +case OPC1_16_SBC_JNE16:
>  gen_branch_condi(ctx, TCG_COND_NE, cpu_gpr_d[15], constant, offset);
>  break;

You forgot to call gen_compute_branch() from decode_16Bit_opc() for this
instruction, which should do the addition of 16 to disp4. Also please
add a check for 1.6+ ISA as suggested for the patch before.

Cheers,
Bastian



[Qemu-devel] [Bug 1586611] Re: usb-hub can not be detached when detach usb device from VM

2016-05-30 Thread Michael liu
Try detach the usb-hub device by the virsh detach-device usb-hub.xml?

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1586611

Title:
  usb-hub can not be detached when detach usb  device from VM

Status in QEMU:
  New

Bug description:
  I give a host usb device to guest in the way of using "virsh attach-device" 
cmd. In guest os,use "lsusb" cmd I can see two devices have been added,one is 
usb device and the other is usb-hub(0409:55aa NEC Corp. Hub).
  when I use "virsh detach-device" detach the usb device,in guest os the 
usb-hub was still exists.
  It can create a bad impression when operating the VM,for example,suspend and 
resume the VM,qemu would report that:

  2016-05-24T12:03:54.434369Z qemu-kvm: Unknown savevm section or
  instance ':00:01.2/2/usb-hub' 0

  2016-05-24T12:03:54.434742Z qemu-kvm: load of migration failed:
  Invalid argument

  From qemu's code,it can be sure that the usb-hub is generated by qemu,and the 
process of detaching usb-hub has already been executed,but failed.With adding 
print information,error as follows:
  libusbx: error [do_close] Device handle closed while transfer was still being 
processed, but the device is still connected as far as we know
  libusbx: warning [do_close] A cancellation for an in-flight transfer hasn't 
completed but closing the device handle

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1586611/+subscriptions



Re: [Qemu-devel] [PATCH v7 07/25] intel_iommu: define several structs for IOMMU IR

2016-05-30 Thread Peter Xu
On Mon, May 30, 2016 at 07:56:16AM +0200, Jan Kiszka wrote:
> On 2016-05-30 07:45, Peter Xu wrote:
> > On Sun, May 29, 2016 at 11:21:35AM +0300, David Kiarie wrote:
> > [...]
>  +
>  +/* Programming format for MSI/MSI-X addresses */
>  +union VTD_IR_MSIAddress {
>  +struct {
>  +uint8_t __not_care:2;
>  +uint8_t index_h:1;  /* Interrupt index bit 15 */
>  +uint8_t sub_valid:1;/* SHV: Sub-Handle Valid bit */
>  +uint8_t int_mode:1; /* Interrupt format */
>  +uint16_t index_l:15;/* Interrupt index bit 14-0 */
>  +uint16_t __head:12; /* Should always be: 0x0fee */
>  +} QEMU_PACKED;
>  +uint32_t data;
>  +};
> >>>
> >>> In a recent discussion, it was brought to my attention that you might
> >>> have a problem with bitfields when the host cpu is not x86. Have you
> >>> considered this ?
> >>
> >> In a case when say the host cpu is little endian.
> > 
> > I assume you mean when host cpu is big endian. x86 was little endian,
> > and I was testing on x86.
> > 
> > I think you are right. I should do conditional byte swap for all
> > uint{16/32/64} cases within the fields. For example, index_l field in
> > above VTD_IR_MSIAddress. And there are several other cases that need
> > special treatment in the patchset. Will go over and fix corresponding
> > issues in next version.
> 
> You actually need bit-swap with bit fields, see e.g. hw/net/vmxnet3.h.

Not noticed about bit-field ordering before... So maybe I need both?

Thanks,

-- peterx



[Qemu-devel] [Bug 1586611] Re: usb-hub can not be detached when detach usb device from VM

2016-05-30 Thread Michael liu
I found that when I attached an usb device to the VM, the VM would add an 
usb-hub automatically if there was no usb-hub.
After adding an usb-hub,the VM assigned a port to the actual usb device. When 
detaching the usb device,the qemu only detach the port,without detaching the 
usb-hub.So when doing action like migrating or suspending/resumming,the VM will 
fail.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1586611

Title:
  usb-hub can not be detached when detach usb  device from VM

Status in QEMU:
  New

Bug description:
  I give a host usb device to guest in the way of using "virsh attach-device" 
cmd. In guest os,use "lsusb" cmd I can see two devices have been added,one is 
usb device and the other is usb-hub(0409:55aa NEC Corp. Hub).
  when I use "virsh detach-device" detach the usb device,in guest os the 
usb-hub was still exists.
  It can create a bad impression when operating the VM,for example,suspend and 
resume the VM,qemu would report that:

  2016-05-24T12:03:54.434369Z qemu-kvm: Unknown savevm section or
  instance ':00:01.2/2/usb-hub' 0

  2016-05-24T12:03:54.434742Z qemu-kvm: load of migration failed:
  Invalid argument

  From qemu's code,it can be sure that the usb-hub is generated by qemu,and the 
process of detaching usb-hub has already been executed,but failed.With adding 
print information,error as follows:
  libusbx: error [do_close] Device handle closed while transfer was still being 
processed, but the device is still connected as far as we know
  libusbx: warning [do_close] A cancellation for an in-flight transfer hasn't 
completed but closing the device handle

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1586611/+subscriptions



Re: [Qemu-devel] linux-user: add option to intercept execve() syscalls

2016-05-30 Thread Riku Voipio
On Wed, May 25, 2016 at 05:07:48PM +0100, Joel Holdsworth wrote:
> This patch-set includes Peter Angelatos's previous patch-set [1] and
> adds code to pass arguments for setting the environment variables,
> passing the interpeter prefix, and passing the strace option.

Considering the messiness this serieas adds to QEMU, I do wonder how
much of win this avoidance really is. If you have permissions to chroot,
you generally have permissions to set binfmt_misc too. Alternatively
these kind of exec manipulations are already done by external tools like
proot and scratchbox.

However if you are ready to stay around to maintain it, and nobody else
objects the code, I can merge it.


> [1] https://patchwork.ozlabs.org/patch/582756/
> 



<    1   2   3   >