date:20160322

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-22 Thread Li, Liang Z

> > Obviously, the virtio-balloon mechanism has a bigger performance
> > impact to the guest than the way we are trying to implement.
> 
> Yeh, we should separately try and fix that; if it's that slow then people 
> will be
> annoyed about it when they're just using it for balloon.
> 
> > 3. Virtio interface
> > There are three different ways of using the virtio interface to send
> > the free page information.
> > a. Extend the current virtio device
> > The virtio spec has already defined some virtio devices, and we can
> > extend one of these devices so as to use it to transport the free page
> > information. It requires modifying the virtio spec.
> >
> > b. Implement a new virtio device
> > Implementing a brand new virtio device to exchange information between
> > host and guest is another choice. It requires modifying the virtio
> > spec too.
> 
> If the right solution is to change the spec then we should do it; we shouldn't
> use a technically worse solution just to avoid the spec change; although we
> have to be even more careful to get the right solution if we want to change
> the spec.
> 
> > c. Make use of virtio-serial (Amit’s suggestion, my choice) It’s
> > possible to make use the virtio-serial for communication between host
> > and guest, the benefit of this solution is no need to modify the
> > virtio spec.
> >
> > 4. Construct free page bitmap
> > To minimize the space for saving free page information, it’s better to
> > use a bitmap to describe the free pages. There are two ways to
> > construct the free page bitmap.
> >
> > a. Construct free page bitmap when demand (My choice) Guest can
> > allocate memory for the free page bitmap only when it receives the
> > request from QEMU, and set the free page bitmap by traversing the free
> > page list. The advantage of this way is that it’s quite simple and
> > easy to implement. The disadvantage is that the traversing operation
> > may consume quite a long time when there are a lot of free pages.
> > (About 20ms for 7GB free pages)
> 
> I wonder how that scales; 20ms isn't too bad - but I'm more worried about
> what happens when someone does it to the 1TB database VM.

Totally depends on the count of free pages in the VM, if 90% of the memory
in the 1TB VM are free pages, the time is about:
1024 * 0.9 / 7 *20 = 2633 ms

Is it unbearable? if so, we can use 4b to construct the free page bitmap, hope
the kernel guys can tolerate it.

> > b. Update free page bitmap when allocating/freeing pages Another
> > choice is to allocate the memory for the free page bitmap when guest
> > boots, and then update the free page bitmap when allocating/freeing
> > pages. It needs more modification to the code related to memory
> > management in guest. The advantage of this way is that guest can
> > response QEMU’s request for a free page bitmap very quickly, no matter
> > how many free pages in the guest. Do the kernel guys like this?
> >
> > 5. Tighten the free page bitmap
> > At last, the free page bitmap should be operated with the
> > ramlist.dirty_memory to filter out the free pages. We should make sure
> > the bit N in the free page bitmap and the bit N in the
> > ramlist.dirty_memory are corresponding to the same guest’s page.
> > Some arch, like X86, there are ‘holes’ in the memory’s physical
> > address, which means there are no actual physical RAM pages
> > corresponding to some PFNs. So, some arch specific information is
> > needed to construct a proper free page bitmap.
> >
> > migration dirty page bitmap:
> > -
> > |a|b|c|d|e|f|g|h|i|j|
> > -
> > loose free page bitmap:
> > -
> > |a|b|c|d|e|f| | | | |g|h|i|j|
> > -
> > tight free page bitmap:
> > -
> > |a|b|c|d|e|f|g|h|i|j|
> > -
> >
> > There are two places for tightening the free page bitmap:
> > a. In guest
> > Constructing the free page bitmap in guest requires adding the arch
> > related code in guest for building a tight bitmap. The advantage of
> > this way is that less memory is needed to store the free page bitmap.
> > b. In QEMU (My choice)
> > Constructing the free page bitmap in QEMU is more flexible, we can get
> > a loose free page bitmap which contains the holes, and then filter out
> > the holes in QEMU, the advantage of this way is that we can keep the
> > kernel code as simple as we can, the disadvantage is that more memory
> > is needed to save the loose free page bitmap. Because this is a mainly
> > QEMU feature, if possible, do all the related things in QEMU is
> > better.
> 
> Yes, maybe; although we'd have to be careful to validate what the guest fills
> in makes sense.
> 
> > 6. Handling page cache in the guest
> > The memory used for page cache in the guest will change depends on the
> > workload, if guest run some block IO intensive work load, there will
> > be lots of pages used for page cache, only a few

Re: [Qemu-devel] [PATCH v4 08/10] vfio: vote the function 0 to do host bus reset when aer occurred

2016-03-22 Thread Chen Fan



On 03/22/2016 05:40 AM, Alex Williamson wrote:

On Mon, 21 Mar 2016 18:08:44 +0800
Cao jin  wrote:


From: Chen Fan 

Due to all devices assigned to VM on the same way as host if enable
aer, so we can easily do the hot reset by selecting the function #0
to do the hot reset.

Signed-off-by: Chen Fan 
---
  hw/vfio/pci.c | 50 ++
  hw/vfio/pci.h |  2 ++
  2 files changed, 52 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9902c87..718cde7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1900,6 +1900,8 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, 
Error **errp)
  /* List all affected devices by bus reset */
  devices = &info->devices[0];
  
+vdev->single_depend_dev = (info->count == 1);

+
  /* Verify that we have all the groups required */
  for (i = 0; i < info->count; i++) {
  PCIHostDeviceAddress host;
@@ -2608,11 +2610,36 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
  static void vfio_err_notifier_handler(void *opaque)
  {
  VFIOPCIDevice *vdev = opaque;
+PCIDevice *pdev = &vdev->pdev;
  
  if (!event_notifier_test_and_clear(&vdev->err_notifier)) {

  return;
  }
  
+if (vdev->features & VFIO_FEATURE_ENABLE_AER) {

+VFIOPCIDevice *tmp;
+PCIDevice *dev;
+int devfn;
+
+/*
+ * If one device has aer capability on a bus, when aer occurred,
+ * we should notify all devices on the bus there was an aer arrived,
+ * then we are able to vote the device #0 to do host bus reset.
+ */
+for (devfn = 0; devfn < 8; devfn++) {

ARI?


+dev = pci_find_device(pdev->bus, pci_bus_num(pdev->bus),
+  PCI_DEVFN(PCI_SLOT(pdev->devfn), devfn));
+if (!dev) {
+continue;
+}
+if (!object_dynamic_cast(OBJECT(dev), "vfio-pci")) {
+continue;
+}
+tmp = DO_UPCAST(VFIOPCIDevice, pdev, dev);
+tmp->aer_occurred = true;
+}
+}
+
  /*
   * TBD. Retrieve the error details and decide what action
   * needs to be taken. One of the actions could be to pass
@@ -3075,6 +3102,29 @@ static void vfio_pci_reset(DeviceState *dev)
  
  trace_vfio_pci_reset(vdev->vbasedev.name);
  
+if (vdev->aer_occurred) {

+PCIDevice *br = pci_bridge_get_device(pdev->bus);
+
+if (br &&
+(pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
+ PCI_BRIDGE_CTL_BUS_RESET)) {
+/* simply voting the function 0 to do hot bus reset */
+if (pci_get_function_0(pdev) == pdev) {
+if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
+vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+} else {
+/* if this device has not AER capability, code
+ * coming here indicates there is another function
+ * on the bus has AER capability.
+ * */

This shouldn't be possible, right?


+vfio_pci_hot_reset(vdev, false);
+}
+}
+vdev->aer_occurred = false;
+return;
+}
+}

Why do we care than an AER occurred now?  Can't we simply test:

 if (vdev->features & VFIO_FEATURE_ENABLE_AER &&
 pci_get_function_0(pdev) == pdev) {
 PCIDevice *br = pci_bridge_get_device(pdev->bus);

 if (pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
 PCI_BRIDGE_CTL_BUS_RESET)) {

 vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
 return;
 }
 }

I think this is not practicable, not_function_0 device can't
pass the condition, it would do reset normally. but our intention
is doing nothing for them, and direct return.
I will change this in next version.

Thanks,
Chen




+
  vfio_pci_pre_reset(vdev);
  
  if (vdev->resetfn && !vdev->resetfn(vdev)) {

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index db7c6d5..17c75b8 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -143,6 +143,8 @@ typedef struct VFIOPCIDevice {
  bool no_kvm_intx;
  bool no_kvm_msi;
  bool no_kvm_msix;
+bool aer_occurred;
+bool single_depend_dev;
  } VFIOPCIDevice;
  
  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);



.

Re: [Qemu-devel] [PATCH v4 08/10] vfio: vote the function 0 to do host bus reset when aer occurred

2016-03-22 Thread Chen Fan



On 03/22/2016 05:40 AM, Alex Williamson wrote:

On Mon, 21 Mar 2016 18:08:44 +0800
Cao jin  wrote:


From: Chen Fan 

Due to all devices assigned to VM on the same way as host if enable
aer, so we can easily do the hot reset by selecting the function #0
to do the hot reset.

Signed-off-by: Chen Fan 
---
  hw/vfio/pci.c | 50 ++
  hw/vfio/pci.h |  2 ++
  2 files changed, 52 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9902c87..718cde7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1900,6 +1900,8 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, 
Error **errp)
  /* List all affected devices by bus reset */
  devices = &info->devices[0];
  
+vdev->single_depend_dev = (info->count == 1);

+
  /* Verify that we have all the groups required */
  for (i = 0; i < info->count; i++) {
  PCIHostDeviceAddress host;
@@ -2608,11 +2610,36 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
  static void vfio_err_notifier_handler(void *opaque)
  {
  VFIOPCIDevice *vdev = opaque;
+PCIDevice *pdev = &vdev->pdev;
  
  if (!event_notifier_test_and_clear(&vdev->err_notifier)) {

  return;
  }
  
+if (vdev->features & VFIO_FEATURE_ENABLE_AER) {

+VFIOPCIDevice *tmp;
+PCIDevice *dev;
+int devfn;
+
+/*
+ * If one device has aer capability on a bus, when aer occurred,
+ * we should notify all devices on the bus there was an aer arrived,
+ * then we are able to vote the device #0 to do host bus reset.
+ */
+for (devfn = 0; devfn < 8; devfn++) {

ARI?


+dev = pci_find_device(pdev->bus, pci_bus_num(pdev->bus),
+  PCI_DEVFN(PCI_SLOT(pdev->devfn), devfn));
+if (!dev) {
+continue;
+}
+if (!object_dynamic_cast(OBJECT(dev), "vfio-pci")) {
+continue;
+}
+tmp = DO_UPCAST(VFIOPCIDevice, pdev, dev);
+tmp->aer_occurred = true;
+}
+}
+
  /*
   * TBD. Retrieve the error details and decide what action
   * needs to be taken. One of the actions could be to pass
@@ -3075,6 +3102,29 @@ static void vfio_pci_reset(DeviceState *dev)
  
  trace_vfio_pci_reset(vdev->vbasedev.name);
  
+if (vdev->aer_occurred) {

+PCIDevice *br = pci_bridge_get_device(pdev->bus);
+
+if (br &&
+(pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
+ PCI_BRIDGE_CTL_BUS_RESET)) {
+/* simply voting the function 0 to do hot bus reset */
+if (pci_get_function_0(pdev) == pdev) {
+if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
+vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
+} else {
+/* if this device has not AER capability, code
+ * coming here indicates there is another function
+ * on the bus has AER capability.
+ * */

This shouldn't be possible, right?


+vfio_pci_hot_reset(vdev, false);
+}
+}
+vdev->aer_occurred = false;
+return;
+}
+}

Why do we care than an AER occurred now?  Can't we simply test:

 if (vdev->features & VFIO_FEATURE_ENABLE_AER &&
 pci_get_function_0(pdev) == pdev) {
 PCIDevice *br = pci_bridge_get_device(pdev->bus);

 if (pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
 PCI_BRIDGE_CTL_BUS_RESET)) {

 vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
 return;
 }
 }

I think this is not practicable, not_function_0 device can't
pass the condition, it would do reset normally. but our intention
is doing nothing for them, and direct return.
I will change this in next version.

Thanks,
CHen




+
  vfio_pci_pre_reset(vdev);
  
  if (vdev->resetfn && !vdev->resetfn(vdev)) {

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index db7c6d5..17c75b8 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -143,6 +143,8 @@ typedef struct VFIOPCIDevice {
  bool no_kvm_intx;
  bool no_kvm_msi;
  bool no_kvm_msix;
+bool aer_occurred;
+bool single_depend_dev;
  } VFIOPCIDevice;
  
  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);



.

[Qemu-devel] [PATCH] ps2kbd: default to scancode_set 2, as with KBD_CMD_RESET

2016-03-22 Thread Hervé Poussineau

This line has been added in commit ef74679a810fe6858f625b9d52b68cc3fc61eb3d with
other initializations. However, scancode set 0 doesn't exist (only 1, 2, 3).
This works well as long as operating system is resetting keyboard, or 
overwriting
the current scancode set with the one it wants.

This fixes IBM 40p firmware, which doesn't bother sending KBD_CMD_RESET or 
KBD_CMD_SCANCODE.

Signed-off-by: Hervé Poussineau 
---
 hw/input/ps2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/input/ps2.c b/hw/input/ps2.c
index 58892d5..a8aa36f 100644
--- a/hw/input/ps2.c
+++ b/hw/input/ps2.c
@@ -628,7 +628,7 @@ static void ps2_kbd_reset(void *opaque)
 ps2_common_reset(&s->common);
 s->scan_enabled = 0;
 s->translate = 0;
-s->scancode_set = 0;
+s->scancode_set = 2;
 }
 
 static void ps2_mouse_reset(void *opaque)
-- 
2.1.4

[Qemu-devel] About global_props

2016-03-22 Thread ch...@foxmail.com

global_props is defined in qdev-properties.c. But I can not find where it is 
read ans used? Can someone tell me? Thanks!



ch...@foxmail.com

Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2016-03-22 Thread David Gibson

On Wed, Mar 23, 2016 at 02:28:01PM +1100, Alexey Kardashevskiy wrote:
> On 03/23/2016 01:13 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
> >>This adds support for Dynamic DMA Windows (DDW) option defined by
> >>the SPAPR specification which allows to have additional DMA window(s)
> >>
> >>This implements DDW for emulated and VFIO devices.
> >>This reserves RTAS token numbers for DDW calls.
> >>
> >>This changes the TCE table migration descriptor to support dynamic
> >>tables as from now on, PHB will create as many stub TCE table objects
> >>as PHB can possibly support but not all of them might be initialized at
> >>the time of migration because DDW might or might not be requested by
> >>the guest.
> >>
> >>The "ddw" property is enabled by default on a PHB but for compatibility
> >>the pseries-2.5 machine and older disable it.
> >>
> >>This implements DDW for VFIO. The host kernel support is required.
> >>This adds a "levels" property to PHB to control the number of levels
> >>in the actual TCE table allocated by the host kernel, 0 is the default
> >>value to tell QEMU to calculate the correct value. Current hardware
> >>supports up to 5 levels.
> >>
> >>The existing linux guests try creating one additional huge DMA window
> >>with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> >>the guest switches to dma_direct_ops and never calls TCE hypercalls
> >>(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> >>and not waste time on map/unmap later. This adds a "dma64_win_addr"
> >>property which is a bus address for the 64bit window and by default
> >>set to 0x800... as this is what the modern POWER8 hardware
> >>uses and this allows having emulated and VFIO devices on the same bus.
> >>
> >>This adds 4 RTAS handlers:
> >>* ibm,query-pe-dma-window
> >>* ibm,create-pe-dma-window
> >>* ibm,remove-pe-dma-window
> >>* ibm,reset-pe-dma-window
> >>These are registered from type_init() callback.
> >>
> >>These RTAS handlers are implemented in a separate file to avoid polluting
> >>spapr_iommu.c with PCI.
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>---
> >>  hw/ppc/Makefile.objs|   1 +
> >>  hw/ppc/spapr.c  |   7 +-
> >>  hw/ppc/spapr_pci.c  |  73 ---
> >>  hw/ppc/spapr_rtas_ddw.c | 300 
> >> 
> >>  hw/vfio/common.c|   5 -
> >>  include/hw/pci-host/spapr.h |  13 ++
> >>  include/hw/ppc/spapr.h  |  16 ++-
> >>  trace-events|   4 +
> >>  8 files changed, 395 insertions(+), 24 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> >>
> >>diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >>index c1ffc77..986b36f 100644
> >>--- a/hw/ppc/Makefile.objs
> >>+++ b/hw/ppc/Makefile.objs
> >>@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o 
> >>spapr_drc.o spapr_rng.o
> >>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> >>  obj-y += spapr_pci_vfio.o
> >>  endif
> >>+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
> >>  # PowerPC 4xx boards
> >>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
> >>  obj-y += ppc4xx_pci.o
> >>diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>index d0bb423..ef4c637 100644
> >>--- a/hw/ppc/spapr.c
> >>+++ b/hw/ppc/spapr.c
> >>@@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
> >>   * pseries-2.5
> >>   */
> >>  #define SPAPR_COMPAT_2_5 \
> >>-HW_COMPAT_2_5
> >>+HW_COMPAT_2_5 \
> >>+{\
> >>+.driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> >>+.property = "ddw",\
> >>+.value= stringify(off),\
> >>+},
> >>
> >>  static void spapr_machine_2_5_instance_options(MachineState *machine)
> >>  {
> >>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>index af99a36..3bb294a 100644
> >>--- a/hw/ppc/spapr_pci.c
> >>+++ b/hw/ppc/spapr_pci.c
> >>@@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState 
> >>*sphb, PCIDevice *pdev)
> >>  return buf;
> >>  }
> >>
> >>-static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>-   uint32_t liobn,
> >>-   uint32_t page_shift,
> >>-   uint64_t window_addr,
> >>-   uint64_t window_size,
> >>-   Error **errp)
> >>+void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>+ uint32_t liobn,
> >>+ uint32_t page_shift,
> >>+ uint64_t window_addr,
> >>+ uint64_t window_size,
> >>+ Error **errp)
> >>  {
> >>  sPAPRTCETable *tcet;
> >>  uint32_t nb_table = window_size >> page_shift;
> >>@@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHB

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

2016-03-22 Thread David Gibson

On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote:
> On 03/23/2016 01:53 PM, David Gibson wrote:
> >On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/23/2016 12:08 PM, David Gibson wrote:
> >>>On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 04:14 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> >>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >>This adds ability to VFIO common code to dynamically allocate/remove
> >>DMA windows in the host kernel when new VFIO container is added/removed.
> >>
> >>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >>and adds just created IOMMU into the host IOMMU list; the opposite
> >>action is taken in vfio_listener_region_del.
> >>
> >>When creating a new window, this uses euristic to decide on the TCE 
> >>table
> >>levels number.
> >>
> >>This should cause no guest visible change in behavior.
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>---
> >>Changes:
> >>v14:
> >>* new to the series
> >>
> >>---
> >>TODO:
> >>* export levels to PHB
> >>---
> >>  hw/vfio/common.c | 108 
> >> ---
> >>  trace-events |   2 ++
> >>  2 files changed, 105 insertions(+), 5 deletions(-)
> >>
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index 4e873b7..421d6eb 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer 
> >>*container,
> >>  return 0;
> >>  }
> >>
> >>+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr 
> >>min_iova)
> >>+{
> >>+VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, 
> >>min_iova, 0x1000);
> >
> >The hard-coded 0x1000 looks dubious..
> 
> Well, that's the minimal page size...
> >>>
> >>>Really?  Some BookE CPUs support 1KiB page size..
> >>
> >>Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)
> >
> >Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
> >it's been done for CPU MMU I wouldn't count on it not being done for
> >IOMMU.
> >
> >1 is a safer choice.
> >
> >>
> >>
> >>>
> >>+g_assert(hiommu);
> >>+QLIST_REMOVE(hiommu, hiommu_next);
> >>+}
> >>+
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection 
> >> *section)
> >>  {
> >>  return (!memory_region_is_ram(section->mr) &&
> >>@@ -392,6 +400,61 @@ static void 
> >>vfio_listener_region_add(MemoryListener *listener,
> >>  }
> >>  end = int128_get64(llend);
> >>
> >>+if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >
> >I think this would be clearer split out into a helper function,
> >vfio_create_host_window() or something.
> 
> 
> It is rather vfio_spapr_create_host_window() and we were avoiding
> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
> separate file but this usually triggers more discussion and never ends 
> well.
> 
> 
> 
> >>+unsigned entries, pages;
> >>+struct vfio_iommu_spapr_tce_create create = { .argsz = 
> >>sizeof(create) };
> >>+
> >>+g_assert(section->mr->iommu_ops);
> >>+g_assert(memory_region_is_iommu(section->mr));
> >
> >I don't think you need these asserts.  AFAICT the same logic should
> >work if a RAM MR was added directly to PCI address space - this would
> >create the new host window, then the existing code for adding a RAM MR
> >would map that block of RAM statically into the new window.
> 
> In what configuration/machine can we do that on SPAPR?
> >>>
> >>>spapr guests won't ever do that.  But you can run an x86 guest on a
> >>>powernv host and this situation could come up.
> >>
> >>
> >>I am pretty sure VFIO won't work in this case anyway.
> >
> >I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.
> 
> This is not about TCG (pseries TCG guest works with VFIO on powernv host),
> this is about things like VFIO_IOMMU_GET_INFO vs.
> VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.
> 
> Should I add such support in this patchset?

Unless adding the generality is really complex, and so far I haven't
seen a reason for it to be.

> 
> 
> >
> >>>In any case there's no point asserting if the code is correct anyway.
> >>
> >>Assert here says (at least) "not tested" or "not expected to
> >>happen".
> >
> >Hmmm..
> >
> >>
> >>
> >>>
> >>+trace_vfio_listener_region_add_iommu(iova, end - 1);
> >>+/*
> >>+ * FIXME: For VFIO iommu types which have KVM acceleration to
> >>+ * avoid bounc

[Qemu-devel] About GlobalProperty

2016-03-22 Thread ch...@foxmail.com

I read code in test-qdev-global-props.c. The test function 
test_static_globalprop_subprocess is:

static void test_static_globalprop_subprocess(void)
{
MyType *mt;
static GlobalProperty props[] = {
{ TYPE_STATIC_PROPS, "prop1", "200" },
{}
};

qdev_prop_register_global_list(props);

mt = STATIC_TYPE(object_new(TYPE_STATIC_PROPS));
qdev_init_nofail(DEVICE(mt));

g_assert_cmpuint(mt->prop1, ==, 200);
g_assert_cmpuint(mt->prop2, ==, PROP_DEFAULT);
}

I found that the GlobalProperty props are registered, but I can not find where 
it be used. Can someone explain it? Thanks! 


ch...@foxmail.com

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-22 Thread Li, Liang Z

> > To make things easier, I wrote this doc about the possible designs and
> > my choices. Comments are welcome!
> 
> Thanks for putting this together, and especially for taking the trouble to
> benchmark existing code paths!
> 
> I think these numbers do show that there are gains to be had from merging
> your code with the existing balloon device. It will probably be a bit more 
> work,
> but I think it'll be worth it.
> 
> More comments below.
> 

Thanks for your comments!

> > 2. Why not use virtio-balloon
> > Actually, the virtio-balloon can do the similar thing by inflating the
> > balloon before live migration, but its performance is no good, for an
> > 8GB idle guest just boots, it takes about 5.7 Sec to inflate the
> > balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
> > from the guest.  There are some of reasons for the bad performance of
> > vitio-balloon:
> > a. allocating pages (5%, 304ms)
> 
> Interesting. This is definitely worth improving in guest kernel.
> Also, will it be faster if we allocate and pass to guest huge pages instead?
> Might speed up madvise as well.

Maybe.

> > b. sending PFNs to host (71%, 4194ms)
> 
> OK, so we probably should teach balloon to pass huge lists in bitmaps.
> Will be benefitial for regular balloon operation, as well.
> 

Agree. Current balloon just send 256 PFNs a time, that's too few and lead to 
too many times 
of virtio transmission, that's the main reason for the bad performance.
Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can  improve the
performance significant. Maybe we should increase it before doing the further 
optimization,
do you think so ?

> > c. address translation and madvise() operation (24%, 1423ms)
> 
> How is this split between translation and madvise?  I suspect it's mostly
> madvise since you need translation when using bitmap as well.
> Correct? Could you measure this please?  Also, what if we use the new
> MADV_FREE instead?  By how much would this help?
> 
For the current balloon, address translation is needed. 
But for live migration, there is no need to do address translation.

I did a another try and got the following data:
   a. allocating pages (6.4%, 402ms)
   b. sending PFNs to host (68.3%, 4263ms)
   c. address translation (6.2%, 389ms)
   d. madvise (19.0%, 1188ms)

The address translation is a time consuming operation too.
I will try MADV_FREE later.

> Finally, we could teach balloon to skip madvise completely.
> By how much would this help?
> 
> > Debugging shows the time spends on these operations are listed in the
> > brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to
> a
> > large value, such as 16384, the time spends on sending the PFNs can be
> > reduced to about 400ms, but it’s still too long.
> > Obviously, the virtio-balloon mechanism has a bigger performance
> > impact to the guest than the way we are trying to implement.
> 
> Since as we see some of the new interfaces might be benefitial to balloon as
> well, I am rather of the opinion that extending the balloon (basically 3a)
> might be the right thing to do.
> 
> > 3. Virtio interface
> > There are three different ways of using the virtio interface to send
> > the free page information.
> > a. Extend the current virtio device
> > The virtio spec has already defined some virtio devices, and we can
> > extend one of these devices so as to use it to transport the free page
> > information. It requires modifying the virtio spec.
> 
> You don't have to do it all by yourself by the way.
> Submit the proposal to the oasis virtio tc mailing list, we will take it from 
> there.
> 
That's great.

>> 4. Construct free page bitmap
>> To minimize the space for saving free page information, it’s better to 
>> use a bitmap to describe the free pages. There are two ways to 
>> construct the free page bitmap.
>> 
>> a. Construct free page bitmap when demand (My choice) Guest can 
>> allocate memory for the free page bitmap only when it receives the 
>> request from QEMU, and set the free page bitmap by traversing the free 
>> page list. The advantage of this way is that it’s quite simple and 
>> easy to implement. The disadvantage is that the traversing operation 
>> may consume quite a long time when there are a lot of free pages. 
>> (About 20ms for 7GB free pages)
>> 
>> b. Update free page bitmap when allocating/freeing pages Another 
>> choice is to allocate the memory for the free page bitmap when guest 
>>boots, and then update the free page bitmap when allocating/freeing 
>> pages. It needs more modification to the code related to memory 
>>management in guest. The advantage of this way is that guest can 
>> response QEMU’s request for a free page bitmap very quickly, no matter 
>> how many free pages in the guest. Do the kernel guys like this?
>>

> > 8. Pseudo code
> > Dirty page logging should be enabled before getting the free page
> > information from guest, this is important because during the process
> > of getti

Re: [Qemu-devel] [Patch 1/1] omap_lcdc: Remove support for DEPTH != 32

2016-03-22 Thread Pooja Dhannawat

On Wed, Mar 23, 2016 at 3:55 AM, Eric Blake  wrote:

> On 03/12/2016 03:23 AM, Pooja Dhannawat wrote:
> > As only DEPTH ==32 case is used, removing other dead code
> > which is based on DEPTH !== 32
> >
> > Signed-off-by: Pooja Dhannawat 
> > ---
>
> > +++ b/hw/display/omap_lcdc.c
> > @@ -71,44 +71,23 @@ static void omap_lcd_interrupts(struct
> omap_lcd_panel_s *s)
> >
> >  #define draw_line_func drawfn
> >
> > -#define DEPTH 8
> > -#include "omap_lcd_template.h"
> > -#define DEPTH 15
> > -#include "omap_lcd_template.h"
> > -#define DEPTH 16
> > -#include "omap_lcd_template.h"
> >  #define DEPTH 32
> >  #include "omap_lcd_template.h"
>
> Umm, the old code WAS using DEPTH != 32.  I'm not a display expert, so
> there may be justification in nuking that code; but the commit message
> needs a better argument than "the code wasn't used" when it sure seems
> to be used.  If the argument is that surface_bits_per_pixel() always
> returns 32, that would be a good argument.
>
>Correct me If I am wrong, I dig down on this and I found internally it
used PIXMAN_FORMAT which uses bits_per_pixel = 32 only.

>
> >  static draw_line_func draw_line_table2[33] = {
> >  [0 ... 32]   = NULL,
> > -[8]  = draw_line2_8,
> > -[15] = draw_line2_15,
> > -[16] = draw_line2_16,
> >  [32] = draw_line2_32,
>
> This array is now wasteful.  If surface_bits_per_pixel() always returns
> 32, then just ditch this array, and later on, change:
>
> Yes sure.


> /* Colour depth */
> switch ((omap_lcd->palette[0] >> 12) & 7) {
> case 1:
> -   draw_line = draw_line_table2[surface_bits_per_pixel(surface)];
> +   draw_line = draw_line2_32;
> bpp = 2;
> break;
>
> In other words, your cleanup, if justified, is incomplete.
>
> --
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>

Re: [Qemu-devel] [PATCH v6 0/4] ARM: add query-gic-capabilities SMP command

2016-03-22 Thread Peter Xu

On Wed, Mar 23, 2016 at 01:32:29PM +0800, Peter Xu wrote:
> The new command can report which kind of GIC device the host/QEMU
> support. The returned result is in the form of array.
> 
> Sample command and output:
> 
> {"execute": "query-gic-capability"}
> {"return": [{"emulated": false, "version": 3, "kernel": false},
> {"emulated": true, "version": 2, "kernel": true}]}

Sorry! This is wrong, which is v5 version of interface.

The new interface should return a dict rather than array:

-> { "execute": "query-gic-capabilities" }
<- { "return": { "capabilities": [
{ "version": 2, "emulated": true, "kernel": false },
{ "version": 3, "emulated": false, "kernel": true } ] } }

Thanks.

-- peterx

Re: [Qemu-devel] [PATCH v5 0/5] ARM: add query-gic-capabilities SMP command

2016-03-22 Thread Peter Xu

On Tue, Mar 22, 2016 at 02:49:19PM +, Peter Maydell wrote:
> On 18 March 2016 at 03:27, Peter Xu  wrote:
> > This patch is to add ARM-specific command "query-gic-capability".
> >
> > The new command can report which kind of GIC device the host/QEMU
> > support. The returned result is in the form of array.
> >
> > Sample command and output:
> >
> > {"execute": "query-gic-capability"}
> > {"return": [{"emulated": false, "version": 3, "kernel": false},
> > {"emulated": true, "version": 2, "kernel": true}]}
> 
> Reminder: I still need confirmation from the libvirt folks
> that this API meets their needs, and review from somebody
> on the QEMU QMP protocol design side. This should go into
> 2.6 but we are getting closer to hardfreeze...

Hi, Peter,

Regarding to QMP part, I have got review comments from Eric/Markus,
and will do more spins if necessary (latest v6). For libvirt part,
Andrea should be the one to finish the libvirt part of this
feature, so I am considering him to be the libvirt guy. Do you think
we need more review besides CCed, especially on libvirt part?

Thanks.

-- peterx

[Qemu-devel] A custom idea for Google Summer of Code 2016

2016-03-22 Thread Anatoly Gusev

Hello,

I'm the computer science student of Brest State Technical University (Belarus), 
and have an idea for Google Summer of Code (I see it as an important feature 
and have already started working on it, but mentoring would be a great help).

The proposition is about implementing one more absolute coordinates device for 
QEMU – a serial tablet with behaviour like usb-tablet we use to get mouse 
integration.

The implementing serial tablet would allow to have mouse integration for 
vintage operating systems. This feature is absent in other virtualisation 
systems, as far a I know. Emulating additional character  device with absolute 
coordinates – the one which was rather popular in 1990x– would allow us to use 
already existing drivers for old operating systems without USB support and so 
such device would bring mouse integration to them.

Supporting such old systems is more important than meets the eye. We see more 
and more web based projects demonstrating historical systems with their 
'screen' embedded into HTML (with VNC client written in JavaScript or something 
similar). One of such projects is oldweb (http://oldweb.today/ 
) , and there are several others 
(e.g. there are single demos of some old Microsoft or Apple OS, and Amiga). Of 
course these systems have no mouse integration, and demonstrate two mouse 
cursors moving with different speed, making it not so easy to operate with. 

We also have a project suffering from this problem: a live history timeline of 
GUI-based mobile & desktop operating systems. This timeline also uses virtual 
machines, embedded into HTML instead of screenshots. Our goal is to virtualise 
about 40 desktop & 30 mobile OSes for this demo (more than 50% is already 
done). The architecture of our project involves QEMU as the container for 
historical operating systems (either straightforward or with an iternal 
emulator).  The project is hosted at 
https://gitssh.com/fiowro/ostimeline/tree/master (dedicated web site with live 
demo demanded separate hardware server and finally should appear this spring…). 
And of course we also have the same problem with guest systems not supporting 
Wacom USB tablet. 

So there are enough projects and users who would benefit from implementing such 
feature.

At first I was thinking to make a plugin subsystem which would allow to write 
separate loadable modules for different character backends (to re-vitalize also 
some abandoned QEMU patches for exotic relative coordinates mice). But later I 
decided to chose more simple goal and just to implement serial tablet.

So I've studied these abandoned patches (mainly one for serial mouse with PC 
mouse systems protocol) and started working to implement similar patch for 
serial Wacom tablet. RS-232 Wacom PenPartner tablet was chosen to match the 
newer USB one, and because it is enough old to have drivers for the larger 
share of antique systems, and also has described protocol. 

At this moment I have got the real hardware PenPartner tablet (model CT-0405-R) 
and three virtual machines configured to use it:
- wacom wtest program under FreeDOS (it discovers the tablet and prints events 
from it),
- windows 3.11,
- windows 95. 
Drivers for OS/2 and MacOS X (and perhaps some others we will find for our 
ostimeline project) are still waiting to be installed into virtualised systems, 
but of course it's not so urgent.

My initial efforts have resulted into some dirty patch to QEMU 1.7 which makes 
wtest to recognise emulated tablet and sends coordinates (a pair of constants) 
on mouse move.  The following link provides pre-compiled qemu binary, its souce 
code, starting script and qcow image with wtest: 
https://www.dropbox.com/sh/q8ud6og8fxjzm60/tcStLohBP6PR-KLyIvxSa?dl=0

Of course the code is far from being finished, but at least it seems to be a 
good proof… And participating in GSoC program would help me a lot not to make 
something useful for other projects and not to finish with one more patch for 
an outdated QEMU version (like an abandoned genius backend, a vmware mouse, and 
a bus mouse patch used to run NextSTEP OS).

P.S. Sorry for such a long intro… Don't know how did it happened…  :)

[Qemu-devel] [PATCH v6 4/4] arm: implement query-gic-capabilities

2016-03-22 Thread Peter Xu

For emulated GIC capabilities, currently only gicv2 is supported. We
need to add gicv3 in when emulated gicv3 ready. For KVM accelerated ARM
VM, we detect the capability bits by creating a scratch VM.

Signed-off-by: Peter Xu 
---
 target-arm/monitor.c | 60 +++-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/target-arm/monitor.c b/target-arm/monitor.c
index 254a9c9..4a2db59 100644
--- a/target-arm/monitor.c
+++ b/target-arm/monitor.c
@@ -21,8 +21,66 @@
  */
 #include "qemu/osdep.h"
 #include "qmp-commands.h"
+#include "hw/boards.h"
+#include "kvm_arm.h"
+
+static GICCapability *gic_cap_new(int version)
+{
+GICCapability *cap = g_new0(GICCapability, 1);
+cap->version = version;
+/* by default, support none */
+cap->emulated = false;
+cap->kernel = false;
+return cap;
+}
+
+static GICCapabilityList *gic_cap_list_add(GICCapabilityList *head,
+   GICCapability *cap)
+{
+GICCapabilityList *item = g_new0(GICCapabilityList, 1);
+item->value = cap;
+item->next = head;
+return item;
+}
 
 GICCapabilityResult *qmp_query_gic_capabilities(Error **errp)
 {
-return NULL;
+GICCapabilityResult *result = g_new0(GICCapabilityResult, 1);
+GICCapabilityList *head = NULL;
+GICCapability *v2 = gic_cap_new(2), *v3 = gic_cap_new(3);
+
+v2->emulated = true;
+/* FIXME: we'd change to true after we get emulated GICv3. */
+v3->emulated = false;
+
+#ifdef CONFIG_KVM
+{
+int fdarray[3];
+
+if (!kvm_arm_create_scratch_host_vcpu(NULL, fdarray, NULL)) {
+goto out;
+}
+
+/* Test KVM GICv2 */
+if (kvm_support_device(fdarray[1], KVM_DEV_TYPE_ARM_VGIC_V2)) {
+v2->kernel = true;
+}
+
+/* Test KVM GICv3 */
+if (kvm_support_device(fdarray[1], KVM_DEV_TYPE_ARM_VGIC_V3)) {
+v3->kernel = true;
+}
+
+kvm_arm_destroy_scratch_host_vcpu(fdarray);
+out:
+;
+}
+#endif
+
+head = gic_cap_list_add(head, v2);
+head = gic_cap_list_add(head, v3);
+
+result->capabilities = head;
+
+return result;
 }
-- 
2.4.3

[Qemu-devel] [PATCH v6 1/4] arm: qmp: add query-gic-capabilities interface

2016-03-22 Thread Peter Xu

This patch add "query-gic-capabilities" but does not implemnet it. The
command is ARM-only. The command will return one GICCapabilityResult
object, which contains a array of GICCapability struct that describes
all GIC versions that current QEMU and system support.

Libvirt is possibly the first consumer of this new command.

Before this patch, user will successfully configure all kinds of GIC
devices for ARM guests, no matter whether current QEMU/kernel support
it. If the specified GIC version/type is not supported, user will got an
ambiguous "QEMU boot failure" when trying to start the VM. This is not
user-friendly.

With this patch, libvirt should be able to query which type (and which
version) of GIC device that we support. Use this information, libvirt
can warn the user during configuration of guests when specified GIC
device type is not supported. Or better, we can just list those versions
that we support, and filter out those not-supported ones.

Besides, the community is working on a more generic way to query these
kind of information. However, due to the eagerness of this command, we
decided to first implement this ad-hoc one, then when the generic method
is ready, we can move on to that one smoothly.

Signed-off-by: Peter Xu 
---
 monitor.c|  8 
 qapi-schema.json | 50 
 qmp-commands.hx  | 28 +++
 target-arm/Makefile.objs |  2 +-
 target-arm/monitor.c | 28 +++
 5 files changed, 115 insertions(+), 1 deletion(-)
 create mode 100644 target-arm/monitor.c

diff --git a/monitor.c b/monitor.c
index 4c02f0f..3845213 100644
--- a/monitor.c
+++ b/monitor.c
@@ -4258,3 +4258,11 @@ void qmp_dump_skeys(const char *filename, Error **errp)
 error_setg(errp, QERR_FEATURE_DISABLED, "dump-skeys");
 }
 #endif
+
+#ifndef TARGET_ARM
+GICCapabilityResult *qmp_query_gic_capabilities(Error **errp)
+{
+error_setg(errp, QERR_FEATURE_DISABLED, "query-gic-capabilities");
+return NULL;
+}
+#endif
diff --git a/qapi-schema.json b/qapi-schema.json
index 88f9b81..72359dd 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -4126,3 +4126,53 @@
 ##
 { 'enum': 'ReplayMode',
   'data': [ 'none', 'record', 'play' ] }
+
+##
+# @GICCapability:
+#
+# The struct describes capability for a specific GIC (Generic
+# Interrupt Controller) version. These bits are not only decided by
+# QEMU/KVM software version, but also decided by the hardware that
+# the program is running upon.
+#
+# @version:  version of GIC to be described. Currently, only 2 and 3
+#are supported.
+#
+# @emulated: whether current QEMU/hardware supports emulated GIC
+#device in user space.
+#
+# @kernel:   whether current QEMU/hardware supports hardware
+#accelerated GIC device in kernel.
+#
+# Since: 2.6
+##
+{ 'struct': 'GICCapability',
+  'data': { 'version': 'int',
+'emulated': 'bool',
+'kernel': 'bool' } }
+
+##
+# @GICCapabilityResult:
+#
+# Query result for GIC capabilities.
+#
+# @data:  currently with a single key 'capabilities', which contains
+# a list of GICCapability elements, describing all supported
+# versions/kinds of GIC devices.
+#
+# Since: 2.6
+##
+{ 'struct': 'GICCapabilityResult',
+  'data': { 'capabilities': ['GICCapability'] } }
+
+##
+# @query-gic-capabilities:
+#
+# This command is ARM-only. It will return a GICCapabilityResult
+# object that describes its capability bits.
+#
+# Returns: GICCapabilityResult object.
+#
+# Since: 2.6
+##
+{ 'command': 'query-gic-capabilities', 'returns': 'GICCapabilityResult' }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 9e05365..4144982 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -4853,3 +4853,31 @@ Example:
  {"type": 0, "out-pport": 0, "pport": 0, "vlan-id": 3840,
   "pop-vlan": 1, "id": 251658240}
]}
+
+EQMP
+
+#if defined TARGET_ARM
+{
+.name   = "query-gic-capabilities",
+.args_type  = "",
+.mhandler.cmd_new = qmp_marshal_query_gic_capabilities,
+},
+#endif
+
+SQMP
+query-gic-capabilities
+---
+
+Return a GICCapabilityResult object, which contains a list of
+GICCapability elements, describing supported GIC versions.
+
+Arguments: None
+
+Example:
+
+-> { "execute": "query-gic-capabilities" }
+<- { "return": { "capabilities": [
+{ "version": 2, "emulated": true, "kernel": false },
+{ "version": 3, "emulated": false, "kernel": true } ] } }
+
+EQMP
diff --git a/target-arm/Makefile.objs b/target-arm/Makefile.objs
index a80eb39..334074c 100644
--- a/target-arm/Makefile.objs
+++ b/target-arm/Makefile.objs
@@ -8,4 +8,4 @@ obj-y += translate.o op_helper.o helper.o cpu.o
 obj-y += neon_helper.o iwmmxt_helper.o
 obj-y += gdbstub.o
 obj-$(TARGET_AARCH64) += cpu64.o translate-a64.o helper-a64.o gdbstub64.o
-obj-y += crypto_helper.o
+obj-y += crypto_helper.o monito

[Qemu-devel] [PATCH v6 0/4] ARM: add query-gic-capabilities SMP command

2016-03-22 Thread Peter Xu

v6 changes:
- patch 1 (squashed into patch 2)
  - explain more about the following in commit message: why we need
this command, and what does the entries mean [Markus]
  - explain what GIC is [Eric]
  - squash this patch into patch 2 [Eric]
- patch 2 (new patch 1)
  - fix "does not implement" english error [Eric]
  - remove useless headers in target-arm/monitor.c [Markus]
  - remove this command from whitelist [Markus, Eric]
  - return dict rather than array, with single key points to the
original array results [Markus, Eric]
- patch 5 (new patch 4)
  - tiny change in implementation to suite the new interface.

v5 changes:
- patch 2: moved to target-arm/monitor.c (from target-arm/machine.c)
   [Peter]
- patch 3: splitted into three patches: [all from Peter's comments]
  - patch 3 (new): leverage kvm_arm_create_scratch_host_vcpu(), tiny
enhancement of old one to suite our need
  - patch 4: introduce kvm_support_device() in kvm-all.c
  - patch 5: do the implementation.

v4 changes:
- all: rename query-gic-capability to query-gic-capabilities [Andrea]
- patch 3: rename helper function to kvm_support_device, make it
  inline and lighter. [Drew]

v3 changes:
- patch 2: remove func declaration, add qmp header [Drew]
- patch 3: being able to detect KVM GIC capabilities even without
  kvm enabled [Andrea]: this is a little bit hacky, need some more
  review on this.

v2 changes:
- result layout change: use array and dict for the capability bits
  rather than a single array of strings [Andrea/Markus]
- spelling out what GIC is in doc [Eric]

This patch is to add ARM-specific command "query-gic-capability".

The new command can report which kind of GIC device the host/QEMU
support. The returned result is in the form of array.

Sample command and output:

{"execute": "query-gic-capability"}
{"return": [{"emulated": false, "version": 3, "kernel": false},
{"emulated": true, "version": 2, "kernel": true}]}

Testing:

Smoke tests on both x86 (emulated) and another moonshot ARM server.

Peter Xu (4):
  arm: qmp: add query-gic-capabilities interface
  arm: enhance kvm_arm_create_scratch_host_vcpu
  kvm: add kvm_support_device() helper function
  arm: implement query-gic-capabilities

 include/sysemu/kvm.h |  9 +
 kvm-all.c| 15 +
 monitor.c|  8 +
 qapi-schema.json | 50 
 qmp-commands.hx  | 28 
 target-arm/Makefile.objs |  2 +-
 target-arm/kvm.c | 10 +-
 target-arm/kvm_arm.h |  6 ++--
 target-arm/monitor.c | 86 
 9 files changed, 210 insertions(+), 4 deletions(-)
 create mode 100644 target-arm/monitor.c

-- 
2.4.3

[Qemu-devel] [PATCH v6 3/4] kvm: add kvm_support_device() helper function

2016-03-22 Thread Peter Xu

This can be used when probing whether KVM support specific device. Here,
a raw vmfd is used.

Signed-off-by: Peter Xu 
---
 include/sysemu/kvm.h |  9 +
 kvm-all.c| 15 +++
 2 files changed, 24 insertions(+)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 6695fa7..8738fa1 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -306,6 +306,15 @@ void kvm_device_access(int fd, int group, uint64_t attr,
  */
 int kvm_create_device(KVMState *s, uint64_t type, bool test);
 
+/**
+ * kvm_support_device - probe whether KVM support specific device
+ *
+ * @vmfd: The fd handler for VM
+ * @type: type of device
+ *
+ * @return: true if supported, otherwise false.
+ */
+bool kvm_support_device(int vmfd, uint64_t type);
 
 /* Arch specific hooks */
 
diff --git a/kvm-all.c b/kvm-all.c
index 44c0464..77deccc 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -2339,6 +2339,21 @@ int kvm_create_device(KVMState *s, uint64_t type, bool 
test)
 return test ? 0 : create_dev.fd;
 }
 
+bool kvm_support_device(int vmfd, uint64_t type)
+{
+struct kvm_create_device create_dev = {
+.type = type,
+.fd = -1,
+.flags = KVM_CREATE_DEVICE_TEST,
+};
+
+if (ioctl(vmfd, KVM_CHECK_EXTENSION, KVM_CAP_DEVICE_CTRL) <= 0) {
+return false;
+}
+
+return (ioctl(vmfd, KVM_CREATE_DEVICE, &create_dev) >= 0);
+}
+
 int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source)
 {
 struct kvm_one_reg reg;
-- 
2.4.3

[Qemu-devel] [PATCH v6 2/4] arm: enhance kvm_arm_create_scratch_host_vcpu

2016-03-22 Thread Peter Xu

Some more lines to make sure we allow NULL for 1st/3rd parameter.

Signed-off-by: Peter Xu 
---
 target-arm/kvm.c | 10 +-
 target-arm/kvm_arm.h |  6 --
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/target-arm/kvm.c b/target-arm/kvm.c
index 969ab0b..0a7f9a6 100644
--- a/target-arm/kvm.c
+++ b/target-arm/kvm.c
@@ -62,13 +62,17 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t 
*cpus_to_try,
 goto err;
 }
 
+if (!init) {
+goto finish;
+}
+
 ret = ioctl(vmfd, KVM_ARM_PREFERRED_TARGET, init);
 if (ret >= 0) {
 ret = ioctl(cpufd, KVM_ARM_VCPU_INIT, init);
 if (ret < 0) {
 goto err;
 }
-} else {
+} else if (cpus_to_try) {
 /* Old kernel which doesn't know about the
  * PREFERRED_TARGET ioctl: we know it will only support
  * creating one kind of guest CPU which is its preferred
@@ -85,8 +89,12 @@ bool kvm_arm_create_scratch_host_vcpu(const uint32_t 
*cpus_to_try,
 if (ret < 0) {
 goto err;
 }
+} else {
+/* Not providing cpus_to_try, do nothing. */
+;
 }
 
+finish:
 fdarray[0] = kvmfd;
 fdarray[1] = vmfd;
 fdarray[2] = cpufd;
diff --git a/target-arm/kvm_arm.h b/target-arm/kvm_arm.h
index 07f0c72..6bcfe6c 100644
--- a/target-arm/kvm_arm.h
+++ b/target-arm/kvm_arm.h
@@ -124,9 +124,11 @@ void kvm_arm_reset_vcpu(ARMCPU *cpu);
  * kvm_arm_create_scratch_host_vcpu:
  * @cpus_to_try: array of QEMU_KVM_ARM_TARGET_* values (terminated with
  * QEMU_KVM_ARM_TARGET_NONE) to try as fallback if the kernel does not
- * know the PREFERRED_TARGET ioctl
+ * know the PREFERRED_TARGET ioctl. If NULL is provided, will try
+ * nothing.
  * @fdarray: filled in with kvmfd, vmfd, cpufd file descriptors in that order
- * @init: filled in with the necessary values for creating a host vcpu
+ * @init: filled in with the necessary values for creating a host
+ * vcpu. If NULL is provided, will not init the vCPU.
  *
  * Create a scratch vcpu in its own VM of the type preferred by the host
  * kernel (as would be used for '-cpu host'), for purposes of probing it
-- 
2.4.3

Re: [Qemu-devel] [PATCH] hw/display contains files named *_template.h. These are included many times with different values of the DEPTH macro. However, only the DEPTH == 32 case is used. Removed suppo

2016-03-22 Thread Nutan Shinde

Hi John,

My patch is same as the patch mentioned in the link. I just picked up the
task from BitSizedTask page, as it was not crossed.

I guess the issue is resolved then, what should I do in this case? Also,
please tell me, how do I know which tasks are already taken up?

Regards,
Nutan Shinde.

Re: [Qemu-devel] [PATCH v5 0/5] ARM: add query-gic-capabilities SMP command

2016-03-22 Thread Peter Xu

On Tue, Mar 22, 2016 at 03:20:25PM +0100, Markus Armbruster wrote:
> We discussed the need for a yet another ad hoc query command.  Can you
> please summarize the discussion and its conclusion?  Explaining *why* a
> feature is needed is always a good idea.  Cover letter and commit
> messages are good places for it.

Sure. I will mention that in commit message of patch 1 in the next
spin.  Thanks.

-- peterx

Re: [Qemu-devel] [PATCH v5 2/5] arm: qmp: add query-gic-capabilities interface

2016-03-22 Thread Peter Xu

On Tue, Mar 22, 2016 at 07:28:13PM +0100, Markus Armbruster wrote:
> Peter Xu  writes:
> > diff --git a/scripts/qapi.py b/scripts/qapi.py
> > index 6b2aa6e..716474e 100644
> > --- a/scripts/qapi.py
> > +++ b/scripts/qapi.py
> > @@ -46,6 +46,7 @@ returns_whitelist = [
># Whitelist of commands allowed to return a non-dictionary
>returns_whitelist = [
>'human-monitor-command',
>'qom-get',
>'query-migrate-cache-size',
> >  'query-tpm-models',
> >  'query-tpm-types',
> >  'ringbuf-read',
> > +'query-gic-capability',
> >  
> >  # From QGA:
> >  'guest-file-open',
> 
> The whitelist exists to except existing commands from design rules on
> return types.  New commands don't get to violate the rules without a
> really, really compelling reason.
> 
> Do you actually need this?
> 
> If yes, why should your command be permitted to violate the design
> rules?

This might not be required. Agree with you and Eric. Will use a hash
instead with single key.

> > +#include "qemu/osdep.h"
> > +#include "hw/boards.h"
> > +#include "qemu/error-report.h"
> > +#include "sysemu/kvm.h"
> > +#include "qmp-commands.h"
> 
> I very much doubt you need all these includes.  Try dropping all but the
> first and the last one.

I just copied them from machine.c and dropped lots, seems still
redundant... Will keep it a minimum subset in next version. Thanks!

-- peterx

Re: [Qemu-devel] [PATCH v5 2/5] arm: qmp: add query-gic-capabilities interface

2016-03-22 Thread Peter Xu

On Tue, Mar 22, 2016 at 12:42:42PM -0600, Eric Blake wrote:
> On 03/17/2016 09:27 PM, Peter Xu wrote:
> > This patch adds the command "query-gic-capabilities" but not implemnet
> 
> s/not implemnet/does not implement/

Yep, again. Thanks.

> 
> > it. The command is ARM-only. Return of the command is a list of
> > GICCapability struct that describes all GIC versions that current QEMU
> > and system support.
> > 
> > Signed-off-by: Peter Xu 
> > ---
> 
> > +++ b/qapi-schema.json
> > @@ -4156,3 +4156,14 @@
> >'data': { 'version': 'int',
> >  'emulated': 'bool',
> >  'kernel': 'bool' } }
> > +
> > +##
> > +# @query-gic-capabilities:
> > +#
> > +# Return a list of supported GIC version capabilities.
> > +#
> > +# Returns: a list of GICCapability.
> > +#
> > +# Since: 2.6
> > +##
> > +{ 'command': 'query-gic-capabilities', 'returns': ['GICCapability'] }
> 
> On the surface, this seems okay.  As mentioned before, I would have
> squashed 1 and 2 into a single patch.  The GICCapability type is
> extensible, and introspection is sufficient at seeing what the type is
> currently capable of exposing.
> 
> On the other hand...
> 
> > +++ b/scripts/qapi.py
> > @@ -46,6 +46,7 @@ returns_whitelist = [
> >  'query-tpm-models',
> >  'query-tpm-types',
> >  'ringbuf-read',
> > +'query-gic-capability',
> 
> ...it required a whitelist, because you are violating the usual
> convention of returning a dict.  If you DO need the whitelist, your
> addition should have been kept sorted.  But you don't need it, if you
> would modify your QAPI to return a dict:
> 
> { 'struct': 'GICCapabilitiesReturn',
>   'data': { 'capabilities': ['GICCapability'] } }
> { 'command': 'query-gic-capabilities',
>   'returns': 'GICCapabilitiesReturn' }
> 
> Yes, the dict has only a single key, and that key points to the same
> list; but now you have future extensibility: in the future, we could
> return any future global data as a sibling to the array, without having
> to modify every element of the array to repeat redundant information.

Yes, I think this is better solution. Will adopt this in next version.

Thanks for the comments!

-- peterx

Re: [Qemu-devel] [PATCH RFC v2 1/2] Add param Error** to msi_init() & modify the callers

2016-03-22 Thread Cao jin



On 03/02/2016 05:13 PM, Markus Armbruster wrote:

This got lost over the Christmas break, sorry.

Cc'ing Marcel for additional PCI expertise.

Cao jin  writes:


msi_init() is a supporting function in PCI device initialization,
in order to convert .init() to .realize(), it should be modified first.


"Supporting function" doesn't imply "should use Error to report
errors".  HACKING explains:

 Use the simplest suitable method to communicate success / failure to
 callers.  Stick to common methods: non-negative on success / -1 on
 error, non-negative / -errno, non-null / null, or Error objects.

 Example: when a function returns a non-null pointer on success, and it
 can fail only in one way (as far as the caller is concerned), returning
 null on failure is just fine, and certainly simpler and a lot easier on
 the eyes than propagating an Error object through an Error ** parameter.

 Example: when a function's callers need to report details on failure
 only the function really knows, use Error **, and set suitable errors.

 Do not report an error to the user when you're also returning an error
 for somebody else to handle.  Leave the reporting to the place that
 consumes the error returned.



Really appreciate your review, I just finished reading all the comments 
and discussion.


Seems pci_add_capability2()(commit cd9aa33e introduced) doesn`t follow 
the new error reporting rule(report error while also return error).


So I am thinking, could we revert commit cd9aa33e, let 
pci_add_capability() return error code and assert when out of pci space, 
and let caller(only assigned device, others could ignore the error) 
handle the error code(new a error object, propagate it)


Hope to hear PCI Maintainer`s advice(So I don`t cc other in this round)

--
Yours Sincerely,

Cao jin

Re: [Qemu-devel] [PATCH 00/15] NVDIMM: introduce nvdimm label support

2016-03-22 Thread Xiao Guangrong




On 03/22/2016 07:17 PM, Michael S. Tsirkin wrote:

On Thu, Mar 17, 2016 at 04:32:46PM +0800, Xiao Guangrong wrote:

This patchset is against commit d4207b223eef3 (fw-cfg: support writeable
blobs) on pci branch of Michael's git tree and can be found at:
   https://github.com/xiaogr/qemu.git nvdimm-label-v1


I see there have been comments, so feel free to iterate on-list, but
please also remember to repost after 2.6 is out.


Okay.

Thank you, Michael!

Re: [Qemu-devel] [PATCH 12/15] nvdimm acpi: support Get Namespace Label Size function

2016-03-22 Thread Xiao Guangrong




On 03/17/2016 06:58 PM, Stefan Hajnoczi wrote:

On Thu, Mar 17, 2016 at 04:32:58PM +0800, Xiao Guangrong wrote:

-/* No function except function 0 is supported yet. */
-nvdimm_dsm_no_payload(1 /* Not Supported */, dsm_mem_addr);
+if (!nvdimm) {
+return nvdimm_dsm_no_payload(1 /* Non-Existing Memory Device */,


"Non-existing Memory Device" is 2 according to the spec.  1 would be
"Not Supported".


Yes, indeed.



Please define constants for all these magic values instead of putting
literals into the code:

enum {
 NVDIMM_DSM_SUCCESS = 0,
 NVDIMM_DSM_NOT_SUPPORTED = 1,
 NVDIMM_DSM_NON_EXISTING_MEMORY_DEVICE = 2,
 NVDIMM_DSM_INVALID_INPUT_PARAMETERS = 3,
 NVDIMM_DSM_VENDOR_SPECIFIC_ERROR = 4,
};


Er, it seems Michael much prefers the style which uses raw number which
is followed by a comment. :(




+ dsm_mem_addr);
+}
+
+/* Encode DSM function according to DSM Spec Rev1. */
+switch (in->function) {
+case 4 /* Get Namespace Label Size */:
+if (nvdimm->reserve_label) {
+nvdimm_dsm_label_size(nvdimm, dsm_mem_addr);
+}
+break;


What is the expected return status code if the device has no labels?

This function must write a return status code, otherwise the guest will
read the out buffer and attempt to interpret uninitialized memory.



Yes, my fault, will fix it. Thanks you for pointing it out.

Re: [Qemu-devel] [RFC PATCH v2 2/2] spapr: Memory hot-unplug support

2016-03-22 Thread David Gibson

On Wed, Mar 16, 2016 at 10:11:54AM +0530, Bharata B Rao wrote:
> On Wed, Mar 16, 2016 at 12:36:05PM +1100, David Gibson wrote:
> > On Tue, Mar 15, 2016 at 10:08:56AM +0530, Bharata B Rao wrote:
> > > Add support to hot remove pc-dimm memory devices.
> > > 
> > > Signed-off-by: Bharata B Rao 
> > 
> > Reviewed-by: David Gibson 
> > 
> > Looks correct, but again, needs to wait on the PAPR change.
> > 
> > Have you thought any further on the idea of sending an index message,
> > then a count message as an interim approach to fixing this without
> > requiring a PAPR change?
> 
> Removal by index and removal by count are valid messages by themselves
> and drmgr would go ahead and start the removal in reponse to those
> calls. IIUC, you are suggesting that lets remove one LMB by index in
> response to 1st message and remove (count -1) LMBs from where the last
> removal was done in the previous message.

Yes, that's the idea.

> Since the same code base of powerpc-utils works on PowerVM too, I am not
> sure if such an approach would impact PowerVM in any undesirable manner.
> May be Nathan can clarify ?

Heard anything from Nathan?  I don't really see how it would be bad
under PowerVM.  Under PowerVM it generally doesn't matter which LMBs
you remove, right?  So removing the ones immediately after the last
one you removed should be as good a choice as any.

> I see that this can be done, but the changes in drmgr code specially the
> code related to LMB list handling/removal can be non-trivial. So not sure
> if the temporary approach is all that worth here and hence I feel it is better
> to wait and do it the count-indexed way.

Really?  drmgr is already scanning LMBs to find ones it can remove.
Seeding that scan with the last removed LMB shouldn't be too hard.

> While we are here, I would also like to get some opinion on the real
> need for memory unplug. Is there anything that memory unplug gives us
> which memory ballooning (shrinking mem via ballooning) can't give ?

Hmm.. that's an interesting question.  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH 10/15] nvdimm acpi: abstract the operations for root device and nvdimm devices

2016-03-22 Thread Xiao Guangrong




On 03/17/2016 06:35 PM, Stefan Hajnoczi wrote:

On Thu, Mar 17, 2016 at 04:32:56PM +0800, Xiao Guangrong wrote:

+static void
+nvdimm_dsm_function0(uint32_t supported_func, hwaddr dsm_mem_addr)
+{
+NvdimmDsmFunc0Out func0 = {
+.len = cpu_to_le32(sizeof(func0)),
+.supported_func = cpu_to_le32(supported_func),
+};
+cpu_physical_memory_write(dsm_mem_addr, &func0, sizeof(func0));
+}
+
+static void
+nvdimm_dsm_no_payload(uint32_t func_ret_status, hwaddr dsm_mem_addr)
+{
+NvdimmDsmFuncNoPayloadOut out = {
+.len = cpu_to_le32(sizeof(out)),
+.func_ret_status = cpu_to_le32(func_ret_status),
+};
+cpu_physical_memory_write(dsm_mem_addr, &out, sizeof(out));
+}
+
+static void nvdimm_dsm_root(NvdimmDsmIn *in, hwaddr dsm_mem_addr)
+{
+/*
+ * function 0 is called to inquire which functions are supported by
+ * OSPM
+ */
+if (!in->function) {
+return nvdimm_dsm_function0(0 /* No function supported other
+ than function 0 */, dsm_mem_addr);


The return type is void so "return foo()" looks strange.  I went back
and double-checked function prototypes because I was surprised by this
line of code.  Please use the conventional "foo(); return;" for void
return instead.



I am so lazy to omit this single line. :)

Okay, will follow this style in the next version.

Re: [Qemu-devel] [PATCH 04/15] nvdimm: support nvdimm label

2016-03-22 Thread Xiao Guangrong




On 03/17/2016 06:28 PM, Stefan Hajnoczi wrote:

On Thu, Mar 17, 2016 at 04:32:50PM +0800, Xiao Guangrong wrote:

+static void nvdimm_init(Object *obj)
+{
+object_property_add_bool(obj, "reserve-label", nvdimm_get_reserve_label,
+ nvdimm_set_reserve_label, NULL);


In the future users may wish for larger namespace label sizes.  This
bool option will not allow that.

Perhaps the option should be an integer called "label-size"?


Yes, good to me.




+static void nvdimm_assert_rw_label_data(NVDIMMDevice *nvdimm, uint64_t size,
+uint64_t offset)
+{
+assert(nvdimm->reserve_label &&
+   (nvdimm->label_size >= size + offset) && (offset + size > offset));
+}


It's not clear from this patch alone, but QEMU is not allowed to assert
due to invalid inputs from the guest.  So if input validation is
necessary here because the values may be invalid, please write if
statements and error returns.


The caller should check it before calling these callbacks, in our case, we did
it in nvdimm_rw_label_data_check() in patch 13.

So if that happen, it is really a QEMU internal BUG.



This is important so guests cannot cause QEMU to core dump (SIGABRT
default behavior) and so that nested virtualization doesn't allow a
nested guest to DoS its parent guest.



Yes, i understood it, but it is not the case in this patch as the assert()
can not be triggered by guest.

Maybe i should mention it in the changelog to make this fact more clean.

[Qemu-devel] [PATCH for-2.6 6/7] block/vpc: set errp in vpc_open

2016-03-22 Thread Jeff Cody

Add more useful error information to failure paths in vpc_open

Signed-off-by: Jeff Cody 
---
 block/vpc.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/vpc.c b/block/vpc.c
index 67ab376..5dd9950 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -237,6 +237,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 
 ret = bdrv_pread(bs->file->bs, 0, s->footer_buf, HEADER_SIZE);
 if (ret < 0) {
+error_setg(errp, "Unable to read VHD header");
 goto fail;
 }
 
@@ -245,9 +246,11 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 int64_t offset = bdrv_getlength(bs->file->bs);
 if (offset < 0) {
 ret = offset;
+error_setg(errp, "Invalid file size");
 goto fail;
 } else if (offset < HEADER_SIZE) {
 ret = -EINVAL;
+error_setg(errp, "File too small for a VHD header");
 goto fail;
 }
 
@@ -326,12 +329,14 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 ret = bdrv_pread(bs->file->bs, be64_to_cpu(footer->data_offset), buf,
  HEADER_SIZE);
 if (ret < 0) {
+error_setg(errp, "Error reading dynamic VHD header");
 goto fail;
 }
 
 dyndisk_header = (VHDDynDiskHeader *) buf;
 
 if (strncmp(dyndisk_header->magic, "cxsparse", 8)) {
+error_setg(errp, "Invalid header magic");
 ret = -EINVAL;
 goto fail;
 }
@@ -347,12 +352,14 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 s->max_table_entries = be32_to_cpu(dyndisk_header->max_table_entries);
 
 if ((bs->total_sectors * 512) / s->block_size > 0xU) {
+error_setg(errp, "Too many blocks");
 ret = -EINVAL;
 goto fail;
 }
 
 computed_size = (uint64_t) s->max_table_entries * s->block_size;
 if (computed_size < bs->total_sectors * 512) {
+error_setg(errp, "Page table too small");
 ret = -EINVAL;
 goto fail;
 }
@@ -369,6 +376,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 
 s->pagetable = qemu_try_blockalign(bs->file->bs, pagetable_size);
 if (s->pagetable == NULL) {
+error_setg(errp, "Unable to allocate memory for page table");
 ret = -ENOMEM;
 goto fail;
 }
@@ -378,6 +386,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 ret = bdrv_pread(bs->file->bs, s->bat_offset, s->pagetable,
  pagetable_size);
 if (ret < 0) {
+error_setg(errp, "Error reading pagetable");
 goto fail;
 }
 
-- 
1.9.3

[Qemu-devel] [PATCH for-2.6 7/7] block/vpc: update comments to be compliant w/coding guidelines

2016-03-22 Thread Jeff Cody

Signed-off-by: Jeff Cody 
---
 block/vpc.c | 70 ++---
 1 file changed, 35 insertions(+), 35 deletions(-)

diff --git a/block/vpc.c b/block/vpc.c
index 5dd9950..0b48cf4 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -36,7 +36,7 @@
 
 #define HEADER_SIZE 512
 
-//#define CACHE
+/* #define CACHE */
 
 enum vhd_type {
 VHD_FIXED   = 2,
@@ -44,7 +44,7 @@ enum vhd_type {
 VHD_DIFFERENCING= 4,
 };
 
-// Seconds since Jan 1, 2000 0:00:00 (UTC)
+/* Seconds since Jan 1, 2000 0:00:00 (UTC) */
 #define VHD_TIMESTAMP_BASE 946684800
 
 #define VHD_CHS_MAX_C   65535LL
@@ -56,22 +56,22 @@ enum vhd_type {
 
 #define VPC_OPT_FORCE_SIZE "force_size"
 
-// always big-endian
+/* always big-endian */
 typedef struct vhd_footer {
-charcreator[8]; // "conectix"
+charcreator[8]; /* "conectix" */
 uint32_tfeatures;
 uint32_tversion;
 
-// Offset of next header structure, 0x if none
+/* Offset of next header structure, 0x if none */
 uint64_tdata_offset;
 
-// Seconds since Jan 1, 2000 0:00:00 (UTC)
+/* Seconds since Jan 1, 2000 0:00:00 (UTC) */
 uint32_ttimestamp;
 
-charcreator_app[4]; // "vpc "
+charcreator_app[4]; /*  e.g., "vpc " */
 uint16_tmajor;
 uint16_tminor;
-charcreator_os[4]; // "Wi2k"
+charcreator_os[4]; /* "Wi2k" */
 
 uint64_torig_size;
 uint64_tcurrent_size;
@@ -82,29 +82,29 @@ typedef struct vhd_footer {
 
 uint32_ttype;
 
-// Checksum of the Hard Disk Footer ("one's complement of the sum of all
-// the bytes in the footer without the checksum field")
+/* Checksum of the Hard Disk Footer ("one's complement of the sum of all
+   the bytes in the footer without the checksum field") */
 uint32_tchecksum;
 
-// UUID used to identify a parent hard disk (backing file)
+/* UUID used to identify a parent hard disk (backing file) */
 uint8_t uuid[16];
 
 uint8_t in_saved_state;
 } QEMU_PACKED VHDFooter;
 
 typedef struct vhd_dyndisk_header {
-charmagic[8]; // "cxsparse"
+charmagic[8]; /* "cxsparse" */
 
-// Offset of next header structure, 0x if none
+/* Offset of next header structure, 0x if none */
 uint64_tdata_offset;
 
-// Offset of the Block Allocation Table (BAT)
+/* Offset of the Block Allocation Table (BAT) */
 uint64_ttable_offset;
 
 uint32_tversion;
-uint32_tmax_table_entries; // 32bit/entry
+uint32_tmax_table_entries; /* 32bit/entry */
 
-// 2 MB by default, must be a power of two
+/* 2 MB by default, must be a power of two */
 uint32_tblock_size;
 
 uint32_tchecksum;
@@ -112,7 +112,7 @@ typedef struct vhd_dyndisk_header {
 uint32_tparent_timestamp;
 uint32_treserved;
 
-// Backing file name (in UTF-16)
+/* Backing file name (in UTF-16) */
 uint8_t parent_name[512];
 
 struct {
@@ -277,9 +277,9 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 /* Write 'checksum' back to footer, or else will leave it with zero. */
 footer->checksum = cpu_to_be32(checksum);
 
-// The visible size of a image in Virtual PC depends on the geometry
-// rather than on the size stored in the footer (the size in the footer
-// is too large usually)
+/* The visible size of a image in Virtual PC depends on the geometry
+   rather than on the size stored in the footer (the size in the footer
+   is too large usually) */
 bs->total_sectors = (int64_t)
 be16_to_cpu(footer->cyls) * footer->heads * footer->secs_per_cyl;
 
@@ -465,16 +465,16 @@ static inline int64_t get_sector_offset(BlockDriverState 
*bs,
 pageentry_index = (offset % s->block_size) / 512;
 
 if (pagetable_index >= s->max_table_entries || 
s->pagetable[pagetable_index] == 0x)
-return -1; // not allocated
+return -1; /* not allocated */
 
 bitmap_offset = 512 * (uint64_t) s->pagetable[pagetable_index];
 block_offset = bitmap_offset + s->bitmap_size + (512 * pageentry_index);
 
-// We must ensure that we don't write to any sectors which are marked as
-// unused in the bitmap. We get away with setting all bits in the block
-// bitmap each time we write to a new block. This might cause Virtual PC to
-// miss sparse read optimization, but it's not a problem in terms of
-// correctness.
+/* We must ensure that we don't write to any sectors which are marked as
+   unused in the bitmap. We get away with setting all bits in the block
+   bitmap each time we write to a new block. This might cause Virtual PC to
+   miss sparse read optimization, but it's not a problem in terms of
+   correctness. */
 if (write && (s->last_bitmap_offset != bitmap_offset)) {
 uint8_t bitmap[s->bitmap_si

[Qemu-devel] [PATCH for-2.6 4/7] block/vpc: Use the correct max sector count for VHD images

2016-03-22 Thread Jeff Cody

The old VHD_MAX_SECTORS value is incorrect, and is a throwback
to the CHS calculations.  The VHD specification allows images up to 2040
GiB, which (using 512 byte sectors) corresponds to a maximum number of
sectors of 0xff00, rather than the old value of 0xfe0001ff.

Update VHD_MAX_SECTORS to reflect the correct value.

Also, update comment references to the actual size limit, and correct
one compare so that we can have sizes up to the limit.

Signed-off-by: Jeff Cody 
---
 block/vpc.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/vpc.c b/block/vpc.c
index 6ad8406..2e023d0 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -51,7 +51,7 @@ enum vhd_type {
 #define VHD_CHS_MAX_H   16
 #define VHD_CHS_MAX_S   255
 
-#define VHD_MAX_SECTORS   (65535LL * 255 * 255)
+#define VHD_MAX_SECTORS   0xff00/* 2040 GiB max image size */
 #define VHD_MAX_GEOMETRY  (VHD_CHS_MAX_C * VHD_CHS_MAX_H * VHD_CHS_MAX_S)
 
 #define VPC_OPT_FORCE_SIZE "force_size"
@@ -316,8 +316,8 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 BDRV_SECTOR_SIZE;
 }
 
-/* Allow a maximum disk size of approximately 2 TB */
-if (bs->total_sectors >= VHD_MAX_SECTORS) {
+/* Allow a maximum disk size of 2040 GiB */
+if (bs->total_sectors > VHD_MAX_SECTORS) {
 ret = -EFBIG;
 goto fail;
 }
@@ -721,7 +721,7 @@ static int64_t coroutine_fn 
vpc_co_get_block_status(BlockDriverState *bs,
  * Note that the geometry doesn't always exactly match total_sectors but
  * may round it down.
  *
- * Returns 0 on success, -EFBIG if the size is larger than ~2 TB. Override
+ * Returns 0 on success, -EFBIG if the size is larger than 2040 GiB. Override
  * the hardware EIDE and ATA-2 limit of 16 heads (max disk size of 127 GB)
  * and instead allow up to 255 heads.
  */
@@ -927,7 +927,7 @@ static int vpc_create(const char *filename, QemuOpts *opts, 
Error **errp)
 
 if ((int64_t)cyls * heads * secs_per_cyl == VHD_MAX_GEOMETRY) {
 total_sectors = total_size / BDRV_SECTOR_SIZE;
-/* Allow a maximum disk size of approximately 2 TB */
+/* Allow a maximum disk size of 2040 GiB */
 if (total_sectors > VHD_MAX_SECTORS) {
 error_setg(errp, "Disk size is too large, max size is 2040 GiB");
 ret = -EFBIG;
-- 
1.9.3

[Qemu-devel] [PATCH for-2.6 3/7] block/vpc: use current_size field for XenConverter VHD images

2016-03-22 Thread Jeff Cody

XenConverter VHD images are another VHD image where current_size is
different from the CHS values in the the format header.  Use
current_size as the default, by looking at the creator_app signature
field.

Signed-off-by: Jeff Cody 
---
 block/vpc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/vpc.c b/block/vpc.c
index 8b8b9a7..6ad8406 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -299,6 +299,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
  *  'win '  :  current_size Hyper-V
  *  'd2v '  :  current_size Disk2vhd
  *  'tap\0' :  current_size XenServer
+ *  'CTXS'  :  current_size XenConverter
  *
  *  The user can override the table values via drive options, however
  *  even with an override we will still use current_size for images
@@ -307,6 +308,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 use_chs = (!!strncmp(footer->creator_app, "win ", 4) &&
!!strncmp(footer->creator_app, "qem2", 4) &&
!!strncmp(footer->creator_app, "d2v ", 4) &&
+   !!strncmp(footer->creator_app, "CTXS", 4) &&
!!memcmp(footer->creator_app, "tap", 4)) || s->force_use_chs;
 
 if (!use_chs || bs->total_sectors == VHD_MAX_GEOMETRY || s->force_use_sz) {
-- 
1.9.3

[Qemu-devel] [PATCH for-2.6 5/7] block/vpc: make checks on max table size a bit more lax

2016-03-22 Thread Jeff Cody

The check on the max_table_size field not being larger than required is
valid, and in accordance with the VHD spec.  However, there have been
VHD images encountered in the wild that have an out-of-spec max table
size that is technically too large.

There is no issue in allowing this larger table size, as we also
later verify that the computed size (used for the pagetable) is
large enough to fit all sectors.  In addition, max_table_entries
is bounds checked against SIZE_MAX and INT_MAX.

Remove the strict check, so that we can accomodate these sorts of
images that are benignly out of spec.

Reported-by: Stefan Hajnoczi 
Reported-by: Grant Wu 
Signed-off-by: Jeff Cody 
---
 block/vpc.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/vpc.c b/block/vpc.c
index 2e023d0..67ab376 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -350,10 +350,6 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
 ret = -EINVAL;
 goto fail;
 }
-if (s->max_table_entries > (VHD_MAX_SECTORS * 512) / s->block_size) {
-ret = -EINVAL;
-goto fail;
-}
 
 computed_size = (uint64_t) s->max_table_entries * s->block_size;
 if (computed_size < bs->total_sectors * 512) {
-- 
1.9.3

[Qemu-devel] [PATCH for-2.6 2/7] vpc: use current_size field for XenServer VHD images

2016-03-22 Thread Jeff Cody

From: Stefan Hajnoczi 

The vpc driver has two methods of determining virtual disk size.  The
correct one to use depends on the software that generated the image
file.  Add the XenServer creator_app signature so that image size is
correctly detected for those images.

Reported-by: Grant Wu 
Reported-by: Spencer Baugh 
Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Jeff Cody 
---
 block/vpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/vpc.c b/block/vpc.c
index bc3d1c6..8b8b9a7 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -298,6 +298,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
  *  'qem2'  :  current_size QEMU (uses current_size)
  *  'win '  :  current_size Hyper-V
  *  'd2v '  :  current_size Disk2vhd
+ *  'tap\0' :  current_size XenServer
  *
  *  The user can override the table values via drive options, however
  *  even with an override we will still use current_size for images
@@ -305,7 +306,8 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
int flags,
  */
 use_chs = (!!strncmp(footer->creator_app, "win ", 4) &&
!!strncmp(footer->creator_app, "qem2", 4) &&
-   !!strncmp(footer->creator_app, "d2v ", 4)) || s->force_use_chs;
+   !!strncmp(footer->creator_app, "d2v ", 4) &&
+   !!memcmp(footer->creator_app, "tap", 4)) || s->force_use_chs;
 
 if (!use_chs || bs->total_sectors == VHD_MAX_GEOMETRY || s->force_use_sz) {
 bs->total_sectors = be64_to_cpu(footer->current_size) /
-- 
1.9.3

[Qemu-devel] [PATCH for-2.6 1/7] block/vpc: fix VPC 'qemu-img create' regression

2016-03-22 Thread Jeff Cody

Commit 'b8f45cdf7827e39f9a1e6cc446f5972cc6144237' switched VPC
over to using blk_pwrite() instead of bdrv_pwrite_sync().  The
return value of bdrv_pwrite_sync() was always 0 for success, and
create_dynamic_disk() in one instance checked for a non-zero return
value to indicate error.  However, blk_pwrite() may return positive
values for success.

This fails silently as well, since vpc_create() did not set errp
in this failuer case.  Set errp in all instances in vpc_create().

Signed-off-by: Jeff Cody 
---
 block/vpc.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/vpc.c b/block/vpc.c
index 8435205..bc3d1c6 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -774,7 +774,7 @@ static int create_dynamic_disk(BlockBackend *blk, uint8_t 
*buf,
 num_bat_entries = (total_sectors + block_size / 512) / (block_size / 512);
 
 ret = blk_pwrite(blk, offset, buf, HEADER_SIZE);
-if (ret) {
+if (ret < 0) {
 goto fail;
 }
 
@@ -873,6 +873,7 @@ static int vpc_create(const char *filename, QemuOpts *opts, 
Error **errp)
 } else if (!strcmp(disk_type_param, "fixed")) {
 disk_type = VHD_FIXED;
 } else {
+error_setg(errp, "Invalid disk type, %s", disk_type_param);
 ret = -EINVAL;
 goto out;
 }
@@ -924,6 +925,7 @@ static int vpc_create(const char *filename, QemuOpts *opts, 
Error **errp)
 total_sectors = total_size / BDRV_SECTOR_SIZE;
 /* Allow a maximum disk size of approximately 2 TB */
 if (total_sectors > VHD_MAX_SECTORS) {
+error_setg(errp, "Disk size is too large, max size is 2040 GiB");
 ret = -EFBIG;
 goto out;
 }
@@ -974,6 +976,9 @@ static int vpc_create(const char *filename, QemuOpts *opts, 
Error **errp)
 } else {
 ret = create_fixed_disk(blk, buf, total_size);
 }
+if (ret < 0) {
+error_setg(errp, "Unable to create or write VHD header");
+}
 
 out:
 blk_unref(blk);
-- 
1.9.3

[Qemu-devel] [PATCH for-2.6 0/7] block: VHD format fixes

2016-03-22 Thread Jeff Cody

Fixes for a regression in vpc_create(), as well as a few issues with VHD format
compatibility.


Jeff Cody (6):
  block/vpc: fix VPC 'qemu-img create' regression
  block/vpc: use current_size field for XenConverter VHD images
  block/vpc: Use the correct max sector count for VHD images
  block/vpc: make checks on max table size a bit more lax
  block/vpc: set errp in vpc_open
  block/vpc: update comments to be compliant w/coding guidelines

Stefan Hajnoczi (1):
  vpc: use current_size field for XenServer VHD images

 block/vpc.c | 106 ++--
 1 file changed, 60 insertions(+), 46 deletions(-)

-- 
1.9.3

Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2016-03-22 Thread Alexey Kardashevskiy


On 03/23/2016 01:13 PM, David Gibson wrote:

On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:

This adds support for Dynamic DMA Windows (DDW) option defined by
the SPAPR specification which allows to have additional DMA window(s)

This implements DDW for emulated and VFIO devices.
This reserves RTAS token numbers for DDW calls.

This changes the TCE table migration descriptor to support dynamic
tables as from now on, PHB will create as many stub TCE table objects
as PHB can possibly support but not all of them might be initialized at
the time of migration because DDW might or might not be requested by
the guest.

The "ddw" property is enabled by default on a PHB but for compatibility
the pseries-2.5 machine and older disable it.

This implements DDW for VFIO. The host kernel support is required.
This adds a "levels" property to PHB to control the number of levels
in the actual TCE table allocated by the host kernel, 0 is the default
value to tell QEMU to calculate the correct value. Current hardware
supports up to 5 levels.

The existing linux guests try creating one additional huge DMA window
with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
the guest switches to dma_direct_ops and never calls TCE hypercalls
(H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
and not waste time on map/unmap later. This adds a "dma64_win_addr"
property which is a bus address for the 64bit window and by default
set to 0x800... as this is what the modern POWER8 hardware
uses and this allows having emulated and VFIO devices on the same bus.

This adds 4 RTAS handlers:
* ibm,query-pe-dma-window
* ibm,create-pe-dma-window
* ibm,remove-pe-dma-window
* ibm,reset-pe-dma-window
These are registered from type_init() callback.

These RTAS handlers are implemented in a separate file to avoid polluting
spapr_iommu.c with PCI.

Signed-off-by: Alexey Kardashevskiy 
---
  hw/ppc/Makefile.objs|   1 +
  hw/ppc/spapr.c  |   7 +-
  hw/ppc/spapr_pci.c  |  73 ---
  hw/ppc/spapr_rtas_ddw.c | 300 
  hw/vfio/common.c|   5 -
  include/hw/pci-host/spapr.h |  13 ++
  include/hw/ppc/spapr.h  |  16 ++-
  trace-events|   4 +
  8 files changed, 395 insertions(+), 24 deletions(-)
  create mode 100644 hw/ppc/spapr_rtas_ddw.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index c1ffc77..986b36f 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o 
spapr_rng.o
  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
  obj-y += spapr_pci_vfio.o
  endif
+obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
  # PowerPC 4xx boards
  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
  obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d0bb423..ef4c637 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
   * pseries-2.5
   */
  #define SPAPR_COMPAT_2_5 \
-HW_COMPAT_2_5
+HW_COMPAT_2_5 \
+{\
+.driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
+.property = "ddw",\
+.value= stringify(off),\
+},

  static void spapr_machine_2_5_instance_options(MachineState *machine)
  {
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index af99a36..3bb294a 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, 
PCIDevice *pdev)
  return buf;
  }

-static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
-   uint32_t liobn,
-   uint32_t page_shift,
-   uint64_t window_addr,
-   uint64_t window_size,
-   Error **errp)
+void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+ uint32_t liobn,
+ uint32_t page_shift,
+ uint64_t window_addr,
+ uint64_t window_size,
+ Error **errp)
  {
  sPAPRTCETable *tcet;
  uint32_t nb_table = window_size >> page_shift;
@@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState 
*sphb,
  return;
  }

+if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
+error_setg(errp,
+   "Attempt to use second window when DDW is disabled on PHB");
+return;
+}


This should never happen unless something is wrong with the tests in
the RTAS functions, yes?  In which case it should probably be an
assert().


This should not. But this is called from the RTAS caller so I'd really like 
to have a message rather than assert() if that cond

Re: [Qemu-devel] [PATCH v5 1/5] arm: qmp: add GICCapability struct

2016-03-22 Thread Peter Xu

On Tue, Mar 22, 2016 at 12:32:28PM -0600, Eric Blake wrote:
> On 03/17/2016 09:27 PM, Peter Xu wrote:
> > +##
> > +# @GICCapability:
> > +#
> > +# This struct describes capability for a specific GIC version. These
> 
> Might be nice to spell out what the acronym GIC means, but that's cosmetic.

Ah! I thought I have added that... It's missing again. Will do in
next spin.

> 
> > +# bits are not only decided by QEMU/KVM software version, but also
> > +# decided by the hardware that the program is running upon.
> > +#
> > +# @version:  version of GIC to be described.
> > +#
> > +# @emulated: whether current QEMU/hardware supports emulated GIC
> > +#device in user space.
> > +#
> > +# @kernel:   whether current QEMU/hardware supports hardware
> > +#accelerated GIC device in kernel.
> > +#
> > +# Since: 2.6
> > +##
> > +{ 'struct': 'GICCapability',
> > +  'data': { 'version': 'int',
> > +'emulated': 'bool',
> > +'kernel': 'bool' } }
> > 
> 
> I might have squashed this with the patch that first uses GICCapability,
> as defining a type in isolation doesn't do much.

I can do the squash in next spin if you prefer that one. Actually I
got this question before, about when I should split and when to
squash. E.g., shall I make sure that I should have no "definition
only" patches in the future?

-- peterx

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

2016-03-22 Thread Alexey Kardashevskiy


On 03/23/2016 01:53 PM, David Gibson wrote:

On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:

On 03/23/2016 12:08 PM, David Gibson wrote:

On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:

On 03/22/2016 04:14 PM, David Gibson wrote:

On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses euristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v14:
* new to the series

---
TODO:
* export levels to PHB
---
  hw/vfio/common.c | 108 ---
  trace-events |   2 ++
  2 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4e873b7..421d6eb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
  return 0;
  }

+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
+{
+VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 
0x1000);


The hard-coded 0x1000 looks dubious..


Well, that's the minimal page size...


Really?  Some BookE CPUs support 1KiB page size..


Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)


Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
it's been done for CPU MMU I wouldn't count on it not being done for
IOMMU.

1 is a safer choice.







+g_assert(hiommu);
+QLIST_REMOVE(hiommu, hiommu_next);
+}
+
  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
  {
  return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
  }
  end = int128_get64(llend);

+if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {


I think this would be clearer split out into a helper function,
vfio_create_host_window() or something.



It is rather vfio_spapr_create_host_window() and we were avoiding
xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
separate file but this usually triggers more discussion and never ends well.




+unsigned entries, pages;
+struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) 
};
+
+g_assert(section->mr->iommu_ops);
+g_assert(memory_region_is_iommu(section->mr));


I don't think you need these asserts.  AFAICT the same logic should
work if a RAM MR was added directly to PCI address space - this would
create the new host window, then the existing code for adding a RAM MR
would map that block of RAM statically into the new window.


In what configuration/machine can we do that on SPAPR?


spapr guests won't ever do that.  But you can run an x86 guest on a
powernv host and this situation could come up.



I am pretty sure VFIO won't work in this case anyway.


I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.


This is not about TCG (pseries TCG guest works with VFIO on powernv host), 
this is about things like VFIO_IOMMU_GET_INFO vs. 
VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work.


Should I add such support in this patchset?





In any case there's no point asserting if the code is correct anyway.


Assert here says (at least) "not tested" or "not expected to
happen".


Hmmm..







+trace_vfio_listener_region_add_iommu(iova, end - 1);
+/*
+ * FIXME: For VFIO iommu types which have KVM acceleration to
+ * avoid bouncing all map/unmaps through qemu this way, this
+ * would be the right place to wire that up (tell the KVM
+ * device emulation the VFIO iommu handles to use).
+ */
+create.window_size = memory_region_size(section->mr);
+create.page_shift =
+ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));


Ah.. except that I guess you'd need to fall back to host page size
here to handle a RAM MR.


Can you give an example of such RAM MR being added to PCI AS on
SPAPR?


On spapr, no.  But you can run other machine types as guests (at least
with TCG) on a host with the spapr IOMMU.


+/*
+ * SPAPR host supports multilevel TCE tables, there is some
+ * euristic to decide how many levels we want for our table:
+ * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+ */
+entries = create.window_size >> create.page_shift;
+pages = (entries * sizeof(uint64_t)) / getpagesize();
+

Re: [Qemu-devel] [PATCH v5 1/5] arm: qmp: add GICCapability struct

2016-03-22 Thread Peter Xu

On Tue, Mar 22, 2016 at 07:41:17PM +0100, Markus Armbruster wrote:
> Markus Armbruster  writes:
> >> +##
> >> +# @GICCapability:
> >> +#
> >> +# This struct describes capability for a specific GIC version. These
> >> +# bits are not only decided by QEMU/KVM software version, but also
> >> +# decided by the hardware that the program is running upon.
> >> +#
> >> +# @version:  version of GIC to be described.
> >> +#
> >> +# @emulated: whether current QEMU/hardware supports emulated GIC
> >> +#device in user space.
> >> +#
> >> +# @kernel:   whether current QEMU/hardware supports hardware
> >> +#accelerated GIC device in kernel.
> >> +#
> >> +# Since: 2.6
> >> +##
> >> +{ 'struct': 'GICCapability',
> >> +  'data': { 'version': 'int',
> >> +'emulated': 'bool',
> >> +'kernel': 'bool' } }
> >
> > Are all four combinations of (emulated, kernel) possible?

Currently, version can only be 2 or 3. And for now, we should only
support:

- emulated/kernel support for v2
- kernel support for v3

and emulated v3 is still not supported.

> Moreover, what do the combinations mean from a practical point of view?
> What would a management application do with the information?

AFAIU, this command will mostly be used by libvirt. E.g., on ARM
host, user can choose which kind of GIC device he/she uses. However,
not all of them are supported, and it is decided by both the ARM
hardware and the software (here software should include at least
QEMU and the kernel version). With this command, libvirt can easily
know what is supported, and what is not.

So it can: warn the user before-hand during configuration like "GIC
emulated/kvm-accelerated version X is/isn't supported",

rather than: firstly, configuration success. After that, QEMU boot
failed with ambiguous error.

I can add more explainations into the commit message (some can be
added to the schema directly possibly) about why we need this
command, and what does every entry mean.

Thanks!

-- peterx

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

2016-03-22 Thread David Gibson

On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote:
> On 03/23/2016 12:08 PM, David Gibson wrote:
> >On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/22/2016 04:14 PM, David Gibson wrote:
> >>>On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> and adds just created IOMMU into the host IOMMU list; the opposite
> action is taken in vfio_listener_region_del.
> 
> When creating a new window, this uses euristic to decide on the TCE table
> levels number.
> 
> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v14:
> * new to the series
> 
> ---
> TODO:
> * export levels to PHB
> ---
>   hw/vfio/common.c | 108 
>  ---
>   trace-events |   2 ++
>   2 files changed, 105 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 4e873b7..421d6eb 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer 
> *container,
>   return 0;
>   }
> 
> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr 
> min_iova)
> +{
> +VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 
> 0x1000);
> >>>
> >>>The hard-coded 0x1000 looks dubious..
> >>
> >>Well, that's the minimal page size...
> >
> >Really?  Some BookE CPUs support 1KiB page size..
> 
> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)

Uh.. actually I don't think those CPUs generally had an IOMMU.  But if
it's been done for CPU MMU I wouldn't count on it not being done for
IOMMU.

1 is a safer choice.

> 
> 
> >
> +g_assert(hiommu);
> +QLIST_REMOVE(hiommu, hiommu_next);
> +}
> +
>   static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>   {
>   return (!memory_region_is_ram(section->mr) &&
> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>   }
>   end = int128_get64(llend);
> 
> +if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >>>
> >>>I think this would be clearer split out into a helper function,
> >>>vfio_create_host_window() or something.
> >>
> >>
> >>It is rather vfio_spapr_create_host_window() and we were avoiding
> >>xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
> >>separate file but this usually triggers more discussion and never ends well.
> >>
> >>
> >>
> +unsigned entries, pages;
> +struct vfio_iommu_spapr_tce_create create = { .argsz = 
> sizeof(create) };
> +
> +g_assert(section->mr->iommu_ops);
> +g_assert(memory_region_is_iommu(section->mr));
> >>>
> >>>I don't think you need these asserts.  AFAICT the same logic should
> >>>work if a RAM MR was added directly to PCI address space - this would
> >>>create the new host window, then the existing code for adding a RAM MR
> >>>would map that block of RAM statically into the new window.
> >>
> >>In what configuration/machine can we do that on SPAPR?
> >
> >spapr guests won't ever do that.  But you can run an x86 guest on a
> >powernv host and this situation could come up.
> 
> 
> I am pretty sure VFIO won't work in this case anyway.

I'm not.  There's no fundamental reason VFIO shouldn't work with TCG.

> >In any case there's no point asserting if the code is correct anyway.
> 
> Assert here says (at least) "not tested" or "not expected to
> happen".

Hmmm..

> 
> 
> >
> +trace_vfio_listener_region_add_iommu(iova, end - 1);
> +/*
> + * FIXME: For VFIO iommu types which have KVM acceleration to
> + * avoid bouncing all map/unmaps through qemu this way, this
> + * would be the right place to wire that up (tell the KVM
> + * device emulation the VFIO iommu handles to use).
> + */
> +create.window_size = memory_region_size(section->mr);
> +create.page_shift =
> +
> ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
> >>>
> >>>Ah.. except that I guess you'd need to fall back to host page size
> >>>here to handle a RAM MR.
> >>
> >>Can you give an example of such RAM MR being added to PCI AS on
> >>SPAPR?
> >
> >On spapr, no.  But you can run other machine types as guests (at least
> >with TCG) on a host with the spapr IOMMU.
> >
> +/*
> + * SPAPR host su

Re: [Qemu-devel] [PATCH 00/15] NVDIMM: introduce nvdimm label support

2016-03-22 Thread Xiao Guangrong



Sorry for the delay and i just come back from my vacation.

On 03/23/2016 04:30 AM, Stefan Hajnoczi wrote:

On Tue, Mar 22, 2016 at 08:37:40AM -0700, Dan Williams wrote:

On Thu, Mar 17, 2016 at 3:04 AM, Stefan Hajnoczi  wrote:

On Thu, Mar 17, 2016 at 04:32:46PM +0800, Xiao Guangrong wrote:

This patchset is against commit d4207b223eef3 (fw-cfg: support writeable
blobs) on pci branch of Michael's git tree and can be found at:
   https://github.com/xiaogr/qemu.git nvdimm-label-v1

This is the last part of vNVDIMM implementation which introduces nvdimm
label support

Currently Linux NVDIMM driver does not support namespace operation on this
kind of PMEM, apply below changes to support dynamical namespace:

@@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
 continue;
 }

-   if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+   if (nfit_mem->memdev_pmem)
 flags |= NDD_ALIASING;


Not a blocker for this patch series, but why does Linux require Block
Device Window to enable namespace support?


A namespace label delineates aliased capacity between the pmem and
block-window access mechanisms.  If there is no aliased capacity then
the size of the namespace can be directly derived from the nfit range
and a label need not be considered.  Contiguous (dax-capable)
sub-divisions of pmem can be had via partitioning of the resulting
gendisk.


Xiao Guangrong: Given this new information, what is the purpose of the
QEMU patches for ACPI DSM and especially the namespace label support?



The initiation goal we introduced vlabel support is:
| Yes, I see Linux driver supports label-less vNVDIMM that is exact current QEMU
| doing. However, label-less is only Linux specific implementation (as it
| completely bypasses namespace), other OS vendors (e.g Microsoft) will use 
label
| storage to address their own requirements，or they do not follow namespace spec
| at all. Another reason is that label is essential for PBLK support.

I did not get the windows driver so far, so this is not tested on windows yet,
just double checked with windows driver team, they told me that:
"It always assumes there is one there so we will need to change how this works"
So the update of windows side will be made.


The QEMU NVDIMM device only supports pmem so now I'm not sure we need
namespace labels or the ACPI DSM at all.



We are planing to support PBLK which "simulated BLK it can simply make one big
aperture that happens to overlap the same address range as PMEM", it will "allow
enabling BTT on BLK namespaces to support power-fail-safe sector updates for a
filesystem on PMEM". The idea was contributed by Dan in our internal 
discussion...

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-22 Thread Peter Xu

On Tue, Mar 22, 2016 at 03:07:00PM -0400, Bandan Das wrote:
> Thank you Peter, I wasn't aware. But unfortunately, I don't think he's doing 
> the right
> thing either when handing @end ? Oh well, it's a RFC :) 

Possibly. Just to make sure you know about the whole thing (rather
than only the @end fix), since I see that fixing the core dump
should possibly be the start to move on to VFIO+IOMMU. As you
mentioned, guessing that it's never a bad thing to know more about
ideas. :)

Thanks.

-- peterx

Re: [Qemu-devel] [PATCH qemu v14 18/18] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW)

2016-03-22 Thread David Gibson

On Mon, Mar 21, 2016 at 06:47:06PM +1100, Alexey Kardashevskiy wrote:
> This adds support for Dynamic DMA Windows (DDW) option defined by
> the SPAPR specification which allows to have additional DMA window(s)
> 
> This implements DDW for emulated and VFIO devices.
> This reserves RTAS token numbers for DDW calls.
> 
> This changes the TCE table migration descriptor to support dynamic
> tables as from now on, PHB will create as many stub TCE table objects
> as PHB can possibly support but not all of them might be initialized at
> the time of migration because DDW might or might not be requested by
> the guest.
> 
> The "ddw" property is enabled by default on a PHB but for compatibility
> the pseries-2.5 machine and older disable it.
> 
> This implements DDW for VFIO. The host kernel support is required.
> This adds a "levels" property to PHB to control the number of levels
> in the actual TCE table allocated by the host kernel, 0 is the default
> value to tell QEMU to calculate the correct value. Current hardware
> supports up to 5 levels.
> 
> The existing linux guests try creating one additional huge DMA window
> with 64K or 16MB pages and map the entire guest RAM to. If succeeded,
> the guest switches to dma_direct_ops and never calls TCE hypercalls
> (H_PUT_TCE,...) again. This enables VFIO devices to use the entire RAM
> and not waste time on map/unmap later. This adds a "dma64_win_addr"
> property which is a bus address for the 64bit window and by default
> set to 0x800... as this is what the modern POWER8 hardware
> uses and this allows having emulated and VFIO devices on the same bus.
> 
> This adds 4 RTAS handlers:
> * ibm,query-pe-dma-window
> * ibm,create-pe-dma-window
> * ibm,remove-pe-dma-window
> * ibm,reset-pe-dma-window
> These are registered from type_init() callback.
> 
> These RTAS handlers are implemented in a separate file to avoid polluting
> spapr_iommu.c with PCI.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  hw/ppc/Makefile.objs|   1 +
>  hw/ppc/spapr.c  |   7 +-
>  hw/ppc/spapr_pci.c  |  73 ---
>  hw/ppc/spapr_rtas_ddw.c | 300 
> 
>  hw/vfio/common.c|   5 -
>  include/hw/pci-host/spapr.h |  13 ++
>  include/hw/ppc/spapr.h  |  16 ++-
>  trace-events|   4 +
>  8 files changed, 395 insertions(+), 24 deletions(-)
>  create mode 100644 hw/ppc/spapr_rtas_ddw.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index c1ffc77..986b36f 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -7,6 +7,7 @@ obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o 
> spapr_rng.o
>  ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>  obj-y += spapr_pci_vfio.o
>  endif
> +obj-$(CONFIG_PSERIES) += spapr_rtas_ddw.o
>  # PowerPC 4xx boards
>  obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>  obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d0bb423..ef4c637 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2362,7 +2362,12 @@ DEFINE_SPAPR_MACHINE(2_6, "2.6", true);
>   * pseries-2.5
>   */
>  #define SPAPR_COMPAT_2_5 \
> -HW_COMPAT_2_5
> +HW_COMPAT_2_5 \
> +{\
> +.driver   = TYPE_SPAPR_PCI_HOST_BRIDGE,\
> +.property = "ddw",\
> +.value= stringify(off),\
> +},
>  
>  static void spapr_machine_2_5_instance_options(MachineState *machine)
>  {
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index af99a36..3bb294a 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -803,12 +803,12 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState 
> *sphb, PCIDevice *pdev)
>  return buf;
>  }
>  
> -static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> -   uint32_t liobn,
> -   uint32_t page_shift,
> -   uint64_t window_addr,
> -   uint64_t window_size,
> -   Error **errp)
> +void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> + uint32_t liobn,
> + uint32_t page_shift,
> + uint64_t window_addr,
> + uint64_t window_size,
> + Error **errp)
>  {
>  sPAPRTCETable *tcet;
>  uint32_t nb_table = window_size >> page_shift;
> @@ -825,10 +825,16 @@ static void spapr_phb_dma_window_enable(sPAPRPHBState 
> *sphb,
>  return;
>  }
>  
> +if (SPAPR_PCI_DMA_WINDOW_NUM(liobn) && !sphb->ddw_enabled) {
> +error_setg(errp,
> +   "Attempt to use second window when DDW is disabled on 
> PHB");
> +return;
> +}

This should never happen unless something is wrong with the tests in
the RTAS functions, yes?  In which c

Re: [Qemu-devel] [PATCH v1 3/3] .travis.yml: make -j3

2016-03-22 Thread David Gibson

On Tue, Mar 22, 2016 at 12:53:15PM +, Alex Bennée wrote:
> The move from Travis VMs to Containers came with a upgrade from 1.5
> cores to 2. The received wisdom is -j N+1 means a core can be doing work
> while other threads wait for IO to complete. This is hard to test on the
> Travis infrastructure but an initial before/after eyeballing seems to
> confirm it is an improvement.
> 
> Signed-off-by: Alex Bennée 

Reviewed-by: David Gibson 

> ---
>  .travis.yml | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/.travis.yml b/.travis.yml
> index 3f77bfa..345e3ca 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -59,7 +59,7 @@ before_install:
>  before_script:
>- ./configure ${CONFIG}
>  script:
> -  - make -j2 && ${TEST_CMD}
> +  - make -j3 && ${TEST_CMD}
>  matrix:
>include:
>  # Sparse is GCC only

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

2016-03-22 Thread Alexey Kardashevskiy


On 03/23/2016 12:08 PM, David Gibson wrote:

On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:

On 03/22/2016 04:14 PM, David Gibson wrote:

On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses euristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v14:
* new to the series

---
TODO:
* export levels to PHB
---
  hw/vfio/common.c | 108 ---
  trace-events |   2 ++
  2 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4e873b7..421d6eb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
  return 0;
  }

+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
+{
+VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 
0x1000);


The hard-coded 0x1000 looks dubious..


Well, that's the minimal page size...


Really?  Some BookE CPUs support 1KiB page size..



Hm. For IOMMU? Ok. s/0x1000/1/ should do then :)





+g_assert(hiommu);
+QLIST_REMOVE(hiommu, hiommu_next);
+}
+
  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
  {
  return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
  }
  end = int128_get64(llend);

+if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {


I think this would be clearer split out into a helper function,
vfio_create_host_window() or something.



It is rather vfio_spapr_create_host_window() and we were avoiding
xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
separate file but this usually triggers more discussion and never ends well.




+unsigned entries, pages;
+struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) 
};
+
+g_assert(section->mr->iommu_ops);
+g_assert(memory_region_is_iommu(section->mr));


I don't think you need these asserts.  AFAICT the same logic should
work if a RAM MR was added directly to PCI address space - this would
create the new host window, then the existing code for adding a RAM MR
would map that block of RAM statically into the new window.


In what configuration/machine can we do that on SPAPR?


spapr guests won't ever do that.  But you can run an x86 guest on a
powernv host and this situation could come up.



I am pretty sure VFIO won't work in this case anyway.


In any case there's no point asserting if the code is correct anyway.


Assert here says (at least) "not tested" or "not expected to happen".





+trace_vfio_listener_region_add_iommu(iova, end - 1);
+/*
+ * FIXME: For VFIO iommu types which have KVM acceleration to
+ * avoid bouncing all map/unmaps through qemu this way, this
+ * would be the right place to wire that up (tell the KVM
+ * device emulation the VFIO iommu handles to use).
+ */
+create.window_size = memory_region_size(section->mr);
+create.page_shift =
+ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));


Ah.. except that I guess you'd need to fall back to host page size
here to handle a RAM MR.


Can you give an example of such RAM MR being added to PCI AS on
SPAPR?


On spapr, no.  But you can run other machine types as guests (at least
with TCG) on a host with the spapr IOMMU.


+/*
+ * SPAPR host supports multilevel TCE tables, there is some
+ * euristic to decide how many levels we want for our table:
+ * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+ */
+entries = create.window_size >> create.page_shift;
+pages = (entries * sizeof(uint64_t)) / getpagesize();
+create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
+
+ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+if (ret) {
+error_report("Failed to create a window, ret = %d (%m)", ret);
+goto fail;
+}
+
+if (create.start_addr != section->offset_within_address_space ||
+vfio_host_iommu_lookup(container, create.start_addr,
+   create.start_addr + create.window_size - 
1)) {


Under what circumstances can this trigger?  Is the kernel ioctl
allowed to return a different window start address than the one

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-22 Thread Wei Yang

Hi, Liang

This is a very clear documentation of your work, I appreciated it a lot. Below
are some of my personal opinion and question.

On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
>I have sent the RFC version patch set for live migration optimization
>by skipping processing the free pages in the ram bulk stage and
>received a lot of comments. The related threads can be found at:
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html 
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
>

Actually there are two threads, Qemu thread and kernel thread. It would be
more clear for audience, if you just list two first mail for these two thread
respectively.

>To make things easier, I wrote this doc about the possible designs
>and my choices. Comments are welcome! 
>
>Content
>===
>1. Background
>2. Why not use virtio-balloon
>3. Virtio interface
>4. Constructing free page bitmap
>5. Tighten free page bitmap
>6. Handling page cache in the guest
>7. APIs for live migration
>8. Pseudo code 
>
>Details
>===
>1. Background
>As we know, in the ram bulk stage of live migration, current QEMU live
>migration implementation mark the all guest's RAM pages as dirtied in
>the ram bulk stage, all these pages will be checked for zero page
>first, and the page content will be sent to the destination depends on
>the checking result, that process consumes quite a lot of CPU cycles
>and network bandwidth.
>
>>From guest's point of view, there are some pages currently not used by

I see in your original RFC patch and your RFC doc, this line starts with a
character '>'. Not sure this one has a special purpose?

>the guest, guest doesn't care about the content in these pages. Free
>pages are this kind of pages which are not used by guest. We can make
>use of this fact and skip processing the free pages in the ram bulk
>stage, it can save a lot CPU cycles and reduce the network traffic
>while speed up the live migration process obviously.
>
>Usually, only the guest has the information of free pages. But it’s
>possible to let the guest tell QEMU it’s free page information by some
>mechanism. E.g. Through the virtio interface. Once QEMU get the free
>page information, it can skip processing these free pages in the ram
>bulk stage by clearing the corresponding bit of the migration bitmap. 
>
>2. Why not use virtio-balloon 
>Actually, the virtio-balloon can do the similar thing by inflating the
>balloon before live migration, but its performance is no good, for an
>8GB idle guest just boots, it takes about 5.7 Sec to inflate the
>balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
>from the guest.  There are some of reasons for the bad performance of
>vitio-balloon:
>a. allocating pages (5%, 304ms)
>b. sending PFNs to host (71%, 4194ms)
>c. address translation and madvise() operation (24%, 1423ms)
>Debugging shows the time spends on these operations are listed in the
>brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
>large value, such as 16384, the time spends on sending the PFNs can be
>reduced to about 400ms, but it’s still too long.
>
>Obviously, the virtio-balloon mechanism has a bigger performance
>impact to the guest than the way we are trying to implement.
>
>3. Virtio interface
>There are three different ways of using the virtio interface to
>send the free page information.
>a. Extend the current virtio device
>The virtio spec has already defined some virtio devices, and we can
>extend one of these devices so as to use it to transport the free page
>information. It requires modifying the virtio spec.
>
>b. Implement a new virtio device
>Implementing a brand new virtio device to exchange information
>between host and guest is another choice. It requires modifying the
>virtio spec too.
>
>c. Make use of virtio-serial (Amit’s suggestion, my choice)
>It’s possible to make use the virtio-serial for communication between
>host and guest, the benefit of this solution is no need to modify the
>virtio spec. 
>
>4. Construct free page bitmap
>To minimize the space for saving free page information, it’s better to
>use a bitmap to describe the free pages. There are two ways to
>construct the free page bitmap.
>
>a. Construct free page bitmap when demand (My choice)
>Guest can allocate memory for the free page bitmap only when it
>receives the request from QEMU, and set the free page bitmap by
>traversing the free page list. The advantage of this way is that it’s
>quite simple and easy to implement. The disadvantage is that the
>traversing oper

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

2016-03-22 Thread David Gibson

On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 04:14 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> >>New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> >>This adds ability to VFIO common code to dynamically allocate/remove
> >>DMA windows in the host kernel when new VFIO container is added/removed.
> >>
> >>This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> >>and adds just created IOMMU into the host IOMMU list; the opposite
> >>action is taken in vfio_listener_region_del.
> >>
> >>When creating a new window, this uses euristic to decide on the TCE table
> >>levels number.
> >>
> >>This should cause no guest visible change in behavior.
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>---
> >>Changes:
> >>v14:
> >>* new to the series
> >>
> >>---
> >>TODO:
> >>* export levels to PHB
> >>---
> >>  hw/vfio/common.c | 108 
> >> ---
> >>  trace-events |   2 ++
> >>  2 files changed, 105 insertions(+), 5 deletions(-)
> >>
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index 4e873b7..421d6eb 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer 
> >>*container,
> >>  return 0;
> >>  }
> >>
> >>+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
> >>+{
> >>+VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 
> >>0x1000);
> >
> >The hard-coded 0x1000 looks dubious..
> 
> Well, that's the minimal page size...

Really?  Some BookE CPUs support 1KiB page size..

> >>+g_assert(hiommu);
> >>+QLIST_REMOVE(hiommu, hiommu_next);
> >>+}
> >>+
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >>  return (!memory_region_is_ram(section->mr) &&
> >>@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener 
> >>*listener,
> >>  }
> >>  end = int128_get64(llend);
> >>
> >>+if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> >
> >I think this would be clearer split out into a helper function,
> >vfio_create_host_window() or something.
> 
> 
> It is rather vfio_spapr_create_host_window() and we were avoiding
> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a
> separate file but this usually triggers more discussion and never ends well.
> 
> 
> 
> >>+unsigned entries, pages;
> >>+struct vfio_iommu_spapr_tce_create create = { .argsz = 
> >>sizeof(create) };
> >>+
> >>+g_assert(section->mr->iommu_ops);
> >>+g_assert(memory_region_is_iommu(section->mr));
> >
> >I don't think you need these asserts.  AFAICT the same logic should
> >work if a RAM MR was added directly to PCI address space - this would
> >create the new host window, then the existing code for adding a RAM MR
> >would map that block of RAM statically into the new window.
> 
> In what configuration/machine can we do that on SPAPR?

spapr guests won't ever do that.  But you can run an x86 guest on a
powernv host and this situation could come up.  In any case there's no
point asserting if the code is correct anyway.

> >>+trace_vfio_listener_region_add_iommu(iova, end - 1);
> >>+/*
> >>+ * FIXME: For VFIO iommu types which have KVM acceleration to
> >>+ * avoid bouncing all map/unmaps through qemu this way, this
> >>+ * would be the right place to wire that up (tell the KVM
> >>+ * device emulation the VFIO iommu handles to use).
> >>+ */
> >>+create.window_size = memory_region_size(section->mr);
> >>+create.page_shift =
> >>+ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));
> >
> >Ah.. except that I guess you'd need to fall back to host page size
> >here to handle a RAM MR.
> 
> Can you give an example of such RAM MR being added to PCI AS on
> SPAPR?

On spapr, no.  But you can run other machine types as guests (at least
with TCG) on a host with the spapr IOMMU.

> >>+/*
> >>+ * SPAPR host supports multilevel TCE tables, there is some
> >>+ * euristic to decide how many levels we want for our table:
> >>+ * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> >>+ */
> >>+entries = create.window_size >> create.page_shift;
> >>+pages = (entries * sizeof(uint64_t)) / getpagesize();
> >>+create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> >>+
> >>+ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> >>+if (ret) {
> >>+error_report("Failed to create a window, ret = %d (%m)", ret);
> >>+goto fail;
> >>+}
> >>+
> >>+if (create.start_addr != section->offset_within_address_space ||
> >>+vfio_host_iommu_lookup(container, create.start_addr,
> >>+

Re: [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space

2016-03-22 Thread David Gibson

On Tue, Mar 22, 2016 at 09:47:10AM -0600, Alex Williamson wrote:
> On Tue, 22 Mar 2016 14:05:15 +1100
> David Gibson  wrote:
> 
> > On Mon, Mar 21, 2016 at 06:47:00PM +1100, Alexey Kardashevskiy wrote:
> > > At the moment IOMMU MR only translate to the system memory.
> > > However if some new code changes this, we will need clear indication why
> > > it is not working so here is the check.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy   
> > 
> > Reviewed-by: David Gibson 
> > 
> > Alex, any chance we could merge this quickly, since it is a reasonable
> > sanity check even without the rest of the changes.
> 
> It all sounds very theoretical to inspire some rush to merge it
> quickly, is there any chance we could actually hit this currently?

Hm, I guess not.  Ok, let's leave it for now.

> 
> > > ---
> > > Changes:
> > > v14:
> > > * new to the series
> > > ---
> > >  hw/vfio/common.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index 55723c9..9587c25 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void 
> > > *data)
> > >  
> > >  trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> > >  
> > > +if (iotlb->target_as != &address_space_memory) {
> > > +error_report("Wrong target AS \"%s\", only system memory is 
> > > allowed",
> > > + 
> > > iotlb->target_as->name?iotlb->target_as->name:"noname");
> 
> Spaces please.
> 
> > > +return;
> > > +}
> > > +
> > >  /*
> > >   * The IOMMU TLB entry we have just covers translation through
> > >   * this IOMMU to its immediate target.  We need to translate  
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH] ppc: move POWER8 Book4 regs in their own routine

2016-03-22 Thread David Gibson

On Tue, Mar 22, 2016 at 03:23:13PM +0100, Cédric Le Goater wrote:
> commit fce55481360d "ppc: A couple more dummy POWER8 Book4 regs"
> squashed in to rapidly a set of POWER8 Book4 regs in the wrong
> routine. This patch introduces the missing gen_spr_power8_book4()
> routine to fix their location.
> 
> Signed-off-by: Cédric Le Goater 

Applied to ppc-for-2.6, thanks.

> ---
>  target-ppc/translate_init.c |8 
>  1 file changed, 8 insertions(+)
> 
> Index: qemu-dgibson-for-2.6.git/target-ppc/translate_init.c
> ===
> --- qemu-dgibson-for-2.6.git.orig/target-ppc/translate_init.c
> +++ qemu-dgibson-for-2.6.git/target-ppc/translate_init.c
> @@ -8018,6 +8018,13 @@ static void gen_spr_power8_ic(CPUPPCStat
>  &spr_read_generic, SPR_NOACCESS,
>  &spr_read_generic, &spr_write_generic,
>  0);
> +#endif
> +}
> +
> +static void gen_spr_power8_book4(CPUPPCState *env)
> +{
> +/* Add a number of P8 book4 registers */
> +#if !defined(CONFIG_USER_ONLY)
>  spr_register_kvm(env, SPR_ACOP, "ACOP",
>   SPR_NOACCESS, SPR_NOACCESS,
>   &spr_read_generic, &spr_write_generic,
> @@ -8086,6 +8093,7 @@ static void init_proc_book3s_64(CPUPPCSt
>  gen_spr_power8_pspb(env);
>  gen_spr_vtb(env);
>  gen_spr_power8_ic(env);
> +gen_spr_power8_book4(env);
>  }
>  if (version < BOOK3S_CPU_POWER8) {
>  gen_spr_book3s_dbg(env);
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH v1 1/3] .travis.yml: collapse the test matrix

2016-03-22 Thread David Gibson

On Tue, Mar 22, 2016 at 12:53:13PM +, Alex Bennée wrote:
> Remove the concept of TARGETS and build the complete target list for
> each config combination. Now the matrix is just based on CONFIG stanzas
> and we use the additional stuff for:
> 
>   - things that only work on one compiler (sparse, gcov, gprof)
>   - combos where "make check" fails
> 
> Signed-off-by: Alex Bennée 

Reviewed-by: David Gibson 

> ---
>  .travis.yml | 75 
> +++--
>  1 file changed, 18 insertions(+), 57 deletions(-)
> 
> diff --git a/.travis.yml b/.travis.yml
> index 9e5873b..18c04af 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -42,17 +42,13 @@ notifications:
>  env:
>global:
>  - TEST_CMD="make check"
> -- EXTRA_CONFIG=""
>matrix:
> -# Group major targets together with their linux-user counterparts
> -- 
> TARGETS=alpha-softmmu,alpha-linux-user,cris-softmmu,cris-linux-user,m68k-softmmu,m68k-linux-user,microblaze-softmmu,microblazeel-softmmu,microblaze-linux-user,microblazeel-linux-user
> -- 
> TARGETS=arm-softmmu,arm-linux-user,armeb-linux-user,aarch64-softmmu,aarch64-linux-user
> -- TARGETS=i386-softmmu,i386-linux-user,x86_64-softmmu,x86_64-linux-user
> -- 
> TARGETS=mips-softmmu,mips64-softmmu,mips64el-softmmu,mipsel-softmmu,mips-linux-user,mips64-linux-user,mips64el-linux-user,mipsel-linux-user,mipsn32-linux-user,mipsn32el-linux-user
> -- 
> TARGETS=or32-softmmu,or32-linux-user,ppc-softmmu,ppc64-softmmu,ppcemb-softmmu,ppc-linux-user,ppc64-linux-user,ppc64abi32-linux-user,ppc64le-linux-user
> -- 
> TARGETS=s390x-softmmu,s390x-linux-user,sh4-softmmu,sh4eb-softmmu,sh4-linux-user,sh4eb-linux-user,sparc-softmmu,sparc64-softmmu,sparc-linux-user,sparc32plus-linux-user,sparc64-linux-user,unicore32-softmmu,unicore32-linux-user
> -# Group remaining softmmu only targets into one build
> -- 
> TARGETS=lm32-softmmu,moxie-softmmu,tricore-softmmu,xtensa-softmmu,xtensaeb-softmmu
> +- CONFIG=""
> +- CONFIG="--enable-debug --enable-debug-tcg --enable-trace-backends=log"
> +- CONFIG="--disable-linux-aio --disable-cap-ng --disable-attr 
> --disable-brlapi --disable-uuid --disable-libusb"
> +- CONFIG="--enable-modules"
> +- CONFIG="--with-coroutine=ucontext"
> +- CONFIG="--with-coroutine=sigaltstack"
>  git:
># we want to do this ourselves
>submodules: false
> @@ -60,65 +56,30 @@ before_install:
>- wget -O - 
> http://people.linaro.org/~alex.bennee/qemu-submodule-git-seed.tar.xz | tar 
> -xvJ
>- git submodule update --init --recursive
>  before_script:
> -  - ./configure --target-list=${TARGETS} --enable-debug-tcg ${EXTRA_CONFIG}
> +  - ./configure ${CONFIG}
>  script:
>- make -j2 && ${TEST_CMD}
>  matrix:
> -  # We manually include a number of additional build for non-standard bits
>include:
> -# Debug related options
> -- env: TARGETS=x86_64-softmmu
> -   EXTRA_CONFIG="--enable-debug"
> +# Sparse is GCC only
> +- env: CONFIG="--enable-sparse"
>compiler: gcc
> -# We currently disable "make check"
> -- env: TARGETS=alpha-softmmu
> -   EXTRA_CONFIG="--enable-debug --enable-tcg-interpreter"
> -   TEST_CMD=""
> -  compiler: gcc
> -# Disable a few of the optional features
> -- env: TARGETS=x86_64-softmmu
> -   EXTRA_CONFIG="--disable-linux-aio --disable-cap-ng --disable-attr 
> --disable-brlapi --disable-uuid --disable-libusb"
> -  compiler: gcc
> -# Currently configure doesn't force --disable-pie
> -- env: TARGETS=x86_64-softmmu
> -   EXTRA_CONFIG="--enable-gprof --enable-gcov --disable-pie"
> -  compiler: gcc
> -# Sparse
> -- env: TARGETS=x86_64-softmmu
> -   EXTRA_CONFIG="--enable-sparse"
> +# gprof/gcov are GCC features
> +- env: CONFIG="--enable-gprof --enable-gcov --disable-pie"
>compiler: gcc
> -# Modules
> -- env: TARGETS=arm-softmmu,x86_64-softmmu
> -   EXTRA_CONFIG="--enable-modules"
> -  compiler: gcc
> -# All the trace backends (apart from dtrace)
> -- env: TARGETS=i386-softmmu
> -   EXTRA_CONFIG="--enable-trace-backends=log"
> -  compiler: gcc
> -# We currently disable "make check" (until 41fc57e44ed regression fixed)
> -- env: TARGETS=x86_64-softmmu
> -   EXTRA_CONFIG="--enable-trace-backends=simple"
> +# We manually include builds which we disable "make check" for
> +- env: CONFIG="--enable-debug --enable-tcg-interpreter"
> TEST_CMD=""
>compiler: gcc
> -# We currently disable "make check"
> -- env: TARGETS=x86_64-softmmu
> -   EXTRA_CONFIG="--enable-trace-backends=ftrace"
> +- env: CONFIG="--enable-trace-backends=simple"
> TEST_CMD=""
>compiler: gcc
> -# We currently disable "make check"
> -- env: TARGETS=x86_64-softmmu
> -   EXTRA_CONFIG="--enable-trace-backends=ust"
> +- env: CONFIG="--enable-trace-

Re: [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space

2016-03-22 Thread Alexey Kardashevskiy


On 03/23/2016 02:47 AM, Alex Williamson wrote:

On Tue, 22 Mar 2016 14:05:15 +1100
David Gibson  wrote:


On Mon, Mar 21, 2016 at 06:47:00PM +1100, Alexey Kardashevskiy wrote:

At the moment IOMMU MR only translate to the system memory.
However if some new code changes this, we will need clear indication why
it is not working so here is the check.

Signed-off-by: Alexey Kardashevskiy 


Reviewed-by: David Gibson 

Alex, any chance we could merge this quickly, since it is a reasonable
sanity check even without the rest of the changes.


It all sounds very theoretical to inspire some rush to merge it
quickly, is there any chance we could actually hit this currently?



The chances are as big as chances that some platform starts supporting VFIO 
soon, for these new folks such a check would be a good piece of 
documentation or at least a warning trigger to ask a question in the lists.





---
Changes:
v14:
* new to the series
---
  hw/vfio/common.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 55723c9..9587c25 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)

  trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);

+if (iotlb->target_as != &address_space_memory) {
+error_report("Wrong target AS \"%s\", only system memory is allowed",
+ iotlb->target_as->name?iotlb->target_as->name:"noname");


Spaces please.


+return;
+}
+
  /*
   * The IOMMU TLB entry we have just covers translation through
   * this IOMMU to its immediate target.  We need to translate







--
Alexey

Re: [Qemu-devel] [PATCH] util: Improved qemu_hexmap() to include an ascii dump of the buffer

2016-03-22 Thread John Snow



On 03/22/2016 03:41 AM, Isaac Lozano wrote:
> qemu_hexdump() in util/hexdump.c has been changed to give also include a
> ascii dump of the buffer. Also, calls to hex_dump() in net/net.c have
> been replaced with calls to qemu_hexdump(). This takes care of two misc
> BiteSized Tasks.
> 
> Signed-off-by: Isaac Lozano <109loza...@gmail.com>
> ---
>  net/net.c  | 31 +--
>  util/hexdump.c | 35 ---
>  2 files changed, 25 insertions(+), 41 deletions(-)
> 
> diff --git a/net/net.c b/net/net.c
> index 1a78edf..8d51ffb 100644
> --- a/net/net.c
> +++ b/net/net.c
> @@ -79,34 +79,6 @@ int default_net = 1;
>  /***/
>  /* network device redirectors */
>  
> -#if defined(DEBUG_NET)
> -static void hex_dump(FILE *f, const uint8_t *buf, int size)
> -{
> -int len, i, j, c;
> -
> -for(i=0;i -len = size - i;
> -if (len > 16)
> -len = 16;
> -fprintf(f, "%08x ", i);
> -for(j=0;j<16;j++) {
> -if (j < len)
> -fprintf(f, " %02x", buf[i+j]);
> -else
> -fprintf(f, "   ");
> -}
> -fprintf(f, " ");
> -for(j=0;j -c = buf[i+j];
> -if (c < ' ' || c > '~')
> -c = '.';
> -fprintf(f, "%c", c);
> -}
> -fprintf(f, "\n");
> -}
> -}
> -#endif
> -
>  static int get_str_sep(char *buf, int buf_size, const char **pp, int sep)
>  {
>  const char *p, *p1;
> @@ -661,8 +633,7 @@ static ssize_t 
> qemu_send_packet_async_with_flags(NetClientState *sender,
>  int ret;
>  
>  #ifdef DEBUG_NET
> -printf("qemu_send_packet_async:\n");
> -hex_dump(stdout, buf, size);
> +qemu_hexdump((const char*)buf, stdout, "qemu_send_packet_async", size);
>  #endif
>  
>  if (sender->link_down || !sender->peer) {
> diff --git a/util/hexdump.c b/util/hexdump.c
> index 1d9c129..886e53c 100644
> --- a/util/hexdump.c
> +++ b/util/hexdump.c
> @@ -18,21 +18,34 @@
>  
>  void qemu_hexdump(const char *buf, FILE *fp, const char *prefix, size_t size)
>  {
> -unsigned int b;
> +unsigned int b, len, i, c;
>  
> -for (b = 0; b < size; b++) {
> -if ((b % 16) == 0) {
> -fprintf(fp, "%s: %04x:", prefix, b);
> +for (b = 0; b < size; b += 16) {
> +len = size - b;
> +if (len > 16) {
> +len = 16;
>  }
> -if ((b % 4) == 0) {
> -fprintf(fp, " ");
> +fprintf(fp, "%s: %04x:", prefix, b);
> +for (i = 0; i < 16; i++) {
> +if ((i % 4) == 0) {
> +fprintf(fp, " ");
> +}
> +if (i < len) {
> +fprintf(fp, " %02x", (unsigned char)buf[b + i]);
> +}
> +else
> +{
> +fprintf(fp, "   ");
> +}
>  }
> -fprintf(fp, " %02x", (unsigned char)buf[b]);
> -if ((b % 16) == 15) {
> -fprintf(fp, "\n");
> +fprintf(fp, " ");
> +for (i = 0; i < len; i++) {
> +c = buf[b+i];
> +if (c < ' ' || c > '~') {
> +c = '.';
> +}
> +fprintf(fp, "%c", c);
>  }
> -}
> -if ((b % 16) != 0) {
>  fprintf(fp, "\n");
>  }
>  }
> 

Don't forget to CC a maintainer when you submit a patch to the open
list, otherwise you run the risk of having your patch get lost. See:
http://wiki.qemu.org/Contribute/SubmitAPatch#CC_the_relevant_maintainer

CCing net/net.c maintainer Jason Wang and the contributor who suggested
the BiteSizedTask, Thomas Huth (...I think. I'm not sure who else 'huth'
on the wiki might be, but he'll certainly correct me if I'm wrong...!)

Thanks,
--js

Re: [Qemu-devel] [PULL 00/13] vhost, virtio, pci, pxe

2016-03-22 Thread Samuel Thibault

Hello Peter,

Peter Maydell, on Fri 19 Feb 2016 12:09:17 +, wrote:
> diff --git a/slirp/mbuf.h b/slirp/mbuf.h
> index 38fedf4..ef5a4f7 100644
> --- a/slirp/mbuf.h
> +++ b/slirp/mbuf.h
> @@ -81,11 +81,9 @@ struct mbuf {
> Slirp *slirp;
> boolresolution_requested;
> uint64_t expiration_date;
> +   char*m_ext;
> /* start of dynamic buffer area, must be last element */
> -   union {
> -   charm_dat[1]; /* ANSI don't like 0 sized arrays */
> -   char*m_ext;
> -   };
> +   charm_dat[];
>  };
> 
>  #define ifq_prev m_prev

Could you give your Signed-off-by on this?

Samuel

Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address

2016-03-22 Thread David Gibson

On Tue, Mar 22, 2016 at 06:19:40PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 03:59 PM, David Gibson wrote:
> >On Tue, Mar 22, 2016 at 03:28:52PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/22/2016 02:26 PM, David Gibson wrote:
> >>>On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 11:49 AM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
> >>Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> >>when new VFIO listener is added, all existing IOMMU mappings are
> >>replayed. However there is a problem that the base address of
> >>an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> >>for the existing user (which is pseries) with its default 32bit DMA
> >>window starting at 0 but it is if there is another DMA window.
> >>
> >>This stores the IOMMU's offset_within_address_space and adjusts
> >>the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> >>
> >>As the IOMMU notifier expects IOVA offset rather than the absolute
> >>address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> >>calling notifier(s).
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>Reviewed-by: David Gibson 
> >
> >On a closer look, I realised this still isn't quite correct, although
> >I don't think any cases which would break it exist or are planned.
> >
> >>---
> >>  hw/ppc/spapr_iommu.c  |  2 +-
> >>  hw/vfio/common.c  | 14 --
> >>  include/hw/vfio/vfio-common.h |  1 +
> >>  3 files changed, 10 insertions(+), 7 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 7dd4588..277f289 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable 
> >>*tcet, target_ulong ioba,
> >>  tcet->table[index] = tce;
> >>
> >>  entry.target_as = &address_space_memory,
> >>-entry.iova = ioba & page_mask;
> >>+entry.iova = (ioba - tcet->bus_offset) & page_mask;
> >>  entry.translated_addr = tce & page_mask;
> >>  entry.addr_mask = ~page_mask;
> >>  entry.perm = spapr_tce_iommu_access_flags(tce);
> >
> >This bit's right/
> >
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index fb588d8..d45e2db 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, 
> >>void *data)
> >>  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>  VFIOContainer *container = giommu->container;
> >>  IOMMUTLBEntry *iotlb = data;
> >>+hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
> >
> >This bit might be right, depending on how you define 
> >giommu->offset_within_address_space.
> >
> >>  MemoryRegion *mr;
> >>  hwaddr xlat;
> >>  hwaddr len = iotlb->addr_mask + 1;
> >>  void *vaddr;
> >>  int ret;
> >>
> >>-trace_vfio_iommu_map_notify(iotlb->iova,
> >>-iotlb->iova + iotlb->addr_mask);
> >>+trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> >>
> >>  /*
> >>   * The IOMMU TLB entry we have just covers translation through
> >>@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, 
> >>void *data)
> >>
> >>  if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> >>  vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>-ret = vfio_dma_map(container, iotlb->iova,
> >>+ret = vfio_dma_map(container, iova,
> >> iotlb->addr_mask + 1, vaddr,
> >> !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >>  if (ret) {
> >>  error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>   "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>- container, iotlb->iova,
> >>+ container, iova,
> >>   iotlb->addr_mask + 1, vaddr, ret);
> >>  }
> >>  } else {
> >>-ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask 
> >>+ 1);
> >>+ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> >>  if (ret) {
> >>  error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>   "0x%"HWADDR_PRIx") = %d (%m)",
> >>- container, iotlb->iova,
> >>+ container, iova,
> >>   iotlb->addr_mask + 1, ret);
> >>  }
> >>  }
> >
> >This is fine.
> >
> >>@@ -377,6 +377,8

Re: [Qemu-devel] [PATCH v3 04/10] ppc: Create cpu_ppc_set_papr() helper

2016-03-22 Thread David Gibson

On Tue, Mar 22, 2016 at 08:05:08AM +0100, Cédric Le Goater wrote:
> On 03/22/2016 12:15 AM, David Gibson wrote:
> > On Mon, Mar 21, 2016 at 01:52:34PM +0100, Cédric Le Goater wrote:
> >> From: Benjamin Herrenschmidt 
> >>
> >> And move the code adjusting the MSR mask and calling kvmppc_set_papr()
> >> to it. This allows us to add a few more things such as disabling setting
> >> of MSR:HV and appropriate LPCR bits which will be used when fixing
> >> the exception model.
> >>
> >> Signed-off-by: Benjamin Herrenschmidt 
> >> Reviewed-by: David Gibson 
> >> [clg: removed LPCR setting ]
> >> Signed-off-by: Cédric Le Goater 
> > 
> > Nothing wrong with the patch, but your mailer seems to have done
> > something really weird with the headers:
> > 
> >> Content-Type: text/plain; charset=a
> > 
> > Oddly enough, git am had some trouble with charset "a".
> > 
> 
> Ah. Interesting. 
> 
> When the patch was sent, git send-email requested a change on the 
> Content-type (a switch to UTF8). Most probably because of the accent 
> on my first name. It seems something went wrong. I will look into it.

Right, as far as I can tell the content was valid UTF-8, including the
accent, just the header was wrong - I was able to apply by hand
editing the mbox to say UTF-8 instead.

For some reason it only happened to this one mail.  The other ones in
the series showed up as UTF-8, although at least some had the same accent.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO

2016-03-22 Thread David Gibson

On Tue, Mar 22, 2016 at 05:24:33PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 03:45 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:47:04PM +1100, Alexey Kardashevskiy wrote:
> >>The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> >>a guest view of the table and a hardware TCE table. If there is no VFIO
> >>presense in the address space, then just the guest view is used, if
> >>this is the case, it is allocated in the KVM. However since there is no
> >>support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> >>we need to move the guest view from KVM to the userspace; and we need
> >>to do this for every IOMMU on a bus with VFIO devices.
> >>
> >>This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> >>notifiy IOMMU about changing environment so it can reallocate the table
> >>to/from KVM or (when available) hook the IOMMU groups with the logical
> >>bus (LIOBN) in the KVM.
> >>
> >>This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> >>path as the new callbacks do this better - they notify IOMMU at
> >>the exact moment when the configuration is changed, and this also
> >>includes the case of PCI hot unplug.
> >>
> >>TODO: split into 2 or 3 patches, per maintainership area.
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >
> >I'm finding this one much easier to follow than the previous revision.
> >
> >>---
> >>  hw/ppc/spapr_iommu.c  | 12 
> >>  hw/ppc/spapr_pci.c|  6 --
> >>  hw/vfio/common.c  |  9 +
> >>  include/exec/memory.h |  4 
> >>  4 files changed, 25 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 6dc3c45..702075d 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -151,6 +151,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion 
> >>*iommu)
> >>  return 1ULL << tcet->page_shift;
> >>  }
> >>
> >>+static void spapr_tce_vfio_start(MemoryRegion *iommu)
> >>+{
> >>+spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), 
> >>true);
> >>+}
> >>+
> >>+static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> >>+{
> >>+spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), 
> >>false);
> >>+}
> >
> >Wonder if a single callback which takes a boolean might be a little
> >less clunky.
> 
> I have a feeling that at least once I was asked to do the opposite and now
> we have take_ownership/release_ownership. This does not seem to be much
> different and the existing names are more self-documenting than the previous
> vfio_notify() or whatever name I could think of.

Ok, leave it as is.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [Qemu-stable] [PATCH 00/35] Patch Round-up for stable 2.5.1, freeze on 2016-03-25

2016-03-22 Thread Michael Roth

Quoting Peter Lieven (2016-03-22 05:00:28)
> Am 21.03.2016 um 18:27 schrieb Michael Roth:
> > Hi everyone,
> >
> > The following new patches are queued for QEMU stable v2.5.1:
> >
> >https://github.com/mdroth/qemu/commits/stable-2.5-staging
> >
> > The release is planned for 2016-03-29:
> >
> >http://wiki.qemu.org/Planning/2.5
> >
> > Please respond here or CC qemu-sta...@nongnu.org on any patches you
> > think should be included in the release.
> 
> Is this stable material?
> 
> 4467c6c hyperv: cpu hotplug fix with HyperV enabled

Doesn't seem like it would cause any side-effects outside of the cpu
hotplug case it's fixing, so makes sense to me.

> 
> Peter
>

Re: [Qemu-devel] [Qemu-stable] [PATCH 28/35] spapr: skip configuration section during migration of older machines

2016-03-22 Thread Michael Roth

Quoting Greg Kurz (2016-03-22 02:49:35)
> On Mon, 21 Mar 2016 12:28:26 -0500
> Michael Roth  wrote:
> 
> > From: Greg Kurz 
> > 
> > Since QEMU 2.4, we have a configuration section in the migration stream.
> > This must be skipped for older machines, like it is already done for x86.
> > 
> > This patch fixes the migration of pseries-2.3 from/to QEMU 2.3, but it
> > breaks migration of the same machine from/to QEMU 2.4/2.4.1/2.5. We do
> > that anyway because QEMU 2.3 is likely to be more widely deployed than
> > newer QEMU versions.
> > 
> > Fixes: 61964c23e5ddd5a33f15699e45ce126f879e3e33
> > Signed-off-by: Greg Kurz 
> > Reviewed-by: Laurent Vivier 
> > Reviewed-by: Juan Quintela 
> > Signed-off-by: David Gibson 
> > (cherry picked from commit 09b5e30da5b19f44768a5429f603caaede216757)
> > 
> > Conflicts:
> >   hw/ppc/spapr.c
> > 
> > *remove dep on 5013c5474
> > 
> > Signed-off-by: Michael Roth 
> > ---
> 
> This will break migration of pseries-2.3 from QEMU 2.5 to QEMU 2.5.1. Maybe
> this should be documented somewhere ?
> 
> Also, there's a companion patch to allow migration to succeed with manual
> intervention on the destination:
> 
> commit 902c053d834e3b802ec736f170edf226d4a841ff
> Author: Greg Kurz 
> Date:   Thu Feb 18 12:32:25 2016 +0100
> 
> migration: allow machine to enforce configuration section migration
> 
> Maybe worth to push to stable as well ?

Yah, makes sense. I'll include this patch and document compatibility
changes in the release notes.

Is this wording correct?

"Fixes migration between QEMU 2.3 and QEMU 2.5.1 when running
pseries-2.3 machine model. Note that for migration of same
pseries-2.3 machine to/from QEMU versions 2.4.x and 2.5, 
the -machine enforce-config-section=on option now needs to be
used."

> 
> >  hw/ppc/spapr.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 6bfb908..ff1537a 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2327,6 +2327,7 @@ static void spapr_compat_2_3(Object *obj)
> >  {
> >  savevm_skip_section_footers();
> >  global_state_set_optional();
> > +savevm_skip_configuration();
> >  }
> > 
> >  static void spapr_compat_2_2(Object *obj)
>

Re: [Qemu-devel] [Qemu-stable] [PATCH 00/35] Patch Round-up for stable 2.5.1, freeze on 2016-03-25

2016-03-22 Thread Michael Roth

Quoting Cole Robinson (2016-03-21 14:32:14)
> On 03/21/2016 01:27 PM, Michael Roth wrote:
> > Hi everyone,
> > 
> > The following new patches are queued for QEMU stable v2.5.1:
> > 
> >   https://github.com/mdroth/qemu/commits/stable-2.5-staging
> > 
> > The release is planned for 2016-03-29:
> > 
> >   http://wiki.qemu.org/Planning/2.5
> > 
> > Please respond here or CC qemu-sta...@nongnu.org on any patches you
> > think should be included in the release.
> > 
> > Testing/feedback is greatly appreciated.
> > 
> 
> Here's some patches we have in Fedora that aren't in your tree that IMO are
> stable candidates:
> 
> 99b4cb7 ahci: Do not unmap NULL addresses
> 64ffbe0 hmp: fix sendkey out of bounds write (CVE-2015-8619)
> 4c1396c i386: avoid null pointer dereference
> 4ab0359 ide: ahci: reset ncq object to unused on error
> 362786f net: check packet payload length
> aa7f996 net: ne2000: fix bounds check in ioport operations
> 49d925c usb: check page select value while processing iTD
> fe3c546 usb: check RNDIS buffer offsets & length
> 64c9bc1 usb: check RNDIS message length
> 80eecda usb: check USB configuration descriptor object
> d62d9dc vmdk: Create streamOptimized as version 3
> 3db1d98 vmdk: Fix converting to streamOptimized
> 
> vmdk patches are for https://bugzilla.redhat.com/show_bug.cgi?id=1299185
> Rest are security issues

Thanks, definitely good to have. Have them all applied locally and will
push after another round of testing.

> 
> Thanks,
> Cole
>

Re: [Qemu-devel] [Patch 1/1] omap_lcdc: Remove support for DEPTH != 32

2016-03-22 Thread Eric Blake

On 03/12/2016 03:23 AM, Pooja Dhannawat wrote:
> As only DEPTH ==32 case is used, removing other dead code
> which is based on DEPTH !== 32
> 
> Signed-off-by: Pooja Dhannawat 
> ---

> +++ b/hw/display/omap_lcdc.c
> @@ -71,44 +71,23 @@ static void omap_lcd_interrupts(struct omap_lcd_panel_s 
> *s)
>  
>  #define draw_line_func drawfn
>  
> -#define DEPTH 8
> -#include "omap_lcd_template.h"
> -#define DEPTH 15
> -#include "omap_lcd_template.h"
> -#define DEPTH 16
> -#include "omap_lcd_template.h"
>  #define DEPTH 32
>  #include "omap_lcd_template.h"

Umm, the old code WAS using DEPTH != 32.  I'm not a display expert, so
there may be justification in nuking that code; but the commit message
needs a better argument than "the code wasn't used" when it sure seems
to be used.  If the argument is that surface_bits_per_pixel() always
returns 32, that would be a good argument.

>  
>  static draw_line_func draw_line_table2[33] = {
>  [0 ... 32]   = NULL,
> -[8]  = draw_line2_8,
> -[15] = draw_line2_15,
> -[16] = draw_line2_16,
>  [32] = draw_line2_32,

This array is now wasteful.  If surface_bits_per_pixel() always returns
32, then just ditch this array, and later on, change:

/* Colour depth */
switch ((omap_lcd->palette[0] >> 12) & 7) {
case 1:
-   draw_line = draw_line_table2[surface_bits_per_pixel(surface)];
+   draw_line = draw_line2_32;
bpp = 2;
break;

In other words, your cleanup, if justified, is incomplete.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 1/3] block: detach devices from DriveInfo at unrealize time

2016-03-22 Thread Paolo Bonzini



On 21/03/2016 18:19, Markus Armbruster wrote:
> I think only if some our users actually expect the alternate wart can we
> seriosuly consider switching, because then we have to choose between two
> breakages anyway:
> 
> * We can stick to the current wart, and leave these users broken.
> 
> * We can switch to the alternate wart, unbreak these users, and break
>   the users that expect the current wart.
> 
> Without further evidence on who expects what, I'd stick to the current
> wart.

I certainly would expect nvme to behave the same as virtio-blk, for one.

Paolo

Re: [Qemu-devel] [PATCH 3/3] block: remove legacy_dinfo at blk_detach_dev time

2016-03-22 Thread Paolo Bonzini



On 21/03/2016 18:39, Kevin Wolf wrote:
> > When I wrote my review, I forgot that I expect x-blockdev-del to accept
> > only backends created with blockdev-add.  With that, my question is
> > indeed moot.
> > 
> > However, I've now tested my expectation, and it turned out to be wrong.
> > I'm inclined to call that a bug.
> 
> Yes.

Like this?

diff --git a/blockdev.c b/blockdev.c
index 3eb05d1..0bc7ea2 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -4023,6 +4023,11 @@ void qmp_x_blockdev_del(bool has_id, const char *id,
 error_setg(errp, "Cannot find block backend %s", id);
 return;
 }
+if (blk_legacy_dinfo(blk)) {
+error_setg(errp, "Deleting block backend added with drive-add"
+   " is not supported");
+return;
+}
 if (blk_get_refcnt(blk) > 1) {
 error_setg(errp, "Block backend %s is in use", id);
 return;

Paolo

Re: [Qemu-devel] [PATCH] hw/display contains files named *_template.h. These are included many times with different values of the DEPTH macro. However, only the DEPTH == 32 case is used. Removed suppo

2016-03-22 Thread Eric Blake

On 03/22/2016 12:34 AM, Nutan Shinde wrote:
> Signed-off-by: Nutan Shinde 
> ---
>  hw/display/blizzard.c  |  8 
>  hw/display/blizzard_template.h | 26 +-
>  hw/display/omap_lcd_template.h |  8 +---
>  hw/display/omap_lcdc.c |  6 --
>  hw/display/sm501.c | 17 -
>  hw/display/sm501_template.h|  8 +---
>  6 files changed, 3 insertions(+), 70 deletions(-)

How does your patch differ from this earlier posting?

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg03159.html

I'm going to tweak the Bite Sized Tasks wiki page to mention that
searching the list archives might be in order, since this is not the
first time we've had competing and overlapping patches based on that
task list.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 3/3] block: remove legacy_dinfo at blk_detach_dev time

2016-03-22 Thread Paolo Bonzini

On 22/03/2016 11:25, Markus Armbruster wrote:
> Regardless of how and when we create BlockBackend, we'll want to keep
> the clean separation between frontend and backend internally and at the
> user interface.

This means that the BlockBackend should not own the DriveInfo.  The
backend and frontend need not know of the object that mixes concepts
from both of them.  Instead, the DriveInfo can instantiate itself into a
BlockBackend and the board can (if required) use the frontend parts of
DriveInfo to instantiate a device and connect it to the BlocKBackend.

In Kevin's idea there would be no ownership either way.  Until then, I
think my patch actually gets us closer to the ideal.

Paolo

> DriveInfo has no role in cleanly separate creation of frontend and
> backend now, and it shouldn't get one in the future.  Its purpose is to
> support the legacy user interface that has frontend and backend matters
> mixed up.

Re: [Qemu-devel] [PATCH] hw/display contains files named *_template.h. These are included many times with different values of the DEPTH macro. However, only the DEPTH == 32 case is used. Removed suppo

2016-03-22 Thread John Snow



On 03/22/2016 02:34 AM, Nutan Shinde wrote:
> Signed-off-by: Nutan Shinde 
> ---
>  hw/display/blizzard.c  |  8 
>  hw/display/blizzard_template.h | 26 +-
>  hw/display/omap_lcd_template.h |  8 +---
>  hw/display/omap_lcdc.c |  6 --
>  hw/display/sm501.c | 17 -
>  hw/display/sm501_template.h|  8 +---
>  6 files changed, 3 insertions(+), 70 deletions(-)
> 
> diff --git a/hw/display/blizzard.c b/hw/display/blizzard.c
> index c231960..da5e133 100644
> --- a/hw/display/blizzard.c
> +++ b/hw/display/blizzard.c
> @@ -925,14 +925,6 @@ static void blizzard_update_display(void *opaque)
>  s->my[1] = 0;
>  }
>  
> -#define DEPTH 8
> -#include "blizzard_template.h"
> -#define DEPTH 15
> -#include "blizzard_template.h"
> -#define DEPTH 16
> -#include "blizzard_template.h"
> -#define DEPTH 24
> -#include "blizzard_template.h"
>  #define DEPTH 32
>  #include "blizzard_template.h"
>  
> diff --git a/hw/display/blizzard_template.h b/hw/display/blizzard_template.h
> index b7ef27c..a4c733d 100644
> --- a/hw/display/blizzard_template.h
> +++ b/hw/display/blizzard_template.h
> @@ -19,31 +19,7 @@
>   */
>  
>  #define SKIP_PIXEL(to) (to += deststep)
> -#if DEPTH == 8
> -# define PIXEL_TYPEuint8_t
> -# define COPY_PIXEL(to, from)  do { *to = from; SKIP_PIXEL(to); } while (0)
> -# define COPY_PIXEL1(to, from) (*to++ = from)
> -#elif DEPTH == 15 || DEPTH == 16
> -# define PIXEL_TYPEuint16_t
> -# define COPY_PIXEL(to, from)  do { *to = from; SKIP_PIXEL(to); } while (0)
> -# define COPY_PIXEL1(to, from) (*to++ = from)
> -#elif DEPTH == 24
> -# define PIXEL_TYPEuint8_t
> -# define COPY_PIXEL(to, from) \
> -do {  \
> -to[0] = from; \
> -to[1] = (from) >> 8;  \
> -to[2] = (from) >> 16; \
> -SKIP_PIXEL(to);   \
> -} while (0)
> -
> -# define COPY_PIXEL1(to, from) \
> -do {   \
> -*to++ = from;  \
> -*to++ = (from) >> 8;   \
> -*to++ = (from) >> 16;  \
> -} while (0)
> -#elif DEPTH == 32
> +#if DEPTH == 32
>  # define PIXEL_TYPEuint32_t
>  # define COPY_PIXEL(to, from)  do { *to = from; SKIP_PIXEL(to); } while (0)
>  # define COPY_PIXEL1(to, from) (*to++ = from)
> diff --git a/hw/display/omap_lcd_template.h b/hw/display/omap_lcd_template.h
> index f0ce71f..c1714ce 100644
> --- a/hw/display/omap_lcd_template.h
> +++ b/hw/display/omap_lcd_template.h
> @@ -27,13 +27,7 @@
>   * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
>   */
>  
> -#if DEPTH == 8
> -# define BPP 1
> -# define PIXEL_TYPE uint8_t
> -#elif DEPTH == 15 || DEPTH == 16
> -# define BPP 2
> -# define PIXEL_TYPE uint16_t
> -#elif DEPTH == 32
> +#if DEPTH == 32
>  # define BPP 4
>  # define PIXEL_TYPE uint32_t
>  #else
> diff --git a/hw/display/omap_lcdc.c b/hw/display/omap_lcdc.c
> index ce1058b..84b5f79 100644
> --- a/hw/display/omap_lcdc.c
> +++ b/hw/display/omap_lcdc.c
> @@ -71,12 +71,6 @@ static void omap_lcd_interrupts(struct omap_lcd_panel_s *s)
>  
>  #define draw_line_func drawfn
>  
> -#define DEPTH 8
> -#include "omap_lcd_template.h"
> -#define DEPTH 15
> -#include "omap_lcd_template.h"
> -#define DEPTH 16
> -#include "omap_lcd_template.h"
>  #define DEPTH 32
>  #include "omap_lcd_template.h"
>  
> diff --git a/hw/display/sm501.c b/hw/display/sm501.c
> index 2957243..3fb64b5 100644
> --- a/hw/display/sm501.c
> +++ b/hw/display/sm501.c
> @@ -1170,23 +1170,6 @@ typedef void draw_line_func(uint8_t *d, const uint8_t 
> *s,
>  typedef void draw_hwc_line_func(SM501State * s, int crt, uint8_t * palette,
>  int c_y, uint8_t *d, int width);
>  
> -#define DEPTH 8
> -#include "sm501_template.h"
> -
> -#define DEPTH 15
> -#include "sm501_template.h"
> -
> -#define BGR_FORMAT
> -#define DEPTH 15
> -#include "sm501_template.h"
> -
> -#define DEPTH 16
> -#include "sm501_template.h"
> -
> -#define BGR_FORMAT
> -#define DEPTH 16
> -#include "sm501_template.h"
> -
>  #define DEPTH 32
>  #include "sm501_template.h"
>  
> diff --git a/hw/display/sm501_template.h b/hw/display/sm501_template.h
> index f33e499..4e5801e 100644
> --- a/hw/display/sm501_template.h
> +++ b/hw/display/sm501_template.h
> @@ -22,13 +22,7 @@
>   * THE SOFTWARE.
>   */
>  
> -#if DEPTH == 8
> -#define BPP 1
> -#define PIXEL_TYPE uint8_t
> -#elif DEPTH == 15 || DEPTH == 16
> -#define BPP 2
> -#define PIXEL_TYPE uint16_t
> -#elif DEPTH == 32
> +#if DEPTH == 32
>  #define BPP 4
>  #define PIXEL_TYPE uint32_t
>  #else
> 

I can't comment on the patch itself (personally), but it looks like
maybe your patch submission got a little garbled.

The first line of your patch should be the subject line, which should be
limited to somewhere less than about 72 characters or so, following the
standard format of:

": "

For example:

"display: remove redundant macro templates" or similar.

Then, t

[Qemu-devel] [PULL v2 00/29] Misc patches for QEMU 2.6 hard freeze

2016-03-22 Thread Paolo Bonzini

The following changes since commit 4829e0378dfb91d55af9dfd741bd09e8f2c4f91a:

  Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2016-03-18' into 
staging (2016-03-18 17:18:41 +)

are available in the git repository at:

  git://github.com/bonzini/qemu.git tags/for-upstream

for you to fetch changes up to ba0fb75c6400fbf75e2e8b52d2c6e2e8e848777c:

  target-i386: implement PKE for TCG (2016-03-22 22:20:18 +0100)


* Log filtering from Alex and Peter
* Chardev fixes from Daniel and Marc-André
* config.status tweak from David
* Header file tweaks from Markus, myself and Veronia (Outreachy candidate)
* get_ticks_per_sec() removal from Rutuja (Outreachy candidate)
* Coverity fix from myself
* PKE implementation from myself, based on rth's XSAVE support


Alex Bennée (7):
  tcg: pass down TranslationBlock to tcg_code_gen
  qemu-log: correct help text for -d cpu
  qemu-log: new option -dfilter to limit output
  qemu-log: dfilter-ise exec, out_asm, op and opt_op
  target-arm: dfilter support for in_asm
  qemu-log: support simple pid substitution for logs
  cputlb: modernise the debug support

Daniel P. Berrange (1):
  char: ensure all clients are in non-blocking mode

Dr. David Alan Gilbert (1):
  config.status: Pass extra parameters

Marc-André Lureau (1):
  char: translate from QIOChannel error to errno

Markus Armbruster (12):
  include/qemu/osdep.h: Don't include qapi/error.h
  Use scripts/clean-includes to drop redundant qemu/typedefs.h
  Clean up includes some more
  fw_cfg: Split fw_cfg_keys.h off fw_cfg.h
  include/qemu/iov.h: Don't include qemu-common.h
  include/hw/hw.h: Don't include qemu-common.h
  hw/pci/pci.h: Don't include qemu-common.h
  Move HOST_LONG_BITS from qemu-common.h to qemu/osdep.h
  Move QEMU_ALIGN_*() from qemu-common.h to qemu/osdep.h
  Move ParallelIOArg from qemu-common.h to sysemu/char.h
  isa: Move DMA_transfer_handler from qemu-common.h to hw/isa/isa.h
  include/crypto: Include qapi-types.h or qemu/bswap.h instead of 
qemu-common.h

Paolo Bonzini (3):
  hw: explicitly include qemu-common.h and cpu.h
  exec: fix error handling in file_ram_alloc
  target-i386: implement PKE for TCG

Peter Maydell (2):
  qemu-log: Avoid function call for disabled qemu_log_mask logging
  qemu-log: Improve the "exec" TB execution logging

Rutuja Shah (1):
  Replaced get_tick_per_sec() by NANOSECONDS_PER_SECOND

Veronia Bahaa (1):
  util: move declarations out of qemu-common.h

 arch_init.c  |   1 +
 async.c  |   1 +
 audio/audio.c|   4 +-
 audio/noaudio.c  |   8 +-
 audio/spiceaudio.c   |   4 +-
 audio/wavaudio.c |   2 +-
 backends/baum.c  |   3 +-
 backends/hostmem-file.c  |   1 +
 backends/hostmem-ram.c   |   1 +
 backends/hostmem.c   |   1 +
 backends/rng-egd.c   |   1 +
 backends/rng-random.c|   1 +
 backends/rng.c   |   1 +
 backends/tpm.c   |   1 +
 block.c  |   3 +-
 block/archipelago.c  |   2 +-
 block/backup.c   |   2 +
 block/blkdebug.c |   3 +-
 block/blkverify.c|   2 +
 block/block-backend.c|   1 +
 block/bochs.c|   1 +
 block/cloop.c|   1 +
 block/commit.c   |   1 +
 block/curl.c |   2 +
 block/dirty-bitmap.c |   2 +-
 block/dmg.c  |   1 +
 block/gluster.c  |   1 +
 block/io.c   |   2 +
 block/mirror.c   |   1 +
 block/nbd.c  |   3 +-
 block/null.c |   1 +
 block/parallels.c|   1 +
 block/qapi.c |   1 +
 block/qcow.c |   1 +
 block/qcow2-cluster.c|   1 +
 block/qcow2-refcount.c   |   1 +
 block/qcow2-snapshot.c   |   3 +-
 block/qcow2.c|   2 +-
 block/qed.c  |   3 +-
 block/qed.h  |   1 +
 block/raw-aio.h  |   2 +
 block/raw-posix.c|   3 +-
 block/raw-win32.c|   3 +-
 block/raw_bsd.c  |   1 +
 block/rbd.c  |   3 +-
 block/sheepdog.c |   3 +-

Re: [Qemu-devel] [PATCH 03/12] block: Introduce BlockBackendPublic

2016-03-22 Thread Eric Blake

On 03/22/2016 09:33 AM, Kevin Wolf wrote:
> Some features, like I/O throttling, are implemented outside
> block-backend.c, but still want to keep BlockBackends in a list. In
> order to avoid exposing the whole struct layout in the public header
> file, this patch introduces an embedded public struct where list entry
> structs can be added and a pair of functions to convert between
> BlockBackend and BlockBackendPublic.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/block-backend.c  | 17 +
>  include/sysemu/block-backend.h |  9 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index df8f717..4394950 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -33,6 +33,7 @@ struct BlockBackend {
>  DriveInfo *legacy_dinfo;/* null unless created by drive_new() */
>  QTAILQ_ENTRY(BlockBackend) link; /* for block_backends */
>  QTAILQ_ENTRY(BlockBackend) monitor_link; /* for monitor_block_backends */
> +BlockBackendPublic public;

Any reason to not put the public struct at offset 0?

> +
> +/*
> + * Returns a BlockBackend given the associated @public fields.
> + */
> +BlockBackend *blk_by_public(BlockBackendPublic *public)
> +{
> +return container_of(public, BlockBackend, public);
> +}

At least container_of() doesn't care, so I guess it doesn't matter.

> +/* This struct is embedded in (the private) BlockBackend struct and contains
> + * fields that must be public. This is in particular for QLIST_ENTRY() and
> + * friends so that BlockBackends can be kept in lists outside 
> block-backend.c */
> +typedef struct BlockBackendPublic {
> +} BlockBackendPublic;

No fields?  So really all we need this for is so that we can have an
address of a member of the larger struct, so that we can do list
operations based on that address?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 06/22] hbitmap: load/store

2016-03-22 Thread John Snow



On 03/15/2016 04:04 PM, Vladimir Sementsov-Ogievskiy wrote:
> Add functions for load/store HBitmap to BDS, using clusters table:
> Last level of the bitmap is splitted into chunks of 'cluster_size'
> size. Each cell of the table contains offset in bds, to load/store
> corresponding chunk.
> 
> Also,
> 0 in cell means all-zeroes-chunk (should not be saved)
> 1 in cell means all-ones-chunk (should not be saved)
> hbitmap_prepare_store() fills table with
>   0 for all-zeroes chunks
>   1 for all-ones chunks
>   2 for others
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/dirty-bitmap.c |  23 +
>  include/block/dirty-bitmap.h |  11 +++
>  include/qemu/hbitmap.h   |  12 +++
>  util/hbitmap.c   | 209 
> +++
>  4 files changed, 255 insertions(+)
> 
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index e68c177..816c6ee 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -396,3 +396,26 @@ int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap)
>  {
>  return hbitmap_count(bitmap->bitmap);
>  }
> +
> +int bdrv_dirty_bitmap_load(BdrvDirtyBitmap *bitmap, BlockDriverState *bs,
> +   const uint64_t *table, uint32_t table_size,
> +   uint32_t cluster_size)
> +{
> +return hbitmap_load(bitmap->bitmap, bs, table, table_size, cluster_size);
> +}
> +
> +int bdrv_dirty_bitmap_prepare_store(const BdrvDirtyBitmap *bitmap,
> +uint32_t cluster_size,
> +uint64_t *table,
> +uint32_t *table_size)
> +{
> +return hbitmap_prepare_store(bitmap->bitmap, cluster_size,
> + table, table_size);
> +}
> +
> +int bdrv_dirty_bitmap_store(const BdrvDirtyBitmap *bitmap, BlockDriverState 
> *bs,
> +const uint64_t *table, uint32_t table_size,
> +uint32_t cluster_size)
> +{
> +return hbitmap_store(bitmap->bitmap, bs, table, table_size, 
> cluster_size);
> +}
> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
> index 27515af..20cb540 100644
> --- a/include/block/dirty-bitmap.h
> +++ b/include/block/dirty-bitmap.h
> @@ -43,4 +43,15 @@ void bdrv_set_dirty_iter(struct HBitmapIter *hbi, int64_t 
> offset);
>  int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>  void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
>  
> +int bdrv_dirty_bitmap_load(BdrvDirtyBitmap *bitmap, BlockDriverState *bs,
> +   const uint64_t *table, uint32_t table_size,
> +   uint32_t cluster_size);
> +int bdrv_dirty_bitmap_prepare_store(const BdrvDirtyBitmap *bitmap,
> +uint32_t cluster_size,
> +uint64_t *table,
> +uint32_t *table_size);
> +int bdrv_dirty_bitmap_store(const BdrvDirtyBitmap *bitmap, BlockDriverState 
> *bs,
> +const uint64_t *table, uint32_t table_size,
> +uint32_t cluster_size);
> +
>  #endif
> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
> index 6d1da4d..d83bb79 100644
> --- a/include/qemu/hbitmap.h
> +++ b/include/qemu/hbitmap.h
> @@ -241,5 +241,17 @@ static inline size_t hbitmap_iter_next_word(HBitmapIter 
> *hbi, unsigned long *p_c
>  return hbi->pos;
>  }
>  
> +typedef struct BlockDriverState BlockDriverState;
> +
> +int hbitmap_load(HBitmap *bitmap, BlockDriverState *bs,
> + const uint64_t *table, uint32_t table_size,
> + uint32_t cluster_size);
> +
> +int hbitmap_prepare_store(const HBitmap *bitmap, uint32_t cluster_size,
> +  uint64_t *table, uint32_t *table_size);
> +
> +int hbitmap_store(HBitmap *bitmap, BlockDriverState *bs,
> +  const uint64_t *table, uint32_t table_size,
> +  uint32_t cluster_size);
>  
>  #endif
> diff --git a/util/hbitmap.c b/util/hbitmap.c
> index 28595fb..1960e4f 100644
> --- a/util/hbitmap.c
> +++ b/util/hbitmap.c
> @@ -15,6 +15,8 @@
>  #include "qemu/host-utils.h"
>  #include "trace.h"
>  
> +#include "block/block.h"
> +
>  /* HBitmaps provides an array of bits.  The bits are stored as usual in an
>   * array of unsigned longs, but HBitmap is also optimized to provide fast
>   * iteration over set bits; going from one bit to the next is O(logB n)
> @@ -499,3 +501,210 @@ char *hbitmap_md5(const HBitmap *bitmap)
>  const guchar *data = (const guchar *)bitmap->levels[HBITMAP_LEVELS - 1];
>  return g_compute_checksum_for_data(G_CHECKSUM_MD5, data, size);
>  }
> +
> +/* hb_restore_levels()
> + * Using the last level restore all other levels
> + */
> +static void hb_restore_levels(HBitmap *bitmap)
> +{
> +int64_t i, size, prev_size;
> +int lev;
> +
> +/* restore levels starting fr

Re: [Qemu-devel] [PATCH 02/12] block: Make sure throttled BDSes always have a BB

2016-03-22 Thread Eric Blake

On 03/22/2016 09:33 AM, Kevin Wolf wrote:
> It was already true in principle that a throttled BDS always has a BB
> attached, except that the order of operations while attaching or
> detaching a BDS to/from a BB wasn't careful enough.
> 
> This commit breaks graph manipulations while I/O throttling is enabled.
> It would have been possible, but quite cumbersome, to keep things
> working with some temporary hacks, so it's not worth the hassle. We'll
> fix things again in a minute.

Might read slightly better as:

It would have been possible to keep things working with some temporary
hacks, but quite cumbersome, so it's not worth the hassle.

> 
> Signed-off-by: Kevin Wolf 
> ---

> +#if 0
>  bdrv_io_limits_enable(bs_new, throttle_group_get_name(bs_top));
>  bdrv_io_limits_disable(bs_top);
> +#else
> +abort();

If it weren't fixed in the same series, then I'd want some text to go
along with the abort.  But since it's temporary, this is fine.

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH] slirp: Allow to disable IPv4 or IPv6

2016-03-22 Thread Samuel Thibault

Hello,

Markus Armbruster, on Mon 21 Mar 2016 08:33:52 +0100, wrote:
> > -# @net: #optional IP address and optional netmask
> > +# @net: #optional IP address and optional netmask. Set to 0.0.0.0 to 
> > disable IPv4 completely
> 
> Long line.
> 
> Syntax?  Default value?

Something like this?

# @net: #optional IP network address that the guest will see, in the
# form addr[/netmask]. The netmask is optional, and can be either in the
# form a.b.c.d or as a number of valid top-most bits. Default is
# 10.0.2.0/24. Set to 'none' to disable IPv4 completely.

> > -# @ip6-prefix: #optional IPv6 network prefix (default is fec0::) (since 
> > 2.6)
> > +# @ip6-prefix: #optional IPv6 network prefix. Set to :: to disable IPv6 
> > completely (default is fec0::) (since 2.6)
> 
> Syntax?

Well, it's just an IPv6 address, is there really something to document?

Samuel

Re: [Qemu-devel] (no subject)

2016-03-22 Thread John Snow



On 03/21/2016 05:09 PM, Peter Maydell wrote:
> On 21 March 2016 at 18:00, John Snow  wrote:
>> Looks like one of your libraries is outdated, for me
>> 'IBV_LINK_LAYER_INFINIBAND' is defined in
>> /usr/include/infiniband/verbs.h; provided by
>> libibverbs-devel-1.1.8-3.fc22.x86_64.
>>
>> Maybe your libibverbs is too old.
> 
> We should probably add a suitable configure test.
> 
> thanks
> -- PMM
> 

Sure, I don't know formally what our minimum version requirement for
libibverbs is.

In this case at least,
http://git.kernel.org/cgit/libs/infiniband/libibverbs.git/commit/include/infiniband/verbs.h?id=e5df8c64df6877facc045608a25f4d4fecd5f2b0

committed 2011-07-26 20:15:33 (GMT),

so probably:
http://git.kernel.org/cgit/libs/infiniband/libibverbs.git/commit/?id=8b2ffc598bd7f8294f3653cab430146985040739

libibverbs 1.1.6 (December 2011) as a minimum, unless there's an even
newer requirement?

--js

Re: [Qemu-devel] [PATCH 01/12] block: Don't disable I/O throttling on sync requests

2016-03-22 Thread Eric Blake

On 03/22/2016 09:33 AM, Kevin Wolf wrote:
> We had to disable I/O throttling with synchronous requests because we
> didn't use to run timers in nested event loops when the code was
> introduced. This isn't true any more, and throttling works just fine
> even when using the synchronous API.
> 
> The removed code is in fact dead code since commit a8823a3b ('block: Use
> blk_co_pwritev() for blk_write()') because I/O throttling can only be
> set on the top layer, but BlockBackend always uses the coroutine
> interface now instead of using the sync API emulation in block.c.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/io.c | 11 ---
>  1 file changed, 11 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index cce508a..e4438da 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -561,17 +561,6 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t 
> offset,
>  .flags = flags,
>  };
>  
> -/**
> - * In sync call context, when the vcpu is blocked, this throttling timer
> - * will not fire; so the I/O throttling function has to be disabled here
> - * if it has been enabled.
> - */
> -if (bs->io_limits_enabled) {
> -fprintf(stderr, "Disabling I/O throttling on '%s' due "
> -"to synchronous I/O.\n", bdrv_get_device_name(bs));

And we get rid of an fprintf().  Nice bonus.

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH v3 10/10] vnc: allow specifying a custom ACL object name

2016-03-22 Thread Eric Blake

On 03/10/2016 11:59 AM, Daniel P. Berrange wrote:
> The VNC server has historically had support for ACLs to check
> both the SASL username and the TLS x509 distinguished name.
> The VNC server was responsible for creating the initial ACL,
> and the client app was then responsible for populating it with
> rules using the HMP 'acl_add' command.
> 
> This is not satisfactory for a variety of reasons. There is
> no way to populate the ACLs from the command line, users are
> forced to use the HMP. With multiple network services all
> supporting TLS and ACLs now, it is desirable to be able to
> define a single ACL that is referenced by all services.
> 
> To address these limitations, two new options are added to the
> VNC server CLI. The 'tls-acl' option takes the ID of a QAuthZ
> object to use for checking TLS x509 distinguished names, and
> the 'sasl-acl' option takes the ID of another object to use for
> checking SASL usernames.
> 
> In this example, we setup two ACLs. The first allows any client
> with a certificate issued by the 'RedHat' organization in the
> 'London' locality. The second ACL allows clients with either
> the 'j...@redhat.com' or  'f...@redhat.com' kerberos usernames.
> Both ACLs must pass for the user to be allowed.
> 
> $QEMU -object tls-creds-x509,id=tls0,dir=/home/berrange/qemutls,\
>   endpoint=server,verify-peer=yes \
>   -object authz-simple,id=acl0,policy=deny,\
>   rules.0.match=O=RedHat,,L=London,rules.0.policy=allow \
>   -object authz-simple,id=acl0,policy=deny,\

Umm, you can't reuse 'acl0' as the id.

>   rules.0.match=f...@redhat.com,rules.0.policy=allow \
>   rules.0.match=j...@redhat.com,rules.0.policy=allow \
>   -vnc 0.0.0.0:1,tls-creds=tls0,tls-acl=tlsacl0,
>  sasl,sasl-acl=saslacl0 \

And this fails because the ids don't exist.  I think you meant
authz-simple,id=tlsacl0 in the first instance, and
authz-simple,id=saslacl0 in the second instance.

>   ...other QEMU args...
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  ui/vnc.c | 73 
> 
>  1 file changed, 60 insertions(+), 13 deletions(-)
> 
> @@ -3670,6 +3680,21 @@ void vnc_display_open(const char *id, Error **errp)
>  }
>  }
>  acl = qemu_opt_get_bool(opts, "acl", false);
> +tlsacl = qemu_opt_get(opts, "tls-acl");
> +if (acl && tlsacl) {
> +error_setg(errp, "'acl' option is mutually exclusive with the "
> +   "'tls-acl' options");
> +goto fail;
> +}
> +
> +#ifdef CONFIG_VNC_SASL
> +saslacl = qemu_opt_get(opts, "sasl-acl");
> +if (acl && saslacl) {
> +error_setg(errp, "'acl' option is mutually exclusive with the "
> +   "'sasl-acl' options");
> +goto fail;
> +}
> +#endif

Do we explicitly fail if sasl-acl was provided but CONFIG_VNC_SASL is
not defined?  It looks here like you silently ignore it, which would not
be good.

> @@ -3710,19 +3737,39 @@ void vnc_display_open(const char *id, Error **errp)
>&error_abort);
>  }
>  #ifdef CONFIG_VNC_SASL
> -if (acl && sasl) {
> -char *aclname;
> +if (sasl) {
> +if (saslacl) {
> +Object *container, *acl;
> +container = object_get_objects_root();
> +acl = object_resolve_path_component(container, saslacl);
> +if (!acl) {
> +error_setg(errp, "Cannot find ACL %s", saslacl);
> +goto fail;
> +}
>  
> -if (strcmp(vs->id, "default") == 0) {
> -aclname = g_strdup("vnc.username");
> -} else {
> -aclname = g_strdup_printf("vnc.%s.username", vs->id);
> -}
> -vs->sasl.acl =
> -QAUTHZ(qauthz_simple_new(aclname,
> - QAUTHZ_SIMPLE_POLICY_DENY,
> - &error_abort));
> -g_free(aclname);
> +if (!object_dynamic_cast(acl, TYPE_QAUTHZ)) {
> +error_setg(errp, "Object '%s' is not a QAuthZ subclass",
> +   saslacl);
> +goto fail;
> +}
> +vs->sasl.acl = QAUTHZ(acl);
> +} else if (acl) {
> +char *aclname;
> +
> +if (strcmp(vs->id, "default") == 0) {
> +aclname = g_strdup("vnc.username");
> +} else {
> +aclname = g_strdup_printf("vnc.%s.username", vs->id);
> +}
> +vs->sasl.acl =
> +QAUTHZ(qauthz_simple_new(aclname,
> + QAUTHZ_SIMPLE_POLICY_DENY,
> + &error_abort));
> +g_free(aclname);
> +}
> +} else if (saslacl) {
> +error_setg(errp, "SASL ACL provided when SASL is disabled");
> +goto fail;
>  }
>  #endif
>  

Again, th

Re: [Qemu-devel] [PATCH 0/9] block: Remove BlockDriverState.blk

2016-03-22 Thread Paolo Bonzini



On 22/03/2016 20:36, Kevin Wolf wrote:
> This is the final patch series that is required before we can start allowing
> setups with more than one BlockBackend per BlockDriverState.
> 
> My current plan is to put the patches up to (and including) this series into
> 2.6 so that we have a relatively consistent block layer state in the release
> that isn't in the middle of an overhaul, but not to make use of the new setups
> that we could allow now before 2.7.
> 
> Depends on 'block: Move I/O throttling to BlockBackend'.

I'm not sure how "posted one month after soft freeze", "depends on a
series also posted one month after soft freeze" and "put the patches up
to (and including) this series into 2.6" could go in the same message.

Paolo

Re: [Qemu-devel] [PATCH 00/12] block: Move I/O throttling to BlockBackend

2016-03-22 Thread Paolo Bonzini

On 22/03/2016 16:33, Kevin Wolf wrote:
> This is another feature that was "logically" part of the BlockBackend, but
> implemented as a BlockDriverState feature. It was always kept on top using
> swap_feature_fields().
> 
> This series moves it to be actually implemented in the BlockBackend, removing
> another obstacle for removing bs->blk and allowing multiple BBs per BDS.
> 
> Depends on 'block: Implement writethrough in BlockBackend'.

This series would mess up my own I/O throttling cleanups that have been
posted in February and have hardly seen a review for one month.

I expect the rules for soft freeze to apply to maintainers as well.
These patches and the removal of bs->blk are about one month late and
shouldn't be included in 2.6.

Thanks,

Paolo

> Kevin Wolf (12):
>   block: Don't disable I/O throttling on sync requests
>   block: Make sure throttled BDSes always have a BB
>   block: Introduce BlockBackendPublic
>   block: throttle-groups: Use BlockBackend pointers internally
>   block: Convert throttle_group_get_name() to BlockBackend
>   block: Move throttling fields from BDS to BB
>   block: Move actual I/O throttling to BlockBackend
>   block: Move I/O throttling configuration functions to BlockBackend
>   block: Introduce BdrvChild.opaque
>   block: Drain throttling queue with BdrvChild callback
>   block: Decouple throttling from BlockDriverState
>   block: Don't check throttled reqs in bdrv_requests_pending()
> 
>  block.c |  23 +
>  block/block-backend.c   | 146 +-
>  block/io.c  | 115 +++--
>  block/qapi.c|   6 +-
>  block/throttle-groups.c | 223 
> +---
>  blockdev.c  |  43 +++-
>  include/block/block.h   |   6 +-
>  include/block/block_int.h   |  23 +
>  include/block/throttle-groups.h |  12 +--
>  include/sysemu/block-backend.h  |  27 +
>  tests/test-throttle.c   |  62 ++-
>  11 files changed, 345 insertions(+), 341 deletions(-)
>

Re: [Qemu-devel] [PULL 00/29] Miscellaneous changes for 2016-03-22

2016-03-22 Thread Paolo Bonzini



On 22/03/2016 21:27, Peter Maydell wrote:
> 
> PPC64 Linux:
> 
> /home/pm215/qemu/hw/intc/openpic_kvm.c: In function 'kvm_openpic_realize':
> /home/pm215/qemu/hw/intc/openpic_kvm.c:207:9: error: implicit
> declaration of function 'error_setg'
> [-Werror=implicit-function-declaration]
>  error_setg(errp, "Kernel is lacking Device Control API");
>  ^
> 
> OSX:
> /Users/pm215/src/qemu-for-merges/hw/usb/dev-mtp.c:1354:13: warning:
> implicit declaration of function 'error_setg' is invalid in C99
> [-Wimplicit-function-declaration]
> error_setg(errp, "usb-mtp: x-root property must be configured");
> ^
> 
> /Users/pm215/src/qemu-for-merges/net/tap-bsd.c:75:9: warning: implicit
> declaration of function 'error_setg_errno' is invalid in C99
> [-Wimplicit-function-declaration]
> error_setg_errno(errp, errno, "could not open %s", dname);
> ^
> /Users/pm215/src/qemu-for-merges/net/tap-bsd.c:99:13: warning:
> implicit declaration of function 'error_setg' is invalid in C99
> [-Wimplicit-function-declaration]
> error_setg(errp, "vnet_hdr=1 requested, but no kernel "
> ^

OS X is not surprising, since I had to fix both Win32 and ARM builds
compared to Markus's original patch...  I thought I had covered all
files in PPC, but openpic slipped.

Paolo

Re: [Qemu-devel] [PATCH v3 09/10] chardev: add support for ACLs for TLS clients

2016-03-22 Thread Eric Blake

On 03/10/2016 11:59 AM, Daniel P. Berrange wrote:
> Currently any client which can complete the TLS handshake
> is able to use a chardev server. The server admin can turn
> on the 'verify-peer' option for the x509 creds to require
> the client to provide a x509 certificate. This means the
> client will have to acquire a certificate from the CA before
> they are permitted to use the chardev server. This is still
> a fairly weak bar.
> 
> This adds a 'tls-acl=ACL-ID' option to the socket chardev
> backend which takes the ID of a previously added 'QAuthZ'
> object instance. This ACL will be used to validate the client's
> x509 distinguished name. Clients failing the ACL will not be
> permitted to use the chardev server.
> 
> For example to setup an ACL that only allows connection from
> a client whose x509 certificate distinguished name contains
> 'CN=fred', you would use:
> 
>   $QEMU -object tls-creds-x509,id=tls0,dir=/home/berrange/qemutls,\
> endpoint=server,verify-peer=yes \
> -object authz-simple,id=acl0,policy=deny,\
> rules.0.match=*CN=fred,rules.0.policy=allow \

Needs shell quoting for *, and also the same recurring comment about
whitespace for presentation not actually being in the command line.

Food for thought: should we enhance QemuOpts to skip all whitespace
after ',', since we _know_ that valid key names start with a letter
rather than a space?  Then, we could represent command lines as:

$QEMU -object 'name,
   param1=value,
   param2=value'

with the same semantics as:

$QEMU -object name,param1=value,param2=value

and without having to worry about backslash-newline-whitespace
formatting.  Obviously, such an enhancement would be a separate patch.

> -chardev socket,host=127.0.0.1,port=9000,server,\
>tls-creds=tls0,tls-acl=acl0 \
> ...other qemud args...
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  qapi-schema.json |  2 ++
>  qemu-char.c  | 11 ++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 

Code is fine; my only comments were on the commit message.
Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-22 Thread Bandan Das

Alex Williamson  writes:
...
>> 
>> And it does. If we fix this assert, then vfio_dma_map() attempts mapping
>> this direct mapped address range starting from 0 and prints a 
>> warning message; happens for the whole range and goes on for ever.
>> The overflow check seemed to me like something we should fix, but now
>> I am more confused then ever!
>
> Is the MemoryRegion memory_region_is_iommu() such that you're calling
> vfio_dma_map() from vfio_iommu_map_notify()?  If so then we should

Yes, that is correct. This all started after we added the iommu mapping
replay changes but I was wrong about the vfio_dma_map part. Please see
below.

> probably be using 128bit helpers for doing sanity checking and go ahead
> and let something assert if we get to the vfio_dma_map() in
> vfio_listener_region_add() with a 2^64 size.  Then if you're taking the
> memory_region_is_iommu() path, vfio_dma_map() is going to be called
> with translations within that 2^64 bit address space, not mapping the
> entire space, right?  Thanks,

The 128 bit operations make sense...

The error message comes from:
if (!memory_region_is_ram(mr)) {
error_report("iommu map to non memory area %"HWADDR_PRIx"",
 xlat);
goto out;
}
in vfio_iommu_map_notify() before we even get to vfio_dma_map().

This gets attempted for the entire range because dmar isn't enabled yet and
vtd_iommu_translate() does this direct mapping in 4k increments in the translate
path :
...

if (!s->dmar_enabled) {
/* DMAR disabled, passthrough, use 4k-page*/
ret.iova = addr & VTD_PAGE_MASK_4K;
ret.translated_addr = addr & VTD_PAGE_MASK_4K;
ret.addr_mask = ~VTD_PAGE_MASK_4K;
ret.perm = IOMMU_RW;
return ret;
}

I am not sure yet who actually uses it though.
memory_region_iommu_replay() does the whole iteration
if perm != IOMMU_NONE:

void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
hwaddr granularity, bool is_write)
{
hwaddr addr;
IOMMUTLBEntry iotlb;

for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
iotlb = mr->iommu_ops->translate(mr, addr, is_write);
if (iotlb.perm != IOMMU_NONE) {
n->notify(n, &iotlb);
}
...

> Alex

Re: [Qemu-devel] [PATCH 00/15] NVDIMM: introduce nvdimm label support

2016-03-22 Thread Stefan Hajnoczi

On Tue, Mar 22, 2016 at 08:37:40AM -0700, Dan Williams wrote:
> On Thu, Mar 17, 2016 at 3:04 AM, Stefan Hajnoczi  wrote:
> > On Thu, Mar 17, 2016 at 04:32:46PM +0800, Xiao Guangrong wrote:
> >> This patchset is against commit d4207b223eef3 (fw-cfg: support writeable
> >> blobs) on pci branch of Michael's git tree and can be found at:
> >>   https://github.com/xiaogr/qemu.git nvdimm-label-v1
> >>
> >> This is the last part of vNVDIMM implementation which introduces nvdimm
> >> label support
> >>
> >> Currently Linux NVDIMM driver does not support namespace operation on this
> >> kind of PMEM, apply below changes to support dynamical namespace:
> >>
> >> @@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct 
> >> acpi_nfit_desc *a
> >> continue;
> >> }
> >>
> >> -   if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> >> +   //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
> >> +   if (nfit_mem->memdev_pmem)
> >> flags |= NDD_ALIASING;
> >
> > Not a blocker for this patch series, but why does Linux require Block
> > Device Window to enable namespace support?
> 
> A namespace label delineates aliased capacity between the pmem and
> block-window access mechanisms.  If there is no aliased capacity then
> the size of the namespace can be directly derived from the nfit range
> and a label need not be considered.  Contiguous (dax-capable)
> sub-divisions of pmem can be had via partitioning of the resulting
> gendisk.

Xiao Guangrong: Given this new information, what is the purpose of the
QEMU patches for ACPI DSM and especially the namespace label support?

The QEMU NVDIMM device only supports pmem so now I'm not sure we need
namespace labels or the ACPI DSM at all.

Stefan


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PULL 00/29] Miscellaneous changes for 2016-03-22

2016-03-22 Thread Peter Maydell

On 22 March 2016 at 14:16, Paolo Bonzini  wrote:
> The following changes since commit 4829e0378dfb91d55af9dfd741bd09e8f2c4f91a:
>
>   Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2016-03-18' 
> into staging (2016-03-18 17:18:41 +)
>
> are available in the git repository at:
>
>   git://github.com/bonzini/qemu.git tags/for-upstream
>
> for you to fetch changes up to 67262f458f145452614468137aff17c7b51e5257:
>
>   target-i386: implement PKE for TCG (2016-03-22 15:15:59 +0100)
>
> 
> * Log filtering from Alex and Peter
> * Chardev fixes from Daniel and Marc-André
> * config.status tweak from David
> * Header file tweaks from Markus, myself and Veronia (Outreachy candidate)
> * get_ticks_per_sec() removal from Rutuja (Outreachy candidate)
> * Coverity fix from myself
> * PKE implementation from myself, based on rth's XSAVE support

Hi; I'm afraid this didn't build:

PPC64 Linux:

/home/pm215/qemu/hw/intc/openpic_kvm.c: In function 'kvm_openpic_realize':
/home/pm215/qemu/hw/intc/openpic_kvm.c:207:9: error: implicit
declaration of function 'error_setg'
[-Werror=implicit-function-declaration]
 error_setg(errp, "Kernel is lacking Device Control API");
 ^

OSX:
/Users/pm215/src/qemu-for-merges/hw/usb/dev-mtp.c:1354:13: warning:
implicit declaration of function 'error_setg' is invalid in C99
[-Wimplicit-function-declaration]
error_setg(errp, "usb-mtp: x-root property must be configured");
^

/Users/pm215/src/qemu-for-merges/net/tap-bsd.c:75:9: warning: implicit
declaration of function 'error_setg_errno' is invalid in C99
[-Wimplicit-function-declaration]
error_setg_errno(errp, errno, "could not open %s", dname);
^
/Users/pm215/src/qemu-for-merges/net/tap-bsd.c:99:13: warning:
implicit declaration of function 'error_setg' is invalid in C99
[-Wimplicit-function-declaration]
error_setg(errp, "vnet_hdr=1 requested, but no kernel "
^

(and subsequent link errors).

thanks
-- PMM

Re: [Qemu-devel] [PATCH 0/5] trace: Add events for vCPU memory accesses

2016-03-22 Thread Peter Maydell

On 22 March 2016 at 19:23, Lluís Vilanova  wrote:
> Peter Maydell writes:
>> Well, that's why I'm not certain about it. I would prefer us to
>> not trace the accesses rather than put a trace in a wrong
>> position.
>
> Theoretically, that's what DEBUG_REMAP is for. A temporary host buffer is
> allocated and data is copied in/out of it according to 'type'
> (VERIFY_READ/VERIFY_WRITE) and 'copy'; so if the lock_user/unlock_user calls 
> are
> incorrect, QEMU will not read/write the correct guest memory values for 
> syscall
> emulation.
>
> I used the logic when DEBUG_REMAP is used to understand when guest memory is
> declared to be read/written. So the current approach for tracing is never in a
> wrong position.
>
> The only problem I see is that the syscall emulation code can declare it wants
> read & write access to some guest memory while a real system would only do one
> of the two (or even none). Tracing this is not incorrect either, since it's 
> just
> how QEMU is emulating that syscall, but it could be sub-optimal compared to a
> real syscall implementation.

Also we sometimes lock an entire buffer but then don't read or write
all of it. And a few places call g2h() directly and accesses via
those code paths won't get traced.

But really, an access to implement a syscall isn't a guest memory
access, it's part of the implementation of the emulated kernel ABIs.
We have tracing for that, it's the strace code (though the strace
code is really very poor currently).

Tracing memory accesses made by the emulated CPU makes total sense;
but I think "what tracing should we do for linux-user code" is
a separate question.

thanks
-- PMM

Re: [Qemu-devel] [PULL 00/10] tricore-patches

2016-03-22 Thread Peter Maydell

On 22 March 2016 at 13:46, Bastian Koppelmann
 wrote:
> The following changes since commit 4829e0378dfb91d55af9dfd741bd09e8f2c4f91a:
>
>   Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2016-03-18' 
> into staging (2016-03-18 17:18:41 +)
>
> are available in the git repository at:
>
>   https://github.com/bkoppelmann/qemu-tricore-upstream.git 
> tags/pull-tricore-20160322
>
> for you to fetch changes up to d66718ccf37b71eff686c3d6116223889736188f:
>
>   target-tricore: Add ftoi and itof instructions (2016-03-21 18:04:16 +0100)

Hi; I'm afraid this doesn't build with clang:

/home/petmay01/linaro/qemu-for-merges/target-tricore/fpu_helper.c:45:20:
error: unused function 'f_is_pos_inf' [-Werror,-Wunused-function]
static inline bool f_is_pos_inf(float32 arg)
   ^
/home/petmay01/linaro/qemu-for-merges/target-tricore/fpu_helper.c:50:20:
error: unused function 'f_is_neg_inf' [-Werror,-Wunused-function]
static inline bool f_is_neg_inf(float32 arg)
   ^

(clang is stricter than gcc about unused static functions in
.c files.)

thanks
-- PMM

[Qemu-devel] [PATCH 4/9] block: Remove bdrv_aio_multiwrite()

2016-03-22 Thread Kevin Wolf

Since virtio-blk implements request merging itself these days, the only
remaining users are test cases for the function. That doesn't make the
function exactly useful any more.

Signed-off-by: Kevin Wolf 
---
 block/block-backend.c  |  14 ---
 block/io.c | 194 ---
 include/block/block.h  |   7 +-
 include/sysemu/block-backend.h |   1 -
 qemu-io-cmds.c | 203 -
 tests/qemu-iotests/100 | 146 -
 tests/qemu-iotests/100.out |  89 --
 tests/qemu-iotests/136 |  20 +---
 tests/qemu-iotests/136.out |   4 +-
 tests/qemu-iotests/group   |   2 +-
 trace-events   |   2 -
 11 files changed, 9 insertions(+), 673 deletions(-)
 delete mode 100755 tests/qemu-iotests/100
 delete mode 100644 tests/qemu-iotests/100.out

diff --git a/block/block-backend.c b/block/block-backend.c
index 03e68a7..dfc11b5 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1140,20 +1140,6 @@ void blk_aio_cancel_async(BlockAIOCB *acb)
 bdrv_aio_cancel_async(acb);
 }
 
-int blk_aio_multiwrite(BlockBackend *blk, BlockRequest *reqs, int num_reqs)
-{
-int i, ret;
-
-for (i = 0; i < num_reqs; i++) {
-ret = blk_check_request(blk, reqs[i].sector, reqs[i].nb_sectors);
-if (ret < 0) {
-return ret;
-}
-}
-
-return bdrv_aio_multiwrite(blk_bs(blk), reqs, num_reqs);
-}
-
 int blk_ioctl(BlockBackend *blk, unsigned long int req, void *buf)
 {
 if (!blk_is_available(blk)) {
diff --git a/block/io.c b/block/io.c
index 24bdd6c..6ec133f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1701,200 +1701,6 @@ BlockAIOCB *bdrv_aio_write_zeroes(BlockDriverState *bs,
  cb, opaque, true);
 }
 
-
-typedef struct MultiwriteCB {
-int error;
-int num_requests;
-int num_callbacks;
-struct {
-BlockCompletionFunc *cb;
-void *opaque;
-QEMUIOVector *free_qiov;
-} callbacks[];
-} MultiwriteCB;
-
-static void multiwrite_user_cb(MultiwriteCB *mcb)
-{
-int i;
-
-for (i = 0; i < mcb->num_callbacks; i++) {
-mcb->callbacks[i].cb(mcb->callbacks[i].opaque, mcb->error);
-if (mcb->callbacks[i].free_qiov) {
-qemu_iovec_destroy(mcb->callbacks[i].free_qiov);
-}
-g_free(mcb->callbacks[i].free_qiov);
-}
-}
-
-static void multiwrite_cb(void *opaque, int ret)
-{
-MultiwriteCB *mcb = opaque;
-
-trace_multiwrite_cb(mcb, ret);
-
-if (ret < 0 && !mcb->error) {
-mcb->error = ret;
-}
-
-mcb->num_requests--;
-if (mcb->num_requests == 0) {
-multiwrite_user_cb(mcb);
-g_free(mcb);
-}
-}
-
-static int multiwrite_req_compare(const void *a, const void *b)
-{
-const BlockRequest *req1 = a, *req2 = b;
-
-/*
- * Note that we can't simply subtract req2->sector from req1->sector
- * here as that could overflow the return value.
- */
-if (req1->sector > req2->sector) {
-return 1;
-} else if (req1->sector < req2->sector) {
-return -1;
-} else {
-return 0;
-}
-}
-
-/*
- * Takes a bunch of requests and tries to merge them. Returns the number of
- * requests that remain after merging.
- */
-static int multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
-int num_reqs, MultiwriteCB *mcb)
-{
-int i, outidx;
-
-// Sort requests by start sector
-qsort(reqs, num_reqs, sizeof(*reqs), &multiwrite_req_compare);
-
-// Check if adjacent requests touch the same clusters. If so, combine them,
-// filling up gaps with zero sectors.
-outidx = 0;
-for (i = 1; i < num_reqs; i++) {
-int merge = 0;
-int64_t oldreq_last = reqs[outidx].sector + reqs[outidx].nb_sectors;
-
-// Handle exactly sequential writes and overlapping writes.
-if (reqs[i].sector <= oldreq_last) {
-merge = 1;
-}
-
-if (reqs[outidx].qiov->niov + reqs[i].qiov->niov + 1 >
-bs->bl.max_iov) {
-merge = 0;
-}
-
-if (bs->bl.max_transfer_length && reqs[outidx].nb_sectors +
-reqs[i].nb_sectors > bs->bl.max_transfer_length) {
-merge = 0;
-}
-
-if (merge) {
-size_t size;
-QEMUIOVector *qiov = g_malloc0(sizeof(*qiov));
-qemu_iovec_init(qiov,
-reqs[outidx].qiov->niov + reqs[i].qiov->niov + 1);
-
-// Add the first request to the merged one. If the requests are
-// overlapping, drop the last sectors of the first request.
-size = (reqs[i].sector - reqs[outidx].sector) << 9;
-qemu_iovec_concat(qiov, reqs[outidx].qiov, 0, size);
-
-// We should need to add any zeros between the two requests
-assert (reqs[i].sector <= oldreq_last);
-
-// Add the second request
-

[Qemu-devel] [PATCH 9/9] block: Remove BlockDriverState.blk

2016-03-22 Thread Kevin Wolf

This patch removes the remaining users of bs->blk, which will allow us
to have multiple BBs on top of a single BDS. All checks that are
currently in place to prevent the user from creating such setups.

Future patches can allow them and e.g. enable users to mirror to a block
device that already has a BlockBackend on it.

Signed-off-by: Kevin Wolf 
---
 block.c|  8 
 block/block-backend.c  |  8 
 block/mirror.c |  4 ++--
 blockdev.c | 19 ---
 include/block/block_int.h  |  2 --
 tests/qemu-iotests/085.out |  6 +++---
 6 files changed, 13 insertions(+), 34 deletions(-)

diff --git a/block.c b/block.c
index eefbcf3..05f9ad4 100644
--- a/block.c
+++ b/block.c
@@ -2227,14 +2227,6 @@ static void change_parent_backing_link(BlockDriverState 
*from,
 {
 BdrvChild *c, *next;
 
-if (from->blk) {
-/* FIXME We bypass blk_set_bs(), so we need to make these updates
- * manually. The root problem is not in this change function, but the
- * existence of BlockDriverState.blk. */
-to->blk = from->blk;
-from->blk = NULL;
-}
-
 QLIST_FOREACH_SAFE(c, &from->parents, next_parent, next) {
 assert(c->role != &child_backing);
 c->bs = to;
diff --git a/block/block-backend.c b/block/block-backend.c
index b71b822..42f95a6 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -190,7 +190,6 @@ BlockBackend *blk_new_with_bs(Error **errp)
 bs = bdrv_new_root();
 blk->root = bdrv_root_attach_child(bs, "root", &child_root);
 blk->root->opaque = blk;
-bs->blk = blk;
 return blk;
 }
 
@@ -453,12 +452,10 @@ bool bdrv_has_blk(BlockDriverState *bs)
 BdrvChild *child;
 QLIST_FOREACH(child, &bs->parents, next_parent) {
 if (child->role == &child_root) {
-assert(bs->blk);
 return true;
 }
 }
 
-assert(!bs->blk);
 return false;
 }
 
@@ -518,8 +515,6 @@ BlockBackend *blk_by_public(BlockBackendPublic *public)
  */
 void blk_remove_bs(BlockBackend *blk)
 {
-assert(blk->root->bs->blk == blk);
-
 notifier_list_notify(&blk->remove_bs_notifiers, blk);
 if (blk->public.throttle_state) {
 throttle_timers_detach_aio_context(&blk->public.throttle_timers);
@@ -527,7 +522,6 @@ void blk_remove_bs(BlockBackend *blk)
 
 blk_update_root_state(blk);
 
-blk->root->bs->blk = NULL;
 bdrv_root_unref_child(blk->root);
 blk->root = NULL;
 }
@@ -537,11 +531,9 @@ void blk_remove_bs(BlockBackend *blk)
  */
 void blk_insert_bs(BlockBackend *blk, BlockDriverState *bs)
 {
-assert(!blk->root && !bs->blk);
 bdrv_ref(bs);
 blk->root = bdrv_root_attach_child(bs, "root", &child_root);
 blk->root->opaque = blk;
-bs->blk = blk;
 
 notifier_list_notify(&blk->insert_bs_notifiers, blk);
 if (blk->public.throttle_state) {
diff --git a/block/mirror.c b/block/mirror.c
index b9a93fd..8b32271 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -450,7 +450,7 @@ static void mirror_exit(BlockJob *job, void *opaque)
 
 /* This was checked in mirror_start_job(), but meanwhile one of the
  * nodes could have been newly attached to a BlockBackend. */
-if (to_replace->blk && s->target->blk) {
+if (bdrv_has_blk(to_replace) && bdrv_has_blk(s->target)) {
 error_report("block job: Can't create node with two 
BlockBackends");
 data->ret = -EINVAL;
 goto out;
@@ -825,7 +825,7 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
 } else {
 replaced_bs = bs;
 }
-if (replaced_bs->blk && target->blk) {
+if (bdrv_has_blk(replaced_bs) && bdrv_has_blk(target)) {
 error_setg(errp, "Can't create node with two BlockBackends");
 return;
 }
diff --git a/blockdev.c b/blockdev.c
index c59cf3e..a658869 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1774,9 +1774,8 @@ static void external_snapshot_prepare(BlkActionState 
*common,
 return;
 }
 
-if (state->new_bs->blk != NULL) {
-error_setg(errp, "The snapshot is already in use by %s",
-   blk_name(state->new_bs->blk));
+if (bdrv_has_blk(state->new_bs)) {
+error_setg(errp, "The snapshot is already in use");
 return;
 }
 
@@ -2492,9 +2491,8 @@ void qmp_x_blockdev_insert_medium(const char *device, 
const char *node_name,
 return;
 }
 
-if (bs->blk) {
-error_setg(errp, "Node '%s' is already in use by '%s'", node_name,
-   blk_name(bs->blk));
+if (bdrv_has_blk(bs)) {
+error_setg(errp, "Node '%s' is already in use", node_name);
 return;
 }
 
@@ -3439,7 +3437,7 @@ static void blockdev_mirror_common(BlockDriverState *bs,
 if (bdrv_op_is_blocked(target, BLOCK_OP_TYPE_MIRROR_TARGET, errp)) {
 return;
 }
-if (target->blk) {
+if (bdrv_has_blk(target)) {
 error_setg(errp, "Cannot mirror to

[Qemu-devel] [PATCH 8/9] block: Don't return throttling info in query-named-block-nodes

2016-03-22 Thread Kevin Wolf

query-named-block-nodes should not return information that is related
to the attached BlockBackend rather than the node itself, so throttling
information needs to be removed from it.

Signed-off-by: Kevin Wolf 
---
 block/qapi.c   | 6 +++---
 tests/qemu-iotests/096 | 3 ++-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 0a222f6..d167e67 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -66,10 +66,10 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
 info->backing_file_depth = bdrv_get_backing_file_depth(bs);
 info->detect_zeroes = bs->detect_zeroes;
 
-if (bs->blk && blk_get_public(bs->blk)->throttle_state) {
+if (blk && blk_get_public(blk)->throttle_state) {
 ThrottleConfig cfg;
 
-throttle_group_get_config(bs->blk, &cfg);
+throttle_group_get_config(blk, &cfg);
 
 info->bps = cfg.buckets[THROTTLE_BPS_TOTAL].avg;
 info->bps_rd  = cfg.buckets[THROTTLE_BPS_READ].avg;
@@ -117,7 +117,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
 info->iops_size = cfg.op_size;
 
 info->has_group = true;
-info->group = g_strdup(throttle_group_get_name(bs->blk));
+info->group = g_strdup(throttle_group_get_name(blk));
 }
 
 info->write_threshold = bdrv_write_threshold_get(bs);
diff --git a/tests/qemu-iotests/096 b/tests/qemu-iotests/096
index e34204b..aeeb375 100644
--- a/tests/qemu-iotests/096
+++ b/tests/qemu-iotests/096
@@ -45,8 +45,9 @@ class TestLiveSnapshot(iotests.QMPTestCase):
 os.remove(self.target_img)
 
 def checkConfig(self, active_layer):
-result = self.vm.qmp('query-named-block-nodes')
+result = self.vm.qmp('query-block')
 for r in result['return']:
+r = r['inserted']
 if r['node-name'] == active_layer:
 self.assertEqual(r['group'], self.group)
 self.assertEqual(r['iops'], self.iops)
-- 
1.8.3.1

[Qemu-devel] [PATCH 3/9] block jobs: Use BdrvChild callbacks for iostatus operations

2016-03-22 Thread Kevin Wolf

The block jobs currently modify the target BB's error handling options
and require that the source BB's iostatus is enabled in order to
implement the per-job error options. It's obvious that this is something
between ugly, adventurous and plain wrong, so we should ideally fix this
instead of thinking about how to handle multiple BBs in this case.

Unfortunately we have a chicken-and-egg problem here: In order to fix
the problem, the block jobs need their own BBs. This requires multiple
BBs per BDS, which in turn means that BDS.blk must be removed. Removing
BDS.blk means that the jobs already need their own BB. The only other
options is that we lift the iostatus operations to the BdrvChild level
and deal with multiple BBs now.

So even though we don't want to have these callbacks in the end (because
they indicate broken logic), this patch converts the iostatus operations
for block jobs targets to BdrvChild callbacks; if more than one parent
implements iostatus operations, they are called for each of them.

Once the conversion is completed, block jobs will get their own internal
BB and the callbacks can be removed again.

Signed-off-by: Kevin Wolf 
---
 block/backup.c| 20 +--
 block/block-backend.c | 42 +++
 block/commit.c|  2 +-
 block/mirror.c| 21 ++--
 block/stream.c|  2 +-
 blockjob.c| 85 +--
 include/block/block_int.h | 10 ++
 include/block/blockjob.h  |  9 +
 8 files changed, 165 insertions(+), 26 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 10397e2..016741e 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -220,9 +220,8 @@ static void backup_iostatus_reset(BlockJob *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common);
 
-if (s->target->blk) {
-blk_iostatus_reset(s->target->blk);
-}
+/* FIXME Only touch iostatus of a job-owned BB */
+bdrv_iostatus_reset(s->target);
 }
 
 static void backup_cleanup_sync_bitmap(BackupBlockJob *job, int ret)
@@ -402,10 +401,9 @@ static void coroutine_fn backup_run(void *opaque)
 
 job->done_bitmap = bitmap_new(end);
 
-if (target->blk) {
-blk_set_on_error(target->blk, on_target_error, on_target_error);
-blk_iostatus_enable(target->blk);
-}
+/* FIXME Only touch on_error and iostatus of a job-owned BB */
+bdrv_set_on_error(target, on_target_error, on_target_error);
+bdrv_iostatus_enable(target);
 
 bdrv_add_before_write_notifier(bs, &before_write);
 
@@ -482,9 +480,9 @@ static void coroutine_fn backup_run(void *opaque)
 qemu_co_rwlock_unlock(&job->flush_rwlock);
 g_free(job->done_bitmap);
 
-if (target->blk) {
-blk_iostatus_disable(target->blk);
-}
+/* FIXME Only touch iostatus of a job-owned BB */
+bdrv_iostatus_disable(target);
+
 bdrv_op_unblock_all(target, job->common.blocker);
 
 data = g_malloc(sizeof(*data));
@@ -515,7 +513,7 @@ void backup_start(BlockDriverState *bs, BlockDriverState 
*target,
 
 if ((on_source_error == BLOCKDEV_ON_ERROR_STOP ||
  on_source_error == BLOCKDEV_ON_ERROR_ENOSPC) &&
-(!bs->blk || !blk_iostatus_is_enabled(bs->blk))) {
+!bdrv_iostatus_is_enabled(bs)) {
 error_setg(errp, QERR_INVALID_PARAMETER, "on-source-error");
 return;
 }
diff --git a/block/block-backend.c b/block/block-backend.c
index c4c0dc0..03e68a7 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -100,6 +100,39 @@ static const char *blk_root_get_name(BdrvChild *child)
 return blk_name(child->opaque);
 }
 
+static void blk_root_iostatus_reset(BdrvChild *child)
+{
+return blk_iostatus_reset(child->opaque);
+}
+
+static void blk_root_iostatus_enable(BdrvChild *child)
+{
+return blk_iostatus_enable(child->opaque);
+}
+
+static void blk_root_iostatus_disable(BdrvChild *child)
+{
+return blk_iostatus_disable(child->opaque);
+}
+
+static bool blk_root_iostatus_is_enabled(BdrvChild *child)
+{
+return blk_iostatus_is_enabled(child->opaque);
+}
+
+static void blk_root_iostatus_set_err(BdrvChild *child, int error)
+{
+return blk_iostatus_set_err(child->opaque, error);
+}
+
+static void blk_root_set_on_error(BdrvChild *child,
+  BlockdevOnError on_read_error,
+  BlockdevOnError on_write_error)
+{
+return blk_set_on_error(child->opaque, on_read_error, on_write_error);
+}
+
+
 static const BdrvChildRole child_root = {
 .inherit_options= blk_root_inherit_options,
 
@@ -108,6 +141,15 @@ static const BdrvChildRole child_root = {
 .get_name   = blk_root_get_name,
 
 .drain_queue= blk_drain_throttling_queue,
+
+/* FIXME Remove these callbacks. Children setting the iostatus are wrong,
+ * it should be controlled by the parents. */
+.iostatus_reset = blk_root_iostatus_reset,
+.iostatus_enable

[Qemu-devel] [PATCH 6/9] block: Remove bdrv_move_feature_fields()

2016-03-22 Thread Kevin Wolf

bdrv_move_feature_fields() and swap_feature_fields() are empty now, they
can be removed.

Signed-off-by: Kevin Wolf 
---
 block.c | 30 --
 1 file changed, 30 deletions(-)

diff --git a/block.c b/block.c
index 66f918e..3770fb0 100644
--- a/block.c
+++ b/block.c
@@ -,13 +,6 @@ void bdrv_close_all(void)
 }
 }
 
-/* Fields that need to stay with the top-level BDS */
-static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
- BlockDriverState *bs_src)
-{
-/* move some fields that need to stay attached to the device */
-}
-
 static void change_parent_backing_link(BlockDriverState *from,
BlockDriverState *to)
 {
@@ -2252,16 +2245,6 @@ static void change_parent_backing_link(BlockDriverState 
*from,
 }
 }
 
-static void swap_feature_fields(BlockDriverState *bs_top,
-BlockDriverState *bs_new)
-{
-BlockDriverState tmp;
-
-bdrv_move_feature_fields(&tmp, bs_top);
-bdrv_move_feature_fields(bs_top, bs_new);
-bdrv_move_feature_fields(bs_new, &tmp);
-}
-
 /*
  * Add new bs contents at the top of an image chain while the chain is
  * live, while keeping required fields on the top layer.
@@ -2285,9 +2268,6 @@ void bdrv_append(BlockDriverState *bs_new, 
BlockDriverState *bs_top)
 
 bdrv_ref(bs_top);
 
-/* Some fields always stay on top of the backing file chain */
-swap_feature_fields(bs_top, bs_new);
-
 change_parent_backing_link(bs_top, bs_new);
 bdrv_set_backing_hd(bs_new, bs_top);
 bdrv_unref(bs_top);
@@ -2304,16 +2284,6 @@ void bdrv_replace_in_backing_chain(BlockDriverState 
*old, BlockDriverState *new)
 
 bdrv_ref(old);
 
-if (old->blk) {
-/* As long as these fields aren't in BlockBackend, but in the top-level
- * BlockDriverState, it's not possible for a BDS to have two BBs.
- *
- * We really want to copy the fields from old to new, but we go for a
- * swap instead so that pointers aren't duplicated and cause trouble.
- * (Also, bdrv_swap() used to do the same.) */
-assert(!new->blk);
-swap_feature_fields(old, new);
-}
 change_parent_backing_link(old, new);
 
 /* Change backing files if a previously independent node is added to the
-- 
1.8.3.1

[Qemu-devel] [PATCH 5/9] block: Avoid BDS.blk in bdrv_next()

2016-03-22 Thread Kevin Wolf

We just want to know whether a BDS has at least one BB attached in order
to avoid enumerating it twice. This doesn't depend on the exact BB that
is attached and is still a valid question when more than one BB can be
attached, so just answer it by checking the parents list.

Signed-off-by: Kevin Wolf 
---
 block.c|  4 ++--
 block/block-backend.c  | 17 +
 include/sysemu/block-backend.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 4cc117d..66f918e 100644
--- a/block.c
+++ b/block.c
@@ -2904,7 +2904,7 @@ BlockDriverState *bdrv_next_node(BlockDriverState *bs)
  * the monitor or attached to a BlockBackend */
 BlockDriverState *bdrv_next(BlockDriverState *bs)
 {
-if (!bs || bs->blk) {
+if (!bs || bdrv_has_blk(bs)) {
 bs = blk_next_root_bs(bs);
 if (bs) {
 return bs;
@@ -2915,7 +2915,7 @@ BlockDriverState *bdrv_next(BlockDriverState *bs)
  * handled by the above block already */
 do {
 bs = bdrv_next_monitor_owned(bs);
-} while (bs && bs->blk);
+} while (bs && bdrv_has_blk(bs));
 return bs;
 }
 
diff --git a/block/block-backend.c b/block/block-backend.c
index dfc11b5..fdd1727 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -434,6 +434,23 @@ BlockDriverState *blk_bs(BlockBackend *blk)
 }
 
 /*
+ * Returns true if @bs has an associated BlockBackend.
+ */
+bool bdrv_has_blk(BlockDriverState *bs)
+{
+BdrvChild *child;
+QLIST_FOREACH(child, &bs->parents, next_parent) {
+if (child->role == &child_root) {
+assert(bs->blk);
+return true;
+}
+}
+
+assert(!bs->blk);
+return false;
+}
+
+/*
  * Return @blk's DriveInfo if any, else null.
  */
 DriveInfo *blk_legacy_dinfo(BlockBackend *blk)
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 1ae1ac9..bd68a17 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -99,6 +99,7 @@ BlockBackend *blk_by_public(BlockBackendPublic *public);
 BlockDriverState *blk_bs(BlockBackend *blk);
 void blk_remove_bs(BlockBackend *blk);
 void blk_insert_bs(BlockBackend *blk, BlockDriverState *bs);
+bool bdrv_has_blk(BlockDriverState *bs);
 
 void blk_set_allow_write_beyond_eof(BlockBackend *blk, bool allow);
 void blk_iostatus_enable(BlockBackend *blk);
-- 
1.8.3.1

[Qemu-devel] [PATCH 7/9] block: Avoid bs->blk in bdrv_next()

2016-03-22 Thread Kevin Wolf

We need to introduce a separate BdrvNextIterator struct that can keep
more state than just the current BDS in order to avoid using the bs->blk
pointer.

Signed-off-by: Kevin Wolf 
---
 block.c| 34 +---
 block/block-backend.c  | 44 +++---
 block/io.c | 13 +++--
 block/snapshot.c   | 30 
 blockdev.c |  3 ++-
 include/block/block.h  |  3 ++-
 include/sysemu/block-backend.h |  1 -
 migration/block.c  |  4 +++-
 monitor.c  |  6 --
 qmp.c  |  5 -
 10 files changed, 77 insertions(+), 66 deletions(-)

diff --git a/block.c b/block.c
index 3770fb0..eefbcf3 100644
--- a/block.c
+++ b/block.c
@@ -2870,25 +2870,6 @@ BlockDriverState *bdrv_next_node(BlockDriverState *bs)
 return QTAILQ_NEXT(bs, node_list);
 }
 
-/* Iterates over all top-level BlockDriverStates, i.e. BDSs that are owned by
- * the monitor or attached to a BlockBackend */
-BlockDriverState *bdrv_next(BlockDriverState *bs)
-{
-if (!bs || bdrv_has_blk(bs)) {
-bs = blk_next_root_bs(bs);
-if (bs) {
-return bs;
-}
-}
-
-/* Ignore all BDSs that are attached to a BlockBackend here; they have been
- * handled by the above block already */
-do {
-bs = bdrv_next_monitor_owned(bs);
-} while (bs && bdrv_has_blk(bs));
-return bs;
-}
-
 const char *bdrv_get_node_name(const BlockDriverState *bs)
 {
 return bs->node_name;
@@ -3212,10 +3193,11 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error 
**errp)
 
 void bdrv_invalidate_cache_all(Error **errp)
 {
-BlockDriverState *bs = NULL;
+BlockDriverState *bs;
 Error *local_err = NULL;
+BdrvNextIterator *it = NULL;
 
-while ((bs = bdrv_next(bs)) != NULL) {
+while ((it = bdrv_next(it, &bs)) != NULL) {
 AioContext *aio_context = bdrv_get_aio_context(bs);
 
 aio_context_acquire(aio_context);
@@ -3245,10 +3227,11 @@ static int bdrv_inactivate(BlockDriverState *bs)
 
 int bdrv_inactivate_all(void)
 {
-BlockDriverState *bs = NULL;
+BlockDriverState *bs;
+BdrvNextIterator *it = NULL;
 int ret;
 
-while ((bs = bdrv_next(bs)) != NULL) {
+while ((it = bdrv_next(it, &bs)) != NULL) {
 AioContext *aio_context = bdrv_get_aio_context(bs);
 
 aio_context_acquire(aio_context);
@@ -3748,10 +3731,11 @@ bool bdrv_recurse_is_first_non_filter(BlockDriverState 
*bs,
  */
 bool bdrv_is_first_non_filter(BlockDriverState *candidate)
 {
-BlockDriverState *bs = NULL;
+BlockDriverState *bs;
+BdrvNextIterator *it = NULL;
 
 /* walk down the bs forest recursively */
-while ((bs = bdrv_next(bs)) != NULL) {
+while ((it = bdrv_next(it, &bs)) != NULL) {
 bool perm;
 
 /* try to recurse in this top level bs */
diff --git a/block/block-backend.c b/block/block-backend.c
index fdd1727..b71b822 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -325,28 +325,40 @@ BlockBackend *blk_next(BlockBackend *blk)
: QTAILQ_FIRST(&monitor_block_backends);
 }
 
-/*
- * Iterates over all BlockDriverStates which are attached to a BlockBackend.
- * This function is for use by bdrv_next().
- *
- * @bs must be NULL or a BDS that is attached to a BB.
- */
-BlockDriverState *blk_next_root_bs(BlockDriverState *bs)
-{
+struct BdrvNextIterator {
+int phase;
 BlockBackend *blk;
+BlockDriverState *bs;
+};
 
-if (bs) {
-assert(bs->blk);
-blk = bs->blk;
-} else {
-blk = NULL;
+/* Iterates over all top-level BlockDriverStates, i.e. BDSs that are owned by
+ * the monitor or attached to a BlockBackend */
+BdrvNextIterator *bdrv_next(BdrvNextIterator *it, BlockDriverState **bs)
+{
+if (!it) {
+it = g_new0(BdrvNextIterator, 1);
+}
+
+if (it->phase == 0) {
+do {
+it->blk = blk_all_next(it->blk);
+*bs = it->blk ? blk_bs(it->blk) : NULL;
+} while (it->blk && *bs == NULL);
+
+if (*bs) {
+return it;
+}
+it->phase = 1;
 }
 
+/* Ignore all BDSs that are attached to a BlockBackend here; they have been
+ * handled by the above block already */
 do {
-blk = blk_all_next(blk);
-} while (blk && !blk->root);
+it->bs = bdrv_next_monitor_owned(it->bs);
+*bs = it->bs;
+} while (*bs && bdrv_has_blk(*bs));
 
-return blk ? blk->root->bs : NULL;
+return *bs ? it : NULL;
 }
 
 /*
diff --git a/block/io.c b/block/io.c
index 6ec133f..ead65cd 100644
--- a/block/io.c
+++ b/block/io.c
@@ -213,10 +213,11 @@ void bdrv_drain_all(void)
 {
 /* Always run first iteration so any pending completion BHs run */
 bool busy = true;
-BlockDriverState *bs = NULL;
+BlockDriverState *bs;
+BdrvNextIterator *it = NULL;
 GSList *aio_ctxs = NU

[Qemu-devel] [PATCH 1/9] block: Use BdrvChild callbacks for change_media/resize

2016-03-22 Thread Kevin Wolf

We want to get rid of BlockDriverState.blk in order to allow multiple
BlockBackends per BDS. Converting the device callbacks in block.c (which
assume a single BlockBackend) to per-child callbacks gets us rid of the
first few instances.

Signed-off-by: Kevin Wolf 
---
 block.c   | 38 +-
 block/block-backend.c | 15 ++-
 include/block/block_int.h |  4 +++-
 3 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/block.c b/block.c
index fb37b91..1fb5dac 100644
--- a/block.c
+++ b/block.c
@@ -1216,6 +1216,27 @@ void bdrv_unref_child(BlockDriverState *parent, 
BdrvChild *child)
 bdrv_root_unref_child(child);
 }
 
+
+static void bdrv_parent_cb_change_media(BlockDriverState *bs, bool load)
+{
+BdrvChild *c;
+QLIST_FOREACH(c, &bs->parents, next_parent) {
+if (c->role->change_media) {
+c->role->change_media(c, load);
+}
+}
+}
+
+static void bdrv_parent_cb_resize(BlockDriverState *bs)
+{
+BdrvChild *c;
+QLIST_FOREACH(c, &bs->parents, next_parent) {
+if (c->role->resize) {
+c->role->resize(c);
+}
+}
+}
+
 /*
  * Sets the backing file link of a BDS. A new reference is created; callers
  * which don't need their own reference any more must call bdrv_unref().
@@ -1674,9 +1695,7 @@ static int bdrv_open_inherit(BlockDriverState **pbs, 
const char *filename,
 }
 
 if (!bdrv_key_required(bs)) {
-if (bs->blk) {
-blk_dev_change_media_cb(bs->blk, true);
-}
+bdrv_parent_cb_change_media(bs, true);
 } else if (!runstate_check(RUN_STATE_PRELAUNCH)
&& !runstate_check(RUN_STATE_INMIGRATE)
&& !runstate_check(RUN_STATE_PAUSED)) { /* HACK */
@@ -2122,9 +2141,7 @@ static void bdrv_close(BlockDriverState *bs)
 bdrv_release_named_dirty_bitmaps(bs);
 assert(QLIST_EMPTY(&bs->dirty_bitmaps));
 
-if (bs->blk) {
-blk_dev_change_media_cb(bs->blk, false);
-}
+bdrv_parent_cb_change_media(bs, false);
 
 if (bs->drv) {
 BdrvChild *child, *next;
@@ -2605,9 +2622,7 @@ int bdrv_truncate(BlockDriverState *bs, int64_t offset)
 if (ret == 0) {
 ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
 bdrv_dirty_bitmap_truncate(bs);
-if (bs->blk) {
-blk_dev_resize_cb(bs->blk);
-}
+bdrv_parent_cb_resize(bs);
 }
 return ret;
 }
@@ -2718,10 +2733,7 @@ int bdrv_set_key(BlockDriverState *bs, const char *key)
 bs->valid_key = 0;
 } else if (!bs->valid_key) {
 bs->valid_key = 1;
-if (bs->blk) {
-/* call the change callback now, we skipped it on open */
-blk_dev_change_media_cb(bs->blk, true);
-}
+bdrv_parent_cb_change_media(bs, true);
 }
 return ret;
 }
diff --git a/block/block-backend.c b/block/block-backend.c
index 8b4eb1a..9d973ba 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -92,9 +92,15 @@ static void blk_root_inherit_options(int *child_flags, QDict 
*child_options,
 }
 static bool blk_drain_throttling_queue(BdrvChild *child);
 
+static void blk_root_change_media(BdrvChild *child, bool load);
+static void blk_root_resize(BdrvChild *child);
+
 static const BdrvChildRole child_root = {
 .inherit_options= blk_root_inherit_options,
 
+.change_media   = blk_root_change_media,
+.resize = blk_root_resize,
+
 .drain_queue= blk_drain_throttling_queue,
 };
 
@@ -553,6 +559,11 @@ void blk_dev_change_media_cb(BlockBackend *blk, bool load)
 }
 }
 
+static void blk_root_change_media(BdrvChild *child, bool load)
+{
+blk_dev_change_media_cb(child->opaque, load);
+}
+
 /*
  * Does @blk's attached device model have removable media?
  * %true if no device model is attached.
@@ -607,8 +618,10 @@ bool blk_dev_is_medium_locked(BlockBackend *blk)
 /*
  * Notify @blk's attached device model of a backend size change.
  */
-void blk_dev_resize_cb(BlockBackend *blk)
+static void blk_root_resize(BdrvChild *child)
 {
+BlockBackend *blk = child->opaque;
+
 if (blk->dev_ops && blk->dev_ops->resize_cb) {
 blk->dev_ops->resize_cb(blk->dev_opaque);
 }
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 000d2c0..f60cb7c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -356,6 +356,9 @@ struct BdrvChildRole {
 void (*inherit_options)(int *child_flags, QDict *child_options,
 int parent_flags, QDict *parent_options);
 
+void (*change_media)(BdrvChild *child, bool load);
+void (*resize)(BdrvChild *child);
+
 bool (*drain_queue)(BdrvChild *child);
 };
 
@@ -695,7 +698,6 @@ bool blk_dev_has_tray(BlockBackend *blk);
 void blk_dev_eject_request(BlockBackend *blk, bool force);
 bool blk_dev_is_tray_open(BlockBackend *blk);
 bool blk_dev_is_medium_locked(BlockBackend *blk);
-void blk_dev_resize_cb(BlockBack

[Qemu-devel] [PATCH 2/9] block: User BdrvChild callback for device name

2016-03-22 Thread Kevin Wolf

In order to get rid of bs->blk for bdrv_get_device_name() and
bdrv_get_device_or_node_name(), ask all parents for their name and
simply pick the first one.

Signed-off-by: Kevin Wolf 
---
 block.c   | 22 --
 block/block-backend.c |  6 ++
 include/block/block_int.h |  1 +
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 1fb5dac..4cc117d 100644
--- a/block.c
+++ b/block.c
@@ -2924,10 +2924,28 @@ const char *bdrv_get_node_name(const BlockDriverState 
*bs)
 return bs->node_name;
 }
 
+static const char *bdrv_get_parent_name(const BlockDriverState *bs)
+{
+BdrvChild *c;
+const char *name;
+
+/* If multiple parents have a name, just pick the first one. */
+QLIST_FOREACH(c, &bs->parents, next_parent) {
+if (c->role->get_name) {
+name = c->role->get_name(c);
+if (name && *name) {
+return name;
+}
+}
+}
+
+return NULL;
+}
+
 /* TODO check what callers really want: bs->node_name or blk_name() */
 const char *bdrv_get_device_name(const BlockDriverState *bs)
 {
-return bs->blk ? blk_name(bs->blk) : "";
+return bdrv_get_parent_name(bs) ?: "";
 }
 
 /* This can be used to identify nodes that might not have a device
@@ -2936,7 +2954,7 @@ const char *bdrv_get_device_name(const BlockDriverState 
*bs)
  * absent, then this returns an empty (non-null) string. */
 const char *bdrv_get_device_or_node_name(const BlockDriverState *bs)
 {
-return bs->blk ? blk_name(bs->blk) : bs->node_name;
+return bdrv_get_parent_name(bs) ?: bs->node_name;
 }
 
 int bdrv_get_flags(BlockDriverState *bs)
diff --git a/block/block-backend.c b/block/block-backend.c
index 9d973ba..c4c0dc0 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -95,11 +95,17 @@ static bool blk_drain_throttling_queue(BdrvChild *child);
 static void blk_root_change_media(BdrvChild *child, bool load);
 static void blk_root_resize(BdrvChild *child);
 
+static const char *blk_root_get_name(BdrvChild *child)
+{
+return blk_name(child->opaque);
+}
+
 static const BdrvChildRole child_root = {
 .inherit_options= blk_root_inherit_options,
 
 .change_media   = blk_root_change_media,
 .resize = blk_root_resize,
+.get_name   = blk_root_get_name,
 
 .drain_queue= blk_drain_throttling_queue,
 };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f60cb7c..195abe8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -358,6 +358,7 @@ struct BdrvChildRole {
 
 void (*change_media)(BdrvChild *child, bool load);
 void (*resize)(BdrvChild *child);
+const char* (*get_name)(BdrvChild *child);
 
 bool (*drain_queue)(BdrvChild *child);
 };
-- 
1.8.3.1

[Qemu-devel] [PATCH 0/9] block: Remove BlockDriverState.blk

2016-03-22 Thread Kevin Wolf

This is the final patch series that is required before we can start allowing
setups with more than one BlockBackend per BlockDriverState.

My current plan is to put the patches up to (and including) this series into
2.6 so that we have a relatively consistent block layer state in the release
that isn't in the middle of an overhaul, but not to make use of the new setups
that we could allow now before 2.7.

Depends on 'block: Move I/O throttling to BlockBackend'.

Kevin Wolf (9):
  block: Use BdrvChild callbacks for change_media/resize
  block: User BdrvChild callback for device name
  block jobs: Use BdrvChild callbacks for iostatus operations
  block: Remove bdrv_aio_multiwrite()
  block: Avoid BDS.blk in bdrv_next()
  block: Remove bdrv_move_feature_fields()
  block: Avoid bs->blk in bdrv_next()
  block: Don't return throttling info in query-named-block-nodes
  block: Remove BlockDriverState.blk

 block.c| 130 +++---
 block/backup.c |  20 ++--
 block/block-backend.c  | 142 
 block/commit.c |   2 +-
 block/io.c | 207 ++---
 block/mirror.c |  25 +++--
 block/qapi.c   |   6 +-
 block/snapshot.c   |  30 +++---
 block/stream.c |   2 +-
 blockdev.c |  22 ++---
 blockjob.c |  85 -
 include/block/block.h  |  10 +-
 include/block/block_int.h  |  17 +++-
 include/block/blockjob.h   |   9 ++
 include/sysemu/block-backend.h |   3 +-
 migration/block.c  |   4 +-
 monitor.c  |   6 +-
 qemu-io-cmds.c | 203 
 qmp.c  |   5 +-
 tests/qemu-iotests/085.out |   6 +-
 tests/qemu-iotests/096 |   3 +-
 tests/qemu-iotests/100 | 146 -
 tests/qemu-iotests/100.out |  89 --
 tests/qemu-iotests/136 |  20 +---
 tests/qemu-iotests/136.out |   4 +-
 tests/qemu-iotests/group   |   2 +-
 trace-events   |   2 -
 27 files changed, 353 insertions(+), 847 deletions(-)
 delete mode 100755 tests/qemu-iotests/100
 delete mode 100644 tests/qemu-iotests/100.out

-- 
1.8.3.1

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-22 Thread Alex Williamson

On Tue, 22 Mar 2016 15:07:00 -0400
Bandan Das  wrote:

> Peter Xu  writes:
> 
> > On Mon, Mar 21, 2016 at 06:00:50PM -0400, Bandan Das wrote:  
> >> 
> >> vfio_listener_region_add for a iommu mr results in
> >> an overflow assert since emulated iommu memory region is initialized
> >> with UINT64_MAX. Add a check just like memory_region_size()
> >> does.  
> >
> > Hi, Bandan,
> >
> > In case you missed this:
> >
> > https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg02865.html  
> 
> Thank you Peter, I wasn't aware. But unfortunately, I don't think he's doing 
> the right
> thing either when handing @end ? Oh well, it's a RFC :) 

Agree, the fix there is bogus too.

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-22 Thread Alex Williamson

On Tue, 22 Mar 2016 14:55:14 -0400
Bandan Das  wrote:

> Alex Williamson  writes:
> ...
> >> >>mr->size = int128_make64(size);
> >> >>if (size == UINT64_MAX) {
> >> >>   mr->size = int128_2_64();
> >> >>}
> >> >> So, end - 1 is still valid for end = UINT64_MAX, no ?
> >> >
> >> > int128_2_64() is not equal to UINT64_MAX, so assigning UIN64_MAX to
> >> > @end is clearing altering the value.  If we had a range from zero to
> >> 
> >> I thought in128_2_64 is the 128 bit representation of UINT64_MAX. The
> >> if condition in memory_region_init doesn't make sense otherwise.  
> >
> > 2^64 cannot be represented with a uint64_t, 2^64 - 1 can:
> >
> > int128_2_64 = 1____h
> > UINT64_MAX  =   ___h  
> 
> Thanks, understood this part. I still don't understand the if condition
> in memory_region_init however. Unless, that function actually takes the
> last address for the size parameter and in that case, it should be
> UINT64_MAX-1 for a size of UINT64_MAX.

Seems like some sort of compatibility convention since
memory_region_init() only takes a uint64_t as the size.  memory.c
interprets that as "oh, I know you really mean 2^64".
 
> >> > int128_2_64() then the size of that region is int128_2_64().  If we
> >> > alter @end to be UINT64_MAX, then the size is only UINT64_MAX and @end
> >> > - 1 is off by one versus the case where we use the value directly.
> >> 
> >> Ok, you mean something like:
> >> int128_get64(int128_sub(int128_2_64(), int128_make64(1)));  for (end - 1) ?
> >> But we still have to deal with (end - iova) when calling vfio_dmap_map().
> >> int128_get64() will definitely assert for iova = 0.   
> >
> > I don't know that that's the most efficient way to handle it, but @end
> > represents a different thing by imposing that -1 and it needs to be
> > handled in the reset of the code.
> >  
> >> > You're effectively changing @end to be the last address in the range,
> >> 
> >> No, I think I am changing "end" to what we initally started with for size
> >> before converting to 128 bit.  
> >
> > Nope, it's the difference between the size of the region and the last
> > address of the region.  
> 
> Ok, but note that it's the "size" that actually asserts here since the
> offset is 0. So, we started with a size UINT64_MAX but end with mr->size =
> 128_2_64().

A size of UINT64_MAX doesn't make much sense to me, that would mean the
last address is UINT64_MAX - 1.  A size of 2^64 means the last address
is UINT64_MAX, which seems reasonable.
 
> >> > but only in some cases, and not adjusting the remaining code to match.
> >> > Not only that, but the vfio map command is probably going to fail if we
> >> > pass in such an unaligned size since the mapping granularity is
> >> 
> >> Trying to map such a large region is wrong anyway, I am still trying
> >> to workout a solution to avoid calling memory_region_init_iommu()
> >> with UINT64_MAX which is what emulated vt-d currently does.  
> >
> > Right, the address width of the IOMMU on x86 is typically nowhere near
> > 2^64, so if you take the vfio_dma_map path, you'll surely explode.  
> 
> And it does. If we fix this assert, then vfio_dma_map() attempts mapping
> this direct mapped address range starting from 0 and prints a 
> warning message; happens for the whole range and goes on for ever.
> The overflow check seemed to me like something we should fix, but now
> I am more confused then ever!

Is the MemoryRegion memory_region_is_iommu() such that you're calling
vfio_dma_map() from vfio_iommu_map_notify()?  If so then we should
probably be using 128bit helpers for doing sanity checking and go ahead
and let something assert if we get to the vfio_dma_map() in
vfio_listener_region_add() with a 2^64 size.  Then if you're taking the
memory_region_is_iommu() path, vfio_dma_map() is going to be called
with translations within that 2^64 bit address space, not mapping the
entire space, right?  Thanks,

Alex

Re: [Qemu-devel] [PATCH v5 2/5] arm: qmp: add query-gic-capabilities interface

2016-03-22 Thread Markus Armbruster

Eric Blake  writes:

> On 03/17/2016 09:27 PM, Peter Xu wrote:
>> This patch adds the command "query-gic-capabilities" but not implemnet
>
> s/not implemnet/does not implement/
>
>> it. The command is ARM-only. Return of the command is a list of
>> GICCapability struct that describes all GIC versions that current QEMU
>> and system support.
>> 
>> Signed-off-by: Peter Xu 
>> ---
>
>> +++ b/qapi-schema.json
>> @@ -4156,3 +4156,14 @@
>>'data': { 'version': 'int',
>>  'emulated': 'bool',
>>  'kernel': 'bool' } }
>> +
>> +##
>> +# @query-gic-capabilities:
>> +#
>> +# Return a list of supported GIC version capabilities.
>> +#
>> +# Returns: a list of GICCapability.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'command': 'query-gic-capabilities', 'returns': ['GICCapability'] }
>
> On the surface, this seems okay.  As mentioned before, I would have
> squashed 1 and 2 into a single patch.  The GICCapability type is
> extensible, and introspection is sufficient at seeing what the type is
> currently capable of exposing.
>
> On the other hand...
>
>> +++ b/scripts/qapi.py
>> @@ -46,6 +46,7 @@ returns_whitelist = [
>>  'query-tpm-models',
>>  'query-tpm-types',
>>  'ringbuf-read',
>> +'query-gic-capability',
>
> ...it required a whitelist, because you are violating the usual
> convention of returning a dict.  If you DO need the whitelist, your
> addition should have been kept sorted.  But you don't need it, if you
> would modify your QAPI to return a dict:
>
> { 'struct': 'GICCapabilitiesReturn',
>   'data': { 'capabilities': ['GICCapability'] } }
> { 'command': 'query-gic-capabilities',
>   'returns': 'GICCapabilitiesReturn' }
>
> Yes, the dict has only a single key, and that key points to the same
> list; but now you have future extensibility: in the future, we could
> return any future global data as a sibling to the array, without having
> to modify every element of the array to repeat redundant information.

I just tested it, the whitelist entry is superfluous, check_command()
accepts a list of dicts just fine.

Re: [Qemu-devel] [PATCH 0/5] trace: Add events for vCPU memory accesses

2016-03-22 Thread Lluís Vilanova

Peter Maydell writes:

> On 22 March 2016 at 14:02, Lluís Vilanova  wrote:
>> Stefan Hajnoczi writes:
>> 
>>> On Wed, Mar 16, 2016 at 03:10:01PM +, Peter Maydell wrote:
 The first two patches which add TCG guest data access tracing look
 OK to me, but I'm much less sure about the last three which are
 adding tracing into linux-user syscall emulation. I'm not sure
 that lock_user is the right place to put that tracepoint.
>> 
>>> Any thoughts on this, Lluís?
>> 
>> Mmmm, I was struggling to find a place to easily add the tracing events
>> whenever the syscall emulation code accesses guest memory.
>> 
>> The lock_user function is used precisely for that, but it can be a bit
>> heavy-handed as to what memory is actually read/written, since it only marks 
>> the
>> "potential" ability of doing so (it's a sort of acquire/release interface 
>> that
>> differentiates between read and write acquires).

> Well, that's why I'm not certain about it. I would prefer us to
> not trace the accesses rather than put a trace in a wrong
> position.

Theoretically, that's what DEBUG_REMAP is for. A temporary host buffer is
allocated and data is copied in/out of it according to 'type'
(VERIFY_READ/VERIFY_WRITE) and 'copy'; so if the lock_user/unlock_user calls are
incorrect, QEMU will not read/write the correct guest memory values for syscall
emulation.

I used the logic when DEBUG_REMAP is used to understand when guest memory is
declared to be read/written. So the current approach for tracing is never in a
wrong position.

The only problem I see is that the syscall emulation code can declare it wants
read & write access to some guest memory while a real system would only do one
of the two (or even none). Tracing this is not incorrect either, since it's just
how QEMU is emulating that syscall, but it could be sub-optimal compared to a
real syscall implementation.

Cheers,
  Lluis

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-22 Thread Bandan Das

Peter Xu  writes:

> On Mon, Mar 21, 2016 at 06:00:50PM -0400, Bandan Das wrote:
>> 
>> vfio_listener_region_add for a iommu mr results in
>> an overflow assert since emulated iommu memory region is initialized
>> with UINT64_MAX. Add a check just like memory_region_size()
>> does.
>
> Hi, Bandan,
>
> In case you missed this:
>
> https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg02865.html

Thank you Peter, I wasn't aware. But unfortunately, I don't think he's doing 
the right
thing either when handing @end ? Oh well, it's a RFC :) 

> -- peterx

Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages

2016-03-22 Thread Dr. David Alan Gilbert

* Liang Li (liang.z...@intel.com) wrote:
> I have sent the RFC version patch set for live migration optimization
> by skipping processing the free pages in the ram bulk stage and
> received a lot of comments. The related threads can be found at:

Thanks!


> Obviously, the virtio-balloon mechanism has a bigger performance
> impact to the guest than the way we are trying to implement.

Yeh, we should separately try and fix that; if it's that slow then
people will be annoyed about it when they're just using it for balloon.

> 3. Virtio interface
> There are three different ways of using the virtio interface to
> send the free page information.
> a. Extend the current virtio device
> The virtio spec has already defined some virtio devices, and we can
> extend one of these devices so as to use it to transport the free page
> information. It requires modifying the virtio spec.
> 
> b. Implement a new virtio device
> Implementing a brand new virtio device to exchange information
> between host and guest is another choice. It requires modifying the
> virtio spec too.

If the right solution is to change the spec then we should do it;
we shouldn't use a technically worse solution just to avoid the spec
change; although we have to be even more careful to get the right
solution if we want to change the spec.

> c. Make use of virtio-serial (Amit’s suggestion, my choice)
> It’s possible to make use the virtio-serial for communication between
> host and guest, the benefit of this solution is no need to modify the
> virtio spec. 
> 
> 4. Construct free page bitmap
> To minimize the space for saving free page information, it’s better to
> use a bitmap to describe the free pages. There are two ways to
> construct the free page bitmap.
> 
> a. Construct free page bitmap when demand (My choice)
> Guest can allocate memory for the free page bitmap only when it
> receives the request from QEMU, and set the free page bitmap by
> traversing the free page list. The advantage of this way is that it’s
> quite simple and easy to implement. The disadvantage is that the
> traversing operation may consume quite a long time when there are a
> lot of free pages. (About 20ms for 7GB free pages)

I wonder how that scales; 20ms isn't too bad - but I'm more worried about
what happens when someone does it to the 1TB database VM.

> b. Update free page bitmap when allocating/freeing pages 
> Another choice is to allocate the memory for the free page bitmap
> when guest boots, and then update the free page bitmap when
> allocating/freeing pages. It needs more modification to the code
> related to memory management in guest. The advantage of this way is
> that guest can response QEMU’s request for a free page bitmap very
> quickly, no matter how many free pages in the guest. Do the kernel guys
> like this?
> 
> 5. Tighten the free page bitmap
> At last, the free page bitmap should be operated with the
> ramlist.dirty_memory to filter out the free pages. We should make sure
> the bit N in the free page bitmap and the bit N in the
> ramlist.dirty_memory are corresponding to the same guest’s page. 
> Some arch, like X86, there are ‘holes’ in the memory’s physical
> address, which means there are no actual physical RAM pages
> corresponding to some PFNs. So, some arch specific information is
> needed to construct a proper free page bitmap.
> 
> migration dirty page bitmap:
> -
> |a|b|c|d|e|f|g|h|i|j|
> -
> loose free page bitmap:
> -  
> |a|b|c|d|e|f| | | | |g|h|i|j|
> -
> tight free page bitmap:
> -
> |a|b|c|d|e|f|g|h|i|j|
> -
> 
> There are two places for tightening the free page bitmap:
> a. In guest 
> Constructing the free page bitmap in guest requires adding the arch
> related code in guest for building a tight bitmap. The advantage of
> this way is that less memory is needed to store the free page bitmap.
> b. In QEMU (My choice)
> Constructing the free page bitmap in QEMU is more flexible, we can get
> a loose free page bitmap which contains the holes, and then filter out
> the holes in QEMU, the advantage of this way is that we can keep the
> kernel code as simple as we can, the disadvantage is that more memory
> is needed to save the loose free page bitmap. Because this is a mainly
> QEMU feature, if possible, do all the related things in QEMU is
> better.

Yes, maybe; although we'd have to be careful to validate what the guest
fills in makes sense.

> 6. Handling page cache in the guest
> The memory used for page cache in the guest will change depends on the
> workload, if guest run some block IO intensive work load, there will
> be lots of pages used for page cache, only a few free pages are left in
> the guest. In order to get more free pages, we can select to ask guest
> to drop some page caches.  Because dropping the page cache may lead to
> performance degradation

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-22 Thread Bandan Das

Alex Williamson  writes:
...
>> >>mr->size = int128_make64(size);
>> >>if (size == UINT64_MAX) {
>> >>   mr->size = int128_2_64();
>> >>}
>> >> So, end - 1 is still valid for end = UINT64_MAX, no ?  
>> >
>> > int128_2_64() is not equal to UINT64_MAX, so assigning UIN64_MAX to
>> > @end is clearing altering the value.  If we had a range from zero to  
>> 
>> I thought in128_2_64 is the 128 bit representation of UINT64_MAX. The
>> if condition in memory_region_init doesn't make sense otherwise.
>
> 2^64 cannot be represented with a uint64_t, 2^64 - 1 can:
>
> int128_2_64 = 1____h
> UINT64_MAX  =   ___h

Thanks, understood this part. I still don't understand the if condition
in memory_region_init however. Unless, that function actually takes the
last address for the size parameter and in that case, it should be
UINT64_MAX-1 for a size of UINT64_MAX.

>> > int128_2_64() then the size of that region is int128_2_64().  If we
>> > alter @end to be UINT64_MAX, then the size is only UINT64_MAX and @end
>> > - 1 is off by one versus the case where we use the value directly.  
>> 
>> Ok, you mean something like:
>> int128_get64(int128_sub(int128_2_64(), int128_make64(1)));  for (end - 1) ?
>> But we still have to deal with (end - iova) when calling vfio_dmap_map().
>> int128_get64() will definitely assert for iova = 0. 
>
> I don't know that that's the most efficient way to handle it, but @end
> represents a different thing by imposing that -1 and it needs to be
> handled in the reset of the code.
>
>> > You're effectively changing @end to be the last address in the range,  
>> 
>> No, I think I am changing "end" to what we initally started with for size
>> before converting to 128 bit.
>
> Nope, it's the difference between the size of the region and the last
> address of the region.

Ok, but note that it's the "size" that actually asserts here since the
offset is 0. So, we started with a size UINT64_MAX but end with mr->size =
128_2_64().

>> > but only in some cases, and not adjusting the remaining code to match.
>> > Not only that, but the vfio map command is probably going to fail if we
>> > pass in such an unaligned size since the mapping granularity is  
>> 
>> Trying to map such a large region is wrong anyway, I am still trying
>> to workout a solution to avoid calling memory_region_init_iommu()
>> with UINT64_MAX which is what emulated vt-d currently does.
>
> Right, the address width of the IOMMU on x86 is typically nowhere near
> 2^64, so if you take the vfio_dma_map path, you'll surely explode.

And it does. If we fix this assert, then vfio_dma_map() attempts mapping
this direct mapped address range starting from 0 and prints a 
warning message; happens for the whole range and goes on for ever.
The overflow check seemed to me like something we should fix, but now
I am more confused then ever!

> Does this fix actually fix anything or just move us to the next
> assert?  Thanks,
>
> Alex

1 2 3 4 >

1 - 100 of 300 matches

Mail list logo