date:20160321

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

2016-03-21 Thread Alexey Kardashevskiy


On 03/22/2016 04:14 PM, David Gibson wrote:

On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:

New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
and adds just created IOMMU into the host IOMMU list; the opposite
action is taken in vfio_listener_region_del.

When creating a new window, this uses euristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v14:
* new to the series

---
TODO:
* export levels to PHB
---
  hw/vfio/common.c | 108 ---
  trace-events |   2 ++
  2 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4e873b7..421d6eb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
  return 0;
  }

+static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
+{
+VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 
0x1000);


The hard-coded 0x1000 looks dubious..


Well, that's the minimal page size...





+g_assert(hiommu);
+QLIST_REMOVE(hiommu, hiommu_next);
+}
+
  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
  {
  return (!memory_region_is_ram(section->mr) &&
@@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
  }
  end = int128_get64(llend);

+if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {


I think this would be clearer split out into a helper function,
vfio_create_host_window() or something.



It is rather vfio_spapr_create_host_window() and we were avoiding 
xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a 
separate file but this usually triggers more discussion and never ends well.





+unsigned entries, pages;
+struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) 
};
+
+g_assert(section->mr->iommu_ops);
+g_assert(memory_region_is_iommu(section->mr));


I don't think you need these asserts.  AFAICT the same logic should
work if a RAM MR was added directly to PCI address space - this would
create the new host window, then the existing code for adding a RAM MR
would map that block of RAM statically into the new window.


In what configuration/machine can we do that on SPAPR?



+trace_vfio_listener_region_add_iommu(iova, end - 1);
+/*
+ * FIXME: For VFIO iommu types which have KVM acceleration to
+ * avoid bouncing all map/unmaps through qemu this way, this
+ * would be the right place to wire that up (tell the KVM
+ * device emulation the VFIO iommu handles to use).
+ */
+create.window_size = memory_region_size(section->mr);
+create.page_shift =
+ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));


Ah.. except that I guess you'd need to fall back to host page size
here to handle a RAM MR.


Can you give an example of such RAM MR being added to PCI AS on SPAPR?



+/*
+ * SPAPR host supports multilevel TCE tables, there is some
+ * euristic to decide how many levels we want for our table:
+ * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+ */
+entries = create.window_size >> create.page_shift;
+pages = (entries * sizeof(uint64_t)) / getpagesize();
+create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
+
+ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, );
+if (ret) {
+error_report("Failed to create a window, ret = %d (%m)", ret);
+goto fail;
+}
+
+if (create.start_addr != section->offset_within_address_space ||
+vfio_host_iommu_lookup(container, create.start_addr,
+   create.start_addr + create.window_size - 
1)) {


Under what circumstances can this trigger?  Is the kernel ioctl
allowed to return a different window start address than the one
requested?


You already asked this some time ago :) The userspace cannot request 
address, the host kernel returns one.




The second check looks very strange - if it returns true doesn't that
mean you *do* have host window which can accomodate this guest region,
which is what you want?


This should not happen, this is what this check is for. Can make it 
assert() or something like this.






+struct vfio_iommu_spapr_tce_remove remove = {
+.argsz = sizeof(remove),
+.start_addr = create.start_addr
+};
+error_report("Host doesn't support DMA window at

Re: [Qemu-devel] [PATCH v1] migration: skip sending ram pages released by virtio-balloon driver.

2016-03-21 Thread Jitendra Kolhe

On 3/18/2016 4:57 PM, Roman Kagan wrote:
> [ Sorry I've lost this thread with email setup changes on my side;
> catching up ]
> 
> On Tue, Mar 15, 2016 at 06:50:45PM +0530, Jitendra Kolhe wrote:
>> On 3/11/2016 8:09 PM, Jitendra Kolhe wrote:
>>> Here is what
>>> I tried, let’s say we have 3 versions of qemu (below timings are for
>>> 16GB idle guest with 12GB ballooned out)
>>>
>>> v1. Unmodified qemu – absolutely not code change – Total Migration time
>>> = ~7600ms (I rounded this one to ~8000ms)
>>> v2. Modified qemu 1 – with proposed patch set (which skips both zero
>>> pages scan and migrating control information for ballooned out pages) -
>>> Total Migration time = ~5700ms
>>> v3. Modified qemu 2 – only with changes to save_zero_page() as discussed
>>> in previous mail (and of course using proposed patch set only to
>>> maintain bitmap for ballooned out pages) – Total migration time is
>>> irrelevant in this case.
>>> Total Zero page scan time = ~1789ms
>>> Total (save_page_header + qemu_put_byte(f, 0)) = ~556ms.
>>> Everything seems to add up here (may not be exact) – 5700+1789+559 =
>>> ~8000ms
>>>
>>> I see 2 factors that we have not considered in this add up a. overhead
>>> for migrating balloon bitmap to target and b. as you mentioned below
>>> overhead of qemu_clock_get_ns().
>>
>> Missed one more factor of testing each page against balloon bitmap during
>> migration, which is consuming around ~320ms for same configuration. If we
>> remove this overhead which is introduced by proposed patch set from above
>> calculation we almost get total migration time for unmodified qemu
>> (5700-320+1789+559=~7700ms)

Thanks for your response, just to clarify my understanding first, with
"protocol" you mean - saving or sending, header or control information 
per page during migration?
I am drafting my below response based on this assumption.

> 
> I'm a bit lost in the numbers you quote, so let me try with
> back-of-the-envelope calculation.
> 
> First off, the way you identify pages that don't need to be sent is
> basically orthogonal to how you optimize the protocol to send them.  So
> teaching is_zero_range() to consult unmapped or ballooned out page map
> looks like a low-hanging fruit that may benefit the migration time by
> avoiding scanning the memory, without protocol changes. 

Yes, the intention of proposed patch is not to optimize existing
protocol, which is used to send control or header information during migration.
Changes only to is_zero_range() should still show benefit in migration time.

> [And vice versa,
> if sending the zero pages bitmap brought so big benefit it would make
> sense to apply it to pages found by scanning, too].
> 

I am not sure if we would see any or much benefit with this, with the timings
that we are seeing the time to test against a bitmap vs. sending control or
header information is not huge.
In case of proposed patch we are anyways spending time to test against bitmap
to avoid zero page scan.

> Now regarding the protocol:
> 
>  - as a first approximation, let's speak in terms of transferred data
>size
> 
>  - consider a VM using 1/10 of its memory (I think this can be
>considered an extreme of over-provisioning)
> 
>  - a whiteout is 3 decimal orders smaller than a page, so with zero
>pages replaced by whiteouts (current protocol) the overall
>transferred data size for zero pages is on the order of a percent of
>the total transferred data size
> 
>  - zero page bitmap would reduce that further by a couple of orders
> 
> So, if this calculation is not totally off, extending the protocol to
> use zero page bitmaps is unlikely to give an improvement at more than a
> percent level.
> 

I agree that current protocol has already reduced total transferred data
size to less than a percent compared to actually sending the zero page.
But here we are talking even to reduce it further by not sending control
or header information.
On my test setup average zero page scan time for every 12GB zero page
is around 1789ms and time taken to send header or control information is
around 559ms for same 12GB zero pages, which constitutes around 30% of
zero page scan time.

I think the point here is, should we consider ballooned out pages as guest
pages and treat them as any other guest ram pages so we expect existing
protocol to take care of them or should we treat them as non guest ram pages
in which case, it may be fine to skip standard protocol?
Note, proposed patch is only focused on ballooned out pages which is a
subset of guest zero page set.

> I'm not sure it pays off the extra code paths and incompatible protocol
> changes...
> 
> Roman.
> 

If skipping sending control or header information for “only” ballooned out
pages raises doubt about protocol compatibility then, yes I agree it’s not
worth the gain we see. We can still localize solution to is_zero_range() 
scan and avoid scanning for zero pages.

Thanks,
- Jitendra

Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote:
> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
> This adds ability to VFIO common code to dynamically allocate/remove
> DMA windows in the host kernel when new VFIO container is added/removed.
> 
> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add
> and adds just created IOMMU into the host IOMMU list; the opposite
> action is taken in vfio_listener_region_del.
> 
> When creating a new window, this uses euristic to decide on the TCE table
> levels number.
> 
> This should cause no guest visible change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v14:
> * new to the series
> 
> ---
> TODO:
> * export levels to PHB
> ---
>  hw/vfio/common.c | 108 
> ---
>  trace-events |   2 ++
>  2 files changed, 105 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 4e873b7..421d6eb 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container,
>  return 0;
>  }
>  
> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova)
> +{
> +VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 
> 0x1000);

The hard-coded 0x1000 looks dubious..

> +g_assert(hiommu);
> +QLIST_REMOVE(hiommu, hiommu_next);
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>  return (!memory_region_is_ram(section->mr) &&
> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  }
>  end = int128_get64(llend);
>  
> +if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {

I think this would be clearer split out into a helper function,
vfio_create_host_window() or something.

> +unsigned entries, pages;
> +struct vfio_iommu_spapr_tce_create create = { .argsz = 
> sizeof(create) };
> +
> +g_assert(section->mr->iommu_ops);
> +g_assert(memory_region_is_iommu(section->mr));

I don't think you need these asserts.  AFAICT the same logic should
work if a RAM MR was added directly to PCI address space - this would
create the new host window, then the existing code for adding a RAM MR
would map that block of RAM statically into the new window.

> +trace_vfio_listener_region_add_iommu(iova, end - 1);
> +/*
> + * FIXME: For VFIO iommu types which have KVM acceleration to
> + * avoid bouncing all map/unmaps through qemu this way, this
> + * would be the right place to wire that up (tell the KVM
> + * device emulation the VFIO iommu handles to use).
> + */
> +create.window_size = memory_region_size(section->mr);
> +create.page_shift =
> +ctz64(section->mr->iommu_ops->get_page_sizes(section->mr));

Ah.. except that I guess you'd need to fall back to host page size
here to handle a RAM MR.

> +/*
> + * SPAPR host supports multilevel TCE tables, there is some
> + * euristic to decide how many levels we want for our table:
> + * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> + */
> +entries = create.window_size >> create.page_shift;
> +pages = (entries * sizeof(uint64_t)) / getpagesize();
> +create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1;
> +
> +ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, );
> +if (ret) {
> +error_report("Failed to create a window, ret = %d (%m)", ret);
> +goto fail;
> +}
> +
> +if (create.start_addr != section->offset_within_address_space ||
> +vfio_host_iommu_lookup(container, create.start_addr,
> +   create.start_addr + create.window_size - 
> 1)) {

Under what circumstances can this trigger?  Is the kernel ioctl
allowed to return a different window start address than the one
requested?

The second check looks very strange - if it returns true doesn't that
mean you *do* have host window which can accomodate this guest region,
which is what you want?

> +struct vfio_iommu_spapr_tce_remove remove = {
> +.argsz = sizeof(remove),
> +.start_addr = create.start_addr
> +};
> +error_report("Host doesn't support DMA window at %"HWADDR_PRIx", 
> must be %"PRIx64,
> + section->offset_within_address_space,
> + create.start_addr);
> +ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_REMOVE, );
> +ret = -EINVAL;
> +goto fail;
> +}
> +trace_vfio_spapr_create_window(create.page_shift,
> +   create.window_size,
> +   create.start_addr);
> +
> +

Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address

2016-03-21 Thread David Gibson

On Tue, Mar 22, 2016 at 03:28:52PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 02:26 PM, David Gibson wrote:
> >On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:
> >>On 03/22/2016 11:49 AM, David Gibson wrote:
> >>>On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> when new VFIO listener is added, all existing IOMMU mappings are
> replayed. However there is a problem that the base address of
> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> for the existing user (which is pseries) with its default 32bit DMA
> window starting at 0 but it is if there is another DMA window.
> 
> This stores the IOMMU's offset_within_address_space and adjusts
> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> 
> As the IOMMU notifier expects IOVA offset rather than the absolute
> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> calling notifier(s).
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 
> >>>
> >>>On a closer look, I realised this still isn't quite correct, although
> >>>I don't think any cases which would break it exist or are planned.
> >>>
> ---
>   hw/ppc/spapr_iommu.c  |  2 +-
>   hw/vfio/common.c  | 14 --
>   include/hw/vfio/vfio-common.h |  1 +
>   3 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 7dd4588..277f289 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, 
> target_ulong ioba,
>   tcet->table[index] = tce;
> 
>   entry.target_as = _space_memory,
> -entry.iova = ioba & page_mask;
> +entry.iova = (ioba - tcet->bus_offset) & page_mask;
>   entry.translated_addr = tce & page_mask;
>   entry.addr_mask = ~page_mask;
>   entry.perm = spapr_tce_iommu_access_flags(tce);
> >>>
> >>>This bit's right/
> >>>
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fb588d8..d45e2db 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void 
> *data)
>   VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>   VFIOContainer *container = giommu->container;
>   IOMMUTLBEntry *iotlb = data;
> +hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
> >>>
> >>>This bit might be right, depending on how you define 
> >>>giommu->offset_within_address_space.
> >>>
>   MemoryRegion *mr;
>   hwaddr xlat;
>   hwaddr len = iotlb->addr_mask + 1;
>   void *vaddr;
>   int ret;
> 
> -trace_vfio_iommu_map_notify(iotlb->iova,
> -iotlb->iova + iotlb->addr_mask);
> +trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> 
>   /*
>    * The IOMMU TLB entry we have just covers translation through
> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void 
> *data)
> 
>   if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>   vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -ret = vfio_dma_map(container, iotlb->iova,
> +ret = vfio_dma_map(container, iova,
>  iotlb->addr_mask + 1, vaddr,
>  !(iotlb->perm & IOMMU_WO) || mr->readonly);
>   if (ret) {
>   error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>    "0x%"HWADDR_PRIx", %p) = %d (%m)",
> - container, iotlb->iova,
> + container, iova,
>    iotlb->addr_mask + 1, vaddr, ret);
>   }
>   } else {
> -ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 
> 1);
> +ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>   if (ret) {
>   error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>    "0x%"HWADDR_PRIx") = %d (%m)",
> - container, iotlb->iova,
> + container, iova,
>    iotlb->addr_mask + 1, ret);
>   }
>   }
> >>>
> >>>This is fine.
> >>>
> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>    */
>   giommu = g_malloc0(sizeof(*giommu));
>   giommu->iommu = section->mr;
> +giommu->offset_within_address_space =
> +

Re: [Qemu-devel] [sheepdog] [PATCH v2] block/sheepdog: add error handling to sd_snapshot_delete()

2016-03-21 Thread Hitoshi Mitake

On Tue, Mar 22, 2016 at 1:33 PM, Takashi Menjo 
wrote:

> Errors have been ignored or not propagated in some code paths
> in sd_snapshot_delete(). This patch adds error handling.
>
> Cc: Hitoshi Mitake 
> Cc: Jeff Cody 
> Cc: Vasiliy Tolstov 
> Cc: sheep...@lists.wpkg.org
>
> Signed-off-by: Takashi Menjo 
> ---
>  block/sheepdog.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>

Looks good to me.
Reviewed-by: Hitoshi Mitake 

Vasiliy, could you test it on your environment if you have a time?

Thanks,
Hitoshi


>
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index a3aeae4..39d13d2 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -2565,6 +2565,7 @@ static int sd_snapshot_delete(BlockDriverState *bs,
>  SheepdogVdiRsp *rsp = (SheepdogVdiRsp *)
>
>  if (!remove_objects(s)) {
> +error_setg(errp, "failed to discard snapshot inode");
>  return -1;
>  }
>
> @@ -2588,12 +2589,13 @@ static int sd_snapshot_delete(BlockDriverState *bs,
>  ret = find_vdi_name(s, s->name, snap_id, snap_tag, , true,
>  _err);
>  if (ret) {
> +error_propagate(errp, local_err);
>  return ret;
>  }
>
>  fd = connect_to_sdog(s, _err);
>  if (fd < 0) {
> -error_report_err(local_err);
> +error_propagate(errp, local_err);
>  return -1;
>  }
>
> @@ -2601,16 +2603,17 @@ static int sd_snapshot_delete(BlockDriverState *bs,
>   buf, , );
>  closesocket(fd);
>  if (ret) {
> +error_setg_errno(errp, -ret, "failed to delete %s", s->name);
>  return ret;
>  }
>
>  switch (rsp->result) {
>  case SD_RES_NO_VDI:
> -error_report("%s was already deleted", s->name);
> +error_setg(errp, "%s was already deleted", s->name);
>  case SD_RES_SUCCESS:
>  break;
>  default:
> -error_report("%s, %s", sd_strerror(rsp->result), s->name);
> +error_setg(errp, "%s, %s", sd_strerror(rsp->result), s->name);
>  return -1;
>  }
>
> --
> 2.7.4.windows.1
>
>
>
> --
> sheepdog mailing list
> sheep...@lists.wpkg.org
> https://lists.wpkg.org/mailman/listinfo/sheepdog
>

Re: [Qemu-devel] [PATCH qemu v14 16/18] spapr_iommu, vfio, memory: Notify IOMMU about starting/stopping being used by VFIO

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:47:04PM +1100, Alexey Kardashevskiy wrote:
> The sPAPR TCE tables manage 2 copies when VFIO is using an IOMMU -
> a guest view of the table and a hardware TCE table. If there is no VFIO
> presense in the address space, then just the guest view is used, if
> this is the case, it is allocated in the KVM. However since there is no
> support yet for VFIO in KVM TCE hypercalls, when we start using VFIO,
> we need to move the guest view from KVM to the userspace; and we need
> to do this for every IOMMU on a bus with VFIO devices.
> 
> This adds vfio_start/vfio_stop callbacks in MemoryRegionIOMMUOps to
> notifiy IOMMU about changing environment so it can reallocate the table
> to/from KVM or (when available) hook the IOMMU groups with the logical
> bus (LIOBN) in the KVM.
> 
> This removes explicit spapr_tce_set_need_vfio() call from PCI hotplug
> path as the new callbacks do this better - they notify IOMMU at
> the exact moment when the configuration is changed, and this also
> includes the case of PCI hot unplug.
> 
> TODO: split into 2 or 3 patches, per maintainership area.
> 
> Signed-off-by: Alexey Kardashevskiy 

I'm finding this one much easier to follow than the previous revision.

> ---
>  hw/ppc/spapr_iommu.c  | 12 
>  hw/ppc/spapr_pci.c|  6 --
>  hw/vfio/common.c  |  9 +
>  include/exec/memory.h |  4 
>  4 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 6dc3c45..702075d 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -151,6 +151,16 @@ static uint64_t spapr_tce_get_page_sizes(MemoryRegion 
> *iommu)
>  return 1ULL << tcet->page_shift;
>  }
>  
> +static void spapr_tce_vfio_start(MemoryRegion *iommu)
> +{
> +spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), true);
> +}
> +
> +static void spapr_tce_vfio_stop(MemoryRegion *iommu)
> +{
> +spapr_tce_set_need_vfio(container_of(iommu, sPAPRTCETable, iommu), 
> false);
> +}

Wonder if a single callback which takes a boolean might be a little
less clunky.

>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>  
> @@ -211,6 +221,8 @@ static const VMStateDescription vmstate_spapr_tce_table = 
> {
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>  .translate = spapr_tce_translate_iommu,
>  .get_page_sizes = spapr_tce_get_page_sizes,
> +.vfio_start = spapr_tce_vfio_start,
> +.vfio_stop = spapr_tce_vfio_stop,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index bfcafdf..af99a36 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1121,12 +1121,6 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector 
> *drc,
>  void *fdt = NULL;
>  int fdt_start_offset = 0, fdt_size;
>  
> -if (object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> -sPAPRTCETable *tcet = spapr_tce_find_by_liobn(phb->dma_liobn);
> -
> -spapr_tce_set_need_vfio(tcet, true);
> -}
> -
>  if (dev->hotplugged) {
>  fdt = create_device_tree(_size);
>  fdt_start_offset = spapr_create_pci_child_dt(phb, pdev, fdt, 0);
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index b257655..4e873b7 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -421,6 +421,9 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next);
>  
>  memory_region_register_iommu_notifier(giommu->iommu, >n);
> +if (section->mr->iommu_ops && section->mr->iommu_ops->vfio_start) {
> +section->mr->iommu_ops->vfio_start(section->mr);
> +}
>  memory_region_iommu_replay(giommu->iommu, >n,
> false);
>  
> @@ -466,6 +469,7 @@ static void vfio_listener_region_del(MemoryListener 
> *listener,
>  VFIOContainer *container = container_of(listener, VFIOContainer, 
> listener);
>  hwaddr iova, end;
>  int ret;
> +MemoryRegion *iommu = NULL;
>  
>  if (vfio_listener_skipped_section(section)) {
>  trace_vfio_listener_region_del_skip(
> @@ -487,6 +491,7 @@ static void vfio_listener_region_del(MemoryListener 
> *listener,
>  QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
>  if (giommu->iommu == section->mr) {
>  memory_region_unregister_iommu_notifier(>n);
> +iommu = giommu->iommu;
>  QLIST_REMOVE(giommu, giommu_next);
>  g_free(giommu);
>  break;
> @@ -519,6 +524,10 @@ static void vfio_listener_region_del(MemoryListener 
> *listener,
>   "0x%"HWADDR_PRIx") = %d (%m)",
>   container, iova, end - iova, ret);
>  }
> +
> +if (iommu && iommu->iommu_ops && iommu->iommu_ops->vfio_stop) {
> +

Re: [Qemu-devel] Query about BiteSizeTasks

2016-03-21 Thread haris iqbal

Hello,

One more thing I noticed was that there is no function prototype for
the function get_ticks_per_sec().

I was planning to use a script to change all the used of
get_ticks_per_sec() to NANOSECONDS_PER_SECOND with an exception of the
function definition itself in include/qemu/timer.h. SO wanted to know
if there is a function prototype that should also be skipped.

On Mon, Mar 21, 2016 at 7:39 PM, haris iqbal  wrote:
> On Mon, Mar 21, 2016 at 7:26 PM, Paolo Bonzini  wrote:
>>
>>
>> On 21/03/2016 14:41, haris iqbal wrote:
>>> Hello,
>>>
>>> I saw a task which says to " Add checks for NULL return value to uses
>>> of load_image_targphys,...". But what I saw in the codebase,
>>> load_image_targphys() returns int. So why should it be checks for
>>> NULL.
>>
>> You're right, in the case of load_image_targphys the error value is -1.
>>  I've fixed the wiki page.
>
>
> Thanks.
>
>>
>> Paolo
>
>
>
> --
>
> With regards,
>
> Md Haris Iqbal,
> Placement Coordinator, MTech IT
> NITK Surathkal,
> Contact: +91 8861996962

-- 

With regards,

Md Haris Iqbal,
Placement Coordinator, MTech IT
NITK Surathkal,
Contact: +91 8861996962

[Qemu-devel] [PATCH v2] block/sheepdog: add error handling to sd_snapshot_delete()

2016-03-21 Thread Takashi Menjo

Errors have been ignored or not propagated in some code paths
in sd_snapshot_delete(). This patch adds error handling.

Cc: Hitoshi Mitake 
Cc: Jeff Cody 
Cc: Vasiliy Tolstov 
Cc: sheep...@lists.wpkg.org

Signed-off-by: Takashi Menjo 
---
 block/sheepdog.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index a3aeae4..39d13d2 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -2565,6 +2565,7 @@ static int sd_snapshot_delete(BlockDriverState *bs,
 SheepdogVdiRsp *rsp = (SheepdogVdiRsp *)
 
 if (!remove_objects(s)) {
+error_setg(errp, "failed to discard snapshot inode");
 return -1;
 }
 
@@ -2588,12 +2589,13 @@ static int sd_snapshot_delete(BlockDriverState *bs,
 ret = find_vdi_name(s, s->name, snap_id, snap_tag, , true,
 _err);
 if (ret) {
+error_propagate(errp, local_err);
 return ret;
 }
 
 fd = connect_to_sdog(s, _err);
 if (fd < 0) {
-error_report_err(local_err);
+error_propagate(errp, local_err);
 return -1;
 }
 
@@ -2601,16 +2603,17 @@ static int sd_snapshot_delete(BlockDriverState *bs,
  buf, , );
 closesocket(fd);
 if (ret) {
+error_setg_errno(errp, -ret, "failed to delete %s", s->name);
 return ret;
 }
 
 switch (rsp->result) {
 case SD_RES_NO_VDI:
-error_report("%s was already deleted", s->name);
+error_setg(errp, "%s was already deleted", s->name);
 case SD_RES_SUCCESS:
 break;
 default:
-error_report("%s, %s", sd_strerror(rsp->result), s->name);
+error_setg(errp, "%s, %s", sd_strerror(rsp->result), s->name);
 return -1;
 }
 
-- 
2.7.4.windows.1

Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address

2016-03-21 Thread Alexey Kardashevskiy


On 03/22/2016 02:26 PM, David Gibson wrote:

On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:

On 03/22/2016 11:49 AM, David Gibson wrote:

On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:

Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
when new VFIO listener is added, all existing IOMMU mappings are
replayed. However there is a problem that the base address of
an IOMMU memory region (IOMMU MR) is ignored which is not a problem
for the existing user (which is pseries) with its default 32bit DMA
window starting at 0 but it is if there is another DMA window.

This stores the IOMMU's offset_within_address_space and adjusts
the IOVA before calling vfio_dma_map/vfio_dma_unmap.

As the IOMMU notifier expects IOVA offset rather than the absolute
address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
calling notifier(s).

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 


On a closer look, I realised this still isn't quite correct, although
I don't think any cases which would break it exist or are planned.


---
  hw/ppc/spapr_iommu.c  |  2 +-
  hw/vfio/common.c  | 14 --
  include/hw/vfio/vfio-common.h |  1 +
  3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 7dd4588..277f289 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, 
target_ulong ioba,
  tcet->table[index] = tce;

  entry.target_as = _space_memory,
-entry.iova = ioba & page_mask;
+entry.iova = (ioba - tcet->bus_offset) & page_mask;
  entry.translated_addr = tce & page_mask;
  entry.addr_mask = ~page_mask;
  entry.perm = spapr_tce_iommu_access_flags(tce);


This bit's right/


diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb588d8..d45e2db 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
  VFIOContainer *container = giommu->container;
  IOMMUTLBEntry *iotlb = data;
+hwaddr iova = iotlb->iova + giommu->offset_within_address_space;


This bit might be right, depending on how you define 
giommu->offset_within_address_space.


  MemoryRegion *mr;
  hwaddr xlat;
  hwaddr len = iotlb->addr_mask + 1;
  void *vaddr;
  int ret;

-trace_vfio_iommu_map_notify(iotlb->iova,
-iotlb->iova + iotlb->addr_mask);
+trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);

  /*
   * The IOMMU TLB entry we have just covers translation through
@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)

  if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
  vaddr = memory_region_get_ram_ptr(mr) + xlat;
-ret = vfio_dma_map(container, iotlb->iova,
+ret = vfio_dma_map(container, iova,
 iotlb->addr_mask + 1, vaddr,
 !(iotlb->perm & IOMMU_WO) || mr->readonly);
  if (ret) {
  error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
   "0x%"HWADDR_PRIx", %p) = %d (%m)",
- container, iotlb->iova,
+ container, iova,
   iotlb->addr_mask + 1, vaddr, ret);
  }
  } else {
-ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
  if (ret) {
  error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
   "0x%"HWADDR_PRIx") = %d (%m)",
- container, iotlb->iova,
+ container, iova,
   iotlb->addr_mask + 1, ret);
  }
  }


This is fine.


@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
   */
  giommu = g_malloc0(sizeof(*giommu));
  giommu->iommu = section->mr;
+giommu->offset_within_address_space =
+section->offset_within_address_space;


But here there's a problem.  The iova in IOMMUTLBEntry is relative to
the IOMMU MemoryRegion, but - at least in theory - only a subsection
of that MemoryRegion could be mapped into the AddressSpace.


But the IOMMU MR stays the same - size, offset, and iova will be relative to
its start, why does it matter if only portion is mapped?


Because the portion mapped may not sit at the start of the MR.  For
example if you had a 2G MR, and the second half is mapped at address 0
in the AS,


My imagination fails here. How could you do this in practice?

address_space_init(, )
memory_region_init(, 2GB)
memory_region_add_subregion(, -1GB, )

But offsets are unsigned.

In general, how to map

Re: [Qemu-devel] [PATCH v2 1/1] Introduce "xen-load-devices-state"

2016-03-21 Thread Changlong Xie


ping..

On 03/14/2016 04:03 PM, Changlong Xie wrote:

From: Wen Congyang 

Introduce a "xen-load-devices-state" QAPI command that can be used to
load the state of all devices, but not the RAM or the block devices of
the VM.

We only have hmp commands savevm/loadvm, and qmp commands
xen-save-devices-state.

We use this new command for COLO:
1. suspend both primary vm and secondary vm
2. sync the state
3. resume both primary vm and secondary vm

In such case, we need to update all devices' state in any time.

Signed-off-by: Wen Congyang 
Signed-off-by: Changlong Xie 
---
  migration/savevm.c | 36 
  qapi-schema.json   | 14 ++
  qmp-commands.hx| 27 +++
  3 files changed, 77 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index 96e7db5..aaead12 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -50,6 +50,7 @@
  #include "qemu/iov.h"
  #include "block/snapshot.h"
  #include "block/qapi.h"
+#include "hw/xen/xen.h"


  #ifndef ETH_P_RARP
@@ -1768,6 +1769,12 @@ qemu_loadvm_section_start_full(QEMUFile *f, 
MigrationIncomingState *mis)
  return -EINVAL;
  }

+/* Validate if it is a device's state */
+if (xen_enabled() && se->is_ram) {
+error_report("loadvm: %s RAM loading not allowed on Xen", idstr);
+return -EINVAL;
+}
+
  /* Add entry */
  le = g_malloc0(sizeof(*le));

@@ -2077,6 +2084,35 @@ void qmp_xen_save_devices_state(const char *filename, 
Error **errp)
  }
  }

+void qmp_xen_load_devices_state(const char *filename, Error **errp)
+{
+QEMUFile *f;
+int saved_vm_running;
+int ret;
+
+saved_vm_running = runstate_is_running();
+vm_stop(RUN_STATE_RESTORE_VM);
+
+f = qemu_fopen(filename, "rb");
+if (!f) {
+error_setg_file_open(errp, errno, filename);
+goto out;
+}
+
+migration_incoming_state_new(f);
+ret = qemu_loadvm_state(f);
+qemu_fclose(f);
+migration_incoming_state_destroy();
+if (ret < 0) {
+error_setg(errp, QERR_IO_ERROR);
+}
+
+out:
+if (saved_vm_running) {
+vm_start();
+}
+}
+
  int load_vmstate(const char *name)
  {
  BlockDriverState *bs, *bs_vm_state;
diff --git a/qapi-schema.json b/qapi-schema.json
index 362c9d8..8cca59d 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -4122,3 +4122,17 @@
  ##
  { 'enum': 'ReplayMode',
'data': [ 'none', 'record', 'play' ] }
+
+##
+# @xen-load-devices-state:
+#
+# Load the state of all devices from file. The RAM and the block devices
+# of the VM are not loaded by this command.
+#
+# @filename: the file to load the state of the devices from as binary
+# data. See xen-save-devices-state.txt for a description of the binary
+# format.
+#
+# Since: 2.7
+##
+{ 'command': 'xen-load-devices-state', 'data': {'filename': 'str'} }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index b629673..4925702 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -587,6 +587,33 @@ Example:
  EQMP

  {
+.name   = "xen-load-devices-state",
+.args_type  = "filename:F",
+.mhandler.cmd_new = qmp_marshal_xen_load_devices_state,
+},
+
+SQMP
+xen-load-devices-state
+--
+
+Load the state of all devices from file. The RAM and the block devices
+of the VM are not loaded by this command.
+
+Arguments:
+
+- "filename": the file to load the state of the devices from as binary
+data. See xen-save-devices-state.txt for a description of the binary
+format.
+
+Example:
+
+-> { "execute": "xen-load-devices-state",
+ "arguments": { "filename": "/tmp/resume" } }
+<- { "return": {} }
+
+EQMP
+
+{
  .name   = "xen-set-global-dirty-log",
  .args_type  = "enable:b",
  .mhandler.cmd_new = qmp_marshal_xen_set_global_dirty_log,

Re: [Qemu-devel] [PATCH qemu v14 15/18] vfio: Add host side IOMMU capabilities

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:47:03PM +1100, Alexey Kardashevskiy wrote:
> There are going to be multiple IOMMUs per a container. This moves
> the single host IOMMU parameter set to a list of VFIOHostIOMMU.
> 
> This should cause no behavioral change and will be used later by
> the SPAPR TCE IOMMU v2 which will also add a vfio_host_iommu_del() helper.
> 
> Signed-off-by: Alexey Kardashevskiy 

This looks ok except for the name.  Calling each window a separate
"host IOMMU" is misleading.  The different windows the container
supports might be implemented by different IOMMUs on the host side, or
it might be implemented by one IOMMU with multiple tables.

Better to call them host DMA windows, or maybe container DMA windows.

> ---
>  hw/vfio/common.c  | 65 
> +--
>  include/hw/vfio/vfio-common.h |  9 --
>  2 files changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a8deb16..b257655 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -29,6 +29,7 @@
>  #include "exec/memory.h"
>  #include "hw/hw.h"
>  #include "qemu/error-report.h"
> +#include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #include "trace.h"
>  
> @@ -239,6 +240,45 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr 
> iova,
>  return -errno;
>  }
>  
> +static VFIOHostIOMMU *vfio_host_iommu_lookup(VFIOContainer *container,
> + hwaddr min_iova, hwaddr 
> max_iova)
> +{
> +VFIOHostIOMMU *hiommu;
> +
> +QLIST_FOREACH(hiommu, >hiommu_list, hiommu_next) {
> +if (hiommu->min_iova <= min_iova && max_iova <= hiommu->max_iova) {
> +return hiommu;
> +}
> +}
> +
> +return NULL;
> +}
> +
> +static int vfio_host_iommu_add(VFIOContainer *container,
> +   hwaddr min_iova, hwaddr max_iova,
> +   uint64_t iova_pgsizes)
> +{
> +VFIOHostIOMMU *hiommu;
> +
> +QLIST_FOREACH(hiommu, >hiommu_list, hiommu_next) {
> +if (ranges_overlap(min_iova, max_iova - min_iova + 1,
> +   hiommu->min_iova,
> +   hiommu->max_iova - hiommu->min_iova + 1)) {
> +error_report("%s: Overlapped IOMMU are not enabled", __func__);
> +return -1;
> +}
> +}
> +
> +hiommu = g_malloc0(sizeof(*hiommu));
> +
> +hiommu->min_iova = min_iova;
> +hiommu->max_iova = max_iova;
> +hiommu->iova_pgsizes = iova_pgsizes;
> +QLIST_INSERT_HEAD(>hiommu_list, hiommu, hiommu_next);
> +
> +return 0;
> +}
> +
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>  return (!memory_region_is_ram(section->mr) &&
> @@ -352,7 +392,7 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  }
>  end = int128_get64(llend);
>  
> -if ((iova < container->min_iova) || ((end - 1) > container->max_iova)) {
> +if (!vfio_host_iommu_lookup(container, iova, end - 1)) {
>  error_report("vfio: IOMMU container %p can't map guest IOVA region"
>   " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx,
>   container, iova, end - 1);
> @@ -367,10 +407,6 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  
>  trace_vfio_listener_region_add_iommu(iova, end - 1);
>  /*
> - * FIXME: We should do some checking to see if the
> - * capabilities of the host VFIO IOMMU are adequate to model
> - * the guest IOMMU
> - *
>   * FIXME: For VFIO iommu types which have KVM acceleration to
>   * avoid bouncing all map/unmaps through qemu this way, this
>   * would be the right place to wire that up (tell the KVM
> @@ -818,16 +854,14 @@ static int vfio_connect_container(VFIOGroup *group, 
> AddressSpace *as)
>   * existing Type1 IOMMUs generally support any IOVA we're
>   * going to actually try in practice.
>   */
> -container->min_iova = 0;
> -container->max_iova = (hwaddr)-1;
> -
> -/* Assume just 4K IOVA page size */
> -container->iova_pgsizes = 0x1000;
>  info.argsz = sizeof(info);
>  ret = ioctl(fd, VFIO_IOMMU_GET_INFO, );
>  /* Ignore errors */
>  if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
> -container->iova_pgsizes = info.iova_pgsizes;
> +vfio_host_iommu_add(container, 0, (hwaddr)-1, info.iova_pgsizes);
> +} else {
> +/* Assume just 4K IOVA page size */
> +vfio_host_iommu_add(container, 0, (hwaddr)-1, 0x1000);
>  }
>  } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
> @@ -884,11 +918,12 @@ static int vfio_connect_container(VFIOGroup *group, 
> AddressSpace *as)
>  ret = -errno;
>

Re: [Qemu-devel] [PATCH qemu v14 13/18] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:47:01PM +1100, Alexey Kardashevskiy wrote:
> This makes use of the new "memory registering" feature. The idea is
> to provide the userspace ability to notify the host kernel about pages
> which are going to be used for DMA. Having this information, the host
> kernel can pin them all once per user process, do locked pages
> accounting (once) and not spent time on doing that in real time with
> possible failures which cannot be handled nicely in some cases.
> 
> This adds a prereg memory listener which listens on address_space_memory
> and notifies a VFIO container about memory which needs to be
> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.
> 
> As there is no per-IOMMU-type release() callback anymore, this stores
> the IOMMU type in the container so vfio_listener_release() can device
> if it needs to unregister @prereg_listener.
> 
> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
> not call it when v2 is detected and enabled.
> 
> This does not change the guest visible interface.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v14:
> * s/free_container_exit/listener_release_exit/g
> * added "if memory_region_is_iommu()" to vfio_prereg_listener_skipped_section
> ---
>  hw/vfio/Makefile.objs |   1 +
>  hw/vfio/common.c  |  38 +---
>  hw/vfio/prereg.c  | 137 
> ++
>  include/hw/vfio/vfio-common.h |   4 ++
>  trace-events  |   2 +
>  5 files changed, 172 insertions(+), 10 deletions(-)
>  create mode 100644 hw/vfio/prereg.c
> 
> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
> index ceddbb8..5800e0e 100644
> --- a/hw/vfio/Makefile.objs
> +++ b/hw/vfio/Makefile.objs
> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o
>  obj-$(CONFIG_SOFTMMU) += platform.o
>  obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o
>  obj-$(CONFIG_SOFTMMU) += amd-xgbe.o
> +obj-$(CONFIG_SOFTMMU) += prereg.o
>  endif
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9587c25..a8deb16 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -493,6 +493,9 @@ static const MemoryListener vfio_memory_listener = {
>  static void vfio_listener_release(VFIOContainer *container)
>  {
>  memory_listener_unregister(>listener);
> +if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
> +memory_listener_unregister(>prereg_listener);
> +}
>  }
>  
>  int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
> @@ -800,8 +803,8 @@ static int vfio_connect_container(VFIOGroup *group, 
> AddressSpace *as)
>  goto free_container_exit;
>  }
>  
> -ret = ioctl(fd, VFIO_SET_IOMMU,
> -v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
> +container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
> +ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>  if (ret) {
>  error_report("vfio: failed to set iommu for container: %m");
>  ret = -errno;
> @@ -826,8 +829,10 @@ static int vfio_connect_container(VFIOGroup *group, 
> AddressSpace *as)
>  if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) {
>  container->iova_pgsizes = info.iova_pgsizes;
>  }
> -} else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +} else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
> +   ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
>  struct vfio_iommu_spapr_tce_info info;
> +bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
>  
>  ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, );
>  if (ret) {
> @@ -835,7 +840,9 @@ static int vfio_connect_container(VFIOGroup *group, 
> AddressSpace *as)
>  ret = -errno;
>  goto free_container_exit;
>  }
> -ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +container->iommu_type =
> +v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
> +ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type);
>  if (ret) {
>  error_report("vfio: failed to set iommu for container: %m");
>  ret = -errno;
> @@ -847,11 +854,22 @@ static int vfio_connect_container(VFIOGroup *group, 
> AddressSpace *as)
>   * when container fd is closed so we do not call it explicitly
>   * in this file.
>   */
> -ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> -if (ret) {
> -error_report("vfio: failed to enable container: %m");
> -ret = -errno;
> -goto free_container_exit;
> +if (!v2) {
> +ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +if (ret) {
> +error_report("vfio: failed to enable

Re: [Qemu-devel] [PATCH] block/sheepdog: add error handling to sd_snapshot_delete()

2016-03-21 Thread MENJO, Takashi

Thank you for your review, Jeff!
I'll submit a patch v2 soon.

In addition, I found that we also need to set errp below.
This will be fixed in v2, too.
| 2607switch (rsp->result) {
| 2608case SD_RES_NO_VDI:
| 2609error_report("%s was already deleted", s->name);


Takashi

> -Original Message-
> From: Jeff Cody [mailto:jc...@redhat.com]
> Sent: Saturday, March 19, 2016 1:17 AM
> To: Takashi Menjo 
> Cc: qemu-devel@nongnu.org; Hitoshi Mitake ;
> Vasiliy Tolstov ; sheep...@lists.wpkg.org
> Subject: Re: [PATCH] block/sheepdog: add error handling to
> sd_snapshot_delete()
> 
> On Fri, Mar 18, 2016 at 05:54:38PM +0900, Takashi Menjo wrote:
> > Errors have been ignored in some code paths in sd_snapshot_delete().
> > This patch adds error handling.
> >
> > Signed-off-by: Takashi Menjo 
> 
> Thank you for the patch!
> 
> > ---
> >  block/sheepdog.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/block/sheepdog.c b/block/sheepdog.c
> > index a3aeae4..6492405 100644
> > --- a/block/sheepdog.c
> > +++ b/block/sheepdog.c
> > @@ -2565,6 +2565,7 @@ static int sd_snapshot_delete(BlockDriverState
*bs,
> >  SheepdogVdiRsp *rsp = (SheepdogVdiRsp *)
> >
> >  if (!remove_objects(s)) {
> > +error_report("failed to discard snapshot inode");
> 
> We want to set errp, so that the error is picked up correctly.  It is
> assumed in QEMU that if there is an Error object passed, that it is
> sufficient to check it for error (as opposed to checking the return
> value).
> 
> You can use error_setg() here to do this, e.g.:
> 
>error_setg(errp, "failed to discard snapshot inode");
> 
> >  return -1;
> >  }
> >
> > @@ -2588,6 +2589,7 @@ static int sd_snapshot_delete(BlockDriverState
*bs,
> >  ret = find_vdi_name(s, s->name, snap_id, snap_tag, , true,
> >  _err);
> >  if (ret) {
> > +error_report_err(local_err);
> 
> To propagate the local_err value to errp, use error_propagate:
> 
> error_propagate(errp, local_err);
> 
> >  return ret;
> >  }
> 
> 
> There is another hunk that is missing an error_propagate in
> sd_snapshot_delete:
> 
> 2594 fd = connect_to_sdog(s, _err);
> 2595 if (fd < 0) {
> 2596 error_report_err(local_err);
> 2597 return -1;
> 2598 }
> 2599
> 
> >
> > @@ -2601,6 +2603,7 @@ static int sd_snapshot_delete(BlockDriverState
*bs,
> >   buf, , );
> >  closesocket(fd);
> >  if (ret) {
> > +error_setg_errno(errp, -ret, "failed to delete %s", s->name);
> >  return ret;
> >  }
> 
> We also need to set errp in the switch statement on rsp->result:
> 
> 2607 switch (rsp->result) {
> 
> [...]
> 
> 2612 default:
> 2613 error_report("%s, %s", sd_strerror(rsp->result), s->name);
> 2614 return -1;
> 2615 }
> 
> 
>

Re: [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper

2016-03-21 Thread David Gibson

On Tue, Mar 22, 2016 at 02:17:24PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 12:02 PM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:46:51PM +1100, Alexey Kardashevskiy wrote:
> >>We are going to have multiple DMA windows soon so let's start preparing.
> >>
> >>This adds a new helper to create a DMA window and makes use of it in
> >>sPAPRPHBState::realize().
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >
> >Reviewed-by: David Gibson 
> >
> >With one tweak..
> >
> >>---
> >>Changes:
> >>v14:
> >>* replaced "int" return to Error* in spapr_phb_dma_window_enable()
> >>---
> >>  hw/ppc/spapr_pci.c | 47 ++-
> >>  1 file changed, 34 insertions(+), 13 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >>index 79baa7b..18332bf 100644
> >>--- a/hw/ppc/spapr_pci.c
> >>+++ b/hw/ppc/spapr_pci.c
> >>@@ -803,6 +803,33 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState 
> >>*sphb, PCIDevice *pdev)
> >>  return buf;
> >>  }
> >>
> >>+static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> >>+   uint32_t liobn,
> >>+   uint32_t page_shift,
> >>+   uint64_t window_addr,
> >>+   uint64_t window_size,
> >>+   Error **errp)
> >>+{
> >>+sPAPRTCETable *tcet;
> >>+uint32_t nb_table = window_size >> page_shift;
> >>+
> >>+if (!nb_table) {
> >>+error_setg(errp, "Zero size table");
> >>+return;
> >>+}
> >>+
> >>+tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
> >>+   page_shift, nb_table, false);
> >>+if (!tcet) {
> >>+error_setg(errp, "Unable to create TCE table liobn %x for %s",
> >>+   liobn, sphb->dtbusname);
> >>+return;
> >>+}
> >>+
> >>+memory_region_add_subregion(>iommu_root, tcet->bus_offset,
> >>+spapr_tce_get_iommu(tcet));
> >>+}
> >>+
> >>  /* Macros to operate with address in OF binding to PCI */
> >>  #define b_x(x, p, l)(((x) & ((1<<(l))-1)) << (p))
> >>  #define b_n(x)  b_x((x), 31, 1) /* 0 if relocatable */
> >>@@ -1307,8 +1334,7 @@ static void spapr_phb_realize(DeviceState *dev, Error 
> >>**errp)
> >>  int i;
> >>  PCIBus *bus;
> >>  uint64_t msi_window_size = 4096;
> >>-sPAPRTCETable *tcet;
> >>-uint32_t nb_table;
> >>+Error *local_err = NULL;
> >>
> >>  if (sphb->index != (uint32_t)-1) {
> >>  hwaddr windows_base;
> >>@@ -1460,18 +1486,13 @@ static void spapr_phb_realize(DeviceState *dev, 
> >>Error **errp)
> >>  }
> >>  }
> >>
> >>-nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
> >>-tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
> >>-   0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
> >>-if (!tcet) {
> >>-error_setg(errp, "Unable to create TCE table for %s",
> >>-   sphb->dtbusname);
> >>-return;
> >>-}
> >>-
> >>  /* Register default 32bit DMA window */
> >>-memory_region_add_subregion(>iommu_root, sphb->dma_win_addr,
> >>-spapr_tce_get_iommu(tcet));
> >>+spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, 
> >>SPAPR_TCE_PAGE_SHIFT,
> >>+sphb->dma_win_addr, sphb->dma_win_size,
> >>+_err);
> >>+if (local_err) {
> >>+error_propagate(errp, local_err);
> >
> >Should be a return; here so we don't continue if there's an error.
> >
> >Actually.. that's not really right, we should be cleaning up all setup
> >we've done already on the failure path.  Without that I think we'll
> >leak some objects on a failed device_add.
> >
> >But.. there are already a bunch of cases here that will do that, so we
> >can clean that up separately.  Probably the sanest way would be to add
> >an unrealize function() that can handle a partially realized object
> >and make sure it's called on all the error paths.
> 
> 
> So what do I do right now with this patch? Leave it as is, add "return",
> implement unrealize(), ...? In practice, being unable to create a PHB is a
> fatal error today (as we do not have PHB hotplug yet and this is what
> unrealize() is for).

Add the return for now, since the series will need a respin anyway.
If you have time it'd be great if you could do an unrealize() patch
that cleans up the existing failure paths, but that would be separate
from this series.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address

2016-03-21 Thread David Gibson

On Tue, Mar 22, 2016 at 02:12:30PM +1100, Alexey Kardashevskiy wrote:
> On 03/22/2016 11:49 AM, David Gibson wrote:
> >On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
> >>Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> >>when new VFIO listener is added, all existing IOMMU mappings are
> >>replayed. However there is a problem that the base address of
> >>an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> >>for the existing user (which is pseries) with its default 32bit DMA
> >>window starting at 0 but it is if there is another DMA window.
> >>
> >>This stores the IOMMU's offset_within_address_space and adjusts
> >>the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> >>
> >>As the IOMMU notifier expects IOVA offset rather than the absolute
> >>address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> >>calling notifier(s).
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>Reviewed-by: David Gibson 
> >
> >On a closer look, I realised this still isn't quite correct, although
> >I don't think any cases which would break it exist or are planned.
> >
> >>---
> >>  hw/ppc/spapr_iommu.c  |  2 +-
> >>  hw/vfio/common.c  | 14 --
> >>  include/hw/vfio/vfio-common.h |  1 +
> >>  3 files changed, 10 insertions(+), 7 deletions(-)
> >>
> >>diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >>index 7dd4588..277f289 100644
> >>--- a/hw/ppc/spapr_iommu.c
> >>+++ b/hw/ppc/spapr_iommu.c
> >>@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, 
> >>target_ulong ioba,
> >>  tcet->table[index] = tce;
> >>
> >>  entry.target_as = _space_memory,
> >>-entry.iova = ioba & page_mask;
> >>+entry.iova = (ioba - tcet->bus_offset) & page_mask;
> >>  entry.translated_addr = tce & page_mask;
> >>  entry.addr_mask = ~page_mask;
> >>  entry.perm = spapr_tce_iommu_access_flags(tce);
> >
> >This bit's right/
> >
> >>diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >>index fb588d8..d45e2db 100644
> >>--- a/hw/vfio/common.c
> >>+++ b/hw/vfio/common.c
> >>@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void 
> >>*data)
> >>  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >>  VFIOContainer *container = giommu->container;
> >>  IOMMUTLBEntry *iotlb = data;
> >>+hwaddr iova = iotlb->iova + giommu->offset_within_address_space;
> >
> >This bit might be right, depending on how you define 
> >giommu->offset_within_address_space.
> >
> >>  MemoryRegion *mr;
> >>  hwaddr xlat;
> >>  hwaddr len = iotlb->addr_mask + 1;
> >>  void *vaddr;
> >>  int ret;
> >>
> >>-trace_vfio_iommu_map_notify(iotlb->iova,
> >>-iotlb->iova + iotlb->addr_mask);
> >>+trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
> >>
> >>  /*
> >>   * The IOMMU TLB entry we have just covers translation through
> >>@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void 
> >>*data)
> >>
> >>  if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> >>  vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >>-ret = vfio_dma_map(container, iotlb->iova,
> >>+ret = vfio_dma_map(container, iova,
> >> iotlb->addr_mask + 1, vaddr,
> >> !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >>  if (ret) {
> >>  error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >>   "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >>- container, iotlb->iova,
> >>+ container, iova,
> >>   iotlb->addr_mask + 1, vaddr, ret);
> >>  }
> >>  } else {
> >>-ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> >>+ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
> >>  if (ret) {
> >>  error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >>   "0x%"HWADDR_PRIx") = %d (%m)",
> >>- container, iotlb->iova,
> >>+ container, iova,
> >>   iotlb->addr_mask + 1, ret);
> >>  }
> >>  }
> >
> >This is fine.
> >
> >>@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener 
> >>*listener,
> >>   */
> >>  giommu = g_malloc0(sizeof(*giommu));
> >>  giommu->iommu = section->mr;
> >>+giommu->offset_within_address_space =
> >>+section->offset_within_address_space;
> >
> >But here there's a problem.  The iova in IOMMUTLBEntry is relative to
> >the IOMMU MemoryRegion, but - at least in theory - only a subsection
> >of that MemoryRegion could be mapped into the AddressSpace.
> 
> But the IOMMU MR stays the same - size, offset, and iova will be relative to
> its start, why does it matter if

Re: [Qemu-devel] [PATCH qemu v14 11/18] memory: Add reporting of supported page sizes

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:46:59PM +1100, Alexey Kardashevskiy wrote:
> Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
> uses when translating, however this information is not available outside
> the translate context for various checks.
> 
> This adds a get_page_sizes callback to MemoryRegionIOMMUOps and
> a wrapper for it so IOMMU users (such as VFIO) can know the actual
> page size(s) used by an IOMMU.
> 
> The qemu_real_host_page_mask is used as fallback.

You're still mismatching concepts here.  The MemoryRegionIOMMUOps
represents a guest IOMMU, so falling back to qemu_real_host_page_mask
(a host property) makes no sense.  I think what you want is to fall
back to TARGET_PAGE_SIZE.

> This removes vfio_container_granularity() and uses new callback in
> memory_region_iommu_replay() when replaying IOMMU mappings on added
> IOMMU memory region.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v14:
> * removed vfio_container_granularity(), changed memory_region_iommu_replay()
> 
> v4:
> * s/1< ---
>  hw/ppc/spapr_iommu.c  |  8 
>  hw/vfio/common.c  |  6 --
>  include/exec/memory.h | 18 ++
>  memory.c  | 17 ++---
>  4 files changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index dd662da..6dc3c45 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -144,6 +144,13 @@ static void spapr_tce_table_pre_save(void *opaque)
>  tcet->mig_table = tcet->table;
>  }
>  
> +static uint64_t spapr_tce_get_page_sizes(MemoryRegion *iommu)
> +{
> +sPAPRTCETable *tcet = container_of(iommu, sPAPRTCETable, iommu);
> +
> +return 1ULL << tcet->page_shift;
> +}
> +
>  static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
>  static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
>  
> @@ -203,6 +210,7 @@ static const VMStateDescription vmstate_spapr_tce_table = 
> {
>  
>  static MemoryRegionIOMMUOps spapr_iommu_ops = {
>  .translate = spapr_tce_translate_iommu,
> +.get_page_sizes = spapr_tce_get_page_sizes,
>  };
>  
>  static int spapr_tce_table_realize(DeviceState *dev)
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index d45e2db..55723c9 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -313,11 +313,6 @@ out:
>  rcu_read_unlock();
>  }
>  
> -static hwaddr vfio_container_granularity(VFIOContainer *container)
> -{
> -return (hwaddr)1 << ctz64(container->iova_pgsizes);
> -}
> -
>  static void vfio_listener_region_add(MemoryListener *listener,
>   MemoryRegionSection *section)
>  {
> @@ -385,7 +380,6 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  
>  memory_region_register_iommu_notifier(giommu->iommu, >n);
>  memory_region_iommu_replay(giommu->iommu, >n,
> -   vfio_container_granularity(container),
> false);
>  
>  return;
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 2de7898..eb5ce67 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -150,6 +150,8 @@ typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
>  struct MemoryRegionIOMMUOps {
>  /* Return a TLB entry that contains a given address. */
>  IOMMUTLBEntry (*translate)(MemoryRegion *iommu, hwaddr addr, bool 
> is_write);
> +/* Returns supported page sizes */
> +uint64_t (*get_page_sizes)(MemoryRegion *iommu);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -573,6 +575,15 @@ static inline bool memory_region_is_iommu(MemoryRegion 
> *mr)
>  
>  
>  /**
> + * memory_region_iommu_get_page_sizes: get supported page sizes in an iommu
> + *
> + * Returns %bitmap of supported page sizes for an iommu.
> + *
> + * @mr: the memory region being queried
> + */
> +uint64_t memory_region_iommu_get_page_sizes(MemoryRegion *mr);
> +
> +/**
>   * memory_region_notify_iommu: notify a change in an IOMMU translation entry.
>   *
>   * @mr: the memory region that was changed
> @@ -596,16 +607,15 @@ void memory_region_register_iommu_notifier(MemoryRegion 
> *mr, Notifier *n);
>  
>  /**
>   * memory_region_iommu_replay: replay existing IOMMU translations to
> - * a notifier
> + * a notifier with the minimum page granularity returned by
> + * mr->iommu_ops->get_page_sizes().
>   *
>   * @mr: the memory region to observe
>   * @n: the notifier to which to replay iommu mappings
> - * @granularity: Minimum page granularity to replay notifications for
>   * @is_write: Whether to treat the replay as a translate "write"
>   * through the iommu
>   */
> -void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n,
> -hwaddr granularity, bool is_write);
> +void memory_region_iommu_replay(MemoryRegion *mr, Notifier *n, bool 
> is_write);
>  
>  /**
>   * memory_region_unregister_iommu_notifier:

Re: [Qemu-devel] [PATCH qemu v14 12/18] vfio: Check that IOMMU MR translates to system address space

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:47:00PM +1100, Alexey Kardashevskiy wrote:
> At the moment IOMMU MR only translate to the system memory.
> However if some new code changes this, we will need clear indication why
> it is not working so here is the check.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

Alex, any chance we could merge this quickly, since it is a reasonable
sanity check even without the rest of the changes.

> ---
> Changes:
> v14:
> * new to the series
> ---
>  hw/vfio/common.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 55723c9..9587c25 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -266,6 +266,12 @@ static void vfio_iommu_map_notify(Notifier *n, void 
> *data)
>  
>  trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>  
> +if (iotlb->target_as != _space_memory) {
> +error_report("Wrong target AS \"%s\", only system memory is allowed",
> + iotlb->target_as->name?iotlb->target_as->name:"noname");
> +return;
> +}
> +
>  /*
>   * The IOMMU TLB entry we have just covers translation through
>   * this IOMMU to its immediate target.  We need to translate

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper

2016-03-21 Thread Alexey Kardashevskiy


On 03/22/2016 12:02 PM, David Gibson wrote:

On Mon, Mar 21, 2016 at 06:46:51PM +1100, Alexey Kardashevskiy wrote:

We are going to have multiple DMA windows soon so let's start preparing.

This adds a new helper to create a DMA window and makes use of it in
sPAPRPHBState::realize().

Signed-off-by: Alexey Kardashevskiy 


Reviewed-by: David Gibson 

With one tweak..


---
Changes:
v14:
* replaced "int" return to Error* in spapr_phb_dma_window_enable()
---
  hw/ppc/spapr_pci.c | 47 ++-
  1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 79baa7b..18332bf 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -803,6 +803,33 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, 
PCIDevice *pdev)
  return buf;
  }

+static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
+   uint32_t liobn,
+   uint32_t page_shift,
+   uint64_t window_addr,
+   uint64_t window_size,
+   Error **errp)
+{
+sPAPRTCETable *tcet;
+uint32_t nb_table = window_size >> page_shift;
+
+if (!nb_table) {
+error_setg(errp, "Zero size table");
+return;
+}
+
+tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
+   page_shift, nb_table, false);
+if (!tcet) {
+error_setg(errp, "Unable to create TCE table liobn %x for %s",
+   liobn, sphb->dtbusname);
+return;
+}
+
+memory_region_add_subregion(>iommu_root, tcet->bus_offset,
+spapr_tce_get_iommu(tcet));
+}
+
  /* Macros to operate with address in OF binding to PCI */
  #define b_x(x, p, l)(((x) & ((1<<(l))-1)) << (p))
  #define b_n(x)  b_x((x), 31, 1) /* 0 if relocatable */
@@ -1307,8 +1334,7 @@ static void spapr_phb_realize(DeviceState *dev, Error 
**errp)
  int i;
  PCIBus *bus;
  uint64_t msi_window_size = 4096;
-sPAPRTCETable *tcet;
-uint32_t nb_table;
+Error *local_err = NULL;

  if (sphb->index != (uint32_t)-1) {
  hwaddr windows_base;
@@ -1460,18 +1486,13 @@ static void spapr_phb_realize(DeviceState *dev, Error 
**errp)
  }
  }

-nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
-tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
-   0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
-if (!tcet) {
-error_setg(errp, "Unable to create TCE table for %s",
-   sphb->dtbusname);
-return;
-}
-
  /* Register default 32bit DMA window */
-memory_region_add_subregion(>iommu_root, sphb->dma_win_addr,
-spapr_tce_get_iommu(tcet));
+spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
+sphb->dma_win_addr, sphb->dma_win_size,
+_err);
+if (local_err) {
+error_propagate(errp, local_err);


Should be a return; here so we don't continue if there's an error.

Actually.. that's not really right, we should be cleaning up all setup
we've done already on the failure path.  Without that I think we'll
leak some objects on a failed device_add.

But.. there are already a bunch of cases here that will do that, so we
can clean that up separately.  Probably the sanest way would be to add
an unrealize function() that can handle a partially realized object
and make sure it's called on all the error paths.



So what do I do right now with this patch? Leave it as is, add "return", 
implement unrealize(), ...? In practice, being unable to create a PHB is a 
fatal error today (as we do not have PHB hotplug yet and this is what 
unrealize() is for).






+}

  sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, 
g_free);
  }





--
Alexey

Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address

2016-03-21 Thread Alexey Kardashevskiy


On 03/22/2016 11:49 AM, David Gibson wrote:

On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:

Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
when new VFIO listener is added, all existing IOMMU mappings are
replayed. However there is a problem that the base address of
an IOMMU memory region (IOMMU MR) is ignored which is not a problem
for the existing user (which is pseries) with its default 32bit DMA
window starting at 0 but it is if there is another DMA window.

This stores the IOMMU's offset_within_address_space and adjusts
the IOVA before calling vfio_dma_map/vfio_dma_unmap.

As the IOMMU notifier expects IOVA offset rather than the absolute
address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
calling notifier(s).

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: David Gibson 


On a closer look, I realised this still isn't quite correct, although
I don't think any cases which would break it exist or are planned.


---
  hw/ppc/spapr_iommu.c  |  2 +-
  hw/vfio/common.c  | 14 --
  include/hw/vfio/vfio-common.h |  1 +
  3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 7dd4588..277f289 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, 
target_ulong ioba,
  tcet->table[index] = tce;

  entry.target_as = _space_memory,
-entry.iova = ioba & page_mask;
+entry.iova = (ioba - tcet->bus_offset) & page_mask;
  entry.translated_addr = tce & page_mask;
  entry.addr_mask = ~page_mask;
  entry.perm = spapr_tce_iommu_access_flags(tce);


This bit's right/


diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb588d8..d45e2db 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)
  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
  VFIOContainer *container = giommu->container;
  IOMMUTLBEntry *iotlb = data;
+hwaddr iova = iotlb->iova + giommu->offset_within_address_space;


This bit might be right, depending on how you define 
giommu->offset_within_address_space.


  MemoryRegion *mr;
  hwaddr xlat;
  hwaddr len = iotlb->addr_mask + 1;
  void *vaddr;
  int ret;

-trace_vfio_iommu_map_notify(iotlb->iova,
-iotlb->iova + iotlb->addr_mask);
+trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);

  /*
   * The IOMMU TLB entry we have just covers translation through
@@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void *data)

  if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
  vaddr = memory_region_get_ram_ptr(mr) + xlat;
-ret = vfio_dma_map(container, iotlb->iova,
+ret = vfio_dma_map(container, iova,
 iotlb->addr_mask + 1, vaddr,
 !(iotlb->perm & IOMMU_WO) || mr->readonly);
  if (ret) {
  error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
   "0x%"HWADDR_PRIx", %p) = %d (%m)",
- container, iotlb->iova,
+ container, iova,
   iotlb->addr_mask + 1, vaddr, ret);
  }
  } else {
-ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
  if (ret) {
  error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
   "0x%"HWADDR_PRIx") = %d (%m)",
- container, iotlb->iova,
+ container, iova,
   iotlb->addr_mask + 1, ret);
  }
  }


This is fine.


@@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
   */
  giommu = g_malloc0(sizeof(*giommu));
  giommu->iommu = section->mr;
+giommu->offset_within_address_space =
+section->offset_within_address_space;


But here there's a problem.  The iova in IOMMUTLBEntry is relative to
the IOMMU MemoryRegion, but - at least in theory - only a subsection
of that MemoryRegion could be mapped into the AddressSpace.


But the IOMMU MR stays the same - size, offset, and iova will be relative 
to its start, why does it matter if only portion is mapped?




So, to find the IOVA within the AddressSpace, from the IOVA within the
MemoryRegion, you need to first subtract the section's offset within
the MemoryRegion, then add the section's offset within the
AddressSpace.

You could precalculate the combined delta here, but...

>



  giommu->container = container;
  giommu->n.notify = vfio_iommu_map_notify;
  QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next);
diff --git

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-21 Thread Peter Xu

On Mon, Mar 21, 2016 at 06:00:50PM -0400, Bandan Das wrote:
> 
> vfio_listener_region_add for a iommu mr results in
> an overflow assert since emulated iommu memory region is initialized
> with UINT64_MAX. Add a check just like memory_region_size()
> does.

Hi, Bandan,

In case you missed this:

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg02865.html

-- peterx

Re: [Qemu-devel] [PATCH qemu v14 06/18] spapr_iommu: Finish renaming vfio_accel to need_vfio

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:46:54PM +1100, Alexey Kardashevskiy wrote:
> 6a81dd17 "spapr_iommu: Rename vfio_accel parameter" renamed vfio_accel
> flag everywhere but one spot was missed.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 


> ---
>  target-ppc/kvm_ppc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
> index fc79312..3b2090e 100644
> --- a/target-ppc/kvm_ppc.h
> +++ b/target-ppc/kvm_ppc.h
> @@ -163,7 +163,7 @@ static inline bool kvmppc_spapr_use_multitce(void)
>  
>  static inline void *kvmppc_create_spapr_tce(uint32_t liobn,
>  uint32_t window_size, int *fd,
> -bool vfio_accel)
> +bool need_vfio)
>  {
>  return NULL;
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH qemu v14 07/18] spapr_iommu: Realloc table during migration

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:46:55PM +1100, Alexey Kardashevskiy wrote:
> The source guest could have reallocated the default TCE table and
> migrate bigger/smaller table. This adds reallocation in post_load()
> if the default table size is different on source and destination.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v14:
> * new to the series
> ---
>  hw/ppc/spapr_iommu.c   | 36 ++--
>  include/hw/ppc/spapr.h |  2 ++
>  2 files changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 9bcd3f6..549cd94 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -137,6 +137,16 @@ static IOMMUTLBEntry 
> spapr_tce_translate_iommu(MemoryRegion *iommu, hwaddr addr,
>  return ret;
>  }
>  
> +static void spapr_tce_table_pre_save(void *opaque)
> +{
> +sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> +
> +tcet->mig_table = tcet->table;

Don't you need to set mig_nb_table here as well?  I can't see anywhere
else it's initialized.

> +}
> +
> +static void spapr_tce_table_do_enable(sPAPRTCETable *tcet);
> +static void spapr_tce_table_do_disable(sPAPRTCETable *tcet);
> +
>  static int spapr_tce_table_post_load(void *opaque, int version_id)
>  {
>  sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque);
> @@ -145,6 +155,26 @@ static int spapr_tce_table_post_load(void *opaque, int 
> version_id)
>  spapr_vio_set_bypass(tcet->vdev, tcet->bypass);
>  }
>  
> +if (tcet->enabled) {
> +if (tcet->nb_table != tcet->mig_nb_table) {
> +if (tcet->nb_table) {
> +spapr_tce_table_do_disable(tcet);
> +}
> +tcet->nb_table = tcet->mig_nb_table;
> +spapr_tce_table_do_enable(tcet);
> +}
> +
> +memcpy(tcet->table, tcet->mig_table,
> +   tcet->nb_table * sizeof(tcet->table[0]));
> +
> +free(tcet->mig_table);
> +tcet->mig_table = NULL;
> +
> +} else if (tcet->table) {
> +/* Destination guest has a default table but source does not -> free 
> */
> +spapr_tce_table_do_disable(tcet);
> +}
> +

Clunky, but I don't know of a better way.

>  return 0;
>  }
>  
> @@ -152,15 +182,17 @@ static const VMStateDescription vmstate_spapr_tce_table 
> = {
>  .name = "spapr_iommu",
>  .version_id = 2,
>  .minimum_version_id = 2,
> +.pre_save = spapr_tce_table_pre_save,
>  .post_load = spapr_tce_table_post_load,
>  .fields  = (VMStateField []) {
>  /* Sanity check */
>  VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
> -VMSTATE_UINT32_EQUAL(nb_table, sPAPRTCETable),
>  
>  /* IOMMU state */
> +VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
>  VMSTATE_BOOL(bypass, sPAPRTCETable),
> -VMSTATE_VARRAY_UINT32(table, sPAPRTCETable, nb_table, 0, 
> vmstate_info_uint64, uint64_t),
> +VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, nb_table, 0,
> +vmstate_info_uint64, uint64_t),
>  
>  VMSTATE_END_OF_LIST()
>  },
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 75b0b55..c1ea49c 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -545,6 +545,8 @@ struct sPAPRTCETable {
>  uint64_t bus_offset;
>  uint32_t page_shift;
>  uint64_t *table;
> +uint32_t mig_nb_table;
> +uint64_t *mig_table;
>  bool bypass;
>  bool need_vfio;
>  int fd;

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH qemu v14 08/18] spapr_iommu: Migrate full state

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:46:56PM +1100, Alexey Kardashevskiy wrote:
> This adds @bus_offset, @page_shift, @enabled members to migration stream.
> These cannot change without dynamic DMA windows so no change in
> behavior is expected.
> 
> Signed-off-by: Alexey Kardashevskiy 

I think you should combine this patch with the previous one.  They're
both simple, and the functions in the previous one check
tcet->enabled, which doesn't make a lot of sense if you're not
migrating that value.

The version bump here looks correct, but it will break migration of
(for example) a pseries-2.5 VM running under qemu-2.7 back into
qemu-2.5.  That sort of backwards migration isn't considered
essential, but it is nice to have (and it's something RH cares about
downstream).

So, if possible it would be preferable to do the migration in a
backwards compatible way.  The standard trick for that seems to be to
add an optional section with the extra info, and make the "needed"
function return true iff the parameters differ from the defaults.

> ---
> Changes:
> v14:
> * new to the series
> ---
>  hw/ppc/spapr_iommu.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 549cd94..5ea5948 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -180,7 +180,7 @@ static int spapr_tce_table_post_load(void *opaque, int 
> version_id)
>  
>  static const VMStateDescription vmstate_spapr_tce_table = {
>  .name = "spapr_iommu",
> -.version_id = 2,
> +.version_id = 3,
>  .minimum_version_id = 2,
>  .pre_save = spapr_tce_table_pre_save,
>  .post_load = spapr_tce_table_post_load,
> @@ -189,6 +189,9 @@ static const VMStateDescription vmstate_spapr_tce_table = 
> {
>  VMSTATE_UINT32_EQUAL(liobn, sPAPRTCETable),
>  
>  /* IOMMU state */
> +VMSTATE_BOOL_V(enabled, sPAPRTCETable, 3),
> +VMSTATE_UINT64_V(bus_offset, sPAPRTCETable, 3),
> +VMSTATE_UINT32_V(page_shift, sPAPRTCETable, 3),
>  VMSTATE_UINT32(mig_nb_table, sPAPRTCETable),
>  VMSTATE_BOOL(bypass, sPAPRTCETable),
>  VMSTATE_VARRAY_UINT32_ALLOC(mig_table, sPAPRTCETable, nb_table, 0,

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature

[Qemu-devel] [PATCH v2] qdict: fix unbounded stack warning for qdict_array_entries

2016-03-21 Thread Peter Xu

Here we use one g_strdup_printf() to replace the two stack allocated
array, considering it's more convenient, safe, and as long as it's
called rarely only when quorum device opens. This will remove the
unbound stack warning when compiling with "-Wstack-usage=100".

Reviewed-by:   Eric Blake 
Signed-off-by: Peter Xu 
---
 qobject/qdict.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/qobject/qdict.c b/qobject/qdict.c
index 9833bd0..fe6ffa1 100644
--- a/qobject/qdict.c
+++ b/qobject/qdict.c
@@ -704,19 +704,16 @@ int qdict_array_entries(QDict *src, const char *subqdict)
 for (i = 0; i < INT_MAX; i++) {
 QObject *subqobj;
 int subqdict_entries;
-size_t slen = 32 + subqdict_len;
-char indexstr[slen], prefix[slen];
-size_t snprintf_ret;
+char *prefix = g_strdup_printf("%s%u.", subqdict, i);
 
-snprintf_ret = snprintf(indexstr, slen, "%s%u", subqdict, i);
-assert(snprintf_ret < slen);
+subqdict_entries = qdict_count_prefixed_entries(src, prefix);
 
-subqobj = qdict_get(src, indexstr);
+/* Remove ending "." */
+prefix[strlen(prefix) - 1] = 0;
+subqobj = qdict_get(src, prefix);
 
-snprintf_ret = snprintf(prefix, slen, "%s%u.", subqdict, i);
-assert(snprintf_ret < slen);
+g_free(prefix);
 
-subqdict_entries = qdict_count_prefixed_entries(src, prefix);
 if (subqdict_entries < 0) {
 return subqdict_entries;
 }
-- 
2.4.3

Re: [Qemu-devel] [PATCH] qdict: fix unbounded stack for qdict_array_entries

2016-03-21 Thread Peter Xu

On Mon, Mar 21, 2016 at 02:58:25PM -0600, Eric Blake wrote:
> On 03/09/2016 06:36 PM, Peter Xu wrote:
> > Sorry to forgot CCing Eric/Markus/Kevin.
> > 
> > This patch title is not correct, which should be:
> > 
> > "Fix unbounded stack warning for qdict_array_entries"
> 
> Keep the 'qdict:' prefix, but yes, adding "warning" helps the commit
> message.
> 
> > 
> > Do I need to re-send with the same content?
> 
> For just the title adjustment, it's up to the maintainer.  Often, a
> maintainer will make small changes like that before sending a pull request.
> 
> > 
> > I'm using g_strdup_printf() here, considering it's most convenient,
> > safe, and as long as it's called rarely only when quorum device
> > opens.
> 
> On the other hand, this information might have been useful...
> 
> > 
> > Thanks.
> > Peter
> > 
> > On Wed, Mar 09, 2016 at 02:03:38PM +0800, Peter Xu wrote:
> >> Signed-off-by: Peter Xu 
> 
> ...in the commit body proper (explaining why you are always allocating,
> because it is not a hot path).  So a v2 might indeed be easier.
> 
> >> +++ b/qobject/qdict.c
> >> @@ -704,19 +704,16 @@ int qdict_array_entries(QDict *src, const char 
> >> *subqdict)
> >>  for (i = 0; i < INT_MAX; i++) {
> >>  QObject *subqobj;
> >>  int subqdict_entries;
> >> -size_t slen = 32 + subqdict_len;
> >> -char indexstr[slen], prefix[slen];
> >> -size_t snprintf_ret;
> >> +char *prefix = g_strdup_printf("%s%u.", subqdict, i);
> 
> If we were worried that this could be a hot path, you could add a %n and
>  here...
> 
> >>  
> >> -snprintf_ret = snprintf(indexstr, slen, "%s%u", subqdict, i);
> >> -assert(snprintf_ret < slen);
> >> +subqdict_entries = qdict_count_prefixed_entries(src, prefix);
> >>  
> >> -subqobj = qdict_get(src, indexstr);
> >> +/* Remove ending "." */
> >> +prefix[strlen(prefix) - 1] = 0x00;
> 
> ...to avoid the strlen() call here.  But this is not a hot path, and %n
> always makes me worry about security, so I'm fine with your approach.
> 
> However, 0x00 is a rather verbose way of writing 0 (and even if you want
> verbosity, '\0' is more idiomatic 0x00).
> 
> At this point, if you send a v2 with s/0x00/0/ and the improved commit
> message, you can also include:
> Reviewed-by: Eric Blake 

Will respin just like above, and with you r-b. Thanks!

-- peterx

Re: [Qemu-devel] [PATCH v5 0/5] ARM: add query-gic-capabilities SMP command

2016-03-21 Thread Peter Xu

On Mon, Mar 21, 2016 at 04:56:07PM +0100, Andrea Bolognani wrote:
> On Fri, 2016-03-18 at 11:27 +0800, Peter Xu wrote:
> > v5 changes:
> > - patch 2: moved to target-arm/monitor.c (from target-arm/machine.c)
> >    [Peter]
> > - patch 3: splitted into three patches: [all from Peter's comments]
> >   - patch 3 (new): leverage kvm_arm_create_scratch_host_vcpu(), tiny
> > enhancement of old one to suite our need
> >   - patch 4: introduce kvm_support_device() in kvm-all.c
> >   - patch 5: do the implementation.
> 
> Tested on two separate aarch64 hosts, seems to work fine.

Thanks Andrea!

Sorry I forgot to CC libvir-list. To avoid duplication, not
re-sending but CCing in this reply. If anyone interested, please
review in the following link:

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg04465.html

(which points to exactly current thread.)

Sorry for the inconvenience!

-- peterx

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-21 Thread Alex Williamson

On Mon, 21 Mar 2016 21:54:48 -0400
Bandan Das  wrote:

> Alex Williamson  writes:
> 
> > On Mon, 21 Mar 2016 20:06:32 -0400
> > Bandan Das  wrote:
> >  
> >> Alex Williamson  writes:
> >>   
> >> > On Mon, 21 Mar 2016 18:00:50 -0400
> >> > Bandan Das  wrote:
> >> >
> >> >> vfio_listener_region_add for a iommu mr results in
> >> >> an overflow assert since emulated iommu memory region is initialized
> >> >> with UINT64_MAX. Add a check just like memory_region_size()
> >> >> does.
> >> >> 
> >> >> Signed-off-by: Bandan Das 
> >> >> ---
> >> >>  hw/vfio/common.c | 7 ++-
> >> >>  1 file changed, 6 insertions(+), 1 deletion(-)
> >> >> 
> >> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> >> index fb588d8..269244b 100644
> >> >> --- a/hw/vfio/common.c
> >> >> +++ b/hw/vfio/common.c
> >> >> @@ -349,7 +349,12 @@ static void 
> >> >> vfio_listener_region_add(MemoryListener *listener,
> >> >>  if (int128_ge(int128_make64(iova), llend)) {
> >> >>  return;
> >> >>  }
> >> >> -end = int128_get64(llend);
> >> >> +
> >> >> +if (int128_eq(llend, int128_2_64())) {
> >> >> +end = UINT64_MAX;
> >> >> +} else {
> >> >> +end = int128_get64(llend);
> >> >> +}
> >> >>  
> >> >>  if ((iova < container->min_iova) || ((end - 1) > 
> >> >> container->max_iova)) {
> >> >>  error_report("vfio: IOMMU container %p can't map guest IOVA 
> >> >> region"
> >> >
> >> > But now all the calculations where we use end-1 are wrong.  See the
> >> > discussion with Pierre Morel in the January qemu-devel archives.
> >> > There's a solution in there, but I never saw a follow-up from Pierre
> >> > with a revised patch.  Thanks,
> >> 
> >> I am missing something. When end < UIN64_MAX, end - 1 calculations are 
> >> valid because
> >> the patch doesn't change that behavior. When end is UINT64_MAX, 
> >> int128_get64() doesn't know how
> >> to calculate this value and we are just feeding it manually. The patch is 
> >> just the opposite
> >> of what memory_region_init() did to init the mem region in the first place:
> >>mr->size = int128_make64(size);
> >>if (size == UINT64_MAX) {
> >>   mr->size = int128_2_64();
> >>}
> >> So, end - 1 is still valid for end = UINT64_MAX, no ?  
> >
> > int128_2_64() is not equal to UINT64_MAX, so assigning UIN64_MAX to
> > @end is clearing altering the value.  If we had a range from zero to  
> 
> I thought in128_2_64 is the 128 bit representation of UINT64_MAX. The
> if condition in memory_region_init doesn't make sense otherwise.

2^64 cannot be represented with a uint64_t, 2^64 - 1 can:

int128_2_64 = 1____h
UINT64_MAX  =   ___h
 
> > int128_2_64() then the size of that region is int128_2_64().  If we
> > alter @end to be UINT64_MAX, then the size is only UINT64_MAX and @end
> > - 1 is off by one versus the case where we use the value directly.  
> 
> Ok, you mean something like:
> int128_get64(int128_sub(int128_2_64(), int128_make64(1)));  for (end - 1) ?
> But we still have to deal with (end - iova) when calling vfio_dmap_map().
> int128_get64() will definitely assert for iova = 0. 

I don't know that that's the most efficient way to handle it, but @end
represents a different thing by imposing that -1 and it needs to be
handled in the reset of the code.

> > You're effectively changing @end to be the last address in the range,  
> 
> No, I think I am changing "end" to what we initally started with for size
> before converting to 128 bit.

Nope, it's the difference between the size of the region and the last
address of the region.

> > but only in some cases, and not adjusting the remaining code to match.
> > Not only that, but the vfio map command is probably going to fail if we
> > pass in such an unaligned size since the mapping granularity is  
> 
> Trying to map such a large region is wrong anyway, I am still trying
> to workout a solution to avoid calling memory_region_init_iommu()
> with UINT64_MAX which is what emulated vt-d currently does.

Right, the address width of the IOMMU on x86 is typically nowhere near
2^64, so if you take the vfio_dma_map path, you'll surely explode.
Does this fix actually fix anything or just move us to the next
assert?  Thanks,

Alex

Re: [Qemu-devel] [PATCH 1/2] block/qapi: make two printf() formats literal

2016-03-21 Thread Peter Xu

On Mon, Mar 21, 2016 at 03:14:48PM -0600, Eric Blake wrote:
> On 03/09/2016 06:46 PM, Peter Xu wrote:
> > 
> > Is this a grammar btw?
> 
> Yes, C has an ugly grammar, because [] is just syntactic sugar for
> deferencing pointer addition with nicer operator precedence.  Quoting
> C99 6.5.2.1:
> 
> "The definition of the subscript operator [] is that E1[E2] is identical
> to (*((E1)+(E2))).  Because of the conversion rules that apply to the
> binary + operator, if E1 is an array object (equivalently, a pointer to
> the initial element of an array object) and E2 is an integer, E1[E2]
> designates the E2-th element of E1 (counting from zero)."
> 
> And a string literal is just a fancy way of writing the address of an
> array of characters (where the address is chosen by the compiler).
> 
> Thus, it IS valid to dereference the addition of an integer offset with
> the address implied by a string literal in order to obtain a character
> within the string.  And since the [] operator is commutative (even
> though no one in their right mind commutes the operands), you can also
> write the even-uglier:
> 
> composite["\n "]
> 
> But now we've gone far astray from the original patch review :)

Interesting thing to know.  Thanks. :)

-- peterx

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-21 Thread Bandan Das

Alex Williamson  writes:

> On Mon, 21 Mar 2016 20:06:32 -0400
> Bandan Das  wrote:
>
>> Alex Williamson  writes:
>> 
>> > On Mon, 21 Mar 2016 18:00:50 -0400
>> > Bandan Das  wrote:
>> >  
>> >> vfio_listener_region_add for a iommu mr results in
>> >> an overflow assert since emulated iommu memory region is initialized
>> >> with UINT64_MAX. Add a check just like memory_region_size()
>> >> does.
>> >> 
>> >> Signed-off-by: Bandan Das 
>> >> ---
>> >>  hw/vfio/common.c | 7 ++-
>> >>  1 file changed, 6 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> >> index fb588d8..269244b 100644
>> >> --- a/hw/vfio/common.c
>> >> +++ b/hw/vfio/common.c
>> >> @@ -349,7 +349,12 @@ static void vfio_listener_region_add(MemoryListener 
>> >> *listener,
>> >>  if (int128_ge(int128_make64(iova), llend)) {
>> >>  return;
>> >>  }
>> >> -end = int128_get64(llend);
>> >> +
>> >> +if (int128_eq(llend, int128_2_64())) {
>> >> +end = UINT64_MAX;
>> >> +} else {
>> >> +end = int128_get64(llend);
>> >> +}
>> >>  
>> >>  if ((iova < container->min_iova) || ((end - 1) > 
>> >> container->max_iova)) {
>> >>  error_report("vfio: IOMMU container %p can't map guest IOVA 
>> >> region"  
>> >
>> > But now all the calculations where we use end-1 are wrong.  See the
>> > discussion with Pierre Morel in the January qemu-devel archives.
>> > There's a solution in there, but I never saw a follow-up from Pierre
>> > with a revised patch.  Thanks,  
>> 
>> I am missing something. When end < UIN64_MAX, end - 1 calculations are valid 
>> because
>> the patch doesn't change that behavior. When end is UINT64_MAX, 
>> int128_get64() doesn't know how
>> to calculate this value and we are just feeding it manually. The patch is 
>> just the opposite
>> of what memory_region_init() did to init the mem region in the first place:
>>mr->size = int128_make64(size);
>>if (size == UINT64_MAX) {
>>   mr->size = int128_2_64();
>>}
>> So, end - 1 is still valid for end = UINT64_MAX, no ?
>
> int128_2_64() is not equal to UINT64_MAX, so assigning UIN64_MAX to
> @end is clearing altering the value.  If we had a range from zero to

I thought in128_2_64 is the 128 bit representation of UINT64_MAX. The
if condition in memory_region_init doesn't make sense otherwise.

> int128_2_64() then the size of that region is int128_2_64().  If we
> alter @end to be UINT64_MAX, then the size is only UINT64_MAX and @end
> - 1 is off by one versus the case where we use the value directly.

Ok, you mean something like:
int128_get64(int128_sub(int128_2_64(), int128_make64(1)));  for (end - 1) ?
But we still have to deal with (end - iova) when calling vfio_dmap_map().
int128_get64() will definitely assert for iova = 0. 

> You're effectively changing @end to be the last address in the range,

No, I think I am changing "end" to what we initally started with for size
before converting to 128 bit.

> but only in some cases, and not adjusting the remaining code to match.
> Not only that, but the vfio map command is probably going to fail if we
> pass in such an unaligned size since the mapping granularity is

Trying to map such a large region is wrong anyway, I am still trying
to workout a solution to avoid calling memory_region_init_iommu()
with UINT64_MAX which is what emulated vt-d currently does.

> likely the system page size.  Thanks,
>
> Alex

Re: [Qemu-devel] [PATCH qemu v14 05/18] spapr_iommu: Introduce "enabled" state for TCE table

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:46:53PM +1100, Alexey Kardashevskiy wrote:
> Currently TCE tables are created once at start and their sizes never
> change. We are going to change that by introducing a Dynamic DMA windows
> support where DMA configuration may change during the guest execution.
> 
> This changes spapr_tce_new_table() to create an empty zero-size IOMMU
> memory region (IOMMU MR). Only LIOBN is assigned by the time of creation.
> It still will be called once at the owner object (VIO or PHB) creation.
> 
> This introduces an "enabled" state for TCE table objects with two
> helper functions - spapr_tce_table_enable()/spapr_tce_table_disable().
> - spapr_tce_table_enable() receives TCE table parameters, allocates
> a guest view of the TCE table (in the user space or KVM) and
> sets the correct size on the IOMMU MR.
> - spapr_tce_table_disable() disposes the table and resets the IOMMU MR
> size.
> 
> This changes the PHB reset handler to do the default DMA initialization
> instead of spapr_phb_realize(). This does not make differenct now but
> later with more than just one DMA window, we will have to remove them all
> and create the default one on a system reset.
> 
> No visible change in behaviour is expected except the actual table
> will be reallocated every reset. We might optimize this later.
> 
> The other way to implement this would be dynamically create/remove
> the TCE table QOM objects but this would make migration impossible
> as the migration code expects all QOM objects to exist at the receiver
> so we have to have TCE table objects created when migration begins.
> 
> spapr_tce_table_do_enable() is separated from from spapr_tce_table_enable()
> as later it will be called at the sPAPRTCETable post-migration stage when
> it already has all the properties set after the migration; the same is
> done for spapr_tce_table_disable().
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 

R-b stands, but I noticed one nit:

> ---
> Changes:
> v14:
> * added spapr_tce_table_do_disable(), will make difference in following
> patch with fully dynamic table migration
> ---
>  hw/ppc/spapr_iommu.c   | 86 
> --
>  hw/ppc/spapr_pci.c | 13 ++--
>  hw/ppc/spapr_vio.c |  8 ++---
>  include/hw/ppc/spapr.h | 10 +++---
>  4 files changed, 81 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 8132f64..9bcd3f6 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -17,6 +17,7 @@
>   * License along with this library; if not, see 
> .
>   */
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  #include "hw/hw.h"
>  #include "sysemu/kvm.h"
>  #include "hw/qdev.h"
> @@ -174,15 +175,9 @@ static int spapr_tce_table_realize(DeviceState *dev)
>  sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>  
>  tcet->fd = -1;
> -tcet->table = spapr_tce_alloc_table(tcet->liobn,
> -tcet->page_shift,
> -tcet->nb_table,
> ->fd,
> -tcet->need_vfio);
> -
> +tcet->need_vfio = false;
>  memory_region_init_iommu(>iommu, OBJECT(dev), _iommu_ops,
> - "iommu-spapr",
> - (uint64_t)tcet->nb_table << tcet->page_shift);
> + "iommu-spapr", 0);
>  
>  QLIST_INSERT_HEAD(_tce_tables, tcet, list);
>  
> @@ -224,14 +219,10 @@ void spapr_tce_set_need_vfio(sPAPRTCETable *tcet, bool 
> need_vfio)
>  tcet->table = newtable;
>  }
>  
> -sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
> -   uint64_t bus_offset,
> -   uint32_t page_shift,
> -   uint32_t nb_table,
> -   bool need_vfio)
> +sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn)
>  {
>  sPAPRTCETable *tcet;
> -char tmp[64];
> +char tmp[32];
>  
>  if (spapr_tce_find_by_liobn(liobn)) {
>  fprintf(stderr, "Attempted to create TCE table with duplicate"
> @@ -239,16 +230,8 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, 
> uint32_t liobn,
>  return NULL;
>  }
>  
> -if (!nb_table) {
> -return NULL;
> -}
> -
>  tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE));
>  tcet->liobn = liobn;
> -tcet->bus_offset = bus_offset;
> -tcet->page_shift = page_shift;
> -tcet->nb_table = nb_table;
> -tcet->need_vfio = need_vfio;
>  
>  snprintf(tmp, sizeof(tmp), "tce-table-%x", liobn);
>  object_property_add_child(OBJECT(owner), tmp, OBJECT(tcet), NULL);
> @@ -258,14 +241,69 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, 
> uint32_t liobn,
>  return tcet;

Re: [Qemu-devel] [PATCH qemu v14 03/18] spapr_pci: Move DMA window enablement to a helper

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:46:51PM +1100, Alexey Kardashevskiy wrote:
> We are going to have multiple DMA windows soon so let's start preparing.
> 
> This adds a new helper to create a DMA window and makes use of it in
> sPAPRPHBState::realize().
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

With one tweak..

> ---
> Changes:
> v14:
> * replaced "int" return to Error* in spapr_phb_dma_window_enable()
> ---
>  hw/ppc/spapr_pci.c | 47 ++-
>  1 file changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 79baa7b..18332bf 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -803,6 +803,33 @@ static char *spapr_phb_get_loc_code(sPAPRPHBState *sphb, 
> PCIDevice *pdev)
>  return buf;
>  }
>  
> +static void spapr_phb_dma_window_enable(sPAPRPHBState *sphb,
> +   uint32_t liobn,
> +   uint32_t page_shift,
> +   uint64_t window_addr,
> +   uint64_t window_size,
> +   Error **errp)
> +{
> +sPAPRTCETable *tcet;
> +uint32_t nb_table = window_size >> page_shift;
> +
> +if (!nb_table) {
> +error_setg(errp, "Zero size table");
> +return;
> +}
> +
> +tcet = spapr_tce_new_table(DEVICE(sphb), liobn, window_addr,
> +   page_shift, nb_table, false);
> +if (!tcet) {
> +error_setg(errp, "Unable to create TCE table liobn %x for %s",
> +   liobn, sphb->dtbusname);
> +return;
> +}
> +
> +memory_region_add_subregion(>iommu_root, tcet->bus_offset,
> +spapr_tce_get_iommu(tcet));
> +}
> +
>  /* Macros to operate with address in OF binding to PCI */
>  #define b_x(x, p, l)(((x) & ((1<<(l))-1)) << (p))
>  #define b_n(x)  b_x((x), 31, 1) /* 0 if relocatable */
> @@ -1307,8 +1334,7 @@ static void spapr_phb_realize(DeviceState *dev, Error 
> **errp)
>  int i;
>  PCIBus *bus;
>  uint64_t msi_window_size = 4096;
> -sPAPRTCETable *tcet;
> -uint32_t nb_table;
> +Error *local_err = NULL;
>  
>  if (sphb->index != (uint32_t)-1) {
>  hwaddr windows_base;
> @@ -1460,18 +1486,13 @@ static void spapr_phb_realize(DeviceState *dev, Error 
> **errp)
>  }
>  }
>  
> -nb_table = sphb->dma_win_size >> SPAPR_TCE_PAGE_SHIFT;
> -tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
> -   0, SPAPR_TCE_PAGE_SHIFT, nb_table, false);
> -if (!tcet) {
> -error_setg(errp, "Unable to create TCE table for %s",
> -   sphb->dtbusname);
> -return;
> -}
> -
>  /* Register default 32bit DMA window */
> -memory_region_add_subregion(>iommu_root, sphb->dma_win_addr,
> -spapr_tce_get_iommu(tcet));
> +spapr_phb_dma_window_enable(sphb, sphb->dma_liobn, SPAPR_TCE_PAGE_SHIFT,
> +sphb->dma_win_addr, sphb->dma_win_size,
> +_err);
> +if (local_err) {
> +error_propagate(errp, local_err);

Should be a return; here so we don't continue if there's an error.

Actually.. that's not really right, we should be cleaning up all setup
we've done already on the failure path.  Without that I think we'll
leak some objects on a failed device_add.

But.. there are already a bunch of cases here that will do that, so we
can clean that up separately.  Probably the sanest way would be to add
an unrealize function() that can handle a partially realized object
and make sure it's called on all the error paths.

> +}
>  
>  sphb->msi = g_hash_table_new_full(g_int_hash, g_int_equal, g_free, 
> g_free);
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH qemu v14 01/18] memory: Fix IOMMU replay base address

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 06:46:49PM +1100, Alexey Kardashevskiy wrote:
> Since a788f227 "memory: Allow replay of IOMMU mapping notifications"
> when new VFIO listener is added, all existing IOMMU mappings are
> replayed. However there is a problem that the base address of
> an IOMMU memory region (IOMMU MR) is ignored which is not a problem
> for the existing user (which is pseries) with its default 32bit DMA
> window starting at 0 but it is if there is another DMA window.
> 
> This stores the IOMMU's offset_within_address_space and adjusts
> the IOVA before calling vfio_dma_map/vfio_dma_unmap.
> 
> As the IOMMU notifier expects IOVA offset rather than the absolute
> address, this also adjusts IOVA in sPAPR H_PUT_TCE handler before
> calling notifier(s).
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 

On a closer look, I realised this still isn't quite correct, although
I don't think any cases which would break it exist or are planned.

> ---
>  hw/ppc/spapr_iommu.c  |  2 +-
>  hw/vfio/common.c  | 14 --
>  include/hw/vfio/vfio-common.h |  1 +
>  3 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 7dd4588..277f289 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -277,7 +277,7 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, 
> target_ulong ioba,
>  tcet->table[index] = tce;
>  
>  entry.target_as = _space_memory,
> -entry.iova = ioba & page_mask;
> +entry.iova = (ioba - tcet->bus_offset) & page_mask;
>  entry.translated_addr = tce & page_mask;
>  entry.addr_mask = ~page_mask;
>  entry.perm = spapr_tce_iommu_access_flags(tce);

This bit's right/

> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fb588d8..d45e2db 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -257,14 +257,14 @@ static void vfio_iommu_map_notify(Notifier *n, void 
> *data)
>  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>  VFIOContainer *container = giommu->container;
>  IOMMUTLBEntry *iotlb = data;
> +hwaddr iova = iotlb->iova + giommu->offset_within_address_space;

This bit might be right, depending on how you define 
giommu->offset_within_address_space.

>  MemoryRegion *mr;
>  hwaddr xlat;
>  hwaddr len = iotlb->addr_mask + 1;
>  void *vaddr;
>  int ret;
>  
> -trace_vfio_iommu_map_notify(iotlb->iova,
> -iotlb->iova + iotlb->addr_mask);
> +trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
>  
>  /*
>   * The IOMMU TLB entry we have just covers translation through
> @@ -291,21 +291,21 @@ static void vfio_iommu_map_notify(Notifier *n, void 
> *data)
>  
>  if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
>  vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -ret = vfio_dma_map(container, iotlb->iova,
> +ret = vfio_dma_map(container, iova,
> iotlb->addr_mask + 1, vaddr,
> !(iotlb->perm & IOMMU_WO) || mr->readonly);
>  if (ret) {
>  error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>   "0x%"HWADDR_PRIx", %p) = %d (%m)",
> - container, iotlb->iova,
> + container, iova,
>   iotlb->addr_mask + 1, vaddr, ret);
>  }
>  } else {
> -ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> +ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1);
>  if (ret) {
>  error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>   "0x%"HWADDR_PRIx") = %d (%m)",
> - container, iotlb->iova,
> + container, iova,
>   iotlb->addr_mask + 1, ret);
>  }
>  }

This is fine.

> @@ -377,6 +377,8 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>   */
>  giommu = g_malloc0(sizeof(*giommu));
>  giommu->iommu = section->mr;
> +giommu->offset_within_address_space =
> +section->offset_within_address_space;

But here there's a problem.  The iova in IOMMUTLBEntry is relative to
the IOMMU MemoryRegion, but - at least in theory - only a subsection
of that MemoryRegion could be mapped into the AddressSpace.

So, to find the IOVA within the AddressSpace, from the IOVA within the
MemoryRegion, you need to first subtract the section's offset within
the MemoryRegion, then add the section's offset within the
AddressSpace.

You could precalculate the combined delta here, but...

>  giommu->container = container;
>  giommu->n.notify = vfio_iommu_map_notify;
>  QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next);
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>

Re: [Qemu-devel] [PATCH 0/4] Tweaks around virtio-blk start/stop

2016-03-21 Thread Fam Zheng

On Mon, 03/21 15:19, Cornelia Huck wrote:
> On Mon, 21 Mar 2016 14:54:04 +0100
> Paolo Bonzini  wrote:
> 
> > On 21/03/2016 14:47, TU BO wrote:
> > >> I'll see if I can produce something based on Conny's patches, which are
> > >> already a start.  Today I had a short day so I couldn't play with the
> > >> bug; out of curiosity, does the bug reproduce with her work + patch 4
> > >> from this series + the reentrancy assertion?
> > > I did NOT see crash with qemu master + "[PATCH RFC 0/6] virtio: refactor
> > > host notifiers" from Conny + patch 4 + assertion.  thx
> > 
> > That's unexpected, but I guess it only says that I didn't review her
> > patches well enough. :)
> 
> I'm also a bit surprised, the only thing that should really be
> different is passing the 'assign' argument in stop_ioeventfd(). Any
> other fixes are purely accidental :)
> 
> Would be interesting to see how this setup fares with virtio-pci.
> 

Seems to fix the assertion I'm hitting too.

Fam

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-21 Thread Alex Williamson

On Mon, 21 Mar 2016 20:06:32 -0400
Bandan Das  wrote:

> Alex Williamson  writes:
> 
> > On Mon, 21 Mar 2016 18:00:50 -0400
> > Bandan Das  wrote:
> >  
> >> vfio_listener_region_add for a iommu mr results in
> >> an overflow assert since emulated iommu memory region is initialized
> >> with UINT64_MAX. Add a check just like memory_region_size()
> >> does.
> >> 
> >> Signed-off-by: Bandan Das 
> >> ---
> >>  hw/vfio/common.c | 7 ++-
> >>  1 file changed, 6 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index fb588d8..269244b 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -349,7 +349,12 @@ static void vfio_listener_region_add(MemoryListener 
> >> *listener,
> >>  if (int128_ge(int128_make64(iova), llend)) {
> >>  return;
> >>  }
> >> -end = int128_get64(llend);
> >> +
> >> +if (int128_eq(llend, int128_2_64())) {
> >> +end = UINT64_MAX;
> >> +} else {
> >> +end = int128_get64(llend);
> >> +}
> >>  
> >>  if ((iova < container->min_iova) || ((end - 1) > 
> >> container->max_iova)) {
> >>  error_report("vfio: IOMMU container %p can't map guest IOVA 
> >> region"  
> >
> > But now all the calculations where we use end-1 are wrong.  See the
> > discussion with Pierre Morel in the January qemu-devel archives.
> > There's a solution in there, but I never saw a follow-up from Pierre
> > with a revised patch.  Thanks,  
> 
> I am missing something. When end < UIN64_MAX, end - 1 calculations are valid 
> because
> the patch doesn't change that behavior. When end is UINT64_MAX, 
> int128_get64() doesn't know how
> to calculate this value and we are just feeding it manually. The patch is 
> just the opposite
> of what memory_region_init() did to init the mem region in the first place:
>mr->size = int128_make64(size);
>if (size == UINT64_MAX) {
>   mr->size = int128_2_64();
>}
> So, end - 1 is still valid for end = UINT64_MAX, no ?

int128_2_64() is not equal to UINT64_MAX, so assigning UIN64_MAX to
@end is clearing altering the value.  If we had a range from zero to
int128_2_64() then the size of that region is int128_2_64().  If we
alter @end to be UINT64_MAX, then the size is only UINT64_MAX and @end
- 1 is off by one versus the case where we use the value directly.
You're effectively changing @end to be the last address in the range,
but only in some cases, and not adjusting the remaining code to match.
Not only that, but the vfio map command is probably going to fail if we
pass in such an unaligned size since the mapping granularity is
likely the system page size.  Thanks,

Alex

Re: [Qemu-devel] [PATCH RFC 1/6] virtio-bus: common ioeventfd infrastructure

2016-03-21 Thread Fam Zheng

On Thu, 03/17 11:01, Cornelia Huck wrote:
> Introduce a set of ioeventfd callbacks on the virtio-bus level
> that can be implemented by the individual transports. At the
> virtio-bus level, do common handling for host notifiers (which
> is actually most of it).
> 
> Two things of note:
> - We always iterate over all possible virtio queues, even though
> ccw (currently) has a lower limit. It does not really matter in
> this place.
> - We allow for the virtio-bus caller to pass an "assign" argument
> down when stopping ioeventfd, which the old interface did not allow.
> 
> Signed-off-by: Cornelia Huck 
> ---
>  hw/virtio/virtio-bus.c | 108 
> +
>  include/hw/virtio/virtio-bus.h |  14 ++
>  2 files changed, 122 insertions(+)
> 
> diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c
> index 574f0e2..501300f 100644
> --- a/hw/virtio/virtio-bus.c
> +++ b/hw/virtio/virtio-bus.c
> @@ -146,6 +146,114 @@ void virtio_bus_set_vdev_config(VirtioBusState *bus, 
> uint8_t *config)
>  }
>  }
>  
> +static int set_host_notifier_internal(DeviceState *proxy, VirtioBusState 
> *bus,
> +  int n, bool assign, bool set_handler)
> +{
> +VirtIODevice *vdev = virtio_bus_get_device(bus);
> +VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(bus);
> +VirtQueue *vq = virtio_get_queue(vdev, n);
> +EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
> +int r = 0;
> +
> +if (assign) {
> +r = event_notifier_init(notifier, 1);
> +if (r < 0) {
> +error_report("%s: unable to init event notifier: %d", __func__, 
> r);
> +return r;
> +}
> +virtio_queue_set_host_notifier_fd_handler(vq, true, set_handler);
> +r = k->ioeventfd_assign(proxy, notifier, n, assign);
> +if (r < 0) {
> +error_report("%s: unable to assign ioeventfd: %d", __func__, r);
> +virtio_queue_set_host_notifier_fd_handler(vq, false, false);
> +event_notifier_cleanup(notifier);
> +return r;
> +}
> +} else {
> +virtio_queue_set_host_notifier_fd_handler(vq, false, false);
> +k->ioeventfd_assign(proxy, notifier, n, assign);
> +event_notifier_cleanup(notifier);
> +}
> +return r;
> +}
> +
> +void virtio_bus_start_ioeventfd(VirtioBusState *bus)
> +{
> +VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(bus);
> +DeviceState *proxy = DEVICE(BUS(bus)->parent);
> +VirtIODevice *vdev;
> +int n, r;
> +
> +if (!k->ioeventfd_started || k->ioeventfd_started(proxy)) {
> +return;
> +}
> +if (!k->ioeventfd_disabled(proxy)) {
> +return;
> +}
> +vdev = virtio_bus_get_device(bus);
> +for (n = 0; n < VIRTIO_QUEUE_MAX; n++) {
> +if (!virtio_queue_get_num(vdev, n)) {
> +continue;
> +}
> +r = set_host_notifier_internal(proxy, bus, n, true, true);
> +if (r < 0) {
> +goto assign_error;
> +}
> +}
> +k->ioeventfd_set_started(proxy, true, false);
> +return;
> +
> +assign_error:
> +while (--n >= 0) {
> +if (!virtio_queue_get_num(vdev, n)) {
> +continue;
> +}
> +
> +r = set_host_notifier_internal(proxy, bus, n, false, false);
> +assert(r >= 0);
> +}
> +k->ioeventfd_set_started(proxy, false, true);
> +error_report("%s: failed. Fallback to userspace (slower).", __func__);
> +}
> +
> +void virtio_bus_stop_ioeventfd(VirtioBusState *bus, bool assign)
> +{
> +VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(bus);
> +DeviceState *proxy = DEVICE(BUS(bus)->parent);
> +VirtIODevice *vdev;
> +int n, r;
> +
> +if (!k->ioeventfd_started || !k->ioeventfd_started(proxy)) {
> +return;
> +}
> +vdev = virtio_bus_get_device(bus);
> +for (n = 0; n < VIRTIO_QUEUE_MAX; n++) {
> +if (!virtio_queue_get_num(vdev, n)) {
> +continue;
> +}
> +r = set_host_notifier_internal(proxy, bus, n, assign, false);
> +assert(r >= 0);
> +}
> +k->ioeventfd_set_started(proxy, false, false);
> +}
> +
> +int virtio_bus_set_host_notifier(VirtioBusState *bus, int n, bool assign)
> +{
> +VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(bus);
> +DeviceState *proxy = DEVICE(BUS(bus)->parent);
> +
> +if (!k->ioeventfd_started) {
> +return -ENOSYS;
> +}
> +/* Stop using the generic ioeventfd, we are doing eventfd handling
> + * ourselves below */
> +k->ioeventfd_set_disabled(proxy, assign);
> +if (assign) {
> +virtio_bus_stop_ioeventfd(bus, assign);
> +}
> +return set_host_notifier_internal(proxy, bus, n, assign, false);
> +}
> +
>  static char *virtio_bus_get_dev_path(DeviceState *dev)
>  {
>  BusState *bus = qdev_get_parent_bus(dev);
> diff --git a/include/hw/virtio/virtio-bus.h b/include/hw/virtio/virtio-bus.h
> index 3f2c136..0281cbf

Re: [Qemu-devel] [RFC PATCH v2 0/9] Core based CPU hotplug for PowerPC sPAPR

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 11:43:34AM +0100, Igor Mammedov wrote:
> On Mon, 21 Mar 2016 14:57:53 +1100
> David Gibson  wrote:
> 
> > On Fri, Mar 18, 2016 at 08:59:32AM +0530, Bharata B Rao wrote:
> > > On Thu, Mar 17, 2016 at 09:03:43PM +1100, David Gibson wrote:  
> > > > On Wed, Mar 16, 2016 at 04:48:50PM +0100, Igor Mammedov wrote:  
> > > > > On Wed, 16 Mar 2016 09:18:03 +0530
> > > > > Bharata B Rao  wrote:
> > > > >   
> > > > > > On Mon, Mar 14, 2016 at 10:47:28AM +0100, Igor Mammedov wrote:  
> > > > > > > On Fri, 11 Mar 2016 10:24:29 +0530
> > > > > > > Bharata B Rao  wrote:
> > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > This is the next version of "Core based CPU hotplug for PowerPC 
> > > > > > > > sPAPR" that
> > > > > > > > was posted at
> > > > > > > > https://lists.gnu.org/archive/html/qemu-ppc/2016-03/msg00081.html
> > > > > > > > 
> > > > > > > > device_add semantics
> > > > > > > > 
> > > > > > > > For -smp 16,sockets=1,cores=2,threads=8,maxcpus=32
> > > > > > > > (qemu) device_add 
> > > > > > > > spapr-cpu-core,id=core2,core=16,cpu_model=host[,threads=8]
> > > > > > > do you plan to allow user to hotplug different cpu_models?
> > > > > > > If not it would be better to hide cpu_model from user
> > > > > > > and set it from machine pre_plug handler.
> > > > > > 
> > > > > > In my earlier implementations I derived cpu model from -cpu and 
> > > > > > threads from
> > > > > > -smp,threads= commandline options and never exposed them to 
> > > > > > device_add
> > > > > > command.
> > > > > > 
> > > > > > Though we don't support heterogenous systems (different cpu models 
> > > > > > and/or
> > > > > > threads) now, it was felt that it should be easy enough to support 
> > > > > > such
> > > > > > systems if required in future, that's how cpu_model and threads 
> > > > > > became
> > > > > > options for device_add.
> > > > > > 
> > > > > > One of the things that David felt was missing from my earlier QMP 
> > > > > > query
> > > > > > command (and which is true in your QMP query implementation also) 
> > > > > > is that
> > > > > > we aren't exporting cpu_model at all, at least for not-yet-plugged 
> > > > > > cores.
> > > > > > So should we include that or let management figure that out since it
> > > > > > would already know about the CPU model.  
> > > > > 1.
> > > > > so since you are not planning supporting heterogeneous setup yet,
> > > > > I'd suggest to refrain from making user to provide cpu_model at
> > > > > device_add time. Instead make machine code to set it for cores it
> > > > > creates before core.realize() (yet another use for pre_plug()).
> > > > > 
> > > > > That way mgmt doesn't have to figure out what cpu_model to set at
> > > > > device_add time and doesn't have find out what property to use for 
> > > > > it.  
> > > > 
> > > > Yes.. of course you could also do the same thing for nr_threads, so
> > > > I'm wondering whether there's a good argument to keep one in
> > > > pre_plug() and one in query-hotpluggable-cpus.  
> > > 
> > > Right, so what should be the way forward ? Should we keep cpu_model= and
> > > threads= options with device_add or just threads=  or neither ?  
> > 
> > I'm inclined to keep them both in device_add - I like the idea of
> > having an example on day 0 of advertising extra properties (beyond
> > nr_threads and location) to set from query-hotpluggable-cpus.
> > 
> > But, I'd probably change my mind if Igor or someone has a stronger
> > opinion.
> I don't have a strong opinion on this, but you have to keep in mind
> that one you make it ABI you probably would have to maintain it forever.
> 
> So far 'threads' and 'cpu_model' look like a constant values,
> fixed at start-up time for every core.
> Taking in account that user is not supposed to change them during
> hotplug time and that they are the same for every core,
> I'd go for conservative route and hide them in pre_plug() for now.
> You always can expose them later if needed.

Hm, yes, I see your point.  Hiding them in pre-plug does also make
life more convenient for someone (not libvirt) manually experimenting
with this - less to type on their device-add command.

> > If we advertise cpu_model, however, it should probably be changed to
> > cpu thread class name, since IIUC that's an existing advertised part
> > of the QOM interface, but cpu_model isn't.
> I still think that spapr-cpu-core should be an abstract type
> with a concrete set of derived cores types per each thread type.
> But this question is not related to hotplug, but rather to
> start-up of QEMU from scratch with -device and supported types
> discovery. So I'd postpone question for later and that's yet another
> reason why I'd like to hide cpu_model from user for now.

Ah.. yes, I think you're probably right, and we should have derived
types.  It's a little bit awkward for spapr, but we can

Re: [Qemu-devel] [RFC PATCH 1/2] target-ppc: migrate interrupt vectors address for spapr VM

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 05:51:22PM +0100, Cédric Le Goater wrote:
> On 03/21/2016 05:18 PM, Thomas Huth wrote:
> > On 21.03.2016 15:02, Cédric Le Goater wrote:
> >> This address is changed by the linux kernel using the H_SET_MODE hcall
> >> and needs to be migrated in order to restart a spapr VM running in
> >> TCG. Other platforms should not be affected.
> >>
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >>  target-ppc/machine.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/target-ppc/machine.c b/target-ppc/machine.c
> >> index 692121e98319..a418d463db83 100644
> >> --- a/target-ppc/machine.c
> >> +++ b/target-ppc/machine.c
> >> @@ -553,6 +553,9 @@ const VMStateDescription vmstate_ppc_cpu = {
> >>  VMSTATE_UINTTL(env.hflags_nmsr, PowerPCCPU),
> >>  /* FIXME: access_type? */
> >>  
> >> +/* Effective Address of interrupt vectors */
> >> +VMSTATE_UINTTL(env.excp_prefix, PowerPCCPU),
> >> +
> >>  /* Sanity checking */
> >>  VMSTATE_UINTTL_EQUAL(env.msr_mask, PowerPCCPU),
> >>  VMSTATE_UINT64_EQUAL(env.insns_flags, PowerPCCPU),
> > 
> > I'm really no expert with all this migration stuff, but don't you have
> > to bump the version_id when you add new fields to the vmstate?
> > ... and/or use VMSTATE_UINTTL_V() so that migration from older versions
> > of QEMU to the current one also still works with KVM? For example, is it
> > still possible to migrate from QEMU 2.5 to QEMU 2.6 in KVM if you only
> > use VMSTATE_UINTTL without the _V suffix?
> 
> Yes. You are right. I think we need something like below.
> 
> Thanks,
> 
> C.
> 
> 
> target-ppc: migrate interrupt vectors address for spapr VM
> 
> This address is changed by the linux kernel using the H_SET_MODE hcall
> and needs to be migrated in order to restart a spapr VM running in
> TCG. Other platforms should not be affected.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  target-ppc/machine.c |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> Index: qemu-dgibson-for-2.6.git/target-ppc/machine.c
> ===
> --- qemu-dgibson-for-2.6.git.orig/target-ppc/machine.c
> +++ qemu-dgibson-for-2.6.git/target-ppc/machine.c
> @@ -522,7 +522,7 @@ static const VMStateDescription vmstate_
>  
>  const VMStateDescription vmstate_ppc_cpu = {
>  .name = "cpu",
> -.version_id = 5,
> +.version_id = 6,
>  .minimum_version_id = 5,
>  .minimum_version_id_old = 4,
>  .load_state_old = cpu_load_old,
> @@ -553,6 +553,9 @@ const VMStateDescription vmstate_ppc_cpu
>  VMSTATE_UINTTL(env.hflags_nmsr, PowerPCCPU),
>  /* FIXME: access_type? */
>  
> +/* Effective Address of interrupt vectors */
> +VMSTATE_UINTTL_V(env.excp_prefix, PowerPCCPU, 6),


So, I dislike putting what's essentially emulator internal state (as
opposed to architected state) into the migration stream if we can
possibly avoid it.

I think recalculating excp_prefix from the MSR on incoming migration
is the correct approach here - I see that there are bugs with that in
the other patch, but so far I'm not seeing a reason to migrate
excp_prefix itself.

>  /* Sanity checking */
>  VMSTATE_UINTTL_EQUAL(env.msr_mask, PowerPCCPU),
>  VMSTATE_UINT64_EQUAL(env.insns_flags, PowerPCCPU),
> 
> 
>  
> >  Thomas
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH v2 0/3] hw/net/spapr_llan: Fix bad RX performance of the spapr-vlan device

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 05:25:21PM +0100, Thomas Huth wrote:
> These patches fix the bad receive performance of the spapr-vlan device
> by introducing proper receive buffer pools of different sizes. Details
> can be found in the patch description of the second patch.

Applied to ppc-for-2.6, thanks.

> 
> v2:
> - Added "Reviewed-by"s to patch 1 and 3
> - Fixed one remaining problem with the buffer sorting in patch 2
>   and improved one of the comments as suggested by David.
> 
> Thomas Huth (3):
>   hw/net/spapr_llan: Extract rx buffer code into separate functions
>   hw/net/spapr_llan: Fix receive buffer handling for better performance
>   hw/net/spapr_llan: Enable the RX buffer pools by default for new
> machines
> 
>  hw/net/spapr_llan.c | 320 
> ++--
>  hw/ppc/spapr.c  |   7 +-
>  2 files changed, 290 insertions(+), 37 deletions(-)
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 2/2] target-ppc: fix interrupt vectors address migration

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 03:02:08PM +0100, Cédric Le Goater wrote:
> commit 2360b6e84f78 ("target-ppc: force update of msr bits in
> cpu_post_load") introduced a change to restore env->excp_prefix of a
> guest which could have altered its MSR_EP. To do this, cpu_post_load()
> invalidates msr and then calls ppc_store_msr() with the expected value
> in argument.
> 
> The problem is that ppc_store_msr() relies on a 'valid' current msr
> before changing its value. The MSR_HVB and MSR_TGPR bits are excluded
> from the msr reset to keep the checks valid but the MSR_IR, MSR_DR,
> MSR_EP bits which are also used through the msr_{ir,dr,ep} macros, are
> reseted.
> 
> This is an issue for CPUs not using MSR_EP, on the spapr platform for
> instance but all book3s are impacted. If excp_prefix is restored to
> some value, it will be reseted by this call, causing an ISEG exception
> on spapr guests.
> 
> This patch proposal is to test the msr_mask before actually testing
> the MSR_EP bit and protect excp_prefix.
> 
> Signed-off-by: Cédric Le Goater 
> ---
> 
>  Should we just move the test in cpu_post_load() and not reset MSR_EP
>  if it is not present in msr_mask ? like this is done for MSR_HVB and
>  MSR_TGPR. I think this is making assumptions on what ppc_store_msr()
>  is up to though.
> 
>  Maybe we could add a POWERPC_FLAGS_ for this purpose ? or test the
>  excp_model ?
>  
>  There is room for improvement in ppc_store_msr(). It might need a new
>  helper like ppc_restore_msr() ?
> 
>  Suggestions welcomed.

So, IIUC, in spapr MSR[EP] can't be set with mtmsr or rfid, but can be
set with H_SET_MODE?

I think what we need to do is to make sure the full MSR value is
migrated then, even if EP is not in the msr_mask.  Once that's done,
we should be able to correctly calculate excp_prefix from the MSR
value after migration.

Or am I missing something?

>  target-ppc/helper_regs.h | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/target-ppc/helper_regs.h b/target-ppc/helper_regs.h
> index 271fddf17f0a..2a72e000ed83 100644
> --- a/target-ppc/helper_regs.h
> +++ b/target-ppc/helper_regs.h
> @@ -92,9 +92,11 @@ static inline int hreg_store_msr(CPUPPCState *env, 
> target_ulong value,
>  /* Swap temporary saved registers with GPRs */
>  hreg_swap_gpr_tgpr(env);
>  }
> -if (unlikely((value >> MSR_EP) & 1) != msr_ep) {
> -/* Change the exception prefix on PowerPC 601 */
> -env->excp_prefix = ((value >> MSR_EP) & 1) * 0xFFF0;
> +if ((env->msr_mask >> MSR_EP) & 1) {
> +if (unlikely((value >> MSR_EP) & 1) != msr_ep) {
> +/* Change the exception prefix on PowerPC 601 */
> +env->excp_prefix = ((value >> MSR_EP) & 1) * 0xFFF0;
> +}
>  }
>  #endif
>  env->msr = value;

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-21 Thread Bandan Das

Alex Williamson  writes:

> On Mon, 21 Mar 2016 18:00:50 -0400
> Bandan Das  wrote:
>
>> vfio_listener_region_add for a iommu mr results in
>> an overflow assert since emulated iommu memory region is initialized
>> with UINT64_MAX. Add a check just like memory_region_size()
>> does.
>> 
>> Signed-off-by: Bandan Das 
>> ---
>>  hw/vfio/common.c | 7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index fb588d8..269244b 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -349,7 +349,12 @@ static void vfio_listener_region_add(MemoryListener 
>> *listener,
>>  if (int128_ge(int128_make64(iova), llend)) {
>>  return;
>>  }
>> -end = int128_get64(llend);
>> +
>> +if (int128_eq(llend, int128_2_64())) {
>> +end = UINT64_MAX;
>> +} else {
>> +end = int128_get64(llend);
>> +}
>>  
>>  if ((iova < container->min_iova) || ((end - 1) > container->max_iova)) {
>>  error_report("vfio: IOMMU container %p can't map guest IOVA region"
>
> But now all the calculations where we use end-1 are wrong.  See the
> discussion with Pierre Morel in the January qemu-devel archives.
> There's a solution in there, but I never saw a follow-up from Pierre
> with a revised patch.  Thanks,

I am missing something. When end < UIN64_MAX, end - 1 calculations are valid 
because
the patch doesn't change that behavior. When end is UINT64_MAX, int128_get64() 
doesn't know how
to calculate this value and we are just feeding it manually. The patch is just 
the opposite
of what memory_region_init() did to init the mem region in the first place:
   mr->size = int128_make64(size);
   if (size == UINT64_MAX) {
  mr->size = int128_2_64();
   }
So, end - 1 is still valid for end = UINT64_MAX, no ?

> Alex

Re: [Qemu-devel] [PATCH v3 04/10] ppc: Create cpu_ppc_set_papr() helper

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 01:52:34PM +0100, Cédric Le Goater wrote:
> From: Benjamin Herrenschmidt 
> 
> And move the code adjusting the MSR mask and calling kvmppc_set_papr()
> to it. This allows us to add a few more things such as disabling setting
> of MSR:HV and appropriate LPCR bits which will be used when fixing
> the exception model.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> Reviewed-by: David Gibson 
> [clg: removed LPCR setting ]
> Signed-off-by: Cédric Le Goater 

Nothing wrong with the patch, but your mailer seems to have done
something really weird with the headers:

> Content-Type: text/plain; charset=a

Oddly enough, git am had some trouble with charset "a".

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH v3 00/10] ppc: preparing pnv landing

2016-03-21 Thread David Gibson

On Mon, Mar 21, 2016 at 01:52:30PM +0100, Cédric Le Goater wrote:
> Hello,
> 
> This is a first mini-serie of patches adding support for new ppc SPRs.
> They were taken from Ben's larger patchset adding the ppc powernv
> platform and they should already be useful for the pseries guest
> migration.
> 
> Initial patches come from :
> 
>   https://github.com/ozbenh/qemu/commits/powernv
> 
> The changes are mostly due to the rebase on Dave's 2.6 branch:
> 
>   https://github.com/dgibson/qemu/commits/ppc-for-2.6 ppc-for-2.6-20160316
> 
> A couple more are bisect and checkpatch fixes and finally some patches
> were merge to reduce the noise.

Applied to ppc-for-2.6, thanks.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [RFC v1 01/11] tcg: move tb_find_fast outside the tb_lock critical section

2016-03-21 Thread Emilio G. Cota

On Mon, Mar 21, 2016 at 22:08:06 +, Peter Maydell wrote:
> It is not _necessary_, but it is a performance optimization to
> speed up the "missed in the TLB" case. (A TLB flush will wipe
> the tb_jmp_cache table.) From the thread where the move-to-front-of-list
> behaviour was added in 2010, benefits cited:

(snip)
> I think what's happening here is that for guest CPUs where TLB
> invalidation happens fairly frequently (notably ARM, because
> we don't model ASIDs in the QEMU TLB and thus have to flush
> the TLB on any context switch) the case of "we didn't hit in
> the TLB but we do have this TB and it was used really recently"
> happens often enough to make it worthwhile for the
> tb_find_physical() code to keep its hash buckets in LRU order.
> 
> Obviously that's all five year old data now, so a pinch of
> salt may be indicated, but I'd rather we didn't just remove
> the optimisation without some benchmarking to check that it's
> not significant. A 2x difference is huge.

Good point. Most of my tests have been on x86-on-x86, and the
difference there (for many CPU-intensive benchmarks such as SPEC) was
negligible.

Just tested the current master booting Alex' debian ARM image, without
LRU, and I see a 20% increase in boot time.

I'll add per-bucket locks to keep the same behaviour without hurting
scalability.

Thanks,

Emilio

Re: [Qemu-devel] [PATCH] slirp: Allow to disable IPv4 or IPv6

2016-03-21 Thread Samuel Thibault

Markus Armbruster, on Mon 21 Mar 2016 08:33:52 +0100, wrote:
> Samuel Thibault  writes:
> > Make net=0.0.0.0 disable IPv4 and ip6-net=:: disable IPv6, so the user can
> > setup IPv4-only and IPv6-only network environments.
> 
> Do "net=" and "ip6-net=" mean anything useful?  If not, wouldn't that be
> a more natural way to switch off than abusing the wildcard address?

An empty parameter looks odd to me.  0.0.0.0 is used e.g. by ifconfig to
disable an interface, that's why I thought about it.  Perhaps an even
better way would be net=none and ip6-net=none?

> > @@ -2427,7 +2427,7 @@
> >  #
> >  # @ip: #optional legacy parameter, use net= instead
> >  #
> > -# @net: #optional IP address and optional netmask
> > +# @net: #optional IP address and optional netmask. Set to 0.0.0.0 to 
> > disable IPv4 completely
> 
> Long line.
> 
> Syntax?  Default value?

Well, that's what was there :)

But yes I can add that along the way.  I'm however now wondering
what difference is supposed to exist between the documentation in
qapi-schema.json and in qemu-options.hx?  (I know they are separate
software layers, thus the two documentations, but does it make sense to
have differing documentations when the qapi schema and the CLI options
work the same?)

Samuel

[Qemu-devel] How to determine Q-id in VHOST_USER_SET_LOG_BASE in a Multi-Q setup ?

2016-03-21 Thread shesha Sreenivasamurthy (shesha)

Hi All,
I'm implementing VM migration support for open-VPP, an open source Vector 
Packet Processing (VPP) technology (https://wiki.fd.io/view/VPP) - A Linux 
foundation project. In lieu of it, I have hit an issue and I need some 
clarification.

In Qemu's vhost-user implementation, each queue is treated as a vhost-net 
device and during migration, vhost_user_set_log_base is invoked per device 
(queue). However, there is no information about the queue index in the API. How 
should the slave determine which queue the master is referring to ?

For example: If I have configured my guest with 4 queues, 
VHOST_USER_SET_LOG_BASE is invoked 4 times with different SHMFDs. How to map 
SHMFD to queue ID ?

--
- Thanks
char * (*shesha) (uint64_t cache, uint8_t F00D)
{ return 0xC0DE; }

Re: [Qemu-devel] [PULL] slirp: Fix memory leak on small incoming ipv4 packet

2016-03-21 Thread Samuel Thibault

Hello,

Peter Maydell, on Mon 21 Mar 2016 09:48:48 +, wrote:
> Generally the
> process for QEMU is that first patches are sent as normal [PATCH] mails,
> for code review. Patches should only be put into pull requests once
> they've been through the review process. (And then you can batch them
> up so you don't have to send me two pulls for one patch each.)

Ah OK. Since they were one-liners, I thought they didn't need the review
step.  Now resent for review.

Samuel

[Qemu-devel] [PATCH] slirp: send icmp6 errors when UDP send failed

2016-03-21 Thread Samuel Thibault

Signed-off-by: Samuel Thibault 
---
 slirp/udp6.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/slirp/udp6.c b/slirp/udp6.c
index 60a91c9..a23026f 100644
--- a/slirp/udp6.c
+++ b/slirp/udp6.c
@@ -113,8 +113,7 @@ void udp6_input(struct mbuf *m)
 m->m_data -= iphlen;
 *ip = save_ip;
 DEBUG_MISC((dfd, "udp tx errno = %d-%s\n", errno, strerror(errno)));
-/* TODO: ICMPv6 error */
-/*icmp_error(m, ICMP_UNREACH,ICMP_UNREACH_NET, 0,strerror(errno));*/
+icmp6_send_error(m, ICMP6_UNREACH, ICMP6_UNREACH_NO_ROUTE);
 goto bad;
 }
 
-- 
2.7.0

[Qemu-devel] [PATCH] slirp: Fix memory leak on small incoming ipv4 packet

2016-03-21 Thread Samuel Thibault

Signed-off-by: Samuel Thibault 
---
 slirp/ip_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/slirp/ip_input.c b/slirp/ip_input.c
index 12f173d..b464f6b 100644
--- a/slirp/ip_input.c
+++ b/slirp/ip_input.c
@@ -85,7 +85,7 @@ ip_input(struct mbuf *m)
DEBUG_ARG("m_len = %d", m->m_len);
 
if (m->m_len < sizeof (struct ip)) {
-   return;
+   goto bad;
}
 
ip = mtod(m, struct ip *);
-- 
2.7.0

Re: [Qemu-devel] [PATCH 0/4] Tweaks around virtio-blk start/stop

2016-03-21 Thread Fam Zheng

On Mon, 03/21 14:02, Cornelia Huck wrote:
> On Mon, 21 Mar 2016 20:45:27 +0800
> Fam Zheng  wrote:
> 
> > On Mon, 03/21 12:15, Cornelia Huck wrote:
> > > On Mon, 21 Mar 2016 18:57:18 +0800
> > > Fam Zheng  wrote:
> > > 
> > > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > > > index 08275a9..47f8043 100644
> > > > --- a/hw/virtio/virtio.c
> > > > +++ b/hw/virtio/virtio.c
> > > > @@ -1098,7 +1098,14 @@ void virtio_queue_notify_vq(VirtQueue *vq)
> > > > 
> > > >  void virtio_queue_notify(VirtIODevice *vdev, int n)
> > > >  {
> > > > -virtio_queue_notify_vq(>vq[n]);
> > > > +VirtQueue *vq = >vq[n];
> > > > +EventNotifier *n;
> > > > +n = virtio_queue_get_host_notifier(vq);
> > > > +if (n) {
> > > 
> > > Isn't that always true, even if the notifier has not been setup?
> > 
> > You are right, this doesn't make a correct fix. But we can still do a quick
> > test with this as the else branch should never be used with ioeventfd=on. 
> > Am I
> > right?
> > 
> > Fam
> 
> Won't we come through here for the very first kick, when we haven't
> registered the ioeventfd with the kernel yet?
> 

The ioeventfd in virtio-ccw is registered in the main loop when
VIRTIO_CONFIG_S_DRIVER_OK is set, so I think the first kick is okay.

Fam

Re: [Qemu-devel] [PATCH v3 1/2] QMP: add query-hotpluggable-cpus

2016-03-21 Thread David Gibson

On Mon, 21 Mar 2016 11:53:23 +0100
Igor Mammedov  wrote:

> On Fri, 18 Mar 2016 16:26:28 -0300
> Eduardo Habkost  wrote:
> 
> > On Tue, Mar 15, 2016 at 02:24:07PM +0100, Igor Mammedov wrote:
> > [...]  
> > > diff --git a/stubs/qmp_query_hotpluggable_cpus.c 
> > > b/stubs/qmp_query_hotpluggable_cpus.c
> > > new file mode 100644
> > > index 000..21a75a3
> > > --- /dev/null
> > > +++ b/stubs/qmp_query_hotpluggable_cpus.c
> > > @@ -0,0 +1,9 @@
> > > +#include "qemu/osdep.h"
> > > +#include "qapi/qmp/qerror.h"
> > > +#include "qmp-commands.h"
> > > +
> > > +HotpluggableCPUList *qmp_query_hotpluggable_cpus(Error **errp)
> > > +{
> > > +error_setg(errp, QERR_FEATURE_DISABLED, "query-hotpluggable-cpus");
> > > +return NULL;
> > > +}
> > 
> > Sorry if this was discussed in previous threads that I haven't
> > read, but: isn't this supposed to be a MachineClass method?  I
> > remember David saying once that we have the habit of assuming
> > that a single QEMU binary can run only one family of machines
> > that are very similar (like x86), but that's not always true.  
> Stub approach works for current qemu with one target per binary
> but it won't for multi-target binary.

This approach won't work even now.  We have draft implementations of
the hook for spapr, but those are absolutely wrong for mac99 or the
many other ppc machine classes.

> I've been trying to not clutter MachineClass with hooks
> that not must have right now but I don't have a strong opinion
> on this so if MachineClass method is preferred way,
> I can rewrite it this patch to use it on respin.
> 
> 


-- 
David Gibson 
Senior Software Engineer, Virtualization, Red Hat


pgpNN2g2X_9TS.pgp
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH v3 03/10] qom: support arbitrary non-scalar properties with -object

2016-03-21 Thread Eric Blake

On 03/10/2016 11:59 AM, Daniel P. Berrange wrote:
> The current -object command line syntax only allows for
> creation of objects with scalar properties, or a list
> with a fixed scalar element type. Objects which have
> properties that are represented as structs in the QAPI
> schema cannot be created using -object.
> 
> This is a design limitation of the way the OptsVisitor
> is written. It simply iterates over the QemuOpts values
> as a flat list. The support for lists is enabled by
> allowing the same key to be repeated in the opts string.
> 
> It is not practical to extend the OptsVisitor to support
> more complex data structures while also maintaining
> the existing list handling behaviour that is relied upon
> by other areas of QEMU.

Zoltán Kővágó tried earlier with his GSoC patches for the audio
subsystem last year, but those got stalled waiting for qapi enhancements
to go in.  But I think your approach is indeed a bit nicer (rather than
making the warty OptsVisitor even wartier, just avoid it).

> 
> Fortunately there is no existing object that implements
> the UserCreatable interface that relies on the list
> handling behaviour, so it is possible to swap out the
> OptsVisitor for a different visitor implementation, so
> -object supports non-scalar properties, thus leaving
> other users of OptsVisitor unaffected.
> 
> The previously added qdict_crumple() method is able to
> take a qdict containing a flat set of properties and
> turn that into a arbitrarily nested set of dicts and
> lists. By combining qemu_opts_to_qdict and qdict_crumple()
> together, we can turn the opt string into a data structure
> that is practically identical to that passed over QMP
> when defining an object. The only difference is that all
> the scalar values are represented as strings, rather than
> strings, ints and bools. This is sufficient to let us
> replace the OptsVisitor with the QMPInputVisitor for
> use with -object.

Indeed, nice replacement.

> 
> Thus -object can now support non-scalar properties,
> for example the QMP object
> 
>   {
> "execute": "object-add",
> "arguments": {
>   "qom-type": "demo",
>   "id": "demo0",
>   "parameters": {
> "foo": [
> { "bar": "one", "wizz": "1" },
> { "bar": "two", "wizz": "2" }
> ]
>   }
> }
>   }
> 
> Would be creatable via the CLI now using
> 
> $QEMU \
>   -object demo,id=demo0,\
>   foo.0.bar=one,foo.0.wizz=1,\
>   foo.1.bar=two,foo.1.wizz=2
> 
> This is also wired up to work for the 'object_add' command
> in the HMP monitor with the same syntax.
> 
>   (hmp) object_add demo,id=demo0,\
>foo.0.bar=one,foo.0.wizz=1,\
>  foo.1.bar=two,foo.1.wizz=2

Maybe mention that the indentation is not actually present in the real
command lines typed.

> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  hmp.c  |  18 +--
>  qom/object_interfaces.c|  20 ++-
>  tests/check-qom-proplist.c | 295 
> -
>  3 files changed, 313 insertions(+), 20 deletions(-)
> 

> @@ -120,6 +120,7 @@ Object *user_creatable_add_type(const char *type, const 
> char *id,
>  obj = object_new(type);
>  if (qdict) {
>  for (e = qdict_first(qdict); e; e = qdict_next(qdict, e)) {
> +
>  object_property_set(obj, v, e->key, _err);
>  if (local_err) {
>  goto out;

Spurious hunk?


-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH v3 02/10] qapi: allow QmpInputVisitor to auto-cast types

2016-03-21 Thread Eric Blake

On 03/10/2016 11:59 AM, Daniel P. Berrange wrote:
> Currently the QmpInputVisitor assumes that all scalar
> values are directly represented as their final types.
> ie it assumes an 'int' is using QInt, and a 'bool' is
> using QBool.
> 
> This extends it so that QString is optionally permitted
> for any of the non-string scalar types. This behaviour
> is turned on by requesting the 'autocast' flag in the
> constructor.
> 
> This makes it possible to use QmpInputVisitor with a
> QDict produced from QemuOpts, where everything is in
> string format.
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  include/qapi/qmp-input-visitor.h |   3 +
>  qapi/qmp-input-visitor.c |  96 +++-
>  tests/test-qmp-input-visitor.c   | 115 
> ++-
>  3 files changed, 196 insertions(+), 18 deletions(-)
> 
> diff --git a/include/qapi/qmp-input-visitor.h 
> b/include/qapi/qmp-input-visitor.h
> index 3ed499c..c25cb7c 100644
> --- a/include/qapi/qmp-input-visitor.h
> +++ b/include/qapi/qmp-input-visitor.h
> @@ -21,6 +21,9 @@ typedef struct QmpInputVisitor QmpInputVisitor;
>  
>  QmpInputVisitor *qmp_input_visitor_new(QObject *obj);
>  QmpInputVisitor *qmp_input_visitor_new_strict(QObject *obj);
> +QmpInputVisitor *qmp_input_visitor_new_full(QObject *obj,
> +bool strict,
> +bool autocast);

We have so few uses of qmp_input_visitor_new* that it might be worth
just having a single prototype, and maybe using an 'int flags' instead
of a string of bool.  But not a show-stopper for this patch (rather, an
idea for a future patch).


> -*obj = qint_get_int(qint);
> +qstr = qobject_to_qstring(qobj);
> +if (qstr && qstr->string && qiv->autocast) {
> +errno = 0;

Dead setting of errno, since...

> +if (qemu_strtoll(qstr->string, NULL, 10, obj) == 0) {

qemu_strtoll() handles it on your behalf, and you aren't using
error_setg_errno().

> @@ -233,30 +245,61 @@ static void qmp_input_type_uint64(Visitor *v, const 
> char *name, uint64_t *obj,
>  {
>  /* FIXME: qobject_to_qint mishandles values over INT64_MAX */
>  QmpInputVisitor *qiv = to_qiv(v);
> -QInt *qint = qobject_to_qint(qmp_input_get_object(qiv, name, true));
> +QObject *qobj = qmp_input_get_object(qiv, name, true);
> +QInt *qint;
> +QString *qstr;
>  
> -if (!qint) {
> -error_setg(errp, QERR_INVALID_PARAMETER_TYPE, name ? name : "null",
> -   "integer");
> +qint = qobject_to_qint(qobj);
> +if (qint) {
> +*obj = qint_get_int(qint);
>  return;
>  }
>  
> -*obj = qint_get_int(qint);
> +qstr = qobject_to_qstring(qobj);
> +if (qstr && qstr->string && qiv->autocast) {
> +errno = 0;
> +if (qemu_strtoull(qstr->string, NULL, 10, obj) == 0) {

And again.

Hmm.  Do we need to worry about partial asymmetry?  That is,
qint_get_int() returns a signed number, but qemu_strtoull() parses
unsigned; if the original conversion from JSON to qint uses a different
parser, then we could have funny results where we get different results
for things like:
 "key1":9223372036854775807, "key2":"9223372036854775807",
even though the same string of digits is being parsed, based on whether
the different parsers handle numbers larger than INT64_MAX differently.

[Ultimately, I'd like QInt to be enhanced to track whether the input was
signed or unsigned, and automatically make the output match the input
when converting back to string; that is, track 65 bits of information
instead of 64; but that's no sooner than 2.7 material]


>  static void qmp_input_type_bool(Visitor *v, const char *name, bool *obj,
>  Error **errp)
>  {
>  QmpInputVisitor *qiv = to_qiv(v);
> -QBool *qbool = qobject_to_qbool(qmp_input_get_object(qiv, name, true));
> +QObject *qobj = qmp_input_get_object(qiv, name, true);
> +QBool *qbool;
> +QString *qstr;
>  
> -if (!qbool) {
> -error_setg(errp, QERR_INVALID_PARAMETER_TYPE, name ? name : "null",
> -   "boolean");
> +qbool = qobject_to_qbool(qobj);
> +if (qbool) {
> +*obj = qbool_get_bool(qbool);
>  return;
>  }
>  
> -*obj = qbool_get_bool(qbool);
> +
> +qstr = qobject_to_qstring(qobj);
> +if (qstr && qstr->string && qiv->autocast) {
> +if (!strcasecmp(qstr->string, "on") ||
> +!strcasecmp(qstr->string, "yes") ||
> +!strcasecmp(qstr->string, "true")) {
> +*obj = true;
> +return;
> +}

Do we also want to allow "0"/"1" for true/false?

Overall, I'm a big fan of this patch.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH v3 01/10] qdict: implement a qdict_crumple method for un-flattening a dict

2016-03-21 Thread Eric Blake

On 03/10/2016 11:59 AM, Daniel P. Berrange wrote:
> The qdict_flatten() method will take a dict whose elements are
> further nested dicts/lists and flatten them by concatenating
> keys.
> 
> The qdict_crumple() method aims to do the reverse, taking a flat
> qdict, and turning it into a set of nested dicts/lists. It will
> apply nesting based on the key name, with a '.' indicating a
> new level in the hierarchy. If the keys in the nested structure
> are all numeric, it will create a list, otherwise it will create
> a dict.
> 

> 
> will get turned into a dict with one element 'foo' whose
> value is a list. The list elements will each in turn be
> dicts.
> 
>  {
>'foo' => [

s/=>/:/

>  { 'bar': 'one', 'wizz': '1' }

s/$/,/

>  { 'bar': 'two', 'wizz': '2' }
>],
>  }
> 

> The intent of this function is that it allows a set of QemuOpts
> to be turned into a nested data structure that mirrors the nested

s/the nested/the nesting/

> used when the same object is defined over QMP.
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  include/qapi/qmp/qdict.h |   1 +
>  qobject/qdict.c  | 267 
> +++
>  tests/check-qdict.c  | 143 +
>  3 files changed, 411 insertions(+)
> 
> +
> +/**
> + * qdict_split_flat_key:
> + *
> + * Given a flattened key such as 'foo.0.bar', split it
> + * into two parts at the first '.' separator. Allows
> + * double dot ('..') to escape the normal separator.
> + *
> + * eg
> + *'foo.0.bar' -> prefix='foo' and suffix='0.bar'
> + *'foo..0.bar' -> prefix='foo.0' and suffix='bar'
> + *
> + * The '..' sequence will be unescaped in the returned
> + * 'prefix' string. The 'suffix' string will be left
> + * in escaped format, so it can be fed back into the
> + * qdict_split_flat_key() key as the input later.
> + */

Might be worth mentioning that prefix and suffix must both be non-NULL,
and that the caller must g_free() the two resulting strings.

> +static void qdict_split_flat_key(const char *key, char **prefix, char 
> **suffix)
> +{
> +const char *separator;
> +size_t i, j;
> +
> +/* Find first '.' separator, but if there is a pair '..'
> + * that acts as an escape, so skip over '..' */
> +separator = NULL;
> +do {
> +if (separator) {
> +separator += 2;
> +} else {
> +separator = key;
> +}
> +separator = strchr(separator, '.');
> +} while (separator && *(separator + 1) == '.');

I'd probably have written separator[1] == '.', but your approach is
synonymous.

> +
> +if (separator) {
> +*prefix = g_strndup(key,
> +separator - key);
> +*suffix = g_strdup(separator + 1);
> +} else {
> +*prefix = g_strdup(key);
> +*suffix = NULL;
> +}
> +
> +/* Unescape the '..' sequence into '.' */
> +for (i = 0, j = 0; (*prefix)[i] != '\0'; i++, j++) {
> +if ((*prefix)[i] == '.' &&
> +(*prefix)[i + 1] == '.') {

Technically, if (*prefix)[i] == '.', we could assert((*prefix)[i + 1] ==
'.'), since the only way to get a '.' in prefix is via escaping.  For
that matter, you could short-circuit (part of) the loop by doing a
strchr for '.' (if not found, the loop is not needed; if found, start
the reduction at that point rather on the bytes leading up to that point).

> +i++;
> +}
> +(*prefix)[j] = (*prefix)[i];
> +}
> +(*prefix)[j] = '\0';
> +}
> +
> +
> +/**
> + * qdict_list_size:
> + * @maybe_List: dict that may be only list elements

s/List/list/

> + *
> + * Determine whether all keys in @maybe_list are
> + * valid list elements. They they are all valid,

s/They they/If they/

> + * then this returns the number of elements. If
> + * they all look like non-numeric keys, then returns
> + * zero. If there is a mix of numeric and non-numeric
> + * keys, then an error is set as it is both a list
> + * and a dict at once.
> + *
> + * Returns: number of list elemets, 0 if a dict, -1 on error

s/elemets/elements/

> + */
> +static ssize_t qdict_list_size(QDict *maybe_list, Error **errp)
> +{
> +const QDictEntry *entry, *next;
> +ssize_t len = 0;
> +ssize_t max = -1;
> +int is_list = -1;
> +int64_t val;
> +
> +entry = qdict_first(maybe_list);
> +while (entry != NULL) {
> +next = qdict_next(maybe_list, entry);
> +
> +if (qemu_strtoll(entry->key, NULL, 10, ) == 0) {
> +if (is_list == -1) {
> +is_list = 1;
> +} else if (!is_list) {
> +error_setg(errp,
> +   "Key '%s' is for a list, but previous key is "
> +   "for a dict", entry->key);

Keys are unsorted, so it's a bit hard to call it "previous key".  Maybe
a better error message would be along the lines of "cannot crumple
dictionary because of a mix of list and non-list keys"?  I dunno...

> +

Re: [Qemu-devel] [PATCH 06/22] hbitmap: load/store

2016-03-21 Thread John Snow



On 03/15/2016 04:04 PM, Vladimir Sementsov-Ogievskiy wrote:
> Add functions for load/store HBitmap to BDS, using clusters table:
> Last level of the bitmap is splitted into chunks of 'cluster_size'
> size. Each cell of the table contains offset in bds, to load/store
> corresponding chunk.
> 
> Also,
> 0 in cell means all-zeroes-chunk (should not be saved)
> 1 in cell means all-ones-chunk (should not be saved)
> hbitmap_prepare_store() fills table with
>   0 for all-zeroes chunks
>   1 for all-ones chunks
>   2 for others
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/dirty-bitmap.c |  23 +
>  include/block/dirty-bitmap.h |  11 +++
>  include/qemu/hbitmap.h   |  12 +++
>  util/hbitmap.c   | 209 
> +++
>  4 files changed, 255 insertions(+)
> 
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index e68c177..816c6ee 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -396,3 +396,26 @@ int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap)
>  {
>  return hbitmap_count(bitmap->bitmap);
>  }
> +
> +int bdrv_dirty_bitmap_load(BdrvDirtyBitmap *bitmap, BlockDriverState *bs,
> +   const uint64_t *table, uint32_t table_size,
> +   uint32_t cluster_size)
> +{
> +return hbitmap_load(bitmap->bitmap, bs, table, table_size, cluster_size);
> +}
> +
> +int bdrv_dirty_bitmap_prepare_store(const BdrvDirtyBitmap *bitmap,
> +uint32_t cluster_size,
> +uint64_t *table,
> +uint32_t *table_size)
> +{
> +return hbitmap_prepare_store(bitmap->bitmap, cluster_size,
> + table, table_size);
> +}
> +
> +int bdrv_dirty_bitmap_store(const BdrvDirtyBitmap *bitmap, BlockDriverState 
> *bs,
> +const uint64_t *table, uint32_t table_size,
> +uint32_t cluster_size)
> +{
> +return hbitmap_store(bitmap->bitmap, bs, table, table_size, 
> cluster_size);
> +}
> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
> index 27515af..20cb540 100644
> --- a/include/block/dirty-bitmap.h
> +++ b/include/block/dirty-bitmap.h
> @@ -43,4 +43,15 @@ void bdrv_set_dirty_iter(struct HBitmapIter *hbi, int64_t 
> offset);
>  int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>  void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
>  
> +int bdrv_dirty_bitmap_load(BdrvDirtyBitmap *bitmap, BlockDriverState *bs,
> +   const uint64_t *table, uint32_t table_size,
> +   uint32_t cluster_size);
> +int bdrv_dirty_bitmap_prepare_store(const BdrvDirtyBitmap *bitmap,
> +uint32_t cluster_size,
> +uint64_t *table,
> +uint32_t *table_size);
> +int bdrv_dirty_bitmap_store(const BdrvDirtyBitmap *bitmap, BlockDriverState 
> *bs,
> +const uint64_t *table, uint32_t table_size,
> +uint32_t cluster_size);
> +
>  #endif
> diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
> index 6d1da4d..d83bb79 100644
> --- a/include/qemu/hbitmap.h
> +++ b/include/qemu/hbitmap.h
> @@ -241,5 +241,17 @@ static inline size_t hbitmap_iter_next_word(HBitmapIter 
> *hbi, unsigned long *p_c
>  return hbi->pos;
>  }
>  
> +typedef struct BlockDriverState BlockDriverState;
> +
> +int hbitmap_load(HBitmap *bitmap, BlockDriverState *bs,
> + const uint64_t *table, uint32_t table_size,
> + uint32_t cluster_size);
> +
> +int hbitmap_prepare_store(const HBitmap *bitmap, uint32_t cluster_size,
> +  uint64_t *table, uint32_t *table_size);
> +
> +int hbitmap_store(HBitmap *bitmap, BlockDriverState *bs,
> +  const uint64_t *table, uint32_t table_size,
> +  uint32_t cluster_size);
>  
>  #endif
> diff --git a/util/hbitmap.c b/util/hbitmap.c
> index 28595fb..1960e4f 100644
> --- a/util/hbitmap.c
> +++ b/util/hbitmap.c
> @@ -15,6 +15,8 @@
>  #include "qemu/host-utils.h"
>  #include "trace.h"
>  
> +#include "block/block.h"
> +

This is a bit of a red flag -- we shouldn't need block layer specifics
in the subcomponent-agnostic HBitmap utility.

Further, by relying on these facilities here in hbitmap.c, "make check"
no longer can compile the relevant hbitmap tests.

Make sure that each intermediate commit here passes these necessary
tests, test-hbitmap in particular for each, and a "make check" overall
at the end of your series.

--js

>  /* HBitmaps provides an array of bits.  The bits are stored as usual in an
>   * array of unsigned longs, but HBitmap is also optimized to provide fast
>   * iteration over set bits; going from one bit to the next is O(logB n)
> @@ -499,3

Re: [Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-21 Thread Alex Williamson

On Mon, 21 Mar 2016 18:00:50 -0400
Bandan Das  wrote:

> vfio_listener_region_add for a iommu mr results in
> an overflow assert since emulated iommu memory region is initialized
> with UINT64_MAX. Add a check just like memory_region_size()
> does.
> 
> Signed-off-by: Bandan Das 
> ---
>  hw/vfio/common.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index fb588d8..269244b 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -349,7 +349,12 @@ static void vfio_listener_region_add(MemoryListener 
> *listener,
>  if (int128_ge(int128_make64(iova), llend)) {
>  return;
>  }
> -end = int128_get64(llend);
> +
> +if (int128_eq(llend, int128_2_64())) {
> +end = UINT64_MAX;
> +} else {
> +end = int128_get64(llend);
> +}
>  
>  if ((iova < container->min_iova) || ((end - 1) > container->max_iova)) {
>  error_report("vfio: IOMMU container %p can't map guest IOVA region"

But now all the calculations where we use end-1 are wrong.  See the
discussion with Pierre Morel in the January qemu-devel archives.
There's a solution in there, but I never saw a follow-up from Pierre
with a revised patch.  Thanks,

Alex

Re: [Qemu-devel] [RFC v1 03/11] tcg: comment on which functions have to be called with tb_lock held

2016-03-21 Thread Paolo Bonzini

On 21/03/2016 22:50, Emilio G. Cota wrote:
> The problem with this approach is that the "point TCG to second buffer"
> is not just a question of pointing code_gen_buffer to a new address;
> we'd have to create a new tcg_ctx struct, since tcg_ctx has quite a few
> elements that are dependent on code_gen_buffer (e.g. s->code_ptr,
> s->code_buf). 

Are these (or other fields similarly dependent on code_gen_buffer) ever
read outside tb_lock?  A quick "git grep -wl" suggests that they are
only used from tcg/, which should only run while tb_lock is held.

If not it would be enough to call tcg_prologue_init from tb_flush.

Paolo

Re: [Qemu-devel] [RFC v1 01/11] tcg: move tb_find_fast outside the tb_lock critical section

2016-03-21 Thread Peter Maydell

On 21 March 2016 at 21:50, Emilio G. Cota  wrote:
> This function, as is, doesn't really just "find"; two concurrent "finders"
> could race here by *writing* to the head of the list at the same time.
>
> The fix is to get rid of this write entirely; moving the just-found TB to
> the head of the list is not really that necessary thanks to the CPU's
> tb_jmp_cache table. This fix would make the function read-only, which
> is what the function's name implies.

It is not _necessary_, but it is a performance optimization to
speed up the "missed in the TLB" case. (A TLB flush will wipe
the tb_jmp_cache table.) From the thread where the move-to-front-of-list
behaviour was added in 2010, benefits cited:

# The exact numbers depend on complexity of guest system.
# - For basic Debian system (no X-server) on versatilepb we observed
# 25% decrease of boot time.
# - For to-be released Samsung LIMO platform on S5PC110 board we
# observed 2x (for older version) and 3x (for newer version)
# decrease of boot time.
# - Small CPU-intensive benchmarks are not affected because they are
# completely handled by 'tb_find_fast'.
#
# We also noticed better response time for heavyweight GUI applications,
# but I do not know how to measure it accurately.
(https://lists.gnu.org/archive/html/qemu-devel/2010-12/msg00380.html)

I think what's happening here is that for guest CPUs where TLB
invalidation happens fairly frequently (notably ARM, because
we don't model ASIDs in the QEMU TLB and thus have to flush
the TLB on any context switch) the case of "we didn't hit in
the TLB but we do have this TB and it was used really recently"
happens often enough to make it worthwhile for the
tb_find_physical() code to keep its hash buckets in LRU order.

Obviously that's all five year old data now, so a pinch of
salt may be indicated, but I'd rather we didn't just remove
the optimisation without some benchmarking to check that it's
not significant. A 2x difference is huge.

thanks
-- PMM

[Qemu-devel] [PATCH] vfio: add check for memory region overflow condition

2016-03-21 Thread Bandan Das


vfio_listener_region_add for a iommu mr results in
an overflow assert since emulated iommu memory region is initialized
with UINT64_MAX. Add a check just like memory_region_size()
does.

Signed-off-by: Bandan Das 
---
 hw/vfio/common.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index fb588d8..269244b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -349,7 +349,12 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 if (int128_ge(int128_make64(iova), llend)) {
 return;
 }
-end = int128_get64(llend);
+
+if (int128_eq(llend, int128_2_64())) {
+end = UINT64_MAX;
+} else {
+end = int128_get64(llend);
+}
 
 if ((iova < container->min_iova) || ((end - 1) > container->max_iova)) {
 error_report("vfio: IOMMU container %p can't map guest IOVA region"
-- 
2.7.0

Re: [Qemu-devel] [PATCH] block: Remove bdrv_make_anon()

2016-03-21 Thread Eric Blake

On 03/18/2016 04:31 AM, Kevin Wolf wrote:
> The call in hmp_drive_del() is dead code because blk_remove_bs() is
> called a few lines above. The only other remaining user is
> bdrv_delete(), which only abuses bdrv_make_anon() to remove it from the
> named nodes list. This path inlines the list entry removal into
> bdrv_delete() and removes bdrv_make_anon().
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block.c   | 15 +++
>  blockdev.c|  3 ---
>  include/block/block.h |  1 -
>  3 files changed, 3 insertions(+), 16 deletions(-)

Nice diffstat, as well.

Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [RFC v1 03/11] tcg: comment on which functions have to be called with tb_lock held

2016-03-21 Thread Emilio G. Cota

On Fri, Mar 18, 2016 at 17:59:46 +0100, Paolo Bonzini wrote:
> On 18/03/2016 17:18, Alex Bennée wrote:
> > +
> > +/* Protected by tb_lock.  */
> 
> Only writes are protected by tb_lock.  Read happen outside the lock.
> 
> Reads are not quite thread safe yet, because of tb_flush.  In order to
> fix that, there's either the async_safe_run() mechanism from Fred or
> preferrably the code generation buffer could be moved under RCU.

A third approach (which I prefer) is to protect tb_jmp_cache with
a seqlock. That way invalidates (via tlb_flush from other CPUs, or
via tb_flush) are picked up if they're racing with concurrent reads.

> Because tb_flush is really rare, my suggestion is simply to allocate two
> code generation buffers and do something like
> 
> static int which_buffer_is_in_use_bit_mask = 1;
> ...
> 
>/* in tb_flush */
>assert (which_buffer_is_in_use_bit_mask != 3);
>if (which_buffer_is_in_use_bit_mask == 1) {
>which_buffer_is_in_use_bit_mask |= 2;
>call_rcu(function doing which_buffer_is_in_use_bit_mask &= ~1);
>point TCG to second buffer
> } else if (which_buffer_is_in_use_bit_mask == 2) {
>which_buffer_is_in_use_bit_mask |= 1;
>call_rcu(function doing which_buffer_is_in_use_bit_mask &= ~2);
>point TCG to first buffer
> }
> 
> Basically, we just assert that call_rcu makes at least one pass between
> two tb_flushes.
> 
> All this is also a prerequisite for patch 1.

The problem with this approach is that the "point TCG to second buffer"
is not just a question of pointing code_gen_buffer to a new address;
we'd have to create a new tcg_ctx struct, since tcg_ctx has quite a few
elements that are dependent on code_gen_buffer (e.g. s->code_ptr,
s->code_buf). And this could end up with readers reading a partially
up-to-date (i.e. corrupt) tcg_ctx.

I know you're not enthusiastic about it, but I think a mechanism to "stop
all CPUs and wait until they have indeed stopped" is in this case justified.

I'm preparing an RFC with these two changes (seqlock and stop all cpus 
mechanism)
on top of these base patches.

Thanks,

Emilio

Re: [Qemu-devel] [RFC v1 01/11] tcg: move tb_find_fast outside the tb_lock critical section

2016-03-21 Thread Emilio G. Cota

On Fri, Mar 18, 2016 at 16:18:42 +, Alex Bennée wrote:
> From: KONRAD Frederic 
> 
> Signed-off-by: KONRAD Frederic 
> Signed-off-by: Paolo Bonzini 
> [AJB: minor checkpatch fixes]
> Signed-off-by: Alex Bennée 
> 
> ---
> v1(ajb)
>   - checkpatch fixes
> ---
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 07545aa..52f25de 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -225,8 +225,9 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
>  phys_page1 = phys_pc & TARGET_PAGE_MASK;
>  h = tb_phys_hash_func(phys_pc);
>  for (ptb1 = _ctx.tb_ctx.tb_phys_hash[h];
> - (tb = *ptb1) != NULL;
> + (tb = atomic_read(ptb1)) != NULL;
>   ptb1 = >phys_hash_next) {
> +smp_read_barrier_depends();
>  if (tb->pc != pc ||
>  tb->page_addr[0] != phys_page1 ||
>  tb->cs_base != cs_base ||
> @@ -254,7 +255,18 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
[ Adding this missing line to the diff for clarity ]
   /* Move the TB to the head of the list */
>  *ptb1 = tb->phys_hash_next;
>  tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
>  tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;

This function, as is, doesn't really just "find"; two concurrent "finders"
could race here by *writing* to the head of the list at the same time.

The fix is to get rid of this write entirely; moving the just-found TB to
the head of the list is not really that necessary thanks to the CPU's
tb_jmp_cache table. This fix would make the function read-only, which
is what the function's name implies.

Further, I'd like to see tb_phys_hash to use the RCU queue primitives; it
makes everything easier to understand (and we avoid sprinkling the code
base with smp_barrier_depends).

I have these two changes queued up as part of my upcoming series, which I'm
basing on your patchset.

Thanks for putting these changes together!

Emilio

Re: [Qemu-devel] [PATCH v4 07/10] vfio: add check aer functionality for hotplug device

2016-03-21 Thread Alex Williamson

On Mon, 21 Mar 2016 18:08:43 +0800
Cao jin  wrote:

> From: Chen Fan 
> 
> because we make the vfio functions are combined
> in the same way as on the host for aer, so we can
> do the aer check when the function 0 was hotplugged.

Suggestion:

  PCI hotplug requires that function 0 is added last to close the
  slot.  Since we require that the VM bus contains the same set of
  devices as the host bus to support AER, we can perform an AER
  validation test whenever a function 0 in the VM is hot-added.

> 
> Signed-off-by: Chen Fan 
> ---
>  hw/vfio/pci.c | 45 +
>  1 file changed, 45 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index dce3b6d..9902c87 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2030,6 +2030,35 @@ out:
>  return;
>  }
>  
> +static void vfio_bus_check_aer_functions(PCIDevice *pdev, Error **errp)
> +{
> +VFIOPCIDevice *vdev;
> +PCIDevice *dev;
> +Error *local_err = NULL;
> +int devfn;
> +
> +for (devfn = 0; devfn < 8; devfn++) {

ARI question again.  Perhaps always use 0-255?

> +dev = pci_find_device(pdev->bus, pci_bus_num(pdev->bus),
> +  PCI_DEVFN(PCI_SLOT(pdev->devfn), devfn));
> +if (!dev) {
> +continue;
> +}
> +if (!object_dynamic_cast(OBJECT(dev), "vfio-pci")) {
> +continue;
> +}
> +vdev = DO_UPCAST(VFIOPCIDevice, pdev, dev);
> +if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
> +vfio_check_hot_bus_reset(vdev, _err);
> +if (local_err) {
> +error_propagate(errp, local_err);
> +return;
> +}
> +}
> +}
> +
> +return;
> +}
> +
>  static void vfio_aer_check_host_bus_reset(Error **errp)
>  {
>  VFIOGroup *group;
> @@ -2982,6 +3011,22 @@ static int vfio_initfn(PCIDevice *pdev)
>  }
>  }
>  
> +/*
> + *  If this function is func 0, indicate the closure of the slot.
> + *  we get the chance to check aer-enabled devices whether support
> + *  hot bus reset.
> + */
> +if (DEVICE(pdev)->hotplugged &&
> +pdev == pci_get_function_0(pdev)) {
> +Error *local_err = NULL;
> +
> +vfio_bus_check_aer_functions(pdev, _err);
> +if (local_err) {
> +error_report_err(local_err);
> +goto out_teardown;
> +}
> +}
> +
>  vfio_register_err_notifier(vdev);
>  vfio_register_req_notifier(vdev);
>  vfio_setup_resetfn_quirk(vdev);

Re: [Qemu-devel] [PATCH v4 08/10] vfio: vote the function 0 to do host bus reset when aer occurred

2016-03-21 Thread Alex Williamson

On Mon, 21 Mar 2016 18:08:44 +0800
Cao jin  wrote:

> From: Chen Fan 
> 
> Due to all devices assigned to VM on the same way as host if enable
> aer, so we can easily do the hot reset by selecting the function #0
> to do the hot reset.
> 
> Signed-off-by: Chen Fan 
> ---
>  hw/vfio/pci.c | 50 ++
>  hw/vfio/pci.h |  2 ++
>  2 files changed, 52 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 9902c87..718cde7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1900,6 +1900,8 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice 
> *vdev, Error **errp)
>  /* List all affected devices by bus reset */
>  devices = >devices[0];
>  
> +vdev->single_depend_dev = (info->count == 1);
> +
>  /* Verify that we have all the groups required */
>  for (i = 0; i < info->count; i++) {
>  PCIHostDeviceAddress host;
> @@ -2608,11 +2610,36 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
>  static void vfio_err_notifier_handler(void *opaque)
>  {
>  VFIOPCIDevice *vdev = opaque;
> +PCIDevice *pdev = >pdev;
>  
>  if (!event_notifier_test_and_clear(>err_notifier)) {
>  return;
>  }
>  
> +if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
> +VFIOPCIDevice *tmp;
> +PCIDevice *dev;
> +int devfn;
> +
> +/*
> + * If one device has aer capability on a bus, when aer occurred,
> + * we should notify all devices on the bus there was an aer arrived,
> + * then we are able to vote the device #0 to do host bus reset.
> + */
> +for (devfn = 0; devfn < 8; devfn++) {

ARI?

> +dev = pci_find_device(pdev->bus, pci_bus_num(pdev->bus),
> +  PCI_DEVFN(PCI_SLOT(pdev->devfn), devfn));
> +if (!dev) {
> +continue;
> +}
> +if (!object_dynamic_cast(OBJECT(dev), "vfio-pci")) {
> +continue;
> +}
> +tmp = DO_UPCAST(VFIOPCIDevice, pdev, dev);
> +tmp->aer_occurred = true;
> +}
> +}
> +
>  /*
>   * TBD. Retrieve the error details and decide what action
>   * needs to be taken. One of the actions could be to pass
> @@ -3075,6 +3102,29 @@ static void vfio_pci_reset(DeviceState *dev)
>  
>  trace_vfio_pci_reset(vdev->vbasedev.name);
>  
> +if (vdev->aer_occurred) {
> +PCIDevice *br = pci_bridge_get_device(pdev->bus);
> +
> +if (br &&
> +(pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
> + PCI_BRIDGE_CTL_BUS_RESET)) {
> +/* simply voting the function 0 to do hot bus reset */
> +if (pci_get_function_0(pdev) == pdev) {
> +if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
> +vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> +} else {
> +/* if this device has not AER capability, code
> + * coming here indicates there is another function
> + * on the bus has AER capability.
> + * */

This shouldn't be possible, right?

> +vfio_pci_hot_reset(vdev, false);
> +}
> +}
> +vdev->aer_occurred = false;
> +return;
> +}
> +}

Why do we care than an AER occurred now?  Can't we simply test:

if (vdev->features & VFIO_FEATURE_ENABLE_AER &&
pci_get_function_0(pdev) == pdev) {
PCIDevice *br = pci_bridge_get_device(pdev->bus);

if (pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
PCI_BRIDGE_CTL_BUS_RESET)) {

vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
return;
}
}

> +
>  vfio_pci_pre_reset(vdev);
>  
>  if (vdev->resetfn && !vdev->resetfn(vdev)) {
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index db7c6d5..17c75b8 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -143,6 +143,8 @@ typedef struct VFIOPCIDevice {
>  bool no_kvm_intx;
>  bool no_kvm_msi;
>  bool no_kvm_msix;
> +bool aer_occurred;
> +bool single_depend_dev;
>  } VFIOPCIDevice;
>  
>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);

Re: [Qemu-devel] [PATCH v4 05/10] vfio: extending function vfio_pci_host_match to support mask func number

2016-03-21 Thread Alex Williamson

On Mon, 21 Mar 2016 18:08:41 +0800
Cao jin  wrote:

> From: Chen Fan 
> 
> Signed-off-by: Chen Fan 
> ---
>  hw/vfio/pci.c | 29 -
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 0516d94..8842b7f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2060,14 +2060,25 @@ static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
>  vfio_intx_enable(vdev);
>  }
>  
> -static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)
> +#define HOST_CMP_FUNC_MASK   (1 << 0)
> +static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name,
> +uint8_t mask)
>  {
> -char tmp[13];
> +PCIHostDeviceAddress tmp;
>  
> -sprintf(tmp, "%04x:%02x:%02x.%1x", addr->domain,
> -addr->bus, addr->slot, addr->function);
> +if (strlen(name) != 12) {
> +return false;
> +}
> +
> +if (sscanf(name, "%04x:%02x:%02x.%1x", ,
> +   , , ) != 4) {
> +return false;
> +}
>  
> -return (strcmp(tmp, name) == 0);
> +return (tmp.domain == addr->domain && tmp.bus == addr->bus &&
> +tmp.slot == addr->slot &&
> +((mask & HOST_CMP_FUNC_MASK) ?
> +1 : (tmp.function == addr->function)));
>  }

I'd probably go for something like:

static int vfio_pci_name_to_addr(const char *name, PCIHostDeviceAddress *addr)
{
if (strlen(name) != 12 ||
sscanf(name, "%04x:%02x:%02x.%1x", >domain,
   >bus, >slot, >function) != 4 ) {
return -EINVAL;
}

return 0;
}

static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)
{
PCIHostDeviceAddress tmp;

if (vfio_pci_name_to_addr(name, )) {
return false;
}

return (tmp.domain == addr->domain && tmp.bus == addr->bus &&
tmp.slot == addr->slot && tmp.function == addr->function);
}

Then a _slot version that avoids skips the function comparison.  The
mask argument doesn't make much sense for such a simple function when
the code duplication is trivial.

Re: [Qemu-devel] [PATCH v4 06/10] vfio: add check host bus reset is support or not

2016-03-21 Thread Alex Williamson

On Mon, 21 Mar 2016 18:08:42 +0800
Cao jin  wrote:

> From: Chen Fan 
> 
> when boot up a VM that assigning vfio devices with aer enabled, we
> must check the vfio device whether support host bus reset. because
> when one error occur. OS driver always recover the device by do a
> bus reset, in order to recover the vfio device, qemu must able to do
> a host bus reset to recover the device to default status. and for all
> affected devices by the bus reset. we must check them whether all
> are assigned to the VM and on the same virtual bus. meanwhile, for
> simply done, the devices which don't affected by the host bus reset
> are not allowed to assign to the same virtual bus.

Rewording/expansion suggestion:

  When assigning a vfio device with AER enabled, we must check whether
  the device supports a host bus reset (ie. hot reset) as this may be
  used by the guest OS in order to recover the device from an AER
  error.  QEMU must therefore have the ability to perform a physical
  host bus reset using the existing vfio APIs in response to a virtual
  bus reset in the VM.  A physical bus reset affects all of the devices
  on the host bus, therefore we place a few simplifying configuration
  restriction on the VM:

   - All physical devices affected by a bus reset must be assigned to
 the VM with AER enabled on each and be configured on the same
 virtual bus in the VM.

   - No devices unaffected by the bus reset, be they physical, emulated,
 or paravirtual may be configured on the same virtual bus as a
 device supporting AER signaling through vfio.

  In other words users wishing to enable AER on a multifunction device
  need to assign all functions of the device to the same virtual bus
  and enable AER support for each device.  The easiest way to
  accomplish this is to identity map the physical functions to virtual
  functions with multifunction enabled on the virtual device.

> 
> Signed-off-by: Chen Fan 
> ---
>  hw/vfio/pci.c | 205 
> +-
>  hw/vfio/pci.h |   1 +
>  2 files changed, 205 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 8842b7f..dce3b6d 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -38,6 +38,10 @@
>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>  
> +#define HOST_CMP_FUNC_MASK   (1 << 0)
> +static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name,
> +uint8_t mask);
> +

This would of course go back to a _slot version of the function as
you've had in previous versions.

>  /*
>   * Disabling BAR mmaping can be slow, but toggling it around INTx can
>   * also be a huge overhead.  We try to get the best of both worlds by
> @@ -1877,6 +1881,185 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, 
> uint8_t pos)
>  return 0;
>  }
>  
> +static void vfio_check_hot_bus_reset(VFIOPCIDevice *vdev, Error **errp)
> +{
> +PCIBus *bus = vdev->pdev.bus;
> +struct vfio_pci_hot_reset_info *info = NULL;
> +struct vfio_pci_dependent_device *devices;
> +VFIOGroup *group;
> +int ret, i, devfn;
> +
> +ret = vfio_get_hot_reset_info(vdev, );
> +if (ret) {
> +error_setg(errp, "vfio: Cannot enable AER for device %s,"
> +   " device does not support hot reset.",
> +   vdev->vbasedev.name);
> +return;
> +}
> +
> +/* List all affected devices by bus reset */
> +devices = >devices[0];
> +
> +/* Verify that we have all the groups required */
> +for (i = 0; i < info->count; i++) {
> +PCIHostDeviceAddress host;
> +VFIOPCIDevice *tmp;
> +VFIODevice *vbasedev_iter;
> +bool found = false;
> +
> +host.domain = devices[i].segment;
> +host.bus = devices[i].bus;
> +host.slot = PCI_SLOT(devices[i].devfn);
> +host.function = PCI_FUNC(devices[i].devfn);
> +
> +/* Skip the current device */
> +if (vfio_pci_host_match(, vdev->vbasedev.name, 0)) {
> +continue;
> +}
> +
> +/* Ensure we own the group of the affected device */
> +QLIST_FOREACH(group, _group_list, next) {
> +if (group->groupid == devices[i].group_id) {
> +break;
> +}
> +}
> +
> +if (!group) {
> +error_setg(errp, "vfio: Cannot enable AER for device %s, "
> +   "depends on group %d which is not owned.",
> +   vdev->vbasedev.name, devices[i].group_id);
> +goto out;
> +}
> +
> +/* Ensure affected devices for reset on the same bus */
> +QLIST_FOREACH(vbasedev_iter, >device_list, next) {
> +if (vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> +continue;
>

Re: [Qemu-devel] [PATCH 1/2] block/qapi: make two printf() formats literal

2016-03-21 Thread Eric Blake

On 03/09/2016 06:46 PM, Peter Xu wrote:
> On Wed, Mar 09, 2016 at 03:14:03PM -0700, Eric Blake wrote:
>>> +func_fprintf(f, "%*s[%i]:%c", indentation * 4, "", i,
>>> + composite ? '\n' : ' ');
>>
>> [The nerd in me wants to point out that you could avoid the ternary by
>> writing '"\n "[composite]', but that's too ugly to use outside of IOCCC
>> submissions, and I wouldn't be surprised if it (rightfully) triggers
>> clang warnings]
> 
> Do you mean something like:
> 
> int i = 0;
> printf("%c", '"\n "[i]');

You mean:

printf("%c", "\n "[i]);

(no '').  But with your declaration of 'i' as int, it is only defined if
i <= 2 (whereas in my example above, "\n "[composite] is always defined
because composite is bool rather than int).

> 
> Is this a grammar btw?

Yes, C has an ugly grammar, because [] is just syntactic sugar for
deferencing pointer addition with nicer operator precedence.  Quoting
C99 6.5.2.1:

"The definition of the subscript operator [] is that E1[E2] is identical
to (*((E1)+(E2))).  Because of the conversion rules that apply to the
binary + operator, if E1 is an array object (equivalently, a pointer to
the initial element of an array object) and E2 is an integer, E1[E2]
designates the E2-th element of E1 (counting from zero)."

And a string literal is just a fancy way of writing the address of an
array of characters (where the address is chosen by the compiler).

Thus, it IS valid to dereference the addition of an integer offset with
the address implied by a string literal in order to obtain a character
within the string.  And since the [] operator is commutative (even
though no one in their right mind commutes the operands), you can also
write the even-uglier:

composite["\n "]

But now we've gone far astray from the original patch review :)

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] (no subject)

2016-03-21 Thread Peter Maydell

On 21 March 2016 at 18:00, John Snow  wrote:
> Looks like one of your libraries is outdated, for me
> 'IBV_LINK_LAYER_INFINIBAND' is defined in
> /usr/include/infiniband/verbs.h; provided by
> libibverbs-devel-1.1.8-3.fc22.x86_64.
>
> Maybe your libibverbs is too old.

We should probably add a suitable configure test.

thanks
-- PMM

[Qemu-devel] [PULL v2 26/40] ivshmem: Propagate errors through ivshmem_recv_setup()

2016-03-21 Thread Markus Armbruster

This kills off the funny state described in the previous commit.

Simplify ivshmem_io_read() accordingly, and update documentation.

Signed-off-by: Markus Armbruster 
Message-Id: <1458066895-20632-27-git-send-email-arm...@redhat.com>
Reviewed-by: Marc-André Lureau 
---
 docs/specs/ivshmem-spec.txt |  20 +++
 hw/misc/ivshmem.c   | 129 
 qemu-doc.texi   |   9 +---
 3 files changed, 95 insertions(+), 63 deletions(-)

diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
index 0cd63ad..4c33973 100644
--- a/docs/specs/ivshmem-spec.txt
+++ b/docs/specs/ivshmem-spec.txt
@@ -62,11 +62,11 @@ There are two ways to use this device:
   likely want to write a kernel driver to handle interrupts.  Requires
   the device to be configured for interrupts, obviously.
 
-If the device is configured for interrupts, BAR2 is initially invalid.
-It becomes safely accessible only after the ivshmem server provided
-the shared memory.  Guest software should wait for the IVPosition
-register (described below) to become non-negative before accessing
-BAR2.
+Before QEMU 2.6.0, BAR2 can initially be invalid if the device is
+configured for interrupts.  It becomes safely accessible only after
+the ivshmem server provided the shared memory.  Guest software should
+wait for the IVPosition register (described below) to become
+non-negative before accessing BAR2.
 
 The device is not capable to tell guest software whether it is
 configured for interrupts.
@@ -82,7 +82,7 @@ BAR 0 contains the following registers:
 4 4   read/write0   Interrupt Status
 bit 0: peer interrupt
 bit 1..31: reserved
-8 4   read-only   0 or -1   IVPosition
+8 4   read-only   0 or ID   IVPosition
12 4   write-only  N/A   Doorbell
 bit 0..15: vector
 bit 16..31: peer ID
@@ -100,12 +100,14 @@ when an interrupt request from a peer is received.  
Reading the
 register clears it.
 
 IVPosition Register: if the device is not configured for interrupts,
-this is zero.  Else, it's -1 for a short while after reset, then
-changes to the device's ID (between 0 and 65535).
+this is zero.  Else, it is the device's ID (between 0 and 65535).
+
+Before QEMU 2.6.0, the register may read -1 for a short while after
+reset.
 
 There is no good way for software to find out whether the device is
 configured for interrupts.  A positive IVPosition means interrupts,
-but zero could be either.  The initial -1 cannot be reliably observed.
+but zero could be either.
 
 Doorbell Register: writing this register requests to interrupt a peer.
 The written value's high 16 bits are the ID of the peer to interrupt,
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index ad16828..7f439c3 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -234,12 +234,7 @@ static uint64_t ivshmem_io_read(void *opaque, hwaddr addr,
 break;
 
 case IVPOSITION:
-/* return my VM ID if the memory is mapped */
-if (memory_region_is_mapped(>ivshmem)) {
-ret = s->vm_id;
-} else {
-ret = -1;
-}
+ret = s->vm_id;
 break;
 
 default:
@@ -511,7 +506,8 @@ static bool fifo_update_and_get_i64(IVShmemState *s,
 return false;
 }
 
-static int ivshmem_add_kvm_msi_virq(IVShmemState *s, int vector)
+static void ivshmem_add_kvm_msi_virq(IVShmemState *s, int vector,
+ Error **errp)
 {
 PCIDevice *pdev = PCI_DEVICE(s);
 MSIMessage msg = msix_get_message(pdev, vector);
@@ -522,22 +518,21 @@ static int ivshmem_add_kvm_msi_virq(IVShmemState *s, int 
vector)
 
 ret = kvm_irqchip_add_msi_route(kvm_state, msg, pdev);
 if (ret < 0) {
-error_report("ivshmem: kvm_irqchip_add_msi_route failed");
-return -1;
+error_setg(errp, "kvm_irqchip_add_msi_route failed");
+return;
 }
 
 s->msi_vectors[vector].virq = ret;
 s->msi_vectors[vector].pdev = pdev;
-
-return 0;
 }
 
-static void setup_interrupt(IVShmemState *s, int vector)
+static void setup_interrupt(IVShmemState *s, int vector, Error **errp)
 {
 EventNotifier *n = >peers[s->vm_id].eventfds[vector];
 bool with_irqfd = kvm_msi_via_irqfd_enabled() &&
 ivshmem_has_feature(s, IVSHMEM_MSI);
 PCIDevice *pdev = PCI_DEVICE(s);
+Error *err = NULL;
 
 IVSHMEM_DPRINTF("setting up interrupt for vector: %d\n", vector);
 
@@ -546,13 +541,16 @@ static void setup_interrupt(IVShmemState *s, int vector)
 watch_vector_notifier(s, n, vector);
 } else if (msix_enabled(pdev)) {
 IVSHMEM_DPRINTF("with irqfd\n");
-if (ivshmem_add_kvm_msi_virq(s, vector) < 0) {
+

[Qemu-devel] [PULL v2 10/40] ivshmem: Rewrite specification document

2016-03-21 Thread Markus Armbruster

This started as an attempt to update ivshmem_device_spec.txt for
clarity, accuracy and completeness while working on its code, and
quickly became a full rewrite.  Since the diff would be useless
anyway, I'm using the opportunity to rename the file to
ivshmem-spec.txt.

I tried hard to ensure the new text contradicts neither the old text
nor the code.  If the new text contradicts the old text but not the
code, it's probably a bug in the old text.  If the new text
contradicts both, its probably a bug in the new text.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-11-git-send-email-arm...@redhat.com>
---
 docs/specs/ivshmem-spec.txt| 243 +
 docs/specs/ivshmem_device_spec.txt | 161 
 2 files changed, 243 insertions(+), 161 deletions(-)
 create mode 100644 docs/specs/ivshmem-spec.txt
 delete mode 100644 docs/specs/ivshmem_device_spec.txt

diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
new file mode 100644
index 000..0e9185a
--- /dev/null
+++ b/docs/specs/ivshmem-spec.txt
@@ -0,0 +1,243 @@
+= Device Specification for Inter-VM shared memory device =
+
+The Inter-VM shared memory device (ivshmem) is designed to share a
+memory region between multiple QEMU processes running different guests
+and the host.  In order for all guests to be able to pick up the
+shared memory area, it is modeled by QEMU as a PCI device exposing
+said memory to the guest as a PCI BAR.
+
+The device can use a shared memory object on the host directly, or it
+can obtain one from an ivshmem server.
+
+In the latter case, the device can additionally interrupt its peers, and
+get interrupted by its peers.
+
+
+== Configuring the ivshmem PCI device ==
+
+There are two basic configurations:
+
+- Just shared memory: -device ivshmem,shm=NAME,...
+
+  This uses shared memory object NAME.
+
+- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
+
+  An ivshmem server must already be running on the host.  The device
+  connects to the server's UNIX domain socket via character device
+  CHR.
+
+  Each peer gets assigned a unique ID by the server.  IDs must be
+  between 0 and 65535.
+
+  Interrupts are message-signaled by default (MSI-X).  With msi=off
+  the device has no MSI-X capability, and uses legacy INTx instead.
+  vectors=N configures the number of vectors to use.
+
+For more details on ivshmem device properties, see The QEMU Emulator
+User Documentation (qemu-doc.*).
+
+
+== The ivshmem PCI device's guest interface ==
+
+The device has vendor ID 1af4, device ID 1110, revision 0.
+
+=== PCI BARs ===
+
+The ivshmem PCI device has two or three BARs:
+
+- BAR0 holds device registers (256 Byte MMIO)
+- BAR1 holds MSI-X table and PBA (only when using MSI-X)
+- BAR2 maps the shared memory object
+
+There are two ways to use this device:
+
+- If you only need the shared memory part, BAR2 suffices.  This way,
+  you have access to the shared memory in the guest and can use it as
+  you see fit.  Memnic, for example, uses ivshmem this way from guest
+  user space (see http://dpdk.org/browse/memnic).
+
+- If you additionally need the capability for peers to interrupt each
+  other, you need BAR0 and, if using MSI-X, BAR1.  You will most
+  likely want to write a kernel driver to handle interrupts.  Requires
+  the device to be configured for interrupts, obviously.
+
+If the device is configured for interrupts, BAR2 is initially invalid.
+It becomes safely accessible only after the ivshmem server provided
+the shared memory.  Guest software should wait for the IVPosition
+register (described below) to become non-negative before accessing
+BAR2.
+
+The device is not capable to tell guest software whether it is
+configured for interrupts.
+
+=== PCI device registers ===
+
+BAR 0 contains the following registers:
+
+Offset  Size  Access  On reset  Function
+0 4   read/write0   Interrupt Mask
+bit 0: peer interrupt
+bit 1..31: reserved
+4 4   read/write0   Interrupt Status
+bit 0: peer interrupt
+bit 1..31: reserved
+8 4   read-only   0 or -1   IVPosition
+   12 4   write-only  N/A   Doorbell
+bit 0..15: vector
+bit 16..31: peer ID
+   16   240   noneN/A   reserved
+
+Software should only access the registers as specified in column
+"Access".  Reserved bits should be ignored on read, and preserved on
+write.
+
+Interrupt Status and Mask Register together control the legacy INTx
+interrupt when the device has no MSI-X capability: INTx is asserted
+when the bit-wise AND of Status and Mask is non-zero and the device
+has no

[Qemu-devel] [PULL v2 33/40] ivshmem: Inline check_shm_size() into its only caller

2016-03-21 Thread Markus Armbruster

Improve the error messages while there.

Signed-off-by: Markus Armbruster 
Message-Id: <1458066895-20632-34-git-send-email-arm...@redhat.com>
Reviewed-by: Marc-André Lureau 
---
 hw/misc/ivshmem.c | 37 +++--
 1 file changed, 11 insertions(+), 26 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 1b1de65..e6282ab 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -343,29 +343,6 @@ static void watch_vector_notifier(IVShmemState *s, 
EventNotifier *n,
 NULL, >msi_vectors[vector]);
 }
 
-static int check_shm_size(IVShmemState *s, int fd, Error **errp)
-{
-/* check that the guest isn't going to try and map more memory than the
- * the object has allocated return -1 to indicate error */
-
-struct stat buf;
-
-if (fstat(fd, ) < 0) {
-error_setg(errp, "exiting: fstat on fd %d failed: %s",
-   fd, strerror(errno));
-return -1;
-}
-
-if (s->ivshmem_size > buf.st_size) {
-error_setg(errp, "Requested memory size greater"
-   " than shared object size (%zu > %" PRIu64")",
-   s->ivshmem_size, (uint64_t)buf.st_size);
-return -1;
-} else {
-return 0;
-}
-}
-
 static void ivshmem_add_eventfd(IVShmemState *s, int posn, int i)
 {
 memory_region_add_eventfd(>ivshmem_mmio,
@@ -480,7 +457,7 @@ static void setup_interrupt(IVShmemState *s, int vector, 
Error **errp)
 
 static void process_msg_shmem(IVShmemState *s, int fd, Error **errp)
 {
-Error *err = NULL;
+struct stat buf;
 void *ptr;
 
 if (s->ivshmem_bar2) {
@@ -489,8 +466,16 @@ static void process_msg_shmem(IVShmemState *s, int fd, 
Error **errp)
 return;
 }
 
-if (check_shm_size(s, fd, ) == -1) {
-error_propagate(errp, err);
+if (fstat(fd, ) < 0) {
+error_setg_errno(errp, errno,
+"can't determine size of shared memory sent by server");
+close(fd);
+return;
+}
+
+if (s->ivshmem_size > buf.st_size) {
+error_setg(errp, "server sent only %zd bytes of shared memory",
+   (size_t)buf.st_size);
 close(fd);
 return;
 }
-- 
2.4.3

[Qemu-devel] [PULL v2 31/40] ivshmem: Implement shm=... with a memory backend

2016-03-21 Thread Markus Armbruster

ivshmem has its very own code to create and map shared memory.
Replace that with an implicitly created memory backend.  Reduces the
number of ways we create BAR 2 from three to two.

The memory-backend-file is currently available only with CONFIG_LINUX,
so this adds a second Linuxism to ivshmem (the other one is eventfd).
Should we ever need to make it portable to systems where
memory-backend-file can't be made to serve, we could create a
memory-backend-shmem that allocates memory with shm_open().

Bonus fix: shared memory files are now created with permissions 0655
instead of 0777.

Signed-off-by: Markus Armbruster 
Reviewed-by: Paolo Bonzini 
Message-Id: <1458066895-20632-32-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 79 ---
 1 file changed, 23 insertions(+), 56 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 66c713e..138ae9d 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -26,6 +26,7 @@
 #include "migration/migration.h"
 #include "qemu/error-report.h"
 #include "qemu/event_notifier.h"
+#include "qom/object_interfaces.h"
 #include "sysemu/char.h"
 #include "sysemu/hostmem.h"
 #include "qapi/visitor.h"
@@ -369,31 +370,6 @@ static int check_shm_size(IVShmemState *s, int fd, Error 
**errp)
 }
 }
 
-/* create the shared memory BAR when we are not using the server, so we can
- * create the BAR and map the memory immediately */
-static int create_shared_memory_BAR(IVShmemState *s, int fd, uint8_t attr,
-Error **errp)
-{
-void * ptr;
-
-ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
-if (ptr == MAP_FAILED) {
-error_setg_errno(errp, errno, "Failed to mmap shared memory");
-return -1;
-}
-
-memory_region_init_ram_ptr(>ivshmem, OBJECT(s), "ivshmem.bar2",
-   s->ivshmem_size, ptr);
-qemu_set_ram_fd(memory_region_get_ram_addr(>ivshmem), fd);
-vmstate_register_ram(>ivshmem, DEVICE(s));
-memory_region_add_subregion(>bar, 0, >ivshmem);
-
-/* region for shared memory */
-pci_register_bar(PCI_DEVICE(s), 2, attr, >bar);
-
-return 0;
-}
-
 static void ivshmem_add_eventfd(IVShmemState *s, int posn, int i)
 {
 memory_region_add_eventfd(>ivshmem_mmio,
@@ -837,6 +813,23 @@ static void ivshmem_write_config(PCIDevice *pdev, uint32_t 
address,
 }
 }
 
+static void desugar_shm(IVShmemState *s)
+{
+Object *obj;
+char *path;
+
+obj = object_new("memory-backend-file");
+path = g_strdup_printf("/dev/shm/%s", s->shmobj);
+object_property_set_str(obj, path, "mem-path", _abort);
+g_free(path);
+object_property_set_int(obj, s->ivshmem_size, "size", _abort);
+object_property_set_bool(obj, true, "share", _abort);
+object_property_add_child(OBJECT(s), "internal-shm-backend", obj,
+  _abort);
+user_creatable_complete(obj, _abort);
+s->hostmem = MEMORY_BACKEND(obj);
+}
+
 static void pci_ivshmem_realize(PCIDevice *dev, Error **errp)
 {
 IVShmemState *s = IVSHMEM(dev);
@@ -915,6 +908,10 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 attr |= PCI_BASE_ADDRESS_MEM_TYPE_64;
 }
 
+if (s->shmobj) {
+desugar_shm(s);
+}
+
 if (s->hostmem != NULL) {
 MemoryRegion *mr;
 
@@ -925,7 +922,7 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 vmstate_register_ram(mr, DEVICE(s));
 memory_region_add_subregion(>bar, 0, mr);
 pci_register_bar(PCI_DEVICE(s), 2, attr, >bar);
-} else if (s->server_chr != NULL) {
+} else {
 IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
 s->server_chr->filename);
 
@@ -952,36 +949,6 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 error_setg(errp, "failed to initialize interrupts");
 return;
 }
-} else {
-/* just map the file immediately, we're not using a server */
-int fd;
-
-IVSHMEM_DPRINTF("using shm_open (shm object = %s)\n", s->shmobj);
-
-/* try opening with O_EXCL and if it succeeds zero the memory
- * by truncating to 0 */
-if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR|O_EXCL,
-S_IRWXU|S_IRWXG|S_IRWXO)) > 0) {
-   /* truncate file to length PCI device's memory */
-if (ftruncate(fd, s->ivshmem_size) != 0) {
-error_report("could not truncate shared file");
-}
-
-} else if ((fd = shm_open(s->shmobj, O_CREAT|O_RDWR,
-S_IRWXU|S_IRWXG|S_IRWXO)) < 0) {
-error_setg(errp, "could not open shared file");
-return;
-}
-
-if (check_shm_size(s, fd, errp) == -1) {
-return;
-}
-
-create_shared_memory_BAR(s, fd, attr, );
-if (err) {
-

Re: [Qemu-devel] [PATCH] qdict: fix unbounded stack for qdict_array_entries

2016-03-21 Thread Eric Blake

On 03/09/2016 06:36 PM, Peter Xu wrote:
> Sorry to forgot CCing Eric/Markus/Kevin.
> 
> This patch title is not correct, which should be:
> 
> "Fix unbounded stack warning for qdict_array_entries"

Keep the 'qdict:' prefix, but yes, adding "warning" helps the commit
message.

> 
> Do I need to re-send with the same content?

For just the title adjustment, it's up to the maintainer.  Often, a
maintainer will make small changes like that before sending a pull request.

> 
> I'm using g_strdup_printf() here, considering it's most convenient,
> safe, and as long as it's called rarely only when quorum device
> opens.

On the other hand, this information might have been useful...

> 
> Thanks.
> Peter
> 
> On Wed, Mar 09, 2016 at 02:03:38PM +0800, Peter Xu wrote:
>> Signed-off-by: Peter Xu 

...in the commit body proper (explaining why you are always allocating,
because it is not a hot path).  So a v2 might indeed be easier.

>> +++ b/qobject/qdict.c
>> @@ -704,19 +704,16 @@ int qdict_array_entries(QDict *src, const char 
>> *subqdict)
>>  for (i = 0; i < INT_MAX; i++) {
>>  QObject *subqobj;
>>  int subqdict_entries;
>> -size_t slen = 32 + subqdict_len;
>> -char indexstr[slen], prefix[slen];
>> -size_t snprintf_ret;
>> +char *prefix = g_strdup_printf("%s%u.", subqdict, i);

If we were worried that this could be a hot path, you could add a %n and
 here...

>>  
>> -snprintf_ret = snprintf(indexstr, slen, "%s%u", subqdict, i);
>> -assert(snprintf_ret < slen);
>> +subqdict_entries = qdict_count_prefixed_entries(src, prefix);
>>  
>> -subqobj = qdict_get(src, indexstr);
>> +/* Remove ending "." */
>> +prefix[strlen(prefix) - 1] = 0x00;

...to avoid the strlen() call here.  But this is not a hot path, and %n
always makes me worry about security, so I'm fine with your approach.

However, 0x00 is a rather verbose way of writing 0 (and even if you want
verbosity, '\0' is more idiomatic 0x00).

At this point, if you send a v2 with s/0x00/0/ and the improved commit
message, you can also include:
Reviewed-by: Eric Blake 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

[Qemu-devel] [PULL v2 34/40] qdev: New DEFINE_PROP_ON_OFF_AUTO

2016-03-21 Thread Markus Armbruster

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-35-git-send-email-arm...@redhat.com>
---
 hw/core/qdev-properties.c| 10 ++
 include/hw/qdev-properties.h |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index bc89800..d2f5a08 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -516,6 +516,16 @@ PropertyInfo qdev_prop_macaddr = {
 .set   = set_mac,
 };
 
+/* --- on/off/auto --- */
+
+PropertyInfo qdev_prop_on_off_auto = {
+.name = "OnOffAuto",
+.description = "on/off/auto",
+.enum_table = OnOffAuto_lookup,
+.get = get_enum,
+.set = set_enum,
+};
+
 /* --- lost tick policy --- */
 
 QEMU_BUILD_BUG_ON(sizeof(LostTickPolicy) != sizeof(int));
diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index 03a1b91..0586cac 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -18,6 +18,7 @@ extern PropertyInfo qdev_prop_string;
 extern PropertyInfo qdev_prop_chr;
 extern PropertyInfo qdev_prop_ptr;
 extern PropertyInfo qdev_prop_macaddr;
+extern PropertyInfo qdev_prop_on_off_auto;
 extern PropertyInfo qdev_prop_losttickpolicy;
 extern PropertyInfo qdev_prop_bios_chs_trans;
 extern PropertyInfo qdev_prop_fdc_drive_type;
@@ -155,6 +156,8 @@ extern PropertyInfo qdev_prop_arraylen;
 DEFINE_PROP(_n, _s, _f, qdev_prop_drive, BlockBackend *)
 #define DEFINE_PROP_MACADDR(_n, _s, _f) \
 DEFINE_PROP(_n, _s, _f, qdev_prop_macaddr, MACAddr)
+#define DEFINE_PROP_ON_OFF_AUTO(_n, _s, _f, _d) \
+DEFINE_PROP_DEFAULT(_n, _s, _f, _d, qdev_prop_on_off_auto, OnOffAuto)
 #define DEFINE_PROP_LOSTTICKPOLICY(_n, _s, _f, _d) \
 DEFINE_PROP_DEFAULT(_n, _s, _f, _d, qdev_prop_losttickpolicy, \
 LostTickPolicy)
-- 
2.4.3

[Qemu-devel] [PULL v2 36/40] ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem

2016-03-21 Thread Markus Armbruster

ivshmem can be configured with and without interrupt capability
(a.k.a. "doorbell").  The two configurations have largely disjoint
options, which makes for a confusing (and badly checked) user
interface.  Moreover, the device can't tell the guest whether its
doorbell is enabled.

Create two new device models ivshmem-plain and ivshmem-doorbell, and
deprecate the old one.

Changes from ivshmem:

* PCI revision is 1 instead of 0.  The new revision is fully backwards
  compatible for guests.  Guests may elect to require at least
  revision 1 to make sure they're not exposed to the funny "no shared
  memory, yet" state.

* Property "role" replaced by "master".  role=master becomes
  master=on, role=peer becomes master=off.  Default is off instead of
  auto.

* Property "use64" is gone.  The new devices always have 64 bit BARs.

Changes from ivshmem to ivshmem-plain:

* The Interrupt Pin register in PCI config space is zero (does not use
  an interrupt pin) instead of one (uses INTA).

* Property "x-memdev" is renamed to "memdev".

* Properties "shm" and "size" are gone.  Use property "memdev"
  instead.

* Property "msi" is gone.  The new device can't have MSI-X capability.
  It can't interrupt anyway.

* Properties "ioeventfd" and "vectors" are gone.  They're meaningless
  without interrupts anyway.

Changes from ivshmem to ivshmem-doorbell:

* Property "msi" is gone.  The new device always has MSI-X capability.

* Property "ioeventfd" defaults to on instead of off.

* Property "size" is gone.  The new device can only map all the shared
  memory received from the server.

Guests can easily find out whether the device is configured for
interrupts by checking for MSI-X capability.

Note: some code added in sub-optimal places to make the diff easier to
review.  The next commit will move it to more sensible places.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-37-git-send-email-arm...@redhat.com>
---
 docs/specs/ivshmem-spec.txt |  66 -
 hw/misc/ivshmem.c   | 329 
 qemu-doc.texi   |  33 +++--
 tests/ivshmem-test.c|  12 +-
 4 files changed, 304 insertions(+), 136 deletions(-)

diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
index 4c33973..f3912c0 100644
--- a/docs/specs/ivshmem-spec.txt
+++ b/docs/specs/ivshmem-spec.txt
@@ -17,9 +17,10 @@ get interrupted by its peers.
 
 There are two basic configurations:
 
-- Just shared memory: -device ivshmem,shm=NAME,...
+- Just shared memory: -device ivshmem-plain,memdev=HMB,...
 
-  This uses shared memory object NAME.
+  This uses host memory backend HMB.  It should have option "share"
+  set.
 
 - Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
 
@@ -30,9 +31,8 @@ There are two basic configurations:
   Each peer gets assigned a unique ID by the server.  IDs must be
   between 0 and 65535.
 
-  Interrupts are message-signaled by default (MSI-X).  With msi=off
-  the device has no MSI-X capability, and uses legacy INTx instead.
-  vectors=N configures the number of vectors to use.
+  Interrupts are message-signaled (MSI-X).  vectors=N configures the
+  number of vectors to use.
 
 For more details on ivshmem device properties, see The QEMU Emulator
 User Documentation (qemu-doc.*).
@@ -40,14 +40,15 @@ User Documentation (qemu-doc.*).
 
 == The ivshmem PCI device's guest interface ==
 
-The device has vendor ID 1af4, device ID 1110, revision 0.
+The device has vendor ID 1af4, device ID 1110, revision 1.  Before
+QEMU 2.6.0, it had revision 0.
 
 === PCI BARs ===
 
 The ivshmem PCI device has two or three BARs:
 
 - BAR0 holds device registers (256 Byte MMIO)
-- BAR1 holds MSI-X table and PBA (only when using MSI-X)
+- BAR1 holds MSI-X table and PBA (only ivshmem-doorbell)
 - BAR2 maps the shared memory object
 
 There are two ways to use this device:
@@ -58,18 +59,19 @@ There are two ways to use this device:
   user space (see http://dpdk.org/browse/memnic).
 
 - If you additionally need the capability for peers to interrupt each
-  other, you need BAR0 and, if using MSI-X, BAR1.  You will most
-  likely want to write a kernel driver to handle interrupts.  Requires
-  the device to be configured for interrupts, obviously.
+  other, you need BAR0 and BAR1.  You will most likely want to write a
+  kernel driver to handle interrupts.  Requires the device to be
+  configured for interrupts, obviously.
 
 Before QEMU 2.6.0, BAR2 can initially be invalid if the device is
 configured for interrupts.  It becomes safely accessible only after
-the ivshmem server provided the shared memory.  Guest software should
-wait for the IVPosition register (described below) to become
-non-negative before accessing BAR2.
+the ivshmem server provided the shared memory.  These devices have PCI
+revision 0 rather than 1.  Guest software should wait for the
+IVPosition register

[Qemu-devel] [PULL v2 16/40] ivshmem: Fix harmless misuse of Error

2016-03-21 Thread Markus Armbruster

We reuse errp after passing it host_memory_backend_get_memory().  If
both host_memory_backend_get_memory() and the reuse set an error, the
reuse will fail the assertion in error_setv().  Fortunately,
host_memory_backend_get_memory() can't fail.

Pass it _abort to make our assumption explicit, and to get the
assertion failure in the right place should it become invalid.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-17-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 0ac0238..299cf5b 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -842,7 +842,7 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 g_warning("size argument ignored with hostmem");
 }
 
-mr = host_memory_backend_get_memory(s->hostmem, errp);
+mr = host_memory_backend_get_memory(s->hostmem, _abort);
 s->ivshmem_size = memory_region_size(mr);
 } else if (s->sizearg == NULL) {
 s->ivshmem_size = 4 << 20; /* 4 MB default */
@@ -907,7 +907,8 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 
 IVSHMEM_DPRINTF("using hostmem\n");
 
-mr = host_memory_backend_get_memory(MEMORY_BACKEND(s->hostmem), errp);
+mr = host_memory_backend_get_memory(MEMORY_BACKEND(s->hostmem),
+_abort);
 vmstate_register_ram(mr, DEVICE(s));
 memory_region_add_subregion(>bar, 0, mr);
 pci_register_bar(PCI_DEVICE(s), 2, attr, >bar);
@@ -1134,7 +1135,7 @@ static void ivshmem_check_memdev_is_busy(Object *obj, 
const char *name,
 {
 MemoryRegion *mr;
 
-mr = host_memory_backend_get_memory(MEMORY_BACKEND(val), errp);
+mr = host_memory_backend_get_memory(MEMORY_BACKEND(val), _abort);
 if (memory_region_is_mapped(mr)) {
 char *path = object_get_canonical_path_component(val);
 error_setg(errp, "can't use already busy memdev: %s", path);
-- 
2.4.3

[Qemu-devel] [PULL v2 27/40] ivshmem: Rely on server sending the ID right after the version

2016-03-21 Thread Markus Armbruster

The protocol specification (ivshmem-spec.txt, formerly
ivshmem_device_spec.txt) has always required the ID message to be sent
right at the beginning, and ivshmem-server has always complied.  The
device, however, accepts it out of order.  If an interrupt setup
arrived before it, though, it would be misinterpreted as connect
notification.  Fix the latent bug by relying on the spec and
ivshmem-server's actual behavior.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-28-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 27 ---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 7f439c3..da32a74 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -653,8 +653,6 @@ static void process_msg(IVShmemState *s, int64_t msg, int 
fd, Error **errp)
 
 if (fd >= 0) {
 process_msg_connect(s, msg, fd, errp);
-} else if (s->vm_id == -1) {
-s->vm_id = msg;
 } else {
 process_msg_disconnect(s, msg, errp);
 }
@@ -722,6 +720,30 @@ static void ivshmem_recv_setup(IVShmemState *s, Error 
**errp)
 }
 
 /*
+ * ivshmem-server sends the remaining initial messages in a fixed
+ * order, but the device has always accepted them in any order.
+ * Stay as compatible as practical, just in case people use
+ * servers that behave differently.
+ */
+
+/*
+ * ivshmem_device_spec.txt has always required the ID message
+ * right here, and ivshmem-server has always complied.  However,
+ * older versions of the device accepted it out of order, but
+ * broke when an interrupt setup message arrived before it.
+ */
+msg = ivshmem_recv_msg(s, , );
+if (err) {
+error_propagate(errp, err);
+return;
+}
+if (fd != -1 || msg < 0 || msg > IVSHMEM_MAX_PEERS) {
+error_setg(errp, "server sent invalid ID message");
+return;
+}
+s->vm_id = msg;
+
+/*
  * Receive more messages until we got shared memory.
  */
 do {
@@ -956,7 +978,6 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 
 /* we allocate enough space for 16 peers and grow as needed */
 resize_peers(s, 16);
-s->vm_id = -1;
 
 pci_register_bar(dev, 2, attr, >bar);
 
-- 
2.4.3

[Qemu-devel] [PULL v2 37/40] ivshmem: Clean up after the previous commit

2016-03-21 Thread Markus Armbruster

Move code to more sensible places.  Use the opportunity to reorder and
document IVShmemState members.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-38-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 400 +++---
 1 file changed, 203 insertions(+), 197 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 89076e4..527d636 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -85,26 +85,31 @@ typedef struct IVShmemState {
 PCIDevice parent_obj;
 /*< public >*/
 
-HostMemoryBackend *hostmem;
+uint32_t features;
+
+/* exactly one of these two may be set */
+HostMemoryBackend *hostmem; /* with interrupts */
+CharDriverState *server_chr; /* without interrupts */
+
+/* registers */
 uint32_t intrmask;
 uint32_t intrstatus;
+int vm_id;
 
-CharDriverState *server_chr;
-MemoryRegion ivshmem_mmio;
-
+/* BARs */
+MemoryRegion ivshmem_mmio;  /* BAR 0 (registers) */
 MemoryRegion *ivshmem_bar2; /* BAR 2 (shared memory) */
 MemoryRegion server_bar2;   /* used with server_chr */
 
+/* interrupt support */
 Peer *peers;
 int nb_peers;   /* space in @peers[] */
-
-int vm_id;
 uint32_t vectors;
-uint32_t features;
 MSIVector *msi_vectors;
 uint64_t msg_buf;   /* buffer for receiving server messages */
 int msg_buffered_bytes; /* #bytes in @msg_buf */
 
+/* migration stuff */
 OnOffAuto master;
 Error *migration_blocker;
 
@@ -830,23 +835,6 @@ static void ivshmem_write_config(PCIDevice *pdev, uint32_t 
address,
 }
 }
 
-static void desugar_shm(IVShmemState *s)
-{
-Object *obj;
-char *path;
-
-obj = object_new("memory-backend-file");
-path = g_strdup_printf("/dev/shm/%s", s->shmobj);
-object_property_set_str(obj, path, "mem-path", _abort);
-g_free(path);
-object_property_set_int(obj, s->legacy_size, "size", _abort);
-object_property_set_bool(obj, true, "share", _abort);
-object_property_add_child(OBJECT(s), "internal-shm-backend", obj,
-  _abort);
-user_creatable_complete(obj, _abort);
-s->hostmem = MEMORY_BACKEND(obj);
-}
-
 static void ivshmem_common_realize(PCIDevice *dev, Error **errp)
 {
 IVShmemState *s = IVSHMEM_COMMON(dev);
@@ -922,65 +910,6 @@ static void ivshmem_common_realize(PCIDevice *dev, Error 
**errp)
 }
 }
 
-static void ivshmem_realize(PCIDevice *dev, Error **errp)
-{
-IVShmemState *s = IVSHMEM_COMMON(dev);
-
-if (!qtest_enabled()) {
-error_report("ivshmem is deprecated, please use ivshmem-plain"
- " or ivshmem-doorbell instead");
-}
-
-if (!!s->server_chr + !!s->shmobj + !!s->hostmem != 1) {
-error_setg(errp,
-   "You must specify either 'shm', 'chardev' or 'x-memdev'");
-return;
-}
-
-if (s->hostmem) {
-if (s->sizearg) {
-g_warning("size argument ignored with hostmem");
-}
-} else if (s->sizearg == NULL) {
-s->legacy_size = 4 << 20; /* 4 MB default */
-} else {
-char *end;
-int64_t size = qemu_strtosz(s->sizearg, );
-if (size < 0 || (size_t)size != size || *end != '\0'
-|| !is_power_of_2(size)) {
-error_setg(errp, "Invalid size %s", s->sizearg);
-return;
-}
-s->legacy_size = size;
-}
-
-/* check that role is reasonable */
-if (s->role) {
-if (strncmp(s->role, "peer", 5) == 0) {
-s->master = ON_OFF_AUTO_OFF;
-} else if (strncmp(s->role, "master", 7) == 0) {
-s->master = ON_OFF_AUTO_ON;
-} else {
-error_setg(errp, "'role' must be 'peer' or 'master'");
-return;
-}
-} else {
-s->master = ON_OFF_AUTO_AUTO;
-}
-
-if (s->shmobj) {
-desugar_shm(s);
-}
-
-/*
- * Note: we don't use INTx with IVSHMEM_MSI at all, so this is a
- * bald-faced lie then.  But it's a backwards compatible lie.
- */
-pci_config_set_interrupt_pin(dev->config, 1);
-
-ivshmem_common_realize(dev, errp);
-}
-
 static void ivshmem_exit(PCIDevice *dev)
 {
 IVShmemState *s = IVSHMEM_COMMON(dev);
@@ -1022,18 +951,6 @@ static void ivshmem_exit(PCIDevice *dev)
 g_free(s->msi_vectors);
 }
 
-static bool test_msix(void *opaque, int version_id)
-{
-IVShmemState *s = opaque;
-
-return ivshmem_has_feature(s, IVSHMEM_MSI);
-}
-
-static bool test_no_msix(void *opaque, int version_id)
-{
-return !test_msix(opaque, version_id);
-}
-
 static int ivshmem_pre_load(void *opaque)
 {
 IVShmemState *s = opaque;
@@ -1056,70 +973,6 @@ static int ivshmem_post_load(void *opaque, int version_id)
 return 0;
 }
 
-static int ivshmem_load_old(QEMUFile *f, void *opaque, int version_id)
-{
-IVShmemState *s =

[Qemu-devel] [PULL v2 40/40] contrib/ivshmem-server: Print "not for production" warning

2016-03-21 Thread Markus Armbruster

The code is okay for illustrating how things work and for testing, but
its error handling make it unfit for production use.  Print a warning
to protect the innocent.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-41-git-send-email-arm...@redhat.com>
---
 contrib/ivshmem-server/main.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/contrib/ivshmem-server/main.c b/contrib/ivshmem-server/main.c
index 5afa8ee..dc64a18 100644
--- a/contrib/ivshmem-server/main.c
+++ b/contrib/ivshmem-server/main.c
@@ -200,6 +200,12 @@ main(int argc, char *argv[])
 };
 int ret = 1;
 
+/*
+ * Do not remove this notice without adding proper error handling!
+ * Start with handling ivshmem_server_send_one_msg() failure.
+ */
+printf("*** Example code, do not use in production ***\n");
+
 /* parse arguments, will exit on error */
 ivshmem_server_parse_args(, argc, argv);
 
-- 
2.4.3

[Qemu-devel] [PULL v2 35/40] ivshmem: Replace int role_val by OnOffAuto master

2016-03-21 Thread Markus Armbruster

In preparation of making it a qdev property.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-36-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index e6282ab..f903fae 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -43,9 +43,6 @@
 #define IVSHMEM_IOEVENTFD   0
 #define IVSHMEM_MSI 1
 
-#define IVSHMEM_PEER0
-#define IVSHMEM_MASTER  1
-
 #define IVSHMEM_REG_BAR_SIZE 0x100
 
 #define IVSHMEM_DEBUG 0
@@ -97,12 +94,12 @@ typedef struct IVShmemState {
 uint64_t msg_buf;   /* buffer for receiving server messages */
 int msg_buffered_bytes; /* #bytes in @msg_buf */
 
+OnOffAuto master;
 Error *migration_blocker;
 
 char * shmobj;
 char * sizearg;
 char * role;
-int role_val;   /* scalar to avoid multiple string comparisons */
 } IVShmemState;
 
 /* registers for the Inter-VM shared memory device */
@@ -118,6 +115,12 @@ static inline uint32_t ivshmem_has_feature(IVShmemState 
*ivs,
 return (ivs->features & (1 << feature));
 }
 
+static inline bool ivshmem_is_master(IVShmemState *s)
+{
+assert(s->master != ON_OFF_AUTO_AUTO);
+return s->master == ON_OFF_AUTO_ON;
+}
+
 static void ivshmem_update_irq(IVShmemState *s)
 {
 PCIDevice *d = PCI_DEVICE(s);
@@ -856,15 +859,15 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 /* check that role is reasonable */
 if (s->role) {
 if (strncmp(s->role, "peer", 5) == 0) {
-s->role_val = IVSHMEM_PEER;
+s->master = ON_OFF_AUTO_OFF;
 } else if (strncmp(s->role, "master", 7) == 0) {
-s->role_val = IVSHMEM_MASTER;
+s->master = ON_OFF_AUTO_ON;
 } else {
 error_setg(errp, "'role' must be 'peer' or 'master'");
 return;
 }
 } else {
-s->role_val = IVSHMEM_MASTER; /* default */
+s->master = ON_OFF_AUTO_AUTO;
 }
 
 pci_conf = dev->config;
@@ -926,7 +929,11 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 vmstate_register_ram(s->ivshmem_bar2, DEVICE(s));
 pci_register_bar(PCI_DEVICE(s), 2, attr, s->ivshmem_bar2);
 
-if (s->role_val == IVSHMEM_PEER) {
+if (s->master == ON_OFF_AUTO_AUTO) {
+s->master = s->vm_id == 0 ? ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF;
+}
+
+if (!ivshmem_is_master(s)) {
 error_setg(>migration_blocker,
"Migration is disabled when using feature 'peer mode' in 
device 'ivshmem'");
 migrate_add_blocker(s->migration_blocker);
@@ -990,7 +997,7 @@ static int ivshmem_pre_load(void *opaque)
 {
 IVShmemState *s = opaque;
 
-if (s->role_val == IVSHMEM_PEER) {
+if (!ivshmem_is_master(s)) {
 error_report("'peer' devices are not migratable");
 return -EINVAL;
 }
@@ -1020,9 +1027,9 @@ static int ivshmem_load_old(QEMUFile *f, void *opaque, 
int version_id)
 return -EINVAL;
 }
 
-if (s->role_val == IVSHMEM_PEER) {
-error_report("'peer' devices are not migratable");
-return -EINVAL;
+ret = ivshmem_pre_load(s);
+if (ret) {
+return ret;
 }
 
 ret = pci_device_load(pdev, f);
-- 
2.4.3

[Qemu-devel] [PULL v2 38/40] ivshmem: Drop ivshmem property x-memdev

2016-03-21 Thread Markus Armbruster

Use ivshmem-plain instead.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-39-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 23 +++
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 527d636..4552060 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -1210,17 +1210,12 @@ static void ivshmem_realize(PCIDevice *dev, Error 
**errp)
  " or ivshmem-doorbell instead");
 }
 
-if (!!s->server_chr + !!s->shmobj + !!s->hostmem != 1) {
-error_setg(errp,
-   "You must specify either 'shm', 'chardev' or 'x-memdev'");
+if (!!s->server_chr + !!s->shmobj != 1) {
+error_setg(errp, "You must specify either 'shm' or 'chardev'");
 return;
 }
 
-if (s->hostmem) {
-if (s->sizearg) {
-g_warning("size argument ignored with hostmem");
-}
-} else if (s->sizearg == NULL) {
+if (s->sizearg == NULL) {
 s->legacy_size = 4 << 20; /* 4 MB default */
 } else {
 char *end;
@@ -1260,17 +1255,6 @@ static void ivshmem_realize(PCIDevice *dev, Error **errp)
 ivshmem_common_realize(dev, errp);
 }
 
-static void ivshmem_init(Object *obj)
-{
-IVShmemState *s = IVSHMEM(obj);
-
-object_property_add_link(obj, "x-memdev", TYPE_MEMORY_BACKEND,
- (Object **)>hostmem,
- ivshmem_check_memdev_is_busy,
- OBJ_PROP_LINK_UNREF_ON_RELEASE,
- _abort);
-}
-
 static void ivshmem_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
@@ -1287,7 +1271,6 @@ static const TypeInfo ivshmem_info = {
 .name  = TYPE_IVSHMEM,
 .parent= TYPE_IVSHMEM_COMMON,
 .instance_size = sizeof(IVShmemState),
-.instance_init = ivshmem_init,
 .class_init= ivshmem_class_init,
 };
 
-- 
2.4.3

[Qemu-devel] [PULL v2 21/40] ivshmem: Assert interrupts are set up once

2016-03-21 Thread Markus Armbruster

An interrupt is set up when the interrupt's file descriptor is
received.  Each message applies to the next interrupt vector.
Therefore, each vector cannot be set up more than once.

ivshmem_add_kvm_msi_virq() half-heartedly tries not to rely on this by
doing nothing then, but that's not going to recover from this error
should it become possible in the future.  watch_vector_notifier()
doesn't even try.

Simply assert what is the case, so we get alerted if we ever screw it
up.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-22-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 65e3a76..61e21cd 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -349,7 +349,7 @@ static void watch_vector_notifier(IVShmemState *s, 
EventNotifier *n,
 {
 int eventfd = event_notifier_get_fd(n);
 
-/* if MSI is supported we need multiple interrupts */
+assert(!s->msi_vectors[vector].pdev);
 s->msi_vectors[vector].pdev = PCI_DEVICE(s);
 
 qemu_set_fd_handler(eventfd, ivshmem_vector_notify,
@@ -535,10 +535,7 @@ static int ivshmem_add_kvm_msi_virq(IVShmemState *s, int 
vector)
 int ret;
 
 IVSHMEM_DPRINTF("ivshmem_add_kvm_msi_virq vector:%d\n", vector);
-
-if (s->msi_vectors[vector].pdev != NULL) {
-return 0;
-}
+assert(!s->msi_vectors[vector].pdev);
 
 ret = kvm_irqchip_add_msi_route(kvm_state, msg, pdev);
 if (ret < 0) {
-- 
2.4.3

[Qemu-devel] [PULL v2 32/40] ivshmem: Simplify memory regions for BAR 2 (shared memory)

2016-03-21 Thread Markus Armbruster

ivshmem_realize() puts the shared memory region in a container region.
Used to be necessary to permit delayed mapping of the shared memory.
However, we recently moved to synchronous mapping, in "ivshmem:
Receive shared memory synchronously in realize()" and the commit
following it.  The container is redundant since then.  Drop it.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Reviewed-by: Paolo Bonzini 
Message-Id: <1458066895-20632-33-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 47 +--
 1 file changed, 17 insertions(+), 30 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 138ae9d..1b1de65 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -82,12 +82,8 @@ typedef struct IVShmemState {
 CharDriverState *server_chr;
 MemoryRegion ivshmem_mmio;
 
-/* We might need to register the BAR before we actually have the memory.
- * So prepare a container MemoryRegion for the BAR immediately and
- * add a subregion when we have the memory.
- */
-MemoryRegion bar;
-MemoryRegion ivshmem;
+MemoryRegion *ivshmem_bar2; /* BAR 2 (shared memory) */
+MemoryRegion server_bar2;   /* used with server_chr */
 size_t ivshmem_size; /* size of shared memory region */
 uint32_t ivshmem_64bit;
 
@@ -487,7 +483,7 @@ static void process_msg_shmem(IVShmemState *s, int fd, 
Error **errp)
 Error *err = NULL;
 void *ptr;
 
-if (memory_region_is_mapped(>ivshmem)) {
+if (s->ivshmem_bar2) {
 error_setg(errp, "server sent unexpected shared memory message");
 close(fd);
 return;
@@ -506,11 +502,10 @@ static void process_msg_shmem(IVShmemState *s, int fd, 
Error **errp)
 close(fd);
 return;
 }
-memory_region_init_ram_ptr(>ivshmem, OBJECT(s),
+memory_region_init_ram_ptr(>server_bar2, OBJECT(s),
"ivshmem.bar2", s->ivshmem_size, ptr);
-qemu_set_ram_fd(memory_region_get_ram_addr(>ivshmem), fd);
-vmstate_register_ram(>ivshmem, DEVICE(s));
-memory_region_add_subregion(>bar, 0, >ivshmem);
+qemu_set_ram_fd(memory_region_get_ram_addr(>server_bar2), fd);
+s->ivshmem_bar2 = >server_bar2;
 }
 
 static void process_msg_disconnect(IVShmemState *s, uint16_t posn,
@@ -702,7 +697,7 @@ static void ivshmem_recv_setup(IVShmemState *s, Error 
**errp)
  * successfully processed the server's shared memory message.
  * Assert that actually mapped the shared memory:
  */
-assert(memory_region_is_mapped(>ivshmem));
+assert(s->ivshmem_bar2);
 }
 
 /* Select the MSI-X vectors used by device.
@@ -903,7 +898,6 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 pci_register_bar(dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY,
  >ivshmem_mmio);
 
-memory_region_init(>bar, OBJECT(s), "ivshmem-bar2-container", 
s->ivshmem_size);
 if (s->ivshmem_64bit) {
 attr |= PCI_BASE_ADDRESS_MEM_TYPE_64;
 }
@@ -913,15 +907,10 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 }
 
 if (s->hostmem != NULL) {
-MemoryRegion *mr;
-
 IVSHMEM_DPRINTF("using hostmem\n");
 
-mr = host_memory_backend_get_memory(MEMORY_BACKEND(s->hostmem),
-_abort);
-vmstate_register_ram(mr, DEVICE(s));
-memory_region_add_subregion(>bar, 0, mr);
-pci_register_bar(PCI_DEVICE(s), 2, attr, >bar);
+s->ivshmem_bar2 = host_memory_backend_get_memory(s->hostmem,
+ _abort);
 } else {
 IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
 s->server_chr->filename);
@@ -929,8 +918,6 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 /* we allocate enough space for 16 peers and grow as needed */
 resize_peers(s, 16);
 
-pci_register_bar(dev, 2, attr, >bar);
-
 /*
  * Receive setup messages from server synchronously.
  * Older versions did it asynchronously, but that creates a
@@ -951,6 +938,9 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 }
 }
 
+vmstate_register_ram(s->ivshmem_bar2, DEVICE(s));
+pci_register_bar(PCI_DEVICE(s), 2, attr, s->ivshmem_bar2);
+
 if (s->role_val == IVSHMEM_PEER) {
 error_setg(>migration_blocker,
"Migration is disabled when using feature 'peer mode' in 
device 'ivshmem'");
@@ -968,9 +958,9 @@ static void pci_ivshmem_exit(PCIDevice *dev)
 error_free(s->migration_blocker);
 }
 
-if (memory_region_is_mapped(>ivshmem)) {
+if (memory_region_is_mapped(s->ivshmem_bar2)) {
 if (!s->hostmem) {
-void *addr = memory_region_get_ram_ptr(>ivshmem);
+void *addr = memory_region_get_ram_ptr(s->ivshmem_bar2);

[Qemu-devel] [PULL v2 25/40] ivshmem: Receive shared memory synchronously in realize()

2016-03-21 Thread Markus Armbruster

When configured for interrupts (property "chardev" given), we receive
the shared memory from an ivshmem server.  We do so asynchronously
after realize() completes, by setting up callbacks with
qemu_chr_add_handlers().

Keeping server I/O out of realize() that way avoids delays due to a
slow server.  This is probably relevant only for hot plug.

However, this funny "no shared memory, yet" state of the device also
causes a raft of issues that are hard or impossible to work around:

* The guest is exposed to this state: when we enter and leave it its
  shared memory contents is apruptly replaced, and device register
  IVPosition changes.

  This is a known issue.  We document that guests should not access
  the shared memory after device initialization until the IVPosition
  register becomes non-negative.

  For cold plug, the funny state is unlikely to be visible in
  practice, because we normally receive the shared memory long before
  the guest gets around to mess with the device.

  For hot plug, the timing is tighter, but the relative slowness of
  PCI device configuration has a good chance to hide the funny state.

  In either case, guests complying with the documented procedure are
  safe.

* Migration becomes racy.

  If migration completes before the shared memory setup completes on
  the source, shared memory contents is silently lost.  Fortunately,
  migration is rather unlikely to win this race.

  If the shared memory's ramblock arrives at the destination before
  shared memory setup completes, migration fails.

  There is no known way for a management application to wait for
  shared memory setup to complete.

  All you can do is retry failed migration.  You can improve your
  chances by leaving more time between running the destination QEMU
  and the migrate command.

  To mitigate silent memory loss, you need to ensure the server
  initializes shared memory exactly the same on source and
  destination.

  These issues are entirely undocumented so far.

I'd expect the server to be almost always fast enough to hide these
issues.  But then rare catastrophic races are in a way the worst kind.

This is way more trouble than I'm willing to take from any device.
Kill the funny state by receiving shared memory synchronously in
realize().  If your hot plug hangs, go kill your ivshmem server.

For easier review, this commit only makes the receive synchronous, it
doesn't add the necessary error propagation.  Without that, the funny
state persists.  The next commit will do that, and kill it off for
real.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-26-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c| 68 
 tests/ivshmem-test.c | 26 ++--
 2 files changed, 55 insertions(+), 39 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index c6d5dd5..ad16828 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -675,27 +675,45 @@ static void ivshmem_read(void *opaque, const uint8_t 
*buf, int size)
 process_msg(s, msg, fd);
 }
 
-static void ivshmem_check_version(void *opaque, const uint8_t * buf, int size)
+static int64_t ivshmem_recv_msg(IVShmemState *s, int *pfd)
 {
-IVShmemState *s = opaque;
-int tmp;
-int64_t version;
+int64_t msg;
+int n, ret;
 
-if (!fifo_update_and_get_i64(s, buf, size, )) {
-return;
-}
+n = 0;
+do {
+ret = qemu_chr_fe_read_all(s->server_chr, (uint8_t *) + n,
+ sizeof(msg) - n);
+if (ret < 0 && ret != -EINTR) {
+/* TODO error handling */
+return INT64_MIN;
+}
+n += ret;
+} while (n < sizeof(msg));
 
-tmp = qemu_chr_fe_get_msgfd(s->server_chr);
-if (tmp != -1 || version != IVSHMEM_PROTOCOL_VERSION) {
+*pfd = qemu_chr_fe_get_msgfd(s->server_chr);
+return msg;
+}
+
+static void ivshmem_recv_setup(IVShmemState *s)
+{
+int64_t msg;
+int fd;
+
+msg = ivshmem_recv_msg(s, );
+if (fd != -1 || msg != IVSHMEM_PROTOCOL_VERSION) {
 fprintf(stderr, "incompatible version, you are connecting to a 
ivshmem-"
 "server using a different protocol please check your setup\n");
-qemu_chr_add_handlers(s->server_chr, NULL, NULL, NULL, s);
 return;
 }
 
-IVSHMEM_DPRINTF("version check ok, switch to real chardev handler\n");
-qemu_chr_add_handlers(s->server_chr, ivshmem_can_receive, ivshmem_read,
-  NULL, s);
+/*
+ * Receive more messages until we got shared memory.
+ */
+do {
+msg = ivshmem_recv_msg(s, );
+process_msg(s, msg, fd);
+} while (msg != -1);
 }
 
 /* Select the MSI-X vectors used by device.
@@ -900,19 +918,29 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 IVSHMEM_DPRINTF("using shared memory server (socket

[Qemu-devel] [PULL v2 24/40] ivshmem: Plug leaks on unplug, fix peer disconnect

2016-03-21 Thread Markus Armbruster

close_peer_eventfds() cleans up three things: ioeventfd triggers if
they exist, eventfds, and the array to store them.

Commit 98609cd (v1.2.0) fixed it not to clean up ioeventfd triggers
when they don't exist (property ioeventfd=off, which is the default).
Unfortunately, the fix also made it skip cleanup of the eventfds and
the array then.  This is a memory and file descriptor leak on unplug.

Additionally, the reset of nb_eventfds is skipped.  Doesn't matter on
unplug.  On peer disconnect, however, this permanently wedges the
interrupt vectors used for that peer's ID.  The eventfds stay behind,
but aren't connected to a peer anymore.  When the ID gets recycled for
a new peer, the new peer's eventfds get assigned to vectors after the
old ones.  Commonly, the device's number of vectors matches the
server's, so the new ones get dropped with a "Too many eventfd
received" message.  Interrupts either don't work (common case) or go
to the wrong vector.

Fix by narrowing the conditional to just the ioeventfd trigger
cleanup.

While there, move the "invalid" peer check to the only caller where it
can actually happen, and tighten it to reject own ID.

Cc: Paolo Bonzini 
Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-25-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index d8d363e..c6d5dd5 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -428,21 +428,17 @@ static void close_peer_eventfds(IVShmemState *s, int posn)
 {
 int i, n;
 
-if (!ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
-return;
-}
-if (posn < 0 || posn >= s->nb_peers) {
-error_report("invalid peer %d", posn);
-return;
-}
-
+assert(posn >= 0 && posn < s->nb_peers);
 n = s->peers[posn].nb_eventfds;
 
-memory_region_transaction_begin();
-for (i = 0; i < n; i++) {
-ivshmem_del_eventfd(s, posn, i);
+if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD)) {
+memory_region_transaction_begin();
+for (i = 0; i < n; i++) {
+ivshmem_del_eventfd(s, posn, i);
+}
+memory_region_transaction_commit();
 }
-memory_region_transaction_commit();
+
 for (i = 0; i < n; i++) {
 event_notifier_cleanup(>peers[posn].eventfds[i]);
 }
@@ -598,6 +594,10 @@ static void process_msg_shmem(IVShmemState *s, int fd)
 static void process_msg_disconnect(IVShmemState *s, uint16_t posn)
 {
 IVSHMEM_DPRINTF("posn %d has gone away\n", posn);
+if (posn >= s->nb_peers || posn == s->vm_id) {
+error_report("invalid peer %d", posn);
+return;
+}
 close_peer_eventfds(s, posn);
 }
 
-- 
2.4.3

[Qemu-devel] [PULL v2 29/40] ivshmem: Simplify how we cope with short reads from server

2016-03-21 Thread Markus Armbruster

Short reads from a UNIX domain sockets are exceedingly unlikely when
the other side always sends eight bytes and we always read eight
bytes.  We cope with them anyway.  However, the code doing that is
rather convoluted.  Dumb it down radically.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-30-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 75 ---
 1 file changed, 16 insertions(+), 59 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index c1a75db..7b9e769 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -26,7 +26,6 @@
 #include "migration/migration.h"
 #include "qemu/error-report.h"
 #include "qemu/event_notifier.h"
-#include "qemu/fifo8.h"
 #include "sysemu/char.h"
 #include "sysemu/hostmem.h"
 #include "qapi/visitor.h"
@@ -80,7 +79,6 @@ typedef struct IVShmemState {
 uint32_t intrstatus;
 
 CharDriverState *server_chr;
-Fifo8 incoming_fifo;
 MemoryRegion ivshmem_mmio;
 
 /* We might need to register the BAR before we actually have the memory.
@@ -99,6 +97,8 @@ typedef struct IVShmemState {
 uint32_t vectors;
 uint32_t features;
 MSIVector *msi_vectors;
+uint64_t msg_buf;   /* buffer for receiving server messages */
+int msg_buffered_bytes; /* #bytes in @msg_buf */
 
 Error *migration_blocker;
 
@@ -255,11 +255,6 @@ static const MemoryRegionOps ivshmem_mmio_ops = {
 },
 };
 
-static int ivshmem_can_receive(void * opaque)
-{
-return sizeof(int64_t);
-}
-
 static void ivshmem_vector_notify(void *opaque)
 {
 MSIVector *entry = opaque;
@@ -459,53 +454,6 @@ static void resize_peers(IVShmemState *s, int nb_peers)
 }
 }
 
-static bool fifo_update_and_get(IVShmemState *s, const uint8_t *buf, int size,
-void *data, size_t len)
-{
-const uint8_t *p;
-uint32_t num;
-
-assert(len <= sizeof(int64_t)); /* limitation of the fifo */
-if (fifo8_is_empty(>incoming_fifo) && size == len) {
-memcpy(data, buf, size);
-return true;
-}
-
-IVSHMEM_DPRINTF("short read of %d bytes\n", size);
-
-num = MIN(size, sizeof(int64_t) - fifo8_num_used(>incoming_fifo));
-fifo8_push_all(>incoming_fifo, buf, num);
-
-if (fifo8_num_used(>incoming_fifo) < len) {
-assert(num == 0);
-return false;
-}
-
-size -= num;
-buf += num;
-p = fifo8_pop_buf(>incoming_fifo, len, );
-assert(num == len);
-
-memcpy(data, p, len);
-
-if (size > 0) {
-fifo8_push_all(>incoming_fifo, buf, size);
-}
-
-return true;
-}
-
-static bool fifo_update_and_get_i64(IVShmemState *s,
-const uint8_t *buf, int size, int64_t *i64)
-{
-if (fifo_update_and_get(s, buf, size, i64, sizeof(*i64))) {
-*i64 = GINT64_FROM_LE(*i64);
-return true;
-}
-
-return false;
-}
-
 static void ivshmem_add_kvm_msi_virq(IVShmemState *s, int vector,
  Error **errp)
 {
@@ -658,6 +606,14 @@ static void process_msg(IVShmemState *s, int64_t msg, int 
fd, Error **errp)
 }
 }
 
+static int ivshmem_can_receive(void *opaque)
+{
+IVShmemState *s = opaque;
+
+assert(s->msg_buffered_bytes < sizeof(s->msg_buf));
+return sizeof(s->msg_buf) - s->msg_buffered_bytes;
+}
+
 static void ivshmem_read(void *opaque, const uint8_t *buf, int size)
 {
 IVShmemState *s = opaque;
@@ -665,9 +621,14 @@ static void ivshmem_read(void *opaque, const uint8_t *buf, 
int size)
 int fd;
 int64_t msg;
 
-if (!fifo_update_and_get_i64(s, buf, size, )) {
+assert(size >= 0 && s->msg_buffered_bytes + size <= sizeof(s->msg_buf));
+memcpy((unsigned char *)>msg_buf + s->msg_buffered_bytes, buf, size);
+s->msg_buffered_bytes += size;
+if (s->msg_buffered_bytes < sizeof(s->msg_buf)) {
 return;
 }
+msg = le64_to_cpu(s->msg_buf);
+s->msg_buffered_bytes = 0;
 
 fd = qemu_chr_fe_get_msgfd(s->server_chr);
 IVSHMEM_DPRINTF("posn is %" PRId64 ", fd is %d\n", msg, fd);
@@ -1022,8 +983,6 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 }
 }
 
-fifo8_create(>incoming_fifo, sizeof(int64_t));
-
 if (s->role_val == IVSHMEM_PEER) {
 error_setg(>migration_blocker,
"Migration is disabled when using feature 'peer mode' in 
device 'ivshmem'");
@@ -1036,8 +995,6 @@ static void pci_ivshmem_exit(PCIDevice *dev)
 IVShmemState *s = IVSHMEM(dev);
 int i;
 
-fifo8_destroy(>incoming_fifo);
-
 if (s->migration_blocker) {
 migrate_del_blocker(s->migration_blocker);
 error_free(s->migration_blocker);
-- 
2.4.3

[Qemu-devel] [PULL v2 20/40] ivshmem: Leave INTx alone when using MSI-X

2016-03-21 Thread Markus Armbruster

The ivshmem device can either use MSI-X or legacy INTx for interrupts.

With MSI-X enabled, peer interrupt events trigger an MSI as they
should.  But software can still raise INTx via interrupt status and
mask register in BAR 0.  This is explicitly prohibited by PCI Local
Bus Specification Revision 3.0, section 6.8.3.3:

While enabled for MSI or MSI-X operation, a function is prohibited
from using its INTx# pin (if implemented) to request service (MSI,
MSI-X, and INTx# are mutually exclusive).

Fix the device model to leave INTx alone when using MSI-X.

Document that we claim to use INTx in config space even when we don't.
Unlike other devices, ivshmem does *not* use INTx when configured for
MSI-X and MSI-X isn't enabled by software.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Reviewed-by: Paolo Bonzini 
Message-Id: <1458066895-20632-21-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index abcb1c1..65e3a76 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -126,6 +126,11 @@ static void ivshmem_update_irq(IVShmemState *s)
 PCIDevice *d = PCI_DEVICE(s);
 uint32_t isr = s->intrstatus & s->intrmask;
 
+/* No INTx with msi=on, whether the guest enabled MSI-X or not */
+if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+return;
+}
+
 /* don't print ISR resets */
 if (isr) {
 IVSHMEM_DPRINTF("Set IRQ to %d (%04x %04x)\n",
@@ -873,6 +878,10 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 pci_conf = dev->config;
 pci_conf[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
 
+/*
+ * Note: we don't use INTx with IVSHMEM_MSI at all, so this is a
+ * bald-faced lie then.  But it's a backwards compatible lie.
+ */
 pci_config_set_interrupt_pin(pci_conf, 1);
 
 memory_region_init_io(>ivshmem_mmio, OBJECT(s), _mmio_ops, s,
-- 
2.4.3

[Qemu-devel] [PULL v2 39/40] ivshmem: Require master to have ID zero

2016-03-21 Thread Markus Armbruster

Migration with ivshmem needs to be carefully orchestrated to work.
Exactly one peer (the "master") migrates to the destination, all other
peers need to unplug (and disconnect), migrate, plug back (and
reconnect).  This is sort of documented in qemu-doc.

If peers connect on the destination before migration completes, the
shared memory can get messed up.  This isn't documented anywhere.  Fix
that in qemu-doc.

To avoid messing up register IVPosition on migration, the server must
assign the same ID on source and destination.  ivshmem-spec.txt leaves
ID assignment unspecified, however.

Amend ivshmem-spec.txt to require the first client to receive ID zero.
The example ivshmem-server complies: it always assigns the first
unused ID.

For a bit of additional safety, enforce ID zero for the master.  This
does nothing when we're not using a server, because the ID is zero for
all peers then.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-40-git-send-email-arm...@redhat.com>
---
 docs/specs/ivshmem-spec.txt | 2 ++
 hw/misc/ivshmem.c   | 6 ++
 qemu-doc.texi   | 5 +
 3 files changed, 13 insertions(+)

diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
index f3912c0..a1f5499 100644
--- a/docs/specs/ivshmem-spec.txt
+++ b/docs/specs/ivshmem-spec.txt
@@ -164,6 +164,8 @@ For each new client that connects to the server, the server
 - sends interrupt setup messages to the new client (these contain file
   descriptors for receiving interrupts).
 
+The first client to connect to the server receives ID zero.
+
 When a client disconnects from the server, the server sends disconnect
 notifications to the other clients.
 
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 4552060..132387f 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -887,6 +887,12 @@ static void ivshmem_common_realize(PCIDevice *dev, Error 
**errp)
 return;
 }
 
+if (s->master == ON_OFF_AUTO_ON && s->vm_id != 0) {
+error_setg(errp,
+   "master must connect to the server before any peers");
+return;
+}
+
 qemu_chr_add_handlers(s->server_chr, ivshmem_can_receive,
   ivshmem_read, NULL, s);
 
diff --git a/qemu-doc.texi b/qemu-doc.texi
index 0dd01c7..79141d3 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -1295,12 +1295,17 @@ When using the server, the guest will be assigned a VM 
ID (>=0) that allows gues
 using the same server to communicate via interrupts.  Guests can read their
 VM ID from a device register (see ivshmem-spec.txt).
 
+@subsubsection Migration with ivshmem
+
 With device property @option{master=on}, the guest will copy the shared
 memory on migration to the destination host.  With @option{master=off},
 the guest will not be able to migrate with the device attached.  In the
 latter case, the device should be detached and then reattached after
 migration using the PCI hotplug support.
 
+At most one of the devices sharing the same memory can be master.  The
+master must complete migration before you plug back the other devices.
+
 @subsubsection ivshmem and hugepages
 
 Instead of specifying the  using POSIX shm, you may specify
-- 
2.4.3

[Qemu-devel] [PULL v2 23/40] ivshmem: Disentangle ivshmem_read()

2016-03-21 Thread Markus Armbruster

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-24-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 173 +++---
 1 file changed, 87 insertions(+), 86 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 703b3bf..d8d363e 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -564,116 +564,117 @@ static void setup_interrupt(IVShmemState *s, int vector)
 }
 }
 
-static void ivshmem_read(void *opaque, const uint8_t *buf, int size)
+static void process_msg_shmem(IVShmemState *s, int fd)
 {
-IVShmemState *s = opaque;
-int incoming_fd;
-int new_eventfd;
-int64_t incoming_posn;
 Error *err = NULL;
-Peer *peer;
+void *ptr;
 
-if (!fifo_update_and_get_i64(s, buf, size, _posn)) {
+if (memory_region_is_mapped(>ivshmem)) {
+error_report("shm already initialized");
+close(fd);
 return;
 }
 
-incoming_fd = qemu_chr_fe_get_msgfd(s->server_chr);
-IVSHMEM_DPRINTF("posn is %" PRId64 ", fd is %d\n",
-incoming_posn, incoming_fd);
-
-if (incoming_posn < -1 || incoming_posn > IVSHMEM_MAX_PEERS) {
-error_report("server sent invalid message %" PRId64,
- incoming_posn);
-if (incoming_fd != -1) {
-close(incoming_fd);
-}
-return;
-}
-
-if (incoming_posn >= s->nb_peers) {
-resize_peers(s, incoming_posn + 1);
-}
-
-peer = >peers[incoming_posn];
-
-if (incoming_fd == -1) {
-/* if posn is positive and unseen before then this is our posn*/
-if (incoming_posn >= 0 && s->vm_id == -1) {
-/* receive our posn */
-s->vm_id = incoming_posn;
-} else {
-/* otherwise an fd == -1 means an existing peer has gone away */
-IVSHMEM_DPRINTF("posn %" PRId64 " has gone away\n", incoming_posn);
-close_peer_eventfds(s, incoming_posn);
-}
+if (check_shm_size(s, fd, ) == -1) {
+error_report_err(err);
+close(fd);
 return;
 }
 
-/* if the position is -1, then it's shared memory region fd */
-if (incoming_posn == -1) {
-void * map_ptr;
-
-if (memory_region_is_mapped(>ivshmem)) {
-error_report("shm already initialized");
-close(incoming_fd);
-return;
-}
-
-if (check_shm_size(s, incoming_fd, ) == -1) {
-error_report_err(err);
-close(incoming_fd);
-return;
-}
-
-/* mmap the region and map into the BAR2 */
-map_ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED,
-incoming_fd, 0);
-if (map_ptr == MAP_FAILED) {
-error_report("Failed to mmap shared memory %s", strerror(errno));
-close(incoming_fd);
-return;
-}
-memory_region_init_ram_ptr(>ivshmem, OBJECT(s),
-   "ivshmem.bar2", s->ivshmem_size, map_ptr);
-qemu_set_ram_fd(memory_region_get_ram_addr(>ivshmem),
-incoming_fd);
-vmstate_register_ram(>ivshmem, DEVICE(s));
-
-IVSHMEM_DPRINTF("guest h/w addr = %p, size = %" PRIu64 "\n",
-map_ptr, s->ivshmem_size);
-
-memory_region_add_subregion(>bar, 0, >ivshmem);
-
+/* mmap the region and map into the BAR2 */
+ptr = mmap(0, s->ivshmem_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+if (ptr == MAP_FAILED) {
+error_report("Failed to mmap shared memory %s", strerror(errno));
+close(fd);
 return;
 }
-
-/* each peer has an associated array of eventfds, and we keep
- * track of how many eventfds received so far */
-/* get a new eventfd: */
+memory_region_init_ram_ptr(>ivshmem, OBJECT(s),
+   "ivshmem.bar2", s->ivshmem_size, ptr);
+qemu_set_ram_fd(memory_region_get_ram_addr(>ivshmem), fd);
+vmstate_register_ram(>ivshmem, DEVICE(s));
+memory_region_add_subregion(>bar, 0, >ivshmem);
+}
+
+static void process_msg_disconnect(IVShmemState *s, uint16_t posn)
+{
+IVSHMEM_DPRINTF("posn %d has gone away\n", posn);
+close_peer_eventfds(s, posn);
+}
+
+static void process_msg_connect(IVShmemState *s, uint16_t posn, int fd)
+{
+Peer *peer = >peers[posn];
+int vector;
+
+/*
+ * The N-th connect message for this peer comes with the file
+ * descriptor for vector N-1.  Count messages to find the vector.
+ */
 if (peer->nb_eventfds >= s->vectors) {
 error_report("Too many eventfd received, device has %d vectors",
  s->vectors);
-close(incoming_fd);
+close(fd);
 return;
 }
+vector = peer->nb_eventfds++;
 
-new_eventfd = peer->nb_eventfds++;
+

[Qemu-devel] [PULL v2 28/40] ivshmem: Drop the hackish test for UNIX domain chardev

2016-03-21 Thread Markus Armbruster

The chardev must be capable of transmitting SCM_RIGHTS ancillary
messages.  We check it by comparing CharDriverState member filename to
"unix:".  That's almost as brittle as it is disgusting.

When the actual transmission all happened asynchronously, this check
was all we could do in realize(), and thus better than nothing.  But
now we receive at least one SCM_RIGHTS synchronously in realize(),
it's not worth its keep anymore.  Drop it.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-29-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index da32a74..c1a75db 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -964,15 +964,6 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 memory_region_add_subregion(>bar, 0, mr);
 pci_register_bar(PCI_DEVICE(s), 2, attr, >bar);
 } else if (s->server_chr != NULL) {
-/* FIXME do not rely on what chr drivers put into filename */
-if (strncmp(s->server_chr->filename, "unix:", 5)) {
-error_setg(errp, "chardev is not a unix client socket");
-return;
-}
-
-/* if we get a UNIX socket as the parameter we will talk
- * to the ivshmem server to receive the memory region */
-
 IVSHMEM_DPRINTF("using shared memory server (socket = %s)\n",
 s->server_chr->filename);
 
-- 
2.4.3

[Qemu-devel] [PULL v2 15/40] ivshmem: Don't destroy the chardev on version mismatch

2016-03-21 Thread Markus Armbruster

Yes, the chardev is commonly useless after we read a bad version from
it, but destroying it is inappropriate anyway: the user created it, so
the user should be able to hold on to it as long as he likes.  We
don't destroy it on other errors.  Screwed up in commit 5105b1d.

Stop reading instead.

Also note QEMU's behavior in ivshmem-spec.txt.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-16-git-send-email-arm...@redhat.com>
---
 docs/specs/ivshmem-spec.txt | 3 +++
 hw/misc/ivshmem.c   | 3 +--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.txt
index 0e9185a..0cd63ad 100644
--- a/docs/specs/ivshmem-spec.txt
+++ b/docs/specs/ivshmem-spec.txt
@@ -187,6 +187,9 @@ Each message consists of a single 8 byte little-endian 
signed number,
 and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
 client and server close the connection on error.
 
+Note: QEMU currently doesn't close the connection right on error, but
+only when the character device is destroyed.
+
 On connect, the server sends the following messages in order:
 
 1. The protocol version number, currently zero.  The client should
diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 8356399..0ac0238 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -708,8 +708,7 @@ static void ivshmem_check_version(void *opaque, const 
uint8_t * buf, int size)
 if (tmp != -1 || version != IVSHMEM_PROTOCOL_VERSION) {
 fprintf(stderr, "incompatible version, you are connecting to a 
ivshmem-"
 "server using a different protocol please check your setup\n");
-qemu_chr_delete(s->server_chr);
-s->server_chr = NULL;
+qemu_chr_add_handlers(s->server_chr, NULL, NULL, NULL, s);
 return;
 }
 
-- 
2.4.3

[Qemu-devel] [PULL v2 22/40] ivshmem: Simplify rejection of invalid peer ID from server

2016-03-21 Thread Markus Armbruster

ivshmem_read() processes server messages.  These are 64 bit signed
integers.  -1 is shared memory setup, 16 bit unsigned is a peer ID,
anything else is invalid.

ivshmem_read() rejects invalid negative messages right away, silently.

Invalid positive messages get rejected only in resize_peers(), and
ivshmem_read() then prints the rather cryptic message "failed to
resize peers array".

Extend the first check to cover all invalid messages, make it report
"server sent invalid message", and drop the second check.

Now resize_peers() can't fail anymore; simplify.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-23-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 61 ---
 1 file changed, 22 insertions(+), 39 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 61e21cd..703b3bf 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -39,7 +39,7 @@
 #define PCI_VENDOR_ID_IVSHMEM   PCI_VENDOR_ID_REDHAT_QUMRANET
 #define PCI_DEVICE_ID_IVSHMEM   0x1110
 
-#define IVSHMEM_MAX_PEERS G_MAXUINT16
+#define IVSHMEM_MAX_PEERS UINT16_MAX
 #define IVSHMEM_IOEVENTFD   0
 #define IVSHMEM_MSI 1
 
@@ -93,7 +93,7 @@ typedef struct IVShmemState {
 uint32_t ivshmem_64bit;
 
 Peer *peers;
-int nb_peers; /* how many peers we have space for */
+int nb_peers;   /* space in @peers[] */
 
 int vm_id;
 uint32_t vectors;
@@ -451,34 +451,21 @@ static void close_peer_eventfds(IVShmemState *s, int posn)
 s->peers[posn].nb_eventfds = 0;
 }
 
-/* this function increase the dynamic storage need to store data about other
- * peers */
-static int resize_peers(IVShmemState *s, int new_min_size)
+static void resize_peers(IVShmemState *s, int nb_peers)
 {
+int old_nb_peers = s->nb_peers;
+int i;
 
-int j, old_size;
+assert(nb_peers > old_nb_peers);
+IVSHMEM_DPRINTF("bumping storage to %d peers\n", nb_peers);
 
-/* limit number of max peers */
-if (new_min_size <= 0 || new_min_size > IVSHMEM_MAX_PEERS) {
-return -1;
-}
-if (new_min_size <= s->nb_peers) {
-return 0;
-}
-
-old_size = s->nb_peers;
-s->nb_peers = new_min_size;
+s->peers = g_realloc(s->peers, nb_peers * sizeof(Peer));
+s->nb_peers = nb_peers;
 
-IVSHMEM_DPRINTF("bumping storage to %d peers\n", s->nb_peers);
-
-s->peers = g_realloc(s->peers, s->nb_peers * sizeof(Peer));
-
-for (j = old_size; j < s->nb_peers; j++) {
-s->peers[j].eventfds = g_new0(EventNotifier, s->vectors);
-s->peers[j].nb_eventfds = 0;
+for (i = old_nb_peers; i < nb_peers; i++) {
+s->peers[i].eventfds = g_new0(EventNotifier, s->vectors);
+s->peers[i].nb_eventfds = 0;
 }
-
-return 0;
 }
 
 static bool fifo_update_and_get(IVShmemState *s, const uint8_t *buf, int size,
@@ -590,25 +577,21 @@ static void ivshmem_read(void *opaque, const uint8_t 
*buf, int size)
 return;
 }
 
-if (incoming_posn < -1) {
-IVSHMEM_DPRINTF("invalid incoming_posn %" PRId64 "\n", incoming_posn);
-return;
-}
-
-/* pick off s->server_chr->msgfd and store it, posn should accompany msg */
 incoming_fd = qemu_chr_fe_get_msgfd(s->server_chr);
 IVSHMEM_DPRINTF("posn is %" PRId64 ", fd is %d\n",
 incoming_posn, incoming_fd);
 
-/* make sure we have enough space for this peer */
+if (incoming_posn < -1 || incoming_posn > IVSHMEM_MAX_PEERS) {
+error_report("server sent invalid message %" PRId64,
+ incoming_posn);
+if (incoming_fd != -1) {
+close(incoming_fd);
+}
+return;
+}
+
 if (incoming_posn >= s->nb_peers) {
-if (resize_peers(s, incoming_posn + 1) < 0) {
-error_report("failed to resize peers array");
-if (incoming_fd != -1) {
-close(incoming_fd);
-}
-return;
-}
+resize_peers(s, incoming_posn + 1);
 }
 
 peer = >peers[incoming_posn];
-- 
2.4.3

[Qemu-devel] [PULL v2 30/40] ivshmem: Tighten check of property "size"

2016-03-21 Thread Markus Armbruster

If size_t is narrower than 64 bits, passing uint64_t ivshmem_size to
mmap() truncates.  Reject such sizes.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-31-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 7b9e769..66c713e 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -87,7 +87,7 @@ typedef struct IVShmemState {
  */
 MemoryRegion bar;
 MemoryRegion ivshmem;
-uint64_t ivshmem_size; /* size of shared memory region */
+size_t ivshmem_size; /* size of shared memory region */
 uint32_t ivshmem_64bit;
 
 Peer *peers;
@@ -361,7 +361,7 @@ static int check_shm_size(IVShmemState *s, int fd, Error 
**errp)
 
 if (s->ivshmem_size > buf.st_size) {
 error_setg(errp, "Requested memory size greater"
-   " than shared object size (%" PRIu64 " > %" PRIu64")",
+   " than shared object size (%zu > %" PRIu64")",
s->ivshmem_size, (uint64_t)buf.st_size);
 return -1;
 } else {
@@ -865,7 +865,8 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 } else {
 char *end;
 int64_t size = qemu_strtosz(s->sizearg, );
-if (size < 0 || *end != '\0' || !is_power_of_2(size)) {
+if (size < 0 || (size_t)size != size || *end != '\0'
+|| !is_power_of_2(size)) {
 error_setg(errp, "Invalid size %s", s->sizearg);
 return;
 }
-- 
2.4.3

[Qemu-devel] [PULL v2 18/40] ivshmem: Clean up register callbacks

2016-03-21 Thread Markus Armbruster

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-19-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 51ad255..1debce3 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -121,12 +121,10 @@ static inline uint32_t ivshmem_has_feature(IVShmemState 
*ivs,
 return (ivs->features & (1 << feature));
 }
 
-/* accessing registers - based on rtl8139 */
 static void ivshmem_update_irq(IVShmemState *s)
 {
 PCIDevice *d = PCI_DEVICE(s);
-int isr;
-isr = (s->intrstatus & s->intrmask) & 0x;
+uint32_t isr = s->intrstatus & s->intrmask;
 
 /* don't print ISR resets */
 if (isr) {
@@ -134,7 +132,7 @@ static void ivshmem_update_irq(IVShmemState *s)
 isr ? 1 : 0, s->intrstatus, s->intrmask);
 }
 
-pci_set_irq(d, (isr != 0));
+pci_set_irq(d, isr != 0);
 }
 
 static void ivshmem_IntrMask_write(IVShmemState *s, uint32_t val)
@@ -142,7 +140,6 @@ static void ivshmem_IntrMask_write(IVShmemState *s, 
uint32_t val)
 IVSHMEM_DPRINTF("IntrMask write(w) val = 0x%04x\n", val);
 
 s->intrmask = val;
-
 ivshmem_update_irq(s);
 }
 
@@ -151,7 +148,6 @@ static uint32_t ivshmem_IntrMask_read(IVShmemState *s)
 uint32_t ret = s->intrmask;
 
 IVSHMEM_DPRINTF("intrmask read(w) val = 0x%04x\n", ret);
-
 return ret;
 }
 
@@ -160,7 +156,6 @@ static void ivshmem_IntrStatus_write(IVShmemState *s, 
uint32_t val)
 IVSHMEM_DPRINTF("IntrStatus write(w) val = 0x%04x\n", val);
 
 s->intrstatus = val;
-
 ivshmem_update_irq(s);
 }
 
@@ -170,9 +165,7 @@ static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
 
 /* reading ISR clears all interrupts */
 s->intrstatus = 0;
-
 ivshmem_update_irq(s);
-
 return ret;
 }
 
-- 
2.4.3

[Qemu-devel] [PULL v2 04/40] qemu-doc: Fix ivshmem huge page example

2016-03-21 Thread Markus Armbruster

Option parameter "share" is missing.  Without it, you get a *private*
mmap(), which defeats ivshmem's purpose pretty thoroughly ;)

While there, switch to the conventional mountpoint of hugetlbfs
/dev/hugepages.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Reviewed-by: Paolo Bonzini 
Message-Id: <1458066895-20632-5-git-send-email-arm...@redhat.com>
---
 qemu-doc.texi | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qemu-doc.texi b/qemu-doc.texi
index bc9dd13..65f3b29 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -1311,7 +1311,7 @@ Instead of specifying the  using POSIX shm, you 
may specify
 a memory backend that has hugepage support:
 
 @example
-qemu-system-i386 -object 
memory-backend-file,size=1G,mem-path=/mnt/hugepages/my-shmem-file,id=mb1
+qemu-system-i386 -object 
memory-backend-file,size=1G,mem-path=/dev/hugepages/my-shmem-file,share,id=mb1
  -device ivshmem,x-memdev=mb1
 @end example
 
-- 
2.4.3

[Qemu-devel] [PULL v2 06/40] tests/libqos/pci-pc: Fix qpci_pc_iomap() to map BARs aligned

2016-03-21 Thread Markus Armbruster

qpci_pc_iomap() maps BARs one after the other, without padding.  This
is wrong.  PCI Local Bus Specification Revision 3.0, 6.2.5.1. Address
Maps: "all address spaces used are a power of two in size and are
naturally aligned".  That's because the size of a BAR is given by the
number of address bits the device decodes, and the BAR needs to be
mapped at a multiple of that size to ensure the address decoding
works.

Fix qpci_pc_iomap() accordingly.  This takes care of a FIXME in
ivshmem-test.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-7-git-send-email-arm...@redhat.com>
---
 tests/ivshmem-test.c  | 17 -
 tests/libqos/pci-pc.c |  8 ++--
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/tests/ivshmem-test.c b/tests/ivshmem-test.c
index 4efa433..da6ca0d 100644
--- a/tests/ivshmem-test.c
+++ b/tests/ivshmem-test.c
@@ -110,19 +110,18 @@ static void setup_vm_cmd(IVState *s, const char *cmd, 
bool msix)
 s->pcibus = qpci_init_pc();
 s->dev = get_device(s->pcibus);
 
-/* FIXME: other bar order fails, mappings changes */
-s->mem_base = qpci_iomap(s->dev, 2, );
-g_assert_nonnull(s->mem_base);
-g_assert_cmpuint(barsize, ==, TMPSHMSIZE);
-
-if (msix) {
-qpci_msix_enable(s->dev);
-}
-
 s->reg_base = qpci_iomap(s->dev, 0, );
 g_assert_nonnull(s->reg_base);
 g_assert_cmpuint(barsize, ==, 256);
 
+if (msix) {
+qpci_msix_enable(s->dev);
+}
+
+s->mem_base = qpci_iomap(s->dev, 2, );
+g_assert_nonnull(s->mem_base);
+g_assert_cmpuint(barsize, ==, TMPSHMSIZE);
+
 qpci_device_enable(s->dev);
 }
 
diff --git a/tests/libqos/pci-pc.c b/tests/libqos/pci-pc.c
index 08167c0..77f15e5 100644
--- a/tests/libqos/pci-pc.c
+++ b/tests/libqos/pci-pc.c
@@ -184,7 +184,9 @@ static void *qpci_pc_iomap(QPCIBus *bus, QPCIDevice *dev, 
int barno, uint64_t *s
 if (io_type == PCI_BASE_ADDRESS_SPACE_IO) {
 uint16_t loc;
 
-g_assert((s->pci_iohole_alloc + size) <= s->pci_iohole_size);
+g_assert(QEMU_ALIGN_UP(s->pci_iohole_alloc, size) + size
+ <= s->pci_iohole_size);
+s->pci_iohole_alloc = QEMU_ALIGN_UP(s->pci_iohole_alloc, size);
 loc = s->pci_iohole_start + s->pci_iohole_alloc;
 s->pci_iohole_alloc += size;
 
@@ -194,7 +196,9 @@ static void *qpci_pc_iomap(QPCIBus *bus, QPCIDevice *dev, 
int barno, uint64_t *s
 } else {
 uint64_t loc;
 
-g_assert((s->pci_hole_alloc + size) <= s->pci_hole_size);
+g_assert(QEMU_ALIGN_UP(s->pci_hole_alloc, size) + size
+ <= s->pci_hole_size);
+s->pci_hole_alloc = QEMU_ALIGN_UP(s->pci_hole_alloc, size);
 loc = s->pci_hole_start + s->pci_hole_alloc;
 s->pci_hole_alloc += size;
 
-- 
2.4.3

[Qemu-devel] [PULL v2 08/40] ivshmem-test: Clean up wait for devices to become operational

2016-03-21 Thread Markus Armbruster

test_ivshmem_server() waits until the first byte in BAR 2 contains the
0x42 we put into shared memory.  Works because the byte reads zero
until the device maps the shared memory gotten from the server.

Check the IVPosition register instead: it's initially -1, and becomes
non-negative right when the device maps the share memory, so no
change, just cleaner, because it's what guest software is supposed to
do.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-9-git-send-email-arm...@redhat.com>
---
 tests/ivshmem-test.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/tests/ivshmem-test.c b/tests/ivshmem-test.c
index a48dc49..bbea8cd 100644
--- a/tests/ivshmem-test.c
+++ b/tests/ivshmem-test.c
@@ -301,7 +301,6 @@ static void test_ivshmem_server(bool msi)
 int nvectors = 2;
 guint64 end_time = g_get_monotonic_time() + 5 * G_TIME_SPAN_SECOND;
 
-memset(tmpshmem, 0x42, TMPSHMSIZE);
 ret = ivshmem_server_init(, tmpserver, tmpshm, true,
   TMPSHMSIZE, nvectors,
   g_test_verbose());
@@ -315,9 +314,9 @@ static void test_ivshmem_server(bool msi)
 setup_vm_with_server(, nvectors, msi);
 s2 = 
 
+/* check state before server sends stuff */
 g_assert_cmpuint(in_reg(s1, IVPOSITION), ==, 0x);
 g_assert_cmpuint(in_reg(s2, IVPOSITION), ==, 0x);
-
 g_assert_cmpuint(qtest_readb(s1->qtest, (uintptr_t)s1->mem_base), ==, 
0x00);
 
 thread.server = 
@@ -326,12 +325,11 @@ static void test_ivshmem_server(bool msi)
 thread.thread = g_thread_new("ivshmem-server", server_thread, );
 g_assert(thread.thread != NULL);
 
-/* waiting until mapping is done */
+/* waiting for devices to become operational */
 while (g_get_monotonic_time() < end_time) {
 g_usleep(1000);
-
-if (qtest_readb(s1->qtest, (uintptr_t)s1->mem_base) == 0x42 &&
-qtest_readb(s2->qtest, (uintptr_t)s2->mem_base) == 0x42) {
+if ((int)in_reg(s1, IVPOSITION) >= 0 &&
+(int)in_reg(s2, IVPOSITION) >= 0) {
 break;
 }
 }
-- 
2.4.3

[Qemu-devel] [PULL v2 19/40] ivshmem: Clean up MSI-X conditions

2016-03-21 Thread Markus Armbruster

There are three predicates related to MSI-X:

* ivshmem_has_feature(s, IVSHMEM_MSI) is true unless the non-MSI-X
  variant of the device is selected with msi=off.

* msix_present() is true when the device has the PCI capability MSI-X.
  It's initially false, and becomes true during successful realize of
  the MSI-X variant of the device.  Thus, it's the same as
  ivshmem_has_feature(s, IVSHMEM_MSI) for realized devices.

* msix_enabled() is true when msix_present() is true and guest software
  has enabled MSI-X.

Code that differs between the non-MSI-X and the MSI-X variant of the
device needs to be guarded by ivshmem_has_feature(s, IVSHMEM_MSI) or
by msix_present(), except the latter works only for realized devices.

Code that depends on whether MSI-X is in use needs to be guarded with
msix_enabled().

Code review led me to two minor messes:

* ivshmem_vector_notify() calls msix_notify() even when
  !msix_enabled(), unlike most other MSI-X-capable devices.  As far as
  I can tell, msix_notify() does nothing when !msix_enabled().  Add
  the guard anyway.

* Most callers of ivshmem_use_msix() guard it with
  ivshmem_has_feature(s, IVSHMEM_MSI).  Not necessary, because
  ivshmem_use_msix() does nothing when !msix_present().  That's
  ivshmem's only use of msix_present(), though.  Guard it
  consistently, and drop the now redundant msix_present() check.
  While there, rename ivshmem_use_msix() to ivshmem_msix_vector_use().

Signed-off-by: Markus Armbruster 
Message-Id: <1458066895-20632-20-git-send-email-arm...@redhat.com>
Reviewed-by: Marc-André Lureau 
---
 hw/misc/ivshmem.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 1debce3..abcb1c1 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -274,7 +274,9 @@ static void ivshmem_vector_notify(void *opaque)
 
 IVSHMEM_DPRINTF("interrupt on vector %p %d\n", pdev, vector);
 if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
-msix_notify(pdev, vector);
+if (msix_enabled(pdev)) {
+msix_notify(pdev, vector);
+}
 } else {
 ivshmem_IntrStatus_write(s, 1);
 }
@@ -713,16 +715,11 @@ static void ivshmem_check_version(void *opaque, const 
uint8_t * buf, int size)
 /* Select the MSI-X vectors used by device.
  * ivshmem maps events to vectors statically, so
  * we just enable all vectors on init and after reset. */
-static void ivshmem_use_msix(IVShmemState * s)
+static void ivshmem_msix_vector_use(IVShmemState *s)
 {
 PCIDevice *d = PCI_DEVICE(s);
 int i;
 
-IVSHMEM_DPRINTF("%s, msix present: %d\n", __func__, msix_present(d));
-if (!msix_present(d)) {
-return;
-}
-
 for (i = 0; i < s->vectors; i++) {
 msix_vector_use(d, i);
 }
@@ -734,7 +731,9 @@ static void ivshmem_reset(DeviceState *d)
 
 s->intrstatus = 0;
 s->intrmask = 0;
-ivshmem_use_msix(s);
+if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
+ivshmem_msix_vector_use(s);
+}
 }
 
 static int ivshmem_setup_interrupts(IVShmemState *s)
@@ -748,7 +747,7 @@ static int ivshmem_setup_interrupts(IVShmemState *s)
 }
 
 IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
-ivshmem_use_msix(s);
+ivshmem_msix_vector_use(s);
 }
 
 return 0;
@@ -1040,9 +1039,8 @@ static int ivshmem_post_load(void *opaque, int version_id)
 IVShmemState *s = opaque;
 
 if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
-ivshmem_use_msix(s);
+ivshmem_msix_vector_use(s);
 }
-
 return 0;
 }
 
@@ -1070,7 +1068,7 @@ static int ivshmem_load_old(QEMUFile *f, void *opaque, 
int version_id)
 
 if (ivshmem_has_feature(s, IVSHMEM_MSI)) {
 msix_load(pdev, f);
-ivshmem_use_msix(s);
+ivshmem_msix_vector_use(s);
 } else {
 s->intrstatus = qemu_get_be32(f);
 s->intrmask = qemu_get_be32(f);
-- 
2.4.3

[Qemu-devel] [PULL v2 11/40] ivshmem: Add missing newlines to debug printfs

2016-03-21 Thread Markus Armbruster

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-12-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 1838bc8..11cbc03 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -568,10 +568,10 @@ static void setup_interrupt(IVShmemState *s, int vector)
 IVSHMEM_DPRINTF("setting up interrupt for vector: %d\n", vector);
 
 if (!with_irqfd) {
-IVSHMEM_DPRINTF("with eventfd");
+IVSHMEM_DPRINTF("with eventfd\n");
 watch_vector_notifier(s, n, vector);
 } else if (msix_enabled(pdev)) {
-IVSHMEM_DPRINTF("with irqfd");
+IVSHMEM_DPRINTF("with irqfd\n");
 if (ivshmem_add_kvm_msi_virq(s, vector) < 0) {
 return;
 }
@@ -582,7 +582,7 @@ static void setup_interrupt(IVShmemState *s, int vector)
 }
 } else {
 /* it will be delayed until msix is enabled, in write_config */
-IVSHMEM_DPRINTF("with irqfd, delayed until msix enabled");
+IVSHMEM_DPRINTF("with irqfd, delayed until msix enabled\n");
 }
 }
 
-- 
2.4.3

[Qemu-devel] [PULL v2 17/40] ivshmem: Failed realize() can leave migration blocker behind

2016-03-21 Thread Markus Armbruster

If pci_ivshmem_realize() fails after it created its migration blocker,
the blocker is left in place.  Fix that by creating it last.

Likewise, if it fails after it called fifo8_create(), it leaks fifo
memory.  Fix that the same way.

Signed-off-by: Markus Armbruster 
Reviewed-by: Marc-André Lureau 
Message-Id: <1458066895-20632-18-git-send-email-arm...@redhat.com>
---
 hw/misc/ivshmem.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/hw/misc/ivshmem.c b/hw/misc/ivshmem.c
index 299cf5b..51ad255 100644
--- a/hw/misc/ivshmem.c
+++ b/hw/misc/ivshmem.c
@@ -825,6 +825,7 @@ static void ivshmem_write_config(PCIDevice *pdev, uint32_t 
address,
 static void pci_ivshmem_realize(PCIDevice *dev, Error **errp)
 {
 IVShmemState *s = IVSHMEM(dev);
+Error *err = NULL;
 uint8_t *pci_conf;
 uint8_t attr = PCI_BASE_ADDRESS_SPACE_MEMORY |
 PCI_BASE_ADDRESS_MEM_PREFETCH;
@@ -856,8 +857,6 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 s->ivshmem_size = size;
 }
 
-fifo8_create(>incoming_fifo, sizeof(int64_t));
-
 /* IRQFD requires MSI */
 if (ivshmem_has_feature(s, IVSHMEM_IOEVENTFD) &&
 !ivshmem_has_feature(s, IVSHMEM_MSI)) {
@@ -879,12 +878,6 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 s->role_val = IVSHMEM_MASTER; /* default */
 }
 
-if (s->role_val == IVSHMEM_PEER) {
-error_setg(>migration_blocker,
-   "Migration is disabled when using feature 'peer mode' in 
device 'ivshmem'");
-migrate_add_blocker(s->migration_blocker);
-}
-
 pci_conf = dev->config;
 pci_conf[PCI_COMMAND] = PCI_COMMAND_IO | PCI_COMMAND_MEMORY;
 
@@ -963,7 +956,19 @@ static void pci_ivshmem_realize(PCIDevice *dev, Error 
**errp)
 return;
 }
 
-create_shared_memory_BAR(s, fd, attr, errp);
+create_shared_memory_BAR(s, fd, attr, );
+if (err) {
+error_propagate(errp, err);
+return;
+}
+}
+
+fifo8_create(>incoming_fifo, sizeof(int64_t));
+
+if (s->role_val == IVSHMEM_PEER) {
+error_setg(>migration_blocker,
+   "Migration is disabled when using feature 'peer mode' in 
device 'ivshmem'");
+migrate_add_blocker(s->migration_blocker);
 }
 }
 
-- 
2.4.3

[Qemu-devel] [PULL v2 00/40] ivshmem: Fixes, cleanups, device model split

2016-03-21 Thread Markus Armbruster

Major issues addressed by this series:

* The specification document is incomplete and vague.  Rewritten.

* When a peer goes away, and its ID gets reused for another one,
  interrupts don't work.

* When configured for interrupts, we receive shared memory from the
  server some time after realize().  This creates a (usually
  short-lived) "no shared memory, yet" state.  If the guest wins the
  race, it is exposed to this state (known issue, if you count burying
  in docs/specs/ as "known").  If migration wins the race, it fails or
  corrupts memory.

* Interrupts are unreliable in a (usually small) time window after the
  destination peer connects.  I believe fixing this will require
  changing the client/server protocol, so just document it for now.

* The device isn't capable to tell guest software whether it is
  configured for interrupts.  Fix that in a new, backwards-compatible
  revision of the guest ABI, and bump the PCI revision.  Deprecate the
  old revision.

* The device properties are a confusing mess and badly checked.
  Clean that up.

* Migration with interrupts relies on server behavior not guaranteed
  by the specification.  Tighten the specification.

v2:
* PATCH 05: Include ivshmem-test only in configurations that include
  the device
* PATCH 36: Fix ivshmem-plain not to assert its nonexistent INTx

Markus Armbruster (40):
  target-ppc: Document TOCTTOU in hugepage support
  ivshmem-server: Fix and clean up command line help
  ivshmem-server: Don't overload POSIX shmem and file name
  qemu-doc: Fix ivshmem huge page example
  event_notifier: Make event_notifier_init_fd() #ifdef CONFIG_EVENTFD
  tests/libqos/pci-pc: Fix qpci_pc_iomap() to map BARs aligned
  ivshmem-test: Improve test case /ivshmem/single
  ivshmem-test: Clean up wait for devices to become operational
  ivshmem-test: Improve test cases /ivshmem/server-*
  ivshmem: Rewrite specification document
  ivshmem: Add missing newlines to debug printfs
  ivshmem: Compile debug prints unconditionally to prevent bit-rot
  ivshmem: Clean up after commit 9940c32
  ivshmem: Drop ivshmem_event() stub
  ivshmem: Don't destroy the chardev on version mismatch
  ivshmem: Fix harmless misuse of Error
  ivshmem: Failed realize() can leave migration blocker behind
  ivshmem: Clean up register callbacks
  ivshmem: Clean up MSI-X conditions
  ivshmem: Leave INTx alone when using MSI-X
  ivshmem: Assert interrupts are set up once
  ivshmem: Simplify rejection of invalid peer ID from server
  ivshmem: Disentangle ivshmem_read()
  ivshmem: Plug leaks on unplug, fix peer disconnect
  ivshmem: Receive shared memory synchronously in realize()
  ivshmem: Propagate errors through ivshmem_recv_setup()
  ivshmem: Rely on server sending the ID right after the version
  ivshmem: Drop the hackish test for UNIX domain chardev
  ivshmem: Simplify how we cope with short reads from server
  ivshmem: Tighten check of property "size"
  ivshmem: Implement shm=... with a memory backend
  ivshmem: Simplify memory regions for BAR 2 (shared memory)
  ivshmem: Inline check_shm_size() into its only caller
  qdev: New DEFINE_PROP_ON_OFF_AUTO
  ivshmem: Replace int role_val by OnOffAuto master
  ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem
  ivshmem: Clean up after the previous commit
  ivshmem: Drop ivshmem property x-memdev
  ivshmem: Require master to have ID zero
  contrib/ivshmem-server: Print "not for production" warning

 contrib/ivshmem-server/ivshmem-server.c |   56 +-
 contrib/ivshmem-server/ivshmem-server.h |4 +-
 contrib/ivshmem-server/main.c   |   98 +--
 default-configs/pci.mak |2 +-
 docs/specs/ivshmem-spec.txt |  254 +++
 docs/specs/ivshmem_device_spec.txt  |  161 -
 hw/core/qdev-properties.c   |   10 +
 hw/misc/ivshmem.c   | 1104 +--
 include/hw/qdev-properties.h|3 +
 qemu-doc.texi   |   47 +-
 target-ppc/kvm.c|6 +
 tests/Makefile  |2 +-
 tests/ivshmem-test.c|   99 +--
 tests/libqos/pci-pc.c   |8 +-
 util/event_notifier-posix.c |6 +
 15 files changed, 1031 insertions(+), 829 deletions(-)
 create mode 100644 docs/specs/ivshmem-spec.txt
 delete mode 100644 docs/specs/ivshmem_device_spec.txt

-- 
2.4.3

1 2 3 4 >

1 - 100 of 335 matches

Mail list logo