date:20141128

Re: [Qemu-devel] [RFC PATCH] vring: calculate descriptor address directly

2014-11-28 Thread Peter Lieven

Am 27.11.2014 um 17:49 schrieb Paolo Bonzini:

 On 27/11/2014 11:15, Peter Lieven wrote:
 vring_map causes a huge overhead by calling memory_region_find everytime.
 the vring_map is executed already on vring_setup and there is also the memory
 region referenced.

 Signed-off-by: Peter Lieven p...@kamp.de
 vring_map/unmap is going to disappear and be replaced by
 address_space_map/unmap, so for now I'd rather keep it...

the issue with vring_map/unmap is that it is called at every vring_pop/push to 
calculate
the descriptor ptr. its ok to call it in vring_setup/destroy and there it 
doesn't hurt.

looking at address_space_map/unmap this would be expensive as well if it is 
called
in every vring_pop/push.

Peter

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Ming Lei

On 11/28/14, Markus Armbruster arm...@redhat.com wrote:
 Ming Lei ming@canonical.com writes:

 Hi Kevin,

 On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote:
 This improves the performance of requests because an ACB doesn't need to
 be allocated on the heap any more. It also makes the code nicer and
 smaller.

 I am not sure it is good way for linux aio optimization:

 - for raw image with some constraint, coroutine can be avoided since
 io_submit() won't sleep most of times

 - handling one time coroutine takes much time than handling malloc,
 memset and free on small buffer, following the test data:

  --   241ns per coroutine

 What do you mean by coroutine here?  Create + destroy?  Yield?

Please see perf_cost() in tests/test-coroutine.c

Thanks,
Ming Lei

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 27.11.2014 um 17:40 schrieb Paolo Bonzini:

 On 27/11/2014 11:27, Peter Lieven wrote:
 +static __thread struct CoRoutinePool {
 +Coroutine *ptrs[POOL_MAX_SIZE];
 +unsigned int size;
 +unsigned int nextfree;
 +} CoPool;
  
 The per-thread ring unfortunately didn't work well last time it was
 tested.  Devices that do not use ioeventfd (not just the slow ones, even
 decently performing ones like ahci, nvme or megasas) will create the
 coroutine in the VCPU thread, and destroy it in the iothread.  The
 result is that coroutines cannot be reused.

 Can you check if this is still the case?

I already tested at least for IDE and for ioeventfd=off. The coroutine
is created in the vCPU thread and destroyed in the I/O thread.

I also havea more complicated version which sets per therad coroutine pool only
for dataplane. Avoiding the lock for dedicated iothreads.

For those who want to take a look:

https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e

Peter

Re: [Qemu-devel] [PATCH v2 1/6] geometry: add bdrv functions for geometry and blocksize

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 On 11/27/2014 05:55 PM, Markus Armbruster wrote:
 I'm sorry for the delay.  I got the flu and have difficulties thinking
 straight for longer than a few minutes.

 Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 Add driver functions for geometry and blocksize detection

 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 ---
   block.c   | 26 ++
   include/block/block.h | 20 
   include/block/block_int.h |  3 +++
   3 files changed, 49 insertions(+)

 diff --git a/block.c b/block.c
 index a612594..5df35cf 100644
 --- a/block.c
 +++ b/block.c
 @@ -548,6 +548,32 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error 
 **errp)
   }
   }

 +struct ProbeBlockSize bdrv_probe_blocksizes(BlockDriverState *bs)
 +{
 +BlockDriver *drv = bs-drv;
 +struct ProbeBlockSize err_geo = { .rc = -1 };
 +
 +assert(drv != NULL);
 +if (drv-bdrv_probe_blocksizes) {
 +return drv-bdrv_probe_blocksizes(bs);
 +}
 +
 +return err_geo;
 +}

 I'm not sure about probe.  Naming is hard.  get?

 There was already bdrv_get_geometry, and I wanted the _geometry and
 _blocksize functions to use the same naming convention.

Fair enough.

bdrv_get_geometry() is a silly wrapper around bdrv_nb_sectors() that
maps failure to zero sectors.  I hope to kill it some time.  Doesn't
help you now.

 I thought probe might be more suitable, since we do not get the value
 for sure. maybe detect?

Feel free to stick to probe.

 Squashing status and actual payload into a single struct to use as
 return type isn't wrong, but unusual.  When the payload can't represent
 failure conveniently, we usually return status, and write the payload to
 a buffer provided by the caller, like this:

  int bdrv_get_blocksizes(BlockDriverState *bs,
  uint16_t *physical_blk_sz,
  uint16_t *logical_blk_sz)

 Or, with a struct to hold both sizes:

  int bdrv_get_blocksizes(BlockDriverState *bs, BlockSizes *bsz)

 Do you insist on changing that? Returning a struct via stack seemed
 useful to me, since there was less probability of caller allocating
 a buffer of incorrect size or smth like that.

You'd have to do crazy stuff to defeat the static type checker.

Please stick to the common technique.

[...]

Re: [Qemu-devel] [PATCH 0/2] virtfs-proxy-helper: Fix Coverity warning

2014-11-28 Thread Paolo Bonzini



On 27/11/2014 11:29, Gonglei wrote:
 On 2014/11/13 20:17, Gonglei (Arei) wrote:
 
 From: Gonglei arei.gong...@huawei.com

 Gonglei (2):
   virtfs-proxy-helper: Fix possible socket leak.
   virtfs-proxy-helper: Fix handle leak to make Coverity happy

  fsdev/virtfs-proxy-helper.c | 16 
  1 file changed, 12 insertions(+), 4 deletions(-)

 
 Ping...

Hi Gonglei---thanks, I had missed this and Michael is probably not
sending qemu-trivial patches because we're so close to the release.
I'll pick up patch 1.

Paolo

Re: [Qemu-devel] [PATCH 0/2] virtfs-proxy-helper: Fix Coverity warning

2014-11-28 Thread Gonglei

On 2014/11/28 16:31, Paolo Bonzini wrote:

 
 
 On 27/11/2014 11:29, Gonglei wrote:
 On 2014/11/13 20:17, Gonglei (Arei) wrote:

 From: Gonglei arei.gong...@huawei.com

 Gonglei (2):
   virtfs-proxy-helper: Fix possible socket leak.
   virtfs-proxy-helper: Fix handle leak to make Coverity happy

  fsdev/virtfs-proxy-helper.c | 16 
  1 file changed, 12 insertions(+), 4 deletions(-)


 Ping...
 
 Hi Gonglei---thanks, I had missed this and Michael is probably not
 sending qemu-trivial patches because we're so close to the release.
 I'll pick up patch 1.
 
 Paolo

Thanks.
As Markus's suggestion, we drop patch 2.  :)

Regards,
-Gonglei

Re: [Qemu-devel] MinGW build

2014-11-28 Thread Peter Maydell

On 28 November 2014 at 07:14, Stefan Weil s...@weilnetz.de wrote:
 The libvixl code is correct, but the C++ compiler would need to be
 fixed. Here are some examples:

 disas/libvixl/a64/disasm-a64.cc:1340:57: warning: unknown conversion
 type character ‘l’ in format [-Wformat]
 disas/libvixl/a64/disasm-a64.cc:1340:57: warning: too many arguments for
 format [-Wformat-extra-args]
 disas/libvixl/a64/disasm-a64.cc:1492:42: warning: unknown conversion
 type character ‘l’ in format [-Wformat]

 That code uses PRIx64, so the format specifier is %llx which is correct.
 Obviously the C++ compiler ignores that QEMU uses ANSI format specifiers
 (compiler option -D__USE_MINGW_ANSI_STDIO=1) instead of the MS specific
 ones.

Lovely.

 A simple workaround for QEMU would just suppress -Wformat and
 -Wformat-extra-args for C++ code.

I'd prefer not to lose those (just because once you turn off a
warning it's never going to come back, practically speaking).
We could have a configure test that checked whether the c++
compiler understood %llx, maybe?

-- PMM

Re: [Qemu-devel] [PATCH v2 3/6] geometry: Add driver methods to probe blocksizes and geometry

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 This patch introduces driver methods of defining disk blocksizes
 (physical and logical) and hard drive geometry.
 The method is only implemented for host_device. For raw devices
 driver calls child's method.
 The detection will only work for DASD devices. In order to check that
 a local CheckForDASD function was introduced, which calls BIODASDINFO2 ioctl
 and returns 0 only if it succeeds.

 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 ---
  block/raw-posix.c | 65 
 +++
  block/raw_bsd.c   | 12 ++
  2 files changed, 77 insertions(+)

 diff --git a/block/raw-posix.c b/block/raw-posix.c
 index 45f1d79..274ceda 100644
 --- a/block/raw-posix.c
 +++ b/block/raw-posix.c
 @@ -56,6 +56,7 @@
  #include linux/cdrom.h
  #include linux/fd.h
  #include linux/fs.h
 +#include linux/hdreg.h
  #ifndef FS_NOCOW_FL
  #define FS_NOCOW_FL 0x0080 /* Do not cow file */
  #endif
 @@ -93,6 +94,10 @@
  #include xfs/xfs.h
  #endif
  
 +#ifdef __s390__
 +#include asm/dasd.h
 +#endif
 +
  //#define DEBUG_FLOPPY
  
  //#define DEBUG_BLOCK
 @@ -678,6 +683,64 @@ static void raw_refresh_limits(BlockDriverState *bs, 
 Error **errp)
  bs-bl.opt_mem_alignment = s-buf_align;
  }
  
 +static int CheckForDASD(int fd)

As Christian said, no CamelCase for functions.

I guess your function returns either 0 or -1.  Can't say for sure
without looking up BIODASDINFO2.

I'd do a slightly higher level is_dasd() returning bool.  Your choice.

 +{
 +#ifdef BIODASDINFO2
 +struct dasd_information2_t info = {0};
 +
 +return ioctl(fd, BIODASDINFO2, info);
 +#endif
 +return -1;
 +}
 +
 +static struct ProbeBlockSize hdev_probe_blocksizes(BlockDriverState *bs)
 +{
 +BDRVRawState *s = bs-opaque;
 +struct ProbeBlockSize block_sizes = {0};
 +
 +block_sizes.rc = CheckForDASD(s-fd);
 +/* If DASD, get blocksizes */
 +if (block_sizes.rc == 0) {
 +block_sizes.size.log = probe_logical_blocksize(bs, s-fd);
 +block_sizes.size.phys = probe_physical_blocksize(bs, s-fd);
 +}
 +
 +return block_sizes;
 +}

Fails unless DASD.  Why?  The block size concept applies not just to
DASD, and the probe_*_blocksize() functions should just work, shouldn't
they?

 +
 +static struct ProbeGeometry hdev_probe_geometry(BlockDriverState *bs)
 +{
 +BDRVRawState *s = bs-opaque;
 +struct ProbeGeometry geometry = {0};
 +struct hd_geometry ioctl_geo = {0};

Is this initializer really necessary?

Because I like my local variable names short  sweet, I'd call this one
geo.  Your choice.

 +
 +geometry.rc = CheckForDASD(s-fd);
 +if (geometry.rc != 0) {

Works if your function really returns either 0 or -1.  If it can return
a positive value, it breaks the callback's contract.

 +goto done;
 +}
 +/* If DASD, get it's geometry */

its

 +geometry.rc = ioctl(s-fd, HDIO_GETGEO, ioctl_geo);
 +/* Do not return a geometry for partition */
 +if (ioctl_geo.start != 0) {
 +geometry.rc = -1;
 +goto done;
 +}
 +/* HDIO_GETGEO may return success even though geo contains zeros
 +   (e.g. certain multipath setups) */
 +if (!ioctl_geo.heads || !ioctl_geo.sectors || !ioctl_geo.cylinders) {
 +geometry.rc = -1;
 +goto done;
 +}
 +if (geometry.rc == 0) {
 +geometry.geo.heads = (uint32_t)ioctl_geo.heads;
 +geometry.geo.sectors = (uint32_t)ioctl_geo.sectors;
 +geometry.geo.cylinders = (uint32_t)ioctl_geo.cylinders;

The LHS is uint32_t, the RHS is unsigned char or unsigned short, thus
the type casts are superfluous clutter.

8-bit head * 8-bit sectors * 16-bit cylinders can't represent today's
disk sizes, so ioctl_geo.cylinders is generally useless.  The common
advice is to ignore it, and do something like

geometry.geo.cylinders = nb_sectors / (ioctl_geo.heads * ioctl_geo.sectors)

A possible alternative is to explicitly document that the returned
cylinders value can't be trusted.

Another one is not to return the number of cylinders :)

 +geometry.geo.start = (uint32_t)ioctl_geo.start;

Always assigns zero (we checked above).  Why is member geo_start useful?

 +}
 +done:
 +   return geometry;
 +}

Also fails unless DASD.  The geometry concept applies pretty much only
to disks older than thirty years or so, and to software designed for
them, like HDIO_GETGEO.

 +
  static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb)
  {
  int ret;
 @@ -2128,6 +2191,8 @@ static BlockDriver bdrv_host_device = {
  .bdrv_get_info = raw_get_info,
  .bdrv_get_allocated_file_size
  = raw_get_allocated_file_size,
 +.bdrv_probe_blocksizes = hdev_probe_blocksizes,
 +.bdrv_probe_geometry = hdev_probe_geometry,
  
  .bdrv_detach_aio_context = raw_detach_aio_context,
  .bdrv_attach_aio_context = raw_attach_aio_context,
 diff --git

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Markus Armbruster

Ming Lei ming@canonical.com writes:

 On 11/28/14, Markus Armbruster arm...@redhat.com wrote:
 Ming Lei ming@canonical.com writes:

 Hi Kevin,

 On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote:
 This improves the performance of requests because an ACB doesn't need to
 be allocated on the heap any more. It also makes the code nicer and
 smaller.

 I am not sure it is good way for linux aio optimization:

 - for raw image with some constraint, coroutine can be avoided since
 io_submit() won't sleep most of times

 - handling one time coroutine takes much time than handling malloc,
 memset and free on small buffer, following the test data:

  --   241ns per coroutine

 What do you mean by coroutine here?  Create + destroy?  Yield?

 Please see perf_cost() in tests/test-coroutine.c

static __attribute__((noinline)) void perf_cost_func(void *opaque)
{
qemu_coroutine_yield();
}

static void perf_cost(void)
{
const unsigned long maxcycles = 4000;
unsigned long i = 0;
double duration;
unsigned long ops;
Coroutine *co;

g_test_timer_start();
while (i++  maxcycles) {
co = qemu_coroutine_create(perf_cost_func);
qemu_coroutine_enter(co, i);
qemu_coroutine_enter(co, NULL);
}
duration = g_test_timer_elapsed();
ops = (long)(maxcycles / (duration * 1000));

g_test_message(Run operation %lu iterations %f s, %luK operations/s, 
   %luns per coroutine,
   maxcycles,
   duration, ops,
   (unsigned long)(10 * duration) / maxcycles);
}

This tests create, enter, yield, reenter, terminate, destroy.  The cost
of create + destroy may well dominate.

If we create and destroy coroutines for each AIO request, we're doing it
wrong.  I doubt Kevin's doing it *that* wrong ;)

Anyway, let's benchmark the real code instead of putting undue trust in
tests/test-coroutine.c micro-benchmarks.

Re: [Qemu-devel] [PATCH] Fix for crash after migration in virtio-rng on bi-endian targets

2014-11-28 Thread Peter Maydell

On 27 November 2014 at 09:26, Markus Armbruster arm...@redhat.com wrote:
 David Gibson da...@gibson.dropbear.id.au writes:

 VirtIO devices now remember which endianness they're operating in in order
 to support targets which may have guests of either endianness, such as
 powerpc.  This endianness state is transferred in a subsection of the
 virtio device's information.

 With virtio-rng this can lead to an abort after a loadvm hitting the
 assert() in virtio_is_big_endian().  This can be reproduced by doing a
 migrate and load from file on a bi-endian target with a virtio-rng device.
 The actual guest state isn't particularly important to triggering this.

 The cause is that virtio_rng_load_device() calls virtio_rng_process() which
 accesses the ring and thus needs the endianness.  However,
 virtio_rng_process() is called via virtio_load() before it loads the
 subsections.  Essentially the -load callback in VirtioDeviceClass should
 only be used for actually reading the device state from the stream, not for
 post-load re-initialization.

 This patch fixes the bug by moving the virtio_rng_process() after the call
 to virtio_load().  Better yet would be to convert virtio to use vmsd and
 have the virtio_rng_process() as a post_load callback, but that's a bigger
 project for another day.

 This is bugfix, and should be considered for the 2.2 branch.

 [PATCH for-2.2] would have been a good idea then.  Next time :)

So do you want this patch in 2.2? I was planning to put in the
virtio-vs-xen fixes today and tag rc4, so it's not too late if you're
confident this patch is good. Let me know if you think it should go in,
and I can apply it to master directly.

-- PMM

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Ming Lei

On 11/28/14, Markus Armbruster arm...@redhat.com wrote:
 Ming Lei ming@canonical.com writes:

 On 11/28/14, Markus Armbruster arm...@redhat.com wrote:
 Ming Lei ming@canonical.com writes:

 Hi Kevin,

 On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote:
 This improves the performance of requests because an ACB doesn't need
 to
 be allocated on the heap any more. It also makes the code nicer and
 smaller.

 I am not sure it is good way for linux aio optimization:

 - for raw image with some constraint, coroutine can be avoided since
 io_submit() won't sleep most of times

 - handling one time coroutine takes much time than handling malloc,
 memset and free on small buffer, following the test data:

  --   241ns per coroutine

 What do you mean by coroutine here?  Create + destroy?  Yield?

 Please see perf_cost() in tests/test-coroutine.c

 static __attribute__((noinline)) void perf_cost_func(void *opaque)
 {
 qemu_coroutine_yield();
 }

 static void perf_cost(void)
 {
 const unsigned long maxcycles = 4000;
 unsigned long i = 0;
 double duration;
 unsigned long ops;
 Coroutine *co;

 g_test_timer_start();
 while (i++  maxcycles) {
 co = qemu_coroutine_create(perf_cost_func);
 qemu_coroutine_enter(co, i);
 qemu_coroutine_enter(co, NULL);
 }
 duration = g_test_timer_elapsed();
 ops = (long)(maxcycles / (duration * 1000));

 g_test_message(Run operation %lu iterations %f s, %luK
 operations/s, 
%luns per coroutine,
maxcycles,
duration, ops,
(unsigned long)(10 * duration) / maxcycles);
 }

 This tests create, enter, yield, reenter, terminate, destroy.  The cost
 of create + destroy may well dominate.

Actually there shouldn't have been much cost from create and destroy
attributed to coroutine pool.


 If we create and destroy coroutines for each AIO request, we're doing it
 wrong.  I doubt Kevin's doing it *that* wrong ;)

 Anyway, let's benchmark the real code instead of putting undue trust in
 tests/test-coroutine.c micro-benchmarks.

I don't think there isn't trust from the micro-benchmark.

That is the direct cost from coroutine, and the cost won't be avoided at all,
not mention cost from switching stack.

If you google some test data posted by me previously, that would show
bypassing coroutine can increase throughput with ~50% for raw image
in case of linux aio, that is the real test case, not micro-benchmark.


Thanks,
Ming Lei

[Qemu-devel] [PATCH] vhost: Fix vhostfd leak in error branch

2014-11-28 Thread arei.gonglei

From: Gonglei arei.gong...@huawei.com

Signed-off-by: Gonglei arei.gong...@huawei.com
---
 hw/scsi/vhost-scsi.c | 1 +
 hw/virtio/vhost.c| 2 ++
 2 files changed, 3 insertions(+)

diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
index 308b393..dcb2bc5 100644
--- a/hw/scsi/vhost-scsi.c
+++ b/hw/scsi/vhost-scsi.c
@@ -233,6 +233,7 @@ static void vhost_scsi_realize(DeviceState *dev, Error 
**errp)
vhost_dummy_handle_output);
 if (err != NULL) {
 error_propagate(errp, err);
+close(vhostfd);
 return;
 }
 
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 5d7c40a..5a12861 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -817,10 +817,12 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
 int i, r;
 
 if (vhost_set_backend_type(hdev, backend_type)  0) {
+close((uintptr_t)opaque);
 return -1;
 }
 
 if (hdev-vhost_ops-vhost_backend_init(hdev, opaque)  0) {
+close((uintptr_t)opaque);
 return -errno;
 }
 
-- 
1.7.12.4

Re: [Qemu-devel] [RFC PATCH v5 00/31] Deterministic replay and reverse execution

2014-11-28 Thread Artyom Tarasenko

On Thu, Nov 27, 2014 at 6:45 PM, Paolo Bonzini pbonz...@redhat.com wrote:


 On 27/11/2014 10:53, Artyom Tarasenko wrote:
 
  Deterministic replay has the following features:
   * Deterministically replays whole system execution and all contents of 
  the memory,
 state of the hadrware devices, clocks, and screen of the VM.
   * Writes execution log into the file for latter replaying for multiple 
  times
 on different machines.
   * Supports i386, x86_64, and ARM hardware platforms.
 This patch series is really impressive, hats off!

 I wonder what would have to be done to add support for SPARC-64? Since
 has a PCI bus, I guess the support in atapi and usb devices is already
 there. Is anything beyond translate.c has to be adjusted?

 Does -icount work reliably on SPARC?

No. Currently using -icount 1 or -icount 2 is reliably producing the
Bad clock read  message, and using -icount 20
reliably crashes qemu with the qemu: fatal: Raised interrupt while
not in I/O function message.
So, I guess I see the starting point...

Artyom

-- 
Regards,
Artyom Tarasenko

SPARC and PPC PReP under qemu blog: http://tyom.blogspot.com/search/label/qemu

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Paolo Bonzini



On 28/11/2014 03:59, Ming Lei wrote:
 Hi Kevin,
 
 On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote:
 This improves the performance of requests because an ACB doesn't need to
 be allocated on the heap any more. It also makes the code nicer and
 smaller.
 
 I am not sure it is good way for linux aio optimization:
 
 - for raw image with some constraint, coroutine can be avoided since
 io_submit() won't sleep most of times
 
 - handling one time coroutine takes much time than handling malloc,
 memset and free on small buffer, following the test data:
 
  --   241ns per coroutine
  --   61ns per (malloc, memset, free for 128bytes)
 
 I still think we should figure out a fast path to avoid cocourinte
 for linux-aio with raw image, otherwise it can't scale well for high
 IOPS device.

sigsetjmp/siglongjmp are just ~60 instructions, it cannot account for
180ns (600 clock cycles).  The cost of creating and destroying the
coroutine must come from somewhere else.

Let's just try something else.  Let's remove the pool mutex, as suggested
by Peter but in a way that works even with non-ioeventfd backends.

I still believe we will end with some kind of coroutine bypass scheme
(even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
directly can help), but hey it cannot hurt to optimize hot code.

The patch below has a single pool where coroutines are placed on
destruction, and a per-thread allocation pool.  Whenever the destruction
pool is big enough, the next thread that runs out of coroutines will
steal from it instead of making a new coroutine.  If this works, it
would be beautiful in two ways:

1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.

2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also removes qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous).

The patch below is very raw, and untested beyond tests/test-coroutine.
There may be some premature optimization (not using GPrivate, even though
I need it to run the per-thread destructor) but it was easy enough.  Ming,
Kevin, can you benchmark it?

Related to this, we can see if __thread beats GPrivate in coroutine-ucontext.c.
Every clock cycle counts (600 clock cycles are a 2% improvement at 3 GHz
and 100 kiops) so we can see what we get.

Paolo

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index d433b90..6a01e2f 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -191,6 +191,17 @@ struct {   
 \
 #define QSLIST_INSERT_HEAD(head, elm, field) do {\
 (elm)-field.sle_next = (head)-slh_first;  \
 (head)-slh_first = (elm);  \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do {   \
+   do {   \
+  (elm)-field.sle_next = (head)-slh_first;  \
+   } while (atomic_cmpxchg((head)-slh_first, (elm)-field.sle_next, \
+  (elm)) != (elm)-field.sle_next);\
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_MOVE_ATOMIC(dest, src) do {   \
+   (dest)-slh_first = atomic_xchg((src)-slh_first, NULL);\
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE_HEAD(head, field) do { \
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index bd574aa..60d761f 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -15,35 +15,52 @@
 #include trace.h
 #include qemu-common.h
 #include qemu/thread.h
+#include qemu/atomic.h
 #include block/coroutine.h
 #include block/coroutine_int.h
 
 enum {
-POOL_DEFAULT_SIZE = 64,
+POOL_BATCH_SIZE = 64,
 };
 
 /** Free list to speed up creation */
-static QemuMutex pool_lock;
-static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
+static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 static unsigned int pool_size;
-static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
QSLIST_HEAD_INITIALIZER(pool);
+
+/* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
+static void coroutine_pool_cleanup(void *value);
+static GPrivate dummy_key = G_PRIVATE_INIT(coroutine_pool_cleanup);
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
 Coroutine *co = NULL;
 
 if (CONFIG_COROUTINE_POOL) {
-qemu_mutex_lock(pool_lock);
-co

Re: [Qemu-devel] [PATCH RFC v4 00/16] qemu: towards virtio-1 host support

2014-11-28 Thread Cornelia Huck

On Thu, 27 Nov 2014 18:35:40 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 On Thu, Nov 27, 2014 at 05:28:42PM +0100, Cornelia Huck wrote:
  On Thu, 27 Nov 2014 18:18:25 +0200
  Michael S. Tsirkin m...@redhat.com wrote:
  
   On Thu, Nov 27, 2014 at 05:06:51PM +0100, Cornelia Huck wrote:
  
So we should have a per-device callback into the transport layer, say
check_legacy()?
   
   I would just have 2 masks: legacy_features and features.
  
  But these belong to the device type and the transport just needs to
  trigger usage of the right one, right?
 
 Yes.
 
   
For ccw, this would check for the negotiated revision; for mmio, it
could check a device property configured with the device; and for pci,
whatever the mechanism is there :)

A transport not implementing this callback is simply considered
legacy-only.
   
   I dislike callbacks. Let's just give all info to core,
   and have it DTRT.
   
  Have a is_legacy flag in the vdev that is initialized to 1, and
  transports can unset it when the revision is negotiated or during init?
 
 I would say have modern_features, legacy_features, and set host_features
 correctly.
 

Started digging through the code a bit:

Currently the only time where the transport is actually mucking with
the feature bits is when the device is plugged, when it adds its
feature bits and calls virtio_bus_get_vdev_features() to get the device
specific bits. Only mmio knows at this point in time whether it has a
v1.0 or a legacy device. Other transports will need to call this
function again when the guest has actually set the revision.

My plan is the following:

- have virtio_bus_get_vdev_features_rev() that takes
  a virtio-rev parameter (0: legacy, 1: v1.0)
- unconverted transports keep calling virtio_bus_get_vdev_features()
  which calls virtio_bus_get_vdev_features_rev(..., 0)
- the rev parameter gets passed to the device, which hands over the
  right feature set (I don't think that we can use static feature
  tables, as the device may need to calulate the desired feature set
  dynamically)

Sounds reasonable?

Re: [Qemu-devel] [PATCH RFC v4 00/16] qemu: towards virtio-1 host support

2014-11-28 Thread Cornelia Huck

On Thu, 27 Nov 2014 18:38:59 +0200
Michael S. Tsirkin m...@redhat.com wrote:

 Also, let's focus on one device, e.g. -net for now.

Yes, I'm currently looking at net.

 Then probably virtio scsi.
 That's because blk needs to be reworked to support ANY_LAYOUT.

I had focused on blk because that's the other device used on every s390
image :) But the ANY_LAYOUT stuff looks like a bit of work :/

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Kevin Wolf

Am 28.11.2014 um 03:59 hat Ming Lei geschrieben:
 Hi Kevin,
 
 On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote:
  This improves the performance of requests because an ACB doesn't need to
  be allocated on the heap any more. It also makes the code nicer and
  smaller.
 
 I am not sure it is good way for linux aio optimization:
 
 - for raw image with some constraint, coroutine can be avoided since
 io_submit() won't sleep most of times
 
 - handling one time coroutine takes much time than handling malloc,
 memset and free on small buffer, following the test data:
 
  --   241ns per coroutine
  --   61ns per (malloc, memset, free for 128bytes)

Please finally stop making comparisons between completely unrelated
things and trying to make a case against coroutines out of it. It simply
doesn't make any sense.

The truth is that in the 'qemu-img bench' case as well as in the highest
performing VM setup for Peter and me, the practically existing coroutine
based git branches perform better then the practically existing bypass
branches. If you think that theoretically the bypass branches must be
better, show us the patches and benchmarks.

If you can't, let's merge the coroutine improvements (which improve
more than just the case of raw images using no block layer features,
including cases that benefit the average user) and be done.

 I still think we should figure out a fast path to avoid cocourinte
 for linux-aio with raw image, otherwise it can't scale well for high
 IOPS device.
 
 Also we can use simple buf pool to avoid the dynamic allocation
 easily, can't we?

Yes, the change to g_slice_alloc() was a bad move performance-wise.

  As a side effect, the codepath taken by aio=threads is changed to use
  paio_submit_co(). This doesn't change the performance at this point.
 
  Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:
 
|  aio=native   | aio=threads
| before   | with patch | before   | with patch
  --+--++--+
  run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
  run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
  run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
  run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
  run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s
 
  TODO: Do some more serious benchmarking in VMs with less variance.
  Results of a quick fio run are vaguely positive.
 
 I will do the test with Paolo's fast path approach under
 VM I/O situation.

Currently, the best thing to compare it against is probably Peter's git
branch at https://github.com/plieven/qemu.git perf_master2. This patch
is only a first step in a whole series of possible optimisations.

Kevin

Re: [Qemu-devel] [PATCH v2 5/6] geometry: Call backend function to detect geometry and blocksize

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 hd_geometry_guess function autodetects the drive geometry. This patch
 adds a block backend call, that probes the backing device geometry.
 If the inner driver method is implemented and succeeds (currently only DASDs),
 the blkconf_geometry will pass-through the backing device geometry.

 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 ---
  hw/block/block.c | 11 +++
  hw/block/hd-geometry.c   |  9 +
  hw/block/virtio-blk.c|  1 +
  include/hw/block/block.h |  1 +
  4 files changed, 22 insertions(+)

 diff --git a/hw/block/block.c b/hw/block/block.c
 index a625773..f1d29bc 100644
 --- a/hw/block/block.c
 +++ b/hw/block/block.c
 @@ -25,6 +25,17 @@ void blkconf_serial(BlockConf *conf, char **serial)
  }
  }
  
 +void blkconf_blocksizes(BlockConf *conf)
 +{
 +BlockBackend *blk = conf-blk;
 +struct ProbeBlockSize blocksize = blk_probe_blocksizes(blk);
 +
 +if (blocksize.rc == 0) {
 +conf-physical_block_size = blocksize.size.phys;
 +conf-logical_block_size = blocksize.size.log;
 +}
 +}
 +

Uh, doesn't this overwrite the user's geometry?

The existing blkconf_ functions do nothing when the user supplied the
configuration.  In other words, they only provide defaults.  Why should
this one be different?

  void blkconf_geometry(BlockConf *conf, int *ptrans,
unsigned cyls_max, unsigned heads_max, unsigned 
 secs_max,
Error **errp)
 diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c
 index 6fcf74d..4972114 100644
 --- a/hw/block/hd-geometry.c
 +++ b/hw/block/hd-geometry.c
 @@ -121,6 +121,14 @@ void hd_geometry_guess(BlockBackend *blk,
 int *ptrans)
  {
  int cylinders, heads, secs, translation;
 +struct ProbeGeometry geometry = blk_probe_geometry(blk);
 +
 +if (geometry.rc == 0) {
 +*pcyls = geometry.geo.cylinders;
 +*psecs = geometry.geo.sectors;
 +*pheads = geometry.geo.heads;
 +goto done;
 +}
  

hd_geometry_guess() is called by blkconf_geometry() to conjure up a
default geometry when the user didn't pick one.  Fairly elaborate
guesswork.  blkconf_geometry() runs for IDE, SCSI and virtio-blk only.

Your patch makes it pick the backend's geometry as reported by
blk_probe_geometry() before anything else.

This is an incompatible change for backends where blk_probe_geometry()
succeeds.  Currently, it succeeds only for DASD, and there you *want*
the incompatible change.

My point is: if we rely on blk_probe_geometry() failing except for DASD,
then we should call it something like blk_dasd_geometry() to make that
obvious.

The commit message needs to spell out the incompatible change.

  if (guess_disk_lchs(blk, cylinders, heads, secs)  0) {
  /* no LCHS guess: use a standard physical disk geometry  */
 @@ -143,6 +151,7 @@ void hd_geometry_guess(BlockBackend *blk,
 the logical geometry */
  translation = BIOS_ATA_TRANSLATION_NONE;
  }
 +done:
  if (ptrans) {
  *ptrans = translation;
  }
 diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
 index b19b102..6f01565 100644
 --- a/hw/block/virtio-blk.c
 +++ b/hw/block/virtio-blk.c
 @@ -745,6 +745,7 @@ static void virtio_blk_device_realize(DeviceState *dev, 
 Error **errp)
  error_propagate(errp, err);
  return;
  }
 +blkconf_blocksizes(conf-conf);
  
  virtio_init(vdev, virtio-blk, VIRTIO_ID_BLOCK,
  sizeof(struct virtio_blk_config));

Why only virtio-blk, and not the other device models sporting
configurable block sizes (nvme, IDE, SCSI, usb-storage)?

This is an incompatible change for backends where blk_probe_blocksizes()
succeeds and returns something other than (512, 512).  Currently, it
succeeds only for DASD.  Is that what we want?

The commit message needs to spell out the incompatible change.

Perhaps this patch should be split up into one for geometry and one for
block sizes.

 diff --git a/include/hw/block/block.h b/include/hw/block/block.h
 index 0d0ce9a..856bf75 100644
 --- a/include/hw/block/block.h
 +++ b/include/hw/block/block.h
 @@ -63,6 +63,7 @@ void blkconf_serial(BlockConf *conf, char **serial);
  void blkconf_geometry(BlockConf *conf, int *trans,
unsigned cyls_max, unsigned heads_max, unsigned 
 secs_max,
Error **errp);
 +void blkconf_blocksizes(BlockConf *conf);
  
  /* Hard disk geometry */

Re: [Qemu-devel] [PATCH v2 5/6] geometry: Call backend function to detect geometry and blocksize

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 hd_geometry_guess function autodetects the drive geometry. This patch
 adds a block backend call, that probes the backing device geometry.
 If the inner driver method is implemented and succeeds (currently only DASDs),
 the blkconf_geometry will pass-through the backing device geometry.
[...]
 diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c
 index 6fcf74d..4972114 100644
 --- a/hw/block/hd-geometry.c
 +++ b/hw/block/hd-geometry.c
 @@ -121,6 +121,14 @@ void hd_geometry_guess(BlockBackend *blk,
 int *ptrans)
  {
  int cylinders, heads, secs, translation;
 +struct ProbeGeometry geometry = blk_probe_geometry(blk);
 +
 +if (geometry.rc == 0) {
 +*pcyls = geometry.geo.cylinders;
 +*psecs = geometry.geo.sectors;
 +*pheads = geometry.geo.heads;

Missing:

translation = ...

I'm not sure what value to assign.

 +goto done;
 +}
  
  if (guess_disk_lchs(blk, cylinders, heads, secs)  0) {
  /* no LCHS guess: use a standard physical disk geometry  */
[...]

Re: [Qemu-devel] [PATCH v6] qcow2: Buffer L1 table in snapshot refcount update

2014-11-28 Thread Stefan Hajnoczi

On Tue, Nov 11, 2014 at 04:27:51PM +0100, Max Reitz wrote:
 From: Zhang Haoyu zhan...@sangfor.com
 
 Buffer the active L1 table in qcow2_update_snapshot_refcount() in order
 to prevent in-place conversion of the L1 table buffer in the
 BDRVQcowState to big endian and back, which would lead to data
 corruption if that buffer was accessed concurrently. This should not
 happen but better being safe than sorry.
 
 Signed-off-by: Zhang Haoyu zhan...@sangfor.com
 Signed-off-by: Max Reitz mre...@redhat.com
 ---
 v6 for snapshot: use local variable to bdrv_pwrite_sync L1 table (I
 changed the commit message wording to make it more clear what this patch
 does and why we want it).
 
 Changes in v6:
 - Only copy the local buffer back into s-l1_table if we are indeed
   accessing the local L1 table
 - Use qemu_vfree() instead of g_free()
 ---
  block/qcow2-refcount.c | 30 ++
  1 file changed, 14 insertions(+), 16 deletions(-)

If there is a code path where the L1 table is accessed while
qcow2_update_snapshot_refcount() is blocked, this patch does not fix the
bug.

It trades an L1 table entry corruption (due to endianness mismatch on
little-endian hosts) for a race condition where a stale L1 table is
accessed or L1 changes are overwritten when
qcow2_update_snapshot_refcount() memcpys back to s-l1_table.

Please identify the root cause and fix that.

 diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
 index 9afdb40..c0c4a50 100644
 --- a/block/qcow2-refcount.c
 +++ b/block/qcow2-refcount.c
 @@ -877,14 +877,18 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
  {
  BDRVQcowState *s = bs-opaque;
  uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2;
 -bool l1_allocated = false;
 +bool active_l1 = false;
  int64_t old_offset, old_l2_offset;
  int i, j, l1_modified = 0, nb_csectors, refcount;
  int ret;
  
  l2_table = NULL;
 -l1_table = NULL;
  l1_size2 = l1_size * sizeof(uint64_t);
 +l1_table = qemu_try_blockalign(bs-file, l1_size2);
 +if (l1_table == NULL) {
 +ret = -ENOMEM;
 +goto fail;
 +}
  
  s-cache_discards = true;
  
 @@ -892,13 +896,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
   * l1_table_offset when it is the current s-l1_table_offset! Be careful
   * when changing this! */
  if (l1_table_offset != s-l1_table_offset) {
 -l1_table = g_try_malloc0(align_offset(l1_size2, 512));
 -if (l1_size2  l1_table == NULL) {
 -ret = -ENOMEM;
 -goto fail;
 -}
 -l1_allocated = true;
 -
  ret = bdrv_pread(bs-file, l1_table_offset, l1_table, l1_size2);
  if (ret  0) {
  goto fail;
 @@ -908,8 +905,8 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
  be64_to_cpus(l1_table[i]);
  } else {
  assert(l1_size == s-l1_size);
 -l1_table = s-l1_table;
 -l1_allocated = false;
 +memcpy(l1_table, s-l1_table, l1_size2);
 +active_l1 = true;
  }
  
  for(i = 0; i  l1_size; i++) {
 @@ -1051,13 +1048,14 @@ fail:
  }
  
  ret = bdrv_pwrite_sync(bs-file, l1_table_offset, l1_table, 
 l1_size2);
 -
 -for (i = 0; i  l1_size; i++) {
 -be64_to_cpus(l1_table[i]);
 +if (active_l1  ret == 0) {
 +for (i = 0; i  l1_size; i++) {
 +be64_to_cpus(l1_table[i]);
 +}
 +memcpy(s-l1_table, l1_table, l1_size2);
  }
  }
 -if (l1_allocated)
 -g_free(l1_table);
 +qemu_vfree(l1_table);
  return ret;
  }
  
 -- 
 1.9.3
 
 


pgpu0ifzdtI68.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices

2014-11-28 Thread Stefan Hajnoczi

On Wed, Nov 26, 2014 at 01:16:48PM +0300, Ekaterina Tumanova wrote:
 On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote:
 On Wed, Nov 19, 2014 at 11:17:50AM +0100, Ekaterina Tumanova wrote:
 3. Why does s390 need to customize hd_geometry_guess()?
 
 Since hd_geometry_guess contains semantics of x86-specific LBA translation,
 we have to modify it not to get in the way of z
 architecture

If the behavior is x86-specific, it should be made x86-specific.  z,
ppc, sparc, arm, etc should share the non-x86-specific code path.  It's
weird to single out z here, this seems like the wrong way around.

Is the x86-specific behavior a problem in practice?  No one seemed to
mind for the other targets.


pgpT85luBvEdh.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 11:28 schrieb Paolo Bonzini:

 On 28/11/2014 09:13, Peter Lieven wrote:
 Am 27.11.2014 um 17:40 schrieb Paolo Bonzini:
 On 27/11/2014 11:27, Peter Lieven wrote:
 +static __thread struct CoRoutinePool {
 +Coroutine *ptrs[POOL_MAX_SIZE];
 +unsigned int size;
 +unsigned int nextfree;
 +} CoPool;
  
 The per-thread ring unfortunately didn't work well last time it was
 tested.  Devices that do not use ioeventfd (not just the slow ones, even
 decently performing ones like ahci, nvme or megasas) will create the
 coroutine in the VCPU thread, and destroy it in the iothread.  The
 result is that coroutines cannot be reused.

 Can you check if this is still the case?
 I already tested at least for IDE and for ioeventfd=off. The coroutine
 is created in the vCPU thread and destroyed in the I/O thread.

 I also havea more complicated version which sets per therad coroutine pool 
 only
 for dataplane. Avoiding the lock for dedicated iothreads.

 For those who want to take a look:

 https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e
 Can you test it against the patch I just sent in Kevin's linux-aio
 coroutine thread?

Was already doing it ;-) At least with test-couroutine.c

master:
Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per 
coroutine

paolo:
Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per 
coroutine

plieven/perf_master2:
Run operation 4000 iterations 9.013785 s, 4437K operations/s, 225ns per 
coroutine

plieven/perf_master:
Run operation 4000 iterations 11.072883 s, 3612K operations/s, 276ns per 
coroutine

However, perf_master and perf_master2 have a regerssion regarding nesting as it 
seems.
@Kevin: Could that be the reason why they performe bad in some szenarios?


Regarding the bypass that is discussed. If it is not just a benchmark thing but 
really necessary
for some peoples use cases why not add a new aio mode like bypass and use it 
only then.
If the performance is really needed the user he/she might trade it in for lost 
features like iothrottling, filters etc.

Peter

Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board

2014-11-28 Thread Andrew Jones

On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote:
 fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c,
 ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt
 board.
 
 The mmio register block of fw_cfg is advertized in the device tree. As
 base address we pick 0x0902, which conforms to the comment preceding
 a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB,
 and it is aligned at 64KB.
 
 fw_cfg automatically exports a number of files to the guest; for example,
 bootorder (see fw_cfg_machine_reset()).
 
 Signed-off-by: Laszlo Ersek ler...@redhat.com
 ---
  hw/arm/virt.c | 21 +
  1 file changed, 21 insertions(+)
 
 diff --git a/hw/arm/virt.c b/hw/arm/virt.c
 index 314e55b..070bd34 100644
 --- a/hw/arm/virt.c
 +++ b/hw/arm/virt.c
 @@ -68,6 +68,7 @@ enum {
  VIRT_UART,
  VIRT_MMIO,
  VIRT_RTC,
 +VIRT_FW_CFG,
  };
  
  typedef struct MemMapEntry {
 @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = {
  [VIRT_GIC_CPU] ={ 0x0801, 0x0001 },
  [VIRT_UART] =   { 0x0900, 0x1000 },
  [VIRT_RTC] ={ 0x0901, 0x1000 },
 +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE },
  [VIRT_MMIO] =   { 0x0a00, 0x0200 },
  /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size 
 */
  /* 0x1000 .. 0x4000 reserved for PCI */
 @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi)
  g_free(nodename);
  }
  
 +static void create_fw_cfg(const VirtBoardInfo *vbi)
 +{
 +hwaddr base = vbi-memmap[VIRT_FW_CFG].base;
 +char *nodename;
 +
 +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE);
 +
 +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base);
 +qemu_fdt_add_subnode(vbi-fdt, nodename);
 +qemu_fdt_setprop_string(vbi-fdt, nodename,
 +compatible, fw-cfg,mmio);
 +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg,
 + 2, base, 2, FW_CFG_SIZE,
 + 2, base + FW_CFG_SIZE, 2, FW_CFG_DATA_SIZE);

Overkill suggestion alert, but how about defining something like

#define FW_CFG_SIZE_ALIGNED \
MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \
QEMU_ALIGN_UP(FW_CFG_SIZE, 4))

and then using that in your memmap size calculation and fw-cfg-data base
address calculation. The only reason I suggest this is because it's hard
to tell that fw-cfg-data's address will be naturally aligned without
hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change
(which it probably never will), then it may not be.

 +g_free(nodename);
 +}
 +
  static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size)
  {
  const VirtBoardInfo *board = (const VirtBoardInfo *)binfo;
 @@ -604,6 +623,8 @@ static void machvirt_init(MachineState *machine)
   */
  create_virtio_devices(vbi, pic);
  
 +create_fw_cfg(vbi);
 +
  vbi-bootinfo.ram_size = machine-ram_size;
  vbi-bootinfo.kernel_filename = machine-kernel_filename;
  vbi-bootinfo.kernel_cmdline = machine-kernel_cmdline;
 -- 
 1.8.3.1

Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board

2014-11-28 Thread Laszlo Ersek

On 11/28/14 11:38, Andrew Jones wrote:
 On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote:
 fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c,
 ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt
 board.

 The mmio register block of fw_cfg is advertized in the device tree. As
 base address we pick 0x0902, which conforms to the comment preceding
 a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB,
 and it is aligned at 64KB.

 fw_cfg automatically exports a number of files to the guest; for example,
 bootorder (see fw_cfg_machine_reset()).

 Signed-off-by: Laszlo Ersek ler...@redhat.com
 ---
  hw/arm/virt.c | 21 +
  1 file changed, 21 insertions(+)

 diff --git a/hw/arm/virt.c b/hw/arm/virt.c
 index 314e55b..070bd34 100644
 --- a/hw/arm/virt.c
 +++ b/hw/arm/virt.c
 @@ -68,6 +68,7 @@ enum {
  VIRT_UART,
  VIRT_MMIO,
  VIRT_RTC,
 +VIRT_FW_CFG,
  };
  
  typedef struct MemMapEntry {
 @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = {
  [VIRT_GIC_CPU] ={ 0x0801, 0x0001 },
  [VIRT_UART] =   { 0x0900, 0x1000 },
  [VIRT_RTC] ={ 0x0901, 0x1000 },
 +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE },
  [VIRT_MMIO] =   { 0x0a00, 0x0200 },
  /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size 
 */
  /* 0x1000 .. 0x4000 reserved for PCI */
 @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi)
  g_free(nodename);
  }
  
 +static void create_fw_cfg(const VirtBoardInfo *vbi)
 +{
 +hwaddr base = vbi-memmap[VIRT_FW_CFG].base;
 +char *nodename;
 +
 +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE);
 +
 +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base);
 +qemu_fdt_add_subnode(vbi-fdt, nodename);
 +qemu_fdt_setprop_string(vbi-fdt, nodename,
 +compatible, fw-cfg,mmio);
 +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg,
 + 2, base, 2, FW_CFG_SIZE,
 + 2, base + FW_CFG_SIZE, 2, 
 FW_CFG_DATA_SIZE);
 
 Overkill suggestion alert, but how about defining something like
 
 #define FW_CFG_SIZE_ALIGNED \
 MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \
 QEMU_ALIGN_UP(FW_CFG_SIZE, 4))
 
 and then using that in your memmap size calculation and fw-cfg-data base
 address calculation. The only reason I suggest this is because it's hard
 to tell that fw-cfg-data's address will be naturally aligned without
 hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change
 (which it probably never will), then it may not be.

Why does it need to be aligned?

The selector register is aligned at a 64KB boundary (for independent,
strict reasons).

The data register is not aligned at all, and -- AFAICS -- it need not
be, because it's 1 byte wide. (In fact the ARM-specific
Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the
above layout passes without problems.)

The full register block is 3 bytes wide. Is that a problem?

Thanks
Laszlo

 
 +g_free(nodename);
 +}
 +
  static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size)
  {
  const VirtBoardInfo *board = (const VirtBoardInfo *)binfo;
 @@ -604,6 +623,8 @@ static void machvirt_init(MachineState *machine)
   */
  create_virtio_devices(vbi, pic);
  
 +create_fw_cfg(vbi);
 +
  vbi-bootinfo.ram_size = machine-ram_size;
  vbi-bootinfo.kernel_filename = machine-kernel_filename;
  vbi-bootinfo.kernel_cmdline = machine-kernel_cmdline;
 -- 
 1.8.3.1

Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices

2014-11-28 Thread Stefan Hajnoczi

On Wed, Nov 26, 2014 at 01:16:48PM +0300, Ekaterina Tumanova wrote:
 On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote:
 On Wed, Nov 19, 2014 at 11:17:50AM +0100, Ekaterina Tumanova wrote:
 4. Please use scripts/checkpatch.pl to check coding style.
 
 I did :)

You are right, checkpatch.pl doesn't complain.

That's weird, I thought there were a few instances that don't follow the
QEMU coding style.  Either I was wrong or checkpatch.pl is incomplete.

Stefan


pgp_ZYDPqDTK1.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH v2 6/6] geometry: Target specific hook for s390x in geometry guessing

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 For target-s390x, the behaviour is chosen as follows:
 If no geo could be guessed from backing device, guess_disk_lchs (partition
 table guessing) is called.
 If this is not successful (e.g. image files) geometry is derived from the
 size of the disk (as before).
 We have no translation on s390, so we simply set in to NONE.

 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 ---
  hw/block/Makefile.objs |  6 +-
  hw/block/hd-geometry.c | 36 +++-
  2 files changed, 36 insertions(+), 6 deletions(-)

 diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
 index d4c3ab7..1f7da7a 100644
 --- a/hw/block/Makefile.objs
 +++ b/hw/block/Makefile.objs
 @@ -1,4 +1,4 @@
 -common-obj-y += block.o cdrom.o hd-geometry.o
 +common-obj-y += block.o cdrom.o
  common-obj-$(CONFIG_FDC) += fdc.o
  common-obj-$(CONFIG_SSI_M25P80) += m25p80.o
  common-obj-$(CONFIG_NAND) += nand.o
 @@ -13,3 +13,7 @@ obj-$(CONFIG_SH4) += tc58128.o
  
  obj-$(CONFIG_VIRTIO) += virtio-blk.o
  obj-$(CONFIG_VIRTIO) += dataplane/
 +
 +# geometry is target/architecture dependent and therefore needs to be built
 +# for every target platform
 +obj-y += hd-geometry.o
 diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c
 index 4972114..b462225 100644
 --- a/hw/block/hd-geometry.c
 +++ b/hw/block/hd-geometry.c
 @@ -49,11 +49,12 @@ struct partition {
  
  /* try to guess the disk logical geometry from the MSDOS partition table.
 Return 0 if OK, -1 if could not guess */
 -static int guess_disk_lchs(BlockBackend *blk,
 -   int *pcylinders, int *pheads, int *psectors)
 +static int guess_disk_lchs(BlockBackend *blk, uint32_t *pcylinders,
 +   uint32_t *pheads, uint32_t *psectors)
  {
  uint8_t buf[BDRV_SECTOR_SIZE];
 -int i, heads, sectors, cylinders;
 +int i;
 +uint32_t heads, sectors, cylinders;
  struct partition *p;
  uint32_t nr_sects;
  uint64_t nb_sectors;
 @@ -116,11 +117,11 @@ static void guess_chs_for_size(BlockBackend *blk,
  *psecs = 63;
  }
  
 +#ifdef TARGET_S390X
  void hd_geometry_guess(BlockBackend *blk,
 uint32_t *pcyls, uint32_t *pheads, uint32_t *psecs,
 int *ptrans)
  {
 -int cylinders, heads, secs, translation;
  struct ProbeGeometry geometry = blk_probe_geometry(blk);
  
  if (geometry.rc == 0) {
 @@ -129,7 +130,32 @@ void hd_geometry_guess(BlockBackend *blk,
  *pheads = geometry.geo.heads;
  goto done;
  }
 +if (guess_disk_lchs(blk, pcyls, pheads, psecs) == 0) {
 +goto done;
 +}
 +guess_chs_for_size(blk, pcyls, pheads, psecs);
 +done:
 +if (ptrans) {
 +*ptrans = BIOS_ATA_TRANSLATION_NONE;
 +}

The goto looks awkward...  What about

if (guess_disk_lchs(blk, pcyls, pheads, psecs)  0) {
guess_chs_for_size(blk, pcyls, pheads, psecs);
}

  
 +trace_hd_geometry_guess(blk, *pcyls, *pheads, *psecs,
 +BIOS_ATA_TRANSLATION_NONE);
 +}
 +#else
 +void hd_geometry_guess(BlockBackend *blk,
 +   uint32_t *pcyls, uint32_t *pheads, uint32_t *psecs,
 +   int *ptrans)
 +{
 +int cylinders, heads, secs, translation;
 +struct ProbeGeometry geometry = blk_probe_geometry(blk);
 +
 +if (geometry.rc == 0) {
 +*pcyls = geometry.geo.cylinders;
 +*psecs = geometry.geo.sectors;
 +*pheads = geometry.geo.heads;
 +goto done;
 +}
  if (guess_disk_lchs(blk, cylinders, heads, secs)  0) {
  /* no LCHS guess: use a standard physical disk geometry  */
  guess_chs_for_size(blk, pcyls, pheads, psecs);
 @@ -157,7 +183,7 @@ done:
  }
  trace_hd_geometry_guess(blk, *pcyls, *pheads, *psecs, translation);
  }
 -
 +#endif

Please put the blank line after the #endif.

  int hd_bios_chs_auto_trans(uint32_t cyls, uint32_t heads, uint32_t secs)
  {
  return cyls = 1024  heads = 16  secs = 63

hd_geometry_guess() behavior before this series:

If we can't guess LCHS from MSDOS partition table
guess geometry on size
translation is NONE for very small disk, else LBA
Else if LCHS guess has heads  16
BIOS LBA translation must be active
guess geometry on size
translation is LARGE for sufficiently small disk, else LBA
Else
use LCHS guess
translation is NONE

Afterwards, targets other than s390x:

If backend has a geometry (currently only DASD has)
use backend geometry
translation is NONE
Else behave like before this series

Incompatible change.  Why do we want it?

Afterwards, target s390x:

If backend has a geometry (currently only DASD has)
use backend geometry
translation is NONE
Else if we can't guess LCHS from MSDOS partition table
guess geometry on size
Else
use LCHS guess

Re: [Qemu-devel] [PATCH] geometry: fix i386 compilation

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
 ---
  hw/block/hd-geometry.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

 diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c
 index b462225..905d2c6 100644
 --- a/hw/block/hd-geometry.c
 +++ b/hw/block/hd-geometry.c
 @@ -147,7 +147,8 @@ void hd_geometry_guess(BlockBackend *blk,
 uint32_t *pcyls, uint32_t *pheads, uint32_t *psecs,
 int *ptrans)
  {
 -int cylinders, heads, secs, translation;
 +uint32_t cylinders, heads, secs;
 +int translation = BIOS_ATA_TRANSLATION_NONE;
  struct ProbeGeometry geometry = blk_probe_geometry(blk);
  
  if (geometry.rc == 0) {
 @@ -173,9 +174,6 @@ void hd_geometry_guess(BlockBackend *blk,
  *pcyls = cylinders;
  *pheads = heads;
  *psecs = secs;
 -/* disable any translation to be in sync with
 -   the logical geometry */
 -translation = BIOS_ATA_TRANSLATION_NONE;
  }

Actually broken in PATCH 5, so it needs to be fixed there, and not in
PATCH 6 (which this one fixes up).

Moreover, your fixup makes the code less clear.  Please add the missing
translation = ... instead.

  done:
  if (ptrans) {

Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board

2014-11-28 Thread Laszlo Ersek

On 11/28/14 11:43, Laszlo Ersek wrote:
 On 11/28/14 11:38, Andrew Jones wrote:
 On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote:
 fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c,
 ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt
 board.

 The mmio register block of fw_cfg is advertized in the device tree. As
 base address we pick 0x0902, which conforms to the comment preceding
 a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB,
 and it is aligned at 64KB.

 fw_cfg automatically exports a number of files to the guest; for example,
 bootorder (see fw_cfg_machine_reset()).

 Signed-off-by: Laszlo Ersek ler...@redhat.com
 ---
  hw/arm/virt.c | 21 +
  1 file changed, 21 insertions(+)

 diff --git a/hw/arm/virt.c b/hw/arm/virt.c
 index 314e55b..070bd34 100644
 --- a/hw/arm/virt.c
 +++ b/hw/arm/virt.c
 @@ -68,6 +68,7 @@ enum {
  VIRT_UART,
  VIRT_MMIO,
  VIRT_RTC,
 +VIRT_FW_CFG,
  };
  
  typedef struct MemMapEntry {
 @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = {
  [VIRT_GIC_CPU] ={ 0x0801, 0x0001 },
  [VIRT_UART] =   { 0x0900, 0x1000 },
  [VIRT_RTC] ={ 0x0901, 0x1000 },
 +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE },
  [VIRT_MMIO] =   { 0x0a00, 0x0200 },
  /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that 
 size */
  /* 0x1000 .. 0x4000 reserved for PCI */
 @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi)
  g_free(nodename);
  }
  
 +static void create_fw_cfg(const VirtBoardInfo *vbi)
 +{
 +hwaddr base = vbi-memmap[VIRT_FW_CFG].base;
 +char *nodename;
 +
 +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE);
 +
 +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base);
 +qemu_fdt_add_subnode(vbi-fdt, nodename);
 +qemu_fdt_setprop_string(vbi-fdt, nodename,
 +compatible, fw-cfg,mmio);
 +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg,
 + 2, base, 2, FW_CFG_SIZE,
 + 2, base + FW_CFG_SIZE, 2, 
 FW_CFG_DATA_SIZE);

 Overkill suggestion alert, but how about defining something like

 #define FW_CFG_SIZE_ALIGNED \
 MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \
 QEMU_ALIGN_UP(FW_CFG_SIZE, 4))

 and then using that in your memmap size calculation and fw-cfg-data base
 address calculation. The only reason I suggest this is because it's hard
 to tell that fw-cfg-data's address will be naturally aligned without
 hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change
 (which it probably never will), then it may not be.
 
 Why does it need to be aligned?
 
 The selector register is aligned at a 64KB boundary (for independent,
 strict reasons).
 
 The data register is not aligned at all, and -- AFAICS -- it need not
 be, because it's 1 byte wide. (In fact the ARM-specific
 Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the
 above layout passes without problems.)
 
 The full register block is 3 bytes wide. Is that a problem?

Hm, I think I get it now. If FW_CFG_DATA_SIZE were to increase, then its
alignment would have to increase as well, and whatever alignment
FW_CFG_SIZE provides might not suffice. So, you'd calculate the natural
alignment, but wouldn't increase it beyond 4.

I do think this is a bit overkill :) but I can do it. Let's wait for
more review comments first.

Thanks!
Laszlo

Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board

2014-11-28 Thread Andrew Jones

On Fri, Nov 28, 2014 at 11:43:32AM +0100, Laszlo Ersek wrote:
 On 11/28/14 11:38, Andrew Jones wrote:
  On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote:
  fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c,
  ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt
  board.
 
  The mmio register block of fw_cfg is advertized in the device tree. As
  base address we pick 0x0902, which conforms to the comment preceding
  a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB,
  and it is aligned at 64KB.
 
  fw_cfg automatically exports a number of files to the guest; for example,
  bootorder (see fw_cfg_machine_reset()).
 
  Signed-off-by: Laszlo Ersek ler...@redhat.com
  ---
   hw/arm/virt.c | 21 +
   1 file changed, 21 insertions(+)
 
  diff --git a/hw/arm/virt.c b/hw/arm/virt.c
  index 314e55b..070bd34 100644
  --- a/hw/arm/virt.c
  +++ b/hw/arm/virt.c
  @@ -68,6 +68,7 @@ enum {
   VIRT_UART,
   VIRT_MMIO,
   VIRT_RTC,
  +VIRT_FW_CFG,
   };
   
   typedef struct MemMapEntry {
  @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = {
   [VIRT_GIC_CPU] ={ 0x0801, 0x0001 },
   [VIRT_UART] =   { 0x0900, 0x1000 },
   [VIRT_RTC] ={ 0x0901, 0x1000 },
  +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE },
   [VIRT_MMIO] =   { 0x0a00, 0x0200 },
   /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that 
  size */
   /* 0x1000 .. 0x4000 reserved for PCI */
  @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi)
   g_free(nodename);
   }
   
  +static void create_fw_cfg(const VirtBoardInfo *vbi)
  +{
  +hwaddr base = vbi-memmap[VIRT_FW_CFG].base;
  +char *nodename;
  +
  +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE);
  +
  +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base);
  +qemu_fdt_add_subnode(vbi-fdt, nodename);
  +qemu_fdt_setprop_string(vbi-fdt, nodename,
  +compatible, fw-cfg,mmio);
  +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg,
  + 2, base, 2, FW_CFG_SIZE,
  + 2, base + FW_CFG_SIZE, 2, 
  FW_CFG_DATA_SIZE);
  
  Overkill suggestion alert, but how about defining something like
  
  #define FW_CFG_SIZE_ALIGNED \
  MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \
  QEMU_ALIGN_UP(FW_CFG_SIZE, 4))
  
  and then using that in your memmap size calculation and fw-cfg-data base
  address calculation. The only reason I suggest this is because it's hard
  to tell that fw-cfg-data's address will be naturally aligned without
  hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change
  (which it probably never will), then it may not be.
 
 Why does it need to be aligned?

Natural alignment is more efficient.

 
 The selector register is aligned at a 64KB boundary (for independent,
 strict reasons).
 
 The data register is not aligned at all, and -- AFAICS -- it need not
 be, because it's 1 byte wide. (In fact the ARM-specific
 Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the
 above layout passes without problems.)

Right. As FW_CFG_DATA_SIZE is currently 1 byte, it's already naturally
aligned, and the macro definition I have above actually doesn't change
anything (which is why I gave the overkill alert). However if
FW_CFG_DATA_SIZE was to change, then the natural alignment could be lost.

 
 The full register block is 3 bytes wide. Is that a problem?

No, it's fine as is, and the FW_CFG_SIZE_ALIGNED would leave it 3 bytes
wide too. FW_CFG_SIZE_ALIGNED only adds future-proofing. However it
really is overkill as the chance that FW_CFG_DATA_SIZE will change is nil.

 
 Thanks
 Laszlo
 
  
  +g_free(nodename);
  +}
  +
   static void *machvirt_dtb(const struct arm_boot_info *binfo, int 
  *fdt_size)
   {
   const VirtBoardInfo *board = (const VirtBoardInfo *)binfo;
  @@ -604,6 +623,8 @@ static void machvirt_init(MachineState *machine)
*/
   create_virtio_devices(vbi, pic);
   
  +create_fw_cfg(vbi);
  +
   vbi-bootinfo.ram_size = machine-ram_size;
   vbi-bootinfo.kernel_filename = machine-kernel_filename;
   vbi-bootinfo.kernel_cmdline = machine-kernel_cmdline;
  -- 
  1.8.3.1

Re: [Qemu-devel] [PATCH v2 5/6] geometry: Call backend function to detect geometry and blocksize

2014-11-28 Thread Ekaterina Tumanova


On 11/28/2014 01:10 PM, Markus Armbruster wrote:

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:


hd_geometry_guess function autodetects the drive geometry. This patch
adds a block backend call, that probes the backing device geometry.
If the inner driver method is implemented and succeeds (currently only DASDs),
the blkconf_geometry will pass-through the backing device geometry.

Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com
---
  hw/block/block.c | 11 +++
  hw/block/hd-geometry.c   |  9 +
  hw/block/virtio-blk.c|  1 +
  include/hw/block/block.h |  1 +
  4 files changed, 22 insertions(+)

diff --git a/hw/block/block.c b/hw/block/block.c
index a625773..f1d29bc 100644
--- a/hw/block/block.c
+++ b/hw/block/block.c
@@ -25,6 +25,17 @@ void blkconf_serial(BlockConf *conf, char **serial)
  }
  }

+void blkconf_blocksizes(BlockConf *conf)
+{
+BlockBackend *blk = conf-blk;
+struct ProbeBlockSize blocksize = blk_probe_blocksizes(blk);
+
+if (blocksize.rc == 0) {
+conf-physical_block_size = blocksize.size.phys;
+conf-logical_block_size = blocksize.size.log;
+}
+}
+


Uh, doesn't this overwrite the user's geometry?

The existing blkconf_ functions do nothing when the user supplied the
configuration.  In other words, they only provide defaults.  Why should
this one be different?



this will be fixed in the next version


  void blkconf_geometry(BlockConf *conf, int *ptrans,
unsigned cyls_max, unsigned heads_max, unsigned 
secs_max,
Error **errp)
diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c
index 6fcf74d..4972114 100644
--- a/hw/block/hd-geometry.c
+++ b/hw/block/hd-geometry.c
@@ -121,6 +121,14 @@ void hd_geometry_guess(BlockBackend *blk,
 int *ptrans)
  {
  int cylinders, heads, secs, translation;
+struct ProbeGeometry geometry = blk_probe_geometry(blk);
+
+if (geometry.rc == 0) {
+*pcyls = geometry.geo.cylinders;
+*psecs = geometry.geo.sectors;
+*pheads = geometry.geo.heads;
+goto done;
+}



hd_geometry_guess() is called by blkconf_geometry() to conjure up a
default geometry when the user didn't pick one.  Fairly elaborate
guesswork.  blkconf_geometry() runs for IDE, SCSI and virtio-blk only.

Your patch makes it pick the backend's geometry as reported by
blk_probe_geometry() before anything else.

This is an incompatible change for backends where blk_probe_geometry()
succeeds.  Currently, it succeeds only for DASD, and there you *want*
the incompatible change.

My point is: if we rely on blk_probe_geometry() failing except for DASD,
then we should call it something like blk_dasd_geometry() to make that
obvious.



This function itself is not DASD specific. It calls the inner method for
host_device driver, which currently only succeeds for DASDs.
But in future someone can define methods for other drivers or
even modify the host_device driver.


The commit message needs to spell out the incompatible change.


  if (guess_disk_lchs(blk, cylinders, heads, secs)  0) {
  /* no LCHS guess: use a standard physical disk geometry  */
@@ -143,6 +151,7 @@ void hd_geometry_guess(BlockBackend *blk,
 the logical geometry */
  translation = BIOS_ATA_TRANSLATION_NONE;
  }
+done:
  if (ptrans) {
  *ptrans = translation;
  }
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index b19b102..6f01565 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -745,6 +745,7 @@ static void virtio_blk_device_realize(DeviceState *dev, 
Error **errp)
  error_propagate(errp, err);
  return;
  }
+blkconf_blocksizes(conf-conf);

  virtio_init(vdev, virtio-blk, VIRTIO_ID_BLOCK,
  sizeof(struct virtio_blk_config));


Why only virtio-blk, and not the other device models sporting
configurable block sizes (nvme, IDE, SCSI, usb-storage)?

This is an incompatible change for backends where blk_probe_blocksizes()
succeeds and returns something other than (512, 512).  Currently, it
succeeds only for DASD.  Is that what we want?

The commit message needs to spell out the incompatible change.

Perhaps this patch should be split up into one for geometry and one for
block sizes.


diff --git a/include/hw/block/block.h b/include/hw/block/block.h
index 0d0ce9a..856bf75 100644
--- a/include/hw/block/block.h
+++ b/include/hw/block/block.h
@@ -63,6 +63,7 @@ void blkconf_serial(BlockConf *conf, char **serial);
  void blkconf_geometry(BlockConf *conf, int *trans,
unsigned cyls_max, unsigned heads_max, unsigned 
secs_max,
Error **errp);
+void blkconf_blocksizes(BlockConf *conf);

  /* Hard disk geometry */

Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote:
[...]
 3. Why does s390 need to customize hd_geometry_guess()?

 Since hd_geometry_guess contains semantics of x86-specific LBA
 translation, we have to modify it not to get in the way of z
 architecture

I already explained this in my review, but it's buried among other
stuff, so let me repeat it here:

The only caller that passes non-null ptrans to hd_geometry_guess() is
ide_dev_initfn(), and the only user of the value that gets stored there
is pc_cmos_init_late().

We store a highly PC-specific value there.  That's okay, it's used only
by PC machines.

The whole concept BIOS translation makes no sense for other machines.
Heck, it makes barely sense even on PCs.

Re: [Qemu-devel] [PATCH v2 2/6] geometry: Detect blocksize via ioctls in separate static functions

2014-11-28 Thread Ekaterina Tumanova





Suggest function comment

 /**
  * Return logical block size, or zero if we can't figure it out
  */


  {
-BDRVRawState *s = bs-opaque;
-char *buf;
-unsigned int sector_size;
-
-/* For /dev/sg devices the alignment is not really used.
-   With buffered I/O, we don't have any restrictions. */
-if (bs-sg || !s-needs_alignment) {
-bs-request_alignment = 1;
-s-buf_align = 1;
-return;
-}
+unsigned int sector_size = 0;


Pointless initialization.


If I do not initialize the sector_size, and the ioctl fails,
I will return garbage as a blocksize to the caller.





  /* Try a few ioctls to get the right size */
-bs-request_alignment = 0;
-s-buf_align = 0;
-
  #ifdef BLKSSZGET
  if (ioctl(fd, BLKSSZGET, sector_size) = 0) {
-bs-request_alignment = sector_size;
+return sector_size;
  }
  #endif
  #ifdef DKIOCGETBLOCKSIZE
  if (ioctl(fd, DKIOCGETBLOCKSIZE, sector_size) = 0) {
-bs-request_alignment = sector_size;
+return sector_size;
  }
  #endif
  #ifdef DIOCGSECTORSIZE
  if (ioctl(fd, DIOCGSECTORSIZE, sector_size) = 0) {
-bs-request_alignment = sector_size;
+return sector_size;
  }
  #endif
  #ifdef CONFIG_XFS
  if (s-is_xfs) {
  struct dioattr da;
  if (xfsctl(NULL, fd, XFS_IOC_DIOINFO, da) = 0) {
-bs-request_alignment = da.d_miniosz;
+sector_size = da.d_miniosz;
  /* The kernel returns wrong information for d_mem */
  /* s-buf_align = da.d_mem; */


Since you keep the enabled assignments to s-buf_align out of this
function, you should keep out this disabled one, too.


+return sector_size;
  }
  }
  #endif

+return 0;
+}
+
+static unsigned int probe_physical_blocksize(BlockDriverState *bs, int fd)


Parameter bs is unused, let's drop it.


+{
+unsigned int blk_size = 0;


Pointless initialization.


Same here.




+#ifdef BLKPBSZGET
+if (ioctl(fd, BLKPBSZGET, blk_size) = 0) {
+return blk_size;
+}
+#endif
+
+return 0;
+}
+
+static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
+{
+BDRVRawState *s = bs-opaque;
+char *buf;
+
+/* For /dev/sg devices the alignment is not really used.
+   With buffered I/O, we don't have any restrictions. */
+if (bs-sg || !s-needs_alignment) {
+bs-request_alignment = 1;
+s-buf_align = 1;
+return;
+}
+
+s-buf_align = 0;
+/* Let's try to use the logical blocksize for the alignment. */
+bs-request_alignment = probe_logical_blocksize(bs, fd);
+
  /* If we could not get the sizes so far, we can only guess them */
  if (!s-buf_align) {
  size_t align;

Re: [Qemu-devel] [PATCH v3 07/22] qcow2: Helper function for refcount modification

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 20, 2014 at 06:06:23PM +0100, Max Reitz wrote:
 @@ -1711,7 +1746,7 @@ static int calculate_refcounts(BlockDriverState *bs, 
 BdrvCheckResult *res,
  static void compare_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
BdrvCheckMode fix, bool *rebuild,
int64_t *highest_cluster,
 -  uint16_t *refcount_table, int64_t nb_clusters)
 +  void *refcount_table, int64_t nb_clusters)
  {
  BDRVQcowState *s = bs-opaque;
  int64_t i, refcount1, refcount2;

An -ERANGE qcow2_get_refcount() return value is treated as a check error
here and we won't be able to rebuild the refcount blocks:

  refcount1 = qcow2_get_refcount(bs, i);
  if (refcount1  0) {
  fprintf(stderr, Can't get refcount for cluster % PRId64 : %s\n,
  i, strerror(-refcount1));
  res-check_errors++;
  continue;
  }

We should allow rebuilding refcount blocks, -ERANGE is just another
corrupted refcount.


pgpyEY7bWMHBB.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH v3 06/22] qcow2: Helper for refcount array reallocation

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 27, 2014 at 04:11:12PM +0100, Max Reitz wrote:
 On 2014-11-27 at 16:09, Stefan Hajnoczi wrote:
 On Thu, Nov 20, 2014 at 06:06:22PM +0100, Max Reitz wrote:
 +/**
 + * Reallocates *array so that it can hold new_size entries. *size must 
 contain
 + * the current number of entries in *array. If the reallocation fails, 
 *array
 + * and *size will not be modified and -errno will be returned. If the
 + * reallocation is successful, *array will be set to the new buffer and 
 *size
 + * will be set to new_size. The size of the reallocated refcount array 
 buffer
 + * will be aligned to a cluster boundary, and the newly allocated area 
 will be
 + * zeroed.
 + */
 +static int realloc_refcount_array(BDRVQcowState *s, uint16_t **array,
 +  int64_t *size, int64_t new_size)
 +{
 +/* Round to clusters so the array can be directly written to disk */
 +size_t old_byte_size = ROUND_UP(refcount_array_byte_size(s, *size),
 +s-cluster_size);
 +size_t new_byte_size = ROUND_UP(refcount_array_byte_size(s, new_size),
 +s-cluster_size);
 +uint16_t *new_ptr;
 +
 +if (new_byte_size = old_byte_size) {
 +*size = new_size;
 +return 0;
 +}
 Why not realloc the array to the new smaller size? ...
 
 Because such a call will actually never happen. I could replace this if ()
 by assert(new_byte_size = old_byte_size); if (new_byte_size ==
 old_byte_size), but as I said before, I'm not a friend of assertions when
 the code can deal perfectly well with the unsupported case.

It is simpler to put an if statement around the memset.  Then the
function actually frees unused memory and readers don't wonder why you
decided not to shrink the array.

Less code and slightly better behavior.

Stefan


pgpEfp6wutS8o.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Paolo Bonzini


 master:
 Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per 
 coroutine
 
 paolo:
 Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per 
 coroutine

Nice. :)

Can you please try coroutine: Use __thread …  together, too?  I still
see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
I apply it here (my times are 191/160/145).

Paolo

Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices

2014-11-28 Thread Ekaterina Tumanova


On 11/28/2014 01:35 PM, Stefan Hajnoczi wrote:

On Wed, Nov 26, 2014 at 01:16:48PM +0300, Ekaterina Tumanova wrote:

On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote:

On Wed, Nov 19, 2014 at 11:17:50AM +0100, Ekaterina Tumanova wrote:
3. Why does s390 need to customize hd_geometry_guess()?


Since hd_geometry_guess contains semantics of x86-specific LBA translation,
we have to modify it not to get in the way of z
architecture


If the behavior is x86-specific, it should be made x86-specific.  z,
ppc, sparc, arm, etc should share the non-x86-specific code path.  It's
weird to single out z here, this seems like the wrong way around.

Is the x86-specific behavior a problem in practice?  No one seemed to
mind for the other targets.



on s390 arch this adds support for FCP attached SCSI disks that do not
yet have a partition table. Without this patch, fdisk -l on the host
would return different results then fdisk -l in the guest.

If you guys don't like the way patch goes, I can exclude it from patch
series and we can discuss it later.

But I thought that since it's related to the hard drive geometry,
and since we change hd_geometry_guess in this patchset anyway, why not
get rid of this problem as well.

Kate.

Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board

2014-11-28 Thread Andrew Jones

On Fri, Nov 28, 2014 at 11:49:48AM +0100, Laszlo Ersek wrote:
 On 11/28/14 11:43, Laszlo Ersek wrote:
  On 11/28/14 11:38, Andrew Jones wrote:
  On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote:
  fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c,
  ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt
  board.
 
  The mmio register block of fw_cfg is advertized in the device tree. As
  base address we pick 0x0902, which conforms to the comment preceding
  a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB,
  and it is aligned at 64KB.
 
  fw_cfg automatically exports a number of files to the guest; for example,
  bootorder (see fw_cfg_machine_reset()).
 
  Signed-off-by: Laszlo Ersek ler...@redhat.com
  ---
   hw/arm/virt.c | 21 +
   1 file changed, 21 insertions(+)
 
  diff --git a/hw/arm/virt.c b/hw/arm/virt.c
  index 314e55b..070bd34 100644
  --- a/hw/arm/virt.c
  +++ b/hw/arm/virt.c
  @@ -68,6 +68,7 @@ enum {
   VIRT_UART,
   VIRT_MMIO,
   VIRT_RTC,
  +VIRT_FW_CFG,
   };
   
   typedef struct MemMapEntry {
  @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = {
   [VIRT_GIC_CPU] ={ 0x0801, 0x0001 },
   [VIRT_UART] =   { 0x0900, 0x1000 },
   [VIRT_RTC] ={ 0x0901, 0x1000 },
  +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE },
   [VIRT_MMIO] =   { 0x0a00, 0x0200 },
   /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that 
  size */
   /* 0x1000 .. 0x4000 reserved for PCI */
  @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi)
   g_free(nodename);
   }
   
  +static void create_fw_cfg(const VirtBoardInfo *vbi)
  +{
  +hwaddr base = vbi-memmap[VIRT_FW_CFG].base;
  +char *nodename;
  +
  +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE);
  +
  +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base);
  +qemu_fdt_add_subnode(vbi-fdt, nodename);
  +qemu_fdt_setprop_string(vbi-fdt, nodename,
  +compatible, fw-cfg,mmio);
  +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg,
  + 2, base, 2, FW_CFG_SIZE,
  + 2, base + FW_CFG_SIZE, 2, 
  FW_CFG_DATA_SIZE);
 
  Overkill suggestion alert, but how about defining something like
 
  #define FW_CFG_SIZE_ALIGNED \
  MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \
  QEMU_ALIGN_UP(FW_CFG_SIZE, 4))
 
  and then using that in your memmap size calculation and fw-cfg-data base
  address calculation. The only reason I suggest this is because it's hard
  to tell that fw-cfg-data's address will be naturally aligned without
  hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change
  (which it probably never will), then it may not be.
  
  Why does it need to be aligned?
  
  The selector register is aligned at a 64KB boundary (for independent,
  strict reasons).
  
  The data register is not aligned at all, and -- AFAICS -- it need not
  be, because it's 1 byte wide. (In fact the ARM-specific
  Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the
  above layout passes without problems.)
  
  The full register block is 3 bytes wide. Is that a problem?
 
 Hm, I think I get it now. If FW_CFG_DATA_SIZE were to increase, then its
 alignment would have to increase as well, and whatever alignment
 FW_CFG_SIZE provides might not suffice. So, you'd calculate the natural
 alignment, but wouldn't increase it beyond 4.
 
 I do think this is a bit overkill :) but I can do it. Let's wait for
 more review comments first.

Actually, on second thought, completely scratch my overkill suggestion.
It's actually wrong to be concerned with it anyway. FW_CFG_DATA_SIZE
doesn't dictate how we should access the data port, the fw-cfg protocol
does, and that says we should access exactly one byte. So, regardless of
the fw-cfg-data size, we'll never have to worry about the data port's
alignment, as we'll never access more than one byte from it.

 
 Thanks!
 Laszlo

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
 master:
 Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per 
 coroutine

 paolo:
 Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per 
 coroutine
 Nice. :)

 Can you please try coroutine: Use __thread …  together, too?  I still
 see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
 I apply it here (my times are 191/160/145).

indeed:

Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per 
coroutine

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Paolo Bonzini



On 28/11/2014 12:21, Peter Lieven wrote:
 Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
 master:
 Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns 
 per coroutine

 paolo:
 Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns 
 per coroutine
 Nice. :)

 Can you please try coroutine: Use __thread …  together, too?  I still
 see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
 I apply it here (my times are 191/160/145).
 
 indeed:
 
 Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per 
 coroutine

Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
if we can use a similar algorithm but with arrays instead of lists...

Paolo

Re: [Qemu-devel] [PATCH v6 2/3] linux-aio: handling -EAGAIN for !s-io_q.plugged case

2014-11-28 Thread Kevin Wolf

Am 28.11.2014 um 03:27 hat Ming Lei geschrieben:
 Hi Kevin,
 
 On Wed, Nov 26, 2014 at 7:27 PM, Kevin Wolf kw...@redhat.com wrote:
  Am 25.11.2014 um 08:23 hat Ming Lei geschrieben:
  Previously -EAGAIN is simply ignored for !s-io_q.plugged case,
  and sometimes it is easy to cause -EIO to VM, such as NVME device.
 
  This patch handles -EAGAIN by io queue for !s-io_q.plugged case,
  and it will be retried in following aio completion cb.
 
  Reviewed-by: Paolo Bonzini pbonz...@redhat.com
  Suggested-by: Paolo Bonzini pbonz...@redhat.com
  Signed-off-by: Ming Lei ming@canonical.com
  ---
   block/linux-aio.c |   24 
   1 file changed, 16 insertions(+), 8 deletions(-)
 
  diff --git a/block/linux-aio.c b/block/linux-aio.c
  index 11ac828..ac25722 100644
  --- a/block/linux-aio.c
  +++ b/block/linux-aio.c
  @@ -282,8 +282,13 @@ static int ioq_enqueue(struct qemu_laio_state *s, 
  struct iocb *iocb)
   s-io_q.iocbs[idx++] = iocb;
   s-io_q.idx = idx;
 
  -/* submit immediately if queue depth is above 2/3 */
  -if (idx  s-io_q.size * 2 / 3) {
  +/*
  + * This is reached in two cases: queue not plugged but io_submit
  + * returned -EAGAIN, or queue plugged.  In the latter case, start
  + * submitting some I/O if the queue is getting too full.  In the
  + * former case, instead, wait until an I/O operation is completed.
  + */
 
  Are we guaranteed that an I/O operation is in flight when we get
  -EAGAIN? The manpage of io_submit isn't very clear on this,
  insufficient resources could be for any reason.
 
 
 That is a good question.
 
 From fs/aio.c in linux kernel, io_submit_one() returns -EAGAIN when
 either there isn't enough requests which are reserved in io_setup(), or
 kmem_cache_alloc(GFP_KERNEL) returns NULL.
 
 In the former case, it means I/O operation is in flight.
 
 In the later case, it should be very difficult to trigger since GFP_KERNEL
 allocation will wait for memory reclaiming.
 
 So most of times, it is reasonable to resubmit in completion for
 -EAGAIN.  When there is no pending I/O, we still can handle
 the very unlikely case either by returning failure to caller or
 try to submit in one BH. Does it make sense for you?

I think returning an error is fine in this case.

Kevin

Re: [Qemu-devel] [PATCH v3 07/22] qcow2: Helper function for refcount modification

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 27, 2014 at 04:32:52PM +0100, Max Reitz wrote:
 On 2014-11-27 at 16:21, Stefan Hajnoczi wrote:
 On Thu, Nov 20, 2014 at 06:06:23PM +0100, Max Reitz wrote:
 @@ -583,7 +608,12 @@ static int QEMU_WARN_UNUSED_RESULT 
 update_refcount(BlockDriverState *bs,
   /* we can update the count and save it */
   block_index = cluster_index  (s-refcount_block_size - 1);
 -refcount = be16_to_cpu(refcount_block[block_index]);
 +refcount = s-get_refcount(refcount_block, block_index);
 +if (refcount  0) {
 +ret = -ERANGE;
 +goto fail;
 +}
 Here again.
 
 @@ -1206,11 +1236,14 @@ static int inc_refcounts(BlockDriverState *bs,
   }
   }
 -if (++(*refcount_table)[k] == 0) {
 +refcount = s-get_refcount(*refcount_table, k);
 Here the refcount  0 check is missing.  That's why it would be simpler
 to eliminate the refcount  0 case entirely.
 
 It's not missing. This is part of the refcount check, as are all the
 following (in other places). The refcount check builds a refcount array in
 memory all by itself, so it knows for sure there are no overflows. The line
 which you omitted directly after this clips the refcount values against
 s-refcount_max which is INT64_MAX for 64-bit refcounts.
 
 Therefore, no overflows possible in the refcount checking functions, because
 the refcount checking functions don't introduce the overflows themselves.
 The other overflow checks are only in place to reject faulty images provided
 from the outside.

I see, that makes sense.

 @@ -1726,7 +1761,7 @@ static void compare_refcounts(BlockDriverState *bs, 
 BdrvCheckResult *res,
   continue;
   }
 -refcount2 = refcount_table[i];
 +refcount2 = s-get_refcount(refcount_table, i);
 Missing here too and in other places.
 
 +typedef uint64_t Qcow2GetRefcountFunc(const void *refcount_array,
 +  uint64_t index);
 Should the return type be int64_t?
 
 No. If it was, we'd have to check for errors every time we call it, but it
 cannot return errors (well, if we let it return uint64_t, it might return
 -ERANGE, but that's exactly what I don't want). Therefore, let it return
 uint64_t so we know this function cannot fail.
 
 +typedef void Qcow2SetRefcountFunc(void *refcount_array,
 +  uint64_t index, uint64_t value);
 Should value's type be int64_t?  Just because the type is unsigned
 doesn't make (uint64_t)-1ULL a valid value.
 
 Actually, it does. It's just that the implementation provided here does not
 support it.
 
 Since there is an assertion against that case in the 64-bit implementation
 for this function, I don't have a problem with using int64_t here, though.
 But that would break symmetry with Qcow2GetRefcountFunc(), and I do have a
 reason there not to return a signed value, as explained above.

Okay, so uint64_t is really a different type - it's the qcow2 spec
64-bit refcount.  int64_t is the QEMU implementation's internal
representation.

This seems error-prone to me.  Maybe comments would have helped, but
it's best to eliminate the problem entirely.  Why not bite the bullet
and fix up qcow2?

There are two external function prototypes that need to be changed:

int64_t qcow2_get_refcount(BlockDriverState *bs, int64_t cluster_index);

int64_t qcow2_update_cluster_refcount(BlockDriverState *bs,
  int64_t cluster_index, int addend,
  enum qcow2_discard_type type);

/* Returns 0 on success, -errno on failure */
int qcow2_get_refcount(BlockDriverState *bs, int64_t cluster_index,
   uint64_t *refcount);
int qcow2_update_cluster_refcount(BlockDriverState *bs,
  int64_t cluster_index, int addend,
  enum qcow2_discard_type type,
  uint64_t *refcount);

Have I missed a fundamental reason why the implementation's internal
refcount type cannot be changed from int64_t to uint64_t?

It would keep the code complexity down and reduce errors.


pgp55_XHR0m3y.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:

 On 28/11/2014 12:21, Peter Lieven wrote:
 Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
 master:
 Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns 
 per coroutine

 paolo:
 Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns 
 per coroutine
 Nice. :)

 Can you please try coroutine: Use __thread …  together, too?  I still
 see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
 I apply it here (my times are 191/160/145).
 indeed:

 Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per 
 coroutine
 Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
 if we can use a similar algorithm but with arrays instead of lists...

You mean an algorithm similar to perf_master2 or to the current implementation?

The ring buffer seems to have a drawback when it comes to excessive coroutine 
nesting.
My idea was that you do not throw away hot coroutines when the pool is full. 
However,
i do not know if this is really a problem since the pool is only full when 
there is not much
I/O. Or is this assumption to easy?

Peter

Re: [Qemu-devel] [RFC PATCH v5 22/31] timer: introduce new QEMU_CLOCK_VIRTUAL_RT clock

2014-11-28 Thread Pavel Dovgaluk

 From: Paolo Bonzini [mailto:pbonz...@redhat.com]
 On 26/11/2014 11:40, Pavel Dovgalyuk wrote:
  This patch introduces new QEMU_CLOCK_VIRTUAL_RT clock, which
  should be used for icount warping. Separate timer is needed
  for replaying the execution, because warping callbacks should
  be deterministic. We cannot make realtime clock deterministic
  because it is used for screen updates and other simulator-specific
  actions. That is why we added new clock which is recorded and
  replayed when needed.
 
  Signed-off-by: Pavel Dovgalyuk pavel.dovga...@ispras.ru
  ---
   include/qemu/timer.h |7 +++
   qemu-timer.c |2 ++
   replay/replay.h  |4 +++-
   3 files changed, 12 insertions(+), 1 deletions(-)
 
  diff --git a/include/qemu/timer.h b/include/qemu/timer.h
  index 7b43331..df27157 100644
  --- a/include/qemu/timer.h
  +++ b/include/qemu/timer.h
  @@ -37,12 +37,19 @@
* is suspended, and it will reflect system time changes the host may
* undergo (e.g. due to NTP). The host clock has the same precision as
* the virtual clock.
  + *
  + * @QEMU_CLOCK_VIRTUAL_RT: realtime clock used for icount warp
  + *
  + * This clock runs as a realtime clock, but is used for icount warp
  + * and thus should be traced with record/replay to make warp function
  + * behave deterministically.
*/
 
 I think it should also stop/restart across stop and cont commands,
 similar to QEMU_CLOCK_VIRTUAL.  This is as simple as changing
 get_clock() to cpu_get_clock().

Not so easy :)
cpu_get_clock() checks vm_clock_seqlock which is locked in icount_warp_rt().
And after locking it requests the value of QEMU_CLOCK_VIRTUAL_RT:

seqlock_write_lock(timers_state.vm_clock_seqlock);
if (runstate_is_running()) {
int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL_RT);

Pavel Dovgalyuk

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:

 On 28/11/2014 12:21, Peter Lieven wrote:
 Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
 master:
 Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns 
 per coroutine

 paolo:
 Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns 
 per coroutine
 Nice. :)

 Can you please try coroutine: Use __thread …  together, too?  I still
 see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
 I apply it here (my times are 191/160/145).
 indeed:

 Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per 
 coroutine
 Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
 if we can use a similar algorithm but with arrays instead of lists...

Why do you set pool_size = 0 in the create path?

When I do the following:
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 6bee354..c79ee78 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
  * and the actual size of alloc_pool.  But it is just a 
heuristic,
  * it does not need to be perfect.
  */
-pool_size = 0;
+atomic_dec(pool_size);
 QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
 co = QSLIST_FIRST(alloc_pool);


I get:
Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per 
coroutine

Re: [Qemu-devel] [PATCH v2 0/3] block: Don't probe for unknown backing file format

2014-11-28 Thread Stefan Hajnoczi

On Tue, Nov 25, 2014 at 06:12:39PM +0100, Kevin Wolf wrote:
 v2:
 - Added two patches to make the test case actually work properly and not
   just by accident on my laptop. [Max]
 - Fixed comment in test case [Max]
 
 Kevin Wolf (3):
   qcow2: Fix header extension size check
   qcow2.py: Add required padding for header extensions
   block: Don't probe for unknown backing file format
 
  block.c |  7 +++---
  block/qcow2.c   |  2 +-
  tests/qemu-iotests/080  |  2 ++
  tests/qemu-iotests/080.out  |  2 ++
  tests/qemu-iotests/114  | 61 
 +
  tests/qemu-iotests/114.out  | 13 ++
  tests/qemu-iotests/group|  1 +
  tests/qemu-iotests/qcow2.py |  4 +++
  8 files changed, 87 insertions(+), 5 deletions(-)
  create mode 100755 tests/qemu-iotests/114
  create mode 100644 tests/qemu-iotests/114.out
 
 -- 
 1.8.3.1
 
 

Thanks, applied to my block-next tree:
https://github.com/stefanha/qemu/commits/block-next

Stefan


pgp1Cj0zA04jc.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH] block: do not use get_clock()

2014-11-28 Thread Stefan Hajnoczi

On Wed, Nov 26, 2014 at 03:01:02PM +0100, Paolo Bonzini wrote:
 Use the external qemu-timer API instead.
 
 Cc: kw...@redhat.com
 Cc: stefa...@redhat.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  block/accounting.c | 6 --
  block/raw-posix.c  | 8 
  2 files changed, 8 insertions(+), 6 deletions(-)

Thanks, applied to my block-next tree:
https://github.com/stefanha/qemu/commits/block-next

Stefan


pgprTjtwyR1U3.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH RFC for-2.2] virtio-blk: force 1st s/g to match header

2014-11-28 Thread Stefan Hajnoczi

On Fri, Nov 28, 2014 at 7:05 AM, Jason Wang jasow...@redhat.com wrote:


 On Fri, Nov 28, 2014 at 9:16 AM, Fam Zheng f...@redhat.com wrote:

 On Thu, 11/27 23:13, Michael S. Tsirkin wrote:

  On Thu, Nov 27, 2014 at 07:21:35PM +, Stefan Hajnoczi wrote:
   On Thu, Nov 27, 2014 at 4:33 PM, Michael S. Tsirkin m...@redhat.com
 wrote:
We leak cpu mappings when 1st s/g is not exactly the
header. As we don't set ANY_LAYOUT, we can at this point
simply assert the correct length.
   
This will have to be fixed once ANY_LAYOUT is set.
   
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
   
Untested: posting for early feedback.
 Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
   Stefan, you are going to merge this for 2.2?
  Peter, if Stefan is merging this one, we can
  take Jason's patch for -net and that should be
  enough for 2.2.


 My test bot saw a failure of make check:

 qemu-system-x86_64: virtio-blk request outhdr too long
 Broken pipe
 GTester: last random seed: R02S44c5a09d74c0603f0091ed20d1395121
 qemu-system-x86_64: virtio-blk request outhdr too long
 Broken pipe
 GTester: last random seed: R02Sfdf270c8e1ed75c1f2047e3f0fb89b88
 qemu-system-x86_64: virtio-blk request outhdr too long
 Broken pipe
 GTester: last random seed: R02Sd292c540929b963ed9d7d155f3db5452
 qemu-system-x86_64: virtio-blk request outhdr too long
 Broken pipe
 GTester: last random seed: R02Sfc775a34a844c39f376aa478eda7573f
 [vmxnet3][WR][vmxnet3_peer_has_vnet_hdr]: Peer has no virtio extension.
 Task offloads will be emulated.
 make: *** [check-qtest-x86_64] Error 1

 Fam


 This probably because of the test itself.

 But anyway all kinds of guests (especially Windows drivers) need to be
 tested.

Right, the test case explicitly tests different descriptor layouts,
even though virtio-blk-pci does not set the ANY_LAYOUT feature bit.

Either the test case needs to check ANY_LAYOUT before using the
2-descriptor layout or it needs to expect QEMU to refuse (in this case
exit(1), which is not very graceful).

The quick fix is to skip the 2-descriptor layout tests and re-enable
them once virtio-blk actually supports ANY_LAYOUT.  Any objections?

Stefan

Re: [Qemu-devel] [PATCH] Fix for crash after migration in virtio-rng on bi-endian targets

2014-11-28 Thread David Gibson

On Fri, Nov 28, 2014 at 09:14:46AM +, Peter Maydell wrote:
 On 27 November 2014 at 09:26, Markus Armbruster arm...@redhat.com wrote:
  David Gibson da...@gibson.dropbear.id.au writes:
 
  VirtIO devices now remember which endianness they're operating in in order
  to support targets which may have guests of either endianness, such as
  powerpc.  This endianness state is transferred in a subsection of the
  virtio device's information.
 
  With virtio-rng this can lead to an abort after a loadvm hitting the
  assert() in virtio_is_big_endian().  This can be reproduced by doing a
  migrate and load from file on a bi-endian target with a virtio-rng device.
  The actual guest state isn't particularly important to triggering this.
 
  The cause is that virtio_rng_load_device() calls virtio_rng_process() which
  accesses the ring and thus needs the endianness.  However,
  virtio_rng_process() is called via virtio_load() before it loads the
  subsections.  Essentially the -load callback in VirtioDeviceClass should
  only be used for actually reading the device state from the stream, not for
  post-load re-initialization.
 
  This patch fixes the bug by moving the virtio_rng_process() after the call
  to virtio_load().  Better yet would be to convert virtio to use vmsd and
  have the virtio_rng_process() as a post_load callback, but that's a bigger
  project for another day.
 
  This is bugfix, and should be considered for the 2.2 branch.
 
  [PATCH for-2.2] would have been a good idea then.  Next time :)
 
 So do you want this patch in 2.2? I was planning to put in the
 virtio-vs-xen fixes today and tag rc4, so it's not too late if you're
 confident this patch is good. Let me know if you think it should go in,
 and I can apply it to master directly.

Yes, I think it should be applied.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


pgpCGqBKs2Qmi.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 12:32 schrieb Peter Lieven:
 Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
 On 28/11/2014 12:21, Peter Lieven wrote:
 Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
 master:
 Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns 
 per coroutine

 paolo:
 Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns 
 per coroutine
 Nice. :)

 Can you please try coroutine: Use __thread …  together, too?  I still
 see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
 I apply it here (my times are 191/160/145).
 indeed:

 Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns 
 per coroutine
 Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
 if we can use a similar algorithm but with arrays instead of lists...
 Why do you set pool_size = 0 in the create path?

 When I do the following:
 diff --git a/qemu-coroutine.c b/qemu-coroutine.c
 index 6bee354..c79ee78 100644
 --- a/qemu-coroutine.c
 +++ b/qemu-coroutine.c
 @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
   * and the actual size of alloc_pool.  But it is just a 
 heuristic,
   * it does not need to be perfect.
   */
 -pool_size = 0;
 +atomic_dec(pool_size);
  QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
  co = QSLIST_FIRST(alloc_pool);


 I get:
 Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per 
 coroutine


Ok, understood, it steals the whole pool, right? Isn't that bad if we have 
more
than one thread in need of a lot of coroutines?

Peter

Re: [Qemu-devel] [PATCH] Fix for crash after migration in virtio-rng on bi-endian targets

2014-11-28 Thread Greg Kurz

On Fri, 28 Nov 2014 09:14:46 +
Peter Maydell peter.mayd...@linaro.org wrote:
 On 27 November 2014 at 09:26, Markus Armbruster arm...@redhat.com wrote:
  David Gibson da...@gibson.dropbear.id.au writes:
 
  VirtIO devices now remember which endianness they're operating in in order
  to support targets which may have guests of either endianness, such as
  powerpc.  This endianness state is transferred in a subsection of the
  virtio device's information.
 
  With virtio-rng this can lead to an abort after a loadvm hitting the
  assert() in virtio_is_big_endian().  This can be reproduced by doing a
  migrate and load from file on a bi-endian target with a virtio-rng device.
  The actual guest state isn't particularly important to triggering this.
 
  The cause is that virtio_rng_load_device() calls virtio_rng_process() which
  accesses the ring and thus needs the endianness.  However,
  virtio_rng_process() is called via virtio_load() before it loads the
  subsections.  Essentially the -load callback in VirtioDeviceClass should
  only be used for actually reading the device state from the stream, not for
  post-load re-initialization.
 
  This patch fixes the bug by moving the virtio_rng_process() after the call
  to virtio_load().  Better yet would be to convert virtio to use vmsd and
  have the virtio_rng_process() as a post_load callback, but that's a bigger
  project for another day.
 
  This is bugfix, and should be considered for the 2.2 branch.
 
  [PATCH for-2.2] would have been a good idea then.  Next time :)
 
 So do you want this patch in 2.2? I was planning to put in the
 virtio-vs-xen fixes today and tag rc4, so it's not too late if you're
 confident this patch is good. Let me know if you think it should go in,
 and I can apply it to master directly.
 
 -- PMM
 

Peter,

FWIW I think it should. Commit 3902d49e13c2428bd6381cfdf183103ca4477c1f is
clearly bad: virtio-rng does not need the .load callback obviously... and
the fact it breaks migration makes it even worse... :(

Please apply to 2.2.

Cheers.

--
Greg

Re: [Qemu-devel] [RFC PATCH 1/3] qemu-img bench

2014-11-28 Thread Stefan Hajnoczi

On Wed, Nov 26, 2014 at 03:46:42PM +0100, Kevin Wolf wrote:
 +while (data.n  0) {
 +main_loop_wait(false);
 +}

Why is this false (non-blocking)?  This is why you get the main loop
spun warning message.

Using true (blocking) seems like the right thing.  data.n changes as
part of the callback, which is invoked from the main loop.  There is no
need to be non-blocking.

Stefan


pgpHYfykzjRn3.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 1/3] qemu-img bench

2014-11-28 Thread Kevin Wolf

Am 28.11.2014 um 12:49 hat Stefan Hajnoczi geschrieben:
 On Wed, Nov 26, 2014 at 03:46:42PM +0100, Kevin Wolf wrote:
  +while (data.n  0) {
  +main_loop_wait(false);
  +}
 
 Why is this false (non-blocking)?  This is why you get the main loop
 spun warning message.
 
 Using true (blocking) seems like the right thing.  data.n changes as
 part of the callback, which is invoked from the main loop.  There is no
 need to be non-blocking.

I think the parameter has exactly the opposite meaning as what you
describe:

int main_loop_wait(int nonblocking)

If it were true, you would get timeout = 0. qemu-io and qemu-nbd also
pass false here.

Kevin


pgpXvxB1_UHSR.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Paolo Bonzini



On 28/11/2014 12:32, Peter Lieven wrote:
 Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:

 On 28/11/2014 12:21, Peter Lieven wrote:
 Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
 master:
 Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns 
 per coroutine

 paolo:
 Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns 
 per coroutine
 Nice. :)

 Can you please try coroutine: Use __thread …  together, too?  I still
 see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
 I apply it here (my times are 191/160/145).
 indeed:

 Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns 
 per coroutine
 Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
 if we can use a similar algorithm but with arrays instead of lists...
 
 Why do you set pool_size = 0 in the create path?
 
 When I do the following:
 diff --git a/qemu-coroutine.c b/qemu-coroutine.c
 index 6bee354..c79ee78 100644
 --- a/qemu-coroutine.c
 +++ b/qemu-coroutine.c
 @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
   * and the actual size of alloc_pool.  But it is just a 
 heuristic,
   * it does not need to be perfect.
   */
 -pool_size = 0;
 +atomic_dec(pool_size);
  QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
  co = QSLIST_FIRST(alloc_pool);
 
 
 I get:
 Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per 
 coroutine

Because pool_size is the (approximate) number of coroutines in the pool.
 It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first.

Paolo

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 13:21 schrieb Paolo Bonzini:

 On 28/11/2014 12:32, Peter Lieven wrote:
 Am 28.11.2014 um 12:23 schrieb Paolo Bonzini:
 On 28/11/2014 12:21, Peter Lieven wrote:
 Am 28.11.2014 um 12:14 schrieb Paolo Bonzini:
 master:
 Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns 
 per coroutine

 paolo:
 Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns 
 per coroutine
 Nice. :)

 Can you please try coroutine: Use __thread …  together, too?  I still
 see 11% time spent in pthread_getspecific, and I get ~10% more indeed if
 I apply it here (my times are 191/160/145).
 indeed:

 Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns 
 per coroutine
 Your perf_master2 uses the ring buffer unconditionally, right?  I wonder
 if we can use a similar algorithm but with arrays instead of lists...
 Why do you set pool_size = 0 in the create path?

 When I do the following:
 diff --git a/qemu-coroutine.c b/qemu-coroutine.c
 index 6bee354..c79ee78 100644
 --- a/qemu-coroutine.c
 +++ b/qemu-coroutine.c
 @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
   * and the actual size of alloc_pool.  But it is just a 
 heuristic,
   * it does not need to be perfect.
   */
 -pool_size = 0;
 +atomic_dec(pool_size);
  QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
  co = QSLIST_FIRST(alloc_pool);


 I get:
 Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per 
 coroutine
 Because pool_size is the (approximate) number of coroutines in the pool.
  It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first.

got it meanwhile. and its not as bad as i thought since you only steal the 
release_pool if your
alloc_pool is empty. Right?

Peter

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Paolo Bonzini



On 28/11/2014 12:46, Peter Lieven wrote:
  I get:
  Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per 
  coroutine

 Ok, understood, it steals the whole pool, right? Isn't that bad if we have 
 more
 than one thread in need of a lot of coroutines?

Overall the algorithm is expected to adapt.  The N threads contribute to
the global release pool, so the pool will fill up N times faster than if
you had only one thread.  There can be some variance, which is why the
maximum size of the pool is twice the threshold (and probably could be
tuned better).

Benchmarks are needed on real I/O too, of course, especially with high
queue depth.

Paolo

[Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-11-28 Thread Laszlo Ersek

Peter Maydell suggested that we describe new devices / DTB nodes in the
kernel Documentation tree that we expose to arm virt guests in QEMU.

Although the kernel is not required to access the fw_cfg interface,
Documentation/devicetree/bindings/arm is probably the best central spot
to keep the fw_cfg description in.

Suggested-by: Peter Maydell peter.mayd...@linaro.org
Signed-off-by: Laszlo Ersek ler...@redhat.com
---
 Documentation/devicetree/bindings/arm/fw-cfg.txt | 47 
 1 file changed, 47 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/arm/fw-cfg.txt

diff --git a/Documentation/devicetree/bindings/arm/fw-cfg.txt 
b/Documentation/devicetree/bindings/arm/fw-cfg.txt
new file mode 100644
index 000..d131453
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/fw-cfg.txt
@@ -0,0 +1,47 @@
+* QEMU Firmware Configuration bindings for ARM
+
+QEMU's arm-softmmu and aarch64-softmmu emulation / virtualization targets
+provide the following Firmware Configuration interface on the virt machine
+type:
+
+- A write-only, 16-bit wide selector (or control) register,
+- a read-write, 8-bit wide data register.
+
+The guest writes a selector value (a key) to the selector register, and then
+can read the corresponding data (produced by QEMU) via the data port. If the
+selected entry is writable, the guest can rewrite it through the data port. The
+authoritative registry of the valid selector values and their meanings is the
+QEMU source code.
+
+QEMU exposes the control and data register to x86 guests at fixed IO ports. ARM
+guests can access them as memory mapped registers, and their location is
+communicated to the guest's UEFI firmware in the DTB that QEMU places at the
+bottom of the guest's DRAM.
+
+The guest kernel is not expected to use these registers (although it is
+certainly allowed to); the device tree bindings are documented here because
+this is where device tree bindings reside in general.
+
+The addresses and sizes of the Firmware Configuration registers are given by
+the /fw-cfg node. The virt board invariably uses 2 as #size-cells and
+#address-cells in the context of this node.
+
+The reg property is therefore an array of four uint64_t elements (eight
+uint32_t cells in total), the first uint64_t pair describing the address and
+size of the control register, the second uint64_t pair describing the address
+and size of the data register. (See
+http://devicetree.org/Device_Tree_Usage#How_Addressing_Works.)
+
+The first string in the compatible property is fw-cfg,mmio.
+
+Example:
+
+/ {
+   #size-cells = 0x2;
+   #address-cells = 0x2;
+
+   fw-cfg@902 {
+   reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1;
+   compatible = fw-cfg,mmio;
+   };
+};
-- 
1.8.3.1

[Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold

2014-11-28 Thread Francesco Romani

Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
altogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani from...@redhat.com
---
 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/usage-threshold.c | 124 
 include/block/block_int.h   |   4 ++
 include/block/usage-threshold.h |  62 
 qapi/block-core.json|  48 +++-
 qmp-commands.hx |  28 +
 tests/Makefile  |   3 +
 tests/test-usage-threshold.c| 101 
 9 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h
 create mode 100644 tests/test-usage-threshold.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..43e381d 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += usage-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index a87a34a..c85abb0 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@
 
 #include block/qapi.h
 #include block/block_int.h
+#include block/usage-threshold.h
 #include qmp-commands.h
 #include qapi-visit.h
 #include qapi/qmp-output-visitor.h
@@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
 info-iops_size = cfg.op_size;
 }
 
+info-write_threshold = bdrv_usage_threshold_get(bs);
+
 return info;
 }
 
diff --git a/block/usage-threshold.c b/block/usage-threshold.c
new file mode 100644
index 000..9faf5e6
--- /dev/null
+++ b/block/usage-threshold.c
@@ -0,0 +1,124 @@
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani from...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include block/block_int.h
+#include block/coroutine.h
+#include block/usage-threshold.h
+#include qemu/notify.h
+#include qapi-event.h
+#include qmp-commands.h
+
+
+uint64_t bdrv_usage_threshold_get(const BlockDriverState *bs)
+{
+return bs-write_threshold_offset;
+}
+
+bool bdrv_usage_threshold_is_set(const BlockDriverState *bs)
+{
+return !!(bs-write_threshold_offset  0);
+}
+
+static void usage_threshold_disable(BlockDriverState *bs)
+{
+if (bdrv_usage_threshold_is_set(bs)) {
+notifier_with_return_remove(bs-write_threshold_notifier);
+bs-write_threshold_offset = 0;
+}
+}
+
+uint64_t bdrv_usage_threshold_exceeded(const BlockDriverState *bs,
+   const BdrvTrackedRequest *req)
+{
+if (bdrv_usage_threshold_is_set(bs)) {
+if (req-offset  bs-write_threshold_offset) {
+return (req-offset - bs-write_threshold_offset) + req-bytes;
+}
+if ((req-offset + req-bytes)  bs-write_threshold_offset) {
+return (req-offset + req-bytes) - bs-write_threshold_offset;
+}
+}
+return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+void *opaque)
+{
+BdrvTrackedRequest *req = opaque;
+BlockDriverState *bs = req-bs;
+uint64_t amount = 0;
+
+amount = bdrv_usage_threshold_exceeded(bs, req);
+if (amount  0) {
+qapi_event_send_block_usage_threshold(
+bs-node_name,
+amount,
+bs-write_threshold_offset,
+error_abort);
+
+/* autodisable to avoid to flood the monitor */
+usage_threshold_disable(bs);
+}
+
+return 0; /* should always let other notifiers run */
+}
+
+static void usage_threshold_register_notifier(BlockDriverState *bs)
+{
+bs-write_threshold_notifier.notify = before_write_notify;
+

[Qemu-devel] [PATCH v3] add write threshold reporting for block devices

2014-11-28 Thread Francesco Romani

v2 was: with RFC. Since last review round I dropped the tag
because I think now all the open points are addressed.
v1 was: add watermark reporting for block devices, but
watermark is incorrectly unused. Hence the change in subject.

Changes since v2:
-

addressed reviewers comments:
- use node name to find the block driver state to be checked
- report node name in the event notification
- fixed signed vs unsigned integer: use uint64 everywhere, to deal
  with integer overflows (more) gracefully.
- fixed pending style issues
- renamed and made public a few functions to make them testable
- add very basic initial unit tests

Changes since v1:
-

addressed reviewers comments. Highligths:
- fixed terminology: watermark - usage threshold
- threshold is expressed in bytes
- make the event triggers only once when threshold crossed
- configured threshold visible in 'query-block' output
- fixed bugs

Open issues:


Not all node names show up in the 'query-block' output, but I'll
start a different thread to discuss this.

Followup:
-

Patches I'll have on my queue and I'd like to post as followup
- more some unit testing.
- add support to set the threshold at device creation


Rationale and context from v1
-

I'm one of the oVirt developers (http://www.ovirt.org);
oVirt is a virtualization management application built
around qemu/kvm, so it is nice to get in touch :)

We have begun a big scalability improvement effort, aiming to
support without problems hundreds of VMs per host, with plans
to support thousands in a not so distant future.
In doing so, we are reviewing our usage flows.

One of them is thin-provisioned storage, which is used
quite extensively, with block devices (ISCSI for example)
and COW images.
When using thin provisioning, oVirt tries hard to hide this
fact from the guest OS, and to do so watches closely
the usage of the device, and resize it when its usage exceeds
a configured threshold (the high water mark), in order
to avoid the guest OS to get paused for space exhausted.

To do the watching, we poll he devices using libvirt [1],
which in turn uses query-blockstats.
This is suboptimal with just one VM, but with hundereds of them,
let alone thousands, it doesn't scale and it is quite a resource
hog.

Would be great to have this watermark concept supported into qemu,
with a new event to be raised when the limit is crossed.

To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957

Moreover, I had the chance to take a look at the QEMU sources
and come up with this tentative patch which I'd also like
to submit.

Notes
-

[0]: https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg01348.html
[1]: 
http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=ebb0c19c48690f0598de954f8e0e9d4d29d48b85


Francesco Romani (1):
  block: add event when disk usage exceeds threshold

 block/Makefile.objs |   1 +
 block/qapi.c|   3 +
 block/usage-threshold.c | 124 
 include/block/block_int.h   |   4 ++
 include/block/usage-threshold.h |  62 
 qapi/block-core.json|  48 +++-
 qmp-commands.hx |  28 +
 tests/Makefile  |   3 +
 tests/test-usage-threshold.c| 101 
 9 files changed, 373 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h
 create mode 100644 tests/test-usage-threshold.c

-- 
1.9.3

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:

 On 28/11/2014 12:46, Peter Lieven wrote:
 I get:
 Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per 
 coroutine
 Ok, understood, it steals the whole pool, right? Isn't that bad if we have 
 more
 than one thread in need of a lot of coroutines?
 Overall the algorithm is expected to adapt.  The N threads contribute to
 the global release pool, so the pool will fill up N times faster than if
 you had only one thread.  There can be some variance, which is why the
 maximum size of the pool is twice the threshold (and probably could be
 tuned better).

 Benchmarks are needed on real I/O too, of course, especially with high
 queue depth.

Yes, cool. The atomic operations are a bit tricky at the first glance ;-)

Question:
 Why is the pool_size increment atomic and the set to zero not?
 
Idea:
 If the release_pool is full why not put the coroutine in the thread alloc_pool 
instead of throwing it away? :-)

Run operation 4000 iterations 9.057805 s, 4416K operations/s, 226ns per 
coroutine

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index 6bee354..edea162 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -25,8 +25,9 @@ enum {
 
 /** Free list to speed up creation */
 static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_size;
+static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
QSLIST_HEAD_INITIALIZER(pool);
+static __thread unsigned int alloc_pool_size;
 
 /* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
 static void coroutine_pool_cleanup(void *value);
@@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 if (CONFIG_COROUTINE_POOL) {
 co = QSLIST_FIRST(alloc_pool);
 if (!co) {
-if (pool_size  POOL_BATCH_SIZE) {
-/* This is not exact; there could be a little skew between 
pool_size
+if (release_pool_size  POOL_BATCH_SIZE) {
+/* This is not exact; there could be a little skew between 
release_pool_size
  * and the actual size of alloc_pool.  But it is just a 
heuristic,
  * it does not need to be perfect.
  */
-pool_size = 0;
+alloc_pool_size = atomic_fetch_and(release_pool_size, 0);
 QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
 co = QSLIST_FIRST(alloc_pool);
 
@@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
  */
 g_private_set(dummy_key, dummy_key);
 }
+} else {
+alloc_pool_size--;
 }
 if (co) {
 QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
@@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 static void coroutine_delete(Coroutine *co)
 {
 if (CONFIG_COROUTINE_POOL) {
-if (pool_size  POOL_BATCH_SIZE * 2) {
+if (release_pool_size  POOL_BATCH_SIZE * 2) {
 co-caller = NULL;
 QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next);
-atomic_inc(pool_size);
+atomic_inc(release_pool_size);
+return;
+} else if (alloc_pool_size  POOL_BATCH_SIZE) {
+co-caller = NULL;
+QSLIST_INSERT_HEAD(alloc_pool, co, pool_next);
+alloc_pool_size++;
 return;
 }
 }


Bug?:
 The release_pool is not cleanup up on termination I think.

Peter

Re: [Qemu-devel] [RFC PATCH v5 22/31] timer: introduce new QEMU_CLOCK_VIRTUAL_RT clock

2014-11-28 Thread Paolo Bonzini



On 28/11/2014 12:28, Pavel Dovgaluk wrote:
 Not so easy :)
 cpu_get_clock() checks vm_clock_seqlock which is locked in icount_warp_rt().
 And after locking it requests the value of QEMU_CLOCK_VIRTUAL_RT:
 
 seqlock_write_lock(timers_state.vm_clock_seqlock);
 if (runstate_is_running()) {
 int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL_RT);

Not so hard :D  You can use cpu_get_clock_locked() there.

In fact, cpu_get_clock_locked() is already used below in the if, so we
can reuse clock instead of the other variable cur_time.

Paolo

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 27, 2014 at 11:27:06AM +0100, Peter Lieven wrote:
 diff --git a/iothread.c b/iothread.c
 index 342a23f..b53529b 100644
 --- a/iothread.c
 +++ b/iothread.c
 @@ -15,6 +15,7 @@
  #include qom/object_interfaces.h
  #include qemu/module.h
  #include block/aio.h
 +#include block/coroutine.h
  #include sysemu/iothread.h
  #include qmp-commands.h
  #include qemu/error-report.h
 @@ -47,6 +48,8 @@ static void *iothread_run(void *opaque)
  }
  aio_context_release(iothread-ctx);
  }
 +
 +coroutine_pool_cleanup();
  return NULL;
  }

The assumption here is that iothread_run() is the only thread function
that uses coroutines.

If another thread uses coroutines the pool will leak :(.  This is one of
the challenges of thread-local storage.


pgpaS0Fpn_kPP.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 1/3] Revert coroutine: make pool size dynamic

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 27, 2014 at 11:27:04AM +0100, Peter Lieven wrote:
 This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a.

Justification?

Won't we hit the same problem as the global pool: either pool size is
too large or too small.  If you exceed the pool size performance drops
dramatically.

Stefan


pgp361Reuw7Cs.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC 0/6] memory: make dirty_memory[] accesses atomic

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 27, 2014 at 01:21:54PM +, Peter Maydell wrote:
 On 27 November 2014 at 12:29, Stefan Hajnoczi stefa...@redhat.com wrote:
  1. Convert all cpu_physical_memory_*_dirty() callers to use the API 
  atomically.
 There are TCG callers who things along the lines of:
 
   if (!cpu_physical_memory_get_dirty(addr)) {
   cpu_physical_memory_set_dirty(addr);  /* not atomic! */
   }
 
 Which bit of code is this? Note that for the TCG DIRTY_MEMORY_CODE
 flag you have bigger problems than just whether the bitmap updates
 are atomic, because the sequence is:
  write to memory
  if (!dirty) {
  flush generated code tbs;
  set dirty;
  }
 
 and what you care about is that the existence of cached translations
 for this area of memory should be in sync with the state of the dirty
 bit, so the whole operation of flush affected translations and set
 the dirty bit needs to be thread-safe, I think.

This is an example of what I mean.

I'm not going to work on making TCG thread-safe in this series, and
there is no dangerous race condition in this code if we leave it as is.

But I'm not 100% sure of all cases, so I'll audit them again.

Stefan


pgp6zN1Z77PY6.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Paolo Bonzini



On 28/11/2014 13:39, Peter Lieven wrote:
 Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:

 On 28/11/2014 12:46, Peter Lieven wrote:
 I get:
 Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns 
 per coroutine
 Ok, understood, it steals the whole pool, right? Isn't that bad if we 
 have more
 than one thread in need of a lot of coroutines?
 Overall the algorithm is expected to adapt.  The N threads contribute to
 the global release pool, so the pool will fill up N times faster than if
 you had only one thread.  There can be some variance, which is why the
 maximum size of the pool is twice the threshold (and probably could be
 tuned better).

 Benchmarks are needed on real I/O too, of course, especially with high
 queue depth.
 
 Yes, cool. The atomic operations are a bit tricky at the first glance ;-)
 
 Question:
  Why is the pool_size increment atomic and the set to zero not?

Because the set to zero is not a read-modify-write operation, so it is
always atomic.  It's just not sequentially-consistent (see
docs/atomics.txt for some info on what that means).

 Idea:
  If the release_pool is full why not put the coroutine in the thread 
 alloc_pool instead of throwing it away? :-)

Because you can only waste 64 coroutines per thread.  But numbers cannot
be sneezed at, so it's worth doing it as a separate patch.

 Run operation 4000 iterations 9.057805 s, 4416K operations/s, 226ns per 
 coroutine
 
 diff --git a/qemu-coroutine.c b/qemu-coroutine.c
 index 6bee354..edea162 100644
 --- a/qemu-coroutine.c
 +++ b/qemu-coroutine.c
 @@ -25,8 +25,9 @@ enum {
  
  /** Free list to speed up creation */
  static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 -static unsigned int pool_size;
 +static unsigned int release_pool_size;
  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
 QSLIST_HEAD_INITIALIZER(pool);
 +static __thread unsigned int alloc_pool_size;
  
  /* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
  static void coroutine_pool_cleanup(void *value);
 @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
  if (CONFIG_COROUTINE_POOL) {
  co = QSLIST_FIRST(alloc_pool);
  if (!co) {
 -if (pool_size  POOL_BATCH_SIZE) {
 -/* This is not exact; there could be a little skew between 
 pool_size
 +if (release_pool_size  POOL_BATCH_SIZE) {
 +/* This is not exact; there could be a little skew between 
 release_pool_size
   * and the actual size of alloc_pool.  But it is just a 
 heuristic,
   * it does not need to be perfect.
   */
 -pool_size = 0;
 +alloc_pool_size = atomic_fetch_and(release_pool_size, 0);
  QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
  co = QSLIST_FIRST(alloc_pool);
  
 @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
   */
  g_private_set(dummy_key, dummy_key);
  }
 +} else {
 +alloc_pool_size--;
  }
  if (co) {
  QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
 @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
  static void coroutine_delete(Coroutine *co)
  {
  if (CONFIG_COROUTINE_POOL) {
 -if (pool_size  POOL_BATCH_SIZE * 2) {
 +if (release_pool_size  POOL_BATCH_SIZE * 2) {
  co-caller = NULL;
  QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next);
 -atomic_inc(pool_size);
 +atomic_inc(release_pool_size);
 +return;
 +} else if (alloc_pool_size  POOL_BATCH_SIZE) {
 +co-caller = NULL;
 +QSLIST_INSERT_HEAD(alloc_pool, co, pool_next);
 +alloc_pool_size++;
  return;
  }
  }
 
 
 Bug?:
  The release_pool is not cleanup up on termination I think.

That's not necessary, it is global.

Paolo

Re: [Qemu-devel] [RFC PATCH 1/3] Revert coroutine: make pool size dynamic

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 13:42 schrieb Stefan Hajnoczi:
 On Thu, Nov 27, 2014 at 11:27:04AM +0100, Peter Lieven wrote:
 This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a.
 Justification?

 Won't we hit the same problem as the global pool: either pool size is
 too large or too small.  If you exceed the pool size performance drops
 dramatically.

Each thread has its own pool. But you can ignore this series I think.
The idea Paolo came up with seems to be better. With
my last improvement it seems to have equal performance without
the drawback of different threads creating and destroying the coroutines.

Peter

Re: [Qemu-devel] Qemu coroutine behaviour on blocking send(3)

2014-11-28 Thread Stefan Hajnoczi

On Fri, Nov 28, 2014 at 01:55:00PM +0700, Iwan Budi Kusnanto wrote:
 I meant, does the coroutine will do yield internally when it get
 blocked on send(3)?

No.  In general, QEMU will use non-blocking file descriptors so the
blocking case does not apply.  (You can't use blocking file descriptors
in an event loop without a risk of blocking the entire event loop.)

There is qemu_co_send(), which attempts the non-blocking send(2) and
yields on EAGAIN.

block/sheepdog.c and nbd.c both use this function.  It's a little ugly
because the caller must add/remove the socket write fd handler function
so that the coroutine is re-entered when the fd becomes writable again.

Stefan


pgpU63xpN0BHb.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 13:45 schrieb Paolo Bonzini:

 On 28/11/2014 13:39, Peter Lieven wrote:
 Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:
 On 28/11/2014 12:46, Peter Lieven wrote:
 I get:
 Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns 
 per coroutine
 Ok, understood, it steals the whole pool, right? Isn't that bad if we 
 have more
 than one thread in need of a lot of coroutines?
 Overall the algorithm is expected to adapt.  The N threads contribute to
 the global release pool, so the pool will fill up N times faster than if
 you had only one thread.  There can be some variance, which is why the
 maximum size of the pool is twice the threshold (and probably could be
 tuned better).

 Benchmarks are needed on real I/O too, of course, especially with high
 queue depth.
 Yes, cool. The atomic operations are a bit tricky at the first glance ;-)

 Question:
  Why is the pool_size increment atomic and the set to zero not?
 Because the set to zero is not a read-modify-write operation, so it is
 always atomic.  It's just not sequentially-consistent (see
 docs/atomics.txt for some info on what that means).

 Idea:
  If the release_pool is full why not put the coroutine in the thread 
 alloc_pool instead of throwing it away? :-)
 Because you can only waste 64 coroutines per thread.  But numbers cannot
 be sneezed at, so it's worth doing it as a separate patch.

What do you mean by that? If I use dataplane I will fill the global pool and 
never use it okay, but
then I use thread local storage only. So I get the same numbers as in my thread 
local storage only version.

Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is 
really attached. If we
have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even 
won't waste those
coroutines oxidating in the global pool.

Peter

Re: [Qemu-devel] [PATCH v2 2/6] geometry: Detect blocksize via ioctls in separate static functions

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:


 Suggest function comment

  /**
   * Return logical block size, or zero if we can't figure it out
   */

   {
 -BDRVRawState *s = bs-opaque;
 -char *buf;
 -unsigned int sector_size;
 -
 -/* For /dev/sg devices the alignment is not really used.
 -   With buffered I/O, we don't have any restrictions. */
 -if (bs-sg || !s-needs_alignment) {
 -bs-request_alignment = 1;
 -s-buf_align = 1;
 -return;
 -}
 +unsigned int sector_size = 0;

 Pointless initialization.

 If I do not initialize the sector_size, and the ioctl fails,
 I will return garbage as a blocksize to the caller.

Where?  As far as I can see, we return it only after ioctl() succeeded.


   /* Try a few ioctls to get the right size */
 -bs-request_alignment = 0;
 -s-buf_align = 0;
 -
   #ifdef BLKSSZGET
   if (ioctl(fd, BLKSSZGET, sector_size) = 0) {
 -bs-request_alignment = sector_size;
 +return sector_size;
   }
   #endif
   #ifdef DKIOCGETBLOCKSIZE
   if (ioctl(fd, DKIOCGETBLOCKSIZE, sector_size) = 0) {
 -bs-request_alignment = sector_size;
 +return sector_size;
   }
   #endif
   #ifdef DIOCGSECTORSIZE
   if (ioctl(fd, DIOCGSECTORSIZE, sector_size) = 0) {
 -bs-request_alignment = sector_size;
 +return sector_size;
   }
   #endif
   #ifdef CONFIG_XFS
   if (s-is_xfs) {
   struct dioattr da;
   if (xfsctl(NULL, fd, XFS_IOC_DIOINFO, da) = 0) {
 -bs-request_alignment = da.d_miniosz;
 +sector_size = da.d_miniosz;
   /* The kernel returns wrong information for d_mem */
   /* s-buf_align = da.d_mem; */

 Since you keep the enabled assignments to s-buf_align out of this
 function, you should keep out this disabled one, too.

 +return sector_size;
   }
   }
   #endif

 +return 0;
 +}
 +
 +static unsigned int probe_physical_blocksize(BlockDriverState *bs, int fd)

 Parameter bs is unused, let's drop it.

 +{
 +unsigned int blk_size = 0;

 Pointless initialization.

 Same here.

Again, we return it only after ioctl() succeeded.

 +#ifdef BLKPBSZGET
 +if (ioctl(fd, BLKPBSZGET, blk_size) = 0) {
 +return blk_size;
 +}
 +#endif
 +
 +return 0;
 +}
 +
 +static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
 +{
 +BDRVRawState *s = bs-opaque;
 +char *buf;
 +
 +/* For /dev/sg devices the alignment is not really used.
 +   With buffered I/O, we don't have any restrictions. */
 +if (bs-sg || !s-needs_alignment) {
 +bs-request_alignment = 1;
 +s-buf_align = 1;
 +return;
 +}
 +
 +s-buf_align = 0;
 +/* Let's try to use the logical blocksize for the alignment. */
 +bs-request_alignment = probe_logical_blocksize(bs, fd);
 +
   /* If we could not get the sizes so far, we can only guess them */
   if (!s-buf_align) {
   size_t align;

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Paolo Bonzini



On 28/11/2014 13:49, Peter Lieven wrote:
 Idea:
   If the release_pool is full why not put the coroutine in the thread 
  alloc_pool instead of throwing it away? :-)
  Because you can only waste 64 coroutines per thread.  But numbers cannot

s/only//

  be sneezed at, so it's worth doing it as a separate patch.
 What do you mean by that? If I use dataplane I will fill the global pool and 
 never use it okay, but
 then I use thread local storage only. So I get the same numbers as in my 
 thread local storage only version.

Right.  I didn't want to waste the coroutines.  But it's not 64
coroutines per VCPU thread, it's just 64 coroutines for the global
iothread because all the dataplane threads are guaranteed to use
ioeventfd.  Let's do it. :)

Can I add your Signed-off-by to the patch?

Paolo

 Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is 
 really attached. If we
 have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even 
 won't waste those
 coroutines oxidating in the global pool.

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Kevin Wolf

Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben:
 On 26.11.2014 15:46, Kevin Wolf wrote:
 This improves the performance of requests because an ACB doesn't need to
 be allocated on the heap any more. It also makes the code nicer and
 smaller.
 
 As a side effect, the codepath taken by aio=threads is changed to use
 paio_submit_co(). This doesn't change the performance at this point.
 
 Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:
 
|  aio=native   | aio=threads
| before   | with patch | before   | with patch
 --+--++--+
 run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
 run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
 run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
 run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
 run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s
 
 TODO: Do some more serious benchmarking in VMs with less variance.
 Results of a quick fio run are vaguely positive.
 
 I still see the main-loop spun warnings with this patches applied to master.
 It wasn't there with the original patch from August.
 
 ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1
 Sending 1000 requests, 4096 bytes each, 64 in parallel
 main-loop: WARNING: I/O thread spun for 1000 iterations
 Run completed in 31.947 seconds.

Yes, I still need to bisect that. The 'qemu-img bench' numbers above are
actually also from August, we have regressed meanwhile by about a
second, and I also haven't found the reason for that yet.

Kevin

Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-11-28 Thread Arnd Bergmann

On Friday 28 November 2014 13:26:44 Laszlo Ersek wrote:
 +Example:
 +
 +/ {
 +   #size-cells = 0x2;
 +   #address-cells = 0x2;
 +
 +   fw-cfg@902 {
 +   reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1;
 +   compatible = fw-cfg,mmio;
 +   };
 +};
 

fw-cfg is not a valid vendor string. Are you emulating an actual piece
of hardware here or is this something that comes from qemu?
How about using qemu,fwcfg as the compatible string, and documenting
qemu as an official vendor string in 
Documentation/devicetree/bindings/vendor-prefixes.txt

We don't normally list contiguous registers separately. Maybe just round
up to one page and make the register property

reg = 0x0 0x902 0x0 0x1000;

Finally, isn't there also an interrupt associated with this virtual device?

Arnd

Re: [Qemu-devel] [PATCH v3 11/22] iotests: Prepare for refcount_width option

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 20, 2014 at 06:06:27PM +0100, Max Reitz wrote:
 Some tests do not work well with certain refcount widths (i.e. you
 cannot create internal snapshots with refcount_width=1), so make those
 widths unsupported.
 
 Furthermore, add another filter to _filter_img_create in common.filter
 which filters out the refcount_width value.
 
 This is necessary for test 079, which does actually work with any
 refcount width, but invoking qemu-img directly leads to the
 refcount_width value being visible in the output; use _make_test_img
 instead which will filter it out.
 
 Signed-off-by: Max Reitz mre...@redhat.com
 Reviewed-by: Eric Blake ebl...@redhat.com
 ---
  tests/qemu-iotests/007   |  3 +++
  tests/qemu-iotests/015   |  2 ++
  tests/qemu-iotests/026   |  7 +++
  tests/qemu-iotests/029   |  2 ++
  tests/qemu-iotests/051   |  3 +++
  tests/qemu-iotests/058   |  2 ++
  tests/qemu-iotests/067   |  2 ++
  tests/qemu-iotests/079   | 10 ++
  tests/qemu-iotests/079.out   | 38 ++
  tests/qemu-iotests/080   |  2 ++
  tests/qemu-iotests/089   |  2 ++
  tests/qemu-iotests/108   |  2 ++
  tests/qemu-iotests/common.filter |  3 ++-
  13 files changed, 41 insertions(+), 37 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgp3BrjG_oLC_.pgp
Description: PGP signature

[Qemu-devel] [PATCH] target-arm: ARM64: Adding EL1 AARCH32 guest support for KVM.

2014-11-28 Thread Pranavkumar Sawargaonkar

In KVM ARM64 one can choose to run guest in 32bit mode i.e EL1 in AARCH32 mode.
This patch adds qemu support for running guest EL1 in AARCH32 mode with
virt as a machine model.
This patch also adds a support to run Image (along with zImage) for arm32.

One can specify about 32bit kernel Image by using -cpu host,el1_aarch32 
argument.

e.g.
./qemu/aarch64-softmmu/qemu-system-aarch64  -nographic -display none \
 -serial stdio -kernel ./Image  -m 512 -M virt -cpu host,el1_aarch32 \
 -initrd rootfs.img  -append console=ttyAMA0 root=/dev/ram -enable-kvm

Signed-off-by: Pranavkumar Sawargaonkar pranavku...@linaro.org
---
 hw/arm/boot.c  | 44 
 hw/arm/virt.c  | 30 +-
 target-arm/cpu.c   |  5 ++--
 target-arm/cpu.h   |  2 ++
 target-arm/kvm64.c | 73 ++
 5 files changed, 146 insertions(+), 8 deletions(-)

diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 0014c34..da8cdc8 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -476,6 +476,32 @@ static void do_cpu_reset(void *opaque)
 }
 }
 
+static int check_load_zimage(const char *filename)
+{
+int fd;
+uint8_t buf[40];
+uint32_t *p = (uint32_t *) buf[36];
+
+fd = open(filename, O_RDONLY | O_BINARY);
+if (fd  0) {
+perror(filename);
+return -1;
+}
+
+memset(buf, 0, sizeof(buf));
+if (read(fd, buf, sizeof(buf)) != sizeof(buf)) {
+close(fd);
+return -1;
+}
+
+/* Check for zImage magic number */
+if (*p == 0x016F2818) {
+return 1;
+}
+
+   return 0;
+}
+
 void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info)
 {
 CPUState *cs;
@@ -515,15 +541,23 @@ void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info 
*info)
 return;
 }
 
-if (arm_feature(cpu-env, ARM_FEATURE_AARCH64)) {
+if (arm_feature(cpu-env, ARM_FEATURE_AARCH64) 
+(!cpu-env.el1_aarch32)) {
 primary_loader = bootloader_aarch64;
 kernel_load_offset = KERNEL64_LOAD_ADDR;
 elf_machine = EM_AARCH64;
 } else {
-primary_loader = bootloader;
-kernel_load_offset = KERNEL_LOAD_ADDR;
-elf_machine = EM_ARM;
-}
+if (check_load_zimage(info-kernel_filename)) {
+primary_loader = bootloader;
+kernel_load_offset = KERNEL_LOAD_ADDR;
+elf_machine = EM_ARM;
+} else {
+primary_loader = bootloader;
+/* Assuming we are loading Image hence aligning it to 0x8000 */
+kernel_load_offset = KERNEL_LOAD_ADDR - 0x8000;
+elf_machine = EM_ARM;
+}
+   }
 
 info-dtb_filename = qemu_opt_get(qemu_get_machine_opts(), dtb);
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 314e55b..64213e6 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -204,7 +204,8 @@ static void fdt_add_psci_node(const VirtBoardInfo *vbi)
 qemu_fdt_setprop(fdt, /psci, compatible, comp, sizeof(comp));
 
 cpu_off_fn = QEMU_PSCI_0_2_FN_CPU_OFF;
-if (arm_feature(armcpu-env, ARM_FEATURE_AARCH64)) {
+if (arm_feature(armcpu-env, ARM_FEATURE_AARCH64) 
+(!armcpu-env.el1_aarch32)) {
 cpu_suspend_fn = QEMU_PSCI_0_2_FN64_CPU_SUSPEND;
 cpu_on_fn = QEMU_PSCI_0_2_FN64_CPU_ON;
 migrate_fn = QEMU_PSCI_0_2_FN64_MIGRATE;
@@ -527,6 +528,24 @@ static void *machvirt_dtb(const struct arm_boot_info 
*binfo, int *fdt_size)
 return board-fdt;
 }
 
+#if defined(TARGET_AARCH64)  !defined(CONFIG_USER_ONLY)
+static void check_special_cpu_model_flags(const char *cpu_model,
+  Object *cpuobj)
+{
+ARMCPU *cpu = ARM_CPU(cpuobj);
+
+if (!cpu) {
+return;
+}
+
+if (strcmp(cpu_model, host,el1_aarch32) == 0) {
+cpu-env.el1_aarch32 = 1;
+} else {
+cpu-env.el1_aarch32 = 0;
+}
+}
+#endif
+
 static void machvirt_init(MachineState *machine)
 {
 qemu_irq pic[NUM_IRQS];
@@ -540,6 +559,12 @@ static void machvirt_init(MachineState *machine)
 cpu_model = cortex-a15;
 }
 
+#if defined(TARGET_AARCH64)  !defined(CONFIG_USER_ONLY)
+if (strcmp(cpu_model, host,el1_aarch32) == 0) {
+cpu_model = host;
+}
+#endif
+
 vbi = find_machine_info(cpu_model);
 
 if (!vbi) {
@@ -578,6 +603,9 @@ static void machvirt_init(MachineState *machine)
 object_property_set_int(cpuobj, vbi-memmap[VIRT_CPUPERIPHS].base,
 reset-cbar, error_abort);
 }
+#if defined(TARGET_AARCH64)  !defined(CONFIG_USER_ONLY)
+check_special_cpu_model_flags(machine-cpu_model, cpuobj);
+#endif
 
 object_property_set_bool(cpuobj, true, realized, NULL);
 }
diff --git a/target-arm/cpu.c b/target-arm/cpu.c
index 5ce7350..37dfe30 100644
--- a/target-arm/cpu.c
+++ b/target-arm/cpu.c
@@ -103,7 +103,7 @@ static void

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 13:39 schrieb Peter Lieven:
 Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:
 On 28/11/2014 12:46, Peter Lieven wrote:
 I get:
 Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns 
 per coroutine
 Ok, understood, it steals the whole pool, right? Isn't that bad if we 
 have more
 than one thread in need of a lot of coroutines?
 Overall the algorithm is expected to adapt.  The N threads contribute to
 the global release pool, so the pool will fill up N times faster than if
 you had only one thread.  There can be some variance, which is why the
 maximum size of the pool is twice the threshold (and probably could be
 tuned better).

 Benchmarks are needed on real I/O too, of course, especially with high
 queue depth.
 Yes, cool. The atomic operations are a bit tricky at the first glance ;-)

 Question:
  Why is the pool_size increment atomic and the set to zero not?
  
 Idea:
  If the release_pool is full why not put the coroutine in the thread 
 alloc_pool instead of throwing it away? :-)

 Run operation 4000 iterations 9.057805 s, 4416K operations/s, 226ns per 
 coroutine

 diff --git a/qemu-coroutine.c b/qemu-coroutine.c
 index 6bee354..edea162 100644
 --- a/qemu-coroutine.c
 +++ b/qemu-coroutine.c
 @@ -25,8 +25,9 @@ enum {
  
  /** Free list to speed up creation */
  static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
 -static unsigned int pool_size;
 +static unsigned int release_pool_size;
  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
 QSLIST_HEAD_INITIALIZER(pool);
 +static __thread unsigned int alloc_pool_size;
  
  /* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
  static void coroutine_pool_cleanup(void *value);
 @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
  if (CONFIG_COROUTINE_POOL) {
  co = QSLIST_FIRST(alloc_pool);
  if (!co) {
 -if (pool_size  POOL_BATCH_SIZE) {
 -/* This is not exact; there could be a little skew between 
 pool_size
 +if (release_pool_size  POOL_BATCH_SIZE) {
 +/* This is not exact; there could be a little skew between 
 release_pool_size
   * and the actual size of alloc_pool.  But it is just a 
 heuristic,
   * it does not need to be perfect.
   */
 -pool_size = 0;
 +alloc_pool_size = atomic_fetch_and(release_pool_size, 0);
  QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
  co = QSLIST_FIRST(alloc_pool);
  
 @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
   */
  g_private_set(dummy_key, dummy_key);
  }
 +} else {
 +alloc_pool_size--;
  }
  if (co) {
  QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
 @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
  static void coroutine_delete(Coroutine *co)
  {
  if (CONFIG_COROUTINE_POOL) {
 -if (pool_size  POOL_BATCH_SIZE * 2) {
 +if (release_pool_size  POOL_BATCH_SIZE * 2) {
  co-caller = NULL;
  QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next);
 -atomic_inc(pool_size);
 +atomic_inc(release_pool_size);
 +return;
 +} else if (alloc_pool_size  POOL_BATCH_SIZE) {
 +co-caller = NULL;
 +QSLIST_INSERT_HEAD(alloc_pool, co, pool_next);
 +alloc_pool_size++;
  return;
  }
  }

Signed-off-by: Peter Lieven p...@kamp.de

Re: [Qemu-devel] [PATCH v3 12/22] qcow2: Allow creation with refcount order != 4

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 20, 2014 at 06:06:28PM +0100, Max Reitz wrote:
 Add a creation option to qcow2 for setting the refcount order of images
 to be created, and respect that option's value.
 
 This breaks some test outputs, fix them.
 
 Signed-off-by: Max Reitz mre...@redhat.com
 Reviewed-by: Eric Blake ebl...@redhat.com
 ---
  block/qcow2.c  |  20 
  include/block/block_int.h  |   1 +
  tests/qemu-iotests/049.out | 112 
 ++---
  tests/qemu-iotests/082.out |  41 ++---
  tests/qemu-iotests/085.out |  38 +++
  5 files changed, 130 insertions(+), 82 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpm5fxW_MQG8.pgp
Description: PGP signature

Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool

2014-11-28 Thread Peter Lieven

Am 28.11.2014 um 13:45 schrieb Paolo Bonzini:

 On 28/11/2014 13:39, Peter Lieven wrote:
 Am 28.11.2014 um 13:26 schrieb Paolo Bonzini:
 On 28/11/2014 12:46, Peter Lieven wrote:
 I get:
 Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns 
 per coroutine
 Ok, understood, it steals the whole pool, right? Isn't that bad if we 
 have more
 than one thread in need of a lot of coroutines?
 Overall the algorithm is expected to adapt.  The N threads contribute to
 the global release pool, so the pool will fill up N times faster than if
 you had only one thread.  There can be some variance, which is why the
 maximum size of the pool is twice the threshold (and probably could be
 tuned better).

 Benchmarks are needed on real I/O too, of course, especially with high
 queue depth.
 Yes, cool. The atomic operations are a bit tricky at the first glance ;-)

 Question:
  Why is the pool_size increment atomic and the set to zero not?
 Because the set to zero is not a read-modify-write operation, so it is
 always atomic.  It's just not sequentially-consistent (see
 docs/atomics.txt for some info on what that means).

 Idea:
  If the release_pool is full why not put the coroutine in the thread 
 alloc_pool instead of throwing it away? :-)
 Because you can only waste 64 coroutines per thread.  But numbers cannot
 be sneezed at, so it's worth doing it as a separate patch.

 Run operation 4000 iterations 9.057805 s, 4416K operations/s, 226ns per 
 coroutine

 diff --git a/qemu-coroutine.c b/qemu-coroutine.c
 index 6bee354..edea162 100644
 --- a/qemu-coroutine.c
 +++ b/qemu-coroutine.c
 @@ -25,8 +25,9 @@ enum {
  
  /** Free list to speed up creation */
  static QSLIST_HEAD(, Coroutine) release_pool = 
 QSLIST_HEAD_INITIALIZER(pool);
 -static unsigned int pool_size;
 +static unsigned int release_pool_size;
  static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
 QSLIST_HEAD_INITIALIZER(pool);
 +static __thread unsigned int alloc_pool_size;
  
  /* The GPrivate is only used to invoke coroutine_pool_cleanup.  */
  static void coroutine_pool_cleanup(void *value);
 @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
  if (CONFIG_COROUTINE_POOL) {
  co = QSLIST_FIRST(alloc_pool);
  if (!co) {
 -if (pool_size  POOL_BATCH_SIZE) {
 -/* This is not exact; there could be a little skew between 
 pool_size
 +if (release_pool_size  POOL_BATCH_SIZE) {
 +/* This is not exact; there could be a little skew between 
 release_pool_size
   * and the actual size of alloc_pool.  But it is just a 
 heuristic,
   * it does not need to be perfect.
   */
 -pool_size = 0;
 +alloc_pool_size = atomic_fetch_and(release_pool_size, 0);
  QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
  co = QSLIST_FIRST(alloc_pool);
  
 @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
   */
  g_private_set(dummy_key, dummy_key);
  }
 +} else {
 +alloc_pool_size--;
  }
  if (co) {
  QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
 @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
  static void coroutine_delete(Coroutine *co)
  {
  if (CONFIG_COROUTINE_POOL) {
 -if (pool_size  POOL_BATCH_SIZE * 2) {
 +if (release_pool_size  POOL_BATCH_SIZE * 2) {
  co-caller = NULL;
  QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next);
 -atomic_inc(pool_size);
 +atomic_inc(release_pool_size);
 +return;
 +} else if (alloc_pool_size  POOL_BATCH_SIZE) {
 +co-caller = NULL;
 +QSLIST_INSERT_HEAD(alloc_pool, co, pool_next);
 +alloc_pool_size++;
  return;
  }
  }


 Bug?:
  The release_pool is not cleanup up on termination I think.
 That's not necessary, it is global.

I don't see where you iterate over release_pool and destroy all coroutines?

Maybe just add back the old destructor with s/pool/release_pool/g

Peter

Peter

Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-11-28 Thread Mark Rutland

Hi,

On Fri, Nov 28, 2014 at 12:26:44PM +, Laszlo Ersek wrote:
 Peter Maydell suggested that we describe new devices / DTB nodes in the
 kernel Documentation tree that we expose to arm virt guests in QEMU.
 
 Although the kernel is not required to access the fw_cfg interface,
 Documentation/devicetree/bindings/arm is probably the best central spot
 to keep the fw_cfg description in.
 
 Suggested-by: Peter Maydell peter.mayd...@linaro.org
 Signed-off-by: Laszlo Ersek ler...@redhat.com
 ---
  Documentation/devicetree/bindings/arm/fw-cfg.txt | 47 
 
  1 file changed, 47 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/arm/fw-cfg.txt
 
 diff --git a/Documentation/devicetree/bindings/arm/fw-cfg.txt 
 b/Documentation/devicetree/bindings/arm/fw-cfg.txt
 new file mode 100644
 index 000..d131453
 --- /dev/null
 +++ b/Documentation/devicetree/bindings/arm/fw-cfg.txt
 @@ -0,0 +1,47 @@
 +* QEMU Firmware Configuration bindings for ARM
 +
 +QEMU's arm-softmmu and aarch64-softmmu emulation / virtualization targets
 +provide the following Firmware Configuration interface on the virt machine
 +type:
 +
 +- A write-only, 16-bit wide selector (or control) register,
 +- a read-write, 8-bit wide data register.
 +
 +The guest writes a selector value (a key) to the selector register, and then
 +can read the corresponding data (produced by QEMU) via the data port. If the
 +selected entry is writable, the guest can rewrite it through the data port. 
 The
 +authoritative registry of the valid selector values and their meanings is the
 +QEMU source code.
 +
 +QEMU exposes the control and data register to x86 guests at fixed IO ports. 
 ARM
 +guests can access them as memory mapped registers, and their location is
 +communicated to the guest's UEFI firmware in the DTB that QEMU places at the
 +bottom of the guest's DRAM.

What exactly is this interface used for? And what's the protocol that
goes over the read/write sequences descrbied above? Is there any
documentation of that we can refer to here?

Is version information probeable from this, or does that need to be
described?

 +The guest kernel is not expected to use these registers (although it is
 +certainly allowed to); the device tree bindings are documented here because
 +this is where device tree bindings reside in general.
 +
 +The addresses and sizes of the Firmware Configuration registers are given by
 +the /fw-cfg node. The virt board invariably uses 2 as #size-cells and
 +#address-cells in the context of this node.

This paragraph can go -- #address-cells and #size-cells are
well-understood common properties.

 +
 +The reg property is therefore an array of four uint64_t elements (eight
 +uint32_t cells in total), the first uint64_t pair describing the address and
 +size of the control register, the second uint64_t pair describing the address
 +and size of the data register. (See
 +http://devicetree.org/Device_Tree_Usage#How_Addressing_Works.)

The reg property is a list of address size tuples. The size of the
entiries depends on #address-cells and #size-cells in the parent node,
but they are certainly not guaranteed to be uint64_t values (if nothing
else, they're encoded big-endian in the DTB). We just need to describe
what the entries are w.r.t. the device, not the particulars of the DTB
format.

Are these registers always contiguous? I assume so, and vif so you can
cover them with a single reg entry (read it as 'region' rather than
'register').

 +
 +The first string in the compatible property is fw-cfg,mmio.

fw-cfg is not a vendor prefix. I guess we need a qemu vendor prefix?
Then we'd have something like qemu,fw-cfg-mmio as the compatible
string.

Also, the string should not be required to be the first entry in the
list, as that breaks the fallback semantics of the compatible list. The
string should just need to be present in the list.

Please try to format the binding something like:

8
Required properties:

- compatible: should contain qemu,fw-cfg-mmio.

- reg: The MMIO regions used by the device:
  * The first entry describes the control register.
  * The second entry describes the data register.
8

Though I hope we can just use a single reg entry to cover all the
registers the device has.

 +
 +Example:
 +
 +/ {
 + #size-cells = 0x2;
 + #address-cells = 0x2;
 +
 + fw-cfg@902 {
 + reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1;

Please bracket list entries individually, e.g.

reg = 0x0 0x902 0x0 0x2,
  0x0 0x9020002 0x0 0x1;

 + compatible = fw-cfg,mmio;

And please place the compatible property before the reg (it makes things
easier to read).

Thanks,
Mark.

Re: [Qemu-devel] [PATCH v3 15/22] qcow2: Use error_report() in qcow2_amend_options()

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 20, 2014 at 06:06:31PM +0100, Max Reitz wrote:
 Signed-off-by: Max Reitz mre...@redhat.com
 Reviewed-by: Eric Blake ebl...@redhat.com
 ---
  block/qcow2.c  | 14 ++
  tests/qemu-iotests/061.out | 14 +++---
  2 files changed, 13 insertions(+), 15 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpTavcLTeR5f.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH v3 16/22] qcow2: Use abort() instead of assert(false)

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 20, 2014 at 06:06:32PM +0100, Max Reitz wrote:
 Signed-off-by: Max Reitz mre...@redhat.com
 Reviewed-by: Eric Blake ebl...@redhat.com
 ---
  block/qcow2.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpwibsjgMgJ0.pgp
Description: PGP signature

Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-11-28 Thread Laszlo Ersek

On 11/28/14 13:59, Arnd Bergmann wrote:
 On Friday 28 November 2014 13:26:44 Laszlo Ersek wrote:
 +Example:
 +
 +/ {
 +   #size-cells = 0x2;
 +   #address-cells = 0x2;
 +
 +   fw-cfg@902 {
 +   reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1;
 +   compatible = fw-cfg,mmio;
 +   };
 +};

 
 fw-cfg is not a valid vendor string. Are you emulating an actual piece
 of hardware here or is this something that comes from qemu?

It's a made up name.

 How about using qemu,fwcfg as the compatible string, and documenting
 qemu as an official vendor string in 
 Documentation/devicetree/bindings/vendor-prefixes.txt

I can do that, certainly, but I'm not sure if that implies other QEMU
devices should follow suit.

For example, the virtio-mmio transports that qemu advertises have the
compatible property virtio,mmio. virtio is not a vendor prefix
either (according to the above text file).

Here's the full list of compatible strings from the generated DTB:

  compatible = arm,cortex-a15;
  compatible = arm,armv7-timer;
  compatible = arm,cortex-a15-gic;
  compatible = arm,pl011, arm,primecell;
  compatible = arm,pl031, arm,primecell;
  compatible = arm,psci-0.2, arm,psci;
  compatible = cfi-flash;
  compatible = fixed-clock;
  compatible = fw-cfg,mmio;
  compatible = virtio,mmio;
  compatible = linux,dummy-virt;

According to
http://devicetree.org/Device_Tree_Usage#Understanding_the_compatible_Property,
cfi-flash and fixed-clock don't even have the correct format
(there's no comma separating manufacturer from model).

 We don't normally list contiguous registers separately. Maybe just round
 up to one page and make the register property
 
   reg = 0x0 0x902 0x0 0x1000;

The registers are not necessarily placed shoulder to shoulder (it's not
a requirement in the QEMU source).

 Finally, isn't there also an interrupt associated with this virtual device?

No, there is not.

Thanks
Laszlo

Re: [Qemu-devel] [PATCH v3 17/22] qcow2: Split upgrade/downgrade paths for amend

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 20, 2014 at 06:06:33PM +0100, Max Reitz wrote:
 If the image version should be upgraded, that is the first we should do;
 if it should be downgraded, that is the last we should do. So split the
 version change block into an upgrade part at the start and a downgrade
 part at the end.
 
 Signed-off-by: Max Reitz mre...@redhat.com
 Reviewed-by: Eric Blake ebl...@redhat.com
 ---
  block/qcow2.c | 31 ---
  1 file changed, 16 insertions(+), 15 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpzAIZQ6Upn9.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH v2 2/6] geometry: Detect blocksize via ioctls in separate static functions

2014-11-28 Thread Ekaterina Tumanova


On 11/28/2014 03:52 PM, Markus Armbruster wrote:

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:



Suggest function comment

  /**
   * Return logical block size, or zero if we can't figure it out
   */


   {
-BDRVRawState *s = bs-opaque;
-char *buf;
-unsigned int sector_size;
-
-/* For /dev/sg devices the alignment is not really used.
-   With buffered I/O, we don't have any restrictions. */
-if (bs-sg || !s-needs_alignment) {
-bs-request_alignment = 1;
-s-buf_align = 1;
-return;
-}
+unsigned int sector_size = 0;


Pointless initialization.


If I do not initialize the sector_size, and the ioctl fails,
I will return garbage as a blocksize to the caller.


Where?  As far as I can see, we return it only after ioctl() succeeded.



Sorry,
you're absolutely right. I kept seeing and thinking that I always
returned sector_size variable. ::facepalm::



   /* Try a few ioctls to get the right size */
-bs-request_alignment = 0;
-s-buf_align = 0;
-
   #ifdef BLKSSZGET
   if (ioctl(fd, BLKSSZGET, sector_size) = 0) {
-bs-request_alignment = sector_size;
+return sector_size;
   }
   #endif
   #ifdef DKIOCGETBLOCKSIZE
   if (ioctl(fd, DKIOCGETBLOCKSIZE, sector_size) = 0) {
-bs-request_alignment = sector_size;
+return sector_size;
   }
   #endif
   #ifdef DIOCGSECTORSIZE
   if (ioctl(fd, DIOCGSECTORSIZE, sector_size) = 0) {
-bs-request_alignment = sector_size;
+return sector_size;
   }
   #endif
   #ifdef CONFIG_XFS
   if (s-is_xfs) {
   struct dioattr da;
   if (xfsctl(NULL, fd, XFS_IOC_DIOINFO, da) = 0) {
-bs-request_alignment = da.d_miniosz;
+sector_size = da.d_miniosz;
   /* The kernel returns wrong information for d_mem */
   /* s-buf_align = da.d_mem; */


Since you keep the enabled assignments to s-buf_align out of this
function, you should keep out this disabled one, too.


+return sector_size;
   }
   }
   #endif

+return 0;
+}
+
+static unsigned int probe_physical_blocksize(BlockDriverState *bs, int fd)


Parameter bs is unused, let's drop it.


+{
+unsigned int blk_size = 0;


Pointless initialization.


Same here.


Again, we return it only after ioctl() succeeded.


+#ifdef BLKPBSZGET
+if (ioctl(fd, BLKPBSZGET, blk_size) = 0) {
+return blk_size;
+}
+#endif
+
+return 0;
+}
+
+static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
+{
+BDRVRawState *s = bs-opaque;
+char *buf;
+
+/* For /dev/sg devices the alignment is not really used.
+   With buffered I/O, we don't have any restrictions. */
+if (bs-sg || !s-needs_alignment) {
+bs-request_alignment = 1;
+s-buf_align = 1;
+return;
+}
+
+s-buf_align = 0;
+/* Let's try to use the logical blocksize for the alignment. */
+bs-request_alignment = probe_logical_blocksize(bs, fd);
+
   /* If we could not get the sizes so far, we can only guess them */
   if (!s-buf_align) {
   size_t align;

Re: [Qemu-devel] [PATCH v2] persistent dirty bitmap: add QDB file spec.

2014-11-28 Thread Vladimir Sementsov-Ogievskiy


On 21.11.2014 19:55, Stefan Hajnoczi wrote:

Active dirty bitmaps should migrate too.  I'm thinking now that the
appropriate thing is to add live migration of dirty bitmaps to QEMU
(regardless of whether they are active or not).
I think, we should migrate named dirty bitmaps, which are not used now. 
So if some external mechanism uses the bitmap (for example - backup) - 
we actually can't migrate this process, because we will need to restore 
the whole backup structure including a pointer to the bitmap, which is 
too hard and includes not only bitmap migration.
So, if named bitmap is enabled, but not used (only bdrv_aligned_pwritev 
writes to it) it can be migrated. For this I see the following solutions:


1) Just save all corresponding pieces of named bitmaps with every 
migrated block. The block size is 1mb, so the overhead for migrating 
additionally a bitmap with 64kb granularity would be 2b, and it would be 
256b for bitmap with 512b granularity. This approach needs additional 
fields in BlkMigBlock, for saving bitmaps pieces.


2) Add DIRTY flag to migrated block flags, to distinguish blocks, which 
became dirty while migrating. Save all the bitmaps separately, and also 
update them on block_load, when we receive block with DIRTY flag on. 
Some information will be lost, migrated dirty bitmaps may be more 
dirty then original ones. This approach needs additional field bool 
dirty in BlkMigBlock, and saving this flag in blk_send.


These solutions don't depend on persistence of dirty bitmaps or 
persistent bitmap file format.


Best regards,
Vladimir

On 21.11.2014 19:55, Stefan Hajnoczi wrote:

On Fri, Nov 21, 2014 at 01:27:40PM +0300, Vladimir Sementsov-Ogievskiy wrote:

There is a constraint if we want to get live migration for free: The
bitmap contents must be accessible with bdrv_read() and
bdrv_get_block_status() to skip zero regions.

Hm. I'm afraid, it still will not be free. If bitmap is active, it's actual
version is in memory. To migrate bitmap file like a disk image, we should
start syncing it with every write to corresponding disk, doubling number of
io.

It would be possible to drive-mirror the persistent dirty bitmap and
then flush it like all drives when the guest vCPUs are paused for
migration.

After thinking more about it though, this approach places more I/O into
the critical guest downtime phase.  In other words, slow disk I/O could
lead to long guest downtimes while QEMU tries to write out the dirty
bitmap.


Moreover, we have normal dirty bitmaps, which have no name/file, do we
migrate them? If, for example, the migration occurs when backup in progress?
Active bitmaps should be migrated in the same way for
persistent/named/normal bitmaps. I can't find in qemu source, is there
bitmap migration?

bs-dirty_bitmaps is not migrated, in fact none of BlockDriverState is
migrated.

QEMU only migrates emulated device state (e.g. the hardware registers
and associated state).  It does not emulate host state that the guest
cannot see like the dirty bitmap.


Or you are saying about migrating disabled bitmaps? Hm. We should sync
bitmap file on bitmap_disable. Disabled persistent bitmap is just a static
file ~30mb, we can easily migrate it without common procedure with cow or
something like this..

Active dirty bitmaps should migrate too.  I'm thinking now that the
appropriate thing is to add live migration of dirty bitmaps to QEMU
(regardless of whether they are active or not).

Stefan

Re: [Qemu-devel] [PATCH v3 13/22] progress: Allow regressing progress

2014-11-28 Thread Stefan Hajnoczi

On Thu, Nov 20, 2014 at 06:06:29PM +0100, Max Reitz wrote:
 Progress may regress; this should be displayed correctly by
 qemu_progress_print().
 
 Signed-off-by: Max Reitz mre...@redhat.com
 ---
  util/qemu-progress.c | 1 +
  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpPiR00qqiIx.pgp
Description: PGP signature

Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-11-28 Thread Mark Rutland

On Fri, Nov 28, 2014 at 01:22:29PM +, Laszlo Ersek wrote:
 On 11/28/14 13:59, Arnd Bergmann wrote:
  On Friday 28 November 2014 13:26:44 Laszlo Ersek wrote:
  +Example:
  +
  +/ {
  +   #size-cells = 0x2;
  +   #address-cells = 0x2;
  +
  +   fw-cfg@902 {
  +   reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1;
  +   compatible = fw-cfg,mmio;
  +   };
  +};
 
  
  fw-cfg is not a valid vendor string. Are you emulating an actual piece
  of hardware here or is this something that comes from qemu?
 
 It's a made up name.
 
  How about using qemu,fwcfg as the compatible string, and documenting
  qemu as an official vendor string in 
  Documentation/devicetree/bindings/vendor-prefixes.txt
 
 I can do that, certainly, but I'm not sure if that implies other QEMU
 devices should follow suit.
 
 For example, the virtio-mmio transports that qemu advertises have the
 compatible property virtio,mmio. virtio is not a vendor prefix
 either (according to the above text file).

True, but it's a documented standard. It's obvious that 'virtio' has
something to do with the virtio standard, while it's not obvious that
'fw-cfg' as a prefix is for a qemu-specific device.

We should probably add 'virtio' as a vendor prefix.

 
 Here's the full list of compatible strings from the generated DTB:
 
   compatible = arm,cortex-a15;
   compatible = arm,armv7-timer;
   compatible = arm,cortex-a15-gic;
   compatible = arm,pl011, arm,primecell;
   compatible = arm,pl031, arm,primecell;
   compatible = arm,psci-0.2, arm,psci;

All of these are emulated real devices or standards from ARM, so that
makes sense.

   compatible = cfi-flash;
   compatible = fixed-clock;

These are generic bindings, arguably they should have been 'linux,'
prefixed but it's too late now.

   compatible = fw-cfg,mmio;
   compatible = virtio,mmio;
   compatible = linux,dummy-virt;
 
 According to
 http://devicetree.org/Device_Tree_Usage#Understanding_the_compatible_Property,
 cfi-flash and fixed-clock don't even have the correct format
 (there's no comma separating manufacturer from model).
 
  We don't normally list contiguous registers separately. Maybe just round
  up to one page and make the register property
  
  reg = 0x0 0x902 0x0 0x1000;
 
 The registers are not necessarily placed shoulder to shoulder (it's not
 a requirement in the QEMU source).

It might not be at the moment, but if no-one separates them it could be
a requirement for using this binding. Is there any benefit to having
them separated?

Thanks,
Mark.

Re: [Qemu-devel] master: intermittent acpi-test failures

2014-11-28 Thread Peter Maydell

On 27 May 2014 at 22:38, Peter Maydell peter.mayd...@linaro.org wrote:
 I'm seeing this test failure intermittently on 'make check':

 ERROR:/root/qemu/tests/acpi-test.c:618:test_acpi_one: assertion failed
 (signature == SIGNATURE): (0x == 0xdead)
 GTester: last random seed: R02S8d0d60963e4442ce284a81d20ce32053

 (32 bit ARM host, in case that makes a difference.)

 Any ideas? It looks from the test as if this may just be
 that the test is coded to assume a faster machine, which
 is a bit unfortunate.

These failures are back after a long period of not
being a problem :-(

-- PMM

Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices

2014-11-28 Thread Markus Armbruster

Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes:

 On 11/28/2014 01:35 PM, Stefan Hajnoczi wrote:
 On Wed, Nov 26, 2014 at 01:16:48PM +0300, Ekaterina Tumanova wrote:
 On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote:
 On Wed, Nov 19, 2014 at 11:17:50AM +0100, Ekaterina Tumanova wrote:
 3. Why does s390 need to customize hd_geometry_guess()?

 Since hd_geometry_guess contains semantics of x86-specific LBA translation,
 we have to modify it not to get in the way of z
 architecture

 If the behavior is x86-specific, it should be made x86-specific.  z,
 ppc, sparc, arm, etc should share the non-x86-specific code path.  It's
 weird to single out z here, this seems like the wrong way around.

 Is the x86-specific behavior a problem in practice?  No one seemed to
 mind for the other targets.


 on s390 arch this adds support for FCP attached SCSI disks that do not
 yet have a partition table. Without this patch, fdisk -l on the host
 would return different results then fdisk -l in the guest.

Isn't it like this for all targets?

If we can't guess LCHS from MSDOS partition table
guess geometry on size
translation is NONE for very small disk, else LBA
Else if LCHS guess has heads  16
BIOS LBA translation must be active
guess geometry on size
translation is LARGE for sufficiently small disk, else LBA
Else
use LCHS guess
translation is NONE

Before writing a partition table, the default geometry is based on size.

Afterwards, we it may be based on the partition table instead.

Yes, writing an MS-DOS partition table can change the default geometry.
Horrible misfeature, independent of target.  Every time I try to kill a
horrible misfeature, Kevin yells at me ;)

Mitigating factor 1: no change when the partition table looks like LBA
is active.  Should be the common case nowadays.

Mitigating factor 2: most guest software is blissfully unaware of
geometry.

 If you guys don't like the way patch goes, I can exclude it from patch
 series and we can discuss it later.

 But I thought that since it's related to the hard drive geometry,
 and since we change hd_geometry_guess in this patchset anyway, why not
 get rid of this problem as well.

I think we should discuss what we actually want, then let you implement
it.

Perhaps the stupidest solution that could possibly work is the state
after your PATCH 5:

If backend has a geometry (currently only DASD has; may change)
// Incompatible change!
use backend geometry
translation is NONE
Else if we can't guess LCHS from MSDOS partition table
guess geometry on size
translation is NONE for very small disk, else LBA
Else if LCHS guess has heads  16
BIOS LBA translation must be active
guess geometry on size
translation is LARGE for sufficiently small disk, else LBA
Else
use LCHS guess
translation is NONE

Note that translation is relevant only for a PC machine's IDE disks.
Everything else ignores it.

Re: [Qemu-devel] [PATCH] block: do not use get_clock()

2014-11-28 Thread Markus Armbruster

Stefan Hajnoczi stefa...@gmail.com writes:

 On Wed, Nov 26, 2014 at 03:01:02PM +0100, Paolo Bonzini wrote:
 Use the external qemu-timer API instead.

 Cc: kw...@redhat.com
 Cc: stefa...@redhat.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  block/accounting.c | 6 --
  block/raw-posix.c  | 8 
  2 files changed, 8 insertions(+), 6 deletions(-)

 Thanks, applied to my block-next tree:
 https://github.com/stefanha/qemu/commits/block-next

Please wait for Paolo's v2 with rationale in the commit message.

Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines

2014-11-28 Thread Kevin Wolf

Am 28.11.2014 um 13:57 hat Kevin Wolf geschrieben:
 Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben:
  On 26.11.2014 15:46, Kevin Wolf wrote:
  This improves the performance of requests because an ACB doesn't need to
  be allocated on the heap any more. It also makes the code nicer and
  smaller.
  
  As a side effect, the codepath taken by aio=threads is changed to use
  paio_submit_co(). This doesn't change the performance at this point.
  
  Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0:
  
 |  aio=native   | aio=threads
 | before   | with patch | before   | with patch
  --+--++--+
  run 1 | 29.921s  | 26.932s| 35.286s  | 35.447s
  run 2 | 29.793s  | 26.252s| 35.276s  | 35.111s
  run 3 | 30.186s  | 27.114s| 35.042s  | 34.921s
  run 4 | 30.425s  | 26.600s| 35.169s  | 34.968s
  run 5 | 30.041s  | 26.263s| 35.224s  | 35.000s
  
  TODO: Do some more serious benchmarking in VMs with less variance.
  Results of a quick fio run are vaguely positive.
  
  I still see the main-loop spun warnings with this patches applied to master.
  It wasn't there with the original patch from August.
  
  ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1
  Sending 1000 requests, 4096 bytes each, 64 in parallel
  main-loop: WARNING: I/O thread spun for 1000 iterations
  Run completed in 31.947 seconds.
 
 Yes, I still need to bisect that. The 'qemu-img bench' numbers above are
 actually also from August, we have regressed meanwhile by about a
 second, and I also haven't found the reason for that yet.

Did the first part of this now. The commit that introduced the spun
message is 2cdff7f6 ('linux-aio: avoid deadlock in nested aio_poll()
calls').

The following patch doesn't make it go away completely, but I only see
it sometime during like every other run now, instead of immediately
after starting qemu-img bench. It's probably a (very) minor performance
optimisation, too.

Kevin


diff --git a/block/linux-aio.c b/block/linux-aio.c
index fd8f0e4..1a0ec62 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -136,6 +136,8 @@ static void qemu_laio_completion_bh(void *opaque)
 
 qemu_laio_process_completion(s, laiocb);
 }
+
+qemu_bh_cancel(s-completion_bh);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -143,7 +145,7 @@ static void qemu_laio_completion_cb(EventNotifier *e)
 struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e);
 
 if (event_notifier_test_and_clear(s-e)) {
-qemu_bh_schedule(s-completion_bh);
+qemu_laio_completion_bh(s);
 }
 }

Re: [Qemu-devel] Qemu coroutine behaviour on blocking send(3)

2014-11-28 Thread Iwan Budi Kusnanto

On Fri, Nov 28, 2014 at 7:47 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Nov 28, 2014 at 01:55:00PM +0700, Iwan Budi Kusnanto wrote:
 I meant, does the coroutine will do yield internally when it get
 blocked on send(3)?

 No.  In general, QEMU will use non-blocking file descriptors so the
 blocking case does not apply.  (You can't use blocking file descriptors
 in an event loop without a risk of blocking the entire event loop.)

 There is qemu_co_send(), which attempts the non-blocking send(2) and
 yields on EAGAIN.

 block/sheepdog.c and nbd.c both use this function.  It's a little ugly
 because the caller must add/remove the socket write fd handler function
 so that the coroutine is re-entered when the fd becomes writable again.

 Stefan

Thanks Stefan, it really helps.



-- 
Iwan Budi Kusnanto

Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-11-28 Thread Laszlo Ersek

On 11/28/14 14:17, Mark Rutland wrote:
 Hi,
 
 On Fri, Nov 28, 2014 at 12:26:44PM +, Laszlo Ersek wrote:
 Peter Maydell suggested that we describe new devices / DTB nodes in the
 kernel Documentation tree that we expose to arm virt guests in QEMU.

 Although the kernel is not required to access the fw_cfg interface,
 Documentation/devicetree/bindings/arm is probably the best central spot
 to keep the fw_cfg description in.

 Suggested-by: Peter Maydell peter.mayd...@linaro.org
 Signed-off-by: Laszlo Ersek ler...@redhat.com
 ---
  Documentation/devicetree/bindings/arm/fw-cfg.txt | 47 
 
  1 file changed, 47 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/arm/fw-cfg.txt

 diff --git a/Documentation/devicetree/bindings/arm/fw-cfg.txt 
 b/Documentation/devicetree/bindings/arm/fw-cfg.txt
 new file mode 100644
 index 000..d131453
 --- /dev/null
 +++ b/Documentation/devicetree/bindings/arm/fw-cfg.txt
 @@ -0,0 +1,47 @@
 +* QEMU Firmware Configuration bindings for ARM
 +
 +QEMU's arm-softmmu and aarch64-softmmu emulation / virtualization targets
 +provide the following Firmware Configuration interface on the virt machine
 +type:
 +
 +- A write-only, 16-bit wide selector (or control) register,
 +- a read-write, 8-bit wide data register.
 +
 +The guest writes a selector value (a key) to the selector register, and then
 +can read the corresponding data (produced by QEMU) via the data port. If the
 +selected entry is writable, the guest can rewrite it through the data port. 
 The
 +authoritative registry of the valid selector values and their meanings is 
 the
 +QEMU source code.
 +
 +QEMU exposes the control and data register to x86 guests at fixed IO ports. 
 ARM
 +guests can access them as memory mapped registers, and their location is
 +communicated to the guest's UEFI firmware in the DTB that QEMU places at the
 +bottom of the guest's DRAM.
 
 What exactly is this interface used for?

The firmware can collect bits of informtion about the virtual hardware.
Normally a (physical) platform comes with firmware that has intimate
knowledge about the properties of the platform. For example, it knows
what ACPI tables it should install for the OS that will match the hardware.

In virt space, qemu owns the virtual hardware (and now it actually owns
the ACPI table generation too), and the firmware projects (SeaBIOS and
the virtual UEFI targets in edk2) are separate projects. So the fw_cfg
interface allows qemu to parametrize the firmware code.

There's a long list of keys that can be written to the selectors. Some
of those are guest architecture invariant, some only defined (and
redefined) for different architectures. I started to document them at
one point (more than two years ago, for my own understanding), but it's
a moving target, and my on-and-off effort could never catch up with the
code.

If such a document existed, it should exist in the docs/ subdir of the
qemu tree.

 And what's the protocol that
 goes over the read/write sequences descrbied above?

It's a simple protocol on the outside: write a selector, read a number
of data bytes (you must know the count in advance, unless the first
thing in the returned blob is a header that advertises the returned
blob's size).

There's a special key that returns a directory of files; basically
(name, key) pairs. Then you can request named blobs too, by doing a
lookup by name in the directory first, and then using the corresponding
key to select a blob.

Beyond that, each key has complete freedom to define the structure of
the associated blob.

Also, some of the blobs can be rewritten (write selector, write a number
of data bytes). You must know the number of bytes to write. Once fully
rewritten, qemu can run a callback function.

(Actually, a somewhat recent qemu feature is that even reads can trigger
callbacks, which allows qemu to generate the data on-the-fly.)

 Is there any
 documentation of that we can refer to here?

None that I know of. As I wrote it's all in the QEMU source.

 Is version information probeable from this, or does that need to be
 described?

The outermost protocol (the behavior of the selector and data register,
and a number of predefined selector values) have ossified. Most of the
newer blobs are not associated with direct keys, but with fw_cfg file
names (see the directory above).

There is a revision (FW_CFG_ID) key, but it seems to be constant 1, so
it's not actually used for versioning.

In practice the set of available keys, files, and the structure of the
individual files, have been a constant pain point wrt. versioning. The
current practice is, as far as I can tell, to keep the name-contents
associations, and the contents' structure, stable. If something new is
needed, introduce a new name. And guests can probe for the existence of
file names in the directory.

 
 +The guest kernel is not expected to use these registers (although it is
 +certainly allowed to); the device tree bindings are

Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface

2014-11-28 Thread Laszlo Ersek

On 11/28/14 14:33, Mark Rutland wrote:
 On Fri, Nov 28, 2014 at 01:22:29PM +, Laszlo Ersek wrote:
 On 11/28/14 13:59, Arnd Bergmann wrote:

 We don't normally list contiguous registers separately. Maybe just round
 up to one page and make the register property

 reg = 0x0 0x902 0x0 0x1000;

 The registers are not necessarily placed shoulder to shoulder (it's not
 a requirement in the QEMU source).
 
 It might not be at the moment, but if no-one separates them it could be
 a requirement for using this binding. Is there any benefit to having
 them separated?

I couldn't name any.

Thanks
Laszlo

[Qemu-devel] async block driver questions

2014-11-28 Thread Iwan Budi Kusnanto

Hi all,

I have some questions regarding developing async network protocol
based block driver for qemu.
The network file server has async client library, and i think it will
be good if i can use it.

1. Is it OK to use libevent inside block driver.
What make me worried is that qemu itself said to be an event driven
program with it's own main loop

2. any example on how to create many coroutine and waiting for it's completion?
I thought i can just do qemu_coroutine_create and followed by
qemu_coroutine_enter, but the things become complex because
qemu_coroutine_yield will return the control back to caller
I also found about coQueue, but not sure how to use it.

3. Do we need to make our bdrv_read and bdrv_write to be thread safe?

Thanks before.



-- 
Iwan Budi Kusnanto

Re: [Qemu-devel] [PATCH RFC for-2.2] virtio-blk: force 1st s/g to match header

2014-11-28 Thread Marc Marí

El Fri, 28 Nov 2014 11:43:59 +
Stefan Hajnoczi stefa...@gmail.com escribió:
 On Fri, Nov 28, 2014 at 7:05 AM, Jason Wang jasow...@redhat.com
 wrote:
 
 
  On Fri, Nov 28, 2014 at 9:16 AM, Fam Zheng f...@redhat.com wrote:
 
  On Thu, 11/27 23:13, Michael S. Tsirkin wrote:
 
   On Thu, Nov 27, 2014 at 07:21:35PM +, Stefan Hajnoczi wrote:
On Thu, Nov 27, 2014 at 4:33 PM, Michael S. Tsirkin
m...@redhat.com
  wrote:
 We leak cpu mappings when 1st s/g is not exactly the
 header. As we don't set ANY_LAYOUT, we can at this point
 simply assert the correct length.

 This will have to be fixed once ANY_LAYOUT is set.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---

 Untested: posting for early feedback.
  Reviewed-by: Stefan Hajnoczi stefa...@redhat.com
Stefan, you are going to merge this for 2.2?
   Peter, if Stefan is merging this one, we can
   take Jason's patch for -net and that should be
   enough for 2.2.
 
 
  My test bot saw a failure of make check:
 
  qemu-system-x86_64: virtio-blk request outhdr too long
  Broken pipe
  GTester: last random seed: R02S44c5a09d74c0603f0091ed20d1395121
  qemu-system-x86_64: virtio-blk request outhdr too long
  Broken pipe
  GTester: last random seed: R02Sfdf270c8e1ed75c1f2047e3f0fb89b88
  qemu-system-x86_64: virtio-blk request outhdr too long
  Broken pipe
  GTester: last random seed: R02Sd292c540929b963ed9d7d155f3db5452
  qemu-system-x86_64: virtio-blk request outhdr too long
  Broken pipe
  GTester: last random seed: R02Sfc775a34a844c39f376aa478eda7573f
  [vmxnet3][WR][vmxnet3_peer_has_vnet_hdr]: Peer has no virtio
  extension. Task offloads will be emulated.
  make: *** [check-qtest-x86_64] Error 1
 
  Fam
 
 
  This probably because of the test itself.
 
  But anyway all kinds of guests (especially Windows drivers) need to
  be tested.
 
 Right, the test case explicitly tests different descriptor layouts,
 even though virtio-blk-pci does not set the ANY_LAYOUT feature bit.
 
 Either the test case needs to check ANY_LAYOUT before using the
 2-descriptor layout or it needs to expect QEMU to refuse (in this case
 exit(1), which is not very graceful).
 
 The quick fix is to skip the 2-descriptor layout tests and re-enable
 them once virtio-blk actually supports ANY_LAYOUT.  Any objections?
 
 Stefan

To be compliant with the specs, the test should check for ANY_LAYOUT
feature before testing with 2 descriptor layout. I'll add it with the
other changes pending for the test.

Marc

[Qemu-devel] [PATCH 0/7] coroutine: optimizations

2014-11-28 Thread Paolo Bonzini

As discussed in the other thread, this brings speedups from
dropping the coroutine mutex (which serializes multiple iothreads,
too) and using ELF thread-local storage.

The speedup in perf/cost is about 30% (190-145).  Windows port tested
with tests/test-coroutine.exe under Wine.

Paolo

Paolo Bonzini (7):
  coroutine-ucontext: use __thread
  qemu-thread: add per-thread atexit functions
  test-coroutine: avoid overflow on 32-bit systems
  QSLIST: add lock-free operations
  coroutine: rewrite pool to avoid mutex
  coroutine: drop qemu_coroutine_adjust_pool_size
  coroutine: try harder not to delete coroutines

 block/block-backend.c |   4 --
 coroutine-ucontext.c  |  64 +++-
 include/block/coroutine.h |  10 -
 include/qemu/queue.h  |  15 ++-
 include/qemu/thread.h |   4 ++
 qemu-coroutine.c  | 104 ++
 tests/test-coroutine.c|   2 +-
 util/qemu-thread-posix.c  |  37 +
 util/qemu-thread-win32.c  |  48 -
 9 files changed, 157 insertions(+), 131 deletions(-)

-- 
2.1.0

[Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems

2014-11-28 Thread Paolo Bonzini

unsigned long is not large enough to represent 10 * duration there.
Just use floating point.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 tests/test-coroutine.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
index e22fae1..27d1b6f 100644
--- a/tests/test-coroutine.c
+++ b/tests/test-coroutine.c
@@ -337,7 +337,7 @@ static void perf_cost(void)
%luns per coroutine,
maxcycles,
duration, ops,
-   (unsigned long)(10 * duration) / maxcycles);
+   (unsigned long)(10.0 * duration / maxcycles));
 }
 
 int main(int argc, char **argv)
-- 
2.1.0

[Qemu-devel] [PATCH 2/7] qemu-thread: add per-thread atexit functions

2014-11-28 Thread Paolo Bonzini

Destructors are the main additional feature of pthread TLS compared
to __thread.  If we were using C++ (hint, hint!) we could have used
thread-local objects with a destructor.  Since we are not, instead,
we add a simple Notifier-based API.

Note that the notifier must be per-thread as well.  We can add a
global list as well later, perhaps.

The Win32 implementation has some complications because a) detached
threads used not to have a QemuThreadData; b) the main thread does
not go through win32_start_routine, so we have to use atexit too.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 include/qemu/thread.h|  4 
 util/qemu-thread-posix.c | 37 +
 util/qemu-thread-win32.c | 48 +---
 3 files changed, 78 insertions(+), 11 deletions(-)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index f7e3b9b..e89fdc9 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -61,4 +61,8 @@ bool qemu_thread_is_self(QemuThread *thread);
 void qemu_thread_exit(void *retval);
 void qemu_thread_naming(bool enable);
 
+struct Notifier;
+void qemu_thread_atexit_add(struct Notifier *notifier);
+void qemu_thread_atexit_remove(struct Notifier *notifier);
+
 #endif
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index d05a649..41cb23d 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -26,6 +26,7 @@
 #endif
 #include qemu/thread.h
 #include qemu/atomic.h
+#include qemu/notify.h
 
 static bool name_threads;
 
@@ -401,6 +402,42 @@ void qemu_event_wait(QemuEvent *ev)
 }
 }
 
+static pthread_key_t exit_key;
+
+union NotifierThreadData {
+void *ptr;
+NotifierList list;
+};
+QEMU_BUILD_BUG_ON(sizeof(union NotifierThreadData) != sizeof(void *));
+
+void qemu_thread_atexit_add(Notifier *notifier)
+{
+union NotifierThreadData ntd;
+ntd.ptr = pthread_getspecific(exit_key);
+notifier_list_add(ntd.list, notifier);
+pthread_setspecific(exit_key, ntd.ptr);
+}
+
+void qemu_thread_atexit_remove(Notifier *notifier)
+{
+union NotifierThreadData ntd;
+ntd.ptr = pthread_getspecific(exit_key);
+notifier_remove(notifier);
+pthread_setspecific(exit_key, ntd.ptr);
+}
+
+static void qemu_thread_atexit_run(void *arg)
+{
+union NotifierThreadData ntd = { .ptr = arg };
+notifier_list_notify(ntd.list, NULL);
+}
+
+static void __attribute__((constructor)) qemu_thread_atexit_init(void)
+{
+pthread_key_create(exit_key, qemu_thread_atexit_run);
+}
+
+
 /* Attempt to set the threads name; note that this is for debug, so
  * we're not going to fail if we can't set it.
  */
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index c405c9b..7bda85b 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -12,6 +12,7 @@
  */
 #include qemu-common.h
 #include qemu/thread.h
+#include qemu/notify.h
 #include process.h
 #include assert.h
 #include limits.h
@@ -268,6 +269,7 @@ struct QemuThreadData {
 void *(*start_routine)(void *);
 void *arg;
 short mode;
+NotifierList  exit;
 
 /* Only used for joinable threads. */
 bool  exited;
@@ -275,18 +277,40 @@ struct QemuThreadData {
 CRITICAL_SECTION  cs;
 };
 
+static bool atexit_registered;
+static NotifierList main_thread_exit;
+
 static __thread QemuThreadData *qemu_thread_data;
 
+static void run_main_iothread_exit(void)
+{
+notifier_list_notify(main_thread_exit, NULL);
+}
+
+void qemu_thread_atexit_add(Notifier *notifier)
+{
+if (!qemu_thread_data) {
+if (!atexit_registered) {
+atexit_registered = true;
+atexit(run_main_iothread_exit);
+}
+notifier_list_add(main_thread_exit, notifier);
+} else {
+notifier_list_add(qemu_thread_data-exit, notifier);
+}
+}
+
+void qemu_thread_atexit_remove(Notifier *notifier)
+{
+notifier_remove(notifier);
+}
+
 static unsigned __stdcall win32_start_routine(void *arg)
 {
 QemuThreadData *data = (QemuThreadData *) arg;
 void *(*start_routine)(void *) = data-start_routine;
 void *thread_arg = data-arg;
 
-if (data-mode == QEMU_THREAD_DETACHED) {
-g_free(data);
-data = NULL;
-}
 qemu_thread_data = data;
 qemu_thread_exit(start_routine(thread_arg));
 abort();
@@ -296,12 +320,14 @@ void qemu_thread_exit(void *arg)
 {
 QemuThreadData *data = qemu_thread_data;
 
-if (data) {
-assert(data-mode != QEMU_THREAD_DETACHED);
+notifier_list_notify(data-exit, NULL);
+if (data-mode == QEMU_THREAD_JOINABLE) {
 data-ret = arg;
 EnterCriticalSection(data-cs);
 data-exited = true;
 LeaveCriticalSection(data-cs);
+} else {
+g_free(data);
 }
 _endthreadex(0);
 }
@@ -313,9 +339,10 @@ void *qemu_thread_join(QemuThread *thread)
 HANDLE handle;
 
 data = thread-data;
-if (!data) {
+if (data-mode ==

[Qemu-devel] [PATCH 4/7] QSLIST: add lock-free operations

2014-11-28 Thread Paolo Bonzini

These operations are trivial to implement and do not have ABA problems.
They are enough to implement simple multiple-producer, single consumer
lock-free lists or, as in the next patch, the multiple consumers can
steal a whole batch of elements and process them at their leisure.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 include/qemu/queue.h | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index d433b90..6a01e2f 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -191,8 +191,19 @@ struct {   
 \
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_INSERT_HEAD(head, elm, field) do {\
-(elm)-field.sle_next = (head)-slh_first;  \
-(head)-slh_first = (elm);  \
+(elm)-field.sle_next = (head)-slh_first;   \
+(head)-slh_first = (elm);   \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do {   \
+do {   \
+(elm)-field.sle_next = (head)-slh_first; \
+} while (atomic_cmpxchg((head)-slh_first, (elm)-field.sle_next, \
+   (elm)) != (elm)-field.sle_next);   \
+} while (/*CONSTCOND*/0)
+
+#define QSLIST_MOVE_ATOMIC(dest, src) do {   \
+(dest)-slh_first = atomic_xchg((src)-slh_first, NULL);\
 } while (/*CONSTCOND*/0)
 
 #define QSLIST_REMOVE_HEAD(head, field) do { \
-- 
2.1.0

[Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex

2014-11-28 Thread Paolo Bonzini

This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.

Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:

1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.

2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also removes qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio file descriptors).

This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.

I still believe we will end with some kind of coroutine bypass scheme
(even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv
directly can help), but hey it cannot hurt to optimize hot code.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 qemu-coroutine.c | 93 +---
 1 file changed, 42 insertions(+), 51 deletions(-)

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index bd574aa..aee1017 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -15,31 +15,57 @@
 #include trace.h
 #include qemu-common.h
 #include qemu/thread.h
+#include qemu/atomic.h
 #include block/coroutine.h
 #include block/coroutine_int.h
 
 enum {
-POOL_DEFAULT_SIZE = 64,
+POOL_BATCH_SIZE = 64,
 };
 
 /** Free list to speed up creation */
-static QemuMutex pool_lock;
-static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int pool_size;
-static unsigned int pool_max_size = POOL_DEFAULT_SIZE;
+static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
+static unsigned int release_pool_size;
+static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
QSLIST_HEAD_INITIALIZER(pool);
+static __thread Notifier coroutine_pool_cleanup_notifier;
+
+static void coroutine_pool_cleanup(Notifier *n, void *value)
+{
+Coroutine *co;
+Coroutine *tmp;
+
+QSLIST_FOREACH_SAFE(co, alloc_pool, pool_next, tmp) {
+QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
+qemu_coroutine_delete(co);
+}
+}
 
 Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 {
 Coroutine *co = NULL;
 
 if (CONFIG_COROUTINE_POOL) {
-qemu_mutex_lock(pool_lock);
-co = QSLIST_FIRST(pool);
+co = QSLIST_FIRST(alloc_pool);
+if (!co) {
+if (release_pool_size  POOL_BATCH_SIZE) {
+/* Slow path; a good place to register the destructor, too.  */
+if (!coroutine_pool_cleanup_notifier.notify) {
+coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;
+qemu_thread_atexit_add(coroutine_pool_cleanup_notifier);
+}
+
+/* This is not exact; there could be a little skew between
+ * release_pool_size and the actual size of release_pool.  But
+ * it is just a heuristic, it does not need to be perfect.
+ */
+release_pool_size = 0;
+QSLIST_MOVE_ATOMIC(alloc_pool, release_pool);
+co = QSLIST_FIRST(alloc_pool);
+}
+}
 if (co) {
-QSLIST_REMOVE_HEAD(pool, pool_next);
-pool_size--;
+QSLIST_REMOVE_HEAD(alloc_pool, pool_next);
 }
-qemu_mutex_unlock(pool_lock);
 }
 
 if (!co) {
@@ -53,39 +80,19 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 
 static void coroutine_delete(Coroutine *co)
 {
+co-caller = NULL;
+
 if (CONFIG_COROUTINE_POOL) {
-qemu_mutex_lock(pool_lock);
-if (pool_size  pool_max_size) {
-QSLIST_INSERT_HEAD(pool, co, pool_next);
-co-caller = NULL;
-pool_size++;
-qemu_mutex_unlock(pool_lock);
+if (release_pool_size  POOL_BATCH_SIZE * 2) {
+QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next);
+atomic_inc(release_pool_size);
 return;
 }
-qemu_mutex_unlock(pool_lock);
 }

[Qemu-devel] [PATCH 6/7] coroutine: drop qemu_coroutine_adjust_pool_size

2014-11-28 Thread Paolo Bonzini

This is not needed anymore.  The new TLS-based algorithm is adaptive.

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 block/block-backend.c |  4 
 include/block/coroutine.h | 10 --
 qemu-coroutine.c  |  4 
 3 files changed, 18 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index d0692b1..abf0cd1 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -260,9 +260,6 @@ int blk_attach_dev(BlockBackend *blk, void *dev)
 blk_ref(blk);
 blk-dev = dev;
 bdrv_iostatus_reset(blk-bs);
-
-/* We're expecting I/O from the device so bump up coroutine pool size */
-qemu_coroutine_adjust_pool_size(COROUTINE_POOL_RESERVATION);
 return 0;
 }
 
@@ -290,7 +287,6 @@ void blk_detach_dev(BlockBackend *blk, void *dev)
 blk-dev_ops = NULL;
 blk-dev_opaque = NULL;
 bdrv_set_guest_block_size(blk-bs, 512);
-qemu_coroutine_adjust_pool_size(-COROUTINE_POOL_RESERVATION);
 blk_unref(blk);
 }
 
diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 793df0e..20c027a 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -216,14 +216,4 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, 
QEMUClockType type,
  */
 void coroutine_fn yield_until_fd_readable(int fd);
 
-/**
- * Add or subtract from the coroutine pool size
- *
- * The coroutine implementation keeps a pool of coroutines to be reused by
- * qemu_coroutine_create().  This makes coroutine creation cheap.  Heavy
- * coroutine users should call this to reserve pool space.  Call it again with
- * a negative number to release pool space.
- */
-void qemu_coroutine_adjust_pool_size(int n);
-
 #endif /* QEMU_COROUTINE_H */
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index aee1017..ca40f4f 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -144,7 +144,3 @@ void coroutine_fn qemu_coroutine_yield(void)
 self-caller = NULL;
 coroutine_swap(self, to);
 }
-
-void qemu_coroutine_adjust_pool_size(int n)
-{
-}
-- 
2.1.0

1 2 >

1 - 100 of 122 matches

Mail list logo