Re: [Qemu-devel] [RFC PATCH] vring: calculate descriptor address directly
Am 27.11.2014 um 17:49 schrieb Paolo Bonzini: On 27/11/2014 11:15, Peter Lieven wrote: vring_map causes a huge overhead by calling memory_region_find everytime. the vring_map is executed already on vring_setup and there is also the memory region referenced. Signed-off-by: Peter Lieven p...@kamp.de vring_map/unmap is going to disappear and be replaced by address_space_map/unmap, so for now I'd rather keep it... the issue with vring_map/unmap is that it is called at every vring_pop/push to calculate the descriptor ptr. its ok to call it in vring_setup/destroy and there it doesn't hurt. looking at address_space_map/unmap this would be expensive as well if it is called in every vring_pop/push. Peter
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
On 11/28/14, Markus Armbruster arm...@redhat.com wrote: Ming Lei ming@canonical.com writes: Hi Kevin, On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. I am not sure it is good way for linux aio optimization: - for raw image with some constraint, coroutine can be avoided since io_submit() won't sleep most of times - handling one time coroutine takes much time than handling malloc, memset and free on small buffer, following the test data: -- 241ns per coroutine What do you mean by coroutine here? Create + destroy? Yield? Please see perf_cost() in tests/test-coroutine.c Thanks, Ming Lei
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 27.11.2014 um 17:40 schrieb Paolo Bonzini: On 27/11/2014 11:27, Peter Lieven wrote: +static __thread struct CoRoutinePool { +Coroutine *ptrs[POOL_MAX_SIZE]; +unsigned int size; +unsigned int nextfree; +} CoPool; The per-thread ring unfortunately didn't work well last time it was tested. Devices that do not use ioeventfd (not just the slow ones, even decently performing ones like ahci, nvme or megasas) will create the coroutine in the VCPU thread, and destroy it in the iothread. The result is that coroutines cannot be reused. Can you check if this is still the case? I already tested at least for IDE and for ioeventfd=off. The coroutine is created in the vCPU thread and destroyed in the I/O thread. I also havea more complicated version which sets per therad coroutine pool only for dataplane. Avoiding the lock for dedicated iothreads. For those who want to take a look: https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e Peter
Re: [Qemu-devel] [PATCH v2 1/6] geometry: add bdrv functions for geometry and blocksize
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: On 11/27/2014 05:55 PM, Markus Armbruster wrote: I'm sorry for the delay. I got the flu and have difficulties thinking straight for longer than a few minutes. Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: Add driver functions for geometry and blocksize detection Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com --- block.c | 26 ++ include/block/block.h | 20 include/block/block_int.h | 3 +++ 3 files changed, 49 insertions(+) diff --git a/block.c b/block.c index a612594..5df35cf 100644 --- a/block.c +++ b/block.c @@ -548,6 +548,32 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) } } +struct ProbeBlockSize bdrv_probe_blocksizes(BlockDriverState *bs) +{ +BlockDriver *drv = bs-drv; +struct ProbeBlockSize err_geo = { .rc = -1 }; + +assert(drv != NULL); +if (drv-bdrv_probe_blocksizes) { +return drv-bdrv_probe_blocksizes(bs); +} + +return err_geo; +} I'm not sure about probe. Naming is hard. get? There was already bdrv_get_geometry, and I wanted the _geometry and _blocksize functions to use the same naming convention. Fair enough. bdrv_get_geometry() is a silly wrapper around bdrv_nb_sectors() that maps failure to zero sectors. I hope to kill it some time. Doesn't help you now. I thought probe might be more suitable, since we do not get the value for sure. maybe detect? Feel free to stick to probe. Squashing status and actual payload into a single struct to use as return type isn't wrong, but unusual. When the payload can't represent failure conveniently, we usually return status, and write the payload to a buffer provided by the caller, like this: int bdrv_get_blocksizes(BlockDriverState *bs, uint16_t *physical_blk_sz, uint16_t *logical_blk_sz) Or, with a struct to hold both sizes: int bdrv_get_blocksizes(BlockDriverState *bs, BlockSizes *bsz) Do you insist on changing that? Returning a struct via stack seemed useful to me, since there was less probability of caller allocating a buffer of incorrect size or smth like that. You'd have to do crazy stuff to defeat the static type checker. Please stick to the common technique. [...]
Re: [Qemu-devel] [PATCH 0/2] virtfs-proxy-helper: Fix Coverity warning
On 27/11/2014 11:29, Gonglei wrote: On 2014/11/13 20:17, Gonglei (Arei) wrote: From: Gonglei arei.gong...@huawei.com Gonglei (2): virtfs-proxy-helper: Fix possible socket leak. virtfs-proxy-helper: Fix handle leak to make Coverity happy fsdev/virtfs-proxy-helper.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) Ping... Hi Gonglei---thanks, I had missed this and Michael is probably not sending qemu-trivial patches because we're so close to the release. I'll pick up patch 1. Paolo
Re: [Qemu-devel] [PATCH 0/2] virtfs-proxy-helper: Fix Coverity warning
On 2014/11/28 16:31, Paolo Bonzini wrote: On 27/11/2014 11:29, Gonglei wrote: On 2014/11/13 20:17, Gonglei (Arei) wrote: From: Gonglei arei.gong...@huawei.com Gonglei (2): virtfs-proxy-helper: Fix possible socket leak. virtfs-proxy-helper: Fix handle leak to make Coverity happy fsdev/virtfs-proxy-helper.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) Ping... Hi Gonglei---thanks, I had missed this and Michael is probably not sending qemu-trivial patches because we're so close to the release. I'll pick up patch 1. Paolo Thanks. As Markus's suggestion, we drop patch 2. :) Regards, -Gonglei
Re: [Qemu-devel] MinGW build
On 28 November 2014 at 07:14, Stefan Weil s...@weilnetz.de wrote: The libvixl code is correct, but the C++ compiler would need to be fixed. Here are some examples: disas/libvixl/a64/disasm-a64.cc:1340:57: warning: unknown conversion type character ‘l’ in format [-Wformat] disas/libvixl/a64/disasm-a64.cc:1340:57: warning: too many arguments for format [-Wformat-extra-args] disas/libvixl/a64/disasm-a64.cc:1492:42: warning: unknown conversion type character ‘l’ in format [-Wformat] That code uses PRIx64, so the format specifier is %llx which is correct. Obviously the C++ compiler ignores that QEMU uses ANSI format specifiers (compiler option -D__USE_MINGW_ANSI_STDIO=1) instead of the MS specific ones. Lovely. A simple workaround for QEMU would just suppress -Wformat and -Wformat-extra-args for C++ code. I'd prefer not to lose those (just because once you turn off a warning it's never going to come back, practically speaking). We could have a configure test that checked whether the c++ compiler understood %llx, maybe? -- PMM
Re: [Qemu-devel] [PATCH v2 3/6] geometry: Add driver methods to probe blocksizes and geometry
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: This patch introduces driver methods of defining disk blocksizes (physical and logical) and hard drive geometry. The method is only implemented for host_device. For raw devices driver calls child's method. The detection will only work for DASD devices. In order to check that a local CheckForDASD function was introduced, which calls BIODASDINFO2 ioctl and returns 0 only if it succeeds. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com --- block/raw-posix.c | 65 +++ block/raw_bsd.c | 12 ++ 2 files changed, 77 insertions(+) diff --git a/block/raw-posix.c b/block/raw-posix.c index 45f1d79..274ceda 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -56,6 +56,7 @@ #include linux/cdrom.h #include linux/fd.h #include linux/fs.h +#include linux/hdreg.h #ifndef FS_NOCOW_FL #define FS_NOCOW_FL 0x0080 /* Do not cow file */ #endif @@ -93,6 +94,10 @@ #include xfs/xfs.h #endif +#ifdef __s390__ +#include asm/dasd.h +#endif + //#define DEBUG_FLOPPY //#define DEBUG_BLOCK @@ -678,6 +683,64 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) bs-bl.opt_mem_alignment = s-buf_align; } +static int CheckForDASD(int fd) As Christian said, no CamelCase for functions. I guess your function returns either 0 or -1. Can't say for sure without looking up BIODASDINFO2. I'd do a slightly higher level is_dasd() returning bool. Your choice. +{ +#ifdef BIODASDINFO2 +struct dasd_information2_t info = {0}; + +return ioctl(fd, BIODASDINFO2, info); +#endif +return -1; +} + +static struct ProbeBlockSize hdev_probe_blocksizes(BlockDriverState *bs) +{ +BDRVRawState *s = bs-opaque; +struct ProbeBlockSize block_sizes = {0}; + +block_sizes.rc = CheckForDASD(s-fd); +/* If DASD, get blocksizes */ +if (block_sizes.rc == 0) { +block_sizes.size.log = probe_logical_blocksize(bs, s-fd); +block_sizes.size.phys = probe_physical_blocksize(bs, s-fd); +} + +return block_sizes; +} Fails unless DASD. Why? The block size concept applies not just to DASD, and the probe_*_blocksize() functions should just work, shouldn't they? + +static struct ProbeGeometry hdev_probe_geometry(BlockDriverState *bs) +{ +BDRVRawState *s = bs-opaque; +struct ProbeGeometry geometry = {0}; +struct hd_geometry ioctl_geo = {0}; Is this initializer really necessary? Because I like my local variable names short sweet, I'd call this one geo. Your choice. + +geometry.rc = CheckForDASD(s-fd); +if (geometry.rc != 0) { Works if your function really returns either 0 or -1. If it can return a positive value, it breaks the callback's contract. +goto done; +} +/* If DASD, get it's geometry */ its +geometry.rc = ioctl(s-fd, HDIO_GETGEO, ioctl_geo); +/* Do not return a geometry for partition */ +if (ioctl_geo.start != 0) { +geometry.rc = -1; +goto done; +} +/* HDIO_GETGEO may return success even though geo contains zeros + (e.g. certain multipath setups) */ +if (!ioctl_geo.heads || !ioctl_geo.sectors || !ioctl_geo.cylinders) { +geometry.rc = -1; +goto done; +} +if (geometry.rc == 0) { +geometry.geo.heads = (uint32_t)ioctl_geo.heads; +geometry.geo.sectors = (uint32_t)ioctl_geo.sectors; +geometry.geo.cylinders = (uint32_t)ioctl_geo.cylinders; The LHS is uint32_t, the RHS is unsigned char or unsigned short, thus the type casts are superfluous clutter. 8-bit head * 8-bit sectors * 16-bit cylinders can't represent today's disk sizes, so ioctl_geo.cylinders is generally useless. The common advice is to ignore it, and do something like geometry.geo.cylinders = nb_sectors / (ioctl_geo.heads * ioctl_geo.sectors) A possible alternative is to explicitly document that the returned cylinders value can't be trusted. Another one is not to return the number of cylinders :) +geometry.geo.start = (uint32_t)ioctl_geo.start; Always assigns zero (we checked above). Why is member geo_start useful? +} +done: + return geometry; +} Also fails unless DASD. The geometry concept applies pretty much only to disks older than thirty years or so, and to software designed for them, like HDIO_GETGEO. + static ssize_t handle_aiocb_ioctl(RawPosixAIOData *aiocb) { int ret; @@ -2128,6 +2191,8 @@ static BlockDriver bdrv_host_device = { .bdrv_get_info = raw_get_info, .bdrv_get_allocated_file_size = raw_get_allocated_file_size, +.bdrv_probe_blocksizes = hdev_probe_blocksizes, +.bdrv_probe_geometry = hdev_probe_geometry, .bdrv_detach_aio_context = raw_detach_aio_context, .bdrv_attach_aio_context = raw_attach_aio_context, diff --git
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Ming Lei ming@canonical.com writes: On 11/28/14, Markus Armbruster arm...@redhat.com wrote: Ming Lei ming@canonical.com writes: Hi Kevin, On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. I am not sure it is good way for linux aio optimization: - for raw image with some constraint, coroutine can be avoided since io_submit() won't sleep most of times - handling one time coroutine takes much time than handling malloc, memset and free on small buffer, following the test data: -- 241ns per coroutine What do you mean by coroutine here? Create + destroy? Yield? Please see perf_cost() in tests/test-coroutine.c static __attribute__((noinline)) void perf_cost_func(void *opaque) { qemu_coroutine_yield(); } static void perf_cost(void) { const unsigned long maxcycles = 4000; unsigned long i = 0; double duration; unsigned long ops; Coroutine *co; g_test_timer_start(); while (i++ maxcycles) { co = qemu_coroutine_create(perf_cost_func); qemu_coroutine_enter(co, i); qemu_coroutine_enter(co, NULL); } duration = g_test_timer_elapsed(); ops = (long)(maxcycles / (duration * 1000)); g_test_message(Run operation %lu iterations %f s, %luK operations/s, %luns per coroutine, maxcycles, duration, ops, (unsigned long)(10 * duration) / maxcycles); } This tests create, enter, yield, reenter, terminate, destroy. The cost of create + destroy may well dominate. If we create and destroy coroutines for each AIO request, we're doing it wrong. I doubt Kevin's doing it *that* wrong ;) Anyway, let's benchmark the real code instead of putting undue trust in tests/test-coroutine.c micro-benchmarks.
Re: [Qemu-devel] [PATCH] Fix for crash after migration in virtio-rng on bi-endian targets
On 27 November 2014 at 09:26, Markus Armbruster arm...@redhat.com wrote: David Gibson da...@gibson.dropbear.id.au writes: VirtIO devices now remember which endianness they're operating in in order to support targets which may have guests of either endianness, such as powerpc. This endianness state is transferred in a subsection of the virtio device's information. With virtio-rng this can lead to an abort after a loadvm hitting the assert() in virtio_is_big_endian(). This can be reproduced by doing a migrate and load from file on a bi-endian target with a virtio-rng device. The actual guest state isn't particularly important to triggering this. The cause is that virtio_rng_load_device() calls virtio_rng_process() which accesses the ring and thus needs the endianness. However, virtio_rng_process() is called via virtio_load() before it loads the subsections. Essentially the -load callback in VirtioDeviceClass should only be used for actually reading the device state from the stream, not for post-load re-initialization. This patch fixes the bug by moving the virtio_rng_process() after the call to virtio_load(). Better yet would be to convert virtio to use vmsd and have the virtio_rng_process() as a post_load callback, but that's a bigger project for another day. This is bugfix, and should be considered for the 2.2 branch. [PATCH for-2.2] would have been a good idea then. Next time :) So do you want this patch in 2.2? I was planning to put in the virtio-vs-xen fixes today and tag rc4, so it's not too late if you're confident this patch is good. Let me know if you think it should go in, and I can apply it to master directly. -- PMM
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
On 11/28/14, Markus Armbruster arm...@redhat.com wrote: Ming Lei ming@canonical.com writes: On 11/28/14, Markus Armbruster arm...@redhat.com wrote: Ming Lei ming@canonical.com writes: Hi Kevin, On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. I am not sure it is good way for linux aio optimization: - for raw image with some constraint, coroutine can be avoided since io_submit() won't sleep most of times - handling one time coroutine takes much time than handling malloc, memset and free on small buffer, following the test data: -- 241ns per coroutine What do you mean by coroutine here? Create + destroy? Yield? Please see perf_cost() in tests/test-coroutine.c static __attribute__((noinline)) void perf_cost_func(void *opaque) { qemu_coroutine_yield(); } static void perf_cost(void) { const unsigned long maxcycles = 4000; unsigned long i = 0; double duration; unsigned long ops; Coroutine *co; g_test_timer_start(); while (i++ maxcycles) { co = qemu_coroutine_create(perf_cost_func); qemu_coroutine_enter(co, i); qemu_coroutine_enter(co, NULL); } duration = g_test_timer_elapsed(); ops = (long)(maxcycles / (duration * 1000)); g_test_message(Run operation %lu iterations %f s, %luK operations/s, %luns per coroutine, maxcycles, duration, ops, (unsigned long)(10 * duration) / maxcycles); } This tests create, enter, yield, reenter, terminate, destroy. The cost of create + destroy may well dominate. Actually there shouldn't have been much cost from create and destroy attributed to coroutine pool. If we create and destroy coroutines for each AIO request, we're doing it wrong. I doubt Kevin's doing it *that* wrong ;) Anyway, let's benchmark the real code instead of putting undue trust in tests/test-coroutine.c micro-benchmarks. I don't think there isn't trust from the micro-benchmark. That is the direct cost from coroutine, and the cost won't be avoided at all, not mention cost from switching stack. If you google some test data posted by me previously, that would show bypassing coroutine can increase throughput with ~50% for raw image in case of linux aio, that is the real test case, not micro-benchmark. Thanks, Ming Lei
[Qemu-devel] [PATCH] vhost: Fix vhostfd leak in error branch
From: Gonglei arei.gong...@huawei.com Signed-off-by: Gonglei arei.gong...@huawei.com --- hw/scsi/vhost-scsi.c | 1 + hw/virtio/vhost.c| 2 ++ 2 files changed, 3 insertions(+) diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c index 308b393..dcb2bc5 100644 --- a/hw/scsi/vhost-scsi.c +++ b/hw/scsi/vhost-scsi.c @@ -233,6 +233,7 @@ static void vhost_scsi_realize(DeviceState *dev, Error **errp) vhost_dummy_handle_output); if (err != NULL) { error_propagate(errp, err); +close(vhostfd); return; } diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index 5d7c40a..5a12861 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -817,10 +817,12 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque, int i, r; if (vhost_set_backend_type(hdev, backend_type) 0) { +close((uintptr_t)opaque); return -1; } if (hdev-vhost_ops-vhost_backend_init(hdev, opaque) 0) { +close((uintptr_t)opaque); return -errno; } -- 1.7.12.4
Re: [Qemu-devel] [RFC PATCH v5 00/31] Deterministic replay and reverse execution
On Thu, Nov 27, 2014 at 6:45 PM, Paolo Bonzini pbonz...@redhat.com wrote: On 27/11/2014 10:53, Artyom Tarasenko wrote: Deterministic replay has the following features: * Deterministically replays whole system execution and all contents of the memory, state of the hadrware devices, clocks, and screen of the VM. * Writes execution log into the file for latter replaying for multiple times on different machines. * Supports i386, x86_64, and ARM hardware platforms. This patch series is really impressive, hats off! I wonder what would have to be done to add support for SPARC-64? Since has a PCI bus, I guess the support in atapi and usb devices is already there. Is anything beyond translate.c has to be adjusted? Does -icount work reliably on SPARC? No. Currently using -icount 1 or -icount 2 is reliably producing the Bad clock read message, and using -icount 20 reliably crashes qemu with the qemu: fatal: Raised interrupt while not in I/O function message. So, I guess I see the starting point... Artyom -- Regards, Artyom Tarasenko SPARC and PPC PReP under qemu blog: http://tyom.blogspot.com/search/label/qemu
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
On 28/11/2014 03:59, Ming Lei wrote: Hi Kevin, On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. I am not sure it is good way for linux aio optimization: - for raw image with some constraint, coroutine can be avoided since io_submit() won't sleep most of times - handling one time coroutine takes much time than handling malloc, memset and free on small buffer, following the test data: -- 241ns per coroutine -- 61ns per (malloc, memset, free for 128bytes) I still think we should figure out a fast path to avoid cocourinte for linux-aio with raw image, otherwise it can't scale well for high IOPS device. sigsetjmp/siglongjmp are just ~60 instructions, it cannot account for 180ns (600 clock cycles). The cost of creating and destroying the coroutine must come from somewhere else. Let's just try something else. Let's remove the pool mutex, as suggested by Peter but in a way that works even with non-ioeventfd backends. I still believe we will end with some kind of coroutine bypass scheme (even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv directly can help), but hey it cannot hurt to optimize hot code. The patch below has a single pool where coroutines are placed on destruction, and a per-thread allocation pool. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal from it instead of making a new coroutine. If this works, it would be beautiful in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also removes qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous). The patch below is very raw, and untested beyond tests/test-coroutine. There may be some premature optimization (not using GPrivate, even though I need it to run the per-thread destructor) but it was easy enough. Ming, Kevin, can you benchmark it? Related to this, we can see if __thread beats GPrivate in coroutine-ucontext.c. Every clock cycle counts (600 clock cycles are a 2% improvement at 3 GHz and 100 kiops) so we can see what we get. Paolo diff --git a/include/qemu/queue.h b/include/qemu/queue.h index d433b90..6a01e2f 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -191,6 +191,17 @@ struct { \ #define QSLIST_INSERT_HEAD(head, elm, field) do {\ (elm)-field.sle_next = (head)-slh_first; \ (head)-slh_first = (elm); \ +} while (/*CONSTCOND*/0) + +#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do { \ + do { \ + (elm)-field.sle_next = (head)-slh_first; \ + } while (atomic_cmpxchg((head)-slh_first, (elm)-field.sle_next, \ + (elm)) != (elm)-field.sle_next);\ +} while (/*CONSTCOND*/0) + +#define QSLIST_MOVE_ATOMIC(dest, src) do { \ + (dest)-slh_first = atomic_xchg((src)-slh_first, NULL);\ } while (/*CONSTCOND*/0) #define QSLIST_REMOVE_HEAD(head, field) do { \ diff --git a/qemu-coroutine.c b/qemu-coroutine.c index bd574aa..60d761f 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -15,35 +15,52 @@ #include trace.h #include qemu-common.h #include qemu/thread.h +#include qemu/atomic.h #include block/coroutine.h #include block/coroutine_int.h enum { -POOL_DEFAULT_SIZE = 64, +POOL_BATCH_SIZE = 64, }; /** Free list to speed up creation */ -static QemuMutex pool_lock; -static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool); +static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); static unsigned int pool_size; -static unsigned int pool_max_size = POOL_DEFAULT_SIZE; +static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); + +/* The GPrivate is only used to invoke coroutine_pool_cleanup. */ +static void coroutine_pool_cleanup(void *value); +static GPrivate dummy_key = G_PRIVATE_INIT(coroutine_pool_cleanup); Coroutine *qemu_coroutine_create(CoroutineEntry *entry) { Coroutine *co = NULL; if (CONFIG_COROUTINE_POOL) { -qemu_mutex_lock(pool_lock); -co
Re: [Qemu-devel] [PATCH RFC v4 00/16] qemu: towards virtio-1 host support
On Thu, 27 Nov 2014 18:35:40 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Thu, Nov 27, 2014 at 05:28:42PM +0100, Cornelia Huck wrote: On Thu, 27 Nov 2014 18:18:25 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Thu, Nov 27, 2014 at 05:06:51PM +0100, Cornelia Huck wrote: So we should have a per-device callback into the transport layer, say check_legacy()? I would just have 2 masks: legacy_features and features. But these belong to the device type and the transport just needs to trigger usage of the right one, right? Yes. For ccw, this would check for the negotiated revision; for mmio, it could check a device property configured with the device; and for pci, whatever the mechanism is there :) A transport not implementing this callback is simply considered legacy-only. I dislike callbacks. Let's just give all info to core, and have it DTRT. Have a is_legacy flag in the vdev that is initialized to 1, and transports can unset it when the revision is negotiated or during init? I would say have modern_features, legacy_features, and set host_features correctly. Started digging through the code a bit: Currently the only time where the transport is actually mucking with the feature bits is when the device is plugged, when it adds its feature bits and calls virtio_bus_get_vdev_features() to get the device specific bits. Only mmio knows at this point in time whether it has a v1.0 or a legacy device. Other transports will need to call this function again when the guest has actually set the revision. My plan is the following: - have virtio_bus_get_vdev_features_rev() that takes a virtio-rev parameter (0: legacy, 1: v1.0) - unconverted transports keep calling virtio_bus_get_vdev_features() which calls virtio_bus_get_vdev_features_rev(..., 0) - the rev parameter gets passed to the device, which hands over the right feature set (I don't think that we can use static feature tables, as the device may need to calulate the desired feature set dynamically) Sounds reasonable?
Re: [Qemu-devel] [PATCH RFC v4 00/16] qemu: towards virtio-1 host support
On Thu, 27 Nov 2014 18:38:59 +0200 Michael S. Tsirkin m...@redhat.com wrote: Also, let's focus on one device, e.g. -net for now. Yes, I'm currently looking at net. Then probably virtio scsi. That's because blk needs to be reworked to support ANY_LAYOUT. I had focused on blk because that's the other device used on every s390 image :) But the ANY_LAYOUT stuff looks like a bit of work :/
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Am 28.11.2014 um 03:59 hat Ming Lei geschrieben: Hi Kevin, On Wed, Nov 26, 2014 at 10:46 PM, Kevin Wolf kw...@redhat.com wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. I am not sure it is good way for linux aio optimization: - for raw image with some constraint, coroutine can be avoided since io_submit() won't sleep most of times - handling one time coroutine takes much time than handling malloc, memset and free on small buffer, following the test data: -- 241ns per coroutine -- 61ns per (malloc, memset, free for 128bytes) Please finally stop making comparisons between completely unrelated things and trying to make a case against coroutines out of it. It simply doesn't make any sense. The truth is that in the 'qemu-img bench' case as well as in the highest performing VM setup for Peter and me, the practically existing coroutine based git branches perform better then the practically existing bypass branches. If you think that theoretically the bypass branches must be better, show us the patches and benchmarks. If you can't, let's merge the coroutine improvements (which improve more than just the case of raw images using no block layer features, including cases that benefit the average user) and be done. I still think we should figure out a fast path to avoid cocourinte for linux-aio with raw image, otherwise it can't scale well for high IOPS device. Also we can use simple buf pool to avoid the dynamic allocation easily, can't we? Yes, the change to g_slice_alloc() was a bad move performance-wise. As a side effect, the codepath taken by aio=threads is changed to use paio_submit_co(). This doesn't change the performance at this point. Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0: | aio=native | aio=threads | before | with patch | before | with patch --+--++--+ run 1 | 29.921s | 26.932s| 35.286s | 35.447s run 2 | 29.793s | 26.252s| 35.276s | 35.111s run 3 | 30.186s | 27.114s| 35.042s | 34.921s run 4 | 30.425s | 26.600s| 35.169s | 34.968s run 5 | 30.041s | 26.263s| 35.224s | 35.000s TODO: Do some more serious benchmarking in VMs with less variance. Results of a quick fio run are vaguely positive. I will do the test with Paolo's fast path approach under VM I/O situation. Currently, the best thing to compare it against is probably Peter's git branch at https://github.com/plieven/qemu.git perf_master2. This patch is only a first step in a whole series of possible optimisations. Kevin
Re: [Qemu-devel] [PATCH v2 5/6] geometry: Call backend function to detect geometry and blocksize
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: hd_geometry_guess function autodetects the drive geometry. This patch adds a block backend call, that probes the backing device geometry. If the inner driver method is implemented and succeeds (currently only DASDs), the blkconf_geometry will pass-through the backing device geometry. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com --- hw/block/block.c | 11 +++ hw/block/hd-geometry.c | 9 + hw/block/virtio-blk.c| 1 + include/hw/block/block.h | 1 + 4 files changed, 22 insertions(+) diff --git a/hw/block/block.c b/hw/block/block.c index a625773..f1d29bc 100644 --- a/hw/block/block.c +++ b/hw/block/block.c @@ -25,6 +25,17 @@ void blkconf_serial(BlockConf *conf, char **serial) } } +void blkconf_blocksizes(BlockConf *conf) +{ +BlockBackend *blk = conf-blk; +struct ProbeBlockSize blocksize = blk_probe_blocksizes(blk); + +if (blocksize.rc == 0) { +conf-physical_block_size = blocksize.size.phys; +conf-logical_block_size = blocksize.size.log; +} +} + Uh, doesn't this overwrite the user's geometry? The existing blkconf_ functions do nothing when the user supplied the configuration. In other words, they only provide defaults. Why should this one be different? void blkconf_geometry(BlockConf *conf, int *ptrans, unsigned cyls_max, unsigned heads_max, unsigned secs_max, Error **errp) diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c index 6fcf74d..4972114 100644 --- a/hw/block/hd-geometry.c +++ b/hw/block/hd-geometry.c @@ -121,6 +121,14 @@ void hd_geometry_guess(BlockBackend *blk, int *ptrans) { int cylinders, heads, secs, translation; +struct ProbeGeometry geometry = blk_probe_geometry(blk); + +if (geometry.rc == 0) { +*pcyls = geometry.geo.cylinders; +*psecs = geometry.geo.sectors; +*pheads = geometry.geo.heads; +goto done; +} hd_geometry_guess() is called by blkconf_geometry() to conjure up a default geometry when the user didn't pick one. Fairly elaborate guesswork. blkconf_geometry() runs for IDE, SCSI and virtio-blk only. Your patch makes it pick the backend's geometry as reported by blk_probe_geometry() before anything else. This is an incompatible change for backends where blk_probe_geometry() succeeds. Currently, it succeeds only for DASD, and there you *want* the incompatible change. My point is: if we rely on blk_probe_geometry() failing except for DASD, then we should call it something like blk_dasd_geometry() to make that obvious. The commit message needs to spell out the incompatible change. if (guess_disk_lchs(blk, cylinders, heads, secs) 0) { /* no LCHS guess: use a standard physical disk geometry */ @@ -143,6 +151,7 @@ void hd_geometry_guess(BlockBackend *blk, the logical geometry */ translation = BIOS_ATA_TRANSLATION_NONE; } +done: if (ptrans) { *ptrans = translation; } diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index b19b102..6f01565 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -745,6 +745,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp) error_propagate(errp, err); return; } +blkconf_blocksizes(conf-conf); virtio_init(vdev, virtio-blk, VIRTIO_ID_BLOCK, sizeof(struct virtio_blk_config)); Why only virtio-blk, and not the other device models sporting configurable block sizes (nvme, IDE, SCSI, usb-storage)? This is an incompatible change for backends where blk_probe_blocksizes() succeeds and returns something other than (512, 512). Currently, it succeeds only for DASD. Is that what we want? The commit message needs to spell out the incompatible change. Perhaps this patch should be split up into one for geometry and one for block sizes. diff --git a/include/hw/block/block.h b/include/hw/block/block.h index 0d0ce9a..856bf75 100644 --- a/include/hw/block/block.h +++ b/include/hw/block/block.h @@ -63,6 +63,7 @@ void blkconf_serial(BlockConf *conf, char **serial); void blkconf_geometry(BlockConf *conf, int *trans, unsigned cyls_max, unsigned heads_max, unsigned secs_max, Error **errp); +void blkconf_blocksizes(BlockConf *conf); /* Hard disk geometry */
Re: [Qemu-devel] [PATCH v2 5/6] geometry: Call backend function to detect geometry and blocksize
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: hd_geometry_guess function autodetects the drive geometry. This patch adds a block backend call, that probes the backing device geometry. If the inner driver method is implemented and succeeds (currently only DASDs), the blkconf_geometry will pass-through the backing device geometry. [...] diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c index 6fcf74d..4972114 100644 --- a/hw/block/hd-geometry.c +++ b/hw/block/hd-geometry.c @@ -121,6 +121,14 @@ void hd_geometry_guess(BlockBackend *blk, int *ptrans) { int cylinders, heads, secs, translation; +struct ProbeGeometry geometry = blk_probe_geometry(blk); + +if (geometry.rc == 0) { +*pcyls = geometry.geo.cylinders; +*psecs = geometry.geo.sectors; +*pheads = geometry.geo.heads; Missing: translation = ... I'm not sure what value to assign. +goto done; +} if (guess_disk_lchs(blk, cylinders, heads, secs) 0) { /* no LCHS guess: use a standard physical disk geometry */ [...]
Re: [Qemu-devel] [PATCH v6] qcow2: Buffer L1 table in snapshot refcount update
On Tue, Nov 11, 2014 at 04:27:51PM +0100, Max Reitz wrote: From: Zhang Haoyu zhan...@sangfor.com Buffer the active L1 table in qcow2_update_snapshot_refcount() in order to prevent in-place conversion of the L1 table buffer in the BDRVQcowState to big endian and back, which would lead to data corruption if that buffer was accessed concurrently. This should not happen but better being safe than sorry. Signed-off-by: Zhang Haoyu zhan...@sangfor.com Signed-off-by: Max Reitz mre...@redhat.com --- v6 for snapshot: use local variable to bdrv_pwrite_sync L1 table (I changed the commit message wording to make it more clear what this patch does and why we want it). Changes in v6: - Only copy the local buffer back into s-l1_table if we are indeed accessing the local L1 table - Use qemu_vfree() instead of g_free() --- block/qcow2-refcount.c | 30 ++ 1 file changed, 14 insertions(+), 16 deletions(-) If there is a code path where the L1 table is accessed while qcow2_update_snapshot_refcount() is blocked, this patch does not fix the bug. It trades an L1 table entry corruption (due to endianness mismatch on little-endian hosts) for a race condition where a stale L1 table is accessed or L1 changes are overwritten when qcow2_update_snapshot_refcount() memcpys back to s-l1_table. Please identify the root cause and fix that. diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 9afdb40..c0c4a50 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -877,14 +877,18 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, { BDRVQcowState *s = bs-opaque; uint64_t *l1_table, *l2_table, l2_offset, offset, l1_size2; -bool l1_allocated = false; +bool active_l1 = false; int64_t old_offset, old_l2_offset; int i, j, l1_modified = 0, nb_csectors, refcount; int ret; l2_table = NULL; -l1_table = NULL; l1_size2 = l1_size * sizeof(uint64_t); +l1_table = qemu_try_blockalign(bs-file, l1_size2); +if (l1_table == NULL) { +ret = -ENOMEM; +goto fail; +} s-cache_discards = true; @@ -892,13 +896,6 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, * l1_table_offset when it is the current s-l1_table_offset! Be careful * when changing this! */ if (l1_table_offset != s-l1_table_offset) { -l1_table = g_try_malloc0(align_offset(l1_size2, 512)); -if (l1_size2 l1_table == NULL) { -ret = -ENOMEM; -goto fail; -} -l1_allocated = true; - ret = bdrv_pread(bs-file, l1_table_offset, l1_table, l1_size2); if (ret 0) { goto fail; @@ -908,8 +905,8 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, be64_to_cpus(l1_table[i]); } else { assert(l1_size == s-l1_size); -l1_table = s-l1_table; -l1_allocated = false; +memcpy(l1_table, s-l1_table, l1_size2); +active_l1 = true; } for(i = 0; i l1_size; i++) { @@ -1051,13 +1048,14 @@ fail: } ret = bdrv_pwrite_sync(bs-file, l1_table_offset, l1_table, l1_size2); - -for (i = 0; i l1_size; i++) { -be64_to_cpus(l1_table[i]); +if (active_l1 ret == 0) { +for (i = 0; i l1_size; i++) { +be64_to_cpus(l1_table[i]); +} +memcpy(s-l1_table, l1_table, l1_size2); } } -if (l1_allocated) -g_free(l1_table); +qemu_vfree(l1_table); return ret; } -- 1.9.3 pgpu0ifzdtI68.pgp Description: PGP signature
Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices
On Wed, Nov 26, 2014 at 01:16:48PM +0300, Ekaterina Tumanova wrote: On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote: On Wed, Nov 19, 2014 at 11:17:50AM +0100, Ekaterina Tumanova wrote: 3. Why does s390 need to customize hd_geometry_guess()? Since hd_geometry_guess contains semantics of x86-specific LBA translation, we have to modify it not to get in the way of z architecture If the behavior is x86-specific, it should be made x86-specific. z, ppc, sparc, arm, etc should share the non-x86-specific code path. It's weird to single out z here, this seems like the wrong way around. Is the x86-specific behavior a problem in practice? No one seemed to mind for the other targets. pgpT85luBvEdh.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 11:28 schrieb Paolo Bonzini: On 28/11/2014 09:13, Peter Lieven wrote: Am 27.11.2014 um 17:40 schrieb Paolo Bonzini: On 27/11/2014 11:27, Peter Lieven wrote: +static __thread struct CoRoutinePool { +Coroutine *ptrs[POOL_MAX_SIZE]; +unsigned int size; +unsigned int nextfree; +} CoPool; The per-thread ring unfortunately didn't work well last time it was tested. Devices that do not use ioeventfd (not just the slow ones, even decently performing ones like ahci, nvme or megasas) will create the coroutine in the VCPU thread, and destroy it in the iothread. The result is that coroutines cannot be reused. Can you check if this is still the case? I already tested at least for IDE and for ioeventfd=off. The coroutine is created in the vCPU thread and destroyed in the I/O thread. I also havea more complicated version which sets per therad coroutine pool only for dataplane. Avoiding the lock for dedicated iothreads. For those who want to take a look: https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e Can you test it against the patch I just sent in Kevin's linux-aio coroutine thread? Was already doing it ;-) At least with test-couroutine.c master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine plieven/perf_master2: Run operation 4000 iterations 9.013785 s, 4437K operations/s, 225ns per coroutine plieven/perf_master: Run operation 4000 iterations 11.072883 s, 3612K operations/s, 276ns per coroutine However, perf_master and perf_master2 have a regerssion regarding nesting as it seems. @Kevin: Could that be the reason why they performe bad in some szenarios? Regarding the bypass that is discussed. If it is not just a benchmark thing but really necessary for some peoples use cases why not add a new aio mode like bypass and use it only then. If the performance is really needed the user he/she might trade it in for lost features like iothrottling, filters etc. Peter
Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board
On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote: fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c, ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt board. The mmio register block of fw_cfg is advertized in the device tree. As base address we pick 0x0902, which conforms to the comment preceding a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB, and it is aligned at 64KB. fw_cfg automatically exports a number of files to the guest; for example, bootorder (see fw_cfg_machine_reset()). Signed-off-by: Laszlo Ersek ler...@redhat.com --- hw/arm/virt.c | 21 + 1 file changed, 21 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 314e55b..070bd34 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -68,6 +68,7 @@ enum { VIRT_UART, VIRT_MMIO, VIRT_RTC, +VIRT_FW_CFG, }; typedef struct MemMapEntry { @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = { [VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, [VIRT_UART] = { 0x0900, 0x1000 }, [VIRT_RTC] ={ 0x0901, 0x1000 }, +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE }, [VIRT_MMIO] = { 0x0a00, 0x0200 }, /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */ /* 0x1000 .. 0x4000 reserved for PCI */ @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi) g_free(nodename); } +static void create_fw_cfg(const VirtBoardInfo *vbi) +{ +hwaddr base = vbi-memmap[VIRT_FW_CFG].base; +char *nodename; + +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE); + +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base); +qemu_fdt_add_subnode(vbi-fdt, nodename); +qemu_fdt_setprop_string(vbi-fdt, nodename, +compatible, fw-cfg,mmio); +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg, + 2, base, 2, FW_CFG_SIZE, + 2, base + FW_CFG_SIZE, 2, FW_CFG_DATA_SIZE); Overkill suggestion alert, but how about defining something like #define FW_CFG_SIZE_ALIGNED \ MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \ QEMU_ALIGN_UP(FW_CFG_SIZE, 4)) and then using that in your memmap size calculation and fw-cfg-data base address calculation. The only reason I suggest this is because it's hard to tell that fw-cfg-data's address will be naturally aligned without hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change (which it probably never will), then it may not be. +g_free(nodename); +} + static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size) { const VirtBoardInfo *board = (const VirtBoardInfo *)binfo; @@ -604,6 +623,8 @@ static void machvirt_init(MachineState *machine) */ create_virtio_devices(vbi, pic); +create_fw_cfg(vbi); + vbi-bootinfo.ram_size = machine-ram_size; vbi-bootinfo.kernel_filename = machine-kernel_filename; vbi-bootinfo.kernel_cmdline = machine-kernel_cmdline; -- 1.8.3.1
Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board
On 11/28/14 11:38, Andrew Jones wrote: On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote: fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c, ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt board. The mmio register block of fw_cfg is advertized in the device tree. As base address we pick 0x0902, which conforms to the comment preceding a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB, and it is aligned at 64KB. fw_cfg automatically exports a number of files to the guest; for example, bootorder (see fw_cfg_machine_reset()). Signed-off-by: Laszlo Ersek ler...@redhat.com --- hw/arm/virt.c | 21 + 1 file changed, 21 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 314e55b..070bd34 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -68,6 +68,7 @@ enum { VIRT_UART, VIRT_MMIO, VIRT_RTC, +VIRT_FW_CFG, }; typedef struct MemMapEntry { @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = { [VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, [VIRT_UART] = { 0x0900, 0x1000 }, [VIRT_RTC] ={ 0x0901, 0x1000 }, +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE }, [VIRT_MMIO] = { 0x0a00, 0x0200 }, /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */ /* 0x1000 .. 0x4000 reserved for PCI */ @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi) g_free(nodename); } +static void create_fw_cfg(const VirtBoardInfo *vbi) +{ +hwaddr base = vbi-memmap[VIRT_FW_CFG].base; +char *nodename; + +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE); + +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base); +qemu_fdt_add_subnode(vbi-fdt, nodename); +qemu_fdt_setprop_string(vbi-fdt, nodename, +compatible, fw-cfg,mmio); +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg, + 2, base, 2, FW_CFG_SIZE, + 2, base + FW_CFG_SIZE, 2, FW_CFG_DATA_SIZE); Overkill suggestion alert, but how about defining something like #define FW_CFG_SIZE_ALIGNED \ MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \ QEMU_ALIGN_UP(FW_CFG_SIZE, 4)) and then using that in your memmap size calculation and fw-cfg-data base address calculation. The only reason I suggest this is because it's hard to tell that fw-cfg-data's address will be naturally aligned without hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change (which it probably never will), then it may not be. Why does it need to be aligned? The selector register is aligned at a 64KB boundary (for independent, strict reasons). The data register is not aligned at all, and -- AFAICS -- it need not be, because it's 1 byte wide. (In fact the ARM-specific Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the above layout passes without problems.) The full register block is 3 bytes wide. Is that a problem? Thanks Laszlo +g_free(nodename); +} + static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size) { const VirtBoardInfo *board = (const VirtBoardInfo *)binfo; @@ -604,6 +623,8 @@ static void machvirt_init(MachineState *machine) */ create_virtio_devices(vbi, pic); +create_fw_cfg(vbi); + vbi-bootinfo.ram_size = machine-ram_size; vbi-bootinfo.kernel_filename = machine-kernel_filename; vbi-bootinfo.kernel_cmdline = machine-kernel_cmdline; -- 1.8.3.1
Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices
On Wed, Nov 26, 2014 at 01:16:48PM +0300, Ekaterina Tumanova wrote: On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote: On Wed, Nov 19, 2014 at 11:17:50AM +0100, Ekaterina Tumanova wrote: 4. Please use scripts/checkpatch.pl to check coding style. I did :) You are right, checkpatch.pl doesn't complain. That's weird, I thought there were a few instances that don't follow the QEMU coding style. Either I was wrong or checkpatch.pl is incomplete. Stefan pgp_ZYDPqDTK1.pgp Description: PGP signature
Re: [Qemu-devel] [PATCH v2 6/6] geometry: Target specific hook for s390x in geometry guessing
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: For target-s390x, the behaviour is chosen as follows: If no geo could be guessed from backing device, guess_disk_lchs (partition table guessing) is called. If this is not successful (e.g. image files) geometry is derived from the size of the disk (as before). We have no translation on s390, so we simply set in to NONE. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com --- hw/block/Makefile.objs | 6 +- hw/block/hd-geometry.c | 36 +++- 2 files changed, 36 insertions(+), 6 deletions(-) diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs index d4c3ab7..1f7da7a 100644 --- a/hw/block/Makefile.objs +++ b/hw/block/Makefile.objs @@ -1,4 +1,4 @@ -common-obj-y += block.o cdrom.o hd-geometry.o +common-obj-y += block.o cdrom.o common-obj-$(CONFIG_FDC) += fdc.o common-obj-$(CONFIG_SSI_M25P80) += m25p80.o common-obj-$(CONFIG_NAND) += nand.o @@ -13,3 +13,7 @@ obj-$(CONFIG_SH4) += tc58128.o obj-$(CONFIG_VIRTIO) += virtio-blk.o obj-$(CONFIG_VIRTIO) += dataplane/ + +# geometry is target/architecture dependent and therefore needs to be built +# for every target platform +obj-y += hd-geometry.o diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c index 4972114..b462225 100644 --- a/hw/block/hd-geometry.c +++ b/hw/block/hd-geometry.c @@ -49,11 +49,12 @@ struct partition { /* try to guess the disk logical geometry from the MSDOS partition table. Return 0 if OK, -1 if could not guess */ -static int guess_disk_lchs(BlockBackend *blk, - int *pcylinders, int *pheads, int *psectors) +static int guess_disk_lchs(BlockBackend *blk, uint32_t *pcylinders, + uint32_t *pheads, uint32_t *psectors) { uint8_t buf[BDRV_SECTOR_SIZE]; -int i, heads, sectors, cylinders; +int i; +uint32_t heads, sectors, cylinders; struct partition *p; uint32_t nr_sects; uint64_t nb_sectors; @@ -116,11 +117,11 @@ static void guess_chs_for_size(BlockBackend *blk, *psecs = 63; } +#ifdef TARGET_S390X void hd_geometry_guess(BlockBackend *blk, uint32_t *pcyls, uint32_t *pheads, uint32_t *psecs, int *ptrans) { -int cylinders, heads, secs, translation; struct ProbeGeometry geometry = blk_probe_geometry(blk); if (geometry.rc == 0) { @@ -129,7 +130,32 @@ void hd_geometry_guess(BlockBackend *blk, *pheads = geometry.geo.heads; goto done; } +if (guess_disk_lchs(blk, pcyls, pheads, psecs) == 0) { +goto done; +} +guess_chs_for_size(blk, pcyls, pheads, psecs); +done: +if (ptrans) { +*ptrans = BIOS_ATA_TRANSLATION_NONE; +} The goto looks awkward... What about if (guess_disk_lchs(blk, pcyls, pheads, psecs) 0) { guess_chs_for_size(blk, pcyls, pheads, psecs); } +trace_hd_geometry_guess(blk, *pcyls, *pheads, *psecs, +BIOS_ATA_TRANSLATION_NONE); +} +#else +void hd_geometry_guess(BlockBackend *blk, + uint32_t *pcyls, uint32_t *pheads, uint32_t *psecs, + int *ptrans) +{ +int cylinders, heads, secs, translation; +struct ProbeGeometry geometry = blk_probe_geometry(blk); + +if (geometry.rc == 0) { +*pcyls = geometry.geo.cylinders; +*psecs = geometry.geo.sectors; +*pheads = geometry.geo.heads; +goto done; +} if (guess_disk_lchs(blk, cylinders, heads, secs) 0) { /* no LCHS guess: use a standard physical disk geometry */ guess_chs_for_size(blk, pcyls, pheads, psecs); @@ -157,7 +183,7 @@ done: } trace_hd_geometry_guess(blk, *pcyls, *pheads, *psecs, translation); } - +#endif Please put the blank line after the #endif. int hd_bios_chs_auto_trans(uint32_t cyls, uint32_t heads, uint32_t secs) { return cyls = 1024 heads = 16 secs = 63 hd_geometry_guess() behavior before this series: If we can't guess LCHS from MSDOS partition table guess geometry on size translation is NONE for very small disk, else LBA Else if LCHS guess has heads 16 BIOS LBA translation must be active guess geometry on size translation is LARGE for sufficiently small disk, else LBA Else use LCHS guess translation is NONE Afterwards, targets other than s390x: If backend has a geometry (currently only DASD has) use backend geometry translation is NONE Else behave like before this series Incompatible change. Why do we want it? Afterwards, target s390x: If backend has a geometry (currently only DASD has) use backend geometry translation is NONE Else if we can't guess LCHS from MSDOS partition table guess geometry on size Else use LCHS guess
Re: [Qemu-devel] [PATCH] geometry: fix i386 compilation
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com --- hw/block/hd-geometry.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c index b462225..905d2c6 100644 --- a/hw/block/hd-geometry.c +++ b/hw/block/hd-geometry.c @@ -147,7 +147,8 @@ void hd_geometry_guess(BlockBackend *blk, uint32_t *pcyls, uint32_t *pheads, uint32_t *psecs, int *ptrans) { -int cylinders, heads, secs, translation; +uint32_t cylinders, heads, secs; +int translation = BIOS_ATA_TRANSLATION_NONE; struct ProbeGeometry geometry = blk_probe_geometry(blk); if (geometry.rc == 0) { @@ -173,9 +174,6 @@ void hd_geometry_guess(BlockBackend *blk, *pcyls = cylinders; *pheads = heads; *psecs = secs; -/* disable any translation to be in sync with - the logical geometry */ -translation = BIOS_ATA_TRANSLATION_NONE; } Actually broken in PATCH 5, so it needs to be fixed there, and not in PATCH 6 (which this one fixes up). Moreover, your fixup makes the code less clear. Please add the missing translation = ... instead. done: if (ptrans) {
Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board
On 11/28/14 11:43, Laszlo Ersek wrote: On 11/28/14 11:38, Andrew Jones wrote: On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote: fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c, ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt board. The mmio register block of fw_cfg is advertized in the device tree. As base address we pick 0x0902, which conforms to the comment preceding a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB, and it is aligned at 64KB. fw_cfg automatically exports a number of files to the guest; for example, bootorder (see fw_cfg_machine_reset()). Signed-off-by: Laszlo Ersek ler...@redhat.com --- hw/arm/virt.c | 21 + 1 file changed, 21 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 314e55b..070bd34 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -68,6 +68,7 @@ enum { VIRT_UART, VIRT_MMIO, VIRT_RTC, +VIRT_FW_CFG, }; typedef struct MemMapEntry { @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = { [VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, [VIRT_UART] = { 0x0900, 0x1000 }, [VIRT_RTC] ={ 0x0901, 0x1000 }, +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE }, [VIRT_MMIO] = { 0x0a00, 0x0200 }, /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */ /* 0x1000 .. 0x4000 reserved for PCI */ @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi) g_free(nodename); } +static void create_fw_cfg(const VirtBoardInfo *vbi) +{ +hwaddr base = vbi-memmap[VIRT_FW_CFG].base; +char *nodename; + +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE); + +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base); +qemu_fdt_add_subnode(vbi-fdt, nodename); +qemu_fdt_setprop_string(vbi-fdt, nodename, +compatible, fw-cfg,mmio); +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg, + 2, base, 2, FW_CFG_SIZE, + 2, base + FW_CFG_SIZE, 2, FW_CFG_DATA_SIZE); Overkill suggestion alert, but how about defining something like #define FW_CFG_SIZE_ALIGNED \ MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \ QEMU_ALIGN_UP(FW_CFG_SIZE, 4)) and then using that in your memmap size calculation and fw-cfg-data base address calculation. The only reason I suggest this is because it's hard to tell that fw-cfg-data's address will be naturally aligned without hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change (which it probably never will), then it may not be. Why does it need to be aligned? The selector register is aligned at a 64KB boundary (for independent, strict reasons). The data register is not aligned at all, and -- AFAICS -- it need not be, because it's 1 byte wide. (In fact the ARM-specific Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the above layout passes without problems.) The full register block is 3 bytes wide. Is that a problem? Hm, I think I get it now. If FW_CFG_DATA_SIZE were to increase, then its alignment would have to increase as well, and whatever alignment FW_CFG_SIZE provides might not suffice. So, you'd calculate the natural alignment, but wouldn't increase it beyond 4. I do think this is a bit overkill :) but I can do it. Let's wait for more review comments first. Thanks! Laszlo
Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board
On Fri, Nov 28, 2014 at 11:43:32AM +0100, Laszlo Ersek wrote: On 11/28/14 11:38, Andrew Jones wrote: On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote: fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c, ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt board. The mmio register block of fw_cfg is advertized in the device tree. As base address we pick 0x0902, which conforms to the comment preceding a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB, and it is aligned at 64KB. fw_cfg automatically exports a number of files to the guest; for example, bootorder (see fw_cfg_machine_reset()). Signed-off-by: Laszlo Ersek ler...@redhat.com --- hw/arm/virt.c | 21 + 1 file changed, 21 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 314e55b..070bd34 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -68,6 +68,7 @@ enum { VIRT_UART, VIRT_MMIO, VIRT_RTC, +VIRT_FW_CFG, }; typedef struct MemMapEntry { @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = { [VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, [VIRT_UART] = { 0x0900, 0x1000 }, [VIRT_RTC] ={ 0x0901, 0x1000 }, +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE }, [VIRT_MMIO] = { 0x0a00, 0x0200 }, /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */ /* 0x1000 .. 0x4000 reserved for PCI */ @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi) g_free(nodename); } +static void create_fw_cfg(const VirtBoardInfo *vbi) +{ +hwaddr base = vbi-memmap[VIRT_FW_CFG].base; +char *nodename; + +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE); + +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base); +qemu_fdt_add_subnode(vbi-fdt, nodename); +qemu_fdt_setprop_string(vbi-fdt, nodename, +compatible, fw-cfg,mmio); +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg, + 2, base, 2, FW_CFG_SIZE, + 2, base + FW_CFG_SIZE, 2, FW_CFG_DATA_SIZE); Overkill suggestion alert, but how about defining something like #define FW_CFG_SIZE_ALIGNED \ MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \ QEMU_ALIGN_UP(FW_CFG_SIZE, 4)) and then using that in your memmap size calculation and fw-cfg-data base address calculation. The only reason I suggest this is because it's hard to tell that fw-cfg-data's address will be naturally aligned without hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change (which it probably never will), then it may not be. Why does it need to be aligned? Natural alignment is more efficient. The selector register is aligned at a 64KB boundary (for independent, strict reasons). The data register is not aligned at all, and -- AFAICS -- it need not be, because it's 1 byte wide. (In fact the ARM-specific Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the above layout passes without problems.) Right. As FW_CFG_DATA_SIZE is currently 1 byte, it's already naturally aligned, and the macro definition I have above actually doesn't change anything (which is why I gave the overkill alert). However if FW_CFG_DATA_SIZE was to change, then the natural alignment could be lost. The full register block is 3 bytes wide. Is that a problem? No, it's fine as is, and the FW_CFG_SIZE_ALIGNED would leave it 3 bytes wide too. FW_CFG_SIZE_ALIGNED only adds future-proofing. However it really is overkill as the chance that FW_CFG_DATA_SIZE will change is nil. Thanks Laszlo +g_free(nodename); +} + static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size) { const VirtBoardInfo *board = (const VirtBoardInfo *)binfo; @@ -604,6 +623,8 @@ static void machvirt_init(MachineState *machine) */ create_virtio_devices(vbi, pic); +create_fw_cfg(vbi); + vbi-bootinfo.ram_size = machine-ram_size; vbi-bootinfo.kernel_filename = machine-kernel_filename; vbi-bootinfo.kernel_cmdline = machine-kernel_cmdline; -- 1.8.3.1
Re: [Qemu-devel] [PATCH v2 5/6] geometry: Call backend function to detect geometry and blocksize
On 11/28/2014 01:10 PM, Markus Armbruster wrote: Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: hd_geometry_guess function autodetects the drive geometry. This patch adds a block backend call, that probes the backing device geometry. If the inner driver method is implemented and succeeds (currently only DASDs), the blkconf_geometry will pass-through the backing device geometry. Signed-off-by: Ekaterina Tumanova tuman...@linux.vnet.ibm.com --- hw/block/block.c | 11 +++ hw/block/hd-geometry.c | 9 + hw/block/virtio-blk.c| 1 + include/hw/block/block.h | 1 + 4 files changed, 22 insertions(+) diff --git a/hw/block/block.c b/hw/block/block.c index a625773..f1d29bc 100644 --- a/hw/block/block.c +++ b/hw/block/block.c @@ -25,6 +25,17 @@ void blkconf_serial(BlockConf *conf, char **serial) } } +void blkconf_blocksizes(BlockConf *conf) +{ +BlockBackend *blk = conf-blk; +struct ProbeBlockSize blocksize = blk_probe_blocksizes(blk); + +if (blocksize.rc == 0) { +conf-physical_block_size = blocksize.size.phys; +conf-logical_block_size = blocksize.size.log; +} +} + Uh, doesn't this overwrite the user's geometry? The existing blkconf_ functions do nothing when the user supplied the configuration. In other words, they only provide defaults. Why should this one be different? this will be fixed in the next version void blkconf_geometry(BlockConf *conf, int *ptrans, unsigned cyls_max, unsigned heads_max, unsigned secs_max, Error **errp) diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c index 6fcf74d..4972114 100644 --- a/hw/block/hd-geometry.c +++ b/hw/block/hd-geometry.c @@ -121,6 +121,14 @@ void hd_geometry_guess(BlockBackend *blk, int *ptrans) { int cylinders, heads, secs, translation; +struct ProbeGeometry geometry = blk_probe_geometry(blk); + +if (geometry.rc == 0) { +*pcyls = geometry.geo.cylinders; +*psecs = geometry.geo.sectors; +*pheads = geometry.geo.heads; +goto done; +} hd_geometry_guess() is called by blkconf_geometry() to conjure up a default geometry when the user didn't pick one. Fairly elaborate guesswork. blkconf_geometry() runs for IDE, SCSI and virtio-blk only. Your patch makes it pick the backend's geometry as reported by blk_probe_geometry() before anything else. This is an incompatible change for backends where blk_probe_geometry() succeeds. Currently, it succeeds only for DASD, and there you *want* the incompatible change. My point is: if we rely on blk_probe_geometry() failing except for DASD, then we should call it something like blk_dasd_geometry() to make that obvious. This function itself is not DASD specific. It calls the inner method for host_device driver, which currently only succeeds for DASDs. But in future someone can define methods for other drivers or even modify the host_device driver. The commit message needs to spell out the incompatible change. if (guess_disk_lchs(blk, cylinders, heads, secs) 0) { /* no LCHS guess: use a standard physical disk geometry */ @@ -143,6 +151,7 @@ void hd_geometry_guess(BlockBackend *blk, the logical geometry */ translation = BIOS_ATA_TRANSLATION_NONE; } +done: if (ptrans) { *ptrans = translation; } diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index b19b102..6f01565 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -745,6 +745,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp) error_propagate(errp, err); return; } +blkconf_blocksizes(conf-conf); virtio_init(vdev, virtio-blk, VIRTIO_ID_BLOCK, sizeof(struct virtio_blk_config)); Why only virtio-blk, and not the other device models sporting configurable block sizes (nvme, IDE, SCSI, usb-storage)? This is an incompatible change for backends where blk_probe_blocksizes() succeeds and returns something other than (512, 512). Currently, it succeeds only for DASD. Is that what we want? The commit message needs to spell out the incompatible change. Perhaps this patch should be split up into one for geometry and one for block sizes. diff --git a/include/hw/block/block.h b/include/hw/block/block.h index 0d0ce9a..856bf75 100644 --- a/include/hw/block/block.h +++ b/include/hw/block/block.h @@ -63,6 +63,7 @@ void blkconf_serial(BlockConf *conf, char **serial); void blkconf_geometry(BlockConf *conf, int *trans, unsigned cyls_max, unsigned heads_max, unsigned secs_max, Error **errp); +void blkconf_blocksizes(BlockConf *conf); /* Hard disk geometry */
Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote: [...] 3. Why does s390 need to customize hd_geometry_guess()? Since hd_geometry_guess contains semantics of x86-specific LBA translation, we have to modify it not to get in the way of z architecture I already explained this in my review, but it's buried among other stuff, so let me repeat it here: The only caller that passes non-null ptrans to hd_geometry_guess() is ide_dev_initfn(), and the only user of the value that gets stored there is pc_cmos_init_late(). We store a highly PC-specific value there. That's okay, it's used only by PC machines. The whole concept BIOS translation makes no sense for other machines. Heck, it makes barely sense even on PCs.
Re: [Qemu-devel] [PATCH v2 2/6] geometry: Detect blocksize via ioctls in separate static functions
Suggest function comment /** * Return logical block size, or zero if we can't figure it out */ { -BDRVRawState *s = bs-opaque; -char *buf; -unsigned int sector_size; - -/* For /dev/sg devices the alignment is not really used. - With buffered I/O, we don't have any restrictions. */ -if (bs-sg || !s-needs_alignment) { -bs-request_alignment = 1; -s-buf_align = 1; -return; -} +unsigned int sector_size = 0; Pointless initialization. If I do not initialize the sector_size, and the ioctl fails, I will return garbage as a blocksize to the caller. /* Try a few ioctls to get the right size */ -bs-request_alignment = 0; -s-buf_align = 0; - #ifdef BLKSSZGET if (ioctl(fd, BLKSSZGET, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef DKIOCGETBLOCKSIZE if (ioctl(fd, DKIOCGETBLOCKSIZE, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef DIOCGSECTORSIZE if (ioctl(fd, DIOCGSECTORSIZE, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef CONFIG_XFS if (s-is_xfs) { struct dioattr da; if (xfsctl(NULL, fd, XFS_IOC_DIOINFO, da) = 0) { -bs-request_alignment = da.d_miniosz; +sector_size = da.d_miniosz; /* The kernel returns wrong information for d_mem */ /* s-buf_align = da.d_mem; */ Since you keep the enabled assignments to s-buf_align out of this function, you should keep out this disabled one, too. +return sector_size; } } #endif +return 0; +} + +static unsigned int probe_physical_blocksize(BlockDriverState *bs, int fd) Parameter bs is unused, let's drop it. +{ +unsigned int blk_size = 0; Pointless initialization. Same here. +#ifdef BLKPBSZGET +if (ioctl(fd, BLKPBSZGET, blk_size) = 0) { +return blk_size; +} +#endif + +return 0; +} + +static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) +{ +BDRVRawState *s = bs-opaque; +char *buf; + +/* For /dev/sg devices the alignment is not really used. + With buffered I/O, we don't have any restrictions. */ +if (bs-sg || !s-needs_alignment) { +bs-request_alignment = 1; +s-buf_align = 1; +return; +} + +s-buf_align = 0; +/* Let's try to use the logical blocksize for the alignment. */ +bs-request_alignment = probe_logical_blocksize(bs, fd); + /* If we could not get the sizes so far, we can only guess them */ if (!s-buf_align) { size_t align;
Re: [Qemu-devel] [PATCH v3 07/22] qcow2: Helper function for refcount modification
On Thu, Nov 20, 2014 at 06:06:23PM +0100, Max Reitz wrote: @@ -1711,7 +1746,7 @@ static int calculate_refcounts(BlockDriverState *bs, BdrvCheckResult *res, static void compare_refcounts(BlockDriverState *bs, BdrvCheckResult *res, BdrvCheckMode fix, bool *rebuild, int64_t *highest_cluster, - uint16_t *refcount_table, int64_t nb_clusters) + void *refcount_table, int64_t nb_clusters) { BDRVQcowState *s = bs-opaque; int64_t i, refcount1, refcount2; An -ERANGE qcow2_get_refcount() return value is treated as a check error here and we won't be able to rebuild the refcount blocks: refcount1 = qcow2_get_refcount(bs, i); if (refcount1 0) { fprintf(stderr, Can't get refcount for cluster % PRId64 : %s\n, i, strerror(-refcount1)); res-check_errors++; continue; } We should allow rebuilding refcount blocks, -ERANGE is just another corrupted refcount. pgpyEY7bWMHBB.pgp Description: PGP signature
Re: [Qemu-devel] [PATCH v3 06/22] qcow2: Helper for refcount array reallocation
On Thu, Nov 27, 2014 at 04:11:12PM +0100, Max Reitz wrote: On 2014-11-27 at 16:09, Stefan Hajnoczi wrote: On Thu, Nov 20, 2014 at 06:06:22PM +0100, Max Reitz wrote: +/** + * Reallocates *array so that it can hold new_size entries. *size must contain + * the current number of entries in *array. If the reallocation fails, *array + * and *size will not be modified and -errno will be returned. If the + * reallocation is successful, *array will be set to the new buffer and *size + * will be set to new_size. The size of the reallocated refcount array buffer + * will be aligned to a cluster boundary, and the newly allocated area will be + * zeroed. + */ +static int realloc_refcount_array(BDRVQcowState *s, uint16_t **array, + int64_t *size, int64_t new_size) +{ +/* Round to clusters so the array can be directly written to disk */ +size_t old_byte_size = ROUND_UP(refcount_array_byte_size(s, *size), +s-cluster_size); +size_t new_byte_size = ROUND_UP(refcount_array_byte_size(s, new_size), +s-cluster_size); +uint16_t *new_ptr; + +if (new_byte_size = old_byte_size) { +*size = new_size; +return 0; +} Why not realloc the array to the new smaller size? ... Because such a call will actually never happen. I could replace this if () by assert(new_byte_size = old_byte_size); if (new_byte_size == old_byte_size), but as I said before, I'm not a friend of assertions when the code can deal perfectly well with the unsupported case. It is simpler to put an if statement around the memset. Then the function actually frees unused memory and readers don't wonder why you decided not to shrink the array. Less code and slightly better behavior. Stefan pgpEfp6wutS8o.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try coroutine: Use __thread … together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). Paolo
Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices
On 11/28/2014 01:35 PM, Stefan Hajnoczi wrote: On Wed, Nov 26, 2014 at 01:16:48PM +0300, Ekaterina Tumanova wrote: On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote: On Wed, Nov 19, 2014 at 11:17:50AM +0100, Ekaterina Tumanova wrote: 3. Why does s390 need to customize hd_geometry_guess()? Since hd_geometry_guess contains semantics of x86-specific LBA translation, we have to modify it not to get in the way of z architecture If the behavior is x86-specific, it should be made x86-specific. z, ppc, sparc, arm, etc should share the non-x86-specific code path. It's weird to single out z here, this seems like the wrong way around. Is the x86-specific behavior a problem in practice? No one seemed to mind for the other targets. on s390 arch this adds support for FCP attached SCSI disks that do not yet have a partition table. Without this patch, fdisk -l on the host would return different results then fdisk -l in the guest. If you guys don't like the way patch goes, I can exclude it from patch series and we can discuss it later. But I thought that since it's related to the hard drive geometry, and since we change hd_geometry_guess in this patchset anyway, why not get rid of this problem as well. Kate.
Re: [Qemu-devel] [qemu PATCH 2/2] arm: add fw_cfg to virt board
On Fri, Nov 28, 2014 at 11:49:48AM +0100, Laszlo Ersek wrote: On 11/28/14 11:43, Laszlo Ersek wrote: On 11/28/14 11:38, Andrew Jones wrote: On Fri, Nov 28, 2014 at 12:18:27AM +0100, Laszlo Ersek wrote: fw_cfg already supports exposure over MMIO (used in ppc/mac_newworld.c, ppc/mac_oldworld.c, sparc/sun4m.c); we can easily add it to the virt board. The mmio register block of fw_cfg is advertized in the device tree. As base address we pick 0x0902, which conforms to the comment preceding a15memmap: it falls in the miscellaneous device I/O range 128MB..256MB, and it is aligned at 64KB. fw_cfg automatically exports a number of files to the guest; for example, bootorder (see fw_cfg_machine_reset()). Signed-off-by: Laszlo Ersek ler...@redhat.com --- hw/arm/virt.c | 21 + 1 file changed, 21 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 314e55b..070bd34 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -68,6 +68,7 @@ enum { VIRT_UART, VIRT_MMIO, VIRT_RTC, +VIRT_FW_CFG, }; typedef struct MemMapEntry { @@ -107,6 +108,7 @@ static const MemMapEntry a15memmap[] = { [VIRT_GIC_CPU] ={ 0x0801, 0x0001 }, [VIRT_UART] = { 0x0900, 0x1000 }, [VIRT_RTC] ={ 0x0901, 0x1000 }, +[VIRT_FW_CFG] = { 0x0902, FW_CFG_SIZE + FW_CFG_DATA_SIZE }, [VIRT_MMIO] = { 0x0a00, 0x0200 }, /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */ /* 0x1000 .. 0x4000 reserved for PCI */ @@ -519,6 +521,23 @@ static void create_flash(const VirtBoardInfo *vbi) g_free(nodename); } +static void create_fw_cfg(const VirtBoardInfo *vbi) +{ +hwaddr base = vbi-memmap[VIRT_FW_CFG].base; +char *nodename; + +fw_cfg_init(0, 0, base, base + FW_CFG_SIZE); + +nodename = g_strdup_printf(/fw-cfg@% PRIx64, base); +qemu_fdt_add_subnode(vbi-fdt, nodename); +qemu_fdt_setprop_string(vbi-fdt, nodename, +compatible, fw-cfg,mmio); +qemu_fdt_setprop_sized_cells(vbi-fdt, nodename, reg, + 2, base, 2, FW_CFG_SIZE, + 2, base + FW_CFG_SIZE, 2, FW_CFG_DATA_SIZE); Overkill suggestion alert, but how about defining something like #define FW_CFG_SIZE_ALIGNED \ MIN(QEMU_ALIGN_UP(FW_CFG_SIZE, FW_CFG_DATA_SIZE), \ QEMU_ALIGN_UP(FW_CFG_SIZE, 4)) and then using that in your memmap size calculation and fw-cfg-data base address calculation. The only reason I suggest this is because it's hard to tell that fw-cfg-data's address will be naturally aligned without hunting down the definition of FW_CFG_DATA_SIZE. And, if it were to change (which it probably never will), then it may not be. Why does it need to be aligned? The selector register is aligned at a 64KB boundary (for independent, strict reasons). The data register is not aligned at all, and -- AFAICS -- it need not be, because it's 1 byte wide. (In fact the ARM-specific Mmio(Read|Write)XX functions in edk2 enforce natural alignment, and the above layout passes without problems.) The full register block is 3 bytes wide. Is that a problem? Hm, I think I get it now. If FW_CFG_DATA_SIZE were to increase, then its alignment would have to increase as well, and whatever alignment FW_CFG_SIZE provides might not suffice. So, you'd calculate the natural alignment, but wouldn't increase it beyond 4. I do think this is a bit overkill :) but I can do it. Let's wait for more review comments first. Actually, on second thought, completely scratch my overkill suggestion. It's actually wrong to be concerned with it anyway. FW_CFG_DATA_SIZE doesn't dictate how we should access the data port, the fw-cfg protocol does, and that says we should access exactly one byte. So, regardless of the fw-cfg-data size, we'll never have to worry about the data port's alignment, as we'll never access more than one byte from it. Thanks! Laszlo
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try coroutine: Use __thread … together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). indeed: Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
On 28/11/2014 12:21, Peter Lieven wrote: Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try coroutine: Use __thread … together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). indeed: Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine Your perf_master2 uses the ring buffer unconditionally, right? I wonder if we can use a similar algorithm but with arrays instead of lists... Paolo
Re: [Qemu-devel] [PATCH v6 2/3] linux-aio: handling -EAGAIN for !s-io_q.plugged case
Am 28.11.2014 um 03:27 hat Ming Lei geschrieben: Hi Kevin, On Wed, Nov 26, 2014 at 7:27 PM, Kevin Wolf kw...@redhat.com wrote: Am 25.11.2014 um 08:23 hat Ming Lei geschrieben: Previously -EAGAIN is simply ignored for !s-io_q.plugged case, and sometimes it is easy to cause -EIO to VM, such as NVME device. This patch handles -EAGAIN by io queue for !s-io_q.plugged case, and it will be retried in following aio completion cb. Reviewed-by: Paolo Bonzini pbonz...@redhat.com Suggested-by: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Ming Lei ming@canonical.com --- block/linux-aio.c | 24 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/block/linux-aio.c b/block/linux-aio.c index 11ac828..ac25722 100644 --- a/block/linux-aio.c +++ b/block/linux-aio.c @@ -282,8 +282,13 @@ static int ioq_enqueue(struct qemu_laio_state *s, struct iocb *iocb) s-io_q.iocbs[idx++] = iocb; s-io_q.idx = idx; -/* submit immediately if queue depth is above 2/3 */ -if (idx s-io_q.size * 2 / 3) { +/* + * This is reached in two cases: queue not plugged but io_submit + * returned -EAGAIN, or queue plugged. In the latter case, start + * submitting some I/O if the queue is getting too full. In the + * former case, instead, wait until an I/O operation is completed. + */ Are we guaranteed that an I/O operation is in flight when we get -EAGAIN? The manpage of io_submit isn't very clear on this, insufficient resources could be for any reason. That is a good question. From fs/aio.c in linux kernel, io_submit_one() returns -EAGAIN when either there isn't enough requests which are reserved in io_setup(), or kmem_cache_alloc(GFP_KERNEL) returns NULL. In the former case, it means I/O operation is in flight. In the later case, it should be very difficult to trigger since GFP_KERNEL allocation will wait for memory reclaiming. So most of times, it is reasonable to resubmit in completion for -EAGAIN. When there is no pending I/O, we still can handle the very unlikely case either by returning failure to caller or try to submit in one BH. Does it make sense for you? I think returning an error is fine in this case. Kevin
Re: [Qemu-devel] [PATCH v3 07/22] qcow2: Helper function for refcount modification
On Thu, Nov 27, 2014 at 04:32:52PM +0100, Max Reitz wrote: On 2014-11-27 at 16:21, Stefan Hajnoczi wrote: On Thu, Nov 20, 2014 at 06:06:23PM +0100, Max Reitz wrote: @@ -583,7 +608,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, /* we can update the count and save it */ block_index = cluster_index (s-refcount_block_size - 1); -refcount = be16_to_cpu(refcount_block[block_index]); +refcount = s-get_refcount(refcount_block, block_index); +if (refcount 0) { +ret = -ERANGE; +goto fail; +} Here again. @@ -1206,11 +1236,14 @@ static int inc_refcounts(BlockDriverState *bs, } } -if (++(*refcount_table)[k] == 0) { +refcount = s-get_refcount(*refcount_table, k); Here the refcount 0 check is missing. That's why it would be simpler to eliminate the refcount 0 case entirely. It's not missing. This is part of the refcount check, as are all the following (in other places). The refcount check builds a refcount array in memory all by itself, so it knows for sure there are no overflows. The line which you omitted directly after this clips the refcount values against s-refcount_max which is INT64_MAX for 64-bit refcounts. Therefore, no overflows possible in the refcount checking functions, because the refcount checking functions don't introduce the overflows themselves. The other overflow checks are only in place to reject faulty images provided from the outside. I see, that makes sense. @@ -1726,7 +1761,7 @@ static void compare_refcounts(BlockDriverState *bs, BdrvCheckResult *res, continue; } -refcount2 = refcount_table[i]; +refcount2 = s-get_refcount(refcount_table, i); Missing here too and in other places. +typedef uint64_t Qcow2GetRefcountFunc(const void *refcount_array, + uint64_t index); Should the return type be int64_t? No. If it was, we'd have to check for errors every time we call it, but it cannot return errors (well, if we let it return uint64_t, it might return -ERANGE, but that's exactly what I don't want). Therefore, let it return uint64_t so we know this function cannot fail. +typedef void Qcow2SetRefcountFunc(void *refcount_array, + uint64_t index, uint64_t value); Should value's type be int64_t? Just because the type is unsigned doesn't make (uint64_t)-1ULL a valid value. Actually, it does. It's just that the implementation provided here does not support it. Since there is an assertion against that case in the 64-bit implementation for this function, I don't have a problem with using int64_t here, though. But that would break symmetry with Qcow2GetRefcountFunc(), and I do have a reason there not to return a signed value, as explained above. Okay, so uint64_t is really a different type - it's the qcow2 spec 64-bit refcount. int64_t is the QEMU implementation's internal representation. This seems error-prone to me. Maybe comments would have helped, but it's best to eliminate the problem entirely. Why not bite the bullet and fix up qcow2? There are two external function prototypes that need to be changed: int64_t qcow2_get_refcount(BlockDriverState *bs, int64_t cluster_index); int64_t qcow2_update_cluster_refcount(BlockDriverState *bs, int64_t cluster_index, int addend, enum qcow2_discard_type type); /* Returns 0 on success, -errno on failure */ int qcow2_get_refcount(BlockDriverState *bs, int64_t cluster_index, uint64_t *refcount); int qcow2_update_cluster_refcount(BlockDriverState *bs, int64_t cluster_index, int addend, enum qcow2_discard_type type, uint64_t *refcount); Have I missed a fundamental reason why the implementation's internal refcount type cannot be changed from int64_t to uint64_t? It would keep the code complexity down and reduce errors. pgp55_XHR0m3y.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: On 28/11/2014 12:21, Peter Lieven wrote: Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try coroutine: Use __thread … together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). indeed: Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine Your perf_master2 uses the ring buffer unconditionally, right? I wonder if we can use a similar algorithm but with arrays instead of lists... You mean an algorithm similar to perf_master2 or to the current implementation? The ring buffer seems to have a drawback when it comes to excessive coroutine nesting. My idea was that you do not throw away hot coroutines when the pool is full. However, i do not know if this is really a problem since the pool is only full when there is not much I/O. Or is this assumption to easy? Peter
Re: [Qemu-devel] [RFC PATCH v5 22/31] timer: introduce new QEMU_CLOCK_VIRTUAL_RT clock
From: Paolo Bonzini [mailto:pbonz...@redhat.com] On 26/11/2014 11:40, Pavel Dovgalyuk wrote: This patch introduces new QEMU_CLOCK_VIRTUAL_RT clock, which should be used for icount warping. Separate timer is needed for replaying the execution, because warping callbacks should be deterministic. We cannot make realtime clock deterministic because it is used for screen updates and other simulator-specific actions. That is why we added new clock which is recorded and replayed when needed. Signed-off-by: Pavel Dovgalyuk pavel.dovga...@ispras.ru --- include/qemu/timer.h |7 +++ qemu-timer.c |2 ++ replay/replay.h |4 +++- 3 files changed, 12 insertions(+), 1 deletions(-) diff --git a/include/qemu/timer.h b/include/qemu/timer.h index 7b43331..df27157 100644 --- a/include/qemu/timer.h +++ b/include/qemu/timer.h @@ -37,12 +37,19 @@ * is suspended, and it will reflect system time changes the host may * undergo (e.g. due to NTP). The host clock has the same precision as * the virtual clock. + * + * @QEMU_CLOCK_VIRTUAL_RT: realtime clock used for icount warp + * + * This clock runs as a realtime clock, but is used for icount warp + * and thus should be traced with record/replay to make warp function + * behave deterministically. */ I think it should also stop/restart across stop and cont commands, similar to QEMU_CLOCK_VIRTUAL. This is as simple as changing get_clock() to cpu_get_clock(). Not so easy :) cpu_get_clock() checks vm_clock_seqlock which is locked in icount_warp_rt(). And after locking it requests the value of QEMU_CLOCK_VIRTUAL_RT: seqlock_write_lock(timers_state.vm_clock_seqlock); if (runstate_is_running()) { int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL_RT); Pavel Dovgalyuk
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: On 28/11/2014 12:21, Peter Lieven wrote: Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try coroutine: Use __thread … together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). indeed: Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine Your perf_master2 uses the ring buffer unconditionally, right? I wonder if we can use a similar algorithm but with arrays instead of lists... Why do you set pool_size = 0 in the create path? When I do the following: diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..c79ee78 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ -pool_size = 0; +atomic_dec(pool_size); QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); co = QSLIST_FIRST(alloc_pool); I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine
Re: [Qemu-devel] [PATCH v2 0/3] block: Don't probe for unknown backing file format
On Tue, Nov 25, 2014 at 06:12:39PM +0100, Kevin Wolf wrote: v2: - Added two patches to make the test case actually work properly and not just by accident on my laptop. [Max] - Fixed comment in test case [Max] Kevin Wolf (3): qcow2: Fix header extension size check qcow2.py: Add required padding for header extensions block: Don't probe for unknown backing file format block.c | 7 +++--- block/qcow2.c | 2 +- tests/qemu-iotests/080 | 2 ++ tests/qemu-iotests/080.out | 2 ++ tests/qemu-iotests/114 | 61 + tests/qemu-iotests/114.out | 13 ++ tests/qemu-iotests/group| 1 + tests/qemu-iotests/qcow2.py | 4 +++ 8 files changed, 87 insertions(+), 5 deletions(-) create mode 100755 tests/qemu-iotests/114 create mode 100644 tests/qemu-iotests/114.out -- 1.8.3.1 Thanks, applied to my block-next tree: https://github.com/stefanha/qemu/commits/block-next Stefan pgp1Cj0zA04jc.pgp Description: PGP signature
Re: [Qemu-devel] [PATCH] block: do not use get_clock()
On Wed, Nov 26, 2014 at 03:01:02PM +0100, Paolo Bonzini wrote: Use the external qemu-timer API instead. Cc: kw...@redhat.com Cc: stefa...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- block/accounting.c | 6 -- block/raw-posix.c | 8 2 files changed, 8 insertions(+), 6 deletions(-) Thanks, applied to my block-next tree: https://github.com/stefanha/qemu/commits/block-next Stefan pgprTjtwyR1U3.pgp Description: PGP signature
Re: [Qemu-devel] [PATCH RFC for-2.2] virtio-blk: force 1st s/g to match header
On Fri, Nov 28, 2014 at 7:05 AM, Jason Wang jasow...@redhat.com wrote: On Fri, Nov 28, 2014 at 9:16 AM, Fam Zheng f...@redhat.com wrote: On Thu, 11/27 23:13, Michael S. Tsirkin wrote: On Thu, Nov 27, 2014 at 07:21:35PM +, Stefan Hajnoczi wrote: On Thu, Nov 27, 2014 at 4:33 PM, Michael S. Tsirkin m...@redhat.com wrote: We leak cpu mappings when 1st s/g is not exactly the header. As we don't set ANY_LAYOUT, we can at this point simply assert the correct length. This will have to be fixed once ANY_LAYOUT is set. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested: posting for early feedback. Reviewed-by: Stefan Hajnoczi stefa...@redhat.com Stefan, you are going to merge this for 2.2? Peter, if Stefan is merging this one, we can take Jason's patch for -net and that should be enough for 2.2. My test bot saw a failure of make check: qemu-system-x86_64: virtio-blk request outhdr too long Broken pipe GTester: last random seed: R02S44c5a09d74c0603f0091ed20d1395121 qemu-system-x86_64: virtio-blk request outhdr too long Broken pipe GTester: last random seed: R02Sfdf270c8e1ed75c1f2047e3f0fb89b88 qemu-system-x86_64: virtio-blk request outhdr too long Broken pipe GTester: last random seed: R02Sd292c540929b963ed9d7d155f3db5452 qemu-system-x86_64: virtio-blk request outhdr too long Broken pipe GTester: last random seed: R02Sfc775a34a844c39f376aa478eda7573f [vmxnet3][WR][vmxnet3_peer_has_vnet_hdr]: Peer has no virtio extension. Task offloads will be emulated. make: *** [check-qtest-x86_64] Error 1 Fam This probably because of the test itself. But anyway all kinds of guests (especially Windows drivers) need to be tested. Right, the test case explicitly tests different descriptor layouts, even though virtio-blk-pci does not set the ANY_LAYOUT feature bit. Either the test case needs to check ANY_LAYOUT before using the 2-descriptor layout or it needs to expect QEMU to refuse (in this case exit(1), which is not very graceful). The quick fix is to skip the 2-descriptor layout tests and re-enable them once virtio-blk actually supports ANY_LAYOUT. Any objections? Stefan
Re: [Qemu-devel] [PATCH] Fix for crash after migration in virtio-rng on bi-endian targets
On Fri, Nov 28, 2014 at 09:14:46AM +, Peter Maydell wrote: On 27 November 2014 at 09:26, Markus Armbruster arm...@redhat.com wrote: David Gibson da...@gibson.dropbear.id.au writes: VirtIO devices now remember which endianness they're operating in in order to support targets which may have guests of either endianness, such as powerpc. This endianness state is transferred in a subsection of the virtio device's information. With virtio-rng this can lead to an abort after a loadvm hitting the assert() in virtio_is_big_endian(). This can be reproduced by doing a migrate and load from file on a bi-endian target with a virtio-rng device. The actual guest state isn't particularly important to triggering this. The cause is that virtio_rng_load_device() calls virtio_rng_process() which accesses the ring and thus needs the endianness. However, virtio_rng_process() is called via virtio_load() before it loads the subsections. Essentially the -load callback in VirtioDeviceClass should only be used for actually reading the device state from the stream, not for post-load re-initialization. This patch fixes the bug by moving the virtio_rng_process() after the call to virtio_load(). Better yet would be to convert virtio to use vmsd and have the virtio_rng_process() as a post_load callback, but that's a bigger project for another day. This is bugfix, and should be considered for the 2.2 branch. [PATCH for-2.2] would have been a good idea then. Next time :) So do you want this patch in 2.2? I was planning to put in the virtio-vs-xen fixes today and tag rc4, so it's not too late if you're confident this patch is good. Let me know if you think it should go in, and I can apply it to master directly. Yes, I think it should be applied. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson pgpCGqBKs2Qmi.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 12:32 schrieb Peter Lieven: Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: On 28/11/2014 12:21, Peter Lieven wrote: Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try coroutine: Use __thread … together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). indeed: Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine Your perf_master2 uses the ring buffer unconditionally, right? I wonder if we can use a similar algorithm but with arrays instead of lists... Why do you set pool_size = 0 in the create path? When I do the following: diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..c79ee78 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ -pool_size = 0; +atomic_dec(pool_size); QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); co = QSLIST_FIRST(alloc_pool); I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Ok, understood, it steals the whole pool, right? Isn't that bad if we have more than one thread in need of a lot of coroutines? Peter
Re: [Qemu-devel] [PATCH] Fix for crash after migration in virtio-rng on bi-endian targets
On Fri, 28 Nov 2014 09:14:46 + Peter Maydell peter.mayd...@linaro.org wrote: On 27 November 2014 at 09:26, Markus Armbruster arm...@redhat.com wrote: David Gibson da...@gibson.dropbear.id.au writes: VirtIO devices now remember which endianness they're operating in in order to support targets which may have guests of either endianness, such as powerpc. This endianness state is transferred in a subsection of the virtio device's information. With virtio-rng this can lead to an abort after a loadvm hitting the assert() in virtio_is_big_endian(). This can be reproduced by doing a migrate and load from file on a bi-endian target with a virtio-rng device. The actual guest state isn't particularly important to triggering this. The cause is that virtio_rng_load_device() calls virtio_rng_process() which accesses the ring and thus needs the endianness. However, virtio_rng_process() is called via virtio_load() before it loads the subsections. Essentially the -load callback in VirtioDeviceClass should only be used for actually reading the device state from the stream, not for post-load re-initialization. This patch fixes the bug by moving the virtio_rng_process() after the call to virtio_load(). Better yet would be to convert virtio to use vmsd and have the virtio_rng_process() as a post_load callback, but that's a bigger project for another day. This is bugfix, and should be considered for the 2.2 branch. [PATCH for-2.2] would have been a good idea then. Next time :) So do you want this patch in 2.2? I was planning to put in the virtio-vs-xen fixes today and tag rc4, so it's not too late if you're confident this patch is good. Let me know if you think it should go in, and I can apply it to master directly. -- PMM Peter, FWIW I think it should. Commit 3902d49e13c2428bd6381cfdf183103ca4477c1f is clearly bad: virtio-rng does not need the .load callback obviously... and the fact it breaks migration makes it even worse... :( Please apply to 2.2. Cheers. -- Greg
Re: [Qemu-devel] [RFC PATCH 1/3] qemu-img bench
On Wed, Nov 26, 2014 at 03:46:42PM +0100, Kevin Wolf wrote: +while (data.n 0) { +main_loop_wait(false); +} Why is this false (non-blocking)? This is why you get the main loop spun warning message. Using true (blocking) seems like the right thing. data.n changes as part of the callback, which is invoked from the main loop. There is no need to be non-blocking. Stefan pgpHYfykzjRn3.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 1/3] qemu-img bench
Am 28.11.2014 um 12:49 hat Stefan Hajnoczi geschrieben: On Wed, Nov 26, 2014 at 03:46:42PM +0100, Kevin Wolf wrote: +while (data.n 0) { +main_loop_wait(false); +} Why is this false (non-blocking)? This is why you get the main loop spun warning message. Using true (blocking) seems like the right thing. data.n changes as part of the callback, which is invoked from the main loop. There is no need to be non-blocking. I think the parameter has exactly the opposite meaning as what you describe: int main_loop_wait(int nonblocking) If it were true, you would get timeout = 0. qemu-io and qemu-nbd also pass false here. Kevin pgpXvxB1_UHSR.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
On 28/11/2014 12:32, Peter Lieven wrote: Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: On 28/11/2014 12:21, Peter Lieven wrote: Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try coroutine: Use __thread … together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). indeed: Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine Your perf_master2 uses the ring buffer unconditionally, right? I wonder if we can use a similar algorithm but with arrays instead of lists... Why do you set pool_size = 0 in the create path? When I do the following: diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..c79ee78 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ -pool_size = 0; +atomic_dec(pool_size); QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); co = QSLIST_FIRST(alloc_pool); I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Because pool_size is the (approximate) number of coroutines in the pool. It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first. Paolo
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 13:21 schrieb Paolo Bonzini: On 28/11/2014 12:32, Peter Lieven wrote: Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: On 28/11/2014 12:21, Peter Lieven wrote: Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: master: Run operation 4000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 4000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try coroutine: Use __thread … together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). indeed: Run operation 4000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine Your perf_master2 uses the ring buffer unconditionally, right? I wonder if we can use a similar algorithm but with arrays instead of lists... Why do you set pool_size = 0 in the create path? When I do the following: diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..c79ee78 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -44,7 +44,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ -pool_size = 0; +atomic_dec(pool_size); QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); co = QSLIST_FIRST(alloc_pool); I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Because pool_size is the (approximate) number of coroutines in the pool. It is zero after QSLIST_MOVE_ATOMIC has NULL-ed out release_pool.slh_first. got it meanwhile. and its not as bad as i thought since you only steal the release_pool if your alloc_pool is empty. Right? Peter
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
On 28/11/2014 12:46, Peter Lieven wrote: I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Ok, understood, it steals the whole pool, right? Isn't that bad if we have more than one thread in need of a lot of coroutines? Overall the algorithm is expected to adapt. The N threads contribute to the global release pool, so the pool will fill up N times faster than if you had only one thread. There can be some variance, which is why the maximum size of the pool is twice the threshold (and probably could be tuned better). Benchmarks are needed on real I/O too, of course, especially with high queue depth. Paolo
[Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface
Peter Maydell suggested that we describe new devices / DTB nodes in the kernel Documentation tree that we expose to arm virt guests in QEMU. Although the kernel is not required to access the fw_cfg interface, Documentation/devicetree/bindings/arm is probably the best central spot to keep the fw_cfg description in. Suggested-by: Peter Maydell peter.mayd...@linaro.org Signed-off-by: Laszlo Ersek ler...@redhat.com --- Documentation/devicetree/bindings/arm/fw-cfg.txt | 47 1 file changed, 47 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/fw-cfg.txt diff --git a/Documentation/devicetree/bindings/arm/fw-cfg.txt b/Documentation/devicetree/bindings/arm/fw-cfg.txt new file mode 100644 index 000..d131453 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/fw-cfg.txt @@ -0,0 +1,47 @@ +* QEMU Firmware Configuration bindings for ARM + +QEMU's arm-softmmu and aarch64-softmmu emulation / virtualization targets +provide the following Firmware Configuration interface on the virt machine +type: + +- A write-only, 16-bit wide selector (or control) register, +- a read-write, 8-bit wide data register. + +The guest writes a selector value (a key) to the selector register, and then +can read the corresponding data (produced by QEMU) via the data port. If the +selected entry is writable, the guest can rewrite it through the data port. The +authoritative registry of the valid selector values and their meanings is the +QEMU source code. + +QEMU exposes the control and data register to x86 guests at fixed IO ports. ARM +guests can access them as memory mapped registers, and their location is +communicated to the guest's UEFI firmware in the DTB that QEMU places at the +bottom of the guest's DRAM. + +The guest kernel is not expected to use these registers (although it is +certainly allowed to); the device tree bindings are documented here because +this is where device tree bindings reside in general. + +The addresses and sizes of the Firmware Configuration registers are given by +the /fw-cfg node. The virt board invariably uses 2 as #size-cells and +#address-cells in the context of this node. + +The reg property is therefore an array of four uint64_t elements (eight +uint32_t cells in total), the first uint64_t pair describing the address and +size of the control register, the second uint64_t pair describing the address +and size of the data register. (See +http://devicetree.org/Device_Tree_Usage#How_Addressing_Works.) + +The first string in the compatible property is fw-cfg,mmio. + +Example: + +/ { + #size-cells = 0x2; + #address-cells = 0x2; + + fw-cfg@902 { + reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1; + compatible = fw-cfg,mmio; + }; +}; -- 1.8.3.1
[Qemu-devel] [PATCHv3] block: add event when disk usage exceeds threshold
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images. To let the guest run smoothly and be not unnecessarily paused, oVirt sets a disk usage threshold (so called 'high water mark') based on the occupation of the device, and automatically extends the image once the threshold is reached or exceeded. In order to detect the crossing of the threshold, oVirt has no choice but aggressively polling the QEMU monitor using the query-blockstats command. This lead to unnecessary system load, and is made even worse under scale: deployments with hundreds of VMs are no longer rare. To fix this, this patch adds: * A new monitor command to set a mark for a given block device. * A new event to report if a block device usage exceeds the threshold. This will allow the managing application to drop the polling altogether and just wait for a watermark crossing event. Signed-off-by: Francesco Romani from...@redhat.com --- block/Makefile.objs | 1 + block/qapi.c| 3 + block/usage-threshold.c | 124 include/block/block_int.h | 4 ++ include/block/usage-threshold.h | 62 qapi/block-core.json| 48 +++- qmp-commands.hx | 28 + tests/Makefile | 3 + tests/test-usage-threshold.c| 101 9 files changed, 373 insertions(+), 1 deletion(-) create mode 100644 block/usage-threshold.c create mode 100644 include/block/usage-threshold.h create mode 100644 tests/test-usage-threshold.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 04b0e43..43e381d 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -20,6 +20,7 @@ block-obj-$(CONFIG_GLUSTERFS) += gluster.o block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o block-obj-$(CONFIG_LIBSSH2) += ssh.o block-obj-y += accounting.o +block-obj-y += usage-threshold.o common-obj-y += stream.o common-obj-y += commit.o diff --git a/block/qapi.c b/block/qapi.c index a87a34a..c85abb0 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -24,6 +24,7 @@ #include block/qapi.h #include block/block_int.h +#include block/usage-threshold.h #include qmp-commands.h #include qapi-visit.h #include qapi/qmp-output-visitor.h @@ -82,6 +83,8 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs) info-iops_size = cfg.op_size; } +info-write_threshold = bdrv_usage_threshold_get(bs); + return info; } diff --git a/block/usage-threshold.c b/block/usage-threshold.c new file mode 100644 index 000..9faf5e6 --- /dev/null +++ b/block/usage-threshold.c @@ -0,0 +1,124 @@ +/* + * QEMU System Emulator block usage threshold notification + * + * Copyright Red Hat, Inc. 2014 + * + * Authors: + * Francesco Romani from...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING.LIB file in the top-level directory. + */ + +#include block/block_int.h +#include block/coroutine.h +#include block/usage-threshold.h +#include qemu/notify.h +#include qapi-event.h +#include qmp-commands.h + + +uint64_t bdrv_usage_threshold_get(const BlockDriverState *bs) +{ +return bs-write_threshold_offset; +} + +bool bdrv_usage_threshold_is_set(const BlockDriverState *bs) +{ +return !!(bs-write_threshold_offset 0); +} + +static void usage_threshold_disable(BlockDriverState *bs) +{ +if (bdrv_usage_threshold_is_set(bs)) { +notifier_with_return_remove(bs-write_threshold_notifier); +bs-write_threshold_offset = 0; +} +} + +uint64_t bdrv_usage_threshold_exceeded(const BlockDriverState *bs, + const BdrvTrackedRequest *req) +{ +if (bdrv_usage_threshold_is_set(bs)) { +if (req-offset bs-write_threshold_offset) { +return (req-offset - bs-write_threshold_offset) + req-bytes; +} +if ((req-offset + req-bytes) bs-write_threshold_offset) { +return (req-offset + req-bytes) - bs-write_threshold_offset; +} +} +return 0; +} + +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier, +void *opaque) +{ +BdrvTrackedRequest *req = opaque; +BlockDriverState *bs = req-bs; +uint64_t amount = 0; + +amount = bdrv_usage_threshold_exceeded(bs, req); +if (amount 0) { +qapi_event_send_block_usage_threshold( +bs-node_name, +amount, +bs-write_threshold_offset, +error_abort); + +/* autodisable to avoid to flood the monitor */ +usage_threshold_disable(bs); +} + +return 0; /* should always let other notifiers run */ +} + +static void usage_threshold_register_notifier(BlockDriverState *bs) +{ +bs-write_threshold_notifier.notify = before_write_notify; +
[Qemu-devel] [PATCH v3] add write threshold reporting for block devices
v2 was: with RFC. Since last review round I dropped the tag because I think now all the open points are addressed. v1 was: add watermark reporting for block devices, but watermark is incorrectly unused. Hence the change in subject. Changes since v2: - addressed reviewers comments: - use node name to find the block driver state to be checked - report node name in the event notification - fixed signed vs unsigned integer: use uint64 everywhere, to deal with integer overflows (more) gracefully. - fixed pending style issues - renamed and made public a few functions to make them testable - add very basic initial unit tests Changes since v1: - addressed reviewers comments. Highligths: - fixed terminology: watermark - usage threshold - threshold is expressed in bytes - make the event triggers only once when threshold crossed - configured threshold visible in 'query-block' output - fixed bugs Open issues: Not all node names show up in the 'query-block' output, but I'll start a different thread to discuss this. Followup: - Patches I'll have on my queue and I'd like to post as followup - more some unit testing. - add support to set the threshold at device creation Rationale and context from v1 - I'm one of the oVirt developers (http://www.ovirt.org); oVirt is a virtualization management application built around qemu/kvm, so it is nice to get in touch :) We have begun a big scalability improvement effort, aiming to support without problems hundreds of VMs per host, with plans to support thousands in a not so distant future. In doing so, we are reviewing our usage flows. One of them is thin-provisioned storage, which is used quite extensively, with block devices (ISCSI for example) and COW images. When using thin provisioning, oVirt tries hard to hide this fact from the guest OS, and to do so watches closely the usage of the device, and resize it when its usage exceeds a configured threshold (the high water mark), in order to avoid the guest OS to get paused for space exhausted. To do the watching, we poll he devices using libvirt [1], which in turn uses query-blockstats. This is suboptimal with just one VM, but with hundereds of them, let alone thousands, it doesn't scale and it is quite a resource hog. Would be great to have this watermark concept supported into qemu, with a new event to be raised when the limit is crossed. To track this RFE I filed https://bugs.launchpad.net/qemu/+bug/1338957 Moreover, I had the chance to take a look at the QEMU sources and come up with this tentative patch which I'd also like to submit. Notes - [0]: https://lists.gnu.org/archive/html/qemu-devel/2014-07/msg01348.html [1]: http://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=ebb0c19c48690f0598de954f8e0e9d4d29d48b85 Francesco Romani (1): block: add event when disk usage exceeds threshold block/Makefile.objs | 1 + block/qapi.c| 3 + block/usage-threshold.c | 124 include/block/block_int.h | 4 ++ include/block/usage-threshold.h | 62 qapi/block-core.json| 48 +++- qmp-commands.hx | 28 + tests/Makefile | 3 + tests/test-usage-threshold.c| 101 9 files changed, 373 insertions(+), 1 deletion(-) create mode 100644 block/usage-threshold.c create mode 100644 include/block/usage-threshold.h create mode 100644 tests/test-usage-threshold.c -- 1.9.3
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: On 28/11/2014 12:46, Peter Lieven wrote: I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Ok, understood, it steals the whole pool, right? Isn't that bad if we have more than one thread in need of a lot of coroutines? Overall the algorithm is expected to adapt. The N threads contribute to the global release pool, so the pool will fill up N times faster than if you had only one thread. There can be some variance, which is why the maximum size of the pool is twice the threshold (and probably could be tuned better). Benchmarks are needed on real I/O too, of course, especially with high queue depth. Yes, cool. The atomic operations are a bit tricky at the first glance ;-) Question: Why is the pool_size increment atomic and the set to zero not? Idea: If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Run operation 4000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..edea162 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -25,8 +25,9 @@ enum { /** Free list to speed up creation */ static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_size; +static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +static __thread unsigned int alloc_pool_size; /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ static void coroutine_pool_cleanup(void *value); @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(alloc_pool); if (!co) { -if (pool_size POOL_BATCH_SIZE) { -/* This is not exact; there could be a little skew between pool_size +if (release_pool_size POOL_BATCH_SIZE) { +/* This is not exact; there could be a little skew between release_pool_size * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ -pool_size = 0; +alloc_pool_size = atomic_fetch_and(release_pool_size, 0); QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); co = QSLIST_FIRST(alloc_pool); @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) */ g_private_set(dummy_key, dummy_key); } +} else { +alloc_pool_size--; } if (co) { QSLIST_REMOVE_HEAD(alloc_pool, pool_next); @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) static void coroutine_delete(Coroutine *co) { if (CONFIG_COROUTINE_POOL) { -if (pool_size POOL_BATCH_SIZE * 2) { +if (release_pool_size POOL_BATCH_SIZE * 2) { co-caller = NULL; QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next); -atomic_inc(pool_size); +atomic_inc(release_pool_size); +return; +} else if (alloc_pool_size POOL_BATCH_SIZE) { +co-caller = NULL; +QSLIST_INSERT_HEAD(alloc_pool, co, pool_next); +alloc_pool_size++; return; } } Bug?: The release_pool is not cleanup up on termination I think. Peter
Re: [Qemu-devel] [RFC PATCH v5 22/31] timer: introduce new QEMU_CLOCK_VIRTUAL_RT clock
On 28/11/2014 12:28, Pavel Dovgaluk wrote: Not so easy :) cpu_get_clock() checks vm_clock_seqlock which is locked in icount_warp_rt(). And after locking it requests the value of QEMU_CLOCK_VIRTUAL_RT: seqlock_write_lock(timers_state.vm_clock_seqlock); if (runstate_is_running()) { int64_t clock = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL_RT); Not so hard :D You can use cpu_get_clock_locked() there. In fact, cpu_get_clock_locked() is already used below in the if, so we can reuse clock instead of the other variable cur_time. Paolo
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
On Thu, Nov 27, 2014 at 11:27:06AM +0100, Peter Lieven wrote: diff --git a/iothread.c b/iothread.c index 342a23f..b53529b 100644 --- a/iothread.c +++ b/iothread.c @@ -15,6 +15,7 @@ #include qom/object_interfaces.h #include qemu/module.h #include block/aio.h +#include block/coroutine.h #include sysemu/iothread.h #include qmp-commands.h #include qemu/error-report.h @@ -47,6 +48,8 @@ static void *iothread_run(void *opaque) } aio_context_release(iothread-ctx); } + +coroutine_pool_cleanup(); return NULL; } The assumption here is that iothread_run() is the only thread function that uses coroutines. If another thread uses coroutines the pool will leak :(. This is one of the challenges of thread-local storage. pgpaS0Fpn_kPP.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 1/3] Revert coroutine: make pool size dynamic
On Thu, Nov 27, 2014 at 11:27:04AM +0100, Peter Lieven wrote: This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a. Justification? Won't we hit the same problem as the global pool: either pool size is too large or too small. If you exceed the pool size performance drops dramatically. Stefan pgp361Reuw7Cs.pgp Description: PGP signature
Re: [Qemu-devel] [RFC 0/6] memory: make dirty_memory[] accesses atomic
On Thu, Nov 27, 2014 at 01:21:54PM +, Peter Maydell wrote: On 27 November 2014 at 12:29, Stefan Hajnoczi stefa...@redhat.com wrote: 1. Convert all cpu_physical_memory_*_dirty() callers to use the API atomically. There are TCG callers who things along the lines of: if (!cpu_physical_memory_get_dirty(addr)) { cpu_physical_memory_set_dirty(addr); /* not atomic! */ } Which bit of code is this? Note that for the TCG DIRTY_MEMORY_CODE flag you have bigger problems than just whether the bitmap updates are atomic, because the sequence is: write to memory if (!dirty) { flush generated code tbs; set dirty; } and what you care about is that the existence of cached translations for this area of memory should be in sync with the state of the dirty bit, so the whole operation of flush affected translations and set the dirty bit needs to be thread-safe, I think. This is an example of what I mean. I'm not going to work on making TCG thread-safe in this series, and there is no dangerous race condition in this code if we leave it as is. But I'm not 100% sure of all cases, so I'll audit them again. Stefan pgp6zN1Z77PY6.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
On 28/11/2014 13:39, Peter Lieven wrote: Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: On 28/11/2014 12:46, Peter Lieven wrote: I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Ok, understood, it steals the whole pool, right? Isn't that bad if we have more than one thread in need of a lot of coroutines? Overall the algorithm is expected to adapt. The N threads contribute to the global release pool, so the pool will fill up N times faster than if you had only one thread. There can be some variance, which is why the maximum size of the pool is twice the threshold (and probably could be tuned better). Benchmarks are needed on real I/O too, of course, especially with high queue depth. Yes, cool. The atomic operations are a bit tricky at the first glance ;-) Question: Why is the pool_size increment atomic and the set to zero not? Because the set to zero is not a read-modify-write operation, so it is always atomic. It's just not sequentially-consistent (see docs/atomics.txt for some info on what that means). Idea: If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Because you can only waste 64 coroutines per thread. But numbers cannot be sneezed at, so it's worth doing it as a separate patch. Run operation 4000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..edea162 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -25,8 +25,9 @@ enum { /** Free list to speed up creation */ static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_size; +static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +static __thread unsigned int alloc_pool_size; /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ static void coroutine_pool_cleanup(void *value); @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(alloc_pool); if (!co) { -if (pool_size POOL_BATCH_SIZE) { -/* This is not exact; there could be a little skew between pool_size +if (release_pool_size POOL_BATCH_SIZE) { +/* This is not exact; there could be a little skew between release_pool_size * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ -pool_size = 0; +alloc_pool_size = atomic_fetch_and(release_pool_size, 0); QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); co = QSLIST_FIRST(alloc_pool); @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) */ g_private_set(dummy_key, dummy_key); } +} else { +alloc_pool_size--; } if (co) { QSLIST_REMOVE_HEAD(alloc_pool, pool_next); @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) static void coroutine_delete(Coroutine *co) { if (CONFIG_COROUTINE_POOL) { -if (pool_size POOL_BATCH_SIZE * 2) { +if (release_pool_size POOL_BATCH_SIZE * 2) { co-caller = NULL; QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next); -atomic_inc(pool_size); +atomic_inc(release_pool_size); +return; +} else if (alloc_pool_size POOL_BATCH_SIZE) { +co-caller = NULL; +QSLIST_INSERT_HEAD(alloc_pool, co, pool_next); +alloc_pool_size++; return; } } Bug?: The release_pool is not cleanup up on termination I think. That's not necessary, it is global. Paolo
Re: [Qemu-devel] [RFC PATCH 1/3] Revert coroutine: make pool size dynamic
Am 28.11.2014 um 13:42 schrieb Stefan Hajnoczi: On Thu, Nov 27, 2014 at 11:27:04AM +0100, Peter Lieven wrote: This reverts commit ac2662a913ee5854b1269256adbdc14e57ba480a. Justification? Won't we hit the same problem as the global pool: either pool size is too large or too small. If you exceed the pool size performance drops dramatically. Each thread has its own pool. But you can ignore this series I think. The idea Paolo came up with seems to be better. With my last improvement it seems to have equal performance without the drawback of different threads creating and destroying the coroutines. Peter
Re: [Qemu-devel] Qemu coroutine behaviour on blocking send(3)
On Fri, Nov 28, 2014 at 01:55:00PM +0700, Iwan Budi Kusnanto wrote: I meant, does the coroutine will do yield internally when it get blocked on send(3)? No. In general, QEMU will use non-blocking file descriptors so the blocking case does not apply. (You can't use blocking file descriptors in an event loop without a risk of blocking the entire event loop.) There is qemu_co_send(), which attempts the non-blocking send(2) and yields on EAGAIN. block/sheepdog.c and nbd.c both use this function. It's a little ugly because the caller must add/remove the socket write fd handler function so that the coroutine is re-entered when the fd becomes writable again. Stefan pgpU63xpN0BHb.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 13:45 schrieb Paolo Bonzini: On 28/11/2014 13:39, Peter Lieven wrote: Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: On 28/11/2014 12:46, Peter Lieven wrote: I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Ok, understood, it steals the whole pool, right? Isn't that bad if we have more than one thread in need of a lot of coroutines? Overall the algorithm is expected to adapt. The N threads contribute to the global release pool, so the pool will fill up N times faster than if you had only one thread. There can be some variance, which is why the maximum size of the pool is twice the threshold (and probably could be tuned better). Benchmarks are needed on real I/O too, of course, especially with high queue depth. Yes, cool. The atomic operations are a bit tricky at the first glance ;-) Question: Why is the pool_size increment atomic and the set to zero not? Because the set to zero is not a read-modify-write operation, so it is always atomic. It's just not sequentially-consistent (see docs/atomics.txt for some info on what that means). Idea: If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Because you can only waste 64 coroutines per thread. But numbers cannot be sneezed at, so it's worth doing it as a separate patch. What do you mean by that? If I use dataplane I will fill the global pool and never use it okay, but then I use thread local storage only. So I get the same numbers as in my thread local storage only version. Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is really attached. If we have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even won't waste those coroutines oxidating in the global pool. Peter
Re: [Qemu-devel] [PATCH v2 2/6] geometry: Detect blocksize via ioctls in separate static functions
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: Suggest function comment /** * Return logical block size, or zero if we can't figure it out */ { -BDRVRawState *s = bs-opaque; -char *buf; -unsigned int sector_size; - -/* For /dev/sg devices the alignment is not really used. - With buffered I/O, we don't have any restrictions. */ -if (bs-sg || !s-needs_alignment) { -bs-request_alignment = 1; -s-buf_align = 1; -return; -} +unsigned int sector_size = 0; Pointless initialization. If I do not initialize the sector_size, and the ioctl fails, I will return garbage as a blocksize to the caller. Where? As far as I can see, we return it only after ioctl() succeeded. /* Try a few ioctls to get the right size */ -bs-request_alignment = 0; -s-buf_align = 0; - #ifdef BLKSSZGET if (ioctl(fd, BLKSSZGET, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef DKIOCGETBLOCKSIZE if (ioctl(fd, DKIOCGETBLOCKSIZE, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef DIOCGSECTORSIZE if (ioctl(fd, DIOCGSECTORSIZE, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef CONFIG_XFS if (s-is_xfs) { struct dioattr da; if (xfsctl(NULL, fd, XFS_IOC_DIOINFO, da) = 0) { -bs-request_alignment = da.d_miniosz; +sector_size = da.d_miniosz; /* The kernel returns wrong information for d_mem */ /* s-buf_align = da.d_mem; */ Since you keep the enabled assignments to s-buf_align out of this function, you should keep out this disabled one, too. +return sector_size; } } #endif +return 0; +} + +static unsigned int probe_physical_blocksize(BlockDriverState *bs, int fd) Parameter bs is unused, let's drop it. +{ +unsigned int blk_size = 0; Pointless initialization. Same here. Again, we return it only after ioctl() succeeded. +#ifdef BLKPBSZGET +if (ioctl(fd, BLKPBSZGET, blk_size) = 0) { +return blk_size; +} +#endif + +return 0; +} + +static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) +{ +BDRVRawState *s = bs-opaque; +char *buf; + +/* For /dev/sg devices the alignment is not really used. + With buffered I/O, we don't have any restrictions. */ +if (bs-sg || !s-needs_alignment) { +bs-request_alignment = 1; +s-buf_align = 1; +return; +} + +s-buf_align = 0; +/* Let's try to use the logical blocksize for the alignment. */ +bs-request_alignment = probe_logical_blocksize(bs, fd); + /* If we could not get the sizes so far, we can only guess them */ if (!s-buf_align) { size_t align;
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
On 28/11/2014 13:49, Peter Lieven wrote: Idea: If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Because you can only waste 64 coroutines per thread. But numbers cannot s/only// be sneezed at, so it's worth doing it as a separate patch. What do you mean by that? If I use dataplane I will fill the global pool and never use it okay, but then I use thread local storage only. So I get the same numbers as in my thread local storage only version. Right. I didn't want to waste the coroutines. But it's not 64 coroutines per VCPU thread, it's just 64 coroutines for the global iothread because all the dataplane threads are guaranteed to use ioeventfd. Let's do it. :) Can I add your Signed-off-by to the patch? Paolo Maybe it is an idea to tweak the POOL_BATCH_SIZE * 2 according to what is really attached. If we have only dataplane or ioeventfd it can be POOL_BATCH_SIZE * 0 and we even won't waste those coroutines oxidating in the global pool.
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben: On 26.11.2014 15:46, Kevin Wolf wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. As a side effect, the codepath taken by aio=threads is changed to use paio_submit_co(). This doesn't change the performance at this point. Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0: | aio=native | aio=threads | before | with patch | before | with patch --+--++--+ run 1 | 29.921s | 26.932s| 35.286s | 35.447s run 2 | 29.793s | 26.252s| 35.276s | 35.111s run 3 | 30.186s | 27.114s| 35.042s | 34.921s run 4 | 30.425s | 26.600s| 35.169s | 34.968s run 5 | 30.041s | 26.263s| 35.224s | 35.000s TODO: Do some more serious benchmarking in VMs with less variance. Results of a quick fio run are vaguely positive. I still see the main-loop spun warnings with this patches applied to master. It wasn't there with the original patch from August. ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1 Sending 1000 requests, 4096 bytes each, 64 in parallel main-loop: WARNING: I/O thread spun for 1000 iterations Run completed in 31.947 seconds. Yes, I still need to bisect that. The 'qemu-img bench' numbers above are actually also from August, we have regressed meanwhile by about a second, and I also haven't found the reason for that yet. Kevin
Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface
On Friday 28 November 2014 13:26:44 Laszlo Ersek wrote: +Example: + +/ { + #size-cells = 0x2; + #address-cells = 0x2; + + fw-cfg@902 { + reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1; + compatible = fw-cfg,mmio; + }; +}; fw-cfg is not a valid vendor string. Are you emulating an actual piece of hardware here or is this something that comes from qemu? How about using qemu,fwcfg as the compatible string, and documenting qemu as an official vendor string in Documentation/devicetree/bindings/vendor-prefixes.txt We don't normally list contiguous registers separately. Maybe just round up to one page and make the register property reg = 0x0 0x902 0x0 0x1000; Finally, isn't there also an interrupt associated with this virtual device? Arnd
Re: [Qemu-devel] [PATCH v3 11/22] iotests: Prepare for refcount_width option
On Thu, Nov 20, 2014 at 06:06:27PM +0100, Max Reitz wrote: Some tests do not work well with certain refcount widths (i.e. you cannot create internal snapshots with refcount_width=1), so make those widths unsupported. Furthermore, add another filter to _filter_img_create in common.filter which filters out the refcount_width value. This is necessary for test 079, which does actually work with any refcount width, but invoking qemu-img directly leads to the refcount_width value being visible in the output; use _make_test_img instead which will filter it out. Signed-off-by: Max Reitz mre...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- tests/qemu-iotests/007 | 3 +++ tests/qemu-iotests/015 | 2 ++ tests/qemu-iotests/026 | 7 +++ tests/qemu-iotests/029 | 2 ++ tests/qemu-iotests/051 | 3 +++ tests/qemu-iotests/058 | 2 ++ tests/qemu-iotests/067 | 2 ++ tests/qemu-iotests/079 | 10 ++ tests/qemu-iotests/079.out | 38 ++ tests/qemu-iotests/080 | 2 ++ tests/qemu-iotests/089 | 2 ++ tests/qemu-iotests/108 | 2 ++ tests/qemu-iotests/common.filter | 3 ++- 13 files changed, 41 insertions(+), 37 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgp3BrjG_oLC_.pgp Description: PGP signature
[Qemu-devel] [PATCH] target-arm: ARM64: Adding EL1 AARCH32 guest support for KVM.
In KVM ARM64 one can choose to run guest in 32bit mode i.e EL1 in AARCH32 mode. This patch adds qemu support for running guest EL1 in AARCH32 mode with virt as a machine model. This patch also adds a support to run Image (along with zImage) for arm32. One can specify about 32bit kernel Image by using -cpu host,el1_aarch32 argument. e.g. ./qemu/aarch64-softmmu/qemu-system-aarch64 -nographic -display none \ -serial stdio -kernel ./Image -m 512 -M virt -cpu host,el1_aarch32 \ -initrd rootfs.img -append console=ttyAMA0 root=/dev/ram -enable-kvm Signed-off-by: Pranavkumar Sawargaonkar pranavku...@linaro.org --- hw/arm/boot.c | 44 hw/arm/virt.c | 30 +- target-arm/cpu.c | 5 ++-- target-arm/cpu.h | 2 ++ target-arm/kvm64.c | 73 ++ 5 files changed, 146 insertions(+), 8 deletions(-) diff --git a/hw/arm/boot.c b/hw/arm/boot.c index 0014c34..da8cdc8 100644 --- a/hw/arm/boot.c +++ b/hw/arm/boot.c @@ -476,6 +476,32 @@ static void do_cpu_reset(void *opaque) } } +static int check_load_zimage(const char *filename) +{ +int fd; +uint8_t buf[40]; +uint32_t *p = (uint32_t *) buf[36]; + +fd = open(filename, O_RDONLY | O_BINARY); +if (fd 0) { +perror(filename); +return -1; +} + +memset(buf, 0, sizeof(buf)); +if (read(fd, buf, sizeof(buf)) != sizeof(buf)) { +close(fd); +return -1; +} + +/* Check for zImage magic number */ +if (*p == 0x016F2818) { +return 1; +} + + return 0; +} + void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info) { CPUState *cs; @@ -515,15 +541,23 @@ void arm_load_kernel(ARMCPU *cpu, struct arm_boot_info *info) return; } -if (arm_feature(cpu-env, ARM_FEATURE_AARCH64)) { +if (arm_feature(cpu-env, ARM_FEATURE_AARCH64) +(!cpu-env.el1_aarch32)) { primary_loader = bootloader_aarch64; kernel_load_offset = KERNEL64_LOAD_ADDR; elf_machine = EM_AARCH64; } else { -primary_loader = bootloader; -kernel_load_offset = KERNEL_LOAD_ADDR; -elf_machine = EM_ARM; -} +if (check_load_zimage(info-kernel_filename)) { +primary_loader = bootloader; +kernel_load_offset = KERNEL_LOAD_ADDR; +elf_machine = EM_ARM; +} else { +primary_loader = bootloader; +/* Assuming we are loading Image hence aligning it to 0x8000 */ +kernel_load_offset = KERNEL_LOAD_ADDR - 0x8000; +elf_machine = EM_ARM; +} + } info-dtb_filename = qemu_opt_get(qemu_get_machine_opts(), dtb); diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 314e55b..64213e6 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -204,7 +204,8 @@ static void fdt_add_psci_node(const VirtBoardInfo *vbi) qemu_fdt_setprop(fdt, /psci, compatible, comp, sizeof(comp)); cpu_off_fn = QEMU_PSCI_0_2_FN_CPU_OFF; -if (arm_feature(armcpu-env, ARM_FEATURE_AARCH64)) { +if (arm_feature(armcpu-env, ARM_FEATURE_AARCH64) +(!armcpu-env.el1_aarch32)) { cpu_suspend_fn = QEMU_PSCI_0_2_FN64_CPU_SUSPEND; cpu_on_fn = QEMU_PSCI_0_2_FN64_CPU_ON; migrate_fn = QEMU_PSCI_0_2_FN64_MIGRATE; @@ -527,6 +528,24 @@ static void *machvirt_dtb(const struct arm_boot_info *binfo, int *fdt_size) return board-fdt; } +#if defined(TARGET_AARCH64) !defined(CONFIG_USER_ONLY) +static void check_special_cpu_model_flags(const char *cpu_model, + Object *cpuobj) +{ +ARMCPU *cpu = ARM_CPU(cpuobj); + +if (!cpu) { +return; +} + +if (strcmp(cpu_model, host,el1_aarch32) == 0) { +cpu-env.el1_aarch32 = 1; +} else { +cpu-env.el1_aarch32 = 0; +} +} +#endif + static void machvirt_init(MachineState *machine) { qemu_irq pic[NUM_IRQS]; @@ -540,6 +559,12 @@ static void machvirt_init(MachineState *machine) cpu_model = cortex-a15; } +#if defined(TARGET_AARCH64) !defined(CONFIG_USER_ONLY) +if (strcmp(cpu_model, host,el1_aarch32) == 0) { +cpu_model = host; +} +#endif + vbi = find_machine_info(cpu_model); if (!vbi) { @@ -578,6 +603,9 @@ static void machvirt_init(MachineState *machine) object_property_set_int(cpuobj, vbi-memmap[VIRT_CPUPERIPHS].base, reset-cbar, error_abort); } +#if defined(TARGET_AARCH64) !defined(CONFIG_USER_ONLY) +check_special_cpu_model_flags(machine-cpu_model, cpuobj); +#endif object_property_set_bool(cpuobj, true, realized, NULL); } diff --git a/target-arm/cpu.c b/target-arm/cpu.c index 5ce7350..37dfe30 100644 --- a/target-arm/cpu.c +++ b/target-arm/cpu.c @@ -103,7 +103,7 @@ static void
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 13:39 schrieb Peter Lieven: Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: On 28/11/2014 12:46, Peter Lieven wrote: I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Ok, understood, it steals the whole pool, right? Isn't that bad if we have more than one thread in need of a lot of coroutines? Overall the algorithm is expected to adapt. The N threads contribute to the global release pool, so the pool will fill up N times faster than if you had only one thread. There can be some variance, which is why the maximum size of the pool is twice the threshold (and probably could be tuned better). Benchmarks are needed on real I/O too, of course, especially with high queue depth. Yes, cool. The atomic operations are a bit tricky at the first glance ;-) Question: Why is the pool_size increment atomic and the set to zero not? Idea: If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Run operation 4000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..edea162 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -25,8 +25,9 @@ enum { /** Free list to speed up creation */ static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_size; +static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +static __thread unsigned int alloc_pool_size; /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ static void coroutine_pool_cleanup(void *value); @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(alloc_pool); if (!co) { -if (pool_size POOL_BATCH_SIZE) { -/* This is not exact; there could be a little skew between pool_size +if (release_pool_size POOL_BATCH_SIZE) { +/* This is not exact; there could be a little skew between release_pool_size * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ -pool_size = 0; +alloc_pool_size = atomic_fetch_and(release_pool_size, 0); QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); co = QSLIST_FIRST(alloc_pool); @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) */ g_private_set(dummy_key, dummy_key); } +} else { +alloc_pool_size--; } if (co) { QSLIST_REMOVE_HEAD(alloc_pool, pool_next); @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) static void coroutine_delete(Coroutine *co) { if (CONFIG_COROUTINE_POOL) { -if (pool_size POOL_BATCH_SIZE * 2) { +if (release_pool_size POOL_BATCH_SIZE * 2) { co-caller = NULL; QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next); -atomic_inc(pool_size); +atomic_inc(release_pool_size); +return; +} else if (alloc_pool_size POOL_BATCH_SIZE) { +co-caller = NULL; +QSLIST_INSERT_HEAD(alloc_pool, co, pool_next); +alloc_pool_size++; return; } } Signed-off-by: Peter Lieven p...@kamp.de
Re: [Qemu-devel] [PATCH v3 12/22] qcow2: Allow creation with refcount order != 4
On Thu, Nov 20, 2014 at 06:06:28PM +0100, Max Reitz wrote: Add a creation option to qcow2 for setting the refcount order of images to be created, and respect that option's value. This breaks some test outputs, fix them. Signed-off-by: Max Reitz mre...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- block/qcow2.c | 20 include/block/block_int.h | 1 + tests/qemu-iotests/049.out | 112 ++--- tests/qemu-iotests/082.out | 41 ++--- tests/qemu-iotests/085.out | 38 +++ 5 files changed, 130 insertions(+), 82 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpm5fxW_MQG8.pgp Description: PGP signature
Re: [Qemu-devel] [RFC PATCH 3/3] qemu-coroutine: use a ring per thread for the pool
Am 28.11.2014 um 13:45 schrieb Paolo Bonzini: On 28/11/2014 13:39, Peter Lieven wrote: Am 28.11.2014 um 13:26 schrieb Paolo Bonzini: On 28/11/2014 12:46, Peter Lieven wrote: I get: Run operation 4000 iterations 9.883958 s, 4046K operations/s, 247ns per coroutine Ok, understood, it steals the whole pool, right? Isn't that bad if we have more than one thread in need of a lot of coroutines? Overall the algorithm is expected to adapt. The N threads contribute to the global release pool, so the pool will fill up N times faster than if you had only one thread. There can be some variance, which is why the maximum size of the pool is twice the threshold (and probably could be tuned better). Benchmarks are needed on real I/O too, of course, especially with high queue depth. Yes, cool. The atomic operations are a bit tricky at the first glance ;-) Question: Why is the pool_size increment atomic and the set to zero not? Because the set to zero is not a read-modify-write operation, so it is always atomic. It's just not sequentially-consistent (see docs/atomics.txt for some info on what that means). Idea: If the release_pool is full why not put the coroutine in the thread alloc_pool instead of throwing it away? :-) Because you can only waste 64 coroutines per thread. But numbers cannot be sneezed at, so it's worth doing it as a separate patch. Run operation 4000 iterations 9.057805 s, 4416K operations/s, 226ns per coroutine diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 6bee354..edea162 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -25,8 +25,9 @@ enum { /** Free list to speed up creation */ static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_size; +static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +static __thread unsigned int alloc_pool_size; /* The GPrivate is only used to invoke coroutine_pool_cleanup. */ static void coroutine_pool_cleanup(void *value); @@ -39,12 +40,12 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(alloc_pool); if (!co) { -if (pool_size POOL_BATCH_SIZE) { -/* This is not exact; there could be a little skew between pool_size +if (release_pool_size POOL_BATCH_SIZE) { +/* This is not exact; there could be a little skew between release_pool_size * and the actual size of alloc_pool. But it is just a heuristic, * it does not need to be perfect. */ -pool_size = 0; +alloc_pool_size = atomic_fetch_and(release_pool_size, 0); QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); co = QSLIST_FIRST(alloc_pool); @@ -53,6 +54,8 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) */ g_private_set(dummy_key, dummy_key); } +} else { +alloc_pool_size--; } if (co) { QSLIST_REMOVE_HEAD(alloc_pool, pool_next); @@ -71,10 +74,15 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) static void coroutine_delete(Coroutine *co) { if (CONFIG_COROUTINE_POOL) { -if (pool_size POOL_BATCH_SIZE * 2) { +if (release_pool_size POOL_BATCH_SIZE * 2) { co-caller = NULL; QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next); -atomic_inc(pool_size); +atomic_inc(release_pool_size); +return; +} else if (alloc_pool_size POOL_BATCH_SIZE) { +co-caller = NULL; +QSLIST_INSERT_HEAD(alloc_pool, co, pool_next); +alloc_pool_size++; return; } } Bug?: The release_pool is not cleanup up on termination I think. That's not necessary, it is global. I don't see where you iterate over release_pool and destroy all coroutines? Maybe just add back the old destructor with s/pool/release_pool/g Peter Peter
Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface
Hi, On Fri, Nov 28, 2014 at 12:26:44PM +, Laszlo Ersek wrote: Peter Maydell suggested that we describe new devices / DTB nodes in the kernel Documentation tree that we expose to arm virt guests in QEMU. Although the kernel is not required to access the fw_cfg interface, Documentation/devicetree/bindings/arm is probably the best central spot to keep the fw_cfg description in. Suggested-by: Peter Maydell peter.mayd...@linaro.org Signed-off-by: Laszlo Ersek ler...@redhat.com --- Documentation/devicetree/bindings/arm/fw-cfg.txt | 47 1 file changed, 47 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/fw-cfg.txt diff --git a/Documentation/devicetree/bindings/arm/fw-cfg.txt b/Documentation/devicetree/bindings/arm/fw-cfg.txt new file mode 100644 index 000..d131453 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/fw-cfg.txt @@ -0,0 +1,47 @@ +* QEMU Firmware Configuration bindings for ARM + +QEMU's arm-softmmu and aarch64-softmmu emulation / virtualization targets +provide the following Firmware Configuration interface on the virt machine +type: + +- A write-only, 16-bit wide selector (or control) register, +- a read-write, 8-bit wide data register. + +The guest writes a selector value (a key) to the selector register, and then +can read the corresponding data (produced by QEMU) via the data port. If the +selected entry is writable, the guest can rewrite it through the data port. The +authoritative registry of the valid selector values and their meanings is the +QEMU source code. + +QEMU exposes the control and data register to x86 guests at fixed IO ports. ARM +guests can access them as memory mapped registers, and their location is +communicated to the guest's UEFI firmware in the DTB that QEMU places at the +bottom of the guest's DRAM. What exactly is this interface used for? And what's the protocol that goes over the read/write sequences descrbied above? Is there any documentation of that we can refer to here? Is version information probeable from this, or does that need to be described? +The guest kernel is not expected to use these registers (although it is +certainly allowed to); the device tree bindings are documented here because +this is where device tree bindings reside in general. + +The addresses and sizes of the Firmware Configuration registers are given by +the /fw-cfg node. The virt board invariably uses 2 as #size-cells and +#address-cells in the context of this node. This paragraph can go -- #address-cells and #size-cells are well-understood common properties. + +The reg property is therefore an array of four uint64_t elements (eight +uint32_t cells in total), the first uint64_t pair describing the address and +size of the control register, the second uint64_t pair describing the address +and size of the data register. (See +http://devicetree.org/Device_Tree_Usage#How_Addressing_Works.) The reg property is a list of address size tuples. The size of the entiries depends on #address-cells and #size-cells in the parent node, but they are certainly not guaranteed to be uint64_t values (if nothing else, they're encoded big-endian in the DTB). We just need to describe what the entries are w.r.t. the device, not the particulars of the DTB format. Are these registers always contiguous? I assume so, and vif so you can cover them with a single reg entry (read it as 'region' rather than 'register'). + +The first string in the compatible property is fw-cfg,mmio. fw-cfg is not a vendor prefix. I guess we need a qemu vendor prefix? Then we'd have something like qemu,fw-cfg-mmio as the compatible string. Also, the string should not be required to be the first entry in the list, as that breaks the fallback semantics of the compatible list. The string should just need to be present in the list. Please try to format the binding something like: 8 Required properties: - compatible: should contain qemu,fw-cfg-mmio. - reg: The MMIO regions used by the device: * The first entry describes the control register. * The second entry describes the data register. 8 Though I hope we can just use a single reg entry to cover all the registers the device has. + +Example: + +/ { + #size-cells = 0x2; + #address-cells = 0x2; + + fw-cfg@902 { + reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1; Please bracket list entries individually, e.g. reg = 0x0 0x902 0x0 0x2, 0x0 0x9020002 0x0 0x1; + compatible = fw-cfg,mmio; And please place the compatible property before the reg (it makes things easier to read). Thanks, Mark.
Re: [Qemu-devel] [PATCH v3 15/22] qcow2: Use error_report() in qcow2_amend_options()
On Thu, Nov 20, 2014 at 06:06:31PM +0100, Max Reitz wrote: Signed-off-by: Max Reitz mre...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- block/qcow2.c | 14 ++ tests/qemu-iotests/061.out | 14 +++--- 2 files changed, 13 insertions(+), 15 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpTavcLTeR5f.pgp Description: PGP signature
Re: [Qemu-devel] [PATCH v3 16/22] qcow2: Use abort() instead of assert(false)
On Thu, Nov 20, 2014 at 06:06:32PM +0100, Max Reitz wrote: Signed-off-by: Max Reitz mre...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- block/qcow2.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpwibsjgMgJ0.pgp Description: PGP signature
Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface
On 11/28/14 13:59, Arnd Bergmann wrote: On Friday 28 November 2014 13:26:44 Laszlo Ersek wrote: +Example: + +/ { + #size-cells = 0x2; + #address-cells = 0x2; + + fw-cfg@902 { + reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1; + compatible = fw-cfg,mmio; + }; +}; fw-cfg is not a valid vendor string. Are you emulating an actual piece of hardware here or is this something that comes from qemu? It's a made up name. How about using qemu,fwcfg as the compatible string, and documenting qemu as an official vendor string in Documentation/devicetree/bindings/vendor-prefixes.txt I can do that, certainly, but I'm not sure if that implies other QEMU devices should follow suit. For example, the virtio-mmio transports that qemu advertises have the compatible property virtio,mmio. virtio is not a vendor prefix either (according to the above text file). Here's the full list of compatible strings from the generated DTB: compatible = arm,cortex-a15; compatible = arm,armv7-timer; compatible = arm,cortex-a15-gic; compatible = arm,pl011, arm,primecell; compatible = arm,pl031, arm,primecell; compatible = arm,psci-0.2, arm,psci; compatible = cfi-flash; compatible = fixed-clock; compatible = fw-cfg,mmio; compatible = virtio,mmio; compatible = linux,dummy-virt; According to http://devicetree.org/Device_Tree_Usage#Understanding_the_compatible_Property, cfi-flash and fixed-clock don't even have the correct format (there's no comma separating manufacturer from model). We don't normally list contiguous registers separately. Maybe just round up to one page and make the register property reg = 0x0 0x902 0x0 0x1000; The registers are not necessarily placed shoulder to shoulder (it's not a requirement in the QEMU source). Finally, isn't there also an interrupt associated with this virtual device? No, there is not. Thanks Laszlo
Re: [Qemu-devel] [PATCH v3 17/22] qcow2: Split upgrade/downgrade paths for amend
On Thu, Nov 20, 2014 at 06:06:33PM +0100, Max Reitz wrote: If the image version should be upgraded, that is the first we should do; if it should be downgraded, that is the last we should do. So split the version change block into an upgrade part at the start and a downgrade part at the end. Signed-off-by: Max Reitz mre...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com --- block/qcow2.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpzAIZQ6Upn9.pgp Description: PGP signature
Re: [Qemu-devel] [PATCH v2 2/6] geometry: Detect blocksize via ioctls in separate static functions
On 11/28/2014 03:52 PM, Markus Armbruster wrote: Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: Suggest function comment /** * Return logical block size, or zero if we can't figure it out */ { -BDRVRawState *s = bs-opaque; -char *buf; -unsigned int sector_size; - -/* For /dev/sg devices the alignment is not really used. - With buffered I/O, we don't have any restrictions. */ -if (bs-sg || !s-needs_alignment) { -bs-request_alignment = 1; -s-buf_align = 1; -return; -} +unsigned int sector_size = 0; Pointless initialization. If I do not initialize the sector_size, and the ioctl fails, I will return garbage as a blocksize to the caller. Where? As far as I can see, we return it only after ioctl() succeeded. Sorry, you're absolutely right. I kept seeing and thinking that I always returned sector_size variable. ::facepalm:: /* Try a few ioctls to get the right size */ -bs-request_alignment = 0; -s-buf_align = 0; - #ifdef BLKSSZGET if (ioctl(fd, BLKSSZGET, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef DKIOCGETBLOCKSIZE if (ioctl(fd, DKIOCGETBLOCKSIZE, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef DIOCGSECTORSIZE if (ioctl(fd, DIOCGSECTORSIZE, sector_size) = 0) { -bs-request_alignment = sector_size; +return sector_size; } #endif #ifdef CONFIG_XFS if (s-is_xfs) { struct dioattr da; if (xfsctl(NULL, fd, XFS_IOC_DIOINFO, da) = 0) { -bs-request_alignment = da.d_miniosz; +sector_size = da.d_miniosz; /* The kernel returns wrong information for d_mem */ /* s-buf_align = da.d_mem; */ Since you keep the enabled assignments to s-buf_align out of this function, you should keep out this disabled one, too. +return sector_size; } } #endif +return 0; +} + +static unsigned int probe_physical_blocksize(BlockDriverState *bs, int fd) Parameter bs is unused, let's drop it. +{ +unsigned int blk_size = 0; Pointless initialization. Same here. Again, we return it only after ioctl() succeeded. +#ifdef BLKPBSZGET +if (ioctl(fd, BLKPBSZGET, blk_size) = 0) { +return blk_size; +} +#endif + +return 0; +} + +static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) +{ +BDRVRawState *s = bs-opaque; +char *buf; + +/* For /dev/sg devices the alignment is not really used. + With buffered I/O, we don't have any restrictions. */ +if (bs-sg || !s-needs_alignment) { +bs-request_alignment = 1; +s-buf_align = 1; +return; +} + +s-buf_align = 0; +/* Let's try to use the logical blocksize for the alignment. */ +bs-request_alignment = probe_logical_blocksize(bs, fd); + /* If we could not get the sizes so far, we can only guess them */ if (!s-buf_align) { size_t align;
Re: [Qemu-devel] [PATCH v2] persistent dirty bitmap: add QDB file spec.
On 21.11.2014 19:55, Stefan Hajnoczi wrote: Active dirty bitmaps should migrate too. I'm thinking now that the appropriate thing is to add live migration of dirty bitmaps to QEMU (regardless of whether they are active or not). I think, we should migrate named dirty bitmaps, which are not used now. So if some external mechanism uses the bitmap (for example - backup) - we actually can't migrate this process, because we will need to restore the whole backup structure including a pointer to the bitmap, which is too hard and includes not only bitmap migration. So, if named bitmap is enabled, but not used (only bdrv_aligned_pwritev writes to it) it can be migrated. For this I see the following solutions: 1) Just save all corresponding pieces of named bitmaps with every migrated block. The block size is 1mb, so the overhead for migrating additionally a bitmap with 64kb granularity would be 2b, and it would be 256b for bitmap with 512b granularity. This approach needs additional fields in BlkMigBlock, for saving bitmaps pieces. 2) Add DIRTY flag to migrated block flags, to distinguish blocks, which became dirty while migrating. Save all the bitmaps separately, and also update them on block_load, when we receive block with DIRTY flag on. Some information will be lost, migrated dirty bitmaps may be more dirty then original ones. This approach needs additional field bool dirty in BlkMigBlock, and saving this flag in blk_send. These solutions don't depend on persistence of dirty bitmaps or persistent bitmap file format. Best regards, Vladimir On 21.11.2014 19:55, Stefan Hajnoczi wrote: On Fri, Nov 21, 2014 at 01:27:40PM +0300, Vladimir Sementsov-Ogievskiy wrote: There is a constraint if we want to get live migration for free: The bitmap contents must be accessible with bdrv_read() and bdrv_get_block_status() to skip zero regions. Hm. I'm afraid, it still will not be free. If bitmap is active, it's actual version is in memory. To migrate bitmap file like a disk image, we should start syncing it with every write to corresponding disk, doubling number of io. It would be possible to drive-mirror the persistent dirty bitmap and then flush it like all drives when the guest vCPUs are paused for migration. After thinking more about it though, this approach places more I/O into the critical guest downtime phase. In other words, slow disk I/O could lead to long guest downtimes while QEMU tries to write out the dirty bitmap. Moreover, we have normal dirty bitmaps, which have no name/file, do we migrate them? If, for example, the migration occurs when backup in progress? Active bitmaps should be migrated in the same way for persistent/named/normal bitmaps. I can't find in qemu source, is there bitmap migration? bs-dirty_bitmaps is not migrated, in fact none of BlockDriverState is migrated. QEMU only migrates emulated device state (e.g. the hardware registers and associated state). It does not emulate host state that the guest cannot see like the dirty bitmap. Or you are saying about migrating disabled bitmaps? Hm. We should sync bitmap file on bitmap_disable. Disabled persistent bitmap is just a static file ~30mb, we can easily migrate it without common procedure with cow or something like this.. Active dirty bitmaps should migrate too. I'm thinking now that the appropriate thing is to add live migration of dirty bitmaps to QEMU (regardless of whether they are active or not). Stefan
Re: [Qemu-devel] [PATCH v3 13/22] progress: Allow regressing progress
On Thu, Nov 20, 2014 at 06:06:29PM +0100, Max Reitz wrote: Progress may regress; this should be displayed correctly by qemu_progress_print(). Signed-off-by: Max Reitz mre...@redhat.com --- util/qemu-progress.c | 1 + 1 file changed, 1 insertion(+) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpPiR00qqiIx.pgp Description: PGP signature
Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface
On Fri, Nov 28, 2014 at 01:22:29PM +, Laszlo Ersek wrote: On 11/28/14 13:59, Arnd Bergmann wrote: On Friday 28 November 2014 13:26:44 Laszlo Ersek wrote: +Example: + +/ { + #size-cells = 0x2; + #address-cells = 0x2; + + fw-cfg@902 { + reg = 0x0 0x902 0x0 0x2 0x0 0x9020002 0x0 0x1; + compatible = fw-cfg,mmio; + }; +}; fw-cfg is not a valid vendor string. Are you emulating an actual piece of hardware here or is this something that comes from qemu? It's a made up name. How about using qemu,fwcfg as the compatible string, and documenting qemu as an official vendor string in Documentation/devicetree/bindings/vendor-prefixes.txt I can do that, certainly, but I'm not sure if that implies other QEMU devices should follow suit. For example, the virtio-mmio transports that qemu advertises have the compatible property virtio,mmio. virtio is not a vendor prefix either (according to the above text file). True, but it's a documented standard. It's obvious that 'virtio' has something to do with the virtio standard, while it's not obvious that 'fw-cfg' as a prefix is for a qemu-specific device. We should probably add 'virtio' as a vendor prefix. Here's the full list of compatible strings from the generated DTB: compatible = arm,cortex-a15; compatible = arm,armv7-timer; compatible = arm,cortex-a15-gic; compatible = arm,pl011, arm,primecell; compatible = arm,pl031, arm,primecell; compatible = arm,psci-0.2, arm,psci; All of these are emulated real devices or standards from ARM, so that makes sense. compatible = cfi-flash; compatible = fixed-clock; These are generic bindings, arguably they should have been 'linux,' prefixed but it's too late now. compatible = fw-cfg,mmio; compatible = virtio,mmio; compatible = linux,dummy-virt; According to http://devicetree.org/Device_Tree_Usage#Understanding_the_compatible_Property, cfi-flash and fixed-clock don't even have the correct format (there's no comma separating manufacturer from model). We don't normally list contiguous registers separately. Maybe just round up to one page and make the register property reg = 0x0 0x902 0x0 0x1000; The registers are not necessarily placed shoulder to shoulder (it's not a requirement in the QEMU source). It might not be at the moment, but if no-one separates them it could be a requirement for using this binding. Is there any benefit to having them separated? Thanks, Mark.
Re: [Qemu-devel] master: intermittent acpi-test failures
On 27 May 2014 at 22:38, Peter Maydell peter.mayd...@linaro.org wrote: I'm seeing this test failure intermittently on 'make check': ERROR:/root/qemu/tests/acpi-test.c:618:test_acpi_one: assertion failed (signature == SIGNATURE): (0x == 0xdead) GTester: last random seed: R02S8d0d60963e4442ce284a81d20ce32053 (32 bit ARM host, in case that makes a difference.) Any ideas? It looks from the test as if this may just be that the test is coded to assume a faster machine, which is a bit unfortunate. These failures are back after a long period of not being a problem :-( -- PMM
Re: [Qemu-devel] [PATCH v2 0/6] Geometry and blocksize support for backing devices
Ekaterina Tumanova tuman...@linux.vnet.ibm.com writes: On 11/28/2014 01:35 PM, Stefan Hajnoczi wrote: On Wed, Nov 26, 2014 at 01:16:48PM +0300, Ekaterina Tumanova wrote: On 11/25/2014 04:01 PM, Stefan Hajnoczi wrote: On Wed, Nov 19, 2014 at 11:17:50AM +0100, Ekaterina Tumanova wrote: 3. Why does s390 need to customize hd_geometry_guess()? Since hd_geometry_guess contains semantics of x86-specific LBA translation, we have to modify it not to get in the way of z architecture If the behavior is x86-specific, it should be made x86-specific. z, ppc, sparc, arm, etc should share the non-x86-specific code path. It's weird to single out z here, this seems like the wrong way around. Is the x86-specific behavior a problem in practice? No one seemed to mind for the other targets. on s390 arch this adds support for FCP attached SCSI disks that do not yet have a partition table. Without this patch, fdisk -l on the host would return different results then fdisk -l in the guest. Isn't it like this for all targets? If we can't guess LCHS from MSDOS partition table guess geometry on size translation is NONE for very small disk, else LBA Else if LCHS guess has heads 16 BIOS LBA translation must be active guess geometry on size translation is LARGE for sufficiently small disk, else LBA Else use LCHS guess translation is NONE Before writing a partition table, the default geometry is based on size. Afterwards, we it may be based on the partition table instead. Yes, writing an MS-DOS partition table can change the default geometry. Horrible misfeature, independent of target. Every time I try to kill a horrible misfeature, Kevin yells at me ;) Mitigating factor 1: no change when the partition table looks like LBA is active. Should be the common case nowadays. Mitigating factor 2: most guest software is blissfully unaware of geometry. If you guys don't like the way patch goes, I can exclude it from patch series and we can discuss it later. But I thought that since it's related to the hard drive geometry, and since we change hd_geometry_guess in this patchset anyway, why not get rid of this problem as well. I think we should discuss what we actually want, then let you implement it. Perhaps the stupidest solution that could possibly work is the state after your PATCH 5: If backend has a geometry (currently only DASD has; may change) // Incompatible change! use backend geometry translation is NONE Else if we can't guess LCHS from MSDOS partition table guess geometry on size translation is NONE for very small disk, else LBA Else if LCHS guess has heads 16 BIOS LBA translation must be active guess geometry on size translation is LARGE for sufficiently small disk, else LBA Else use LCHS guess translation is NONE Note that translation is relevant only for a PC machine's IDE disks. Everything else ignores it.
Re: [Qemu-devel] [PATCH] block: do not use get_clock()
Stefan Hajnoczi stefa...@gmail.com writes: On Wed, Nov 26, 2014 at 03:01:02PM +0100, Paolo Bonzini wrote: Use the external qemu-timer API instead. Cc: kw...@redhat.com Cc: stefa...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- block/accounting.c | 6 -- block/raw-posix.c | 8 2 files changed, 8 insertions(+), 6 deletions(-) Thanks, applied to my block-next tree: https://github.com/stefanha/qemu/commits/block-next Please wait for Paolo's v2 with rationale in the commit message.
Re: [Qemu-devel] [RFC PATCH 2/3] raw-posix: Convert Linux AIO submission to coroutines
Am 28.11.2014 um 13:57 hat Kevin Wolf geschrieben: Am 27.11.2014 um 10:50 hat Peter Lieven geschrieben: On 26.11.2014 15:46, Kevin Wolf wrote: This improves the performance of requests because an ACB doesn't need to be allocated on the heap any more. It also makes the code nicer and smaller. As a side effect, the codepath taken by aio=threads is changed to use paio_submit_co(). This doesn't change the performance at this point. Results of qemu-img bench -t none -c 1000 [-n] /dev/loop0: | aio=native | aio=threads | before | with patch | before | with patch --+--++--+ run 1 | 29.921s | 26.932s| 35.286s | 35.447s run 2 | 29.793s | 26.252s| 35.276s | 35.111s run 3 | 30.186s | 27.114s| 35.042s | 34.921s run 4 | 30.425s | 26.600s| 35.169s | 34.968s run 5 | 30.041s | 26.263s| 35.224s | 35.000s TODO: Do some more serious benchmarking in VMs with less variance. Results of a quick fio run are vaguely positive. I still see the main-loop spun warnings with this patches applied to master. It wasn't there with the original patch from August. ~/git/qemu$ ./qemu-img bench -t none -c 1000 -n /dev/ram1 Sending 1000 requests, 4096 bytes each, 64 in parallel main-loop: WARNING: I/O thread spun for 1000 iterations Run completed in 31.947 seconds. Yes, I still need to bisect that. The 'qemu-img bench' numbers above are actually also from August, we have regressed meanwhile by about a second, and I also haven't found the reason for that yet. Did the first part of this now. The commit that introduced the spun message is 2cdff7f6 ('linux-aio: avoid deadlock in nested aio_poll() calls'). The following patch doesn't make it go away completely, but I only see it sometime during like every other run now, instead of immediately after starting qemu-img bench. It's probably a (very) minor performance optimisation, too. Kevin diff --git a/block/linux-aio.c b/block/linux-aio.c index fd8f0e4..1a0ec62 100644 --- a/block/linux-aio.c +++ b/block/linux-aio.c @@ -136,6 +136,8 @@ static void qemu_laio_completion_bh(void *opaque) qemu_laio_process_completion(s, laiocb); } + +qemu_bh_cancel(s-completion_bh); } static void qemu_laio_completion_cb(EventNotifier *e) @@ -143,7 +145,7 @@ static void qemu_laio_completion_cb(EventNotifier *e) struct qemu_laio_state *s = container_of(e, struct qemu_laio_state, e); if (event_notifier_test_and_clear(s-e)) { -qemu_bh_schedule(s-completion_bh); +qemu_laio_completion_bh(s); } }
Re: [Qemu-devel] Qemu coroutine behaviour on blocking send(3)
On Fri, Nov 28, 2014 at 7:47 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Nov 28, 2014 at 01:55:00PM +0700, Iwan Budi Kusnanto wrote: I meant, does the coroutine will do yield internally when it get blocked on send(3)? No. In general, QEMU will use non-blocking file descriptors so the blocking case does not apply. (You can't use blocking file descriptors in an event loop without a risk of blocking the entire event loop.) There is qemu_co_send(), which attempts the non-blocking send(2) and yields on EAGAIN. block/sheepdog.c and nbd.c both use this function. It's a little ugly because the caller must add/remove the socket write fd handler function so that the coroutine is re-entered when the fd becomes writable again. Stefan Thanks Stefan, it really helps. -- Iwan Budi Kusnanto
Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface
On 11/28/14 14:17, Mark Rutland wrote: Hi, On Fri, Nov 28, 2014 at 12:26:44PM +, Laszlo Ersek wrote: Peter Maydell suggested that we describe new devices / DTB nodes in the kernel Documentation tree that we expose to arm virt guests in QEMU. Although the kernel is not required to access the fw_cfg interface, Documentation/devicetree/bindings/arm is probably the best central spot to keep the fw_cfg description in. Suggested-by: Peter Maydell peter.mayd...@linaro.org Signed-off-by: Laszlo Ersek ler...@redhat.com --- Documentation/devicetree/bindings/arm/fw-cfg.txt | 47 1 file changed, 47 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/fw-cfg.txt diff --git a/Documentation/devicetree/bindings/arm/fw-cfg.txt b/Documentation/devicetree/bindings/arm/fw-cfg.txt new file mode 100644 index 000..d131453 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/fw-cfg.txt @@ -0,0 +1,47 @@ +* QEMU Firmware Configuration bindings for ARM + +QEMU's arm-softmmu and aarch64-softmmu emulation / virtualization targets +provide the following Firmware Configuration interface on the virt machine +type: + +- A write-only, 16-bit wide selector (or control) register, +- a read-write, 8-bit wide data register. + +The guest writes a selector value (a key) to the selector register, and then +can read the corresponding data (produced by QEMU) via the data port. If the +selected entry is writable, the guest can rewrite it through the data port. The +authoritative registry of the valid selector values and their meanings is the +QEMU source code. + +QEMU exposes the control and data register to x86 guests at fixed IO ports. ARM +guests can access them as memory mapped registers, and their location is +communicated to the guest's UEFI firmware in the DTB that QEMU places at the +bottom of the guest's DRAM. What exactly is this interface used for? The firmware can collect bits of informtion about the virtual hardware. Normally a (physical) platform comes with firmware that has intimate knowledge about the properties of the platform. For example, it knows what ACPI tables it should install for the OS that will match the hardware. In virt space, qemu owns the virtual hardware (and now it actually owns the ACPI table generation too), and the firmware projects (SeaBIOS and the virtual UEFI targets in edk2) are separate projects. So the fw_cfg interface allows qemu to parametrize the firmware code. There's a long list of keys that can be written to the selectors. Some of those are guest architecture invariant, some only defined (and redefined) for different architectures. I started to document them at one point (more than two years ago, for my own understanding), but it's a moving target, and my on-and-off effort could never catch up with the code. If such a document existed, it should exist in the docs/ subdir of the qemu tree. And what's the protocol that goes over the read/write sequences descrbied above? It's a simple protocol on the outside: write a selector, read a number of data bytes (you must know the count in advance, unless the first thing in the returned blob is a header that advertises the returned blob's size). There's a special key that returns a directory of files; basically (name, key) pairs. Then you can request named blobs too, by doing a lookup by name in the directory first, and then using the corresponding key to select a blob. Beyond that, each key has complete freedom to define the structure of the associated blob. Also, some of the blobs can be rewritten (write selector, write a number of data bytes). You must know the number of bytes to write. Once fully rewritten, qemu can run a callback function. (Actually, a somewhat recent qemu feature is that even reads can trigger callbacks, which allows qemu to generate the data on-the-fly.) Is there any documentation of that we can refer to here? None that I know of. As I wrote it's all in the QEMU source. Is version information probeable from this, or does that need to be described? The outermost protocol (the behavior of the selector and data register, and a number of predefined selector values) have ossified. Most of the newer blobs are not associated with direct keys, but with fw_cfg file names (see the directory above). There is a revision (FW_CFG_ID) key, but it seems to be constant 1, so it's not actually used for versioning. In practice the set of available keys, files, and the structure of the individual files, have been a constant pain point wrt. versioning. The current practice is, as far as I can tell, to keep the name-contents associations, and the contents' structure, stable. If something new is needed, introduce a new name. And guests can probe for the existence of file names in the directory. +The guest kernel is not expected to use these registers (although it is +certainly allowed to); the device tree bindings are
Re: [Qemu-devel] [kernel PATCH] devicetree: document ARM bindings for QEMU's Firmware Config interface
On 11/28/14 14:33, Mark Rutland wrote: On Fri, Nov 28, 2014 at 01:22:29PM +, Laszlo Ersek wrote: On 11/28/14 13:59, Arnd Bergmann wrote: We don't normally list contiguous registers separately. Maybe just round up to one page and make the register property reg = 0x0 0x902 0x0 0x1000; The registers are not necessarily placed shoulder to shoulder (it's not a requirement in the QEMU source). It might not be at the moment, but if no-one separates them it could be a requirement for using this binding. Is there any benefit to having them separated? I couldn't name any. Thanks Laszlo
[Qemu-devel] async block driver questions
Hi all, I have some questions regarding developing async network protocol based block driver for qemu. The network file server has async client library, and i think it will be good if i can use it. 1. Is it OK to use libevent inside block driver. What make me worried is that qemu itself said to be an event driven program with it's own main loop 2. any example on how to create many coroutine and waiting for it's completion? I thought i can just do qemu_coroutine_create and followed by qemu_coroutine_enter, but the things become complex because qemu_coroutine_yield will return the control back to caller I also found about coQueue, but not sure how to use it. 3. Do we need to make our bdrv_read and bdrv_write to be thread safe? Thanks before. -- Iwan Budi Kusnanto
Re: [Qemu-devel] [PATCH RFC for-2.2] virtio-blk: force 1st s/g to match header
El Fri, 28 Nov 2014 11:43:59 + Stefan Hajnoczi stefa...@gmail.com escribió: On Fri, Nov 28, 2014 at 7:05 AM, Jason Wang jasow...@redhat.com wrote: On Fri, Nov 28, 2014 at 9:16 AM, Fam Zheng f...@redhat.com wrote: On Thu, 11/27 23:13, Michael S. Tsirkin wrote: On Thu, Nov 27, 2014 at 07:21:35PM +, Stefan Hajnoczi wrote: On Thu, Nov 27, 2014 at 4:33 PM, Michael S. Tsirkin m...@redhat.com wrote: We leak cpu mappings when 1st s/g is not exactly the header. As we don't set ANY_LAYOUT, we can at this point simply assert the correct length. This will have to be fixed once ANY_LAYOUT is set. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Untested: posting for early feedback. Reviewed-by: Stefan Hajnoczi stefa...@redhat.com Stefan, you are going to merge this for 2.2? Peter, if Stefan is merging this one, we can take Jason's patch for -net and that should be enough for 2.2. My test bot saw a failure of make check: qemu-system-x86_64: virtio-blk request outhdr too long Broken pipe GTester: last random seed: R02S44c5a09d74c0603f0091ed20d1395121 qemu-system-x86_64: virtio-blk request outhdr too long Broken pipe GTester: last random seed: R02Sfdf270c8e1ed75c1f2047e3f0fb89b88 qemu-system-x86_64: virtio-blk request outhdr too long Broken pipe GTester: last random seed: R02Sd292c540929b963ed9d7d155f3db5452 qemu-system-x86_64: virtio-blk request outhdr too long Broken pipe GTester: last random seed: R02Sfc775a34a844c39f376aa478eda7573f [vmxnet3][WR][vmxnet3_peer_has_vnet_hdr]: Peer has no virtio extension. Task offloads will be emulated. make: *** [check-qtest-x86_64] Error 1 Fam This probably because of the test itself. But anyway all kinds of guests (especially Windows drivers) need to be tested. Right, the test case explicitly tests different descriptor layouts, even though virtio-blk-pci does not set the ANY_LAYOUT feature bit. Either the test case needs to check ANY_LAYOUT before using the 2-descriptor layout or it needs to expect QEMU to refuse (in this case exit(1), which is not very graceful). The quick fix is to skip the 2-descriptor layout tests and re-enable them once virtio-blk actually supports ANY_LAYOUT. Any objections? Stefan To be compliant with the specs, the test should check for ANY_LAYOUT feature before testing with 2 descriptor layout. I'll add it with the other changes pending for the test. Marc
[Qemu-devel] [PATCH 0/7] coroutine: optimizations
As discussed in the other thread, this brings speedups from dropping the coroutine mutex (which serializes multiple iothreads, too) and using ELF thread-local storage. The speedup in perf/cost is about 30% (190-145). Windows port tested with tests/test-coroutine.exe under Wine. Paolo Paolo Bonzini (7): coroutine-ucontext: use __thread qemu-thread: add per-thread atexit functions test-coroutine: avoid overflow on 32-bit systems QSLIST: add lock-free operations coroutine: rewrite pool to avoid mutex coroutine: drop qemu_coroutine_adjust_pool_size coroutine: try harder not to delete coroutines block/block-backend.c | 4 -- coroutine-ucontext.c | 64 +++- include/block/coroutine.h | 10 - include/qemu/queue.h | 15 ++- include/qemu/thread.h | 4 ++ qemu-coroutine.c | 104 ++ tests/test-coroutine.c| 2 +- util/qemu-thread-posix.c | 37 + util/qemu-thread-win32.c | 48 - 9 files changed, 157 insertions(+), 131 deletions(-) -- 2.1.0
[Qemu-devel] [PATCH 3/7] test-coroutine: avoid overflow on 32-bit systems
unsigned long is not large enough to represent 10 * duration there. Just use floating point. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- tests/test-coroutine.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c index e22fae1..27d1b6f 100644 --- a/tests/test-coroutine.c +++ b/tests/test-coroutine.c @@ -337,7 +337,7 @@ static void perf_cost(void) %luns per coroutine, maxcycles, duration, ops, - (unsigned long)(10 * duration) / maxcycles); + (unsigned long)(10.0 * duration / maxcycles)); } int main(int argc, char **argv) -- 2.1.0
[Qemu-devel] [PATCH 2/7] qemu-thread: add per-thread atexit functions
Destructors are the main additional feature of pthread TLS compared to __thread. If we were using C++ (hint, hint!) we could have used thread-local objects with a destructor. Since we are not, instead, we add a simple Notifier-based API. Note that the notifier must be per-thread as well. We can add a global list as well later, perhaps. The Win32 implementation has some complications because a) detached threads used not to have a QemuThreadData; b) the main thread does not go through win32_start_routine, so we have to use atexit too. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- include/qemu/thread.h| 4 util/qemu-thread-posix.c | 37 + util/qemu-thread-win32.c | 48 +--- 3 files changed, 78 insertions(+), 11 deletions(-) diff --git a/include/qemu/thread.h b/include/qemu/thread.h index f7e3b9b..e89fdc9 100644 --- a/include/qemu/thread.h +++ b/include/qemu/thread.h @@ -61,4 +61,8 @@ bool qemu_thread_is_self(QemuThread *thread); void qemu_thread_exit(void *retval); void qemu_thread_naming(bool enable); +struct Notifier; +void qemu_thread_atexit_add(struct Notifier *notifier); +void qemu_thread_atexit_remove(struct Notifier *notifier); + #endif diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c index d05a649..41cb23d 100644 --- a/util/qemu-thread-posix.c +++ b/util/qemu-thread-posix.c @@ -26,6 +26,7 @@ #endif #include qemu/thread.h #include qemu/atomic.h +#include qemu/notify.h static bool name_threads; @@ -401,6 +402,42 @@ void qemu_event_wait(QemuEvent *ev) } } +static pthread_key_t exit_key; + +union NotifierThreadData { +void *ptr; +NotifierList list; +}; +QEMU_BUILD_BUG_ON(sizeof(union NotifierThreadData) != sizeof(void *)); + +void qemu_thread_atexit_add(Notifier *notifier) +{ +union NotifierThreadData ntd; +ntd.ptr = pthread_getspecific(exit_key); +notifier_list_add(ntd.list, notifier); +pthread_setspecific(exit_key, ntd.ptr); +} + +void qemu_thread_atexit_remove(Notifier *notifier) +{ +union NotifierThreadData ntd; +ntd.ptr = pthread_getspecific(exit_key); +notifier_remove(notifier); +pthread_setspecific(exit_key, ntd.ptr); +} + +static void qemu_thread_atexit_run(void *arg) +{ +union NotifierThreadData ntd = { .ptr = arg }; +notifier_list_notify(ntd.list, NULL); +} + +static void __attribute__((constructor)) qemu_thread_atexit_init(void) +{ +pthread_key_create(exit_key, qemu_thread_atexit_run); +} + + /* Attempt to set the threads name; note that this is for debug, so * we're not going to fail if we can't set it. */ diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c index c405c9b..7bda85b 100644 --- a/util/qemu-thread-win32.c +++ b/util/qemu-thread-win32.c @@ -12,6 +12,7 @@ */ #include qemu-common.h #include qemu/thread.h +#include qemu/notify.h #include process.h #include assert.h #include limits.h @@ -268,6 +269,7 @@ struct QemuThreadData { void *(*start_routine)(void *); void *arg; short mode; +NotifierList exit; /* Only used for joinable threads. */ bool exited; @@ -275,18 +277,40 @@ struct QemuThreadData { CRITICAL_SECTION cs; }; +static bool atexit_registered; +static NotifierList main_thread_exit; + static __thread QemuThreadData *qemu_thread_data; +static void run_main_iothread_exit(void) +{ +notifier_list_notify(main_thread_exit, NULL); +} + +void qemu_thread_atexit_add(Notifier *notifier) +{ +if (!qemu_thread_data) { +if (!atexit_registered) { +atexit_registered = true; +atexit(run_main_iothread_exit); +} +notifier_list_add(main_thread_exit, notifier); +} else { +notifier_list_add(qemu_thread_data-exit, notifier); +} +} + +void qemu_thread_atexit_remove(Notifier *notifier) +{ +notifier_remove(notifier); +} + static unsigned __stdcall win32_start_routine(void *arg) { QemuThreadData *data = (QemuThreadData *) arg; void *(*start_routine)(void *) = data-start_routine; void *thread_arg = data-arg; -if (data-mode == QEMU_THREAD_DETACHED) { -g_free(data); -data = NULL; -} qemu_thread_data = data; qemu_thread_exit(start_routine(thread_arg)); abort(); @@ -296,12 +320,14 @@ void qemu_thread_exit(void *arg) { QemuThreadData *data = qemu_thread_data; -if (data) { -assert(data-mode != QEMU_THREAD_DETACHED); +notifier_list_notify(data-exit, NULL); +if (data-mode == QEMU_THREAD_JOINABLE) { data-ret = arg; EnterCriticalSection(data-cs); data-exited = true; LeaveCriticalSection(data-cs); +} else { +g_free(data); } _endthreadex(0); } @@ -313,9 +339,10 @@ void *qemu_thread_join(QemuThread *thread) HANDLE handle; data = thread-data; -if (!data) { +if (data-mode ==
[Qemu-devel] [PATCH 4/7] QSLIST: add lock-free operations
These operations are trivial to implement and do not have ABA problems. They are enough to implement simple multiple-producer, single consumer lock-free lists or, as in the next patch, the multiple consumers can steal a whole batch of elements and process them at their leisure. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- include/qemu/queue.h | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/include/qemu/queue.h b/include/qemu/queue.h index d433b90..6a01e2f 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -191,8 +191,19 @@ struct { \ } while (/*CONSTCOND*/0) #define QSLIST_INSERT_HEAD(head, elm, field) do {\ -(elm)-field.sle_next = (head)-slh_first; \ -(head)-slh_first = (elm); \ +(elm)-field.sle_next = (head)-slh_first; \ +(head)-slh_first = (elm); \ +} while (/*CONSTCOND*/0) + +#define QSLIST_INSERT_HEAD_ATOMIC(head, elm, field) do { \ +do { \ +(elm)-field.sle_next = (head)-slh_first; \ +} while (atomic_cmpxchg((head)-slh_first, (elm)-field.sle_next, \ + (elm)) != (elm)-field.sle_next); \ +} while (/*CONSTCOND*/0) + +#define QSLIST_MOVE_ATOMIC(dest, src) do { \ +(dest)-slh_first = atomic_xchg((src)-slh_first, NULL);\ } while (/*CONSTCOND*/0) #define QSLIST_REMOVE_HEAD(head, field) do { \ -- 2.1.0
[Qemu-devel] [PATCH 5/7] coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also removes qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. I still believe we will end with some kind of coroutine bypass scheme (even coroutines _do_ allocate an AIOCB, so calling bdrv_aio_readv directly can help), but hey it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- qemu-coroutine.c | 93 +--- 1 file changed, 42 insertions(+), 51 deletions(-) diff --git a/qemu-coroutine.c b/qemu-coroutine.c index bd574aa..aee1017 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -15,31 +15,57 @@ #include trace.h #include qemu-common.h #include qemu/thread.h +#include qemu/atomic.h #include block/coroutine.h #include block/coroutine_int.h enum { -POOL_DEFAULT_SIZE = 64, +POOL_BATCH_SIZE = 64, }; /** Free list to speed up creation */ -static QemuMutex pool_lock; -static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_size; -static unsigned int pool_max_size = POOL_DEFAULT_SIZE; +static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); +static unsigned int release_pool_size; +static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +static __thread Notifier coroutine_pool_cleanup_notifier; + +static void coroutine_pool_cleanup(Notifier *n, void *value) +{ +Coroutine *co; +Coroutine *tmp; + +QSLIST_FOREACH_SAFE(co, alloc_pool, pool_next, tmp) { +QSLIST_REMOVE_HEAD(alloc_pool, pool_next); +qemu_coroutine_delete(co); +} +} Coroutine *qemu_coroutine_create(CoroutineEntry *entry) { Coroutine *co = NULL; if (CONFIG_COROUTINE_POOL) { -qemu_mutex_lock(pool_lock); -co = QSLIST_FIRST(pool); +co = QSLIST_FIRST(alloc_pool); +if (!co) { +if (release_pool_size POOL_BATCH_SIZE) { +/* Slow path; a good place to register the destructor, too. */ +if (!coroutine_pool_cleanup_notifier.notify) { +coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; +qemu_thread_atexit_add(coroutine_pool_cleanup_notifier); +} + +/* This is not exact; there could be a little skew between + * release_pool_size and the actual size of release_pool. But + * it is just a heuristic, it does not need to be perfect. + */ +release_pool_size = 0; +QSLIST_MOVE_ATOMIC(alloc_pool, release_pool); +co = QSLIST_FIRST(alloc_pool); +} +} if (co) { -QSLIST_REMOVE_HEAD(pool, pool_next); -pool_size--; +QSLIST_REMOVE_HEAD(alloc_pool, pool_next); } -qemu_mutex_unlock(pool_lock); } if (!co) { @@ -53,39 +80,19 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) static void coroutine_delete(Coroutine *co) { +co-caller = NULL; + if (CONFIG_COROUTINE_POOL) { -qemu_mutex_lock(pool_lock); -if (pool_size pool_max_size) { -QSLIST_INSERT_HEAD(pool, co, pool_next); -co-caller = NULL; -pool_size++; -qemu_mutex_unlock(pool_lock); +if (release_pool_size POOL_BATCH_SIZE * 2) { +QSLIST_INSERT_HEAD_ATOMIC(release_pool, co, pool_next); +atomic_inc(release_pool_size); return; } -qemu_mutex_unlock(pool_lock); }
[Qemu-devel] [PATCH 6/7] coroutine: drop qemu_coroutine_adjust_pool_size
This is not needed anymore. The new TLS-based algorithm is adaptive. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- block/block-backend.c | 4 include/block/coroutine.h | 10 -- qemu-coroutine.c | 4 3 files changed, 18 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index d0692b1..abf0cd1 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -260,9 +260,6 @@ int blk_attach_dev(BlockBackend *blk, void *dev) blk_ref(blk); blk-dev = dev; bdrv_iostatus_reset(blk-bs); - -/* We're expecting I/O from the device so bump up coroutine pool size */ -qemu_coroutine_adjust_pool_size(COROUTINE_POOL_RESERVATION); return 0; } @@ -290,7 +287,6 @@ void blk_detach_dev(BlockBackend *blk, void *dev) blk-dev_ops = NULL; blk-dev_opaque = NULL; bdrv_set_guest_block_size(blk-bs, 512); -qemu_coroutine_adjust_pool_size(-COROUTINE_POOL_RESERVATION); blk_unref(blk); } diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 793df0e..20c027a 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -216,14 +216,4 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type, */ void coroutine_fn yield_until_fd_readable(int fd); -/** - * Add or subtract from the coroutine pool size - * - * The coroutine implementation keeps a pool of coroutines to be reused by - * qemu_coroutine_create(). This makes coroutine creation cheap. Heavy - * coroutine users should call this to reserve pool space. Call it again with - * a negative number to release pool space. - */ -void qemu_coroutine_adjust_pool_size(int n); - #endif /* QEMU_COROUTINE_H */ diff --git a/qemu-coroutine.c b/qemu-coroutine.c index aee1017..ca40f4f 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -144,7 +144,3 @@ void coroutine_fn qemu_coroutine_yield(void) self-caller = NULL; coroutine_swap(self, to); } - -void qemu_coroutine_adjust_pool_size(int n) -{ -} -- 2.1.0