Re: [Qemu-block] [PATCH v6 2/4] qcow2: add option to clean unused cache entries after some time
Am 05.06.2015 um 18:27 hat Alberto Garcia geschrieben: This adds a new 'cache-clean-interval' option that cleans all qcow2 cache entries that haven't been used in a certain interval, given in seconds. This allows setting a large L2 cache size so it can handle scenarios with lots of I/O and at the same time use little memory during periods of inactivity. This feature currently relies on MADV_DONTNEED to free that memory, so it is not useful in systems that don't follow that behavior. Signed-off-by: Alberto Garcia be...@igalia.com Reviewed-by: Max Reitz mre...@redhat.com This patch doesn't apply cleanly any more. Can you please rebase and change the QAPI documenation to say since 2.5? Kevin
Re: [Qemu-block] [PATCH 1/2] block: vpc - prevent overflow if max_table_entries = 0x40000000
Am 26.06.2015 um 11:57 hat Stefan Hajnoczi geschrieben: On Thu, Jun 25, 2015 at 11:05:20AM -0400, Jeff Cody wrote: On Thu, Jun 25, 2015 at 03:28:35PM +0100, Stefan Hajnoczi wrote: On Wed, Jun 24, 2015 at 03:54:27PM -0400, Jeff Cody wrote: @@ -269,7 +270,9 @@ static int vpc_open(BlockDriverState *bs, QDict *options, int flags, goto fail; } -s-pagetable = qemu_try_blockalign(bs-file, s-max_table_entries * 4); +pagetable_size = (size_t) s-max_table_entries * 4; + +s-pagetable = qemu_try_blockalign(bs-file, pagetable_size); On 32-bit hosts size_t is 32-bit so the overflow hasn't been solved. Does it make sense to impose a limit on pagetable_size? Good point. Yes, it does. The VHD spec says that the Max Table Entries field should be equal to the disk size / block size. I don't know if there are images out there that treat that as = disk size / block size rather than ==, however. But if we assume max size of 2TB for a VHD disk, and a minimal block size of 512 bytes, that would give us a max_table_entries of 0x1, which exceeds 32 bits by itself. For pagetable_size to fit in a 32-bit, that means to support 2TB on a 32-bit host in the current implementation, the minimum block size is 4096. We could check during open / create that (disk_size / block_size) * 4 SIZE_MAX, and refuse to open if this is not true (and also validate max_table_entries to fit in the same). Sounds good. Jeff, I couldn't find a v2 anywhere. Are you still planning to send it? Kevin pgpwMzEoRfQNy.pgp Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH v2] block/curl: Don't lose original error when a connection fails.
On Wed, Jul 08, 2015 at 02:01:30PM +0200, Kevin Wolf wrote: The guest can't cause it, but once the connection is down, I expect every request to fail. You don't have to have a malicious guest for filling up the log file, it just needs to be careless enough to continue trying new requests instead of offlining the device. How about something along these lines? [I'm not clear if atomic is necessary here, nor if there is already some mechanism for suppressing log messages - a cursory grep of the qemu source didn't find anything.] diff --git a/block/curl.c b/block/curl.c index 2fd7c06..33c14d8 100644 --- a/block/curl.c +++ b/block/curl.c @@ -299,11 +299,20 @@ static void curl_multi_check_completion(BDRVCURLState *s) /* ACBs for successful messages get completed in curl_read_cb */ if (msg-data.result != CURLE_OK) { int i; +static int errcount = 100; +int current_errcount; /* Don't lose the original error message from curl, since * it contains extra data. */ -error_report(curl: %s, state-errmsg); +current_errcount = atomic_fetch_add(errcount, 0); +if (current_errcount 0) { +error_report(curl: %s, state-errmsg); +if (current_errcount == 1) { +error_report(curl: further errors suppressed); +} +atomic_dec(errcount); +} for (i = 0; i CURL_NUM_ACB; i++) { CURLAIOCB *acb = state-acb[i]; Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [Qemu-block] [Qemu-devel] [PATCH RFC 4/4] aio-posix: Use epoll in aio_poll
On Wed, Jul 08, 2015 at 09:01:27AM +0800, Fam Zheng wrote: On Tue, 07/07 16:08, Stefan Hajnoczi wrote: +#define EPOLL_BATCH 128 +static bool aio_poll_epoll(AioContext *ctx, bool blocking) +{ +AioHandler *node; +bool was_dispatching; +int i, ret; +bool progress; +int64_t timeout; +struct epoll_event events[EPOLL_BATCH]; + +aio_context_acquire(ctx); +was_dispatching = ctx-dispatching; +progress = false; + +/* aio_notify can avoid the expensive event_notifier_set if + * everything (file descriptors, bottom halves, timers) will + * be re-evaluated before the next blocking poll(). This is + * already true when aio_poll is called with blocking == false; + * if blocking == true, it is only true after poll() returns. + * + * If we're in a nested event loop, ctx-dispatching might be true. + * In that case we can restore it just before returning, but we + * have to clear it now. + */ +aio_set_dispatching(ctx, !blocking); + +ctx-walking_handlers++; + +timeout = blocking ? aio_compute_timeout(ctx) : 0; + +if (timeout 0) { +timeout = DIV_ROUND_UP(timeout, 100); +} I think you already posted the timerfd code in an earlier series. Why degrade to millisecond precision? It needs to be fixed up anyway if the main loop uses aio_poll() in the future. Because of a little complication: timeout here is always -1 for iothread, and what is interesting is that -1 actually requires an explicit timerfd_settime(timerfd, flags, (struct itimerspec){{0, 0}}, NULL) to disable timerfd for this aio_poll(), which costs somethings. Passing -1 to epoll_wait() without this doesn't work because the timerfd is already added to the epollfd and may have an unexpected timeout set before. Of course we can cache the state and optimize, but I've not reasoned about what if another thread happens to call aio_poll() when we're in epoll_wait(), for example when the first aio_poll() has a positive timeout but the second one has -1. I'm not sure I understand the threads scenario since aio_poll_epoll() has a big aio_context_acquire()/release() region that protects it, but I guess the nested aio_poll() case is similar. Care needs to be taken so the extra timerfd state is consistent. The optimization can be added later unless the timerfd_settime() syscall is so expensive that it defeats the advantage of epoll(). pgpGBlTnInsLd.pgp Description: PGP signature
Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom
Am 08.07.2015 um 12:47 hat Laurent Vivier geschrieben: On 08/07/2015 12:31, Kevin Wolf wrote: Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben: On 02/07/2015 16:03, Paolo Bonzini wrote: On 02/07/2015 15:58, Laurent Vivier wrote: Since any /dev entry can be treated as a raw disk image, it is worth noting which devices can be accessed when and how. /dev/rdisk nodes are character-special devices, but are raw in the BSD sense and force block-aligned I/O. They are closer to the physical disk than the buffer cache. /dev/disk nodes, on the other hand, are buffered block-special devices and are used primarily by the kernel's filesystem code. So the right thing to do would not be just to set need_alignment, but to probe it like we do on Linux for BDRV_O_NO_CACHE. I'm okay with doing the simple thing, but it needs a comment for non-BSDers. So, what we have to do, in our case, for MacOS X cdrom, is something like: ... GetBSDPath ... ... if (flags BDRV_O_NOCACHE) { strcat(bsdPath, r); } ... I would avoid such magic. What we could do is rejecting /dev/rdisk nodes without BDRV_O_NOCACHE. It's not how it works... Look in hdev_open(). If user provides /dev/cdrom on the command line, in the case of MacOS X, QEMU searches for a cdrom drive in the system and set filename to /dev/rdiskX according to the result. Oh, we're already playing such games... I guess you're right then. It even seems to be not only for '/dev/cdrom', but for everything starting with this string. Does anyone know what's the reason for that? Also, I guess before doing strcat() on bsdPath, we should check the buffer length... Perhaps this part should be removed. But if we just want to correct the bug, we must not set filename to /dev/rdiskX if NOCACHE is not set but to /dev/diskX It's the aim of this change. Yes, that looks right. Kevin
Re: [Qemu-block] [PATCH RFC 3/4] aio: Introduce aio_context_setup
On Wed, Jul 08, 2015 at 09:15:57AM +0800, Fam Zheng wrote: On Tue, 07/07 15:35, Stefan Hajnoczi wrote: On Tue, Jun 30, 2015 at 09:19:44PM +0800, Fam Zheng wrote: diff --git a/async.c b/async.c index 06971f4..1d70cfd 100644 --- a/async.c +++ b/async.c @@ -290,12 +290,17 @@ AioContext *aio_context_new(Error **errp) { int ret; AioContext *ctx; +Error *local_err = NULL; + ctx = (AioContext *) g_source_new(aio_source_funcs, sizeof(AioContext)); +aio_context_setup(ctx, local_err); +if (local_err) { +error_propagate(errp, local_err); Is there any reason to introduce local_err? errp can be passed directly into aio_context_setup(). It's used for catching failure of aio_context_setup, because the convention is errp can be NULL. You are right, I missed that aio_context_setup() has a void return type. pgpRqk_4uUwRV.pgp Description: PGP signature
Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom
Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben: On 02/07/2015 16:03, Paolo Bonzini wrote: On 02/07/2015 15:58, Laurent Vivier wrote: Since any /dev entry can be treated as a raw disk image, it is worth noting which devices can be accessed when and how. /dev/rdisk nodes are character-special devices, but are raw in the BSD sense and force block-aligned I/O. They are closer to the physical disk than the buffer cache. /dev/disk nodes, on the other hand, are buffered block-special devices and are used primarily by the kernel's filesystem code. So the right thing to do would not be just to set need_alignment, but to probe it like we do on Linux for BDRV_O_NO_CACHE. I'm okay with doing the simple thing, but it needs a comment for non-BSDers. So, what we have to do, in our case, for MacOS X cdrom, is something like: ... GetBSDPath ... ... if (flags BDRV_O_NOCACHE) { strcat(bsdPath, r); } ... I would avoid such magic. What we could do is rejecting /dev/rdisk nodes without BDRV_O_NOCACHE. Kevin
[Qemu-block] [PULL v2 02/16] Revert dataplane: allow virtio-1 devices
From: Cornelia Huck cornelia.h...@de.ibm.com This reverts commit f5a5628cf0b65b223fa0c9031714578dfac4cf04. This was an old patch that had been already superseded by b0e5d90eb (dataplane: endianness-aware accesses). Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com Reviewed-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com Acked-by: Stefan Hajnoczi stefa...@redhat.com --- hw/virtio/dataplane/vring.c | 47 - 1 file changed, 21 insertions(+), 26 deletions(-) diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c index bed9b11..07fd69c 100644 --- a/hw/virtio/dataplane/vring.c +++ b/hw/virtio/dataplane/vring.c @@ -158,18 +158,15 @@ bool vring_should_notify(VirtIODevice *vdev, Vring *vring) } -static int get_desc(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem, +static int get_desc(Vring *vring, VirtQueueElement *elem, struct vring_desc *desc) { unsigned *num; struct iovec *iov; hwaddr *addr; MemoryRegion *mr; -int is_write = virtio_tswap16(vdev, desc-flags) VRING_DESC_F_WRITE; -uint32_t len = virtio_tswap32(vdev, desc-len); -uint64_t desc_addr = virtio_tswap64(vdev, desc-addr); -if (is_write) { +if (desc-flags VRING_DESC_F_WRITE) { num = elem-in_num; iov = elem-in_sg[*num]; addr = elem-in_addr[*num]; @@ -193,17 +190,18 @@ static int get_desc(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem, } /* TODO handle non-contiguous memory across region boundaries */ -iov-iov_base = vring_map(mr, desc_addr, len, is_write); +iov-iov_base = vring_map(mr, desc-addr, desc-len, + desc-flags VRING_DESC_F_WRITE); if (!iov-iov_base) { error_report(Failed to map descriptor addr %# PRIx64 len %u, - (uint64_t)desc_addr, len); + (uint64_t)desc-addr, desc-len); return -EFAULT; } /* The MemoryRegion is looked up again and unref'ed later, leave the * ref in place. */ -iov-iov_len = len; -*addr = desc_addr; +iov-iov_len = desc-len; +*addr = desc-addr; *num += 1; return 0; } @@ -225,23 +223,21 @@ static int get_indirect(VirtIODevice *vdev, Vring *vring, struct vring_desc desc; unsigned int i = 0, count, found = 0; int ret; -uint32_t len = virtio_tswap32(vdev, indirect-len); -uint64_t addr = virtio_tswap64(vdev, indirect-addr); /* Sanity check */ -if (unlikely(len % sizeof(desc))) { +if (unlikely(indirect-len % sizeof(desc))) { error_report(Invalid length in indirect descriptor: len %#x not multiple of %#zx, - len, sizeof(desc)); + indirect-len, sizeof(desc)); vring-broken = true; return -EFAULT; } -count = len / sizeof(desc); +count = indirect-len / sizeof(desc); /* Buffers are chained via a 16 bit next field, so * we can have at most 2^16 of these. */ if (unlikely(count USHRT_MAX + 1)) { -error_report(Indirect buffer length too big: %d, len); +error_report(Indirect buffer length too big: %d, indirect-len); vring-broken = true; return -EFAULT; } @@ -252,12 +248,12 @@ static int get_indirect(VirtIODevice *vdev, Vring *vring, /* Translate indirect descriptor */ desc_ptr = vring_map(mr, - addr + found * sizeof(desc), + indirect-addr + found * sizeof(desc), sizeof(desc), false); if (!desc_ptr) { error_report(Failed to map indirect descriptor addr %# PRIx64 len %zu, - (uint64_t)addr + found * sizeof(desc), + (uint64_t)indirect-addr + found * sizeof(desc), sizeof(desc)); vring-broken = true; return -EFAULT; @@ -275,20 +271,19 @@ static int get_indirect(VirtIODevice *vdev, Vring *vring, return -EFAULT; } -if (unlikely(virtio_tswap16(vdev, desc.flags) - VRING_DESC_F_INDIRECT)) { +if (unlikely(desc.flags VRING_DESC_F_INDIRECT)) { error_report(Nested indirect descriptor); vring-broken = true; return -EFAULT; } -ret = get_desc(vdev, vring, elem, desc); +ret = get_desc(vring, elem, desc); if (ret 0) { vring-broken |= (ret == -EFAULT); return ret; } -i = virtio_tswap16(vdev, desc.next); -} while (virtio_tswap16(vdev, desc.flags) VRING_DESC_F_NEXT); +i = desc.next; +} while (desc.flags VRING_DESC_F_NEXT); return 0; } @@ -389,7 +384,7 @@ int vring_pop(VirtIODevice *vdev, Vring *vring, /* Ensure descriptor is loaded
Re: [Qemu-block] NVMe volatile write cache fixes
Am 11.06.2015 um 12:01 hat Christoph Hellwig geschrieben: The first patch ensures that flush actually flushes data so that data integrity is preserved, the second fixe up WCE reporting. Thanks, applied to the block branch for 2.4-rc1. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH RFC 0/4] aio: Use epoll_wait in aio_poll
Am 08.07.2015 um 03:02 schrieb Fam Zheng: On Tue, 07/07 16:54, Christian Borntraeger wrote: Am 30.06.2015 um 15:19 schrieb Fam Zheng: epoll is more scalable than ppoll. It performs faster than ppoll when the number of polled fds is high. See patch 4 for an example of the senario and some benchmark data. Note: it is only effective on iothread (dataplane), while the main loop cannot benefit from this yet, because the iohandler and chardev GSource's don't easily fit into this epoll interface style (that's why main loop uses qemu_poll_ns directly instead of aio_poll()). There is hardly any timer activity in iothreads for now, as a result the timeout is always 0 or -1. Therefore, timerfd, or the said nanosecond epoll_pwait1 interface, which fixes the timeout granularity deficiency is not immediately necessary at this point, but still that will be simple to add. Please review! Is there a branch somewhere, so that I could give it a spin? Here: https://github.com/famz/qemu/tree/aio-posix-epoll In file included from /home/cborntra/REPOS/qemu/include/qemu/option.h:31:0, from /home/cborntra/REPOS/qemu/include/qemu-common.h:44, from /home/cborntra/REPOS/qemu/async.c:25: /home/cborntra/REPOS/qemu/async.c: In function 'aio_context_new': /home/cborntra/REPOS/qemu/include/qapi/error.h:57:20: error: 'ret' may be used uninitialized in this function [-Werror=maybe-uninitialized] error_set_errno(errp, os_error, ERROR_CLASS_GENERIC_ERROR, \ ^ /home/cborntra/REPOS/qemu/async.c:291:9: note: 'ret' was declared here int ret; ^ cc1: all warnings being treated as errors With that fixed, it seems to work. Still looking at the performance.
[Qemu-block] [PATCH v4] block/curl: Don't lose original error when a connection fails.
Currently if qemu is connected to a curl source (eg. web server), and the web server fails / times out / dies, you always see a bogus EIO Input/output error. For example, choose a large file located on any local webserver which you control: $ qemu-img convert -p http://example.com/large.iso /tmp/test Once it starts copying the file, stop the webserver and you will see qemu-img fail with: qemu-img: error while reading sector 61440: Input/output error This patch does two things: Firstly print the actual error from curl so it doesn't get lost. Secondly, change EIO to EPROTO. EPROTO is a POSIX.1 compatible errno which more accurately reflects that there was a protocol error, rather than some kind of hardware failure. After this patch is applied, the error changes to: $ qemu-img convert -p http://example.com/large.iso /tmp/test qemu-img: curl: transfer closed with 469989 bytes remaining to read qemu-img: error while reading sector 16384: Protocol error Signed-off-by: Richard W.M. Jones rjo...@redhat.com Reviewed-by: Stefan Hajnoczi stefa...@redhat.com --- block/curl.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/block/curl.c b/block/curl.c index 3a2b63e..032cc8a 100644 --- a/block/curl.c +++ b/block/curl.c @@ -22,6 +22,7 @@ * THE SOFTWARE. */ #include qemu-common.h +#include qemu/error-report.h #include block/block_int.h #include qapi/qmp/qbool.h #include qapi/qmp/qstring.h @@ -298,6 +299,18 @@ static void curl_multi_check_completion(BDRVCURLState *s) /* ACBs for successful messages get completed in curl_read_cb */ if (msg-data.result != CURLE_OK) { int i; +static int errcount = 100; + +/* Don't lose the original error message from curl, since + * it contains extra data. + */ +if (errcount 0) { +error_report(curl: %s, state-errmsg); +if (--errcount == 0) { +error_report(curl: further errors suppressed); +} +} + for (i = 0; i CURL_NUM_ACB; i++) { CURLAIOCB *acb = state-acb[i]; @@ -305,7 +318,7 @@ static void curl_multi_check_completion(BDRVCURLState *s) continue; } -acb-common.cb(acb-common.opaque, -EIO); +acb-common.cb(acb-common.opaque, -EPROTO); qemu_aio_unref(acb); state-acb[i] = NULL; } -- 2.3.1
[Qemu-block] [PATCH v3 2/2] block/curl: Don't lose original error when a connection fails.
Currently if qemu is connected to a curl source (eg. web server), and the web server fails / times out / dies, you always see a bogus EIO Input/output error. For example, choose a large file located on any local webserver which you control: $ qemu-img convert -p http://example.com/large.iso /tmp/test Once it starts copying the file, stop the webserver and you will see qemu-img fail with: qemu-img: error while reading sector 61440: Input/output error This patch does two things: Firstly print the actual error from curl so it doesn't get lost. Secondly, change EIO to EPROTO. EPROTO is a POSIX.1 compatible errno which more accurately reflects that there was a protocol error, rather than some kind of hardware failure. After this patch is applied, the error changes to: $ qemu-img convert -p http://example.com/large.iso /tmp/test qemu-img: curl: transfer closed with 469989 bytes remaining to read qemu-img: error while reading sector 16384: Protocol error Signed-off-by: Richard W.M. Jones rjo...@redhat.com Reviewed-by: Stefan Hajnoczi stefa...@redhat.com --- block/curl.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/block/curl.c b/block/curl.c index 3a2b63e..b72d505 100644 --- a/block/curl.c +++ b/block/curl.c @@ -22,6 +22,8 @@ * THE SOFTWARE. */ #include qemu-common.h +#include qemu/error-report.h +#include qemu/no-flood.h #include block/block_int.h #include qapi/qmp/qbool.h #include qapi/qmp/qstring.h @@ -298,6 +300,15 @@ static void curl_multi_check_completion(BDRVCURLState *s) /* ACBs for successful messages get completed in curl_read_cb */ if (msg-data.result != CURLE_OK) { int i; +FLOOD_COUNTER(errcount, 100); + +/* Don't lose the original error message from curl, since + * it contains extra data. + */ +NO_FLOOD(errcount, + error_report(curl: %s, state-errmsg), + error_report(curl: further errors suppressed)); + for (i = 0; i CURL_NUM_ACB; i++) { CURLAIOCB *acb = state-acb[i]; @@ -305,7 +316,7 @@ static void curl_multi_check_completion(BDRVCURLState *s) continue; } -acb-common.cb(acb-common.opaque, -EIO); +acb-common.cb(acb-common.opaque, -EPROTO); qemu_aio_unref(acb); state-acb[i] = NULL; } -- 2.3.1
[Qemu-block] [PATCH v3 1/2] Add a simple mechanism to protect against error message floods.
You can now write code like this: FLOOD_COUNTER(errcount, 100); ... NO_FLOOD(errcount, error_report(oops, something bad happened), error_report(further errors suppressed)); which would print the oops, ... error message up to 100 times, followed by the further errors suppressed message once, and then nothing else. Signed-off-by: Richard W.M. Jones rjo...@redhat.com --- include/qemu/no-flood.h | 34 ++ 1 file changed, 34 insertions(+) create mode 100644 include/qemu/no-flood.h diff --git a/include/qemu/no-flood.h b/include/qemu/no-flood.h new file mode 100644 index 000..90a24da --- /dev/null +++ b/include/qemu/no-flood.h @@ -0,0 +1,34 @@ +/* + * Suppress error floods. + * + * Copyright (C) 2015 Red Hat, Inc. + * + * Author: Richard W.M. Jones rjo...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#ifndef __QEMU_NO_FLOOD_H +#define __QEMU_NO_FLOOD_H 1 + +#include qemu/atomic.h + +/* Suggested initial value is 100 */ +#define FLOOD_COUNTER(counter_name, initial) \ + static int counter_name = (initial) + +#define NO_FLOOD(counter_name, code, suppress_message) \ + do { \ +int t_##counter_name = atomic_fetch_add(counter_name, 0); \ +if (t_##counter_name 0) {\ +code; \ +if (t_##counter_name == 1) { \ +suppress_message; \ +} \ +atomic_dec(counter_name); \ +} \ +} while (0) + +#endif -- 2.3.1
Re: [Qemu-block] [PATCH v3 1/2] Add a simple mechanism to protect against error message floods.
Am 08.07.2015 um 14:50 hat Richard W.M. Jones geschrieben: You can now write code like this: FLOOD_COUNTER(errcount, 100); ... NO_FLOOD(errcount, error_report(oops, something bad happened), error_report(further errors suppressed)); which would print the oops, ... error message up to 100 times, followed by the further errors suppressed message once, and then nothing else. Signed-off-by: Richard W.M. Jones rjo...@redhat.com --- include/qemu/no-flood.h | 34 ++ 1 file changed, 34 insertions(+) create mode 100644 include/qemu/no-flood.h diff --git a/include/qemu/no-flood.h b/include/qemu/no-flood.h new file mode 100644 index 000..90a24da --- /dev/null +++ b/include/qemu/no-flood.h @@ -0,0 +1,34 @@ +/* + * Suppress error floods. + * + * Copyright (C) 2015 Red Hat, Inc. + * + * Author: Richard W.M. Jones rjo...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#ifndef __QEMU_NO_FLOOD_H +#define __QEMU_NO_FLOOD_H 1 + +#include qemu/atomic.h + +/* Suggested initial value is 100 */ +#define FLOOD_COUNTER(counter_name, initial) \ + static int counter_name = (initial) + +#define NO_FLOOD(counter_name, code, suppress_message) \ + do { \ +int t_##counter_name = atomic_fetch_add(counter_name, 0); \ +if (t_##counter_name 0) {\ +code; \ +if (t_##counter_name == 1) { \ +suppress_message; \ +} \ +atomic_dec(counter_name); \ +} \ +} while (0) + +#endif I don't think that you actually achieve thread safety with the atomic operations you use here. The counter may change between your check and the atomic_dec(). It doesn't matter for curl because we don't get concurrency there anyway, but it might confuse others if it looks as if it was thread safe when it fact it isn't. So I'd suggest that if you have an easy fix for thread safety, do it, but otherwise don't bother and just remove the atomics. Other than that, I like this series. Kevin
Re: [Qemu-block] [PATCH] raw-posix.c: remove raw device access for cdrom
On Jul 8, 2015, at 7:01 AM, Kevin Wolf wrote: Am 08.07.2015 um 12:47 hat Laurent Vivier geschrieben: On 08/07/2015 12:31, Kevin Wolf wrote: Am 02.07.2015 um 16:18 hat Laurent Vivier geschrieben: On 02/07/2015 16:03, Paolo Bonzini wrote: On 02/07/2015 15:58, Laurent Vivier wrote: Since any /dev entry can be treated as a raw disk image, it is worth noting which devices can be accessed when and how. /dev/rdisk nodes are character-special devices, but are raw in the BSD sense and force block-aligned I/O. They are closer to the physical disk than the buffer cache. /dev/disk nodes, on the other hand, are buffered block-special devices and are used primarily by the kernel's filesystem code. So the right thing to do would not be just to set need_alignment, but to probe it like we do on Linux for BDRV_O_NO_CACHE. I'm okay with doing the simple thing, but it needs a comment for non-BSDers. So, what we have to do, in our case, for MacOS X cdrom, is something like: ... GetBSDPath ... ... if (flags BDRV_O_NOCACHE) { strcat(bsdPath, r); } ... I would avoid such magic. What we could do is rejecting /dev/rdisk nodes without BDRV_O_NOCACHE. It's not how it works... Look in hdev_open(). If user provides /dev/cdrom on the command line, in the case of MacOS X, QEMU searches for a cdrom drive in the system and set filename to /dev/rdiskX according to the result. Oh, we're already playing such games... I guess you're right then. It even seems to be not only for '/dev/cdrom', but for everything starting with this string. Does anyone know what's the reason for that? Also, I guess before doing strcat() on bsdPath, we should check the buffer length... By buffer, do you mean the bsdPath variable?
[Qemu-block] [PATCH for-2.5 2/4] block-backend: add blk_get_max_iov()
Add a function to query BlockLimits.max_iov. Signed-off-by: Stefan Hajnoczi stefa...@redhat.com --- block/block-backend.c | 5 + include/sysemu/block-backend.h | 1 + 2 files changed, 6 insertions(+) diff --git a/block/block-backend.c b/block/block-backend.c index aee8a12..5108aec 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -777,6 +777,11 @@ int blk_get_max_transfer_length(BlockBackend *blk) return blk-bs-bl.max_transfer_length; } +int blk_get_max_iov(BlockBackend *blk) +{ +return blk-bs-bl.max_iov; +} + void blk_set_guest_block_size(BlockBackend *blk, int align) { bdrv_set_guest_block_size(blk-bs, align); diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h index 8fc960f..c114440 100644 --- a/include/sysemu/block-backend.h +++ b/include/sysemu/block-backend.h @@ -135,6 +135,7 @@ void blk_lock_medium(BlockBackend *blk, bool locked); void blk_eject(BlockBackend *blk, bool eject_flag); int blk_get_flags(BlockBackend *blk); int blk_get_max_transfer_length(BlockBackend *blk); +int blk_get_max_iov(BlockBackend *blk); void blk_set_guest_block_size(BlockBackend *blk, int align); void *blk_blockalign(BlockBackend *blk, size_t size); bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp); -- 2.4.3
[Qemu-block] [PATCH for-2.5 1/4] block: add BlockLimits.max_iov field
The maximum number of struct iovec elements depends on the BlockDriverState. The raw-posix protocol has a maximum of IOV_MAX but others could have different values. Instead of assuming raw-posix and hardcoding IOV_MAX in several places, put the limit into BlockLimits. Cc: Peter Lieven p...@kamp.de Suggested-by: Kevin Wolf kw...@redhat.com Signed-off-by: Stefan Hajnoczi stefa...@redhat.com --- Peter Lieven: I think the SCSI LUN level does not have a maximum scatter-gather segments constraint. That is probably only at the HBA level. CCed you anyway in case you think block/iscsi.c should set the max_iov field. Kevin: The default is now INT_MAX. This means non-raw-posix users will now be able to merge more requests than before. They were limited to IOV_MAX previously. This could expose limits in other BlockDrivers which we weren't aware of... --- block/io.c| 3 +++ block/raw-posix.c | 1 + include/block/block_int.h | 3 +++ 3 files changed, 7 insertions(+) diff --git a/block/io.c b/block/io.c index e295992..6750de6 100644 --- a/block/io.c +++ b/block/io.c @@ -165,9 +165,11 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs-bl.max_transfer_length = bs-file-bl.max_transfer_length; bs-bl.min_mem_alignment = bs-file-bl.min_mem_alignment; bs-bl.opt_mem_alignment = bs-file-bl.opt_mem_alignment; +bs-bl.max_iov = bs-file-bl.max_iov; } else { bs-bl.min_mem_alignment = 512; bs-bl.opt_mem_alignment = getpagesize(); +bs-bl.max_iov = INT_MAX; } if (bs-backing_hd) { @@ -188,6 +190,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs-bl.min_mem_alignment = MAX(bs-bl.min_mem_alignment, bs-backing_hd-bl.min_mem_alignment); +bs-bl.max_iov = MIN(bs-bl.max_iov, bs-backing_hd-bl.max_iov); } /* Then let the driver override it */ diff --git a/block/raw-posix.c b/block/raw-posix.c index cbe6574..faa6ae0 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -735,6 +735,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) raw_probe_alignment(bs, s-fd, errp); bs-bl.min_mem_alignment = s-buf_align; bs-bl.opt_mem_alignment = MAX(s-buf_align, getpagesize()); +bs-bl.max_iov = IOV_MAX; /* limit comes from preadv()/pwritev()/etc */ } static int check_for_dasd(int fd) diff --git a/include/block/block_int.h b/include/block/block_int.h index b0476fc..767e83d 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -315,6 +315,9 @@ typedef struct BlockLimits { /* memory alignment for bounce buffer */ size_t opt_mem_alignment; + +/* maximum number of iovec elements */ +int max_iov; } BlockLimits; typedef struct BdrvOpBlocker BdrvOpBlocker; -- 2.4.3
[Qemu-block] [PATCH for-2.5 4/4] block/mirror: replace IOV_MAX with blk_get_max_iov()
Use blk_get_max_iov() instead of hardcoding IOV_MAX, which may not apply to all BlockDrivers. Signed-off-by: Stefan Hajnoczi stefa...@redhat.com --- block/mirror.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 985ad00..0f4445c 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -160,11 +160,13 @@ static void mirror_read_complete(void *opaque, int ret) static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) { BlockDriverState *source = s-common.bs; -int nb_sectors, sectors_per_chunk, nb_chunks; +int nb_sectors, sectors_per_chunk, nb_chunks, max_iov; int64_t end, sector_num, next_chunk, next_sector, hbitmap_next_sector; uint64_t delay_ns = 0; MirrorOp *op; +max_iov = MIN(source-bl.max_iov, s-target-bl.max_iov); + s-sector_num = hbitmap_iter_next(s-hbi); if (s-sector_num 0) { bdrv_dirty_iter_init(s-dirty_bitmap, s-hbi); @@ -241,7 +243,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) trace_mirror_break_buf_busy(s, nb_chunks, s-in_flight); break; } -if (IOV_MAX nb_chunks + added_chunks) { +if (max_iov nb_chunks + added_chunks) { trace_mirror_break_iov_max(s, nb_chunks, added_chunks); break; } -- 2.4.3
[Qemu-block] [PATCH for-2.5 3/4] block: replace IOV_MAX with BlockLimits.max_iov
Request merging must not result in a huge request that exceeds the maximum number of iovec elements. Use BlockLimits.max_iov instead of hardcoding IOV_MAX. Signed-off-by: Stefan Hajnoczi stefa...@redhat.com --- block/io.c| 3 ++- hw/block/virtio-blk.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/block/io.c b/block/io.c index 6750de6..baffd02 100644 --- a/block/io.c +++ b/block/io.c @@ -1818,7 +1818,8 @@ static int multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs, merge = 1; } -if (reqs[outidx].qiov-niov + reqs[i].qiov-niov + 1 IOV_MAX) { +if (reqs[outidx].qiov-niov + reqs[i].qiov-niov + 1 +bs-bl.max_iov) { merge = 0; } diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 6aefda4..a186956 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -407,7 +407,7 @@ void virtio_blk_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb) bool merge = true; /* merge would exceed maximum number of IOVs */ -if (niov + req-qiov.niov IOV_MAX) { +if (niov + req-qiov.niov blk_get_max_iov(blk)) { merge = false; } -- 2.4.3
[Qemu-block] [PATCH 1/2] ignore bdrv_flush operation when no qcow2 cache item is dirty
qcow2_cache_flush() writes dirty cache to the disk and invokes bdrv_flush() to make the data durable. But even if there is no dirty cache, qcow2_cache_flush() would invoke bdrv_flush(). In fact, bdrv_flush() will invoke fdatasync(), and it is an expensive operation. The patch will not invoke bdrv_flush if there is not dirty cache. The reason that I modify the return value of qcow2_cache_flush() is qcow2_co_flush_to_os needs to know whether flush operation is called. Following is the patch: From 23f9f83da4178e8fbb53d2cffe128f5a2d3a239a Mon Sep 17 00:00:00 2001 From: Qingshu Chen qingshu.chen...@gmail.com Date: Wed, 1 Jul 2015 14:45:23 +0800 Subject: [PATCH 1/2] ignore bdrv_flush operation when no qcow2 cache item is dirty Signed-off-by: Qingshu Chen qingshu.chen...@gmail.com --- block/qcow2-cache.c | 9 - block/qcow2.c | 2 ++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index ed92a09..57c0601 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -174,6 +174,7 @@ int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c) int result = 0; int ret; int i; +int flag = 0; trace_qcow2_cache_flush(qemu_coroutine_self(), c == s-l2_table_cache); @@ -182,15 +183,21 @@ int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c) if (ret 0 result != -ENOSPC) { result = ret; } +if (c-entries[i].dirty c-entries[i].offset) { +flag = 1; +} } -if (result == 0) { +if (result == 0 flag == 1) { ret = bdrv_flush(bs-file); if (ret 0) { result = ret; } } +if (flag == 0 result = 0) { +result = 1; +} + return result; } diff --git a/block/qcow2.c b/block/qcow2.c index d522ec7..ab4613a 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -2504,6 +2504,8 @@ static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs) if (ret 0) { qemu_co_mutex_unlock(s-lock); return ret; +} else if (ret == 1) { +bdrv_flush(bs-file); } if (qcow2_need_accurate_refcounts(s)) { -- 1.9.1 - Qingshu Chen, Institute of Parallel and Distributed Systems (IPADS), School of Software, Shanghai Jiao Tong University ---
[Qemu-block] [PATCH for-2.4 4/5] block: Reorder cleanups in bdrv_close()
Block drivers may still want to access their child nodes in their .bdrv_close handler. If they unref and/or detach a child by themselves, this should not result in a double free. There is additional code for backing files, which are just a special case of child nodes. The same applies for them. Signed-off-by: Kevin Wolf kw...@redhat.com --- block.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/block.c b/block.c index b723cf2..d5c9f03 100644 --- a/block.c +++ b/block.c @@ -1901,6 +1901,14 @@ void bdrv_close(BlockDriverState *bs) if (bs-drv) { BdrvChild *child, *next; +bs-drv-bdrv_close(bs); + +if (bs-backing_hd) { +BlockDriverState *backing_hd = bs-backing_hd; +bdrv_set_backing_hd(bs, NULL); +bdrv_unref(backing_hd); +} + QLIST_FOREACH_SAFE(child, bs-children, next, next) { /* TODO Remove bdrv_unref() from drivers' close function and use * bdrv_unref_child() here */ @@ -1910,12 +1918,6 @@ void bdrv_close(BlockDriverState *bs) bdrv_detach_child(child); } -if (bs-backing_hd) { -BlockDriverState *backing_hd = bs-backing_hd; -bdrv_set_backing_hd(bs, NULL); -bdrv_unref(backing_hd); -} -bs-drv-bdrv_close(bs); g_free(bs-opaque); bs-opaque = NULL; bs-drv = NULL; -- 1.8.3.1
[Qemu-block] [PATCH for-2.4 5/5] block: Fix backing file child when modifying graph
This patch moves bdrv_attach_child() from the individual places that add a backing file to a BDS to bdrv_set_backing_hd(), which is called by all of them. It also adds bdrv_detach_child() there. For normal operation (starting with one backing file chain and not changing it until the topmost image is closed) and live snapshots, this constitutes no change in behaviour. For all other cases, this is a fix for the bug that the old backing file was still referenced as a child, and the new one wasn't referenced. Signed-off-by: Kevin Wolf kw...@redhat.com --- block.c | 5 +++-- include/block/block_int.h | 1 + 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/block.c b/block.c index d5c9f03..d088ee0 100644 --- a/block.c +++ b/block.c @@ -1141,6 +1141,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd) if (bs-backing_hd) { assert(bs-backing_blocker); bdrv_op_unblock_all(bs-backing_hd, bs-backing_blocker); +bdrv_detach_child(bs-backing_child); } else if (backing_hd) { error_setg(bs-backing_blocker, node is used as backing hd of '%s', @@ -1151,8 +1152,10 @@ void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd) if (!backing_hd) { error_free(bs-backing_blocker); bs-backing_blocker = NULL; +bs-backing_child = NULL; goto out; } +bs-backing_child = bdrv_attach_child(bs, backing_hd, child_backing); bs-open_flags = ~BDRV_O_NO_BACKING; pstrcpy(bs-backing_file, sizeof(bs-backing_file), backing_hd-filename); pstrcpy(bs-backing_format, sizeof(bs-backing_format), @@ -1236,7 +1239,6 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp) goto free_exit; } -bdrv_attach_child(bs, backing_hd, child_backing); bdrv_set_backing_hd(bs, backing_hd); free_exit: @@ -2171,7 +2173,6 @@ void bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top) /* The contents of 'tmp' will become bs_top, as we are * swapping bs_new and bs_top contents. */ bdrv_set_backing_hd(bs_top, bs_new); -bdrv_attach_child(bs_top, bs_new, child_backing); } static void bdrv_delete(BlockDriverState *bs) diff --git a/include/block/block_int.h b/include/block/block_int.h index 8996baf..2cc973c 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -379,6 +379,7 @@ struct BlockDriverState { char exact_filename[PATH_MAX]; BlockDriverState *backing_hd; +BdrvChild *backing_child; BlockDriverState *file; NotifierList close_notifiers; -- 1.8.3.1
[Qemu-block] [PATCH 3/3] mirror: Speed up bitmap initial scanning
Limiting to sectors_per_chunk for each bdrv_is_allocated_above is slow, because the underlying protocol driver would issue much more queries than necessary. We should coalesce the query. Signed-off-by: Fam Zheng f...@redhat.com --- block/mirror.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index ca55578..e8cb592 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -371,7 +371,7 @@ static void coroutine_fn mirror_run(void *opaque) MirrorBlockJob *s = opaque; MirrorExitData *data; BlockDriverState *bs = s-common.bs; -int64_t sector_num, end, sectors_per_chunk, length; +int64_t sector_num, end, length; uint64_t last_pause_ns; BlockDriverInfo bdi; char backing_filename[2]; /* we only need 2 characters because we are only @@ -425,7 +425,6 @@ static void coroutine_fn mirror_run(void *opaque) goto immediate_exit; } -sectors_per_chunk = s-granularity BDRV_SECTOR_BITS; mirror_free_init(s); last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); @@ -433,7 +432,9 @@ static void coroutine_fn mirror_run(void *opaque) /* First part, loop on the sectors and initialize the dirty bitmap. */ BlockDriverState *base = s-base; for (sector_num = 0; sector_num end; ) { -int64_t next = (sector_num | (sectors_per_chunk - 1)) + 1; +/* Just to make sure we are not exceeding int limit. */ +int nb_sectors = MIN(INT_MAX BDRV_SECTOR_BITS, + end - sector_num); int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); if (now - last_pause_ns SLICE_TIME) { @@ -444,9 +445,7 @@ static void coroutine_fn mirror_run(void *opaque) if (block_job_is_cancelled(s-common)) { goto immediate_exit; } - -ret = bdrv_is_allocated_above(bs, base, - sector_num, next - sector_num, n); +ret = bdrv_is_allocated_above(bs, base, sector_num, nb_sectors, n); if (ret 0) { goto immediate_exit; @@ -455,10 +454,8 @@ static void coroutine_fn mirror_run(void *opaque) assert(n 0); if (ret == 1) { bdrv_set_dirty_bitmap(s-dirty_bitmap, sector_num, n); -sector_num = next; -} else { -sector_num += n; } +sector_num += n; } } -- 2.4.3
[Qemu-block] [PATCH 2/3] mirror: Use block_job_relax_cpu during bitmap scanning
Sleeping for 0 second may not be as effective as we want, use block_job_relax_cpu. Signed-off-by: Fam Zheng f...@redhat.com --- block/mirror.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/mirror.c b/block/mirror.c index 62db031..ca55578 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -438,7 +438,7 @@ static void coroutine_fn mirror_run(void *opaque) if (now - last_pause_ns SLICE_TIME) { last_pause_ns = now; -block_job_sleep_ns(s-common, QEMU_CLOCK_REALTIME, 0); +block_job_relax_cpu(s-common); } if (block_job_is_cancelled(s-common)) { -- 2.4.3
Re: [Qemu-block] [PATCH COLO-BLOCK v8 09/18] Backup: clear all bitmap when doing block checkpoint
* Wen Congyang (we...@cn.fujitsu.com) wrote: Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com Signed-off-by: Gonglei arei.gong...@huawei.com Cc: Jeff Cody jc...@redhat.com --- block/backup.c | 13 + blockjob.c | 10 ++ include/block/blockjob.h | 12 3 files changed, 35 insertions(+) diff --git a/block/backup.c b/block/backup.c index d3c7d9f..ebb8a88 100644 --- a/block/backup.c +++ b/block/backup.c @@ -211,11 +211,24 @@ static void backup_iostatus_reset(BlockJob *job) bdrv_iostatus_reset(s-target); } +static void backup_do_checkpoint(BlockJob *job, Error **errp) +{ +BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common); + +if (backup_job-sync_mode != MIRROR_SYNC_MODE_NONE) { +error_setg(errp, this feature or command is not currently supported); You use this text in a few different errors in the block code; we've currently got one test machine which is producing it and we haven't yet figured out which one of the exit paths is doing it. Please make each one unique and state an identifier; e.g. checkpoint on block backup in sync ... for blockjob ... (I'm not sure what exaclty makes sense for block jobs - but something like that). Dave +return; +} + +hbitmap_reset_all(backup_job-bitmap); +} + static const BlockJobDriver backup_job_driver = { .instance_size = sizeof(BackupBlockJob), .job_type = BLOCK_JOB_TYPE_BACKUP, .set_speed = backup_set_speed, .iostatus_reset = backup_iostatus_reset, +.do_checkpoint = backup_do_checkpoint, }; static BlockErrorAction backup_error_action(BackupBlockJob *job, diff --git a/blockjob.c b/blockjob.c index ec46fad..cb412d1 100644 --- a/blockjob.c +++ b/blockjob.c @@ -400,3 +400,13 @@ void block_job_defer_to_main_loop(BlockJob *job, qemu_bh_schedule(data-bh); } + +void block_job_do_checkpoint(BlockJob *job, Error **errp) +{ +if (!job-driver-do_checkpoint) { +error_setg(errp, this feature or command is not currently supported); +return; +} + +job-driver-do_checkpoint(job, errp); +} diff --git a/include/block/blockjob.h b/include/block/blockjob.h index 57d8ef1..b832dc3 100644 --- a/include/block/blockjob.h +++ b/include/block/blockjob.h @@ -50,6 +50,9 @@ typedef struct BlockJobDriver { * manually. */ void (*complete)(BlockJob *job, Error **errp); + +/** Optional callback for job types that support checkpoint. */ +void (*do_checkpoint)(BlockJob *job, Error **errp); } BlockJobDriver; /** @@ -348,4 +351,13 @@ void block_job_defer_to_main_loop(BlockJob *job, BlockJobDeferToMainLoopFn *fn, void *opaque); +/** + * block_job_do_checkpoint: + * @job: The job. + * @errp: Error object. + * + * Do block checkpoint on the specified job. + */ +void block_job_do_checkpoint(BlockJob *job, Error **errp); + #endif -- 2.4.3 -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [Qemu-devel] [PATCH COLO-BLOCK v7 00/17] Block replication for continuous checkpoints
* Wen Congyang (we...@cn.fujitsu.com) wrote: On 07/08/2015 11:49 PM, Michael R. Hines wrote: On 07/07/2015 08:38 PM, Wen Congyang wrote: On 07/08/2015 12:56 AM, Michael R. Hines wrote: On 07/07/2015 04:23 AM, Paolo Bonzini wrote: On 07/07/2015 11:13, Dr. David Alan Gilbert wrote: This log is very stange. The NBD client connects to NBD server, and NBD server wants to read data from NBD client, but reading fails. It seems that the connection is closed unexpectedly. Can you give me more log and how do you use it? That was the same failure I was getting. I think it's that the NBD server and client are in different modes, with one of them expecting the export. nbd_server_add always expects the export. Paolo OK, Wen, so your wiki finally does reflect this, but now we're back to the export not found error. Again, here's the exact command line: 1. First on the secondary VM: qemu-system-x86_64 .snip... -drive if=none,driver=qcow2,file=foo.qcow2,id=mc1,cache=none,aio=native -drive if=virtio,driver=replication,mode=secondary,throttling.bps-total-max=7000,file.file.filename=active_disk.qcow2,file.driver=qcow2,file.backing.file.filename=hidden_disk.qcow2,file.backing.driver=qcow2,file.backing.allow-write-backing-file=on,file.backing.backing.backing_reference=mc1 2. Then, then HMP commands: nbd_server_start 0:6262 nbd_server_add -w mc1 3. Then the primary VM: qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on With the error: Server requires an export name *but*, your wiki has no export name on the primary VM size, so I added the export name back which is on your old wiki: qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.export=mc1,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on: Failed to read export length Hmm, I think you use v7 version. Is it right? In this version, the correct useage for primary qemu is: 1. qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,id=disk1, children.0.file.filename=bar.qcow2,children.0.driver=qcow2, Then run hmp monitor command(It should be run after you run nbd_server_add in the secondary qemu): child_add disk1 child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on Thanks Wen Congyang And server now says: nbd.c:nbd_handle_export_name():L416: export not found nbd.c:nbd_send_negotiate():L562: option negotiation failed . OK, I'm totally confused at this point. =) Maybe it would be easier to just wait for your next clean patchset + documentation which is all consistent with each other. I have sent the v8. But the usage is not changed. You can setup the environment according to the wiki. When we open nbd client, we need to connect to the nbd server. So I introduce a new command child_add to add NBD client as a quorum child when the nbd server is ready. The nbd server is ready after you run the following command: nbd_server_start 0:6262 # the secondary qemu will listen to host:port nbd_server_add -w mc1 # the NBD server will know this disk is used as NBD server. The export name is its id wc1. # -w means we allow to write to this disk. Then you can run the following command in the primary qemu: child_add disk1 child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on After this monitor command, nbd client has connected to the nbd server. Ah! The 'child.file.export=mc1' wasn't there previously; I see Yang added that to the wiki yesterday; that probably explains the problem that we've been having. Dave Thanks Wen Congyang - Michael . -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
[Qemu-block] [PATCH 0/3] mirror: Fix guest responsiveness during bitmap scan
This supersedes: http://patchwork.ozlabs.org/patch/491415/ and [1] which is currently in Jeff's tree. Although [1] fixed the QMP responsiveness, Alexandre DERUMIER reported that guest responsiveness still suffers when we are busy in the initial dirty bitmap scanning loop of mirror job. That is because 1) we issue too many lseeks; 2) we only sleep for 0 ns which turns out quite ineffective in yielding BQL to vcpu threads. Both are fixed. To reproduce: start a guest, attach a 10G raw image, then mirror it. Your guest will immediately start to stutter (with patch [1] testing on a local ext4 raw image, and while echo -n .; do sleep 0.05; done in guest console). This series adds block_job_relax_cpu as suggested by Stefan Hajnoczi and uses it in mirror job; and lets bdrv_is_allocated_above return a larger p_num as suggested by Paolo Bonzini (although it's done without changing the API). [1]: http://patchwork.ozlabs.org/patch/471656/ block/mirror: Sleep periodically during bitmap scanning Fam Zheng (3): blockjob: Introduce block_job_relax_cpu mirror: Use block_job_relax_cpu during bitmap scanning mirror: Speed up bitmap initial scanning block/mirror.c | 17 +++-- include/block/blockjob.h | 16 2 files changed, 23 insertions(+), 10 deletions(-) -- 2.4.3
Re: [Qemu-block] [Qemu-devel] [PATCH COLO-BLOCK v7 00/17] Block replication for continuous checkpoints
On 07/09/2015 09:55 AM, Dr. David Alan Gilbert wrote: * Wen Congyang (we...@cn.fujitsu.com) wrote: On 07/08/2015 11:49 PM, Michael R. Hines wrote: On 07/07/2015 08:38 PM, Wen Congyang wrote: On 07/08/2015 12:56 AM, Michael R. Hines wrote: On 07/07/2015 04:23 AM, Paolo Bonzini wrote: On 07/07/2015 11:13, Dr. David Alan Gilbert wrote: This log is very stange. The NBD client connects to NBD server, and NBD server wants to read data from NBD client, but reading fails. It seems that the connection is closed unexpectedly. Can you give me more log and how do you use it? That was the same failure I was getting. I think it's that the NBD server and client are in different modes, with one of them expecting the export. nbd_server_add always expects the export. Paolo OK, Wen, so your wiki finally does reflect this, but now we're back to the export not found error. Again, here's the exact command line: 1. First on the secondary VM: qemu-system-x86_64 .snip... -drive if=none,driver=qcow2,file=foo.qcow2,id=mc1,cache=none,aio=native -drive if=virtio,driver=replication,mode=secondary,throttling.bps-total-max=7000,file.file.filename=active_disk.qcow2,file.driver=qcow2,file.backing.file.filename=hidden_disk.qcow2,file.backing.driver=qcow2,file.backing.allow-write-backing-file=on,file.backing.backing.backing_reference=mc1 2. Then, then HMP commands: nbd_server_start 0:6262 nbd_server_add -w mc1 3. Then the primary VM: qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on With the error: Server requires an export name *but*, your wiki has no export name on the primary VM size, so I added the export name back which is on your old wiki: qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.export=mc1,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on: Failed to read export length Hmm, I think you use v7 version. Is it right? In this version, the correct useage for primary qemu is: 1. qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,id=disk1, children.0.file.filename=bar.qcow2,children.0.driver=qcow2, Then run hmp monitor command(It should be run after you run nbd_server_add in the secondary qemu): child_add disk1 child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on Thanks Wen Congyang And server now says: nbd.c:nbd_handle_export_name():L416: export not found nbd.c:nbd_send_negotiate():L562: option negotiation failed . OK, I'm totally confused at this point. =) Maybe it would be easier to just wait for your next clean patchset + documentation which is all consistent with each other. I have sent the v8. But the usage is not changed. You can setup the environment according to the wiki. When we open nbd client, we need to connect to the nbd server. So I introduce a new command child_add to add NBD client as a quorum child when the nbd server is ready. The nbd server is ready after you run the following command: nbd_server_start 0:6262 # the secondary qemu will listen to host:port nbd_server_add -w mc1 # the NBD server will know this disk is used as NBD server. The export name is its id wc1. # -w means we allow to write to this disk. Then you can run the following command in the primary qemu: child_add disk1 child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on After this monitor command, nbd client has connected to the nbd server. Ah! The 'child.file.export=mc1' wasn't there previously; I see Yang added that to the wiki yesterday; that probably explains the problem that we've been having. Sorry for this mistake. Thanks Wen Congyang Dave Thanks Wen Congyang - Michael . -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK .
Re: [Qemu-block] [Qemu-devel] [PATCH COLO-BLOCK v7 00/17] Block replication for continuous checkpoints
On 07/08/2015 11:49 PM, Michael R. Hines wrote: On 07/07/2015 08:38 PM, Wen Congyang wrote: On 07/08/2015 12:56 AM, Michael R. Hines wrote: On 07/07/2015 04:23 AM, Paolo Bonzini wrote: On 07/07/2015 11:13, Dr. David Alan Gilbert wrote: This log is very stange. The NBD client connects to NBD server, and NBD server wants to read data from NBD client, but reading fails. It seems that the connection is closed unexpectedly. Can you give me more log and how do you use it? That was the same failure I was getting. I think it's that the NBD server and client are in different modes, with one of them expecting the export. nbd_server_add always expects the export. Paolo OK, Wen, so your wiki finally does reflect this, but now we're back to the export not found error. Again, here's the exact command line: 1. First on the secondary VM: qemu-system-x86_64 .snip... -drive if=none,driver=qcow2,file=foo.qcow2,id=mc1,cache=none,aio=native -drive if=virtio,driver=replication,mode=secondary,throttling.bps-total-max=7000,file.file.filename=active_disk.qcow2,file.driver=qcow2,file.backing.file.filename=hidden_disk.qcow2,file.backing.driver=qcow2,file.backing.allow-write-backing-file=on,file.backing.backing.backing_reference=mc1 2. Then, then HMP commands: nbd_server_start 0:6262 nbd_server_add -w mc1 3. Then the primary VM: qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on With the error: Server requires an export name *but*, your wiki has no export name on the primary VM size, so I added the export name back which is on your old wiki: qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,no-connect=on,children.0.file.filename=bar.qcow2,children.0.driver=qcow2,children.1.file.driver=nbd,children.1.file.export=mc1,children.1.file.host=127.0.0.1,children.1.file.port=6262,children.1.driver=replication,children.1.mode=primary,children.1.ignore-errors=on: Failed to read export length Hmm, I think you use v7 version. Is it right? In this version, the correct useage for primary qemu is: 1. qemu-system-x86_64 .snip... -drive if=virtio,driver=quorum,read-pattern=fifo,id=disk1, children.0.file.filename=bar.qcow2,children.0.driver=qcow2, Then run hmp monitor command(It should be run after you run nbd_server_add in the secondary qemu): child_add disk1 child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on Thanks Wen Congyang And server now says: nbd.c:nbd_handle_export_name():L416: export not found nbd.c:nbd_send_negotiate():L562: option negotiation failed . OK, I'm totally confused at this point. =) Maybe it would be easier to just wait for your next clean patchset + documentation which is all consistent with each other. I have sent the v8. But the usage is not changed. You can setup the environment according to the wiki. When we open nbd client, we need to connect to the nbd server. So I introduce a new command child_add to add NBD client as a quorum child when the nbd server is ready. The nbd server is ready after you run the following command: nbd_server_start 0:6262 # the secondary qemu will listen to host:port nbd_server_add -w mc1 # the NBD server will know this disk is used as NBD server. The export name is its id wc1. # -w means we allow to write to this disk. Then you can run the following command in the primary qemu: child_add disk1 child.driver=replication,child.mode=primary,child.file.host=127.0.0.1,child.file.port=6262,child.file.export=mc1,child.file.driver=nbd,child.ignore-errors=on After this monitor command, nbd client has connected to the nbd server. Thanks Wen Congyang - Michael .
Re: [Qemu-block] [PATCH COLO-BLOCK v8 09/18] Backup: clear all bitmap when doing block checkpoint
On 07/09/2015 09:28 AM, Dr. David Alan Gilbert wrote: * Wen Congyang (we...@cn.fujitsu.com) wrote: Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com Signed-off-by: Gonglei arei.gong...@huawei.com Cc: Jeff Cody jc...@redhat.com --- block/backup.c | 13 + blockjob.c | 10 ++ include/block/blockjob.h | 12 3 files changed, 35 insertions(+) diff --git a/block/backup.c b/block/backup.c index d3c7d9f..ebb8a88 100644 --- a/block/backup.c +++ b/block/backup.c @@ -211,11 +211,24 @@ static void backup_iostatus_reset(BlockJob *job) bdrv_iostatus_reset(s-target); } +static void backup_do_checkpoint(BlockJob *job, Error **errp) +{ +BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common); + +if (backup_job-sync_mode != MIRROR_SYNC_MODE_NONE) { +error_setg(errp, this feature or command is not currently supported); You use this text in a few different errors in the block code; we've currently got one test machine which is producing it and we haven't yet figured out which one of the exit paths is doing it. Please make each one unique and state an identifier; e.g. checkpoint on block backup in sync ... for blockjob ... (I'm not sure what exaclty makes sense for block jobs - but something like that). OK, I will check all error messages. Thanks Wen Congyang Dave +return; +} + +hbitmap_reset_all(backup_job-bitmap); +} + static const BlockJobDriver backup_job_driver = { .instance_size = sizeof(BackupBlockJob), .job_type = BLOCK_JOB_TYPE_BACKUP, .set_speed = backup_set_speed, .iostatus_reset = backup_iostatus_reset, +.do_checkpoint = backup_do_checkpoint, }; static BlockErrorAction backup_error_action(BackupBlockJob *job, diff --git a/blockjob.c b/blockjob.c index ec46fad..cb412d1 100644 --- a/blockjob.c +++ b/blockjob.c @@ -400,3 +400,13 @@ void block_job_defer_to_main_loop(BlockJob *job, qemu_bh_schedule(data-bh); } + +void block_job_do_checkpoint(BlockJob *job, Error **errp) +{ +if (!job-driver-do_checkpoint) { +error_setg(errp, this feature or command is not currently supported); +return; +} + +job-driver-do_checkpoint(job, errp); +} diff --git a/include/block/blockjob.h b/include/block/blockjob.h index 57d8ef1..b832dc3 100644 --- a/include/block/blockjob.h +++ b/include/block/blockjob.h @@ -50,6 +50,9 @@ typedef struct BlockJobDriver { * manually. */ void (*complete)(BlockJob *job, Error **errp); + +/** Optional callback for job types that support checkpoint. */ +void (*do_checkpoint)(BlockJob *job, Error **errp); } BlockJobDriver; /** @@ -348,4 +351,13 @@ void block_job_defer_to_main_loop(BlockJob *job, BlockJobDeferToMainLoopFn *fn, void *opaque); +/** + * block_job_do_checkpoint: + * @job: The job. + * @errp: Error object. + * + * Do block checkpoint on the specified job. + */ +void block_job_do_checkpoint(BlockJob *job, Error **errp); + #endif -- 2.4.3 -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK .
[Qemu-block] [PATCH for-2.4 0/5] block: Fix backing file child when modifying graph
This series is extracted from my work towards removing bdrv_swap(), which is targeted for 2.5. It contains a fix for dangling pointers when modifying the BDS graph and its dependencies. I didn't bother to split patches of which only a part is required, nor did I remove references to the future bdrv_swap removal (after all, it will happen, even if the patches will be delayed by the 2.4 freeze). Specifically, bdrv_open/unref_child() are yet unused in this series; the respective patches are included because of bdrv_attach/detach_child(). Kevin Wolf (5): block: Move bdrv_attach_child() calls up the call chain block: Introduce bdrv_open_child() block: Introduce bdrv_unref_child() block: Reorder cleanups in bdrv_close() block: Fix backing file child when modifying graph block.c | 144 -- include/block/block.h | 7 +++ include/block/block_int.h | 1 + 3 files changed, 108 insertions(+), 44 deletions(-) -- 1.8.3.1
[Qemu-block] [PATCH for-2.4 2/5] block: Introduce bdrv_open_child()
It is the same as bdrv_open_image(), except that it doesn't only return success or failure, but the newly created BdrvChild object for the new child node. As the BdrvChild object already contains a BlockDriverState pointer (and this is supposed to become the only pointer so that bdrv_append() and friends can just change a single pointer in BdrvChild), the pbs parameter is removed for bdrv_open_child(). Signed-off-by: Kevin Wolf kw...@redhat.com --- block.c | 71 +-- include/block/block.h | 6 + 2 files changed, 58 insertions(+), 19 deletions(-) diff --git a/block.c b/block.c index 0398bff..029feeb 100644 --- a/block.c +++ b/block.c @@ -1102,9 +1102,9 @@ static int bdrv_fill_options(QDict **options, const char **pfilename, return 0; } -static void bdrv_attach_child(BlockDriverState *parent_bs, - BlockDriverState *child_bs, - const BdrvChildRole *child_role) +static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, +BlockDriverState *child_bs, +const BdrvChildRole *child_role) { BdrvChild *child = g_new(BdrvChild, 1); *child = (BdrvChild) { @@ -1113,6 +1113,8 @@ static void bdrv_attach_child(BlockDriverState *parent_bs, }; QLIST_INSERT_HEAD(parent_bs-children, child, next); + +return child; } void bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd) @@ -1229,7 +1231,7 @@ free_exit: * device's options. * * If allow_none is true, no image will be opened if filename is false and no - * BlockdevRef is given. *pbs will remain unchanged and 0 will be returned. + * BlockdevRef is given. NULL will be returned, but errp remains unset. * * bdrev_key specifies the key for the image's BlockdevRef in the options QDict. * That QDict has to be flattened; therefore, if the BlockdevRef is a QDict @@ -1237,21 +1239,20 @@ free_exit: * BlockdevRef. * * The BlockdevRef will be removed from the options QDict. - * - * To conform with the behavior of bdrv_open(), *pbs has to be NULL. */ -int bdrv_open_image(BlockDriverState **pbs, const char *filename, -QDict *options, const char *bdref_key, -BlockDriverState* parent, const BdrvChildRole *child_role, -bool allow_none, Error **errp) +BdrvChild *bdrv_open_child(const char *filename, + QDict *options, const char *bdref_key, + BlockDriverState* parent, + const BdrvChildRole *child_role, + bool allow_none, Error **errp) { +BdrvChild *c = NULL; +BlockDriverState *bs; QDict *image_options; int ret; char *bdref_key_dot; const char *reference; -assert(pbs); -assert(*pbs == NULL); assert(child_role != NULL); bdref_key_dot = g_strdup_printf(%s., bdref_key); @@ -1260,28 +1261,60 @@ int bdrv_open_image(BlockDriverState **pbs, const char *filename, reference = qdict_get_try_str(options, bdref_key); if (!filename !reference !qdict_size(image_options)) { -if (allow_none) { -ret = 0; -} else { +if (!allow_none) { error_setg(errp, A block device must be specified for \%s\, bdref_key); -ret = -EINVAL; } QDECREF(image_options); goto done; } -ret = bdrv_open_inherit(pbs, filename, reference, image_options, 0, +bs = NULL; +ret = bdrv_open_inherit(bs, filename, reference, image_options, 0, parent, child_role, NULL, errp); if (ret 0) { goto done; } -bdrv_attach_child(parent, *pbs, child_role); +c = bdrv_attach_child(parent, bs, child_role); done: qdict_del(options, bdref_key); -return ret; +return c; +} + +/* + * This is a version of bdrv_open_child() that returns 0/-EINVAL instead of + * a BdrvChild object. + * + * If allow_none is true, no image will be opened if filename is false and no + * BlockdevRef is given. *pbs will remain unchanged and 0 will be returned. + * + * To conform with the behavior of bdrv_open(), *pbs has to be NULL. + */ +int bdrv_open_image(BlockDriverState **pbs, const char *filename, +QDict *options, const char *bdref_key, +BlockDriverState* parent, const BdrvChildRole *child_role, +bool allow_none, Error **errp) +{ +Error *local_err = NULL; +BdrvChild *c; + +assert(pbs); +assert(*pbs == NULL); + +c = bdrv_open_child(filename, options, bdref_key, parent, child_role, +allow_none, local_err); +if (local_err) { +error_propagate(errp, local_err); +return -EINVAL; +} + +if (c != NULL) { +*pbs = c-bs; +} + +return 0; } int