[Qemu-block] [PATCH for-2.8] block: Let write zeroes fallback work even with small max_transfer
Commit 443668ca rewrote the write_zeroes logic to guarantee that an unaligned request never crosses a cluster boundary. But in the rewrite, the new code assumed that at most one iteration would be needed to get to an alignment boundary. However, it is easy to trigger an assertion failure: the Linux kernel limits loopback devices to advertise a max_transfer of only 64k. Any operation that requires falling back to writes rather than more efficient zeroing must obey max_transfer during that fallback, which means an unaligned head may require multiple iterations of the write fallbacks before reaching the aligned boundaries, when layering a format with clusters larger than 64k atop the protocol of file access to a loopback device. Test case: $ qemu-img create -f qcow2 -o cluster_size=1M file 10M $ losetup /dev/loop2 /path/to/file $ qemu-io -f qcow2 /dev/loop2 qemu-io> w 7m 1k qemu-io> w -z 8003584 2093056 In fairness to Denis (as the original listed author of the culprit commit), the faulty logic for at most one iteration is probably all my fault in reworking his idea. But the solution is to restore what was in place prior to that commit: when dealing with an unaligned head or tail, iterate as many times as necessary while fragmenting the operation at max_transfer boundaries. CC: qemu-sta...@nongnu.org CC: Ed SwierkCC: Denis V. Lunev Signed-off-by: Eric Blake --- block/io.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/block/io.c b/block/io.c index aa532a5..085ac34 100644 --- a/block/io.c +++ b/block/io.c @@ -1214,6 +1214,8 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX); int alignment = MAX(bs->bl.pwrite_zeroes_alignment, bs->bl.request_alignment); +int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer, +MAX_WRITE_ZEROES_BOUNCE_BUFFER); assert(alignment % bs->bl.request_alignment == 0); head = offset % alignment; @@ -1229,9 +1231,12 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, * boundaries. */ if (head) { -/* Make a small request up to the first aligned sector. */ -num = MIN(count, alignment - head); -head = 0; +/* Make a small request up to the first aligned sector. For + * convenience, limit this request to max_transfer even if + * we don't need to fall back to writes. */ +num = MIN(MIN(count, max_transfer), alignment - head); +head = (head + num) % alignment; +assert(num < max_write_zeroes); } else if (tail && num > alignment) { /* Shorten the request to the last aligned sector. */ num -= tail; @@ -1257,8 +1262,6 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, if (ret == -ENOTSUP) { /* Fall back to bounce buffer if write zeroes is unsupported */ -int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer, -MAX_WRITE_ZEROES_BOUNCE_BUFFER); BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE; if ((flags & BDRV_REQ_FUA) && -- 2.7.4
Re: [Qemu-block] [Qemu-devel] [PATCH v3 5/6] blockjob: refactor backup_start as backup_job_create
On Tue, Nov 08, 2016 at 10:24:50AM -0500, John Snow wrote: > > > On 11/08/2016 04:11 AM, Kevin Wolf wrote: > >Am 08.11.2016 um 06:41 hat John Snow geschrieben: > >>On 11/03/2016 09:17 AM, Kevin Wolf wrote: > >>>Am 02.11.2016 um 18:50 hat John Snow geschrieben: > Refactor backup_start as backup_job_create, which only creates the job, > but does not automatically start it. The old interface, 'backup_start', > is not kept in favor of limiting the number of nearly-identical interfaces > that would have to be edited to keep up with QAPI changes in the future. > > Callers that wish to synchronously start the backup_block_job can > instead just call block_job_start immediately after calling > backup_job_create. > > Transactions are updated to use the new interface, calling block_job_start > only during the .commit phase, which helps prevent race conditions where > jobs may finish before we even finish building the transaction. This may > happen, for instance, during empty block backup jobs. > > Reported-by: Vladimir Sementsov-Ogievskiy> Signed-off-by: John Snow > >>> > +static void drive_backup_commit(BlkActionState *common) > +{ > +DriveBackupState *state = DO_UPCAST(DriveBackupState, common, > common); > +if (state->job) { > +block_job_start(state->job); > +} > } > >>> > >>>How could state->job ever be NULL? > >>> > >> > >>Mechanical thinking. It can't. (I definitely didn't copy paste from > >>the .abort routines. Definitely.) > >> > >>>Same question for abort, and for blockdev_backup_commit/abort. > >>> > >> > >>Abort ... we may not have created the job successfully. Abort gets > >>called whether or not we made it to or through the matching > >>.prepare. > > > >Ah, yes, I always forget about this. It's so counterintuitive (and > >bdrv_reopen() actually works differently, it only aborts entries that > >have successfully been prepared). > > > >Is there a good reason why qmp_transaction() works this way, especially > >since we have a separate .clean function? > > > >Kevin > > > > We just don't track which actions have succeeded or not, so we loop through > all actions on each phase regardless. > > I could add a little state enumeration (or boolean) to each action and I > could adjust abort to only run on actions that either completed or failed, > but in this case I think it still wouldn't change the text for .abort, > because an action may fail before it got to creating the job, for instance. > As far as this part goes, couldn't we just do it without any flags, by not inserting the state into the snap_bdrv_states list unless it was successful (assuming _prepare cleans up itself on failure)? E.g.: -QSIMPLEQ_INSERT_TAIL(_bdrv_states, state, entry); state->ops->prepare(state, _err); if (local_err) { error_propagate(errp, local_err); +g_free(state); goto delete_and_fail; } +QSIMPLEQ_INSERT_TAIL(_bdrv_states, state, entry); } > Unless you'd propose undoing .prepare IN .prepare in failure cases, but why > write abort code twice? I don't mind it living in .abort, personally. > Doing it the above way would indeed require prepare functions to clean up after themselves on failure. The bdrv_reopen() model does it this way, and I think it makes sense. With most APIs, on failure you wouldn't have a way of knowing what has or has not been done, so it leaves everything in a clean state. I think this is a good model to follow. It is also what most QEMU block interfaces currently do, iirc (.bdrv_open, etc.) - if it fails, it is assumed that it frees all resources it allocated. I guess it doesn't have to be done this way, and the complexity can just be pushed into the _abort() function. After all, with these transactional models, there exists an abort function, which differentiates it from most other APIs. But the downfall is that we have different ways of handling essentially the same sort of transactional model in the block layer (between bdrv_reopen and qmp_transaction), and it trips up reviewers / authors. (I don't think changing how qmp_transaction handles this is something that needs to be handled in this series - but it would be nice in the future sometime). Jeff
Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto
08.11.2016 15:18, Kevin Wolf wrote: Am 08.11.2016 um 12:08 hat Vladimir Sementsov-Ogievskiy geschrieben: 08.11.2016 14:05, Kevin Wolf wrote: Am 07.11.2016 um 17:10 hat Max Reitz geschrieben: On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote: Hi all! As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not handled.. Is it ok? Should not they be filled with ones or something like this? Filling them with ones makes sense to me. I guess nobody noticed because nobody was crazy enough to use block jobs alongside loadvm... What's the use case in which ones make sense? It rather seems to me that an active dirty bitmap and snapshot switching should exclude each other because the bitmap becomes meaningless by the switch. And chances are that after switching a snapshot you don't want to "incrementally" backup everything, but that you should access a different backup. In other words, dirty bitmaps should be deleted on snapshot switch? All? Or only named? As Max said, we should probably integrate bitmaps with snapshots. After reloading the old state, the bitmap becomes valid again, so throwing it away in the active state seems only right if we included it in the snapshot and can bring it back. If we choose this way, it should be firstly done for BdrvDirtyBitmap's without any persistance. And it is not as simple as just drop dirty bitmaps or fill them with ones. Current behavior is definitely wrong: if user create incremental backup after snapshot switch this incremental backup will be incorrect. I think it should be fixed now simpler way (actually this fix means "for now incremental backup is incompatible with snapshot switch"), and in future, if we really need this, make them work together. Also, I think that filling with ones is safer and more native. It really describes, what happens (with some overhead of dirty bits). Simple improvement: instead of filling with ones, new_dirty_bitmap_state = old_dirty_bitmap_state | old_allocated_mask | new_allocated_mask, where allocated mask is bitmap with same granularity, showing which ranges are allocated in the image. Kevin -- Best regards, Vladimir
Re: [Qemu-block] [PATCH] Added iopmem device emulation
Hey, On 08/11/16 08:58 AM, Stefan Hajnoczi wrote: > My concern with the current implementation is that a PCI MMIO access > invokes a synchronous blk_*() call. That can pause vcpu execution while > I/O is happening and therefore leads to unresponsive guests. QEMU's > monitor interface is also blocked during blk_*() making it impossible to > troubleshoot QEMU if it gets stuck due to a slow/hung I/O operation. > > Device models need to use blk_aio_*() so that control is returned while > I/O is running. There are a few legacy devices left that use > synchronous I/O but new devices should not use this approach. That's fair. I wasn't aware of this and I must have copied a legacy device. We can certainly make the change in our patch. > Regarding the hardware design, I think the PCI BAR approach to nvdimm is > inefficient for virtualization because each memory load/store requires a > guest<->host transition (vmexit + vmenter). A DMA approach (i.e. > message passing or descriptor rings) is more efficient because it > requires fewer vmexits. > > On real hardware the performance characteristics are different so it > depends what your target market is. The performance of the virtual device is completely unimportant. This isn't something I'd expect anyone to use except to test drivers. On real hardware, with real applications, DMA would almost certainly be used -- but it would be the DMA engine in another device. eg. an IB NIC would DMA from the PCI BAR of the iopmem device. This completely bypasses the CPU so there would be no load/stores to be concerned about. Thanks, Logan
Re: [Qemu-block] [PATCH 4/4] block: Cater to iscsi with non-power-of-2 discard
On 11/08/2016 05:03 AM, Peter Lieven wrote: > Am 25.10.2016 um 18:12 schrieb Eric Blake: >> On 10/25/2016 09:36 AM, Paolo Bonzini wrote: >>> >>> On 25/10/2016 16:35, Eric Blake wrote: So your argument is that we should always pass down every unaligned less-than-optimum discard request all the way to the hardware, rather than dropping it higher in the stack, even though discard requests are already advisory, in order to leave the hardware as the ultimate decision on whether to ignore the unaligned request? >>> Yes, I agree with Peter as to this. >> Okay, I'll work on patches. I think it counts as bug fix, so appropriate >> even if I miss soft freeze (I'd still like to get NBD write zero support >> into 2.8, since it already missed 2.7, but that one is still awaiting >> review with not much time left). >> > > Hi Eric, > > have you had time to look at this? > If you need help, let me know. Still on my list. I'm not forgetting it, and it does count as a bug fix so it is safe for inclusion, although I'm trying to get it in before this week is out. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [PATCH for-2.8] hbitmap: Fix the serialization granularity's type
On Mon, Nov 07, 2016 at 05:39:21PM +0100, Max Reitz wrote: > This function returns a uint64_t, so it should not truncate its result > by performing a plain int calculation. > > Signed-off-by: Max Reitz> --- > util/hbitmap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/util/hbitmap.c b/util/hbitmap.c > index 5d1a21c..c57be76 100644 > --- a/util/hbitmap.c > +++ b/util/hbitmap.c > @@ -401,7 +401,7 @@ uint64_t hbitmap_serialization_granularity(const HBitmap > *hb) > { > /* Require at least 64 bit granularity to be safe on both 64 bit and 32 > bit > * hosts. */ > -return 64 << hb->granularity; > +return UINT64_C(64) << hb->granularity; > } Another instance that should be fixed: uint64_t start = QEMU_ALIGN_UP(num_elements, 1 << hb->granularity); signature.asc Description: PGP signature
Re: [Qemu-block] [PATCH] Added iopmem device emulation
On Mon, Nov 07, 2016 at 10:09:29AM -0700, Logan Gunthorpe wrote: > On 07/11/16 03:28 AM, Stefan Hajnoczi wrote: > > It may be too early to merge this code into qemu.git if there is no > > hardware spec and this is a prototype device that is subject to change. > > Fair enough, though the interface is so simple I don't know what could > possibly change. My concern with the current implementation is that a PCI MMIO access invokes a synchronous blk_*() call. That can pause vcpu execution while I/O is happening and therefore leads to unresponsive guests. QEMU's monitor interface is also blocked during blk_*() making it impossible to troubleshoot QEMU if it gets stuck due to a slow/hung I/O operation. Device models need to use blk_aio_*() so that control is returned while I/O is running. There are a few legacy devices left that use synchronous I/O but new devices should not use this approach. Regarding the hardware design, I think the PCI BAR approach to nvdimm is inefficient for virtualization because each memory load/store requires a guest<->host transition (vmexit + vmenter). A DMA approach (i.e. message passing or descriptor rings) is more efficient because it requires fewer vmexits. On real hardware the performance characteristics are different so it depends what your target market is. > > I'm wondering if there is a way to test or use this device if you are > > not releasing specs and code that drives the device. > > > > Have you submitted patches to enable this device in Linux, DPDK, or any > > other project? > > Yes, you can find patches to the Linux Kernel that were submitted to a > couple mailing lists at the same time as the QEMU patch: > > http://www.mail-archive.com/linux-nvdimm@lists.01.org/msg01426.html > > There's been a discussion as to how best to expose these devices to user > space and we may take a different approach in v2. But there has been no > indication that the PCI interface would need to change at all. Thanks, I'll check out the discussion! Stefan signature.asc Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH v3 5/6] blockjob: refactor backup_start as backup_job_create
On 11/08/2016 04:11 AM, Kevin Wolf wrote: Am 08.11.2016 um 06:41 hat John Snow geschrieben: On 11/03/2016 09:17 AM, Kevin Wolf wrote: Am 02.11.2016 um 18:50 hat John Snow geschrieben: Refactor backup_start as backup_job_create, which only creates the job, but does not automatically start it. The old interface, 'backup_start', is not kept in favor of limiting the number of nearly-identical interfaces that would have to be edited to keep up with QAPI changes in the future. Callers that wish to synchronously start the backup_block_job can instead just call block_job_start immediately after calling backup_job_create. Transactions are updated to use the new interface, calling block_job_start only during the .commit phase, which helps prevent race conditions where jobs may finish before we even finish building the transaction. This may happen, for instance, during empty block backup jobs. Reported-by: Vladimir Sementsov-OgievskiySigned-off-by: John Snow +static void drive_backup_commit(BlkActionState *common) +{ +DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common); +if (state->job) { +block_job_start(state->job); +} } How could state->job ever be NULL? Mechanical thinking. It can't. (I definitely didn't copy paste from the .abort routines. Definitely.) Same question for abort, and for blockdev_backup_commit/abort. Abort ... we may not have created the job successfully. Abort gets called whether or not we made it to or through the matching .prepare. Ah, yes, I always forget about this. It's so counterintuitive (and bdrv_reopen() actually works differently, it only aborts entries that have successfully been prepared). Is there a good reason why qmp_transaction() works this way, especially since we have a separate .clean function? Kevin We just don't track which actions have succeeded or not, so we loop through all actions on each phase regardless. I could add a little state enumeration (or boolean) to each action and I could adjust abort to only run on actions that either completed or failed, but in this case I think it still wouldn't change the text for .abort, because an action may fail before it got to creating the job, for instance. Unless you'd propose undoing .prepare IN .prepare in failure cases, but why write abort code twice? I don't mind it living in .abort, personally. --js
Re: [Qemu-block] [PATCH 1/2] aio-posix: avoid NULL pointer dereference in aio_epoll_update
On Tue, 11/08 14:55, Paolo Bonzini wrote: > aio_epoll_update dereferences parameter "node", but it could have been NULL > if deleting an fd handler that was not registered in the first place. > > Signed-off-by: Paolo Bonzini> --- > Remove unnecessary assignment to node->pfd.revents. > > aio-posix.c | 32 +--- > 1 file changed, 17 insertions(+), 15 deletions(-) > > diff --git a/aio-posix.c b/aio-posix.c > index 4ef34dd..ec908f7 100644 > --- a/aio-posix.c > +++ b/aio-posix.c > @@ -217,21 +217,23 @@ void aio_set_fd_handler(AioContext *ctx, > > /* Are we deleting the fd handler? */ > if (!io_read && !io_write) { > -if (node) { > -g_source_remove_poll(>source, >pfd); > - > -/* If the lock is held, just mark the node as deleted */ > -if (ctx->walking_handlers) { > -node->deleted = 1; > -node->pfd.revents = 0; > -} else { > -/* Otherwise, delete it for real. We can't just mark it as > - * deleted because deleted nodes are only cleaned up after > - * releasing the walking_handlers lock. > - */ > -QLIST_REMOVE(node, node); > -deleted = true; > -} > +if (node == NULL) { > +return; > +} > + > +g_source_remove_poll(>source, >pfd); > + > +/* If the lock is held, just mark the node as deleted */ > +if (ctx->walking_handlers) { > +node->deleted = 1; > +node->pfd.revents = 0; > +} else { > +/* Otherwise, delete it for real. We can't just mark it as > + * deleted because deleted nodes are only cleaned up after > + * releasing the walking_handlers lock. > + */ > +QLIST_REMOVE(node, node); > +deleted = true; > } > } else { > if (node == NULL) { > -- > 2.7.4 > > Reviewed-by: Fam Zheng
Re: [Qemu-block] [Qemu-devel] [PATCH] MAINTAINERS: Add Fam and Jsnow for Bitmap support
On Tue, 11/08 12:57, Thomas Huth wrote: > On 07.11.2016 17:40, Max Reitz wrote: > > On 04.08.2016 20:18, John Snow wrote: > >> These files are currently unmaintained. > >> > >> I'm proposing that Fam and I co-maintain them; under the model that > >> whomever between us isn't authoring a given series will be responsible > >> for reviewing it. > >> > >> Signed-off-by: John Snow> >> --- > >> MAINTAINERS | 14 ++ > >> 1 file changed, 14 insertions(+) > > > > Ping, anyone? > > I'm currently gathering a set of my patches that updates the MAINTAINERS > file - and Paolo asked me to send a PULL request for that one, so if you > like, I can also include this patch there. Please do. Thanks! Fam
Re: [Qemu-block] [Qemu-devel] qemu-img create doesn't always replace the existing file
On Tue, Nov 08, 2016 at 03:05:24PM +0100, Kevin Wolf wrote: > [ Cc: qemu-block ] > > Am 08.11.2016 um 11:58 hat Richard W.M. Jones geschrieben: > > When using 'qemu-img create', if the file being created already > > exists, then qemu-img tries to read it first. This has some > > unexpected effects: > > > > > > $ rm test.qcow2 > > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=off test.qcow2 1G > > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 > > encryption=off cluster_size=65536 preallocation=off lazy_refcounts=off > > refcount_bits=16 > > $ du -sh test.qcow2 > > 196K test.qcow2 > > > > > > $ rm test.qcow2 > > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=falloc test.qcow2 1G > > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 > > encryption=off cluster_size=65536 preallocation=falloc lazy_refcounts=off > > refcount_bits=16 > > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=off test.qcow2 1G > > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 > > encryption=off cluster_size=65536 preallocation=off lazy_refcounts=off > > refcount_bits=16 > > $ du -sh test.qcow2 > > 256K test.qcow2# would expect this to be the same as above > > For me it's actually even more: > > $ du -h /tmp/test.qcow2 > 448K/tmp/test.qcow2 > > However... > > $ ls -lh /tmp/test.qcow2 > -rw-r--r--. 1 kwolf kwolf 193K 8. Nov 15:00 /tmp/test.qcow2 > > So qemu-img can't be at fault, the file has the same size as always. > > Are you using XFS? In my case I would have guessed that it's probably > some preallocation thing that XFS does internally. We've seen this > before that 'du' shows (sometimes by far) larger values than the file > size on XFS. That space is reclaimed later, though. Yes I am, and indeed this looks like a filesystem artifact and not a problem with qemu-img. Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Re: [Qemu-block] [Qemu-devel] qemu-img create doesn't always replace the existing file
[ Cc: qemu-block ] Am 08.11.2016 um 11:58 hat Richard W.M. Jones geschrieben: > When using 'qemu-img create', if the file being created already > exists, then qemu-img tries to read it first. This has some > unexpected effects: > > > $ rm test.qcow2 > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=off test.qcow2 1G > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 encryption=off > cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16 > $ du -sh test.qcow2 > 196K test.qcow2 > > > $ rm test.qcow2 > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=falloc test.qcow2 1G > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 encryption=off > cluster_size=65536 preallocation=falloc lazy_refcounts=off refcount_bits=16 > $ qemu-img create -f qcow2 -o compat=1.1,preallocation=off test.qcow2 1G > Formatting 'test.qcow2', fmt=qcow2 size=1073741824 compat=1.1 encryption=off > cluster_size=65536 preallocation=off lazy_refcounts=off refcount_bits=16 > $ du -sh test.qcow2 > 256K test.qcow2# would expect this to be the same as above For me it's actually even more: $ du -h /tmp/test.qcow2 448K/tmp/test.qcow2 However... $ ls -lh /tmp/test.qcow2 -rw-r--r--. 1 kwolf kwolf 193K 8. Nov 15:00 /tmp/test.qcow2 So qemu-img can't be at fault, the file has the same size as always. Are you using XFS? In my case I would have guessed that it's probably some preallocation thing that XFS does internally. We've seen this before that 'du' shows (sometimes by far) larger values than the file size on XFS. That space is reclaimed later, though. Kevin
[Qemu-block] [PATCH for-2.8 v2 0/2] aio-posix: epoll cleanups
The first fixes a NULL-pointer dereference that was reported by Coverity (so definitely for 2.8). The second is a small simplification. Paolo Bonzini (2): aio-posix: avoid NULL pointer dereference in aio_epoll_update aio-posix: simplify aio_epoll_update aio-posix.c | 55 +-- 1 file changed, 25 insertions(+), 30 deletions(-) -- 2.7.4
[Qemu-block] [PATCH 1/2] aio-posix: avoid NULL pointer dereference in aio_epoll_update
aio_epoll_update dereferences parameter "node", but it could have been NULL if deleting an fd handler that was not registered in the first place. Signed-off-by: Paolo Bonzini--- Remove unnecessary assignment to node->pfd.revents. aio-posix.c | 32 +--- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/aio-posix.c b/aio-posix.c index 4ef34dd..ec908f7 100644 --- a/aio-posix.c +++ b/aio-posix.c @@ -217,21 +217,23 @@ void aio_set_fd_handler(AioContext *ctx, /* Are we deleting the fd handler? */ if (!io_read && !io_write) { -if (node) { -g_source_remove_poll(>source, >pfd); - -/* If the lock is held, just mark the node as deleted */ -if (ctx->walking_handlers) { -node->deleted = 1; -node->pfd.revents = 0; -} else { -/* Otherwise, delete it for real. We can't just mark it as - * deleted because deleted nodes are only cleaned up after - * releasing the walking_handlers lock. - */ -QLIST_REMOVE(node, node); -deleted = true; -} +if (node == NULL) { +return; +} + +g_source_remove_poll(>source, >pfd); + +/* If the lock is held, just mark the node as deleted */ +if (ctx->walking_handlers) { +node->deleted = 1; +node->pfd.revents = 0; +} else { +/* Otherwise, delete it for real. We can't just mark it as + * deleted because deleted nodes are only cleaned up after + * releasing the walking_handlers lock. + */ +QLIST_REMOVE(node, node); +deleted = true; } } else { if (node == NULL) { -- 2.7.4
[Qemu-block] [PATCH 2/2] aio-posix: simplify aio_epoll_update
Extract common code out of the "if". Reviewed-by: Fam ZhengSigned-off-by: Paolo Bonzini --- aio-posix.c | 23 --- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/aio-posix.c b/aio-posix.c index ec908f7..d54553d 100644 --- a/aio-posix.c +++ b/aio-posix.c @@ -81,29 +81,22 @@ static void aio_epoll_update(AioContext *ctx, AioHandler *node, bool is_new) { struct epoll_event event; int r; +int ctl; if (!ctx->epoll_enabled) { return; } if (!node->pfd.events) { -r = epoll_ctl(ctx->epollfd, EPOLL_CTL_DEL, node->pfd.fd, ); -if (r) { -aio_epoll_disable(ctx); -} +ctl = EPOLL_CTL_DEL; } else { event.data.ptr = node; event.events = epoll_events_from_pfd(node->pfd.events); -if (is_new) { -r = epoll_ctl(ctx->epollfd, EPOLL_CTL_ADD, node->pfd.fd, ); -if (r) { -aio_epoll_disable(ctx); -} -} else { -r = epoll_ctl(ctx->epollfd, EPOLL_CTL_MOD, node->pfd.fd, ); -if (r) { -aio_epoll_disable(ctx); -} -} +ctl = is_new ? EPOLL_CTL_ADD : EPOLL_CTL_MOD; +} + +r = epoll_ctl(ctx->epollfd, ctl, node->pfd.fd, ); +if (r) { +aio_epoll_disable(ctx); } } -- 2.7.4
Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto
Am 08.11.2016 um 12:08 hat Vladimir Sementsov-Ogievskiy geschrieben: > 08.11.2016 14:05, Kevin Wolf wrote: > >Am 07.11.2016 um 17:10 hat Max Reitz geschrieben: > >>On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote: > >>>Hi all! > >>> > >>>As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not > >>>handled.. Is it ok? Should not they be filled with ones or something > >>>like this? > >>Filling them with ones makes sense to me. I guess nobody noticed because > >>nobody was crazy enough to use block jobs alongside loadvm... > >What's the use case in which ones make sense? > > > >It rather seems to me that an active dirty bitmap and snapshot switching > >should exclude each other because the bitmap becomes meaningless by the > >switch. And chances are that after switching a snapshot you don't want > >to "incrementally" backup everything, but that you should access a > >different backup. > > In other words, dirty bitmaps should be deleted on snapshot switch? > All? Or only named? As Max said, we should probably integrate bitmaps with snapshots. After reloading the old state, the bitmap becomes valid again, so throwing it away in the active state seems only right if we included it in the snapshot and can bring it back. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH] MAINTAINERS: Add Fam and Jsnow for Bitmap support
On 07.11.2016 17:40, Max Reitz wrote: > On 04.08.2016 20:18, John Snow wrote: >> These files are currently unmaintained. >> >> I'm proposing that Fam and I co-maintain them; under the model that >> whomever between us isn't authoring a given series will be responsible >> for reviewing it. >> >> Signed-off-by: John Snow>> --- >> MAINTAINERS | 14 ++ >> 1 file changed, 14 insertions(+) > > Ping, anyone? I'm currently gathering a set of my patches that updates the MAINTAINERS file - and Paolo asked me to send a PULL request for that one, so if you like, I can also include this patch there. Thomas signature.asc Description: OpenPGP digital signature
Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto
08.11.2016 14:05, Kevin Wolf wrote: Am 07.11.2016 um 17:10 hat Max Reitz geschrieben: On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote: Hi all! As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not handled.. Is it ok? Should not they be filled with ones or something like this? Filling them with ones makes sense to me. I guess nobody noticed because nobody was crazy enough to use block jobs alongside loadvm... What's the use case in which ones make sense? It rather seems to me that an active dirty bitmap and snapshot switching should exclude each other because the bitmap becomes meaningless by the switch. And chances are that after switching a snapshot you don't want to "incrementally" backup everything, but that you should access a different backup. Kevin In other words, dirty bitmaps should be deleted on snapshot switch? All? Or only named? -- Best regards, Vladimir
Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto
Am 07.11.2016 um 17:10 hat Max Reitz geschrieben: > On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote: > > Hi all! > > > > As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not > > handled.. Is it ok? Should not they be filled with ones or something > > like this? > > Filling them with ones makes sense to me. I guess nobody noticed because > nobody was crazy enough to use block jobs alongside loadvm... What's the use case in which ones make sense? It rather seems to me that an active dirty bitmap and snapshot switching should exclude each other because the bitmap becomes meaningless by the switch. And chances are that after switching a snapshot you don't want to "incrementally" backup everything, but that you should access a different backup. Kevin pgpHEq5UTe7mS.pgp Description: PGP signature
Re: [Qemu-block] [PATCH 4/4] block: Cater to iscsi with non-power-of-2 discard
Am 25.10.2016 um 18:12 schrieb Eric Blake: On 10/25/2016 09:36 AM, Paolo Bonzini wrote: On 25/10/2016 16:35, Eric Blake wrote: So your argument is that we should always pass down every unaligned less-than-optimum discard request all the way to the hardware, rather than dropping it higher in the stack, even though discard requests are already advisory, in order to leave the hardware as the ultimate decision on whether to ignore the unaligned request? Yes, I agree with Peter as to this. Okay, I'll work on patches. I think it counts as bug fix, so appropriate even if I miss soft freeze (I'd still like to get NBD write zero support into 2.8, since it already missed 2.7, but that one is still awaiting review with not much time left). Hi Eric, have you had time to look at this? If you need help, let me know. Peter
Re: [Qemu-block] [Qemu-devel] BdrvDirtyBitmap and bdrv_snapshot_goto
07.11.2016 19:10, Max Reitz wrote: On 07.11.2016 16:24, Vladimir Sementsov-Ogievskiy wrote: Hi all! As I can see, in bdrv_snapshot_goto, existing dirty bitmaps are not handled.. Is it ok? Should not they be filled with ones or something like this? Filling them with ones makes sense to me. I guess nobody noticed because nobody was crazy enough to use block jobs alongside loadvm... Using block jobs is not necessary - we just have to maintain our dirty bitmap while qemu works, regardless of block jobs. Also, when we will have persistent bitmaps in qcow2, haw they should be handled on snapshot switching? Good question. Since persistent bitmaps are not bound to snapshots, I'd fill them with ones for now, too. It would probably make sense to bind bitmaps to snapshots, though. This could be achieved by adding a bitmap directory pointer to each snapshot table entry. When switching snapshots, software (i.e. qemu) could then either: (1) Fill the bitmaps with ones, thus treating them as "global" bitmaps. (2) Save the current bitmap directory in the old snapshot and put the one from the snapshot that is being switched to into the image header, thus treating them as bound to the snapshot. Of course, this could be a bitmap-specific property. Max -- Best regards, Vladimir
Re: [Qemu-block] [Qemu-devel] [PATCH v3 4/6] blockjob: add block_job_start
Am 08.11.2016 um 03:05 hat Jeff Cody geschrieben: > On Mon, Nov 07, 2016 at 09:02:14PM -0500, John Snow wrote: > > On 11/03/2016 08:17 AM, Kevin Wolf wrote: > > >Am 02.11.2016 um 18:50 hat John Snow geschrieben: > > >>+void block_job_start(BlockJob *job) > > >>+{ > > >>+assert(job && !block_job_started(job) && job->paused && > > >>+ !job->busy && job->driver->start); > > >>+job->paused = false; > > >>+job->busy = true; > > >>+job->co = qemu_coroutine_create(job->driver->start, job); > > >>+qemu_coroutine_enter(job->co); > > >>+} > > > > > >We allow the user to pause a job while it's not started yet. You > > >classified this as "harmless". But if we accept this, can we really > > >unconditionally enter the coroutine even if the job has been paused? > > >Can't a user expect that a job remains in paused state when they > > >explicitly requested a pause and the job was already internally paused, > > >like in this case by block_job_create()? > > > > > > > What will end up happening is that we'll enter the job, and then it'll pause > > immediately upon entrance. Is that a problem? > > > > If the jobs themselves are not checking their pause state fastidiously, it > > could be (but block/backup does -- after it creates a write notifier.) > > > > Do we want a stronger guarantee here? > > > > Naively I think it's OK as-is, but I could add a stronger boolean in that > > lets us know if it's okay to start or not, and we could delay the actual > > creation and start until the 'resume' comes in if you'd like. > > > > I'd like to avoid the complexity if we can help it, but perhaps I'm not > > thinking carefully enough about the existing edge cases. > > > > Is there any reason we can't just use job->pause_count here? When the job > is created, set job->paused = true, and job->pause_count = 1. In the > block_job_start(), check the pause_count prior to qemu_coroutine_enter(): > > void block_job_start(BlockJob *job) > { > assert(job && !block_job_started(job) && job->paused && > !job->busy && job->driver->start); > job->co = qemu_coroutine_create(job->driver->start, job); > job->paused = --job->pause_count > 0; > if (!job->paused) { > job->busy = true; > qemu_coroutine_enter(job->co); > } > } Yes, something like this is what I had in mind. > > >The same probably also applies to the internal job pausing during > > >bdrv_drain_all_begin/end, though as you know there is a larger problem > > >with starting jobs under drain_all anyway. For now, we just need to keep > > >in mind that we can neither create nor start a job in such sections. > > > > > > > Yeah, there are deeper problems there. As long as the existing critical > > sections don't allow us to create jobs (started or not) I think we're > > probably already OK. My point here was that we would like the get rid of that restriction eventually, and if we add more and more things that depend on the restriction, getting rid of it will only become harder. But with the above code, I think this specific problem is solved. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH v3 5/6] blockjob: refactor backup_start as backup_job_create
Am 08.11.2016 um 06:41 hat John Snow geschrieben: > On 11/03/2016 09:17 AM, Kevin Wolf wrote: > >Am 02.11.2016 um 18:50 hat John Snow geschrieben: > >>Refactor backup_start as backup_job_create, which only creates the job, > >>but does not automatically start it. The old interface, 'backup_start', > >>is not kept in favor of limiting the number of nearly-identical interfaces > >>that would have to be edited to keep up with QAPI changes in the future. > >> > >>Callers that wish to synchronously start the backup_block_job can > >>instead just call block_job_start immediately after calling > >>backup_job_create. > >> > >>Transactions are updated to use the new interface, calling block_job_start > >>only during the .commit phase, which helps prevent race conditions where > >>jobs may finish before we even finish building the transaction. This may > >>happen, for instance, during empty block backup jobs. > >> > >>Reported-by: Vladimir Sementsov-Ogievskiy> >>Signed-off-by: John Snow > > > >>+static void drive_backup_commit(BlkActionState *common) > >>+{ > >>+DriveBackupState *state = DO_UPCAST(DriveBackupState, common, common); > >>+if (state->job) { > >>+block_job_start(state->job); > >>+} > >> } > > > >How could state->job ever be NULL? > > > > Mechanical thinking. It can't. (I definitely didn't copy paste from > the .abort routines. Definitely.) > > >Same question for abort, and for blockdev_backup_commit/abort. > > > > Abort ... we may not have created the job successfully. Abort gets > called whether or not we made it to or through the matching > .prepare. Ah, yes, I always forget about this. It's so counterintuitive (and bdrv_reopen() actually works differently, it only aborts entries that have successfully been prepared). Is there a good reason why qmp_transaction() works this way, especially since we have a separate .clean function? Kevin