Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model
On 1/20/2011 1:45 PM, Stefan Hajnoczi wrote: > On Thu, Jan 20, 2011 at 9:15 PM, Venkateswararao Jujjuri (JV) > wrote: >> On 1/20/2011 12:59 AM, Stefan Hajnoczi wrote: >>> On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote: After creating a file object, its permission and ownership details are updated as per client's request for both passthrough and none security model. But with chrooted environment its not required for passthrough security model. Move all post file creation changes to none security model Signed-off-by: M. Mohan Kumar --- hw/9pfs/virtio-9p-local.c | 19 ++- 1 files changed, 6 insertions(+), 13 deletions(-) diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c index 08fd67f..d2e32e2 100644 --- a/hw/9pfs/virtio-9p-local.c +++ b/hw/9pfs/virtio-9p-local.c @@ -208,21 +208,14 @@ static int local_set_xattr(const char *path, FsCred *credp) return 0; } -static int local_post_create_passthrough(FsContext *fs_ctx, const char *path, +static int local_post_create_none(FsContext *fs_ctx, const char *path, FsCred *credp) { +int retval; if (chmod(rpath(fs_ctx, path), credp->fc_mode & 0) < 0) { return -1; } -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) { -/* - * If we fail to change ownership and if we are - * using security model none. Ignore the error - */ -if (fs_ctx->fs_sm != SM_NONE) { -return -1; -} -} +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid); return 0; } >>> >>> retval is unused. >>> >>> Can multiple virtio-9p requests execute at a time? chmod() and lchown() >>> after creation is a race condition if other requests can execute >>> concurrently. >> >> If some level of serialization is needed it will be done at the client/guest >> inode level. >> Are you worried about filesystem semantics? or do you see some corruption if >> they >> get executed in parallel? > > My main concern is unreliable results due to the race conditions > between creation and the fixups that are performed afterwards. > > Is virtio-9p only useful for single guest exclusive access? I thought > both guest and host could access files at the same time? What about > multiple VMs sharing a directory? These scenarios can only work if > operations are made atomic. For now, there is only one exploiter for the filesystem. The Guest/client. In the future it could be different and we 'may' support multiple exploiters/users. Note that we have two security models 1. Passthrough 2. Mapped. (3. None - can be ignored as it is intended for developer) Mapped model is advised when you have only one exploiter; Passthrough model is for more practical application/uses and it can be used for multiple exploiters (say guests). In passthrough model we don't do chmod() lchmod() after creating files. Thanks, JV > > Stefan
[Qemu-devel] Qemu signal handling
hi In QEMU code almost every signal is handled then why this warning is generated from syscall.c #elif defined(TARGET_ABI_MIPSN64) # warning signal handling not implemented
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 18:23 -0600, Anthony Liguori wrote: > On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote: > > On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: > > > >> On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: > >> > >>> When MSI is off, each interrupt needs to be bounced through the io > >>> thread when it's set/cleared, so vhost-net causes more context switches > >>> and > >>> higher CPU utilization than userspace virtio which handles networking in > >>> the same thread. > >>> > >>> We'll need to fix this by adding level irq support in kvm irqfd, > >>> for now disable vhost-net in these configurations. > >>> > >>> Signed-off-by: Michael S. Tsirkin > >>> > >> I actually think this should be a terminal error. The user asks for > >> vhost-net, if we cannot enable it, we should exit. > >> > >> Or we should warn the user that they should expect bad performance. > >> Silently doing something that the user has explicitly asked us not > >> to do is not a good behavior. > >> > >> Regards, > >> > >> Anthony Liguori > >> > > The issue is that user has no control of the guest, and can not know > > whether the guest enables MSI. So what you ask for will just make > > some guests fail, and others fail sometimes. > > The user also has no way to know that version X of kvm does not expose a > > way to inject level interrupts with irqfd. > > > > We could have *another* flag that says "use vhost where it helps" but > > then I think this is what everyone wants to do, anyway, and libvirt > > already sets vhost=on so I prefer redefining the meaning of an existing > > flag. > > > > In the very least, there needs to be a vhost=force. > > Having some sort of friendly default policy is fine but we need to > provide a mechanism for a user to have the final say. If you want to > redefine vhost=on to really mean, use the friendly default, that's fine > by me, but only if the vhost=force option exists. > > I actually would think libvirt would want to use vhost=force. Debugging > with vhost=on is going to be a royal pain in the ass if a user reports > bad performance. Given the libvirt XML, you can't actually tell from > the guest and the XML whether or not vhost was actually in use or not. If we add a force option, let's please distinguish hotplug from VM creation time. The latter can abort. Hotplug should print an error and fail the initfn. Thanks, Alex
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote: On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says "use vhost where it helps" but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. In the very least, there needs to be a vhost=force. Having some sort of friendly default policy is fine but we need to provide a mechanism for a user to have the final say. If you want to redefine vhost=on to really mean, use the friendly default, that's fine by me, but only if the vhost=force option exists. I actually would think libvirt would want to use vhost=force. Debugging with vhost=on is going to be a royal pain in the ass if a user reports bad performance. Given the libvirt XML, you can't actually tell from the guest and the XML whether or not vhost was actually in use or not. Regards, Anthony Liguori Maybe this is best handled by a documentation update? We always said: "use vhost=on to enable experimental in kernel accelerator\n" note 'enable' not 'require'. This is similar to how we specify nvectors : you can not make guest use the feature. How about this: diff --git a/qemu-options.hx b/qemu-options.hx index 898561d..3c937c1 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -1061,6 +1061,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, "use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n" "use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n" "use vhost=on to enable experimental in kernel accelerator\n" +"(note: vhost=on has no effect unless guest uses MSI-X)\n" "use 'vhostfd=h' to connect to an already opened vhost net device\n" #endif "-net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n"
[Qemu-devel] Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 19:47 +0200, Michael S. Tsirkin wrote: > On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote: > > On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: > > > When MSI is off, each interrupt needs to be bounced through the io > > > thread when it's set/cleared, so vhost-net causes more context switches > > > and > > > higher CPU utilization than userspace virtio which handles networking in > > > the same thread. > > > > > > We'll need to fix this by adding level irq support in kvm irqfd, > > > for now disable vhost-net in these configurations. > > > > > > Signed-off-by: Michael S. Tsirkin > > > --- > > > > > > I need to report some error from virtio-pci > > > that would be handled specially (disable but don't > > > report an error) so I wanted one that's never likely to be used by a > > > userspace ioctl. I selected ERANGE but it'd > > > be easy to switch to something else. Comments? > > > > Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? > > > > -Sridhar > > The error is reported by virtio-pci which does not know about vhost. > I started with EVIRTIO_MSIX_DISABLED and made is shorter. > Would EVIRTIO_MSIX_DISABLED be better? I think so. This makes it more clear. -Sridhar > > > > > > > hw/vhost.c |4 +++- > > > hw/virtio-net.c |6 -- > > > hw/virtio-pci.c |3 +++ > > > hw/virtio.h |2 ++ > > > 4 files changed, 12 insertions(+), 3 deletions(-) > > > > > > diff --git a/hw/vhost.c b/hw/vhost.c > > > index 1d09ed0..c79765a 100644 > > > --- a/hw/vhost.c > > > +++ b/hw/vhost.c > > > @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, > > > VirtIODevice *vdev) > > > > > > r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true); > > > if (r < 0) { > > > -fprintf(stderr, "Error binding guest notifier: %d\n", -r); > > > + if (r != -EVIRTIO_DISABLED) { > > > + fprintf(stderr, "Error binding guest notifier: %d\n", -r); > > > + } > > > goto fail_notifiers; > > > } > > > > > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > > > index ccb3e63..5de3fee 100644 > > > --- a/hw/virtio-net.c > > > +++ b/hw/virtio-net.c > > > @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, > > > uint8_t status) > > > if (!n->vhost_started) { > > > int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), > > > &n->vdev); > > > if (r < 0) { > > > -error_report("unable to start vhost net: %d: " > > > - "falling back on userspace virtio", -r); > > > +if (r != -EVIRTIO_DISABLED) { > > > +error_report("unable to start vhost net: %d: " > > > + "falling back on userspace virtio", -r); > > > +} > > > } else { > > > n->vhost_started = 1; > > > } > > > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c > > > index dd8887a..dbf4be0 100644 > > > --- a/hw/virtio-pci.c > > > +++ b/hw/virtio-pci.c > > > @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void > > > *opaque, int n, bool assign) > > > EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); > > > > > > if (assign) { > > > +if (!msix_enabled(&proxy->pci_dev)) { > > > +return -EVIRTIO_DISABLED; > > > +} > > > int r = event_notifier_init(notifier, 0); > > > if (r < 0) { > > > return r; > > > diff --git a/hw/virtio.h b/hw/virtio.h > > > index d8546d5..53bbdba 100644 > > > --- a/hw/virtio.h > > > +++ b/hw/virtio.h > > > @@ -98,6 +98,8 @@ typedef struct { > > > void (*vmstate_change)(void * opaque, bool running); > > > } VirtIOBindings; > > > > > > +#define EVIRTIO_DISABLED ERANGE > > > + > > > #define VIRTIO_PCI_QUEUE_MAX 64 > > > > > > #define VIRTIO_NO_VECTOR 0x
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-20 22:40, Blue Swirl wrote: > On Thu, Jan 20, 2011 at 9:22 PM, Jan Kiszka wrote: >> On 2011-01-20 20:27, Blue Swirl wrote: >>> On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka wrote: On 2011-01-19 20:32, Blue Swirl wrote: > On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori > wrote: >> On 01/19/2011 07:15 AM, Markus Armbruster wrote: >>> >>> So they interact with KVM (need kvm_state), and they interact with the >>> emulated PCI bus. Could you elaborate on the fundamental difference >>> between the two interactions that makes you choose the (hypothetical) >>> KVM bus over the PCI bus as device parent? >>> >> >> It's almost arbitrary, but I would say it's the direction that I/Os flow. >> >> But if the underlying observation is that the device tree is not really a >> tree, you're 100% correct. This is part of why a factory interface that >> just takes a parent bus is too simplistic. >> >> I think we ought to introduce a -pci-device option that is specifically >> for >> creating PCI devices that doesn't require a parent bus argument but >> provides >> a way to specify stable addressing (for instancing, using a linear >> index). > > I think kvm_state should not be a property of any device or bus. It > should be split to more logical pieces. > > Some parts of it could remain in CPUState, because they are associated > with a VCPU. > > Also, for example irqfd could be considered to be similar object to > char or block devices provided by QEMU to devices. Would it make sense > to introduce new host types for passing parts of kvm_state to devices? > > I'd also make coalesced MMIO stuff part of memory object. We are not > passing any state references when using cpu_physical_memory_rw(), but > that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. >>> >>> I think fields vcpu_events, robust_singlestep, debugregs, >>> kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the >>> same for all VCPUs but still they are sort of CPU properties. I'm not >>> sure about fd field. >> >> They are all properties of the currently loaded KVM subsystem in the >> host kernel. They can't change while KVM's root fd is opened. >> Replicating this static information into each and every VCPU state would >> be crazy. > > Then each CPUX86State could have a pointer to common structure. That already exists. > >> In fact, services like kvm_has_vcpu_events() already encode this: they >> are static functions without any kvm_state reference that simply return >> the content of those fields. Totally inconsistent to this, we force the >> caller of kvm_check_extension to pass a handle. This is part of my >> problem with the current situation and any halfhearted steps in this >> context. Either we work towards eliminating "static KVMState *kvm_state" >> in kvm-all.c or eliminating KVMState. > > If the CPU related fields are accessible through CPUState, the handle > should be available. > It may be a good idea to introduce an arch-specific kvm_state and move related bits over. >>> >>> This should probably contain only irqchip_in_kernel, pit_in_kernel and >>> many_ioeventfds, maybe fd. >> >> fd is that root file descriptor you need for a few KVM services that are >> not bound to a specific VM - e.g. feature queries. It's not arch >> specific. Arch specific are e.g. robust_singlestep or xsave feature states. > > By arch you mean guest CPU architecture? They are not machine features. No, they are practically static host features. > >>> It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. >>> >>> I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region, >>> migration_log into the memory object. >> >> vmfd is the VM-scope file descriptor you need at machine-level. The rest >> logically belongs to a memory object, but I haven't looked at technical >> details yet. >> >>> Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. >>> >>> If it is an opaque blob which contains various unrelated stuff, no >>> clear place will be found. >> >> We aren't moving fields yet (and we shouldn't). We first of all need to >> establish the handle distribution (which apparently requires quite some >> work in areas beyond KVM). > > But I think this is exactly the problem. If the handle is for the > current KVMState, you'll indeed need it in various places and passing > it around will be cumbersome. By moving the fields around, the > information should be available more naturally. Yeah, if we had a MachineState or if we could agree on introducing it, I'm with you again. Improving the currently cumbersome KVM API interaction was the main
Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model
On Thu, Jan 20, 2011 at 9:15 PM, Venkateswararao Jujjuri (JV) wrote: > On 1/20/2011 12:59 AM, Stefan Hajnoczi wrote: >> On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote: >>> After creating a file object, its permission and ownership details are >>> updated >>> as per client's request for both passthrough and none security model. But >>> with >>> chrooted environment its not required for passthrough security model. Move >>> all >>> post file creation changes to none security model >>> >>> Signed-off-by: M. Mohan Kumar >>> --- >>> hw/9pfs/virtio-9p-local.c | 19 ++- >>> 1 files changed, 6 insertions(+), 13 deletions(-) >>> >>> diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c >>> index 08fd67f..d2e32e2 100644 >>> --- a/hw/9pfs/virtio-9p-local.c >>> +++ b/hw/9pfs/virtio-9p-local.c >>> @@ -208,21 +208,14 @@ static int local_set_xattr(const char *path, FsCred >>> *credp) >>> return 0; >>> } >>> >>> -static int local_post_create_passthrough(FsContext *fs_ctx, const char >>> *path, >>> +static int local_post_create_none(FsContext *fs_ctx, const char *path, >>> FsCred *credp) >>> { >>> + int retval; >>> if (chmod(rpath(fs_ctx, path), credp->fc_mode & 0) < 0) { >>> return -1; >>> } >>> - if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) { >>> - /* >>> - * If we fail to change ownership and if we are >>> - * using security model none. Ignore the error >>> - */ >>> - if (fs_ctx->fs_sm != SM_NONE) { >>> - return -1; >>> - } >>> - } >>> + retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid); >>> return 0; >>> } >> >> retval is unused. >> >> Can multiple virtio-9p requests execute at a time? chmod() and lchown() >> after creation is a race condition if other requests can execute >> concurrently. > > If some level of serialization is needed it will be done at the client/guest > inode level. > Are you worried about filesystem semantics? or do you see some corruption if > they > get executed in parallel? My main concern is unreliable results due to the race conditions between creation and the fixups that are performed afterwards. Is virtio-9p only useful for single guest exclusive access? I thought both guest and host could access files at the same time? What about multiple VMs sharing a directory? These scenarios can only work if operations are made atomic. Stefan
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-20 21:02, Blue Swirl wrote: > I think KVMState was designed to match KVM ioctl interface: all stuff > that is needed for talking to KVM or received from KVM are there. But > I think this shouldn't be a design driver. Agreed. The nice cleanup would probably be the complete assimilation of KVMState by something bigger of comparable scope. If a machine was brought up with KVM support, every device that refers to this machine (as it is supposed to become part of it) should be able to use KVM services in order to accelerate its model. > > If the only pieces of kvm_state that are needed by the devices are > irqchip_in_kernel, pit_in_kernel and many_ioeventfds, the problem of > passing kvm_state to devices becomes very different. Each of these are > just single bits, affecting only a few devices. Perhaps they could be > device properties which the board level sets when KVM is used? Forget about the static capabilities for now. The core of kvm_state are handles that enable you to use KVM services and maybe state fields that have machine scope (unless we find more local homes like a memory object). Those need to be accessible by the kvm layer when servicing requests of components that are related to that very same machine. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Thu, Jan 20, 2011 at 9:22 PM, Jan Kiszka wrote: > On 2011-01-20 20:27, Blue Swirl wrote: >> On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka wrote: >>> On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori wrote: > On 01/19/2011 07:15 AM, Markus Armbruster wrote: >> >> So they interact with KVM (need kvm_state), and they interact with the >> emulated PCI bus. Could you elaborate on the fundamental difference >> between the two interactions that makes you choose the (hypothetical) >> KVM bus over the PCI bus as device parent? >> > > It's almost arbitrary, but I would say it's the direction that I/Os flow. > > But if the underlying observation is that the device tree is not really a > tree, you're 100% correct. This is part of why a factory interface that > just takes a parent bus is too simplistic. > > I think we ought to introduce a -pci-device option that is specifically > for > creating PCI devices that doesn't require a parent bus argument but > provides > a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. >>> >>> There are currently no VCPU-specific bits remaining in kvm_state. >> >> I think fields vcpu_events, robust_singlestep, debugregs, >> kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the >> same for all VCPUs but still they are sort of CPU properties. I'm not >> sure about fd field. > > They are all properties of the currently loaded KVM subsystem in the > host kernel. They can't change while KVM's root fd is opened. > Replicating this static information into each and every VCPU state would > be crazy. Then each CPUX86State could have a pointer to common structure. > In fact, services like kvm_has_vcpu_events() already encode this: they > are static functions without any kvm_state reference that simply return > the content of those fields. Totally inconsistent to this, we force the > caller of kvm_check_extension to pass a handle. This is part of my > problem with the current situation and any halfhearted steps in this > context. Either we work towards eliminating "static KVMState *kvm_state" > in kvm-all.c or eliminating KVMState. If the CPU related fields are accessible through CPUState, the handle should be available. >>> It may >>> be a good idea to introduce an arch-specific kvm_state and move related >>> bits over. >> >> This should probably contain only irqchip_in_kernel, pit_in_kernel and >> many_ioeventfds, maybe fd. > > fd is that root file descriptor you need for a few KVM services that are > not bound to a specific VM - e.g. feature queries. It's not arch > specific. Arch specific are e.g. robust_singlestep or xsave feature states. By arch you mean guest CPU architecture? They are not machine features. >> >>> It may also once be feasible to carve out memory management >>> related fields if we have proper abstractions for that, but I'm not >>> completely sure here. >> >> I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region, >> migration_log into the memory object. > > vmfd is the VM-scope file descriptor you need at machine-level. The rest > logically belongs to a memory object, but I haven't looked at technical > details yet. > >> >>> Anyway, all these things are secondary. The primary topic here is how to >>> deal with kvm_state and its fields that have VM-global scope. >> >> If it is an opaque blob which contains various unrelated stuff, no >> clear place will be found. > > We aren't moving fields yet (and we shouldn't). We first of all need to > establish the handle distribution (which apparently requires quite some > work in areas beyond KVM). But I think this is exactly the problem. If the handle is for the current KVMState, you'll indeed need it in various places and passing it around will be cumbersome. By moving the fields around, the information should be available more naturally. >> By the way, we don't have a QEMUState but instead use globals. Perhaps >> this should be reorganized as well. For fd field, maybe even using a >> global variable could be justified, since it is used for direct access >> to kernel, not unlike a system call. > > The fd field is part of this discussion. Making it global (but local to > the kvm subsystem) was an implicit part of my original sugges
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-20 20:37, Anthony Liguori wrote: > On 01/20/2011 03:33 AM, Jan Kiszka wrote: >> On 2011-01-19 20:32, Blue Swirl wrote: >> >>> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori >>> wrote: >>> On 01/19/2011 07:15 AM, Markus Armbruster wrote: > So they interact with KVM (need kvm_state), and they interact with the > emulated PCI bus. Could you elaborate on the fundamental difference > between the two interactions that makes you choose the (hypothetical) > KVM bus over the PCI bus as device parent? > > It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). >>> I think kvm_state should not be a property of any device or bus. It >>> should be split to more logical pieces. >>> >>> Some parts of it could remain in CPUState, because they are associated >>> with a VCPU. >>> >>> Also, for example irqfd could be considered to be similar object to >>> char or block devices provided by QEMU to devices. Would it make sense >>> to introduce new host types for passing parts of kvm_state to devices? >>> >>> I'd also make coalesced MMIO stuff part of memory object. We are not >>> passing any state references when using cpu_physical_memory_rw(), but >>> that could be changed. >>> >> There are currently no VCPU-specific bits remaining in kvm_state. It may >> be a good idea to introduce an arch-specific kvm_state and move related >> bits over. It may also once be feasible to carve out memory management >> related fields if we have proper abstractions for that, but I'm not >> completely sure here. >> >> Anyway, all these things are secondary. The primary topic here is how to >> deal with kvm_state and its fields that have VM-global scope. >> > > The debate is really: > > 1) should we remove all passing of kvm_state and just assume it's static > > 2) deal with a couple places in the code where we need to figure out how > to get at kvm_state > > I think we've only identified 1 real instance of (2) and it's resulted > in some good discussions about how to model KVM devices vs. emulated > devices. Honestly, (1) just stinks. I see absolutely no advantage to > it at all. In the very worst case scenario, the thing we need to do is > just reference an extern variable in a few places. That completely > avoids all of the modelling discussions for now (while leaving for > placeholder FIXMEs so the problem can be tackled later). The PCI bus discussion is surely an interesting outcome, but now almost completely off-topic to the original, way less critical issue (as we were discussing internals). > > I don't understand the resistance here. IMHO, most suggestions on the table are still over-designed (like a KVMBus that only passes a kvm_state - or do you have more features for it in mind?). The idea I love most so far is establishing a machine state that also carries those few KVM bits which correspond to the KVM extension of CPUState. But in the end I want an implementable consensus that helps moving forward with main topic: the overdue KVM upstream merge. I just do not have a clear picture yet. Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-20 20:27, Blue Swirl wrote: > On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka wrote: >> On 2011-01-19 20:32, Blue Swirl wrote: >>> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori >>> wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: > > So they interact with KVM (need kvm_state), and they interact with the > emulated PCI bus. Could you elaborate on the fundamental difference > between the two interactions that makes you choose the (hypothetical) > KVM bus over the PCI bus as device parent? > It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). >>> >>> I think kvm_state should not be a property of any device or bus. It >>> should be split to more logical pieces. >>> >>> Some parts of it could remain in CPUState, because they are associated >>> with a VCPU. >>> >>> Also, for example irqfd could be considered to be similar object to >>> char or block devices provided by QEMU to devices. Would it make sense >>> to introduce new host types for passing parts of kvm_state to devices? >>> >>> I'd also make coalesced MMIO stuff part of memory object. We are not >>> passing any state references when using cpu_physical_memory_rw(), but >>> that could be changed. >> >> There are currently no VCPU-specific bits remaining in kvm_state. > > I think fields vcpu_events, robust_singlestep, debugregs, > kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the > same for all VCPUs but still they are sort of CPU properties. I'm not > sure about fd field. They are all properties of the currently loaded KVM subsystem in the host kernel. They can't change while KVM's root fd is opened. Replicating this static information into each and every VCPU state would be crazy. In fact, services like kvm_has_vcpu_events() already encode this: they are static functions without any kvm_state reference that simply return the content of those fields. Totally inconsistent to this, we force the caller of kvm_check_extension to pass a handle. This is part of my problem with the current situation and any halfhearted steps in this context. Either we work towards eliminating "static KVMState *kvm_state" in kvm-all.c or eliminating KVMState. > >> It may >> be a good idea to introduce an arch-specific kvm_state and move related >> bits over. > > This should probably contain only irqchip_in_kernel, pit_in_kernel and > many_ioeventfds, maybe fd. fd is that root file descriptor you need for a few KVM services that are not bound to a specific VM - e.g. feature queries. It's not arch specific. Arch specific are e.g. robust_singlestep or xsave feature states. > >> It may also once be feasible to carve out memory management >> related fields if we have proper abstractions for that, but I'm not >> completely sure here. > > I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region, > migration_log into the memory object. vmfd is the VM-scope file descriptor you need at machine-level. The rest logically belongs to a memory object, but I haven't looked at technical details yet. > >> Anyway, all these things are secondary. The primary topic here is how to >> deal with kvm_state and its fields that have VM-global scope. > > If it is an opaque blob which contains various unrelated stuff, no > clear place will be found. We aren't moving fields yet (and we shouldn't). We first of all need to establish the handle distribution (which apparently requires quite some work in areas beyond KVM). > > By the way, we don't have a QEMUState but instead use globals. Perhaps > this should be reorganized as well. For fd field, maybe even using a > global variable could be justified, since it is used for direct access > to kernel, not unlike a system call. The fd field is part of this discussion. Making it global (but local to the kvm subsystem) was an implicit part of my original suggestion. I've no problem with something like a QEMUState, or better a MachineState that would also include a few KVM-specific fields like the vmfd - just like we already do for CPUstate (or should we better introduce a KVM CPU bus... ;) ). Jan signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model
On Thu, Jan 20, 2011 at 2:48 PM, Daniel P. Berrange wrote: > On Thu, Jan 20, 2011 at 08:11:27PM +0530, M. Mohan Kumar wrote: >> On Thursday 20 January 2011 2:29:54 pm Stefan Hajnoczi wrote: >> > On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote: >> >> > > - if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) { >> > > - /* >> > > - * If we fail to change ownership and if we are >> > > - * using security model none. Ignore the error >> > > - */ >> > > - if (fs_ctx->fs_sm != SM_NONE) { >> > > - return -1; >> > > - } >> > > - } >> > > + retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid); >> > > >> > > return 0; >> > > >> > > } >> > >> > retval is unused. >> > >> >> That was used to disable the warning message "error: ignoring return value of >> ‘lchown’, declared with attribute warn_unused_result" >> >> Otherwise I have to use >> if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid)) { >> ; >> } >> >> > Can multiple virtio-9p requests execute at a time? chmod() and lchown() >> > after creation is a race condition if other requests can execute >> > concurrently. >> > >> >> We can't implement file creation with requested user credentials and >> permission >> bits in the none security model atomically. Its expected behaviour only > > Well you could do the nasty trick of forking a child process > and doing setuid/gid in that and then creating the file before > letting the parent continue. > > if ((pid = fork()) == 0) { > setuid(fc_uid); > setgid(fc_gid); > fd =open("foo", O_CREAT); > close(fd); > } else { > waitpid(pid); > } > > This kind of approach is in fact required if you want to > be able to create files with a special uid/gid on a root > squashing NFS server, because otherwise your QEMU running > as root will have its files squashed to 'nobody' when initially > created, and lchown will fail with EPERM. You might decide > that root squashing NFS is too painful to care about supporting > though :-) I was thinking about this approach and it's similar to the chroot helper process, but this time you have a helper process that does umask/setgid/setuid as necessary. Performance will be bad but there's really no way around this. Either implement something that works 90% of the time only but runs a bit faster or implement something that works all the time but runs slow. It's not a nice trade-off. Stefan
Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model
On 1/20/2011 12:59 AM, Stefan Hajnoczi wrote: > On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote: >> After creating a file object, its permission and ownership details are >> updated >> as per client's request for both passthrough and none security model. But >> with >> chrooted environment its not required for passthrough security model. Move >> all >> post file creation changes to none security model >> >> Signed-off-by: M. Mohan Kumar >> --- >> hw/9pfs/virtio-9p-local.c | 19 ++- >> 1 files changed, 6 insertions(+), 13 deletions(-) >> >> diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c >> index 08fd67f..d2e32e2 100644 >> --- a/hw/9pfs/virtio-9p-local.c >> +++ b/hw/9pfs/virtio-9p-local.c >> @@ -208,21 +208,14 @@ static int local_set_xattr(const char *path, FsCred >> *credp) >> return 0; >> } >> >> -static int local_post_create_passthrough(FsContext *fs_ctx, const char >> *path, >> +static int local_post_create_none(FsContext *fs_ctx, const char *path, >> FsCred *credp) >> { >> +int retval; >> if (chmod(rpath(fs_ctx, path), credp->fc_mode & 0) < 0) { >> return -1; >> } >> -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) { >> -/* >> - * If we fail to change ownership and if we are >> - * using security model none. Ignore the error >> - */ >> -if (fs_ctx->fs_sm != SM_NONE) { >> -return -1; >> -} >> -} >> +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid); >> return 0; >> } > > retval is unused. > > Can multiple virtio-9p requests execute at a time? chmod() and lchown() > after creation is a race condition if other requests can execute > concurrently. If some level of serialization is needed it will be done at the client/guest inode level. Are you worried about filesystem semantics? or do you see some corruption if they get executed in parallel? JV > > Stefan >
Re: [Spice-devel] [Qemu-devel] usb redirection status report
> H > i, > > On 01/20/2011 12:27 PM, Christoph Hellwig wrote: > > On Wed, Jan 19, 2011 at 07:15:47PM +0100, Hans de Goede wrote: > >> Hi All, > >> > >> As most of you know I'm working on usb redirection (making client usb > >> devices > >> accessible in guests over the network). > >> > >> I thought it would be a good idea to write a short status report. > >> > >> So fat the following has been done: > >> > >> * written and posted a network protocol for usb redir. over the network, > >> and send this to the list. > > As another late-comer, hate to ask a possibly old question but will this added protocol allow usb devices from arbitrary "remote" hosts to be redirected or only from the host running the spice client? I recently tried to use the native qemu usb functionality to pass-through a webcam and fell flat on my face due to what others described as "timing issues", and was wondering if it would be possible to pass a USB device plugged in to the vm-host to the vm-guest using your new functionality, even though a third (and unrelated) host is running the spice client. I guess this is moot if the new approach will not fix the aforementioned "timing issues" (i.e. perhaps it is not possible in a general case to drive an arbitrary USB device non-locally). Thanks, David
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Thu, Jan 20, 2011 at 7:37 PM, Anthony Liguori wrote: > On 01/20/2011 03:33 AM, Jan Kiszka wrote: >> >> On 2011-01-19 20:32, Blue Swirl wrote: >> >>> >>> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori >>> wrote: >>> On 01/19/2011 07:15 AM, Markus Armbruster wrote: > > So they interact with KVM (need kvm_state), and they interact with the > emulated PCI bus. Could you elaborate on the fundamental difference > between the two interactions that makes you choose the (hypothetical) > KVM bus over the PCI bus as device parent? > > It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). >>> >>> I think kvm_state should not be a property of any device or bus. It >>> should be split to more logical pieces. >>> >>> Some parts of it could remain in CPUState, because they are associated >>> with a VCPU. >>> >>> Also, for example irqfd could be considered to be similar object to >>> char or block devices provided by QEMU to devices. Would it make sense >>> to introduce new host types for passing parts of kvm_state to devices? >>> >>> I'd also make coalesced MMIO stuff part of memory object. We are not >>> passing any state references when using cpu_physical_memory_rw(), but >>> that could be changed. >>> >> >> There are currently no VCPU-specific bits remaining in kvm_state. It may >> be a good idea to introduce an arch-specific kvm_state and move related >> bits over. It may also once be feasible to carve out memory management >> related fields if we have proper abstractions for that, but I'm not >> completely sure here. >> >> Anyway, all these things are secondary. The primary topic here is how to >> deal with kvm_state and its fields that have VM-global scope. >> > > The debate is really: > > 1) should we remove all passing of kvm_state and just assume it's static > > 2) deal with a couple places in the code where we need to figure out how to > get at kvm_state > > I think we've only identified 1 real instance of (2) and it's resulted in > some good discussions about how to model KVM devices vs. emulated devices. > Honestly, (1) just stinks. I see absolutely no advantage to it at all. Fully agree. > In the very worst case scenario, the thing we need to do is just reference > an extern variable in a few places. That completely avoids all of the > modelling discussions for now (while leaving for placeholder FIXMEs so the > problem can be tackled later). I think KVMState was designed to match KVM ioctl interface: all stuff that is needed for talking to KVM or received from KVM are there. But I think this shouldn't be a design driver. If the only pieces of kvm_state that are needed by the devices are irqchip_in_kernel, pit_in_kernel and many_ioeventfds, the problem of passing kvm_state to devices becomes very different. Each of these are just single bits, affecting only a few devices. Perhaps they could be device properties which the board level sets when KVM is used?
Re: [Qemu-devel] [RFC PATCH] Fake machine for scalability testing
On 01/20/2011 11:12 AM, Markus Armbruster wrote: Anthony Liguori writes: On 01/18/2011 02:16 PM, Markus Armbruster wrote: The problem: you want to do serious scalability testing (1000s of VMs) of your management stack. If each guest eats up a few 100MiB and competes for CPU, that requires a serious host machine. Which you don't have. You also don't want to modify the management stack at all, if you can help it. The solution: a perfectly normal-looking QEMU that uses minimal resources. Ability to execute any guest code is strictly optional ;) New option -fake-machine creates a fake machine incapable of running guest code. Completely compiled out by default, enable with configure --enable-fake-machine. With -fake-machine, CPU use is negligible, and memory use is rather modest. Non-fake VM running F-14 live, right after boot: UIDPID PPID CSZ RSS PSR STIME TTY TIME CMD armbru 15707 2558 53 191837 414388 1 21:05 pts/300:00:29 [...] Same VM -fake-machine, after similar time elapsed: UIDPID PPID CSZ RSS PSR STIME TTY TIME CMD armbru 15742 2558 0 85129 9412 0 21:07 pts/300:00:00 [...] We're using a very similar patch for RHEL scalability testing. Interesting, but: 9432 anthony 20 0 153m 14m 5384 S0 0.2 0:00.22 qemu-system-x86 That's qemu-system-x86 -m 4 Sure you ran qemu-system-x86 -fake-machine? No, I didn't try it. My point was that -m 4 is already pretty small. In terms of memory overhead, the largest source is not really going to be addressed by -fake-machine (l1_phys_map and phys_ram_dirty). git-grep phys_ram_dirty finds nothing. Yeah, it's now ram_list[i].phys_dirty. l1_phys_map is (sizeof(void *) + sizeof(PhysPageDesc)) * mem_size_in_pages phys_dirty is mem_size_in_pages bytes. I don't really understand the point of not creating a VCPU with KVM. Is there some type of overhead in doing that? I briefly looked at both main loops, TCG's was the first one I happened to crack, and I didn't feel like doing both then. If the general approach is okay, I'll gladly investigate how to do it with KVM. I guess what I don't understand is why do you need to not run guest code? Specifically, if you remove the following, is it any less useful? diff --git a/cpu-exec.c b/cpu-exec.c index 8c9fb8b..cd1259a 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -230,7 +230,7 @@ int cpu_exec(CPUState *env1) uint8_t *tc_ptr; unsigned long next_tb; -if (cpu_halted(env1) == EXCP_HALTED) +if (fake_machine || cpu_halted(env1) == EXCP_HALTED) return EXCP_HALTED; Regards, Anthony Liguori
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/20/2011 04:33 AM, Daniel P. Berrange wrote: On Thu, Jan 20, 2011 at 09:44:05AM +0100, Gerd Hoffmann wrote: Hi, For (2), you cannot use bus=X,addr=Y because it makes assumptions about the PCI topology which may change in newer -M pc's. Why should the PCI topology for 'pc' ever change? We'll probably get q35 support some day, but when this lands I expect we'll see a new machine type 'q35', so '-m q35' will pick the ich9 chipset (which will have a different pci topology of course) and '-m pc' will pick the existing piix chipset (which will continue to look like it looks today). If the topology does ever change (eg in the kind of way anthony suggests, first bus only has the graphics card), I think libvirt is going to need a little work to adapt to the new topology, regardless of whether we currently specify a bus= arg to -device or not. I'm not sure there's anything we could do to future proof us to that kind of change. I assume that libvirt today assumes that it can use a set of PCI slots in bus 0, correct? Probably in the range 3-31? Such assumptions are very likely to break. Regards, Anthony Liguori Regards, Daniel
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/20/2011 02:44 AM, Gerd Hoffmann wrote: Hi, For (2), you cannot use bus=X,addr=Y because it makes assumptions about the PCI topology which may change in newer -M pc's. Why should the PCI topology for 'pc' ever change? We'll probably get q35 support some day, but when this lands I expect we'll see a new machine type 'q35', so '-m q35' will pick the ich9 chipset (which will have a different pci topology of course) and '-m pc' will pick the existing piix chipset (which will continue to look like it looks today). But then what's the default machine type? When I say -M pc, I really mean the default machine. At some point, "qemu-system-x86_64 -device virtio-net-pci,addr=2.0" Is not going to be a reliable way to invoke qemu because there's no way we can guarantee that slot 2 isn't occupied by a chipset device or some other default device. Regards, Anthony Liguori cheers, Gerd
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 01/20/2011 03:33 AM, Jan Kiszka wrote: On 2011-01-19 20:32, Blue Swirl wrote: On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori wrote: On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? It's almost arbitrary, but I would say it's the direction that I/Os flow. But if the underlying observation is that the device tree is not really a tree, you're 100% correct. This is part of why a factory interface that just takes a parent bus is too simplistic. I think we ought to introduce a -pci-device option that is specifically for creating PCI devices that doesn't require a parent bus argument but provides a way to specify stable addressing (for instancing, using a linear index). I think kvm_state should not be a property of any device or bus. It should be split to more logical pieces. Some parts of it could remain in CPUState, because they are associated with a VCPU. Also, for example irqfd could be considered to be similar object to char or block devices provided by QEMU to devices. Would it make sense to introduce new host types for passing parts of kvm_state to devices? I'd also make coalesced MMIO stuff part of memory object. We are not passing any state references when using cpu_physical_memory_rw(), but that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. The debate is really: 1) should we remove all passing of kvm_state and just assume it's static 2) deal with a couple places in the code where we need to figure out how to get at kvm_state I think we've only identified 1 real instance of (2) and it's resulted in some good discussions about how to model KVM devices vs. emulated devices. Honestly, (1) just stinks. I see absolutely no advantage to it at all. In the very worst case scenario, the thing we need to do is just reference an extern variable in a few places. That completely avoids all of the modelling discussions for now (while leaving for placeholder FIXMEs so the problem can be tackled later). I don't understand the resistance here. Regards, Anthony Liguori Jan
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka wrote: > On 2011-01-19 20:32, Blue Swirl wrote: >> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori >> wrote: >>> On 01/19/2011 07:15 AM, Markus Armbruster wrote: So they interact with KVM (need kvm_state), and they interact with the emulated PCI bus. Could you elaborate on the fundamental difference between the two interactions that makes you choose the (hypothetical) KVM bus over the PCI bus as device parent? >>> >>> It's almost arbitrary, but I would say it's the direction that I/Os flow. >>> >>> But if the underlying observation is that the device tree is not really a >>> tree, you're 100% correct. This is part of why a factory interface that >>> just takes a parent bus is too simplistic. >>> >>> I think we ought to introduce a -pci-device option that is specifically for >>> creating PCI devices that doesn't require a parent bus argument but provides >>> a way to specify stable addressing (for instancing, using a linear index). >> >> I think kvm_state should not be a property of any device or bus. It >> should be split to more logical pieces. >> >> Some parts of it could remain in CPUState, because they are associated >> with a VCPU. >> >> Also, for example irqfd could be considered to be similar object to >> char or block devices provided by QEMU to devices. Would it make sense >> to introduce new host types for passing parts of kvm_state to devices? >> >> I'd also make coalesced MMIO stuff part of memory object. We are not >> passing any state references when using cpu_physical_memory_rw(), but >> that could be changed. > > There are currently no VCPU-specific bits remaining in kvm_state. I think fields vcpu_events, robust_singlestep, debugregs, kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the same for all VCPUs but still they are sort of CPU properties. I'm not sure about fd field. > It may > be a good idea to introduce an arch-specific kvm_state and move related > bits over. This should probably contain only irqchip_in_kernel, pit_in_kernel and many_ioeventfds, maybe fd. > It may also once be feasible to carve out memory management > related fields if we have proper abstractions for that, but I'm not > completely sure here. I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region, migration_log into the memory object. > Anyway, all these things are secondary. The primary topic here is how to > deal with kvm_state and its fields that have VM-global scope. If it is an opaque blob which contains various unrelated stuff, no clear place will be found. By the way, we don't have a QEMUState but instead use globals. Perhaps this should be reorganized as well. For fd field, maybe even using a global variable could be justified, since it is used for direct access to kernel, not unlike a system call.
Re: [Qemu-devel] usb redirection status report
Hi, On 01/20/2011 12:27 PM, Christoph Hellwig wrote: On Wed, Jan 19, 2011 at 07:15:47PM +0100, Hans de Goede wrote: Hi All, As most of you know I'm working on usb redirection (making client usb devices accessible in guests over the network). I thought it would be a good idea to write a short status report. So fat the following has been done: * written and posted a network protocol for usb redir. over the network, and send this to the list. How does this relate to the various existing usb over ip protocols, e.g. the one in drivers/staging/usbip/ in the Linux kernel tree? See the previous discussion about this on the qemu-devel list, specifically: http://lists.gnu.org/archive/html/qemu-devel/2010-12/msg8.html Regards, Hans
[Qemu-devel] Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: > When MSI is off, each interrupt needs to be bounced through the io > thread when it's set/cleared, so vhost-net causes more context switches and > higher CPU utilization than userspace virtio which handles networking in > the same thread. > > We'll need to fix this by adding level irq support in kvm irqfd, > for now disable vhost-net in these configurations. We need this if we want avoid bouncing vfio INtx through qemu too. > Signed-off-by: Michael S. Tsirkin > --- > > I need to report some error from virtio-pci > that would be handled specially (disable but don't > report an error) so I wanted one that's never likely to be used by a > userspace ioctl. I selected ERANGE but it'd > be easy to switch to something else. Comments? > > hw/vhost.c |4 +++- > hw/virtio-net.c |6 -- > hw/virtio-pci.c |3 +++ > hw/virtio.h |2 ++ > 4 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/hw/vhost.c b/hw/vhost.c > index 1d09ed0..c79765a 100644 > --- a/hw/vhost.c > +++ b/hw/vhost.c > @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice > *vdev) > > r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true); > if (r < 0) { > -fprintf(stderr, "Error binding guest notifier: %d\n", -r); > + if (r != -EVIRTIO_DISABLED) { > + fprintf(stderr, "Error binding guest notifier: %d\n", -r); > + } style - the above is tab indented. > goto fail_notifiers; > } > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > index ccb3e63..5de3fee 100644 > --- a/hw/virtio-net.c > +++ b/hw/virtio-net.c > @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, > uint8_t status) > if (!n->vhost_started) { > int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), > &n->vdev); > if (r < 0) { > -error_report("unable to start vhost net: %d: " > - "falling back on userspace virtio", -r); > +if (r != -EVIRTIO_DISABLED) { > +error_report("unable to start vhost net: %d: " > + "falling back on userspace virtio", -r); > +} > } else { > n->vhost_started = 1; > } > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c > index dd8887a..dbf4be0 100644 > --- a/hw/virtio-pci.c > +++ b/hw/virtio-pci.c > @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, > int n, bool assign) > EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); > > if (assign) { > +if (!msix_enabled(&proxy->pci_dev)) { > +return -EVIRTIO_DISABLED; > +} > int r = event_notifier_init(notifier, 0); > if (r < 0) { > return r; > diff --git a/hw/virtio.h b/hw/virtio.h > index d8546d5..53bbdba 100644 > --- a/hw/virtio.h > +++ b/hw/virtio.h > @@ -98,6 +98,8 @@ typedef struct { > void (*vmstate_change)(void * opaque, bool running); > } VirtIOBindings; > > +#define EVIRTIO_DISABLED ERANGE > + > #define VIRTIO_PCI_QUEUE_MAX 64 > > #define VIRTIO_NO_VECTOR 0x I'm not a fan of having this special return value. Why doesn't virtio-pci only setup the set_guest_notifiers function pointer when msix is enabled? If not that, it could at least expose some virtio_foo_enabled() type feature that vhost could check. Thanks, Alex
Re: [Qemu-devel] [PATCH] target-arm: Set the right overflow bit for neon 32 and 64 bit saturating add/sub.
On 20 January 2011 17:16, Christophe Lyon wrote: > Set the right overflow bit for neon 32 and 64 bit saturating add/sub. > > Also move the neon 64 bit saturating add/sub helpers to neon_helper.c > for consistency with the 32 bits versions. > > There is probably still room for code commonalization though. > > Peter, this patch is based upon your patch 6f83e7d and adds the 64 bits case. > > Signed-off-by: Christophe Lyon > Signed-off-by: Peter Maydell You shouldn't really leave my sign-off in there if you've changed the patch. (Actually a lot of the changes in the tree I pointed you at have my sign-off and really oughtn't -- all I was doing was splitting and rearranging existing meego commits, so those are all unreviewed / untested; so you should just remove it anyway.) I'll review this tomorrow. -- PMM
[Qemu-devel] Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote: > On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: > > When MSI is off, each interrupt needs to be bounced through the io > > thread when it's set/cleared, so vhost-net causes more context switches and > > higher CPU utilization than userspace virtio which handles networking in > > the same thread. > > > > We'll need to fix this by adding level irq support in kvm irqfd, > > for now disable vhost-net in these configurations. > > > > Signed-off-by: Michael S. Tsirkin > > --- > > > > I need to report some error from virtio-pci > > that would be handled specially (disable but don't > > report an error) so I wanted one that's never likely to be used by a > > userspace ioctl. I selected ERANGE but it'd > > be easy to switch to something else. Comments? > > Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? > > -Sridhar The error is reported by virtio-pci which does not know about vhost. I started with EVIRTIO_MSIX_DISABLED and made is shorter. Would EVIRTIO_MSIX_DISABLED be better? > > > > hw/vhost.c |4 +++- > > hw/virtio-net.c |6 -- > > hw/virtio-pci.c |3 +++ > > hw/virtio.h |2 ++ > > 4 files changed, 12 insertions(+), 3 deletions(-) > > > > diff --git a/hw/vhost.c b/hw/vhost.c > > index 1d09ed0..c79765a 100644 > > --- a/hw/vhost.c > > +++ b/hw/vhost.c > > @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, > > VirtIODevice *vdev) > > > > r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true); > > if (r < 0) { > > -fprintf(stderr, "Error binding guest notifier: %d\n", -r); > > + if (r != -EVIRTIO_DISABLED) { > > + fprintf(stderr, "Error binding guest notifier: %d\n", -r); > > + } > > goto fail_notifiers; > > } > > > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > > index ccb3e63..5de3fee 100644 > > --- a/hw/virtio-net.c > > +++ b/hw/virtio-net.c > > @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, > > uint8_t status) > > if (!n->vhost_started) { > > int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), > > &n->vdev); > > if (r < 0) { > > -error_report("unable to start vhost net: %d: " > > - "falling back on userspace virtio", -r); > > +if (r != -EVIRTIO_DISABLED) { > > +error_report("unable to start vhost net: %d: " > > + "falling back on userspace virtio", -r); > > +} > > } else { > > n->vhost_started = 1; > > } > > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c > > index dd8887a..dbf4be0 100644 > > --- a/hw/virtio-pci.c > > +++ b/hw/virtio-pci.c > > @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, > > int n, bool assign) > > EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); > > > > if (assign) { > > +if (!msix_enabled(&proxy->pci_dev)) { > > +return -EVIRTIO_DISABLED; > > +} > > int r = event_notifier_init(notifier, 0); > > if (r < 0) { > > return r; > > diff --git a/hw/virtio.h b/hw/virtio.h > > index d8546d5..53bbdba 100644 > > --- a/hw/virtio.h > > +++ b/hw/virtio.h > > @@ -98,6 +98,8 @@ typedef struct { > > void (*vmstate_change)(void * opaque, bool running); > > } VirtIOBindings; > > > > +#define EVIRTIO_DISABLED ERANGE > > + > > #define VIRTIO_PCI_QUEUE_MAX 64 > > > > #define VIRTIO_NO_VECTOR 0x
[Qemu-devel] [PATCH] target-arm: Set the right overflow bit for neon 32 and 64 bit saturating add/sub.
Set the right overflow bit for neon 32 and 64 bit saturating add/sub. Also move the neon 64 bit saturating add/sub helpers to neon_helper.c for consistency with the 32 bits versions. There is probably still room for code commonalization though. Peter, this patch is based upon your patch 6f83e7d and adds the 64 bits case. Signed-off-by: Christophe Lyon Signed-off-by: Peter Maydell --- target-arm/helpers.h | 12 -- target-arm/neon_helper.c | 89 ++ target-arm/op_helper.c | 49 - target-arm/translate.c | 18 - 4 files changed, 105 insertions(+), 63 deletions(-) diff --git a/target-arm/helpers.h b/target-arm/helpers.h index b88ebae..8a2564e 100644 --- a/target-arm/helpers.h +++ b/target-arm/helpers.h @@ -137,10 +137,6 @@ DEF_HELPER_2(rsqrte_f32, f32, f32, env) DEF_HELPER_2(recpe_u32, i32, i32, env) DEF_HELPER_2(rsqrte_u32, i32, i32, env) DEF_HELPER_4(neon_tbl, i32, i32, i32, i32, i32) -DEF_HELPER_2(neon_add_saturate_u64, i64, i64, i64) -DEF_HELPER_2(neon_add_saturate_s64, i64, i64, i64) -DEF_HELPER_2(neon_sub_saturate_u64, i64, i64, i64) -DEF_HELPER_2(neon_sub_saturate_s64, i64, i64, i64) DEF_HELPER_2(add_cc, i32, i32, i32) DEF_HELPER_2(adc_cc, i32, i32, i32) @@ -160,10 +156,18 @@ DEF_HELPER_3(neon_qadd_u8, i32, env, i32, i32) DEF_HELPER_3(neon_qadd_s8, i32, env, i32, i32) DEF_HELPER_3(neon_qadd_u16, i32, env, i32, i32) DEF_HELPER_3(neon_qadd_s16, i32, env, i32, i32) +DEF_HELPER_3(neon_qadd_u32, i32, env, i32, i32) +DEF_HELPER_3(neon_qadd_s32, i32, env, i32, i32) DEF_HELPER_3(neon_qsub_u8, i32, env, i32, i32) DEF_HELPER_3(neon_qsub_s8, i32, env, i32, i32) DEF_HELPER_3(neon_qsub_u16, i32, env, i32, i32) DEF_HELPER_3(neon_qsub_s16, i32, env, i32, i32) +DEF_HELPER_3(neon_qsub_u32, i32, env, i32, i32) +DEF_HELPER_3(neon_qsub_s32, i32, env, i32, i32) +DEF_HELPER_3(neon_qadd_u64, i64, env, i64, i64) +DEF_HELPER_3(neon_qadd_s64, i64, env, i64, i64) +DEF_HELPER_3(neon_qsub_u64, i64, env, i64, i64) +DEF_HELPER_3(neon_qsub_s64, i64, env, i64, i64) DEF_HELPER_2(neon_hadd_s8, i32, i32, i32) DEF_HELPER_2(neon_hadd_u8, i32, i32, i32) diff --git a/target-arm/neon_helper.c b/target-arm/neon_helper.c index 20f3c16..c1619c0 100644 --- a/target-arm/neon_helper.c +++ b/target-arm/neon_helper.c @@ -198,6 +198,28 @@ NEON_VOP_ENV(qadd_u16, neon_u16, 2) #undef NEON_FN #undef NEON_USAT +uint32_t HELPER(neon_qadd_u32)(CPUState *env, uint32_t a, uint32_t b) +{ +uint32_t res = a + b; +if (res < a) { +SET_QC(); +res = ~0; +} +return res; +} + +uint64_t HELPER(neon_qadd_u64)(CPUState *env, uint64_t src1, uint64_t src2) +{ + uint64_t res; + + res = src1 + src2; + if (res < src1) { +SET_QC(); +res = ~(uint64_t)0; + } + return res; +} + #define NEON_SSAT(dest, src1, src2, type) do { \ int32_t tmp = (uint32_t)src1 + (uint32_t)src2; \ if (tmp != (type)tmp) { \ @@ -218,6 +240,28 @@ NEON_VOP_ENV(qadd_s16, neon_s16, 2) #undef NEON_FN #undef NEON_SSAT +uint32_t HELPER(neon_qadd_s32)(CPUState *env, uint32_t a, uint32_t b) +{ +uint32_t res = a + b; +if (((res ^ a) & SIGNBIT) && !((a ^ b) & SIGNBIT)) { +SET_QC(); +res = ~(((int32_t)a >> 31) ^ SIGNBIT); +} +return res; +} + +uint64_t HELPER(neon_qadd_s64)(CPUState *env, uint64_t src1, uint64_t src2) +{ + uint64_t res; + + res = src1 + src2; + if (((res ^ src1) & SIGNBIT64) && !((src1 ^ src2) & SIGNBIT64)) { +SET_QC(); +res = ((int64_t)src1 >> 63) ^ ~SIGNBIT64; + } + return res; +} + #define NEON_USAT(dest, src1, src2, type) do { \ uint32_t tmp = (uint32_t)src1 - (uint32_t)src2; \ if (tmp != (type)tmp) { \ @@ -234,6 +278,29 @@ NEON_VOP_ENV(qsub_u16, neon_u16, 2) #undef NEON_FN #undef NEON_USAT +uint32_t HELPER(neon_qsub_u32)(CPUState *env, uint32_t a, uint32_t b) +{ +uint32_t res = a - b; +if (res > a) { +SET_QC(); +res = 0; +} +return res; +} + +uint64_t HELPER(neon_qsub_u64)(CPUState *env, uint64_t src1, uint64_t src2) +{ + uint64_t res; + + if (src1 < src2) { +SET_QC(); +res = 0; + } else { +res = src1 - src2; + } + return res; +} + #define NEON_SSAT(dest, src1, src2, type) do { \ int32_t tmp = (uint32_t)src1 - (uint32_t)src2; \ if (tmp != (type)tmp) { \ @@ -254,6 +321,28 @@ NEON_VOP_ENV(qsub_s16, neon_s16, 2) #undef NEON_FN #undef NEON_SSAT +uint32_t HELPER(neon_qsub_s32)(CPUState *env, uint32_t a, uint32_t b) +{ +uint32_t res = a - b; +if (((res ^ a) & SIGNBIT) && ((a ^ b) & SIGNBIT)) { +SET_QC(); +res = ~(((int32_t)a >> 31) ^ SIGNBIT); +} +return res; +} + +uint64_t HELPER(neon_qsub_s64)(CPUState *env, uint64_t src1, uint64_t src2) +{ + uint64_t res; + + res = src1 - src2; + if (((res ^ src1) & SIGNBIT64) && ((src1 ^ src2) & SIGNBIT64)) { +SET_QC(); +res = ((int64_t)src1 >> 63) ^ ~SIGNBIT64; + } + return res; +} + #define NEON_FN(dest, src1, src2)
Re: [Qemu-devel] [RFC PATCH] Fake machine for scalability testing
Anthony Liguori writes: > On 01/18/2011 02:16 PM, Markus Armbruster wrote: >> The problem: you want to do serious scalability testing (1000s of VMs) >> of your management stack. If each guest eats up a few 100MiB and >> competes for CPU, that requires a serious host machine. Which you don't >> have. You also don't want to modify the management stack at all, if you >> can help it. >> >> The solution: a perfectly normal-looking QEMU that uses minimal >> resources. Ability to execute any guest code is strictly optional ;) >> >> New option -fake-machine creates a fake machine incapable of running >> guest code. Completely compiled out by default, enable with configure >> --enable-fake-machine. >> >> With -fake-machine, CPU use is negligible, and memory use is rather >> modest. >> >> Non-fake VM running F-14 live, right after boot: >> UIDPID PPID CSZ RSS PSR STIME TTY TIME CMD >> armbru 15707 2558 53 191837 414388 1 21:05 pts/300:00:29 [...] >> >> Same VM -fake-machine, after similar time elapsed: >> UIDPID PPID CSZ RSS PSR STIME TTY TIME CMD >> armbru 15742 2558 0 85129 9412 0 21:07 pts/300:00:00 [...] >> >> We're using a very similar patch for RHEL scalability testing. >> > > Interesting, but: > > 9432 anthony 20 0 153m 14m 5384 S0 0.2 0:00.22 > qemu-system-x86 > > That's qemu-system-x86 -m 4 Sure you ran qemu-system-x86 -fake-machine? > In terms of memory overhead, the largest source is not really going to > be addressed by -fake-machine (l1_phys_map and phys_ram_dirty). git-grep phys_ram_dirty finds nothing. > I don't really understand the point of not creating a VCPU with KVM. > Is there some type of overhead in doing that? I briefly looked at both main loops, TCG's was the first one I happened to crack, and I didn't feel like doing both then. If the general approach is okay, I'll gladly investigate how to do it with KVM.
Re: [Qemu-devel] [PATCH 1/5] Fast Virtual Disk (FVD) Proposal Part 1
Am 20.01.2011 15:49, schrieb Chunqiang Tang: Please try to split the patches into logical parts, and use descriptive subject lines for each patch. E.g. adding the new sim command to qemu-io could be one patch, adding the img_update (why not just update?) command to qemu-img another, moving code into qemu-tool-time.c one more, etc. Will do and thank you for the detailed instructions. - + Please do not introduce random whitespace changes in patches. Stefan Weil previously suggested removing spaces at the end of a line, and I used a script to do that. It seems that in this example, the old code has multiple spaces on an empty line, which were automatically removed by the script. Yes, that's a problem with some parts of the old code. For files which you want to modify, you could remove the spaces with your script before applying your other modifications and create a separate patch which only removes the superfluous spaces. So your patch series would start with patches which only remove spaces at line endings (and say so in the patch descriptions). Then these changes are no longer random whitespace changes. Regards, Stefan Weil
[Qemu-devel] [PATCH v4 3/3] qcow2: Batch flushes for COW
qcow2 calls bdrv_flush() after performing COW in order to ensure that the L2 table change is never written before the copy is safe on disk. Now that the L2 table is cached, we can wait with flushing until we write out the next L2 table. Signed-off-by: Kevin Wolf --- block/qcow2-cache.c | 20 +--- block/qcow2-cluster.c |2 +- block/qcow2.h |1 + 3 files changed, 19 insertions(+), 4 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 5f1740b..8f2955b 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -38,6 +38,7 @@ struct Qcow2Cache { int size; Qcow2CachedTable* entries; struct Qcow2Cache* depends; +booldepends_on_flush; boolwritethrough; }; @@ -85,13 +86,15 @@ static int qcow2_cache_flush_dependency(BlockDriverState *bs, Qcow2Cache *c) } c->depends = NULL; +c->depends_on_flush = false; + return 0; } static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i) { BDRVQcowState *s = bs->opaque; -int ret; +int ret = 0; if (!c->entries[i].dirty || !c->entries[i].offset) { return 0; @@ -99,11 +102,17 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i) if (c->depends) { ret = qcow2_cache_flush_dependency(bs, c); -if (ret < 0) { -return ret; +} else if (c->depends_on_flush) { +ret = bdrv_flush(bs->file); +if (ret >= 0) { +c->depends_on_flush = false; } } +if (ret < 0) { +return ret; +} + if (c == s->refcount_block_cache) { BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART); } else if (c == s->l2_table_cache) { @@ -167,6 +176,11 @@ int qcow2_cache_set_dependency(BlockDriverState *bs, Qcow2Cache *c, return 0; } +void qcow2_cache_depends_on_flush(Qcow2Cache *c) +{ +c->depends_on_flush = true; +} + static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c) { int i; diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 7fa4fa1..716562f 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -637,7 +637,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) * handled. */ if (cow) { -bdrv_flush(bs->file); +qcow2_cache_depends_on_flush(s->l2_table_cache); } qcow2_cache_set_dependency(bs, s->l2_table_cache, s->refcount_block_cache); diff --git a/block/qcow2.h b/block/qcow2.h index 11cbce3..6d80120 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -229,6 +229,7 @@ void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table); int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c); int qcow2_cache_set_dependency(BlockDriverState *bs, Qcow2Cache *c, Qcow2Cache *dependency); +void qcow2_cache_depends_on_flush(Qcow2Cache *c); int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); -- 1.7.2.3
[Qemu-devel] [PATCH v4 2/3] qcow2: Use QcowCache
Use the new functions of qcow2-cache.c for everything that works on refcount block and L2 tables. Signed-off-by: Kevin Wolf --- block/qcow2-cache.c| 10 ++ block/qcow2-cluster.c | 207 +- block/qcow2-refcount.c | 258 block/qcow2.c | 48 - block/qcow2.h | 12 ++- 5 files changed, 239 insertions(+), 296 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index f7c4e2a..5f1740b 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -104,6 +104,12 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i) } } +if (c == s->refcount_block_cache) { +BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART); +} else if (c == s->l2_table_cache) { +BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE); +} + ret = bdrv_pwrite(bs->file, c->entries[i].offset, c->entries[i].table, s->cluster_size); if (ret < 0) { @@ -218,6 +224,10 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, c->entries[i].offset = 0; if (read_from_disk) { +if (c == s->l2_table_cache) { +BLKDBG_EVENT(bs->file, BLKDBG_L2_LOAD); +} + ret = bdrv_pread(bs->file, offset, c->entries[i].table, s->cluster_size); if (ret < 0) { return ret; diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index c3ef550..7fa4fa1 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -67,7 +67,11 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size, bool exact_size) qemu_free(new_l1_table); return new_l1_table_offset; } -bdrv_flush(bs->file); + +ret = qcow2_cache_flush(bs, s->refcount_block_cache); +if (ret < 0) { +return ret; +} BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE); for(i = 0; i < s->l1_size; i++) @@ -98,63 +102,6 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size, bool exact_size) return ret; } -void qcow2_l2_cache_reset(BlockDriverState *bs) -{ -BDRVQcowState *s = bs->opaque; - -memset(s->l2_cache, 0, s->l2_size * L2_CACHE_SIZE * sizeof(uint64_t)); -memset(s->l2_cache_offsets, 0, L2_CACHE_SIZE * sizeof(uint64_t)); -memset(s->l2_cache_counts, 0, L2_CACHE_SIZE * sizeof(uint32_t)); -} - -static inline int l2_cache_new_entry(BlockDriverState *bs) -{ -BDRVQcowState *s = bs->opaque; -uint32_t min_count; -int min_index, i; - -/* find a new entry in the least used one */ -min_index = 0; -min_count = 0x; -for(i = 0; i < L2_CACHE_SIZE; i++) { -if (s->l2_cache_counts[i] < min_count) { -min_count = s->l2_cache_counts[i]; -min_index = i; -} -} -return min_index; -} - -/* - * seek_l2_table - * - * seek l2_offset in the l2_cache table - * if not found, return NULL, - * if found, - * increments the l2 cache hit count of the entry, - * if counter overflow, divide by two all counters - * return the pointer to the l2 cache entry - * - */ - -static uint64_t *seek_l2_table(BDRVQcowState *s, uint64_t l2_offset) -{ -int i, j; - -for(i = 0; i < L2_CACHE_SIZE; i++) { -if (l2_offset == s->l2_cache_offsets[i]) { -/* increment the hit count */ -if (++s->l2_cache_counts[i] == 0x) { -for(j = 0; j < L2_CACHE_SIZE; j++) { -s->l2_cache_counts[j] >>= 1; -} -} -return s->l2_cache + (i << s->l2_bits); -} -} -return NULL; -} - /* * l2_load * @@ -169,33 +116,11 @@ static int l2_load(BlockDriverState *bs, uint64_t l2_offset, uint64_t **l2_table) { BDRVQcowState *s = bs->opaque; -int min_index; int ret; -/* seek if the table for the given offset is in the cache */ - -*l2_table = seek_l2_table(s, l2_offset); -if (*l2_table != NULL) { -return 0; -} - -/* not found: load a new entry in the least used one */ - -min_index = l2_cache_new_entry(bs); -*l2_table = s->l2_cache + (min_index << s->l2_bits); - -BLKDBG_EVENT(bs->file, BLKDBG_L2_LOAD); -ret = bdrv_pread(bs->file, l2_offset, *l2_table, -s->l2_size * sizeof(uint64_t)); -if (ret < 0) { -qcow2_l2_cache_reset(bs); -return ret; -} +ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset, (void**) l2_table); -s->l2_cache_offsets[min_index] = l2_offset; -s->l2_cache_counts[min_index] = 1; - -return 0; +return ret; } /* @@ -238,7 +163,6 @@ static int write_l1_entry(BlockDriverState *bs, int l1_index) static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table) { BDRVQcowState *s = bs->opaque; -int min_index; uint64_t old_l2_offset; uint64_t *l2_table; int64_t l2_offset; @@ -252,29 +176,48 @@ static int l2_allocate(BlockDriverState *
[Qemu-devel] [PATCH v4 1/3] qcow2: Add QcowCache
This adds some new cache functions to qcow2 which can be used for caching refcount blocks and L2 tables. When used with cache=writethrough they work like the old caching code which is spread all over qcow2, so for this case we have merely a cleanup. The interesting case is with writeback caching (this includes cache=none) where data isn't written to disk immediately but only kept in cache initially. This leads to some form of metadata write batching which avoids the current "write to refcount block, flush, write to L2 table" pattern for each single request when a lot of cluster allocations happen. Instead, cache entries are only written out if its required to maintain the right order. In the pure cluster allocation case this means that all metadata updates for requests are done in memory initially and on sync, first the refcount blocks are written to disk, then fsync, then L2 tables. This improves performance of scenarios with lots of cluster allocations noticably (e.g. installation or after taking a snapshot). Signed-off-by: Kevin Wolf --- Makefile.objs |2 +- block/qcow2-cache.c | 290 +++ block/qcow2.h | 19 3 files changed, 310 insertions(+), 1 deletions(-) create mode 100644 block/qcow2-cache.c diff --git a/Makefile.objs b/Makefile.objs index c3e52c5..65889c9 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -19,7 +19,7 @@ block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o -block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o +block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-nested-y += qed-check.o block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c new file mode 100644 index 000..f7c4e2a --- /dev/null +++ b/block/qcow2-cache.c @@ -0,0 +1,290 @@ +/* + * L2/refcount table cache for the QCOW2 format + * + * Copyright (c) 2010 Kevin Wolf + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include "block_int.h" +#include "qemu-common.h" +#include "qcow2.h" + +typedef struct Qcow2CachedTable { +void* table; +int64_t offset; +booldirty; +int cache_hits; +int ref; +} Qcow2CachedTable; + +struct Qcow2Cache { +int size; +Qcow2CachedTable* entries; +struct Qcow2Cache* depends; +boolwritethrough; +}; + +Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables, +bool writethrough) +{ +BDRVQcowState *s = bs->opaque; +Qcow2Cache *c; +int i; + +c = qemu_mallocz(sizeof(*c)); +c->size = num_tables; +c->entries = qemu_mallocz(sizeof(*c->entries) * num_tables); +c->writethrough = writethrough; + +for (i = 0; i < c->size; i++) { +c->entries[i].table = qemu_blockalign(bs, s->cluster_size); +} + +return c; +} + +int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c) +{ +int i; + +for (i = 0; i < c->size; i++) { +assert(c->entries[i].ref == 0); +qemu_vfree(c->entries[i].table); +} + +qemu_free(c->entries); +qemu_free(c); + +return 0; +} + +static int qcow2_cache_flush_dependency(BlockDriverState *bs, Qcow2Cache *c) +{ +int ret; + +ret = qcow2_cache_flush(bs, c->depends); +if (ret < 0) { +return ret; +} + +c->depends = NULL; +return 0; +} + +static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i) +{ +BDRVQcowState *s = bs->opaque; +int ret; + +if (!c->entries[i].dirty || !c->entries[i].offset) { +return 0; +} + +if (c->depends) { +ret = qcow2_cache_flush_dependency(
[Qemu-devel] [PATCH v4 0/3] qcow2 metadata cache
block-queue turned out to be too big effort to be useful for quickly fixing the performance problems that qcow2 got since we introduced the metadata flushes. While I still think the idea is right, it needs more time and qcow2 doesn't have more time. Let's come back to block-queue later when the most urgent qcow2 problems are fixed. So this is the idea of block-queue applied to the very specific case of qcow2. Whereas block-queue tried to be a generic solution for all kind of things and tried to make all writes asynchronous at the same time, this is only about batching writes to refcount blocks and L2 tables in qcow2 and getting the dependencies right. (Yes, the L1 table and refcount table is left alone. They are almost never written to anyway.) This should be much easier to understand and review, and I myself feel a bit more confident about it than with block-queue, too. v1: - Don't read newly allocated tables from the disk before memsetting them to zero v2: - Addressed Stefan's review comments - Added patch 3 to avoid an unnecessary bdrv_flush after COW v3: - Some error path fixes (esp. missing or double qcow2_cache_put) v4: - Pay attention to make writethrough performance not even worse in update_refcount - Add some blkdebug events back so that "failures" in qemu-iotests are kept minimal. Some diff between master and this series is left in test 026, but it's harmless. Kevin Wolf (3): qcow2: Add QcowCache qcow2: Use QcowCache qcow2: Batch flushes for COW Makefile.objs |2 +- block/qcow2-cache.c| 314 block/qcow2-cluster.c | 207 +++- block/qcow2-refcount.c | 258 --- block/qcow2.c | 48 +++- block/qcow2.h | 32 - 6 files changed, 564 insertions(+), 297 deletions(-) create mode 100644 block/qcow2-cache.c -- 1.7.2.3
Re: [Qemu-devel] [PATCH] target-arm: Fix garbage collection of temporaries in Neon emulation.
> I would personally prefer slightly less terse commit messages > (for instance it might be nice to list the affected instructions in > this case). The convention is also to preface the summary line > with the file or directory affected, ie "target-arm: Fix garbage > collection of temporaries in Neon emulation". > OK, so here is the same patch with an updated description: target-arm: Fix garbage collection of temporaries in Neon emulation of MULL and friends (VMLAL, VQDMLAL, VMLSL, VQDMLSL, VMULL, VQDMULL) as well as (VSHRN, VRSHRN, VQSHRN, VQRSHRN). Signed-off-by: Christophe Lyon --- target-arm/translate.c | 18 +- 1 files changed, 13 insertions(+), 5 deletions(-) diff --git a/target-arm/translate.c b/target-arm/translate.c index 57664bc..b3e3d70 100644 --- a/target-arm/translate.c +++ b/target-arm/translate.c @@ -4176,6 +4176,13 @@ static inline void gen_neon_mull(TCGv_i64 dest, TCGv a, TCGv b, int size, int u) break; default: abort(); } + +/* gen_helper_neon_mull_[su]{8|16} do not free their parameters. + Don't forget to clean them now. */ +if (size < 2) { + dead_tmp(a); + dead_tmp(b); +} } /* Translate a NEON data processing instruction. Return nonzero if the @@ -4840,7 +4847,7 @@ static int disas_neon_data_insn(CPUState * env, DisasContext *s, uint32_t insn) if (size == 3) { tcg_temp_free_i64(tmp64); } else { -dead_tmp(tmp2); +tcg_temp_free_i32(tmp2); } } else if (op == 10) { /* VSHLL */ @@ -5076,8 +5083,6 @@ static int disas_neon_data_insn(CPUState * env, DisasContext *s, uint32_t insn) case 8: case 9: case 10: case 11: case 12: case 13: /* VMLAL, VQDMLAL, VMLSL, VQDMLSL, VMULL, VQDMULL */ gen_neon_mull(cpu_V0, tmp, tmp2, size, u); -dead_tmp(tmp2); -dead_tmp(tmp); break; case 14: /* Polynomial VMULL */ cpu_abort(env, "Polynomial VMULL not implemented"); @@ -5228,6 +5233,10 @@ static int disas_neon_data_insn(CPUState * env, DisasContext *s, uint32_t insn) return 1; tmp2 = neon_get_scalar(size, rm); +/* We need a copy of tmp2 because gen_neon_mull + * deletes it during pass 0. */ +tmp4 = new_tmp(); +tcg_gen_mov_i32(tmp4, tmp2); tmp3 = neon_load_reg(rn, 1); for (pass = 0; pass < 2; pass++) { @@ -5235,9 +5244,9 @@ static int disas_neon_data_insn(CPUState * env, DisasContext *s, uint32_t insn) tmp = neon_load_reg(rn, 0); } else { tmp = tmp3; +tmp2 = tmp4; } gen_neon_mull(cpu_V0, tmp, tmp2, size, u); -dead_tmp(tmp); if (op == 6 || op == 7) { gen_neon_negl(cpu_V0, size); } @@ -5264,7 +5273,6 @@ static int disas_neon_data_insn(CPUState * env, DisasContext *s, uint32_t insn) neon_store_reg64(cpu_V0, rd + pass); } -dead_tmp(tmp2); break; default: /* 14 and 15 are RESERVED */ -- 1.7.2.3
[Qemu-devel] Re: [Xen-devel] [PATCH 2/5] qemu and qemu-xen: fix segfault on con_disconnect
Stefano Stabellini writes ("[Xen-devel] [PATCH 2/5] qemu and qemu-xen: fix segfault on con_disconnect"): > The test on xendev->gnttabdev in con_disconnect is wrong: what > differentiates the consoles is the dev number. Thanks, I have applied this patch to qemu-xen-unstable.git. Ian.
Re: [Qemu-devel] [RFC PATCH] Fake machine for scalability testing
On 01/18/2011 02:16 PM, Markus Armbruster wrote: The problem: you want to do serious scalability testing (1000s of VMs) of your management stack. If each guest eats up a few 100MiB and competes for CPU, that requires a serious host machine. Which you don't have. You also don't want to modify the management stack at all, if you can help it. The solution: a perfectly normal-looking QEMU that uses minimal resources. Ability to execute any guest code is strictly optional ;) New option -fake-machine creates a fake machine incapable of running guest code. Completely compiled out by default, enable with configure --enable-fake-machine. With -fake-machine, CPU use is negligible, and memory use is rather modest. Non-fake VM running F-14 live, right after boot: UIDPID PPID CSZ RSS PSR STIME TTY TIME CMD armbru 15707 2558 53 191837 414388 1 21:05 pts/300:00:29 [...] Same VM -fake-machine, after similar time elapsed: UIDPID PPID CSZ RSS PSR STIME TTY TIME CMD armbru 15742 2558 0 85129 9412 0 21:07 pts/300:00:00 [...] We're using a very similar patch for RHEL scalability testing. Interesting, but: 9432 anthony 20 0 153m 14m 5384 S0 0.2 0:00.22 qemu-system-x86 That's qemu-system-x86 -m 4 In terms of memory overhead, the largest source is not really going to be addressed by -fake-machine (l1_phys_map and phys_ram_dirty). I don't really understand the point of not creating a VCPU with KVM. Is there some type of overhead in doing that? Regards, Anthony Liguori HACK ALERT: Works by hacking the main loop so it never executes any guest code. Not implemented for KVM's main loop at this time, thus -fake-machine needs to force KVM off. It also replaces guest RAM by a token amount (pc machine only at this time), and forces -vga none, because VGA eats too much memory. Note the TODO and FIXME comments. Dan Berrange explored a different solution a while ago: a new do-nothing target, patterned after i386, and a new do-nothing machine, patterned after pc. His patch works. But it duplicates much target and machine code --- adds more than ten times as many lines as this patch. Keeping the duplicated code reasonably in sync would be bothersome. I didn't like that, talked it over with Dan, and we came up with this idea instead. Comments? Better ideas? --- configure | 12 cpu-exec.c |2 +- cpus.c |3 +++ hw/pc.c | 30 -- qemu-options.hx |7 +++ targphys.h |7 +++ vl.c| 21 + 7 files changed, 71 insertions(+), 11 deletions(-) diff --git a/configure b/configure index d68f862..98b0a5f 100755 --- a/configure +++ b/configure @@ -174,6 +174,7 @@ trace_backend="nop" trace_file="trace" spice="" rbd="" +fake_machine="no" # parse CC options first for opt do @@ -719,6 +720,10 @@ for opt do ;; --enable-rbd) rbd="yes" ;; + --disable-fake-machine) fake_machine="no" + ;; + --enable-fake-machine) fake_machine="yes" + ;; *) echo "ERROR: unknown option $opt"; show_help="yes" ;; esac @@ -913,6 +918,8 @@ echo " Default:trace-" echo " --disable-spice disable spice" echo " --enable-spice enable spice" echo " --enable-rbd enable building the rados block device (rbd)" +echo " --disable-fake-machine disable -fake-machine option" +echo " --enable-fake-machineenable -fake-machine option" echo "" echo "NOTE: The object files are built at the place where configure is launched" exit 1 @@ -2455,6 +2462,7 @@ echo "Trace output file $trace_file-" echo "spice support $spice" echo "rbd support $rbd" echo "xfsctl support$xfs" +echo "-fake-machine $fake_machine" if test $sdl_too_old = "yes"; then echo "-> Your SDL version is too old - please upgrade to have SDL support" @@ -2727,6 +2735,10 @@ if test "$spice" = "yes" ; then echo "CONFIG_SPICE=y">> $config_host_mak fi +if test $fake_machine = "yes" ; then + echo "CONFIG_FAKE_MACHINE=y">> $config_host_mak +fi + # XXX: suppress that if [ "$bsd" = "yes" ] ; then echo "CONFIG_BSD=y">> $config_host_mak diff --git a/cpu-exec.c b/cpu-exec.c index 8c9fb8b..cd1259a 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -230,7 +230,7 @@ int cpu_exec(CPUState *env1) uint8_t *tc_ptr; unsigned long next_tb; -if (cpu_halted(env1) == EXCP_HALTED) +if (fake_machine || cpu_halted(env1) == EXCP_HALTED) return EXCP_HALTED; cpu_single_env = env1; diff --git a/cpus.c b/cpus.c index 0309189..91e708f 100644 --- a/cpus.c +++ b/cpus.c @@ -128,6 +128,9 @@ static int cpu_can_run(CPUState *env) static int cpu_has_work(CPUState *env) { +if (fake_machine) { +return 0; +} if (env->stop) return 1; if (env->queued_work_first) diff --git a/hw/pc.c b/hw/pc.c i
[Qemu-devel] Re: [PATCH] vhost: force vhost off for non-MSI guests
On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote: > When MSI is off, each interrupt needs to be bounced through the io > thread when it's set/cleared, so vhost-net causes more context switches and > higher CPU utilization than userspace virtio which handles networking in > the same thread. > > We'll need to fix this by adding level irq support in kvm irqfd, > for now disable vhost-net in these configurations. > > Signed-off-by: Michael S. Tsirkin > --- > > I need to report some error from virtio-pci > that would be handled specially (disable but don't > report an error) so I wanted one that's never likely to be used by a > userspace ioctl. I selected ERANGE but it'd > be easy to switch to something else. Comments? Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED? -Sridhar > > hw/vhost.c |4 +++- > hw/virtio-net.c |6 -- > hw/virtio-pci.c |3 +++ > hw/virtio.h |2 ++ > 4 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/hw/vhost.c b/hw/vhost.c > index 1d09ed0..c79765a 100644 > --- a/hw/vhost.c > +++ b/hw/vhost.c > @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice > *vdev) > > r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true); > if (r < 0) { > -fprintf(stderr, "Error binding guest notifier: %d\n", -r); > + if (r != -EVIRTIO_DISABLED) { > + fprintf(stderr, "Error binding guest notifier: %d\n", -r); > + } > goto fail_notifiers; > } > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c > index ccb3e63..5de3fee 100644 > --- a/hw/virtio-net.c > +++ b/hw/virtio-net.c > @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, > uint8_t status) > if (!n->vhost_started) { > int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), > &n->vdev); > if (r < 0) { > -error_report("unable to start vhost net: %d: " > - "falling back on userspace virtio", -r); > +if (r != -EVIRTIO_DISABLED) { > +error_report("unable to start vhost net: %d: " > + "falling back on userspace virtio", -r); > +} > } else { > n->vhost_started = 1; > } > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c > index dd8887a..dbf4be0 100644 > --- a/hw/virtio-pci.c > +++ b/hw/virtio-pci.c > @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, > int n, bool assign) > EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); > > if (assign) { > +if (!msix_enabled(&proxy->pci_dev)) { > +return -EVIRTIO_DISABLED; > +} > int r = event_notifier_init(notifier, 0); > if (r < 0) { > return r; > diff --git a/hw/virtio.h b/hw/virtio.h > index d8546d5..53bbdba 100644 > --- a/hw/virtio.h > +++ b/hw/virtio.h > @@ -98,6 +98,8 @@ typedef struct { > void (*vmstate_change)(void * opaque, bool running); > } VirtIOBindings; > > +#define EVIRTIO_DISABLED ERANGE > + > #define VIRTIO_PCI_QUEUE_MAX 64 > > #define VIRTIO_NO_VECTOR 0x
[Qemu-devel] Re: [PATCH] hw/pl190.c: Fix writing of default vector address
On Thu, Jan 20, 2011 at 04:04:52PM +, Peter Maydell wrote: > The PL190 implementation keeps the default vector address > in vect_addr[16], but we weren't using this for writes to > the DEFVECTADDR register. As a result of this fix the > default_addr structure member is unused and we can delete it. > > Reported-by: Himanshu Chauhan > Signed-off-by: Peter Maydell > --- > (This patch addresses Aurelien's comments on Himanshu's > patch from November: http://patchwork.ozlabs.org/patch/69768/) > > hw/pl190.c |4 +--- > 1 files changed, 1 insertions(+), 3 deletions(-) > > diff --git a/hw/pl190.c b/hw/pl190.c > index 17c279b..75f2ba1 100644 > --- a/hw/pl190.c > +++ b/hw/pl190.c > @@ -21,7 +21,6 @@ typedef struct { > uint32_t soft_level; > uint32_t irq_enable; > uint32_t fiq_select; > -uint32_t default_addr; > uint8_t vect_control[16]; > uint32_t vect_addr[PL190_NUM_PRIO]; > /* Mask containing interrupts with higher priority than this one. */ > @@ -186,7 +185,7 @@ static void pl190_write(void *opaque, target_phys_addr_t > offset, uint32_t val) > s->priority = s->prev_prio[s->priority]; > break; > case 13: /* DEFVECTADDR */ > -s->default_addr = val; > +s->vect_addr[16] = val; > break; > case 0xc0: /* ITCR */ > if (val) { > @@ -252,7 +251,6 @@ static const VMStateDescription vmstate_pl190 = { > VMSTATE_UINT32(soft_level, pl190_state), > VMSTATE_UINT32(irq_enable, pl190_state), > VMSTATE_UINT32(fiq_select, pl190_state), > -VMSTATE_UINT32(default_addr, pl190_state), > VMSTATE_UINT8_ARRAY(vect_control, pl190_state, 16), > VMSTATE_UINT32_ARRAY(vect_addr, pl190_state, PL190_NUM_PRIO), > VMSTATE_UINT32_ARRAY(prio_mask, pl190_state, PL190_NUM_PRIO+1), > -- > 1.6.3.3 > Thanks, applied. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH] Fix block migration when the device size is not a multiple of 1 MB
2011/1/20 Pierre Riteau : > On 20 janv. 2011, at 03:06, Yoshiaki Tamura wrote: > >> 2011/1/19 Pierre Riteau : >>> b02bea3a85cc939f09aa674a3f1e4f36d418c007 added a check on the return >>> value of bdrv_write and aborts migration when it fails. However, if the >>> size of the block device to migrate is not a multiple of BLOCK_SIZE >>> (currently 1 MB), the last bdrv_write will fail with -EIO. >>> >>> Fixed by calling bdrv_write with the correct size of the last block. >>> --- >>> block-migration.c | 16 +++- >>> 1 files changed, 15 insertions(+), 1 deletions(-) >>> >>> diff --git a/block-migration.c b/block-migration.c >>> index 1475325..eeb9c62 100644 >>> --- a/block-migration.c >>> +++ b/block-migration.c >>> @@ -635,6 +635,8 @@ static int block_load(QEMUFile *f, void *opaque, int >>> version_id) >>> int64_t addr; >>> BlockDriverState *bs; >>> uint8_t *buf; >>> + int64_t total_sectors; >>> + int nr_sectors; >>> >>> do { >>> addr = qemu_get_be64(f); >>> @@ -656,10 +658,22 @@ static int block_load(QEMUFile *f, void *opaque, int >>> version_id) >>> return -EINVAL; >>> } >>> >>> + total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS; >>> + if (total_sectors <= 0) { >>> + fprintf(stderr, "Error getting length of block device >>> %s\n", device_name); >>> + return -EINVAL; >>> + } >>> + >>> + if (total_sectors - addr < BDRV_SECTORS_PER_DIRTY_CHUNK) { >>> + nr_sectors = total_sectors - addr; >>> + } else { >>> + nr_sectors = BDRV_SECTORS_PER_DIRTY_CHUNK; >>> + } >>> + >>> buf = qemu_malloc(BLOCK_SIZE); >>> >>> qemu_get_buffer(f, buf, BLOCK_SIZE); >>> - ret = bdrv_write(bs, addr, buf, BDRV_SECTORS_PER_DIRTY_CHUNK); >>> + ret = bdrv_write(bs, addr, buf, nr_sectors); >>> >>> qemu_free(buf); >>> if (ret < 0) { >>> -- >>> 1.7.3.5 >>> >>> >>> >> >> Hi Pierre, >> >> I don't think the fix above is correct. If you have a file which >> isn't aliened with BLOCK_SIZE, you won't get an error with the >> patch. However, the receiver doesn't know how much sectors which >> the sender wants to be written, so the guest may fail after >> migration because some data may not be written. IIUC, although >> changing bytestream should be prevented as much as possible, we >> should save/load total_sectors to check appropriate file is >> allocated on the receiver side. > > Isn't the guest supposed to be started using a file with the correct size? I personally don't like that; It's insisting too much to the user. Can't we expand the image on the fly? We can just abort if expanding failed anyway. > But I guess changing the protocol would be best as it would avoid headaches > to people who mistakenly created a file that is too small. We should think carefully before changing the protocol. Kevin? > >> BTW, you should use error_report instead of fprintf(stderr, ...). > > I didn't know that, I followed what was used in this file. Thank you. > > -- > Pierre Riteau -- PhD student, Myriads team, IRISA, Rennes, France > http://perso.univ-rennes1.fr/pierre.riteau/ > > >
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote: > On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: > >When MSI is off, each interrupt needs to be bounced through the io > >thread when it's set/cleared, so vhost-net causes more context switches and > >higher CPU utilization than userspace virtio which handles networking in > >the same thread. > > > >We'll need to fix this by adding level irq support in kvm irqfd, > >for now disable vhost-net in these configurations. > > > >Signed-off-by: Michael S. Tsirkin > > I actually think this should be a terminal error. The user asks for > vhost-net, if we cannot enable it, we should exit. > > Or we should warn the user that they should expect bad performance. > Silently doing something that the user has explicitly asked us not > to do is not a good behavior. > > Regards, > > Anthony Liguori The issue is that user has no control of the guest, and can not know whether the guest enables MSI. So what you ask for will just make some guests fail, and others fail sometimes. The user also has no way to know that version X of kvm does not expose a way to inject level interrupts with irqfd. We could have *another* flag that says "use vhost where it helps" but then I think this is what everyone wants to do, anyway, and libvirt already sets vhost=on so I prefer redefining the meaning of an existing flag. Maybe this is best handled by a documentation update? We always said: "use vhost=on to enable experimental in kernel accelerator\n" note 'enable' not 'require'. This is similar to how we specify nvectors : you can not make guest use the feature. How about this: diff --git a/qemu-options.hx b/qemu-options.hx index 898561d..3c937c1 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -1061,6 +1061,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net, "use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap flag\n" "use vnet_hdr=on to make the lack of IFF_VNET_HDR support an error condition\n" "use vhost=on to enable experimental in kernel accelerator\n" +"(note: vhost=on has no effect unless guest uses MSI-X)\n" "use 'vhostfd=h' to connect to an already opened vhost net device\n" #endif "-net socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n"
[Qemu-devel] [PATCH] hw/pl190.c: Fix writing of default vector address
The PL190 implementation keeps the default vector address in vect_addr[16], but we weren't using this for writes to the DEFVECTADDR register. As a result of this fix the default_addr structure member is unused and we can delete it. Reported-by: Himanshu Chauhan Signed-off-by: Peter Maydell --- (This patch addresses Aurelien's comments on Himanshu's patch from November: http://patchwork.ozlabs.org/patch/69768/) hw/pl190.c |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/hw/pl190.c b/hw/pl190.c index 17c279b..75f2ba1 100644 --- a/hw/pl190.c +++ b/hw/pl190.c @@ -21,7 +21,6 @@ typedef struct { uint32_t soft_level; uint32_t irq_enable; uint32_t fiq_select; -uint32_t default_addr; uint8_t vect_control[16]; uint32_t vect_addr[PL190_NUM_PRIO]; /* Mask containing interrupts with higher priority than this one. */ @@ -186,7 +185,7 @@ static void pl190_write(void *opaque, target_phys_addr_t offset, uint32_t val) s->priority = s->prev_prio[s->priority]; break; case 13: /* DEFVECTADDR */ -s->default_addr = val; +s->vect_addr[16] = val; break; case 0xc0: /* ITCR */ if (val) { @@ -252,7 +251,6 @@ static const VMStateDescription vmstate_pl190 = { VMSTATE_UINT32(soft_level, pl190_state), VMSTATE_UINT32(irq_enable, pl190_state), VMSTATE_UINT32(fiq_select, pl190_state), -VMSTATE_UINT32(default_addr, pl190_state), VMSTATE_UINT8_ARRAY(vect_control, pl190_state, 16), VMSTATE_UINT32_ARRAY(vect_addr, pl190_state, PL190_NUM_PRIO), VMSTATE_UINT32_ARRAY(prio_mask, pl190_state, PL190_NUM_PRIO+1), -- 1.6.3.3
[Qemu-devel] [PATCH 2/5] Use vmstate to save/load spitz-lcdtg and corgi-ssp state
Signed-off-by: Dmitry Eremin-Solenikov --- hw/spitz.c | 74 ++- 1 files changed, 28 insertions(+), 46 deletions(-) diff --git a/hw/spitz.c b/hw/spitz.c index 3667618..87731d3 100644 --- a/hw/spitz.c +++ b/hw/spitz.c @@ -527,8 +527,8 @@ static void spitz_keyboard_register(PXA2xxState *cpu) typedef struct { SSISlave ssidev; -int bl_intensity; -int bl_power; +uint32_t bl_intensity; +uint32_t bl_power; } SpitzLCDTG; static void spitz_bl_update(SpitzLCDTG *s) @@ -592,21 +592,6 @@ static uint32_t spitz_lcdtg_transfer(SSISlave *dev, uint32_t value) return 0; } -static void spitz_lcdtg_save(QEMUFile *f, void *opaque) -{ -SpitzLCDTG *s = (SpitzLCDTG *)opaque; -qemu_put_be32(f, s->bl_intensity); -qemu_put_be32(f, s->bl_power); -} - -static int spitz_lcdtg_load(QEMUFile *f, void *opaque, int version_id) -{ -SpitzLCDTG *s = (SpitzLCDTG *)opaque; -s->bl_intensity = qemu_get_be32(f); -s->bl_power = qemu_get_be32(f); -return 0; -} - static int spitz_lcdtg_init(SSISlave *dev) { SpitzLCDTG *s = FROM_SSI_SLAVE(SpitzLCDTG, dev); @@ -615,8 +600,6 @@ static int spitz_lcdtg_init(SSISlave *dev) s->bl_power = 0; s->bl_intensity = 0x20; -register_savevm(&dev->qdev, "spitz-lcdtg", -1, 1, -spitz_lcdtg_save, spitz_lcdtg_load, s); return 0; } @@ -635,7 +618,7 @@ static DeviceState *max; typedef struct { SSISlave ssidev; SSIBus *bus[3]; -int enable[3]; +uint32_t enable[3]; } CorgiSSPState; static uint32_t corgi_ssp_transfer(SSISlave *dev, uint32_t value) @@ -677,30 +660,6 @@ static void spitz_adc_temp_on(void *opaque, int line, int level) max111x_set_input(max, MAX_BATT_TEMP, 0); } -static void spitz_ssp_save(QEMUFile *f, void *opaque) -{ -CorgiSSPState *s = (CorgiSSPState *)opaque; -int i; - -for (i = 0; i < 3; i++) { -qemu_put_be32(f, s->enable[i]); -} -} - -static int spitz_ssp_load(QEMUFile *f, void *opaque, int version_id) -{ -CorgiSSPState *s = (CorgiSSPState *)opaque; -int i; - -if (version_id != 1) { -return -EINVAL; -} -for (i = 0; i < 3; i++) { -s->enable[i] = qemu_get_be32(f); -} -return 0; -} - static int corgi_ssp_init(SSISlave *dev) { CorgiSSPState *s = FROM_SSI_SLAVE(CorgiSSPState, dev); @@ -710,8 +669,6 @@ static int corgi_ssp_init(SSISlave *dev) s->bus[1] = ssi_create_bus(&dev->qdev, "ssi1"); s->bus[2] = ssi_create_bus(&dev->qdev, "ssi2"); -register_savevm(&dev->qdev, "spitz_ssp", -1, 1, -spitz_ssp_save, spitz_ssp_load, s); return 0; } @@ -1070,16 +1027,41 @@ static void spitz_machine_init(void) machine_init(spitz_machine_init); +static const VMStateDescription vmstate_corgi_ssp_regs = { +.name = "corgi-ssp", +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT32_ARRAY(enable, CorgiSSPState, 3), +VMSTATE_END_OF_LIST(), +} +}; + static SSISlaveInfo corgi_ssp_info = { .qdev.name = "corgi-ssp", .qdev.size = sizeof(CorgiSSPState), +.qdev.vmsd = &vmstate_corgi_ssp_regs, .init = corgi_ssp_init, .transfer = corgi_ssp_transfer }; +static const VMStateDescription vmstate_spitz_lcdtg_regs = { +.name = "spitz-lcdtg", +.version_id = 1, +.minimum_version_id = 1, +.minimum_version_id_old = 1, +.fields = (VMStateField []) { +VMSTATE_UINT32(bl_intensity, SpitzLCDTG), +VMSTATE_UINT32(bl_power, SpitzLCDTG), +VMSTATE_END_OF_LIST(), +} +}; + static SSISlaveInfo spitz_lcdtg_info = { .qdev.name = "spitz-lcdtg", .qdev.size = sizeof(SpitzLCDTG), +.qdev.vmsd = &vmstate_spitz_lcdtg_regs, .init = spitz_lcdtg_init, .transfer = spitz_lcdtg_transfer }; -- 1.7.2.3
[Qemu-devel] [PATCH 1/5] SharpSL scoop device - convert to qdev
Convert SharpSL scoop device to qdev, remove lots of supporting code, as lot of init and gpio related things can now be done automagically. Signed-off-by: Dmitry Eremin-Solenikov --- hw/sharpsl.h |7 hw/spitz.c | 23 ++-- hw/tosa.c| 23 ++-- hw/zaurus.c | 111 ++--- 4 files changed, 75 insertions(+), 89 deletions(-) diff --git a/hw/sharpsl.h b/hw/sharpsl.h index c5ccf79..0b3a774 100644 --- a/hw/sharpsl.h +++ b/hw/sharpsl.h @@ -10,13 +10,6 @@ fprintf(stderr, "%s: " format, __FUNCTION__, ##__VA_ARGS__) /* zaurus.c */ -typedef struct ScoopInfo ScoopInfo; -ScoopInfo *scoop_init(PXA2xxState *cpu, -int instance, target_phys_addr_t target_base); -void scoop_gpio_set(void *opaque, int line, int level); -qemu_irq *scoop_gpio_in_get(ScoopInfo *s); -void scoop_gpio_out_set(ScoopInfo *s, int line, -qemu_irq handler); #define SL_PXA_PARAM_BASE 0xaa00 void sl_bootparam_write(target_phys_addr_t ptr); diff --git a/hw/spitz.c b/hw/spitz.c index 092bb64..3667618 100644 --- a/hw/spitz.c +++ b/hw/spitz.c @@ -23,6 +23,7 @@ #include "audio/audio.h" #include "boards.h" #include "blockdev.h" +#include "sysbus.h" #undef REG_FMT #define REG_FMT"0x%02lx" @@ -851,21 +852,21 @@ static void spitz_out_switch(void *opaque, int line, int level) #define SPITZ_SCP2_MIC_BIAS9 static void spitz_scoop_gpio_setup(PXA2xxState *cpu, -ScoopInfo *scp0, ScoopInfo *scp1) +DeviceState *scp0, DeviceState *scp1) { qemu_irq *outsignals = qemu_allocate_irqs(spitz_out_switch, cpu, 8); -scoop_gpio_out_set(scp0, SPITZ_SCP_CHRG_ON, outsignals[0]); -scoop_gpio_out_set(scp0, SPITZ_SCP_JK_B, outsignals[1]); -scoop_gpio_out_set(scp0, SPITZ_SCP_LED_GREEN, outsignals[2]); -scoop_gpio_out_set(scp0, SPITZ_SCP_LED_ORANGE, outsignals[3]); +qdev_connect_gpio_out(scp0, SPITZ_SCP_CHRG_ON, outsignals[0]); +qdev_connect_gpio_out(scp0, SPITZ_SCP_JK_B, outsignals[1]); +qdev_connect_gpio_out(scp0, SPITZ_SCP_LED_GREEN, outsignals[2]); +qdev_connect_gpio_out(scp0, SPITZ_SCP_LED_ORANGE, outsignals[3]); if (scp1) { -scoop_gpio_out_set(scp1, SPITZ_SCP2_BACKLIGHT_CONT, outsignals[4]); -scoop_gpio_out_set(scp1, SPITZ_SCP2_BACKLIGHT_ON, outsignals[5]); +qdev_connect_gpio_out(scp1, SPITZ_SCP2_BACKLIGHT_CONT, outsignals[4]); +qdev_connect_gpio_out(scp1, SPITZ_SCP2_BACKLIGHT_ON, outsignals[5]); } -scoop_gpio_out_set(scp0, SPITZ_SCP_ADC_TEMP_ON, outsignals[6]); +qdev_connect_gpio_out(scp0, SPITZ_SCP_ADC_TEMP_ON, outsignals[6]); } #define SPITZ_GPIO_HSYNC 22 @@ -952,7 +953,7 @@ static void spitz_common_init(ram_addr_t ram_size, const char *cpu_model, enum spitz_model_e model, int arm_id) { PXA2xxState *cpu; -ScoopInfo *scp0, *scp1 = NULL; +DeviceState *scp0, *scp1 = NULL; if (!cpu_model) cpu_model = (model == terrier) ? "pxa270-c5" : "pxa270-c0"; @@ -970,9 +971,9 @@ static void spitz_common_init(ram_addr_t ram_size, spitz_ssp_attach(cpu); -scp0 = scoop_init(cpu, 0, 0x1080); +scp0 = sysbus_create_simple("scoop", 0x1080, NULL); if (model != akita) { - scp1 = scoop_init(cpu, 1, 0x08800040); + scp1 = sysbus_create_simple("scoop", 0x08800040, NULL); } spitz_scoop_gpio_setup(cpu, scp0, scp1); diff --git a/hw/tosa.c b/hw/tosa.c index cc8ce6d..18e3be5 100644 --- a/hw/tosa.c +++ b/hw/tosa.c @@ -20,6 +20,7 @@ #include "i2c.h" #include "ssi.h" #include "blockdev.h" +#include "sysbus.h" #define TOSA_RAM0x0400 #define TOSA_ROM 0x0080 @@ -86,14 +87,14 @@ static void tosa_out_switch(void *opaque, int line, int level) static void tosa_gpio_setup(PXA2xxState *cpu, -ScoopInfo *scp0, -ScoopInfo *scp1, +DeviceState *scp0, +DeviceState *scp1, TC6393xbState *tmio) { qemu_irq *outsignals = qemu_allocate_irqs(tosa_out_switch, cpu, 4); /* MMC/SD host */ pxa2xx_mmci_handlers(cpu->mmc, -scoop_gpio_in_get(scp0)[TOSA_GPIO_SD_WP], +qdev_get_gpio_in(scp0, TOSA_GPIO_SD_WP), qemu_irq_invert(pxa2xx_gpio_in_get(cpu->gpio)[TOSA_GPIO_nSD_DETECT])); /* Handle reset */ @@ -108,12 +109,12 @@ static void tosa_gpio_setup(PXA2xxState *cpu, pxa2xx_gpio_in_get(cpu->gpio)[TOSA_GPIO_JC_CF_IRQ], NULL); -scoop_gpio_out_set(scp1, TOSA_GPIO_BT_LED, outsignals[0]); -scoop_gpio_out_set(scp1, TOSA_GPIO_NOTE_LED, outsignals[1]); -scoop_gpio_out_set(scp1, TOSA_GPIO_CHRG_ERR_LED, outsignals[2]); -scoop_gpio_out_set(scp1, TOSA_GPIO_WLAN_LED, outsignals[3]); +qdev_connect_gpio_out(scp1, TOSA_GPIO_BT_LED, outsignals[0]); +qdev_connect_gpio_ou
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
2011/1/20 Kevin Wolf : > Am 20.01.2011 14:50, schrieb Yoshiaki Tamura: >> 2011/1/20 Kevin Wolf : >>> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: 2011/1/20 Kevin Wolf : > Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: >> + return; >> + } >> + >> + bdrv_aio_writev(bs, blk_req->reqs[0].sector, >> blk_req->reqs[0].qiov, >> + blk_req->reqs[0].nb_sectors, >> blk_req->reqs[0].cb, >> + blk_req->reqs[0].opaque); > > Same here. > >> + bdrv_flush(bs); > > This looks really strange. What is this supposed to do? > > One point is that you write it immediately after bdrv_aio_write, so > you > get an fsync for which you don't know if it includes the current write > request or if it doesn't. Which data do you want to get flushed to > the disk? I was expecting to flush the aio request that was just initiated. Am I misunderstanding the function? >>> >>> Seems so. The function names don't use really clear terminology either, >>> so you're not the first one to fall in this trap. Basically we have: >>> >>> * qemu_aio_flush() waits for all AIO requests to complete. I think you >>> wanted to have exactly this, but only for a single block device. Such a >>> function doesn't exist yet. >>> >>> * bdrv_flush() makes sure that all successfully completed requests are >>> written to disk (by calling fsync) >>> >>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run >>> the fsync in the thread pool >> >> Then what I wanted to do is, call qemu_aio_flush first, then >> bdrv_flush. It should be like live migration. > > Okay, that makes sense. :-) > > The other thing is that you introduce a bdrv_flush for each request, > basically forcing everyone to something very similar to writethrough > mode. I'm sure this will have a big impact on performance. The reason is to avoid inversion of queued requests. Although processing one-by-one is heavy, wouldn't having requests flushed to disk out of order break the disk image? >>> >>> No, that's fine. If a guest issues two requests at the same time, they >>> may complete in any order. You just need to make sure that you don't >>> call the completion callback before the request really has completed. >> >> We need to flush requests, meaning aio and fsync, before sending >> the final state of the guests, to make sure we can switch to the >> secondary safely. > > In theory I think you could just re-submit the requests on the secondary > if they had not completed yet. > > But you're right, let's keep things simple for the start. > >>> I'm just starting to wonder if the guest won't timeout the requests if >>> they are queued for too long. Even more, with IDE, it can only handle >>> one request at a time, so not completing requests doesn't sound like a >>> good idea at all. In what intervals is the event-tap queue flushed? >> >> The requests are flushed once each transaction completes. So >> it's not with specific intervals. > > Right. So when is a transaction completed? This is the time that a > single request will take. The transaction is completed when the vm state is sent to the secondary, and the primary receives the ack to it. Please let me know if the answer is too vague. What I can tell is that it can't be super fast. >>> On the other hand, if you complete before actually writing out, you >>> don't get timeouts, but you signal success to the guest when the request >>> could still fail. What would you do in this case? With a writeback cache >>> mode we're fine, we can just fail the next flush (until then nothing is >>> guaranteed to be on disk and order doesn't matter either), but with >>> cache=writethrough we're in serious trouble. >>> >>> Have you thought about this problem? Maybe we end up having to flush the >>> event-tap queue for each single write in writethrough mode. >> >> Yes, and that's what I'm trying to do at this point. > > Oh, I must have missed that code. Which patch/function should I look at? Maybe I miss-answered to your question. The device may receive timeouts. >>> >>> We should pay attention that the guest does not see timeouts. I'm not >>> expecting that I/O will be super fast, and as long as it is only a >>> performance problem we can live with it. >>> >>> However, as soon as the guest gets timeouts it reports I/O errors and >>> eventually offlines the block device. At this point it's not a >>> performance problem any more, but also a correctness problem. >>> >>> This is why I sugges
Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin I actually think this should be a terminal error. The user asks for vhost-net, if we cannot enable it, we should exit. Or we should warn the user that they should expect bad performance. Silently doing something that the user has explicitly asked us not to do is not a good behavior. Regards, Anthony Liguori --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true); if (r< 0) { -fprintf(stderr, "Error binding guest notifier: %d\n", -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, "Error binding guest notifier: %d\n", -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n->vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer),&n->vdev); if (r< 0) { -error_report("unable to start vhost net: %d: " - "falling back on userspace virtio", -r); +if (r != -EVIRTIO_DISABLED) { +error_report("unable to start vhost net: %d: " + "falling back on userspace virtio", -r); +} } else { n->vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(&proxy->pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r< 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x
Re: [Qemu-devel] [PULL] virtio, pci, migration, hotplug
On 01/20/2011 08:54 AM, Michael S. Tsirkin wrote: This has a fix for a crash with multiple block devices. Probably makes sense to pull ASAP. The following changes since commit d788b57051ee91aa39de67cff8d8e15953bc100c: target-ppc: fix wrong NaN tests (2011-01-20 15:11:14 +0100) Pulled. Thanks. Regards, Anthony Liguori are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu.git for_anthony Alex Williamson (1): savevm: Fix no_migrate Amit Shah (1): virtio-serial-bus: bump up control vq size to 32 Isaku Yamahata (4): pci: deassert intx on reset. msi: simplify write config a bit. msix: simplify write config pci: use qemu_malloc() in pcibus_get_dev_path() Marcelo Tosatti (2): document QEMU<->ACPIBIOS PCI hotplug interface acpi_piix4: expose no_hotplug attribute via i/o port Michael S. Tsirkin (2): Merge remote branch 'origin/master' into pci pci: fix device paths cris-dis.c |9 ++- dis-asm.h |1 + disas.c |9 ++- docs/specs/acpi_pci_hotplug.txt | 37 +++ hw/acpi_piix4.c | 37 +++ hw/msi.c|5 +--- hw/msix.c |5 +--- hw/pci.c| 27 hw/pci.h|2 + hw/virtio-serial-bus.c | 10 +++- migration.c |4 +++ savevm.c| 40 - sysemu.h|1 + target-cris/translate.c | 41 +++--- target-cris/translate_v10.c | 10 +--- 15 files changed, 189 insertions(+), 49 deletions(-) create mode 100644 docs/specs/acpi_pci_hotplug.txt
[Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests
When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. We'll need to fix this by adding level irq support in kvm irqfd, for now disable vhost-net in these configurations. Signed-off-by: Michael S. Tsirkin --- I need to report some error from virtio-pci that would be handled specially (disable but don't report an error) so I wanted one that's never likely to be used by a userspace ioctl. I selected ERANGE but it'd be easy to switch to something else. Comments? hw/vhost.c |4 +++- hw/virtio-net.c |6 -- hw/virtio-pci.c |3 +++ hw/virtio.h |2 ++ 4 files changed, 12 insertions(+), 3 deletions(-) diff --git a/hw/vhost.c b/hw/vhost.c index 1d09ed0..c79765a 100644 --- a/hw/vhost.c +++ b/hw/vhost.c @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev) r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true); if (r < 0) { -fprintf(stderr, "Error binding guest notifier: %d\n", -r); + if (r != -EVIRTIO_DISABLED) { + fprintf(stderr, "Error binding guest notifier: %d\n", -r); + } goto fail_notifiers; } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ccb3e63..5de3fee 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t status) if (!n->vhost_started) { int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), &n->vdev); if (r < 0) { -error_report("unable to start vhost net: %d: " - "falling back on userspace virtio", -r); +if (r != -EVIRTIO_DISABLED) { +error_report("unable to start vhost net: %d: " + "falling back on userspace virtio", -r); +} } else { n->vhost_started = 1; } diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index dd8887a..dbf4be0 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); if (assign) { +if (!msix_enabled(&proxy->pci_dev)) { +return -EVIRTIO_DISABLED; +} int r = event_notifier_init(notifier, 0); if (r < 0) { return r; diff --git a/hw/virtio.h b/hw/virtio.h index d8546d5..53bbdba 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -98,6 +98,8 @@ typedef struct { void (*vmstate_change)(void * opaque, bool running); } VirtIOBindings; +#define EVIRTIO_DISABLED ERANGE + #define VIRTIO_PCI_QUEUE_MAX 64 #define VIRTIO_NO_VECTOR 0x -- 1.7.3.2.91.g446ac
[Qemu-devel] [PULL] virtio, pci, migration, hotplug
This has a fix for a crash with multiple block devices. Probably makes sense to pull ASAP. The following changes since commit d788b57051ee91aa39de67cff8d8e15953bc100c: target-ppc: fix wrong NaN tests (2011-01-20 15:11:14 +0100) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu.git for_anthony Alex Williamson (1): savevm: Fix no_migrate Amit Shah (1): virtio-serial-bus: bump up control vq size to 32 Isaku Yamahata (4): pci: deassert intx on reset. msi: simplify write config a bit. msix: simplify write config pci: use qemu_malloc() in pcibus_get_dev_path() Marcelo Tosatti (2): document QEMU<->ACPIBIOS PCI hotplug interface acpi_piix4: expose no_hotplug attribute via i/o port Michael S. Tsirkin (2): Merge remote branch 'origin/master' into pci pci: fix device paths cris-dis.c |9 ++- dis-asm.h |1 + disas.c |9 ++- docs/specs/acpi_pci_hotplug.txt | 37 +++ hw/acpi_piix4.c | 37 +++ hw/msi.c|5 +--- hw/msix.c |5 +--- hw/pci.c| 27 hw/pci.h|2 + hw/virtio-serial-bus.c | 10 +++- migration.c |4 +++ savevm.c| 40 - sysemu.h|1 + target-cris/translate.c | 41 +++--- target-cris/translate_v10.c | 10 +--- 15 files changed, 189 insertions(+), 49 deletions(-) create mode 100644 docs/specs/acpi_pci_hotplug.txt
Re: [Qemu-devel] [PATCH 1/5] Fast Virtual Disk (FVD) Proposal Part 1
> Please try to split the patches into logical parts, and use descriptive > subject lines for each patch. > E.g. adding the new sim command to qemu-io could be one patch, adding > the img_update (why not just update?) command to qemu-img another, > moving code into qemu-tool-time.c one more, etc. Will do and thank you for the detailed instructions. > > - > > + > Please do not introduce random whitespace changes in patches. Stefan Weil previously suggested removing spaces at the end of a line, and I used a script to do that. It seems that in this example, the old code has multiple spaces on an empty line, which were automatically removed by the script.
Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model
On Thu, Jan 20, 2011 at 08:11:27PM +0530, M. Mohan Kumar wrote: > On Thursday 20 January 2011 2:29:54 pm Stefan Hajnoczi wrote: > > On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote: > > > > -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) { > > > -/* > > > - * If we fail to change ownership and if we are > > > - * using security model none. Ignore the error > > > - */ > > > -if (fs_ctx->fs_sm != SM_NONE) { > > > -return -1; > > > -} > > > -} > > > +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid); > > > > > > return 0; > > > > > > } > > > > retval is unused. > > > > That was used to disable the warning message "error: ignoring return value of > ‘lchown’, declared with attribute warn_unused_result" > > Otherwise I have to use > if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid)) { > ; > } > > > Can multiple virtio-9p requests execute at a time? chmod() and lchown() > > after creation is a race condition if other requests can execute > > concurrently. > > > > We can't implement file creation with requested user credentials and > permission > bits in the none security model atomically. Its expected behaviour only Well you could do the nasty trick of forking a child process and doing setuid/gid in that and then creating the file before letting the parent continue. if ((pid = fork()) == 0) { setuid(fc_uid); setgid(fc_gid); fd =open("foo", O_CREAT); close(fd); } else { waitpid(pid); } This kind of approach is in fact required if you want to be able to create files with a special uid/gid on a root squashing NFS server, because otherwise your QEMU running as root will have its files squashed to 'nobody' when initially created, and lchown will fail with EPERM. You might decide that root squashing NFS is too painful to care about supporting though :-) Regards, Daniel
Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model
On Thursday 20 January 2011 2:29:54 pm Stefan Hajnoczi wrote: > On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote: > > -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) { > > -/* > > - * If we fail to change ownership and if we are > > - * using security model none. Ignore the error > > - */ > > -if (fs_ctx->fs_sm != SM_NONE) { > > -return -1; > > -} > > -} > > +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid); > > > > return 0; > > > > } > > retval is unused. > That was used to disable the warning message "error: ignoring return value of ‘lchown’, declared with attribute warn_unused_result" Otherwise I have to use if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid)) { ; } > Can multiple virtio-9p requests execute at a time? chmod() and lchown() > after creation is a race condition if other requests can execute > concurrently. > We can't implement file creation with requested user credentials and permission bits in the none security model atomically. Its expected behaviour only M. Mohan Kumar
[Qemu-devel] Re: [PATCH] pci: fix pcibus_get_dev_path()
On Thu, Jan 20, 2011 at 06:31:59PM +0800, TeLeMan wrote: > The commit 6a7005d14b3c32d4864a718fb1cb19c789f58a5 used snprintf() > incorrectly. > > Signed-off-by: TeLeMan This won't work for nested bridges as \n by the first sprintf overwrites the rest of the string. I sent a fix titled [PATCH] pci: fix device paths look it up. > --- > hw/pci.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/pci.c b/hw/pci.c > index 8d0e3df..9f8800d 100644 > --- a/hw/pci.c > +++ b/hw/pci.c > @@ -2050,14 +2050,14 @@ static char *pcibus_get_dev_path(DeviceState *dev) > path[path_len] = '\0'; > > /* First field is the domain. */ > -snprintf(path, domain_len, "%04x:00", pci_find_domain(d->bus)); > +snprintf(path, domain_len + 1, "%04x:00", pci_find_domain(d->bus)); > > /* Fill in slot numbers. We walk up from device to root, so need to print > * them in the reverse order, last to first. */ > p = path + path_len; > for (t = d; t; t = t->bus->parent_dev) { > p -= slot_len; > -snprintf(p, slot_len, ":%02x.%x", PCI_SLOT(t->devfn), > PCI_FUNC(d->devfn)); > +snprintf(p, slot_len + 1, ":%02x.%x", PCI_SLOT(t->devfn), > PCI_FUNC(d->devfn)); > } > > return path; > -- > 1.7.3.1.msysgit.0
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Am 20.01.2011 14:50, schrieb Yoshiaki Tamura: > 2011/1/20 Kevin Wolf : >> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: >>> 2011/1/20 Kevin Wolf : Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: > +return; > +} > + > +bdrv_aio_writev(bs, blk_req->reqs[0].sector, > blk_req->reqs[0].qiov, > +blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, > +blk_req->reqs[0].opaque); Same here. > +bdrv_flush(bs); This looks really strange. What is this supposed to do? One point is that you write it immediately after bdrv_aio_write, so you get an fsync for which you don't know if it includes the current write request or if it doesn't. Which data do you want to get flushed to the disk? >>> >>> I was expecting to flush the aio request that was just initiated. >>> Am I misunderstanding the function? >> >> Seems so. The function names don't use really clear terminology either, >> so you're not the first one to fall in this trap. Basically we have: >> >> * qemu_aio_flush() waits for all AIO requests to complete. I think you >> wanted to have exactly this, but only for a single block device. Such a >> function doesn't exist yet. >> >> * bdrv_flush() makes sure that all successfully completed requests are >> written to disk (by calling fsync) >> >> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run >> the fsync in the thread pool > > Then what I wanted to do is, call qemu_aio_flush first, then > bdrv_flush. It should be like live migration. Okay, that makes sense. :-) The other thing is that you introduce a bdrv_flush for each request, basically forcing everyone to something very similar to writethrough mode. I'm sure this will have a big impact on performance. >>> >>> The reason is to avoid inversion of queued requests. Although >>> processing one-by-one is heavy, wouldn't having requests flushed >>> to disk out of order break the disk image? >> >> No, that's fine. If a guest issues two requests at the same time, they >> may complete in any order. You just need to make sure that you don't >> call the completion callback before the request really has completed. > > We need to flush requests, meaning aio and fsync, before sending > the final state of the guests, to make sure we can switch to the > secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. >> I'm just starting to wonder if the guest won't timeout the requests if >> they are queued for too long. Even more, with IDE, it can only handle >> one request at a time, so not completing requests doesn't sound like a >> good idea at all. In what intervals is the event-tap queue flushed? > > The requests are flushed once each transaction completes. So > it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. >>> >>> The transaction is completed when the vm state is sent to the >>> secondary, and the primary receives the ack to it. Please let me >>> know if the answer is too vague. What I can tell is that it >>> can't be super fast. >>> >> On the other hand, if you complete before actually writing out, you >> don't get timeouts, but you signal success to the guest when the request >> could still fail. What would you do in this case? With a writeback cache >> mode we're fine, we can just fail the next flush (until then nothing is >> guaranteed to be on disk and order doesn't matter either), but with >> cache=writethrough we're in serious trouble. >> >> Have you thought about this problem? Maybe we end up having to flush the >> event-tap queue for each single write in writethrough mode. > > Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? >>> >>> Maybe I miss-answered to your question. The device may receive >>> timeouts. >> >> We should pay attention that the guest does not see timeouts. I'm not >> expecting that I/O will be super fast, and as long as it is only a >> performance problem we can live with it. >> >> However, as soon as the guest gets timeouts it reports I/O errors and >> eventually offlines the block device. At this point it's not a >> performance problem any more, but also a correctness problem. >> >> This is why I suggested that we flush the event-tap queue (i.e. complete >> the transaction) immediately after an I/O request has been issued >> instead of waiting for other e
[Qemu-devel] Re: [PATCH] pci/pcie: make pci_find_device() ARI aware.
On Thu, Jan 20, 2011 at 03:57:39PM +0900, Isaku Yamahata wrote: > make pci_find_device() ARI aware. > > Signed-off-by: Isaku Yamahata > --- > hw/pci.c |7 +++ > 1 files changed, 7 insertions(+), 0 deletions(-) > > diff --git a/hw/pci.c b/hw/pci.c > index 8d0e3df..851f350 100644 > --- a/hw/pci.c > +++ b/hw/pci.c > @@ -1596,11 +1596,18 @@ PCIBus *pci_find_bus(PCIBus *bus, int bus_num) > > PCIDevice *pci_find_device(PCIBus *bus, int bus_num, int slot, int function) > { > +PCIDevice *d; > bus = pci_find_bus(bus, bus_num); > > if (!bus) > return NULL; > > +d = bus->parent_dev; > +if (d && pci_is_express(d) && > +pcie_cap_get_type(d) == PCI_EXP_TYPE_DOWNSTREAM && > +!pcie_cap_is_ari_enabled(d) && slot > 0) { > +return NULL; > +} > return bus->devices[PCI_DEVFN(slot, function)]; > } I'd like to split this express-specific code out in some way. Further, the downstream port always has a single slot. Maybe it shouldn't be a bus at all, but this needs some thought. How about we put flag in PCIBus that says that a single slot is supported? Downstream ports would just set it. > -- > 1.7.1.1
Re: [Qemu-devel] Missing roms/seabios/Makefile and roms/vgabios/Makefile
On 20/01/11 10:58, Stefan Hajnoczi wrote: On Thu, Jan 20, 2011 at 10:51:17AM +, Mateusz Loskot wrote: On 19/01/11 18:07, Blue Swirl wrote: On Wed, Jan 19, 2011 at 12:44 PM, Mateusz Loskot wrote: Hi, Running ./configure (under MinGW/MSYS) and symlink gives up trying to create links to non-existing Makefiles in roms/seabios/Makefile roms/vgabios/Makefile Those directiories are actually git submodules, when you run 'git submodule update', they get populated. Something is not quite working and I don't get anything populated. Here I tried under MinGW mloskot@dog /g/src/qemu/_git/master $ git pull Already up-to-date. mloskot@dog /g/src/qemu/_git/master $ git submodule update mloskot@dog /g/src/qemu/_git/master $ ls roms/seabios/ config.mak Make sure roms/{vgabios,seabios} are empty directories. $ git submodule init $ git submodule update This should clone the vgabios and seabios repos and checkout the correct commit. After this completes successfully you should have source trees under roms/{vgabios,seabios}. I forgot about the "init" command. Works now. Thanks! Best regards, -- Mateusz Loskot, http://mateusz.loskot.net Charter Member of OSGeo, http://osgeo.org Member of ACCU, http://accu.org
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
2011/1/20 Kevin Wolf : > Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: >> 2011/1/20 Kevin Wolf : >>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: + return; + } + + bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, + blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, + blk_req->reqs[0].opaque); >>> >>> Same here. >>> + bdrv_flush(bs); >>> >>> This looks really strange. What is this supposed to do? >>> >>> One point is that you write it immediately after bdrv_aio_write, so you >>> get an fsync for which you don't know if it includes the current write >>> request or if it doesn't. Which data do you want to get flushed to the >>> disk? >> >> I was expecting to flush the aio request that was just initiated. >> Am I misunderstanding the function? > > Seems so. The function names don't use really clear terminology either, > so you're not the first one to fall in this trap. Basically we have: > > * qemu_aio_flush() waits for all AIO requests to complete. I think you > wanted to have exactly this, but only for a single block device. Such a > function doesn't exist yet. > > * bdrv_flush() makes sure that all successfully completed requests are > written to disk (by calling fsync) > > * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run > the fsync in the thread pool Then what I wanted to do is, call qemu_aio_flush first, then bdrv_flush. It should be like live migration. >>> >>> Okay, that makes sense. :-) >>> >>> The other thing is that you introduce a bdrv_flush for each request, >>> basically forcing everyone to something very similar to writethrough >>> mode. I'm sure this will have a big impact on performance. >> >> The reason is to avoid inversion of queued requests. Although >> processing one-by-one is heavy, wouldn't having requests flushed >> to disk out of order break the disk image? > > No, that's fine. If a guest issues two requests at the same time, they > may complete in any order. You just need to make sure that you don't > call the completion callback before the request really has completed. We need to flush requests, meaning aio and fsync, before sending the final state of the guests, to make sure we can switch to the secondary safely. >>> >>> In theory I think you could just re-submit the requests on the secondary >>> if they had not completed yet. >>> >>> But you're right, let's keep things simple for the start. >>> > I'm just starting to wonder if the guest won't timeout the requests if > they are queued for too long. Even more, with IDE, it can only handle > one request at a time, so not completing requests doesn't sound like a > good idea at all. In what intervals is the event-tap queue flushed? The requests are flushed once each transaction completes. So it's not with specific intervals. >>> >>> Right. So when is a transaction completed? This is the time that a >>> single request will take. >> >> The transaction is completed when the vm state is sent to the >> secondary, and the primary receives the ack to it. Please let me >> know if the answer is too vague. What I can tell is that it >> can't be super fast. >> > On the other hand, if you complete before actually writing out, you > don't get timeouts, but you signal success to the guest when the request > could still fail. What would you do in this case? With a writeback cache > mode we're fine, we can just fail the next flush (until then nothing is > guaranteed to be on disk and order doesn't matter either), but with > cache=writethrough we're in serious trouble. > > Have you thought about this problem? Maybe we end up having to flush the > event-tap queue for each single write in writethrough mode. Yes, and that's what I'm trying to do at this point. >>> >>> Oh, I must have missed that code. Which patch/function should I look at? >> >> Maybe I miss-answered to your question. The device may receive >> timeouts. > > We should pay attention that the guest does not see timeouts. I'm not > expecting that I/O will be super fast, and as long as it is only a > performance problem we can live with it. > > However, as soon as the guest gets timeouts it reports I/O errors and > eventually offlines the block device. At this point it's not a > performance problem any more, but also a correctness problem. > > This is why I suggested that we flush the event-tap queue (i.e. complete > the transaction) immediately after an I/O request has been issued > instead of waiting for other events that would complete the transaction. Right. event-tap doesn't queue at specific interval. It'll schedule the transaction as bh once events are tapp
[Qemu-devel] Re: [PATCH] pci: use qemu_malloc() in pcibus_get_dev_path()
On Thu, Jan 20, 2011 at 03:57:49PM +0900, Isaku Yamahata wrote: > use qemu_malloc() instead of direct use of malloc(). > > Signed-off-by: Isaku Yamahata > --- > hw/pci.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) Thanks, applied. > diff --git a/hw/pci.c b/hw/pci.c > index 851f350..86af0ee 100644 > --- a/hw/pci.c > +++ b/hw/pci.c > @@ -2053,7 +2053,7 @@ static char *pcibus_get_dev_path(DeviceState *dev) > path_len = domain_len + slot_len * slot_depth; > > /* Allocate memory, fill in the terminating null byte. */ > -path = malloc(path_len + 1 /* For '\0' */); > +path = qemu_malloc(path_len + 1 /* For '\0' */); > path[path_len] = '\0'; > > /* First field is the domain. */ > -- > 1.7.1.1
[Qemu-devel] Re: [PATCH] pci: make pci_bus_new() aware of pci domain
On Thu, Jan 20, 2011 at 03:57:56PM +0900, Isaku Yamahata wrote: > This patch makes pci bus creation aware of pci domain. > > Signed-off-by: Isaku Yamahata IMO domain support needs some thought, before we jump into it. Specifically: I am not sure a simple number is enough to address a domain. In linux, what seems to happen is that the pci segment reported by pci is used. What happens without acpi? I'll have to look - do you know? What will we do for acpi? We can load firmware or add a hardware interface to get the domains. How will migration work? Thanks, > --- > hw/pci.c | 19 ++- > hw/pci.h |7 --- > hw/piix_pci.c |2 +- > 3 files changed, 19 insertions(+), 9 deletions(-) > > diff --git a/hw/pci.c b/hw/pci.c > index 86af0ee..e1e7b25 100644 > --- a/hw/pci.c > +++ b/hw/pci.c > @@ -246,8 +246,11 @@ int pci_find_domain(const PCIBus *bus) > return -1; > } > > +/* create root pci bus. > + * If secondary pci bus is wanted, use pci_bridge_initfn() > + */ > void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent, > - const char *name, int devfn_min) > + const char *name, int domain, int devfn_min) > { > qbus_create_inplace(&bus->qbus, &pci_bus_info, parent, name); > assert(PCI_FUNC(devfn_min) == 0); > @@ -255,18 +258,19 @@ void pci_bus_new_inplace(PCIBus *bus, DeviceState > *parent, > > /* host bridge */ > QLIST_INIT(&bus->child); > -pci_host_bus_register(0, bus); /* for now only pci domain 0 is supported > */ > +pci_host_bus_register(domain, bus); > > vmstate_register(NULL, -1, &vmstate_pcibus, bus); > } > > -PCIBus *pci_bus_new(DeviceState *parent, const char *name, int devfn_min) > +PCIBus *pci_bus_new(DeviceState *parent, const char *name, > +int domain, int devfn_min) > { > PCIBus *bus; > > bus = qemu_mallocz(sizeof(*bus)); > bus->qbus.qdev_allocated = 1; > -pci_bus_new_inplace(bus, parent, name, devfn_min); > +pci_bus_new_inplace(bus, parent, name, domain, devfn_min); > return bus; > } > > @@ -292,13 +296,18 @@ void pci_bus_set_mem_base(PCIBus *bus, > target_phys_addr_t base) > bus->mem_base = base; > } > > +/* deprecated: kept for compatility of existing codes. > + * pci_bus_new() and pci_bus_irqs() should be used for root pci bus > + * like i440fx_init(). > + * pci_bridge_initfn() should be used for secondary pci bus > + */ > PCIBus *pci_register_bus(DeviceState *parent, const char *name, > pci_set_irq_fn set_irq, pci_map_irq_fn map_irq, > void *irq_opaque, int devfn_min, int nirq) > { > PCIBus *bus; > > -bus = pci_bus_new(parent, name, devfn_min); > +bus = pci_bus_new(parent, name, 0 /* domain = 0 for compat */, > devfn_min); > pci_bus_irqs(bus, set_irq, map_irq, irq_opaque, nirq); > return bus; > } > diff --git a/hw/pci.h b/hw/pci.h > index bc8d5bb..a0fd953 100644 > --- a/hw/pci.h > +++ b/hw/pci.h > @@ -229,14 +229,15 @@ typedef enum { > typedef int (*pci_hotplug_fn)(DeviceState *qdev, PCIDevice *pci_dev, >PCIHotplugState state); > void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent, > - const char *name, int devfn_min); > -PCIBus *pci_bus_new(DeviceState *parent, const char *name, int devfn_min); > + const char *name, int domain, int devfn_min); > +PCIBus *pci_bus_new(DeviceState *parent, const char *name, > +int domain, int devfn_min); > void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq, pci_map_irq_fn > map_irq, >void *irq_opaque, int nirq); > void pci_bus_hotplug(PCIBus *bus, pci_hotplug_fn hotplug, DeviceState *dev); > PCIBus *pci_register_bus(DeviceState *parent, const char *name, > pci_set_irq_fn set_irq, pci_map_irq_fn map_irq, > - void *irq_opaque, int devfn_min, int nirq); > + void *irq_opaque, int devfn_min, int nirq); /* > deprecated */ > void pci_device_reset(PCIDevice *dev); > void pci_bus_reset(PCIBus *bus); > > diff --git a/hw/piix_pci.c b/hw/piix_pci.c > index 358da58..718983d 100644 > --- a/hw/piix_pci.c > +++ b/hw/piix_pci.c > @@ -226,7 +226,7 @@ PCIBus *i440fx_init(PCII440FXState **pi440fx_state, int > *piix3_devfn, qemu_irq * > > dev = qdev_create(NULL, "i440FX-pcihost"); > s = FROM_SYSBUS(I440FXState, sysbus_from_qdev(dev)); > -b = pci_bus_new(&s->busdev.qdev, NULL, 0); > +b = pci_bus_new(&s->busdev.qdev, NULL, 0, 0); > s->bus = b; > qdev_init_nofail(dev); > > -- > 1.7.1.1
Re: [Qemu-devel] [PATCH 1/5] Fast Virtual Disk (FVD) Proposal Part 1
On Wed, Jan 19, 2011 at 05:04:44PM -0500, Chunqiang Tang wrote: > Part 1 of the block device driver for the proposed FVD image format. > Multiple patches are used in order to manage the size of each patch. > This patch includes existing files that are modified by FVD. Please try to split the patches into logical parts, and use descriptive subject lines for each patch. E.g. adding the new sim command to qemu-io could be one patch, adding the img_update (why not just update?) command to qemu-img another, moving code into qemu-tool-time.c one more, etc. > - > + Please do not introduce random whitespace changes in patches.
[Qemu-devel] [PATCH 12/12] Threadlets: Add documentation
Signed-off-by: Arun R Bharadwaj Signed-off-by: Gautham R Shenoy Reviewed-by: Stefan Hajnoczi --- docs/async-support.txt | 141 1 files changed, 141 insertions(+), 0 deletions(-) create mode 100644 docs/async-support.txt diff --git a/docs/async-support.txt b/docs/async-support.txt new file mode 100644 index 000..9f22b9a --- /dev/null +++ b/docs/async-support.txt @@ -0,0 +1,141 @@ +== How to use the threadlets infrastructure supported in Qemu == + +== Threadlets == + +Q.1: What are threadlets ? +A.1: Threadlets is an infrastructure within QEMU that allows other subsystems + to offload possibly blocking work to a queue to be processed by a pool + of threads asynchronously. + +Q.2: When would one want to use threadlets ? +A.2: Threadlets are useful when there are operations that can be performed + outside the context of the VCPU/IO threads inorder to free these latter + to service any other guest requests. + +Q.3: I have some work that can be executed in an asynchronous context. How + should I go about it ? +A.3: One could follow the steps listed below: + + - Define a function which would do the asynchronous work. + static void my_threadlet_func(ThreadletWork *work) + { + } + + - Declare an object of type ThreadletWork; + ThreadletWork work; + + + - Assign a value to the "func" member of ThreadletWork object. + work.func = my_threadlet_func; + + - Submit the threadlet to the global queue. + submit_threadletwork(&work); + + - Continue servicing some other guest operations. + +Q.4: I want to my_threadlet_func to access some non-global data. How do I do + that ? +A.4: Suppose you want my_threadlet_func to access some non-global data-object + of type myPrivateData. In that case one could follow the following steps. + + - Define a member of the type ThreadletWork within myPrivateData. + typedef struct MyPrivateData { + ...; + ...; + ...; + ThreadletWork work; + } MyPrivateData; + + MyPrivateData my_data; + + - Initialize myData.work as described in A.3 + myData.work.func = my_threadlet_func; + submit_threadletwork(&myData.work); + + - Access the myData object inside my_threadlet_func() using container_of + primitive + static void my_threadlet_func(ThreadletWork *work) + { + myPrivateData *mydata_ptr; + mydata_ptr = container_of(work, myPrivateData, work); + + /* mydata_ptr now points to myData object */ + } + +Q.5: Are there any precautions one must take while sharing data with the + Asynchronous thread-pool ? +A.5: Yes, make sure that the helper function of the type my_threadlet_func() + does not access/modify data when it can be accessed or modified in the + context of VCPU thread or IO thread. This is because the asynchronous + threads in the pool can run in parallel with the VCPU/IOThreads as shown + in the figure. + + A typical workflow is as follows: + + VCPU/IOThread + | + | (1) + | + V +Offload work (2) + |---> to threadlets -> Helper thread + || | + || | + || (3) | (4) + || | + | Handle other Guest requests| + || | + || V + || (3) Signal the I/O Thread + |(6) | | + || / + || / + |V/ + | Do the post <-/ + | processing (5) + || + || (6) + |V + |-Yes-- More async work? + | + | (7) + No + | + | + . + . + +Hence one needs to make sure that in the steps (3) and (4) which run in +parallel, any global data is accessed within only one context. + +Q.6: I have queued a threadlet which I want to cancel. How do I do that ? +A.6: Threadlets framework provides the API cancel_threadlet: + - int cancel_threadletwork(ThreadletWork *work) + + The API scans the ThreadletQueue to see if (wor
[Qemu-devel] [PATCH 10/12] Move threadlet code to qemu-threadlets.c
This patch moves the threadlet queue API code to qemu-threadlets.c where these APIs can be used by other subsystems. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- posix-aio-compat.c | 137 qemu-threadlet.c | 115 qemu-threadlet.h | 25 + 3 files changed, 141 insertions(+), 136 deletions(-) diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 96e28db..4fc9581 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -26,33 +26,13 @@ #include "qemu-common.h" #include "trace.h" #include "block_int.h" -#include "qemu-thread.h" +#include "qemu-threadlet.h" #include "block/raw-posix-aio.h" -#define MAX_GLOBAL_THREADS 64 -#define MIN_GLOBAL_THREADS 8 - static QemuMutex aiocb_mutex; static QemuCond aiocb_completion; -typedef struct ThreadletQueue -{ -QemuMutex lock; -QemuCond cond; -int max_threads; -int min_threads; -int cur_threads; -int idle_threads; -QTAILQ_HEAD(, ThreadletWork) request_list; -} ThreadletQueue; - -typedef struct ThreadletWork -{ -QTAILQ_ENTRY(ThreadletWork) node; -void (*func)(struct ThreadletWork *work); -} ThreadletWork; - struct qemu_paiocb { BlockDriverAIOCB common; int aio_fildes; @@ -79,10 +59,6 @@ typedef struct PosixAioState { struct qemu_paiocb *first_aio; } PosixAioState; -/* Default ThreadletQueue */ -static ThreadletQueue globalqueue; -static int globalqueue_init; - #ifdef CONFIG_PREADV static int preadv_present = 1; #else @@ -283,50 +259,6 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb) return nbytes; } -static void *threadlet_worker(void *data) -{ -ThreadletQueue *queue = data; - -qemu_mutex_lock(&queue->lock); -while (1) { -ThreadletWork *work; -int ret = 0; - -while (QTAILQ_EMPTY(&queue->request_list) && - (ret != ETIMEDOUT)) { -/* wait for cond to be signalled or broadcast for 1000s */ -ret = qemu_cond_timedwait((&queue->cond), - &(queue->lock), 10*10); -} - -assert(queue->idle_threads != 0); -if (QTAILQ_EMPTY(&queue->request_list)) { -if (queue->cur_threads > queue->min_threads) { -/* We retain the minimum number of threads */ -break; -} -} else { -work = QTAILQ_FIRST(&queue->request_list); -QTAILQ_REMOVE(&queue->request_list, work, node); - -queue->idle_threads--; -qemu_mutex_unlock(&queue->lock); - -/* execute the work function */ -work->func(work); - -qemu_mutex_lock(&queue->lock); -queue->idle_threads++; -} -} - -queue->idle_threads--; -queue->cur_threads--; -qemu_mutex_unlock(&queue->lock); - -return NULL; -} - static PosixAioState *posix_aio_state; static void handle_work(ThreadletWork *work) @@ -371,48 +303,6 @@ static void handle_work(ThreadletWork *work) } } -static void spawn_threadlet(ThreadletQueue *queue) -{ -QemuThread thread; - -queue->cur_threads++; -queue->idle_threads++; - -qemu_thread_create(&thread, threadlet_worker, queue); -} - -/** - * submit_work: Submit to the global queue a new task to be executed - * asynchronously. - * @work: Contains information about the task that needs to be submitted. - */ -static void submit_work(ThreadletWork *work) -{ -qemu_mutex_lock(&globalqueue.lock); - -if (!globalqueue_init) { -globalqueue.cur_threads = 0; -globalqueue.idle_threads = 0; -globalqueue.max_threads = MAX_GLOBAL_THREADS; -globalqueue.min_threads = MIN_GLOBAL_THREADS; -QTAILQ_INIT(&globalqueue.request_list); -qemu_mutex_init(&globalqueue.lock); -qemu_cond_init(&globalqueue.cond); - -globalqueue_init = 1; -} - -if (globalqueue.idle_threads == 0 && -globalqueue.cur_threads < globalqueue.max_threads) { -spawn_threadlet(&globalqueue); - -} else { -qemu_cond_signal(&globalqueue.cond); -} -QTAILQ_INSERT_TAIL(&globalqueue.request_list, work, node); -qemu_mutex_unlock(&globalqueue.lock); -} - static ssize_t qemu_paio_return(struct qemu_paiocb *aiocb) { ssize_t ret; @@ -545,31 +435,6 @@ static void paio_remove(struct qemu_paiocb *acb) } } -/** - * dequeue_work: Cancel a task queued on the global queue. - * @work: Contains the information of the task that needs to be cancelled. - * - * Returns:0 if successfully dequeued work. - * 1 otherwise. - */ -static int dequeue_work(ThreadletWork *work) -{ -int ret = 1; -ThreadletWork *ret_work; - -qemu_mutex_lock(&globalqueue.lock); -QTAILQ_FOREACH(ret_work, &(globalqueue.request_list), node) { -if (ret_work == work) { -QTAILQ_REMOVE(&globalqu
[Qemu-devel] [PATCH 02/12] Introduce work concept in posix-aio-compat.c
This patch introduces work concept by introducing the ThreadletWork structure. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- posix-aio-compat.c | 20 ++-- 1 files changed, 14 insertions(+), 6 deletions(-) diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 82862ec..ddf42f5 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -33,6 +33,10 @@ static QemuMutex aiocb_mutex; static QemuCond aiocb_completion; +typedef struct ThreadletWork +{ +QTAILQ_ENTRY(ThreadletWork) node; +} ThreadletWork; struct qemu_paiocb { BlockDriverAIOCB common; @@ -47,13 +51,13 @@ struct qemu_paiocb { int ev_signo; off_t aio_offset; -QTAILQ_ENTRY(qemu_paiocb) node; int aio_type; ssize_t ret; int active; struct qemu_paiocb *next; int async_context_id; +ThreadletWork work; }; typedef struct PosixAioState { @@ -69,7 +73,7 @@ static pthread_attr_t attr; static int max_threads = 64; static int cur_threads = 0; static int idle_threads = 0; -static QTAILQ_HEAD(, qemu_paiocb) request_list; +static QTAILQ_HEAD(, ThreadletWork) request_list; #ifdef CONFIG_PREADV static int preadv_present = 1; @@ -307,6 +311,7 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb) static void *aio_thread(void *unused) { pid_t pid; +ThreadletWork *work; pid = getpid(); @@ -330,8 +335,11 @@ static void *aio_thread(void *unused) if (QTAILQ_EMPTY(&request_list)) break; -aiocb = QTAILQ_FIRST(&request_list); -QTAILQ_REMOVE(&request_list, aiocb, node); +work = QTAILQ_FIRST(&request_list); +QTAILQ_REMOVE(&request_list, work, node); + +aiocb = container_of(work, struct qemu_paiocb, work); + aiocb->active = 1; idle_threads--; mutex_unlock(&lock); @@ -398,7 +406,7 @@ static void qemu_paio_submit(struct qemu_paiocb *aiocb) mutex_lock(&lock); if (idle_threads == 0 && cur_threads < max_threads) spawn_thread(); -QTAILQ_INSERT_TAIL(&request_list, aiocb, node); +QTAILQ_INSERT_TAIL(&request_list, &aiocb->work, node); mutex_unlock(&lock); cond_signal(&cond); } @@ -548,7 +556,7 @@ static void paio_cancel(BlockDriverAIOCB *blockacb) qemu_mutex_lock(&aiocb_mutex); if (!acb->active) { -QTAILQ_REMOVE(&request_list, acb, node); +QTAILQ_REMOVE(&request_list, &acb->work, node); acb->ret = -ECANCELED; } else if (acb->ret == -EINPROGRESS) { active = 1;
[Qemu-devel] [PATCH 06/12] Threadlet: Add dequeue_work threadlet API and remove active field.
This patch adds dequeue_work threadlet API and shows how the paio_cancel changes to dequeue_work. The active field in the qemu_aiocb structure is now useless. Remove it. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- posix-aio-compat.c | 48 ++-- 1 files changed, 26 insertions(+), 22 deletions(-) diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 62e32e3..95ba805 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -69,7 +69,6 @@ struct qemu_paiocb { int aio_type; ssize_t ret; -int active; struct qemu_paiocb *next; int async_context_id; @@ -345,9 +344,6 @@ static void handle_work(ThreadletWork *work) pid = getpid(); aiocb = container_of(work, struct qemu_paiocb, work); -qemu_mutex_lock(&aiocb_mutex); -aiocb->active = 1; -qemu_mutex_unlock(&aiocb_mutex); switch (aiocb->aio_type & QEMU_AIO_TYPE_MASK) { case QEMU_AIO_READ: @@ -452,7 +448,6 @@ static void qemu_paio_submit(struct qemu_paiocb *aiocb) { qemu_mutex_lock(&aiocb_mutex); aiocb->ret = -EINPROGRESS; -aiocb->active = 0; qemu_mutex_unlock(&aiocb_mutex); aiocb->work.func = handle_work; @@ -574,33 +569,42 @@ static void paio_remove(struct qemu_paiocb *acb) } } -static void paio_cancel(BlockDriverAIOCB *blockacb) +/** + * dequeue_work: Cancel a task queued on the global queue. + * @work: Contains the information of the task that needs to be cancelled. + * + * Returns:0 if successfully dequeued work. + * 1 otherwise. + */ +static int dequeue_work(ThreadletWork *work) { -struct qemu_paiocb *acb = (struct qemu_paiocb *)blockacb; -int active = 0; +int ret = 1; +ThreadletWork *ret_work; -qemu_mutex_lock(&aiocb_mutex); qemu_mutex_lock(&globalqueue.lock); -if (!acb->active) { -QTAILQ_REMOVE(&globalqueue.request_list, &acb->work, node); -acb->ret = -ECANCELED; -} else if (acb->ret == -EINPROGRESS) { -active = 1; +QTAILQ_FOREACH(ret_work, &(globalqueue.request_list), node) { +if (ret_work == work) { +QTAILQ_REMOVE(&globalqueue.request_list, ret_work, node); +ret = 0; +break; +} } qemu_mutex_unlock(&globalqueue.lock); -if (!active) { -acb->ret = -ECANCELED; -} else { +return ret; +} + +static void paio_cancel(BlockDriverAIOCB *blockacb) +{ +struct qemu_paiocb *acb = (struct qemu_paiocb *)blockacb; +if (dequeue_work(&acb->work) != 0) { +/* Wait for running work item to complete */ +qemu_mutex_lock(&aiocb_mutex); while (acb->ret == -EINPROGRESS) { -/* - * fail safe: if the aio could not be canceled, - * we wait for it - */ qemu_cond_wait(&aiocb_completion, &aiocb_mutex); } +qemu_mutex_unlock(&aiocb_mutex); } -qemu_mutex_unlock(&aiocb_mutex); paio_remove(acb); }
[Qemu-devel] [PATCH 09/12] Remove all instances of CONFIG_THREAD
Remove all instances of CONFIG_THREAD, which is not used anymore. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- configure |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/configure b/configure index a079a49..addf733 100755 --- a/configure +++ b/configure @@ -2456,7 +2456,6 @@ if test "$vnc_png" != "no" ; then fi if test "$vnc_thread" != "no" ; then echo "CONFIG_VNC_THREAD=y" >> $config_host_mak - echo "CONFIG_THREAD=y" >> $config_host_mak fi if test "$fnmatch" = "yes" ; then echo "CONFIG_FNMATCH=y" >> $config_host_mak @@ -2534,7 +2533,6 @@ if test "$xen" = "yes" ; then fi if test "$io_thread" = "yes" ; then echo "CONFIG_IOTHREAD=y" >> $config_host_mak - echo "CONFIG_THREAD=y" >> $config_host_mak fi if test "$linux_aio" = "yes" ; then echo "CONFIG_LINUX_AIO=y" >> $config_host_mak
[Qemu-devel] [PATCH 11/12] Threadlets: Add functionality to create private queues.
This patch allows subsystems to create their own private queues with associated pools of threads. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- qemu-threadlet.c | 85 +- qemu-threadlet.h |4 +++ 2 files changed, 63 insertions(+), 26 deletions(-) diff --git a/qemu-threadlet.c b/qemu-threadlet.c index 6523f08..4bad8c6 100644 --- a/qemu-threadlet.c +++ b/qemu-threadlet.c @@ -99,6 +99,25 @@ static void spawn_threadlet(ThreadletQueue *queue) } /** + * submit_work_to_queue: Submit a new task to a private queue to be + *executed asynchronously. + * @queue: Per-subsystem private queue to which the new task needs + * to be submitted. + * @work: Contains information about the task that needs to be submitted. + */ +void submit_work_to_queue(ThreadletQueue *queue, ThreadletWork *work) +{ +qemu_mutex_lock(&queue->lock); +if (queue->idle_threads == 0 && queue->cur_threads < queue->max_threads) { +spawn_threadlet(queue); +} else { +qemu_cond_signal(&queue->cond); +} +QTAILQ_INSERT_TAIL(&queue->request_list, work, node); +qemu_mutex_unlock(&queue->lock); +} + +/** * submit_work: Submit to the global queue a new task to be executed * asynchronously. * @work: Contains information about the task that needs to be submitted. @@ -106,49 +125,63 @@ static void spawn_threadlet(ThreadletQueue *queue) void submit_work(ThreadletWork *work) { if (!globalqueue_init) { -globalqueue.cur_threads = 0; -globalqueue.idle_threads = 0; -globalqueue.max_threads = MAX_GLOBAL_THREADS; -globalqueue.min_threads = MIN_GLOBAL_THREADS; -QTAILQ_INIT(&globalqueue.request_list); -qemu_mutex_init(&globalqueue.lock); -qemu_cond_init(&globalqueue.cond); - +threadlet_queue_init(&globalqueue, MAX_GLOBAL_THREADS, +MIN_GLOBAL_THREADS); globalqueue_init = 1; } -qemu_mutex_lock(&globalqueue.lock); -if (globalqueue.idle_threads == 0 && -globalqueue.cur_threads < globalqueue.max_threads) { -spawn_threadlet(&globalqueue); -} else { -qemu_cond_signal(&globalqueue.cond); -} -QTAILQ_INSERT_TAIL(&globalqueue.request_list, work, node); -qemu_mutex_unlock(&globalqueue.lock); +submit_work_to_queue(&globalqueue, work); } /** - * dequeue_work: Cancel a task queued on the global queue. + * dequeue_work_on_queue: Cancel a task queued on a Queue. + * @queue: The queue containing the task to be cancelled. * @work: Contains the information of the task that needs to be cancelled. - * - * Returns: 0 if successfully dequeued work. - * 1 otherwise. */ -int dequeue_work(ThreadletWork *work) +int dequeue_work_on_queue(ThreadletQueue *queue, ThreadletWork *work) { int ret = 1; ThreadletWork *ret_work; -qemu_mutex_lock(&globalqueue.lock); -QTAILQ_FOREACH(ret_work, &(globalqueue.request_list), node) { +qemu_mutex_lock(&queue->lock); +QTAILQ_FOREACH(ret_work, &(queue->request_list), node) { if (ret_work == work) { -QTAILQ_REMOVE(&globalqueue.request_list, ret_work, node); +QTAILQ_REMOVE(&queue->request_list, work, node); ret = 0; break; } } -qemu_mutex_unlock(&globalqueue.lock); +qemu_mutex_unlock(&queue->lock); return ret; } + +/** + * dequeue_work: Cancel a task queued on the global queue. + * @work: Contains the information of the task that needs to be cancelled. + * + * Returns: 0 if successfully dequeued work. + * 1 otherwise. + */ +int dequeue_work(ThreadletWork *work) +{ +return dequeue_work_on_queue(&globalqueue, work); +} + +/** + * threadlet_queue_init: Initialize a threadlet queue. + * @queue: The threadlet queue to be initialized. + * @max_threads: Maximum number of threads processing the queue. + * @min_threads: Minimum number of threads processing the queue. + */ +void threadlet_queue_init(ThreadletQueue *queue, +int max_threads, int min_threads) +{ +queue->cur_threads = 0; +queue->idle_threads = 0; +queue->max_threads = max_threads; +queue->min_threads = min_threads; +QTAILQ_INIT(&queue->request_list); +qemu_mutex_init(&queue->lock); +qemu_cond_init(&queue->cond); +} diff --git a/qemu-threadlet.h b/qemu-threadlet.h index 6b11c86..25e77f2 100644 --- a/qemu-threadlet.h +++ b/qemu-threadlet.h @@ -38,7 +38,11 @@ typedef struct ThreadletWork void (*func)(struct ThreadletWork *work); } ThreadletWork; +void submit_work_to_queue(ThreadletQueue *queue, ThreadletWork *work); void submit_work(ThreadletWork *work); +int dequeue_work_on_queue(ThreadletQueue *queue, ThreadletWork *work); int dequeue_work(ThreadletWork *work); void threadlet_init(void); +void threadlet_queue_init(ThreadletQueue *queu
[Qemu-devel] [PATCH 05/12] Threadlet: Add submit_work threadlet API.
This patch adds submit work threadlet API and shows how the qemu_paio_submit changes to submit_work. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- posix-aio-compat.c | 33 ++--- 1 files changed, 22 insertions(+), 11 deletions(-) diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 011633f..62e32e3 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -393,13 +393,13 @@ static void spawn_threadlet(ThreadletQueue *queue) if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore"); } -static void qemu_paio_submit(struct qemu_paiocb *aiocb) +/** + * submit_work: Submit to the global queue a new task to be executed + * asynchronously. + * @work: Contains information about the task that needs to be submitted. + */ +static void submit_work(ThreadletWork *work) { -qemu_mutex_lock(&aiocb_mutex); -aiocb->ret = -EINPROGRESS; -aiocb->active = 0; -qemu_mutex_unlock(&aiocb_mutex); - qemu_mutex_lock(&globalqueue.lock); if (!globalqueue_init) { @@ -415,13 +415,13 @@ static void qemu_paio_submit(struct qemu_paiocb *aiocb) } if (globalqueue.idle_threads == 0 && -globalqueue.cur_threads < globalqueue.max_threads) +globalqueue.cur_threads < globalqueue.max_threads) { spawn_threadlet(&globalqueue); -aiocb->work.func = handle_work; - -QTAILQ_INSERT_TAIL(&globalqueue.request_list, &aiocb->work, node); -qemu_cond_signal(&globalqueue.cond); +} else { +qemu_cond_signal(&globalqueue.cond); +} +QTAILQ_INSERT_TAIL(&globalqueue.request_list, work, node); qemu_mutex_unlock(&globalqueue.lock); } @@ -448,6 +448,17 @@ static int qemu_paio_error(struct qemu_paiocb *aiocb) return ret; } +static void qemu_paio_submit(struct qemu_paiocb *aiocb) +{ +qemu_mutex_lock(&aiocb_mutex); +aiocb->ret = -EINPROGRESS; +aiocb->active = 0; +qemu_mutex_unlock(&aiocb_mutex); + +aiocb->work.func = handle_work; +submit_work(&aiocb->work); +} + static int posix_aio_process_queue(void *opaque) { PosixAioState *s = opaque;
[Qemu-devel] [PATCH 04/12] Add ThreadletQueue.
This patch adds a global queue of type ThreadletQueue and removes the earlier usage of request_list queue. We want to create the thread on the first submit. Hence we need to track whether the globalqueue is initialized or not. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- posix-aio-compat.c | 149 +++- 1 files changed, 76 insertions(+), 73 deletions(-) diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 2792201..011633f 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -31,8 +31,23 @@ #include "block/raw-posix-aio.h" +#define MAX_GLOBAL_THREADS 64 +#define MIN_GLOBAL_THREADS 8 + static QemuMutex aiocb_mutex; static QemuCond aiocb_completion; + +typedef struct ThreadletQueue +{ +QemuMutex lock; +QemuCond cond; +int max_threads; +int min_threads; +int cur_threads; +int idle_threads; +QTAILQ_HEAD(, ThreadletWork) request_list; +} ThreadletQueue; + typedef struct ThreadletWork { QTAILQ_ENTRY(ThreadletWork) node; @@ -66,15 +81,10 @@ typedef struct PosixAioState { struct qemu_paiocb *first_aio; } PosixAioState; - -static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; -static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; -static pthread_t thread_id; +/* Default ThreadletQueue */ +static ThreadletQueue globalqueue; +static int globalqueue_init; static pthread_attr_t attr; -static int max_threads = 64; -static int cur_threads = 0; -static int idle_threads = 0; -static QTAILQ_HEAD(, ThreadletWork) request_list; #ifdef CONFIG_PREADV static int preadv_present = 1; @@ -93,32 +103,6 @@ static void die(const char *what) die2(errno, what); } -static void mutex_lock(pthread_mutex_t *mutex) -{ -int ret = pthread_mutex_lock(mutex); -if (ret) die2(ret, "pthread_mutex_lock"); -} - -static void mutex_unlock(pthread_mutex_t *mutex) -{ -int ret = pthread_mutex_unlock(mutex); -if (ret) die2(ret, "pthread_mutex_unlock"); -} - -static int cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *mutex, - struct timespec *ts) -{ -int ret = pthread_cond_timedwait(cond, mutex, ts); -if (ret && ret != ETIMEDOUT) die2(ret, "pthread_cond_timedwait"); -return ret; -} - -static void cond_signal(pthread_cond_t *cond) -{ -int ret = pthread_cond_signal(cond); -if (ret) die2(ret, "pthread_cond_signal"); -} - static void thread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine)(void*), void *arg) { @@ -311,42 +295,45 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb) static void *threadlet_worker(void *data) { +ThreadletQueue *queue = data; +qemu_mutex_lock(&queue->lock); while (1) { -ssize_t ret = 0; -qemu_timeval tv; -struct timespec ts; ThreadletWork *work; +int ret = 0; -qemu_gettimeofday(&tv); -ts.tv_sec = tv.tv_sec + 10; -ts.tv_nsec = 0; - -mutex_lock(&lock); - -while (QTAILQ_EMPTY(&request_list) && - !(ret == ETIMEDOUT)) { -ret = cond_timedwait(&cond, &lock, &ts); +while (QTAILQ_EMPTY(&queue->request_list) && + (ret != ETIMEDOUT)) { +/* wait for cond to be signalled or broadcast for 1000s */ +ret = qemu_cond_timedwait((&queue->cond), + &(queue->lock), 10*10); } -if (QTAILQ_EMPTY(&request_list)) { -idle_threads--; -cur_threads--; -mutex_unlock(&lock); -break; -} -work = QTAILQ_FIRST(&request_list); -QTAILQ_REMOVE(&request_list, work, node); -idle_threads--; -mutex_unlock(&lock); +assert(queue->idle_threads != 0); +if (QTAILQ_EMPTY(&queue->request_list)) { +if (queue->cur_threads > queue->min_threads) { +/* We retain the minimum number of threads */ +break; +} +} else { +work = QTAILQ_FIRST(&queue->request_list); +QTAILQ_REMOVE(&queue->request_list, work, node); + +queue->idle_threads--; +qemu_mutex_unlock(&queue->lock); -work->func(work); -mutex_lock(&lock); -idle_threads++; -mutex_unlock(&lock); +/* execute the work function */ +work->func(work); +qemu_mutex_lock(&queue->lock); +queue->idle_threads++; +} } +queue->idle_threads--; +queue->cur_threads--; +qemu_mutex_unlock(&queue->lock); + return NULL; } @@ -389,18 +376,19 @@ static void handle_work(ThreadletWork *work) } } -static void spawn_thread(void) +static void spawn_threadlet(ThreadletQueue *queue) { +pthread_t thread_id; sigset_t set, oldset; -cur_threads++; -idle_threads++; +queue->cur_threads++; +queue->idle_thre
[Qemu-devel] [PATCH 07/12] Remove thread_create routine.
Remove thread_create and use qemu_thread_create instead. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- posix-aio-compat.c | 29 ++--- 1 files changed, 2 insertions(+), 27 deletions(-) diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 95ba805..a7625d8 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -13,7 +13,6 @@ #include #include -#include #include #include #include @@ -83,7 +82,6 @@ typedef struct PosixAioState { /* Default ThreadletQueue */ static ThreadletQueue globalqueue; static int globalqueue_init; -static pthread_attr_t attr; #ifdef CONFIG_PREADV static int preadv_present = 1; @@ -102,13 +100,6 @@ static void die(const char *what) die2(errno, what); } -static void thread_create(pthread_t *thread, pthread_attr_t *attr, - void *(*start_routine)(void*), void *arg) -{ -int ret = pthread_create(thread, attr, start_routine, arg); -if (ret) die2(ret, "pthread_create"); -} - static ssize_t handle_aiocb_ioctl(struct qemu_paiocb *aiocb) { int ret; @@ -374,19 +365,12 @@ static void handle_work(ThreadletWork *work) static void spawn_threadlet(ThreadletQueue *queue) { -pthread_t thread_id; -sigset_t set, oldset; +QemuThread thread; queue->cur_threads++; queue->idle_threads++; -/* block all signals */ -if (sigfillset(&set)) die("sigfillset"); -if (sigprocmask(SIG_SETMASK, &set, &oldset)) die("sigprocmask"); - -thread_create(&thread_id, &attr, threadlet_worker, queue); - -if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore"); +qemu_thread_create(&thread, threadlet_worker, queue); } /** @@ -671,7 +655,6 @@ int paio_init(void) struct sigaction act; PosixAioState *s; int fds[2]; -int ret; if (posix_aio_state) return 0; @@ -701,14 +684,6 @@ int paio_init(void) qemu_aio_set_fd_handler(s->rfd, posix_aio_read, NULL, posix_aio_flush, posix_aio_process_queue, s); -ret = pthread_attr_init(&attr); -if (ret) -die2(ret, "pthread_attr_init"); - -ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); -if (ret) -die2(ret, "pthread_attr_setdetachstate"); - posix_aio_state = s; return 0; }
[Qemu-devel] [PATCH 08/12] Threadlet: Add aio_signal_handler threadlet API
This patch adds aio_signal_handler threadlet API. Earler posix-aio-compat.c had its own signal handler code. Now abstract this, in the later patch it is moved to a generic code so that it can be used by other subsystems. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- Makefile.objs |1 + posix-aio-compat.c | 30 -- qemu-threadlet.c | 39 +++ qemu-threadlet.h | 19 +++ vl.c |3 +++ 5 files changed, 70 insertions(+), 22 deletions(-) create mode 100644 qemu-threadlet.c create mode 100644 qemu-threadlet.h diff --git a/Makefile.objs b/Makefile.objs index 3b7ec27..3b11dd0 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -10,6 +10,7 @@ qobject-obj-y += qerror.o block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o block-obj-$(CONFIG_POSIX) += qemu-thread.o +block-obj-$(CONFIG_POSIX) += qemu-threadlet.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o diff --git a/posix-aio-compat.c b/posix-aio-compat.c index a7625d8..96e28db 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -327,11 +327,14 @@ static void *threadlet_worker(void *data) return NULL; } +static PosixAioState *posix_aio_state; + static void handle_work(ThreadletWork *work) { pid_t pid; ssize_t ret = 0; struct qemu_paiocb *aiocb; +char byte = 0; pid = getpid(); aiocb = container_of(work, struct qemu_paiocb, work); @@ -358,6 +361,11 @@ static void handle_work(ThreadletWork *work) qemu_cond_broadcast(&aiocb_completion); qemu_mutex_unlock(&aiocb_mutex); +ret = write(posix_aio_state->wfd, &byte, sizeof(byte)); +if (ret < 0 && errno != EAGAIN) { +die("write()"); +} + if (kill(pid, aiocb->ev_signo)) { die("kill failed"); } @@ -518,22 +526,6 @@ static int posix_aio_flush(void *opaque) return !!s->first_aio; } -static PosixAioState *posix_aio_state; - -static void aio_signal_handler(int signum) -{ -if (posix_aio_state) { -char byte = 0; -ssize_t ret; - -ret = write(posix_aio_state->wfd, &byte, sizeof(byte)); -if (ret < 0 && errno != EAGAIN) -die("write()"); -} - -qemu_service_io(); -} - static void paio_remove(struct qemu_paiocb *acb) { struct qemu_paiocb **pacb; @@ -652,7 +644,6 @@ BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd, int paio_init(void) { -struct sigaction act; PosixAioState *s; int fds[2]; @@ -664,11 +655,6 @@ int paio_init(void) s = qemu_malloc(sizeof(PosixAioState)); -sigfillset(&act.sa_mask); -act.sa_flags = 0; /* do not restart syscalls to interrupt select() */ -act.sa_handler = aio_signal_handler; -sigaction(SIGUSR2, &act, NULL); - s->first_aio = NULL; if (qemu_pipe(fds) == -1) { fprintf(stderr, "failed to create pipe\n"); diff --git a/qemu-threadlet.c b/qemu-threadlet.c new file mode 100644 index 000..857d08d --- /dev/null +++ b/qemu-threadlet.c @@ -0,0 +1,39 @@ +/* + * Threadlet support for offloading tasks to be executed asynchronously + * + * Copyright IBM, Corp. 2008 + * Copyright IBM, Corp. 2010 + * + * Authors: + * Anthony Liguori + * Aneesh Kumar K.V + * Gautham R Shenoy + * Arun R Bharadwaj + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include + +#include "qemu-threadlet.h" +#include "osdep.h" + +static void threadlet_io_completion_signal_handler(int signum) +{ +qemu_service_io(); +} + +static void threadlet_register_signal_handler(void) +{ +struct sigaction act; +sigfillset(&act.sa_mask); +act.sa_flags = 0; /* do not restart syscalls to interrupt select() */ +act.sa_handler = threadlet_io_completion_signal_handler; +sigaction(SIGUSR2, &act, NULL); +} + +void threadlet_init(void) +{ +threadlet_register_signal_handler(); +} diff --git a/qemu-threadlet.h b/qemu-threadlet.h new file mode 100644 index 000..2c225d6 --- /dev/null +++ b/qemu-threadlet.h @@ -0,0 +1,19 @@ +/* + * Threadlet support for offloading tasks to be executed asynchronously + * + * Copyright IBM, Corp. 2008 + * Copyright IBM, Corp. 2010 + * + * Authors: + * Anthony Liguori + * Aneesh Kumar K.V + * Gautham R Shenoy + * Arun R Bharadwaj + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include "qemu-common.h" + +void threadlet_init(void); diff --git a/vl.c b/vl.c index df414ef..d04fc7a 100644 --- a/vl.c +++ b/vl.c @@ -148,6 +148,7 @@ int main(int argc, char **argv) #include "qemu-config.h" #include "qemu-objects.h" #include "qemu-options.h" +#include "qemu-threadlet.h" #ifdef CONFIG_VIRTFS #include "fs
[Qemu-devel] [PATCH 03/12] Add callback function to ThreadletWork structure.
This patch adds the callback function to the ThreadletWork structure and moves aio handler as a callback function. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- posix-aio-compat.c | 89 +--- 1 files changed, 50 insertions(+), 39 deletions(-) diff --git a/posix-aio-compat.c b/posix-aio-compat.c index ddf42f5..2792201 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -36,6 +36,7 @@ static QemuCond aiocb_completion; typedef struct ThreadletWork { QTAILQ_ENTRY(ThreadletWork) node; +void (*func)(struct ThreadletWork *work); } ThreadletWork; struct qemu_paiocb { @@ -308,18 +309,14 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb) return nbytes; } -static void *aio_thread(void *unused) +static void *threadlet_worker(void *data) { -pid_t pid; -ThreadletWork *work; - -pid = getpid(); while (1) { -struct qemu_paiocb *aiocb; ssize_t ret = 0; qemu_timeval tv; struct timespec ts; +ThreadletWork *work; qemu_gettimeofday(&tv); ts.tv_sec = tv.tv_sec + 10; @@ -332,52 +329,64 @@ static void *aio_thread(void *unused) ret = cond_timedwait(&cond, &lock, &ts); } -if (QTAILQ_EMPTY(&request_list)) +if (QTAILQ_EMPTY(&request_list)) { +idle_threads--; +cur_threads--; +mutex_unlock(&lock); break; - +} work = QTAILQ_FIRST(&request_list); QTAILQ_REMOVE(&request_list, work, node); - -aiocb = container_of(work, struct qemu_paiocb, work); - -aiocb->active = 1; idle_threads--; mutex_unlock(&lock); -switch (aiocb->aio_type & QEMU_AIO_TYPE_MASK) { -case QEMU_AIO_READ: -case QEMU_AIO_WRITE: -ret = handle_aiocb_rw(aiocb); -break; -case QEMU_AIO_FLUSH: -ret = handle_aiocb_flush(aiocb); -break; -case QEMU_AIO_IOCTL: -ret = handle_aiocb_ioctl(aiocb); -break; -default: -fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type); -ret = -EINVAL; -break; -} - +work->func(work); mutex_lock(&lock); idle_threads++; mutex_unlock(&lock); -qemu_mutex_lock(&aiocb_mutex); -aiocb->ret = ret; -qemu_cond_broadcast(&aiocb_completion); -qemu_mutex_unlock(&aiocb_mutex); +} + +return NULL; +} + +static void handle_work(ThreadletWork *work) +{ +pid_t pid; +ssize_t ret = 0; +struct qemu_paiocb *aiocb; -if (kill(pid, aiocb->ev_signo)) die("kill failed"); +pid = getpid(); +aiocb = container_of(work, struct qemu_paiocb, work); +qemu_mutex_lock(&aiocb_mutex); +aiocb->active = 1; +qemu_mutex_unlock(&aiocb_mutex); + +switch (aiocb->aio_type & QEMU_AIO_TYPE_MASK) { +case QEMU_AIO_READ: +case QEMU_AIO_WRITE: +ret = handle_aiocb_rw(aiocb); +break; +case QEMU_AIO_FLUSH: +ret = handle_aiocb_flush(aiocb); +break; +case QEMU_AIO_IOCTL: +ret = handle_aiocb_ioctl(aiocb); +break; +default: +fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type); +ret = -EINVAL; +break; } -idle_threads--; -cur_threads--; -mutex_unlock(&lock); +qemu_mutex_lock(&aiocb_mutex); +aiocb->ret = ret; +qemu_cond_broadcast(&aiocb_completion); +qemu_mutex_unlock(&aiocb_mutex); -return NULL; +if (kill(pid, aiocb->ev_signo)) { +die("kill failed"); +} } static void spawn_thread(void) @@ -391,7 +400,7 @@ static void spawn_thread(void) if (sigfillset(&set)) die("sigfillset"); if (sigprocmask(SIG_SETMASK, &set, &oldset)) die("sigprocmask"); -thread_create(&thread_id, &attr, aio_thread, NULL); +thread_create(&thread_id, &attr, threadlet_worker, NULL); if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore"); } @@ -406,6 +415,8 @@ static void qemu_paio_submit(struct qemu_paiocb *aiocb) mutex_lock(&lock); if (idle_threads == 0 && cur_threads < max_threads) spawn_thread(); + +aiocb->work.func = handle_work; QTAILQ_INSERT_TAIL(&request_list, &aiocb->work, node); mutex_unlock(&lock); cond_signal(&cond);
[Qemu-devel] [PATCH 01/12] Add aiocb_mutex and aiocb_completion.
This patch adds the aiocb_mutex to protect aiocb. This patch also removes the infinite loop present in paio_cancel. Since there can only be one cancellation at a time, we need to introduce a condition variable. For this, we need a global aiocb_completion condition variable. This patch also adds the Makefile entry to compile qemu-thread.c when CONFIG_POSIX is set, instead of the unused CONFIG_THREAD. Signed-off-by: Arun R Bharadwaj Reviewed-by: Stefan Hajnoczi --- Makefile.objs |2 +- posix-aio-compat.c | 39 --- 2 files changed, 29 insertions(+), 12 deletions(-) diff --git a/Makefile.objs b/Makefile.objs index cd5a24b..3b7ec27 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -9,6 +9,7 @@ qobject-obj-y += qerror.o block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o +block-obj-$(CONFIG_POSIX) += qemu-thread.o block-obj-$(CONFIG_POSIX) += posix-aio-compat.o block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o @@ -124,7 +125,6 @@ endif common-obj-y += $(addprefix ui/, $(ui-obj-y)) common-obj-y += iov.o acl.o -common-obj-$(CONFIG_THREAD) += qemu-thread.o common-obj-y += notify.o event_notifier.o common-obj-y += qemu-timer.o diff --git a/posix-aio-compat.c b/posix-aio-compat.c index 7b862b5..82862ec 100644 --- a/posix-aio-compat.c +++ b/posix-aio-compat.c @@ -27,9 +27,12 @@ #include "qemu-common.h" #include "trace.h" #include "block_int.h" +#include "qemu-thread.h" #include "block/raw-posix-aio.h" +static QemuMutex aiocb_mutex; +static QemuCond aiocb_completion; struct qemu_paiocb { BlockDriverAIOCB common; @@ -351,10 +354,14 @@ static void *aio_thread(void *unused) } mutex_lock(&lock); -aiocb->ret = ret; idle_threads++; mutex_unlock(&lock); +qemu_mutex_lock(&aiocb_mutex); +aiocb->ret = ret; +qemu_cond_broadcast(&aiocb_completion); +qemu_mutex_unlock(&aiocb_mutex); + if (kill(pid, aiocb->ev_signo)) die("kill failed"); } @@ -383,8 +390,11 @@ static void spawn_thread(void) static void qemu_paio_submit(struct qemu_paiocb *aiocb) { +qemu_mutex_lock(&aiocb_mutex); aiocb->ret = -EINPROGRESS; aiocb->active = 0; +qemu_mutex_unlock(&aiocb_mutex); + mutex_lock(&lock); if (idle_threads == 0 && cur_threads < max_threads) spawn_thread(); @@ -397,9 +407,9 @@ static ssize_t qemu_paio_return(struct qemu_paiocb *aiocb) { ssize_t ret; -mutex_lock(&lock); +qemu_mutex_lock(&aiocb_mutex); ret = aiocb->ret; -mutex_unlock(&lock); +qemu_mutex_unlock(&aiocb_mutex); return ret; } @@ -536,22 +546,26 @@ static void paio_cancel(BlockDriverAIOCB *blockacb) struct qemu_paiocb *acb = (struct qemu_paiocb *)blockacb; int active = 0; -mutex_lock(&lock); +qemu_mutex_lock(&aiocb_mutex); if (!acb->active) { QTAILQ_REMOVE(&request_list, acb, node); acb->ret = -ECANCELED; } else if (acb->ret == -EINPROGRESS) { active = 1; } -mutex_unlock(&lock); -if (active) { -/* fail safe: if the aio could not be canceled, we wait for - it */ -while (qemu_paio_error(acb) == EINPROGRESS) -; +if (!active) { +acb->ret = -ECANCELED; +} else { +while (acb->ret == -EINPROGRESS) { +/* + * fail safe: if the aio could not be canceled, + * we wait for it + */ +qemu_cond_wait(&aiocb_completion, &aiocb_mutex); +} } - +qemu_mutex_unlock(&aiocb_mutex); paio_remove(acb); } @@ -623,6 +637,9 @@ int paio_init(void) if (posix_aio_state) return 0; +qemu_mutex_init(&aiocb_mutex); +qemu_cond_init(&aiocb_completion); + s = qemu_malloc(sizeof(PosixAioState)); sigfillset(&act.sa_mask);
[Qemu-devel] [PATCH 00/12] Threadlet Infrastructure
Hi, This patch series introduces threadlet framework in qemu. Changelog: * Modified the dequeue_work function logic which was earlier wrong. It now returns 0 on success and 1 on failure. * removed the locking for container_of which is not needed in handle_work() * changed qemu-threadlets.[c|h] to qemu-threadlet.[c|h] * introduced signal handler APIs in qemu-threadlet.c directly instead of in posix-aio-compat.c Testing: * I have tested this series using fsstress. Detailed testing information has been mentioned in the previous iteration. The following series implements... --- Arun R Bharadwaj (12): Add aiocb_mutex and aiocb_completion. Introduce work concept in posix-aio-compat.c Add callback function to ThreadletWork structure. Add ThreadletQueue. Threadlet: Add submit_work threadlet API. Threadlet: Add dequeue_work threadlet API and remove active field. Remove thread_create routine. Threadlet: Add aio_signal_handler threadlet API Remove all instances of CONFIG_THREAD Move threadlet code to qemu-threadlets.c Threadlets: Add functionality to create private queues. Threadlets: Add documentation Makefile.objs |3 - configure |2 docs/async-support.txt | 141 posix-aio-compat.c | 238 qemu-threadlet.c | 187 ++ qemu-threadlet.h | 48 ++ vl.c |3 + 7 files changed, 440 insertions(+), 182 deletions(-) create mode 100644 docs/async-support.txt create mode 100644 qemu-threadlet.c create mode 100644 qemu-threadlet.h -- arun
[Qemu-devel] Re: [PATCH] target-arm: Fix loading of scalar value for Neon multiply-by-scalar
On 19.01.2011 20:29, Peter Maydell wrote: > Fix the register and part of register we get the scalar from in > the various "multiply vector by scalar" ops (VMUL by scalar > and friends). > I do confirm that with this patch, I can see many improvements in my tests. Thanks.
Re: [Qemu-devel] [PATCH] Support saturation with shift=0.
On 20 January 2011 12:06, Christophe Lyon wrote: > On 19.01.2011 17:51, Peter Maydell wrote: >> On 19 January 2011 16:10, Christophe Lyon wrote: >>> >>> This patch fixes corner-case saturations, when the target range is >>> zero. It merely removes the guard against (sh == 0), and makes: >>> __ssat(0x87654321, 1) return 0x and set the saturation flag >> >> did you mean __ssat(0x87654321, 0) here? (they give the same >> result, of course, but it's the sh==0 case the patch is changing...) > > Well... the ARM ARM says that the position for saturation is in the > range 1 to 32, so I think the assembler encodes 1 less than what the > user actually wrote. Hence at user level we use '1', but '0' is encoded > and then parsed by qemu. Am I wrong? No, you're right, there's a "+1" in the SSAT decoding instructions which I hadn't noticed. -- PMM
Re: [Qemu-devel] [PATCH] Support saturation with shift=0.
On 19.01.2011 17:51, Peter Maydell wrote: > On 19 January 2011 16:10, Christophe Lyon wrote: >> >> This patch fixes corner-case saturations, when the target range is >> zero. It merely removes the guard against (sh == 0), and makes: >> __ssat(0x87654321, 1) return 0x and set the saturation flag > > did you mean __ssat(0x87654321, 0) here? (they give the same > result, of course, but it's the sh==0 case the patch is changing...) Well... the ARM ARM says that the position for saturation is in the range 1 to 32, so I think the assembler encodes 1 less than what the user actually wrote. Hence at user level we use '1', but '0' is encoded and then parsed by qemu. Am I wrong? Obviously, I can rephrase the commit message :-) > >> __usat(0x87654321, 0) return 0 and set the saturation flag >> >> Signed-off-by: Christophe Lyon > > Checked against the ARM ARM and tested by > random-instruction-sequence generation. > Thanks.
Re: [Qemu-devel] [PATCH] pxa2xx_lcd: restore updating of display
On Tue, Jan 18, 2011 at 07:11:33PM +0300, Dmitry Eremin-Solenikov wrote: > Recently PXA2xx lcd have stopped to be updated incrementally (picture > frozen). This patch fixes that by passing non min/max x/y, but rather > (correctly) x/y and w/h. > > Signed-off-by: Dmitry Eremin-Solenikov > --- > hw/pxa2xx_lcd.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/pxa2xx_lcd.c b/hw/pxa2xx_lcd.c > index 1f2211a..5b2b07e 100644 > --- a/hw/pxa2xx_lcd.c > +++ b/hw/pxa2xx_lcd.c > @@ -796,9 +796,9 @@ static void pxa2xx_update_display(void *opaque) > > if (miny >= 0) { > if (s->orientation) > -dpy_update(s->ds, miny, 0, maxy, s->xres); > +dpy_update(s->ds, miny, 0, maxy - miny, s->xres); > else > -dpy_update(s->ds, 0, miny, s->xres, maxy); > +dpy_update(s->ds, 0, miny, s->xres, maxy - miny); > } > pxa2xx_lcdc_int_update(s); > > -- > 1.7.2.3 > Thanks, applied. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH] ioport: Improve error output
On Mon, Dec 27, 2010 at 01:00:56AM +0100, Andreas Färber wrote: > When failing due to conflicting I/O port registrations, > include the offending I/O port address in the message. > > Signed-off-by: Andreas Färber > --- > ioport.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/ioport.c b/ioport.c > index aa4188a..349915f 100644 > --- a/ioport.c > +++ b/ioport.c > @@ -149,7 +149,7 @@ int register_ioport_read(pio_addr_t start, int length, > int size, > for(i = start; i < start + length; i += size) { > ioport_read_table[bsize][i] = func; > if (ioport_opaque[i] != NULL && ioport_opaque[i] != opaque) > -hw_error("register_ioport_read: invalid opaque"); > +hw_error("register_ioport_read: invalid opaque for 0x%x", i); > ioport_opaque[i] = opaque; > } > return 0; > @@ -168,7 +168,7 @@ int register_ioport_write(pio_addr_t start, int length, > int size, > for(i = start; i < start + length; i += size) { > ioport_write_table[bsize][i] = func; > if (ioport_opaque[i] != NULL && ioport_opaque[i] != opaque) > -hw_error("register_ioport_write: invalid opaque"); > +hw_error("register_ioport_write: invalid opaque for 0x%x", i); > ioport_opaque[i] = opaque; > } > return 0; The idea looks good, but "for 0x%x" looks strange. Maybe something like "for address 0x%x"? -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH] add bepo (french dvorak) keyboard layout
On Sun, Jan 09, 2011 at 02:24:59PM +0100, Fred Boiteux wrote: > Hello, > > It's my first message here, and I'm not a git user, so I hope the > patch format will be good (I've done a "git diff -cached" in the qemu > tree). > I'm using the Qemu program with VNC I/O, and I had some problems with > my keyboard layout, so I've prepared a definition to be included in > Qemu, built from Xorg description. I've tested here, it works for me. > > If you have any remark/question about that, tell me. > > Fred. > > > > Signed-off-by: Frédéric Boiteux > > --- Thanks, applied. > diff --git a/pc-bios/keymaps/bepo b/pc-bios/keymaps/bepo > new file mode 100644 > index 000..d40041a > --- /dev/null > +++ b/pc-bios/keymaps/bepo > @@ -0,0 +1,333 @@ > +include common > + > +# Bépo : Improved ergonomic french keymap using Dvorak method. > +# Built by community on 'Dvorak Fr / Bépo' : > +# see http://www.clavier-dvorak.org/wiki/ to join and help. > +# > +# Bépo layout (1.0rc2 version) for a pc105 keyboard (french) : > +# ┌┐ > +# │ S A│ S = Shift, A = AltGr + Shift > +# │ s a│ s = normal, a = AltGr > +# └┘ > +# > +# > ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┲━┓ > +# │ # ¶ │ 1 „ │ 2 “ │ 3 ” │ 4 ≤ │ 5 ≥ │ 6 │ 7 ¬ │ 8 ¼ │ 9 ½ │ 0 ¾ │ ° ′ │ > ` ″ ┃ ⌫ Retour┃ > +# │ $ – │ " — │ « < │ » > │ ( [ │ ) ] │ @ ^ │ + ± │ - − │ / ÷ │ * × │ = ≠ │ > % ‰ ┃ arrière┃ > +# > ┢━┷━┱───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┺━┳━━━┫ > +# ┃ ┃ B ¦ │ É ˝ │ P § │ O Œ │ È ` │ ! │ V │ D Ð │ L │ J IJ │ Z Ə > │ W ┃Entrée ┃ > +# ┃Tab ↹ ┃ b | │ é ˊ │ p & │ o œ │ è ` │ ˆ ¡ │ v ˇ │ d ð │ l / │ j ij │ z ə > │ w ̆ ┃ ⏎ ┃ > +# > ┣━━━┻┱┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┺┓ > ┃ > +# ┃┃ A Æ │ U Ù │ I ˙ │ E ¤ │ ; ̛ │ C ſ │ T Þ │ S ẞ │ R ™ │ N │ M º > │ Ç , ┃ ┃ > +# ┃Maj ⇬ ┃ a æ │ u ù │ i ̈ │ e € │ , ’ │ c © │ t þ │ s ß │ r ® │ n ˜ │ m ¯ > │ ç ¸ ┃ ┃ > +# > ┣━━━┳┹┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┲┷━┻━━┫ > +# ┃ ┃ Ê │ À │ Y ‘ │ X ’ │ : · │ K │ ? ̉ │ Q ̣ │ G │ H ‡ │ F ª > ┃ ┃ > +# ┃Shift ⇧┃ ê / │ à \ │ y { │ x } │ . … │ k ~ │ ' ¿ │ q ˚ │ g µ │ h † │ f ˛ > ┃Shift ⇧ ┃ > +# > ┣━━━╋━┷━┳━━━┷━━━┱─┴─┴─┴─┴─┴─┴───┲━┷━╈━┻━┳━━━┳━━━┛ > +# ┃ ┃ ┃ ┃ Espace inséc. Espace inséc. fin ┃ ┃ > ┃ ┃ > +# ┃Ctrl ┃Meta ┃Alt┃ ␣ (Espace) _ ␣ ┃AltGr ⇮┃Menu > ┃Ctrl ┃ > +# > ┗━━━┻━━━┻━━━┹───┺━━━┻━━━┻━━━┛ > + > + > +# First row > +## keycode 41 = dollar numbersign U+2013 U+00b6 > +dollar0x29 > +numbersign0x29 shift > +U2013 0x29altgr > +U00b6 0x29 shift altgr > + > +## keycode 2 = +quotedbl +one U+2014 U+201e > +quotedbl 0x2 > +one 0x2 shift > +U2014 0x2altgr > +U201e 0x2 shift altgr > + > +## keycode 3 = +guillemotleft +two lessU+201c > +guillemotleft 0x3 > +two 0x3 shift > +less 0x3altgr > +U201c 0x3 shift altgr > + > +## keycode 4 = +guillemotright +three greater U+201d > +guillemotright 0x4 > +three 0x4 shift > +greater 0x4altgr > +U201d 0x4 shift altgr > + > +## keycode 5 = +parenleft +fourbracketleft U+2264 > +parenleft 0x5 > +four 0x5 shift > +bracketleft 0x5altgr > +U2264 0x5 shift altgr > + > +## keycode 6 = +parenright +five bracketright U+2265 > +parenright0x6 > +five 0x6 shift > +bracketright 0x6altgr > +U2265 0x6 shift altgr > + > +## keycode 7 = +at +six asciicircum > +at0x7 > +six 0x7 shift > +asciicircum 0x7altgr > + > +## keycode 8 = +plus +sevenU+00b1 U+00ac > +plus 0x8 > +seven 0x8 shift > +U00b1 0x8altgr > +U00ac 0x8 shift altgr > + > +## keycode 9 = +minus+eightU+2212 U+00bc > +minus 0x9 > +eight 0x9 shift > +U2212 0x9altgr > +U00bc 0x9 shift altgr > + > +## keycode 10 = +slash+nine U+00f7 U+00bd > +slash 0xa > +nine 0xa shift > +U00f7 0xaaltgr > +U00bd 0xa shift altgr > + > +## keycode 11 = +asterisk +zero U+00d7 U+00be > +asterisk 0xb > +zero 0xb shift > +U00d7 0xbaltgr > +U00be 0xb shift altgr > + > +## keycode 12 = equal U+00b0U+2260 U+2032 > +equal 0xc > +U00b0 0xc shift > +U2260 0xcaltgr > +U2032 0xc shift altgr > + > +## keycode 13 = percent grave U+2030 U+2033 > +percent 0xd > +grave 0xd shift > +U2030 0xd
Re: [Qemu-devel] [PATCH 1/3] mainstone: fix name of the allocated memory for roms
On Thu, Jan 13, 2011 at 06:37:10PM +0300, Dmitry Eremin-Solenikov wrote: > Mainstone board has two flash chips (emulated by two ram regions), however > currently code tries to allocate them with the same name, which fails. > Fix that to make mainstone emulation work again. > > Signed-off-by: Dmitry Eremin-Solenikov > --- > hw/mainstone.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/hw/mainstone.c b/hw/mainstone.c > index efa2959..d8e41cf 100644 > --- a/hw/mainstone.c > +++ b/hw/mainstone.c > @@ -106,7 +106,8 @@ static void mainstone_common_init(ram_addr_t ram_size, > } > > if (!pflash_cfi01_register(mainstone_flash_base[i], > - qemu_ram_alloc(NULL, "mainstone.flash", > + qemu_ram_alloc(NULL, i ? > "mainstone.flash1" : > + "mainstone.flash0", >MAINSTONE_FLASH), > dinfo->bdrv, sector_len, > MAINSTONE_FLASH / sector_len, 4, 0, 0, 0, > 0, > -- > 1.7.2.3 > Thanks, all three patches applied. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Am 20.01.2011 11:39, schrieb Yoshiaki Tamura: > 2011/1/20 Kevin Wolf : >> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: >>> +return; >>> +} >>> + >>> +bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, >>> +blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, >>> +blk_req->reqs[0].opaque); >> >> Same here. >> >>> +bdrv_flush(bs); >> >> This looks really strange. What is this supposed to do? >> >> One point is that you write it immediately after bdrv_aio_write, so you >> get an fsync for which you don't know if it includes the current write >> request or if it doesn't. Which data do you want to get flushed to the >> disk? > > I was expecting to flush the aio request that was just initiated. > Am I misunderstanding the function? Seems so. The function names don't use really clear terminology either, so you're not the first one to fall in this trap. Basically we have: * qemu_aio_flush() waits for all AIO requests to complete. I think you wanted to have exactly this, but only for a single block device. Such a function doesn't exist yet. * bdrv_flush() makes sure that all successfully completed requests are written to disk (by calling fsync) * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run the fsync in the thread pool >>> >>> Then what I wanted to do is, call qemu_aio_flush first, then >>> bdrv_flush. It should be like live migration. >> >> Okay, that makes sense. :-) >> >> The other thing is that you introduce a bdrv_flush for each request, >> basically forcing everyone to something very similar to writethrough >> mode. I'm sure this will have a big impact on performance. > > The reason is to avoid inversion of queued requests. Although > processing one-by-one is heavy, wouldn't having requests flushed > to disk out of order break the disk image? No, that's fine. If a guest issues two requests at the same time, they may complete in any order. You just need to make sure that you don't call the completion callback before the request really has completed. >>> >>> We need to flush requests, meaning aio and fsync, before sending >>> the final state of the guests, to make sure we can switch to the >>> secondary safely. >> >> In theory I think you could just re-submit the requests on the secondary >> if they had not completed yet. >> >> But you're right, let's keep things simple for the start. >> I'm just starting to wonder if the guest won't timeout the requests if they are queued for too long. Even more, with IDE, it can only handle one request at a time, so not completing requests doesn't sound like a good idea at all. In what intervals is the event-tap queue flushed? >>> >>> The requests are flushed once each transaction completes. So >>> it's not with specific intervals. >> >> Right. So when is a transaction completed? This is the time that a >> single request will take. > > The transaction is completed when the vm state is sent to the > secondary, and the primary receives the ack to it. Please let me > know if the answer is too vague. What I can tell is that it > can't be super fast. > On the other hand, if you complete before actually writing out, you don't get timeouts, but you signal success to the guest when the request could still fail. What would you do in this case? With a writeback cache mode we're fine, we can just fail the next flush (until then nothing is guaranteed to be on disk and order doesn't matter either), but with cache=writethrough we're in serious trouble. Have you thought about this problem? Maybe we end up having to flush the event-tap queue for each single write in writethrough mode. >>> >>> Yes, and that's what I'm trying to do at this point. >> >> Oh, I must have missed that code. Which patch/function should I look at? > > Maybe I miss-answered to your question. The device may receive > timeouts. We should pay attention that the guest does not see timeouts. I'm not expecting that I/O will be super fast, and as long as it is only a performance problem we can live with it. However, as soon as the guest gets timeouts it reports I/O errors and eventually offlines the block device. At this point it's not a performance problem any more, but also a correctness problem. This is why I suggested that we flush the event-tap queue (i.e. complete the transaction) immediately after an I/O request has been issued instead of waiting for other events that would complete the transaction. > If timeouts didn't happen, the requests are flushed > one-by-one in writethrough because we're calling qemu_aio_flush > and bdrv_flush together. I think this is what we must do. Kevin
Re: [Qemu-devel] [PATCH 0/8] Add save/restore support to ARM versatilepb devices
On Thu, Dec 23, 2010 at 05:19:50PM +, Peter Maydell wrote: > This patchset adds save/restore support to the devices used by the ARM > versatilepb board which didn't already support it. > > I did this in line with docs/migration.txt, and it seems to work OK. I'd > appreciate some review from somebody with a better grasp of the APIs, > though :-) In particular, am I saving the uint8_t data[NUM_PACKETS][2048] > array in stc91c111 in the right way? > > Peter Maydell (8): > pl190: Implement save/restore > vpb_sic: Implement save/restore > arm_sysctl: Implement save/restore > pl050: Implement save/restore > pl031: Implement save/restore > pl110: Implement save/restore > pl080: Implement save/restore > stc91c111: Implement save/restore > > hw/arm_sysctl.c | 17 +++ > hw/pl031.c | 25 ++- > hw/pl050.c | 35 +++ > hw/pl080.c | 58 +++-- > hw/pl110.c | 44 +--- > hw/pl190.c | 36 +--- > hw/smc91c111.c | 30 +++ > hw/versatilepb.c | 24 +++-- > 8 files changed, 249 insertions(+), 20 deletions(-) > I don't really know this part of QEMU, but given nobody really wants to review these patches and given they look ok, I have applied them. It's better than leaving them bitrotting on the mailing list. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] [PATCH 0/7] Define "deposit" tcg operation, v2
On Mon, Jan 10, 2011 at 07:23:41PM -0800, Richard Henderson wrote: > Changes since v1: > * No attempt to pack pos+len into one operand. Updated backends > to match this change. > > * Example in the README is a bit more complex. > > * Define an official tcg_scratch_alloc routine, used by the i386 > target for the case in which we need a scratch register. I had > said that triggering this wasn't possible with mainline, but I > was wrong. The rlwimi example I posted in the other thread hits > this case. > > Aurelien suggested using a shorter name like "dep", but I think that > would be more confusing than not. This opcode is not really used > enough to warrent cropping 4 characters, in my opinion. Thanks Richard. In agreement with Aurelien I've applied patch 1 and 6 as a first step. I've also applied a follow-up patch to fix the copy-pastos in the README (loc -> len). Hopefully as more acks come in, we'll apply the rest. Cheers
Re: [Qemu-devel] usb redirection status report
On Wed, Jan 19, 2011 at 07:15:47PM +0100, Hans de Goede wrote: > Hi All, > > As most of you know I'm working on usb redirection (making client usb > devices > accessible in guests over the network). > > I thought it would be a good idea to write a short status report. > > So fat the following has been done: > > * written and posted a network protocol for usb redir. over the network, > and send this to the list. How does this relate to the various existing usb over ip protocols, e.g. the one in drivers/staging/usbip/ in the Linux kernel tree?
Re: [Qemu-devel] [PATCH 6/7] target-i386: Use deposit operation.
On Mon, Jan 10, 2011 at 07:23:47PM -0800, Richard Henderson wrote: > Use this for assignment to the low byte or low word of a register. > > Signed-off-by: Richard Henderson > --- > target-i386/translate.c | 34 ++ > 1 files changed, 6 insertions(+), 28 deletions(-) Acked-by: Edgar E. Iglesias > diff --git a/target-i386/translate.c b/target-i386/translate.c > index 7b6e3c2..c008450 100644 > --- a/target-i386/translate.c > +++ b/target-i386/translate.c > @@ -274,28 +274,16 @@ static inline void gen_op_andl_A0_(void) > > static inline void gen_op_mov_reg_v(int ot, int reg, TCGv t0) > { > -TCGv tmp; > - > switch(ot) { > case OT_BYTE: > -tmp = tcg_temp_new(); > -tcg_gen_ext8u_tl(tmp, t0); > if (reg < 4 X86_64_DEF( || reg >= 8 || x86_64_hregs)) { > -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0xff); > -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], tmp); > +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], t0, 0, 8); > } else { > -tcg_gen_shli_tl(tmp, tmp, 8); > -tcg_gen_andi_tl(cpu_regs[reg - 4], cpu_regs[reg - 4], ~0xff00); > -tcg_gen_or_tl(cpu_regs[reg - 4], cpu_regs[reg - 4], tmp); > +tcg_gen_deposit_tl(cpu_regs[reg - 4], cpu_regs[reg - 4], t0, 8, > 8); > } > -tcg_temp_free(tmp); > break; > case OT_WORD: > -tmp = tcg_temp_new(); > -tcg_gen_ext16u_tl(tmp, t0); > -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0x); > -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], tmp); > -tcg_temp_free(tmp); > +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], t0, 0, 16); > break; > default: /* XXX this shouldn't be reached; abort? */ > case OT_LONG: > @@ -323,15 +311,9 @@ static inline void gen_op_mov_reg_T1(int ot, int reg) > > static inline void gen_op_mov_reg_A0(int size, int reg) > { > -TCGv tmp; > - > switch(size) { > case 0: > -tmp = tcg_temp_new(); > -tcg_gen_ext16u_tl(tmp, cpu_A0); > -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0x); > -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], tmp); > -tcg_temp_free(tmp); > +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], cpu_A0, 0, 16); > break; > default: /* XXX this shouldn't be reached; abort? */ > case 1: > @@ -415,9 +397,7 @@ static inline void gen_op_add_reg_im(int size, int reg, > int32_t val) > switch(size) { > case 0: > tcg_gen_addi_tl(cpu_tmp0, cpu_regs[reg], val); > -tcg_gen_ext16u_tl(cpu_tmp0, cpu_tmp0); > -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0x); > -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], cpu_tmp0); > +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], cpu_tmp0, 0, 16); > break; > case 1: > tcg_gen_addi_tl(cpu_tmp0, cpu_regs[reg], val); > @@ -439,9 +419,7 @@ static inline void gen_op_add_reg_T0(int size, int reg) > switch(size) { > case 0: > tcg_gen_add_tl(cpu_tmp0, cpu_regs[reg], cpu_T[0]); > -tcg_gen_ext16u_tl(cpu_tmp0, cpu_tmp0); > -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0x); > -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], cpu_tmp0); > +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], cpu_tmp0, 0, 16); > break; > case 1: > tcg_gen_add_tl(cpu_tmp0, cpu_regs[reg], cpu_T[0]); > -- > 1.7.2.3 > > > -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net
Re: [Qemu-devel] Missing roms/seabios/Makefile and roms/vgabios/Makefile
On Thu, Jan 20, 2011 at 10:51:17AM +, Mateusz Loskot wrote: > On 19/01/11 18:07, Blue Swirl wrote: > >On Wed, Jan 19, 2011 at 12:44 PM, Mateusz Loskot wrote: > >>Hi, > >> > >>Running ./configure (under MinGW/MSYS) and symlink gives up trying to > >>create links to non-existing Makefiles in > >>roms/seabios/Makefile > >>roms/vgabios/Makefile > > > >Those directiories are actually git submodules, when you run 'git > >submodule update', they get populated. > > Something is not quite working and I don't get anything populated. > Here I tried under MinGW > > mloskot@dog /g/src/qemu/_git/master > $ git pull > Already up-to-date. > > mloskot@dog /g/src/qemu/_git/master > $ git submodule update > > mloskot@dog /g/src/qemu/_git/master > $ ls roms/seabios/ > config.mak Make sure roms/{vgabios,seabios} are empty directories. $ git submodule init $ git submodule update This should clone the vgabios and seabios repos and checkout the correct commit. After this completes successfully you should have source trees under roms/{vgabios,seabios}. Stefan
Re: [Qemu-devel] Missing roms/seabios/Makefile and roms/vgabios/Makefile
On 19/01/11 18:07, Blue Swirl wrote: On Wed, Jan 19, 2011 at 12:44 PM, Mateusz Loskot wrote: Hi, Running ./configure (under MinGW/MSYS) and symlink gives up trying to create links to non-existing Makefiles in roms/seabios/Makefile roms/vgabios/Makefile Those directiories are actually git submodules, when you run 'git submodule update', they get populated. Something is not quite working and I don't get anything populated. Here I tried under MinGW mloskot@dog /g/src/qemu/_git/master $ git pull Already up-to-date. mloskot@dog /g/src/qemu/_git/master $ git submodule update mloskot@dog /g/src/qemu/_git/master $ ls roms/seabios/ config.mak I also tested on Linux, same results, nothing pulled. Maybe configure should check if the directories are OK and disables building ROMs if not. Generally, if they are optional, I think it's a good idea. Specifically, I'm nearly completely green about qemu internals, so can't tell. Best regards, -- Mateusz Loskot, http://mateusz.loskot.net Charter Member of OSGeo, http://osgeo.org Member of ACCU, http://accu.org
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
2011/1/20 Kevin Wolf : > Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: >> + return; >> + } >> + >> + bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, >> + blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, >> + blk_req->reqs[0].opaque); > > Same here. > >> + bdrv_flush(bs); > > This looks really strange. What is this supposed to do? > > One point is that you write it immediately after bdrv_aio_write, so you > get an fsync for which you don't know if it includes the current write > request or if it doesn't. Which data do you want to get flushed to the > disk? I was expecting to flush the aio request that was just initiated. Am I misunderstanding the function? >>> >>> Seems so. The function names don't use really clear terminology either, >>> so you're not the first one to fall in this trap. Basically we have: >>> >>> * qemu_aio_flush() waits for all AIO requests to complete. I think you >>> wanted to have exactly this, but only for a single block device. Such a >>> function doesn't exist yet. >>> >>> * bdrv_flush() makes sure that all successfully completed requests are >>> written to disk (by calling fsync) >>> >>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run >>> the fsync in the thread pool >> >> Then what I wanted to do is, call qemu_aio_flush first, then >> bdrv_flush. It should be like live migration. > > Okay, that makes sense. :-) > > The other thing is that you introduce a bdrv_flush for each request, > basically forcing everyone to something very similar to writethrough > mode. I'm sure this will have a big impact on performance. The reason is to avoid inversion of queued requests. Although processing one-by-one is heavy, wouldn't having requests flushed to disk out of order break the disk image? >>> >>> No, that's fine. If a guest issues two requests at the same time, they >>> may complete in any order. You just need to make sure that you don't >>> call the completion callback before the request really has completed. >> >> We need to flush requests, meaning aio and fsync, before sending >> the final state of the guests, to make sure we can switch to the >> secondary safely. > > In theory I think you could just re-submit the requests on the secondary > if they had not completed yet. > > But you're right, let's keep things simple for the start. > >>> I'm just starting to wonder if the guest won't timeout the requests if >>> they are queued for too long. Even more, with IDE, it can only handle >>> one request at a time, so not completing requests doesn't sound like a >>> good idea at all. In what intervals is the event-tap queue flushed? >> >> The requests are flushed once each transaction completes. So >> it's not with specific intervals. > > Right. So when is a transaction completed? This is the time that a > single request will take. The transaction is completed when the vm state is sent to the secondary, and the primary receives the ack to it. Please let me know if the answer is too vague. What I can tell is that it can't be super fast. >>> On the other hand, if you complete before actually writing out, you >>> don't get timeouts, but you signal success to the guest when the request >>> could still fail. What would you do in this case? With a writeback cache >>> mode we're fine, we can just fail the next flush (until then nothing is >>> guaranteed to be on disk and order doesn't matter either), but with >>> cache=writethrough we're in serious trouble. >>> >>> Have you thought about this problem? Maybe we end up having to flush the >>> event-tap queue for each single write in writethrough mode. >> >> Yes, and that's what I'm trying to do at this point. > > Oh, I must have missed that code. Which patch/function should I look at? Maybe I miss-answered to your question. The device may receive timeouts. If timeouts didn't happen, the requests are flushed one-by-one in writethrough because we're calling qemu_aio_flush and bdrv_flush together. Yoshi >> I know that >> performance matters a lot, but sacrificing reliability over >> performance now isn't a good idea. I first want to lay the >> ground, and then focus on optimization. Note that without dirty >> bitmap optimization, Kemari suffers a lot in sending rams. >> Anthony and I discussed to take this approach at KVM Forum. > > I agree, starting simple makes sense. > > Kevin > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On Thu, Jan 20, 2011 at 09:44:05AM +0100, Gerd Hoffmann wrote: > Hi, > > >For (2), you cannot use bus=X,addr=Y because it makes assumptions about > >the PCI topology which may change in newer -M pc's. > > Why should the PCI topology for 'pc' ever change? > > We'll probably get q35 support some day, but when this lands I > expect we'll see a new machine type 'q35', so '-m q35' will pick the > ich9 chipset (which will have a different pci topology of course) > and '-m pc' will pick the existing piix chipset (which will continue > to look like it looks today). If the topology does ever change (eg in the kind of way anthony suggests, first bus only has the graphics card), I think libvirt is going to need a little work to adapt to the new topology, regardless of whether we currently specify a bus= arg to -device or not. I'm not sure there's anything we could do to future proof us to that kind of change. Regards, Daniel
[Qemu-devel] [PATCH] pci: fix pcibus_get_dev_path()
The commit 6a7005d14b3c32d4864a718fb1cb19c789f58a5 used snprintf() incorrectly. Signed-off-by: TeLeMan --- hw/pci.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 8d0e3df..9f8800d 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -2050,14 +2050,14 @@ static char *pcibus_get_dev_path(DeviceState *dev) path[path_len] = '\0'; /* First field is the domain. */ -snprintf(path, domain_len, "%04x:00", pci_find_domain(d->bus)); +snprintf(path, domain_len + 1, "%04x:00", pci_find_domain(d->bus)); /* Fill in slot numbers. We walk up from device to root, so need to print * them in the reverse order, last to first. */ p = path + path_len; for (t = d; t; t = t->bus->parent_dev) { p -= slot_len; -snprintf(p, slot_len, ":%02x.%x", PCI_SLOT(t->devfn), PCI_FUNC(d->devfn)); +snprintf(p, slot_len + 1, ":%02x.%x", PCI_SLOT(t->devfn), PCI_FUNC(d->devfn)); } return path; -- 1.7.3.1.msysgit.0
Re: [Qemu-devel] Re: [PATCH] savevm: fix corruption in vmstate_subsection_load().
2011/1/20 Paolo Bonzini : > On 01/20/2011 09:57 AM, Yoshiaki Tamura wrote: >> >> 2011/1/20 Paolo Bonzini: >>> >>> On 12/14/2010 10:07 AM, Yoshiaki Tamura wrote: Although it's rare to happen in live migration, when the head of a byte stream contains 0x05 >>> >>> IIUC, this happens if you have VMS_STRUCT and the field after the >>> VMS_STRUCT >>> starts with 0x5. >>> >>> I think you should also add this in vmstate_subsection_load: >>> >>> sub_vmsd = vmstate_get_subsection(sub, idstr); >>> if (sub_vmsd == NULL) { >>> return -ENOENT; >>> } >>> + assert (!sub_vmsd->subsections); >>> ret = vmstate_load_state(f, sub_vmsd, opaque, version_id); >>> >>> and this in vmstate_load_state: >>> >>> if (field->flags& VMS_STRUCT) { >>> + assert (!vmsd->subsections); >>> ret = vmstate_load_state(f, field->vmsd, addr, >>> field->vmsd->version_id); >>> } >> >> Hi Paolo, >> >> You mean, having subsection nested and under VMS_STRUCT are >> violations? > > I believe so, because the protocol doesn't allow you to distinguish: > > - in the case of nested subsections, whether 2 consecutive subsections are > siblings, or the second is nested into the first. In fact, your patch also > fixes a latent bug in case a device supports more than one subsection, and > both are present in the data stream. When vmstate_subsection_load is called > for the first subsection, it would see a 0x5 byte (the beginning of the > second subsection), eat it and then fail with ENOENT. The second subsection > would then fail to load. > > - in the case of VMS_STRUCT, whether a 0x5 byte after the VMS_STRUCT is a > subsection or part of the parent data stream. This is, I believe, the > source of your bug. Thank you for the explanation. It's very helpful because I didn't know the background of subsection. Kemari is kind of stress test of live migration. > I don't think it is possible to fix these problems in the file format while > preserving backwards compatibility with pre-subsection QEMU (which was a > fundamental requirement of subsections). So, I think your patch is correct > and fixes the practical bugs. However, we can be even stronger and assert > that the problematic vmstate descriptions are not used. > > Even better, asserts matching the ones above could be added to > vmstate_subsection_save and vmstate_save_state, as well. I see. Let me fold the assertion you pointed to the original patch for now. Because I'm not an expert in subsection, I would like to leave further improvements to the others. Yoshi > > Paolo > >
Re: [Qemu-devel] [PATCH 2/3] block: tell drivers about an image resize
On Thu, Jan 20, 2011 at 09:14:17AM +, Stefan Hajnoczi wrote: > Cool, this is much nicer than stashing away state like > bs->media_changed = 1. I just took a peek to see if we can remove > that field completely but IDE seems to use it internally. I think we could get rid of it in generic code, but let's keep this series simple.
Re: [Qemu-devel] [PATCH 1/3] block: add resize monitor command
On Thu, Jan 20, 2011 at 08:56:07AM +, Stefan Hajnoczi wrote: > eject, change, block_passwd, and others call the device name argument > "device" instead of "id". In the interest of a consistent external API > it would be nice to use "device" for the block_resize command too. Sure, I can do that with the next respin.
[Qemu-devel] Re: [PATCH 0/3 v2] allow online resizing of block devices
On Wed, Jan 19, 2011 at 06:26:49PM +0100, Juan Quintela wrote: > Does this handle the change of the rtc data? I can't see how/where. > (Not that I know if it is important at all until reboot). It doesn't. > cmos_init_hb() uses the size to guess the right geometry of the drives, > if we change the size of the drive, should we change that one? I don't think changing geometry on a live system is a good idea. But as mentioned at least for Linux guests there's no way to find out about the new size of an IDE disk right now. Although I might send a patch to allow a manual rescan, similar to SCSI.
[Qemu-devel] Re: [PATCH] pci: fix device paths
On Wed, Jan 19, 2011 at 09:24:10PM +0200, Michael S. Tsirkin wrote: > Patch a6a7005d14b3c32d4864a718fb1cb19c789f58a5 generated > broken device paths. We snprintf with a length shorter > than the output, so the last character is discarded and replaced > by the null byte. Fix it up by snprintf to a buffer > which is larger by 1 byte and then memcpy the data (without > the null byte) to where we need it. This fixed the boot for me.
Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state
On 2011-01-19 20:32, Blue Swirl wrote: > On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori > wrote: >> On 01/19/2011 07:15 AM, Markus Armbruster wrote: >>> >>> So they interact with KVM (need kvm_state), and they interact with the >>> emulated PCI bus. Could you elaborate on the fundamental difference >>> between the two interactions that makes you choose the (hypothetical) >>> KVM bus over the PCI bus as device parent? >>> >> >> It's almost arbitrary, but I would say it's the direction that I/Os flow. >> >> But if the underlying observation is that the device tree is not really a >> tree, you're 100% correct. This is part of why a factory interface that >> just takes a parent bus is too simplistic. >> >> I think we ought to introduce a -pci-device option that is specifically for >> creating PCI devices that doesn't require a parent bus argument but provides >> a way to specify stable addressing (for instancing, using a linear index). > > I think kvm_state should not be a property of any device or bus. It > should be split to more logical pieces. > > Some parts of it could remain in CPUState, because they are associated > with a VCPU. > > Also, for example irqfd could be considered to be similar object to > char or block devices provided by QEMU to devices. Would it make sense > to introduce new host types for passing parts of kvm_state to devices? > > I'd also make coalesced MMIO stuff part of memory object. We are not > passing any state references when using cpu_physical_memory_rw(), but > that could be changed. There are currently no VCPU-specific bits remaining in kvm_state. It may be a good idea to introduce an arch-specific kvm_state and move related bits over. It may also once be feasible to carve out memory management related fields if we have proper abstractions for that, but I'm not completely sure here. Anyway, all these things are secondary. The primary topic here is how to deal with kvm_state and its fields that have VM-global scope. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux
Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.
Am 20.01.2011 06:19, schrieb Yoshiaki Tamura: > +return; > +} > + > +bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov, > +blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb, > +blk_req->reqs[0].opaque); Same here. > +bdrv_flush(bs); This looks really strange. What is this supposed to do? One point is that you write it immediately after bdrv_aio_write, so you get an fsync for which you don't know if it includes the current write request or if it doesn't. Which data do you want to get flushed to the disk? >>> >>> I was expecting to flush the aio request that was just initiated. >>> Am I misunderstanding the function? >> >> Seems so. The function names don't use really clear terminology either, >> so you're not the first one to fall in this trap. Basically we have: >> >> * qemu_aio_flush() waits for all AIO requests to complete. I think you >> wanted to have exactly this, but only for a single block device. Such a >> function doesn't exist yet. >> >> * bdrv_flush() makes sure that all successfully completed requests are >> written to disk (by calling fsync) >> >> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run >> the fsync in the thread pool > > Then what I wanted to do is, call qemu_aio_flush first, then > bdrv_flush. It should be like live migration. Okay, that makes sense. :-) The other thing is that you introduce a bdrv_flush for each request, basically forcing everyone to something very similar to writethrough mode. I'm sure this will have a big impact on performance. >>> >>> The reason is to avoid inversion of queued requests. Although >>> processing one-by-one is heavy, wouldn't having requests flushed >>> to disk out of order break the disk image? >> >> No, that's fine. If a guest issues two requests at the same time, they >> may complete in any order. You just need to make sure that you don't >> call the completion callback before the request really has completed. > > We need to flush requests, meaning aio and fsync, before sending > the final state of the guests, to make sure we can switch to the > secondary safely. In theory I think you could just re-submit the requests on the secondary if they had not completed yet. But you're right, let's keep things simple for the start. >> I'm just starting to wonder if the guest won't timeout the requests if >> they are queued for too long. Even more, with IDE, it can only handle >> one request at a time, so not completing requests doesn't sound like a >> good idea at all. In what intervals is the event-tap queue flushed? > > The requests are flushed once each transaction completes. So > it's not with specific intervals. Right. So when is a transaction completed? This is the time that a single request will take. >> On the other hand, if you complete before actually writing out, you >> don't get timeouts, but you signal success to the guest when the request >> could still fail. What would you do in this case? With a writeback cache >> mode we're fine, we can just fail the next flush (until then nothing is >> guaranteed to be on disk and order doesn't matter either), but with >> cache=writethrough we're in serious trouble. >> >> Have you thought about this problem? Maybe we end up having to flush the >> event-tap queue for each single write in writethrough mode. > > Yes, and that's what I'm trying to do at this point. Oh, I must have missed that code. Which patch/function should I look at? > I know that > performance matters a lot, but sacrificing reliability over > performance now isn't a good idea. I first want to lay the > ground, and then focus on optimization. Note that without dirty > bitmap optimization, Kemari suffers a lot in sending rams. > Anthony and I discussed to take this approach at KVM Forum. I agree, starting simple makes sense. Kevin
Re: [Qemu-devel] [PATCH 2/3] block: tell drivers about an image resize
On Wed, Jan 19, 2011 at 5:02 PM, Christoph Hellwig wrote: > Extend the change_cb callback with a reason argument, and use it > to tell drivers about size changes. > > Signed-off-by: Christoph Hellwig > > Index: qemu/block.c > === > --- qemu.orig/block.c 2011-01-18 20:54:45.246021572 +0100 > +++ qemu/block.c 2011-01-18 20:56:54.117255612 +0100 > @@ -645,7 +645,7 @@ int bdrv_open(BlockDriverState *bs, cons > /* call the change callback */ > bs->media_changed = 1; > if (bs->change_cb) > - bs->change_cb(bs->change_opaque); > + bs->change_cb(bs->change_opaque, CHANGE_MEDIA); Cool, this is much nicer than stashing away state like bs->media_changed = 1. I just took a peek to see if we can remove that field completely but IDE seems to use it internally. Stefan
[Qemu-devel] Re: [PATCH 0/3] pci: disable intx on flr/bus reset
On Thu, Jan 20, 2011 at 04:21:37PM +0900, Isaku Yamahata wrote: > So far pci_device_reset() is used for system reset. > In that case, interrupt controller is also reset so that > all irq is are deasserted. > But now pci bus reset/flr is supported, and in that case irq needs to be > disabled explicitly. Simply->simplify Otherwise looks good. Applied, thanks! > Isaku Yamahata (3): > pci: deassert intx on reset. > msi: simply write config a bit. > msix: simply write config > > hw/msi.c |5 + > hw/msix.c |5 + > hw/pci.c |9 + > hw/pci.h |2 ++ > 4 files changed, 13 insertions(+), 8 deletions(-)
Re: [Qemu-devel] Re: [PATCH] savevm: fix corruption in vmstate_subsection_load().
On 01/20/2011 09:57 AM, Yoshiaki Tamura wrote: 2011/1/20 Paolo Bonzini: On 12/14/2010 10:07 AM, Yoshiaki Tamura wrote: Although it's rare to happen in live migration, when the head of a byte stream contains 0x05 IIUC, this happens if you have VMS_STRUCT and the field after the VMS_STRUCT starts with 0x5. I think you should also add this in vmstate_subsection_load: sub_vmsd = vmstate_get_subsection(sub, idstr); if (sub_vmsd == NULL) { return -ENOENT; } + assert (!sub_vmsd->subsections); ret = vmstate_load_state(f, sub_vmsd, opaque, version_id); and this in vmstate_load_state: if (field->flags& VMS_STRUCT) { + assert (!vmsd->subsections); ret = vmstate_load_state(f, field->vmsd, addr, field->vmsd->version_id); } Hi Paolo, You mean, having subsection nested and under VMS_STRUCT are violations? I believe so, because the protocol doesn't allow you to distinguish: - in the case of nested subsections, whether 2 consecutive subsections are siblings, or the second is nested into the first. In fact, your patch also fixes a latent bug in case a device supports more than one subsection, and both are present in the data stream. When vmstate_subsection_load is called for the first subsection, it would see a 0x5 byte (the beginning of the second subsection), eat it and then fail with ENOENT. The second subsection would then fail to load. - in the case of VMS_STRUCT, whether a 0x5 byte after the VMS_STRUCT is a subsection or part of the parent data stream. This is, I believe, the source of your bug. I don't think it is possible to fix these problems in the file format while preserving backwards compatibility with pre-subsection QEMU (which was a fundamental requirement of subsections). So, I think your patch is correct and fixes the practical bugs. However, we can be even stronger and assert that the problematic vmstate descriptions are not used. Even better, asserts matching the ones above could be added to vmstate_subsection_save and vmstate_save_state, as well. Paolo
Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model
On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote: > After creating a file object, its permission and ownership details are updated > as per client's request for both passthrough and none security model. But with > chrooted environment its not required for passthrough security model. Move all > post file creation changes to none security model > > Signed-off-by: M. Mohan Kumar > --- > hw/9pfs/virtio-9p-local.c | 19 ++- > 1 files changed, 6 insertions(+), 13 deletions(-) > > diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c > index 08fd67f..d2e32e2 100644 > --- a/hw/9pfs/virtio-9p-local.c > +++ b/hw/9pfs/virtio-9p-local.c > @@ -208,21 +208,14 @@ static int local_set_xattr(const char *path, FsCred > *credp) > return 0; > } > > -static int local_post_create_passthrough(FsContext *fs_ctx, const char *path, > +static int local_post_create_none(FsContext *fs_ctx, const char *path, > FsCred *credp) > { > +int retval; > if (chmod(rpath(fs_ctx, path), credp->fc_mode & 0) < 0) { > return -1; > } > -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) { > -/* > - * If we fail to change ownership and if we are > - * using security model none. Ignore the error > - */ > -if (fs_ctx->fs_sm != SM_NONE) { > -return -1; > -} > -} > +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid); > return 0; > } retval is unused. Can multiple virtio-9p requests execute at a time? chmod() and lchown() after creation is a race condition if other requests can execute concurrently. Stefan
Re: [Qemu-devel] Re: [PATCH] savevm: fix corruption in vmstate_subsection_load().
2011/1/20 Paolo Bonzini : > On 12/14/2010 10:07 AM, Yoshiaki Tamura wrote: >> >> Although it's rare to happen in live migration, when the head of a >> byte stream contains 0x05 > > IIUC, this happens if you have VMS_STRUCT and the field after the VMS_STRUCT > starts with 0x5. > > I think you should also add this in vmstate_subsection_load: > > sub_vmsd = vmstate_get_subsection(sub, idstr); > if (sub_vmsd == NULL) { > return -ENOENT; > } > + assert (!sub_vmsd->subsections); > ret = vmstate_load_state(f, sub_vmsd, opaque, version_id); > > and this in vmstate_load_state: > > if (field->flags & VMS_STRUCT) { > + assert (!vmsd->subsections); > ret = vmstate_load_state(f, field->vmsd, addr, > field->vmsd->version_id); > } Hi Paolo, You mean, having subsection nested and under VMS_STRUCT are violations? Yoshi > > Paolo > >
Re: [Qemu-devel] [PATCH 1/3] block: add resize monitor command
On Wed, Jan 19, 2011 at 06:02:48PM +0100, Christoph Hellwig wrote: > Index: qemu/hmp-commands.hx > === > --- qemu.orig/hmp-commands.hx 2011-01-19 17:47:10.444004409 +0100 > +++ qemu/hmp-commands.hx 2011-01-19 17:49:51.673254095 +0100 > @@ -53,6 +53,25 @@ Quit the emulator. > ETEXI > > { > +.name = "resize", > +.args_type = "id:s,size:o", > +.params = "device size", > +.help = "resize a block image", > +.user_print = monitor_user_noop, > +.mhandler.cmd_new = do_resize, > +}, > + > +STEXI > +@item resize > +@findex resize > +Resize a block image while a guest is running. Usually requires guest > +action to see the updated size. Resize to a lower size is supported, > +but should be used with extreme caution. Note that this command only > +resizes image files, it can not resize block devices like LVM volumes. > +ETEXI > + > + > +{ > .name = "eject", > .args_type = "force:-f,device:B", > .params = "[-f] device", [...] > Index: qemu/qmp-commands.hx > === > --- qemu.orig/qmp-commands.hx 2011-01-19 17:47:10.478012371 +0100 > +++ qemu/qmp-commands.hx 2011-01-19 17:50:07.406016841 +0100 > @@ -601,6 +601,34 @@ Example: > -> { "execute": "netdev_del", "arguments": { "id": "netdev1" } } > <- { "return": {} } > > + > +EQMP > + > +{ > +.name = "block_resize", > +.args_type = "id:s,size:o", > +.params = "id size", > +.help = "resize a block image", > +.user_print = monitor_user_noop, > +.mhandler.cmd_new = do_resize, > +}, > + > +SQMP > +block_resize > + > + > +Resize a block image while a guest is running. > + > +Arguments: > + > +- "id": the device's ID, must be unique (json-string) > +- "size": new size > + > +Example: > + > +-> { "execute": "block_resize", "arguments": { "id": "scratch", "size": > 1073741824 } } > +<- { "return": {} } > + > EQMP eject, change, block_passwd, and others call the device name argument "device" instead of "id". In the interest of a consistent external API it would be nice to use "device" for the block_resize command too. Stefan