date:20110120

Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model

2011-01-20 Thread Venkateswararao Jujjuri (JV)

On 1/20/2011 1:45 PM, Stefan Hajnoczi wrote:
> On Thu, Jan 20, 2011 at 9:15 PM, Venkateswararao Jujjuri (JV)
>  wrote:
>> On 1/20/2011 12:59 AM, Stefan Hajnoczi wrote:
>>> On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote:
 After creating a file object, its permission and ownership details are 
 updated
 as per client's request for both passthrough and none security model. But 
 with
 chrooted environment its not required for passthrough security model. Move 
 all
 post file creation changes to none security model

 Signed-off-by: M. Mohan Kumar 
 ---
  hw/9pfs/virtio-9p-local.c |   19 ++-
  1 files changed, 6 insertions(+), 13 deletions(-)

 diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
 index 08fd67f..d2e32e2 100644
 --- a/hw/9pfs/virtio-9p-local.c
 +++ b/hw/9pfs/virtio-9p-local.c
 @@ -208,21 +208,14 @@ static int local_set_xattr(const char *path, FsCred 
 *credp)
  return 0;
  }

 -static int local_post_create_passthrough(FsContext *fs_ctx, const char 
 *path,
 +static int local_post_create_none(FsContext *fs_ctx, const char *path,
  FsCred *credp)
  {
 +int retval;
  if (chmod(rpath(fs_ctx, path), credp->fc_mode & 0) < 0) {
  return -1;
  }
 -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) {
 -/*
 - * If we fail to change ownership and if we are
 - * using security model none. Ignore the error
 - */
 -if (fs_ctx->fs_sm != SM_NONE) {
 -return -1;
 -}
 -}
 +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid);
  return 0;
  }
>>>
>>> retval is unused.
>>>
>>> Can multiple virtio-9p requests execute at a time?  chmod() and lchown()
>>> after creation is a race condition if other requests can execute
>>> concurrently.
>>
>> If some level of serialization is needed it will be done at the client/guest
>> inode level.
>> Are you worried about filesystem semantics? or do you see some corruption if 
>> they
>> get executed in parallel?
> 
> My main concern is unreliable results due to the race conditions
> between creation and the fixups that are performed afterwards.
> 
> Is virtio-9p only useful for single guest exclusive access?  I thought
> both guest and host could access files at the same time?  What about
> multiple VMs sharing a directory?  These scenarios can only work if
> operations are made atomic.

For now, there is only one exploiter for the filesystem. The Guest/client.

In the future it could be different and we 'may' support multiple 
exploiters/users.
Note that we have two security models
1. Passthrough 2. Mapped. (3. None -  can be ignored as it is intended for
developer)

Mapped model is advised when you have only one exploiter;
Passthrough model is for more practical application/uses and it can be
used for multiple exploiters (say guests).

In passthrough model we don't do chmod() lchmod() after creating files.

Thanks,
JV
> 
> Stefan

[Qemu-devel] Qemu signal handling

2011-01-20 Thread maheen butt

 hi
In QEMU code almost every signal is handled then why this warning is generated 
from syscall.c
#elif defined(TARGET_ABI_MIPSN64) # warning signal handling not implemented

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Alex Williamson

On Thu, 2011-01-20 at 18:23 -0600, Anthony Liguori wrote:
> On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote:
> > On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
> >
> >> On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
> >>  
> >>> When MSI is off, each interrupt needs to be bounced through the io
> >>> thread when it's set/cleared, so vhost-net causes more context switches 
> >>> and
> >>> higher CPU utilization than userspace virtio which handles networking in
> >>> the same thread.
> >>>
> >>> We'll need to fix this by adding level irq support in kvm irqfd,
> >>> for now disable vhost-net in these configurations.
> >>>
> >>> Signed-off-by: Michael S. Tsirkin
> >>>
> >> I actually think this should be a terminal error.  The user asks for
> >> vhost-net, if we cannot enable it, we should exit.
> >>
> >> Or we should warn the user that they should expect bad performance.
> >> Silently doing something that the user has explicitly asked us not
> >> to do is not a good behavior.
> >>
> >> Regards,
> >>
> >> Anthony Liguori
> >>  
> > The issue is that user has no control of the guest, and can not know
> > whether the guest enables MSI. So what you ask for will just make
> > some guests fail, and others fail sometimes.
> > The user also has no way to know that version X of kvm does not expose a
> > way to inject level interrupts with irqfd.
> >
> > We could have *another* flag that says "use vhost where it helps" but
> > then I think this is what everyone wants to do, anyway, and libvirt
> > already sets vhost=on so I prefer redefining the meaning of an existing
> > flag.
> >
> 
> In the very least, there needs to be a vhost=force.
> 
> Having some sort of friendly default policy is fine but we need to 
> provide a mechanism for a user to have the final say.  If you want to 
> redefine vhost=on to really mean, use the friendly default, that's fine 
> by me, but only if the vhost=force option exists.
> 
> I actually would think libvirt would want to use vhost=force.  Debugging 
> with vhost=on is going to be a royal pain in the ass if a user reports 
> bad performance.  Given the libvirt XML, you can't actually tell from 
> the guest and the XML whether or not vhost was actually in use or not.

If we add a force option, let's please distinguish hotplug from VM
creation time.  The latter can abort.  Hotplug should print an error and
fail the initfn.  Thanks,

Alex

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Anthony Liguori


On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote:

On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
   

On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
 

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Signed-off-by: Michael S. Tsirkin
   

I actually think this should be a terminal error.  The user asks for
vhost-net, if we cannot enable it, we should exit.

Or we should warn the user that they should expect bad performance.
Silently doing something that the user has explicitly asked us not
to do is not a good behavior.

Regards,

Anthony Liguori
 

The issue is that user has no control of the guest, and can not know
whether the guest enables MSI. So what you ask for will just make
some guests fail, and others fail sometimes.
The user also has no way to know that version X of kvm does not expose a
way to inject level interrupts with irqfd.

We could have *another* flag that says "use vhost where it helps" but
then I think this is what everyone wants to do, anyway, and libvirt
already sets vhost=on so I prefer redefining the meaning of an existing
flag.
   


In the very least, there needs to be a vhost=force.

Having some sort of friendly default policy is fine but we need to 
provide a mechanism for a user to have the final say.  If you want to 
redefine vhost=on to really mean, use the friendly default, that's fine 
by me, but only if the vhost=force option exists.


I actually would think libvirt would want to use vhost=force.  Debugging 
with vhost=on is going to be a royal pain in the ass if a user reports 
bad performance.  Given the libvirt XML, you can't actually tell from 
the guest and the XML whether or not vhost was actually in use or not.


Regards,

Anthony Liguori


Maybe this is best handled by a documentation update?

We always said:
"use vhost=on to enable experimental in kernel 
accelerator\n"

note 'enable' not 'require'. This is similar to how we specify
nvectors : you can not make guest use the feature.

How about this:

diff --git a/qemu-options.hx b/qemu-options.hx
index 898561d..3c937c1 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1061,6 +1061,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
  "use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap 
flag\n"
  "use vnet_hdr=on to make the lack of IFF_VNET_HDR support an 
error condition\n"
  "use vhost=on to enable experimental in kernel 
accelerator\n"
+"(note: vhost=on has no effect unless guest uses MSI-X)\n"
  "use 'vhostfd=h' to connect to an already opened vhost net 
device\n"
  #endif
  "-net 
socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n"

[Qemu-devel] Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Sridhar Samudrala

On Thu, 2011-01-20 at 19:47 +0200, Michael S. Tsirkin wrote:
> On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote:
> > On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
> > > When MSI is off, each interrupt needs to be bounced through the io
> > > thread when it's set/cleared, so vhost-net causes more context switches 
> > > and
> > > higher CPU utilization than userspace virtio which handles networking in
> > > the same thread.
> > > 
> > > We'll need to fix this by adding level irq support in kvm irqfd,
> > > for now disable vhost-net in these configurations.
> > > 
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > > 
> > > I need to report some error from virtio-pci
> > > that would be handled specially (disable but don't
> > > report an error) so I wanted one that's never likely to be used by a
> > > userspace ioctl. I selected ERANGE but it'd
> > > be easy to switch to something else. Comments?
> > 
> > Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?
> > 
> > -Sridhar
> 
> The error is reported by virtio-pci which does not know about vhost.
> I started with EVIRTIO_MSIX_DISABLED and made is shorter.
> Would EVIRTIO_MSIX_DISABLED be better?

I think so. This makes it more clear.
-Sridhar
> 
> > > 
> > >  hw/vhost.c  |4 +++-
> > >  hw/virtio-net.c |6 --
> > >  hw/virtio-pci.c |3 +++
> > >  hw/virtio.h |2 ++
> > >  4 files changed, 12 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/vhost.c b/hw/vhost.c
> > > index 1d09ed0..c79765a 100644
> > > --- a/hw/vhost.c
> > > +++ b/hw/vhost.c
> > > @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, 
> > > VirtIODevice *vdev)
> > >  
> > >  r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
> > >  if (r < 0) {
> > > -fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> > > + if (r != -EVIRTIO_DISABLED) {
> > > + fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> > > + }
> > >  goto fail_notifiers;
> > >  }
> > >  
> > > diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> > > index ccb3e63..5de3fee 100644
> > > --- a/hw/virtio-net.c
> > > +++ b/hw/virtio-net.c
> > > @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
> > > uint8_t status)
> > >  if (!n->vhost_started) {
> > >  int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), 
> > > &n->vdev);
> > >  if (r < 0) {
> > > -error_report("unable to start vhost net: %d: "
> > > - "falling back on userspace virtio", -r);
> > > +if (r != -EVIRTIO_DISABLED) {
> > > +error_report("unable to start vhost net: %d: "
> > > + "falling back on userspace virtio", -r);
> > > +}
> > >  } else {
> > >  n->vhost_started = 1;
> > >  }
> > > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> > > index dd8887a..dbf4be0 100644
> > > --- a/hw/virtio-pci.c
> > > +++ b/hw/virtio-pci.c
> > > @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void 
> > > *opaque, int n, bool assign)
> > >  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
> > >  
> > >  if (assign) {
> > > +if (!msix_enabled(&proxy->pci_dev)) {
> > > +return -EVIRTIO_DISABLED;
> > > +}
> > >  int r = event_notifier_init(notifier, 0);
> > >  if (r < 0) {
> > >  return r;
> > > diff --git a/hw/virtio.h b/hw/virtio.h
> > > index d8546d5..53bbdba 100644
> > > --- a/hw/virtio.h
> > > +++ b/hw/virtio.h
> > > @@ -98,6 +98,8 @@ typedef struct {
> > >  void (*vmstate_change)(void * opaque, bool running);
> > >  } VirtIOBindings;
> > >  
> > > +#define EVIRTIO_DISABLED ERANGE
> > > +
> > >  #define VIRTIO_PCI_QUEUE_MAX 64
> > >  
> > >  #define VIRTIO_NO_VECTOR 0x

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka

On 2011-01-20 22:40, Blue Swirl wrote:
> On Thu, Jan 20, 2011 at 9:22 PM, Jan Kiszka  wrote:
>> On 2011-01-20 20:27, Blue Swirl wrote:
>>> On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka  wrote:
 On 2011-01-19 20:32, Blue Swirl wrote:
> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
>  wrote:
>> On 01/19/2011 07:15 AM, Markus Armbruster wrote:
>>>
>>> So they interact with KVM (need kvm_state), and they interact with the
>>> emulated PCI bus.  Could you elaborate on the fundamental difference
>>> between the two interactions that makes you choose the (hypothetical)
>>> KVM bus over the PCI bus as device parent?
>>>
>>
>> It's almost arbitrary, but I would say it's the direction that I/Os flow.
>>
>> But if the underlying observation is that the device tree is not really a
>> tree, you're 100% correct.  This is part of why a factory interface that
>> just takes a parent bus is too simplistic.
>>
>> I think we ought to introduce a -pci-device option that is specifically 
>> for
>> creating PCI devices that doesn't require a parent bus argument but 
>> provides
>> a way to specify stable addressing (for instancing, using a linear 
>> index).
>
> I think kvm_state should not be a property of any device or bus. It
> should be split to more logical pieces.
>
> Some parts of it could remain in CPUState, because they are associated
> with a VCPU.
>
> Also, for example irqfd could be considered to be similar object to
> char or block devices provided by QEMU to devices. Would it make sense
> to introduce new host types for passing parts of kvm_state to devices?
>
> I'd also make coalesced MMIO stuff part of memory object. We are not
> passing any state references when using cpu_physical_memory_rw(), but
> that could be changed.

 There are currently no VCPU-specific bits remaining in kvm_state.
>>>
>>> I think fields vcpu_events, robust_singlestep, debugregs,
>>> kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the
>>> same for all VCPUs but still they are sort of CPU properties. I'm not
>>> sure about fd field.
>>
>> They are all properties of the currently loaded KVM subsystem in the
>> host kernel. They can't change while KVM's root fd is opened.
>> Replicating this static information into each and every VCPU state would
>> be crazy.
> 
> Then each CPUX86State could have a pointer to common structure.

That already exists.

> 
>> In fact, services like kvm_has_vcpu_events() already encode this: they
>> are static functions without any kvm_state reference that simply return
>> the content of those fields. Totally inconsistent to this, we force the
>> caller of kvm_check_extension to pass a handle. This is part of my
>> problem with the current situation and any halfhearted steps in this
>> context. Either we work towards eliminating "static KVMState *kvm_state"
>> in kvm-all.c or eliminating KVMState.
> 
> If the CPU related fields are accessible through CPUState, the handle
> should be available.
> 
 It may
 be a good idea to introduce an arch-specific kvm_state and move related
 bits over.
>>>
>>> This should probably contain only irqchip_in_kernel, pit_in_kernel and
>>> many_ioeventfds, maybe fd.
>>
>> fd is that root file descriptor you need for a few KVM services that are
>> not bound to a specific VM - e.g. feature queries. It's not arch
>> specific. Arch specific are e.g. robust_singlestep or xsave feature states.
> 
> By arch you mean guest CPU architecture? They are not machine features.

No, they are practically static host features.

> 
>>>
 It may also once be feasible to carve out memory management
 related fields if we have proper abstractions for that, but I'm not
 completely sure here.
>>>
>>> I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region,
>>> migration_log into the memory object.
>>
>> vmfd is the VM-scope file descriptor you need at machine-level. The rest
>> logically belongs to a memory object, but I haven't looked at technical
>> details yet.
>>
>>>
 Anyway, all these things are secondary. The primary topic here is how to
 deal with kvm_state and its fields that have VM-global scope.
>>>
>>> If it is an opaque blob which contains various unrelated stuff, no
>>> clear place will be found.
>>
>> We aren't moving fields yet (and we shouldn't). We first of all need to
>> establish the handle distribution (which apparently requires quite some
>> work in areas beyond KVM).
> 
> But I think this is exactly  the problem. If the handle is for the
> current KVMState, you'll indeed need it in various places and passing
> it around will be cumbersome. By moving the fields around, the
> information should  be available more naturally.

Yeah, if we had a MachineState or if we could agree on introducing it,
I'm with you again. Improving the currently cumbersome KVM API
interaction was the main

Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model

2011-01-20 Thread Stefan Hajnoczi

On Thu, Jan 20, 2011 at 9:15 PM, Venkateswararao Jujjuri (JV)
 wrote:
> On 1/20/2011 12:59 AM, Stefan Hajnoczi wrote:
>> On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote:
>>> After creating a file object, its permission and ownership details are 
>>> updated
>>> as per client's request for both passthrough and none security model. But 
>>> with
>>> chrooted environment its not required for passthrough security model. Move 
>>> all
>>> post file creation changes to none security model
>>>
>>> Signed-off-by: M. Mohan Kumar 
>>> ---
>>>  hw/9pfs/virtio-9p-local.c |   19 ++-
>>>  1 files changed, 6 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
>>> index 08fd67f..d2e32e2 100644
>>> --- a/hw/9pfs/virtio-9p-local.c
>>> +++ b/hw/9pfs/virtio-9p-local.c
>>> @@ -208,21 +208,14 @@ static int local_set_xattr(const char *path, FsCred 
>>> *credp)
>>>      return 0;
>>>  }
>>>
>>> -static int local_post_create_passthrough(FsContext *fs_ctx, const char 
>>> *path,
>>> +static int local_post_create_none(FsContext *fs_ctx, const char *path,
>>>          FsCred *credp)
>>>  {
>>> +    int retval;
>>>      if (chmod(rpath(fs_ctx, path), credp->fc_mode & 0) < 0) {
>>>          return -1;
>>>      }
>>> -    if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) {
>>> -        /*
>>> -         * If we fail to change ownership and if we are
>>> -         * using security model none. Ignore the error
>>> -         */
>>> -        if (fs_ctx->fs_sm != SM_NONE) {
>>> -            return -1;
>>> -        }
>>> -    }
>>> +    retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid);
>>>      return 0;
>>>  }
>>
>> retval is unused.
>>
>> Can multiple virtio-9p requests execute at a time?  chmod() and lchown()
>> after creation is a race condition if other requests can execute
>> concurrently.
>
> If some level of serialization is needed it will be done at the client/guest
> inode level.
> Are you worried about filesystem semantics? or do you see some corruption if 
> they
> get executed in parallel?

My main concern is unreliable results due to the race conditions
between creation and the fixups that are performed afterwards.

Is virtio-9p only useful for single guest exclusive access?  I thought
both guest and host could access files at the same time?  What about
multiple VMs sharing a directory?  These scenarios can only work if
operations are made atomic.

Stefan

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka

On 2011-01-20 21:02, Blue Swirl wrote:
> I think KVMState was designed to match KVM ioctl interface: all stuff
> that is needed for talking to KVM or received from KVM are there. But
> I think this shouldn't be a design driver.

Agreed. The nice cleanup would probably be the complete assimilation of
KVMState by something bigger of comparable scope.

If a machine was brought up with KVM support, every device that refers
to this machine (as it is supposed to become part of it) should be able
to use KVM services in order to accelerate its model.

> 
> If the only pieces of kvm_state that are needed by the devices are
> irqchip_in_kernel, pit_in_kernel and many_ioeventfds, the problem of
> passing kvm_state to devices becomes very different. Each of these are
> just single bits, affecting only a few devices. Perhaps they could be
> device properties which the board level sets when KVM is used?

Forget about the static capabilities for now. The core of kvm_state are
handles that enable you to use KVM services and maybe state fields that
have machine scope (unless we find more local homes like a memory
object). Those need to be accessible by the kvm layer when servicing
requests of components that are related to that very same machine.

Jan

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Blue Swirl

On Thu, Jan 20, 2011 at 9:22 PM, Jan Kiszka  wrote:
> On 2011-01-20 20:27, Blue Swirl wrote:
>> On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka  wrote:
>>> On 2011-01-19 20:32, Blue Swirl wrote:
 On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
  wrote:
> On 01/19/2011 07:15 AM, Markus Armbruster wrote:
>>
>> So they interact with KVM (need kvm_state), and they interact with the
>> emulated PCI bus.  Could you elaborate on the fundamental difference
>> between the two interactions that makes you choose the (hypothetical)
>> KVM bus over the PCI bus as device parent?
>>
>
> It's almost arbitrary, but I would say it's the direction that I/Os flow.
>
> But if the underlying observation is that the device tree is not really a
> tree, you're 100% correct.  This is part of why a factory interface that
> just takes a parent bus is too simplistic.
>
> I think we ought to introduce a -pci-device option that is specifically 
> for
> creating PCI devices that doesn't require a parent bus argument but 
> provides
> a way to specify stable addressing (for instancing, using a linear index).

 I think kvm_state should not be a property of any device or bus. It
 should be split to more logical pieces.

 Some parts of it could remain in CPUState, because they are associated
 with a VCPU.

 Also, for example irqfd could be considered to be similar object to
 char or block devices provided by QEMU to devices. Would it make sense
 to introduce new host types for passing parts of kvm_state to devices?

 I'd also make coalesced MMIO stuff part of memory object. We are not
 passing any state references when using cpu_physical_memory_rw(), but
 that could be changed.
>>>
>>> There are currently no VCPU-specific bits remaining in kvm_state.
>>
>> I think fields vcpu_events, robust_singlestep, debugregs,
>> kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the
>> same for all VCPUs but still they are sort of CPU properties. I'm not
>> sure about fd field.
>
> They are all properties of the currently loaded KVM subsystem in the
> host kernel. They can't change while KVM's root fd is opened.
> Replicating this static information into each and every VCPU state would
> be crazy.

Then each CPUX86State could have a pointer to common structure.

> In fact, services like kvm_has_vcpu_events() already encode this: they
> are static functions without any kvm_state reference that simply return
> the content of those fields. Totally inconsistent to this, we force the
> caller of kvm_check_extension to pass a handle. This is part of my
> problem with the current situation and any halfhearted steps in this
> context. Either we work towards eliminating "static KVMState *kvm_state"
> in kvm-all.c or eliminating KVMState.

If the CPU related fields are accessible through CPUState, the handle
should be available.

>>> It may
>>> be a good idea to introduce an arch-specific kvm_state and move related
>>> bits over.
>>
>> This should probably contain only irqchip_in_kernel, pit_in_kernel and
>> many_ioeventfds, maybe fd.
>
> fd is that root file descriptor you need for a few KVM services that are
> not bound to a specific VM - e.g. feature queries. It's not arch
> specific. Arch specific are e.g. robust_singlestep or xsave feature states.

By arch you mean guest CPU architecture? They are not machine features.

>>
>>> It may also once be feasible to carve out memory management
>>> related fields if we have proper abstractions for that, but I'm not
>>> completely sure here.
>>
>> I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region,
>> migration_log into the memory object.
>
> vmfd is the VM-scope file descriptor you need at machine-level. The rest
> logically belongs to a memory object, but I haven't looked at technical
> details yet.
>
>>
>>> Anyway, all these things are secondary. The primary topic here is how to
>>> deal with kvm_state and its fields that have VM-global scope.
>>
>> If it is an opaque blob which contains various unrelated stuff, no
>> clear place will be found.
>
> We aren't moving fields yet (and we shouldn't). We first of all need to
> establish the handle distribution (which apparently requires quite some
> work in areas beyond KVM).

But I think this is exactly  the problem. If the handle is for the
current KVMState, you'll indeed need it in various places and passing
it around will be cumbersome. By moving the fields around, the
information should  be available more naturally.

>> By the way, we don't have a QEMUState but instead use globals. Perhaps
>> this should be reorganized as well. For fd field, maybe even using a
>> global variable could be justified, since it is used for direct access
>> to kernel, not unlike a system call.
>
> The fd field is part of this discussion. Making it global (but local to
> the kvm subsystem) was an implicit part of my original sugges

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka

On 2011-01-20 20:37, Anthony Liguori wrote:
> On 01/20/2011 03:33 AM, Jan Kiszka wrote:
>> On 2011-01-19 20:32, Blue Swirl wrote:
>>   
>>> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
>>>   wrote:
>>> 
 On 01/19/2011 07:15 AM, Markus Armbruster wrote:
   
> So they interact with KVM (need kvm_state), and they interact with the
> emulated PCI bus.  Could you elaborate on the fundamental difference
> between the two interactions that makes you choose the (hypothetical)
> KVM bus over the PCI bus as device parent?
>
>  
 It's almost arbitrary, but I would say it's the direction that I/Os
 flow.

 But if the underlying observation is that the device tree is not
 really a
 tree, you're 100% correct.  This is part of why a factory interface
 that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is
 specifically for
 creating PCI devices that doesn't require a parent bus argument but
 provides
 a way to specify stable addressing (for instancing, using a linear
 index).

>>> I think kvm_state should not be a property of any device or bus. It
>>> should be split to more logical pieces.
>>>
>>> Some parts of it could remain in CPUState, because they are associated
>>> with a VCPU.
>>>
>>> Also, for example irqfd could be considered to be similar object to
>>> char or block devices provided by QEMU to devices. Would it make sense
>>> to introduce new host types for passing parts of kvm_state to devices?
>>>
>>> I'd also make coalesced MMIO stuff part of memory object. We are not
>>> passing any state references when using cpu_physical_memory_rw(), but
>>> that could be changed.
>>>  
>> There are currently no VCPU-specific bits remaining in kvm_state. It may
>> be a good idea to introduce an arch-specific kvm_state and move related
>> bits over. It may also once be feasible to carve out memory management
>> related fields if we have proper abstractions for that, but I'm not
>> completely sure here.
>>
>> Anyway, all these things are secondary. The primary topic here is how to
>> deal with kvm_state and its fields that have VM-global scope.
>>
> 
> The debate is really:
> 
> 1) should we remove all passing of kvm_state and just assume it's static
> 
> 2) deal with a couple places in the code where we need to figure out how
> to get at kvm_state
> 
> I think we've only identified 1 real instance of (2) and it's resulted
> in some good discussions about how to model KVM devices vs. emulated
> devices.  Honestly, (1) just stinks.  I see absolutely no advantage to
> it at all.   In the very worst case scenario, the thing we need to do is
> just reference an extern variable in a few places.  That completely
> avoids all of the modelling discussions for now (while leaving for
> placeholder FIXMEs so the problem can be tackled later).

The PCI bus discussion is surely an interesting outcome, but now almost
completely off-topic to the original, way less critical issue (as we
were discussing internals).

> 
> I don't understand the resistance here.

IMHO, most suggestions on the table are still over-designed (like a
KVMBus that only passes a kvm_state - or do you have more features for
it in mind?). The idea I love most so far is establishing a machine
state that also carries those few KVM bits which correspond to the KVM
extension of CPUState.

But in the end I want an implementable consensus that helps moving
forward with main topic: the overdue KVM upstream merge. I just do not
have a clear picture yet.

Jan



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka

On 2011-01-20 20:27, Blue Swirl wrote:
> On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka  wrote:
>> On 2011-01-19 20:32, Blue Swirl wrote:
>>> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
>>>  wrote:
 On 01/19/2011 07:15 AM, Markus Armbruster wrote:
>
> So they interact with KVM (need kvm_state), and they interact with the
> emulated PCI bus.  Could you elaborate on the fundamental difference
> between the two interactions that makes you choose the (hypothetical)
> KVM bus over the PCI bus as device parent?
>

 It's almost arbitrary, but I would say it's the direction that I/Os flow.

 But if the underlying observation is that the device tree is not really a
 tree, you're 100% correct.  This is part of why a factory interface that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is specifically for
 creating PCI devices that doesn't require a parent bus argument but 
 provides
 a way to specify stable addressing (for instancing, using a linear index).
>>>
>>> I think kvm_state should not be a property of any device or bus. It
>>> should be split to more logical pieces.
>>>
>>> Some parts of it could remain in CPUState, because they are associated
>>> with a VCPU.
>>>
>>> Also, for example irqfd could be considered to be similar object to
>>> char or block devices provided by QEMU to devices. Would it make sense
>>> to introduce new host types for passing parts of kvm_state to devices?
>>>
>>> I'd also make coalesced MMIO stuff part of memory object. We are not
>>> passing any state references when using cpu_physical_memory_rw(), but
>>> that could be changed.
>>
>> There are currently no VCPU-specific bits remaining in kvm_state.
> 
> I think fields vcpu_events, robust_singlestep, debugregs,
> kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the
> same for all VCPUs but still they are sort of CPU properties. I'm not
> sure about fd field.

They are all properties of the currently loaded KVM subsystem in the
host kernel. They can't change while KVM's root fd is opened.
Replicating this static information into each and every VCPU state would
be crazy.

In fact, services like kvm_has_vcpu_events() already encode this: they
are static functions without any kvm_state reference that simply return
the content of those fields. Totally inconsistent to this, we force the
caller of kvm_check_extension to pass a handle. This is part of my
problem with the current situation and any halfhearted steps in this
context. Either we work towards eliminating "static KVMState *kvm_state"
in kvm-all.c or eliminating KVMState.

> 
>> It may
>> be a good idea to introduce an arch-specific kvm_state and move related
>> bits over.
> 
> This should probably contain only irqchip_in_kernel, pit_in_kernel and
> many_ioeventfds, maybe fd.

fd is that root file descriptor you need for a few KVM services that are
not bound to a specific VM - e.g. feature queries. It's not arch
specific. Arch specific are e.g. robust_singlestep or xsave feature states.

> 
>> It may also once be feasible to carve out memory management
>> related fields if we have proper abstractions for that, but I'm not
>> completely sure here.
> 
> I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region,
> migration_log into the memory object.

vmfd is the VM-scope file descriptor you need at machine-level. The rest
logically belongs to a memory object, but I haven't looked at technical
details yet.

> 
>> Anyway, all these things are secondary. The primary topic here is how to
>> deal with kvm_state and its fields that have VM-global scope.
> 
> If it is an opaque blob which contains various unrelated stuff, no
> clear place will be found.

We aren't moving fields yet (and we shouldn't). We first of all need to
establish the handle distribution (which apparently requires quite some
work in areas beyond KVM).

> 
> By the way, we don't have a QEMUState but instead use globals. Perhaps
> this should be reorganized as well. For fd field, maybe even using a
> global variable could be justified, since it is used for direct access
> to kernel, not unlike a system call.

The fd field is part of this discussion. Making it global (but local to
the kvm subsystem) was an implicit part of my original suggestion.

I've no problem with something like a QEMUState, or better a
MachineState that would also include a few KVM-specific fields like the
vmfd - just like we already do for CPUstate (or should we better
introduce a KVM CPU bus... ;) ).

Jan

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model

2011-01-20 Thread Stefan Hajnoczi

On Thu, Jan 20, 2011 at 2:48 PM, Daniel P. Berrange  wrote:
> On Thu, Jan 20, 2011 at 08:11:27PM +0530, M. Mohan Kumar wrote:
>> On Thursday 20 January 2011 2:29:54 pm Stefan Hajnoczi wrote:
>> > On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote:
>>
>> > > -    if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) {
>> > > -        /*
>> > > -         * If we fail to change ownership and if we are
>> > > -         * using security model none. Ignore the error
>> > > -         */
>> > > -        if (fs_ctx->fs_sm != SM_NONE) {
>> > > -            return -1;
>> > > -        }
>> > > -    }
>> > > +    retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid);
>> > >
>> > >      return 0;
>> > >
>> > >  }
>> >
>> > retval is unused.
>> >
>>
>> That was used to disable the warning message "error: ignoring return value of
>> ‘lchown’, declared with attribute warn_unused_result"
>>
>> Otherwise I have to use
>> if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid)) {
>>       ;
>> }
>>
>> > Can multiple virtio-9p requests execute at a time?  chmod() and lchown()
>> > after creation is a race condition if other requests can execute
>> > concurrently.
>> >
>>
>> We can't implement file creation with requested user credentials and 
>> permission
>> bits in the none security model atomically. Its expected behaviour only
>
> Well you could do the nasty trick of forking a child process
> and doing setuid/gid in that and then creating the file before
> letting the parent continue.
>
>  if ((pid = fork()) == 0) {
>     setuid(fc_uid);
>     setgid(fc_gid);
>     fd =open("foo", O_CREAT);
>     close(fd);
>  } else {
>     waitpid(pid);
>  }
>
> This kind of approach is in fact required if you want to
> be able to create files with a special uid/gid on a root
> squashing NFS server, because otherwise your QEMU running
> as root will have its files squashed to 'nobody' when initially
> created, and lchown will fail with EPERM.  You might decide
> that root squashing NFS is too painful to care about supporting
> though :-)

I was thinking about this approach and it's similar to the chroot
helper process, but this time you have a helper process that does
umask/setgid/setuid as necessary.  Performance will be bad but there's
really no way around this.

Either implement something that works 90% of the time only but runs a
bit faster or implement something that works all the time but runs
slow.  It's not a nice trade-off.

Stefan

Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model

2011-01-20 Thread Venkateswararao Jujjuri (JV)

On 1/20/2011 12:59 AM, Stefan Hajnoczi wrote:
> On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote:
>> After creating a file object, its permission and ownership details are 
>> updated
>> as per client's request for both passthrough and none security model. But 
>> with
>> chrooted environment its not required for passthrough security model. Move 
>> all
>> post file creation changes to none security model
>>
>> Signed-off-by: M. Mohan Kumar 
>> ---
>>  hw/9pfs/virtio-9p-local.c |   19 ++-
>>  1 files changed, 6 insertions(+), 13 deletions(-)
>>
>> diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
>> index 08fd67f..d2e32e2 100644
>> --- a/hw/9pfs/virtio-9p-local.c
>> +++ b/hw/9pfs/virtio-9p-local.c
>> @@ -208,21 +208,14 @@ static int local_set_xattr(const char *path, FsCred 
>> *credp)
>>  return 0;
>>  }
>>  
>> -static int local_post_create_passthrough(FsContext *fs_ctx, const char 
>> *path,
>> +static int local_post_create_none(FsContext *fs_ctx, const char *path,
>>  FsCred *credp)
>>  {
>> +int retval;
>>  if (chmod(rpath(fs_ctx, path), credp->fc_mode & 0) < 0) {
>>  return -1;
>>  }
>> -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) {
>> -/*
>> - * If we fail to change ownership and if we are
>> - * using security model none. Ignore the error
>> - */
>> -if (fs_ctx->fs_sm != SM_NONE) {
>> -return -1;
>> -}
>> -}
>> +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid);
>>  return 0;
>>  }
> 
> retval is unused.
> 
> Can multiple virtio-9p requests execute at a time?  chmod() and lchown()
> after creation is a race condition if other requests can execute
> concurrently.

If some level of serialization is needed it will be done at the client/guest
inode level.
Are you worried about filesystem semantics? or do you see some corruption if 
they
get executed in parallel?

JV

> 
> Stefan
>

Re: [Spice-devel] [Qemu-devel] usb redirection status report

2011-01-20 Thread David Mansfield

> H
> i,
> 
> On 01/20/2011 12:27 PM, Christoph Hellwig wrote:
> > On Wed, Jan 19, 2011 at 07:15:47PM +0100, Hans de Goede wrote:
> >> Hi All,
> >>
> >> As most of you know I'm working on usb redirection (making client usb
> >> devices
> >> accessible in guests over the network).
> >>
> >> I thought it would be a good idea to write a short status report.
> >>
> >> So fat the following has been done:
> >>
> >> * written and posted a network protocol for usb redir. over the network,
> >> and send this to the list.
> >

As another late-comer, hate to ask a possibly old question but will this
added protocol allow usb devices from arbitrary "remote" hosts to be
redirected or only from the host running the spice client?

I recently tried to use the native qemu usb functionality to
pass-through a webcam and fell flat on my face due to what others
described as "timing issues", and was wondering if it would be possible
to pass a USB device plugged in to the vm-host to the vm-guest using
your new functionality, even though a third (and unrelated) host is
running the spice client.

I guess this is moot if the new approach will not fix the aforementioned
"timing issues" (i.e. perhaps it is not possible in a general case to
drive an arbitrary USB device non-locally).

Thanks,
David

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Blue Swirl

On Thu, Jan 20, 2011 at 7:37 PM, Anthony Liguori
 wrote:
> On 01/20/2011 03:33 AM, Jan Kiszka wrote:
>>
>> On 2011-01-19 20:32, Blue Swirl wrote:
>>
>>>
>>> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
>>>   wrote:
>>>

 On 01/19/2011 07:15 AM, Markus Armbruster wrote:

>
> So they interact with KVM (need kvm_state), and they interact with the
> emulated PCI bus.  Could you elaborate on the fundamental difference
> between the two interactions that makes you choose the (hypothetical)
> KVM bus over the PCI bus as device parent?
>
>

 It's almost arbitrary, but I would say it's the direction that I/Os
 flow.

 But if the underlying observation is that the device tree is not really
 a
 tree, you're 100% correct.  This is part of why a factory interface that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is specifically
 for
 creating PCI devices that doesn't require a parent bus argument but
 provides
 a way to specify stable addressing (for instancing, using a linear
 index).

>>>
>>> I think kvm_state should not be a property of any device or bus. It
>>> should be split to more logical pieces.
>>>
>>> Some parts of it could remain in CPUState, because they are associated
>>> with a VCPU.
>>>
>>> Also, for example irqfd could be considered to be similar object to
>>> char or block devices provided by QEMU to devices. Would it make sense
>>> to introduce new host types for passing parts of kvm_state to devices?
>>>
>>> I'd also make coalesced MMIO stuff part of memory object. We are not
>>> passing any state references when using cpu_physical_memory_rw(), but
>>> that could be changed.
>>>
>>
>> There are currently no VCPU-specific bits remaining in kvm_state. It may
>> be a good idea to introduce an arch-specific kvm_state and move related
>> bits over. It may also once be feasible to carve out memory management
>> related fields if we have proper abstractions for that, but I'm not
>> completely sure here.
>>
>> Anyway, all these things are secondary. The primary topic here is how to
>> deal with kvm_state and its fields that have VM-global scope.
>>
>
> The debate is really:
>
> 1) should we remove all passing of kvm_state and just assume it's static
>
> 2) deal with a couple places in the code where we need to figure out how to
> get at kvm_state
>
> I think we've only identified 1 real instance of (2) and it's resulted in
> some good discussions about how to model KVM devices vs. emulated devices.
>  Honestly, (1) just stinks.  I see absolutely no advantage to it at all.

Fully agree.

> In the very worst case scenario, the thing we need to do is just reference
> an extern variable in a few places.  That completely avoids all of the
> modelling discussions for now (while leaving for placeholder FIXMEs so the
> problem can be tackled later).

I think KVMState was designed to match KVM ioctl interface: all stuff
that is needed for talking to KVM or received from KVM are there. But
I think this shouldn't be a design driver.

If the only pieces of kvm_state that are needed by the devices are
irqchip_in_kernel, pit_in_kernel and many_ioeventfds, the problem of
passing kvm_state to devices becomes very different. Each of these are
just single bits, affecting only a few devices. Perhaps they could be
device properties which the board level sets when KVM is used?

Re: [Qemu-devel] [RFC PATCH] Fake machine for scalability testing

2011-01-20 Thread Anthony Liguori


On 01/20/2011 11:12 AM, Markus Armbruster wrote:

Anthony Liguori  writes:

   

On 01/18/2011 02:16 PM, Markus Armbruster wrote:
 

The problem: you want to do serious scalability testing (1000s of VMs)
of your management stack.  If each guest eats up a few 100MiB and
competes for CPU, that requires a serious host machine.  Which you don't
have.  You also don't want to modify the management stack at all, if you
can help it.

The solution: a perfectly normal-looking QEMU that uses minimal
resources.  Ability to execute any guest code is strictly optional ;)

New option -fake-machine creates a fake machine incapable of running
guest code.  Completely compiled out by default, enable with configure
--enable-fake-machine.

With -fake-machine, CPU use is negligible, and memory use is rather
modest.

Non-fake VM running F-14 live, right after boot:
UIDPID  PPID  CSZ   RSS PSR STIME TTY  TIME CMD
armbru   15707  2558 53 191837 414388 1 21:05 pts/300:00:29 [...]

Same VM -fake-machine, after similar time elapsed:
UIDPID  PPID  CSZ   RSS PSR STIME TTY  TIME CMD
armbru   15742  2558  0 85129  9412   0 21:07 pts/300:00:00 [...]

We're using a very similar patch for RHEL scalability testing.

   

Interesting, but:

  9432 anthony   20   0  153m  14m 5384 S0  0.2   0:00.22
qemu-system-x86

That's qemu-system-x86 -m 4
 

Sure you ran qemu-system-x86 -fake-machine?
   


No, I didn't try it.  My point was that -m 4 is already pretty small.


In terms of memory overhead, the largest source is not really going to
be addressed by -fake-machine (l1_phys_map and phys_ram_dirty).
 

git-grep phys_ram_dirty finds nothing.
   


Yeah, it's now ram_list[i].phys_dirty.

l1_phys_map is (sizeof(void *) + sizeof(PhysPageDesc)) * mem_size_in_pages

phys_dirty is mem_size_in_pages bytes.


I don't really understand the point of not creating a VCPU with KVM.
Is there some type of overhead in doing that?
 

I briefly looked at both main loops, TCG's was the first one I happened
to crack, and I didn't feel like doing both then.  If the general
approach is okay, I'll gladly investigate how to do it with KVM.
   


I guess what I don't understand is why do you need to not run guest 
code?  Specifically, if you remove the following, is it any less useful?


diff --git a/cpu-exec.c b/cpu-exec.c
index 8c9fb8b..cd1259a 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -230,7 +230,7 @@ int cpu_exec(CPUState *env1)
 uint8_t *tc_ptr;
 unsigned long next_tb;

-if (cpu_halted(env1) == EXCP_HALTED)
+if (fake_machine || cpu_halted(env1) == EXCP_HALTED)

 return EXCP_HALTED;


Regards,

Anthony Liguori

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Anthony Liguori


On 01/20/2011 04:33 AM, Daniel P. Berrange wrote:

On Thu, Jan 20, 2011 at 09:44:05AM +0100, Gerd Hoffmann wrote:
   

   Hi,

 

For (2), you cannot use bus=X,addr=Y because it makes assumptions about
the PCI topology which may change in newer -M pc's.
   

Why should the PCI topology for 'pc' ever change?

We'll probably get q35 support some day, but when this lands I
expect we'll see a new machine type 'q35', so '-m q35' will pick the
ich9 chipset (which will have a different pci topology of course)
and '-m pc' will pick the existing piix chipset (which will continue
to look like it looks today).
 

If the topology does ever change (eg in the kind of way anthony
suggests, first bus only has the graphics card), I think libvirt
is going to need a little work to adapt to the new topology,
regardless of whether we currently specify a bus= arg to -device
or not. I'm not sure there's anything we could do to future proof
us to that kind of change.
   


I assume that libvirt today assumes that it can use a set of PCI slots 
in bus 0, correct?  Probably in the range 3-31?  Such assumptions are 
very likely to break.


Regards,

Anthony Liguori


Regards,
Daniel

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Anthony Liguori


On 01/20/2011 02:44 AM, Gerd Hoffmann wrote:

  Hi,


For (2), you cannot use bus=X,addr=Y because it makes assumptions about
the PCI topology which may change in newer -M pc's.


Why should the PCI topology for 'pc' ever change?

We'll probably get q35 support some day, but when this lands I expect 
we'll see a new machine type 'q35', so '-m q35' will pick the ich9 
chipset (which will have a different pci topology of course) and '-m 
pc' will pick the existing piix chipset (which will continue to look 
like it looks today).


But then what's the default machine type?  When I say -M pc, I really 
mean the default machine.


At some point, "qemu-system-x86_64 -device virtio-net-pci,addr=2.0"

Is not going to be a reliable way to invoke qemu because there's no way 
we can guarantee that slot 2 isn't occupied by a chipset device or some 
other default device.


Regards,

Anthony Liguori


cheers,
  Gerd

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Anthony Liguori


On 01/20/2011 03:33 AM, Jan Kiszka wrote:

On 2011-01-19 20:32, Blue Swirl wrote:
   

On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
  wrote:
 

On 01/19/2011 07:15 AM, Markus Armbruster wrote:
   

So they interact with KVM (need kvm_state), and they interact with the
emulated PCI bus.  Could you elaborate on the fundamental difference
between the two interactions that makes you choose the (hypothetical)
KVM bus over the PCI bus as device parent?

 

It's almost arbitrary, but I would say it's the direction that I/Os flow.

But if the underlying observation is that the device tree is not really a
tree, you're 100% correct.  This is part of why a factory interface that
just takes a parent bus is too simplistic.

I think we ought to introduce a -pci-device option that is specifically for
creating PCI devices that doesn't require a parent bus argument but provides
a way to specify stable addressing (for instancing, using a linear index).
   

I think kvm_state should not be a property of any device or bus. It
should be split to more logical pieces.

Some parts of it could remain in CPUState, because they are associated
with a VCPU.

Also, for example irqfd could be considered to be similar object to
char or block devices provided by QEMU to devices. Would it make sense
to introduce new host types for passing parts of kvm_state to devices?

I'd also make coalesced MMIO stuff part of memory object. We are not
passing any state references when using cpu_physical_memory_rw(), but
that could be changed.
 

There are currently no VCPU-specific bits remaining in kvm_state. It may
be a good idea to introduce an arch-specific kvm_state and move related
bits over. It may also once be feasible to carve out memory management
related fields if we have proper abstractions for that, but I'm not
completely sure here.

Anyway, all these things are secondary. The primary topic here is how to
deal with kvm_state and its fields that have VM-global scope.
   


The debate is really:

1) should we remove all passing of kvm_state and just assume it's static

2) deal with a couple places in the code where we need to figure out how 
to get at kvm_state


I think we've only identified 1 real instance of (2) and it's resulted 
in some good discussions about how to model KVM devices vs. emulated 
devices.  Honestly, (1) just stinks.  I see absolutely no advantage to 
it at all.   In the very worst case scenario, the thing we need to do is 
just reference an extern variable in a few places.  That completely 
avoids all of the modelling discussions for now (while leaving for 
placeholder FIXMEs so the problem can be tackled later).


I don't understand the resistance here.

Regards,

Anthony Liguori


Jan

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Blue Swirl

On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka  wrote:
> On 2011-01-19 20:32, Blue Swirl wrote:
>> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
>>  wrote:
>>> On 01/19/2011 07:15 AM, Markus Armbruster wrote:

 So they interact with KVM (need kvm_state), and they interact with the
 emulated PCI bus.  Could you elaborate on the fundamental difference
 between the two interactions that makes you choose the (hypothetical)
 KVM bus over the PCI bus as device parent?

>>>
>>> It's almost arbitrary, but I would say it's the direction that I/Os flow.
>>>
>>> But if the underlying observation is that the device tree is not really a
>>> tree, you're 100% correct.  This is part of why a factory interface that
>>> just takes a parent bus is too simplistic.
>>>
>>> I think we ought to introduce a -pci-device option that is specifically for
>>> creating PCI devices that doesn't require a parent bus argument but provides
>>> a way to specify stable addressing (for instancing, using a linear index).
>>
>> I think kvm_state should not be a property of any device or bus. It
>> should be split to more logical pieces.
>>
>> Some parts of it could remain in CPUState, because they are associated
>> with a VCPU.
>>
>> Also, for example irqfd could be considered to be similar object to
>> char or block devices provided by QEMU to devices. Would it make sense
>> to introduce new host types for passing parts of kvm_state to devices?
>>
>> I'd also make coalesced MMIO stuff part of memory object. We are not
>> passing any state references when using cpu_physical_memory_rw(), but
>> that could be changed.
>
> There are currently no VCPU-specific bits remaining in kvm_state.

I think fields vcpu_events, robust_singlestep, debugregs,
kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the
same for all VCPUs but still they are sort of CPU properties. I'm not
sure about fd field.

> It may
> be a good idea to introduce an arch-specific kvm_state and move related
> bits over.

This should probably contain only irqchip_in_kernel, pit_in_kernel and
many_ioeventfds, maybe fd.

> It may also once be feasible to carve out memory management
> related fields if we have proper abstractions for that, but I'm not
> completely sure here.

I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region,
migration_log into the memory object.

> Anyway, all these things are secondary. The primary topic here is how to
> deal with kvm_state and its fields that have VM-global scope.

If it is an opaque blob which contains various unrelated stuff, no
clear place will be found.

By the way, we don't have a QEMUState but instead use globals. Perhaps
this should be reorganized as well. For fd field, maybe even using a
global variable could be justified, since it is used for direct access
to kernel, not unlike a system call.

Re: [Qemu-devel] usb redirection status report

2011-01-20 Thread Hans de Goede


Hi,

On 01/20/2011 12:27 PM, Christoph Hellwig wrote:

On Wed, Jan 19, 2011 at 07:15:47PM +0100, Hans de Goede wrote:

Hi All,

As most of you know I'm working on usb redirection (making client usb
devices
accessible in guests over the network).

I thought it would be a good idea to write a short status report.

So fat the following has been done:

* written and posted a network protocol for usb redir. over the network,
and send this to the list.


How does this relate to the various existing usb over ip protocols, e.g.
the one in drivers/staging/usbip/ in the Linux kernel tree?



See the previous discussion about this on the qemu-devel list, specifically:
http://lists.gnu.org/archive/html/qemu-devel/2010-12/msg8.html

Regards,

Hans

[Qemu-devel] Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Alex Williamson

On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
> When MSI is off, each interrupt needs to be bounced through the io
> thread when it's set/cleared, so vhost-net causes more context switches and
> higher CPU utilization than userspace virtio which handles networking in
> the same thread.
> 
> We'll need to fix this by adding level irq support in kvm irqfd,
> for now disable vhost-net in these configurations.

We need this if we want avoid bouncing vfio INtx through qemu too.

> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> I need to report some error from virtio-pci
> that would be handled specially (disable but don't
> report an error) so I wanted one that's never likely to be used by a
> userspace ioctl. I selected ERANGE but it'd
> be easy to switch to something else. Comments?
> 
>  hw/vhost.c  |4 +++-
>  hw/virtio-net.c |6 --
>  hw/virtio-pci.c |3 +++
>  hw/virtio.h |2 ++
>  4 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/vhost.c b/hw/vhost.c
> index 1d09ed0..c79765a 100644
> --- a/hw/vhost.c
> +++ b/hw/vhost.c
> @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
> *vdev)
>  
>  r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
>  if (r < 0) {
> -fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> + if (r != -EVIRTIO_DISABLED) {
> + fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> + }

style - the above is tab indented.

>  goto fail_notifiers;
>  }
>  
> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> index ccb3e63..5de3fee 100644
> --- a/hw/virtio-net.c
> +++ b/hw/virtio-net.c
> @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
> uint8_t status)
>  if (!n->vhost_started) {
>  int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), 
> &n->vdev);
>  if (r < 0) {
> -error_report("unable to start vhost net: %d: "
> - "falling back on userspace virtio", -r);
> +if (r != -EVIRTIO_DISABLED) {
> +error_report("unable to start vhost net: %d: "
> + "falling back on userspace virtio", -r);
> +}
>  } else {
>  n->vhost_started = 1;
>  }
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index dd8887a..dbf4be0 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, 
> int n, bool assign)
>  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
>  
>  if (assign) {
> +if (!msix_enabled(&proxy->pci_dev)) {
> +return -EVIRTIO_DISABLED;
> +}
>  int r = event_notifier_init(notifier, 0);
>  if (r < 0) {
>  return r;
> diff --git a/hw/virtio.h b/hw/virtio.h
> index d8546d5..53bbdba 100644
> --- a/hw/virtio.h
> +++ b/hw/virtio.h
> @@ -98,6 +98,8 @@ typedef struct {
>  void (*vmstate_change)(void * opaque, bool running);
>  } VirtIOBindings;
>  
> +#define EVIRTIO_DISABLED ERANGE
> +
>  #define VIRTIO_PCI_QUEUE_MAX 64
>  
>  #define VIRTIO_NO_VECTOR 0x

I'm not a fan of having this special return value.  Why doesn't
virtio-pci only setup the set_guest_notifiers function pointer when msix
is enabled?  If not that, it could at least expose some
virtio_foo_enabled() type feature that vhost could check.  Thanks,

Alex

Re: [Qemu-devel] [PATCH] target-arm: Set the right overflow bit for neon 32 and 64 bit saturating add/sub.

2011-01-20 Thread Peter Maydell

On 20 January 2011 17:16, Christophe Lyon  wrote:
> Set the right overflow bit for neon 32 and 64 bit saturating add/sub.
>
> Also move the neon 64 bit saturating add/sub helpers to neon_helper.c
> for consistency with the 32 bits versions.
>
> There is probably still room for code commonalization though.
>
> Peter, this patch is based upon your patch 6f83e7d and adds the 64 bits case.
>
> Signed-off-by: Christophe Lyon 
> Signed-off-by: Peter Maydell 

You shouldn't really leave my sign-off in there if you've changed the
patch. (Actually a lot of the changes in the tree I pointed you at have
my sign-off and really oughtn't -- all I was doing was splitting and
rearranging existing meego commits, so those are all unreviewed /
untested; so you should just remove it anyway.)

I'll review this tomorrow.

-- PMM

[Qemu-devel] Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote:
> On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
> > When MSI is off, each interrupt needs to be bounced through the io
> > thread when it's set/cleared, so vhost-net causes more context switches and
> > higher CPU utilization than userspace virtio which handles networking in
> > the same thread.
> > 
> > We'll need to fix this by adding level irq support in kvm irqfd,
> > for now disable vhost-net in these configurations.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> > 
> > I need to report some error from virtio-pci
> > that would be handled specially (disable but don't
> > report an error) so I wanted one that's never likely to be used by a
> > userspace ioctl. I selected ERANGE but it'd
> > be easy to switch to something else. Comments?
> 
> Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?
> 
> -Sridhar

The error is reported by virtio-pci which does not know about vhost.
I started with EVIRTIO_MSIX_DISABLED and made is shorter.
Would EVIRTIO_MSIX_DISABLED be better?

> > 
> >  hw/vhost.c  |4 +++-
> >  hw/virtio-net.c |6 --
> >  hw/virtio-pci.c |3 +++
> >  hw/virtio.h |2 ++
> >  4 files changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/vhost.c b/hw/vhost.c
> > index 1d09ed0..c79765a 100644
> > --- a/hw/vhost.c
> > +++ b/hw/vhost.c
> > @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, 
> > VirtIODevice *vdev)
> >  
> >  r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
> >  if (r < 0) {
> > -fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> > +   if (r != -EVIRTIO_DISABLED) {
> > +   fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> > +   }
> >  goto fail_notifiers;
> >  }
> >  
> > diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> > index ccb3e63..5de3fee 100644
> > --- a/hw/virtio-net.c
> > +++ b/hw/virtio-net.c
> > @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
> > uint8_t status)
> >  if (!n->vhost_started) {
> >  int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), 
> > &n->vdev);
> >  if (r < 0) {
> > -error_report("unable to start vhost net: %d: "
> > - "falling back on userspace virtio", -r);
> > +if (r != -EVIRTIO_DISABLED) {
> > +error_report("unable to start vhost net: %d: "
> > + "falling back on userspace virtio", -r);
> > +}
> >  } else {
> >  n->vhost_started = 1;
> >  }
> > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> > index dd8887a..dbf4be0 100644
> > --- a/hw/virtio-pci.c
> > +++ b/hw/virtio-pci.c
> > @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, 
> > int n, bool assign)
> >  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
> >  
> >  if (assign) {
> > +if (!msix_enabled(&proxy->pci_dev)) {
> > +return -EVIRTIO_DISABLED;
> > +}
> >  int r = event_notifier_init(notifier, 0);
> >  if (r < 0) {
> >  return r;
> > diff --git a/hw/virtio.h b/hw/virtio.h
> > index d8546d5..53bbdba 100644
> > --- a/hw/virtio.h
> > +++ b/hw/virtio.h
> > @@ -98,6 +98,8 @@ typedef struct {
> >  void (*vmstate_change)(void * opaque, bool running);
> >  } VirtIOBindings;
> >  
> > +#define EVIRTIO_DISABLED ERANGE
> > +
> >  #define VIRTIO_PCI_QUEUE_MAX 64
> >  
> >  #define VIRTIO_NO_VECTOR 0x

[Qemu-devel] [PATCH] target-arm: Set the right overflow bit for neon 32 and 64 bit saturating add/sub.

2011-01-20 Thread Christophe Lyon

Set the right overflow bit for neon 32 and 64 bit saturating add/sub.

Also move the neon 64 bit saturating add/sub helpers to neon_helper.c
for consistency with the 32 bits versions.

There is probably still room for code commonalization though.

Peter, this patch is based upon your patch 6f83e7d and adds the 64 bits case.

Signed-off-by: Christophe Lyon 
Signed-off-by: Peter Maydell 
---
 target-arm/helpers.h |   12 --
 target-arm/neon_helper.c |   89 ++
 target-arm/op_helper.c   |   49 -
 target-arm/translate.c   |   18 -
 4 files changed, 105 insertions(+), 63 deletions(-)

diff --git a/target-arm/helpers.h b/target-arm/helpers.h
index b88ebae..8a2564e 100644
--- a/target-arm/helpers.h
+++ b/target-arm/helpers.h
@@ -137,10 +137,6 @@ DEF_HELPER_2(rsqrte_f32, f32, f32, env)
 DEF_HELPER_2(recpe_u32, i32, i32, env)
 DEF_HELPER_2(rsqrte_u32, i32, i32, env)
 DEF_HELPER_4(neon_tbl, i32, i32, i32, i32, i32)
-DEF_HELPER_2(neon_add_saturate_u64, i64, i64, i64)
-DEF_HELPER_2(neon_add_saturate_s64, i64, i64, i64)
-DEF_HELPER_2(neon_sub_saturate_u64, i64, i64, i64)
-DEF_HELPER_2(neon_sub_saturate_s64, i64, i64, i64)
 
 DEF_HELPER_2(add_cc, i32, i32, i32)
 DEF_HELPER_2(adc_cc, i32, i32, i32)
@@ -160,10 +156,18 @@ DEF_HELPER_3(neon_qadd_u8, i32, env, i32, i32)
 DEF_HELPER_3(neon_qadd_s8, i32, env, i32, i32)
 DEF_HELPER_3(neon_qadd_u16, i32, env, i32, i32)
 DEF_HELPER_3(neon_qadd_s16, i32, env, i32, i32)
+DEF_HELPER_3(neon_qadd_u32, i32, env, i32, i32)
+DEF_HELPER_3(neon_qadd_s32, i32, env, i32, i32)
 DEF_HELPER_3(neon_qsub_u8, i32, env, i32, i32)
 DEF_HELPER_3(neon_qsub_s8, i32, env, i32, i32)
 DEF_HELPER_3(neon_qsub_u16, i32, env, i32, i32)
 DEF_HELPER_3(neon_qsub_s16, i32, env, i32, i32)
+DEF_HELPER_3(neon_qsub_u32, i32, env, i32, i32)
+DEF_HELPER_3(neon_qsub_s32, i32, env, i32, i32)
+DEF_HELPER_3(neon_qadd_u64, i64, env, i64, i64)
+DEF_HELPER_3(neon_qadd_s64, i64, env, i64, i64)
+DEF_HELPER_3(neon_qsub_u64, i64, env, i64, i64)
+DEF_HELPER_3(neon_qsub_s64, i64, env, i64, i64)
 
 DEF_HELPER_2(neon_hadd_s8, i32, i32, i32)
 DEF_HELPER_2(neon_hadd_u8, i32, i32, i32)
diff --git a/target-arm/neon_helper.c b/target-arm/neon_helper.c
index 20f3c16..c1619c0 100644
--- a/target-arm/neon_helper.c
+++ b/target-arm/neon_helper.c
@@ -198,6 +198,28 @@ NEON_VOP_ENV(qadd_u16, neon_u16, 2)
 #undef NEON_FN
 #undef NEON_USAT
 
+uint32_t HELPER(neon_qadd_u32)(CPUState *env, uint32_t a, uint32_t b)
+{
+uint32_t res = a + b;
+if (res < a) {
+SET_QC();
+res = ~0;
+}
+return res;
+}
+
+uint64_t HELPER(neon_qadd_u64)(CPUState *env, uint64_t src1, uint64_t src2)
+{
+  uint64_t res;
+
+  res = src1 + src2;
+  if (res < src1) {
+SET_QC();
+res = ~(uint64_t)0;
+  }
+  return res;
+}
+
 #define NEON_SSAT(dest, src1, src2, type) do { \
 int32_t tmp = (uint32_t)src1 + (uint32_t)src2; \
 if (tmp != (type)tmp) { \
@@ -218,6 +240,28 @@ NEON_VOP_ENV(qadd_s16, neon_s16, 2)
 #undef NEON_FN
 #undef NEON_SSAT
 
+uint32_t HELPER(neon_qadd_s32)(CPUState *env, uint32_t a, uint32_t b)
+{
+uint32_t res = a + b;
+if (((res ^ a) & SIGNBIT) && !((a ^ b) & SIGNBIT)) {
+SET_QC();
+res = ~(((int32_t)a >> 31) ^ SIGNBIT);
+}
+return res;
+}
+
+uint64_t HELPER(neon_qadd_s64)(CPUState *env, uint64_t src1, uint64_t src2)
+{
+  uint64_t res;
+
+  res = src1 + src2;
+  if (((res ^ src1) & SIGNBIT64) && !((src1 ^ src2) & SIGNBIT64)) {
+SET_QC();
+res = ((int64_t)src1 >> 63) ^ ~SIGNBIT64;
+  }
+  return res;
+}
+
 #define NEON_USAT(dest, src1, src2, type) do { \
 uint32_t tmp = (uint32_t)src1 - (uint32_t)src2; \
 if (tmp != (type)tmp) { \
@@ -234,6 +278,29 @@ NEON_VOP_ENV(qsub_u16, neon_u16, 2)
 #undef NEON_FN
 #undef NEON_USAT
 
+uint32_t HELPER(neon_qsub_u32)(CPUState *env, uint32_t a, uint32_t b)
+{
+uint32_t res = a - b;
+if (res > a) {
+SET_QC();
+res = 0;
+}
+return res;
+}
+
+uint64_t HELPER(neon_qsub_u64)(CPUState *env, uint64_t src1, uint64_t src2)
+{
+  uint64_t res;
+
+  if (src1 < src2) {
+SET_QC();
+res = 0;
+  } else {
+res = src1 - src2;
+  }
+  return res;
+}
+
 #define NEON_SSAT(dest, src1, src2, type) do { \
 int32_t tmp = (uint32_t)src1 - (uint32_t)src2; \
 if (tmp != (type)tmp) { \
@@ -254,6 +321,28 @@ NEON_VOP_ENV(qsub_s16, neon_s16, 2)
 #undef NEON_FN
 #undef NEON_SSAT
 
+uint32_t HELPER(neon_qsub_s32)(CPUState *env, uint32_t a, uint32_t b)
+{
+uint32_t res = a - b;
+if (((res ^ a) & SIGNBIT) && ((a ^ b) & SIGNBIT)) {
+SET_QC();
+res = ~(((int32_t)a >> 31) ^ SIGNBIT);
+}
+return res;
+}
+
+uint64_t HELPER(neon_qsub_s64)(CPUState *env, uint64_t src1, uint64_t src2)
+{
+  uint64_t res;
+
+  res = src1 - src2;
+  if (((res ^ src1) & SIGNBIT64) && ((src1 ^ src2) & SIGNBIT64)) {
+SET_QC();
+res = ((int64_t)src1 >> 63) ^ ~SIGNBIT64;
+  }
+  return res;
+}
+
 #define NEON_FN(dest, src1, src2)

Re: [Qemu-devel] [RFC PATCH] Fake machine for scalability testing

2011-01-20 Thread Markus Armbruster

Anthony Liguori  writes:

> On 01/18/2011 02:16 PM, Markus Armbruster wrote:
>> The problem: you want to do serious scalability testing (1000s of VMs)
>> of your management stack.  If each guest eats up a few 100MiB and
>> competes for CPU, that requires a serious host machine.  Which you don't
>> have.  You also don't want to modify the management stack at all, if you
>> can help it.
>>
>> The solution: a perfectly normal-looking QEMU that uses minimal
>> resources.  Ability to execute any guest code is strictly optional ;)
>>
>> New option -fake-machine creates a fake machine incapable of running
>> guest code.  Completely compiled out by default, enable with configure
>> --enable-fake-machine.
>>
>> With -fake-machine, CPU use is negligible, and memory use is rather
>> modest.
>>
>> Non-fake VM running F-14 live, right after boot:
>> UIDPID  PPID  CSZ   RSS PSR STIME TTY  TIME CMD
>> armbru   15707  2558 53 191837 414388 1 21:05 pts/300:00:29 [...]
>>
>> Same VM -fake-machine, after similar time elapsed:
>> UIDPID  PPID  CSZ   RSS PSR STIME TTY  TIME CMD
>> armbru   15742  2558  0 85129  9412   0 21:07 pts/300:00:00 [...]
>>
>> We're using a very similar patch for RHEL scalability testing.
>>
>
> Interesting, but:
>
>  9432 anthony   20   0  153m  14m 5384 S0  0.2   0:00.22
> qemu-system-x86
>
> That's qemu-system-x86 -m 4

Sure you ran qemu-system-x86 -fake-machine?

> In terms of memory overhead, the largest source is not really going to
> be addressed by -fake-machine (l1_phys_map and phys_ram_dirty).

git-grep phys_ram_dirty finds nothing.

> I don't really understand the point of not creating a VCPU with KVM.
> Is there some type of overhead in doing that?

I briefly looked at both main loops, TCG's was the first one I happened
to crack, and I didn't feel like doing both then.  If the general
approach is okay, I'll gladly investigate how to do it with KVM.

Re: [Qemu-devel] [PATCH 1/5] Fast Virtual Disk (FVD) Proposal Part 1

2011-01-20 Thread Stefan Weil


Am 20.01.2011 15:49, schrieb Chunqiang Tang:

Please try to split the patches into logical parts, and use descriptive
subject lines for each patch.
E.g. adding the new sim command to qemu-io could be one patch, adding
the img_update (why not just update?) command to qemu-img another,
moving code into qemu-tool-time.c one more, etc.


Will do and thank you for the detailed instructions.


-
+

Please do not introduce random whitespace changes in patches.


Stefan Weil previously suggested removing spaces at the end of a line, 
and

I used a script to do that. It seems that in this example, the old code
has multiple spaces on an empty line, which were automatically removed by
the script.


Yes, that's a problem with some parts of the old code.
For files which you want to modify, you could remove
the spaces with your script before applying your other
modifications and create a separate patch which only
removes the superfluous spaces.

So your patch series would start with patches which
only remove spaces at line endings (and say so in the
patch descriptions). Then these changes are no longer
random whitespace changes.

Regards,
Stefan Weil

[Qemu-devel] [PATCH v4 3/3] qcow2: Batch flushes for COW

2011-01-20 Thread Kevin Wolf

qcow2 calls bdrv_flush() after performing COW in order to ensure that the
L2 table change is never written before the copy is safe on disk. Now that the
L2 table is cached, we can wait with flushing until we write out the next L2
table.

Signed-off-by: Kevin Wolf 
---
 block/qcow2-cache.c   |   20 +---
 block/qcow2-cluster.c |2 +-
 block/qcow2.h |1 +
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 5f1740b..8f2955b 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -38,6 +38,7 @@ struct Qcow2Cache {
 int size;
 Qcow2CachedTable*   entries;
 struct Qcow2Cache*  depends;
+booldepends_on_flush;
 boolwritethrough;
 };
 
@@ -85,13 +86,15 @@ static int qcow2_cache_flush_dependency(BlockDriverState 
*bs, Qcow2Cache *c)
 }
 
 c->depends = NULL;
+c->depends_on_flush = false;
+
 return 0;
 }
 
 static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i)
 {
 BDRVQcowState *s = bs->opaque;
-int ret;
+int ret = 0;
 
 if (!c->entries[i].dirty || !c->entries[i].offset) {
 return 0;
@@ -99,11 +102,17 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, 
Qcow2Cache *c, int i)
 
 if (c->depends) {
 ret = qcow2_cache_flush_dependency(bs, c);
-if (ret < 0) {
-return ret;
+} else if (c->depends_on_flush) {
+ret = bdrv_flush(bs->file);
+if (ret >= 0) {
+c->depends_on_flush = false;
 }
 }
 
+if (ret < 0) {
+return ret;
+}
+
 if (c == s->refcount_block_cache) {
 BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART);
 } else if (c == s->l2_table_cache) {
@@ -167,6 +176,11 @@ int qcow2_cache_set_dependency(BlockDriverState *bs, 
Qcow2Cache *c,
 return 0;
 }
 
+void qcow2_cache_depends_on_flush(Qcow2Cache *c)
+{
+c->depends_on_flush = true;
+}
+
 static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c)
 {
 int i;
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7fa4fa1..716562f 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -637,7 +637,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
  * handled.
  */
 if (cow) {
-bdrv_flush(bs->file);
+qcow2_cache_depends_on_flush(s->l2_table_cache);
 }
 
 qcow2_cache_set_dependency(bs, s->l2_table_cache, s->refcount_block_cache);
diff --git a/block/qcow2.h b/block/qcow2.h
index 11cbce3..6d80120 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -229,6 +229,7 @@ void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void 
*table);
 int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c);
 int qcow2_cache_set_dependency(BlockDriverState *bs, Qcow2Cache *c,
 Qcow2Cache *dependency);
+void qcow2_cache_depends_on_flush(Qcow2Cache *c);
 
 int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
 void **table);
-- 
1.7.2.3

[Qemu-devel] [PATCH v4 2/3] qcow2: Use QcowCache

2011-01-20 Thread Kevin Wolf

Use the new functions of qcow2-cache.c for everything that works on refcount
block and L2 tables.

Signed-off-by: Kevin Wolf 
---
 block/qcow2-cache.c|   10 ++
 block/qcow2-cluster.c  |  207 +-
 block/qcow2-refcount.c |  258 
 block/qcow2.c  |   48 -
 block/qcow2.h  |   12 ++-
 5 files changed, 239 insertions(+), 296 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index f7c4e2a..5f1740b 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -104,6 +104,12 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, 
Qcow2Cache *c, int i)
 }
 }
 
+if (c == s->refcount_block_cache) {
+BLKDBG_EVENT(bs->file, BLKDBG_REFBLOCK_UPDATE_PART);
+} else if (c == s->l2_table_cache) {
+BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE);
+}
+
 ret = bdrv_pwrite(bs->file, c->entries[i].offset, c->entries[i].table,
 s->cluster_size);
 if (ret < 0) {
@@ -218,6 +224,10 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 
 c->entries[i].offset = 0;
 if (read_from_disk) {
+if (c == s->l2_table_cache) {
+BLKDBG_EVENT(bs->file, BLKDBG_L2_LOAD);
+}
+
 ret = bdrv_pread(bs->file, offset, c->entries[i].table, 
s->cluster_size);
 if (ret < 0) {
 return ret;
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index c3ef550..7fa4fa1 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -67,7 +67,11 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size, 
bool exact_size)
 qemu_free(new_l1_table);
 return new_l1_table_offset;
 }
-bdrv_flush(bs->file);
+
+ret = qcow2_cache_flush(bs, s->refcount_block_cache);
+if (ret < 0) {
+return ret;
+}
 
 BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_WRITE_TABLE);
 for(i = 0; i < s->l1_size; i++)
@@ -98,63 +102,6 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size, 
bool exact_size)
 return ret;
 }
 
-void qcow2_l2_cache_reset(BlockDriverState *bs)
-{
-BDRVQcowState *s = bs->opaque;
-
-memset(s->l2_cache, 0, s->l2_size * L2_CACHE_SIZE * sizeof(uint64_t));
-memset(s->l2_cache_offsets, 0, L2_CACHE_SIZE * sizeof(uint64_t));
-memset(s->l2_cache_counts, 0, L2_CACHE_SIZE * sizeof(uint32_t));
-}
-
-static inline int l2_cache_new_entry(BlockDriverState *bs)
-{
-BDRVQcowState *s = bs->opaque;
-uint32_t min_count;
-int min_index, i;
-
-/* find a new entry in the least used one */
-min_index = 0;
-min_count = 0x;
-for(i = 0; i < L2_CACHE_SIZE; i++) {
-if (s->l2_cache_counts[i] < min_count) {
-min_count = s->l2_cache_counts[i];
-min_index = i;
-}
-}
-return min_index;
-}
-
-/*
- * seek_l2_table
- *
- * seek l2_offset in the l2_cache table
- * if not found, return NULL,
- * if found,
- *   increments the l2 cache hit count of the entry,
- *   if counter overflow, divide by two all counters
- *   return the pointer to the l2 cache entry
- *
- */
-
-static uint64_t *seek_l2_table(BDRVQcowState *s, uint64_t l2_offset)
-{
-int i, j;
-
-for(i = 0; i < L2_CACHE_SIZE; i++) {
-if (l2_offset == s->l2_cache_offsets[i]) {
-/* increment the hit count */
-if (++s->l2_cache_counts[i] == 0x) {
-for(j = 0; j < L2_CACHE_SIZE; j++) {
-s->l2_cache_counts[j] >>= 1;
-}
-}
-return s->l2_cache + (i << s->l2_bits);
-}
-}
-return NULL;
-}
-
 /*
  * l2_load
  *
@@ -169,33 +116,11 @@ static int l2_load(BlockDriverState *bs, uint64_t 
l2_offset,
 uint64_t **l2_table)
 {
 BDRVQcowState *s = bs->opaque;
-int min_index;
 int ret;
 
-/* seek if the table for the given offset is in the cache */
-
-*l2_table = seek_l2_table(s, l2_offset);
-if (*l2_table != NULL) {
-return 0;
-}
-
-/* not found: load a new entry in the least used one */
-
-min_index = l2_cache_new_entry(bs);
-*l2_table = s->l2_cache + (min_index << s->l2_bits);
-
-BLKDBG_EVENT(bs->file, BLKDBG_L2_LOAD);
-ret = bdrv_pread(bs->file, l2_offset, *l2_table,
-s->l2_size * sizeof(uint64_t));
-if (ret < 0) {
-qcow2_l2_cache_reset(bs);
-return ret;
-}
+ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset, (void**) l2_table);
 
-s->l2_cache_offsets[min_index] = l2_offset;
-s->l2_cache_counts[min_index] = 1;
-
-return 0;
+return ret;
 }
 
 /*
@@ -238,7 +163,6 @@ static int write_l1_entry(BlockDriverState *bs, int 
l1_index)
 static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
 {
 BDRVQcowState *s = bs->opaque;
-int min_index;
 uint64_t old_l2_offset;
 uint64_t *l2_table;
 int64_t l2_offset;
@@ -252,29 +176,48 @@ static int l2_allocate(BlockDriverState *

[Qemu-devel] [PATCH v4 1/3] qcow2: Add QcowCache

2011-01-20 Thread Kevin Wolf

This adds some new cache functions to qcow2 which can be used for caching
refcount blocks and L2 tables. When used with cache=writethrough they work
like the old caching code which is spread all over qcow2, so for this case we
have merely a cleanup.

The interesting case is with writeback caching (this includes cache=none) where
data isn't written to disk immediately but only kept in cache initially. This
leads to some form of metadata write batching which avoids the current "write
to refcount block, flush, write to L2 table" pattern for each single request
when a lot of cluster allocations happen. Instead, cache entries are only
written out if its required to maintain the right order. In the pure cluster
allocation case this means that all metadata updates for requests are done in
memory initially and on sync, first the refcount blocks are written to disk,
then fsync, then L2 tables.

This improves performance of scenarios with lots of cluster allocations
noticably (e.g. installation or after taking a snapshot).

Signed-off-by: Kevin Wolf 
---
 Makefile.objs   |2 +-
 block/qcow2-cache.c |  290 +++
 block/qcow2.h   |   19 
 3 files changed, 310 insertions(+), 1 deletions(-)
 create mode 100644 block/qcow2-cache.c

diff --git a/Makefile.objs b/Makefile.objs
index c3e52c5..65889c9 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -19,7 +19,7 @@ block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
-block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
+block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-nested-y += qed-check.o
 block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
new file mode 100644
index 000..f7c4e2a
--- /dev/null
+++ b/block/qcow2-cache.c
@@ -0,0 +1,290 @@
+/*
+ * L2/refcount table cache for the QCOW2 format
+ *
+ * Copyright (c) 2010 Kevin Wolf 
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "block_int.h"
+#include "qemu-common.h"
+#include "qcow2.h"
+
+typedef struct Qcow2CachedTable {
+void*   table;
+int64_t offset;
+booldirty;
+int cache_hits;
+int ref;
+} Qcow2CachedTable;
+
+struct Qcow2Cache {
+int size;
+Qcow2CachedTable*   entries;
+struct Qcow2Cache*  depends;
+boolwritethrough;
+};
+
+Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables,
+bool writethrough)
+{
+BDRVQcowState *s = bs->opaque;
+Qcow2Cache *c;
+int i;
+
+c = qemu_mallocz(sizeof(*c));
+c->size = num_tables;
+c->entries = qemu_mallocz(sizeof(*c->entries) * num_tables);
+c->writethrough = writethrough;
+
+for (i = 0; i < c->size; i++) {
+c->entries[i].table = qemu_blockalign(bs, s->cluster_size);
+}
+
+return c;
+}
+
+int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c)
+{
+int i;
+
+for (i = 0; i < c->size; i++) {
+assert(c->entries[i].ref == 0);
+qemu_vfree(c->entries[i].table);
+}
+
+qemu_free(c->entries);
+qemu_free(c);
+
+return 0;
+}
+
+static int qcow2_cache_flush_dependency(BlockDriverState *bs, Qcow2Cache *c)
+{
+int ret;
+
+ret = qcow2_cache_flush(bs, c->depends);
+if (ret < 0) {
+return ret;
+}
+
+c->depends = NULL;
+return 0;
+}
+
+static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i)
+{
+BDRVQcowState *s = bs->opaque;
+int ret;
+
+if (!c->entries[i].dirty || !c->entries[i].offset) {
+return 0;
+}
+
+if (c->depends) {
+ret = qcow2_cache_flush_dependency(

[Qemu-devel] [PATCH v4 0/3] qcow2 metadata cache

2011-01-20 Thread Kevin Wolf

block-queue turned out to be too big effort to be useful for quickly fixing the
performance problems that qcow2 got since we introduced the metadata flushes.
While I still think the idea is right, it needs more time and qcow2 doesn't
have more time. Let's come back to block-queue later when the most urgent qcow2
problems are fixed.

So this is the idea of block-queue applied to the very specific case of qcow2.
Whereas block-queue tried to be a generic solution for all kind of things and
tried to make all writes asynchronous at the same time, this is only about
batching writes to refcount blocks and L2 tables in qcow2 and getting the
dependencies right. (Yes, the L1 table and refcount table is left alone. They
are almost never written to anyway.)

This should be much easier to understand and review, and I myself feel a bit
more confident about it than with block-queue, too.

v1:
- Don't read newly allocated tables from the disk before memsetting them to
  zero

v2:
- Addressed Stefan's review comments
- Added patch 3 to avoid an unnecessary bdrv_flush after COW

v3:
- Some error path fixes (esp. missing or double qcow2_cache_put)

v4:
- Pay attention to make writethrough performance not even worse in
  update_refcount
- Add some blkdebug events back so that "failures" in qemu-iotests are kept
  minimal. Some diff between master and this series is left in test 026, but
  it's harmless.

Kevin Wolf (3):
  qcow2: Add QcowCache
  qcow2: Use QcowCache
  qcow2: Batch flushes for COW

 Makefile.objs  |2 +-
 block/qcow2-cache.c|  314 
 block/qcow2-cluster.c  |  207 +++-
 block/qcow2-refcount.c |  258 ---
 block/qcow2.c  |   48 +++-
 block/qcow2.h  |   32 -
 6 files changed, 564 insertions(+), 297 deletions(-)
 create mode 100644 block/qcow2-cache.c

-- 
1.7.2.3

Re: [Qemu-devel] [PATCH] target-arm: Fix garbage collection of temporaries in Neon emulation.

2011-01-20 Thread Christophe Lyon


> I would personally prefer slightly less terse commit messages
> (for instance it might be nice to list the affected instructions in
> this case). The convention is also to preface the summary line
> with the file or directory affected, ie "target-arm: Fix garbage
> collection of temporaries in Neon emulation".
> 


OK, so here is the same patch with an updated description:

target-arm: Fix garbage collection of temporaries in Neon emulation of
MULL and friends (VMLAL, VQDMLAL, VMLSL, VQDMLSL, VMULL, VQDMULL) as
well as (VSHRN, VRSHRN, VQSHRN, VQRSHRN).

Signed-off-by: Christophe Lyon 
---
 target-arm/translate.c |   18 +-
 1 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 57664bc..b3e3d70 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -4176,6 +4176,13 @@ static inline void gen_neon_mull(TCGv_i64 dest, TCGv a, 
TCGv b, int size, int u)
 break;
 default: abort();
 }
+
+/* gen_helper_neon_mull_[su]{8|16} do not free their parameters.
+   Don't forget to clean them now.  */
+if (size < 2) {
+  dead_tmp(a);
+  dead_tmp(b);
+}
 }
 
 /* Translate a NEON data processing instruction.  Return nonzero if the
@@ -4840,7 +4847,7 @@ static int disas_neon_data_insn(CPUState * env, 
DisasContext *s, uint32_t insn)
 if (size == 3) {
 tcg_temp_free_i64(tmp64);
 } else {
-dead_tmp(tmp2);
+tcg_temp_free_i32(tmp2);
 }
 } else if (op == 10) {
 /* VSHLL */
@@ -5076,8 +5083,6 @@ static int disas_neon_data_insn(CPUState * env, 
DisasContext *s, uint32_t insn)
 case 8: case 9: case 10: case 11: case 12: case 13:
 /* VMLAL, VQDMLAL, VMLSL, VQDMLSL, VMULL, VQDMULL */
 gen_neon_mull(cpu_V0, tmp, tmp2, size, u);
-dead_tmp(tmp2);
-dead_tmp(tmp);
 break;
 case 14: /* Polynomial VMULL */
 cpu_abort(env, "Polynomial VMULL not implemented");
@@ -5228,6 +5233,10 @@ static int disas_neon_data_insn(CPUState * env, 
DisasContext *s, uint32_t insn)
 return 1;
 
 tmp2 = neon_get_scalar(size, rm);
+/* We need a copy of tmp2 because gen_neon_mull
+ * deletes it during pass 0.  */
+tmp4 = new_tmp();
+tcg_gen_mov_i32(tmp4, tmp2);
 tmp3 = neon_load_reg(rn, 1);
 
 for (pass = 0; pass < 2; pass++) {
@@ -5235,9 +5244,9 @@ static int disas_neon_data_insn(CPUState * env, 
DisasContext *s, uint32_t insn)
 tmp = neon_load_reg(rn, 0);
 } else {
 tmp = tmp3;
+tmp2 = tmp4;
 }
 gen_neon_mull(cpu_V0, tmp, tmp2, size, u);
-dead_tmp(tmp);
 if (op == 6 || op == 7) {
 gen_neon_negl(cpu_V0, size);
 }
@@ -5264,7 +5273,6 @@ static int disas_neon_data_insn(CPUState * env, 
DisasContext *s, uint32_t insn)
 neon_store_reg64(cpu_V0, rd + pass);
 }
 
-dead_tmp(tmp2);
 
 break;
 default: /* 14 and 15 are RESERVED */
-- 
1.7.2.3

[Qemu-devel] Re: [Xen-devel] [PATCH 2/5] qemu and qemu-xen: fix segfault on con_disconnect

2011-01-20 Thread Ian Jackson

Stefano Stabellini writes ("[Xen-devel] [PATCH 2/5] qemu and qemu-xen: fix 
segfault on con_disconnect"):
> The test on xendev->gnttabdev in con_disconnect is wrong: what
> differentiates the consoles is the dev number.

Thanks, I have applied this patch to qemu-xen-unstable.git.

Ian.

Re: [Qemu-devel] [RFC PATCH] Fake machine for scalability testing

2011-01-20 Thread Anthony Liguori


On 01/18/2011 02:16 PM, Markus Armbruster wrote:

The problem: you want to do serious scalability testing (1000s of VMs)
of your management stack.  If each guest eats up a few 100MiB and
competes for CPU, that requires a serious host machine.  Which you don't
have.  You also don't want to modify the management stack at all, if you
can help it.

The solution: a perfectly normal-looking QEMU that uses minimal
resources.  Ability to execute any guest code is strictly optional ;)

New option -fake-machine creates a fake machine incapable of running
guest code.  Completely compiled out by default, enable with configure
--enable-fake-machine.

With -fake-machine, CPU use is negligible, and memory use is rather
modest.

Non-fake VM running F-14 live, right after boot:
UIDPID  PPID  CSZ   RSS PSR STIME TTY  TIME CMD
armbru   15707  2558 53 191837 414388 1 21:05 pts/300:00:29 [...]

Same VM -fake-machine, after similar time elapsed:
UIDPID  PPID  CSZ   RSS PSR STIME TTY  TIME CMD
armbru   15742  2558  0 85129  9412   0 21:07 pts/300:00:00 [...]

We're using a very similar patch for RHEL scalability testing.
   


Interesting, but:

 9432 anthony   20   0  153m  14m 5384 S0  0.2   0:00.22 
qemu-system-x86


That's qemu-system-x86 -m 4

In terms of memory overhead, the largest source is not really going to 
be addressed by -fake-machine (l1_phys_map and phys_ram_dirty).


I don't really understand the point of not creating a VCPU with KVM.  Is 
there some type of overhead in doing that?


Regards,

Anthony Liguori


HACK ALERT: Works by hacking the main loop so it never executes any
guest code.  Not implemented for KVM's main loop at this time, thus
-fake-machine needs to force KVM off.  It also replaces guest RAM by a
token amount (pc machine only at this time), and forces -vga none,
because VGA eats too much memory.

Note the TODO and FIXME comments.

Dan Berrange explored a different solution a while ago: a new do-nothing
target, patterned after i386, and a new do-nothing machine, patterned
after pc.  His patch works.  But it duplicates much target and machine
code --- adds more than ten times as many lines as this patch.  Keeping
the duplicated code reasonably in sync would be bothersome.  I didn't
like that, talked it over with Dan, and we came up with this idea
instead.

Comments?  Better ideas?
---
  configure   |   12 
  cpu-exec.c  |2 +-
  cpus.c  |3 +++
  hw/pc.c |   30 --
  qemu-options.hx |7 +++
  targphys.h  |7 +++
  vl.c|   21 +
  7 files changed, 71 insertions(+), 11 deletions(-)

diff --git a/configure b/configure
index d68f862..98b0a5f 100755
--- a/configure
+++ b/configure
@@ -174,6 +174,7 @@ trace_backend="nop"
  trace_file="trace"
  spice=""
  rbd=""
+fake_machine="no"

  # parse CC options first
  for opt do
@@ -719,6 +720,10 @@ for opt do
;;
--enable-rbd) rbd="yes"
;;
+  --disable-fake-machine) fake_machine="no"
+  ;;
+  --enable-fake-machine) fake_machine="yes"
+  ;;
*) echo "ERROR: unknown option $opt"; show_help="yes"
;;
esac
@@ -913,6 +918,8 @@ echo "   Default:trace-"
  echo "  --disable-spice  disable spice"
  echo "  --enable-spice   enable spice"
  echo "  --enable-rbd enable building the rados block device (rbd)"
+echo "  --disable-fake-machine   disable -fake-machine option"
+echo "  --enable-fake-machineenable -fake-machine option"
  echo ""
  echo "NOTE: The object files are built at the place where configure is 
launched"
  exit 1
@@ -2455,6 +2462,7 @@ echo "Trace output file $trace_file-"
  echo "spice support $spice"
  echo "rbd support   $rbd"
  echo "xfsctl support$xfs"
+echo "-fake-machine $fake_machine"

  if test $sdl_too_old = "yes"; then
  echo "->  Your SDL version is too old - please upgrade to have SDL support"
@@ -2727,6 +2735,10 @@ if test "$spice" = "yes" ; then
echo "CONFIG_SPICE=y">>  $config_host_mak
  fi

+if test $fake_machine = "yes" ; then
+  echo "CONFIG_FAKE_MACHINE=y">>  $config_host_mak
+fi
+
  # XXX: suppress that
  if [ "$bsd" = "yes" ] ; then
echo "CONFIG_BSD=y">>  $config_host_mak
diff --git a/cpu-exec.c b/cpu-exec.c
index 8c9fb8b..cd1259a 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -230,7 +230,7 @@ int cpu_exec(CPUState *env1)
  uint8_t *tc_ptr;
  unsigned long next_tb;

-if (cpu_halted(env1) == EXCP_HALTED)
+if (fake_machine || cpu_halted(env1) == EXCP_HALTED)
  return EXCP_HALTED;

  cpu_single_env = env1;
diff --git a/cpus.c b/cpus.c
index 0309189..91e708f 100644
--- a/cpus.c
+++ b/cpus.c
@@ -128,6 +128,9 @@ static int cpu_can_run(CPUState *env)

  static int cpu_has_work(CPUState *env)
  {
+if (fake_machine) {
+return 0;
+}
  if (env->stop)
  return 1;
  if (env->queued_work_first)
diff --git a/hw/pc.c b/hw/pc.c
i

[Qemu-devel] Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Sridhar Samudrala

On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
> When MSI is off, each interrupt needs to be bounced through the io
> thread when it's set/cleared, so vhost-net causes more context switches and
> higher CPU utilization than userspace virtio which handles networking in
> the same thread.
> 
> We'll need to fix this by adding level irq support in kvm irqfd,
> for now disable vhost-net in these configurations.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> I need to report some error from virtio-pci
> that would be handled specially (disable but don't
> report an error) so I wanted one that's never likely to be used by a
> userspace ioctl. I selected ERANGE but it'd
> be easy to switch to something else. Comments?

Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?

-Sridhar
> 
>  hw/vhost.c  |4 +++-
>  hw/virtio-net.c |6 --
>  hw/virtio-pci.c |3 +++
>  hw/virtio.h |2 ++
>  4 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/vhost.c b/hw/vhost.c
> index 1d09ed0..c79765a 100644
> --- a/hw/vhost.c
> +++ b/hw/vhost.c
> @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
> *vdev)
>  
>  r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
>  if (r < 0) {
> -fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> + if (r != -EVIRTIO_DISABLED) {
> + fprintf(stderr, "Error binding guest notifier: %d\n", -r);
> + }
>  goto fail_notifiers;
>  }
>  
> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> index ccb3e63..5de3fee 100644
> --- a/hw/virtio-net.c
> +++ b/hw/virtio-net.c
> @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
> uint8_t status)
>  if (!n->vhost_started) {
>  int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), 
> &n->vdev);
>  if (r < 0) {
> -error_report("unable to start vhost net: %d: "
> - "falling back on userspace virtio", -r);
> +if (r != -EVIRTIO_DISABLED) {
> +error_report("unable to start vhost net: %d: "
> + "falling back on userspace virtio", -r);
> +}
>  } else {
>  n->vhost_started = 1;
>  }
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index dd8887a..dbf4be0 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, 
> int n, bool assign)
>  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
>  
>  if (assign) {
> +if (!msix_enabled(&proxy->pci_dev)) {
> +return -EVIRTIO_DISABLED;
> +}
>  int r = event_notifier_init(notifier, 0);
>  if (r < 0) {
>  return r;
> diff --git a/hw/virtio.h b/hw/virtio.h
> index d8546d5..53bbdba 100644
> --- a/hw/virtio.h
> +++ b/hw/virtio.h
> @@ -98,6 +98,8 @@ typedef struct {
>  void (*vmstate_change)(void * opaque, bool running);
>  } VirtIOBindings;
>  
> +#define EVIRTIO_DISABLED ERANGE
> +
>  #define VIRTIO_PCI_QUEUE_MAX 64
>  
>  #define VIRTIO_NO_VECTOR 0x

[Qemu-devel] Re: [PATCH] hw/pl190.c: Fix writing of default vector address

2011-01-20 Thread Aurelien Jarno

On Thu, Jan 20, 2011 at 04:04:52PM +, Peter Maydell wrote:
> The PL190 implementation keeps the default vector address
> in vect_addr[16], but we weren't using this for writes to
> the DEFVECTADDR register. As a result of this fix the
> default_addr structure member is unused and we can delete it.
> 
> Reported-by: Himanshu Chauhan 
> Signed-off-by: Peter Maydell 
> ---
> (This patch addresses Aurelien's comments on Himanshu's
> patch from November: http://patchwork.ozlabs.org/patch/69768/)
> 
>  hw/pl190.c |4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/pl190.c b/hw/pl190.c
> index 17c279b..75f2ba1 100644
> --- a/hw/pl190.c
> +++ b/hw/pl190.c
> @@ -21,7 +21,6 @@ typedef struct {
>  uint32_t soft_level;
>  uint32_t irq_enable;
>  uint32_t fiq_select;
> -uint32_t default_addr;
>  uint8_t vect_control[16];
>  uint32_t vect_addr[PL190_NUM_PRIO];
>  /* Mask containing interrupts with higher priority than this one.  */
> @@ -186,7 +185,7 @@ static void pl190_write(void *opaque, target_phys_addr_t 
> offset, uint32_t val)
>  s->priority = s->prev_prio[s->priority];
>  break;
>  case 13: /* DEFVECTADDR */
> -s->default_addr = val;
> +s->vect_addr[16] = val;
>  break;
>  case 0xc0: /* ITCR */
>  if (val) {
> @@ -252,7 +251,6 @@ static const VMStateDescription vmstate_pl190 = {
>  VMSTATE_UINT32(soft_level, pl190_state),
>  VMSTATE_UINT32(irq_enable, pl190_state),
>  VMSTATE_UINT32(fiq_select, pl190_state),
> -VMSTATE_UINT32(default_addr, pl190_state),
>  VMSTATE_UINT8_ARRAY(vect_control, pl190_state, 16),
>  VMSTATE_UINT32_ARRAY(vect_addr, pl190_state, PL190_NUM_PRIO),
>  VMSTATE_UINT32_ARRAY(prio_mask, pl190_state, PL190_NUM_PRIO+1),
> -- 
> 1.6.3.3
> 

Thanks, applied.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net

Re: [Qemu-devel] [PATCH] Fix block migration when the device size is not a multiple of 1 MB

2011-01-20 Thread Yoshiaki Tamura

2011/1/20 Pierre Riteau :
> On 20 janv. 2011, at 03:06, Yoshiaki Tamura wrote:
>
>> 2011/1/19 Pierre Riteau :
>>> b02bea3a85cc939f09aa674a3f1e4f36d418c007 added a check on the return
>>> value of bdrv_write and aborts migration when it fails. However, if the
>>> size of the block device to migrate is not a multiple of BLOCK_SIZE
>>> (currently 1 MB), the last bdrv_write will fail with -EIO.
>>>
>>> Fixed by calling bdrv_write with the correct size of the last block.
>>> ---
>>>  block-migration.c |   16 +++-
>>>  1 files changed, 15 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/block-migration.c b/block-migration.c
>>> index 1475325..eeb9c62 100644
>>> --- a/block-migration.c
>>> +++ b/block-migration.c
>>> @@ -635,6 +635,8 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>     int64_t addr;
>>>     BlockDriverState *bs;
>>>     uint8_t *buf;
>>> +    int64_t total_sectors;
>>> +    int nr_sectors;
>>>
>>>     do {
>>>         addr = qemu_get_be64(f);
>>> @@ -656,10 +658,22 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>                 return -EINVAL;
>>>             }
>>>
>>> +            total_sectors = bdrv_getlength(bs) >> BDRV_SECTOR_BITS;
>>> +            if (total_sectors <= 0) {
>>> +                fprintf(stderr, "Error getting length of block device 
>>> %s\n", device_name);
>>> +                return -EINVAL;
>>> +            }
>>> +
>>> +            if (total_sectors - addr < BDRV_SECTORS_PER_DIRTY_CHUNK) {
>>> +                nr_sectors = total_sectors - addr;
>>> +            } else {
>>> +                nr_sectors = BDRV_SECTORS_PER_DIRTY_CHUNK;
>>> +            }
>>> +
>>>             buf = qemu_malloc(BLOCK_SIZE);
>>>
>>>             qemu_get_buffer(f, buf, BLOCK_SIZE);
>>> -            ret = bdrv_write(bs, addr, buf, BDRV_SECTORS_PER_DIRTY_CHUNK);
>>> +            ret = bdrv_write(bs, addr, buf, nr_sectors);
>>>
>>>             qemu_free(buf);
>>>             if (ret < 0) {
>>> --
>>> 1.7.3.5
>>>
>>>
>>>
>>
>> Hi Pierre,
>>
>> I don't think the fix above is correct.  If you have a file which
>> isn't aliened with BLOCK_SIZE, you won't get an error with the
>> patch.  However, the receiver doesn't know how much sectors which
>> the sender wants to be written, so the guest may fail after
>> migration because some data may not be written.  IIUC, although
>> changing bytestream should be prevented as much as possible, we
>> should save/load total_sectors to check appropriate file is
>> allocated on the receiver side.
>
> Isn't the guest supposed to be started using a file with the correct size?

I personally don't like that; It's insisting too much to the user.
Can't we expand the image on the fly?  We can just abort if expanding
failed anyway.

> But I guess changing the protocol would be best as it would avoid headaches 
> to people who mistakenly created a file that is too small.

We should think carefully before changing the protocol.

Kevin?

>
>> BTW, you should use error_report instead of fprintf(stderr, ...).
>
> I didn't know that, I followed what was used in this file. Thank you.
>
> --
> Pierre Riteau -- PhD student, Myriads team, IRISA, Rennes, France
> http://perso.univ-rennes1.fr/pierre.riteau/
>
>
>

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
> On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
> >When MSI is off, each interrupt needs to be bounced through the io
> >thread when it's set/cleared, so vhost-net causes more context switches and
> >higher CPU utilization than userspace virtio which handles networking in
> >the same thread.
> >
> >We'll need to fix this by adding level irq support in kvm irqfd,
> >for now disable vhost-net in these configurations.
> >
> >Signed-off-by: Michael S. Tsirkin
> 
> I actually think this should be a terminal error.  The user asks for
> vhost-net, if we cannot enable it, we should exit.
> 
> Or we should warn the user that they should expect bad performance.
> Silently doing something that the user has explicitly asked us not
> to do is not a good behavior.
> 
> Regards,
> 
> Anthony Liguori

The issue is that user has no control of the guest, and can not know
whether the guest enables MSI. So what you ask for will just make
some guests fail, and others fail sometimes.
The user also has no way to know that version X of kvm does not expose a
way to inject level interrupts with irqfd.

We could have *another* flag that says "use vhost where it helps" but
then I think this is what everyone wants to do, anyway, and libvirt
already sets vhost=on so I prefer redefining the meaning of an existing
flag.

Maybe this is best handled by a documentation update?

We always said:
"use vhost=on to enable experimental in kernel 
accelerator\n"

note 'enable' not 'require'. This is similar to how we specify
nvectors : you can not make guest use the feature.

How about this:

diff --git a/qemu-options.hx b/qemu-options.hx
index 898561d..3c937c1 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1061,6 +1061,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
 "use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap 
flag\n"
 "use vnet_hdr=on to make the lack of IFF_VNET_HDR support 
an error condition\n"
 "use vhost=on to enable experimental in kernel 
accelerator\n"
+"(note: vhost=on has no effect unless guest uses MSI-X)\n"
 "use 'vhostfd=h' to connect to an already opened vhost net 
device\n"
 #endif
 "-net 
socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n"

[Qemu-devel] [PATCH] hw/pl190.c: Fix writing of default vector address

2011-01-20 Thread Peter Maydell

The PL190 implementation keeps the default vector address
in vect_addr[16], but we weren't using this for writes to
the DEFVECTADDR register. As a result of this fix the
default_addr structure member is unused and we can delete it.

Reported-by: Himanshu Chauhan 
Signed-off-by: Peter Maydell 
---
(This patch addresses Aurelien's comments on Himanshu's
patch from November: http://patchwork.ozlabs.org/patch/69768/)

 hw/pl190.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/hw/pl190.c b/hw/pl190.c
index 17c279b..75f2ba1 100644
--- a/hw/pl190.c
+++ b/hw/pl190.c
@@ -21,7 +21,6 @@ typedef struct {
 uint32_t soft_level;
 uint32_t irq_enable;
 uint32_t fiq_select;
-uint32_t default_addr;
 uint8_t vect_control[16];
 uint32_t vect_addr[PL190_NUM_PRIO];
 /* Mask containing interrupts with higher priority than this one.  */
@@ -186,7 +185,7 @@ static void pl190_write(void *opaque, target_phys_addr_t 
offset, uint32_t val)
 s->priority = s->prev_prio[s->priority];
 break;
 case 13: /* DEFVECTADDR */
-s->default_addr = val;
+s->vect_addr[16] = val;
 break;
 case 0xc0: /* ITCR */
 if (val) {
@@ -252,7 +251,6 @@ static const VMStateDescription vmstate_pl190 = {
 VMSTATE_UINT32(soft_level, pl190_state),
 VMSTATE_UINT32(irq_enable, pl190_state),
 VMSTATE_UINT32(fiq_select, pl190_state),
-VMSTATE_UINT32(default_addr, pl190_state),
 VMSTATE_UINT8_ARRAY(vect_control, pl190_state, 16),
 VMSTATE_UINT32_ARRAY(vect_addr, pl190_state, PL190_NUM_PRIO),
 VMSTATE_UINT32_ARRAY(prio_mask, pl190_state, PL190_NUM_PRIO+1),
-- 
1.6.3.3

[Qemu-devel] [PATCH 2/5] Use vmstate to save/load spitz-lcdtg and corgi-ssp state

2011-01-20 Thread Dmitry Eremin-Solenikov

Signed-off-by: Dmitry Eremin-Solenikov 
---
 hw/spitz.c |   74 ++-
 1 files changed, 28 insertions(+), 46 deletions(-)

diff --git a/hw/spitz.c b/hw/spitz.c
index 3667618..87731d3 100644
--- a/hw/spitz.c
+++ b/hw/spitz.c
@@ -527,8 +527,8 @@ static void spitz_keyboard_register(PXA2xxState *cpu)
 
 typedef struct {
 SSISlave ssidev;
-int bl_intensity;
-int bl_power;
+uint32_t bl_intensity;
+uint32_t bl_power;
 } SpitzLCDTG;
 
 static void spitz_bl_update(SpitzLCDTG *s)
@@ -592,21 +592,6 @@ static uint32_t spitz_lcdtg_transfer(SSISlave *dev, 
uint32_t value)
 return 0;
 }
 
-static void spitz_lcdtg_save(QEMUFile *f, void *opaque)
-{
-SpitzLCDTG *s = (SpitzLCDTG *)opaque;
-qemu_put_be32(f, s->bl_intensity);
-qemu_put_be32(f, s->bl_power);
-}
-
-static int spitz_lcdtg_load(QEMUFile *f, void *opaque, int version_id)
-{
-SpitzLCDTG *s = (SpitzLCDTG *)opaque;
-s->bl_intensity = qemu_get_be32(f);
-s->bl_power = qemu_get_be32(f);
-return 0;
-}
-
 static int spitz_lcdtg_init(SSISlave *dev)
 {
 SpitzLCDTG *s = FROM_SSI_SLAVE(SpitzLCDTG, dev);
@@ -615,8 +600,6 @@ static int spitz_lcdtg_init(SSISlave *dev)
 s->bl_power = 0;
 s->bl_intensity = 0x20;
 
-register_savevm(&dev->qdev, "spitz-lcdtg", -1, 1,
-spitz_lcdtg_save, spitz_lcdtg_load, s);
 return 0;
 }
 
@@ -635,7 +618,7 @@ static DeviceState *max;
 typedef struct {
 SSISlave ssidev;
 SSIBus *bus[3];
-int enable[3];
+uint32_t enable[3];
 } CorgiSSPState;
 
 static uint32_t corgi_ssp_transfer(SSISlave *dev, uint32_t value)
@@ -677,30 +660,6 @@ static void spitz_adc_temp_on(void *opaque, int line, int 
level)
 max111x_set_input(max, MAX_BATT_TEMP, 0);
 }
 
-static void spitz_ssp_save(QEMUFile *f, void *opaque)
-{
-CorgiSSPState *s = (CorgiSSPState *)opaque;
-int i;
-
-for (i = 0; i < 3; i++) {
-qemu_put_be32(f, s->enable[i]);
-}
-}
-
-static int spitz_ssp_load(QEMUFile *f, void *opaque, int version_id)
-{
-CorgiSSPState *s = (CorgiSSPState *)opaque;
-int i;
-
-if (version_id != 1) {
-return -EINVAL;
-}
-for (i = 0; i < 3; i++) {
-s->enable[i] = qemu_get_be32(f);
-}
-return 0;
-}
-
 static int corgi_ssp_init(SSISlave *dev)
 {
 CorgiSSPState *s = FROM_SSI_SLAVE(CorgiSSPState, dev);
@@ -710,8 +669,6 @@ static int corgi_ssp_init(SSISlave *dev)
 s->bus[1] = ssi_create_bus(&dev->qdev, "ssi1");
 s->bus[2] = ssi_create_bus(&dev->qdev, "ssi2");
 
-register_savevm(&dev->qdev, "spitz_ssp", -1, 1,
-spitz_ssp_save, spitz_ssp_load, s);
 return 0;
 }
 
@@ -1070,16 +1027,41 @@ static void spitz_machine_init(void)
 
 machine_init(spitz_machine_init);
 
+static const VMStateDescription vmstate_corgi_ssp_regs = {
+.name = "corgi-ssp",
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields = (VMStateField []) {
+VMSTATE_UINT32_ARRAY(enable, CorgiSSPState, 3),
+VMSTATE_END_OF_LIST(),
+}
+};
+
 static SSISlaveInfo corgi_ssp_info = {
 .qdev.name = "corgi-ssp",
 .qdev.size = sizeof(CorgiSSPState),
+.qdev.vmsd = &vmstate_corgi_ssp_regs,
 .init = corgi_ssp_init,
 .transfer = corgi_ssp_transfer
 };
 
+static const VMStateDescription vmstate_spitz_lcdtg_regs = {
+.name = "spitz-lcdtg",
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields = (VMStateField []) {
+VMSTATE_UINT32(bl_intensity, SpitzLCDTG),
+VMSTATE_UINT32(bl_power, SpitzLCDTG),
+VMSTATE_END_OF_LIST(),
+}
+};
+
 static SSISlaveInfo spitz_lcdtg_info = {
 .qdev.name = "spitz-lcdtg",
 .qdev.size = sizeof(SpitzLCDTG),
+.qdev.vmsd = &vmstate_spitz_lcdtg_regs,
 .init = spitz_lcdtg_init,
 .transfer = spitz_lcdtg_transfer
 };
-- 
1.7.2.3

[Qemu-devel] [PATCH 1/5] SharpSL scoop device - convert to qdev

2011-01-20 Thread Dmitry Eremin-Solenikov

Convert SharpSL scoop device to qdev, remove lots of supporting code, as
lot of init and gpio related things can now be done automagically.

Signed-off-by: Dmitry Eremin-Solenikov 
---
 hw/sharpsl.h |7 
 hw/spitz.c   |   23 ++--
 hw/tosa.c|   23 ++--
 hw/zaurus.c  |  111 ++---
 4 files changed, 75 insertions(+), 89 deletions(-)

diff --git a/hw/sharpsl.h b/hw/sharpsl.h
index c5ccf79..0b3a774 100644
--- a/hw/sharpsl.h
+++ b/hw/sharpsl.h
@@ -10,13 +10,6 @@
 fprintf(stderr, "%s: " format, __FUNCTION__, ##__VA_ARGS__)
 
 /* zaurus.c */
-typedef struct ScoopInfo ScoopInfo;
-ScoopInfo *scoop_init(PXA2xxState *cpu,
-int instance, target_phys_addr_t target_base);
-void scoop_gpio_set(void *opaque, int line, int level);
-qemu_irq *scoop_gpio_in_get(ScoopInfo *s);
-void scoop_gpio_out_set(ScoopInfo *s, int line,
-qemu_irq handler);
 
 #define SL_PXA_PARAM_BASE  0xaa00
 void sl_bootparam_write(target_phys_addr_t ptr);
diff --git a/hw/spitz.c b/hw/spitz.c
index 092bb64..3667618 100644
--- a/hw/spitz.c
+++ b/hw/spitz.c
@@ -23,6 +23,7 @@
 #include "audio/audio.h"
 #include "boards.h"
 #include "blockdev.h"
+#include "sysbus.h"
 
 #undef REG_FMT
 #define REG_FMT"0x%02lx"
@@ -851,21 +852,21 @@ static void spitz_out_switch(void *opaque, int line, int 
level)
 #define SPITZ_SCP2_MIC_BIAS9
 
 static void spitz_scoop_gpio_setup(PXA2xxState *cpu,
-ScoopInfo *scp0, ScoopInfo *scp1)
+DeviceState *scp0, DeviceState *scp1)
 {
 qemu_irq *outsignals = qemu_allocate_irqs(spitz_out_switch, cpu, 8);
 
-scoop_gpio_out_set(scp0, SPITZ_SCP_CHRG_ON, outsignals[0]);
-scoop_gpio_out_set(scp0, SPITZ_SCP_JK_B, outsignals[1]);
-scoop_gpio_out_set(scp0, SPITZ_SCP_LED_GREEN, outsignals[2]);
-scoop_gpio_out_set(scp0, SPITZ_SCP_LED_ORANGE, outsignals[3]);
+qdev_connect_gpio_out(scp0, SPITZ_SCP_CHRG_ON, outsignals[0]);
+qdev_connect_gpio_out(scp0, SPITZ_SCP_JK_B, outsignals[1]);
+qdev_connect_gpio_out(scp0, SPITZ_SCP_LED_GREEN, outsignals[2]);
+qdev_connect_gpio_out(scp0, SPITZ_SCP_LED_ORANGE, outsignals[3]);
 
 if (scp1) {
-scoop_gpio_out_set(scp1, SPITZ_SCP2_BACKLIGHT_CONT, outsignals[4]);
-scoop_gpio_out_set(scp1, SPITZ_SCP2_BACKLIGHT_ON, outsignals[5]);
+qdev_connect_gpio_out(scp1, SPITZ_SCP2_BACKLIGHT_CONT, outsignals[4]);
+qdev_connect_gpio_out(scp1, SPITZ_SCP2_BACKLIGHT_ON, outsignals[5]);
 }
 
-scoop_gpio_out_set(scp0, SPITZ_SCP_ADC_TEMP_ON, outsignals[6]);
+qdev_connect_gpio_out(scp0, SPITZ_SCP_ADC_TEMP_ON, outsignals[6]);
 }
 
 #define SPITZ_GPIO_HSYNC   22
@@ -952,7 +953,7 @@ static void spitz_common_init(ram_addr_t ram_size,
 const char *cpu_model, enum spitz_model_e model, int arm_id)
 {
 PXA2xxState *cpu;
-ScoopInfo *scp0, *scp1 = NULL;
+DeviceState *scp0, *scp1 = NULL;
 
 if (!cpu_model)
 cpu_model = (model == terrier) ? "pxa270-c5" : "pxa270-c0";
@@ -970,9 +971,9 @@ static void spitz_common_init(ram_addr_t ram_size,
 
 spitz_ssp_attach(cpu);
 
-scp0 = scoop_init(cpu, 0, 0x1080);
+scp0 = sysbus_create_simple("scoop", 0x1080, NULL);
 if (model != akita) {
-   scp1 = scoop_init(cpu, 1, 0x08800040);
+   scp1 = sysbus_create_simple("scoop", 0x08800040, NULL);
 }
 
 spitz_scoop_gpio_setup(cpu, scp0, scp1);
diff --git a/hw/tosa.c b/hw/tosa.c
index cc8ce6d..18e3be5 100644
--- a/hw/tosa.c
+++ b/hw/tosa.c
@@ -20,6 +20,7 @@
 #include "i2c.h"
 #include "ssi.h"
 #include "blockdev.h"
+#include "sysbus.h"
 
 #define TOSA_RAM0x0400
 #define TOSA_ROM   0x0080
@@ -86,14 +87,14 @@ static void tosa_out_switch(void *opaque, int line, int 
level)
 
 
 static void tosa_gpio_setup(PXA2xxState *cpu,
-ScoopInfo *scp0,
-ScoopInfo *scp1,
+DeviceState *scp0,
+DeviceState *scp1,
 TC6393xbState *tmio)
 {
 qemu_irq *outsignals = qemu_allocate_irqs(tosa_out_switch, cpu, 4);
 /* MMC/SD host */
 pxa2xx_mmci_handlers(cpu->mmc,
-scoop_gpio_in_get(scp0)[TOSA_GPIO_SD_WP],
+qdev_get_gpio_in(scp0, TOSA_GPIO_SD_WP),
 
qemu_irq_invert(pxa2xx_gpio_in_get(cpu->gpio)[TOSA_GPIO_nSD_DETECT]));
 
 /* Handle reset */
@@ -108,12 +109,12 @@ static void tosa_gpio_setup(PXA2xxState *cpu,
 pxa2xx_gpio_in_get(cpu->gpio)[TOSA_GPIO_JC_CF_IRQ],
 NULL);
 
-scoop_gpio_out_set(scp1, TOSA_GPIO_BT_LED, outsignals[0]);
-scoop_gpio_out_set(scp1, TOSA_GPIO_NOTE_LED, outsignals[1]);
-scoop_gpio_out_set(scp1, TOSA_GPIO_CHRG_ERR_LED, outsignals[2]);
-scoop_gpio_out_set(scp1, TOSA_GPIO_WLAN_LED, outsignals[3]);
+qdev_connect_gpio_out(scp1, TOSA_GPIO_BT_LED, outsignals[0]);
+qdev_connect_gpio_ou

Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Yoshiaki Tamura

2011/1/20 Kevin Wolf :
> Am 20.01.2011 14:50, schrieb Yoshiaki Tamura:
>> 2011/1/20 Kevin Wolf :
>>> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
 2011/1/20 Kevin Wolf :
> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>> +        return;
>> +    }
>> +
>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, 
>> blk_req->reqs[0].qiov,
>> +                    blk_req->reqs[0].nb_sectors, 
>> blk_req->reqs[0].cb,
>> +                    blk_req->reqs[0].opaque);
>
> Same here.
>
>> +    bdrv_flush(bs);
>
> This looks really strange. What is this supposed to do?
>
> One point is that you write it immediately after bdrv_aio_write, so 
> you
> get an fsync for which you don't know if it includes the current write
> request or if it doesn't. Which data do you want to get flushed to 
> the disk?

 I was expecting to flush the aio request that was just initiated.
 Am I misunderstanding the function?
>>>
>>> Seems so. The function names don't use really clear terminology either,
>>> so you're not the first one to fall in this trap. Basically we have:
>>>
>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>> wanted to have exactly this, but only for a single block device. Such a
>>> function doesn't exist yet.
>>>
>>> * bdrv_flush() makes sure that all successfully completed requests are
>>> written to disk (by calling fsync)
>>>
>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>> the fsync in the thread pool
>>
>> Then what I wanted to do is, call qemu_aio_flush first, then
>> bdrv_flush.  It should be like live migration.
>
> Okay, that makes sense. :-)
>
> The other thing is that you introduce a bdrv_flush for each request,
> basically forcing everyone to something very similar to writethrough
> mode. I'm sure this will have a big impact on performance.

 The reason is to avoid inversion of queued requests.  Although
 processing one-by-one is heavy, wouldn't having requests flushed
 to disk out of order break the disk image?
>>>
>>> No, that's fine. If a guest issues two requests at the same time, they
>>> may complete in any order. You just need to make sure that you don't
>>> call the completion callback before the request really has completed.
>>
>> We need to flush requests, meaning aio and fsync, before sending
>> the final state of the guests, to make sure we can switch to the
>> secondary safely.
>
> In theory I think you could just re-submit the requests on the secondary
> if they had not completed yet.
>
> But you're right, let's keep things simple for the start.
>
>>> I'm just starting to wonder if the guest won't timeout the requests if
>>> they are queued for too long. Even more, with IDE, it can only handle
>>> one request at a time, so not completing requests doesn't sound like a
>>> good idea at all. In what intervals is the event-tap queue flushed?
>>
>> The requests are flushed once each transaction completes.  So
>> it's not with specific intervals.
>
> Right. So when is a transaction completed? This is the time that a
> single request will take.

 The transaction is completed when the vm state is sent to the
 secondary, and the primary receives the ack to it.  Please let me
 know if the answer is too vague.  What I can tell is that it
 can't be super fast.

>>> On the other hand, if you complete before actually writing out, you
>>> don't get timeouts, but you signal success to the guest when the request
>>> could still fail. What would you do in this case? With a writeback cache
>>> mode we're fine, we can just fail the next flush (until then nothing is
>>> guaranteed to be on disk and order doesn't matter either), but with
>>> cache=writethrough we're in serious trouble.
>>>
>>> Have you thought about this problem? Maybe we end up having to flush the
>>> event-tap queue for each single write in writethrough mode.
>>
>> Yes, and that's what I'm trying to do at this point.
>
> Oh, I must have missed that code. Which patch/function should I look at?

 Maybe I miss-answered to your question.  The device may receive
 timeouts.
>>>
>>> We should pay attention that the guest does not see timeouts. I'm not
>>> expecting that I/O will be super fast, and as long as it is only a
>>> performance problem we can live with it.
>>>
>>> However, as soon as the guest gets timeouts it reports I/O errors and
>>> eventually offlines the block device. At this point it's not a
>>> performance problem any more, but also a correctness problem.
>>>
>>> This is why I sugges

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Anthony Liguori


On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Signed-off-by: Michael S. Tsirkin
   


I actually think this should be a terminal error.  The user asks for 
vhost-net, if we cannot enable it, we should exit.


Or we should warn the user that they should expect bad performance.  
Silently doing something that the user has explicitly asked us not to do 
is not a good behavior.


Regards,

Anthony Liguori


---

I need to report some error from virtio-pci
that would be handled specially (disable but don't
report an error) so I wanted one that's never likely to be used by a
userspace ioctl. I selected ERANGE but it'd
be easy to switch to something else. Comments?

  hw/vhost.c  |4 +++-
  hw/virtio-net.c |6 --
  hw/virtio-pci.c |3 +++
  hw/virtio.h |2 ++
  4 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 1d09ed0..c79765a 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)

  r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
  if (r<  0) {
-fprintf(stderr, "Error binding guest notifier: %d\n", -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, "Error binding guest notifier: %d\n", -r);
+   }
  goto fail_notifiers;
  }

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ccb3e63..5de3fee 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
  if (!n->vhost_started) {
  int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer),&n->vdev);
  if (r<  0) {
-error_report("unable to start vhost net: %d: "
- "falling back on userspace virtio", -r);
+if (r != -EVIRTIO_DISABLED) {
+error_report("unable to start vhost net: %d: "
+ "falling back on userspace virtio", -r);
+}
  } else {
  n->vhost_started = 1;
  }
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index dd8887a..dbf4be0 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int 
n, bool assign)
  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);

  if (assign) {
+if (!msix_enabled(&proxy->pci_dev)) {
+return -EVIRTIO_DISABLED;
+}
  int r = event_notifier_init(notifier, 0);
  if (r<  0) {
  return r;
diff --git a/hw/virtio.h b/hw/virtio.h
index d8546d5..53bbdba 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -98,6 +98,8 @@ typedef struct {
  void (*vmstate_change)(void * opaque, bool running);
  } VirtIOBindings;

+#define EVIRTIO_DISABLED ERANGE
+
  #define VIRTIO_PCI_QUEUE_MAX 64

  #define VIRTIO_NO_VECTOR 0x

Re: [Qemu-devel] [PULL] virtio, pci, migration, hotplug

2011-01-20 Thread Anthony Liguori


On 01/20/2011 08:54 AM, Michael S. Tsirkin wrote:

This has a fix for a crash with multiple block devices.
Probably makes sense to pull ASAP.

The following changes since commit d788b57051ee91aa39de67cff8d8e15953bc100c:

   target-ppc: fix wrong NaN tests (2011-01-20 15:11:14 +0100)
   


Pulled.  Thanks.

Regards,

Anthony Liguori


are available in the git repository at:
   git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu.git for_anthony

Alex Williamson (1):
   savevm: Fix no_migrate

Amit Shah (1):
   virtio-serial-bus: bump up control vq size to 32

Isaku Yamahata (4):
   pci: deassert intx on reset.
   msi: simplify write config a bit.
   msix: simplify write config
   pci: use qemu_malloc() in pcibus_get_dev_path()

Marcelo Tosatti (2):
   document QEMU<->ACPIBIOS PCI hotplug interface
   acpi_piix4: expose no_hotplug attribute via i/o port

Michael S. Tsirkin (2):
   Merge remote branch 'origin/master' into pci
   pci: fix device paths

  cris-dis.c  |9 ++-
  dis-asm.h   |1 +
  disas.c |9 ++-
  docs/specs/acpi_pci_hotplug.txt |   37 +++
  hw/acpi_piix4.c |   37 +++
  hw/msi.c|5 +---
  hw/msix.c   |5 +---
  hw/pci.c|   27 
  hw/pci.h|2 +
  hw/virtio-serial-bus.c  |   10 +++-
  migration.c |4 +++
  savevm.c|   40 -
  sysemu.h|1 +
  target-cris/translate.c |   41 +++---
  target-cris/translate_v10.c |   10 +---
  15 files changed, 189 insertions(+), 49 deletions(-)
  create mode 100644 docs/specs/acpi_pci_hotplug.txt

[Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Signed-off-by: Michael S. Tsirkin 
---

I need to report some error from virtio-pci
that would be handled specially (disable but don't
report an error) so I wanted one that's never likely to be used by a
userspace ioctl. I selected ERANGE but it'd
be easy to switch to something else. Comments?

 hw/vhost.c  |4 +++-
 hw/virtio-net.c |6 --
 hw/virtio-pci.c |3 +++
 hw/virtio.h |2 ++
 4 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 1d09ed0..c79765a 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 
 r = vdev->binding->set_guest_notifiers(vdev->binding_opaque, true);
 if (r < 0) {
-fprintf(stderr, "Error binding guest notifier: %d\n", -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, "Error binding guest notifier: %d\n", -r);
+   }
 goto fail_notifiers;
 }
 
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ccb3e63..5de3fee 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 if (!n->vhost_started) {
 int r = vhost_net_start(tap_get_vhost_net(n->nic->nc.peer), &n->vdev);
 if (r < 0) {
-error_report("unable to start vhost net: %d: "
- "falling back on userspace virtio", -r);
+if (r != -EVIRTIO_DISABLED) {
+error_report("unable to start vhost net: %d: "
+ "falling back on userspace virtio", -r);
+}
 } else {
 n->vhost_started = 1;
 }
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index dd8887a..dbf4be0 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int 
n, bool assign)
 EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
 
 if (assign) {
+if (!msix_enabled(&proxy->pci_dev)) {
+return -EVIRTIO_DISABLED;
+}
 int r = event_notifier_init(notifier, 0);
 if (r < 0) {
 return r;
diff --git a/hw/virtio.h b/hw/virtio.h
index d8546d5..53bbdba 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -98,6 +98,8 @@ typedef struct {
 void (*vmstate_change)(void * opaque, bool running);
 } VirtIOBindings;
 
+#define EVIRTIO_DISABLED ERANGE
+
 #define VIRTIO_PCI_QUEUE_MAX 64
 
 #define VIRTIO_NO_VECTOR 0x
-- 
1.7.3.2.91.g446ac

[Qemu-devel] [PULL] virtio, pci, migration, hotplug

2011-01-20 Thread Michael S. Tsirkin

This has a fix for a crash with multiple block devices.
Probably makes sense to pull ASAP.

The following changes since commit d788b57051ee91aa39de67cff8d8e15953bc100c:

  target-ppc: fix wrong NaN tests (2011-01-20 15:11:14 +0100)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu.git for_anthony

Alex Williamson (1):
  savevm: Fix no_migrate

Amit Shah (1):
  virtio-serial-bus: bump up control vq size to 32

Isaku Yamahata (4):
  pci: deassert intx on reset.
  msi: simplify write config a bit.
  msix: simplify write config
  pci: use qemu_malloc() in pcibus_get_dev_path()

Marcelo Tosatti (2):
  document QEMU<->ACPIBIOS PCI hotplug interface
  acpi_piix4: expose no_hotplug attribute via i/o port

Michael S. Tsirkin (2):
  Merge remote branch 'origin/master' into pci
  pci: fix device paths

 cris-dis.c  |9 ++-
 dis-asm.h   |1 +
 disas.c |9 ++-
 docs/specs/acpi_pci_hotplug.txt |   37 +++
 hw/acpi_piix4.c |   37 +++
 hw/msi.c|5 +---
 hw/msix.c   |5 +---
 hw/pci.c|   27 
 hw/pci.h|2 +
 hw/virtio-serial-bus.c  |   10 +++-
 migration.c |4 +++
 savevm.c|   40 -
 sysemu.h|1 +
 target-cris/translate.c |   41 +++---
 target-cris/translate_v10.c |   10 +---
 15 files changed, 189 insertions(+), 49 deletions(-)
 create mode 100644 docs/specs/acpi_pci_hotplug.txt

Re: [Qemu-devel] [PATCH 1/5] Fast Virtual Disk (FVD) Proposal Part 1

2011-01-20 Thread Chunqiang Tang

> Please try to split the patches into logical parts, and use descriptive
> subject lines for each patch.
> E.g. adding the new sim command to qemu-io could be one patch, adding
> the img_update (why not just update?) command to qemu-img another,
> moving code into qemu-tool-time.c one more, etc.

Will do and thank you for the detailed instructions. 
 
> > - 
> > +
> Please do not introduce random whitespace changes in patches.

Stefan Weil previously suggested removing spaces at the end of a line, and 
I used a script to do that. It seems that in this example, the old code 
has multiple spaces on an empty line, which were automatically removed by 
the script.

Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model

2011-01-20 Thread Daniel P. Berrange

On Thu, Jan 20, 2011 at 08:11:27PM +0530, M. Mohan Kumar wrote:
> On Thursday 20 January 2011 2:29:54 pm Stefan Hajnoczi wrote:
> > On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote:
> 
> > > -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) {
> > > -/*
> > > - * If we fail to change ownership and if we are
> > > - * using security model none. Ignore the error
> > > - */
> > > -if (fs_ctx->fs_sm != SM_NONE) {
> > > -return -1;
> > > -}
> > > -}
> > > +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid);
> > > 
> > >  return 0;
> > >  
> > >  }
> > 
> > retval is unused.
> > 
> 
> That was used to disable the warning message "error: ignoring return value of 
> ‘lchown’, declared with attribute warn_unused_result"
> 
> Otherwise I have to use
> if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid)) {
>   ;
> }
> 
> > Can multiple virtio-9p requests execute at a time?  chmod() and lchown()
> > after creation is a race condition if other requests can execute
> > concurrently.
> > 
> 
> We can't implement file creation with requested user credentials and 
> permission 
> bits in the none security model atomically. Its expected behaviour only

Well you could do the nasty trick of forking a child process
and doing setuid/gid in that and then creating the file before
letting the parent continue.

  if ((pid = fork()) == 0) {
 setuid(fc_uid);
 setgid(fc_gid);
 fd =open("foo", O_CREAT);
 close(fd);
  } else {
 waitpid(pid);
  }

This kind of approach is in fact required if you want to
be able to create files with a special uid/gid on a root
squashing NFS server, because otherwise your QEMU running
as root will have its files squashed to 'nobody' when initially
created, and lchown will fail with EPERM.  You might decide
that root squashing NFS is too painful to care about supporting
though :-)

Regards,
Daniel

Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model

2011-01-20 Thread M. Mohan Kumar

On Thursday 20 January 2011 2:29:54 pm Stefan Hajnoczi wrote:
> On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote:

> > -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) {
> > -/*
> > - * If we fail to change ownership and if we are
> > - * using security model none. Ignore the error
> > - */
> > -if (fs_ctx->fs_sm != SM_NONE) {
> > -return -1;
> > -}
> > -}
> > +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid);
> > 
> >  return 0;
> >  
> >  }
> 
> retval is unused.
> 

That was used to disable the warning message "error: ignoring return value of 
‘lchown’, declared with attribute warn_unused_result"

Otherwise I have to use
if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid)) {
;
}

> Can multiple virtio-9p requests execute at a time?  chmod() and lchown()
> after creation is a race condition if other requests can execute
> concurrently.
> 

We can't implement file creation with requested user credentials and permission 
bits in the none security model atomically. Its expected behaviour only


M. Mohan Kumar

[Qemu-devel] Re: [PATCH] pci: fix pcibus_get_dev_path()

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 06:31:59PM +0800, TeLeMan wrote:
> The commit 6a7005d14b3c32d4864a718fb1cb19c789f58a5 used snprintf() 
> incorrectly.
> 
> Signed-off-by: TeLeMan 

This won't work for nested bridges as \n by the first
sprintf overwrites the rest of the string.
I sent a fix titled
[PATCH] pci: fix device paths
look it up.

> ---
>  hw/pci.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index 8d0e3df..9f8800d 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -2050,14 +2050,14 @@ static char *pcibus_get_dev_path(DeviceState *dev)
>  path[path_len] = '\0';
> 
>  /* First field is the domain. */
> -snprintf(path, domain_len, "%04x:00", pci_find_domain(d->bus));
> +snprintf(path, domain_len + 1, "%04x:00", pci_find_domain(d->bus));
> 
>  /* Fill in slot numbers. We walk up from device to root, so need to print
>   * them in the reverse order, last to first. */
>  p = path + path_len;
>  for (t = d; t; t = t->bus->parent_dev) {
>  p -= slot_len;
> -snprintf(p, slot_len, ":%02x.%x", PCI_SLOT(t->devfn),
> PCI_FUNC(d->devfn));
> +snprintf(p, slot_len + 1, ":%02x.%x", PCI_SLOT(t->devfn),
> PCI_FUNC(d->devfn));
>  }
> 
>  return path;
> -- 
> 1.7.3.1.msysgit.0

Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Kevin Wolf

Am 20.01.2011 14:50, schrieb Yoshiaki Tamura:
> 2011/1/20 Kevin Wolf :
>> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
>>> 2011/1/20 Kevin Wolf :
 Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
> +return;
> +}
> +
> +bdrv_aio_writev(bs, blk_req->reqs[0].sector, 
> blk_req->reqs[0].qiov,
> +blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
> +blk_req->reqs[0].opaque);

 Same here.

> +bdrv_flush(bs);

 This looks really strange. What is this supposed to do?

 One point is that you write it immediately after bdrv_aio_write, so you
 get an fsync for which you don't know if it includes the current write
 request or if it doesn't. Which data do you want to get flushed to the 
 disk?
>>>
>>> I was expecting to flush the aio request that was just initiated.
>>> Am I misunderstanding the function?
>>
>> Seems so. The function names don't use really clear terminology either,
>> so you're not the first one to fall in this trap. Basically we have:
>>
>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>> wanted to have exactly this, but only for a single block device. Such a
>> function doesn't exist yet.
>>
>> * bdrv_flush() makes sure that all successfully completed requests are
>> written to disk (by calling fsync)
>>
>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>> the fsync in the thread pool
>
> Then what I wanted to do is, call qemu_aio_flush first, then
> bdrv_flush.  It should be like live migration.

 Okay, that makes sense. :-)

 The other thing is that you introduce a bdrv_flush for each request,
 basically forcing everyone to something very similar to writethrough
 mode. I'm sure this will have a big impact on performance.
>>>
>>> The reason is to avoid inversion of queued requests.  Although
>>> processing one-by-one is heavy, wouldn't having requests flushed
>>> to disk out of order break the disk image?
>>
>> No, that's fine. If a guest issues two requests at the same time, they
>> may complete in any order. You just need to make sure that you don't
>> call the completion callback before the request really has completed.
>
> We need to flush requests, meaning aio and fsync, before sending
> the final state of the guests, to make sure we can switch to the
> secondary safely.

 In theory I think you could just re-submit the requests on the secondary
 if they had not completed yet.

 But you're right, let's keep things simple for the start.

>> I'm just starting to wonder if the guest won't timeout the requests if
>> they are queued for too long. Even more, with IDE, it can only handle
>> one request at a time, so not completing requests doesn't sound like a
>> good idea at all. In what intervals is the event-tap queue flushed?
>
> The requests are flushed once each transaction completes.  So
> it's not with specific intervals.

 Right. So when is a transaction completed? This is the time that a
 single request will take.
>>>
>>> The transaction is completed when the vm state is sent to the
>>> secondary, and the primary receives the ack to it.  Please let me
>>> know if the answer is too vague.  What I can tell is that it
>>> can't be super fast.
>>>
>> On the other hand, if you complete before actually writing out, you
>> don't get timeouts, but you signal success to the guest when the request
>> could still fail. What would you do in this case? With a writeback cache
>> mode we're fine, we can just fail the next flush (until then nothing is
>> guaranteed to be on disk and order doesn't matter either), but with
>> cache=writethrough we're in serious trouble.
>>
>> Have you thought about this problem? Maybe we end up having to flush the
>> event-tap queue for each single write in writethrough mode.
>
> Yes, and that's what I'm trying to do at this point.

 Oh, I must have missed that code. Which patch/function should I look at?
>>>
>>> Maybe I miss-answered to your question.  The device may receive
>>> timeouts.
>>
>> We should pay attention that the guest does not see timeouts. I'm not
>> expecting that I/O will be super fast, and as long as it is only a
>> performance problem we can live with it.
>>
>> However, as soon as the guest gets timeouts it reports I/O errors and
>> eventually offlines the block device. At this point it's not a
>> performance problem any more, but also a correctness problem.
>>
>> This is why I suggested that we flush the event-tap queue (i.e. complete
>> the transaction) immediately after an I/O request has been issued
>> instead of waiting for other e

[Qemu-devel] Re: [PATCH] pci/pcie: make pci_find_device() ARI aware.

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 03:57:39PM +0900, Isaku Yamahata wrote:
> make pci_find_device() ARI aware.
> 
> Signed-off-by: Isaku Yamahata 
> ---
>  hw/pci.c |7 +++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index 8d0e3df..851f350 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -1596,11 +1596,18 @@ PCIBus *pci_find_bus(PCIBus *bus, int bus_num)
>  
>  PCIDevice *pci_find_device(PCIBus *bus, int bus_num, int slot, int function)
>  {
> +PCIDevice *d;
>  bus = pci_find_bus(bus, bus_num);
>  
>  if (!bus)
>  return NULL;
>  
> +d = bus->parent_dev;
> +if (d && pci_is_express(d) &&
> +pcie_cap_get_type(d) == PCI_EXP_TYPE_DOWNSTREAM &&
> +!pcie_cap_is_ari_enabled(d) && slot > 0) {
> +return NULL;
> +}
>  return bus->devices[PCI_DEVFN(slot, function)];
>  }

I'd like to split this express-specific code out in some way.
Further, the downstream port always has a single slot.
Maybe it shouldn't be a bus at all, but this needs some thought.
How about we put flag in PCIBus that says that a single
slot is supported? Downstream ports would just set it.

> -- 
> 1.7.1.1

Re: [Qemu-devel] Missing roms/seabios/Makefile and roms/vgabios/Makefile

2011-01-20 Thread Mateusz Loskot


On 20/01/11 10:58, Stefan Hajnoczi wrote:

On Thu, Jan 20, 2011 at 10:51:17AM +, Mateusz Loskot wrote:

On 19/01/11 18:07, Blue Swirl wrote:

On Wed, Jan 19, 2011 at 12:44 PM, Mateusz Loskot   wrote:

Hi,

Running ./configure (under MinGW/MSYS) and symlink gives up trying to
create links to non-existing Makefiles in
roms/seabios/Makefile
roms/vgabios/Makefile


Those directiories are actually git submodules, when you run 'git
submodule update', they get populated.


Something is not quite working and I don't get anything populated.
Here I tried under MinGW

mloskot@dog /g/src/qemu/_git/master
$ git pull
Already up-to-date.

mloskot@dog /g/src/qemu/_git/master
$ git submodule update

mloskot@dog /g/src/qemu/_git/master
$ ls roms/seabios/
config.mak


Make sure roms/{vgabios,seabios} are empty directories.

$ git submodule init
$ git submodule update

This should clone the vgabios and seabios repos and checkout the correct
commit.  After this completes successfully you should have source trees
under roms/{vgabios,seabios}.



I forgot about the "init" command. Works now. Thanks!

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
Charter Member of OSGeo, http://osgeo.org
Member of ACCU, http://accu.org

Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Yoshiaki Tamura

2011/1/20 Kevin Wolf :
> Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
>> 2011/1/20 Kevin Wolf :
>>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
 +        return;
 +    }
 +
 +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, 
 blk_req->reqs[0].qiov,
 +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
 +                    blk_req->reqs[0].opaque);
>>>
>>> Same here.
>>>
 +    bdrv_flush(bs);
>>>
>>> This looks really strange. What is this supposed to do?
>>>
>>> One point is that you write it immediately after bdrv_aio_write, so you
>>> get an fsync for which you don't know if it includes the current write
>>> request or if it doesn't. Which data do you want to get flushed to the 
>>> disk?
>>
>> I was expecting to flush the aio request that was just initiated.
>> Am I misunderstanding the function?
>
> Seems so. The function names don't use really clear terminology either,
> so you're not the first one to fall in this trap. Basically we have:
>
> * qemu_aio_flush() waits for all AIO requests to complete. I think you
> wanted to have exactly this, but only for a single block device. Such a
> function doesn't exist yet.
>
> * bdrv_flush() makes sure that all successfully completed requests are
> written to disk (by calling fsync)
>
> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
> the fsync in the thread pool

 Then what I wanted to do is, call qemu_aio_flush first, then
 bdrv_flush.  It should be like live migration.
>>>
>>> Okay, that makes sense. :-)
>>>
>>> The other thing is that you introduce a bdrv_flush for each request,
>>> basically forcing everyone to something very similar to writethrough
>>> mode. I'm sure this will have a big impact on performance.
>>
>> The reason is to avoid inversion of queued requests.  Although
>> processing one-by-one is heavy, wouldn't having requests flushed
>> to disk out of order break the disk image?
>
> No, that's fine. If a guest issues two requests at the same time, they
> may complete in any order. You just need to make sure that you don't
> call the completion callback before the request really has completed.

 We need to flush requests, meaning aio and fsync, before sending
 the final state of the guests, to make sure we can switch to the
 secondary safely.
>>>
>>> In theory I think you could just re-submit the requests on the secondary
>>> if they had not completed yet.
>>>
>>> But you're right, let's keep things simple for the start.
>>>
> I'm just starting to wonder if the guest won't timeout the requests if
> they are queued for too long. Even more, with IDE, it can only handle
> one request at a time, so not completing requests doesn't sound like a
> good idea at all. In what intervals is the event-tap queue flushed?

 The requests are flushed once each transaction completes.  So
 it's not with specific intervals.
>>>
>>> Right. So when is a transaction completed? This is the time that a
>>> single request will take.
>>
>> The transaction is completed when the vm state is sent to the
>> secondary, and the primary receives the ack to it.  Please let me
>> know if the answer is too vague.  What I can tell is that it
>> can't be super fast.
>>
> On the other hand, if you complete before actually writing out, you
> don't get timeouts, but you signal success to the guest when the request
> could still fail. What would you do in this case? With a writeback cache
> mode we're fine, we can just fail the next flush (until then nothing is
> guaranteed to be on disk and order doesn't matter either), but with
> cache=writethrough we're in serious trouble.
>
> Have you thought about this problem? Maybe we end up having to flush the
> event-tap queue for each single write in writethrough mode.

 Yes, and that's what I'm trying to do at this point.
>>>
>>> Oh, I must have missed that code. Which patch/function should I look at?
>>
>> Maybe I miss-answered to your question.  The device may receive
>> timeouts.
>
> We should pay attention that the guest does not see timeouts. I'm not
> expecting that I/O will be super fast, and as long as it is only a
> performance problem we can live with it.
>
> However, as soon as the guest gets timeouts it reports I/O errors and
> eventually offlines the block device. At this point it's not a
> performance problem any more, but also a correctness problem.
>
> This is why I suggested that we flush the event-tap queue (i.e. complete
> the transaction) immediately after an I/O request has been issued
> instead of waiting for other events that would complete the transaction.

Right.  event-tap doesn't queue at specific interval.  It'll
schedule the transaction as bh once events are tapp

[Qemu-devel] Re: [PATCH] pci: use qemu_malloc() in pcibus_get_dev_path()

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 03:57:49PM +0900, Isaku Yamahata wrote:
> use qemu_malloc() instead of direct use of malloc().
> 
> Signed-off-by: Isaku Yamahata 
> ---
>  hw/pci.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)

Thanks, applied.

> diff --git a/hw/pci.c b/hw/pci.c
> index 851f350..86af0ee 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -2053,7 +2053,7 @@ static char *pcibus_get_dev_path(DeviceState *dev)
>  path_len = domain_len + slot_len * slot_depth;
>  
>  /* Allocate memory, fill in the terminating null byte. */
> -path = malloc(path_len + 1 /* For '\0' */);
> +path = qemu_malloc(path_len + 1 /* For '\0' */);
>  path[path_len] = '\0';
>  
>  /* First field is the domain. */
> -- 
> 1.7.1.1

[Qemu-devel] Re: [PATCH] pci: make pci_bus_new() aware of pci domain

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 03:57:56PM +0900, Isaku Yamahata wrote:
> This patch makes pci bus creation aware of pci domain.
> 
> Signed-off-by: Isaku Yamahata 

IMO domain support needs some thought,
before we jump into it. Specifically:

I am not sure a simple number is enough to address a domain. In linux,
what seems to happen is that the pci segment reported by pci is used.
What happens without acpi? I'll have to look - do you know? What will we
do for acpi? We can load firmware or add a hardware interface to get the 
domains.
How will migration work?

Thanks,

> ---
>  hw/pci.c  |   19 ++-
>  hw/pci.h  |7 ---
>  hw/piix_pci.c |2 +-
>  3 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index 86af0ee..e1e7b25 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -246,8 +246,11 @@ int pci_find_domain(const PCIBus *bus)
>  return -1;
>  }
>  
> +/* create root pci bus.
> + * If secondary pci bus is wanted, use pci_bridge_initfn()
> + */
>  void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent,
> - const char *name, int devfn_min)
> + const char *name, int domain, int devfn_min)
>  {
>  qbus_create_inplace(&bus->qbus, &pci_bus_info, parent, name);
>  assert(PCI_FUNC(devfn_min) == 0);
> @@ -255,18 +258,19 @@ void pci_bus_new_inplace(PCIBus *bus, DeviceState 
> *parent,
>  
>  /* host bridge */
>  QLIST_INIT(&bus->child);
> -pci_host_bus_register(0, bus); /* for now only pci domain 0 is supported 
> */
> +pci_host_bus_register(domain, bus);
>  
>  vmstate_register(NULL, -1, &vmstate_pcibus, bus);
>  }
>  
> -PCIBus *pci_bus_new(DeviceState *parent, const char *name, int devfn_min)
> +PCIBus *pci_bus_new(DeviceState *parent, const char *name,
> +int domain, int devfn_min)
>  {
>  PCIBus *bus;
>  
>  bus = qemu_mallocz(sizeof(*bus));
>  bus->qbus.qdev_allocated = 1;
> -pci_bus_new_inplace(bus, parent, name, devfn_min);
> +pci_bus_new_inplace(bus, parent, name, domain, devfn_min);
>  return bus;
>  }
>  
> @@ -292,13 +296,18 @@ void pci_bus_set_mem_base(PCIBus *bus, 
> target_phys_addr_t base)
>  bus->mem_base = base;
>  }
>  
> +/* deprecated: kept for compatility of existing codes.
> + * pci_bus_new() and pci_bus_irqs() should be used for root pci bus
> + * like i440fx_init().
> + * pci_bridge_initfn() should be used for secondary pci bus
> + */
>  PCIBus *pci_register_bus(DeviceState *parent, const char *name,
>   pci_set_irq_fn set_irq, pci_map_irq_fn map_irq,
>   void *irq_opaque, int devfn_min, int nirq)
>  {
>  PCIBus *bus;
>  
> -bus = pci_bus_new(parent, name, devfn_min);
> +bus = pci_bus_new(parent, name, 0 /* domain = 0 for compat */, 
> devfn_min);
>  pci_bus_irqs(bus, set_irq, map_irq, irq_opaque, nirq);
>  return bus;
>  }
> diff --git a/hw/pci.h b/hw/pci.h
> index bc8d5bb..a0fd953 100644
> --- a/hw/pci.h
> +++ b/hw/pci.h
> @@ -229,14 +229,15 @@ typedef enum {
>  typedef int (*pci_hotplug_fn)(DeviceState *qdev, PCIDevice *pci_dev,
>PCIHotplugState state);
>  void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent,
> - const char *name, int devfn_min);
> -PCIBus *pci_bus_new(DeviceState *parent, const char *name, int devfn_min);
> + const char *name, int domain, int devfn_min);
> +PCIBus *pci_bus_new(DeviceState *parent, const char *name,
> +int domain, int devfn_min);
>  void pci_bus_irqs(PCIBus *bus, pci_set_irq_fn set_irq, pci_map_irq_fn 
> map_irq,
>void *irq_opaque, int nirq);
>  void pci_bus_hotplug(PCIBus *bus, pci_hotplug_fn hotplug, DeviceState *dev);
>  PCIBus *pci_register_bus(DeviceState *parent, const char *name,
>   pci_set_irq_fn set_irq, pci_map_irq_fn map_irq,
> - void *irq_opaque, int devfn_min, int nirq);
> + void *irq_opaque, int devfn_min, int nirq); /* 
> deprecated */
>  void pci_device_reset(PCIDevice *dev);
>  void pci_bus_reset(PCIBus *bus);
>  
> diff --git a/hw/piix_pci.c b/hw/piix_pci.c
> index 358da58..718983d 100644
> --- a/hw/piix_pci.c
> +++ b/hw/piix_pci.c
> @@ -226,7 +226,7 @@ PCIBus *i440fx_init(PCII440FXState **pi440fx_state, int 
> *piix3_devfn, qemu_irq *
>  
>  dev = qdev_create(NULL, "i440FX-pcihost");
>  s = FROM_SYSBUS(I440FXState, sysbus_from_qdev(dev));
> -b = pci_bus_new(&s->busdev.qdev, NULL, 0);
> +b = pci_bus_new(&s->busdev.qdev, NULL, 0, 0);
>  s->bus = b;
>  qdev_init_nofail(dev);
>  
> -- 
> 1.7.1.1

Re: [Qemu-devel] [PATCH 1/5] Fast Virtual Disk (FVD) Proposal Part 1

2011-01-20 Thread Christoph Hellwig

On Wed, Jan 19, 2011 at 05:04:44PM -0500, Chunqiang Tang wrote:
> Part 1 of the block device driver for the proposed FVD image format.
> Multiple patches are used in order to manage the size of each patch.
> This patch includes existing files that are modified by FVD.

Please try to split the patches into logical parts, and use descriptive
subject lines for each patch.

E.g. adding the new sim command to qemu-io could be one patch, adding
the img_update (why not just update?) command to qemu-img another,
moving code into qemu-tool-time.c one more, etc.

> -
> +

Please do not introduce random whitespace changes in patches.

[Qemu-devel] [PATCH 12/12] Threadlets: Add documentation

2011-01-20 Thread Arun R Bharadwaj

Signed-off-by: Arun R Bharadwaj 
Signed-off-by: Gautham R Shenoy 
Reviewed-by: Stefan Hajnoczi 
---
 docs/async-support.txt |  141 
 1 files changed, 141 insertions(+), 0 deletions(-)
 create mode 100644 docs/async-support.txt

diff --git a/docs/async-support.txt b/docs/async-support.txt
new file mode 100644
index 000..9f22b9a
--- /dev/null
+++ b/docs/async-support.txt
@@ -0,0 +1,141 @@
+== How to use the threadlets infrastructure supported in Qemu ==
+
+== Threadlets ==
+
+Q.1: What are threadlets ?
+A.1: Threadlets is an infrastructure within QEMU that allows other subsystems
+ to offload possibly blocking work to a queue to be processed by a pool
+ of threads asynchronously.
+
+Q.2: When would one want to use threadlets ?
+A.2: Threadlets are useful when there are operations that can be performed
+ outside the context of the VCPU/IO threads inorder to free these latter
+ to service any other guest requests.
+
+Q.3: I have some work that can be executed in an asynchronous context. How
+ should I go about it ?
+A.3: One could follow the steps listed below:
+
+ - Define a function which would do the asynchronous work.
+   static void my_threadlet_func(ThreadletWork *work)
+   {
+   }
+
+ - Declare an object of type ThreadletWork;
+   ThreadletWork work;
+
+
+ - Assign a value to the "func" member of ThreadletWork object.
+   work.func = my_threadlet_func;
+
+ - Submit the threadlet to the global queue.
+   submit_threadletwork(&work);
+
+ - Continue servicing some other guest operations.
+
+Q.4: I want to my_threadlet_func to access some non-global data. How do I do
+ that ?
+A.4: Suppose you want my_threadlet_func to access some non-global data-object
+ of type myPrivateData. In that case one could follow the following steps.
+
+ - Define a member of the type ThreadletWork within myPrivateData.
+   typedef struct MyPrivateData {
+   ...;
+   ...;
+   ...;
+   ThreadletWork work;
+   } MyPrivateData;
+
+   MyPrivateData my_data;
+
+ - Initialize myData.work as described in A.3
+   myData.work.func = my_threadlet_func;
+   submit_threadletwork(&myData.work);
+
+ - Access the myData object inside my_threadlet_func() using container_of
+   primitive
+   static void my_threadlet_func(ThreadletWork *work)
+   {
+   myPrivateData *mydata_ptr;
+   mydata_ptr = container_of(work, myPrivateData, work);
+
+   /* mydata_ptr now points to myData object */
+   }
+
+Q.5: Are there any precautions one must take while sharing data with the
+ Asynchronous thread-pool ?
+A.5: Yes, make sure that the helper function of the type my_threadlet_func()
+ does not access/modify data when it can be accessed or modified in the
+ context of VCPU thread or IO thread. This is because the asynchronous
+ threads in the pool can run in parallel with the VCPU/IOThreads as shown
+ in the figure.
+
+ A typical workflow is as follows:
+
+  VCPU/IOThread
+   |
+   | (1)
+   |
+   V
+Offload work  (2)
+  |---> to threadlets -> Helper thread
+  ||   |
+  ||   |
+  || (3)   | (4)
+  ||   |
+  | Handle other Guest requests|
+  ||   |
+  ||   V
+  || (3)  Signal the I/O Thread
+  |(6) |   |
+  ||  /
+  || /
+  |V/
+  |  Do the post <-/
+  |  processing   (5)
+  ||
+  || (6)
+  |V
+  |-Yes-- More async work?
+   |
+   | (7)
+  No
+   |
+   |
+   .
+   .
+
+Hence one needs to make sure that in the steps (3) and (4) which run in
+parallel, any global data is accessed within only one context.
+
+Q.6: I have queued a threadlet which I want to cancel. How do I do that ?
+A.6: Threadlets framework provides the API cancel_threadlet:
+   - int cancel_threadletwork(ThreadletWork *work)
+
+ The API scans the ThreadletQueue to see if (wor

[Qemu-devel] [PATCH 10/12] Move threadlet code to qemu-threadlets.c

2011-01-20 Thread Arun R Bharadwaj

This patch moves the threadlet queue API code to
qemu-threadlets.c where these APIs can be used by
other subsystems.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 posix-aio-compat.c |  137 
 qemu-threadlet.c   |  115 
 qemu-threadlet.h   |   25 +
 3 files changed, 141 insertions(+), 136 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 96e28db..4fc9581 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -26,33 +26,13 @@
 #include "qemu-common.h"
 #include "trace.h"
 #include "block_int.h"
-#include "qemu-thread.h"
+#include "qemu-threadlet.h"
 
 #include "block/raw-posix-aio.h"
 
-#define MAX_GLOBAL_THREADS  64
-#define MIN_GLOBAL_THREADS   8
-
 static QemuMutex aiocb_mutex;
 static QemuCond aiocb_completion;
 
-typedef struct ThreadletQueue
-{
-QemuMutex lock;
-QemuCond cond;
-int max_threads;
-int min_threads;
-int cur_threads;
-int idle_threads;
-QTAILQ_HEAD(, ThreadletWork) request_list;
-} ThreadletQueue;
-
-typedef struct ThreadletWork
-{
-QTAILQ_ENTRY(ThreadletWork) node;
-void (*func)(struct ThreadletWork *work);
-} ThreadletWork;
-
 struct qemu_paiocb {
 BlockDriverAIOCB common;
 int aio_fildes;
@@ -79,10 +59,6 @@ typedef struct PosixAioState {
 struct qemu_paiocb *first_aio;
 } PosixAioState;
 
-/* Default ThreadletQueue */
-static ThreadletQueue globalqueue;
-static int globalqueue_init;
-
 #ifdef CONFIG_PREADV
 static int preadv_present = 1;
 #else
@@ -283,50 +259,6 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb)
 return nbytes;
 }
 
-static void *threadlet_worker(void *data)
-{
-ThreadletQueue *queue = data;
-
-qemu_mutex_lock(&queue->lock);
-while (1) {
-ThreadletWork *work;
-int ret = 0;
-
-while (QTAILQ_EMPTY(&queue->request_list) &&
-   (ret != ETIMEDOUT)) {
-/* wait for cond to be signalled or broadcast for 1000s */
-ret = qemu_cond_timedwait((&queue->cond),
-  &(queue->lock), 10*10);
-}
-
-assert(queue->idle_threads != 0);
-if (QTAILQ_EMPTY(&queue->request_list)) {
-if (queue->cur_threads > queue->min_threads) {
-/* We retain the minimum number of threads */
-break;
-}
-} else {
-work = QTAILQ_FIRST(&queue->request_list);
-QTAILQ_REMOVE(&queue->request_list, work, node);
-
-queue->idle_threads--;
-qemu_mutex_unlock(&queue->lock);
-
-/* execute the work function */
-work->func(work);
-
-qemu_mutex_lock(&queue->lock);
-queue->idle_threads++;
-}
-}
-
-queue->idle_threads--;
-queue->cur_threads--;
-qemu_mutex_unlock(&queue->lock);
-
-return NULL;
-}
-
 static PosixAioState *posix_aio_state;
 
 static void handle_work(ThreadletWork *work)
@@ -371,48 +303,6 @@ static void handle_work(ThreadletWork *work)
 }
 }
 
-static void spawn_threadlet(ThreadletQueue *queue)
-{
-QemuThread thread;
-
-queue->cur_threads++;
-queue->idle_threads++;
-
-qemu_thread_create(&thread, threadlet_worker, queue);
-}
-
-/**
- * submit_work: Submit to the global queue a new task to be executed
- *   asynchronously.
- * @work: Contains information about the task that needs to be submitted.
- */
-static void submit_work(ThreadletWork *work)
-{
-qemu_mutex_lock(&globalqueue.lock);
-
-if (!globalqueue_init) {
-globalqueue.cur_threads  = 0;
-globalqueue.idle_threads = 0;
-globalqueue.max_threads = MAX_GLOBAL_THREADS;
-globalqueue.min_threads = MIN_GLOBAL_THREADS;
-QTAILQ_INIT(&globalqueue.request_list);
-qemu_mutex_init(&globalqueue.lock);
-qemu_cond_init(&globalqueue.cond);
-
-globalqueue_init = 1;
-}
-
-if (globalqueue.idle_threads == 0 &&
-globalqueue.cur_threads < globalqueue.max_threads) {
-spawn_threadlet(&globalqueue);
-
-} else {
-qemu_cond_signal(&globalqueue.cond);
-}
-QTAILQ_INSERT_TAIL(&globalqueue.request_list, work, node);
-qemu_mutex_unlock(&globalqueue.lock);
-}
-
 static ssize_t qemu_paio_return(struct qemu_paiocb *aiocb)
 {
 ssize_t ret;
@@ -545,31 +435,6 @@ static void paio_remove(struct qemu_paiocb *acb)
 }
 }
 
-/**
- * dequeue_work: Cancel a task queued on the global queue.
- * @work: Contains the information of the task that needs to be cancelled.
- *
- * Returns:0 if successfully dequeued work.
- * 1 otherwise.
- */
-static int dequeue_work(ThreadletWork *work)
-{
-int ret = 1;
-ThreadletWork *ret_work;
-
-qemu_mutex_lock(&globalqueue.lock);
-QTAILQ_FOREACH(ret_work, &(globalqueue.request_list), node) {
-if (ret_work == work) {
-QTAILQ_REMOVE(&globalqu

[Qemu-devel] [PATCH 02/12] Introduce work concept in posix-aio-compat.c

2011-01-20 Thread Arun R Bharadwaj

This patch introduces work concept by introducing the
ThreadletWork structure.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 posix-aio-compat.c |   20 ++--
 1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 82862ec..ddf42f5 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -33,6 +33,10 @@
 
 static QemuMutex aiocb_mutex;
 static QemuCond aiocb_completion;
+typedef struct ThreadletWork
+{
+QTAILQ_ENTRY(ThreadletWork) node;
+} ThreadletWork;
 
 struct qemu_paiocb {
 BlockDriverAIOCB common;
@@ -47,13 +51,13 @@ struct qemu_paiocb {
 int ev_signo;
 off_t aio_offset;
 
-QTAILQ_ENTRY(qemu_paiocb) node;
 int aio_type;
 ssize_t ret;
 int active;
 struct qemu_paiocb *next;
 
 int async_context_id;
+ThreadletWork work;
 };
 
 typedef struct PosixAioState {
@@ -69,7 +73,7 @@ static pthread_attr_t attr;
 static int max_threads = 64;
 static int cur_threads = 0;
 static int idle_threads = 0;
-static QTAILQ_HEAD(, qemu_paiocb) request_list;
+static QTAILQ_HEAD(, ThreadletWork) request_list;
 
 #ifdef CONFIG_PREADV
 static int preadv_present = 1;
@@ -307,6 +311,7 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb)
 static void *aio_thread(void *unused)
 {
 pid_t pid;
+ThreadletWork *work;
 
 pid = getpid();
 
@@ -330,8 +335,11 @@ static void *aio_thread(void *unused)
 if (QTAILQ_EMPTY(&request_list))
 break;
 
-aiocb = QTAILQ_FIRST(&request_list);
-QTAILQ_REMOVE(&request_list, aiocb, node);
+work = QTAILQ_FIRST(&request_list);
+QTAILQ_REMOVE(&request_list, work, node);
+
+aiocb = container_of(work, struct qemu_paiocb, work);
+
 aiocb->active = 1;
 idle_threads--;
 mutex_unlock(&lock);
@@ -398,7 +406,7 @@ static void qemu_paio_submit(struct qemu_paiocb *aiocb)
 mutex_lock(&lock);
 if (idle_threads == 0 && cur_threads < max_threads)
 spawn_thread();
-QTAILQ_INSERT_TAIL(&request_list, aiocb, node);
+QTAILQ_INSERT_TAIL(&request_list, &aiocb->work, node);
 mutex_unlock(&lock);
 cond_signal(&cond);
 }
@@ -548,7 +556,7 @@ static void paio_cancel(BlockDriverAIOCB *blockacb)
 
 qemu_mutex_lock(&aiocb_mutex);
 if (!acb->active) {
-QTAILQ_REMOVE(&request_list, acb, node);
+QTAILQ_REMOVE(&request_list, &acb->work, node);
 acb->ret = -ECANCELED;
 } else if (acb->ret == -EINPROGRESS) {
 active = 1;

[Qemu-devel] [PATCH 06/12] Threadlet: Add dequeue_work threadlet API and remove active field.

2011-01-20 Thread Arun R Bharadwaj

This patch adds dequeue_work threadlet API and shows how
the paio_cancel changes to dequeue_work.

The active field in the qemu_aiocb structure is now useless.
Remove it.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 posix-aio-compat.c |   48 ++--
 1 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 62e32e3..95ba805 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -69,7 +69,6 @@ struct qemu_paiocb {
 
 int aio_type;
 ssize_t ret;
-int active;
 struct qemu_paiocb *next;
 
 int async_context_id;
@@ -345,9 +344,6 @@ static void handle_work(ThreadletWork *work)
 
 pid = getpid();
 aiocb = container_of(work, struct qemu_paiocb, work);
-qemu_mutex_lock(&aiocb_mutex);
-aiocb->active = 1;
-qemu_mutex_unlock(&aiocb_mutex);
 
 switch (aiocb->aio_type & QEMU_AIO_TYPE_MASK) {
 case QEMU_AIO_READ:
@@ -452,7 +448,6 @@ static void qemu_paio_submit(struct qemu_paiocb *aiocb)
 {
 qemu_mutex_lock(&aiocb_mutex);
 aiocb->ret = -EINPROGRESS;
-aiocb->active = 0;
 qemu_mutex_unlock(&aiocb_mutex);
 
 aiocb->work.func = handle_work;
@@ -574,33 +569,42 @@ static void paio_remove(struct qemu_paiocb *acb)
 }
 }
 
-static void paio_cancel(BlockDriverAIOCB *blockacb)
+/**
+ * dequeue_work: Cancel a task queued on the global queue.
+ * @work: Contains the information of the task that needs to be cancelled.
+ *
+ * Returns:0 if successfully dequeued work.
+ * 1 otherwise.
+ */
+static int dequeue_work(ThreadletWork *work)
 {
-struct qemu_paiocb *acb = (struct qemu_paiocb *)blockacb;
-int active = 0;
+int ret = 1;
+ThreadletWork *ret_work;
 
-qemu_mutex_lock(&aiocb_mutex);
 qemu_mutex_lock(&globalqueue.lock);
-if (!acb->active) {
-QTAILQ_REMOVE(&globalqueue.request_list, &acb->work, node);
-acb->ret = -ECANCELED;
-} else if (acb->ret == -EINPROGRESS) {
-active = 1;
+QTAILQ_FOREACH(ret_work, &(globalqueue.request_list), node) {
+if (ret_work == work) {
+QTAILQ_REMOVE(&globalqueue.request_list, ret_work, node);
+ret = 0;
+break;
+}
 }
 qemu_mutex_unlock(&globalqueue.lock);
 
-if (!active) {
-acb->ret = -ECANCELED;
-} else {
+return ret;
+}
+
+static void paio_cancel(BlockDriverAIOCB *blockacb)
+{
+struct qemu_paiocb *acb = (struct qemu_paiocb *)blockacb;
+if (dequeue_work(&acb->work) != 0) {
+/* Wait for running work item to complete */
+qemu_mutex_lock(&aiocb_mutex);
 while (acb->ret == -EINPROGRESS) {
-/*
- * fail safe: if the aio could not be canceled,
- * we wait for it
- */
 qemu_cond_wait(&aiocb_completion, &aiocb_mutex);
 }
+qemu_mutex_unlock(&aiocb_mutex);
 }
-qemu_mutex_unlock(&aiocb_mutex);
 paio_remove(acb);
 }

[Qemu-devel] [PATCH 09/12] Remove all instances of CONFIG_THREAD

2011-01-20 Thread Arun R Bharadwaj

Remove all instances of CONFIG_THREAD, which is not used
anymore.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 configure |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index a079a49..addf733 100755
--- a/configure
+++ b/configure
@@ -2456,7 +2456,6 @@ if test "$vnc_png" != "no" ; then
 fi
 if test "$vnc_thread" != "no" ; then
   echo "CONFIG_VNC_THREAD=y" >> $config_host_mak
-  echo "CONFIG_THREAD=y" >> $config_host_mak
 fi
 if test "$fnmatch" = "yes" ; then
   echo "CONFIG_FNMATCH=y" >> $config_host_mak
@@ -2534,7 +2533,6 @@ if test "$xen" = "yes" ; then
 fi
 if test "$io_thread" = "yes" ; then
   echo "CONFIG_IOTHREAD=y" >> $config_host_mak
-  echo "CONFIG_THREAD=y" >> $config_host_mak
 fi
 if test "$linux_aio" = "yes" ; then
   echo "CONFIG_LINUX_AIO=y" >> $config_host_mak

[Qemu-devel] [PATCH 11/12] Threadlets: Add functionality to create private queues.

2011-01-20 Thread Arun R Bharadwaj

This patch allows subsystems to create their own private
queues with associated pools of threads.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 qemu-threadlet.c |   85 +-
 qemu-threadlet.h |4 +++
 2 files changed, 63 insertions(+), 26 deletions(-)

diff --git a/qemu-threadlet.c b/qemu-threadlet.c
index 6523f08..4bad8c6 100644
--- a/qemu-threadlet.c
+++ b/qemu-threadlet.c
@@ -99,6 +99,25 @@ static void spawn_threadlet(ThreadletQueue *queue)
 }
 
 /**
+ * submit_work_to_queue: Submit a new task to a private queue to be
+ *executed asynchronously.
+ * @queue: Per-subsystem private queue to which the new task needs
+ * to be submitted.
+ * @work: Contains information about the task that needs to be submitted.
+ */
+void submit_work_to_queue(ThreadletQueue *queue, ThreadletWork *work)
+{
+qemu_mutex_lock(&queue->lock);
+if (queue->idle_threads == 0 && queue->cur_threads < queue->max_threads) {
+spawn_threadlet(queue);
+} else {
+qemu_cond_signal(&queue->cond);
+}
+QTAILQ_INSERT_TAIL(&queue->request_list, work, node);
+qemu_mutex_unlock(&queue->lock);
+}
+
+/**
  * submit_work: Submit to the global queue a new task to be executed
  *   asynchronously.
  * @work: Contains information about the task that needs to be submitted.
@@ -106,49 +125,63 @@ static void spawn_threadlet(ThreadletQueue *queue)
 void submit_work(ThreadletWork *work)
 {
 if (!globalqueue_init) {
-globalqueue.cur_threads  = 0;
-globalqueue.idle_threads = 0;
-globalqueue.max_threads = MAX_GLOBAL_THREADS;
-globalqueue.min_threads = MIN_GLOBAL_THREADS;
-QTAILQ_INIT(&globalqueue.request_list);
-qemu_mutex_init(&globalqueue.lock);
-qemu_cond_init(&globalqueue.cond);
-
+threadlet_queue_init(&globalqueue, MAX_GLOBAL_THREADS,
+MIN_GLOBAL_THREADS);
 globalqueue_init = 1;
 }
 
-qemu_mutex_lock(&globalqueue.lock);
-if (globalqueue.idle_threads == 0 &&
-globalqueue.cur_threads < globalqueue.max_threads) {
-spawn_threadlet(&globalqueue);
-} else {
-qemu_cond_signal(&globalqueue.cond);
-}
-QTAILQ_INSERT_TAIL(&globalqueue.request_list, work, node);
-qemu_mutex_unlock(&globalqueue.lock);
+submit_work_to_queue(&globalqueue, work);
 }
 
 /**
- * dequeue_work: Cancel a task queued on the global queue.
+ * dequeue_work_on_queue: Cancel a task queued on a Queue.
+ * @queue: The queue containing the task to be cancelled.
  * @work: Contains the information of the task that needs to be cancelled.
- *
- * Returns: 0 if successfully dequeued work.
- *  1 otherwise.
  */
-int dequeue_work(ThreadletWork *work)
+int dequeue_work_on_queue(ThreadletQueue *queue, ThreadletWork *work)
 {
 int ret = 1;
 ThreadletWork *ret_work;
 
-qemu_mutex_lock(&globalqueue.lock);
-QTAILQ_FOREACH(ret_work, &(globalqueue.request_list), node) {
+qemu_mutex_lock(&queue->lock);
+QTAILQ_FOREACH(ret_work, &(queue->request_list), node) {
 if (ret_work == work) {
-QTAILQ_REMOVE(&globalqueue.request_list, ret_work, node);
+QTAILQ_REMOVE(&queue->request_list, work, node);
 ret = 0;
 break;
 }
 }
-qemu_mutex_unlock(&globalqueue.lock);
+qemu_mutex_unlock(&queue->lock);
 
 return ret;
 }
+
+/**
+ * dequeue_work: Cancel a task queued on the global queue.
+ * @work: Contains the information of the task that needs to be cancelled.
+ *
+ * Returns: 0 if successfully dequeued work.
+ *  1 otherwise.
+ */
+int dequeue_work(ThreadletWork *work)
+{
+return dequeue_work_on_queue(&globalqueue, work);
+}
+
+/**
+ * threadlet_queue_init: Initialize a threadlet queue.
+ * @queue: The threadlet queue to be initialized.
+ * @max_threads: Maximum number of threads processing the queue.
+ * @min_threads: Minimum number of threads processing the queue.
+ */
+void threadlet_queue_init(ThreadletQueue *queue,
+int max_threads, int min_threads)
+{
+queue->cur_threads  = 0;
+queue->idle_threads = 0;
+queue->max_threads  = max_threads;
+queue->min_threads  = min_threads;
+QTAILQ_INIT(&queue->request_list);
+qemu_mutex_init(&queue->lock);
+qemu_cond_init(&queue->cond);
+}
diff --git a/qemu-threadlet.h b/qemu-threadlet.h
index 6b11c86..25e77f2 100644
--- a/qemu-threadlet.h
+++ b/qemu-threadlet.h
@@ -38,7 +38,11 @@ typedef struct ThreadletWork
 void (*func)(struct ThreadletWork *work);
 } ThreadletWork;
 
+void submit_work_to_queue(ThreadletQueue *queue, ThreadletWork *work);
 void submit_work(ThreadletWork *work);
+int dequeue_work_on_queue(ThreadletQueue *queue, ThreadletWork *work);
 int dequeue_work(ThreadletWork *work);
 void threadlet_init(void);
+void threadlet_queue_init(ThreadletQueue *queu

[Qemu-devel] [PATCH 05/12] Threadlet: Add submit_work threadlet API.

2011-01-20 Thread Arun R Bharadwaj

This patch adds submit work threadlet API and shows how
the qemu_paio_submit changes to submit_work.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 posix-aio-compat.c |   33 ++---
 1 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 011633f..62e32e3 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -393,13 +393,13 @@ static void spawn_threadlet(ThreadletQueue *queue)
 if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore");
 }
 
-static void qemu_paio_submit(struct qemu_paiocb *aiocb)
+/**
+ * submit_work: Submit to the global queue a new task to be executed
+ *   asynchronously.
+ * @work: Contains information about the task that needs to be submitted.
+ */
+static void submit_work(ThreadletWork *work)
 {
-qemu_mutex_lock(&aiocb_mutex);
-aiocb->ret = -EINPROGRESS;
-aiocb->active = 0;
-qemu_mutex_unlock(&aiocb_mutex);
-
 qemu_mutex_lock(&globalqueue.lock);
 
 if (!globalqueue_init) {
@@ -415,13 +415,13 @@ static void qemu_paio_submit(struct qemu_paiocb *aiocb)
 }
 
 if (globalqueue.idle_threads == 0 &&
-globalqueue.cur_threads < globalqueue.max_threads)
+globalqueue.cur_threads < globalqueue.max_threads) {
 spawn_threadlet(&globalqueue);
 
-aiocb->work.func = handle_work;
-
-QTAILQ_INSERT_TAIL(&globalqueue.request_list, &aiocb->work, node);
-qemu_cond_signal(&globalqueue.cond);
+} else {
+qemu_cond_signal(&globalqueue.cond);
+}
+QTAILQ_INSERT_TAIL(&globalqueue.request_list, work, node);
 qemu_mutex_unlock(&globalqueue.lock);
 }
 
@@ -448,6 +448,17 @@ static int qemu_paio_error(struct qemu_paiocb *aiocb)
 return ret;
 }
 
+static void qemu_paio_submit(struct qemu_paiocb *aiocb)
+{
+qemu_mutex_lock(&aiocb_mutex);
+aiocb->ret = -EINPROGRESS;
+aiocb->active = 0;
+qemu_mutex_unlock(&aiocb_mutex);
+
+aiocb->work.func = handle_work;
+submit_work(&aiocb->work);
+}
+
 static int posix_aio_process_queue(void *opaque)
 {
 PosixAioState *s = opaque;

[Qemu-devel] [PATCH 04/12] Add ThreadletQueue.

2011-01-20 Thread Arun R Bharadwaj

This patch adds a global queue of type ThreadletQueue and
removes the earlier usage of request_list queue.

We want to create the thread on the first submit. Hence we
need to track whether the globalqueue is initialized or not.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 posix-aio-compat.c |  149 +++-
 1 files changed, 76 insertions(+), 73 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 2792201..011633f 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -31,8 +31,23 @@
 
 #include "block/raw-posix-aio.h"
 
+#define MAX_GLOBAL_THREADS  64
+#define MIN_GLOBAL_THREADS   8
+
 static QemuMutex aiocb_mutex;
 static QemuCond aiocb_completion;
+
+typedef struct ThreadletQueue
+{
+QemuMutex lock;
+QemuCond cond;
+int max_threads;
+int min_threads;
+int cur_threads;
+int idle_threads;
+QTAILQ_HEAD(, ThreadletWork) request_list;
+} ThreadletQueue;
+
 typedef struct ThreadletWork
 {
 QTAILQ_ENTRY(ThreadletWork) node;
@@ -66,15 +81,10 @@ typedef struct PosixAioState {
 struct qemu_paiocb *first_aio;
 } PosixAioState;
 
-
-static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
-static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
-static pthread_t thread_id;
+/* Default ThreadletQueue */
+static ThreadletQueue globalqueue;
+static int globalqueue_init;
 static pthread_attr_t attr;
-static int max_threads = 64;
-static int cur_threads = 0;
-static int idle_threads = 0;
-static QTAILQ_HEAD(, ThreadletWork) request_list;
 
 #ifdef CONFIG_PREADV
 static int preadv_present = 1;
@@ -93,32 +103,6 @@ static void die(const char *what)
 die2(errno, what);
 }
 
-static void mutex_lock(pthread_mutex_t *mutex)
-{
-int ret = pthread_mutex_lock(mutex);
-if (ret) die2(ret, "pthread_mutex_lock");
-}
-
-static void mutex_unlock(pthread_mutex_t *mutex)
-{
-int ret = pthread_mutex_unlock(mutex);
-if (ret) die2(ret, "pthread_mutex_unlock");
-}
-
-static int cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *mutex,
-   struct timespec *ts)
-{
-int ret = pthread_cond_timedwait(cond, mutex, ts);
-if (ret && ret != ETIMEDOUT) die2(ret, "pthread_cond_timedwait");
-return ret;
-}
-
-static void cond_signal(pthread_cond_t *cond)
-{
-int ret = pthread_cond_signal(cond);
-if (ret) die2(ret, "pthread_cond_signal");
-}
-
 static void thread_create(pthread_t *thread, pthread_attr_t *attr,
   void *(*start_routine)(void*), void *arg)
 {
@@ -311,42 +295,45 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb)
 
 static void *threadlet_worker(void *data)
 {
+ThreadletQueue *queue = data;
 
+qemu_mutex_lock(&queue->lock);
 while (1) {
-ssize_t ret = 0;
-qemu_timeval tv;
-struct timespec ts;
 ThreadletWork *work;
+int ret = 0;
 
-qemu_gettimeofday(&tv);
-ts.tv_sec = tv.tv_sec + 10;
-ts.tv_nsec = 0;
-
-mutex_lock(&lock);
-
-while (QTAILQ_EMPTY(&request_list) &&
-   !(ret == ETIMEDOUT)) {
-ret = cond_timedwait(&cond, &lock, &ts);
+while (QTAILQ_EMPTY(&queue->request_list) &&
+   (ret != ETIMEDOUT)) {
+/* wait for cond to be signalled or broadcast for 1000s */
+ret = qemu_cond_timedwait((&queue->cond),
+  &(queue->lock), 10*10);
 }
 
-if (QTAILQ_EMPTY(&request_list)) {
-idle_threads--;
-cur_threads--;
-mutex_unlock(&lock);
-break;
-}
-work = QTAILQ_FIRST(&request_list);
-QTAILQ_REMOVE(&request_list, work, node);
-idle_threads--;
-mutex_unlock(&lock);
+assert(queue->idle_threads != 0);
+if (QTAILQ_EMPTY(&queue->request_list)) {
+if (queue->cur_threads > queue->min_threads) {
+/* We retain the minimum number of threads */
+break;
+}
+} else {
+work = QTAILQ_FIRST(&queue->request_list);
+QTAILQ_REMOVE(&queue->request_list, work, node);
+
+queue->idle_threads--;
+qemu_mutex_unlock(&queue->lock);
 
-work->func(work);
-mutex_lock(&lock);
-idle_threads++;
-mutex_unlock(&lock);
+/* execute the work function */
+work->func(work);
 
+qemu_mutex_lock(&queue->lock);
+queue->idle_threads++;
+}
 }
 
+queue->idle_threads--;
+queue->cur_threads--;
+qemu_mutex_unlock(&queue->lock);
+
 return NULL;
 }
 
@@ -389,18 +376,19 @@ static void handle_work(ThreadletWork *work)
 }
 }
 
-static void spawn_thread(void)
+static void spawn_threadlet(ThreadletQueue *queue)
 {
+pthread_t thread_id;
 sigset_t set, oldset;
 
-cur_threads++;
-idle_threads++;
+queue->cur_threads++;
+queue->idle_thre

[Qemu-devel] [PATCH 07/12] Remove thread_create routine.

2011-01-20 Thread Arun R Bharadwaj

Remove thread_create and use qemu_thread_create instead.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 posix-aio-compat.c |   29 ++---
 1 files changed, 2 insertions(+), 27 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 95ba805..a7625d8 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -13,7 +13,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,7 +82,6 @@ typedef struct PosixAioState {
 /* Default ThreadletQueue */
 static ThreadletQueue globalqueue;
 static int globalqueue_init;
-static pthread_attr_t attr;
 
 #ifdef CONFIG_PREADV
 static int preadv_present = 1;
@@ -102,13 +100,6 @@ static void die(const char *what)
 die2(errno, what);
 }
 
-static void thread_create(pthread_t *thread, pthread_attr_t *attr,
-  void *(*start_routine)(void*), void *arg)
-{
-int ret = pthread_create(thread, attr, start_routine, arg);
-if (ret) die2(ret, "pthread_create");
-}
-
 static ssize_t handle_aiocb_ioctl(struct qemu_paiocb *aiocb)
 {
 int ret;
@@ -374,19 +365,12 @@ static void handle_work(ThreadletWork *work)
 
 static void spawn_threadlet(ThreadletQueue *queue)
 {
-pthread_t thread_id;
-sigset_t set, oldset;
+QemuThread thread;
 
 queue->cur_threads++;
 queue->idle_threads++;
 
-/* block all signals */
-if (sigfillset(&set)) die("sigfillset");
-if (sigprocmask(SIG_SETMASK, &set, &oldset)) die("sigprocmask");
-
-thread_create(&thread_id, &attr, threadlet_worker, queue);
-
-if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore");
+qemu_thread_create(&thread, threadlet_worker, queue);
 }
 
 /**
@@ -671,7 +655,6 @@ int paio_init(void)
 struct sigaction act;
 PosixAioState *s;
 int fds[2];
-int ret;
 
 if (posix_aio_state)
 return 0;
@@ -701,14 +684,6 @@ int paio_init(void)
 qemu_aio_set_fd_handler(s->rfd, posix_aio_read, NULL, posix_aio_flush,
 posix_aio_process_queue, s);
 
-ret = pthread_attr_init(&attr);
-if (ret)
-die2(ret, "pthread_attr_init");
-
-ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
-if (ret)
-die2(ret, "pthread_attr_setdetachstate");
-
 posix_aio_state = s;
 return 0;
 }

[Qemu-devel] [PATCH 08/12] Threadlet: Add aio_signal_handler threadlet API

2011-01-20 Thread Arun R Bharadwaj

This patch adds aio_signal_handler threadlet API. Earler
posix-aio-compat.c had its own signal handler code. Now
abstract this, in the later patch it is moved to a generic
code so that it can be used by other subsystems.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 Makefile.objs  |1 +
 posix-aio-compat.c |   30 --
 qemu-threadlet.c   |   39 +++
 qemu-threadlet.h   |   19 +++
 vl.c   |3 +++
 5 files changed, 70 insertions(+), 22 deletions(-)
 create mode 100644 qemu-threadlet.c
 create mode 100644 qemu-threadlet.h

diff --git a/Makefile.objs b/Makefile.objs
index 3b7ec27..3b11dd0 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -10,6 +10,7 @@ qobject-obj-y += qerror.o
 block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o
 block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
 block-obj-$(CONFIG_POSIX) += qemu-thread.o
+block-obj-$(CONFIG_POSIX) += qemu-threadlet.o
 block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index a7625d8..96e28db 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -327,11 +327,14 @@ static void *threadlet_worker(void *data)
 return NULL;
 }
 
+static PosixAioState *posix_aio_state;
+
 static void handle_work(ThreadletWork *work)
 {
 pid_t pid;
 ssize_t ret = 0;
 struct qemu_paiocb *aiocb;
+char byte = 0;
 
 pid = getpid();
 aiocb = container_of(work, struct qemu_paiocb, work);
@@ -358,6 +361,11 @@ static void handle_work(ThreadletWork *work)
 qemu_cond_broadcast(&aiocb_completion);
 qemu_mutex_unlock(&aiocb_mutex);
 
+ret = write(posix_aio_state->wfd, &byte, sizeof(byte));
+if (ret < 0 && errno != EAGAIN) {
+die("write()");
+}
+
 if (kill(pid, aiocb->ev_signo)) {
 die("kill failed");
 }
@@ -518,22 +526,6 @@ static int posix_aio_flush(void *opaque)
 return !!s->first_aio;
 }
 
-static PosixAioState *posix_aio_state;
-
-static void aio_signal_handler(int signum)
-{
-if (posix_aio_state) {
-char byte = 0;
-ssize_t ret;
-
-ret = write(posix_aio_state->wfd, &byte, sizeof(byte));
-if (ret < 0 && errno != EAGAIN)
-die("write()");
-}
-
-qemu_service_io();
-}
-
 static void paio_remove(struct qemu_paiocb *acb)
 {
 struct qemu_paiocb **pacb;
@@ -652,7 +644,6 @@ BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd,
 
 int paio_init(void)
 {
-struct sigaction act;
 PosixAioState *s;
 int fds[2];
 
@@ -664,11 +655,6 @@ int paio_init(void)
 
 s = qemu_malloc(sizeof(PosixAioState));
 
-sigfillset(&act.sa_mask);
-act.sa_flags = 0; /* do not restart syscalls to interrupt select() */
-act.sa_handler = aio_signal_handler;
-sigaction(SIGUSR2, &act, NULL);
-
 s->first_aio = NULL;
 if (qemu_pipe(fds) == -1) {
 fprintf(stderr, "failed to create pipe\n");
diff --git a/qemu-threadlet.c b/qemu-threadlet.c
new file mode 100644
index 000..857d08d
--- /dev/null
+++ b/qemu-threadlet.c
@@ -0,0 +1,39 @@
+/*
+ * Threadlet support for offloading tasks to be executed asynchronously
+ *
+ * Copyright IBM, Corp. 2008
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Anthony Liguori 
+ *  Aneesh Kumar K.V
+ *  Gautham R Shenoy
+ *  Arun R Bharadwaj
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include 
+
+#include "qemu-threadlet.h"
+#include "osdep.h"
+
+static void threadlet_io_completion_signal_handler(int signum)
+{
+qemu_service_io();
+}
+
+static void threadlet_register_signal_handler(void)
+{
+struct sigaction act;
+sigfillset(&act.sa_mask);
+act.sa_flags = 0; /* do not restart syscalls to interrupt select() */
+act.sa_handler = threadlet_io_completion_signal_handler;
+sigaction(SIGUSR2, &act, NULL);
+}
+
+void threadlet_init(void)
+{
+threadlet_register_signal_handler();
+}
diff --git a/qemu-threadlet.h b/qemu-threadlet.h
new file mode 100644
index 000..2c225d6
--- /dev/null
+++ b/qemu-threadlet.h
@@ -0,0 +1,19 @@
+/*
+ * Threadlet support for offloading tasks to be executed asynchronously
+ *
+ * Copyright IBM, Corp. 2008
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Anthony Liguori 
+ *  Aneesh Kumar K.V
+ *  Gautham R Shenoy
+ *  Arun R Bharadwaj
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+
+void threadlet_init(void);
diff --git a/vl.c b/vl.c
index df414ef..d04fc7a 100644
--- a/vl.c
+++ b/vl.c
@@ -148,6 +148,7 @@ int main(int argc, char **argv)
 #include "qemu-config.h"
 #include "qemu-objects.h"
 #include "qemu-options.h"
+#include "qemu-threadlet.h"
 #ifdef CONFIG_VIRTFS
 #include "fs

[Qemu-devel] [PATCH 03/12] Add callback function to ThreadletWork structure.

2011-01-20 Thread Arun R Bharadwaj

This patch adds the callback function to the ThreadletWork
structure and moves aio handler as a callback function.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 posix-aio-compat.c |   89 +---
 1 files changed, 50 insertions(+), 39 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index ddf42f5..2792201 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -36,6 +36,7 @@ static QemuCond aiocb_completion;
 typedef struct ThreadletWork
 {
 QTAILQ_ENTRY(ThreadletWork) node;
+void (*func)(struct ThreadletWork *work);
 } ThreadletWork;
 
 struct qemu_paiocb {
@@ -308,18 +309,14 @@ static ssize_t handle_aiocb_rw(struct qemu_paiocb *aiocb)
 return nbytes;
 }
 
-static void *aio_thread(void *unused)
+static void *threadlet_worker(void *data)
 {
-pid_t pid;
-ThreadletWork *work;
-
-pid = getpid();
 
 while (1) {
-struct qemu_paiocb *aiocb;
 ssize_t ret = 0;
 qemu_timeval tv;
 struct timespec ts;
+ThreadletWork *work;
 
 qemu_gettimeofday(&tv);
 ts.tv_sec = tv.tv_sec + 10;
@@ -332,52 +329,64 @@ static void *aio_thread(void *unused)
 ret = cond_timedwait(&cond, &lock, &ts);
 }
 
-if (QTAILQ_EMPTY(&request_list))
+if (QTAILQ_EMPTY(&request_list)) {
+idle_threads--;
+cur_threads--;
+mutex_unlock(&lock);
 break;
-
+}
 work = QTAILQ_FIRST(&request_list);
 QTAILQ_REMOVE(&request_list, work, node);
-
-aiocb = container_of(work, struct qemu_paiocb, work);
-
-aiocb->active = 1;
 idle_threads--;
 mutex_unlock(&lock);
 
-switch (aiocb->aio_type & QEMU_AIO_TYPE_MASK) {
-case QEMU_AIO_READ:
-case QEMU_AIO_WRITE:
-ret = handle_aiocb_rw(aiocb);
-break;
-case QEMU_AIO_FLUSH:
-ret = handle_aiocb_flush(aiocb);
-break;
-case QEMU_AIO_IOCTL:
-ret = handle_aiocb_ioctl(aiocb);
-break;
-default:
-fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type);
-ret = -EINVAL;
-break;
-}
-
+work->func(work);
 mutex_lock(&lock);
 idle_threads++;
 mutex_unlock(&lock);
 
-qemu_mutex_lock(&aiocb_mutex);
-aiocb->ret = ret;
-qemu_cond_broadcast(&aiocb_completion);
-qemu_mutex_unlock(&aiocb_mutex);
+}
+
+return NULL;
+}
+
+static void handle_work(ThreadletWork *work)
+{
+pid_t pid;
+ssize_t ret = 0;
+struct qemu_paiocb *aiocb;
 
-if (kill(pid, aiocb->ev_signo)) die("kill failed");
+pid = getpid();
+aiocb = container_of(work, struct qemu_paiocb, work);
+qemu_mutex_lock(&aiocb_mutex);
+aiocb->active = 1;
+qemu_mutex_unlock(&aiocb_mutex);
+
+switch (aiocb->aio_type & QEMU_AIO_TYPE_MASK) {
+case QEMU_AIO_READ:
+case QEMU_AIO_WRITE:
+ret = handle_aiocb_rw(aiocb);
+break;
+case QEMU_AIO_FLUSH:
+ret = handle_aiocb_flush(aiocb);
+break;
+case QEMU_AIO_IOCTL:
+ret = handle_aiocb_ioctl(aiocb);
+break;
+default:
+fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type);
+ret = -EINVAL;
+break;
 }
 
-idle_threads--;
-cur_threads--;
-mutex_unlock(&lock);
+qemu_mutex_lock(&aiocb_mutex);
+aiocb->ret = ret;
+qemu_cond_broadcast(&aiocb_completion);
+qemu_mutex_unlock(&aiocb_mutex);
 
-return NULL;
+if (kill(pid, aiocb->ev_signo)) {
+die("kill failed");
+}
 }
 
 static void spawn_thread(void)
@@ -391,7 +400,7 @@ static void spawn_thread(void)
 if (sigfillset(&set)) die("sigfillset");
 if (sigprocmask(SIG_SETMASK, &set, &oldset)) die("sigprocmask");
 
-thread_create(&thread_id, &attr, aio_thread, NULL);
+thread_create(&thread_id, &attr, threadlet_worker, NULL);
 
 if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore");
 }
@@ -406,6 +415,8 @@ static void qemu_paio_submit(struct qemu_paiocb *aiocb)
 mutex_lock(&lock);
 if (idle_threads == 0 && cur_threads < max_threads)
 spawn_thread();
+
+aiocb->work.func = handle_work;
 QTAILQ_INSERT_TAIL(&request_list, &aiocb->work, node);
 mutex_unlock(&lock);
 cond_signal(&cond);

[Qemu-devel] [PATCH 01/12] Add aiocb_mutex and aiocb_completion.

2011-01-20 Thread Arun R Bharadwaj

This patch adds the aiocb_mutex to protect aiocb.
This patch also removes the infinite loop present in paio_cancel.

Since there can only be one cancellation at a time, we need to
introduce a condition variable. For this, we need a
global aiocb_completion condition variable.

This patch also adds the Makefile entry to compile qemu-thread.c
when CONFIG_POSIX is set, instead of the unused CONFIG_THREAD.

Signed-off-by: Arun R Bharadwaj 
Reviewed-by: Stefan Hajnoczi 
---
 Makefile.objs  |2 +-
 posix-aio-compat.c |   39 ---
 2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/Makefile.objs b/Makefile.objs
index cd5a24b..3b7ec27 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -9,6 +9,7 @@ qobject-obj-y += qerror.o
 
 block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o
 block-obj-y += nbd.o block.o aio.o aes.o osdep.o qemu-config.o
+block-obj-$(CONFIG_POSIX) += qemu-thread.o
 block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
@@ -124,7 +125,6 @@ endif
 common-obj-y += $(addprefix ui/, $(ui-obj-y))
 
 common-obj-y += iov.o acl.o
-common-obj-$(CONFIG_THREAD) += qemu-thread.o
 common-obj-y += notify.o event_notifier.o
 common-obj-y += qemu-timer.o
 
diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 7b862b5..82862ec 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -27,9 +27,12 @@
 #include "qemu-common.h"
 #include "trace.h"
 #include "block_int.h"
+#include "qemu-thread.h"
 
 #include "block/raw-posix-aio.h"
 
+static QemuMutex aiocb_mutex;
+static QemuCond aiocb_completion;
 
 struct qemu_paiocb {
 BlockDriverAIOCB common;
@@ -351,10 +354,14 @@ static void *aio_thread(void *unused)
 }
 
 mutex_lock(&lock);
-aiocb->ret = ret;
 idle_threads++;
 mutex_unlock(&lock);
 
+qemu_mutex_lock(&aiocb_mutex);
+aiocb->ret = ret;
+qemu_cond_broadcast(&aiocb_completion);
+qemu_mutex_unlock(&aiocb_mutex);
+
 if (kill(pid, aiocb->ev_signo)) die("kill failed");
 }
 
@@ -383,8 +390,11 @@ static void spawn_thread(void)
 
 static void qemu_paio_submit(struct qemu_paiocb *aiocb)
 {
+qemu_mutex_lock(&aiocb_mutex);
 aiocb->ret = -EINPROGRESS;
 aiocb->active = 0;
+qemu_mutex_unlock(&aiocb_mutex);
+
 mutex_lock(&lock);
 if (idle_threads == 0 && cur_threads < max_threads)
 spawn_thread();
@@ -397,9 +407,9 @@ static ssize_t qemu_paio_return(struct qemu_paiocb *aiocb)
 {
 ssize_t ret;
 
-mutex_lock(&lock);
+qemu_mutex_lock(&aiocb_mutex);
 ret = aiocb->ret;
-mutex_unlock(&lock);
+qemu_mutex_unlock(&aiocb_mutex);
 
 return ret;
 }
@@ -536,22 +546,26 @@ static void paio_cancel(BlockDriverAIOCB *blockacb)
 struct qemu_paiocb *acb = (struct qemu_paiocb *)blockacb;
 int active = 0;
 
-mutex_lock(&lock);
+qemu_mutex_lock(&aiocb_mutex);
 if (!acb->active) {
 QTAILQ_REMOVE(&request_list, acb, node);
 acb->ret = -ECANCELED;
 } else if (acb->ret == -EINPROGRESS) {
 active = 1;
 }
-mutex_unlock(&lock);
 
-if (active) {
-/* fail safe: if the aio could not be canceled, we wait for
-   it */
-while (qemu_paio_error(acb) == EINPROGRESS)
-;
+if (!active) {
+acb->ret = -ECANCELED;
+} else {
+while (acb->ret == -EINPROGRESS) {
+/*
+ * fail safe: if the aio could not be canceled,
+ * we wait for it
+ */
+qemu_cond_wait(&aiocb_completion, &aiocb_mutex);
+}
 }
-
+qemu_mutex_unlock(&aiocb_mutex);
 paio_remove(acb);
 }
 
@@ -623,6 +637,9 @@ int paio_init(void)
 if (posix_aio_state)
 return 0;
 
+qemu_mutex_init(&aiocb_mutex);
+qemu_cond_init(&aiocb_completion);
+
 s = qemu_malloc(sizeof(PosixAioState));
 
 sigfillset(&act.sa_mask);

[Qemu-devel] [PATCH 00/12] Threadlet Infrastructure

2011-01-20 Thread Arun R Bharadwaj

Hi,

This patch series introduces threadlet framework
in qemu.

Changelog:

* Modified the dequeue_work function logic which was
  earlier wrong. It now returns 0 on success and 1 on
  failure.

* removed the locking for container_of which is not 
  needed in handle_work()

* changed qemu-threadlets.[c|h] to qemu-threadlet.[c|h]

* introduced signal handler APIs in qemu-threadlet.c
  directly instead of in posix-aio-compat.c

Testing:

* I have tested this series using fsstress.
  Detailed testing information has been mentioned in the
  previous iteration.

The following series implements...

---

Arun R Bharadwaj (12):
  Add aiocb_mutex and aiocb_completion.
  Introduce work concept in posix-aio-compat.c
  Add callback function to ThreadletWork structure.
  Add ThreadletQueue.
  Threadlet: Add submit_work threadlet API.
  Threadlet: Add dequeue_work threadlet API and remove active field.
  Remove thread_create routine.
  Threadlet: Add aio_signal_handler threadlet API
  Remove all instances of CONFIG_THREAD
  Move threadlet code to qemu-threadlets.c
  Threadlets: Add functionality to create private queues.
  Threadlets: Add documentation


 Makefile.objs  |3 -
 configure  |2 
 docs/async-support.txt |  141 
 posix-aio-compat.c |  238 
 qemu-threadlet.c   |  187 ++
 qemu-threadlet.h   |   48 ++
 vl.c   |3 +
 7 files changed, 440 insertions(+), 182 deletions(-)
 create mode 100644 docs/async-support.txt
 create mode 100644 qemu-threadlet.c
 create mode 100644 qemu-threadlet.h

-- 
arun

[Qemu-devel] Re: [PATCH] target-arm: Fix loading of scalar value for Neon multiply-by-scalar

2011-01-20 Thread Christophe Lyon

On 19.01.2011 20:29, Peter Maydell wrote:
> Fix the register and part of register we get the scalar from in
> the various "multiply vector by scalar" ops (VMUL by scalar
> and friends).
> 


I do confirm that with this patch, I can see many improvements in my tests.

Thanks.

Re: [Qemu-devel] [PATCH] Support saturation with shift=0.

2011-01-20 Thread Peter Maydell

On 20 January 2011 12:06, Christophe Lyon  wrote:
> On 19.01.2011 17:51, Peter Maydell wrote:
>> On 19 January 2011 16:10, Christophe Lyon  wrote:
>>>
>>> This patch fixes corner-case saturations, when the target range is
>>> zero. It merely removes the guard against (sh == 0), and makes:
>>> __ssat(0x87654321, 1) return 0x and set the saturation flag
>>
>> did you mean __ssat(0x87654321, 0) here? (they give the same
>> result, of course, but it's the sh==0 case the patch is changing...)
>
> Well... the ARM ARM says that the position for saturation is in the
> range 1 to 32, so I think the assembler encodes 1 less than what the
> user actually wrote. Hence at user level we use '1', but '0' is encoded
> and then parsed by qemu. Am I wrong?

No, you're right, there's a "+1" in the SSAT decoding instructions
which I hadn't noticed.

-- PMM

Re: [Qemu-devel] [PATCH] Support saturation with shift=0.

2011-01-20 Thread Christophe Lyon

On 19.01.2011 17:51, Peter Maydell wrote:
> On 19 January 2011 16:10, Christophe Lyon  wrote:
>>
>> This patch fixes corner-case saturations, when the target range is
>> zero. It merely removes the guard against (sh == 0), and makes:
>> __ssat(0x87654321, 1) return 0x and set the saturation flag
> 
> did you mean __ssat(0x87654321, 0) here? (they give the same
> result, of course, but it's the sh==0 case the patch is changing...)

Well... the ARM ARM says that the position for saturation is in the range 1 to 
32, so I think the assembler encodes 1 less than what the user actually wrote. 
Hence at user level we use '1', but '0' is encoded and then parsed by qemu. Am 
I wrong?

Obviously, I can rephrase the commit message :-)

> 
>> __usat(0x87654321, 0) return 0 and set the saturation flag
>>
>> Signed-off-by: Christophe Lyon 
> 
> Checked against the ARM ARM and tested by
> random-instruction-sequence generation.
> 

Thanks.

Re: [Qemu-devel] [PATCH] pxa2xx_lcd: restore updating of display

2011-01-20 Thread Aurelien Jarno

On Tue, Jan 18, 2011 at 07:11:33PM +0300, Dmitry Eremin-Solenikov wrote:
> Recently PXA2xx lcd have stopped to be updated incrementally (picture
> frozen). This patch fixes that by passing non min/max x/y, but rather
> (correctly) x/y and w/h.
> 
> Signed-off-by: Dmitry Eremin-Solenikov 
> ---
>  hw/pxa2xx_lcd.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/pxa2xx_lcd.c b/hw/pxa2xx_lcd.c
> index 1f2211a..5b2b07e 100644
> --- a/hw/pxa2xx_lcd.c
> +++ b/hw/pxa2xx_lcd.c
> @@ -796,9 +796,9 @@ static void pxa2xx_update_display(void *opaque)
>  
>  if (miny >= 0) {
>  if (s->orientation)
> -dpy_update(s->ds, miny, 0, maxy, s->xres);
> +dpy_update(s->ds, miny, 0, maxy - miny, s->xres);
>  else
> -dpy_update(s->ds, 0, miny, s->xres, maxy);
> +dpy_update(s->ds, 0, miny, s->xres, maxy - miny);
>  }
>  pxa2xx_lcdc_int_update(s);
>  
> -- 
> 1.7.2.3
> 

Thanks, applied.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net

Re: [Qemu-devel] [PATCH] ioport: Improve error output

2011-01-20 Thread Aurelien Jarno

On Mon, Dec 27, 2010 at 01:00:56AM +0100, Andreas Färber wrote:
> When failing due to conflicting I/O port registrations,
> include the offending I/O port address in the message.
> 
> Signed-off-by: Andreas Färber 
> ---
>  ioport.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/ioport.c b/ioport.c
> index aa4188a..349915f 100644
> --- a/ioport.c
> +++ b/ioport.c
> @@ -149,7 +149,7 @@ int register_ioport_read(pio_addr_t start, int length, 
> int size,
>  for(i = start; i < start + length; i += size) {
>  ioport_read_table[bsize][i] = func;
>  if (ioport_opaque[i] != NULL && ioport_opaque[i] != opaque)
> -hw_error("register_ioport_read: invalid opaque");
> +hw_error("register_ioport_read: invalid opaque for 0x%x", i);
>  ioport_opaque[i] = opaque;
>  }
>  return 0;
> @@ -168,7 +168,7 @@ int register_ioport_write(pio_addr_t start, int length, 
> int size,
>  for(i = start; i < start + length; i += size) {
>  ioport_write_table[bsize][i] = func;
>  if (ioport_opaque[i] != NULL && ioport_opaque[i] != opaque)
> -hw_error("register_ioport_write: invalid opaque");
> +hw_error("register_ioport_write: invalid opaque for 0x%x", i);
>  ioport_opaque[i] = opaque;
>  }
>  return 0;

The idea looks good, but "for 0x%x" looks strange. Maybe something like
"for address 0x%x"?

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net

Re: [Qemu-devel] [PATCH] add bepo (french dvorak) keyboard layout

2011-01-20 Thread Aurelien Jarno

On Sun, Jan 09, 2011 at 02:24:59PM +0100, Fred Boiteux wrote:
>   Hello,
> 
>   It's my first message here, and I'm not a git user, so I hope the
> patch format will be good (I've done a "git diff -cached" in the qemu
> tree).
>   I'm using the Qemu program with VNC I/O, and I had some problems with
> my keyboard layout, so I've prepared a definition to be included in
> Qemu, built from Xorg description. I've tested here, it works for me.
> 
>   If you have any remark/question about that, tell me.
> 
>   Fred.
> 
> 
> 
> Signed-off-by: Frédéric Boiteux 
> 
> ---

Thanks, applied.

> diff --git a/pc-bios/keymaps/bepo b/pc-bios/keymaps/bepo
> new file mode 100644
> index 000..d40041a
> --- /dev/null
> +++ b/pc-bios/keymaps/bepo
> @@ -0,0 +1,333 @@
> +include common
> +
> +# Bépo : Improved ergonomic french keymap using Dvorak method.
> +# Built by community on 'Dvorak Fr / Bépo' :
> +# see http://www.clavier-dvorak.org/wiki/ to join and help.
> +#
> +# Bépo layout (1.0rc2 version) for a pc105 keyboard (french) :
> +# ┌┐
> +# │ S A│   S = Shift,  A = AltGr + Shift
> +# │ s a│   s = normal, a = AltGr
> +# └┘
> +#
> +# 
> ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┲━┓
> +# │ # ¶ │ 1 „ │ 2 “ │ 3 ” │ 4 ≤ │ 5 ≥ │ 6   │ 7 ¬ │ 8 ¼ │ 9 ½ │ 0 ¾ │ ° ′ │ 
> ` ″ ┃ ⌫ Retour┃
> +# │ $ – │ " — │ « < │ » > │ ( [ │ ) ] │ @ ^ │ + ± │ - − │ / ÷ │ * × │ = ≠ │ 
> % ‰ ┃  arrière┃
> +# 
> ┢━┷━┱───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┴─┬───┺━┳━━━┫
> +# ┃   ┃ B ¦ │ É ˝ │ P § │ O Œ │ È ` │ !   │ V   │ D Ð │ L   │ J Ĳ │ Z Ə 
> │ W   ┃Entrée ┃
> +# ┃Tab ↹  ┃ b | │ é ˊ │ p & │ o œ │ è ` │ ˆ ¡ │ v ˇ │ d ð │ l / │ j ĳ │ z ə 
> │ w ̆ ┃   ⏎   ┃
> +# 
> ┣━━━┻┱┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┺┓
>   ┃
> +# ┃┃ A Æ │ U Ù │ I ˙ │ E ¤ │ ; ̛ │ C ſ │ T Þ │ S ẞ │ R ™ │ N   │ M º 
> │ Ç , ┃  ┃
> +# ┃Maj ⇬   ┃ a æ │ u ù │ i ̈ │ e € │ , ’ │ c © │ t þ │ s ß │ r ® │ n ˜ │ m ¯ 
> │ ç ¸ ┃  ┃
> +# 
> ┣━━━┳┹┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┬┴┲┷━┻━━┫
> +# ┃   ┃ Ê   │ À   │ Y ‘ │ X ’ │ : · │ K   │ ? ̉ │ Q ̣ │ G   │ H ‡ │ F ª 
> ┃ ┃
> +# ┃Shift ⇧┃ ê / │ à \ │ y { │ x } │ . … │ k ~ │ ' ¿ │ q ˚ │ g µ │ h † │ f ˛ 
> ┃Shift ⇧  ┃
> +# 
> ┣━━━╋━┷━┳━━━┷━━━┱─┴─┴─┴─┴─┴─┴───┲━┷━╈━┻━┳━━━┳━━━┛
> +# ┃   ┃   ┃   ┃ Espace inséc.   Espace inséc. fin ┃   ┃  
>  ┃   ┃
> +# ┃Ctrl   ┃Meta   ┃Alt┃ ␣ (Espace)  _   ␣ ┃AltGr ⇮┃Menu  
>  ┃Ctrl   ┃
> +# 
> ┗━━━┻━━━┻━━━┹───┺━━━┻━━━┻━━━┛
> +
> +
> +# First row
> +## keycode  41 = dollar numbersign   U+2013  U+00b6
> +dollar0x29
> +numbersign0x29  shift
> +U2013 0x29altgr
> +U00b6 0x29  shift altgr
> +
> +## keycode   2 = +quotedbl +one  U+2014  U+201e
> +quotedbl  0x2
> +one   0x2  shift
> +U2014 0x2altgr
> +U201e 0x2  shift altgr
> +
> +## keycode   3 = +guillemotleft  +two lessU+201c
> +guillemotleft  0x3
> +two   0x3  shift
> +less  0x3altgr
> +U201c 0x3  shift altgr
> +
> +## keycode   4 = +guillemotright +three  greater U+201d
> +guillemotright  0x4
> +three 0x4  shift
> +greater   0x4altgr
> +U201d 0x4  shift altgr
> +
> +## keycode   5 = +parenleft +fourbracketleft  U+2264
> +parenleft 0x5
> +four  0x5  shift
> +bracketleft   0x5altgr
> +U2264 0x5  shift altgr
> +
> +## keycode   6 = +parenright +five   bracketright  U+2265
> +parenright0x6
> +five  0x6  shift
> +bracketright  0x6altgr
> +U2265 0x6  shift altgr
> +
> +## keycode   7 = +at   +six  asciicircum
> +at0x7
> +six   0x7  shift
> +asciicircum   0x7altgr
> +
> +## keycode   8 = +plus +sevenU+00b1  U+00ac
> +plus  0x8
> +seven 0x8  shift
> +U00b1 0x8altgr
> +U00ac 0x8  shift altgr
> +
> +## keycode   9 = +minus+eightU+2212  U+00bc
> +minus 0x9
> +eight 0x9  shift
> +U2212 0x9altgr
> +U00bc 0x9  shift altgr
> +
> +## keycode  10 = +slash+nine U+00f7  U+00bd
> +slash 0xa
> +nine  0xa  shift
> +U00f7 0xaaltgr
> +U00bd 0xa  shift altgr
> +
> +## keycode  11 = +asterisk +zero U+00d7  U+00be
> +asterisk  0xb
> +zero  0xb  shift
> +U00d7 0xbaltgr
> +U00be 0xb  shift altgr
> +
> +## keycode  12 = equal U+00b0U+2260  U+2032
> +equal 0xc
> +U00b0 0xc  shift
> +U2260 0xcaltgr
> +U2032 0xc  shift altgr
> +
> +## keycode  13 = percent   grave U+2030  U+2033
> +percent   0xd
> +grave 0xd  shift
> +U2030 0xd

Re: [Qemu-devel] [PATCH 1/3] mainstone: fix name of the allocated memory for roms

2011-01-20 Thread Aurelien Jarno

On Thu, Jan 13, 2011 at 06:37:10PM +0300, Dmitry Eremin-Solenikov wrote:
> Mainstone board has two flash chips (emulated by two ram regions), however
> currently code tries to allocate them with the same name, which fails.
> Fix that to make mainstone emulation work again.
> 
> Signed-off-by: Dmitry Eremin-Solenikov 
> ---
>  hw/mainstone.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/mainstone.c b/hw/mainstone.c
> index efa2959..d8e41cf 100644
> --- a/hw/mainstone.c
> +++ b/hw/mainstone.c
> @@ -106,7 +106,8 @@ static void mainstone_common_init(ram_addr_t ram_size,
>  }
>  
>  if (!pflash_cfi01_register(mainstone_flash_base[i],
> -   qemu_ram_alloc(NULL, "mainstone.flash",
> +   qemu_ram_alloc(NULL, i ? 
> "mainstone.flash1" :
> +  "mainstone.flash0",
>MAINSTONE_FLASH),
> dinfo->bdrv, sector_len,
> MAINSTONE_FLASH / sector_len, 4, 0, 0, 0, 
> 0,
> -- 
> 1.7.2.3
> 

Thanks, all three patches applied.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net

Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Kevin Wolf

Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
> 2011/1/20 Kevin Wolf :
>> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>>> +return;
>>> +}
>>> +
>>> +bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>>> +blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>>> +blk_req->reqs[0].opaque);
>>
>> Same here.
>>
>>> +bdrv_flush(bs);
>>
>> This looks really strange. What is this supposed to do?
>>
>> One point is that you write it immediately after bdrv_aio_write, so you
>> get an fsync for which you don't know if it includes the current write
>> request or if it doesn't. Which data do you want to get flushed to the 
>> disk?
>
> I was expecting to flush the aio request that was just initiated.
> Am I misunderstanding the function?

 Seems so. The function names don't use really clear terminology either,
 so you're not the first one to fall in this trap. Basically we have:

 * qemu_aio_flush() waits for all AIO requests to complete. I think you
 wanted to have exactly this, but only for a single block device. Such a
 function doesn't exist yet.

 * bdrv_flush() makes sure that all successfully completed requests are
 written to disk (by calling fsync)

 * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
 the fsync in the thread pool
>>>
>>> Then what I wanted to do is, call qemu_aio_flush first, then
>>> bdrv_flush.  It should be like live migration.
>>
>> Okay, that makes sense. :-)
>>
>> The other thing is that you introduce a bdrv_flush for each request,
>> basically forcing everyone to something very similar to writethrough
>> mode. I'm sure this will have a big impact on performance.
>
> The reason is to avoid inversion of queued requests.  Although
> processing one-by-one is heavy, wouldn't having requests flushed
> to disk out of order break the disk image?

 No, that's fine. If a guest issues two requests at the same time, they
 may complete in any order. You just need to make sure that you don't
 call the completion callback before the request really has completed.
>>>
>>> We need to flush requests, meaning aio and fsync, before sending
>>> the final state of the guests, to make sure we can switch to the
>>> secondary safely.
>>
>> In theory I think you could just re-submit the requests on the secondary
>> if they had not completed yet.
>>
>> But you're right, let's keep things simple for the start.
>>
 I'm just starting to wonder if the guest won't timeout the requests if
 they are queued for too long. Even more, with IDE, it can only handle
 one request at a time, so not completing requests doesn't sound like a
 good idea at all. In what intervals is the event-tap queue flushed?
>>>
>>> The requests are flushed once each transaction completes.  So
>>> it's not with specific intervals.
>>
>> Right. So when is a transaction completed? This is the time that a
>> single request will take.
> 
> The transaction is completed when the vm state is sent to the
> secondary, and the primary receives the ack to it.  Please let me
> know if the answer is too vague.  What I can tell is that it
> can't be super fast.
> 
 On the other hand, if you complete before actually writing out, you
 don't get timeouts, but you signal success to the guest when the request
 could still fail. What would you do in this case? With a writeback cache
 mode we're fine, we can just fail the next flush (until then nothing is
 guaranteed to be on disk and order doesn't matter either), but with
 cache=writethrough we're in serious trouble.

 Have you thought about this problem? Maybe we end up having to flush the
 event-tap queue for each single write in writethrough mode.
>>>
>>> Yes, and that's what I'm trying to do at this point.
>>
>> Oh, I must have missed that code. Which patch/function should I look at?
> 
> Maybe I miss-answered to your question.  The device may receive
> timeouts.  

We should pay attention that the guest does not see timeouts. I'm not
expecting that I/O will be super fast, and as long as it is only a
performance problem we can live with it.

However, as soon as the guest gets timeouts it reports I/O errors and
eventually offlines the block device. At this point it's not a
performance problem any more, but also a correctness problem.

This is why I suggested that we flush the event-tap queue (i.e. complete
the transaction) immediately after an I/O request has been issued
instead of waiting for other events that would complete the transaction.

> If timeouts didn't happen, the requests are flushed
> one-by-one in writethrough because we're calling qemu_aio_flush
> and bdrv_flush together.

I think this is what we must do.

Kevin

Re: [Qemu-devel] [PATCH 0/8] Add save/restore support to ARM versatilepb devices

2011-01-20 Thread Aurelien Jarno

On Thu, Dec 23, 2010 at 05:19:50PM +, Peter Maydell wrote:
> This patchset adds save/restore support to the devices used by the ARM
> versatilepb board which didn't already support it.
> 
> I did this in line with docs/migration.txt, and it seems to work OK. I'd
> appreciate some review from somebody with a better grasp of the APIs,
> though :-) In particular, am I saving the uint8_t data[NUM_PACKETS][2048]
> array in stc91c111 in the right way?
> 
> Peter Maydell (8):
>   pl190: Implement save/restore
>   vpb_sic: Implement save/restore
>   arm_sysctl: Implement save/restore
>   pl050: Implement save/restore
>   pl031: Implement save/restore
>   pl110: Implement save/restore
>   pl080: Implement save/restore
>   stc91c111: Implement save/restore
> 
>  hw/arm_sysctl.c  |   17 +++
>  hw/pl031.c   |   25 ++-
>  hw/pl050.c   |   35 +++
>  hw/pl080.c   |   58 +++--
>  hw/pl110.c   |   44 +---
>  hw/pl190.c   |   36 +---
>  hw/smc91c111.c   |   30 +++
>  hw/versatilepb.c |   24 +++--
>  8 files changed, 249 insertions(+), 20 deletions(-)
> 

I don't really know this part of QEMU, but given nobody really wants to
review these patches and given they look ok, I have applied them. It's 
better than leaving them bitrotting on the mailing list.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net

Re: [Qemu-devel] [PATCH 0/7] Define "deposit" tcg operation, v2

2011-01-20 Thread Edgar E. Iglesias

On Mon, Jan 10, 2011 at 07:23:41PM -0800, Richard Henderson wrote:
> Changes since v1:
>   * No attempt to pack pos+len into one operand.  Updated backends
> to match this change.
> 
>   * Example in the README is a bit more complex.
> 
>   * Define an official tcg_scratch_alloc routine, used by the i386
> target for the case in which we need a scratch register.  I had
> said that triggering this wasn't possible with mainline, but I
> was wrong.  The rlwimi example I posted in the other thread hits
> this case.
> 
> Aurelien suggested using a shorter name like "dep", but I think that
> would be more confusing than not.  This opcode is not really used
> enough to warrent cropping 4 characters, in my opinion.

Thanks Richard. In agreement with Aurelien I've applied patch 1 and 6
as a first step. I've also applied a follow-up patch to fix the
copy-pastos in the README (loc -> len).

Hopefully as more acks come in, we'll apply the rest.

Cheers

Re: [Qemu-devel] usb redirection status report

2011-01-20 Thread Christoph Hellwig

On Wed, Jan 19, 2011 at 07:15:47PM +0100, Hans de Goede wrote:
> Hi All,
> 
> As most of you know I'm working on usb redirection (making client usb 
> devices
> accessible in guests over the network).
> 
> I thought it would be a good idea to write a short status report.
> 
> So fat the following has been done:
> 
> * written and posted a network protocol for usb redir. over the network,
> and send this to the list.

How does this relate to the various existing usb over ip protocols, e.g.
the one in drivers/staging/usbip/ in the Linux kernel tree?

Re: [Qemu-devel] [PATCH 6/7] target-i386: Use deposit operation.

2011-01-20 Thread Aurelien Jarno

On Mon, Jan 10, 2011 at 07:23:47PM -0800, Richard Henderson wrote:
> Use this for assignment to the low byte or low word of a register.
> 
> Signed-off-by: Richard Henderson 
> ---
>  target-i386/translate.c |   34 ++
>  1 files changed, 6 insertions(+), 28 deletions(-)

Acked-by: Edgar E. Iglesias 

> diff --git a/target-i386/translate.c b/target-i386/translate.c
> index 7b6e3c2..c008450 100644
> --- a/target-i386/translate.c
> +++ b/target-i386/translate.c
> @@ -274,28 +274,16 @@ static inline void gen_op_andl_A0_(void)
>  
>  static inline void gen_op_mov_reg_v(int ot, int reg, TCGv t0)
>  {
> -TCGv tmp;
> -
>  switch(ot) {
>  case OT_BYTE:
> -tmp = tcg_temp_new();
> -tcg_gen_ext8u_tl(tmp, t0);
>  if (reg < 4 X86_64_DEF( || reg >= 8 || x86_64_hregs)) {
> -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0xff);
> -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], tmp);
> +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], t0, 0, 8);
>  } else {
> -tcg_gen_shli_tl(tmp, tmp, 8);
> -tcg_gen_andi_tl(cpu_regs[reg - 4], cpu_regs[reg - 4], ~0xff00);
> -tcg_gen_or_tl(cpu_regs[reg - 4], cpu_regs[reg - 4], tmp);
> +tcg_gen_deposit_tl(cpu_regs[reg - 4], cpu_regs[reg - 4], t0, 8, 
> 8);
>  }
> -tcg_temp_free(tmp);
>  break;
>  case OT_WORD:
> -tmp = tcg_temp_new();
> -tcg_gen_ext16u_tl(tmp, t0);
> -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0x);
> -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], tmp);
> -tcg_temp_free(tmp);
> +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], t0, 0, 16);
>  break;
>  default: /* XXX this shouldn't be reached;  abort? */
>  case OT_LONG:
> @@ -323,15 +311,9 @@ static inline void gen_op_mov_reg_T1(int ot, int reg)
>  
>  static inline void gen_op_mov_reg_A0(int size, int reg)
>  {
> -TCGv tmp;
> -
>  switch(size) {
>  case 0:
> -tmp = tcg_temp_new();
> -tcg_gen_ext16u_tl(tmp, cpu_A0);
> -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0x);
> -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], tmp);
> -tcg_temp_free(tmp);
> +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], cpu_A0, 0, 16);
>  break;
>  default: /* XXX this shouldn't be reached;  abort? */
>  case 1:
> @@ -415,9 +397,7 @@ static inline void gen_op_add_reg_im(int size, int reg, 
> int32_t val)
>  switch(size) {
>  case 0:
>  tcg_gen_addi_tl(cpu_tmp0, cpu_regs[reg], val);
> -tcg_gen_ext16u_tl(cpu_tmp0, cpu_tmp0);
> -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0x);
> -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], cpu_tmp0);
> +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], cpu_tmp0, 0, 16);
>  break;
>  case 1:
>  tcg_gen_addi_tl(cpu_tmp0, cpu_regs[reg], val);
> @@ -439,9 +419,7 @@ static inline void gen_op_add_reg_T0(int size, int reg)
>  switch(size) {
>  case 0:
>  tcg_gen_add_tl(cpu_tmp0, cpu_regs[reg], cpu_T[0]);
> -tcg_gen_ext16u_tl(cpu_tmp0, cpu_tmp0);
> -tcg_gen_andi_tl(cpu_regs[reg], cpu_regs[reg], ~0x);
> -tcg_gen_or_tl(cpu_regs[reg], cpu_regs[reg], cpu_tmp0);
> +tcg_gen_deposit_tl(cpu_regs[reg], cpu_regs[reg], cpu_tmp0, 0, 16);
>  break;
>  case 1:
>  tcg_gen_add_tl(cpu_tmp0, cpu_regs[reg], cpu_T[0]);
> -- 
> 1.7.2.3
> 
> 
> 

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net

Re: [Qemu-devel] Missing roms/seabios/Makefile and roms/vgabios/Makefile

2011-01-20 Thread Stefan Hajnoczi

On Thu, Jan 20, 2011 at 10:51:17AM +, Mateusz Loskot wrote:
> On 19/01/11 18:07, Blue Swirl wrote:
> >On Wed, Jan 19, 2011 at 12:44 PM, Mateusz Loskot  wrote:
> >>Hi,
> >>
> >>Running ./configure (under MinGW/MSYS) and symlink gives up trying to
> >>create links to non-existing Makefiles in
> >>roms/seabios/Makefile
> >>roms/vgabios/Makefile
> >
> >Those directiories are actually git submodules, when you run 'git
> >submodule update', they get populated.
> 
> Something is not quite working and I don't get anything populated.
> Here I tried under MinGW
> 
> mloskot@dog /g/src/qemu/_git/master
> $ git pull
> Already up-to-date.
> 
> mloskot@dog /g/src/qemu/_git/master
> $ git submodule update
> 
> mloskot@dog /g/src/qemu/_git/master
> $ ls roms/seabios/
> config.mak

Make sure roms/{vgabios,seabios} are empty directories.

$ git submodule init
$ git submodule update

This should clone the vgabios and seabios repos and checkout the correct
commit.  After this completes successfully you should have source trees
under roms/{vgabios,seabios}.

Stefan

Re: [Qemu-devel] Missing roms/seabios/Makefile and roms/vgabios/Makefile

2011-01-20 Thread Mateusz Loskot


On 19/01/11 18:07, Blue Swirl wrote:

On Wed, Jan 19, 2011 at 12:44 PM, Mateusz Loskot  wrote:

Hi,

Running ./configure (under MinGW/MSYS) and symlink gives up trying to
create links to non-existing Makefiles in
roms/seabios/Makefile
roms/vgabios/Makefile


Those directiories are actually git submodules, when you run 'git
submodule update', they get populated.


Something is not quite working and I don't get anything populated.
Here I tried under MinGW

mloskot@dog /g/src/qemu/_git/master
$ git pull
Already up-to-date.

mloskot@dog /g/src/qemu/_git/master
$ git submodule update

mloskot@dog /g/src/qemu/_git/master
$ ls roms/seabios/
config.mak

I also tested on Linux, same results, nothing pulled.


Maybe configure should check if the directories are OK and disables
building ROMs if not.


Generally, if they are optional, I think it's a good idea.
Specifically, I'm nearly completely green about qemu internals, so can't 
tell.


Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net
Charter Member of OSGeo, http://osgeo.org
Member of ACCU, http://accu.org

Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Yoshiaki Tamura

2011/1/20 Kevin Wolf :
> Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
>> +        return;
>> +    }
>> +
>> +    bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
>> +                    blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
>> +                    blk_req->reqs[0].opaque);
>
> Same here.
>
>> +    bdrv_flush(bs);
>
> This looks really strange. What is this supposed to do?
>
> One point is that you write it immediately after bdrv_aio_write, so you
> get an fsync for which you don't know if it includes the current write
> request or if it doesn't. Which data do you want to get flushed to the 
> disk?

 I was expecting to flush the aio request that was just initiated.
 Am I misunderstanding the function?
>>>
>>> Seems so. The function names don't use really clear terminology either,
>>> so you're not the first one to fall in this trap. Basically we have:
>>>
>>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>>> wanted to have exactly this, but only for a single block device. Such a
>>> function doesn't exist yet.
>>>
>>> * bdrv_flush() makes sure that all successfully completed requests are
>>> written to disk (by calling fsync)
>>>
>>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>>> the fsync in the thread pool
>>
>> Then what I wanted to do is, call qemu_aio_flush first, then
>> bdrv_flush.  It should be like live migration.
>
> Okay, that makes sense. :-)
>
> The other thing is that you introduce a bdrv_flush for each request,
> basically forcing everyone to something very similar to writethrough
> mode. I'm sure this will have a big impact on performance.

 The reason is to avoid inversion of queued requests.  Although
 processing one-by-one is heavy, wouldn't having requests flushed
 to disk out of order break the disk image?
>>>
>>> No, that's fine. If a guest issues two requests at the same time, they
>>> may complete in any order. You just need to make sure that you don't
>>> call the completion callback before the request really has completed.
>>
>> We need to flush requests, meaning aio and fsync, before sending
>> the final state of the guests, to make sure we can switch to the
>> secondary safely.
>
> In theory I think you could just re-submit the requests on the secondary
> if they had not completed yet.
>
> But you're right, let's keep things simple for the start.
>
>>> I'm just starting to wonder if the guest won't timeout the requests if
>>> they are queued for too long. Even more, with IDE, it can only handle
>>> one request at a time, so not completing requests doesn't sound like a
>>> good idea at all. In what intervals is the event-tap queue flushed?
>>
>> The requests are flushed once each transaction completes.  So
>> it's not with specific intervals.
>
> Right. So when is a transaction completed? This is the time that a
> single request will take.

The transaction is completed when the vm state is sent to the
secondary, and the primary receives the ack to it.  Please let me
know if the answer is too vague.  What I can tell is that it
can't be super fast.

>>> On the other hand, if you complete before actually writing out, you
>>> don't get timeouts, but you signal success to the guest when the request
>>> could still fail. What would you do in this case? With a writeback cache
>>> mode we're fine, we can just fail the next flush (until then nothing is
>>> guaranteed to be on disk and order doesn't matter either), but with
>>> cache=writethrough we're in serious trouble.
>>>
>>> Have you thought about this problem? Maybe we end up having to flush the
>>> event-tap queue for each single write in writethrough mode.
>>
>> Yes, and that's what I'm trying to do at this point.
>
> Oh, I must have missed that code. Which patch/function should I look at?

Maybe I miss-answered to your question.  The device may receive
timeouts.  If timeouts didn't happen, the requests are flushed
one-by-one in writethrough because we're calling qemu_aio_flush
and bdrv_flush together.

Yoshi

>> I know that
>> performance matters a lot, but sacrificing reliability over
>> performance now isn't a good idea.  I first want to lay the
>> ground, and then focus on optimization.  Note that without dirty
>> bitmap optimization, Kemari suffers a lot in sending rams.
>> Anthony and I discussed to take this approach at KVM Forum.
>
> I agree, starting simple makes sense.
>
> Kevin
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Daniel P. Berrange

On Thu, Jan 20, 2011 at 09:44:05AM +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> >For (2), you cannot use bus=X,addr=Y because it makes assumptions about
> >the PCI topology which may change in newer -M pc's.
> 
> Why should the PCI topology for 'pc' ever change?
> 
> We'll probably get q35 support some day, but when this lands I
> expect we'll see a new machine type 'q35', so '-m q35' will pick the
> ich9 chipset (which will have a different pci topology of course)
> and '-m pc' will pick the existing piix chipset (which will continue
> to look like it looks today).

If the topology does ever change (eg in the kind of way anthony
suggests, first bus only has the graphics card), I think libvirt
is going to need a little work to adapt to the new topology,
regardless of whether we currently specify a bus= arg to -device
or not. I'm not sure there's anything we could do to future proof
us to that kind of change.

Regards,
Daniel

[Qemu-devel] [PATCH] pci: fix pcibus_get_dev_path()

2011-01-20 Thread TeLeMan

The commit 6a7005d14b3c32d4864a718fb1cb19c789f58a5 used snprintf() incorrectly.

Signed-off-by: TeLeMan 
---
 hw/pci.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 8d0e3df..9f8800d 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -2050,14 +2050,14 @@ static char *pcibus_get_dev_path(DeviceState *dev)
 path[path_len] = '\0';

 /* First field is the domain. */
-snprintf(path, domain_len, "%04x:00", pci_find_domain(d->bus));
+snprintf(path, domain_len + 1, "%04x:00", pci_find_domain(d->bus));

 /* Fill in slot numbers. We walk up from device to root, so need to print
  * them in the reverse order, last to first. */
 p = path + path_len;
 for (t = d; t; t = t->bus->parent_dev) {
 p -= slot_len;
-snprintf(p, slot_len, ":%02x.%x", PCI_SLOT(t->devfn),
PCI_FUNC(d->devfn));
+snprintf(p, slot_len + 1, ":%02x.%x", PCI_SLOT(t->devfn),
PCI_FUNC(d->devfn));
 }

 return path;
-- 
1.7.3.1.msysgit.0

Re: [Qemu-devel] Re: [PATCH] savevm: fix corruption in vmstate_subsection_load().

2011-01-20 Thread Yoshiaki Tamura

2011/1/20 Paolo Bonzini :
> On 01/20/2011 09:57 AM, Yoshiaki Tamura wrote:
>>
>> 2011/1/20 Paolo Bonzini:
>>>
>>> On 12/14/2010 10:07 AM, Yoshiaki Tamura wrote:

 Although it's rare to happen in live migration, when the head of a
 byte stream contains 0x05
>>>
>>> IIUC, this happens if you have VMS_STRUCT and the field after the
>>> VMS_STRUCT
>>> starts with 0x5.
>>>
>>> I think you should also add this in vmstate_subsection_load:
>>>
>>>    sub_vmsd = vmstate_get_subsection(sub, idstr);
>>>    if (sub_vmsd == NULL) {
>>>        return -ENOENT;
>>>    }
>>> +   assert (!sub_vmsd->subsections);
>>>    ret = vmstate_load_state(f, sub_vmsd, opaque, version_id);
>>>
>>> and this in vmstate_load_state:
>>>
>>>    if (field->flags&  VMS_STRUCT) {
>>> +       assert (!vmsd->subsections);
>>>        ret = vmstate_load_state(f, field->vmsd, addr,
>>>                                 field->vmsd->version_id);
>>>    }
>>
>> Hi Paolo,
>>
>> You mean, having subsection nested and under VMS_STRUCT are
>> violations?
>
> I believe so, because the protocol doesn't allow you to distinguish:
>
> - in the case of nested subsections, whether 2 consecutive subsections are
> siblings, or the second is nested into the first.  In fact, your patch also
> fixes a latent bug in case a device supports more than one subsection, and
> both are present in the data stream.  When vmstate_subsection_load is called
> for the first subsection, it would see a 0x5 byte (the beginning of the
> second subsection), eat it and then fail with ENOENT.  The second subsection
> would then fail to load.
>
> - in the case of VMS_STRUCT, whether a 0x5 byte after the VMS_STRUCT is a
> subsection or part of the parent data stream.  This is, I believe, the
> source of your bug.

Thank you for the explanation.  It's very helpful because I
didn't know the background of subsection.  Kemari is kind of
stress test of live migration.

> I don't think it is possible to fix these problems in the file format while
> preserving backwards compatibility with pre-subsection QEMU (which was a
> fundamental requirement of subsections).  So, I think your patch is correct
> and fixes the practical bugs.  However, we can be even stronger and assert
> that the problematic vmstate descriptions are not used.
>
> Even better, asserts matching the ones above could be added to
> vmstate_subsection_save and vmstate_save_state, as well.

I see.  Let me fold the assertion you pointed to the original
patch for now.  Because I'm not an expert in subsection, I would
like to leave further improvements to the others.

Yoshi

>
> Paolo
>
>

Re: [Qemu-devel] [PATCH 2/3] block: tell drivers about an image resize

2011-01-20 Thread Christoph Hellwig

On Thu, Jan 20, 2011 at 09:14:17AM +, Stefan Hajnoczi wrote:
> Cool, this is much nicer than stashing away state like
> bs->media_changed = 1.  I just took a peek to see if we can remove
> that field completely but IDE seems to use it internally.

I think we could get rid of it in generic code, but let's keep this
series simple.

Re: [Qemu-devel] [PATCH 1/3] block: add resize monitor command

2011-01-20 Thread Christoph Hellwig

On Thu, Jan 20, 2011 at 08:56:07AM +, Stefan Hajnoczi wrote:
> eject, change, block_passwd, and others call the device name argument
> "device" instead of "id".  In the interest of a consistent external API
> it would be nice to use "device" for the block_resize command too.

Sure, I can do that with the next respin.

[Qemu-devel] Re: [PATCH 0/3 v2] allow online resizing of block devices

2011-01-20 Thread Christoph Hellwig

On Wed, Jan 19, 2011 at 06:26:49PM +0100, Juan Quintela wrote:
> Does this handle the change of the rtc data?  I can't see how/where.
> (Not that I know if it is important at all until reboot).

It doesn't.

> cmos_init_hb() uses the size to guess the right geometry of the drives,
> if we change the size of the drive, should we change that one?

I don't think changing geometry on a live system is a good idea.  But
as mentioned at least for Linux guests there's no way to find out about
the new size of an IDE disk right now.  Although I might send a patch
to allow a manual rescan, similar to SCSI.

[Qemu-devel] Re: [PATCH] pci: fix device paths

2011-01-20 Thread Christoph Hellwig

On Wed, Jan 19, 2011 at 09:24:10PM +0200, Michael S. Tsirkin wrote:
> Patch a6a7005d14b3c32d4864a718fb1cb19c789f58a5 generated
> broken device paths. We snprintf with a length shorter
> than the output, so the last character is discarded and replaced
> by the null byte. Fix it up by snprintf to a buffer
> which is larger by 1 byte and then memcpy the data (without
> the null byte) to where we need it.

This fixed the boot for me.

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka

On 2011-01-19 20:32, Blue Swirl wrote:
> On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
>  wrote:
>> On 01/19/2011 07:15 AM, Markus Armbruster wrote:
>>>
>>> So they interact with KVM (need kvm_state), and they interact with the
>>> emulated PCI bus.  Could you elaborate on the fundamental difference
>>> between the two interactions that makes you choose the (hypothetical)
>>> KVM bus over the PCI bus as device parent?
>>>
>>
>> It's almost arbitrary, but I would say it's the direction that I/Os flow.
>>
>> But if the underlying observation is that the device tree is not really a
>> tree, you're 100% correct.  This is part of why a factory interface that
>> just takes a parent bus is too simplistic.
>>
>> I think we ought to introduce a -pci-device option that is specifically for
>> creating PCI devices that doesn't require a parent bus argument but provides
>> a way to specify stable addressing (for instancing, using a linear index).
> 
> I think kvm_state should not be a property of any device or bus. It
> should be split to more logical pieces.
> 
> Some parts of it could remain in CPUState, because they are associated
> with a VCPU.
> 
> Also, for example irqfd could be considered to be similar object to
> char or block devices provided by QEMU to devices. Would it make sense
> to introduce new host types for passing parts of kvm_state to devices?
> 
> I'd also make coalesced MMIO stuff part of memory object. We are not
> passing any state references when using cpu_physical_memory_rw(), but
> that could be changed.

There are currently no VCPU-specific bits remaining in kvm_state. It may
be a good idea to introduce an arch-specific kvm_state and move related
bits over. It may also once be feasible to carve out memory management
related fields if we have proper abstractions for that, but I'm not
completely sure here.

Anyway, all these things are secondary. The primary topic here is how to
deal with kvm_state and its fields that have VM-global scope.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Kevin Wolf

Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
> +return;
> +}
> +
> +bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
> +blk_req->reqs[0].nb_sectors, blk_req->reqs[0].cb,
> +blk_req->reqs[0].opaque);

 Same here.

> +bdrv_flush(bs);

 This looks really strange. What is this supposed to do?

 One point is that you write it immediately after bdrv_aio_write, so you
 get an fsync for which you don't know if it includes the current write
 request or if it doesn't. Which data do you want to get flushed to the 
 disk?
>>>
>>> I was expecting to flush the aio request that was just initiated.
>>> Am I misunderstanding the function?
>>
>> Seems so. The function names don't use really clear terminology either,
>> so you're not the first one to fall in this trap. Basically we have:
>>
>> * qemu_aio_flush() waits for all AIO requests to complete. I think you
>> wanted to have exactly this, but only for a single block device. Such a
>> function doesn't exist yet.
>>
>> * bdrv_flush() makes sure that all successfully completed requests are
>> written to disk (by calling fsync)
>>
>> * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
>> the fsync in the thread pool
> 
> Then what I wanted to do is, call qemu_aio_flush first, then
> bdrv_flush.  It should be like live migration.

Okay, that makes sense. :-)

 The other thing is that you introduce a bdrv_flush for each request,
 basically forcing everyone to something very similar to writethrough
 mode. I'm sure this will have a big impact on performance.
>>>
>>> The reason is to avoid inversion of queued requests.  Although
>>> processing one-by-one is heavy, wouldn't having requests flushed
>>> to disk out of order break the disk image?
>>
>> No, that's fine. If a guest issues two requests at the same time, they
>> may complete in any order. You just need to make sure that you don't
>> call the completion callback before the request really has completed.
> 
> We need to flush requests, meaning aio and fsync, before sending
> the final state of the guests, to make sure we can switch to the
> secondary safely.

In theory I think you could just re-submit the requests on the secondary
if they had not completed yet.

But you're right, let's keep things simple for the start.

>> I'm just starting to wonder if the guest won't timeout the requests if
>> they are queued for too long. Even more, with IDE, it can only handle
>> one request at a time, so not completing requests doesn't sound like a
>> good idea at all. In what intervals is the event-tap queue flushed?
> 
> The requests are flushed once each transaction completes.  So
> it's not with specific intervals.

Right. So when is a transaction completed? This is the time that a
single request will take.

>> On the other hand, if you complete before actually writing out, you
>> don't get timeouts, but you signal success to the guest when the request
>> could still fail. What would you do in this case? With a writeback cache
>> mode we're fine, we can just fail the next flush (until then nothing is
>> guaranteed to be on disk and order doesn't matter either), but with
>> cache=writethrough we're in serious trouble.
>>
>> Have you thought about this problem? Maybe we end up having to flush the
>> event-tap queue for each single write in writethrough mode.
> 
> Yes, and that's what I'm trying to do at this point.  

Oh, I must have missed that code. Which patch/function should I look at?

> I know that
> performance matters a lot, but sacrificing reliability over
> performance now isn't a good idea.  I first want to lay the
> ground, and then focus on optimization.  Note that without dirty
> bitmap optimization, Kemari suffers a lot in sending rams.
> Anthony and I discussed to take this approach at KVM Forum.

I agree, starting simple makes sense.

Kevin

Re: [Qemu-devel] [PATCH 2/3] block: tell drivers about an image resize

2011-01-20 Thread Stefan Hajnoczi

On Wed, Jan 19, 2011 at 5:02 PM, Christoph Hellwig  wrote:
> Extend the change_cb callback with a reason argument, and use it
> to tell drivers about size changes.
>
> Signed-off-by: Christoph Hellwig 
>
> Index: qemu/block.c
> ===
> --- qemu.orig/block.c   2011-01-18 20:54:45.246021572 +0100
> +++ qemu/block.c        2011-01-18 20:56:54.117255612 +0100
> @@ -645,7 +645,7 @@ int bdrv_open(BlockDriverState *bs, cons
>         /* call the change callback */
>         bs->media_changed = 1;
>         if (bs->change_cb)
> -            bs->change_cb(bs->change_opaque);
> +            bs->change_cb(bs->change_opaque, CHANGE_MEDIA);

Cool, this is much nicer than stashing away state like
bs->media_changed = 1.  I just took a peek to see if we can remove
that field completely but IDE seems to use it internally.

Stefan

[Qemu-devel] Re: [PATCH 0/3] pci: disable intx on flr/bus reset

2011-01-20 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 04:21:37PM +0900, Isaku Yamahata wrote:
> So far pci_device_reset() is used for system reset.
> In that case, interrupt controller is also reset so that
> all irq is are deasserted.
> But now pci bus reset/flr is supported, and in that case irq needs to be
> disabled explicitly.

Simply->simplify
Otherwise looks good. Applied, thanks!

> Isaku Yamahata (3):
>   pci: deassert intx on reset.
>   msi: simply write config a bit.
>   msix: simply write config
> 
>  hw/msi.c  |5 +
>  hw/msix.c |5 +
>  hw/pci.c  |9 +
>  hw/pci.h  |2 ++
>  4 files changed, 13 insertions(+), 8 deletions(-)

Re: [Qemu-devel] Re: [PATCH] savevm: fix corruption in vmstate_subsection_load().

2011-01-20 Thread Paolo Bonzini


On 01/20/2011 09:57 AM, Yoshiaki Tamura wrote:

2011/1/20 Paolo Bonzini:

On 12/14/2010 10:07 AM, Yoshiaki Tamura wrote:


Although it's rare to happen in live migration, when the head of a
byte stream contains 0x05


IIUC, this happens if you have VMS_STRUCT and the field after the VMS_STRUCT
starts with 0x5.

I think you should also add this in vmstate_subsection_load:

sub_vmsd = vmstate_get_subsection(sub, idstr);
if (sub_vmsd == NULL) {
return -ENOENT;
}
+   assert (!sub_vmsd->subsections);
ret = vmstate_load_state(f, sub_vmsd, opaque, version_id);

and this in vmstate_load_state:

if (field->flags&  VMS_STRUCT) {
+   assert (!vmsd->subsections);
ret = vmstate_load_state(f, field->vmsd, addr,
 field->vmsd->version_id);
}


Hi Paolo,

You mean, having subsection nested and under VMS_STRUCT are
violations?


I believe so, because the protocol doesn't allow you to distinguish:

- in the case of nested subsections, whether 2 consecutive subsections 
are siblings, or the second is nested into the first.  In fact, your 
patch also fixes a latent bug in case a device supports more than one 
subsection, and both are present in the data stream.  When 
vmstate_subsection_load is called for the first subsection, it would see 
a 0x5 byte (the beginning of the second subsection), eat it and then 
fail with ENOENT.  The second subsection would then fail to load.


- in the case of VMS_STRUCT, whether a 0x5 byte after the VMS_STRUCT is 
a subsection or part of the parent data stream.  This is, I believe, the 
source of your bug.


I don't think it is possible to fix these problems in the file format 
while preserving backwards compatibility with pre-subsection QEMU (which 
was a fundamental requirement of subsections).  So, I think your patch 
is correct and fixes the practical bugs.  However, we can be even 
stronger and assert that the problematic vmstate descriptions are not used.


Even better, asserts matching the ones above could be added to 
vmstate_subsection_save and vmstate_save_state, as well.


Paolo

Re: [Qemu-devel] [V3 PATCH 7/8] virtio-9p: Move file post creation changes to none security model

2011-01-20 Thread Stefan Hajnoczi

On Tue, Jan 18, 2011 at 01:54:16PM +0530, M. Mohan Kumar wrote:
> After creating a file object, its permission and ownership details are updated
> as per client's request for both passthrough and none security model. But with
> chrooted environment its not required for passthrough security model. Move all
> post file creation changes to none security model
> 
> Signed-off-by: M. Mohan Kumar 
> ---
>  hw/9pfs/virtio-9p-local.c |   19 ++-
>  1 files changed, 6 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
> index 08fd67f..d2e32e2 100644
> --- a/hw/9pfs/virtio-9p-local.c
> +++ b/hw/9pfs/virtio-9p-local.c
> @@ -208,21 +208,14 @@ static int local_set_xattr(const char *path, FsCred 
> *credp)
>  return 0;
>  }
>  
> -static int local_post_create_passthrough(FsContext *fs_ctx, const char *path,
> +static int local_post_create_none(FsContext *fs_ctx, const char *path,
>  FsCred *credp)
>  {
> +int retval;
>  if (chmod(rpath(fs_ctx, path), credp->fc_mode & 0) < 0) {
>  return -1;
>  }
> -if (lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid) < 0) {
> -/*
> - * If we fail to change ownership and if we are
> - * using security model none. Ignore the error
> - */
> -if (fs_ctx->fs_sm != SM_NONE) {
> -return -1;
> -}
> -}
> +retval = lchown(rpath(fs_ctx, path), credp->fc_uid, credp->fc_gid);
>  return 0;
>  }

retval is unused.

Can multiple virtio-9p requests execute at a time?  chmod() and lchown()
after creation is a race condition if other requests can execute
concurrently.

Stefan

Re: [Qemu-devel] Re: [PATCH] savevm: fix corruption in vmstate_subsection_load().

2011-01-20 Thread Yoshiaki Tamura

2011/1/20 Paolo Bonzini :
> On 12/14/2010 10:07 AM, Yoshiaki Tamura wrote:
>>
>> Although it's rare to happen in live migration, when the head of a
>> byte stream contains 0x05
>
> IIUC, this happens if you have VMS_STRUCT and the field after the VMS_STRUCT
> starts with 0x5.
>
> I think you should also add this in vmstate_subsection_load:
>
>    sub_vmsd = vmstate_get_subsection(sub, idstr);
>    if (sub_vmsd == NULL) {
>        return -ENOENT;
>    }
> +   assert (!sub_vmsd->subsections);
>    ret = vmstate_load_state(f, sub_vmsd, opaque, version_id);
>
> and this in vmstate_load_state:
>
>    if (field->flags & VMS_STRUCT) {
> +       assert (!vmsd->subsections);
>        ret = vmstate_load_state(f, field->vmsd, addr,
>                                 field->vmsd->version_id);
>    }

Hi Paolo,

You mean, having subsection nested and under VMS_STRUCT are
violations?

Yoshi

>
> Paolo
>
>

Re: [Qemu-devel] [PATCH 1/3] block: add resize monitor command

2011-01-20 Thread Stefan Hajnoczi

On Wed, Jan 19, 2011 at 06:02:48PM +0100, Christoph Hellwig wrote:
> Index: qemu/hmp-commands.hx
> ===
> --- qemu.orig/hmp-commands.hx 2011-01-19 17:47:10.444004409 +0100
> +++ qemu/hmp-commands.hx  2011-01-19 17:49:51.673254095 +0100
> @@ -53,6 +53,25 @@ Quit the emulator.
>  ETEXI
>  
>  {
> +.name   = "resize",
> +.args_type  = "id:s,size:o",
> +.params = "device size",
> +.help   = "resize a block image",
> +.user_print = monitor_user_noop,
> +.mhandler.cmd_new = do_resize,
> +},
> +
> +STEXI
> +@item resize
> +@findex resize
> +Resize a block image while a guest is running.  Usually requires guest
> +action to see the updated size.  Resize to a lower size is supported,
> +but should be used with extreme caution.  Note that this command only
> +resizes image files, it can not resize block devices like LVM volumes.
> +ETEXI
> +
> +
> +{
>  .name   = "eject",
>  .args_type  = "force:-f,device:B",
>  .params = "[-f] device",
[...]
> Index: qemu/qmp-commands.hx
> ===
> --- qemu.orig/qmp-commands.hx 2011-01-19 17:47:10.478012371 +0100
> +++ qemu/qmp-commands.hx  2011-01-19 17:50:07.406016841 +0100
> @@ -601,6 +601,34 @@ Example:
>  -> { "execute": "netdev_del", "arguments": { "id": "netdev1" } }
>  <- { "return": {} }
>  
> +
> +EQMP
> +
> +{
> +.name   = "block_resize",
> +.args_type  = "id:s,size:o",
> +.params = "id size",
> +.help   = "resize a block image",
> +.user_print = monitor_user_noop,
> +.mhandler.cmd_new = do_resize,
> +},
> +
> +SQMP
> +block_resize
> +
> +
> +Resize a block image while a guest is running.
> +
> +Arguments:
> +
> +- "id": the device's ID, must be unique (json-string)
> +- "size": new size
> +
> +Example:
> +
> +-> { "execute": "block_resize", "arguments": { "id": "scratch", "size": 
> 1073741824 } }
> +<- { "return": {} }
> +
>  EQMP

eject, change, block_passwd, and others call the device name argument
"device" instead of "id".  In the interest of a consistent external API
it would be nice to use "device" for the block_resize command too.

Stefan

1 2 >

1 - 100 of 102 matches

Mail list logo