date:20101202

Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.

2010-12-02 Thread Yoshiaki Tamura

2010/12/2 Michael S. Tsirkin :
> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/28 Michael S. Tsirkin :
>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/11/28 Michael S. Tsirkin :
>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >>
>> >> >> Signed-off-by: Yoshiaki Tamura 
>> >> >
>> >> > This changes migration format, so it will break compatibility with
>> >> > existing drivers. More generally, I think migrating internal
>> >> > state that is not guest visible is always a mistake
>> >> > as it ties migration format to an internal implementation
>> >> > (yes, I know we do this sometimes, but we should at least
>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> > is to flush outstanding
>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> > I sent patches that do this for virtio net and block.
>> >>
>> >> Could you give me the link of your patches?  I'd like to test
>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> happy to drop this patch.
>> >>
>> >> Yoshi
>> >
>> > Look for this:
>> > stable migration image on a stopped vm
>> > sent on:
>> > Wed, 24 Nov 2010 17:52:49 +0200
>>
>> Thanks for the info.
>>
>> However, The patch series above didn't solve the issue.  In
>> case of Kemari, inuse is mostly > 0 because it queues the
>> output, and while last_avail_idx gets incremented
>> immediately, not sending inuse makes the state inconsistent
>> between Primary and Secondary.
>
> Hmm. Can we simply avoid incrementing last_avail_idx?

I think we can calculate or prepare an internal last_avail_idx,
and update the external when inuse is decremented.  I'll try
whether it work w/ w/o Kemari.

>
>>  I'm wondering why
>> last_avail_idx is OK to send but not inuse.
>
> last_avail_idx is at some level a mistake, it exposes part of
> our internal implementation, but it does *also* express
> a guest observable state.
>
> Here's the problem that it solves: just looking at the rings in virtio
> there is no way to detect that a specific request has already been
> completed. And the protocol forbids completing the same request twice.
>
> Our implementation always starts processing the requests
> in order, and since we flush outstanding requests
> before save, it works to just tell the remote 'process only requests
> after this place'.
>
> But there's no such requirement in the virtio protocol,
> so to be really generic we could add a bitmask of valid avail
> ring entries that did not complete yet. This would be
> the exact representation of the guest observable state.
> In practice we have rings of up to 512 entries.
> That's 64 byte per ring, not a lot at all.
>
> However, if we ever do change the protocol to send the bitmask,
> we would need some code to resubmit requests
> out of order, so it's not trivial.
>
> Another minor mistake with last_avail_idx is that it has
> some redundancy: the high bits in the index
> (> vq size) are not necessary as they can be
> got from avail idx.  There's a consistency check
> in load but we really should try to use formats
> that are always consistent.
>
>> The following patch does the same thing as original, yet
>> keeps the format of the virtio.  It shouldn't break live
>> migration either because inuse should be 0.
>>
>> Yoshi
>
> Question is, can you flush to make inuse 0 in kemari too?
> And if not, how do you handle the fact that some requests
> are in flight on the primary?

Although we try flushing requests one by one making inuse 0,
there are cases when it failovers to the secondary when inuse
isn't 0.  We handle these in flight request on the primary by
replaying on the secondary.

>
>> diff --git a/hw/virtio.c b/hw/virtio.c
>> index c8a0fc6..875c7ca 100644
>> --- a/hw/virtio.c
>> +++ b/hw/virtio.c
>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>      qemu_put_be32(f, i);
>>
>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> +        uint16_t last_avail_idx;
>> +
>>          if (vdev->vq[i].vring.num == 0)
>>              break;
>>
>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> +
>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>          qemu_put_be64(f, vdev->vq[i].pa);
>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> +        qemu_put_be16s(f, &last_avail_idx);
>>          if (vdev->binding->save_queue)
>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>      }
>>
>>
>
> This looks wrong to me.  Requests can complete in any order, can they
> not?  So if request 0 did not complete and request 1 did not,
> you send avail - inuse and on the secondary you will process and
> complete request 1 the second time, crashing the guest.

In case of Kemari, no.

Re: [PATCHv6 00/16] boot order specification

2010-12-02 Thread Gleb Natapov

On Thu, Dec 02, 2010 at 09:01:25PM -0500, Kevin O'Connor wrote:
> On Thu, Dec 02, 2010 at 02:30:42PM +0200, Gleb Natapov wrote:
> > On Wed, Dec 01, 2010 at 09:25:40PM -0500, Kevin O'Connor wrote:
> > > You're thinking in terms of which device to boot, which does make this
> > > difficult.  However, it's equally valid to think in terms of which
> > > boot method to invoke, which makes this easy.
> > It is not. Because boot methods are enumerated by a guest at runtime.
> > Qemu knows absolutely nothing about them. I am thinking in terms of
> > devices because this is the only thing I have in qemu.
> 
> As before, a safe heuristic would be to request a rom boot for any
> device with a pci rom that the user requests to boot from.
> 
> Note, your original proposal included a boot method:
> /r...@genroms/linuxboot.bin
> I'm asking to extend this to be able to include roms on PCI devices.
> 
I already posted here code that maps PCI rom to PCI device path. Why do
you still insist on artificial path like /pci/r...@4?

> > > We could tell the coreboot user to edit the "bootorder" file and add
> > > "/p...@i0cf8/r...@4" (second rom on 4th pci device - the exact syntax
> > > of the name is not important).
> > > 
> > But how user should knows that second rom (I think you mean "second BCV")
> > on pci device 4.0 will boot from the new scsi cdrom that he just connected?
> > How can he tell that it should put second BCV there and not third or fifth
> > without running Seabios first and looking at F12 menu?
> 
> Exactly - look at the F12 menu. (Or, for bonus points, one could
> write a program that scans roms on the booted coreboot system,
> presents the user with a menu, and then writes the requested boot
> method to "bootorder".)
> 
If you propose that then you probably misunderstand the problem I am
trying to tackle. Qemu should be able to specify boot order for newly
created machine with arbitrary HW configuration without making user
to look at F12 menu. Actually at this point there is not human user
at all. All the process from machine creation till OS installation is
completely unattended. Looking at the BBS spec in several places they
say that if NV ram is corrupted bios should use some predefined default
boot order. This "predefined default boot order" is what qemu tries to
pass to Seabios to use correct boot device without user's input.

To make it clear "unattended" part is absolute requirement.

> Being able to specify which boot method is a requirement for me.  It
> doesn't have to be pretty, but it does have to be possible.
For you problem solution is very easy and described in BBS spec. You
just need two numbers IPL to boot from and which BCV is first. No need
for device paths at all. In fact Seabios already tracks IPL order this
way the only thing missing is to track BCV order.

> 
> > > >BTW to create proper EDD entry for SCSI boot device BIOS also
> > > > needs too map BCV to id:lun. How it can be done? 
> > > 
> > > It's the responsibility of the rom to build the EDD info.  I don't
> > > know if all roms do this - I don't believe it's possible to get at the
> > > EDD info until after the drive has been mapped (ie, too late to use it
> > > for boot ordering).
> > How can we get to EDD info after device is mapped?
> 
> Raise int 0x13 ah=0x48 - once the drive is mapped it will hook that
> the 0x13 irq and handle the request (or jump to the bios handler for
> drives it doesn't know about).
> 
> >Can we use "disconnect vector"
> > to connect device temporarily get EDD and then disconnect?
> 
> No.
> 
And is it possible to call BCV and then restore int13 vector?
So to find out what device each BCV belongs too we do:
  foreach(bcv)
call_bcv(bcv)
int 1348
restor(int13) 

After that we call  call_bcv(bcv) but in correct order.

> > > I understand.  However, we'll still need to support arbitrary rom
> > > based BEVs and BCVs, so the use case is still important.
> > > 
> > We can't do something that is impossible.
> 
> You've used this word "impossible" a few times - I'm afraid I don't
> know what it means.
It means that I do not have a solution and you do not propose one either :)
It seems we were trying to solve differed problems though. I hope now it is
much more clear what I am trying to achieve with proposed bootorder
patches.

> 
> >For coreboot Seabios should
> > implement what BBS spec says i.e enumerate all BCVs, present boot menu
> > to the user, record number of BCVs and user's choice on non-volatile
> > storage (CMOS).
> 
> Bleh - there's no need for that.  Much more reliable is to record the
> device path for builtin devices or the boot method (device path of
> rom, plus bev/bcv instance) for rom based boots.
> 
Why is this much more reliable? Suppose user has scsi card with 3 devices
id3 id5 id6. Three BCVs was created for them and user chose to boot
from BCV3(id6). He added /pci/s...@4/b...@3 to his coreboot's bootorder
file. Now he adds one more device to this scsi bus with id1. On the ne

Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-02 Thread Mike Galbraith

On Thu, 2010-12-02 at 14:44 -0500, Rik van Riel wrote:

> +#ifdef CONFIG_SCHED_HRTICK
> +/*
> + * Yield the CPU, giving the remainder of our time slice to task p.
> + * Typically used to hand CPU time to another thread inside the same
> + * process, eg. when p holds a resource other threads are waiting for.
> + * Giving priority to p may help get that resource released sooner.
> + */
> +void yield_to(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct sched_entity *se = &p->se;
> + struct rq *rq;
> + struct cfs_rq *cfs_rq;
> + u64 remain = slice_remain(current);

That "slice remaining" only shows the distance to last preempt, however
brief.  It shows nothing wrt tree position, the yielding task may well
already be right of the task it wants to yield to, having been a buddy.

> cfs_rq = cfs_rq_of(se);
> + se->vruntime -= remain;
> + if (se->vruntime < cfs_rq->min_vruntime)
> + se->vruntime = cfs_rq->min_vruntime;

(This is usually done using max_vruntime())

If the recipient was already left of the fair stick (min_vruntime),
clipping advances it's vruntime, vaporizing entitlement from both donor
and recipient.

What if a task tries to yield to another not on the same cpu, and/or in
the same task group?  This would munge min_vruntime of other queues.  I
think you'd have to restrict this to same cpu, same group.  If tasks can
donate cross cfs_rq, (say) pinned task A cpu A running solo could donate
vruntime to selected tasks pinned to cpu B, for as long as minuscule
preemptions can resupply ammo.  Would suck to not be the favored child.

Maybe you could exchange vruntimes cooperatively (iff same cfs_rq)
between threads, but I don't think donations with clipping works.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Chris Wright

* Anthony Liguori (anth...@codemonkey.ws) wrote:
> On 12/02/2010 08:42 PM, Chris Wright wrote:
> >OK, let's say a single PCPU == 12 Compute Units.
> >
> >If the guest is the first to migrate to a newly added unused host, and
> >we are using either non-trapping hlt or Marcelo's non-yielding trapping
> >hlt, then that guest is going to get more CPU than it expected unless
> >there is some throttling mechanism.  Specifically, it will get 12CU
> >instead of 1-3CU.
> >
> >Do you agree with that?
> 
> Yes.
> 
> There's definitely a use-case to have a hard cap.

OK, good, just wanted to be clear.  Because this started as a discussion
of hard caps, and it began to sound as if you were no longer advocating
for them.

> But I think another common use-case is really just performance
> isolation.  If over the course of a day, you go from 12CU, to 6CU,
> to 4CU, that might not be that bad of a thing.

I guess it depends on your SLA.  We don't have to do anything to give
varying CU based on host load.  That's the one thing CFS will do for
us quite well ;)

> If the environment is designed correctly, of N nodes, N-1 will
> always be at capacity so it's really just a single node hat is under
> utilized.

Many clouds do a variation on Small, Medium, Large sizing.  So depending
on the scheduler (best fit, rr...) even the notion of at capacity may
change from node to node and during the time of day.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori


On 12/02/2010 08:42 PM, Chris Wright wrote:

OK, let's say a single PCPU == 12 Compute Units.

If the guest is the first to migrate to a newly added unused host, and
we are using either non-trapping hlt or Marcelo's non-yielding trapping
hlt, then that guest is going to get more CPU than it expected unless
there is some throttling mechanism.  Specifically, it will get 12CU
instead of 1-3CU.

Do you agree with that?
   


Yes.

There's definitely a use-case to have a hard cap.

But I think another common use-case is really just performance 
isolation.  If over the course of a day, you go from 12CU, to 6CU, to 
4CU, that might not be that bad of a thing.


If the environment is designed correctly, of N nodes, N-1 will always be 
at capacity so it's really just a single node hat is under utilized.


Regards,

Anthony Liguori


thanks,
-chris
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Yang, Sheng

On Friday 03 December 2010 00:55:03 Michael S. Tsirkin wrote:
> On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote:
> > On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin  wrote:
> > > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
> > >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
> > >> >>  Which case?  the readl() doesn't need access to the routing table,
> > >> >>  just the entry.
> > >> >
> > >> >One thing that read should do is flush in the outstanding
> > >> >interrupts and flush out the mask bit writes.
> > >> 
> > >> The mask bit writes are synchronous.
> > >> 
> > >> wrt interrupts, we can deal with assigned devices, and can poll
> > >> irqfds.  But we can't force vhost-net to issue an interrupt (and I
> > >> don't think it's required).
> > > 
> > > To clarify:
> > > 
> > >mask write
> > >read
> > > 
> > > it is safe for guest to assume no more interrupts
> > > 
> > > where as with a simple
> > >mask write
> > > 
> > > an interrupt might be in flight and get delivered shortly afterwards.
> > 
> > I think it's already contained in the current patchset.
> > 
> > >> >>  Oh, I think there is a terminology problem, I was talking about
> > >> >>  kvm's irq routing table, you were talking about the msix entries.
> > >> >> 
> > >> >>  I think treating it as a cache causes more problems, since there
> > >> >> are now two paths for reads (in cache and not in cache) and more
> > >> >> things for writes to manage.
> > >> >> 
> > >> >>  Here's my proposed API:
> > >> >> 
> > >> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
> > >> >>  pending_bitmap_base_gpa)
> > >> >> 
> > >> >>   - called when the guest enables msix
> > >> >
> > >> >I would add virtual addresses so that we can use swappable memory to
> > >> >store the state.
> > >> 
> > >> Right.
> > 
> > Do we need synchronization between kernel and userspace? Any recommended
> > method?
> > 
> > >> >If we do, maybe we can just keep the table there and then
> > >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
> > >> 
> > >> Still need to to let userspace know it needs to reprogram the irqfd
> > >> or whatever it uses to inject the interrupt.
> > > 
> > > Why do we need to reprogram irqfd?  I thought irqfd would map to an
> > > entry within the table instead of address/data as now.
> > > Could you clarify please?
> > > 
> > >> >>  KVM_REMOVE_MSIX_TABLE(table_id)
> > >> >> 
> > >> >>- called when the guest disables msix
> > >> >> 
> > >> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
> > >> >> 
> > >> >>- called when the guest enables msix (to initialize it), or
> > >> >> after live migration
> > >> >
> > >> >What is entry_id here?
> > >> 
> > >> Entry within the table.
> > > 
> > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
> > > enabled (note: it can not be called at boot anyway since pa
> > > depends on BAR assigned by BIOS).
> > 
> > Don't agree. MMIO can be write regardless of if MSIX is enabled. If
> > you handle MMIO to kernel, them handle them all.
> 
> Hmm, does this mean we would need ioctls to enable/disable MSIX?  I
> think passing control from userspace to kernel when MSIX is enabled is
> not too bad. Will think about it.

Anyway we need ioctls to do so because kernel knows nothing about PCI 
configuration 
space... And we already have that for assigned device.
> 
> > I suppose qemu still
> > got control of BAR?
> 
> Yes. So e.g. if memory is disabled in the device then we must
> disable this as well, and not claim these transactions
> as mmio.
> 
> > Then leave it in the current place should be fine.
> 
> current place?

assigned_dev_iomem_map().
> 
> At the moment qemu does not notify devices when
> memory is disabled, only when it is enabled.

You mean bit 2 of Command register? I think suppose device would only disable 
memory when shut down should be acceptable. Also we can hook at disable point 
as 
well.
 
> So handling msix enable/disable is more straight-forward.

Don't agree. Then you got duplicate between kernel and userspace. Also the 
semantic of MSI-X MMIO has no relationship with MSIX enable/disable.
> 
> > >> >>  Michael?  I think that should work for virtio and vfio assigned
> > >> >>  devices?  Not sure about pending bits.
> > >> >
> > >> >Pending bits must be tracked in kernel, but I don't see
> > >> >how we can support polling mode if we don't exit to userspace
> > >> >on pending bit reads.
> > >> >
> > >> >This does mean that some reads will be fast and some will be
> > >> >slow, and it's a bit sad that we seem to be optimizing
> > >> >for specific guests, but I just can't come up with
> > >> >anything better.
> > >> 
> > >> If the pending bits live in userspace memory, the device model can
> > >> update them directly?
> > > 
> > > Note that these are updated on an interrupt, so updating them
> > > in userspace would need get_user_page etc trickery,
> > > and add the overhead of atomics.
> > > 
> > > Further I th

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Chris Wright

* Anthony Liguori (anth...@codemonkey.ws) wrote:
> On 12/02/2010 03:07 PM, Chris Wright wrote:
> >Like non-trapping hlt, that too will guarantee that the guest is preempted
> >by timeslice exhaustion (and is simpler than non-trapping hlt).  So it
> >may well be the simplest for the case where we are perfectly committed
> >(i.e. the vcpu fractional core count totals the pcpu count).  But once
> >we are undercommitted we still need some extra logic to handle the hard
> >cap and something to kick the running guest off the cpu and suck up the
> >extra cycles in a power conserving way.
> 
> I'm not entirely sure TBH.
> 
> If you think of a cloud's per-VCPU capacity in terms of Compute
> Units, having a model where a VCPU maps to 1-3 units depending on
> total load is potentially interesting particularly if the VCPU's
> capacity only changes in discrete amounts,  that the expected
> capacity is communicated to the guest, and that the capacity only
> changes periodically.

OK, let's say a single PCPU == 12 Compute Units.

If the guest is the first to migrate to a newly added unused host, and
we are using either non-trapping hlt or Marcelo's non-yielding trapping
hlt, then that guest is going to get more CPU than it expected unless
there is some throttling mechanism.  Specifically, it will get 12CU
instead of 1-3CU.

Do you agree with that?

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2010-12-02 Thread Chris Wright

* Rik van Riel (r...@redhat.com) wrote:
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1880,18 +1880,53 @@ void kvm_resched(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_resched);
>  
> -void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
> +void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  {
> - ktime_t expires;
> - DEFINE_WAIT(wait);
> + struct kvm *kvm = me->kvm;
> + struct kvm_vcpu *vcpu;
> + int last_boosted_vcpu = me->kvm->last_boosted_vcpu;

s/me->//

> + int first_round = 1;
> + int i;
>  
> - prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
> + me->spinning = 1;
> +
> + /*
> +  * We boost the priority of a VCPU that is runnable but not
> +  * currently running, because it got preempted by something
> +  * else and called schedule in __vcpu_run.  Hopefully that
> +  * VCPU is holding the lock that we need and will release it.
> +  * We approximate round-robin by starting at the last boosted VCPU.
> +  */
> + again:
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + struct task_struct *task = vcpu->task;
> + if (first_round && i < last_boosted_vcpu) {
> + i = last_boosted_vcpu;
> + continue;
> + } else if (!first_round && i > last_boosted_vcpu)
> + break;
> + if (vcpu == me)
> + continue;
> + if (vcpu->spinning)
> + continue;
> + if (!task)
> + continue;
> + if (waitqueue_active(&vcpu->wq))
> + continue;
> + if (task->flags & PF_VCPU)
> + continue;

I wonder if you set vcpu->task in sched_out and then NULL it in sched_in
if you'd get what you want you could simplify the checks.  Basically
that would be only the preempted runnable vcpus.

> + kvm->last_boosted_vcpu = i;
> + yield_to(task);

Just trying to think of ways to be sure this doesn't become just
yield() (although I thnk PF_VCPU is enough to catch that).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv6 00/16] boot order specification

2010-12-02 Thread Kevin O'Connor

On Thu, Dec 02, 2010 at 02:30:42PM +0200, Gleb Natapov wrote:
> On Wed, Dec 01, 2010 at 09:25:40PM -0500, Kevin O'Connor wrote:
> > You're thinking in terms of which device to boot, which does make this
> > difficult.  However, it's equally valid to think in terms of which
> > boot method to invoke, which makes this easy.
> It is not. Because boot methods are enumerated by a guest at runtime.
> Qemu knows absolutely nothing about them. I am thinking in terms of
> devices because this is the only thing I have in qemu.

As before, a safe heuristic would be to request a rom boot for any
device with a pci rom that the user requests to boot from.

Note, your original proposal included a boot method:
/r...@genroms/linuxboot.bin
I'm asking to extend this to be able to include roms on PCI devices.

> > We could tell the coreboot user to edit the "bootorder" file and add
> > "/p...@i0cf8/r...@4" (second rom on 4th pci device - the exact syntax
> > of the name is not important).
> > 
> But how user should knows that second rom (I think you mean "second BCV")
> on pci device 4.0 will boot from the new scsi cdrom that he just connected?
> How can he tell that it should put second BCV there and not third or fifth
> without running Seabios first and looking at F12 menu?

Exactly - look at the F12 menu.  (Or, for bonus points, one could
write a program that scans roms on the booted coreboot system,
presents the user with a menu, and then writes the requested boot
method to "bootorder".)

Being able to specify which boot method is a requirement for me.  It
doesn't have to be pretty, but it does have to be possible.

> > >BTW to create proper EDD entry for SCSI boot device BIOS also
> > > needs too map BCV to id:lun. How it can be done? 
> > 
> > It's the responsibility of the rom to build the EDD info.  I don't
> > know if all roms do this - I don't believe it's possible to get at the
> > EDD info until after the drive has been mapped (ie, too late to use it
> > for boot ordering).
> How can we get to EDD info after device is mapped?

Raise int 0x13 ah=0x48 - once the drive is mapped it will hook that
the 0x13 irq and handle the request (or jump to the bios handler for
drives it doesn't know about).

>Can we use "disconnect vector"
> to connect device temporarily get EDD and then disconnect?

No.

> > I understand.  However, we'll still need to support arbitrary rom
> > based BEVs and BCVs, so the use case is still important.
> > 
> We can't do something that is impossible.

You've used this word "impossible" a few times - I'm afraid I don't
know what it means.

>For coreboot Seabios should
> implement what BBS spec says i.e enumerate all BCVs, present boot menu
> to the user, record number of BCVs and user's choice on non-volatile
> storage (CMOS).

Bleh - there's no need for that.  Much more reliable is to record the
device path for builtin devices or the boot method (device path of
rom, plus bev/bcv instance) for rom based boots.

-Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu

2010-12-02 Thread Chris Wright

* Rik van Riel (r...@redhat.com) wrote:
> Keep track of which task is running a KVM vcpu.  This helps us
> figure out later what task to wake up if we want to boost a
> vcpu that got preempted.
> 
> Unfortunately there are no guarantees that the same task
> always keeps the same vcpu, so we can only track the task
> across a single "run" of the vcpu.

So shouldn't it confine to KVM_RUN?  The other vcpu_load callers aren't
always a vcpu in a useful runnable state.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/3] sched: add yield_to function

2010-12-02 Thread Chris Wright

* Rik van Riel (r...@redhat.com) wrote:
> Add a yield_to function to the scheduler code, allowing us to
> give the remainder of our timeslice to another thread.
> 
> We may want to use this to provide a sys_yield_to system call
> one day.
> 
> Signed-off-by: Rik van Riel 
> Signed-off-by: Marcelo Tosatti 
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c5f926c..4f3cce9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1985,6 +1985,7 @@ extern void set_user_nice(struct task_struct *p, long 
> nice);
>  extern int task_prio(const struct task_struct *p);
>  extern int task_nice(const struct task_struct *p);
>  extern int can_nice(const struct task_struct *p, const int nice);
> +extern void requeue_task(struct rq *rq, struct task_struct *p);
>  extern int task_curr(const struct task_struct *p);
>  extern int idle_cpu(int cpu);
>  extern int sched_setscheduler(struct task_struct *, int, struct sched_param 
> *);
> @@ -2058,6 +2059,14 @@ extern int wake_up_state(struct task_struct *tsk, 
> unsigned int state);
>  extern int wake_up_process(struct task_struct *tsk);
>  extern void wake_up_new_task(struct task_struct *tsk,
>   unsigned long clone_flags);
> +
> +#ifdef CONFIG_SCHED_HRTICK
> +extern u64 slice_remain(struct task_struct *);
> +extern void yield_to(struct task_struct *);
> +#else
> +static inline void yield_to(struct task_struct *p) yield()

Missing {}'s ?

> +#endif
> +
>  #ifdef CONFIG_SMP
>   extern void kick_process(struct task_struct *tsk);
>  #else
> diff --git a/kernel/sched.c b/kernel/sched.c
> index f8e5a25..ef088cd 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1909,6 +1909,26 @@ static void dequeue_task(struct rq *rq, struct 
> task_struct *p, int sleep)
>   p->se.on_rq = 0;
>  }
>  
> +/**
> + * requeue_task - requeue a task which priority got changed by yield_to
> + * @rq: the tasks's runqueue
> + * @p: the task in question
> + * Must be called with the runqueue lock held. Will cause the CPU to
> + * reschedule if p is now at the head of the runqueue.
> + */
> +void requeue_task(struct rq *rq, struct task_struct *p)
> +{
> + assert_spin_locked(&rq->lock);
> +
> + if (!p->se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
> + return;

already checked task_running(rq, p) || task_has_rt_policy(p) w/ rq lock
held.

> +
> + dequeue_task(rq, p, 0);
> + enqueue_task(rq, p, 0);

seems like you could condense to save an update_rq_clock() call at least,
don't know if the info_queued, info_dequeued need to be updated

> + resched_task(p);
> +}
> +
>  /*
>   * __normal_prio - return the priority that is based on the static prio
>   */
> @@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, 
> unsigned int, len,
>   return ret;
>  }
>  
> +#ifdef CONFIG_SCHED_HRTICK
> +/*
> + * Yield the CPU, giving the remainder of our time slice to task p.
> + * Typically used to hand CPU time to another thread inside the same
> + * process, eg. when p holds a resource other threads are waiting for.
> + * Giving priority to p may help get that resource released sooner.
> + */
> +void yield_to(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct sched_entity *se = &p->se;
> + struct rq *rq;
> + struct cfs_rq *cfs_rq;
> + u64 remain = slice_remain(current);
> +
> + rq = task_rq_lock(p, &flags);
> + if (task_running(rq, p) || task_has_rt_policy(p))
> + goto out;
> + cfs_rq = cfs_rq_of(se);
> + se->vruntime -= remain;
> + if (se->vruntime < cfs_rq->min_vruntime)
> + se->vruntime = cfs_rq->min_vruntime;

Should these details all be in sched_fair?  Seems like the wrong layer
here.  And would that condition go the other way?  If new vruntime is
smaller than min, then it becomes new cfs_rq->min_vruntime?

> + requeue_task(rq, p);
> + out:
> + task_rq_unlock(rq, &flags);
> + yield();
> +}
> +EXPORT_SYMBOL(yield_to);
> +#endif
> +
>  /**
>   * sys_sched_yield - yield the current processor to other threads.
>   *
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5119b08..2a0a595 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
> *curr, int queued)
>   */
>  
>  #ifdef CONFIG_SCHED_HRTICK
> +u64 slice_remain(struct task_struct *p)
> +{
> + unsigned long flags;
> + struct sched_entity *se = &p->se;
> + struct cfs_rq *cfs_rq;
> + struct rq *rq;
> + u64 slice, ran;
> + s64 delta;
> +
> + rq = task_rq_lock(p, &flags);
> + cfs_rq = cfs_rq_of(se);
> + slice = sched_slice(cfs_rq, se);
> + ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> + delta = slice - ran;
> + task_rq_unlock(rq, &flags);
> +
> + return max(delta, 0LL);

Can delta go negative?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to

Re: [PATCH 1/2] vhost test module

2010-12-02 Thread Paul E. McKenney

On Fri, Dec 03, 2010 at 01:18:18AM +0200, Michael S. Tsirkin wrote:
> On Thu, Dec 02, 2010 at 03:13:03PM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 02, 2010 at 09:47:09PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Dec 02, 2010 at 11:26:16AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote:
> > > > > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote:
> > > > > > > This adds a test module for vhost infrastructure.
> > > > > > > Intentionally not tied to kbuild to prevent people
> > > > > > > from installing and loading it accidentally.
> > > > > > > 
> > > > > > > Signed-off-by: Michael S. Tsirkin 
> > > > > > 
> > > > > > On question below.
> > > > > > 
> > > > > > > ---
> > > > > > > 
> > > > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > > > > > > new file mode 100644
> > > > > > > index 000..099f302
> > > > > > > --- /dev/null
> > > > > > > +++ b/drivers/vhost/test.c
> > > > > > > @@ -0,0 +1,320 @@
> > > > > > > +/* Copyright (C) 2009 Red Hat, Inc.
> > > > > > > + * Author: Michael S. Tsirkin 
> > > > > > > + *
> > > > > > > + * This work is licensed under the terms of the GNU GPL, version 
> > > > > > > 2.
> > > > > > > + *
> > > > > > > + * test virtio server in host kernel.
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +
> > > > > > > +#include "test.h"
> > > > > > > +#include "vhost.c"
> > > > > > > +
> > > > > > > +/* Max number of bytes transferred before requeueing the job.
> > > > > > > + * Using this limit prevents one virtqueue from starving others. 
> > > > > > > */
> > > > > > > +#define VHOST_TEST_WEIGHT 0x8
> > > > > > > +
> > > > > > > +enum {
> > > > > > > + VHOST_TEST_VQ = 0,
> > > > > > > + VHOST_TEST_VQ_MAX = 1,
> > > > > > > +};
> > > > > > > +
> > > > > > > +struct vhost_test {
> > > > > > > + struct vhost_dev dev;
> > > > > > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
> > > > > > > +};
> > > > > > > +
> > > > > > > +/* Expects to be always run from workqueue - which acts as
> > > > > > > + * read-size critical section for our kind of RCU. */
> > > > > > > +static void handle_vq(struct vhost_test *n)
> > > > > > > +{
> > > > > > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ];
> > > > > > > + unsigned out, in;
> > > > > > > + int head;
> > > > > > > + size_t len, total_len = 0;
> > > > > > > + void *private;
> > > > > > > +
> > > > > > > + private = rcu_dereference_check(vq->private_data, 1);
> > > > > > 
> > > > > > Any chance of a check for running in a workqueue?  If I remember 
> > > > > > correctly,
> > > > > > the ->lockdep_map field in the work_struct structure allows you to 
> > > > > > create
> > > > > > the required lockdep expression.
> > > > > 
> > > > > We moved away from using the workqueue to a custom kernel thread
> > > > > implementation though.
> > > > 
> > > > OK, then could you please add a check for "current == 
> > > > custom_kernel_thread"
> > > > or some such?
> > > > 
> > > > Thanx, Paul
> > > 
> > > It's a bit tricky. The way we flush out things is by an analog of
> > > flush_work. So just checking that we run from workqueue isn't
> > > right need to check that we are running from one of the specific work
> > > structures that we are careful to flush.
> > > 
> > > I can do this by passing the work structure pointer on to relevant
> > > functions but I think this will add (very minor) overhead even when rcu
> > > checks are disabled. Does it matter? Thoughts?
> > 
> > Would it be possible to set up separate lockdep maps, in effect passing
> > the needed information via lockdep rather than via the function arguments?
> 
> Possibly, I don't know enough about this but will check.
> Any examples to look at?

I would suggest the workqueue example in include/linux/workqueue.h and
kernel/workqueue.c.  You will need a struct lockdep_map for each of the
specific work structures, sort of like the one in struct workqueue_struct.

You can then use lock_map_acquire() and lock_map_release() to flag the
fact that your kernel thread is running and not, and then you can pass
lock_is_held() to rcu_dereference_check().  Each of lock_map_acquire(),
lock_map_release(), and lock_is_held() takes a struct lockdep_map as
sole argument.  The rcu_lock_map definition shows an example of a global
lockdep_map variable.  If you need to do runtime initialization of your
lockdep_map structure, you can use lockdep_init_map().

Seem reasonable?

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the bo

Re: [PATCH 1/2] vhost test module

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 03:13:03PM -0800, Paul E. McKenney wrote:
> On Thu, Dec 02, 2010 at 09:47:09PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Dec 02, 2010 at 11:26:16AM -0800, Paul E. McKenney wrote:
> > > On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote:
> > > > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote:
> > > > > > This adds a test module for vhost infrastructure.
> > > > > > Intentionally not tied to kbuild to prevent people
> > > > > > from installing and loading it accidentally.
> > > > > > 
> > > > > > Signed-off-by: Michael S. Tsirkin 
> > > > > 
> > > > > On question below.
> > > > > 
> > > > > > ---
> > > > > > 
> > > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > > > > > new file mode 100644
> > > > > > index 000..099f302
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/vhost/test.c
> > > > > > @@ -0,0 +1,320 @@
> > > > > > +/* Copyright (C) 2009 Red Hat, Inc.
> > > > > > + * Author: Michael S. Tsirkin 
> > > > > > + *
> > > > > > + * This work is licensed under the terms of the GNU GPL, version 2.
> > > > > > + *
> > > > > > + * test virtio server in host kernel.
> > > > > > + */
> > > > > > +
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +#include 
> > > > > > +
> > > > > > +#include "test.h"
> > > > > > +#include "vhost.c"
> > > > > > +
> > > > > > +/* Max number of bytes transferred before requeueing the job.
> > > > > > + * Using this limit prevents one virtqueue from starving others. */
> > > > > > +#define VHOST_TEST_WEIGHT 0x8
> > > > > > +
> > > > > > +enum {
> > > > > > +   VHOST_TEST_VQ = 0,
> > > > > > +   VHOST_TEST_VQ_MAX = 1,
> > > > > > +};
> > > > > > +
> > > > > > +struct vhost_test {
> > > > > > +   struct vhost_dev dev;
> > > > > > +   struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
> > > > > > +};
> > > > > > +
> > > > > > +/* Expects to be always run from workqueue - which acts as
> > > > > > + * read-size critical section for our kind of RCU. */
> > > > > > +static void handle_vq(struct vhost_test *n)
> > > > > > +{
> > > > > > +   struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ];
> > > > > > +   unsigned out, in;
> > > > > > +   int head;
> > > > > > +   size_t len, total_len = 0;
> > > > > > +   void *private;
> > > > > > +
> > > > > > +   private = rcu_dereference_check(vq->private_data, 1);
> > > > > 
> > > > > Any chance of a check for running in a workqueue?  If I remember 
> > > > > correctly,
> > > > > the ->lockdep_map field in the work_struct structure allows you to 
> > > > > create
> > > > > the required lockdep expression.
> > > > 
> > > > We moved away from using the workqueue to a custom kernel thread
> > > > implementation though.
> > > 
> > > OK, then could you please add a check for "current == 
> > > custom_kernel_thread"
> > > or some such?
> > > 
> > >   Thanx, Paul
> > 
> > It's a bit tricky. The way we flush out things is by an analog of
> > flush_work. So just checking that we run from workqueue isn't
> > right need to check that we are running from one of the specific work
> > structures that we are careful to flush.
> > 
> > I can do this by passing the work structure pointer on to relevant
> > functions but I think this will add (very minor) overhead even when rcu
> > checks are disabled. Does it matter? Thoughts?
> 
> Would it be possible to set up separate lockdep maps, in effect passing
> the needed information via lockdep rather than via the function arguments?
> 
>   Thanx, Paul

Possibly, I don't know enough about this but will check.
Any examples to look at?

-- 
mST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] vhost test module

2010-12-02 Thread Paul E. McKenney

On Thu, Dec 02, 2010 at 09:47:09PM +0200, Michael S. Tsirkin wrote:
> On Thu, Dec 02, 2010 at 11:26:16AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote:
> > > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote:
> > > > > This adds a test module for vhost infrastructure.
> > > > > Intentionally not tied to kbuild to prevent people
> > > > > from installing and loading it accidentally.
> > > > > 
> > > > > Signed-off-by: Michael S. Tsirkin 
> > > > 
> > > > On question below.
> > > > 
> > > > > ---
> > > > > 
> > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > > > > new file mode 100644
> > > > > index 000..099f302
> > > > > --- /dev/null
> > > > > +++ b/drivers/vhost/test.c
> > > > > @@ -0,0 +1,320 @@
> > > > > +/* Copyright (C) 2009 Red Hat, Inc.
> > > > > + * Author: Michael S. Tsirkin 
> > > > > + *
> > > > > + * This work is licensed under the terms of the GNU GPL, version 2.
> > > > > + *
> > > > > + * test virtio server in host kernel.
> > > > > + */
> > > > > +
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +
> > > > > +#include "test.h"
> > > > > +#include "vhost.c"
> > > > > +
> > > > > +/* Max number of bytes transferred before requeueing the job.
> > > > > + * Using this limit prevents one virtqueue from starving others. */
> > > > > +#define VHOST_TEST_WEIGHT 0x8
> > > > > +
> > > > > +enum {
> > > > > + VHOST_TEST_VQ = 0,
> > > > > + VHOST_TEST_VQ_MAX = 1,
> > > > > +};
> > > > > +
> > > > > +struct vhost_test {
> > > > > + struct vhost_dev dev;
> > > > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
> > > > > +};
> > > > > +
> > > > > +/* Expects to be always run from workqueue - which acts as
> > > > > + * read-size critical section for our kind of RCU. */
> > > > > +static void handle_vq(struct vhost_test *n)
> > > > > +{
> > > > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ];
> > > > > + unsigned out, in;
> > > > > + int head;
> > > > > + size_t len, total_len = 0;
> > > > > + void *private;
> > > > > +
> > > > > + private = rcu_dereference_check(vq->private_data, 1);
> > > > 
> > > > Any chance of a check for running in a workqueue?  If I remember 
> > > > correctly,
> > > > the ->lockdep_map field in the work_struct structure allows you to 
> > > > create
> > > > the required lockdep expression.
> > > 
> > > We moved away from using the workqueue to a custom kernel thread
> > > implementation though.
> > 
> > OK, then could you please add a check for "current == custom_kernel_thread"
> > or some such?
> > 
> > Thanx, Paul
> 
> It's a bit tricky. The way we flush out things is by an analog of
> flush_work. So just checking that we run from workqueue isn't
> right need to check that we are running from one of the specific work
> structures that we are careful to flush.
> 
> I can do this by passing the work structure pointer on to relevant
> functions but I think this will add (very minor) overhead even when rcu
> checks are disabled. Does it matter? Thoughts?

Would it be possible to set up separate lockdep maps, in effect passing
the needed information via lockdep rather than via the function arguments?

Thanx, Paul

> > > > > + if (!private)
> > > > > + return;
> > > > > +
> > > > > + mutex_lock(&vq->mutex);
> > > > > + vhost_disable_notify(vq);
> > > > > +
> > > > > + for (;;) {
> > > > > + head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
> > > > > +  ARRAY_SIZE(vq->iov),
> > > > > +  &out, &in,
> > > > > +  NULL, NULL);
> > > > > + /* On error, stop handling until the next kick. */
> > > > > + if (unlikely(head < 0))
> > > > > + break;
> > > > > + /* Nothing new?  Wait for eventfd to tell us they 
> > > > > refilled. */
> > > > > + if (head == vq->num) {
> > > > > + if (unlikely(vhost_enable_notify(vq))) {
> > > > > + vhost_disable_notify(vq);
> > > > > + continue;
> > > > > + }
> > > > > + break;
> > > > > + }
> > > > > + if (in) {
> > > > > + vq_err(vq, "Unexpected descriptor format for 
> > > > > TX: "
> > > > > +"out %d, int %d\n", out, in);
> > > > > + break;
> > > > > + }
> > > > > + len = iov_length(vq->iov, out);
> > > > > + /* S

Problems on qemu-kvm unittests

2010-12-02 Thread Lucas Meneghel Rodrigues

We are getting failures when executing apic.flat on our periodic upstream tests:

12/02 18:40:59 DEBUG|kvm_vm:0664| Running qemu command:
/usr/local/autotest/tests/kvm/qemu -name 'vm1' -monitor 
unix:'/tmp/monitor-humanmonitor1-20101202-184059-9EnX',server,nowait -serial 
unix:'/tmp/serial-20101202-184059-9EnX',server,nowait -m 512 -smp 2 -kernel 
'/usr/local/autotest/tests/kvm/unittests/apic.flat' -vnc :0 -chardev 
file,id=testlog,path=/tmp/testlog-20101202-184059-9EnX -device 
testdev,chardev=testlog  -S -cpu qemu64,+x2apic
12/02 18:40:59 DEBUG|kvm_subpro:0700| (qemu) Cannot load x86-64 image, give a 
32bit one.
12/02 18:40:59 DEBUG|kvm_subpro:0700| (qemu) (Process terminated with status 1)

Relevant git commits:

12/02 18:20:16 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git is 
9ee00410d82a7c5cab5ae347d97fbf8a95c55506 (tag v2.6.32-56688-g9ee0041)

12/02 18:39:11 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git is 
53b6d3d5c2522e881c8d194f122de3114f6f76eb (tag kvm-88-6325-g53b6d3d)

12/02 18:39:16 INFO | kvm_utils:0407| Commit hash for 
git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git is 
f2d2b7c74355523b90019427224577b6f0ff1b8a (no tag found)

Please advise!

Lucas

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-02 Thread Chris Wright

* Rik van Riel (r...@redhat.com) wrote:
> When running SMP virtual machines, it is possible for one VCPU to be
> spinning on a spinlock, while the VCPU that holds the spinlock is not
> currently running, because the host scheduler preempted it to run
> something else.
> 
> Both Intel and AMD CPUs have a feature that detects when a virtual
> CPU is spinning on a lock and will trap to the host.
> 
> The current KVM code sleeps for a bit whenever that happens, which
> results in eg. a 64 VCPU Windows guest taking forever and a bit to
> boot up.  This is because the VCPU holding the lock is actually
> running and not sleeping, so the pause is counter-productive.

Seems like simply increasing the spin window help in that case?  Or is
it just too contended a lock (I think they use mcs locks, so I can see a
single wrong sleep causing real contention problems).

> In other workloads a pause can also be counter-productive, with
> spinlock detection resulting in one guest giving up its CPU time
> to the others.  Instead of spinning, it ends up simply not running
> much at all.
> 
> This patch series aims to fix that, by having a VCPU that spins
> give the remainder of its timeslice to another VCPU in the same
> guest before yielding the CPU - one that is runnable but got 
> preempted, hopefully the lock holder.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori


On 12/02/2010 03:07 PM, Chris Wright wrote:

But you agree this is no KVM business.
 

Like non-trapping hlt, that too will guarantee that the guest is preempted
by timeslice exhaustion (and is simpler than non-trapping hlt).  So it
may well be the simplest for the case where we are perfectly committed
(i.e. the vcpu fractional core count totals the pcpu count).  But once
we are undercommitted we still need some extra logic to handle the hard
cap and something to kick the running guest off the cpu and suck up the
extra cycles in a power conserving way.
   


I'm not entirely sure TBH.

If you think of a cloud's per-VCPU capacity in terms of Compute Units, 
having a model where a VCPU maps to 1-3 units depending on total load is 
potentially interesting particularly if the VCPU's capacity only changes 
in discrete amounts,  that the expected capacity is communicated to the 
guest, and that the capacity only changes periodically.


Regards,

Anthony Liguori


thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori


On 12/02/2010 02:40 PM, Marcelo Tosatti wrote:

Consuming the timeslice outside guest mode is less intrusive and easier
to replace. Something like this should work?

if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
 while (!need_resched())
 default_idle();
}

But you agree this is no KVM business.
   


My initial inclination is that this would be inappropriate for KVM but I 
think I'm slowly convincing myself otherwise.


Ultimately, hard limits and deterministic scheduling are related goals 
but not quite the same.  A hard limit is very specific, you want to 
receive no more than an exit amount of CPU time per VCPU  With 
deterministic scheduling, you want to make sure that a set of VMs are 
not influenced by each other's behavior.


You want hard limits when you want to hide the density/capacity of a 
node from the end customer.  You want determinism when you simply want 
to isolate the performance of each customer from the other customers.


That is, the only thing that should affect the performance graph of a VM 
is how many neighbors it has (which is controlled by management 
software) rather than what its neighbors are doing.


If you have hard limits, you can approximate deterministic scheduling 
but it's complex in the face of changing numbers of guests.  
Additionally, hard limits present issues with directed yield that don't 
exist with a deterministic scheduling approach.


You can still donate your time slice to another VCPU because the VCPUs 
are not actually capped.  That may mean that an individual VCPU gets 
more PCPU time than an exact division but for the VM overall, it won't 
get more than it's total share.  So the principle of performance 
isolation for the guest isn't impacted.


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SeaBIOS] [PATCHv6 00/16] boot order specification

2010-12-02 Thread Sebastian Herbszt


Gleb Natapov wrote:

How can we get to EDD info after device is mapped? Looking at Seabios
implementation it builds EDD table on the fly when int_1348 is called
and it does it only for internal devices. Can we use "disconnect vector"
to connect device temporarily get EDD and then disconnect?



From BIOS Boot Specification 1.01


"6.4.2 Disconnect Vector

Originally, it was thought that the DV would be called by the BIOS if the 
device's
BCV was called and subsequent booting from the device failed. However, it was 
later
discovered that current PnP option ROMs are more well behaved by checking during
the BCV call if their device will properly respond to INT 13h calls or not, and 
simply
not hooking INT 13h if those calls would fail. Because of this, the DV is not 
called by
a BIOS supporting the BIOS Boot Specification, nor is it necessary to have a 
valid DV
in the PnP Expansion Header. The DV should be NULL and can't be used for other
storage."

Sebastian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Chris Wright

* Marcelo Tosatti (mtosa...@redhat.com) wrote:
> On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
> > * Anthony Liguori (aligu...@us.ibm.com) wrote:
> > > In certain use-cases, we want to allocate guests fixed time slices where 
> > > idle
> > > guest cycles leave the machine idling.  There are many approaches to 
> > > achieve
> > > this but the most direct is to simply avoid trapping the HLT instruction 
> > > which
> > > lets the guest directly execute the instruction putting the processor to 
> > > sleep.
> > 
> > I like the idea, esp to keep from burning power.
> > 
> > > Introduce this as a module-level option for kvm-vmx.ko since if you do 
> > > this
> > > for one guest, you probably want to do it for all.  A similar option is 
> > > possible
> > > for AMD but I don't have easy access to AMD test hardware.
> > 
> > Perhaps it should be a VM level option.  And then invert the notion.
> > Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
> > (pin in place probably).  And have that VM do nothing other than hlt.
> > Then it's always runnable according to scheduler, and can "consume" the
> > extra work that CFS wants to give away.
> > 
> > What do you think?
> > 
> > thanks,
> > -chris
> 
> Consuming the timeslice outside guest mode is less intrusive and easier
> to replace. Something like this should work?
> 
> if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
> while (!need_resched())
> default_idle();
> } 
> 
> But you agree this is no KVM business.

Like non-trapping hlt, that too will guarantee that the guest is preempted
by timeslice exhaustion (and is simpler than non-trapping hlt).  So it
may well be the simplest for the case where we are perfectly committed
(i.e. the vcpu fractional core count totals the pcpu count).  But once
we are undercommitted we still need some extra logic to handle the hard
cap and something to kick the running guest off the cpu and suck up the
extra cycles in a power conserving way.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori


On 12/02/2010 02:12 PM, Marcelo Tosatti wrote:

opt = CPU_BASED_TPR_SHADOW |
  CPU_BASED_USE_MSR_BITMAPS |
  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
--
1.7.0.4
 

Breaks async PF (see "checks on guest state"),
   

Sorry, I don't follow what you mean here.  Can you elaborate?
 

VCPU in HLT state only allows injection of certain events that
would be delivered on HLT. #PF is not one of them.
   


But you can't inject an exception into a guest while the VMCS is active, 
can you?  So the guest takes an exit while in the hlt instruction but 
that's no different than if the guest has been interrupted because of 
hlt exiting.



You'd have to handle this situation on event injection, vmentry fails
otherwise. Or perhaps clear HLT state on vmexit and vmentry.
   


So this works today because on a hlt exit, emulate_halt() will clear the 
the HLT state which then puts the the vcpu into a state where it can 
receive an exception injection?


Regards,

Anthony Liguori


  timer reinjection
probably.
   

Timer reinjection will continue to work as expected.  If a guest is
halting an external interrupt is delivered (by a timer), the guest
will still exit as expected.

I can think of anything that would be functionally correct and still
depend on getting hlt exits because ultimately, a guest never
actually has to do a hlt (and certainly there are guests that
won't).
 

LAPIC pending timer events will be reinjected on entry path, if
accumulated. So they depend on any exit. If you disable HLT-exiting,
delay will increase. OK, maybe thats irrelevant.

   

  It should be possible to achieve determinism with
a scheduler policy?
   

If the desire is the ultimate desire is to have the guests be
scheduled in a non-work conserving fashion, I can't see a more
direct approach that to simply not have the guests yield (which is
ultimately what hlt trapping does).

Anything the scheduler would do is after the fact and probably based
on inference about why the yield.
 

Another issue is you ignore the hosts idea of the best way to sleep
(ACPI, or whatever).

And handling inactive HLT state (which was never enabled) can be painful.

   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Marcelo Tosatti

On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote:
> * Anthony Liguori (aligu...@us.ibm.com) wrote:
> > In certain use-cases, we want to allocate guests fixed time slices where 
> > idle
> > guest cycles leave the machine idling.  There are many approaches to achieve
> > this but the most direct is to simply avoid trapping the HLT instruction 
> > which
> > lets the guest directly execute the instruction putting the processor to 
> > sleep.
> 
> I like the idea, esp to keep from burning power.
> 
> > Introduce this as a module-level option for kvm-vmx.ko since if you do this
> > for one guest, you probably want to do it for all.  A similar option is 
> > possible
> > for AMD but I don't have easy access to AMD test hardware.
> 
> Perhaps it should be a VM level option.  And then invert the notion.
> Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
> (pin in place probably).  And have that VM do nothing other than hlt.
> Then it's always runnable according to scheduler, and can "consume" the
> extra work that CFS wants to give away.
> 
> What do you think?
> 
> thanks,
> -chris

Consuming the timeslice outside guest mode is less intrusive and easier
to replace. Something like this should work?

if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) {
while (!need_resched())
default_idle();
} 

But you agree this is no KVM business.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Marcelo Tosatti

> >>opt = CPU_BASED_TPR_SHADOW |
> >>  CPU_BASED_USE_MSR_BITMAPS |
> >>  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> >>-- 
> >>1.7.0.4
> >Breaks async PF (see "checks on guest state"),
> 
> Sorry, I don't follow what you mean here.  Can you elaborate?

VCPU in HLT state only allows injection of certain events that
would be delivered on HLT. #PF is not one of them.

You'd have to handle this situation on event injection, vmentry fails
otherwise. Or perhaps clear HLT state on vmexit and vmentry.

> >  timer reinjection
> >probably.
> 
> Timer reinjection will continue to work as expected.  If a guest is
> halting an external interrupt is delivered (by a timer), the guest
> will still exit as expected.
> 
> I can think of anything that would be functionally correct and still
> depend on getting hlt exits because ultimately, a guest never
> actually has to do a hlt (and certainly there are guests that
> won't).

LAPIC pending timer events will be reinjected on entry path, if
accumulated. So they depend on any exit. If you disable HLT-exiting,
delay will increase. OK, maybe thats irrelevant.

> >  It should be possible to achieve determinism with
> >a scheduler policy?
> 
> If the desire is the ultimate desire is to have the guests be
> scheduled in a non-work conserving fashion, I can't see a more
> direct approach that to simply not have the guests yield (which is
> ultimately what hlt trapping does).
> 
> Anything the scheduler would do is after the fact and probably based
> on inference about why the yield.

Another issue is you ignore the hosts idea of the best way to sleep
(ACPI, or whatever).

And handling inactive HLT state (which was never enabled) can be painful.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Chris Wright

* Anthony Liguori (anth...@codemonkey.ws) wrote:
> On 12/02/2010 01:14 PM, Chris Wright wrote:
> >* Anthony Liguori (aligu...@us.ibm.com) wrote:
> >>In certain use-cases, we want to allocate guests fixed time slices where 
> >>idle
> >>guest cycles leave the machine idling.  There are many approaches to achieve
> >>this but the most direct is to simply avoid trapping the HLT instruction 
> >>which
> >>lets the guest directly execute the instruction putting the processor to 
> >>sleep.
> >I like the idea, esp to keep from burning power.
> >
> >>Introduce this as a module-level option for kvm-vmx.ko since if you do this
> >>for one guest, you probably want to do it for all.  A similar option is 
> >>possible
> >>for AMD but I don't have easy access to AMD test hardware.
> >Perhaps it should be a VM level option.  And then invert the notion.
> >Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
> >(pin in place probably).  And have that VM do nothing other than hlt.
> >Then it's always runnable according to scheduler, and can "consume" the
> >extra work that CFS wants to give away.
> 
> That's an interesting idea.  I think Vatsa had some ideas about how
> to do this with existing mechanisms.

Yeah, should Just Work (TM) w/ smth like evilcap.

> I'm interesting in comparing behavior with fixed allocation because
> one thing the above relies upon is that the filler VM loses it's
> time when one of the non-filler VCPU needs to run.

Priorites?

> This may all
> work correctly but I think it's easier to rationalize about having
> each non-filler VCPU have a fixed (long) time slice.  If a VCPU
> needs to wake up to become non-idle, it can do so immediately
> because it already has the PCPU.

The flipside...dont' have to worry about the issues that Marcelo brought
up.

Should be pretty easy to compare though.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori


On 12/02/2010 01:14 PM, Chris Wright wrote:

* Anthony Liguori (aligu...@us.ibm.com) wrote:
   

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.
 

I like the idea, esp to keep from burning power.

   

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.
 

Perhaps it should be a VM level option.  And then invert the notion.
Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
(pin in place probably).  And have that VM do nothing other than hlt.
Then it's always runnable according to scheduler, and can "consume" the
extra work that CFS wants to give away.
   


That's an interesting idea.  I think Vatsa had some ideas about how to 
do this with existing mechanisms.


I'm interesting in comparing behavior with fixed allocation because one 
thing the above relies upon is that the filler VM loses it's time when 
one of the non-filler VCPU needs to run.  This may all work correctly 
but I think it's easier to rationalize about having each non-filler VCPU 
have a fixed (long) time slice.  If a VCPU needs to wake up to become 
non-idle, it can do so immediately because it already has the PCPU.


Regards,

Anthony Liguori




What do you think?

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] vhost test module

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 11:26:16AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote:
> > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote:
> > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote:
> > > > This adds a test module for vhost infrastructure.
> > > > Intentionally not tied to kbuild to prevent people
> > > > from installing and loading it accidentally.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin 
> > > 
> > > On question below.
> > > 
> > > > ---
> > > > 
> > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > > > new file mode 100644
> > > > index 000..099f302
> > > > --- /dev/null
> > > > +++ b/drivers/vhost/test.c
> > > > @@ -0,0 +1,320 @@
> > > > +/* Copyright (C) 2009 Red Hat, Inc.
> > > > + * Author: Michael S. Tsirkin 
> > > > + *
> > > > + * This work is licensed under the terms of the GNU GPL, version 2.
> > > > + *
> > > > + * test virtio server in host kernel.
> > > > + */
> > > > +
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > > +
> > > > +#include "test.h"
> > > > +#include "vhost.c"
> > > > +
> > > > +/* Max number of bytes transferred before requeueing the job.
> > > > + * Using this limit prevents one virtqueue from starving others. */
> > > > +#define VHOST_TEST_WEIGHT 0x8
> > > > +
> > > > +enum {
> > > > +   VHOST_TEST_VQ = 0,
> > > > +   VHOST_TEST_VQ_MAX = 1,
> > > > +};
> > > > +
> > > > +struct vhost_test {
> > > > +   struct vhost_dev dev;
> > > > +   struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
> > > > +};
> > > > +
> > > > +/* Expects to be always run from workqueue - which acts as
> > > > + * read-size critical section for our kind of RCU. */
> > > > +static void handle_vq(struct vhost_test *n)
> > > > +{
> > > > +   struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ];
> > > > +   unsigned out, in;
> > > > +   int head;
> > > > +   size_t len, total_len = 0;
> > > > +   void *private;
> > > > +
> > > > +   private = rcu_dereference_check(vq->private_data, 1);
> > > 
> > > Any chance of a check for running in a workqueue?  If I remember 
> > > correctly,
> > > the ->lockdep_map field in the work_struct structure allows you to create
> > > the required lockdep expression.
> > 
> > We moved away from using the workqueue to a custom kernel thread
> > implementation though.
> 
> OK, then could you please add a check for "current == custom_kernel_thread"
> or some such?
> 
>   Thanx, Paul

It's a bit tricky. The way we flush out things is by an analog of
flush_work. So just checking that we run from workqueue isn't
right need to check that we are running from one of the specific work
structures that we are careful to flush.

I can do this by passing the work structure pointer on to relevant
functions but I think this will add (very minor) overhead even when rcu
checks are disabled. Does it matter? Thoughts?

> > > > +   if (!private)
> > > > +   return;
> > > > +
> > > > +   mutex_lock(&vq->mutex);
> > > > +   vhost_disable_notify(vq);
> > > > +
> > > > +   for (;;) {
> > > > +   head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
> > > > +ARRAY_SIZE(vq->iov),
> > > > +&out, &in,
> > > > +NULL, NULL);
> > > > +   /* On error, stop handling until the next kick. */
> > > > +   if (unlikely(head < 0))
> > > > +   break;
> > > > +   /* Nothing new?  Wait for eventfd to tell us they 
> > > > refilled. */
> > > > +   if (head == vq->num) {
> > > > +   if (unlikely(vhost_enable_notify(vq))) {
> > > > +   vhost_disable_notify(vq);
> > > > +   continue;
> > > > +   }
> > > > +   break;
> > > > +   }
> > > > +   if (in) {
> > > > +   vq_err(vq, "Unexpected descriptor format for 
> > > > TX: "
> > > > +  "out %d, int %d\n", out, in);
> > > > +   break;
> > > > +   }
> > > > +   len = iov_length(vq->iov, out);
> > > > +   /* Sanity check */
> > > > +   if (!len) {
> > > > +   vq_err(vq, "Unexpected 0 len for TX\n");
> > > > +   break;
> > > > +   }
> > > > +   vhost_add_used_and_signal(&n->dev, vq, head, 0);
> > > > +   total_len += len;
> > > > +   if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
> > > > +   vhost_poll_queue(&vq->poll);
> > > > +

[RFC PATCH 2/3] sched: add yield_to function

2010-12-02 Thread Rik van Riel

Add a yield_to function to the scheduler code, allowing us to
give the remainder of our timeslice to another thread.

We may want to use this to provide a sys_yield_to system call
one day.

Signed-off-by: Rik van Riel 
Signed-off-by: Marcelo Tosatti 

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5f926c..4f3cce9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1985,6 +1985,7 @@ extern void set_user_nice(struct task_struct *p, long 
nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
 extern int can_nice(const struct task_struct *p, const int nice);
+extern void requeue_task(struct rq *rq, struct task_struct *p);
 extern int task_curr(const struct task_struct *p);
 extern int idle_cpu(int cpu);
 extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
@@ -2058,6 +2059,14 @@ extern int wake_up_state(struct task_struct *tsk, 
unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
unsigned long clone_flags);
+
+#ifdef CONFIG_SCHED_HRTICK
+extern u64 slice_remain(struct task_struct *);
+extern void yield_to(struct task_struct *);
+#else
+static inline void yield_to(struct task_struct *p) yield()
+#endif
+
 #ifdef CONFIG_SMP
  extern void kick_process(struct task_struct *tsk);
 #else
diff --git a/kernel/sched.c b/kernel/sched.c
index f8e5a25..ef088cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1909,6 +1909,26 @@ static void dequeue_task(struct rq *rq, struct 
task_struct *p, int sleep)
p->se.on_rq = 0;
 }
 
+/**
+ * requeue_task - requeue a task which priority got changed by yield_to
+ * @rq: the tasks's runqueue
+ * @p: the task in question
+ * Must be called with the runqueue lock held. Will cause the CPU to
+ * reschedule if p is now at the head of the runqueue.
+ */
+void requeue_task(struct rq *rq, struct task_struct *p)
+{
+   assert_spin_locked(&rq->lock);
+
+   if (!p->se.on_rq || task_running(rq, p) || task_has_rt_policy(p))
+   return;
+
+   dequeue_task(rq, p, 0);
+   enqueue_task(rq, p, 0);
+
+   resched_task(p);
+}
+
 /*
  * __normal_prio - return the priority that is based on the static prio
  */
@@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned 
int, len,
return ret;
 }
 
+#ifdef CONFIG_SCHED_HRTICK
+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ */
+void yield_to(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se = &p->se;
+   struct rq *rq;
+   struct cfs_rq *cfs_rq;
+   u64 remain = slice_remain(current);
+
+   rq = task_rq_lock(p, &flags);
+   if (task_running(rq, p) || task_has_rt_policy(p))
+   goto out;
+   cfs_rq = cfs_rq_of(se);
+   se->vruntime -= remain;
+   if (se->vruntime < cfs_rq->min_vruntime)
+   se->vruntime = cfs_rq->min_vruntime;
+   requeue_task(rq, p);
+ out:
+   task_rq_unlock(rq, &flags);
+   yield();
+}
+EXPORT_SYMBOL(yield_to);
+#endif
+
 /**
  * sys_sched_yield - yield the current processor to other threads.
  *
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5119b08..2a0a595 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
*curr, int queued)
  */
 
 #ifdef CONFIG_SCHED_HRTICK
+u64 slice_remain(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se = &p->se;
+   struct cfs_rq *cfs_rq;
+   struct rq *rq;
+   u64 slice, ran;
+   s64 delta;
+
+   rq = task_rq_lock(p, &flags);
+   cfs_rq = cfs_rq_of(se);
+   slice = sched_slice(cfs_rq, se);
+   ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
+   delta = slice - ran;
+   task_rq_unlock(rq, &flags);
+
+   return max(delta, 0LL);
+}
+
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
struct sched_entity *se = &p->se;


-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu

2010-12-02 Thread Rik van Riel

Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single "run" of the vcpu.

Signed-off-by: Rik van Riel 

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5fbdb55..cb73a73 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -79,6 +79,7 @@ struct kvm_vcpu {
 #endif
int vcpu_id;
struct mutex mutex;
+   struct task_struct *task;
int   cpu;
struct kvm_run *run;
unsigned long requests;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 050577a..2bffa47 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -741,6 +741,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
int cpu;
 
mutex_lock(&vcpu->mutex);
+   vcpu->task = current;
cpu = get_cpu();
preempt_notifier_register(&vcpu->preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
@@ -749,6 +750,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 
 void vcpu_put(struct kvm_vcpu *vcpu)
 {
+   vcpu->task = NULL;
preempt_disable();
kvm_arch_vcpu_put(vcpu);
preempt_notifier_unregister(&vcpu->preempt_notifier);



-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-02 Thread Rik van Riel

When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

Scheduler people, please flame me with anything I may have done
wrong, so I can do it right for a next version :)

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2010-12-02 Thread Rik van Riel

Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to hand
the rest of our timeslice to another vcpu in the same KVM guest.

Signed-off-by: Rik van Riel 
Signed-off-by: Marcelo Tosatti 

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c011ba3..7637dd3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -90,6 +90,7 @@ struct kvm_vcpu {
int fpu_active;
int guest_fpu_loaded;
wait_queue_head_t wq;
+   int spinning;
int sigset_active;
sigset_t sigset;
struct kvm_vcpu_stat stat;
@@ -185,6 +186,7 @@ struct kvm {
 #endif
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
+   int last_boosted_vcpu;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 80f17db..a6eeafc 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1880,18 +1880,53 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);
 
-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
-   ktime_t expires;
-   DEFINE_WAIT(wait);
+   struct kvm *kvm = me->kvm;
+   struct kvm_vcpu *vcpu;
+   int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
+   int first_round = 1;
+   int i;
 
-   prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
+   me->spinning = 1;
+
+   /*
+* We boost the priority of a VCPU that is runnable but not
+* currently running, because it got preempted by something
+* else and called schedule in __vcpu_run.  Hopefully that
+* VCPU is holding the lock that we need and will release it.
+* We approximate round-robin by starting at the last boosted VCPU.
+*/
+ again:
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct task_struct *task = vcpu->task;
+   if (first_round && i < last_boosted_vcpu) {
+   i = last_boosted_vcpu;
+   continue;
+   } else if (!first_round && i > last_boosted_vcpu)
+   break;
+   if (vcpu == me)
+   continue;
+   if (vcpu->spinning)
+   continue;
+   if (!task)
+   continue;
+   if (waitqueue_active(&vcpu->wq))
+   continue;
+   if (task->flags & PF_VCPU)
+   continue;
+   kvm->last_boosted_vcpu = i;
+   yield_to(task);
+   break;
+   }
 
-   /* Sleep for 100 us, and hope lock-holder got scheduled */
-   expires = ktime_add_ns(ktime_get(), 10UL);
-   schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+   if (first_round && last_boosted_vcpu == kvm->last_boosted_vcpu) {
+   /* We have not found anyone yet. */
+   first_round = 0;
+   goto again;
+   }
 
-   finish_wait(&vcpu->wq, &wait);
+   me->spinning = 0;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 


-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] vhost test module

2010-12-02 Thread Paul E. McKenney

On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote:
> On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote:
> > > This adds a test module for vhost infrastructure.
> > > Intentionally not tied to kbuild to prevent people
> > > from installing and loading it accidentally.
> > > 
> > > Signed-off-by: Michael S. Tsirkin 
> > 
> > On question below.
> > 
> > > ---
> > > 
> > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > > new file mode 100644
> > > index 000..099f302
> > > --- /dev/null
> > > +++ b/drivers/vhost/test.c
> > > @@ -0,0 +1,320 @@
> > > +/* Copyright (C) 2009 Red Hat, Inc.
> > > + * Author: Michael S. Tsirkin 
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2.
> > > + *
> > > + * test virtio server in host kernel.
> > > + */
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +#include "test.h"
> > > +#include "vhost.c"
> > > +
> > > +/* Max number of bytes transferred before requeueing the job.
> > > + * Using this limit prevents one virtqueue from starving others. */
> > > +#define VHOST_TEST_WEIGHT 0x8
> > > +
> > > +enum {
> > > + VHOST_TEST_VQ = 0,
> > > + VHOST_TEST_VQ_MAX = 1,
> > > +};
> > > +
> > > +struct vhost_test {
> > > + struct vhost_dev dev;
> > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
> > > +};
> > > +
> > > +/* Expects to be always run from workqueue - which acts as
> > > + * read-size critical section for our kind of RCU. */
> > > +static void handle_vq(struct vhost_test *n)
> > > +{
> > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ];
> > > + unsigned out, in;
> > > + int head;
> > > + size_t len, total_len = 0;
> > > + void *private;
> > > +
> > > + private = rcu_dereference_check(vq->private_data, 1);
> > 
> > Any chance of a check for running in a workqueue?  If I remember correctly,
> > the ->lockdep_map field in the work_struct structure allows you to create
> > the required lockdep expression.
> 
> We moved away from using the workqueue to a custom kernel thread
> implementation though.

OK, then could you please add a check for "current == custom_kernel_thread"
or some such?

Thanx, Paul

> > > + if (!private)
> > > + return;
> > > +
> > > + mutex_lock(&vq->mutex);
> > > + vhost_disable_notify(vq);
> > > +
> > > + for (;;) {
> > > + head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
> > > +  ARRAY_SIZE(vq->iov),
> > > +  &out, &in,
> > > +  NULL, NULL);
> > > + /* On error, stop handling until the next kick. */
> > > + if (unlikely(head < 0))
> > > + break;
> > > + /* Nothing new?  Wait for eventfd to tell us they refilled. */
> > > + if (head == vq->num) {
> > > + if (unlikely(vhost_enable_notify(vq))) {
> > > + vhost_disable_notify(vq);
> > > + continue;
> > > + }
> > > + break;
> > > + }
> > > + if (in) {
> > > + vq_err(vq, "Unexpected descriptor format for TX: "
> > > +"out %d, int %d\n", out, in);
> > > + break;
> > > + }
> > > + len = iov_length(vq->iov, out);
> > > + /* Sanity check */
> > > + if (!len) {
> > > + vq_err(vq, "Unexpected 0 len for TX\n");
> > > + break;
> > > + }
> > > + vhost_add_used_and_signal(&n->dev, vq, head, 0);
> > > + total_len += len;
> > > + if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
> > > + vhost_poll_queue(&vq->poll);
> > > + break;
> > > + }
> > > + }
> > > +
> > > + mutex_unlock(&vq->mutex);
> > > +}
> > > +
> > > +static void handle_vq_kick(struct vhost_work *work)
> > > +{
> > > + struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> > > +   poll.work);
> > > + struct vhost_test *n = container_of(vq->dev, struct vhost_test, dev);
> > > +
> > > + handle_vq(n);
> > > +}
> > > +
> > > +static int vhost_test_open(struct inode *inode, struct file *f)
> > > +{
> > > + struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
> > > + struct vhost_dev *dev;
> > > + int r;
> > > +
> > > + if (!n)
> > > + return -ENOMEM;
> > > +
> > > + dev = &n->dev;
> > > + n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> > > + r = vhost_dev_init(dev, n->vqs, VHOST_TEST_VQ_MAX);
> > > + if (r < 0) {
> > > + kfree(n);
> > > + return r;
> > > + }
> > > +
> > > + f->private_data = n;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static void *vhost

Re: [PATCH 1/2] vhost test module

2010-12-02 Thread Paul E. McKenney

On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote:
> This adds a test module for vhost infrastructure.
> Intentionally not tied to kbuild to prevent people
> from installing and loading it accidentally.
> 
> Signed-off-by: Michael S. Tsirkin 

On question below.

> ---
> 
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> new file mode 100644
> index 000..099f302
> --- /dev/null
> +++ b/drivers/vhost/test.c
> @@ -0,0 +1,320 @@
> +/* Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + *
> + * test virtio server in host kernel.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "test.h"
> +#include "vhost.c"
> +
> +/* Max number of bytes transferred before requeueing the job.
> + * Using this limit prevents one virtqueue from starving others. */
> +#define VHOST_TEST_WEIGHT 0x8
> +
> +enum {
> + VHOST_TEST_VQ = 0,
> + VHOST_TEST_VQ_MAX = 1,
> +};
> +
> +struct vhost_test {
> + struct vhost_dev dev;
> + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
> +};
> +
> +/* Expects to be always run from workqueue - which acts as
> + * read-size critical section for our kind of RCU. */
> +static void handle_vq(struct vhost_test *n)
> +{
> + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ];
> + unsigned out, in;
> + int head;
> + size_t len, total_len = 0;
> + void *private;
> +
> + private = rcu_dereference_check(vq->private_data, 1);

Any chance of a check for running in a workqueue?  If I remember correctly,
the ->lockdep_map field in the work_struct structure allows you to create
the required lockdep expression.

Thanx, Paul

> + if (!private)
> + return;
> +
> + mutex_lock(&vq->mutex);
> + vhost_disable_notify(vq);
> +
> + for (;;) {
> + head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
> +  ARRAY_SIZE(vq->iov),
> +  &out, &in,
> +  NULL, NULL);
> + /* On error, stop handling until the next kick. */
> + if (unlikely(head < 0))
> + break;
> + /* Nothing new?  Wait for eventfd to tell us they refilled. */
> + if (head == vq->num) {
> + if (unlikely(vhost_enable_notify(vq))) {
> + vhost_disable_notify(vq);
> + continue;
> + }
> + break;
> + }
> + if (in) {
> + vq_err(vq, "Unexpected descriptor format for TX: "
> +"out %d, int %d\n", out, in);
> + break;
> + }
> + len = iov_length(vq->iov, out);
> + /* Sanity check */
> + if (!len) {
> + vq_err(vq, "Unexpected 0 len for TX\n");
> + break;
> + }
> + vhost_add_used_and_signal(&n->dev, vq, head, 0);
> + total_len += len;
> + if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
> + vhost_poll_queue(&vq->poll);
> + break;
> + }
> + }
> +
> + mutex_unlock(&vq->mutex);
> +}
> +
> +static void handle_vq_kick(struct vhost_work *work)
> +{
> + struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> +   poll.work);
> + struct vhost_test *n = container_of(vq->dev, struct vhost_test, dev);
> +
> + handle_vq(n);
> +}
> +
> +static int vhost_test_open(struct inode *inode, struct file *f)
> +{
> + struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
> + struct vhost_dev *dev;
> + int r;
> +
> + if (!n)
> + return -ENOMEM;
> +
> + dev = &n->dev;
> + n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> + r = vhost_dev_init(dev, n->vqs, VHOST_TEST_VQ_MAX);
> + if (r < 0) {
> + kfree(n);
> + return r;
> + }
> +
> + f->private_data = n;
> +
> + return 0;
> +}
> +
> +static void *vhost_test_stop_vq(struct vhost_test *n,
> + struct vhost_virtqueue *vq)
> +{
> + void *private;
> +
> + mutex_lock(&vq->mutex);
> + private = rcu_dereference_protected(vq->private_data,
> +  lockdep_is_held(&vq->mutex));
> + rcu_assign_pointer(vq->private_data, NULL);
> + mutex_unlock(&vq->mutex);
> + return private;
> +}
> +
> +static void vhost_test_stop(struct vhost_test *n, void **privatep)
> +{
> + *privatep = vhost_test_stop_vq(n, n->vqs + VHOST_TEST_VQ);
> +}
> +
> +static void vhost_test_flush_vq(struct vhost_test *n, int index)

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Chris Wright

* Anthony Liguori (aligu...@us.ibm.com) wrote:
> In certain use-cases, we want to allocate guests fixed time slices where idle
> guest cycles leave the machine idling.  There are many approaches to achieve
> this but the most direct is to simply avoid trapping the HLT instruction which
> lets the guest directly execute the instruction putting the processor to 
> sleep.

I like the idea, esp to keep from burning power.

> Introduce this as a module-level option for kvm-vmx.ko since if you do this
> for one guest, you probably want to do it for all.  A similar option is 
> possible
> for AMD but I don't have easy access to AMD test hardware.

Perhaps it should be a VM level option.  And then invert the notion.
Create one idle domain w/out hlt trap.  Give that VM a vcpu per pcpu
(pin in place probably).  And have that VM do nothing other than hlt.
Then it's always runnable according to scheduler, and can "consume" the
extra work that CFS wants to give away.

What do you think?

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] vhost test module

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote:
> > This adds a test module for vhost infrastructure.
> > Intentionally not tied to kbuild to prevent people
> > from installing and loading it accidentally.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> 
> On question below.
> 
> > ---
> > 
> > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > new file mode 100644
> > index 000..099f302
> > --- /dev/null
> > +++ b/drivers/vhost/test.c
> > @@ -0,0 +1,320 @@
> > +/* Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin 
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.
> > + *
> > + * test virtio server in host kernel.
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "test.h"
> > +#include "vhost.c"
> > +
> > +/* Max number of bytes transferred before requeueing the job.
> > + * Using this limit prevents one virtqueue from starving others. */
> > +#define VHOST_TEST_WEIGHT 0x8
> > +
> > +enum {
> > +   VHOST_TEST_VQ = 0,
> > +   VHOST_TEST_VQ_MAX = 1,
> > +};
> > +
> > +struct vhost_test {
> > +   struct vhost_dev dev;
> > +   struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
> > +};
> > +
> > +/* Expects to be always run from workqueue - which acts as
> > + * read-size critical section for our kind of RCU. */
> > +static void handle_vq(struct vhost_test *n)
> > +{
> > +   struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ];
> > +   unsigned out, in;
> > +   int head;
> > +   size_t len, total_len = 0;
> > +   void *private;
> > +
> > +   private = rcu_dereference_check(vq->private_data, 1);
> 
> Any chance of a check for running in a workqueue?  If I remember correctly,
> the ->lockdep_map field in the work_struct structure allows you to create
> the required lockdep expression.
> 
>   Thanx, Paul

We moved away from using the workqueue to a custom kernel thread
implementation though.


> > +   if (!private)
> > +   return;
> > +
> > +   mutex_lock(&vq->mutex);
> > +   vhost_disable_notify(vq);
> > +
> > +   for (;;) {
> > +   head = vhost_get_vq_desc(&n->dev, vq, vq->iov,
> > +ARRAY_SIZE(vq->iov),
> > +&out, &in,
> > +NULL, NULL);
> > +   /* On error, stop handling until the next kick. */
> > +   if (unlikely(head < 0))
> > +   break;
> > +   /* Nothing new?  Wait for eventfd to tell us they refilled. */
> > +   if (head == vq->num) {
> > +   if (unlikely(vhost_enable_notify(vq))) {
> > +   vhost_disable_notify(vq);
> > +   continue;
> > +   }
> > +   break;
> > +   }
> > +   if (in) {
> > +   vq_err(vq, "Unexpected descriptor format for TX: "
> > +  "out %d, int %d\n", out, in);
> > +   break;
> > +   }
> > +   len = iov_length(vq->iov, out);
> > +   /* Sanity check */
> > +   if (!len) {
> > +   vq_err(vq, "Unexpected 0 len for TX\n");
> > +   break;
> > +   }
> > +   vhost_add_used_and_signal(&n->dev, vq, head, 0);
> > +   total_len += len;
> > +   if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
> > +   vhost_poll_queue(&vq->poll);
> > +   break;
> > +   }
> > +   }
> > +
> > +   mutex_unlock(&vq->mutex);
> > +}
> > +
> > +static void handle_vq_kick(struct vhost_work *work)
> > +{
> > +   struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
> > + poll.work);
> > +   struct vhost_test *n = container_of(vq->dev, struct vhost_test, dev);
> > +
> > +   handle_vq(n);
> > +}
> > +
> > +static int vhost_test_open(struct inode *inode, struct file *f)
> > +{
> > +   struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
> > +   struct vhost_dev *dev;
> > +   int r;
> > +
> > +   if (!n)
> > +   return -ENOMEM;
> > +
> > +   dev = &n->dev;
> > +   n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> > +   r = vhost_dev_init(dev, n->vqs, VHOST_TEST_VQ_MAX);
> > +   if (r < 0) {
> > +   kfree(n);
> > +   return r;
> > +   }
> > +
> > +   f->private_data = n;
> > +
> > +   return 0;
> > +}
> > +
> > +static void *vhost_test_stop_vq(struct vhost_test *n,
> > +   struct vhost_virtqueue *vq)
> > +{
> > +   void *private;
> > +
> > +   mutex_lock(&vq->mutex);
> > +   private = rcu_dereference_protected(vq->private_data,
> > +lockdep_is_held(&vq->mutex));
> > +   rcu_assign_po

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori


On 12/02/2010 11:37 AM, Marcelo Tosatti wrote:

On Thu, Dec 02, 2010 at 07:59:17AM -0600, Anthony Liguori wrote:
   

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.

Signed-off-by: Anthony Liguori
---
v1 ->  v2
  - Rename parameter to yield_on_hlt
  - Remove __read_mostly

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index caa967e..d8310e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
  static int __read_mostly vmm_exclusive = 1;
  module_param(vmm_exclusive, bool, S_IRUGO);

+static int yield_on_hlt = 1;
+module_param(yield_on_hlt, bool, S_IRUGO);
+
  #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
  #define KVM_GUEST_CR0_MASK\
@@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
&_pin_based_exec_control)<  0)
return -EIO;

-   min = CPU_BASED_HLT_EXITING |
+   min =
  #ifdef CONFIG_X86_64
  CPU_BASED_CR8_LOAD_EXITING |
  CPU_BASED_CR8_STORE_EXITING |
@@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  CPU_BASED_MWAIT_EXITING |
  CPU_BASED_MONITOR_EXITING |
  CPU_BASED_INVLPG_EXITING;
+
+   if (yield_on_hlt)
+   min |= CPU_BASED_HLT_EXITING;
+
opt = CPU_BASED_TPR_SHADOW |
  CPU_BASED_USE_MSR_BITMAPS |
  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
--
1.7.0.4
 

Breaks async PF (see "checks on guest state"),


Sorry, I don't follow what you mean here.  Can you elaborate?


  timer reinjection
probably.


Timer reinjection will continue to work as expected.  If a guest is 
halting an external interrupt is delivered (by a timer), the guest will 
still exit as expected.


I can think of anything that would be functionally correct and still 
depend on getting hlt exits because ultimately, a guest never actually 
has to do a hlt (and certainly there are guests that won't).



  It should be possible to achieve determinism with
a scheduler policy?
   


If the desire is the ultimate desire is to have the guests be scheduled 
in a non-work conserving fashion, I can't see a more direct approach 
that to simply not have the guests yield (which is ultimately what hlt 
trapping does).


Anything the scheduler would do is after the fact and probably based on 
inference about why the yield.


Regards,

Anthony Liguori



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Marcelo Tosatti

On Thu, Dec 02, 2010 at 07:59:17AM -0600, Anthony Liguori wrote:
> In certain use-cases, we want to allocate guests fixed time slices where idle
> guest cycles leave the machine idling.  There are many approaches to achieve
> this but the most direct is to simply avoid trapping the HLT instruction which
> lets the guest directly execute the instruction putting the processor to 
> sleep.
> 
> Introduce this as a module-level option for kvm-vmx.ko since if you do this
> for one guest, you probably want to do it for all.  A similar option is 
> possible
> for AMD but I don't have easy access to AMD test hardware.
> 
> Signed-off-by: Anthony Liguori 
> ---
> v1 -> v2
>  - Rename parameter to yield_on_hlt
>  - Remove __read_mostly
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index caa967e..d8310e4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
>  static int __read_mostly vmm_exclusive = 1;
>  module_param(vmm_exclusive, bool, S_IRUGO);
>  
> +static int yield_on_hlt = 1;
> +module_param(yield_on_hlt, bool, S_IRUGO);
> +
>  #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST
> \
>   (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
>  #define KVM_GUEST_CR0_MASK   \
> @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
> *vmcs_conf)
>   &_pin_based_exec_control) < 0)
>   return -EIO;
>  
> - min = CPU_BASED_HLT_EXITING |
> + min =
>  #ifdef CONFIG_X86_64
> CPU_BASED_CR8_LOAD_EXITING |
> CPU_BASED_CR8_STORE_EXITING |
> @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
> *vmcs_conf)
> CPU_BASED_MWAIT_EXITING |
> CPU_BASED_MONITOR_EXITING |
> CPU_BASED_INVLPG_EXITING;
> +
> + if (yield_on_hlt)
> + min |= CPU_BASED_HLT_EXITING;
> +
>   opt = CPU_BASED_TPR_SHADOW |
> CPU_BASED_USE_MSR_BITMAPS |
> CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> -- 
> 1.7.0.4

Breaks async PF (see "checks on guest state"), timer reinjection
probably. It should be possible to achieve determinism with 
a scheduler policy?



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SeaBIOS] [PATCHv6 00/16] boot order specification

2010-12-02 Thread Gleb Natapov

On Thu, Dec 02, 2010 at 04:07:16PM +0100, Peter Stuge wrote:
> Gleb Natapov wrote:
> > BBS specification is broken since it doesn't provide a way for
> > discovered boot method (BCV) to be linked back to a device it will
> > boot from. Nothing we can do to fix this except moving to EFI (an
> > hope the problem is fixed there).
> 
> There is that option, or there could be some simple improvement of
> our own, which works in QEMU and maybe even adds value to coreboot.
> But then there would be a bit of novel development in firmware - that
> can't be a good thing, right?
> 
I am all for novel development in firmware, but unfortunately I do not
see what can we do in Seabios + qemu to fix this shortcoming. The problem
should be fixed in each and every option rom. Option rom may set device
address somewhere in pnp header for instance.

> 
> > Spec says that in that case user probably will want to adjust boot
> > order anyway and will enter boot menu by itself. Sorry excuse for
> > failing to provide proper functionality if you ask me :)
> 
> I agree. I can not believe the absolute resistance to innovation in
> this field.
> 
Interested parties want everyone to move to EFI I guess.

> Isn't the scope of BBS logic limited to boot time? (There are calls
> to do settings, but that's no problem.)
> 
> Maybe it would be possible for SeaBIOS to provide what looks like BBS
> to the guest, but on the other side there is something more
> intelligent going on, be it together with QEMU or coreboot?
> 
> 
I don't how it can be done without cooperation with option roms.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote:
> On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin  wrote:
> > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
> >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
> >> >>
> >> >>  Which case?  the readl() doesn't need access to the routing table,
> >> >>  just the entry.
> >> >
> >> >One thing that read should do is flush in the outstanding
> >> >interrupts and flush out the mask bit writes.
> >>
> >> The mask bit writes are synchronous.
> >>
> >> wrt interrupts, we can deal with assigned devices, and can poll
> >> irqfds.  But we can't force vhost-net to issue an interrupt (and I
> >> don't think it's required).
> >
> > To clarify:
> >
> >        mask write
> >        read
> >
> > it is safe for guest to assume no more interrupts
> >
> > where as with a simple
> >        mask write
> >
> > an interrupt might be in flight and get delivered shortly afterwards.
> 
> I think it's already contained in the current patchset.
> >
> >
> >> >>  Oh, I think there is a terminology problem, I was talking about
> >> >>  kvm's irq routing table, you were talking about the msix entries.
> >> >>
> >> >>  I think treating it as a cache causes more problems, since there are
> >> >>  now two paths for reads (in cache and not in cache) and more things
> >> >>  for writes to manage.
> >> >>
> >> >>  Here's my proposed API:
> >> >>
> >> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
> >> >>  pending_bitmap_base_gpa)
> >> >>
> >> >>   - called when the guest enables msix
> >> >
> >> >I would add virtual addresses so that we can use swappable memory to
> >> >store the state.
> >>
> >> Right.
> 
> Do we need synchronization between kernel and userspace? Any recommended 
> method?
> >>
> >> >If we do, maybe we can just keep the table there and then
> >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
> >>
> >> Still need to to let userspace know it needs to reprogram the irqfd
> >> or whatever it uses to inject the interrupt.
> >
> > Why do we need to reprogram irqfd?  I thought irqfd would map to an
> > entry within the table instead of address/data as now.
> > Could you clarify please?
> >
> >
> >> >>  KVM_REMOVE_MSIX_TABLE(table_id)
> >> >>
> >> >>    - called when the guest disables msix
> >> >>
> >> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
> >> >>
> >> >>    - called when the guest enables msix (to initialize it), or after
> >> >>  live migration
> >> >
> >> >What is entry_id here?
> >>
> >> Entry within the table.
> >
> > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
> > enabled (note: it can not be called at boot anyway since pa
> > depends on BAR assigned by BIOS).
> 
> Don't agree. MMIO can be write regardless of if MSIX is enabled. If
> you handle MMIO to kernel, them handle them all.

Hmm, does this mean we would need ioctls to enable/disable MSIX?  I
think passing control from userspace to kernel when MSIX is enabled is
not too bad. Will think about it.

> I suppose qemu still
> got control of BAR?

Yes. So e.g. if memory is disabled in the device then we must
disable this as well, and not claim these transactions
as mmio.

> Then leave it in the current place should be fine.

current place?

At the moment qemu does not notify devices when
memory is disabled, only when it is enabled.

So handling msix enable/disable is more straight-forward.

> >
> >> >>  Michael?  I think that should work for virtio and vfio assigned
> >> >>  devices?  Not sure about pending bits.
> >> >
> >> >Pending bits must be tracked in kernel, but I don't see
> >> >how we can support polling mode if we don't exit to userspace
> >> >on pending bit reads.
> >> >
> >> >This does mean that some reads will be fast and some will be
> >> >slow, and it's a bit sad that we seem to be optimizing
> >> >for specific guests, but I just can't come up with
> >> >anything better.
> >> >
> >>
> >> If the pending bits live in userspace memory, the device model can
> >> update them directly?
> >
> > Note that these are updated on an interrupt, so updating them
> > in userspace would need get_user_page etc trickery,
> > and add the overhead of atomics.
> >
> > Further I think it's important to avoid the overhead of updating them
> > all the time, and only do this when an interrupt is
> > masked or on pending bits read. Since userspace does not know
> > when interrupts are masked, this means do update on each read.
> 
> In fact qemu's accessing to MMIO should be quite rare after moving all
> the things to the kernel. Using IOCTL is also fine with me.
> 
> And how to "do update on each read"?


When read of pending bits is detected, we could forward it up to qemu.
Qemu could check internal device state and clear bits that are no longer
relevant.

> -- 
> regards,
> Yang, Sheng
> >
> >> >So maybe just add an ioctl to get and to clear pending bits.
> >> >Maybe set for symmetry.
> >>
> >> For live migration too.  But if they live in memory,

Re: [PATCH 2/6] KVM: SVM: Add manipulation functions for CRx intercepts

2010-12-02 Thread Marcelo Tosatti

On Tue, Nov 30, 2010 at 06:03:57PM +0100, Joerg Roedel wrote:
> This patch wraps changes to the CRx intercepts of SVM into
> seperate functions to abstract nested-svm better and prepare
> the implementation of the vmcb-clean-bits feature.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/include/asm/svm.h |   15 +++--
>  arch/x86/kvm/svm.c |  119 
> +++-
>  2 files changed, 72 insertions(+), 62 deletions(-)
> 

> - control->intercept_cr_read =INTERCEPT_CR0_MASK |
> - INTERCEPT_CR3_MASK |
> - INTERCEPT_CR4_MASK;
> -
> - control->intercept_cr_write =   INTERCEPT_CR0_MASK |
> - INTERCEPT_CR3_MASK |
> - INTERCEPT_CR4_MASK |
> - INTERCEPT_CR8_MASK;
> + set_cr_intercept(svm, INTERCEPT_CR0_READ);
> + set_cr_intercept(svm, INTERCEPT_CR3_READ);
> + set_cr_intercept(svm, INTERCEPT_CR4_READ);
> + set_cr_intercept(svm, INTERCEPT_CR0_WRITE);
> + set_cr_intercept(svm, INTERCEPT_CR3_WRITE);
> + set_cr_intercept(svm, INTERCEPT_CR4_WRITE);
> + set_cr_intercept(svm, INTERCEPT_CR8_WRITE);

Should clear hflags before using is_guest_mode().

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] Add "broadcast" option for mce command.

2010-12-02 Thread Marcelo Tosatti

On Tue, Nov 30, 2010 at 05:14:13PM +0900, Jin Dongming wrote:
> When the following test case is injected with mce command, maybe user could 
> not
> get the expected result.
> DATA
>command cpu bank status mcg_status  addr   misc
> (qemu) mce 1   10xbd00 0x050x1234 0x8c
> 
> Expected Result
>panic type: "Fatal Machine check"
> 
> That is because each mce command can only inject the given cpu and could not
> inject mce interrupt to other cpus. So user will get the following result:
> panic type: "Fatal machine check on current CPU"
> 
> "broadcast" option is used for injecting dummy data into other cpus. Injecting
> mce with this option the expected result could be gotten.
> 
> Signed-off-by: Jin Dongming 
> ---
>  cpu-all.h |3 ++-
>  hmp-commands.hx   |6 +++---
>  monitor.c |   18 --
>  target-i386/helper.c  |   17 +++--
>  target-i386/kvm.c |   12 ++--
>  target-i386/kvm_x86.h |2 +-
>  6 files changed, 47 insertions(+), 11 deletions(-)
> 
> diff --git a/cpu-all.h b/cpu-all.h
> index 11edddc..a3b3cd8 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -967,6 +967,7 @@ int cpu_memory_rw_debug(CPUState *env, target_ulong addr,
>  uint8_t *buf, int len, int is_write);
>  
>  void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
> -uint64_t mcg_status, uint64_t addr, uint64_t misc);
> +uint64_t mcg_status, uint64_t addr, uint64_t misc,
> +int broadcast);
>  
>  #endif /* CPU_ALL_H */
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index ba6de28..44ad571 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -1053,9 +1053,9 @@ ETEXI
>  
>  {
>  .name   = "mce",
> -.args_type  = 
> "cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l",
> -.params = "cpu bank status mcgstatus addr misc",
> -.help   = "inject a MCE on the given CPU",
> +.args_type  = 
> "cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,broadcast:s?",
> +.params = "cpu bank status mcgstatus addr misc [broadcast|b]",
> +.help   = "inject a MCE on the given CPU [and broadcast to other 
> CPUs]",
>  .mhandler.cmd = do_inject_mce,
>  },
>  
> diff --git a/monitor.c b/monitor.c
> index 66d6acd..e316eb6 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -2263,12 +2263,26 @@ static void do_inject_mce(Monitor *mon, const QDict 
> *qdict)
>  uint64_t mcg_status = qdict_get_int(qdict, "mcg_status");
>  uint64_t addr = qdict_get_int(qdict, "addr");
>  uint64_t misc = qdict_get_int(qdict, "misc");
> +const char *b = qdict_get_try_str(qdict, "broadcast");

qdict_get_try_bool is easier.

> +int broadcast = 0;
>  
> -for (cenv = first_cpu; cenv != NULL; cenv = cenv->next_cpu)
> +if (b) {
> +if (!strncmp(b, "broadcast", sizeof("broadcast")) ||
> +!strncmp(b, "b", sizeof("b"))) {
> +broadcast = 1;
> +} else {
> +monitor_printf(mon, "unexpected option: %s\n", b);
> +return;
> +}
> +}
> +
> +for (cenv = first_cpu; cenv != NULL; cenv = cenv->next_cpu) {
>  if (cenv->cpu_index == cpu_index && cenv->mcg_cap) {
> -cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
> +cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc,
> +   broadcast);
>  break;
>  }
> +} 
>  }
>  #endif
>  
> diff --git a/target-i386/helper.c b/target-i386/helper.c
> index e384fdc..7e07ebd 100644
> --- a/target-i386/helper.c
> +++ b/target-i386/helper.c
> @@ -1071,17 +1071,30 @@ static void qemu_inject_x86_mce(CPUState *cenv, int 
> bank, uint64_t status,
>  }
>  
>  void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status,
> -uint64_t mcg_status, uint64_t addr, uint64_t misc)
> +uint64_t mcg_status, uint64_t addr, uint64_t misc,
> +int broadcast)
>  {
>  unsigned bank_num = cenv->mcg_cap & 0xff;
> +CPUState *env;
> +int flag = 0;
>  
>  if (bank >= bank_num || !(status & MCI_STATUS_VAL))
>  return;
>  
>  if (kvm_enabled()) {
> -kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, 0);
> +if (broadcast)
> +flag |= 0x02; /* bit 1: 1(broadcast); 0(not broadcast) */
> +kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, flag);
>  } else {
>  qemu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc);
> +if (broadcast) {
> +for (env = first_cpu; env != NULL; env = env->next_cpu) {
> +if (cenv == env)
> +continue;
> +
> +qemu_inject_x86_mce(env, 1, 0xa000, 0, 0,

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Srivatsa Vaddagiri

Actually CCing Rik now!

On Thu, Dec 02, 2010 at 08:57:16PM +0530, Srivatsa Vaddagiri wrote:
> On Thu, Dec 02, 2010 at 03:49:44PM +0200, Avi Kivity wrote:
> > On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote:
> > >On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote:
> > >>  >>   What I'd like to see in directed yield is donating exactly the
> > >>  >>   amount of vruntime that's needed to make the target thread run.
> > >>  >
> > >>  >I presume this requires the target vcpu to move left in rb-tree to run
> > >>  >earlier than scheduled currently and that it doesn't involve any
> > >>  >change to the sched_period() of target vcpu?
> > >>  >
> > >>  >Just was wondering how this would work in case of buggy guests. Lets 
> > >> say that a
> > >>  >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> > >>  >currently), while VCPU spins on lock A (held by VCPU0 currently). Both 
> > >> keep
> > >>  >boosting each other's vruntime, potentially affecting fairtime for 
> > >> other guests
> > >>  >(to the point of starving them perhaps)?
> > >>
> > >>  We preserve vruntime overall.  If you give vruntime to someone, it
> > >>  comes at your own expense.  Overall vruntime is preserved.
> > >
> > >Hmm ..so I presume that this means we don't affect target thread's 
> > >position in
> > >rb-tree upon donation, rather we influence its sched_period() to include
> > >donated time? IOW donation has no effect on causing the target thread to 
> > >run
> > >"immediately", rather it will have the effect of causing it run "longer"
> > >whenever it runs next?
> > 
> > No.  The intent (at least mine, maybe Rik has other ideas) is to
> 
> CCing Rik now ..
> 
> > move some vruntime from current to target such that target would be
> > placed before current in the timeline.
> 
> Well ok, then this is what I had presumed earlier (about shifting target 
> towards
> left in rb-tree).
> 
> > >Even that would require some precaution in directed yield to ensure that it
> > >doesn't unduly inflate vruntime of target, hurting fairness for other 
> > >guests on
> > >same cpu as target (example guest code that can lead to this situation
> > >below):
> > >
> > >vcpu0: vcpu1:
> > >
> > >   spinlock(A);
> > >
> > >spinlock(A);
> > >
> > >   while(1)
> > >   ;
> > >
> > >   spin_unlock(A);
> > 
> > directed yield should preserve the invariant that sum(vruntime) does
> > not change.
> 
> Hmm don't think I understand this invariant sum() part. Lets take a simple
> example as below:
> 
> 
> p0-> A0 B0 A1
> 
> p1-> B1 C0 C1
> 
> A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus
> 
> Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace
> delibrately). When A0 spins and exits (due to PLE) what does its directed 
> yield
> do? Going by your statement, it can put target before current, leading to
> perhaps this arrangement in runqueue:
> 
> p0-> A1 B0 A0
> 
> Now A1 spins and wants to do a directed yield back to A0, leading to :
> 
> p0-> A0 B0 A1
> 
> This can go back and forth, starving B0 (iow leading to some sort of DoS
> attack).
> 
> Where does the "invariant sum" part of directed yield kick in to avoid such 
> nastiness?
> 
> - vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] KVM: VMX: Return 0 from a failed VMREAD

2010-12-02 Thread Avi Kivity

If we execute VMREAD during reboot we'll just skip over it.  Instead of
returning garbage, return 0, which has a much smaller chance of confusing
the code.  Otherwise we risk a flood of debug printk()s which block the
reboot process if a serial console or netconsole is enabled.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/vmx.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index caa967e..72cfdb7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -565,10 +565,10 @@ static inline void ept_sync_individual_addr(u64 eptp, 
gpa_t gpa)
 
 static unsigned long vmcs_readl(unsigned long field)
 {
-   unsigned long value;
+   unsigned long value = 0;
 
asm volatile (__ex(ASM_VMX_VMREAD_RDX_RAX)
- : "=a"(value) : "d"(field) : "cc");
+ : "+a"(value) : "d"(field) : "cc");
return value;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: Don't spin on virt instruction faults during reboot

2010-12-02 Thread Avi Kivity

Since vmx blocks INIT signals, we disable virtualization extensions during
reboot.  This leads to virtualization instructions faulting; we trap these
faults and spin while the reboot continues.

Unfortunately spinning on a non-preemptible kernel may block a task that
reboot depends on; this causes the reboot to hang.

Fix by skipping over the instruction and hoping for the best.

Signed-off-by: Avi Kivity 
---
 arch/x86/include/asm/kvm_host.h |8 ++--
 virt/kvm/kvm_main.c |   13 -
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0e64a39..3987e1f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -790,14 +790,18 @@ enum {
  * reboot turns off virtualization while processes are running.
  * Trap the fault and ignore the instruction if that happens.
  */
-asmlinkage void kvm_handle_fault_on_reboot(void);
+asmlinkage void kvm_spurious_fault(void);
+extern bool kvm_rebooting;
 
 #define __kvm_handle_fault_on_reboot(insn) \
"666: " insn "\n\t" \
+   "668: \n\t"   \
".pushsection .fixup, \"ax\" \n" \
"667: \n\t" \
+   "cmpb $0, kvm_rebooting \n\t" \
+   "jne 668b \n\t"   \
__ASM_SIZE(push) " $666b \n\t"\
-   "jmp kvm_handle_fault_on_reboot \n\t" \
+   "call kvm_spurious_fault \n\t"\
".popsection \n\t" \
".pushsection __ex_table, \"a\" \n\t" \
_ASM_PTR " 666b, 667b \n\t" \
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c4ee364..83f5bf6 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -90,7 +90,8 @@ static void hardware_disable_all(void);
 
 static void kvm_io_bus_destroy(struct kvm_io_bus *bus);
 
-static bool kvm_rebooting;
+bool kvm_rebooting;
+EXPORT_SYMBOL_GPL(kvm_rebooting);
 
 static bool largepages_enabled = true;
 
@@ -2179,18 +2180,12 @@ static int kvm_cpu_hotplug(struct notifier_block 
*notifier, unsigned long val,
 }
 
 
-asmlinkage void kvm_handle_fault_on_reboot(void)
+asmlinkage void kvm_spurious_fault(void)
 {
-   if (kvm_rebooting) {
-   /* spin while reset goes on */
-   local_irq_enable();
-   while (true)
-   cpu_relax();
-   }
/* Fault while not rebooting.  We want the trace. */
BUG();
 }
-EXPORT_SYMBOL_GPL(kvm_handle_fault_on_reboot);
+EXPORT_SYMBOL_GPL(kvm_spurious_fault);
 
 static int kvm_reboot(struct notifier_block *notifier, unsigned long val,
  void *v)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] Fix reboot on non-preemptible kernels

2010-12-02 Thread Avi Kivity

Reboot with guests running, on Intel hosts, with a non-preemptible host
kernel is broken.  This patchset fixes the issue.

Avi Kivity (2):
  KVM: Don't spin on virt instruction faults during reboot
  KVM: VMX: Return 0 from a failed VMREAD

 arch/x86/include/asm/kvm_host.h |8 ++--
 arch/x86/kvm/vmx.c  |4 ++--
 virt/kvm/kvm_main.c |   13 -
 3 files changed, 12 insertions(+), 13 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Srivatsa Vaddagiri

On Thu, Dec 02, 2010 at 05:33:40PM +0200, Avi Kivity wrote:
> A0 and A1's vruntime will keep growing, eventually B will become
> leftmost and become runnable (assuming leftmost == min vruntime, not
> sure what the terminology is).

Donation (in directed yield) will cause vruntime to drop as well (thats the only
way target can get to run ahead of its scheduled time), so I still think there
are nasty issues involved here. 

Anyway, I am curious to see the directed yield implementation that Rik has - I
can comment more after I have seen that!

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Avi Kivity

On 12/02/2010 05:27 PM, Srivatsa Vaddagiri wrote:

>  >Even that would require some precaution in directed yield to ensure that it
>  >doesn't unduly inflate vruntime of target, hurting fairness for other 
guests on
>  >same cpu as target (example guest code that can lead to this situation
>  >below):
>  >
>  >vcpu0:vcpu1:
>  >
>  >  spinlock(A);
>  >
>  >spinlock(A);
>  >
>  >  while(1)
>  >  ;
>  >
>  >  spin_unlock(A);
>
>  directed yield should preserve the invariant that sum(vruntime) does
>  not change.

Hmm don't think I understand this invariant sum() part. Lets take a simple
example as below:

p0  ->  A0 B0 A1

p1  ->  B1 C0 C1

A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus

Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace
delibrately). When A0 spins and exits (due to PLE) what does its directed yield
do? Going by your statement, it can put target before current, leading to
perhaps this arrangement in runqueue:

p0  ->  A1 B0 A0

Now A1 spins and wants to do a directed yield back to A0, leading to :

p0  ->  A0 B0 A1

This can go back and forth, starving B0 (iow leading to some sort of DoS
attack).

Where does the "invariant sum" part of directed yield kick in to avoid such
nastiness?

A0 and A1's vruntime will keep growing, eventually B will become 
leftmost and become runnable (assuming leftmost == min vruntime, not 
sure what the terminology is).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Srivatsa Vaddagiri

On Thu, Dec 02, 2010 at 03:49:44PM +0200, Avi Kivity wrote:
> On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote:
> >On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote:
> >>  >>   What I'd like to see in directed yield is donating exactly the
> >>  >>   amount of vruntime that's needed to make the target thread run.
> >>  >
> >>  >I presume this requires the target vcpu to move left in rb-tree to run
> >>  >earlier than scheduled currently and that it doesn't involve any
> >>  >change to the sched_period() of target vcpu?
> >>  >
> >>  >Just was wondering how this would work in case of buggy guests. Lets say 
> >> that a
> >>  >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> >>  >currently), while VCPU spins on lock A (held by VCPU0 currently). Both 
> >> keep
> >>  >boosting each other's vruntime, potentially affecting fairtime for other 
> >> guests
> >>  >(to the point of starving them perhaps)?
> >>
> >>  We preserve vruntime overall.  If you give vruntime to someone, it
> >>  comes at your own expense.  Overall vruntime is preserved.
> >
> >Hmm ..so I presume that this means we don't affect target thread's position 
> >in
> >rb-tree upon donation, rather we influence its sched_period() to include
> >donated time? IOW donation has no effect on causing the target thread to run
> >"immediately", rather it will have the effect of causing it run "longer"
> >whenever it runs next?
> 
> No.  The intent (at least mine, maybe Rik has other ideas) is to

CCing Rik now ..

> move some vruntime from current to target such that target would be
> placed before current in the timeline.

Well ok, then this is what I had presumed earlier (about shifting target towards
left in rb-tree).

> >Even that would require some precaution in directed yield to ensure that it
> >doesn't unduly inflate vruntime of target, hurting fairness for other guests 
> >on
> >same cpu as target (example guest code that can lead to this situation
> >below):
> >
> >vcpu0:   vcpu1:
> >
> > spinlock(A);
> >
> >spinlock(A);
> >
> > while(1)
> > ;
> >
> > spin_unlock(A);
> 
> directed yield should preserve the invariant that sum(vruntime) does
> not change.

Hmm don't think I understand this invariant sum() part. Lets take a simple
example as below:


p0  -> A0 B0 A1

p1  -> B1 C0 C1

A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus

Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace
delibrately). When A0 spins and exits (due to PLE) what does its directed yield
do? Going by your statement, it can put target before current, leading to
perhaps this arrangement in runqueue:

p0  -> A1 B0 A0

Now A1 spins and wants to do a directed yield back to A0, leading to :

p0  -> A0 B0 A1

This can go back and forth, starving B0 (iow leading to some sort of DoS
attack).

Where does the "invariant sum" part of directed yield kick in to avoid such 
nastiness?

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori


On 12/02/2010 08:39 AM, lidong chen wrote:

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.

i could not understand why need this? can you tell more detailedly?
   


If you run 4 guests on a CPU, and they're all trying to consume 100% 
CPU, all things being equal, you'll get ~25% CPU for each guest.


However, if one guest is idle, you'll get something like 1% 32% 33% 
32%.  This characteristic is usually desirable because it increase 
aggregate throughput but in some circumstances, determinism is more 
desirable than aggregate throughput.


This patch essentially makes guest execution non-work conserving by 
making it appear to the scheduler that each guest wants 100% CPU even 
though they may be idling.


That means that regardless of what each guest is doing, if you have four 
guests on one CPU, each will get ~25% CPU[1].


[1] there are corner cases around things like forced sleep due to PFs 
and the like.  The goal is not for 100% determinism but more to at least 
obtain more significantly more determinism than we have now.


Regards,

Anthony Liguori


thanks.


2010/12/2 Anthony Liguori:
   

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.

Signed-off-by: Anthony Liguori
---
v1 ->  v2
  - Rename parameter to yield_on_hlt
  - Remove __read_mostly

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index caa967e..d8310e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
  static int __read_mostly vmm_exclusive = 1;
  module_param(vmm_exclusive, bool, S_IRUGO);

+static int yield_on_hlt = 1;
+module_param(yield_on_hlt, bool, S_IRUGO);
+
  #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
  #define KVM_GUEST_CR0_MASK \
@@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
&_pin_based_exec_control)<  0)
return -EIO;

-   min = CPU_BASED_HLT_EXITING |
+   min =
  #ifdef CONFIG_X86_64
  CPU_BASED_CR8_LOAD_EXITING |
  CPU_BASED_CR8_STORE_EXITING |
@@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  CPU_BASED_MWAIT_EXITING |
  CPU_BASED_MONITOR_EXITING |
  CPU_BASED_INVLPG_EXITING;
+
+   if (yield_on_hlt)
+   min |= CPU_BASED_HLT_EXITING;
+
opt = CPU_BASED_TPR_SHADOW |
  CPU_BASED_USE_MSR_BITMAPS |
  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
--
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori


On 12/02/2010 08:39 AM, lidong chen wrote:

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.

i could not understand why need this? can you tell more detailedly?
   


If you run 4 guests on a CPU, and they're all trying to consume 100% 
CPU, all things being equal, you'll get ~25% CPU for each guest.


However, if one guest is idle, you'll get something like 1% 32% 33% 
32%.  This characteristic is usually desirable because it increase 
aggregate throughput but in some circumstances, determinism is more 
desirable than aggregate throughput.


This patch essentially makes guest execution non-work conserving by 
making it appear to the scheduler that each guest wants 100% CPU even 
though they may be idling.


That means that regardless of what each guest is doing, if you have four 
guests on one CPU, each will get ~25% CPU[1].


[1] there are corner cases around things like forced sleep due to PFs 
and the like.  The goal is not for 100% determinism but more to at least 
obtain more significantly more determinism than we have now.


Regards,

Anthony Liguori


thanks.


2010/12/2 Anthony Liguori:
   

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.

Signed-off-by: Anthony Liguori
---
v1 ->  v2
  - Rename parameter to yield_on_hlt
  - Remove __read_mostly

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index caa967e..d8310e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
  static int __read_mostly vmm_exclusive = 1;
  module_param(vmm_exclusive, bool, S_IRUGO);

+static int yield_on_hlt = 1;
+module_param(yield_on_hlt, bool, S_IRUGO);
+
  #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
  #define KVM_GUEST_CR0_MASK \
@@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
&_pin_based_exec_control)<  0)
return -EIO;

-   min = CPU_BASED_HLT_EXITING |
+   min =
  #ifdef CONFIG_X86_64
  CPU_BASED_CR8_LOAD_EXITING |
  CPU_BASED_CR8_STORE_EXITING |
@@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  CPU_BASED_MWAIT_EXITING |
  CPU_BASED_MONITOR_EXITING |
  CPU_BASED_INVLPG_EXITING;
+
+   if (yield_on_hlt)
+   min |= CPU_BASED_HLT_EXITING;
+
opt = CPU_BASED_TPR_SHADOW |
  CPU_BASED_USE_MSR_BITMAPS |
  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
--
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [SeaBIOS] [PATCHv6 00/16] boot order specification

2010-12-02 Thread Peter Stuge

Gleb Natapov wrote:
> BBS specification is broken since it doesn't provide a way for
> discovered boot method (BCV) to be linked back to a device it will
> boot from. Nothing we can do to fix this except moving to EFI (an
> hope the problem is fixed there).

There is that option, or there could be some simple improvement of
our own, which works in QEMU and maybe even adds value to coreboot.
But then there would be a bit of novel development in firmware - that
can't be a good thing, right?

> Spec says that in that case user probably will want to adjust boot
> order anyway and will enter boot menu by itself. Sorry excuse for
> failing to provide proper functionality if you ask me :)

I agree. I can not believe the absolute resistance to innovation in
this field.

Isn't the scope of BBS logic limited to boot time? (There are calls
to do settings, but that's no problem.)

Maybe it would be possible for SeaBIOS to provide what looks like BBS
to the guest, but on the other side there is something more
intelligent going on, be it together with QEMU or coreboot?

//Peter
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Sheng Yang

On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin  wrote:
> On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
>> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
>> >>
>> >>  Which case?  the readl() doesn't need access to the routing table,
>> >>  just the entry.
>> >
>> >One thing that read should do is flush in the outstanding
>> >interrupts and flush out the mask bit writes.
>>
>> The mask bit writes are synchronous.
>>
>> wrt interrupts, we can deal with assigned devices, and can poll
>> irqfds.  But we can't force vhost-net to issue an interrupt (and I
>> don't think it's required).
>
> To clarify:
>
>        mask write
>        read
>
> it is safe for guest to assume no more interrupts
>
> where as with a simple
>        mask write
>
> an interrupt might be in flight and get delivered shortly afterwards.

I think it's already contained in the current patchset.
>
>
>> >>  Oh, I think there is a terminology problem, I was talking about
>> >>  kvm's irq routing table, you were talking about the msix entries.
>> >>
>> >>  I think treating it as a cache causes more problems, since there are
>> >>  now two paths for reads (in cache and not in cache) and more things
>> >>  for writes to manage.
>> >>
>> >>  Here's my proposed API:
>> >>
>> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
>> >>  pending_bitmap_base_gpa)
>> >>
>> >>   - called when the guest enables msix
>> >
>> >I would add virtual addresses so that we can use swappable memory to
>> >store the state.
>>
>> Right.

Do we need synchronization between kernel and userspace? Any recommended method?
>>
>> >If we do, maybe we can just keep the table there and then
>> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
>>
>> Still need to to let userspace know it needs to reprogram the irqfd
>> or whatever it uses to inject the interrupt.
>
> Why do we need to reprogram irqfd?  I thought irqfd would map to an
> entry within the table instead of address/data as now.
> Could you clarify please?
>
>
>> >>  KVM_REMOVE_MSIX_TABLE(table_id)
>> >>
>> >>    - called when the guest disables msix
>> >>
>> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
>> >>
>> >>    - called when the guest enables msix (to initialize it), or after
>> >>  live migration
>> >
>> >What is entry_id here?
>>
>> Entry within the table.
>
> So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
> enabled (note: it can not be called at boot anyway since pa
> depends on BAR assigned by BIOS).

Don't agree. MMIO can be write regardless of if MSIX is enabled. If
you handle MMIO to kernel, them handle them all. I suppose qemu still
got control of BAR? Then leave it in the current place should be fine.
>
>> >>  Michael?  I think that should work for virtio and vfio assigned
>> >>  devices?  Not sure about pending bits.
>> >
>> >Pending bits must be tracked in kernel, but I don't see
>> >how we can support polling mode if we don't exit to userspace
>> >on pending bit reads.
>> >
>> >This does mean that some reads will be fast and some will be
>> >slow, and it's a bit sad that we seem to be optimizing
>> >for specific guests, but I just can't come up with
>> >anything better.
>> >
>>
>> If the pending bits live in userspace memory, the device model can
>> update them directly?
>
> Note that these are updated on an interrupt, so updating them
> in userspace would need get_user_page etc trickery,
> and add the overhead of atomics.
>
> Further I think it's important to avoid the overhead of updating them
> all the time, and only do this when an interrupt is
> masked or on pending bits read. Since userspace does not know
> when interrupts are masked, this means do update on each read.

In fact qemu's accessing to MMIO should be quite rare after moving all
the things to the kernel. Using IOCTL is also fine with me.

And how to "do update on each read"?

-- 
regards,
Yang, Sheng
>
>> >So maybe just add an ioctl to get and to clear pending bits.
>> >Maybe set for symmetry.
>>
>> For live migration too.  But if they live in memory, no need for
>> get/set, just specify the address.
>>
>> --
>> error compiling committee.c: too many arguments to function
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread lidong chen

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.

i could not understand why need this? can you tell more detailedly?
thanks.


2010/12/2 Anthony Liguori :
> In certain use-cases, we want to allocate guests fixed time slices where idle
> guest cycles leave the machine idling.  There are many approaches to achieve
> this but the most direct is to simply avoid trapping the HLT instruction which
> lets the guest directly execute the instruction putting the processor to 
> sleep.
>
> Introduce this as a module-level option for kvm-vmx.ko since if you do this
> for one guest, you probably want to do it for all.  A similar option is 
> possible
> for AMD but I don't have easy access to AMD test hardware.
>
> Signed-off-by: Anthony Liguori 
> ---
> v1 -> v2
>  - Rename parameter to yield_on_hlt
>  - Remove __read_mostly
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index caa967e..d8310e4 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
>  static int __read_mostly vmm_exclusive = 1;
>  module_param(vmm_exclusive, bool, S_IRUGO);
>
> +static int yield_on_hlt = 1;
> +module_param(yield_on_hlt, bool, S_IRUGO);
> +
>  #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST                          \
>        (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
>  #define KVM_GUEST_CR0_MASK                                             \
> @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
> *vmcs_conf)
>                                &_pin_based_exec_control) < 0)
>                return -EIO;
>
> -       min = CPU_BASED_HLT_EXITING |
> +       min =
>  #ifdef CONFIG_X86_64
>              CPU_BASED_CR8_LOAD_EXITING |
>              CPU_BASED_CR8_STORE_EXITING |
> @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
> *vmcs_conf)
>              CPU_BASED_MWAIT_EXITING |
>              CPU_BASED_MONITOR_EXITING |
>              CPU_BASED_INVLPG_EXITING;
> +
> +       if (yield_on_hlt)
> +               min |= CPU_BASED_HLT_EXITING;
> +
>        opt = CPU_BASED_TPR_SHADOW |
>              CPU_BASED_USE_MSR_BITMAPS |
>              CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> --
> 1.7.0.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote:
> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:
> >>
> >>  Which case?  the readl() doesn't need access to the routing table,
> >>  just the entry.
> >
> >One thing that read should do is flush in the outstanding
> >interrupts and flush out the mask bit writes.
> 
> The mask bit writes are synchronous.
> 
> wrt interrupts, we can deal with assigned devices, and can poll
> irqfds.  But we can't force vhost-net to issue an interrupt (and I
> don't think it's required).

To clarify:

mask write
read

it is safe for guest to assume no more interrupts

where as with a simple
mask write

an interrupt might be in flight and get delivered shortly afterwards.


> >>  Oh, I think there is a terminology problem, I was talking about
> >>  kvm's irq routing table, you were talking about the msix entries.
> >>
> >>  I think treating it as a cache causes more problems, since there are
> >>  now two paths for reads (in cache and not in cache) and more things
> >>  for writes to manage.
> >>
> >>  Here's my proposed API:
> >>
> >>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
> >>  pending_bitmap_base_gpa)
> >>
> >>   - called when the guest enables msix
> >
> >I would add virtual addresses so that we can use swappable memory to
> >store the state.
> 
> Right.
> 
> >If we do, maybe we can just keep the table there and then
> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?
> 
> Still need to to let userspace know it needs to reprogram the irqfd
> or whatever it uses to inject the interrupt.

Why do we need to reprogram irqfd?  I thought irqfd would map to an
entry within the table instead of address/data as now.
Could you clarify please?


> >>  KVM_REMOVE_MSIX_TABLE(table_id)
> >>
> >>- called when the guest disables msix
> >>
> >>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
> >>
> >>- called when the guest enables msix (to initialize it), or after
> >>  live migration
> >
> >What is entry_id here?
> 
> Entry within the table.

So I think KVM_DEFINE_MSIX_TABLE should be called when msix is
enabled (note: it can not be called at boot anyway since pa
depends on BAR assigned by BIOS).

> >>  Michael?  I think that should work for virtio and vfio assigned
> >>  devices?  Not sure about pending bits.
> >
> >Pending bits must be tracked in kernel, but I don't see
> >how we can support polling mode if we don't exit to userspace
> >on pending bit reads.
> >
> >This does mean that some reads will be fast and some will be
> >slow, and it's a bit sad that we seem to be optimizing
> >for specific guests, but I just can't come up with
> >anything better.
> >
> 
> If the pending bits live in userspace memory, the device model can
> update them directly?

Note that these are updated on an interrupt, so updating them
in userspace would need get_user_page etc trickery,
and add the overhead of atomics.

Further I think it's important to avoid the overhead of updating them
all the time, and only do this when an interrupt is
masked or on pending bits read. Since userspace does not know
when interrupts are masked, this means do update on each read.

> >So maybe just add an ioctl to get and to clear pending bits.
> >Maybe set for symmetry.
> 
> For live migration too.  But if they live in memory, no need for
> get/set, just specify the address.
> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-02 Thread lidong chen

 i apply patch correctly.

the addr is not in mmio range because kvm_io_bus_write test the addr
for each device.
/* kvm_io_bus_write - called under kvm->slots_lock */
int kvm_io_bus_write(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 int len, const void *val)
{
int i;
struct kvm_io_bus *bus = rcu_dereference(kvm->buses[bus_idx]);
for (i = 0; i < bus->dev_count; i++)
if (!kvm_iodevice_write(bus->devs[i], addr, len, val))
return 0;
return -EOPNOTSUPP;
}

the test result(cpu rate) before apply patch:
  hostosguestos
CPU0 15.85   58.05
CPU1  2.97  70.56
CPU2  3.25  69.00
CPU3  3.31  68.59
CPU4  5.11  47.46
CPU5  4.70  29.28
CPU6  6.00  5.96
CPU7  4.75  2.58

the test result(cpu rate) after apply patch:
  hostosguestos
CPU0 11.89  60.92
CPU1  2.18  68.07
CPU2 2.6166.18
CPU3 2.4466.46
CPU4 2.6744.98
CPU5 2.3126.81
CPU6 1.935.65
CPU7 1.79 2.14

the total cpu rate reduce from 397% to 369%.

i pin the vcpu below, and only vcpu0 handle msix interrupt.
virsh vcpupin brd2vm1sriov 0 0
virsh vcpupin brd2vm1sriov 1 1
virsh vcpupin brd2vm1sriov 2 2
virsh vcpupin brd2vm1sriov 3 3
virsh vcpupin brd2vm1sriov 4 4
virsh vcpupin brd2vm1sriov 5 5
virsh vcpupin brd2vm1sriov 6 6
virsh vcpupin brd2vm1sriov 7 7

I will test virtio_net with msix enable later, and compare to it with
msix disable.
to see which is better?


2010/12/2 Michael S. Tsirkin :
> On Thu, Dec 02, 2010 at 07:52:00PM +0800, Sheng Yang wrote:
>> On Thu, Dec 2, 2010 at 5:49 PM, Michael S. Tsirkin  wrote:
>> > On Thu, Dec 02, 2010 at 09:13:28AM +0800, Yang, Sheng wrote:
>> >> On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote:
>> >> > On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote:
>> >> > > I used sr-iov, give each vm 2 vf.
>> >> > > after apply the patch, and i found performence is the same.
>> >> > >
>> >> > > the reason is in function msix_mmio_write, mostly addr is not in mmio
>> >> > > range.
>> >> > >
>> >> > > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
>> >> > > len,
>> >> > >
>> >> > > á á á á á á á á á á áconst void *val)
>> >> > >
>> >> > > {
>> >> > >
>> >> > > á struct kvm_assigned_dev_kernel *adev =
>> >> > >
>> >> > > á á á á á á á á á container_of(this, struct kvm_assigned_dev_kernel,
>> >> > >
>> >> > > á á á á á á á á á á á á á á á ámsix_mmio_dev);
>> >> > >
>> >> > > á int idx, r = 0;
>> >> > > á unsigned long new_val = *(unsigned long *)val;
>> >> > >
>> >> > > á mutex_lock(&adev->kvm->lock);
>> >> > > á if (!msix_mmio_in_range(adev, addr, len)) {
>> >> > >
>> >> > > á á á á á // return here.
>> >> > >
>> >> > > á á á á á á á á ár = -EOPNOTSUPP;
>> >> > >
>> >> > > á á á á á goto out;
>> >> > >
>> >> > > á }
>> >> > >
>> >> > > i printk the value:
>> >> > > addr á á á á á á start á á á á á end á á á á á len
>> >> > > F004C00C á F0044000 áF0044030 á á 4
>> >> > >
>> >> > > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed 
>> >> > > (rev
>> >> > > 01)
>> >> > >
>> >> > > á Subsystem: Intel Corporation Unknown device 000c
>> >> > > á Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> >> > >
>> >> > > Stepping- SERR- FastB2B-
>> >> > >
>> >> > > á Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> >> > >
>> >> > > SERR- > >> > >
>> >> > > á Latency: 0
>> >> > > á Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K]
>> >> > > á Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K]
>> >> > > á Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
>> >> > >
>> >> > > á á á á á Vector table: BAR=3 offset=
>> >> > > á á á á á PBA: BAR=3 offset=2000
>> >> > >
>> >> > > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed 
>> >> > > (rev
>> >> > > 01)
>> >> > >
>> >> > > á Subsystem: Intel Corporation Unknown device 000c
>> >> > > á Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> >> > >
>> >> > > Stepping- SERR- FastB2B-
>> >> > >
>> >> > > á Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> >> > >
>> >> > > SERR- > >> > >
>> >> > > á Latency: 0
>> >> > > á Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K]
>> >> > > á Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K]
>> >> > > á Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
>> >> > >
>> >> > > á á á á á Vector table: BAR=3 offset=
>> >> > > á á á á á PBA: BAR=3 offset=2000
>> >> > >
>> >> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
>> >> > > + á á á á á á á á á á á gpa_t addr, int len)
>> >> > > +{
>> >> > > + gpa_t start, end;
>> >> > > +
>> >> > > + BUG_ON(adev->msix_mmio_base == 0);
>> >

[PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)

2010-12-02 Thread Anthony Liguori

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.

Signed-off-by: Anthony Liguori 
---
v1 -> v2
 - Rename parameter to yield_on_hlt
 - Remove __read_mostly

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index caa967e..d8310e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 static int __read_mostly vmm_exclusive = 1;
 module_param(vmm_exclusive, bool, S_IRUGO);
 
+static int yield_on_hlt = 1;
+module_param(yield_on_hlt, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK \
@@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
&_pin_based_exec_control) < 0)
return -EIO;
 
-   min = CPU_BASED_HLT_EXITING |
+   min =
 #ifdef CONFIG_X86_64
  CPU_BASED_CR8_LOAD_EXITING |
  CPU_BASED_CR8_STORE_EXITING |
@@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  CPU_BASED_MWAIT_EXITING |
  CPU_BASED_MONITOR_EXITING |
  CPU_BASED_INVLPG_EXITING;
+
+   if (yield_on_hlt)
+   min |= CPU_BASED_HLT_EXITING;
+
opt = CPU_BASED_TPR_SHADOW |
  CPU_BASED_USE_MSR_BITMAPS |
  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Avi Kivity

On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote:

>
>  Which case?  the readl() doesn't need access to the routing table,
>  just the entry.

One thing that read should do is flush in the outstanding
interrupts and flush out the mask bit writes.

The mask bit writes are synchronous.

wrt interrupts, we can deal with assigned devices, and can poll irqfds.  
But we can't force vhost-net to issue an interrupt (and I don't think 
it's required).

>  Oh, I think there is a terminology problem, I was talking about
>  kvm's irq routing table, you were talking about the msix entries.
>
>  I think treating it as a cache causes more problems, since there are
>  now two paths for reads (in cache and not in cache) and more things
>  for writes to manage.
>
>  Here's my proposed API:
>
>  KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
>  pending_bitmap_base_gpa)
>
>   - called when the guest enables msix

I would add virtual addresses so that we can use swappable memory to
store the state.

Right.

If we do, maybe we can just keep the table there and then
KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?

Still need to to let userspace know it needs to reprogram the irqfd or 
whatever it uses to inject the interrupt.

>  KVM_REMOVE_MSIX_TABLE(table_id)
>
>- called when the guest disables msix
>
>  KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
>
>- called when the guest enables msix (to initialize it), or after
>  live migration

What is entry_id here?

Entry within the table.

>  Michael?  I think that should work for virtio and vfio assigned
>  devices?  Not sure about pending bits.

Pending bits must be tracked in kernel, but I don't see
how we can support polling mode if we don't exit to userspace
on pending bit reads.

This does mean that some reads will be fast and some will be
slow, and it's a bit sad that we seem to be optimizing
for specific guests, but I just can't come up with
anything better.

If the pending bits live in userspace memory, the device model can 
update them directly?

So maybe just add an ioctl to get and to clear pending bits.
Maybe set for symmetry.

For live migration too.  But if they live in memory, no need for 
get/set, just specify the address.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Introduce a C++ wrapper for the kvm APIs

2010-12-02 Thread Mike Day

On 24/11/10 18:10 +0200, Avi Kivity wrote:
> On 11/24/2010 05:50 PM, Anthony Liguori wrote:

> My answer is that C++ is the only language that allows you to evolve
> away from C, with mixed C/C++ source (not just linkage level
> compatibility).  If there are others, I want to know about them.

Objective C is good (I like it better than C++) if you can live with
the message-passing architecture. 

Mike



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Avi Kivity


On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote:

On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote:
>  >>   What I'd like to see in directed yield is donating exactly the
>  >>   amount of vruntime that's needed to make the target thread run.
>  >
>  >I presume this requires the target vcpu to move left in rb-tree to run
>  >earlier than scheduled currently and that it doesn't involve any
>  >change to the sched_period() of target vcpu?
>  >
>  >Just was wondering how this would work in case of buggy guests. Lets say 
that a
>  >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
>  >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
>  >boosting each other's vruntime, potentially affecting fairtime for other 
guests
>  >(to the point of starving them perhaps)?
>
>  We preserve vruntime overall.  If you give vruntime to someone, it
>  comes at your own expense.  Overall vruntime is preserved.

Hmm ..so I presume that this means we don't affect target thread's position in
rb-tree upon donation, rather we influence its sched_period() to include
donated time? IOW donation has no effect on causing the target thread to run
"immediately", rather it will have the effect of causing it run "longer"
whenever it runs next?


No.  The intent (at least mine, maybe Rik has other ideas) is to move 
some vruntime from current to target such that target would be placed 
before current in the timeline.



Even that would require some precaution in directed yield to ensure that it
doesn't unduly inflate vruntime of target, hurting fairness for other guests on
same cpu as target (example guest code that can lead to this situation
below):

vcpu0:  vcpu1:

spinlock(A);

spinlock(A);

while(1)
;

spin_unlock(A);


directed yield should preserve the invariant that sum(vruntime) does not 
change.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 03:09:43PM +0200, Avi Kivity wrote:
> On 12/01/2010 04:36 AM, Yang, Sheng wrote:
> >On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
> >>  On 11/26/2010 04:35 AM, Yang, Sheng wrote:
> >>  >  >   >   Shouldn't kvm also service reads from the pending bitmask?
> >>  >  >
> >>  >  >   Of course KVM should service reading from pending bitmask. For
> >>  >  >   assigned device, it's kernel who would set the pending bit; but I 
> >> am
> >>  >  >   not sure for virtio. This interface is GET_ENTRY, so reading is 
> >> fine
> >>  >  >   with it.
> >>
> >>  The kernel should manage it in the same way.  Virtio raises irq (via
> >>  KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
> >>
> >>  Note we need to be able to read and write the pending bitmask for live
> >>  migration.
> >
> >Then seems we still need to an writing interface for it. And I think we can 
> >work
> >on it later since it got no user now.
> 
> I'm not sure.  Suppose a guest starts using pending bits.  Does it
> mean it will not work with kernel msix acceleration?
> 
> If that is the case, then we need pending bit support now.  I don't
> want to knowingly merge something that doesn't conform to the spec,
> forcing users to choose whether they want to enable kernel msix
> acceleration or not.
> 
> 
> 
> >>
> >>  Because we have an interface where you get an exit if (addr % 4)<  3 and
> >>  don't get an exit if (addr % 4) == 3.  There is a gpa range which is
> >>  partially maintained by the kernel and partially in userspace.  It's a
> >>  confusing interface.  Things like 64-bit reads or writes need to be
> >>  broken up and serviced in two different places.
> >>
> >>  We already need to support this (for unaligned writes which hit two
> >>  regions), but let's at least make a contiguous region behave sanely.
> >
> >Oh, I didn't mean to handle this kind of unaligned writing in userspace. 
> >They're
> >illegal according to the PCI spec(otherwise the result is undefined 
> >according to
> >the spec). I would cover all contiguous writing(32-bit and 64-bit) in the 
> >next
> >version, and discard all illegal writing. And 64-bit accessing would be 
> >broken up
> >in qemu as you said, as they do currently.
> >
> >In fact I think we can handle all data for 64-bit to qemu, because it should 
> >still
> >sync the mask bit with kernel, which make the maskbit writing in userspace 
> >useless
> >and can be ignored.
> 
> What about reads?  Split those as well?
> 
> >>  The reason is to keep a sane interface.  Like we emulate instructions
> >>  and msrs in the kernel and don't do half a job.  I don't think there's a
> >>  real need to accelerate the first three words of an msi-x entry.
> >
> >Here is the case we've observed with Xen. It can only be reproduced by large 
> >scale
> >testing. When the interrupt intensity is very high, even new kernels would 
> >try to
> >make it lower, then it would access mask bit very frequently. And in the 
> >kernel,
> >msi_set_mask_bit() is like this:
> >
> >static void msi_set_mask_bit(struct irq_data *data, u32 flag)
> >{
> > struct msi_desc *desc = irq_data_get_msi(data);
> >
> > if (desc->msi_attrib.is_msix) {
> > msix_mask_irq(desc, flag);
> > readl(desc->mask_base); /* Flush write to device */
> > } else {
> > unsigned offset = data->irq - desc->dev->irq;
> > msi_mask_irq(desc, 1<<  offset, flag<<  offset);
> > }
> >}
> >
> >That flush by readl() would complied with each MSI-X mask writing. That is 
> >the only
> >place we want to accelerate, otherwise it would still fall back to qemu each 
> >time
> >when guest want to mask the MSI-X entry.
> 
> So it seems we do want to accelerate reads to the entire entry.
> This seems to give more weight to making the kernel own the entire
> entry.
> 
> >>
> >>  >  And BTW, we can take routing table as a kind of *cache*, if the content
> >>  >  is in the cache, then we can fetch it from the cache, otherwise we need
> >>  >  to go back to fetch it from memory(userspace).
> >>
> >>  If it's guaranteed by the spec that addr/data pairs are always
> >>  interpreted in the same way, sure.  But there no reason to do it,
> >>  really, it isn't a fast path.
> >
> >I am not quite understand the first sentence... But it's a fast path, in the 
> >case I
> >showed above.
> 
> Which case?  the readl() doesn't need access to the routing table,
> just the entry.

One thing that read should do is flush in the outstanding
interrupts and flush out the mask bit writes.

> Oh, I think there is a terminology problem, I was talking about
> kvm's irq routing table, you were talking about the msix entries.
> 
> I think treating it as a cache causes more problems, since there are
> now two paths for reads (in cache and not in cache) and more things
> for writes to manage.
> 
> Here's my proposed API:
> 
> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
>

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Srivatsa Vaddagiri

On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote:
> >>  What I'd like to see in directed yield is donating exactly the
> >>  amount of vruntime that's needed to make the target thread run.
> >
> >I presume this requires the target vcpu to move left in rb-tree to run
> >earlier than scheduled currently and that it doesn't involve any
> >change to the sched_period() of target vcpu?
> >
> >Just was wondering how this would work in case of buggy guests. Lets say 
> >that a
> >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
> >boosting each other's vruntime, potentially affecting fairtime for other 
> >guests
> >(to the point of starving them perhaps)?
> 
> We preserve vruntime overall.  If you give vruntime to someone, it
> comes at your own expense.  Overall vruntime is preserved.

Hmm ..so I presume that this means we don't affect target thread's position in
rb-tree upon donation, rather we influence its sched_period() to include 
donated time? IOW donation has no effect on causing the target thread to run
"immediately", rather it will have the effect of causing it run "longer"
whenever it runs next?

Even that would require some precaution in directed yield to ensure that it
doesn't unduly inflate vruntime of target, hurting fairness for other guests on
same cpu as target (example guest code that can lead to this situation 
below):

vcpu0:  vcpu1:

spinlock(A);

spinlock(A);

while(1)
;
 
spin_unlock(A);

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-12-02 Thread Avi Kivity

On 12/01/2010 04:36 AM, Yang, Sheng wrote:

On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
>  On 11/26/2010 04:35 AM, Yang, Sheng wrote:
>  >  >   >   Shouldn't kvm also service reads from the pending bitmask?
>  >  >
>  >  >   Of course KVM should service reading from pending bitmask. For
>  >  >   assigned device, it's kernel who would set the pending bit; but I am
>  >  >   not sure for virtio. This interface is GET_ENTRY, so reading is fine
>  >  >   with it.
>
>  The kernel should manage it in the same way.  Virtio raises irq (via
>  KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
>
>  Note we need to be able to read and write the pending bitmask for live
>  migration.

Then seems we still need to an writing interface for it. And I think we can work
on it later since it got no user now.

I'm not sure.  Suppose a guest starts using pending bits.  Does it mean 
it will not work with kernel msix acceleration?

If that is the case, then we need pending bit support now.  I don't want 
to knowingly merge something that doesn't conform to the spec, forcing 
users to choose whether they want to enable kernel msix acceleration or not.

>
>  Because we have an interface where you get an exit if (addr % 4)<  3 and
>  don't get an exit if (addr % 4) == 3.  There is a gpa range which is
>  partially maintained by the kernel and partially in userspace.  It's a
>  confusing interface.  Things like 64-bit reads or writes need to be
>  broken up and serviced in two different places.
>
>  We already need to support this (for unaligned writes which hit two
>  regions), but let's at least make a contiguous region behave sanely.

Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're
illegal according to the PCI spec(otherwise the result is undefined according to
the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next
version, and discard all illegal writing. And 64-bit accessing would be broken 
up
in qemu as you said, as they do currently.

In fact I think we can handle all data for 64-bit to qemu, because it should 
still
sync the mask bit with kernel, which make the maskbit writing in userspace 
useless
and can be ignored.

What about reads?  Split those as well?

>  The reason is to keep a sane interface.  Like we emulate instructions
>  and msrs in the kernel and don't do half a job.  I don't think there's a
>  real need to accelerate the first three words of an msi-x entry.

Here is the case we've observed with Xen. It can only be reproduced by large 
scale
testing. When the interrupt intensity is very high, even new kernels would try 
to
make it lower, then it would access mask bit very frequently. And in the kernel,
msi_set_mask_bit() is like this:

static void msi_set_mask_bit(struct irq_data *data, u32 flag)
{
 struct msi_desc *desc = irq_data_get_msi(data);

 if (desc->msi_attrib.is_msix) {
 msix_mask_irq(desc, flag);
 readl(desc->mask_base); /* Flush write to device */
 } else {
 unsigned offset = data->irq - desc->dev->irq;
 msi_mask_irq(desc, 1<<  offset, flag<<  offset);
 }
}

That flush by readl() would complied with each MSI-X mask writing. That is the 
only
place we want to accelerate, otherwise it would still fall back to qemu each 
time
when guest want to mask the MSI-X entry.

So it seems we do want to accelerate reads to the entire entry.  This 
seems to give more weight to making the kernel own the entire entry.

>
>  >  And BTW, we can take routing table as a kind of *cache*, if the content
>  >  is in the cache, then we can fetch it from the cache, otherwise we need
>  >  to go back to fetch it from memory(userspace).
>
>  If it's guaranteed by the spec that addr/data pairs are always
>  interpreted in the same way, sure.  But there no reason to do it,
>  really, it isn't a fast path.

I am not quite understand the first sentence... But it's a fast path, in the 
case I
showed above.

Which case?  the readl() doesn't need access to the routing table, just 
the entry.

Oh, I think there is a terminology problem, I was talking about kvm's 
irq routing table, you were talking about the msix entries.

I think treating it as a cache causes more problems, since there are now 
two paths for reads (in cache and not in cache) and more things for 
writes to manage.

Here's my proposed API:

KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, 
pending_bitmap_base_gpa)

 - called when the guest enables msix

KVM_REMOVE_MSIX_TABLE(table_id)

  - called when the guest disables msix

KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)

  - called when the guest enables msix (to initialize it), or after 
live migration

KVM_GET_MSIX_ENTRY(table_id, entry_id, contents)

  - called when the guest updates an msix table (triggered by 
KVM_EXIT_MSIX_ENTRY_UPDATED)

KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED

  - returned from KV

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Avi Kivity

On 12/02/2010 02:19 PM, Srivatsa Vaddagiri wrote:

On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote:
>  What I'd like to see in directed yield is donating exactly the
>  amount of vruntime that's needed to make the target thread run.  The

How would that work well with hard-limits? The target thread would have been
rate limited and no amount of vruntime donation would make it work
"immediately" ..

At least it would stop the donator from running.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Avi Kivity

On 12/02/2010 01:47 PM, Srivatsa Vaddagiri wrote:

On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote:
>  On 12/01/2010 09:09 PM, Peter Zijlstra wrote:
>  >>
>  >>   We are dealing with just one task here (the task that is yielding).
>  >>   After recording how much timeslice we are "giving up" in 
current->donate_time
>  >>   (donate_time is perhaps not the right name to use), we adjust the 
yielding
>  >>   task's vruntime as per existing logic (for ex: to make it go to back of
>  >>   runqueue). When the yielding tasks gets to run again, lock is hopefully
>  >>   available for it to grab, we let it run longer than the default 
sched_slice()
>  >>   to compensate for what time it gave up previously to other threads in 
same
>  >>   runqueue. This ensures that because of yielding upon lock contention, 
we are not
>  >>   leaking bandwidth in favor of other guests. Again I don't know how much 
of
>  >>   fairness issue this is in practice, so unless we see some numbers I'd 
prefer
>  >>   sticking to plain yield() upon lock-contention (for unmodified guests 
that is).
>  >
>  >No, that won't work. Once you've given up time you cannot add it back
>  >without destroying fairness.

Over shorter intervals perhaps. Over longer interval (few seconds to couple of
minutes), fairness should not be affected because of this feedback? In any case,
don't we have similar issues with directed yield as well?

Directed yield works by donating vruntime to another thread.  If you 
don't have vruntime, you can't donate it (well, you're guaranteed to 
have some, since you're running, but if all you have is a microsecond's 
worth, that's what you can donate).

>
>  What I'd like to see in directed yield is donating exactly the
>  amount of vruntime that's needed to make the target thread run.

I presume this requires the target vcpu to move left in rb-tree to run
earlier than scheduled currently and that it doesn't involve any
change to the sched_period() of target vcpu?

Just was wondering how this would work in case of buggy guests. Lets say that a
guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
boosting each other's vruntime, potentially affecting fairtime for other guests
(to the point of starving them perhaps)?

We preserve vruntime overall.  If you give vruntime to someone, it comes 
at your own expense.  Overall vruntime is preserved.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv6 00/16] boot order specification

2010-12-02 Thread Gleb Natapov

On Wed, Dec 01, 2010 at 09:25:40PM -0500, Kevin O'Connor wrote:
> On Wed, Dec 01, 2010 at 02:27:40PM +0200, Gleb Natapov wrote:
> > On Tue, Nov 30, 2010 at 09:53:32PM -0500, Kevin O'Connor wrote:
> > > BTW, what's the plan for handling SCSI adapters?  Lets say a user has
> > > a scsi card with three drives (lun 1, lun 3, lun 5) that show up as 3
> > > bcvs (lun1, lun3, lun5 in that order) and the user wishes to boot from
> > > lun3.  I understand this use case may not be important for qemu, but
> > > I'd like to use the same code on coreboot.  Being able to boot from
> > > any drive is important - it doesn't have to autodetect, but it should
> > > be possible.
> > > 
> > We can't. Option rom does not give us back enough information to do so.
> > Form device path we know exactly what id:lun boot should be done from,
> > but pnp_data does not have information to map BCV back to id:lun. I do
> > not see how coreboot can provide better information to Seabios then
> > qemu here.
> 
> You're thinking in terms of which device to boot, which does make this
> difficult.  However, it's equally valid to think in terms of which
> boot method to invoke, which makes this easy.
It is not. Because boot methods are enumerated by a guest at runtime.
Qemu knows absolutely nothing about them. I am thinking in terms of
devices because this is the only thing I have in qemu. BBS specification
is broken since it doesn't provide a way for discovered boot method
(BCV) to be linked back to a device it will boot from. Nothing we can do
to fix this except moving to EFI (an hope the problem is fixed there).

> 
> We could tell the coreboot user to edit the "bootorder" file and add
> "/p...@i0cf8/r...@4" (second rom on 4th pci device - the exact syntax
> of the name is not important).
> 
But how user should knows that second rom (I think you mean "second BCV")
on pci device 4.0 will boot from the new scsi cdrom that he just connected?
How can he tell that it should put second BCV there and not third or fifth
without running Seabios first and looking at F12 menu? Depending on
option rom implementation even F12 menu can have not enough information
for user to guess which boot entry he should use to boot from the device
because product name that is used to distinguish between different boot
entires should be unique, but not necessary provide useful information
about device according to spec. (Both gpxe and lsi 8xx_64.rom provide device
address as part of product name; gpxe puts pci address there and 8xx_64.rom
puts ID LUN). Another important point is that BCV number can change when new
device is added to scsi bus, so next boot will mysteriously fail. Important
property of device path that it is stable and does not depend on HW 
configuration.

> >BTW to create proper EDD entry for SCSI boot device BIOS also
> > needs too map BCV to id:lun. How it can be done? 
> 
> It's the responsibility of the rom to build the EDD info.  I don't
> know if all roms do this - I don't believe it's possible to get at the
> EDD info until after the drive has been mapped (ie, too late to use it
> for boot ordering).
How can we get to EDD info after device is mapped? Looking at Seabios
implementation it builds EDD table on the fly when int_1348 is called
and it does it only for internal devices. Can we use "disconnect vector"
to connect device temporarily get EDD and then disconnect?

> 
> > SCSI _is_ important for qemu. Not HBA  we have now, but something supported
> > by recent operation systems out of the box. When such HBA will be emulated
> > in qemu we will add native Seabios support for it.
> 
> I understand.  However, we'll still need to support arbitrary rom
> based BEVs and BCVs, so the use case is still important.
> 
We can't do something that is impossible. For coreboot Seabios should
implement what BBS spec says i.e enumerate all BCVs, present boot menu
to the user, record number of BCVs and user's choice on non-volatile
storage (CMOS). On next boot use recorded user's choice unless number of
BCVs changes, otherwise present boot menu again. This is of course
broken too since number of BCVs may not change even if HW configuration
changes substantially. Spec says that in that case user probably
will want to adjust boot order anyway and will enter boot menu by
itself. Sorry excuse for failing to provide proper functionality if you
ask me :)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 07:52:00PM +0800, Sheng Yang wrote:
> On Thu, Dec 2, 2010 at 5:49 PM, Michael S. Tsirkin  wrote:
> > On Thu, Dec 02, 2010 at 09:13:28AM +0800, Yang, Sheng wrote:
> >> On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote:
> >> > On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote:
> >> > > I used sr-iov, give each vm 2 vf.
> >> > > after apply the patch, and i found performence is the same.
> >> > >
> >> > > the reason is in function msix_mmio_write, mostly addr is not in mmio
> >> > > range.
> >> > >
> >> > > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> >> > > len,
> >> > >
> >> > >                      const void *val)
> >> > >
> >> > > {
> >> > >
> >> > >   struct kvm_assigned_dev_kernel *adev =
> >> > >
> >> > >                   container_of(this, struct kvm_assigned_dev_kernel,
> >> > >
> >> > >                                msix_mmio_dev);
> >> > >
> >> > >   int idx, r = 0;
> >> > >   unsigned long new_val = *(unsigned long *)val;
> >> > >
> >> > >   mutex_lock(&adev->kvm->lock);
> >> > >   if (!msix_mmio_in_range(adev, addr, len)) {
> >> > >
> >> > >           // return here.
> >> > >
> >> > >                  r = -EOPNOTSUPP;
> >> > >
> >> > >           goto out;
> >> > >
> >> > >   }
> >> > >
> >> > > i printk the value:
> >> > > addr             start           end           len
> >> > > F004C00C   F0044000  F0044030     4
> >> > >
> >> > > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
> >> > > 01)
> >> > >
> >> > >   Subsystem: Intel Corporation Unknown device 000c
> >> > >   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> >> > >
> >> > > Stepping- SERR- FastB2B-
> >> > >
> >> > >   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> >> > >
> >> > > SERR-  >> > >
> >> > >   Latency: 0
> >> > >   Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K]
> >> > >   Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K]
> >> > >   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
> >> > >
> >> > >           Vector table: BAR=3 offset=
> >> > >           PBA: BAR=3 offset=2000
> >> > >
> >> > > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
> >> > > 01)
> >> > >
> >> > >   Subsystem: Intel Corporation Unknown device 000c
> >> > >   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> >> > >
> >> > > Stepping- SERR- FastB2B-
> >> > >
> >> > >   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> >> > >
> >> > > SERR-  >> > >
> >> > >   Latency: 0
> >> > >   Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K]
> >> > >   Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K]
> >> > >   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
> >> > >
> >> > >           Vector table: BAR=3 offset=
> >> > >           PBA: BAR=3 offset=2000
> >> > >
> >> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> >> > > +                       gpa_t addr, int len)
> >> > > +{
> >> > > + gpa_t start, end;
> >> > > +
> >> > > + BUG_ON(adev->msix_mmio_base == 0);
> >> > > + start = adev->msix_mmio_base;
> >> > > + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> >> > > +         adev->msix_max_entries_nr;
> >> > > + if (addr >= start && addr + len <= end)
> >> > > +         return true;
> >> > > +
> >> > > + return false;
> >> > > +}
> >> >
> >> > Hmm, this check looks wrong to me: there's no guarantee
> >> > that guest uses the first N entries in the table.
> >> > E.g. it could use a single entry, but only the last one.
> >>
> >> Please check the PCI spec.
> >
> >
> > This is pretty explicit in the spec: the the last paragraph in the below:
> >
> > IMPLEMENTATION NOTE
> > Handling MSI-X Vector Shortages
> >
> > Handling MSI-X Vector Shortages
> > For the case where fewer vectors are allocated to a function than desired,
> 
> You may not notice the premise here.

I noticed it.

> Also check for "Table Size" would
> help I think.

It would help if msix_max_entries_nr was MSIX Table Size N
(encoded in PCI config space as  N - 1).
Is this what it is? Maybe add this in some comment.

> -- 
> regards,
> Yang, Sheng
> 
> software-
> > controlled aliasing as enabled by MSI-X is one approach for handling the 
> > situation. For
> > example, if a function supports five queues, each with an associated MSI-X 
> > table entry, but
> > only three vectors are allocated, the function could be designed for 
> > software still to configure
> > all five table entries, assigning one or more vectors to multiple table 
> > entries. Software could
> > assign the three vectors {A,B,C} to the five entries as ABCCC, ABBCC, 
> > ABCBA, or other
> > similar combinations.
> >
> >
> > Alternatively, the function could be designed for software to configure it 
> > (using a device-
> > specific mechanism) to use only three queues and three MSI-X table entries. 
> > Soft

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Srivatsa Vaddagiri

On Thu, Dec 02, 2010 at 05:17:00PM +0530, Srivatsa Vaddagiri wrote:
> Just was wondering how this would work in case of buggy guests. Lets say that 
> a
> guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
> currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
> boosting each other's vruntime, potentially affecting fairtime for other 
> guests
> (to the point of starving them perhaps)?

Guests that exhibit strong spinlock contentions can cause similar symptoms as
well?

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Srivatsa Vaddagiri

On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote:
> What I'd like to see in directed yield is donating exactly the
> amount of vruntime that's needed to make the target thread run.  The

How would that work well with hard-limits? The target thread would have been
rate limited and no amount of vruntime donation would make it work
"immediately" ..

- vatsa
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Possiblle misleading information of motherboards supporting VT-d

2010-12-02 Thread Prasad Joshi


 Hello,

I am having little trouble figuring out which motherboards have support for 
VT-d. I am following the link mentioned on the KVM site 
http://wiki.xensource.com/xenwiki/VTdHowTo

The link says

Following desktop boards have the VT-d support
* Intel DQ35JO
* Intel DQ35MP
* Intel DX38BT
* Intel DX48BT2
* Intel DQ45CB (BIOS 0061 required, previous versions are known to cause 
problems)
* Intel DQ45EK
* Intel DX58SO 

I checked on the Intel site to verify that these motherboards have the VT-d 
support. But seems like only few of them support VT-d.

Following boards do not have VT-d support
Intel® Desktop Board DX38BT
http://ark.intel.com/Product.aspx?id=30272

Intel® Desktop Board DX48BT2
http://ark.intel.com/Product.aspx?id=33394

Intel® Desktop Board DX58SO
http://ark.intel.com/Product.aspx?id=36888&code=Intel%C2%AE+Desktop+Board+DX58SO

I was going to buy the Intel DX38BT board just before placing the order I had a 
second thought of checking the Intel site.
Can someone confirm?

Thanks and Regards,
Prasad Joshi (Student ID: 19015523)
MRes Systems Engineering



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.

2010-12-02 Thread Michael S. Tsirkin

On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> 2010/11/28 Michael S. Tsirkin :
> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> 2010/11/28 Michael S. Tsirkin :
> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >>
> >> >> Signed-off-by: Yoshiaki Tamura 
> >> >
> >> > This changes migration format, so it will break compatibility with
> >> > existing drivers. More generally, I think migrating internal
> >> > state that is not guest visible is always a mistake
> >> > as it ties migration format to an internal implementation
> >> > (yes, I know we do this sometimes, but we should at least
> >> > try not to add such cases).  I think the right thing to do in this case
> >> > is to flush outstanding
> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> > I sent patches that do this for virtio net and block.
> >>
> >> Could you give me the link of your patches?  I'd like to test
> >> whether they work with Kemari upon failover.  If they do, I'm
> >> happy to drop this patch.
> >>
> >> Yoshi
> >
> > Look for this:
> > stable migration image on a stopped vm
> > sent on:
> > Wed, 24 Nov 2010 17:52:49 +0200
> 
> Thanks for the info.
> 
> However, The patch series above didn't solve the issue.  In
> case of Kemari, inuse is mostly > 0 because it queues the
> output, and while last_avail_idx gets incremented
> immediately, not sending inuse makes the state inconsistent
> between Primary and Secondary.

Hmm. Can we simply avoid incrementing last_avail_idx?

>  I'm wondering why
> last_avail_idx is OK to send but not inuse.

last_avail_idx is at some level a mistake, it exposes part of
our internal implementation, but it does *also* express
a guest observable state.

Here's the problem that it solves: just looking at the rings in virtio
there is no way to detect that a specific request has already been
completed. And the protocol forbids completing the same request twice.

Our implementation always starts processing the requests
in order, and since we flush outstanding requests
before save, it works to just tell the remote 'process only requests
after this place'.

But there's no such requirement in the virtio protocol,
so to be really generic we could add a bitmask of valid avail
ring entries that did not complete yet. This would be
the exact representation of the guest observable state.
In practice we have rings of up to 512 entries.
That's 64 byte per ring, not a lot at all.

However, if we ever do change the protocol to send the bitmask,
we would need some code to resubmit requests
out of order, so it's not trivial.

Another minor mistake with last_avail_idx is that it has
some redundancy: the high bits in the index
(> vq size) are not necessary as they can be
got from avail idx.  There's a consistency check
in load but we really should try to use formats
that are always consistent.

> The following patch does the same thing as original, yet
> keeps the format of the virtio.  It shouldn't break live
> migration either because inuse should be 0.
> 
> Yoshi

Question is, can you flush to make inuse 0 in kemari too?
And if not, how do you handle the fact that some requests
are in flight on the primary?

> diff --git a/hw/virtio.c b/hw/virtio.c
> index c8a0fc6..875c7ca 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>  qemu_put_be32(f, i);
> 
>  for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> +uint16_t last_avail_idx;
> +
>  if (vdev->vq[i].vring.num == 0)
>  break;
> 
> +last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> +
>  qemu_put_be32(f, vdev->vq[i].vring.num);
>  qemu_put_be64(f, vdev->vq[i].pa);
> -qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> +qemu_put_be16s(f, &last_avail_idx);
>  if (vdev->binding->save_queue)
>  vdev->binding->save_queue(vdev->binding_opaque, i, f);
>  }
> 
> 

This looks wrong to me.  Requests can complete in any order, can they
not?  So if request 0 did not complete and request 1 did not,
you send avail - inuse and on the secondary you will process and
complete request 1 the second time, crashing the guest.

> 
> >
> >> >
> >> >> ---
> >> >>  hw/virtio.c |    8 +++-
> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >>
> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> index 849a60f..5509644 100644
> >> >> --- a/hw/virtio.c
> >> >> +++ b/hw/virtio.c
> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >>      VRing vring;
> >> >>      target_phys_addr_t pa;
> >> >>      uint16_t last_avail_idx;
> >> >> -    int inuse;
> >> >> +    uint16_t inuse;
> >> >>      uint16_t vector;
> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-02 Thread Sheng Yang

On Thu, Dec 2, 2010 at 5:49 PM, Michael S. Tsirkin  wrote:
> On Thu, Dec 02, 2010 at 09:13:28AM +0800, Yang, Sheng wrote:
>> On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote:
>> > On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote:
>> > > I used sr-iov, give each vm 2 vf.
>> > > after apply the patch, and i found performence is the same.
>> > >
>> > > the reason is in function msix_mmio_write, mostly addr is not in mmio
>> > > range.
>> > >
>> > > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
>> > > len,
>> > >
>> > >                      const void *val)
>> > >
>> > > {
>> > >
>> > >   struct kvm_assigned_dev_kernel *adev =
>> > >
>> > >                   container_of(this, struct kvm_assigned_dev_kernel,
>> > >
>> > >                                msix_mmio_dev);
>> > >
>> > >   int idx, r = 0;
>> > >   unsigned long new_val = *(unsigned long *)val;
>> > >
>> > >   mutex_lock(&adev->kvm->lock);
>> > >   if (!msix_mmio_in_range(adev, addr, len)) {
>> > >
>> > >           // return here.
>> > >
>> > >                  r = -EOPNOTSUPP;
>> > >
>> > >           goto out;
>> > >
>> > >   }
>> > >
>> > > i printk the value:
>> > > addr             start           end           len
>> > > F004C00C   F0044000  F0044030     4
>> > >
>> > > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
>> > > 01)
>> > >
>> > >   Subsystem: Intel Corporation Unknown device 000c
>> > >   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> > >
>> > > Stepping- SERR- FastB2B-
>> > >
>> > >   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> > >
>> > > SERR- > > >
>> > >   Latency: 0
>> > >   Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K]
>> > >   Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K]
>> > >   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
>> > >
>> > >           Vector table: BAR=3 offset=
>> > >           PBA: BAR=3 offset=2000
>> > >
>> > > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
>> > > 01)
>> > >
>> > >   Subsystem: Intel Corporation Unknown device 000c
>> > >   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> > >
>> > > Stepping- SERR- FastB2B-
>> > >
>> > >   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> > >
>> > > SERR- > > >
>> > >   Latency: 0
>> > >   Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K]
>> > >   Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K]
>> > >   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
>> > >
>> > >           Vector table: BAR=3 offset=
>> > >           PBA: BAR=3 offset=2000
>> > >
>> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
>> > > +                       gpa_t addr, int len)
>> > > +{
>> > > + gpa_t start, end;
>> > > +
>> > > + BUG_ON(adev->msix_mmio_base == 0);
>> > > + start = adev->msix_mmio_base;
>> > > + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
>> > > +         adev->msix_max_entries_nr;
>> > > + if (addr >= start && addr + len <= end)
>> > > +         return true;
>> > > +
>> > > + return false;
>> > > +}
>> >
>> > Hmm, this check looks wrong to me: there's no guarantee
>> > that guest uses the first N entries in the table.
>> > E.g. it could use a single entry, but only the last one.
>>
>> Please check the PCI spec.
>
>
> This is pretty explicit in the spec: the the last paragraph in the below:
>
> IMPLEMENTATION NOTE
> Handling MSI-X Vector Shortages
>
> Handling MSI-X Vector Shortages
> For the case where fewer vectors are allocated to a function than desired,

You may not notice the premise here. Also check for "Table Size" would
help I think.

-- 
regards,
Yang, Sheng

software-
> controlled aliasing as enabled by MSI-X is one approach for handling the 
> situation. For
> example, if a function supports five queues, each with an associated MSI-X 
> table entry, but
> only three vectors are allocated, the function could be designed for software 
> still to configure
> all five table entries, assigning one or more vectors to multiple table 
> entries. Software could
> assign the three vectors {A,B,C} to the five entries as ABCCC, ABBCC, ABCBA, 
> or other
> similar combinations.
>
>
> Alternatively, the function could be designed for software to configure it 
> (using a device-
> specific mechanism) to use only three queues and three MSI-X table entries. 
> Software could
> assign the three vectors {A,B,C} to the five entries as ABC--, A-B-C, A--CB, 
> or other similar
> combinations.
>
>
>
>> --
>> regards
>> Yang, Sheng
>>
>>
>> > > 2010/11/30 Yang, Sheng :
>> > > > On Tuesday 30 November 2010 17:10:11 lidong chen wrote:
>> > > >> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource.
>> > > >>
>> > > >> I test kvm with sriov, which the vf driver could not disable msix.
>> > > >> so the host os waste a lot of cpu.  cpu rate of host os i

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Srivatsa Vaddagiri

On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote:
> On 12/01/2010 09:09 PM, Peter Zijlstra wrote:
> >>
> >>  We are dealing with just one task here (the task that is yielding).
> >>  After recording how much timeslice we are "giving up" in 
> >> current->donate_time
> >>  (donate_time is perhaps not the right name to use), we adjust the yielding
> >>  task's vruntime as per existing logic (for ex: to make it go to back of
> >>  runqueue). When the yielding tasks gets to run again, lock is hopefully
> >>  available for it to grab, we let it run longer than the default 
> >> sched_slice()
> >>  to compensate for what time it gave up previously to other threads in same
> >>  runqueue. This ensures that because of yielding upon lock contention, we 
> >> are not
> >>  leaking bandwidth in favor of other guests. Again I don't know how much of
> >>  fairness issue this is in practice, so unless we see some numbers I'd 
> >> prefer
> >>  sticking to plain yield() upon lock-contention (for unmodified guests 
> >> that is).
> >
> >No, that won't work. Once you've given up time you cannot add it back
> >without destroying fairness.

Over shorter intervals perhaps. Over longer interval (few seconds to couple of
minutes), fairness should not be affected because of this feedback? In any case,
don't we have similar issues with directed yield as well?

> >You can limit the unfairness by limiting the amount of feedback, but I
> >really dislike such 'yield' semantics.
> 
> Agreed.
> 
> What I'd like to see in directed yield is donating exactly the
> amount of vruntime that's needed to make the target thread run.

I presume this requires the target vcpu to move left in rb-tree to run 
earlier than scheduled currently and that it doesn't involve any
change to the sched_period() of target vcpu?

Just was wondering how this would work in case of buggy guests. Lets say that a
guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1
currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep
boosting each other's vruntime, potentially affecting fairtime for other guests
(to the point of starving them perhaps)?

- vatsa

> The donating thread won't get its vruntime back, unless the other thread
> hits contention itself and does a directed yield back.  So even if
> your lock is ping-ponged around, the guest doesn't lose vruntime
> compared to other processes on the host.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ kvm-Bugs-1808970 ] CDROM eject doesn't work via Qemu Monitor

2010-12-02 Thread SourceForge.net

Bugs item #1808970, was opened at 2007-10-07 16:42
Message generated for change (Comment added) made by jessorensen
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1808970&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
>Status: Closed
>Resolution: Works For Me
Priority: 5
Private: No
Submitted By: Technologov (technologov)
Assigned to: Nobody/Anonymous (nobody)
Summary: CDROM eject doesn't work via Qemu Monitor

Initial Comment:
When ejecting CD/DVD via Qemu monitor, then inserting another CD, the first CD 
is always shown, instead of the newly inserted one, unless I eject the CD from 
the guest.

This bug prevents setup of Windows XP Tablet Edition, because this OS requires 
2 CDs to install, and since this is Windows there are no workarounds.

Guest: any Windows, especially Windows XP Tablet Edition.

Host: Fedora7, 64-bit, Intel CPU, KVM-45.

Other KVM versions and -no-kvm didn't help.

-Alexey Technologov. 7.oct.2007.

--

>Comment By: Jes Sorensen (jessorensen)
Date: 2010-12-02 11:30

Message:
Works fine here under Fedora 14:

A change like this was detected instantly:
(qemu) eject drive-ide0-1-0
(qemu) change drive-ide0-1-0 en_winxp_sp1a_tablet_pc_cd1.iso

Closing

Jes


--

Comment By: Wilco (paranoidt)
Date: 2008-03-07 17:52

Message:
Logged In: YES 
user_id=902376
Originator: NO

It works for me (with some quirks) if I first eject the old CD:

eject ide1-cd0

Then change to the new one:
change ide1-cd0 /path/to/newiso.iso

This does not always seem to work. Sometimes you have to access your CDROM
(from the installation) so windows finds no cd. Then repeat these steps.

Hope this helps.

--

Comment By: Technologov (technologov)
Date: 2007-10-12 16:58

Message:
Logged In: YES 
user_id=1839746
Originator: YES

chenghuan_jia:

I have tried that too, with the exception that I used ISO instead of real
CD-ROM.

-Technologov

--

Comment By: Neo Jia (chenghuan_jia)
Date: 2007-10-12 07:10

Message:
Logged In: YES 
user_id=446034
Originator: NO

Can you try "change cdrom /dev/scd0" after you insert the 2nd disc?

Thanks,
Neo

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1808970&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-02 Thread Michael S. Tsirkin

On Thu, Dec 02, 2010 at 09:13:28AM +0800, Yang, Sheng wrote:
> On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote:
> > On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote:
> > > I used sr-iov, give each vm 2 vf.
> > > after apply the patch, and i found performence is the same.
> > > 
> > > the reason is in function msix_mmio_write, mostly addr is not in mmio
> > > range.
> > > 
> > > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > > len,
> > > 
> > >  const void *val)
> > > 
> > > {
> > > 
> > >   struct kvm_assigned_dev_kernel *adev =
> > >   
> > >   container_of(this, struct kvm_assigned_dev_kernel,
> > >   
> > >msix_mmio_dev);
> > >   
> > >   int idx, r = 0;
> > >   unsigned long new_val = *(unsigned long *)val;
> > >   
> > >   mutex_lock(&adev->kvm->lock);
> > >   if (!msix_mmio_in_range(adev, addr, len)) {
> > >   
> > >   // return here.
> > >   
> > >  r = -EOPNOTSUPP;
> > >   
> > >   goto out;
> > >   
> > >   }
> > > 
> > > i printk the value:
> > > addr start   end   len
> > > F004C00C   F0044000  F0044030 4
> > > 
> > > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
> > > 01)
> > > 
> > >   Subsystem: Intel Corporation Unknown device 000c
> > >   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > 
> > > Stepping- SERR- FastB2B-
> > > 
> > >   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > 
> > > SERR-  > > 
> > >   Latency: 0
> > >   Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K]
> > >   Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K]
> > >   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
> > >   
> > >   Vector table: BAR=3 offset=
> > >   PBA: BAR=3 offset=2000
> > > 
> > > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
> > > 01)
> > > 
> > >   Subsystem: Intel Corporation Unknown device 000c
> > >   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > 
> > > Stepping- SERR- FastB2B-
> > > 
> > >   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > > 
> > > SERR-  > > 
> > >   Latency: 0
> > >   Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K]
> > >   Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K]
> > >   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
> > >   
> > >   Vector table: BAR=3 offset=
> > >   PBA: BAR=3 offset=2000
> > > 
> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > > +   gpa_t addr, int len)
> > > +{
> > > + gpa_t start, end;
> > > +
> > > + BUG_ON(adev->msix_mmio_base == 0);
> > > + start = adev->msix_mmio_base;
> > > + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > > + adev->msix_max_entries_nr;
> > > + if (addr >= start && addr + len <= end)
> > > + return true;
> > > +
> > > + return false;
> > > +}
> > 
> > Hmm, this check looks wrong to me: there's no guarantee
> > that guest uses the first N entries in the table.
> > E.g. it could use a single entry, but only the last one.
> 
> Please check the PCI spec.


This is pretty explicit in the spec: the the last paragraph in the below:

IMPLEMENTATION NOTE
Handling MSI-X Vector Shortages

Handling MSI-X Vector Shortages
For the case where fewer vectors are allocated to a function than desired, 
software-
controlled aliasing as enabled by MSI-X is one approach for handling the 
situation. For
example, if a function supports five queues, each with an associated MSI-X 
table entry, but
only three vectors are allocated, the function could be designed for software 
still to configure
all five table entries, assigning one or more vectors to multiple table 
entries. Software could
assign the three vectors {A,B,C} to the five entries as ABCCC, ABBCC, ABCBA, or 
other
similar combinations.


Alternatively, the function could be designed for software to configure it 
(using a device-
specific mechanism) to use only three queues and three MSI-X table entries. 
Software could
assign the three vectors {A,B,C} to the five entries as ABC--, A-B-C, A--CB, or 
other similar
combinations.



> --
> regards
> Yang, Sheng
> 
>  
> > > 2010/11/30 Yang, Sheng :
> > > > On Tuesday 30 November 2010 17:10:11 lidong chen wrote:
> > > >> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource.
> > > >> 
> > > >> I test kvm with sriov, which the vf driver could not disable msix.
> > > >> so the host os waste a lot of cpu.  cpu rate of host os is 90%.
> > > >> 
> > > >> then I test xen with sriov, there ara also a lot of vm exits caused by
> > > >> MSIX mask.
> > > >> but the cpu rate of xen and domain0 is less than kvm. cpu rate of xen
> > > >> and domain0 is 60%.
> > > >> 
> > > >> without

buildbot failure in qemu-kvm on default_i386_debian_5_0

2010-12-02 Thread qemu-kvm

The Buildbot has detected a new failure of default_i386_debian_5_0 on qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_debian_5_0/builds/670

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_2

Build Reason: 
Build Source Stamp: [branch next] HEAD
Blamelist: Avi Kivity ,Joerg Roedel 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

buildbot failure in qemu-kvm on default_i386_out_of_tree

2010-12-02 Thread qemu-kvm

The Buildbot has detected a new failure of default_i386_out_of_tree on qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_out_of_tree/builds/607

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_2

Build Reason: 
Build Source Stamp: [branch next] HEAD
Blamelist: Avi Kivity ,Joerg Roedel 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 3/3] KVM: MMU: retry #PF for softmmu

2010-12-02 Thread Xiao Guangrong

Retry #PF for softmmu only when the current vcpu has the same cr3 as the time
when #PF occurs

Changelog:
  Just compare cr3 value since It's harmless to instantiate an spte for an
  unused translation from Marcelo's comment

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |2 ++
 arch/x86/kvm/paging_tmpl.h  |   34 +++---
 arch/x86/kvm/x86.c  |6 +-
 4 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a0c066e..1e876e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -602,6 +602,7 @@ struct kvm_x86_ops {
 struct kvm_arch_async_pf {
u32 token;
gfn_t gfn;
+   unsigned long cr3;
bool direct_map;
 };
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c6bb449..3f0d9a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2607,9 +2607,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, 
gva_t gva,
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
 {
struct kvm_arch_async_pf arch;
+
arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
arch.gfn = gfn;
arch.direct_map = vcpu->arch.mmu.direct_map;
+   arch.cr3 = vcpu->arch.mmu.get_cr3(vcpu);
 
return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
 }
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 23275d0..437e11a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -116,7 +116,7 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, 
pt_element_t gpte)
  */
 static int FNAME(walk_addr_generic)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-   gva_t addr, u32 access)
+   gva_t addr, u32 access, bool prefault)
 {
pt_element_t pte;
gfn_t table_gfn;
@@ -194,6 +194,13 @@ walk:
 #endif
 
if (!eperm && !rsvd_fault && !(pte & PT_ACCESSED_MASK)) {
+   /*
+* Don't set gpte accessed bit if it's on
+* speculative path.
+*/
+   if (prefault)
+   goto error;
+
trace_kvm_mmu_set_accessed_bit(table_gfn, index,
   sizeof(pte));
if (FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn,
@@ -287,10 +294,11 @@ error:
 }
 
 static int FNAME(walk_addr)(struct guest_walker *walker,
-   struct kvm_vcpu *vcpu, gva_t addr, u32 access)
+   struct kvm_vcpu *vcpu, gva_t addr,
+   u32 access, bool prefault)
 {
return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.mmu, addr,
-   access);
+   access, prefault);
 }
 
 static int FNAME(walk_addr_nested)(struct guest_walker *walker,
@@ -298,7 +306,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker 
*walker,
   u32 access)
 {
return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
-   addr, access);
+   addr, access, false);
 }
 
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
@@ -438,7 +446,8 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
 static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 struct guest_walker *gw,
 int user_fault, int write_fault, int hlevel,
-int *ptwrite, pfn_t pfn, bool map_writable)
+int *ptwrite, pfn_t pfn, bool map_writable,
+bool prefault)
 {
unsigned access = gw->pt_access;
struct kvm_mmu_page *sp = NULL;
@@ -512,7 +521,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
mmu_set_spte(vcpu, it.sptep, access, gw->pte_access & access,
 user_fault, write_fault, dirty, ptwrite, it.level,
-gw->gfn, pfn, false, map_writable);
+gw->gfn, pfn, prefault, map_writable);
FNAME(pte_prefetch)(vcpu, gw, it.sptep);
 
return it.sptep;
@@ -561,15 +570,18 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
/*
 * Look up the guest pte for the faulting address.
 */
-   r = FNAME(walk_addr)(&walker, vcpu, addr, error_code);
+   r = FNAME(walk_addr)(&walker, vcpu, addr, error_code, prefault);
 
/*
 * The page is not mapped by the guest.  Let the guest handle it.
 */
if (!r) {

buildbot failure in qemu-kvm on default_x86_64_out_of_tree

2010-12-02 Thread qemu-kvm

The Buildbot has detected a new failure of default_x86_64_out_of_tree on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_out_of_tree/builds/609

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_1

Build Reason: 
Build Source Stamp: [branch next] HEAD
Blamelist: Avi Kivity ,Joerg Roedel 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

buildbot failure in qemu-kvm on default_x86_64_debian_5_0

2010-12-02 Thread qemu-kvm

The Buildbot has detected a new failure of default_x86_64_debian_5_0 on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_debian_5_0/builds/668

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_1

Build Reason: 
Build Source Stamp: [branch next] HEAD
Blamelist: Avi Kivity ,Joerg Roedel 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

buildbot failure in qemu-kvm on disable_kvm_i386_out_of_tree

2010-12-02 Thread qemu-kvm

The Buildbot has detected a new failure of disable_kvm_i386_out_of_tree on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_i386_out_of_tree/builds/607

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_2

Build Reason: 
Build Source Stamp: [branch next] HEAD
Blamelist: Avi Kivity ,Joerg Roedel 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

buildbot failure in qemu-kvm on disable_kvm_x86_64_out_of_tree

2010-12-02 Thread qemu-kvm

The Buildbot has detected a new failure of disable_kvm_x86_64_out_of_tree on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_out_of_tree/builds/607

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_1

Build Reason: 
Build Source Stamp: [branch next] HEAD
Blamelist: Avi Kivity ,Joerg Roedel 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

buildbot failure in qemu-kvm on disable_kvm_i386_debian_5_0

2010-12-02 Thread qemu-kvm

The Buildbot has detected a new failure of disable_kvm_i386_debian_5_0 on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_i386_debian_5_0/builds/659

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_2

Build Reason: 
Build Source Stamp: [branch next] HEAD
Blamelist: Avi Kivity ,Joerg Roedel 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

buildbot failure in qemu-kvm on disable_kvm_x86_64_debian_5_0

2010-12-02 Thread qemu-kvm

The Buildbot has detected a new failure of disable_kvm_x86_64_debian_5_0 on 
qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_debian_5_0/builds/658

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_1

Build Reason: 
Build Source Stamp: [branch next] HEAD
Blamelist: Avi Kivity ,Joerg Roedel 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 3/3] KVM: MMU: retry #PF for softmmu

2010-12-02 Thread Xiao Guangrong

Retry #PF for softmmu only when the current vcpu has the same cr3 as the time
when #PF occurs

Changelog:
  Just compare cr3 value since It's harmless to instantiate an spte for an
  unused translation from Marcelo's comment

Signed-off-by: Xiao Guangrong 
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.c  |2 ++
 arch/x86/kvm/paging_tmpl.h  |   34 +++---
 arch/x86/kvm/x86.c  |6 +-
 4 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a0c066e..1e876e5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -602,6 +602,7 @@ struct kvm_x86_ops {
 struct kvm_arch_async_pf {
u32 token;
gfn_t gfn;
+   unsigned long cr3;
bool direct_map;
 };
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c6bb449..3f0d9a0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2607,9 +2607,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, 
gva_t gva,
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
 {
struct kvm_arch_async_pf arch;
+
arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
arch.gfn = gfn;
arch.direct_map = vcpu->arch.mmu.direct_map;
+   arch.cr3 = vcpu->arch.mmu.get_cr3(vcpu);
 
return kvm_setup_async_pf(vcpu, gva, gfn, &arch);
 }
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 23275d0..437e11a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -116,7 +116,7 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, 
pt_element_t gpte)
  */
 static int FNAME(walk_addr_generic)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
-   gva_t addr, u32 access)
+   gva_t addr, u32 access, bool prefault)
 {
pt_element_t pte;
gfn_t table_gfn;
@@ -194,6 +194,13 @@ walk:
 #endif
 
if (!eperm && !rsvd_fault && !(pte & PT_ACCESSED_MASK)) {
+   /*
+* Don't set gpte accessed bit if it's on
+* speculative path.
+*/
+   if (prefault)
+   goto error;
+
trace_kvm_mmu_set_accessed_bit(table_gfn, index,
   sizeof(pte));
if (FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn,
@@ -287,10 +294,11 @@ error:
 }
 
 static int FNAME(walk_addr)(struct guest_walker *walker,
-   struct kvm_vcpu *vcpu, gva_t addr, u32 access)
+   struct kvm_vcpu *vcpu, gva_t addr,
+   u32 access, bool prefault)
 {
return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.mmu, addr,
-   access);
+   access, prefault);
 }
 
 static int FNAME(walk_addr_nested)(struct guest_walker *walker,
@@ -298,7 +306,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker 
*walker,
   u32 access)
 {
return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu,
-   addr, access);
+   addr, access, false);
 }
 
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
@@ -438,7 +446,8 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
 static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 struct guest_walker *gw,
 int user_fault, int write_fault, int hlevel,
-int *ptwrite, pfn_t pfn, bool map_writable)
+int *ptwrite, pfn_t pfn, bool map_writable,
+bool prefault)
 {
unsigned access = gw->pt_access;
struct kvm_mmu_page *sp = NULL;
@@ -512,7 +521,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
mmu_set_spte(vcpu, it.sptep, access, gw->pte_access & access,
 user_fault, write_fault, dirty, ptwrite, it.level,
-gw->gfn, pfn, false, map_writable);
+gw->gfn, pfn, prefault, map_writable);
FNAME(pte_prefetch)(vcpu, gw, it.sptep);
 
return it.sptep;
@@ -561,15 +570,18 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
/*
 * Look up the guest pte for the faulting address.
 */
-   r = FNAME(walk_addr)(&walker, vcpu, addr, error_code);
+   r = FNAME(walk_addr)(&walker, vcpu, addr, error_code, prefault);
 
/*
 * The page is not mapped by the guest.  Let the guest handle it.
 */
if (!r) {

[PATCH v4 1/3] KVM: MMU: rename 'no_apf' to 'prefault'

2010-12-02 Thread Xiao Guangrong

It's the speculative path if 'no_apf = 1' and we will specially handle this
speculative path in the later patch, so 'prefault' is better to fit the sense

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |3 ++-
 arch/x86/kvm/mmu.c  |   18 +-
 arch/x86/kvm/paging_tmpl.h  |4 ++--
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e91c692..a0c066e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -241,7 +241,8 @@ struct kvm_mmu {
void (*new_cr3)(struct kvm_vcpu *vcpu);
void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
-   int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool 
no_apf);
+   int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err,
+ bool prefault);
void (*inject_page_fault)(struct kvm_vcpu *vcpu);
void (*free)(struct kvm_vcpu *vcpu);
gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e111b79..010736e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2284,11 +2284,11 @@ static int kvm_handle_bad_page(struct kvm *kvm, gfn_t 
gfn, pfn_t pfn)
return 1;
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 gva_t gva, pfn_t *pfn, bool write, bool *writable);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn,
-bool no_apf)
+bool prefault)
 {
int r;
int level;
@@ -2310,7 +2310,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
 
-   if (try_async_pf(vcpu, no_apf, gfn, v, &pfn, write, &map_writable))
+   if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
return 0;
 
/* mmio */
@@ -2583,7 +2583,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu 
*vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-   u32 error_code, bool no_apf)
+   u32 error_code, bool prefault)
 {
gfn_t gfn;
int r;
@@ -2599,7 +2599,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, 
gva_t gva,
gfn = gva >> PAGE_SHIFT;
 
return nonpaging_map(vcpu, gva & PAGE_MASK,
-error_code & PFERR_WRITE_MASK, gfn, no_apf);
+error_code & PFERR_WRITE_MASK, gfn, prefault);
 }
 
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
@@ -2621,7 +2621,7 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
return kvm_x86_ops->interrupt_allowed(vcpu);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 gva_t gva, pfn_t *pfn, bool write, bool *writable)
 {
bool async;
@@ -2633,7 +2633,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
no_apf, gfn_t gfn,
 
put_page(pfn_to_page(*pfn));
 
-   if (!no_apf && can_do_async_pf(vcpu)) {
+   if (!prefault && can_do_async_pf(vcpu)) {
trace_kvm_try_async_get_page(gva, gfn);
if (kvm_find_async_pf_gfn(vcpu, gfn)) {
trace_kvm_async_pf_doublefault(gva, gfn);
@@ -2649,7 +2649,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
no_apf, gfn_t gfn,
 }
 
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
- bool no_apf)
+ bool prefault)
 {
pfn_t pfn;
int r;
@@ -2673,7 +2673,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
 
-   if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn, write, &map_writable))
+   if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
return 0;
 
/* mmio */
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ad5a5a2..23275d0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -539,7 +539,7 @@ out_gpte_changed:
  *   a negative value on error.
  */
 static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
-bool no_apf)
+bool prefault)
 {
int write_fault = error_code & PFERR_WRITE_MASK;
int user_fault = error_code & PFERR_USER_MASK;
@@ -581,7 +581,7 @@ static int FNAME(page_fault)(struct kvm_vcpu

[PATCH v4 2/3] KVM: MMU: fix accessed bit set on prefault path

2010-12-02 Thread Xiao Guangrong

Retry #PF is the speculative path, so don't set the accessed bit

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 010736e..c6bb449 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2214,7 +2214,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
u64 *sptep)
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-   int map_writable, int level, gfn_t gfn, pfn_t pfn)
+   int map_writable, int level, gfn_t gfn, pfn_t pfn,
+   bool prefault)
 {
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
@@ -2229,7 +2230,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
pte_access &= ~ACC_WRITE_MASK;
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
 0, write, 1, &pt_write,
-level, gfn, pfn, false, map_writable);
+level, gfn, pfn, prefault, map_writable);
direct_pte_prefetch(vcpu, iterator.sptep);
++vcpu->stat.pf_fixed;
break;
@@ -2321,7 +2322,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn,
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
-   r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn);
+   r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn,
+prefault);
spin_unlock(&vcpu->kvm->mmu_lock);
 
 
@@ -2684,7 +2686,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
r = __direct_map(vcpu, gpa, write, map_writable,
-level, gfn, pfn);
+level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);
 
return r;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Avi Kivity

On 12/01/2010 09:09 PM, Peter Zijlstra wrote:

>
>  We are dealing with just one task here (the task that is yielding).
>  After recording how much timeslice we are "giving up" in current->donate_time
>  (donate_time is perhaps not the right name to use), we adjust the yielding
>  task's vruntime as per existing logic (for ex: to make it go to back of
>  runqueue). When the yielding tasks gets to run again, lock is hopefully
>  available for it to grab, we let it run longer than the default sched_slice()
>  to compensate for what time it gave up previously to other threads in same
>  runqueue. This ensures that because of yielding upon lock contention, we are 
not
>  leaking bandwidth in favor of other guests. Again I don't know how much of
>  fairness issue this is in practice, so unless we see some numbers I'd prefer
>  sticking to plain yield() upon lock-contention (for unmodified guests that 
is).

No, that won't work. Once you've given up time you cannot add it back
without destroying fairness.

You can limit the unfairness by limiting the amount of feedback, but I
really dislike such 'yield' semantics.

Agreed.

What I'd like to see in directed yield is donating exactly the amount of 
vruntime that's needed to make the target thread run.  The donating 
thread won't get its vruntime back, unless the other thread hits 
contention itself and does a directed yield back.  So even if your lock 
is ping-ponged around, the guest doesn't lose vruntime compared to other 
processes on the host.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Avi Kivity

On 12/01/2010 07:29 PM, Srivatsa Vaddagiri wrote:

>  >  >  A plain yield (ignoring no-opiness on Linux) will penalize the
>  >  >  running guest wrt other guests.  We need to maintain fairness.

Avi, any idea how much penalty are we talking of here in using plain yield?
If that is acceptable in practice, I'd prefer we use the plain yield rather than
add any more sophistication to it ..

The first issue with plain yield is that it's a no-op unless some ioctl 
is enabled.  It's simply not very meaningful for CFS.  That's why the 
current PLE implementation uses sleep() (which is incredibly sucky in 
another way).

Even if we do get yield() to work, giving up a timeslice will cause a 
contending guest to lose way more than its fair share.  Consider a lock 
that is held for 100us every 500us; if each time we detect contention we 
give up a 3ms timeslice, the guest will grind down to a halt while other 
guests will happily use its timeslice.  With directed yield, the guest 
continues to run; and any excess time we donated to a vcpu is preserved.

>  The Xen paravirt spinlock solution is relatively sane, use that.
>  Unmodified guests suck anyway, there's really nothing much sane you can
>  do there as you don't know who owns what lock.

Hopefully we don't have to deal with unmodified guests for too long.
Till that time, plain yield() upon lock-contention seems like the best option
we have for such unmodified guests.

We will have to deal with unmodified guests forever.  Both older Linux 
and Windows.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)

2010-12-02 Thread Avi Kivity

On 12/01/2010 09:07 PM, Peter Zijlstra wrote:

>
>  The pause loop exiting&  directed yield patches I am working on
>  preserve inter-vcpu fairness by round robining among the vcpus
>  inside one KVM guest.

I don't necessarily think that's enough.

Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning.
They'll end up all three donating some time to the 4th.

The only way to make that fair again is if due to future contention the
4th cpu donates an equal amount of time back to the resp. cpus it got
time from. Guest lock patterns and host scheduling don't provide this
guarantee.

That is fine.  Fairness is pointless if it there's no work to be done 
for the other three threads.  Just like if you have four processes, one 
of which if processing, the others sleeping, cpu is not divided equally.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions

2010-12-02 Thread Avi Kivity


On 12/01/2010 08:03 PM, Anthony Liguori wrote:

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.  A similar option is possible
for AMD but I don't have easy access to AMD test hardware.


Looks reasonable.


+static int __read_mostly enable_yield_on_guest_hlt = 1;
+module_param(enable_yield_on_guest_hlt, bool, S_IRUGO);
+


Please rename to yield_on_hlt.  Also, should not be __read_mostly 
(__read_mostly is for variables that are also __read_often).



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] Exit loop if we have been there too long

2010-12-02 Thread Avi Kivity

On 12/02/2010 03:31 AM, Takuya Yoshikawa wrote:

Thanks for the answers Avi, Juan,

Some FYI, (not about the bottleneck)

On Wed, 01 Dec 2010 14:35:57 +0200
Avi Kivity  wrote:

>  >  >- how many dirty pages do we have to care?
>  >
>  >  default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of
>  >  dirty pages to have only 30ms of downtime.
>
>  1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB.
>

3MB / 4KB/page = 750 pages.

Then, KVM side processing is near the theoretical goal!

In my framebuffer test, I tested

   nr_dirty_pages/npages = 576/4096

case with the rate of 20 updates/s (1updates/50ms).

Using rmap optimization, write protection only took 46,718 tsc time.

Yes, using rmap to drive write protection with sparse dirty bitmaps 
really helps.

Bitmap copy was not a problem of course.

The display was working anyway at this rate!

In my guess, within 1,000 dirty pages, kvm_vm_ioctl_get_dirty_log()
can be processed within 200us or so even for large RAM slot.
  - rmap optimization depends mainly on nr_dirty_pages but npages.

Avi, can you guess the property of O(1) write protection?
I want to test rmap optimization taking these issues into acount.

I think we should use O(1) write protection only if there is a large 
number of dirty pages.  With a small number, using rmap guided by the 
previous dirty bitmap is faster.

So, under normal operation where only the framebuffer is logged, we'd 
use rmap write protection; when enabling logging for live migration we'd 
use O(1) write protection, after a few iterations when the number of 
dirty pages drops, we switch back to rmap write protection.

Of course, Kemari have to continue synchronization, and maybe see
more dirty pages. This will be a future task!

There's yet another option, of using dirty bits instead of write 
protection.  Or maybe using write protection in the upper page tables 
and dirty bits in the lowest level.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

90 matches

Mail list logo