Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
2010/12/2 Michael S. Tsirkin : > On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote: >> 2010/11/28 Michael S. Tsirkin : >> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote: >> >> 2010/11/28 Michael S. Tsirkin : >> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote: >> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert >> >> >> last_avail_idx with inuse if there are outstanding emulation. >> >> >> >> >> >> Signed-off-by: Yoshiaki Tamura >> >> > >> >> > This changes migration format, so it will break compatibility with >> >> > existing drivers. More generally, I think migrating internal >> >> > state that is not guest visible is always a mistake >> >> > as it ties migration format to an internal implementation >> >> > (yes, I know we do this sometimes, but we should at least >> >> > try not to add such cases). I think the right thing to do in this case >> >> > is to flush outstanding >> >> > work when vm is stopped. Then, we are guaranteed that inuse is 0. >> >> > I sent patches that do this for virtio net and block. >> >> >> >> Could you give me the link of your patches? I'd like to test >> >> whether they work with Kemari upon failover. If they do, I'm >> >> happy to drop this patch. >> >> >> >> Yoshi >> > >> > Look for this: >> > stable migration image on a stopped vm >> > sent on: >> > Wed, 24 Nov 2010 17:52:49 +0200 >> >> Thanks for the info. >> >> However, The patch series above didn't solve the issue. In >> case of Kemari, inuse is mostly > 0 because it queues the >> output, and while last_avail_idx gets incremented >> immediately, not sending inuse makes the state inconsistent >> between Primary and Secondary. > > Hmm. Can we simply avoid incrementing last_avail_idx? I think we can calculate or prepare an internal last_avail_idx, and update the external when inuse is decremented. I'll try whether it work w/ w/o Kemari. > >> I'm wondering why >> last_avail_idx is OK to send but not inuse. > > last_avail_idx is at some level a mistake, it exposes part of > our internal implementation, but it does *also* express > a guest observable state. > > Here's the problem that it solves: just looking at the rings in virtio > there is no way to detect that a specific request has already been > completed. And the protocol forbids completing the same request twice. > > Our implementation always starts processing the requests > in order, and since we flush outstanding requests > before save, it works to just tell the remote 'process only requests > after this place'. > > But there's no such requirement in the virtio protocol, > so to be really generic we could add a bitmask of valid avail > ring entries that did not complete yet. This would be > the exact representation of the guest observable state. > In practice we have rings of up to 512 entries. > That's 64 byte per ring, not a lot at all. > > However, if we ever do change the protocol to send the bitmask, > we would need some code to resubmit requests > out of order, so it's not trivial. > > Another minor mistake with last_avail_idx is that it has > some redundancy: the high bits in the index > (> vq size) are not necessary as they can be > got from avail idx. There's a consistency check > in load but we really should try to use formats > that are always consistent. > >> The following patch does the same thing as original, yet >> keeps the format of the virtio. It shouldn't break live >> migration either because inuse should be 0. >> >> Yoshi > > Question is, can you flush to make inuse 0 in kemari too? > And if not, how do you handle the fact that some requests > are in flight on the primary? Although we try flushing requests one by one making inuse 0, there are cases when it failovers to the secondary when inuse isn't 0. We handle these in flight request on the primary by replaying on the secondary. > >> diff --git a/hw/virtio.c b/hw/virtio.c >> index c8a0fc6..875c7ca 100644 >> --- a/hw/virtio.c >> +++ b/hw/virtio.c >> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) >> qemu_put_be32(f, i); >> >> for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { >> + uint16_t last_avail_idx; >> + >> if (vdev->vq[i].vring.num == 0) >> break; >> >> + last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse; >> + >> qemu_put_be32(f, vdev->vq[i].vring.num); >> qemu_put_be64(f, vdev->vq[i].pa); >> - qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); >> + qemu_put_be16s(f, &last_avail_idx); >> if (vdev->binding->save_queue) >> vdev->binding->save_queue(vdev->binding_opaque, i, f); >> } >> >> > > This looks wrong to me. Requests can complete in any order, can they > not? So if request 0 did not complete and request 1 did not, > you send avail - inuse and on the secondary you will process and > complete request 1 the second time, crashing the guest. In case of Kemari, no.
Re: [PATCHv6 00/16] boot order specification
On Thu, Dec 02, 2010 at 09:01:25PM -0500, Kevin O'Connor wrote: > On Thu, Dec 02, 2010 at 02:30:42PM +0200, Gleb Natapov wrote: > > On Wed, Dec 01, 2010 at 09:25:40PM -0500, Kevin O'Connor wrote: > > > You're thinking in terms of which device to boot, which does make this > > > difficult. However, it's equally valid to think in terms of which > > > boot method to invoke, which makes this easy. > > It is not. Because boot methods are enumerated by a guest at runtime. > > Qemu knows absolutely nothing about them. I am thinking in terms of > > devices because this is the only thing I have in qemu. > > As before, a safe heuristic would be to request a rom boot for any > device with a pci rom that the user requests to boot from. > > Note, your original proposal included a boot method: > /r...@genroms/linuxboot.bin > I'm asking to extend this to be able to include roms on PCI devices. > I already posted here code that maps PCI rom to PCI device path. Why do you still insist on artificial path like /pci/r...@4? > > > We could tell the coreboot user to edit the "bootorder" file and add > > > "/p...@i0cf8/r...@4" (second rom on 4th pci device - the exact syntax > > > of the name is not important). > > > > > But how user should knows that second rom (I think you mean "second BCV") > > on pci device 4.0 will boot from the new scsi cdrom that he just connected? > > How can he tell that it should put second BCV there and not third or fifth > > without running Seabios first and looking at F12 menu? > > Exactly - look at the F12 menu. (Or, for bonus points, one could > write a program that scans roms on the booted coreboot system, > presents the user with a menu, and then writes the requested boot > method to "bootorder".) > If you propose that then you probably misunderstand the problem I am trying to tackle. Qemu should be able to specify boot order for newly created machine with arbitrary HW configuration without making user to look at F12 menu. Actually at this point there is not human user at all. All the process from machine creation till OS installation is completely unattended. Looking at the BBS spec in several places they say that if NV ram is corrupted bios should use some predefined default boot order. This "predefined default boot order" is what qemu tries to pass to Seabios to use correct boot device without user's input. To make it clear "unattended" part is absolute requirement. > Being able to specify which boot method is a requirement for me. It > doesn't have to be pretty, but it does have to be possible. For you problem solution is very easy and described in BBS spec. You just need two numbers IPL to boot from and which BCV is first. No need for device paths at all. In fact Seabios already tracks IPL order this way the only thing missing is to track BCV order. > > > > >BTW to create proper EDD entry for SCSI boot device BIOS also > > > > needs too map BCV to id:lun. How it can be done? > > > > > > It's the responsibility of the rom to build the EDD info. I don't > > > know if all roms do this - I don't believe it's possible to get at the > > > EDD info until after the drive has been mapped (ie, too late to use it > > > for boot ordering). > > How can we get to EDD info after device is mapped? > > Raise int 0x13 ah=0x48 - once the drive is mapped it will hook that > the 0x13 irq and handle the request (or jump to the bios handler for > drives it doesn't know about). > > >Can we use "disconnect vector" > > to connect device temporarily get EDD and then disconnect? > > No. > And is it possible to call BCV and then restore int13 vector? So to find out what device each BCV belongs too we do: foreach(bcv) call_bcv(bcv) int 1348 restor(int13) After that we call call_bcv(bcv) but in correct order. > > > I understand. However, we'll still need to support arbitrary rom > > > based BEVs and BCVs, so the use case is still important. > > > > > We can't do something that is impossible. > > You've used this word "impossible" a few times - I'm afraid I don't > know what it means. It means that I do not have a solution and you do not propose one either :) It seems we were trying to solve differed problems though. I hope now it is much more clear what I am trying to achieve with proposed bootorder patches. > > >For coreboot Seabios should > > implement what BBS spec says i.e enumerate all BCVs, present boot menu > > to the user, record number of BCVs and user's choice on non-volatile > > storage (CMOS). > > Bleh - there's no need for that. Much more reliable is to record the > device path for builtin devices or the boot method (device path of > rom, plus bev/bcv instance) for rom based boots. > Why is this much more reliable? Suppose user has scsi card with 3 devices id3 id5 id6. Three BCVs was created for them and user chose to boot from BCV3(id6). He added /pci/s...@4/b...@3 to his coreboot's bootorder file. Now he adds one more device to this scsi bus with id1. On the ne
Re: [RFC PATCH 2/3] sched: add yield_to function
On Thu, 2010-12-02 at 14:44 -0500, Rik van Riel wrote: > +#ifdef CONFIG_SCHED_HRTICK > +/* > + * Yield the CPU, giving the remainder of our time slice to task p. > + * Typically used to hand CPU time to another thread inside the same > + * process, eg. when p holds a resource other threads are waiting for. > + * Giving priority to p may help get that resource released sooner. > + */ > +void yield_to(struct task_struct *p) > +{ > + unsigned long flags; > + struct sched_entity *se = &p->se; > + struct rq *rq; > + struct cfs_rq *cfs_rq; > + u64 remain = slice_remain(current); That "slice remaining" only shows the distance to last preempt, however brief. It shows nothing wrt tree position, the yielding task may well already be right of the task it wants to yield to, having been a buddy. > cfs_rq = cfs_rq_of(se); > + se->vruntime -= remain; > + if (se->vruntime < cfs_rq->min_vruntime) > + se->vruntime = cfs_rq->min_vruntime; (This is usually done using max_vruntime()) If the recipient was already left of the fair stick (min_vruntime), clipping advances it's vruntime, vaporizing entitlement from both donor and recipient. What if a task tries to yield to another not on the same cpu, and/or in the same task group? This would munge min_vruntime of other queues. I think you'd have to restrict this to same cpu, same group. If tasks can donate cross cfs_rq, (say) pinned task A cpu A running solo could donate vruntime to selected tasks pinned to cpu B, for as long as minuscule preemptions can resupply ammo. Would suck to not be the favored child. Maybe you could exchange vruntimes cooperatively (iff same cfs_rq) between threads, but I don't think donations with clipping works. -Mike -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Anthony Liguori (anth...@codemonkey.ws) wrote: > On 12/02/2010 08:42 PM, Chris Wright wrote: > >OK, let's say a single PCPU == 12 Compute Units. > > > >If the guest is the first to migrate to a newly added unused host, and > >we are using either non-trapping hlt or Marcelo's non-yielding trapping > >hlt, then that guest is going to get more CPU than it expected unless > >there is some throttling mechanism. Specifically, it will get 12CU > >instead of 1-3CU. > > > >Do you agree with that? > > Yes. > > There's definitely a use-case to have a hard cap. OK, good, just wanted to be clear. Because this started as a discussion of hard caps, and it began to sound as if you were no longer advocating for them. > But I think another common use-case is really just performance > isolation. If over the course of a day, you go from 12CU, to 6CU, > to 4CU, that might not be that bad of a thing. I guess it depends on your SLA. We don't have to do anything to give varying CU based on host load. That's the one thing CFS will do for us quite well ;) > If the environment is designed correctly, of N nodes, N-1 will > always be at capacity so it's really just a single node hat is under > utilized. Many clouds do a variation on Small, Medium, Large sizing. So depending on the scheduler (best fit, rr...) even the notion of at capacity may change from node to node and during the time of day. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On 12/02/2010 08:42 PM, Chris Wright wrote: OK, let's say a single PCPU == 12 Compute Units. If the guest is the first to migrate to a newly added unused host, and we are using either non-trapping hlt or Marcelo's non-yielding trapping hlt, then that guest is going to get more CPU than it expected unless there is some throttling mechanism. Specifically, it will get 12CU instead of 1-3CU. Do you agree with that? Yes. There's definitely a use-case to have a hard cap. But I think another common use-case is really just performance isolation. If over the course of a day, you go from 12CU, to 6CU, to 4CU, that might not be that bad of a thing. If the environment is designed correctly, of N nodes, N-1 will always be at capacity so it's really just a single node hat is under utilized. Regards, Anthony Liguori thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Friday 03 December 2010 00:55:03 Michael S. Tsirkin wrote: > On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote: > > On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin wrote: > > > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote: > > >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: > > >> >> Which case? the readl() doesn't need access to the routing table, > > >> >> just the entry. > > >> > > > >> >One thing that read should do is flush in the outstanding > > >> >interrupts and flush out the mask bit writes. > > >> > > >> The mask bit writes are synchronous. > > >> > > >> wrt interrupts, we can deal with assigned devices, and can poll > > >> irqfds. But we can't force vhost-net to issue an interrupt (and I > > >> don't think it's required). > > > > > > To clarify: > > > > > >mask write > > >read > > > > > > it is safe for guest to assume no more interrupts > > > > > > where as with a simple > > >mask write > > > > > > an interrupt might be in flight and get delivered shortly afterwards. > > > > I think it's already contained in the current patchset. > > > > >> >> Oh, I think there is a terminology problem, I was talking about > > >> >> kvm's irq routing table, you were talking about the msix entries. > > >> >> > > >> >> I think treating it as a cache causes more problems, since there > > >> >> are now two paths for reads (in cache and not in cache) and more > > >> >> things for writes to manage. > > >> >> > > >> >> Here's my proposed API: > > >> >> > > >> >> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, > > >> >> pending_bitmap_base_gpa) > > >> >> > > >> >> - called when the guest enables msix > > >> > > > >> >I would add virtual addresses so that we can use swappable memory to > > >> >store the state. > > >> > > >> Right. > > > > Do we need synchronization between kernel and userspace? Any recommended > > method? > > > > >> >If we do, maybe we can just keep the table there and then > > >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? > > >> > > >> Still need to to let userspace know it needs to reprogram the irqfd > > >> or whatever it uses to inject the interrupt. > > > > > > Why do we need to reprogram irqfd? I thought irqfd would map to an > > > entry within the table instead of address/data as now. > > > Could you clarify please? > > > > > >> >> KVM_REMOVE_MSIX_TABLE(table_id) > > >> >> > > >> >>- called when the guest disables msix > > >> >> > > >> >> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) > > >> >> > > >> >>- called when the guest enables msix (to initialize it), or > > >> >> after live migration > > >> > > > >> >What is entry_id here? > > >> > > >> Entry within the table. > > > > > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is > > > enabled (note: it can not be called at boot anyway since pa > > > depends on BAR assigned by BIOS). > > > > Don't agree. MMIO can be write regardless of if MSIX is enabled. If > > you handle MMIO to kernel, them handle them all. > > Hmm, does this mean we would need ioctls to enable/disable MSIX? I > think passing control from userspace to kernel when MSIX is enabled is > not too bad. Will think about it. Anyway we need ioctls to do so because kernel knows nothing about PCI configuration space... And we already have that for assigned device. > > > I suppose qemu still > > got control of BAR? > > Yes. So e.g. if memory is disabled in the device then we must > disable this as well, and not claim these transactions > as mmio. > > > Then leave it in the current place should be fine. > > current place? assigned_dev_iomem_map(). > > At the moment qemu does not notify devices when > memory is disabled, only when it is enabled. You mean bit 2 of Command register? I think suppose device would only disable memory when shut down should be acceptable. Also we can hook at disable point as well. > So handling msix enable/disable is more straight-forward. Don't agree. Then you got duplicate between kernel and userspace. Also the semantic of MSI-X MMIO has no relationship with MSIX enable/disable. > > > >> >> Michael? I think that should work for virtio and vfio assigned > > >> >> devices? Not sure about pending bits. > > >> > > > >> >Pending bits must be tracked in kernel, but I don't see > > >> >how we can support polling mode if we don't exit to userspace > > >> >on pending bit reads. > > >> > > > >> >This does mean that some reads will be fast and some will be > > >> >slow, and it's a bit sad that we seem to be optimizing > > >> >for specific guests, but I just can't come up with > > >> >anything better. > > >> > > >> If the pending bits live in userspace memory, the device model can > > >> update them directly? > > > > > > Note that these are updated on an interrupt, so updating them > > > in userspace would need get_user_page etc trickery, > > > and add the overhead of atomics. > > > > > > Further I th
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Anthony Liguori (anth...@codemonkey.ws) wrote: > On 12/02/2010 03:07 PM, Chris Wright wrote: > >Like non-trapping hlt, that too will guarantee that the guest is preempted > >by timeslice exhaustion (and is simpler than non-trapping hlt). So it > >may well be the simplest for the case where we are perfectly committed > >(i.e. the vcpu fractional core count totals the pcpu count). But once > >we are undercommitted we still need some extra logic to handle the hard > >cap and something to kick the running guest off the cpu and suck up the > >extra cycles in a power conserving way. > > I'm not entirely sure TBH. > > If you think of a cloud's per-VCPU capacity in terms of Compute > Units, having a model where a VCPU maps to 1-3 units depending on > total load is potentially interesting particularly if the VCPU's > capacity only changes in discrete amounts, that the expected > capacity is communicated to the guest, and that the capacity only > changes periodically. OK, let's say a single PCPU == 12 Compute Units. If the guest is the first to migrate to a newly added unused host, and we are using either non-trapping hlt or Marcelo's non-yielding trapping hlt, then that guest is going to get more CPU than it expected unless there is some throttling mechanism. Specifically, it will get 12CU instead of 1-3CU. Do you agree with that? thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
* Rik van Riel (r...@redhat.com) wrote: > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1880,18 +1880,53 @@ void kvm_resched(struct kvm_vcpu *vcpu) > } > EXPORT_SYMBOL_GPL(kvm_resched); > > -void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu) > +void kvm_vcpu_on_spin(struct kvm_vcpu *me) > { > - ktime_t expires; > - DEFINE_WAIT(wait); > + struct kvm *kvm = me->kvm; > + struct kvm_vcpu *vcpu; > + int last_boosted_vcpu = me->kvm->last_boosted_vcpu; s/me->// > + int first_round = 1; > + int i; > > - prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); > + me->spinning = 1; > + > + /* > + * We boost the priority of a VCPU that is runnable but not > + * currently running, because it got preempted by something > + * else and called schedule in __vcpu_run. Hopefully that > + * VCPU is holding the lock that we need and will release it. > + * We approximate round-robin by starting at the last boosted VCPU. > + */ > + again: > + kvm_for_each_vcpu(i, vcpu, kvm) { > + struct task_struct *task = vcpu->task; > + if (first_round && i < last_boosted_vcpu) { > + i = last_boosted_vcpu; > + continue; > + } else if (!first_round && i > last_boosted_vcpu) > + break; > + if (vcpu == me) > + continue; > + if (vcpu->spinning) > + continue; > + if (!task) > + continue; > + if (waitqueue_active(&vcpu->wq)) > + continue; > + if (task->flags & PF_VCPU) > + continue; I wonder if you set vcpu->task in sched_out and then NULL it in sched_in if you'd get what you want you could simplify the checks. Basically that would be only the preempted runnable vcpus. > + kvm->last_boosted_vcpu = i; > + yield_to(task); Just trying to think of ways to be sure this doesn't become just yield() (although I thnk PF_VCPU is enough to catch that). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv6 00/16] boot order specification
On Thu, Dec 02, 2010 at 02:30:42PM +0200, Gleb Natapov wrote: > On Wed, Dec 01, 2010 at 09:25:40PM -0500, Kevin O'Connor wrote: > > You're thinking in terms of which device to boot, which does make this > > difficult. However, it's equally valid to think in terms of which > > boot method to invoke, which makes this easy. > It is not. Because boot methods are enumerated by a guest at runtime. > Qemu knows absolutely nothing about them. I am thinking in terms of > devices because this is the only thing I have in qemu. As before, a safe heuristic would be to request a rom boot for any device with a pci rom that the user requests to boot from. Note, your original proposal included a boot method: /r...@genroms/linuxboot.bin I'm asking to extend this to be able to include roms on PCI devices. > > We could tell the coreboot user to edit the "bootorder" file and add > > "/p...@i0cf8/r...@4" (second rom on 4th pci device - the exact syntax > > of the name is not important). > > > But how user should knows that second rom (I think you mean "second BCV") > on pci device 4.0 will boot from the new scsi cdrom that he just connected? > How can he tell that it should put second BCV there and not third or fifth > without running Seabios first and looking at F12 menu? Exactly - look at the F12 menu. (Or, for bonus points, one could write a program that scans roms on the booted coreboot system, presents the user with a menu, and then writes the requested boot method to "bootorder".) Being able to specify which boot method is a requirement for me. It doesn't have to be pretty, but it does have to be possible. > > >BTW to create proper EDD entry for SCSI boot device BIOS also > > > needs too map BCV to id:lun. How it can be done? > > > > It's the responsibility of the rom to build the EDD info. I don't > > know if all roms do this - I don't believe it's possible to get at the > > EDD info until after the drive has been mapped (ie, too late to use it > > for boot ordering). > How can we get to EDD info after device is mapped? Raise int 0x13 ah=0x48 - once the drive is mapped it will hook that the 0x13 irq and handle the request (or jump to the bios handler for drives it doesn't know about). >Can we use "disconnect vector" > to connect device temporarily get EDD and then disconnect? No. > > I understand. However, we'll still need to support arbitrary rom > > based BEVs and BCVs, so the use case is still important. > > > We can't do something that is impossible. You've used this word "impossible" a few times - I'm afraid I don't know what it means. >For coreboot Seabios should > implement what BBS spec says i.e enumerate all BCVs, present boot menu > to the user, record number of BCVs and user's choice on non-volatile > storage (CMOS). Bleh - there's no need for that. Much more reliable is to record the device path for builtin devices or the boot method (device path of rom, plus bev/bcv instance) for rom based boots. -Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu
* Rik van Riel (r...@redhat.com) wrote: > Keep track of which task is running a KVM vcpu. This helps us > figure out later what task to wake up if we want to boost a > vcpu that got preempted. > > Unfortunately there are no guarantees that the same task > always keeps the same vcpu, so we can only track the task > across a single "run" of the vcpu. So shouldn't it confine to KVM_RUN? The other vcpu_load callers aren't always a vcpu in a useful runnable state. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] sched: add yield_to function
* Rik van Riel (r...@redhat.com) wrote: > Add a yield_to function to the scheduler code, allowing us to > give the remainder of our timeslice to another thread. > > We may want to use this to provide a sys_yield_to system call > one day. > > Signed-off-by: Rik van Riel > Signed-off-by: Marcelo Tosatti > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index c5f926c..4f3cce9 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1985,6 +1985,7 @@ extern void set_user_nice(struct task_struct *p, long > nice); > extern int task_prio(const struct task_struct *p); > extern int task_nice(const struct task_struct *p); > extern int can_nice(const struct task_struct *p, const int nice); > +extern void requeue_task(struct rq *rq, struct task_struct *p); > extern int task_curr(const struct task_struct *p); > extern int idle_cpu(int cpu); > extern int sched_setscheduler(struct task_struct *, int, struct sched_param > *); > @@ -2058,6 +2059,14 @@ extern int wake_up_state(struct task_struct *tsk, > unsigned int state); > extern int wake_up_process(struct task_struct *tsk); > extern void wake_up_new_task(struct task_struct *tsk, > unsigned long clone_flags); > + > +#ifdef CONFIG_SCHED_HRTICK > +extern u64 slice_remain(struct task_struct *); > +extern void yield_to(struct task_struct *); > +#else > +static inline void yield_to(struct task_struct *p) yield() Missing {}'s ? > +#endif > + > #ifdef CONFIG_SMP > extern void kick_process(struct task_struct *tsk); > #else > diff --git a/kernel/sched.c b/kernel/sched.c > index f8e5a25..ef088cd 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -1909,6 +1909,26 @@ static void dequeue_task(struct rq *rq, struct > task_struct *p, int sleep) > p->se.on_rq = 0; > } > > +/** > + * requeue_task - requeue a task which priority got changed by yield_to > + * @rq: the tasks's runqueue > + * @p: the task in question > + * Must be called with the runqueue lock held. Will cause the CPU to > + * reschedule if p is now at the head of the runqueue. > + */ > +void requeue_task(struct rq *rq, struct task_struct *p) > +{ > + assert_spin_locked(&rq->lock); > + > + if (!p->se.on_rq || task_running(rq, p) || task_has_rt_policy(p)) > + return; already checked task_running(rq, p) || task_has_rt_policy(p) w/ rq lock held. > + > + dequeue_task(rq, p, 0); > + enqueue_task(rq, p, 0); seems like you could condense to save an update_rq_clock() call at least, don't know if the info_queued, info_dequeued need to be updated > + resched_task(p); > +} > + > /* > * __normal_prio - return the priority that is based on the static prio > */ > @@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, > unsigned int, len, > return ret; > } > > +#ifdef CONFIG_SCHED_HRTICK > +/* > + * Yield the CPU, giving the remainder of our time slice to task p. > + * Typically used to hand CPU time to another thread inside the same > + * process, eg. when p holds a resource other threads are waiting for. > + * Giving priority to p may help get that resource released sooner. > + */ > +void yield_to(struct task_struct *p) > +{ > + unsigned long flags; > + struct sched_entity *se = &p->se; > + struct rq *rq; > + struct cfs_rq *cfs_rq; > + u64 remain = slice_remain(current); > + > + rq = task_rq_lock(p, &flags); > + if (task_running(rq, p) || task_has_rt_policy(p)) > + goto out; > + cfs_rq = cfs_rq_of(se); > + se->vruntime -= remain; > + if (se->vruntime < cfs_rq->min_vruntime) > + se->vruntime = cfs_rq->min_vruntime; Should these details all be in sched_fair? Seems like the wrong layer here. And would that condition go the other way? If new vruntime is smaller than min, then it becomes new cfs_rq->min_vruntime? > + requeue_task(rq, p); > + out: > + task_rq_unlock(rq, &flags); > + yield(); > +} > +EXPORT_SYMBOL(yield_to); > +#endif > + > /** > * sys_sched_yield - yield the current processor to other threads. > * > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index 5119b08..2a0a595 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity > *curr, int queued) > */ > > #ifdef CONFIG_SCHED_HRTICK > +u64 slice_remain(struct task_struct *p) > +{ > + unsigned long flags; > + struct sched_entity *se = &p->se; > + struct cfs_rq *cfs_rq; > + struct rq *rq; > + u64 slice, ran; > + s64 delta; > + > + rq = task_rq_lock(p, &flags); > + cfs_rq = cfs_rq_of(se); > + slice = sched_slice(cfs_rq, se); > + ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; > + delta = slice - ran; > + task_rq_unlock(rq, &flags); > + > + return max(delta, 0LL); Can delta go negative? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to
Re: [PATCH 1/2] vhost test module
On Fri, Dec 03, 2010 at 01:18:18AM +0200, Michael S. Tsirkin wrote: > On Thu, Dec 02, 2010 at 03:13:03PM -0800, Paul E. McKenney wrote: > > On Thu, Dec 02, 2010 at 09:47:09PM +0200, Michael S. Tsirkin wrote: > > > On Thu, Dec 02, 2010 at 11:26:16AM -0800, Paul E. McKenney wrote: > > > > On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote: > > > > > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote: > > > > > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote: > > > > > > > This adds a test module for vhost infrastructure. > > > > > > > Intentionally not tied to kbuild to prevent people > > > > > > > from installing and loading it accidentally. > > > > > > > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > > > > > > > On question below. > > > > > > > > > > > > > --- > > > > > > > > > > > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c > > > > > > > new file mode 100644 > > > > > > > index 000..099f302 > > > > > > > --- /dev/null > > > > > > > +++ b/drivers/vhost/test.c > > > > > > > @@ -0,0 +1,320 @@ > > > > > > > +/* Copyright (C) 2009 Red Hat, Inc. > > > > > > > + * Author: Michael S. Tsirkin > > > > > > > + * > > > > > > > + * This work is licensed under the terms of the GNU GPL, version > > > > > > > 2. > > > > > > > + * > > > > > > > + * test virtio server in host kernel. > > > > > > > + */ > > > > > > > + > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > +#include > > > > > > > + > > > > > > > +#include "test.h" > > > > > > > +#include "vhost.c" > > > > > > > + > > > > > > > +/* Max number of bytes transferred before requeueing the job. > > > > > > > + * Using this limit prevents one virtqueue from starving others. > > > > > > > */ > > > > > > > +#define VHOST_TEST_WEIGHT 0x8 > > > > > > > + > > > > > > > +enum { > > > > > > > + VHOST_TEST_VQ = 0, > > > > > > > + VHOST_TEST_VQ_MAX = 1, > > > > > > > +}; > > > > > > > + > > > > > > > +struct vhost_test { > > > > > > > + struct vhost_dev dev; > > > > > > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; > > > > > > > +}; > > > > > > > + > > > > > > > +/* Expects to be always run from workqueue - which acts as > > > > > > > + * read-size critical section for our kind of RCU. */ > > > > > > > +static void handle_vq(struct vhost_test *n) > > > > > > > +{ > > > > > > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ]; > > > > > > > + unsigned out, in; > > > > > > > + int head; > > > > > > > + size_t len, total_len = 0; > > > > > > > + void *private; > > > > > > > + > > > > > > > + private = rcu_dereference_check(vq->private_data, 1); > > > > > > > > > > > > Any chance of a check for running in a workqueue? If I remember > > > > > > correctly, > > > > > > the ->lockdep_map field in the work_struct structure allows you to > > > > > > create > > > > > > the required lockdep expression. > > > > > > > > > > We moved away from using the workqueue to a custom kernel thread > > > > > implementation though. > > > > > > > > OK, then could you please add a check for "current == > > > > custom_kernel_thread" > > > > or some such? > > > > > > > > Thanx, Paul > > > > > > It's a bit tricky. The way we flush out things is by an analog of > > > flush_work. So just checking that we run from workqueue isn't > > > right need to check that we are running from one of the specific work > > > structures that we are careful to flush. > > > > > > I can do this by passing the work structure pointer on to relevant > > > functions but I think this will add (very minor) overhead even when rcu > > > checks are disabled. Does it matter? Thoughts? > > > > Would it be possible to set up separate lockdep maps, in effect passing > > the needed information via lockdep rather than via the function arguments? > > Possibly, I don't know enough about this but will check. > Any examples to look at? I would suggest the workqueue example in include/linux/workqueue.h and kernel/workqueue.c. You will need a struct lockdep_map for each of the specific work structures, sort of like the one in struct workqueue_struct. You can then use lock_map_acquire() and lock_map_release() to flag the fact that your kernel thread is running and not, and then you can pass lock_is_held() to rcu_dereference_check(). Each of lock_map_acquire(), lock_map_release(), and lock_is_held() takes a struct lockdep_map as sole argument. The rcu_lock_map definition shows an example of a global lockdep_map variable. If you need to do runtime initialization of your lockdep_map structure, you can use lockdep_init_map(). Seem reasonable? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the bo
Re: [PATCH 1/2] vhost test module
On Thu, Dec 02, 2010 at 03:13:03PM -0800, Paul E. McKenney wrote: > On Thu, Dec 02, 2010 at 09:47:09PM +0200, Michael S. Tsirkin wrote: > > On Thu, Dec 02, 2010 at 11:26:16AM -0800, Paul E. McKenney wrote: > > > On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote: > > > > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote: > > > > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote: > > > > > > This adds a test module for vhost infrastructure. > > > > > > Intentionally not tied to kbuild to prevent people > > > > > > from installing and loading it accidentally. > > > > > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > > > > > On question below. > > > > > > > > > > > --- > > > > > > > > > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c > > > > > > new file mode 100644 > > > > > > index 000..099f302 > > > > > > --- /dev/null > > > > > > +++ b/drivers/vhost/test.c > > > > > > @@ -0,0 +1,320 @@ > > > > > > +/* Copyright (C) 2009 Red Hat, Inc. > > > > > > + * Author: Michael S. Tsirkin > > > > > > + * > > > > > > + * This work is licensed under the terms of the GNU GPL, version 2. > > > > > > + * > > > > > > + * test virtio server in host kernel. > > > > > > + */ > > > > > > + > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > +#include > > > > > > + > > > > > > +#include "test.h" > > > > > > +#include "vhost.c" > > > > > > + > > > > > > +/* Max number of bytes transferred before requeueing the job. > > > > > > + * Using this limit prevents one virtqueue from starving others. */ > > > > > > +#define VHOST_TEST_WEIGHT 0x8 > > > > > > + > > > > > > +enum { > > > > > > + VHOST_TEST_VQ = 0, > > > > > > + VHOST_TEST_VQ_MAX = 1, > > > > > > +}; > > > > > > + > > > > > > +struct vhost_test { > > > > > > + struct vhost_dev dev; > > > > > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; > > > > > > +}; > > > > > > + > > > > > > +/* Expects to be always run from workqueue - which acts as > > > > > > + * read-size critical section for our kind of RCU. */ > > > > > > +static void handle_vq(struct vhost_test *n) > > > > > > +{ > > > > > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ]; > > > > > > + unsigned out, in; > > > > > > + int head; > > > > > > + size_t len, total_len = 0; > > > > > > + void *private; > > > > > > + > > > > > > + private = rcu_dereference_check(vq->private_data, 1); > > > > > > > > > > Any chance of a check for running in a workqueue? If I remember > > > > > correctly, > > > > > the ->lockdep_map field in the work_struct structure allows you to > > > > > create > > > > > the required lockdep expression. > > > > > > > > We moved away from using the workqueue to a custom kernel thread > > > > implementation though. > > > > > > OK, then could you please add a check for "current == > > > custom_kernel_thread" > > > or some such? > > > > > > Thanx, Paul > > > > It's a bit tricky. The way we flush out things is by an analog of > > flush_work. So just checking that we run from workqueue isn't > > right need to check that we are running from one of the specific work > > structures that we are careful to flush. > > > > I can do this by passing the work structure pointer on to relevant > > functions but I think this will add (very minor) overhead even when rcu > > checks are disabled. Does it matter? Thoughts? > > Would it be possible to set up separate lockdep maps, in effect passing > the needed information via lockdep rather than via the function arguments? > > Thanx, Paul Possibly, I don't know enough about this but will check. Any examples to look at? -- mST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] vhost test module
On Thu, Dec 02, 2010 at 09:47:09PM +0200, Michael S. Tsirkin wrote: > On Thu, Dec 02, 2010 at 11:26:16AM -0800, Paul E. McKenney wrote: > > On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote: > > > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote: > > > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote: > > > > > This adds a test module for vhost infrastructure. > > > > > Intentionally not tied to kbuild to prevent people > > > > > from installing and loading it accidentally. > > > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > > > On question below. > > > > > > > > > --- > > > > > > > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c > > > > > new file mode 100644 > > > > > index 000..099f302 > > > > > --- /dev/null > > > > > +++ b/drivers/vhost/test.c > > > > > @@ -0,0 +1,320 @@ > > > > > +/* Copyright (C) 2009 Red Hat, Inc. > > > > > + * Author: Michael S. Tsirkin > > > > > + * > > > > > + * This work is licensed under the terms of the GNU GPL, version 2. > > > > > + * > > > > > + * test virtio server in host kernel. > > > > > + */ > > > > > + > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > +#include > > > > > + > > > > > +#include "test.h" > > > > > +#include "vhost.c" > > > > > + > > > > > +/* Max number of bytes transferred before requeueing the job. > > > > > + * Using this limit prevents one virtqueue from starving others. */ > > > > > +#define VHOST_TEST_WEIGHT 0x8 > > > > > + > > > > > +enum { > > > > > + VHOST_TEST_VQ = 0, > > > > > + VHOST_TEST_VQ_MAX = 1, > > > > > +}; > > > > > + > > > > > +struct vhost_test { > > > > > + struct vhost_dev dev; > > > > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; > > > > > +}; > > > > > + > > > > > +/* Expects to be always run from workqueue - which acts as > > > > > + * read-size critical section for our kind of RCU. */ > > > > > +static void handle_vq(struct vhost_test *n) > > > > > +{ > > > > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ]; > > > > > + unsigned out, in; > > > > > + int head; > > > > > + size_t len, total_len = 0; > > > > > + void *private; > > > > > + > > > > > + private = rcu_dereference_check(vq->private_data, 1); > > > > > > > > Any chance of a check for running in a workqueue? If I remember > > > > correctly, > > > > the ->lockdep_map field in the work_struct structure allows you to > > > > create > > > > the required lockdep expression. > > > > > > We moved away from using the workqueue to a custom kernel thread > > > implementation though. > > > > OK, then could you please add a check for "current == custom_kernel_thread" > > or some such? > > > > Thanx, Paul > > It's a bit tricky. The way we flush out things is by an analog of > flush_work. So just checking that we run from workqueue isn't > right need to check that we are running from one of the specific work > structures that we are careful to flush. > > I can do this by passing the work structure pointer on to relevant > functions but I think this will add (very minor) overhead even when rcu > checks are disabled. Does it matter? Thoughts? Would it be possible to set up separate lockdep maps, in effect passing the needed information via lockdep rather than via the function arguments? Thanx, Paul > > > > > + if (!private) > > > > > + return; > > > > > + > > > > > + mutex_lock(&vq->mutex); > > > > > + vhost_disable_notify(vq); > > > > > + > > > > > + for (;;) { > > > > > + head = vhost_get_vq_desc(&n->dev, vq, vq->iov, > > > > > + ARRAY_SIZE(vq->iov), > > > > > + &out, &in, > > > > > + NULL, NULL); > > > > > + /* On error, stop handling until the next kick. */ > > > > > + if (unlikely(head < 0)) > > > > > + break; > > > > > + /* Nothing new? Wait for eventfd to tell us they > > > > > refilled. */ > > > > > + if (head == vq->num) { > > > > > + if (unlikely(vhost_enable_notify(vq))) { > > > > > + vhost_disable_notify(vq); > > > > > + continue; > > > > > + } > > > > > + break; > > > > > + } > > > > > + if (in) { > > > > > + vq_err(vq, "Unexpected descriptor format for > > > > > TX: " > > > > > +"out %d, int %d\n", out, in); > > > > > + break; > > > > > + } > > > > > + len = iov_length(vq->iov, out); > > > > > + /* S
Problems on qemu-kvm unittests
We are getting failures when executing apic.flat on our periodic upstream tests: 12/02 18:40:59 DEBUG|kvm_vm:0664| Running qemu command: /usr/local/autotest/tests/kvm/qemu -name 'vm1' -monitor unix:'/tmp/monitor-humanmonitor1-20101202-184059-9EnX',server,nowait -serial unix:'/tmp/serial-20101202-184059-9EnX',server,nowait -m 512 -smp 2 -kernel '/usr/local/autotest/tests/kvm/unittests/apic.flat' -vnc :0 -chardev file,id=testlog,path=/tmp/testlog-20101202-184059-9EnX -device testdev,chardev=testlog -S -cpu qemu64,+x2apic 12/02 18:40:59 DEBUG|kvm_subpro:0700| (qemu) Cannot load x86-64 image, give a 32bit one. 12/02 18:40:59 DEBUG|kvm_subpro:0700| (qemu) (Process terminated with status 1) Relevant git commits: 12/02 18:20:16 INFO | kvm_utils:0407| Commit hash for git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git is 9ee00410d82a7c5cab5ae347d97fbf8a95c55506 (tag v2.6.32-56688-g9ee0041) 12/02 18:39:11 INFO | kvm_utils:0407| Commit hash for git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git is 53b6d3d5c2522e881c8d194f122de3114f6f76eb (tag kvm-88-6325-g53b6d3d) 12/02 18:39:16 INFO | kvm_utils:0407| Commit hash for git://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git is f2d2b7c74355523b90019427224577b6f0ff1b8a (no tag found) Please advise! Lucas -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
* Rik van Riel (r...@redhat.com) wrote: > When running SMP virtual machines, it is possible for one VCPU to be > spinning on a spinlock, while the VCPU that holds the spinlock is not > currently running, because the host scheduler preempted it to run > something else. > > Both Intel and AMD CPUs have a feature that detects when a virtual > CPU is spinning on a lock and will trap to the host. > > The current KVM code sleeps for a bit whenever that happens, which > results in eg. a 64 VCPU Windows guest taking forever and a bit to > boot up. This is because the VCPU holding the lock is actually > running and not sleeping, so the pause is counter-productive. Seems like simply increasing the spin window help in that case? Or is it just too contended a lock (I think they use mcs locks, so I can see a single wrong sleep causing real contention problems). > In other workloads a pause can also be counter-productive, with > spinlock detection resulting in one guest giving up its CPU time > to the others. Instead of spinning, it ends up simply not running > much at all. > > This patch series aims to fix that, by having a VCPU that spins > give the remainder of its timeslice to another VCPU in the same > guest before yielding the CPU - one that is runnable but got > preempted, hopefully the lock holder. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On 12/02/2010 03:07 PM, Chris Wright wrote: But you agree this is no KVM business. Like non-trapping hlt, that too will guarantee that the guest is preempted by timeslice exhaustion (and is simpler than non-trapping hlt). So it may well be the simplest for the case where we are perfectly committed (i.e. the vcpu fractional core count totals the pcpu count). But once we are undercommitted we still need some extra logic to handle the hard cap and something to kick the running guest off the cpu and suck up the extra cycles in a power conserving way. I'm not entirely sure TBH. If you think of a cloud's per-VCPU capacity in terms of Compute Units, having a model where a VCPU maps to 1-3 units depending on total load is potentially interesting particularly if the VCPU's capacity only changes in discrete amounts, that the expected capacity is communicated to the guest, and that the capacity only changes periodically. Regards, Anthony Liguori thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On 12/02/2010 02:40 PM, Marcelo Tosatti wrote: Consuming the timeslice outside guest mode is less intrusive and easier to replace. Something like this should work? if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) { while (!need_resched()) default_idle(); } But you agree this is no KVM business. My initial inclination is that this would be inappropriate for KVM but I think I'm slowly convincing myself otherwise. Ultimately, hard limits and deterministic scheduling are related goals but not quite the same. A hard limit is very specific, you want to receive no more than an exit amount of CPU time per VCPU With deterministic scheduling, you want to make sure that a set of VMs are not influenced by each other's behavior. You want hard limits when you want to hide the density/capacity of a node from the end customer. You want determinism when you simply want to isolate the performance of each customer from the other customers. That is, the only thing that should affect the performance graph of a VM is how many neighbors it has (which is controlled by management software) rather than what its neighbors are doing. If you have hard limits, you can approximate deterministic scheduling but it's complex in the face of changing numbers of guests. Additionally, hard limits present issues with directed yield that don't exist with a deterministic scheduling approach. You can still donate your time slice to another VCPU because the VCPUs are not actually capped. That may mean that an individual VCPU gets more PCPU time than an exact division but for the VM overall, it won't get more than it's total share. So the principle of performance isolation for the guest isn't impacted. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SeaBIOS] [PATCHv6 00/16] boot order specification
Gleb Natapov wrote: How can we get to EDD info after device is mapped? Looking at Seabios implementation it builds EDD table on the fly when int_1348 is called and it does it only for internal devices. Can we use "disconnect vector" to connect device temporarily get EDD and then disconnect? From BIOS Boot Specification 1.01 "6.4.2 Disconnect Vector Originally, it was thought that the DV would be called by the BIOS if the device's BCV was called and subsequent booting from the device failed. However, it was later discovered that current PnP option ROMs are more well behaved by checking during the BCV call if their device will properly respond to INT 13h calls or not, and simply not hooking INT 13h if those calls would fail. Because of this, the DV is not called by a BIOS supporting the BIOS Boot Specification, nor is it necessary to have a valid DV in the PnP Expansion Header. The DV should be NULL and can't be used for other storage." Sebastian -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Marcelo Tosatti (mtosa...@redhat.com) wrote: > On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote: > > * Anthony Liguori (aligu...@us.ibm.com) wrote: > > > In certain use-cases, we want to allocate guests fixed time slices where > > > idle > > > guest cycles leave the machine idling. There are many approaches to > > > achieve > > > this but the most direct is to simply avoid trapping the HLT instruction > > > which > > > lets the guest directly execute the instruction putting the processor to > > > sleep. > > > > I like the idea, esp to keep from burning power. > > > > > Introduce this as a module-level option for kvm-vmx.ko since if you do > > > this > > > for one guest, you probably want to do it for all. A similar option is > > > possible > > > for AMD but I don't have easy access to AMD test hardware. > > > > Perhaps it should be a VM level option. And then invert the notion. > > Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu > > (pin in place probably). And have that VM do nothing other than hlt. > > Then it's always runnable according to scheduler, and can "consume" the > > extra work that CFS wants to give away. > > > > What do you think? > > > > thanks, > > -chris > > Consuming the timeslice outside guest mode is less intrusive and easier > to replace. Something like this should work? > > if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) { > while (!need_resched()) > default_idle(); > } > > But you agree this is no KVM business. Like non-trapping hlt, that too will guarantee that the guest is preempted by timeslice exhaustion (and is simpler than non-trapping hlt). So it may well be the simplest for the case where we are perfectly committed (i.e. the vcpu fractional core count totals the pcpu count). But once we are undercommitted we still need some extra logic to handle the hard cap and something to kick the running guest off the cpu and suck up the extra cycles in a power conserving way. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On 12/02/2010 02:12 PM, Marcelo Tosatti wrote: opt = CPU_BASED_TPR_SHADOW | CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; -- 1.7.0.4 Breaks async PF (see "checks on guest state"), Sorry, I don't follow what you mean here. Can you elaborate? VCPU in HLT state only allows injection of certain events that would be delivered on HLT. #PF is not one of them. But you can't inject an exception into a guest while the VMCS is active, can you? So the guest takes an exit while in the hlt instruction but that's no different than if the guest has been interrupted because of hlt exiting. You'd have to handle this situation on event injection, vmentry fails otherwise. Or perhaps clear HLT state on vmexit and vmentry. So this works today because on a hlt exit, emulate_halt() will clear the the HLT state which then puts the the vcpu into a state where it can receive an exception injection? Regards, Anthony Liguori timer reinjection probably. Timer reinjection will continue to work as expected. If a guest is halting an external interrupt is delivered (by a timer), the guest will still exit as expected. I can think of anything that would be functionally correct and still depend on getting hlt exits because ultimately, a guest never actually has to do a hlt (and certainly there are guests that won't). LAPIC pending timer events will be reinjected on entry path, if accumulated. So they depend on any exit. If you disable HLT-exiting, delay will increase. OK, maybe thats irrelevant. It should be possible to achieve determinism with a scheduler policy? If the desire is the ultimate desire is to have the guests be scheduled in a non-work conserving fashion, I can't see a more direct approach that to simply not have the guests yield (which is ultimately what hlt trapping does). Anything the scheduler would do is after the fact and probably based on inference about why the yield. Another issue is you ignore the hosts idea of the best way to sleep (ACPI, or whatever). And handling inactive HLT state (which was never enabled) can be painful. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On Thu, Dec 02, 2010 at 11:14:16AM -0800, Chris Wright wrote: > * Anthony Liguori (aligu...@us.ibm.com) wrote: > > In certain use-cases, we want to allocate guests fixed time slices where > > idle > > guest cycles leave the machine idling. There are many approaches to achieve > > this but the most direct is to simply avoid trapping the HLT instruction > > which > > lets the guest directly execute the instruction putting the processor to > > sleep. > > I like the idea, esp to keep from burning power. > > > Introduce this as a module-level option for kvm-vmx.ko since if you do this > > for one guest, you probably want to do it for all. A similar option is > > possible > > for AMD but I don't have easy access to AMD test hardware. > > Perhaps it should be a VM level option. And then invert the notion. > Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu > (pin in place probably). And have that VM do nothing other than hlt. > Then it's always runnable according to scheduler, and can "consume" the > extra work that CFS wants to give away. > > What do you think? > > thanks, > -chris Consuming the timeslice outside guest mode is less intrusive and easier to replace. Something like this should work? if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) { while (!need_resched()) default_idle(); } But you agree this is no KVM business. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
> >>opt = CPU_BASED_TPR_SHADOW | > >> CPU_BASED_USE_MSR_BITMAPS | > >> CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; > >>-- > >>1.7.0.4 > >Breaks async PF (see "checks on guest state"), > > Sorry, I don't follow what you mean here. Can you elaborate? VCPU in HLT state only allows injection of certain events that would be delivered on HLT. #PF is not one of them. You'd have to handle this situation on event injection, vmentry fails otherwise. Or perhaps clear HLT state on vmexit and vmentry. > > timer reinjection > >probably. > > Timer reinjection will continue to work as expected. If a guest is > halting an external interrupt is delivered (by a timer), the guest > will still exit as expected. > > I can think of anything that would be functionally correct and still > depend on getting hlt exits because ultimately, a guest never > actually has to do a hlt (and certainly there are guests that > won't). LAPIC pending timer events will be reinjected on entry path, if accumulated. So they depend on any exit. If you disable HLT-exiting, delay will increase. OK, maybe thats irrelevant. > > It should be possible to achieve determinism with > >a scheduler policy? > > If the desire is the ultimate desire is to have the guests be > scheduled in a non-work conserving fashion, I can't see a more > direct approach that to simply not have the guests yield (which is > ultimately what hlt trapping does). > > Anything the scheduler would do is after the fact and probably based > on inference about why the yield. Another issue is you ignore the hosts idea of the best way to sleep (ACPI, or whatever). And handling inactive HLT state (which was never enabled) can be painful. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Anthony Liguori (anth...@codemonkey.ws) wrote: > On 12/02/2010 01:14 PM, Chris Wright wrote: > >* Anthony Liguori (aligu...@us.ibm.com) wrote: > >>In certain use-cases, we want to allocate guests fixed time slices where > >>idle > >>guest cycles leave the machine idling. There are many approaches to achieve > >>this but the most direct is to simply avoid trapping the HLT instruction > >>which > >>lets the guest directly execute the instruction putting the processor to > >>sleep. > >I like the idea, esp to keep from burning power. > > > >>Introduce this as a module-level option for kvm-vmx.ko since if you do this > >>for one guest, you probably want to do it for all. A similar option is > >>possible > >>for AMD but I don't have easy access to AMD test hardware. > >Perhaps it should be a VM level option. And then invert the notion. > >Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu > >(pin in place probably). And have that VM do nothing other than hlt. > >Then it's always runnable according to scheduler, and can "consume" the > >extra work that CFS wants to give away. > > That's an interesting idea. I think Vatsa had some ideas about how > to do this with existing mechanisms. Yeah, should Just Work (TM) w/ smth like evilcap. > I'm interesting in comparing behavior with fixed allocation because > one thing the above relies upon is that the filler VM loses it's > time when one of the non-filler VCPU needs to run. Priorites? > This may all > work correctly but I think it's easier to rationalize about having > each non-filler VCPU have a fixed (long) time slice. If a VCPU > needs to wake up to become non-idle, it can do so immediately > because it already has the PCPU. The flipside...dont' have to worry about the issues that Marcelo brought up. Should be pretty easy to compare though. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On 12/02/2010 01:14 PM, Chris Wright wrote: * Anthony Liguori (aligu...@us.ibm.com) wrote: In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. I like the idea, esp to keep from burning power. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. A similar option is possible for AMD but I don't have easy access to AMD test hardware. Perhaps it should be a VM level option. And then invert the notion. Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu (pin in place probably). And have that VM do nothing other than hlt. Then it's always runnable according to scheduler, and can "consume" the extra work that CFS wants to give away. That's an interesting idea. I think Vatsa had some ideas about how to do this with existing mechanisms. I'm interesting in comparing behavior with fixed allocation because one thing the above relies upon is that the filler VM loses it's time when one of the non-filler VCPU needs to run. This may all work correctly but I think it's easier to rationalize about having each non-filler VCPU have a fixed (long) time slice. If a VCPU needs to wake up to become non-idle, it can do so immediately because it already has the PCPU. Regards, Anthony Liguori What do you think? thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] vhost test module
On Thu, Dec 02, 2010 at 11:26:16AM -0800, Paul E. McKenney wrote: > On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote: > > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote: > > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote: > > > > This adds a test module for vhost infrastructure. > > > > Intentionally not tied to kbuild to prevent people > > > > from installing and loading it accidentally. > > > > > > > > Signed-off-by: Michael S. Tsirkin > > > > > > On question below. > > > > > > > --- > > > > > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c > > > > new file mode 100644 > > > > index 000..099f302 > > > > --- /dev/null > > > > +++ b/drivers/vhost/test.c > > > > @@ -0,0 +1,320 @@ > > > > +/* Copyright (C) 2009 Red Hat, Inc. > > > > + * Author: Michael S. Tsirkin > > > > + * > > > > + * This work is licensed under the terms of the GNU GPL, version 2. > > > > + * > > > > + * test virtio server in host kernel. > > > > + */ > > > > + > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > + > > > > +#include "test.h" > > > > +#include "vhost.c" > > > > + > > > > +/* Max number of bytes transferred before requeueing the job. > > > > + * Using this limit prevents one virtqueue from starving others. */ > > > > +#define VHOST_TEST_WEIGHT 0x8 > > > > + > > > > +enum { > > > > + VHOST_TEST_VQ = 0, > > > > + VHOST_TEST_VQ_MAX = 1, > > > > +}; > > > > + > > > > +struct vhost_test { > > > > + struct vhost_dev dev; > > > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; > > > > +}; > > > > + > > > > +/* Expects to be always run from workqueue - which acts as > > > > + * read-size critical section for our kind of RCU. */ > > > > +static void handle_vq(struct vhost_test *n) > > > > +{ > > > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ]; > > > > + unsigned out, in; > > > > + int head; > > > > + size_t len, total_len = 0; > > > > + void *private; > > > > + > > > > + private = rcu_dereference_check(vq->private_data, 1); > > > > > > Any chance of a check for running in a workqueue? If I remember > > > correctly, > > > the ->lockdep_map field in the work_struct structure allows you to create > > > the required lockdep expression. > > > > We moved away from using the workqueue to a custom kernel thread > > implementation though. > > OK, then could you please add a check for "current == custom_kernel_thread" > or some such? > > Thanx, Paul It's a bit tricky. The way we flush out things is by an analog of flush_work. So just checking that we run from workqueue isn't right need to check that we are running from one of the specific work structures that we are careful to flush. I can do this by passing the work structure pointer on to relevant functions but I think this will add (very minor) overhead even when rcu checks are disabled. Does it matter? Thoughts? > > > > + if (!private) > > > > + return; > > > > + > > > > + mutex_lock(&vq->mutex); > > > > + vhost_disable_notify(vq); > > > > + > > > > + for (;;) { > > > > + head = vhost_get_vq_desc(&n->dev, vq, vq->iov, > > > > +ARRAY_SIZE(vq->iov), > > > > +&out, &in, > > > > +NULL, NULL); > > > > + /* On error, stop handling until the next kick. */ > > > > + if (unlikely(head < 0)) > > > > + break; > > > > + /* Nothing new? Wait for eventfd to tell us they > > > > refilled. */ > > > > + if (head == vq->num) { > > > > + if (unlikely(vhost_enable_notify(vq))) { > > > > + vhost_disable_notify(vq); > > > > + continue; > > > > + } > > > > + break; > > > > + } > > > > + if (in) { > > > > + vq_err(vq, "Unexpected descriptor format for > > > > TX: " > > > > + "out %d, int %d\n", out, in); > > > > + break; > > > > + } > > > > + len = iov_length(vq->iov, out); > > > > + /* Sanity check */ > > > > + if (!len) { > > > > + vq_err(vq, "Unexpected 0 len for TX\n"); > > > > + break; > > > > + } > > > > + vhost_add_used_and_signal(&n->dev, vq, head, 0); > > > > + total_len += len; > > > > + if (unlikely(total_len >= VHOST_TEST_WEIGHT)) { > > > > + vhost_poll_queue(&vq->poll); > > > > +
[RFC PATCH 2/3] sched: add yield_to function
Add a yield_to function to the scheduler code, allowing us to give the remainder of our timeslice to another thread. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel Signed-off-by: Marcelo Tosatti diff --git a/include/linux/sched.h b/include/linux/sched.h index c5f926c..4f3cce9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1985,6 +1985,7 @@ extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); extern int can_nice(const struct task_struct *p, const int nice); +extern void requeue_task(struct rq *rq, struct task_struct *p); extern int task_curr(const struct task_struct *p); extern int idle_cpu(int cpu); extern int sched_setscheduler(struct task_struct *, int, struct sched_param *); @@ -2058,6 +2059,14 @@ extern int wake_up_state(struct task_struct *tsk, unsigned int state); extern int wake_up_process(struct task_struct *tsk); extern void wake_up_new_task(struct task_struct *tsk, unsigned long clone_flags); + +#ifdef CONFIG_SCHED_HRTICK +extern u64 slice_remain(struct task_struct *); +extern void yield_to(struct task_struct *); +#else +static inline void yield_to(struct task_struct *p) yield() +#endif + #ifdef CONFIG_SMP extern void kick_process(struct task_struct *tsk); #else diff --git a/kernel/sched.c b/kernel/sched.c index f8e5a25..ef088cd 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1909,6 +1909,26 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep) p->se.on_rq = 0; } +/** + * requeue_task - requeue a task which priority got changed by yield_to + * @rq: the tasks's runqueue + * @p: the task in question + * Must be called with the runqueue lock held. Will cause the CPU to + * reschedule if p is now at the head of the runqueue. + */ +void requeue_task(struct rq *rq, struct task_struct *p) +{ + assert_spin_locked(&rq->lock); + + if (!p->se.on_rq || task_running(rq, p) || task_has_rt_policy(p)) + return; + + dequeue_task(rq, p, 0); + enqueue_task(rq, p, 0); + + resched_task(p); +} + /* * __normal_prio - return the priority that is based on the static prio */ @@ -6797,6 +6817,36 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, return ret; } +#ifdef CONFIG_SCHED_HRTICK +/* + * Yield the CPU, giving the remainder of our time slice to task p. + * Typically used to hand CPU time to another thread inside the same + * process, eg. when p holds a resource other threads are waiting for. + * Giving priority to p may help get that resource released sooner. + */ +void yield_to(struct task_struct *p) +{ + unsigned long flags; + struct sched_entity *se = &p->se; + struct rq *rq; + struct cfs_rq *cfs_rq; + u64 remain = slice_remain(current); + + rq = task_rq_lock(p, &flags); + if (task_running(rq, p) || task_has_rt_policy(p)) + goto out; + cfs_rq = cfs_rq_of(se); + se->vruntime -= remain; + if (se->vruntime < cfs_rq->min_vruntime) + se->vruntime = cfs_rq->min_vruntime; + requeue_task(rq, p); + out: + task_rq_unlock(rq, &flags); + yield(); +} +EXPORT_SYMBOL(yield_to); +#endif + /** * sys_sched_yield - yield the current processor to other threads. * diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5119b08..2a0a595 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -974,6 +974,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) */ #ifdef CONFIG_SCHED_HRTICK +u64 slice_remain(struct task_struct *p) +{ + unsigned long flags; + struct sched_entity *se = &p->se; + struct cfs_rq *cfs_rq; + struct rq *rq; + u64 slice, ran; + s64 delta; + + rq = task_rq_lock(p, &flags); + cfs_rq = cfs_rq_of(se); + slice = sched_slice(cfs_rq, se); + ran = se->sum_exec_runtime - se->prev_sum_exec_runtime; + delta = slice - ran; + task_rq_unlock(rq, &flags); + + return max(delta, 0LL); +} + static void hrtick_start_fair(struct rq *rq, struct task_struct *p) { struct sched_entity *se = &p->se; -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/3] kvm: keep track of which task is running a KVM vcpu
Keep track of which task is running a KVM vcpu. This helps us figure out later what task to wake up if we want to boost a vcpu that got preempted. Unfortunately there are no guarantees that the same task always keeps the same vcpu, so we can only track the task across a single "run" of the vcpu. Signed-off-by: Rik van Riel diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 5fbdb55..cb73a73 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -79,6 +79,7 @@ struct kvm_vcpu { #endif int vcpu_id; struct mutex mutex; + struct task_struct *task; int cpu; struct kvm_run *run; unsigned long requests; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 050577a..2bffa47 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -741,6 +741,7 @@ void vcpu_load(struct kvm_vcpu *vcpu) int cpu; mutex_lock(&vcpu->mutex); + vcpu->task = current; cpu = get_cpu(); preempt_notifier_register(&vcpu->preempt_notifier); kvm_arch_vcpu_load(vcpu, cpu); @@ -749,6 +750,7 @@ void vcpu_load(struct kvm_vcpu *vcpu) void vcpu_put(struct kvm_vcpu *vcpu) { + vcpu->task = NULL; preempt_disable(); kvm_arch_vcpu_put(vcpu); preempt_notifier_unregister(&vcpu->preempt_notifier); -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/3] directed yield for Pause Loop Exiting
When running SMP virtual machines, it is possible for one VCPU to be spinning on a spinlock, while the VCPU that holds the spinlock is not currently running, because the host scheduler preempted it to run something else. Both Intel and AMD CPUs have a feature that detects when a virtual CPU is spinning on a lock and will trap to the host. The current KVM code sleeps for a bit whenever that happens, which results in eg. a 64 VCPU Windows guest taking forever and a bit to boot up. This is because the VCPU holding the lock is actually running and not sleeping, so the pause is counter-productive. In other workloads a pause can also be counter-productive, with spinlock detection resulting in one guest giving up its CPU time to the others. Instead of spinning, it ends up simply not running much at all. This patch series aims to fix that, by having a VCPU that spins give the remainder of its timeslice to another VCPU in the same guest before yielding the CPU - one that is runnable but got preempted, hopefully the lock holder. Scheduler people, please flame me with anything I may have done wrong, so I can do it right for a next version :) -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic slowdowns of certain workloads, we instead use yield_to to hand the rest of our timeslice to another vcpu in the same KVM guest. Signed-off-by: Rik van Riel Signed-off-by: Marcelo Tosatti diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c011ba3..7637dd3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -90,6 +90,7 @@ struct kvm_vcpu { int fpu_active; int guest_fpu_loaded; wait_queue_head_t wq; + int spinning; int sigset_active; sigset_t sigset; struct kvm_vcpu_stat stat; @@ -185,6 +186,7 @@ struct kvm { #endif struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; atomic_t online_vcpus; + int last_boosted_vcpu; struct list_head vm_list; struct mutex lock; struct kvm_io_bus *buses[KVM_NR_BUSES]; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 80f17db..a6eeafc 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1880,18 +1880,53 @@ void kvm_resched(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_resched); -void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu) +void kvm_vcpu_on_spin(struct kvm_vcpu *me) { - ktime_t expires; - DEFINE_WAIT(wait); + struct kvm *kvm = me->kvm; + struct kvm_vcpu *vcpu; + int last_boosted_vcpu = me->kvm->last_boosted_vcpu; + int first_round = 1; + int i; - prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE); + me->spinning = 1; + + /* +* We boost the priority of a VCPU that is runnable but not +* currently running, because it got preempted by something +* else and called schedule in __vcpu_run. Hopefully that +* VCPU is holding the lock that we need and will release it. +* We approximate round-robin by starting at the last boosted VCPU. +*/ + again: + kvm_for_each_vcpu(i, vcpu, kvm) { + struct task_struct *task = vcpu->task; + if (first_round && i < last_boosted_vcpu) { + i = last_boosted_vcpu; + continue; + } else if (!first_round && i > last_boosted_vcpu) + break; + if (vcpu == me) + continue; + if (vcpu->spinning) + continue; + if (!task) + continue; + if (waitqueue_active(&vcpu->wq)) + continue; + if (task->flags & PF_VCPU) + continue; + kvm->last_boosted_vcpu = i; + yield_to(task); + break; + } - /* Sleep for 100 us, and hope lock-holder got scheduled */ - expires = ktime_add_ns(ktime_get(), 10UL); - schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); + if (first_round && last_boosted_vcpu == kvm->last_boosted_vcpu) { + /* We have not found anyone yet. */ + first_round = 0; + goto again; + } - finish_wait(&vcpu->wq, &wait); + me->spinning = 0; } EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] vhost test module
On Thu, Dec 02, 2010 at 09:11:30PM +0200, Michael S. Tsirkin wrote: > On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote: > > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote: > > > This adds a test module for vhost infrastructure. > > > Intentionally not tied to kbuild to prevent people > > > from installing and loading it accidentally. > > > > > > Signed-off-by: Michael S. Tsirkin > > > > On question below. > > > > > --- > > > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c > > > new file mode 100644 > > > index 000..099f302 > > > --- /dev/null > > > +++ b/drivers/vhost/test.c > > > @@ -0,0 +1,320 @@ > > > +/* Copyright (C) 2009 Red Hat, Inc. > > > + * Author: Michael S. Tsirkin > > > + * > > > + * This work is licensed under the terms of the GNU GPL, version 2. > > > + * > > > + * test virtio server in host kernel. > > > + */ > > > + > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > + > > > +#include "test.h" > > > +#include "vhost.c" > > > + > > > +/* Max number of bytes transferred before requeueing the job. > > > + * Using this limit prevents one virtqueue from starving others. */ > > > +#define VHOST_TEST_WEIGHT 0x8 > > > + > > > +enum { > > > + VHOST_TEST_VQ = 0, > > > + VHOST_TEST_VQ_MAX = 1, > > > +}; > > > + > > > +struct vhost_test { > > > + struct vhost_dev dev; > > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; > > > +}; > > > + > > > +/* Expects to be always run from workqueue - which acts as > > > + * read-size critical section for our kind of RCU. */ > > > +static void handle_vq(struct vhost_test *n) > > > +{ > > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ]; > > > + unsigned out, in; > > > + int head; > > > + size_t len, total_len = 0; > > > + void *private; > > > + > > > + private = rcu_dereference_check(vq->private_data, 1); > > > > Any chance of a check for running in a workqueue? If I remember correctly, > > the ->lockdep_map field in the work_struct structure allows you to create > > the required lockdep expression. > > We moved away from using the workqueue to a custom kernel thread > implementation though. OK, then could you please add a check for "current == custom_kernel_thread" or some such? Thanx, Paul > > > + if (!private) > > > + return; > > > + > > > + mutex_lock(&vq->mutex); > > > + vhost_disable_notify(vq); > > > + > > > + for (;;) { > > > + head = vhost_get_vq_desc(&n->dev, vq, vq->iov, > > > + ARRAY_SIZE(vq->iov), > > > + &out, &in, > > > + NULL, NULL); > > > + /* On error, stop handling until the next kick. */ > > > + if (unlikely(head < 0)) > > > + break; > > > + /* Nothing new? Wait for eventfd to tell us they refilled. */ > > > + if (head == vq->num) { > > > + if (unlikely(vhost_enable_notify(vq))) { > > > + vhost_disable_notify(vq); > > > + continue; > > > + } > > > + break; > > > + } > > > + if (in) { > > > + vq_err(vq, "Unexpected descriptor format for TX: " > > > +"out %d, int %d\n", out, in); > > > + break; > > > + } > > > + len = iov_length(vq->iov, out); > > > + /* Sanity check */ > > > + if (!len) { > > > + vq_err(vq, "Unexpected 0 len for TX\n"); > > > + break; > > > + } > > > + vhost_add_used_and_signal(&n->dev, vq, head, 0); > > > + total_len += len; > > > + if (unlikely(total_len >= VHOST_TEST_WEIGHT)) { > > > + vhost_poll_queue(&vq->poll); > > > + break; > > > + } > > > + } > > > + > > > + mutex_unlock(&vq->mutex); > > > +} > > > + > > > +static void handle_vq_kick(struct vhost_work *work) > > > +{ > > > + struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, > > > + poll.work); > > > + struct vhost_test *n = container_of(vq->dev, struct vhost_test, dev); > > > + > > > + handle_vq(n); > > > +} > > > + > > > +static int vhost_test_open(struct inode *inode, struct file *f) > > > +{ > > > + struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); > > > + struct vhost_dev *dev; > > > + int r; > > > + > > > + if (!n) > > > + return -ENOMEM; > > > + > > > + dev = &n->dev; > > > + n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; > > > + r = vhost_dev_init(dev, n->vqs, VHOST_TEST_VQ_MAX); > > > + if (r < 0) { > > > + kfree(n); > > > + return r; > > > + } > > > + > > > + f->private_data = n; > > > + > > > + return 0; > > > +} > > > + > > > +static void *vhost
Re: [PATCH 1/2] vhost test module
On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote: > This adds a test module for vhost infrastructure. > Intentionally not tied to kbuild to prevent people > from installing and loading it accidentally. > > Signed-off-by: Michael S. Tsirkin On question below. > --- > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c > new file mode 100644 > index 000..099f302 > --- /dev/null > +++ b/drivers/vhost/test.c > @@ -0,0 +1,320 @@ > +/* Copyright (C) 2009 Red Hat, Inc. > + * Author: Michael S. Tsirkin > + * > + * This work is licensed under the terms of the GNU GPL, version 2. > + * > + * test virtio server in host kernel. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "test.h" > +#include "vhost.c" > + > +/* Max number of bytes transferred before requeueing the job. > + * Using this limit prevents one virtqueue from starving others. */ > +#define VHOST_TEST_WEIGHT 0x8 > + > +enum { > + VHOST_TEST_VQ = 0, > + VHOST_TEST_VQ_MAX = 1, > +}; > + > +struct vhost_test { > + struct vhost_dev dev; > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; > +}; > + > +/* Expects to be always run from workqueue - which acts as > + * read-size critical section for our kind of RCU. */ > +static void handle_vq(struct vhost_test *n) > +{ > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ]; > + unsigned out, in; > + int head; > + size_t len, total_len = 0; > + void *private; > + > + private = rcu_dereference_check(vq->private_data, 1); Any chance of a check for running in a workqueue? If I remember correctly, the ->lockdep_map field in the work_struct structure allows you to create the required lockdep expression. Thanx, Paul > + if (!private) > + return; > + > + mutex_lock(&vq->mutex); > + vhost_disable_notify(vq); > + > + for (;;) { > + head = vhost_get_vq_desc(&n->dev, vq, vq->iov, > + ARRAY_SIZE(vq->iov), > + &out, &in, > + NULL, NULL); > + /* On error, stop handling until the next kick. */ > + if (unlikely(head < 0)) > + break; > + /* Nothing new? Wait for eventfd to tell us they refilled. */ > + if (head == vq->num) { > + if (unlikely(vhost_enable_notify(vq))) { > + vhost_disable_notify(vq); > + continue; > + } > + break; > + } > + if (in) { > + vq_err(vq, "Unexpected descriptor format for TX: " > +"out %d, int %d\n", out, in); > + break; > + } > + len = iov_length(vq->iov, out); > + /* Sanity check */ > + if (!len) { > + vq_err(vq, "Unexpected 0 len for TX\n"); > + break; > + } > + vhost_add_used_and_signal(&n->dev, vq, head, 0); > + total_len += len; > + if (unlikely(total_len >= VHOST_TEST_WEIGHT)) { > + vhost_poll_queue(&vq->poll); > + break; > + } > + } > + > + mutex_unlock(&vq->mutex); > +} > + > +static void handle_vq_kick(struct vhost_work *work) > +{ > + struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, > + poll.work); > + struct vhost_test *n = container_of(vq->dev, struct vhost_test, dev); > + > + handle_vq(n); > +} > + > +static int vhost_test_open(struct inode *inode, struct file *f) > +{ > + struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); > + struct vhost_dev *dev; > + int r; > + > + if (!n) > + return -ENOMEM; > + > + dev = &n->dev; > + n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; > + r = vhost_dev_init(dev, n->vqs, VHOST_TEST_VQ_MAX); > + if (r < 0) { > + kfree(n); > + return r; > + } > + > + f->private_data = n; > + > + return 0; > +} > + > +static void *vhost_test_stop_vq(struct vhost_test *n, > + struct vhost_virtqueue *vq) > +{ > + void *private; > + > + mutex_lock(&vq->mutex); > + private = rcu_dereference_protected(vq->private_data, > + lockdep_is_held(&vq->mutex)); > + rcu_assign_pointer(vq->private_data, NULL); > + mutex_unlock(&vq->mutex); > + return private; > +} > + > +static void vhost_test_stop(struct vhost_test *n, void **privatep) > +{ > + *privatep = vhost_test_stop_vq(n, n->vqs + VHOST_TEST_VQ); > +} > + > +static void vhost_test_flush_vq(struct vhost_test *n, int index)
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
* Anthony Liguori (aligu...@us.ibm.com) wrote: > In certain use-cases, we want to allocate guests fixed time slices where idle > guest cycles leave the machine idling. There are many approaches to achieve > this but the most direct is to simply avoid trapping the HLT instruction which > lets the guest directly execute the instruction putting the processor to > sleep. I like the idea, esp to keep from burning power. > Introduce this as a module-level option for kvm-vmx.ko since if you do this > for one guest, you probably want to do it for all. A similar option is > possible > for AMD but I don't have easy access to AMD test hardware. Perhaps it should be a VM level option. And then invert the notion. Create one idle domain w/out hlt trap. Give that VM a vcpu per pcpu (pin in place probably). And have that VM do nothing other than hlt. Then it's always runnable according to scheduler, and can "consume" the extra work that CFS wants to give away. What do you think? thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] vhost test module
On Thu, Dec 02, 2010 at 11:00:37AM -0800, Paul E. McKenney wrote: > On Mon, Nov 29, 2010 at 07:09:01PM +0200, Michael S. Tsirkin wrote: > > This adds a test module for vhost infrastructure. > > Intentionally not tied to kbuild to prevent people > > from installing and loading it accidentally. > > > > Signed-off-by: Michael S. Tsirkin > > On question below. > > > --- > > > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c > > new file mode 100644 > > index 000..099f302 > > --- /dev/null > > +++ b/drivers/vhost/test.c > > @@ -0,0 +1,320 @@ > > +/* Copyright (C) 2009 Red Hat, Inc. > > + * Author: Michael S. Tsirkin > > + * > > + * This work is licensed under the terms of the GNU GPL, version 2. > > + * > > + * test virtio server in host kernel. > > + */ > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include "test.h" > > +#include "vhost.c" > > + > > +/* Max number of bytes transferred before requeueing the job. > > + * Using this limit prevents one virtqueue from starving others. */ > > +#define VHOST_TEST_WEIGHT 0x8 > > + > > +enum { > > + VHOST_TEST_VQ = 0, > > + VHOST_TEST_VQ_MAX = 1, > > +}; > > + > > +struct vhost_test { > > + struct vhost_dev dev; > > + struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; > > +}; > > + > > +/* Expects to be always run from workqueue - which acts as > > + * read-size critical section for our kind of RCU. */ > > +static void handle_vq(struct vhost_test *n) > > +{ > > + struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_TEST_VQ]; > > + unsigned out, in; > > + int head; > > + size_t len, total_len = 0; > > + void *private; > > + > > + private = rcu_dereference_check(vq->private_data, 1); > > Any chance of a check for running in a workqueue? If I remember correctly, > the ->lockdep_map field in the work_struct structure allows you to create > the required lockdep expression. > > Thanx, Paul We moved away from using the workqueue to a custom kernel thread implementation though. > > + if (!private) > > + return; > > + > > + mutex_lock(&vq->mutex); > > + vhost_disable_notify(vq); > > + > > + for (;;) { > > + head = vhost_get_vq_desc(&n->dev, vq, vq->iov, > > +ARRAY_SIZE(vq->iov), > > +&out, &in, > > +NULL, NULL); > > + /* On error, stop handling until the next kick. */ > > + if (unlikely(head < 0)) > > + break; > > + /* Nothing new? Wait for eventfd to tell us they refilled. */ > > + if (head == vq->num) { > > + if (unlikely(vhost_enable_notify(vq))) { > > + vhost_disable_notify(vq); > > + continue; > > + } > > + break; > > + } > > + if (in) { > > + vq_err(vq, "Unexpected descriptor format for TX: " > > + "out %d, int %d\n", out, in); > > + break; > > + } > > + len = iov_length(vq->iov, out); > > + /* Sanity check */ > > + if (!len) { > > + vq_err(vq, "Unexpected 0 len for TX\n"); > > + break; > > + } > > + vhost_add_used_and_signal(&n->dev, vq, head, 0); > > + total_len += len; > > + if (unlikely(total_len >= VHOST_TEST_WEIGHT)) { > > + vhost_poll_queue(&vq->poll); > > + break; > > + } > > + } > > + > > + mutex_unlock(&vq->mutex); > > +} > > + > > +static void handle_vq_kick(struct vhost_work *work) > > +{ > > + struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue, > > + poll.work); > > + struct vhost_test *n = container_of(vq->dev, struct vhost_test, dev); > > + > > + handle_vq(n); > > +} > > + > > +static int vhost_test_open(struct inode *inode, struct file *f) > > +{ > > + struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); > > + struct vhost_dev *dev; > > + int r; > > + > > + if (!n) > > + return -ENOMEM; > > + > > + dev = &n->dev; > > + n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; > > + r = vhost_dev_init(dev, n->vqs, VHOST_TEST_VQ_MAX); > > + if (r < 0) { > > + kfree(n); > > + return r; > > + } > > + > > + f->private_data = n; > > + > > + return 0; > > +} > > + > > +static void *vhost_test_stop_vq(struct vhost_test *n, > > + struct vhost_virtqueue *vq) > > +{ > > + void *private; > > + > > + mutex_lock(&vq->mutex); > > + private = rcu_dereference_protected(vq->private_data, > > +lockdep_is_held(&vq->mutex)); > > + rcu_assign_po
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On 12/02/2010 11:37 AM, Marcelo Tosatti wrote: On Thu, Dec 02, 2010 at 07:59:17AM -0600, Anthony Liguori wrote: In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. A similar option is possible for AMD but I don't have easy access to AMD test hardware. Signed-off-by: Anthony Liguori --- v1 -> v2 - Rename parameter to yield_on_hlt - Remove __read_mostly diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index caa967e..d8310e4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO); static int __read_mostly vmm_exclusive = 1; module_param(vmm_exclusive, bool, S_IRUGO); +static int yield_on_hlt = 1; +module_param(yield_on_hlt, bool, S_IRUGO); + #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) #define KVM_GUEST_CR0_MASK\ @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) &_pin_based_exec_control)< 0) return -EIO; - min = CPU_BASED_HLT_EXITING | + min = #ifdef CONFIG_X86_64 CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING | @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING | CPU_BASED_INVLPG_EXITING; + + if (yield_on_hlt) + min |= CPU_BASED_HLT_EXITING; + opt = CPU_BASED_TPR_SHADOW | CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; -- 1.7.0.4 Breaks async PF (see "checks on guest state"), Sorry, I don't follow what you mean here. Can you elaborate? timer reinjection probably. Timer reinjection will continue to work as expected. If a guest is halting an external interrupt is delivered (by a timer), the guest will still exit as expected. I can think of anything that would be functionally correct and still depend on getting hlt exits because ultimately, a guest never actually has to do a hlt (and certainly there are guests that won't). It should be possible to achieve determinism with a scheduler policy? If the desire is the ultimate desire is to have the guests be scheduled in a non-work conserving fashion, I can't see a more direct approach that to simply not have the guests yield (which is ultimately what hlt trapping does). Anything the scheduler would do is after the fact and probably based on inference about why the yield. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On Thu, Dec 02, 2010 at 07:59:17AM -0600, Anthony Liguori wrote: > In certain use-cases, we want to allocate guests fixed time slices where idle > guest cycles leave the machine idling. There are many approaches to achieve > this but the most direct is to simply avoid trapping the HLT instruction which > lets the guest directly execute the instruction putting the processor to > sleep. > > Introduce this as a module-level option for kvm-vmx.ko since if you do this > for one guest, you probably want to do it for all. A similar option is > possible > for AMD but I don't have easy access to AMD test hardware. > > Signed-off-by: Anthony Liguori > --- > v1 -> v2 > - Rename parameter to yield_on_hlt > - Remove __read_mostly > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index caa967e..d8310e4 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO); > static int __read_mostly vmm_exclusive = 1; > module_param(vmm_exclusive, bool, S_IRUGO); > > +static int yield_on_hlt = 1; > +module_param(yield_on_hlt, bool, S_IRUGO); > + > #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST > \ > (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) > #define KVM_GUEST_CR0_MASK \ > @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config > *vmcs_conf) > &_pin_based_exec_control) < 0) > return -EIO; > > - min = CPU_BASED_HLT_EXITING | > + min = > #ifdef CONFIG_X86_64 > CPU_BASED_CR8_LOAD_EXITING | > CPU_BASED_CR8_STORE_EXITING | > @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config > *vmcs_conf) > CPU_BASED_MWAIT_EXITING | > CPU_BASED_MONITOR_EXITING | > CPU_BASED_INVLPG_EXITING; > + > + if (yield_on_hlt) > + min |= CPU_BASED_HLT_EXITING; > + > opt = CPU_BASED_TPR_SHADOW | > CPU_BASED_USE_MSR_BITMAPS | > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; > -- > 1.7.0.4 Breaks async PF (see "checks on guest state"), timer reinjection probably. It should be possible to achieve determinism with a scheduler policy? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SeaBIOS] [PATCHv6 00/16] boot order specification
On Thu, Dec 02, 2010 at 04:07:16PM +0100, Peter Stuge wrote: > Gleb Natapov wrote: > > BBS specification is broken since it doesn't provide a way for > > discovered boot method (BCV) to be linked back to a device it will > > boot from. Nothing we can do to fix this except moving to EFI (an > > hope the problem is fixed there). > > There is that option, or there could be some simple improvement of > our own, which works in QEMU and maybe even adds value to coreboot. > But then there would be a bit of novel development in firmware - that > can't be a good thing, right? > I am all for novel development in firmware, but unfortunately I do not see what can we do in Seabios + qemu to fix this shortcoming. The problem should be fixed in each and every option rom. Option rom may set device address somewhere in pnp header for instance. > > > Spec says that in that case user probably will want to adjust boot > > order anyway and will enter boot menu by itself. Sorry excuse for > > failing to provide proper functionality if you ask me :) > > I agree. I can not believe the absolute resistance to innovation in > this field. > Interested parties want everyone to move to EFI I guess. > Isn't the scope of BBS logic limited to boot time? (There are calls > to do settings, but that's no problem.) > > Maybe it would be possible for SeaBIOS to provide what looks like BBS > to the guest, but on the other side there is something more > intelligent going on, be it together with QEMU or coreboot? > > I don't how it can be done without cooperation with option roms. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Thu, Dec 02, 2010 at 10:54:24PM +0800, Sheng Yang wrote: > On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin wrote: > > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote: > >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: > >> >> > >> >> Which case? the readl() doesn't need access to the routing table, > >> >> just the entry. > >> > > >> >One thing that read should do is flush in the outstanding > >> >interrupts and flush out the mask bit writes. > >> > >> The mask bit writes are synchronous. > >> > >> wrt interrupts, we can deal with assigned devices, and can poll > >> irqfds. But we can't force vhost-net to issue an interrupt (and I > >> don't think it's required). > > > > To clarify: > > > > mask write > > read > > > > it is safe for guest to assume no more interrupts > > > > where as with a simple > > mask write > > > > an interrupt might be in flight and get delivered shortly afterwards. > > I think it's already contained in the current patchset. > > > > > >> >> Oh, I think there is a terminology problem, I was talking about > >> >> kvm's irq routing table, you were talking about the msix entries. > >> >> > >> >> I think treating it as a cache causes more problems, since there are > >> >> now two paths for reads (in cache and not in cache) and more things > >> >> for writes to manage. > >> >> > >> >> Here's my proposed API: > >> >> > >> >> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, > >> >> pending_bitmap_base_gpa) > >> >> > >> >> - called when the guest enables msix > >> > > >> >I would add virtual addresses so that we can use swappable memory to > >> >store the state. > >> > >> Right. > > Do we need synchronization between kernel and userspace? Any recommended > method? > >> > >> >If we do, maybe we can just keep the table there and then > >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? > >> > >> Still need to to let userspace know it needs to reprogram the irqfd > >> or whatever it uses to inject the interrupt. > > > > Why do we need to reprogram irqfd? I thought irqfd would map to an > > entry within the table instead of address/data as now. > > Could you clarify please? > > > > > >> >> KVM_REMOVE_MSIX_TABLE(table_id) > >> >> > >> >> - called when the guest disables msix > >> >> > >> >> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) > >> >> > >> >> - called when the guest enables msix (to initialize it), or after > >> >> live migration > >> > > >> >What is entry_id here? > >> > >> Entry within the table. > > > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is > > enabled (note: it can not be called at boot anyway since pa > > depends on BAR assigned by BIOS). > > Don't agree. MMIO can be write regardless of if MSIX is enabled. If > you handle MMIO to kernel, them handle them all. Hmm, does this mean we would need ioctls to enable/disable MSIX? I think passing control from userspace to kernel when MSIX is enabled is not too bad. Will think about it. > I suppose qemu still > got control of BAR? Yes. So e.g. if memory is disabled in the device then we must disable this as well, and not claim these transactions as mmio. > Then leave it in the current place should be fine. current place? At the moment qemu does not notify devices when memory is disabled, only when it is enabled. So handling msix enable/disable is more straight-forward. > > > >> >> Michael? I think that should work for virtio and vfio assigned > >> >> devices? Not sure about pending bits. > >> > > >> >Pending bits must be tracked in kernel, but I don't see > >> >how we can support polling mode if we don't exit to userspace > >> >on pending bit reads. > >> > > >> >This does mean that some reads will be fast and some will be > >> >slow, and it's a bit sad that we seem to be optimizing > >> >for specific guests, but I just can't come up with > >> >anything better. > >> > > >> > >> If the pending bits live in userspace memory, the device model can > >> update them directly? > > > > Note that these are updated on an interrupt, so updating them > > in userspace would need get_user_page etc trickery, > > and add the overhead of atomics. > > > > Further I think it's important to avoid the overhead of updating them > > all the time, and only do this when an interrupt is > > masked or on pending bits read. Since userspace does not know > > when interrupts are masked, this means do update on each read. > > In fact qemu's accessing to MMIO should be quite rare after moving all > the things to the kernel. Using IOCTL is also fine with me. > > And how to "do update on each read"? When read of pending bits is detected, we could forward it up to qemu. Qemu could check internal device state and clear bits that are no longer relevant. > -- > regards, > Yang, Sheng > > > >> >So maybe just add an ioctl to get and to clear pending bits. > >> >Maybe set for symmetry. > >> > >> For live migration too. But if they live in memory,
Re: [PATCH 2/6] KVM: SVM: Add manipulation functions for CRx intercepts
On Tue, Nov 30, 2010 at 06:03:57PM +0100, Joerg Roedel wrote: > This patch wraps changes to the CRx intercepts of SVM into > seperate functions to abstract nested-svm better and prepare > the implementation of the vmcb-clean-bits feature. > > Signed-off-by: Joerg Roedel > --- > arch/x86/include/asm/svm.h | 15 +++-- > arch/x86/kvm/svm.c | 119 > +++- > 2 files changed, 72 insertions(+), 62 deletions(-) > > - control->intercept_cr_read =INTERCEPT_CR0_MASK | > - INTERCEPT_CR3_MASK | > - INTERCEPT_CR4_MASK; > - > - control->intercept_cr_write = INTERCEPT_CR0_MASK | > - INTERCEPT_CR3_MASK | > - INTERCEPT_CR4_MASK | > - INTERCEPT_CR8_MASK; > + set_cr_intercept(svm, INTERCEPT_CR0_READ); > + set_cr_intercept(svm, INTERCEPT_CR3_READ); > + set_cr_intercept(svm, INTERCEPT_CR4_READ); > + set_cr_intercept(svm, INTERCEPT_CR0_WRITE); > + set_cr_intercept(svm, INTERCEPT_CR3_WRITE); > + set_cr_intercept(svm, INTERCEPT_CR4_WRITE); > + set_cr_intercept(svm, INTERCEPT_CR8_WRITE); Should clear hflags before using is_guest_mode(). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Add "broadcast" option for mce command.
On Tue, Nov 30, 2010 at 05:14:13PM +0900, Jin Dongming wrote: > When the following test case is injected with mce command, maybe user could > not > get the expected result. > DATA >command cpu bank status mcg_status addr misc > (qemu) mce 1 10xbd00 0x050x1234 0x8c > > Expected Result >panic type: "Fatal Machine check" > > That is because each mce command can only inject the given cpu and could not > inject mce interrupt to other cpus. So user will get the following result: > panic type: "Fatal machine check on current CPU" > > "broadcast" option is used for injecting dummy data into other cpus. Injecting > mce with this option the expected result could be gotten. > > Signed-off-by: Jin Dongming > --- > cpu-all.h |3 ++- > hmp-commands.hx |6 +++--- > monitor.c | 18 -- > target-i386/helper.c | 17 +++-- > target-i386/kvm.c | 12 ++-- > target-i386/kvm_x86.h |2 +- > 6 files changed, 47 insertions(+), 11 deletions(-) > > diff --git a/cpu-all.h b/cpu-all.h > index 11edddc..a3b3cd8 100644 > --- a/cpu-all.h > +++ b/cpu-all.h > @@ -967,6 +967,7 @@ int cpu_memory_rw_debug(CPUState *env, target_ulong addr, > uint8_t *buf, int len, int is_write); > > void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, > -uint64_t mcg_status, uint64_t addr, uint64_t misc); > +uint64_t mcg_status, uint64_t addr, uint64_t misc, > +int broadcast); > > #endif /* CPU_ALL_H */ > diff --git a/hmp-commands.hx b/hmp-commands.hx > index ba6de28..44ad571 100644 > --- a/hmp-commands.hx > +++ b/hmp-commands.hx > @@ -1053,9 +1053,9 @@ ETEXI > > { > .name = "mce", > -.args_type = > "cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l", > -.params = "cpu bank status mcgstatus addr misc", > -.help = "inject a MCE on the given CPU", > +.args_type = > "cpu_index:i,bank:i,status:l,mcg_status:l,addr:l,misc:l,broadcast:s?", > +.params = "cpu bank status mcgstatus addr misc [broadcast|b]", > +.help = "inject a MCE on the given CPU [and broadcast to other > CPUs]", > .mhandler.cmd = do_inject_mce, > }, > > diff --git a/monitor.c b/monitor.c > index 66d6acd..e316eb6 100644 > --- a/monitor.c > +++ b/monitor.c > @@ -2263,12 +2263,26 @@ static void do_inject_mce(Monitor *mon, const QDict > *qdict) > uint64_t mcg_status = qdict_get_int(qdict, "mcg_status"); > uint64_t addr = qdict_get_int(qdict, "addr"); > uint64_t misc = qdict_get_int(qdict, "misc"); > +const char *b = qdict_get_try_str(qdict, "broadcast"); qdict_get_try_bool is easier. > +int broadcast = 0; > > -for (cenv = first_cpu; cenv != NULL; cenv = cenv->next_cpu) > +if (b) { > +if (!strncmp(b, "broadcast", sizeof("broadcast")) || > +!strncmp(b, "b", sizeof("b"))) { > +broadcast = 1; > +} else { > +monitor_printf(mon, "unexpected option: %s\n", b); > +return; > +} > +} > + > +for (cenv = first_cpu; cenv != NULL; cenv = cenv->next_cpu) { > if (cenv->cpu_index == cpu_index && cenv->mcg_cap) { > -cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc); > +cpu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, > + broadcast); > break; > } > +} > } > #endif > > diff --git a/target-i386/helper.c b/target-i386/helper.c > index e384fdc..7e07ebd 100644 > --- a/target-i386/helper.c > +++ b/target-i386/helper.c > @@ -1071,17 +1071,30 @@ static void qemu_inject_x86_mce(CPUState *cenv, int > bank, uint64_t status, > } > > void cpu_inject_x86_mce(CPUState *cenv, int bank, uint64_t status, > -uint64_t mcg_status, uint64_t addr, uint64_t misc) > +uint64_t mcg_status, uint64_t addr, uint64_t misc, > +int broadcast) > { > unsigned bank_num = cenv->mcg_cap & 0xff; > +CPUState *env; > +int flag = 0; > > if (bank >= bank_num || !(status & MCI_STATUS_VAL)) > return; > > if (kvm_enabled()) { > -kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, 0); > +if (broadcast) > +flag |= 0x02; /* bit 1: 1(broadcast); 0(not broadcast) */ > +kvm_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc, flag); > } else { > qemu_inject_x86_mce(cenv, bank, status, mcg_status, addr, misc); > +if (broadcast) { > +for (env = first_cpu; env != NULL; env = env->next_cpu) { > +if (cenv == env) > +continue; > + > +qemu_inject_x86_mce(env, 1, 0xa000, 0, 0,
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
Actually CCing Rik now! On Thu, Dec 02, 2010 at 08:57:16PM +0530, Srivatsa Vaddagiri wrote: > On Thu, Dec 02, 2010 at 03:49:44PM +0200, Avi Kivity wrote: > > On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote: > > >On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote: > > >> >> What I'd like to see in directed yield is donating exactly the > > >> >> amount of vruntime that's needed to make the target thread run. > > >> > > > >> >I presume this requires the target vcpu to move left in rb-tree to run > > >> >earlier than scheduled currently and that it doesn't involve any > > >> >change to the sched_period() of target vcpu? > > >> > > > >> >Just was wondering how this would work in case of buggy guests. Lets > > >> say that a > > >> >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > > >> >currently), while VCPU spins on lock A (held by VCPU0 currently). Both > > >> keep > > >> >boosting each other's vruntime, potentially affecting fairtime for > > >> other guests > > >> >(to the point of starving them perhaps)? > > >> > > >> We preserve vruntime overall. If you give vruntime to someone, it > > >> comes at your own expense. Overall vruntime is preserved. > > > > > >Hmm ..so I presume that this means we don't affect target thread's > > >position in > > >rb-tree upon donation, rather we influence its sched_period() to include > > >donated time? IOW donation has no effect on causing the target thread to > > >run > > >"immediately", rather it will have the effect of causing it run "longer" > > >whenever it runs next? > > > > No. The intent (at least mine, maybe Rik has other ideas) is to > > CCing Rik now .. > > > move some vruntime from current to target such that target would be > > placed before current in the timeline. > > Well ok, then this is what I had presumed earlier (about shifting target > towards > left in rb-tree). > > > >Even that would require some precaution in directed yield to ensure that it > > >doesn't unduly inflate vruntime of target, hurting fairness for other > > >guests on > > >same cpu as target (example guest code that can lead to this situation > > >below): > > > > > >vcpu0: vcpu1: > > > > > > spinlock(A); > > > > > >spinlock(A); > > > > > > while(1) > > > ; > > > > > > spin_unlock(A); > > > > directed yield should preserve the invariant that sum(vruntime) does > > not change. > > Hmm don't think I understand this invariant sum() part. Lets take a simple > example as below: > > > p0-> A0 B0 A1 > > p1-> B1 C0 C1 > > A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus > > Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace > delibrately). When A0 spins and exits (due to PLE) what does its directed > yield > do? Going by your statement, it can put target before current, leading to > perhaps this arrangement in runqueue: > > p0-> A1 B0 A0 > > Now A1 spins and wants to do a directed yield back to A0, leading to : > > p0-> A0 B0 A1 > > This can go back and forth, starving B0 (iow leading to some sort of DoS > attack). > > Where does the "invariant sum" part of directed yield kick in to avoid such > nastiness? > > - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: VMX: Return 0 from a failed VMREAD
If we execute VMREAD during reboot we'll just skip over it. Instead of returning garbage, return 0, which has a much smaller chance of confusing the code. Otherwise we risk a flood of debug printk()s which block the reboot process if a serial console or netconsole is enabled. Signed-off-by: Avi Kivity --- arch/x86/kvm/vmx.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index caa967e..72cfdb7 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -565,10 +565,10 @@ static inline void ept_sync_individual_addr(u64 eptp, gpa_t gpa) static unsigned long vmcs_readl(unsigned long field) { - unsigned long value; + unsigned long value = 0; asm volatile (__ex(ASM_VMX_VMREAD_RDX_RAX) - : "=a"(value) : "d"(field) : "cc"); + : "+a"(value) : "d"(field) : "cc"); return value; } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: Don't spin on virt instruction faults during reboot
Since vmx blocks INIT signals, we disable virtualization extensions during reboot. This leads to virtualization instructions faulting; we trap these faults and spin while the reboot continues. Unfortunately spinning on a non-preemptible kernel may block a task that reboot depends on; this causes the reboot to hang. Fix by skipping over the instruction and hoping for the best. Signed-off-by: Avi Kivity --- arch/x86/include/asm/kvm_host.h |8 ++-- virt/kvm/kvm_main.c | 13 - 2 files changed, 10 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 0e64a39..3987e1f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -790,14 +790,18 @@ enum { * reboot turns off virtualization while processes are running. * Trap the fault and ignore the instruction if that happens. */ -asmlinkage void kvm_handle_fault_on_reboot(void); +asmlinkage void kvm_spurious_fault(void); +extern bool kvm_rebooting; #define __kvm_handle_fault_on_reboot(insn) \ "666: " insn "\n\t" \ + "668: \n\t" \ ".pushsection .fixup, \"ax\" \n" \ "667: \n\t" \ + "cmpb $0, kvm_rebooting \n\t" \ + "jne 668b \n\t" \ __ASM_SIZE(push) " $666b \n\t"\ - "jmp kvm_handle_fault_on_reboot \n\t" \ + "call kvm_spurious_fault \n\t"\ ".popsection \n\t" \ ".pushsection __ex_table, \"a\" \n\t" \ _ASM_PTR " 666b, 667b \n\t" \ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c4ee364..83f5bf6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -90,7 +90,8 @@ static void hardware_disable_all(void); static void kvm_io_bus_destroy(struct kvm_io_bus *bus); -static bool kvm_rebooting; +bool kvm_rebooting; +EXPORT_SYMBOL_GPL(kvm_rebooting); static bool largepages_enabled = true; @@ -2179,18 +2180,12 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val, } -asmlinkage void kvm_handle_fault_on_reboot(void) +asmlinkage void kvm_spurious_fault(void) { - if (kvm_rebooting) { - /* spin while reset goes on */ - local_irq_enable(); - while (true) - cpu_relax(); - } /* Fault while not rebooting. We want the trace. */ BUG(); } -EXPORT_SYMBOL_GPL(kvm_handle_fault_on_reboot); +EXPORT_SYMBOL_GPL(kvm_spurious_fault); static int kvm_reboot(struct notifier_block *notifier, unsigned long val, void *v) -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Fix reboot on non-preemptible kernels
Reboot with guests running, on Intel hosts, with a non-preemptible host kernel is broken. This patchset fixes the issue. Avi Kivity (2): KVM: Don't spin on virt instruction faults during reboot KVM: VMX: Return 0 from a failed VMREAD arch/x86/include/asm/kvm_host.h |8 ++-- arch/x86/kvm/vmx.c |4 ++-- virt/kvm/kvm_main.c | 13 - 3 files changed, 12 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On Thu, Dec 02, 2010 at 05:33:40PM +0200, Avi Kivity wrote: > A0 and A1's vruntime will keep growing, eventually B will become > leftmost and become runnable (assuming leftmost == min vruntime, not > sure what the terminology is). Donation (in directed yield) will cause vruntime to drop as well (thats the only way target can get to run ahead of its scheduled time), so I still think there are nasty issues involved here. Anyway, I am curious to see the directed yield implementation that Rik has - I can comment more after I have seen that! - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On 12/02/2010 05:27 PM, Srivatsa Vaddagiri wrote: > >Even that would require some precaution in directed yield to ensure that it > >doesn't unduly inflate vruntime of target, hurting fairness for other guests on > >same cpu as target (example guest code that can lead to this situation > >below): > > > >vcpu0:vcpu1: > > > > spinlock(A); > > > >spinlock(A); > > > > while(1) > > ; > > > > spin_unlock(A); > > directed yield should preserve the invariant that sum(vruntime) does > not change. Hmm don't think I understand this invariant sum() part. Lets take a simple example as below: p0 -> A0 B0 A1 p1 -> B1 C0 C1 A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace delibrately). When A0 spins and exits (due to PLE) what does its directed yield do? Going by your statement, it can put target before current, leading to perhaps this arrangement in runqueue: p0 -> A1 B0 A0 Now A1 spins and wants to do a directed yield back to A0, leading to : p0 -> A0 B0 A1 This can go back and forth, starving B0 (iow leading to some sort of DoS attack). Where does the "invariant sum" part of directed yield kick in to avoid such nastiness? A0 and A1's vruntime will keep growing, eventually B will become leftmost and become runnable (assuming leftmost == min vruntime, not sure what the terminology is). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On Thu, Dec 02, 2010 at 03:49:44PM +0200, Avi Kivity wrote: > On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote: > >On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote: > >> >> What I'd like to see in directed yield is donating exactly the > >> >> amount of vruntime that's needed to make the target thread run. > >> > > >> >I presume this requires the target vcpu to move left in rb-tree to run > >> >earlier than scheduled currently and that it doesn't involve any > >> >change to the sched_period() of target vcpu? > >> > > >> >Just was wondering how this would work in case of buggy guests. Lets say > >> that a > >> >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > >> >currently), while VCPU spins on lock A (held by VCPU0 currently). Both > >> keep > >> >boosting each other's vruntime, potentially affecting fairtime for other > >> guests > >> >(to the point of starving them perhaps)? > >> > >> We preserve vruntime overall. If you give vruntime to someone, it > >> comes at your own expense. Overall vruntime is preserved. > > > >Hmm ..so I presume that this means we don't affect target thread's position > >in > >rb-tree upon donation, rather we influence its sched_period() to include > >donated time? IOW donation has no effect on causing the target thread to run > >"immediately", rather it will have the effect of causing it run "longer" > >whenever it runs next? > > No. The intent (at least mine, maybe Rik has other ideas) is to CCing Rik now .. > move some vruntime from current to target such that target would be > placed before current in the timeline. Well ok, then this is what I had presumed earlier (about shifting target towards left in rb-tree). > >Even that would require some precaution in directed yield to ensure that it > >doesn't unduly inflate vruntime of target, hurting fairness for other guests > >on > >same cpu as target (example guest code that can lead to this situation > >below): > > > >vcpu0: vcpu1: > > > > spinlock(A); > > > >spinlock(A); > > > > while(1) > > ; > > > > spin_unlock(A); > > directed yield should preserve the invariant that sum(vruntime) does > not change. Hmm don't think I understand this invariant sum() part. Lets take a simple example as below: p0 -> A0 B0 A1 p1 -> B1 C0 C1 A/B/C are VMs and A0 etc are virtual cpus. p0/1 are physical cpus Let's say A0/A1 hit AB-BA spin-deadlock (which one can write in userspace delibrately). When A0 spins and exits (due to PLE) what does its directed yield do? Going by your statement, it can put target before current, leading to perhaps this arrangement in runqueue: p0 -> A1 B0 A0 Now A1 spins and wants to do a directed yield back to A0, leading to : p0 -> A0 B0 A1 This can go back and forth, starving B0 (iow leading to some sort of DoS attack). Where does the "invariant sum" part of directed yield kick in to avoid such nastiness? - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On 12/02/2010 08:39 AM, lidong chen wrote: In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. i could not understand why need this? can you tell more detailedly? If you run 4 guests on a CPU, and they're all trying to consume 100% CPU, all things being equal, you'll get ~25% CPU for each guest. However, if one guest is idle, you'll get something like 1% 32% 33% 32%. This characteristic is usually desirable because it increase aggregate throughput but in some circumstances, determinism is more desirable than aggregate throughput. This patch essentially makes guest execution non-work conserving by making it appear to the scheduler that each guest wants 100% CPU even though they may be idling. That means that regardless of what each guest is doing, if you have four guests on one CPU, each will get ~25% CPU[1]. [1] there are corner cases around things like forced sleep due to PFs and the like. The goal is not for 100% determinism but more to at least obtain more significantly more determinism than we have now. Regards, Anthony Liguori thanks. 2010/12/2 Anthony Liguori: In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. A similar option is possible for AMD but I don't have easy access to AMD test hardware. Signed-off-by: Anthony Liguori --- v1 -> v2 - Rename parameter to yield_on_hlt - Remove __read_mostly diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index caa967e..d8310e4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO); static int __read_mostly vmm_exclusive = 1; module_param(vmm_exclusive, bool, S_IRUGO); +static int yield_on_hlt = 1; +module_param(yield_on_hlt, bool, S_IRUGO); + #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) #define KVM_GUEST_CR0_MASK \ @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) &_pin_based_exec_control)< 0) return -EIO; - min = CPU_BASED_HLT_EXITING | + min = #ifdef CONFIG_X86_64 CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING | @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING | CPU_BASED_INVLPG_EXITING; + + if (yield_on_hlt) + min |= CPU_BASED_HLT_EXITING; + opt = CPU_BASED_TPR_SHADOW | CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
On 12/02/2010 08:39 AM, lidong chen wrote: In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. i could not understand why need this? can you tell more detailedly? If you run 4 guests on a CPU, and they're all trying to consume 100% CPU, all things being equal, you'll get ~25% CPU for each guest. However, if one guest is idle, you'll get something like 1% 32% 33% 32%. This characteristic is usually desirable because it increase aggregate throughput but in some circumstances, determinism is more desirable than aggregate throughput. This patch essentially makes guest execution non-work conserving by making it appear to the scheduler that each guest wants 100% CPU even though they may be idling. That means that regardless of what each guest is doing, if you have four guests on one CPU, each will get ~25% CPU[1]. [1] there are corner cases around things like forced sleep due to PFs and the like. The goal is not for 100% determinism but more to at least obtain more significantly more determinism than we have now. Regards, Anthony Liguori thanks. 2010/12/2 Anthony Liguori: In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. A similar option is possible for AMD but I don't have easy access to AMD test hardware. Signed-off-by: Anthony Liguori --- v1 -> v2 - Rename parameter to yield_on_hlt - Remove __read_mostly diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index caa967e..d8310e4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO); static int __read_mostly vmm_exclusive = 1; module_param(vmm_exclusive, bool, S_IRUGO); +static int yield_on_hlt = 1; +module_param(yield_on_hlt, bool, S_IRUGO); + #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) #define KVM_GUEST_CR0_MASK \ @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) &_pin_based_exec_control)< 0) return -EIO; - min = CPU_BASED_HLT_EXITING | + min = #ifdef CONFIG_X86_64 CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING | @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING | CPU_BASED_INVLPG_EXITING; + + if (yield_on_hlt) + min |= CPU_BASED_HLT_EXITING; + opt = CPU_BASED_TPR_SHADOW | CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SeaBIOS] [PATCHv6 00/16] boot order specification
Gleb Natapov wrote: > BBS specification is broken since it doesn't provide a way for > discovered boot method (BCV) to be linked back to a device it will > boot from. Nothing we can do to fix this except moving to EFI (an > hope the problem is fixed there). There is that option, or there could be some simple improvement of our own, which works in QEMU and maybe even adds value to coreboot. But then there would be a bit of novel development in firmware - that can't be a good thing, right? > Spec says that in that case user probably will want to adjust boot > order anyway and will enter boot menu by itself. Sorry excuse for > failing to provide proper functionality if you ask me :) I agree. I can not believe the absolute resistance to innovation in this field. Isn't the scope of BBS logic limited to boot time? (There are calls to do settings, but that's no problem.) Maybe it would be possible for SeaBIOS to provide what looks like BBS to the guest, but on the other side there is something more intelligent going on, be it together with QEMU or coreboot? //Peter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Thu, Dec 2, 2010 at 10:26 PM, Michael S. Tsirkin wrote: > On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote: >> On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: >> >> >> >> Which case? the readl() doesn't need access to the routing table, >> >> just the entry. >> > >> >One thing that read should do is flush in the outstanding >> >interrupts and flush out the mask bit writes. >> >> The mask bit writes are synchronous. >> >> wrt interrupts, we can deal with assigned devices, and can poll >> irqfds. But we can't force vhost-net to issue an interrupt (and I >> don't think it's required). > > To clarify: > > mask write > read > > it is safe for guest to assume no more interrupts > > where as with a simple > mask write > > an interrupt might be in flight and get delivered shortly afterwards. I think it's already contained in the current patchset. > > >> >> Oh, I think there is a terminology problem, I was talking about >> >> kvm's irq routing table, you were talking about the msix entries. >> >> >> >> I think treating it as a cache causes more problems, since there are >> >> now two paths for reads (in cache and not in cache) and more things >> >> for writes to manage. >> >> >> >> Here's my proposed API: >> >> >> >> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, >> >> pending_bitmap_base_gpa) >> >> >> >> - called when the guest enables msix >> > >> >I would add virtual addresses so that we can use swappable memory to >> >store the state. >> >> Right. Do we need synchronization between kernel and userspace? Any recommended method? >> >> >If we do, maybe we can just keep the table there and then >> >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? >> >> Still need to to let userspace know it needs to reprogram the irqfd >> or whatever it uses to inject the interrupt. > > Why do we need to reprogram irqfd? I thought irqfd would map to an > entry within the table instead of address/data as now. > Could you clarify please? > > >> >> KVM_REMOVE_MSIX_TABLE(table_id) >> >> >> >> - called when the guest disables msix >> >> >> >> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) >> >> >> >> - called when the guest enables msix (to initialize it), or after >> >> live migration >> > >> >What is entry_id here? >> >> Entry within the table. > > So I think KVM_DEFINE_MSIX_TABLE should be called when msix is > enabled (note: it can not be called at boot anyway since pa > depends on BAR assigned by BIOS). Don't agree. MMIO can be write regardless of if MSIX is enabled. If you handle MMIO to kernel, them handle them all. I suppose qemu still got control of BAR? Then leave it in the current place should be fine. > >> >> Michael? I think that should work for virtio and vfio assigned >> >> devices? Not sure about pending bits. >> > >> >Pending bits must be tracked in kernel, but I don't see >> >how we can support polling mode if we don't exit to userspace >> >on pending bit reads. >> > >> >This does mean that some reads will be fast and some will be >> >slow, and it's a bit sad that we seem to be optimizing >> >for specific guests, but I just can't come up with >> >anything better. >> > >> >> If the pending bits live in userspace memory, the device model can >> update them directly? > > Note that these are updated on an interrupt, so updating them > in userspace would need get_user_page etc trickery, > and add the overhead of atomics. > > Further I think it's important to avoid the overhead of updating them > all the time, and only do this when an interrupt is > masked or on pending bits read. Since userspace does not know > when interrupts are masked, this means do update on each read. In fact qemu's accessing to MMIO should be quite rare after moving all the things to the kernel. Using IOCTL is also fine with me. And how to "do update on each read"? -- regards, Yang, Sheng > >> >So maybe just add an ioctl to get and to clear pending bits. >> >Maybe set for symmetry. >> >> For live migration too. But if they live in memory, no need for >> get/set, just specify the address. >> >> -- >> error compiling committee.c: too many arguments to function > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. i could not understand why need this? can you tell more detailedly? thanks. 2010/12/2 Anthony Liguori : > In certain use-cases, we want to allocate guests fixed time slices where idle > guest cycles leave the machine idling. There are many approaches to achieve > this but the most direct is to simply avoid trapping the HLT instruction which > lets the guest directly execute the instruction putting the processor to > sleep. > > Introduce this as a module-level option for kvm-vmx.ko since if you do this > for one guest, you probably want to do it for all. A similar option is > possible > for AMD but I don't have easy access to AMD test hardware. > > Signed-off-by: Anthony Liguori > --- > v1 -> v2 > - Rename parameter to yield_on_hlt > - Remove __read_mostly > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index caa967e..d8310e4 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO); > static int __read_mostly vmm_exclusive = 1; > module_param(vmm_exclusive, bool, S_IRUGO); > > +static int yield_on_hlt = 1; > +module_param(yield_on_hlt, bool, S_IRUGO); > + > #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ > (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) > #define KVM_GUEST_CR0_MASK \ > @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config > *vmcs_conf) > &_pin_based_exec_control) < 0) > return -EIO; > > - min = CPU_BASED_HLT_EXITING | > + min = > #ifdef CONFIG_X86_64 > CPU_BASED_CR8_LOAD_EXITING | > CPU_BASED_CR8_STORE_EXITING | > @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config > *vmcs_conf) > CPU_BASED_MWAIT_EXITING | > CPU_BASED_MONITOR_EXITING | > CPU_BASED_INVLPG_EXITING; > + > + if (yield_on_hlt) > + min |= CPU_BASED_HLT_EXITING; > + > opt = CPU_BASED_TPR_SHADOW | > CPU_BASED_USE_MSR_BITMAPS | > CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; > -- > 1.7.0.4 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Thu, Dec 02, 2010 at 03:56:52PM +0200, Avi Kivity wrote: > On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: > >> > >> Which case? the readl() doesn't need access to the routing table, > >> just the entry. > > > >One thing that read should do is flush in the outstanding > >interrupts and flush out the mask bit writes. > > The mask bit writes are synchronous. > > wrt interrupts, we can deal with assigned devices, and can poll > irqfds. But we can't force vhost-net to issue an interrupt (and I > don't think it's required). To clarify: mask write read it is safe for guest to assume no more interrupts where as with a simple mask write an interrupt might be in flight and get delivered shortly afterwards. > >> Oh, I think there is a terminology problem, I was talking about > >> kvm's irq routing table, you were talking about the msix entries. > >> > >> I think treating it as a cache causes more problems, since there are > >> now two paths for reads (in cache and not in cache) and more things > >> for writes to manage. > >> > >> Here's my proposed API: > >> > >> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, > >> pending_bitmap_base_gpa) > >> > >> - called when the guest enables msix > > > >I would add virtual addresses so that we can use swappable memory to > >store the state. > > Right. > > >If we do, maybe we can just keep the table there and then > >KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? > > Still need to to let userspace know it needs to reprogram the irqfd > or whatever it uses to inject the interrupt. Why do we need to reprogram irqfd? I thought irqfd would map to an entry within the table instead of address/data as now. Could you clarify please? > >> KVM_REMOVE_MSIX_TABLE(table_id) > >> > >>- called when the guest disables msix > >> > >> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) > >> > >>- called when the guest enables msix (to initialize it), or after > >> live migration > > > >What is entry_id here? > > Entry within the table. So I think KVM_DEFINE_MSIX_TABLE should be called when msix is enabled (note: it can not be called at boot anyway since pa depends on BAR assigned by BIOS). > >> Michael? I think that should work for virtio and vfio assigned > >> devices? Not sure about pending bits. > > > >Pending bits must be tracked in kernel, but I don't see > >how we can support polling mode if we don't exit to userspace > >on pending bit reads. > > > >This does mean that some reads will be fast and some will be > >slow, and it's a bit sad that we seem to be optimizing > >for specific guests, but I just can't come up with > >anything better. > > > > If the pending bits live in userspace memory, the device model can > update them directly? Note that these are updated on an interrupt, so updating them in userspace would need get_user_page etc trickery, and add the overhead of atomics. Further I think it's important to avoid the overhead of updating them all the time, and only do this when an interrupt is masked or on pending bits read. Since userspace does not know when interrupts are masked, this means do update on each read. > >So maybe just add an ioctl to get and to clear pending bits. > >Maybe set for symmetry. > > For live migration too. But if they live in memory, no need for > get/set, just specify the address. > > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance test result between virtio_pci MSI-X disable and enable
i apply patch correctly. the addr is not in mmio range because kvm_io_bus_write test the addr for each device. /* kvm_io_bus_write - called under kvm->slots_lock */ int kvm_io_bus_write(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, int len, const void *val) { int i; struct kvm_io_bus *bus = rcu_dereference(kvm->buses[bus_idx]); for (i = 0; i < bus->dev_count; i++) if (!kvm_iodevice_write(bus->devs[i], addr, len, val)) return 0; return -EOPNOTSUPP; } the test result(cpu rate) before apply patch: hostosguestos CPU0 15.85 58.05 CPU1 2.97 70.56 CPU2 3.25 69.00 CPU3 3.31 68.59 CPU4 5.11 47.46 CPU5 4.70 29.28 CPU6 6.00 5.96 CPU7 4.75 2.58 the test result(cpu rate) after apply patch: hostosguestos CPU0 11.89 60.92 CPU1 2.18 68.07 CPU2 2.6166.18 CPU3 2.4466.46 CPU4 2.6744.98 CPU5 2.3126.81 CPU6 1.935.65 CPU7 1.79 2.14 the total cpu rate reduce from 397% to 369%. i pin the vcpu below, and only vcpu0 handle msix interrupt. virsh vcpupin brd2vm1sriov 0 0 virsh vcpupin brd2vm1sriov 1 1 virsh vcpupin brd2vm1sriov 2 2 virsh vcpupin brd2vm1sriov 3 3 virsh vcpupin brd2vm1sriov 4 4 virsh vcpupin brd2vm1sriov 5 5 virsh vcpupin brd2vm1sriov 6 6 virsh vcpupin brd2vm1sriov 7 7 I will test virtio_net with msix enable later, and compare to it with msix disable. to see which is better? 2010/12/2 Michael S. Tsirkin : > On Thu, Dec 02, 2010 at 07:52:00PM +0800, Sheng Yang wrote: >> On Thu, Dec 2, 2010 at 5:49 PM, Michael S. Tsirkin wrote: >> > On Thu, Dec 02, 2010 at 09:13:28AM +0800, Yang, Sheng wrote: >> >> On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote: >> >> > On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote: >> >> > > I used sr-iov, give each vm 2 vf. >> >> > > after apply the patch, and i found performence is the same. >> >> > > >> >> > > the reason is in function msix_mmio_write, mostly addr is not in mmio >> >> > > range. >> >> > > >> >> > > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int >> >> > > len, >> >> > > >> >> > > á á á á á á á á á á áconst void *val) >> >> > > >> >> > > { >> >> > > >> >> > > á struct kvm_assigned_dev_kernel *adev = >> >> > > >> >> > > á á á á á á á á á container_of(this, struct kvm_assigned_dev_kernel, >> >> > > >> >> > > á á á á á á á á á á á á á á á ámsix_mmio_dev); >> >> > > >> >> > > á int idx, r = 0; >> >> > > á unsigned long new_val = *(unsigned long *)val; >> >> > > >> >> > > á mutex_lock(&adev->kvm->lock); >> >> > > á if (!msix_mmio_in_range(adev, addr, len)) { >> >> > > >> >> > > á á á á á // return here. >> >> > > >> >> > > á á á á á á á á ár = -EOPNOTSUPP; >> >> > > >> >> > > á á á á á goto out; >> >> > > >> >> > > á } >> >> > > >> >> > > i printk the value: >> >> > > addr á á á á á á start á á á á á end á á á á á len >> >> > > F004C00C á F0044000 áF0044030 á á 4 >> >> > > >> >> > > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed >> >> > > (rev >> >> > > 01) >> >> > > >> >> > > á Subsystem: Intel Corporation Unknown device 000c >> >> > > á Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> >> > > >> >> > > Stepping- SERR- FastB2B- >> >> > > >> >> > > á Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> >> > > >> >> > > SERR- > >> > > >> >> > > á Latency: 0 >> >> > > á Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K] >> >> > > á Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K] >> >> > > á Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3 >> >> > > >> >> > > á á á á á Vector table: BAR=3 offset= >> >> > > á á á á á PBA: BAR=3 offset=2000 >> >> > > >> >> > > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed >> >> > > (rev >> >> > > 01) >> >> > > >> >> > > á Subsystem: Intel Corporation Unknown device 000c >> >> > > á Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> >> > > >> >> > > Stepping- SERR- FastB2B- >> >> > > >> >> > > á Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> >> > > >> >> > > SERR- > >> > > >> >> > > á Latency: 0 >> >> > > á Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K] >> >> > > á Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K] >> >> > > á Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3 >> >> > > >> >> > > á á á á á Vector table: BAR=3 offset= >> >> > > á á á á á PBA: BAR=3 offset=2000 >> >> > > >> >> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev, >> >> > > + á á á á á á á á á á á gpa_t addr, int len) >> >> > > +{ >> >> > > + gpa_t start, end; >> >> > > + >> >> > > + BUG_ON(adev->msix_mmio_base == 0); >> >
[PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions (v2)
In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. A similar option is possible for AMD but I don't have easy access to AMD test hardware. Signed-off-by: Anthony Liguori --- v1 -> v2 - Rename parameter to yield_on_hlt - Remove __read_mostly diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index caa967e..d8310e4 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO); static int __read_mostly vmm_exclusive = 1; module_param(vmm_exclusive, bool, S_IRUGO); +static int yield_on_hlt = 1; +module_param(yield_on_hlt, bool, S_IRUGO); + #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) #define KVM_GUEST_CR0_MASK \ @@ -1419,7 +1422,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) &_pin_based_exec_control) < 0) return -EIO; - min = CPU_BASED_HLT_EXITING | + min = #ifdef CONFIG_X86_64 CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING | @@ -1432,6 +1435,10 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING | CPU_BASED_INVLPG_EXITING; + + if (yield_on_hlt) + min |= CPU_BASED_HLT_EXITING; + opt = CPU_BASED_TPR_SHADOW | CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On 12/02/2010 03:47 PM, Michael S. Tsirkin wrote: > > Which case? the readl() doesn't need access to the routing table, > just the entry. One thing that read should do is flush in the outstanding interrupts and flush out the mask bit writes. The mask bit writes are synchronous. wrt interrupts, we can deal with assigned devices, and can poll irqfds. But we can't force vhost-net to issue an interrupt (and I don't think it's required). > Oh, I think there is a terminology problem, I was talking about > kvm's irq routing table, you were talking about the msix entries. > > I think treating it as a cache causes more problems, since there are > now two paths for reads (in cache and not in cache) and more things > for writes to manage. > > Here's my proposed API: > > KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, > pending_bitmap_base_gpa) > > - called when the guest enables msix I would add virtual addresses so that we can use swappable memory to store the state. Right. If we do, maybe we can just keep the table there and then KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed? Still need to to let userspace know it needs to reprogram the irqfd or whatever it uses to inject the interrupt. > KVM_REMOVE_MSIX_TABLE(table_id) > >- called when the guest disables msix > > KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) > >- called when the guest enables msix (to initialize it), or after > live migration What is entry_id here? Entry within the table. > Michael? I think that should work for virtio and vfio assigned > devices? Not sure about pending bits. Pending bits must be tracked in kernel, but I don't see how we can support polling mode if we don't exit to userspace on pending bit reads. This does mean that some reads will be fast and some will be slow, and it's a bit sad that we seem to be optimizing for specific guests, but I just can't come up with anything better. If the pending bits live in userspace memory, the device model can update them directly? So maybe just add an ioctl to get and to clear pending bits. Maybe set for symmetry. For live migration too. But if they live in memory, no need for get/set, just specify the address. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Introduce a C++ wrapper for the kvm APIs
On 24/11/10 18:10 +0200, Avi Kivity wrote: > On 11/24/2010 05:50 PM, Anthony Liguori wrote: > My answer is that C++ is the only language that allows you to evolve > away from C, with mixed C/C++ source (not just linkage level > compatibility). If there are others, I want to know about them. Objective C is good (I like it better than C++) if you can live with the message-passing architecture. Mike -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On 12/02/2010 03:13 PM, Srivatsa Vaddagiri wrote: On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote: > >> What I'd like to see in directed yield is donating exactly the > >> amount of vruntime that's needed to make the target thread run. > > > >I presume this requires the target vcpu to move left in rb-tree to run > >earlier than scheduled currently and that it doesn't involve any > >change to the sched_period() of target vcpu? > > > >Just was wondering how this would work in case of buggy guests. Lets say that a > >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > >boosting each other's vruntime, potentially affecting fairtime for other guests > >(to the point of starving them perhaps)? > > We preserve vruntime overall. If you give vruntime to someone, it > comes at your own expense. Overall vruntime is preserved. Hmm ..so I presume that this means we don't affect target thread's position in rb-tree upon donation, rather we influence its sched_period() to include donated time? IOW donation has no effect on causing the target thread to run "immediately", rather it will have the effect of causing it run "longer" whenever it runs next? No. The intent (at least mine, maybe Rik has other ideas) is to move some vruntime from current to target such that target would be placed before current in the timeline. Even that would require some precaution in directed yield to ensure that it doesn't unduly inflate vruntime of target, hurting fairness for other guests on same cpu as target (example guest code that can lead to this situation below): vcpu0: vcpu1: spinlock(A); spinlock(A); while(1) ; spin_unlock(A); directed yield should preserve the invariant that sum(vruntime) does not change. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On Thu, Dec 02, 2010 at 03:09:43PM +0200, Avi Kivity wrote: > On 12/01/2010 04:36 AM, Yang, Sheng wrote: > >On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote: > >> On 11/26/2010 04:35 AM, Yang, Sheng wrote: > >> > > > Shouldn't kvm also service reads from the pending bitmask? > >> > > > >> > > Of course KVM should service reading from pending bitmask. For > >> > > assigned device, it's kernel who would set the pending bit; but I > >> am > >> > > not sure for virtio. This interface is GET_ENTRY, so reading is > >> fine > >> > > with it. > >> > >> The kernel should manage it in the same way. Virtio raises irq (via > >> KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit. > >> > >> Note we need to be able to read and write the pending bitmask for live > >> migration. > > > >Then seems we still need to an writing interface for it. And I think we can > >work > >on it later since it got no user now. > > I'm not sure. Suppose a guest starts using pending bits. Does it > mean it will not work with kernel msix acceleration? > > If that is the case, then we need pending bit support now. I don't > want to knowingly merge something that doesn't conform to the spec, > forcing users to choose whether they want to enable kernel msix > acceleration or not. > > > > >> > >> Because we have an interface where you get an exit if (addr % 4)< 3 and > >> don't get an exit if (addr % 4) == 3. There is a gpa range which is > >> partially maintained by the kernel and partially in userspace. It's a > >> confusing interface. Things like 64-bit reads or writes need to be > >> broken up and serviced in two different places. > >> > >> We already need to support this (for unaligned writes which hit two > >> regions), but let's at least make a contiguous region behave sanely. > > > >Oh, I didn't mean to handle this kind of unaligned writing in userspace. > >They're > >illegal according to the PCI spec(otherwise the result is undefined > >according to > >the spec). I would cover all contiguous writing(32-bit and 64-bit) in the > >next > >version, and discard all illegal writing. And 64-bit accessing would be > >broken up > >in qemu as you said, as they do currently. > > > >In fact I think we can handle all data for 64-bit to qemu, because it should > >still > >sync the mask bit with kernel, which make the maskbit writing in userspace > >useless > >and can be ignored. > > What about reads? Split those as well? > > >> The reason is to keep a sane interface. Like we emulate instructions > >> and msrs in the kernel and don't do half a job. I don't think there's a > >> real need to accelerate the first three words of an msi-x entry. > > > >Here is the case we've observed with Xen. It can only be reproduced by large > >scale > >testing. When the interrupt intensity is very high, even new kernels would > >try to > >make it lower, then it would access mask bit very frequently. And in the > >kernel, > >msi_set_mask_bit() is like this: > > > >static void msi_set_mask_bit(struct irq_data *data, u32 flag) > >{ > > struct msi_desc *desc = irq_data_get_msi(data); > > > > if (desc->msi_attrib.is_msix) { > > msix_mask_irq(desc, flag); > > readl(desc->mask_base); /* Flush write to device */ > > } else { > > unsigned offset = data->irq - desc->dev->irq; > > msi_mask_irq(desc, 1<< offset, flag<< offset); > > } > >} > > > >That flush by readl() would complied with each MSI-X mask writing. That is > >the only > >place we want to accelerate, otherwise it would still fall back to qemu each > >time > >when guest want to mask the MSI-X entry. > > So it seems we do want to accelerate reads to the entire entry. > This seems to give more weight to making the kernel own the entire > entry. > > >> > >> > And BTW, we can take routing table as a kind of *cache*, if the content > >> > is in the cache, then we can fetch it from the cache, otherwise we need > >> > to go back to fetch it from memory(userspace). > >> > >> If it's guaranteed by the spec that addr/data pairs are always > >> interpreted in the same way, sure. But there no reason to do it, > >> really, it isn't a fast path. > > > >I am not quite understand the first sentence... But it's a fast path, in the > >case I > >showed above. > > Which case? the readl() doesn't need access to the routing table, > just the entry. One thing that read should do is flush in the outstanding interrupts and flush out the mask bit writes. > Oh, I think there is a terminology problem, I was talking about > kvm's irq routing table, you were talking about the msix entries. > > I think treating it as a cache causes more problems, since there are > now two paths for reads (in cache and not in cache) and more things > for writes to manage. > > Here's my proposed API: > > KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, >
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On Thu, Dec 02, 2010 at 02:41:35PM +0200, Avi Kivity wrote: > >> What I'd like to see in directed yield is donating exactly the > >> amount of vruntime that's needed to make the target thread run. > > > >I presume this requires the target vcpu to move left in rb-tree to run > >earlier than scheduled currently and that it doesn't involve any > >change to the sched_period() of target vcpu? > > > >Just was wondering how this would work in case of buggy guests. Lets say > >that a > >guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > >currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > >boosting each other's vruntime, potentially affecting fairtime for other > >guests > >(to the point of starving them perhaps)? > > We preserve vruntime overall. If you give vruntime to someone, it > comes at your own expense. Overall vruntime is preserved. Hmm ..so I presume that this means we don't affect target thread's position in rb-tree upon donation, rather we influence its sched_period() to include donated time? IOW donation has no effect on causing the target thread to run "immediately", rather it will have the effect of causing it run "longer" whenever it runs next? Even that would require some precaution in directed yield to ensure that it doesn't unduly inflate vruntime of target, hurting fairness for other guests on same cpu as target (example guest code that can lead to this situation below): vcpu0: vcpu1: spinlock(A); spinlock(A); while(1) ; spin_unlock(A); - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mask bit support's API
On 12/01/2010 04:36 AM, Yang, Sheng wrote: On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote: > On 11/26/2010 04:35 AM, Yang, Sheng wrote: > > > > Shouldn't kvm also service reads from the pending bitmask? > > > > > > Of course KVM should service reading from pending bitmask. For > > > assigned device, it's kernel who would set the pending bit; but I am > > > not sure for virtio. This interface is GET_ENTRY, so reading is fine > > > with it. > > The kernel should manage it in the same way. Virtio raises irq (via > KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit. > > Note we need to be able to read and write the pending bitmask for live > migration. Then seems we still need to an writing interface for it. And I think we can work on it later since it got no user now. I'm not sure. Suppose a guest starts using pending bits. Does it mean it will not work with kernel msix acceleration? If that is the case, then we need pending bit support now. I don't want to knowingly merge something that doesn't conform to the spec, forcing users to choose whether they want to enable kernel msix acceleration or not. > > Because we have an interface where you get an exit if (addr % 4)< 3 and > don't get an exit if (addr % 4) == 3. There is a gpa range which is > partially maintained by the kernel and partially in userspace. It's a > confusing interface. Things like 64-bit reads or writes need to be > broken up and serviced in two different places. > > We already need to support this (for unaligned writes which hit two > regions), but let's at least make a contiguous region behave sanely. Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're illegal according to the PCI spec(otherwise the result is undefined according to the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next version, and discard all illegal writing. And 64-bit accessing would be broken up in qemu as you said, as they do currently. In fact I think we can handle all data for 64-bit to qemu, because it should still sync the mask bit with kernel, which make the maskbit writing in userspace useless and can be ignored. What about reads? Split those as well? > The reason is to keep a sane interface. Like we emulate instructions > and msrs in the kernel and don't do half a job. I don't think there's a > real need to accelerate the first three words of an msi-x entry. Here is the case we've observed with Xen. It can only be reproduced by large scale testing. When the interrupt intensity is very high, even new kernels would try to make it lower, then it would access mask bit very frequently. And in the kernel, msi_set_mask_bit() is like this: static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi(data); if (desc->msi_attrib.is_msix) { msix_mask_irq(desc, flag); readl(desc->mask_base); /* Flush write to device */ } else { unsigned offset = data->irq - desc->dev->irq; msi_mask_irq(desc, 1<< offset, flag<< offset); } } That flush by readl() would complied with each MSI-X mask writing. That is the only place we want to accelerate, otherwise it would still fall back to qemu each time when guest want to mask the MSI-X entry. So it seems we do want to accelerate reads to the entire entry. This seems to give more weight to making the kernel own the entire entry. > > > And BTW, we can take routing table as a kind of *cache*, if the content > > is in the cache, then we can fetch it from the cache, otherwise we need > > to go back to fetch it from memory(userspace). > > If it's guaranteed by the spec that addr/data pairs are always > interpreted in the same way, sure. But there no reason to do it, > really, it isn't a fast path. I am not quite understand the first sentence... But it's a fast path, in the case I showed above. Which case? the readl() doesn't need access to the routing table, just the entry. Oh, I think there is a terminology problem, I was talking about kvm's irq routing table, you were talking about the msix entries. I think treating it as a cache causes more problems, since there are now two paths for reads (in cache and not in cache) and more things for writes to manage. Here's my proposed API: KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, pending_bitmap_base_gpa) - called when the guest enables msix KVM_REMOVE_MSIX_TABLE(table_id) - called when the guest disables msix KVM_SET_MSIX_ENTRY(table_id, entry_id, contents) - called when the guest enables msix (to initialize it), or after live migration KVM_GET_MSIX_ENTRY(table_id, entry_id, contents) - called when the guest updates an msix table (triggered by KVM_EXIT_MSIX_ENTRY_UPDATED) KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED - returned from KV
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On 12/02/2010 02:19 PM, Srivatsa Vaddagiri wrote: On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote: > What I'd like to see in directed yield is donating exactly the > amount of vruntime that's needed to make the target thread run. The How would that work well with hard-limits? The target thread would have been rate limited and no amount of vruntime donation would make it work "immediately" .. At least it would stop the donator from running. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On 12/02/2010 01:47 PM, Srivatsa Vaddagiri wrote: On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote: > On 12/01/2010 09:09 PM, Peter Zijlstra wrote: > >> > >> We are dealing with just one task here (the task that is yielding). > >> After recording how much timeslice we are "giving up" in current->donate_time > >> (donate_time is perhaps not the right name to use), we adjust the yielding > >> task's vruntime as per existing logic (for ex: to make it go to back of > >> runqueue). When the yielding tasks gets to run again, lock is hopefully > >> available for it to grab, we let it run longer than the default sched_slice() > >> to compensate for what time it gave up previously to other threads in same > >> runqueue. This ensures that because of yielding upon lock contention, we are not > >> leaking bandwidth in favor of other guests. Again I don't know how much of > >> fairness issue this is in practice, so unless we see some numbers I'd prefer > >> sticking to plain yield() upon lock-contention (for unmodified guests that is). > > > >No, that won't work. Once you've given up time you cannot add it back > >without destroying fairness. Over shorter intervals perhaps. Over longer interval (few seconds to couple of minutes), fairness should not be affected because of this feedback? In any case, don't we have similar issues with directed yield as well? Directed yield works by donating vruntime to another thread. If you don't have vruntime, you can't donate it (well, you're guaranteed to have some, since you're running, but if all you have is a microsecond's worth, that's what you can donate). > > What I'd like to see in directed yield is donating exactly the > amount of vruntime that's needed to make the target thread run. I presume this requires the target vcpu to move left in rb-tree to run earlier than scheduled currently and that it doesn't involve any change to the sched_period() of target vcpu? Just was wondering how this would work in case of buggy guests. Lets say that a guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep boosting each other's vruntime, potentially affecting fairtime for other guests (to the point of starving them perhaps)? We preserve vruntime overall. If you give vruntime to someone, it comes at your own expense. Overall vruntime is preserved. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv6 00/16] boot order specification
On Wed, Dec 01, 2010 at 09:25:40PM -0500, Kevin O'Connor wrote: > On Wed, Dec 01, 2010 at 02:27:40PM +0200, Gleb Natapov wrote: > > On Tue, Nov 30, 2010 at 09:53:32PM -0500, Kevin O'Connor wrote: > > > BTW, what's the plan for handling SCSI adapters? Lets say a user has > > > a scsi card with three drives (lun 1, lun 3, lun 5) that show up as 3 > > > bcvs (lun1, lun3, lun5 in that order) and the user wishes to boot from > > > lun3. I understand this use case may not be important for qemu, but > > > I'd like to use the same code on coreboot. Being able to boot from > > > any drive is important - it doesn't have to autodetect, but it should > > > be possible. > > > > > We can't. Option rom does not give us back enough information to do so. > > Form device path we know exactly what id:lun boot should be done from, > > but pnp_data does not have information to map BCV back to id:lun. I do > > not see how coreboot can provide better information to Seabios then > > qemu here. > > You're thinking in terms of which device to boot, which does make this > difficult. However, it's equally valid to think in terms of which > boot method to invoke, which makes this easy. It is not. Because boot methods are enumerated by a guest at runtime. Qemu knows absolutely nothing about them. I am thinking in terms of devices because this is the only thing I have in qemu. BBS specification is broken since it doesn't provide a way for discovered boot method (BCV) to be linked back to a device it will boot from. Nothing we can do to fix this except moving to EFI (an hope the problem is fixed there). > > We could tell the coreboot user to edit the "bootorder" file and add > "/p...@i0cf8/r...@4" (second rom on 4th pci device - the exact syntax > of the name is not important). > But how user should knows that second rom (I think you mean "second BCV") on pci device 4.0 will boot from the new scsi cdrom that he just connected? How can he tell that it should put second BCV there and not third or fifth without running Seabios first and looking at F12 menu? Depending on option rom implementation even F12 menu can have not enough information for user to guess which boot entry he should use to boot from the device because product name that is used to distinguish between different boot entires should be unique, but not necessary provide useful information about device according to spec. (Both gpxe and lsi 8xx_64.rom provide device address as part of product name; gpxe puts pci address there and 8xx_64.rom puts ID LUN). Another important point is that BCV number can change when new device is added to scsi bus, so next boot will mysteriously fail. Important property of device path that it is stable and does not depend on HW configuration. > >BTW to create proper EDD entry for SCSI boot device BIOS also > > needs too map BCV to id:lun. How it can be done? > > It's the responsibility of the rom to build the EDD info. I don't > know if all roms do this - I don't believe it's possible to get at the > EDD info until after the drive has been mapped (ie, too late to use it > for boot ordering). How can we get to EDD info after device is mapped? Looking at Seabios implementation it builds EDD table on the fly when int_1348 is called and it does it only for internal devices. Can we use "disconnect vector" to connect device temporarily get EDD and then disconnect? > > > SCSI _is_ important for qemu. Not HBA we have now, but something supported > > by recent operation systems out of the box. When such HBA will be emulated > > in qemu we will add native Seabios support for it. > > I understand. However, we'll still need to support arbitrary rom > based BEVs and BCVs, so the use case is still important. > We can't do something that is impossible. For coreboot Seabios should implement what BBS spec says i.e enumerate all BCVs, present boot menu to the user, record number of BCVs and user's choice on non-volatile storage (CMOS). On next boot use recorded user's choice unless number of BCVs changes, otherwise present boot menu again. This is of course broken too since number of BCVs may not change even if HW configuration changes substantially. Spec says that in that case user probably will want to adjust boot order anyway and will enter boot menu by itself. Sorry excuse for failing to provide proper functionality if you ask me :) -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance test result between virtio_pci MSI-X disable and enable
On Thu, Dec 02, 2010 at 07:52:00PM +0800, Sheng Yang wrote: > On Thu, Dec 2, 2010 at 5:49 PM, Michael S. Tsirkin wrote: > > On Thu, Dec 02, 2010 at 09:13:28AM +0800, Yang, Sheng wrote: > >> On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote: > >> > On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote: > >> > > I used sr-iov, give each vm 2 vf. > >> > > after apply the patch, and i found performence is the same. > >> > > > >> > > the reason is in function msix_mmio_write, mostly addr is not in mmio > >> > > range. > >> > > > >> > > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int > >> > > len, > >> > > > >> > > const void *val) > >> > > > >> > > { > >> > > > >> > > struct kvm_assigned_dev_kernel *adev = > >> > > > >> > > container_of(this, struct kvm_assigned_dev_kernel, > >> > > > >> > > msix_mmio_dev); > >> > > > >> > > int idx, r = 0; > >> > > unsigned long new_val = *(unsigned long *)val; > >> > > > >> > > mutex_lock(&adev->kvm->lock); > >> > > if (!msix_mmio_in_range(adev, addr, len)) { > >> > > > >> > > // return here. > >> > > > >> > > r = -EOPNOTSUPP; > >> > > > >> > > goto out; > >> > > > >> > > } > >> > > > >> > > i printk the value: > >> > > addr start end len > >> > > F004C00C F0044000 F0044030 4 > >> > > > >> > > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev > >> > > 01) > >> > > > >> > > Subsystem: Intel Corporation Unknown device 000c > >> > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > >> > > > >> > > Stepping- SERR- FastB2B- > >> > > > >> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > >> > > > >> > > SERR- >> > > > >> > > Latency: 0 > >> > > Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K] > >> > > Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K] > >> > > Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3 > >> > > > >> > > Vector table: BAR=3 offset= > >> > > PBA: BAR=3 offset=2000 > >> > > > >> > > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev > >> > > 01) > >> > > > >> > > Subsystem: Intel Corporation Unknown device 000c > >> > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > >> > > > >> > > Stepping- SERR- FastB2B- > >> > > > >> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > >> > > > >> > > SERR- >> > > > >> > > Latency: 0 > >> > > Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K] > >> > > Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K] > >> > > Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3 > >> > > > >> > > Vector table: BAR=3 offset= > >> > > PBA: BAR=3 offset=2000 > >> > > > >> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev, > >> > > + gpa_t addr, int len) > >> > > +{ > >> > > + gpa_t start, end; > >> > > + > >> > > + BUG_ON(adev->msix_mmio_base == 0); > >> > > + start = adev->msix_mmio_base; > >> > > + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE * > >> > > + adev->msix_max_entries_nr; > >> > > + if (addr >= start && addr + len <= end) > >> > > + return true; > >> > > + > >> > > + return false; > >> > > +} > >> > > >> > Hmm, this check looks wrong to me: there's no guarantee > >> > that guest uses the first N entries in the table. > >> > E.g. it could use a single entry, but only the last one. > >> > >> Please check the PCI spec. > > > > > > This is pretty explicit in the spec: the the last paragraph in the below: > > > > IMPLEMENTATION NOTE > > Handling MSI-X Vector Shortages > > > > Handling MSI-X Vector Shortages > > For the case where fewer vectors are allocated to a function than desired, > > You may not notice the premise here. I noticed it. > Also check for "Table Size" would > help I think. It would help if msix_max_entries_nr was MSIX Table Size N (encoded in PCI config space as N - 1). Is this what it is? Maybe add this in some comment. > -- > regards, > Yang, Sheng > > software- > > controlled aliasing as enabled by MSI-X is one approach for handling the > > situation. For > > example, if a function supports five queues, each with an associated MSI-X > > table entry, but > > only three vectors are allocated, the function could be designed for > > software still to configure > > all five table entries, assigning one or more vectors to multiple table > > entries. Software could > > assign the three vectors {A,B,C} to the five entries as ABCCC, ABBCC, > > ABCBA, or other > > similar combinations. > > > > > > Alternatively, the function could be designed for software to configure it > > (using a device- > > specific mechanism) to use only three queues and three MSI-X table entries. > > Soft
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On Thu, Dec 02, 2010 at 05:17:00PM +0530, Srivatsa Vaddagiri wrote: > Just was wondering how this would work in case of buggy guests. Lets say that > a > guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 > currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep > boosting each other's vruntime, potentially affecting fairtime for other > guests > (to the point of starving them perhaps)? Guests that exhibit strong spinlock contentions can cause similar symptoms as well? - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote: > What I'd like to see in directed yield is donating exactly the > amount of vruntime that's needed to make the target thread run. The How would that work well with hard-limits? The target thread would have been rate limited and no amount of vruntime donation would make it work "immediately" .. - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Possiblle misleading information of motherboards supporting VT-d
Hello, I am having little trouble figuring out which motherboards have support for VT-d. I am following the link mentioned on the KVM site http://wiki.xensource.com/xenwiki/VTdHowTo The link says Following desktop boards have the VT-d support * Intel DQ35JO * Intel DQ35MP * Intel DX38BT * Intel DX48BT2 * Intel DQ45CB (BIOS 0061 required, previous versions are known to cause problems) * Intel DQ45EK * Intel DX58SO I checked on the Intel site to verify that these motherboards have the VT-d support. But seems like only few of them support VT-d. Following boards do not have VT-d support Intel® Desktop Board DX38BT http://ark.intel.com/Product.aspx?id=30272 Intel® Desktop Board DX48BT2 http://ark.intel.com/Product.aspx?id=33394 Intel® Desktop Board DX58SO http://ark.intel.com/Product.aspx?id=36888&code=Intel%C2%AE+Desktop+Board+DX58SO I was going to buy the Intel DX38BT board just before placing the order I had a second thought of checking the Intel site. Can someone confirm? Thanks and Regards, Prasad Joshi (Student ID: 19015523) MRes Systems Engineering -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote: > 2010/11/28 Michael S. Tsirkin : > > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote: > >> 2010/11/28 Michael S. Tsirkin : > >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote: > >> >> Modify inuse type to uint16_t, let save/load to handle, and revert > >> >> last_avail_idx with inuse if there are outstanding emulation. > >> >> > >> >> Signed-off-by: Yoshiaki Tamura > >> > > >> > This changes migration format, so it will break compatibility with > >> > existing drivers. More generally, I think migrating internal > >> > state that is not guest visible is always a mistake > >> > as it ties migration format to an internal implementation > >> > (yes, I know we do this sometimes, but we should at least > >> > try not to add such cases). I think the right thing to do in this case > >> > is to flush outstanding > >> > work when vm is stopped. Then, we are guaranteed that inuse is 0. > >> > I sent patches that do this for virtio net and block. > >> > >> Could you give me the link of your patches? I'd like to test > >> whether they work with Kemari upon failover. If they do, I'm > >> happy to drop this patch. > >> > >> Yoshi > > > > Look for this: > > stable migration image on a stopped vm > > sent on: > > Wed, 24 Nov 2010 17:52:49 +0200 > > Thanks for the info. > > However, The patch series above didn't solve the issue. In > case of Kemari, inuse is mostly > 0 because it queues the > output, and while last_avail_idx gets incremented > immediately, not sending inuse makes the state inconsistent > between Primary and Secondary. Hmm. Can we simply avoid incrementing last_avail_idx? > I'm wondering why > last_avail_idx is OK to send but not inuse. last_avail_idx is at some level a mistake, it exposes part of our internal implementation, but it does *also* express a guest observable state. Here's the problem that it solves: just looking at the rings in virtio there is no way to detect that a specific request has already been completed. And the protocol forbids completing the same request twice. Our implementation always starts processing the requests in order, and since we flush outstanding requests before save, it works to just tell the remote 'process only requests after this place'. But there's no such requirement in the virtio protocol, so to be really generic we could add a bitmask of valid avail ring entries that did not complete yet. This would be the exact representation of the guest observable state. In practice we have rings of up to 512 entries. That's 64 byte per ring, not a lot at all. However, if we ever do change the protocol to send the bitmask, we would need some code to resubmit requests out of order, so it's not trivial. Another minor mistake with last_avail_idx is that it has some redundancy: the high bits in the index (> vq size) are not necessary as they can be got from avail idx. There's a consistency check in load but we really should try to use formats that are always consistent. > The following patch does the same thing as original, yet > keeps the format of the virtio. It shouldn't break live > migration either because inuse should be 0. > > Yoshi Question is, can you flush to make inuse 0 in kemari too? And if not, how do you handle the fact that some requests are in flight on the primary? > diff --git a/hw/virtio.c b/hw/virtio.c > index c8a0fc6..875c7ca 100644 > --- a/hw/virtio.c > +++ b/hw/virtio.c > @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) > qemu_put_be32(f, i); > > for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { > +uint16_t last_avail_idx; > + > if (vdev->vq[i].vring.num == 0) > break; > > +last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse; > + > qemu_put_be32(f, vdev->vq[i].vring.num); > qemu_put_be64(f, vdev->vq[i].pa); > -qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); > +qemu_put_be16s(f, &last_avail_idx); > if (vdev->binding->save_queue) > vdev->binding->save_queue(vdev->binding_opaque, i, f); > } > > This looks wrong to me. Requests can complete in any order, can they not? So if request 0 did not complete and request 1 did not, you send avail - inuse and on the secondary you will process and complete request 1 the second time, crashing the guest. > > > > >> > > >> >> --- > >> >> hw/virtio.c | 8 +++- > >> >> 1 files changed, 7 insertions(+), 1 deletions(-) > >> >> > >> >> diff --git a/hw/virtio.c b/hw/virtio.c > >> >> index 849a60f..5509644 100644 > >> >> --- a/hw/virtio.c > >> >> +++ b/hw/virtio.c > >> >> @@ -72,7 +72,7 @@ struct VirtQueue > >> >> VRing vring; > >> >> target_phys_addr_t pa; > >> >> uint16_t last_avail_idx; > >> >> - int inuse; > >> >> + uint16_t inuse; > >> >> uint16_t vector; > >> >> void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq); > >> >
Re: Performance test result between virtio_pci MSI-X disable and enable
On Thu, Dec 2, 2010 at 5:49 PM, Michael S. Tsirkin wrote: > On Thu, Dec 02, 2010 at 09:13:28AM +0800, Yang, Sheng wrote: >> On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote: >> > On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote: >> > > I used sr-iov, give each vm 2 vf. >> > > after apply the patch, and i found performence is the same. >> > > >> > > the reason is in function msix_mmio_write, mostly addr is not in mmio >> > > range. >> > > >> > > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int >> > > len, >> > > >> > > const void *val) >> > > >> > > { >> > > >> > > struct kvm_assigned_dev_kernel *adev = >> > > >> > > container_of(this, struct kvm_assigned_dev_kernel, >> > > >> > > msix_mmio_dev); >> > > >> > > int idx, r = 0; >> > > unsigned long new_val = *(unsigned long *)val; >> > > >> > > mutex_lock(&adev->kvm->lock); >> > > if (!msix_mmio_in_range(adev, addr, len)) { >> > > >> > > // return here. >> > > >> > > r = -EOPNOTSUPP; >> > > >> > > goto out; >> > > >> > > } >> > > >> > > i printk the value: >> > > addr start end len >> > > F004C00C F0044000 F0044030 4 >> > > >> > > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev >> > > 01) >> > > >> > > Subsystem: Intel Corporation Unknown device 000c >> > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> > > >> > > Stepping- SERR- FastB2B- >> > > >> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> > > >> > > SERR- > > > >> > > Latency: 0 >> > > Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K] >> > > Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K] >> > > Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3 >> > > >> > > Vector table: BAR=3 offset= >> > > PBA: BAR=3 offset=2000 >> > > >> > > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev >> > > 01) >> > > >> > > Subsystem: Intel Corporation Unknown device 000c >> > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- >> > > >> > > Stepping- SERR- FastB2B- >> > > >> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> > > >> > > SERR- > > > >> > > Latency: 0 >> > > Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K] >> > > Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K] >> > > Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3 >> > > >> > > Vector table: BAR=3 offset= >> > > PBA: BAR=3 offset=2000 >> > > >> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev, >> > > + gpa_t addr, int len) >> > > +{ >> > > + gpa_t start, end; >> > > + >> > > + BUG_ON(adev->msix_mmio_base == 0); >> > > + start = adev->msix_mmio_base; >> > > + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE * >> > > + adev->msix_max_entries_nr; >> > > + if (addr >= start && addr + len <= end) >> > > + return true; >> > > + >> > > + return false; >> > > +} >> > >> > Hmm, this check looks wrong to me: there's no guarantee >> > that guest uses the first N entries in the table. >> > E.g. it could use a single entry, but only the last one. >> >> Please check the PCI spec. > > > This is pretty explicit in the spec: the the last paragraph in the below: > > IMPLEMENTATION NOTE > Handling MSI-X Vector Shortages > > Handling MSI-X Vector Shortages > For the case where fewer vectors are allocated to a function than desired, You may not notice the premise here. Also check for "Table Size" would help I think. -- regards, Yang, Sheng software- > controlled aliasing as enabled by MSI-X is one approach for handling the > situation. For > example, if a function supports five queues, each with an associated MSI-X > table entry, but > only three vectors are allocated, the function could be designed for software > still to configure > all five table entries, assigning one or more vectors to multiple table > entries. Software could > assign the three vectors {A,B,C} to the five entries as ABCCC, ABBCC, ABCBA, > or other > similar combinations. > > > Alternatively, the function could be designed for software to configure it > (using a device- > specific mechanism) to use only three queues and three MSI-X table entries. > Software could > assign the three vectors {A,B,C} to the five entries as ABC--, A-B-C, A--CB, > or other similar > combinations. > > > >> -- >> regards >> Yang, Sheng >> >> >> > > 2010/11/30 Yang, Sheng : >> > > > On Tuesday 30 November 2010 17:10:11 lidong chen wrote: >> > > >> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource. >> > > >> >> > > >> I test kvm with sriov, which the vf driver could not disable msix. >> > > >> so the host os waste a lot of cpu. cpu rate of host os i
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On Thu, Dec 02, 2010 at 11:17:52AM +0200, Avi Kivity wrote: > On 12/01/2010 09:09 PM, Peter Zijlstra wrote: > >> > >> We are dealing with just one task here (the task that is yielding). > >> After recording how much timeslice we are "giving up" in > >> current->donate_time > >> (donate_time is perhaps not the right name to use), we adjust the yielding > >> task's vruntime as per existing logic (for ex: to make it go to back of > >> runqueue). When the yielding tasks gets to run again, lock is hopefully > >> available for it to grab, we let it run longer than the default > >> sched_slice() > >> to compensate for what time it gave up previously to other threads in same > >> runqueue. This ensures that because of yielding upon lock contention, we > >> are not > >> leaking bandwidth in favor of other guests. Again I don't know how much of > >> fairness issue this is in practice, so unless we see some numbers I'd > >> prefer > >> sticking to plain yield() upon lock-contention (for unmodified guests > >> that is). > > > >No, that won't work. Once you've given up time you cannot add it back > >without destroying fairness. Over shorter intervals perhaps. Over longer interval (few seconds to couple of minutes), fairness should not be affected because of this feedback? In any case, don't we have similar issues with directed yield as well? > >You can limit the unfairness by limiting the amount of feedback, but I > >really dislike such 'yield' semantics. > > Agreed. > > What I'd like to see in directed yield is donating exactly the > amount of vruntime that's needed to make the target thread run. I presume this requires the target vcpu to move left in rb-tree to run earlier than scheduled currently and that it doesn't involve any change to the sched_period() of target vcpu? Just was wondering how this would work in case of buggy guests. Lets say that a guest ran into a AB<->BA deadlock. VCPU0 spins on lock B (held by VCPU1 currently), while VCPU spins on lock A (held by VCPU0 currently). Both keep boosting each other's vruntime, potentially affecting fairtime for other guests (to the point of starving them perhaps)? - vatsa > The donating thread won't get its vruntime back, unless the other thread > hits contention itself and does a directed yield back. So even if > your lock is ping-ponged around, the guest doesn't lose vruntime > compared to other processes on the host. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-1808970 ] CDROM eject doesn't work via Qemu Monitor
Bugs item #1808970, was opened at 2007-10-07 16:42 Message generated for change (Comment added) made by jessorensen You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1808970&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None >Status: Closed >Resolution: Works For Me Priority: 5 Private: No Submitted By: Technologov (technologov) Assigned to: Nobody/Anonymous (nobody) Summary: CDROM eject doesn't work via Qemu Monitor Initial Comment: When ejecting CD/DVD via Qemu monitor, then inserting another CD, the first CD is always shown, instead of the newly inserted one, unless I eject the CD from the guest. This bug prevents setup of Windows XP Tablet Edition, because this OS requires 2 CDs to install, and since this is Windows there are no workarounds. Guest: any Windows, especially Windows XP Tablet Edition. Host: Fedora7, 64-bit, Intel CPU, KVM-45. Other KVM versions and -no-kvm didn't help. -Alexey Technologov. 7.oct.2007. -- >Comment By: Jes Sorensen (jessorensen) Date: 2010-12-02 11:30 Message: Works fine here under Fedora 14: A change like this was detected instantly: (qemu) eject drive-ide0-1-0 (qemu) change drive-ide0-1-0 en_winxp_sp1a_tablet_pc_cd1.iso Closing Jes -- Comment By: Wilco (paranoidt) Date: 2008-03-07 17:52 Message: Logged In: YES user_id=902376 Originator: NO It works for me (with some quirks) if I first eject the old CD: eject ide1-cd0 Then change to the new one: change ide1-cd0 /path/to/newiso.iso This does not always seem to work. Sometimes you have to access your CDROM (from the installation) so windows finds no cd. Then repeat these steps. Hope this helps. -- Comment By: Technologov (technologov) Date: 2007-10-12 16:58 Message: Logged In: YES user_id=1839746 Originator: YES chenghuan_jia: I have tried that too, with the exception that I used ISO instead of real CD-ROM. -Technologov -- Comment By: Neo Jia (chenghuan_jia) Date: 2007-10-12 07:10 Message: Logged In: YES user_id=446034 Originator: NO Can you try "change cdrom /dev/scd0" after you insert the 2nd disc? Thanks, Neo -- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1808970&group_id=180599 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance test result between virtio_pci MSI-X disable and enable
On Thu, Dec 02, 2010 at 09:13:28AM +0800, Yang, Sheng wrote: > On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote: > > On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote: > > > I used sr-iov, give each vm 2 vf. > > > after apply the patch, and i found performence is the same. > > > > > > the reason is in function msix_mmio_write, mostly addr is not in mmio > > > range. > > > > > > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int > > > len, > > > > > > const void *val) > > > > > > { > > > > > > struct kvm_assigned_dev_kernel *adev = > > > > > > container_of(this, struct kvm_assigned_dev_kernel, > > > > > >msix_mmio_dev); > > > > > > int idx, r = 0; > > > unsigned long new_val = *(unsigned long *)val; > > > > > > mutex_lock(&adev->kvm->lock); > > > if (!msix_mmio_in_range(adev, addr, len)) { > > > > > > // return here. > > > > > > r = -EOPNOTSUPP; > > > > > > goto out; > > > > > > } > > > > > > i printk the value: > > > addr start end len > > > F004C00C F0044000 F0044030 4 > > > > > > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev > > > 01) > > > > > > Subsystem: Intel Corporation Unknown device 000c > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > > > > > Stepping- SERR- FastB2B- > > > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > > > > > > SERR- > > > > > Latency: 0 > > > Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K] > > > Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K] > > > Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3 > > > > > > Vector table: BAR=3 offset= > > > PBA: BAR=3 offset=2000 > > > > > > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev > > > 01) > > > > > > Subsystem: Intel Corporation Unknown device 000c > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > > > > > Stepping- SERR- FastB2B- > > > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > > > > > > SERR- > > > > > Latency: 0 > > > Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K] > > > Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K] > > > Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3 > > > > > > Vector table: BAR=3 offset= > > > PBA: BAR=3 offset=2000 > > > > > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev, > > > + gpa_t addr, int len) > > > +{ > > > + gpa_t start, end; > > > + > > > + BUG_ON(adev->msix_mmio_base == 0); > > > + start = adev->msix_mmio_base; > > > + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE * > > > + adev->msix_max_entries_nr; > > > + if (addr >= start && addr + len <= end) > > > + return true; > > > + > > > + return false; > > > +} > > > > Hmm, this check looks wrong to me: there's no guarantee > > that guest uses the first N entries in the table. > > E.g. it could use a single entry, but only the last one. > > Please check the PCI spec. This is pretty explicit in the spec: the the last paragraph in the below: IMPLEMENTATION NOTE Handling MSI-X Vector Shortages Handling MSI-X Vector Shortages For the case where fewer vectors are allocated to a function than desired, software- controlled aliasing as enabled by MSI-X is one approach for handling the situation. For example, if a function supports five queues, each with an associated MSI-X table entry, but only three vectors are allocated, the function could be designed for software still to configure all five table entries, assigning one or more vectors to multiple table entries. Software could assign the three vectors {A,B,C} to the five entries as ABCCC, ABBCC, ABCBA, or other similar combinations. Alternatively, the function could be designed for software to configure it (using a device- specific mechanism) to use only three queues and three MSI-X table entries. Software could assign the three vectors {A,B,C} to the five entries as ABC--, A-B-C, A--CB, or other similar combinations. > -- > regards > Yang, Sheng > > > > > 2010/11/30 Yang, Sheng : > > > > On Tuesday 30 November 2010 17:10:11 lidong chen wrote: > > > >> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource. > > > >> > > > >> I test kvm with sriov, which the vf driver could not disable msix. > > > >> so the host os waste a lot of cpu. cpu rate of host os is 90%. > > > >> > > > >> then I test xen with sriov, there ara also a lot of vm exits caused by > > > >> MSIX mask. > > > >> but the cpu rate of xen and domain0 is less than kvm. cpu rate of xen > > > >> and domain0 is 60%. > > > >> > > > >> without
buildbot failure in qemu-kvm on default_i386_debian_5_0
The Buildbot has detected a new failure of default_i386_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_debian_5_0/builds/670 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Avi Kivity ,Joerg Roedel BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on default_i386_out_of_tree
The Buildbot has detected a new failure of default_i386_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_out_of_tree/builds/607 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Avi Kivity ,Joerg Roedel BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/3] KVM: MMU: retry #PF for softmmu
Retry #PF for softmmu only when the current vcpu has the same cr3 as the time when #PF occurs Changelog: Just compare cr3 value since It's harmless to instantiate an spte for an unused translation from Marcelo's comment Signed-off-by: Xiao Guangrong --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c |2 ++ arch/x86/kvm/paging_tmpl.h | 34 +++--- arch/x86/kvm/x86.c |6 +- 4 files changed, 31 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a0c066e..1e876e5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -602,6 +602,7 @@ struct kvm_x86_ops { struct kvm_arch_async_pf { u32 token; gfn_t gfn; + unsigned long cr3; bool direct_map; }; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c6bb449..3f0d9a0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2607,9 +2607,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) { struct kvm_arch_async_pf arch; + arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id; arch.gfn = gfn; arch.direct_map = vcpu->arch.mmu.direct_map; + arch.cr3 = vcpu->arch.mmu.get_cr3(vcpu); return kvm_setup_async_pf(vcpu, gva, gfn, &arch); } diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 23275d0..437e11a 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -116,7 +116,7 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte) */ static int FNAME(walk_addr_generic)(struct guest_walker *walker, struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gva_t addr, u32 access) + gva_t addr, u32 access, bool prefault) { pt_element_t pte; gfn_t table_gfn; @@ -194,6 +194,13 @@ walk: #endif if (!eperm && !rsvd_fault && !(pte & PT_ACCESSED_MASK)) { + /* +* Don't set gpte accessed bit if it's on +* speculative path. +*/ + if (prefault) + goto error; + trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte)); if (FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, @@ -287,10 +294,11 @@ error: } static int FNAME(walk_addr)(struct guest_walker *walker, - struct kvm_vcpu *vcpu, gva_t addr, u32 access) + struct kvm_vcpu *vcpu, gva_t addr, + u32 access, bool prefault) { return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.mmu, addr, - access); + access, prefault); } static int FNAME(walk_addr_nested)(struct guest_walker *walker, @@ -298,7 +306,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker, u32 access) { return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu, - addr, access); + addr, access, false); } static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, @@ -438,7 +446,8 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, struct guest_walker *gw, int user_fault, int write_fault, int hlevel, -int *ptwrite, pfn_t pfn, bool map_writable) +int *ptwrite, pfn_t pfn, bool map_writable, +bool prefault) { unsigned access = gw->pt_access; struct kvm_mmu_page *sp = NULL; @@ -512,7 +521,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, mmu_set_spte(vcpu, it.sptep, access, gw->pte_access & access, user_fault, write_fault, dirty, ptwrite, it.level, -gw->gfn, pfn, false, map_writable); +gw->gfn, pfn, prefault, map_writable); FNAME(pte_prefetch)(vcpu, gw, it.sptep); return it.sptep; @@ -561,15 +570,18 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, /* * Look up the guest pte for the faulting address. */ - r = FNAME(walk_addr)(&walker, vcpu, addr, error_code); + r = FNAME(walk_addr)(&walker, vcpu, addr, error_code, prefault); /* * The page is not mapped by the guest. Let the guest handle it. */ if (!r) {
buildbot failure in qemu-kvm on default_x86_64_out_of_tree
The Buildbot has detected a new failure of default_x86_64_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_out_of_tree/builds/609 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Avi Kivity ,Joerg Roedel BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on default_x86_64_debian_5_0
The Buildbot has detected a new failure of default_x86_64_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/default_x86_64_debian_5_0/builds/668 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Avi Kivity ,Joerg Roedel BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on disable_kvm_i386_out_of_tree
The Buildbot has detected a new failure of disable_kvm_i386_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_i386_out_of_tree/builds/607 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Avi Kivity ,Joerg Roedel BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on disable_kvm_x86_64_out_of_tree
The Buildbot has detected a new failure of disable_kvm_x86_64_out_of_tree on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_out_of_tree/builds/607 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Avi Kivity ,Joerg Roedel BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on disable_kvm_i386_debian_5_0
The Buildbot has detected a new failure of disable_kvm_i386_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_i386_debian_5_0/builds/659 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_2 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Avi Kivity ,Joerg Roedel BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in qemu-kvm on disable_kvm_x86_64_debian_5_0
The Buildbot has detected a new failure of disable_kvm_x86_64_debian_5_0 on qemu-kvm. Full details are available at: http://buildbot.b1-systems.de/qemu-kvm/builders/disable_kvm_x86_64_debian_5_0/builds/658 Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/ Buildslave for this Build: b1_qemu_kvm_1 Build Reason: Build Source Stamp: [branch next] HEAD Blamelist: Avi Kivity ,Joerg Roedel BUILD FAILED: failed git sincerely, -The Buildbot -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 3/3] KVM: MMU: retry #PF for softmmu
Retry #PF for softmmu only when the current vcpu has the same cr3 as the time when #PF occurs Changelog: Just compare cr3 value since It's harmless to instantiate an spte for an unused translation from Marcelo's comment Signed-off-by: Xiao Guangrong arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/mmu.c |2 ++ arch/x86/kvm/paging_tmpl.h | 34 +++--- arch/x86/kvm/x86.c |6 +- 4 files changed, 31 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a0c066e..1e876e5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -602,6 +602,7 @@ struct kvm_x86_ops { struct kvm_arch_async_pf { u32 token; gfn_t gfn; + unsigned long cr3; bool direct_map; }; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c6bb449..3f0d9a0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2607,9 +2607,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) { struct kvm_arch_async_pf arch; + arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id; arch.gfn = gfn; arch.direct_map = vcpu->arch.mmu.direct_map; + arch.cr3 = vcpu->arch.mmu.get_cr3(vcpu); return kvm_setup_async_pf(vcpu, gva, gfn, &arch); } diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 23275d0..437e11a 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -116,7 +116,7 @@ static unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, pt_element_t gpte) */ static int FNAME(walk_addr_generic)(struct guest_walker *walker, struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, - gva_t addr, u32 access) + gva_t addr, u32 access, bool prefault) { pt_element_t pte; gfn_t table_gfn; @@ -194,6 +194,13 @@ walk: #endif if (!eperm && !rsvd_fault && !(pte & PT_ACCESSED_MASK)) { + /* +* Don't set gpte accessed bit if it's on +* speculative path. +*/ + if (prefault) + goto error; + trace_kvm_mmu_set_accessed_bit(table_gfn, index, sizeof(pte)); if (FNAME(cmpxchg_gpte)(vcpu->kvm, table_gfn, @@ -287,10 +294,11 @@ error: } static int FNAME(walk_addr)(struct guest_walker *walker, - struct kvm_vcpu *vcpu, gva_t addr, u32 access) + struct kvm_vcpu *vcpu, gva_t addr, + u32 access, bool prefault) { return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.mmu, addr, - access); + access, prefault); } static int FNAME(walk_addr_nested)(struct guest_walker *walker, @@ -298,7 +306,7 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker, u32 access) { return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.nested_mmu, - addr, access); + addr, access, false); } static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, @@ -438,7 +446,8 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, struct guest_walker *gw, int user_fault, int write_fault, int hlevel, -int *ptwrite, pfn_t pfn, bool map_writable) +int *ptwrite, pfn_t pfn, bool map_writable, +bool prefault) { unsigned access = gw->pt_access; struct kvm_mmu_page *sp = NULL; @@ -512,7 +521,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, mmu_set_spte(vcpu, it.sptep, access, gw->pte_access & access, user_fault, write_fault, dirty, ptwrite, it.level, -gw->gfn, pfn, false, map_writable); +gw->gfn, pfn, prefault, map_writable); FNAME(pte_prefetch)(vcpu, gw, it.sptep); return it.sptep; @@ -561,15 +570,18 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, /* * Look up the guest pte for the faulting address. */ - r = FNAME(walk_addr)(&walker, vcpu, addr, error_code); + r = FNAME(walk_addr)(&walker, vcpu, addr, error_code, prefault); /* * The page is not mapped by the guest. Let the guest handle it. */ if (!r) {
[PATCH v4 1/3] KVM: MMU: rename 'no_apf' to 'prefault'
It's the speculative path if 'no_apf = 1' and we will specially handle this speculative path in the later patch, so 'prefault' is better to fit the sense Signed-off-by: Xiao Guangrong --- arch/x86/include/asm/kvm_host.h |3 ++- arch/x86/kvm/mmu.c | 18 +- arch/x86/kvm/paging_tmpl.h |4 ++-- 3 files changed, 13 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index e91c692..a0c066e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -241,7 +241,8 @@ struct kvm_mmu { void (*new_cr3)(struct kvm_vcpu *vcpu); void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root); unsigned long (*get_cr3)(struct kvm_vcpu *vcpu); - int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf); + int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, + bool prefault); void (*inject_page_fault)(struct kvm_vcpu *vcpu); void (*free)(struct kvm_vcpu *vcpu); gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access, diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e111b79..010736e 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2284,11 +2284,11 @@ static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) return 1; } -static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, +static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, gva_t gva, pfn_t *pfn, bool write, bool *writable); static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, -bool no_apf) +bool prefault) { int r; int level; @@ -2310,7 +2310,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (try_async_pf(vcpu, no_apf, gfn, v, &pfn, write, &map_writable)) + if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable)) return 0; /* mmio */ @@ -2583,7 +2583,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr, } static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, - u32 error_code, bool no_apf) + u32 error_code, bool prefault) { gfn_t gfn; int r; @@ -2599,7 +2599,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, gfn = gva >> PAGE_SHIFT; return nonpaging_map(vcpu, gva & PAGE_MASK, -error_code & PFERR_WRITE_MASK, gfn, no_apf); +error_code & PFERR_WRITE_MASK, gfn, prefault); } static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) @@ -2621,7 +2621,7 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu) return kvm_x86_ops->interrupt_allowed(vcpu); } -static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, +static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, gva_t gva, pfn_t *pfn, bool write, bool *writable) { bool async; @@ -2633,7 +2633,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, put_page(pfn_to_page(*pfn)); - if (!no_apf && can_do_async_pf(vcpu)) { + if (!prefault && can_do_async_pf(vcpu)) { trace_kvm_try_async_get_page(gva, gfn); if (kvm_find_async_pf_gfn(vcpu, gfn)) { trace_kvm_async_pf_doublefault(gva, gfn); @@ -2649,7 +2649,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, } static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, - bool no_apf) + bool prefault) { pfn_t pfn; int r; @@ -2673,7 +2673,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn, write, &map_writable)) + if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable)) return 0; /* mmio */ diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index ad5a5a2..23275d0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -539,7 +539,7 @@ out_gpte_changed: * a negative value on error. */ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, -bool no_apf) +bool prefault) { int write_fault = error_code & PFERR_WRITE_MASK; int user_fault = error_code & PFERR_USER_MASK; @@ -581,7 +581,7 @@ static int FNAME(page_fault)(struct kvm_vcpu
[PATCH v4 2/3] KVM: MMU: fix accessed bit set on prefault path
Retry #PF is the speculative path, so don't set the accessed bit Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 10 ++ 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 010736e..c6bb449 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2214,7 +2214,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) } static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, - int map_writable, int level, gfn_t gfn, pfn_t pfn) + int map_writable, int level, gfn_t gfn, pfn_t pfn, + bool prefault) { struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; @@ -2229,7 +2230,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, pte_access &= ~ACC_WRITE_MASK; mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access, 0, write, 1, &pt_write, -level, gfn, pfn, false, map_writable); +level, gfn, pfn, prefault, map_writable); direct_pte_prefetch(vcpu, iterator.sptep); ++vcpu->stat.pf_fixed; break; @@ -2321,7 +2322,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; kvm_mmu_free_some_pages(vcpu); - r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn); + r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn, +prefault); spin_unlock(&vcpu->kvm->mmu_lock); @@ -2684,7 +2686,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, goto out_unlock; kvm_mmu_free_some_pages(vcpu); r = __direct_map(vcpu, gpa, write, map_writable, -level, gfn, pfn); +level, gfn, pfn, prefault); spin_unlock(&vcpu->kvm->mmu_lock); return r; -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On 12/01/2010 09:09 PM, Peter Zijlstra wrote: > > We are dealing with just one task here (the task that is yielding). > After recording how much timeslice we are "giving up" in current->donate_time > (donate_time is perhaps not the right name to use), we adjust the yielding > task's vruntime as per existing logic (for ex: to make it go to back of > runqueue). When the yielding tasks gets to run again, lock is hopefully > available for it to grab, we let it run longer than the default sched_slice() > to compensate for what time it gave up previously to other threads in same > runqueue. This ensures that because of yielding upon lock contention, we are not > leaking bandwidth in favor of other guests. Again I don't know how much of > fairness issue this is in practice, so unless we see some numbers I'd prefer > sticking to plain yield() upon lock-contention (for unmodified guests that is). No, that won't work. Once you've given up time you cannot add it back without destroying fairness. You can limit the unfairness by limiting the amount of feedback, but I really dislike such 'yield' semantics. Agreed. What I'd like to see in directed yield is donating exactly the amount of vruntime that's needed to make the target thread run. The donating thread won't get its vruntime back, unless the other thread hits contention itself and does a directed yield back. So even if your lock is ping-ponged around, the guest doesn't lose vruntime compared to other processes on the host. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On 12/01/2010 07:29 PM, Srivatsa Vaddagiri wrote: > > > A plain yield (ignoring no-opiness on Linux) will penalize the > > > running guest wrt other guests. We need to maintain fairness. Avi, any idea how much penalty are we talking of here in using plain yield? If that is acceptable in practice, I'd prefer we use the plain yield rather than add any more sophistication to it .. The first issue with plain yield is that it's a no-op unless some ioctl is enabled. It's simply not very meaningful for CFS. That's why the current PLE implementation uses sleep() (which is incredibly sucky in another way). Even if we do get yield() to work, giving up a timeslice will cause a contending guest to lose way more than its fair share. Consider a lock that is held for 100us every 500us; if each time we detect contention we give up a 3ms timeslice, the guest will grind down to a halt while other guests will happily use its timeslice. With directed yield, the guest continues to run; and any excess time we donated to a vcpu is preserved. > The Xen paravirt spinlock solution is relatively sane, use that. > Unmodified guests suck anyway, there's really nothing much sane you can > do there as you don't know who owns what lock. Hopefully we don't have to deal with unmodified guests for too long. Till that time, plain yield() upon lock-contention seems like the best option we have for such unmodified guests. We will have to deal with unmodified guests forever. Both older Linux and Windows. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: response to SIGUSR1 to start/stop a VCPU (v2)
On 12/01/2010 09:07 PM, Peter Zijlstra wrote: > > The pause loop exiting& directed yield patches I am working on > preserve inter-vcpu fairness by round robining among the vcpus > inside one KVM guest. I don't necessarily think that's enough. Suppose you've got 4 vcpus, one is holding a lock and 3 are spinning. They'll end up all three donating some time to the 4th. The only way to make that fair again is if due to future contention the 4th cpu donates an equal amount of time back to the resp. cpus it got time from. Guest lock patterns and host scheduling don't provide this guarantee. That is fine. Fairness is pointless if it there's no work to be done for the other three threads. Just like if you have four processes, one of which if processing, the others sleeping, cpu is not divided equally. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-vmx: add module parameter to avoid trapping HLT instructions
On 12/01/2010 08:03 PM, Anthony Liguori wrote: In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. A similar option is possible for AMD but I don't have easy access to AMD test hardware. Looks reasonable. +static int __read_mostly enable_yield_on_guest_hlt = 1; +module_param(enable_yield_on_guest_hlt, bool, S_IRUGO); + Please rename to yield_on_hlt. Also, should not be __read_mostly (__read_mostly is for variables that are also __read_often). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/10] Exit loop if we have been there too long
On 12/02/2010 03:31 AM, Takuya Yoshikawa wrote: Thanks for the answers Avi, Juan, Some FYI, (not about the bottleneck) On Wed, 01 Dec 2010 14:35:57 +0200 Avi Kivity wrote: > > >- how many dirty pages do we have to care? > > > > default values and assuming 1Gigabit ethernet for ourselves ~9.5MB of > > dirty pages to have only 30ms of downtime. > > 1Gb/s * 30ms = 100 MB/s * 30 ms = 3 MB. > 3MB / 4KB/page = 750 pages. Then, KVM side processing is near the theoretical goal! In my framebuffer test, I tested nr_dirty_pages/npages = 576/4096 case with the rate of 20 updates/s (1updates/50ms). Using rmap optimization, write protection only took 46,718 tsc time. Yes, using rmap to drive write protection with sparse dirty bitmaps really helps. Bitmap copy was not a problem of course. The display was working anyway at this rate! In my guess, within 1,000 dirty pages, kvm_vm_ioctl_get_dirty_log() can be processed within 200us or so even for large RAM slot. - rmap optimization depends mainly on nr_dirty_pages but npages. Avi, can you guess the property of O(1) write protection? I want to test rmap optimization taking these issues into acount. I think we should use O(1) write protection only if there is a large number of dirty pages. With a small number, using rmap guided by the previous dirty bitmap is faster. So, under normal operation where only the framebuffer is logged, we'd use rmap write protection; when enabling logging for live migration we'd use O(1) write protection, after a few iterations when the number of dirty pages drops, we switch back to rmap write protection. Of course, Kemari have to continue synchronization, and maybe see more dirty pages. This will be a future task! There's yet another option, of using dirty bits instead of write protection. Or maybe using write protection in the upper page tables and dirty bits in the lowest level. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html