Re: vhost + multiqueue + RSS question.

2014-11-18 Thread Gleb Natapov
On Mon, Nov 17, 2014 at 01:58:20PM +0200, Michael S. Tsirkin wrote:
> On Mon, Nov 17, 2014 at 01:22:07PM +0200, Gleb Natapov wrote:
> > On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
> > > On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
> > > > On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
> > > > > On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> > > > > > Hi Michael,
> > > > > > 
> > > > > >  I am playing with vhost multiqueue capability and have a question 
> > > > > > about
> > > > > > vhost multiqueue and RSS (receive side steering). My setup has 
> > > > > > Mellanox
> > > > > > ConnectX-3 NIC which supports multiqueue and RSS. Network related
> > > > > > parameters for qemu are:
> > > > > > 
> > > > > >-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> > > > > >-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> > > > > > 
> > > > > > In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> > > > > > 
> > > > > > I am running one tcp stream into the guest using iperf. Since there 
> > > > > > is
> > > > > > only one tcp stream I expect it to be handled by one queue only but
> > > > > > this seams to be not the case. ethtool -S on a host shows that the
> > > > > > stream is handled by one queue in the NIC, just like I would expect,
> > > > > > but in a guest all 4 virtio-input interrupt are incremented. Am I
> > > > > > missing any configuration?
> > > > > 
> > > > > I don't see anything obviously wrong with what you describe.
> > > > > Maybe, somehow, same irqfd got bound to multiple MSI vectors?
> > > > It does not look like this is what is happening judging by the way
> > > > interrupts are distributed between queues. They are not distributed
> > > > uniformly and often I see one queue gets most interrupt and others get
> > > > much less and then it changes.
> > > 
> > > Weird. It would happen if you transmitted from multiple CPUs.
> > > You did pin iperf to a single CPU within guest, did you not?
> > > 
> > No, I didn't because I didn't expect it to matter for input interrupts.
> > When I run iperf on a host rx queue that receives all packets depends
> > only on a connection itself, not on a cpu iperf is running on (I tested
> > that).
> 
> This really depends on the type of networking card you have
> on the host, and how it's configured.
> 
> I think you will get something more closely resembling this
> behaviour if you enable RFS in host.
> 
> > When I pin iperf in a guest I do indeed see that all interrupts
> > are arriving to the same irq vector. Is a number after virtio-input
> > in /proc/interrupt any indication of a queue a packet arrived to (on
> > a host I can use ethtool -S to check what queue receives packets, but
> > unfortunately this does not work for virtio nic in a guest)?
> 
> I think it is.
> 
> > Because if
> > it is the way RSS works in virtio is not how it works on a host and not
> > what I would expect after reading about RSS. The queue a packets arrives
> > to should be calculated by hashing fields from a packet header only.
> 
> Yes, what virtio has is not RSS - it's an accelerated RFS really.
> 
OK, if what virtio has is RFS and not RSS my test results make sense.
Thanks!

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: vhost + multiqueue + RSS question.

2014-11-18 Thread Gleb Natapov
On Tue, Nov 18, 2014 at 11:41:11AM +0800, Jason Wang wrote:
> On 11/18/2014 09:37 AM, Zhang Haoyu wrote:
> >> On Mon, Nov 17, 2014 at 01:58:20PM +0200, Michael S. Tsirkin wrote:
> >>> On Mon, Nov 17, 2014 at 01:22:07PM +0200, Gleb Natapov wrote:
> >>>> On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
> >>>>> On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
> >>>>>> On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
> >>>>>>> On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> >>>>>>>> Hi Michael,
> >>>>>>>>
> >>>>>>>>  I am playing with vhost multiqueue capability and have a question 
> >>>>>>>> about
> >>>>>>>> vhost multiqueue and RSS (receive side steering). My setup has 
> >>>>>>>> Mellanox
> >>>>>>>> ConnectX-3 NIC which supports multiqueue and RSS. Network related
> >>>>>>>> parameters for qemu are:
> >>>>>>>>
> >>>>>>>>-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> >>>>>>>>-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> >>>>>>>>
> >>>>>>>> In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> >>>>>>>>
> >>>>>>>> I am running one tcp stream into the guest using iperf. Since there 
> >>>>>>>> is
> >>>>>>>> only one tcp stream I expect it to be handled by one queue only but
> >>>>>>>> this seams to be not the case. ethtool -S on a host shows that the
> >>>>>>>> stream is handled by one queue in the NIC, just like I would expect,
> >>>>>>>> but in a guest all 4 virtio-input interrupt are incremented. Am I
> >>>>>>>> missing any configuration?
> >>>>>>> I don't see anything obviously wrong with what you describe.
> >>>>>>> Maybe, somehow, same irqfd got bound to multiple MSI vectors?
> >>>>>> It does not look like this is what is happening judging by the way
> >>>>>> interrupts are distributed between queues. They are not distributed
> >>>>>> uniformly and often I see one queue gets most interrupt and others get
> >>>>>> much less and then it changes.
> >>>>> Weird. It would happen if you transmitted from multiple CPUs.
> >>>>> You did pin iperf to a single CPU within guest, did you not?
> >>>>>
> >>>> No, I didn't because I didn't expect it to matter for input interrupts.
> >>>> When I run iperf on a host rx queue that receives all packets depends
> >>>> only on a connection itself, not on a cpu iperf is running on (I tested
> >>>> that).
> >>> This really depends on the type of networking card you have
> >>> on the host, and how it's configured.
> >>>
> >>> I think you will get something more closely resembling this
> >>> behaviour if you enable RFS in host.
> >>>
> >>>> When I pin iperf in a guest I do indeed see that all interrupts
> >>>> are arriving to the same irq vector. Is a number after virtio-input
> >>>> in /proc/interrupt any indication of a queue a packet arrived to (on
> >>>> a host I can use ethtool -S to check what queue receives packets, but
> >>>> unfortunately this does not work for virtio nic in a guest)?
> >>> I think it is.
> >>>
> >>>> Because if
> >>>> it is the way RSS works in virtio is not how it works on a host and not
> >>>> what I would expect after reading about RSS. The queue a packets arrives
> >>>> to should be calculated by hashing fields from a packet header only.
> >>> Yes, what virtio has is not RSS - it's an accelerated RFS really.
> >>>
> >> OK, if what virtio has is RFS and not RSS my test results make sense.
> >> Thanks!
> > I think the RSS emulation for virtio-mq NIC is implemented in 
> > tun_select_queue(),
> > am I missing something?
> >
> > Thanks,
> > Zhang Haoyu
> >
> 
> Yes, if RSS is the short for Receive Side Steering which is a generic
> technology. But RSS is usually short for Receive Side Scaling which was
> commonly technology used by Windows, it was implemented through a
> indirection table in the card which is obviously not supported in tun
> currently.
Hmm, I had an impression that "Receive Side Steering" and "Receive Side
Scaling" are interchangeable. Software implementation for RSS is called
"Receive Packet Steering" according to Documentation/networking/scaling.txt
not "Receive Packet Scaling". Those damn TLAs are confusing.
 
--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: vhost + multiqueue + RSS question.

2014-11-18 Thread Gleb Natapov
On Mon, Nov 17, 2014 at 12:38:16PM +0200, Michael S. Tsirkin wrote:
> On Mon, Nov 17, 2014 at 09:44:23AM +0200, Gleb Natapov wrote:
> > On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
> > > On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> > > > Hi Michael,
> > > > 
> > > >  I am playing with vhost multiqueue capability and have a question about
> > > > vhost multiqueue and RSS (receive side steering). My setup has Mellanox
> > > > ConnectX-3 NIC which supports multiqueue and RSS. Network related
> > > > parameters for qemu are:
> > > > 
> > > >-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> > > >-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> > > > 
> > > > In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> > > > 
> > > > I am running one tcp stream into the guest using iperf. Since there is
> > > > only one tcp stream I expect it to be handled by one queue only but
> > > > this seams to be not the case. ethtool -S on a host shows that the
> > > > stream is handled by one queue in the NIC, just like I would expect,
> > > > but in a guest all 4 virtio-input interrupt are incremented. Am I
> > > > missing any configuration?
> > > 
> > > I don't see anything obviously wrong with what you describe.
> > > Maybe, somehow, same irqfd got bound to multiple MSI vectors?
> > It does not look like this is what is happening judging by the way
> > interrupts are distributed between queues. They are not distributed
> > uniformly and often I see one queue gets most interrupt and others get
> > much less and then it changes.
> 
> Weird. It would happen if you transmitted from multiple CPUs.
> You did pin iperf to a single CPU within guest, did you not?
> 
No, I didn't because I didn't expect it to matter for input interrupts.
When I run iperf on a host rx queue that receives all packets depends
only on a connection itself, not on a cpu iperf is running on (I tested
that). When I pin iperf in a guest I do indeed see that all interrupts
are arriving to the same irq vector. Is a number after virtio-input
in /proc/interrupt any indication of a queue a packet arrived to (on
a host I can use ethtool -S to check what queue receives packets, but
unfortunately this does not work for virtio nic in a guest)? Because if
it is the way RSS works in virtio is not how it works on a host and not
what I would expect after reading about RSS. The queue a packets arrives
to should be calculated by hashing fields from a packet header only.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: vhost + multiqueue + RSS question.

2014-11-18 Thread Gleb Natapov
On Sun, Nov 16, 2014 at 08:56:04PM +0200, Michael S. Tsirkin wrote:
> On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> > Hi Michael,
> > 
> >  I am playing with vhost multiqueue capability and have a question about
> > vhost multiqueue and RSS (receive side steering). My setup has Mellanox
> > ConnectX-3 NIC which supports multiqueue and RSS. Network related
> > parameters for qemu are:
> > 
> >-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> >-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> > 
> > In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> > 
> > I am running one tcp stream into the guest using iperf. Since there is
> > only one tcp stream I expect it to be handled by one queue only but
> > this seams to be not the case. ethtool -S on a host shows that the
> > stream is handled by one queue in the NIC, just like I would expect,
> > but in a guest all 4 virtio-input interrupt are incremented. Am I
> > missing any configuration?
> 
> I don't see anything obviously wrong with what you describe.
> Maybe, somehow, same irqfd got bound to multiple MSI vectors?
It does not look like this is what is happening judging by the way
interrupts are distributed between queues. They are not distributed
uniformly and often I see one queue gets most interrupt and others get
much less and then it changes.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: vhost + multiqueue + RSS question.

2014-11-18 Thread Gleb Natapov
On Mon, Nov 17, 2014 at 01:30:06PM +0800, Jason Wang wrote:
> On 11/17/2014 02:56 AM, Michael S. Tsirkin wrote:
> > On Sun, Nov 16, 2014 at 06:18:18PM +0200, Gleb Natapov wrote:
> >> Hi Michael,
> >>
> >>  I am playing with vhost multiqueue capability and have a question about
> >> vhost multiqueue and RSS (receive side steering). My setup has Mellanox
> >> ConnectX-3 NIC which supports multiqueue and RSS. Network related
> >> parameters for qemu are:
> >>
> >>-netdev tap,id=hn0,script=qemu-ifup.sh,vhost=on,queues=4
> >>-device virtio-net-pci,netdev=hn0,id=nic1,mq=on,vectors=10
> >>
> >> In a guest I ran "ethtool -L eth0 combined 4" to enable multiqueue.
> >>
> >> I am running one tcp stream into the guest using iperf. Since there is
> >> only one tcp stream I expect it to be handled by one queue only but
> >> this seams to be not the case. ethtool -S on a host shows that the
> >> stream is handled by one queue in the NIC, just like I would expect,
> >> but in a guest all 4 virtio-input interrupt are incremented. Am I
> >> missing any configuration?
> > I don't see anything obviously wrong with what you describe.
> > Maybe, somehow, same irqfd got bound to multiple MSI vectors?
> > To see, can you try dumping struct kvm_irqfd that's passed to kvm?
> >
> >
> >> --
> >>Gleb.
> 
> This sounds like a regression, which kernel/qemu version did you use?
Sorry, should have mentioned it from the start. Host is a fedora 20 with
kernel 3.16.6-200.fc20.x86_64 and qemu-system-x86-1.6.2-9.fc20.x86_64.
Guest is also fedora 20 but with an older kernel 3.11.10-301.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov
On Fri, Sep 19, 2014 at 11:20:49AM -0700, Andy Lutomirski wrote:
> [cc: Alok Kataria at VMware]
> 
> On Fri, Sep 19, 2014 at 11:12 AM, Gleb Natapov  wrote:
> > On Fri, Sep 19, 2014 at 11:02:38AM -0700, Andy Lutomirski wrote:
> >> On Fri, Sep 19, 2014 at 10:49 AM, Gleb Natapov  wrote:
> >> > On Fri, Sep 19, 2014 at 10:18:37AM -0700, H. Peter Anvin wrote:
> >> >> On 09/19/2014 10:15 AM, Gleb Natapov wrote:
> >> >> > On Fri, Sep 19, 2014 at 10:08:20AM -0700, H. Peter Anvin wrote:
> >> >> >> On 09/19/2014 09:53 AM, Gleb Natapov wrote:
> >> >> >>> On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> >> >> >>>> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >> >> >>>>>
> >> >> >>>>> Linux detects what hypervior it runs on very early
> >> >> >>>>
> >> >> >>>> Not anywhere close to early enough.  We're talking for uses like 
> >> >> >>>> kASLR.
> >> >> >>>>
> >> >> >>> Still to early to do:
> >> >> >>>
> >> >> >>>h = cpuid(HYPERVIOR_SIGNATURE)
> >> >> >>>if (h == KVMKVMKVM) {
> >> >> >>>   if (cpuid(kvm_features) & kvm_rnd)
> >> >> >>>  rdmsr(kvm_rnd)
> >> >> >>>else (h == HyperV) {
> >> >> >>>   if (cpuid(hv_features) & hv_rnd)
> >> >> >>> rdmsr(hv_rnd)
> >> >> >>>else (h == XenXenXen) {
> >> >> >>>   if (cpuid(xen_features) & xen_rnd)
> >> >> >>> rdmsr(xen_rnd)
> >> >> >>>   }
> >> >> >>>
> >> >> >>
> >> >> >> If we need to do chase loops, especially not so...
> >> >> >>
> >> >> > What loops exactly? As a non native English speaker I fail to 
> >> >> > understand
> >> >> > if your answer is "yes" or "no" ;)
> >> >> >
> >> >>
> >> >> The above isn't actually the full algorithm used.
> >> >>
> >> > What part of actually algorithm cannot be implemented? Loop that searches
> >> > for KVM leaf in case KVM pretend to be HyperV (is this what you called
> >> > "chase loops"?)? First of all there is no need to implement it, if KVM
> >> > pretends to be HyperV use HyperV's way to obtain RNG, but what is the
> >> > problem with the loop?
> >> >
> >>
> >> It can be implemented, and I've done it.  But it's a mess.  Almost the
> >> very first thing we do in boot (even before decompressing the kernel)
> >> will be to scan a bunch of cpuid leaves looking for a hypervisor with
> >> an rng source that we can use for kASLR.  And we'll have to update
> >> that code and make it bigger every time another hypervisor adds
> >> exactly the same feature.
> > IMO implementing this feature is in hypervisor's best interest, so the task
> > of updating the code will scale by virtue of hypervisor's developers each
> > adding it for hypervisor he cares about.
> 
> I assume that you mean guest, not hypervisor.
> 
Yes, I mean guest support for hypervisor he cares about.

> >
> >>
> >> And then we have another copy of almost exactly the same code in the
> >> normal post-boot part of the kernel.
> >>
> >> We can certainly do this, but I'd much rather solve the problem once
> >> and let all of the hypervisors and guests opt in and immediately be
> >> compatible with each other.
> >>
> >> > I "forgot" VMware because I do not see VMware people to be CCed. They may
> >> > be even less excited about them being told _how_ this feature need to be
> >> > implemented (e.g implement HyperV leafs for the feature detection). I
> >> > do not want to and cannot speak for VMware, but my guess is that for
> >> > them it would be much easier to add an else clause for VMware in above
> >> > "if" then to coordinate with all hypervisor developers about MSR/cpuid
> >> > details. And since this is security feature implementing it for Linux
> >> > is in their best interest.
> >>
> >> Do you know any of them who should be cc'd?
> >>
> > No, not anyone in particular. git log arch/x86/kernel/cpu/vmware.c may help.
> >
> > But VMware is an elephant in the room here. There are other hypervisors out 
> > there.
> > VirtualBox, bhyve...
> 
> Exactly.  The amount of effort to get everything to be compatible with
> everything scales quadratically in the number of hypervisors, and the
> probability that some combination is broken also increases.
> 
The effort is distributed equally among hypervisor developers. If they
want Linux to be more secure on their hypervisor they contribute guest
code. They do need to write hypervisor part anyway. On cpus with RDRAND
instruction this MSR is not even needed and some hypervisors may decide
that support for old cpus does not worth the effort. Unified interface
does not help if hypervisor does not implement it.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov
On Fri, Sep 19, 2014 at 11:02:38AM -0700, Andy Lutomirski wrote:
> On Fri, Sep 19, 2014 at 10:49 AM, Gleb Natapov  wrote:
> > On Fri, Sep 19, 2014 at 10:18:37AM -0700, H. Peter Anvin wrote:
> >> On 09/19/2014 10:15 AM, Gleb Natapov wrote:
> >> > On Fri, Sep 19, 2014 at 10:08:20AM -0700, H. Peter Anvin wrote:
> >> >> On 09/19/2014 09:53 AM, Gleb Natapov wrote:
> >> >>> On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> >> >>>> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >> >>>>>
> >> >>>>> Linux detects what hypervior it runs on very early
> >> >>>>
> >> >>>> Not anywhere close to early enough.  We're talking for uses like 
> >> >>>> kASLR.
> >> >>>>
> >> >>> Still to early to do:
> >> >>>
> >> >>>h = cpuid(HYPERVIOR_SIGNATURE)
> >> >>>if (h == KVMKVMKVM) {
> >> >>>   if (cpuid(kvm_features) & kvm_rnd)
> >> >>>  rdmsr(kvm_rnd)
> >> >>>else (h == HyperV) {
> >> >>>   if (cpuid(hv_features) & hv_rnd)
> >> >>> rdmsr(hv_rnd)
> >> >>>else (h == XenXenXen) {
> >> >>>   if (cpuid(xen_features) & xen_rnd)
> >> >>> rdmsr(xen_rnd)
> >> >>>   }
> >> >>>
> >> >>
> >> >> If we need to do chase loops, especially not so...
> >> >>
> >> > What loops exactly? As a non native English speaker I fail to understand
> >> > if your answer is "yes" or "no" ;)
> >> >
> >>
> >> The above isn't actually the full algorithm used.
> >>
> > What part of actually algorithm cannot be implemented? Loop that searches
> > for KVM leaf in case KVM pretend to be HyperV (is this what you called
> > "chase loops"?)? First of all there is no need to implement it, if KVM
> > pretends to be HyperV use HyperV's way to obtain RNG, but what is the
> > problem with the loop?
> >
> 
> It can be implemented, and I've done it.  But it's a mess.  Almost the
> very first thing we do in boot (even before decompressing the kernel)
> will be to scan a bunch of cpuid leaves looking for a hypervisor with
> an rng source that we can use for kASLR.  And we'll have to update
> that code and make it bigger every time another hypervisor adds
> exactly the same feature.
IMO implementing this feature is in hypervisor's best interest, so the task
of updating the code will scale by virtue of hypervisor's developers each
adding it for hypervisor he cares about.

> 
> And then we have another copy of almost exactly the same code in the
> normal post-boot part of the kernel.
> 
> We can certainly do this, but I'd much rather solve the problem once
> and let all of the hypervisors and guests opt in and immediately be
> compatible with each other.
> 
> > I "forgot" VMware because I do not see VMware people to be CCed. They may
> > be even less excited about them being told _how_ this feature need to be
> > implemented (e.g implement HyperV leafs for the feature detection). I
> > do not want to and cannot speak for VMware, but my guess is that for
> > them it would be much easier to add an else clause for VMware in above
> > "if" then to coordinate with all hypervisor developers about MSR/cpuid
> > details. And since this is security feature implementing it for Linux
> > is in their best interest.
> 
> Do you know any of them who should be cc'd?
> 
No, not anyone in particular. git log arch/x86/kernel/cpu/vmware.c may help.

But VMware is an elephant in the room here. There are other hypervisors out 
there.
VirtualBox, bhyve...

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov
On Fri, Sep 19, 2014 at 10:21:27AM -0700, Andy Lutomirski wrote:
> On Sep 19, 2014 9:53 AM, "Gleb Natapov"  wrote:
> >
> > On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> > > On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> > > >
> > > > Linux detects what hypervior it runs on very early
> > >
> > > Not anywhere close to early enough.  We're talking for uses like kASLR.
> > >
> > Still to early to do:
> >
> >h = cpuid(HYPERVIOR_SIGNATURE)
> >if (h == KVMKVMKVM) {
> >   if (cpuid(kvm_features) & kvm_rnd)
> >  rdmsr(kvm_rnd)
> >else (h == HyperV) {
> >   if (cpuid(hv_features) & hv_rnd)
> > rdmsr(hv_rnd)
> >else (h == XenXenXen) {
> >   if (cpuid(xen_features) & xen_rnd)
> > rdmsr(xen_rnd)
> >   }
> >
> 
> I think that there's a lot of value in having each guest
> implementation be automatically compatible with all hypervisors.  For
> example, you forgot VMware, and VMware might be less excited about
> implementing this feature if all the guests won't immediately start
> using it.
> 
I "forgot" VMware because I do not see VMware people to be CCed. They may
be even less excited about them being told _how_ this feature need to be
implemented (e.g implement HyperV leafs for the feature detection). I
do not want to and cannot speak for VMware, but my guess is that for
them it would be much easier to add an else clause for VMware in above
"if" then to coordinate with all hypervisor developers about MSR/cpuid
details. And since this is security feature implementing it for Linux
is in their best interest.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov
On Fri, Sep 19, 2014 at 10:18:37AM -0700, H. Peter Anvin wrote:
> On 09/19/2014 10:15 AM, Gleb Natapov wrote:
> > On Fri, Sep 19, 2014 at 10:08:20AM -0700, H. Peter Anvin wrote:
> >> On 09/19/2014 09:53 AM, Gleb Natapov wrote:
> >>> On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> >>>> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >>>>>
> >>>>> Linux detects what hypervior it runs on very early
> >>>>
> >>>> Not anywhere close to early enough.  We're talking for uses like kASLR.
> >>>>
> >>> Still to early to do:
> >>>
> >>>h = cpuid(HYPERVIOR_SIGNATURE)
> >>>if (h == KVMKVMKVM) {
> >>>   if (cpuid(kvm_features) & kvm_rnd)
> >>>  rdmsr(kvm_rnd)
> >>>else (h == HyperV) {
> >>>   if (cpuid(hv_features) & hv_rnd)
> >>> rdmsr(hv_rnd) 
> >>>else (h == XenXenXen) {
> >>>   if (cpuid(xen_features) & xen_rnd)
> >>> rdmsr(xen_rnd)
> >>>   }
> >>>
> >>
> >> If we need to do chase loops, especially not so...
> >>
> > What loops exactly? As a non native English speaker I fail to understand
> > if your answer is "yes" or "no" ;)
> > 
> 
> The above isn't actually the full algorithm used.
> 
What part of actually algorithm cannot be implemented? Loop that searches
for KVM leaf in case KVM pretend to be HyperV (is this what you called
"chase loops"?)? First of all there is no need to implement it, if KVM
pretends to be HyperV use HyperV's way to obtain RNG, but what is the
problem with the loop?

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov
On Fri, Sep 19, 2014 at 10:08:20AM -0700, H. Peter Anvin wrote:
> On 09/19/2014 09:53 AM, Gleb Natapov wrote:
> > On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> >> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >>>
> >>> Linux detects what hypervior it runs on very early
> >>
> >> Not anywhere close to early enough.  We're talking for uses like kASLR.
> >>
> > Still to early to do:
> > 
> >h = cpuid(HYPERVIOR_SIGNATURE)
> >if (h == KVMKVMKVM) {
> >   if (cpuid(kvm_features) & kvm_rnd)
> >  rdmsr(kvm_rnd)
> >else (h == HyperV) {
> >   if (cpuid(hv_features) & hv_rnd)
> > rdmsr(hv_rnd) 
> >else (h == XenXenXen) {
> >   if (cpuid(xen_features) & xen_rnd)
> > rdmsr(xen_rnd)
> >   }
> > 
> 
> If we need to do chase loops, especially not so...
> 
What loops exactly? As a non native English speaker I fail to understand
if your answer is "yes" or "no" ;)

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov
On Fri, Sep 19, 2014 at 09:40:07AM -0700, H. Peter Anvin wrote:
> On 09/19/2014 09:37 AM, Gleb Natapov wrote:
> >
> > Linux detects what hypervior it runs on very early
> 
> Not anywhere close to early enough.  We're talking for uses like kASLR.
> 
Still to early to do:

   h = cpuid(HYPERVIOR_SIGNATURE)
   if (h == KVMKVMKVM) {
  if (cpuid(kvm_features) & kvm_rnd)
 rdmsr(kvm_rnd)
   else (h == HyperV) {
  if (cpuid(hv_features) & hv_rnd)
rdmsr(hv_rnd) 
   else (h == XenXenXen) {
  if (cpuid(xen_features) & xen_rnd)
rdmsr(xen_rnd)
  }

?

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Gleb Natapov
On Thu, Sep 18, 2014 at 03:00:05PM -0700, Andy Lutomirski wrote:
> On Thu, Sep 18, 2014 at 2:46 PM, David Hepkin  wrote:
> > I suggest we come to consensus on a specific CPUID leaf where an OS needs 
> > to look to determine if a hypervisor supports this capability.  We could 
> > define a new CPUID leaf range at a well-defined location, or we could just 
> > use one of the existing CPUID leaf ranges implemented by an existing 
> > hypervisor.  I'm not familiar with the KVM CPUID leaf range, but in the 
> > case of Hyper-V, the Hyper-V CPUID leaf range was architected to allow for 
> > other hypervisors to implement it and just show through specific 
> > capabilities supported by the hypervisor.  So, we could define a bit in the 
> > Hyper-V CPUID leaf range (since Xen and KVM also implement this range), but 
> > that would require Linux to look in that range on boot to discover this 
> > capability.
> 
> I also don't know whether QEMU and KVM would be okay with implementing
> the host side of the Hyper-V mechanism by default.  They would have to
> implement at least leaves 0x4001 and 0x402, plus correctly
> reporting zeros through whatever leaf is used for this new feature.
> Gleb?  Paolo?
> 
KVM and any other hypervisor out there already implement capability
detection mechanism in 0x4000 range, and of course all of them do
it differently. Linux detects what hypervior it runs on very early and
switch to correspondent code to handle each hypervisor. Quite frankly
I do not see what problem you are trying to fix with standardizing MSR
to get RND and detection mechanism for this MSR. RND MSR is in no way
unique here. There are other mechanisms that are virtually identical
between hypervisors but have different gust/hypervisor interfaces and
are detected differently on different hypervisors. Examples are pvclock,
pveoi may be others.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: GET_RNG_SEED hypercall ABI? (Re: [PATCH v5 0/5] random,x86,kvm: Rework arch RNG seeds and get some from kvm)

2014-08-28 Thread Gleb Natapov
On Tue, Aug 26, 2014 at 04:58:34PM -0700, Andy Lutomirski wrote:
> hpa pointed out that the ABI that I chose (an MSR from the KVM range
> and a KVM cpuid bit) is unnecessarily KVM-specific.  It would be nice
> to allocate an MSR that everyone involved can agree on and, rather
> than relying on a cpuid bit, just have the guest probe for the MSR.
> 
CPUID part allows feature to be disabled for machine compatibility purpose
during migration. Of course interface can explicitly state that one successful
use of the MSR does not mean that next use will not result in a #GP, but that
doesn't sound very elegant and is different from any other MSR out there.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V13 0/4] Paravirtualized ticket spinlocks for KVM host

2013-08-26 Thread Gleb Natapov
On Mon, Aug 26, 2013 at 02:18:32PM +0530, Raghavendra K T wrote:
> 
>  This series forms the kvm host part of paravirtual spinlock
>  based against kvm tree.
> 
>  Please refer to https://lkml.org/lkml/2013/8/9/265 for
>  kvm guest and Xen, x86 part merged to -tip spinlocks.
> 
>  Please note that:
>  kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi is a common patch
>  for both guest and host.
> 
Thanks, applied. The patchset is not against kvm.git queue though, so I
had to fix one minor conflict manually.

>  Changes since V12:
>   fold the patch 3 into patch 2 for bisection. (Eric Northup)
> 
> Raghavendra K T (3):
>   kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi
>   kvm hypervisor: Simplify kvm_for_each_vcpu with
> kvm_irq_delivery_to_apic
>   Documentation/kvm : Add documentation on Hypercalls and features used
> for PV spinlock
> 
> Srivatsa Vaddagiri (1):
>   kvm hypervisor : Add a hypercall to KVM hypervisor to support
> pv-ticketlocks
> 
>  Documentation/virtual/kvm/cpuid.txt  |  4 
>  Documentation/virtual/kvm/hypercalls.txt | 14 ++
>  arch/x86/include/asm/kvm_host.h  |  5 +
>  arch/x86/include/uapi/asm/kvm_para.h |  1 +
>  arch/x86/kvm/cpuid.c |  3 ++-
>  arch/x86/kvm/lapic.c |  5 -
>  arch/x86/kvm/x86.c   | 31 ++-
>  include/uapi/linux/kvm_para.h|  1 +
>  8 files changed, 61 insertions(+), 3 deletions(-)
> 
> -- 
> 1.7.11.7

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V12 3/5] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration

2013-08-25 Thread Gleb Natapov
On Wed, Aug 07, 2013 at 12:40:36AM +0530, Raghavendra K T wrote:
> On 08/07/2013 12:02 AM, Eric Northup wrote:
> >
> >If this is about migration correctness, could it get folded into the
> >previous patch 2/5, so that there's not a broken commit which could
> >hurt bisection?
> 
> Yes. It could be. Only reason I maintained like that was,
> original author in the previous patch is different (Srivatsa) and I did
> not want to merge this hunk when the patch series got evolved to mix
> the sign-offs.
> 
> Gleb, Paolo please let me know.
> 
Yes please, do so.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V12 0/14] Paravirtualized ticket spinlocks

2013-08-07 Thread Gleb Natapov
On Wed, Aug 07, 2013 at 08:50:12PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Aug 07, 2013 at 12:15:21PM +0530, Raghavendra K T wrote:
> > On 08/07/2013 10:18 AM, H. Peter Anvin wrote:
> > >>Please let me know, if I should rebase again.
> > >>
> > >
> > >tip:master is not a stable branch; it is more like linux-next.  We need
> > >to figure out which topic branches are dependencies for this set.
> > 
> > Okay. I 'll start looking at the branches that would get affected.
> > (Xen, kvm are obvious ones).
> > Please do let me know the branches I might have to check for.
> 
> >From the Xen standpoint anything past v3.11-rc4 would work.
> 
For KVM as early as past v3.11-rc1 would be OK.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-02 Thread Gleb Natapov
On Fri, Aug 02, 2013 at 11:25:39AM +0200, Ingo Molnar wrote:
> > Ingo,
> > 
> > Do you have any concerns reg this series? please let me know if this 
> > looks good now to you.
> 
> I'm inclined to NAK it for excessive quotation - who knows how many people 
> left the discussion in disgust? Was it done to drive away as many 
> reviewers as possible?
> 
> Anyway, see my other reply, the measurement results seem hard to interpret 
> and inconclusive at the moment.
> 
That result was only for patch 18 of the series, not pvspinlock in
general.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-08-01 Thread Gleb Natapov
On Thu, Aug 01, 2013 at 01:08:47PM +0530, Raghavendra K T wrote:
> On 07/31/2013 11:54 AM, Gleb Natapov wrote:
> >On Tue, Jul 30, 2013 at 10:13:12PM +0530, Raghavendra K T wrote:
> >>On 07/25/2013 03:08 PM, Raghavendra K T wrote:
> >>>On 07/25/2013 02:45 PM, Gleb Natapov wrote:
> >>>>On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote:
> >>>>>On 07/24/2013 06:06 PM, Raghavendra K T wrote:
> >>>>>>On 07/24/2013 05:36 PM, Gleb Natapov wrote:
> >>>>>>>On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:
> >>>>>>>>On 07/24/2013 04:09 PM, Gleb Natapov wrote:
> >>>>>>>>>On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:
> >>>>>>>>>>On 07/23/2013 08:37 PM, Gleb Natapov wrote:
> >>>>>>>>>>>On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
> >>>>>>>>>>>>+static void kvm_lock_spinning(struct arch_spinlock *lock,
> >>>>>>>>>>>>__ticket_t want)
> >>>>>>>>>>[...]
> >>>>>>>>>>>>+
> >>>>>>>>>>>>+/*
> >>>>>>>>>>>>+ * halt until it's our turn and kicked. Note that we do safe
> >>>>>>>>>>>>halt
> >>>>>>>>>>>>+ * for irq enabled case to avoid hang when lock info is
> >>>>>>>>>>>>overwritten
> >>>>>>>>>>>>+ * in irq spinlock slowpath and no spurious interrupt occur
> >>>>>>>>>>>>to save us.
> >>>>>>>>>>>>+ */
> >>>>>>>>>>>>+if (arch_irqs_disabled_flags(flags))
> >>>>>>>>>>>>+halt();
> >>>>>>>>>>>>+else
> >>>>>>>>>>>>+safe_halt();
> >>>>>>>>>>>>+
> >>>>>>>>>>>>+out:
> >>>>>>>>>>>So here now interrupts can be either disabled or enabled. Previous
> >>>>>>>>>>>version disabled interrupts here, so are we sure it is safe to
> >>>>>>>>>>>have them
> >>>>>>>>>>>enabled at this point? I do not see any problem yet, will keep
> >>>>>>>>>>>thinking.
> >>>>>>>>>>
> >>>>>>>>>>If we enable interrupt here, then
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>+cpumask_clear_cpu(cpu, &waiting_cpus);
> >>>>>>>>>>
> >>>>>>>>>>and if we start serving lock for an interrupt that came here,
> >>>>>>>>>>cpumask clear and w->lock=null may not happen atomically.
> >>>>>>>>>>if irq spinlock does not take slow path we would have non null
> >>>>>>>>>>value
> >>>>>>>>>>for lock, but with no information in waitingcpu.
> >>>>>>>>>>
> >>>>>>>>>>I am still thinking what would be problem with that.
> >>>>>>>>>>
> >>>>>>>>>Exactly, for kicker waiting_cpus and w->lock updates are
> >>>>>>>>>non atomic anyway.
> >>>>>>>>>
> >>>>>>>>>>>>+w->lock = NULL;
> >>>>>>>>>>>>+local_irq_restore(flags);
> >>>>>>>>>>>>+spin_time_accum_blocked(start);
> >>>>>>>>>>>>+}
> >>>>>>>>>>>>+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
> >>>>>>>>>>>>+
> >>>>>>>>>>>>+/* Kick vcpu waiting on @lock->head to reach value @ticket */
> >>>>>>>>>>>>+static void kvm_unlock_kick(struct arch_spinlock *lock,
> >>>>>>>>>>>>__ticket_t ticket)
> >>>>>>>>>>>>+{
> >>>>>>>>>>>>+int cpu;
> >>>>&

Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-30 Thread Gleb Natapov
On Tue, Jul 30, 2013 at 10:13:12PM +0530, Raghavendra K T wrote:
> On 07/25/2013 03:08 PM, Raghavendra K T wrote:
> >On 07/25/2013 02:45 PM, Gleb Natapov wrote:
> >>On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote:
> >>>On 07/24/2013 06:06 PM, Raghavendra K T wrote:
> >>>>On 07/24/2013 05:36 PM, Gleb Natapov wrote:
> >>>>>On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:
> >>>>>>On 07/24/2013 04:09 PM, Gleb Natapov wrote:
> >>>>>>>On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:
> >>>>>>>>On 07/23/2013 08:37 PM, Gleb Natapov wrote:
> >>>>>>>>>On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
> >>>>>>>>>>+static void kvm_lock_spinning(struct arch_spinlock *lock,
> >>>>>>>>>>__ticket_t want)
> >>>>>>>>[...]
> >>>>>>>>>>+
> >>>>>>>>>>+/*
> >>>>>>>>>>+ * halt until it's our turn and kicked. Note that we do safe
> >>>>>>>>>>halt
> >>>>>>>>>>+ * for irq enabled case to avoid hang when lock info is
> >>>>>>>>>>overwritten
> >>>>>>>>>>+ * in irq spinlock slowpath and no spurious interrupt occur
> >>>>>>>>>>to save us.
> >>>>>>>>>>+ */
> >>>>>>>>>>+if (arch_irqs_disabled_flags(flags))
> >>>>>>>>>>+halt();
> >>>>>>>>>>+else
> >>>>>>>>>>+safe_halt();
> >>>>>>>>>>+
> >>>>>>>>>>+out:
> >>>>>>>>>So here now interrupts can be either disabled or enabled. Previous
> >>>>>>>>>version disabled interrupts here, so are we sure it is safe to
> >>>>>>>>>have them
> >>>>>>>>>enabled at this point? I do not see any problem yet, will keep
> >>>>>>>>>thinking.
> >>>>>>>>
> >>>>>>>>If we enable interrupt here, then
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>+cpumask_clear_cpu(cpu, &waiting_cpus);
> >>>>>>>>
> >>>>>>>>and if we start serving lock for an interrupt that came here,
> >>>>>>>>cpumask clear and w->lock=null may not happen atomically.
> >>>>>>>>if irq spinlock does not take slow path we would have non null
> >>>>>>>>value
> >>>>>>>>for lock, but with no information in waitingcpu.
> >>>>>>>>
> >>>>>>>>I am still thinking what would be problem with that.
> >>>>>>>>
> >>>>>>>Exactly, for kicker waiting_cpus and w->lock updates are
> >>>>>>>non atomic anyway.
> >>>>>>>
> >>>>>>>>>>+w->lock = NULL;
> >>>>>>>>>>+local_irq_restore(flags);
> >>>>>>>>>>+spin_time_accum_blocked(start);
> >>>>>>>>>>+}
> >>>>>>>>>>+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
> >>>>>>>>>>+
> >>>>>>>>>>+/* Kick vcpu waiting on @lock->head to reach value @ticket */
> >>>>>>>>>>+static void kvm_unlock_kick(struct arch_spinlock *lock,
> >>>>>>>>>>__ticket_t ticket)
> >>>>>>>>>>+{
> >>>>>>>>>>+int cpu;
> >>>>>>>>>>+
> >>>>>>>>>>+add_stats(RELEASED_SLOW, 1);
> >>>>>>>>>>+for_each_cpu(cpu, &waiting_cpus) {
> >>>>>>>>>>+const struct kvm_lock_waiting *w =
> >>>>>>>>>>&per_cpu(lock_waiting, cpu);
> >>>>>>>>>>+if (ACCESS_ONCE(w->lock) == lock &&
> >>>>>>>>>>+ACCESS_ONCE(w->want) == ticket) {
> >>>>>>>>>>+add_

Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-25 Thread Gleb Natapov
On Thu, Jul 25, 2013 at 02:47:37PM +0530, Raghavendra K T wrote:
> On 07/24/2013 06:06 PM, Raghavendra K T wrote:
> >On 07/24/2013 05:36 PM, Gleb Natapov wrote:
> >>On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:
> >>>On 07/24/2013 04:09 PM, Gleb Natapov wrote:
> >>>>On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:
> >>>>>On 07/23/2013 08:37 PM, Gleb Natapov wrote:
> >>>>>>On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
> >>>>>>>+static void kvm_lock_spinning(struct arch_spinlock *lock,
> >>>>>>>__ticket_t want)
> >>>>>[...]
> >>>>>>>+
> >>>>>>>+/*
> >>>>>>>+ * halt until it's our turn and kicked. Note that we do safe
> >>>>>>>halt
> >>>>>>>+ * for irq enabled case to avoid hang when lock info is
> >>>>>>>overwritten
> >>>>>>>+ * in irq spinlock slowpath and no spurious interrupt occur
> >>>>>>>to save us.
> >>>>>>>+ */
> >>>>>>>+if (arch_irqs_disabled_flags(flags))
> >>>>>>>+halt();
> >>>>>>>+else
> >>>>>>>+safe_halt();
> >>>>>>>+
> >>>>>>>+out:
> >>>>>>So here now interrupts can be either disabled or enabled. Previous
> >>>>>>version disabled interrupts here, so are we sure it is safe to
> >>>>>>have them
> >>>>>>enabled at this point? I do not see any problem yet, will keep
> >>>>>>thinking.
> >>>>>
> >>>>>If we enable interrupt here, then
> >>>>>
> >>>>>
> >>>>>>>+cpumask_clear_cpu(cpu, &waiting_cpus);
> >>>>>
> >>>>>and if we start serving lock for an interrupt that came here,
> >>>>>cpumask clear and w->lock=null may not happen atomically.
> >>>>>if irq spinlock does not take slow path we would have non null value
> >>>>>for lock, but with no information in waitingcpu.
> >>>>>
> >>>>>I am still thinking what would be problem with that.
> >>>>>
> >>>>Exactly, for kicker waiting_cpus and w->lock updates are
> >>>>non atomic anyway.
> >>>>
> >>>>>>>+w->lock = NULL;
> >>>>>>>+local_irq_restore(flags);
> >>>>>>>+spin_time_accum_blocked(start);
> >>>>>>>+}
> >>>>>>>+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
> >>>>>>>+
> >>>>>>>+/* Kick vcpu waiting on @lock->head to reach value @ticket */
> >>>>>>>+static void kvm_unlock_kick(struct arch_spinlock *lock,
> >>>>>>>__ticket_t ticket)
> >>>>>>>+{
> >>>>>>>+int cpu;
> >>>>>>>+
> >>>>>>>+add_stats(RELEASED_SLOW, 1);
> >>>>>>>+for_each_cpu(cpu, &waiting_cpus) {
> >>>>>>>+const struct kvm_lock_waiting *w =
> >>>>>>>&per_cpu(lock_waiting, cpu);
> >>>>>>>+if (ACCESS_ONCE(w->lock) == lock &&
> >>>>>>>+ACCESS_ONCE(w->want) == ticket) {
> >>>>>>>+add_stats(RELEASED_SLOW_KICKED, 1);
> >>>>>>>+kvm_kick_cpu(cpu);
> >>>>>>What about using NMI to wake sleepers? I think it was discussed, but
> >>>>>>forgot why it was dismissed.
> >>>>>
> >>>>>I think I have missed that discussion. 'll go back and check. so
> >>>>>what is the idea here? we can easily wake up the halted vcpus that
> >>>>>have interrupt disabled?
> >>>>We can of course. IIRC the objection was that NMI handling path is very
> >>>>fragile and handling NMI on each wakeup will be more expensive then
> >>>>waking up a guest without injecting an event, but it is still
> >>>>interesting
> >>>>to see the numbers.
> >>>>
> >>>
> >>>Haam, now I remember, We had tried request based mechanism. (new
> >>>request like REQ_UNHALT) and process that. It had worked, but had some
> >>>complex hacks in vcpu_enter_guest to avoid guest hang in case of
> >>>request cleared.  So had left it there..
> >>>
> >>>https://lkml.org/lkml/2012/4/30/67
> >>>
> >>>But I do not remember performance impact though.
> >>No, this is something different. Wakeup with NMI does not need KVM
> >>changes at
> >>all. Instead of kvm_kick_cpu(cpu) in kvm_unlock_kick you send NMI IPI.
> >>
> >
> >True. It was not NMI.
> >just to confirm, are you talking about something like this to be tried ?
> >
> >apic->send_IPI_mask(cpumask_of(cpu), APIC_DM_NMI);
> 
> When I started benchmark, I started seeing
> "Dazed and confused, but trying to continue" from unknown nmi error
> handling.
> Did I miss anything (because we did not register any NMI handler)? or
> is it that spurious NMIs are trouble because we could get spurious NMIs
> if next waiter already acquired the lock.
There is a default NMI handler that tries to detect the reason why NMI
happened (which is no so easy on x86) and prints this message if it
fails. You need to add logic to detect spinlock slow path there. Check
bit in waiting_cpus for instance.

> 
> (note: I tried sending APIC_DM_REMRD IPI directly, which worked fine
> but hypercall way of handling still performed well from the results I
> saw).
You mean better? This is strange. Have you ran guest with x2apic?

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-24 Thread Gleb Natapov
On Wed, Jul 24, 2013 at 05:30:20PM +0530, Raghavendra K T wrote:
> On 07/24/2013 04:09 PM, Gleb Natapov wrote:
> >On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:
> >>On 07/23/2013 08:37 PM, Gleb Natapov wrote:
> >>>On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
> >>>>+static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t 
> >>>>want)
> >>[...]
> >>>>+
> >>>>+ /*
> >>>>+  * halt until it's our turn and kicked. Note that we do safe halt
> >>>>+  * for irq enabled case to avoid hang when lock info is overwritten
> >>>>+  * in irq spinlock slowpath and no spurious interrupt occur to save us.
> >>>>+  */
> >>>>+ if (arch_irqs_disabled_flags(flags))
> >>>>+ halt();
> >>>>+ else
> >>>>+ safe_halt();
> >>>>+
> >>>>+out:
> >>>So here now interrupts can be either disabled or enabled. Previous
> >>>version disabled interrupts here, so are we sure it is safe to have them
> >>>enabled at this point? I do not see any problem yet, will keep thinking.
> >>
> >>If we enable interrupt here, then
> >>
> >>
> >>>>+ cpumask_clear_cpu(cpu, &waiting_cpus);
> >>
> >>and if we start serving lock for an interrupt that came here,
> >>cpumask clear and w->lock=null may not happen atomically.
> >>if irq spinlock does not take slow path we would have non null value
> >>for lock, but with no information in waitingcpu.
> >>
> >>I am still thinking what would be problem with that.
> >>
> >Exactly, for kicker waiting_cpus and w->lock updates are
> >non atomic anyway.
> >
> >>>>+ w->lock = NULL;
> >>>>+ local_irq_restore(flags);
> >>>>+ spin_time_accum_blocked(start);
> >>>>+}
> >>>>+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
> >>>>+
> >>>>+/* Kick vcpu waiting on @lock->head to reach value @ticket */
> >>>>+static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t 
> >>>>ticket)
> >>>>+{
> >>>>+ int cpu;
> >>>>+
> >>>>+ add_stats(RELEASED_SLOW, 1);
> >>>>+ for_each_cpu(cpu, &waiting_cpus) {
> >>>>+ const struct kvm_lock_waiting *w = &per_cpu(lock_waiting, cpu);
> >>>>+ if (ACCESS_ONCE(w->lock) == lock &&
> >>>>+ ACCESS_ONCE(w->want) == ticket) {
> >>>>+ add_stats(RELEASED_SLOW_KICKED, 1);
> >>>>+ kvm_kick_cpu(cpu);
> >>>What about using NMI to wake sleepers? I think it was discussed, but
> >>>forgot why it was dismissed.
> >>
> >>I think I have missed that discussion. 'll go back and check. so
> >>what is the idea here? we can easily wake up the halted vcpus that
> >>have interrupt disabled?
> >We can of course. IIRC the objection was that NMI handling path is very
> >fragile and handling NMI on each wakeup will be more expensive then
> >waking up a guest without injecting an event, but it is still interesting
> >to see the numbers.
> >
> 
> Haam, now I remember, We had tried request based mechanism. (new
> request like REQ_UNHALT) and process that. It had worked, but had some
> complex hacks in vcpu_enter_guest to avoid guest hang in case of
> request cleared.  So had left it there..
> 
> https://lkml.org/lkml/2012/4/30/67
> 
> But I do not remember performance impact though.
No, this is something different. Wakeup with NMI does not need KVM changes at
all. Instead of kvm_kick_cpu(cpu) in kvm_unlock_kick you send NMI IPI.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-24 Thread Gleb Natapov
On Wed, Jul 24, 2013 at 03:15:50PM +0530, Raghavendra K T wrote:
> On 07/23/2013 08:37 PM, Gleb Natapov wrote:
> >On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
> >>+static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
> [...]
> >>+
> >>+   /*
> >>+* halt until it's our turn and kicked. Note that we do safe halt
> >>+* for irq enabled case to avoid hang when lock info is overwritten
> >>+* in irq spinlock slowpath and no spurious interrupt occur to save us.
> >>+*/
> >>+   if (arch_irqs_disabled_flags(flags))
> >>+   halt();
> >>+   else
> >>+   safe_halt();
> >>+
> >>+out:
> >So here now interrupts can be either disabled or enabled. Previous
> >version disabled interrupts here, so are we sure it is safe to have them
> >enabled at this point? I do not see any problem yet, will keep thinking.
> 
> If we enable interrupt here, then
> 
> 
> >>+   cpumask_clear_cpu(cpu, &waiting_cpus);
> 
> and if we start serving lock for an interrupt that came here,
> cpumask clear and w->lock=null may not happen atomically.
> if irq spinlock does not take slow path we would have non null value
> for lock, but with no information in waitingcpu.
> 
> I am still thinking what would be problem with that.
> 
Exactly, for kicker waiting_cpus and w->lock updates are
non atomic anyway.

> >>+   w->lock = NULL;
> >>+   local_irq_restore(flags);
> >>+   spin_time_accum_blocked(start);
> >>+}
> >>+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
> >>+
> >>+/* Kick vcpu waiting on @lock->head to reach value @ticket */
> >>+static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
> >>+{
> >>+   int cpu;
> >>+
> >>+   add_stats(RELEASED_SLOW, 1);
> >>+   for_each_cpu(cpu, &waiting_cpus) {
> >>+   const struct kvm_lock_waiting *w = &per_cpu(lock_waiting, cpu);
> >>+   if (ACCESS_ONCE(w->lock) == lock &&
> >>+   ACCESS_ONCE(w->want) == ticket) {
> >>+   add_stats(RELEASED_SLOW_KICKED, 1);
> >>+   kvm_kick_cpu(cpu);
> >What about using NMI to wake sleepers? I think it was discussed, but
> >forgot why it was dismissed.
> 
> I think I have missed that discussion. 'll go back and check. so
> what is the idea here? we can easily wake up the halted vcpus that
> have interrupt disabled?
We can of course. IIRC the objection was that NMI handling path is very
fragile and handling NMI on each wakeup will be more expensive then
waking up a guest without injecting an event, but it is still interesting
to see the numbers.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V11 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-23 Thread Gleb Natapov
On Mon, Jul 22, 2013 at 11:50:16AM +0530, Raghavendra K T wrote:
> +static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
> +{
> + struct kvm_lock_waiting *w;
> + int cpu;
> + u64 start;
> + unsigned long flags;
> +
Why don't you bailout if in nmi here like we discussed?

> + w = &__get_cpu_var(lock_waiting);
> + cpu = smp_processor_id();
> + start = spin_time_start();
> +
> + /*
> +  * Make sure an interrupt handler can't upset things in a
> +  * partially setup state.
> +  */
> + local_irq_save(flags);
> +
> + /*
> +  * The ordering protocol on this is that the "lock" pointer
> +  * may only be set non-NULL if the "want" ticket is correct.
> +  * If we're updating "want", we must first clear "lock".
> +  */
> + w->lock = NULL;
> + smp_wmb();
> + w->want = want;
> + smp_wmb();
> + w->lock = lock;
> +
> + add_stats(TAKEN_SLOW, 1);
> +
> + /*
> +  * This uses set_bit, which is atomic but we should not rely on its
> +  * reordering gurantees. So barrier is needed after this call.
> +  */
> + cpumask_set_cpu(cpu, &waiting_cpus);
> +
> + barrier();
> +
> + /*
> +  * Mark entry to slowpath before doing the pickup test to make
> +  * sure we don't deadlock with an unlocker.
> +  */
> + __ticket_enter_slowpath(lock);
> +
> + /*
> +  * check again make sure it didn't become free while
> +  * we weren't looking.
> +  */
> + if (ACCESS_ONCE(lock->tickets.head) == want) {
> + add_stats(TAKEN_SLOW_PICKUP, 1);
> + goto out;
> + }
> +
> + /*
> +  * halt until it's our turn and kicked. Note that we do safe halt
> +  * for irq enabled case to avoid hang when lock info is overwritten
> +  * in irq spinlock slowpath and no spurious interrupt occur to save us.
> +  */
> + if (arch_irqs_disabled_flags(flags))
> + halt();
> + else
> + safe_halt();
> +
> +out:
So here now interrupts can be either disabled or enabled. Previous
version disabled interrupts here, so are we sure it is safe to have them
enabled at this point? I do not see any problem yet, will keep thinking.

> + cpumask_clear_cpu(cpu, &waiting_cpus);
> + w->lock = NULL;
> + local_irq_restore(flags);
> + spin_time_accum_blocked(start);
> +}
> +PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
> +
> +/* Kick vcpu waiting on @lock->head to reach value @ticket */
> +static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
> +{
> + int cpu;
> +
> + add_stats(RELEASED_SLOW, 1);
> + for_each_cpu(cpu, &waiting_cpus) {
> + const struct kvm_lock_waiting *w = &per_cpu(lock_waiting, cpu);
> + if (ACCESS_ONCE(w->lock) == lock &&
> + ACCESS_ONCE(w->want) == ticket) {
> + add_stats(RELEASED_SLOW_KICKED, 1);
> + kvm_kick_cpu(cpu);
What about using NMI to wake sleepers? I think it was discussed, but
forgot why it was dismissed.

> + break;
> + }
> + }
> +}
> +
> +/*
> + * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
> + */
> +void __init kvm_spinlock_init(void)
> +{
> + if (!kvm_para_available())
> + return;
> + /* Does host kernel support KVM_FEATURE_PV_UNHALT? */
> + if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
> + return;
> +
> + printk(KERN_INFO "KVM setup paravirtual spinlock\n");
> +
> + static_key_slow_inc(¶virt_ticketlocks_enabled);
> +
> + pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
> + pv_lock_ops.unlock_kick = kvm_unlock_kick;
> +}
> +#endif   /* CONFIG_PARAVIRT_SPINLOCKS */

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-17 Thread Gleb Natapov
On Wed, Jul 17, 2013 at 08:25:19PM +0530, Raghavendra K T wrote:
> On 07/17/2013 08:14 PM, Gleb Natapov wrote:
> >On Wed, Jul 17, 2013 at 07:43:01PM +0530, Raghavendra K T wrote:
> >>On 07/17/2013 06:55 PM, Gleb Natapov wrote:
> >>>On Wed, Jul 17, 2013 at 06:25:05PM +0530, Raghavendra K T wrote:
> >>>>On 07/17/2013 06:15 PM, Gleb Natapov wrote:
> >>>>>On Wed, Jul 17, 2013 at 03:35:37PM +0530, Raghavendra K T wrote:
> >>>>>>>>Instead of halt we started with a sleep hypercall in those
> >>>>>>>>  versions. Changed to halt() once Avi suggested to reuse existing 
> >>>>>>>> sleep.
> >>>>>>>>
> >>>>>>>>If we use older hypercall with few changes like below:
> >>>>>>>>
> >>>>>>>>kvm_pv_wait_for_kick_op(flags, vcpu, w->lock )
> >>>>>>>>{
> >>>>>>>>  // a0 reserved for flags
> >>>>>>>>if (!w->lock)
> >>>>>>>>return;
> >>>>>>>>DEFINE_WAIT
> >>>>>>>>...
> >>>>>>>>end_wait
> >>>>>>>>}
> >>>>>>>>
> >>>>>>>How would this help if NMI takes lock in critical section. The thing
> >>>>>>>that may happen is that lock_waiting->want may have NMI lock value, but
> >>>>>>>lock_waiting->lock will point to non NMI lock. Setting of want and lock
> >>>>>>>have to be atomic.
> >>>>>>
> >>>>>>True. so we are here
> >>>>>>
> >>>>>> non NMI lock(a)
> >>>>>> w->lock = NULL;
> >>>>>> smp_wmb();
> >>>>>> w->want = want;
> >>>>>>NMI
> >>>>>>  <-
> >>>>>>   NMI lock(b)
> >>>>>>   w->lock = NULL;
> >>>>>>   smp_wmb();
> >>>>>>   w->want = want;
> >>>>>>   smp_wmb();
> >>>>>>   w->lock = lock;
> >>>>>>  -->
> >>>>>> smp_wmb();
> >>>>>> w->lock = lock;
> >>>>>>
> >>>>>>so how about fixing like this?
> >>>>>>
> >>>>>>again:
> >>>>>> w->lock = NULL;
> >>>>>> smp_wmb();
> >>>>>> w->want = want;
> >>>>>> smp_wmb();
> >>>>>> w->lock = lock;
> >>>>>>
> >>>>>>if (!lock || w->want != want) goto again;
> >>>>>>
> >>>>>NMI can happen after the if() but before halt and the same situation
> >>>>>we are trying to prevent with IRQs will occur.
> >>>>
> >>>>True, we can not fix that. I thought to fix the inconsistency of
> >>>>lock,want pair.
> >>>>But NMI could happen after the first OR condition also.
> >>>>/me thinks again
> >>>>
> >>>lock_spinning() can check that it is called in nmi context and bail out.
> >>
> >>Good point.
> >>I think we can check for even irq context and bailout so that in irq
> >>context we continue spinning instead of slowpath. no ?
> >>
> >That will happen much more often and irq context is no a problem anyway.
> >
> 
> Yes. It is not a problem. But my idea was to not to enter slowpath lock
> during irq processing. Do you think that is a good idea?
> 
Why would we disable it if its purpose is to improve handling of
contended locks? NMI is only special because it is impossible to handle
and should not happen anyway.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-17 Thread Gleb Natapov
On Wed, Jul 17, 2013 at 07:43:01PM +0530, Raghavendra K T wrote:
> On 07/17/2013 06:55 PM, Gleb Natapov wrote:
> >On Wed, Jul 17, 2013 at 06:25:05PM +0530, Raghavendra K T wrote:
> >>On 07/17/2013 06:15 PM, Gleb Natapov wrote:
> >>>On Wed, Jul 17, 2013 at 03:35:37PM +0530, Raghavendra K T wrote:
> >>>>>>Instead of halt we started with a sleep hypercall in those
> >>>>>>  versions. Changed to halt() once Avi suggested to reuse existing 
> >>>>>> sleep.
> >>>>>>
> >>>>>>If we use older hypercall with few changes like below:
> >>>>>>
> >>>>>>kvm_pv_wait_for_kick_op(flags, vcpu, w->lock )
> >>>>>>{
> >>>>>>  // a0 reserved for flags
> >>>>>>if (!w->lock)
> >>>>>>return;
> >>>>>>DEFINE_WAIT
> >>>>>>...
> >>>>>>end_wait
> >>>>>>}
> >>>>>>
> >>>>>How would this help if NMI takes lock in critical section. The thing
> >>>>>that may happen is that lock_waiting->want may have NMI lock value, but
> >>>>>lock_waiting->lock will point to non NMI lock. Setting of want and lock
> >>>>>have to be atomic.
> >>>>
> >>>>True. so we are here
> >>>>
> >>>> non NMI lock(a)
> >>>> w->lock = NULL;
> >>>> smp_wmb();
> >>>> w->want = want;
> >>>>NMI
> >>>>  <-
> >>>>   NMI lock(b)
> >>>>   w->lock = NULL;
> >>>>   smp_wmb();
> >>>>   w->want = want;
> >>>>   smp_wmb();
> >>>>   w->lock = lock;
> >>>>  -->
> >>>> smp_wmb();
> >>>> w->lock = lock;
> >>>>
> >>>>so how about fixing like this?
> >>>>
> >>>>again:
> >>>> w->lock = NULL;
> >>>> smp_wmb();
> >>>> w->want = want;
> >>>> smp_wmb();
> >>>> w->lock = lock;
> >>>>
> >>>>if (!lock || w->want != want) goto again;
> >>>>
> >>>NMI can happen after the if() but before halt and the same situation
> >>>we are trying to prevent with IRQs will occur.
> >>
> >>True, we can not fix that. I thought to fix the inconsistency of
> >>lock,want pair.
> >>But NMI could happen after the first OR condition also.
> >>/me thinks again
> >>
> >lock_spinning() can check that it is called in nmi context and bail out.
> 
> Good point.
> I think we can check for even irq context and bailout so that in irq
> context we continue spinning instead of slowpath. no ?
> 
That will happen much more often and irq context is no a problem anyway.

> >How often this will happens anyway.
> >
> 
> I know NMIs occur frequently with watchdogs. or used by sysrq-trigger
> etc.. But I am not an expert how frequent it is otherwise. But even
> then if they do not use spinlock, we have no problem as already pointed.
> 
> I can measure with debugfs counter how often it happens.
> 
When you run perf you will see a lot of NMIs, but those should not take
any locks.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-17 Thread Gleb Natapov
On Wed, Jul 17, 2013 at 06:25:05PM +0530, Raghavendra K T wrote:
> On 07/17/2013 06:15 PM, Gleb Natapov wrote:
> >On Wed, Jul 17, 2013 at 03:35:37PM +0530, Raghavendra K T wrote:
> >>>>Instead of halt we started with a sleep hypercall in those
> >>>>  versions. Changed to halt() once Avi suggested to reuse existing sleep.
> >>>>
> >>>>If we use older hypercall with few changes like below:
> >>>>
> >>>>kvm_pv_wait_for_kick_op(flags, vcpu, w->lock )
> >>>>{
> >>>>  // a0 reserved for flags
> >>>>if (!w->lock)
> >>>>return;
> >>>>DEFINE_WAIT
> >>>>...
> >>>>end_wait
> >>>>}
> >>>>
> >>>How would this help if NMI takes lock in critical section. The thing
> >>>that may happen is that lock_waiting->want may have NMI lock value, but
> >>>lock_waiting->lock will point to non NMI lock. Setting of want and lock
> >>>have to be atomic.
> >>
> >>True. so we are here
> >>
> >> non NMI lock(a)
> >> w->lock = NULL;
> >> smp_wmb();
> >> w->want = want;
> >>NMI
> >>  <-
> >>   NMI lock(b)
> >>   w->lock = NULL;
> >>   smp_wmb();
> >>   w->want = want;
> >>   smp_wmb();
> >>   w->lock = lock;
> >>  -->
> >> smp_wmb();
> >> w->lock = lock;
> >>
> >>so how about fixing like this?
> >>
> >>again:
> >> w->lock = NULL;
> >> smp_wmb();
> >> w->want = want;
> >> smp_wmb();
> >> w->lock = lock;
> >>
> >>if (!lock || w->want != want) goto again;
> >>
> >NMI can happen after the if() but before halt and the same situation
> >we are trying to prevent with IRQs will occur.
> 
> True, we can not fix that. I thought to fix the inconsistency of
> lock,want pair.
> But NMI could happen after the first OR condition also.
> /me thinks again
> 
lock_spinning() can check that it is called in nmi context and bail out.
How often this will happens anyway.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-17 Thread Gleb Natapov
On Wed, Jul 17, 2013 at 03:35:37PM +0530, Raghavendra K T wrote:
> >>Instead of halt we started with a sleep hypercall in those
> >>  versions. Changed to halt() once Avi suggested to reuse existing sleep.
> >>
> >>If we use older hypercall with few changes like below:
> >>
> >>kvm_pv_wait_for_kick_op(flags, vcpu, w->lock )
> >>{
> >>  // a0 reserved for flags
> >>if (!w->lock)
> >>return;
> >>DEFINE_WAIT
> >>...
> >>end_wait
> >>}
> >>
> >How would this help if NMI takes lock in critical section. The thing
> >that may happen is that lock_waiting->want may have NMI lock value, but
> >lock_waiting->lock will point to non NMI lock. Setting of want and lock
> >have to be atomic.
> 
> True. so we are here
> 
> non NMI lock(a)
> w->lock = NULL;
> smp_wmb();
> w->want = want;
>NMI
>  <-
>   NMI lock(b)
>   w->lock = NULL;
>   smp_wmb();
>   w->want = want;
>   smp_wmb();
>   w->lock = lock;
>  -->
> smp_wmb();
> w->lock = lock;
> 
> so how about fixing like this?
> 
> again:
> w->lock = NULL;
> smp_wmb();
> w->want = want;
> smp_wmb();
> w->lock = lock;
> 
> if (!lock || w->want != want) goto again;
> 
NMI can happen after the if() but before halt and the same situation
we are trying to prevent with IRQs will occur. But if NMI handler do not
take locks we shouldn't worry.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-17 Thread Gleb Natapov
On Wed, Jul 17, 2013 at 12:12:35AM +0530, Raghavendra K T wrote:
>> I do not think it is very rare to get interrupt between
>> local_irq_restore() and halt() under load since any interrupt that
>> occurs between local_irq_save() and local_irq_restore() will be
>> delivered
>> immediately after local_irq_restore(). Of course the chance of no
>> other
>> random interrupt waking lock waiter is very low, but waiter can sleep
>> for much longer then needed and this will be noticeable in
>> performance.
>
>Yes, I meant the entire thing. I did infact turned WARN on
>w->lock==null before halt() [ though we can potentially have irq right
>after that ], but did not hit so far.
Depends on your workload of course. To hit that you not only need to get
interrupt in there, but the interrupt handler needs to take contended
spinlock.

>
> >BTW can NMI handler take spinlocks? If it can what happens if NMI is
> >delivered in a section protected by local_irq_save()/local_irq_restore()?
> >
> 
> Had another idea if NMI, halts are causing problem until I saw
> PeterZ's reply similar to V2 of pvspinlock posted  here:
> 
>  https://lkml.org/lkml/2011/10/23/211
> 
> Instead of halt we started with a sleep hypercall in those
>  versions. Changed to halt() once Avi suggested to reuse existing sleep.
> 
> If we use older hypercall with few changes like below:
> 
> kvm_pv_wait_for_kick_op(flags, vcpu, w->lock )
> {
>  // a0 reserved for flags
> if (!w->lock)
> return;
> DEFINE_WAIT
> ...
> end_wait
> }
> 
How would this help if NMI takes lock in critical section. The thing
that may happen is that lock_waiting->want may have NMI lock value, but
lock_waiting->lock will point to non NMI lock. Setting of want and lock
have to be atomic.

kvm_pv_wait_for_kick_op() is incorrect in other ways. It will spuriously
return to a guest since not all events that wake up vcpu thread
correspond to work for guest to do.

> Only question is how to retry immediately with lock_spinning in
> w->lock=null cases.
> 
> /me need to experiment that again perhaps to see if we get some benefit.
> 
> >>
> >>So I am,
> >>1. trying to artificially reproduce this.
> >>
> >>2. I replaced the halt with below code,
> >>if (arch_irqs_disabled())
> >> halt();
> >>
> >>and ran benchmarks.
> >>But this results in degradation because, it means we again go back
> >>and spin in irq enabled case.
> >>
> >Yes, this is not what I proposed.
> 
> True.
> 
> >
> >>3. Now I am analyzing the performance overhead of safe_halt in irq
> >>enabled case.
> >>   if (arch_irqs_disabled())
> >>halt();
> >>   else
> >>safe_halt();
> >Use of arch_irqs_disabled() is incorrect here.
> 
> Oops! sill me.
> 
> If you are doing it before
> >local_irq_restore() it will always be false since you disabled interrupt
> >yourself,
> 
> This was not the case. but latter is the one I missed.
> 
>  if you do it after then it is to late since interrupt can come
> >between local_irq_restore() and halt() so enabling interrupt and halt
> >are still not atomic.  You should drop local_irq_restore() and do
> >
> >   if (arch_irqs_disabled_flags(flags))
> > halt();
> >   else
> > safe_halt();
> >
> >instead.
> >
> 
> Yes, I tested with below as suggested:
> 
> //local_irq_restore(flags);
> 
> /* halt until it's our turn and kicked. */
> if (arch_irqs_disabled_flags(flags))
> halt();
> else
> safe_halt();
> 
>  //local_irq_save(flags);
> I am seeing only a slight overhead, but want to give a full run to
> check the performance.
Without compiling and checking myself the different between previous
code and this one should be a couple asm instruction. I would be
surprised if you can measure it especially as vcpu is going to halt
(and do expensive vmexit in the process) anyway.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-16 Thread Gleb Natapov
On Tue, Jul 16, 2013 at 05:48:52PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 16, 2013 at 09:02:15AM +0300, Gleb Natapov wrote:
> > BTW can NMI handler take spinlocks? 
> 
> No -- that is, yes you can using trylock, but you still shouldn't.
> 
Great news for this code. Thanks.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-15 Thread Gleb Natapov
On Tue, Jul 16, 2013 at 09:07:53AM +0530, Raghavendra K T wrote:
> On 07/15/2013 04:06 PM, Gleb Natapov wrote:
> >On Mon, Jul 15, 2013 at 03:20:06PM +0530, Raghavendra K T wrote:
> >>On 07/14/2013 06:42 PM, Gleb Natapov wrote:
> >>>On Mon, Jun 24, 2013 at 06:13:42PM +0530, Raghavendra K T wrote:
> >>>>kvm : Paravirtual ticketlocks support for linux guests running on KVM 
> >>>>hypervisor
> >>>>
> >>>>From: Srivatsa Vaddagiri 
> >>>>
> >>trimming
> >>[...]
> >>>>+
> >>>>+static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t 
> >>>>want)
> >>>>+{
> >>>>+ struct kvm_lock_waiting *w;
> >>>>+ int cpu;
> >>>>+ u64 start;
> >>>>+ unsigned long flags;
> >>>>+
> >>>>+ w = &__get_cpu_var(lock_waiting);
> >>>>+ cpu = smp_processor_id();
> >>>>+ start = spin_time_start();
> >>>>+
> >>>>+ /*
> >>>>+  * Make sure an interrupt handler can't upset things in a
> >>>>+  * partially setup state.
> >>>>+  */
> >>>>+ local_irq_save(flags);
> >>>>+
> >>>>+ /*
> >>>>+  * The ordering protocol on this is that the "lock" pointer
> >>>>+  * may only be set non-NULL if the "want" ticket is correct.
> >>>>+  * If we're updating "want", we must first clear "lock".
> >>>>+  */
> >>>>+ w->lock = NULL;
> >>>>+ smp_wmb();
> >>>>+ w->want = want;
> >>>>+ smp_wmb();
> >>>>+ w->lock = lock;
> >>>>+
> >>>>+ add_stats(TAKEN_SLOW, 1);
> >>>>+
> >>>>+ /*
> >>>>+  * This uses set_bit, which is atomic but we should not rely on its
> >>>>+  * reordering gurantees. So barrier is needed after this call.
> >>>>+  */
> >>>>+ cpumask_set_cpu(cpu, &waiting_cpus);
> >>>>+
> >>>>+ barrier();
> >>>>+
> >>>>+ /*
> >>>>+  * Mark entry to slowpath before doing the pickup test to make
> >>>>+  * sure we don't deadlock with an unlocker.
> >>>>+  */
> >>>>+ __ticket_enter_slowpath(lock);
> >>>>+
> >>>>+ /*
> >>>>+  * check again make sure it didn't become free while
> >>>>+  * we weren't looking.
> >>>>+  */
> >>>>+ if (ACCESS_ONCE(lock->tickets.head) == want) {
> >>>>+ add_stats(TAKEN_SLOW_PICKUP, 1);
> >>>>+ goto out;
> >>>>+ }
> >>>>+
> >>>>+ /* Allow interrupts while blocked */
> >>>>+ local_irq_restore(flags);
> >>>>+
> >>>So what happens if an interrupt comes here and an interrupt handler
> >>>takes another spinlock that goes into the slow path? As far as I see
> >>>lock_waiting will become overwritten and cpu will be cleared from
> >>>waiting_cpus bitmap by nested kvm_lock_spinning(), so when halt is
> >>>called here after returning from the interrupt handler nobody is going
> >>>to wake this lock holder. Next random interrupt will "fix" it, but it
> >>>may be several milliseconds away, or never. We should probably check
> >>>if interrupt were enabled and call native_safe_halt() here.
> >>>
> >>
> >>Okay you mean something like below should be done.
> >>if irq_enabled()
> >>   native_safe_halt()
> >>else
> >>   halt()
> >>
> >>It is been a complex stuff for analysis for me.
> >>
> >>So in our discussion stack would looking like this.
> >>
> >>spinlock()
> >>   kvm_lock_spinning()
> >>   <-- interrupt here
> >>   halt()
> >>
> >>
> >> From the halt if we trace
> >>
> >It is to early to trace the halt since it was not executed yet. Guest
> >stack trace will look something like this:
> >
> >spinlock(a)
> >   kvm_lock_spinning(a)
> >lock_waiting = a
> >set bit in waiting_cpus
> > <-- interrupt here
> > spinlock(b)
> >   kvm_lock_spinning(b)
> > lock_waiting = b
> > set bit in waiting_c

Re: [PATCH RFC V10 16/18] kvm hypervisor : Simplify kvm_for_each_vcpu with kvm_irq_delivery_to_apic

2013-07-15 Thread Gleb Natapov
On Mon, Jul 15, 2013 at 09:06:13PM +0530, Raghavendra K T wrote:
> On 07/14/2013 06:54 PM, Gleb Natapov wrote:
> >On Mon, Jun 24, 2013 at 06:13:53PM +0530, Raghavendra K T wrote:
> >>Simplify kvm_for_each_vcpu with kvm_irq_delivery_to_apic
> >>
> >>From: Raghavendra K T 
> >>
> >>Note that we are using APIC_DM_REMRD which has reserved usage.
> >>In future if APIC_DM_REMRD usage is standardized, then we should
> >>find some other way or go back to old method.
> >>
> >>Suggested-by: Gleb Natapov 
> >>Signed-off-by: Raghavendra K T 
> >>---
> >>  arch/x86/kvm/lapic.c |5 -
> >>  arch/x86/kvm/x86.c   |   25 ++---
> >>  2 files changed, 10 insertions(+), 20 deletions(-)
> >>
> >>diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >>index e1adbb4..3f5f82e 100644
> >>--- a/arch/x86/kvm/lapic.c
> >>+++ b/arch/x86/kvm/lapic.c
> >>@@ -706,7 +706,10 @@ out:
> >>break;
> >>
> >>case APIC_DM_REMRD:
> >>-   apic_debug("Ignoring delivery mode 3\n");
> >>+   result = 1;
> >>+   vcpu->arch.pv.pv_unhalted = 1;
> >>+   kvm_make_request(KVM_REQ_EVENT, vcpu);
> >>+   kvm_vcpu_kick(vcpu);
> >>break;
> >>
> >>case APIC_DM_SMI:
> >>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>index 92a9932..b963c86 100644
> >>--- a/arch/x86/kvm/x86.c
> >>+++ b/arch/x86/kvm/x86.c
> >>@@ -5456,27 +5456,14 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
> >>   */
> >>  static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
> >>  {
> >>-   struct kvm_vcpu *vcpu = NULL;
> >>-   int i;
> >>+   struct kvm_lapic_irq lapic_irq;
> >>
> >>-   kvm_for_each_vcpu(i, vcpu, kvm) {
> >>-   if (!kvm_apic_present(vcpu))
> >>-   continue;
> >>+   lapic_irq.shorthand = 0;
> >>+   lapic_irq.dest_mode = 0;
> >>+   lapic_irq.dest_id = apicid;
> >>
> >>-   if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
> >>-   break;
> >>-   }
> >>-   if (vcpu) {
> >>-   /*
> >>-* Setting unhalt flag here can result in spurious runnable
> >>-* state when unhalt reset does not happen in vcpu_block.
> >>-* But that is harmless since that should soon result in halt.
> >>-*/
> >>-   vcpu->arch.pv.pv_unhalted = true;
> >>-   /* We need everybody see unhalt before vcpu unblocks */
> >>-   smp_wmb();
> >>-   kvm_vcpu_kick(vcpu);
> >>-   }
> >>+   lapic_irq.delivery_mode = APIC_DM_REMRD;
> >Need to make sure that delivery_mode cannot be set to APIC_DM_REMRD
> >from MSI/IOAPIC/IPI path.
> 
> I Gleb,
> I need your help here since I do not know much about apic.
> 
> so are you saying explicitly checking that, kvm_set_msi_irq,
> apic_send_ipi, native_setup_ioapic_entry, does not have
> APIC_DM_REMRD as delivery_mode set?
> 
Yes, but on a second thought what bad can happen if we will not check it?
If guest configures irq to inject APIC_DM_REMRD interrupt this may
needlessly wakeup sleeping vcpu and it will try to accrue spinlock
again, so no correctness problem only performance. If this is the case
lets leave it as it for now.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-15 Thread Gleb Natapov
On Mon, Jul 15, 2013 at 03:20:06PM +0530, Raghavendra K T wrote:
> On 07/14/2013 06:42 PM, Gleb Natapov wrote:
> >On Mon, Jun 24, 2013 at 06:13:42PM +0530, Raghavendra K T wrote:
> >>kvm : Paravirtual ticketlocks support for linux guests running on KVM 
> >>hypervisor
> >>
> >>From: Srivatsa Vaddagiri 
> >>
> trimming
> [...]
> >>+
> >>+static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
> >>+{
> >>+   struct kvm_lock_waiting *w;
> >>+   int cpu;
> >>+   u64 start;
> >>+   unsigned long flags;
> >>+
> >>+   w = &__get_cpu_var(lock_waiting);
> >>+   cpu = smp_processor_id();
> >>+   start = spin_time_start();
> >>+
> >>+   /*
> >>+* Make sure an interrupt handler can't upset things in a
> >>+* partially setup state.
> >>+*/
> >>+   local_irq_save(flags);
> >>+
> >>+   /*
> >>+* The ordering protocol on this is that the "lock" pointer
> >>+* may only be set non-NULL if the "want" ticket is correct.
> >>+* If we're updating "want", we must first clear "lock".
> >>+*/
> >>+   w->lock = NULL;
> >>+   smp_wmb();
> >>+   w->want = want;
> >>+   smp_wmb();
> >>+   w->lock = lock;
> >>+
> >>+   add_stats(TAKEN_SLOW, 1);
> >>+
> >>+   /*
> >>+* This uses set_bit, which is atomic but we should not rely on its
> >>+* reordering gurantees. So barrier is needed after this call.
> >>+*/
> >>+   cpumask_set_cpu(cpu, &waiting_cpus);
> >>+
> >>+   barrier();
> >>+
> >>+   /*
> >>+* Mark entry to slowpath before doing the pickup test to make
> >>+* sure we don't deadlock with an unlocker.
> >>+*/
> >>+   __ticket_enter_slowpath(lock);
> >>+
> >>+   /*
> >>+* check again make sure it didn't become free while
> >>+* we weren't looking.
> >>+*/
> >>+   if (ACCESS_ONCE(lock->tickets.head) == want) {
> >>+   add_stats(TAKEN_SLOW_PICKUP, 1);
> >>+   goto out;
> >>+   }
> >>+
> >>+   /* Allow interrupts while blocked */
> >>+   local_irq_restore(flags);
> >>+
> >So what happens if an interrupt comes here and an interrupt handler
> >takes another spinlock that goes into the slow path? As far as I see
> >lock_waiting will become overwritten and cpu will be cleared from
> >waiting_cpus bitmap by nested kvm_lock_spinning(), so when halt is
> >called here after returning from the interrupt handler nobody is going
> >to wake this lock holder. Next random interrupt will "fix" it, but it
> >may be several milliseconds away, or never. We should probably check
> >if interrupt were enabled and call native_safe_halt() here.
> >
> 
> Okay you mean something like below should be done.
> if irq_enabled()
>   native_safe_halt()
> else
>   halt()
> 
> It is been a complex stuff for analysis for me.
> 
> So in our discussion stack would looking like this.
> 
> spinlock()
>   kvm_lock_spinning()
>   <-- interrupt here
>   halt()
> 
> 
> From the halt if we trace
> 
It is to early to trace the halt since it was not executed yet. Guest
stack trace will look something like this:

spinlock(a)
  kvm_lock_spinning(a)
   lock_waiting = a
   set bit in waiting_cpus
<-- interrupt here
spinlock(b)
  kvm_lock_spinning(b)
lock_waiting = b
set bit in waiting_cpus
halt()
unset bit in waiting_cpus
lock_waiting = NULL
 --> ret from interrupt
   halt()

Now at the time of the last halt above lock_waiting == NULL and
waiting_cpus is empty and not interrupt it pending, so who will unhalt
the waiter?

>   halt()
> kvm_vcpu_block()
>kvm_arch_vcpu_runnable())
>kvm_make_request(KVM_REQ_UNHALT)
>   
> This would drive us out of halt handler, and we are fine when it
> happens since we would revisit kvm_lock_spinning.
> 
> But I see that
> 
> kvm_arch_vcpu_runnable() has this condition
>  (kvm_arch_interrupt_allowed(vcpu) &&  kvm_cpu_has_interrupt(vcpu));
> 
> which means that if we going process the interrupt here we would set
> KVM_REQ_UNHALT. and we are fine.
> 
> But if we are in the situation kvm_arch_interrupt_allowe

Re: [PATCH RFC V10 18/18] kvm hypervisor: Add directed yield in vcpu block path

2013-07-14 Thread Gleb Natapov
On Mon, Jun 24, 2013 at 06:14:15PM +0530, Raghavendra K T wrote:
> kvm hypervisor: Add directed yield in vcpu block path
> 
> From: Raghavendra K T 
> 
> We use the improved PLE handler logic in vcpu block patch for
> scheduling rather than plain schedule, so that we can make
> intelligent decisions.
> 
What kind of improvement this provides? Doesn't it screw up our pause
exit heuristics?

> Signed-off-by: Raghavendra K T 
> ---
>  arch/ia64/include/asm/kvm_host.h|5 +
>  arch/powerpc/include/asm/kvm_host.h |5 +
>  arch/s390/include/asm/kvm_host.h|5 +
>  arch/x86/include/asm/kvm_host.h |2 +-
>  arch/x86/kvm/x86.c  |8 
>  include/linux/kvm_host.h|2 +-
>  virt/kvm/kvm_main.c |6 --
This miss some arches.

>  7 files changed, 29 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/kvm_host.h 
> b/arch/ia64/include/asm/kvm_host.h
> index 989dd3f..999ab15 100644
> --- a/arch/ia64/include/asm/kvm_host.h
> +++ b/arch/ia64/include/asm/kvm_host.h
> @@ -595,6 +595,11 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu);
>  int kvm_pal_emul(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run);
>  void kvm_sal_emul(struct kvm_vcpu *vcpu);
>  
> +static inline void kvm_do_schedule(struct kvm_vcpu *vcpu)
> +{
> + schedule();
> +}
> +
>  #define __KVM_HAVE_ARCH_VM_ALLOC 1
>  struct kvm *kvm_arch_alloc_vm(void);
>  void kvm_arch_free_vm(struct kvm *kvm);
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index af326cd..1aeecc0 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -628,4 +628,9 @@ struct kvm_vcpu_arch {
>  #define __KVM_HAVE_ARCH_WQP
>  #define __KVM_HAVE_CREATE_DEVICE
>  
> +static inline void kvm_do_schedule(struct kvm_vcpu *vcpu)
> +{
> + schedule();
> +}
> +
>  #endif /* __POWERPC_KVM_HOST_H__ */
> diff --git a/arch/s390/include/asm/kvm_host.h 
> b/arch/s390/include/asm/kvm_host.h
> index 16bd5d1..db09a56 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -266,4 +266,9 @@ struct kvm_arch{
>  };
>  
>  extern int sie64a(struct kvm_s390_sie_block *, u64 *);
> +static inline void kvm_do_schedule(struct kvm_vcpu *vcpu)
> +{
> + schedule();
> +}
> +
>  #endif
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 95702de..72ff791 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1042,5 +1042,5 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info);
>  int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
>  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
>  void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
> -
> +void kvm_do_schedule(struct kvm_vcpu *vcpu);
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b963c86..84a4eb2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7281,6 +7281,14 @@ bool kvm_arch_can_inject_async_page_present(struct 
> kvm_vcpu *vcpu)
>   kvm_x86_ops->interrupt_allowed(vcpu);
>  }
>  
> +void kvm_do_schedule(struct kvm_vcpu *vcpu)
> +{
> + /* We try to yield to a kicked vcpu else do a schedule */
> + if (kvm_vcpu_on_spin(vcpu) <= 0)
> + schedule();
> +}
> +EXPORT_SYMBOL_GPL(kvm_do_schedule);
> +
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f0eea07..39efc18 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -565,7 +565,7 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct 
> kvm_memory_slot *memslot,
>  void kvm_vcpu_block(struct kvm_vcpu *vcpu);
>  void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
>  bool kvm_vcpu_yield_to(struct kvm_vcpu *target);
> -void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
> +bool kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
>  void kvm_resched(struct kvm_vcpu *vcpu);
>  void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
>  void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 302681c..8387247 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1685,7 +1685,7 @@ void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>   if (signal_pending(current))
>   break;
>  
> - schedule();
> + kvm_do_schedule(vcpu);
>   }
>  
>   finish_wait(&vcpu->wq, &wait);
> @@ -1786,7 +1786,7 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
> kvm_vcpu *vcpu)
>  }
>  #endif
>  
> -void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> +bool kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  {
>   struct kvm *kvm = me->kvm;
>   struct kvm_vcpu *vcpu;
> @@ -1835,6 +1835,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  

Re: [PATCH RFC V10 12/18] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2013-07-14 Thread Gleb Natapov
On Mon, Jun 24, 2013 at 06:13:04PM +0530, Raghavendra K T wrote:
> kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
> 
> From: Srivatsa Vaddagiri 
> 
> kvm_hc_kick_cpu allows the calling vcpu to kick another vcpu out of halt 
> state.
> the presence of these hypercalls is indicated to guest via
> kvm_feature_pv_unhalt.
> 
> Signed-off-by: Srivatsa Vaddagiri 
> Signed-off-by: Suzuki Poulose 
> [Raghu: Apic related changes, folding pvunhalted into vcpu_runnable]
> Signed-off-by: Raghavendra K T 
> ---
>  arch/x86/include/asm/kvm_host.h  |5 +
>  arch/x86/include/uapi/asm/kvm_para.h |1 +
>  arch/x86/kvm/cpuid.c |3 ++-
>  arch/x86/kvm/x86.c   |   37 
> ++
>  include/uapi/linux/kvm_para.h|1 +
>  5 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3741c65..95702de 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -503,6 +503,11 @@ struct kvm_vcpu_arch {
>* instruction.
>*/
>   bool write_fault_to_shadow_pgtable;
> +
> + /* pv related host specific info */
> + struct {
> + bool pv_unhalted;
> + } pv;
>  };
>  
>  struct kvm_lpage_info {
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
> b/arch/x86/include/uapi/asm/kvm_para.h
> index 06fdbd9..94dc8ca 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -23,6 +23,7 @@
>  #define KVM_FEATURE_ASYNC_PF 4
>  #define KVM_FEATURE_STEAL_TIME   5
>  #define KVM_FEATURE_PV_EOI   6
> +#define KVM_FEATURE_PV_UNHALT7
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index a20ecb5..b110fe6 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -413,7 +413,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
> u32 function,
>(1 << KVM_FEATURE_CLOCKSOURCE2) |
>(1 << KVM_FEATURE_ASYNC_PF) |
>(1 << KVM_FEATURE_PV_EOI) |
> -  (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> +  (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> +  (1 << KVM_FEATURE_PV_UNHALT);
>  
>   if (sched_info_on())
>   entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 094b5d9..f8bea30 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5449,6 +5449,36 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>   return 1;
>  }
>  
> +/*
> + * kvm_pv_kick_cpu_op:  Kick a vcpu.
> + *
> + * @apicid - apicid of vcpu to be kicked.
> + */
> +static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
> +{
> + struct kvm_vcpu *vcpu = NULL;
> + int i;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!kvm_apic_present(vcpu))
> + continue;
> +
> + if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
> + break;
> + }
> + if (vcpu) {
> + /*
> +  * Setting unhalt flag here can result in spurious runnable
> +  * state when unhalt reset does not happen in vcpu_block.
> +  * But that is harmless since that should soon result in halt.
> +  */
> + vcpu->arch.pv.pv_unhalted = true;
> + /* We need everybody see unhalt before vcpu unblocks */
> + smp_wmb();
> + kvm_vcpu_kick(vcpu);
> + }
> +}
> +
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  {
>   unsigned long nr, a0, a1, a2, a3, ret;
> @@ -5482,6 +5512,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   case KVM_HC_VAPIC_POLL_IRQ:
>   ret = 0;
>   break;
> + case KVM_HC_KICK_CPU:
> + kvm_pv_kick_cpu_op(vcpu->kvm, a0);
Lets make it hypercall with two parameters: 
a0 - flags
a1 - apic id of a cpu to kick

A0 have to be zero. This will make it more extensible.

> + ret = 0;
> + break;
>   default:
>   ret = -KVM_ENOSYS;
>   break;
> @@ -5909,6 +5943,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>   kvm_apic_accept_events(vcpu);
>   switch(vcpu->arch.mp_state) {
>   case KVM_MP_STATE_HALTED:
> + vcpu->arch.pv.pv_unhalted = false;
>   vcpu->arch.mp_state =
>   KVM_MP_STATE_RUNNABLE;
>   case KVM_MP_STATE_RUNNABLE:
> @@ -6729,6 +6764,7 @@ int kv

Re: [PATCH RFC V10 16/18] kvm hypervisor : Simplify kvm_for_each_vcpu with kvm_irq_delivery_to_apic

2013-07-14 Thread Gleb Natapov
On Mon, Jun 24, 2013 at 06:13:53PM +0530, Raghavendra K T wrote:
> Simplify kvm_for_each_vcpu with kvm_irq_delivery_to_apic
> 
> From: Raghavendra K T 
> 
> Note that we are using APIC_DM_REMRD which has reserved usage.
> In future if APIC_DM_REMRD usage is standardized, then we should
> find some other way or go back to old method.
> 
> Suggested-by: Gleb Natapov 
> Signed-off-by: Raghavendra K T 
> ---
>  arch/x86/kvm/lapic.c |5 -
>  arch/x86/kvm/x86.c   |   25 ++---
>  2 files changed, 10 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index e1adbb4..3f5f82e 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -706,7 +706,10 @@ out:
>   break;
>  
>   case APIC_DM_REMRD:
> - apic_debug("Ignoring delivery mode 3\n");
> + result = 1;
> + vcpu->arch.pv.pv_unhalted = 1;
> + kvm_make_request(KVM_REQ_EVENT, vcpu);
> + kvm_vcpu_kick(vcpu);
>   break;
>  
>   case APIC_DM_SMI:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 92a9932..b963c86 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5456,27 +5456,14 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>   */
>  static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
>  {
> - struct kvm_vcpu *vcpu = NULL;
> - int i;
> + struct kvm_lapic_irq lapic_irq;
>  
> - kvm_for_each_vcpu(i, vcpu, kvm) {
> - if (!kvm_apic_present(vcpu))
> - continue;
> + lapic_irq.shorthand = 0;
> + lapic_irq.dest_mode = 0;
> + lapic_irq.dest_id = apicid;
>  
> - if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
> - break;
> - }
> - if (vcpu) {
> - /*
> -  * Setting unhalt flag here can result in spurious runnable
> -  * state when unhalt reset does not happen in vcpu_block.
> -  * But that is harmless since that should soon result in halt.
> -  */
> - vcpu->arch.pv.pv_unhalted = true;
> - /* We need everybody see unhalt before vcpu unblocks */
> - smp_wmb();
> - kvm_vcpu_kick(vcpu);
> - }
> + lapic_irq.delivery_mode = APIC_DM_REMRD;
Need to make sure that delivery_mode cannot be set to APIC_DM_REMRD
from MSI/IOAPIC/IPI path.

> + kvm_irq_delivery_to_apic(kvm, 0, &lapic_irq, NULL);
>  }
>  
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V10 15/18] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2013-07-14 Thread Gleb Natapov
On Mon, Jun 24, 2013 at 06:13:42PM +0530, Raghavendra K T wrote:
> kvm : Paravirtual ticketlocks support for linux guests running on KVM 
> hypervisor
> 
> From: Srivatsa Vaddagiri 
> 
> During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
> required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
>  support for pv-ticketlocks is registered via pv_lock_ops.
> 
> Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.
> 
> Signed-off-by: Srivatsa Vaddagiri 
> Signed-off-by: Suzuki Poulose 
> [Raghu: check_zero race fix, enum for kvm_contention_stat
> jumplabel related changes ]
> Signed-off-by: Raghavendra K T 
> ---
>  arch/x86/include/asm/kvm_para.h |   14 ++
>  arch/x86/kernel/kvm.c   |  255 
> +++
>  2 files changed, 267 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 695399f..427afcb 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -118,10 +118,20 @@ void kvm_async_pf_task_wait(u32 token);
>  void kvm_async_pf_task_wake(u32 token);
>  u32 kvm_read_and_reset_pf_reason(void);
>  extern void kvm_disable_steal_time(void);
> -#else
> -#define kvm_guest_init() do { } while (0)
> +
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +void __init kvm_spinlock_init(void);
> +#else /* !CONFIG_PARAVIRT_SPINLOCKS */
> +static inline void kvm_spinlock_init(void)
> +{
> +}
> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
> +
> +#else /* CONFIG_KVM_GUEST */
> +#define kvm_guest_init() do {} while (0)
>  #define kvm_async_pf_task_wait(T) do {} while(0)
>  #define kvm_async_pf_task_wake(T) do {} while(0)
> +
>  static inline u32 kvm_read_and_reset_pf_reason(void)
>  {
>   return 0;
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index cd6d9a5..97ade5a 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -419,6 +420,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
>   WARN_ON(kvm_register_clock("primary cpu clock"));
>   kvm_guest_cpu_init();
>   native_smp_prepare_boot_cpu();
> + kvm_spinlock_init();
>  }
>  
>  static void __cpuinit kvm_guest_cpu_online(void *dummy)
> @@ -523,3 +525,256 @@ static __init int activate_jump_labels(void)
>   return 0;
>  }
>  arch_initcall(activate_jump_labels);
> +
> +/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> +void kvm_kick_cpu(int cpu)
> +{
> + int apicid;
> +
> + apicid = per_cpu(x86_cpu_to_apicid, cpu);
> + kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
> +}
> +
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> +
> +enum kvm_contention_stat {
> + TAKEN_SLOW,
> + TAKEN_SLOW_PICKUP,
> + RELEASED_SLOW,
> + RELEASED_SLOW_KICKED,
> + NR_CONTENTION_STATS
> +};
> +
> +#ifdef CONFIG_KVM_DEBUG_FS
> +#define HISTO_BUCKETS30
> +
> +static struct kvm_spinlock_stats
> +{
> + u32 contention_stats[NR_CONTENTION_STATS];
> + u32 histo_spin_blocked[HISTO_BUCKETS+1];
> + u64 time_blocked;
> +} spinlock_stats;
> +
> +static u8 zero_stats;
> +
> +static inline void check_zero(void)
> +{
> + u8 ret;
> + u8 old;
> +
> + old = ACCESS_ONCE(zero_stats);
> + if (unlikely(old)) {
> + ret = cmpxchg(&zero_stats, old, 0);
> + /* This ensures only one fellow resets the stat */
> + if (ret == old)
> + memset(&spinlock_stats, 0, sizeof(spinlock_stats));
> + }
> +}
> +
> +static inline void add_stats(enum kvm_contention_stat var, u32 val)
> +{
> + check_zero();
> + spinlock_stats.contention_stats[var] += val;
> +}
> +
> +
> +static inline u64 spin_time_start(void)
> +{
> + return sched_clock();
> +}
> +
> +static void __spin_time_accum(u64 delta, u32 *array)
> +{
> + unsigned index;
> +
> + index = ilog2(delta);
> + check_zero();
> +
> + if (index < HISTO_BUCKETS)
> + array[index]++;
> + else
> + array[HISTO_BUCKETS]++;
> +}
> +
> +static inline void spin_time_accum_blocked(u64 start)
> +{
> + u32 delta;
> +
> + delta = sched_clock() - start;
> + __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
> + spinlock_stats.time_blocked += delta;
> +}
> +
> +static struct dentry *d_spin_debug;
> +static struct dentry *d_kvm_debug;
> +
> +struct dentry *kvm_init_debugfs(void)
> +{
> + d_kvm_debug = debugfs_create_dir("kvm", NULL);
> + if (!d_kvm_debug)
> + printk(KERN_WARNING "Could not create 'kvm' debugfs 
> directory\n");
> +
> + return d_kvm_debug;
> +}
> +
> +static int __init kvm_spinlock_debugfs(void)
> +{
> + struct dentry *d_kvm;
> +
> + d_kvm = kvm_init_debugfs();
> + if (d_kvm == NULL)
> + return -ENOMEM;
> +
> + d_spin_debug = debugfs_create_dir("spinlocks", d_kvm);
> +
> + debugfs_create_u8("zero

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Gleb Natapov
On Thu, Jul 11, 2013 at 04:23:58PM +0530, Raghavendra K T wrote:
> On 07/11/2013 03:41 PM, Gleb Natapov wrote:
> >On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote:
> >>>>>>Gleb,
> >>>>>>Can you elaborate little more on what you have in mind regarding per
> >>>>>>VM ple_window. (maintaining part of it as a per vm variable is clear
> >>>>>>to
> >>>>>>  me), but is it that we have to load that every time of guest entry?
> >>>>>>
> >>>>>Only when it changes, shouldn't be to often no?
> >>>>
> >>>>Ok. Thinking how to do. read the register and writeback if there need
> >>>>to be a change during guest entry?
> >>>>
> >>>Why not do it like in the patch you've linked? When value changes write it
> >>>to VMCS of the current vcpu.
> >>>
> >>
> >>Yes. can be done. So the running vcpu's ple_window gets updated only
> >>after next pl-exit. right?
> >I am not sure what you mean. You cannot change vcpu's ple_window while
> >vcpu is in a guest mode.
> >
> 
> I agree with that. Both of us are on the same page.
>  What I meant is,
> suppose the per VM ple_window changes when a vcpu x of that VM  was running,
> it will get its ple_window updated during next run.
Ah, I think "per VM" is what confuses me. Why do you want to have "per
VM" ple_windows and not "per vcpu" one? With per vcpu one ple_windows
cannot change while vcpu is running.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Gleb Natapov
On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote:
> Gleb,
> Can you elaborate little more on what you have in mind regarding per
> VM ple_window. (maintaining part of it as a per vm variable is clear
> to
>   me), but is it that we have to load that every time of guest entry?
> 
> >>>Only when it changes, shouldn't be to often no?
> >>
> >>Ok. Thinking how to do. read the register and writeback if there need
> >>to be a change during guest entry?
> >>
> >Why not do it like in the patch you've linked? When value changes write it
> >to VMCS of the current vcpu.
> >
> 
> Yes. can be done. So the running vcpu's ple_window gets updated only
> after next pl-exit. right?
I am not sure what you mean. You cannot change vcpu's ple_window while
vcpu is in a guest mode.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Gleb Natapov
On Thu, Jul 11, 2013 at 02:43:03PM +0530, Raghavendra K T wrote:
> On 07/10/2013 04:03 PM, Gleb Natapov wrote:
> [...] trimmed
> 
> >>>Yes. you are right. dynamic ple window was an attempt to solve it.
> >>>
> >>>Probelm is, reducing the SPIN_THRESHOLD is resulting in excess halt
> >>>exits in under-commits and increasing ple_window may be sometimes
> >>>counter productive as it affects other busy-wait constructs such as
> >>>flush_tlb AFAIK.
> >>>So if we could have had a dynamically changing SPIN_THRESHOLD too, that
> >>>would be nice.
> >>>
> >>
> >>Gleb, Andrew,
> >>I tested with the global ple window change (similar to what I posted
> >>here https://lkml.org/lkml/2012/11/11/14 ),
> >This does not look global. It changes PLE per vcpu.
> 
> Okay. Got it. I was thinking it would change the global value. But IIRC
>  It is changing global sysfs value and per vcpu ple_window.
> Sorry. I missed this part yesterday.
> 
Yes, it changes sysfs value but this does not affect already created
vcpus.

> >
> >>But did not see good result. May be it is good to go with per VM
> >>ple_window.
> >>
> >>Gleb,
> >>Can you elaborate little more on what you have in mind regarding per
> >>VM ple_window. (maintaining part of it as a per vm variable is clear
> >>to
> >>  me), but is it that we have to load that every time of guest entry?
> >>
> >Only when it changes, shouldn't be to often no?
> 
> Ok. Thinking how to do. read the register and writeback if there need
> to be a change during guest entry?
> 
Why not do it like in the patch you've linked? When value changes write it
to VMCS of the current vcpu.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 11:03:15AM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
> > On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
> > > 
> > > Here's an idea, trim the damn email ;-) -- not only directed at gleb.
> > > 
> > Good idea.
> > 
> > > > > Ingo, Gleb,
> > > > > 
> > > > > From the results perspective, Andrew Theurer, Vinod's test results are
> > > > > pro-pvspinlock.
> > > > > Could you please help me to know what will make it a mergeable
> > > > > candidate?.
> > > > > 
> > > > I need to spend more time reviewing it :) The problem with PV interfaces
> > > > is that they are easy to add but hard to get rid of if better solution
> > > > (HW or otherwise) appears.
> > > 
> > > How so? Just make sure the registration for the PV interface is optional; 
> > > that
> > > is, allow it to fail. A guest that fails the PV setup will either have to 
> > > try
> > > another PV interface or fall back to 'native'.
> > > 
> > We have to carry PV around for live migration purposes. PV interface
> > cannot disappear under a running guest.
> 
> Why can't it? This is the same as handling say XSAVE operations. Some hosts
> might have it - some might not. It is the job of the toolstack to make sure
> to not migrate to the hosts which don't have it. Or bound the guest to the
> lowest interface (so don't enable the PV interface if the other hosts in the
> cluster can't support this flag)?
XSAVE is HW feature and it is not going disappear under you after software
upgrade. Upgrading kernel on part of your hosts and no longer been
able to migrate to them is not something people who use live migration
expect. In practise it means that updating all hosts in a datacenter to
newer kernel is no longer possible without rebooting VMs.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 04:54:12PM +0530, Raghavendra K T wrote:
> >>Ingo, Gleb,
> >>
> >> From the results perspective, Andrew Theurer, Vinod's test results are
> >>pro-pvspinlock.
> >>Could you please help me to know what will make it a mergeable
> >>candidate?.
> >>
> >I need to spend more time reviewing it :) The problem with PV interfaces
> >is that they are easy to add but hard to get rid of if better solution
> >(HW or otherwise) appears.
> 
> Infact Avi had acked the whole V8 series, but delayed for seeing how
> PLE improvement would affect it.
> 
I see that Ingo was happy with it too.

> The only addition from that series has been
> 1. tuning the SPIN_THRESHOLD to 32k (from 2k)
> and
> 2. the halt handler now calls vcpu_on_spin to take the advantage of
> PLE improvements. (this can also go as an independent patch into
> kvm)
> 
> The rationale for making SPIN_THERSHOLD 32k needs big explanation.
> Before PLE improvements, as you know,
> kvm undercommit scenario was very worse in ple enabled cases.
> (compared to ple disabled cases).
> pvspinlock patches behaved equally bad in undercommit. Both had
> similar reason so at the end there was no degradation w.r.t base.
> 
> The reason for bad performance in PLE case was unneeded vcpu
> iteration in ple handler resulting in high yield_to calls and double
> run queue locks.
> With pvspinlock applied, same villain role was played by excessive
> halt exits.
> 
> But after ple handler improved, we needed to throttle unnecessary halts
> in undercommit for pvspinlock to be on par with 1x result.
> 
Make sense. I will review it ASAP. BTW the latest version is V10 right?

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 04:58:29PM +0530, Raghavendra K T wrote:
> On 07/10/2013 04:17 PM, Gleb Natapov wrote:
> >On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
> >>On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
> >>
> >>Here's an idea, trim the damn email ;-) -- not only directed at gleb.
> >>
> >Good idea.
> >
> >>>>Ingo, Gleb,
> >>>>
> >>>> From the results perspective, Andrew Theurer, Vinod's test results are
> >>>>pro-pvspinlock.
> >>>>Could you please help me to know what will make it a mergeable
> >>>>candidate?.
> >>>>
> >>>I need to spend more time reviewing it :) The problem with PV interfaces
> >>>is that they are easy to add but hard to get rid of if better solution
> >>>(HW or otherwise) appears.
> >>
> >>How so? Just make sure the registration for the PV interface is optional; 
> >>that
> >>is, allow it to fail. A guest that fails the PV setup will either have to 
> >>try
> >>another PV interface or fall back to 'native'.
> >>
> >We have to carry PV around for live migration purposes. PV interface
> >cannot disappear under a running guest.
> >
> 
> IIRC, The only requirement was running state of the vcpu to be retained.
> This was addressed by
> [PATCH RFC V10 13/18] kvm : Fold pv_unhalt flag into GET_MP_STATE
> ioctl to aid migration.
> 
> I would have to know more if I missed something here.
> 
I was not talking about the state that has to be migrated, but
HV<->guest interface that has to be preserved after migration.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
> 
> Here's an idea, trim the damn email ;-) -- not only directed at gleb.
> 
Good idea.

> > > Ingo, Gleb,
> > > 
> > > From the results perspective, Andrew Theurer, Vinod's test results are
> > > pro-pvspinlock.
> > > Could you please help me to know what will make it a mergeable
> > > candidate?.
> > > 
> > I need to spend more time reviewing it :) The problem with PV interfaces
> > is that they are easy to add but hard to get rid of if better solution
> > (HW or otherwise) appears.
> 
> How so? Just make sure the registration for the PV interface is optional; that
> is, allow it to fail. A guest that fails the PV setup will either have to try
> another PV interface or fall back to 'native'.
> 
We have to carry PV around for live migration purposes. PV interface
cannot disappear under a running guest.

> > > I agree that Jiannan's Preemptable Lock idea is promising and we could
> > > evaluate that  approach, and make the best one get into kernel and also
> > > will carry on discussion with Jiannan to improve that patch.
> > That would be great. The work is stalled from what I can tell.
> 
> I absolutely hated that stuff because it wrecked the native code.
Yes, the idea was to hide it from native code behind PV hooks.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Tue, Jul 09, 2013 at 02:41:30PM +0530, Raghavendra K T wrote:
> On 06/26/2013 11:24 PM, Raghavendra K T wrote:
> >On 06/26/2013 09:41 PM, Gleb Natapov wrote:
> >>On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:
> >>>On 06/26/2013 06:22 PM, Gleb Natapov wrote:
> >>>>On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
> >>>>>On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
> >>>>>>On 06/25/2013 08:20 PM, Andrew Theurer wrote:
> >>>>>>>On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
> >>>>>>>>This series replaces the existing paravirtualized spinlock
> >>>>>>>>mechanism
> >>>>>>>>with a paravirtualized ticketlock mechanism. The series provides
> >>>>>>>>implementation for both Xen and KVM.
> >>>>>>>>
> >>>>>>>>Changes in V9:
> >>>>>>>>- Changed spin_threshold to 32k to avoid excess halt exits that are
> >>>>>>>>causing undercommit degradation (after PLE handler
> >>>>>>>>improvement).
> >>>>>>>>- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
> >>>>>>>>- Optimized halt exit path to use PLE handler
> >>>>>>>>
> >>>>>>>>V8 of PVspinlock was posted last year. After Avi's suggestions
> >>>>>>>>to look
> >>>>>>>>at PLE handler's improvements, various optimizations in PLE
> >>>>>>>>handling
> >>>>>>>>have been tried.
> >>>>>>>
> >>>>>>>Sorry for not posting this sooner.  I have tested the v9
> >>>>>>>pv-ticketlock
> >>>>>>>patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I
> >>>>>>>have
> >>>>>>>tested these patches with and without PLE, as PLE is still not
> >>>>>>>scalable
> >>>>>>>with large VMs.
> >>>>>>>
> >>>>>>
> >>>>>>Hi Andrew,
> >>>>>>
> >>>>>>Thanks for testing.
> >>>>>>
> >>>>>>>System: x3850X5, 40 cores, 80 threads
> >>>>>>>
> >>>>>>>
> >>>>>>>1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
> >>>>>>>--
> >>>>>>>Total
> >>>>>>>ConfigurationThroughput(MB/s)Notes
> >>>>>>>
> >>>>>>>3.10-default-ple_on229455% CPU in host
> >>>>>>>kernel, 2% spin_lock in guests
> >>>>>>>3.10-default-ple_off231845% CPU in host
> >>>>>>>kernel, 2% spin_lock in guests
> >>>>>>>3.10-pvticket-ple_on228955% CPU in host
> >>>>>>>kernel, 2% spin_lock in guests
> >>>>>>>3.10-pvticket-ple_off230515% CPU in host
> >>>>>>>kernel, 2% spin_lock in guests
> >>>>>>>[all 1x results look good here]
> >>>>>>
> >>>>>>Yes. The 1x results look too close
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
> >>>>>>>---
> >>>>>>>Total
> >>>>>>>ConfigurationThroughputNotes
> >>>>>>>
> >>>>>>>3.10-default-ple_on 628755% CPU  host
> >>>>>>>kernel, 17% spin_lock in guests
> >>>>>>>3.10-default-ple_off 18492% CPU in host
> >>>>>>>kernel, 95% spin_lock in guests
> >>>>>>>3.10-pvticket-ple_on 669150% CPU in host
> >>>>>>>kernel, 15% spin_lock in guests
> >>>>>>>3.10-pvticket-ple_off164648% CPU in host
> >>>>>>>kernel, 33% spin_lock

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Gleb Natapov
On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:
> On 06/26/2013 06:22 PM, Gleb Natapov wrote:
> >On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
> >>On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
> >>>On 06/25/2013 08:20 PM, Andrew Theurer wrote:
> >>>>On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
> >>>>>This series replaces the existing paravirtualized spinlock mechanism
> >>>>>with a paravirtualized ticketlock mechanism. The series provides
> >>>>>implementation for both Xen and KVM.
> >>>>>
> >>>>>Changes in V9:
> >>>>>- Changed spin_threshold to 32k to avoid excess halt exits that are
> >>>>>causing undercommit degradation (after PLE handler improvement).
> >>>>>- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
> >>>>>- Optimized halt exit path to use PLE handler
> >>>>>
> >>>>>V8 of PVspinlock was posted last year. After Avi's suggestions to look
> >>>>>at PLE handler's improvements, various optimizations in PLE handling
> >>>>>have been tried.
> >>>>
> >>>>Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
> >>>>patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
> >>>>tested these patches with and without PLE, as PLE is still not scalable
> >>>>with large VMs.
> >>>>
> >>>
> >>>Hi Andrew,
> >>>
> >>>Thanks for testing.
> >>>
> >>>>System: x3850X5, 40 cores, 80 threads
> >>>>
> >>>>
> >>>>1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
> >>>>--
> >>>>  Total
> >>>>Configuration Throughput(MB/s)Notes
> >>>>
> >>>>3.10-default-ple_on   22945   5% CPU 
> >>>>in host kernel, 2% spin_lock in guests
> >>>>3.10-default-ple_off  23184   5% CPU 
> >>>>in host kernel, 2% spin_lock in guests
> >>>>3.10-pvticket-ple_on  22895   5% CPU 
> >>>>in host kernel, 2% spin_lock in guests
> >>>>3.10-pvticket-ple_off 23051   5% CPU 
> >>>>in host kernel, 2% spin_lock in guests
> >>>>[all 1x results look good here]
> >>>
> >>>Yes. The 1x results look too close
> >>>
> >>>>
> >>>>
> >>>>2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
> >>>>---
> >>>>  Total
> >>>>Configuration Throughput  Notes
> >>>>
> >>>>3.10-default-ple_on6287   55% CPU 
> >>>> host kernel, 17% spin_lock in guests
> >>>>3.10-default-ple_off   1849   2% CPU 
> >>>>in host kernel, 95% spin_lock in guests
> >>>>3.10-pvticket-ple_on   6691   50% CPU 
> >>>>in host kernel, 15% spin_lock in guests
> >>>>3.10-pvticket-ple_off 16464   8% CPU 
> >>>>in host kernel, 33% spin_lock in guests
> >>>
> >>>I see 6.426% improvement with ple_on
> >>>and 161.87% improvement with ple_off. I think this is a very good sign
> >>>  for the patches
> >>>
> >>>>[PLE hinders pv-ticket improvements, but even with PLE off,
> >>>>  we still off from ideal throughput (somewhere >2)]
> >>>>
> >>>
> >>>Okay, The ideal throughput you are referring is getting around atleast
> >>>80% of 1x throughput for over-commit. Yes we are still far away from
> >>>there.
> >>>
> >>>>
> >>>>1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
> >>>>--
> >>>>  Total
> >>>>Configuration Throughput   

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Gleb Natapov
On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
> On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
> > On 06/25/2013 08:20 PM, Andrew Theurer wrote:
> > >On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
> > >>This series replaces the existing paravirtualized spinlock mechanism
> > >>with a paravirtualized ticketlock mechanism. The series provides
> > >>implementation for both Xen and KVM.
> > >>
> > >>Changes in V9:
> > >>- Changed spin_threshold to 32k to avoid excess halt exits that are
> > >>causing undercommit degradation (after PLE handler improvement).
> > >>- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
> > >>- Optimized halt exit path to use PLE handler
> > >>
> > >>V8 of PVspinlock was posted last year. After Avi's suggestions to look
> > >>at PLE handler's improvements, various optimizations in PLE handling
> > >>have been tried.
> > >
> > >Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
> > >patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
> > >tested these patches with and without PLE, as PLE is still not scalable
> > >with large VMs.
> > >
> > 
> > Hi Andrew,
> > 
> > Thanks for testing.
> > 
> > >System: x3850X5, 40 cores, 80 threads
> > >
> > >
> > >1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
> > >--
> > >   Total
> > >Configuration  Throughput(MB/s)Notes
> > >
> > >3.10-default-ple_on22945   5% CPU 
> > >in host kernel, 2% spin_lock in guests
> > >3.10-default-ple_off   23184   5% CPU 
> > >in host kernel, 2% spin_lock in guests
> > >3.10-pvticket-ple_on   22895   5% CPU 
> > >in host kernel, 2% spin_lock in guests
> > >3.10-pvticket-ple_off  23051   5% CPU 
> > >in host kernel, 2% spin_lock in guests
> > >[all 1x results look good here]
> > 
> > Yes. The 1x results look too close
> > 
> > >
> > >
> > >2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
> > >---
> > >   Total
> > >Configuration  Throughput  Notes
> > >
> > >3.10-default-ple_on 6287   55% CPU 
> > > host kernel, 17% spin_lock in guests
> > >3.10-default-ple_off1849   2% CPU 
> > >in host kernel, 95% spin_lock in guests
> > >3.10-pvticket-ple_on6691   50% CPU 
> > >in host kernel, 15% spin_lock in guests
> > >3.10-pvticket-ple_off  16464   8% CPU 
> > >in host kernel, 33% spin_lock in guests
> > 
> > I see 6.426% improvement with ple_on
> > and 161.87% improvement with ple_off. I think this is a very good sign
> >  for the patches
> > 
> > >[PLE hinders pv-ticket improvements, but even with PLE off,
> > >  we still off from ideal throughput (somewhere >2)]
> > >
> > 
> > Okay, The ideal throughput you are referring is getting around atleast
> > 80% of 1x throughput for over-commit. Yes we are still far away from
> > there.
> > 
> > >
> > >1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
> > >--
> > >   Total
> > >Configuration  Throughput  Notes
> > >
> > >3.10-default-ple_on22736   6% CPU 
> > >in host kernel, 3% spin_lock in guests
> > >3.10-default-ple_off   23377   5% CPU 
> > >in host kernel, 3% spin_lock in guests
> > >3.10-pvticket-ple_on   22471   6% CPU 
> > >in host kernel, 3% spin_lock in guests
> > >3.10-pvticket-ple_off  23445   5% CPU 
> > >in host kernel, 3% spin_lock in guests
> > >[1x looking fine here]
> > >
> > 
> > I see ple_off is little better here.
> > 
> > >
> > >2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
> > >--
> > >   Total
> > >Configuration  Throughput  Notes
> > >
> > >3.10-default-ple_on 1965   70% CPU 
> > >in host kernel, 34% spin_lock in guests 
> > >3.10-default-ple_off 226   2% CPU 
> > >in host kernel, 94% spin_lock in guests
> > >3.10-pvticket-ple_on1942   70% CPU 
> > >in host kernel, 35% spin_lock in guests
> > >3.10-pvticket-ple_off   8003   11% CPU 
> > >in host kernel, 70% spin_lock in guests
> > >[qu

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-02 Thread Gleb Natapov
On Sun, Jun 02, 2013 at 12:51:25AM +0530, Raghavendra K T wrote:
> 
> This series replaces the existing paravirtualized spinlock mechanism
> with a paravirtualized ticketlock mechanism. The series provides
> implementation for both Xen and KVM.
> 
High level question here. We have a big hope for "Preemptable Ticket
Spinlock" patch series by Jiannan Ouyang to solve most, if not all,
ticketing spinlocks in overcommit scenarios problem without need for PV.
So how this patch series compares with his patches on PLE enabled processors?

> Changes in V9:
> - Changed spin_threshold to 32k to avoid excess halt exits that are
>causing undercommit degradation (after PLE handler improvement).
> - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
> - Optimized halt exit path to use PLE handler
> 
> V8 of PVspinlock was posted last year. After Avi's suggestions to look
> at PLE handler's improvements, various optimizations in PLE handling
> have been tried.
> 
> With this series we see that we could get little more improvements on top
> of that. 
> 
> Ticket locks have an inherent problem in a virtualized case, because
> the vCPUs are scheduled rather than running concurrently (ignoring
> gang scheduled vCPUs).  This can result in catastrophic performance
> collapses when the vCPU scheduler doesn't schedule the correct "next"
> vCPU, and ends up scheduling a vCPU which burns its entire timeslice
> spinning.  (Note that this is not the same problem as lock-holder
> preemption, which this series also addresses; that's also a problem,
> but not catastrophic).
> 
> (See Thomas Friebel's talk "Prevent Guests from Spinning Around"
> http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
> 
> Currently we deal with this by having PV spinlocks, which adds a layer
> of indirection in front of all the spinlock functions, and defining a
> completely new implementation for Xen (and for other pvops users, but
> there are none at present).
> 
> PV ticketlocks keeps the existing ticketlock implemenentation
> (fastpath) as-is, but adds a couple of pvops for the slow paths:
> 
> - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
>   iterations, then call out to the __ticket_lock_spinning() pvop,
>   which allows a backend to block the vCPU rather than spinning.  This
>   pvop can set the lock into "slowpath state".
> 
> - When releasing a lock, if it is in "slowpath state", the call
>   __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
>   lock is no longer in contention, it also clears the slowpath flag.
> 
> The "slowpath state" is stored in the LSB of the within the lock tail
> ticket.  This has the effect of reducing the max number of CPUs by
> half (so, a "small ticket" can deal with 128 CPUs, and "large ticket"
> 32768).
> 
> For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
> another vcpu out of halt state.
> The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
> 
> Overall, it results in a large reduction in code, it makes the native
> and virtualized cases closer, and it removes a layer of indirection
> around all the spinlock functions.
> 
> The fast path (taking an uncontended lock which isn't in "slowpath"
> state) is optimal, identical to the non-paravirtualized case.
> 
> The inner part of ticket lock code becomes:
>   inc = xadd(&lock->tickets, inc);
>   inc.tail &= ~TICKET_SLOWPATH_FLAG;
> 
>   if (likely(inc.head == inc.tail))
>   goto out;
>   for (;;) {
>   unsigned count = SPIN_THRESHOLD;
>   do {
>   if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
>   goto out;
>   cpu_relax();
>   } while (--count);
>   __ticket_lock_spinning(lock, inc.tail);
>   }
> out:  barrier();
> which results in:
>   push   %rbp
>   mov%rsp,%rbp
> 
>   mov$0x200,%eax
>   lock xadd %ax,(%rdi)
>   movzbl %ah,%edx
>   cmp%al,%dl
>   jne1f   # Slowpath if lock in contention
> 
>   pop%rbp
>   retq   
> 
>   ### SLOWPATH START
> 1:and$-2,%edx
>   movzbl %dl,%esi
> 
> 2:mov$0x800,%eax
>   jmp4f
> 
> 3:pause  
>   sub$0x1,%eax
>   je 5f
> 
> 4:movzbl (%rdi),%ecx
>   cmp%cl,%dl
>   jne3b
> 
>   pop%rbp
>   retq   
> 
> 5:callq  *__ticket_lock_spinning
>   jmp2b
>   ### SLOWPATH END
> 
> with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
> the fastpath case is straight through (taking the lock without
> contention), and the spin loop is out of line:
> 
>   push   %rbp
>   mov%rsp,%rbp
> 
>   mov$0x100,%eax
>   lock xadd %ax,(%rdi)
>   movzbl %ah,%edx
>   cmp%al,%dl
>   jne1f
> 
>   pop%rbp
>   retq   
> 
>   ### SLOWPATH START
> 1:pause  
>   movzbl (%rdi),%eax
>

Re: [PATCH v2] KVM: Fix kvm_irqfd_init initialization

2013-05-08 Thread Gleb Natapov
On Wed, May 08, 2013 at 10:57:29AM +0800, Asias He wrote:
> In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when
> kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko),
> kvm_arch_init() will fail with -EEXIST, then kvm_irqfd_exit() will be
> called on the error handling path. This way, the kvm_irqfd system will
> not be ready.
> 
> This patch fix the following:
> 
Applied, thanks.

> BUG: unable to handle kernel NULL pointer dereference at   (null)
> IP: [] _raw_spin_lock+0xe/0x30
> PGD 0
> Oops: 0002 [#1] SMP
> Modules linked in: vhost_net
> CPU 6
> Pid: 4257, comm: qemu-system-x86 Not tainted 3.9.0-rc3+ #757 Dell Inc. 
> OptiPlex 790/0V5HMK
> RIP: 0010:[]  [] _raw_spin_lock+0xe/0x30
> RSP: 0018:880221721cc8  EFLAGS: 00010046
> RAX: 0100 RBX: 88022dcc003f RCX: 880221734950
> RDX: 8802208f6ca8 RSI: 7fff RDI: 
> RBP: 880221721cc8 R08: 0002 R09: 0002
> R10: 7f7fd01087e0 R11: 0246 R12: 8802208f6ca8
> R13: 0080 R14: 880223e2a900 R15: 
> FS:  7f7fd38488e0() GS:88022dcc() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2:  CR3: 00022309f000 CR4: 000427e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process qemu-system-x86 (pid: 4257, threadinfo 88022172, task 
> 880222bd5640)
> Stack:
>  880221721d08 810ac5c5 88022431dc00 0086
>  0080 880223e2a900 8802208f6ca8 
>  880221721d48 810ac8fe  880221734000
> Call Trace:
>  [] __queue_work+0x45/0x2d0
>  [] queue_work_on+0x8e/0xa0
>  [] queue_work+0x19/0x20
>  [] irqfd_deactivate+0x4b/0x60
>  [] kvm_irqfd+0x39d/0x580
>  [] kvm_vm_ioctl+0x207/0x5b0
>  [] ? update_curr+0xf5/0x180
>  [] do_vfs_ioctl+0x98/0x550
>  [] ? finish_task_switch+0x4e/0xe0
>  [] ? __schedule+0x2ea/0x710
>  [] sys_ioctl+0x57/0x90
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3c
>  [] system_call_fastpath+0x16/0x1b
> Code: c1 ea 08 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f b6 03 38 c2 75 f7 48 83 
> c4 08 5b c9 c3 55 48 89 e5 66 66 66 66 90 b8 00 01 00 00  66 0f c1 07 89 
> c2 66 c1 ea 08 38 c2 74 0c 0f 1f 00 f3 90 0f
> RIP  [] _raw_spin_lock+0xe/0x30
> RSP 
> CR2: 
> ---[ end trace 13fb1e4b6e5ab21f ]---
> 
> Signed-off-by: Asias He 
> ---
>  virt/kvm/kvm_main.c | 18 +-
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8fd325a..85b93d2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3078,13 +3078,21 @@ int kvm_init(void *opaque, unsigned vcpu_size, 
> unsigned vcpu_align,
>   int r;
>   int cpu;
>  
> - r = kvm_irqfd_init();
> - if (r)
> - goto out_irqfd;
>   r = kvm_arch_init(opaque);
>   if (r)
>   goto out_fail;
>  
> + /*
> +  * kvm_arch_init makes sure there's at most one caller
> +  * for architectures that support multiple implementations,
> +  * like intel and amd on x86.
> +  * kvm_arch_init must be called before kvm_irqfd_init to avoid creating
> +  * conflicts in case kvm is already setup for another implementation.
> +  */
> + r = kvm_irqfd_init();
> + if (r)
> + goto out_irqfd;
> +
>   if (!zalloc_cpumask_var(&cpus_hardware_enabled, GFP_KERNEL)) {
>   r = -ENOMEM;
>   goto out_free_0;
> @@ -3159,10 +3167,10 @@ out_free_1:
>  out_free_0a:
>   free_cpumask_var(cpus_hardware_enabled);
>  out_free_0:
> - kvm_arch_exit();
> -out_fail:
>   kvm_irqfd_exit();
>  out_irqfd:
> + kvm_arch_exit();
> +out_fail:
>   return r;
>  }
>  EXPORT_SYMBOL_GPL(kvm_init);
> -- 
> 1.8.1.4

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] KVM: Fix kvm_irqfd_init initialization

2013-05-07 Thread Gleb Natapov
On Tue, May 07, 2013 at 06:24:48PM +0300, Michael S. Tsirkin wrote:
> On Tue, May 07, 2013 at 06:16:09PM +0300, Gleb Natapov wrote:
> > On Tue, May 07, 2013 at 06:11:33PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, May 07, 2013 at 11:05:12PM +0800, Asias He wrote:
> > > > On Tue, May 07, 2013 at 05:59:38PM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, May 07, 2013 at 10:54:16PM +0800, Asias He wrote:
> > > > > > In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when
> > > > > > kvm_init() is called the second time (e.g kvm-amd.ko and 
> > > > > > kvm-intel.ko),
> > > > > > kvm_arch_init() will fail with -EEXIST,
> > > > > 
> > > > > Wow. Is this intentional?
> > > > 
> > > > I think it is. You can not be amd and intel at the same time ;-)
> > > > 
> > > > kvm_arch_init
> > > > 
> > > > if (kvm_x86_ops) {
> > > > printk(KERN_ERR "kvm: already loaded the other module\n");
> > > > r = -EEXIST;
> > > > goto out;
> > > > }   
> > > > 
> > > 
> > > Interesting. So we check it with
> > >   if (kvm_x86_ops)
> > >  and later we do
> > > kvm_x86_ops = ops;
> > > 
> > > 
> > > This looks racy - or is something serializing
> > > module loading?
> > > 
> > I think module loading is serialized.
> >
> 
> Hmm then I don't understand ... both kvm-intel
> and kvm-amd *can't* be loaded at the same time:
> module loading fails for one of them.
> 
Why would it fail?

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] KVM: Fix kvm_irqfd_init initialization

2013-05-07 Thread Gleb Natapov
On Tue, May 07, 2013 at 06:11:33PM +0300, Michael S. Tsirkin wrote:
> On Tue, May 07, 2013 at 11:05:12PM +0800, Asias He wrote:
> > On Tue, May 07, 2013 at 05:59:38PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, May 07, 2013 at 10:54:16PM +0800, Asias He wrote:
> > > > In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when
> > > > kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko),
> > > > kvm_arch_init() will fail with -EEXIST,
> > > 
> > > Wow. Is this intentional?
> > 
> > I think it is. You can not be amd and intel at the same time ;-)
> > 
> > kvm_arch_init
> > 
> > if (kvm_x86_ops) {
> > printk(KERN_ERR "kvm: already loaded the other module\n");
> > r = -EEXIST;
> > goto out;
> > }   
> > 
> 
> Interesting. So we check it with
>   if (kvm_x86_ops)
>  and later we do
> kvm_x86_ops = ops;
> 
> 
> This looks racy - or is something serializing
> module loading?
> 
I think module loading is serialized.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-07 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 04:14:57PM +0300, Gleb Natapov wrote:
> > > 
> > >>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> > >>> will be never, for PCI-e it will be after ~16 devices.
> > >> 
> > >> Ok, let's go back a step here. Are you actually able to measure any 
> > >> speed in performance with this patch applied and without when going 
> > >> through MMIO kicks?
> > >> 
> > >> 
> > > That's the question for MST. I think he did only micro benchmarks till
> > > now and he already posted his result here:
> > > 
> > > mmio-wildcard-eventfd:pci-mem 3529
> > > mmio-pv-eventfd:pci-mem 1878
> > > portio-wildcard-eventfd:pci-io 1846
> > > 
> > > So the patch speedup mmio by almost 100% and it is almost the same as PIO.
> > 
> > Those numbers don't align at all with what I measured.
> I am trying to run vmexit test on AMD now, but something does not work
> there. Next week I'll fix it and see how AMD differs, bit on Intel those are 
> the
> numbers.
> 
The numbers are:
vmcall 1921
inl_from_kernel 4227
outl_to_kernel 2345

outl is specifically optimized to not go through the emulator since it
is used for virtio kick. mmio-pv-eventfd is the same kind of
optimization but for mmio.
 
--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 06:36:30PM +0300, Michael S. Tsirkin wrote:
> > processor   : 0
> > vendor_id   : AuthenticAMD
> > cpu family  : 16
> > model   : 8
> > model name  : Six-Core AMD Opteron(tm) Processor 8435
> > stepping: 0
> > cpu MHz : 800.000
> > cache size  : 512 KB
> > physical id : 0
> > siblings: 6
> > core id : 0
> > cpu cores   : 6
> > apicid  : 8
> > initial apicid  : 0
> > fpu : yes
> > fpu_exception   : yes
> > cpuid level : 5
> > wp  : yes
> > flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc 
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv 
> > svm_lock nrip_save pausefilter
> > bogomips: 5199.87
> > TLB size: 1024 4K pages
> > clflush size: 64
> > cache_alignment : 64
> > address sizes   : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> Hmm, svm code seems less optimized for MMIO, but PIO
> is almost identical. Gleb says the unittest is broken
> on AMD so I'll wait until it's fixed to test.
> 
It's not unittest is broken, its my environment is broken :)

> Did you do PIO reads by chance?
> 
> > > 
> > > Or could be different software, this is on top of 3.9.0-rc5, what
> > > did you try?
> > 
> > 3.0 plus kvm-kmod of whatever was current back in autumn :).
> > 
> > > 
> > >> MST, could you please do a real world latency benchmark with virtio-net 
> > >> and
> > >> 
> > >>  * normal ioeventfd
> > >>  * mmio-pv eventfd
> > >>  * hcall eventfd
> > > 
> > > I can't do this right away, sorry.  For MMIO we are discussing the new
> > > layout on the virtio mailing list, guest and qemu need a patch for this
> > > too.  My hcall patches are stale and would have to be brought up to
> > > date.
> > > 
> > > 
> > >> to give us some idea how much performance we would gain from each 
> > >> approach? Thoughput should be completely unaffected anyway, since virtio 
> > >> just coalesces kicks internally.
> > > 
> > > Latency is dominated by the scheduling latency.
> > > This means virtio-net is not the best benchmark.
> > 
> > So what is a good benchmark?
> 
> E.g. ping pong stress will do but need to look at CPU utilization,
> that's what is affected, not latency.
> 
> > Is there any difference in speed at all? I strongly doubt it. One of 
> > virtio's main points is to reduce the number of kicks.
> 
> For this stage of the project I think microbenchmarks are more appropriate.
> Doubling the price of exit is likely to be measureable. 30 cycles likely
> not ...
> 
> > > 
> > >> I'm also slightly puzzled why the wildcard eventfd mechanism is so 
> > >> significantly slower, while it was only a few percent on my test system. 
> > >> What are the numbers you're listing above? Cycles? How many cycles do 
> > >> you execute in a second?
> > >> 
> > >> 
> > >> Alex
> > > 
> > > 
> > > It's the TSC divided by number of iterations.  kvm unittest this value, 
> > > here's
> > > what it does (removed some dead code):
> > > 
> > > #define GOAL (1ull << 30)
> > > 
> > >do {
> > >iterations *= 2;
> > >t1 = rdtsc();
> > > 
> > >for (i = 0; i < iterations; ++i)
> > >func();
> > >t2 = rdtsc();
> > >} while ((t2 - t1) < GOAL);
> > >printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
> > 
> > So it's the number of cycles per run.
> > 
> > That means translated my numbers are:
> > 
> >   MMIO: 4307
> >   PIO: 3658
> >   HCALL: 1756
> > 
> > MMIO - PIO = 649
> > 
> > which aligns roughly with your PV MMIO callback.
> > 
> > My MMIO benchmark was to poke the LAPIC version register. That does go 
> > through instruction emulation, no?
> > 
> > 
> > Alex
> 
> Why wouldn't it?
> 
Intel decodes access to apic page, but we use it only for fast eoi.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 05:36:40PM +0200, Alexander Graf wrote:
> > 
> > #define GOAL (1ull << 30)
> > 
> >do {
> >iterations *= 2;
> >t1 = rdtsc();
> > 
> >for (i = 0; i < iterations; ++i)
> >func();
> >t2 = rdtsc();
> >} while ((t2 - t1) < GOAL);
> >printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
> 
> So it's the number of cycles per run.
> 
> That means translated my numbers are:
> 
>   MMIO: 4307
>   PIO: 3658
>   HCALL: 1756
> 
> MMIO - PIO = 649
> 
> which aligns roughly with your PV MMIO callback.
> 
> My MMIO benchmark was to poke the LAPIC version register. That does go 
> through instruction emulation, no?
> 
It should and PIO 'in' or string also goes through the emulator.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:56, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >>>> 
> >>>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >>>> 
> >>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>>>>> 
> >>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>>>>> 
> >>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>>>>> 
> >>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>>>>> 
> >>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>>>>>> know the address from the VMCS so if the address is unique, we can 
> >>>>>>>>> look
> >>>>>>>>> up the eventfd directly, bypassing emulation.
> >>>>>>>>> 
> >>>>>>>>> Add an interface for userspace to specify this per-address, we can
> >>>>>>>>> use this e.g. for virtio.
> >>>>>>>>> 
> >>>>>>>>> The implementation adds a separate bus internally. This serves two
> >>>>>>>>> purposes:
> >>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>>>>>> - minimize disruption in other code (since we don't know the length,
> >>>>>>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>>>>>> way we don't need to touch all devices to teach them handle
> >>>>>>>>> an dinvalid length)
> >>>>>>>>> 
> >>>>>>>>> At the moment, this optimization is only supported for EPT on x86 
> >>>>>>>>> and
> >>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>>>>>> slowly.
> >>>>>>>>> 
> >>>>>>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>>>>> 
> >>>>>>>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>>>>>>>> pre-review and suggestions.
> >>>>>>>>> 
> >>>>>>>>> Signed-off-by: Michael S. Tsirkin 
> >>>>>>>> 
> >>>>>>>> This still uses page fault intercepts which are orders of magnitudes 
> >>>>>>>> slower than hypercalls. Why don't you just create a PV MMIO 
> >>>>>>>> hypercall that the guest can use to invoke MMIO accesses towards the 
> >>>>>>>> host based on physical addresses with explicit length encodings?
> >>>>>>>> 
> >>>>>>> It is slower, but not an order of magnitude slower. It become faster
> >>>>>>> with newer HW.
> >>>>>>> 
> >>>>>>>> That way you simplify and speed up all code paths, exceeding the 
> >>>>>>>> speed of PIO exits even. It should also be quite easily portable, as 
> >>>>>>>> all other platforms have hypercalls available as well.
> >>>>>>>> 
> >>>>>>> We are trying to avoid PV as much as possible (well this is also PV,
> >>>>>>> but not guest visible
> >>>>>> 
> >>>>>> Also, how is this not guest visible? Who sets 
> >>>>>> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
> >>>>>> that the guest does so, so it is guest visible.
> >>>>>> 
> >>>>> QEMU sets it.
> >>>> 
> >>>> How does QEMU know?
> >>>> 
> >>> Knows what? When to create such eventfd? virtio device knows.
> >> 
> >> Where does it know from?
> >> 
> > It does it always.

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>>> 
> >>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>>> 
> >>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>>> 
> >>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>>> 
> >>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>>>> know the address from the VMCS so if the address is unique, we can 
> >>>>>>> look
> >>>>>>> up the eventfd directly, bypassing emulation.
> >>>>>>> 
> >>>>>>> Add an interface for userspace to specify this per-address, we can
> >>>>>>> use this e.g. for virtio.
> >>>>>>> 
> >>>>>>> The implementation adds a separate bus internally. This serves two
> >>>>>>> purposes:
> >>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>>>> - minimize disruption in other code (since we don't know the length,
> >>>>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>>>> way we don't need to touch all devices to teach them handle
> >>>>>>> an dinvalid length)
> >>>>>>> 
> >>>>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>>>> slowly.
> >>>>>>> 
> >>>>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>>> 
> >>>>>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>>>>>> pre-review and suggestions.
> >>>>>>> 
> >>>>>>> Signed-off-by: Michael S. Tsirkin 
> >>>>>> 
> >>>>>> This still uses page fault intercepts which are orders of magnitudes 
> >>>>>> slower than hypercalls. Why don't you just create a PV MMIO hypercall 
> >>>>>> that the guest can use to invoke MMIO accesses towards the host based 
> >>>>>> on physical addresses with explicit length encodings?
> >>>>>> 
> >>>>> It is slower, but not an order of magnitude slower. It become faster
> >>>>> with newer HW.
> >>>>> 
> >>>>>> That way you simplify and speed up all code paths, exceeding the speed 
> >>>>>> of PIO exits even. It should also be quite easily portable, as all 
> >>>>>> other platforms have hypercalls available as well.
> >>>>>> 
> >>>>> We are trying to avoid PV as much as possible (well this is also PV,
> >>>>> but not guest visible
> >>>> 
> >>>> Also, how is this not guest visible? Who sets 
> >>>> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
> >>>> that the guest does so, so it is guest visible.
> >>>> 
> >>> QEMU sets it.
> >> 
> >> How does QEMU know?
> >> 
> > Knows what? When to create such eventfd? virtio device knows.
> 
> Where does it know from?
> 
It does it always.

> > 
> >>> 
> >>>> +/*
> >>>> + * PV_MMIO - Guest can promise us that all accesses touching this 
> >>>> address
> >>>> + * are writes of specified length, starting at the specified address.
> >>>> + * If not - it's a Guest bug.
> >>>> + * Can not be used together with either PIO or DATAMATCH.
> >>>> + */
> >>>> 
> >>> Virtio spec will state that access to a kick register needs to be of
> >>> specific length. This is reasonable thing for HW to ask.
> >> 
> >> This is a spec change. So the guest would have to indicate that it adheres 
> >> to a newer spec. Thus it's a guest visible change.
> >> 
> > There is not virtio spec that has kick register in MMIO. The spec is in
> > the works AFAIK. Actually PIO will not be deprecated and my suggestion
> 
> So the guest would indicate that it supports a newer revision of the spec (in 
> your case, that it supports MMIO). How is that any different from exposing 
> that it supports a PV MMIO hcall?
> 
Guest will indicate nothing. New driver will use MMIO if PIO is bar is
not configured. All driver will not work for virtio devices with MMIO
bar, but not PIO bar.

> > is to move to MMIO only when PIO address space is exhausted. For PCI it
> > will be never, for PCI-e it will be after ~16 devices.
> 
> Ok, let's go back a step here. Are you actually able to measure any speed in 
> performance with this patch applied and without when going through MMIO kicks?
> 
> 
That's the question for MST. I think he did only micro benchmarks till
now and he already posted his result here:

mmio-wildcard-eventfd:pci-mem 3529
mmio-pv-eventfd:pci-mem 1878
portio-wildcard-eventfd:pci-io 1846

So the patch speedup mmio by almost 100% and it is almost the same as PIO.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>> 
> >>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>> 
> >>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>> up the eventfd directly, bypassing emulation.
> >>>>> 
> >>>>> Add an interface for userspace to specify this per-address, we can
> >>>>> use this e.g. for virtio.
> >>>>> 
> >>>>> The implementation adds a separate bus internally. This serves two
> >>>>> purposes:
> >>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>> - minimize disruption in other code (since we don't know the length,
> >>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>> way we don't need to touch all devices to teach them handle
> >>>>> an dinvalid length)
> >>>>> 
> >>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>> slowly.
> >>>>> 
> >>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>> 
> >>>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>>>> pre-review and suggestions.
> >>>>> 
> >>>>> Signed-off-by: Michael S. Tsirkin 
> >>>> 
> >>>> This still uses page fault intercepts which are orders of magnitudes 
> >>>> slower than hypercalls. Why don't you just create a PV MMIO hypercall 
> >>>> that the guest can use to invoke MMIO accesses towards the host based on 
> >>>> physical addresses with explicit length encodings?
> >>>> 
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>> 
> >>>> That way you simplify and speed up all code paths, exceeding the speed 
> >>>> of PIO exits even. It should also be quite easily portable, as all other 
> >>>> platforms have hypercalls available as well.
> >>>> 
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible
> >> 
> >> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
> >> The comment above its definition indicates that the guest does so, so it 
> >> is guest visible.
> >> 
> > QEMU sets it.
> 
> How does QEMU know?
> 
Knows what? When to create such eventfd? virtio device knows.

> > 
> >> +/*
> >> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >> + * are writes of specified length, starting at the specified address.
> >> + * If not - it's a Guest bug.
> >> + * Can not be used together with either PIO or DATAMATCH.
> >> + */
> >> 
> > Virtio spec will state that access to a kick register needs to be of
> > specific length. This is reasonable thing for HW to ask.
> 
> This is a spec change. So the guest would have to indicate that it adheres to 
> a newer spec. Thus it's a guest visible change.
> 
There is not virtio spec that has kick register in MMIO. The spec is in
the works AFAIK. Actually PIO will not be deprecated and my suggestion
is to move to MMIO only when PIO address space is exhausted. For PCI it
will be never, for PCI-e it will be after ~16 devices.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
> >> the guest can use to invoke MMIO accesses towards the host based on 
> >> physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed of 
> >> PIO exits even. It should also be quite easily portable, as all other 
> >> platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible
> 
> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The 
> comment above its definition indicates that the guest does so, so it is guest 
> visible.
> 
QEMU sets it.

> +/*
> + * PV_MMIO - Guest can promise us that all accesses touching this address
> + * are writes of specified length, starting at the specified address.
> + * If not - it's a Guest bug.
> + * Can not be used together with either PIO or DATAMATCH.
> + */
> 
Virtio spec will state that access to a kick register needs to be of
specific length. This is reasonable thing for HW to ask.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
> >> the guest can use to invoke MMIO accesses towards the host based on 
> >> physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed of 
> >> PIO exits even. It should also be quite easily portable, as all other 
> >> platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible). We haven't replaced PIO with hypercall for the
> > same reason. My hope is that future HW will provide us with instruction
> > decode for basic mov instruction at which point this optimisation can be
> > dropped.
> 
> The same applies to an MMIO hypercall. Once the PV interface becomes 
> obsolete, we can drop the capability we expose to the guest.
> 
Disable it on newer HW is easy, but it is not that simple to get rid of the 
guest code.

> > And hypercall has its own set of problems with Windows guests.
> > When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
> > hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
> 
> Can't we simply register a hypercall ID range with Microsoft?
Doubt it. This is not only about the rang though. The calling convention
is completely different.

> 
> > may also affect WHQLing Windows drivers since driver will talk to HW
> > bypassing Windows interfaces.
> 
> Then the WHQL'ed driver doesn't support the PV MMIO hcall?
> 
All Windows drivers have to be WHQL'ed, saying that is like saying that
we do not care about Windows guest virtio speed.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes
> >> slower than hypercalls.
> > 
> > Not really. Here's a test:
> > compare vmcall to portio:
> > 
> > vmcall 1519
> > ...
> > outl_to_kernel 1745
> > 
> > compare portio to mmio:
> > 
> > mmio-wildcard-eventfd:pci-mem 3529
> > mmio-pv-eventfd:pci-mem 1878
> > portio-wildcard-eventfd:pci-io 1846
> > 
> > So not orders of magnitude.
> 
> https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
> 
> Check out page 41. Higher is better (number is number of loop cycles in a 
> second). My test system was an AMD Istanbul based box.
> 
Have you bypassed instruction emulation in your testing?

> > 
> >> Why don't you just create a PV MMIO hypercall
> >> that the guest can use to invoke MMIO accesses towards the host based
> >> on physical addresses with explicit length encodings?
> >> That way you simplify and speed up all code paths, exceeding the speed
> >> of PIO exits even. It should also be quite easily portable, as all
> >> other platforms have hypercalls available as well.
> >> 
> >> 
> >> Alex
> > 
> > I sent such a patch, but maintainers seem reluctant to add hypercalls.
> > Gleb, could you comment please?
> > 
> > A fast way to do MMIO is probably useful in any case ...
> 
> Yes, but at least according to my numbers optimizing anything that is not 
> hcalls is a waste of time.
> 
> 
> Alex

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> 
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> >  devices on the MMIO bus only get a valid address in write, this
> >  way we don't need to touch all devices to teach them handle
> >  an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> 
> This still uses page fault intercepts which are orders of magnitudes slower 
> than hypercalls. Why don't you just create a PV MMIO hypercall that the guest 
> can use to invoke MMIO accesses towards the host based on physical addresses 
> with explicit length encodings?
> 
It is slower, but not an order of magnitude slower. It become faster
with newer HW.

> That way you simplify and speed up all code paths, exceeding the speed of PIO 
> exits even. It should also be quite easily portable, as all other platforms 
> have hypercalls available as well.
> 
We are trying to avoid PV as much as possible (well this is also PV,
but not guest visible). We haven't replaced PIO with hypercall for the
same reason. My hope is that future HW will provide us with instruction
decode for basic mov instruction at which point this optimisation can be
dropped. And hypercall has its own set of problems with Windows guests.
When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
may also affect WHQLing Windows drivers since driver will talk to HW
bypassing Windows interfaces.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/2] virtio: provide a way for host to monitor critical events in the device

2012-07-24 Thread Gleb Natapov
On Tue, Jul 24, 2012 at 02:31:25PM +0200, Sasha Levin wrote:
> On 07/24/2012 02:28 PM, Gleb Natapov wrote:
> > On Tue, Jul 24, 2012 at 02:26:33PM +0200, Sasha Levin wrote:
> >> On 07/24/2012 09:44 AM, Gleb Natapov wrote:
> >>> On Mon, Jul 23, 2012 at 10:32:39PM +0200, Sasha Levin wrote:
> >>>> As it was discussed recently, there's currently no way for the guest to 
> >>>> notify
> >>>> the host about panics. Further more, there's no reasonable way to notify 
> >>>> the
> >>>> host of other critical events such as an OOM kill.
> >>>>
> >>>> This short patch series introduces a new device named virtio-notifier 
> >>>> which
> >>>> does two simple things:
> >>>>
> >>>>  1. Provide a simple interface for the guest to notify the host of 
> >>>> critical
> >>> To get early OOPSes virtio will have to be compiled into the kernel. If
> >>> your are so keen on using virtio for this though, why not just use
> >>> dedicated virtio serial channel?
> >>
> >> Let's separate between having log for these events and receiving 
> >> notifications about them.
> >>
> >> For the log part, I can already run a simple serial console to dump 
> >> everything somewhere. I'm more concerned about having notifications about 
> >> something critical happening when the guest is already up and running.
> >>
> > I am talking about notifications. Run your notification protocol over
> > dedicated virtio-serial channel. Logs goes to virtio-console as you've
> > said.
> 
> Ah, so just add another channel into virtio-serial to pass these 
> notifications? Good idea - I'll look into it.
> 
Yes, that's what I mean. This solution still has the disadvantage of not
been able to catch early boot oopses without compiling the whole virtio
into a kernel image.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/2] virtio: provide a way for host to monitor critical events in the device

2012-07-24 Thread Gleb Natapov
On Tue, Jul 24, 2012 at 02:26:33PM +0200, Sasha Levin wrote:
> On 07/24/2012 09:44 AM, Gleb Natapov wrote:
> > On Mon, Jul 23, 2012 at 10:32:39PM +0200, Sasha Levin wrote:
> >> As it was discussed recently, there's currently no way for the guest to 
> >> notify
> >> the host about panics. Further more, there's no reasonable way to notify 
> >> the
> >> host of other critical events such as an OOM kill.
> >>
> >> This short patch series introduces a new device named virtio-notifier which
> >> does two simple things:
> >>
> >>  1. Provide a simple interface for the guest to notify the host of critical
> > To get early OOPSes virtio will have to be compiled into the kernel. If
> > your are so keen on using virtio for this though, why not just use
> > dedicated virtio serial channel?
> 
> Let's separate between having log for these events and receiving 
> notifications about them.
> 
> For the log part, I can already run a simple serial console to dump 
> everything somewhere. I'm more concerned about having notifications about 
> something critical happening when the guest is already up and running.
> 
I am talking about notifications. Run your notification protocol over
dedicated virtio-serial channel. Logs goes to virtio-console as you've
said.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC 0/2] virtio: provide a way for host to monitor critical events in the device

2012-07-24 Thread Gleb Natapov
On Mon, Jul 23, 2012 at 10:32:39PM +0200, Sasha Levin wrote:
> As it was discussed recently, there's currently no way for the guest to notify
> the host about panics. Further more, there's no reasonable way to notify the
> host of other critical events such as an OOM kill.
> 
> This short patch series introduces a new device named virtio-notifier which
> does two simple things:
> 
>  1. Provide a simple interface for the guest to notify the host of critical
To get early OOPSes virtio will have to be compiled into the kernel. If
your are so keen on using virtio for this though, why not just use
dedicated virtio serial channel?

>  events. This is easily expandible to add support for any events we may find
>  interesting for the host to know about.
> 
>  2. Provide an "echo" interface for the host to ping the guest. This allows
>  the host to ping the guest at intervals chosen by the host, and act
>  accordingly if no response has been received.
> 
> Sasha Levin (2):
>   virtio: Introduce virtio-notifier
>   kvm tools: support virtio-notifier
> 
>  drivers/virtio/Kconfig  |   11 ++
>  drivers/virtio/Makefile |1 +
>  drivers/virtio/virtio_notifier.c|  135 
>  include/linux/virtio_ids.h  |1 +
>  include/linux/virtio_notifier.h |   15 +++
>  tools/kvm/Makefile  |1 +
>  tools/kvm/builtin-run.c |6 +
>  tools/kvm/include/kvm/virtio-notifier.h |9 ++
>  tools/kvm/include/kvm/virtio-pci-dev.h  |1 +
>  tools/kvm/virtio/notifier.c |  203 
> +++
>  10 files changed, 383 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/virtio/virtio_notifier.c
>  create mode 100644 include/linux/virtio_notifier.h
>  create mode 100644 tools/kvm/include/kvm/virtio-notifier.h
>  create mode 100644 tools/kvm/virtio/notifier.c
> 
> -- 
> 1.7.8.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCHv2] x86info: dump kvm cpuid's

2012-05-01 Thread Gleb Natapov
On Tue, May 01, 2012 at 01:27:05PM +0100, Ian Campbell wrote:
> On Mon, 2012-04-30 at 17:38 +0300, Michael S. Tsirkin wrote:
> > The following makes 'x86info -r' dump hypervisor leaf cpu ids
> > (for kvm this is signature+features) when running in a vm.
> > 
> > On the guest we see the signature and the features:
> > eax in: 0x4000, eax =  ebx = 4b4d564b ecx = 564b4d56 edx = 
> > 004d
> > eax in: 0x4001, eax = 017b ebx =  ecx =  edx = 
> > 
> > 
> > Hypervisor flag is checked to avoid output changes when not
> > running on a VM.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > 
> > Changes from v1:
> > Make work on non KVM hypervisors (only KVM was tested).
> > Avi Kivity said kvm will in the future report
> > max HV leaf in eax. For now it reports eax = 0
> > so add a work around for that.
> > 
> > ---
> > 
> > diff --git a/identify.c b/identify.c
> > index 33f35de..a4a3763 100644
> > --- a/identify.c
> > +++ b/identify.c
> > @@ -9,8 +9,8 @@
> >  
> >  void get_cpu_info_basics(struct cpudata *cpu)
> >  {
> > -   unsigned int maxi, maxei, vendor, address_bits;
> > -   unsigned int eax;
> > +   unsigned int maxi, maxei, maxhv, vendor, address_bits;
> > +   unsigned int eax, ebx, ecx;
> >  
> > cpuid(cpu->number, 0, &maxi, &vendor, NULL, NULL);
> > maxi &= 0x; /* The high-order word is non-zero on some 
> > Cyrix CPUs */
> > @@ -19,7 +19,7 @@ void get_cpu_info_basics(struct cpudata *cpu)
> > return;
> >  
> > /* Everything that supports cpuid supports these. */
> > -   cpuid(cpu->number, 1, &eax, NULL, NULL, NULL);
> > +   cpuid(cpu->number, 1, &eax, &ebx, &ecx, NULL);
> 
> You probably want to check ebx, ecx, edx for the signatures of the
> hypervisor's you are willing to support and which you know do something
> sane with eax? Also it would be something worth reporting in its own
> right?
> 
Yes, but this is for -r option which only dumps raw cpuid info.
Decoding hypervisor specific info is useful thing, but should not be
pre-request for that patch.

> BTW, according to arch/x86/include/asm/kvm_para.h unsurprisingly KVM has
> a signature too 'KVMKVMKVM'.
> 
> > cpu->stepping = eax & 0xf;
> > cpu->model = (eax >> 4) & 0xf;
> > cpu->family = (eax >> 8) & 0xf;
> > @@ -29,6 +29,19 @@ void get_cpu_info_basics(struct cpudata *cpu)
> >  
> > cpuid(cpu->number, 0xC000, &maxei, NULL, NULL, NULL);
> > cpu->maxei2 = maxei;
> > +   if (ecx & 0x8000) {
> > +   cpuid(cpu->number, 0x4000, &maxhv, NULL, NULL, NULL);
> > +   /*
> > +* KVM up to linux 3.4 reports 0 as the max hypervisor leaf,
> > +* where it really means 0x4001.
> 
> This is something where I definitely think you want to check the
> signature first.
In theory yes, but in practice what will this break?

> 
> Ian.
> 
> > +* Most (all?) hypervisors have at least one CPUID besides
> > +* the vendor ID so assume that.
> > +*/
> > +   cpu->maxhv = maxhv ? maxhv : 0x4001;
> > +   } else {
> > +   /* Suppress hypervisor cpuid unless running on a hypervisor */
> > +   cpu->maxhv = 0;
> > +   }
> >  
> > cpuid(cpu->number, 0x8008,&address_bits, NULL, NULL, NULL);
> > cpu->phyaddr_bits = address_bits & 0xFF;
> > diff --git a/x86info.c b/x86info.c
> > index 22c4734..80cae36 100644
> > --- a/x86info.c
> > +++ b/x86info.c
> > @@ -44,6 +44,10 @@ static void display_detailed_info(struct cpudata *cpu)
> >  
> > if (cpu->maxei2 >=0xC000)
> > dump_raw_cpuid(cpu->number, 0xC000, cpu->maxei2);
> > +
> > +   if (cpu->maxhv >= 0x4000)
> > +   dump_raw_cpuid(cpu->number, 0x4000, cpu->maxhv);
> > +
> > }
> >  
> > if (show_cacheinfo) {
> > diff --git a/x86info.h b/x86info.h
> > index 7d2a455..c4f5d81 100644
> > --- a/x86info.h
> > +++ b/x86info.h
> > @@ -84,7 +84,7 @@ struct cpudata {
> > unsigned int cachesize_trace;
> > unsigned int phyaddr_bits;
> > unsigned int viraddr_bits;
> > -   unsigned int cpuid_level, maxei, maxei2;
> > +   unsigned int cpuid_level, maxei, maxei2, maxhv;
> > char name[CPU_NAME_LEN];
> > enum connector connector;
> > unsigned int flags_ecx;
> > ___
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
> > 
> 
> -- 
> Ian Campbell
> 
> Your own qualities will help prevent your advancement in the world.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoun

Re: [PATCHv2] x86info: dump kvm cpuid's

2012-04-30 Thread Gleb Natapov
On Mon, Apr 30, 2012 at 05:38:35PM +0300, Michael S. Tsirkin wrote:
> The following makes 'x86info -r' dump hypervisor leaf cpu ids
> (for kvm this is signature+features) when running in a vm.
> 
> On the guest we see the signature and the features:
> eax in: 0x4000, eax =  ebx = 4b4d564b ecx = 564b4d56 edx = 
> 004d
> eax in: 0x4001, eax = 017b ebx =  ecx =  edx = 
> 
> 
> Hypervisor flag is checked to avoid output changes when not
> running on a VM.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
Looks good to me.

> Changes from v1:
>   Make work on non KVM hypervisors (only KVM was tested).
>   Avi Kivity said kvm will in the future report
>   max HV leaf in eax. For now it reports eax = 0
> so add a work around for that.
> 
> ---
> 
> diff --git a/identify.c b/identify.c
> index 33f35de..a4a3763 100644
> --- a/identify.c
> +++ b/identify.c
> @@ -9,8 +9,8 @@
>  
>  void get_cpu_info_basics(struct cpudata *cpu)
>  {
> - unsigned int maxi, maxei, vendor, address_bits;
> - unsigned int eax;
> + unsigned int maxi, maxei, maxhv, vendor, address_bits;
> + unsigned int eax, ebx, ecx;
>  
>   cpuid(cpu->number, 0, &maxi, &vendor, NULL, NULL);
>   maxi &= 0x; /* The high-order word is non-zero on some 
> Cyrix CPUs */
> @@ -19,7 +19,7 @@ void get_cpu_info_basics(struct cpudata *cpu)
>   return;
>  
>   /* Everything that supports cpuid supports these. */
> - cpuid(cpu->number, 1, &eax, NULL, NULL, NULL);
> + cpuid(cpu->number, 1, &eax, &ebx, &ecx, NULL);
>   cpu->stepping = eax & 0xf;
>   cpu->model = (eax >> 4) & 0xf;
>   cpu->family = (eax >> 8) & 0xf;
> @@ -29,6 +29,19 @@ void get_cpu_info_basics(struct cpudata *cpu)
>  
>   cpuid(cpu->number, 0xC000, &maxei, NULL, NULL, NULL);
>   cpu->maxei2 = maxei;
> + if (ecx & 0x8000) {
> + cpuid(cpu->number, 0x4000, &maxhv, NULL, NULL, NULL);
> + /*
> +  * KVM up to linux 3.4 reports 0 as the max hypervisor leaf,
> +  * where it really means 0x4001.
> +  * Most (all?) hypervisors have at least one CPUID besides
> +  * the vendor ID so assume that.
> +  */
> + cpu->maxhv = maxhv ? maxhv : 0x4001;
> + } else {
> + /* Suppress hypervisor cpuid unless running on a hypervisor */
> + cpu->maxhv = 0;
> + }
>  
>   cpuid(cpu->number, 0x8008,&address_bits, NULL, NULL, NULL);
>   cpu->phyaddr_bits = address_bits & 0xFF;
> diff --git a/x86info.c b/x86info.c
> index 22c4734..80cae36 100644
> --- a/x86info.c
> +++ b/x86info.c
> @@ -44,6 +44,10 @@ static void display_detailed_info(struct cpudata *cpu)
>  
>   if (cpu->maxei2 >=0xC000)
>   dump_raw_cpuid(cpu->number, 0xC000, cpu->maxei2);
> +
> + if (cpu->maxhv >= 0x4000)
> + dump_raw_cpuid(cpu->number, 0x4000, cpu->maxhv);
> +
>   }
>  
>   if (show_cacheinfo) {
> diff --git a/x86info.h b/x86info.h
> index 7d2a455..c4f5d81 100644
> --- a/x86info.h
> +++ b/x86info.h
> @@ -84,7 +84,7 @@ struct cpudata {
>   unsigned int cachesize_trace;
>   unsigned int phyaddr_bits;
>   unsigned int viraddr_bits;
> - unsigned int cpuid_level, maxei, maxei2;
> + unsigned int cpuid_level, maxei, maxei2, maxhv;
>   char name[CPU_NAME_LEN];
>   enum connector connector;
>   unsigned int flags_ecx;

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: x86info: dump kvm cpuid's

2012-04-30 Thread Gleb Natapov
On Mon, Apr 30, 2012 at 12:38:13PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 30, 2012 at 11:43:19AM +0300, Gleb Natapov wrote:
> > On Sun, Apr 29, 2012 at 01:10:21PM +0300, Michael S. Tsirkin wrote:
> > > The following makes 'x86info -r' dump kvm cpu ids
> > > (signature+features) when running in a vm.
> > > 
> > > On the guest we see the signature and the features:
> > > eax in: 0x4000, eax =  ebx = 4b4d564b ecx = 564b4d56 edx = 
> > > 004d
> > > eax in: 0x4001, eax = 017b ebx =  ecx =  edx = 
> > > 
> > > 
> > > On the host it just adds a couple of zero lines:
> > > eax in: 0x4000, eax =  ebx =  ecx =  edx = 
> > > 
> > > eax in: 0x4001, eax =  ebx =  ecx =  edx = 
> > > 
> > > 
> > This is too KVM specific.
> 
> That's what I have. I scratch my own itch.
> 
> > Other hypervisors may use more cpuid leafs.
> 
> But not less so no harm's done.
> 
> > As far as I see Hyper-V uses 5 and use cpuid.0x4000.eax as max cpuid
> > leaf available. Haven't checked Xen or VMWare.
> 
> I don't think guessing at the CPU behaviour from linux source
> is the right thing to do.
> 
That is guessing from Hyper-V specification. The best kind of guess.

http://msdn.microsoft.com/en-us/library/windows/hardware/ff542700%28v=vs.85%29.aspx

> I Cc'd some addresses found in MAINTAINERS in the linux
> kernel. This will give more people the opportunity
> to ask for their stuff to be added, if they care.
> 
> > > Signed-off-by: Michael S. Tsirkin 
> > > 
> > > ---
> > > 
> > > Dave - not sure whether there's a mailing list for x86info.
> > > The patch is on top of the master branch in
> > > git://git.codemonkey.org.uk/x86info.git
> > > 
> > > Thanks!
> > > 
> > > diff --git a/x86info.c b/x86info.c
> > > index 22c4734..dee5ed1 100644
> > > --- a/x86info.c
> > > +++ b/x86info.c
> > > @@ -44,6 +44,7 @@ static void display_detailed_info(struct cpudata *cpu)
> > >  
> > >   if (cpu->maxei2 >=0xC000)
> > >   dump_raw_cpuid(cpu->number, 0xC000, cpu->maxei2);
> > > + dump_raw_cpuid(cpu->number, 0x4000, 0x4001);
> > >   }
> > >  
> > >   if (show_cacheinfo) {
> > 
> > --
> > Gleb.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V6 1/5] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-04-30 Thread Gleb Natapov
On Mon, Apr 30, 2012 at 11:22:34AM +0300, Avi Kivity wrote:
> On 04/29/2012 04:52 PM, Gleb Natapov wrote:
> > On Sun, Apr 29, 2012 at 04:26:21PM +0300, Avi Kivity wrote:
> > > On 04/29/2012 04:20 PM, Gleb Natapov wrote:
> > > > > > This is too similar to kvm_irq_delivery_to_apic(). Why not reuse 
> > > > > > it. We
> > > > > > can use one of reserved delivery modes as PV delivery mode. We will
> > > > > > disallow guest to trigger it through apic interface, so this will 
> > > > > > not be
> > > > > > part of ABI and can be changed at will.
> > > > > >
> > > > > 
> > > > > I'm not thrilled about this.  Those delivery modes will eventually
> > > > > become unreserved.  We can have a kvm_lookup_apic_id() that is shared
> > > > > among implementations.
> > > > > 
> > > > This is only internal implementation. If they become unreserved we will
> > > > use something else.
> > > >
> > > 
> > > Yeah, I'm thinking of that time.  Why do something temporary and fragile?
> > > 
> > Why is it fragile? Just by unreserving the value Intel will not break
> > KVM. Only when KVM will implement apic feature that unreserves the value
> > we will have to change internal implementation and use another value,
> > but this will be done by the same patch that does unreserving. The
> > unreserving may even never happen. 
> 
> Some remains of that may leak somewhere.
I do not see where. APIC code should #GP if a guest attempts to set
reserved values through APIC interface, or at least ignore them.

>   Why not add an extra
> parameter?
Yes, we can add extra parameter to "struct kvm_lapic_irq" and propagate it to
__apic_accept_irq(). We can do that now, or when Intel unreserve all
reserved values. I prefer the later since I do not believe it will
happen in observable feature.

>  Or do something like
> 
> kvm_for_each_apic_dest(vcpu, apic_destination) {
>   ...
> }
> 
> That can be reused in both the apic code and pv kick.
> 
That's exactly what kvm_irq_delivery_to_apic() is.

> > Meanwhile kvm_irq_delivery_to_apic()
> > will likely be optimized to use hash for unicast delivery and unhalt
> > hypercall will benefit from it immediately.
> 
> Overloading delivery mode is not the only way to achieve sharing.
> 
It is simplest and most straightforward with no demonstratable drawbacks :)
Adding parameter to "struct kvm_lapic_irq" is next best thing.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V6 1/5] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-04-29 Thread Gleb Natapov
On Sun, Apr 29, 2012 at 04:26:21PM +0300, Avi Kivity wrote:
> On 04/29/2012 04:20 PM, Gleb Natapov wrote:
> > > > This is too similar to kvm_irq_delivery_to_apic(). Why not reuse it. We
> > > > can use one of reserved delivery modes as PV delivery mode. We will
> > > > disallow guest to trigger it through apic interface, so this will not be
> > > > part of ABI and can be changed at will.
> > > >
> > > 
> > > I'm not thrilled about this.  Those delivery modes will eventually
> > > become unreserved.  We can have a kvm_lookup_apic_id() that is shared
> > > among implementations.
> > > 
> > This is only internal implementation. If they become unreserved we will
> > use something else.
> >
> 
> Yeah, I'm thinking of that time.  Why do something temporary and fragile?
> 
Why is it fragile? Just by unreserving the value Intel will not break
KVM. Only when KVM will implement apic feature that unreserves the value
we will have to change internal implementation and use another value,
but this will be done by the same patch that does unreserving. The
unreserving may even never happen. Meanwhile kvm_irq_delivery_to_apic()
will likely be optimized to use hash for unicast delivery and unhalt
hypercall will benefit from it immediately.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V6 1/5] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-04-29 Thread Gleb Natapov
On Sun, Apr 29, 2012 at 04:18:03PM +0300, Avi Kivity wrote:
> On 04/24/2012 12:59 PM, Gleb Natapov wrote:
> > >  
> > > +/*
> > > + * kvm_pv_kick_cpu_op:  Kick a vcpu.
> > > + *
> > > + * @apicid - apicid of vcpu to be kicked.
> > > + */
> > > +static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
> > > +{
> > > + struct kvm_vcpu *vcpu = NULL;
> > > + int i;
> > > +
> > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > > + if (!kvm_apic_present(vcpu))
> > > + continue;
> > > +
> > > + if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
> > > + break;
> > > + }
> > > + if (vcpu) {
> > > + /*
> > > +  * Setting unhalt flag here can result in spurious runnable
> > > +  * state when unhalt reset does not happen in vcpu_block.
> > > +  * But that is harmless since that should soon result in halt.
> > > +  */
> > > + vcpu->arch.pv.pv_unhalted = 1;
> > > + /* We need everybody see unhalt before vcpu unblocks */
> > > + smp_mb();
> > > + kvm_vcpu_kick(vcpu);
> > > + }
> > > +}
> > This is too similar to kvm_irq_delivery_to_apic(). Why not reuse it. We
> > can use one of reserved delivery modes as PV delivery mode. We will
> > disallow guest to trigger it through apic interface, so this will not be
> > part of ABI and can be changed at will.
> >
> 
> I'm not thrilled about this.  Those delivery modes will eventually
> become unreserved.  We can have a kvm_lookup_apic_id() that is shared
> among implementations.
> 
This is only internal implementation. If they become unreserved we will
use something else.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V6 1/5] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-04-27 Thread Gleb Natapov
On Fri, Apr 27, 2012 at 04:15:35PM +0530, Raghavendra K T wrote:
> On 04/24/2012 03:29 PM, Gleb Natapov wrote:
> >On Mon, Apr 23, 2012 at 03:29:47PM +0530, Raghavendra K T wrote:
> >>From: Srivatsa Vaddagiri
> >>
> >>KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu out of halt 
> >>state.
> >>
> >>The presence of these hypercalls is indicated to guest via
> >>KVM_FEATURE_PV_UNHALT/KVM_CAP_PV_UNHALT.
> >>
> >>Signed-off-by: Srivatsa Vaddagiri
> >>Signed-off-by: Suzuki Poulose
> >>Signed-off-by: Raghavendra K T
> >>---
> [...]
> >>+/*
> >>+ * kvm_pv_kick_cpu_op:  Kick a vcpu.
> >>+ *
> >>+ * @apicid - apicid of vcpu to be kicked.
> >>+ */
> >>+static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
> >>+{
> >>+   struct kvm_vcpu *vcpu = NULL;
> >>+   int i;
> >>+
> >>+   kvm_for_each_vcpu(i, vcpu, kvm) {
> >>+   if (!kvm_apic_present(vcpu))
> >>+   continue;
> >>+
> >>+   if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
> >>+   break;
> >>+   }
> >>+   if (vcpu) {
> >>+   /*
> >>+* Setting unhalt flag here can result in spurious runnable
> >>+* state when unhalt reset does not happen in vcpu_block.
> >>+* But that is harmless since that should soon result in halt.
> >>+*/
> >>+   vcpu->arch.pv.pv_unhalted = 1;
> >>+   /* We need everybody see unhalt before vcpu unblocks */
> >>+   smp_mb();
> >>+   kvm_vcpu_kick(vcpu);
> >>+   }
> >>+}
> >This is too similar to kvm_irq_delivery_to_apic(). Why not reuse it. We
> >can use one of reserved delivery modes as PV delivery mode. We will
> >disallow guest to trigger it through apic interface, so this will not be
> >part of ABI and can be changed at will.
> >
> 
> I think it is interesting ( Perhaps more reasonable way of doing it).
> I am not too familiar with lapic source. So, pardon me if my
> questions are stupid.
> 
> Something like below is what I deciphered from your suggestion which
> is working.
> 
> kvm/x86.c
> =
> kvm_pv_kick_cpu_op()
> {
> 
>  struct kvm_lapic_irq lapic_irq;
> 
>  lapic_irq.shorthand = 0;
>  lapic_irq.dest_mode = 0;
>  lapic_irq.dest_id = apicid;
> 
>  lapic_irq.delivery_mode = PV_DELIVERY_MODE;
>  kvm_irq_delivery_to_apic(kvm, 0, &lapic_irq );
> 
> }
> 
> kvm/lapic.c
> ==
> _apic_accept_irq()
> {
> ...
> case APIC_DM_REMRD:
> result = 1;
> vcpu->pv_unhalted = 1
> smp_mb();
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> kvm_vcpu_kick(vcpu);
> break;
> 
> ...
> }
> 
> here using PV_DELIVERY_MODE = APIC_DM_REMRD, which was unused.
> 
Yes, this is what I mean except that PV_DELIVERY_MODE should be
number defined as reserved by Intel spec.

> OR
> 1) Are you asking to remove vcpu->pv_unhalted flag and replace with an irq?
I would like to remove vcpu->pv_unhalted, but do not see how to do that,
so I do not asking that :)
 
> 2) are you talking about some reserved fields in struct local_apic instead
> of APIC_DM_REMRD what I have used above?
Delivery modes 011b and 111b are reserved. We can use one if them.

> [ Honestly, arch/x86/include/asm/apicdef.h had too much of info to
> digest :( ]
> 
> 3) I am not sure about: disallow guest to trigger it through apic
> interface part also.(mean howto?)
I mean we should disallow guest to set delivery mode to reserved values
through apic interface.

> 4) And one more question, So you think it takes care of migration part
>   (in case we are removing pv_unhalted flag)?
No, since I am not asking for removing pv_unhalted flag. I want to reuse
code that we already have to deliver the unhalt event.

> 
> It would be helpful if you can give little more explanation/ pointer
> to Documentation.
> 
> Ingo is keen to see whole ticketlock/Xen/KVM patch in one go.
> and anyhow since this does not cause any ABI change, I hope you
> don't mind if
> I do only the vcpu->pv_unhalted change you suggested now [ having
> pv_unhalted reset  in vcpu_run if
> you meant something else than code I have above ], so that whole
> series get fair amount time for testing.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V6 1/5] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-04-24 Thread Gleb Natapov
On Mon, Apr 23, 2012 at 03:29:47PM +0530, Raghavendra K T wrote:
> From: Srivatsa Vaddagiri 
> 
> KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu out of halt 
> state.
> 
> The presence of these hypercalls is indicated to guest via
> KVM_FEATURE_PV_UNHALT/KVM_CAP_PV_UNHALT.
> 
> Signed-off-by: Srivatsa Vaddagiri 
> Signed-off-by: Suzuki Poulose 
> Signed-off-by: Raghavendra K T 
> ---
> diff --git a/arch/ia64/include/asm/kvm_host.h 
> b/arch/ia64/include/asm/kvm_host.h
> index e35b3a8..3252339 100644
> --- a/arch/ia64/include/asm/kvm_host.h
> +++ b/arch/ia64/include/asm/kvm_host.h
> @@ -597,6 +597,9 @@ void kvm_sal_emul(struct kvm_vcpu *vcpu);
>  struct kvm *kvm_arch_alloc_vm(void);
>  void kvm_arch_free_vm(struct kvm *kvm);
>  
> +static inline void kvm_arch_vcpu_reset_pv_unhalted(struct kvm_vcpu *vcpu)
> +{
> +}
>  #endif /* __ASSEMBLY__*/
>  
>  #endif
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 52eb9c1..28446de 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -498,4 +498,8 @@ struct kvm_vcpu_arch {
>  #define KVM_MMIO_REG_QPR 0x0040
>  #define KVM_MMIO_REG_FQPR0x0060
>  
> +static inline void kvm_arch_vcpu_reset_pv_unhalted(struct kvm_vcpu *vcpu)
> +{
> +}
> +
>  #endif /* __POWERPC_KVM_HOST_H__ */
> diff --git a/arch/s390/include/asm/kvm_host.h 
> b/arch/s390/include/asm/kvm_host.h
> index 7343872..c47f874 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -256,4 +256,8 @@ struct kvm_arch{
>  };
>  
>  extern int sie64a(struct kvm_s390_sie_block *, u64 *);
> +static inline void kvm_arch_vcpu_reset_pv_unhalted(struct kvm_vcpu *vcpu)
> +{
> +}
> +
>  #endif
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e216ba0..dad475b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -481,6 +481,10 @@ struct kvm_vcpu_arch {
>   u64 length;
>   u64 status;
>   } osvw;
> + /* pv related host specific info */
> + struct {
> + int pv_unhalted;
> + } pv;
>  };
>  
>  struct kvm_lpage_info {
> @@ -967,4 +971,6 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, 
> u64 *data);
>  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
>  void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
>  
> +void kvm_arch_vcpu_reset_pv_unhalted(struct kvm_vcpu *vcpu);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 734c376..5b647ea 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -16,12 +16,14 @@
>  #define KVM_FEATURE_CLOCKSOURCE  0
>  #define KVM_FEATURE_NOP_IO_DELAY 1
>  #define KVM_FEATURE_MMU_OP   2
> +
>  /* This indicates that the new set of kvmclock msrs
>   * are available. The use of 0x11 and 0x12 is deprecated
>   */
>  #define KVM_FEATURE_CLOCKSOURCE23
>  #define KVM_FEATURE_ASYNC_PF 4
>  #define KVM_FEATURE_STEAL_TIME   5
> +#define KVM_FEATURE_PV_UNHALT6
>  
>  /* The last 8 bits are used to indicate how to interpret the flags field
>   * in pvclock structure. If no bits are set, all flags are ignored.
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 9fed5be..7c93806 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -408,7 +408,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
> u32 function,
>(1 << KVM_FEATURE_NOP_IO_DELAY) |
>(1 << KVM_FEATURE_CLOCKSOURCE2) |
>(1 << KVM_FEATURE_ASYNC_PF) |
> -  (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> +  (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> +  (1 << KVM_FEATURE_PV_UNHALT);
>  
>   if (sched_info_on())
>   entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4044ce0..7fc9be6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2147,6 +2147,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>   case KVM_CAP_ASYNC_PF:
>   case KVM_CAP_GET_TSC_KHZ:
>   case KVM_CAP_PCI_2_3:
> + case KVM_CAP_PV_UNHALT:
>   r = 1;
>   break;
>   case KVM_CAP_COALESCED_MMIO:
> @@ -4993,6 +4994,36 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>   return 1;
>  }
>  
> +/*
> + * kvm_pv_kick_cpu_op:  Kick a vcpu.
> + *
> + * @apicid - apicid of vcpu to be kicked.
> + */
> +static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
> +{
> + struct kvm_vcpu *vcpu = NULL;
> + int i;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!kvm_apic_present(vcpu))
> + continue;
> +
> + if (kvm_apic_match_dest(vcpu, 0, 0, 

Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock

2012-01-17 Thread Gleb Natapov
On Tue, Jan 17, 2012 at 07:58:18PM +0530, Srivatsa Vaddagiri wrote:
> * Gleb Natapov  [2012-01-17 15:20:51]:
> 
> > > Having the hypercall makes the intent of vcpu (to sleep on a kick) clear 
> > > to 
> > > hypervisor vs assuming that because of a trapped HLT instruction (which
> > > will anyway won't work when yield_on_hlt=0).
> > > 
> > The purpose of yield_on_hlt=0 is to allow VCPU to occupy CPU for the
> > entire time slice no mater what. I do not think disabling yield on HLT
> > is even make sense in CPU oversubscribe scenario.
> 
> Yes, so is there any real use for yield_on_hlt=0? I believe Anthony
> initially added it as a way to implement CPU bandwidth capping for VMs,
> which would ensure that busy VMs don't eat into cycles meant for a idle
> VM. Now that we have proper support in scheduler for CPU bandwidth capping, 
> is 
> there any real world use for yield_on_hlt=0? If not, deprecate it?
> 
I was against adding it in the first place, so if IBM no longer needs it
I am for removing it ASAP.

> > Now if you'll call
> > KVM_HC_WAIT_FOR_KICK instead of HLT you will effectively ignore
> > yield_on_hlt=0 setting.
> 
> I guess that depends on what we do in KVM_HC_WAIT_FOR_KICK. If we do
> yield_to() rather than sleep, it should minimize how much cycles vcpu gives 
> away
> to a competing VM (which seems to be the biggest purpose why one may
> want to set yield_on_hlt=0).
> 
> > This is like having PV HLT that does not obey
> > VMX exit control setting.
> 
> - vatsa

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock

2012-01-17 Thread Gleb Natapov
On Tue, Jan 17, 2012 at 06:41:03PM +0530, Srivatsa Vaddagiri wrote:
> * Gleb Natapov  [2012-01-17 14:51:26]:
> 
> > On Tue, Jan 17, 2012 at 05:56:50PM +0530, Srivatsa Vaddagiri wrote:
> > > * Gleb Natapov  [2012-01-17 11:14:13]:
> > > 
> > > > > The problem case I was thinking of was when guest VCPU would have 
> > > > > issued
> > > > > HLT with interrupts disabled. I guess one option is to inject an NMI,
> > > > > and have the guest kernel NMI handler recognize this and make
> > > > > adjustments such that the vcpu avoids going back to HLT instruction.
> > > > > 
> > > > Just kick vcpu out of a guest mode and adjust rip to point after HLT on
> > > > next re-entry. Don't forget to call vmx_clear_hlt().
> > > 
> > > Looks bit hackish to me compared to having another hypercall to yield!
> > > 
> > Do not see anything hackish about it. But what you described above (the
> > part I replied to) is not another hypercall, but yet another NMI source
> > and special handling in a guest.
> 
> True, which I didn't exactly like and hence was suggesting we use
> another hypercall to let spinning vcpu sleep.
> 
Ah, sorry. Missed that.

> > So what hypercall do you mean?
> 
> The hypercall is described below:
> 
> > > > > Having another hypercall to do yield/sleep (rather than effecting that
> > > > > via HLT) seems like an alternate clean solution here ..
> 
> and was implemented in an earlier version of this patch (v2) as
> KVM_HC_WAIT_FOR_KICK hypercall:
> 
> https://lkml.org/lkml/2011/10/23/211
> 
> Having the hypercall makes the intent of vcpu (to sleep on a kick) clear to 
> hypervisor vs assuming that because of a trapped HLT instruction (which
> will anyway won't work when yield_on_hlt=0).
> 
The purpose of yield_on_hlt=0 is to allow VCPU to occupy CPU for the
entire time slice no mater what. I do not think disabling yield on HLT
is even make sense in CPU oversubscribe scenario. Now if you'll call
KVM_HC_WAIT_FOR_KICK instead of HLT you will effectively ignore
yield_on_hlt=0 setting. This is like having PV HLT that does not obey
VMX exit control setting.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock

2012-01-17 Thread Gleb Natapov
On Tue, Jan 17, 2012 at 05:56:50PM +0530, Srivatsa Vaddagiri wrote:
> * Gleb Natapov  [2012-01-17 11:14:13]:
> 
> > > The problem case I was thinking of was when guest VCPU would have issued
> > > HLT with interrupts disabled. I guess one option is to inject an NMI,
> > > and have the guest kernel NMI handler recognize this and make
> > > adjustments such that the vcpu avoids going back to HLT instruction.
> > > 
> > Just kick vcpu out of a guest mode and adjust rip to point after HLT on
> > next re-entry. Don't forget to call vmx_clear_hlt().
> 
> Looks bit hackish to me compared to having another hypercall to yield!
> 
Do not see anything hackish about it. But what you described above (the
part I replied to) is not another hypercall, but yet another NMI source
and special handling in a guest. So what hypercall do you mean?

> > > Having another hypercall to do yield/sleep (rather than effecting that
> > > via HLT) seems like an alternate clean solution here ..
> 
> - vatsa

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC V4 5/5] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock

2012-01-17 Thread Gleb Natapov
On Mon, Jan 16, 2012 at 07:41:17PM +0530, Srivatsa Vaddagiri wrote:
> * Avi Kivity  [2012-01-16 12:14:27]:
> 
> > > One option is to make the kick hypercall available only when
> > > yield_on_hlt=1?
> > 
> > It's not a good idea to tie various options together.  Features should
> > be orthogonal.
> > 
> > Can't we make it work?  Just have different handling for
> > KVM_REQ_PVLOCK_KICK (let 's rename it, and the hypercall, PV_UNHALT,
> > since we can use it for non-locks too).
> 
> The problem case I was thinking of was when guest VCPU would have issued
> HLT with interrupts disabled. I guess one option is to inject an NMI,
> and have the guest kernel NMI handler recognize this and make
> adjustments such that the vcpu avoids going back to HLT instruction.
> 
Just kick vcpu out of a guest mode and adjust rip to point after HLT on
next re-entry. Don't forget to call vmx_clear_hlt().

> Having another hypercall to do yield/sleep (rather than effecting that
> via HLT) seems like an alternate clean solution here ..
> 
> - vatsa

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 00/12] virtio: s4 support

2011-12-07 Thread Gleb Natapov
On Wed, Dec 07, 2011 at 05:54:29PM +1030, Rusty Russell wrote:
> On Wed,  7 Dec 2011 01:18:38 +0530, Amit Shah  wrote:
> > Hi,
> > 
> > These patches add support for S4 to virtio (pci) and all drivers.
> 
> Dumb meta-question: why do we want to hibernate virtual machines?
> 
For instance you want to hibernate your laptop while it is running a virtual
machine. You can pause them of course, but then time will be incorrect
inside a guest after resume.

> I figure there's a reason, but it seems a bit weird :)
> 
> Thanks,
> Rusty.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Make virtio vq size configurable by a guest.

2011-06-22 Thread Gleb Natapov
On Tue, Jun 21, 2011 at 11:32:04AM +0930, Rusty Russell wrote:
> On Mon, 20 Jun 2011 16:16:24 +0300, Gleb Natapov  wrote:
> > Hi,
> > 
> > Currently in virtio host dictates the size and layout of vq that should
> > be used.  To talk to a device that has one vq with 128 elements guest
> > needs to allocate at least 2 pages. Usually this is not a problem, but
> > sometimes guest runs in a resource restricted environment and then it may
> > not have enough memory to initialize all virtio devices present in the
> > system. One such environment is a BIOS. Seabios currently has virtio block
> > support. Since the BIOS should be able to access the disk even after OS
> > is launched vq should be allocated from a special memory region that will
> > be marked as unavailable to an OS, but such memory is scarce. Because vq
> > is so huge only a couple of virtio disks can be initialized by the BIOS.
> > 
> > It would be nice if a guest will be able to tell to a host what vq size
> > should be used instead. BIOS issues only one request at a time anyway,
> > so it needs only one element in vq. It does not care about performance
> > to much either, so it can tell to a host to not align used index to a
> > page boundary. This way vq of one element shouldn't take more then a couple
> > hundreds of bytes.
> 
> Unfortunately, a virtqueue *always* takes at least 2 pages.  That's
> because we split the host/guest part on page boundaries.  (2 pages per
> disk is "huge"?  Really?)
> 
"Huge" only for a BIOS environment, not when normal guest is running. The
problem is those two pages should be allocated from a memory that is
marked as reserved in e820 map since the queue has to be active after
BIOS passes control to an OS. Currently Seabios reserve up to 64K for this
memory pool (memory that is not actually allocated is not marked as
reserved in e820).

> So really, you want to negotiate the ring size and the 'align'
> parameter.  A new feature could allow this, but there may be valid
> reasons for a host to want to place an upper limit, too.
> 
What kind of reason can it be except performance? AFAIK KVM does not have such
restrictions.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Make virtio vq size configurable by a guest.

2011-06-20 Thread Gleb Natapov
Hi,

Currently in virtio host dictates the size and layout of vq that should
be used.  To talk to a device that has one vq with 128 elements guest
needs to allocate at least 2 pages. Usually this is not a problem, but
sometimes guest runs in a resource restricted environment and then it may
not have enough memory to initialize all virtio devices present in the
system. One such environment is a BIOS. Seabios currently has virtio block
support. Since the BIOS should be able to access the disk even after OS
is launched vq should be allocated from a special memory region that will
be marked as unavailable to an OS, but such memory is scarce. Because vq
is so huge only a couple of virtio disks can be initialized by the BIOS.

It would be nice if a guest will be able to tell to a host what vq size
should be used instead. BIOS issues only one request at a time anyway,
so it needs only one element in vq. It does not care about performance
to much either, so it can tell to a host to not align used index to a
page boundary. This way vq of one element shouldn't take more then a couple
hundreds of bytes.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Bug inkvm_set_irq

2011-02-25 Thread Gleb Natapov
CCing Michael in case he missed this.

On Fri, Feb 25, 2011 at 10:07:22AM +0100, Jean-Philippe Menil wrote:
> Hi,
> 
> Each time i try tou use vhost_net, i'm facing a kernel bug.
> I do a "modprobe vhost_net", and start guest whith vhost=on.
> 
> Following is a trace with a kernel 2.6.37, but  i had the same
> problem with 2.6.36 (cf https://lkml.org/lkml/2010/11/30/29).
> 
> The bug only occurs whith vhost_net charged, so i don't know if this
> is a bug in kvm module code or in the vhost_net code.
> 
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243100] BUG: unable to handle kernel paging request at
> 2458
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243250] IP: [] kvm_set_irq+0x2a/0x130 [kvm]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243378] PGD 45d363067 PUD 45e77a067 PMD 0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243556] Oops:  [#1] SMP
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243692] last sysfs file:
> /sys/devices/pci:00/:00:0d.0/:05:00.0/:06:00.0/irq
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.243777] CPU 0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.243820] Modules linked in: vhost_net macvtap macvlan tun
> powernow_k8 mperf cpufreq_userspace cpufreq_stats cpufreq_powersave
> cpufreq_ondemand fre
> q_table cpufreq_conservative fuse xt_physdev ip6t_LOG
> ip6table_filter ip6_tables ipt_LOG xt_multiport xt_limit xt_tcpudp
> xt_state iptable_filter ip_tables x_tables nf_conntrack_tftp
> nf_conntrack_ftp nf_connt
> rack_ipv4 nf_defrag_ipv4 8021q bridge stp ext2 mbcache
> dm_round_robin dm_multipath nf_conntrack_ipv6 nf_conntrack
> nf_defrag_ipv6 kvm_amd kvm ipv6 snd_pcm snd_timer snd soundcore
> snd_page_alloc tpm_tis tpm ps
> mouse dcdbas tpm_bios processor i2c_nforce2 shpchp pcspkr ghes
> serio_raw joydev evdev pci_hotplug i2c_core hed button thermal_sys
> xfs exportfs dm_mod sg sr_mod cdrom usbhid hid usb_storage ses
> sd_mod enclosu
> re megaraid_sas ohci_hcd lpfc scsi_transport_fc scsi_tgt bnx2
> scsi_mod ehci_hcd [last unloaded: scsi_wait_scan]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] Pid: 10, comm: kworker/0:1 Not tainted
> 2.6.37-dsiun-110105 #17 0K543T/PowerEdge M605
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RIP: 0010:[]  []
> kvm_set_irq+0x2a/0x130 [kvm]
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RSP: 0018:88045fc89d30  EFLAGS: 00010246
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RAX:  RBX: 001a RCX:
> 0001
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RDX:  RSI:  RDI:
> 
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] RBP:  R08: 0001 R09:
> 880856a91e48
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] R10:  R11:  R12:
> 
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] R13: 0001 R14:  R15:
> 
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] FS:  7f617986c710() GS:88007f80()
> knlGS:
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] CS:  0010 DS:  ES:  CR0: 8005003b
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] CR2: 2458 CR3: 00045d197000 CR4:
> 06f0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] DR0:  DR1:  DR2:
> 
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] DR3:  DR6: 0ff0 DR7:
> 0400
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] Process kworker/0:1 (pid: 10, threadinfo
> 88045fc88000, task 88085fc53c30)
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [  685.246123] Stack:
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123]  88045fc89fd8 000119c0 88045fc88010
> 88085fc53ee8
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123]  88045fc89fd8 88085fc53ee0 88085fc53c30
> 000119c0
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123]  000119c0 8137f7ce 88007f80df40
> 
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123] Call Trace:
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123]  [] ? common_interrupt+0xe/0x13
> Feb 23 13:56:19 ayrshire.u06.univ-nantes.prive kernel: [
> 685.246123]  [] ? irqfd_inject+0x0/

Re: [PATCH] virtio: remove virtio-pci root device

2011-01-10 Thread Gleb Natapov
On Fri, Jan 07, 2011 at 02:55:06AM -0600, Milton Miller wrote:
> We sometimes need to map between the virtio device and
> the given pci device. One such use is OS installer that
> gets the boot pci device from BIOS and needs to
> find the relevant block device. Since it can't,
> installation fails.
> 
> Instead of creating a top-level devices/virtio-pci
> directory, create each device under the corresponding
> pci device node.  Symlinks to all virtio-pci
> devices can be found under the pci driver link in
> bus/pci/drivers/virtio-pci/devices, and all virtio
> devices under drivers/bus/virtio/devices.
> 
> Signed-off-by: Milton Miller 
Acked-by: Gleb Natapov 

> ---
> 
> This is an alternative to the patch by Michael S. Tsirkin
> titled "virtio-pci: add softlinks between virtio and pci"
> https://patchwork.kernel.org/patch/454581/
> 
> It creates simpler code, uses less memory, and should
> be even easier use by the installer as it won't have to
> know a virtio symlink to follow (just follow none).
> 
> Compile tested only as I don't have kvm setup.
> 
> 
> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> index ef8d9d5..4fb5b2b 100644
> --- a/drivers/virtio/virtio_pci.c
> +++ b/drivers/virtio/virtio_pci.c
> @@ -96,11 +96,6 @@ static struct pci_device_id virtio_pci_id_table[] = {
>  
>  MODULE_DEVICE_TABLE(pci, virtio_pci_id_table);
>  
> -/* A PCI device has it's own struct device and so does a virtio device so
> - * we create a place for the virtio devices to show up in sysfs.  I think it
> - * would make more sense for virtio to not insist on having it's own device. 
> */
> -static struct device *virtio_pci_root;
> -
>  /* Convert a generic virtio device to our structure */
>  static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
>  {
> @@ -629,7 +624,7 @@ static int __devinit virtio_pci_probe(struct pci_dev 
> *pci_dev,
>   if (vp_dev == NULL)
>   return -ENOMEM;
>  
> - vp_dev->vdev.dev.parent = virtio_pci_root;
> + vp_dev->vdev.dev.parent = &pci_dev->dev;
>   vp_dev->vdev.dev.release = virtio_pci_release_dev;
>   vp_dev->vdev.config = &virtio_pci_config_ops;
>   vp_dev->pci_dev = pci_dev;
> @@ -717,17 +712,7 @@ static struct pci_driver virtio_pci_driver = {
>  
>  static int __init virtio_pci_init(void)
>  {
> - int err;
> -
> - virtio_pci_root = root_device_register("virtio-pci");
> - if (IS_ERR(virtio_pci_root))
> - return PTR_ERR(virtio_pci_root);
> -
> - err = pci_register_driver(&virtio_pci_driver);
> - if (err)
> - root_device_unregister(virtio_pci_root);
> -
> - return err;
> + return pci_register_driver(&virtio_pci_driver);
>  }
>  
>  module_init(virtio_pci_init);
> @@ -735,7 +720,6 @@ module_init(virtio_pci_init);
>  static void __exit virtio_pci_exit(void)
>  {
>   pci_unregister_driver(&virtio_pci_driver);
> - root_device_unregister(virtio_pci_root);
>  }
>  
>  module_exit(virtio_pci_exit);

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: remove virtio-pci root device

2011-01-10 Thread Gleb Natapov
On Mon, Jan 10, 2011 at 03:31:40PM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 10, 2011 at 02:50:11PM +0200, Gleb Natapov wrote:
> > On Mon, Jan 10, 2011 at 02:08:08PM +0200, Michael S. Tsirkin wrote:
> > > On Mon, Jan 10, 2011 at 01:42:19PM +0200, Gleb Natapov wrote:
> > > > On Mon, Jan 10, 2011 at 01:27:39PM +0200, Gleb Natapov wrote:
> > > > > On Mon, Jan 10, 2011 at 01:22:05PM +0200, Michael S. Tsirkin wrote:
> > > > > > On Mon, Jan 10, 2011 at 01:18:59PM +0200, Gleb Natapov wrote:
> > > > > > > On Sun, Jan 09, 2011 at 05:18:21PM +0200, Michael S. Tsirkin 
> > > > > > > wrote:
> > > > > > > > On Fri, Jan 07, 2011 at 02:55:06AM -0600, Milton Miller wrote:
> > > > > > > > > We sometimes need to map between the virtio device and
> > > > > > > > > the given pci device. One such use is OS installer that
> > > > > > > > > gets the boot pci device from BIOS and needs to
> > > > > > > > > find the relevant block device. Since it can't,
> > > > > > > > > installation fails.
> > > > > > > > > 
> > > > > > > > > Instead of creating a top-level devices/virtio-pci
> > > > > > > > > directory, create each device under the corresponding
> > > > > > > > > pci device node.  Symlinks to all virtio-pci
> > > > > > > > > devices can be found under the pci driver link in
> > > > > > > > > bus/pci/drivers/virtio-pci/devices, and all virtio
> > > > > > > > > devices under drivers/bus/virtio/devices.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Milton Miller 
> > > > > > > > 
> > > > > > > > OK, this works fine for me.  I played with options to add compat
> > > > > > > > softlinks under devices/virtio-pci but we still don't get 
> > > > > > > > exactly the
> > > > > > > > same layout and since I don't think anyone actually uses them, 
> > > > > > > > it's
> > > > > > > > probably ok to just to the simple thing.
> > > > > > > > 
> > > > > > > > Tested/Acked-by: Michael S. Tsirkin 
> > > > > > > > 
> > > > > > > > Rusty, since this help fix at least one user, any chance this 
> > > > > > > > can be put
> > > > > > > > in 2.6.38? OK to backport to -stable?
> > > > > > > > 
> > > > > > > > Gleb, could you try this out too?
> > > > > > > > 
> > > > > > > With this patch if I have 3 virtio disks for a VM I get:
> > > > > > > /sys/devices/pci:00/:00:04.0/virtio0/block/vda
> > > > > > > /sys/devices/pci:00/:00:05.0/virtio1/block/vdb
> > > > > > > /sys/devices/pci:00/:00:06.0/virtio2/block/vdc
> > > > > > > 
> > > > > > > Number after virtio has no much sense. It either should be 
> > > > > > > dropped at all
> > > > > > > or be always zero in case we will support more then one virtio 
> > > > > > > controller
> > > > > > > per pci card. In that case each virtio controller will have 
> > > > > > > directories
> > > > > > > virtio0/virtio1/virtio2... under same pci device directory.
> > > > > > 
> > > > > > Yes. But this is the bus name.  It must be unique - all devices
> > > > > > also appear under /sys/bus/virtio/devices.
> > > > > > 
> > > > > It is very strange king of bus that is spread over several PCI 
> > > > > devices :)
> > > > > It doesn't make much sense IMHO, but I can leave with it.
> > > > > 
> > > > I can't "leave" with it, but I can "live" with it.
> > > 
> > > The virtio bus is an attempt to make as many applications as
> > > possible work transparently on any virtio system,
> > > be it lguest, s390 or pci. Arbitrary IDs is just a hint to the
> > > applications 'don't rely on the name at all'.
> > > 
> > Shouldn't sysfs directory reflect real device topology?
> 
> What exactly is proposed here? Are you acking or nacking this patch?
> Prefer the older version? Can send your own?
> 
What exactly is proposed is to get rid of random number in virtio
directory name. If this is complicated, as I said above, I can live
with it as is.

> > There are other buses AFAIK that connected differently on different
> > HW. How virtio is special?
> > What applications this allow to work transparently?
> 
> I don't know which applications use the current sysfs topology
> but the assumption that there are none and we can break it
> arbitrarily seems a bit drastic.
> 
> > --
> > Gleb.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: remove virtio-pci root device

2011-01-10 Thread Gleb Natapov
On Mon, Jan 10, 2011 at 02:08:08PM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 10, 2011 at 01:42:19PM +0200, Gleb Natapov wrote:
> > On Mon, Jan 10, 2011 at 01:27:39PM +0200, Gleb Natapov wrote:
> > > On Mon, Jan 10, 2011 at 01:22:05PM +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 10, 2011 at 01:18:59PM +0200, Gleb Natapov wrote:
> > > > > On Sun, Jan 09, 2011 at 05:18:21PM +0200, Michael S. Tsirkin wrote:
> > > > > > On Fri, Jan 07, 2011 at 02:55:06AM -0600, Milton Miller wrote:
> > > > > > > We sometimes need to map between the virtio device and
> > > > > > > the given pci device. One such use is OS installer that
> > > > > > > gets the boot pci device from BIOS and needs to
> > > > > > > find the relevant block device. Since it can't,
> > > > > > > installation fails.
> > > > > > > 
> > > > > > > Instead of creating a top-level devices/virtio-pci
> > > > > > > directory, create each device under the corresponding
> > > > > > > pci device node.  Symlinks to all virtio-pci
> > > > > > > devices can be found under the pci driver link in
> > > > > > > bus/pci/drivers/virtio-pci/devices, and all virtio
> > > > > > > devices under drivers/bus/virtio/devices.
> > > > > > > 
> > > > > > > Signed-off-by: Milton Miller 
> > > > > > 
> > > > > > OK, this works fine for me.  I played with options to add compat
> > > > > > softlinks under devices/virtio-pci but we still don't get exactly 
> > > > > > the
> > > > > > same layout and since I don't think anyone actually uses them, it's
> > > > > > probably ok to just to the simple thing.
> > > > > > 
> > > > > > Tested/Acked-by: Michael S. Tsirkin 
> > > > > > 
> > > > > > Rusty, since this help fix at least one user, any chance this can 
> > > > > > be put
> > > > > > in 2.6.38? OK to backport to -stable?
> > > > > > 
> > > > > > Gleb, could you try this out too?
> > > > > > 
> > > > > With this patch if I have 3 virtio disks for a VM I get:
> > > > > /sys/devices/pci:00/:00:04.0/virtio0/block/vda
> > > > > /sys/devices/pci:00/:00:05.0/virtio1/block/vdb
> > > > > /sys/devices/pci:00/:00:06.0/virtio2/block/vdc
> > > > > 
> > > > > Number after virtio has no much sense. It either should be dropped at 
> > > > > all
> > > > > or be always zero in case we will support more then one virtio 
> > > > > controller
> > > > > per pci card. In that case each virtio controller will have 
> > > > > directories
> > > > > virtio0/virtio1/virtio2... under same pci device directory.
> > > > 
> > > > Yes. But this is the bus name.  It must be unique - all devices
> > > > also appear under /sys/bus/virtio/devices.
> > > > 
> > > It is very strange king of bus that is spread over several PCI devices :)
> > > It doesn't make much sense IMHO, but I can leave with it.
> > > 
> > I can't "leave" with it, but I can "live" with it.
> 
> The virtio bus is an attempt to make as many applications as
> possible work transparently on any virtio system,
> be it lguest, s390 or pci. Arbitrary IDs is just a hint to the
> applications 'don't rely on the name at all'.
> 
Shouldn't sysfs directory reflect real device topology? There are other
buses AFAIK that connected differently on different HW. How virtio is
special? What applications this allow to work transparently?

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: remove virtio-pci root device

2011-01-10 Thread Gleb Natapov
On Mon, Jan 10, 2011 at 01:27:39PM +0200, Gleb Natapov wrote:
> On Mon, Jan 10, 2011 at 01:22:05PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 10, 2011 at 01:18:59PM +0200, Gleb Natapov wrote:
> > > On Sun, Jan 09, 2011 at 05:18:21PM +0200, Michael S. Tsirkin wrote:
> > > > On Fri, Jan 07, 2011 at 02:55:06AM -0600, Milton Miller wrote:
> > > > > We sometimes need to map between the virtio device and
> > > > > the given pci device. One such use is OS installer that
> > > > > gets the boot pci device from BIOS and needs to
> > > > > find the relevant block device. Since it can't,
> > > > > installation fails.
> > > > > 
> > > > > Instead of creating a top-level devices/virtio-pci
> > > > > directory, create each device under the corresponding
> > > > > pci device node.  Symlinks to all virtio-pci
> > > > > devices can be found under the pci driver link in
> > > > > bus/pci/drivers/virtio-pci/devices, and all virtio
> > > > > devices under drivers/bus/virtio/devices.
> > > > > 
> > > > > Signed-off-by: Milton Miller 
> > > > 
> > > > OK, this works fine for me.  I played with options to add compat
> > > > softlinks under devices/virtio-pci but we still don't get exactly the
> > > > same layout and since I don't think anyone actually uses them, it's
> > > > probably ok to just to the simple thing.
> > > > 
> > > > Tested/Acked-by: Michael S. Tsirkin 
> > > > 
> > > > Rusty, since this help fix at least one user, any chance this can be put
> > > > in 2.6.38? OK to backport to -stable?
> > > > 
> > > > Gleb, could you try this out too?
> > > > 
> > > With this patch if I have 3 virtio disks for a VM I get:
> > > /sys/devices/pci:00/:00:04.0/virtio0/block/vda
> > > /sys/devices/pci:00/:00:05.0/virtio1/block/vdb
> > > /sys/devices/pci:00/:00:06.0/virtio2/block/vdc
> > > 
> > > Number after virtio has no much sense. It either should be dropped at all
> > > or be always zero in case we will support more then one virtio controller
> > > per pci card. In that case each virtio controller will have directories
> > > virtio0/virtio1/virtio2... under same pci device directory.
> > 
> > Yes. But this is the bus name.  It must be unique - all devices
> > also appear under /sys/bus/virtio/devices.
> > 
> It is very strange king of bus that is spread over several PCI devices :)
> It doesn't make much sense IMHO, but I can leave with it.
> 
I can't "leave" with it, but I can "live" with it.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: remove virtio-pci root device

2011-01-10 Thread Gleb Natapov
On Mon, Jan 10, 2011 at 01:22:05PM +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 10, 2011 at 01:18:59PM +0200, Gleb Natapov wrote:
> > On Sun, Jan 09, 2011 at 05:18:21PM +0200, Michael S. Tsirkin wrote:
> > > On Fri, Jan 07, 2011 at 02:55:06AM -0600, Milton Miller wrote:
> > > > We sometimes need to map between the virtio device and
> > > > the given pci device. One such use is OS installer that
> > > > gets the boot pci device from BIOS and needs to
> > > > find the relevant block device. Since it can't,
> > > > installation fails.
> > > > 
> > > > Instead of creating a top-level devices/virtio-pci
> > > > directory, create each device under the corresponding
> > > > pci device node.  Symlinks to all virtio-pci
> > > > devices can be found under the pci driver link in
> > > > bus/pci/drivers/virtio-pci/devices, and all virtio
> > > > devices under drivers/bus/virtio/devices.
> > > > 
> > > > Signed-off-by: Milton Miller 
> > > 
> > > OK, this works fine for me.  I played with options to add compat
> > > softlinks under devices/virtio-pci but we still don't get exactly the
> > > same layout and since I don't think anyone actually uses them, it's
> > > probably ok to just to the simple thing.
> > > 
> > > Tested/Acked-by: Michael S. Tsirkin 
> > > 
> > > Rusty, since this help fix at least one user, any chance this can be put
> > > in 2.6.38? OK to backport to -stable?
> > > 
> > > Gleb, could you try this out too?
> > > 
> > With this patch if I have 3 virtio disks for a VM I get:
> > /sys/devices/pci:00/:00:04.0/virtio0/block/vda
> > /sys/devices/pci:00/:00:05.0/virtio1/block/vdb
> > /sys/devices/pci:00/:00:06.0/virtio2/block/vdc
> > 
> > Number after virtio has no much sense. It either should be dropped at all
> > or be always zero in case we will support more then one virtio controller
> > per pci card. In that case each virtio controller will have directories
> > virtio0/virtio1/virtio2... under same pci device directory.
> 
> Yes. But this is the bus name.  It must be unique - all devices
> also appear under /sys/bus/virtio/devices.
> 
It is very strange king of bus that is spread over several PCI devices :)
It doesn't make much sense IMHO, but I can leave with it.

> > > > ---
> > > > 
> > > > This is an alternative to the patch by Michael S. Tsirkin
> > > > titled "virtio-pci: add softlinks between virtio and pci"
> > > > https://patchwork.kernel.org/patch/454581/
> > > > 
> > > > It creates simpler code, uses less memory, and should
> > > > be even easier use by the installer as it won't have to
> > > > know a virtio symlink to follow (just follow none).
> > > > 
> > > > Compile tested only as I don't have kvm setup.
> > > > 
> > > > 
> > > > diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> > > > index ef8d9d5..4fb5b2b 100644
> > > > --- a/drivers/virtio/virtio_pci.c
> > > > +++ b/drivers/virtio/virtio_pci.c
> > > > @@ -96,11 +96,6 @@ static struct pci_device_id virtio_pci_id_table[] = {
> > > >  
> > > >  MODULE_DEVICE_TABLE(pci, virtio_pci_id_table);
> > > >  
> > > > -/* A PCI device has it's own struct device and so does a virtio device 
> > > > so
> > > > - * we create a place for the virtio devices to show up in sysfs.  I 
> > > > think it
> > > > - * would make more sense for virtio to not insist on having it's own 
> > > > device. */
> > > > -static struct device *virtio_pci_root;
> > > > -
> > > >  /* Convert a generic virtio device to our structure */
> > > >  static struct virtio_pci_device *to_vp_device(struct virtio_device 
> > > > *vdev)
> > > >  {
> > > > @@ -629,7 +624,7 @@ static int __devinit virtio_pci_probe(struct 
> > > > pci_dev *pci_dev,
> > > > if (vp_dev == NULL)
> > > > return -ENOMEM;
> > > >  
> > > > -   vp_dev->vdev.dev.parent = virtio_pci_root;
> > > > +   vp_dev->vdev.dev.parent = &pci_dev->dev;
> > > > vp_dev->vdev.dev.release = virtio_pci_release_dev;
> > > > vp_dev->vdev.config = &virtio_pci_config_ops;
&g

Re: [PATCH] virtio: remove virtio-pci root device

2011-01-10 Thread Gleb Natapov
On Sun, Jan 09, 2011 at 05:18:21PM +0200, Michael S. Tsirkin wrote:
> On Fri, Jan 07, 2011 at 02:55:06AM -0600, Milton Miller wrote:
> > We sometimes need to map between the virtio device and
> > the given pci device. One such use is OS installer that
> > gets the boot pci device from BIOS and needs to
> > find the relevant block device. Since it can't,
> > installation fails.
> > 
> > Instead of creating a top-level devices/virtio-pci
> > directory, create each device under the corresponding
> > pci device node.  Symlinks to all virtio-pci
> > devices can be found under the pci driver link in
> > bus/pci/drivers/virtio-pci/devices, and all virtio
> > devices under drivers/bus/virtio/devices.
> > 
> > Signed-off-by: Milton Miller 
> 
> OK, this works fine for me.  I played with options to add compat
> softlinks under devices/virtio-pci but we still don't get exactly the
> same layout and since I don't think anyone actually uses them, it's
> probably ok to just to the simple thing.
> 
> Tested/Acked-by: Michael S. Tsirkin 
> 
> Rusty, since this help fix at least one user, any chance this can be put
> in 2.6.38? OK to backport to -stable?
> 
> Gleb, could you try this out too?
> 
With this patch if I have 3 virtio disks for a VM I get:
/sys/devices/pci:00/:00:04.0/virtio0/block/vda
/sys/devices/pci:00/:00:05.0/virtio1/block/vdb
/sys/devices/pci:00/:00:06.0/virtio2/block/vdc

Number after virtio has no much sense. It either should be dropped at all
or be always zero in case we will support more then one virtio controller
per pci card. In that case each virtio controller will have directories
virtio0/virtio1/virtio2... under same pci device directory.


> > ---
> > 
> > This is an alternative to the patch by Michael S. Tsirkin
> > titled "virtio-pci: add softlinks between virtio and pci"
> > https://patchwork.kernel.org/patch/454581/
> > 
> > It creates simpler code, uses less memory, and should
> > be even easier use by the installer as it won't have to
> > know a virtio symlink to follow (just follow none).
> > 
> > Compile tested only as I don't have kvm setup.
> > 
> > 
> > diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> > index ef8d9d5..4fb5b2b 100644
> > --- a/drivers/virtio/virtio_pci.c
> > +++ b/drivers/virtio/virtio_pci.c
> > @@ -96,11 +96,6 @@ static struct pci_device_id virtio_pci_id_table[] = {
> >  
> >  MODULE_DEVICE_TABLE(pci, virtio_pci_id_table);
> >  
> > -/* A PCI device has it's own struct device and so does a virtio device so
> > - * we create a place for the virtio devices to show up in sysfs.  I think 
> > it
> > - * would make more sense for virtio to not insist on having it's own 
> > device. */
> > -static struct device *virtio_pci_root;
> > -
> >  /* Convert a generic virtio device to our structure */
> >  static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
> >  {
> > @@ -629,7 +624,7 @@ static int __devinit virtio_pci_probe(struct pci_dev 
> > *pci_dev,
> > if (vp_dev == NULL)
> > return -ENOMEM;
> >  
> > -   vp_dev->vdev.dev.parent = virtio_pci_root;
> > +   vp_dev->vdev.dev.parent = &pci_dev->dev;
> > vp_dev->vdev.dev.release = virtio_pci_release_dev;
> > vp_dev->vdev.config = &virtio_pci_config_ops;
> > vp_dev->pci_dev = pci_dev;
> > @@ -717,17 +712,7 @@ static struct pci_driver virtio_pci_driver = {
> >  
> >  static int __init virtio_pci_init(void)
> >  {
> > -   int err;
> > -
> > -   virtio_pci_root = root_device_register("virtio-pci");
> > -   if (IS_ERR(virtio_pci_root))
> > -   return PTR_ERR(virtio_pci_root);
> > -
> > -   err = pci_register_driver(&virtio_pci_driver);
> > -   if (err)
> > -   root_device_unregister(virtio_pci_root);
> > -
> > -   return err;
> > +   return pci_register_driver(&virtio_pci_driver);
> >  }
> >  
> >  module_init(virtio_pci_init);
> > @@ -735,7 +720,6 @@ module_init(virtio_pci_init);
> >  static void __exit virtio_pci_exit(void)
> >  {
> > pci_unregister_driver(&virtio_pci_driver);
> > -   root_device_unregister(virtio_pci_root);
> >  }
> >  
> >  module_exit(virtio_pci_exit);

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-pci: add softlinks between virtio and pci

2011-01-06 Thread Gleb Natapov
On Wed, Jan 05, 2011 at 09:17:11PM +0200, Michael S. Tsirkin wrote:
> We sometimes need to map between the virtio device and
> the given pci device. One such use is OS installer that
> gets the boot pci device from BIOS and needs to
> find the relevant block device. Since it can't,
> installation fails.
> 
> Supply softlinks between these to make it possible.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> Gleb, could you please ack that this patch below
> will be enough to fix the installer issue that
> you see?
> 
ACK. With this patch given PCI address from EDD I can find device node
by looking at /sys/devices/pci:00/:edd_bdf/virtio_device/block/

>  drivers/virtio/virtio_pci.c |   18 +-
>  1 files changed, 17 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
> index ef8d9d5..06eb2f8 100644
> --- a/drivers/virtio/virtio_pci.c
> +++ b/drivers/virtio/virtio_pci.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  MODULE_AUTHOR("Anthony Liguori ");
>  MODULE_DESCRIPTION("virtio-pci");
> @@ -667,8 +668,21 @@ static int __devinit virtio_pci_probe(struct pci_dev 
> *pci_dev,
>   if (err)
>   goto out_set_drvdata;
>  
> - return 0;
> + err = sysfs_create_link(&pci_dev->dev.kobj, &vp_dev->vdev.dev.kobj,
> + "virtio_device");
> + if (err)
> + goto out_register_device;
> +
> + err = sysfs_create_link(&vp_dev->vdev.dev.kobj, &pci_dev->dev.kobj,
> + "bus_device");
> + if (err)
> + goto out_create_link;
>  
> + return 0;
> +out_create_link:
> + sysfs_remove_link(&pci_dev->dev.kobj, "virtio_device");
> +out_register_device:
> + unregister_virtio_device(&vp_dev->vdev);
>  out_set_drvdata:
>   pci_set_drvdata(pci_dev, NULL);
>   pci_iounmap(pci_dev, vp_dev->ioaddr);
> @@ -685,6 +699,8 @@ static void __devexit virtio_pci_remove(struct pci_dev 
> *pci_dev)
>  {
>   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
>  
> + sysfs_remove_link(&vp_dev->vdev.dev.kobj, "bus_device");
> + sysfs_remove_link(&pci_dev->dev.kobj, "virtio_device");
>   unregister_virtio_device(&vp_dev->vdev);
>  }
>  
> -- 
> 1.7.3.2.91.g446ac

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-05-06 Thread Gleb Natapov
On Wed, May 05, 2010 at 10:47:10AM -0700, Pankaj Thakkar wrote:
> 
> 
> > -Original Message-
> > From: Christoph Hellwig [mailto:h...@infradead.org]
> > Sent: Wednesday, May 05, 2010 10:40 AM
> > To: Dmitry Torokhov
> > Cc: Christoph Hellwig; pv-driv...@vmware.com; Pankaj Thakkar;
> > net...@vger.kernel.org; linux-ker...@vger.kernel.org;
> > virtualization@lists.linux-foundation.org
> > Subject: Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for
> > vmxnet3
> > 
> > On Wed, May 05, 2010 at 10:35:28AM -0700, Dmitry Torokhov wrote:
> > > Yes, with the exception that the only body of code that will be
> > > accepted by the shell should be GPL-licensed and thus open and
> > available
> > > for examining. This is not different from having a standard kernel
> > > module that is loaded normally and plugs into a certain subsystem.
> > > The difference is that the binary resides not on guest filesystem
> > > but elsewhere.
> > 
> > Forget about the licensing.  Loading binary blobs written to a shim
> > layer is a complete pain in the ass and totally unsupportable, and
> > also uninteresting because of the overhead.
> 
> [PT] Why do you think it is unsupportable? How different is it from any module
> written against a well maintained interface? What overhead are you talking 
> about?
> 
Overhead of interpreting bytecode plugin is written in. Or are you
saying plugin is x86 assembly (32bit or 64bit btw?) and other arches
will have to have in kernel x86 emulator to use the plugin (like some
of them had for vgabios)? 

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] vhost: fix get_user_pages_fast error handling

2010-02-23 Thread Gleb Natapov
On Tue, Feb 23, 2010 at 07:39:08PM +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 23, 2010 at 07:39:52PM +0200, Gleb Natapov wrote:
> > On Tue, Feb 23, 2010 at 07:32:58PM +0200, Michael S. Tsirkin wrote:
> > > On Tue, Feb 23, 2010 at 07:34:34PM +0200, Gleb Natapov wrote:
> > > > On Tue, Feb 23, 2010 at 06:57:58PM +0200, Michael S. Tsirkin wrote:
> > > > > get_user_pages_fast returns number of pages on success, negative value
> > > > > on failure, but never 0. Fix vhost code to match this logic.
> > > > > 
> > > > > Signed-off-by: Michael S. Tsirkin 
> > > > > ---
> > > > >  drivers/vhost/vhost.c |3 ++-
> > > > >  1 files changed, 2 insertions(+), 1 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > index d4f8fdf..d003504 100644
> > > > > --- a/drivers/vhost/vhost.c
> > > > > +++ b/drivers/vhost/vhost.c
> > > > > @@ -646,8 +646,9 @@ static int set_bit_to_user(int nr, void __user 
> > > > > *addr)
> > > > >   int bit = nr + (log % PAGE_SIZE) * 8;
> > > > >   int r;
> > > > >   r = get_user_pages_fast(log, 1, 1, &page);
> > > > > - if (r)
> > > > > + if (r < 0)
> > > > >   return r;
> > > > > + BUG_ON(r != 1);
> > > > Can't this be easily triggered from user space?
> > > 
> > > I think no. get_user_pages_fast always returns number of pages
> > > pinned (in this case always 1) or an error (< 0).
> > > Anything else is a kernel bug.
> > > 
> > But what if page is unmapped from userspace?
> 
> Then we get -EFAULT
> 
Ah correct.

> > > > >   base = kmap_atomic(page, KM_USER0);
> > > > >   set_bit(bit, base);
> > > > >   kunmap_atomic(base, KM_USER0);
> > > > > -- 
> > > > > 1.7.0.18.g0d53a5
> > > > 
> > > > --
> > > > Gleb.
> > 
> > --
> > Gleb.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] vhost: fix get_user_pages_fast error handling

2010-02-23 Thread Gleb Natapov
On Tue, Feb 23, 2010 at 07:32:58PM +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 23, 2010 at 07:34:34PM +0200, Gleb Natapov wrote:
> > On Tue, Feb 23, 2010 at 06:57:58PM +0200, Michael S. Tsirkin wrote:
> > > get_user_pages_fast returns number of pages on success, negative value
> > > on failure, but never 0. Fix vhost code to match this logic.
> > > 
> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > >  drivers/vhost/vhost.c |3 ++-
> > >  1 files changed, 2 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index d4f8fdf..d003504 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -646,8 +646,9 @@ static int set_bit_to_user(int nr, void __user *addr)
> > >   int bit = nr + (log % PAGE_SIZE) * 8;
> > >   int r;
> > >   r = get_user_pages_fast(log, 1, 1, &page);
> > > - if (r)
> > > + if (r < 0)
> > >   return r;
> > > + BUG_ON(r != 1);
> > Can't this be easily triggered from user space?
> 
> I think no. get_user_pages_fast always returns number of pages
> pinned (in this case always 1) or an error (< 0).
> Anything else is a kernel bug.
> 
But what if page is unmapped from userspace?

> > >   base = kmap_atomic(page, KM_USER0);
> > >   set_bit(bit, base);
> > >   kunmap_atomic(base, KM_USER0);
> > > -- 
> > > 1.7.0.18.g0d53a5
> > 
> > --
> > Gleb.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] vhost: fix get_user_pages_fast error handling

2010-02-23 Thread Gleb Natapov
On Tue, Feb 23, 2010 at 06:57:58PM +0200, Michael S. Tsirkin wrote:
> get_user_pages_fast returns number of pages on success, negative value
> on failure, but never 0. Fix vhost code to match this logic.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/vhost.c |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index d4f8fdf..d003504 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -646,8 +646,9 @@ static int set_bit_to_user(int nr, void __user *addr)
>   int bit = nr + (log % PAGE_SIZE) * 8;
>   int r;
>   r = get_user_pages_fast(log, 1, 1, &page);
> - if (r)
> + if (r < 0)
>   return r;
> + BUG_ON(r != 1);
Can't this be easily triggered from user space?

>   base = kmap_atomic(page, KM_USER0);
>   set_bit(bit, base);
>   kunmap_atomic(base, KM_USER0);
> -- 
> 1.7.0.18.g0d53a5

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


virtio-blk + suspend to disk

2009-09-30 Thread Gleb Natapov
Hi,

Does anybody knows is subject supported? It doesn't work and
it seems that virtio-pci device has no way to know that suspend/resume
happened.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv3 2/2] vhost_net: a kernel-level virtio server

2009-08-21 Thread Gleb Natapov
On Thu, Aug 20, 2009 at 04:38:17PM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 20, 2009 at 03:10:54PM +0200, Arnd Bergmann wrote:
> > On Thursday 20 August 2009, Michael S. Tsirkin wrote:
> > > On Wed, Aug 19, 2009 at 05:27:07PM +0200, Arnd Bergmann wrote:
> > > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > > > On Wed, Aug 19, 2009 at 03:46:44PM +0200, Arnd Bergmann wrote:
> > > > > > On Wednesday 19 August 2009, Michael S. Tsirkin wrote:
> > > > > >
> > > > > > Leaving that aside for now, you could replace VHOST_NET_SET_SOCKET,
> > > > > > VHOST_SET_OWNER, VHOST_RESET_OWNER
> > > > > 
> > > > > SET/RESET OWNER is still needed: otherwise if you share a descriptor
> > > > > with another process, it can corrupt your memory.
> > > > 
> > > > How? The point of using user threads is that you only ever access the
> > > > address space of the thread that called the ioctl.
> > > 
> > > Think about this example with processes A and B sharing an fd:
> > > A does SET_USED_ADDRESS
> > > B does SET_USED_ADDRESS
> > > A does VHOST_NET_SPLICE
> > > See how stuff gets written into a random place in memory of A?
> > 
> > Yes, I didn't think of that. It doesn't seem like a big problem
> > though, because it's a clear misuse of the API (I guess your
> > current code returns an error for one of the SET_USED_ADDRESS
> > ioctls), so I would see it as a classic garbage-in garbage-out
> > case.
> > 
> > It may even work in the case that the sharing of the fd resulted
> > from a fork, where the address contains the same buffer in both
> > processes. I can't think of a reason why you would want to use
> > it like that though.
> 
> It doesn't matter that I don't want this: allowing 1 process corrupt
> another's memory is a security issue.  Once you get an fd, you want to
> be able to use it without worrying that a bug in another process will
> crash yours.
> 
If B's SET_USED_ADDRESS fails how one process can corrupt a memory of
other process?

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-07-29 Thread Gleb Natapov
On Wed, Jul 29, 2009 at 01:14:18PM +0530, Amit Shah wrote:
> On (Tue) Jul 28 2009 [08:42:32], Anthony Liguori wrote:
> > Amit Shah wrote:
> >> Right; use virtio just as the transport and all the interesting
> >> activity happens in userspaces. That was the basis with which I started.
> >> I can imagine dbus doing the copy/paste, lock screen, etc. actions.
> >>
> >> However for libguestfs, dbus isn't an option and they already have some
> >> predefined agents for each port. So libguestfs is an example for a
> >> multi-port usecase for virtio-serial.
> >>   
> >
> > Or don't use dbus and use something that libguestfs is able to embed.   
> > The fact that libguestfs doesn't want dbus in the guest is not an  
> > argument for using a higher level kernel interface especially one that  
> > doesn't meet the requirements of the interface.
> 
> But why do we want to limit the device to only one port? It's not too
> complex supporting additional ones.
> 
> As I see it qemu and the kernel should provide the basic abstraction for
> the userspace to go do its job. Why create unnecessary barriers?
> 
I agree. If userspace wants it may use only one channel and demultiplex
messages by itself, but we shouldn't force it to. Also one of the
requirements for virtio-serial is to have connect disconnect
notifications. It is not possible with demultiplexing in the userspace.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: kvm (on core 2 duo) freezes shortly after startup

2009-07-27 Thread Gleb Natapov
On Tue, Jul 28, 2009 at 11:47:43AM +0530, Haneef Syed wrote:
> Hi,
> 
> Description:
> kvm (on core 2 duo) freezes shortly after startup.
> 
> Additional info:
> * package version(s)
> kernel 2.6.21
> qemu 0.9.1-1
> 
Is this kvm that comes with 2.6.21 or kvm-88 compiled for kernel 2.6.21?

> * config and/or log files etc.
> [redhat as guest]
> dmesg:
> kvm: guest NX capability removed
> 
> Steps to reproduce:
> modprobe kvm-intel
> 
> [redhat]
> qemu-kvm -hda new.qcow2 -m 512 -cdrom rhel-server-5.3-i386-dvd.iso -boot d
> 
> just after issuing above command, kvm (on core 2 duo) freezes shortly 
> after startup, please suggest me on this problem..?
> 
> Thanks & Regards, 
> 
> Haneef Syed 
> 
> "Unless you try to do something beyond what 
>   you have already mastered, 
>   you will never grow." 
> 
> 
> __

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: KVM-88 Crash with Linux-2.6.25 kernel

2009-07-23 Thread Gleb Natapov
On Fri, Jul 24, 2009 at 07:49:56AM +0530, Haneef Syed wrote:
> Yes.. Host kernel oop like below.. Both Host and guest goes to deadlock 
> condition.
> 
> 
This oops doesn't look kvm specific. Try to run with -net none.

> 
> 
> Gleb Natapov  
> Sent by: kvm-ow...@vger.kernel.org
> 07/23/2009 09:02 PM
> 
> To
> Haneef Syed 
> cc
> a...@redhat.com, k...@vger.kernel.org, 
> virtualization@lists.linux-foundation.org
> Subject
> Re: KVM-88 Crash with Linux-2.6.25 kernel
> 
> 
> 
> 
> 
> 
> On Thu, Jul 23, 2009 at 07:17:51PM +0530, Haneef Syed wrote:
> > Hi Gleb,
> > 
> > After installing kvm-88 kernel package and issuing modprobe kvm-intel 
> and 
> > in dmesg it displays as follows
> > 
> > "loaded kvm module (kvm-kmod-devel-86)" 
> > 
> > it confirms that kvm-88 kernel modules are installed properly but still 
> i 
> > m unable to fix installation bug..?
> > 
> What is the bug? Host kernel oops like below?
> 
> > 
> > 
> > 
> > 
> > Gleb Natapov  
> > 07/23/2009 04:20 PM
> > 
> > To
> > Haneef Syed 
> > cc
> > a...@redhat.com, k...@vger.kernel.org, 
> > virtualization@lists.linux-foundation.org
> > Subject
> > Re: KVM-88 Crash with Linux-2.6.25 kernel
> > 
> > 
> > 
> > 
> > 
> > 
> > On Thu, Jul 23, 2009 at 03:13:25PM +0530, Haneef Syed wrote:
> > > If i use kvm-88 kernel package then system does not crash
> > > 
> > > but it uses qemu emulation but not kvm. I want kvm emulation so
> > > 
> > If it uses qemu emulation then you haven't installed kvm-88 kernel 
> modules
> > properly.
> > 
> > > can u tell me how to achieve that.
> > > 
> > > 
> > > 
> > > 
> > > Gleb Natapov  
> > > 07/23/2009 03:09 PM
> > > 
> > > To
> > > Haneef Syed 
> > > cc
> > > k...@vger.kernel.org, virtualization@lists.linux-foundation.org, 
> > > a...@redhat.com
> > > Subject
> > > Re: KVM-88 Crash with Linux-2.6.25 kernel
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > On Thu, Jul 23, 2009 at 03:06:07PM +0530, Haneef Syed wrote:
> > > > Hi All,
> > > > 
> > > > I am using kvm-88 user space package with linux-2.6.25 kernel. I f i 
> 
> > run 
> > > 
> > > > following command system gets hang.
> > > > 
> > > Please use kvm-88 kernel package.
> > > 
> > > > ./qemu-system-x86_64 -hda new.qcow2 -cdrom 
> > rhel-server-5.3-i386-dvd.iso 
> > > > -boot d -m 512
> > > > 
> > > > When i print dmesg it displays as follows:
> > > > 
> > > > 
> > > > BUG: unable to handle kernel NULL pointer dereference at virtual 
> > address 
> > > 
> > > > 0024
> > > >  printing eip:
> > > > c0425fb0
> > > > *pde = 3963c067
> > > > Oops:  [#1]
> > > > SMP
> > > > Modules linked in: ipt_MASQUERADE iptable_nat nf_nat 
> nf_conntrack_ipv4 
> > 
> > > > xt_state nf_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter 
> > > > ip_tables x_tables bridge rfcomm l2cap bluetooth autofs4 sunrpc 
> > ib_iser 
> > > > rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi 
> > > > scsi_transport_iscsi cpufreq_ondemand acpi_cpufreq loop dm_multipath 
> 
> > > ipv6 
> > > > kvm_intel kvm snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss 
> > > > snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss 
> > > > snd_pcm snd_timer snd soundcore via_rhine i2c_i801 parport_pc 
> > > > snd_page_alloc mii parport i2c_core serio_raw pcspkr sg ata_generic 
> > > > dm_snapshot dm_zero dm_mirror dm_mod ahci libata sd_mod scsi_mod 
> ext3 
> > > jbd 
> > > > mbcache uhci_hcd ohci_hcd ehci_hcd
> > > > CPU:0
> > > > EIP:0060:[]Not tainted VLI
> > > > EFLAGS: 00010246   (2.6.21.kvm-25-21 #14)
> > > > EIP is at do_exit+0x2ca/0x3c8
> > > > eax:    ebx: f7c51dc0   ecx: 0001   edx: f7bc80f0
> > > > esi: f7bc80f0   edi:    ebp: 0100   esp: f5372f84
> > > > ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
> > > > Process qemu-system-x86 (pid: 3668[#0], ti=f5372000 task=f7bc80f0 
> > > > task.ti=f5372000)
> > > > Stack:

Re: KVM-88 Crash with Linux-2.6.25 kernel

2009-07-23 Thread Gleb Natapov
On Thu, Jul 23, 2009 at 07:17:51PM +0530, Haneef Syed wrote:
> Hi Gleb,
> 
> After installing kvm-88 kernel package and issuing modprobe kvm-intel and 
> in dmesg it displays as follows
> 
> "loaded kvm module (kvm-kmod-devel-86)" 
> 
> it confirms that kvm-88 kernel modules are installed properly but still i 
> m unable to fix installation bug..?
> 
What is the bug? Host kernel oops like below?

> 
> 
> 
> 
> Gleb Natapov  
> 07/23/2009 04:20 PM
> 
> To
> Haneef Syed 
> cc
> a...@redhat.com, k...@vger.kernel.org, 
> virtualization@lists.linux-foundation.org
> Subject
> Re: KVM-88 Crash with Linux-2.6.25 kernel
> 
> 
> 
> 
> 
> 
> On Thu, Jul 23, 2009 at 03:13:25PM +0530, Haneef Syed wrote:
> > If i use kvm-88 kernel package then system does not crash
> > 
> > but it uses qemu emulation but not kvm. I want kvm emulation so
> > 
> If it uses qemu emulation then you haven't installed kvm-88 kernel modules
> properly.
> 
> > can u tell me how to achieve that.
> > 
> > 
> > 
> > 
> > Gleb Natapov  
> > 07/23/2009 03:09 PM
> > 
> > To
> > Haneef Syed 
> > cc
> > k...@vger.kernel.org, virtualization@lists.linux-foundation.org, 
> > a...@redhat.com
> > Subject
> > Re: KVM-88 Crash with Linux-2.6.25 kernel
> > 
> > 
> > 
> > 
> > 
> > 
> > On Thu, Jul 23, 2009 at 03:06:07PM +0530, Haneef Syed wrote:
> > > Hi All,
> > > 
> > > I am using kvm-88 user space package with linux-2.6.25 kernel. I f i 
> run 
> > 
> > > following command system gets hang.
> > > 
> > Please use kvm-88 kernel package.
> > 
> > > ./qemu-system-x86_64 -hda new.qcow2 -cdrom 
> rhel-server-5.3-i386-dvd.iso 
> > > -boot d -m 512
> > > 
> > > When i print dmesg it displays as follows:
> > > 
> > > 
> > > BUG: unable to handle kernel NULL pointer dereference at virtual 
> address 
> > 
> > > 0024
> > >  printing eip:
> > > c0425fb0
> > > *pde = 3963c067
> > > Oops:  [#1]
> > > SMP
> > > Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 
> 
> > > xt_state nf_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter 
> > > ip_tables x_tables bridge rfcomm l2cap bluetooth autofs4 sunrpc 
> ib_iser 
> > > rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi 
> > > scsi_transport_iscsi cpufreq_ondemand acpi_cpufreq loop dm_multipath 
> > ipv6 
> > > kvm_intel kvm snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss 
> > > snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss 
> > > snd_pcm snd_timer snd soundcore via_rhine i2c_i801 parport_pc 
> > > snd_page_alloc mii parport i2c_core serio_raw pcspkr sg ata_generic 
> > > dm_snapshot dm_zero dm_mirror dm_mod ahci libata sd_mod scsi_mod ext3 
> > jbd 
> > > mbcache uhci_hcd ohci_hcd ehci_hcd
> > > CPU:0
> > > EIP:0060:[]Not tainted VLI
> > > EFLAGS: 00010246   (2.6.21.kvm-25-21 #14)
> > > EIP is at do_exit+0x2ca/0x3c8
> > > eax:    ebx: f7c51dc0   ecx: 0001   edx: f7bc80f0
> > > esi: f7bc80f0   edi:    ebp: 0100   esp: f5372f84
> > > ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
> > > Process qemu-system-x86 (pid: 3668[#0], ti=f5372000 task=f7bc80f0 
> > > task.ti=f5372000)
> > > Stack: f7bc80f0 0100 f5372000 c042cdf8 f7bb1580 f7a15ac0 0100 
> > > f5372000
> > >c0426144 0001 00d55274 00d55274 c0404db0 0001 b77956d0 
> > > 
> > >00d55274 00d55274 b7795738 00fc 007b c040007b c060 
> > > 00fc
> > > Call Trace:
> > >  [] signal_wake_up+0x1e/0x2c
> > >  [] sys_exit_group+0x0/0xd
> > >  [] syscall_call+0x7/0xb
> > >  [] packet_setsockopt+0x13f/0x170
> > >  ===
> > > Code: ab 04 00 e8 f7 d4 fd ff 85 ff 74 19 8b 86 6c 04 00 00 83 b8 b0 
> 00 
> > 00 
> > > 00 00 74 0a b8 01 00 00 00 e8 7e 64 10 00 8b 46 04 8b 40 04 <8b> 40 24 
> 
> > e8 
> > > 94 0d 02 00 8b 86 a4 00 00 00 85 c0 74 08 8b 40 04
> > > EIP: [] do_exit+0x2ca/0x3c8 SS:ESP 0068:f5372f84
> > > Fixing recursive fault but reboot is needed!
> > > 
> > > Please suggest me how to fix this bug..
> > > 
> > > Thanks in advance :-))
> > > 
> > > 
> > > __

Re: KVM-88 Crash with Linux-2.6.25 kernel

2009-07-23 Thread Gleb Natapov
On Thu, Jul 23, 2009 at 03:13:25PM +0530, Haneef Syed wrote:
> If i use kvm-88 kernel package then system does not crash
> 
> but it uses qemu emulation but not kvm. I want kvm emulation so
> 
If it uses qemu emulation then you haven't installed kvm-88 kernel modules
properly.

> can u tell me how to achieve that.
> 
> 
> 
> 
> Gleb Natapov  
> 07/23/2009 03:09 PM
> 
> To
> Haneef Syed 
> cc
> k...@vger.kernel.org, virtualization@lists.linux-foundation.org, 
> a...@redhat.com
> Subject
> Re: KVM-88 Crash with Linux-2.6.25 kernel
> 
> 
> 
> 
> 
> 
> On Thu, Jul 23, 2009 at 03:06:07PM +0530, Haneef Syed wrote:
> > Hi All,
> > 
> > I am using kvm-88 user space package with linux-2.6.25 kernel. I f i run 
> 
> > following command system gets hang.
> > 
> Please use kvm-88 kernel package.
> 
> > ./qemu-system-x86_64 -hda new.qcow2 -cdrom rhel-server-5.3-i386-dvd.iso 
> > -boot d -m 512
> > 
> > When i print dmesg it displays as follows:
> > 
> > 
> > BUG: unable to handle kernel NULL pointer dereference at virtual address 
> 
> > 0024
> >  printing eip:
> > c0425fb0
> > *pde = 3963c067
> > Oops:  [#1]
> > SMP
> > Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 
> > xt_state nf_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter 
> > ip_tables x_tables bridge rfcomm l2cap bluetooth autofs4 sunrpc ib_iser 
> > rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi 
> > scsi_transport_iscsi cpufreq_ondemand acpi_cpufreq loop dm_multipath 
> ipv6 
> > kvm_intel kvm snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss 
> > snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss 
> > snd_pcm snd_timer snd soundcore via_rhine i2c_i801 parport_pc 
> > snd_page_alloc mii parport i2c_core serio_raw pcspkr sg ata_generic 
> > dm_snapshot dm_zero dm_mirror dm_mod ahci libata sd_mod scsi_mod ext3 
> jbd 
> > mbcache uhci_hcd ohci_hcd ehci_hcd
> > CPU:0
> > EIP:0060:[]Not tainted VLI
> > EFLAGS: 00010246   (2.6.21.kvm-25-21 #14)
> > EIP is at do_exit+0x2ca/0x3c8
> > eax:    ebx: f7c51dc0   ecx: 0001   edx: f7bc80f0
> > esi: f7bc80f0   edi:    ebp: 0100   esp: f5372f84
> > ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
> > Process qemu-system-x86 (pid: 3668[#0], ti=f5372000 task=f7bc80f0 
> > task.ti=f5372000)
> > Stack: f7bc80f0 0100 f5372000 c042cdf8 f7bb1580 f7a15ac0 0100 
> > f5372000
> >c0426144 0001 00d55274 00d55274 c0404db0 0001 b77956d0 
> > 
> >00d55274 00d55274 b7795738 00fc 007b c040007b c060 
> > 00fc
> > Call Trace:
> >  [] signal_wake_up+0x1e/0x2c
> >  [] sys_exit_group+0x0/0xd
> >  [] syscall_call+0x7/0xb
> >  [] packet_setsockopt+0x13f/0x170
> >  ===
> > Code: ab 04 00 e8 f7 d4 fd ff 85 ff 74 19 8b 86 6c 04 00 00 83 b8 b0 00 
> 00 
> > 00 00 74 0a b8 01 00 00 00 e8 7e 64 10 00 8b 46 04 8b 40 04 <8b> 40 24 
> e8 
> > 94 0d 02 00 8b 86 a4 00 00 00 85 c0 74 08 8b 40 04
> > EIP: [] do_exit+0x2ca/0x3c8 SS:ESP 0068:f5372f84
> > Fixing recursive fault but reboot is needed!
> > 
> > Please suggest me how to fix this bug..
> > 
> > Thanks in advance :-))
> > 
> > 
> > __
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
>  Gleb.
> 
> __
> 
> 
> 
> __

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: KVM-88 Crash with Linux-2.6.25 kernel

2009-07-23 Thread Gleb Natapov
On Thu, Jul 23, 2009 at 03:06:07PM +0530, Haneef Syed wrote:
> Hi All,
> 
> I am using kvm-88 user space package with linux-2.6.25 kernel. I f i run 
> following command system gets hang.
> 
Please use kvm-88 kernel package.

> ./qemu-system-x86_64 -hda new.qcow2 -cdrom rhel-server-5.3-i386-dvd.iso 
> -boot d -m 512
> 
> When i print dmesg it displays as follows:
> 
> 
> BUG: unable to handle kernel NULL pointer dereference at virtual address 
> 0024
>  printing eip:
> c0425fb0
> *pde = 3963c067
> Oops:  [#1]
> SMP
> Modules linked in: ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 
> xt_state nf_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter 
> ip_tables x_tables bridge rfcomm l2cap bluetooth autofs4 sunrpc ib_iser 
> rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi 
> scsi_transport_iscsi cpufreq_ondemand acpi_cpufreq loop dm_multipath ipv6 
> kvm_intel kvm snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss 
> snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss 
> snd_pcm snd_timer snd soundcore via_rhine i2c_i801 parport_pc 
> snd_page_alloc mii parport i2c_core serio_raw pcspkr sg ata_generic 
> dm_snapshot dm_zero dm_mirror dm_mod ahci libata sd_mod scsi_mod ext3 jbd 
> mbcache uhci_hcd ohci_hcd ehci_hcd
> CPU:0
> EIP:0060:[]Not tainted VLI
> EFLAGS: 00010246   (2.6.21.kvm-25-21 #14)
> EIP is at do_exit+0x2ca/0x3c8
> eax:    ebx: f7c51dc0   ecx: 0001   edx: f7bc80f0
> esi: f7bc80f0   edi:    ebp: 0100   esp: f5372f84
> ds: 007b   es: 007b   fs: 00d8  gs:   ss: 0068
> Process qemu-system-x86 (pid: 3668[#0], ti=f5372000 task=f7bc80f0 
> task.ti=f5372000)
> Stack: f7bc80f0 0100 f5372000 c042cdf8 f7bb1580 f7a15ac0 0100 
> f5372000
>c0426144 0001 00d55274 00d55274 c0404db0 0001 b77956d0 
> 
>00d55274 00d55274 b7795738 00fc 007b c040007b c060 
> 00fc
> Call Trace:
>  [] signal_wake_up+0x1e/0x2c
>  [] sys_exit_group+0x0/0xd
>  [] syscall_call+0x7/0xb
>  [] packet_setsockopt+0x13f/0x170
>  ===
> Code: ab 04 00 e8 f7 d4 fd ff 85 ff 74 19 8b 86 6c 04 00 00 83 b8 b0 00 00 
> 00 00 74 0a b8 01 00 00 00 e8 7e 64 10 00 8b 46 04 8b 40 04 <8b> 40 24 e8 
> 94 0d 02 00 8b 86 a4 00 00 00 85 c0 74 08 8b 40 04
> EIP: [] do_exit+0x2ca/0x3c8 SS:ESP 0068:f5372f84
> Fixing recursive fault but reboot is needed!
> 
> Please suggest me how to fix this bug..
> 
> Thanks in advance :-))
> 
> 
> __
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-15 Thread Gleb Natapov
On Mon, Jun 15, 2009 at 02:07:53PM +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2009 at 01:52:13PM +0300, Gleb Natapov wrote:
> > On Mon, Jun 15, 2009 at 01:46:53PM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jun 15, 2009 at 01:44:56PM +0300, Gleb Natapov wrote:
> > > > On Mon, Jun 15, 2009 at 01:32:49PM +0300, Michael S. Tsirkin wrote:
> > > > > > You do need to export available slot numbers from qemu.
> > > > > 
> > > > > Why would a slot be unavailable?
> > > > > 
> > > > Because it does not exist?
> > > 
> > > We can create a slot with any number, can't we?
> > What do you mean? If the mobo has 4 slots you can't create fifth.
> > KVM describes 32 slots in the BIOS.
> 
> Do you mean the KVM kernel module here? I don't know much about the
No I don't mean KVM kernel module here.

> BIOS. Can't qemu control the number of slots declared?
> 
Qemu represents HW, BIOS drives this HW. They should be in sync on such
important issues like pci slot configuration. Even if QEMU can control
the number of slots declared (which it can't easily do), it will be able
to do it only on startup (before BIOS runs). The way to have dynamic
number of slots may be pci bridge emulation. Not sure what is needed
from BIOS for that.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: Configuration vs. compat hints [was Re: [Qemu-devel] [PATCHv3 03/13] qemu: add routines to manage PCI capabilities]

2009-06-15 Thread Gleb Natapov
On Mon, Jun 15, 2009 at 01:46:53PM +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 15, 2009 at 01:44:56PM +0300, Gleb Natapov wrote:
> > On Mon, Jun 15, 2009 at 01:32:49PM +0300, Michael S. Tsirkin wrote:
> > > > You do need to export available slot numbers from qemu.
> > > 
> > > Why would a slot be unavailable?
> > > 
> > Because it does not exist?
> 
> We can create a slot with any number, can't we?
What do you mean? If the mobo has 4 slots you can't create fifth.
KVM describes 32 slots in the BIOS.

--
Gleb.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


  1   2   >