[RFC] Next gen kvm api

2012-02-02 Thread Avi Kivity
The kvm api has been accumulating cruft for several years now.  This is
due to feature creep, fixing mistakes, experience gained by the
maintainers and developers on how to do things, ports to new
architectures, and simply as a side effect of a code base that is
developed slowly and incrementally.

While I don't think we can justify a complete revamp of the API now, I'm
writing this as a thought experiment to see where a from-scratch API can
take us.  Of course, if we do implement this, the new and old APIs will
have to be supported side by side for several years.

Syscalls

kvm currently uses the much-loved ioctl() system call as its entry
point.  While this made it easy to add kvm to the kernel unintrusively,
it does have downsides:

- overhead in the entry path, for the ioctl dispatch path and vcpu mutex
(low but measurable)
- semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
a vm to be tied to an mm_struct, but the current API ties them to file
descriptors, which can move between threads and processes.  We check
that they don't, but we don't want to.

Moving to syscalls avoids these problems, but introduces new ones:

- adding new syscalls is generally frowned upon, and kvm will need several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct

Syscalls that operate on the entire guest will pick it up implicitly
from the mm_struct, and syscalls that operate on a vcpu will pick it up
from current.

State accessors
---
Currently vcpu state is read and written by a bunch of ioctls that
access register sets that were added (or discovered) along the years. 
Some state is stored in the vcpu mmap area.  These will be replaced by a
pair of syscalls that read or write the entire state, or a subset of the
state, in a tag/value format.  A register will be described by a tuple:

  set: the register set to which it belongs; either a real set (GPR,
x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
eflags/rip/IDT/interrupt shadow/pending exception/etc.)
  number: register number within a set
  size: for self-description, and to allow expanding registers like
SSE->AVX or eax->rax
  attributes: read-write, read-only, read-only for guest but read-write
for host
  value

Device model

Currently kvm virtualizes or emulates a set of x86 cores, with or
without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
PCI devices assigned from the host.  The API allows emulating the local
APICs in userspace.

The new API will do away with the IOAPIC/PIC/PIT emulation and defer
them to userspace.  Note: this may cause a regression for older guests
that don't support MSI or kvmclock.  Device assignment will be done
using VFIO, that is, without direct kvm involvement.

Local APICs will be mandatory, but it will be possible to hide them from
the guest.  This means that it will no longer be possible to emulate an
APIC in userspace, but it will be possible to virtualize an APIC-less
core - userspace will play with the LINT0/LINT1 inputs (configured as
EXITINT and NMI) to queue interrupts and NMIs.

The communications between the local APIC and the IOAPIC/PIC will be
done over a socketpair, emulating the APIC bus protocol.

Ioeventfd/irqfd
---
As the ioeventfd/irqfd mechanism has been quite successful, it will be
retained, and perhaps supplemented with a way to assign an mmio region
to a socketpair carrying transactions.  This allows a device model to be
implemented out-of-process.  The socketpair can also be used to
implement a replacement for coalesced mmio, by not waiting for responses
on write transactions when enabled.  Synchronization of coalesced mmio
will be implemented in the kernel, not userspace as now: when a
non-coalesced mmio is needed, the kernel will first flush the coalesced
mmio queue(s).

Guest memory management
---
Instead of managing each memory slot individually, a single API will be
provided that replaces the entire guest physical memory map atomically. 
This matches the implementation (using RCU) and plugs holes in the
current API, where you lose the dirty log in the window between the last
call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
that removes the slot.

Slot-based dirty logging will be replaced by range-based and work-based
dirty logging; that is "what pages are dirty in this range, which may be
smaller than a slot" and "don't return more than N pages".

We may want to place the log in user memory instead of kernel memory, to
reduce pinned memory and increase flexibility.

vcpu fd mmap area
-
Currently we mmap() a few pages of the vcpu fd for fast user/kernel
communications.  This will be replaced by a more orthodox pointer
parameter to sys_kvm_enter_guest(), that will be accessed using
get_user() and put_user().  This is slower than the current situation,
but better for th

Re: [RFC] Next gen kvm api

2012-02-03 Thread Eric Northup
On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  wrote:
[...]
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-05 Thread Gleb Natapov
On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> Device model
> 
> Currently kvm virtualizes or emulates a set of x86 cores, with or
> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> PCI devices assigned from the host.  The API allows emulating the local
> APICs in userspace.
> 
> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> them to userspace.  Note: this may cause a regression for older guests
> that don't support MSI or kvmclock.  Device assignment will be done
> using VFIO, that is, without direct kvm involvement.
> 
So are we officially saying that KVM is only for modern guest
virtualization? Also my not so old host kernel uses MSI only for NIC.
SATA and USB are using IOAPIC (though this is probably more HW related
than kernel version related).

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-05 Thread Avi Kivity
On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > Device model
> > 
> > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > PCI devices assigned from the host.  The API allows emulating the local
> > APICs in userspace.
> > 
> > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > them to userspace.  Note: this may cause a regression for older guests
> > that don't support MSI or kvmclock.  Device assignment will be done
> > using VFIO, that is, without direct kvm involvement.
> > 
> So are we officially saying that KVM is only for modern guest
> virtualization? 

No, but older guests may have reduced performance in some workloads
(e.g. RHEL4 gettimeofday() intensive workloads).

> Also my not so old host kernel uses MSI only for NIC.
> SATA and USB are using IOAPIC (though this is probably more HW related
> than kernel version related).

For devices emulated in userspace, it doesn't matter where the IOAPIC
is.  It only matters for kernel provided devices (PIT, assigned devices,
vhost-net).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-05 Thread Gleb Natapov
On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > Device model
> > > 
> > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > PCI devices assigned from the host.  The API allows emulating the local
> > > APICs in userspace.
> > > 
> > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > them to userspace.  Note: this may cause a regression for older guests
> > > that don't support MSI or kvmclock.  Device assignment will be done
> > > using VFIO, that is, without direct kvm involvement.
> > > 
> > So are we officially saying that KVM is only for modern guest
> > virtualization? 
> 
> No, but older guests may have reduced performance in some workloads
> (e.g. RHEL4 gettimeofday() intensive workloads).
> 
Reduced performance is what I mean. Obviously old guests will continue working.

> > Also my not so old host kernel uses MSI only for NIC.
> > SATA and USB are using IOAPIC (though this is probably more HW related
> > than kernel version related).
> 
> For devices emulated in userspace, it doesn't matter where the IOAPIC
> is.  It only matters for kernel provided devices (PIT, assigned devices,
> vhost-net).
> 
What about EOI that will have to do additional exit to userspace for each
interrupt delivered?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-05 Thread Avi Kivity
On 02/05/2012 11:51 AM, Gleb Natapov wrote:
> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> > On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > > Device model
> > > > 
> > > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > > PCI devices assigned from the host.  The API allows emulating the local
> > > > APICs in userspace.
> > > > 
> > > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > > them to userspace.  Note: this may cause a regression for older guests
> > > > that don't support MSI or kvmclock.  Device assignment will be done
> > > > using VFIO, that is, without direct kvm involvement.
> > > > 
> > > So are we officially saying that KVM is only for modern guest
> > > virtualization? 
> > 
> > No, but older guests may have reduced performance in some workloads
> > (e.g. RHEL4 gettimeofday() intensive workloads).
> > 
> Reduced performance is what I mean. Obviously old guests will continue 
> working.

I'm not happy about it either.

> > > Also my not so old host kernel uses MSI only for NIC.
> > > SATA and USB are using IOAPIC (though this is probably more HW related
> > > than kernel version related).
> > 
> > For devices emulated in userspace, it doesn't matter where the IOAPIC
> > is.  It only matters for kernel provided devices (PIT, assigned devices,
> > vhost-net).
> > 
> What about EOI that will have to do additional exit to userspace for each
> interrupt delivered?

I think the ioapic EOI is asynchronous wrt the core, yes?  So the vcpu
can just post the EOI broadcast on the apic-bus socketpair, waking up
the thread handling the ioapic, and continue running.  This trades off
vcpu latency for using more host resources.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-05 Thread Gleb Natapov
On Sun, Feb 05, 2012 at 11:56:21AM +0200, Avi Kivity wrote:
> On 02/05/2012 11:51 AM, Gleb Natapov wrote:
> > On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> > > On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > > > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > > > Device model
> > > > > 
> > > > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > > > PCI devices assigned from the host.  The API allows emulating the 
> > > > > local
> > > > > APICs in userspace.
> > > > > 
> > > > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > > > them to userspace.  Note: this may cause a regression for older guests
> > > > > that don't support MSI or kvmclock.  Device assignment will be done
> > > > > using VFIO, that is, without direct kvm involvement.
> > > > > 
> > > > So are we officially saying that KVM is only for modern guest
> > > > virtualization? 
> > > 
> > > No, but older guests may have reduced performance in some workloads
> > > (e.g. RHEL4 gettimeofday() intensive workloads).
> > > 
> > Reduced performance is what I mean. Obviously old guests will continue 
> > working.
> 
> I'm not happy about it either.
> 
It is not only about old guests either. In RHEL we pretend to not
support HPET because when some guests detect it they are accessing
its mmio frequently for certain workloads. For Linux guests we can
avoid that by using kvmclock. For Windows guests I hope we will have
enlightenment timers  + RTC, but what about other guests? *BSD? How often
they access HPET when it is available? We will probably have to move
HPET into the kernel if we want to make it usable.

So what is the criteria for device to be emulated in userspace vs kernelspace
in new API? Never? What about vhost-net then? Only if a device works in MSI
mode? This may work for HPET case, but looks like artificial limitation
since the problem with HPET is not interrupt latency, but mmio space
access. 

And BTW, what about enlightenment timers for Windows? Are we going to
implement them in userspace or kernel?
 
> > > > Also my not so old host kernel uses MSI only for NIC.
> > > > SATA and USB are using IOAPIC (though this is probably more HW related
> > > > than kernel version related).
> > > 
> > > For devices emulated in userspace, it doesn't matter where the IOAPIC
> > > is.  It only matters for kernel provided devices (PIT, assigned devices,
> > > vhost-net).
> > > 
> > What about EOI that will have to do additional exit to userspace for each
> > interrupt delivered?
> 
> I think the ioapic EOI is asynchronous wrt the core, yes?  So the vcpu
Probably, do not see what problem can async EOI may cause.

> can just post the EOI broadcast on the apic-bus socketpair, waking up
> the thread handling the ioapic, and continue running.  This trades off
> vcpu latency for using more host resources.
> 
Sounds good. This will increase IOAPIC interrupt latency though since next
interrupt (same GSI) can't be delivered until EOI is processed.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-05 Thread Avi Kivity
On 02/05/2012 12:58 PM, Gleb Natapov wrote:
> > > > 
> > > Reduced performance is what I mean. Obviously old guests will continue 
> > > working.
> > 
> > I'm not happy about it either.
> > 
> It is not only about old guests either. In RHEL we pretend to not
> support HPET because when some guests detect it they are accessing
> its mmio frequently for certain workloads. For Linux guests we can
> avoid that by using kvmclock. For Windows guests I hope we will have
> enlightenment timers  + RTC, but what about other guests? *BSD? How often
> they access HPET when it is available? We will probably have to move
> HPET into the kernel if we want to make it usable.

If we have to, we'll do it.

> So what is the criteria for device to be emulated in userspace vs kernelspace
> in new API? Never? What about vhost-net then? Only if a device works in MSI
> mode? This may work for HPET case, but looks like artificial limitation
> since the problem with HPET is not interrupt latency, but mmio space
> access. 

The criteria is, if it's absolutely necessary.

> And BTW, what about enlightenment timers for Windows? Are we going to
> implement them in userspace or kernel?

The kernel.
-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-02 Thread Rob Earhart
(Resending as plain text to appease vger.kernel.org :-)

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  wrote:
>
> The kvm api has been accumulating cruft for several years now.  This is
> due to feature creep, fixing mistakes, experience gained by the
> maintainers and developers on how to do things, ports to new
> architectures, and simply as a side effect of a code base that is
> developed slowly and incrementally.
>
> While I don't think we can justify a complete revamp of the API now, I'm
> writing this as a thought experiment to see where a from-scratch API can
> take us.  Of course, if we do implement this, the new and old APIs will
> have to be supported side by side for several years.
>
> Syscalls
> 
> kvm currently uses the much-loved ioctl() system call as its entry
> point.  While this made it easy to add kvm to the kernel unintrusively,
> it does have downsides:
>
> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
> (low but measurable)
> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
> a vm to be tied to an mm_struct, but the current API ties them to file
> descriptors, which can move between threads and processes.  We check
> that they don't, but we don't want to.
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
>
> Syscalls that operate on the entire guest will pick it up implicitly
> from the mm_struct, and syscalls that operate on a vcpu will pick it up
> from current.
>



I like the ioctl() interface.  If the overhead matters in your hot
path, I suspect you're doing it wrong; use irq fds & ioevent fds.  You
might fix the semantic mismatch by having a notion of a "current
process's VM" and "current thread's VCPU", and just use the one
/dev/kvm filedescriptor.

Or you could go the other way, and break the connection between VMs
and processes / VCPUs and threads: I don't know how easy it is to do
it in Linux, but a VCPU might be backed by a kernel thread, operated
on via ioctl()s, indicating that they've exited the guest by having
their descriptors become readable (and either use read() or mmap() to
pull off the reason why the VCPU exited).  This would allow for a
variety of different programming styles for the VMM--I'm a fan of CSP
model myself, but that's hard to do with the current API.

It'd be nice to be able to kick a VCPU out of the guest without
messing around with signals.  One possibility would be to tie it to an
eventfd; another might be to add a pseudo-register to indicate whether
the VCPU is explicitly suspended.  (Combined with the decoupling idea,
you'd want another pseudo-register to indicate whether the VMM is
implicitly suspended due to an intercept; a single "runnable" bit is
racy if both the VMM and VCPU are setting it.)

ioevent fds are definitely useful.  It might be cute if they could
synchronously set the VIRTIO_USED_F_NOTIFY bit - the guest could do
this itself, but that'd require giving the guest write access to the
used side of the virtio queue, and I kind of like the idea that it
doesn't need write access there.  Then again, I don't have any perf
data to back up the need for this.

The rest of it sounds great.

)Rob
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-02 Thread Anthony Liguori

On 02/02/2012 10:09 AM, Avi Kivity wrote:

The kvm api has been accumulating cruft for several years now.  This is
due to feature creep, fixing mistakes, experience gained by the
maintainers and developers on how to do things, ports to new
architectures, and simply as a side effect of a code base that is
developed slowly and incrementally.

While I don't think we can justify a complete revamp of the API now, I'm
writing this as a thought experiment to see where a from-scratch API can
take us.  Of course, if we do implement this, the new and old APIs will
have to be supported side by side for several years.

Syscalls

kvm currently uses the much-loved ioctl() system call as its entry
point.  While this made it easy to add kvm to the kernel unintrusively,
it does have downsides:

- overhead in the entry path, for the ioctl dispatch path and vcpu mutex
(low but measurable)
- semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
a vm to be tied to an mm_struct, but the current API ties them to file
descriptors, which can move between threads and processes.  We check
that they don't, but we don't want to.

Moving to syscalls avoids these problems, but introduces new ones:

- adding new syscalls is generally frowned upon, and kvm will need several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct

Syscalls that operate on the entire guest will pick it up implicitly
from the mm_struct, and syscalls that operate on a vcpu will pick it up
from current.


This seems like the natural progression.


State accessors
---
Currently vcpu state is read and written by a bunch of ioctls that
access register sets that were added (or discovered) along the years.
Some state is stored in the vcpu mmap area.  These will be replaced by a
pair of syscalls that read or write the entire state, or a subset of the
state, in a tag/value format.  A register will be described by a tuple:

   set: the register set to which it belongs; either a real set (GPR,
x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
eflags/rip/IDT/interrupt shadow/pending exception/etc.)
   number: register number within a set
   size: for self-description, and to allow expanding registers like
SSE->AVX or eax->rax
   attributes: read-write, read-only, read-only for guest but read-write
for host
   value


I do like the idea a lot of being able to read one register at a time as often 
times that's all you need.




Device model

Currently kvm virtualizes or emulates a set of x86 cores, with or
without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
PCI devices assigned from the host.  The API allows emulating the local
APICs in userspace.

The new API will do away with the IOAPIC/PIC/PIT emulation and defer
them to userspace.


I'm a big fan of this.


Note: this may cause a regression for older guests
that don't support MSI or kvmclock.  Device assignment will be done
using VFIO, that is, without direct kvm involvement.

Local APICs will be mandatory, but it will be possible to hide them from
the guest.  This means that it will no longer be possible to emulate an
APIC in userspace, but it will be possible to virtualize an APIC-less
core - userspace will play with the LINT0/LINT1 inputs (configured as
EXITINT and NMI) to queue interrupts and NMIs.


I think this makes sense.  An interesting consequence of this is that it's no 
longer necessary to associate the VCPU context with an MMIO/PIO operation.  I'm 
not sure if there's an obvious benefit to that but it's interesting nonetheless.



The communications between the local APIC and the IOAPIC/PIC will be
done over a socketpair, emulating the APIC bus protocol.

Ioeventfd/irqfd
---
As the ioeventfd/irqfd mechanism has been quite successful, it will be
retained, and perhaps supplemented with a way to assign an mmio region
to a socketpair carrying transactions.  This allows a device model to be
implemented out-of-process.  The socketpair can also be used to
implement a replacement for coalesced mmio, by not waiting for responses
on write transactions when enabled.  Synchronization of coalesced mmio
will be implemented in the kernel, not userspace as now: when a
non-coalesced mmio is needed, the kernel will first flush the coalesced
mmio queue(s).

Guest memory management
---
Instead of managing each memory slot individually, a single API will be
provided that replaces the entire guest physical memory map atomically.
This matches the implementation (using RCU) and plugs holes in the
current API, where you lose the dirty log in the window between the last
call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
that removes the slot.

Slot-based dirty logging will be replaced by range-based and work-based
dirty logging; that is "what pages are dirty in this range, which may be
smaller than a slot" and "don

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-03 Thread Anthony Liguori

On 02/03/2012 12:07 PM, Eric Northup wrote:

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  wrote:
[...]


Moving to syscalls avoids these problems, but introduces new ones:

- adding new syscalls is generally frowned upon, and kvm will need several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct

- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?


Why should it be regulated?

It's not a finite or privileged resource.

Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-03 Thread Takuya Yoshikawa
Hope to get comments from live migration developers,

Anthony Liguori  wrote:

> > Guest memory management
> > ---
> > Instead of managing each memory slot individually, a single API will be
> > provided that replaces the entire guest physical memory map atomically.
> > This matches the implementation (using RCU) and plugs holes in the
> > current API, where you lose the dirty log in the window between the last
> > call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> > that removes the slot.
> >
> > Slot-based dirty logging will be replaced by range-based and work-based
> > dirty logging; that is "what pages are dirty in this range, which may be
> > smaller than a slot" and "don't return more than N pages".
> >
> > We may want to place the log in user memory instead of kernel memory, to
> > reduce pinned memory and increase flexibility.
> 
> Since we really only support 64-bit hosts, what about just pointing the 
> kernel 
> at a address/size pair and rely on userspace to mmap() the range 
> appropriately?
> 

Seems reasonable but the real problem is not how to set up the memory:
the problem is how to set a bit in user-space.

We need two things:
- introducing set_bit_user()
- changing mmu_lock from spin_lock to mutex_lock
  (mark_page_dirty() can be called with mmu_lock held)

The former is straightforward and I sent a patch last year.
The latter needs a fundamental change:  I heard (from Avi) that we can
change mmu_lock to mutex_lock if mmu_notifier becomes preemptible.

So I was planning to restart this work when Peter's
"mm: Preemptibility"
http://lkml.org/lkml/2011/4/1/141
gets finished.

But even if we cannot achieve "without pinned memory" we may also want
to make the user-space know how many pages are getting dirty.

For example think about the last step of live migration.  We stop the
guest and send the remaining pages.  For this we do not need to write
protect them any more, just want to know which ones are dirty.

If user-space can read the bitmap, it does not need to do GET_DIRTY_LOG
because the guest is already stopped, so we can reduce the downtime.

Is this correct?


So I think we can do this in two steps:
1. just move the bitmap to user-space and (pin it)
2. un-pin it when the time comes

I can start 1 after "srcu-less dirty logging" gets finished.


Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-05 Thread Avi Kivity
On 02/03/2012 04:09 AM, Anthony Liguori wrote:
>
>> Note: this may cause a regression for older guests
>> that don't support MSI or kvmclock.  Device assignment will be done
>> using VFIO, that is, without direct kvm involvement.
>>
>> Local APICs will be mandatory, but it will be possible to hide them from
>> the guest.  This means that it will no longer be possible to emulate an
>> APIC in userspace, but it will be possible to virtualize an APIC-less
>> core - userspace will play with the LINT0/LINT1 inputs (configured as
>> EXITINT and NMI) to queue interrupts and NMIs.
>
> I think this makes sense.  An interesting consequence of this is that
> it's no longer necessary to associate the VCPU context with an
> MMIO/PIO operation.  I'm not sure if there's an obvious benefit to
> that but it's interesting nonetheless.

It doesn't follow (at least from the above), and it isn't allowed in
some situations (like PIO invoking synchronous SMI).  So we'll have to
retain synchronous PIO/MMIO (but we can allow to relax this for
socketpair mmio).

>
>> The communications between the local APIC and the IOAPIC/PIC will be
>> done over a socketpair, emulating the APIC bus protocol.
>>
>> Ioeventfd/irqfd
>> ---
>> As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> retained, and perhaps supplemented with a way to assign an mmio region
>> to a socketpair carrying transactions.  This allows a device model to be
>> implemented out-of-process.  The socketpair can also be used to
>> implement a replacement for coalesced mmio, by not waiting for responses
>> on write transactions when enabled.  Synchronization of coalesced mmio
>> will be implemented in the kernel, not userspace as now: when a
>> non-coalesced mmio is needed, the kernel will first flush the coalesced
>> mmio queue(s).
>>
>> Guest memory management
>> ---
>> Instead of managing each memory slot individually, a single API will be
>> provided that replaces the entire guest physical memory map atomically.
>> This matches the implementation (using RCU) and plugs holes in the
>> current API, where you lose the dirty log in the window between the last
>> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>> that removes the slot.
>>
>> Slot-based dirty logging will be replaced by range-based and work-based
>> dirty logging; that is "what pages are dirty in this range, which may be
>> smaller than a slot" and "don't return more than N pages".
>>
>> We may want to place the log in user memory instead of kernel memory, to
>> reduce pinned memory and increase flexibility.
>
> Since we really only support 64-bit hosts, 

We don't (Red Hat does, but that's a distro choice).  Non-x86 also needs
32-bit.

> what about just pointing the kernel at a address/size pair and rely on
> userspace to mmap() the range appropriately?

The "one large slot" approach.  Even if we ignore the 32-bit issue, we
still need some per-slot information, like per-slot dirty logging.  It's
also hard to create aliases this way (BIOS at 0xe and 0xfffe) or
to move memory around (framebuffer BAR).

>
>> vcpu fd mmap area
>> -
>> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
>> communications.  This will be replaced by a more orthodox pointer
>> parameter to sys_kvm_enter_guest(), that will be accessed using
>> get_user() and put_user().  This is slower than the current situation,
>> but better for things like strace.
>
> Look pretty interesting overall.

I'll get an actual API description for the next round.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-05 Thread Avi Kivity
On 02/03/2012 12:13 AM, Rob Earhart wrote:
> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  > wrote:
>
> The kvm api has been accumulating cruft for several years now.
>  This is
> due to feature creep, fixing mistakes, experience gained by the
> maintainers and developers on how to do things, ports to new
> architectures, and simply as a side effect of a code base that is
> developed slowly and incrementally.
>
> While I don't think we can justify a complete revamp of the API
> now, I'm
> writing this as a thought experiment to see where a from-scratch
> API can
> take us.  Of course, if we do implement this, the new and old APIs
> will
> have to be supported side by side for several years.
>
> Syscalls
> 
> kvm currently uses the much-loved ioctl() system call as its entry
> point.  While this made it easy to add kvm to the kernel
> unintrusively,
> it does have downsides:
>
> - overhead in the entry path, for the ioctl dispatch path and vcpu
> mutex
> (low but measurable)
> - semantic mismatch: kvm really wants a vcpu to be tied to a
> thread, and
> a vm to be tied to an mm_struct, but the current API ties them to file
> descriptors, which can move between threads and processes.  We check
> that they don't, but we don't want to.
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need
> several
> - syscalls into modules are harder and rarer than into core kernel
> code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
>
> Syscalls that operate on the entire guest will pick it up implicitly
> from the mm_struct, and syscalls that operate on a vcpu will pick
> it up
> from current.
>
>
> 
>
> I like the ioctl() interface.  If the overhead matters in your hot path,

I can't say that it's a pressing problem, but it's not negligible.

> I suspect you're doing it wrong;

What am I doing wrong?

> use irq fds & ioevent fds.  You might fix the semantic mismatch by
> having a notion of a "current process's VM" and "current thread's
> VCPU", and just use the one /dev/kvm filedescriptor.
>
> Or you could go the other way, and break the connection between VMs
> and processes / VCPUs and threads: I don't know how easy it is to do
> it in Linux, but a VCPU might be backed by a kernel thread, operated
> on via ioctl()s, indicating that they've exited the guest by having
> their descriptors become readable (and either use read() or mmap() to
> pull off the reason why the VCPU exited). 

That breaks the ability to renice vcpu threads (unless you want the user
renice kernel threads).

> This would allow for a variety of different programming styles for the
> VMM--I'm a fan of CSP model myself, but that's hard to do with the
> current API.

Just convert the synchronous API to an RPC over a pipe, in the vcpu
thread, and you have the asynchronous model you asked for.

>
> It'd be nice to be able to kick a VCPU out of the guest without
> messing around with signals.  One possibility would be to tie it to an
> eventfd;

We have to support signals in any case, supporting more mechanisms just
increases complexity.

> another might be to add a pseudo-register to indicate whether the VCPU
> is explicitly suspended.  (Combined with the decoupling idea, you'd
> want another pseudo-register to indicate whether the VMM is implicitly
> suspended due to an intercept; a single "runnable" bit is racy if both
> the VMM and VCPU are setting it.)
>
> ioevent fds are definitely useful.  It might be cute if they could
> synchronously set the VIRTIO_USED_F_NOTIFY bit - the guest could do
> this itself, but that'd require giving the guest write access to the
> used side of the virtio queue, and I kind of like the idea that it
> doesn't need write access there.  Then again, I don't have any perf
> data to back up the need for this.
>

I'd hate to tie ioeventfds into virtio specifics, they're a general
mechanism.  Especially if the guest can do it itself.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-05 Thread Anthony Liguori

On 02/05/2012 03:51 AM, Gleb Natapov wrote:

On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:

On 02/05/2012 11:37 AM, Gleb Natapov wrote:

On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:

Device model

Currently kvm virtualizes or emulates a set of x86 cores, with or
without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
PCI devices assigned from the host.  The API allows emulating the local
APICs in userspace.

The new API will do away with the IOAPIC/PIC/PIT emulation and defer
them to userspace.  Note: this may cause a regression for older guests
that don't support MSI or kvmclock.  Device assignment will be done
using VFIO, that is, without direct kvm involvement.


So are we officially saying that KVM is only for modern guest
virtualization?


No, but older guests may have reduced performance in some workloads
(e.g. RHEL4 gettimeofday() intensive workloads).


Reduced performance is what I mean. Obviously old guests will continue working.


An interesting solution to this problem would be an in-kernel device VM.

Most of the time, the hot register is just one register within a more complex 
device.  The reads are often side-effect free and trivially computed from some 
device state + host time.


If userspace had a way to upload bytecode to the kernel that was executed for a 
PIO operation, it could either pass the operation to userspace or handle it 
within the kernel when possible without taking a heavy weight exit.


If the bytecode can access variables in a shared memory area, it could be pretty 
efficient to work with.


This means that the kernel never has to deal with specific in-kernel devices but 
that userspace can accelerator as many of its devices as it sees fit.


This could replace ioeventfd as a mechanism (which would allow clearing the 
notify flag before writing to an eventfd).


We could potentially just use BPF for this.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Avi Kivity
On 02/05/2012 06:36 PM, Anthony Liguori wrote:
> On 02/05/2012 03:51 AM, Gleb Natapov wrote:
>> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
>>> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
 On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> Device model
> 
> Currently kvm virtualizes or emulates a set of x86 cores, with or
> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> PCI devices assigned from the host.  The API allows emulating the
> local
> APICs in userspace.
>
> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> them to userspace.  Note: this may cause a regression for older
> guests
> that don't support MSI or kvmclock.  Device assignment will be done
> using VFIO, that is, without direct kvm involvement.
>
 So are we officially saying that KVM is only for modern guest
 virtualization?
>>>
>>> No, but older guests may have reduced performance in some workloads
>>> (e.g. RHEL4 gettimeofday() intensive workloads).
>>>
>> Reduced performance is what I mean. Obviously old guests will
>> continue working.
>
> An interesting solution to this problem would be an in-kernel device VM.

It's interesting, yes, but has a very high barrier to implementation.

>
> Most of the time, the hot register is just one register within a more
> complex device.  The reads are often side-effect free and trivially
> computed from some device state + host time.

Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample. 
There are also interactions with other devices (for example the
apic/ioapic interaction via the apic bus).

>
> If userspace had a way to upload bytecode to the kernel that was
> executed for a PIO operation, it could either pass the operation to
> userspace or handle it within the kernel when possible without taking
> a heavy weight exit.
>
> If the bytecode can access variables in a shared memory area, it could
> be pretty efficient to work with.
>
> This means that the kernel never has to deal with specific in-kernel
> devices but that userspace can accelerator as many of its devices as
> it sees fit.

I would really love to have this, but the problem is that we'd need a
general purpose bytecode VM with binding to some kernel APIs.  The
bytecode VM, if made general enough to host more complicated devices,
would likely be much larger than the actual code we have in the kernel now.

>
> This could replace ioeventfd as a mechanism (which would allow
> clearing the notify flag before writing to an eventfd).
>
> We could potentially just use BPF for this.

BPF generally just computes a predicate.  We could overload the scratch
area for storing internal state and for read results, though (and have
an "mmio scratch register" for reading the time).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Anthony Liguori

On 02/06/2012 03:34 AM, Avi Kivity wrote:

On 02/05/2012 06:36 PM, Anthony Liguori wrote:

On 02/05/2012 03:51 AM, Gleb Natapov wrote:

On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:

On 02/05/2012 11:37 AM, Gleb Natapov wrote:

On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:

Device model

Currently kvm virtualizes or emulates a set of x86 cores, with or
without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
PCI devices assigned from the host.  The API allows emulating the
local
APICs in userspace.

The new API will do away with the IOAPIC/PIC/PIT emulation and defer
them to userspace.  Note: this may cause a regression for older
guests
that don't support MSI or kvmclock.  Device assignment will be done
using VFIO, that is, without direct kvm involvement.


So are we officially saying that KVM is only for modern guest
virtualization?


No, but older guests may have reduced performance in some workloads
(e.g. RHEL4 gettimeofday() intensive workloads).


Reduced performance is what I mean. Obviously old guests will
continue working.


An interesting solution to this problem would be an in-kernel device VM.


It's interesting, yes, but has a very high barrier to implementation.



Most of the time, the hot register is just one register within a more
complex device.  The reads are often side-effect free and trivially
computed from some device state + host time.


Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
There are also interactions with other devices (for example the
apic/ioapic interaction via the apic bus).


Hrm, maybe I'm missing it, but the path that would be hot is:

if (!status_latched && !count_latched) {
   value = kpit_elapsed()
   // manipulate count based on mode
   // mask value depending on read_state
}

This path is side-effect free, and applies relatively simple math to a time 
counter.

The idea would be to allow the filter to not handle an I/O request depending on 
existing state.  Anything that's modifies state (like reading the latch counter) 
would drop to userspace.






If userspace had a way to upload bytecode to the kernel that was
executed for a PIO operation, it could either pass the operation to
userspace or handle it within the kernel when possible without taking
a heavy weight exit.

If the bytecode can access variables in a shared memory area, it could
be pretty efficient to work with.

This means that the kernel never has to deal with specific in-kernel
devices but that userspace can accelerator as many of its devices as
it sees fit.


I would really love to have this, but the problem is that we'd need a
general purpose bytecode VM with binding to some kernel APIs.  The
bytecode VM, if made general enough to host more complicated devices,
would likely be much larger than the actual code we have in the kernel now.


I think the question is whether BPF is good enough as it stands.  I'm not really 
sure.  I agree that inventing a new bytecode VM is probably not worth it.




This could replace ioeventfd as a mechanism (which would allow
clearing the notify flag before writing to an eventfd).

We could potentially just use BPF for this.


BPF generally just computes a predicate.


Can it modify a packet in place?  I think a predicate is about right (can this 
io operation be handled in the kernel or not) but the question is whether 
there's a way produce an output as a side effect.



We could overload the scratch
area for storing internal state and for read results, though (and have
an "mmio scratch register" for reading the time).


Right.

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Avi Kivity
On 02/06/2012 03:33 PM, Anthony Liguori wrote:
>> Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
>> There are also interactions with other devices (for example the
>> apic/ioapic interaction via the apic bus).
>
>
> Hrm, maybe I'm missing it, but the path that would be hot is:
>
> if (!status_latched && !count_latched) {
>value = kpit_elapsed()
>// manipulate count based on mode
>// mask value depending on read_state
> }
>
> This path is side-effect free, and applies relatively simple math to a
> time counter.

Do guests always read an unlatched counter?  Doesn't seem reasonable
since they can't get a stable count this way.

>
> The idea would be to allow the filter to not handle an I/O request
> depending on existing state.  Anything that's modifies state (like
> reading the latch counter) would drop to userspace.

This restricts us to a subset of the device which is at the mercy of the
guest.

>
>>
>>>
>>> If userspace had a way to upload bytecode to the kernel that was
>>> executed for a PIO operation, it could either pass the operation to
>>> userspace or handle it within the kernel when possible without taking
>>> a heavy weight exit.
>>>
>>> If the bytecode can access variables in a shared memory area, it could
>>> be pretty efficient to work with.
>>>
>>> This means that the kernel never has to deal with specific in-kernel
>>> devices but that userspace can accelerator as many of its devices as
>>> it sees fit.
>>
>> I would really love to have this, but the problem is that we'd need a
>> general purpose bytecode VM with binding to some kernel APIs.  The
>> bytecode VM, if made general enough to host more complicated devices,
>> would likely be much larger than the actual code we have in the
>> kernel now.
>
> I think the question is whether BPF is good enough as it stands.  I'm
> not really sure.

I think not.  It doesn't have 64-bit muldiv, required for hpet, for example.

>   I agree that inventing a new bytecode VM is probably not worth it.
>
>>>
>>> This could replace ioeventfd as a mechanism (which would allow
>>> clearing the notify flag before writing to an eventfd).
>>>
>>> We could potentially just use BPF for this.
>>
>> BPF generally just computes a predicate.
>
> Can it modify a packet in place?  I think a predicate is about right
> (can this io operation be handled in the kernel or not) but the
> question is whether there's a way produce an output as a side effect.

You can use the scratch area, and say that it's persistent.  But the VM
itself isn't rich enough.

>
>> We could overload the scratch
>> area for storing internal state and for read results, though (and have
>> an "mmio scratch register" for reading the time).
>
> Right.
>

We could define mmio registers for muldiv64, and for communicating over
the APIC bus.  But then the device model for BPF ends up more
complicated than the kernel devices we have put together.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Anthony Liguori

On 02/06/2012 07:54 AM, Avi Kivity wrote:

On 02/06/2012 03:33 PM, Anthony Liguori wrote:

Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
There are also interactions with other devices (for example the
apic/ioapic interaction via the apic bus).



Hrm, maybe I'm missing it, but the path that would be hot is:

if (!status_latched&&  !count_latched) {
value = kpit_elapsed()
// manipulate count based on mode
// mask value depending on read_state
}

This path is side-effect free, and applies relatively simple math to a
time counter.


Do guests always read an unlatched counter?  Doesn't seem reasonable
since they can't get a stable count this way.


Perhaps.  You could have the latching done by writing to persisted scratch 
memory but then locking becomes an issue.



The idea would be to allow the filter to not handle an I/O request
depending on existing state.  Anything that's modifies state (like
reading the latch counter) would drop to userspace.


This restricts us to a subset of the device which is at the mercy of the
guest.


Yes, but it provides an elegant solution to having a flexible way to do things 
in the fast path in a generic way without presenting additional security concerns.


A similar, albeit more complex and less elegant, approach would be to make use 
of something like the vtpm optimization to reflect certain exits back into 
injected code into the guest.  But this has the disadvantage of being very 
x86-centric and it's not clear if you can avoid double exits which would hurt 
the slow paths.



We could define mmio registers for muldiv64, and for communicating over
the APIC bus.  But then the device model for BPF ends up more
complicated than the kernel devices we have put together.


Maybe what we really need is NaCL for kernel space :-D

Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Avi Kivity
On 02/06/2012 04:00 PM, Anthony Liguori wrote:
>> Do guests always read an unlatched counter?  Doesn't seem reasonable
>> since they can't get a stable count this way.
>
>
> Perhaps.  You could have the latching done by writing to persisted
> scratch memory but then locking becomes an issue.

Oh, you'd certainly serialize the entire device.

>
>>> The idea would be to allow the filter to not handle an I/O request
>>> depending on existing state.  Anything that's modifies state (like
>>> reading the latch counter) would drop to userspace.
>>
>> This restricts us to a subset of the device which is at the mercy of the
>> guest.
>
> Yes, but it provides an elegant solution to having a flexible way to
> do things in the fast path in a generic way without presenting
> additional security concerns.
>
> A similar, albeit more complex and less elegant, approach would be to
> make use of something like the vtpm optimization to reflect certain
> exits back into injected code into the guest.  But this has the
> disadvantage of being very x86-centric and it's not clear if you can
> avoid double exits which would hurt the slow paths.

It's also hard to communicate with the rest of the host kernel (say for
timers).  You can't ensure that any piece of memory will be virtually
mapped, and with the correct permissions too.

>
>> We could define mmio registers for muldiv64, and for communicating over
>> the APIC bus.  But then the device model for BPF ends up more
>> complicated than the kernel devices we have put together.
>
> Maybe what we really need is NaCL for kernel space :-D

NaCl or bytecode, doesn't matter.  But we do need bindings to other
kernel and kvm services.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Rob Earhart
On Sun, Feb 5, 2012 at 5:14 AM, Avi Kivity  wrote:
> On 02/03/2012 12:13 AM, Rob Earhart wrote:
>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity > > wrote:
>>
>>     The kvm api has been accumulating cruft for several years now.
>>      This is
>>     due to feature creep, fixing mistakes, experience gained by the
>>     maintainers and developers on how to do things, ports to new
>>     architectures, and simply as a side effect of a code base that is
>>     developed slowly and incrementally.
>>
>>     While I don't think we can justify a complete revamp of the API
>>     now, I'm
>>     writing this as a thought experiment to see where a from-scratch
>>     API can
>>     take us.  Of course, if we do implement this, the new and old APIs
>>     will
>>     have to be supported side by side for several years.
>>
>>     Syscalls
>>     
>>     kvm currently uses the much-loved ioctl() system call as its entry
>>     point.  While this made it easy to add kvm to the kernel
>>     unintrusively,
>>     it does have downsides:
>>
>>     - overhead in the entry path, for the ioctl dispatch path and vcpu
>>     mutex
>>     (low but measurable)
>>     - semantic mismatch: kvm really wants a vcpu to be tied to a
>>     thread, and
>>     a vm to be tied to an mm_struct, but the current API ties them to file
>>     descriptors, which can move between threads and processes.  We check
>>     that they don't, but we don't want to.
>>
>>     Moving to syscalls avoids these problems, but introduces new ones:
>>
>>     - adding new syscalls is generally frowned upon, and kvm will need
>>     several
>>     - syscalls into modules are harder and rarer than into core kernel
>>     code
>>     - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>     mm_struct
>>
>>     Syscalls that operate on the entire guest will pick it up implicitly
>>     from the mm_struct, and syscalls that operate on a vcpu will pick
>>     it up
>>     from current.
>>
>>
>> 
>>
>> I like the ioctl() interface.  If the overhead matters in your hot path,
>
> I can't say that it's a pressing problem, but it's not negligible.
>
>> I suspect you're doing it wrong;
>
> What am I doing wrong?

"You the vmm" not "you the KVM maintainer" :-)

To be a little more precise: If a VCPU thread is going all the way out
to host usermode in its hot path, that's probably a performance
problem regardless of how fast you make the transitions between host
user and host kernel.

That's why ioctl() doesn't bother me.  I think it'd be more useful to
focus on mechanisms which don't require the VCPU thread to exit at all
in its hot paths, so the overhead of the ioctl() really becomes lost
in the noise.  irq fds and ioevent fds are great for that, and I
really like your MMIO-over-socketpair idea.

>> use irq fds & ioevent fds.  You might fix the semantic mismatch by
>> having a notion of a "current process's VM" and "current thread's
>> VCPU", and just use the one /dev/kvm filedescriptor.
>>
>> Or you could go the other way, and break the connection between VMs
>> and processes / VCPUs and threads: I don't know how easy it is to do
>> it in Linux, but a VCPU might be backed by a kernel thread, operated
>> on via ioctl()s, indicating that they've exited the guest by having
>> their descriptors become readable (and either use read() or mmap() to
>> pull off the reason why the VCPU exited).
>
> That breaks the ability to renice vcpu threads (unless you want the user
> renice kernel threads).

I think it'd be fine to have an ioctl()/syscall() to do it.  But I
don't know how well that'd compose with other tools people might use
for managing priorities.

>> This would allow for a variety of different programming styles for the
>> VMM--I'm a fan of CSP model myself, but that's hard to do with the
>> current API.
>
> Just convert the synchronous API to an RPC over a pipe, in the vcpu
> thread, and you have the asynchronous model you asked for.

Yup.  But you still get multiple threads in your process.  It's not a
disaster, though.

)Rob
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Anthony Liguori

On 02/06/2012 11:41 AM, Rob Earhart wrote:

On Sun, Feb 5, 2012 at 5:14 AM, Avi Kivity  wrote:

On 02/03/2012 12:13 AM, Rob Earhart wrote:

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivitymailto:a...@redhat.com>>  wrote:

 The kvm api has been accumulating cruft for several years now.
  This is
 due to feature creep, fixing mistakes, experience gained by the
 maintainers and developers on how to do things, ports to new
 architectures, and simply as a side effect of a code base that is
 developed slowly and incrementally.

 While I don't think we can justify a complete revamp of the API
 now, I'm
 writing this as a thought experiment to see where a from-scratch
 API can
 take us.  Of course, if we do implement this, the new and old APIs
 will
 have to be supported side by side for several years.

 Syscalls
 
 kvm currently uses the much-loved ioctl() system call as its entry
 point.  While this made it easy to add kvm to the kernel
 unintrusively,
 it does have downsides:

 - overhead in the entry path, for the ioctl dispatch path and vcpu
 mutex
 (low but measurable)
 - semantic mismatch: kvm really wants a vcpu to be tied to a
 thread, and
 a vm to be tied to an mm_struct, but the current API ties them to file
 descriptors, which can move between threads and processes.  We check
 that they don't, but we don't want to.

 Moving to syscalls avoids these problems, but introduces new ones:

 - adding new syscalls is generally frowned upon, and kvm will need
 several
 - syscalls into modules are harder and rarer than into core kernel
 code
 - will need to add a vcpu pointer to task_struct, and a kvm pointer to
 mm_struct

 Syscalls that operate on the entire guest will pick it up implicitly
 from the mm_struct, and syscalls that operate on a vcpu will pick
 it up
 from current.




I like the ioctl() interface.  If the overhead matters in your hot path,


I can't say that it's a pressing problem, but it's not negligible.


I suspect you're doing it wrong;


What am I doing wrong?


"You the vmm" not "you the KVM maintainer" :-)

To be a little more precise: If a VCPU thread is going all the way out
to host usermode in its hot path, that's probably a performance
problem regardless of how fast you make the transitions between host
user and host kernel.

That's why ioctl() doesn't bother me.  I think it'd be more useful to
focus on mechanisms which don't require the VCPU thread to exit at all
in its hot paths, so the overhead of the ioctl() really becomes lost
in the noise.  irq fds and ioevent fds are great for that, and I
really like your MMIO-over-socketpair idea.


I'm not so sure.  ioeventfds and a future mmio-over-socketpair have to put the 
kthread to sleep while it waits for the other end to process it.  This is 
effectively equivalent to a heavy weight exit.  The difference in cost is 
dropping to userspace which is really neglible these days (< 100 cycles).


There is some fast-path trickery to avoid heavy weight exits but this presents 
the same basic problem of having to put all the device model stuff in the kernel.


ioeventfd to userspace is almost certainly worse for performance.  And Avi 
mentioned, you can emulate this behavior yourself in userspace if so inclined.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Scott Wood
On 02/03/2012 04:52 PM, Anthony Liguori wrote:
> On 02/03/2012 12:07 PM, Eric Northup wrote:
>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  wrote:
>> [...]
>>>
>>> Moving to syscalls avoids these problems, but introduces new ones:
>>>
>>> - adding new syscalls is generally frowned upon, and kvm will need
>>> several
>>> - syscalls into modules are harder and rarer than into core kernel code
>>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>> mm_struct
>> - Lost a good place to put access control (permissions on /dev/kvm)
>> for which user-mode processes can use KVM.
>>
>> How would the ability to use sys_kvm_* be regulated?
> 
> Why should it be regulated?
> 
> It's not a finite or privileged resource.

You're exposing a large, complex kernel subsystem that does very
low-level things with the hardware.  It's a potential source of exploits
(from bugs in KVM or in hardware).  I can see people wanting to be
selective with access because of that.

And sometimes it is a finite resource.  I don't know how x86 does it,
but on at least some powerpc hardware we have a finite, relatively small
number of hardware partition IDs.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Alexander Graf

On 03.02.2012, at 03:09, Anthony Liguori wrote:

> On 02/02/2012 10:09 AM, Avi Kivity wrote:
>> The kvm api has been accumulating cruft for several years now.  This is
>> due to feature creep, fixing mistakes, experience gained by the
>> maintainers and developers on how to do things, ports to new
>> architectures, and simply as a side effect of a code base that is
>> developed slowly and incrementally.
>> 
>> While I don't think we can justify a complete revamp of the API now, I'm
>> writing this as a thought experiment to see where a from-scratch API can
>> take us.  Of course, if we do implement this, the new and old APIs will
>> have to be supported side by side for several years.
>> 
>> Syscalls
>> 
>> kvm currently uses the much-loved ioctl() system call as its entry
>> point.  While this made it easy to add kvm to the kernel unintrusively,
>> it does have downsides:
>> 
>> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
>> (low but measurable)
>> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
>> a vm to be tied to an mm_struct, but the current API ties them to file
>> descriptors, which can move between threads and processes.  We check
>> that they don't, but we don't want to.
>> 
>> Moving to syscalls avoids these problems, but introduces new ones:
>> 
>> - adding new syscalls is generally frowned upon, and kvm will need several
>> - syscalls into modules are harder and rarer than into core kernel code
>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>> mm_struct
>> 
>> Syscalls that operate on the entire guest will pick it up implicitly
>> from the mm_struct, and syscalls that operate on a vcpu will pick it up
>> from current.
> 
> This seems like the natural progression.

I don't like the idea too much. On s390 and ppc we can set other vcpu's 
interrupt status. How would that work in this model?

I really do like the ioctl model btw. It's easily extensible and easy to 
understand.

I can also promise you that I have no idea what other extensions we will need 
in the next few years. The non-x86 targets are just really very moving. So 
having an interface that allows for easy extension is a must-have.

> 
>> State accessors
>> ---
>> Currently vcpu state is read and written by a bunch of ioctls that
>> access register sets that were added (or discovered) along the years.
>> Some state is stored in the vcpu mmap area.  These will be replaced by a
>> pair of syscalls that read or write the entire state, or a subset of the
>> state, in a tag/value format.  A register will be described by a tuple:
>> 
>>   set: the register set to which it belongs; either a real set (GPR,
>> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>>   number: register number within a set
>>   size: for self-description, and to allow expanding registers like
>> SSE->AVX or eax->rax
>>   attributes: read-write, read-only, read-only for guest but read-write
>> for host
>>   value
> 
> I do like the idea a lot of being able to read one register at a time as 
> often times that's all you need.

The framework is in KVM today. It's called ONE_REG. So far only PPC implements 
a few registers. If you like it, just throw all the x86 ones in there and you 
have everything you need.

> 
>> 
>> Device model
>> 
>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>> PCI devices assigned from the host.  The API allows emulating the local
>> APICs in userspace.
>> 
>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>> them to userspace.
> 
> I'm a big fan of this.
> 
>> Note: this may cause a regression for older guests
>> that don't support MSI or kvmclock.  Device assignment will be done
>> using VFIO, that is, without direct kvm involvement.
>> 
>> Local APICs will be mandatory, but it will be possible to hide them from
>> the guest.  This means that it will no longer be possible to emulate an
>> APIC in userspace, but it will be possible to virtualize an APIC-less
>> core - userspace will play with the LINT0/LINT1 inputs (configured as
>> EXITINT and NMI) to queue interrupts and NMIs.
> 
> I think this makes sense.  An interesting consequence of this is that it's no 
> longer necessary to associate the VCPU context with an MMIO/PIO operation.  
> I'm not sure if there's an obvious benefit to that but it's interesting 
> nonetheless.
> 
>> The communications between the local APIC and the IOAPIC/PIC will be
>> done over a socketpair, emulating the APIC bus protocol.

What is keeping us from moving there today?

>> 
>> Ioeventfd/irqfd
>> ---
>> As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> retained, and perhaps supplemented with a way to assign an mmio region
>> to a socketpair carrying transactions.  This allows a devic

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Michael Ellerman
On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
> > On 02/03/2012 12:07 PM, Eric Northup wrote:
> >> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  wrote:
> >> [...]
> >>>
> >>> Moving to syscalls avoids these problems, but introduces new ones:
> >>>
> >>> - adding new syscalls is generally frowned upon, and kvm will need
> >>> several
> >>> - syscalls into modules are harder and rarer than into core kernel code
> >>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> >>> mm_struct
> >> - Lost a good place to put access control (permissions on /dev/kvm)
> >> for which user-mode processes can use KVM.
> >>
> >> How would the ability to use sys_kvm_* be regulated?
> > 
> > Why should it be regulated?
> > 
> > It's not a finite or privileged resource.
> 
> You're exposing a large, complex kernel subsystem that does very
> low-level things with the hardware.  It's a potential source of exploits
> (from bugs in KVM or in hardware).  I can see people wanting to be
> selective with access because of that.

Exactly.

In a perfect world I'd agree with Anthony, but in reality I think
sysadmins are quite happy that they can prevent some users from using
KVM.

You could presumably achieve something similar with capabilities or
whatever, but a node in /dev is much simpler.

cheers


signature.asc
Description: This is a digitally signed message part


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 07:58, Michael Ellerman wrote:

> On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
>> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
>>> On 02/03/2012 12:07 PM, Eric Northup wrote:
 On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  wrote:
 [...]
> 
> Moving to syscalls avoids these problems, but introduces new ones:
> 
> - adding new syscalls is generally frowned upon, and kvm will need
> several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
 - Lost a good place to put access control (permissions on /dev/kvm)
 for which user-mode processes can use KVM.
 
 How would the ability to use sys_kvm_* be regulated?
>>> 
>>> Why should it be regulated?
>>> 
>>> It's not a finite or privileged resource.
>> 
>> You're exposing a large, complex kernel subsystem that does very
>> low-level things with the hardware.  It's a potential source of exploits
>> (from bugs in KVM or in hardware).  I can see people wanting to be
>> selective with access because of that.
> 
> Exactly.
> 
> In a perfect world I'd agree with Anthony, but in reality I think
> sysadmins are quite happy that they can prevent some users from using
> KVM.
> 
> You could presumably achieve something similar with capabilities or
> whatever, but a node in /dev is much simpler.

Well, you could still keep the /dev/kvm node and then have syscalls operate on 
the fd.

But again, I don't see the problem with the ioctl interface. It's nice, 
extensible and works great for us.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/06/2012 07:41 PM, Rob Earhart wrote:

>>
>>  I like the ioctl() interface.  If the overhead matters in your hot path,
>
>  I can't say that it's a pressing problem, but it's not negligible.
>
>>  I suspect you're doing it wrong;
>
>  What am I doing wrong?

"You the vmm" not "you the KVM maintainer" :-)

To be a little more precise: If a VCPU thread is going all the way out
to host usermode in its hot path, that's probably a performance
problem regardless of how fast you make the transitions between host
user and host kernel.


Why?


That's why ioctl() doesn't bother me.  I think it'd be more useful to
focus on mechanisms which don't require the VCPU thread to exit at all
in its hot paths, so the overhead of the ioctl() really becomes lost
in the noise.  irq fds and ioevent fds are great for that, and I
really like your MMIO-over-socketpair idea.


I like them too, but they're not suitable for all cases.

An ioeventfd, or unordered write-over-mmio-socketpair can take one of 
two paths:


 - waking up an idle mmio service thread on a different core, involving 
a double context switch on that remote core
 - scheduling the idle mmio service thread on the current core, 
involving both a double context switch and a heavyweight exit


An ordered write-over-mmio-socketpair, or a read-over-mmio-socketpair 
can also take one of two paths
 - waking up an idle mmio service thread on a different core, involving 
a double context switch on that remote core, and also  invoking two 
context switches on the current core (while we wait for a reply); if the 
current core schedules a user task we might also have a heavyweight exit
 - scheduling the idle mmio service thread on the current core, 
involving both a double context switch and a heavyweight exit


As you can see the actual work is greater for threaded io handlers than 
the synchronous ones.  The real advantage is that you can perform more 
work in parallel if you have the spare cores (not a given in 
consolidation environments) and if you actually have a lot of work to do 
(like virtio-net in a throughput load).  It doesn't quite fit a "read 
hpet register" load.






>>  This would allow for a variety of different programming styles for the
>>  VMM--I'm a fan of CSP model myself, but that's hard to do with the
>>  current API.
>
>  Just convert the synchronous API to an RPC over a pipe, in the vcpu
>  thread, and you have the asynchronous model you asked for.

Yup.  But you still get multiple threads in your process.  It's not a
disaster, though.



You have multiple threads anyway, even if it's the kernel that creates them.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/06/2012 09:11 PM, Anthony Liguori wrote:


I'm not so sure.  ioeventfds and a future mmio-over-socketpair have to 
put the kthread to sleep while it waits for the other end to process 
it.  This is effectively equivalent to a heavy weight exit.  The 
difference in cost is dropping to userspace which is really neglible 
these days (< 100 cycles).


On what machine did you measure these wonderful numbers?

But I agree a heavyweight exit is probably faster than a double context 
switch on a remote core.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 03:08 AM, Alexander Graf wrote:

I don't like the idea too much. On s390 and ppc we can set other vcpu's 
interrupt status. How would that work in this model?


It would be a "vm-wide syscall".  You can also do that on x86 (through 
KVM_IRQ_LINE).




I really do like the ioctl model btw. It's easily extensible and easy to 
understand.

I can also promise you that I have no idea what other extensions we will need 
in the next few years. The non-x86 targets are just really very moving. So 
having an interface that allows for easy extension is a must-have.


Good point.  If we ever go through with it, it will only be after we see 
the interface has stabilized.




>
>>  State accessors
>>  ---
>>  Currently vcpu state is read and written by a bunch of ioctls that
>>  access register sets that were added (or discovered) along the years.
>>  Some state is stored in the vcpu mmap area.  These will be replaced by a
>>  pair of syscalls that read or write the entire state, or a subset of the
>>  state, in a tag/value format.  A register will be described by a tuple:
>>
>>set: the register set to which it belongs; either a real set (GPR,
>>  x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>>  eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>>number: register number within a set
>>size: for self-description, and to allow expanding registers like
>>  SSE->AVX or eax->rax
>>attributes: read-write, read-only, read-only for guest but read-write
>>  for host
>>value
>
>  I do like the idea a lot of being able to read one register at a time as 
often times that's all you need.

The framework is in KVM today. It's called ONE_REG. So far only PPC implements 
a few registers. If you like it, just throw all the x86 ones in there and you 
have everything you need.


This is more like MANY_REG, where you scatter/gather a list of registers 
in userspace to the kernel or vice versa.




>>  The communications between the local APIC and the IOAPIC/PIC will be
>>  done over a socketpair, emulating the APIC bus protocol.

What is keeping us from moving there today?


The biggest problem with this proposal is that what we have today works 
reasonably well.  Nothing is keeping us from moving there, except the 
fear of performance regressions and lack of strong motivation.




>>
>>  Ioeventfd/irqfd
>>  ---
>>  As the ioeventfd/irqfd mechanism has been quite successful, it will be
>>  retained, and perhaps supplemented with a way to assign an mmio region
>>  to a socketpair carrying transactions.  This allows a device model to be
>>  implemented out-of-process.  The socketpair can also be used to
>>  implement a replacement for coalesced mmio, by not waiting for responses
>>  on write transactions when enabled.  Synchronization of coalesced mmio
>>  will be implemented in the kernel, not userspace as now: when a
>>  non-coalesced mmio is needed, the kernel will first flush the coalesced
>>  mmio queue(s).

I would vote for completely deprecating coalesced MMIO. It is a generic 
framework that nobody except for VGA really needs.


It's actually used by e1000 too, don't remember what the performance 
benefits are.  Of course, few people use e1000.



Better make something that accelerates read and write paths thanks to more 
specific knowledge of the interface.

One thing I'm thinking of here is IDE. There's no need to PIO callback into 
user space for all the status ports. We only really care about a callback on 
write to 7 (cmd). All the others are basically registers that the kernel could 
just read and write from shared memory.

I'm sure the VGA text stuff could use similar acceleration with well-known 
interfaces.


This goes back to the discussion about a kernel bytecode vm for 
accelerating mmio.  The problem is that we need something really general.



To me, coalesced mmio has proven that's it's generalization where it doesn't 
belong.


But you want to generalize it even more?

There's no way a patch with 'VGA' in it would be accepted.



>>
>>  Guest memory management
>>  ---
>>  Instead of managing each memory slot individually, a single API will be
>>  provided that replaces the entire guest physical memory map atomically.
>>  This matches the implementation (using RCU) and plugs holes in the
>>  current API, where you lose the dirty log in the window between the last
>>  call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>>  that removes the slot.

So we render the actual slot logic invisible? That's a very good idea.


No, slots still exist.  Only the API is "replace slot list" instead of 
"add slot" and "remove slot".




>>
>>  Slot-based dirty logging will be replaced by range-based and work-based
>>  dirty logging; that is "what pages are dirty in this range, which may be
>>  smaller than a slot" and "don't return more than N pages".
>>
>>  We may want to place the log in user memory instead of kernel memory

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Anthony Liguori

On 02/06/2012 01:46 PM, Scott Wood wrote:

On 02/03/2012 04:52 PM, Anthony Liguori wrote:

On 02/03/2012 12:07 PM, Eric Northup wrote:

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity   wrote:
[...]


Moving to syscalls avoids these problems, but introduces new ones:

- adding new syscalls is generally frowned upon, and kvm will need
several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct

- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?


Why should it be regulated?

It's not a finite or privileged resource.


You're exposing a large, complex kernel subsystem that does very
low-level things with the hardware.


As does the rest of the kernel.


 It's a potential source of exploits
(from bugs in KVM or in hardware).  I can see people wanting to be
selective with access because of that.


As is true of the rest of the kernel.

If you want finer grain access control, that's exactly why we have things like 
LSM and SELinux.  You can add the appropriate LSM hooks into the KVM 
infrastructure and setup default SELinux policies appropriately.



And sometimes it is a finite resource.  I don't know how x86 does it,
but on at least some powerpc hardware we have a finite, relatively small
number of hardware partition IDs.


But presumably this is per-core, right?  And they're recycled, right?  IOW, 
there isn't a limit of number of guests <= number of hardware partitions IDs. 
It just impacts performance.


Regards,

Anthony Liguori



-Scott




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 02:28 PM, Anthony Liguori wrote:



 It's a potential source of exploits
(from bugs in KVM or in hardware).  I can see people wanting to be
selective with access because of that.


As is true of the rest of the kernel.

If you want finer grain access control, that's exactly why we have 
things like LSM and SELinux.  You can add the appropriate LSM hooks 
into the KVM infrastructure and setup default SELinux policies 
appropriately.


LSMs protect objects, not syscalls.  There isn't an object to protect 
here (except the fake /dev/kvm object).


In theory, kvm is exactly the same as other syscalls, but in practice, 
it is used by only very few user programs, so there may be many 
unexercised paths.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Anthony Liguori

On 02/07/2012 06:40 AM, Avi Kivity wrote:

On 02/07/2012 02:28 PM, Anthony Liguori wrote:



It's a potential source of exploits
(from bugs in KVM or in hardware). I can see people wanting to be
selective with access because of that.


As is true of the rest of the kernel.

If you want finer grain access control, that's exactly why we have things like
LSM and SELinux. You can add the appropriate LSM hooks into the KVM
infrastructure and setup default SELinux policies appropriately.


LSMs protect objects, not syscalls. There isn't an object to protect here
(except the fake /dev/kvm object).


A VM can be an object.

Regards,

Anthony Liguori


In theory, kvm is exactly the same as other syscalls, but in practice, it is
used by only very few user programs, so there may be many unexercised paths.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 13:24, Avi Kivity wrote:

> On 02/07/2012 03:08 AM, Alexander Graf wrote:
>> I don't like the idea too much. On s390 and ppc we can set other vcpu's 
>> interrupt status. How would that work in this model?
> 
> It would be a "vm-wide syscall".  You can also do that on x86 (through 
> KVM_IRQ_LINE).
> 
>> 
>> I really do like the ioctl model btw. It's easily extensible and easy to 
>> understand.
>> 
>> I can also promise you that I have no idea what other extensions we will 
>> need in the next few years. The non-x86 targets are just really very moving. 
>> So having an interface that allows for easy extension is a must-have.
> 
> Good point.  If we ever go through with it, it will only be after we see the 
> interface has stabilized.

Not sure we'll ever get there. For PPC, it will probably take another 1-2 years 
until we get the 32-bit targets stabilized. By then we will have new 64-bit 
support though. And then the next gen will come out giving us even more new 
constraints.

The same goes for ARM, where we will get v7 support for now, but very soon we 
will also want to get v8. Stabilizing a target so far takes ~1-2 years from 
what I've seen. And that stabilizing to a point where we don't find major ABI 
issues anymore.

> 
>> 
>> >
>> >>  State accessors
>> >>  ---
>> >>  Currently vcpu state is read and written by a bunch of ioctls that
>> >>  access register sets that were added (or discovered) along the years.
>> >>  Some state is stored in the vcpu mmap area.  These will be replaced by a
>> >>  pair of syscalls that read or write the entire state, or a subset of the
>> >>  state, in a tag/value format.  A register will be described by a tuple:
>> >>
>> >>set: the register set to which it belongs; either a real set (GPR,
>> >>  x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>> >>  eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>> >>number: register number within a set
>> >>size: for self-description, and to allow expanding registers like
>> >>  SSE->AVX or eax->rax
>> >>attributes: read-write, read-only, read-only for guest but read-write
>> >>  for host
>> >>value
>> >
>> >  I do like the idea a lot of being able to read one register at a time as 
>> > often times that's all you need.
>> 
>> The framework is in KVM today. It's called ONE_REG. So far only PPC 
>> implements a few registers. If you like it, just throw all the x86 ones in 
>> there and you have everything you need.
> 
> This is more like MANY_REG, where you scatter/gather a list of registers in 
> userspace to the kernel or vice versa.

Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to 
give every register a unique identifier that can be used to access it. Taking 
that logic to an array is trivial.

> 
>> 
>> >>  The communications between the local APIC and the IOAPIC/PIC will be
>> >>  done over a socketpair, emulating the APIC bus protocol.
>> 
>> What is keeping us from moving there today?
> 
> The biggest problem with this proposal is that what we have today works 
> reasonably well.  Nothing is keeping us from moving there, except the fear of 
> performance regressions and lack of strong motivation.

So why bring it up in the "next-gen" api discussion?

> 
>> 
>> >>
>> >>  Ioeventfd/irqfd
>> >>  ---
>> >>  As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> >>  retained, and perhaps supplemented with a way to assign an mmio region
>> >>  to a socketpair carrying transactions.  This allows a device model to be
>> >>  implemented out-of-process.  The socketpair can also be used to
>> >>  implement a replacement for coalesced mmio, by not waiting for responses
>> >>  on write transactions when enabled.  Synchronization of coalesced mmio
>> >>  will be implemented in the kernel, not userspace as now: when a
>> >>  non-coalesced mmio is needed, the kernel will first flush the coalesced
>> >>  mmio queue(s).
>> 
>> I would vote for completely deprecating coalesced MMIO. It is a generic 
>> framework that nobody except for VGA really needs.
> 
> It's actually used by e1000 too, don't remember what the performance benefits 
> are.  Of course, few people use e1000.

And for e1000 it's only used for nvram which actually could benefit from a more 
clever "this is backed by ram" logic. Coalesced mmio is not a great fit here.

> 
>> Better make something that accelerates read and write paths thanks to more 
>> specific knowledge of the interface.
>> 
>> One thing I'm thinking of here is IDE. There's no need to PIO callback into 
>> user space for all the status ports. We only really care about a callback on 
>> write to 7 (cmd). All the others are basically registers that the kernel 
>> could just read and write from shared memory.
>> 
>> I'm sure the VGA text stuff could use similar acceleration with well-known 
>> interfaces.
> 
> This goes back to the discussion about a kernel bytecode vm for accelerating 
>

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 02:51 PM, Alexander Graf wrote:

On 07.02.2012, at 13:24, Avi Kivity wrote:

>  On 02/07/2012 03:08 AM, Alexander Graf wrote:
>>  I don't like the idea too much. On s390 and ppc we can set other vcpu's 
interrupt status. How would that work in this model?
>
>  It would be a "vm-wide syscall".  You can also do that on x86 (through 
KVM_IRQ_LINE).
>
>>
>>  I really do like the ioctl model btw. It's easily extensible and easy to 
understand.
>>
>>  I can also promise you that I have no idea what other extensions we will 
need in the next few years. The non-x86 targets are just really very moving. So 
having an interface that allows for easy extension is a must-have.
>
>  Good point.  If we ever go through with it, it will only be after we see the 
interface has stabilized.

Not sure we'll ever get there. For PPC, it will probably take another 1-2 years 
until we get the 32-bit targets stabilized. By then we will have new 64-bit 
support though. And then the next gen will come out giving us even more new 
constraints.


I would expect that newer archs have less constraints, not more.


The same goes for ARM, where we will get v7 support for now, but very soon we 
will also want to get v8. Stabilizing a target so far takes ~1-2 years from 
what I've seen. And that stabilizing to a point where we don't find major ABI 
issues anymore.


The trick is to get the ABI to be flexible, like a generalized ABI for 
state.  But it's true that it's really hard to nail it down.




>>
>>  The framework is in KVM today. It's called ONE_REG. So far only PPC 
implements a few registers. If you like it, just throw all the x86 ones in there and 
you have everything you need.
>
>  This is more like MANY_REG, where you scatter/gather a list of registers in 
userspace to the kernel or vice versa.

Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to 
give every register a unique identifier that can be used to access it. Taking 
that logic to an array is trivial.


Definitely easy to extend.



>
>>
>>  >>   The communications between the local APIC and the IOAPIC/PIC will be
>>  >>   done over a socketpair, emulating the APIC bus protocol.
>>
>>  What is keeping us from moving there today?
>
>  The biggest problem with this proposal is that what we have today works 
reasonably well.  Nothing is keeping us from moving there, except the fear of 
performance regressions and lack of strong motivation.

So why bring it up in the "next-gen" api discussion?


One reason is to try to shape future changes to the current ABI in the 
same direction.  Another is that maybe someone will convince us that it 
is needed.



>
>  There's no way a patch with 'VGA' in it would be accepted.

Why not? I think the natural step forward is hybrid acceleration. Take a 
minimal subset of device emulation into kernel land, keep the rest in user 
space.



When a device is fully in the kernel, we have a good specification of 
the ABI: it just implements the spec, and the ABI provides the interface 
from the device to the rest of the world.  Partially accelerated devices 
means a much greater effort in specifying exactly what it does.  It's 
also vulnerable to changes in how the guest uses the device.



Similar to how vhost works, where we keep device enumeration and configuration 
in user space, but ring processing in kernel space.


vhost-net was a massive effort, I hope we don't have to replicate it.



Good candidates for in-kernel acceleration are:

   - HPET


Yes


   - VGA
   - IDE


Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
virtio-scsi).



I'm not sure how easy it would be to only partially accelerate the hot paths of 
the IO-APIC. I'm not too familiar with its details.


Pretty hard.



We will run into the same thing with the MPIC though. On e500v2, IPIs are done 
through the MPIC. So if we want any SMP performance on those, we need to shove 
that part into the kernel. I don't really want to have all of the MPIC code in 
there however. So a hybrid approach sounds like a great fit.


Pointer to the qemu code?


The problem with in-kernel device emulation the way we have it today is that 
it's an all-or-nothing choice. Either we push the device into kernel space or 
we keep it in user space. That adds a lot of code in kernel land where it 
doesn't belong.


Like I mentioned, I see that as a good thing.


>
>  No, slots still exist.  Only the API is "replace slot list" instead of "add slot" and 
"remove slot".

Why?


Physical memory is discontiguous, and includes aliases (two gpas 
referencing the same backing page).  How else would you describe it.



On PPC we walk the slots on every fault (incl. mmio), so fast lookup times 
there would be great. I was thinking of something page table like here.


We can certainly convert the slots to a tree internally.  I'm doing the 
same thing for qemu now, maybe we can do it for kvm too.  No need to 
involve the ABI at all.


Slot searching

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 02:51 PM, Anthony Liguori wrote:

On 02/07/2012 06:40 AM, Avi Kivity wrote:

On 02/07/2012 02:28 PM, Anthony Liguori wrote:



It's a potential source of exploits
(from bugs in KVM or in hardware). I can see people wanting to be
selective with access because of that.


As is true of the rest of the kernel.

If you want finer grain access control, that's exactly why we have 
things like

LSM and SELinux. You can add the appropriate LSM hooks into the KVM
infrastructure and setup default SELinux policies appropriately.


LSMs protect objects, not syscalls. There isn't an object to protect 
here

(except the fake /dev/kvm object).


A VM can be an object.



Not really, it's not accessible in a namespace.  How would you label it?

Maybe we can reuse the process label/context (not sure what the right 
term is for a process).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 14:16, Avi Kivity wrote:

> On 02/07/2012 02:51 PM, Alexander Graf wrote:
>> On 07.02.2012, at 13:24, Avi Kivity wrote:
>> 
>> >  On 02/07/2012 03:08 AM, Alexander Graf wrote:
>> >>  I don't like the idea too much. On s390 and ppc we can set other vcpu's 
>> >> interrupt status. How would that work in this model?
>> >
>> >  It would be a "vm-wide syscall".  You can also do that on x86 (through 
>> > KVM_IRQ_LINE).
>> >
>> >>
>> >>  I really do like the ioctl model btw. It's easily extensible and easy to 
>> >> understand.
>> >>
>> >>  I can also promise you that I have no idea what other extensions we will 
>> >> need in the next few years. The non-x86 targets are just really very 
>> >> moving. So having an interface that allows for easy extension is a 
>> >> must-have.
>> >
>> >  Good point.  If we ever go through with it, it will only be after we see 
>> > the interface has stabilized.
>> 
>> Not sure we'll ever get there. For PPC, it will probably take another 1-2 
>> years until we get the 32-bit targets stabilized. By then we will have new 
>> 64-bit support though. And then the next gen will come out giving us even 
>> more new constraints.
> 
> I would expect that newer archs have less constraints, not more.

Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid 
out stuff wrong before?

I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
completely new architecture.

And what if MIPS comes along? I hear they also work on hw accelerated 
virtualization.

> 
>> The same goes for ARM, where we will get v7 support for now, but very soon 
>> we will also want to get v8. Stabilizing a target so far takes ~1-2 years 
>> from what I've seen. And that stabilizing to a point where we don't find 
>> major ABI issues anymore.
> 
> The trick is to get the ABI to be flexible, like a generalized ABI for state. 
>  But it's true that it's really hard to nail it down.

Yup, and I think what we have today is a pretty good approach to this. I'm 
trying to mostly add "generalized" ioctls whenever I see that something can be 
handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that 
direction, we are extensible with a reasonably stable ABI. Even without 
syscalls.

> 
> 
>> >>
>> >>  The framework is in KVM today. It's called ONE_REG. So far only PPC 
>> >> implements a few registers. If you like it, just throw all the x86 ones 
>> >> in there and you have everything you need.
>> >
>> >  This is more like MANY_REG, where you scatter/gather a list of registers 
>> > in userspace to the kernel or vice versa.
>> 
>> Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was 
>> to give every register a unique identifier that can be used to access it. 
>> Taking that logic to an array is trivial.
> 
> Definitely easy to extend.
> 
> 
>> >
>> >>
>> >>  >>   The communications between the local APIC and the IOAPIC/PIC will be
>> >>  >>   done over a socketpair, emulating the APIC bus protocol.
>> >>
>> >>  What is keeping us from moving there today?
>> >
>> >  The biggest problem with this proposal is that what we have today works 
>> > reasonably well.  Nothing is keeping us from moving there, except the fear 
>> > of performance regressions and lack of strong motivation.
>> 
>> So why bring it up in the "next-gen" api discussion?
> 
> One reason is to try to shape future changes to the current ABI in the same 
> direction.  Another is that maybe someone will convince us that it is needed.
> 
>> >
>> >  There's no way a patch with 'VGA' in it would be accepted.
>> 
>> Why not? I think the natural step forward is hybrid acceleration. Take a 
>> minimal subset of device emulation into kernel land, keep the rest in user 
>> space.
> 
> 
> When a device is fully in the kernel, we have a good specification of the 
> ABI: it just implements the spec, and the ABI provides the interface from the 
> device to the rest of the world.  Partially accelerated devices means a much 
> greater effort in specifying exactly what it does.  It's also vulnerable to 
> changes in how the guest uses the device.

Why? For the HPET timer register for example, we could have a simple MMIO hook 
that says

  on_read:
return read_current_time() - shared_page.offset;
  on_write:
handle_in_user_space();

For IDE, it would be as simple as

  register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE, &s->cmd[0]);
  for (i = 1; i < 7; i++) {
register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE, &s->cmd[i]);
register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE, &s->cmd[i]);
  }

and we should have reduced overhead of IDE by quite a bit already. All the 
other 2k LOC in hw/ide/core.c don't matter for us really.

> 
>> Similar to how vhost works, where we keep device enumeration and 
>> configuration in user space, but ring processing in kernel space.
> 
> vhost-net was a massive effort, I hope we don

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 03:40 PM, Alexander Graf wrote:

>>
>>  Not sure we'll ever get there. For PPC, it will probably take another 1-2 
years until we get the 32-bit targets stabilized. By then we will have new 64-bit 
support though. And then the next gen will come out giving us even more new 
constraints.
>
>  I would expect that newer archs have less constraints, not more.

Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid 
out stuff wrong before?


That's not what I mean by constraints.  It's easy to accommodate 
different register layouts.  Constraints (for me) are like requiring 
gang scheduling.  But you introduced the subject - what did you mean?


Let's take for example the software-controlled TLB on some ppc.  It's 
tempting to call them all "registers" and use the register interface to 
access them.  Is it workable?


Or let's look at SMM on x86.  To implement it memory slots need an 
additional attribute "SMM/non-SMM/either".  These sort of things, if you 
don't think of them beforehand, break your interface.




I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
completely new architecture.

And what if MIPS comes along? I hear they also work on hw accelerated 
virtualization.


If it's just a matter of different register names and sizes, no 
problem.  From what I've seen of v8, it doesn't introduce new wierdnesses.




>
>>  The same goes for ARM, where we will get v7 support for now, but very soon 
we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what 
I've seen. And that stabilizing to a point where we don't find major ABI issues 
anymore.
>
>  The trick is to get the ABI to be flexible, like a generalized ABI for 
state.  But it's true that it's really hard to nail it down.

Yup, and I think what we have today is a pretty good approach to this. I'm trying to 
mostly add "generalized" ioctls whenever I see that something can be handled 
generically, like ONE_REG or ENABLE_CAP. If we keep moving that direction, we are 
extensible with a reasonably stable ABI. Even without syscalls.


Syscalls are orthogonal to that - they're to avoid the fget_light() and 
to tighten the vcpu/thread and vm/process relationship.



, keep the rest in user space.
>
>
>  When a device is fully in the kernel, we have a good specification of the 
ABI: it just implements the spec, and the ABI provides the interface from the 
device to the rest of the world.  Partially accelerated devices means a much 
greater effort in specifying exactly what it does.  It's also vulnerable to 
changes in how the guest uses the device.

Why? For the HPET timer register for example, we could have a simple MMIO hook 
that says

   on_read:
 return read_current_time() - shared_page.offset;
   on_write:
 handle_in_user_space();


It works for the really simple cases, yes, but if the guest wants to set 
up one-shot timers, it fails.  Also look at the PIT which latches on read.




For IDE, it would be as simple as

   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
   for (i = 1; i<  7; i++) {
 register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
 register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
   }

and we should have reduced overhead of IDE by quite a bit already. All the 
other 2k LOC in hw/ide/core.c don't matter for us really.



Just use virtio.



>
>>  Similar to how vhost works, where we keep device enumeration and 
configuration in user space, but ring processing in kernel space.
>
>  vhost-net was a massive effort, I hope we don't have to replicate it.

Was it harder than the in-kernel io-apic?


Much, much harder.



>
>>
>>  Good candidates for in-kernel acceleration are:
>>
>>- HPET
>
>  Yes
>
>>- VGA
>>- IDE
>
>  Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
virtio-scsi).

Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI 
needs 3rd party drivers on w2k3 and wxp. I'm pretty sure non-Linux non-Windows 
systems won't get QXL drivers. Same for virtio.

Please don't do the Xen mistake again of claiming that all we care about is 
Linux as a guest.


Rest easy, there's no chance of that.  But if a guest is important 
enough, virtio drivers will get written.  IDE has no chance in hell of 
approaching virtio-blk performance, no matter how much effort we put 
into it.



KVM's strength has always been its close resemblance to hardware.


This will remain.  But we can't optimize everything.


>
>>
>>  We will run into the same thing with the MPIC though. On e500v2, IPIs are 
done through the MPIC. So if we want any SMP performance on those, we need to shove 
that part into the kernel. I don't really want to have all of the MPIC code in there 
however. So a hybrid approach sounds like a great fit.
>
>  Pointer to the qemu code?

hw/openpic.c


I see what you mean.



>
>>  T

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 15:21, Avi Kivity wrote:

> On 02/07/2012 03:40 PM, Alexander Graf wrote:
>> >>
>> >>  Not sure we'll ever get there. For PPC, it will probably take another 
>> >> 1-2 years until we get the 32-bit targets stabilized. By then we will 
>> >> have new 64-bit support though. And then the next gen will come out 
>> >> giving us even more new constraints.
>> >
>> >  I would expect that newer archs have less constraints, not more.
>> 
>> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
>> today on 32-bit, but extends a bunch of registers to 64-bit. So what if we 
>> laid out stuff wrong before?
> 
> That's not what I mean by constraints.  It's easy to accommodate different 
> register layouts.  Constraints (for me) are like requiring gang scheduling.  
> But you introduced the subject - what did you mean?

New extensions to architectures give us new challenges. Newer booke for example 
implements page tables in parallel to soft TLBs. We need to model that. My 
point was more that I can't predict the future :).

> Let's take for example the software-controlled TLB on some ppc.  It's 
> tempting to call them all "registers" and use the register interface to 
> access them.  Is it workable?

Workable, yes. Fast? No. Right now we share them between kernel and user space 
to have very fast access to them. That way we don't have to sync anything at 
all.

> Or let's look at SMM on x86.  To implement it memory slots need an additional 
> attribute "SMM/non-SMM/either".  These sort of things, if you don't think of 
> them beforehand, break your interface.

Yup. And we will never think of all the cases.

> 
>> 
>> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
>> completely new architecture.
>> 
>> And what if MIPS comes along? I hear they also work on hw accelerated 
>> virtualization.
> 
> If it's just a matter of different register names and sizes, no problem.  
> From what I've seen of v8, it doesn't introduce new wierdnesses.

I haven't seen anything real yet, since the spec isn't out. So far only generic 
architecture documentation is available.

> 
>> 
>> >
>> >>  The same goes for ARM, where we will get v7 support for now, but very 
>> >> soon we will also want to get v8. Stabilizing a target so far takes ~1-2 
>> >> years from what I've seen. And that stabilizing to a point where we don't 
>> >> find major ABI issues anymore.
>> >
>> >  The trick is to get the ABI to be flexible, like a generalized ABI for 
>> > state.  But it's true that it's really hard to nail it down.
>> 
>> Yup, and I think what we have today is a pretty good approach to this. I'm 
>> trying to mostly add "generalized" ioctls whenever I see that something can 
>> be handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that 
>> direction, we are extensible with a reasonably stable ABI. Even without 
>> syscalls.
> 
> Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
> tighten the vcpu/thread and vm/process relationship.

How about keeping the ioctl interface but moving vcpu_run to a syscall then? 
That should really be the only thing that belongs into the fast path, right? 
Every time we do a register sync in user space, we do something wrong. Instead, 
user space should either

  a) have wrappers around register accesses, so it can directly ask for 
specific registers that it needs
or
  b) keep everything that would be requested by the register synchronization in 
shared memory

> 
>> , keep the rest in user space.
>> >
>> >
>> >  When a device is fully in the kernel, we have a good specification of the 
>> > ABI: it just implements the spec, and the ABI provides the interface from 
>> > the device to the rest of the world.  Partially accelerated devices means 
>> > a much greater effort in specifying exactly what it does.  It's also 
>> > vulnerable to changes in how the guest uses the device.
>> 
>> Why? For the HPET timer register for example, we could have a simple MMIO 
>> hook that says
>> 
>>   on_read:
>> return read_current_time() - shared_page.offset;
>>   on_write:
>> handle_in_user_space();
> 
> It works for the really simple cases, yes, but if the guest wants to set up 
> one-shot timers, it fails.  

I don't understand. Why would anything fail here? Once the logic that's 
implemented by the kernel accelerator doesn't fit anymore, unregister it.

> Also look at the PIT which latches on read.
> 
>> 
>> For IDE, it would be as simple as
>> 
>>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>   for (i = 1; i<  7; i++) {
>> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>   }
>> 
>> and we should have reduced overhead of IDE by quite a bit already. All the 
>> other 2k LOC in hw/ide/core.c don't matter for us really.
> 
> 
> Just use virtio.

Just use xenbus. Seriously, this is not an answer.

> 
>> 
>> >
>> >>  Simila

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Anthony Liguori

On 02/07/2012 07:18 AM, Avi Kivity wrote:

On 02/07/2012 02:51 PM, Anthony Liguori wrote:

On 02/07/2012 06:40 AM, Avi Kivity wrote:

On 02/07/2012 02:28 PM, Anthony Liguori wrote:



It's a potential source of exploits
(from bugs in KVM or in hardware). I can see people wanting to be
selective with access because of that.


As is true of the rest of the kernel.

If you want finer grain access control, that's exactly why we have things like
LSM and SELinux. You can add the appropriate LSM hooks into the KVM
infrastructure and setup default SELinux policies appropriately.


LSMs protect objects, not syscalls. There isn't an object to protect here
(except the fake /dev/kvm object).


A VM can be an object.



Not really, it's not accessible in a namespace. How would you label it?


Labels can originate from userspace, IIUC, so I think it's possible for QEMU (or 
whatever the userspace is) to set the label for the VM while it's creating it. 
I think this is how most of the labeling for X and things of that nature works.


Maybe Chris can set me straight.


Maybe we can reuse the process label/context (not sure what the right term is
for a process).


Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Anthony Liguori

On 02/07/2012 06:03 AM, Avi Kivity wrote:

On 02/06/2012 09:11 PM, Anthony Liguori wrote:


I'm not so sure. ioeventfds and a future mmio-over-socketpair have to put the
kthread to sleep while it waits for the other end to process it. This is
effectively equivalent to a heavy weight exit. The difference in cost is
dropping to userspace which is really neglible these days (< 100 cycles).


On what machine did you measure these wonderful numbers?


A syscall is what I mean by "dropping to userspace", not the cost of a heavy 
weight exit.  I think a heavy weight exit is still around a few thousand cycles.


Any nehalem class or better processor should have a syscall cost of around that 
unless I'm wildly mistaken.




But I agree a heavyweight exit is probably faster than a double context switch
on a remote core.


I meant, if you already need to take a heavyweight exit (and you do to schedule 
something else on the core), than the only additional cost is taking a syscall 
return to userspace *first* before scheduling another process.  That overhead is 
pretty low.


Regards,

Anthony Liguori






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Anthony Liguori

On 02/07/2012 07:40 AM, Alexander Graf wrote:


Why? For the HPET timer register for example, we could have a simple MMIO hook 
that says

   on_read:
 return read_current_time() - shared_page.offset;
   on_write:
 handle_in_user_space();

For IDE, it would be as simple as

   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
   for (i = 1; i<  7; i++) {
 register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
 register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
   }


You can't easily serialize updates to that address with the kernel since two 
threads are likely going to be accessing it at the same time.  That either means 
an expensive sync operation or a reliance on atomic instructions.


But not all architectures offer non-word sized atomic instructions so it gets 
fairly nasty in practice.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 16:23, Anthony Liguori wrote:

> On 02/07/2012 07:40 AM, Alexander Graf wrote:
>> 
>> Why? For the HPET timer register for example, we could have a simple MMIO 
>> hook that says
>> 
>>   on_read:
>> return read_current_time() - shared_page.offset;
>>   on_write:
>> handle_in_user_space();
>> 
>> For IDE, it would be as simple as
>> 
>>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>   for (i = 1; i<  7; i++) {
>> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>   }
> 
> You can't easily serialize updates to that address with the kernel since two 
> threads are likely going to be accessing it at the same time.  That either 
> means an expensive sync operation or a reliance on atomic instructions.

Yes. Essentially we want a mutex for them.

> But not all architectures offer non-word sized atomic instructions so it gets 
> fairly nasty in practice.

Well, we can always require fields to be word sized.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 05:17 PM, Anthony Liguori wrote:

On 02/07/2012 06:03 AM, Avi Kivity wrote:

On 02/06/2012 09:11 PM, Anthony Liguori wrote:


I'm not so sure. ioeventfds and a future mmio-over-socketpair have 
to put the
kthread to sleep while it waits for the other end to process it. 
This is
effectively equivalent to a heavy weight exit. The difference in 
cost is
dropping to userspace which is really neglible these days (< 100 
cycles).


On what machine did you measure these wonderful numbers?


A syscall is what I mean by "dropping to userspace", not the cost of a 
heavy weight exit. 


Ah.  But then ioeventfd has that as well, unless the other end is in the 
kernel too.



I think a heavy weight exit is still around a few thousand cycles.

Any nehalem class or better processor should have a syscall cost of 
around that unless I'm wildly mistaken.




That's what I remember too.



But I agree a heavyweight exit is probably faster than a double 
context switch

on a remote core.


I meant, if you already need to take a heavyweight exit (and you do to 
schedule something else on the core), than the only additional cost is 
taking a syscall return to userspace *first* before scheduling another 
process.  That overhead is pretty low.


Yeah.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Jan Kiszka
On 2012-02-07 17:02, Avi Kivity wrote:
> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:

 I'm not so sure. ioeventfds and a future mmio-over-socketpair have
 to put the
 kthread to sleep while it waits for the other end to process it.
 This is
 effectively equivalent to a heavy weight exit. The difference in
 cost is
 dropping to userspace which is really neglible these days (< 100
 cycles).
>>>
>>> On what machine did you measure these wonderful numbers?
>>
>> A syscall is what I mean by "dropping to userspace", not the cost of a
>> heavy weight exit. 
> 
> Ah.  But then ioeventfd has that as well, unless the other end is in the
> kernel too.
> 
>> I think a heavy weight exit is still around a few thousand cycles.
>>
>> Any nehalem class or better processor should have a syscall cost of
>> around that unless I'm wildly mistaken.
>>
> 
> That's what I remember too.
> 
>>>
>>> But I agree a heavyweight exit is probably faster than a double
>>> context switch
>>> on a remote core.
>>
>> I meant, if you already need to take a heavyweight exit (and you do to
>> schedule something else on the core), than the only additional cost is
>> taking a syscall return to userspace *first* before scheduling another
>> process.  That overhead is pretty low.
> 
> Yeah.
> 

Isn't there another level in between just scheduling and full syscall
return if the user return notifier has some real work to do?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Anthony Liguori

On 02/07/2012 10:02 AM, Avi Kivity wrote:

On 02/07/2012 05:17 PM, Anthony Liguori wrote:

On 02/07/2012 06:03 AM, Avi Kivity wrote:

On 02/06/2012 09:11 PM, Anthony Liguori wrote:


I'm not so sure. ioeventfds and a future mmio-over-socketpair have to put the
kthread to sleep while it waits for the other end to process it. This is
effectively equivalent to a heavy weight exit. The difference in cost is
dropping to userspace which is really neglible these days (< 100 cycles).


On what machine did you measure these wonderful numbers?


A syscall is what I mean by "dropping to userspace", not the cost of a heavy
weight exit.


Ah. But then ioeventfd has that as well, unless the other end is in the kernel 
too.


Yes, that was my point exactly :-)

ioeventfd/mmio-over-socketpair to adifferent thread is not faster than a 
synchronous KVM_RUN + writing to an eventfd in userspace modulo a couple of 
cheap syscalls.


The exception is when the other end is in the kernel and there is magic 
optimizations (like there is today with ioeventfd).


Regards,

Anthony Liguori




I think a heavy weight exit is still around a few thousand cycles.

Any nehalem class or better processor should have a syscall cost of around
that unless I'm wildly mistaken.



That's what I remember too.



But I agree a heavyweight exit is probably faster than a double context switch
on a remote core.


I meant, if you already need to take a heavyweight exit (and you do to
schedule something else on the core), than the only additional cost is taking
a syscall return to userspace *first* before scheduling another process. That
overhead is pretty low.


Yeah.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Anthony Liguori

On 02/07/2012 10:18 AM, Jan Kiszka wrote:

On 2012-02-07 17:02, Avi Kivity wrote:

On 02/07/2012 05:17 PM, Anthony Liguori wrote:

On 02/07/2012 06:03 AM, Avi Kivity wrote:

On 02/06/2012 09:11 PM, Anthony Liguori wrote:


I'm not so sure. ioeventfds and a future mmio-over-socketpair have
to put the
kthread to sleep while it waits for the other end to process it.
This is
effectively equivalent to a heavy weight exit. The difference in
cost is
dropping to userspace which is really neglible these days (<  100
cycles).


On what machine did you measure these wonderful numbers?


A syscall is what I mean by "dropping to userspace", not the cost of a
heavy weight exit.


Ah.  But then ioeventfd has that as well, unless the other end is in the
kernel too.


I think a heavy weight exit is still around a few thousand cycles.

Any nehalem class or better processor should have a syscall cost of
around that unless I'm wildly mistaken.



That's what I remember too.



But I agree a heavyweight exit is probably faster than a double
context switch
on a remote core.


I meant, if you already need to take a heavyweight exit (and you do to
schedule something else on the core), than the only additional cost is
taking a syscall return to userspace *first* before scheduling another
process.  That overhead is pretty low.


Yeah.



Isn't there another level in between just scheduling and full syscall
return if the user return notifier has some real work to do?


Depends on whether you're scheduling a kthread or a userspace process, no?  If 
you're eventually going to end up in userspace, you have to do the full heavy 
weight exit.


If you're scheduling to a kthread, it's better to do the type of trickery that 
ioeventfd does and just turn it into a function call.


Regards,

Anthony Liguori



Jan



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Jan Kiszka
On 2012-02-07 17:21, Anthony Liguori wrote:
> On 02/07/2012 10:18 AM, Jan Kiszka wrote:
>> On 2012-02-07 17:02, Avi Kivity wrote:
>>> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
 On 02/07/2012 06:03 AM, Avi Kivity wrote:
> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>
>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have
>> to put the
>> kthread to sleep while it waits for the other end to process it.
>> This is
>> effectively equivalent to a heavy weight exit. The difference in
>> cost is
>> dropping to userspace which is really neglible these days (<  100
>> cycles).
>
> On what machine did you measure these wonderful numbers?

 A syscall is what I mean by "dropping to userspace", not the cost of a
 heavy weight exit.
>>>
>>> Ah.  But then ioeventfd has that as well, unless the other end is in the
>>> kernel too.
>>>
 I think a heavy weight exit is still around a few thousand cycles.

 Any nehalem class or better processor should have a syscall cost of
 around that unless I'm wildly mistaken.

>>>
>>> That's what I remember too.
>>>
>
> But I agree a heavyweight exit is probably faster than a double
> context switch
> on a remote core.

 I meant, if you already need to take a heavyweight exit (and you do to
 schedule something else on the core), than the only additional cost is
 taking a syscall return to userspace *first* before scheduling another
 process.  That overhead is pretty low.
>>>
>>> Yeah.
>>>
>>
>> Isn't there another level in between just scheduling and full syscall
>> return if the user return notifier has some real work to do?
> 
> Depends on whether you're scheduling a kthread or a userspace process, no?  
> If 

Kthreads can't return, of course. User space threads /may/ do so. And
then there needs to be a differences between host and guest in the
tracked MSRs. I think to recall it's a question of another few hundred
cycles.

Jan

> you're eventually going to end up in userspace, you have to do the full heavy 
> weight exit.
> 
> If you're scheduling to a kthread, it's better to do the type of trickery 
> that 
> ioeventfd does and just turn it into a function call.
> 
> Regards,
> 
> Anthony Liguori
> 
>>
>> Jan
>>
> 

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Chris Wright
* Anthony Liguori (anth...@codemonkey.ws) wrote:
> On 02/07/2012 07:18 AM, Avi Kivity wrote:
> >On 02/07/2012 02:51 PM, Anthony Liguori wrote:
> >>On 02/07/2012 06:40 AM, Avi Kivity wrote:
> >>>On 02/07/2012 02:28 PM, Anthony Liguori wrote:
> 
> >It's a potential source of exploits
> >(from bugs in KVM or in hardware). I can see people wanting to be
> >selective with access because of that.
> 
> As is true of the rest of the kernel.
> 
> If you want finer grain access control, that's exactly why we have things 
> like
> LSM and SELinux. You can add the appropriate LSM hooks into the KVM
> infrastructure and setup default SELinux policies appropriately.
> >>>
> >>>LSMs protect objects, not syscalls. There isn't an object to protect here
> >>>(except the fake /dev/kvm object).
> >>
> >>A VM can be an object.
> >
> >Not really, it's not accessible in a namespace. How would you label it?

A VM, vcpu, etc are all objects.  The labelling can be implicit based on
the security context of the process creating the object.  You could create
simplistic rules such as a process may have the ability KVM__VM_CREATE
(this is roughly analogous to the PROC__EXECMEM policy control that
allows some processes to create executable writable memory mappings, or
SHM__CREATE for a process that can create a shared memory segment).
Adding some label mgmt to the object (add ->security and some callbacks to
do ->alloc/init/free), and then checks on the object itself would allow
for finer grained protection.  If there was any VM lookup (although the
original example explicitly ties a process to a vm and a thread to a
vcpu) the finer grained check would certainly be useful to verify that
the process can access the VM.

> Labels can originate from userspace, IIUC, so I think it's possible for QEMU
> (or whatever the userspace is) to set the label for the VM while it's
> creating it. I think this is how most of the labeling for X and things of
> that nature works.

For X, the policy enforcement is done in the X server.  There is
assistance from the kernel for doing policy server queries (can foo do
bar?), but it's up to the X server to actually care enough to ask and
then fail a request that doesn't comply.  I'm not sure that's the model
here.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Rusty Russell
On Mon, 06 Feb 2012 11:34:01 +0200, Avi Kivity  wrote:
> On 02/05/2012 06:36 PM, Anthony Liguori wrote:
> > If userspace had a way to upload bytecode to the kernel that was
> > executed for a PIO operation, it could either pass the operation to
> > userspace or handle it within the kernel when possible without taking
> > a heavy weight exit.
> >
> > If the bytecode can access variables in a shared memory area, it could
> > be pretty efficient to work with.
> >
> > This means that the kernel never has to deal with specific in-kernel
> > devices but that userspace can accelerator as many of its devices as
> > it sees fit.
> 
> I would really love to have this, but the problem is that we'd need a
> general purpose bytecode VM with binding to some kernel APIs.  The
> bytecode VM, if made general enough to host more complicated devices,
> would likely be much larger than the actual code we have in the kernel now.

We have the ability to upload bytecode into the kernel already.  It's in
a great bytecode interpreted by the CPU itself.

If every user were emulating different machines, LPF this would make
sense.  Are they?  Or should we write those helpers once, in C, and
provide that for them.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-08 Thread Scott Wood
On 02/07/2012 06:28 AM, Anthony Liguori wrote:
> On 02/06/2012 01:46 PM, Scott Wood wrote:
>> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
>>> On 02/03/2012 12:07 PM, Eric Northup wrote:
 How would the ability to use sys_kvm_* be regulated?
>>>
>>> Why should it be regulated?
>>>
>>> It's not a finite or privileged resource.
>>
>> You're exposing a large, complex kernel subsystem that does very
>> low-level things with the hardware.
> 
> As does the rest of the kernel.

Just because other parts of the kernel made this mistake (e.g.
networking) doesn't mean that KVM should as well.

> If you want finer grain access control, that's exactly why we have
> things like LSM and SELinux.  You can add the appropriate LSM hooks into
> the KVM infrastructure and setup default SELinux policies appropriately.

Needing to use such bandaids is more complicated (or at least less
familiar to many) than setting permissions on a filesystem object.

>> And sometimes it is a finite resource.  I don't know how x86 does it,
>> but on at least some powerpc hardware we have a finite, relatively small
>> number of hardware partition IDs.
> 
> But presumably this is per-core, right?

Not currently.

I can't speak for the IBM stuff, but our hardware is desgined with the
idea that a partition has a permanent system-wide LPID (partition ID).
We *may* be able to do dynamic LPID on e500mc, but it is likely to be a
problem in the future with things like LPID-based direct-to-guest
interrupt delivery.  There's also a question of prioritizing effort --
there's enough other stuff that needs work first.

> And they're recycled, right? 

Not currently (other than when a guest is destroyed, of course).

What are the advantages of getting rid of the file descriptor that
warrant this?  What is performance sensitive enough than an fd lookup is
unacceptable but the other overhead of going out to qemu is fine?

Is that fd lookup any heavier than "appropriate LSM hooks"?

If the fd overhead really is a problem, perhaps the fd could be retained
for setup operations, and omitted only on calls that require a vcpu to
have been already set up on the current thread?

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-08 Thread Alan Cox
> If the fd overhead really is a problem, perhaps the fd could be retained
> for setup operations, and omitted only on calls that require a vcpu to
> have been already set up on the current thread?

Quite frankly I'd like to have an fd because it means you've got a
meaningful way of ensuring that id reuse problems go away. You open a
given id and keep a handle to it, if the id gets reused then your handle
will be tied to the old one so you can fail the requests.

Without an fd it's near impossible to get this right. The Unix/Linux
model is open an object, use it, close it. I see no reason not to do that.

Also the LSM hooks apply to file objects mostly, so its a natural fit on
top *IF* you choose to use them.

Finally you can pass file handles around between processes - do that any
other way 8)

Alan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-08 Thread Alan Cox
> >register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> >for (i = 1; i<  7; i++) {
> >  register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >  register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >}
> 
> You can't easily serialize updates to that address with the kernel since two 
> threads are likely going to be accessing it at the same time.  That either 
> means 
> an expensive sync operation or a reliance on atomic instructions.

Who cares

If your API is right this isn't a problem (and for IDE the guess that it
won't happen you will win 99.999% of the time).

In fact IDE you can do even better in many cases because you'll get a
single rep outsw you can trap and shortcut.

> But not all architectures offer non-word sized atomic instructions so it gets 
> fairly nasty in practice.

Thats their problem. We don't screwup the fast paths because some
hardware vendor screwed up that bit of their implementation. That's
*their* problem not everyone elses.

So on x86 IDE should be about 10 outb traps that can be predicted, a rep
outsw which can be shortcut and a completion set of inb/inw ops that can
be predicted.

You should hit userspace about once per IDE operation. Fix the hot paths
with good design and the noise doesn't matter.

Alan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-09 Thread Jamie Lokier
Anthony Liguori wrote:
> >The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> >them to userspace.
> 
> I'm a big fan of this.

I agree with getting rid of unnecessary emulations.
(Why were those things emulated in the first place?)

But it would be good to retain some way to "plugin" device emulations
in the kernel, separate from KVM core with a well-defined API boundary.

Then it wouldn't matter to the KVM core whether there's PIT emulation
or whatever; that would just be a separate module.  Perhaps even with
its own /dev device and maybe not tightly bound to KVM,

> >Note: this may cause a regression for older guests that don't
> >support MSI or kvmclock.  Device assignment will be done using
> >VFIO, that is, without direct kvm involvement.

I don't like the sound of regressions.

I tend to think of a VM as something that needs to have consistent
behaviour over a long time, for keeping working systems running for
years despite changing hardware, or reviving old systems to test
software and make patches for things in long-term maintenance etc.

But I haven't noticed problems from upgrading kernelspace-KVM yet,
only upgrading the userspace parts.  If a kernel upgrade is risky,
that makes upgrading host kernels difficult and "all or nothing" for
all the guests within.

However it looks like you mean only the performance characteristics
will change because of moving things back to userspace?

> >Local APICs will be mandatory, but it will be possible to hide them from
> >the guest.  This means that it will no longer be possible to emulate an
> >APIC in userspace, but it will be possible to virtualize an APIC-less
> >core - userspace will play with the LINT0/LINT1 inputs (configured as
> >EXITINT and NMI) to queue interrupts and NMIs.
> 
> I think this makes sense.  An interesting consequence of this is
> that it's no longer necessary to associate the VCPU context with an
> MMIO/PIO operation.  I'm not sure if there's an obvious benefit to
> that but it's interesting nonetheless.

Would that be useful for using VCPUs to run sandboxed userspace code
with ability to trap and control the whole environment (as opposed to
guest OSes, or ptrace which is rather incomplete and unsuitable for
sandboxing code meant for other OSes)?

Thanks,
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-11 Thread Takuya Yoshikawa
Avi Kivity  wrote:

> > >  Slot searching is quite fast since there's a small number of slots, and 
> > > we sort the larger ones to be in the front, so positive lookups are fast. 
> > >  We cache negative lookups in the shadow page tables (an spte can be 
> > > either "not mapped", "mapped to RAM", or "not mapped and known to be 
> > > mmio") so we rarely need to walk the entire list.
> >
> > Well, we don't always have shadow page tables. Having hints for unmapped 
> > guest memory like this is pretty tricky.
> > We're currently running into issues with device assignment though, where we 
> > get a lot of small slots mapped to real hardware. I'm sure that will hit us 
> > on x86 sooner or later too.
> 
> For x86 that's not a problem, since once you map a page, it stays mapped 
> (on modern hardware).
> 

I was once thinking about how to search a slot reasonably fast for every case,
even when we do not have mmio-spte cache.

One possible way I thought up was to sort slots according to their base_gfn.
Then the problem would become:  "find the first slot whose base_gfn + npages
is greater than this gfn."

Since we can do binary search, the search cost is O(log(# of slots)).

But I guess that most of the time was wasted on reading many memslots just to
know their base_gfn and npages.

So the most practically effective thing is to make a separate array which holds
just their base_gfn.  This will make the task a simple, and cache friendly,
search on an integer array:  probably faster than using *-tree data structure.

If needed, we should make cmp_memslot() architecture specific in the end?

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 04:39 PM, Alexander Graf wrote:
> > 
> > Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
> > tighten the vcpu/thread and vm/process relationship.
>
> How about keeping the ioctl interface but moving vcpu_run to a syscall then?

I dislike half-and-half interfaces even more.  And it's not like the
fget_light() is really painful - it's just that I see it occasionally in
perf top so it annoys me.

>  That should really be the only thing that belongs into the fast path, right? 
> Every time we do a register sync in user space, we do something wrong. 
> Instead, user space should either
>
>   a) have wrappers around register accesses, so it can directly ask for 
> specific registers that it needs
> or
>   b) keep everything that would be requested by the register synchronization 
> in shared memory

Always-synced shared memory is a liability, since newer hardware might
introduce on-chip caches for that state, making synchronization
expensive.  Or we may choose to keep some of the registers loaded, if we
have a way to trap on their use from userspace - for example we can
return to userspace with the guest fpu loaded, and trap if userspace
tries to use it.

Is an extra syscall for copying TLB entries to user space prohibitively
expensive?

> > 
> >> , keep the rest in user space.
> >> >
> >> >
> >> >  When a device is fully in the kernel, we have a good specification of 
> >> > the ABI: it just implements the spec, and the ABI provides the interface 
> >> > from the device to the rest of the world.  Partially accelerated devices 
> >> > means a much greater effort in specifying exactly what it does.  It's 
> >> > also vulnerable to changes in how the guest uses the device.
> >> 
> >> Why? For the HPET timer register for example, we could have a simple MMIO 
> >> hook that says
> >> 
> >>   on_read:
> >> return read_current_time() - shared_page.offset;
> >>   on_write:
> >> handle_in_user_space();
> > 
> > It works for the really simple cases, yes, but if the guest wants to set up 
> > one-shot timers, it fails.  
>
> I don't understand. Why would anything fail here? 

It fails to provide a benefit, I didn't mean it causes guest failures.

You also have to make sure the kernel part and the user part use exactly
the same time bases.

> Once the logic that's implemented by the kernel accelerator doesn't fit 
> anymore, unregister it.

Yeah.

>
> > Also look at the PIT which latches on read.
> > 
> >> 
> >> For IDE, it would be as simple as
> >> 
> >>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> >>   for (i = 1; i<  7; i++) {
> >> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >>   }
> >> 
> >> and we should have reduced overhead of IDE by quite a bit already. All the 
> >> other 2k LOC in hw/ide/core.c don't matter for us really.
> > 
> > 
> > Just use virtio.
>
> Just use xenbus. Seriously, this is not an answer.

Why not?  We invested effort in making it as fast as possible, and in
writing the drivers.  IDE will never, ever, get anything close to virtio
performance, even if we put all of it in the kernel.

However, after these examples, I'm more open to partial acceleration
now.  I won't ever like it though.

> >> >
> >> >>- VGA
> >> >>- IDE
> >> >
> >> >  Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
> >> > virtio-scsi).
> >> 
> >> Because not every guest supports them. Virtio-blk needs 3rd party drivers. 
> >> AHCI needs 3rd party drivers on w2k3 and wxp. 

3rd party drivers are a way of life for Windows users; and the
incremental benefits of IDE acceleration are still far behind virtio.

> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 

Cirrus or vesa should be okay for them, I don't see what we could do for
them in the kernel, or why.

> Same for virtio.
> >> 
> >> Please don't do the Xen mistake again of claiming that all we care about 
> >> is Linux as a guest.
> > 
> > Rest easy, there's no chance of that.  But if a guest is important enough, 
> > virtio drivers will get written.  IDE has no chance in hell of approaching 
> > virtio-blk performance, no matter how much effort we put into it.
>
> Ever used VMware? They basically get virtio-blk performance out of ordinary 
> IDE for linear workloads.

For linear loads, so should we, perhaps with greater cpu utliization.

If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
shouldn't matter.

> > 
> >> KVM's strength has always been its close resemblance to hardware.
> > 
> > This will remain.  But we can't optimize everything.
>
> That's my point. Let's optimize the hot paths and be good. As long as we 
> default to IDE for disk, we should have that be fast, no?

We should make sure that we don't default to IDE.  Qemu has no knowledge
of the guest, so it can't defa

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 12:18, Avi Kivity wrote:

> On 02/07/2012 04:39 PM, Alexander Graf wrote:
>>> 
>>> Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
>>> tighten the vcpu/thread and vm/process relationship.
>> 
>> How about keeping the ioctl interface but moving vcpu_run to a syscall then?
> 
> I dislike half-and-half interfaces even more.  And it's not like the
> fget_light() is really painful - it's just that I see it occasionally in
> perf top so it annoys me.
> 
>> That should really be the only thing that belongs into the fast path, right? 
>> Every time we do a register sync in user space, we do something wrong. 
>> Instead, user space should either
>> 
>>  a) have wrappers around register accesses, so it can directly ask for 
>> specific registers that it needs
>> or
>>  b) keep everything that would be requested by the register synchronization 
>> in shared memory
> 
> Always-synced shared memory is a liability, since newer hardware might
> introduce on-chip caches for that state, making synchronization
> expensive.  Or we may choose to keep some of the registers loaded, if we
> have a way to trap on their use from userspace - for example we can
> return to userspace with the guest fpu loaded, and trap if userspace
> tries to use it.
> 
> Is an extra syscall for copying TLB entries to user space prohibitively
> expensive?

The copying can be very expensive, yes. We want to have the possibility of 
exposing a very large TLB to the guest, in the order of multiple kentries. 
Every entry is a struct of 24 bytes.

> 
>>> 
 , keep the rest in user space.
> 
> 
> When a device is fully in the kernel, we have a good specification of the 
> ABI: it just implements the spec, and the ABI provides the interface from 
> the device to the rest of the world.  Partially accelerated devices means 
> a much greater effort in specifying exactly what it does.  It's also 
> vulnerable to changes in how the guest uses the device.
 
 Why? For the HPET timer register for example, we could have a simple MMIO 
 hook that says
 
  on_read:
return read_current_time() - shared_page.offset;
  on_write:
handle_in_user_space();
>>> 
>>> It works for the really simple cases, yes, but if the guest wants to set up 
>>> one-shot timers, it fails.  
>> 
>> I don't understand. Why would anything fail here? 
> 
> It fails to provide a benefit, I didn't mean it causes guest failures.
> 
> You also have to make sure the kernel part and the user part use exactly
> the same time bases.

Right. It's an optional performance accelerator. If anything doesn't align, 
don't use it. But if you happen to have a system where everything's cool, 
you're faster. Sounds like a good deal to me ;).

> 
>> Once the logic that's implemented by the kernel accelerator doesn't fit 
>> anymore, unregister it.
> 
> Yeah.
> 
>> 
>>> Also look at the PIT which latches on read.
>>> 
 
 For IDE, it would be as simple as
 
  register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
  for (i = 1; i<  7; i++) {
register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
  }
 
 and we should have reduced overhead of IDE by quite a bit already. All the 
 other 2k LOC in hw/ide/core.c don't matter for us really.
>>> 
>>> 
>>> Just use virtio.
>> 
>> Just use xenbus. Seriously, this is not an answer.
> 
> Why not?  We invested effort in making it as fast as possible, and in
> writing the drivers.  IDE will never, ever, get anything close to virtio
> performance, even if we put all of it in the kernel.
> 
> However, after these examples, I'm more open to partial acceleration
> now.  I won't ever like it though.
> 
> 
>>   - VGA
>>   - IDE
> 
> Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
> virtio-scsi).
 
 Because not every guest supports them. Virtio-blk needs 3rd party drivers. 
 AHCI needs 3rd party drivers on w2k3 and wxp. 
> 
> 3rd party drivers are a way of life for Windows users; and the
> incremental benefits of IDE acceleration are still far behind virtio.

The typical way of life for Windows users are all-included drivers. Which is 
the case for AHCI, where we're getting awesome performance for Vista and above 
guests. The iDE thing was just an idea for legacy ones.

It'd be great to simply try and see how fast we could get by handling a few 
special registers in kernel space vs heavyweight exiting to QEMU. If it's only 
10%, I wouldn't even bother with creating an interface for it. I'd bet the 
benefits are a lot bigger though.

And the main point was that specific partial device emulation buys us more than 
pseudo-generic accelerators like coalesced mmio, which are also only used by 1 
or 2 devices.

> 
>> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
> 
> Cirrus or ves

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/15/2012 01:57 PM, Alexander Graf wrote:
> > 
> > Is an extra syscall for copying TLB entries to user space prohibitively
> > expensive?
>
> The copying can be very expensive, yes. We want to have the possibility of 
> exposing a very large TLB to the guest, in the order of multiple kentries. 
> Every entry is a struct of 24 bytes.

You don't need to copy the entire TLB, just the way that maps the
address you're interested in.

btw, why are you interested in virtual addresses in userspace at all?

> >>> 
> >>> It works for the really simple cases, yes, but if the guest wants to set 
> >>> up one-shot timers, it fails.  
> >> 
> >> I don't understand. Why would anything fail here? 
> > 
> > It fails to provide a benefit, I didn't mean it causes guest failures.
> > 
> > You also have to make sure the kernel part and the user part use exactly
> > the same time bases.
>
> Right. It's an optional performance accelerator. If anything doesn't align, 
> don't use it. But if you happen to have a system where everything's cool, 
> you're faster. Sounds like a good deal to me ;).

Depends on how much the alignment relies on guest knowledge.  I guess
with a simple device like HPET, it's simple, but with a complex device,
different guests (or different versions of the same guest) could drive
it very differently.

>  
>  Because not every guest supports them. Virtio-blk needs 3rd party 
>  drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
> > 
> > 3rd party drivers are a way of life for Windows users; and the
> > incremental benefits of IDE acceleration are still far behind virtio.
>
> The typical way of life for Windows users are all-included drivers. Which is 
> the case for AHCI, where we're getting awesome performance for Vista and 
> above guests. The iDE thing was just an idea for legacy ones.
>
> It'd be great to simply try and see how fast we could get by handling a few 
> special registers in kernel space vs heavyweight exiting to QEMU. If it's 
> only 10%, I wouldn't even bother with creating an interface for it. I'd bet 
> the benefits are a lot bigger though.
>
> And the main point was that specific partial device emulation buys us more 
> than pseudo-generic accelerators like coalesced mmio, which are also only 
> used by 1 or 2 devices.

Ok.

> > 
> >> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
> > 
> > Cirrus or vesa should be okay for them, I don't see what we could do for
> > them in the kernel, or why.
>
> That's my point. You need fast emulation of standard devices to get a good 
> baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>
> > 
> >> Same for virtio.
>  
>  Please don't do the Xen mistake again of claiming that all we care about 
>  is Linux as a guest.
> >>> 
> >>> Rest easy, there's no chance of that.  But if a guest is important 
> >>> enough, virtio drivers will get written.  IDE has no chance in hell of 
> >>> approaching virtio-blk performance, no matter how much effort we put into 
> >>> it.
> >> 
> >> Ever used VMware? They basically get virtio-blk performance out of 
> >> ordinary IDE for linear workloads.
> > 
> > For linear loads, so should we, perhaps with greater cpu utliization.
> > 
> > If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
> > means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
> > shouldn't matter.
>
> *shrug* last time I checked we were a lot slower. But maybe there's more 
> stuff making things slow than the exit path ;).

One thing that's different is that virtio offloads itself to a thread
very quickly, while IDE does a lot of work in vcpu thread context.

> > 
> >>> 
>  KVM's strength has always been its close resemblance to hardware.
> >>> 
> >>> This will remain.  But we can't optimize everything.
> >> 
> >> That's my point. Let's optimize the hot paths and be good. As long as we 
> >> default to IDE for disk, we should have that be fast, no?
> > 
> > We should make sure that we don't default to IDE.  Qemu has no knowledge
> > of the guest, so it can't default to virtio, but higher level tools can
> > and should.
>
> You can only default to virtio on recent Linux. Windows, BSD, etc don't 
> include drivers, so you can't assume it working. You can default to AHCI for 
> basically any recent guest, but that still won't work for XP and the likes :(.

The all-knowing management tool can provide a virtio driver disk, or
even slip-stream the driver into the installation CD.


>  
> >> Ah, because you're on NPT and you can have MMIO hints in the nested page 
> >> table. Nifty. Yeah, we don't have that luxury :).
> > 
> > Well the real reason is we have an extra bit reported by page faults
> > that we can control.  Can't you set up a hashed pte that is configured
> > in a way that it will fault, no matter what type of access the guest
> > does, and see it in your page fault handler?
>
> I might be able to synthesize a PTE that is !readab

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/12/2012 09:10 AM, Takuya Yoshikawa wrote:
> Avi Kivity  wrote:
>
> > > >  Slot searching is quite fast since there's a small number of slots, 
> > > > and we sort the larger ones to be in the front, so positive lookups are 
> > > > fast.  We cache negative lookups in the shadow page tables (an spte can 
> > > > be either "not mapped", "mapped to RAM", or "not mapped and known to be 
> > > > mmio") so we rarely need to walk the entire list.
> > >
> > > Well, we don't always have shadow page tables. Having hints for unmapped 
> > > guest memory like this is pretty tricky.
> > > We're currently running into issues with device assignment though, where 
> > > we get a lot of small slots mapped to real hardware. I'm sure that will 
> > > hit us on x86 sooner or later too.
> > 
> > For x86 that's not a problem, since once you map a page, it stays mapped 
> > (on modern hardware).
> > 
>
> I was once thinking about how to search a slot reasonably fast for every case,
> even when we do not have mmio-spte cache.
>
> One possible way I thought up was to sort slots according to their base_gfn.
> Then the problem would become:  "find the first slot whose base_gfn + npages
> is greater than this gfn."
>
> Since we can do binary search, the search cost is O(log(# of slots)).
>
> But I guess that most of the time was wasted on reading many memslots just to
> know their base_gfn and npages.
>
> So the most practically effective thing is to make a separate array which 
> holds
> just their base_gfn.  This will make the task a simple, and cache friendly,
> search on an integer array:  probably faster than using *-tree data structure.

This assumes that there is equal probability for matching any slot.  But
that's not true, even if you have hundreds of slots, the probability is
much greater for the two main memory slots, or if you're playing with
the framebuffer, the framebuffer slot.  Everything else is loaded
quickly into shadow and forgotten.

> If needed, we should make cmp_memslot() architecture specific in the end?

We could, but why is it needed?  This logic holds for all architectures.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 05:23 PM, Anthony Liguori wrote:
> On 02/07/2012 07:40 AM, Alexander Graf wrote:
>>
>> Why? For the HPET timer register for example, we could have a simple
>> MMIO hook that says
>>
>>on_read:
>>  return read_current_time() - shared_page.offset;
>>on_write:
>>  handle_in_user_space();
>>
>> For IDE, it would be as simple as
>>
>>register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>for (i = 1; i<  7; i++) {
>>  register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>  register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>}
>
> You can't easily serialize updates to that address with the kernel
> since two threads are likely going to be accessing it at the same
> time.  That either means an expensive sync operation or a reliance on
> atomic instructions.
>
> But not all architectures offer non-word sized atomic instructions so
> it gets fairly nasty in practice.
>

I doubt that any guest accesses IDE registers from two threads in
parallel.  The guest will have some lock, so we could have a lock as
well and be assured that there will never be contention.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 14:29, Avi Kivity wrote:

> On 02/15/2012 01:57 PM, Alexander Graf wrote:
>>> 
>>> Is an extra syscall for copying TLB entries to user space prohibitively
>>> expensive?
>> 
>> The copying can be very expensive, yes. We want to have the possibility of 
>> exposing a very large TLB to the guest, in the order of multiple kentries. 
>> Every entry is a struct of 24 bytes.
> 
> You don't need to copy the entire TLB, just the way that maps the
> address you're interested in.

Yeah, unless we do migration in which case we need to introduce another special 
case to fetch the whole thing :(.

> btw, why are you interested in virtual addresses in userspace at all?

We need them for gdb and monitor introspection.

> 
> 
> It works for the really simple cases, yes, but if the guest wants to set 
> up one-shot timers, it fails.  
 
 I don't understand. Why would anything fail here? 
>>> 
>>> It fails to provide a benefit, I didn't mean it causes guest failures.
>>> 
>>> You also have to make sure the kernel part and the user part use exactly
>>> the same time bases.
>> 
>> Right. It's an optional performance accelerator. If anything doesn't align, 
>> don't use it. But if you happen to have a system where everything's cool, 
>> you're faster. Sounds like a good deal to me ;).
> 
> Depends on how much the alignment relies on guest knowledge.  I guess
> with a simple device like HPET, it's simple, but with a complex device,
> different guests (or different versions of the same guest) could drive
> it very differently.

Right. But accelerating simple devices > not accelerating any devices. No? :)

> 
>> 
>> Because not every guest supports them. Virtio-blk needs 3rd party 
>> drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
>>> 
>>> 3rd party drivers are a way of life for Windows users; and the
>>> incremental benefits of IDE acceleration are still far behind virtio.
>> 
>> The typical way of life for Windows users are all-included drivers. Which is 
>> the case for AHCI, where we're getting awesome performance for Vista and 
>> above guests. The iDE thing was just an idea for legacy ones.
>> 
>> It'd be great to simply try and see how fast we could get by handling a few 
>> special registers in kernel space vs heavyweight exiting to QEMU. If it's 
>> only 10%, I wouldn't even bother with creating an interface for it. I'd bet 
>> the benefits are a lot bigger though.
>> 
>> And the main point was that specific partial device emulation buys us more 
>> than pseudo-generic accelerators like coalesced mmio, which are also only 
>> used by 1 or 2 devices.
> 
> Ok.
> 
>>> 
 I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
>>> 
>>> Cirrus or vesa should be okay for them, I don't see what we could do for
>>> them in the kernel, or why.
>> 
>> That's my point. You need fast emulation of standard devices to get a good 
>> baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>> 
>>> 
 Same for virtio.
>> 
>> Please don't do the Xen mistake again of claiming that all we care about 
>> is Linux as a guest.
> 
> Rest easy, there's no chance of that.  But if a guest is important 
> enough, virtio drivers will get written.  IDE has no chance in hell of 
> approaching virtio-blk performance, no matter how much effort we put into 
> it.
 
 Ever used VMware? They basically get virtio-blk performance out of 
 ordinary IDE for linear workloads.
>>> 
>>> For linear loads, so should we, perhaps with greater cpu utliization.
>>> 
>>> If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
>>> means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
>>> shouldn't matter.
>> 
>> *shrug* last time I checked we were a lot slower. But maybe there's more 
>> stuff making things slow than the exit path ;).
> 
> One thing that's different is that virtio offloads itself to a thread
> very quickly, while IDE does a lot of work in vcpu thread context.

So it's all about latencies again, which could be reduced at least a fair bit 
with the scheme I described above. But really, this needs to be prototyped and 
benchmarked to actually give us data on how fast it would get us.

> 
>>> 
> 
>> KVM's strength has always been its close resemblance to hardware.
> 
> This will remain.  But we can't optimize everything.
 
 That's my point. Let's optimize the hot paths and be good. As long as we 
 default to IDE for disk, we should have that be fast, no?
>>> 
>>> We should make sure that we don't default to IDE.  Qemu has no knowledge
>>> of the guest, so it can't default to virtio, but higher level tools can
>>> and should.
>> 
>> You can only default to virtio on recent Linux. Windows, BSD, etc don't 
>> include drivers, so you can't assume it working. You can default to AHCI for 
>> basically any recent guest, but that still won't work for XP and the likes 

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 08:12 PM, Rusty Russell wrote:
> > I would really love to have this, but the problem is that we'd need a
> > general purpose bytecode VM with binding to some kernel APIs.  The
> > bytecode VM, if made general enough to host more complicated devices,
> > would likely be much larger than the actual code we have in the kernel now.
>
> We have the ability to upload bytecode into the kernel already.  It's in
> a great bytecode interpreted by the CPU itself.

Unfortunately it's inflexible (has to come with the kernel) and open to
security vulnerabilities.

> If every user were emulating different machines, LPF this would make
> sense.  Are they?  

They aren't.

> Or should we write those helpers once, in C, and
> provide that for them.

There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
them are quite complicated.  However implementing them in bytecode
amounts to exposing a stable kernel ABI, since they use such a vast
range of kernel services.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 06:29 PM, Jan Kiszka wrote:
> >>>
> >>
> >> Isn't there another level in between just scheduling and full syscall
> >> return if the user return notifier has some real work to do?
> > 
> > Depends on whether you're scheduling a kthread or a userspace process, no?  
> > If 
>
> Kthreads can't return, of course. User space threads /may/ do so. And
> then there needs to be a differences between host and guest in the
> tracked MSRs. 

Right.  Until we randomize kernel virtual addresses (what happened to
that?) and then there will always be a difference, even if you run the
same kernel in the host and guest.

> I think to recall it's a question of another few hundred
> cycles.

Right.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 06:19 PM, Anthony Liguori wrote:
>> Ah. But then ioeventfd has that as well, unless the other end is in
>> the kernel too.
>
>
> Yes, that was my point exactly :-)
>
> ioeventfd/mmio-over-socketpair to adifferent thread is not faster than
> a synchronous KVM_RUN + writing to an eventfd in userspace modulo a
> couple of cheap syscalls.
>
> The exception is when the other end is in the kernel and there is
> magic optimizations (like there is today with ioeventfd).

vhost seems to schedule a workqueue item unconditionally.

irqfd does have magic optimizations to avoid an extra schedule.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/15/2012 03:37 PM, Alexander Graf wrote:
> On 15.02.2012, at 14:29, Avi Kivity wrote:
>
> > On 02/15/2012 01:57 PM, Alexander Graf wrote:
> >>> 
> >>> Is an extra syscall for copying TLB entries to user space prohibitively
> >>> expensive?
> >> 
> >> The copying can be very expensive, yes. We want to have the possibility of 
> >> exposing a very large TLB to the guest, in the order of multiple kentries. 
> >> Every entry is a struct of 24 bytes.
> > 
> > You don't need to copy the entire TLB, just the way that maps the
> > address you're interested in.
>
> Yeah, unless we do migration in which case we need to introduce another 
> special case to fetch the whole thing :(.

Well, the scatter/gather registers I proposed will give you just one
register or all of them.

> > btw, why are you interested in virtual addresses in userspace at all?
>
> We need them for gdb and monitor introspection.

Hardly fast paths that justify shared memory.  I should be much harder
on you.

> >> 
> >> Right. It's an optional performance accelerator. If anything doesn't 
> >> align, don't use it. But if you happen to have a system where everything's 
> >> cool, you're faster. Sounds like a good deal to me ;).
> > 
> > Depends on how much the alignment relies on guest knowledge.  I guess
> > with a simple device like HPET, it's simple, but with a complex device,
> > different guests (or different versions of the same guest) could drive
> > it very differently.
>
> Right. But accelerating simple devices > not accelerating any devices. No? :)

Yes.  But introducing bugs and vulns < not introducing them.  It's a
tradeoff.  Even an unexploited vulnerability can be a lot more pain,
just because you need to update your entire cluster, than a simple
device that is accelerated for a guest which has maybe 3% utilization. 
Performance is just one parameter we optimize for.  It's easy to overdo
it because it's an easily measurable and sexy parameter, but it's a mistake.

> > 
> > One thing that's different is that virtio offloads itself to a thread
> > very quickly, while IDE does a lot of work in vcpu thread context.
>
> So it's all about latencies again, which could be reduced at least a fair bit 
> with the scheme I described above. But really, this needs to be prototyped 
> and benchmarked to actually give us data on how fast it would get us.

Simply making qemu issue the request from a thread would be way better. 
Something like socketpair mmio, configured for not waiting for the
writes to be seen (posted writes) will also help by buffering writes in
the socket buffer.

> > 
> > The all-knowing management tool can provide a virtio driver disk, or
> > even slip-stream the driver into the installation CD.
>
> One management tool might do that, another one might now. We can't assume 
> that all management tools are all-knowing. Some times you also want to run 
> guest OSs that the management tool doesn't know (yet).

That is true, but we have to leave some work for the management guys.

>  
> >> So for MMIO reads, I can assume that this is an MMIO because I would never 
> >> write a non-readable entry. For writes, I'm overloading the bit that also 
> >> means "guest entry is not readable" so there I'd have to walk the guest 
> >> PTEs/TLBs and check if I find a read-only entry. Right now I can just 
> >> forward write faults to the guest. Since COW is probably a hotter path for 
> >> the guest than MMIO, this might end up being ineffective.
> > 
> > COWs usually happen from guest userspace, while mmio is usually from the
> > guest kernel, so you can switch on that, maybe.
>
> Hrm, nice idea. That might fall apart with user space drivers that we might 
> eventually have once vfio turns out to work well, but for the time being it's 
> a nice hack :).

Or nested virt...



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 14:57, Avi Kivity wrote:

> On 02/15/2012 03:37 PM, Alexander Graf wrote:
>> On 15.02.2012, at 14:29, Avi Kivity wrote:
>> 
>>> On 02/15/2012 01:57 PM, Alexander Graf wrote:
> 
> Is an extra syscall for copying TLB entries to user space prohibitively
> expensive?
 
 The copying can be very expensive, yes. We want to have the possibility of 
 exposing a very large TLB to the guest, in the order of multiple kentries. 
 Every entry is a struct of 24 bytes.
>>> 
>>> You don't need to copy the entire TLB, just the way that maps the
>>> address you're interested in.
>> 
>> Yeah, unless we do migration in which case we need to introduce another 
>> special case to fetch the whole thing :(.
> 
> Well, the scatter/gather registers I proposed will give you just one
> register or all of them.

One register is hardly any use. We either need all ways of a respective address 
to do a full fledged lookup or all of them. By sharing the same data structures 
between qemu and kvm, we actually managed to reuse all of the tcg code for 
lookups, just like you do for x86. On x86 you also have shared memory for page 
tables, it's just guest visible, hence in guest memory. The concept is the same.

> 
>>> btw, why are you interested in virtual addresses in userspace at all?
>> 
>> We need them for gdb and monitor introspection.
> 
> Hardly fast paths that justify shared memory.  I should be much harder
> on you.

It was a tradeoff on speed and complexity. This way we have the least amount of 
complexity IMHO. All KVM code paths just magically fit in with the TCG code. 
There are essentially no if(kvm_enabled)'s in our MMU walking code, because the 
tables are just there. Makes everything a lot easier (without dragging down 
performance).

> 
 
 Right. It's an optional performance accelerator. If anything doesn't 
 align, don't use it. But if you happen to have a system where everything's 
 cool, you're faster. Sounds like a good deal to me ;).
>>> 
>>> Depends on how much the alignment relies on guest knowledge.  I guess
>>> with a simple device like HPET, it's simple, but with a complex device,
>>> different guests (or different versions of the same guest) could drive
>>> it very differently.
>> 
>> Right. But accelerating simple devices > not accelerating any devices. No? :)
> 
> Yes.  But introducing bugs and vulns < not introducing them.  It's a
> tradeoff.  Even an unexploited vulnerability can be a lot more pain,
> just because you need to update your entire cluster, than a simple
> device that is accelerated for a guest which has maybe 3% utilization. 
> Performance is just one parameter we optimize for.  It's easy to overdo
> it because it's an easily measurable and sexy parameter, but it's a mistake.

Yeah, I agree. That's why I was trying to get AHCI to the default storage 
adapter for a while, because I think the same. However, Anthony believes that 
XP/w2k3 is still a major chunk of the guests running on QEMU, so we can't do 
that :(.

I'm mostly trying to think of ways to accelerate the obvious low hanging 
fruits, without overengineering any interfaces.

> 
>>> 
>>> One thing that's different is that virtio offloads itself to a thread
>>> very quickly, while IDE does a lot of work in vcpu thread context.
>> 
>> So it's all about latencies again, which could be reduced at least a fair 
>> bit with the scheme I described above. But really, this needs to be 
>> prototyped and benchmarked to actually give us data on how fast it would get 
>> us.
> 
> Simply making qemu issue the request from a thread would be way better. 
> Something like socketpair mmio, configured for not waiting for the
> writes to be seen (posted writes) will also help by buffering writes in
> the socket buffer.

Yup, nice idea. That only works when all parts of a device are actually 
implemented through the same socket though. Otherwise you could run out of 
order. So if you have a PCI device with a PIO and an MMIO BAR region, they 
would both have to be handled through the same socket.

> 
>>> 
>>> The all-knowing management tool can provide a virtio driver disk, or
>>> even slip-stream the driver into the installation CD.
>> 
>> One management tool might do that, another one might now. We can't assume 
>> that all management tools are all-knowing. Some times you also want to run 
>> guest OSs that the management tool doesn't know (yet).
> 
> That is true, but we have to leave some work for the management guys.

The easier the management stack is, the happier I am ;).

> 
>> 
 So for MMIO reads, I can assume that this is an MMIO because I would never 
 write a non-readable entry. For writes, I'm overloading the bit that also 
 means "guest entry is not readable" so there I'd have to walk the guest 
 PTEs/TLBs and check if I find a read-only entry. Right now I can just 
 forward write faults to the guest. Since COW is probably a hotter path for 
 the guest than M

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Scott Wood
On 02/15/2012 05:57 AM, Alexander Graf wrote:
> 
> On 15.02.2012, at 12:18, Avi Kivity wrote:
> 
>> Well the real reason is we have an extra bit reported by page faults
>> that we can control.  Can't you set up a hashed pte that is configured
>> in a way that it will fault, no matter what type of access the guest
>> does, and see it in your page fault handler?
> 
> I might be able to synthesize a PTE that is !readable and might throw
> a permission exception instead of a miss exception. I might be able
> to synthesize something similar for booke. I don't however get any
> indication on why things failed.

On booke with ISA 2.06 hypervisor extensions, there's MAS8[VF] that will
trigger a DSI that gets sent to the hypervisor even if normal DSIs go
directly to the guest.  You'll still need to zero out the execute
permission bits.

For other booke, you could use one of the user bits in MAS3 (along with
zeroing out all the permission bits), which you could get to by doing a
tlbsx.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Anthony Liguori

On 02/15/2012 07:39 AM, Avi Kivity wrote:

On 02/07/2012 08:12 PM, Rusty Russell wrote:

I would really love to have this, but the problem is that we'd need a
general purpose bytecode VM with binding to some kernel APIs.  The
bytecode VM, if made general enough to host more complicated devices,
would likely be much larger than the actual code we have in the kernel now.


We have the ability to upload bytecode into the kernel already.  It's in
a great bytecode interpreted by the CPU itself.


Unfortunately it's inflexible (has to come with the kernel) and open to
security vulnerabilities.


I wonder if there's any reasonable way to run device emulation within the 
context of the guest.  Could we effectively do something like SMM?


For a given set of traps, reflect back into the guest quickly changing the 
visibility of the VGA region. It may require installing a new CR3 but maybe that 
wouldn't be so bad with VPIDs.


Then you could implement the PIT as guest firmware using kvmclock as the time 
base.

Once you're back in the guest, you could install the old CR3.  Perhaps just hide 
a portion of the physical address space with the e820.


Regards,

Anthony Liguori


If every user were emulating different machines, LPF this would make
sense.  Are they?


They aren't.


Or should we write those helpers once, in C, and
provide that for them.


There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
them are quite complicated.  However implementing them in bytecode
amounts to exposing a stable kernel ABI, since they use such a vast
range of kernel services.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Arnd Bergmann
On Tuesday 07 February 2012, Alexander Graf wrote:
> On 07.02.2012, at 07:58, Michael Ellerman wrote:
> 
> > On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> >> You're exposing a large, complex kernel subsystem that does very
> >> low-level things with the hardware.  It's a potential source of exploits
> >> (from bugs in KVM or in hardware).  I can see people wanting to be
> >> selective with access because of that.
> > 
> > Exactly.
> > 
> > In a perfect world I'd agree with Anthony, but in reality I think
> > sysadmins are quite happy that they can prevent some users from using
> > KVM.
> > 
> > You could presumably achieve something similar with capabilities or
> > whatever, but a node in /dev is much simpler.
> 
> Well, you could still keep the /dev/kvm node and then have syscalls operate 
> on the fd.
> 
> But again, I don't see the problem with the ioctl interface. It's nice, 
> extensible and works great for us.
> 

ioctl is good for hardware devices and stuff that you want to enumerate
and/or control permissions on. For something like KVM that is really a
core kernel service, a syscall makes much more sense.

I would certainly never mix the two concepts: If you use a chardev to get
a file descriptor, use ioctl to do operations on it, and if you use a 
syscall to get the file descriptor then use other syscalls to do operations
on it.

I don't really have a good recommendation whether or not to change from an
ioctl based interface to syscall for KVM now. On the one hand I believe it
would be significantly cleaner, on the other hand we cannot remove the
chardev interface any more since there are many existing users.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Arnd Bergmann
On Tuesday 07 February 2012, Alexander Graf wrote:
> >> 
> >> Not sure we'll ever get there. For PPC, it will probably take another 1-2 
> >> years until we get the 32-bit targets stabilized. By then we will have new 
> >> 64-bit support though. And then the next gen will come out giving us even 
> >> more new constraints.
> > 
> > I would expect that newer archs have less constraints, not more.
> 
> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
> today on 32-bit, but extends a
> bunch of registers to 64-bit. So what if we laid out stuff wrong before?
> 
> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
> completely new architecture.
> 

I have not seen the source but I'm pretty sure that v7 and v8 they look very
similar regarding virtualization support because they were designed together,
including the concept that on v8 you can run either a v7 compatible 32 bit
hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of
32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical
to the v8 one. The main difference is the instruction set, but then ARMv7
already has four of these (ARM, Thumb, Thumb2, ThumbEE).

Arnd

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Michael Ellerman
On Wed, 2012-02-15 at 22:21 +, Arnd Bergmann wrote:
> On Tuesday 07 February 2012, Alexander Graf wrote:
> > On 07.02.2012, at 07:58, Michael Ellerman wrote:
> > 
> > > On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> > >> You're exposing a large, complex kernel subsystem that does very
> > >> low-level things with the hardware.  It's a potential source of exploits
> > >> (from bugs in KVM or in hardware).  I can see people wanting to be
> > >> selective with access because of that.
> > > 
> > > Exactly.
> > > 
> > > In a perfect world I'd agree with Anthony, but in reality I think
> > > sysadmins are quite happy that they can prevent some users from using
> > > KVM.
> > > 
> > > You could presumably achieve something similar with capabilities or
> > > whatever, but a node in /dev is much simpler.
> > 
> > Well, you could still keep the /dev/kvm node and then have syscalls operate 
> > on the fd.
> > 
> > But again, I don't see the problem with the ioctl interface. It's nice, 
> > extensible and works great for us.
> > 
> 
> ioctl is good for hardware devices and stuff that you want to enumerate
> and/or control permissions on. For something like KVM that is really a
> core kernel service, a syscall makes much more sense.

Yeah maybe. That distinction is at least in part just historical.

The first problem I see with using a syscall is that you don't need one
syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
multiplexed syscall like epoll_ctl() - or probably several
(vm/vcpu/etc).

Secondly you still need a handle/context for those syscalls, and I think
the most sane thing to use for that is an fd.

At that point you've basically reinvented ioctl :)

I also think it is an advantage that you have a node in /dev for
permissions. I know other "core kernel" interfaces don't use a /dev
node, but arguably that is their loss.

> I would certainly never mix the two concepts: If you use a chardev to get
> a file descriptor, use ioctl to do operations on it, and if you use a 
> syscall to get the file descriptor then use other syscalls to do operations
> on it.

Sure, we use a syscall to get the fd (open) and then other syscalls to
do operations on it, ioctl and kvm_vcpu_run. ;)

But seriously, I guess that makes sense. Though it's a bit of a pity
because if you want a syscall for any of it, eg. vcpu_run(), then you
have to basically reinvent ioctl for all the other little operations.

cheers


signature.asc
Description: This is a digitally signed message part


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Rusty Russell
On Wed, 15 Feb 2012 15:39:41 +0200, Avi Kivity  wrote:
> On 02/07/2012 08:12 PM, Rusty Russell wrote:
> > > I would really love to have this, but the problem is that we'd need a
> > > general purpose bytecode VM with binding to some kernel APIs.  The
> > > bytecode VM, if made general enough to host more complicated devices,
> > > would likely be much larger than the actual code we have in the kernel 
> > > now.
> >
> > We have the ability to upload bytecode into the kernel already.  It's in
> > a great bytecode interpreted by the CPU itself.
> 
> Unfortunately it's inflexible (has to come with the kernel) and open to
> security vulnerabilities.

It doesn't have to come with the kernel, but it does require privs.  And
the bytecode itself might be invulnerable, the services it will call
will be, so it's not clear it'll be a win, given the reduced
auditability.

The grass is not really greener, and getting there involves many fences.

> > If every user were emulating different machines, LPF this would make
> > sense.  Are they?  
> 
> They aren't.
> 
> > Or should we write those helpers once, in C, and
> > provide that for them.
> 
> There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
> stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
> them are quite complicated.  However implementing them in bytecode
> amounts to exposing a stable kernel ABI, since they use such a vast
> range of kernel services.

We could think about regularizing and enumerating the various in-kernel
helpers, and give userspace a generic mechanism for wiring them up.
That would surely be the first step towards bytecode anyway.

But the current device assignment ioctls make me think that this
wouldn't be simple or neat.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Gleb Natapov
On Wed, Feb 15, 2012 at 03:59:33PM -0600, Anthony Liguori wrote:
> On 02/15/2012 07:39 AM, Avi Kivity wrote:
> >On 02/07/2012 08:12 PM, Rusty Russell wrote:
> >>>I would really love to have this, but the problem is that we'd need a
> >>>general purpose bytecode VM with binding to some kernel APIs.  The
> >>>bytecode VM, if made general enough to host more complicated devices,
> >>>would likely be much larger than the actual code we have in the kernel now.
> >>
> >>We have the ability to upload bytecode into the kernel already.  It's in
> >>a great bytecode interpreted by the CPU itself.
> >
> >Unfortunately it's inflexible (has to come with the kernel) and open to
> >security vulnerabilities.
> 
> I wonder if there's any reasonable way to run device emulation
> within the context of the guest.  Could we effectively do something
> like SMM?
> 
> For a given set of traps, reflect back into the guest quickly
> changing the visibility of the VGA region. It may require installing
> a new CR3 but maybe that wouldn't be so bad with VPIDs.
> 
What will it buy us? Surely not speed. Entering a guest is not much
(if at all) faster than exiting to userspace and any non trivial
operation will require exit to userspace anyway, so we just added one
more guest entry/exit operation on the way to userspace.

> Then you could implement the PIT as guest firmware using kvmclock as the time 
> base.
> 
> Once you're back in the guest, you could install the old CR3.
> Perhaps just hide a portion of the physical address space with the
> e820.
> 
> Regards,
> 
> Anthony Liguori
> 
> >>If every user were emulating different machines, LPF this would make
> >>sense.  Are they?
> >
> >They aren't.
> >
> >>Or should we write those helpers once, in C, and
> >>provide that for them.
> >
> >There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
> >stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
> >them are quite complicated.  However implementing them in bytecode
> >amounts to exposing a stable kernel ABI, since they use such a vast
> >range of kernel services.
> >

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Avi Kivity
On 02/16/2012 12:21 AM, Arnd Bergmann wrote:
> ioctl is good for hardware devices and stuff that you want to enumerate
> and/or control permissions on. For something like KVM that is really a
> core kernel service, a syscall makes much more sense.
>
> I would certainly never mix the two concepts: If you use a chardev to get
> a file descriptor, use ioctl to do operations on it, and if you use a 
> syscall to get the file descriptor then use other syscalls to do operations
> on it.
>
> I don't really have a good recommendation whether or not to change from an
> ioctl based interface to syscall for KVM now. On the one hand I believe it
> would be significantly cleaner, on the other hand we cannot remove the
> chardev interface any more since there are many existing users.
>

This sums up my feelings exactly.  Moving to syscalls would be an
improvement, but not so much an improvement as to warrant the thrashing
and the pain from having to maintain the old interface for a long while.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Anthony Liguori

On 02/16/2012 02:57 AM, Gleb Natapov wrote:

On Wed, Feb 15, 2012 at 03:59:33PM -0600, Anthony Liguori wrote:

On 02/15/2012 07:39 AM, Avi Kivity wrote:

On 02/07/2012 08:12 PM, Rusty Russell wrote:

I would really love to have this, but the problem is that we'd need a
general purpose bytecode VM with binding to some kernel APIs.  The
bytecode VM, if made general enough to host more complicated devices,
would likely be much larger than the actual code we have in the kernel now.


We have the ability to upload bytecode into the kernel already.  It's in
a great bytecode interpreted by the CPU itself.


Unfortunately it's inflexible (has to come with the kernel) and open to
security vulnerabilities.


I wonder if there's any reasonable way to run device emulation
within the context of the guest.  Could we effectively do something
like SMM?

For a given set of traps, reflect back into the guest quickly
changing the visibility of the VGA region. It may require installing
a new CR3 but maybe that wouldn't be so bad with VPIDs.


What will it buy us? Surely not speed. Entering a guest is not much
(if at all) faster than exiting to userspace and any non trivial
operation will require exit to userspace anyway,


You can emulate the PIT/RTC entirely within the guest using kvmclock which 
doesn't require an additional exit to get the current time base.


So instead of:

1) guest -> host kernel
2) host kernel -> userspace
3) implement logic using rdtscp via VDSO
4) userspace -> host kernel
5) host kernel -> guest

You go:

1) guest -> host kernel
2) host kernel -> guest (with special CR3)
3) implement logic using rdtscp + kvmclock page
4) change CR3 within guest and RETI to VMEXIT source RIP

Same basic concept as PS/2 emulation with SMM.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Avi Kivity
On 02/15/2012 04:08 PM, Alexander Graf wrote:
> > 
> > Well, the scatter/gather registers I proposed will give you just one
> > register or all of them.
>
> One register is hardly any use. We either need all ways of a respective 
> address to do a full fledged lookup or all of them. 

I should have said, just one register, or all of them, or anything in
between.

> By sharing the same data structures between qemu and kvm, we actually managed 
> to reuse all of the tcg code for lookups, just like you do for x86.

Sharing the data structures is not need.  Simply synchronize them before
lookup, like we do for ordinary registers.

>  On x86 you also have shared memory for page tables, it's just guest visible, 
> hence in guest memory. The concept is the same.

But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
on every exit.  And you're risking the same thing if your hardware gets
cleverer.

> > 
> >>> btw, why are you interested in virtual addresses in userspace at all?
> >> 
> >> We need them for gdb and monitor introspection.
> > 
> > Hardly fast paths that justify shared memory.  I should be much harder
> > on you.
>
> It was a tradeoff on speed and complexity. This way we have the least amount 
> of complexity IMHO. All KVM code paths just magically fit in with the TCG 
> code. 

It's too magical, fitting a random version of a random userspace
component.  Now you can't change this tcg code (and still keep the magic).

Some complexity is part of keeping software as separate components.

> There are essentially no if(kvm_enabled)'s in our MMU walking code, because 
> the tables are just there. Makes everything a lot easier (without dragging 
> down performance).

We have the same issue with registers.  There we call
cpu_synchronize_state() before every access.  No magic, but we get to
reuse the code just the same.

> > 
> >>> 
> >>> One thing that's different is that virtio offloads itself to a thread
> >>> very quickly, while IDE does a lot of work in vcpu thread context.
> >> 
> >> So it's all about latencies again, which could be reduced at least a fair 
> >> bit with the scheme I described above. But really, this needs to be 
> >> prototyped and benchmarked to actually give us data on how fast it would 
> >> get us.
> > 
> > Simply making qemu issue the request from a thread would be way better. 
> > Something like socketpair mmio, configured for not waiting for the
> > writes to be seen (posted writes) will also help by buffering writes in
> > the socket buffer.
>
> Yup, nice idea. That only works when all parts of a device are actually 
> implemented through the same socket though. 

Right, but that's not an issue.

> Otherwise you could run out of order. So if you have a PCI device with a PIO 
> and an MMIO BAR region, they would both have to be handled through the same 
> socket.

I'm more worried about interactions between hotplug and a device, and
between people issuing unrelated PCI reads to flush writes (not sure
what the hardware semantics are there).  It's easy to get this wrong.

> >>> 
> >>> COWs usually happen from guest userspace, while mmio is usually from the
> >>> guest kernel, so you can switch on that, maybe.
> >> 
> >> Hrm, nice idea. That might fall apart with user space drivers that we 
> >> might eventually have once vfio turns out to work well, but for the time 
> >> being it's a nice hack :).
> > 
> > Or nested virt...
>
> Nested virt on ppc with device assignment? And here I thought I was the crazy 
> one of the two of us :)

I don't mind being crazy on somebody else's arch.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Avi Kivity
On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> > 
> > ioctl is good for hardware devices and stuff that you want to enumerate
> > and/or control permissions on. For something like KVM that is really a
> > core kernel service, a syscall makes much more sense.
>
> Yeah maybe. That distinction is at least in part just historical.
>
> The first problem I see with using a syscall is that you don't need one
> syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> multiplexed syscall like epoll_ctl() - or probably several
> (vm/vcpu/etc).

No.  Many of our ioctls are for state save/restore - we reduce that to
two.  Many others are due to the with/without irqchip support - we slash
that as well.  The device assignment stuff is relegated to vfio.

I still have to draw up a concrete proposal, but I think we'll end up
with 10-15.

>
> Secondly you still need a handle/context for those syscalls, and I think
> the most sane thing to use for that is an fd.

The context is the process (for vm-wide calls) and thread (for vcpu
local calls).

>
> At that point you've basically reinvented ioctl :)
>
> I also think it is an advantage that you have a node in /dev for
> permissions. I know other "core kernel" interfaces don't use a /dev
> node, but arguably that is their loss.

Have to agree with that.  Theoretically we don't need permissions for
/dev/kvm, but in practice we do.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Alexander Graf

On 16.02.2012, at 20:24, Avi Kivity wrote:

> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>> 
>>> Well, the scatter/gather registers I proposed will give you just one
>>> register or all of them.
>> 
>> One register is hardly any use. We either need all ways of a respective 
>> address to do a full fledged lookup or all of them. 
> 
> I should have said, just one register, or all of them, or anything in
> between.
> 
>> By sharing the same data structures between qemu and kvm, we actually 
>> managed to reuse all of the tcg code for lookups, just like you do for x86.
> 
> Sharing the data structures is not need.  Simply synchronize them before
> lookup, like we do for ordinary registers.

Ordinary registers are a few bytes. We're talking of dozens of kbytes here.

> 
>> On x86 you also have shared memory for page tables, it's just guest visible, 
>> hence in guest memory. The concept is the same.
> 
> But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
> on every exit.  And you're risking the same thing if your hardware gets
> cleverer.

Yes, we do. When that day comes, we forget the CAP and do it another way. Which 
way we will find out by the time that day of more clever hardware comes :).

> 
>>> 
> btw, why are you interested in virtual addresses in userspace at all?
 
 We need them for gdb and monitor introspection.
>>> 
>>> Hardly fast paths that justify shared memory.  I should be much harder
>>> on you.
>> 
>> It was a tradeoff on speed and complexity. This way we have the least amount 
>> of complexity IMHO. All KVM code paths just magically fit in with the TCG 
>> code. 
> 
> It's too magical, fitting a random version of a random userspace
> component.  Now you can't change this tcg code (and still keep the magic).
> 
> Some complexity is part of keeping software as separate components.

Why? If another user space wants to use this, they can

a) do the slow copy path
or
b) simply use our struct definitions

The whole copy thing really only makes sense when you have existing code in 
user space that you don't want to touch, but easily add on KVM to it. If KVM is 
part of your whole design, then integrating things makes a lot more sense.

> 
>> There are essentially no if(kvm_enabled)'s in our MMU walking code, because 
>> the tables are just there. Makes everything a lot easier (without dragging 
>> down performance).
> 
> We have the same issue with registers.  There we call
> cpu_synchronize_state() before every access.  No magic, but we get to
> reuse the code just the same.

Yes, and for those few bytes it's ok to do so - most of the time. On s390, even 
those get shared by now. And it makes sense to do so - if we synchronize it 
every time anyways, why not do so implicitly?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Avi Kivity
On 02/16/2012 04:46 PM, Anthony Liguori wrote:
>> What will it buy us? Surely not speed. Entering a guest is not much
>> (if at all) faster than exiting to userspace and any non trivial
>> operation will require exit to userspace anyway,
>
>
> You can emulate the PIT/RTC entirely within the guest using kvmclock
> which doesn't require an additional exit to get the current time base.
>
> So instead of:
>
> 1) guest -> host kernel
> 2) host kernel -> userspace
> 3) implement logic using rdtscp via VDSO
> 4) userspace -> host kernel
> 5) host kernel -> guest
>
> You go:
>
> 1) guest -> host kernel
> 2) host kernel -> guest (with special CR3)
> 3) implement logic using rdtscp + kvmclock page
> 4) change CR3 within guest and RETI to VMEXIT source RIP
>
> Same basic concept as PS/2 emulation with SMM.

Interesting, but unimplementable in practice.  SMM requires a VMEXIT for
RSM, and anything non-SMM wants a virtual address mapping (and some RAM)
which you can't get without guest cooperation.  There are other
complications like an NMI interrupting hypervisor-provided code and
finding unexpected addresses on its stack (SMM at least blocks NMIs).

Tangentially related, Intel introduced a VMFUNC that allows you to
change the guest's physical memory map to a pre-set alternative provided
by the host, without a VMEXIT.  Seems similar to SMM but requires guest
cooperation.  I guess it's for unintrusive virus scanners and the like.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Avi Kivity
On 02/16/2012 09:34 PM, Alexander Graf wrote:
> On 16.02.2012, at 20:24, Avi Kivity wrote:
>
> > On 02/15/2012 04:08 PM, Alexander Graf wrote:
> >>> 
> >>> Well, the scatter/gather registers I proposed will give you just one
> >>> register or all of them.
> >> 
> >> One register is hardly any use. We either need all ways of a respective 
> >> address to do a full fledged lookup or all of them. 
> > 
> > I should have said, just one register, or all of them, or anything in
> > between.
> > 
> >> By sharing the same data structures between qemu and kvm, we actually 
> >> managed to reuse all of the tcg code for lookups, just like you do for x86.
> > 
> > Sharing the data structures is not need.  Simply synchronize them before
> > lookup, like we do for ordinary registers.
>
> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.

A TLB way is a few dozen bytes, no?

> > 
> >> On x86 you also have shared memory for page tables, it's just guest 
> >> visible, hence in guest memory. The concept is the same.
> > 
> > But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
> > on every exit.  And you're risking the same thing if your hardware gets
> > cleverer.
>
> Yes, we do. When that day comes, we forget the CAP and do it another way. 
> Which way we will find out by the time that day of more clever hardware comes 
> :).

Or we try to be less clever unless we have a really compelling reason. 
qemu monitor and gdb support aren't compelling reasons to optimize.

> > 
> > It's too magical, fitting a random version of a random userspace
> > component.  Now you can't change this tcg code (and still keep the magic).
> > 
> > Some complexity is part of keeping software as separate components.
>
> Why? If another user space wants to use this, they can
>
> a) do the slow copy path
> or
> b) simply use our struct definitions
>
> The whole copy thing really only makes sense when you have existing code in 
> user space that you don't want to touch, but easily add on KVM to it. If KVM 
> is part of your whole design, then integrating things makes a lot more sense.

Yeah, I guess.

>
> > 
> >> There are essentially no if(kvm_enabled)'s in our MMU walking code, 
> >> because the tables are just there. Makes everything a lot easier (without 
> >> dragging down performance).
> > 
> > We have the same issue with registers.  There we call
> > cpu_synchronize_state() before every access.  No magic, but we get to
> > reuse the code just the same.
>
> Yes, and for those few bytes it's ok to do so - most of the time. On s390, 
> even those get shared by now. And it makes sense to do so - if we synchronize 
> it every time anyways, why not do so implicitly?
>

At least on x86, we synchronize only rarely.



-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Scott Wood
On 02/16/2012 01:38 PM, Avi Kivity wrote:
> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>
>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>
> Well, the scatter/gather registers I proposed will give you just one
> register or all of them.

 One register is hardly any use. We either need all ways of a respective 
 address to do a full fledged lookup or all of them. 
>>>
>>> I should have said, just one register, or all of them, or anything in
>>> between.
>>>
 By sharing the same data structures between qemu and kvm, we actually 
 managed to reuse all of the tcg code for lookups, just like you do for x86.
>>>
>>> Sharing the data structures is not need.  Simply synchronize them before
>>> lookup, like we do for ordinary registers.
>>
>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> 
> A TLB way is a few dozen bytes, no?

I think you mean a TLB set... but the TLB (or part of it) may be fully
associative.

On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.

Then we'd need to deal with tracking whether we synchronized one or more
specific sets, or everything (for migration or debug TLB dump).  The
request to synchronize would have to come from within the QEMU MMU code,
since that's the point where we know what to ask for (unless we
duplicate the logic elsewhere).  I'm not sure that reusing the standard
QEMU MMU code for individual debug address translation is really
simplifying things...

And yes, we do have fancier hardware coming fairly soon for which this
breaks (TLB0 entries can be loaded without host involvement, as long as
there's a translation from guest physical to physical in a separate
hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
it as invalidated), but not for debug since that may be where the
translation we're interested in resides.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Michael Ellerman
On Thu, 2012-02-16 at 21:28 +0200, Avi Kivity wrote:
> On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> > > 
> > > ioctl is good for hardware devices and stuff that you want to enumerate
> > > and/or control permissions on. For something like KVM that is really a
> > > core kernel service, a syscall makes much more sense.
> >
> > Yeah maybe. That distinction is at least in part just historical.
> >
> > The first problem I see with using a syscall is that you don't need one
> > syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> > multiplexed syscall like epoll_ctl() - or probably several
> > (vm/vcpu/etc).
> 
> No.  Many of our ioctls are for state save/restore - we reduce that to
> two.  Many others are due to the with/without irqchip support - we slash
> that as well.  The device assignment stuff is relegated to vfio.
> 
> I still have to draw up a concrete proposal, but I think we'll end up
> with 10-15.

That's true, you certainly could reduce it, though by how much I'm not
sure. On powerpc I'm working on moving the irq controller emulation into
the kernel, and some associated firmware emulation, so that's at least
one new ioctl. And there will always be more, whatever scheme you have
must be easily extensible - ie. not requiring new syscalls for each new
weird platform.

> > Secondly you still need a handle/context for those syscalls, and I think
> > the most sane thing to use for that is an fd.
> 
> The context is the process (for vm-wide calls) and thread (for vcpu
> local calls).

Yeah OK I forgot you'd mentioned that. But isn't that change basically
orthogonal to how you get into the kernel? ie. we could have the
kvm/vcpu pointers in mm_struct/task_struct today?

I guess it wouldn't win you much though because you still have the fd
and ioctl overhead as well.

cheers


signature.asc
Description: This is a digitally signed message part


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Alexander Graf

On 16.02.2012, at 20:38, Avi Kivity wrote:

> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>> 
>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
> 
> Well, the scatter/gather registers I proposed will give you just one
> register or all of them.
 
 One register is hardly any use. We either need all ways of a respective 
 address to do a full fledged lookup or all of them. 
>>> 
>>> I should have said, just one register, or all of them, or anything in
>>> between.
>>> 
 By sharing the same data structures between qemu and kvm, we actually 
 managed to reuse all of the tcg code for lookups, just like you do for x86.
>>> 
>>> Sharing the data structures is not need.  Simply synchronize them before
>>> lookup, like we do for ordinary registers.
>> 
>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> 
> A TLB way is a few dozen bytes, no?
> 
>>> 
 On x86 you also have shared memory for page tables, it's just guest 
 visible, hence in guest memory. The concept is the same.
>>> 
>>> But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
>>> on every exit.  And you're risking the same thing if your hardware gets
>>> cleverer.
>> 
>> Yes, we do. When that day comes, we forget the CAP and do it another way. 
>> Which way we will find out by the time that day of more clever hardware 
>> comes :).
> 
> Or we try to be less clever unless we have a really compelling reason. 
> qemu monitor and gdb support aren't compelling reasons to optimize.

The goal here was simplicity with a grain of performance concerns.

So what would you be envisioning? Should we make all of the MMU walker code in 
target-ppc KVM aware so it fetches that single way it actually cares about on 
demand from the kernel? That is pretty intrusive and goes against the general 
nicely fitting in principle of how KVM integrates today.

Also, we need to store the guest TLB somewhere. With this model, we can just 
store it in user space memory, so we keep only a single copy around, reducing 
memory footprint. If we had to copy it, we would need more than a single copy.

> 
>>> 
>>> It's too magical, fitting a random version of a random userspace
>>> component.  Now you can't change this tcg code (and still keep the magic).
>>> 
>>> Some complexity is part of keeping software as separate components.
>> 
>> Why? If another user space wants to use this, they can
>> 
>> a) do the slow copy path
>> or
>> b) simply use our struct definitions
>> 
>> The whole copy thing really only makes sense when you have existing code in 
>> user space that you don't want to touch, but easily add on KVM to it. If KVM 
>> is part of your whole design, then integrating things makes a lot more sense.
> 
> Yeah, I guess.
> 
>> 
>>> 
 There are essentially no if(kvm_enabled)'s in our MMU walking code, 
 because the tables are just there. Makes everything a lot easier (without 
 dragging down performance).
>>> 
>>> We have the same issue with registers.  There we call
>>> cpu_synchronize_state() before every access.  No magic, but we get to
>>> reuse the code just the same.
>> 
>> Yes, and for those few bytes it's ok to do so - most of the time. On s390, 
>> even those get shared by now. And it makes sense to do so - if we 
>> synchronize it every time anyways, why not do so implicitly?
>> 
> 
> At least on x86, we synchronize only rarely.

Yeah, on s390 we only know which registers actually contain the information we 
need for traps / hypercalls when in user space, since that's where the decoding 
happens. So we better have all GPRs available to read from and write to.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Alexander Graf

On 16.02.2012, at 21:41, Scott Wood wrote:

> On 02/16/2012 01:38 PM, Avi Kivity wrote:
>> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>> 
 On 02/15/2012 04:08 PM, Alexander Graf wrote:
>> 
>> Well, the scatter/gather registers I proposed will give you just one
>> register or all of them.
> 
> One register is hardly any use. We either need all ways of a respective 
> address to do a full fledged lookup or all of them. 
 
 I should have said, just one register, or all of them, or anything in
 between.
 
> By sharing the same data structures between qemu and kvm, we actually 
> managed to reuse all of the tcg code for lookups, just like you do for 
> x86.
 
 Sharing the data structures is not need.  Simply synchronize them before
 lookup, like we do for ordinary registers.
>>> 
>>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
>> 
>> A TLB way is a few dozen bytes, no?
> 
> I think you mean a TLB set... but the TLB (or part of it) may be fully
> associative.
> 
> On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
> a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.
> 
> Then we'd need to deal with tracking whether we synchronized one or more
> specific sets, or everything (for migration or debug TLB dump).  The
> request to synchronize would have to come from within the QEMU MMU code,
> since that's the point where we know what to ask for (unless we
> duplicate the logic elsewhere).  I'm not sure that reusing the standard
> QEMU MMU code for individual debug address translation is really
> simplifying things...
> 
> And yes, we do have fancier hardware coming fairly soon for which this
> breaks (TLB0 entries can be loaded without host involvement, as long as
> there's a translation from guest physical to physical in a separate
> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
> it as invalidated), but not for debug since that may be where the
> translation we're interested in resides.

Could we maybe add an ioctl that forces kvm to read out the current tlb0 
contents and push them to memory? How slow would that be?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-17 Thread Scott Wood
On 02/16/2012 06:23 PM, Alexander Graf wrote:
> On 16.02.2012, at 21:41, Scott Wood wrote:
>> And yes, we do have fancier hardware coming fairly soon for which this
>> breaks (TLB0 entries can be loaded without host involvement, as long as
>> there's a translation from guest physical to physical in a separate
>> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
>> it as invalidated), but not for debug since that may be where the
>> translation we're interested in resides.
> 
> Could we maybe add an ioctl that forces kvm to read out the current tlb0 
> contents and push them to memory? How slow would that be?

Yes, I was thinking something like that.  We'd just have to remove (make
conditional on MMU type) the statement that this is synchronized
implicitly on return from vcpu_run.

Performance shouldn't be a problem -- we'd only need to sync once and
then can do all the repeated debug accesses we want.  So should be no
need to mess around with partial sync.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-18 Thread Avi Kivity
On 02/16/2012 10:41 PM, Scott Wood wrote:
> >>> Sharing the data structures is not need.  Simply synchronize them before
> >>> lookup, like we do for ordinary registers.
> >>
> >> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> > 
> > A TLB way is a few dozen bytes, no?
>
> I think you mean a TLB set... 

Yes, thanks.

> but the TLB (or part of it) may be fully
> associative.

A fully associative TLB has to be very small.

> On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
> a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.

Syncing this every time you need a translation (for gdb or the monitor)
is trivial in terms of performance.

> Then we'd need to deal with tracking whether we synchronized one or more
> specific sets, or everything (for migration or debug TLB dump).  The
> request to synchronize would have to come from within the QEMU MMU code,
> since that's the point where we know what to ask for (unless we
> duplicate the logic elsewhere).  I'm not sure that reusing the standard
> QEMU MMU code for individual debug address translation is really
> simplifying things...
>
> And yes, we do have fancier hardware coming fairly soon for which this
> breaks (TLB0 entries can be loaded without host involvement, as long as
> there's a translation from guest physical to physical in a separate
> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
> it as invalidated), but not for debug since that may be where the
> translation we're interested in resides.
>

So with this new hardware, the always-sync API breaks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-18 Thread Avi Kivity
On 02/17/2012 02:19 AM, Alexander Graf wrote:
> > 
> > Or we try to be less clever unless we have a really compelling reason. 
> > qemu monitor and gdb support aren't compelling reasons to optimize.
>
> The goal here was simplicity with a grain of performance concerns.
>

Shared memory is simple in one way, but in other ways it is more
complicated since it takes away the kernel's freedom in how it manages
the data, how it's laid out, and whether it can lazify things or not.

> So what would you be envisioning? Should we make all of the MMU walker code 
> in target-ppc KVM aware so it fetches that single way it actually cares about 
> on demand from the kernel? That is pretty intrusive and goes against the 
> general nicely fitting in principle of how KVM integrates today.

First, it's trivial, when you access a set you call
cpu_synchronize_tlb(set), just like how you access the registers when
you want them.

Second, and more important, how a random version of qemu works is
totally immaterial to the kvm userspace interface.  qemu could change in
15 different ways and so could the kernel, and other users exist. 
Fitting into qemu's current model is not a goal (if qemu happens to have
a good model, use it by all means; and clashing with qemu is likely an
indication the something is wrong -- but the two projects need to be
decoupled).

> Also, we need to store the guest TLB somewhere. With this model, we can just 
> store it in user space memory, so we keep only a single copy around, reducing 
> memory footprint. If we had to copy it, we would need more than a single copy.

That's the whole point.  You could store it on the cpu hardware, if the
cpu allows it.  Forcing it into always-synchronized shared memory takes
that ability away from you.

>  
> > 
> > At least on x86, we synchronize only rarely.
>
> Yeah, on s390 we only know which registers actually contain the information 
> we need for traps / hypercalls when in user space, since that's where the 
> decoding happens. So we better have all GPRs available to read from and write 
> to.
>

Ok.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-18 Thread Avi Kivity
On 02/17/2012 02:09 AM, Michael Ellerman wrote:
> On Thu, 2012-02-16 at 21:28 +0200, Avi Kivity wrote:
> > On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> > > > 
> > > > ioctl is good for hardware devices and stuff that you want to enumerate
> > > > and/or control permissions on. For something like KVM that is really a
> > > > core kernel service, a syscall makes much more sense.
> > >
> > > Yeah maybe. That distinction is at least in part just historical.
> > >
> > > The first problem I see with using a syscall is that you don't need one
> > > syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> > > multiplexed syscall like epoll_ctl() - or probably several
> > > (vm/vcpu/etc).
> > 
> > No.  Many of our ioctls are for state save/restore - we reduce that to
> > two.  Many others are due to the with/without irqchip support - we slash
> > that as well.  The device assignment stuff is relegated to vfio.
> > 
> > I still have to draw up a concrete proposal, but I think we'll end up
> > with 10-15.
>
> That's true, you certainly could reduce it, though by how much I'm not
> sure. On powerpc I'm working on moving the irq controller emulation into
> the kernel, and some associated firmware emulation, so that's at least
> one new ioctl. And there will always be more, whatever scheme you have
> must be easily extensible - ie. not requiring new syscalls for each new
> weird platform.

Most of it falls into read/write state, which is covered by two
syscalls.  There's probably need for configuration (wiring etc.); we
could call that pseudo-state with fake registers but I don't like that
very much.


> > > Secondly you still need a handle/context for those syscalls, and I think
> > > the most sane thing to use for that is an fd.
> > 
> > The context is the process (for vm-wide calls) and thread (for vcpu
> > local calls).
>
> Yeah OK I forgot you'd mentioned that. But isn't that change basically
> orthogonal to how you get into the kernel? ie. we could have the
> kvm/vcpu pointers in mm_struct/task_struct today?
>
> I guess it wouldn't win you much though because you still have the fd
> and ioctl overhead as well.
>

Yes.  I also dislike bypassing ioctl semantics (though we already do
that by requiring vcpus to stay on the same thread and vms on the same
process).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-18 Thread Alexander Graf


On 18.02.2012, at 11:00, Avi Kivity  wrote:

> On 02/17/2012 02:19 AM, Alexander Graf wrote:
>>> 
>>> Or we try to be less clever unless we have a really compelling reason. 
>>> qemu monitor and gdb support aren't compelling reasons to optimize.
>> 
>> The goal here was simplicity with a grain of performance concerns.
>> 
> 
> Shared memory is simple in one way, but in other ways it is more
> complicated since it takes away the kernel's freedom in how it manages
> the data, how it's laid out, and whether it can lazify things or not.

Yes and no. Shared memory is a means of transferring data. If it's implemented 
by copying internally or by implicit sychronization is orthogonal to that.

With the interface as is, we can now on newer CPUs (which need changes to user 
space to work anyways) take the current interface and add a new CAP + ioctl 
that allows us to force flush the TLYb into the shared buffer. That way we 
maintain backwards compatibility, memory savings, no in kernel vmalloc 
cluttering etc. on all CPUs, but get the checkpoint to actually have useful 
contents for new CPUs.

I don't see the problem really. The data is the architected layout of the TLB. 
It contains all the data that can possibly make up a TLB entry according to the 
booke spec. If we wanted to copy different data, we'd need a different ioctl 
too.

> 
>> So what would you be envisioning? Should we make all of the MMU walker code 
>> in target-ppc KVM aware so it fetches that single way it actually cares 
>> about on demand from the kernel? That is pretty intrusive and goes against 
>> the general nicely fitting in principle of how KVM integrates today.
> 
> First, it's trivial, when you access a set you call
> cpu_synchronize_tlb(set), just like how you access the registers when
> you want them.

Yes, which is reasonably intrusive and going to be necessary with LRAT.

> 
> Second, and more important, how a random version of qemu works is
> totally immaterial to the kvm userspace interface.  qemu could change in
> 15 different ways and so could the kernel, and other users exist. 
> Fitting into qemu's current model is not a goal (if qemu happens to have
> a good model, use it by all means; and clashing with qemu is likely an
> indication the something is wrong -- but the two projects need to be
> decoupled).

Sure. In fact, in this case, the two were developed together. QEMU didn't have 
support for this specific TLB type, so we combined the development efforts. 
This way any new user space has a very easy time to implement it too, because 
we didn't model the KVM parts after QEMU, but the QEMU parts after KVM.

I still think it holds true that the KVM interface is very easy to plug in to 
any random emulation project. And to achieve that, the interface should be as 
little intrusive as possible wrt its requirements. The one we have seemed to 
fit that pretty well. Sure, we need a special flush command for newer CPUs, but 
at least we don't have to always copy. We only copy when we need to.

> 
>> Also, we need to store the guest TLB somewhere. With this model, we can just 
>> store it in user space memory, so we keep only a single copy around, 
>> reducing memory footprint. If we had to copy it, we would need more than a 
>> single copy.
> 
> That's the whole point.  You could store it on the cpu hardware, if the
> cpu allows it.  Forcing it into always-synchronized shared memory takes
> that ability away from you.

Yup. So the correct comment to make would be "don't make the shared TLB always 
synchronized", which I agree with today. I still think that the whole idea of 
passing kvm user space memory to work on is great. It reduces vmalloc 
footprint, it reduces copying, and it keeps data at one place, reducing chances 
to mess up.

Having it defined to always be in sync was a mistake, but one we can easily 
fix. That's why the CAP and ioctl interfaces are so awesome ;). I strongly 
believe that I can't predict the future. So designing an interface that holds 
stable for the next 10 years is close to imposdible. with an easily extensible 
interface however, it becomes almost trivial tk fix earlier messups ;).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-22 Thread Peter Zijlstra
On Sat, 2012-02-04 at 11:08 +0900, Takuya Yoshikawa wrote:
> The latter needs a fundamental change:  I heard (from Avi) that we can
> change mmu_lock to mutex_lock if mmu_notifier becomes preemptible.
> 
> So I was planning to restart this work when Peter's
> "mm: Preemptibility"
> http://lkml.org/lkml/2011/4/1/141
> gets finished. 


That got merged a while ago:

# git describe --contains d16dfc550f5326a4000f3322582a7c05dec91d7a --match "v*"
v3.0-rc1~275

While I still need to get back to unifying mmu_gather across
architectures the whole thing is currently preemptible.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html