Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-18 Thread Alexander Graf


On 18.02.2012, at 11:00, Avi Kivity  wrote:

> On 02/17/2012 02:19 AM, Alexander Graf wrote:
>>> 
>>> Or we try to be less clever unless we have a really compelling reason. 
>>> qemu monitor and gdb support aren't compelling reasons to optimize.
>> 
>> The goal here was simplicity with a grain of performance concerns.
>> 
> 
> Shared memory is simple in one way, but in other ways it is more
> complicated since it takes away the kernel's freedom in how it manages
> the data, how it's laid out, and whether it can lazify things or not.

Yes and no. Shared memory is a means of transferring data. If it's implemented 
by copying internally or by implicit sychronization is orthogonal to that.

With the interface as is, we can now on newer CPUs (which need changes to user 
space to work anyways) take the current interface and add a new CAP + ioctl 
that allows us to force flush the TLYb into the shared buffer. That way we 
maintain backwards compatibility, memory savings, no in kernel vmalloc 
cluttering etc. on all CPUs, but get the checkpoint to actually have useful 
contents for new CPUs.

I don't see the problem really. The data is the architected layout of the TLB. 
It contains all the data that can possibly make up a TLB entry according to the 
booke spec. If we wanted to copy different data, we'd need a different ioctl 
too.

> 
>> So what would you be envisioning? Should we make all of the MMU walker code 
>> in target-ppc KVM aware so it fetches that single way it actually cares 
>> about on demand from the kernel? That is pretty intrusive and goes against 
>> the general nicely fitting in principle of how KVM integrates today.
> 
> First, it's trivial, when you access a set you call
> cpu_synchronize_tlb(set), just like how you access the registers when
> you want them.

Yes, which is reasonably intrusive and going to be necessary with LRAT.

> 
> Second, and more important, how a random version of qemu works is
> totally immaterial to the kvm userspace interface.  qemu could change in
> 15 different ways and so could the kernel, and other users exist. 
> Fitting into qemu's current model is not a goal (if qemu happens to have
> a good model, use it by all means; and clashing with qemu is likely an
> indication the something is wrong -- but the two projects need to be
> decoupled).

Sure. In fact, in this case, the two were developed together. QEMU didn't have 
support for this specific TLB type, so we combined the development efforts. 
This way any new user space has a very easy time to implement it too, because 
we didn't model the KVM parts after QEMU, but the QEMU parts after KVM.

I still think it holds true that the KVM interface is very easy to plug in to 
any random emulation project. And to achieve that, the interface should be as 
little intrusive as possible wrt its requirements. The one we have seemed to 
fit that pretty well. Sure, we need a special flush command for newer CPUs, but 
at least we don't have to always copy. We only copy when we need to.

> 
>> Also, we need to store the guest TLB somewhere. With this model, we can just 
>> store it in user space memory, so we keep only a single copy around, 
>> reducing memory footprint. If we had to copy it, we would need more than a 
>> single copy.
> 
> That's the whole point.  You could store it on the cpu hardware, if the
> cpu allows it.  Forcing it into always-synchronized shared memory takes
> that ability away from you.

Yup. So the correct comment to make would be "don't make the shared TLB always 
synchronized", which I agree with today. I still think that the whole idea of 
passing kvm user space memory to work on is great. It reduces vmalloc 
footprint, it reduces copying, and it keeps data at one place, reducing chances 
to mess up.

Having it defined to always be in sync was a mistake, but one we can easily 
fix. That's why the CAP and ioctl interfaces are so awesome ;). I strongly 
believe that I can't predict the future. So designing an interface that holds 
stable for the next 10 years is close to imposdible. with an easily extensible 
interface however, it becomes almost trivial tk fix earlier messups ;).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-18 Thread Avi Kivity
On 02/17/2012 02:19 AM, Alexander Graf wrote:
> > 
> > Or we try to be less clever unless we have a really compelling reason. 
> > qemu monitor and gdb support aren't compelling reasons to optimize.
>
> The goal here was simplicity with a grain of performance concerns.
>

Shared memory is simple in one way, but in other ways it is more
complicated since it takes away the kernel's freedom in how it manages
the data, how it's laid out, and whether it can lazify things or not.

> So what would you be envisioning? Should we make all of the MMU walker code 
> in target-ppc KVM aware so it fetches that single way it actually cares about 
> on demand from the kernel? That is pretty intrusive and goes against the 
> general nicely fitting in principle of how KVM integrates today.

First, it's trivial, when you access a set you call
cpu_synchronize_tlb(set), just like how you access the registers when
you want them.

Second, and more important, how a random version of qemu works is
totally immaterial to the kvm userspace interface.  qemu could change in
15 different ways and so could the kernel, and other users exist. 
Fitting into qemu's current model is not a goal (if qemu happens to have
a good model, use it by all means; and clashing with qemu is likely an
indication the something is wrong -- but the two projects need to be
decoupled).

> Also, we need to store the guest TLB somewhere. With this model, we can just 
> store it in user space memory, so we keep only a single copy around, reducing 
> memory footprint. If we had to copy it, we would need more than a single copy.

That's the whole point.  You could store it on the cpu hardware, if the
cpu allows it.  Forcing it into always-synchronized shared memory takes
that ability away from you.

>  
> > 
> > At least on x86, we synchronize only rarely.
>
> Yeah, on s390 we only know which registers actually contain the information 
> we need for traps / hypercalls when in user space, since that's where the 
> decoding happens. So we better have all GPRs available to read from and write 
> to.
>

Ok.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-18 Thread Avi Kivity
On 02/16/2012 10:41 PM, Scott Wood wrote:
> >>> Sharing the data structures is not need.  Simply synchronize them before
> >>> lookup, like we do for ordinary registers.
> >>
> >> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> > 
> > A TLB way is a few dozen bytes, no?
>
> I think you mean a TLB set... 

Yes, thanks.

> but the TLB (or part of it) may be fully
> associative.

A fully associative TLB has to be very small.

> On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
> a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.

Syncing this every time you need a translation (for gdb or the monitor)
is trivial in terms of performance.

> Then we'd need to deal with tracking whether we synchronized one or more
> specific sets, or everything (for migration or debug TLB dump).  The
> request to synchronize would have to come from within the QEMU MMU code,
> since that's the point where we know what to ask for (unless we
> duplicate the logic elsewhere).  I'm not sure that reusing the standard
> QEMU MMU code for individual debug address translation is really
> simplifying things...
>
> And yes, we do have fancier hardware coming fairly soon for which this
> breaks (TLB0 entries can be loaded without host involvement, as long as
> there's a translation from guest physical to physical in a separate
> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
> it as invalidated), but not for debug since that may be where the
> translation we're interested in resides.
>

So with this new hardware, the always-sync API breaks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-17 Thread Scott Wood
On 02/16/2012 06:23 PM, Alexander Graf wrote:
> On 16.02.2012, at 21:41, Scott Wood wrote:
>> And yes, we do have fancier hardware coming fairly soon for which this
>> breaks (TLB0 entries can be loaded without host involvement, as long as
>> there's a translation from guest physical to physical in a separate
>> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
>> it as invalidated), but not for debug since that may be where the
>> translation we're interested in resides.
> 
> Could we maybe add an ioctl that forces kvm to read out the current tlb0 
> contents and push them to memory? How slow would that be?

Yes, I was thinking something like that.  We'd just have to remove (make
conditional on MMU type) the statement that this is synchronized
implicitly on return from vcpu_run.

Performance shouldn't be a problem -- we'd only need to sync once and
then can do all the repeated debug accesses we want.  So should be no
need to mess around with partial sync.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Alexander Graf

On 16.02.2012, at 21:41, Scott Wood wrote:

> On 02/16/2012 01:38 PM, Avi Kivity wrote:
>> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>> 
 On 02/15/2012 04:08 PM, Alexander Graf wrote:
>> 
>> Well, the scatter/gather registers I proposed will give you just one
>> register or all of them.
> 
> One register is hardly any use. We either need all ways of a respective 
> address to do a full fledged lookup or all of them. 
 
 I should have said, just one register, or all of them, or anything in
 between.
 
> By sharing the same data structures between qemu and kvm, we actually 
> managed to reuse all of the tcg code for lookups, just like you do for 
> x86.
 
 Sharing the data structures is not need.  Simply synchronize them before
 lookup, like we do for ordinary registers.
>>> 
>>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
>> 
>> A TLB way is a few dozen bytes, no?
> 
> I think you mean a TLB set... but the TLB (or part of it) may be fully
> associative.
> 
> On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
> a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.
> 
> Then we'd need to deal with tracking whether we synchronized one or more
> specific sets, or everything (for migration or debug TLB dump).  The
> request to synchronize would have to come from within the QEMU MMU code,
> since that's the point where we know what to ask for (unless we
> duplicate the logic elsewhere).  I'm not sure that reusing the standard
> QEMU MMU code for individual debug address translation is really
> simplifying things...
> 
> And yes, we do have fancier hardware coming fairly soon for which this
> breaks (TLB0 entries can be loaded without host involvement, as long as
> there's a translation from guest physical to physical in a separate
> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
> it as invalidated), but not for debug since that may be where the
> translation we're interested in resides.

Could we maybe add an ioctl that forces kvm to read out the current tlb0 
contents and push them to memory? How slow would that be?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Alexander Graf

On 16.02.2012, at 20:38, Avi Kivity wrote:

> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>> 
>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
> 
> Well, the scatter/gather registers I proposed will give you just one
> register or all of them.
 
 One register is hardly any use. We either need all ways of a respective 
 address to do a full fledged lookup or all of them. 
>>> 
>>> I should have said, just one register, or all of them, or anything in
>>> between.
>>> 
 By sharing the same data structures between qemu and kvm, we actually 
 managed to reuse all of the tcg code for lookups, just like you do for x86.
>>> 
>>> Sharing the data structures is not need.  Simply synchronize them before
>>> lookup, like we do for ordinary registers.
>> 
>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> 
> A TLB way is a few dozen bytes, no?
> 
>>> 
 On x86 you also have shared memory for page tables, it's just guest 
 visible, hence in guest memory. The concept is the same.
>>> 
>>> But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
>>> on every exit.  And you're risking the same thing if your hardware gets
>>> cleverer.
>> 
>> Yes, we do. When that day comes, we forget the CAP and do it another way. 
>> Which way we will find out by the time that day of more clever hardware 
>> comes :).
> 
> Or we try to be less clever unless we have a really compelling reason. 
> qemu monitor and gdb support aren't compelling reasons to optimize.

The goal here was simplicity with a grain of performance concerns.

So what would you be envisioning? Should we make all of the MMU walker code in 
target-ppc KVM aware so it fetches that single way it actually cares about on 
demand from the kernel? That is pretty intrusive and goes against the general 
nicely fitting in principle of how KVM integrates today.

Also, we need to store the guest TLB somewhere. With this model, we can just 
store it in user space memory, so we keep only a single copy around, reducing 
memory footprint. If we had to copy it, we would need more than a single copy.

> 
>>> 
>>> It's too magical, fitting a random version of a random userspace
>>> component.  Now you can't change this tcg code (and still keep the magic).
>>> 
>>> Some complexity is part of keeping software as separate components.
>> 
>> Why? If another user space wants to use this, they can
>> 
>> a) do the slow copy path
>> or
>> b) simply use our struct definitions
>> 
>> The whole copy thing really only makes sense when you have existing code in 
>> user space that you don't want to touch, but easily add on KVM to it. If KVM 
>> is part of your whole design, then integrating things makes a lot more sense.
> 
> Yeah, I guess.
> 
>> 
>>> 
 There are essentially no if(kvm_enabled)'s in our MMU walking code, 
 because the tables are just there. Makes everything a lot easier (without 
 dragging down performance).
>>> 
>>> We have the same issue with registers.  There we call
>>> cpu_synchronize_state() before every access.  No magic, but we get to
>>> reuse the code just the same.
>> 
>> Yes, and for those few bytes it's ok to do so - most of the time. On s390, 
>> even those get shared by now. And it makes sense to do so - if we 
>> synchronize it every time anyways, why not do so implicitly?
>> 
> 
> At least on x86, we synchronize only rarely.

Yeah, on s390 we only know which registers actually contain the information we 
need for traps / hypercalls when in user space, since that's where the decoding 
happens. So we better have all GPRs available to read from and write to.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Scott Wood
On 02/16/2012 01:38 PM, Avi Kivity wrote:
> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>
>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>
> Well, the scatter/gather registers I proposed will give you just one
> register or all of them.

 One register is hardly any use. We either need all ways of a respective 
 address to do a full fledged lookup or all of them. 
>>>
>>> I should have said, just one register, or all of them, or anything in
>>> between.
>>>
 By sharing the same data structures between qemu and kvm, we actually 
 managed to reuse all of the tcg code for lookups, just like you do for x86.
>>>
>>> Sharing the data structures is not need.  Simply synchronize them before
>>> lookup, like we do for ordinary registers.
>>
>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> 
> A TLB way is a few dozen bytes, no?

I think you mean a TLB set... but the TLB (or part of it) may be fully
associative.

On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.

Then we'd need to deal with tracking whether we synchronized one or more
specific sets, or everything (for migration or debug TLB dump).  The
request to synchronize would have to come from within the QEMU MMU code,
since that's the point where we know what to ask for (unless we
duplicate the logic elsewhere).  I'm not sure that reusing the standard
QEMU MMU code for individual debug address translation is really
simplifying things...

And yes, we do have fancier hardware coming fairly soon for which this
breaks (TLB0 entries can be loaded without host involvement, as long as
there's a translation from guest physical to physical in a separate
hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
it as invalidated), but not for debug since that may be where the
translation we're interested in resides.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Avi Kivity
On 02/16/2012 09:34 PM, Alexander Graf wrote:
> On 16.02.2012, at 20:24, Avi Kivity wrote:
>
> > On 02/15/2012 04:08 PM, Alexander Graf wrote:
> >>> 
> >>> Well, the scatter/gather registers I proposed will give you just one
> >>> register or all of them.
> >> 
> >> One register is hardly any use. We either need all ways of a respective 
> >> address to do a full fledged lookup or all of them. 
> > 
> > I should have said, just one register, or all of them, or anything in
> > between.
> > 
> >> By sharing the same data structures between qemu and kvm, we actually 
> >> managed to reuse all of the tcg code for lookups, just like you do for x86.
> > 
> > Sharing the data structures is not need.  Simply synchronize them before
> > lookup, like we do for ordinary registers.
>
> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.

A TLB way is a few dozen bytes, no?

> > 
> >> On x86 you also have shared memory for page tables, it's just guest 
> >> visible, hence in guest memory. The concept is the same.
> > 
> > But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
> > on every exit.  And you're risking the same thing if your hardware gets
> > cleverer.
>
> Yes, we do. When that day comes, we forget the CAP and do it another way. 
> Which way we will find out by the time that day of more clever hardware comes 
> :).

Or we try to be less clever unless we have a really compelling reason. 
qemu monitor and gdb support aren't compelling reasons to optimize.

> > 
> > It's too magical, fitting a random version of a random userspace
> > component.  Now you can't change this tcg code (and still keep the magic).
> > 
> > Some complexity is part of keeping software as separate components.
>
> Why? If another user space wants to use this, they can
>
> a) do the slow copy path
> or
> b) simply use our struct definitions
>
> The whole copy thing really only makes sense when you have existing code in 
> user space that you don't want to touch, but easily add on KVM to it. If KVM 
> is part of your whole design, then integrating things makes a lot more sense.

Yeah, I guess.

>
> > 
> >> There are essentially no if(kvm_enabled)'s in our MMU walking code, 
> >> because the tables are just there. Makes everything a lot easier (without 
> >> dragging down performance).
> > 
> > We have the same issue with registers.  There we call
> > cpu_synchronize_state() before every access.  No magic, but we get to
> > reuse the code just the same.
>
> Yes, and for those few bytes it's ok to do so - most of the time. On s390, 
> even those get shared by now. And it makes sense to do so - if we synchronize 
> it every time anyways, why not do so implicitly?
>

At least on x86, we synchronize only rarely.



-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Alexander Graf

On 16.02.2012, at 20:24, Avi Kivity wrote:

> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>> 
>>> Well, the scatter/gather registers I proposed will give you just one
>>> register or all of them.
>> 
>> One register is hardly any use. We either need all ways of a respective 
>> address to do a full fledged lookup or all of them. 
> 
> I should have said, just one register, or all of them, or anything in
> between.
> 
>> By sharing the same data structures between qemu and kvm, we actually 
>> managed to reuse all of the tcg code for lookups, just like you do for x86.
> 
> Sharing the data structures is not need.  Simply synchronize them before
> lookup, like we do for ordinary registers.

Ordinary registers are a few bytes. We're talking of dozens of kbytes here.

> 
>> On x86 you also have shared memory for page tables, it's just guest visible, 
>> hence in guest memory. The concept is the same.
> 
> But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
> on every exit.  And you're risking the same thing if your hardware gets
> cleverer.

Yes, we do. When that day comes, we forget the CAP and do it another way. Which 
way we will find out by the time that day of more clever hardware comes :).

> 
>>> 
> btw, why are you interested in virtual addresses in userspace at all?
 
 We need them for gdb and monitor introspection.
>>> 
>>> Hardly fast paths that justify shared memory.  I should be much harder
>>> on you.
>> 
>> It was a tradeoff on speed and complexity. This way we have the least amount 
>> of complexity IMHO. All KVM code paths just magically fit in with the TCG 
>> code. 
> 
> It's too magical, fitting a random version of a random userspace
> component.  Now you can't change this tcg code (and still keep the magic).
> 
> Some complexity is part of keeping software as separate components.

Why? If another user space wants to use this, they can

a) do the slow copy path
or
b) simply use our struct definitions

The whole copy thing really only makes sense when you have existing code in 
user space that you don't want to touch, but easily add on KVM to it. If KVM is 
part of your whole design, then integrating things makes a lot more sense.

> 
>> There are essentially no if(kvm_enabled)'s in our MMU walking code, because 
>> the tables are just there. Makes everything a lot easier (without dragging 
>> down performance).
> 
> We have the same issue with registers.  There we call
> cpu_synchronize_state() before every access.  No magic, but we get to
> reuse the code just the same.

Yes, and for those few bytes it's ok to do so - most of the time. On s390, even 
those get shared by now. And it makes sense to do so - if we synchronize it 
every time anyways, why not do so implicitly?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-16 Thread Avi Kivity
On 02/15/2012 04:08 PM, Alexander Graf wrote:
> > 
> > Well, the scatter/gather registers I proposed will give you just one
> > register or all of them.
>
> One register is hardly any use. We either need all ways of a respective 
> address to do a full fledged lookup or all of them. 

I should have said, just one register, or all of them, or anything in
between.

> By sharing the same data structures between qemu and kvm, we actually managed 
> to reuse all of the tcg code for lookups, just like you do for x86.

Sharing the data structures is not need.  Simply synchronize them before
lookup, like we do for ordinary registers.

>  On x86 you also have shared memory for page tables, it's just guest visible, 
> hence in guest memory. The concept is the same.

But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
on every exit.  And you're risking the same thing if your hardware gets
cleverer.

> > 
> >>> btw, why are you interested in virtual addresses in userspace at all?
> >> 
> >> We need them for gdb and monitor introspection.
> > 
> > Hardly fast paths that justify shared memory.  I should be much harder
> > on you.
>
> It was a tradeoff on speed and complexity. This way we have the least amount 
> of complexity IMHO. All KVM code paths just magically fit in with the TCG 
> code. 

It's too magical, fitting a random version of a random userspace
component.  Now you can't change this tcg code (and still keep the magic).

Some complexity is part of keeping software as separate components.

> There are essentially no if(kvm_enabled)'s in our MMU walking code, because 
> the tables are just there. Makes everything a lot easier (without dragging 
> down performance).

We have the same issue with registers.  There we call
cpu_synchronize_state() before every access.  No magic, but we get to
reuse the code just the same.

> > 
> >>> 
> >>> One thing that's different is that virtio offloads itself to a thread
> >>> very quickly, while IDE does a lot of work in vcpu thread context.
> >> 
> >> So it's all about latencies again, which could be reduced at least a fair 
> >> bit with the scheme I described above. But really, this needs to be 
> >> prototyped and benchmarked to actually give us data on how fast it would 
> >> get us.
> > 
> > Simply making qemu issue the request from a thread would be way better. 
> > Something like socketpair mmio, configured for not waiting for the
> > writes to be seen (posted writes) will also help by buffering writes in
> > the socket buffer.
>
> Yup, nice idea. That only works when all parts of a device are actually 
> implemented through the same socket though. 

Right, but that's not an issue.

> Otherwise you could run out of order. So if you have a PCI device with a PIO 
> and an MMIO BAR region, they would both have to be handled through the same 
> socket.

I'm more worried about interactions between hotplug and a device, and
between people issuing unrelated PCI reads to flush writes (not sure
what the hardware semantics are there).  It's easy to get this wrong.

> >>> 
> >>> COWs usually happen from guest userspace, while mmio is usually from the
> >>> guest kernel, so you can switch on that, maybe.
> >> 
> >> Hrm, nice idea. That might fall apart with user space drivers that we 
> >> might eventually have once vfio turns out to work well, but for the time 
> >> being it's a nice hack :).
> > 
> > Or nested virt...
>
> Nested virt on ppc with device assignment? And here I thought I was the crazy 
> one of the two of us :)

I don't mind being crazy on somebody else's arch.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Arnd Bergmann
On Tuesday 07 February 2012, Alexander Graf wrote:
> >> 
> >> Not sure we'll ever get there. For PPC, it will probably take another 1-2 
> >> years until we get the 32-bit targets stabilized. By then we will have new 
> >> 64-bit support though. And then the next gen will come out giving us even 
> >> more new constraints.
> > 
> > I would expect that newer archs have less constraints, not more.
> 
> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
> today on 32-bit, but extends a
> bunch of registers to 64-bit. So what if we laid out stuff wrong before?
> 
> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
> completely new architecture.
> 

I have not seen the source but I'm pretty sure that v7 and v8 they look very
similar regarding virtualization support because they were designed together,
including the concept that on v8 you can run either a v7 compatible 32 bit
hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of
32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical
to the v8 one. The main difference is the instruction set, but then ARMv7
already has four of these (ARM, Thumb, Thumb2, ThumbEE).

Arnd

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Scott Wood
On 02/15/2012 05:57 AM, Alexander Graf wrote:
> 
> On 15.02.2012, at 12:18, Avi Kivity wrote:
> 
>> Well the real reason is we have an extra bit reported by page faults
>> that we can control.  Can't you set up a hashed pte that is configured
>> in a way that it will fault, no matter what type of access the guest
>> does, and see it in your page fault handler?
> 
> I might be able to synthesize a PTE that is !readable and might throw
> a permission exception instead of a miss exception. I might be able
> to synthesize something similar for booke. I don't however get any
> indication on why things failed.

On booke with ISA 2.06 hypervisor extensions, there's MAS8[VF] that will
trigger a DSI that gets sent to the hypervisor even if normal DSIs go
directly to the guest.  You'll still need to zero out the execute
permission bits.

For other booke, you could use one of the user bits in MAS3 (along with
zeroing out all the permission bits), which you could get to by doing a
tlbsx.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 14:57, Avi Kivity wrote:

> On 02/15/2012 03:37 PM, Alexander Graf wrote:
>> On 15.02.2012, at 14:29, Avi Kivity wrote:
>> 
>>> On 02/15/2012 01:57 PM, Alexander Graf wrote:
> 
> Is an extra syscall for copying TLB entries to user space prohibitively
> expensive?
 
 The copying can be very expensive, yes. We want to have the possibility of 
 exposing a very large TLB to the guest, in the order of multiple kentries. 
 Every entry is a struct of 24 bytes.
>>> 
>>> You don't need to copy the entire TLB, just the way that maps the
>>> address you're interested in.
>> 
>> Yeah, unless we do migration in which case we need to introduce another 
>> special case to fetch the whole thing :(.
> 
> Well, the scatter/gather registers I proposed will give you just one
> register or all of them.

One register is hardly any use. We either need all ways of a respective address 
to do a full fledged lookup or all of them. By sharing the same data structures 
between qemu and kvm, we actually managed to reuse all of the tcg code for 
lookups, just like you do for x86. On x86 you also have shared memory for page 
tables, it's just guest visible, hence in guest memory. The concept is the same.

> 
>>> btw, why are you interested in virtual addresses in userspace at all?
>> 
>> We need them for gdb and monitor introspection.
> 
> Hardly fast paths that justify shared memory.  I should be much harder
> on you.

It was a tradeoff on speed and complexity. This way we have the least amount of 
complexity IMHO. All KVM code paths just magically fit in with the TCG code. 
There are essentially no if(kvm_enabled)'s in our MMU walking code, because the 
tables are just there. Makes everything a lot easier (without dragging down 
performance).

> 
 
 Right. It's an optional performance accelerator. If anything doesn't 
 align, don't use it. But if you happen to have a system where everything's 
 cool, you're faster. Sounds like a good deal to me ;).
>>> 
>>> Depends on how much the alignment relies on guest knowledge.  I guess
>>> with a simple device like HPET, it's simple, but with a complex device,
>>> different guests (or different versions of the same guest) could drive
>>> it very differently.
>> 
>> Right. But accelerating simple devices > not accelerating any devices. No? :)
> 
> Yes.  But introducing bugs and vulns < not introducing them.  It's a
> tradeoff.  Even an unexploited vulnerability can be a lot more pain,
> just because you need to update your entire cluster, than a simple
> device that is accelerated for a guest which has maybe 3% utilization. 
> Performance is just one parameter we optimize for.  It's easy to overdo
> it because it's an easily measurable and sexy parameter, but it's a mistake.

Yeah, I agree. That's why I was trying to get AHCI to the default storage 
adapter for a while, because I think the same. However, Anthony believes that 
XP/w2k3 is still a major chunk of the guests running on QEMU, so we can't do 
that :(.

I'm mostly trying to think of ways to accelerate the obvious low hanging 
fruits, without overengineering any interfaces.

> 
>>> 
>>> One thing that's different is that virtio offloads itself to a thread
>>> very quickly, while IDE does a lot of work in vcpu thread context.
>> 
>> So it's all about latencies again, which could be reduced at least a fair 
>> bit with the scheme I described above. But really, this needs to be 
>> prototyped and benchmarked to actually give us data on how fast it would get 
>> us.
> 
> Simply making qemu issue the request from a thread would be way better. 
> Something like socketpair mmio, configured for not waiting for the
> writes to be seen (posted writes) will also help by buffering writes in
> the socket buffer.

Yup, nice idea. That only works when all parts of a device are actually 
implemented through the same socket though. Otherwise you could run out of 
order. So if you have a PCI device with a PIO and an MMIO BAR region, they 
would both have to be handled through the same socket.

> 
>>> 
>>> The all-knowing management tool can provide a virtio driver disk, or
>>> even slip-stream the driver into the installation CD.
>> 
>> One management tool might do that, another one might now. We can't assume 
>> that all management tools are all-knowing. Some times you also want to run 
>> guest OSs that the management tool doesn't know (yet).
> 
> That is true, but we have to leave some work for the management guys.

The easier the management stack is, the happier I am ;).

> 
>> 
 So for MMIO reads, I can assume that this is an MMIO because I would never 
 write a non-readable entry. For writes, I'm overloading the bit that also 
 means "guest entry is not readable" so there I'd have to walk the guest 
 PTEs/TLBs and check if I find a read-only entry. Right now I can just 
 forward write faults to the guest. Since COW is probably a hotter path for 
 the guest than M

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/15/2012 03:37 PM, Alexander Graf wrote:
> On 15.02.2012, at 14:29, Avi Kivity wrote:
>
> > On 02/15/2012 01:57 PM, Alexander Graf wrote:
> >>> 
> >>> Is an extra syscall for copying TLB entries to user space prohibitively
> >>> expensive?
> >> 
> >> The copying can be very expensive, yes. We want to have the possibility of 
> >> exposing a very large TLB to the guest, in the order of multiple kentries. 
> >> Every entry is a struct of 24 bytes.
> > 
> > You don't need to copy the entire TLB, just the way that maps the
> > address you're interested in.
>
> Yeah, unless we do migration in which case we need to introduce another 
> special case to fetch the whole thing :(.

Well, the scatter/gather registers I proposed will give you just one
register or all of them.

> > btw, why are you interested in virtual addresses in userspace at all?
>
> We need them for gdb and monitor introspection.

Hardly fast paths that justify shared memory.  I should be much harder
on you.

> >> 
> >> Right. It's an optional performance accelerator. If anything doesn't 
> >> align, don't use it. But if you happen to have a system where everything's 
> >> cool, you're faster. Sounds like a good deal to me ;).
> > 
> > Depends on how much the alignment relies on guest knowledge.  I guess
> > with a simple device like HPET, it's simple, but with a complex device,
> > different guests (or different versions of the same guest) could drive
> > it very differently.
>
> Right. But accelerating simple devices > not accelerating any devices. No? :)

Yes.  But introducing bugs and vulns < not introducing them.  It's a
tradeoff.  Even an unexploited vulnerability can be a lot more pain,
just because you need to update your entire cluster, than a simple
device that is accelerated for a guest which has maybe 3% utilization. 
Performance is just one parameter we optimize for.  It's easy to overdo
it because it's an easily measurable and sexy parameter, but it's a mistake.

> > 
> > One thing that's different is that virtio offloads itself to a thread
> > very quickly, while IDE does a lot of work in vcpu thread context.
>
> So it's all about latencies again, which could be reduced at least a fair bit 
> with the scheme I described above. But really, this needs to be prototyped 
> and benchmarked to actually give us data on how fast it would get us.

Simply making qemu issue the request from a thread would be way better. 
Something like socketpair mmio, configured for not waiting for the
writes to be seen (posted writes) will also help by buffering writes in
the socket buffer.

> > 
> > The all-knowing management tool can provide a virtio driver disk, or
> > even slip-stream the driver into the installation CD.
>
> One management tool might do that, another one might now. We can't assume 
> that all management tools are all-knowing. Some times you also want to run 
> guest OSs that the management tool doesn't know (yet).

That is true, but we have to leave some work for the management guys.

>  
> >> So for MMIO reads, I can assume that this is an MMIO because I would never 
> >> write a non-readable entry. For writes, I'm overloading the bit that also 
> >> means "guest entry is not readable" so there I'd have to walk the guest 
> >> PTEs/TLBs and check if I find a read-only entry. Right now I can just 
> >> forward write faults to the guest. Since COW is probably a hotter path for 
> >> the guest than MMIO, this might end up being ineffective.
> > 
> > COWs usually happen from guest userspace, while mmio is usually from the
> > guest kernel, so you can switch on that, maybe.
>
> Hrm, nice idea. That might fall apart with user space drivers that we might 
> eventually have once vfio turns out to work well, but for the time being it's 
> a nice hack :).

Or nested virt...



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 14:29, Avi Kivity wrote:

> On 02/15/2012 01:57 PM, Alexander Graf wrote:
>>> 
>>> Is an extra syscall for copying TLB entries to user space prohibitively
>>> expensive?
>> 
>> The copying can be very expensive, yes. We want to have the possibility of 
>> exposing a very large TLB to the guest, in the order of multiple kentries. 
>> Every entry is a struct of 24 bytes.
> 
> You don't need to copy the entire TLB, just the way that maps the
> address you're interested in.

Yeah, unless we do migration in which case we need to introduce another special 
case to fetch the whole thing :(.

> btw, why are you interested in virtual addresses in userspace at all?

We need them for gdb and monitor introspection.

> 
> 
> It works for the really simple cases, yes, but if the guest wants to set 
> up one-shot timers, it fails.  
 
 I don't understand. Why would anything fail here? 
>>> 
>>> It fails to provide a benefit, I didn't mean it causes guest failures.
>>> 
>>> You also have to make sure the kernel part and the user part use exactly
>>> the same time bases.
>> 
>> Right. It's an optional performance accelerator. If anything doesn't align, 
>> don't use it. But if you happen to have a system where everything's cool, 
>> you're faster. Sounds like a good deal to me ;).
> 
> Depends on how much the alignment relies on guest knowledge.  I guess
> with a simple device like HPET, it's simple, but with a complex device,
> different guests (or different versions of the same guest) could drive
> it very differently.

Right. But accelerating simple devices > not accelerating any devices. No? :)

> 
>> 
>> Because not every guest supports them. Virtio-blk needs 3rd party 
>> drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
>>> 
>>> 3rd party drivers are a way of life for Windows users; and the
>>> incremental benefits of IDE acceleration are still far behind virtio.
>> 
>> The typical way of life for Windows users are all-included drivers. Which is 
>> the case for AHCI, where we're getting awesome performance for Vista and 
>> above guests. The iDE thing was just an idea for legacy ones.
>> 
>> It'd be great to simply try and see how fast we could get by handling a few 
>> special registers in kernel space vs heavyweight exiting to QEMU. If it's 
>> only 10%, I wouldn't even bother with creating an interface for it. I'd bet 
>> the benefits are a lot bigger though.
>> 
>> And the main point was that specific partial device emulation buys us more 
>> than pseudo-generic accelerators like coalesced mmio, which are also only 
>> used by 1 or 2 devices.
> 
> Ok.
> 
>>> 
 I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
>>> 
>>> Cirrus or vesa should be okay for them, I don't see what we could do for
>>> them in the kernel, or why.
>> 
>> That's my point. You need fast emulation of standard devices to get a good 
>> baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>> 
>>> 
 Same for virtio.
>> 
>> Please don't do the Xen mistake again of claiming that all we care about 
>> is Linux as a guest.
> 
> Rest easy, there's no chance of that.  But if a guest is important 
> enough, virtio drivers will get written.  IDE has no chance in hell of 
> approaching virtio-blk performance, no matter how much effort we put into 
> it.
 
 Ever used VMware? They basically get virtio-blk performance out of 
 ordinary IDE for linear workloads.
>>> 
>>> For linear loads, so should we, perhaps with greater cpu utliization.
>>> 
>>> If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
>>> means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
>>> shouldn't matter.
>> 
>> *shrug* last time I checked we were a lot slower. But maybe there's more 
>> stuff making things slow than the exit path ;).
> 
> One thing that's different is that virtio offloads itself to a thread
> very quickly, while IDE does a lot of work in vcpu thread context.

So it's all about latencies again, which could be reduced at least a fair bit 
with the scheme I described above. But really, this needs to be prototyped and 
benchmarked to actually give us data on how fast it would get us.

> 
>>> 
> 
>> KVM's strength has always been its close resemblance to hardware.
> 
> This will remain.  But we can't optimize everything.
 
 That's my point. Let's optimize the hot paths and be good. As long as we 
 default to IDE for disk, we should have that be fast, no?
>>> 
>>> We should make sure that we don't default to IDE.  Qemu has no knowledge
>>> of the guest, so it can't default to virtio, but higher level tools can
>>> and should.
>> 
>> You can only default to virtio on recent Linux. Windows, BSD, etc don't 
>> include drivers, so you can't assume it working. You can default to AHCI for 
>> basically any recent guest, but that still won't work for XP and the likes 

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 05:23 PM, Anthony Liguori wrote:
> On 02/07/2012 07:40 AM, Alexander Graf wrote:
>>
>> Why? For the HPET timer register for example, we could have a simple
>> MMIO hook that says
>>
>>on_read:
>>  return read_current_time() - shared_page.offset;
>>on_write:
>>  handle_in_user_space();
>>
>> For IDE, it would be as simple as
>>
>>register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>for (i = 1; i<  7; i++) {
>>  register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>  register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>}
>
> You can't easily serialize updates to that address with the kernel
> since two threads are likely going to be accessing it at the same
> time.  That either means an expensive sync operation or a reliance on
> atomic instructions.
>
> But not all architectures offer non-word sized atomic instructions so
> it gets fairly nasty in practice.
>

I doubt that any guest accesses IDE registers from two threads in
parallel.  The guest will have some lock, so we could have a lock as
well and be assured that there will never be contention.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/12/2012 09:10 AM, Takuya Yoshikawa wrote:
> Avi Kivity  wrote:
>
> > > >  Slot searching is quite fast since there's a small number of slots, 
> > > > and we sort the larger ones to be in the front, so positive lookups are 
> > > > fast.  We cache negative lookups in the shadow page tables (an spte can 
> > > > be either "not mapped", "mapped to RAM", or "not mapped and known to be 
> > > > mmio") so we rarely need to walk the entire list.
> > >
> > > Well, we don't always have shadow page tables. Having hints for unmapped 
> > > guest memory like this is pretty tricky.
> > > We're currently running into issues with device assignment though, where 
> > > we get a lot of small slots mapped to real hardware. I'm sure that will 
> > > hit us on x86 sooner or later too.
> > 
> > For x86 that's not a problem, since once you map a page, it stays mapped 
> > (on modern hardware).
> > 
>
> I was once thinking about how to search a slot reasonably fast for every case,
> even when we do not have mmio-spte cache.
>
> One possible way I thought up was to sort slots according to their base_gfn.
> Then the problem would become:  "find the first slot whose base_gfn + npages
> is greater than this gfn."
>
> Since we can do binary search, the search cost is O(log(# of slots)).
>
> But I guess that most of the time was wasted on reading many memslots just to
> know their base_gfn and npages.
>
> So the most practically effective thing is to make a separate array which 
> holds
> just their base_gfn.  This will make the task a simple, and cache friendly,
> search on an integer array:  probably faster than using *-tree data structure.

This assumes that there is equal probability for matching any slot.  But
that's not true, even if you have hundreds of slots, the probability is
much greater for the two main memory slots, or if you're playing with
the framebuffer, the framebuffer slot.  Everything else is loaded
quickly into shadow and forgotten.

> If needed, we should make cmp_memslot() architecture specific in the end?

We could, but why is it needed?  This logic holds for all architectures.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/15/2012 01:57 PM, Alexander Graf wrote:
> > 
> > Is an extra syscall for copying TLB entries to user space prohibitively
> > expensive?
>
> The copying can be very expensive, yes. We want to have the possibility of 
> exposing a very large TLB to the guest, in the order of multiple kentries. 
> Every entry is a struct of 24 bytes.

You don't need to copy the entire TLB, just the way that maps the
address you're interested in.

btw, why are you interested in virtual addresses in userspace at all?

> >>> 
> >>> It works for the really simple cases, yes, but if the guest wants to set 
> >>> up one-shot timers, it fails.  
> >> 
> >> I don't understand. Why would anything fail here? 
> > 
> > It fails to provide a benefit, I didn't mean it causes guest failures.
> > 
> > You also have to make sure the kernel part and the user part use exactly
> > the same time bases.
>
> Right. It's an optional performance accelerator. If anything doesn't align, 
> don't use it. But if you happen to have a system where everything's cool, 
> you're faster. Sounds like a good deal to me ;).

Depends on how much the alignment relies on guest knowledge.  I guess
with a simple device like HPET, it's simple, but with a complex device,
different guests (or different versions of the same guest) could drive
it very differently.

>  
>  Because not every guest supports them. Virtio-blk needs 3rd party 
>  drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
> > 
> > 3rd party drivers are a way of life for Windows users; and the
> > incremental benefits of IDE acceleration are still far behind virtio.
>
> The typical way of life for Windows users are all-included drivers. Which is 
> the case for AHCI, where we're getting awesome performance for Vista and 
> above guests. The iDE thing was just an idea for legacy ones.
>
> It'd be great to simply try and see how fast we could get by handling a few 
> special registers in kernel space vs heavyweight exiting to QEMU. If it's 
> only 10%, I wouldn't even bother with creating an interface for it. I'd bet 
> the benefits are a lot bigger though.
>
> And the main point was that specific partial device emulation buys us more 
> than pseudo-generic accelerators like coalesced mmio, which are also only 
> used by 1 or 2 devices.

Ok.

> > 
> >> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
> > 
> > Cirrus or vesa should be okay for them, I don't see what we could do for
> > them in the kernel, or why.
>
> That's my point. You need fast emulation of standard devices to get a good 
> baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>
> > 
> >> Same for virtio.
>  
>  Please don't do the Xen mistake again of claiming that all we care about 
>  is Linux as a guest.
> >>> 
> >>> Rest easy, there's no chance of that.  But if a guest is important 
> >>> enough, virtio drivers will get written.  IDE has no chance in hell of 
> >>> approaching virtio-blk performance, no matter how much effort we put into 
> >>> it.
> >> 
> >> Ever used VMware? They basically get virtio-blk performance out of 
> >> ordinary IDE for linear workloads.
> > 
> > For linear loads, so should we, perhaps with greater cpu utliization.
> > 
> > If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
> > means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
> > shouldn't matter.
>
> *shrug* last time I checked we were a lot slower. But maybe there's more 
> stuff making things slow than the exit path ;).

One thing that's different is that virtio offloads itself to a thread
very quickly, while IDE does a lot of work in vcpu thread context.

> > 
> >>> 
>  KVM's strength has always been its close resemblance to hardware.
> >>> 
> >>> This will remain.  But we can't optimize everything.
> >> 
> >> That's my point. Let's optimize the hot paths and be good. As long as we 
> >> default to IDE for disk, we should have that be fast, no?
> > 
> > We should make sure that we don't default to IDE.  Qemu has no knowledge
> > of the guest, so it can't default to virtio, but higher level tools can
> > and should.
>
> You can only default to virtio on recent Linux. Windows, BSD, etc don't 
> include drivers, so you can't assume it working. You can default to AHCI for 
> basically any recent guest, but that still won't work for XP and the likes :(.

The all-knowing management tool can provide a virtio driver disk, or
even slip-stream the driver into the installation CD.


>  
> >> Ah, because you're on NPT and you can have MMIO hints in the nested page 
> >> table. Nifty. Yeah, we don't have that luxury :).
> > 
> > Well the real reason is we have an extra bit reported by page faults
> > that we can control.  Can't you set up a hashed pte that is configured
> > in a way that it will fault, no matter what type of access the guest
> > does, and see it in your page fault handler?
>
> I might be able to synthesize a PTE that is !readab

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 12:18, Avi Kivity wrote:

> On 02/07/2012 04:39 PM, Alexander Graf wrote:
>>> 
>>> Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
>>> tighten the vcpu/thread and vm/process relationship.
>> 
>> How about keeping the ioctl interface but moving vcpu_run to a syscall then?
> 
> I dislike half-and-half interfaces even more.  And it's not like the
> fget_light() is really painful - it's just that I see it occasionally in
> perf top so it annoys me.
> 
>> That should really be the only thing that belongs into the fast path, right? 
>> Every time we do a register sync in user space, we do something wrong. 
>> Instead, user space should either
>> 
>>  a) have wrappers around register accesses, so it can directly ask for 
>> specific registers that it needs
>> or
>>  b) keep everything that would be requested by the register synchronization 
>> in shared memory
> 
> Always-synced shared memory is a liability, since newer hardware might
> introduce on-chip caches for that state, making synchronization
> expensive.  Or we may choose to keep some of the registers loaded, if we
> have a way to trap on their use from userspace - for example we can
> return to userspace with the guest fpu loaded, and trap if userspace
> tries to use it.
> 
> Is an extra syscall for copying TLB entries to user space prohibitively
> expensive?

The copying can be very expensive, yes. We want to have the possibility of 
exposing a very large TLB to the guest, in the order of multiple kentries. 
Every entry is a struct of 24 bytes.

> 
>>> 
 , keep the rest in user space.
> 
> 
> When a device is fully in the kernel, we have a good specification of the 
> ABI: it just implements the spec, and the ABI provides the interface from 
> the device to the rest of the world.  Partially accelerated devices means 
> a much greater effort in specifying exactly what it does.  It's also 
> vulnerable to changes in how the guest uses the device.
 
 Why? For the HPET timer register for example, we could have a simple MMIO 
 hook that says
 
  on_read:
return read_current_time() - shared_page.offset;
  on_write:
handle_in_user_space();
>>> 
>>> It works for the really simple cases, yes, but if the guest wants to set up 
>>> one-shot timers, it fails.  
>> 
>> I don't understand. Why would anything fail here? 
> 
> It fails to provide a benefit, I didn't mean it causes guest failures.
> 
> You also have to make sure the kernel part and the user part use exactly
> the same time bases.

Right. It's an optional performance accelerator. If anything doesn't align, 
don't use it. But if you happen to have a system where everything's cool, 
you're faster. Sounds like a good deal to me ;).

> 
>> Once the logic that's implemented by the kernel accelerator doesn't fit 
>> anymore, unregister it.
> 
> Yeah.
> 
>> 
>>> Also look at the PIT which latches on read.
>>> 
 
 For IDE, it would be as simple as
 
  register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
  for (i = 1; i<  7; i++) {
register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
  }
 
 and we should have reduced overhead of IDE by quite a bit already. All the 
 other 2k LOC in hw/ide/core.c don't matter for us really.
>>> 
>>> 
>>> Just use virtio.
>> 
>> Just use xenbus. Seriously, this is not an answer.
> 
> Why not?  We invested effort in making it as fast as possible, and in
> writing the drivers.  IDE will never, ever, get anything close to virtio
> performance, even if we put all of it in the kernel.
> 
> However, after these examples, I'm more open to partial acceleration
> now.  I won't ever like it though.
> 
> 
>>   - VGA
>>   - IDE
> 
> Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
> virtio-scsi).
 
 Because not every guest supports them. Virtio-blk needs 3rd party drivers. 
 AHCI needs 3rd party drivers on w2k3 and wxp. 
> 
> 3rd party drivers are a way of life for Windows users; and the
> incremental benefits of IDE acceleration are still far behind virtio.

The typical way of life for Windows users are all-included drivers. Which is 
the case for AHCI, where we're getting awesome performance for Vista and above 
guests. The iDE thing was just an idea for legacy ones.

It'd be great to simply try and see how fast we could get by handling a few 
special registers in kernel space vs heavyweight exiting to QEMU. If it's only 
10%, I wouldn't even bother with creating an interface for it. I'd bet the 
benefits are a lot bigger though.

And the main point was that specific partial device emulation buys us more than 
pseudo-generic accelerators like coalesced mmio, which are also only used by 1 
or 2 devices.

> 
>> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
> 
> Cirrus or ves

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 04:39 PM, Alexander Graf wrote:
> > 
> > Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
> > tighten the vcpu/thread and vm/process relationship.
>
> How about keeping the ioctl interface but moving vcpu_run to a syscall then?

I dislike half-and-half interfaces even more.  And it's not like the
fget_light() is really painful - it's just that I see it occasionally in
perf top so it annoys me.

>  That should really be the only thing that belongs into the fast path, right? 
> Every time we do a register sync in user space, we do something wrong. 
> Instead, user space should either
>
>   a) have wrappers around register accesses, so it can directly ask for 
> specific registers that it needs
> or
>   b) keep everything that would be requested by the register synchronization 
> in shared memory

Always-synced shared memory is a liability, since newer hardware might
introduce on-chip caches for that state, making synchronization
expensive.  Or we may choose to keep some of the registers loaded, if we
have a way to trap on their use from userspace - for example we can
return to userspace with the guest fpu loaded, and trap if userspace
tries to use it.

Is an extra syscall for copying TLB entries to user space prohibitively
expensive?

> > 
> >> , keep the rest in user space.
> >> >
> >> >
> >> >  When a device is fully in the kernel, we have a good specification of 
> >> > the ABI: it just implements the spec, and the ABI provides the interface 
> >> > from the device to the rest of the world.  Partially accelerated devices 
> >> > means a much greater effort in specifying exactly what it does.  It's 
> >> > also vulnerable to changes in how the guest uses the device.
> >> 
> >> Why? For the HPET timer register for example, we could have a simple MMIO 
> >> hook that says
> >> 
> >>   on_read:
> >> return read_current_time() - shared_page.offset;
> >>   on_write:
> >> handle_in_user_space();
> > 
> > It works for the really simple cases, yes, but if the guest wants to set up 
> > one-shot timers, it fails.  
>
> I don't understand. Why would anything fail here? 

It fails to provide a benefit, I didn't mean it causes guest failures.

You also have to make sure the kernel part and the user part use exactly
the same time bases.

> Once the logic that's implemented by the kernel accelerator doesn't fit 
> anymore, unregister it.

Yeah.

>
> > Also look at the PIT which latches on read.
> > 
> >> 
> >> For IDE, it would be as simple as
> >> 
> >>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> >>   for (i = 1; i<  7; i++) {
> >> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >>   }
> >> 
> >> and we should have reduced overhead of IDE by quite a bit already. All the 
> >> other 2k LOC in hw/ide/core.c don't matter for us really.
> > 
> > 
> > Just use virtio.
>
> Just use xenbus. Seriously, this is not an answer.

Why not?  We invested effort in making it as fast as possible, and in
writing the drivers.  IDE will never, ever, get anything close to virtio
performance, even if we put all of it in the kernel.

However, after these examples, I'm more open to partial acceleration
now.  I won't ever like it though.

> >> >
> >> >>- VGA
> >> >>- IDE
> >> >
> >> >  Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
> >> > virtio-scsi).
> >> 
> >> Because not every guest supports them. Virtio-blk needs 3rd party drivers. 
> >> AHCI needs 3rd party drivers on w2k3 and wxp. 

3rd party drivers are a way of life for Windows users; and the
incremental benefits of IDE acceleration are still far behind virtio.

> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 

Cirrus or vesa should be okay for them, I don't see what we could do for
them in the kernel, or why.

> Same for virtio.
> >> 
> >> Please don't do the Xen mistake again of claiming that all we care about 
> >> is Linux as a guest.
> > 
> > Rest easy, there's no chance of that.  But if a guest is important enough, 
> > virtio drivers will get written.  IDE has no chance in hell of approaching 
> > virtio-blk performance, no matter how much effort we put into it.
>
> Ever used VMware? They basically get virtio-blk performance out of ordinary 
> IDE for linear workloads.

For linear loads, so should we, perhaps with greater cpu utliization.

If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
shouldn't matter.

> > 
> >> KVM's strength has always been its close resemblance to hardware.
> > 
> > This will remain.  But we can't optimize everything.
>
> That's my point. Let's optimize the hot paths and be good. As long as we 
> default to IDE for disk, we should have that be fast, no?

We should make sure that we don't default to IDE.  Qemu has no knowledge
of the guest, so it can't defa

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-11 Thread Takuya Yoshikawa
Avi Kivity  wrote:

> > >  Slot searching is quite fast since there's a small number of slots, and 
> > > we sort the larger ones to be in the front, so positive lookups are fast. 
> > >  We cache negative lookups in the shadow page tables (an spte can be 
> > > either "not mapped", "mapped to RAM", or "not mapped and known to be 
> > > mmio") so we rarely need to walk the entire list.
> >
> > Well, we don't always have shadow page tables. Having hints for unmapped 
> > guest memory like this is pretty tricky.
> > We're currently running into issues with device assignment though, where we 
> > get a lot of small slots mapped to real hardware. I'm sure that will hit us 
> > on x86 sooner or later too.
> 
> For x86 that's not a problem, since once you map a page, it stays mapped 
> (on modern hardware).
> 

I was once thinking about how to search a slot reasonably fast for every case,
even when we do not have mmio-spte cache.

One possible way I thought up was to sort slots according to their base_gfn.
Then the problem would become:  "find the first slot whose base_gfn + npages
is greater than this gfn."

Since we can do binary search, the search cost is O(log(# of slots)).

But I guess that most of the time was wasted on reading many memslots just to
know their base_gfn and npages.

So the most practically effective thing is to make a separate array which holds
just their base_gfn.  This will make the task a simple, and cache friendly,
search on an integer array:  probably faster than using *-tree data structure.

If needed, we should make cmp_memslot() architecture specific in the end?

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-08 Thread Alan Cox
> >register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> >for (i = 1; i<  7; i++) {
> >  register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >  register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >}
> 
> You can't easily serialize updates to that address with the kernel since two 
> threads are likely going to be accessing it at the same time.  That either 
> means 
> an expensive sync operation or a reliance on atomic instructions.

Who cares

If your API is right this isn't a problem (and for IDE the guess that it
won't happen you will win 99.999% of the time).

In fact IDE you can do even better in many cases because you'll get a
single rep outsw you can trap and shortcut.

> But not all architectures offer non-word sized atomic instructions so it gets 
> fairly nasty in practice.

Thats their problem. We don't screwup the fast paths because some
hardware vendor screwed up that bit of their implementation. That's
*their* problem not everyone elses.

So on x86 IDE should be about 10 outb traps that can be predicted, a rep
outsw which can be shortcut and a completion set of inb/inw ops that can
be predicted.

You should hit userspace about once per IDE operation. Fix the hot paths
with good design and the noise doesn't matter.

Alan
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 16:23, Anthony Liguori wrote:

> On 02/07/2012 07:40 AM, Alexander Graf wrote:
>> 
>> Why? For the HPET timer register for example, we could have a simple MMIO 
>> hook that says
>> 
>>   on_read:
>> return read_current_time() - shared_page.offset;
>>   on_write:
>> handle_in_user_space();
>> 
>> For IDE, it would be as simple as
>> 
>>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>   for (i = 1; i<  7; i++) {
>> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>   }
> 
> You can't easily serialize updates to that address with the kernel since two 
> threads are likely going to be accessing it at the same time.  That either 
> means an expensive sync operation or a reliance on atomic instructions.

Yes. Essentially we want a mutex for them.

> But not all architectures offer non-word sized atomic instructions so it gets 
> fairly nasty in practice.

Well, we can always require fields to be word sized.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Anthony Liguori

On 02/07/2012 07:40 AM, Alexander Graf wrote:


Why? For the HPET timer register for example, we could have a simple MMIO hook 
that says

   on_read:
 return read_current_time() - shared_page.offset;
   on_write:
 handle_in_user_space();

For IDE, it would be as simple as

   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
   for (i = 1; i<  7; i++) {
 register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
 register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
   }


You can't easily serialize updates to that address with the kernel since two 
threads are likely going to be accessing it at the same time.  That either means 
an expensive sync operation or a reliance on atomic instructions.


But not all architectures offer non-word sized atomic instructions so it gets 
fairly nasty in practice.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 15:21, Avi Kivity wrote:

> On 02/07/2012 03:40 PM, Alexander Graf wrote:
>> >>
>> >>  Not sure we'll ever get there. For PPC, it will probably take another 
>> >> 1-2 years until we get the 32-bit targets stabilized. By then we will 
>> >> have new 64-bit support though. And then the next gen will come out 
>> >> giving us even more new constraints.
>> >
>> >  I would expect that newer archs have less constraints, not more.
>> 
>> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
>> today on 32-bit, but extends a bunch of registers to 64-bit. So what if we 
>> laid out stuff wrong before?
> 
> That's not what I mean by constraints.  It's easy to accommodate different 
> register layouts.  Constraints (for me) are like requiring gang scheduling.  
> But you introduced the subject - what did you mean?

New extensions to architectures give us new challenges. Newer booke for example 
implements page tables in parallel to soft TLBs. We need to model that. My 
point was more that I can't predict the future :).

> Let's take for example the software-controlled TLB on some ppc.  It's 
> tempting to call them all "registers" and use the register interface to 
> access them.  Is it workable?

Workable, yes. Fast? No. Right now we share them between kernel and user space 
to have very fast access to them. That way we don't have to sync anything at 
all.

> Or let's look at SMM on x86.  To implement it memory slots need an additional 
> attribute "SMM/non-SMM/either".  These sort of things, if you don't think of 
> them beforehand, break your interface.

Yup. And we will never think of all the cases.

> 
>> 
>> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
>> completely new architecture.
>> 
>> And what if MIPS comes along? I hear they also work on hw accelerated 
>> virtualization.
> 
> If it's just a matter of different register names and sizes, no problem.  
> From what I've seen of v8, it doesn't introduce new wierdnesses.

I haven't seen anything real yet, since the spec isn't out. So far only generic 
architecture documentation is available.

> 
>> 
>> >
>> >>  The same goes for ARM, where we will get v7 support for now, but very 
>> >> soon we will also want to get v8. Stabilizing a target so far takes ~1-2 
>> >> years from what I've seen. And that stabilizing to a point where we don't 
>> >> find major ABI issues anymore.
>> >
>> >  The trick is to get the ABI to be flexible, like a generalized ABI for 
>> > state.  But it's true that it's really hard to nail it down.
>> 
>> Yup, and I think what we have today is a pretty good approach to this. I'm 
>> trying to mostly add "generalized" ioctls whenever I see that something can 
>> be handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that 
>> direction, we are extensible with a reasonably stable ABI. Even without 
>> syscalls.
> 
> Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
> tighten the vcpu/thread and vm/process relationship.

How about keeping the ioctl interface but moving vcpu_run to a syscall then? 
That should really be the only thing that belongs into the fast path, right? 
Every time we do a register sync in user space, we do something wrong. Instead, 
user space should either

  a) have wrappers around register accesses, so it can directly ask for 
specific registers that it needs
or
  b) keep everything that would be requested by the register synchronization in 
shared memory

> 
>> , keep the rest in user space.
>> >
>> >
>> >  When a device is fully in the kernel, we have a good specification of the 
>> > ABI: it just implements the spec, and the ABI provides the interface from 
>> > the device to the rest of the world.  Partially accelerated devices means 
>> > a much greater effort in specifying exactly what it does.  It's also 
>> > vulnerable to changes in how the guest uses the device.
>> 
>> Why? For the HPET timer register for example, we could have a simple MMIO 
>> hook that says
>> 
>>   on_read:
>> return read_current_time() - shared_page.offset;
>>   on_write:
>> handle_in_user_space();
> 
> It works for the really simple cases, yes, but if the guest wants to set up 
> one-shot timers, it fails.  

I don't understand. Why would anything fail here? Once the logic that's 
implemented by the kernel accelerator doesn't fit anymore, unregister it.

> Also look at the PIT which latches on read.
> 
>> 
>> For IDE, it would be as simple as
>> 
>>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>   for (i = 1; i<  7; i++) {
>> register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>> register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>   }
>> 
>> and we should have reduced overhead of IDE by quite a bit already. All the 
>> other 2k LOC in hw/ide/core.c don't matter for us really.
> 
> 
> Just use virtio.

Just use xenbus. Seriously, this is not an answer.

> 
>> 
>> >
>> >>  Simila

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 03:40 PM, Alexander Graf wrote:

>>
>>  Not sure we'll ever get there. For PPC, it will probably take another 1-2 
years until we get the 32-bit targets stabilized. By then we will have new 64-bit 
support though. And then the next gen will come out giving us even more new 
constraints.
>
>  I would expect that newer archs have less constraints, not more.

Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid 
out stuff wrong before?


That's not what I mean by constraints.  It's easy to accommodate 
different register layouts.  Constraints (for me) are like requiring 
gang scheduling.  But you introduced the subject - what did you mean?


Let's take for example the software-controlled TLB on some ppc.  It's 
tempting to call them all "registers" and use the register interface to 
access them.  Is it workable?


Or let's look at SMM on x86.  To implement it memory slots need an 
additional attribute "SMM/non-SMM/either".  These sort of things, if you 
don't think of them beforehand, break your interface.




I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
completely new architecture.

And what if MIPS comes along? I hear they also work on hw accelerated 
virtualization.


If it's just a matter of different register names and sizes, no 
problem.  From what I've seen of v8, it doesn't introduce new wierdnesses.




>
>>  The same goes for ARM, where we will get v7 support for now, but very soon 
we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what 
I've seen. And that stabilizing to a point where we don't find major ABI issues 
anymore.
>
>  The trick is to get the ABI to be flexible, like a generalized ABI for 
state.  But it's true that it's really hard to nail it down.

Yup, and I think what we have today is a pretty good approach to this. I'm trying to 
mostly add "generalized" ioctls whenever I see that something can be handled 
generically, like ONE_REG or ENABLE_CAP. If we keep moving that direction, we are 
extensible with a reasonably stable ABI. Even without syscalls.


Syscalls are orthogonal to that - they're to avoid the fget_light() and 
to tighten the vcpu/thread and vm/process relationship.



, keep the rest in user space.
>
>
>  When a device is fully in the kernel, we have a good specification of the 
ABI: it just implements the spec, and the ABI provides the interface from the 
device to the rest of the world.  Partially accelerated devices means a much 
greater effort in specifying exactly what it does.  It's also vulnerable to 
changes in how the guest uses the device.

Why? For the HPET timer register for example, we could have a simple MMIO hook 
that says

   on_read:
 return read_current_time() - shared_page.offset;
   on_write:
 handle_in_user_space();


It works for the really simple cases, yes, but if the guest wants to set 
up one-shot timers, it fails.  Also look at the PIT which latches on read.




For IDE, it would be as simple as

   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
   for (i = 1; i<  7; i++) {
 register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
 register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
   }

and we should have reduced overhead of IDE by quite a bit already. All the 
other 2k LOC in hw/ide/core.c don't matter for us really.



Just use virtio.



>
>>  Similar to how vhost works, where we keep device enumeration and 
configuration in user space, but ring processing in kernel space.
>
>  vhost-net was a massive effort, I hope we don't have to replicate it.

Was it harder than the in-kernel io-apic?


Much, much harder.



>
>>
>>  Good candidates for in-kernel acceleration are:
>>
>>- HPET
>
>  Yes
>
>>- VGA
>>- IDE
>
>  Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
virtio-scsi).

Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI 
needs 3rd party drivers on w2k3 and wxp. I'm pretty sure non-Linux non-Windows 
systems won't get QXL drivers. Same for virtio.

Please don't do the Xen mistake again of claiming that all we care about is 
Linux as a guest.


Rest easy, there's no chance of that.  But if a guest is important 
enough, virtio drivers will get written.  IDE has no chance in hell of 
approaching virtio-blk performance, no matter how much effort we put 
into it.



KVM's strength has always been its close resemblance to hardware.


This will remain.  But we can't optimize everything.


>
>>
>>  We will run into the same thing with the MPIC though. On e500v2, IPIs are 
done through the MPIC. So if we want any SMP performance on those, we need to shove 
that part into the kernel. I don't really want to have all of the MPIC code in there 
however. So a hybrid approach sounds like a great fit.
>
>  Pointer to the qemu code?

hw/openpic.c


I see what you mean.



>
>>  T

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 14:16, Avi Kivity wrote:

> On 02/07/2012 02:51 PM, Alexander Graf wrote:
>> On 07.02.2012, at 13:24, Avi Kivity wrote:
>> 
>> >  On 02/07/2012 03:08 AM, Alexander Graf wrote:
>> >>  I don't like the idea too much. On s390 and ppc we can set other vcpu's 
>> >> interrupt status. How would that work in this model?
>> >
>> >  It would be a "vm-wide syscall".  You can also do that on x86 (through 
>> > KVM_IRQ_LINE).
>> >
>> >>
>> >>  I really do like the ioctl model btw. It's easily extensible and easy to 
>> >> understand.
>> >>
>> >>  I can also promise you that I have no idea what other extensions we will 
>> >> need in the next few years. The non-x86 targets are just really very 
>> >> moving. So having an interface that allows for easy extension is a 
>> >> must-have.
>> >
>> >  Good point.  If we ever go through with it, it will only be after we see 
>> > the interface has stabilized.
>> 
>> Not sure we'll ever get there. For PPC, it will probably take another 1-2 
>> years until we get the 32-bit targets stabilized. By then we will have new 
>> 64-bit support though. And then the next gen will come out giving us even 
>> more new constraints.
> 
> I would expect that newer archs have less constraints, not more.

Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid 
out stuff wrong before?

I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
completely new architecture.

And what if MIPS comes along? I hear they also work on hw accelerated 
virtualization.

> 
>> The same goes for ARM, where we will get v7 support for now, but very soon 
>> we will also want to get v8. Stabilizing a target so far takes ~1-2 years 
>> from what I've seen. And that stabilizing to a point where we don't find 
>> major ABI issues anymore.
> 
> The trick is to get the ABI to be flexible, like a generalized ABI for state. 
>  But it's true that it's really hard to nail it down.

Yup, and I think what we have today is a pretty good approach to this. I'm 
trying to mostly add "generalized" ioctls whenever I see that something can be 
handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that 
direction, we are extensible with a reasonably stable ABI. Even without 
syscalls.

> 
> 
>> >>
>> >>  The framework is in KVM today. It's called ONE_REG. So far only PPC 
>> >> implements a few registers. If you like it, just throw all the x86 ones 
>> >> in there and you have everything you need.
>> >
>> >  This is more like MANY_REG, where you scatter/gather a list of registers 
>> > in userspace to the kernel or vice versa.
>> 
>> Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was 
>> to give every register a unique identifier that can be used to access it. 
>> Taking that logic to an array is trivial.
> 
> Definitely easy to extend.
> 
> 
>> >
>> >>
>> >>  >>   The communications between the local APIC and the IOAPIC/PIC will be
>> >>  >>   done over a socketpair, emulating the APIC bus protocol.
>> >>
>> >>  What is keeping us from moving there today?
>> >
>> >  The biggest problem with this proposal is that what we have today works 
>> > reasonably well.  Nothing is keeping us from moving there, except the fear 
>> > of performance regressions and lack of strong motivation.
>> 
>> So why bring it up in the "next-gen" api discussion?
> 
> One reason is to try to shape future changes to the current ABI in the same 
> direction.  Another is that maybe someone will convince us that it is needed.
> 
>> >
>> >  There's no way a patch with 'VGA' in it would be accepted.
>> 
>> Why not? I think the natural step forward is hybrid acceleration. Take a 
>> minimal subset of device emulation into kernel land, keep the rest in user 
>> space.
> 
> 
> When a device is fully in the kernel, we have a good specification of the 
> ABI: it just implements the spec, and the ABI provides the interface from the 
> device to the rest of the world.  Partially accelerated devices means a much 
> greater effort in specifying exactly what it does.  It's also vulnerable to 
> changes in how the guest uses the device.

Why? For the HPET timer register for example, we could have a simple MMIO hook 
that says

  on_read:
return read_current_time() - shared_page.offset;
  on_write:
handle_in_user_space();

For IDE, it would be as simple as

  register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE, &s->cmd[0]);
  for (i = 1; i < 7; i++) {
register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE, &s->cmd[i]);
register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE, &s->cmd[i]);
  }

and we should have reduced overhead of IDE by quite a bit already. All the 
other 2k LOC in hw/ide/core.c don't matter for us really.

> 
>> Similar to how vhost works, where we keep device enumeration and 
>> configuration in user space, but ring processing in kernel space.
> 
> vhost-net was a massive effort, I hope we don

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 02:51 PM, Alexander Graf wrote:

On 07.02.2012, at 13:24, Avi Kivity wrote:

>  On 02/07/2012 03:08 AM, Alexander Graf wrote:
>>  I don't like the idea too much. On s390 and ppc we can set other vcpu's 
interrupt status. How would that work in this model?
>
>  It would be a "vm-wide syscall".  You can also do that on x86 (through 
KVM_IRQ_LINE).
>
>>
>>  I really do like the ioctl model btw. It's easily extensible and easy to 
understand.
>>
>>  I can also promise you that I have no idea what other extensions we will 
need in the next few years. The non-x86 targets are just really very moving. So 
having an interface that allows for easy extension is a must-have.
>
>  Good point.  If we ever go through with it, it will only be after we see the 
interface has stabilized.

Not sure we'll ever get there. For PPC, it will probably take another 1-2 years 
until we get the 32-bit targets stabilized. By then we will have new 64-bit 
support though. And then the next gen will come out giving us even more new 
constraints.


I would expect that newer archs have less constraints, not more.


The same goes for ARM, where we will get v7 support for now, but very soon we 
will also want to get v8. Stabilizing a target so far takes ~1-2 years from 
what I've seen. And that stabilizing to a point where we don't find major ABI 
issues anymore.


The trick is to get the ABI to be flexible, like a generalized ABI for 
state.  But it's true that it's really hard to nail it down.




>>
>>  The framework is in KVM today. It's called ONE_REG. So far only PPC 
implements a few registers. If you like it, just throw all the x86 ones in there and 
you have everything you need.
>
>  This is more like MANY_REG, where you scatter/gather a list of registers in 
userspace to the kernel or vice versa.

Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to 
give every register a unique identifier that can be used to access it. Taking 
that logic to an array is trivial.


Definitely easy to extend.



>
>>
>>  >>   The communications between the local APIC and the IOAPIC/PIC will be
>>  >>   done over a socketpair, emulating the APIC bus protocol.
>>
>>  What is keeping us from moving there today?
>
>  The biggest problem with this proposal is that what we have today works 
reasonably well.  Nothing is keeping us from moving there, except the fear of 
performance regressions and lack of strong motivation.

So why bring it up in the "next-gen" api discussion?


One reason is to try to shape future changes to the current ABI in the 
same direction.  Another is that maybe someone will convince us that it 
is needed.



>
>  There's no way a patch with 'VGA' in it would be accepted.

Why not? I think the natural step forward is hybrid acceleration. Take a 
minimal subset of device emulation into kernel land, keep the rest in user 
space.



When a device is fully in the kernel, we have a good specification of 
the ABI: it just implements the spec, and the ABI provides the interface 
from the device to the rest of the world.  Partially accelerated devices 
means a much greater effort in specifying exactly what it does.  It's 
also vulnerable to changes in how the guest uses the device.



Similar to how vhost works, where we keep device enumeration and configuration 
in user space, but ring processing in kernel space.


vhost-net was a massive effort, I hope we don't have to replicate it.



Good candidates for in-kernel acceleration are:

   - HPET


Yes


   - VGA
   - IDE


Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
virtio-scsi).



I'm not sure how easy it would be to only partially accelerate the hot paths of 
the IO-APIC. I'm not too familiar with its details.


Pretty hard.



We will run into the same thing with the MPIC though. On e500v2, IPIs are done 
through the MPIC. So if we want any SMP performance on those, we need to shove 
that part into the kernel. I don't really want to have all of the MPIC code in 
there however. So a hybrid approach sounds like a great fit.


Pointer to the qemu code?


The problem with in-kernel device emulation the way we have it today is that 
it's an all-or-nothing choice. Either we push the device into kernel space or 
we keep it in user space. That adds a lot of code in kernel land where it 
doesn't belong.


Like I mentioned, I see that as a good thing.


>
>  No, slots still exist.  Only the API is "replace slot list" instead of "add slot" and 
"remove slot".

Why?


Physical memory is discontiguous, and includes aliases (two gpas 
referencing the same backing page).  How else would you describe it.



On PPC we walk the slots on every fault (incl. mmio), so fast lookup times 
there would be great. I was thinking of something page table like here.


We can certainly convert the slots to a tree internally.  I'm doing the 
same thing for qemu now, maybe we can do it for kvm too.  No need to 
involve the ABI at all.


Slot searching

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Alexander Graf

On 07.02.2012, at 13:24, Avi Kivity wrote:

> On 02/07/2012 03:08 AM, Alexander Graf wrote:
>> I don't like the idea too much. On s390 and ppc we can set other vcpu's 
>> interrupt status. How would that work in this model?
> 
> It would be a "vm-wide syscall".  You can also do that on x86 (through 
> KVM_IRQ_LINE).
> 
>> 
>> I really do like the ioctl model btw. It's easily extensible and easy to 
>> understand.
>> 
>> I can also promise you that I have no idea what other extensions we will 
>> need in the next few years. The non-x86 targets are just really very moving. 
>> So having an interface that allows for easy extension is a must-have.
> 
> Good point.  If we ever go through with it, it will only be after we see the 
> interface has stabilized.

Not sure we'll ever get there. For PPC, it will probably take another 1-2 years 
until we get the 32-bit targets stabilized. By then we will have new 64-bit 
support though. And then the next gen will come out giving us even more new 
constraints.

The same goes for ARM, where we will get v7 support for now, but very soon we 
will also want to get v8. Stabilizing a target so far takes ~1-2 years from 
what I've seen. And that stabilizing to a point where we don't find major ABI 
issues anymore.

> 
>> 
>> >
>> >>  State accessors
>> >>  ---
>> >>  Currently vcpu state is read and written by a bunch of ioctls that
>> >>  access register sets that were added (or discovered) along the years.
>> >>  Some state is stored in the vcpu mmap area.  These will be replaced by a
>> >>  pair of syscalls that read or write the entire state, or a subset of the
>> >>  state, in a tag/value format.  A register will be described by a tuple:
>> >>
>> >>set: the register set to which it belongs; either a real set (GPR,
>> >>  x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>> >>  eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>> >>number: register number within a set
>> >>size: for self-description, and to allow expanding registers like
>> >>  SSE->AVX or eax->rax
>> >>attributes: read-write, read-only, read-only for guest but read-write
>> >>  for host
>> >>value
>> >
>> >  I do like the idea a lot of being able to read one register at a time as 
>> > often times that's all you need.
>> 
>> The framework is in KVM today. It's called ONE_REG. So far only PPC 
>> implements a few registers. If you like it, just throw all the x86 ones in 
>> there and you have everything you need.
> 
> This is more like MANY_REG, where you scatter/gather a list of registers in 
> userspace to the kernel or vice versa.

Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to 
give every register a unique identifier that can be used to access it. Taking 
that logic to an array is trivial.

> 
>> 
>> >>  The communications between the local APIC and the IOAPIC/PIC will be
>> >>  done over a socketpair, emulating the APIC bus protocol.
>> 
>> What is keeping us from moving there today?
> 
> The biggest problem with this proposal is that what we have today works 
> reasonably well.  Nothing is keeping us from moving there, except the fear of 
> performance regressions and lack of strong motivation.

So why bring it up in the "next-gen" api discussion?

> 
>> 
>> >>
>> >>  Ioeventfd/irqfd
>> >>  ---
>> >>  As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> >>  retained, and perhaps supplemented with a way to assign an mmio region
>> >>  to a socketpair carrying transactions.  This allows a device model to be
>> >>  implemented out-of-process.  The socketpair can also be used to
>> >>  implement a replacement for coalesced mmio, by not waiting for responses
>> >>  on write transactions when enabled.  Synchronization of coalesced mmio
>> >>  will be implemented in the kernel, not userspace as now: when a
>> >>  non-coalesced mmio is needed, the kernel will first flush the coalesced
>> >>  mmio queue(s).
>> 
>> I would vote for completely deprecating coalesced MMIO. It is a generic 
>> framework that nobody except for VGA really needs.
> 
> It's actually used by e1000 too, don't remember what the performance benefits 
> are.  Of course, few people use e1000.

And for e1000 it's only used for nvram which actually could benefit from a more 
clever "this is backed by ram" logic. Coalesced mmio is not a great fit here.

> 
>> Better make something that accelerates read and write paths thanks to more 
>> specific knowledge of the interface.
>> 
>> One thing I'm thinking of here is IDE. There's no need to PIO callback into 
>> user space for all the status ports. We only really care about a callback on 
>> write to 7 (cmd). All the others are basically registers that the kernel 
>> could just read and write from shared memory.
>> 
>> I'm sure the VGA text stuff could use similar acceleration with well-known 
>> interfaces.
> 
> This goes back to the discussion about a kernel bytecode vm for accelerating 
>

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-07 Thread Avi Kivity

On 02/07/2012 03:08 AM, Alexander Graf wrote:

I don't like the idea too much. On s390 and ppc we can set other vcpu's 
interrupt status. How would that work in this model?


It would be a "vm-wide syscall".  You can also do that on x86 (through 
KVM_IRQ_LINE).




I really do like the ioctl model btw. It's easily extensible and easy to 
understand.

I can also promise you that I have no idea what other extensions we will need 
in the next few years. The non-x86 targets are just really very moving. So 
having an interface that allows for easy extension is a must-have.


Good point.  If we ever go through with it, it will only be after we see 
the interface has stabilized.




>
>>  State accessors
>>  ---
>>  Currently vcpu state is read and written by a bunch of ioctls that
>>  access register sets that were added (or discovered) along the years.
>>  Some state is stored in the vcpu mmap area.  These will be replaced by a
>>  pair of syscalls that read or write the entire state, or a subset of the
>>  state, in a tag/value format.  A register will be described by a tuple:
>>
>>set: the register set to which it belongs; either a real set (GPR,
>>  x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>>  eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>>number: register number within a set
>>size: for self-description, and to allow expanding registers like
>>  SSE->AVX or eax->rax
>>attributes: read-write, read-only, read-only for guest but read-write
>>  for host
>>value
>
>  I do like the idea a lot of being able to read one register at a time as 
often times that's all you need.

The framework is in KVM today. It's called ONE_REG. So far only PPC implements 
a few registers. If you like it, just throw all the x86 ones in there and you 
have everything you need.


This is more like MANY_REG, where you scatter/gather a list of registers 
in userspace to the kernel or vice versa.




>>  The communications between the local APIC and the IOAPIC/PIC will be
>>  done over a socketpair, emulating the APIC bus protocol.

What is keeping us from moving there today?


The biggest problem with this proposal is that what we have today works 
reasonably well.  Nothing is keeping us from moving there, except the 
fear of performance regressions and lack of strong motivation.




>>
>>  Ioeventfd/irqfd
>>  ---
>>  As the ioeventfd/irqfd mechanism has been quite successful, it will be
>>  retained, and perhaps supplemented with a way to assign an mmio region
>>  to a socketpair carrying transactions.  This allows a device model to be
>>  implemented out-of-process.  The socketpair can also be used to
>>  implement a replacement for coalesced mmio, by not waiting for responses
>>  on write transactions when enabled.  Synchronization of coalesced mmio
>>  will be implemented in the kernel, not userspace as now: when a
>>  non-coalesced mmio is needed, the kernel will first flush the coalesced
>>  mmio queue(s).

I would vote for completely deprecating coalesced MMIO. It is a generic 
framework that nobody except for VGA really needs.


It's actually used by e1000 too, don't remember what the performance 
benefits are.  Of course, few people use e1000.



Better make something that accelerates read and write paths thanks to more 
specific knowledge of the interface.

One thing I'm thinking of here is IDE. There's no need to PIO callback into 
user space for all the status ports. We only really care about a callback on 
write to 7 (cmd). All the others are basically registers that the kernel could 
just read and write from shared memory.

I'm sure the VGA text stuff could use similar acceleration with well-known 
interfaces.


This goes back to the discussion about a kernel bytecode vm for 
accelerating mmio.  The problem is that we need something really general.



To me, coalesced mmio has proven that's it's generalization where it doesn't 
belong.


But you want to generalize it even more?

There's no way a patch with 'VGA' in it would be accepted.



>>
>>  Guest memory management
>>  ---
>>  Instead of managing each memory slot individually, a single API will be
>>  provided that replaces the entire guest physical memory map atomically.
>>  This matches the implementation (using RCU) and plugs holes in the
>>  current API, where you lose the dirty log in the window between the last
>>  call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>>  that removes the slot.

So we render the actual slot logic invisible? That's a very good idea.


No, slots still exist.  Only the API is "replace slot list" instead of 
"add slot" and "remove slot".




>>
>>  Slot-based dirty logging will be replaced by range-based and work-based
>>  dirty logging; that is "what pages are dirty in this range, which may be
>>  smaller than a slot" and "don't return more than N pages".
>>
>>  We may want to place the log in user memory instead of kernel memory

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-06 Thread Alexander Graf

On 03.02.2012, at 03:09, Anthony Liguori wrote:

> On 02/02/2012 10:09 AM, Avi Kivity wrote:
>> The kvm api has been accumulating cruft for several years now.  This is
>> due to feature creep, fixing mistakes, experience gained by the
>> maintainers and developers on how to do things, ports to new
>> architectures, and simply as a side effect of a code base that is
>> developed slowly and incrementally.
>> 
>> While I don't think we can justify a complete revamp of the API now, I'm
>> writing this as a thought experiment to see where a from-scratch API can
>> take us.  Of course, if we do implement this, the new and old APIs will
>> have to be supported side by side for several years.
>> 
>> Syscalls
>> 
>> kvm currently uses the much-loved ioctl() system call as its entry
>> point.  While this made it easy to add kvm to the kernel unintrusively,
>> it does have downsides:
>> 
>> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
>> (low but measurable)
>> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
>> a vm to be tied to an mm_struct, but the current API ties them to file
>> descriptors, which can move between threads and processes.  We check
>> that they don't, but we don't want to.
>> 
>> Moving to syscalls avoids these problems, but introduces new ones:
>> 
>> - adding new syscalls is generally frowned upon, and kvm will need several
>> - syscalls into modules are harder and rarer than into core kernel code
>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>> mm_struct
>> 
>> Syscalls that operate on the entire guest will pick it up implicitly
>> from the mm_struct, and syscalls that operate on a vcpu will pick it up
>> from current.
> 
> This seems like the natural progression.

I don't like the idea too much. On s390 and ppc we can set other vcpu's 
interrupt status. How would that work in this model?

I really do like the ioctl model btw. It's easily extensible and easy to 
understand.

I can also promise you that I have no idea what other extensions we will need 
in the next few years. The non-x86 targets are just really very moving. So 
having an interface that allows for easy extension is a must-have.

> 
>> State accessors
>> ---
>> Currently vcpu state is read and written by a bunch of ioctls that
>> access register sets that were added (or discovered) along the years.
>> Some state is stored in the vcpu mmap area.  These will be replaced by a
>> pair of syscalls that read or write the entire state, or a subset of the
>> state, in a tag/value format.  A register will be described by a tuple:
>> 
>>   set: the register set to which it belongs; either a real set (GPR,
>> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>>   number: register number within a set
>>   size: for self-description, and to allow expanding registers like
>> SSE->AVX or eax->rax
>>   attributes: read-write, read-only, read-only for guest but read-write
>> for host
>>   value
> 
> I do like the idea a lot of being able to read one register at a time as 
> often times that's all you need.

The framework is in KVM today. It's called ONE_REG. So far only PPC implements 
a few registers. If you like it, just throw all the x86 ones in there and you 
have everything you need.

> 
>> 
>> Device model
>> 
>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>> PCI devices assigned from the host.  The API allows emulating the local
>> APICs in userspace.
>> 
>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>> them to userspace.
> 
> I'm a big fan of this.
> 
>> Note: this may cause a regression for older guests
>> that don't support MSI or kvmclock.  Device assignment will be done
>> using VFIO, that is, without direct kvm involvement.
>> 
>> Local APICs will be mandatory, but it will be possible to hide them from
>> the guest.  This means that it will no longer be possible to emulate an
>> APIC in userspace, but it will be possible to virtualize an APIC-less
>> core - userspace will play with the LINT0/LINT1 inputs (configured as
>> EXITINT and NMI) to queue interrupts and NMIs.
> 
> I think this makes sense.  An interesting consequence of this is that it's no 
> longer necessary to associate the VCPU context with an MMIO/PIO operation.  
> I'm not sure if there's an obvious benefit to that but it's interesting 
> nonetheless.
> 
>> The communications between the local APIC and the IOAPIC/PIC will be
>> done over a socketpair, emulating the APIC bus protocol.

What is keeping us from moving there today?

>> 
>> Ioeventfd/irqfd
>> ---
>> As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> retained, and perhaps supplemented with a way to assign an mmio region
>> to a socketpair carrying transactions.  This allows a devic