Re: [RFC PATCH 0/3] generic hypercall support

2009-05-13 Thread Gregory Haskins
Anthony Liguori wrote:
> Gregory Haskins wrote:
>> I specifically generalized my statement above because #1 I assume
>> everyone here is smart enough to convert that nice round unit into the
>> relevant figure.  And #2, there are multiple potential latency sources
>> at play which we need to factor in when looking at the big picture.  For
>> instance, the difference between PF exit, and an IO exit (2.58us on x86,
>> to be precise).  Or whether you need to take a heavy-weight exit.  Or a
>> context switch to qemu, the the kernel, back to qemu, and back to the
>> vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models
>> IO. I know you wish that this whole discussion would just go away,
>> but these
>> little "300ns here, 1600ns there" really add up in aggregate despite
>> your dismissive attitude towards them.  And it doesn't take much to
>> affect the results in a measurable way.  As stated, each 1us costs
>> ~4%. My motivation is to reduce as many of these sources as possible.
>>
>> So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
>> improvement.  So what?  Its still an improvement.  If that improvement
>> were for free, would you object?  And we all know that this change isn't
>> "free" because we have to change some code (+128/-0, to be exact).  But
>> what is it specifically you are objecting to in the first place?  Adding
>> hypercall support as an pv_ops primitive isn't exactly hard or complex,
>> or even very much code.
>>   
>
> Where does 25us come from?  The number you post below are 33us and
> 66us.  This is part of what's frustrating me in this thread.  Things
> are way too theoretical.  Saying that "if packet latency was 25us,
> then it would be a 1.4% improvement" is close to misleading.
[ answered in the last reply ]

> The numbers you've posted are also measuring on-box speeds.  What
> really matters are off-box latencies and that's just going to exaggerate.

I'm not 100% clear on what you mean with on-box vs off-box.  These
figures were gathered between two real machines connected via 10GE
cross-over cable.  The 5.8Gb/s and 33us (25us) values were gathered
sending real data between these hosts.  This sounds "off-box" to me, but
I am not sure I truly understand your assertion.

>
>
> IIUC, if you switched vbus to using PIO today, you would go from 66us
> to to 65.65, which you'd round to 66us for on-box latencies.  Even if
> you didn't round, it's a 0.5% improvement in latency.

I think part of what you are missing is that in order to create vbus, I
needed to _create_ an in-kernel hook from scratch since there were no
existing methods. Since I measured HC to be superior in performance (if
by only a little), I wasn't going to chose the slower way if there
wasn't a reason, and at the time I didn't see one.  Now after community
review, perhaps we do have a reason, but that is the point of the review
process.  So now we can push something like iofd as a PIO hook instead. 
But either way, something needed to be created.

>
>
> Adding hypercall support as a pv_ops primitive is adding a fair bit of
> complexity.  You need a hypercall fd mechanism to plumb this down to
> userspace otherwise, you can't support migration from in-kernel
> backend to non in-kernel backend.

I respectfully disagree.   This is orthogonal to the simple issue of the
IO type for the exit.  Where you *do* have a point is that the bigger
benefit comes from in-kernel termination (like the iofd stuff I posted
yesterday).  However, in-kernel termination is not strictly necessary to
exploit some reduction in overhead in the IO latency.  In either case we
know we can shave off about 2.56us from an MMIO.

Since I formally measured MMIO rtt to userspace yesterday, we now know
that we can do qemu-mmio in about 110k IOPS, 9.09us rtt.  Switching to
pv_io_ops->mmio() alone would be a boost to approximately 153k IOPS,
6.53us rtt.  This would have a tangible benefit to all models without
any hypercall plumbing screwing up migration.  Therefore I still stand
by the assertion that the hypercall discussion alone doesn't add very
much complexity.

>   You need some way to allocate hypercalls to particular devices which
> so far, has been completely ignored.

I'm sorry, but thats not true.  Vbus already handles this mapping.


>   I've already mentioned why hypercalls are also unfortunate from a
> guest perspective.  They require kernel patching and this is almost
> certainly going to break at least Vista as a guest.  Certainly Windows 7.

Yes, you have a point here.

>
> So it's not at all fair to trivialize the complexity introduce here. 
> I'm simply asking for justification to introduce this complexity.  I
> don't see why this is unfair for me to ask.

In summary, I don't think there is really much complexity being added
because this stuff really doesn't depend on the hypercallfd (iofd)
interface in order to have some benefit, as you assert above.  The
hypercall page is a good point for attestation, but that is

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-13 Thread Gregory Haskins
Anthony Liguori wrote:
> Gregory Haskins wrote:
>>
>> So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
>> improvement.  So what?  Its still an improvement.  If that improvement
>> were for free, would you object?  And we all know that this change isn't
>> "free" because we have to change some code (+128/-0, to be exact).  But
>> what is it specifically you are objecting to in the first place?  Adding
>> hypercall support as an pv_ops primitive isn't exactly hard or complex,
>> or even very much code.
>>   
>
> Where does 25us come from?  The number you post below are 33us and 66us.



The 25us is approximately the max from an in-kernel harness strapped
directly to the driver gathered informally during testing.  The 33us is
from formally averaging multiple runs of a userspace socket app in
preparation for publishing.  I consider the 25us the "target goal" since
there is obviously overhead that a socket application deals with that
theoretically a guest bypasses with the tap-device.  Note that the
socket application itself often sees < 30us itself...this was just a
particularly "slow" set of runs that day.

Note that this is why I express the impact as "approximately" (e.g.
"~4%").  Sorry for the confusion.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Anthony Liguori

Avi Kivity wrote:

Anthony Liguori wrote:


It's a question of cost vs. benefit.  It's clear the benefit is low 
(but that doesn't mean it's not worth having).  The cost initially 
appeared to be very low, until the nested virtualization wrench was 
thrown into the works.  Not that nested virtualization is a reality 
-- even on svm where it is implemented it is not yet production 
quality and is disabled by default.


Now nested virtualization is beginning to look interesting, with 
Windows 7's XP mode requiring virtualization extensions.  Desktop 
virtualization is also something likely to use device assignment 
(though you probably won't assign a virtio device to the XP instance 
inside Windows 7).


Maybe we should revisit the mmio hypercall idea again, it might be 
workable if we find a way to let the guest know if it should use the 
hypercall or not for a given memory range.


mmio hypercall is nice because
- it falls back nicely to pure mmio
- it optimizes an existing slow path, not just new device models
- it has preexisting semantics, so we have less ABI to screw up
- for nested virtualization + device assignment, we can drop it and 
get a nice speed win (or rather, less speed loss)


If it's a PCI device, then we can also have an interrupt which we 
currently lack with vmcall-based hypercalls.  This would give us 
guestcalls, upcalls, or whatever we've previously decided to call them.


Sorry, I totally failed to understand this.  Please explain.


I totally missed what you meant by MMIO hypercall.

In what cases do you think MMIO hypercall would result in a net benefit?

I think the difference in MMIO vs hcall will be overshadowed by the 
heavy weight transition to userspace.  The only thing I can think of 
where it may matter is for in-kernel devices like the APIC but that's a 
totally different path in Linux.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Avi Kivity
Anthony Liguori wrote: 


It's a question of cost vs. benefit.  It's clear the benefit is low 
(but that doesn't mean it's not worth having).  The cost initially 
appeared to be very low, until the nested virtualization wrench was 
thrown into the works.  Not that nested virtualization is a reality 
-- even on svm where it is implemented it is not yet production 
quality and is disabled by default.


Now nested virtualization is beginning to look interesting, with 
Windows 7's XP mode requiring virtualization extensions.  Desktop 
virtualization is also something likely to use device assignment 
(though you probably won't assign a virtio device to the XP instance 
inside Windows 7).


Maybe we should revisit the mmio hypercall idea again, it might be 
workable if we find a way to let the guest know if it should use the 
hypercall or not for a given memory range.


mmio hypercall is nice because
- it falls back nicely to pure mmio
- it optimizes an existing slow path, not just new device models
- it has preexisting semantics, so we have less ABI to screw up
- for nested virtualization + device assignment, we can drop it and 
get a nice speed win (or rather, less speed loss)


If it's a PCI device, then we can also have an interrupt which we 
currently lack with vmcall-based hypercalls.  This would give us 
guestcalls, upcalls, or whatever we've previously decided to call them.


Sorry, I totally failed to understand this.  Please explain.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Avi Kivity

Gregory Haskins wrote:

Avi Kivity wrote:
  

Hollis Blanchard wrote:


I haven't been following this conversation at all. With that in mind...

AFAICS, a hypercall is clearly the higher-performing option, since you
don't need the additional memory load (which could even cause a page
fault in some circumstances) and instruction decode. That said, I'm
willing to agree that this overhead is probably negligible compared to
the IOp itself... Ahmdal's Law again.
  
  

It's a question of cost vs. benefit.  It's clear the benefit is low
(but that doesn't mean it's not worth having).  The cost initially
appeared to be very low, until the nested virtualization wrench was
thrown into the works.  Not that nested virtualization is a reality --
even on svm where it is implemented it is not yet production quality
and is disabled by default.

Now nested virtualization is beginning to look interesting, with
Windows 7's XP mode requiring virtualization extensions.  Desktop
virtualization is also something likely to use device assignment
(though you probably won't assign a virtio device to the XP instance
inside Windows 7).

Maybe we should revisit the mmio hypercall idea again, it might be
workable if we find a way to let the guest know if it should use the
hypercall or not for a given memory range.

mmio hypercall is nice because
- it falls back nicely to pure mmio
- it optimizes an existing slow path, not just new device models
- it has preexisting semantics, so we have less ABI to screw up
- for nested virtualization + device assignment, we can drop it and
get a nice speed win (or rather, less speed loss)



Yeah, I agree with all this.  I am still wrestling with how to deal with
the device-assignment problem w.r.t. shunting io requests into a
hypercall vs letting them PF.  Are you saying we could simply ignore
this case by disabling "MMIOoHC" when assignment is enabled?  That would
certainly make the problem much easier to solve.
  


No, we need to deal with hotplug.  Something like IO_COND that Chris 
mentioned, but how to avoid turning this into a doctoral thesis.


(On the other hand, device assignment requires the iommu, and I think 
you have to specify that up front?)


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Anthony Liguori

Hollis Blanchard wrote:

On Sun, 2009-05-10 at 13:38 -0500, Anthony Liguori wrote:
  

Gregory Haskins wrote:


Can you back up your claim that PPC has no difference in performance
with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
like instructions, but clearly there are ways to cause a trap, so
presumably we can measure the difference between a PF exit and something
more explicit).
  
First, the PPC that KVM supports performs very poorly relatively 
speaking because it receives no hardware assistance  this is not the 
right place to focus wrt optimizations.


And because there's no hardware assistance, there simply isn't a 
hypercall instruction.  Are PFs the fastest type of exits?  Probably not 
but I honestly have no idea.  I'm sure Hollis does though.



Memory load from the guest context (for instruction decoding) is a
*very* poorly performing path on most PowerPC, even considering server
PowerPC with hardware virtualization support. No, I don't have any data
for you, but switching the hardware MMU contexts requires some
heavyweight synchronization instructions.
  


For current ppcemb, you would have to do a memory load no matter what, 
right?  I guess you could have a dedicated interrupt vector or something...


For future ppcemb's, do you know if there is an equivalent of a PF exit 
type?  Does the hardware squirrel away the faulting address somewhere 
and set PC to the start of the instruction?  If so, no guest memory load 
should be required.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Anthony Liguori

Avi Kivity wrote:

Hollis Blanchard wrote:

I haven't been following this conversation at all. With that in mind...

AFAICS, a hypercall is clearly the higher-performing option, since you
don't need the additional memory load (which could even cause a page
fault in some circumstances) and instruction decode. That said, I'm
willing to agree that this overhead is probably negligible compared to
the IOp itself... Ahmdal's Law again.
  


It's a question of cost vs. benefit.  It's clear the benefit is low 
(but that doesn't mean it's not worth having).  The cost initially 
appeared to be very low, until the nested virtualization wrench was 
thrown into the works.  Not that nested virtualization is a reality -- 
even on svm where it is implemented it is not yet production quality 
and is disabled by default.


Now nested virtualization is beginning to look interesting, with 
Windows 7's XP mode requiring virtualization extensions.  Desktop 
virtualization is also something likely to use device assignment 
(though you probably won't assign a virtio device to the XP instance 
inside Windows 7).


Maybe we should revisit the mmio hypercall idea again, it might be 
workable if we find a way to let the guest know if it should use the 
hypercall or not for a given memory range.


mmio hypercall is nice because
- it falls back nicely to pure mmio
- it optimizes an existing slow path, not just new device models
- it has preexisting semantics, so we have less ABI to screw up
- for nested virtualization + device assignment, we can drop it and 
get a nice speed win (or rather, less speed loss)


If it's a PCI device, then we can also have an interrupt which we 
currently lack with vmcall-based hypercalls.  This would give us 
guestcalls, upcalls, or whatever we've previously decided to call them.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Anthony Liguori

Gregory Haskins wrote:

I specifically generalized my statement above because #1 I assume
everyone here is smart enough to convert that nice round unit into the
relevant figure.  And #2, there are multiple potential latency sources
at play which we need to factor in when looking at the big picture.  For
instance, the difference between PF exit, and an IO exit (2.58us on x86,
to be precise).  Or whether you need to take a heavy-weight exit.  Or a
context switch to qemu, the the kernel, back to qemu, and back to the
vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models IO. 
I know you wish that this whole discussion would just go away, but these

little "300ns here, 1600ns there" really add up in aggregate despite
your dismissive attitude towards them.  And it doesn't take much to
affect the results in a measurable way.  As stated, each 1us costs ~4%. 
My motivation is to reduce as many of these sources as possible.


So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
improvement.  So what?  Its still an improvement.  If that improvement
were for free, would you object?  And we all know that this change isn't
"free" because we have to change some code (+128/-0, to be exact).  But
what is it specifically you are objecting to in the first place?  Adding
hypercall support as an pv_ops primitive isn't exactly hard or complex,
or even very much code.
  


Where does 25us come from?  The number you post below are 33us and 
66us.  This is part of what's frustrating me in this thread.  Things are 
way too theoretical.  Saying that "if packet latency was 25us, then it 
would be a 1.4% improvement" is close to misleading.  The numbers you've 
posted are also measuring on-box speeds.  What really matters are 
off-box latencies and that's just going to exaggerate.


IIUC, if you switched vbus to using PIO today, you would go from 66us to 
to 65.65, which you'd round to 66us for on-box latencies.  Even if you 
didn't round, it's a 0.5% improvement in latency.


Adding hypercall support as a pv_ops primitive is adding a fair bit of 
complexity.  You need a hypercall fd mechanism to plumb this down to 
userspace otherwise, you can't support migration from in-kernel backend 
to non in-kernel backend.  You need some way to allocate hypercalls to 
particular devices which so far, has been completely ignored.  I've 
already mentioned why hypercalls are also unfortunate from a guest 
perspective.  They require kernel patching and this is almost certainly 
going to break at least Vista as a guest.  Certainly Windows 7.


So it's not at all fair to trivialize the complexity introduce here.  
I'm simply asking for justification to introduce this complexity.  I 
don't see why this is unfair for me to ask.



As a more general observation, we need numbers to justify an
optimization, not to justify not including an optimization.

In other words, the burden is on you to present a scenario where this
optimization would result in a measurable improvement in a real world
work load.



I have already done this.  You seem to have chosen to ignore my
statements and results, but if you insist on rehashing:

I started this project by analyzing system traces and finding some of
the various bottlenecks in comparison to a native host.  Throughput was
already pretty decent, but latency was pretty bad (and recently got
*really* bad, but I know you already have a handle on whats causing
that).  I digress...one of the conclusions of the research was that  I
wanted to focus on building an IO subsystem designed to minimize the
quantity of exits, minimize the cost of each exit, and shorten the
end-to-end signaling path to achieve optimal performance.  I also wanted
to build a system that was extensible enough to work with a variety of
client types, on a variety of architectures, etc, so we would only need
to solve these problems "once".  The end result was vbus, and the first
working example was venet.  The measured performance data of this work
was as follows:

802.x network, 9000 byte MTU,  2 8-core x86_64s connected back to back
with Chelsio T3 10GE via crossover.

Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
Virtio-net (PCI): tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
Venet  (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt)

For more details:  http://lkml.org/lkml/2009/4/21/408
  


Sending out a massive infrastructure change that does things wildly 
differently from how they're done today without any indication of why 
those changes were necessary is disruptive.


If you could characterize all of the changes that vbus makes that are 
different from virtio, demonstrating at each stage why the change 
mattered and what benefit it brought, then we'd be having a completely 
different discussion.  I have no problem throwing away virtio today if 
there's something else better.


That's not what you've done though.  You wrote a bunch of code without 
understanding why virt

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Gregory Haskins
Avi Kivity wrote:
> Hollis Blanchard wrote:
>> I haven't been following this conversation at all. With that in mind...
>>
>> AFAICS, a hypercall is clearly the higher-performing option, since you
>> don't need the additional memory load (which could even cause a page
>> fault in some circumstances) and instruction decode. That said, I'm
>> willing to agree that this overhead is probably negligible compared to
>> the IOp itself... Ahmdal's Law again.
>>   
>
> It's a question of cost vs. benefit.  It's clear the benefit is low
> (but that doesn't mean it's not worth having).  The cost initially
> appeared to be very low, until the nested virtualization wrench was
> thrown into the works.  Not that nested virtualization is a reality --
> even on svm where it is implemented it is not yet production quality
> and is disabled by default.
>
> Now nested virtualization is beginning to look interesting, with
> Windows 7's XP mode requiring virtualization extensions.  Desktop
> virtualization is also something likely to use device assignment
> (though you probably won't assign a virtio device to the XP instance
> inside Windows 7).
>
> Maybe we should revisit the mmio hypercall idea again, it might be
> workable if we find a way to let the guest know if it should use the
> hypercall or not for a given memory range.
>
> mmio hypercall is nice because
> - it falls back nicely to pure mmio
> - it optimizes an existing slow path, not just new device models
> - it has preexisting semantics, so we have less ABI to screw up
> - for nested virtualization + device assignment, we can drop it and
> get a nice speed win (or rather, less speed loss)
>
Yeah, I agree with all this.  I am still wrestling with how to deal with
the device-assignment problem w.r.t. shunting io requests into a
hypercall vs letting them PF.  Are you saying we could simply ignore
this case by disabling "MMIOoHC" when assignment is enabled?  That would
certainly make the problem much easier to solve.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Avi Kivity

Hollis Blanchard wrote:

I haven't been following this conversation at all. With that in mind...

AFAICS, a hypercall is clearly the higher-performing option, since you
don't need the additional memory load (which could even cause a page
fault in some circumstances) and instruction decode. That said, I'm
willing to agree that this overhead is probably negligible compared to
the IOp itself... Ahmdal's Law again.
  


It's a question of cost vs. benefit.  It's clear the benefit is low (but 
that doesn't mean it's not worth having).  The cost initially appeared 
to be very low, until the nested virtualization wrench was thrown into 
the works.  Not that nested virtualization is a reality -- even on svm 
where it is implemented it is not yet production quality and is disabled 
by default.


Now nested virtualization is beginning to look interesting, with Windows 
7's XP mode requiring virtualization extensions.  Desktop virtualization 
is also something likely to use device assignment (though you probably 
won't assign a virtio device to the XP instance inside Windows 7).


Maybe we should revisit the mmio hypercall idea again, it might be 
workable if we find a way to let the guest know if it should use the 
hypercall or not for a given memory range.


mmio hypercall is nice because
- it falls back nicely to pure mmio
- it optimizes an existing slow path, not just new device models
- it has preexisting semantics, so we have less ABI to screw up
- for nested virtualization + device assignment, we can drop it and get 
a nice speed win (or rather, less speed loss)


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Hollis Blanchard
On Sun, 2009-05-10 at 13:38 -0500, Anthony Liguori wrote:
> Gregory Haskins wrote:
> >
> > Can you back up your claim that PPC has no difference in performance
> > with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
> > like instructions, but clearly there are ways to cause a trap, so
> > presumably we can measure the difference between a PF exit and something
> > more explicit).
> 
> First, the PPC that KVM supports performs very poorly relatively 
> speaking because it receives no hardware assistance  this is not the 
> right place to focus wrt optimizations.
> 
> And because there's no hardware assistance, there simply isn't a 
> hypercall instruction.  Are PFs the fastest type of exits?  Probably not 
> but I honestly have no idea.  I'm sure Hollis does though.

Memory load from the guest context (for instruction decoding) is a
*very* poorly performing path on most PowerPC, even considering server
PowerPC with hardware virtualization support. No, I don't have any data
for you, but switching the hardware MMU contexts requires some
heavyweight synchronization instructions.

> Page faults are going to have tremendously different performance 
> characteristics on PPC too because it's a software managed TLB. There's 
> no page table lookup like there is on x86.

To clarify, software-managed TLBs are only found in embedded PowerPC.
Server and classic PowerPC use hash tables, which are a third MMU type.

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Jeremy Fitzhardinge

Anthony Liguori wrote:
Yes, I misunderstood that they actually emulated it like that.  
However, ia64 has no paravirtualization support today so surely, we 
aren't going to be justifying this via ia64, right?




Someone is actively putting a pvops infrastructure into the ia64 port, 
along with a Xen port.  I think pieces of it got merged this last window.


   J
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Hollis Blanchard
On Mon, 2009-05-11 at 09:14 -0400, Gregory Haskins wrote:
> 
> >> for request-response, this is generally for *every* packet since
> you
> >> cannot exploit buffering/deferring.
> >>
> >> Can you back up your claim that PPC has no difference in
> performance
> >> with an MMIO exit and a "hypercall" (yes, I understand PPC has no
> "VT"
> >> like instructions, but clearly there are ways to cause a trap, so
> >> presumably we can measure the difference between a PF exit and
> something
> >> more explicit).
> >>   
> >
> > First, the PPC that KVM supports performs very poorly relatively
> > speaking because it receives no hardware assistance
> 
> So wouldn't that be making the case that it could use as much help as
> possible?

I think he's referencing Ahmdal's Law here. While I'd agree, this is
relevant only for the current KVM PowerPC implementations. I think it
would be short-sighted to design an IO architecture around that.

> >   this is not the right place to focus wrt optimizations.
> 
> Odd choice of words.  I am advocating the opposite (broad solution to
> many arches and many platforms (i.e. hypervisors)) and therefore I am
> not "focused" on it (or really any one arch) at all per se.  I am
> _worried_ however, that we could be overlooking PPC (as an example) if
> we ignore the disparity between MMIO and HC since other higher
> performance options are not available like PIO.  The goal on this
> particular thread is to come up with an IO interface that works
> reasonably well across as many hypervisors as possible.  MMIO/PIO do
> not appear to fit that bill (at least not without tunneling them over
> HCs)

I haven't been following this conversation at all. With that in mind...

AFAICS, a hypercall is clearly the higher-performing option, since you
don't need the additional memory load (which could even cause a page
fault in some circumstances) and instruction decode. That said, I'm
willing to agree that this overhead is probably negligible compared to
the IOp itself... Ahmdal's Law again.

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Gregory Haskins
Anthony Liguori wrote:
> Gregory Haskins wrote:
>> Anthony Liguori wrote:
>>  
>>>   I'm surprised so much effort is going into this, is there any
>>> indication that this is even close to a bottleneck in any circumstance?
>>> 
>>
>> Yes.  Each 1us of overhead is a 4% regression in something as trivial as
>> a 25us UDP/ICMP rtt "ping".m   
>
> It wasn't 1us, it was 350ns or something around there (i.e ~1%).

I wasn't referring to "it".  I chose my words carefully.

Let me rephrase for your clarity: *each* 1us of overhead introduced into
the signaling path is a ~4% latency regression for a round trip on a
high speed network (note that this can also affect throughput at some
level, too).  I believe this point has been lost on you from the very
beginning of the vbus discussions.

I specifically generalized my statement above because #1 I assume
everyone here is smart enough to convert that nice round unit into the
relevant figure.  And #2, there are multiple potential latency sources
at play which we need to factor in when looking at the big picture.  For
instance, the difference between PF exit, and an IO exit (2.58us on x86,
to be precise).  Or whether you need to take a heavy-weight exit.  Or a
context switch to qemu, the the kernel, back to qemu, and back to the
vcpu).  Or acquire a mutex.  Or get head-of-lined on the VGA models IO. 
I know you wish that this whole discussion would just go away, but these
little "300ns here, 1600ns there" really add up in aggregate despite
your dismissive attitude towards them.  And it doesn't take much to
affect the results in a measurable way.  As stated, each 1us costs ~4%. 
My motivation is to reduce as many of these sources as possible.

So, yes, the delta from PIO to HC is 350ns.  Yes, this is a ~1.4%
improvement.  So what?  Its still an improvement.  If that improvement
were for free, would you object?  And we all know that this change isn't
"free" because we have to change some code (+128/-0, to be exact).  But
what is it specifically you are objecting to in the first place?  Adding
hypercall support as an pv_ops primitive isn't exactly hard or complex,
or even very much code.

Besides, I've already clearly stated multiple times (including in this
very thread) that I agree that I am not yet sure if the 350ns/1.4%
improvement alone is enough to justify a change.  So if you are somehow
trying to make me feel silly by pointing out the "~1%" above, you are
being ridiculous.

Rather, I was simply answering your question as to whether these latency
sources are a real issue.  The answer is "yes" (assuming you care about
latency) and I gave you a specific example and a method to quantify the
impact.

It is duly noted that you do not care about this type of performance,
but you also need to realize that your "blessing" or
acknowledgment/denial of the problem domain has _zero_ bearing on
whether the domain exists, or if there are others out there that do care
about it.  Sorry.

>
>> for request-response, this is generally for *every* packet since you
>> cannot exploit buffering/deferring.
>>
>> Can you back up your claim that PPC has no difference in performance
>> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
>> like instructions, but clearly there are ways to cause a trap, so
>> presumably we can measure the difference between a PF exit and something
>> more explicit).
>>   
>
> First, the PPC that KVM supports performs very poorly relatively
> speaking because it receives no hardware assistance

So wouldn't that be making the case that it could use as much help as
possible?

>   this is not the right place to focus wrt optimizations.

Odd choice of words.  I am advocating the opposite (broad solution to
many arches and many platforms (i.e. hypervisors)) and therefore I am
not "focused" on it (or really any one arch) at all per se.  I am
_worried_ however, that we could be overlooking PPC (as an example) if
we ignore the disparity between MMIO and HC since other higher
performance options are not available like PIO.  The goal on this
particular thread is to come up with an IO interface that works
reasonably well across as many hypervisors as possible.  MMIO/PIO do not
appear to fit that bill (at least not without tunneling them over HCs)

If I am guilty of focusing anywhere too much it would be x86 since that
is the only development platform I have readily available.


>
>
> And because there's no hardware assistance, there simply isn't a
> hypercall instruction.  Are PFs the fastest type of exits?  Probably
> not but I honestly have no idea.  I'm sure Hollis does though.
>
> Page faults are going to have tremendously different performance
> characteristics on PPC too because it's a software managed TLB.
> There's no page table lookup like there is on x86.

The difference between MMIO and "HC", and whether it is cause for
concern will continue to be pure speculation until we can find someone
with a PPC box willing to run some numbers.  I will 

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Gregory Haskins
Arnd Bergmann wrote:
> On Saturday 09 May 2009, Benjamin Herrenschmidt wrote:
>   
>> This was shot down by a vast majority of people, with the outcome being
>> an agreement that for IORESOURCE_MEM, pci_iomap and friends must return
>> something that is strictly interchangeable with what ioremap would have
>> returned.
>>
>> That means that readl and writel must work on the output of pci_iomap()
>> and similar, but I don't see why __raw_writel would be excluded there, I
>> think it's in there too.
>> 
>
> One of the ideas was to change pci_iomap to return a special token
> in case of virtual devices that causes iowrite32() to do an hcall,
> and to just define writel() to do iowrite32().
>
> Unfortunately, there is no __raw_iowrite32(), although I guess we
> could add this generically if necessary.
>
>   
>> Direct dereference is illegal in all cases though.
>> 
>
> right.
>  
>   
>> The token returned by pci_iomap for other type of resources (IO for
>> example) is also only supported for use by iomap access functions
>> (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those
>> neither.
>> 
>
> That still leaves the option to let drivers pass the IORESOURCE_PVIO
> for its own resources under some conditions, meaning that we will
> only use hcalls for I/O on these drivers but not on others, as Chris
> explained earlier.
>   

Between this, and Avi's "nesting" point, this is the direction I am
leaning in right now.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-11 Thread Arnd Bergmann
On Saturday 09 May 2009, Benjamin Herrenschmidt wrote:
> This was shot down by a vast majority of people, with the outcome being
> an agreement that for IORESOURCE_MEM, pci_iomap and friends must return
> something that is strictly interchangeable with what ioremap would have
> returned.
> 
> That means that readl and writel must work on the output of pci_iomap()
> and similar, but I don't see why __raw_writel would be excluded there, I
> think it's in there too.

One of the ideas was to change pci_iomap to return a special token
in case of virtual devices that causes iowrite32() to do an hcall,
and to just define writel() to do iowrite32().

Unfortunately, there is no __raw_iowrite32(), although I guess we
could add this generically if necessary.

> Direct dereference is illegal in all cases though.

right.
 
> The token returned by pci_iomap for other type of resources (IO for
> example) is also only supported for use by iomap access functions
> (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those
> neither.

That still leaves the option to let drivers pass the IORESOURCE_PVIO
for its own resources under some conditions, meaning that we will
only use hcalls for I/O on these drivers but not on others, as Chris
explained earlier.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-10 Thread Anthony Liguori

Gregory Haskins wrote:

Anthony Liguori wrote:
  


  I'm surprised so much effort is going into this, is there any
indication that this is even close to a bottleneck in any circumstance?



Yes.  Each 1us of overhead is a 4% regression in something as trivial as
a 25us UDP/ICMP rtt "ping".m 
  


It wasn't 1us, it was 350ns or something around there (i.e ~1%).


for request-response, this is generally for *every* packet since you
cannot exploit buffering/deferring.

Can you back up your claim that PPC has no difference in performance
with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
like instructions, but clearly there are ways to cause a trap, so
presumably we can measure the difference between a PF exit and something
more explicit).
  


First, the PPC that KVM supports performs very poorly relatively 
speaking because it receives no hardware assistance  this is not the 
right place to focus wrt optimizations.


And because there's no hardware assistance, there simply isn't a 
hypercall instruction.  Are PFs the fastest type of exits?  Probably not 
but I honestly have no idea.  I'm sure Hollis does though.


Page faults are going to have tremendously different performance 
characteristics on PPC too because it's a software managed TLB. There's 
no page table lookup like there is on x86.


As a more general observation, we need numbers to justify an 
optimization, not to justify not including an optimization.


In other words, the burden is on you to present a scenario where this 
optimization would result in a measurable improvement in a real world 
work load.



Regards,

Anthony Liguori

We need numbers before we can really decide to abandon this
optimization.  If PPC mmio has no penalty over hypercall, I am not sure
the 350ns on x86 is worth this effort (especially if I can shrink this
with some RCU fixes).  Otherwise, the margin is quite a bit larger.

-Greg



  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-10 Thread Avi Kivity

Gregory Haskins wrote:


That only works if the device exposes a pio port, and the hypervisor
exposes HC_PIO.  If the device exposes the hypercall, things break
once you assign it.



Well, true.  But normally I would think you would resurface the device
from G1 to G2 anyway, so any relevant transform would also be reflected
in the resurfaced device config.  


We do, but the G1 hypervisor cannot be expected to understand the config 
option that exposes the hypercall.



I suppose if you had a hard
requirement that, say, even the pci-config space was pass-through, this
would be a problem.  I am not sure if that is a realistic environment,
though. 
  


You must pass through the config space, as some of it is device 
specific.  The hypervisor will trap config space accesses, but unless it 
understands them, it cannot modify them.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-09 Thread Avi Kivity

David S. Ahern wrote:

kvm_stat shows same approximate numbers as with the TSC-->ops/sec
conversions. Interestingly, MMIO writes are not showing up as mmio_exits
in kvm_stat; they are showing up as insn_emulation.
  


That's a bug, mmio_exits ignores mmios that are handled in the kernel.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-09 Thread David S. Ahern


Gregory Haskins wrote:
> Avi Kivity wrote:
>> David S. Ahern wrote:
>>> I ran another test case with SMT disabled, and while I was at it
>>> converted TSC delta to operations/sec. The results without SMT are
>>> confusing -- to me anyways. I'm hoping someone can explain it.
>>> Basically, using a count of 10,000,000 (per your web page) with SMT
>>> disabled the guest detected a soft lockup on the CPU. So, I dropped the
>>> count down to 1,000,000. So, for 1e6 iterations:
>>>
>>> without SMT, with EPT:
>>> HC:   259,455 ops/sec
>>> PIO:  226,937 ops/sec
>>> MMIO: 113,180 ops/sec
>>>
>>> without SMT, without EPT:
>>> HC:   274,825 ops/sec
>>> PIO:  247,910 ops/sec
>>> MMIO: 111,535 ops/sec
>>>
>>> Converting the prior TSC deltas:
>>>
>>> with SMT, with EPT:
>>> HC:994,655 ops/sec
>>> PIO:   875,116 ops/sec
>>> MMIO:  439,738 ops/sec
>>>
>>> with SMT, without EPT:
>>> HC:994,304 ops/sec
>>> PIO:   903,057 ops/sec
>>> MMIO:  423,244 ops/sec
>>>
>>> Running the tests repeatedly I did notice a fair variability (as much as
>>> -10% down from these numbers).
>>>
>>> Also, just to make sure I converted the delta to ops/sec, the formula I
>>> used was cpu_freq / dTSC * count = operations/sec
>>>
>>>   
>> The only think I can think of is cpu frequency scaling lying about the
>> cpu frequency.  Really the test needs to use time and not the time
>> stamp counter.
>>
>> Are the results expressed in cycles/op more reasonable?
> 
> FWIW: I always used kvm_stat instead of my tsc printk
> 

kvm_stat shows same approximate numbers as with the TSC-->ops/sec
conversions. Interestingly, MMIO writes are not showing up as mmio_exits
in kvm_stat; they are showing up as insn_emulation.

david
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-09 Thread David S. Ahern


Avi Kivity wrote:
> David S. Ahern wrote:
>> I ran another test case with SMT disabled, and while I was at it
>> converted TSC delta to operations/sec. The results without SMT are
>> confusing -- to me anyways. I'm hoping someone can explain it.
>> Basically, using a count of 10,000,000 (per your web page) with SMT
>> disabled the guest detected a soft lockup on the CPU. So, I dropped the
>> count down to 1,000,000. So, for 1e6 iterations:
>>
>> without SMT, with EPT:
>> HC:   259,455 ops/sec
>> PIO:  226,937 ops/sec
>> MMIO: 113,180 ops/sec
>>
>> without SMT, without EPT:
>> HC:   274,825 ops/sec
>> PIO:  247,910 ops/sec
>> MMIO: 111,535 ops/sec
>>
>> Converting the prior TSC deltas:
>>
>> with SMT, with EPT:
>> HC:994,655 ops/sec
>> PIO:   875,116 ops/sec
>> MMIO:  439,738 ops/sec
>>
>> with SMT, without EPT:
>> HC:994,304 ops/sec
>> PIO:   903,057 ops/sec
>> MMIO:  423,244 ops/sec
>>
>> Running the tests repeatedly I did notice a fair variability (as much as
>> -10% down from these numbers).
>>
>> Also, just to make sure I converted the delta to ops/sec, the formula I
>> used was cpu_freq / dTSC * count = operations/sec
>>
>>   
> 
> The only think I can think of is cpu frequency scaling lying about the
> cpu frequency.  Really the test needs to use time and not the time stamp
> counter.
> 
> Are the results expressed in cycles/op more reasonable?
> 

Power settings seem to be the root cause. With this HP server the SMT
mode must be disabling or overriding a power setting that is enabled in
the bios. I found one power-based knob that gets non-SMT performance
close to SMT numbers. Not very intuitive that SMT/non-SMT can differ so
dramatically.

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-09 Thread Gregory Haskins
Anthony Liguori wrote:
> Avi Kivity wrote:
>>
>> Hmm, reminds me of something I thought of a while back.
>>
>> We could implement an 'mmio hypercall' that does mmio reads/writes
>> via a hypercall instead of an mmio operation.  That will speed up
>> mmio for emulated devices (say, e1000).  It's easy to hook into Linux
>> (readl/writel), is pci-friendly, non-x86 friendly, etc.
>
> By the time you get down to userspace for an emulated device, that 2us
> difference between mmio and hypercalls is simply not going to make a
> difference.

I don't care about this path for emulated devices.  I am interested in
in-kernel vbus devices.

>   I'm surprised so much effort is going into this, is there any
> indication that this is even close to a bottleneck in any circumstance?

Yes.  Each 1us of overhead is a 4% regression in something as trivial as
a 25us UDP/ICMP rtt "ping".
>
>
> We have much, much lower hanging fruit to attack.  The basic fact that
> we still copy data multiple times in the networking drivers is clearly
> more significant than a few hundred nanoseconds that should occur less
> than once per packet.
for request-response, this is generally for *every* packet since you
cannot exploit buffering/deferring.

Can you back up your claim that PPC has no difference in performance
with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT"
like instructions, but clearly there are ways to cause a trap, so
presumably we can measure the difference between a PF exit and something
more explicit).

We need numbers before we can really decide to abandon this
optimization.  If PPC mmio has no penalty over hypercall, I am not sure
the 350ns on x86 is worth this effort (especially if I can shrink this
with some RCU fixes).  Otherwise, the margin is quite a bit larger.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-09 Thread Gregory Haskins
Avi Kivity wrote:
> David S. Ahern wrote:
>> I ran another test case with SMT disabled, and while I was at it
>> converted TSC delta to operations/sec. The results without SMT are
>> confusing -- to me anyways. I'm hoping someone can explain it.
>> Basically, using a count of 10,000,000 (per your web page) with SMT
>> disabled the guest detected a soft lockup on the CPU. So, I dropped the
>> count down to 1,000,000. So, for 1e6 iterations:
>>
>> without SMT, with EPT:
>> HC:   259,455 ops/sec
>> PIO:  226,937 ops/sec
>> MMIO: 113,180 ops/sec
>>
>> without SMT, without EPT:
>> HC:   274,825 ops/sec
>> PIO:  247,910 ops/sec
>> MMIO: 111,535 ops/sec
>>
>> Converting the prior TSC deltas:
>>
>> with SMT, with EPT:
>> HC:994,655 ops/sec
>> PIO:   875,116 ops/sec
>> MMIO:  439,738 ops/sec
>>
>> with SMT, without EPT:
>> HC:994,304 ops/sec
>> PIO:   903,057 ops/sec
>> MMIO:  423,244 ops/sec
>>
>> Running the tests repeatedly I did notice a fair variability (as much as
>> -10% down from these numbers).
>>
>> Also, just to make sure I converted the delta to ops/sec, the formula I
>> used was cpu_freq / dTSC * count = operations/sec
>>
>>   
>
> The only think I can think of is cpu frequency scaling lying about the
> cpu frequency.  Really the test needs to use time and not the time
> stamp counter.
>
> Are the results expressed in cycles/op more reasonable?

FWIW: I always used kvm_stat instead of my tsc printk




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-09 Thread Avi Kivity

David S. Ahern wrote:

I ran another test case with SMT disabled, and while I was at it
converted TSC delta to operations/sec. The results without SMT are
confusing -- to me anyways. I'm hoping someone can explain it.
Basically, using a count of 10,000,000 (per your web page) with SMT
disabled the guest detected a soft lockup on the CPU. So, I dropped the
count down to 1,000,000. So, for 1e6 iterations:

without SMT, with EPT:
HC:   259,455 ops/sec
PIO:  226,937 ops/sec
MMIO: 113,180 ops/sec

without SMT, without EPT:
HC:   274,825 ops/sec
PIO:  247,910 ops/sec
MMIO: 111,535 ops/sec

Converting the prior TSC deltas:

with SMT, with EPT:
HC:994,655 ops/sec
PIO:   875,116 ops/sec
MMIO:  439,738 ops/sec

with SMT, without EPT:
HC:994,304 ops/sec
PIO:   903,057 ops/sec
MMIO:  423,244 ops/sec

Running the tests repeatedly I did notice a fair variability (as much as
-10% down from these numbers).

Also, just to make sure I converted the delta to ops/sec, the formula I
used was cpu_freq / dTSC * count = operations/sec

  


The only think I can think of is cpu frequency scaling lying about the 
cpu frequency.  Really the test needs to use time and not the time stamp 
counter.


Are the results expressed in cycles/op more reasonable?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread David S. Ahern


Gregory Haskins wrote:
> David S. Ahern wrote:
>> Marcelo Tosatti wrote:
>>   
>>> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote:
>>> 
 Marcelo Tosatti wrote:
   
> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
>   
> 
>> Marcelo Tosatti wrote:
>> 
>>   
>>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>>> it probably sucks much less than what you're seeing.
>>>   
>>>   
>>> 
>> Why would NPT improve mmio?  If anything, it would be worse, since the  
>> processor has to do the nested walk.
>>
>> Of course, these are newer machines, so the absolute results as well as  
>> the difference will be smaller.
>> 
>>   
> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
>
> NPT enabled:
> test 0: 3088633284634 - 3059375712321 = 29257572313
> test 1: 3121754636397 - 3088633419760 = 33121216637
> test 2: 3204666462763 - 3121754668573 = 82911794190
>
> NPT disabled:
> test 0: 3638061646250 - 3609416811687 = 28644834563
> test 1: 3669413430258 - 3638061771291 = 31351658967
> test 2: 3736287253287 - 3669413463506 = 66873789781
>
>   
> 
 Thanks for running that.  Its interesting to see that NPT was in fact
 worse as Avi predicted.

 Would you mind if I graphed the result and added this data to my wiki? 
 If so, could you adjust the tsc result into IOPs using the proper
 time-base and the test_count you ran with?   I can show a graph with the
 data as is and the relative differences will properly surface..but it
 would be nice to have apples to apples in terms of IOPS units with my
 other run.

 -Greg
   
>>> Please, that'll be nice.
>>>
>>> Quad-Core AMD Opteron(tm) Processor 2358 SE
>>>
>>> host: 2.6.30-rc2
>>> guest: 2.6.29.1-102.fc11.x86_64
>>>
>>> test_count=100, tsc freq=2402882804 Hz
>>>
>>> NPT disabled:
>>>
>>> test 0 = 2771200766
>>> test 1 = 3018726738
>>> test 2 = 6414705418
>>> test 3 = 2890332864
>>>
>>> NPT enabled:
>>>
>>> test 0 = 2908604045
>>> test 1 = 3174687394
>>> test 2 = 7912464804
>>> test 3 = 3046085805
>>>
>>> 
>> DL380 G6, 1-E5540, 6 GB RAM, SMT enabled:
>> host: 2.6.30-rc3
>> guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64
>>
>> with EPT
>> test 0: 543617607291 - 518146439877 = 25471167414
>> test 1: 572568176856 - 543617703004 = 28950473852
>> test 2: 630182158139 - 572568269792 = 57613888347
>>
>>
>> without EPT
>> test 0: 1383532195307 - 1358052032086 = 25480163221
>> test 1: 1411587055210 - 1383532318617 = 28054736593
>> test 2: 1471446356172 - 1411587194600 = 59859161572
>>
>>
>>   
> 
> Thank you kindly, David.
> 
> -Greg

I ran another test case with SMT disabled, and while I was at it
converted TSC delta to operations/sec. The results without SMT are
confusing -- to me anyways. I'm hoping someone can explain it.
Basically, using a count of 10,000,000 (per your web page) with SMT
disabled the guest detected a soft lockup on the CPU. So, I dropped the
count down to 1,000,000. So, for 1e6 iterations:

without SMT, with EPT:
HC:   259,455 ops/sec
PIO:  226,937 ops/sec
MMIO: 113,180 ops/sec

without SMT, without EPT:
HC:   274,825 ops/sec
PIO:  247,910 ops/sec
MMIO: 111,535 ops/sec

Converting the prior TSC deltas:

with SMT, with EPT:
HC:994,655 ops/sec
PIO:   875,116 ops/sec
MMIO:  439,738 ops/sec

with SMT, without EPT:
HC:994,304 ops/sec
PIO:   903,057 ops/sec
MMIO:  423,244 ops/sec

Running the tests repeatedly I did notice a fair variability (as much as
-10% down from these numbers).

Also, just to make sure I converted the delta to ops/sec, the formula I
used was cpu_freq / dTSC * count = operations/sec


david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Benjamin Herrenschmidt
On Fri, 2009-05-08 at 00:11 +0200, Arnd Bergmann wrote:
> On Thursday 07 May 2009, Chris Wright wrote:
> > 
> > > Chris, is that issue with the non ioread/iowrite access of a mangled
> > > pointer still an issue here?  I would think so, but I am a bit fuzzy on
> > > whether there is still an issue of non-wrapped MMIO ever occuring.
> > 
> > Arnd was saying it's a bug for other reasons, so perhaps it would work
> > out fine.
> 
> Well, maybe. I only said that __raw_writel and pointer dereference is
> bad, but not writel.

I have only vague recollection of that stuff, but basically, it boiled
down to me attempting to use annotations to differenciate old style
"ioremap" vs. new style iomap and effectively forbid mixing ioremap with
iomap in either direction.

This was shot down by a vast majority of people, with the outcome being
an agreement that for IORESOURCE_MEM, pci_iomap and friends must return
something that is strictly interchangeable with what ioremap would have
returned.

That means that readl and writel must work on the output of pci_iomap()
and similar, but I don't see why __raw_writel would be excluded there, I
think it's in there too.

Direct dereference is illegal in all cases though.

The token returned by pci_iomap for other type of resources (IO for
example) is also only supported for use by iomap access functions
(ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those
neither.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
David S. Ahern wrote:
> Marcelo Tosatti wrote:
>   
>> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote:
>> 
>>> Marcelo Tosatti wrote:
>>>   
 On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
   
 
> Marcelo Tosatti wrote:
> 
>   
>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>> it probably sucks much less than what you're seeing.
>>   
>>   
>> 
> Why would NPT improve mmio?  If anything, it would be worse, since the  
> processor has to do the nested walk.
>
> Of course, these are newer machines, so the absolute results as well as  
> the difference will be smaller.
> 
>   
 Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:

 NPT enabled:
 test 0: 3088633284634 - 3059375712321 = 29257572313
 test 1: 3121754636397 - 3088633419760 = 33121216637
 test 2: 3204666462763 - 3121754668573 = 82911794190

 NPT disabled:
 test 0: 3638061646250 - 3609416811687 = 28644834563
 test 1: 3669413430258 - 3638061771291 = 31351658967
 test 2: 3736287253287 - 3669413463506 = 66873789781

   
 
>>> Thanks for running that.  Its interesting to see that NPT was in fact
>>> worse as Avi predicted.
>>>
>>> Would you mind if I graphed the result and added this data to my wiki? 
>>> If so, could you adjust the tsc result into IOPs using the proper
>>> time-base and the test_count you ran with?   I can show a graph with the
>>> data as is and the relative differences will properly surface..but it
>>> would be nice to have apples to apples in terms of IOPS units with my
>>> other run.
>>>
>>> -Greg
>>>   
>> Please, that'll be nice.
>>
>> Quad-Core AMD Opteron(tm) Processor 2358 SE
>>
>> host: 2.6.30-rc2
>> guest: 2.6.29.1-102.fc11.x86_64
>>
>> test_count=100, tsc freq=2402882804 Hz
>>
>> NPT disabled:
>>
>> test 0 = 2771200766
>> test 1 = 3018726738
>> test 2 = 6414705418
>> test 3 = 2890332864
>>
>> NPT enabled:
>>
>> test 0 = 2908604045
>> test 1 = 3174687394
>> test 2 = 7912464804
>> test 3 = 3046085805
>>
>> 
>
> DL380 G6, 1-E5540, 6 GB RAM, SMT enabled:
> host: 2.6.30-rc3
> guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64
>
> with EPT
> test 0: 543617607291 - 518146439877 = 25471167414
> test 1: 572568176856 - 543617703004 = 28950473852
> test 2: 630182158139 - 572568269792 = 57613888347
>
>
> without EPT
> test 0: 1383532195307 - 1358052032086 = 25480163221
> test 1: 1411587055210 - 1383532318617 = 28054736593
> test 2: 1471446356172 - 1411587194600 = 59859161572
>
>
>   

Thank you kindly, David.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
 And likewise, in both cases, G1 would (should?) know what to do with
 that "address" as it relates to G2, just as it would need to know what
 the PIO address is for.  Typically this would result in some kind of
 translation of that "address", but I suppose even this is completely
 arbitrary and only G1 knows for sure.  E.g. it might translate from
 hypercall vector X to Y similar to your PIO example, it might
 completely
 change transports, or it might terminate locally (e.g. emulated device
 in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1
 might
 be using MMIO to talk to H.  I don't think it matters from a topology
 perspective (though it might from a performance perspective).
 
>>> How can you translate a hypercall?  G1's and H's hypercall mechanisms
>>> can be completely different.
>>> 
>>
>> Well, what I mean is that the hypercall ABI is specific to G2->G1, but
>> the path really looks like G2->(H)->G1 transparently since H gets all
>> the initial exits coming from G2.  But all H has to do is blindly
>> reinject the exit with all the same parameters (e.g. registers,
>> primarily) to the G1-root context.
>>
>> So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT,
>> and does its thing according to the ABI.  Perhaps the ABI for that
>> particular HC-id is a PIOoHC, so it turns around and does a
>> ioread/iowrite PIO, trapping us back to H.
>>
>> So this transform of the HC-id "X" to PIO("Y") is the translation I was
>> referring to.  It could really be anything, though (e.g. HC "X" to HC
>> "Z", if thats what G1s handler for X told it to do)
>>   
>
> That only works if the device exposes a pio port, and the hypervisor
> exposes HC_PIO.  If the device exposes the hypercall, things break
> once you assign it.

Well, true.  But normally I would think you would resurface the device
from G1 to G2 anyway, so any relevant transform would also be reflected
in the resurfaced device config.  I suppose if you had a hard
requirement that, say, even the pci-config space was pass-through, this
would be a problem.  I am not sure if that is a realistic environment,
though. 
>
>>>   Of course mmio is faster in this case since it traps directly.
>>>
>>> btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a
>>> 0.4us difference will buy us 0.4% reduction in cpu load, so let's see
>>> what's the potential gain here.
>>> 
>>
>> Its more of an issue of execution latency (which translates to IO
>> latency, since "execution" is usually for the specific goal of doing
>> some IO).  In fact, per my own design claims, I try to avoid exits like
>> the plague and generally succeed at making very few of them. ;)
>>
>> So its not really the .4% reduction of cpu use that allures me.  Its the
>> 16% reduction in latency.  Time/discussion will tell if its worth the
>> trouble to use HC or just try to shave more off of PIO.  If we went that
>> route, I am concerned about falling back to MMIO, but Anthony seems to
>> think this is not a real issue.
>>   
>
> You need to use absolute numbers, not percentages off the smallest
> component.  If you want to reduce latency, keep things on the same
> core (IPIs, cache bounces are more expensive than the 200ns we're
> seeing here).
>
>
Ok, so there are no shortages of IO cards that can perform operations in
the order of 10us-15us.  Therefore a 350ns latency (the delta between
PIO and HC) turns into a 2%-3.5% overhead when compared to bare-metal. 
I am not really at liberty to talk about most of the kinds of
applications that might care.  A trivial example might be PTPd clock
distribution.

But this is going nowhere.  Based on Anthony's asserting that the MMIO
fallback worry is unfounded, and all the controversy this is causing,
perhaps we should just move on and just forget the whole thing.  If I
have to I will patch the HC code in my own tree.  For now, I will submit
a few patches to clean up the locking on the io_bus.  That may help
narrow the gap without all this stuff, anyway.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread David S. Ahern


Marcelo Tosatti wrote:
> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote:
>> Marcelo Tosatti wrote:
>>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
>>>   
 Marcelo Tosatti wrote:
 
> Also it would be interesting to see the MMIO comparison with EPT/NPT,
> it probably sucks much less than what you're seeing.
>   
>   
 Why would NPT improve mmio?  If anything, it would be worse, since the  
 processor has to do the nested walk.

 Of course, these are newer machines, so the absolute results as well as  
 the difference will be smaller.
 
>>> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
>>>
>>> NPT enabled:
>>> test 0: 3088633284634 - 3059375712321 = 29257572313
>>> test 1: 3121754636397 - 3088633419760 = 33121216637
>>> test 2: 3204666462763 - 3121754668573 = 82911794190
>>>
>>> NPT disabled:
>>> test 0: 3638061646250 - 3609416811687 = 28644834563
>>> test 1: 3669413430258 - 3638061771291 = 31351658967
>>> test 2: 3736287253287 - 3669413463506 = 66873789781
>>>
>>>   
>> Thanks for running that.  Its interesting to see that NPT was in fact
>> worse as Avi predicted.
>>
>> Would you mind if I graphed the result and added this data to my wiki? 
>> If so, could you adjust the tsc result into IOPs using the proper
>> time-base and the test_count you ran with?   I can show a graph with the
>> data as is and the relative differences will properly surface..but it
>> would be nice to have apples to apples in terms of IOPS units with my
>> other run.
>>
>> -Greg
> 
> Please, that'll be nice.
> 
> Quad-Core AMD Opteron(tm) Processor 2358 SE
> 
> host: 2.6.30-rc2
> guest: 2.6.29.1-102.fc11.x86_64
> 
> test_count=100, tsc freq=2402882804 Hz
> 
> NPT disabled:
> 
> test 0 = 2771200766
> test 1 = 3018726738
> test 2 = 6414705418
> test 3 = 2890332864
> 
> NPT enabled:
> 
> test 0 = 2908604045
> test 1 = 3174687394
> test 2 = 7912464804
> test 3 = 3046085805
> 

DL380 G6, 1-E5540, 6 GB RAM, SMT enabled:
host: 2.6.30-rc3
guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64

with EPT
test 0: 543617607291 - 518146439877 = 25471167414
test 1: 572568176856 - 543617703004 = 28950473852
test 2: 630182158139 - 572568269792 = 57613888347


without EPT
test 0: 1383532195307 - 1358052032086 = 25480163221
test 1: 1411587055210 - 1383532318617 = 28054736593
test 2: 1471446356172 - 1411587194600 = 59859161572


david


> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Paul E. McKenney wrote:
> On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote:
>   
>> Marcelo Tosatti wrote:
>> 
>>> On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote:
>>>   
>>>   
 Marcelo Tosatti wrote:
 
 
> I think comparison is not entirely fair. You're using
> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
> (on Intel) to only one register read:
>
> nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>
> Whereas in a real hypercall for (say) PIO you would need the address,
> size, direction and data.
>   
>   
>   
 Well, that's probably one of the reasons pio is slower, as the cpu has  
 to set these up, and the kernel has to read them.

 
 
> Also for PIO/MMIO you're adding this unoptimized lookup to the  
> measurement:
>
> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> if (pio_dev) {
> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> complete_pio(vcpu); return 1;
> }
>   
>   
>   
 Since there are only one or two elements in the list, I don't see how it  
 could be optimized.
 
 
>>> speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev 
>>> is probably the last in the io_bus list.
>>>
>>> Not sure if this one matters very much. Point is you should measure the
>>> exit time only, not the pio path vs hypercall path in kvm. 
>>>   
>>>   
>> The problem is the exit time in of itself isnt all that interesting to
>> me.  What I am interested in measuring is how long it takes KVM to
>> process the request and realize that I want to execute function "X". 
>> Ultimately that is what matters in terms of execution latency and is
>> thus the more interesting data.  I think the exit time is possibly an
>> interesting 5th data point, but its more of a side-bar IMO.   In any
>> case, I suspect that both exits will be approximately the same at the
>> VT/SVM level.
>>
>> OTOH: If there is a patch out there to improve KVMs code (say
>> specifically the PIO handling logic), that is fair-game here and we
>> should benchmark it.  For instance, if you have ideas on ways to improve
>> the find_pio_dev performance, etc   One item may be to replace the
>> kvm->lock on the bus scan with an RCU or something (though PIOs are
>> very frequent and the constant re-entry to an an RCU read-side CS may
>> effectively cause a perpetual grace-period and may be too prohibitive). 
>> CC'ing pmck.
>> 
>
> Hello, Greg!
>
> Not a problem.  ;-)
>
> A grace period only needs to wait on RCU read-side critical sections that
> started before the grace period started.  As soon as those pre-existing
> RCU read-side critical get done, the grace period can end, regardless
> of how many RCU read-side critical sections might have started after
> the grace period started.
>
> If you find a situation where huge numbers of RCU read-side critical
> sections do indefinitely delay a grace period, then that is a bug in
> RCU that I need to fix.
>
> Of course, if you have a single RCU read-side critical section that
> runs for a very long time, that -will- delay a grace period.  As long
> as you don't do it too often, this is not a problem, though if running
> a single RCU read-side critical section for more than a few milliseconds
> is probably not a good thing.  Not as bad as holding a heavily contended
> spinlock for a few milliseconds, but still not a good thing.
>   

Hey Paul,
  This makes sense, and it clears up a misconception I had about RCU. 
So thanks for that.

Based on what Paul said, I think we can get some amount of gains in the
PIO and PIOoHC stats from converting to RCU.  I will do this next.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Gregory Haskins wrote:

And likewise, in both cases, G1 would (should?) know what to do with
that "address" as it relates to G2, just as it would need to know what
the PIO address is for.  Typically this would result in some kind of
translation of that "address", but I suppose even this is completely
arbitrary and only G1 knows for sure.  E.g. it might translate from
hypercall vector X to Y similar to your PIO example, it might completely
change transports, or it might terminate locally (e.g. emulated device
in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1 might
be using MMIO to talk to H.  I don't think it matters from a topology
perspective (though it might from a performance perspective).
  
  

How can you translate a hypercall?  G1's and H's hypercall mechanisms
can be completely different.



Well, what I mean is that the hypercall ABI is specific to G2->G1, but
the path really looks like G2->(H)->G1 transparently since H gets all
the initial exits coming from G2.  But all H has to do is blindly
reinject the exit with all the same parameters (e.g. registers,
primarily) to the G1-root context.

So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT,
and does its thing according to the ABI.  Perhaps the ABI for that
particular HC-id is a PIOoHC, so it turns around and does a
ioread/iowrite PIO, trapping us back to H.

So this transform of the HC-id "X" to PIO("Y") is the translation I was
referring to.  It could really be anything, though (e.g. HC "X" to HC
"Z", if thats what G1s handler for X told it to do)
  


That only works if the device exposes a pio port, and the hypervisor 
exposes HC_PIO.  If the device exposes the hypercall, things break once 
you assign it.



  Of course mmio is faster in this case since it traps directly.

btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a
0.4us difference will buy us 0.4% reduction in cpu load, so let's see
what's the potential gain here.



Its more of an issue of execution latency (which translates to IO
latency, since "execution" is usually for the specific goal of doing
some IO).  In fact, per my own design claims, I try to avoid exits like
the plague and generally succeed at making very few of them. ;)

So its not really the .4% reduction of cpu use that allures me.  Its the
16% reduction in latency.  Time/discussion will tell if its worth the
trouble to use HC or just try to shave more off of PIO.  If we went that
route, I am concerned about falling back to MMIO, but Anthony seems to
think this is not a real issue.
  


You need to use absolute numbers, not percentages off the smallest 
component.  If you want to reduce latency, keep things on the same core 
(IPIs, cache bounces are more expensive than the 200ns we're seeing here).




--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Anthony Liguori wrote:
ia64 uses mmio to emulate pio, so the cost may be different.  I agree 
on x86 it's almost negligible.


Yes, I misunderstood that they actually emulated it like that.  
However, ia64 has no paravirtualization support today so surely, we 
aren't going to be justifying this via ia64, right?




Right.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Anthony Liguori

Gregory Haskins wrote:

Its more of an issue of execution latency (which translates to IO
latency, since "execution" is usually for the specific goal of doing
some IO).  In fact, per my own design claims, I try to avoid exits like
the plague and generally succeed at making very few of them. ;)

So its not really the .4% reduction of cpu use that allures me.  Its the
16% reduction in latency.  Time/discussion will tell if its worth the
trouble to use HC or just try to shave more off of PIO.  If we went that
route, I am concerned about falling back to MMIO, but Anthony seems to
think this is not a real issue.
  


It's only a 16% reduction in latency if your workload is entirely 
dependent on the latency of a hypercall.  What is that workload?  I 
don't think it exists.


For a network driver, I have a hard time believing that anyone cares 
that much about 210ns of latency.  We're getting close to the cost of a 
few dozen instructions here.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Marcelo Tosatti wrote:

On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote:
  

The problem is the exit time in of itself isnt all that interesting to
me.  What I am interested in measuring is how long it takes KVM to
process the request and realize that I want to execute function "X". 
Ultimately that is what matters in terms of execution latency and is

thus the more interesting data.  I think the exit time is possibly an
interesting 5th data point, but its more of a side-bar IMO.   In any
case, I suspect that both exits will be approximately the same at the
VT/SVM level.

OTOH: If there is a patch out there to improve KVMs code (say
specifically the PIO handling logic), that is fair-game here and we
should benchmark it.  For instance, if you have ideas on ways to improve
the find_pio_dev performance, etc   





One easy thing to try is to cache the last successful lookup on a
pointer, to improve patterns where there's "device locality" (like
nullio test).
  


We should do that everywhere, memory slots, pio slots, etc.  Or even 
keep statistics on accesses and sort by that.





  


I'd leave it on if I were you.


One item may be to replace the kvm->lock on the bus scan with an RCU
or something (though PIOs are very frequent and the constant
re-entry to an an RCU read-side CS may effectively cause a perpetual
grace-period and may be too prohibitive). CC'ing pmck.



Yes, locking improvements are needed there badly (think for eg the cache
bouncing of kvm->lock _and_ bouncing of kvm->slots_lock on 4-way SMP
guests).
  


There's no reason for kvm->lock on pio.  We should push the locking to 
devices.


I'm going to rename slots_lock as 
slots_lock_please_reimplement_me_using_rcu, this keeps coming up.



FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that
140 can possibly be recouped.  I currently suspect the lock acquisition
in the iobus-scan is the bulk of that time, but that is admittedly a
guess.  The remaining 200-250ns is elsewhere in the PIO decode.



vmcs_read is significantly expensive
(http://www.mail-archive.com/kvm@vger.kernel.org/msg00840.html,
likely that my measurements were foobar, Avi mentioned 50 cycles for
vmcs_write).
  


IIRC vmcs reads are pretty fast, and are being improved.


See for eg how vmx.c reads VM_EXIT_INTR_INFO twice on every exit.
  


Ugh.

Also this one looks pretty bad for a 32-bit PAE guest (and you can 
get away with the unconditional GUEST_CR3 read too).


/* Access CR3 don't cause VMExit in paging mode, so we need
 * to sync with guest real CR3. */
if (enable_ept && is_paging(vcpu)) {
vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
ept_load_pdptrs(vcpu);
}

  


We should use an accessor here just like with registers and segment 
registers.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Anthony Liguori

Avi Kivity wrote:

Anthony Liguori wrote:


And we're now getting close to the point where the difference is 
virtually meaningless.


At .14us, in order to see 1% CPU overhead added from PIO vs HC, you 
need 71429 exits.




If I read things correctly, you want the difference between PIO and 
PIOoHC, which is 210ns.  But your point stands, 50,000 exits/sec will 
add 1% cpu overhead.


Right, the basic math still stands.



The non-x86 architecture argument isn't valid because other 
architectures either 1) don't use PCI at all (s390) and are already 
using hypercalls 2) use PCI, but do not have a dedicated hypercall 
instruction (PPC emb) or 3) have PIO (ia64).


ia64 uses mmio to emulate pio, so the cost may be different.  I agree 
on x86 it's almost negligible.


Yes, I misunderstood that they actually emulated it like that.  However, 
ia64 has no paravirtualization support today so surely, we aren't going 
to be justifying this via ia64, right?


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> Consider nested virtualization where the host (H) runs a guest (G1)
>>> which is itself a hypervisor, running a guest (G2).  The host exposes
>>> a set of virtio (V1..Vn) devices for guest G1.  Guest G1, rather than
>>> creating a new virtio devices and bridging it to one of V1..Vn,
>>> assigns virtio device V1 to guest G2, and prays.
>>>
>>> Now guest G2 issues a hypercall.  Host H traps the hypercall, sees it
>>> originated in G1 while in guest mode, so it injects it into G1.  G1
>>> examines the parameters but can't make any sense of them, so it
>>> returns an error to G2.
>>>
>>> If this were done using mmio or pio, it would have just worked.  With
>>> pio, H would have reflected the pio into G1, G1 would have done the
>>> conversion from G2's port number into G1's port number and reissued
>>> the pio, finally trapped by H and used to issue the I/O. 
>>
>> I might be missing something, but I am not seeing the difference
>> here. We have an "address" (in this case the HC-id) and a context (in
>> this
>> case G1 running in non-root mode).   Whether the  trap to H is a HC or a
>> PIO, the context tells us that it needs to re-inject the same trap to G1
>> for proper handling.  So the "address" is re-injected from H to G1 as an
>> emulated trap to G1s root-mode, and we continue (just like the PIO).
>>   
>
> So far, so good (though in fact mmio can short-circuit G2->H directly).

Yeah, that is a nice trick.  Despite the fact that MMIOs have about 50%
degradation over an equivalent PIO/HC trap, you would be hard-pressed to
make that up again with all the nested reinjection going on on the
PIO/HC side of the coin.  I think MMIO would be a fairly easy win with
one level of nesting, and absolutely trounce anything that happens to be
deeper.

>
>> And likewise, in both cases, G1 would (should?) know what to do with
>> that "address" as it relates to G2, just as it would need to know what
>> the PIO address is for.  Typically this would result in some kind of
>> translation of that "address", but I suppose even this is completely
>> arbitrary and only G1 knows for sure.  E.g. it might translate from
>> hypercall vector X to Y similar to your PIO example, it might completely
>> change transports, or it might terminate locally (e.g. emulated device
>> in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1 might
>> be using MMIO to talk to H.  I don't think it matters from a topology
>> perspective (though it might from a performance perspective).
>>   
>
> How can you translate a hypercall?  G1's and H's hypercall mechanisms
> can be completely different.

Well, what I mean is that the hypercall ABI is specific to G2->G1, but
the path really looks like G2->(H)->G1 transparently since H gets all
the initial exits coming from G2.  But all H has to do is blindly
reinject the exit with all the same parameters (e.g. registers,
primarily) to the G1-root context.

So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT,
and does its thing according to the ABI.  Perhaps the ABI for that
particular HC-id is a PIOoHC, so it turns around and does a
ioread/iowrite PIO, trapping us back to H.

So this transform of the HC-id "X" to PIO("Y") is the translation I was
referring to.  It could really be anything, though (e.g. HC "X" to HC
"Z", if thats what G1s handler for X told it to do)

>
>
>
>>> So the upshoot is that hypercalls for devices must not be the primary
>>> method of communications; they're fine as an optimization, but we
>>> should always be able to fall back on something else.  We also need to
>>> figure out how G1 can stop V1 from advertising hypercall support.
>>> 
>> I agree it would be desirable to be able to control this exposure.
>> However, I am not currently convinced its strictly necessary because of
>> the reason you mentioned above.  And also note that I am not currently
>> convinced its even possible to control it.
>>
>> For instance, what if G1 is an old KVM, or (dare I say) a completely
>> different hypervisor?  You could control things like whether G1 can see
>> the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who
>> is to say what G1 will expose to G2?  G1 may very well advertise a HC
>> feature bit to G2 which may allow G2 to try to make a VMCALL.  How do
>> you stop that?
>>   
>
> I don't see any way.
>
> If, instead of a hypercall we go through the pio hypercall route, then
> it all resolves itself.  G2 issues a pio hypercall, H bounces it to
> G1, G1 either issues a pio or a pio hypercall depending on what the H
> and G1 negotiated.

Actually I don't even think it matters what the HC payload is.  Its
governed by the ABI between G1 and G2. H will simply reflect the trap,
so the HC could be of any type, really.

>   Of course mmio is faster in this case since it traps directly.
>
> btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a
> 0.4us difference will buy us 0.4% reduction in cp

Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Anthony Liguori wrote:


And we're now getting close to the point where the difference is 
virtually meaningless.


At .14us, in order to see 1% CPU overhead added from PIO vs HC, you 
need 71429 exits.




If I read things correctly, you want the difference between PIO and 
PIOoHC, which is 210ns.  But your point stands, 50,000 exits/sec will 
add 1% cpu overhead.




The non-x86 architecture argument isn't valid because other 
architectures either 1) don't use PCI at all (s390) and are already 
using hypercalls 2) use PCI, but do not have a dedicated hypercall 
instruction (PPC emb) or 3) have PIO (ia64).


ia64 uses mmio to emulate pio, so the cost may be different.  I agree on 
x86 it's almost negligible.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Gregory Haskins wrote:

Consider nested virtualization where the host (H) runs a guest (G1)
which is itself a hypervisor, running a guest (G2).  The host exposes
a set of virtio (V1..Vn) devices for guest G1.  Guest G1, rather than
creating a new virtio devices and bridging it to one of V1..Vn,
assigns virtio device V1 to guest G2, and prays.

Now guest G2 issues a hypercall.  Host H traps the hypercall, sees it
originated in G1 while in guest mode, so it injects it into G1.  G1
examines the parameters but can't make any sense of them, so it
returns an error to G2.

If this were done using mmio or pio, it would have just worked.  With
pio, H would have reflected the pio into G1, G1 would have done the
conversion from G2's port number into G1's port number and reissued
the pio, finally trapped by H and used to issue the I/O. 



I might be missing something, but I am not seeing the difference here. 
We have an "address" (in this case the HC-id) and a context (in this

case G1 running in non-root mode).   Whether the  trap to H is a HC or a
PIO, the context tells us that it needs to re-inject the same trap to G1
for proper handling.  So the "address" is re-injected from H to G1 as an
emulated trap to G1s root-mode, and we continue (just like the PIO).
  


So far, so good (though in fact mmio can short-circuit G2->H directly).


And likewise, in both cases, G1 would (should?) know what to do with
that "address" as it relates to G2, just as it would need to know what
the PIO address is for.  Typically this would result in some kind of
translation of that "address", but I suppose even this is completely
arbitrary and only G1 knows for sure.  E.g. it might translate from
hypercall vector X to Y similar to your PIO example, it might completely
change transports, or it might terminate locally (e.g. emulated device
in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1 might
be using MMIO to talk to H.  I don't think it matters from a topology
perspective (though it might from a performance perspective).
  


How can you translate a hypercall?  G1's and H's hypercall mechanisms 
can be completely different.




So the upshoot is that hypercalls for devices must not be the primary
method of communications; they're fine as an optimization, but we
should always be able to fall back on something else.  We also need to
figure out how G1 can stop V1 from advertising hypercall support.

I agree it would be desirable to be able to control this exposure. 
However, I am not currently convinced its strictly necessary because of

the reason you mentioned above.  And also note that I am not currently
convinced its even possible to control it.

For instance, what if G1 is an old KVM, or (dare I say) a completely
different hypervisor?  You could control things like whether G1 can see
the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who
is to say what G1 will expose to G2?  G1 may very well advertise a HC
feature bit to G2 which may allow G2 to try to make a VMCALL.  How do
you stop that?
  


I don't see any way.

If, instead of a hypercall we go through the pio hypercall route, then 
it all resolves itself.  G2 issues a pio hypercall, H bounces it to G1, 
G1 either issues a pio or a pio hypercall depending on what the H and G1 
negotiated.  Of course mmio is faster in this case since it traps directly.


btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a 
0.4us difference will buy us 0.4% reduction in cpu load, so let's see 
what's the potential gain here.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Paul E. McKenney
On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote:
> Marcelo Tosatti wrote:
> > On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote:
> >   
> >> Marcelo Tosatti wrote:
> >> 
> >>> I think comparison is not entirely fair. You're using
> >>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
> >>> (on Intel) to only one register read:
> >>>
> >>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
> >>>
> >>> Whereas in a real hypercall for (say) PIO you would need the address,
> >>> size, direction and data.
> >>>   
> >>>   
> >> Well, that's probably one of the reasons pio is slower, as the cpu has  
> >> to set these up, and the kernel has to read them.
> >>
> >> 
> >>> Also for PIO/MMIO you're adding this unoptimized lookup to the  
> >>> measurement:
> >>>
> >>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> >>> if (pio_dev) {
> >>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> >>> complete_pio(vcpu); return 1;
> >>> }
> >>>   
> >>>   
> >> Since there are only one or two elements in the list, I don't see how it  
> >> could be optimized.
> >> 
> >
> > speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev 
> > is probably the last in the io_bus list.
> >
> > Not sure if this one matters very much. Point is you should measure the
> > exit time only, not the pio path vs hypercall path in kvm. 
> >   
> 
> The problem is the exit time in of itself isnt all that interesting to
> me.  What I am interested in measuring is how long it takes KVM to
> process the request and realize that I want to execute function "X". 
> Ultimately that is what matters in terms of execution latency and is
> thus the more interesting data.  I think the exit time is possibly an
> interesting 5th data point, but its more of a side-bar IMO.   In any
> case, I suspect that both exits will be approximately the same at the
> VT/SVM level.
> 
> OTOH: If there is a patch out there to improve KVMs code (say
> specifically the PIO handling logic), that is fair-game here and we
> should benchmark it.  For instance, if you have ideas on ways to improve
> the find_pio_dev performance, etc   One item may be to replace the
> kvm->lock on the bus scan with an RCU or something (though PIOs are
> very frequent and the constant re-entry to an an RCU read-side CS may
> effectively cause a perpetual grace-period and may be too prohibitive). 
> CC'ing pmck.

Hello, Greg!

Not a problem.  ;-)

A grace period only needs to wait on RCU read-side critical sections that
started before the grace period started.  As soon as those pre-existing
RCU read-side critical get done, the grace period can end, regardless
of how many RCU read-side critical sections might have started after
the grace period started.

If you find a situation where huge numbers of RCU read-side critical
sections do indefinitely delay a grace period, then that is a bug in
RCU that I need to fix.

Of course, if you have a single RCU read-side critical section that
runs for a very long time, that -will- delay a grace period.  As long
as you don't do it too often, this is not a problem, though if running
a single RCU read-side critical section for more than a few milliseconds
is probably not a good thing.  Not as bad as holding a heavily contended
spinlock for a few milliseconds, but still not a good thing.

Thanx, Paul

> FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that
> 140 can possibly be recouped.  I currently suspect the lock acquisition
> in the iobus-scan is the bulk of that time, but that is admittedly a
> guess.  The remaining 200-250ns is elsewhere in the PIO decode.
> 
> -Greg
> 
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Marcelo Tosatti
On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote:
> Marcelo Tosatti wrote:
> > On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
> >   
> >> Marcelo Tosatti wrote:
> >> 
> >>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
> >>> it probably sucks much less than what you're seeing.
> >>>   
> >>>   
> >> Why would NPT improve mmio?  If anything, it would be worse, since the  
> >> processor has to do the nested walk.
> >>
> >> Of course, these are newer machines, so the absolute results as well as  
> >> the difference will be smaller.
> >> 
> >
> > Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
> >
> > NPT enabled:
> > test 0: 3088633284634 - 3059375712321 = 29257572313
> > test 1: 3121754636397 - 3088633419760 = 33121216637
> > test 2: 3204666462763 - 3121754668573 = 82911794190
> >
> > NPT disabled:
> > test 0: 3638061646250 - 3609416811687 = 28644834563
> > test 1: 3669413430258 - 3638061771291 = 31351658967
> > test 2: 3736287253287 - 3669413463506 = 66873789781
> >
> >   
> Thanks for running that.  Its interesting to see that NPT was in fact
> worse as Avi predicted.
> 
> Would you mind if I graphed the result and added this data to my wiki? 
> If so, could you adjust the tsc result into IOPs using the proper
> time-base and the test_count you ran with?   I can show a graph with the
> data as is and the relative differences will properly surface..but it
> would be nice to have apples to apples in terms of IOPS units with my
> other run.
> 
> -Greg

Please, that'll be nice.

Quad-Core AMD Opteron(tm) Processor 2358 SE

host: 2.6.30-rc2
guest: 2.6.29.1-102.fc11.x86_64

test_count=100, tsc freq=2402882804 Hz

NPT disabled:

test 0 = 2771200766
test 1 = 3018726738
test 2 = 6414705418
test 3 = 2890332864

NPT enabled:

test 0 = 2908604045
test 1 = 3174687394
test 2 = 7912464804
test 3 = 3046085805

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Marcelo Tosatti
On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote:
> The problem is the exit time in of itself isnt all that interesting to
> me.  What I am interested in measuring is how long it takes KVM to
> process the request and realize that I want to execute function "X". 
> Ultimately that is what matters in terms of execution latency and is
> thus the more interesting data.  I think the exit time is possibly an
> interesting 5th data point, but its more of a side-bar IMO.   In any
> case, I suspect that both exits will be approximately the same at the
> VT/SVM level.
> 
> OTOH: If there is a patch out there to improve KVMs code (say
> specifically the PIO handling logic), that is fair-game here and we
> should benchmark it.  For instance, if you have ideas on ways to improve
> the find_pio_dev performance, etc   



One easy thing to try is to cache the last successful lookup on a
pointer, to improve patterns where there's "device locality" (like
nullio test).



> One item may be to replace the kvm->lock on the bus scan with an RCU
> or something (though PIOs are very frequent and the constant
> re-entry to an an RCU read-side CS may effectively cause a perpetual
> grace-period and may be too prohibitive). CC'ing pmck.

Yes, locking improvements are needed there badly (think for eg the cache
bouncing of kvm->lock _and_ bouncing of kvm->slots_lock on 4-way SMP
guests).

> FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that
> 140 can possibly be recouped.  I currently suspect the lock acquisition
> in the iobus-scan is the bulk of that time, but that is admittedly a
> guess.  The remaining 200-250ns is elsewhere in the PIO decode.

vmcs_read is significantly expensive
(http://www.mail-archive.com/kvm@vger.kernel.org/msg00840.html,
likely that my measurements were foobar, Avi mentioned 50 cycles for
vmcs_write).

See for eg how vmx.c reads VM_EXIT_INTR_INFO twice on every exit.

Also this one looks pretty bad for a 32-bit PAE guest (and you can 
get away with the unconditional GUEST_CR3 read too).

/* Access CR3 don't cause VMExit in paging mode, so we need
 * to sync with guest real CR3. */
if (enable_ept && is_paging(vcpu)) {
vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
ept_load_pdptrs(vcpu);
}

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Anthony Liguori wrote:
>>  
>>> Gregory Haskins wrote:
>>>
 Today, there is no equivelent of a platform agnostic "iowrite32()" for
 hypercalls so the driver would look like the pseudocode above except
 substitute with kvm_hypercall(), lguest_hypercall(), etc.  The
 proposal
 is to allow the hypervisor to assign a dynamic vector to resources in
 the backend and convey this vector to the guest (such as in PCI
 config-space as mentioned in my example use-case).  The provides the
 "address negotiation" function that would normally be done for
 something
 like a pio port-address.   The hypervisor agnostic driver can then use
 this globally recognized address-token coupled with other
 device-private
 ABI parameters to communicate with the device.  This can all occur
 without the core hypervisor needing to understand the details
 beyond the
 addressing.
 
>>> PCI already provide a hypervisor agnostic interface (via IO
>>> regions). You have a mechanism for devices to discover which regions
>>> they have
>>> allocated and to request remappings.  It's supported by Linux and
>>> Windows.  It works on the vast majority of architectures out there
>>> today.
>>>
>>> Why reinvent the wheel?
>>> 
>>
>> I suspect the current wheel is square.  And the air is out.  Plus its
>> pulling to the left when I accelerate, but to be fair that may be my
>> alignment
>
> No, your wheel is slightly faster on the highway, but doesn't work at
> all off-road.

Heh..

>
> Consider nested virtualization where the host (H) runs a guest (G1)
> which is itself a hypervisor, running a guest (G2).  The host exposes
> a set of virtio (V1..Vn) devices for guest G1.  Guest G1, rather than
> creating a new virtio devices and bridging it to one of V1..Vn,
> assigns virtio device V1 to guest G2, and prays.
>
> Now guest G2 issues a hypercall.  Host H traps the hypercall, sees it
> originated in G1 while in guest mode, so it injects it into G1.  G1
> examines the parameters but can't make any sense of them, so it
> returns an error to G2.
>
> If this were done using mmio or pio, it would have just worked.  With
> pio, H would have reflected the pio into G1, G1 would have done the
> conversion from G2's port number into G1's port number and reissued
> the pio, finally trapped by H and used to issue the I/O. 

I might be missing something, but I am not seeing the difference here. 
We have an "address" (in this case the HC-id) and a context (in this
case G1 running in non-root mode).   Whether the  trap to H is a HC or a
PIO, the context tells us that it needs to re-inject the same trap to G1
for proper handling.  So the "address" is re-injected from H to G1 as an
emulated trap to G1s root-mode, and we continue (just like the PIO).

And likewise, in both cases, G1 would (should?) know what to do with
that "address" as it relates to G2, just as it would need to know what
the PIO address is for.  Typically this would result in some kind of
translation of that "address", but I suppose even this is completely
arbitrary and only G1 knows for sure.  E.g. it might translate from
hypercall vector X to Y similar to your PIO example, it might completely
change transports, or it might terminate locally (e.g. emulated device
in G1).   IOW: G2 might be using hypercalls to talk to G1, and G1 might
be using MMIO to talk to H.  I don't think it matters from a topology
perspective (though it might from a performance perspective).

> With mmio, G1 would have set up G2's page tables to point directly at
> the addresses set up by H, so we would actually have a direct G2->H
> path.  Of course we'd need an emulated iommu so all the memory
> references actually resolve to G2's context.

/me head explodes

>
> So the upshoot is that hypercalls for devices must not be the primary
> method of communications; they're fine as an optimization, but we
> should always be able to fall back on something else.  We also need to
> figure out how G1 can stop V1 from advertising hypercall support.
I agree it would be desirable to be able to control this exposure. 
However, I am not currently convinced its strictly necessary because of
the reason you mentioned above.  And also note that I am not currently
convinced its even possible to control it.

For instance, what if G1 is an old KVM, or (dare I say) a completely
different hypervisor?  You could control things like whether G1 can see
the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who
is to say what G1 will expose to G2?  G1 may very well advertise a HC
feature bit to G2 which may allow G2 to try to make a VMCALL.  How do
you stop that?

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Anthony Liguori

Avi Kivity wrote:


Hmm, reminds me of something I thought of a while back.

We could implement an 'mmio hypercall' that does mmio reads/writes via 
a hypercall instead of an mmio operation.  That will speed up mmio for 
emulated devices (say, e1000).  It's easy to hook into Linux 
(readl/writel), is pci-friendly, non-x86 friendly, etc.


By the time you get down to userspace for an emulated device, that 2us 
difference between mmio and hypercalls is simply not going to make a 
difference.  I'm surprised so much effort is going into this, is there 
any indication that this is even close to a bottleneck in any circumstance?


We have much, much lower hanging fruit to attack.  The basic fact that 
we still copy data multiple times in the networking drivers is clearly 
more significant than a few hundred nanoseconds that should occur less 
than once per packet.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Anthony Liguori

Gregory Haskins wrote:

Greg,

I think comparison is not entirely fair.





FYI: I've update the test/wiki to (hopefully) address your concerns.

http://developer.novell.com/wiki/index.php/WhyHypercalls
  


And we're now getting close to the point where the difference is 
virtually meaningless.


At .14us, in order to see 1% CPU overhead added from PIO vs HC, you need 
71429 exits.


If you have this many exits, the shear cost of the base vmexit overhead 
is going to result in about 15% CPU overhead.  To put this another way, 
if you're workload was entirely bound by vmexits (which is virtually 
impossible), then when you were saturating your CPU at 100%, only 7% of 
that is the cost of PIO exits vs. HC.


In real life workloads, if you're paying 15% overhead just to the cost 
of exits (not including the cost of heavy weight or post-exit 
processing), you're toast.  I think it's going to be very difficult to 
construct a real scenario where you'll have a measurable (i.e. > 1%) 
performance overhead from using PIO vs. HC.


And in the absence of that, I don't see the justification for adding 
additional infrastructure to Linux to support this.


The non-x86 architecture argument isn't valid because other 
architectures either 1) don't use PCI at all (s390) and are already 
using hypercalls 2) use PCI, but do not have a dedicated hypercall 
instruction (PPC emb) or 3) have PIO (ia64).


Regards,

Anthony Liguori


Regards,
-Greg


  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Marcelo Tosatti wrote:
> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
>   
>> Marcelo Tosatti wrote:
>> 
>>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>>> it probably sucks much less than what you're seeing.
>>>   
>>>   
>> Why would NPT improve mmio?  If anything, it would be worse, since the  
>> processor has to do the nested walk.
>>
>> Of course, these are newer machines, so the absolute results as well as  
>> the difference will be smaller.
>> 
>
> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:
>
> NPT enabled:
> test 0: 3088633284634 - 3059375712321 = 29257572313
> test 1: 3121754636397 - 3088633419760 = 33121216637
> test 2: 3204666462763 - 3121754668573 = 82911794190
>
> NPT disabled:
> test 0: 3638061646250 - 3609416811687 = 28644834563
> test 1: 3669413430258 - 3638061771291 = 31351658967
> test 2: 3736287253287 - 3669413463506 = 66873789781
>
>   
Thanks for running that.  Its interesting to see that NPT was in fact
worse as Avi predicted.

Would you mind if I graphed the result and added this data to my wiki? 
If so, could you adjust the tsc result into IOPs using the proper
time-base and the test_count you ran with?   I can show a graph with the
data as is and the relative differences will properly surface..but it
would be nice to have apples to apples in terms of IOPS units with my
other run.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Marcelo Tosatti
On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> Also it would be interesting to see the MMIO comparison with EPT/NPT,
>> it probably sucks much less than what you're seeing.
>>   
>
> Why would NPT improve mmio?  If anything, it would be worse, since the  
> processor has to do the nested walk.
>
> Of course, these are newer machines, so the absolute results as well as  
> the difference will be smaller.

Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz:

NPT enabled:
test 0: 3088633284634 - 3059375712321 = 29257572313
test 1: 3121754636397 - 3088633419760 = 33121216637
test 2: 3204666462763 - 3121754668573 = 82911794190

NPT disabled:
test 0: 3638061646250 - 3609416811687 = 28644834563
test 1: 3669413430258 - 3638061771291 = 31351658967
test 2: 3736287253287 - 3669413463506 = 66873789781

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote:
>   
>> Chris Wright wrote:
>> 
>>> * Gregory Haskins (ghask...@novell.com) wrote:
>>>   
>>>   
 Chris Wright wrote:
 
 
> VF drivers can also have this issue (and typically use mmio).
> I at least have a better idea what your proposal is, thanks for
> explanation.  Are you able to demonstrate concrete benefit with it yet
> (improved latency numbers for example)?
>   
>   
 I had a test-harness/numbers for this kind of thing, but its a bit
 crufty since its from ~1.5 years ago.  I will dig it up, update it, and
 generate/post new numbers.
 
 
>>> That would be useful, because I keep coming back to pio and shared
>>> page(s) when think of why not to do this.  Seems I'm not alone in that.
>>>
>>> thanks,
>>> -chris
>>>   
>>>   
>> I completed the resurrection of the test and wrote up a little wiki on
>> the subject, which you can find here:
>>
>> http://developer.novell.com/wiki/index.php/WhyHypercalls
>>
>> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
>> reinvent the wheel?" questions.
>>
>> I will include this information when I publish the updated v2 series
>> with the s/hypercall/dynhc changes.
>>
>> Let me know if you have any questions.
>> 
>
> Greg,
>
> I think comparison is not entirely fair.



FYI: I've update the test/wiki to (hopefully) address your concerns.

http://developer.novell.com/wiki/index.php/WhyHypercalls

Regards,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Marcelo Tosatti wrote:
> On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote:
>   
>> Marcelo Tosatti wrote:
>> 
>>> I think comparison is not entirely fair. You're using
>>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
>>> (on Intel) to only one register read:
>>>
>>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>>>
>>> Whereas in a real hypercall for (say) PIO you would need the address,
>>> size, direction and data.
>>>   
>>>   
>> Well, that's probably one of the reasons pio is slower, as the cpu has  
>> to set these up, and the kernel has to read them.
>>
>> 
>>> Also for PIO/MMIO you're adding this unoptimized lookup to the  
>>> measurement:
>>>
>>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>>> if (pio_dev) {
>>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>>> complete_pio(vcpu); return 1;
>>> }
>>>   
>>>   
>> Since there are only one or two elements in the list, I don't see how it  
>> could be optimized.
>> 
>
> speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev 
> is probably the last in the io_bus list.
>
> Not sure if this one matters very much. Point is you should measure the
> exit time only, not the pio path vs hypercall path in kvm. 
>   

The problem is the exit time in of itself isnt all that interesting to
me.  What I am interested in measuring is how long it takes KVM to
process the request and realize that I want to execute function "X". 
Ultimately that is what matters in terms of execution latency and is
thus the more interesting data.  I think the exit time is possibly an
interesting 5th data point, but its more of a side-bar IMO.   In any
case, I suspect that both exits will be approximately the same at the
VT/SVM level.

OTOH: If there is a patch out there to improve KVMs code (say
specifically the PIO handling logic), that is fair-game here and we
should benchmark it.  For instance, if you have ideas on ways to improve
the find_pio_dev performance, etc   One item may be to replace the
kvm->lock on the bus scan with an RCU or something (though PIOs are
very frequent and the constant re-entry to an an RCU read-side CS may
effectively cause a perpetual grace-period and may be too prohibitive). 
CC'ing pmck.

FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that
140 can possibly be recouped.  I currently suspect the lock acquisition
in the iobus-scan is the bulk of that time, but that is admittedly a
guess.  The remaining 200-250ns is elsewhere in the PIO decode.

-Greg






signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Marcelo Tosatti wrote:

Also it would be interesting to see the MMIO comparison with EPT/NPT,
it probably sucks much less than what you're seeing.
  
  
Why would NPT improve mmio?  If anything, it would be worse, since the  
processor has to do the nested walk.



I suppose the hardware is much more efficient than walk_addr? There's
all this kmalloc, spinlock, etc overhead in the fault path.
  


mmio still has to do a walk_addr, even with npt.  We don't take the mmu 
lock during walk_addr.



--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>
>   
 Ack.  I hope when its all said and done I can convince you that the
 framework to code up those virtio backends in the kernel is vbus ;)
   
>>> If vbus doesn't bring significant performance advantages, I'll prefer
>>> virtio because of existing investment.
>>> 
>>
>> Just to clarify: vbus is just the container/framework for the in-kernel
>> models.  You can implement and deploy virtio devices inside the
>> container (tho I haven't had a chance to sit down and implement one
>> yet).  Note that I did publish a virtio transport in the last few series
>> to demonstrate how that might work, so its just ripe for the picking if
>> someone is so inclined.
>>
>>   
>
> Yeah I keep getting confused over this.
>
>> So really the question is whether you implement the in-kernel virtio
>> backend in vbus, in some other framework, or just do it standalone.
>>   
>
> I prefer the standalone model.  Keep the glue in userspace.

Just to keep the facts straight:  The glue in userspace vs standalone
model are independent variables.  E.g. you can have the glue in
userspace for vbus, too.  Its not written that way today for KVM, but
its moving in that direction as we work though these subtopics like
irqfd, dynhc, etc.

What vbus buys you as a core technology is that you can write one
backend that works "everywhere" (you only need a glue layer for each
environment you want to support).  You might say "I can make my backends
work everywhere too", and to that I would say "by the time you get it to
work, you will have duplicated almost my exact effort on vbus" ;).  Of
course, you may also say "I don't care if it works anywhere else but
KVM", which is a perfectly valid (if not unfortunate) position to take.

I think the confusion point is possibly a result of the name "vbus". 
The vbus core isn't really true bus in the traditional sense.  It's just
a host-side kernel-based container for these device models.  That is all
I am talking about here.  There is, of course, also an LDM "bus" for
rendering vbus devices in the guest as a function of the current
kvm-specific glue layer Ive written.  Note that this glue layer could
render them as PCI in the future, TBD.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Gregory Haskins
Avi Kivity wrote:
> Marcelo Tosatti wrote:
>> I think comparison is not entirely fair. You're using
>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
>> (on Intel) to only one register read:
>>
>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>>
>> Whereas in a real hypercall for (say) PIO you would need the address,
>> size, direction and data.
>>   
>
> Well, that's probably one of the reasons pio is slower, as the cpu has
> to set these up, and the kernel has to read them.

Right, that was the point I was trying to make.  Its real-world overhead
to measure how long it takes KVM to go round-trip in each of the
respective trap types.

>
>> Also for PIO/MMIO you're adding this unoptimized lookup to the
>> measurement:
>>
>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>> if (pio_dev) {
>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>> complete_pio(vcpu); return 1;
>> }
>>   
>
> Since there are only one or two elements in the list, I don't see how
> it could be optimized.

To Marcelo's point, I think he was more taking exception to the fact
that the HC path was potentially completely optimized out if GCC was
super-intuitive about the switch(nr) statement hitting the null vector. 
In theory, both the io_bus and the select(nr) are about equivalent in
algorithmic complexity (and depth, I should say) which is why I think in
general the test is "fair".  IOW it represents the real-world decode
cycle function for each transport.

However, if one side was artificially optimized simply due to the
triviality of my NULLIO test, that is not fair, and that is the point I
believe he was making.  In any case, I just wrote a new version of the
test which hopefully addresses forces GCC to leave it as a more
real-world decode.  (FYI: I saw no difference).  I will update the
tarball/wiki shortly.

>
>> Whereas for hypercall measurement you don't. I believe a fair comparison
>> would be have a shared guest/host memory area where you store guest/host
>> TSC values and then do, on guest:
>>
>> rdtscll(&shared_area->guest_tsc);
>> pio/mmio/hypercall
>> ... back to host
>> rdtscll(&shared_area->host_tsc);
>>
>> And then calculate the difference (minus guests TSC_OFFSET of course)?
>>   
>
> I don't understand why you want host tsc?  We're interested in
> round-trip latency, so you want guest tsc all the time.

Yeah, I agree.  My take is he was just trying to introduce a real
workload so GCC wouldn't do that potential "cheater decode" in the HC
path.  After thinking about it, however, I realized we could do that
with a simple "state++" operation, so the new test does this in each of
the various test's "execute" cycle.  The timing calculation remains
unchanged.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Gregory Haskins wrote:

   


Ack.  I hope when its all said and done I can convince you that the
framework to code up those virtio backends in the kernel is vbus ;)
  

If vbus doesn't bring significant performance advantages, I'll prefer
virtio because of existing investment.



Just to clarify: vbus is just the container/framework for the in-kernel
models.  You can implement and deploy virtio devices inside the
container (tho I haven't had a chance to sit down and implement one
yet).  Note that I did publish a virtio transport in the last few series
to demonstrate how that might work, so its just ripe for the picking if
someone is so inclined.

  


Yeah I keep getting confused over this.


So really the question is whether you implement the in-kernel virtio
backend in vbus, in some other framework, or just do it standalone.
  


I prefer the standalone model.  Keep the glue in userspace.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Gregory Haskins wrote:

Anthony Liguori wrote:
  

Gregory Haskins wrote:


Today, there is no equivelent of a platform agnostic "iowrite32()" for
hypercalls so the driver would look like the pseudocode above except
substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
is to allow the hypervisor to assign a dynamic vector to resources in
the backend and convey this vector to the guest (such as in PCI
config-space as mentioned in my example use-case).  The provides the
"address negotiation" function that would normally be done for something
like a pio port-address.   The hypervisor agnostic driver can then use
this globally recognized address-token coupled with other device-private
ABI parameters to communicate with the device.  This can all occur
without the core hypervisor needing to understand the details beyond the
addressing.
  
  
PCI already provide a hypervisor agnostic interface (via IO regions). 
You have a mechanism for devices to discover which regions they have

allocated and to request remappings.  It's supported by Linux and
Windows.  It works on the vast majority of architectures out there today.

Why reinvent the wheel?



I suspect the current wheel is square.  And the air is out.  Plus its
pulling to the left when I accelerate, but to be fair that may be my
alignment


No, your wheel is slightly faster on the highway, but doesn't work at 
all off-road.


Consider nested virtualization where the host (H) runs a guest (G1) 
which is itself a hypervisor, running a guest (G2).  The host exposes a 
set of virtio (V1..Vn) devices for guest G1.  Guest G1, rather than 
creating a new virtio devices and bridging it to one of V1..Vn, assigns 
virtio device V1 to guest G2, and prays.


Now guest G2 issues a hypercall.  Host H traps the hypercall, sees it 
originated in G1 while in guest mode, so it injects it into G1.  G1 
examines the parameters but can't make any sense of them, so it returns 
an error to G2.


If this were done using mmio or pio, it would have just worked.  With 
pio, H would have reflected the pio into G1, G1 would have done the 
conversion from G2's port number into G1's port number and reissued the 
pio, finally trapped by H and used to issue the I/O.  With mmio, G1 
would have set up G2's page tables to point directly at the addresses 
set up by H, so we would actually have a direct G2->H path.  Of course 
we'd need an emulated iommu so all the memory references actually 
resolve to G2's context.


So the upshoot is that hypercalls for devices must not be the primary 
method of communications; they're fine as an optimization, but we should 
always be able to fall back on something else.  We also need to figure 
out how G1 can stop V1 from advertising hypercall support.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Marcelo Tosatti wrote:

I think comparison is not entirely fair. You're using
KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
(on Intel) to only one register read:

nr = kvm_register_read(vcpu, VCPU_REGS_RAX);

Whereas in a real hypercall for (say) PIO you would need the address,
size, direction and data.
  


Well, that's probably one of the reasons pio is slower, as the cpu has 
to set these up, and the kernel has to read them.


Also for PIO/MMIO you're adding this unoptimized lookup to the 
measurement:


pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
if (pio_dev) {
kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
complete_pio(vcpu); 
return 1;

}
  


Since there are only one or two elements in the list, I don't see how it 
could be optimized.



Whereas for hypercall measurement you don't. I believe a fair comparison
would be have a shared guest/host memory area where you store guest/host
TSC values and then do, on guest:

rdtscll(&shared_area->guest_tsc);
pio/mmio/hypercall
... back to host
rdtscll(&shared_area->host_tsc);

And then calculate the difference (minus guests TSC_OFFSET of course)?
  


I don't understand why you want host tsc?  We're interested in 
round-trip latency, so you want guest tsc all the time.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-08 Thread Avi Kivity

Marcelo Tosatti wrote:

Also it would be interesting to see the MMIO comparison with EPT/NPT,
it probably sucks much less than what you're seeing.
  


Why would NPT improve mmio?  If anything, it would be worse, since the 
processor has to do the nested walk.


Of course, these are newer machines, so the absolute results as well as 
the difference will be smaller.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote:
>   
>> Also for PIO/MMIO you're adding this unoptimized lookup to the 
>> measurement:
>>
>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
>> if (pio_dev) {
>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
>> complete_pio(vcpu); 
>> return 1;
>> }
>>
>> Whereas for hypercall measurement you don't. I believe a fair comparison
>> would be have a shared guest/host memory area where you store guest/host
>> TSC values and then do, on guest:
>>
>>  rdtscll(&shared_area->guest_tsc);
>>  pio/mmio/hypercall
>>  ... back to host
>>  rdtscll(&shared_area->host_tsc);
>>
>> And then calculate the difference (minus guests TSC_OFFSET of course)?
>> 
>
> Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest"
> Core2 Xeon 5130 @2.00Ghz, 4GB RAM.
>
> Also it would be interesting to see the MMIO comparison with EPT/NPT,
> it probably sucks much less than what you're seeing.
>
>   

Agreed.  If you or someone on this thread has such a beast, please fire
up my test and post the numbers.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Marcelo Tosatti wrote:
> On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote:
>   
>> Chris Wright wrote:
>> 
>>> * Gregory Haskins (ghask...@novell.com) wrote:
>>>   
>>>   
 Chris Wright wrote:
 
 
> VF drivers can also have this issue (and typically use mmio).
> I at least have a better idea what your proposal is, thanks for
> explanation.  Are you able to demonstrate concrete benefit with it yet
> (improved latency numbers for example)?
>   
>   
 I had a test-harness/numbers for this kind of thing, but its a bit
 crufty since its from ~1.5 years ago.  I will dig it up, update it, and
 generate/post new numbers.
 
 
>>> That would be useful, because I keep coming back to pio and shared
>>> page(s) when think of why not to do this.  Seems I'm not alone in that.
>>>
>>> thanks,
>>> -chris
>>>   
>>>   
>> I completed the resurrection of the test and wrote up a little wiki on
>> the subject, which you can find here:
>>
>> http://developer.novell.com/wiki/index.php/WhyHypercalls
>>
>> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
>> reinvent the wheel?" questions.
>>
>> I will include this information when I publish the updated v2 series
>> with the s/hypercall/dynhc changes.
>>
>> Let me know if you have any questions.
>> 
>
> Greg,
>
> I think comparison is not entirely fair. You're using
> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
> (on Intel) to only one register read:
>
> nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
>
> Whereas in a real hypercall for (say) PIO you would need the address,
> size, direction and data.
>   

Hi Marcelo,
   I'll have to respectfully disagree with you here.  What you are
proposing is actually a different test: a 4th type I would call  "PIO
over HC".  It is distinctly different than the existing MMIO, PIO, and
HC tests already present.

I assert that the current HC test remains valid because for pure
hypercalls, the "nr" *is* the address.  It identifies the function to be
executed (e.g. VAPIC_POLL_IRQ = null), just like the PIO address of my
nullio device identifies the function to be executed (i.e.
nullio_write() = null)

My argument is that the HC test emulates the "dynhc()" concept I have
been talking about, whereas the PIOoHC is more like the
pv_io_ops->iowrite approach.

That said, your 4th test type would actually be a very interesting
data-point to add to the suite (especially since we are still kicking
around the notion of doing something like this).  I will update the
patches.


> Also for PIO/MMIO you're adding this unoptimized lookup to the 
> measurement:
>
> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> if (pio_dev) {
> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> complete_pio(vcpu); 
> return 1;
> }
>
> Whereas for hypercall measurement you don't.

In theory they should both share about the same algorithmic complexity
in the decode-stage, but due to the possible optimization you mention
you may have a point.  I need to take some steps to ensure the HC path
isn't artificially simplified by GCC (like making the execute stage do
some trivial work like you mention below).

>  I believe a fair comparison
> would be have a shared guest/host memory area where you store guest/host
> TSC values and then do, on guest:
>
>   rdtscll(&shared_area->guest_tsc);
>   pio/mmio/hypercall
>   ... back to host
>   rdtscll(&shared_area->host_tsc);
>
> And then calculate the difference (minus guests TSC_OFFSET of course)?
>
>   
I'm not sure I need that much complexity.  I can probably just change
the test harness to generate an ioread32(), and have the functions
return the TSC value as a return parameter for all test types.  The
important thing is that we pick something extremely cheap (yet dynamic)
to compute so the execution time doesn't invalidate the measurement
granularity with a large constant.

Regards,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Marcelo Tosatti
On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote:
> Also for PIO/MMIO you're adding this unoptimized lookup to the 
> measurement:
> 
> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
> if (pio_dev) {
> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
> complete_pio(vcpu); 
> return 1;
> }
> 
> Whereas for hypercall measurement you don't. I believe a fair comparison
> would be have a shared guest/host memory area where you store guest/host
> TSC values and then do, on guest:
> 
>   rdtscll(&shared_area->guest_tsc);
>   pio/mmio/hypercall
>   ... back to host
>   rdtscll(&shared_area->host_tsc);
> 
> And then calculate the difference (minus guests TSC_OFFSET of course)?

Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest"
Core2 Xeon 5130 @2.00Ghz, 4GB RAM.

Also it would be interesting to see the MMIO comparison with EPT/NPT,
it probably sucks much less than what you're seeing.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Marcelo Tosatti
On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote:
> Chris Wright wrote:
> > * Gregory Haskins (ghask...@novell.com) wrote:
> >   
> >> Chris Wright wrote:
> >> 
> >>> VF drivers can also have this issue (and typically use mmio).
> >>> I at least have a better idea what your proposal is, thanks for
> >>> explanation.  Are you able to demonstrate concrete benefit with it yet
> >>> (improved latency numbers for example)?
> >>>   
> >> I had a test-harness/numbers for this kind of thing, but its a bit
> >> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
> >> generate/post new numbers.
> >> 
> >
> > That would be useful, because I keep coming back to pio and shared
> > page(s) when think of why not to do this.  Seems I'm not alone in that.
> >
> > thanks,
> > -chris
> >   
> 
> I completed the resurrection of the test and wrote up a little wiki on
> the subject, which you can find here:
> 
> http://developer.novell.com/wiki/index.php/WhyHypercalls
> 
> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
> reinvent the wheel?" questions.
> 
> I will include this information when I publish the updated v2 series
> with the s/hypercall/dynhc changes.
> 
> Let me know if you have any questions.

Greg,

I think comparison is not entirely fair. You're using
KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that
(on Intel) to only one register read:

nr = kvm_register_read(vcpu, VCPU_REGS_RAX);

Whereas in a real hypercall for (say) PIO you would need the address,
size, direction and data.

Also for PIO/MMIO you're adding this unoptimized lookup to the 
measurement:

pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in);
if (pio_dev) {
kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data);
complete_pio(vcpu); 
return 1;
}

Whereas for hypercall measurement you don't. I believe a fair comparison
would be have a shared guest/host memory area where you store guest/host
TSC values and then do, on guest:

rdtscll(&shared_area->guest_tsc);
pio/mmio/hypercall
... back to host
rdtscll(&shared_area->host_tsc);

And then calculate the difference (minus guests TSC_OFFSET of course)?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Chris Wright wrote:
> 
> > Chris, is that issue with the non ioread/iowrite access of a mangled
> > pointer still an issue here?  I would think so, but I am a bit fuzzy on
> > whether there is still an issue of non-wrapped MMIO ever occuring.
> 
> Arnd was saying it's a bug for other reasons, so perhaps it would work
> out fine.

Well, maybe. I only said that __raw_writel and pointer dereference is
bad, but not writel.

IIRC when we had that discussion about io-workarounds on powerpc,
the outcome was that passing an IORESOURCE_MEM resource into pci_iomap
must still result in something that can be passed into writel in addition
to iowrite32, while an IORESOURCE_IO resource may or may not be valid for
writel and/or outl.

Unfortunately, this means that either readl/writel needs to be adapted
in some way (e.g. the address also ioremapped to the mangled pointer)
or the mechanism will be limited to I/O space accesses.

Maybe BenH remembers the details better than me.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Gregory Haskins (gregory.hask...@gmail.com) wrote:
> Arnd Bergmann wrote:
> > pci_iomap could look at the bus device that the PCI function sits on.
> > If it detects a PCI bridge that has a certain property (config space
> > setting, vendor/device ID, ...), it assumes that the device itself
> > will be emulated and it should set the address flag for IO_COND.
> >
> > This implies that all pass-through devices need to be on a different
> > PCI bridge from the emulated devices, which should be fairly
> > straightforward to enforce.

Hmm, this gets to the grey area of the ABI.  I think this would mean an
upgrade of the host would suddenly break when the mgmt tool does:

(qemu) pci_add pci_addr=0:6 host host=01:10.0

> Thats actually a pretty good idea.
> 
> Chris, is that issue with the non ioread/iowrite access of a mangled
> pointer still an issue here?  I would think so, but I am a bit fuzzy on
> whether there is still an issue of non-wrapped MMIO ever occuring.

Arnd was saying it's a bug for other reasons, so perhaps it would work
out fine.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Arnd Bergmann wrote:
> On Thursday 07 May 2009, Gregory Haskins wrote:
>   
>> What I am not clear on is how you would know to flag the address to
>> begin with.
>> 
>
> pci_iomap could look at the bus device that the PCI function sits on.
> If it detects a PCI bridge that has a certain property (config space
> setting, vendor/device ID, ...), it assumes that the device itself
> will be emulated and it should set the address flag for IO_COND.
>
> This implies that all pass-through devices need to be on a different
> PCI bridge from the emulated devices, which should be fairly
> straightforward to enforce.
>   

Thats actually a pretty good idea.

Chris, is that issue with the non ioread/iowrite access of a mangled
pointer still an issue here?  I would think so, but I am a bit fuzzy on
whether there is still an issue of non-wrapped MMIO ever occuring.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Gregory Haskins wrote:
> What I am not clear on is how you would know to flag the address to
> begin with.

pci_iomap could look at the bus device that the PCI function sits on.
If it detects a PCI bridge that has a certain property (config space
setting, vendor/device ID, ...), it assumes that the device itself
will be emulated and it should set the address flag for IO_COND.

This implies that all pass-through devices need to be on a different
PCI bridge from the emulated devices, which should be fairly
straightforward to enforce.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Gregory Haskins (gregory.hask...@gmail.com) wrote:
> After posting my numbers today, what I *can* tell you definitively that
> its significantly slower to VMEXIT via MMIO.  I guess I do not really
> know the reason for sure. :)

there's certainly more work, including insn decoding
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Arnd Bergmann wrote:
> An easy way to deal with the pass-through case might be to actually use
> __raw_writel there. In guest-to-guest communication, the two sides are
> known to have the same endianess (I assume) and you can still add the
> appropriate smp_mb() and such into the code.

Ok, that was nonsense. I thought you meant pass-through to a memory range
on the host that is potentially shared with other processes or guests.
For pass-through to a real device, it obviously would not work.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Gregory Haskins wrote:
> Arnd Bergmann wrote:
> > An mmio that goes through a PF is a bug, it's certainly broken on
> > a number of platforms, so performance should not be an issue there.
> >   
> 
> This may be my own ignorance, but I thought a VMEXIT of type "PF" was
> how MMIO worked in VT/SVM. 

You are right that all MMIOs (and PIO on most non-x86 architectures)
are handled this way in the end. What I meant was that an MMIO that
traps because of a simple pointer dereference as in __raw_writel
is a bug, while any actual writel() call could be diverted to
do an hcall and therefore not cause a PF once the infrastructure
is there.

> I guess the problem that was later pointed out is that we cannot discern
> which devices might be pass-through and therefore should not be
> revectored through a HC.  But I am even less knowledgeable about how
> pass-through works than I am about the MMIO traps, so I might be
> completely off here.

An easy way to deal with the pass-through case might be to actually use
__raw_writel there. In guest-to-guest communication, the two sides are
known to have the same endianess (I assume) and you can still add the
appropriate smp_mb() and such into the code.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Chris Wright wrote:
> * Gregory Haskins (gregory.hask...@gmail.com) wrote:
>   
>> What I am not clear on is how you would know to flag the address to
>> begin with.
>> 
>
> That's why I mentioned pv_io_ops->iomap() earlier.  Something I'd expect
> would get called on IORESOURCE_PVIO type.

Yeah, this wasn't clear at the time, but I totally get what you meant
now in retrospect.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Arnd Bergmann wrote:
> On Thursday 07 May 2009, Gregory Haskins wrote:
>   
>> I guess technically mmio can just be a simple access of the page which
>> would be problematic to trap locally without a PF.  However it seems
>> that most mmio always passes through a ioread()/iowrite() call so this
>> is perhaps the hook point.  If we set the stake in the ground that mmios
>> that go through some other mechanism like PFs can just hit the "slow
>> path" are an acceptable casualty, I think we can make that work.
>>
>> Thoughts?
>> 
>
> An mmio that goes through a PF is a bug, it's certainly broken on
> a number of platforms, so performance should not be an issue there.
>   

This may be my own ignorance, but I thought a VMEXIT of type "PF" was
how MMIO worked in VT/SVM.  I didn't mean to imply that the guest nor
the host took a traditional PF exception in their respective IDT, if
that is what you thought I meant here.  Rather, the mmio region is
unmapped in the guest MMU, access causes a VMEXIT to host-side KVM of
type PF, and the host side code then consults the guest page-table to
see if its an MMIO or not.  I could very well be mistaken as I have only
a cursory understanding of what happens in KVM today with this path.

After posting my numbers today, what I *can* tell you definitively that
its significantly slower to VMEXIT via MMIO.  I guess I do not really
know the reason for sure. :)
> Note that are four commonly used interface classes for PIO/MMIO:
>
> 1. readl/writel: little-endian MMIO
> 2. inl/outl: little-endian PIO
> 3. ioread32/iowrite32: converged little-endian PIO/MMIO
> 4. __raw_readl/__raw_writel: native-endian MMIO without checks
>
> You don't need to worry about the __raw_* stuff, as this should never
> be used in device drivers.
>
> As a simplification, you could mandate that all drivers that want to
> use this get converted to the ioread/iowrite class of interfaces and
> leave the others slow.
>   

I guess the problem that was later pointed out is that we cannot discern
which devices might be pass-through and therefore should not be
revectored through a HC.  But I am even less knowledgeable about how
pass-through works than I am about the MMIO traps, so I might be
completely off here.

In any case, thank you kindly for the suggestions.

Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> Oh yes.  But don't call it dynhc - like Chris says it's the wrong
>>> semantic.
>>>
>>> Since we want to connect it to an eventfd, call it HC_NOTIFY or
>>> HC_EVENT or something along these lines.  You won't be able to pass
>>> any data, but that's fine.  Registers are saved to memory anyway.
>>> 
>> Ok, but how would you access the registers since you would presumably
>> only be getting a waitq::func callback on the eventfd.  Or were you
>> saying that more data, if required, is saved in a side-band memory
>> location?  I can see the latter working. 
>
> Yeah.  You basically have that side-band in vbus shmem (or the virtio
> ring).

Ok, got it.
>
>>  I can't wrap my head around
>> the former.
>>   
>
> I only meant that registers aren't faster than memory, since they are
> just another memory location.
>
> In fact registers are accessed through a function call (not that that
> takes any time these days).
>
>
>>> Just to make sure we have everything plumbed down, here's how I see
>>> things working out (using qemu and virtio, use sed to taste):
>>>
>>> 1. qemu starts up, sets up the VM
>>> 2. qemu creates virtio-net-server
>>> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
>>> ring, one set for rx ring)
>>> 4. qemu connects the six eventfd to the data-available,
>>> data-not-available, and kick ports of virtio-net-server
>>> 5. the guest starts up and configures virtio-net in pci pin mode
>>> 6. qemu notices and decides it will manage interrupts in user space
>>> since this is complicated (shared level triggered interrupts)
>>> 7. the guest OS boots, loads device driver
>>> 8. device driver switches virtio-net to msix mode
>>> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
>>> notify fds as notifyfd
>>> 10. look ma, no hands.
>>>
>>> Under the hood, the following takes place.
>>>
>>> kvm wires the irqfds to schedule a work item which fires the
>>> interrupt.  One day the kvm developers get their act together and
>>> change it to inject the interrupt directly when the irqfd is signalled
>>> (which could be from the net softirq or somewhere similarly nasty).
>>>
>>> virtio-net-server wires notifyfd according to its liking.  It may
>>> schedule a thread, or it may execute directly.
>>>
>>> And they all lived happily ever after.
>>> 
>>
>> Ack.  I hope when its all said and done I can convince you that the
>> framework to code up those virtio backends in the kernel is vbus ;)
>
> If vbus doesn't bring significant performance advantages, I'll prefer
> virtio because of existing investment.

Just to clarify: vbus is just the container/framework for the in-kernel
models.  You can implement and deploy virtio devices inside the
container (tho I haven't had a chance to sit down and implement one
yet).  Note that I did publish a virtio transport in the last few series
to demonstrate how that might work, so its just ripe for the picking if
someone is so inclined.

So really the question is whether you implement the in-kernel virtio
backend in vbus, in some other framework, or just do it standalone.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Gregory Haskins (gregory.hask...@gmail.com) wrote:
> What I am not clear on is how you would know to flag the address to
> begin with.

That's why I mentioned pv_io_ops->iomap() earlier.  Something I'd expect
would get called on IORESOURCE_PVIO type.  This isn't really transparent
though (only virtio devices basically), kind of like you're saying below.

> Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it
> would never be a real device).  This means we route any io requests from
> virtio-pci though pv_io_ops->mmio(), but not unflagged addresses.  This
> is not as slick as boosting *everyones* mmio speed as Avi's original
> idea would have, but it is perhaps a good tradeoff between the entirely
> new namespace created by my original dynhc() proposal and leaving them
> all PF based.
>
> This way, its just like using my dynhc() proposal except the mmio-addr
> is the substitute address-token (instead of the dynhc-vector). 
> Additionally, if you do not PV the kernel the IO_COND/pv_io_op is
> ignored and it just slow-paths through the PF as it does today.  Dynhc()
> would be dependent  on pv_ops.
> 
> Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

Oh yes.  But don't call it dynhc - like Chris says it's the wrong
semantic.

Since we want to connect it to an eventfd, call it HC_NOTIFY or
HC_EVENT or something along these lines.  You won't be able to pass
any data, but that's fine.  Registers are saved to memory anyway.


Ok, but how would you access the registers since you would presumably
only be getting a waitq::func callback on the eventfd.  Or were you
saying that more data, if required, is saved in a side-band memory
location?  I can see the latter working. 


Yeah.  You basically have that side-band in vbus shmem (or the virtio ring).


 I can't wrap my head around
the former.
  


I only meant that registers aren't faster than memory, since they are 
just another memory location.


In fact registers are accessed through a function call (not that that 
takes any time these days).




Just to make sure we have everything plumbed down, here's how I see
things working out (using qemu and virtio, use sed to taste):

1. qemu starts up, sets up the VM
2. qemu creates virtio-net-server
3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
ring, one set for rx ring)
4. qemu connects the six eventfd to the data-available,
data-not-available, and kick ports of virtio-net-server
5. the guest starts up and configures virtio-net in pci pin mode
6. qemu notices and decides it will manage interrupts in user space
since this is complicated (shared level triggered interrupts)
7. the guest OS boots, loads device driver
8. device driver switches virtio-net to msix mode
9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
notify fds as notifyfd
10. look ma, no hands.

Under the hood, the following takes place.

kvm wires the irqfds to schedule a work item which fires the
interrupt.  One day the kvm developers get their act together and
change it to inject the interrupt directly when the irqfd is signalled
(which could be from the net softirq or somewhere similarly nasty).

virtio-net-server wires notifyfd according to its liking.  It may
schedule a thread, or it may execute directly.

And they all lived happily ever after.



Ack.  I hope when its all said and done I can convince you that the
framework to code up those virtio backends in the kernel is vbus ;)


If vbus doesn't bring significant performance advantages, I'll prefer 
virtio because of existing investment.



  But
even if not, this should provide enough plumbing that we can all coexist
together peacefully.
  


Yes, vbus and virtio can compete on their merits without bias from some 
maintainer getting in the way.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> Don't - it's broken.  It will also catch device assignment mmio and
>>> hypercall them.
>>>
>>> 
>> Ah.  Crap.
>>
>> Would you be conducive if I continue along with the dynhc() approach
>> then?
>>   
>
> Oh yes.  But don't call it dynhc - like Chris says it's the wrong
> semantic.
>
> Since we want to connect it to an eventfd, call it HC_NOTIFY or
> HC_EVENT or something along these lines.  You won't be able to pass
> any data, but that's fine.  Registers are saved to memory anyway.
Ok, but how would you access the registers since you would presumably
only be getting a waitq::func callback on the eventfd.  Or were you
saying that more data, if required, is saved in a side-band memory
location?  I can see the latter working.  I can't wrap my head around
the former.

>
> And btw, given that eventfd and the underlying infrastructure are so
> flexible, it's probably better to go back to your original "irqfd gets
> fd from userspace" just to be consistent everywhere.
>
> (no, I'm not deliberately making you rewrite that patch again and
> again... it's going to be a key piece of infrastructure so I want to
> get it right)

Ok, np.  Actually now that Davide showed me the waitq::func trick, the
fd technically doesn't even need to be an eventfd per se.  We can just
plain-old "fget()" it and attach via the f_ops->poll() as I do in v5. 
Ill submit this later today.

>
>
> Just to make sure we have everything plumbed down, here's how I see
> things working out (using qemu and virtio, use sed to taste):
>
> 1. qemu starts up, sets up the VM
> 2. qemu creates virtio-net-server
> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx
> ring, one set for rx ring)
> 4. qemu connects the six eventfd to the data-available,
> data-not-available, and kick ports of virtio-net-server
> 5. the guest starts up and configures virtio-net in pci pin mode
> 6. qemu notices and decides it will manage interrupts in user space
> since this is complicated (shared level triggered interrupts)
> 7. the guest OS boots, loads device driver
> 8. device driver switches virtio-net to msix mode
> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the
> notify fds as notifyfd
> 10. look ma, no hands.
>
> Under the hood, the following takes place.
>
> kvm wires the irqfds to schedule a work item which fires the
> interrupt.  One day the kvm developers get their act together and
> change it to inject the interrupt directly when the irqfd is signalled
> (which could be from the net softirq or somewhere similarly nasty).
>
> virtio-net-server wires notifyfd according to its liking.  It may
> schedule a thread, or it may execute directly.
>
> And they all lived happily ever after.

Ack.  I hope when its all said and done I can convince you that the
framework to code up those virtio backends in the kernel is vbus ;)  But
even if not, this should provide enough plumbing that we can all coexist
together peacefully.

Thanks,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Arnd Bergmann
On Thursday 07 May 2009, Gregory Haskins wrote:
> I guess technically mmio can just be a simple access of the page which
> would be problematic to trap locally without a PF.  However it seems
> that most mmio always passes through a ioread()/iowrite() call so this
> is perhaps the hook point.  If we set the stake in the ground that mmios
> that go through some other mechanism like PFs can just hit the "slow
> path" are an acceptable casualty, I think we can make that work.
> 
> Thoughts?

An mmio that goes through a PF is a bug, it's certainly broken on
a number of platforms, so performance should not be an issue there.

Note that are four commonly used interface classes for PIO/MMIO:

1. readl/writel: little-endian MMIO
2. inl/outl: little-endian PIO
3. ioread32/iowrite32: converged little-endian PIO/MMIO
4. __raw_readl/__raw_writel: native-endian MMIO without checks

You don't need to worry about the __raw_* stuff, as this should never
be used in device drivers.

As a simplification, you could mandate that all drivers that want to
use this get converted to the ioread/iowrite class of interfaces and
leave the others slow.

Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Avi Kivity wrote:


I think we just past the "too complicated" threshold.



And the "can't spel" threshold in the same sentence.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

Don't - it's broken.  It will also catch device assignment mmio and
hypercall them.



Ah.  Crap.

Would you be conducive if I continue along with the dynhc() approach then?
  


Oh yes.  But don't call it dynhc - like Chris says it's the wrong semantic.

Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT 
or something along these lines.  You won't be able to pass any data, but 
that's fine.  Registers are saved to memory anyway.


And btw, given that eventfd and the underlying infrastructure are so 
flexible, it's probably better to go back to your original "irqfd gets 
fd from userspace" just to be consistent everywhere.


(no, I'm not deliberately making you rewrite that patch again and 
again... it's going to be a key piece of infrastructure so I want to get 
it right)


Just to make sure we have everything plumbed down, here's how I see 
things working out (using qemu and virtio, use sed to taste):


1. qemu starts up, sets up the VM
2. qemu creates virtio-net-server
3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx 
ring, one set for rx ring)
4. qemu connects the six eventfd to the data-available, 
data-not-available, and kick ports of virtio-net-server

5. the guest starts up and configures virtio-net in pci pin mode
6. qemu notices and decides it will manage interrupts in user space 
since this is complicated (shared level triggered interrupts)

7. the guest OS boots, loads device driver
8. device driver switches virtio-net to msix mode
9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the 
notify fds as notifyfd

10. look ma, no hands.

Under the hood, the following takes place.

kvm wires the irqfds to schedule a work item which fires the interrupt.  
One day the kvm developers get their act together and change it to 
inject the interrupt directly when the irqfd is signalled (which could 
be from the net softirq or somewhere similarly nasty).


virtio-net-server wires notifyfd according to its liking.  It may 
schedule a thread, or it may execute directly.


And they all lived happily ever after.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Chris Wright wrote:
> * Gregory Haskins (ghask...@novell.com) wrote:
>   
>> Chris Wright wrote:
>> 
>>> * Avi Kivity (a...@redhat.com) wrote:
>>>   
 Gregory Haskins wrote:
 
> Cool,  I will code this up and submit it.  While Im at it, Ill run it
> through the "nullio" ringer, too. ;)  It would be cool to see the
> pv-mmio hit that 2.07us number.  I can't think of any reason why this
> will not be the case.
>   
>   
 Don't - it's broken.  It will also catch device assignment mmio and  
 hypercall them.
 
>>> Not necessarily.  It just needs to be creative w/ IO_COND
>>>   
>> Hi Chris,
>>Could you elaborate?  How would you know which pages to hypercall and
>> which to let PF?
>> 
>
> Was just thinking of some ugly mangling of the addr (I'm not entirely
> sure what would work best).
>   
Right, I get the part about flagging the address and then keying off
that flag in IO_COND (like we do for PIO vs MMIO).

What I am not clear on is how you would know to flag the address to
begin with.

Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it
would never be a real device).  This means we route any io requests from
virtio-pci though pv_io_ops->mmio(), but not unflagged addresses.  This
is not as slick as boosting *everyones* mmio speed as Avi's original
idea would have, but it is perhaps a good tradeoff between the entirely
new namespace created by my original dynhc() proposal and leaving them
all PF based.

This way, its just like using my dynhc() proposal except the mmio-addr
is the substitute address-token (instead of the dynhc-vector). 
Additionally, if you do not PV the kernel the IO_COND/pv_io_op is
ignored and it just slow-paths through the PF as it does today.  Dynhc()
would be dependent  on pv_ops.

Thoughts?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Chris Wright wrote:

* Gregory Haskins (ghask...@novell.com) wrote:
  

Chris Wright wrote:


* Avi Kivity (a...@redhat.com) wrote:
  

Gregory Haskins wrote:


Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the "nullio" ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.
  
  
Don't - it's broken.  It will also catch device assignment mmio and  
hypercall them.


Not necessarily.  It just needs to be creative w/ IO_COND
  

Hi Chris,
   Could you elaborate?  How would you know which pages to hypercall and
which to let PF?



Was just thinking of some ugly mangling of the addr (I'm not entirely
sure what would work best).
  


I think we just past the "too complicated" threshold.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Gregory Haskins (ghask...@novell.com) wrote:
> Chris Wright wrote:
> > * Avi Kivity (a...@redhat.com) wrote:
> >> Gregory Haskins wrote:
> >>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
> >>> through the "nullio" ringer, too. ;)  It would be cool to see the
> >>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
> >>> will not be the case.
> >>>   
> >> Don't - it's broken.  It will also catch device assignment mmio and  
> >> hypercall them.
> >
> > Not necessarily.  It just needs to be creative w/ IO_COND
> 
> Hi Chris,
>Could you elaborate?  How would you know which pages to hypercall and
> which to let PF?

Was just thinking of some ugly mangling of the addr (I'm not entirely
sure what would work best).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Chris Wright wrote:
> * Avi Kivity (a...@redhat.com) wrote:
>   
>> Gregory Haskins wrote:
>> 
>>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>>> through the "nullio" ringer, too. ;)  It would be cool to see the
>>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>>> will not be the case.
>>>   
>> Don't - it's broken.  It will also catch device assignment mmio and  
>> hypercall them.
>> 
>
> Not necessarily.  It just needs to be creative w/ IO_COND
>   

Hi Chris,
   Could you elaborate?  How would you know which pages to hypercall and
which to let PF?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Chris Wright
* Avi Kivity (a...@redhat.com) wrote:
> Gregory Haskins wrote:
>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>> through the "nullio" ringer, too. ;)  It would be cool to see the
>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>> will not be the case.
>
> Don't - it's broken.  It will also catch device assignment mmio and  
> hypercall them.

Not necessarily.  It just needs to be creative w/ IO_COND
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>
 I guess technically mmio can just be a simple access of the page which
 would be problematic to trap locally without a PF.  However it seems
 that most mmio always passes through a ioread()/iowrite() call so this
 is perhaps the hook point.  If we set the stake in the ground that
 mmios
 that go through some other mechanism like PFs can just hit the "slow
 path" are an acceptable casualty, I think we can make that work.
 
>>> That's my thinking exactly.
>>> 
>>
>> Cool,  I will code this up and submit it.  While Im at it, Ill run it
>> through the "nullio" ringer, too. ;)  It would be cool to see the
>> pv-mmio hit that 2.07us number.  I can't think of any reason why this
>> will not be the case.
>>   
>
> Don't - it's broken.  It will also catch device assignment mmio and
> hypercall them.
>
Ah.  Crap.

Would you be conducive if I continue along with the dynhc() approach then?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

Avi Kivity wrote:
  

Gregory Haskins wrote:


I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the "slow
path" are an acceptable casualty, I think we can make that work.
  
  

That's my thinking exactly.



Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the "nullio" ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.
  


Don't - it's broken.  It will also catch device assignment mmio and 
hypercall them.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> I guess technically mmio can just be a simple access of the page which
>> would be problematic to trap locally without a PF.  However it seems
>> that most mmio always passes through a ioread()/iowrite() call so this
>> is perhaps the hook point.  If we set the stake in the ground that mmios
>> that go through some other mechanism like PFs can just hit the "slow
>> path" are an acceptable casualty, I think we can make that work.
>>   
>
> That's my thinking exactly.

Cool,  I will code this up and submit it.  While Im at it, Ill run it
through the "nullio" ringer, too. ;)  It would be cool to see the
pv-mmio hit that 2.07us number.  I can't think of any reason why this
will not be the case.

>
> Note we can cheat further.  kvm already has a "coalesced mmio" feature
> where side-effect-free mmios are collected in the kernel and passed to
> userspace only when some other significant event happens.  We could
> pass those addresses to the guest and let it queue those writes
> itself, avoiding the hypercall completely.
>
> Though it's probably pointless: if the guest is paravirtualized enough
> to have the mmio hypercall, then it shouldn't be using e1000.

Yeah...plus at least for my vbus purposes, all my my guest->host
transitions are explicitly to cause side-effects, or I wouldn't be doing
them in the first place ;)  I suspect virtio-pci is exactly the same. 
I.e. the coalescing has already been done at a higher layer for
platforms running "PV" code.

Still a cool feature, tho.

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the "slow
path" are an acceptable casualty, I think we can make that work.
  


That's my thinking exactly.

Note we can cheat further.  kvm already has a "coalesced mmio" feature 
where side-effect-free mmios are collected in the kernel and passed to 
userspace only when some other significant event happens.  We could pass 
those addresses to the guest and let it queue those writes itself, 
avoiding the hypercall completely.


Though it's probably pointless: if the guest is paravirtualized enough 
to have the mmio hypercall, then it shouldn't be using e1000.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>>> What do you think of my mmio hypercall?  That will speed up all mmio
>>> to be as fast as a hypercall, and then we can use ordinary mmio/pio
>>> writes to trigger things.
>>>
>>> 
>> I like it!
>>
>> Bigger question is what kind of work goes into making mmio a pv_op (or
>> is this already done)?
>>
>>   
>
> Looks like it isn't there.  But it isn't any different than set_pte -
> convert a write into a hypercall.
>
>

I guess technically mmio can just be a simple access of the page which
would be problematic to trap locally without a PF.  However it seems
that most mmio always passes through a ioread()/iowrite() call so this
is perhaps the hook point.  If we set the stake in the ground that mmios
that go through some other mechanism like PFs can just hit the "slow
path" are an acceptable casualty, I think we can make that work.

Thoughts?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

What do you think of my mmio hypercall?  That will speed up all mmio
to be as fast as a hypercall, and then we can use ordinary mmio/pio
writes to trigger things.



I like it!

Bigger question is what kind of work goes into making mmio a pv_op (or
is this already done)?

  


Looks like it isn't there.  But it isn't any different than set_pte - 
convert a write into a hypercall.



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> I completed the resurrection of the test and wrote up a little wiki on
>> the subject, which you can find here:
>>
>> http://developer.novell.com/wiki/index.php/WhyHypercalls
>>
>> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
>> reinvent the wheel?" questions.
>>
>> I will include this information when I publish the updated v2 series
>> with the s/hypercall/dynhc changes.
>>
>> Let me know if you have any questions.
>>   
>
> Well, 420 ns is not to be sneezed at.
>
> What do you think of my mmio hypercall?  That will speed up all mmio
> to be as fast as a hypercall, and then we can use ordinary mmio/pio
> writes to trigger things.
>
I like it!

Bigger question is what kind of work goes into making mmio a pv_op (or
is this already done)?

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

I completed the resurrection of the test and wrote up a little wiki on
the subject, which you can find here:

http://developer.novell.com/wiki/index.php/WhyHypercalls

Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
reinvent the wheel?" questions.

I will include this information when I publish the updated v2 series
with the s/hypercall/dynhc changes.

Let me know if you have any questions.
  


Well, 420 ns is not to be sneezed at.

What do you think of my mmio hypercall?  That will speed up all mmio to 
be as fast as a hypercall, and then we can use ordinary mmio/pio writes 
to trigger things.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Gregory Haskins
Chris Wright wrote:
> * Gregory Haskins (ghask...@novell.com) wrote:
>   
>> Chris Wright wrote:
>> 
>>> VF drivers can also have this issue (and typically use mmio).
>>> I at least have a better idea what your proposal is, thanks for
>>> explanation.  Are you able to demonstrate concrete benefit with it yet
>>> (improved latency numbers for example)?
>>>   
>> I had a test-harness/numbers for this kind of thing, but its a bit
>> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
>> generate/post new numbers.
>> 
>
> That would be useful, because I keep coming back to pio and shared
> page(s) when think of why not to do this.  Seems I'm not alone in that.
>
> thanks,
> -chris
>   

I completed the resurrection of the test and wrote up a little wiki on
the subject, which you can find here:

http://developer.novell.com/wiki/index.php/WhyHypercalls

Hopefully this answers Chris' "show me the numbers" and Anthony's "Why
reinvent the wheel?" questions.

I will include this information when I publish the updated v2 series
with the s/hypercall/dynhc changes.

Let me know if you have any questions.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-07 Thread Avi Kivity

Gregory Haskins wrote:

Chris Wright wrote:
  

* Gregory Haskins (ghask...@novell.com) wrote:
  


Chris Wright wrote:

  

But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
means hypercall number and arg list must be the same in order for code
to call hypercall() in a hypervisor agnostic way.
  


Yes, and that is exactly the intention.  I think its perhaps the point
you are missing.

  

Yes, I was reading this as purely any hypercall, but it seems a bit
more like:
 pv_io_ops->iomap()
 pv_io_ops->ioread()
 pv_io_ops->iowrite()
  



Right.
  


Hmm, reminds me of something I thought of a while back.

We could implement an 'mmio hypercall' that does mmio reads/writes via a 
hypercall instead of an mmio operation.  That will speed up mmio for 
emulated devices (say, e1000).  It's easy to hook into Linux 
(readl/writel), is pci-friendly, non-x86 friendly, etc.


It also makes the device work when hypercall support is not available 
(qemu/tcg); you simply fall back on mmio.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-06 Thread Chris Wright
* Gregory Haskins (ghask...@novell.com) wrote:
> Chris Wright wrote:
> > VF drivers can also have this issue (and typically use mmio).
> > I at least have a better idea what your proposal is, thanks for
> > explanation.  Are you able to demonstrate concrete benefit with it yet
> > (improved latency numbers for example)?
> 
> I had a test-harness/numbers for this kind of thing, but its a bit
> crufty since its from ~1.5 years ago.  I will dig it up, update it, and
> generate/post new numbers.

That would be useful, because I keep coming back to pio and shared
page(s) when think of why not to do this.  Seems I'm not alone in that.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-06 Thread Gregory Haskins
Anthony Liguori wrote:
> Gregory Haskins wrote:
>>
>> Today, there is no equivelent of a platform agnostic "iowrite32()" for
>> hypercalls so the driver would look like the pseudocode above except
>> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
>> is to allow the hypervisor to assign a dynamic vector to resources in
>> the backend and convey this vector to the guest (such as in PCI
>> config-space as mentioned in my example use-case).  The provides the
>> "address negotiation" function that would normally be done for something
>> like a pio port-address.   The hypervisor agnostic driver can then use
>> this globally recognized address-token coupled with other device-private
>> ABI parameters to communicate with the device.  This can all occur
>> without the core hypervisor needing to understand the details beyond the
>> addressing.
>>   
>
> PCI already provide a hypervisor agnostic interface (via IO regions). 
> You have a mechanism for devices to discover which regions they have
> allocated and to request remappings.  It's supported by Linux and
> Windows.  It works on the vast majority of architectures out there today.
>
> Why reinvent the wheel?

I suspect the current wheel is square.  And the air is out.  Plus its
pulling to the left when I accelerate, but to be fair that may be my
alignment

:) But I digress.  See: http://patchwork.kernel.org/patch/21865/

To give PCI proper respect, I think its greatest value add here is the
inherent IRQ routing (which is a huge/difficult component, as I
experienced with dynirq in vbus v1).  Beyond that, however, I think we
can do better.

HTH
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-06 Thread Anthony Liguori

Gregory Haskins wrote:

Chris Wright wrote:
  

* Gregory Haskins (gregory.hask...@gmail.com) wrote:
  


So you would never have someone making a generic
hypercall(KVM_HC_MMU_OP).  I agree.

  

Which is why I think the interface proposal you've made is wrong.



I respectfully disagree.  Its only wrong in that the name chosen for the
interface was perhaps too broad/vague.  I still believe the concept is
sound, and the general layering is appropriate. 

  

  There's
already hypercall interfaces w/ specific ABI and semantic meaning (which
are typically called directly/indirectly from an existing pv op hook).
  



Yes, these are different, thus the new interface.

  

But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
means hypercall number and arg list must be the same in order for code
to call hypercall() in a hypervisor agnostic way.
  



Yes, and that is exactly the intention.  I think its perhaps the point
you are missing.

I am well aware that historically the things we do over a hypercall
interface would inherently have meaning only to a specific hypervisor
(e.g. KVM_HC_MMU_OPS (vector 2) via kvm_hypercall()).  However, this
doesn't in any way infer that it is the only use for the general
concept.  Its just the only way they have been exploited to date.

While I acknowledge that the hypervisor certainly must be coordinated
with their use, in their essence hypercalls are just another form of IO
joining the ranks of things like MMIO and PIO.  This is an attempt to
bring them out of the bowels of CONFIG_PARAVIRT to make them a first
class citizen. 


The thing I am building here is really not a general hypercall in the
broad sense.  Rather, its a subset of the hypercall vector namespace. 
It is designed specifically for dynamic binding a synchronous call()

interface to things like virtual devices, and it is therefore these
virtual device models that define the particular ABI within that
namespace.  Thus the ABI in question is explicitly independent of the
underlying hypervisor.  I therefore stand by the proposed design to have
this interface described above the hypervisor support layer (i.e.
pv_ops) (albeit with perhaps a better name like "dynamic hypercall" as
per my later discussion with Avi).

Consider PIO: The hypervisor (or hardware) and OS negotiate a port
address, but the two end-points are the driver and the device-model (or
real device).  The driver doesnt have to say:

if (kvm)
   kvm_iowrite32(addr, ..);
else if (lguest)
   lguest_iowrite32(addr, ...);
else
   native_iowrite32(addr, ...);

Instead, it just says "iowrite32(addr, ...);" and the address is used to
route the message appropriately by the platform.  The ABI of that
message, however, is specific to the driver/device and is not
interpreted by kvm/lguest/native-hw infrastructure on the way.

Today, there is no equivelent of a platform agnostic "iowrite32()" for
hypercalls so the driver would look like the pseudocode above except
substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
is to allow the hypervisor to assign a dynamic vector to resources in
the backend and convey this vector to the guest (such as in PCI
config-space as mentioned in my example use-case).  The provides the
"address negotiation" function that would normally be done for something
like a pio port-address.   The hypervisor agnostic driver can then use
this globally recognized address-token coupled with other device-private
ABI parameters to communicate with the device.  This can all occur
without the core hypervisor needing to understand the details beyond the
addressing.
  


PCI already provide a hypervisor agnostic interface (via IO regions).  
You have a mechanism for devices to discover which regions they have 
allocated and to request remappings.  It's supported by Linux and 
Windows.  It works on the vast majority of architectures out there today.


Why reinvent the wheel?

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-06 Thread Gregory Haskins
Chris Wright wrote:
> * Gregory Haskins (ghask...@novell.com) wrote:
>   
>> Chris Wright wrote:
>> 
>>> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t 
>>> count)
>>> means hypercall number and arg list must be the same in order for code
>>> to call hypercall() in a hypervisor agnostic way.
>>>   
>> Yes, and that is exactly the intention.  I think its perhaps the point
>> you are missing.
>> 
>
> Yes, I was reading this as purely any hypercall, but it seems a bit
> more like:
>  pv_io_ops->iomap()
>  pv_io_ops->ioread()
>  pv_io_ops->iowrite()
>   

Right.

> 
>   
>> Today, there is no equivelent of a platform agnostic "iowrite32()" for
>> hypercalls so the driver would look like the pseudocode above except
>> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
>> is to allow the hypervisor to assign a dynamic vector to resources in
>> the backend and convey this vector to the guest (such as in PCI
>> config-space as mentioned in my example use-case).  The provides the
>> "address negotiation" function that would normally be done for something
>> like a pio port-address.   The hypervisor agnostic driver can then use
>> this globally recognized address-token coupled with other device-private
>> ABI parameters to communicate with the device.  This can all occur
>> without the core hypervisor needing to understand the details beyond the
>> addressing.
>> 
>
> VF drivers can also have this issue (and typically use mmio).
> I at least have a better idea what your proposal is, thanks for
> explanation.  Are you able to demonstrate concrete benefit with it yet
> (improved latency numbers for example)?
>   

I had a test-harness/numbers for this kind of thing, but its a bit
crufty since its from ~1.5 years ago.  I will dig it up, update it, and
generate/post new numbers.

Thanks Chris,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-06 Thread Chris Wright
* Gregory Haskins (ghask...@novell.com) wrote:
> Chris Wright wrote:
> > But a free-form hypercall(unsigned long nr, unsigned long *args, size_t 
> > count)
> > means hypercall number and arg list must be the same in order for code
> > to call hypercall() in a hypervisor agnostic way.
> 
> Yes, and that is exactly the intention.  I think its perhaps the point
> you are missing.

Yes, I was reading this as purely any hypercall, but it seems a bit
more like:
 pv_io_ops->iomap()
 pv_io_ops->ioread()
 pv_io_ops->iowrite()


> Today, there is no equivelent of a platform agnostic "iowrite32()" for
> hypercalls so the driver would look like the pseudocode above except
> substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
> is to allow the hypervisor to assign a dynamic vector to resources in
> the backend and convey this vector to the guest (such as in PCI
> config-space as mentioned in my example use-case).  The provides the
> "address negotiation" function that would normally be done for something
> like a pio port-address.   The hypervisor agnostic driver can then use
> this globally recognized address-token coupled with other device-private
> ABI parameters to communicate with the device.  This can all occur
> without the core hypervisor needing to understand the details beyond the
> addressing.

VF drivers can also have this issue (and typically use mmio).
I at least have a better idea what your proposal is, thanks for
explanation.  Are you able to demonstrate concrete benefit with it yet
(improved latency numbers for example)?

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-05 Thread Gregory Haskins
Chris Wright wrote:
> * Gregory Haskins (gregory.hask...@gmail.com) wrote:
>   
>> So you would never have someone making a generic
>> hypercall(KVM_HC_MMU_OP).  I agree.
>> 
>
> Which is why I think the interface proposal you've made is wrong.

I respectfully disagree.  Its only wrong in that the name chosen for the
interface was perhaps too broad/vague.  I still believe the concept is
sound, and the general layering is appropriate. 

>   There's
> already hypercall interfaces w/ specific ABI and semantic meaning (which
> are typically called directly/indirectly from an existing pv op hook).
>   

Yes, these are different, thus the new interface.

> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
> means hypercall number and arg list must be the same in order for code
> to call hypercall() in a hypervisor agnostic way.
>   

Yes, and that is exactly the intention.  I think its perhaps the point
you are missing.

I am well aware that historically the things we do over a hypercall
interface would inherently have meaning only to a specific hypervisor
(e.g. KVM_HC_MMU_OPS (vector 2) via kvm_hypercall()).  However, this
doesn't in any way infer that it is the only use for the general
concept.  Its just the only way they have been exploited to date.

While I acknowledge that the hypervisor certainly must be coordinated
with their use, in their essence hypercalls are just another form of IO
joining the ranks of things like MMIO and PIO.  This is an attempt to
bring them out of the bowels of CONFIG_PARAVIRT to make them a first
class citizen. 

The thing I am building here is really not a general hypercall in the
broad sense.  Rather, its a subset of the hypercall vector namespace. 
It is designed specifically for dynamic binding a synchronous call()
interface to things like virtual devices, and it is therefore these
virtual device models that define the particular ABI within that
namespace.  Thus the ABI in question is explicitly independent of the
underlying hypervisor.  I therefore stand by the proposed design to have
this interface described above the hypervisor support layer (i.e.
pv_ops) (albeit with perhaps a better name like "dynamic hypercall" as
per my later discussion with Avi).

Consider PIO: The hypervisor (or hardware) and OS negotiate a port
address, but the two end-points are the driver and the device-model (or
real device).  The driver doesnt have to say:

if (kvm)
   kvm_iowrite32(addr, ..);
else if (lguest)
   lguest_iowrite32(addr, ...);
else
   native_iowrite32(addr, ...);

Instead, it just says "iowrite32(addr, ...);" and the address is used to
route the message appropriately by the platform.  The ABI of that
message, however, is specific to the driver/device and is not
interpreted by kvm/lguest/native-hw infrastructure on the way.

Today, there is no equivelent of a platform agnostic "iowrite32()" for
hypercalls so the driver would look like the pseudocode above except
substitute with kvm_hypercall(), lguest_hypercall(), etc.  The proposal
is to allow the hypervisor to assign a dynamic vector to resources in
the backend and convey this vector to the guest (such as in PCI
config-space as mentioned in my example use-case).  The provides the
"address negotiation" function that would normally be done for something
like a pio port-address.   The hypervisor agnostic driver can then use
this globally recognized address-token coupled with other device-private
ABI parameters to communicate with the device.  This can all occur
without the core hypervisor needing to understand the details beyond the
addressing.

What this means to our interface design is that the only thing the
hypervisor really cares about is the first "nr" parameter.  This acts as
our address-token.  The optional/variable list of args is just payload
as far as the core infrastructure is concerned and are coupled only to
our device ABI.  They were chosen to be an array of ulongs (vs something
like vargs) to reflect the fact that hypercalls are typically passed by
packing registers.

Hope this helps,
-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-05 Thread Chris Wright
* Gregory Haskins (gregory.hask...@gmail.com) wrote:
> So you would never have someone making a generic
> hypercall(KVM_HC_MMU_OP).  I agree.

Which is why I think the interface proposal you've made is wrong.  There's
already hypercall interfaces w/ specific ABI and semantic meaning (which
are typically called directly/indirectly from an existing pv op hook).

But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count)
means hypercall number and arg list must be the same in order for code
to call hypercall() in a hypervisor agnostic way.

The pv_ops level need to have semantic meaning, not a free form
hypercall multiplexor.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-05 Thread Avi Kivity

Gregory Haskins wrote:

I see.  I had designed it slightly different where KVM could assign any
top level vector it wanted and thus that drove the guest-side interface
you see here to be more "generic hypercall".  However, I think your
proposal is perfectly fine too and it makes sense to more narrowly focus
these calls as specifically "dynamic"...as thats the only vectors that
we could technically use like this anyway.

So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC"
to kvm_para.h, and I will change the interface to follow suit (something
like s/hypercall/dynhc).  Sound good?
  


Yeah.

Another couple of points:

- on the host side, we'd rig this to hit an eventfd.  Nothing stops us 
from rigging pio to hit an eventfd as well, giving us kernel handling 
for pio trigger points.
- pio actually has an advantage over hypercalls with nested guests.  
Since hypercalls don't have an associated port number, the lowermost 
hypervisor must interpret a hypercall as going to a guest's hypervisor, 
and not any lower-level hypervisors.  What it boils down to is that you 
cannot use device assignment to give a guest access to a virtio/vbus 
device from a lower level hypervisor.


(Bah, that's totally unreadable.  What I want is

instead of

  hypervisor[eth0/virtio-server]   > 
intermediate[virtio-driver/virtio-server]  > guest[virtio-driver]


do

  hypervisor[eth0/virtio-server]   > intermediate[assign virtio 
device]  > guest[virtio-driver]


well, it's probably still unreadable)

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-05 Thread Gregory Haskins
Gregory Haskins wrote:
> So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC"
> to kvm_para.h, and I will change the interface to follow suit (something
> like s/hypercall/dynhc).  Sound good?
>   

A small ramification of this change will be that I will need to do
something like add a feature-bit to cpuid for detecting if HC_DYNAMIC is
supported on the backend or not.  The current v1 design doesn't suffer
from this requirement because the presence of the dynamic vector itself
is enough to know its supported.  I like Avi's proposal enough to say
that its worth this minor inconvenience, but FYI I will have to
additionally submit a userspace patch for v2 if we go this route.

-Greg





signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/3] generic hypercall support

2009-05-05 Thread Gregory Haskins
Avi Kivity wrote:
> Gregory Haskins wrote:
>> Avi Kivity wrote:
>>  
>>> Gregory Haskins wrote:
>>>
 (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a)

 Please see patch 1/3 for a description.  This has been tested with
 a KVM
 guest on x86_64 and appears to work properly.  Comments, please.
 
>>> What about the hypercalls in include/asm/kvm_para.h?
>>>
>>> In general, hypercalls cannot be generic since each hypervisor
>>> implements its own ABI.
>>> 
>> Please see the prologue to 1/3.  Its all described there, including a
>> use case which I think answers your questions.  If there is still
>> ambiguity, let me know.
>>
>>   
>
> Yeah, sorry.
>
>>>   The abstraction needs to be at a higher level (pv_ops is such a
>>> level).
>>> 
>> Yep, agreed.  Thats exactly what this series is doing, actually.
>>   
>
> No, it doesn't.  It makes "making hypercalls" a pv_op, but hypervisors
> don't implement the same ABI.
Yes, that is true, but I think the issue right now is more of
semantics.  I think we are on the same page.

So you would never have someone making a generic
hypercall(KVM_HC_MMU_OP).  I agree.  What I am proposing here is more
akin to PIO-BAR + iowrite()/ioread().  E.g. the infrastructure sets up
the "addressing" (where in PIO this is literally an address, and for
hypercalls this is a vector), but the "device" defines the ABI at that
address.  So its really the "device end-point" that is defining the ABI
here, not the hypervisor (per se) and thats why I thought its ok to
declare these "generic".  But to your point below...

>
> pv_ops all _use_ hypercalls to implement higher level operations, like
> set_pte (probably the only place set_pte can be considered a high
> level operation).
>
> In this case, the higher level event could be
> hypervisor_dynamic_event(number); each pv_ops implementation would use
> its own hypercalls to implement that.

I see.  I had designed it slightly different where KVM could assign any
top level vector it wanted and thus that drove the guest-side interface
you see here to be more "generic hypercall".  However, I think your
proposal is perfectly fine too and it makes sense to more narrowly focus
these calls as specifically "dynamic"...as thats the only vectors that
we could technically use like this anyway.

So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC"
to kvm_para.h, and I will change the interface to follow suit (something
like s/hypercall/dynhc).  Sound good?

Thanks, Avi,
-Greg




signature.asc
Description: OpenPGP digital signature


  1   2   >