Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: > Gregory Haskins wrote: >> I specifically generalized my statement above because #1 I assume >> everyone here is smart enough to convert that nice round unit into the >> relevant figure. And #2, there are multiple potential latency sources >> at play which we need to factor in when looking at the big picture. For >> instance, the difference between PF exit, and an IO exit (2.58us on x86, >> to be precise). Or whether you need to take a heavy-weight exit. Or a >> context switch to qemu, the the kernel, back to qemu, and back to the >> vcpu). Or acquire a mutex. Or get head-of-lined on the VGA models >> IO. I know you wish that this whole discussion would just go away, >> but these >> little "300ns here, 1600ns there" really add up in aggregate despite >> your dismissive attitude towards them. And it doesn't take much to >> affect the results in a measurable way. As stated, each 1us costs >> ~4%. My motivation is to reduce as many of these sources as possible. >> >> So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% >> improvement. So what? Its still an improvement. If that improvement >> were for free, would you object? And we all know that this change isn't >> "free" because we have to change some code (+128/-0, to be exact). But >> what is it specifically you are objecting to in the first place? Adding >> hypercall support as an pv_ops primitive isn't exactly hard or complex, >> or even very much code. >> > > Where does 25us come from? The number you post below are 33us and > 66us. This is part of what's frustrating me in this thread. Things > are way too theoretical. Saying that "if packet latency was 25us, > then it would be a 1.4% improvement" is close to misleading. [ answered in the last reply ] > The numbers you've posted are also measuring on-box speeds. What > really matters are off-box latencies and that's just going to exaggerate. I'm not 100% clear on what you mean with on-box vs off-box. These figures were gathered between two real machines connected via 10GE cross-over cable. The 5.8Gb/s and 33us (25us) values were gathered sending real data between these hosts. This sounds "off-box" to me, but I am not sure I truly understand your assertion. > > > IIUC, if you switched vbus to using PIO today, you would go from 66us > to to 65.65, which you'd round to 66us for on-box latencies. Even if > you didn't round, it's a 0.5% improvement in latency. I think part of what you are missing is that in order to create vbus, I needed to _create_ an in-kernel hook from scratch since there were no existing methods. Since I measured HC to be superior in performance (if by only a little), I wasn't going to chose the slower way if there wasn't a reason, and at the time I didn't see one. Now after community review, perhaps we do have a reason, but that is the point of the review process. So now we can push something like iofd as a PIO hook instead. But either way, something needed to be created. > > > Adding hypercall support as a pv_ops primitive is adding a fair bit of > complexity. You need a hypercall fd mechanism to plumb this down to > userspace otherwise, you can't support migration from in-kernel > backend to non in-kernel backend. I respectfully disagree. This is orthogonal to the simple issue of the IO type for the exit. Where you *do* have a point is that the bigger benefit comes from in-kernel termination (like the iofd stuff I posted yesterday). However, in-kernel termination is not strictly necessary to exploit some reduction in overhead in the IO latency. In either case we know we can shave off about 2.56us from an MMIO. Since I formally measured MMIO rtt to userspace yesterday, we now know that we can do qemu-mmio in about 110k IOPS, 9.09us rtt. Switching to pv_io_ops->mmio() alone would be a boost to approximately 153k IOPS, 6.53us rtt. This would have a tangible benefit to all models without any hypercall plumbing screwing up migration. Therefore I still stand by the assertion that the hypercall discussion alone doesn't add very much complexity. > You need some way to allocate hypercalls to particular devices which > so far, has been completely ignored. I'm sorry, but thats not true. Vbus already handles this mapping. > I've already mentioned why hypercalls are also unfortunate from a > guest perspective. They require kernel patching and this is almost > certainly going to break at least Vista as a guest. Certainly Windows 7. Yes, you have a point here. > > So it's not at all fair to trivialize the complexity introduce here. > I'm simply asking for justification to introduce this complexity. I > don't see why this is unfair for me to ask. In summary, I don't think there is really much complexity being added because this stuff really doesn't depend on the hypercallfd (iofd) interface in order to have some benefit, as you assert above. The hypercall page is a good point for attestation, but that is
Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: > Gregory Haskins wrote: >> >> So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% >> improvement. So what? Its still an improvement. If that improvement >> were for free, would you object? And we all know that this change isn't >> "free" because we have to change some code (+128/-0, to be exact). But >> what is it specifically you are objecting to in the first place? Adding >> hypercall support as an pv_ops primitive isn't exactly hard or complex, >> or even very much code. >> > > Where does 25us come from? The number you post below are 33us and 66us. The 25us is approximately the max from an in-kernel harness strapped directly to the driver gathered informally during testing. The 33us is from formally averaging multiple runs of a userspace socket app in preparation for publishing. I consider the 25us the "target goal" since there is obviously overhead that a socket application deals with that theoretically a guest bypasses with the tap-device. Note that the socket application itself often sees < 30us itself...this was just a particularly "slow" set of runs that day. Note that this is why I express the impact as "approximately" (e.g. "~4%"). Sorry for the confusion. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: Anthony Liguori wrote: It's a question of cost vs. benefit. It's clear the benefit is low (but that doesn't mean it's not worth having). The cost initially appeared to be very low, until the nested virtualization wrench was thrown into the works. Not that nested virtualization is a reality -- even on svm where it is implemented it is not yet production quality and is disabled by default. Now nested virtualization is beginning to look interesting, with Windows 7's XP mode requiring virtualization extensions. Desktop virtualization is also something likely to use device assignment (though you probably won't assign a virtio device to the XP instance inside Windows 7). Maybe we should revisit the mmio hypercall idea again, it might be workable if we find a way to let the guest know if it should use the hypercall or not for a given memory range. mmio hypercall is nice because - it falls back nicely to pure mmio - it optimizes an existing slow path, not just new device models - it has preexisting semantics, so we have less ABI to screw up - for nested virtualization + device assignment, we can drop it and get a nice speed win (or rather, less speed loss) If it's a PCI device, then we can also have an interrupt which we currently lack with vmcall-based hypercalls. This would give us guestcalls, upcalls, or whatever we've previously decided to call them. Sorry, I totally failed to understand this. Please explain. I totally missed what you meant by MMIO hypercall. In what cases do you think MMIO hypercall would result in a net benefit? I think the difference in MMIO vs hcall will be overshadowed by the heavy weight transition to userspace. The only thing I can think of where it may matter is for in-kernel devices like the APIC but that's a totally different path in Linux. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: It's a question of cost vs. benefit. It's clear the benefit is low (but that doesn't mean it's not worth having). The cost initially appeared to be very low, until the nested virtualization wrench was thrown into the works. Not that nested virtualization is a reality -- even on svm where it is implemented it is not yet production quality and is disabled by default. Now nested virtualization is beginning to look interesting, with Windows 7's XP mode requiring virtualization extensions. Desktop virtualization is also something likely to use device assignment (though you probably won't assign a virtio device to the XP instance inside Windows 7). Maybe we should revisit the mmio hypercall idea again, it might be workable if we find a way to let the guest know if it should use the hypercall or not for a given memory range. mmio hypercall is nice because - it falls back nicely to pure mmio - it optimizes an existing slow path, not just new device models - it has preexisting semantics, so we have less ABI to screw up - for nested virtualization + device assignment, we can drop it and get a nice speed win (or rather, less speed loss) If it's a PCI device, then we can also have an interrupt which we currently lack with vmcall-based hypercalls. This would give us guestcalls, upcalls, or whatever we've previously decided to call them. Sorry, I totally failed to understand this. Please explain. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Avi Kivity wrote: Hollis Blanchard wrote: I haven't been following this conversation at all. With that in mind... AFAICS, a hypercall is clearly the higher-performing option, since you don't need the additional memory load (which could even cause a page fault in some circumstances) and instruction decode. That said, I'm willing to agree that this overhead is probably negligible compared to the IOp itself... Ahmdal's Law again. It's a question of cost vs. benefit. It's clear the benefit is low (but that doesn't mean it's not worth having). The cost initially appeared to be very low, until the nested virtualization wrench was thrown into the works. Not that nested virtualization is a reality -- even on svm where it is implemented it is not yet production quality and is disabled by default. Now nested virtualization is beginning to look interesting, with Windows 7's XP mode requiring virtualization extensions. Desktop virtualization is also something likely to use device assignment (though you probably won't assign a virtio device to the XP instance inside Windows 7). Maybe we should revisit the mmio hypercall idea again, it might be workable if we find a way to let the guest know if it should use the hypercall or not for a given memory range. mmio hypercall is nice because - it falls back nicely to pure mmio - it optimizes an existing slow path, not just new device models - it has preexisting semantics, so we have less ABI to screw up - for nested virtualization + device assignment, we can drop it and get a nice speed win (or rather, less speed loss) Yeah, I agree with all this. I am still wrestling with how to deal with the device-assignment problem w.r.t. shunting io requests into a hypercall vs letting them PF. Are you saying we could simply ignore this case by disabling "MMIOoHC" when assignment is enabled? That would certainly make the problem much easier to solve. No, we need to deal with hotplug. Something like IO_COND that Chris mentioned, but how to avoid turning this into a doctoral thesis. (On the other hand, device assignment requires the iommu, and I think you have to specify that up front?) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Hollis Blanchard wrote: On Sun, 2009-05-10 at 13:38 -0500, Anthony Liguori wrote: Gregory Haskins wrote: Can you back up your claim that PPC has no difference in performance with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" like instructions, but clearly there are ways to cause a trap, so presumably we can measure the difference between a PF exit and something more explicit). First, the PPC that KVM supports performs very poorly relatively speaking because it receives no hardware assistance this is not the right place to focus wrt optimizations. And because there's no hardware assistance, there simply isn't a hypercall instruction. Are PFs the fastest type of exits? Probably not but I honestly have no idea. I'm sure Hollis does though. Memory load from the guest context (for instruction decoding) is a *very* poorly performing path on most PowerPC, even considering server PowerPC with hardware virtualization support. No, I don't have any data for you, but switching the hardware MMU contexts requires some heavyweight synchronization instructions. For current ppcemb, you would have to do a memory load no matter what, right? I guess you could have a dedicated interrupt vector or something... For future ppcemb's, do you know if there is an equivalent of a PF exit type? Does the hardware squirrel away the faulting address somewhere and set PC to the start of the instruction? If so, no guest memory load should be required. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: Hollis Blanchard wrote: I haven't been following this conversation at all. With that in mind... AFAICS, a hypercall is clearly the higher-performing option, since you don't need the additional memory load (which could even cause a page fault in some circumstances) and instruction decode. That said, I'm willing to agree that this overhead is probably negligible compared to the IOp itself... Ahmdal's Law again. It's a question of cost vs. benefit. It's clear the benefit is low (but that doesn't mean it's not worth having). The cost initially appeared to be very low, until the nested virtualization wrench was thrown into the works. Not that nested virtualization is a reality -- even on svm where it is implemented it is not yet production quality and is disabled by default. Now nested virtualization is beginning to look interesting, with Windows 7's XP mode requiring virtualization extensions. Desktop virtualization is also something likely to use device assignment (though you probably won't assign a virtio device to the XP instance inside Windows 7). Maybe we should revisit the mmio hypercall idea again, it might be workable if we find a way to let the guest know if it should use the hypercall or not for a given memory range. mmio hypercall is nice because - it falls back nicely to pure mmio - it optimizes an existing slow path, not just new device models - it has preexisting semantics, so we have less ABI to screw up - for nested virtualization + device assignment, we can drop it and get a nice speed win (or rather, less speed loss) If it's a PCI device, then we can also have an interrupt which we currently lack with vmcall-based hypercalls. This would give us guestcalls, upcalls, or whatever we've previously decided to call them. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: I specifically generalized my statement above because #1 I assume everyone here is smart enough to convert that nice round unit into the relevant figure. And #2, there are multiple potential latency sources at play which we need to factor in when looking at the big picture. For instance, the difference between PF exit, and an IO exit (2.58us on x86, to be precise). Or whether you need to take a heavy-weight exit. Or a context switch to qemu, the the kernel, back to qemu, and back to the vcpu). Or acquire a mutex. Or get head-of-lined on the VGA models IO. I know you wish that this whole discussion would just go away, but these little "300ns here, 1600ns there" really add up in aggregate despite your dismissive attitude towards them. And it doesn't take much to affect the results in a measurable way. As stated, each 1us costs ~4%. My motivation is to reduce as many of these sources as possible. So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% improvement. So what? Its still an improvement. If that improvement were for free, would you object? And we all know that this change isn't "free" because we have to change some code (+128/-0, to be exact). But what is it specifically you are objecting to in the first place? Adding hypercall support as an pv_ops primitive isn't exactly hard or complex, or even very much code. Where does 25us come from? The number you post below are 33us and 66us. This is part of what's frustrating me in this thread. Things are way too theoretical. Saying that "if packet latency was 25us, then it would be a 1.4% improvement" is close to misleading. The numbers you've posted are also measuring on-box speeds. What really matters are off-box latencies and that's just going to exaggerate. IIUC, if you switched vbus to using PIO today, you would go from 66us to to 65.65, which you'd round to 66us for on-box latencies. Even if you didn't round, it's a 0.5% improvement in latency. Adding hypercall support as a pv_ops primitive is adding a fair bit of complexity. You need a hypercall fd mechanism to plumb this down to userspace otherwise, you can't support migration from in-kernel backend to non in-kernel backend. You need some way to allocate hypercalls to particular devices which so far, has been completely ignored. I've already mentioned why hypercalls are also unfortunate from a guest perspective. They require kernel patching and this is almost certainly going to break at least Vista as a guest. Certainly Windows 7. So it's not at all fair to trivialize the complexity introduce here. I'm simply asking for justification to introduce this complexity. I don't see why this is unfair for me to ask. As a more general observation, we need numbers to justify an optimization, not to justify not including an optimization. In other words, the burden is on you to present a scenario where this optimization would result in a measurable improvement in a real world work load. I have already done this. You seem to have chosen to ignore my statements and results, but if you insist on rehashing: I started this project by analyzing system traces and finding some of the various bottlenecks in comparison to a native host. Throughput was already pretty decent, but latency was pretty bad (and recently got *really* bad, but I know you already have a handle on whats causing that). I digress...one of the conclusions of the research was that I wanted to focus on building an IO subsystem designed to minimize the quantity of exits, minimize the cost of each exit, and shorten the end-to-end signaling path to achieve optimal performance. I also wanted to build a system that was extensible enough to work with a variety of client types, on a variety of architectures, etc, so we would only need to solve these problems "once". The end result was vbus, and the first working example was venet. The measured performance data of this work was as follows: 802.x network, 9000 byte MTU, 2 8-core x86_64s connected back to back with Chelsio T3 10GE via crossover. Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt) Virtio-net (PCI): tput = 4578Mb/s, round-trip = 249pps (4016us rtt) Venet (VBUS): tput = 5802Mb/s, round-trip = 15127 (66us rtt) For more details: http://lkml.org/lkml/2009/4/21/408 Sending out a massive infrastructure change that does things wildly differently from how they're done today without any indication of why those changes were necessary is disruptive. If you could characterize all of the changes that vbus makes that are different from virtio, demonstrating at each stage why the change mattered and what benefit it brought, then we'd be having a completely different discussion. I have no problem throwing away virtio today if there's something else better. That's not what you've done though. You wrote a bunch of code without understanding why virt
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Hollis Blanchard wrote: >> I haven't been following this conversation at all. With that in mind... >> >> AFAICS, a hypercall is clearly the higher-performing option, since you >> don't need the additional memory load (which could even cause a page >> fault in some circumstances) and instruction decode. That said, I'm >> willing to agree that this overhead is probably negligible compared to >> the IOp itself... Ahmdal's Law again. >> > > It's a question of cost vs. benefit. It's clear the benefit is low > (but that doesn't mean it's not worth having). The cost initially > appeared to be very low, until the nested virtualization wrench was > thrown into the works. Not that nested virtualization is a reality -- > even on svm where it is implemented it is not yet production quality > and is disabled by default. > > Now nested virtualization is beginning to look interesting, with > Windows 7's XP mode requiring virtualization extensions. Desktop > virtualization is also something likely to use device assignment > (though you probably won't assign a virtio device to the XP instance > inside Windows 7). > > Maybe we should revisit the mmio hypercall idea again, it might be > workable if we find a way to let the guest know if it should use the > hypercall or not for a given memory range. > > mmio hypercall is nice because > - it falls back nicely to pure mmio > - it optimizes an existing slow path, not just new device models > - it has preexisting semantics, so we have less ABI to screw up > - for nested virtualization + device assignment, we can drop it and > get a nice speed win (or rather, less speed loss) > Yeah, I agree with all this. I am still wrestling with how to deal with the device-assignment problem w.r.t. shunting io requests into a hypercall vs letting them PF. Are you saying we could simply ignore this case by disabling "MMIOoHC" when assignment is enabled? That would certainly make the problem much easier to solve. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Hollis Blanchard wrote: I haven't been following this conversation at all. With that in mind... AFAICS, a hypercall is clearly the higher-performing option, since you don't need the additional memory load (which could even cause a page fault in some circumstances) and instruction decode. That said, I'm willing to agree that this overhead is probably negligible compared to the IOp itself... Ahmdal's Law again. It's a question of cost vs. benefit. It's clear the benefit is low (but that doesn't mean it's not worth having). The cost initially appeared to be very low, until the nested virtualization wrench was thrown into the works. Not that nested virtualization is a reality -- even on svm where it is implemented it is not yet production quality and is disabled by default. Now nested virtualization is beginning to look interesting, with Windows 7's XP mode requiring virtualization extensions. Desktop virtualization is also something likely to use device assignment (though you probably won't assign a virtio device to the XP instance inside Windows 7). Maybe we should revisit the mmio hypercall idea again, it might be workable if we find a way to let the guest know if it should use the hypercall or not for a given memory range. mmio hypercall is nice because - it falls back nicely to pure mmio - it optimizes an existing slow path, not just new device models - it has preexisting semantics, so we have less ABI to screw up - for nested virtualization + device assignment, we can drop it and get a nice speed win (or rather, less speed loss) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Sun, 2009-05-10 at 13:38 -0500, Anthony Liguori wrote: > Gregory Haskins wrote: > > > > Can you back up your claim that PPC has no difference in performance > > with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" > > like instructions, but clearly there are ways to cause a trap, so > > presumably we can measure the difference between a PF exit and something > > more explicit). > > First, the PPC that KVM supports performs very poorly relatively > speaking because it receives no hardware assistance this is not the > right place to focus wrt optimizations. > > And because there's no hardware assistance, there simply isn't a > hypercall instruction. Are PFs the fastest type of exits? Probably not > but I honestly have no idea. I'm sure Hollis does though. Memory load from the guest context (for instruction decoding) is a *very* poorly performing path on most PowerPC, even considering server PowerPC with hardware virtualization support. No, I don't have any data for you, but switching the hardware MMU contexts requires some heavyweight synchronization instructions. > Page faults are going to have tremendously different performance > characteristics on PPC too because it's a software managed TLB. There's > no page table lookup like there is on x86. To clarify, software-managed TLBs are only found in embedded PowerPC. Server and classic PowerPC use hash tables, which are a third MMU type. -- Hollis Blanchard IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: Yes, I misunderstood that they actually emulated it like that. However, ia64 has no paravirtualization support today so surely, we aren't going to be justifying this via ia64, right? Someone is actively putting a pvops infrastructure into the ia64 port, along with a Xen port. I think pieces of it got merged this last window. J -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Mon, 2009-05-11 at 09:14 -0400, Gregory Haskins wrote: > > >> for request-response, this is generally for *every* packet since > you > >> cannot exploit buffering/deferring. > >> > >> Can you back up your claim that PPC has no difference in > performance > >> with an MMIO exit and a "hypercall" (yes, I understand PPC has no > "VT" > >> like instructions, but clearly there are ways to cause a trap, so > >> presumably we can measure the difference between a PF exit and > something > >> more explicit). > >> > > > > First, the PPC that KVM supports performs very poorly relatively > > speaking because it receives no hardware assistance > > So wouldn't that be making the case that it could use as much help as > possible? I think he's referencing Ahmdal's Law here. While I'd agree, this is relevant only for the current KVM PowerPC implementations. I think it would be short-sighted to design an IO architecture around that. > > this is not the right place to focus wrt optimizations. > > Odd choice of words. I am advocating the opposite (broad solution to > many arches and many platforms (i.e. hypervisors)) and therefore I am > not "focused" on it (or really any one arch) at all per se. I am > _worried_ however, that we could be overlooking PPC (as an example) if > we ignore the disparity between MMIO and HC since other higher > performance options are not available like PIO. The goal on this > particular thread is to come up with an IO interface that works > reasonably well across as many hypervisors as possible. MMIO/PIO do > not appear to fit that bill (at least not without tunneling them over > HCs) I haven't been following this conversation at all. With that in mind... AFAICS, a hypercall is clearly the higher-performing option, since you don't need the additional memory load (which could even cause a page fault in some circumstances) and instruction decode. That said, I'm willing to agree that this overhead is probably negligible compared to the IOp itself... Ahmdal's Law again. -- Hollis Blanchard IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: > Gregory Haskins wrote: >> Anthony Liguori wrote: >> >>> I'm surprised so much effort is going into this, is there any >>> indication that this is even close to a bottleneck in any circumstance? >>> >> >> Yes. Each 1us of overhead is a 4% regression in something as trivial as >> a 25us UDP/ICMP rtt "ping".m > > It wasn't 1us, it was 350ns or something around there (i.e ~1%). I wasn't referring to "it". I chose my words carefully. Let me rephrase for your clarity: *each* 1us of overhead introduced into the signaling path is a ~4% latency regression for a round trip on a high speed network (note that this can also affect throughput at some level, too). I believe this point has been lost on you from the very beginning of the vbus discussions. I specifically generalized my statement above because #1 I assume everyone here is smart enough to convert that nice round unit into the relevant figure. And #2, there are multiple potential latency sources at play which we need to factor in when looking at the big picture. For instance, the difference between PF exit, and an IO exit (2.58us on x86, to be precise). Or whether you need to take a heavy-weight exit. Or a context switch to qemu, the the kernel, back to qemu, and back to the vcpu). Or acquire a mutex. Or get head-of-lined on the VGA models IO. I know you wish that this whole discussion would just go away, but these little "300ns here, 1600ns there" really add up in aggregate despite your dismissive attitude towards them. And it doesn't take much to affect the results in a measurable way. As stated, each 1us costs ~4%. My motivation is to reduce as many of these sources as possible. So, yes, the delta from PIO to HC is 350ns. Yes, this is a ~1.4% improvement. So what? Its still an improvement. If that improvement were for free, would you object? And we all know that this change isn't "free" because we have to change some code (+128/-0, to be exact). But what is it specifically you are objecting to in the first place? Adding hypercall support as an pv_ops primitive isn't exactly hard or complex, or even very much code. Besides, I've already clearly stated multiple times (including in this very thread) that I agree that I am not yet sure if the 350ns/1.4% improvement alone is enough to justify a change. So if you are somehow trying to make me feel silly by pointing out the "~1%" above, you are being ridiculous. Rather, I was simply answering your question as to whether these latency sources are a real issue. The answer is "yes" (assuming you care about latency) and I gave you a specific example and a method to quantify the impact. It is duly noted that you do not care about this type of performance, but you also need to realize that your "blessing" or acknowledgment/denial of the problem domain has _zero_ bearing on whether the domain exists, or if there are others out there that do care about it. Sorry. > >> for request-response, this is generally for *every* packet since you >> cannot exploit buffering/deferring. >> >> Can you back up your claim that PPC has no difference in performance >> with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" >> like instructions, but clearly there are ways to cause a trap, so >> presumably we can measure the difference between a PF exit and something >> more explicit). >> > > First, the PPC that KVM supports performs very poorly relatively > speaking because it receives no hardware assistance So wouldn't that be making the case that it could use as much help as possible? > this is not the right place to focus wrt optimizations. Odd choice of words. I am advocating the opposite (broad solution to many arches and many platforms (i.e. hypervisors)) and therefore I am not "focused" on it (or really any one arch) at all per se. I am _worried_ however, that we could be overlooking PPC (as an example) if we ignore the disparity between MMIO and HC since other higher performance options are not available like PIO. The goal on this particular thread is to come up with an IO interface that works reasonably well across as many hypervisors as possible. MMIO/PIO do not appear to fit that bill (at least not without tunneling them over HCs) If I am guilty of focusing anywhere too much it would be x86 since that is the only development platform I have readily available. > > > And because there's no hardware assistance, there simply isn't a > hypercall instruction. Are PFs the fastest type of exits? Probably > not but I honestly have no idea. I'm sure Hollis does though. > > Page faults are going to have tremendously different performance > characteristics on PPC too because it's a software managed TLB. > There's no page table lookup like there is on x86. The difference between MMIO and "HC", and whether it is cause for concern will continue to be pure speculation until we can find someone with a PPC box willing to run some numbers. I will
Re: [RFC PATCH 0/3] generic hypercall support
Arnd Bergmann wrote: > On Saturday 09 May 2009, Benjamin Herrenschmidt wrote: > >> This was shot down by a vast majority of people, with the outcome being >> an agreement that for IORESOURCE_MEM, pci_iomap and friends must return >> something that is strictly interchangeable with what ioremap would have >> returned. >> >> That means that readl and writel must work on the output of pci_iomap() >> and similar, but I don't see why __raw_writel would be excluded there, I >> think it's in there too. >> > > One of the ideas was to change pci_iomap to return a special token > in case of virtual devices that causes iowrite32() to do an hcall, > and to just define writel() to do iowrite32(). > > Unfortunately, there is no __raw_iowrite32(), although I guess we > could add this generically if necessary. > > >> Direct dereference is illegal in all cases though. >> > > right. > > >> The token returned by pci_iomap for other type of resources (IO for >> example) is also only supported for use by iomap access functions >> (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those >> neither. >> > > That still leaves the option to let drivers pass the IORESOURCE_PVIO > for its own resources under some conditions, meaning that we will > only use hcalls for I/O on these drivers but not on others, as Chris > explained earlier. > Between this, and Avi's "nesting" point, this is the direction I am leaning in right now. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
On Saturday 09 May 2009, Benjamin Herrenschmidt wrote: > This was shot down by a vast majority of people, with the outcome being > an agreement that for IORESOURCE_MEM, pci_iomap and friends must return > something that is strictly interchangeable with what ioremap would have > returned. > > That means that readl and writel must work on the output of pci_iomap() > and similar, but I don't see why __raw_writel would be excluded there, I > think it's in there too. One of the ideas was to change pci_iomap to return a special token in case of virtual devices that causes iowrite32() to do an hcall, and to just define writel() to do iowrite32(). Unfortunately, there is no __raw_iowrite32(), although I guess we could add this generically if necessary. > Direct dereference is illegal in all cases though. right. > The token returned by pci_iomap for other type of resources (IO for > example) is also only supported for use by iomap access functions > (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those > neither. That still leaves the option to let drivers pass the IORESOURCE_PVIO for its own resources under some conditions, meaning that we will only use hcalls for I/O on these drivers but not on others, as Chris explained earlier. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Anthony Liguori wrote: I'm surprised so much effort is going into this, is there any indication that this is even close to a bottleneck in any circumstance? Yes. Each 1us of overhead is a 4% regression in something as trivial as a 25us UDP/ICMP rtt "ping".m It wasn't 1us, it was 350ns or something around there (i.e ~1%). for request-response, this is generally for *every* packet since you cannot exploit buffering/deferring. Can you back up your claim that PPC has no difference in performance with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" like instructions, but clearly there are ways to cause a trap, so presumably we can measure the difference between a PF exit and something more explicit). First, the PPC that KVM supports performs very poorly relatively speaking because it receives no hardware assistance this is not the right place to focus wrt optimizations. And because there's no hardware assistance, there simply isn't a hypercall instruction. Are PFs the fastest type of exits? Probably not but I honestly have no idea. I'm sure Hollis does though. Page faults are going to have tremendously different performance characteristics on PPC too because it's a software managed TLB. There's no page table lookup like there is on x86. As a more general observation, we need numbers to justify an optimization, not to justify not including an optimization. In other words, the burden is on you to present a scenario where this optimization would result in a measurable improvement in a real world work load. Regards, Anthony Liguori We need numbers before we can really decide to abandon this optimization. If PPC mmio has no penalty over hypercall, I am not sure the 350ns on x86 is worth this effort (especially if I can shrink this with some RCU fixes). Otherwise, the margin is quite a bit larger. -Greg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: That only works if the device exposes a pio port, and the hypervisor exposes HC_PIO. If the device exposes the hypercall, things break once you assign it. Well, true. But normally I would think you would resurface the device from G1 to G2 anyway, so any relevant transform would also be reflected in the resurfaced device config. We do, but the G1 hypervisor cannot be expected to understand the config option that exposes the hypercall. I suppose if you had a hard requirement that, say, even the pci-config space was pass-through, this would be a problem. I am not sure if that is a realistic environment, though. You must pass through the config space, as some of it is device specific. The hypervisor will trap config space accesses, but unless it understands them, it cannot modify them. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
David S. Ahern wrote: kvm_stat shows same approximate numbers as with the TSC-->ops/sec conversions. Interestingly, MMIO writes are not showing up as mmio_exits in kvm_stat; they are showing up as insn_emulation. That's a bug, mmio_exits ignores mmios that are handled in the kernel. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: > Avi Kivity wrote: >> David S. Ahern wrote: >>> I ran another test case with SMT disabled, and while I was at it >>> converted TSC delta to operations/sec. The results without SMT are >>> confusing -- to me anyways. I'm hoping someone can explain it. >>> Basically, using a count of 10,000,000 (per your web page) with SMT >>> disabled the guest detected a soft lockup on the CPU. So, I dropped the >>> count down to 1,000,000. So, for 1e6 iterations: >>> >>> without SMT, with EPT: >>> HC: 259,455 ops/sec >>> PIO: 226,937 ops/sec >>> MMIO: 113,180 ops/sec >>> >>> without SMT, without EPT: >>> HC: 274,825 ops/sec >>> PIO: 247,910 ops/sec >>> MMIO: 111,535 ops/sec >>> >>> Converting the prior TSC deltas: >>> >>> with SMT, with EPT: >>> HC:994,655 ops/sec >>> PIO: 875,116 ops/sec >>> MMIO: 439,738 ops/sec >>> >>> with SMT, without EPT: >>> HC:994,304 ops/sec >>> PIO: 903,057 ops/sec >>> MMIO: 423,244 ops/sec >>> >>> Running the tests repeatedly I did notice a fair variability (as much as >>> -10% down from these numbers). >>> >>> Also, just to make sure I converted the delta to ops/sec, the formula I >>> used was cpu_freq / dTSC * count = operations/sec >>> >>> >> The only think I can think of is cpu frequency scaling lying about the >> cpu frequency. Really the test needs to use time and not the time >> stamp counter. >> >> Are the results expressed in cycles/op more reasonable? > > FWIW: I always used kvm_stat instead of my tsc printk > kvm_stat shows same approximate numbers as with the TSC-->ops/sec conversions. Interestingly, MMIO writes are not showing up as mmio_exits in kvm_stat; they are showing up as insn_emulation. david > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > David S. Ahern wrote: >> I ran another test case with SMT disabled, and while I was at it >> converted TSC delta to operations/sec. The results without SMT are >> confusing -- to me anyways. I'm hoping someone can explain it. >> Basically, using a count of 10,000,000 (per your web page) with SMT >> disabled the guest detected a soft lockup on the CPU. So, I dropped the >> count down to 1,000,000. So, for 1e6 iterations: >> >> without SMT, with EPT: >> HC: 259,455 ops/sec >> PIO: 226,937 ops/sec >> MMIO: 113,180 ops/sec >> >> without SMT, without EPT: >> HC: 274,825 ops/sec >> PIO: 247,910 ops/sec >> MMIO: 111,535 ops/sec >> >> Converting the prior TSC deltas: >> >> with SMT, with EPT: >> HC:994,655 ops/sec >> PIO: 875,116 ops/sec >> MMIO: 439,738 ops/sec >> >> with SMT, without EPT: >> HC:994,304 ops/sec >> PIO: 903,057 ops/sec >> MMIO: 423,244 ops/sec >> >> Running the tests repeatedly I did notice a fair variability (as much as >> -10% down from these numbers). >> >> Also, just to make sure I converted the delta to ops/sec, the formula I >> used was cpu_freq / dTSC * count = operations/sec >> >> > > The only think I can think of is cpu frequency scaling lying about the > cpu frequency. Really the test needs to use time and not the time stamp > counter. > > Are the results expressed in cycles/op more reasonable? > Power settings seem to be the root cause. With this HP server the SMT mode must be disabling or overriding a power setting that is enabled in the bios. I found one power-based knob that gets non-SMT performance close to SMT numbers. Not very intuitive that SMT/non-SMT can differ so dramatically. david -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: > Avi Kivity wrote: >> >> Hmm, reminds me of something I thought of a while back. >> >> We could implement an 'mmio hypercall' that does mmio reads/writes >> via a hypercall instead of an mmio operation. That will speed up >> mmio for emulated devices (say, e1000). It's easy to hook into Linux >> (readl/writel), is pci-friendly, non-x86 friendly, etc. > > By the time you get down to userspace for an emulated device, that 2us > difference between mmio and hypercalls is simply not going to make a > difference. I don't care about this path for emulated devices. I am interested in in-kernel vbus devices. > I'm surprised so much effort is going into this, is there any > indication that this is even close to a bottleneck in any circumstance? Yes. Each 1us of overhead is a 4% regression in something as trivial as a 25us UDP/ICMP rtt "ping". > > > We have much, much lower hanging fruit to attack. The basic fact that > we still copy data multiple times in the networking drivers is clearly > more significant than a few hundred nanoseconds that should occur less > than once per packet. for request-response, this is generally for *every* packet since you cannot exploit buffering/deferring. Can you back up your claim that PPC has no difference in performance with an MMIO exit and a "hypercall" (yes, I understand PPC has no "VT" like instructions, but clearly there are ways to cause a trap, so presumably we can measure the difference between a PF exit and something more explicit). We need numbers before we can really decide to abandon this optimization. If PPC mmio has no penalty over hypercall, I am not sure the 350ns on x86 is worth this effort (especially if I can shrink this with some RCU fixes). Otherwise, the margin is quite a bit larger. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > David S. Ahern wrote: >> I ran another test case with SMT disabled, and while I was at it >> converted TSC delta to operations/sec. The results without SMT are >> confusing -- to me anyways. I'm hoping someone can explain it. >> Basically, using a count of 10,000,000 (per your web page) with SMT >> disabled the guest detected a soft lockup on the CPU. So, I dropped the >> count down to 1,000,000. So, for 1e6 iterations: >> >> without SMT, with EPT: >> HC: 259,455 ops/sec >> PIO: 226,937 ops/sec >> MMIO: 113,180 ops/sec >> >> without SMT, without EPT: >> HC: 274,825 ops/sec >> PIO: 247,910 ops/sec >> MMIO: 111,535 ops/sec >> >> Converting the prior TSC deltas: >> >> with SMT, with EPT: >> HC:994,655 ops/sec >> PIO: 875,116 ops/sec >> MMIO: 439,738 ops/sec >> >> with SMT, without EPT: >> HC:994,304 ops/sec >> PIO: 903,057 ops/sec >> MMIO: 423,244 ops/sec >> >> Running the tests repeatedly I did notice a fair variability (as much as >> -10% down from these numbers). >> >> Also, just to make sure I converted the delta to ops/sec, the formula I >> used was cpu_freq / dTSC * count = operations/sec >> >> > > The only think I can think of is cpu frequency scaling lying about the > cpu frequency. Really the test needs to use time and not the time > stamp counter. > > Are the results expressed in cycles/op more reasonable? FWIW: I always used kvm_stat instead of my tsc printk signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
David S. Ahern wrote: I ran another test case with SMT disabled, and while I was at it converted TSC delta to operations/sec. The results without SMT are confusing -- to me anyways. I'm hoping someone can explain it. Basically, using a count of 10,000,000 (per your web page) with SMT disabled the guest detected a soft lockup on the CPU. So, I dropped the count down to 1,000,000. So, for 1e6 iterations: without SMT, with EPT: HC: 259,455 ops/sec PIO: 226,937 ops/sec MMIO: 113,180 ops/sec without SMT, without EPT: HC: 274,825 ops/sec PIO: 247,910 ops/sec MMIO: 111,535 ops/sec Converting the prior TSC deltas: with SMT, with EPT: HC:994,655 ops/sec PIO: 875,116 ops/sec MMIO: 439,738 ops/sec with SMT, without EPT: HC:994,304 ops/sec PIO: 903,057 ops/sec MMIO: 423,244 ops/sec Running the tests repeatedly I did notice a fair variability (as much as -10% down from these numbers). Also, just to make sure I converted the delta to ops/sec, the formula I used was cpu_freq / dTSC * count = operations/sec The only think I can think of is cpu frequency scaling lying about the cpu frequency. Really the test needs to use time and not the time stamp counter. Are the results expressed in cycles/op more reasonable? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: > David S. Ahern wrote: >> Marcelo Tosatti wrote: >> >>> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote: >>> Marcelo Tosatti wrote: > On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: > > >> Marcelo Tosatti wrote: >> >> >>> Also it would be interesting to see the MMIO comparison with EPT/NPT, >>> it probably sucks much less than what you're seeing. >>> >>> >>> >> Why would NPT improve mmio? If anything, it would be worse, since the >> processor has to do the nested walk. >> >> Of course, these are newer machines, so the absolute results as well as >> the difference will be smaller. >> >> > Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: > > NPT enabled: > test 0: 3088633284634 - 3059375712321 = 29257572313 > test 1: 3121754636397 - 3088633419760 = 33121216637 > test 2: 3204666462763 - 3121754668573 = 82911794190 > > NPT disabled: > test 0: 3638061646250 - 3609416811687 = 28644834563 > test 1: 3669413430258 - 3638061771291 = 31351658967 > test 2: 3736287253287 - 3669413463506 = 66873789781 > > > Thanks for running that. Its interesting to see that NPT was in fact worse as Avi predicted. Would you mind if I graphed the result and added this data to my wiki? If so, could you adjust the tsc result into IOPs using the proper time-base and the test_count you ran with? I can show a graph with the data as is and the relative differences will properly surface..but it would be nice to have apples to apples in terms of IOPS units with my other run. -Greg >>> Please, that'll be nice. >>> >>> Quad-Core AMD Opteron(tm) Processor 2358 SE >>> >>> host: 2.6.30-rc2 >>> guest: 2.6.29.1-102.fc11.x86_64 >>> >>> test_count=100, tsc freq=2402882804 Hz >>> >>> NPT disabled: >>> >>> test 0 = 2771200766 >>> test 1 = 3018726738 >>> test 2 = 6414705418 >>> test 3 = 2890332864 >>> >>> NPT enabled: >>> >>> test 0 = 2908604045 >>> test 1 = 3174687394 >>> test 2 = 7912464804 >>> test 3 = 3046085805 >>> >>> >> DL380 G6, 1-E5540, 6 GB RAM, SMT enabled: >> host: 2.6.30-rc3 >> guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64 >> >> with EPT >> test 0: 543617607291 - 518146439877 = 25471167414 >> test 1: 572568176856 - 543617703004 = 28950473852 >> test 2: 630182158139 - 572568269792 = 57613888347 >> >> >> without EPT >> test 0: 1383532195307 - 1358052032086 = 25480163221 >> test 1: 1411587055210 - 1383532318617 = 28054736593 >> test 2: 1471446356172 - 1411587194600 = 59859161572 >> >> >> > > Thank you kindly, David. > > -Greg I ran another test case with SMT disabled, and while I was at it converted TSC delta to operations/sec. The results without SMT are confusing -- to me anyways. I'm hoping someone can explain it. Basically, using a count of 10,000,000 (per your web page) with SMT disabled the guest detected a soft lockup on the CPU. So, I dropped the count down to 1,000,000. So, for 1e6 iterations: without SMT, with EPT: HC: 259,455 ops/sec PIO: 226,937 ops/sec MMIO: 113,180 ops/sec without SMT, without EPT: HC: 274,825 ops/sec PIO: 247,910 ops/sec MMIO: 111,535 ops/sec Converting the prior TSC deltas: with SMT, with EPT: HC:994,655 ops/sec PIO: 875,116 ops/sec MMIO: 439,738 ops/sec with SMT, without EPT: HC:994,304 ops/sec PIO: 903,057 ops/sec MMIO: 423,244 ops/sec Running the tests repeatedly I did notice a fair variability (as much as -10% down from these numbers). Also, just to make sure I converted the delta to ops/sec, the formula I used was cpu_freq / dTSC * count = operations/sec david -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Fri, 2009-05-08 at 00:11 +0200, Arnd Bergmann wrote: > On Thursday 07 May 2009, Chris Wright wrote: > > > > > Chris, is that issue with the non ioread/iowrite access of a mangled > > > pointer still an issue here? I would think so, but I am a bit fuzzy on > > > whether there is still an issue of non-wrapped MMIO ever occuring. > > > > Arnd was saying it's a bug for other reasons, so perhaps it would work > > out fine. > > Well, maybe. I only said that __raw_writel and pointer dereference is > bad, but not writel. I have only vague recollection of that stuff, but basically, it boiled down to me attempting to use annotations to differenciate old style "ioremap" vs. new style iomap and effectively forbid mixing ioremap with iomap in either direction. This was shot down by a vast majority of people, with the outcome being an agreement that for IORESOURCE_MEM, pci_iomap and friends must return something that is strictly interchangeable with what ioremap would have returned. That means that readl and writel must work on the output of pci_iomap() and similar, but I don't see why __raw_writel would be excluded there, I think it's in there too. Direct dereference is illegal in all cases though. The token returned by pci_iomap for other type of resources (IO for example) is also only supported for use by iomap access functions (ioreadXX/iowriteXX) , and IO ports cannot be passed directly to those neither. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
David S. Ahern wrote: > Marcelo Tosatti wrote: > >> On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote: >> >>> Marcelo Tosatti wrote: >>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: > Marcelo Tosatti wrote: > > >> Also it would be interesting to see the MMIO comparison with EPT/NPT, >> it probably sucks much less than what you're seeing. >> >> >> > Why would NPT improve mmio? If anything, it would be worse, since the > processor has to do the nested walk. > > Of course, these are newer machines, so the absolute results as well as > the difference will be smaller. > > Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: NPT enabled: test 0: 3088633284634 - 3059375712321 = 29257572313 test 1: 3121754636397 - 3088633419760 = 33121216637 test 2: 3204666462763 - 3121754668573 = 82911794190 NPT disabled: test 0: 3638061646250 - 3609416811687 = 28644834563 test 1: 3669413430258 - 3638061771291 = 31351658967 test 2: 3736287253287 - 3669413463506 = 66873789781 >>> Thanks for running that. Its interesting to see that NPT was in fact >>> worse as Avi predicted. >>> >>> Would you mind if I graphed the result and added this data to my wiki? >>> If so, could you adjust the tsc result into IOPs using the proper >>> time-base and the test_count you ran with? I can show a graph with the >>> data as is and the relative differences will properly surface..but it >>> would be nice to have apples to apples in terms of IOPS units with my >>> other run. >>> >>> -Greg >>> >> Please, that'll be nice. >> >> Quad-Core AMD Opteron(tm) Processor 2358 SE >> >> host: 2.6.30-rc2 >> guest: 2.6.29.1-102.fc11.x86_64 >> >> test_count=100, tsc freq=2402882804 Hz >> >> NPT disabled: >> >> test 0 = 2771200766 >> test 1 = 3018726738 >> test 2 = 6414705418 >> test 3 = 2890332864 >> >> NPT enabled: >> >> test 0 = 2908604045 >> test 1 = 3174687394 >> test 2 = 7912464804 >> test 3 = 3046085805 >> >> > > DL380 G6, 1-E5540, 6 GB RAM, SMT enabled: > host: 2.6.30-rc3 > guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64 > > with EPT > test 0: 543617607291 - 518146439877 = 25471167414 > test 1: 572568176856 - 543617703004 = 28950473852 > test 2: 630182158139 - 572568269792 = 57613888347 > > > without EPT > test 0: 1383532195307 - 1358052032086 = 25480163221 > test 1: 1411587055210 - 1383532318617 = 28054736593 > test 2: 1471446356172 - 1411587194600 = 59859161572 > > > Thank you kindly, David. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: And likewise, in both cases, G1 would (should?) know what to do with that "address" as it relates to G2, just as it would need to know what the PIO address is for. Typically this would result in some kind of translation of that "address", but I suppose even this is completely arbitrary and only G1 knows for sure. E.g. it might translate from hypercall vector X to Y similar to your PIO example, it might completely change transports, or it might terminate locally (e.g. emulated device in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might be using MMIO to talk to H. I don't think it matters from a topology perspective (though it might from a performance perspective). >>> How can you translate a hypercall? G1's and H's hypercall mechanisms >>> can be completely different. >>> >> >> Well, what I mean is that the hypercall ABI is specific to G2->G1, but >> the path really looks like G2->(H)->G1 transparently since H gets all >> the initial exits coming from G2. But all H has to do is blindly >> reinject the exit with all the same parameters (e.g. registers, >> primarily) to the G1-root context. >> >> So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT, >> and does its thing according to the ABI. Perhaps the ABI for that >> particular HC-id is a PIOoHC, so it turns around and does a >> ioread/iowrite PIO, trapping us back to H. >> >> So this transform of the HC-id "X" to PIO("Y") is the translation I was >> referring to. It could really be anything, though (e.g. HC "X" to HC >> "Z", if thats what G1s handler for X told it to do) >> > > That only works if the device exposes a pio port, and the hypervisor > exposes HC_PIO. If the device exposes the hypercall, things break > once you assign it. Well, true. But normally I would think you would resurface the device from G1 to G2 anyway, so any relevant transform would also be reflected in the resurfaced device config. I suppose if you had a hard requirement that, say, even the pci-config space was pass-through, this would be a problem. I am not sure if that is a realistic environment, though. > >>> Of course mmio is faster in this case since it traps directly. >>> >>> btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a >>> 0.4us difference will buy us 0.4% reduction in cpu load, so let's see >>> what's the potential gain here. >>> >> >> Its more of an issue of execution latency (which translates to IO >> latency, since "execution" is usually for the specific goal of doing >> some IO). In fact, per my own design claims, I try to avoid exits like >> the plague and generally succeed at making very few of them. ;) >> >> So its not really the .4% reduction of cpu use that allures me. Its the >> 16% reduction in latency. Time/discussion will tell if its worth the >> trouble to use HC or just try to shave more off of PIO. If we went that >> route, I am concerned about falling back to MMIO, but Anthony seems to >> think this is not a real issue. >> > > You need to use absolute numbers, not percentages off the smallest > component. If you want to reduce latency, keep things on the same > core (IPIs, cache bounces are more expensive than the 200ns we're > seeing here). > > Ok, so there are no shortages of IO cards that can perform operations in the order of 10us-15us. Therefore a 350ns latency (the delta between PIO and HC) turns into a 2%-3.5% overhead when compared to bare-metal. I am not really at liberty to talk about most of the kinds of applications that might care. A trivial example might be PTPd clock distribution. But this is going nowhere. Based on Anthony's asserting that the MMIO fallback worry is unfounded, and all the controversy this is causing, perhaps we should just move on and just forget the whole thing. If I have to I will patch the HC code in my own tree. For now, I will submit a few patches to clean up the locking on the io_bus. That may help narrow the gap without all this stuff, anyway. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: > On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote: >> Marcelo Tosatti wrote: >>> On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: >>> Marcelo Tosatti wrote: > Also it would be interesting to see the MMIO comparison with EPT/NPT, > it probably sucks much less than what you're seeing. > > Why would NPT improve mmio? If anything, it would be worse, since the processor has to do the nested walk. Of course, these are newer machines, so the absolute results as well as the difference will be smaller. >>> Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: >>> >>> NPT enabled: >>> test 0: 3088633284634 - 3059375712321 = 29257572313 >>> test 1: 3121754636397 - 3088633419760 = 33121216637 >>> test 2: 3204666462763 - 3121754668573 = 82911794190 >>> >>> NPT disabled: >>> test 0: 3638061646250 - 3609416811687 = 28644834563 >>> test 1: 3669413430258 - 3638061771291 = 31351658967 >>> test 2: 3736287253287 - 3669413463506 = 66873789781 >>> >>> >> Thanks for running that. Its interesting to see that NPT was in fact >> worse as Avi predicted. >> >> Would you mind if I graphed the result and added this data to my wiki? >> If so, could you adjust the tsc result into IOPs using the proper >> time-base and the test_count you ran with? I can show a graph with the >> data as is and the relative differences will properly surface..but it >> would be nice to have apples to apples in terms of IOPS units with my >> other run. >> >> -Greg > > Please, that'll be nice. > > Quad-Core AMD Opteron(tm) Processor 2358 SE > > host: 2.6.30-rc2 > guest: 2.6.29.1-102.fc11.x86_64 > > test_count=100, tsc freq=2402882804 Hz > > NPT disabled: > > test 0 = 2771200766 > test 1 = 3018726738 > test 2 = 6414705418 > test 3 = 2890332864 > > NPT enabled: > > test 0 = 2908604045 > test 1 = 3174687394 > test 2 = 7912464804 > test 3 = 3046085805 > DL380 G6, 1-E5540, 6 GB RAM, SMT enabled: host: 2.6.30-rc3 guest: fedora 9, 2.6.27.21-78.2.41.fc9.x86_64 with EPT test 0: 543617607291 - 518146439877 = 25471167414 test 1: 572568176856 - 543617703004 = 28950473852 test 2: 630182158139 - 572568269792 = 57613888347 without EPT test 0: 1383532195307 - 1358052032086 = 25480163221 test 1: 1411587055210 - 1383532318617 = 28054736593 test 2: 1471446356172 - 1411587194600 = 59859161572 david > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Paul E. McKenney wrote: > On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote: > >> Marcelo Tosatti wrote: >> >>> On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote: >>> >>> Marcelo Tosatti wrote: > I think comparison is not entirely fair. You're using > KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that > (on Intel) to only one register read: > > nr = kvm_register_read(vcpu, VCPU_REGS_RAX); > > Whereas in a real hypercall for (say) PIO you would need the address, > size, direction and data. > > > Well, that's probably one of the reasons pio is slower, as the cpu has to set these up, and the kernel has to read them. > Also for PIO/MMIO you're adding this unoptimized lookup to the > measurement: > > pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > if (pio_dev) { > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > complete_pio(vcpu); return 1; > } > > > Since there are only one or two elements in the list, I don't see how it could be optimized. >>> speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev >>> is probably the last in the io_bus list. >>> >>> Not sure if this one matters very much. Point is you should measure the >>> exit time only, not the pio path vs hypercall path in kvm. >>> >>> >> The problem is the exit time in of itself isnt all that interesting to >> me. What I am interested in measuring is how long it takes KVM to >> process the request and realize that I want to execute function "X". >> Ultimately that is what matters in terms of execution latency and is >> thus the more interesting data. I think the exit time is possibly an >> interesting 5th data point, but its more of a side-bar IMO. In any >> case, I suspect that both exits will be approximately the same at the >> VT/SVM level. >> >> OTOH: If there is a patch out there to improve KVMs code (say >> specifically the PIO handling logic), that is fair-game here and we >> should benchmark it. For instance, if you have ideas on ways to improve >> the find_pio_dev performance, etc One item may be to replace the >> kvm->lock on the bus scan with an RCU or something (though PIOs are >> very frequent and the constant re-entry to an an RCU read-side CS may >> effectively cause a perpetual grace-period and may be too prohibitive). >> CC'ing pmck. >> > > Hello, Greg! > > Not a problem. ;-) > > A grace period only needs to wait on RCU read-side critical sections that > started before the grace period started. As soon as those pre-existing > RCU read-side critical get done, the grace period can end, regardless > of how many RCU read-side critical sections might have started after > the grace period started. > > If you find a situation where huge numbers of RCU read-side critical > sections do indefinitely delay a grace period, then that is a bug in > RCU that I need to fix. > > Of course, if you have a single RCU read-side critical section that > runs for a very long time, that -will- delay a grace period. As long > as you don't do it too often, this is not a problem, though if running > a single RCU read-side critical section for more than a few milliseconds > is probably not a good thing. Not as bad as holding a heavily contended > spinlock for a few milliseconds, but still not a good thing. > Hey Paul, This makes sense, and it clears up a misconception I had about RCU. So thanks for that. Based on what Paul said, I think we can get some amount of gains in the PIO and PIOoHC stats from converting to RCU. I will do this next. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: And likewise, in both cases, G1 would (should?) know what to do with that "address" as it relates to G2, just as it would need to know what the PIO address is for. Typically this would result in some kind of translation of that "address", but I suppose even this is completely arbitrary and only G1 knows for sure. E.g. it might translate from hypercall vector X to Y similar to your PIO example, it might completely change transports, or it might terminate locally (e.g. emulated device in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might be using MMIO to talk to H. I don't think it matters from a topology perspective (though it might from a performance perspective). How can you translate a hypercall? G1's and H's hypercall mechanisms can be completely different. Well, what I mean is that the hypercall ABI is specific to G2->G1, but the path really looks like G2->(H)->G1 transparently since H gets all the initial exits coming from G2. But all H has to do is blindly reinject the exit with all the same parameters (e.g. registers, primarily) to the G1-root context. So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT, and does its thing according to the ABI. Perhaps the ABI for that particular HC-id is a PIOoHC, so it turns around and does a ioread/iowrite PIO, trapping us back to H. So this transform of the HC-id "X" to PIO("Y") is the translation I was referring to. It could really be anything, though (e.g. HC "X" to HC "Z", if thats what G1s handler for X told it to do) That only works if the device exposes a pio port, and the hypervisor exposes HC_PIO. If the device exposes the hypercall, things break once you assign it. Of course mmio is faster in this case since it traps directly. btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a 0.4us difference will buy us 0.4% reduction in cpu load, so let's see what's the potential gain here. Its more of an issue of execution latency (which translates to IO latency, since "execution" is usually for the specific goal of doing some IO). In fact, per my own design claims, I try to avoid exits like the plague and generally succeed at making very few of them. ;) So its not really the .4% reduction of cpu use that allures me. Its the 16% reduction in latency. Time/discussion will tell if its worth the trouble to use HC or just try to shave more off of PIO. If we went that route, I am concerned about falling back to MMIO, but Anthony seems to think this is not a real issue. You need to use absolute numbers, not percentages off the smallest component. If you want to reduce latency, keep things on the same core (IPIs, cache bounces are more expensive than the 200ns we're seeing here). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: ia64 uses mmio to emulate pio, so the cost may be different. I agree on x86 it's almost negligible. Yes, I misunderstood that they actually emulated it like that. However, ia64 has no paravirtualization support today so surely, we aren't going to be justifying this via ia64, right? Right. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Its more of an issue of execution latency (which translates to IO latency, since "execution" is usually for the specific goal of doing some IO). In fact, per my own design claims, I try to avoid exits like the plague and generally succeed at making very few of them. ;) So its not really the .4% reduction of cpu use that allures me. Its the 16% reduction in latency. Time/discussion will tell if its worth the trouble to use HC or just try to shave more off of PIO. If we went that route, I am concerned about falling back to MMIO, but Anthony seems to think this is not a real issue. It's only a 16% reduction in latency if your workload is entirely dependent on the latency of a hypercall. What is that workload? I don't think it exists. For a network driver, I have a hard time believing that anyone cares that much about 210ns of latency. We're getting close to the cost of a few dozen instructions here. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote: The problem is the exit time in of itself isnt all that interesting to me. What I am interested in measuring is how long it takes KVM to process the request and realize that I want to execute function "X". Ultimately that is what matters in terms of execution latency and is thus the more interesting data. I think the exit time is possibly an interesting 5th data point, but its more of a side-bar IMO. In any case, I suspect that both exits will be approximately the same at the VT/SVM level. OTOH: If there is a patch out there to improve KVMs code (say specifically the PIO handling logic), that is fair-game here and we should benchmark it. For instance, if you have ideas on ways to improve the find_pio_dev performance, etc One easy thing to try is to cache the last successful lookup on a pointer, to improve patterns where there's "device locality" (like nullio test). We should do that everywhere, memory slots, pio slots, etc. Or even keep statistics on accesses and sort by that. I'd leave it on if I were you. One item may be to replace the kvm->lock on the bus scan with an RCU or something (though PIOs are very frequent and the constant re-entry to an an RCU read-side CS may effectively cause a perpetual grace-period and may be too prohibitive). CC'ing pmck. Yes, locking improvements are needed there badly (think for eg the cache bouncing of kvm->lock _and_ bouncing of kvm->slots_lock on 4-way SMP guests). There's no reason for kvm->lock on pio. We should push the locking to devices. I'm going to rename slots_lock as slots_lock_please_reimplement_me_using_rcu, this keeps coming up. FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that 140 can possibly be recouped. I currently suspect the lock acquisition in the iobus-scan is the bulk of that time, but that is admittedly a guess. The remaining 200-250ns is elsewhere in the PIO decode. vmcs_read is significantly expensive (http://www.mail-archive.com/kvm@vger.kernel.org/msg00840.html, likely that my measurements were foobar, Avi mentioned 50 cycles for vmcs_write). IIRC vmcs reads are pretty fast, and are being improved. See for eg how vmx.c reads VM_EXIT_INTR_INFO twice on every exit. Ugh. Also this one looks pretty bad for a 32-bit PAE guest (and you can get away with the unconditional GUEST_CR3 read too). /* Access CR3 don't cause VMExit in paging mode, so we need * to sync with guest real CR3. */ if (enable_ept && is_paging(vcpu)) { vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); ept_load_pdptrs(vcpu); } We should use an accessor here just like with registers and segment registers. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: Anthony Liguori wrote: And we're now getting close to the point where the difference is virtually meaningless. At .14us, in order to see 1% CPU overhead added from PIO vs HC, you need 71429 exits. If I read things correctly, you want the difference between PIO and PIOoHC, which is 210ns. But your point stands, 50,000 exits/sec will add 1% cpu overhead. Right, the basic math still stands. The non-x86 architecture argument isn't valid because other architectures either 1) don't use PCI at all (s390) and are already using hypercalls 2) use PCI, but do not have a dedicated hypercall instruction (PPC emb) or 3) have PIO (ia64). ia64 uses mmio to emulate pio, so the cost may be different. I agree on x86 it's almost negligible. Yes, I misunderstood that they actually emulated it like that. However, ia64 has no paravirtualization support today so surely, we aren't going to be justifying this via ia64, right? Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >>> Consider nested virtualization where the host (H) runs a guest (G1) >>> which is itself a hypervisor, running a guest (G2). The host exposes >>> a set of virtio (V1..Vn) devices for guest G1. Guest G1, rather than >>> creating a new virtio devices and bridging it to one of V1..Vn, >>> assigns virtio device V1 to guest G2, and prays. >>> >>> Now guest G2 issues a hypercall. Host H traps the hypercall, sees it >>> originated in G1 while in guest mode, so it injects it into G1. G1 >>> examines the parameters but can't make any sense of them, so it >>> returns an error to G2. >>> >>> If this were done using mmio or pio, it would have just worked. With >>> pio, H would have reflected the pio into G1, G1 would have done the >>> conversion from G2's port number into G1's port number and reissued >>> the pio, finally trapped by H and used to issue the I/O. >> >> I might be missing something, but I am not seeing the difference >> here. We have an "address" (in this case the HC-id) and a context (in >> this >> case G1 running in non-root mode). Whether the trap to H is a HC or a >> PIO, the context tells us that it needs to re-inject the same trap to G1 >> for proper handling. So the "address" is re-injected from H to G1 as an >> emulated trap to G1s root-mode, and we continue (just like the PIO). >> > > So far, so good (though in fact mmio can short-circuit G2->H directly). Yeah, that is a nice trick. Despite the fact that MMIOs have about 50% degradation over an equivalent PIO/HC trap, you would be hard-pressed to make that up again with all the nested reinjection going on on the PIO/HC side of the coin. I think MMIO would be a fairly easy win with one level of nesting, and absolutely trounce anything that happens to be deeper. > >> And likewise, in both cases, G1 would (should?) know what to do with >> that "address" as it relates to G2, just as it would need to know what >> the PIO address is for. Typically this would result in some kind of >> translation of that "address", but I suppose even this is completely >> arbitrary and only G1 knows for sure. E.g. it might translate from >> hypercall vector X to Y similar to your PIO example, it might completely >> change transports, or it might terminate locally (e.g. emulated device >> in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might >> be using MMIO to talk to H. I don't think it matters from a topology >> perspective (though it might from a performance perspective). >> > > How can you translate a hypercall? G1's and H's hypercall mechanisms > can be completely different. Well, what I mean is that the hypercall ABI is specific to G2->G1, but the path really looks like G2->(H)->G1 transparently since H gets all the initial exits coming from G2. But all H has to do is blindly reinject the exit with all the same parameters (e.g. registers, primarily) to the G1-root context. So when the trap is injected to G1, G1 sees it as a normal HC-VMEXIT, and does its thing according to the ABI. Perhaps the ABI for that particular HC-id is a PIOoHC, so it turns around and does a ioread/iowrite PIO, trapping us back to H. So this transform of the HC-id "X" to PIO("Y") is the translation I was referring to. It could really be anything, though (e.g. HC "X" to HC "Z", if thats what G1s handler for X told it to do) > > > >>> So the upshoot is that hypercalls for devices must not be the primary >>> method of communications; they're fine as an optimization, but we >>> should always be able to fall back on something else. We also need to >>> figure out how G1 can stop V1 from advertising hypercall support. >>> >> I agree it would be desirable to be able to control this exposure. >> However, I am not currently convinced its strictly necessary because of >> the reason you mentioned above. And also note that I am not currently >> convinced its even possible to control it. >> >> For instance, what if G1 is an old KVM, or (dare I say) a completely >> different hypervisor? You could control things like whether G1 can see >> the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who >> is to say what G1 will expose to G2? G1 may very well advertise a HC >> feature bit to G2 which may allow G2 to try to make a VMCALL. How do >> you stop that? >> > > I don't see any way. > > If, instead of a hypercall we go through the pio hypercall route, then > it all resolves itself. G2 issues a pio hypercall, H bounces it to > G1, G1 either issues a pio or a pio hypercall depending on what the H > and G1 negotiated. Actually I don't even think it matters what the HC payload is. Its governed by the ABI between G1 and G2. H will simply reflect the trap, so the HC could be of any type, really. > Of course mmio is faster in this case since it traps directly. > > btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a > 0.4us difference will buy us 0.4% reduction in cp
Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: And we're now getting close to the point where the difference is virtually meaningless. At .14us, in order to see 1% CPU overhead added from PIO vs HC, you need 71429 exits. If I read things correctly, you want the difference between PIO and PIOoHC, which is 210ns. But your point stands, 50,000 exits/sec will add 1% cpu overhead. The non-x86 architecture argument isn't valid because other architectures either 1) don't use PCI at all (s390) and are already using hypercalls 2) use PCI, but do not have a dedicated hypercall instruction (PPC emb) or 3) have PIO (ia64). ia64 uses mmio to emulate pio, so the cost may be different. I agree on x86 it's almost negligible. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Consider nested virtualization where the host (H) runs a guest (G1) which is itself a hypervisor, running a guest (G2). The host exposes a set of virtio (V1..Vn) devices for guest G1. Guest G1, rather than creating a new virtio devices and bridging it to one of V1..Vn, assigns virtio device V1 to guest G2, and prays. Now guest G2 issues a hypercall. Host H traps the hypercall, sees it originated in G1 while in guest mode, so it injects it into G1. G1 examines the parameters but can't make any sense of them, so it returns an error to G2. If this were done using mmio or pio, it would have just worked. With pio, H would have reflected the pio into G1, G1 would have done the conversion from G2's port number into G1's port number and reissued the pio, finally trapped by H and used to issue the I/O. I might be missing something, but I am not seeing the difference here. We have an "address" (in this case the HC-id) and a context (in this case G1 running in non-root mode). Whether the trap to H is a HC or a PIO, the context tells us that it needs to re-inject the same trap to G1 for proper handling. So the "address" is re-injected from H to G1 as an emulated trap to G1s root-mode, and we continue (just like the PIO). So far, so good (though in fact mmio can short-circuit G2->H directly). And likewise, in both cases, G1 would (should?) know what to do with that "address" as it relates to G2, just as it would need to know what the PIO address is for. Typically this would result in some kind of translation of that "address", but I suppose even this is completely arbitrary and only G1 knows for sure. E.g. it might translate from hypercall vector X to Y similar to your PIO example, it might completely change transports, or it might terminate locally (e.g. emulated device in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might be using MMIO to talk to H. I don't think it matters from a topology perspective (though it might from a performance perspective). How can you translate a hypercall? G1's and H's hypercall mechanisms can be completely different. So the upshoot is that hypercalls for devices must not be the primary method of communications; they're fine as an optimization, but we should always be able to fall back on something else. We also need to figure out how G1 can stop V1 from advertising hypercall support. I agree it would be desirable to be able to control this exposure. However, I am not currently convinced its strictly necessary because of the reason you mentioned above. And also note that I am not currently convinced its even possible to control it. For instance, what if G1 is an old KVM, or (dare I say) a completely different hypervisor? You could control things like whether G1 can see the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who is to say what G1 will expose to G2? G1 may very well advertise a HC feature bit to G2 which may allow G2 to try to make a VMCALL. How do you stop that? I don't see any way. If, instead of a hypercall we go through the pio hypercall route, then it all resolves itself. G2 issues a pio hypercall, H bounces it to G1, G1 either issues a pio or a pio hypercall depending on what the H and G1 negotiated. Of course mmio is faster in this case since it traps directly. btw, what's the hypercall rate you're seeing? at 10K hypercalls/sec, a 0.4us difference will buy us 0.4% reduction in cpu load, so let's see what's the potential gain here. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote: > Marcelo Tosatti wrote: > > On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote: > > > >> Marcelo Tosatti wrote: > >> > >>> I think comparison is not entirely fair. You're using > >>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that > >>> (on Intel) to only one register read: > >>> > >>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX); > >>> > >>> Whereas in a real hypercall for (say) PIO you would need the address, > >>> size, direction and data. > >>> > >>> > >> Well, that's probably one of the reasons pio is slower, as the cpu has > >> to set these up, and the kernel has to read them. > >> > >> > >>> Also for PIO/MMIO you're adding this unoptimized lookup to the > >>> measurement: > >>> > >>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > >>> if (pio_dev) { > >>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > >>> complete_pio(vcpu); return 1; > >>> } > >>> > >>> > >> Since there are only one or two elements in the list, I don't see how it > >> could be optimized. > >> > > > > speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev > > is probably the last in the io_bus list. > > > > Not sure if this one matters very much. Point is you should measure the > > exit time only, not the pio path vs hypercall path in kvm. > > > > The problem is the exit time in of itself isnt all that interesting to > me. What I am interested in measuring is how long it takes KVM to > process the request and realize that I want to execute function "X". > Ultimately that is what matters in terms of execution latency and is > thus the more interesting data. I think the exit time is possibly an > interesting 5th data point, but its more of a side-bar IMO. In any > case, I suspect that both exits will be approximately the same at the > VT/SVM level. > > OTOH: If there is a patch out there to improve KVMs code (say > specifically the PIO handling logic), that is fair-game here and we > should benchmark it. For instance, if you have ideas on ways to improve > the find_pio_dev performance, etc One item may be to replace the > kvm->lock on the bus scan with an RCU or something (though PIOs are > very frequent and the constant re-entry to an an RCU read-side CS may > effectively cause a perpetual grace-period and may be too prohibitive). > CC'ing pmck. Hello, Greg! Not a problem. ;-) A grace period only needs to wait on RCU read-side critical sections that started before the grace period started. As soon as those pre-existing RCU read-side critical get done, the grace period can end, regardless of how many RCU read-side critical sections might have started after the grace period started. If you find a situation where huge numbers of RCU read-side critical sections do indefinitely delay a grace period, then that is a bug in RCU that I need to fix. Of course, if you have a single RCU read-side critical section that runs for a very long time, that -will- delay a grace period. As long as you don't do it too often, this is not a problem, though if running a single RCU read-side critical section for more than a few milliseconds is probably not a good thing. Not as bad as holding a heavily contended spinlock for a few milliseconds, but still not a good thing. Thanx, Paul > FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that > 140 can possibly be recouped. I currently suspect the lock acquisition > in the iobus-scan is the bulk of that time, but that is admittedly a > guess. The remaining 200-250ns is elsewhere in the PIO decode. > > -Greg > > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Fri, May 08, 2009 at 10:45:52AM -0400, Gregory Haskins wrote: > Marcelo Tosatti wrote: > > On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: > > > >> Marcelo Tosatti wrote: > >> > >>> Also it would be interesting to see the MMIO comparison with EPT/NPT, > >>> it probably sucks much less than what you're seeing. > >>> > >>> > >> Why would NPT improve mmio? If anything, it would be worse, since the > >> processor has to do the nested walk. > >> > >> Of course, these are newer machines, so the absolute results as well as > >> the difference will be smaller. > >> > > > > Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: > > > > NPT enabled: > > test 0: 3088633284634 - 3059375712321 = 29257572313 > > test 1: 3121754636397 - 3088633419760 = 33121216637 > > test 2: 3204666462763 - 3121754668573 = 82911794190 > > > > NPT disabled: > > test 0: 3638061646250 - 3609416811687 = 28644834563 > > test 1: 3669413430258 - 3638061771291 = 31351658967 > > test 2: 3736287253287 - 3669413463506 = 66873789781 > > > > > Thanks for running that. Its interesting to see that NPT was in fact > worse as Avi predicted. > > Would you mind if I graphed the result and added this data to my wiki? > If so, could you adjust the tsc result into IOPs using the proper > time-base and the test_count you ran with? I can show a graph with the > data as is and the relative differences will properly surface..but it > would be nice to have apples to apples in terms of IOPS units with my > other run. > > -Greg Please, that'll be nice. Quad-Core AMD Opteron(tm) Processor 2358 SE host: 2.6.30-rc2 guest: 2.6.29.1-102.fc11.x86_64 test_count=100, tsc freq=2402882804 Hz NPT disabled: test 0 = 2771200766 test 1 = 3018726738 test 2 = 6414705418 test 3 = 2890332864 NPT enabled: test 0 = 2908604045 test 1 = 3174687394 test 2 = 7912464804 test 3 = 3046085805 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Fri, May 08, 2009 at 08:43:40AM -0400, Gregory Haskins wrote: > The problem is the exit time in of itself isnt all that interesting to > me. What I am interested in measuring is how long it takes KVM to > process the request and realize that I want to execute function "X". > Ultimately that is what matters in terms of execution latency and is > thus the more interesting data. I think the exit time is possibly an > interesting 5th data point, but its more of a side-bar IMO. In any > case, I suspect that both exits will be approximately the same at the > VT/SVM level. > > OTOH: If there is a patch out there to improve KVMs code (say > specifically the PIO handling logic), that is fair-game here and we > should benchmark it. For instance, if you have ideas on ways to improve > the find_pio_dev performance, etc One easy thing to try is to cache the last successful lookup on a pointer, to improve patterns where there's "device locality" (like nullio test). > One item may be to replace the kvm->lock on the bus scan with an RCU > or something (though PIOs are very frequent and the constant > re-entry to an an RCU read-side CS may effectively cause a perpetual > grace-period and may be too prohibitive). CC'ing pmck. Yes, locking improvements are needed there badly (think for eg the cache bouncing of kvm->lock _and_ bouncing of kvm->slots_lock on 4-way SMP guests). > FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that > 140 can possibly be recouped. I currently suspect the lock acquisition > in the iobus-scan is the bulk of that time, but that is admittedly a > guess. The remaining 200-250ns is elsewhere in the PIO decode. vmcs_read is significantly expensive (http://www.mail-archive.com/kvm@vger.kernel.org/msg00840.html, likely that my measurements were foobar, Avi mentioned 50 cycles for vmcs_write). See for eg how vmx.c reads VM_EXIT_INTR_INFO twice on every exit. Also this one looks pretty bad for a 32-bit PAE guest (and you can get away with the unconditional GUEST_CR3 read too). /* Access CR3 don't cause VMExit in paging mode, so we need * to sync with guest real CR3. */ if (enable_ept && is_paging(vcpu)) { vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); ept_load_pdptrs(vcpu); } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >> Anthony Liguori wrote: >> >>> Gregory Haskins wrote: >>> Today, there is no equivelent of a platform agnostic "iowrite32()" for hypercalls so the driver would look like the pseudocode above except substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal is to allow the hypervisor to assign a dynamic vector to resources in the backend and convey this vector to the guest (such as in PCI config-space as mentioned in my example use-case). The provides the "address negotiation" function that would normally be done for something like a pio port-address. The hypervisor agnostic driver can then use this globally recognized address-token coupled with other device-private ABI parameters to communicate with the device. This can all occur without the core hypervisor needing to understand the details beyond the addressing. >>> PCI already provide a hypervisor agnostic interface (via IO >>> regions). You have a mechanism for devices to discover which regions >>> they have >>> allocated and to request remappings. It's supported by Linux and >>> Windows. It works on the vast majority of architectures out there >>> today. >>> >>> Why reinvent the wheel? >>> >> >> I suspect the current wheel is square. And the air is out. Plus its >> pulling to the left when I accelerate, but to be fair that may be my >> alignment > > No, your wheel is slightly faster on the highway, but doesn't work at > all off-road. Heh.. > > Consider nested virtualization where the host (H) runs a guest (G1) > which is itself a hypervisor, running a guest (G2). The host exposes > a set of virtio (V1..Vn) devices for guest G1. Guest G1, rather than > creating a new virtio devices and bridging it to one of V1..Vn, > assigns virtio device V1 to guest G2, and prays. > > Now guest G2 issues a hypercall. Host H traps the hypercall, sees it > originated in G1 while in guest mode, so it injects it into G1. G1 > examines the parameters but can't make any sense of them, so it > returns an error to G2. > > If this were done using mmio or pio, it would have just worked. With > pio, H would have reflected the pio into G1, G1 would have done the > conversion from G2's port number into G1's port number and reissued > the pio, finally trapped by H and used to issue the I/O. I might be missing something, but I am not seeing the difference here. We have an "address" (in this case the HC-id) and a context (in this case G1 running in non-root mode). Whether the trap to H is a HC or a PIO, the context tells us that it needs to re-inject the same trap to G1 for proper handling. So the "address" is re-injected from H to G1 as an emulated trap to G1s root-mode, and we continue (just like the PIO). And likewise, in both cases, G1 would (should?) know what to do with that "address" as it relates to G2, just as it would need to know what the PIO address is for. Typically this would result in some kind of translation of that "address", but I suppose even this is completely arbitrary and only G1 knows for sure. E.g. it might translate from hypercall vector X to Y similar to your PIO example, it might completely change transports, or it might terminate locally (e.g. emulated device in G1). IOW: G2 might be using hypercalls to talk to G1, and G1 might be using MMIO to talk to H. I don't think it matters from a topology perspective (though it might from a performance perspective). > With mmio, G1 would have set up G2's page tables to point directly at > the addresses set up by H, so we would actually have a direct G2->H > path. Of course we'd need an emulated iommu so all the memory > references actually resolve to G2's context. /me head explodes > > So the upshoot is that hypercalls for devices must not be the primary > method of communications; they're fine as an optimization, but we > should always be able to fall back on something else. We also need to > figure out how G1 can stop V1 from advertising hypercall support. I agree it would be desirable to be able to control this exposure. However, I am not currently convinced its strictly necessary because of the reason you mentioned above. And also note that I am not currently convinced its even possible to control it. For instance, what if G1 is an old KVM, or (dare I say) a completely different hypervisor? You could control things like whether G1 can see the VMX/SVM option at a coarse level, but once you expose VMX/SVM, who is to say what G1 will expose to G2? G1 may very well advertise a HC feature bit to G2 which may allow G2 to try to make a VMCALL. How do you stop that? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: Hmm, reminds me of something I thought of a while back. We could implement an 'mmio hypercall' that does mmio reads/writes via a hypercall instead of an mmio operation. That will speed up mmio for emulated devices (say, e1000). It's easy to hook into Linux (readl/writel), is pci-friendly, non-x86 friendly, etc. By the time you get down to userspace for an emulated device, that 2us difference between mmio and hypercalls is simply not going to make a difference. I'm surprised so much effort is going into this, is there any indication that this is even close to a bottleneck in any circumstance? We have much, much lower hanging fruit to attack. The basic fact that we still copy data multiple times in the networking drivers is clearly more significant than a few hundred nanoseconds that should occur less than once per packet. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Greg, I think comparison is not entirely fair. FYI: I've update the test/wiki to (hopefully) address your concerns. http://developer.novell.com/wiki/index.php/WhyHypercalls And we're now getting close to the point where the difference is virtually meaningless. At .14us, in order to see 1% CPU overhead added from PIO vs HC, you need 71429 exits. If you have this many exits, the shear cost of the base vmexit overhead is going to result in about 15% CPU overhead. To put this another way, if you're workload was entirely bound by vmexits (which is virtually impossible), then when you were saturating your CPU at 100%, only 7% of that is the cost of PIO exits vs. HC. In real life workloads, if you're paying 15% overhead just to the cost of exits (not including the cost of heavy weight or post-exit processing), you're toast. I think it's going to be very difficult to construct a real scenario where you'll have a measurable (i.e. > 1%) performance overhead from using PIO vs. HC. And in the absence of that, I don't see the justification for adding additional infrastructure to Linux to support this. The non-x86 architecture argument isn't valid because other architectures either 1) don't use PCI at all (s390) and are already using hypercalls 2) use PCI, but do not have a dedicated hypercall instruction (PPC emb) or 3) have PIO (ia64). Regards, Anthony Liguori Regards, -Greg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: > On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: > >> Marcelo Tosatti wrote: >> >>> Also it would be interesting to see the MMIO comparison with EPT/NPT, >>> it probably sucks much less than what you're seeing. >>> >>> >> Why would NPT improve mmio? If anything, it would be worse, since the >> processor has to do the nested walk. >> >> Of course, these are newer machines, so the absolute results as well as >> the difference will be smaller. >> > > Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: > > NPT enabled: > test 0: 3088633284634 - 3059375712321 = 29257572313 > test 1: 3121754636397 - 3088633419760 = 33121216637 > test 2: 3204666462763 - 3121754668573 = 82911794190 > > NPT disabled: > test 0: 3638061646250 - 3609416811687 = 28644834563 > test 1: 3669413430258 - 3638061771291 = 31351658967 > test 2: 3736287253287 - 3669413463506 = 66873789781 > > Thanks for running that. Its interesting to see that NPT was in fact worse as Avi predicted. Would you mind if I graphed the result and added this data to my wiki? If so, could you adjust the tsc result into IOPs using the proper time-base and the test_count you ran with? I can show a graph with the data as is and the relative differences will properly surface..but it would be nice to have apples to apples in terms of IOPS units with my other run. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
On Fri, May 08, 2009 at 10:55:37AM +0300, Avi Kivity wrote: > Marcelo Tosatti wrote: >> Also it would be interesting to see the MMIO comparison with EPT/NPT, >> it probably sucks much less than what you're seeing. >> > > Why would NPT improve mmio? If anything, it would be worse, since the > processor has to do the nested walk. > > Of course, these are newer machines, so the absolute results as well as > the difference will be smaller. Quad-Core AMD Opteron(tm) Processor 2358 SE 2.4GHz: NPT enabled: test 0: 3088633284634 - 3059375712321 = 29257572313 test 1: 3121754636397 - 3088633419760 = 33121216637 test 2: 3204666462763 - 3121754668573 = 82911794190 NPT disabled: test 0: 3638061646250 - 3609416811687 = 28644834563 test 1: 3669413430258 - 3638061771291 = 31351658967 test 2: 3736287253287 - 3669413463506 = 66873789781 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote: > >> Chris Wright wrote: >> >>> * Gregory Haskins (ghask...@novell.com) wrote: >>> >>> Chris Wright wrote: > VF drivers can also have this issue (and typically use mmio). > I at least have a better idea what your proposal is, thanks for > explanation. Are you able to demonstrate concrete benefit with it yet > (improved latency numbers for example)? > > I had a test-harness/numbers for this kind of thing, but its a bit crufty since its from ~1.5 years ago. I will dig it up, update it, and generate/post new numbers. >>> That would be useful, because I keep coming back to pio and shared >>> page(s) when think of why not to do this. Seems I'm not alone in that. >>> >>> thanks, >>> -chris >>> >>> >> I completed the resurrection of the test and wrote up a little wiki on >> the subject, which you can find here: >> >> http://developer.novell.com/wiki/index.php/WhyHypercalls >> >> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why >> reinvent the wheel?" questions. >> >> I will include this information when I publish the updated v2 series >> with the s/hypercall/dynhc changes. >> >> Let me know if you have any questions. >> > > Greg, > > I think comparison is not entirely fair. FYI: I've update the test/wiki to (hopefully) address your concerns. http://developer.novell.com/wiki/index.php/WhyHypercalls Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: > On Fri, May 08, 2009 at 10:59:00AM +0300, Avi Kivity wrote: > >> Marcelo Tosatti wrote: >> >>> I think comparison is not entirely fair. You're using >>> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that >>> (on Intel) to only one register read: >>> >>> nr = kvm_register_read(vcpu, VCPU_REGS_RAX); >>> >>> Whereas in a real hypercall for (say) PIO you would need the address, >>> size, direction and data. >>> >>> >> Well, that's probably one of the reasons pio is slower, as the cpu has >> to set these up, and the kernel has to read them. >> >> >>> Also for PIO/MMIO you're adding this unoptimized lookup to the >>> measurement: >>> >>> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >>> if (pio_dev) { >>> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >>> complete_pio(vcpu); return 1; >>> } >>> >>> >> Since there are only one or two elements in the list, I don't see how it >> could be optimized. >> > > speaker_ioport, pit_ioport, pic_ioport and plus nulldev ioport. nulldev > is probably the last in the io_bus list. > > Not sure if this one matters very much. Point is you should measure the > exit time only, not the pio path vs hypercall path in kvm. > The problem is the exit time in of itself isnt all that interesting to me. What I am interested in measuring is how long it takes KVM to process the request and realize that I want to execute function "X". Ultimately that is what matters in terms of execution latency and is thus the more interesting data. I think the exit time is possibly an interesting 5th data point, but its more of a side-bar IMO. In any case, I suspect that both exits will be approximately the same at the VT/SVM level. OTOH: If there is a patch out there to improve KVMs code (say specifically the PIO handling logic), that is fair-game here and we should benchmark it. For instance, if you have ideas on ways to improve the find_pio_dev performance, etc One item may be to replace the kvm->lock on the bus scan with an RCU or something (though PIOs are very frequent and the constant re-entry to an an RCU read-side CS may effectively cause a perpetual grace-period and may be too prohibitive). CC'ing pmck. FWIW: the PIOoHCs were about 140ns slower than pure HC, so some of that 140 can possibly be recouped. I currently suspect the lock acquisition in the iobus-scan is the bulk of that time, but that is admittedly a guess. The remaining 200-250ns is elsewhere in the PIO decode. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: Also it would be interesting to see the MMIO comparison with EPT/NPT, it probably sucks much less than what you're seeing. Why would NPT improve mmio? If anything, it would be worse, since the processor has to do the nested walk. I suppose the hardware is much more efficient than walk_addr? There's all this kmalloc, spinlock, etc overhead in the fault path. mmio still has to do a walk_addr, even with npt. We don't take the mmu lock during walk_addr. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: > > Ack. I hope when its all said and done I can convince you that the framework to code up those virtio backends in the kernel is vbus ;) >>> If vbus doesn't bring significant performance advantages, I'll prefer >>> virtio because of existing investment. >>> >> >> Just to clarify: vbus is just the container/framework for the in-kernel >> models. You can implement and deploy virtio devices inside the >> container (tho I haven't had a chance to sit down and implement one >> yet). Note that I did publish a virtio transport in the last few series >> to demonstrate how that might work, so its just ripe for the picking if >> someone is so inclined. >> >> > > Yeah I keep getting confused over this. > >> So really the question is whether you implement the in-kernel virtio >> backend in vbus, in some other framework, or just do it standalone. >> > > I prefer the standalone model. Keep the glue in userspace. Just to keep the facts straight: The glue in userspace vs standalone model are independent variables. E.g. you can have the glue in userspace for vbus, too. Its not written that way today for KVM, but its moving in that direction as we work though these subtopics like irqfd, dynhc, etc. What vbus buys you as a core technology is that you can write one backend that works "everywhere" (you only need a glue layer for each environment you want to support). You might say "I can make my backends work everywhere too", and to that I would say "by the time you get it to work, you will have duplicated almost my exact effort on vbus" ;). Of course, you may also say "I don't care if it works anywhere else but KVM", which is a perfectly valid (if not unfortunate) position to take. I think the confusion point is possibly a result of the name "vbus". The vbus core isn't really true bus in the traditional sense. It's just a host-side kernel-based container for these device models. That is all I am talking about here. There is, of course, also an LDM "bus" for rendering vbus devices in the guest as a function of the current kvm-specific glue layer Ive written. Note that this glue layer could render them as PCI in the future, TBD. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Marcelo Tosatti wrote: >> I think comparison is not entirely fair. You're using >> KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that >> (on Intel) to only one register read: >> >> nr = kvm_register_read(vcpu, VCPU_REGS_RAX); >> >> Whereas in a real hypercall for (say) PIO you would need the address, >> size, direction and data. >> > > Well, that's probably one of the reasons pio is slower, as the cpu has > to set these up, and the kernel has to read them. Right, that was the point I was trying to make. Its real-world overhead to measure how long it takes KVM to go round-trip in each of the respective trap types. > >> Also for PIO/MMIO you're adding this unoptimized lookup to the >> measurement: >> >> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >> if (pio_dev) { >> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >> complete_pio(vcpu); return 1; >> } >> > > Since there are only one or two elements in the list, I don't see how > it could be optimized. To Marcelo's point, I think he was more taking exception to the fact that the HC path was potentially completely optimized out if GCC was super-intuitive about the switch(nr) statement hitting the null vector. In theory, both the io_bus and the select(nr) are about equivalent in algorithmic complexity (and depth, I should say) which is why I think in general the test is "fair". IOW it represents the real-world decode cycle function for each transport. However, if one side was artificially optimized simply due to the triviality of my NULLIO test, that is not fair, and that is the point I believe he was making. In any case, I just wrote a new version of the test which hopefully addresses forces GCC to leave it as a more real-world decode. (FYI: I saw no difference). I will update the tarball/wiki shortly. > >> Whereas for hypercall measurement you don't. I believe a fair comparison >> would be have a shared guest/host memory area where you store guest/host >> TSC values and then do, on guest: >> >> rdtscll(&shared_area->guest_tsc); >> pio/mmio/hypercall >> ... back to host >> rdtscll(&shared_area->host_tsc); >> >> And then calculate the difference (minus guests TSC_OFFSET of course)? >> > > I don't understand why you want host tsc? We're interested in > round-trip latency, so you want guest tsc all the time. Yeah, I agree. My take is he was just trying to introduce a real workload so GCC wouldn't do that potential "cheater decode" in the HC path. After thinking about it, however, I realized we could do that with a simple "state++" operation, so the new test does this in each of the various test's "execute" cycle. The timing calculation remains unchanged. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Ack. I hope when its all said and done I can convince you that the framework to code up those virtio backends in the kernel is vbus ;) If vbus doesn't bring significant performance advantages, I'll prefer virtio because of existing investment. Just to clarify: vbus is just the container/framework for the in-kernel models. You can implement and deploy virtio devices inside the container (tho I haven't had a chance to sit down and implement one yet). Note that I did publish a virtio transport in the last few series to demonstrate how that might work, so its just ripe for the picking if someone is so inclined. Yeah I keep getting confused over this. So really the question is whether you implement the in-kernel virtio backend in vbus, in some other framework, or just do it standalone. I prefer the standalone model. Keep the glue in userspace. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Anthony Liguori wrote: Gregory Haskins wrote: Today, there is no equivelent of a platform agnostic "iowrite32()" for hypercalls so the driver would look like the pseudocode above except substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal is to allow the hypervisor to assign a dynamic vector to resources in the backend and convey this vector to the guest (such as in PCI config-space as mentioned in my example use-case). The provides the "address negotiation" function that would normally be done for something like a pio port-address. The hypervisor agnostic driver can then use this globally recognized address-token coupled with other device-private ABI parameters to communicate with the device. This can all occur without the core hypervisor needing to understand the details beyond the addressing. PCI already provide a hypervisor agnostic interface (via IO regions). You have a mechanism for devices to discover which regions they have allocated and to request remappings. It's supported by Linux and Windows. It works on the vast majority of architectures out there today. Why reinvent the wheel? I suspect the current wheel is square. And the air is out. Plus its pulling to the left when I accelerate, but to be fair that may be my alignment No, your wheel is slightly faster on the highway, but doesn't work at all off-road. Consider nested virtualization where the host (H) runs a guest (G1) which is itself a hypervisor, running a guest (G2). The host exposes a set of virtio (V1..Vn) devices for guest G1. Guest G1, rather than creating a new virtio devices and bridging it to one of V1..Vn, assigns virtio device V1 to guest G2, and prays. Now guest G2 issues a hypercall. Host H traps the hypercall, sees it originated in G1 while in guest mode, so it injects it into G1. G1 examines the parameters but can't make any sense of them, so it returns an error to G2. If this were done using mmio or pio, it would have just worked. With pio, H would have reflected the pio into G1, G1 would have done the conversion from G2's port number into G1's port number and reissued the pio, finally trapped by H and used to issue the I/O. With mmio, G1 would have set up G2's page tables to point directly at the addresses set up by H, so we would actually have a direct G2->H path. Of course we'd need an emulated iommu so all the memory references actually resolve to G2's context. So the upshoot is that hypercalls for devices must not be the primary method of communications; they're fine as an optimization, but we should always be able to fall back on something else. We also need to figure out how G1 can stop V1 from advertising hypercall support. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: I think comparison is not entirely fair. You're using KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that (on Intel) to only one register read: nr = kvm_register_read(vcpu, VCPU_REGS_RAX); Whereas in a real hypercall for (say) PIO you would need the address, size, direction and data. Well, that's probably one of the reasons pio is slower, as the cpu has to set these up, and the kernel has to read them. Also for PIO/MMIO you're adding this unoptimized lookup to the measurement: pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); if (pio_dev) { kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); complete_pio(vcpu); return 1; } Since there are only one or two elements in the list, I don't see how it could be optimized. Whereas for hypercall measurement you don't. I believe a fair comparison would be have a shared guest/host memory area where you store guest/host TSC values and then do, on guest: rdtscll(&shared_area->guest_tsc); pio/mmio/hypercall ... back to host rdtscll(&shared_area->host_tsc); And then calculate the difference (minus guests TSC_OFFSET of course)? I don't understand why you want host tsc? We're interested in round-trip latency, so you want guest tsc all the time. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: Also it would be interesting to see the MMIO comparison with EPT/NPT, it probably sucks much less than what you're seeing. Why would NPT improve mmio? If anything, it would be worse, since the processor has to do the nested walk. Of course, these are newer machines, so the absolute results as well as the difference will be smaller. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote: > >> Also for PIO/MMIO you're adding this unoptimized lookup to the >> measurement: >> >> pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); >> if (pio_dev) { >> kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); >> complete_pio(vcpu); >> return 1; >> } >> >> Whereas for hypercall measurement you don't. I believe a fair comparison >> would be have a shared guest/host memory area where you store guest/host >> TSC values and then do, on guest: >> >> rdtscll(&shared_area->guest_tsc); >> pio/mmio/hypercall >> ... back to host >> rdtscll(&shared_area->host_tsc); >> >> And then calculate the difference (minus guests TSC_OFFSET of course)? >> > > Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest" > Core2 Xeon 5130 @2.00Ghz, 4GB RAM. > > Also it would be interesting to see the MMIO comparison with EPT/NPT, > it probably sucks much less than what you're seeing. > > Agreed. If you or someone on this thread has such a beast, please fire up my test and post the numbers. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Marcelo Tosatti wrote: > On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote: > >> Chris Wright wrote: >> >>> * Gregory Haskins (ghask...@novell.com) wrote: >>> >>> Chris Wright wrote: > VF drivers can also have this issue (and typically use mmio). > I at least have a better idea what your proposal is, thanks for > explanation. Are you able to demonstrate concrete benefit with it yet > (improved latency numbers for example)? > > I had a test-harness/numbers for this kind of thing, but its a bit crufty since its from ~1.5 years ago. I will dig it up, update it, and generate/post new numbers. >>> That would be useful, because I keep coming back to pio and shared >>> page(s) when think of why not to do this. Seems I'm not alone in that. >>> >>> thanks, >>> -chris >>> >>> >> I completed the resurrection of the test and wrote up a little wiki on >> the subject, which you can find here: >> >> http://developer.novell.com/wiki/index.php/WhyHypercalls >> >> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why >> reinvent the wheel?" questions. >> >> I will include this information when I publish the updated v2 series >> with the s/hypercall/dynhc changes. >> >> Let me know if you have any questions. >> > > Greg, > > I think comparison is not entirely fair. You're using > KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that > (on Intel) to only one register read: > > nr = kvm_register_read(vcpu, VCPU_REGS_RAX); > > Whereas in a real hypercall for (say) PIO you would need the address, > size, direction and data. > Hi Marcelo, I'll have to respectfully disagree with you here. What you are proposing is actually a different test: a 4th type I would call "PIO over HC". It is distinctly different than the existing MMIO, PIO, and HC tests already present. I assert that the current HC test remains valid because for pure hypercalls, the "nr" *is* the address. It identifies the function to be executed (e.g. VAPIC_POLL_IRQ = null), just like the PIO address of my nullio device identifies the function to be executed (i.e. nullio_write() = null) My argument is that the HC test emulates the "dynhc()" concept I have been talking about, whereas the PIOoHC is more like the pv_io_ops->iowrite approach. That said, your 4th test type would actually be a very interesting data-point to add to the suite (especially since we are still kicking around the notion of doing something like this). I will update the patches. > Also for PIO/MMIO you're adding this unoptimized lookup to the > measurement: > > pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > if (pio_dev) { > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > complete_pio(vcpu); > return 1; > } > > Whereas for hypercall measurement you don't. In theory they should both share about the same algorithmic complexity in the decode-stage, but due to the possible optimization you mention you may have a point. I need to take some steps to ensure the HC path isn't artificially simplified by GCC (like making the execute stage do some trivial work like you mention below). > I believe a fair comparison > would be have a shared guest/host memory area where you store guest/host > TSC values and then do, on guest: > > rdtscll(&shared_area->guest_tsc); > pio/mmio/hypercall > ... back to host > rdtscll(&shared_area->host_tsc); > > And then calculate the difference (minus guests TSC_OFFSET of course)? > > I'm not sure I need that much complexity. I can probably just change the test harness to generate an ioread32(), and have the functions return the TSC value as a return parameter for all test types. The important thing is that we pick something extremely cheap (yet dynamic) to compute so the execution time doesn't invalidate the measurement granularity with a large constant. Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
On Thu, May 07, 2009 at 08:35:03PM -0300, Marcelo Tosatti wrote: > Also for PIO/MMIO you're adding this unoptimized lookup to the > measurement: > > pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); > if (pio_dev) { > kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); > complete_pio(vcpu); > return 1; > } > > Whereas for hypercall measurement you don't. I believe a fair comparison > would be have a shared guest/host memory area where you store guest/host > TSC values and then do, on guest: > > rdtscll(&shared_area->guest_tsc); > pio/mmio/hypercall > ... back to host > rdtscll(&shared_area->host_tsc); > > And then calculate the difference (minus guests TSC_OFFSET of course)? Test Machine: Dell Precision 490 - 4-way SMP (2x2) x86_64 "Woodcrest" Core2 Xeon 5130 @2.00Ghz, 4GB RAM. Also it would be interesting to see the MMIO comparison with EPT/NPT, it probably sucks much less than what you're seeing. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thu, May 07, 2009 at 01:03:45PM -0400, Gregory Haskins wrote: > Chris Wright wrote: > > * Gregory Haskins (ghask...@novell.com) wrote: > > > >> Chris Wright wrote: > >> > >>> VF drivers can also have this issue (and typically use mmio). > >>> I at least have a better idea what your proposal is, thanks for > >>> explanation. Are you able to demonstrate concrete benefit with it yet > >>> (improved latency numbers for example)? > >>> > >> I had a test-harness/numbers for this kind of thing, but its a bit > >> crufty since its from ~1.5 years ago. I will dig it up, update it, and > >> generate/post new numbers. > >> > > > > That would be useful, because I keep coming back to pio and shared > > page(s) when think of why not to do this. Seems I'm not alone in that. > > > > thanks, > > -chris > > > > I completed the resurrection of the test and wrote up a little wiki on > the subject, which you can find here: > > http://developer.novell.com/wiki/index.php/WhyHypercalls > > Hopefully this answers Chris' "show me the numbers" and Anthony's "Why > reinvent the wheel?" questions. > > I will include this information when I publish the updated v2 series > with the s/hypercall/dynhc changes. > > Let me know if you have any questions. Greg, I think comparison is not entirely fair. You're using KVM_HC_VAPIC_POLL_IRQ ("null" hypercall) and the compiler optimizes that (on Intel) to only one register read: nr = kvm_register_read(vcpu, VCPU_REGS_RAX); Whereas in a real hypercall for (say) PIO you would need the address, size, direction and data. Also for PIO/MMIO you're adding this unoptimized lookup to the measurement: pio_dev = vcpu_find_pio_dev(vcpu, port, size, !in); if (pio_dev) { kernel_pio(pio_dev, vcpu, vcpu->arch.pio_data); complete_pio(vcpu); return 1; } Whereas for hypercall measurement you don't. I believe a fair comparison would be have a shared guest/host memory area where you store guest/host TSC values and then do, on guest: rdtscll(&shared_area->guest_tsc); pio/mmio/hypercall ... back to host rdtscll(&shared_area->host_tsc); And then calculate the difference (minus guests TSC_OFFSET of course)? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Chris Wright wrote: > > > Chris, is that issue with the non ioread/iowrite access of a mangled > > pointer still an issue here? I would think so, but I am a bit fuzzy on > > whether there is still an issue of non-wrapped MMIO ever occuring. > > Arnd was saying it's a bug for other reasons, so perhaps it would work > out fine. Well, maybe. I only said that __raw_writel and pointer dereference is bad, but not writel. IIRC when we had that discussion about io-workarounds on powerpc, the outcome was that passing an IORESOURCE_MEM resource into pci_iomap must still result in something that can be passed into writel in addition to iowrite32, while an IORESOURCE_IO resource may or may not be valid for writel and/or outl. Unfortunately, this means that either readl/writel needs to be adapted in some way (e.g. the address also ioremapped to the mangled pointer) or the mechanism will be limited to I/O space accesses. Maybe BenH remembers the details better than me. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (gregory.hask...@gmail.com) wrote: > Arnd Bergmann wrote: > > pci_iomap could look at the bus device that the PCI function sits on. > > If it detects a PCI bridge that has a certain property (config space > > setting, vendor/device ID, ...), it assumes that the device itself > > will be emulated and it should set the address flag for IO_COND. > > > > This implies that all pass-through devices need to be on a different > > PCI bridge from the emulated devices, which should be fairly > > straightforward to enforce. Hmm, this gets to the grey area of the ABI. I think this would mean an upgrade of the host would suddenly break when the mgmt tool does: (qemu) pci_add pci_addr=0:6 host host=01:10.0 > Thats actually a pretty good idea. > > Chris, is that issue with the non ioread/iowrite access of a mangled > pointer still an issue here? I would think so, but I am a bit fuzzy on > whether there is still an issue of non-wrapped MMIO ever occuring. Arnd was saying it's a bug for other reasons, so perhaps it would work out fine. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Arnd Bergmann wrote: > On Thursday 07 May 2009, Gregory Haskins wrote: > >> What I am not clear on is how you would know to flag the address to >> begin with. >> > > pci_iomap could look at the bus device that the PCI function sits on. > If it detects a PCI bridge that has a certain property (config space > setting, vendor/device ID, ...), it assumes that the device itself > will be emulated and it should set the address flag for IO_COND. > > This implies that all pass-through devices need to be on a different > PCI bridge from the emulated devices, which should be fairly > straightforward to enforce. > Thats actually a pretty good idea. Chris, is that issue with the non ioread/iowrite access of a mangled pointer still an issue here? I would think so, but I am a bit fuzzy on whether there is still an issue of non-wrapped MMIO ever occuring. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Gregory Haskins wrote: > What I am not clear on is how you would know to flag the address to > begin with. pci_iomap could look at the bus device that the PCI function sits on. If it detects a PCI bridge that has a certain property (config space setting, vendor/device ID, ...), it assumes that the device itself will be emulated and it should set the address flag for IO_COND. This implies that all pass-through devices need to be on a different PCI bridge from the emulated devices, which should be fairly straightforward to enforce. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (gregory.hask...@gmail.com) wrote: > After posting my numbers today, what I *can* tell you definitively that > its significantly slower to VMEXIT via MMIO. I guess I do not really > know the reason for sure. :) there's certainly more work, including insn decoding -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Arnd Bergmann wrote: > An easy way to deal with the pass-through case might be to actually use > __raw_writel there. In guest-to-guest communication, the two sides are > known to have the same endianess (I assume) and you can still add the > appropriate smp_mb() and such into the code. Ok, that was nonsense. I thought you meant pass-through to a memory range on the host that is potentially shared with other processes or guests. For pass-through to a real device, it obviously would not work. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Gregory Haskins wrote: > Arnd Bergmann wrote: > > An mmio that goes through a PF is a bug, it's certainly broken on > > a number of platforms, so performance should not be an issue there. > > > > This may be my own ignorance, but I thought a VMEXIT of type "PF" was > how MMIO worked in VT/SVM. You are right that all MMIOs (and PIO on most non-x86 architectures) are handled this way in the end. What I meant was that an MMIO that traps because of a simple pointer dereference as in __raw_writel is a bug, while any actual writel() call could be diverted to do an hcall and therefore not cause a PF once the infrastructure is there. > I guess the problem that was later pointed out is that we cannot discern > which devices might be pass-through and therefore should not be > revectored through a HC. But I am even less knowledgeable about how > pass-through works than I am about the MMIO traps, so I might be > completely off here. An easy way to deal with the pass-through case might be to actually use __raw_writel there. In guest-to-guest communication, the two sides are known to have the same endianess (I assume) and you can still add the appropriate smp_mb() and such into the code. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Gregory Haskins (gregory.hask...@gmail.com) wrote: > >> What I am not clear on is how you would know to flag the address to >> begin with. >> > > That's why I mentioned pv_io_ops->iomap() earlier. Something I'd expect > would get called on IORESOURCE_PVIO type. Yeah, this wasn't clear at the time, but I totally get what you meant now in retrospect. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Arnd Bergmann wrote: > On Thursday 07 May 2009, Gregory Haskins wrote: > >> I guess technically mmio can just be a simple access of the page which >> would be problematic to trap locally without a PF. However it seems >> that most mmio always passes through a ioread()/iowrite() call so this >> is perhaps the hook point. If we set the stake in the ground that mmios >> that go through some other mechanism like PFs can just hit the "slow >> path" are an acceptable casualty, I think we can make that work. >> >> Thoughts? >> > > An mmio that goes through a PF is a bug, it's certainly broken on > a number of platforms, so performance should not be an issue there. > This may be my own ignorance, but I thought a VMEXIT of type "PF" was how MMIO worked in VT/SVM. I didn't mean to imply that the guest nor the host took a traditional PF exception in their respective IDT, if that is what you thought I meant here. Rather, the mmio region is unmapped in the guest MMU, access causes a VMEXIT to host-side KVM of type PF, and the host side code then consults the guest page-table to see if its an MMIO or not. I could very well be mistaken as I have only a cursory understanding of what happens in KVM today with this path. After posting my numbers today, what I *can* tell you definitively that its significantly slower to VMEXIT via MMIO. I guess I do not really know the reason for sure. :) > Note that are four commonly used interface classes for PIO/MMIO: > > 1. readl/writel: little-endian MMIO > 2. inl/outl: little-endian PIO > 3. ioread32/iowrite32: converged little-endian PIO/MMIO > 4. __raw_readl/__raw_writel: native-endian MMIO without checks > > You don't need to worry about the __raw_* stuff, as this should never > be used in device drivers. > > As a simplification, you could mandate that all drivers that want to > use this get converted to the ioread/iowrite class of interfaces and > leave the others slow. > I guess the problem that was later pointed out is that we cannot discern which devices might be pass-through and therefore should not be revectored through a HC. But I am even less knowledgeable about how pass-through works than I am about the MMIO traps, so I might be completely off here. In any case, thank you kindly for the suggestions. Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >>> Oh yes. But don't call it dynhc - like Chris says it's the wrong >>> semantic. >>> >>> Since we want to connect it to an eventfd, call it HC_NOTIFY or >>> HC_EVENT or something along these lines. You won't be able to pass >>> any data, but that's fine. Registers are saved to memory anyway. >>> >> Ok, but how would you access the registers since you would presumably >> only be getting a waitq::func callback on the eventfd. Or were you >> saying that more data, if required, is saved in a side-band memory >> location? I can see the latter working. > > Yeah. You basically have that side-band in vbus shmem (or the virtio > ring). Ok, got it. > >> I can't wrap my head around >> the former. >> > > I only meant that registers aren't faster than memory, since they are > just another memory location. > > In fact registers are accessed through a function call (not that that > takes any time these days). > > >>> Just to make sure we have everything plumbed down, here's how I see >>> things working out (using qemu and virtio, use sed to taste): >>> >>> 1. qemu starts up, sets up the VM >>> 2. qemu creates virtio-net-server >>> 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx >>> ring, one set for rx ring) >>> 4. qemu connects the six eventfd to the data-available, >>> data-not-available, and kick ports of virtio-net-server >>> 5. the guest starts up and configures virtio-net in pci pin mode >>> 6. qemu notices and decides it will manage interrupts in user space >>> since this is complicated (shared level triggered interrupts) >>> 7. the guest OS boots, loads device driver >>> 8. device driver switches virtio-net to msix mode >>> 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the >>> notify fds as notifyfd >>> 10. look ma, no hands. >>> >>> Under the hood, the following takes place. >>> >>> kvm wires the irqfds to schedule a work item which fires the >>> interrupt. One day the kvm developers get their act together and >>> change it to inject the interrupt directly when the irqfd is signalled >>> (which could be from the net softirq or somewhere similarly nasty). >>> >>> virtio-net-server wires notifyfd according to its liking. It may >>> schedule a thread, or it may execute directly. >>> >>> And they all lived happily ever after. >>> >> >> Ack. I hope when its all said and done I can convince you that the >> framework to code up those virtio backends in the kernel is vbus ;) > > If vbus doesn't bring significant performance advantages, I'll prefer > virtio because of existing investment. Just to clarify: vbus is just the container/framework for the in-kernel models. You can implement and deploy virtio devices inside the container (tho I haven't had a chance to sit down and implement one yet). Note that I did publish a virtio transport in the last few series to demonstrate how that might work, so its just ripe for the picking if someone is so inclined. So really the question is whether you implement the in-kernel virtio backend in vbus, in some other framework, or just do it standalone. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (gregory.hask...@gmail.com) wrote: > What I am not clear on is how you would know to flag the address to > begin with. That's why I mentioned pv_io_ops->iomap() earlier. Something I'd expect would get called on IORESOURCE_PVIO type. This isn't really transparent though (only virtio devices basically), kind of like you're saying below. > Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it > would never be a real device). This means we route any io requests from > virtio-pci though pv_io_ops->mmio(), but not unflagged addresses. This > is not as slick as boosting *everyones* mmio speed as Avi's original > idea would have, but it is perhaps a good tradeoff between the entirely > new namespace created by my original dynhc() proposal and leaving them > all PF based. > > This way, its just like using my dynhc() proposal except the mmio-addr > is the substitute address-token (instead of the dynhc-vector). > Additionally, if you do not PV the kernel the IO_COND/pv_io_op is > ignored and it just slow-paths through the PF as it does today. Dynhc() > would be dependent on pv_ops. > > Thoughts? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Oh yes. But don't call it dynhc - like Chris says it's the wrong semantic. Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT or something along these lines. You won't be able to pass any data, but that's fine. Registers are saved to memory anyway. Ok, but how would you access the registers since you would presumably only be getting a waitq::func callback on the eventfd. Or were you saying that more data, if required, is saved in a side-band memory location? I can see the latter working. Yeah. You basically have that side-band in vbus shmem (or the virtio ring). I can't wrap my head around the former. I only meant that registers aren't faster than memory, since they are just another memory location. In fact registers are accessed through a function call (not that that takes any time these days). Just to make sure we have everything plumbed down, here's how I see things working out (using qemu and virtio, use sed to taste): 1. qemu starts up, sets up the VM 2. qemu creates virtio-net-server 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx ring, one set for rx ring) 4. qemu connects the six eventfd to the data-available, data-not-available, and kick ports of virtio-net-server 5. the guest starts up and configures virtio-net in pci pin mode 6. qemu notices and decides it will manage interrupts in user space since this is complicated (shared level triggered interrupts) 7. the guest OS boots, loads device driver 8. device driver switches virtio-net to msix mode 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the notify fds as notifyfd 10. look ma, no hands. Under the hood, the following takes place. kvm wires the irqfds to schedule a work item which fires the interrupt. One day the kvm developers get their act together and change it to inject the interrupt directly when the irqfd is signalled (which could be from the net softirq or somewhere similarly nasty). virtio-net-server wires notifyfd according to its liking. It may schedule a thread, or it may execute directly. And they all lived happily ever after. Ack. I hope when its all said and done I can convince you that the framework to code up those virtio backends in the kernel is vbus ;) If vbus doesn't bring significant performance advantages, I'll prefer virtio because of existing investment. But even if not, this should provide enough plumbing that we can all coexist together peacefully. Yes, vbus and virtio can compete on their merits without bias from some maintainer getting in the way. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >>> Don't - it's broken. It will also catch device assignment mmio and >>> hypercall them. >>> >>> >> Ah. Crap. >> >> Would you be conducive if I continue along with the dynhc() approach >> then? >> > > Oh yes. But don't call it dynhc - like Chris says it's the wrong > semantic. > > Since we want to connect it to an eventfd, call it HC_NOTIFY or > HC_EVENT or something along these lines. You won't be able to pass > any data, but that's fine. Registers are saved to memory anyway. Ok, but how would you access the registers since you would presumably only be getting a waitq::func callback on the eventfd. Or were you saying that more data, if required, is saved in a side-band memory location? I can see the latter working. I can't wrap my head around the former. > > And btw, given that eventfd and the underlying infrastructure are so > flexible, it's probably better to go back to your original "irqfd gets > fd from userspace" just to be consistent everywhere. > > (no, I'm not deliberately making you rewrite that patch again and > again... it's going to be a key piece of infrastructure so I want to > get it right) Ok, np. Actually now that Davide showed me the waitq::func trick, the fd technically doesn't even need to be an eventfd per se. We can just plain-old "fget()" it and attach via the f_ops->poll() as I do in v5. Ill submit this later today. > > > Just to make sure we have everything plumbed down, here's how I see > things working out (using qemu and virtio, use sed to taste): > > 1. qemu starts up, sets up the VM > 2. qemu creates virtio-net-server > 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx > ring, one set for rx ring) > 4. qemu connects the six eventfd to the data-available, > data-not-available, and kick ports of virtio-net-server > 5. the guest starts up and configures virtio-net in pci pin mode > 6. qemu notices and decides it will manage interrupts in user space > since this is complicated (shared level triggered interrupts) > 7. the guest OS boots, loads device driver > 8. device driver switches virtio-net to msix mode > 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the > notify fds as notifyfd > 10. look ma, no hands. > > Under the hood, the following takes place. > > kvm wires the irqfds to schedule a work item which fires the > interrupt. One day the kvm developers get their act together and > change it to inject the interrupt directly when the irqfd is signalled > (which could be from the net softirq or somewhere similarly nasty). > > virtio-net-server wires notifyfd according to its liking. It may > schedule a thread, or it may execute directly. > > And they all lived happily ever after. Ack. I hope when its all said and done I can convince you that the framework to code up those virtio backends in the kernel is vbus ;) But even if not, this should provide enough plumbing that we can all coexist together peacefully. Thanks, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
On Thursday 07 May 2009, Gregory Haskins wrote: > I guess technically mmio can just be a simple access of the page which > would be problematic to trap locally without a PF. However it seems > that most mmio always passes through a ioread()/iowrite() call so this > is perhaps the hook point. If we set the stake in the ground that mmios > that go through some other mechanism like PFs can just hit the "slow > path" are an acceptable casualty, I think we can make that work. > > Thoughts? An mmio that goes through a PF is a bug, it's certainly broken on a number of platforms, so performance should not be an issue there. Note that are four commonly used interface classes for PIO/MMIO: 1. readl/writel: little-endian MMIO 2. inl/outl: little-endian PIO 3. ioread32/iowrite32: converged little-endian PIO/MMIO 4. __raw_readl/__raw_writel: native-endian MMIO without checks You don't need to worry about the __raw_* stuff, as this should never be used in device drivers. As a simplification, you could mandate that all drivers that want to use this get converted to the ioread/iowrite class of interfaces and leave the others slow. Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: I think we just past the "too complicated" threshold. And the "can't spel" threshold in the same sentence. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Don't - it's broken. It will also catch device assignment mmio and hypercall them. Ah. Crap. Would you be conducive if I continue along with the dynhc() approach then? Oh yes. But don't call it dynhc - like Chris says it's the wrong semantic. Since we want to connect it to an eventfd, call it HC_NOTIFY or HC_EVENT or something along these lines. You won't be able to pass any data, but that's fine. Registers are saved to memory anyway. And btw, given that eventfd and the underlying infrastructure are so flexible, it's probably better to go back to your original "irqfd gets fd from userspace" just to be consistent everywhere. (no, I'm not deliberately making you rewrite that patch again and again... it's going to be a key piece of infrastructure so I want to get it right) Just to make sure we have everything plumbed down, here's how I see things working out (using qemu and virtio, use sed to taste): 1. qemu starts up, sets up the VM 2. qemu creates virtio-net-server 3. qemu allocates six eventfds: irq, stopirq, notify (one set for tx ring, one set for rx ring) 4. qemu connects the six eventfd to the data-available, data-not-available, and kick ports of virtio-net-server 5. the guest starts up and configures virtio-net in pci pin mode 6. qemu notices and decides it will manage interrupts in user space since this is complicated (shared level triggered interrupts) 7. the guest OS boots, loads device driver 8. device driver switches virtio-net to msix mode 9. qemu notices, plumbs the irq fds as msix interrupts, plumbs the notify fds as notifyfd 10. look ma, no hands. Under the hood, the following takes place. kvm wires the irqfds to schedule a work item which fires the interrupt. One day the kvm developers get their act together and change it to inject the interrupt directly when the irqfd is signalled (which could be from the net softirq or somewhere similarly nasty). virtio-net-server wires notifyfd according to its liking. It may schedule a thread, or it may execute directly. And they all lived happily ever after. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Gregory Haskins (ghask...@novell.com) wrote: > >> Chris Wright wrote: >> >>> * Avi Kivity (a...@redhat.com) wrote: >>> Gregory Haskins wrote: > Cool, I will code this up and submit it. While Im at it, Ill run it > through the "nullio" ringer, too. ;) It would be cool to see the > pv-mmio hit that 2.07us number. I can't think of any reason why this > will not be the case. > > Don't - it's broken. It will also catch device assignment mmio and hypercall them. >>> Not necessarily. It just needs to be creative w/ IO_COND >>> >> Hi Chris, >>Could you elaborate? How would you know which pages to hypercall and >> which to let PF? >> > > Was just thinking of some ugly mangling of the addr (I'm not entirely > sure what would work best). > Right, I get the part about flagging the address and then keying off that flag in IO_COND (like we do for PIO vs MMIO). What I am not clear on is how you would know to flag the address to begin with. Here's a thought: "PV" drivers can flag the IO (e.g. virtio-pci knows it would never be a real device). This means we route any io requests from virtio-pci though pv_io_ops->mmio(), but not unflagged addresses. This is not as slick as boosting *everyones* mmio speed as Avi's original idea would have, but it is perhaps a good tradeoff between the entirely new namespace created by my original dynhc() proposal and leaving them all PF based. This way, its just like using my dynhc() proposal except the mmio-addr is the substitute address-token (instead of the dynhc-vector). Additionally, if you do not PV the kernel the IO_COND/pv_io_op is ignored and it just slow-paths through the PF as it does today. Dynhc() would be dependent on pv_ops. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: * Gregory Haskins (ghask...@novell.com) wrote: Chris Wright wrote: * Avi Kivity (a...@redhat.com) wrote: Gregory Haskins wrote: Cool, I will code this up and submit it. While Im at it, Ill run it through the "nullio" ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. Not necessarily. It just needs to be creative w/ IO_COND Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? Was just thinking of some ugly mangling of the addr (I'm not entirely sure what would work best). I think we just past the "too complicated" threshold. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (ghask...@novell.com) wrote: > Chris Wright wrote: > > * Avi Kivity (a...@redhat.com) wrote: > >> Gregory Haskins wrote: > >>> Cool, I will code this up and submit it. While Im at it, Ill run it > >>> through the "nullio" ringer, too. ;) It would be cool to see the > >>> pv-mmio hit that 2.07us number. I can't think of any reason why this > >>> will not be the case. > >>> > >> Don't - it's broken. It will also catch device assignment mmio and > >> hypercall them. > > > > Not necessarily. It just needs to be creative w/ IO_COND > > Hi Chris, >Could you elaborate? How would you know which pages to hypercall and > which to let PF? Was just thinking of some ugly mangling of the addr (I'm not entirely sure what would work best). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Avi Kivity (a...@redhat.com) wrote: > >> Gregory Haskins wrote: >> >>> Cool, I will code this up and submit it. While Im at it, Ill run it >>> through the "nullio" ringer, too. ;) It would be cool to see the >>> pv-mmio hit that 2.07us number. I can't think of any reason why this >>> will not be the case. >>> >> Don't - it's broken. It will also catch device assignment mmio and >> hypercall them. >> > > Not necessarily. It just needs to be creative w/ IO_COND > Hi Chris, Could you elaborate? How would you know which pages to hypercall and which to let PF? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
* Avi Kivity (a...@redhat.com) wrote: > Gregory Haskins wrote: >> Cool, I will code this up and submit it. While Im at it, Ill run it >> through the "nullio" ringer, too. ;) It would be cool to see the >> pv-mmio hit that 2.07us number. I can't think of any reason why this >> will not be the case. > > Don't - it's broken. It will also catch device assignment mmio and > hypercall them. Not necessarily. It just needs to be creative w/ IO_COND -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >> Avi Kivity wrote: >> >>> Gregory Haskins wrote: >>> I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. >>> That's my thinking exactly. >>> >> >> Cool, I will code this up and submit it. While Im at it, Ill run it >> through the "nullio" ringer, too. ;) It would be cool to see the >> pv-mmio hit that 2.07us number. I can't think of any reason why this >> will not be the case. >> > > Don't - it's broken. It will also catch device assignment mmio and > hypercall them. > Ah. Crap. Would you be conducive if I continue along with the dynhc() approach then? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Avi Kivity wrote: Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. That's my thinking exactly. Cool, I will code this up and submit it. While Im at it, Ill run it through the "nullio" ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. Don't - it's broken. It will also catch device assignment mmio and hypercall them. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >> I guess technically mmio can just be a simple access of the page which >> would be problematic to trap locally without a PF. However it seems >> that most mmio always passes through a ioread()/iowrite() call so this >> is perhaps the hook point. If we set the stake in the ground that mmios >> that go through some other mechanism like PFs can just hit the "slow >> path" are an acceptable casualty, I think we can make that work. >> > > That's my thinking exactly. Cool, I will code this up and submit it. While Im at it, Ill run it through the "nullio" ringer, too. ;) It would be cool to see the pv-mmio hit that 2.07us number. I can't think of any reason why this will not be the case. > > Note we can cheat further. kvm already has a "coalesced mmio" feature > where side-effect-free mmios are collected in the kernel and passed to > userspace only when some other significant event happens. We could > pass those addresses to the guest and let it queue those writes > itself, avoiding the hypercall completely. > > Though it's probably pointless: if the guest is paravirtualized enough > to have the mmio hypercall, then it shouldn't be using e1000. Yeah...plus at least for my vbus purposes, all my my guest->host transitions are explicitly to cause side-effects, or I wouldn't be doing them in the first place ;) I suspect virtio-pci is exactly the same. I.e. the coalescing has already been done at a higher layer for platforms running "PV" code. Still a cool feature, tho. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. That's my thinking exactly. Note we can cheat further. kvm already has a "coalesced mmio" feature where side-effect-free mmios are collected in the kernel and passed to userspace only when some other significant event happens. We could pass those addresses to the guest and let it queue those writes itself, avoiding the hypercall completely. Though it's probably pointless: if the guest is paravirtualized enough to have the mmio hypercall, then it shouldn't be using e1000. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >>> What do you think of my mmio hypercall? That will speed up all mmio >>> to be as fast as a hypercall, and then we can use ordinary mmio/pio >>> writes to trigger things. >>> >>> >> I like it! >> >> Bigger question is what kind of work goes into making mmio a pv_op (or >> is this already done)? >> >> > > Looks like it isn't there. But it isn't any different than set_pte - > convert a write into a hypercall. > > I guess technically mmio can just be a simple access of the page which would be problematic to trap locally without a PF. However it seems that most mmio always passes through a ioread()/iowrite() call so this is perhaps the hook point. If we set the stake in the ground that mmios that go through some other mechanism like PFs can just hit the "slow path" are an acceptable casualty, I think we can make that work. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. I like it! Bigger question is what kind of work goes into making mmio a pv_op (or is this already done)? Looks like it isn't there. But it isn't any different than set_pte - convert a write into a hypercall. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >> I completed the resurrection of the test and wrote up a little wiki on >> the subject, which you can find here: >> >> http://developer.novell.com/wiki/index.php/WhyHypercalls >> >> Hopefully this answers Chris' "show me the numbers" and Anthony's "Why >> reinvent the wheel?" questions. >> >> I will include this information when I publish the updated v2 series >> with the s/hypercall/dynhc changes. >> >> Let me know if you have any questions. >> > > Well, 420 ns is not to be sneezed at. > > What do you think of my mmio hypercall? That will speed up all mmio > to be as fast as a hypercall, and then we can use ordinary mmio/pio > writes to trigger things. > I like it! Bigger question is what kind of work goes into making mmio a pv_op (or is this already done)? -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: I completed the resurrection of the test and wrote up a little wiki on the subject, which you can find here: http://developer.novell.com/wiki/index.php/WhyHypercalls Hopefully this answers Chris' "show me the numbers" and Anthony's "Why reinvent the wheel?" questions. I will include this information when I publish the updated v2 series with the s/hypercall/dynhc changes. Let me know if you have any questions. Well, 420 ns is not to be sneezed at. What do you think of my mmio hypercall? That will speed up all mmio to be as fast as a hypercall, and then we can use ordinary mmio/pio writes to trigger things. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Gregory Haskins (ghask...@novell.com) wrote: > >> Chris Wright wrote: >> >>> VF drivers can also have this issue (and typically use mmio). >>> I at least have a better idea what your proposal is, thanks for >>> explanation. Are you able to demonstrate concrete benefit with it yet >>> (improved latency numbers for example)? >>> >> I had a test-harness/numbers for this kind of thing, but its a bit >> crufty since its from ~1.5 years ago. I will dig it up, update it, and >> generate/post new numbers. >> > > That would be useful, because I keep coming back to pio and shared > page(s) when think of why not to do this. Seems I'm not alone in that. > > thanks, > -chris > I completed the resurrection of the test and wrote up a little wiki on the subject, which you can find here: http://developer.novell.com/wiki/index.php/WhyHypercalls Hopefully this answers Chris' "show me the numbers" and Anthony's "Why reinvent the wheel?" questions. I will include this information when I publish the updated v2 series with the s/hypercall/dynhc changes. Let me know if you have any questions. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Chris Wright wrote: * Gregory Haskins (ghask...@novell.com) wrote: Chris Wright wrote: But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) means hypercall number and arg list must be the same in order for code to call hypercall() in a hypervisor agnostic way. Yes, and that is exactly the intention. I think its perhaps the point you are missing. Yes, I was reading this as purely any hypercall, but it seems a bit more like: pv_io_ops->iomap() pv_io_ops->ioread() pv_io_ops->iowrite() Right. Hmm, reminds me of something I thought of a while back. We could implement an 'mmio hypercall' that does mmio reads/writes via a hypercall instead of an mmio operation. That will speed up mmio for emulated devices (say, e1000). It's easy to hook into Linux (readl/writel), is pci-friendly, non-x86 friendly, etc. It also makes the device work when hypercall support is not available (qemu/tcg); you simply fall back on mmio. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (ghask...@novell.com) wrote: > Chris Wright wrote: > > VF drivers can also have this issue (and typically use mmio). > > I at least have a better idea what your proposal is, thanks for > > explanation. Are you able to demonstrate concrete benefit with it yet > > (improved latency numbers for example)? > > I had a test-harness/numbers for this kind of thing, but its a bit > crufty since its from ~1.5 years ago. I will dig it up, update it, and > generate/post new numbers. That would be useful, because I keep coming back to pio and shared page(s) when think of why not to do this. Seems I'm not alone in that. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Anthony Liguori wrote: > Gregory Haskins wrote: >> >> Today, there is no equivelent of a platform agnostic "iowrite32()" for >> hypercalls so the driver would look like the pseudocode above except >> substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal >> is to allow the hypervisor to assign a dynamic vector to resources in >> the backend and convey this vector to the guest (such as in PCI >> config-space as mentioned in my example use-case). The provides the >> "address negotiation" function that would normally be done for something >> like a pio port-address. The hypervisor agnostic driver can then use >> this globally recognized address-token coupled with other device-private >> ABI parameters to communicate with the device. This can all occur >> without the core hypervisor needing to understand the details beyond the >> addressing. >> > > PCI already provide a hypervisor agnostic interface (via IO regions). > You have a mechanism for devices to discover which regions they have > allocated and to request remappings. It's supported by Linux and > Windows. It works on the vast majority of architectures out there today. > > Why reinvent the wheel? I suspect the current wheel is square. And the air is out. Plus its pulling to the left when I accelerate, but to be fair that may be my alignment :) But I digress. See: http://patchwork.kernel.org/patch/21865/ To give PCI proper respect, I think its greatest value add here is the inherent IRQ routing (which is a huge/difficult component, as I experienced with dynirq in vbus v1). Beyond that, however, I think we can do better. HTH -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: Chris Wright wrote: * Gregory Haskins (gregory.hask...@gmail.com) wrote: So you would never have someone making a generic hypercall(KVM_HC_MMU_OP). I agree. Which is why I think the interface proposal you've made is wrong. I respectfully disagree. Its only wrong in that the name chosen for the interface was perhaps too broad/vague. I still believe the concept is sound, and the general layering is appropriate. There's already hypercall interfaces w/ specific ABI and semantic meaning (which are typically called directly/indirectly from an existing pv op hook). Yes, these are different, thus the new interface. But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) means hypercall number and arg list must be the same in order for code to call hypercall() in a hypervisor agnostic way. Yes, and that is exactly the intention. I think its perhaps the point you are missing. I am well aware that historically the things we do over a hypercall interface would inherently have meaning only to a specific hypervisor (e.g. KVM_HC_MMU_OPS (vector 2) via kvm_hypercall()). However, this doesn't in any way infer that it is the only use for the general concept. Its just the only way they have been exploited to date. While I acknowledge that the hypervisor certainly must be coordinated with their use, in their essence hypercalls are just another form of IO joining the ranks of things like MMIO and PIO. This is an attempt to bring them out of the bowels of CONFIG_PARAVIRT to make them a first class citizen. The thing I am building here is really not a general hypercall in the broad sense. Rather, its a subset of the hypercall vector namespace. It is designed specifically for dynamic binding a synchronous call() interface to things like virtual devices, and it is therefore these virtual device models that define the particular ABI within that namespace. Thus the ABI in question is explicitly independent of the underlying hypervisor. I therefore stand by the proposed design to have this interface described above the hypervisor support layer (i.e. pv_ops) (albeit with perhaps a better name like "dynamic hypercall" as per my later discussion with Avi). Consider PIO: The hypervisor (or hardware) and OS negotiate a port address, but the two end-points are the driver and the device-model (or real device). The driver doesnt have to say: if (kvm) kvm_iowrite32(addr, ..); else if (lguest) lguest_iowrite32(addr, ...); else native_iowrite32(addr, ...); Instead, it just says "iowrite32(addr, ...);" and the address is used to route the message appropriately by the platform. The ABI of that message, however, is specific to the driver/device and is not interpreted by kvm/lguest/native-hw infrastructure on the way. Today, there is no equivelent of a platform agnostic "iowrite32()" for hypercalls so the driver would look like the pseudocode above except substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal is to allow the hypervisor to assign a dynamic vector to resources in the backend and convey this vector to the guest (such as in PCI config-space as mentioned in my example use-case). The provides the "address negotiation" function that would normally be done for something like a pio port-address. The hypervisor agnostic driver can then use this globally recognized address-token coupled with other device-private ABI parameters to communicate with the device. This can all occur without the core hypervisor needing to understand the details beyond the addressing. PCI already provide a hypervisor agnostic interface (via IO regions). You have a mechanism for devices to discover which regions they have allocated and to request remappings. It's supported by Linux and Windows. It works on the vast majority of architectures out there today. Why reinvent the wheel? Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Gregory Haskins (ghask...@novell.com) wrote: > >> Chris Wright wrote: >> >>> But a free-form hypercall(unsigned long nr, unsigned long *args, size_t >>> count) >>> means hypercall number and arg list must be the same in order for code >>> to call hypercall() in a hypervisor agnostic way. >>> >> Yes, and that is exactly the intention. I think its perhaps the point >> you are missing. >> > > Yes, I was reading this as purely any hypercall, but it seems a bit > more like: > pv_io_ops->iomap() > pv_io_ops->ioread() > pv_io_ops->iowrite() > Right. > > >> Today, there is no equivelent of a platform agnostic "iowrite32()" for >> hypercalls so the driver would look like the pseudocode above except >> substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal >> is to allow the hypervisor to assign a dynamic vector to resources in >> the backend and convey this vector to the guest (such as in PCI >> config-space as mentioned in my example use-case). The provides the >> "address negotiation" function that would normally be done for something >> like a pio port-address. The hypervisor agnostic driver can then use >> this globally recognized address-token coupled with other device-private >> ABI parameters to communicate with the device. This can all occur >> without the core hypervisor needing to understand the details beyond the >> addressing. >> > > VF drivers can also have this issue (and typically use mmio). > I at least have a better idea what your proposal is, thanks for > explanation. Are you able to demonstrate concrete benefit with it yet > (improved latency numbers for example)? > I had a test-harness/numbers for this kind of thing, but its a bit crufty since its from ~1.5 years ago. I will dig it up, update it, and generate/post new numbers. Thanks Chris, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (ghask...@novell.com) wrote: > Chris Wright wrote: > > But a free-form hypercall(unsigned long nr, unsigned long *args, size_t > > count) > > means hypercall number and arg list must be the same in order for code > > to call hypercall() in a hypervisor agnostic way. > > Yes, and that is exactly the intention. I think its perhaps the point > you are missing. Yes, I was reading this as purely any hypercall, but it seems a bit more like: pv_io_ops->iomap() pv_io_ops->ioread() pv_io_ops->iowrite() > Today, there is no equivelent of a platform agnostic "iowrite32()" for > hypercalls so the driver would look like the pseudocode above except > substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal > is to allow the hypervisor to assign a dynamic vector to resources in > the backend and convey this vector to the guest (such as in PCI > config-space as mentioned in my example use-case). The provides the > "address negotiation" function that would normally be done for something > like a pio port-address. The hypervisor agnostic driver can then use > this globally recognized address-token coupled with other device-private > ABI parameters to communicate with the device. This can all occur > without the core hypervisor needing to understand the details beyond the > addressing. VF drivers can also have this issue (and typically use mmio). I at least have a better idea what your proposal is, thanks for explanation. Are you able to demonstrate concrete benefit with it yet (improved latency numbers for example)? thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Chris Wright wrote: > * Gregory Haskins (gregory.hask...@gmail.com) wrote: > >> So you would never have someone making a generic >> hypercall(KVM_HC_MMU_OP). I agree. >> > > Which is why I think the interface proposal you've made is wrong. I respectfully disagree. Its only wrong in that the name chosen for the interface was perhaps too broad/vague. I still believe the concept is sound, and the general layering is appropriate. > There's > already hypercall interfaces w/ specific ABI and semantic meaning (which > are typically called directly/indirectly from an existing pv op hook). > Yes, these are different, thus the new interface. > But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) > means hypercall number and arg list must be the same in order for code > to call hypercall() in a hypervisor agnostic way. > Yes, and that is exactly the intention. I think its perhaps the point you are missing. I am well aware that historically the things we do over a hypercall interface would inherently have meaning only to a specific hypervisor (e.g. KVM_HC_MMU_OPS (vector 2) via kvm_hypercall()). However, this doesn't in any way infer that it is the only use for the general concept. Its just the only way they have been exploited to date. While I acknowledge that the hypervisor certainly must be coordinated with their use, in their essence hypercalls are just another form of IO joining the ranks of things like MMIO and PIO. This is an attempt to bring them out of the bowels of CONFIG_PARAVIRT to make them a first class citizen. The thing I am building here is really not a general hypercall in the broad sense. Rather, its a subset of the hypercall vector namespace. It is designed specifically for dynamic binding a synchronous call() interface to things like virtual devices, and it is therefore these virtual device models that define the particular ABI within that namespace. Thus the ABI in question is explicitly independent of the underlying hypervisor. I therefore stand by the proposed design to have this interface described above the hypervisor support layer (i.e. pv_ops) (albeit with perhaps a better name like "dynamic hypercall" as per my later discussion with Avi). Consider PIO: The hypervisor (or hardware) and OS negotiate a port address, but the two end-points are the driver and the device-model (or real device). The driver doesnt have to say: if (kvm) kvm_iowrite32(addr, ..); else if (lguest) lguest_iowrite32(addr, ...); else native_iowrite32(addr, ...); Instead, it just says "iowrite32(addr, ...);" and the address is used to route the message appropriately by the platform. The ABI of that message, however, is specific to the driver/device and is not interpreted by kvm/lguest/native-hw infrastructure on the way. Today, there is no equivelent of a platform agnostic "iowrite32()" for hypercalls so the driver would look like the pseudocode above except substitute with kvm_hypercall(), lguest_hypercall(), etc. The proposal is to allow the hypervisor to assign a dynamic vector to resources in the backend and convey this vector to the guest (such as in PCI config-space as mentioned in my example use-case). The provides the "address negotiation" function that would normally be done for something like a pio port-address. The hypervisor agnostic driver can then use this globally recognized address-token coupled with other device-private ABI parameters to communicate with the device. This can all occur without the core hypervisor needing to understand the details beyond the addressing. What this means to our interface design is that the only thing the hypervisor really cares about is the first "nr" parameter. This acts as our address-token. The optional/variable list of args is just payload as far as the core infrastructure is concerned and are coupled only to our device ABI. They were chosen to be an array of ulongs (vs something like vargs) to reflect the fact that hypercalls are typically passed by packing registers. Hope this helps, -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
* Gregory Haskins (gregory.hask...@gmail.com) wrote: > So you would never have someone making a generic > hypercall(KVM_HC_MMU_OP). I agree. Which is why I think the interface proposal you've made is wrong. There's already hypercall interfaces w/ specific ABI and semantic meaning (which are typically called directly/indirectly from an existing pv op hook). But a free-form hypercall(unsigned long nr, unsigned long *args, size_t count) means hypercall number and arg list must be the same in order for code to call hypercall() in a hypervisor agnostic way. The pv_ops level need to have semantic meaning, not a free form hypercall multiplexor. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: I see. I had designed it slightly different where KVM could assign any top level vector it wanted and thus that drove the guest-side interface you see here to be more "generic hypercall". However, I think your proposal is perfectly fine too and it makes sense to more narrowly focus these calls as specifically "dynamic"...as thats the only vectors that we could technically use like this anyway. So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC" to kvm_para.h, and I will change the interface to follow suit (something like s/hypercall/dynhc). Sound good? Yeah. Another couple of points: - on the host side, we'd rig this to hit an eventfd. Nothing stops us from rigging pio to hit an eventfd as well, giving us kernel handling for pio trigger points. - pio actually has an advantage over hypercalls with nested guests. Since hypercalls don't have an associated port number, the lowermost hypervisor must interpret a hypercall as going to a guest's hypervisor, and not any lower-level hypervisors. What it boils down to is that you cannot use device assignment to give a guest access to a virtio/vbus device from a lower level hypervisor. (Bah, that's totally unreadable. What I want is instead of hypervisor[eth0/virtio-server] > intermediate[virtio-driver/virtio-server] > guest[virtio-driver] do hypervisor[eth0/virtio-server] > intermediate[assign virtio device] > guest[virtio-driver] well, it's probably still unreadable) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] generic hypercall support
Gregory Haskins wrote: > So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC" > to kvm_para.h, and I will change the interface to follow suit (something > like s/hypercall/dynhc). Sound good? > A small ramification of this change will be that I will need to do something like add a feature-bit to cpuid for detecting if HC_DYNAMIC is supported on the backend or not. The current v1 design doesn't suffer from this requirement because the presence of the dynamic vector itself is enough to know its supported. I like Avi's proposal enough to say that its worth this minor inconvenience, but FYI I will have to additionally submit a userspace patch for v2 if we go this route. -Greg signature.asc Description: OpenPGP digital signature
Re: [RFC PATCH 0/3] generic hypercall support
Avi Kivity wrote: > Gregory Haskins wrote: >> Avi Kivity wrote: >> >>> Gregory Haskins wrote: >>> (Applies to Linus' tree, b4348f32dae3cb6eb4bc21c7ed8f76c0b11e9d6a) Please see patch 1/3 for a description. This has been tested with a KVM guest on x86_64 and appears to work properly. Comments, please. >>> What about the hypercalls in include/asm/kvm_para.h? >>> >>> In general, hypercalls cannot be generic since each hypervisor >>> implements its own ABI. >>> >> Please see the prologue to 1/3. Its all described there, including a >> use case which I think answers your questions. If there is still >> ambiguity, let me know. >> >> > > Yeah, sorry. > >>> The abstraction needs to be at a higher level (pv_ops is such a >>> level). >>> >> Yep, agreed. Thats exactly what this series is doing, actually. >> > > No, it doesn't. It makes "making hypercalls" a pv_op, but hypervisors > don't implement the same ABI. Yes, that is true, but I think the issue right now is more of semantics. I think we are on the same page. So you would never have someone making a generic hypercall(KVM_HC_MMU_OP). I agree. What I am proposing here is more akin to PIO-BAR + iowrite()/ioread(). E.g. the infrastructure sets up the "addressing" (where in PIO this is literally an address, and for hypercalls this is a vector), but the "device" defines the ABI at that address. So its really the "device end-point" that is defining the ABI here, not the hypervisor (per se) and thats why I thought its ok to declare these "generic". But to your point below... > > pv_ops all _use_ hypercalls to implement higher level operations, like > set_pte (probably the only place set_pte can be considered a high > level operation). > > In this case, the higher level event could be > hypervisor_dynamic_event(number); each pv_ops implementation would use > its own hypercalls to implement that. I see. I had designed it slightly different where KVM could assign any top level vector it wanted and thus that drove the guest-side interface you see here to be more "generic hypercall". However, I think your proposal is perfectly fine too and it makes sense to more narrowly focus these calls as specifically "dynamic"...as thats the only vectors that we could technically use like this anyway. So rather than allocate a top-level vector, I will add "KVM_HC_DYNAMIC" to kvm_para.h, and I will change the interface to follow suit (something like s/hypercall/dynhc). Sound good? Thanks, Avi, -Greg signature.asc Description: OpenPGP digital signature