Re: [Qemu-devel] device assignment for embedded Power
On 05.07.2011, at 20:19, Yoder Stuart-B08248 wrote: > > >> -Original Message- >> From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org] >> Sent: Thursday, June 30, 2011 7:58 PM >> To: Yoder Stuart-B08248 >> Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; >> alex.william...@redhat.com; >> anth...@codemonkey.ws; d...@au1.ibm.com; joerg.roe...@amd.com; >> p...@codesourcery.com; >> blauwir...@gmail.com; arm...@redhat.com >> Subject: Re: device assignment for embedded Power >> >> On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote: >>> One feature we need for QEMU/KVM on embedded Power Architecture is the >>> ability to do passthru assignment of SoC I/O devices and memory. An >>> important use case in embedded is creating static partitions-- taking >>> physical memory and I/O devices (non-PCI) and partitioning >>> them between the host Linux and several virtual machines. Things like >>> live migration would not be needed or supported in these types of scenarios. >>> >>> SoC devices do not sit on a probeable bus and there are no identifiers >>> like 01:00.0 with PCI that we can use to identify devices-- the host >>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a >>> device tree structure passed at boot. QEMU needs to generate a >>> device tree to pass to the guest as well with all the guest's virtual >>> and physical resources. Today a number of mostly complete guest >>> device trees are kept under ./pc-bios in QEMU, but this too static and >>> inflexible. >>> >>> Some new mechanism is needed to assign SoC devices to guests, and we >>> (FSL + Alex Graf) have been discussing a few possible approaches for >>> doing this from QEMU and would like some feedback. >>> >>> Some possibilities: >>> >>> 1. Option 1. Pass the host dev tree to QEMU and assign devices >>> by device tree path >>> >>> -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000 >>> >>> /soc/i2c@3000 is the device tree path to the assigned device. >>> The device node 'i2c@3000' has some number of properties (e.g. >>> address, interrupt info) and possibly subnodes under >>> it. QEMU copies that node when generating the guest dev tree. >>> See snippet of entire node: http://paste2.org/p/1496460 >> >> Yuck (see below) >> >>> 2. Option 2. Pass the entire assigned device node as a string to >>> QEMU >>> >>> -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>; >>> #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c"; >>> reg = <0xffe03000 0x100>; interrupts = <43 2>; >>> interrupt-parent = <&mpic>; dfsrr;' >> >> Beuark ! (see below) >> >>> This avoids needing to pass the host device tree, but could >>> get awkward-- the i2c example above is very simple, some device >>> nodes are very large with a complex hierarchy of subnodes and >>> could be hundreds of lines of text to represent a single >>> node. >>> >>> It gets more complicated... >> >> >> So, from a qemu command line perspective, all you should have to do is pass >> qemu the device- >> tree -path- to the device you want to pass-trough (you may support passing a >> full hierarchy >> here). >> >> That is for normal MMIO mapped SoC devices. Something else (individual i2c, >> usb, ...) will use >> specific virtualization of the corresponding busses. > > Then why 'yuck' to option 1 :)? That is basically what was being proposed. Yes, and probably a good idea to go with for now. We can handle the guest device tree parts externally for now by passing in a fully populated device tree that just contains everything we need and pass qemu the configuration the way we did it in the device tree. >> Anything else sucks too much really. >> >> From there, well, there's several approach inside qemu/kvm to handle that >> path. If you want to >> do things at the qemu level you can probably parse /proc/device-tree. But >> I'd personally just >> make it a kernel thing. >> >> IE. I would have an ioctl to "instanciate" a pass-through device, that takes >> that path as an >> argument. I would make it return an anonymous fd which you can then use to >> mmap the resources, >> etc... > > Regarding implementation I think there are 3 things that need > to be set up-- 1) mmapping the device's registers, 2) getting the iommu > set up (if there is one), 3) getting the interrupt(s) handled. Yes :). I guess we'll just have to sit down and implement something very simple that can at least pass through MMIO regions and interrupts and then take it from there until we hit the plenty walls. Alex
Re: [Qemu-devel] device assignment for embedded Power
> -Original Message- > From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org] > Sent: Thursday, June 30, 2011 7:58 PM > To: Yoder Stuart-B08248 > Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; > alex.william...@redhat.com; > anth...@codemonkey.ws; d...@au1.ibm.com; joerg.roe...@amd.com; > p...@codesourcery.com; > blauwir...@gmail.com; arm...@redhat.com > Subject: Re: device assignment for embedded Power > > On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote: > > One feature we need for QEMU/KVM on embedded Power Architecture is the > > ability to do passthru assignment of SoC I/O devices and memory. An > > important use case in embedded is creating static partitions-- taking > > physical memory and I/O devices (non-PCI) and partitioning > > them between the host Linux and several virtual machines. Things like > > live migration would not be needed or supported in these types of scenarios. > > > > SoC devices do not sit on a probeable bus and there are no identifiers > > like 01:00.0 with PCI that we can use to identify devices-- the host > > Linux kernel is made aware of SoC I/O devices from nodes/properties in a > > device tree structure passed at boot. QEMU needs to generate a > > device tree to pass to the guest as well with all the guest's virtual > > and physical resources. Today a number of mostly complete guest > > device trees are kept under ./pc-bios in QEMU, but this too static and > > inflexible. > > > > Some new mechanism is needed to assign SoC devices to guests, and we > > (FSL + Alex Graf) have been discussing a few possible approaches for > > doing this from QEMU and would like some feedback. > > > > Some possibilities: > > > > 1. Option 1. Pass the host dev tree to QEMU and assign devices > >by device tree path > > > > -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000 > > > >/soc/i2c@3000 is the device tree path to the assigned device. > >The device node 'i2c@3000' has some number of properties (e.g. > >address, interrupt info) and possibly subnodes under > >it. QEMU copies that node when generating the guest dev tree. > >See snippet of entire node: http://paste2.org/p/1496460 > > Yuck (see below) > > > 2. Option 2. Pass the entire assigned device node as a string to > >QEMU > > > > -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>; > > #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c"; > > reg = <0xffe03000 0x100>; interrupts = <43 2>; > > interrupt-parent = <&mpic>; dfsrr;' > > Beuark ! (see below) > > >This avoids needing to pass the host device tree, but could > >get awkward-- the i2c example above is very simple, some device > >nodes are very large with a complex hierarchy of subnodes and > >could be hundreds of lines of text to represent a single > >node. > > > > It gets more complicated... > > > So, from a qemu command line perspective, all you should have to do is pass > qemu the device- > tree -path- to the device you want to pass-trough (you may support passing a > full hierarchy > here). > > That is for normal MMIO mapped SoC devices. Something else (individual i2c, > usb, ...) will use > specific virtualization of the corresponding busses. Then why 'yuck' to option 1 :)? That is basically what was being proposed. > Anything else sucks too much really. > > From there, well, there's several approach inside qemu/kvm to handle that > path. If you want to > do things at the qemu level you can probably parse /proc/device-tree. But I'd > personally just > make it a kernel thing. > > IE. I would have an ioctl to "instanciate" a pass-through device, that takes > that path as an > argument. I would make it return an anonymous fd which you can then use to > mmap the resources, > etc... Regarding implementation I think there are 3 things that need to be set up-- 1) mmapping the device's registers, 2) getting the iommu set up (if there is one), 3) getting the interrupt(s) handled. > > In some cases, modifications to device tree nodes may be needed. > > An example-- sometimes a device tree property references another node > > and that relationship may not exist when assigned to a guest. > > A "phy-handle" property may need to be deleted and a "fixed-link" > > property added to a node representing a network device. > > That's fishy. Why wouldn't you give full access to the MDIO ? It's shared ? > Such things are so > device-specific that they would have to be handled by device-specific quirks, > which can live > either in qemu or in the kernel. It is shared and in this case didn't want the phy shared. That was a super simple example to illustrate the idea. With our experience with the Freescale Embedded Hypervisor we see this as a definite requirement-- nodes in the hardware device may need modifications. In the P4080 device tree there are some complex relationships expressed between nodes
Re: [Qemu-devel] device assignment for embedded Power
On Fri, 1 Jul 2011 17:32:43 -0500 Anthony Liguori wrote: > On 07/01/2011 11:43 AM, Scott Wood wrote: > > However, we'll need to address the question of what it means to say "irq 10" > > It depends on what the bus is. If you're going to declare "system bus" > which is sort of what we call ISA for the PC, More like "arbitrary MMIO". Could be an on-chip peripheral. Could be some external custom chip. Could be an entire PCIe root complex. > then it can map trivially to the interrupt controller's inputs. Which interrupt controller? We might want to assign an IRQ that's on some cascaded controller. We also have some things like MPIC IPIs and timers, that are on the main interrupt controller but aren't normal numbered interrupts. We use the ability to have multiple cells in an interrupt specifier to express these. And while you could make up fake numbers for these to force it to be linear, someone has to come up with this mapping and get qemu, its users, and the kernel to agree on it. We already have a repository for such bindings for the device tree. That's not to say that the device tree should be forced onto platforms that have some other reasonable way of doing it, of course -- just that it's nice to be able to refer to it when it's there. > > -- outside of PC-land there often isn't a global IRQ numberspace that isn't > > a fiction created by some software layer. > > PC's don't have a global IRQ number space FWIW. When we say: > > -device isa-serial,irq=4 > > This really means, "ISA irq 4", which is mapped to the PIIX3 and then > routed through GSI, then the APIC architecture to correspond to some > interrupt for some physical CPU. Well, it's been a while since I've dealt with such things on PCs... I thought there was at least some standard way of interpreting things like IRQ numbers that the BIOS wrote into PCI config space. > > Addressing this is one of the > > device tree's strengths. > > Not really. There's nothing magical about the device tree. It's just a > guest visible description of the platform hardware that isn't probe-able > in some bus framework. ACPI does exactly the same thing. I'll concede > that the device tree is far nicer than ACPI but again, it's not magical :-) I didn't say it was the only way to express it -- just that the device tree, or something like it, comes in useful here. And we're not about to do ACPI on powerpc. :-) -Scott
Re: [Qemu-devel] device assignment for embedded Power
> So you're basically saying we should tackle these 3 issues separately: > > * actually pass through a device > * generate interrupt links > * model the guest device tree dynamically based on whatever the user > gives us Yes. Paul
Re: [Qemu-devel] device assignment for embedded Power
On 02.07.2011, at 01:50, Paul Brook wrote: >> On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote: On Fri, 1 Jul 2011 18:03:01 +0100 Paul Brook wrote: > Basically you should start by implementing full emulation of a device > with similar characteristics to the one you want to passthrough. That's not going to happen. >>> >>> Why is your device so unique? How does it interact with the guest system >>> and what features does it require that doen't exist in any device that >>> can be emulated? >> >> Do you guys only support PCI pass-through by doing full emulation of the >> all possible supported PCI devices first ? :-) > > Absolutely not. My point is that dynamic (user-driven) device creation is > effectively a prerequisite for a passthrough device. > > If you just want to make a very specific use-case then this doesn't need any > code in qemu at all. We just make the user provide the device tree > themselves. If it doesn't match then they loose. If you do choose an ugly > qemu then the changes are it'll be changed/removed once we do dyamic device > creation properly. There have already been discussions about dynamic device > creation, so this this isn't completely hypothetical. > > If you integrate it properly, then you need to realise then there's a fair > chunk of infrastructure and user interface required. Most of which has > nothing to do with device passthrough. Trying to implement both at the same > time is just going to cause confusion and complicate things. It's already a > hard problem, combining it with something else is just going to cause you and > everyone else even more pain. So you're basically saying we should tackle these 3 issues separately: * actually pass through a device * generate interrupt links * model the guest device tree dynamically based on whatever the user gives us I tend to agree with that perspective. Still, the main issue still stands in that we don't have a concrete answer for all three issues :). Facing them one at a time might help actually solving them though. Alex
Re: [Qemu-devel] device assignment for embedded Power
> On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote: > > > On Fri, 1 Jul 2011 18:03:01 +0100 > > > > > > Paul Brook wrote: > > > > Basically you should start by implementing full emulation of a device > > > > with similar characteristics to the one you want to passthrough. > > > > > > That's not going to happen. > > > > Why is your device so unique? How does it interact with the guest system > > and what features does it require that doen't exist in any device that > > can be emulated? > > Do you guys only support PCI pass-through by doing full emulation of the > all possible supported PCI devices first ? :-) Absolutely not. My point is that dynamic (user-driven) device creation is effectively a prerequisite for a passthrough device. If you just want to make a very specific use-case then this doesn't need any code in qemu at all. We just make the user provide the device tree themselves. If it doesn't match then they loose. If you do choose an ugly qemu then the changes are it'll be changed/removed once we do dyamic device creation properly. There have already been discussions about dynamic device creation, so this this isn't completely hypothetical. If you integrate it properly, then you need to realise then there's a fair chunk of infrastructure and user interface required. Most of which has nothing to do with device passthrough. Trying to implement both at the same time is just going to cause confusion and complicate things. It's already a hard problem, combining it with something else is just going to cause you and everyone else even more pain. Paul
Re: [Qemu-devel] device assignment for embedded Power
> > Why is your device so unique? How does it interact with the guest system > > and what features does it require that doen't exist in any device that > > can be emulated? > > Perhaps I misunderstood what you meant by "similar characteristics". I see > no reason to spend a bunch of time implementing full emulation for a > device, that isn't going to be used, just because it seems like a nice > intermediary step. You say your device has MMIO regions, generates IRQs and initiates DMA transactions. Any device or selection of devices that between them use all those features will do the job. I'd expect most SoC to have several. We don't care what the device actually does, only the ways it communicates with the rest of the machine. I think you're coming at this problem from completely the wrong direction. Instead of "how do I wedge this passthrough into my machine", you should be asking "how do I create a machine without knowing the machine layout at compile time". Once you fix that, hooking up the passthrough device should be fairly trivial. You only have a single passthrough device, and the rest of us have none at all. Anything restricted to the pasthrough case is thus unlikely to be the right answer to the second question, and I'd expect it to be removed/changed/broken when we do get round to implementing dynamic device creation. > > > We're talking about directly mapping the registers into the guest. The > > > whole point is performance. > > > > That's an additional step after you get passthrough working the normal > > way. > > "normal"? Mapping a MMIO region into the guest is an additional complication, and purely a performance optimization. qemu already needs to be in the loop to handle interrupts, probably DMA setup and the non-kvm case. > I'm not sure what the use case is for direct assignment of a device in an > otherwise completely emulated guest, but perhaps there is one. Typically because the host system doesn't know how to talk to it, or there isn't a sensible way to relay the functionality provided by the device from the kernel to qemu. > > We already have mechanisms (or at least patches) for mapping file-like > > objects into guest physical memory. That's largely independent of > > device passthrough. It's a relatively minor tweak to how the passthrough > > device sets up its MMIO regions. > > > > Mapping host device MMIO regions into guest space is entirely > > uninteresting unless we already have some way of creating guest-host > > passthrough devices. > > Isn't that what's being discussed? It's your end goal, but I don't think it's particularly relevant to the problem you've encountered. > > Creating guest-device passthrough devices isn't going to happen until the > > can create arbitrary devices (within the set emulated by qemu) that > > interact with the rest of the emulated machine in a similar way. > > What do you mean by "interact with the rest of the emulated machine in a > similar way"? See first paragraph above. Paul
Re: [Qemu-devel] device assignment for embedded Power
On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote: > > On Fri, 1 Jul 2011 18:03:01 +0100 > > > > Paul Brook wrote: > > > Basically you should start by implementing full emulation of a device > > > with similar characteristics to the one you want to passthrough. > > > > That's not going to happen. > > Why is your device so unique? How does it interact with the guest system and > what features does it require that doen't exist in any device that can be > emulated? Do you guys only support PCI pass-through by doing full emulation of the all possible supported PCI devices first ? :-) > I'm also extremely sceptical of anything that only works in a kvm > environment. > Makes me think it's an unmaintainable hack, and almost certainly going to > cause you immense amounts of pain later. See above question... Cheers, Ben. > > > I doubt you're going to get generic passthrough of arbitrary devices > > > working in a useful way. > > > > It's usefully working for us internally -- we're just trying to find a way > > to improve it for upstream, with a better configuration mechanism. > > I don't believe that either. More likely you've got passthrough of device > hanging off your specific CPU bus, using only (or even a subset of) the > facilities provided by that bus. > > > > Basically you have to emulate everything that is different between the > > > host and guest. > > > > Directly assigning a device means you don't get to have differences between > > the actual hardware device and what the guest sees. The kind of thin > > wrapper you're suggesting might have some use cases, but it's a different > > problem from what we're trying to solve. > > That's the problem. You've skipped several steps and gone startigh for > optimization before you've even got basic functionality working. > > You've also missed the point I was making. In order to do device passthrough > you need to define a boundary allong which the emulated machine state can be > fully replicated on the host machine. Anything inside this boundary is (by > definition) that same on both the host and guest systems (we're effectively > using host hardware to emulate a device for us). Outside that boundary the > host and guest systems will diverge. > > For a device that merely responds to CPU initiated MMIO transfers this is > pretty simple, it's the point at which MMIO transfers are generated. So the > guest gets a proxy device that intercepts accesses to that memory region, and > the host proxies some way for qemu to poke values at the host device. > > > > Once you've done all the above, host device passthrough should be > > > relatively straightforward. Just replace the emulation bits in the > > > above device with code that pokes at a real device via the relevant > > > kernel API. > > > > That's not what we mean by direct device assignment. > > Maybe, but IMO but it's a necessary prerequisite. You're trying to run before > you can walk. > > > We're talking about directly mapping the registers into the guest. The > > whole point is performance. > > That's an additional step after you get passthrough working the normal way. > We already have mechanisms (or at least patches) for mapping file-like > objects > into guest physical memory. That's largely independent of device > passthrough. > It's a relatively minor tweak to how the passthrough device sets up its MMIO > regions. > > Mapping host device MMIO regions into guest space is entirely uninteresting > unless we already have some way of creating guest-host passthrough devices. > Creating guest-device passthrough devices isn't going to happen until the can > create arbitrary devices (within the set emulated by qemu) that interact with > the rest of the emulated machine in a similar way. > > Paul
Re: [Qemu-devel] device assignment for embedded Power
On 07/01/2011 12:03 PM, Paul Brook wrote: irq[0].guest_irq = "10" This should be independent of anything to do with device tree. This would be useful for x86 too to assign platform devices (like the HPET). That's fine, as long as there's something layered on top of it for the case where we do want to reference something in the device tree. However, we'll need to address the question of what it means to say "irq 10" -- outside of PC-land there often isn't a global IRQ numberspace that isn't a fiction created by some software layer. Addressing this is one of the device tree's strengths. That's an entirely separate problem, thoug probably a prerequisite. Basically you should start by implementing full emulation of a device with similar characteristics to the one you want to passthrough. If you want to model interrupt remapping, you have to model device relationships. If you cannot express the bus hierarchy/relationship then you cannot sanely model interrupt remapping. You can only really ever think about passing through an entire subtree of the device hierarchy. You can't have a partial subtree with some crazy hack logic to explain how the physical layer may remap interrupts. That's just asking for pain. Regards, Anthony Liguori
Re: [Qemu-devel] device assignment for embedded Power
On 07/01/2011 11:43 AM, Scott Wood wrote: On Fri, 1 Jul 2011 07:10:45 -0500 Anthony Liguori wrote: I agree in principle but I think it should be done in a slightly different way. I think we ought to support composing a device by passthrough. For instance, something like: [physical-device "mydev"] region[0].file = "/dev/mem" region[0].guest_address = "0x42232000" region[0].file_offset = "0x23423400" region[0].size = "4096" irq[0].guest_irq = "10" irq[0].host_irq = "10" This should be independent of anything to do with device tree. This would be useful for x86 too to assign platform devices (like the HPET). That's fine, as long as there's something layered on top of it for the case where we do want to reference something in the device tree. However, we'll need to address the question of what it means to say "irq 10" It depends on what the bus is. If you're going to declare "system bus" which is sort of what we call ISA for the PC, then it can map trivially to the interrupt controller's inputs. -- outside of PC-land there often isn't a global IRQ numberspace that isn't a fiction created by some software layer. PC's don't have a global IRQ number space FWIW. When we say: -device isa-serial,irq=4 This really means, "ISA irq 4", which is mapped to the PIIX3 and then routed through GSI, then the APIC architecture to correspond to some interrupt for some physical CPU. Addressing this is one of the device tree's strengths. Not really. There's nothing magical about the device tree. It's just a guest visible description of the platform hardware that isn't probe-able in some bus framework. ACPI does exactly the same thing. I'll concede that the device tree is far nicer than ACPI but again, it's not magical :-) Regards, Anthony Liguori -Scott
Re: [Qemu-devel] device assignment for embedded Power
On Fri, 1 Jul 2011 21:59:35 +0100 Paul Brook wrote: > > On Fri, 1 Jul 2011 18:03:01 +0100 > > > > Paul Brook wrote: > > > Basically you should start by implementing full emulation of a device > > > with similar characteristics to the one you want to passthrough. > > > > That's not going to happen. > > Why is your device so unique? How does it interact with the guest system and > what features does it require that doen't exist in any device that can be > emulated? Perhaps I misunderstood what you meant by "similar characteristics". I see no reason to spend a bunch of time implementing full emulation for a device, that isn't going to be used, just because it seems like a nice intermediary step. What specifically is it you're suggesting we do full emulation of? > I'm also extremely sceptical of anything that only works in a kvm > environment. > Makes me think it's an unmaintainable hack, and almost certainly going to > cause you immense amounts of pain later. I believe the only part of the device assignment stuff we've implemented so far that is KVM specific is the interrupt routing. I'm open to ways of routing the interrupts to qemu in the non-KVM case, as long as we can bypass it when KVM is used. I'm not sure what the use case is for direct assignment of a device in an otherwise completely emulated guest, but perhaps there is one. > > > I doubt you're going to get generic passthrough of arbitrary devices > > > working in a useful way. > > > > It's usefully working for us internally -- we're just trying to find a way > > to improve it for upstream, with a better configuration mechanism. > > I don't believe that either. More likely you've got passthrough of device > hanging off your specific CPU bus, using only (or even a subset of) the > facilities provided by that bus. There's nothing special about our "bus". It's MMIO, DMA, and interrupts. What specifically are you disbelieving? > > > Basically you have to emulate everything that is different between the > > > host and guest. > > > > Directly assigning a device means you don't get to have differences between > > the actual hardware device and what the guest sees. The kind of thin > > wrapper you're suggesting might have some use cases, but it's a different > > problem from what we're trying to solve. > > That's the problem. You've skipped several steps and gone startigh for > optimization before you've even got basic functionality working. This is the basic functionality -- assign a piece of hardware to the guest with minimal overhead. Why go through contortions to construct some intermediate phase that nobody's interested in using? > You've also missed the point I was making. In order to do device passthrough > you need to define a boundary allong which the emulated machine state can be > fully replicated on the host machine. Anything inside this boundary is (by > definition) that same on both the host and guest systems (we're effectively > using host hardware to emulate a device for us). Outside that boundary the > host and guest systems will diverge. I'm still not sure what the point is, then. By directly assigning the device the user is placing everything about the device on the "same as host" side of that boundary. We're not using host hardware to emulate a device, we're using host hardware to send and receive packets under control of the guest. Whatever hardware that is, the guest will deal with it, just as if the guest weren't running in a vm. > For a device that merely responds to CPU initiated MMIO transfers this is > pretty simple, it's the point at which MMIO transfers are generated. So the > guest gets a proxy device that intercepts accesses to that memory region, and > the host proxies some way for qemu to poke values at the host device. The point is to be faster than virtio, not slower. There would be no reason for us to do this otherwise. Emulating some specific device is not our goal, at all. I realize that that's a major part of what qemu does, but it's not the only thing it's used for. > > > Once you've done all the above, host device passthrough should be > > > relatively straightforward. Just replace the emulation bits in the > > > above device with code that pokes at a real device via the relevant > > > kernel API. > > > > That's not what we mean by direct device assignment. > > Maybe, but IMO but it's a necessary prerequisite. You're trying to run before > you can walk. I disagree that it is a prerequisite. It is a fundamentally different thing, for a different purpose. If it's a purpose that is important to you, and you think the proposed config mechanisms don't accommodate that, then propose something that does. > > We're talking about directly mapping the registers into the guest. The > > whole point is performance. > > That's an additional step after you get passthrough working the normal way. "normal"? > We already have mechanisms (or at least patches) for mappi
Re: [Qemu-devel] device assignment for embedded Power
> On Fri, 1 Jul 2011 18:03:01 +0100 > > Paul Brook wrote: > > Basically you should start by implementing full emulation of a device > > with similar characteristics to the one you want to passthrough. > > That's not going to happen. Why is your device so unique? How does it interact with the guest system and what features does it require that doen't exist in any device that can be emulated? I'm also extremely sceptical of anything that only works in a kvm environment. Makes me think it's an unmaintainable hack, and almost certainly going to cause you immense amounts of pain later. > > I doubt you're going to get generic passthrough of arbitrary devices > > working in a useful way. > > It's usefully working for us internally -- we're just trying to find a way > to improve it for upstream, with a better configuration mechanism. I don't believe that either. More likely you've got passthrough of device hanging off your specific CPU bus, using only (or even a subset of) the facilities provided by that bus. > > Basically you have to emulate everything that is different between the > > host and guest. > > Directly assigning a device means you don't get to have differences between > the actual hardware device and what the guest sees. The kind of thin > wrapper you're suggesting might have some use cases, but it's a different > problem from what we're trying to solve. That's the problem. You've skipped several steps and gone startigh for optimization before you've even got basic functionality working. You've also missed the point I was making. In order to do device passthrough you need to define a boundary allong which the emulated machine state can be fully replicated on the host machine. Anything inside this boundary is (by definition) that same on both the host and guest systems (we're effectively using host hardware to emulate a device for us). Outside that boundary the host and guest systems will diverge. For a device that merely responds to CPU initiated MMIO transfers this is pretty simple, it's the point at which MMIO transfers are generated. So the guest gets a proxy device that intercepts accesses to that memory region, and the host proxies some way for qemu to poke values at the host device. > > Once you've done all the above, host device passthrough should be > > relatively straightforward. Just replace the emulation bits in the > > above device with code that pokes at a real device via the relevant > > kernel API. > > That's not what we mean by direct device assignment. Maybe, but IMO but it's a necessary prerequisite. You're trying to run before you can walk. > We're talking about directly mapping the registers into the guest. The > whole point is performance. That's an additional step after you get passthrough working the normal way. We already have mechanisms (or at least patches) for mapping file-like objects into guest physical memory. That's largely independent of device passthrough. It's a relatively minor tweak to how the passthrough device sets up its MMIO regions. Mapping host device MMIO regions into guest space is entirely uninteresting unless we already have some way of creating guest-host passthrough devices. Creating guest-device passthrough devices isn't going to happen until the can create arbitrary devices (within the set emulated by qemu) that interact with the rest of the emulated machine in a similar way. Paul
Re: [Qemu-devel] device assignment for embedded Power
On Fri, 1 Jul 2011 12:16:35 +0100 Paul Brook wrote: > > One feature we need for QEMU/KVM on embedded Power Architecture is the > > ability to do passthru assignment of SoC I/O devices and memory. An > > important use case in embedded is creating static partitions-- > > taking physical memory and I/O devices (non-PCI) and partitioning > > them between the host Linux and several virtual machines. Things like > > live migration would not be needed or supported in these types of > > scenarios. > > > > SoC devices do not sit on a probeable bus and there are no identifiers > > like 01:00.0 with PCI that we can use to identify devices-- the host > > Linux kernel is made aware of SoC I/O devices from nodes/properties in a > > device tree structure passed at boot. QEMU needs to generate a > > device tree to pass to the guest as well with all the guest's virtual > > and physical resources. Today a number of mostly complete guest device > > trees are kept under ./pc-bios in QEMU, but this too static and > > inflexible. > > I doubt you're going to get generic passthrough of arbitrary devices working > in a useful way. It's usefully working for us internally -- we're just trying to find a way to improve it for upstream, with a better configuration mechanism. > My expectation is that, at minimum, you'll need a bus > specific proxy device. i.e. create a virtual device in qemu that responds to > the guest, and happens poke at a host device rather than emulating things > directly. Many of these embedded devices don't sit on any sort of software-visible bus, and requiring that the I/O happen via MMIO traps would result in unacceptable overhead. > Basically you have to emulate everything that is different between the host > and guest. Directly assigning a device means you don't get to have differences between the actual hardware device and what the guest sees. The kind of thin wrapper you're suggesting might have some use cases, but it's a different problem from what we're trying to solve. -Scott
Re: [Qemu-devel] device assignment for embedded Power
> > irq[0].guest_irq = "10" > > > > This should be independent of anything to do with device tree. This > > would be useful for x86 too to assign platform devices (like the HPET). > > That's fine, as long as there's something layered on top of it for the case > where we do want to reference something in the device tree. > > However, we'll need to address the question of what it means to say "irq > 10" -- outside of PC-land there often isn't a global IRQ numberspace that > isn't a fiction created by some software layer. Addressing this is one of > the device tree's strengths. That's an entirely separate problem, thoug probably a prerequisite. Basically you should start by implementing full emulation of a device with similar characteristics to the one you want to passthrough. Then fix whatever is needed to allow the user to contol instantiation of those devices. This almost certainly means using the -device commandline option. This currently only works for a fairly simple subset of devices (approximately PCI and USB), so you'll probably need to fix/implement the missing bits. To do this you'll probably need to do some work on the various bits of the qdev relating to linking devices together. See recent discussion about sockets in the "basic support for composing sysbus devices" thread. To expose this to the guest you'll probably also need to implement some form of dynamic device tree assembly/manipulation. Not strictly necessary (we can require the user supply a complete device tree that matches whatever devices they've configured), but probably highly desirable. Once you've done all the above, host device passthrough should be relatively straightforward. Just replace the emulation bits in the above device with code that pokes at a real device via the relevant kernel API. Paul
Re: [Qemu-devel] device assignment for embedded Power
On Fri, 1 Jul 2011 18:03:01 +0100 Paul Brook wrote: > Basically you should start by implementing full emulation of a device with > similar characteristics to the one you want to passthrough. That's not going to happen. > Once you've done all the above, host device passthrough should be relatively > straightforward. Just replace the emulation bits in the above device with > code that pokes at a real device via the relevant kernel API. That's not what we mean by direct device assignment. We're talking about directly mapping the registers into the guest. The whole point is performance. -Scott
Re: [Qemu-devel] device assignment for embedded Power
On Fri, 1 Jul 2011 07:10:45 -0500 Anthony Liguori wrote: > I agree in principle but I think it should be done in a slightly > different way. > > I think we ought to support composing a device by passthrough. For > instance, something like: > > [physical-device "mydev"] > region[0].file = "/dev/mem" > region[0].guest_address = "0x42232000" > region[0].file_offset = "0x23423400" > region[0].size = "4096" > irq[0].guest_irq = "10" > irq[0].host_irq = "10" > > This should be independent of anything to do with device tree. This > would be useful for x86 too to assign platform devices (like the HPET). That's fine, as long as there's something layered on top of it for the case where we do want to reference something in the device tree. However, we'll need to address the question of what it means to say "irq 10" -- outside of PC-land there often isn't a global IRQ numberspace that isn't a fiction created by some software layer. Addressing this is one of the device tree's strengths. -Scott
Re: [Qemu-devel] device assignment for embedded Power
On Fri, 1 Jul 2011 10:58:14 +1000 Benjamin Herrenschmidt wrote: > So, from a qemu command line perspective, all you should have to do is > pass qemu the device-tree -path- to the device you want to pass-trough > (you may support passing a full hierarchy here). > > That is for normal MMIO mapped SoC devices. Something else (individual > i2c, usb, ...) will use specific virtualization of the corresponding > busses. > > Anything else sucks too much really. > > From there, well, there's several approach inside qemu/kvm to handle > that path. If you want to do things at the qemu level you can probably > parse /proc/device-tree. That's what option 1 is, except that instead of adding code to qemu to parse /proc/device-tree, we'd use dtc to dump /proc/device-tree into a dtb and let qemu use libfdt to look at the tree. This is less Linux-specific, more modular, and more flexible for doing the sort of insane hacks that are going to happen in embedded-land whether you like them or not. :-) > But I'd personally just make it a kernel thing. I'd rather keep the kernel interface simple -- assign this memory region, assign that interrupt, use this IOMMU device ID, etc. Getting the kernel involved in preparing the guest device tree, and understanding guuest configuration, seems quite excessive. > IE. I would have an ioctl to "instanciate" a pass-through device, that > takes that path as an argument. I would make it return an anonymous fd > which you can then use to mmap the resources, etc... > > > In some cases, modifications to device tree nodes may be needed. > > An example-- sometimes a device tree property references another node > > and that relationship may not exist when assigned to a guest. > > A "phy-handle" property may need to be deleted and a "fixed-link" > > property added to a node representing a network device. > > That's fishy. Why wouldn't you give full access to the MDIO ? It's > shared ? Yes, it's shared. Yes, it sucks. > Such things are so device-specific that they would have to be > handled by device-specific quirks, which can live either in qemu or in > the kernel. Or in the configuration of qemu. Not all users of the device want to do the same thing. > > So in addition to assigning a device, a mechanism is needed to update > > device tree nodes. So for the above example, maybe-- > > > > -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle, > > node-update="fixed-link = <2 1 1000 0 0>" > > That's just so gross and error prone, borderline insane. Welcome to embedded. :-) Here, users are going to want to be able to mess around under the hood in a way that server or desktop users generally don't need or want to. > > The types of modifications needed-- deleting nodes, deleting properties, > > adding nodes, adding properties, adding properties that reference other > > nodes, changing properties. This device tree transformation mechanism > > needed is general enough that it could apply to any device tree based > > embedded platform (e.g. ARM, MIPS) > > > > Another complexity relates to the IOMMU. Here things get very company > > and IOMMU specific. Freescale has a proprietary IOMMU. > > Look at the work currently being done for a generic qemu iommu layer. We > need it for server power as well and from what I last saw coming from > Eduardo and David, it's not PCI specific. The problem is that our current IOMMU doesn't implement full paging (yes, the HW people have been screamed at, but we're stuck with it for current chips). You have to break things down into regions following certain alignment rules, which may require user guidance as to which memory regions actually need DMA access, especially if you're setting up discontiguous shared memory regions and such. -Scott
Re: [Qemu-devel] device assignment for embedded Power
On 07/01/2011 07:52 AM, Paul Brook wrote: So, from a qemu command line perspective, all you should have to do is pass qemu the device-tree -path- to the device you want to pass-trough (you may support passing a full hierarchy here). I agree in principle but I think it should be done in a slightly different way. I think we ought to support composing a device by passthrough. For instance, something like: [physical-device "mydev"] region[0].file = "/dev/mem" region[0].guest_address = "0x42232000" region[0].file_offset = "0x23423400" region[0].size = "4096" irq[0].guest_irq = "10" irq[0].host_irq = "10" This should be independent of anything to do with device tree. This would be useful for x86 too to assign platform devices (like the HPET). I'm not quite sure what you're getting at here. IMO there should be little or no need for special knowledge of passthrough devices. They should just be annother qdev device, configured in the normal way. e.g.: -device sysbus-host,hostdev=whatever,normal_mmio_and_irq_config What I wrote about is just readconfig syntax. It's the same as: -device physical-device,id=mydev,region[0].file=/dev/mem, Regards, Anthony Liguori I think there should be a separate mechanism to manipulate the guest device tree, just like there are mechanisms to manipulate the guest's ACPI tables. I aggree. Any sort of device tree (IIUC ACPI tables are in principle giving the same information) is, in practice, going to need to be assembled at runtime. This needs some mechanism for devices to describe themselves, probably largely independent of actual machine/device creation code. We've got away without it thus far because the only real place where we have nontrivial user-specified machine variants is on the PCI bus. Devices there are for the most part self-describing so the guest firmware/OS can probe hardware itself. Paul
Re: [Qemu-devel] device assignment for embedded Power
> > So, from a qemu command line perspective, all you should have to do is > > pass qemu the device-tree -path- to the device you want to pass-trough > > (you may support passing a full hierarchy here). > > I agree in principle but I think it should be done in a slightly > different way. > > I think we ought to support composing a device by passthrough. For > instance, something like: > > [physical-device "mydev"] > region[0].file = "/dev/mem" > region[0].guest_address = "0x42232000" > region[0].file_offset = "0x23423400" > region[0].size = "4096" > irq[0].guest_irq = "10" > irq[0].host_irq = "10" > > This should be independent of anything to do with device tree. This > would be useful for x86 too to assign platform devices (like the HPET). I'm not quite sure what you're getting at here. IMO there should be little or no need for special knowledge of passthrough devices. They should just be annother qdev device, configured in the normal way. e.g.: -device sysbus-host,hostdev=whatever,normal_mmio_and_irq_config Should work the same as adding any other device. If it doesn't then we should fix that. This is an example of why it's good to have device features (IRQs, MMIO regions, sockets, or whatever we call them) registered when the device is instantiated, not relying on pre-compiled device decriptors/property lists. In the latter case you probably need explicit variants for differnt numbers of IRQs, MMIO regions, etc. While I'm thinking about it, we already have exactly this for USB (i.e. the usb-host device). > I think there should be a separate mechanism to manipulate the guest > device tree, just like there are mechanisms to manipulate the guest's > ACPI tables. I aggree. Any sort of device tree (IIUC ACPI tables are in principle giving the same information) is, in practice, going to need to be assembled at runtime. This needs some mechanism for devices to describe themselves, probably largely independent of actual machine/device creation code. We've got away without it thus far because the only real place where we have nontrivial user-specified machine variants is on the PCI bus. Devices there are for the most part self-describing so the guest firmware/OS can probe hardware itself. Paul
Re: [Qemu-devel] device assignment for embedded Power
On 07/01/2011 07:02 AM, Alexander Graf wrote: On 01.07.2011, at 13:55, Paul Brook wrote: But the real challenge is how to expose the device to the guest device tree. Especially when it comes to links between dt nodes, interrupt maps, etc. We basically have 3 choices there: * take the host device tree pieces and modify them * provide device tree chunks for each device (manually or through qdev parameters) * use the device tree as machine config file and base everything on it (solves the linking problem) The main question is which one would be the cleanest solution. And how would it be implemented. I don't think any of this is specific to device passthrough. It occurs as soon as you have any user-configurable parts of the machine (or even just a nontrivial selection of machine variants). My guess is the only reason you haven't hit it before is because you're only emulated a single hard-coded SoC/board. Well, the real reason we haven't hit this before is that we don't have any devices in Qemu that are generic. We only have specific device emulation. This however would be a device that can handle hundreds of different backing devices, all with different requirements. The infrastructure we have today simply isn't made for this. The question is how can we model it so that it will? :) Our infrastructure is quite capable of handling this. It has many other problems but I think the only thing really missing is the way to have lists of parameters. That seems easy to solve though. Regards, Anthony Liguori Alex
Re: [Qemu-devel] device assignment for embedded Power
On 06/30/2011 07:58 PM, Benjamin Herrenschmidt wrote: On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote: This avoids needing to pass the host device tree, but could get awkward-- the i2c example above is very simple, some device nodes are very large with a complex hierarchy of subnodes and could be hundreds of lines of text to represent a single node. It gets more complicated... So, from a qemu command line perspective, all you should have to do is pass qemu the device-tree -path- to the device you want to pass-trough (you may support passing a full hierarchy here). I agree in principle but I think it should be done in a slightly different way. I think we ought to support composing a device by passthrough. For instance, something like: [physical-device "mydev"] region[0].file = "/dev/mem" region[0].guest_address = "0x42232000" region[0].file_offset = "0x23423400" region[0].size = "4096" irq[0].guest_irq = "10" irq[0].host_irq = "10" This should be independent of anything to do with device tree. This would be useful for x86 too to assign platform devices (like the HPET). I think there should be a separate mechanism to manipulate the guest device tree, just like there are mechanisms to manipulate the guest's ACPI tables. Given these two mechanisms, there should be a simple command line like Ben has suggested that just takes a host device tree path and Just Works. It really is just a convenience interface though. With raw mechanisms like I described above, it would give you the flexibility to pass through a device with a modified host tree fragment without having an overly complicated command line interface for the more common case. Regards, Anthony Liguori
Re: [Qemu-devel] device assignment for embedded Power
On 07/01/2011 06:40 AM, Alexander Graf wrote: On 01.07.2011, at 02:58, Benjamin Herrenschmidt wrote: On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote: One feature we need for QEMU/KVM on embedded Power Architecture is the ability to do passthru assignment of SoC I/O devices and memory. An important use case in embedded is creating static partitions-- taking physical memory and I/O devices (non-PCI) and partitioning them between the host Linux and several virtual machines. Things like live migration would not be needed or supported in these types of scenarios. SoC devices do not sit on a probeable bus and there are no identifiers like 01:00.0 with PCI that we can use to identify devices-- the host Linux kernel is made aware of SoC I/O devices from nodes/properties in a device tree structure passed at boot. QEMU needs to generate a device tree to pass to the guest as well with all the guest's virtual and physical resources. Today a number of mostly complete guest device trees are kept under ./pc-bios in QEMU, but this too static and inflexible. Some new mechanism is needed to assign SoC devices to guests, and we (FSL + Alex Graf) have been discussing a few possible approaches for doing this from QEMU and would like some feedback. Some possibilities: 1. Option 1. Pass the host dev tree to QEMU and assign devices by device tree path -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000 /soc/i2c@3000 is the device tree path to the assigned device. The device node 'i2c@3000' has some number of properties (e.g. address, interrupt info) and possibly subnodes under it. QEMU copies that node when generating the guest dev tree. See snippet of entire node: http://paste2.org/p/1496460 Yuck (see below) 2. Option 2. Pass the entire assigned device node as a string to QEMU -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells =<1>; #size-cells =<0>; cell-index =<0>; compatible = "fsl-i2c"; reg =<0xffe03000 0x100>; interrupts =<43 2>; interrupt-parent =<&mpic>; dfsrr;' Beuark ! (see below) This avoids needing to pass the host device tree, but could get awkward-- the i2c example above is very simple, some device nodes are very large with a complex hierarchy of subnodes and could be hundreds of lines of text to represent a single node. It gets more complicated... So, from a qemu command line perspective, all you should have to do is pass qemu the device-tree -path- to the device you want to pass-trough (you may support passing a full hierarchy here). That is for normal MMIO mapped SoC devices. Something else (individual i2c, usb, ...) will use specific virtualization of the corresponding busses. Anything else sucks too much really. From there, well, there's several approach inside qemu/kvm to handle that path. If you want to do things at the qemu level you can probably parse /proc/device-tree. But I'd personally just make it a kernel thing. IE. I would have an ioctl to "instanciate" a pass-through device, that takes that path as an argument. I would make it return an anonymous fd which you can then use to mmap the resources, etc... Yeah, one idea was to use VFIO here. We could for example modify the host device tree to occupy device we want to pass through with a specific compatibility parameter. Or we could try to steal the node during runtime. But I agree, reading the device tree data from a VFIO node sounds reasonable. If it's required. That makes it very specific to systems that use device trees. To do the same for ARM platforms or x86, you would need to invent yet another mechanism. Passing through arbitrary MMIO is fairly straight forward (likewise with PIO). Passing through IRQs is a bit less straight forward and perhaps VFIO is the answer here. I don't see a problem with QEMU figuring out what a device's resources are and doing the assignment. Regards, Anthony Liguori
Re: [Qemu-devel] device assignment for embedded Power
> But the real challenge is how to expose the device to the guest device > tree. Especially when it comes to links between dt nodes, interrupt maps, > etc. We basically have 3 choices there: > > * take the host device tree pieces and modify them > * provide device tree chunks for each device (manually or through qdev > parameters) * use the device tree as machine config file and base > everything on it (solves the linking problem) > > The main question is which one would be the cleanest solution. And how > would it be implemented. I don't think any of this is specific to device passthrough. It occurs as soon as you have any user-configurable parts of the machine (or even just a nontrivial selection of machine variants). My guess is the only reason you haven't hit it before is because you're only emulated a single hard-coded SoC/board. Paul
Re: [Qemu-devel] device assignment for embedded Power
On 01.07.2011, at 13:55, Paul Brook wrote: > >> But the real challenge is how to expose the device to the guest device >> tree. Especially when it comes to links between dt nodes, interrupt maps, >> etc. We basically have 3 choices there: >> >> * take the host device tree pieces and modify them >> * provide device tree chunks for each device (manually or through qdev >> parameters) * use the device tree as machine config file and base >> everything on it (solves the linking problem) >> >> The main question is which one would be the cleanest solution. And how >> would it be implemented. > > I don't think any of this is specific to device passthrough. It occurs as > soon as you have any user-configurable parts of the machine (or even just a > nontrivial selection of machine variants). My guess is the only reason you > haven't hit it before is because you're only emulated a single hard-coded > SoC/board. Well, the real reason we haven't hit this before is that we don't have any devices in Qemu that are generic. We only have specific device emulation. This however would be a device that can handle hundreds of different backing devices, all with different requirements. The infrastructure we have today simply isn't made for this. The question is how can we model it so that it will? :) Alex
Re: [Qemu-devel] device assignment for embedded Power
On 01.07.2011, at 02:58, Benjamin Herrenschmidt wrote: > On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote: >> One feature we need for QEMU/KVM on embedded Power Architecture is the >> ability to do passthru assignment of SoC I/O devices and memory. An >> important use case in embedded is creating static partitions-- >> taking physical memory and I/O devices (non-PCI) and partitioning >> them between the host Linux and several virtual machines. Things like >> live migration would not be needed or supported in these types of scenarios. >> >> SoC devices do not sit on a probeable bus and there are no identifiers >> like 01:00.0 with PCI that we can use to identify devices-- the host >> Linux kernel is made aware of SoC I/O devices from nodes/properties in a >> device tree structure passed at boot. QEMU needs to generate a >> device tree to pass to the guest as well with all the guest's virtual >> and physical resources. Today a number of mostly complete guest device >> trees are kept under ./pc-bios in QEMU, but this too static and >> inflexible. >> >> Some new mechanism is needed to assign SoC devices to guests, and we >> (FSL + Alex Graf) have been discussing a few possible approaches >> for doing this from QEMU and would like some feedback. >> >> Some possibilities: >> >> 1. Option 1. Pass the host dev tree to QEMU and assign devices >> by device tree path >> >> -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000 >> >> /soc/i2c@3000 is the device tree path to the assigned device. >> The device node 'i2c@3000' has some number of properties (e.g. >> address, interrupt info) and possibly subnodes under >> it. QEMU copies that node when generating the guest dev tree. >> See snippet of entire node: http://paste2.org/p/1496460 > > Yuck (see below) > >> 2. Option 2. Pass the entire assigned device node as a string to >> QEMU >> >> -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>; >> #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c"; >> reg = <0xffe03000 0x100>; interrupts = <43 2>; >> interrupt-parent = <&mpic>; dfsrr;' > > Beuark ! (see below) > >> This avoids needing to pass the host device tree, but could >> get awkward-- the i2c example above is very simple, some device >> nodes are very large with a complex hierarchy of subnodes and >> could be hundreds of lines of text to represent a single >> node. >> >> It gets more complicated... > > > So, from a qemu command line perspective, all you should have to do is > pass qemu the device-tree -path- to the device you want to pass-trough > (you may support passing a full hierarchy here). > > That is for normal MMIO mapped SoC devices. Something else (individual > i2c, usb, ...) will use specific virtualization of the corresponding > busses. > > Anything else sucks too much really. > > From there, well, there's several approach inside qemu/kvm to handle > that path. If you want to do things at the qemu level you can probably > parse /proc/device-tree. But I'd personally just make it a kernel thing. > > IE. I would have an ioctl to "instanciate" a pass-through device, that > takes that path as an argument. I would make it return an anonymous fd > which you can then use to mmap the resources, etc... Yeah, one idea was to use VFIO here. We could for example modify the host device tree to occupy device we want to pass through with a specific compatibility parameter. Or we could try to steal the node during runtime. But I agree, reading the device tree data from a VFIO node sounds reasonable. If it's required. > >> In some cases, modifications to device tree nodes may be needed. >> An example-- sometimes a device tree property references another node >> and that relationship may not exist when assigned to a guest. >> A "phy-handle" property may need to be deleted and a "fixed-link" >> property added to a node representing a network device. > > That's fishy. Why wouldn't you give full access to the MDIO ? It's > shared ? Such things are so device-specific that they would have to be > handled by device-specific quirks, which can live either in qemu or in > the kernel. Hrm, so you'd create a separate device for MDIO which can do pass-through of those? > >> So in addition to assigning a device, a mechanism is needed to update >> device tree nodes. So for the above example, maybe-- >> >> -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle, >> node-update="fixed-link = <2 1 1000 0 0>" > > That's just so gross and error prone, borderline insane. Alternatives: * not modify the device tree (unlikely to work) * pass a full device tree chunk to qemu instead of modification commands * ? > >> The types of modifications needed-- deleting nodes, deleting properties, >> adding nodes, adding properties, adding properties that reference other >> nodes, changing properties. This device tree transforma
Re: [Qemu-devel] device assignment for embedded Power
On 01.07.2011, at 13:16, Paul Brook wrote: >> One feature we need for QEMU/KVM on embedded Power Architecture is the >> ability to do passthru assignment of SoC I/O devices and memory. An >> important use case in embedded is creating static partitions-- >> taking physical memory and I/O devices (non-PCI) and partitioning >> them between the host Linux and several virtual machines. Things like >> live migration would not be needed or supported in these types of >> scenarios. >> >> SoC devices do not sit on a probeable bus and there are no identifiers >> like 01:00.0 with PCI that we can use to identify devices-- the host >> Linux kernel is made aware of SoC I/O devices from nodes/properties in a >> device tree structure passed at boot. QEMU needs to generate a >> device tree to pass to the guest as well with all the guest's virtual >> and physical resources. Today a number of mostly complete guest device >> trees are kept under ./pc-bios in QEMU, but this too static and >> inflexible. > > I doubt you're going to get generic passthrough of arbitrary devices working > in a useful way. My expectation is that, at minimum, you'll need a bus > specific proxy device. i.e. create a virtual device in qemu that responds to > the guest, and happens poke at a host device rather than emulating things > directly. > > For busses like I2C this is fairly trivial - all communication with the > device > goes down a single well defined and easily proxied channel. For more complex > busses you end up having to emulate a lot more. Basically you have to > emulate > everything that is different between the host and guest. If that happens to > include device specific state then you loose. > > Using PCI devices as an example: The resources provided by the device are > self-describing, so proxying those is fairly straightforward, and doesn't > even > require manual configuration. However replicating the environment seen by > the > device is trickier as PCI devices can initiate memory accesses (i.e. bus- > master). For machines without an IOMMU this means passthrough in general > can't work, and substantial amounts of device specific knowledge is required. > You'd need to intercept and modify and/oor proxy all data relating to DMA > addresses. In practice you need to emulate an IOMMU inside qemu (so you can > determine the address space accessed by the device), and arrange for the host > IOMMU to present the same virtual address space to the real device. Well, for DMA the solution is reasonably simple. We have basically two choices: * run 1:1 mapped, so the guest physical address == host physical address, at which point DMA works, but everything is insecure * use an IOMMU We can easily limit it to those two cases. The more challenging part here (and the main reason for the email) is the question on how to configure all of that in a flexible, yet simple way. We can find the IO regions for devices from the host device tree - no problem there. But the real challenge is how to expose the device to the guest device tree. Especially when it comes to links between dt nodes, interrupt maps, etc. We basically have 3 choices there: * take the host device tree pieces and modify them * provide device tree chunks for each device (manually or through qdev parameters) * use the device tree as machine config file and base everything on it (solves the linking problem) The main question is which one would be the cleanest solution. And how would it be implemented. Alex
Re: [Qemu-devel] device assignment for embedded Power
> One feature we need for QEMU/KVM on embedded Power Architecture is the > ability to do passthru assignment of SoC I/O devices and memory. An > important use case in embedded is creating static partitions-- > taking physical memory and I/O devices (non-PCI) and partitioning > them between the host Linux and several virtual machines. Things like > live migration would not be needed or supported in these types of > scenarios. > > SoC devices do not sit on a probeable bus and there are no identifiers > like 01:00.0 with PCI that we can use to identify devices-- the host > Linux kernel is made aware of SoC I/O devices from nodes/properties in a > device tree structure passed at boot. QEMU needs to generate a > device tree to pass to the guest as well with all the guest's virtual > and physical resources. Today a number of mostly complete guest device > trees are kept under ./pc-bios in QEMU, but this too static and > inflexible. I doubt you're going to get generic passthrough of arbitrary devices working in a useful way. My expectation is that, at minimum, you'll need a bus specific proxy device. i.e. create a virtual device in qemu that responds to the guest, and happens poke at a host device rather than emulating things directly. For busses like I2C this is fairly trivial - all communication with the device goes down a single well defined and easily proxied channel. For more complex busses you end up having to emulate a lot more. Basically you have to emulate everything that is different between the host and guest. If that happens to include device specific state then you loose. Using PCI devices as an example: The resources provided by the device are self-describing, so proxying those is fairly straightforward, and doesn't even require manual configuration. However replicating the environment seen by the device is trickier as PCI devices can initiate memory accesses (i.e. bus- master). For machines without an IOMMU this means passthrough in general can't work, and substantial amounts of device specific knowledge is required. You'd need to intercept and modify and/oor proxy all data relating to DMA addresses. In practice you need to emulate an IOMMU inside qemu (so you can determine the address space accessed by the device), and arrange for the host IOMMU to present the same virtual address space to the real device. Paul
Re: [Qemu-devel] device assignment for embedded Power
On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote: > One feature we need for QEMU/KVM on embedded Power Architecture is the > ability to do passthru assignment of SoC I/O devices and memory. An > important use case in embedded is creating static partitions-- > taking physical memory and I/O devices (non-PCI) and partitioning > them between the host Linux and several virtual machines. Things like > live migration would not be needed or supported in these types of scenarios. > > SoC devices do not sit on a probeable bus and there are no identifiers > like 01:00.0 with PCI that we can use to identify devices-- the host > Linux kernel is made aware of SoC I/O devices from nodes/properties in a > device tree structure passed at boot. QEMU needs to generate a > device tree to pass to the guest as well with all the guest's virtual > and physical resources. Today a number of mostly complete guest device > trees are kept under ./pc-bios in QEMU, but this too static and > inflexible. > > Some new mechanism is needed to assign SoC devices to guests, and we > (FSL + Alex Graf) have been discussing a few possible approaches > for doing this from QEMU and would like some feedback. > > Some possibilities: > > 1. Option 1. Pass the host dev tree to QEMU and assign devices >by device tree path > > -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000 > >/soc/i2c@3000 is the device tree path to the assigned device. >The device node 'i2c@3000' has some number of properties (e.g. >address, interrupt info) and possibly subnodes under >it. QEMU copies that node when generating the guest dev tree. >See snippet of entire node: http://paste2.org/p/1496460 Yuck (see below) > 2. Option 2. Pass the entire assigned device node as a string to >QEMU > > -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>; > #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c"; > reg = <0xffe03000 0x100>; interrupts = <43 2>; > interrupt-parent = <&mpic>; dfsrr;' Beuark ! (see below) >This avoids needing to pass the host device tree, but could >get awkward-- the i2c example above is very simple, some device >nodes are very large with a complex hierarchy of subnodes and >could be hundreds of lines of text to represent a single >node. > > It gets more complicated... So, from a qemu command line perspective, all you should have to do is pass qemu the device-tree -path- to the device you want to pass-trough (you may support passing a full hierarchy here). That is for normal MMIO mapped SoC devices. Something else (individual i2c, usb, ...) will use specific virtualization of the corresponding busses. Anything else sucks too much really. >From there, well, there's several approach inside qemu/kvm to handle that path. If you want to do things at the qemu level you can probably parse /proc/device-tree. But I'd personally just make it a kernel thing. IE. I would have an ioctl to "instanciate" a pass-through device, that takes that path as an argument. I would make it return an anonymous fd which you can then use to mmap the resources, etc... > In some cases, modifications to device tree nodes may be needed. > An example-- sometimes a device tree property references another node > and that relationship may not exist when assigned to a guest. > A "phy-handle" property may need to be deleted and a "fixed-link" > property added to a node representing a network device. That's fishy. Why wouldn't you give full access to the MDIO ? It's shared ? Such things are so device-specific that they would have to be handled by device-specific quirks, which can live either in qemu or in the kernel. > So in addition to assigning a device, a mechanism is needed to update > device tree nodes. So for the above example, maybe-- > > -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle, > node-update="fixed-link = <2 1 1000 0 0>" That's just so gross and error prone, borderline insane. > The types of modifications needed-- deleting nodes, deleting properties, > adding nodes, adding properties, adding properties that reference other > nodes, changing properties. This device tree transformation mechanism > needed is general enough that it could apply to any device tree based > embedded platform (e.g. ARM, MIPS) > > Another complexity relates to the IOMMU. Here things get very company > and IOMMU specific. Freescale has a proprietary IOMMU. Look at the work currently being done for a generic qemu iommu layer. We need it for server power as well and from what I last saw coming from Eduardo and David, it's not PCI specific. > Devices have 1 or more logical I/O device numbers used to index into > the IOMMU table. The IOMMU is limited in that it is designed to only > support large, physically contiguous mappings per device. It does not > support any kind of page table. The IOMMU hardware architec
[Qemu-devel] device assignment for embedded Power
One feature we need for QEMU/KVM on embedded Power Architecture is the ability to do passthru assignment of SoC I/O devices and memory. An important use case in embedded is creating static partitions-- taking physical memory and I/O devices (non-PCI) and partitioning them between the host Linux and several virtual machines. Things like live migration would not be needed or supported in these types of scenarios. SoC devices do not sit on a probeable bus and there are no identifiers like 01:00.0 with PCI that we can use to identify devices-- the host Linux kernel is made aware of SoC I/O devices from nodes/properties in a device tree structure passed at boot. QEMU needs to generate a device tree to pass to the guest as well with all the guest's virtual and physical resources. Today a number of mostly complete guest device trees are kept under ./pc-bios in QEMU, but this too static and inflexible. Some new mechanism is needed to assign SoC devices to guests, and we (FSL + Alex Graf) have been discussing a few possible approaches for doing this from QEMU and would like some feedback. Some possibilities: 1. Option 1. Pass the host dev tree to QEMU and assign devices by device tree path -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000 /soc/i2c@3000 is the device tree path to the assigned device. The device node 'i2c@3000' has some number of properties (e.g. address, interrupt info) and possibly subnodes under it. QEMU copies that node when generating the guest dev tree. See snippet of entire node: http://paste2.org/p/1496460 2. Option 2. Pass the entire assigned device node as a string to QEMU -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>; #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c"; reg = <0xffe03000 0x100>; interrupts = <43 2>; interrupt-parent = <&mpic>; dfsrr;' This avoids needing to pass the host device tree, but could get awkward-- the i2c example above is very simple, some device nodes are very large with a complex hierarchy of subnodes and could be hundreds of lines of text to represent a single node. It gets more complicated... In some cases, modifications to device tree nodes may be needed. An example-- sometimes a device tree property references another node and that relationship may not exist when assigned to a guest. A "phy-handle" property may need to be deleted and a "fixed-link" property added to a node representing a network device. So in addition to assigning a device, a mechanism is needed to update device tree nodes. So for the above example, maybe-- -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle, node-update="fixed-link = <2 1 1000 0 0>" The types of modifications needed-- deleting nodes, deleting properties, adding nodes, adding properties, adding properties that reference other nodes, changing properties. This device tree transformation mechanism needed is general enough that it could apply to any device tree based embedded platform (e.g. ARM, MIPS). Another complexity relates to the IOMMU. Here things get very company and IOMMU specific. Freescale has a proprietary IOMMU. Devices have 1 or more logical I/O device numbers used to index into the IOMMU table. The IOMMU is limited in that it is designed to only support large, physically contiguous mappings per device. It does not support any kind of page table. The IOMMU hardware architecture assumes DMAs are typically targeted to just a few address regions. So, a common IOMMU setup for a device would be a device with a single IOMMU mapping covering the guest's main memory segment. However, there are many much more complicated IOMMU setups that are common as well, such as doing "operation translations" where a device's write transaction is translated to "stash" directly into CPU caches. We can't assume that all memory slots belonging to the guest are targets of DMA. So for Freescale we would need some very Freescale-specific configuration mechanism to set up the IOMMU. Here I think we would need the new qcfg approach to expressing nested structures (http://wiki.qemu.org/Features/QCFG). Device assignment with IOMMU set up might look like the examples below: # device with multiple logical i/o device numbers -device assigned-soc-dev,dev=/qman-portals/qman-portal@4000, vcpu=1,fsl,iommu.stash-mem={ dma-window.guest-addr=0x0, dma-window.size=0x1, liodn-index=1, operation-mapping=0 stash-dest=1}, fsl,iommu.stash-dqrr={ dma-window.guest-addr=0xff420, dma-window.size=0x4000, liodn-index=0, operation-mapping=0 stash-dest=1} # assign pci-bus to a guest with multiple memory # regions #addr size #0x0 512MB #0x2000 4KB (for MSIs) #0x4000 16MB (shared memory) #0xc000 64MB (shared memory) -device assigned-soc-dev,dev=/pcie@ffe09000, fsl,iommu={dma-window.guest-addr=0x0, dma-window.size=