Re: [Qemu-devel] device assignment for embedded Power

2011-07-05 Thread Alexander Graf

On 05.07.2011, at 20:19, Yoder Stuart-B08248 wrote:

> 
> 
>> -Original Message-
>> From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org]
>> Sent: Thursday, June 30, 2011 7:58 PM
>> To: Yoder Stuart-B08248
>> Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; 
>> alex.william...@redhat.com;
>> anth...@codemonkey.ws; d...@au1.ibm.com; joerg.roe...@amd.com; 
>> p...@codesourcery.com;
>> blauwir...@gmail.com; arm...@redhat.com
>> Subject: Re: device assignment for embedded Power
>> 
>> On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote:
>>> One feature we need for QEMU/KVM on embedded Power Architecture is the
>>> ability to do passthru assignment of SoC I/O devices and memory.  An
>>> important use case in embedded is creating static partitions-- taking
>>> physical memory and I/O devices (non-PCI) and partitioning
>>> them between the host Linux and several virtual machines.   Things like
>>> live migration would not be needed or supported in these types of scenarios.
>>> 
>>> SoC devices do not sit on a probeable bus and there are no identifiers
>>> like 01:00.0 with PCI that we can use to identify devices--  the host
>>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
>>> device tree structure passed at boot.   QEMU needs to generate a
>>> device tree to pass to the guest as well with all the guest's virtual
>>> and physical resources.  Today a number of mostly complete guest
>>> device trees are kept under ./pc-bios in QEMU, but this too static and
>>> inflexible.
>>> 
>>> Some new mechanism is needed to assign SoC devices to guests, and we
>>> (FSL + Alex Graf) have been discussing a few possible approaches for
>>> doing this from QEMU and would like some feedback.
>>> 
>>> Some possibilities:
>>> 
>>> 1. Option 1.  Pass the host dev tree to QEMU and assign devices
>>>   by device tree path
>>> 
>>> -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
>>> 
>>>   /soc/i2c@3000 is the device tree path to the assigned device.
>>>   The device node 'i2c@3000' has some number of properties (e.g.
>>>   address, interrupt info) and possibly subnodes under
>>>   it.   QEMU copies that node when generating the guest dev tree.
>>>   See snippet of entire node:  http://paste2.org/p/1496460
>> 
>> Yuck (see below)
>> 
>>> 2. Option 2.  Pass the entire assigned device node as a string to
>>>   QEMU
>>> 
>>> -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
>>>  #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>>>  reg = <0xffe03000 0x100>; interrupts = <43 2>;
>>>  interrupt-parent = <&mpic>; dfsrr;'
>> 
>> Beuark ! (see below)
>> 
>>>   This avoids needing to pass the host device tree, but could
>>>   get awkward-- the i2c example above is very simple, some device
>>>   nodes are very large with a complex hierarchy of subnodes and
>>>   could be hundreds of lines of text to represent a single
>>>   node.
>>> 
>>> It gets more complicated...
>> 
>> 
>> So, from a qemu command line perspective, all you should have to do is pass 
>> qemu the device-
>> tree -path- to the device you want to pass-trough (you may support passing a 
>> full hierarchy
>> here).
>> 
>> That is for normal MMIO mapped SoC devices. Something else (individual i2c, 
>> usb, ...) will use
>> specific virtualization of the corresponding busses.
> 
> Then why 'yuck' to option 1 :)?   That is basically what was being proposed.

Yes, and probably a good idea to go with for now. We can handle the guest 
device tree parts externally for now by passing in a fully populated device 
tree that just contains everything we need and pass qemu the configuration the 
way we did it in the device tree.


>> Anything else sucks too much really.
>> 
>> From there, well, there's several approach inside qemu/kvm to handle that 
>> path. If you want to
>> do things at the qemu level you can probably parse /proc/device-tree. But 
>> I'd personally just
>> make it a kernel thing.
>> 
>> IE. I would have an ioctl to "instanciate" a pass-through device, that takes 
>> that path as an
>> argument. I would make it return an anonymous fd which you can then use to 
>> mmap the resources,
>> etc...
> 
> Regarding implementation I think there are 3 things that need
> to be set up--  1) mmapping the device's registers, 2) getting the iommu
> set up (if there is one), 3) getting the interrupt(s) handled.

Yes :).

I guess we'll just have to sit down and implement something very simple that 
can at least pass through MMIO regions and interrupts and then take it from 
there until we hit the plenty walls.


Alex




Re: [Qemu-devel] device assignment for embedded Power

2011-07-05 Thread Yoder Stuart-B08248


> -Original Message-
> From: Benjamin Herrenschmidt [mailto:b...@kernel.crashing.org]
> Sent: Thursday, June 30, 2011 7:58 PM
> To: Yoder Stuart-B08248
> Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; 
> alex.william...@redhat.com;
> anth...@codemonkey.ws; d...@au1.ibm.com; joerg.roe...@amd.com; 
> p...@codesourcery.com;
> blauwir...@gmail.com; arm...@redhat.com
> Subject: Re: device assignment for embedded Power
> 
> On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote:
> > One feature we need for QEMU/KVM on embedded Power Architecture is the
> > ability to do passthru assignment of SoC I/O devices and memory.  An
> > important use case in embedded is creating static partitions-- taking
> > physical memory and I/O devices (non-PCI) and partitioning
> > them between the host Linux and several virtual machines.   Things like
> > live migration would not be needed or supported in these types of scenarios.
> >
> > SoC devices do not sit on a probeable bus and there are no identifiers
> > like 01:00.0 with PCI that we can use to identify devices--  the host
> > Linux kernel is made aware of SoC I/O devices from nodes/properties in a
> > device tree structure passed at boot.   QEMU needs to generate a
> > device tree to pass to the guest as well with all the guest's virtual
> > and physical resources.  Today a number of mostly complete guest
> > device trees are kept under ./pc-bios in QEMU, but this too static and
> > inflexible.
> >
> > Some new mechanism is needed to assign SoC devices to guests, and we
> > (FSL + Alex Graf) have been discussing a few possible approaches for
> > doing this from QEMU and would like some feedback.
> >
> > Some possibilities:
> >
> > 1. Option 1.  Pass the host dev tree to QEMU and assign devices
> >by device tree path
> >
> >  -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
> >
> >/soc/i2c@3000 is the device tree path to the assigned device.
> >The device node 'i2c@3000' has some number of properties (e.g.
> >address, interrupt info) and possibly subnodes under
> >it.   QEMU copies that node when generating the guest dev tree.
> >See snippet of entire node:  http://paste2.org/p/1496460
> 
> Yuck (see below)
> 
> > 2. Option 2.  Pass the entire assigned device node as a string to
> >QEMU
> >
> >  -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
> >   #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
> >   reg = <0xffe03000 0x100>; interrupts = <43 2>;
> >   interrupt-parent = <&mpic>; dfsrr;'
> 
> Beuark ! (see below)
> 
> >This avoids needing to pass the host device tree, but could
> >get awkward-- the i2c example above is very simple, some device
> >nodes are very large with a complex hierarchy of subnodes and
> >could be hundreds of lines of text to represent a single
> >node.
> >
> > It gets more complicated...
> 
> 
> So, from a qemu command line perspective, all you should have to do is pass 
> qemu the device-
> tree -path- to the device you want to pass-trough (you may support passing a 
> full hierarchy
> here).
> 
> That is for normal MMIO mapped SoC devices. Something else (individual i2c, 
> usb, ...) will use
> specific virtualization of the corresponding busses.

Then why 'yuck' to option 1 :)?   That is basically what was being proposed.

> Anything else sucks too much really.
> 
> From there, well, there's several approach inside qemu/kvm to handle that 
> path. If you want to
> do things at the qemu level you can probably parse /proc/device-tree. But I'd 
> personally just
> make it a kernel thing.
>
> IE. I would have an ioctl to "instanciate" a pass-through device, that takes 
> that path as an
> argument. I would make it return an anonymous fd which you can then use to 
> mmap the resources,
> etc...

Regarding implementation I think there are 3 things that need
to be set up--  1) mmapping the device's registers, 2) getting the iommu
set up (if there is one), 3) getting the interrupt(s) handled.

> > In some cases, modifications to device tree nodes may be needed.
> > An example-- sometimes a device tree property references another node
> > and that relationship may not exist when assigned to a guest.
> > A "phy-handle" property may need to be deleted and a "fixed-link"
> > property added to a node representing a network device.
> 
> That's fishy. Why wouldn't you give full access to the MDIO ? It's shared ? 
> Such things are so
> device-specific that they would have to be handled by device-specific quirks, 
> which can live
> either in qemu or in the kernel.

It is shared and in this case didn't want the phy shared.   That was a super
simple example to illustrate the idea.  With our experience with the Freescale
Embedded Hypervisor we see this as a definite requirement-- nodes in the
hardware device may need modifications.  In the P4080 device tree there
are some complex relationships expressed between nodes

Re: [Qemu-devel] device assignment for embedded Power

2011-07-05 Thread Scott Wood
On Fri, 1 Jul 2011 17:32:43 -0500
Anthony Liguori  wrote:

> On 07/01/2011 11:43 AM, Scott Wood wrote:
> > However, we'll need to address the question of what it means to say "irq 10"
> 
> It depends on what the bus is.  If you're going to declare "system bus" 
> which is sort of what we call ISA for the PC,

More like "arbitrary MMIO".  Could be an on-chip peripheral.  Could be some
external custom chip.  Could be an entire PCIe root complex.

> then it can map trivially to the interrupt controller's inputs.

Which interrupt controller?  We might want to assign an IRQ that's on some
cascaded controller.

We also have some things like MPIC IPIs and timers,
that are on the main interrupt controller but aren't normal numbered
interrupts.  We use the ability to have multiple cells in an interrupt
specifier to express these.  And while you could make up fake numbers for
these to force it to be linear, someone has to come up with this mapping and
get qemu, its users, and the kernel to agree on it.  We already have a
repository for such bindings for the device tree.

That's not to say that the device tree should be forced onto platforms that
have some other reasonable way of doing it, of course -- just that it's
nice to be able to refer to it when it's there.

> > -- outside of PC-land there often isn't a global IRQ numberspace that isn't
> > a fiction created by some software layer.
> 
> PC's don't have a global IRQ number space FWIW.  When we say:
> 
> -device isa-serial,irq=4
> 
> This really means, "ISA irq 4", which is mapped to the PIIX3 and then 
> routed through GSI, then the APIC architecture to correspond to some 
> interrupt for some physical CPU.

Well, it's been a while since I've dealt with such things on PCs...  I
thought there was at least some standard way of interpreting things like
IRQ numbers that the BIOS wrote into PCI config space.

> > Addressing this is one of the
> > device tree's strengths.
> 
> Not really.  There's nothing magical about the device tree.  It's just a 
> guest visible description of the platform hardware that isn't probe-able 
> in some bus framework.  ACPI does exactly the same thing.  I'll concede 
> that the device tree is far nicer than ACPI but again, it's not magical :-)

I didn't say it was the only way to express it -- just that the device tree,
or something like it, comes in useful here.

And we're not about to do ACPI on powerpc. :-)

-Scott




Re: [Qemu-devel] device assignment for embedded Power

2011-07-02 Thread Paul Brook
> So you're basically saying we should tackle these 3 issues separately:
> 
>   * actually pass through a device
>   * generate interrupt links
>   * model the guest device tree dynamically based on whatever the user
> gives us

Yes.

Paul



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Alexander Graf

On 02.07.2011, at 01:50, Paul Brook wrote:

>> On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
 On Fri, 1 Jul 2011 18:03:01 +0100
 
 Paul Brook  wrote:
> Basically you should start by implementing full emulation of a device
> with similar characteristics to the one you want to passthrough.
 
 That's not going to happen.
>>> 
>>> Why is your device so unique? How does it interact with the guest system
>>> and what features does it require that doen't exist in any device that
>>> can be emulated?
>> 
>> Do you guys only support PCI pass-through by doing full emulation of the
>> all possible supported PCI devices first ? :-)
> 
> Absolutely not.  My point is that dynamic (user-driven) device creation is 
> effectively a prerequisite for a passthrough device.
> 
> If you just want to make a very specific use-case then this doesn't need any 
> code in qemu at all.  We just make the user provide the device tree 
> themselves. If it doesn't match then they loose.  If you do choose an ugly 
> qemu then the changes are it'll be changed/removed once we do dyamic device 
> creation properly.  There have already been discussions about dynamic device 
> creation, so this this isn't completely hypothetical.
> 
> If you integrate it properly, then you need to realise then there's a fair 
> chunk of infrastructure and user interface required.  Most of which has 
> nothing to do with device passthrough.  Trying to implement both at the same 
> time is just going to cause confusion and complicate things.  It's already a 
> hard problem, combining it with something else is just going to cause you and 
> everyone else even more pain.

So you're basically saying we should tackle these 3 issues separately:

  * actually pass through a device
  * generate interrupt links
  * model the guest device tree dynamically based on whatever the user gives us

I tend to agree with that perspective. Still, the main issue still stands in 
that we don't have a concrete answer for all three issues :). Facing them one 
at a time might help actually solving them though.


Alex




Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Paul Brook
> On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
> > > On Fri, 1 Jul 2011 18:03:01 +0100
> > > 
> > > Paul Brook  wrote:
> > > > Basically you should start by implementing full emulation of a device
> > > > with similar characteristics to the one you want to passthrough.
> > > 
> > > That's not going to happen.
> > 
> > Why is your device so unique? How does it interact with the guest system
> > and what features does it require that doen't exist in any device that
> > can be emulated?
> 
> Do you guys only support PCI pass-through by doing full emulation of the
> all possible supported PCI devices first ? :-)

Absolutely not.  My point is that dynamic (user-driven) device creation is 
effectively a prerequisite for a passthrough device.

If you just want to make a very specific use-case then this doesn't need any 
code in qemu at all.  We just make the user provide the device tree 
themselves. If it doesn't match then they loose.  If you do choose an ugly 
qemu then the changes are it'll be changed/removed once we do dyamic device 
creation properly.  There have already been discussions about dynamic device 
creation, so this this isn't completely hypothetical.

If you integrate it properly, then you need to realise then there's a fair 
chunk of infrastructure and user interface required.  Most of which has 
nothing to do with device passthrough.  Trying to implement both at the same 
time is just going to cause confusion and complicate things.  It's already a 
hard problem, combining it with something else is just going to cause you and 
everyone else even more pain.

Paul



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Paul Brook
> > Why is your device so unique? How does it interact with the guest system
> > and what features does it require that doen't exist in any device that
> > can be emulated?
> 
> Perhaps I misunderstood what you meant by "similar characteristics".  I see
> no reason to spend a bunch of time implementing full emulation for a
> device, that isn't going to be used, just because it seems like a nice
> intermediary step.

You say your device has MMIO regions, generates IRQs and initiates DMA 
transactions.  Any device or selection of devices that between them use all 
those features will do the job. I'd expect most SoC to have several.  We don't 
care what the device actually does, only the ways it communicates with the 
rest of the machine.

I think you're coming at this problem from completely the wrong direction.  
Instead of "how do I wedge this passthrough into my machine", you should be 
asking "how do I create a machine without knowing the machine layout at 
compile time".  Once you fix that, hooking up the passthrough device should be 
fairly trivial.  You only have a single passthrough device, and the rest of us 
have none at all.  Anything restricted to the pasthrough case is thus unlikely 
to be the right answer to the second question, and I'd expect it to be 
removed/changed/broken when we do get round to implementing dynamic device 
creation.

> > > We're talking about directly mapping the registers into the guest.  The
> > > whole point is performance.
> > 
> > That's an additional step after you get passthrough working the normal
> > way.
> 
> "normal"?

Mapping a MMIO region into the guest is an additional complication, and purely 
a performance optimization.  qemu already needs to be in the loop to handle 
interrupts, probably DMA setup and the non-kvm case.

> I'm not sure what the use case is for direct assignment of a device in an
> otherwise completely emulated guest, but perhaps there is one.

Typically because the host system doesn't know how to talk to it, or there 
isn't a sensible way to relay the functionality provided by the device from 
the kernel to qemu.

> > We already have mechanisms (or at least patches) for mapping file-like
> > objects into guest physical memory.  That's largely independent of
> > device passthrough. It's a relatively minor tweak to how the passthrough
> > device sets up its MMIO regions.
> > 
> > Mapping host device MMIO regions into guest space is entirely
> > uninteresting unless we already have some way of creating guest-host
> > passthrough devices.
> 
> Isn't that what's being discussed?

It's your end goal, but I don't think it's particularly relevant to the 
problem you've encountered.

> > Creating guest-device passthrough devices isn't going to happen until the
> > can create arbitrary devices (within the set emulated by qemu) that
> > interact with the rest of the emulated machine in a similar way.
> 
> What do you mean by "interact with the rest of the emulated machine in a
> similar way"?

See first paragraph above.

Paul



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Benjamin Herrenschmidt
On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
> > On Fri, 1 Jul 2011 18:03:01 +0100
> > 
> > Paul Brook  wrote:
> > > Basically you should start by implementing full emulation of a device
> > > with similar characteristics to the one you want to passthrough.
> > 
> > That's not going to happen.
> 
> Why is your device so unique? How does it interact with the guest system and 
> what features does it require that doen't exist in any device that can be 
> emulated?

Do you guys only support PCI pass-through by doing full emulation of the
all possible supported PCI devices first ? :-)

> I'm also extremely sceptical of anything that only works in a kvm 
> environment.  
> Makes me think it's an unmaintainable hack, and almost certainly going to 
> cause you immense amounts of pain later.

See above question...

Cheers,
Ben.
 
> > > I doubt you're going to get generic passthrough of arbitrary devices
> > > working in a useful way.
> > 
> > It's usefully working for us internally -- we're just trying to find a way
> > to improve it for upstream, with a better configuration mechanism.
> 
> I don't believe that either.  More likely you've got passthrough of device 
> hanging off your specific CPU bus, using only (or even a subset of) the 
> facilities provided by that bus.
> 
> > > Basically you have to emulate  everything that is different between the
> > > host and guest.
> > 
> > Directly assigning a device means you don't get to have differences between
> > the actual hardware device and what the guest sees.  The kind of thin
> > wrapper you're suggesting might have some use cases, but it's a different
> > problem from what we're trying to solve.
> 
> That's the problem. You've skipped several steps and gone startigh for 
> optimization before you've even got basic functionality working.
> 
> You've also missed the point I was making.  In order to do device passthrough 
> you need to define a boundary allong which the emulated machine state can be 
> fully replicated on the host machine.  Anything inside this boundary is (by 
> definition) that same on both the host and guest systems (we're effectively 
> using host hardware to emulate a device for us). Outside that boundary the 
> host and guest systems will diverge.
> 
> For a device that merely responds to CPU initiated MMIO transfers this is 
> pretty simple, it's the point at which MMIO transfers are generated. So the 
> guest gets a proxy device that intercepts accesses to that memory region, and 
> the host proxies some way for qemu to poke values at the host device.
> 
> > > Once you've done all the above, host device passthrough should be
> > > relatively straightforward.  Just replace the emulation bits in the
> > > above device with code that pokes at a real device via the relevant
> > > kernel API.
> > 
> > That's not what we mean by direct device assignment.
> 
> Maybe, but IMO but it's a necessary prerequisite. You're trying to run before 
> you can walk.
> 
> > We're talking about directly mapping the registers into the guest.  The
> > whole point is performance.
> 
> That's an additional step after you get passthrough working the normal way.
> We already have mechanisms (or at least patches) for mapping file-like 
> objects 
> into guest physical memory.  That's largely independent of device 
> passthrough.  
> It's a relatively minor tweak to how the passthrough device sets up its MMIO 
> regions.
> 
> Mapping host device MMIO regions into guest space is entirely uninteresting 
> unless we already have some way of creating guest-host passthrough devices.  
> Creating guest-device passthrough devices isn't going to happen until the can 
> create arbitrary devices (within the set emulated by qemu) that interact with 
> the rest of the emulated machine in a similar way.
> 
> Paul





Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Anthony Liguori

On 07/01/2011 12:03 PM, Paul Brook wrote:

irq[0].guest_irq = "10"

This should be independent of anything to do with device tree.  This
would be useful for x86 too to assign platform devices (like the HPET).


That's fine, as long as there's something layered on top of it for the case
where we do want to reference something in the device tree.

However, we'll need to address the question of what it means to say "irq
10" -- outside of PC-land there often isn't a global IRQ numberspace that
isn't a fiction created by some software layer.  Addressing this is one of
the device tree's strengths.


That's an entirely separate problem, thoug probably a prerequisite.

Basically you should start by implementing full emulation of a device with
similar characteristics to the one you want to passthrough.


If you want to model interrupt remapping, you have to model device 
relationships.  If you cannot express the bus hierarchy/relationship 
then you cannot sanely model interrupt remapping.


You can only really ever think about passing through an entire subtree 
of the device hierarchy.  You can't have a partial subtree with some 
crazy hack logic to explain how the physical layer may remap interrupts. 
 That's just asking for pain.


Regards,

Anthony Liguori



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Anthony Liguori

On 07/01/2011 11:43 AM, Scott Wood wrote:

On Fri, 1 Jul 2011 07:10:45 -0500
Anthony Liguori  wrote:


I agree in principle but I think it should be done in a slightly
different way.

I think we ought to support composing a device by passthrough.  For
instance, something like:

[physical-device "mydev"]
region[0].file = "/dev/mem"
region[0].guest_address = "0x42232000"
region[0].file_offset = "0x23423400"
region[0].size = "4096"
irq[0].guest_irq = "10"
irq[0].host_irq = "10"

This should be independent of anything to do with device tree.  This
would be useful for x86 too to assign platform devices (like the HPET).


That's fine, as long as there's something layered on top of it for the case
where we do want to reference something in the device tree.

However, we'll need to address the question of what it means to say "irq 10"


It depends on what the bus is.  If you're going to declare "system bus" 
which is sort of what we call ISA for the PC, then it can map trivially 
to the interrupt controller's inputs.



-- outside of PC-land there often isn't a global IRQ numberspace that isn't
a fiction created by some software layer.


PC's don't have a global IRQ number space FWIW.  When we say:

-device isa-serial,irq=4

This really means, "ISA irq 4", which is mapped to the PIIX3 and then 
routed through GSI, then the APIC architecture to correspond to some 
interrupt for some physical CPU.



Addressing this is one of the
device tree's strengths.


Not really.  There's nothing magical about the device tree.  It's just a 
guest visible description of the platform hardware that isn't probe-able 
in some bus framework.  ACPI does exactly the same thing.  I'll concede 
that the device tree is far nicer than ACPI but again, it's not magical :-)


Regards,

Anthony Liguori


-Scott







Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Scott Wood
On Fri, 1 Jul 2011 21:59:35 +0100
Paul Brook  wrote:

> > On Fri, 1 Jul 2011 18:03:01 +0100
> > 
> > Paul Brook  wrote:
> > > Basically you should start by implementing full emulation of a device
> > > with similar characteristics to the one you want to passthrough.
> > 
> > That's not going to happen.
> 
> Why is your device so unique? How does it interact with the guest system and 
> what features does it require that doen't exist in any device that can be 
> emulated?

Perhaps I misunderstood what you meant by "similar characteristics".  I see
no reason to spend a bunch of time implementing full emulation for a device,
that isn't going to be used, just because it seems like a nice
intermediary step.

What specifically is it you're suggesting we do full emulation of?

> I'm also extremely sceptical of anything that only works in a kvm 
> environment.  
> Makes me think it's an unmaintainable hack, and almost certainly going to 
> cause you immense amounts of pain later.

I believe the only part of the device assignment stuff we've implemented so
far that is KVM specific is the interrupt routing.  I'm open to ways of
routing the interrupts to qemu in the non-KVM case, as long as we can
bypass it when KVM is used.

I'm not sure what the use case is for direct assignment of a device in an
otherwise completely emulated guest, but perhaps there is one.

> > > I doubt you're going to get generic passthrough of arbitrary devices
> > > working in a useful way.
> > 
> > It's usefully working for us internally -- we're just trying to find a way
> > to improve it for upstream, with a better configuration mechanism.
> 
> I don't believe that either.  More likely you've got passthrough of device 
> hanging off your specific CPU bus, using only (or even a subset of) the 
> facilities provided by that bus.

There's nothing special about our "bus".  It's MMIO, DMA, and interrupts.

What specifically are you disbelieving?

> > > Basically you have to emulate  everything that is different between the
> > > host and guest.
> > 
> > Directly assigning a device means you don't get to have differences between
> > the actual hardware device and what the guest sees.  The kind of thin
> > wrapper you're suggesting might have some use cases, but it's a different
> > problem from what we're trying to solve.
> 
> That's the problem. You've skipped several steps and gone startigh for 
> optimization before you've even got basic functionality working.

This is the basic functionality -- assign a piece of hardware to the
guest with minimal overhead.  Why go through contortions to construct some
intermediate phase that nobody's interested in using?

> You've also missed the point I was making.  In order to do device passthrough 
> you need to define a boundary allong which the emulated machine state can be 
> fully replicated on the host machine.  Anything inside this boundary is (by 
> definition) that same on both the host and guest systems (we're effectively 
> using host hardware to emulate a device for us). Outside that boundary the 
> host and guest systems will diverge.

I'm still not sure what the point is, then.  By directly assigning the
device the user is placing everything about the device on the "same as
host" side of that boundary.

We're not using host hardware to emulate a device, we're using host
hardware to send and receive packets under control of the guest.
Whatever hardware that is, the guest will deal with it, just as if the
guest weren't running in a vm.

> For a device that merely responds to CPU initiated MMIO transfers this is 
> pretty simple, it's the point at which MMIO transfers are generated. So the 
> guest gets a proxy device that intercepts accesses to that memory region, and 
> the host proxies some way for qemu to poke values at the host device.

The point is to be faster than virtio, not slower.  There would be no
reason for us to do this otherwise.

Emulating some specific device is not our goal, at all.  I realize that
that's a major part of what qemu does, but it's not the only thing it's
used for.

> > > Once you've done all the above, host device passthrough should be
> > > relatively straightforward.  Just replace the emulation bits in the
> > > above device with code that pokes at a real device via the relevant
> > > kernel API.
> > 
> > That's not what we mean by direct device assignment.
> 
> Maybe, but IMO but it's a necessary prerequisite. You're trying to run before 
> you can walk.

I disagree that it is a prerequisite.  It is a fundamentally different
thing, for a different purpose.

If it's a purpose that is important to you, and you think the proposed
config mechanisms don't accommodate that, then propose something that does.

> > We're talking about directly mapping the registers into the guest.  The
> > whole point is performance.
> 
> That's an additional step after you get passthrough working the normal way.

"normal"?

> We already have mechanisms (or at least patches) for mappi

Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Paul Brook
> On Fri, 1 Jul 2011 18:03:01 +0100
> 
> Paul Brook  wrote:
> > Basically you should start by implementing full emulation of a device
> > with similar characteristics to the one you want to passthrough.
> 
> That's not going to happen.

Why is your device so unique? How does it interact with the guest system and 
what features does it require that doen't exist in any device that can be 
emulated?

I'm also extremely sceptical of anything that only works in a kvm environment.  
Makes me think it's an unmaintainable hack, and almost certainly going to 
cause you immense amounts of pain later.

> > I doubt you're going to get generic passthrough of arbitrary devices
> > working in a useful way.
> 
> It's usefully working for us internally -- we're just trying to find a way
> to improve it for upstream, with a better configuration mechanism.

I don't believe that either.  More likely you've got passthrough of device 
hanging off your specific CPU bus, using only (or even a subset of) the 
facilities provided by that bus.

> > Basically you have to emulate  everything that is different between the
> > host and guest.
> 
> Directly assigning a device means you don't get to have differences between
> the actual hardware device and what the guest sees.  The kind of thin
> wrapper you're suggesting might have some use cases, but it's a different
> problem from what we're trying to solve.

That's the problem. You've skipped several steps and gone startigh for 
optimization before you've even got basic functionality working.

You've also missed the point I was making.  In order to do device passthrough 
you need to define a boundary allong which the emulated machine state can be 
fully replicated on the host machine.  Anything inside this boundary is (by 
definition) that same on both the host and guest systems (we're effectively 
using host hardware to emulate a device for us). Outside that boundary the 
host and guest systems will diverge.

For a device that merely responds to CPU initiated MMIO transfers this is 
pretty simple, it's the point at which MMIO transfers are generated. So the 
guest gets a proxy device that intercepts accesses to that memory region, and 
the host proxies some way for qemu to poke values at the host device.

> > Once you've done all the above, host device passthrough should be
> > relatively straightforward.  Just replace the emulation bits in the
> > above device with code that pokes at a real device via the relevant
> > kernel API.
> 
> That's not what we mean by direct device assignment.

Maybe, but IMO but it's a necessary prerequisite. You're trying to run before 
you can walk.

> We're talking about directly mapping the registers into the guest.  The
> whole point is performance.

That's an additional step after you get passthrough working the normal way.
We already have mechanisms (or at least patches) for mapping file-like objects 
into guest physical memory.  That's largely independent of device passthrough.  
It's a relatively minor tweak to how the passthrough device sets up its MMIO 
regions.

Mapping host device MMIO regions into guest space is entirely uninteresting 
unless we already have some way of creating guest-host passthrough devices.  
Creating guest-device passthrough devices isn't going to happen until the can 
create arbitrary devices (within the set emulated by qemu) that interact with 
the rest of the emulated machine in a similar way.

Paul



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Scott Wood
On Fri, 1 Jul 2011 12:16:35 +0100
Paul Brook  wrote:

> > One feature we need for QEMU/KVM on embedded Power Architecture is the
> > ability to do passthru assignment of SoC I/O devices and memory.  An
> > important use case in embedded is creating static partitions--
> > taking physical memory and I/O devices (non-PCI) and partitioning
> > them between the host Linux and several virtual machines.   Things like
> > live migration would not be needed or supported in these types of
> > scenarios.
> > 
> > SoC devices do not sit on a probeable bus and there are no identifiers
> > like 01:00.0 with PCI that we can use to identify devices--  the host
> > Linux kernel is made aware of SoC I/O devices from nodes/properties in a
> > device tree structure passed at boot.   QEMU needs to generate a
> > device tree to pass to the guest as well with all the guest's virtual
> > and physical resources.  Today a number of mostly complete guest device
> > trees are kept under ./pc-bios in QEMU, but this too static and
> > inflexible.
> 
> I doubt you're going to get generic passthrough of arbitrary devices working 
> in a useful way.

It's usefully working for us internally -- we're just trying to find a way
to improve it for upstream, with a better configuration mechanism.

> My expectation is that, at minimum, you'll need a bus 
> specific proxy device. i.e. create a virtual device in qemu that responds to 
> the guest, and happens poke at a host device rather than emulating things 
> directly.

Many of these embedded devices don't sit on any sort of software-visible
bus, and requiring that the I/O happen via MMIO traps would result in
unacceptable overhead.

> Basically you have to emulate  everything that is different between the host 
> and guest.

Directly assigning a device means you don't get to have differences between
the actual hardware device and what the guest sees.  The kind of thin
wrapper you're suggesting might have some use cases, but it's a different
problem from what we're trying to solve.

-Scott




Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Paul Brook
> > irq[0].guest_irq = "10"
> > 
> > This should be independent of anything to do with device tree.  This
> > would be useful for x86 too to assign platform devices (like the HPET).
> 
> That's fine, as long as there's something layered on top of it for the case
> where we do want to reference something in the device tree.
> 
> However, we'll need to address the question of what it means to say "irq
> 10" -- outside of PC-land there often isn't a global IRQ numberspace that
> isn't a fiction created by some software layer.  Addressing this is one of
> the device tree's strengths.

That's an entirely separate problem, thoug probably a prerequisite.

Basically you should start by implementing full emulation of a device with 
similar characteristics to the one you want to passthrough.

Then fix whatever is needed to allow the user to contol instantiation of those 
devices. This almost certainly means using the -device commandline option.  
This currently only works for a fairly simple subset of devices (approximately 
PCI and USB), so you'll probably need to fix/implement the missing bits.  To 
do this you'll probably need to do some work on the various bits of the qdev 
relating to linking devices together.  See recent discussion about sockets in 
the "basic support for composing sysbus devices" thread.

To expose this to the guest you'll probably also need to implement some form 
of dynamic device tree assembly/manipulation.  Not strictly necessary (we can 
require the user supply a complete device tree that matches whatever devices 
they've configured), but probably highly desirable.

Once you've done all the above, host device passthrough should be relatively 
straightforward.  Just replace the emulation bits in the above device with 
code that pokes at a real device via the relevant kernel API.

Paul



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Scott Wood
On Fri, 1 Jul 2011 18:03:01 +0100
Paul Brook  wrote:

> Basically you should start by implementing full emulation of a device with 
> similar characteristics to the one you want to passthrough.

That's not going to happen.

> Once you've done all the above, host device passthrough should be relatively 
> straightforward.  Just replace the emulation bits in the above device with 
> code that pokes at a real device via the relevant kernel API.

That's not what we mean by direct device assignment.

We're talking about directly mapping the registers into the guest.  The
whole point is performance.

-Scott




Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Scott Wood
On Fri, 1 Jul 2011 07:10:45 -0500
Anthony Liguori  wrote:

> I agree in principle but I think it should be done in a slightly 
> different way.
> 
> I think we ought to support composing a device by passthrough.  For 
> instance, something like:
> 
> [physical-device "mydev"]
> region[0].file = "/dev/mem"
> region[0].guest_address = "0x42232000"
> region[0].file_offset = "0x23423400"
> region[0].size = "4096"
> irq[0].guest_irq = "10"
> irq[0].host_irq = "10"
> 
> This should be independent of anything to do with device tree.  This 
> would be useful for x86 too to assign platform devices (like the HPET).

That's fine, as long as there's something layered on top of it for the case
where we do want to reference something in the device tree.  

However, we'll need to address the question of what it means to say "irq 10"
-- outside of PC-land there often isn't a global IRQ numberspace that isn't
a fiction created by some software layer.  Addressing this is one of the
device tree's strengths.

-Scott




Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Scott Wood
On Fri, 1 Jul 2011 10:58:14 +1000
Benjamin Herrenschmidt  wrote:

> So, from a qemu command line perspective, all you should have to do is
> pass qemu the device-tree -path- to the device you want to pass-trough
> (you may support passing a full hierarchy here).
> 
> That is for normal MMIO mapped SoC devices. Something else (individual
> i2c, usb, ...) will use specific virtualization of the corresponding
> busses.
> 
> Anything else sucks too much really.
> 
> From there, well, there's several approach inside qemu/kvm to handle
> that path. If you want to do things at the qemu level you can probably
> parse /proc/device-tree.

That's what option 1 is, except that instead of adding code to qemu to
parse /proc/device-tree, we'd use dtc to dump /proc/device-tree into a dtb
and let qemu use libfdt to look at the tree.  This is less Linux-specific,
more modular, and more flexible for doing the sort of insane hacks that are
going to happen in embedded-land whether you like them or not. :-)

> But I'd personally just make it a kernel thing.

I'd rather keep the kernel interface simple -- assign this memory region,
assign that interrupt, use this IOMMU device ID, etc.  Getting the kernel
involved in preparing the guest device tree, and understanding guuest
configuration, seems quite excessive.

> IE. I would have an ioctl to "instanciate" a pass-through device, that
> takes that path as an argument. I would make it return an anonymous fd
> which you can then use to mmap the resources, etc...
> 
> > In some cases, modifications to device tree nodes may be needed.
> > An example-- sometimes a device tree property references another node 
> > and that relationship may not exist when assigned to a guest.
> > A "phy-handle" property may need to be deleted and a "fixed-link"
> > property added to a node representing a network device.
> 
> That's fishy. Why wouldn't you give full access to the MDIO ? It's
> shared ? 

Yes, it's shared.  Yes, it sucks.

> Such things are so device-specific that they would have to be
> handled by device-specific quirks, which can live either in qemu or in
> the kernel.

Or in the configuration of qemu.  Not all users of the device want to do
the same thing.

> > So in addition to assigning a device, a mechanism is needed to update 
> > device tree nodes.  So for the above example, maybe--
> > 
> >  -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
> >   node-update="fixed-link = <2 1 1000 0 0>"
> 
> That's just so gross and error prone, borderline insane.

Welcome to embedded. :-)

Here, users are going to want to be able to mess around under the hood in
a way that server or desktop users generally don't need or want to.

> > The types of modifications needed--  deleting nodes, deleting properties, 
> > adding nodes, adding properties, adding properties that reference other
> > nodes, changing properties. This device tree transformation mechanism
> > needed is general enough that it could apply to any device tree based
> > embedded platform (e.g. ARM, MIPS)
> >
> > Another complexity relates to the IOMMU.  Here things get very company 
> > and IOMMU specific. Freescale has a proprietary IOMMU.
> 
> Look at the work currently being done for a generic qemu iommu layer. We
> need it for server power as well and from what I last saw coming from
> Eduardo and David, it's not PCI specific.

The problem is that our current IOMMU doesn't implement full paging (yes,
the HW people have been screamed at, but we're stuck with it for current
chips).  You have to break things down into regions following certain
alignment rules, which may require user guidance as to which memory regions
actually need DMA access, especially if you're setting up discontiguous
shared memory regions and such.

-Scott




Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Anthony Liguori

On 07/01/2011 07:52 AM, Paul Brook wrote:

So, from a qemu command line perspective, all you should have to do is
pass qemu the device-tree -path- to the device you want to pass-trough
(you may support passing a full hierarchy here).


I agree in principle but I think it should be done in a slightly
different way.

I think we ought to support composing a device by passthrough.  For
instance, something like:

[physical-device "mydev"]
region[0].file = "/dev/mem"
region[0].guest_address = "0x42232000"
region[0].file_offset = "0x23423400"
region[0].size = "4096"
irq[0].guest_irq = "10"
irq[0].host_irq = "10"

This should be independent of anything to do with device tree.  This
would be useful for x86 too to assign platform devices (like the HPET).


I'm not quite sure what you're getting at here.  IMO there should be little or
no need for special knowledge of passthrough devices.  They should just be
annother qdev device, configured in the normal way.  e.g.:
-device sysbus-host,hostdev=whatever,normal_mmio_and_irq_config


What I wrote about is just readconfig syntax.  It's the same as:

-device physical-device,id=mydev,region[0].file=/dev/mem,

Regards,

Anthony Liguori


I think there should be a separate mechanism to manipulate the guest
device tree, just like there are mechanisms to manipulate the guest's
ACPI tables.


I aggree.  Any sort of device tree (IIUC ACPI tables are in principle giving
the same information) is, in practice, going to need to be assembled at
runtime.  This needs some mechanism for devices to describe themselves,
probably largely independent of actual machine/device creation code.

We've got away without it thus far because the only real place where we have
nontrivial user-specified machine variants is on the PCI bus.  Devices there
are for the most part self-describing so the guest firmware/OS can probe
hardware itself.

Paul






Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Paul Brook
> > So, from a qemu command line perspective, all you should have to do is
> > pass qemu the device-tree -path- to the device you want to pass-trough
> > (you may support passing a full hierarchy here).
> 
> I agree in principle but I think it should be done in a slightly
> different way.
> 
> I think we ought to support composing a device by passthrough.  For
> instance, something like:
> 
> [physical-device "mydev"]
> region[0].file = "/dev/mem"
> region[0].guest_address = "0x42232000"
> region[0].file_offset = "0x23423400"
> region[0].size = "4096"
> irq[0].guest_irq = "10"
> irq[0].host_irq = "10"
> 
> This should be independent of anything to do with device tree.  This
> would be useful for x86 too to assign platform devices (like the HPET).

I'm not quite sure what you're getting at here.  IMO there should be little or 
no need for special knowledge of passthrough devices.  They should just be 
annother qdev device, configured in the normal way.  e.g.:
   -device sysbus-host,hostdev=whatever,normal_mmio_and_irq_config
Should work the same as adding any other device. If it doesn't then we should 
fix that.  This is an example of why it's good to have device features (IRQs, 
MMIO regions, sockets, or whatever we call them) registered when the device is 
instantiated, not relying on pre-compiled device decriptors/property lists.  
In the latter case you probably need explicit variants for differnt numbers of 
IRQs, MMIO regions, etc.

While I'm thinking about it, we already have exactly this for USB (i.e. the 
usb-host device).

> I think there should be a separate mechanism to manipulate the guest
> device tree, just like there are mechanisms to manipulate the guest's
> ACPI tables.

I aggree.  Any sort of device tree (IIUC ACPI tables are in principle giving 
the same information) is, in practice, going to need to be assembled at 
runtime.  This needs some mechanism for devices to describe themselves, 
probably largely independent of actual machine/device creation code.

We've got away without it thus far because the only real place where we have 
nontrivial user-specified machine variants is on the PCI bus.  Devices there 
are for the most part self-describing so the guest firmware/OS can probe 
hardware itself.

Paul



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Anthony Liguori

On 07/01/2011 07:02 AM, Alexander Graf wrote:


On 01.07.2011, at 13:55, Paul Brook wrote:




But the real challenge is how to expose the device to the guest device
tree. Especially when it comes to links between dt nodes, interrupt maps,
etc. We basically have 3 choices there:

  * take the host device tree pieces and modify them
  * provide device tree chunks for each device (manually or through qdev
parameters) * use the device tree as machine config file and base
everything on it (solves the linking problem)

The main question is which one would be the cleanest solution. And how
would it be implemented.


I don't think any of this is specific to device passthrough.  It occurs as
soon as you have any user-configurable parts of the machine (or even just a
nontrivial selection of machine variants).  My guess is the only reason you
haven't hit it before is because you're only emulated a single hard-coded
SoC/board.


Well, the real reason we haven't hit this before is that we don't have any 
devices in Qemu that are generic. We only have specific device emulation. This 
however would be a device that can handle hundreds of different backing 
devices, all with different requirements.

The infrastructure we have today simply isn't made for this. The question is 
how can we model it so that it will? :)


Our infrastructure is quite capable of handling this.  It has many other 
problems but I think the only thing really missing is the way to have 
lists of parameters.  That seems easy to solve though.


Regards,

Anthony Liguori




Alex






Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Anthony Liguori

On 06/30/2011 07:58 PM, Benjamin Herrenschmidt wrote:

On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote:

This avoids needing to pass the host device tree, but could
get awkward-- the i2c example above is very simple, some device
nodes are very large with a complex hierarchy of subnodes and
could be hundreds of lines of text to represent a single
node.

It gets more complicated...



So, from a qemu command line perspective, all you should have to do is
pass qemu the device-tree -path- to the device you want to pass-trough
(you may support passing a full hierarchy here).


I agree in principle but I think it should be done in a slightly 
different way.


I think we ought to support composing a device by passthrough.  For 
instance, something like:


[physical-device "mydev"]
region[0].file = "/dev/mem"
region[0].guest_address = "0x42232000"
region[0].file_offset = "0x23423400"
region[0].size = "4096"
irq[0].guest_irq = "10"
irq[0].host_irq = "10"

This should be independent of anything to do with device tree.  This 
would be useful for x86 too to assign platform devices (like the HPET).


I think there should be a separate mechanism to manipulate the guest 
device tree, just like there are mechanisms to manipulate the guest's 
ACPI tables.


Given these two mechanisms, there should be a simple command line like 
Ben has suggested that just takes a host device tree path and Just 
Works.  It really is just a convenience interface though.


With raw mechanisms like I described above, it would give you the 
flexibility to pass through a device with a modified host tree fragment 
without having an overly complicated command line interface for the more 
common case.


Regards,

Anthony Liguori



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Anthony Liguori

On 07/01/2011 06:40 AM, Alexander Graf wrote:


On 01.07.2011, at 02:58, Benjamin Herrenschmidt wrote:


On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote:

One feature we need for QEMU/KVM on embedded Power Architecture is the
ability to do passthru assignment of SoC I/O devices and memory.  An
important use case in embedded is creating static partitions--
taking physical memory and I/O devices (non-PCI) and partitioning
them between the host Linux and several virtual machines.   Things like
live migration would not be needed or supported in these types of scenarios.

SoC devices do not sit on a probeable bus and there are no identifiers
like 01:00.0 with PCI that we can use to identify devices--  the host
Linux kernel is made aware of SoC I/O devices from nodes/properties in a
device tree structure passed at boot.   QEMU needs to generate a
device tree to pass to the guest as well with all the guest's virtual
and physical resources.  Today a number of mostly complete guest device
trees are kept under ./pc-bios in QEMU, but this too static and
inflexible.

Some new mechanism is needed to assign SoC devices to guests, and we
(FSL + Alex Graf) have been discussing a few possible approaches
for doing this from QEMU and would like some feedback.

Some possibilities:

1. Option 1.  Pass the host dev tree to QEMU and assign devices
   by device tree path

 -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000

   /soc/i2c@3000 is the device tree path to the assigned device.
   The device node 'i2c@3000' has some number of properties (e.g.
   address, interrupt info) and possibly subnodes under
   it.   QEMU copies that node when generating the guest dev tree.
   See snippet of entire node:  http://paste2.org/p/1496460


Yuck (see below)


2. Option 2.  Pass the entire assigned device node as a string to
   QEMU

 -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells =<1>;
  #size-cells =<0>; cell-index =<0>; compatible = "fsl-i2c";
  reg =<0xffe03000 0x100>; interrupts =<43 2>;
  interrupt-parent =<&mpic>; dfsrr;'


Beuark ! (see below)


   This avoids needing to pass the host device tree, but could
   get awkward-- the i2c example above is very simple, some device
   nodes are very large with a complex hierarchy of subnodes and
   could be hundreds of lines of text to represent a single
   node.

It gets more complicated...



So, from a qemu command line perspective, all you should have to do is
pass qemu the device-tree -path- to the device you want to pass-trough
(you may support passing a full hierarchy here).

That is for normal MMIO mapped SoC devices. Something else (individual
i2c, usb, ...) will use specific virtualization of the corresponding
busses.

Anything else sucks too much really.

 From there, well, there's several approach inside qemu/kvm to handle
that path. If you want to do things at the qemu level you can probably
parse /proc/device-tree. But I'd personally just make it a kernel thing.

IE. I would have an ioctl to "instanciate" a pass-through device, that
takes that path as an argument. I would make it return an anonymous fd
which you can then use to mmap the resources, etc...


Yeah, one idea was to use VFIO here. We could for example modify the host 
device tree to occupy device we want to pass through with a specific 
compatibility parameter. Or we could try to steal the node during runtime. But 
I agree, reading the device tree data from a VFIO node sounds reasonable. If 
it's required.


That makes it very specific to systems that use device trees.

To do the same for ARM platforms or x86, you would need to invent yet 
another mechanism.


Passing through arbitrary MMIO is fairly straight forward (likewise with 
PIO).  Passing through IRQs is a bit less straight forward and perhaps 
VFIO is the answer here.


I don't see a problem with QEMU figuring out what a device's resources 
are and doing the assignment.


Regards,

Anthony Liguori



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Paul Brook

> But the real challenge is how to expose the device to the guest device
> tree. Especially when it comes to links between dt nodes, interrupt maps,
> etc. We basically have 3 choices there:
> 
>   * take the host device tree pieces and modify them
>   * provide device tree chunks for each device (manually or through qdev
> parameters) * use the device tree as machine config file and base
> everything on it (solves the linking problem)
> 
> The main question is which one would be the cleanest solution. And how
> would it be implemented.

I don't think any of this is specific to device passthrough.  It occurs as 
soon as you have any user-configurable parts of the machine (or even just a 
nontrivial selection of machine variants).  My guess is the only reason you 
haven't hit it before is because you're only emulated a single hard-coded 
SoC/board.

Paul



Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Alexander Graf

On 01.07.2011, at 13:55, Paul Brook wrote:

> 
>> But the real challenge is how to expose the device to the guest device
>> tree. Especially when it comes to links between dt nodes, interrupt maps,
>> etc. We basically have 3 choices there:
>> 
>>  * take the host device tree pieces and modify them
>>  * provide device tree chunks for each device (manually or through qdev
>> parameters) * use the device tree as machine config file and base
>> everything on it (solves the linking problem)
>> 
>> The main question is which one would be the cleanest solution. And how
>> would it be implemented.
> 
> I don't think any of this is specific to device passthrough.  It occurs as 
> soon as you have any user-configurable parts of the machine (or even just a 
> nontrivial selection of machine variants).  My guess is the only reason you 
> haven't hit it before is because you're only emulated a single hard-coded 
> SoC/board.

Well, the real reason we haven't hit this before is that we don't have any 
devices in Qemu that are generic. We only have specific device emulation. This 
however would be a device that can handle hundreds of different backing 
devices, all with different requirements.

The infrastructure we have today simply isn't made for this. The question is 
how can we model it so that it will? :)


Alex




Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Alexander Graf

On 01.07.2011, at 02:58, Benjamin Herrenschmidt wrote:

> On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote:
>> One feature we need for QEMU/KVM on embedded Power Architecture is the 
>> ability to do passthru assignment of SoC I/O devices and memory.  An 
>> important use case in embedded is creating static partitions-- 
>> taking physical memory and I/O devices (non-PCI) and partitioning
>> them between the host Linux and several virtual machines.   Things like
>> live migration would not be needed or supported in these types of scenarios.
>> 
>> SoC devices do not sit on a probeable bus and there are no identifiers 
>> like 01:00.0 with PCI that we can use to identify devices--  the host
>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
>> device tree structure passed at boot.   QEMU needs to generate a
>> device tree to pass to the guest as well with all the guest's virtual
>> and physical resources.  Today a number of mostly complete guest device
>> trees are kept under ./pc-bios in QEMU, but this too static and
>> inflexible.
>> 
>> Some new mechanism is needed to assign SoC devices to guests, and we
>> (FSL + Alex Graf) have been discussing a few possible approaches
>> for doing this from QEMU and would like some feedback.
>> 
>> Some possibilities:
>> 
>> 1. Option 1.  Pass the host dev tree to QEMU and assign devices
>>   by device tree path
>> 
>> -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
>> 
>>   /soc/i2c@3000 is the device tree path to the assigned device.
>>   The device node 'i2c@3000' has some number of properties (e.g. 
>>   address, interrupt info) and possibly subnodes under
>>   it.   QEMU copies that node when generating the guest dev tree.
>>   See snippet of entire node:  http://paste2.org/p/1496460
> 
> Yuck (see below)
> 
>> 2. Option 2.  Pass the entire assigned device node as a string to
>>   QEMU
>> 
>> -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
>>  #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>>  reg = <0xffe03000 0x100>; interrupts = <43 2>;
>>  interrupt-parent = <&mpic>; dfsrr;'
> 
> Beuark ! (see below)
> 
>>   This avoids needing to pass the host device tree, but could 
>>   get awkward-- the i2c example above is very simple, some device
>>   nodes are very large with a complex hierarchy of subnodes and 
>>   could be hundreds of lines of text to represent a single
>>   node.
>> 
>> It gets more complicated...
> 
> 
> So, from a qemu command line perspective, all you should have to do is
> pass qemu the device-tree -path- to the device you want to pass-trough
> (you may support passing a full hierarchy here).
> 
> That is for normal MMIO mapped SoC devices. Something else (individual
> i2c, usb, ...) will use specific virtualization of the corresponding
> busses.
> 
> Anything else sucks too much really.
> 
> From there, well, there's several approach inside qemu/kvm to handle
> that path. If you want to do things at the qemu level you can probably
> parse /proc/device-tree. But I'd personally just make it a kernel thing.
> 
> IE. I would have an ioctl to "instanciate" a pass-through device, that
> takes that path as an argument. I would make it return an anonymous fd
> which you can then use to mmap the resources, etc...

Yeah, one idea was to use VFIO here. We could for example modify the host 
device tree to occupy device we want to pass through with a specific 
compatibility parameter. Or we could try to steal the node during runtime. But 
I agree, reading the device tree data from a VFIO node sounds reasonable. If 
it's required.

> 
>> In some cases, modifications to device tree nodes may be needed.
>> An example-- sometimes a device tree property references another node 
>> and that relationship may not exist when assigned to a guest.
>> A "phy-handle" property may need to be deleted and a "fixed-link"
>> property added to a node representing a network device.
> 
> That's fishy. Why wouldn't you give full access to the MDIO ? It's
> shared ? Such things are so device-specific that they would have to be
> handled by device-specific quirks, which can live either in qemu or in
> the kernel.

Hrm, so you'd create a separate device for MDIO which can do pass-through of 
those?

> 
>> So in addition to assigning a device, a mechanism is needed to update 
>> device tree nodes.  So for the above example, maybe--
>> 
>> -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
>>  node-update="fixed-link = <2 1 1000 0 0>"
> 
> That's just so gross and error prone, borderline insane.

Alternatives:

  * not modify the device tree (unlikely to work)
  * pass a full device tree chunk to qemu instead of modification commands
  * ?

> 
>> The types of modifications needed--  deleting nodes, deleting properties, 
>> adding nodes, adding properties, adding properties that reference other
>> nodes, changing properties. This device tree transforma

Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Alexander Graf

On 01.07.2011, at 13:16, Paul Brook wrote:

>> One feature we need for QEMU/KVM on embedded Power Architecture is the
>> ability to do passthru assignment of SoC I/O devices and memory.  An
>> important use case in embedded is creating static partitions--
>> taking physical memory and I/O devices (non-PCI) and partitioning
>> them between the host Linux and several virtual machines.   Things like
>> live migration would not be needed or supported in these types of
>> scenarios.
>> 
>> SoC devices do not sit on a probeable bus and there are no identifiers
>> like 01:00.0 with PCI that we can use to identify devices--  the host
>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
>> device tree structure passed at boot.   QEMU needs to generate a
>> device tree to pass to the guest as well with all the guest's virtual
>> and physical resources.  Today a number of mostly complete guest device
>> trees are kept under ./pc-bios in QEMU, but this too static and
>> inflexible.
> 
> I doubt you're going to get generic passthrough of arbitrary devices working 
> in a useful way. My expectation is that, at minimum, you'll need a bus 
> specific proxy device. i.e. create a virtual device in qemu that responds to 
> the guest, and happens poke at a host device rather than emulating things 
> directly.
> 
> For busses like I2C this is fairly trivial - all communication with the 
> device 
> goes down a single well defined and easily proxied channel.  For more complex 
> busses you end up having to emulate a lot more.  Basically you have to 
> emulate 
> everything that is different between the host and guest.  If that happens to 
> include device specific state then you loose.
> 
> Using PCI devices as an example: The resources provided by the device are 
> self-describing, so proxying those is fairly straightforward, and doesn't 
> even 
> require manual configuration.  However replicating the environment seen by 
> the 
> device is trickier as PCI devices can initiate memory accesses (i.e. bus-
> master).  For machines without an IOMMU this means passthrough in general 
> can't work, and substantial amounts of device specific knowledge is required. 
> You'd need to intercept and modify and/oor proxy all data relating to DMA 
> addresses.  In practice you need to emulate an IOMMU inside qemu (so you can 
> determine the address space accessed by the device), and arrange for the host 
> IOMMU to present the same virtual address space to the real device.

Well, for DMA the solution is reasonably simple. We have basically two choices:

  * run 1:1 mapped, so the guest physical address == host physical address, at 
which point DMA works, but everything is insecure
  * use an IOMMU

We can easily limit it to those two cases. The more challenging part here (and 
the main reason for the email) is the question on how to configure all of that 
in a flexible, yet simple way. We can find the IO regions for devices from the 
host device tree - no problem there.

But the real challenge is how to expose the device to the guest device tree. 
Especially when it comes to links between dt nodes, interrupt maps, etc. We 
basically have 3 choices there:

  * take the host device tree pieces and modify them
  * provide device tree chunks for each device (manually or through qdev 
parameters)
  * use the device tree as machine config file and base everything on it 
(solves the linking problem)

The main question is which one would be the cleanest solution. And how would it 
be implemented.


Alex




Re: [Qemu-devel] device assignment for embedded Power

2011-07-01 Thread Paul Brook
> One feature we need for QEMU/KVM on embedded Power Architecture is the
> ability to do passthru assignment of SoC I/O devices and memory.  An
> important use case in embedded is creating static partitions--
> taking physical memory and I/O devices (non-PCI) and partitioning
> them between the host Linux and several virtual machines.   Things like
> live migration would not be needed or supported in these types of
> scenarios.
> 
> SoC devices do not sit on a probeable bus and there are no identifiers
> like 01:00.0 with PCI that we can use to identify devices--  the host
> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
> device tree structure passed at boot.   QEMU needs to generate a
> device tree to pass to the guest as well with all the guest's virtual
> and physical resources.  Today a number of mostly complete guest device
> trees are kept under ./pc-bios in QEMU, but this too static and
> inflexible.

I doubt you're going to get generic passthrough of arbitrary devices working 
in a useful way. My expectation is that, at minimum, you'll need a bus 
specific proxy device. i.e. create a virtual device in qemu that responds to 
the guest, and happens poke at a host device rather than emulating things 
directly.

For busses like I2C this is fairly trivial - all communication with the device 
goes down a single well defined and easily proxied channel.  For more complex 
busses you end up having to emulate a lot more.  Basically you have to emulate 
everything that is different between the host and guest.  If that happens to 
include device specific state then you loose.

Using PCI devices as an example: The resources provided by the device are 
self-describing, so proxying those is fairly straightforward, and doesn't even 
require manual configuration.  However replicating the environment seen by the 
device is trickier as PCI devices can initiate memory accesses (i.e. bus-
master).  For machines without an IOMMU this means passthrough in general 
can't work, and substantial amounts of device specific knowledge is required. 
You'd need to intercept and modify and/oor proxy all data relating to DMA 
addresses.  In practice you need to emulate an IOMMU inside qemu (so you can 
determine the address space accessed by the device), and arrange for the host 
IOMMU to present the same virtual address space to the real device.

Paul



Re: [Qemu-devel] device assignment for embedded Power

2011-06-30 Thread Benjamin Herrenschmidt
On Thu, 2011-06-30 at 15:59 +, Yoder Stuart-B08248 wrote:
> One feature we need for QEMU/KVM on embedded Power Architecture is the 
> ability to do passthru assignment of SoC I/O devices and memory.  An 
> important use case in embedded is creating static partitions-- 
> taking physical memory and I/O devices (non-PCI) and partitioning
> them between the host Linux and several virtual machines.   Things like
> live migration would not be needed or supported in these types of scenarios.
> 
> SoC devices do not sit on a probeable bus and there are no identifiers 
> like 01:00.0 with PCI that we can use to identify devices--  the host
> Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
> device tree structure passed at boot.   QEMU needs to generate a
> device tree to pass to the guest as well with all the guest's virtual
> and physical resources.  Today a number of mostly complete guest device
> trees are kept under ./pc-bios in QEMU, but this too static and
> inflexible.
> 
> Some new mechanism is needed to assign SoC devices to guests, and we
> (FSL + Alex Graf) have been discussing a few possible approaches
> for doing this from QEMU and would like some feedback.
> 
> Some possibilities:
> 
> 1. Option 1.  Pass the host dev tree to QEMU and assign devices
>by device tree path
>
>  -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
> 
>/soc/i2c@3000 is the device tree path to the assigned device.
>The device node 'i2c@3000' has some number of properties (e.g. 
>address, interrupt info) and possibly subnodes under
>it.   QEMU copies that node when generating the guest dev tree.
>See snippet of entire node:  http://paste2.org/p/1496460

Yuck (see below)

> 2. Option 2.  Pass the entire assigned device node as a string to
>QEMU
> 
>  -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
>   #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>   reg = <0xffe03000 0x100>; interrupts = <43 2>;
>   interrupt-parent = <&mpic>; dfsrr;'

Beuark ! (see below)

>This avoids needing to pass the host device tree, but could 
>get awkward-- the i2c example above is very simple, some device
>nodes are very large with a complex hierarchy of subnodes and 
>could be hundreds of lines of text to represent a single
>node.
> 
> It gets more complicated...


So, from a qemu command line perspective, all you should have to do is
pass qemu the device-tree -path- to the device you want to pass-trough
(you may support passing a full hierarchy here).

That is for normal MMIO mapped SoC devices. Something else (individual
i2c, usb, ...) will use specific virtualization of the corresponding
busses.

Anything else sucks too much really.

>From there, well, there's several approach inside qemu/kvm to handle
that path. If you want to do things at the qemu level you can probably
parse /proc/device-tree. But I'd personally just make it a kernel thing.

IE. I would have an ioctl to "instanciate" a pass-through device, that
takes that path as an argument. I would make it return an anonymous fd
which you can then use to mmap the resources, etc...

> In some cases, modifications to device tree nodes may be needed.
> An example-- sometimes a device tree property references another node 
> and that relationship may not exist when assigned to a guest.
> A "phy-handle" property may need to be deleted and a "fixed-link"
> property added to a node representing a network device.

That's fishy. Why wouldn't you give full access to the MDIO ? It's
shared ? Such things are so device-specific that they would have to be
handled by device-specific quirks, which can live either in qemu or in
the kernel.

> So in addition to assigning a device, a mechanism is needed to update 
> device tree nodes.  So for the above example, maybe--
> 
>  -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
>   node-update="fixed-link = <2 1 1000 0 0>"

That's just so gross and error prone, borderline insane.

> The types of modifications needed--  deleting nodes, deleting properties, 
> adding nodes, adding properties, adding properties that reference other
> nodes, changing properties. This device tree transformation mechanism
> needed is general enough that it could apply to any device tree based
> embedded platform (e.g. ARM, MIPS)
>
> Another complexity relates to the IOMMU.  Here things get very company 
> and IOMMU specific. Freescale has a proprietary IOMMU.

Look at the work currently being done for a generic qemu iommu layer. We
need it for server power as well and from what I last saw coming from
Eduardo and David, it's not PCI specific.

> Devices have 1 or more logical I/O device numbers used to index into 
> the IOMMU table. The IOMMU is limited in that it is designed to only 
> support large, physically contiguous mappings per device.  It does not 
> support any kind of page table.  The IOMMU hardware architec

[Qemu-devel] device assignment for embedded Power

2011-06-30 Thread Yoder Stuart-B08248
One feature we need for QEMU/KVM on embedded Power Architecture is the 
ability to do passthru assignment of SoC I/O devices and memory.  An 
important use case in embedded is creating static partitions-- 
taking physical memory and I/O devices (non-PCI) and partitioning
them between the host Linux and several virtual machines.   Things like
live migration would not be needed or supported in these types of scenarios.

SoC devices do not sit on a probeable bus and there are no identifiers 
like 01:00.0 with PCI that we can use to identify devices--  the host
Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
device tree structure passed at boot.   QEMU needs to generate a
device tree to pass to the guest as well with all the guest's virtual
and physical resources.  Today a number of mostly complete guest device
trees are kept under ./pc-bios in QEMU, but this too static and
inflexible.

Some new mechanism is needed to assign SoC devices to guests, and we
(FSL + Alex Graf) have been discussing a few possible approaches
for doing this from QEMU and would like some feedback.

Some possibilities:

1. Option 1.  Pass the host dev tree to QEMU and assign devices
   by device tree path

 -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000

   /soc/i2c@3000 is the device tree path to the assigned device.
   The device node 'i2c@3000' has some number of properties (e.g. 
   address, interrupt info) and possibly subnodes under
   it.   QEMU copies that node when generating the guest dev tree.
   See snippet of entire node:  http://paste2.org/p/1496460

2. Option 2.  Pass the entire assigned device node as a string to
   QEMU

 -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
  #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
  reg = <0xffe03000 0x100>; interrupts = <43 2>;
  interrupt-parent = <&mpic>; dfsrr;'

   This avoids needing to pass the host device tree, but could 
   get awkward-- the i2c example above is very simple, some device
   nodes are very large with a complex hierarchy of subnodes and 
   could be hundreds of lines of text to represent a single
   node.

It gets more complicated...

In some cases, modifications to device tree nodes may be needed.
An example-- sometimes a device tree property references another node 
and that relationship may not exist when assigned to a guest.
A "phy-handle" property may need to be deleted and a "fixed-link"
property added to a node representing a network device.

So in addition to assigning a device, a mechanism is needed to update 
device tree nodes.  So for the above example, maybe--

 -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
  node-update="fixed-link = <2 1 1000 0 0>"

The types of modifications needed--  deleting nodes, deleting properties, 
adding nodes, adding properties, adding properties that reference other
nodes, changing properties. This device tree transformation mechanism
needed is general enough that it could apply to any device tree based
embedded platform (e.g. ARM, MIPS).

Another complexity relates to the IOMMU.  Here things get very company 
and IOMMU specific. Freescale has a proprietary IOMMU.
Devices have 1 or more logical I/O device numbers used to index into 
the IOMMU table. The IOMMU is limited in that it is designed to only 
support large, physically contiguous mappings per device.  It does not 
support any kind of page table.  The IOMMU hardware architecture 
assumes DMAs are typically targeted to just a few address regions.  
So, a common IOMMU setup for a device would be a device with a single 
IOMMU mapping covering the guest's main memory segment.  However, 
there are many much more complicated IOMMU setups that are common as 
well, such as doing "operation translations" where a device's write 
transaction is translated to "stash" directly into CPU caches.  We 
can't assume that all memory slots belonging to the guest are targets 
of DMA.

So for Freescale we would need some very Freescale-specific 
configuration mechanism to set up the IOMMU.  Here I think we would 
need the new qcfg approach to expressing nested
structures (http://wiki.qemu.org/Features/QCFG).   Device
assignment with IOMMU set up might look like the examples
below:

# device with multiple logical i/o device numbers

-device assigned-soc-dev,dev=/qman-portals/qman-portal@4000,
vcpu=1,fsl,iommu.stash-mem={
dma-window.guest-addr=0x0,
dma-window.size=0x1,
liodn-index=1,
operation-mapping=0
stash-dest=1},
fsl,iommu.stash-dqrr={
dma-window.guest-addr=0xff420,
dma-window.size=0x4000,
liodn-index=0,
operation-mapping=0
stash-dest=1}

# assign pci-bus to a guest with multiple memory # regions
#addr   size
#0x0 512MB
#0x2000  4KB  (for MSIs)
#0x4000  16MB (shared memory)
#0xc000  64MB (shared memory)

-device assigned-soc-dev,dev=/pcie@ffe09000,
fsl,iommu={dma-window.guest-addr=0x0,
dma-window.size=