userspace: Device Assignment: Support for assigning PCI devices to guests

Amit Shah Wed, 24 Sep 2008 10:14:29 -0700

* On Wednesday 24 Sep 2008 20:37:24 Anthony Liguori wrote:
> Amit Shah wrote:
> > * On Tuesday 23 Sep 2008 22:00:32 Anthony Liguori wrote:
> >> Amit Shah wrote:
> >>> diff --git a/qemu/Makefile.target b/qemu/Makefile.target
> >>> index 72f3db8..40eb273 100644
> >>> --- a/qemu/Makefile.target
> >>> +++ b/qemu/Makefile.target
> >>> @@ -616,6 +616,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
> >>>  OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
> >>>  OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
> >>>  OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o
> >>> +OBJS+= device-assignment.o
> >>
> >> This needs to be conditional on at least linux hosts, but probably also
> >> kvm support.
> >
> > I didn't see any other file that's doing it. So I added this conditional
> > in vl.c by having a #if defined(__linux__). That's how usb-linux.c does
> > it as well. Is there a better way?
>
> aio and compatfd currently do it this way.  block-raw-win32 and
> block-raw-posix are this way.  We're slowly moving things away from
> #ifdef #else #endif to conditional compilation.


So is

ifdef(CONFIG_LINUX)
OBJS+=..
endif

supposed to work?
(or is it CONFIG_LINUX_USER?)

> > Not the whole functionality needs kvm support. This should be able to
> > work even without kvm (for example, when the guest is 1:1 mapped in the
> > host address space).
>
> KVM is needed for interrupt remapping though.  That's something I don't
> see happening for normal userspace any time soon.

We already have an implementation that I've not submitted. I remember at least 
one user asking to support that functionality (because he wanted to try this 
on powerpc).

> >>> + /* FIXME: Add support for emulated MMIO for non-kvm guests */
> >>> + if (kvm_enabled()) {
> >>
> >> This doesn't work at all if kvm isn't enabled right?  You should
> >> probably bail out in the init if kvm isn't enabled.  If this whole file
> >> is included conditionally based on KVM support, then you don't have to
> >> worry about using kvm_enabled() guards to conditionally compile out
> >> code.
> >
> > Non-kvm support is currently broken and should be fixed, but that can
> > happen after we get this merged.
>
> But it would take bouncing interrupts to userspace?  I don't think that
> will ever happen upstream personally.  At any rate, there's no point in
> even trying to support something like that until progress is made
> upstream on this front.

With 1:1 mapping (Andrea's patches) and --no-kvm-irqchip (or indeed --no-kvm) 
and the userspace interrupt bouncing patches, we can support this.

> > I can temporarily add a check for kvm_enabled and bail out.
> >
> >>> + sprintf(dir, "/sys/bus/pci/devices/0000:%02x:%02x.%x/",
> >>> +         r_bus, r_dev, r_func);
> >>
> >> snprintf()
> >
> > It's guarded by the %02x modifiers; so this doesn't depend on user input.
>
> strcpy or sprintf should never be used.  It doesn't matter if it's safe
> in a particular instance.  There are safer functions to use (like
> snprintf).

Hmm, qemu is littered with such strcpy()s though. I'll change this.

> All it takes is for someone to come along and change the /sys/bus path
> to be larger without adjusting the buffer size and everything goes to
> hell.  It's inherently brittle.
>
> >>> + fprintf(stderr, "Registered host PCI device %02x:%02x.%1x "
> >>> +         "(\"%s\") as guest device %02x:%02x.%1x\n",
> >>> +         r_bus, r_dev, r_func, e_dev_name,
> >>> +         pci_bus_num(e_bus), e_device, r_func);
> >>
> >> Please don't fprintf() unconditionally.
> >
> > OK; however, a vmdk file open does that so I though it was alright to do
> > it.
>
> I obviously don't use vmdk or else I would have removed that by now :-)
>
> >> A lot more checks are needed here to see if things can succeed.  We
> >> definitely should bail out if they can't.
> >
> > Bailing out is done in the out: label below. What else do  you think can
> > fail? I've taken care of all the cases that do fail IMO.
> >
> >>> + return pci_dev;
> >>> +out:
> >>> + pci_unregister_device(&pci_dev->dev);
> >>> + return NULL;
> >>> +}
> >>>
> >>>
> >>>
> >>> +/*
> >>> + * Syntax to assign device:
> >>> + *
> >>> + * -pcidevice dev=bus:dev.func,dma=dma
> >>> + *
> >>> + * Example:
> >>> + * -pcidevice host=00:13.0,dma=pvdma
> >>> + *
> >>> + * dma can currently only be 'none' to disable iommu support.
> >>
> >> Does it actually work if you disable iommu support?
> >
> > If the guest is 1:1 mapped.
>
> You mean with Andrea's reserved ram patches?

Yes.

> >>> +#include <sys/mman.h>
> >>
> >> Don't think this is needed here.
> >
> > We use mmap(), so this is needed.
>
> Ah.
>
> >>> +    /* Initialize assigned devices */
> >>> +    if (pci_enabled) {
> >>> +        int r = -1;
> >>> +        do {
> >>> +            init_assigned_device(pci_bus, &r);
> >>
> >> Why pass r by reference instead of just returning it?  At any rate, you
> >> should detect when this fails and gracefully terminate QEMU.
> >
> > 'r' is the count of the number of assigned devices -- mostly needed
> > because we have the data stored in an array. If we migrate to a list,
> > this can be relaxed.
> >
> > ATM, I start the guest without assigning the device. I haven't figured
> > out a way to gracefully terminate qemu yet.
>
> In the case of hot plug, you fail the hot plug.  If you start with
> device assignment, just doing an "exit" would be sufficient.

What about allocated memory? I'm sure there'll be lots of leaks in case of 
just exit().

> >>> +#if defined(TARGET_I386) || defined(TARGET_X86_64) ||
> >>> defined(__linux__) +          case QEMU_OPTION_pcidevice:
> >>> +         add_assigned_device(optarg);
> >>
> >> You should copy into an array, then in pc.c, iterate through the array
> >> and call into add_assigned_device.
> >
> > Is there any benefit in doing this? We're moving the iterate out of vl.c
> > to pc.c and both will happen at the same time.
>
> It's how everything else works.

But there's no particular benefit to it. On the down side, we'll need some 
array (the size of which will not be known since we don't know how many 
devices will be assigned). This current scheme is simple; can't this stay 
this way?

Amit
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/7] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests

Reply via email to