Re: [Qemu-devel] QEMU tries to register to VFIO memory that is not RAM

2019-06-14 Thread Thanos Makatos
> > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > > index 4374cc6176..d9d3b1277a 100644
> > > --- a/hw/vfio/common.c
> > > +++ b/hw/vfio/common.c
> > > @@ -430,6 +430,9 @@ static void
> > vfio_listener_region_add(MemoryListener *listener,
> > >  VFIOHostDMAWindow *hostwin;
> > >  bool hostwin_found;
> > >
> > > +if (!section->mr->ram_device)
> > > +return;
> > > +
> >
> > Nope, this would prevent IOMMU mapping of assigned device MMIO
> > regions
> > which would prevent peer-to-peer DMA between assigned devices.
> Thanks,
> 
> Understood.
> 
> Is there a strong reason why QEMU allocates memory for these address
> spaces without MAP_SHARED? In our use case it would solve our problem if
> we could make QEMU use MAP_SHARED. I understand that this isn't strictly
> correct, so would it be acceptable to enable this behavior with a command-
> line option or an #ifdef?

Ping!




[Qemu-devel] QEMU tries to register to VFIO memory that is not RAM

2019-05-31 Thread Thanos Makatos
When configuring device pass-through via VFIO to a VM, I noticed that QEMU 
tries to register (DMA_MAP) all memory regions of a guest (not only RAM). That 
includes firmware regions like "pc.rom". Would a physical device ever need 
access to those? Am I missing something?



Re: [Qemu-devel] QEMU tries to register to VFIO memory that is not RAM

2019-05-31 Thread Thanos Makatos
> > When configuring device pass-through via VFIO to a VM, I noticed that
> > QEMU tries to register (DMA_MAP) all memory regions of a guest (not
> > only RAM). That includes firmware regions like "pc.rom". 
Would a
> > physical device ever need access to those?
>
> Probably not, but are those things not in the address space of the
> device on a physical system?

They are. I'm wondering whether it makes sense in a virtualized environment.

>
> > Am I missing something?
>
> Does this cause a problem?

It does in my use case. We're experimenting with devices backed by another
userspace application. We can configure QEMU to allocate shared memory
(MAP_SHARED) for guest RAM (which we can register in the other process) but not
for anything else.

>  It's not always easy to identify regions
> that should not be mapped to a device, clearly we're not going to
> create a whitelist based on the name of the region.  Thanks,

Indeed. Could we decide whether or not to register an address space with
VFIO in a more intelligent manner? E.g. the following simplistic patch solves
our problem:

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4374cc6176..d9d3b1277a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -430,6 +430,9 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 VFIOHostDMAWindow *hostwin;
 bool hostwin_found;

+if (!section->mr->ram_device)
+return;
+
 if (vfio_listener_skipped_section(section)) {
 trace_vfio_listener_region_add_skip(
 section->offset_within_address_space,



Re: [Qemu-devel] QEMU tries to register to VFIO memory that is not RAM

2019-06-04 Thread Thanos Makatos
> > Indeed. Could we decide whether or not to register an address space with
> > VFIO in a more intelligent manner? E.g. the following simplistic patch 
> > solves
> > our problem:
> >
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 4374cc6176..d9d3b1277a 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -430,6 +430,9 @@ static void
> vfio_listener_region_add(MemoryListener *listener,
> >  VFIOHostDMAWindow *hostwin;
> >  bool hostwin_found;
> >
> > +if (!section->mr->ram_device)
> > +return;
> > +
> 
> Nope, this would prevent IOMMU mapping of assigned device MMIO
> regions
> which would prevent peer-to-peer DMA between assigned devices.  Thanks,

Understood.

Is there a strong reason why QEMU allocates memory for these address spaces 
without MAP_SHARED? In our use case it would solve our problem if we could make 
QEMU use MAP_SHARED. I understand that this isn't strictly correct, so would it 
be acceptable to enable this behavior with a command-line option or an #ifdef?



question about handling MSI-X by VFIO

2020-01-21 Thread Thanos Makatos
I'm passing through a virtual PCI device to a QEMU guest via VFIO/mdev and I
notice that MSI-X interrupts are disabled in the device (MSIXCAP.MXC.MXE is
zero) and the BARs containing the table and PBA (4 and 5 in my case) are never
accessed.  However, whenever I fire an MSI-X interrupt from the virtual device
(although I'm not supposed to do so as they're disabled), the guest seems to
correctly receive it. I've started looking at hw/vfio/pci.c and it seems that
VFIO handles MSI-X interrupts there, including masking etc?



RE: question about handling MSI-X by VFIO

2020-01-23 Thread Thanos Makatos
> > I'm passing through a virtual PCI device to a QEMU guest via VFIO/mdev
> and I
> > notice that MSI-X interrupts are disabled in the device (MSIXCAP.MXC.MXE
> is
> > zero) and the BARs containing the table and PBA (4 and 5 in my case) are
> never
> > accessed.  However, whenever I fire an MSI-X interrupt from the virtual
> device
> > (although I'm not supposed to do so as they're disabled), the guest seems
> to
> > correctly receive it. I've started looking at hw/vfio/pci.c and it seems 
> > that
> > VFIO handles MSI-X interrupts there, including masking etc?
> 
> Yes, the vector table and PBA are emulated in QEMU, the latter lazily
> only when vectors are masked, iirc.  The backing device vector table
> should never be directly accessed by the user (it can be, but you can
> just discard those accesses), MSI-X is configured via the
> VFIO_DEVICE_SET_IRQS ioctl, which configures the eventfd through which
> an mdev driver would trigger an MSI.  When you say that you "fire and
> MSI-X interrupt from the virtual device" does this mean that you're
> signaling via one of these eventfds?  It looks to me like emulating the
> MSI-X enable bit in the MSI-X capability is probably the responsibility
> of the mdev vendor driver.  With vfio-pci the VFIO_DEVICE_SET_IRQS ioctl
> would enable MSI-X on the physical device and the MSI-X capability seen
> by the user would reflect that.  Are you missing a bit of code that
> updates the mdev config space as part of the SET_IRQS ioctl?  Thanks,

Indeed I fire interrupts via the eventfd and it works correctly. I just
couldn't understand how it could possibly work since the table and PBA BARs 
were never accessed and the MSI-X enable bit was not set.  It makes perfect
sense now why it works since QEMU does it all.



Cannot use more than one MSI interrupts

2019-10-24 Thread Thanos Makatos
I have a Ubuntu VM (4.15.0-48-generic) to which I pass through a PCI device,
specifically a virtual NVMe controller. The problem I have is that only one I/O 
queue
is initialized, while there should be more (e.g. four). I'm using upstream QEMU
v4.1.0 confgiured without any additional options. Most likely there's something
broken with my virtual device implementation but I can't figure out exactly
what, I was hoping to get some debugging directions.

I run QEMU as follows:

~/src/qemu/x86_64-softmmu/qemu-system-x86_64 \
-kernel bionic-server-cloudimg-amd64-vmlinuz-generic \
-smp cores=2,sockets=2 \
-nographic \
-append "console=ttyS0 root=/dev/sda1 single nvme.sgl_threshold=0 
nokaslr nvme.io_queue_depth=4" \
-initrd bionic-server-cloudimg-amd64-initrd-generic \
-hda bionic-server-cloudimg-amd64.raw \
-hdb data.raw \
-m 1G \
-object 
memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=1073741824
 -numa node,nodeid=0,cpus=0-3,memdev=ram-node0 \
-device 
vfio-pci,sysfsdev=/sys/bus/mdev/devices/---- \
-trace enable=vfio*,file=qemu.trace \
-net none \
-s

This is what QEMU thinks of the devices:

(qemu) info pci
  Bus  0, device   0, function 0:
Host bridge: PCI device 8086:1237
  PCI subsystem 1af4:1100
  id ""
  Bus  0, device   1, function 0:
ISA bridge: PCI device 8086:7000
  PCI subsystem 1af4:1100
  id ""
  Bus  0, device   1, function 1:
IDE controller: PCI device 8086:7010
  PCI subsystem 1af4:1100
  BAR4: I/O at 0xc000 [0xc00f].
  id ""
  Bus  0, device   1, function 3:
Bridge: PCI device 8086:7113
  PCI subsystem 1af4:1100
  IRQ 9.
  id ""
  Bus  0, device   2, function 0:
VGA controller: PCI device 1234:
  PCI subsystem 1af4:1100
  BAR0: 32 bit prefetchable memory at 0xfd00 [0xfdff].
  BAR2: 32 bit memory at 0xfebf4000 [0xfebf4fff].
  BAR6: 32 bit memory at 0x [0xfffe].
  id ""
  Bus  0, device   3, function 0:
Class 0264: PCI device 4e58:0001
  PCI subsystem :
  IRQ 11.
  BAR0: 32 bit memory at 0xfebf [0xfebf3fff].
  id ""

And this is what the guest thinks of the device in question:

root@ubuntu:~# lspci -vvv -s 00:03.0
00:03.0 Non-Volatile memory controller: Device 4e58:0001 (prog-if 02 [NVM 
Express])
Physical Slot: 3
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- flags & MSI_FLAG_MULTI_PCI_MSI)) 
return 1;

because flags is 0x3b (MSI_FLAG_MULTI_PCI_MSI is 0x4). And this I think means
that MSI_FLAG_MULTI_PCI_MSI is not set for that msi_domain_info.

# grep -i msi qemu.trace
1327@1571926064.595365:vfio_msi_setup ---- PCI 
MSI CAP @0x48
1334@1571926073.489691:vfio_msi_enable  (----) 
Enabled 1 MSI vectors
1334@1571926073.501741:vfio_msi_disable  (----)
1334@1571926073.507127:vfio_msi_enable  (----) 
Enabled 1 MSI vectors
1327@1571926073.520840:vfio_msi_interrupt  
(----) vector 0 0xfee01004/0x4023
... more vfio_msi_interrupt ...

How can I further debug this?



RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-01 Thread Thanos Makatos
> On Thu, Mar 26, 2020 at 09:47:38AM +0000, Thanos Makatos wrote:
> > Build MUSER with vfio-over-socket:
> >
> > git clone --single-branch --branch vfio-over-socket
> g...@github.com:tmakatos/muser.git
> > cd muser/
> > git submodule update --init
> > make
> >
> > Run device emulation, e.g.
> >
> > ./build/dbg/samples/gpio-pci-idio-16 -s 
> >
> > Where  is an available IOMMU group, essentially the device ID, which
> must not
> > previously exist in /dev/vfio/.
> >
> > Run QEMU using the vfio wrapper library and specifying the MUSER device:
> >
> > LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64
> \
> > ... \
> > -device vfio-pci,sysfsdev=/dev/vfio/ \
> > -object memory-backend-file,id=ram-node0,prealloc=yes,mem-
> path=mem,share=yes,size=1073741824 \
> > -numa node,nodeid=0,cpus=0,memdev=ram-node0
> >
> > Bear in mind that since this is just a PoC lots of things can break, e.g. 
> > some
> > system call not intercepted etc.
> 
> Cool, I had a quick look at libvfio and how the transport integrates
> into libmuser.  The integration on the libmuser side is nice and small.
> 
> It seems likely that there will be several different implementations of
> the vfio-over-socket device side (server):
> 1. libmuser
> 2. A Rust equivalent to libmuser
> 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
>has been investigating this?)
> 
> In order to interoperate we'll need to maintain a protocol
> specification.  Mayb You and JJ could put that together and CC the vfio,
> rust-vmm, and QEMU communities for discussion?

Sure, I can start by drafting a design doc and share it.

> It should cover the UNIX domain socket connection semantics (does a
> listen socket only accept 1 connection at a time?  What happens when the
> client disconnects?  What happens when the server disconnects?), how
> VFIO structs are exchanged, any vfio-over-socket specific protocol
> messages, etc.  Basically everything needed to write an implementation
> (although it's not necessary to copy the VFIO struct definitions from
> the kernel headers into the spec or even document their semantics if
> they are identical to kernel VFIO).
> 
> The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> client implementation in QEMU.  There is a prototype here:
> https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> user.c
> 
> If there are any volunteers for working on that then this would be a
> good time to discuss it.
> 
> Finally, has anyone looked at CrosVM's out-of-process device model?  I
> wonder if it has any features we should consider...
> 
> Looks like a great start to vfio-over-socket!



RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-20 Thread Thanos Makatos
> In order to interoperate we'll need to maintain a protocol
> specification.  Mayb You and JJ could put that together and CC the vfio,
> rust-vmm, and QEMU communities for discussion?
> 
> It should cover the UNIX domain socket connection semantics (does a
> listen socket only accept 1 connection at a time?  What happens when the
> client disconnects?  What happens when the server disconnects?), how
> VFIO structs are exchanged, any vfio-over-socket specific protocol
> messages, etc.  Basically everything needed to write an implementation
> (although it's not necessary to copy the VFIO struct definitions from
> the kernel headers into the spec or even document their semantics if
> they are identical to kernel VFIO).
> 
> The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> client implementation in QEMU.  There is a prototype here:
> https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> user.c
> 
> If there are any volunteers for working on that then this would be a
> good time to discuss it.

Hi,

I've just shared with you the Google doc we've working on with John where we've
been drafting the protocol specification, we think it's time for some first
comments. Please feel free to comment/edit and suggest more people to be on the
reviewers list.

You can also find the Google doc here:

https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing

If a Google doc doesn't work for you we're open to suggestions.

Thanks



RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-27 Thread Thanos Makatos
> > I've just shared with you the Google doc we've working on with John
> where we've
> > been drafting the protocol specification, we think it's time for some first
> > comments. Please feel free to comment/edit and suggest more people to
> be on the
> > reviewers list.
> >
> > You can also find the Google doc here:
> >
> >
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_
> 7HhY471TsVwyK8/edit?usp=sharing
> >
> > If a Google doc doesn't work for you we're open to suggestions.
> 
> I can't add comments to the document so I've inlined them here:
> 
> The spec assumes the reader is already familiar with VFIO and does not
> explain concepts like the device lifecycle, regions, interrupts, etc.
> We don't need to duplicate detailed VFIO information, but I think the
> device model should be explained so that anyone can start from the
> VFIO-user spec and begin working on an implementation.  Right now I
> think they would have to do some serious investigation of VFIO first in
> order to be able to write code.

I've added a high-level overview of how VFIO is used in this context.

> "only the source header files are used"
> I notice the current  header is licensed "GPL-2.0 WITH
> Linux-syscall-note".  I'm not a lawyer but I guess this means there are
> some restrictions on using this header file.  The 
> header files were explicitly licensed under the BSD license to make it
> easy to use the non __KERNEL__ parts.

My impression is that this note actually relaxes the licensing requirements, so
that proprietary software can use the system call headers and run on Linux
without being considered derived work. In any case I'll double check with our
legal team.
 
> VFIO-user Command Types: please indicate for each request type whether
> it is client->server, server->client, or both.  Also is it a "command"
> or "request"?

Will do. It's a command.

 
> vfio_user_req_type <-- is this an extension on top of ?
> Please make it clear what is part of the base  protocol
> and what is specific to vfio-user.

Correct, it's an extension over . I've clarified the two symbol
namespaces.

 
> VFIO_USER_READ/WRITE serve completely different purposes depending on
> whether they are sent client->server or server->client.  I suggest
> defining separate request type constants instead of overloading them.

Fixed.

> What is the difference between VFIO_USER_MAP_DMA and
> VFIO_USER_REG_MEM?
> They both seem to be client->server messages for setting up memory but
> I'm not sure why two request types are needed.

John will provide more information on this.

> struct vfio_user_req->data.  Is this really a union so that every
> message has the same size, regardless of how many parameters are passed
> in the data field?

Correct, it's a union so that every message has the same length.

> "a framebuffer where the guest does multiple stores to the virtual
> device."  Do you mean in SMP guests?  Or even in a single CPU guest?

@John

> Also, is there any concurrency requirement on the client and server
> side?  Can I implement a client/server that processes requests
> sequentially and completes them before moving on to the next request or
> would that deadlock for certain message types?

I believe that this might also depend on the device semantics, will need to
think about it in greater detail.

More importantly, considering:
a) Marc-André's comments about data alignment etc., and
b) the possibility to run the server on another guest or host,
we won't be able to use native VFIO types. If we do want to support that then
we'll have to redefine all data formats, similar to
https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.rst.

So the protocol will be more like an enhanced version of the Vhost-user protocol
than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
so we need to decide before proceeding as the request format is substantially
different.



RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-30 Thread Thanos Makatos
> > > I've just shared with you the Google doc we've working on with John
> > where we've
> > > been drafting the protocol specification, we think it's time for some 
> > > first
> > > comments. Please feel free to comment/edit and suggest more people
> to
> > be on the
> > > reviewers list.
> > >
> > > You can also find the Google doc here:
> > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__docs.google.com_document_d_1FspkL0hVEnZqHbdoqGLUpyC38rSk-
> 5F&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46a
> tk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJW7NMg
> Rg&s=RyyhgVrLX2bBTqpXZnBmllqkCg_wyalxwZKkfcYt50c&e=
> > 7HhY471TsVwyK8/edit?usp=sharing
> > >
> > > If a Google doc doesn't work for you we're open to suggestions.
> >
> > I can't add comments to the document so I've inlined them here:
> >
> > The spec assumes the reader is already familiar with VFIO and does not
> > explain concepts like the device lifecycle, regions, interrupts, etc.
> > We don't need to duplicate detailed VFIO information, but I think the
> > device model should be explained so that anyone can start from the
> > VFIO-user spec and begin working on an implementation.  Right now I
> > think they would have to do some serious investigation of VFIO first in
> > order to be able to write code.
> 
> I've added a high-level overview of how VFIO is used in this context.
> 
> > "only the source header files are used"
> > I notice the current  header is licensed "GPL-2.0 WITH
> > Linux-syscall-note".  I'm not a lawyer but I guess this means there are
> > some restrictions on using this header file.  The 
> > header files were explicitly licensed under the BSD license to make it
> > easy to use the non __KERNEL__ parts.
> 
> My impression is that this note actually relaxes the licensing requirements, 
> so
> that proprietary software can use the system call headers and run on Linux
> without being considered derived work. In any case I'll double check with our
> legal team.
> 
> > VFIO-user Command Types: please indicate for each request type whether
> > it is client->server, server->client, or both.  Also is it a "command"
> > or "request"?
> 
> Will do. It's a command.
> 
> 
> > vfio_user_req_type <-- is this an extension on top of ?
> > Please make it clear what is part of the base  protocol
> > and what is specific to vfio-user.
> 
> Correct, it's an extension over . I've clarified the two symbol
> namespaces.
> 
> 
> > VFIO_USER_READ/WRITE serve completely different purposes depending
> on
> > whether they are sent client->server or server->client.  I suggest
> > defining separate request type constants instead of overloading them.
> 
> Fixed.
> 
> > What is the difference between VFIO_USER_MAP_DMA and
> > VFIO_USER_REG_MEM?
> > They both seem to be client->server messages for setting up memory but
> > I'm not sure why two request types are needed.
> 
> John will provide more information on this.
> 
> > struct vfio_user_req->data.  Is this really a union so that every
> > message has the same size, regardless of how many parameters are
> passed
> > in the data field?
> 
> Correct, it's a union so that every message has the same length.
> 
> > "a framebuffer where the guest does multiple stores to the virtual
> > device."  Do you mean in SMP guests?  Or even in a single CPU guest?
> 
> @John
> 
> > Also, is there any concurrency requirement on the client and server
> > side?  Can I implement a client/server that processes requests
> > sequentially and completes them before moving on to the next request or
> > would that deadlock for certain message types?
> 
> I believe that this might also depend on the device semantics, will need to
> think about it in greater detail.

I've looked at this but can't provide a definitive answer yet. I believe the
safest thing to do is for the server to process requests in order.

> More importantly, considering:
> a) Marc-André's comments about data alignment etc., and
> b) the possibility to run the server on another guest or host,
> we won't be able to use native VFIO types. If we do want to support that
> then
> we'll have to redefine all data formats, similar to
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> W7NMgRg&s=1d_kB7VWQ-8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
> 
> So the protocol will be more like an enhanced version of the Vhost-user
> protocol
> than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> so we need to decide before proceeding as the request format is
> substantially
> different.

Regarding the ability to use the protocol on non-AF_UNIX sockets, we can 
support this future use case without unnecessarily complicating the protocol by
defining the C structs and stating that data alignment and endianness for the 
non AF_UNIX c

RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-04-30 Thread Thanos Makatos
> > > More importantly, considering:
> > > a) Marc-André's comments about data alignment etc., and
> > > b) the possibility to run the server on another guest or host,
> > > we won't be able to use native VFIO types. If we do want to support that
> > > then
> > > we'll have to redefine all data formats, similar to
> > > https://urldefense.proofpoint.com/v2/url?u=https-
> > > 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> > >
> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> > >
> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> > > W7NMgRg&s=1d_kB7VWQ-
> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
> > >
> > > So the protocol will be more like an enhanced version of the Vhost-user
> > > protocol
> > > than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> > > so we need to decide before proceeding as the request format is
> > > substantially
> > > different.
> >
> > Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
> > support this future use case without unnecessarily complicating the
> protocol by
> > defining the C structs and stating that data alignment and endianness for
> the
> > non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> or can
> > be overridden as required.
> 
> Defining it to be x86_64 semantics is effectively saying "we're not going
> to do anything and it is up to other arch maintainers to fix the inevitable
> portability problems that arise".

Pretty much.
 
> Since this is a new protocol should we take the opportunity to model it
> explicitly in some common standard RPC protocol language. This would have
> the benefit of allowing implementors to use off the shelf APIs for their
> wire protocol marshalling, and eliminate questions about endianness and
> alignment across architectures.

The problem is that we haven't defined the scope very well. My initial 
impression 
was that we should use the existing VFIO structs and constants, however that's 
impossible if we're to support non AF_UNIX. We need consensus on this, we're 
open to ideas how to do this.




RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-03-26 Thread Thanos Makatos
I want to continue the discussion regarding using MUSER
(https://github.com/nutanix/muser) as a device offloading mechanism. The main
drawback of MUSER is that it requires a kernel module, so I've experimented
with a proof of concept of how MUSER would look like if we somehow didn't need
a kernel module. I did this by implementing a wrapper library
(https://github.com/tmakatos/libpathtrap) that intercepts accesses to
VFIO-related paths and forwards them to the MUSER process providing device
emulation over a UNIX domain socket. This does not require any changes to QEMU
(4.1.0). Obviously this is a massive hack and is done only for the needs of
this PoC.

The result is a fully working PCI device in QEMU (the gpio sample explained in
https://github.com/nutanix/muser/blob/master/README.md#running-gpio-pci-idio-16),
which is as simple as possible. I've also tested with a much more complicated
device emulation, https://github.com/tmakatos/spdk, which provides NVMe device
emulation and requires accessing guest memory for DMA, allowing BAR0 to be
memory mapped into the guest, using MSI-X interrupts, etc.

The changes required in MUSER are fairly small, all that is needed is to
introduce a new concept of "transport" to receive requests from a UNIX domain
socket instead of the kernel (from a character device) and to send/receive file
descriptors for sharing memory and firing interrupts.

My experience is that VFIO is so intuitive to use for offloading device
emulation from one process to another that makes this feature quite
straightforward. There's virtually nothing specific to the kernel in the VFIO
API. Therefore I strongly agree with Stefan's suggestion to use it for device
offloading when interacting with QEMU. Using 'muser.ko' is still interesting
when QEMU is not the client, but if everyone is happy to proceed with the
vfio-over-socket alternative the kernel module can become a second-class
citizen. (QEMU is, after all, our first and most relevant client.)

Next I explain how to test the PoC.

Build MUSER with vfio-over-socket:

git clone --single-branch --branch vfio-over-socket 
g...@github.com:tmakatos/muser.git
cd muser/
git submodule update --init
make

Run device emulation, e.g.

./build/dbg/samples/gpio-pci-idio-16 -s 

Where  is an available IOMMU group, essentially the device ID, which must not
previously exist in /dev/vfio/.

Run QEMU using the vfio wrapper library and specifying the MUSER device:

LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64 \
... \
-device vfio-pci,sysfsdev=/dev/vfio/ \
-object 
memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=1073741824
 \
-numa node,nodeid=0,cpus=0,memdev=ram-node0

Bear in mind that since this is just a PoC lots of things can break, e.g. some
system call not intercepted etc.



RE: RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-03-27 Thread Thanos Makatos
>  
> Next I explain how to test the PoC.
> 
> Build MUSER with vfio-over-socket:
> 
> git clone --single-branch --branch vfio-over-socket
> g...@github.com:tmakatos/muser.git
> cd muser/
> git submodule update --init
> make

Yesterday's version had a bug where it didn't build if you didn't have an 
existing libmuser installation, I pushed a patch to fix that.



RE: [RFC v4 PATCH 00/49] Initial support of multi-process qemu - status update

2020-02-25 Thread Thanos Makatos
> > 3) Muser.ko pins the pages (in get_dma_map(), called from below)
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_nutanix_muser_blob_master_kmod_muser.c-
> 23L711&d=DwICAg&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtt
> i46atk736SI4vgsJiUKIyDE&m=C8rTp4SZoy4YNcZWntiROp3otxCyKbLoQXBw8O
> SB0TM&s=G2JfW1GcVNc_iph7C4hE285sTZM8JrR4dYXgmcyAZPE&e= )
> 
> Yikes, it pins every page??  vfio_pin_pages() intends for the vendor
> driver to be much smarter than this :-\  Thanks,

We no longer have to pin pages at all. Instead we grab the fd backing the VMA
and inject it in libmuser, and then request it to mmap that file. This also
solves a few other problems and is far simpler to implement.



Re: [Qemu-devel] [PATCH] block: vpc support for ~2 TB disks

2012-11-14 Thread Thanos Makatos
We don't use qemu's VHD driver in XenServer. Instead, we use blktap2 to create 
a block device in dom0 serving the VHD file in question, and have qemu open 
that block device instead of the VHD file itself.

> -Original Message-
> From: Stefano Stabellini [mailto:stefano.stabell...@eu.citrix.com]
> Sent: 13 November 2012 10:56
> To: Paolo Bonzini
> Cc: Charles Arnold; qemu-devel@nongnu.org; Kevin Wolf;
> stefa...@redhat.com; Stefano Stabellini; xen-de...@lists.xensource.com;
> Thanos Makatos
> Subject: Re: [PATCH] block: vpc support for ~2 TB disks
> 
> On Tue, 13 Nov 2012, Paolo Bonzini wrote:
> > Il 12/11/2012 20:12, Charles Arnold ha scritto:
> > > Ping?
> > >
> > > Any thoughts on whether this is acceptable?
> >
> > I would like to know what is done by other platforms.  Stefano, any
> > idea about XenServer?
> >
> 
> I am not sure, but maybe Thanos, that is working on blktap3, knows, or
> at least knows who to CC.
> 
> 
> 
> > > - Charles
> > >
> > >>>> On 10/30/2012 at 08:59 PM, in message
> > >>>> <50a0e561.5b74.009...@suse.com>, Charles
> > > Arnold wrote:
> > >> The VHD specification allows for up to a 2 TB disk size. The
> > >> current implementation in qemu emulates EIDE and ATA-2 hardware
> > >> which only allows for up to 127 GB.  This disk size limitation can
> > >> be overridden by allowing up to 255 heads instead of the normal 4
> > >> bit limitation of 16.  Doing so allows disk images to be created
> of
> > >> up to nearly 2 TB.  This change does not violate the VHD format
> > >> specification nor does it change how smaller disks (ie, <=127GB)
> are defined.
> > >>
> > >> Signed-off-by: Charles Arnold 
> > >>
> > >> diff --git a/block/vpc.c b/block/vpc.c index b6bf52f..0c2eaf8
> > >> 100644
> > >> --- a/block/vpc.c
> > >> +++ b/block/vpc.c
> > >> @@ -198,7 +198,8 @@ static int vpc_open(BlockDriverState *bs, int
> flags)
> > >>  bs->total_sectors = (int64_t)
> > >>  be16_to_cpu(footer->cyls) * footer->heads *
> > >> footer->secs_per_cyl;
> > >>
> > >> -if (bs->total_sectors >= 65535 * 16 * 255) {
> > >> +/* Allow a maximum disk size of approximately 2 TB */
> > >> +if (bs->total_sectors >= 65535LL * 255 * 255)
> > >> + {qemu-devel@nongnu.org
> > >>  err = -EFBIG;
> > >>  goto fail;
> > >>  }
> > >> @@ -524,19 +525,27 @@ static coroutine_fn int
> > >> vpc_co_write(BlockDriverState *bs, int64_t sector_num,
> > >>   * Note that the geometry doesn't always exactly match
> total_sectors but
> > >>   * may round it down.
> > >>   *
> > >> - * Returns 0 on success, -EFBIG if the size is larger than 127 GB
> > >> + * Returns 0 on success, -EFBIG if the size is larger than ~2 TB.
> > >> + Override
> > >> + * the hardware EIDE and ATA-2 limit of 16 heads (max disk size
> of
> > >> + 127 GB)
> > >> + * and instead allow up to 255 heads.
> > >>   */
> > >>  static int calculate_geometry(int64_t total_sectors, uint16_t*
> cyls,
> > >>  uint8_t* heads, uint8_t* secs_per_cyl)  {
> > >>  uint32_t cyls_times_heads;
> > >>
> > >> -if (total_sectors > 65535 * 16 * 255)
> > >> +/* Allow a maximum disk size of approximately 2 TB */
> > >> +if (total_sectors > 65535LL * 255 * 255) {
> > >>  return -EFBIG;
> > >> +}
> > >>
> > >>  if (total_sectors > 65535 * 16 * 63) {
> > >>  *secs_per_cyl = 255;
> > >> -*heads = 16;
> > >> +if (total_sectors > 65535 * 16 * 255) {
> > >> +*heads = 255;
> > >> +} else {
> > >> +*heads = 16;
> > >> +}
> > >>  cyls_times_heads = total_sectors / *secs_per_cyl;
> > >>  } else {
> > >>  *secs_per_cyl = 17;
> > >
> > >
> > >
> > >
> >



Re: [Qemu-devel] [PATCH] block: vpc support for ~2 TB disks

2012-11-15 Thread Thanos Makatos
I'm not sure I understand your question. In XenServer blktap2 we set CHS to 
65535*16*255 in the VHD metadata for disks larger than 127GB. We don't really 
care about these values, we just store them in the VHD metadata.

> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: 14 November 2012 16:36
> To: Thanos Makatos
> Cc: Stefano Stabellini; Charles Arnold; qemu-devel@nongnu.org; Kevin
> Wolf; stefa...@redhat.com; xen-de...@lists.xensource.com
> Subject: Re: [PATCH] block: vpc support for ~2 TB disks
> 
> Il 14/11/2012 17:25, Thanos Makatos ha scritto:
> > We don't use qemu's VHD driver in XenServer. Instead, we use blktap2
> > to create a block device in dom0 serving the VHD file in question,
> and
> > have qemu open that block device instead of the VHD file itself.
> 
> Yes, the question is how you handle disks bigger than 127GB, so that
> QEMU can do the same.
> 
> Paolo



RE: Out-of-Process Device Emulation session at KVM Forum 2020

2020-10-28 Thread Thanos Makatos
> -Original Message-
> From: Stefan Hajnoczi 
> Sent: 27 October 2020 15:14
> To: qemu-devel@nongnu.org
> Cc: Alex Bennée ; m...@redhat.com
> ; john.g.john...@oracle.com; Elena Ufimtseva
> ; kra...@redhat.com;
> jag.ra...@oracle.com; Thanos Makatos ;
> Felipe Franciosi ; Marc-André Lureau
> ; s...@redhat.com; David Gibson
> 
> Subject: Out-of-Process Device Emulation session at KVM Forum 2020
> 
> There will be a birds-of-a-feather session at KVM Forum, a chance for
> us to get together and discuss Out-of-Process Device Emulation.
> 
> Please send suggestions for the agenda!
> 
> These sessions are a good opportunity to reach agreement on topics that
> are hard to discuss via mailing lists.
> 
> Ideas:
>  * How will we decide that the protocol is stable? Can third-party
>applications like DPDK/SPDK use the protocol in the meantime?
>  * QEMU build system requirements: how to configure and build device
>emulator binaries?
>  * Common sandboxing solution shared between C and Rust-based binaries?
>minijail (https://github.com/google/minijail)? bubblewrap
>(https://github.com/containers/bubblewrap)? systemd-run?
> 
> Stefan

Here are a couple of issues we'd also like to talk about:

Fast switching from polling to interrupt-based notifications: when a single
process is emulating multiple devices then it might be more efficient to poll
instead of relying on interrupts for notifications. However, during periods when
the guests are mostly idle, polling might unnecessary, so we'd like to be able
switch to interrupt-based notifications at a low cost.

Device throttling during live migration: a device can easily dirty huge amounts
of guest RAM which results in live migration taking too long or making it hard
to estimate progress. Ideally, we'd like to be able to instruct an 
out-of-process
device emulator to make sure it won't dirty too many guest pages during a
specified window of time.



RE: Out-of-Process Device Emulation session at KVM Forum 2020

2020-10-28 Thread Thanos Makatos



> -Original Message-
> From: Qemu-devel  bounces+thanos.makatos=nutanix@nongnu.org> On Behalf Of Thanos
> Makatos
> Sent: 28 October 2020 09:32
> To: Stefan Hajnoczi ; qemu-devel@nongnu.org
> Cc: Elena Ufimtseva ;
> john.g.john...@oracle.com; m...@redhat.com ;
> jag.ra...@oracle.com; s...@redhat.com; kra...@redhat.com; Felipe
> Franciosi ; Marc-André Lureau
> ; Alex Bennée ;
> David Gibson 
> Subject: RE: Out-of-Process Device Emulation session at KVM Forum 2020
> 
> > -Original Message-
> > From: Stefan Hajnoczi 
> > Sent: 27 October 2020 15:14
> > To: qemu-devel@nongnu.org
> > Cc: Alex Bennée ; m...@redhat.com
> > ; john.g.john...@oracle.com; Elena Ufimtseva
> > ; kra...@redhat.com;
> > jag.ra...@oracle.com; Thanos Makatos ;
> > Felipe Franciosi ; Marc-André Lureau
> > ; s...@redhat.com; David Gibson
> > 
> > Subject: Out-of-Process Device Emulation session at KVM Forum 2020
> >
> > There will be a birds-of-a-feather session at KVM Forum, a chance for
> > us to get together and discuss Out-of-Process Device Emulation.
> >
> > Please send suggestions for the agenda!
> >
> > These sessions are a good opportunity to reach agreement on topics that
> > are hard to discuss via mailing lists.
> >
> > Ideas:
> >  * How will we decide that the protocol is stable? Can third-party
> >applications like DPDK/SPDK use the protocol in the meantime?
> >  * QEMU build system requirements: how to configure and build device
> >emulator binaries?
> >  * Common sandboxing solution shared between C and Rust-based
> binaries?
> >minijail (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_google_minijail-29-
> 3F&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46a
> tk736SI4vgsJiUKIyDE&m=hPc4ln1oFnCIYCRna-
> C027BO06__al6zPJhAs0_KcP8&s=dqqLRGO3GvV4gAEqkMXzbhm5TtOHqLGQ
> d_0SBlzubp0&e=  bubblewrap
> >(https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_containers_bubblewrap-29-
> 3F&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46a
> tk736SI4vgsJiUKIyDE&m=hPc4ln1oFnCIYCRna-
> C027BO06__al6zPJhAs0_KcP8&s=Rnd-
> 6YVz2xrg0Vrm6ukannwt3kmbQ8L7upVLrEc227g&e=  systemd-run?
> >
> > Stefan
> 
> Here are a couple of issues we'd also like to talk about:
> 
> Fast switching from polling to interrupt-based notifications: when a single
> process is emulating multiple devices then it might be more efficient to poll
> instead of relying on interrupts for notifications. However, during periods
> when
> the guests are mostly idle, polling might unnecessary, so we'd like to be able
> switch to interrupt-based notifications at a low cost.

Correction: there are no interrupts involved here, just guest to device 
notifications.

> 
> Device throttling during live migration: a device can easily dirty huge 
> amounts
> of guest RAM which results in live migration taking too long or making it hard
> to estimate progress. Ideally, we'd like to be able to instruct an out-of-
> process
> device emulator to make sure it won't dirty too many guest pages during a
> specified window of time.




[PATCH v5] introduce vfio-user protocol specification

2020-10-28 Thread Thanos Makatos
This patch introduces the vfio-user protocol specification (formerly
known as VFIO-over-socket), which is designed to allow devices to be
emulated outside QEMU, in a separate process. vfio-user reuses the
existing VFIO defines, structs and concepts.

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson 
Signed-off-by: Thanos Makatos 

---

Changed since v1:
  * fix coding style issues
  * update MAINTAINERS for VFIO-over-socket
  * add vfio-over-socket to ToC

Changed since v2:
  * fix whitespace

Changed since v3:
  * rename protocol to vfio-user
  * add table of contents
  * fix Unicode problems
  * fix typos and various reStructuredText issues
  * various stylistic improvements
  * add backend program conventions
  * rewrite part of intro, drop QEMU-specific stuff
  * drop QEMU-specific paragraph about implementation
  * explain that passing of FDs isn't necessary
  * minor improvements in the VFIO section
  * various text substitutions for the sake of consistency
  * drop paragraph about client and server, already explained in intro
  * drop device ID
  * drop type from version
  * elaborate on request concurrency
  * convert some inessential paragraphs into notes
  * explain why some existing VFIO defines cannot be reused
  * explain how to make changes to the protocol
  * improve text of DMA map
  * reword comment about existing VFIO commands
  * add reference to Version section
  * reset device on disconnection
  * reword live migration section
  * replace sys/vfio.h with linux/vfio.h
  * drop reference to iovec
  * use argz the same way it is used in VFIO
  * add type field in header for clarity

Changed since v4:
  * introduce support for live migration as defined in include/uapi/linux/vfio.h
  * introduce 'max_fds' and 'migration' capabilities:
  * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO
  * fix minor typos and reworded some text for clarity

You can focus on v4 to v5 changes by cloning my fork
(https://github.com/tmakatos/qemu) and doing:

git diff refs/tags/vfio-user/v4 refs/heads/vfio-user/v5
---
 MAINTAINERS  |6 +
 docs/devel/index.rst |1 +
 docs/devel/vfio-user.rst | 1552 ++
 3 files changed, 1559 insertions(+)
 create mode 100644 docs/devel/vfio-user.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 7e442b5247..3611f9e365 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1754,6 +1754,12 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s3...@nongnu.org
 
+vfio-user
+M: John G Johnson 
+M: Thanos Makatos 
+S: Supported
+F: docs/devel/vfio-user.rst
+
 vhost
 M: Michael S. Tsirkin 
 S: Supported
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index 77baae5c77..7c7740a096 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -34,3 +34,4 @@ Contents:
clocks
qom
block-coroutine-wrapper
+   vfio-user
diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
new file mode 100644
index 00..d8664e864f
--- /dev/null
+++ b/docs/devel/vfio-user.rst
@@ -0,0 +1,1552 @@
+.. include:: 
+
+
+vfio-user Protocol Specification
+
+
+
+Version_ 0.1
+
+
+.. contents:: Table of Contents
+
+Introduction
+
+vfio-user is a protocol that allows a device to be emulated in a separate
+process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
+of a generic VFIO device type, living inside the VMM, which we call the client,
+and the core device implementation, living outside the VMM, which we call the
+server.
+
+The `Linux VFIO ioctl interface 
<https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_
+been chosen as the base for this protocol for the following reasons:
+
+1) It is a mature and stable API, backed by an extensively used framework.
+2) The existing VFIO client implementation in QEMU (qemu/hw/vfio/) can be
+   largely reused.
+
+.. Note::
+   In a proof of concept implementation it has been demonstrated that using 
VFIO
+   over a UNIX domain socket is a viable option. vfio-user is designed with
+   QEMU in mind, however it could be used by other client applications. The
+   vfio-user protocol does not require that QEMU's VFIO client  implementation
+   is used in QEMU.
+
+None of the VFIO kernel modules are required for supporting the protocol,
+neither in the client nor the server, only the source header files are used.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
+chosen because file descriptors can be trivially sent over it, which in turn
+allows:
+
+* Sharing of client memory for DMA with the server.
+* Sharing of server memory with the client for fast MMIO.
+* Efficient sharing

RE: [PATCH v5] introduce vfio-user protocol specification

2020-10-28 Thread Thanos Makatos
FYI here's v5 of the vfio-user protocol, my --cc in git send-email got messed 
up somehow

> -Original Message-
> From: Qemu-devel  bounces+thanos.makatos=nutanix@nongnu.org> On Behalf Of Thanos
> Makatos
> Sent: 28 October 2020 16:10
> To: qemu-devel@nongnu.org
> Subject: [PATCH v5] introduce vfio-user protocol specification
> 
> This patch introduces the vfio-user protocol specification (formerly
> known as VFIO-over-socket), which is designed to allow devices to be
> emulated outside QEMU, in a separate process. vfio-user reuses the
> existing VFIO defines, structs and concepts.
> 
> It has been earlier discussed as an RFC in:
> "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> 
> Signed-off-by: John G Johnson 
> Signed-off-by: Thanos Makatos 
> 
> ---
> 
> Changed since v1:
>   * fix coding style issues
>   * update MAINTAINERS for VFIO-over-socket
>   * add vfio-over-socket to ToC
> 
> Changed since v2:
>   * fix whitespace
> 
> Changed since v3:
>   * rename protocol to vfio-user
>   * add table of contents
>   * fix Unicode problems
>   * fix typos and various reStructuredText issues
>   * various stylistic improvements
>   * add backend program conventions
>   * rewrite part of intro, drop QEMU-specific stuff
>   * drop QEMU-specific paragraph about implementation
>   * explain that passing of FDs isn't necessary
>   * minor improvements in the VFIO section
>   * various text substitutions for the sake of consistency
>   * drop paragraph about client and server, already explained in intro
>   * drop device ID
>   * drop type from version
>   * elaborate on request concurrency
>   * convert some inessential paragraphs into notes
>   * explain why some existing VFIO defines cannot be reused
>   * explain how to make changes to the protocol
>   * improve text of DMA map
>   * reword comment about existing VFIO commands
>   * add reference to Version section
>   * reset device on disconnection
>   * reword live migration section
>   * replace sys/vfio.h with linux/vfio.h
>   * drop reference to iovec
>   * use argz the same way it is used in VFIO
>   * add type field in header for clarity
> 
> Changed since v4:
>   * introduce support for live migration as defined in 
> include/uapi/linux/vfio.h
>   * introduce 'max_fds' and 'migration' capabilities:
>   * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO
>   * fix minor typos and reworded some text for clarity
> 
> You can focus on v4 to v5 changes by cloning my fork
> (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_tmakatos_qemu&d=DwIBAg&c=s883GpUCOChKOHiocYtGc
> g&r=XTpYsh5Ps2zJvtw6ogtti46atk736SI4vgsJiUKIyDE&m=LUZ0P5VWbFFynYq
> 5SPmJD4n-E6Tow26xDuPYIeCKV28&s=IgZugyqjIvKQ3-
> gpftrAm73sKizX51JYUroR-4aaaI0&e= ) and doing:
> 
>   git diff refs/tags/vfio-user/v4 refs/heads/vfio-user/v5
> ---
>  MAINTAINERS  |6 +
>  docs/devel/index.rst |1 +
>  docs/devel/vfio-user.rst | 1552
> ++
>  3 files changed, 1559 insertions(+)
>  create mode 100644 docs/devel/vfio-user.rst
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7e442b5247..3611f9e365 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1754,6 +1754,12 @@ F: hw/vfio/ap.c
>  F: docs/system/s390x/vfio-ap.rst
>  L: qemu-s3...@nongnu.org
> 
> +vfio-user
> +M: John G Johnson 
> +M: Thanos Makatos 
> +S: Supported
> +F: docs/devel/vfio-user.rst
> +
>  vhost
>  M: Michael S. Tsirkin 
>  S: Supported
> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> index 77baae5c77..7c7740a096 100644
> --- a/docs/devel/index.rst
> +++ b/docs/devel/index.rst
> @@ -34,3 +34,4 @@ Contents:
> clocks
> qom
> block-coroutine-wrapper
> +   vfio-user
> diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
> new file mode 100644
> index 00..d8664e864f
> --- /dev/null
> +++ b/docs/devel/vfio-user.rst
> @@ -0,0 +1,1552 @@
> +.. include:: 
> +
> +
> +vfio-user Protocol Specification
> +
> +
> +
> +Version_ 0.1
> +
> +
> +.. contents:: Table of Contents
> +
> +Introduction
> +
> +vfio-user is a protocol that allows a device to be emulated in a separate
> +process outside of a Virtual Machine Monitor (VMM). vfio-user devices
> consist
> +of a generic VFIO device type, living inside the VMM, which we call the
> client,
> +and the core device implementation, living outside the VMM, which we call
> the
>

RE: [PATCH v5] introduce vfio-user protocol specification

2020-11-02 Thread Thanos Makatos
> -Original Message-
> From: John Levon 
> Sent: 30 October 2020 17:03
> To: Thanos Makatos 
> Cc: qemu-devel@nongnu.org; benjamin.wal...@intel.com; Elena Ufimtseva
> ; tomassetti.and...@gmail.com;
> john.g.john...@oracle.com; jag.ra...@oracle.com; Swapnil Ingle
> ; james.r.har...@intel.com;
> konrad.w...@oracle.com; yuvalkash...@gmail.com; dgilb...@redhat.com;
> Raphael Norwitz ; ism...@linux.com;
> alex.william...@redhat.com; kanth.ghatr...@oracle.com; Stefan Hajnoczi
> ; Felipe Franciosi ;
> xiuchun...@intel.com; Marc-André Lureau
> ; tina.zh...@intel.com;
> changpeng@intel.com
> Subject: Re: [PATCH v5] introduce vfio-user protocol specification
> 
> On Wed, Oct 28, 2020 at 04:41:31PM +, Thanos Makatos wrote:
> 
> > FYI here's v5 of the vfio-user protocol, my --cc in git send-email got 
> > messed
> up somehow
> 
> Hi Thanos, this looks great, I just had some minor questions below.
> 
> > Command Concurrency
> > ---
> > A client may pipeline multiple commands without waiting for previous
> command
> > replies.  The server will process commands in the order they are received.
> > A consequence of this is if a client issues a command with the *No_reply*
> bit,
> > then subseqently issues a command without *No_reply*, the older
> command will
> > have been processed before the reply to the younger command is sent by
> the
> > server.  The client must be aware of the device's capability to process
> concurrent
> > commands if pipelining is used.  For example, pipelining allows multiple
> client
> > threads to concurently access device memory; the client must ensure
> these acceses
> > obey device semantics.
> >
> > An example is a frame buffer device, where the device may allow
> concurrent access
> > to different areas of video memory, but may have indeterminate behavior
> if concurrent
> > acceses are performed to command or status registers.
> 
> Is it valid for an unrelated server->client message to appear in between a
> client->server request/reply, or not? And vice versa? Either way, seems
> useful
> for the spec to say.

Yes, it's valid. I don't see a reason why it shouldn't. I'll update the text to
make it explicit.

> 
> > || +-++ |
> > || | Bit | Definition | |
> > || +=++ |
> > || | 0-3 | Type   | |
> > || +-++ |
> > || | 4   | No_reply   | |
> > || +-++ |
> > || | 5   | Error  | |
> > || +-++ |
> > +++-+
> > | Error  | 12 | 4   |
> > +++-+
> >
> > * *Message ID* identifies the message, and is echoed in the command's
> reply message.
> 
> Is it valid to re-use an ID? When/when not?

Yes, it's valid to re-use an ID. I suppose it's also valid, though should be
discouraged, to have multiple outstanding requests with the same ID, even though
that probably doesn't make much sense and will most likely break things. The ID
belongs purely to whomever sends the request, the receiver simply echoes it
back in the response and must make no assumptions about its uniqueness. I think
it's simpler to have it this way, otherwise implementations might start abusing
it or rely on it too much. If there's no objection to these semantics I'll
update the text to clarify.


> 
> >   * *Error* in a reply message indicates the command being acknowledged
> had
> > an error. In this case, the *Error* field will be valid.
> >
> > * *Error* in a reply message is a UNIX errno value. It is reserved in a
> command message.
> 
> I'm not quite following why we need a bit flag and an error field. Do you
> anticipate a failure, but with errno==0?

Indeed, the Error bit seems redundant. John, is there a reason why we need the
error bit?

> 
> > VFIO_USER_VERSION
> > -
> >
> > +--++
> > | Message size | 16 + version length|
> 
> Terminating NUL included?

Good point, in the current libmuser implementation we do include the terminating
NUL, however it's not necessary. I don't have a strong opinion on this one, I'll
update the text to include the terminating NUL just to be correct for now,
however if there's a good argument for/against it we should definitely consider
it.

> 
> > +--++--

RE: [PATCH v5] introduce vfio-user protocol specification

2020-11-02 Thread Thanos Makatos



> -Original Message-
> From: Qemu-devel  bounces+thanos.makatos=nutanix@nongnu.org> On Behalf Of John
> Levon
> Sent: 02 November 2020 11:41
> To: Thanos Makatos 
> Cc: benjamin.wal...@intel.com; Elena Ufimtseva
> ; jag.ra...@oracle.com;
> james.r.har...@intel.com; Swapnil Ingle ;
> john.g.john...@oracle.com; yuvalkash...@gmail.com;
> konrad.w...@oracle.com; tina.zh...@intel.com; qemu-devel@nongnu.org;
> dgilb...@redhat.com; Marc-André Lureau
> ; ism...@linux.com;
> alex.william...@redhat.com; Stefan Hajnoczi ;
> Felipe Franciosi ; xiuchun...@intel.com;
> tomassetti.and...@gmail.com; changpeng@intel.com; Raphael Norwitz
> ; kanth.ghatr...@oracle.com
> Subject: Re: [PATCH v5] introduce vfio-user protocol specification
> 
> On Mon, Nov 02, 2020 at 11:29:23AM +, Thanos Makatos wrote:
> 
> > >
> +==++=
> > > ==+
> > > > | version  | object | ``{"major": , "minor": }``
> |
> > > > |  ||   
> > > > |
> > > > |  || Version supported by the sender, e.g. "0.1".  
> > > > |
> > >
> > > It seems quite unlikely but this should specify it's strings not floating 
> > > point
> > > values maybe?
> > >
> > > Definitely applies to max_fds too.
> >
> > major and minor are JSON numbers and specifically integers.
> 
> It is debatable as to whether there is such a thing as a JSON integer :)

AFAIK there isn't.

> 
> > The rationale behind this is to simplify parsing. Is specifying that
> > major/minor/max_fds should be an interger sufficient to clear any
> vagueness
> > here?
> 
> I suppose that's OK as long as we never want a 0.1.1 or whatever. I'm not
> sure
> it simplifies parsing, but maybe it does.

Now that you mention it, why preclude 0.1.1? IIUC the whole point of not
stating the version as a float is to simply have this flexibility in the future.
You're right in your earlier suggestion to explicitly state major/minor as
strings.

> 
> > > > Versioning and Feature Support
> > > > ^^
> > > > Upon accepting a connection, the server must send a
> VFIO_USER_VERSION
> > > message
> > > > proposing a protocol version and a set of capabilities. The client
> compares
> > > > these with the versions and capabilities it supports and sends a
> > > > VFIO_USER_VERSION reply according to the following rules.
> > >
> > > I'm curious if there was a specific reason it's this way around, when it
> seems
> > > more natural for the client to propose first, and the server to reply?
> >
> > I'm not aware of any specific reason.
> 
> So can we switch it now so the initial setup is a send/recv too?

I'm fine with that but would first like to hear back from John in case he 
objects.



[PATCH v4] introduce vfio-user protocol specification

2020-09-15 Thread Thanos Makatos
This patch introduces the vfio-user protocol specification (formerly
known as VFIO-over-socket), which is designed to allow devices to be
emulated outside QEMU, in a separate process. vfio-user reuses the
existing VFIO defines, structs and concepts.

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson 
Signed-off-by: Thanos Makatos 

---

Changed since v1:
  * fix coding style issues
  * update MAINTAINERS for VFIO-over-socket
  * add vfio-over-socket to ToC

Changed since v2:
  * fix whitespace

Changed since v3:
  * rename protocol to vfio-user
  * add table of contents
  * fix Unicode problems
  * fix typos and various reStructuredText issues
  * various stylistic improvements
  * add backend program conventions
  * rewrite part of intro, drop QEMU-specific stuff
  * drop QEMU-specific paragraph about implementation
  * explain that passing of FDs isn't necessary
  * minor improvements in the VFIO section
  * various text substitutions for the sake of consistency
  * drop paragraph about client and server, already explained in intro
  * drop device ID
  * drop type from version
  * elaborate on request concurrency
  * convert some inessential paragraphs into notes
  * explain why some existing VFIO defines cannot be reused
  * explain how to make changes to the protocol
  * improve text of DMA map
  * reword comment about existing VFIO commands
  * add reference to Version section
  * reset device on disconnection
  * reword live migration section
  * replace sys/vfio.h with linux/vfio.h
  * drop reference to iovec
  * use argz the same way it is used in VFIO
  * add type field in header for clarity

Signed-off-by: John G Johnson 
Signed-off-by: Thanos Makatos 
---
 MAINTAINERS  |6 +
 docs/devel/index.rst |1 +
 docs/devel/vfio-user.rst | 1191 ++
 3 files changed, 1198 insertions(+)
 create mode 100644 docs/devel/vfio-user.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 030faf0..a7f4b8f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1732,6 +1732,12 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s3...@nongnu.org
 
+vfio-user
+M: John G Johnson 
+M: Thanos Makatos 
+S: Supported
+F: docs/devel/vfio-user.rst
+
 vhost
 M: Michael S. Tsirkin 
 S: Supported
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index ae6eac7..e6a89c2 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -30,3 +30,4 @@ Contents:
reset
s390-dasd-ipl
clocks
+   vfio-user
diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
new file mode 100644
index 000..0217270
--- /dev/null
+++ b/docs/devel/vfio-user.rst
@@ -0,0 +1,1191 @@
+
+vfio-user Protocol Specification
+
+
+
+Version_ 0.1
+
+
+.. contents:: Table of Contents
+
+Introduction
+
+vfio-user is a protocol that allows a device to be emulated in a separate
+process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
+of a generic VFIO device type, living inside the VMM, which we call the client,
+and the core device implementation, living outside the VMM, which we call the
+server.
+
+The `Linux VFIO ioctl interface 
<https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_
+been chosen as the base for this protocol for the following reasons:
+
+1) It is a mature and stable API, backed by an extensively used framework.
+2) The existing VFIO client implementation in QEMU (qemu/hw/vfio/) can be
+   largely reused.
+
+.. Note::
+   In a proof of concept implementation it has been demonstrated that using 
VFIO
+   over a UNIX domain socket is a viable option. vfio-user is designed with
+   QEMU in mind, however it could be used by other client applications. The
+   vfio-user protocol does not require that QEMU's VFIO client  implementation
+   is used in QEMU.
+
+None of the VFIO kernel modules are required for supporting the protocol,
+neither in the client nor the server, only the source header files are used.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
+chosen because file descriptors can be trivially sent over it, which in turn
+allows:
+
+* Sharing of client memory for DMA with the server.
+* Sharing of server memory with the client for fast MMIO.
+* Efficient sharing of eventfd's for triggering interrupts.
+
+Other socket types could be used which allow the server to run in a separate
+guest in the same host (AF_VSOCK) or remotely (AF_INET). Theoretically the
+underlying transport does not necessarily have to be a socket, however we do
+not examine such alternatives. In this protocol version we focus on using a
+UNIX domain socket and introduce basic support for the other two types of
+sockets 

RE: [PATCH v4] introduce vfio-user protocol specification

2020-09-28 Thread Thanos Makatos



> -Original Message-
> From: Stefan Hajnoczi 
> Sent: 24 September 2020 09:22
> To: Thanos Makatos 
> Cc: qemu-devel@nongnu.org; Michael S. Tsirkin ;
> alex.william...@redhat.com; benjamin.wal...@intel.com;
> elena.ufimts...@oracle.com; jag.ra...@oracle.com; Swapnil Ingle
> ; james.r.har...@intel.com;
> konrad.w...@oracle.com; Raphael Norwitz ;
> marcandre.lur...@redhat.com; kanth.ghatr...@oracle.com; Felipe
> Franciosi ; tina.zh...@intel.com;
> changpeng@intel.com; dgilb...@redhat.com;
> tomassetti.and...@gmail.com; yuvalkash...@gmail.com;
> ism...@linux.com; xiuchun...@intel.com; John G Johnson
> 
> Subject: Re: [PATCH v4] introduce vfio-user protocol specification
> 
> On Tue, Sep 15, 2020 at 07:29:17AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be
> > emulated outside QEMU, in a separate process. vfio-user reuses the
> > existing VFIO defines, structs and concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson 
> > Signed-off-by: Thanos Makatos 
> 
> The approach looks promising. It's hard to know what changes will be
> required when this is implemented, so let's not worry about getting
> every detail of the spec right.
> 
> Now that there is a spec to start from, the next step is patches
> implementing --device vfio-user-pci,chardev= in
> hw/vfio-user/pci.c (mirroring hw/vfio/).
> 
> It should be accompanied by a test in tests/. PCI-level testing APIS for
> BARs, configuration space, interrupts, etc are available in
> tests/qtest/libqos/pci.h. The test case needs to include a vfio-user
> device backend interact with QEMU's vfio-user-pci implementation.

We plan to use a libmuser-based backend for testing. This, I suppose, will
make libmuser a dependency of QEMU (either as a submodule or as a library),
which for now can be disabled in the default configuration. Is this acceptable?

> 
> I think this spec can be merged in docs/devel/ now and marked as
> "subject to change (not a stable public interface

Great!

> 
> After the details have been proven and any necessary changes have been
> made the spec can be promoted to docs/interop/ as a stable public
> interface. This gives the freedom to make changes discovered when
> figuring out issues like disconnect/reconnect, live migration, etc that
> can be hard to get right without a working implementation.
> 
> Does this approach sound good?

Yes.

> 
> Also please let us know who is working on what so additional people can
> get involved in areas that need work!

Swapnil and I will be working on libmuser and the test in QEMU, John and
the mp-qemu folks will be working on the patches for implementing
--device vfio-user-pci.

> 
> Stefan



DMA region abruptly removed from PCI device

2020-07-06 Thread Thanos Makatos
We have an issue when using the VFIO-over-socket libmuser PoC
(https://www.mail-archive.com/qemu-devel@nongnu.org/msg692251.html) instead of
the VFIO kernel module: we notice that DMA regions used by the emulated device
can be abruptly removed while the device is still using them.

The PCI device we've implemented is an NVMe controller using SPDK, so it polls
the submission queues for new requests. We use the latest SeaBIOS where it tries
to boot from the NVMe controller. Several DMA regions are registered
(VFIO_IOMMU_MAP_DMA) and then the admin and a submission queues are created.
>From this point SPDK polls both queues. Then, the DMA region where the
submission queue lies is removed (VFIO_IOMMU_UNMAP_DMA) and then re-added at the
same IOVA but at a different offset. SPDK crashes soon after as it accesses
invalid memory. There is no other event (e.g. some PCI config space or NVMe
register write) happening between unmapping and mapping the DMA region. My guess
is that this behavior is legitimate and that this is solved in the VFIO kernel
module by releasing the DMA region only after all references to it have been
released, which is handled by vfio_pin/unpin_pages, correct? If this is the case
then I suppose we need to implement the same logic in libmuser, but I just want
to make sure I'm not missing anything as this is a substantial change.



RE: Inter-VM device emulation (call on Mon 20th July 2020)

2020-07-15 Thread Thanos Makatos



> -Original Message-
> From: kvm-ow...@vger.kernel.org  On
> Behalf Of Stefan Hajnoczi
> Sent: 15 July 2020 12:24
> To: Nikos Dragazis ; Jan Kiszka
> 
> Cc: Michael S. Tsirkin ; Thanos Makatos
> ; John G. Johnson
> ; Andra-Irina Paraschiv
> ; Alexander Graf ; qemu-
> de...@nongnu.org; k...@vger.kernel.org; Maxime Coquelin
> ; Alex Bennée 
> Subject: Inter-VM device emulation (call on Mon 20th July 2020)
> 
> Hi,
> Several projects are underway to create an inter-VM device emulation
> interface:
> 
>  * ivshmem v2
>https://www.mail-archive.com/qemu-devel@nongnu.org/msg706465.html
> 
>A PCI device that provides shared-memory communication between VMs.
>This device already exists but is limited in its current form. The
>"v2" project updates IVSHMEM's capabilities and makes it suitable as
>a VIRTIO transport.
> 
>Jan Kiszka is working on this and has posted specs for review.
> 
>  * virtio-vhost-user
>https://www.mail-archive.com/virtio-dev@lists.oasis-
> open.org/msg06429.html
> 
>A VIRTIO device that transports the vhost-user protocol. Allows
>vhost-user device emulation to be implemented by another VM.
> 
>Nikos Dragazis is working on this with QEMU, DPDK, and VIRTIO patches
>posted.
> 
>  * VFIO-over-socket
>https://github.com/tmakatos/qemu/blob/master/docs/devel/vfio-over-
> socket.rst
> 
>Similar to the vhost-user protocol in spirit but for any PCI device.
>Uses the Linux VFIO ioctl API as the protocol instead of vhost.
> 
>It doesn't have a virtio-vhost-user equivalent yet, but the same
>approach could be applied to VFIO-over-socket too.
> 
>Thanos Makatos and John G. Johnson are working on this. The draft
>spec is available.
> 
> Let's have a call to figure out:
> 
> 1. What is unique about these approaches and how do they overlap?
> 2. Can we focus development and code review efforts to get something
>merged sooner?
> 
> Jan and Nikos: do you have time to join on Monday, 20th of July at 15:00
> UTC?
> https://www.timeanddate.com/worldclock/fixedtime.html?iso=20200720T1
> 500
> 
> Video call URL: https://bluejeans.com/240406010
> 
> It would be nice if Thanos and/or JJ could join the call too. Others
> welcome too (feel free to forward this email)!

Sure!

> 
> Stefan



[PATCH] introduce VFIO-over-socket protocol specificaion

2020-07-16 Thread Thanos Makatos
This patch introduces the VFIO-over-socket protocol specification, which
is designed to allow devices to be emulated outside QEMU, in a separate
process. VFIO-over-socket reuses the existing VFIO defines, structs and
concepts.

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson 
Signed-off-by: Thanos Makatos 
---
 docs/devel/vfio-over-socket.rst | 1135 +++
 1 files changed, 1135 insertions(+), 0 deletions(-)
 create mode 100644 docs/devel/vfio-over-socket.rst

diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.rst
new file mode 100644
index 000..723b944
--- /dev/null
+++ b/docs/devel/vfio-over-socket.rst
@@ -0,0 +1,1135 @@
+***
+VFIO-over-socket Protocol Specification
+***
+
+Version 0.1
+
+Introduction
+
+VFIO-over-socket, also known as vfio-user, is a protocol that allows a device
+to be virtualized in a separate process outside of QEMU. VFIO-over-socket
+devices consist of a generic VFIO device type, living inside QEMU, which we
+call the client, and the core device implementation, living outside QEMU, which
+we call the server. VFIO-over-socket can be the main transport mechanism for
+multi-process QEMU, however it can be used by other applications offering
+device virtualization. Explaining the advantages of a
+disaggregated/multi-process QEMU, and device virtualization outside QEMU in
+general, is beyond the scope of this document.
+
+This document focuses on specifying the VFIO-over-socket protocol. VFIO has
+been chosen for the following reasons:
+
+1) It is a mature and stable API, backed by an extensively used framework.
+2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely
+   reused.
+
+In a proof of concept implementation it has been demonstrated that using VFIO
+over a UNIX domain socket is a viable option. VFIO-over-socket is designed with
+QEMU in mind, however it could be used by other client applications. The
+VFIO-over-socket protocol does not require that QEMU's VFIO client
+implementation is used in QEMU. None of the VFIO kernel modules are required
+for supporting the protocol, neither in the client nor the server, only the
+source header files are used.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
+chosen because we can trivially send file descriptors over it, which in turn
+allows:
+
+* Sharing of guest memory for DMA with the virtual device process.
+* Sharing of virtual device memory with the guest for fast MMIO.
+* Efficient sharing of eventfd's for triggering interrupts.
+
+However, other socket types could be used which allows the virtual device
+process to run in a separate guest in the same host (AF_VSOCK) or remotely
+(AF_INET). Theoretically the underlying transport doesn't necessarily have to
+be a socket, however we don't examine such alternatives. In this document we
+focus on using a UNIX domain socket and introduce basic support for the other
+two types of sockets without considering performance implications.
+
+This document does not yet describe any internal details of the server-side
+implementation, however QEMU's VFIO client implementation will have to be
+adapted according to this protocol in order to support VFIO-over-socket virtual
+devices.
+
+VFIO
+
+VFIO is a framework that allows a physical device to be securely passed through
+to a user space process; the kernel does not drive the device at all.
+Typically, the user space process is a VM and the device is passed through to
+it in order to achieve high performance. VFIO provides an API and the required
+functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual
+machine to directly access physical devices, instead of emulating them in
+software
+
+VFIO-over-socket reuses the core VFIO concepts defined in its API, but
+implements them as messages to be sent over a UNIX-domain socket. It does not
+change the kernel-based VFIO in any way, in fact none of the VFIO kernel
+modules need to be loaded to use VFIO-over-socket. It is also possible for QEMU
+to concurrently use the current kernel-based VFIO for one guest device, and use
+VFIO-over-socket for another device in the same guest.
+
+VFIO Device Model
+-
+A device under VFIO presents a standard VFIO model to the user process. Many
+of the VFIO operations in the existing kernel model use the ioctl() system
+call, and references to the existing model are called the ioctl()
+implementation in this document.
+
+The following sections describe the set of messages that implement the VFIO
+device model over a UNIX domain socket. In many cases, the messages are direct
+translations of data structures used in the ioctl(

[PATCH v2] introduce VFIO-over-socket protocol specificaion

2020-07-17 Thread Thanos Makatos
This patch introduces the VFIO-over-socket protocol specification, which
is designed to allow devices to be emulated outside QEMU, in a separate
process. VFIO-over-socket reuses the existing VFIO defines, structs and
concepts.

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson 
Signed-off-by: Thanos Makatos 

---

Changed since v1:
  * fix coding style issues
  * update MAINTAINERS for VFIO-over-socket
  * add vfio-over-socket to ToC

Regarding the build failure, I have not been able to reproduce it locally
using the docker image on my Debian 10.4 machine.
---
 MAINTAINERS |6 +
 docs/devel/index.rst|1 +
 docs/devel/vfio-over-socket.rst | 1135 +++
 3 files changed, 1142 insertions(+)
 create mode 100644 docs/devel/vfio-over-socket.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 030faf0..bb81590 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1732,6 +1732,12 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s3...@nongnu.org
 
+VFIO-over-socket
+M: John G Johnson 
+M: Thanos Makatos 
+S: Supported
+F: docs/devel/vfio-over-socket.rst
+
 vhost
 M: Michael S. Tsirkin 
 S: Supported
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index ae6eac7..0439460 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -30,3 +30,4 @@ Contents:
reset
s390-dasd-ipl
clocks
+   vfio-over-socket
diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.rst
new file mode 100644
index 000..9ac0282
--- /dev/null
+++ b/docs/devel/vfio-over-socket.rst
@@ -0,0 +1,1135 @@
+***
+VFIO-over-socket Protocol Specification
+***
+
+Version 0.1
+
+Introduction
+
+VFIO-over-socket, also known as vfio-user, is a protocol that allows a device
+to be virtualized in a separate process outside of QEMU. VFIO-over-socket
+devices consist of a generic VFIO device type, living inside QEMU, which we
+call the client, and the core device implementation, living outside QEMU, which
+we call the server. VFIO-over-socket can be the main transport mechanism for
+multi-process QEMU, however it can be used by other applications offering
+device virtualization. Explaining the advantages of a
+disaggregated/multi-process QEMU, and device virtualization outside QEMU in
+general, is beyond the scope of this document.
+
+This document focuses on specifying the VFIO-over-socket protocol. VFIO has
+been chosen for the following reasons:
+
+1) It is a mature and stable API, backed by an extensively used framework.
+2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely
+   reused.
+
+In a proof of concept implementation it has been demonstrated that using VFIO
+over a UNIX domain socket is a viable option. VFIO-over-socket is designed with
+QEMU in mind, however it could be used by other client applications. The
+VFIO-over-socket protocol does not require that QEMU's VFIO client
+implementation is used in QEMU. None of the VFIO kernel modules are required
+for supporting the protocol, neither in the client nor the server, only the
+source header files are used.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
+chosen because we can trivially send file descriptors over it, which in turn
+allows:
+
+* Sharing of guest memory for DMA with the virtual device process.
+* Sharing of virtual device memory with the guest for fast MMIO.
+* Efficient sharing of eventfd's for triggering interrupts.
+
+However, other socket types could be used which allows the virtual device
+process to run in a separate guest in the same host (AF_VSOCK) or remotely
+(AF_INET). Theoretically the underlying transport doesn't necessarily have to
+be a socket, however we don't examine such alternatives. In this document we
+focus on using a UNIX domain socket and introduce basic support for the other
+two types of sockets without considering performance implications.
+
+This document does not yet describe any internal details of the server-side
+implementation, however QEMU's VFIO client implementation will have to be
+adapted according to this protocol in order to support VFIO-over-socket virtual
+devices.
+
+VFIO
+
+VFIO is a framework that allows a physical device to be securely passed through
+to a user space process; the kernel does not drive the device at all.
+Typically, the user space process is a VM and the device is passed through to
+it in order to achieve high performance. VFIO provides an API and the required
+functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual
+machine to directly access physical devices, instead of emulating them in
+software
+
+VFIO-over-socket reuses the core VF

[PATCH v3] introduce VFIO-over-socket protocol specificaion

2020-07-17 Thread Thanos Makatos
This patch introduces the VFIO-over-socket protocol specification, which
is designed to allow devices to be emulated outside QEMU, in a separate
process. VFIO-over-socket reuses the existing VFIO defines, structs and
concepts.

It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"

Signed-off-by: John G Johnson 
Signed-off-by: Thanos Makatos 

---

Changed since v1:
  * fix coding style issues
  * update MAINTAINERS for VFIO-over-socket
  * add vfio-over-socket to ToC

Changed since v2:
  * fix whitespace

Regarding the build failure, I have not been able to reproduce it locally
using the docker image on my Debian 10.4 machine.
---
 MAINTAINERS |6 +
 docs/devel/index.rst|1 +
 docs/devel/vfio-over-socket.rst | 1135 +++
 3 files changed, 1142 insertions(+)
 create mode 100644 docs/devel/vfio-over-socket.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 030faf0..bb81590 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1732,6 +1732,12 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s3...@nongnu.org
 
+VFIO-over-socket
+M: John G Johnson 
+M: Thanos Makatos 
+S: Supported
+F: docs/devel/vfio-over-socket.rst
+
 vhost
 M: Michael S. Tsirkin 
 S: Supported
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index ae6eac7..0439460 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -30,3 +30,4 @@ Contents:
reset
s390-dasd-ipl
clocks
+   vfio-over-socket
diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-socket.rst
new file mode 100644
index 000..b474f23
--- /dev/null
+++ b/docs/devel/vfio-over-socket.rst
@@ -0,0 +1,1135 @@
+***
+VFIO-over-socket Protocol Specification
+***
+
+Version 0.1
+
+Introduction
+
+VFIO-over-socket, also known as vfio-user, is a protocol that allows a device
+to be virtualized in a separate process outside of QEMU. VFIO-over-socket
+devices consist of a generic VFIO device type, living inside QEMU, which we
+call the client, and the core device implementation, living outside QEMU, which
+we call the server. VFIO-over-socket can be the main transport mechanism for
+multi-process QEMU, however it can be used by other applications offering
+device virtualization. Explaining the advantages of a
+disaggregated/multi-process QEMU, and device virtualization outside QEMU in
+general, is beyond the scope of this document.
+
+This document focuses on specifying the VFIO-over-socket protocol. VFIO has
+been chosen for the following reasons:
+
+1) It is a mature and stable API, backed by an extensively used framework.
+2) The existing VFIO client implementation (qemu/hw/vfio/) can be largely
+   reused.
+
+In a proof of concept implementation it has been demonstrated that using VFIO
+over a UNIX domain socket is a viable option. VFIO-over-socket is designed with
+QEMU in mind, however it could be used by other client applications. The
+VFIO-over-socket protocol does not require that QEMU's VFIO client
+implementation is used in QEMU. None of the VFIO kernel modules are required
+for supporting the protocol, neither in the client nor the server, only the
+source header files are used.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (AF_UNIX) is
+chosen because we can trivially send file descriptors over it, which in turn
+allows:
+
+* Sharing of guest memory for DMA with the virtual device process.
+* Sharing of virtual device memory with the guest for fast MMIO.
+* Efficient sharing of eventfd's for triggering interrupts.
+
+However, other socket types could be used which allows the virtual device
+process to run in a separate guest in the same host (AF_VSOCK) or remotely
+(AF_INET). Theoretically the underlying transport doesn't necessarily have to
+be a socket, however we don't examine such alternatives. In this document we
+focus on using a UNIX domain socket and introduce basic support for the other
+two types of sockets without considering performance implications.
+
+This document does not yet describe any internal details of the server-side
+implementation, however QEMU's VFIO client implementation will have to be
+adapted according to this protocol in order to support VFIO-over-socket virtual
+devices.
+
+VFIO
+
+VFIO is a framework that allows a physical device to be securely passed through
+to a user space process; the kernel does not drive the device at all.
+Typically, the user space process is a VM and the device is passed through to
+it in order to achieve high performance. VFIO provides an API and the required
+functionality in the kernel. QEMU has adopted VFIO to allow a guest virtual
+machine to directly access physical devices, instead of emulating them in
+software

RE: [PATCH] introduce VFIO-over-socket protocol specificaion

2020-07-22 Thread Thanos Makatos
> -Original Message-
> From: Stefan Hajnoczi 
> Sent: 17 July 2020 13:18
> To: Thanos Makatos 
> Cc: qemu-devel@nongnu.org; alex.william...@redhat.com;
> benjamin.wal...@intel.com; elena.ufimts...@oracle.com;
> jag.ra...@oracle.com; Swapnil Ingle ;
> james.r.har...@intel.com; konrad.w...@oracle.com; Raphael Norwitz
> ; marcandre.lur...@redhat.com;
> kanth.ghatr...@oracle.com; Felipe Franciosi ;
> tina.zh...@intel.com; changpeng@intel.com; dgilb...@redhat.com;
> tomassetti.and...@gmail.com; yuvalkash...@gmail.com;
> ism...@linux.com; John G Johnson 
> Subject: Re: [PATCH] introduce VFIO-over-socket protocol specificaion
> 
> On Thu, Jul 16, 2020 at 08:31:43AM -0700, Thanos Makatos wrote:
> > This patch introduces the VFIO-over-socket protocol specification, which
> > is designed to allow devices to be emulated outside QEMU, in a separate
> > process. VFIO-over-socket reuses the existing VFIO defines, structs and
> > concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson 
> > Signed-off-by: Thanos Makatos 
> > ---
> >  docs/devel/vfio-over-socket.rst | 1135
> +++
> >  1 files changed, 1135 insertions(+), 0 deletions(-)
> >  create mode 100644 docs/devel/vfio-over-socket.rst
> 
> This is exciting! The spec is clear enough that I feel I could start
> writing a client/server. There is enough functionality here to implement
> real-world devices. Can you share links to client/server
> implementations?

Strictly speaking there's no client/server implementation yet as we're waiting
until the protocol is finalized.

The closest server implementation we have is an NVMe controller implemented in
SPDK with MUSER:
(https://github.com/tmakatos/spdk/tree/rfc-vfio-over-socket and
https://github.com/tmakatos/muser/tree/vfio-over-socket).

The closest client implementation we have is libvfio in the VFIO-over-socket
PoC.

Neither of these implementation use the new protocol, but they're similar in
spirit.

John is working on the VFIO changes, the only thing left to do is the DMA/IOMMU
changes we made in the last review round, they'll be on their GitHub site soon.


> 
> launching, and controlling device emulation processes. That doesn't need
> It would be useful to introduce a standard way of enumerating,
> to be part of this specification document though. In vhost-user there
> are conventions for command-line parameters, process lifecycle, etc that
> make it easier for management tools to run device processes (the
> "Backend program conventions" section in vhost-user.rst).

Sure, we'll come up with something similar based on those.

> 
> > diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
> socket.rst
> > new file mode 100644
> > index 000..723b944
> > --- /dev/null
> > +++ b/docs/devel/vfio-over-socket.rst
> > @@ -0,0 +1,1135 @@
> > +***
> > +VFIO-over-socket Protocol Specification
> > +***
> > +
> > +Version 0.1
> 
> Please include a reference to the section below explaining how
> versioning works.

I'm not sure I understand, do you mean we should add something like the
following (right below "Version 0.1"):

"Refer to section 1.2.3 on how versioning works."

?

> 
> Also, are there rules about noting versions when updating the spec? For
> example:
> 
>   When making a change to this specification, the protocol version
>   number must be included:
> 
> The `foo` field contains ... Added in version 1.3.

OK, we'll add the rule as per your recommendation.

> 
> > +
> > +Introduction
> > +
> > +VFIO-over-socket, also known as vfio-user, is a protocol that allows a
> device
> 
> vfio-user is shorted. Now is the best time to start consistently using
> "vfio-user" as the name for this protocol. Want to drop the name
> VFIO-over-socket?

"vfio-user" it is.

> 
> > +to be virtualized in a separate process outside of QEMU. VFIO-over-
> socket
> 
> Is there anything QEMU-specific about this protocol?
> 
> I think the scope of this protocol is more general and it could be
> described as:
> 
>   allows device emulation in a separate process outside of a Virtual
>   Machine Monitor (VMM).
> 
> (Or "emulator" instead of VMM, if you prefer.)
> 
> > +devices consist of a generic VFIO device type, living inside QEMU, which
> we
> 
> s/QEMU/the VMM/

Correct, QEMU is simply our main u

RE: [PATCH v3] introduce VFIO-over-socket protocol specificaion

2020-07-22 Thread Thanos Makatos


> -Original Message-
> From: Nikos Dragazis 
> Sent: 21 July 2020 17:34
> To: Thanos Makatos 
> Cc: qemu-devel@nongnu.org; benjamin.wal...@intel.com;
> elena.ufimts...@oracle.com; tomassetti.and...@gmail.com; John G
> Johnson ; jag.ra...@oracle.com; Swapnil
> Ingle ; james.r.har...@intel.com;
> konrad.w...@oracle.com; yuvalkash...@gmail.com; dgilb...@redhat.com;
> Raphael Norwitz ; ism...@linux.com;
> alex.william...@redhat.com; kanth.ghatr...@oracle.com;
> stefa...@redhat.com; Felipe Franciosi ;
> marcandre.lur...@redhat.com; tina.zh...@intel.com;
> changpeng@intel.com
> Subject: Re: [PATCH v3] introduce VFIO-over-socket protocol specificaion
> 
> Hi Thanos,
> 
> I had a quick look on the spec. Leaving some comments inline.
> 
> On 17/7/20 2:20 μ.μ., Thanos Makatos wrote:
> 
> > This patch introduces the VFIO-over-socket protocol specification, which
> > is designed to allow devices to be emulated outside QEMU, in a separate
> > process. VFIO-over-socket reuses the existing VFIO defines, structs and
> > concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson 
> > Signed-off-by: Thanos Makatos 
> >
> > ---
> >
> > Changed since v1:
> >* fix coding style issues
> >* update MAINTAINERS for VFIO-over-socket
> >* add vfio-over-socket to ToC
> >
> > Changed since v2:
> >* fix whitespace
> >
> > Regarding the build failure, I have not been able to reproduce it locally
> > using the docker image on my Debian 10.4 machine.
> > ---
> >   MAINTAINERS |6 +
> >   docs/devel/index.rst|1 +
> >   docs/devel/vfio-over-socket.rst | 1135
> +++
> >   3 files changed, 1142 insertions(+)
> >   create mode 100644 docs/devel/vfio-over-socket.rst
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 030faf0..bb81590 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -1732,6 +1732,12 @@ F: hw/vfio/ap.c
> >   F: docs/system/s390x/vfio-ap.rst
> >   L: qemu-s3...@nongnu.org
> >
> > +VFIO-over-socket
> > +M: John G Johnson 
> > +M: Thanos Makatos 
> > +S: Supported
> > +F: docs/devel/vfio-over-socket.rst
> > +
> >   vhost
> >   M: Michael S. Tsirkin 
> >   S: Supported
> > diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> > index ae6eac7..0439460 100644
> > --- a/docs/devel/index.rst
> > +++ b/docs/devel/index.rst
> > @@ -30,3 +30,4 @@ Contents:
> >  reset
> >  s390-dasd-ipl
> >  clocks
> > +   vfio-over-socket
> > diff --git a/docs/devel/vfio-over-socket.rst b/docs/devel/vfio-over-
> socket.rst
> > new file mode 100644
> > index 000..b474f23
> > --- /dev/null
> > +++ b/docs/devel/vfio-over-socket.rst
> > @@ -0,0 +1,1135 @@
> > +***
> > +VFIO-over-socket Protocol Specification
> > +***
> > +
> > +Version 0.1
> > +
> > +Introduction
> > +
> > +VFIO-over-socket, also known as vfio-user, is a protocol that allows a
> device
> 
> I think there is no point in having two names for the same protocol,
> "vfio-over-socket" and "vfio-user".

Yes, we'll use vfio-user from now on.

> 
> > +to be virtualized in a separate process outside of QEMU. VFIO-over-
> socket
> > +devices consist of a generic VFIO device type, living inside QEMU, which
> we
> > +call the client, and the core device implementation, living outside QEMU,
> which
> > +we call the server. VFIO-over-socket can be the main transport
> mechanism for
> > +multi-process QEMU, however it can be used by other applications
> offering
> > +device virtualization. Explaining the advantages of a
> > +disaggregated/multi-process QEMU, and device virtualization outside
> QEMU in
> > +general, is beyond the scope of this document.
> > +
> > +This document focuses on specifying the VFIO-over-socket protocol. VFIO
> has
> > +been chosen for the following reasons:
> > +
> > +1) It is a mature and stable API, backed by an extensively used
> framework.
> > +2) The existing VFIO client implementation (qemu/hw/vfio/) can be
> largely
> > +   reused.
> > +
> > +In a proof of concept implementation it has been demonstrated that
> using VFIO
> > +over a UNIX domain socke