Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-24 Thread Ira W. Snyder
On Thu, Dec 24, 2009 at 11:09:39AM -0600, Anthony Liguori wrote:
 On 12/23/2009 05:42 PM, Ira W. Snyder wrote:
 
  I've got a single PCI Host (master) with ~20 PCI slots. Physically, it
  is a backplane in a cPCI chassis, but the form factor is irrelevant. It
  is regular PCI from a software perspective.
 
  Into this backplane, I plug up to 20 PCI Agents (slaves). They are
  powerpc computers, almost identical to the Freescale MPC8349EMDS board.
  They're full-featured powerpc computers, with CPU, RAM, etc. They can
  run standalone.
 
  I want to use the PCI backplane as a data transport. Specifically, I
  want to transport ethernet over the backplane, so I can have the powerpc
  boards mount their rootfs via NFS, etc. Everyone knows how to write
  network daemons. It is a good and very well known way to transport data
  between systems.
 
  On the PCI bus, the powerpc systems expose 3 PCI BAR's. The size is
  configureable, as is the memory location at which they point. What I
  cannot do is get notified when a read/write hits the BAR. There is a
  feature on the board which allows me to generate interrupts in either
  direction: agent-master (PCI INTX) and master-agent (via an MMIO
  register). The PCI vendor ID and device ID are not configureable.
 
  One thing I cannot assume is that the PCI master system is capable of
  performing DMA. In my system, it is a Pentium3 class x86 machine, which
  has no DMA engine. However, the PowerPC systems do have DMA engines. In
  virtio terms, it was suggested to make the powerpc systems the virtio
  hosts (running the backends) and make the x86 (PCI master) the virtio
  guest (running virtio-net, etc.).
 
 IMHO, virtio and vbus are both the wrong model for what you're doing. 
 The key reason why is that virtio and vbus are generally designed around 
 the concept that there is shared cache coherent memory from which you 
 can use lock-less ring queues to implement efficient I/O.
 
 In your architecture, you do not have cache coherent shared memory. 
 Instead, you have two systems connected via a PCI backplace with 
 non-coherent shared memory.
 
 You probably need to use the shared memory as a bounce buffer and 
 implement a driver on top of that.
 
  I'm not sure what you're suggesting in the paragraph above. I want to
  use virtio-net as the transport, I do not want to write my own
  virtual-network driver. Can you please clarify?
 
 virtio-net and vbus are going to be overly painful for you to use 
 because no one end can access arbitrary memory in the other end.
 

The PCI Agents (powerpc's) can access the lowest 4GB of the PCI Master's
memory. Not all at the same time, but I have a 1GB movable window into
PCI address space. I hunch Kyle's setup is similar.

I've proved that virtio can work via my crossed-wires driver, hooking
two virtio-net's together. With a proper in-kernel backend, I think the
issues would be gone, and things would work great.

  Hopefully that explains what I'm trying to do. I'd love someone to help
  guide me in the right direction here. I want something to fill this need
  in mainline.
 
 If I were you, I would write a custom network driver.  virtio-net is 
 awfully small (just a few hundred lines).  I'd use that as a basis but I 
 would not tie into virtio or vbus.  The paradigms don't match.
 

This is exactly what I did first. I proposed it for mainline, and David
Miller shot it down, saying: you're creating your own virtualization
scheme, use virtio instead. Arnd Bergmann is maintaining a driver
out-of-tree for some IBM cell boards which is very similar, IIRC.

In my driver, I used the PCI Agent's PCI BAR's to contain ring
descriptors. The PCI Agent actually handles all data transfer (via the
onboard DMA engine). It works great. I'll gladly post it if you'd like
to see it.

In my driver, I had to use 64K MTU to get acceptable performance. I'm
not entirely sure how to implement a driver that can handle
scatter/gather (fragmented skb's). It clearly isn't that easy to tune a
network driver for good performance. For reference, my crossed-wires
virtio drivers achieved excellent performance (10x better than my custom
driver) with 1500 byte MTU.

  I've been contacted seperately by 10+ people also looking
  for a similar solution. I hunch most of them end up doing what I did:
  write a quick-and-dirty network driver. I've been working on this for a
  year, just to give an idea.
 
 The whole architecture of having multiple heterogenous systems on a 
 common high speed backplane is what IBM refers to as hybrid computing. 
   It's a model that I think will be come a lot more common in the 
 future.  I think there are typically two types of hybrid models 
 depending on whether the memory sharing is cache coherent or not.  If 
 you have coherent shared memory, the problem looks an awfully lot like 
 virtualization.  If you don't have coherent shared memory, then the 
 shared memory basically becomes a pool to bounce into and out-of.
 

Let's

Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Ira W. Snyder
On Wed, Dec 23, 2009 at 12:34:44PM -0500, Gregory Haskins wrote:
 On 12/23/09 1:15 AM, Kyle Moffett wrote:
  On Tue, Dec 22, 2009 at 12:36, Gregory Haskins
  gregory.hask...@gmail.com wrote:
  On 12/22/09 2:57 AM, Ingo Molnar wrote:
  * Gregory Haskins gregory.hask...@gmail.com wrote:
  Actually, these patches have nothing to do with the KVM folks. [...]
 
  That claim is curious to me - the AlacrityVM host
 
  It's quite simple, really.  These drivers support accessing vbus, and
  vbus is hypervisor agnostic.  In fact, vbus isn't necessarily even
  hypervisor related.  It may be used anywhere where a Linux kernel is the
  io backend, which includes hypervisors like AlacrityVM, but also
  userspace apps, and interconnected physical systems as well.
 
  The vbus-core on the backend, and the drivers on the frontend operate
  completely independent of the underlying hypervisor.  A glue piece
  called a connector ties them together, and any hypervisor specific
  details are encapsulated in the connector module.  In this case, the
  connector surfaces to the guest side as a pci-bridge, so even that is
  not hypervisor specific per se.  It will work with any pci-bridge that
  exposes a compatible ABI, which conceivably could be actual hardware.
  
  This is actually something that is of particular interest to me.  I
  have a few prototype boards right now with programmable PCI-E
  host/device links on them; one of my long-term plans is to finagle
  vbus into providing multiple virtual devices across that single
  PCI-E interface.
  
  Specifically, I want to be able to provide virtual NIC(s), serial
  ports and serial consoles, virtual block storage, and possibly other
  kinds of interfaces.  My big problem with existing virtio right now
  (although I would be happy to be proven wrong) is that it seems to
  need some sort of out-of-band communication channel for setting up
  devices, not to mention it seems to need one PCI device per virtual
  device.
  

Greg, thanks for CC'ing me.

Hello Kyle,

I've got a similar situation here. I've got many PCI agents (devices)
plugged into a PCI backplane. I want to use the network to communicate
from the agents to the PCI master (host system).

At the moment, I'm using a custom driver, heavily based on the PCINet
driver posted on the linux-netdev mailing list. David Miller rejected
this approach, and suggested I use virtio instead.

My first approach with virtio was to create a crossed-wires driver,
which connected two virtio-net drivers together. While this worked, it
doesn't support feature negotiation properly, and so it was scrapped.
You can find this posted on linux-netdev with the title
virtio-over-PCI.

I started writing a virtio-phys layer which creates the appropriate
distinction between frontend (guest driver) and backend (kvm, qemu,
etc.). This effort has been put on hold for lack of time, and because
there is no example code which shows how to create an interface from
virtio rings to TUN/TAP. The vhost-net driver is supposed to fill this
role, but I haven't seen any test code for that either. The developers
haven't been especially helpful answering questions like: how would I
use vhost-net with a DMA engine.

(You'll quickly find that you must use DMA to transfer data across PCI.
AFAIK, CPU's cannot do burst accesses to the PCI bus. I get a 10+ times
speedup using DMA.)

The virtio-phys work is mostly lacking a backend for virtio-net. It is
still incomplete, but at least devices can be registered, etc. It is
available at:
http://www.mmarray.org/~iws/virtio-phys/

Another thing you'll notice about virtio-net (and vbus' venet) is that
they DO NOT specify endianness. This means that they cannot be used with
a big-endian guest and a little-endian host, or vice versa. This means
they will not work in certain QEMU setups today.

Another problem with virtio is that you'll need to invent your own bus
model. QEMU/KVM has their bus model, lguest uses a different one, and
s390 uses yet another, IIRC. At least vbus provides a standardized bus
model.

All in all, I've written a lot of virtio code, and it has pretty much
all been shot down. It isn't very encouraging.

  So I would love to be able to port something like vbus to my nify PCI
  hardware and write some backend drivers... then my PCI-E connected
  systems would dynamically provide a list of highly-efficient virtual
  devices to each other, with only one 4-lane PCI-E bus.

I've written some IOQ test code, all of which is posted on the
alacrityvm-devel mailing list. If we can figure out how to make IOQ use
the proper ioread32()/iowrite32() accessors for accessing ioremap()ed
PCI BARs, then I can pretty easily write the rest of a vbus-phys
connector.

 
 Hi Kyle,
 
 We indeed have others that are doing something similar.  I have CC'd Ira
 who may be able to provide you more details.  I would also point you at
 the canonical example for what you would need to write to tie your
 systems together.  Its the null 

Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Ira W. Snyder
On Wed, Dec 23, 2009 at 09:09:21AM -0600, Anthony Liguori wrote:
 On 12/23/2009 12:15 AM, Kyle Moffett wrote:
  This is actually something that is of particular interest to me.  I
  have a few prototype boards right now with programmable PCI-E
  host/device links on them; one of my long-term plans is to finagle
  vbus into providing multiple virtual devices across that single
  PCI-E interface.
 
  Specifically, I want to be able to provide virtual NIC(s), serial
  ports and serial consoles, virtual block storage, and possibly other
  kinds of interfaces.  My big problem with existing virtio right now
  (although I would be happy to be proven wrong) is that it seems to
  need some sort of out-of-band communication channel for setting up
  devices, not to mention it seems to need one PCI device per virtual
  device.
 
 We've been thinking about doing a virtio-over-IP mechanism such that you 
 could remote the entire virtio bus to a separate physical machine. 
 virtio-over-IB is probably more interesting since you can make use of 
 RDMA. virtio-over-PCI-e would work just as well.
 

I didn't know you were interested in this as well. See my later reply to
Kyle for a lot of code that I've written with this in mind.

 virtio is a layered architecture.  Device enumeration/discovery sits at 
 a lower level than the actual device ABIs.  The device ABIs are 
 implemented on top of a bulk data transfer API.  The reason for this 
 layering is so that we can reuse PCI as an enumeration/discovery 
 mechanism.  This tremendenously simplifies porting drivers to other OSes 
 and let's us use PCI hotplug automatically.  We get integration into all 
 the fancy userspace hotplug support for free.
 
 But both virtio-lguest and virtio-s390 use in-band enumeration and 
 discovery since they do not have support for PCI on either platform.
 

I'm interested in the same thing, just over PCI. The only PCI agent
systems I've used are not capable of manipulating the PCI configuration
space in such a way that virtio-pci is usable on them. This means
creating your own enumeration mechanism. Which sucks. See my virtio-phys
code (http://www.mmarray.org/~iws/virtio-phys/) for an example of how I
did it. It was modeled on lguest. Help is appreciated.

Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33

2009-12-23 Thread Ira W. Snyder
On Wed, Dec 23, 2009 at 04:58:37PM -0600, Anthony Liguori wrote:
 On 12/23/2009 01:54 PM, Ira W. Snyder wrote:
  On Wed, Dec 23, 2009 at 09:09:21AM -0600, Anthony Liguori wrote:
 
  I didn't know you were interested in this as well. See my later reply to
  Kyle for a lot of code that I've written with this in mind.
 
 
 BTW, in the future, please CC me or CC 
 virtualizat...@lists.linux-foundation.org.  Or certainly k...@vger.  I 
 never looked at the virtio-over-pci patchset although I've heard it 
 referenced before.
 

Will do. I wouldn't think k...@vger would be on-topic. I'm not interested
in KVM (though I do use it constantly, it is great). I'm only interested
in using virtio as a transport between physical systems. Is it a place
where discussing virtio by itself is on-topic?

  But both virtio-lguest and virtio-s390 use in-band enumeration and
  discovery since they do not have support for PCI on either platform.
 
 
  I'm interested in the same thing, just over PCI. The only PCI agent
  systems I've used are not capable of manipulating the PCI configuration
  space in such a way that virtio-pci is usable on them.
 
 virtio-pci is the wrong place to start if you want to use a PCI *device* 
 as the virtio bus. virtio-pci is meant to use the PCI bus as the virtio 
 bus.  That's a very important requirement for us because it maintains 
 the relationship of each device looking like a normal PCI device.
 
  This means
  creating your own enumeration mechanism. Which sucks.
 
 I don't think it sucks.  The idea is that we don't want to unnecessarily 
 reinvent things.
 
 Of course, the key feature of virtio is that it makes it possible for 
 you to create your own enumeration mechanism if you're so inclined.
 
  See my virtio-phys
  code (http://www.mmarray.org/~iws/virtio-phys/) for an example of how I
  did it. It was modeled on lguest. Help is appreciated.
 
 If it were me, I'd take a much different approach.  I would use a very 
 simple device with a single transmit and receive queue.  I'd create a 
 standard header, and the implement a command protocol on top of it. 
 You'll be able to support zero copy I/O (although you'll have a fixed 
 number of outstanding requests).  You would need a single large ring.
 
 But then again, I have no idea what your requirements are.  You could 
 probably get far treating the thing as a network device and just doing 
 ATAoE or something like that.
 

I've got a single PCI Host (master) with ~20 PCI slots. Physically, it
is a backplane in a cPCI chassis, but the form factor is irrelevant. It
is regular PCI from a software perspective.

Into this backplane, I plug up to 20 PCI Agents (slaves). They are
powerpc computers, almost identical to the Freescale MPC8349EMDS board.
They're full-featured powerpc computers, with CPU, RAM, etc. They can
run standalone.

I want to use the PCI backplane as a data transport. Specifically, I
want to transport ethernet over the backplane, so I can have the powerpc
boards mount their rootfs via NFS, etc. Everyone knows how to write
network daemons. It is a good and very well known way to transport data
between systems.

On the PCI bus, the powerpc systems expose 3 PCI BAR's. The size is
configureable, as is the memory location at which they point. What I
cannot do is get notified when a read/write hits the BAR. There is a
feature on the board which allows me to generate interrupts in either
direction: agent-master (PCI INTX) and master-agent (via an MMIO
register). The PCI vendor ID and device ID are not configureable.

One thing I cannot assume is that the PCI master system is capable of
performing DMA. In my system, it is a Pentium3 class x86 machine, which
has no DMA engine. However, the PowerPC systems do have DMA engines. In
virtio terms, it was suggested to make the powerpc systems the virtio
hosts (running the backends) and make the x86 (PCI master) the virtio
guest (running virtio-net, etc.).

I'm not sure what you're suggesting in the paragraph above. I want to
use virtio-net as the transport, I do not want to write my own
virtual-network driver. Can you please clarify?

Hopefully that explains what I'm trying to do. I'd love someone to help
guide me in the right direction here. I want something to fill this need
in mainline. I've been contacted seperately by 10+ people also looking
for a similar solution. I hunch most of them end up doing what I did:
write a quick-and-dirty network driver. I've been working on this for a
year, just to give an idea.

PS - should I create a new thread on the two mailing lists mentioned
above? I don't want to go too far off-topic in an alacrityvm thread. :)

Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Alacrityvm-devel] [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests

2009-10-06 Thread Ira W. Snyder
On Tue, Oct 06, 2009 at 12:58:06PM -0400, Gregory Haskins wrote:
 Avi Kivity wrote:
  On 10/06/2009 03:31 PM, Gregory Haskins wrote:
 
  slots would be one implementation, if you can think of others then you'd
  add them.
   
  I'm more interested in *how* you'd add them more than if we would add
  them.  What I am getting at are the logistics of such a beast.
 
  
  Add alternative ioctls, or have one ioctl with a 'type' field.
  
  For instance, would I have /dev/slots-vas with ioctls for adding slots,
  and /dev/foo-vas for adding foos?  And each one would instantiate a
  different vas_struct object with its own vas_struct-ops?  Or were you
  thinking of something different.
 
  
  I think a single /dev/foo is sufficient, unless some of those address
  spaces are backed by real devices.
  
  If you can't, I think it indicates that the whole thing isn't necessary
  and we're better off with slots and virtual memory.
   
  I'm not sure if we are talking about the same thing yet, but if we are,
  there are uses of a generalized interface outside of slots/virtual
  memory (Ira's physical box being a good example).
 
  
  I'm not sure Ira's case is not best supported by virtual memory.
 
 Perhaps, but there are surely some cases where the memory is not
 pageable, but is accessible indirectly through some DMA controller.  So
 if we require it to be pagable we will limit the utility of the
 interface, though admittedly it will probably cover most cases.
 

The limitation I have is that memory made available from the host system
(PCI card) as PCI BAR1 must not be migrated around in memory. I can only
change the address decoding to hit a specific physical address. AFAIK,
this means it cannot be userspace memory (since the underlying physical
page could change, or it could be in swap), and must be allocated with
something like __get_free_pages() or dma_alloc_coherent().

This is how all 83xx powerpc boards work, and I'd bet that the 85xx and
86xx boards work almost exactly the same. I can't say anything about
non-powerpc boards.

Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-25 Thread Ira W. Snyder
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.
 Existing virtio net code is used in the guest without modification.
 
 There's similarity with vringfd, with some differences and reduced scope
 - uses eventfd for signalling
 - structures can be moved around in memory at any time (good for migration)
 - support memory table and not just an offset (needed for kvm)
 
 common virtio related code has been put in a separate file vhost.c and
 can be made into a separate module if/when more backends appear.  I used
 Rusty's lguest.c as the source for developing this part : this supplied
 me with witty comments I wouldn't be able to write myself.
 
 What it is not: vhost net is not a bus, and not a generic new system
 call. No assumptions are made on how guest performs hypercalls.
 Userspace hypervisors are supported as well as kvm.
 
 How it works: Basically, we connect virtio frontend (configured by
 userspace) to a backend. The backend could be a network device, or a
 tun-like device. In this version I only support raw socket as a backend,
 which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
 also configured by userspace, including vlan/mac etc.
 
 Status:
 This works for me, and I haven't see any crashes.
 I have done some light benchmarking (with v4), compared to userspace, I
 see improved latency (as I save up to 4 system calls per packet) but not
 bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
 ping benchmark (where there's no TSO) troughput is also improved.
 
 Features that I plan to look at in the future:
 - tap support
 - TSO
 - interrupt mitigation
 - zero copy
 
 Acked-by: Arnd Bergmann a...@arndb.de
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 ---
  MAINTAINERS|   10 +
  arch/x86/kvm/Kconfig   |1 +
  drivers/Makefile   |1 +
  drivers/vhost/Kconfig  |   11 +
  drivers/vhost/Makefile |2 +
  drivers/vhost/net.c|  475 ++
  drivers/vhost/vhost.c  |  688 
 
  drivers/vhost/vhost.h  |  122 
  include/linux/Kbuild   |1 +
  include/linux/miscdevice.h |1 +
  include/linux/vhost.h  |  101 +++
  11 files changed, 1413 insertions(+), 0 deletions(-)
  create mode 100644 drivers/vhost/Kconfig
  create mode 100644 drivers/vhost/Makefile
  create mode 100644 drivers/vhost/net.c
  create mode 100644 drivers/vhost/vhost.c
  create mode 100644 drivers/vhost/vhost.h
  create mode 100644 include/linux/vhost.h
 
 diff --git a/MAINTAINERS b/MAINTAINERS
 index b1114cf..de4587f 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -5431,6 +5431,16 @@ S: Maintained
  F:   Documentation/filesystems/vfat.txt
  F:   fs/fat/
  
 +VIRTIO HOST (VHOST)
 +P:   Michael S. Tsirkin
 +M:   m...@redhat.com
 +L:   kvm@vger.kernel.org
 +L:   virtualizat...@lists.osdl.org
 +L:   net...@vger.kernel.org
 +S:   Maintained
 +F:   drivers/vhost/
 +F:   include/linux/vhost.h
 +
  VIA RHINE NETWORK DRIVER
  M:   Roger Luethi r...@hellgate.ch
  S:   Maintained
 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
 index b84e571..94f44d9 100644
 --- a/arch/x86/kvm/Kconfig
 +++ b/arch/x86/kvm/Kconfig
 @@ -64,6 +64,7 @@ config KVM_AMD
  
  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
  # the virtualization menu.
 +source drivers/vhost/Kconfig
  source drivers/lguest/Kconfig
  source drivers/virtio/Kconfig
  
 diff --git a/drivers/Makefile b/drivers/Makefile
 index bc4205d..1551ae1 100644
 --- a/drivers/Makefile
 +++ b/drivers/Makefile
 @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/
  obj-$(CONFIG_PPC_PS3)+= ps3/
  obj-$(CONFIG_OF) += of/
  obj-$(CONFIG_SSB)+= ssb/
 +obj-$(CONFIG_VHOST_NET)  += vhost/
  obj-$(CONFIG_VIRTIO) += virtio/
  obj-$(CONFIG_VLYNQ)  += vlynq/
  obj-$(CONFIG_STAGING)+= staging/
 diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
 new file mode 100644
 index 000..d955406
 --- /dev/null
 +++ b/drivers/vhost/Kconfig
 @@ -0,0 +1,11 @@
 +config VHOST_NET
 + tristate Host kernel accelerator for virtio net
 + depends on NET  EVENTFD
 + ---help---
 +   This kernel module can be loaded in host kernel to accelerate
 +   guest networking with virtio_net. Not to be confused with virtio_net
 +   module itself which needs to be loaded in guest kernel.
 +
 +   To compile this driver as a module, choose M here: the module will
 +   be called vhost_net.
 +
 diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
 new file mode 100644
 index 000..72dd020
 --- /dev/null
 +++ b/drivers/vhost/Makefile
 @@ -0,0 +1,2 @@
 +obj-$(CONFIG_VHOST_NET) += vhost_net.o
 +vhost_net-y := vhost.o net.o
 diff --git 

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-24 Thread Ira W. Snyder
On Thu, Sep 24, 2009 at 10:18:28AM +0300, Avi Kivity wrote:
 On 09/24/2009 12:15 AM, Gregory Haskins wrote:
 
  There are various aspects about designing high-performance virtual
  devices such as providing the shortest paths possible between the
  physical resources and the consumers.  Conversely, we also need to
  ensure that we meet proper isolation/protection guarantees at the same
  time.  What this means is there are various aspects to any
  high-performance PV design that require to be placed in-kernel to
  maximize the performance yet properly isolate the guest.
 
  For instance, you are required to have your signal-path (interrupts and
  hypercalls), your memory-path (gpa translation), and
  addressing/isolation model in-kernel to maximize performance.
 
 
  Exactly.  That's what vhost puts into the kernel and nothing more.
   
  Actually, no.  Generally, _KVM_ puts those things into the kernel, and
  vhost consumes them.  Without KVM (or something equivalent), vhost is
  incomplete.  One of my goals with vbus is to generalize the something
  equivalent part here.
 
 
 I don't really see how vhost and vbus are different here.  vhost expects 
 signalling to happen through a couple of eventfds and requires someone 
 to supply them and implement kernel support (if needed).  vbus requires 
 someone to write a connector to provide the signalling implementation.  
 Neither will work out-of-the-box when implementing virtio-net over 
 falling dominos, for example.
 
  Vbus accomplishes its in-kernel isolation model by providing a
  container concept, where objects are placed into this container by
  userspace.  The host kernel enforces isolation/protection by using a
  namespace to identify objects that is only relevant within a specific
  container's context (namely, a u32 dev-id).  The guest addresses the
  objects by its dev-id, and the kernel ensures that the guest can't
  access objects outside of its dev-id namespace.
 
 
  vhost manages to accomplish this without any kernel support.
   
  No, vhost manages to accomplish this because of KVMs kernel support
  (ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
  merely a kind of tuntap-like clone signalled by eventfds.
 
 
 Without a vbus-connector-falling-dominos, vbus-venet can't do anything 
 either.  Both vhost and vbus need an interface, vhost's is just narrower 
 since it doesn't do configuration or enumeration.
 
  This goes directly to my rebuttal of your claim that vbus places too
  much in the kernel.  I state that, one way or the other, address decode
  and isolation _must_ be in the kernel for performance.  Vbus does this
  with a devid/container scheme.  vhost+virtio-pci+kvm does it with
  pci+pio+ioeventfd.
 
 
 vbus doesn't do kvm guest address decoding for the fast path.  It's 
 still done by ioeventfd.
 
The guest
  simply has not access to any vhost resources other than the guest-host
  doorbell, which is handed to the guest outside vhost (so it's somebody
  else's problem, in userspace).
   
  You mean _controlled_ by userspace, right?  Obviously, the other side of
  the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
  vhost would be pointless: e.g. just use vanilla tuntap if you don't need
  fast in-kernel decoding.
 
 
 Yes (though for something like level-triggered interrupts we're probably 
 keeping it in userspace, enjoying the benefits of vhost data path while 
 paying more for signalling).
 
  All that is required is a way to transport a message with a devid
  attribute as an address (such as DEVCALL(devid)) and the framework
  provides the rest of the decode+execute function.
 
 
  vhost avoids that.
   
  No, it doesn't avoid it.  It just doesn't specify how its done, and
  relies on something else to do it on its behalf.
 
 
 That someone else can be in userspace, apart from the actual fast path.
 
  Conversely, vbus specifies how its done, but not how to transport the
  verb across the wire.  That is the role of the vbus-connector abstraction.
 
 
 So again, vbus does everything in the kernel (since it's so easy and 
 cheap) but expects a vbus-connector.  vhost does configuration in 
 userspace (since it's so clunky and fragile) but expects a couple of 
 eventfds.
 
  Contrast this to vhost+virtio-pci (called simply vhost from here).
 
 
  It's the wrong name.  vhost implements only the data path.
   
  Understood, but vhost+virtio-pci is what I am contrasting, and I use
  vhost for short from that point on because I am too lazy to type the
  whole name over and over ;)
 
 
 If you #define A A+B+C don't expect intelligent conversation afterwards.
 
  It is not immune to requiring in-kernel addressing support either, but
  rather it just does it differently (and its not as you might expect via
  qemu).
 
  Vhost relies on QEMU to render PCI objects to the guest, which the guest
  assigns resources (such 

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-22 Thread Ira W. Snyder
On Tue, Sep 22, 2009 at 12:43:36PM +0300, Avi Kivity wrote:
 On 09/22/2009 12:43 AM, Ira W. Snyder wrote:
 
  Sure, virtio-ira and he is on his own to make a bus-model under that, or
  virtio-vbus + vbus-ira-connector to use the vbus framework.  Either
  model can work, I agree.
 
   
  Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and
  virtio-s390. It isn't especially easy. I can steal lots of code from the
  lguest bus model, but sometimes it is good to generalize, especially
  after the fourth implemention or so. I think this is what GHaskins tried
  to do.
 
 
 Yes.  vbus is more finely layered so there is less code duplication.
 
 The virtio layering was more or less dictated by Xen which doesn't have 
 shared memory (it uses grant references instead).  As a matter of fact 
 lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that 
 part is duplicated.  It's probably possible to add a virtio-shmem.ko 
 library that people who do have shared memory can reuse.
 

Seems like a nice benefit of vbus.

  I've given it some thought, and I think that running vhost-net (or
  similar) on the ppc boards, with virtio-net on the x86 crate server will
  work. The virtio-ring abstraction is almost good enough to work for this
  situation, but I had to re-invent it to work with my boards.
 
  I've exposed a 16K region of memory as PCI BAR1 from my ppc board.
  Remember that this is the host system. I used each 4K block as a
  device descriptor which contains:
 
  1) the type of device, config space, etc. for virtio
  2) the desc table (virtio memory descriptors, see virtio-ring)
  3) the avail table (available entries in the desc table)
 
 
 Won't access from x86 be slow to this memory (on the other hand, if you 
 change it to main memory access from ppc will be slow... really depends 
 on how your system is tuned.
 

Writes across the bus are fast, reads across the bus are slow. These are
just the descriptor tables for memory buffers, not the physical memory
buffers themselves.

These only need to be written by the guest (x86), and read by the host
(ppc). The host never changes the tables, so we can cache a copy in the
guest, for a fast detach_buf() implementation (see virtio-ring, which
I'm copying the design from).

The only accesses are writes across the PCI bus. There is never a need
to do a read (except for slow-path configuration).

  Parts 2 and 3 are repeated three times, to allow for a maximum of three
  virtqueues per device. This is good enough for all current drivers.
 
 
 The plan is to switch to multiqueue soon.  Will not affect you if your 
 boards are uniprocessor or small smp.
 

Everything I have is UP. I don't need extreme performance, either.
40MB/sec is the minimum I need to reach, though I'd like to have some
headroom.

For reference, using the CPU to handle data transfers, I get ~2MB/sec
transfers. Using the DMA engine, I've hit about 60MB/sec with my
crossed-wires virtio-net.

  I've gotten plenty of email about this from lots of interested
  developers. There are people who would like this kind of system to just
  work, while having to write just some glue for their device, just like a
  network driver. I hunch most people have created some proprietary mess
  that basically works, and left it at that.
 
 
 So long as you keep the system-dependent features hookable or 
 configurable, it should work.
 
  So, here is a desperate cry for help. I'd like to make this work, and
  I'd really like to see it in mainline. I'm trying to give back to the
  community from which I've taken plenty.
 
 
 Not sure who you're crying for help to.  Once you get this working, post 
 patches.  If the patches are reasonably clean and don't impact 
 performance for the main use case, and if you can show the need, I 
 expect they'll be merged.
 

In the spirit of post early and often, I'm making my code available,
that's all. I'm asking anyone interested for some review, before I have
to re-code this for about the fifth time now. I'm trying to avoid
Haskins' situation, where he's invented and debugged a lot of new code,
and then been told to do it completely differently.

Yes, the code I posted is only compile-tested, because quite a lot of
code (kernel and userspace) must be working before anything works at
all. I hate to design the whole thing, then be told that something
fundamental about it is wrong, and have to completely re-write it.

Thanks for the comments,
Ira

 -- 
 error compiling committee.c: too many arguments to function
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-21 Thread Ira W. Snyder
On Wed, Sep 16, 2009 at 11:11:57PM -0400, Gregory Haskins wrote:
 Avi Kivity wrote:
  On 09/16/2009 10:22 PM, Gregory Haskins wrote:
  Avi Kivity wrote:

  On 09/16/2009 05:10 PM, Gregory Haskins wrote:
  
  If kvm can do it, others can.
 
   
  The problem is that you seem to either hand-wave over details like
  this,
  or you give details that are pretty much exactly what vbus does
  already.
 My point is that I've already sat down and thought about these
  issues
  and solved them in a freely available GPL'ed software package.
 
 
  In the kernel.  IMO that's the wrong place for it.
   
  3) in-kernel: You can do something like virtio-net to vhost to
  potentially meet some of the requirements, but not all.
 
  In order to fully meet (3), you would need to do some of that stuff you
  mentioned in the last reply with muxing device-nr/reg-nr.  In addition,
  we need to have a facility for mapping eventfds and establishing a
  signaling mechanism (like PIO+qid), etc. KVM does this with
  IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
  invented.
 
  
  irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.
 
 Not per se, but it needs to be interfaced.  How do I register that
 eventfd with the fastpath in Ira's rig? How do I signal the eventfd
 (x86-ppc, and ppc-x86)?
 

Sorry to reply so late to this thread, I've been on vacation for the
past week. If you'd like to continue in another thread, please start it
and CC me.

On the PPC, I've got a hardware doorbell register which generates 30
distiguishable interrupts over the PCI bus. I have outbound and inbound
registers, which can be used to signal the other side.

I assume it isn't too much code to signal an eventfd in an interrupt
handler. I haven't gotten to this point in the code yet.

 To take it to the next level, how do I organize that mechanism so that
 it works for more than one IO-stream (e.g. address the various queues
 within ethernet or a different device like the console)?  KVM has
 IOEVENTFD and IRQFD managed with MSI and PIO.  This new rig does not
 have the luxury of an established IO paradigm.
 
 Is vbus the only way to implement a solution?  No.  But it is _a_ way,
 and its one that was specifically designed to solve this very problem
 (as well as others).
 
 (As an aside, note that you generally will want an abstraction on top of
 irqfd/eventfd like shm-signal or virtqueues to do shared-memory based
 event mitigation, but I digress.  That is a separate topic).
 
  
  To meet performance, this stuff has to be in kernel and there has to be
  a way to manage it.
  
  and management belongs in userspace.
 
 vbus does not dictate where the management must be.  Its an extensible
 framework, governed by what you plug into it (ala connectors and devices).
 
 For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD
 and DEVDROP hotswap events into the interrupt stream, because they are
 simple and we already needed the interrupt stream anyway for fast-path.
 
 As another example: venet chose to put -call(MACQUERY) config-space
 into its call namespace because its simple, and we already need
 -calls() for fastpath.  It therefore exports an attribute to sysfs that
 allows the management app to set it.
 
 I could likewise have designed the connector or device-model differently
 as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI
 userspace) but this seems silly to me when they are so trivial, so I didn't.
 
  
  Since vbus was designed to do exactly that, this is
  what I would advocate.  You could also reinvent these concepts and put
  your own mux and mapping code in place, in addition to all the other
  stuff that vbus does.  But I am not clear why anyone would want to.
 
  
  Maybe they like their backward compatibility and Windows support.
 
 This is really not relevant to this thread, since we are talking about
 Ira's hardware.  But if you must bring this up, then I will reiterate
 that you just design the connector to interface with QEMU+PCI and you
 have that too if that was important to you.
 
 But on that topic: Since you could consider KVM a motherboard
 manufacturer of sorts (it just happens to be virtual hardware), I don't
 know why KVM seems to consider itself the only motherboard manufacturer
 in the world that has to make everything look legacy.  If a company like
 ASUS wants to add some cutting edge IO controller/bus, they simply do
 it.  Pretty much every product release may contain a different array of
 devices, many of which are not backwards compatible with any prior
 silicon.  The guy/gal installing Windows on that system may see a ? in
 device-manager until they load a driver that supports the new chip, and
 subsequently it works.  It is certainly not a requirement to make said
 chip somehow work with existing drivers/facilities on bare metal, per
 se.  Why should virtual systems be different?
 
 So, yeah, the 

Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-08 Thread Ira W. Snyder
On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
 On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
  On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
   What it is: vhost net is a character device that can be used to reduce
   the number of system calls involved in virtio networking.
   Existing virtio net code is used in the guest without modification.
   
   There's similarity with vringfd, with some differences and reduced scope
   - uses eventfd for signalling
   - structures can be moved around in memory at any time (good for 
   migration)
   - support memory table and not just an offset (needed for kvm)
   
   common virtio related code has been put in a separate file vhost.c and
   can be made into a separate module if/when more backends appear.  I used
   Rusty's lguest.c as the source for developing this part : this supplied
   me with witty comments I wouldn't be able to write myself.
   
   What it is not: vhost net is not a bus, and not a generic new system
   call. No assumptions are made on how guest performs hypercalls.
   Userspace hypervisors are supported as well as kvm.
   
   How it works: Basically, we connect virtio frontend (configured by
   userspace) to a backend. The backend could be a network device, or a
   tun-like device. In this version I only support raw socket as a backend,
   which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
   also configured by userspace, including vlan/mac etc.
   
   Status:
   This works for me, and I haven't see any crashes.
   I have done some light benchmarking (with v4), compared to userspace, I
   see improved latency (as I save up to 4 system calls per packet) but not
   bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
   ping benchmark (where there's no TSO) troughput is also improved.
   
   Features that I plan to look at in the future:
   - tap support
   - TSO
   - interrupt mitigation
   - zero copy
   
  
  Hello Michael,
  
  I've started looking at vhost with the intention of using it over PCI to
  connect physical machines together.
  
  The part that I am struggling with the most is figuring out which parts
  of the rings are in the host's memory, and which parts are in the
  guest's memory.
 
 All rings are in guest's memory, to match existing virtio code.

Ok, this makes sense.

 vhost
 assumes that the memory space of the hypervisor userspace process covers
 the whole of guest memory.

Is this necessary? Why? The assumption seems very wrong when you're
doing data transport between two physical systems via PCI.

I know vhost has not been designed for this specific situation, but it
is good to be looking toward other possible uses.

 And there's a translation table.
 Ring addresses are userspace addresses, they do not undergo translation.
 
  If I understand everything correctly, the rings are all userspace
  addresses, which means that they can be moved around in physical memory,
  and get pushed out to swap.
 
 Unless they are locked, yes.
 
  AFAIK, this is impossible to handle when
  connecting two physical systems, you'd need the rings available in IO
  memory (PCI memory), so you can ioreadXX() them instead. To the best of
  my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
  Also, having them migrate around in memory would be a bad thing.
  
  Also, I'm having trouble figuring out how the packet contents are
  actually copied from one system to the other. Could you point this out
  for me?
 
 The code in net/packet/af_packet.c does it when vhost calls sendmsg.
 

Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible
to make this use a DMA engine instead? I know this was suggested in an
earlier thread.

  Is there somewhere I can find the userspace code (kvm, qemu, lguest,
  etc.) code needed for interacting with the vhost misc device so I can
  get a better idea of how userspace is supposed to work?
 
 Look in archives for k...@vger.kernel.org. the subject is qemu-kvm: vhost net.
 
  (Features
  negotiation, etc.)
  
 
 That's not yet implemented as there are no features yet.  I'm working on
 tap support, which will add a feature bit.  Overall, qemu does an ioctl
 to query supported features, and then acks them with another ioctl.  I'm
 also trying to avoid duplicating functionality available elsewhere.  So
 that to check e.g. TSO support, you'd just look at the underlying
 hardware device you are binding to.
 

Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in
the future? I found that this made an enormous improvement in throughput
on my virtio-net - virtio-net system. Perhaps it isn't needed with
vhost-net.

Thanks for replying,
Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

2009-09-03 Thread Ira W. Snyder
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.
 Existing virtio net code is used in the guest without modification.
 
 There's similarity with vringfd, with some differences and reduced scope
 - uses eventfd for signalling
 - structures can be moved around in memory at any time (good for migration)
 - support memory table and not just an offset (needed for kvm)
 
 common virtio related code has been put in a separate file vhost.c and
 can be made into a separate module if/when more backends appear.  I used
 Rusty's lguest.c as the source for developing this part : this supplied
 me with witty comments I wouldn't be able to write myself.
 
 What it is not: vhost net is not a bus, and not a generic new system
 call. No assumptions are made on how guest performs hypercalls.
 Userspace hypervisors are supported as well as kvm.
 
 How it works: Basically, we connect virtio frontend (configured by
 userspace) to a backend. The backend could be a network device, or a
 tun-like device. In this version I only support raw socket as a backend,
 which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
 also configured by userspace, including vlan/mac etc.
 
 Status:
 This works for me, and I haven't see any crashes.
 I have done some light benchmarking (with v4), compared to userspace, I
 see improved latency (as I save up to 4 system calls per packet) but not
 bandwidth/CPU (as TSO and interrupt mitigation are not supported).  For
 ping benchmark (where there's no TSO) troughput is also improved.
 
 Features that I plan to look at in the future:
 - tap support
 - TSO
 - interrupt mitigation
 - zero copy
 

Hello Michael,

I've started looking at vhost with the intention of using it over PCI to
connect physical machines together.

The part that I am struggling with the most is figuring out which parts
of the rings are in the host's memory, and which parts are in the
guest's memory.

If I understand everything correctly, the rings are all userspace
addresses, which means that they can be moved around in physical memory,
and get pushed out to swap. AFAIK, this is impossible to handle when
connecting two physical systems, you'd need the rings available in IO
memory (PCI memory), so you can ioreadXX() them instead. To the best of
my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
Also, having them migrate around in memory would be a bad thing.

Also, I'm having trouble figuring out how the packet contents are
actually copied from one system to the other. Could you point this out
for me?

Is there somewhere I can find the userspace code (kvm, qemu, lguest,
etc.) code needed for interacting with the vhost misc device so I can
get a better idea of how userspace is supposed to work? (Features
negotiation, etc.)

Thanks,
Ira

 Acked-by: Arnd Bergmann a...@arndb.de
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 ---
  MAINTAINERS|   10 +
  arch/x86/kvm/Kconfig   |1 +
  drivers/Makefile   |1 +
  drivers/vhost/Kconfig  |   11 +
  drivers/vhost/Makefile |2 +
  drivers/vhost/net.c|  475 ++
  drivers/vhost/vhost.c  |  688 
 
  drivers/vhost/vhost.h  |  122 
  include/linux/Kbuild   |1 +
  include/linux/miscdevice.h |1 +
  include/linux/vhost.h  |  101 +++
  11 files changed, 1413 insertions(+), 0 deletions(-)
  create mode 100644 drivers/vhost/Kconfig
  create mode 100644 drivers/vhost/Makefile
  create mode 100644 drivers/vhost/net.c
  create mode 100644 drivers/vhost/vhost.c
  create mode 100644 drivers/vhost/vhost.h
  create mode 100644 include/linux/vhost.h
 
 diff --git a/MAINTAINERS b/MAINTAINERS
 index b1114cf..de4587f 100644
 --- a/MAINTAINERS
 +++ b/MAINTAINERS
 @@ -5431,6 +5431,16 @@ S: Maintained
  F:   Documentation/filesystems/vfat.txt
  F:   fs/fat/
  
 +VIRTIO HOST (VHOST)
 +P:   Michael S. Tsirkin
 +M:   m...@redhat.com
 +L:   kvm@vger.kernel.org
 +L:   virtualizat...@lists.osdl.org
 +L:   net...@vger.kernel.org
 +S:   Maintained
 +F:   drivers/vhost/
 +F:   include/linux/vhost.h
 +
  VIA RHINE NETWORK DRIVER
  M:   Roger Luethi r...@hellgate.ch
  S:   Maintained
 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
 index b84e571..94f44d9 100644
 --- a/arch/x86/kvm/Kconfig
 +++ b/arch/x86/kvm/Kconfig
 @@ -64,6 +64,7 @@ config KVM_AMD
  
  # OK, it's a little counter-intuitive to do this, but it puts it neatly under
  # the virtualization menu.
 +source drivers/vhost/Kconfig
  source drivers/lguest/Kconfig
  source drivers/virtio/Kconfig
  
 diff --git a/drivers/Makefile b/drivers/Makefile
 index bc4205d..1551ae1 100644
 --- a/drivers/Makefile
 +++ b/drivers/Makefile
 @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/
  obj-$(CONFIG_PPC_PS3)

Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-19 Thread Ira W. Snyder
On Wed, Aug 19, 2009 at 08:40:33AM +0300, Avi Kivity wrote:
 On 08/19/2009 03:38 AM, Ira W. Snyder wrote:
 On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote:

 On 08/18/2009 11:59 PM, Ira W. Snyder wrote:
  
 On a non shared-memory system (where the guest's RAM is not just a chunk
 of userspace RAM in the host system), virtio's management model seems to
 fall apart. Feature negotiation doesn't work as one would expect.


 In your case, virtio-net on the main board accesses PCI config space
 registers to perform the feature negotiation; software on your PCI cards
 needs to trap these config space accesses and respond to them according
 to virtio ABI.

  
 Is this real PCI (physical hardware) or fake PCI (software PCI
 emulation) that you are describing?



 Real PCI.

 The host (x86, PCI master) must use real PCI to actually configure the
 boards, enable bus mastering, etc. Just like any other PCI device, such
 as a network card.

 On the guests (ppc, PCI agents) I cannot add/change PCI functions (the
 last .[0-9] in the PCI address) nor can I change PCI BAR's once the
 board has started. I'm pretty sure that would violate the PCI spec,
 since the PCI master would need to re-scan the bus, and re-assign
 addresses, which is a task for the BIOS.


 Yes.  Can the boards respond to PCI config space cycles coming from the  
 host, or is the config space implemented in silicon and immutable?   
 (reading on, I see the answer is no).  virtio-pci uses the PCI config  
 space to configure the hardware.


Yes, the PCI config space is implemented in silicon. I can change a few
things (mostly PCI BAR attributes), but not much.

 (There's no real guest on your setup, right?  just a kernel running on
 and x86 system and other kernels running on the PCI cards?)

  
 Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's
 (PCI agents) also run Linux (booted via U-Boot). They are independent
 Linux systems, with a physical PCI interconnect.

 The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's
 PCI stack does bad things as a PCI agent. It always assumes it is a PCI
 master.

 It is possible for me to enable CONFIG_PCI=y on the ppc's by removing
 the PCI bus from their list of devices provided by OpenFirmware. They
 can not access PCI via normal methods. PCI drivers cannot work on the
 ppc's, because Linux assumes it is a PCI master.

 To the best of my knowledge, I cannot trap configuration space accesses
 on the PCI agents. I haven't needed that for anything I've done thus
 far.



 Well, if you can't do that, you can't use virtio-pci on the host.   
 You'll need another virtio transport (equivalent to fake pci you  
 mentioned above).


Ok.

Is there something similar that I can study as an example? Should I look
at virtio-pci?

 This does appear to be solved by vbus, though I haven't written a
 vbus-over-PCI implementation, so I cannot be completely sure.


 Even if virtio-pci doesn't work out for some reason (though it should),
 you can write your own virtio transport and implement its config space
 however you like.

  
 This is what I did with virtio-over-PCI. The way virtio-net negotiates
 features makes this work non-intuitively.


 I think you tried to take two virtio-nets and make them talk together?   
 That won't work.  You need the code from qemu to talk to virtio-net  
 config space, and vhost-net to pump the rings.


It *is* possible to make two unmodified virtio-net's talk together. I've
done it, and it is exactly what the virtio-over-PCI patch does. Study it
and you'll see how I connected the rx/tx queues together.

The feature negotiation code also works, but in a very unintuitive
manner. I made it work in the virtio-over-PCI patch, but the devices are
hardcoded into the driver. It would be quite a bit of work to swap
virtio-net and virtio-console, for example.

 I'm not at all clear on how to get feature negotiation to work on a
 system like mine. From my study of lguest and kvm (see below) it looks
 like userspace will need to be involved, via a miscdevice.


 I don't see why.  Is the kernel on the PCI cards in full control of all
 accesses?

  
 I'm not sure what you mean by this. Could you be more specific? This is
 a normal, unmodified vanilla Linux kernel running on the PCI agents.


 I meant, does board software implement the config space accesses issued  
 from the host, and it seems the answer is no.


 In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote
 an algorithm to pair the tx/rx queues together. Since virtio-net
 pre-fills its rx queues with buffers, I was able to use the DMA engine
 to copy from the tx queue into the pre-allocated memory in the rx queue.



 Please find a name other than virtio-over-PCI since it conflicts with  
 virtio-pci.  You're tunnelling virtio config cycles (which are usually  
 done on pci config cycles) on a new protocol which

Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-19 Thread Ira W. Snyder
On Wed, Aug 19, 2009 at 06:37:06PM +0300, Avi Kivity wrote:
 On 08/19/2009 06:28 PM, Ira W. Snyder wrote:

 Well, if you can't do that, you can't use virtio-pci on the host.
 You'll need another virtio transport (equivalent to fake pci you
 mentioned above).

  
 Ok.

 Is there something similar that I can study as an example? Should I look
 at virtio-pci?



 There's virtio-lguest, virtio-s390, and virtio-vbus.

 I think you tried to take two virtio-nets and make them talk together?
 That won't work.  You need the code from qemu to talk to virtio-net
 config space, and vhost-net to pump the rings.

  
 It *is* possible to make two unmodified virtio-net's talk together. I've
 done it, and it is exactly what the virtio-over-PCI patch does. Study it
 and you'll see how I connected the rx/tx queues together.


 Right, crossing the cables works, but feature negotiation is screwed up,  
 and both sides think the data is in their RAM.

 vhost-net doesn't do negotiation and doesn't assume the data lives in  
 its address space.


Yes, that is exactly what I did: crossed the cables (in software).

I'll take a closer look at vhost-net now, and make sure I understand how
it works.

 Please find a name other than virtio-over-PCI since it conflicts with
 virtio-pci.  You're tunnelling virtio config cycles (which are usually
 done on pci config cycles) on a new protocol which is itself tunnelled
 over PCI shared memory.

  
 Sorry about that. Do you have suggestions for a better name?



 virtio-$yourhardware or maybe virtio-dma


How about virtio-phys?

Arnd and BenH are both looking at PPC systems (similar to mine). Grant
Likely is looking at talking to an processor core running on an FPGA,
IIRC. Most of the code can be shared, very little should need to be
board-specific, I hope.

 I called it virtio-over-PCI in my previous postings to LKML, so until a
 new patch is written and posted, I'll keep referring to it by the name
 used in the past, so people can search for it.

 When I post virtio patches, should I CC another mailing list in addition
 to LKML?


 virtualizat...@lists.linux-foundation.org is virtio's home.

 That said, I'm not sure how qemu-system-ppc running on x86 could
 possibly communicate using virtio-net. This would mean the guest is an
 emulated big-endian PPC, while the host is a little-endian x86. I
 haven't actually tested this situation, so perhaps I am wrong.


 I'm confused now.  You don't actually have any guest, do you, so why  
 would you run qemu at all?


I do not run qemu. I am just stating a problem with virtio-net that I
noticed. This is just so someone more knowledgeable can be aware of the
problem.

 The x86 side only needs to run virtio-net, which is present in RHEL 5.3.
 You'd only need to run virtio-tunnel or however it's called.  All the
 eventfd magic takes place on the PCI agents.

  
 I can upgrade the kernel to anything I want on both the x86 and ppc's.
 I'd like to avoid changing the x86 (RHEL5) userspace, though. On the
 ppc's, I have full control over the userspace environment.


 You don't need any userspace on virtio-net's side.

 Your ppc boards emulate a virtio-net device, so all you need is the  
 virtio-net module (and virtio bindings).  If you chose to emulate, say,  
 an e1000 card all you'd need is the e1000 driver.


Thanks for the replies.
Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Ira W. Snyder
On Tue, Aug 18, 2009 at 11:46:06AM +0300, Michael S. Tsirkin wrote:
 On Mon, Aug 17, 2009 at 04:17:09PM -0400, Gregory Haskins wrote:
  Michael S. Tsirkin wrote:
   On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote:
   Case in point: Take an upstream kernel and you can modprobe the
   vbus-pcibridge in and virtio devices will work over that transport
   unmodified.
  
   See http://lkml.org/lkml/2009/8/6/244 for details.
   
   The modprobe you are talking about would need
   to be done in guest kernel, correct?
  
  Yes, and your point is? unmodified (pardon the psuedo pun) modifies
  virtio, not guest.
   It means you can take an off-the-shelf kernel
  with off-the-shelf virtio (ala distro-kernel) and modprobe
  vbus-pcibridge and get alacrityvm acceleration.
 
 Heh, by that logic ksplice does not modify running kernel either :)
 
  It is not a design goal of mine to forbid the loading of a new driver,
  so I am ok with that requirement.
  
   OTOH, Michael's patch is purely targeted at improving virtio-net on kvm,
   and its likewise constrained by various limitations of that decision
   (such as its reliance of the PCI model, and the kvm memory scheme).
   
   vhost is actually not related to PCI in any way. It simply leaves all
   setup for userspace to do.  And the memory scheme was intentionally
   separated from kvm so that it can easily support e.g. lguest.
   
  
  I think you have missed my point. I mean that vhost requires a separate
  bus-model (ala qemu-pci).
 
 So? That can be in userspace, and can be anything including vbus.
 
  And no, your memory scheme is not separated,
  at least, not very well.  It still assumes memory-regions and
  copy_to_user(), which is very kvm-esque.
 
 I don't think so: works for lguest, kvm, UML and containers
 
  Vbus has people using things
  like userspace containers (no regions),
 
 vhost by default works without regions
 
  and physical hardware (dma
  controllers, so no regions or copy_to_user) so your scheme quickly falls
  apart once you get away from KVM.
 
 Someone took a driver and is building hardware for it ... so what?
 

I think Greg is referring to something like my virtio-over-PCI patch.
I'm pretty sure that vhost is completely useless for my situation. I'd
like to see vhost work for my use, so I'll try to explain what I'm
doing.

I've got a system where I have about 20 computers connected via PCI. The
PCI master is a normal x86 system, and the PCI agents are PowerPC
systems. The PCI agents act just like any other PCI card, except they
are running Linux, and have their own RAM and peripherals.

I wrote a custom driver which imitated a network interface and a serial
port. I tried to push it towards mainline, and DavidM rejected it, with
the argument, use virtio, don't add another virtualization layer to the
kernel. I think he has a decent argument, so I wrote virtio-over-PCI.

Now, there are some things about virtio that don't work over PCI.
Mainly, memory is not truly shared. It is extremely slow to access
memory that is far away, meaning across the PCI bus. This can be
worked around by using a DMA controller to transfer all data, along with
an intelligent scheme to perform only writes across the bus. If you're
careful, reads are never needed.

So, in my system, copy_(to|from)_user() is completely wrong. There is no
userspace, only a physical system. In fact, because normal x86 computers
do not have DMA controllers, the host system doesn't actually handle any
data transfer!

I used virtio-net in both the guest and host systems in my example
virtio-over-PCI patch, and succeeded in getting them to communicate.
However, the lack of any setup interface means that the devices must be
hardcoded into both drivers, when the decision could be up to userspace.
I think this is a problem that vbus could solve.

For my own selfish reasons (I don't want to maintain an out-of-tree
driver) I'd like to see *something* useful in mainline Linux. I'm happy
to answer questions about my setup, just ask.

Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Ira W. Snyder
On Tue, Aug 18, 2009 at 07:51:21PM +0300, Avi Kivity wrote:
 On 08/18/2009 06:53 PM, Ira W. Snyder wrote:
 So, in my system, copy_(to|from)_user() is completely wrong. There is no
 userspace, only a physical system. In fact, because normal x86 computers
 do not have DMA controllers, the host system doesn't actually handle any
 data transfer!


 In fact, modern x86s do have dma engines these days (google for Intel  
 I/OAT), and one of our plans for vhost-net is to allow their use for  
 packets above a certain size.  So a patch allowing vhost-net to  
 optionally use a dma engine is a good thing.


Yes, I'm aware that very modern x86 PCs have general purpose DMA
engines, even though I don't have any capable hardware. However, I think
it is better to support using any PC (with or without DMA engine, any
architecture) as the PCI master, and just handle the DMA all from the
PCI agent, which is known to have DMA?

 I used virtio-net in both the guest and host systems in my example
 virtio-over-PCI patch, and succeeded in getting them to communicate.
 However, the lack of any setup interface means that the devices must be
 hardcoded into both drivers, when the decision could be up to userspace.
 I think this is a problem that vbus could solve.


 Exposing a knob to userspace is not an insurmountable problem; vhost-net  
 already allows changing the memory layout, for example.


Let me explain the most obvious problem I ran into: setting the MAC
addresses used in virtio.

On the host (PCI master), I want eth0 (virtio-net) to get a random MAC
address.

On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC
address, aa:bb:cc:dd:ee:ff.

The virtio feature negotiation code handles this, by seeing the
VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do
not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC
address. This is because the feature negotiation code only accepts a
feature if it is offered by both sides of the connection.

In this case, I must have the guest generate a random MAC address and
have the host put aa:bb:cc:dd:ee:ff into the guest's configuration
space. This basically means hardcoding the MAC addresses in the Linux
drivers, which is a big no-no.

What would I expose to userspace to make this situation manageable?

Thanks for the response,
Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Ira W. Snyder
On Tue, Aug 18, 2009 at 08:47:04PM +0300, Avi Kivity wrote:
 On 08/18/2009 08:27 PM, Ira W. Snyder wrote:
 In fact, modern x86s do have dma engines these days (google for Intel
 I/OAT), and one of our plans for vhost-net is to allow their use for
 packets above a certain size.  So a patch allowing vhost-net to
 optionally use a dma engine is a good thing.
  
 Yes, I'm aware that very modern x86 PCs have general purpose DMA
 engines, even though I don't have any capable hardware. However, I think
 it is better to support using any PC (with or without DMA engine, any
 architecture) as the PCI master, and just handle the DMA all from the
 PCI agent, which is known to have DMA?


 Certainly; but if your PCI agent will support the DMA API, then the same  
 vhost code will work with both I/OAT and your specialized hardware.


Yes, that's true. My ppc is a Freescale MPC8349EMDS. It has a Linux
DMAEngine driver in mainline, which I've used. That's excellent.

 Exposing a knob to userspace is not an insurmountable problem; vhost-net
 already allows changing the memory layout, for example.

  
 Let me explain the most obvious problem I ran into: setting the MAC
 addresses used in virtio.

 On the host (PCI master), I want eth0 (virtio-net) to get a random MAC
 address.

 On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC
 address, aa:bb:cc:dd:ee:ff.

 The virtio feature negotiation code handles this, by seeing the
 VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do
 not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC
 address. This is because the feature negotiation code only accepts a
 feature if it is offered by both sides of the connection.

 In this case, I must have the guest generate a random MAC address and
 have the host put aa:bb:cc:dd:ee:ff into the guest's configuration
 space. This basically means hardcoding the MAC addresses in the Linux
 drivers, which is a big no-no.

 What would I expose to userspace to make this situation manageable?



 I think in this case you want one side to be virtio-net (I'm guessing  
 the x86) and the other side vhost-net (the ppc boards with the dma  
 engine).  virtio-net on x86 would communicate with userspace on the ppc  
 board to negotiate features and get a mac address, the fast path would  
 be between virtio-net and vhost-net (which would use the dma engine to  
 push and pull data).


Ah, that seems backwards, but it should work after vhost-net learns how
to use the DMAEngine API.

I haven't studied vhost-net very carefully yet. As soon as I saw the
copy_(to|from)_user() I stopped reading, because it seemed useless for
my case. I'll look again and try to find where vhost-net supports
setting MAC addresses and other features.

Also, in my case I'd like to boot Linux with my rootfs over NFS. Is
vhost-net capable of this?

I've had Arnd, BenH, and Grant Likely (and others, privately) contact me
about devices they are working with that would benefit from something
like virtio-over-PCI. I'd like to see vhost-net be merged with the
capability to support my use case. There are plenty of others that would
benefit, not just myself.

I'm not sure vhost-net is being written with this kind of future use in
mind. I'd hate to see it get merged, and then have to change the ABI to
support physical-device-to-device usage. It would be better to keep
future use in mind now, rather than try and hack it in later.

Thanks for the comments.
Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Ira W. Snyder
On Tue, Aug 18, 2009 at 09:52:48PM +0300, Avi Kivity wrote:
 On 08/18/2009 09:27 PM, Ira W. Snyder wrote:
 I think in this case you want one side to be virtio-net (I'm guessing
 the x86) and the other side vhost-net (the ppc boards with the dma
 engine).  virtio-net on x86 would communicate with userspace on the ppc
 board to negotiate features and get a mac address, the fast path would
 be between virtio-net and vhost-net (which would use the dma engine to
 push and pull data).

  

 Ah, that seems backwards, but it should work after vhost-net learns how
 to use the DMAEngine API.

 I haven't studied vhost-net very carefully yet. As soon as I saw the
 copy_(to|from)_user() I stopped reading, because it seemed useless for
 my case. I'll look again and try to find where vhost-net supports
 setting MAC addresses and other features.


 It doesn't; all it does is pump the rings, leaving everything else to  
 userspace.


Ok.

On a non shared-memory system (where the guest's RAM is not just a chunk
of userspace RAM in the host system), virtio's management model seems to
fall apart. Feature negotiation doesn't work as one would expect.

This does appear to be solved by vbus, though I haven't written a
vbus-over-PCI implementation, so I cannot be completely sure.

I'm not at all clear on how to get feature negotiation to work on a
system like mine. From my study of lguest and kvm (see below) it looks
like userspace will need to be involved, via a miscdevice.

 Also, in my case I'd like to boot Linux with my rootfs over NFS. Is
 vhost-net capable of this?


 It's just another network interface.  You'd need an initramfs though to  
 contain the needed userspace.


Ok. I'm using an initramfs already, so adding some more userspace to it
isn't a problem.

 I've had Arnd, BenH, and Grant Likely (and others, privately) contact me
 about devices they are working with that would benefit from something
 like virtio-over-PCI. I'd like to see vhost-net be merged with the
 capability to support my use case. There are plenty of others that would
 benefit, not just myself.

 I'm not sure vhost-net is being written with this kind of future use in
 mind. I'd hate to see it get merged, and then have to change the ABI to
 support physical-device-to-device usage. It would be better to keep
 future use in mind now, rather than try and hack it in later.


 Please review and comment then.  I'm fairly confident there won't be any  
 ABI issues since vhost-net does so little outside pumping the rings.


Ok. I thought I should at least express my concerns while we're
discussing this, rather than being too late after finding the time to
study the driver.

Off the top of my head, I would think that transporting userspace
addresses in the ring (for copy_(to|from)_user()) vs. physical addresses
(for DMAEngine) might be a problem. Pinning userspace pages into memory
for DMA is a bit of a pain, though it is possible.

There is also the problem of different endianness between host and guest
in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h)
defines fields in host byte order. Which totally breaks if the guest has
a different endianness. This is a virtio-net problem though, and is not
transport specific.

 Note the signalling paths go through eventfd: when vhost-net wants the  
 other side to look at its ring, it tickles an eventfd which is supposed  
 to trigger an interrupt on the other side.  Conversely, when another  
 eventfd is signalled, vhost-net will look at the ring and process any  
 data there.  You'll need to wire your signalling to those eventfds,  
 either in userspace or in the kernel.


Ok. I've never used eventfd before, so that'll take yet more studying.

I've browsed over both the kvm and lguest code, and it looks like they
each re-invent a mechanism for transporting interrupts between the host
and guest, using eventfd. They both do this by implementing a
miscdevice, which is basically their management interface.

See drivers/lguest/lguest_user.c (see write() and LHREQ_EVENTFD) and
kvm-kmod-devel-88/x86/kvm_main.c (see kvm_vm_ioctl(), called via
kvm_dev_ioctl()) for how they hook up eventfd's.

I can now imagine how two userspace programs (host and guest) could work
together to implement a management interface, including hotplug of
devices, etc. Of course, this would basically reinvent the vbus
management interface into a specific driver.

I think this is partly what Greg is trying to abstract out into generic
code. I haven't studied the actual data transport mechanisms in vbus,
though I have studied virtio's transport mechanism. I think a generic
management interface for virtio might be a good thing to consider,
because it seems there are at least two implementations already: kvm and
lguest.

Thanks for answering my questions. It helps to talk with someone more
familiar with the issues than I am.

Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message

Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Ira W. Snyder
On Tue, Aug 18, 2009 at 11:57:48PM +0300, Michael S. Tsirkin wrote:
 On Tue, Aug 18, 2009 at 08:53:29AM -0700, Ira W. Snyder wrote:
  I think Greg is referring to something like my virtio-over-PCI patch.
  I'm pretty sure that vhost is completely useless for my situation. I'd
  like to see vhost work for my use, so I'll try to explain what I'm
  doing.
  
  I've got a system where I have about 20 computers connected via PCI. The
  PCI master is a normal x86 system, and the PCI agents are PowerPC
  systems. The PCI agents act just like any other PCI card, except they
  are running Linux, and have their own RAM and peripherals.
  
  I wrote a custom driver which imitated a network interface and a serial
  port. I tried to push it towards mainline, and DavidM rejected it, with
  the argument, use virtio, don't add another virtualization layer to the
  kernel. I think he has a decent argument, so I wrote virtio-over-PCI.
  
  Now, there are some things about virtio that don't work over PCI.
  Mainly, memory is not truly shared. It is extremely slow to access
  memory that is far away, meaning across the PCI bus. This can be
  worked around by using a DMA controller to transfer all data, along with
  an intelligent scheme to perform only writes across the bus. If you're
  careful, reads are never needed.
  
  So, in my system, copy_(to|from)_user() is completely wrong.
  There is no userspace, only a physical system.
 
 Can guests do DMA to random host memory? Or is there some kind of IOMMU
 and DMA API involved? If the later, then note that you'll still need
 some kind of driver for your device. The question we need to ask
 ourselves then is whether this driver can reuse bits from vhost.
 

Mostly. All of my systems are 32 bit (both x86 and ppc). From the view
of the ppc (and DMAEngine), I can view the first 1GB of host memory.

This limited view is due to address space limitations on the ppc. The
view of PCI memory must live somewhere in the ppc address space, along
with the ppc's SDRAM, flash, and other peripherals. Since this is a
32bit processor, I only have 4GB of address space to work with.

The PCI address space could be up to 4GB in size. If I tried to allow
the ppc boards to view all 4GB of PCI address space, then they would
have no address space left for their onboard SDRAM, etc.

Hopefully that makes sense.

I use dma_set_mask(dev, DMA_BIT_MASK(30) on the host system to ensure
that when dma_map_sg() is called, it returns addresses that can be
accessed directly by the device.

The DMAEngine can access any local (ppc) memory without any restriction.

I have used the Linux DMAEngine API (include/linux/dmaengine.h) to
handle all data transfer across the PCI bus. The Intel I/OAT (and many
others) use the same API.

  In fact, because normal x86 computers
  do not have DMA controllers, the host system doesn't actually handle any
  data transfer!
 
 Is it true that PPC has to initiate all DMA then? How do you
 manage not to do DMA reads then?
 

Yes, the ppc initiates all DMA. It handles all data transfer (both reads
and writes) across the PCI bus, for speed reasons. A CPU cannot create
burst transactions on the PCI bus. This is the reason that most (all?)
network cards (as a familiar example) use DMA to transfer packet
contents into RAM.

Sorry if I made a confusing statement (no reads are necessary)
earlier. What I meant to say was: If you are very careful, it is not
necessary for the CPU to do any reads over the PCI bus to maintain
state. Writes are the only necessary CPU-initiated transaction.

I implemented this in my virtio-over-PCI patch, copying as much as
possible from the virtio vring structure. The descriptors in the rings
are only changed by one side of the connection, therefore they can be
cached as they are written (via the CPU) across the PCI bus, with the
knowledge that both sides will have a consistent view.

I'm sorry, this is hard to explain via email. It is much easier in a
room with a whiteboard. :)

  I used virtio-net in both the guest and host systems in my example
  virtio-over-PCI patch, and succeeded in getting them to communicate.
  However, the lack of any setup interface means that the devices must be
  hardcoded into both drivers, when the decision could be up to userspace.
  I think this is a problem that vbus could solve.
 
 What you describe (passing setup from host to guest) seems like
 a feature that guest devices need to support. It seems unlikely that
 vbus, being a transport layer, can address this.
 

I think I explained this poorly as well.

Virtio needs two things to function:
1) a set of descriptor rings (1 or more)
2) a way to kick each ring.

With the amount of space available in the ppc's PCI BAR's (which point
at a small chunk of SDRAM), I could potentially make ~6 virtqueues + 6
kick interrupts available.

Right now, my virtio-over-PCI driver hardcoded the first and second
virtqueues to be for virtio-net only, and nothing else.

What if the user wanted 2

Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Ira W. Snyder
On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote:
 On 08/18/2009 11:59 PM, Ira W. Snyder wrote:
 On a non shared-memory system (where the guest's RAM is not just a chunk
 of userspace RAM in the host system), virtio's management model seems to
 fall apart. Feature negotiation doesn't work as one would expect.


 In your case, virtio-net on the main board accesses PCI config space  
 registers to perform the feature negotiation; software on your PCI cards  
 needs to trap these config space accesses and respond to them according  
 to virtio ABI.


Is this real PCI (physical hardware) or fake PCI (software PCI
emulation) that you are describing?

The host (x86, PCI master) must use real PCI to actually configure the
boards, enable bus mastering, etc. Just like any other PCI device, such
as a network card.

On the guests (ppc, PCI agents) I cannot add/change PCI functions (the
last .[0-9] in the PCI address) nor can I change PCI BAR's once the
board has started. I'm pretty sure that would violate the PCI spec,
since the PCI master would need to re-scan the bus, and re-assign
addresses, which is a task for the BIOS.

 (There's no real guest on your setup, right?  just a kernel running on  
 and x86 system and other kernels running on the PCI cards?)


Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's
(PCI agents) also run Linux (booted via U-Boot). They are independent
Linux systems, with a physical PCI interconnect.

The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's
PCI stack does bad things as a PCI agent. It always assumes it is a PCI
master.

It is possible for me to enable CONFIG_PCI=y on the ppc's by removing
the PCI bus from their list of devices provided by OpenFirmware. They
can not access PCI via normal methods. PCI drivers cannot work on the
ppc's, because Linux assumes it is a PCI master.

To the best of my knowledge, I cannot trap configuration space accesses
on the PCI agents. I haven't needed that for anything I've done thus
far.

 This does appear to be solved by vbus, though I haven't written a
 vbus-over-PCI implementation, so I cannot be completely sure.


 Even if virtio-pci doesn't work out for some reason (though it should),  
 you can write your own virtio transport and implement its config space  
 however you like.


This is what I did with virtio-over-PCI. The way virtio-net negotiates
features makes this work non-intuitively.

 I'm not at all clear on how to get feature negotiation to work on a
 system like mine. From my study of lguest and kvm (see below) it looks
 like userspace will need to be involved, via a miscdevice.


 I don't see why.  Is the kernel on the PCI cards in full control of all  
 accesses?


I'm not sure what you mean by this. Could you be more specific? This is
a normal, unmodified vanilla Linux kernel running on the PCI agents.

 Ok. I thought I should at least express my concerns while we're
 discussing this, rather than being too late after finding the time to
 study the driver.

 Off the top of my head, I would think that transporting userspace
 addresses in the ring (for copy_(to|from)_user()) vs. physical addresses
 (for DMAEngine) might be a problem. Pinning userspace pages into memory
 for DMA is a bit of a pain, though it is possible.


 Oh, the ring doesn't transport userspace addresses.  It transports guest  
 addresses, and it's up to vhost to do something with them.

 Currently vhost supports two translation modes:

 1. virtio address == host virtual address (using copy_to_user)
 2. virtio address == offsetted host virtual address (using copy_to_user)

 The latter mode is used for kvm guests (with multiple offsets, skipping  
 some details).

 I think you need to add a third mode, virtio address == host physical  
 address (using dma engine).  Once you do that, and wire up the  
 signalling, things should work.


Ok.

In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote
an algorithm to pair the tx/rx queues together. Since virtio-net
pre-fills its rx queues with buffers, I was able to use the DMA engine
to copy from the tx queue into the pre-allocated memory in the rx queue.

I have an intuitive idea about how I think vhost-net works in this case.

 There is also the problem of different endianness between host and guest
 in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h)
 defines fields in host byte order. Which totally breaks if the guest has
 a different endianness. This is a virtio-net problem though, and is not
 transport specific.


 Yeah.  You'll need to add byteswaps.


I wonder if Rusty would accept a new feature:
VIRTIO_F_NET_LITTLE_ENDIAN, which would allow the virtio-net driver to
use LE for all of it's multi-byte fields.

I don't think the transport should have to care about the endianness.

 I've browsed over both the kvm and lguest code, and it looks like they
 each re-invent a mechanism for transporting interrupts between

Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects

2009-08-18 Thread Ira W. Snyder
On Wed, Aug 19, 2009 at 01:06:45AM +0300, Avi Kivity wrote:
 On 08/19/2009 12:26 AM, Avi Kivity wrote:

 Off the top of my head, I would think that transporting userspace
 addresses in the ring (for copy_(to|from)_user()) vs. physical addresses
 (for DMAEngine) might be a problem. Pinning userspace pages into memory
 for DMA is a bit of a pain, though it is possible.


 Oh, the ring doesn't transport userspace addresses.  It transports  
 guest addresses, and it's up to vhost to do something with them.

 Currently vhost supports two translation modes:

 1. virtio address == host virtual address (using copy_to_user)
 2. virtio address == offsetted host virtual address (using copy_to_user)

 The latter mode is used for kvm guests (with multiple offsets,  
 skipping some details).

 I think you need to add a third mode, virtio address == host physical  
 address (using dma engine).  Once you do that, and wire up the  
 signalling, things should work.


 You don't need in fact a third mode.  You can mmap the x86 address space  
 into your ppc userspace and use the second mode.  All you need then is  
 the dma engine glue and byte swapping.


Hmm, I'll have to think about that.

The ppc is a 32-bit processor, so it has 4GB of address space for
everything, including PCI, SDRAM, flash memory, and all other
peripherals.

This is exactly like 32bit x86, where you cannot have a PCI card that
exposes a 4GB PCI BAR. The system would have no address space left for
its own SDRAM.

On my x86 computers, I only have 1GB of physical RAM, and so the ppc's
have plenty of room in their address spaces to map the entire x86 RAM
into their own address space. That is exactly what I do now. Accesses to
ppc physical address 0x8000 magically hit x86 physical address
0x0.

Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-12 Thread Ira W. Snyder
On Wed, Aug 12, 2009 at 07:03:22PM +0200, Arnd Bergmann wrote:
 On Monday 10 August 2009, Michael S. Tsirkin wrote:
 
  +struct workqueue_struct *vhost_workqueue;
 
 [nitpicking] This could be static. 
 
  +/* The virtqueue structure describes a queue attached to a device. */
  +struct vhost_virtqueue {
  +   struct vhost_dev *dev;
  +
  +   /* The actual ring of buffers. */
  +   struct mutex mutex;
  +   unsigned int num;
  +   struct vring_desc __user *desc;
  +   struct vring_avail __user *avail;
  +   struct vring_used __user *used;
  +   struct file *kick;
  +   struct file *call;
  +   struct file *error;
  +   struct eventfd_ctx *call_ctx;
  +   struct eventfd_ctx *error_ctx;
  +
  +   struct vhost_poll poll;
  +
  +   /* The routine to call when the Guest pings us, or timeout. */
  +   work_func_t handle_kick;
  +
  +   /* Last available index we saw. */
  +   u16 last_avail_idx;
  +
  +   /* Last index we used. */
  +   u16 last_used_idx;
  +
  +   /* Outstanding buffers */
  +   unsigned int inflight;
  +
  +   /* Is this blocked? */
  +   bool blocked;
  +
  +   struct iovec iov[VHOST_NET_MAX_SG];
  +
  +} cacheline_aligned;
 
 We discussed this before, and I still think this could be directly derived
 from struct virtqueue, in the same way that vring_virtqueue is derived from
 struct virtqueue. That would make it possible for simple device drivers
 to use the same driver in both host and guest, similar to how Ira Snyder
 used virtqueues to make virtio_net run between two hosts running the
 same code [1].
 
 Ideally, I guess you should be able to even make virtio_net work in the
 host if you do that, but that could bring other complexities.

I have no comments about the vhost code itself, I haven't reviewed it.

It might be interesting to try using a virtio-net in the host kernel to
communicate with the virtio-net running in the guest kernel. The lack of
a management interface is the biggest problem you will face (setting MAC
addresses, negotiating features, etc. doesn't work intuitively). Getting
the network interfaces talking is relatively easy.

Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-12 Thread Ira W. Snyder
On Wed, Aug 12, 2009 at 08:31:04PM +0300, Michael S. Tsirkin wrote:
 On Wed, Aug 12, 2009 at 10:19:22AM -0700, Ira W. Snyder wrote:

[ snip out code ]

   
   We discussed this before, and I still think this could be directly derived
   from struct virtqueue, in the same way that vring_virtqueue is derived 
   from
   struct virtqueue. That would make it possible for simple device drivers
   to use the same driver in both host and guest, similar to how Ira Snyder
   used virtqueues to make virtio_net run between two hosts running the
   same code [1].
   
   Ideally, I guess you should be able to even make virtio_net work in the
   host if you do that, but that could bring other complexities.
  
  I have no comments about the vhost code itself, I haven't reviewed it.
  
  It might be interesting to try using a virtio-net in the host kernel to
  communicate with the virtio-net running in the guest kernel. The lack of
  a management interface is the biggest problem you will face (setting MAC
  addresses, negotiating features, etc. doesn't work intuitively).
 
 That was one of the reasons I decided to move most of code out to
 userspace. My kernel driver only handles datapath,
 it's much smaller than virtio net.
 
  Getting
  the network interfaces talking is relatively easy.
  
  Ira
 
 Tried this, but
 - guest memory isn't pinned, so copy_to_user
   to access it, errors need to be handled in a sane way
 - used/available roles are reversed
 - kick/interrupt roles are reversed
 
 So most of the code then looks like
 
   if (host) {
   } else {
   }
   return
 
 
 The only common part is walking the descriptor list,
 but that's like 10 lines of code.
 
 At which point it's better to keep host/guest code separate, IMO.
 

Ok, that makes sense. Let me see if I understand the concept of the
driver. Here's a picture of what makes sense to me:

guest system
-
| userspace applications|
-
| kernel network stack  |
-
| virtio-net|
-
| transport (virtio-ring, etc.) |
-
   |
   |
-
| transport (virtio-ring, etc.) |
-
| some driver (maybe vhost?)| -- [1]
-
| kernel network stack  |
-
host system

From the host's network stack, packets can be forwarded out to the
physical network, or be consumed by a normal userspace application on
the host. Just as if this were any other network interface.

In my patch, [1] was the virtio-net driver, completely unmodified.

So, does this patch accomplish the above diagram? If so, why the
copy_to_user(), etc? Maybe I'm confusing this with my system, where the
guest is another physical system, separated by the PCI bus.

Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:

2009-08-06 Thread Ira W. Snyder
On Thu, Aug 06, 2009 at 10:29:08AM -0600, Gregory Haskins wrote:
  On 8/6/2009 at 11:40 AM, in message 200908061740.04276.a...@arndb.de, 
  Arnd
 Bergmann a...@arndb.de wrote: 
  On Thursday 06 August 2009, Gregory Haskins wrote:

[ big snip ]

  
  3. The ioq method seems to be the real core of your work that makes
  venet perform better than virtio-net with its virtqueues. I don't see
  any reason to doubt that your claim is correct. My conclusion from
  this would be to add support for ioq to virtio devices, alongside
  virtqueues, but to leave out the extra bus_type and probing method.
 
 While I appreciate the sentiment, I doubt that is actually whats helping here.
 
 There are a variety of factors that I poured into venet/vbus that I think 
 contribute to its superior performance.  However, the difference in the ring 
 design I do not think is one if them.  In fact, in many ways I think Rusty's 
 design might turn out to be faster if put side by side because he was much 
 more careful with cacheline alignment than I was.  Also note that I was 
 careful to not pick one ring vs the other ;)  They both should work.

IMO, the virtio vring design is very well thought out. I found it
relatively easy to port to a host+blade setup, and run virtio-net over a
physical PCI bus, connecting two physical CPUs.

 
 IMO, we are only looking at the tip of the iceberg when looking at this 
 purely as the difference between virtio-pci vs virtio-vbus, or venet vs 
 virtio-net.
 
 Really, the big thing I am working on here is the host side device-model.  
 The idea here was to design a bus model that was conducive to high 
 performance, software to software IO that would work in a variety of 
 environments (that may or may not have PCI).  KVM is one such environment, 
 but I also have people looking at building other types of containers, and 
 even physical systems (host+blade kind of setups).
 
 The idea is that the connector is modular, and then something like 
 virtio-net or venet just work: in kvm, in the userspace container, on the 
 blade system. 
 
 It provides a management infrastructure that (hopefully) makes sense for 
 these different types of containers, regardless of whether they have PCI, 
 QEMU, etc (e.g. things that are inherent to KVM, but not others).
 
 I hope this helps to clarify the project :)
 

I think this is the major benefit of vbus. I've only started studying
the vbus code, so I don't have lots to say yet. The overview of the
management interface makes it look pretty good.

Getting two virtio-net drivers hooked together in my virtio-over-PCI
patches was nasty. If you read the thread that followed, you'll see
the lack of a management interface as a concern of mine. It was
basically decided that it could come later. The configfs interface
vbus provides is pretty nice, IMO.

Just my two cents,
Ira
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html