Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Amit Shah
On (Fri) Aug 07 2009 [09:14:43], Anthony Liguori wrote:
 Amit Shah wrote:
 On (Thu) Aug 06 2009 [18:37:40], Jamie Lokier wrote:
   
 Apart from dbus, hard-coded meanings of small N in /dev/vmchN are
 asking for trouble.  It is bound to break when widely deployed and
 

 It's no different from the way major-minor numbering works on the Linux
 kernel: they uniquely identify a device.

 Bad example.  Quite a lot of modern devices drivers are using dynamic  
 major/minor numbers because they have proven to be such a pain in the  
 butt.  That's why we have more sophisticated mechanisms like udev for  
 userspace to make use of.

Let me explain how we came to this numbering: we first had support for
'naming' ports and the names were obtained by userspace programs by an
ioctl. Rusty suggested to use some numbering scheme where some ports
could exist at predefined locations so that we wouldn't need the naming
and the ioctls around it.

 We'll definitely need some way to support dynamic vmchannels.  Static  
 allocation of ports is just not going to work.  If we did a userspace  
 daemon, I'd suggest using some sort of universal identifier that's easy  
 to manage in a distributed fashion.  Like a reverse fqdn.

 So for instance, I could have an com.ibm.my-awesome-channel and never  
 have to worry about conflicts.

Hm, we could have something like a class of ports instead of the
current minor-number scheme: each port exposes what class it belongs to,
like

-virtioserial unix:/tmp/foo,class=stream -virtioserial
unix:/tmp/bar,class=console

 guest/host configs don't match.  It also doesn't fit comfortably when
 you have, say, bob and alice both logged in with desktops on separate
 VTs.  Clashes are inevitable, as third-party apps pick N values for
 themselves then get distributed - unless N values can be large
 (/dev/vmch44324 == kernelstats...).
 

 Hm, so there can be one daemon on the guest handling all clipboard
 events. There's some work done already by the fast-user-switch support
 and that can be extended to daemons that talk over virtio-serial.
   

 You could have one daemon that manages all vmchannel sessions.  It can  
 then expose channels to apps via whatever mechanism is best.  It could  
 use unix domain sockets, sys v ipc, whatever floats your boat.

 And, you can build this daemon today using the existing vmchannel over  
 TCP/IP.  You could also make it support serial devices.  We could also  
 introduce a custom usb device and use libusb.  libusb is portable to  
 Windows and Linux.

There are some other problems with usb too: It's not transparent to
users. Any hotplug event could alert users and that's not desired. It's
a system-only thing and should also remain that way.

Amit
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Anthony Liguori
Gerd Hoffmann wrote:

 There are some other problems with usb too: It's not transparent to
 users. Any hotplug event could alert users and that's not desired. It's
 a system-only thing and should also remain that way.

 I think virtio-serial is the better way to handle vmchannel.  Unlike 
 usb virtio is designed to work nicely in a virtual environment.

 But vmchannel-over-usbserial should be easy too though in case some 
 guests lacks virtio backports or something.

I think you're missing my fundamental point.  Don't use the kernel as 
the guest interface.

Introduce a userspace daemon that exposes a domain socket.  Then we can 
have a proper protocol that uses reverse fqdns for identification.

We can do the backend over TCP/IP, usb, standard serial, etc.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Fischer, Anna
 Subject: Re: [PATCH][RFC] net/bridge: add basic VEPA support
 
 On Friday 07 August 2009, Paul Congdon (UC Davis) wrote:
  As I understand the macvlan code, it currently doesn't allow two VMs
 on the
  same machine to communicate with one another.
 
 There are patches to do that. I think if we add that, there should be
 a way to choose the behavior between either bridging between the
 guests or VEPA.

If you implement this direct bridging capability between local VMs for 
macvlan, then would this not break existing applications that currently
use it? It would be quite a significant change to how macvlan works 
today. I guess, ideally, you would want to have macvlan work in 
separate modes, e.g. traditional macvlan, bridging, and VEPA.

 
  I could imagine a hairpin mode on the adjacent bridge making this
  possible, but the macvlan code would need to be updated to filter
  reflected frames so a source did not receive his own packet.
 
 Right, I missed this point so far. I'll follow up with a patch
 to do that.

Can you maybe point me to the missing patches for macvlan that you
have mentioned in other emails, and the one you mention above? 
E.g. enabling multicast distribution and allowing local bridging etc.
I could not find any of those in the archives. Thanks.

 
  I could imagine this being done as well, but to also
  support selective multicast usage, something similar to the bridge
  forwarding table would be needed.  I think putting VEPA into a new
 driver
  would cause you to implement many things the bridge code already
 supports.
  Given that we expect the bridge standard to ultimately include VEPA,
 and the
  new functions are basic forwarding operations, it seems to make most
 sense
  to keep this consistent with the bridge module.
 
 This is the interesting part of the discussion. The bridge and macvlan
 drivers certainly have an overlap in functionality and you can argue
 that you only need one. Then again, the bridge code is a little crufty
 and we might not want to add much more to it for functionality that can
 be implemented in a much simpler way elsewhere. My preferred way would
 be to use bridge when you really need 802.1d MAC learning, netfilter-
 bridge
 and STP, while we put the optimizations for stuff like VMDq, zero-copy
 and multiqueue guest adapters only into the macvlan code.

I can see this being a possible solution. 

My concern with putting VEPA into macvlan instead of the bridging code
is that there will be more work required to make it usable for other
virtualization solution as macvtap will only work for KVM type setups. 
Basically, VEPA capabilities would rely on someone developing further
drivers to connect macvlan to different backend interfaces, e.g. one for
KVM (macvtap), one for Xen PV drivers, one for virtio, and whatever else
is out there, or will be there in the future. The bridging code is
already very generic in that respect, and all virtualization layers
can deal with connecting interfaces to a bridge. 

Our extensions to the bridging code to enable VEPA for the Linux kernel
are only very minimal code changes and would allow to make VEPA available
to most virtualization solutions today.

Anna
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Gerd Hoffmann
On 08/10/09 15:02, Anthony Liguori wrote:

 I think you're missing my fundamental point. Don't use the kernel as the
 guest interface.

 Introduce a userspace daemon that exposes a domain socket. Then we can
 have a proper protocol that uses reverse fqdns for identification.

We need nothing but (a) bidirectional byte streams and (b) name tags for 
them.

Do we really want design a daemon and a protocol for such a simple 
thing?  Especially as requiring a daemon for that adds a few problems 
you don't have without them.  Access control for example:  For device 
nodes you can just use standard unix permissions and acls.  You can 
easily do stuff like adding the logged in desktop user to the 
/dev/vmchannel/org/qemu/clipboard acl using existing solutions.  With a 
daemon you have to hop through a number of loops to archive the same.

Can't we simply have guest apps open /dev/vmchannel/$protocol ?

cheers,
   Gerd

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Anthony Liguori
Amit Shah wrote:
 Let me explain how we came to this numbering: we first had support for
 'naming' ports and the names were obtained by userspace programs by an
 ioctl. Rusty suggested to use some numbering scheme where some ports
 could exist at predefined locations so that we wouldn't need the naming
 and the ioctls around it.
   

Fortunately, if you implement the naming scheme in userspace you get the 
best of both worlds ;-)

 Hm, so there can be one daemon on the guest handling all clipboard
 events. There's some work done already by the fast-user-switch support
 and that can be extended to daemons that talk over virtio-serial.
   
   
 You could have one daemon that manages all vmchannel sessions.  It can  
 then expose channels to apps via whatever mechanism is best.  It could  
 use unix domain sockets, sys v ipc, whatever floats your boat.

 And, you can build this daemon today using the existing vmchannel over  
 TCP/IP.  You could also make it support serial devices.  We could also  
 introduce a custom usb device and use libusb.  libusb is portable to  
 Windows and Linux.
 

 There are some other problems with usb too: It's not transparent to
 users. Any hotplug event could alert users and that's not desired.

I don't think this is true in practice.  Our goal is not to circumvent 
an OS's policy decisions either.

  It's
 a system-only thing and should also remain that way.
   

I don't buy this argument at all.  If you exposed a new usb device that 
no OS had a kernel driver, and you had a daemon running that watched for 
insertions of that device, what OS would that not work transparently on?

I think my fundamental argument boils down to two points.  1) we should 
not require new guest drivers unless we absolutely have to 2) we should 
always do things in userspace unless we absolutely have to do things in 
the kernel.

Adding new kernel drivers breaks support for enterprise Linux distros.  
Adding a userspace daemon does not.  Windows device drivers require 
signing which is very difficult to do.  There's a huge practical 
advantage in not requiring guest drivers.

Regards,

Anthony Liguori

   Amit
   

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Arnd Bergmann
On Monday 10 August 2009, Fischer, Anna wrote:
  Subject: Re: [PATCH][RFC] net/bridge: add basic VEPA support
  
  On Friday 07 August 2009, Paul Congdon (UC Davis) wrote:
   As I understand the macvlan code, it currently doesn't allow two VMs
  on the
   same machine to communicate with one another.
  
  There are patches to do that. I think if we add that, there should be
  a way to choose the behavior between either bridging between the
  guests or VEPA.
 
 If you implement this direct bridging capability between local VMs for 
 macvlan, then would this not break existing applications that currently
 use it? It would be quite a significant change to how macvlan works 
 today. I guess, ideally, you would want to have macvlan work in 
 separate modes, e.g. traditional macvlan, bridging, and VEPA.

Right, that's what I meant with my sentence above. I'm not sure
if we need to differentiate traditional macvlan and VEPA though.
AFAICT, the only difference should be the handling of broadcast
and multicast frames returning from the hairpin turn. Since this
does not happen with a traditional macvlan, we can always send them
to all macvlan ports except the source port.

   I could imagine a hairpin mode on the adjacent bridge making this
   possible, but the macvlan code would need to be updated to filter
   reflected frames so a source did not receive his own packet.
  
  Right, I missed this point so far. I'll follow up with a patch
  to do that.
 
 Can you maybe point me to the missing patches for macvlan that you
 have mentioned in other emails, and the one you mention above? 
 E.g. enabling multicast distribution and allowing local bridging etc.
 I could not find any of those in the archives. Thanks.

The patch from Eric Biederman to allow macvlan to bridge between
its slave ports is at

http://kerneltrap.org/mailarchive/linux-netdev/2009/3/9/5125774

I could not find any patches for the other features (or bugs).

  This is the interesting part of the discussion. The bridge and macvlan
  drivers certainly have an overlap in functionality and you can argue
  that you only need one. Then again, the bridge code is a little crufty
  and we might not want to add much more to it for functionality that can
  be implemented in a much simpler way elsewhere. My preferred way would
  be to use bridge when you really need 802.1d MAC learning, netfilter-
  bridge
  and STP, while we put the optimizations for stuff like VMDq, zero-copy
  and multiqueue guest adapters only into the macvlan code.
 
 I can see this being a possible solution. 
 
 My concern with putting VEPA into macvlan instead of the bridging code
 is that there will be more work required to make it usable for other
 virtualization solution as macvtap will only work for KVM type setups.

Right, I understand.

 Basically, VEPA capabilities would rely on someone developing further
 drivers to connect macvlan to different backend interfaces, e.g. one for
 KVM (macvtap), one for Xen PV drivers, one for virtio, and whatever else
 is out there, or will be there in the future. The bridging code is
 already very generic in that respect, and all virtualization layers
 can deal with connecting interfaces to a bridge. 
 
 Our extensions to the bridging code to enable VEPA for the Linux kernel
 are only very minimal code changes and would allow to make VEPA available
 to most virtualization solutions today.

I don't object to having VEPA supported in the bridge code at all.
I think your patch is simple enough so it won't hurt in the bridge
code. If Stephen prefers to do VEPA only in one component, we should
probably make it possible for that component to act as a bridge between
1+n existing interfaces as well. You can almost do that with the regular
macvlan and the bridge driver, like

  / macvlan0 - br0 - tap0
eth0 -- macvlan1 - br1 - tap1
  \ macvlan2 - br2 - tap2

Here, you can have two guests attached to tap devices (or xen net ...)
and the macvlan driver doing the VEPA. Of course this is not how bridge
works -- you would have the same mac addresses on two sides of
the bridge.

So we could have another macvlan backend (let's call it macvbridge)
so you can do this:

  / macvlan0 - 'qemu -net raw'
eth0 -- macvtap0 - 'qemu -net tap,fd=3 3/dev/net/macvtap0'
  \ macvbr0 -- tap0 - 'qemu -net tap'

The macvbr driver could this way be used to associate an existing
network device to a slave of a macvlan port. Not sure if this
has any significant advantage over your bridge patches, it does
have the obvious disadvantage that someone needs to implement
it first, while your patch is there ;-)

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Stephen Hemminger
On Sun, 09 Aug 2009 14:19:08 +0300
Or Gerlitz ogerl...@voltaire.com wrote:

 Stephen Hemminger wrote:
  I have a patch that forwards all multicast packets, and another that does 
  proper forwarding. It should have worked that way in original macvlan, the 
  current behavior is really a bug.

 Looking in macvlan_set_multicast_list() it acts in a similar manner to
 macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I
 assume what's left is to add macvlan_hash_xxx multicast logic to
 map/unmap multicast groups to what macvlan devices want to receive them
 and this way the flooding can be removed, correct?

The device can just flood all multicast packets, since the filtering
is done on the receive path anyway.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] Re: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Arnd Bergmann
On Friday 07 August 2009, Paul Congdon (UC Davis) wrote:
 
 I don't think your scheme works too well because broadcast packet coming
 from other interfaces on br0 would get replicated and sent across the wire
 to ethB multiple times.

Right, that won't work. So the bridge patch for the hairpin turn
is still the best solution. Btw, how will that interact with
the bride-netfilter (ebtables) setup? Can you apply any filters
that work on current bridges also between two VEPA ports while
doing the hairpin turn?

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Arnd Bergmann
On Monday 10 August 2009, Stephen Hemminger wrote:
 On Sun, 09 Aug 2009 14:19:08 +0300, Or Gerlitz ogerl...@voltaire.com wrote:
  Looking in macvlan_set_multicast_list() it acts in a similar manner to
  macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I
  assume what's left is to add macvlan_hash_xxx multicast logic to
  map/unmap multicast groups to what macvlan devices want to receive them
  and this way the flooding can be removed, correct?
 
 The device can just flood all multicast packets, since the filtering
 is done on the receive path anyway.

But we'd still have to copy the frames to user space (for both
macvtap and raw packet sockets) and exit from the guest to inject
it into its stack, right?

I guess for multicast heavy workloads, we could save a lot of cycles
by throwing the frames away as early as possible. How common are those
setups in virtual servers though?

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Gerd Hoffmann
On 08/10/09 16:20, Anthony Liguori wrote:
 Gerd Hoffmann wrote:
 Do we really want design a daemon and a protocol for such a simple thing?

 Yes, because we also need (c) the ability to write cross platform
 software that targets vmchannel.

 So having a library interface is going to be extremely desirable.

You don't need a daemon for that though.

 Also, see the previous discussion about security. How do you sanely
 delegate /dev/vmchannel/org/qemu/clipboard to the current Xorg user?

pam_console (I think that is the name of the beast).
Or is it handled by hal these days?

The piece of software which does the very same thing already for sound 
and other devices.

 Especially as requiring a daemon for that adds a few problems you
 don't have without them. Access control for example: For device nodes
 you can just use standard unix permissions and acls.

 But how do you set those permissions in the first place?

See above.  There are other devices which need that too.  There are 
existing solutions for this problem.

 You can easily do stuff like adding the logged in desktop user to the
 /dev/vmchannel/org/qemu/clipboard acl using existing solutions. With a
 daemon you have to hop through a number of loops to archive the same.

 Can't we simply have guest apps open /dev/vmchannel/$protocol ?

 /dev interfaces are only simple to kernel developers :-) Besides, why do
 something that can be clearly done in userspace within the kernel?

Ok, lets rip out the in-kernel ioapic code then.  It can (and has been) 
done in userspace.

 It
 just increases the possibility of kernel bugs.

The daemon increases the possibility of userspace bugs.

Seriously:  Attaching a name tag to virtio-serial devices and have them 
exposed via sysfs is probably *much* less code than a vmchannel daemon.

Also multiplexing over one device introduces a number of problems you 
have to take care of on both sides (qemu+daemon) of the connection.  For 
example:  When the communication stalls in one protocol the others 
should keep on going of course.  With one device per protocol and thus 
one virtqueue per protocol the problem doesn't exist in the first place.

 You can have a /var/run/vmchannel/$protocol.sock unix domain socket and
 it has all the same properties that you describe. It also Just Works
 with standard tools like socat.

bash: socat: command not found

 If we really want
 vmchannel to be used by application developers, then we really need a
 libvmchannel.

We need a sane solution developed and merged and not a new idea each week.

cheers,
   Gerd

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Gerd Hoffmann
On 08/10/09 16:27, Anthony Liguori wrote:

 I think my fundamental argument boils down to two points. 1) we should
 not require new guest drivers unless we absolutely have to

Allow guest drivers is fine though I guess?

 2) we should
 always do things in userspace unless we absolutely have to do things in
 the kernel.

Wrong.  There are often good reasons to do stuff in kernel, even if you 
can do it in userspace too.

 Adding new kernel drivers breaks support for enterprise Linux distros.
 Adding a userspace daemon does not. Windows device drivers require
 signing which is very difficult to do. There's a huge practical
 advantage in not requiring guest drivers.

Ok, so the virtio-serial + usbserial combo should work well then I think.

If you have guest drivers you'll go the virtio-serial route.
If you don't have guest drivers you can go the usbserial route, either 
via /dev/ttyUSB or via libusb.

We can also have a libvmchannel as abstraction layer on top of this.

cheers,
   Gerd

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Yaron Haviv
Paul,

I also think that bridge may not be the right place for VEPA, but rather a 
simpler sw/hw mux 
Although the VEPA support may reside in multiple places (I.e. also in the 
bridge)

As Arnd pointed out Or already added an extension to qemu that allow direct 
guest virtual NIC mapping to an interface device (vs using tap), this was done 
specifically to address VEPA, and result in much faster performance and lower 
cpu overhead (Or and some others are planning additional meaningful performance 
optimizations) 

The interface multiplexing can be achieved using macvlan driver or using an 
SR-IOV capable NIC (the preferred option), macvlan may need to be extended to 
support VEPA multicast handling, this looks like a rather simple task 

It may be counter intuitive for some, but we expect the (completed) qemu VEPA 
mode + SR-IOV + certain switches with hairpin (vepa) mode to perform faster 
than using bridge+tap even for connecting 2 VMs on the same host


Yaron 

Sent from BlackBerry



From: e...@yahoogroups.com 
To: 'Stephen Hemminger' ; 'Fischer, Anna' 
Cc: bri...@lists.linux-foundation.org ; linux-ker...@vger.kernel.org ; 
net...@vger.kernel.org ; virtualization@lists.linux-foundation.org ; 
e...@yahoogroups.com ; da...@davemloft.net ; ka...@trash.net ; 
adobri...@gmail.com ; 'Arnd Bergmann' 
Sent: Fri Aug 07 21:58:00 2009
Subject: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support 


  

 
 After reading more about this, I am not convinced this should be part 
 of the bridge code. The bridge code really consists of two parts:
 forwarding table and optional spanning tree. Well the VEPA code short 
 circuits both of these; it can't imagine it working with STP turned 
 on. The only part of bridge code that really gets used by this are the 
 receive packet hooks and the crufty old API.
 
 So instead of adding more stuff to existing bridge code, why not have 
 a new driver for just VEPA. You could do it with a simple version of 
 macvlan type driver.

Stephen,

Thanks for your comments and questions. We do believe the bridge code is
the right place for this, so I'd like to embellish on that a bit more to
help persuade you. Sorry for the long winded response, but here are some
thoughts:

- First and foremost, VEPA is going to be a standard addition to the IEEE
802.1Q specification. The working group agreed at the last meeting to
pursue a project to augment the bridge standard with hairpin mode (aka
reflective relay) and a remote filtering service (VEPA). See for details:
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-evbPar5C-0709 
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-evbPar5C-0709 
-v01.pdf

- The VEPA functionality was really a pretty small change to the code with
low risk and wouldn't seem to warrant an entire new driver or module.

- There are good use cases where VMs will want to have some of their
interfaces attached to bridges and others to bridges operating in VEPA mode.
In other words, we see simultaneous operation of the bridge code and VEPA
occurring, so having as much of the underlying code as common as possible
would seem to be beneficial. 

- By augmenting the bridge code with VEPA there is a great amount of re-use
achieved. It works wherever the bridge code works and doesn't need anything
special to support KVM, XEN, and all the hooks, etc...

- The hardware vendors building SR-IOV NICs with embedded switches will be
adding VEPA mode, so by keeping the bridge module in sync would be
consistent with this trend and direction. It will be possible to extend the
hardware implementations by cascading a software bridge and/or VEPA, so
being in sync with the architecture would make this more consistent.

- The forwarding table is still needed and used on inbound traffic to
deliver frames to the correct virtual interfaces and to filter any reflected
frames. A new driver would have to basically implement an equivalent
forwarding table anyway. As I understand the current macvlan type driver,
it wouldn't filter multicast frames properly without such a table.

- It seems the hairpin mode would be needed in the bridge module whether
VEPA was added to the bridge module or a new driver. Having the associated
changes together in the same code could aid in understanding and deployment.

As I understand the macvlan code, it currently doesn't allow two VMs on the
same machine to communicate with one another. I could imagine a hairpin
mode on the adjacent bridge making this possible, but the macvlan code would
need to be updated to filter reflected frames so a source did not receive
his own packet. I could imagine this being done as well, but to also
support selective multicast usage, something similar to the bridge
forwarding table would be needed. I think putting VEPA into a new driver
would cause you to implement many things the bridge code already supports.
Given that we expect the bridge standard to ultimately include VEPA, 

RE: [evb] Re: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Fischer, Anna
 Subject: Re: [PATCH][RFC] net/bridge: add basic VEPA support
 
 On Friday 07 August 2009, Paul Congdon (UC Davis) wrote:
 
  I don't think your scheme works too well because broadcast packet
 coming
  from other interfaces on br0 would get replicated and sent across the
 wire
  to ethB multiple times.
 
 Right, that won't work. So the bridge patch for the hairpin turn
 is still the best solution. 

Yes, I think that we should separate the discussions between hairpin 
mode on the adjacent bridge and the VEPA filtering service residing
within the end-station. The hairpin feature really has to be
implemented in the bridging code.


 Btw, how will that interact with
 the bride-netfilter (ebtables) setup? Can you apply any filters
 that work on current bridges also between two VEPA ports while
 doing the hairpin turn?

The hairpin mode is implemented on the adjacent bridge. The only 
difference for a hairpin mode port vs. a normal bridge port is that
it can pass frames back out to the same port it came from. All the
netfilter hooks are still in place.

On the VEPA filtering service side, the only change we have implemented
in the bridging code is that in VEPA mode all frames are passed to the
uplink on TX. However, frames are still passed through the netfilter 
hooks before they go out on the wire. On the inbound path, there are
no changes to the way frames are processed (except the filtering for
the original source port), so netfilter hooks work in the same way
as for a normal bridge.

If a frame is reflected back because of a hairpin turn, then of course
the incoming port is the VEPA uplink port and not the port that
originally sent the frame. So if you are trying to enforce some
packet filtering on that inbound path, then you would have to do that
based on MAC addresses and not on bridge ports. But I would assume that
you would enforce the filtering already before you send out the frame
to the adjacent bridge. Apart from that, if you enable your bridge to
behave in VEPA mode, then you would typically do packet filtering etc
on the adjacent bridge and not use the netfilter hook. You can still use
both though, if you like.

Anna
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Benny Amorsen
Fischer, Anna anna.fisc...@hp.com writes:

 If you do have a SRIOV NIC that supports VEPA, then I would think that
 you do not have QEMU or macvtap in the setup any more though. Simply
 because in that case the VM can directly access the VF on the physical
 device. That would be ideal.

I'm just trying to understand how this all works, so I'm probably asking
a stupid question:

Would a SRIOV NIC with VEPA support show up as multiple devices? I.e.
would I get e.g. eth0-eth7 for a NIC with support for 8 virtual
interfaces? Would they have different MAC addresses?


/Benny

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Paul Congdon (UC Davis)
 
 After reading more about this, I am not convinced this should be part 
 of the bridge code. The bridge code really consists of two parts:
 forwarding table and optional spanning tree. Well the VEPA code short 
 circuits both of these; it can't imagine it working with STP turned 
 on. The only part of bridge code that really gets used by this are the 
 receive packet hooks and the crufty old API.
 
 So instead of adding more stuff to existing bridge code, why not have 
 a new driver for just VEPA. You could do it with a simple version of 
 macvlan type driver.

Stephen,

Thanks for your comments and questions.  We do believe the bridge code is
the right place for this, so I'd like to embellish on that a bit more to
help persuade you.  Sorry for the long winded response, but here are some
thoughts:

- First and foremost, VEPA is going to be a standard addition to the IEEE
802.1Q specification.  The working group agreed at the last meeting to
pursue a project to augment the bridge standard with hairpin mode (aka
reflective relay) and a remote filtering service (VEPA).  See for details:
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-evbPar5C-0709
-v01.pdf

- The VEPA functionality was really a pretty small change to the code with
low risk and wouldn't seem to warrant an entire new driver or module.

- There are good use cases where VMs will want to have some of their
interfaces attached to bridges and others to bridges operating in VEPA mode.
In other words, we see simultaneous operation of the bridge code and VEPA
occurring, so having as much of the underlying code as common as possible
would seem to be beneficial. 

- By augmenting the bridge code with VEPA there is a great amount of re-use
achieved.  It works wherever the bridge code works and doesn't need anything
special to support KVM, XEN, and all the hooks, etc...

- The hardware vendors building SR-IOV NICs with embedded switches will be
adding VEPA mode, so by keeping the bridge module in sync would be
consistent with this trend and direction.  It will be possible to extend the
hardware implementations by cascading a software bridge and/or VEPA, so
being in sync with the architecture would make this more consistent.
 
- The forwarding table is still needed and used on inbound traffic to
deliver frames to the correct virtual interfaces and to filter any reflected
frames.  A new driver would have to basically implement an equivalent
forwarding table anyway.  As I understand the current macvlan type driver,
it wouldn't filter multicast frames properly without such a table.

- It seems the hairpin mode would be needed in the bridge module whether
VEPA was added to the bridge module or a new driver.  Having the associated
changes together in the same code could aid in understanding and deployment.

As I understand the macvlan code, it currently doesn't allow two VMs on the
same machine to communicate with one another.  I could imagine a hairpin
mode on the adjacent bridge making this possible, but the macvlan code would
need to be updated to filter reflected frames so a source did not receive
his own packet.  I could imagine this being done as well, but to also
support selective multicast usage, something similar to the bridge
forwarding table would be needed.  I think putting VEPA into a new driver
would cause you to implement many things the bridge code already supports.
Given that we expect the bridge standard to ultimately include VEPA, and the
new functions are basic forwarding operations, it seems to make most sense
to keep this consistent with the bridge module.

Paul

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [evb] Re: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Paul Congdon (UC Davis)
Arnd,

 

I don't think your scheme works too well because broadcast packet coming
from other interfaces on br0 would get replicated and sent across the wire
to ethB multiple times.

 

Paul

That way you should be able to do something
like:

Host A Host B

/- nalvcam0 -\ /- macvlan0 - 192.168.1.1
br0 -| |- ethA === ethB -|
\- nalvcam1 -/ \- macvlan1 - 192.168.1.2

Now assuming that macvlan0 and macvlan1 are in different
network namespaces or belong to different KVM guests, these
guests would be able to communicate with each other through
the bridge on host A, which can set the policy (using ebtables)
for this communication and get interface statistics on its
nalvcam interfaces. Also, instead of having the br0, Host A could
assign an IP addresses to the two nalvcam interfaces that host
B has, and use IP forwarding between the guests of host B.



 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

RE: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Paul Congdon (UC Davis)
Yaron,


The interface multiplexing can be achieved using macvlan driver or using an 
SR-IOV capable NIC (the preferred option), macvlan may need to be extended to 
support VEPA multicast handling, this looks like a rather simple task 

Agreed that the hardware solution is preferred so the macvlan implementation 
doesn’t really matter.  If we are talking SR-IOV, then it is direct mapped, 
regardless of whether there is a VEB or VEPA in the hardware below, so you are 
bypassing the bridge software code also.  

I disagree that adding the multicast handling is simple – while not 
conceptually hard, it will basically require you to put an address table into 
the macvlan implementation – if you have that, then why not have just used the 
one already in the bridge code.  If you hook a VEPA up to a non-hairpin mode 
external bridge, you get the macvlan capability as well.

It also seems to me like the special macvlan interfaces for KVM don’t apply to 
XEN or a non-virtualized environment?  Or more has to be written to make that 
work?  If it is in the bridge code, you get all of this re-use.

 

 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [evb] Re: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Arnd Bergmann
On Monday 10 August 2009, Fischer, Anna wrote:
 On the VEPA filtering service side, the only change we have implemented
 in the bridging code is that in VEPA mode all frames are passed to the
 uplink on TX. However, frames are still passed through the netfilter 
 hooks before they go out on the wire. On the inbound path, there are
 no changes to the way frames are processed (except the filtering for
 the original source port), so netfilter hooks work in the same way
 as for a normal bridge.

Ah, interesting. I did not realize that the hooks were still active,
although that obviously makes sense. So that would be another
important difference between our implementations.

 If a frame is reflected back because of a hairpin turn, then of course
 the incoming port is the VEPA uplink port and not the port that
 originally sent the frame. So if you are trying to enforce some
 packet filtering on that inbound path, then you would have to do that
 based on MAC addresses and not on bridge ports. But I would assume that
 you would enforce the filtering already before you send out the frame
 to the adjacent bridge. Apart from that, if you enable your bridge to
 behave in VEPA mode, then you would typically do packet filtering etc
 on the adjacent bridge and not use the netfilter hook. You can still use
 both though, if you like.

Right, that was my point. They bridge in VEPA mode would likely be
configured to be completely ignorant of the data going through it
and not do any filter, and you do all filterring on the adjacent
bridge.

I just wasn't sure that this is possible with ebtables if the
adjacent bridge is a Linux system with the bridge in hairpin turn
mode.

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Fischer, Anna
 Subject: Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
 
 On Monday 10 August 2009, Stephen Hemminger wrote:
  On Sun, 09 Aug 2009 14:19:08 +0300, Or Gerlitz
 ogerl...@voltaire.com wrote:
   Looking in macvlan_set_multicast_list() it acts in a similar manner
 to
   macvlan_set_mac_address() in the sense that it calls dev_mc_sync().
 I
   assume what's left is to add macvlan_hash_xxx multicast logic to
   map/unmap multicast groups to what macvlan devices want to receive
 them
   and this way the flooding can be removed, correct?
 
  The device can just flood all multicast packets, since the filtering
  is done on the receive path anyway.

Is this handled by one of the additional patches? In the current kernel tree
macvlan code it looks as if multicast filtering is only handled by the
physical device driver, but not on particular macvlan devices.
 

 But we'd still have to copy the frames to user space (for both
 macvtap and raw packet sockets) and exit from the guest to inject
 it into its stack, right?

I think it would be nice if you can implement what Or describes for 
macvlan and avoid flooding, and it doesn't sound too hard to do. 

I guess one advantage for macvlan (over the bridge) is that you can 
program in all information you have for the ports attached to it, e.g. 
MAC addresses and multicast addresses. So you could take advantage of
that whereas the bridge always floods multicast frames to all ports.
 
How would this work though, if the OS inside the guest wants to register
to a particular multicast address? Is this propagated through the backend
drivers to the macvlan/macvtap interface?

Anna

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Stephen Hemminger
On Mon, 10 Aug 2009 16:32:01 +
Fischer, Anna anna.fisc...@hp.com wrote:

  Subject: Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
  
  On Monday 10 August 2009, Stephen Hemminger wrote:
   On Sun, 09 Aug 2009 14:19:08 +0300, Or Gerlitz
  ogerl...@voltaire.com wrote:
Looking in macvlan_set_multicast_list() it acts in a similar manner
  to
macvlan_set_mac_address() in the sense that it calls dev_mc_sync().
  I
assume what's left is to add macvlan_hash_xxx multicast logic to
map/unmap multicast groups to what macvlan devices want to receive
  them
and this way the flooding can be removed, correct?
  
   The device can just flood all multicast packets, since the filtering
   is done on the receive path anyway.
 
 Is this handled by one of the additional patches? In the current kernel tree
 macvlan code it looks as if multicast filtering is only handled by the
 physical device driver, but not on particular macvlan devices.
  
 
  But we'd still have to copy the frames to user space (for both
  macvtap and raw packet sockets) and exit from the guest to inject
  it into its stack, right?
 
 I think it would be nice if you can implement what Or describes for 
 macvlan and avoid flooding, and it doesn't sound too hard to do. 
 
 I guess one advantage for macvlan (over the bridge) is that you can 
 program in all information you have for the ports attached to it, e.g. 
 MAC addresses and multicast addresses. So you could take advantage of
 that whereas the bridge always floods multicast frames to all ports.
  
 How would this work though, if the OS inside the guest wants to register
 to a particular multicast address? Is this propagated through the backend
 drivers to the macvlan/macvtap interface?

Sure filtering is better, but multicast performance with large number
of guests is really a corner case, not the real performance issue.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Anthony Liguori
Gerd Hoffmann wrote:
 Ok, lets rip out the in-kernel ioapic code then.  It can (and has 
 been) done in userspace.

The justification is performance although that's not really true anymore 
post TPR optimization.

But FWIW, I opposed both the in-kernel apic and the in-kernel pit when 
they were introduced.  If nothing else, I'm at least consistent :-)

 The daemon increases the possibility of userspace bugs.

How?

 Seriously:  Attaching a name tag to virtio-serial devices and have 
 them exposed via sysfs is probably *much* less code than a vmchannel 
 daemon.

I strongly doubt that.

 If we really want
 vmchannel to be used by application developers, then we really need a
 libvmchannel.

 We need a sane solution developed and merged and not a new idea each 
 week.

There is nothing sane about vmchannel.  It's just an attempt to bypass 
QEMU which is going to introduce all sorts of complexities wrt 
migration, guest compatibility, etc.

However, as I've mentioned repeatedly, the reason I won't merge 
virtio-serial is that it duplicates functionality with virtio-console.  
If the two are converged, I'm happy to merge it.  I'm not opposed to 
having more functionality.

I think it's the wrong solution for the use-case, and I always have, but 
that's independent of my willingness to merge it.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Anthony Liguori
Anthony Liguori wrote:

 There is nothing sane about vmchannel.  It's just an attempt to bypass 
 QEMU which is going to introduce all sorts of complexities wrt 
 migration, guest compatibility, etc.

 However, as I've mentioned repeatedly, the reason I won't merge 
 virtio-serial is that it duplicates functionality with 
 virtio-console.  If the two are converged, I'm happy to merge it.  I'm 
 not opposed to having more functionality.

NB: the userspace interface for these devices should be a tty, not a new 
character device.

If you want to add a new bustype for these devices, and then have an 
entry in sysfs that had some sort of identification string, that's 
perfectly acceptable.

Also note though that this is exactly what usb-serial is today.  
/sys/bus/usbserial contains all the usb serial devices and you can get a 
vendor id/device id to uniquely identify the device type.

Using virtio vs. usb has it's advantage but the userspace interface 
model should be roughly equivalent.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 0/2] vhost: a kernel-level virtio server

2009-08-10 Thread Michael S. Tsirkin
This implements vhost: a kernel-level backend for virtio,
The main motivation for this work is to reduce virtualization
overhead for virtio by removing system calls on data path,
without guest changes. For virtio-net, this removes up to
4 system calls per packet: vm exit for kick, reentry for kick,
iothread wakeup for packet, interrupt injection for packet.

Some more detailed description attached to the patch itself.

The patches are against 2.6.31-rc4.  I'd like them to go into linux-next
and down the road 2.6.32 if possible.  Please comment.

Userspace bits using this driver will be posted to k...@vger shortly.

Michael S. Tsirkin (2):
  export cpu_tlbstate to modules
  vhost_net: a kernel-level virtio server

 MAINTAINERS|   10 +
 arch/x86/kvm/Kconfig   |1 +
 arch/x86/mm/tlb.c  |1 +
 drivers/Makefile   |1 +
 drivers/block/virtio_blk.c |3 +
 drivers/vhost/Kconfig  |   11 +
 drivers/vhost/Makefile |2 +
 drivers/vhost/net.c|  462 ++
 drivers/vhost/vhost.c  |  663 
 drivers/vhost/vhost.h  |  108 +++
 include/linux/Kbuild   |1 +
 include/linux/miscdevice.h |1 +
 include/linux/vhost.h  |  100 +++
 13 files changed, 1364 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-10 Thread Michael S. Tsirkin
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backend appear.  I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a
tun-like device. In this version I only support raw socket as a backend,
which can be bound to e.g. SR IOV, or to macvlan device.  Backend is
also configured by userspace, including vlan/mac etc.

Status:
This works for me, and I haven't see any crashes.
I have not run any benchmarks yet, compared to userspace, I expect to
see improved latency (as I save up to 4 system calls per packet) but not
yet bandwidth/CPU (as TSO and interrupt mitigation are not yet supported).

Features that I plan to look at in the future:
- TSO
- interrupt mitigation
- zero copy

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---
 MAINTAINERS|   10 +
 arch/x86/kvm/Kconfig   |1 +
 drivers/Makefile   |1 +
 drivers/block/virtio_blk.c |3 +
 drivers/vhost/Kconfig  |   11 +
 drivers/vhost/Makefile |2 +
 drivers/vhost/net.c|  462 ++
 drivers/vhost/vhost.c  |  663 
 drivers/vhost/vhost.h  |  108 +++
 include/linux/Kbuild   |1 +
 include/linux/miscdevice.h |1 +
 include/linux/vhost.h  |  100 +++
 12 files changed, 1363 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/Kconfig
 create mode 100644 drivers/vhost/Makefile
 create mode 100644 drivers/vhost/net.c
 create mode 100644 drivers/vhost/vhost.c
 create mode 100644 drivers/vhost/vhost.h
 create mode 100644 include/linux/vhost.h

diff --git a/MAINTAINERS b/MAINTAINERS
index ebc2691..eb0c1da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6312,6 +6312,16 @@ S:   Maintained
 F: Documentation/filesystems/vfat.txt
 F: fs/fat/
 
+VIRTIO HOST (VHOST)
+P: Michael S. Tsirkin
+M: m...@redhat.com
+L: k...@vger.kernel.org
+L: virtualizat...@lists.osdl.org
+L: net...@vger.kernel.org
+S: Maintained
+F: drivers/vhost/
+F: include/linux/vhost.h
+
 VIA RHINE NETWORK DRIVER
 P: Roger Luethi
 M: r...@hellgate.ch
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b84e571..94f44d9 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -64,6 +64,7 @@ config KVM_AMD
 
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
+source drivers/vhost/Kconfig
 source drivers/lguest/Kconfig
 source drivers/virtio/Kconfig
 
diff --git a/drivers/Makefile b/drivers/Makefile
index bc4205d..1551ae1 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -105,6 +105,7 @@ obj-$(CONFIG_HID)   += hid/
 obj-$(CONFIG_PPC_PS3)  += ps3/
 obj-$(CONFIG_OF)   += of/
 obj-$(CONFIG_SSB)  += ssb/
+obj-$(CONFIG_VHOST_NET)+= vhost/
 obj-$(CONFIG_VIRTIO)   += virtio/
 obj-$(CONFIG_VLYNQ)+= vlynq/
 obj-$(CONFIG_STAGING)  += staging/
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index aa1a3d5..42e61b0 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -293,6 +293,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
err = PTR_ERR(vblk-vq);
goto out_free_vblk;
}
+   printk(KERN_ERR vblk-vq = %p\n, vblk-vq);
 
vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
if (!vblk-pool) {
@@ -383,6 +384,8 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
if (!err)
blk_queue_logical_block_size(vblk-disk-queue, blk_size);
 
+   printk(KERN_ERR virtio_config_val returned %d\n, err);
+
add_disk(vblk-disk);
return 0;
 
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
new file mode 100644
index 000..d955406
--- /dev/null
+++ b/drivers/vhost/Kconfig
@@ -0,0 +1,11 @@
+config VHOST_NET
+   tristate Host kernel accelerator for virtio net
+   depends on NET  EVENTFD
+   ---help---
+ This kernel module can be loaded in host kernel 

[PATCH 1/2] export cpu_tlbstate to modules

2009-08-10 Thread Michael S. Tsirkin
vhost net module wants to do copy to/from user from a kernel thread,
which needs switch_mm (like what fs/aio has).
export cpu_tlbstate to make this possible

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 arch/x86/mm/tlb.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 821e970..e33a5f0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -13,6 +13,7 @@
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate)
= { init_mm, 0, };
+EXPORT_PER_CPU_SYMBOL_GPL(cpu_tlbstate);
 
 /*
  * Smarter SMP flushing macros.
-- 
1.6.2.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 0/3] qemu-kvm: vhost net support

2009-08-10 Thread Michael S. Tsirkin
This adds support for vhost-net virtio kernel backend.

This is RFC, but works without issues for me.

Still needs to be split up, tested and benchmarked properly,
but posting it here in case people want to test drive
the kernel bits I posted.

Michael S. Tsirkin (3):
  qemu-kvm: move virtio-pci.o to near pci.o
  virtio: move features to an inline function
  qemu-kvm: vhost-net implementation

 Makefile.hw |2 +-
 Makefile.target |3 +-
 hw/vhost_net.c  |  181 +++
 hw/vhost_net.h  |   30 +
 hw/virtio-balloon.c |2 +-
 hw/virtio-blk.c |2 +-
 hw/virtio-console.c |2 +-
 hw/virtio-net.c |   34 +-
 hw/virtio-pci.c |   43 +++-
 hw/virtio.c |   19 --
 hw/virtio.h |   38 ++-
 net.c   |8 ++-
 net.h   |1 +
 qemu-kvm.c  |8 --
 qemu-kvm.h  |9 +++
 15 files changed, 340 insertions(+), 42 deletions(-)
 create mode 100644 hw/vhost_net.c
 create mode 100644 hw/vhost_net.h
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Arnd Bergmann
On Monday 10 August 2009, Stephen Hemminger wrote:
 On Mon, 10 Aug 2009 16:32:01, Fischer, Anna anna.fisc...@hp.com wrote:
  How would this work though, if the OS inside the guest wants to register
  to a particular multicast address? Is this propagated through the backend
  drivers to the macvlan/macvtap interface?
 
 Sure filtering is better, but multicast performance with large number
 of guests is really a corner case, not the real performance issue.

Well, right now, qemu does not care at all about this, it essentially
leaves the tun device in ALLMULTI state. I should check whether macvtap
at this stage can receive multicast frames at all, but if it does,
it will get them all ;-).

If we want to implement this with kvm, we would have to start with
the qemu virtio-net implementation, to move the receive filter into
the tap device. With tun/tap that will mean less copying to user
space, with macvtap (after implementing TUNSETTXFILTER) we get already
pretty far because we no longer need to have the external interface
in ALLMULTI state. Once that is in place, we can start thinking about
filtering per virtual device.

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 1/3] qemu: move virtio-pci.o to near pci.o

2009-08-10 Thread Michael S. Tsirkin
virtio-pci depends, and will always depend, on pci.c
so it makes sense to keep it in the same makefile,
(unlike the rest of virtio files which should eventually
 be moved out to Makefile.hw).

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 Makefile.hw |2 +-
 Makefile.target |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Makefile.hw b/Makefile.hw
index f7a9507..6857fea 100644
--- a/Makefile.hw
+++ b/Makefile.hw
@@ -16,7 +16,7 @@ CPPFLAGS += -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 
-D_LARGEFILE_SOURCE
 CPPFLAGS+=-I$(SRC_PATH)/fpu
 
 obj-y =
-obj-y += virtio.o virtio-pci.o
+obj-y += virtio.o
 obj-y += fw_cfg.o
 obj-y += watchdog.o
 obj-y += nand.o ecc.o
diff --git a/Makefile.target b/Makefile.target
index df1f32b..0dd56b1 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -538,7 +538,7 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o 
machine.o \
 gdbstub.o gdbstub-xml.o msix.o ioport.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
-obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o
+obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o 
virtio-pci.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 
 LIBS+=-lz
-- 
1.6.2.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 2/3] qemu/virtio: move features to an inline function

2009-08-10 Thread Michael S. Tsirkin
devices should have the final say over which virtio features they
support. E.g. indirect entries may or may not make sense in the context
of virtio-console.  Move the common bits from virtio-pci to an inline
function and let each device call it.

No functional changes.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 hw/virtio-balloon.c |2 +-
 hw/virtio-blk.c |2 +-
 hw/virtio-console.c |2 +-
 hw/virtio-net.c |2 +-
 hw/virtio-pci.c |3 ---
 hw/virtio.h |   10 ++
 6 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index 7ca783e..15b50bb 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -127,7 +127,7 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,
 
 static uint32_t virtio_balloon_get_features(VirtIODevice *vdev)
 {
-return 0;
+return virtio_common_features();
 }
 
 static ram_addr_t virtio_balloon_to_target(void *opaque, ram_addr_t target)
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 2beff52..7d33fee 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -364,7 +364,7 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 if (strcmp(s-serial_str, 0))
 features |= 1  VIRTIO_BLK_F_IDENTIFY;
 
-return features;
+return features | virtio_common_features();
 }
 
 static void virtio_blk_save(QEMUFile *f, void *opaque)
diff --git a/hw/virtio-console.c b/hw/virtio-console.c
index 663c8b9..ac25499 100644
--- a/hw/virtio-console.c
+++ b/hw/virtio-console.c
@@ -53,7 +53,7 @@ static void virtio_console_handle_input(VirtIODevice *vdev, 
VirtQueue *vq)
 
 static uint32_t virtio_console_get_features(VirtIODevice *vdev)
 {
-return 0;
+return virtio_common_features();
 }
 
 static int vcon_can_read(void *opaque)
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ef9e7ff..2e51a6a 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -154,7 +154,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 }
 #endif
 
-return features;
+return features | virtio_common_features();
 }
 
 static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 3b9bfd1..90f51be 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -232,9 +232,6 @@ static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, 
uint32_t addr)
 switch (addr) {
 case VIRTIO_PCI_HOST_FEATURES:
 ret = vdev-get_features(vdev);
-ret |= (1  VIRTIO_F_NOTIFY_ON_EMPTY);
-ret |= (1  VIRTIO_RING_F_INDIRECT_DESC);
-ret |= (1  VIRTIO_F_BAD_FEATURE);
 break;
 case VIRTIO_PCI_GUEST_FEATURES:
 ret = vdev-features;
diff --git a/hw/virtio.h b/hw/virtio.h
index aa55677..de620a7 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -166,4 +166,14 @@ VirtIODevice *virtio_net_init(DeviceState *dev);
 VirtIODevice *virtio_console_init(DeviceState *dev);
 VirtIODevice *virtio_balloon_init(DeviceState *dev);
 
+static inline uint32_t virtio_common_features(void)
+{
+uint32_t features = 0;
+features |= (1  VIRTIO_F_NOTIFY_ON_EMPTY);
+features |= (1  VIRTIO_RING_F_INDIRECT_DESC);
+features |= (1  VIRTIO_F_BAD_FEATURE);
+
+return features;
+}
+
 #endif
-- 
1.6.2.5

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 3/3] qemu-kvm: vhost-net implementation

2009-08-10 Thread Michael S. Tsirkin
This adds support for vhost-net virtio kernel backend.
To enable (assuming device eth2):
1. enable promisc mode or program guest mac in device eth2
2. disable tso, gso, lro on the card
3. add vhost=eth0 to -net flag
4. run with CAP_NET_ADMIN priviledge (e.g. root)

This patch is RFC, but works without issues for me.

It still needs to be split up, tested and benchmarked properly,
but posting it here in case people want to test drive
the kernel bits I posted.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 Makefile.target |3 +-
 hw/vhost_net.c  |  181 +++
 hw/vhost_net.h  |   30 +
 hw/virtio-net.c |   32 ++-
 hw/virtio-pci.c |   40 
 hw/virtio.c |   19 --
 hw/virtio.h |   28 -
 net.c   |8 ++-
 net.h   |1 +
 qemu-kvm.c  |8 ---
 qemu-kvm.h  |9 +++
 11 files changed, 325 insertions(+), 34 deletions(-)
 create mode 100644 hw/vhost_net.c
 create mode 100644 hw/vhost_net.h

diff --git a/Makefile.target b/Makefile.target
index 0dd56b1..5763fa8 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -538,7 +538,8 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o 
machine.o \
 gdbstub.o gdbstub-xml.o msix.o ioport.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
-obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o 
virtio-pci.o
+obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o 
virtio-pci.o \
+   vhost_net.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 
 LIBS+=-lz
diff --git a/hw/vhost_net.c b/hw/vhost_net.c
new file mode 100644
index 000..7d52de0
--- /dev/null
+++ b/hw/vhost_net.c
@@ -0,0 +1,181 @@
+#include sys/eventfd.h
+#include sys/socket.h
+#include linux/kvm.h
+#include fcntl.h
+#include sys/ioctl.h
+#include linux/vhost.h
+#include linux/virtio_ring.h
+#include netpacket/packet.h
+#include net/ethernet.h
+#include net/if.h
+#include netinet/in.h
+
+#include stdio.h
+
+#include qemu-kvm.h
+
+#include vhost_net.h
+
+const char *vhost_net_device;
+
+static int vhost_virtqueue_init(struct vhost_dev *dev,
+   struct VirtIODevice *vdev,
+   struct vhost_virtqueue *vq,
+   struct VirtQueue *q,
+   unsigned idx)
+{
+   target_phys_addr_t s, l;
+   int r;
+   struct vhost_vring_addr addr = {
+   .index = idx,
+   };
+   struct vhost_vring_file file = {
+   .index = idx,
+   };
+   struct vhost_vring_state size = {
+   .index = idx,
+   };
+
+   size.num = q-vring.num;
+   r = ioctl(dev-control, VHOST_SET_VRING_NUM, size);
+   if (r)
+   return -errno;
+
+   file.fd = vq-kick = eventfd(0, 0);
+   r = ioctl(dev-control, VHOST_SET_VRING_KICK, file);
+   if (r)
+   return -errno;
+   file.fd = vq-call = eventfd(0, 0);
+   r = ioctl(dev-control, VHOST_SET_VRING_CALL, file);
+   if (r)
+   return -errno;
+
+   s = l = sizeof(struct vring_desc) * q-vring.num;
+   vq-desc = cpu_physical_memory_map(q-vring.desc, l, 0);
+   if (!vq-desc || l != s)
+   return -ENOMEM;
+   addr.user_addr = (u_int64_t)(unsigned long)vq-desc;
+   r = ioctl(dev-control, VHOST_SET_VRING_DESC, addr);
+   if (r  0)
+   return -errno;
+   s = l = offsetof(struct vring_avail, ring) +
+   sizeof(u_int64_t) * q-vring.num;
+   vq-avail = cpu_physical_memory_map(q-vring.avail, l, 0);
+   if (!vq-avail || l != s)
+   return -ENOMEM;
+   addr.user_addr = (u_int64_t)(unsigned long)vq-avail;
+   r = ioctl(dev-control, VHOST_SET_VRING_AVAIL, addr);
+   if (r  0)
+   return -errno;
+   s = l = offsetof(struct vring_used, ring) +
+   sizeof(struct vring_used_elem) * q-vring.num;
+   vq-used = cpu_physical_memory_map(q-vring.used, l, 1);
+   if (!vq-used || l != s)
+   return -ENOMEM;
+   addr.user_addr = (u_int64_t)(unsigned long)vq-used;
+   r = ioctl(dev-control, VHOST_SET_VRING_USED, addr);
+   if (r  0)
+   return -errno;
+
+r = vdev-binding-irqfd(vdev-binding_opaque, q-vector, vq-call);
+if (r  0)
+return -errno;
+
+r = vdev-binding-queuefd(vdev-binding_opaque, idx, vq-kick);
+if (r  0)
+return -errno;
+
+   return 0;
+}
+
+static int vhost_dev_init(struct vhost_dev *hdev,
+ VirtIODevice *vdev)
+{
+   int i, r, n = 0;
+   struct vhost_memory *mem;
+   hdev-control = open(/dev/vhost-net, O_RDWR);
+   if (hdev-control  0)
+   return -errno;
+   r = ioctl(hdev-control, VHOST_SET_OWNER, NULL);
+   if (r  0)
+   return -errno;
+   for (i = 0; i  

[PATCH] qemu/virtio: move features to an inline function

2009-08-10 Thread Michael S. Tsirkin
devices should have the final say over which virtio features they
support. E.g. indirect entries may or may not make sense in the context
of virtio-console.  Move the common bits from virtio-pci to an inline
function and let each device call it.

No functional changes.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

This is part of my vhost work, but IMO the patch makes sense separately
as well: the common features are not related to pci at all.

 hw/virtio-balloon.c |2 +-
 hw/virtio-blk.c |2 +-
 hw/virtio-console.c |2 +-
 hw/virtio-net.c |2 +-
 hw/virtio-pci.c |3 ---
 hw/virtio.h |   10 ++
 6 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index 7ca783e..15b50bb 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -127,7 +127,7 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,
 
 static uint32_t virtio_balloon_get_features(VirtIODevice *vdev)
 {
-return 0;
+return virtio_common_features();
 }
 
 static ram_addr_t virtio_balloon_to_target(void *opaque, ram_addr_t target)
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index 2beff52..7d33fee 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -364,7 +364,7 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
 if (strcmp(s-serial_str, 0))
 features |= 1  VIRTIO_BLK_F_IDENTIFY;
 
-return features;
+return features | virtio_common_features();
 }
 
 static void virtio_blk_save(QEMUFile *f, void *opaque)
diff --git a/hw/virtio-console.c b/hw/virtio-console.c
index 663c8b9..ac25499 100644
--- a/hw/virtio-console.c
+++ b/hw/virtio-console.c
@@ -53,7 +53,7 @@ static void virtio_console_handle_input(VirtIODevice *vdev, 
VirtQueue *vq)
 
 static uint32_t virtio_console_get_features(VirtIODevice *vdev)
 {
-return 0;
+return virtio_common_features();
 }
 
 static int vcon_can_read(void *opaque)
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ef9e7ff..2e51a6a 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -154,7 +154,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 }
 #endif
 
-return features;
+return features | virtio_common_features();
 }
 
 static uint32_t virtio_net_bad_features(VirtIODevice *vdev)
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 3b9bfd1..90f51be 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -232,9 +232,6 @@ static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, 
uint32_t addr)
 switch (addr) {
 case VIRTIO_PCI_HOST_FEATURES:
 ret = vdev-get_features(vdev);
-ret |= (1  VIRTIO_F_NOTIFY_ON_EMPTY);
-ret |= (1  VIRTIO_RING_F_INDIRECT_DESC);
-ret |= (1  VIRTIO_F_BAD_FEATURE);
 break;
 case VIRTIO_PCI_GUEST_FEATURES:
 ret = vdev-features;
diff --git a/hw/virtio.h b/hw/virtio.h
index aa55677..de620a7 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -166,4 +166,14 @@ VirtIODevice *virtio_net_init(DeviceState *dev);
 VirtIODevice *virtio_console_init(DeviceState *dev);
 VirtIODevice *virtio_balloon_init(DeviceState *dev);
 
+static inline uint32_t virtio_common_features(void)
+{
+uint32_t features = 0;
+features |= (1  VIRTIO_F_NOTIFY_ON_EMPTY);
+features |= (1  VIRTIO_RING_F_INDIRECT_DESC);
+features |= (1  VIRTIO_F_BAD_FEATURE);
+
+return features;
+}
+
 #endif
-- 
1.6.2.5
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] qemu/virtio: move features to an inline function

2009-08-10 Thread Anthony Liguori
Michael S. Tsirkin wrote:
 devices should have the final say over which virtio features they
 support. E.g. indirect entries may or may not make sense in the context
 of virtio-console.  Move the common bits from virtio-pci to an inline
 function and let each device call it.
   

What drove this in vhost?

Normally, the common features are transport features and the devices 
should have absolutely no knowledge of transport feature (since they're 
transport dependent).

IOW, VIRTIO_RING_F_INDIRECT_DESC is meaningless to virtio-console 
because virtio-console has no idea what the ring implementation is that 
it sits on top of.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function

2009-08-10 Thread Michael S. Tsirkin
On Mon, Aug 10, 2009 at 02:37:20PM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 devices should have the final say over which virtio features they
 support. E.g. indirect entries may or may not make sense in the context
 of virtio-console.  Move the common bits from virtio-pci to an inline
 function and let each device call it.
   

 What drove this in vhost?

vhost does not support indirect ring entries and notify on empty yet.

 Normally, the common features are transport features and the devices  
 should have absolutely no knowledge of transport feature (since they're  
 transport dependent).

Good point. But

1. note that with my patch they don't. They call
virtio_get_common_features and that's all.

2. some features may not make sense for some devices. For example, it is
quite possible that indirect ring entries feature improves performance
on block but hurts on net, as net has a similar merged buffers feature.
Someone should try benchmarking with it disabled, and it becomes
possible with my patch.

 IOW, VIRTIO_RING_F_INDIRECT_DESC is meaningless to virtio-console  
 because virtio-console has no idea what the ring implementation is that  
 it sits on top of.

At some level.  Same can be said to be true for virtio-pci: it sits
below the ring implementation. However, it is not *completely*
meaningless: some devices may benefit from indirect entries, others may
not, or may benefit from disabling them.

 Regards,

 Anthony Liguori

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-10 Thread Arnd Bergmann
On Monday 10 August 2009, Michael S. Tsirkin wrote:
 What it is: vhost net is a character device that can be used to reduce
 the number of system calls involved in virtio networking.
 Existing virtio net code is used in the guest without modification.

Very nice, I loved reading it. It's getting rather late in my time
zone, so this comments only on the network driver. I'll go through
the rest tomorrow.

 @@ -293,6 +293,7 @@ static int __devinit virtblk_probe(struct virtio_device 
 *vdev)
   err = PTR_ERR(vblk-vq);
   goto out_free_vblk;
   }
 + printk(KERN_ERR vblk-vq = %p\n, vblk-vq);
  
   vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
   if (!vblk-pool) {
 @@ -383,6 +384,8 @@ static int __devinit virtblk_probe(struct virtio_device 
 *vdev)
   if (!err)
   blk_queue_logical_block_size(vblk-disk-queue, blk_size);
  
 + printk(KERN_ERR virtio_config_val returned %d\n, err);
 +
   add_disk(vblk-disk);
   return 0;

I guess you meant to remove these before submitting.

 +static void handle_tx_kick(struct work_struct *work);
 +static void handle_rx_kick(struct work_struct *work);
 +static void handle_tx_net(struct work_struct *work);
 +static void handle_rx_net(struct work_struct *work);

[style] I think the code gets more readable if you reorder it
so that you don't need forward declarations for static functions.

 +static long vhost_net_reset_owner(struct vhost_net *n)
 +{
 + struct socket *sock = NULL;
 + long r;
 + mutex_lock(n-dev.mutex);
 + r = vhost_dev_check_owner(n-dev);
 + if (r)
 + goto done;
 + sock = vhost_net_stop(n);
 + r = vhost_dev_reset_owner(n-dev);
 +done:
 + mutex_unlock(n-dev.mutex);
 + if (sock)
 + fput(sock-file);
 + return r;
 +}

what is the difference between vhost_net_reset_owner(n)
and vhost_net_set_socket(n, -1)?

 +
 +static struct file_operations vhost_net_fops = {
 + .owner  = THIS_MODULE,
 + .release= vhost_net_release,
 + .unlocked_ioctl = vhost_net_ioctl,
 + .open   = vhost_net_open,
 +};

This is missing a compat_ioctl pointer. It should simply be

static long vhost_net_compat_ioctl(struct file *f,
unsigned int ioctl, unsigned long arg)
{
return f, ioctl, (unsigned long)compat_ptr(arg);
}

 +/* Bits from fs/aio.c. TODO: export and use from there? */
 +/*
 + * use_mm
 + *   Makes the calling kernel thread take on the specified
 + *   mm context.
 + *   Called by the retry thread execute retries within the
 + *   iocb issuer's mm context, so that copy_from/to_user
 + *   operations work seamlessly for aio.
 + *   (Note: this routine is intended to be called only
 + *   from a kernel thread context)
 + */
 +static void use_mm(struct mm_struct *mm)
 +{
 + struct mm_struct *active_mm;
 + struct task_struct *tsk = current;
 +
 + task_lock(tsk);
 + active_mm = tsk-active_mm;
 + atomic_inc(mm-mm_count);
 + tsk-mm = mm;
 + tsk-active_mm = mm;
 + switch_mm(active_mm, mm, tsk);
 + task_unlock(tsk);
 +
 + mmdrop(active_mm);
 +}

Why do you need a kernel thread here? If the data transfer functions
all get called from a guest intercept, shouldn't you already be
in the right mm?

 +static void handle_tx(struct vhost_net *net)
 +{
 + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 + unsigned head, out, in;
 + struct msghdr msg = {
 + .msg_name = NULL,
 + .msg_namelen = 0,
 + .msg_control = NULL,
 + .msg_controllen = 0,
 + .msg_iov = (struct iovec *)vq-iov + 1,
 + .msg_flags = MSG_DONTWAIT,
 + };
 + size_t len;
 + int err;
 + struct socket *sock = rcu_dereference(net-sock);
 + if (!sock || !sock_writeable(sock-sk))
 + return;
 +
 + use_mm(net-dev.mm);
 + mutex_lock(vq-mutex);
 + for (;;) {
 + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in);
 + if (head == vq-num)
 + break;
 + if (out = 1 || in) {
 + vq_err(vq, Unexpected descriptor format for TX: 
 +out %d, int %d\n, out, in);
 + break;
 + }
 + /* Sanity check */
 + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) {
 + vq_err(vq, Unexpected header len for TX: 
 +%ld expected %zd\n, vq-iov-iov_len,
 +sizeof(struct virtio_net_hdr));
 + break;
 + }
 + /* Skip header. TODO: support TSO. */
 + msg.msg_iovlen = out - 1;
 + len = iov_length(vq-iov + 1, out - 1);
 + /* TODO: Check specific error and bomb out unless ENOBUFS? */
 + err = sock-ops-sendmsg(NULL, sock, msg, len);
 + if (err  0) {
 + 

Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-10 Thread Michael S. Tsirkin
On Mon, Aug 10, 2009 at 09:51:18PM +0200, Arnd Bergmann wrote:
 On Monday 10 August 2009, Michael S. Tsirkin wrote:
  What it is: vhost net is a character device that can be used to reduce
  the number of system calls involved in virtio networking.
  Existing virtio net code is used in the guest without modification.
 
 Very nice, I loved reading it. It's getting rather late in my time
 zone, so this comments only on the network driver. I'll go through
 the rest tomorrow.
 
  @@ -293,6 +293,7 @@ static int __devinit virtblk_probe(struct virtio_device 
  *vdev)
  err = PTR_ERR(vblk-vq);
  goto out_free_vblk;
  }
  +   printk(KERN_ERR vblk-vq = %p\n, vblk-vq);
   
  vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
  if (!vblk-pool) {
  @@ -383,6 +384,8 @@ static int __devinit virtblk_probe(struct virtio_device 
  *vdev)
  if (!err)
  blk_queue_logical_block_size(vblk-disk-queue, blk_size);
   
  +   printk(KERN_ERR virtio_config_val returned %d\n, err);
  +
  add_disk(vblk-disk);
  return 0;
 
 I guess you meant to remove these before submitting.

Good catch, thanks!

  +static void handle_tx_kick(struct work_struct *work);
  +static void handle_rx_kick(struct work_struct *work);
  +static void handle_tx_net(struct work_struct *work);
  +static void handle_rx_net(struct work_struct *work);
 
 [style] I think the code gets more readable if you reorder it
 so that you don't need forward declarations for static functions.

Right.

  +static long vhost_net_reset_owner(struct vhost_net *n)
  +{
  +   struct socket *sock = NULL;
  +   long r;
  +   mutex_lock(n-dev.mutex);
  +   r = vhost_dev_check_owner(n-dev);
  +   if (r)
  +   goto done;
  +   sock = vhost_net_stop(n);
  +   r = vhost_dev_reset_owner(n-dev);
  +done:
  +   mutex_unlock(n-dev.mutex);
  +   if (sock)
  +   fput(sock-file);
  +   return r;
  +}
 
 what is the difference between vhost_net_reset_owner(n)
 and vhost_net_set_socket(n, -1)?

set socket to -1 will only stop the device.

reset owner will let another process take over the device.
It also needs to reset all parameters to make it safe for that
other process, so in particular the device is stopped.

I tried explaining this in the header vhost.h - does the comment
there help, or do I need to clarify it?

  +
  +static struct file_operations vhost_net_fops = {
  +   .owner  = THIS_MODULE,
  +   .release= vhost_net_release,
  +   .unlocked_ioctl = vhost_net_ioctl,
  +   .open   = vhost_net_open,
  +};
 
 This is missing a compat_ioctl pointer. It should simply be
 
 static long vhost_net_compat_ioctl(struct file *f,
   unsigned int ioctl, unsigned long arg)
 {
   return f, ioctl, (unsigned long)compat_ptr(arg);
 }

I had the impression that if there's no compat_ioctl,
unlocked_ioctl will get called automatically. No?

  +/* Bits from fs/aio.c. TODO: export and use from there? */
  +/*
  + * use_mm
  + * Makes the calling kernel thread take on the specified
  + * mm context.
  + * Called by the retry thread execute retries within the
  + * iocb issuer's mm context, so that copy_from/to_user
  + * operations work seamlessly for aio.
  + * (Note: this routine is intended to be called only
  + * from a kernel thread context)
  + */
  +static void use_mm(struct mm_struct *mm)
  +{
  +   struct mm_struct *active_mm;
  +   struct task_struct *tsk = current;
  +
  +   task_lock(tsk);
  +   active_mm = tsk-active_mm;
  +   atomic_inc(mm-mm_count);
  +   tsk-mm = mm;
  +   tsk-active_mm = mm;
  +   switch_mm(active_mm, mm, tsk);
  +   task_unlock(tsk);
  +
  +   mmdrop(active_mm);
  +}
 
 Why do you need a kernel thread here? If the data transfer functions
 all get called from a guest intercept, shouldn't you already be
 in the right mm?

several reasons :)
- I get called under lock, so can't block
- eventfd can be passed to another process, and I won't be in guest context at 
all
- this also gets called outside guest context from socket poll
- vcpu is blocked while it's doing i/o. it is better to free it up
  as all the packet copying might take a while

  +static void handle_tx(struct vhost_net *net)
  +{
  +   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
  +   unsigned head, out, in;
  +   struct msghdr msg = {
  +   .msg_name = NULL,
  +   .msg_namelen = 0,
  +   .msg_control = NULL,
  +   .msg_controllen = 0,
  +   .msg_iov = (struct iovec *)vq-iov + 1,
  +   .msg_flags = MSG_DONTWAIT,
  +   };
  +   size_t len;
  +   int err;
  +   struct socket *sock = rcu_dereference(net-sock);
  +   if (!sock || !sock_writeable(sock-sk))
  +   return;
  +
  +   use_mm(net-dev.mm);
  +   mutex_lock(vq-mutex);
  +   for (;;) {
  +   head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in);
  +   if (head == vq-num)
  +   break;
  +   if (out = 1 || in) {
  +  

Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function

2009-08-10 Thread Anthony Liguori
Michael S. Tsirkin wrote:
 Normally, the common features are transport features and the devices  
 should have absolutely no knowledge of transport feature (since they're  
 transport dependent).
 

 Good point. But

 1. note that with my patch they don't. They call
 virtio_get_common_features and that's all.

 2. some features may not make sense for some devices. For example, it is
 quite possible that indirect ring entries feature improves performance
 on block but hurts on net, as net has a similar merged buffers feature.
 Someone should try benchmarking with it disabled, and it becomes
 possible with my patch.
   

I don't necessarily disagree but I think your patch goes about it the 
wrong way.

There ought to be a way to layer qdev properties that achieves this goal 
so that when you create a virtio-pci-block device, you have the ability 
to turn off indirect sg without virtio-block having to know what that is.

For your use-case, I wonder if you're integrating at the wrong level.  
If you implement a ring in-kernel, maybe the thing to do is introduce 
more layering in qemu like we have in the kernel so that you can easily 
add a new ring backend type.  At any rate, see if you can achieve the 
same goal with qdev properties.  If you could, you should be able to 
hack something up easily to disable this for vhost without completely 
overhauling qemu's virtio implementation.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] vhost_net: a kernel-level virtio server

2009-08-10 Thread Arnd Bergmann
On Monday 10 August 2009 20:10:44 Michael S. Tsirkin wrote:
 On Mon, Aug 10, 2009 at 09:51:18PM +0200, Arnd Bergmann wrote:
  what is the difference between vhost_net_reset_owner(n)
  and vhost_net_set_socket(n, -1)?
 
 set socket to -1 will only stop the device.
 
 reset owner will let another process take over the device.
 It also needs to reset all parameters to make it safe for that
 other process, so in particular the device is stopped.

ok
 
 I tried explaining this in the header vhost.h - does the comment
 there help, or do I need to clarify it?

No, I just didn't get there yet.

 I had the impression that if there's no compat_ioctl,
 unlocked_ioctl will get called automatically. No?

It will issue a kernel warning but not call unlocked_ioctl,
so you need either a compat_ioctl method or list the numbers
in fs/compat_ioctl.c, which I try to avoid.

  Why do you need a kernel thread here? If the data transfer functions
  all get called from a guest intercept, shouldn't you already be
  in the right mm?
 
 several reasons :)
 - I get called under lock, so can't block
 - eventfd can be passed to another process, and I won't be in guest context 
 at all
 - this also gets called outside guest context from socket poll
 - vcpu is blocked while it's doing i/o. it is better to free it up
   as all the packet copying might take a while

Ok.

  I guess that this is where one could plug into macvlan directly, using
  sock_alloc_send_skb/memcpy_fromiovec/dev_queue_xmit directly,
  instead of filling a msghdr for each, if we want to combine this
  with the work I did on that.
 
 quite possibly. Or one can just bind a raw socket to macvlan :)

Right, that works as well, but may get more complicated once we
try to add zero-copy or other optimizations.

Arnd 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function

2009-08-10 Thread Michael S. Tsirkin
On Mon, Aug 10, 2009 at 03:33:59PM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin wrote:
 Normally, the common features are transport features and the devices  
 should have absolutely no knowledge of transport feature (since 
 they're  transport dependent).
 

 Good point. But

 1. note that with my patch they don't. They call
 virtio_get_common_features and that's all.

 2. some features may not make sense for some devices. For example, it is
 quite possible that indirect ring entries feature improves performance
 on block but hurts on net, as net has a similar merged buffers feature.
 Someone should try benchmarking with it disabled, and it becomes
 possible with my patch.
   

 I don't necessarily disagree but I think your patch goes about it the  
 wrong way.

 There ought to be a way to layer qdev properties that achieves this goal  
 so that when you create a virtio-pci-block device, you have the ability  
 to turn off indirect sg without virtio-block having to know what that is.

I don't understand, sorry. Why do you insist on involving pci here?
ring layout has nothing to do with pci, does it?  With my patch,
virtio-block does not know what indirect sg is. It just does
enable/disable.

virtio net has device-specific feature that overlaps with
indirect entries. So insisting that devices should just ignore
transport does not make sense, to me.

 For your use-case, I wonder if you're integrating at the wrong level.   

Forget about that for now. Let's solve the generic problem.

 Regards,

 Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] export cpu_tlbstate to modules

2009-08-10 Thread H. Peter Anvin
On 08/10/2009 03:06 PM, Michael S. Tsirkin wrote:
 Wouldn't it be a *lot* better to move use_mm() from fs/aio.c into common
 code, and export that instead?
 
 That's easy too. What would a good place for it be?

Somewhere in mm/, presumably.  When in doubt, make it a new file...

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function

2009-08-10 Thread Anthony Liguori
Michael S. Tsirkin wrote:
 On Mon, Aug 10, 2009 at 03:33:59PM -0500, Anthony Liguori wrote:
   
 There ought to be a way to layer qdev properties that achieves this goal  
 so that when you create a virtio-pci-block device, you have the ability  
 to turn off indirect sg without virtio-block having to know what that is.
 

 I don't understand, sorry. Why do you insist on involving pci here?
 ring layout has nothing to do with pci, does it?

What I'm saying is that virtio-blk-pci, which is the qdev instantiation 
of virtio-pci + virtio-blk, should be able to have a set of qdev 
properties that is composed of a combination of at least two sets of 
properties: virtio-blk's qdev properties and virtio-pci's qdev properties.

Right now, all of the properties are defined in virtio-pci.c, so you 
could add a property that was DEFINE_PROP_BOOL(indirect-sg, ...), that 
you could then use to selectively enable/disable indirect sg on 
virtio-blk-pci devices without ever having to involve virtio-blk.c.

Ideally, we wouldn't have the properties centralized in virtio-pci.c.  
Rather, it would be nice if virtio-blk.c could have a set of properties 
and virtio-pci.c could just add those properties to it's own set of 
properties.

Today, we don't have a concept of a ring abstraction.  If we did, then 
virtio-ring.c could have it's own set of properties.

N.B. I expect that the in-kernel virtio-net device is going to be 
separate qdev device than virtio-net-pci.  It can have an identical 
guest interface but within qemu, it should be a separate device.  This 
is how we handle the in-kernel PIT and it's how we should handle the 
in-kernel APIC.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Rusty Russell
On Mon, 10 Aug 2009 07:17:54 pm Gerd Hoffmann wrote:
 On 08/10/09 08:55, Amit Shah wrote:
  Bad example.  Quite a lot of modern devices drivers are using dynamic
  major/minor numbers because they have proven to be such a pain in the
  butt.  That's why we have more sophisticated mechanisms like udev for
  userspace to make use of.
 
  Let me explain how we came to this numbering: we first had support for
  'naming' ports and the names were obtained by userspace programs by an
  ioctl. Rusty suggested to use some numbering scheme where some ports
  could exist at predefined locations so that we wouldn't need the naming
  and the ioctls around it.
 
 I think the naming is very important.

I disagree.  If you can hand out names, you can hand out numbers.  Whether
the guest chooses to put that number in sysfs or make it a minor, I don't
care.

Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication

2009-08-10 Thread Anthony Liguori
Rusty Russell wrote:
 On Mon, 10 Aug 2009 07:17:54 pm Gerd Hoffmann wrote:
   
 On 08/10/09 08:55, Amit Shah wrote:
 
 Bad example.  Quite a lot of modern devices drivers are using dynamic
 major/minor numbers because they have proven to be such a pain in the
 butt.  That's why we have more sophisticated mechanisms like udev for
 userspace to make use of.
 
 Let me explain how we came to this numbering: we first had support for
 'naming' ports and the names were obtained by userspace programs by an
 ioctl. Rusty suggested to use some numbering scheme where some ports
 could exist at predefined locations so that we wouldn't need the naming
 and the ioctls around it.
   
 I think the naming is very important.
 

 I disagree.  If you can hand out names, you can hand out numbers.

The problem with handing out names is that there has to be someone to 
hand things out.  And even if you have a good hander-outer, 
development is difficult in a distributed environment because you may 
have folks using your code before you've gotten an official hand-out.

A better discovery mechanism is based on something that piggy backs on 
another authority.  For instance, reverse fully qualified domains work 
well.  uuid's tend to work pretty well too although it's not perfect.

In general, even just open strings can work out okay given that people 
are responsible in how they name things.

Regards,

Anthony Liguori
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization