Re: [PATCH] AF_VMCHANNEL address family for guest-host communication.

2008-12-15 Thread Stephen Hemminger
On Mon, 15 Dec 2008 17:01:14 -0600
Anthony Liguori anth...@codemonkey.ws wrote:

 David Miller wrote:
  From: Anthony Liguori anth...@codemonkey.ws
  Date: Mon, 15 Dec 2008 14:44:26 -0600
 

  We want this communication mechanism to be simple and reliable as we
  want to implement the backends drivers in the host userspace with
  minimum mess.
  
 
  One implication of your statement here is that TCP is unreliable.
  That's absolutely not true.

 
 No, TCP falls under the not simple category because it requires the 
 backend to have access to a TCP/IP stack.
 
  Within the guest, we need the interface to be always available and
  we need an addressing scheme that is hypervisor specific.  Yes, we
  can build this all on top of TCP/IP.  We could even build it on top
  of a serial port.  Both have their down-sides wrt reliability and
  complexity.
  
 
  I don't know of any zero-copy through the hypervisor mechanisms for
  serial ports, but I know we do that with the various virtualization
  network devices.

 
 Yes, and I went down the road of using a dedicated network device and 
 using raw ethernet as the protocol.  The thing that killed that was the 
 fact that it's not reliable.  You need something like TCP to add 
 reliability.
 
 But that's a lot of work and a bit backwards.  Use a unreliable 
 transport but use TCP on top of it to get reliability.  Our link 
 (virtio) is inherently reliable so why not just expose a reliable 
 interface to userspace?
 
  Do you have another recommendation?
  
 
  I don't have to make alternative recommendations until you can
  show that what we have can't solve the problem acceptably, and
  TCP emphatically can.

 
 It can solve the problem but I don't think it's the best way to solve 
 the problem mainly because the complexity it demands on the backend.

Those who don't understand TCP are doomed to reimplement it, badly.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 34/54] Staging: hv: remove STRUCT_PACKED and STRUCT_ALIGNED defines

2009-07-24 Thread Stephen Hemminger
On Fri, 24 Jul 2009 23:32:19 +0200
Jörn Engel jo...@logfs.org wrote:

 On Tue, 21 July 2009 01:46:41 +0200, Arnd Bergmann wrote:
  On Friday 17 July 2009, Greg Kroah-Hartman wrote:
   @@ -43,7 +43,7 @@ typedef struct _RING_BUFFER {
   // volatile u32 InterruptMask;
   // Ring data starts here + RingDataStartOffset !!! DO NOT place 
   any fields below this !!!
u8 Buffer[0];
   -} STRUCT_PACKED RING_BUFFER;
   +} __attribute__((packed)) RING_BUFFER;

  
  The data structure is actually packed already, the attribute does not make 
  it better
  and could be removed. We also have __packed as a shortcut for 
  __attribute__((packed)).
 
 Honestly, I don't know how useful __packed really is.  In a shared
 kernel/userspace header, it is only defined for the kernel.
 

As I remember, gcc generates worse code for packed structures on many 
architectures
since it may have to do byte fetchs/recombining to avoid unaligned
accesses.

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-06 Thread Stephen Hemminger
On Mon, 15 Jun 2009 17:33:10 +
Fischer, Anna anna.fisc...@hp.com wrote:

 This patch adds basic Virtual Ethernet Port Aggregator (VEPA)
 capabilities to the Linux kernel Ethernet bridging code.
 
 A Virtual Ethernet Port Aggregator (VEPA) is a capability within
 a physical end station that collaborates with an adjacent, external
 bridge to provide distributed bridging support between multiple
 virtual end stations and external networks. The VEPA collaborates
 by forwarding all station-originated frames to the adjacent bridge
 for frame processing and frame relay (including so-called 'hairpin'
 forwarding) and by steering and replicating frames received from
 the VEPA uplink to the appropriate destinations. A VEPA may be
 implemented in software or in conjunction with embedded hardware.
 
 In particular, the patch extends the Linux Ethernet bridge to act as
 (1) a VEPA - for this we have added VEPA forwarding functionality and
 added a configuration option for a VEPA uplink port, or as
 (2) a bridge supporting 'hairpin' forwarding - for this we have added a
 bridge port 'hairpin' mode which allows sending frames back out
 through the port the frame was received on.
 
 Configuration of VEPA capabilities through Linux userspace bridge
 utilities is provided by an additional patch 'bridge-utils: add
 basic VEPA support'.

After reading more about this, I am not convinced this should be part
of the bridge code. The bridge code really consists of two parts:
forwarding table and optional spanning tree. Well the VEPA code
short circuits both of these; it can't imagine it working
with STP turned on. The only part of bridge code that really gets
used by this are the receive packet hooks and the crufty old
API.

So instead of adding more stuff to existing bridge code, why not
have a new driver for just VEPA. You could
do it with a simple version of macvlan type driver.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-07 Thread Stephen Hemminger
On Fri, 7 Aug 2009 14:06:58 -0700
Paul Congdon \(UC Davis\) ptcong...@ucdavis.edu wrote:

 Yaron,
 
 
 The interface multiplexing can be achieved using macvlan driver or using an 
 SR-IOV capable NIC (the preferred option), macvlan may need to be extended to 
 support VEPA multicast handling, this looks like a rather simple task 
 
 Agreed that the hardware solution is preferred so the macvlan implementation 
 doesn’t really matter.  If we are talking SR-IOV, then it is direct mapped, 
 regardless of whether there is a VEB or VEPA in the hardware below, so you 
 are bypassing the bridge software code also.  
 
 I disagree that adding the multicast handling is simple – while not 
 conceptually hard, it will basically require you to put an address table into 
 the macvlan implementation – if you have that, then why not have just used 
 the one already in the bridge code.  If you hook a VEPA up to a non-hairpin 
 mode external bridge, you get the macvlan capability as well.

I have a patch that forwards all multicast packets, and another that does
proper forwarding. It should have worked that way in original macvlan, the
current behavior is really a bug.


-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Stephen Hemminger
On Sun, 09 Aug 2009 14:19:08 +0300
Or Gerlitz ogerl...@voltaire.com wrote:

 Stephen Hemminger wrote:
  I have a patch that forwards all multicast packets, and another that does 
  proper forwarding. It should have worked that way in original macvlan, the 
  current behavior is really a bug.

 Looking in macvlan_set_multicast_list() it acts in a similar manner to
 macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I
 assume what's left is to add macvlan_hash_xxx multicast logic to
 map/unmap multicast groups to what macvlan devices want to receive them
 and this way the flooding can be removed, correct?

The device can just flood all multicast packets, since the filtering
is done on the receive path anyway.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support

2009-08-10 Thread Stephen Hemminger
On Mon, 10 Aug 2009 16:32:01 +
Fischer, Anna anna.fisc...@hp.com wrote:

  Subject: Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
  
  On Monday 10 August 2009, Stephen Hemminger wrote:
   On Sun, 09 Aug 2009 14:19:08 +0300, Or Gerlitz
  ogerl...@voltaire.com wrote:
Looking in macvlan_set_multicast_list() it acts in a similar manner
  to
macvlan_set_mac_address() in the sense that it calls dev_mc_sync().
  I
assume what's left is to add macvlan_hash_xxx multicast logic to
map/unmap multicast groups to what macvlan devices want to receive
  them
and this way the flooding can be removed, correct?
  
   The device can just flood all multicast packets, since the filtering
   is done on the receive path anyway.
 
 Is this handled by one of the additional patches? In the current kernel tree
 macvlan code it looks as if multicast filtering is only handled by the
 physical device driver, but not on particular macvlan devices.
  
 
  But we'd still have to copy the frames to user space (for both
  macvtap and raw packet sockets) and exit from the guest to inject
  it into its stack, right?
 
 I think it would be nice if you can implement what Or describes for 
 macvlan and avoid flooding, and it doesn't sound too hard to do. 
 
 I guess one advantage for macvlan (over the bridge) is that you can 
 program in all information you have for the ports attached to it, e.g. 
 MAC addresses and multicast addresses. So you could take advantage of
 that whereas the bridge always floods multicast frames to all ports.
  
 How would this work though, if the OS inside the guest wants to register
 to a particular multicast address? Is this propagated through the backend
 drivers to the macvlan/macvtap interface?

Sure filtering is better, but multicast performance with large number
of guests is really a corner case, not the real performance issue.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] net/bridge: Add 'hairpin' port forwarding mode

2009-08-13 Thread Stephen Hemminger

 diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
 index eb404dc..e486f1f 100644
 --- a/net/bridge/br_if.c
 +++ b/net/bridge/br_if.c
 @@ -256,6 +256,7 @@ static struct net_bridge_port *new_nbp(struct net_bridge 
 *br,
   p-path_cost = port_cost(dev);
   p-priority = 0x8000  BR_PORT_BITS;
   p-port_no = index;
 + p-flags = 0;
   br_init_port(p);
   p-state = BR_STATE_DISABLED;
   br_stp_port_timer_init(p);

Minor nit. this is unnecessary since the structure is allocated with kzalloc.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

2009-09-01 Thread Stephen Hemminger
On Tue, 1 Sep 2009 14:58:19 +0800
Xin, Xiaohui xiaohui@intel.com wrote:

   [RFC] Virtual Machine Device Queues (VMDq) support on KVM
 
 Network adapter with VMDq technology presents multiple pairs of tx/rx queues,
 and renders network L2 sorting mechanism based on MAC addresses and VLAN tags
 for each tx/rx queue pair. Here we present a generic framework, in which 
 network
 traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without
 any software copy.
 
 Actually this framework can apply to traditional network adapters which have
 just one tx/rx queue pair. And applications using the same user/kernel 
 interface
 can utilize this framework to send/receive network traffic directly thru a 
 tx/rx
 queue pair in a network adapter.
 
 We use virtio-net architecture to illustrate the framework.
 
 
 || pop   add_buf||
 |Qemu process|  -TX   --  | Guest Kernel   |
 ||  - --  ||
 |Virtio-net  | push  get_buf||
 |  (Backend service) |  -RX   --  |  Virtio-net|
 ||  - --  |driver  |
 || push  get_buf||
 ||  ||
|
|
| AIO (read  write) combined with Direct I/O
|   (which substitute synced file operations)
 |---|
 | Host kernel  | read: copy-less with directly mapped user  |
 |  |   space to kernel, payload directly DMAed  |
 |  |   into user space  |
 |  | write: copy-less with directly mapped user |
 |  |   space to kernel, payload directly hooked |
 |  |   to a skb |
 |  ||
 |  (a likely   ||
 |   queue pair ||
 |   instance)  ||
 |  |   ||
 | NIC driver --  TUN/TAP driver   |
 |---|
|
|
traditional adapter or a tx/rx queue pair
 
 The basic idea is to utilize the kernel Asynchronous I/O combined with Direct
 I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to
 kernel, we still can see it in SCSI tape driver.
 
 With traditional file operations, a copying of payload contents from/to the
 kernel DMA address to/from a user buffer is needed. That's what the copying we
 want to save.
 
 The proposed framework is like this:
 A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair 
 in
 host side. KVM virto-net Backend service, the user space program submits
 asynchronous read/write I/O requests to the host kernel through TUN/TAP 
 device.
 The requests are corresponding to the vqueue elements include both 
 transmission
  receive. They can be queued in one AIO request and later, the completion 
 will
 be notified through the underlying packets tx/rx processing of the rx/tx queue
 pair.
 
 Detailed path:
 
 To guest Virtio-net driver, packets receive corresponding to asynchronous read
 I/O requests of Backend service.
 
 1) Guest Virtio-net driver provides header and payload address through the
 receive vqueue to Virtio-net backend service.
 
 2) Virtio-net backend service encapsulates multiple vqueue elements into
 multiple AIO control blocks and composes them into one AIO read request.
 
 3) Virtio-net backend service uses io_submit() syscall to pass the request to
 the TUN/TAP device.
 
 4) Virtio-net backend service uses io_getevents() syscall to check the
 completion of the request.
 
 5) The TUN/TAP driver receives packets from the queue pair of NIC, and 
 prepares
 for Direct I/O.
A modified NIC driver may render a skb which header is allocated in host
 kernel, but the payload buffer is directly mapped from user space buffer which
 are rendered through the AIO request by the Backend service. get_user_pages()
 may do this. For one AIO read request, the TUN/TAP driver maintains a list for
 the directly mapped buffers, and a NIC driver tries to get the buffers as
 payload buffer to compose the new skbs. Of course, if getting the buffers
 fails, then kernel allocated buffers are used.
 
 6) Modern NIC cards now mostly have the header split feature. The NIC queue
 pair then may directly DMA the payload 

Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

2009-09-21 Thread Stephen Hemminger
On Mon, 21 Sep 2009 16:37:22 +0930
Rusty Russell ru...@rustcorp.com.au wrote:

   Actually this framework can apply to traditional network adapters which 
   have
   just one tx/rx queue pair. And applications using the same user/kernel 
   interface
   can utilize this framework to send/receive network traffic directly thru 
   a tx/rx
   queue pair in a network adapter.
   

More importantly, when virtualizations is used with multi-queue NIC's the 
virtio-net
NIC is a single CPU bottleneck. The virtio-net NIC should preserve the 
parallelism (lock
free) using multiple receive/transmit queues. The number of queues should equal 
the
number of CPUs.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM

2009-09-22 Thread Stephen Hemminger
On Tue, 22 Sep 2009 13:50:54 +0200
Arnd Bergmann a...@arndb.de wrote:

 On Tuesday 22 September 2009, Michael S. Tsirkin wrote:
More importantly, when virtualizations is used with multi-queue
NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net
NIC should preserve the parallelism (lock free) using multiple
receive/transmit queues. The number of queues should equal the
number of CPUs.
   
   Yup, multiqueue virtio is on todo list ;-)
   
  
  Note we'll need multiqueue tap for that to help.
 
 My idea for that was to open multiple file descriptors to the same
 macvtap device and let the kernel figure out the  right thing to
 do with that. You can do the same with raw packed sockets in case
 of vhost_net, but I wouldn't want to add more complexity to the
 tun/tap driver for this.
 
   Arnd 


Or get tap out of the way entirely. The packets should not have
to go out to user space at all (see veth)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2.6.32-rc1] net: VMware virtual Ethernet NIC driver: vmxnet3

2009-09-30 Thread Stephen Hemminger
On Wed, 30 Sep 2009 14:34:57 -0700 (PDT)
Shreyas Bhatewara sbhatew...@vmware.com wrote:

Note: your patch was linewrapped again

 +
 +
 +static void
 +vmxnet3_declare_features(struct vmxnet3_adapter *adapter, bool dma64)
 +{
 + struct net_device *netdev = adapter-netdev;
 +
 + netdev-features = NETIF_F_SG |
 + NETIF_F_HW_CSUM |
 + NETIF_F_HW_VLAN_TX |
 + NETIF_F_HW_VLAN_RX |
 + NETIF_F_HW_VLAN_FILTER |
 + NETIF_F_TSO |
 + NETIF_F_TSO6;
 +
 + printk(KERN_INFO features: sg csum vlan jf tso tsoIPv6);
 +
 + adapter-rxcsum = true;
 + adapter-jumbo_frame = true;
 +
 + if (!disable_lro) {
 + adapter-lro = true;
 + printk( lro);
 + }

Why not use NETIF_F_LRO and ethtool to control LRO support?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2.6.32-rc3] net: VMware virtual Ethernet NIC driver: vmxnet3

2009-10-09 Thread Stephen Hemminger
On Thu, 8 Oct 2009 10:59:26 -0700 (PDT)
Shreyas Bhatewara sbhatew...@vmware.com wrote:

 Hello all,
 
 I do not mean to be bothersome but this thread has been unusually silent.
 Could you please review the patch for me and reply with your comments / 
 acks ?
 
 Thanks.
 -Shreyas  


Looks fine, but just a minor style nit (can be changed after insertion in 
mainline).

The code:

static void
vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget, int *txd_done,
int *rxd_done)
{
if (unlikely(adapter-shared-ecr))
vmxnet3_process_events(adapter);

*txd_done = vmxnet3_tq_tx_complete(adapter-tx_queue, adapter);
*rxd_done = vmxnet3_rq_rx_complete(adapter-rx_queue, adapter, budget);
}


static int
vmxnet3_poll(struct napi_struct *napi, int budget)
{
struct vmxnet3_adapter *adapter = container_of(napi,
  struct vmxnet3_adapter, napi);
int rxd_done, txd_done;

vmxnet3_do_poll(adapter, budget, txd_done, rxd_done);

if (rxd_done  budget) {
napi_complete(napi);
vmxnet3_enable_intr(adapter, 0);
}
return rxd_done;
}


Is simpler if you just have do_poll return rx done value. Probably Gcc
inline's it all anyway.

static int
vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget)
{
if (unlikely(adapter-shared-ecr))
vmxnet3_process_events(adapter);

vmxnet3_tq_tx_complete(adapter-tx_queue, adapter);
return vmxnet3_rq_rx_complete(adapter-rx_queue, adapter, budget);
}


static int
vmxnet3_poll(struct napi_struct *napi, int budget)
{
struct vmxnet3_adapter *adapter = container_of(napi,
  struct vmxnet3_adapter, napi);
int rxd_done;

rxd_done = vmxnet3_do_poll(adapter, budget);
if (rxd_done  budget) {
napi_complete(napi);
vmxnet3_enable_intr(adapter, 0);
}
return rxd_done;
}
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH for-2.6.35] virtio_net: fix oom handling on tx

2010-06-10 Thread Stephen Hemminger
On Thu, 10 Jun 2010 10:17:07 -0700
Sridhar Samudrala s...@us.ibm.com wrote:

 On Thu, 2010-06-10 at 18:20 +0300, Michael S. Tsirkin wrote:
  virtio net will never try to overflow the TX ring, so the only reason
  add_buf may fail is out of memory. Thus, we can not stop the
  device until some request completes - there's no guarantee anything
  at all is outstanding.
  
  Make the error message clearer as well: error here does not
  indicate queue full.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
   drivers/net/virtio_net.c |   15 ---
   1 files changed, 8 insertions(+), 7 deletions(-)
  
  diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
  index 85615a3..e48a06f 100644
  --- a/drivers/net/virtio_net.c
  +++ b/drivers/net/virtio_net.c
  @@ -563,7 +563,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
  struct net_device *dev)
  struct virtnet_info *vi = netdev_priv(dev);
  int capacity;
  
  -again:
  /* Free up any pending old buffers before queueing new ones. */
  free_old_xmit_skbs(vi);
  
  @@ -572,12 +571,14 @@ again:
  
  /* This can happen with OOM and indirect buffers. */
  if (unlikely(capacity  0)) {
  -   netif_stop_queue(dev);
  -   dev_warn(dev-dev, Unexpected full queue\n);
  -   if (unlikely(!virtqueue_enable_cb(vi-svq))) {
  -   virtqueue_disable_cb(vi-svq);
  -   netif_start_queue(dev);
  -   goto again;
  +   if (net_ratelimit()) {
  +   if (likely(capacity == -ENOMEM))
  +   dev_warn(dev-dev,
  +TX queue failure: out of memory\n);
  +   else
  +   dev_warn(dev-dev,
  +Unexpected TX queue failure: %d\n,
  +capacity);
  }
  return NETDEV_TX_BUSY;
  }
 
 It is not clear to me how xmit_skb() can return -ENOMEM.
 xmit_skb() calls virtqueue_add_buf_gfp() which can return -ENOSPC.
 Even vring_add_indirect() doesn't return -ENOMEM on kmalloc failure.

It makes more sense to have the device increment tx_droppped, 
and return NETDEV_TX_OK. Skip the message (or make it a pr_debug()). 
Network devices do not guarantee packet delivery, and if out of
resources then holding more data in the
queue is going to hurt not help the situation.

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-07-12 Thread Stephen Hemminger
On Mon, 12 Jul 2010 20:06:28 -0700
Shreyas Bhatewara sbhatew...@vmware.com wrote:

 
 On Thu, 2010-05-06 at 13:21 -0700, Christoph Hellwig wrote:
  On Wed, May 05, 2010 at 10:52:53AM -0700, Stephen Hemminger wrote:
   Let me put it bluntly. Any design that allows external code to run
   in the kernel is not going to be accepted.  Out of tree kernel modules 
   are enough
   of a pain already, why do you expect the developers to add another
   interface.
  
  Exactly.  Until our friends at VMware get this basic fact it's useless
  to continue arguing.
  
  Pankaj and Dmitry: you're fine to waste your time on this, but it's not
  going to go anywhere until you address that fundamental problem.  The
  first thing you need to fix in your archicture is to integrate the VF
  function code into the kernel tree, and we can work from there.
  
  Please post patches doing this if you want to resume the discussion.
  
  ___
  Pv-drivers mailing list
  pv-driv...@vmware.com
  http://mailman2.vmware.com/mailman/listinfo/pv-drivers
 
 
 As discussed, following is the patch to give you an idea
 about implementation of NPA for vmxnet3 driver. Although the
 patch is big, I have verified it with checkpatch.pl. It gave
 0 errors / warnings.
 
 Signed-off-by: Matthieu Bucchaineri matth...@vmware.com
 Signed-off-by: Shreyas Bhatewara sbhatew...@vmware.com

I think the concept won't fly.

But you should really at least try running checkpatch to make sure
the style conforms.


-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3

2010-07-13 Thread Stephen Hemminger
On Mon, 12 Jul 2010 20:06:28 -0700
Shreyas Bhatewara sbhatew...@vmware.com wrote:

 
 On Thu, 2010-05-06 at 13:21 -0700, Christoph Hellwig wrote:
  On Wed, May 05, 2010 at 10:52:53AM -0700, Stephen Hemminger wrote:
   Let me put it bluntly. Any design that allows external code to run
   in the kernel is not going to be accepted.  Out of tree kernel modules 
   are enough
   of a pain already, why do you expect the developers to add another
   interface.
  
  Exactly.  Until our friends at VMware get this basic fact it's useless
  to continue arguing.
  
  Pankaj and Dmitry: you're fine to waste your time on this, but it's not
  going to go anywhere until you address that fundamental problem.  The
  first thing you need to fix in your archicture is to integrate the VF
  function code into the kernel tree, and we can work from there.
  
  Please post patches doing this if you want to resume the discussion.
  
  ___
  Pv-drivers mailing list
  pv-driv...@vmware.com
  http://mailman2.vmware.com/mailman/listinfo/pv-drivers
 
 
 As discussed, following is the patch to give you an idea
 about implementation of NPA for vmxnet3 driver. Although the
 patch is big, I have verified it with checkpatch.pl. It gave
 0 errors / warnings.
 
 Signed-off-by: Matthieu Bucchaineri matth...@vmware.com
 Signed-off-by: Shreyas Bhatewara sbhatew...@vmware.com
 ---

I am surprised, the code seems to use lots of mixed case in places
that don't really follow current kernel practice.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] virtio: fix format of sysfs driver/vendor files

2010-11-09 Thread Stephen Hemminger
The sysfs files for virtio produce the wrong format and are missing
the required newline. The output for virtio bus vendor/device should
have the same format as the corresponding entries for PCI devices.

Although this technically changes the ABI for sysfs, these files were
broken to start with!

Signed-off-by: Stephen Hemminger shemmin...@vyatta.com


--- a/drivers/virtio/virtio.c   2010-11-09 21:39:31.713916249 -0800
+++ b/drivers/virtio/virtio.c   2010-11-09 21:40:16.072089461 -0800
@@ -9,19 +9,19 @@ static ssize_t device_show(struct device
   struct device_attribute *attr, char *buf)
 {
struct virtio_device *dev = container_of(_d,struct virtio_device,dev);
-   return sprintf(buf, %hu, dev-id.device);
+   return sprintf(buf, 0x%04x\n, dev-id.device);
 }
 static ssize_t vendor_show(struct device *_d,
   struct device_attribute *attr, char *buf)
 {
struct virtio_device *dev = container_of(_d,struct virtio_device,dev);
-   return sprintf(buf, %hu, dev-id.vendor);
+   return sprintf(buf, 0x%04x\n, dev-id.vendor);
 }
 static ssize_t status_show(struct device *_d,
   struct device_attribute *attr, char *buf)
 {
struct virtio_device *dev = container_of(_d,struct virtio_device,dev);
-   return sprintf(buf, 0x%08x, dev-config-get_status(dev));
+   return sprintf(buf, 0x%08x\n, dev-config-get_status(dev));
 }
 static ssize_t modalias_show(struct device *_d,
 struct device_attribute *attr, char *buf)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH]: An implementation of HyperV KVP functionality

2010-11-11 Thread Stephen Hemminger
On Thu, 11 Nov 2010 13:03:10 -0700
Ky Srinivasan ksriniva...@novell.com wrote:

 +static char *kvp_keys[KVP_MAX_KEY] = {FullyQualifiedDomainName,
 + IntegrationServicesVersion,
 + NetworkAddressIPv4,
 + NetworkAddressIPv6,
 + OSBuildNumber,
 + OSName,
 + OSMajorVersion,
 + OSMinorVersion,
 + OSVersion,
 + ProcessorArchitecture,
 + };

Minor nit:
static const char *kvp_keys[KVP_MAX_KEY] = {
FullQualifiedDomainName,
...

+/*
+ * Global state maintained for transaction that is being processed.
+ * Note that only one transaction can be active at any point in time.
+ *
+ * This state is set when we receive a request from the host; we
+ * cleanup this state when the transaction is completed - when we respond
+ * to the host with the key value.
+ */
+
+static u8 *recv_buffer; /* the receive buffer that we allocated */
+static int recv_len; /* number of bytes received. */
+static struct vmbus_channel *recv_channel; /*chn on which we got the request*/
+static u64 recv_req_id; /* request ID. */
+static int kvp_current_index;
+

I would put all the state variables for the transaction in one
structure, 

+static void kvp_timer_func(unsigned long __data)
+{
+   u32 key = *((u32 *)__data);
+   /*
+* If the timer fires, the user-mode component has not responded;
+* process the pending transaction.
+*/
+   kvp_respond_to_host(key, Guest timed out);
+}

delayed_work is sometimes better for things like this, since it
runs in user context and can sleep.

+   case (KVP_MAX_KEY):
+   /*
+* We don't support this key
+* and any key beyond this.
+*/
+   icmsghdrp-status = HV_E_FAIL;
+   goto callback_done;
+

case labels do not need parens

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] xen: fix p2m section mismatches

2011-03-24 Thread Stephen Hemminger
On Thu, 24 Mar 2011 13:34:32 -0700
Randy Dunlap randy.dun...@oracle.com wrote:

 From: Randy Dunlap randy.dun...@oracle.com
 
 Fix section mismatch warnings:
 set_phys_range_identity() is called by __init xen_set_identity(),
 so also mark set_phys_range_identity() as __init.
 then:
 __early_alloc_p2m() is called set_phys_range_identity(), so also mark
 __early_alloc_p2m() as __init.
 
 WARNING: arch/x86/built-in.o(.text+0x7856): Section mismatch in reference 
 from the function __early_alloc_p2m() to the function .init.text:extend_brk()
 The function __early_alloc_p2m() references
 the function __init extend_brk().
 This is often because __early_alloc_p2m lacks a __init 
 annotation or the annotation of extend_brk is wrong.
 
 WARNING: arch/x86/built-in.o(.text+0x7967): Section mismatch in reference 
 from the function set_phys_range_identity() to the function 
 .init.text:extend_brk()
 The function set_phys_range_identity() references
 the function __init extend_brk().
 This is often because set_phys_range_identity lacks a __init 
 annotation or the annotation of extend_brk is wrong.
 
 Signed-off-by: Randy Dunlap randy.dun...@oracle.com
 ---
  arch/x86/xen/p2m.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 --- linux-2.6.38-git13.orig/arch/x86/xen/p2m.c
 +++ linux-2.6.38-git13/arch/x86/xen/p2m.c
 @@ -497,7 +497,7 @@ static bool alloc_p2m(unsigned long pfn)
   return true;
  }
  
 -bool __early_alloc_p2m(unsigned long pfn)
 +bool __init __early_alloc_p2m(unsigned long pfn)

__early_alloc_p2m could be static as well.
-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] virtio-net: per cpu 64 bit stats

2011-06-15 Thread Stephen Hemminger
Use per-cpu variables to maintain 64 bit statistics.
Compile tested only.

Signed-off-by: Stephen Hemminger shemmin...@vyatta.com

--- a/drivers/net/virtio_net.c  2011-06-14 15:18:46.448596355 -0400
+++ b/drivers/net/virtio_net.c  2011-06-15 09:54:22.914426067 -0400
@@ -40,6 +40,15 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX2
 
+struct virtnet_stats {
+   struct u64_stats_sync syncp;
+   u64 tx_bytes;
+   u64 tx_packets;
+
+   u64 rx_bytes;
+   u64 rx_packets;
+};
+
 struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *rvq, *svq, *cvq;
@@ -56,6 +65,9 @@ struct virtnet_info {
/* Host will merge rx buffers for big packets (shake it! shake it!) */
bool mergeable_rx_bufs;
 
+   /* Active statistics */
+   struct virtnet_stats __percpu *stats;
+
/* Work struct for refilling if we run low on memory. */
struct delayed_work refill;
 
@@ -209,7 +221,6 @@ static int receive_mergeable(struct virt
skb-dev-stats.rx_length_errors++;
return -EINVAL;
}
-
page = virtqueue_get_buf(vi-rvq, len);
if (!page) {
pr_debug(%s: rx error: %d buffers missing\n,
@@ -217,6 +228,7 @@ static int receive_mergeable(struct virt
skb-dev-stats.rx_length_errors++;
return -EINVAL;
}
+
if (len  PAGE_SIZE)
len = PAGE_SIZE;
 
@@ -230,6 +242,7 @@ static int receive_mergeable(struct virt
 static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 {
struct virtnet_info *vi = netdev_priv(dev);
+   struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats);
struct sk_buff *skb;
struct page *page;
struct skb_vnet_hdr *hdr;
@@ -265,8 +278,11 @@ static void receive_buf(struct net_devic
 
hdr = skb_vnet_hdr(skb);
skb-truesize += skb-data_len;
-   dev-stats.rx_bytes += skb-len;
-   dev-stats.rx_packets++;
+
+   u64_stats_update_begin(stats-syncp);
+   stats-rx_bytes += skb-len;
+   stats-rx_packets++;
+   u64_stats_update_begin(stats-syncp);
 
if (hdr-hdr.flags  VIRTIO_NET_HDR_F_NEEDS_CSUM) {
pr_debug(Needs csum!\n);
@@ -515,11 +531,16 @@ static unsigned int free_old_xmit_skbs(s
 {
struct sk_buff *skb;
unsigned int len, tot_sgs = 0;
+   struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats);
 
while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
pr_debug(Sent skb %p\n, skb);
-   vi-dev-stats.tx_bytes += skb-len;
-   vi-dev-stats.tx_packets++;
+
+   u64_stats_update_begin(stats-syncp);
+   stats-tx_bytes += skb-len;
+   stats-tx_packets++;
+   u64_stats_update_begin(stats-syncp);
+
tot_sgs += skb_vnet_hdr(skb)-num_sg;
dev_kfree_skb_any(skb);
}
@@ -641,6 +662,40 @@ static int virtnet_set_mac_address(struc
return 0;
 }
 
+static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
+  struct rtnl_link_stats64 *tot)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int cpu;
+   unsigned int start;
+
+   for_each_possible_cpu(cpu) {
+   struct virtnet_stats __percpu *stats
+   = per_cpu_ptr(vi-stats, cpu);
+   u64 tpackets, tbytes, rpackets, rbytes;
+
+   do {
+   start = u64_stats_fetch_begin(stats-syncp);
+   tpackets = stats-tx_packets;
+   tbytes   = stats-tx_bytes;
+   rpackets = stats-rx_packets;
+   rbytes   = stats-rx_bytes;
+   } while (u64_stats_fetch_retry(stats-syncp, start));
+
+   tot-rx_packets += rpackets;
+   tot-tx_packets += tpackets;
+   tot-rx_bytes   += rbytes;
+   tot-tx_bytes   += tbytes;
+   }
+
+   tot-tx_dropped = dev-stats.tx_dropped;
+   tot-rx_dropped = dev-stats.rx_dropped;
+   tot-rx_length_errors = dev-stats.rx_length_errors;
+   tot-rx_frame_errors = dev-stats.rx_frame_errors;
+
+   return tot;
+}
+
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void virtnet_netpoll(struct net_device *dev)
 {
@@ -650,6 +705,14 @@ static void virtnet_netpoll(struct net_d
 }
 #endif
 
+static void virtnet_free(struct net_device *dev)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+
+   free_percpu(vi-stats);
+   free_netdev(dev);
+}
+
 static int virtnet_open(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
@@ -835,6 +898,7 @@ static const struct net_device_ops virtn
.ndo_set_mac_address = virtnet_set_mac_address,
.ndo_set_rx_mode

Re: [PATCH] virtio-net: per cpu 64 bit stats (v2)

2011-06-15 Thread Stephen Hemminger
Use per-cpu variables to maintain 64 bit statistics.

Signed-off-by: Stephen Hemminger shemmin...@vyatta.com

--- a/drivers/net/virtio_net.c  2011-06-14 15:18:46.448596355 -0400
+++ b/drivers/net/virtio_net.c  2011-06-15 12:02:54.860667443 -0400
@@ -40,6 +40,15 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX2
 
+struct virtnet_stats {
+   struct u64_stats_sync syncp;
+   u64 tx_bytes;
+   u64 tx_packets;
+
+   u64 rx_bytes;
+   u64 rx_packets;
+};
+
 struct virtnet_info {
struct virtio_device *vdev;
struct virtqueue *rvq, *svq, *cvq;
@@ -56,6 +65,9 @@ struct virtnet_info {
/* Host will merge rx buffers for big packets (shake it! shake it!) */
bool mergeable_rx_bufs;
 
+   /* Active statistics */
+   struct virtnet_stats __percpu *stats;
+
/* Work struct for refilling if we run low on memory. */
struct delayed_work refill;
 
@@ -209,7 +221,6 @@ static int receive_mergeable(struct virt
skb-dev-stats.rx_length_errors++;
return -EINVAL;
}
-
page = virtqueue_get_buf(vi-rvq, len);
if (!page) {
pr_debug(%s: rx error: %d buffers missing\n,
@@ -217,6 +228,7 @@ static int receive_mergeable(struct virt
skb-dev-stats.rx_length_errors++;
return -EINVAL;
}
+
if (len  PAGE_SIZE)
len = PAGE_SIZE;
 
@@ -230,6 +242,7 @@ static int receive_mergeable(struct virt
 static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
 {
struct virtnet_info *vi = netdev_priv(dev);
+   struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats);
struct sk_buff *skb;
struct page *page;
struct skb_vnet_hdr *hdr;
@@ -265,8 +278,11 @@ static void receive_buf(struct net_devic
 
hdr = skb_vnet_hdr(skb);
skb-truesize += skb-data_len;
-   dev-stats.rx_bytes += skb-len;
-   dev-stats.rx_packets++;
+
+   u64_stats_update_begin(stats-syncp);
+   stats-rx_bytes += skb-len;
+   stats-rx_packets++;
+   u64_stats_update_end(stats-syncp);
 
if (hdr-hdr.flags  VIRTIO_NET_HDR_F_NEEDS_CSUM) {
pr_debug(Needs csum!\n);
@@ -515,11 +531,16 @@ static unsigned int free_old_xmit_skbs(s
 {
struct sk_buff *skb;
unsigned int len, tot_sgs = 0;
+   struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats);
 
while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
pr_debug(Sent skb %p\n, skb);
-   vi-dev-stats.tx_bytes += skb-len;
-   vi-dev-stats.tx_packets++;
+
+   u64_stats_update_begin(stats-syncp);
+   stats-tx_bytes += skb-len;
+   stats-tx_packets++;
+   u64_stats_update_end(stats-syncp);
+
tot_sgs += skb_vnet_hdr(skb)-num_sg;
dev_kfree_skb_any(skb);
}
@@ -641,6 +662,40 @@ static int virtnet_set_mac_address(struc
return 0;
 }
 
+static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
+  struct rtnl_link_stats64 *tot)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int cpu;
+   unsigned int start;
+
+   for_each_possible_cpu(cpu) {
+   struct virtnet_stats __percpu *stats
+   = per_cpu_ptr(vi-stats, cpu);
+   u64 tpackets, tbytes, rpackets, rbytes;
+
+   do {
+   start = u64_stats_fetch_begin(stats-syncp);
+   tpackets = stats-tx_packets;
+   tbytes   = stats-tx_bytes;
+   rpackets = stats-rx_packets;
+   rbytes   = stats-rx_bytes;
+   } while (u64_stats_fetch_retry(stats-syncp, start));
+
+   tot-rx_packets += rpackets;
+   tot-tx_packets += tpackets;
+   tot-rx_bytes   += rbytes;
+   tot-tx_bytes   += tbytes;
+   }
+
+   tot-tx_dropped = dev-stats.tx_dropped;
+   tot-rx_dropped = dev-stats.rx_dropped;
+   tot-rx_length_errors = dev-stats.rx_length_errors;
+   tot-rx_frame_errors = dev-stats.rx_frame_errors;
+
+   return tot;
+}
+
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void virtnet_netpoll(struct net_device *dev)
 {
@@ -650,6 +705,14 @@ static void virtnet_netpoll(struct net_d
 }
 #endif
 
+static void virtnet_free(struct net_device *dev)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+
+   free_percpu(vi-stats);
+   free_netdev(dev);
+}
+
 static int virtnet_open(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
@@ -835,6 +898,7 @@ static const struct net_device_ops virtn
.ndo_set_mac_address = virtnet_set_mac_address,
.ndo_set_rx_mode = virtnet_set_rx_mode

Re: [PATCH] virtio-net: per cpu 64 bit stats

2011-06-17 Thread Stephen Hemminger
On Wed, 15 Jun 2011 21:33:16 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Jun 15, 2011 at 11:43:37AM -0400, Stephen Hemminger wrote:
  Use per-cpu variables to maintain 64 bit statistics.
  Compile tested only.
  
  Signed-off-by: Stephen Hemminger shemmin...@vyatta.com
 
 Interesting. Does this help speed at all?
 
  --- a/drivers/net/virtio_net.c  2011-06-14 15:18:46.448596355 -0400
  +++ b/drivers/net/virtio_net.c  2011-06-15 09:54:22.914426067 -0400
  @@ -40,6 +40,15 @@ module_param(gso, bool, 0444);
   
   #define VIRTNET_SEND_COMMAND_SG_MAX2
   
  +struct virtnet_stats {
  +   struct u64_stats_sync syncp;
  +   u64 tx_bytes;
  +   u64 tx_packets;
  +
  +   u64 rx_bytes;
  +   u64 rx_packets;
  +};
  +
   struct virtnet_info {
  struct virtio_device *vdev;
  struct virtqueue *rvq, *svq, *cvq;
  @@ -56,6 +65,9 @@ struct virtnet_info {
  /* Host will merge rx buffers for big packets (shake it! shake it!) */
  bool mergeable_rx_bufs;
   
  +   /* Active statistics */
  +   struct virtnet_stats __percpu *stats;
  +
  /* Work struct for refilling if we run low on memory. */
  struct delayed_work refill;
   
  @@ -209,7 +221,6 @@ static int receive_mergeable(struct virt
  skb-dev-stats.rx_length_errors++;
  return -EINVAL;
  }
  -
  page = virtqueue_get_buf(vi-rvq, len);
  if (!page) {
  pr_debug(%s: rx error: %d buffers missing\n,
  @@ -217,6 +228,7 @@ static int receive_mergeable(struct virt
  skb-dev-stats.rx_length_errors++;
  return -EINVAL;
  }
  +
  if (len  PAGE_SIZE)
  len = PAGE_SIZE;
 
 
 Let's not tweak whitespace unnecessarily.
   
  @@ -230,6 +242,7 @@ static int receive_mergeable(struct virt
   static void receive_buf(struct net_device *dev, void *buf, unsigned int 
  len)
   {
  struct virtnet_info *vi = netdev_priv(dev);
  +   struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats);
  struct sk_buff *skb;
  struct page *page;
  struct skb_vnet_hdr *hdr;
  @@ -265,8 +278,11 @@ static void receive_buf(struct net_devic
   
  hdr = skb_vnet_hdr(skb);
  skb-truesize += skb-data_len;
  -   dev-stats.rx_bytes += skb-len;
  -   dev-stats.rx_packets++;
  +
  +   u64_stats_update_begin(stats-syncp);
  +   stats-rx_bytes += skb-len;
  +   stats-rx_packets++;
  +   u64_stats_update_begin(stats-syncp);
   
  if (hdr-hdr.flags  VIRTIO_NET_HDR_F_NEEDS_CSUM) {
  pr_debug(Needs csum!\n);
  @@ -515,11 +531,16 @@ static unsigned int free_old_xmit_skbs(s
   {
  struct sk_buff *skb;
  unsigned int len, tot_sgs = 0;
  +   struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats);
   
  while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
  pr_debug(Sent skb %p\n, skb);
  -   vi-dev-stats.tx_bytes += skb-len;
  -   vi-dev-stats.tx_packets++;
  +
  +   u64_stats_update_begin(stats-syncp);
  +   stats-tx_bytes += skb-len;
  +   stats-tx_packets++;
  +   u64_stats_update_begin(stats-syncp);
  +
  tot_sgs += skb_vnet_hdr(skb)-num_sg;
  dev_kfree_skb_any(skb);
  }
  @@ -641,6 +662,40 @@ static int virtnet_set_mac_address(struc
  return 0;
   }
   
  +static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
  +  struct rtnl_link_stats64 *tot)
  +{
  +   struct virtnet_info *vi = netdev_priv(dev);
  +   int cpu;
  +   unsigned int start;
  +
  +   for_each_possible_cpu(cpu) {
  +   struct virtnet_stats __percpu *stats
  +   = per_cpu_ptr(vi-stats, cpu);
  +   u64 tpackets, tbytes, rpackets, rbytes;
  +
  +   do {
  +   start = u64_stats_fetch_begin(stats-syncp);
  +   tpackets = stats-tx_packets;
  +   tbytes   = stats-tx_bytes;
  +   rpackets = stats-rx_packets;
  +   rbytes   = stats-rx_bytes;
  +   } while (u64_stats_fetch_retry(stats-syncp, start));
  +
  +   tot-rx_packets += rpackets;
  +   tot-tx_packets += tpackets;
  +   tot-rx_bytes   += rbytes;
  +   tot-tx_bytes   += tbytes;
  +   }
  +
  +   tot-tx_dropped = dev-stats.tx_dropped;
  +   tot-rx_dropped = dev-stats.rx_dropped;
  +   tot-rx_length_errors = dev-stats.rx_length_errors;
  +   tot-rx_frame_errors = dev-stats.rx_frame_errors;
  +
  +   return tot;
  +}
  +
   #ifdef CONFIG_NET_POLL_CONTROLLER
   static void virtnet_netpoll(struct net_device *dev)
   {
  @@ -650,6 +705,14 @@ static void virtnet_netpoll(struct net_d
   }
   #endif
   
  +static void virtnet_free(struct net_device *dev)
  +{
  +   struct virtnet_info *vi = netdev_priv(dev);
  +
  +   free_percpu(vi-stats);
  +   free_netdev(dev);
  +}
  +
   static int virtnet_open(struct net_device *dev

Re: [PATCH 00/40] Staging: hv: Driver cleanup

2011-06-30 Thread Stephen Hemminger
On Fri, 1 Jul 2011 00:19:38 +
KY Srinivasan k...@microsoft.com wrote:

 
 
  -Original Message-
  From: Stephen Hemminger [mailto:shemmin...@vyatta.com]
  Sent: Thursday, June 30, 2011 7:48 PM
  To: KY Srinivasan
  Cc: Christoph Hellwig; de...@linuxdriverproject.org; gre...@suse.de; linux-
  ker...@vger.kernel.org; virtualizat...@lists.osdl.org
  Subject: Re: [PATCH 00/40] Staging: hv: Driver cleanup
  
  On Thu, 30 Jun 2011 23:32:34 +
  KY Srinivasan k...@microsoft.com wrote:
  
  
-Original Message-
From: Christoph Hellwig [mailto:h...@infradead.org]
Sent: Thursday, June 30, 2011 3:34 PM
To: KY Srinivasan
Cc: gre...@suse.de; linux-ker...@vger.kernel.org;
de...@linuxdriverproject.org; virtualizat...@lists.osdl.org
Subject: Re: [PATCH 00/40] Staging: hv: Driver cleanup
   
On Wed, Jun 29, 2011 at 07:38:21AM -0700, K. Y. Srinivasan wrote:
 Further cleanup of the hv drivers:

   1) Cleanup the reference counting mess for both stor and net 
 devices.
   
I really don't understand the need for reference counting on the storage
side, especially now that you only have a SCSI driver.  The SCSI
midlayer does proper counting on it's objects (Scsi_Host, scsi_device,
scsi_cmnd), so you'll get that for free given that SCSI drivers just
piggyback on the midlayer lifetime rules.
   
For now your patches should probably go in as-is, but mid-term you
should be able to completely remove that code on the storage side.
   
  
   Greg,
  
   I am thinking of  going back to my original implementation where I had 
   one scsi
  host
   per IDE device. This will certainly simply the code. Let me know what you 
   think.
  If you
   agree with this approach, please drop this patch-set, I will send you a 
   new set
  of patches.
  
  I think there ref counting on network devices is also unneeded
  as long as the unregister logic handles RCU correctly. The network layer
  calls the driver unregister routine after all packets are gone.
 On the networking side, what about incoming packets that may be racing
 with the device destruction. The current ref counting scheme deals with
 that case.

Not sure how HV driver tells hypervisor to stop sending packets. But the
destructor is not called until after all other CPU's are done processing
packets from that device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-22 Thread Stephen Hemminger
I have been playing with userspace-rcu which has a number of neat
lockless routines for queuing and hashing. But there aren't kernel versions
and several of them may require cmpxchg to work.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first

2011-11-28 Thread Stephen Hemminger
On Mon, 28 Nov 2011 12:25:15 +0800
Jason Wang jasow...@redhat.com wrote:

 I'm using ixgbe for testing also, for host, its driver seems provide irq 
 affinity hint, so no binding or irqbalance is needed.

The hint is for irqbalance to use. You need to still do manual
affinity or use irqbalance.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] vhost-net: add module alias

2012-01-10 Thread Stephen Hemminger
By adding the a module alias, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.
Choose one next to TUN since this driver is related to it.

Also, use C99 style initialization.

Signed-off-by: Stephen Hemminger shemmin...@vyatta.com

---
 drivers/vhost/net.c|8 +---
 include/linux/miscdevice.h |1 +
 2 files changed, 6 insertions(+), 3 deletions(-)

--- a/drivers/vhost/net.c   2012-01-10 10:56:58.883179194 -0800
+++ b/drivers/vhost/net.c   2012-01-10 19:48:23.650225892 -0800
@@ -856,9 +856,9 @@ static const struct file_operations vhos
 };
 
 static struct miscdevice vhost_net_misc = {
-   MISC_DYNAMIC_MINOR,
-   vhost-net,
-   vhost_net_fops,
+   .minor = VHOST_NET_MINOR,
+   .name = vhost-net,
+   .fops = vhost_net_fops,
 };
 
 static int vhost_net_init(void)
@@ -879,3 +879,5 @@ MODULE_VERSION(0.0.1);
 MODULE_LICENSE(GPL v2);
 MODULE_AUTHOR(Michael S. Tsirkin);
 MODULE_DESCRIPTION(Host kernel accelerator for virtio net);
+MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
+MODULE_ALIAS(devname:vhost-net);
--- a/include/linux/miscdevice.h2012-01-10 10:56:59.779189436 -0800
+++ b/include/linux/miscdevice.h2012-01-10 19:49:56.091748210 -0800
@@ -31,6 +31,7 @@
 #define I2O_MINOR  166
 #define MICROCODE_MINOR184
 #define TUN_MINOR  200
+#define VHOST_NET_MINOR201
 #define MWAVE_MINOR219 /* ACP/Mwave Modem */
 #define MPT_MINOR  220
 #define MPT2SAS_MINOR  221

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vhost-net: add module alias

2012-01-11 Thread Stephen Hemminger
On Wed, 11 Jan 2012 15:43:42 +0800
Amos Kong kongjian...@gmail.com wrote:

 On Wed, Jan 11, 2012 at 12:54 PM, Stephen Hemminger
 shemmin...@vyatta.comwrote:
 
  By adding the a module alias, programs (or users) won't have to explicitly
  call modprobe. Vhost-net will always be available if built into the kernel.
  It does require assigning a permanent minor number for depmod to work.
  Choose one next to TUN since this driver is related to it.
 
  Also, use C99 style initialization.
 
  Signed-off-by: Stephen Hemminger shemmin...@vyatta.com
 
  ---
   drivers/vhost/net.c|8 +---
   include/linux/miscdevice.h |1 +
   2 files changed, 6 insertions(+), 3 deletions(-)
 
:
 /*
  *  These allocations are managed by dev...@lanana.org. If you use an
  *  entry that is not in assigned your entry may well be moved and
  *  reassigned, or set dynamic if a fixed value is not justified.
  */

Didn't that mailing address was ever used any more. Like many places
in kernel, the comment looked like a historical leftover.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vhost-net: add module alias

2012-01-11 Thread Stephen Hemminger
On Wed, 11 Jan 2012 11:07:47 +0400
Michael Tokarev m...@tls.msk.ru wrote:

 On 11.01.2012 08:54, Stephen Hemminger wrote:
  By adding the a module alias, programs (or users) won't have to explicitly
  call modprobe. Vhost-net will always be available if built into the kernel.
  It does require assigning a permanent minor number for depmod to work.
  Choose one next to TUN since this driver is related to it.
 
 Why do you think a statically-allocated device number will do any good
 at all?  Static /dev is gone almost completely, at least on the systems
 where whole virt stuff makes any sense, so you don't have pre-created
 vhost-net device anymore, and hence this allocation makes no sense.
 Just IMHO anyway.

The statically allocated device number is required for the udev/module
autoloading to work. Probably the udev infrastructure needs a consistent
number to hang off of.

It looks like:
  * driver adds MODULE_ALIAS() for devname and character device
  * depmod scans modules and creates modules.devname (in /lib/modules)
  * udev uses modules.devname to autoload the module

$ /sbin/modinfo vhost_net
filename:   /lib/modules/3.2.0-net+/kernel/drivers/vhost/vhost_net.ko
alias:  devname:vhost-net
alias:  char-major-10-201
description:Host kernel accelerator for virtio net
...

See also: https://lkml.org/lkml/2010/5/21/134



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] vhost-net: add module alias (v2)

2012-01-11 Thread Stephen Hemminger
By adding the correct module alias, programs won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.
Choose one next to TUN since this driver is related to it.

Also, use C99 style initialization.

Signed-off-by: Stephen Hemminger shemmin...@vyatta.com

---
v2 - document minor number and make sure to not overlap

 Documentation/devices.txt  |2 ++
 drivers/vhost/net.c|8 +---
 include/linux/miscdevice.h |1 +
 3 files changed, 8 insertions(+), 3 deletions(-)

--- a/drivers/vhost/net.c   2012-01-10 10:56:58.883179194 -0800
+++ b/drivers/vhost/net.c   2012-01-10 19:48:23.650225892 -0800
@@ -856,9 +856,9 @@ static const struct file_operations vhos
 };
 
 static struct miscdevice vhost_net_misc = {
-   MISC_DYNAMIC_MINOR,
-   vhost-net,
-   vhost_net_fops,
+   .minor = VHOST_NET_MINOR,
+   .name = vhost-net,
+   .fops = vhost_net_fops,
 };
 
 static int vhost_net_init(void)
@@ -879,3 +879,5 @@ MODULE_VERSION(0.0.1);
 MODULE_LICENSE(GPL v2);
 MODULE_AUTHOR(Michael S. Tsirkin);
 MODULE_DESCRIPTION(Host kernel accelerator for virtio net);
+MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
+MODULE_ALIAS(devname:vhost-net);
--- a/include/linux/miscdevice.h2012-01-10 10:56:59.779189436 -0800
+++ b/include/linux/miscdevice.h2012-01-11 09:13:20.803694316 -0800
@@ -42,6 +42,7 @@
 #define AUTOFS_MINOR   235
 #define MAPPER_CTRL_MINOR  236
 #define LOOP_CTRL_MINOR237
+#define VHOST_NET_MINOR238
 #define MISC_DYNAMIC_MINOR 255
 
 struct device;
--- a/Documentation/devices.txt 2012-01-10 10:56:53.399116518 -0800
+++ b/Documentation/devices.txt 2012-01-11 09:12:49.251197653 -0800
@@ -447,6 +447,8 @@ Your cooperation is appreciated.
234 = /dev/btrfs-controlBtrfs control device
235 = /dev/autofs   Autofs control device
236 = /dev/mapper/control   Device-Mapper control device
+   237 = /dev/vhost-netHost kernel accelerator for virtio net
+
240-254 Reserved for local use
255 Reserved for MISC_DYNAMIC_MINOR
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vhost-net: add module alias (v2.1)

2012-01-16 Thread Stephen Hemminger
On Mon, 16 Jan 2012 12:26:45 +
Alan Cox a...@linux.intel.com wrote:

   ACKs, NACKs?  What is happening here?
  
  I would like an Ack from Alan Cox who switched vhost-net
  to a dynamic minor in the first place, in commit
  79907d89c397b8bc2e05b347ec94e928ea919d33.
 
 Sorry dev...@lanana.org isn't yet back from the kernel hack incident.
 
 I don't read netdev so someone needs to summarise the issue and send me
 a copy of the patch to look at.
 
 Alan

Subject: vhost-net: add module alias (v2.1)

By adding some module aliases, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.

Also:
  - use C99 style initialization.
  - add missing entry in documentation for loop-control

Signed-off-by: Stephen Hemminger shemmin...@vyatta.com

---
2.1 - add missing documentation for loop control as well

 Documentation/devices.txt  |3 +++
 drivers/vhost/net.c|8 +---
 include/linux/miscdevice.h |1 +
 3 files changed, 9 insertions(+), 3 deletions(-)

--- a/drivers/vhost/net.c   2012-01-12 14:14:25.681815487 -0800
+++ b/drivers/vhost/net.c   2012-01-12 18:09:56.810680816 -0800
@@ -856,9 +856,9 @@ static const struct file_operations vhos
 };
 
 static struct miscdevice vhost_net_misc = {
-   MISC_DYNAMIC_MINOR,
-   vhost-net,
-   vhost_net_fops,
+   .minor = VHOST_NET_MINOR,
+   .name = vhost-net,
+   .fops = vhost_net_fops,
 };
 
 static int vhost_net_init(void)
@@ -879,3 +879,5 @@ MODULE_VERSION(0.0.1);
 MODULE_LICENSE(GPL v2);
 MODULE_AUTHOR(Michael S. Tsirkin);
 MODULE_DESCRIPTION(Host kernel accelerator for virtio net);
+MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
+MODULE_ALIAS(devname:vhost-net);
--- a/include/linux/miscdevice.h2012-01-12 14:14:25.725815981 -0800
+++ b/include/linux/miscdevice.h2012-01-12 18:09:56.810680816 -0800
@@ -42,6 +42,7 @@
 #define AUTOFS_MINOR   235
 #define MAPPER_CTRL_MINOR  236
 #define LOOP_CTRL_MINOR237
+#define VHOST_NET_MINOR238
 #define MISC_DYNAMIC_MINOR 255
 
 struct device;
--- a/Documentation/devices.txt 2012-01-12 14:14:25.701815712 -0800
+++ b/Documentation/devices.txt 2012-01-12 18:09:56.814680860 -0800
@@ -447,6 +447,9 @@ Your cooperation is appreciated.
234 = /dev/btrfs-controlBtrfs control device
235 = /dev/autofs   Autofs control device
236 = /dev/mapper/control   Device-Mapper control device
+   237 = /dev/loop-control Loopback control device
+   238 = /dev/vhost-netHost kernel accelerator for virtio net
+
240-254 Reserved for local use
255 Reserved for MISC_DYNAMIC_MINOR
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [vmw_vmci RFC 01/11] Apply VMCI context code

2012-05-16 Thread Stephen Hemminger
On Tue, 15 May 2012 08:06:58 -0700
Andrew Stiegmann (stieg) astiegm...@vmware.com wrote:

 Context code maintains state for vmci and allows the driver
 to communicate with multiple VMs.
 
 Signed-off-by: Andrew Stiegmann (stieg) astiegm...@vmware.com

Running checkpatch reveals the usual noise, and the following that
should be addressed.

ERROR: do not use C99 // comments
#272: FILE: drivers/misc/vmw_vmci/vmci_context.c:183:
+static bool ctx_exists_locked(uint32_t cid)// IN

ERROR: foo * bar should be foo *bar
#304: FILE: drivers/misc/vmw_vmci/vmci_context.c:215:
+ uid_t * user, struct vmci_ctx **outContext)

I don't mind the C99 style comments, but the // IN convention
is pretty useless and should be removed.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-net: fix a race on 32bit arches

2012-06-06 Thread Stephen Hemminger
On Wed, 6 Jun 2012 17:49:42 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 Sounds good, but I have a question: this realies on counters
 being atomic on 64 bit.
 Would not it be better to always use a seqlock even on 64 bit?
 This way counters would actually be correct and in sync.
 As it is if we want e.g. average packet size,
 we can not rely e.g. on it being bytes/packets.

This has not been a requirement on real physical devices; therefore
the added overhead is not really justified.

Many network cards use counters in hardware to count packets/bytes
and there is no expectation of atomic access there.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq

2012-07-06 Thread Stephen Hemminger
On Fri, 06 Jul 2012 11:20:06 +0800
Jason Wang jasow...@redhat.com wrote:

 On 07/05/2012 08:51 PM, Sasha Levin wrote:
  On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote:
  @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
   if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
   vi-has_cvq = true;
 
  +   /* Use single tx/rx queue pair as default */
  +   vi-num_queue_pairs = 1;
  +   vi-total_queue_pairs = num_queue_pairs;
  The code is using this default even if the amount of queue pairs it
  wants was specified during initialization. This basically limits any
  device to use 1 pair when starting up.
 
 
 Yes, currently the virtio-net driver would use 1 txq/txq by default 
 since multiqueue may not outperform in all kinds of workload. So it's 
 better to keep it as default and let user enable multiqueue by ethtool -L.
 

I would prefer that the driver sized number of queues based on number
of online CPU's. That is what real hardware does. What kind of workload
are you doing? If it is some DBMS benchmark then maybe the issue is that
some CPU's need to be reserved.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Question]About KVM network zero-copy feature!

2012-08-11 Thread Stephen Hemminger
On Fri, 10 Aug 2012 11:34:32 +0800
Peter Huang(Peng) peter.huangp...@huawei.com wrote:

 Hi,All
 
 I searched from git-log, and found that until now we have vhost TX zero-copy 
 experiment feature, how
 about RX zero-copy?
 
 For XEN, net-back also only has TX zero-copy, Is there any reason that RX 
 zero-copy still not implemented?
 
There is no guarantee that packet will ever be read by receiver. This means 
zero-copy could
create memory back pressure stalls.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/6] VSOCK for Linux upstreaming

2012-11-06 Thread Stephen Hemminger
On Mon, 05 Nov 2012 10:00:38 -0800
George Zhang georgezh...@vmware.com wrote:

 * * *
 This series of VSOCK linux upstreaming patches include latest udpate from
 VMware.
 
 Summary of changes:
   - Add include/linux/socket.h for AF_VSOCK.
 - Cleanup some comments.
 - Cleanup makefiles.
 
 
 
 * * *
 
 In an effort to improve the out-of-the-box experience with Linux
 kernels for VMware users, VMware is working on readying the Virtual
 Machine Communication Interface (vmw_vmci) and VMCI Sockets (VSOCK)
 (vmw_vsock) kernel modules for inclusion in the Linux kernel. The
 purpose of this post is to acquire feedback on the vmw_vsock kernel
 module. The vmw_vmci kernel module has been presented in an early post.
 
 
 * * *
 
 VMCI Sockets allows virtual machines to communicate with host kernel
 modules and the VMware hypervisors. VMCI Sockets kernel module has
 dependency on VMCI kernel module. User level applications both in
 a virtual machine and on the host can use vmw_vmci through VMCI
 Sockets API which facilitates fast and efficient communication
 between guest virtual machines and their host. A socket
 address family designed to be compatible with UDP and TCP at the
 interface level. Today, VMCI and VMCI Sockets are used by the VMware
 shared folders (HGFS) and various VMware Tools components inside the
 guest for zero-config, network-less access to VMware host services. In
 addition to this, VMware's users are using VMCI Sockets for various
 applications, where network access of the virtual machine is
 restricted or non-existent. Examples of this are VMs communicating
 with device proxies for proprietary hardware running as host
 applications and automated testing of applications running within
 virtual machines.
 
 The VMware VMCI Sockets are similar to other socket types, like
 Berkeley UNIX socket interface. The VMCI sockets module supports
 both connection-oriented stream sockets like TCP, and connectionless
 datagram sockets like UDP. The VSOCK protocol family is defined as
 AF_VSOCK and the socket operations split for SOCK_DGRAM and
 SOCK_STREAM.
 
 For additional information about the use of VMCI and in particular
 VMCI Sockets, please refer to the VMCI Socket Programming Guide
 available at https://www.vmware.com/support/developer/vmci-sdk/.
 

This should go to netdev as well since it is a new address family.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/6] VSOCK for Linux upstreaming

2012-11-06 Thread Stephen Hemminger
Never mind, mail server seemed to be overloaded today.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] VSOCK: vsock protocol implementation.

2012-11-06 Thread Stephen Hemminger
On Mon, 05 Nov 2012 10:00:51 -0800
George Zhang georgezh...@vmware.com wrote:

 + /* Added in 2.6.10. */
 + .owner = THIS_MODULE,

Thanks for submitting this, it will make life easier for distro's
that now have to go through extra effort to include out of mainline
support for Vmware.

You did some scrubbing of the macro's to support multiple kernel
versions, but there are still some leftovers.
This code seems to have a lot of this added in version xxx
type comments. These are probably not a good idea to include
in the mainline kernel code. 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent

2012-11-26 Thread Stephen Hemminger
On Mon, 26 Nov 2012 15:56:52 +0800
Jason Wang jasow...@redhat.com wrote:

 Some deivces do not free the old tx skbs immediately after it has been sent
 (usually in tx interrupt). One such example is virtio-net which optimizes for
 virt and only free the possible old tx skbs during the next packet sending. 
 This
 would lead the pktgen to wait forever in the refcount of the skb if no other
 pakcet will be sent afterwards.
 
 Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which could
 notify the pktgen that the device does not free skb immediately after it has
 been sent and let it not to wait for the refcount to be one.
 
 Signed-off-by: Jason Wang jasow...@redhat.com

Another alternative would be using skb_orphan() and skb-destructor.
There are other cases where skb's are not freed right away.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent

2012-11-27 Thread Stephen Hemminger
On Tue, 27 Nov 2012 14:45:13 +0800
Jason Wang jasow...@redhat.com wrote:

 On 11/27/2012 01:37 AM, Stephen Hemminger wrote:
  On Mon, 26 Nov 2012 15:56:52 +0800
  Jason Wang jasow...@redhat.com wrote:
 
  Some deivces do not free the old tx skbs immediately after it has been sent
  (usually in tx interrupt). One such example is virtio-net which optimizes 
  for
  virt and only free the possible old tx skbs during the next packet 
  sending. This
  would lead the pktgen to wait forever in the refcount of the skb if no 
  other
  pakcet will be sent afterwards.
 
  Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which 
  could
  notify the pktgen that the device does not free skb immediately after it 
  has
  been sent and let it not to wait for the refcount to be one.
 
  Signed-off-by: Jason Wang jasow...@redhat.com
  Another alternative would be using skb_orphan() and skb-destructor.
  There are other cases where skb's are not freed right away.
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Hi Stephen:
 
 Do you mean registering a skb-destructor for pktgen then set and check 
 bits in skb-tx_flag?

Yes. Register a destructor that does something like update a counter (number of 
packets pending),
then just spin while number of packets pending is over threshold.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent

2012-11-28 Thread Stephen Hemminger
On Wed, 28 Nov 2012 14:48:52 +0800
Jason Wang jasow...@redhat.com wrote:

 On 11/28/2012 12:49 AM, Stephen Hemminger wrote:
  On Tue, 27 Nov 2012 14:45:13 +0800
  Jason Wang jasow...@redhat.com wrote:
 
  On 11/27/2012 01:37 AM, Stephen Hemminger wrote:
  On Mon, 26 Nov 2012 15:56:52 +0800
  Jason Wang jasow...@redhat.com wrote:
 
  Some deivces do not free the old tx skbs immediately after it has been 
  sent
  (usually in tx interrupt). One such example is virtio-net which 
  optimizes for
  virt and only free the possible old tx skbs during the next packet 
  sending. This
  would lead the pktgen to wait forever in the refcount of the skb if no 
  other
  pakcet will be sent afterwards.
 
  Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which 
  could
  notify the pktgen that the device does not free skb immediately after it 
  has
  been sent and let it not to wait for the refcount to be one.
 
  Signed-off-by: Jason Wang jasow...@redhat.com
  Another alternative would be using skb_orphan() and skb-destructor.
  There are other cases where skb's are not freed right away.
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Hi Stephen:
 
  Do you mean registering a skb-destructor for pktgen then set and check
  bits in skb-tx_flag?
  Yes. Register a destructor that does something like update a counter 
  (number of packets pending),
  then just spin while number of packets pending is over threshold.
  --
 
 Not sure this is the best method, since pktgen was used to test the tx 
 process of the device driver and NIC. If we use skb_orhpan(), we would 
 miss the test of tx completion part.

There are other places that delay freeing and your solution would mean finding
and fixing all those. Code that does that already has to use skb_orphan() to
work, and I was looking for a way that could use that. Introducing another flag
value seems like a long term burden.

Alternatively, virtio could do cleanup more aggressively. Maybe in response to
ring getting half full, or add a cleanup timer or something to avoid the 
problem.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent

2012-12-03 Thread Stephen Hemminger
On Mon, 03 Dec 2012 14:45:46 +0800
Jason Wang jasow...@redhat.com wrote:

 On Tuesday, November 27, 2012 08:49:19 AM Stephen Hemminger wrote:
  On Tue, 27 Nov 2012 14:45:13 +0800
  
  Jason Wang jasow...@redhat.com wrote:
   On 11/27/2012 01:37 AM, Stephen Hemminger wrote:
On Mon, 26 Nov 2012 15:56:52 +0800

Jason Wang jasow...@redhat.com wrote:
Some deivces do not free the old tx skbs immediately after it has been
sent
(usually in tx interrupt). One such example is virtio-net which
optimizes for virt and only free the possible old tx skbs during the
next packet sending. This would lead the pktgen to wait forever in the
refcount of the skb if no other pakcet will be sent afterwards.

Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY
which could notify the pktgen that the device does not free skb
immediately after it has been sent and let it not to wait for the
refcount to be one.

Signed-off-by: Jason Wang jasow...@redhat.com

Another alternative would be using skb_orphan() and skb-destructor.
There are other cases where skb's are not freed right away.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
   Hi Stephen:
   
   Do you mean registering a skb-destructor for pktgen then set and check
   bits in skb-tx_flag?
  
  Yes. Register a destructor that does something like update a counter (number
  of packets pending), then just spin while number of packets pending is over
  threshold.
 
 Have some experiments on this, looks like it does not work weel when 
 clone_skb 
 is used. For driver that call skb_orphan() in ndo_start_xmit, the destructor 
 is only called when the first packet were sent, but what we need to know is 
 when the last were sent. Any thoughts on this or we can just introduce 
 another 
 flag (anyway we have something like IFF_TX_SKB_SHARING) ?
 

The SKB_SHARING flag looks like the best solution then.
Surprisingly, transmit buffer completion is a major bottleneck for 10G
devices, and I suspect more changes will come.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/3] virtio: change to_vp_device to an inlined definition

2012-12-05 Thread Stephen Hemminger
On Wed, 5 Dec 2012 15:03:27 +0800
Wanlong Gao gaowanl...@cn.fujitsu.com wrote:

 to_vp_device is worth changing to inlined definition.
 
 Signed-off-by: Wanlong Gao gaowanl...@cn.fujitsu.com
 ---
  drivers/virtio/virtio_pci.c | 6 +-
  1 file changed, 1 insertion(+), 5 deletions(-)
 
 diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
 index e3ecc94..7681fe3 100644
 --- a/drivers/virtio/virtio_pci.c
 +++ b/drivers/virtio/virtio_pci.c
 @@ -98,11 +98,7 @@ static struct pci_device_id virtio_pci_id_table[] = {
  
  MODULE_DEVICE_TABLE(pci, virtio_pci_id_table);
  
 -/* Convert a generic virtio device to our structure */
 -static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev)
 -{
 - return container_of(vdev, struct virtio_pci_device, vdev);
 -}
 +#define to_vp_device(_vdev) container_of(_vdev, struct virtio_pci_device, 
 vdev)

Just mark the function as inline. A macro loses type checking.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v3 1/3] virtio-net: separate fields of sending/receiving queue from virtnet_info

2012-12-07 Thread Stephen Hemminger
Minor style issue reported by checkpatch which can be fixed after merge.
Although sizeof is actually an operator in C, it is considered correct
style to treat it as a function.


WARNING: sizeof hdr-hdr should be sizeof(hdr-hdr)
#293: FILE: drivers/net/virtio_net.c:395:
+   sg_set_buf(rq-sg, hdr-hdr, sizeof hdr-hdr);

WARNING: sizeof hdr-mhdr should be sizeof(hdr-mhdr)
#552: FILE: drivers/net/virtio_net.c:641:
+   sg_set_buf(sq-sg, hdr-mhdr, sizeof hdr-mhdr);

WARNING: sizeof hdr-hdr should be sizeof(hdr-hdr)
#555: FILE: drivers/net/virtio_net.c:643:
+   sg_set_buf(sq-sg, hdr-hdr, sizeof hdr-hdr);
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net

2012-12-07 Thread Stephen Hemminger
On Fri, 07 Dec 2012 15:35:56 -0500 (EST)
David Miller da...@davemloft.net wrote:

 From: Jason Wang jasow...@redhat.com
 Date: Sat,  8 Dec 2012 01:04:54 +0800
 
  This series is an update version (hope the final version) of multiqueue
  (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were
  addressed, the work were based on Krishna Kumar's work to let virtio-net use
  multiple rx/tx queues to do the packets reception and transmission. 
  Performance
  test show the aggregate latency were increased greately but may get some
  regression in small packet transmission. Due to this, multiqueue were 
  disabled
  by default. If user want to benefit form the multiqueue, ethtool -L could be
  used to enable the feature.
  
  Please review and comments.
  
  A protype implementation of qemu-kvm support could by found in
  git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two 
  queues, you
  could specify the queues parameters to both tap and virtio-net like:
  
  ./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,...
  
  then enable the multiqueue through ethtool by:
  
  ethtool -L eth0 combined 2
 
 It seems like most, if not all, of the feedback given for this series
 has been addressed by Jason.
 
 Can I get some ACKs?

Other than the minor style nit in the first patch, I see no issues.
This is really needed by Virtual Routers.

Acked-by: Stephen Hemminger shemmin...@vyatta.com

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio: make config_ops const

2013-02-05 Thread Stephen Hemminger
It is just a table of function pointers, make it const for cleanliness and 
security
reasons.

Signed-off-by: Stephen Hemminger step...@networkplumber.org

--- a/drivers/lguest/lguest_device.c2013-02-02 20:03:23.285506053 +1100
+++ b/drivers/lguest/lguest_device.c2013-02-02 20:06:15.253510658 +1100
@@ -396,7 +396,7 @@ static const char *lg_bus_name(struct vi
 }
 
 /* The ops structure which hooks everything together. */
-static struct virtio_config_ops lguest_config_ops = {
+static const struct virtio_config_ops lguest_config_ops = {
.get_features = lg_get_features,
.finalize_features = lg_finalize_features,
.get = lg_get,
--- a/drivers/remoteproc/remoteproc_virtio.c2013-02-02 20:03:23.289506053 
+1100
+++ b/drivers/remoteproc/remoteproc_virtio.c2013-02-02 20:06:10.097510520 
+1100
@@ -222,7 +222,7 @@ static void rproc_virtio_finalize_featur
rvdev-gfeatures = vdev-features[0];
 }
 
-static struct virtio_config_ops rproc_virtio_config_ops = {
+static const struct virtio_config_ops rproc_virtio_config_ops = {
.get_features   = rproc_virtio_get_features,
.finalize_features = rproc_virtio_finalize_features,
.find_vqs   = rproc_virtio_find_vqs,
--- a/drivers/s390/kvm/kvm_virtio.c 2013-01-17 16:06:04.691116116 +1100
+++ b/drivers/s390/kvm/kvm_virtio.c 2013-02-02 20:06:04.121510360 +1100
@@ -275,7 +275,7 @@ static const char *kvm_bus_name(struct v
 /*
  * The config ops structure as defined by virtio config
  */
-static struct virtio_config_ops kvm_vq_configspace_ops = {
+static const struct virtio_config_ops kvm_vq_configspace_ops = {
.get_features = kvm_get_features,
.finalize_features = kvm_finalize_features,
.get = kvm_get,
--- a/drivers/virtio/virtio_mmio.c  2013-01-09 04:04:12.010765226 +1100
+++ b/drivers/virtio/virtio_mmio.c  2013-02-02 20:05:50.937510007 +1100
@@ -423,7 +423,7 @@ static const char *vm_bus_name(struct vi
return vm_dev-pdev-name;
 }
 
-static struct virtio_config_ops virtio_mmio_config_ops = {
+static const struct virtio_config_ops virtio_mmio_config_ops = {
.get= vm_get,
.set= vm_set,
.get_status = vm_get_status,
--- a/drivers/virtio/virtio_pci.c   2013-01-09 04:04:12.010765226 +1100
+++ b/drivers/virtio/virtio_pci.c   2013-02-02 20:05:40.009509714 +1100
@@ -652,7 +652,7 @@ static int vp_set_vq_affinity(struct vir
return 0;
 }
 
-static struct virtio_config_ops virtio_pci_config_ops = {
+static const struct virtio_config_ops virtio_pci_config_ops = {
.get= vp_get,
.set= vp_set,
.get_status = vp_get_status,
--- a/include/linux/virtio.h2013-01-09 04:04:12.122765257 +1100
+++ b/include/linux/virtio.h2013-02-02 20:05:31.433509485 +1100
@@ -78,7 +78,7 @@ struct virtio_device {
int index;
struct device dev;
struct virtio_device_id id;
-   struct virtio_config_ops *config;
+   const struct virtio_config_ops *config;
struct list_head vqs;
/* Note that this is a Linux set_bit-style bitmap. */
unsigned long features[1];
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)

2013-10-25 Thread Stephen Hemminger
I/O from userspace  is used to implement usermode virtio driver(s).
This has been done independently by Intel, Brocade/Vyatta, and 6Wind.
Sorry, it has to stay.


On Mon, Oct 21, 2013 at 7:35 PM, Josh Triplett j...@joshtriplett.org wrote:
 On the vast majority of modern systems, no processes will use the
 userspsace I/O syscalls, iopl and ioperm.  Add a new config option,
 CONFIG_X86_IOPORT, to support configuring them out of the kernel
 entirely.  Since these syscalls only exist to support rare legacy
 userspace programs, X86_IOPORT does not depend on EXPERT, though it does
 still default to y.

 In addition to saving a significant amount of space, this also reduces
 the size of several major kernel data structures, drops a bit of code
 from several task-related hot paths, and reduces the attack surface of
 the kernel.

 All of the task-related code dealing with userspace I/O permissions now
 lives in process-io.h as several new inline functions, which become
 no-ops when CONFIG_X86_IOPORT.

 bloat-o-meter shows a net reduction of 17681 bytes on 32-bit and 9719
 bytes on 64-bit:

 32-bit bloat-o-meter:
 add/remove: 0/3 grow/shrink: 0/10 up/down: 0/-17681 (-17681)
 function old new   delta
 cpu_init 676 668  -8
 ioperm_active 18   7 -11
 init_task   12961284 -12
 exit_thread  179  91 -88
 ioperm_get   103  10 -93
 __switch_to_xtra 254 161 -93
 sys_iopl 127   --127
 SyS_iopl 127   --127
 copy_thread  606 446-160
 vt_ioctl41273919-208
 sys_ioperm   370   --370
 init_tss8576 384   -8192
 doublefault_tss 8576 384   -8192

 64-bit bloat-o-meter:
 add/remove: 0/4 grow/shrink: 2/9 up/down: 45/-9764 (-9719)
 function old new   delta
 cpu_init 958 995 +37
 arch_align_stack  78  86  +8
 perf_event_exit_task 525 517  -8
 ioperm_active 17   8  -9
 init_task   19681944 -24
 stub_iopl 81   - -81
 ioperm_get   111  11-100
 __switch_to_xtra 281 164-117
 exit_thread  212  92-120
 vt_ioctl44324304-128
 sys_iopl 137   --137
 SyS_iopl 137   --137
 copy_thread  694 520-174
 sys_ioperm   473   --473
 init_tss8896 640   -8256

 Signed-off-by: Josh Triplett j...@joshtriplett.org
 ---
  arch/x86/Kconfig  | 10 +
  arch/x86/include/asm/paravirt.h   |  2 +
  arch/x86/include/asm/paravirt_types.h |  2 +
  arch/x86/include/asm/processor.h  | 51 +
  arch/x86/include/asm/syscalls.h   |  3 ++
  arch/x86/kernel/Makefile  |  3 +-
  arch/x86/kernel/cpu/common.c  | 12 +-
  arch/x86/kernel/entry_64.S|  9 +++--
  arch/x86/kernel/paravirt.c|  2 +
  arch/x86/kernel/process-io.h  | 71 
 +++
  arch/x86/kernel/process.c | 34 ++---
  arch/x86/kernel/process_32.c  | 11 +-
  arch/x86/kernel/ptrace.c  |  8 
  arch/x86/xen/enlighten.c  |  4 ++
  drivers/tty/vt/vt_ioctl.c |  2 +-
  kernel/sys_ni.c   |  5 +++
  16 files changed, 168 insertions(+), 61 deletions(-)

 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index e241a19..d5b1e68 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -976,6 +976,16 @@ config VM86
   XFree86 to initialize some video cards via BIOS. Disabling this
   option saves about 6k.

 +config X86_IOPORT
 +   bool iopl and ioperm system calls
 +   default y
 +   ---help---
 + This option enables the iopl and ioperm system calls, which allow
 + privileged userspace processes to directly access I/O ports. This
 + is used by some legacy software to drive hardware directly from
 + userspace rather than via a proper kernel driver. Unless you intend
 + to run such software, you can safely say N here.
 +
  config TOSHIBA
 tristate 

[PATCH net-next 3/3] virtio_net: spelling fixes

2013-12-09 Thread Stephen Hemminger


Signed-off-by: Stephen Hemminger step...@networkplumber.org


--- a/drivers/net/virtio_net.c  2013-12-09 16:12:41.409051865 -0800
+++ b/drivers/net/virtio_net.c  2013-12-09 16:12:43.872996856 -0800
@@ -873,7 +873,7 @@ static netdev_tx_t start_xmit(struct sk_
 /*
  * Send command via the control virtqueue and check status.  Commands
  * supported by the hypervisor, as indicated by feature bits, should
- * never fail unless improperly formated.
+ * never fail unless improperly formatted.
  */
 static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
 struct scatterlist *out)
@@ -1061,7 +1061,7 @@ static void virtnet_set_rx_mode(struct n
void *buf;
int i;
 
-   /* We can't dynamicaly set ndo_set_rx_mode, so return gracefully */
+   /* We can't dynamically set ndo_set_rx_mode, so return gracefully */
if (!virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 1/3] virtio_net: set multicast filter list to host

2013-12-09 Thread Stephen Hemminger
The virtio_net driver never sends the multicast address list to
the host. This is because send command takes a pointer to scatter list
to send but only inserts that one entry into the outgoing scatter list.

This bug has been there since:
commit f565a7c259d71cc186753653d978c646d2354b36
Author: Alex Williamson alex.william...@hp.com
Date:   Wed Feb 4 09:02:45 2009 +

virtio_net: Add a MAC filter table

Signed-off-by: Stephen Hemminger step...@networkplumber.org

--- a/drivers/net/virtio_net.c  2013-12-09 16:12:03.897891975 -0800
+++ b/drivers/net/virtio_net.c  2013-12-09 16:12:36.353164803 -0800
@@ -893,7 +893,7 @@ static bool virtnet_send_command(struct
sg_init_one(hdr, ctrl, sizeof(ctrl));
sgs[out_num++] = hdr;
 
-   if (out)
+   for (; out; out = sg_next(out))
sgs[out_num++] = out;
if (in)
sgs[out_num + in_num++] = in;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 2/3] virtio_net: remove unused parameter to send_command

2013-12-09 Thread Stephen Hemminger
All the code passes NULL for the last sg list (in).
Simplify by just removing it.

Signed-off-by: Stephen Hemminger step...@networkplumber.org


--- a/drivers/net/virtio_net.c  2013-12-09 16:12:36.353164803 -0800
+++ b/drivers/net/virtio_net.c  2013-12-09 16:12:41.409051865 -0800
@@ -876,13 +876,12 @@ static netdev_tx_t start_xmit(struct sk_
  * never fail unless improperly formated.
  */
 static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
-struct scatterlist *out,
-struct scatterlist *in)
+struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
struct virtio_net_ctrl_hdr ctrl;
virtio_net_ctrl_ack status = ~0;
-   unsigned out_num = 0, in_num = 0, tmp;
+   unsigned out_num = 0, tmp;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_VQ));
@@ -895,16 +894,13 @@ static bool virtnet_send_command(struct
 
for (; out; out = sg_next(out))
sgs[out_num++] = out;
-   if (in)
-   sgs[out_num + in_num++] = in;
 
/* Add return status. */
sg_init_one(stat, status, sizeof(status));
-   sgs[out_num + in_num++] = stat;
+   sgs[out_num] = stat;
 
-   BUG_ON(out_num + in_num  ARRAY_SIZE(sgs));
-   BUG_ON(virtqueue_add_sgs(vi-cvq, sgs, out_num, in_num, vi, GFP_ATOMIC)
-   0);
+   BUG_ON(out_num + 1  ARRAY_SIZE(sgs));
+   BUG_ON(virtqueue_add_sgs(vi-cvq, sgs, out_num, 1, vi, GFP_ATOMIC)  0);
 
if (unlikely(!virtqueue_kick(vi-cvq)))
return status == VIRTIO_NET_OK;
@@ -934,8 +930,7 @@ static int virtnet_set_mac_address(struc
if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) {
sg_init_one(sg, addr-sa_data, dev-addr_len);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
- VIRTIO_NET_CTRL_MAC_ADDR_SET,
- sg, NULL)) {
+ VIRTIO_NET_CTRL_MAC_ADDR_SET, sg)) {
dev_warn(vdev-dev,
 Failed to set mac address by vq command.\n);
return -EINVAL;
@@ -1008,7 +1003,7 @@ static void virtnet_ack_link_announce(st
 {
rtnl_lock();
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
- VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL, NULL))
+ VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
dev_warn(vi-dev-dev, Failed to ack link announce.\n);
rtnl_unlock();
 }
@@ -1026,7 +1021,7 @@ static int virtnet_set_queues(struct vir
sg_init_one(sg, s, sizeof(s));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
- VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, sg, NULL)) {
+ VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, sg)) {
dev_warn(dev-dev, Fail to set num of queue pairs to %d\n,
 queue_pairs);
return -EINVAL;
@@ -1076,16 +1071,14 @@ static void virtnet_set_rx_mode(struct n
sg_init_one(sg, promisc, sizeof(promisc));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
- VIRTIO_NET_CTRL_RX_PROMISC,
- sg, NULL))
+ VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(dev-dev, Failed to %sable promisc mode.\n,
 promisc ? en : dis);
 
sg_init_one(sg, allmulti, sizeof(allmulti));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
- VIRTIO_NET_CTRL_RX_ALLMULTI,
- sg, NULL))
+ VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(dev-dev, Failed to %sable allmulti mode.\n,
 allmulti ? en : dis);
 
@@ -1121,8 +1114,7 @@ static void virtnet_set_rx_mode(struct n
   sizeof(mac_data-entries) + (mc_count * ETH_ALEN));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
- VIRTIO_NET_CTRL_MAC_TABLE_SET,
- sg, NULL))
+ VIRTIO_NET_CTRL_MAC_TABLE_SET, sg))
dev_warn(dev-dev, Failed to set MAC filter table.\n);
 
kfree(buf);
@@ -1137,7 +1129,7 @@ static int virtnet_vlan_rx_add_vid(struc
sg_init_one(sg, vid, sizeof(vid));
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
- VIRTIO_NET_CTRL_VLAN_ADD, sg, NULL))
+ VIRTIO_NET_CTRL_VLAN_ADD, sg))
dev_warn(dev-dev, Failed to add VLAN ID %d.\n, vid);
return 0;
 }
@@ -1151,7 +1143,7 @@ static int virtnet_vlan_rx_kill_vid(stru
sg_init_one(sg

[PATCH net-next] virtio: change comment in transmit

2015-03-24 Thread Stephen Hemminger
The original comment was not really informative or funny
as well as sexist. Replace it with a better explanation of
why the driver does stop and what the impacts are.

Signed-off-by: Stephen Hemminger step...@networkplumber.org

--- a/drivers/net/virtio_net.c  2015-03-24 15:20:25.174671000 -0700
+++ b/drivers/net/virtio_net.c  2015-03-24 16:17:28.478525333 -0700
@@ -939,8 +939,12 @@ static netdev_tx_t start_xmit(struct sk_
skb_orphan(skb);
nf_reset(skb);
 
-   /* Apparently nice girls don't return TX_BUSY; stop the queue
-* before it gets out of hand.  Naturally, this wastes entries. */
+   /* It is better to stop queue if running out of space
+* instead of forcing queuing layer to requeue the skb
+* by returning TX_BUSY (and cause a BUG message).
+* Since most packets only take 1 or 2 ring slots
+* this means 16 slots are typically wasted.
+*/
if (sq-vq-num_free  2+MAX_SKB_FRAGS) {
netif_stop_subqueue(dev, qnum);
if (unlikely(!virtqueue_enable_cb_delayed(sq-vq))) {
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC V7 PATCH 7/7] vhost_net: add interrupt coalescing support

2015-05-26 Thread Stephen Hemminger
On Mon, 25 May 2015 01:24:04 -0400
Jason Wang jasow...@redhat.com wrote:

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/vhost/net.c | 8 
  1 file changed, 8 insertions(+)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 7d137a4..5ee28b7 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -320,6 +320,9 @@ static void handle_tx(struct vhost_net *net)
   hdr_size = nvq-vhost_hlen;
   zcopy = nvq-ubufs;
  
 + /* Finish pending interrupts first */
 + vhost_check_coalesce_and_signal(vq-dev, vq, false);
 +
   for (;;) {
   /* Release DMAs done buffers first */
   if (zcopy)
 @@ -415,6 +418,7 @@ static void handle_tx(struct vhost_net *net)
   }
   }
  out:
 + vhost_check_coalesce_and_signal(vq-dev, vq, true);
   mutex_unlock(vq-mutex);
  }
  
 @@ -554,6 +558,9 @@ static void handle_rx(struct vhost_net *net)
   vq-log : NULL;
   mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
  
 + /* Finish pending interrupts first */
 + vhost_check_coalesce_and_signal(vq-dev, vq, false);
 +
   while ((sock_len = peek_head_len(sock-sk))) {
   sock_len += sock_hlen;
   vhost_len = sock_len + vhost_hlen;
 @@ -638,6 +645,7 @@ static void handle_rx(struct vhost_net *net)
   }
   }
  out:
 + vhost_check_coalesce_and_signal(vq-dev, vq, true);
   mutex_unlock(vq-mutex);
  }
  

Could you implement ethtool control of these coalescing parameters?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next] virtio_net: add ethtool support for set and get of settings

2016-02-03 Thread Stephen Hemminger
On Tue,  2 Feb 2016 13:51:20 +0100
Nikolay Aleksandrov  wrote:

> +static bool virtnet_validate_speed(u32 speed)
> +{
> + switch (speed) {
> + case SPEED_10:
> + case SPEED_100:
> + case SPEED_1000:
> + case SPEED_2500:
> + case SPEED_5000:
> + case SPEED_1:
> + case SPEED_2:
> + case SPEED_25000:
> + case SPEED_4:
> + case SPEED_5:
> + case SPEED_56000:
> + case SPEED_10:
> + case SPEED_UNKNOWN:
> + return true;
> + }
> +
> + return false;
> +}

Why limit to only known values. This switch() will get out of
date when some vendor introduces 64G or some other weird value.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v2 -next 1/2] virtio: Start feature MTU support

2016-03-18 Thread Stephen Hemminger
On Tue, 15 Mar 2016 17:04:12 -0400
Aaron Conole  wrote:

> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -55,6 +55,7 @@
>  #define VIRTIO_NET_F_MQ  22  /* Device supports Receive Flow
>* Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */
> +#define VIRTIO_NET_F_MTU 25  /* Device supports Default MTU Negotiation */
>  
>  #ifndef VIRTIO_NET_NO_LEGACY
>  #define VIRTIO_NET_F_GSO 6   /* Host handles pkts w/ any GSO type */
> @@ -73,6 +74,8 @@ struct virtio_net_config {
>* Legal values are between 1 and 0x8000
>*/
>   __u16 max_virtqueue_pairs;
> + /* Default maximum transmit unit advice */
> + __u16 mtu;
>  } __attribute__((packed));
>  
>  /*

You can't change user visible headers without breaking ABI.
This structure might be used by other user code. Also how can this
work if host is using old size of structure.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v2 -next 1/2] virtio: Start feature MTU support

2016-03-19 Thread Stephen Hemminger
On Thu, 17 Mar 2016 17:10:55 -0400
Aaron Conole <acon...@redhat.com> wrote:

> Stephen Hemminger <step...@networkplumber.org> writes:
> 
> > On Tue, 15 Mar 2016 17:04:12 -0400
> > Aaron Conole <acon...@redhat.com> wrote:
> >
> >> --- a/include/uapi/linux/virtio_net.h
> >> +++ b/include/uapi/linux/virtio_net.h
> >> @@ -55,6 +55,7 @@
> >>  #define VIRTIO_NET_F_MQ   22  /* Device supports Receive Flow
> >> * Steering */
> >>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
> >> +#define VIRTIO_NET_F_MTU 25   /* Device supports Default MTU 
> >> Negotiation */
> >>  
> >>  #ifndef VIRTIO_NET_NO_LEGACY
> >>  #define VIRTIO_NET_F_GSO  6   /* Host handles pkts w/ any GSO type */
> >> @@ -73,6 +74,8 @@ struct virtio_net_config {
> >> * Legal values are between 1 and 0x8000
> >> */
> >>__u16 max_virtqueue_pairs;
> >> +  /* Default maximum transmit unit advice */
> >> +  __u16 mtu;
> >>  } __attribute__((packed));
> >>  
> >>  /*
> >
> > You can't change user visible headers without breaking ABI.
> > This structure might be used by other user code. Also how can this
> > work if host is using old size of structure.
> 
> How else can this field be added and remain compliant with the spec? The
> spec requires that mtu be passed in the virtio_net_config field.
> 
> As for old sizeof, I think the absence of the VIRTIO_NET_F_MTU bit being
> asserted is confirmation that mtu is not valid (at least, it is implied
> in the spec).

Michael is right as long as the code checks for MTU flag before
referencing the mtu field, everything is fine.  Actually, the structure
is never used directly only by fetching fields with offsetof

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 08/14] vmbus: put related per-cpu variable together

2017-02-01 Thread Stephen Hemminger
The hv_context structure had several arrays which were per-cpu
and was allocating small structures (tasklet_struct). Instead use
a single per-cpu array.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/channel_mgmt.c |  35 -
 drivers/hv/connection.c   |  20 ---
 drivers/hv/hv.c   | 130 --
 drivers/hv/hyperv_vmbus.h |  53 +++
 drivers/hv/vmbus_drv.c|  39 --
 5 files changed, 143 insertions(+), 134 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index de90a9900fee..579ad2560a39 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -353,9 +353,10 @@ static void free_channel(struct vmbus_channel *channel)
 static void percpu_channel_enq(void *arg)
 {
struct vmbus_channel *channel = arg;
-   int cpu = smp_processor_id();
+   struct hv_per_cpu_context *hv_cpu
+   = this_cpu_ptr(hv_context.cpu_context);
 
-   list_add_tail(>percpu_list, _context.percpu_list[cpu]);
+   list_add_tail(>percpu_list, _cpu->chan_list);
 }
 
 static void percpu_channel_deq(void *arg)
@@ -379,19 +380,21 @@ static void vmbus_release_relid(u32 relid)
 
 void hv_event_tasklet_disable(struct vmbus_channel *channel)
 {
-   struct tasklet_struct *tasklet;
-   tasklet = hv_context.event_dpc[channel->target_cpu];
-   tasklet_disable(tasklet);
+   struct hv_per_cpu_context *hv_cpu;
+
+   hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu);
+   tasklet_disable(_cpu->event_dpc);
 }
 
 void hv_event_tasklet_enable(struct vmbus_channel *channel)
 {
-   struct tasklet_struct *tasklet;
-   tasklet = hv_context.event_dpc[channel->target_cpu];
-   tasklet_enable(tasklet);
+   struct hv_per_cpu_context *hv_cpu;
+
+   hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu);
+   tasklet_enable(_cpu->event_dpc);
 
/* In case there is any pending event */
-   tasklet_schedule(tasklet);
+   tasklet_schedule(_cpu->event_dpc);
 }
 
 void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid)
@@ -726,9 +729,12 @@ static void vmbus_wait_for_unload(void)
break;
 
for_each_online_cpu(cpu) {
-   page_addr = hv_context.synic_message_page[cpu];
-   msg = (struct hv_message *)page_addr +
-   VMBUS_MESSAGE_SINT;
+   struct hv_per_cpu_context *hv_cpu
+   = per_cpu_ptr(hv_context.cpu_context, cpu);
+
+   page_addr = hv_cpu->synic_message_page;
+   msg = (struct hv_message *)page_addr
+   + VMBUS_MESSAGE_SINT;
 
message_type = READ_ONCE(msg->header.message_type);
if (message_type == HVMSG_NONE)
@@ -752,7 +758,10 @@ static void vmbus_wait_for_unload(void)
 * messages after we reconnect.
 */
for_each_online_cpu(cpu) {
-   page_addr = hv_context.synic_message_page[cpu];
+   struct hv_per_cpu_context *hv_cpu
+   = per_cpu_ptr(hv_context.cpu_context, cpu);
+
+   page_addr = hv_cpu->synic_message_page;
msg = (struct hv_message *)page_addr + VMBUS_MESSAGE_SINT;
msg->header.message_type = HVMSG_NONE;
}
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 1766ef03e78d..158f12823baf 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -93,12 +93,10 @@ static int vmbus_negotiate_version(struct 
vmbus_channel_msginfo *msginfo,
 * all the CPUs. This is needed for kexec to work correctly where
 * the CPU attempting to connect may not be CPU 0.
 */
-   if (version >= VERSION_WIN8_1) {
-   msg->target_vcpu = hv_context.vp_index[get_cpu()];
-   put_cpu();
-   } else {
+   if (version >= VERSION_WIN8_1)
+   msg->target_vcpu = hv_context.vp_index[smp_processor_id()];
+   else
msg->target_vcpu = 0;
-   }
 
/*
 * Add to list before we send the request since we may
@@ -269,12 +267,12 @@ void vmbus_disconnect(void)
  */
 static struct vmbus_channel *pcpu_relid2channel(u32 relid)
 {
+   struct hv_per_cpu_context *hv_cpu
+   = this_cpu_ptr(hv_context.cpu_context);
+   struct vmbus_channel *found_channel = NULL;
struct vmbus_channel *channel;
-   struct vmbus_channel *found_channel  = NULL;
-   int cpu = smp_processor_id();
-   struct list_head *pcpu_head = _context.percpu_list[cpu];
 
-   list_for_each_entry(channel, pcpu_head, percpu_list) {
+   list_for_each_entry(channel, _cpu->chan_list, percpu_list) {
if (channel->o

[PATCH 07/14] vmbus: callback is in softirq not workqueue

2017-02-01 Thread Stephen Hemminger
The callback is done via tasklet not workqueue.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 include/linux/hyperv.h | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 39d493ce550d..b30808f740f9 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -32,7 +32,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -729,9 +728,7 @@ struct vmbus_channel {
 
struct vmbus_close_msg close_msg;
 
-   /* Channel callback are invoked in this workqueue context */
-   /* HANDLE dataWorkQueue; */
-
+   /* Channel callback's invoked in softirq context */
void (*onchannel_callback)(void *context);
void *channel_callback_context;
 
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 09/14] vmbus: change to per channel tasklet

2017-02-01 Thread Stephen Hemminger
Make the event handling tasklet per channel rather than per-cpu.
This allows for better fairness when getting lots of data on the same
cpu.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/channel.c  |  2 +-
 drivers/hv/channel_mgmt.c | 16 +-
 drivers/hv/connection.c   | 78 ++-
 drivers/hv/hv.c   |  2 --
 drivers/hv/hyperv_vmbus.h |  1 -
 drivers/hv/vmbus_drv.c| 58 ++-
 include/linux/hyperv.h|  3 +-
 7 files changed, 64 insertions(+), 96 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 789c75f6df26..18cc1c78260d 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -530,7 +530,7 @@ static int vmbus_close_internal(struct vmbus_channel 
*channel)
int ret;
 
/*
-* process_chn_event(), running in the tasklet, can race
+* vmbus_on_event(), running in the tasklet, can race
 * with vmbus_close_internal() in the case of SMP guest, e.g., when
 * the former is accessing channel->inbound.ring_buffer, the latter
 * could be freeing the ring_buffer pages.
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 579ad2560a39..2f6270d76b79 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -339,6 +339,9 @@ static struct vmbus_channel *alloc_channel(void)
INIT_LIST_HEAD(>sc_list);
INIT_LIST_HEAD(>percpu_list);
 
+   tasklet_init(>callback_event,
+vmbus_on_event, (unsigned long)channel);
+
return channel;
 }
 
@@ -347,6 +350,7 @@ static struct vmbus_channel *alloc_channel(void)
  */
 static void free_channel(struct vmbus_channel *channel)
 {
+   tasklet_kill(>callback_event);
kfree(channel);
 }
 
@@ -380,21 +384,15 @@ static void vmbus_release_relid(u32 relid)
 
 void hv_event_tasklet_disable(struct vmbus_channel *channel)
 {
-   struct hv_per_cpu_context *hv_cpu;
-
-   hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu);
-   tasklet_disable(_cpu->event_dpc);
+   tasklet_disable(>callback_event);
 }
 
 void hv_event_tasklet_enable(struct vmbus_channel *channel)
 {
-   struct hv_per_cpu_context *hv_cpu;
-
-   hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu);
-   tasklet_enable(_cpu->event_dpc);
+   tasklet_enable(>callback_event);
 
/* In case there is any pending event */
-   tasklet_schedule(_cpu->event_dpc);
+   tasklet_schedule(>callback_event);
 }
 
 void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 158f12823baf..27e72dc07e12 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -260,29 +260,6 @@ void vmbus_disconnect(void)
 }
 
 /*
- * Map the given relid to the corresponding channel based on the
- * per-cpu list of channels that have been affinitized to this CPU.
- * This will be used in the channel callback path as we can do this
- * mapping in a lock-free fashion.
- */
-static struct vmbus_channel *pcpu_relid2channel(u32 relid)
-{
-   struct hv_per_cpu_context *hv_cpu
-   = this_cpu_ptr(hv_context.cpu_context);
-   struct vmbus_channel *found_channel = NULL;
-   struct vmbus_channel *channel;
-
-   list_for_each_entry(channel, _cpu->chan_list, percpu_list) {
-   if (channel->offermsg.child_relid == relid) {
-   found_channel = channel;
-   break;
-   }
-   }
-
-   return found_channel;
-}
-
-/*
  * relid2channel - Get the channel object given its
  * child relative id (ie channel id)
  */
@@ -318,25 +295,16 @@ struct vmbus_channel *relid2channel(u32 relid)
 }
 
 /*
- * process_chn_event - Process a channel event notification
+ * vmbus_on_event - Process a channel event notification
  */
-static void process_chn_event(u32 relid)
+void vmbus_on_event(unsigned long data)
 {
-   struct vmbus_channel *channel;
+   struct vmbus_channel *channel = (void *) data;
void *arg;
bool read_state;
u32 bytes_to_read;
 
/*
-* Find the channel based on this relid and invokes the
-* channel callback to process the event
-*/
-   channel = pcpu_relid2channel(relid);
-
-   if (!channel)
-   return;
-
-   /*
 * A channel once created is persistent even when there
 * is no driver handling the device. An unloading driver
 * sets the onchannel_callback to NULL on the same CPU
@@ -344,7 +312,6 @@ static void process_chn_event(u32 relid)
 * Thus, checking and invoking the driver specific callback takes
 * care of orderly unloading of the driver.
 */
-
if (channel->onchannel_callback != NULL) {
arg = channel->chann

[PATCH 13/14] vmbus: constify parameters where possible

2017-02-01 Thread Stephen Hemminger
Functions that just query state of ring buffer can have parameters
marked const.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/hyperv_vmbus.h |  6 +++---
 drivers/hv/ring_buffer.c  | 22 ++
 include/linux/hyperv.h| 12 ++--
 3 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index e15a130de3c9..884f83bba1ab 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -283,14 +283,14 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
*ring_info,
 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info);
 
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list, u32 kv_count);
+   const struct kvec *kv_list, u32 kv_count);
 
 int hv_ringbuffer_read(struct vmbus_channel *channel,
   void *buffer, u32 buflen, u32 *buffer_actual_len,
   u64 *requestid, bool raw);
 
-void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
-   struct hv_ring_buffer_debug_info *debug_info);
+void hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info,
+struct hv_ring_buffer_debug_info *debug_info);
 
 /*
  * Maximum channels is determined by the size of the interrupt page
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 47ab69089115..ee3e488d9dee 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -96,11 +96,9 @@ hv_set_next_write_location(struct hv_ring_buffer_info 
*ring_info,
 
 /* Get the next read location for the specified ring buffer. */
 static inline u32
-hv_get_next_read_location(struct hv_ring_buffer_info *ring_info)
+hv_get_next_read_location(const struct hv_ring_buffer_info *ring_info)
 {
-   u32 next = ring_info->ring_buffer->read_index;
-
-   return next;
+   return ring_info->ring_buffer->read_index;
 }
 
 /*
@@ -108,8 +106,8 @@ hv_get_next_read_location(struct hv_ring_buffer_info 
*ring_info)
  * This allows the caller to skip.
  */
 static inline u32
-hv_get_next_readlocation_withoffset(struct hv_ring_buffer_info *ring_info,
-u32 offset)
+hv_get_next_readlocation_withoffset(const struct hv_ring_buffer_info 
*ring_info,
+   u32 offset)
 {
u32 next = ring_info->ring_buffer->read_index;
 
@@ -130,7 +128,7 @@ hv_set_next_read_location(struct hv_ring_buffer_info 
*ring_info,
 
 /* Get the size of the ring buffer. */
 static inline u32
-hv_get_ring_buffersize(struct hv_ring_buffer_info *ring_info)
+hv_get_ring_buffersize(const struct hv_ring_buffer_info *ring_info)
 {
return ring_info->ring_datasize;
 }
@@ -147,7 +145,7 @@ hv_get_ring_bufferindices(struct hv_ring_buffer_info 
*ring_info)
  * Assume there is enough room. Handles wrap-around in src case only!!
  */
 static u32 hv_copyfrom_ringbuffer(
-   struct hv_ring_buffer_info  *ring_info,
+   const struct hv_ring_buffer_info *ring_info,
void*dest,
u32 destlen,
u32 start_read_offset)
@@ -171,7 +169,7 @@ static u32 hv_copyfrom_ringbuffer(
 static u32 hv_copyto_ringbuffer(
struct hv_ring_buffer_info  *ring_info,
u32 start_write_offset,
-   void*src,
+   const void  *src,
u32 srclen)
 {
void *ring_buffer = hv_get_ring_buffer(ring_info);
@@ -186,8 +184,8 @@ static u32 hv_copyto_ringbuffer(
 }
 
 /* Get various debug metrics for the specified ring buffer. */
-void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
-   struct hv_ring_buffer_debug_info *debug_info)
+void hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info,
+struct hv_ring_buffer_debug_info *debug_info)
 {
u32 bytes_avail_towrite;
u32 bytes_avail_toread;
@@ -264,7 +262,7 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info 
*ring_info)
 
 /* Write to the ring buffer. */
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list, u32 kv_count)
+   const struct kvec *kv_list, u32 kv_count)
 {
int i = 0;
u32 bytes_avail_towrite;
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dc50997a3fba..32a9cbc66b65 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -137,8 +137,8 @@ struct hv_ring_buffer_info {
  * for the specified ring buffer
  */
 static inline void
-hv_get_ringbuffer_availbytes(struct hv_ring_buffer_info *rbi,
- u32 *read, u32 *write)
+hv_get_ringbuffer_availbytes(const stru

[PATCH 14/14] vmbus: replace modulus operation with subtraction

2017-02-01 Thread Stephen Hemminger
Takes less clock cycles to check for ring wrap and subtract than to
do a modulus instruction.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/ring_buffer.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index ee3e488d9dee..8ab6298fd5ae 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -112,7 +112,8 @@ hv_get_next_readlocation_withoffset(const struct 
hv_ring_buffer_info *ring_info,
u32 next = ring_info->ring_buffer->read_index;
 
next += offset;
-   next %= ring_info->ring_datasize;
+   if (next >= ring_info->ring_datasize)
+   next -= ring_info->ring_datasize;
 
return next;
 }
@@ -156,7 +157,8 @@ static u32 hv_copyfrom_ringbuffer(
memcpy(dest, ring_buffer + start_read_offset, destlen);
 
start_read_offset += destlen;
-   start_read_offset %= ring_buffer_size;
+   if (start_read_offset >= ring_buffer_size)
+   start_read_offset -= ring_buffer_size;
 
return start_read_offset;
 }
@@ -178,7 +180,8 @@ static u32 hv_copyto_ringbuffer(
memcpy(ring_buffer + start_write_offset, src, srclen);
 
start_write_offset += srclen;
-   start_write_offset %= ring_buffer_size;
+   if (start_write_offset >= ring_buffer_size)
+   start_write_offset -= ring_buffer_size;
 
return start_write_offset;
 }
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 12/14] vmbus: expose hv_begin/end_read

2017-02-01 Thread Stephen Hemminger
In order to implement NAPI in netvsc, the driver needs access to
control host interrupt mask.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/hyperv_vmbus.h |  4 
 drivers/hv/ring_buffer.c  | 20 
 include/linux/hyperv.h| 30 ++
 3 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 6a9b54677218..e15a130de3c9 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -292,10 +292,6 @@ int hv_ringbuffer_read(struct vmbus_channel *channel,
 void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info,
struct hv_ring_buffer_debug_info *debug_info);
 
-void hv_begin_read(struct hv_ring_buffer_info *rbi);
-
-u32 hv_end_read(struct hv_ring_buffer_info *rbi);
-
 /*
  * Maximum channels is determined by the size of the interrupt page
  * which is PAGE_SIZE. 1/2 of PAGE_SIZE is for send endpoint interrupt
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 146fd8ab2a2a..47ab69089115 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -32,26 +32,6 @@
 
 #include "hyperv_vmbus.h"
 
-void hv_begin_read(struct hv_ring_buffer_info *rbi)
-{
-   rbi->ring_buffer->interrupt_mask = 1;
-   virt_mb();
-}
-
-u32 hv_end_read(struct hv_ring_buffer_info *rbi)
-{
-
-   rbi->ring_buffer->interrupt_mask = 0;
-   virt_mb();
-
-   /*
-* Now check to see if the ring buffer is still empty.
-* If it is not, we raced and we need to process new
-* incoming messages.
-*/
-   return hv_get_bytes_to_read(rbi);
-}
-
 /*
  * When we write to the ring buffer, check if the host needs to
  * be signaled. Here is the details of this protocol:
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 9b0165b11c5c..dc50997a3fba 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1473,6 +1473,36 @@ static inline  void hv_signal_on_read(struct 
vmbus_channel *channel)
 }
 
 /*
+ * Mask off host interrupt callback notifications
+ */
+static inline void hv_begin_read(struct hv_ring_buffer_info *rbi)
+{
+   rbi->ring_buffer->interrupt_mask = 1;
+
+   /* make sure mask update is not reordered */
+   virt_mb();
+}
+
+/*
+ * Re-enable host callback and return number of outstanding bytes
+ */
+static inline u32 hv_end_read(struct hv_ring_buffer_info *rbi)
+{
+
+   rbi->ring_buffer->interrupt_mask = 0;
+
+   /* make sure mask update is not reordered */
+   virt_mb();
+
+   /*
+* Now check to see if the ring buffer is still empty.
+* If it is not, we raced and we need to process new
+* incoming messages.
+*/
+   return hv_get_bytes_to_read(rbi);
+}
+
+/*
  * An API to support in-place processing of incoming VMBUS packets.
  */
 #define VMBUS_PKT_TRAILER  8
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 11/14] vmbus: remove conditional locking of vmbus_write

2017-02-01 Thread Stephen Hemminger
All current usage of vmbus write uses the acquire_lock flag, therefore
having it be optional is unnecessary. This also fixes a sparse warning
since sparse doesn't like when a function has conditional locking.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/channel.c  | 13 -
 drivers/hv/channel_mgmt.c |  1 -
 drivers/hv/hyperv_vmbus.h |  3 +--
 drivers/hv/ring_buffer.c  | 11 ---
 include/linux/hyperv.h| 15 ---
 5 files changed, 9 insertions(+), 34 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index 18cc1c78260d..81a80c82f1bd 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -651,7 +651,6 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, 
void *buffer,
u32 packetlen_aligned = ALIGN(packetlen, sizeof(u64));
struct kvec bufferlist[3];
u64 aligned_data = 0;
-   bool lock = channel->acquire_ring_lock;
int num_vecs = ((bufferlen != 0) ? 3 : 1);
 
 
@@ -670,7 +669,7 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, 
void *buffer,
bufferlist[2].iov_base = _data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, num_vecs, lock);
+   return hv_ringbuffer_write(channel, bufferlist, num_vecs);
 }
 EXPORT_SYMBOL(vmbus_sendpacket_ctl);
 
@@ -716,12 +715,10 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel 
*channel,
u32 packetlen_aligned;
struct kvec bufferlist[3];
u64 aligned_data = 0;
-   bool lock = channel->acquire_ring_lock;
 
if (pagecount > MAX_PAGE_BUFFER_COUNT)
return -EINVAL;
 
-
/*
 * Adjust the size down since vmbus_channel_packet_page_buffer is the
 * largest size we support
@@ -753,7 +750,7 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel 
*channel,
bufferlist[2].iov_base = _data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
+   return hv_ringbuffer_write(channel, bufferlist, 3);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_pagebuffer_ctl);
 
@@ -789,7 +786,6 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
u32 packetlen_aligned;
struct kvec bufferlist[3];
u64 aligned_data = 0;
-   bool lock = channel->acquire_ring_lock;
 
packetlen = desc_size + bufferlen;
packetlen_aligned = ALIGN(packetlen, sizeof(u64));
@@ -809,7 +805,7 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
bufferlist[2].iov_base = _data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
+   return hv_ringbuffer_write(channel, bufferlist, 3);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_mpb_desc);
 
@@ -827,7 +823,6 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel 
*channel,
u32 packetlen_aligned;
struct kvec bufferlist[3];
u64 aligned_data = 0;
-   bool lock = channel->acquire_ring_lock;
u32 pfncount = NUM_PAGES_SPANNED(multi_pagebuffer->offset,
 multi_pagebuffer->len);
 
@@ -866,7 +861,7 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel 
*channel,
bufferlist[2].iov_base = _data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
+   return hv_ringbuffer_write(channel, bufferlist, 3);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_multipagebuffer);
 
diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index b2bb5aafaa2f..f33465d78a02 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -332,7 +332,6 @@ static struct vmbus_channel *alloc_channel(void)
if (!channel)
return NULL;
 
-   channel->acquire_ring_lock = true;
spin_lock_init(>inbound_lock);
spin_lock_init(>lock);
 
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 558a798c407c..6a9b54677218 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -283,8 +283,7 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
*ring_info,
 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info);
 
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list,
-   u32 kv_count, bool lock);
+   struct kvec *kv_list, u32 kv_count);
 
 int hv_ringbuffer_read(struct vmbus_channel *channel,
   void *buffer, u32 buflen, u32 *buffer_actual_len,
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 30ca55aefd24..146fd8ab2a2a 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -284,7 +284,7 @@ void hv_ringbuffer_cleanup(struct hv_ring_

[PATCH 10/14] vmbus: add direct isr callback mode

2017-02-01 Thread Stephen Hemminger
Change the simple boolean batched_reading into a tri-value.
For future NAPI support in netvsc driver, the callback needs to
occur directly in interrupt handler.

Batched mode is also changed to disable host interrupts immediately
in interrupt routine (to avoid unnecessary host signals), and the
tasklet is rescheduled if more data is detected.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/channel_mgmt.c|  7 ---
 drivers/hv/connection.c  | 27 ---
 drivers/hv/hv_util.c |  3 +--
 drivers/hv/vmbus_drv.c   | 26 --
 drivers/uio/uio_hv_generic.c |  2 +-
 include/linux/hyperv.h   | 31 +--
 6 files changed, 55 insertions(+), 41 deletions(-)

diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
index 2f6270d76b79..b2bb5aafaa2f 100644
--- a/drivers/hv/channel_mgmt.c
+++ b/drivers/hv/channel_mgmt.c
@@ -820,13 +820,6 @@ static void vmbus_onoffer(struct 
vmbus_channel_message_header *hdr)
}
 
/*
-* By default we setup state to enable batched
-* reading. A specific service can choose to
-* disable this prior to opening the channel.
-*/
-   newchannel->batched_reading = true;
-
-   /*
 * Setup state for signalling the host.
 */
newchannel->sig_event = (struct hv_input_signal_event *)
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 27e72dc07e12..a8366fec1458 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -300,9 +300,7 @@ struct vmbus_channel *relid2channel(u32 relid)
 void vmbus_on_event(unsigned long data)
 {
struct vmbus_channel *channel = (void *) data;
-   void *arg;
-   bool read_state;
-   u32 bytes_to_read;
+   void (*callback_fn)(void *);
 
/*
 * A channel once created is persistent even when there
@@ -312,9 +310,13 @@ void vmbus_on_event(unsigned long data)
 * Thus, checking and invoking the driver specific callback takes
 * care of orderly unloading of the driver.
 */
-   if (channel->onchannel_callback != NULL) {
-   arg = channel->channel_callback_context;
-   read_state = channel->batched_reading;
+   callback_fn = READ_ONCE(channel->onchannel_callback);
+   if (unlikely(callback_fn == NULL))
+   return;
+
+   (*callback_fn)(channel->channel_callback_context);
+
+   if (channel->callback_mode == HV_CALL_BATCHED) {
/*
 * This callback reads the messages sent by the host.
 * We can optimize host to guest signaling by ensuring:
@@ -326,16 +328,11 @@ void vmbus_on_event(unsigned long data)
 *state is set we check to see if additional packets are
 *available to read. In this case we repeat the process.
 */
+   if (hv_end_read(>inbound) != 0) {
+   hv_begin_read(>inbound);
 
-   do {
-   if (read_state)
-   hv_begin_read(>inbound);
-   channel->onchannel_callback(arg);
-   if (read_state)
-   bytes_to_read = hv_end_read(>inbound);
-   else
-   bytes_to_read = 0;
-   } while (read_state && (bytes_to_read != 0));
+   tasklet_schedule(>callback_event);
+   }
}
 }
 
diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
index d42ede78a9dd..8410191b4992 100644
--- a/drivers/hv/hv_util.c
+++ b/drivers/hv/hv_util.c
@@ -409,8 +409,7 @@ static int util_probe(struct hv_device *dev,
 * Turn off batched reading for all util drivers before we open the
 * channel.
 */
-
-   set_channel_read_state(dev->channel, false);
+   set_channel_read_mode(dev->channel, HV_CALL_DIRECT);
 
hv_set_drvdata(dev, srv);
 
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index eaf1a10b0245..f7f6b9144b07 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -887,6 +887,18 @@ void vmbus_on_msg_dpc(unsigned long data)
 
 
 /*
+ * Direct callback for channels using other deferred processing
+ */
+static void vmbus_channel_isr(struct vmbus_channel *channel)
+{
+   void (*callback_fn)(void *);
+
+   callback_fn = READ_ONCE(channel->onchannel_callback);
+   if (likely(callback_fn != NULL))
+   (*callback_fn)(channel->channel_callback_context);
+}
+
+/*
  * Schedule all channels with events pending
  */
 static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu)
@@ -927,9 +939,19 @@ static void vmbus_chan_sched(struct hv_per_cpu_context 
*hv_cpu)
 
/* Find channel based on relid */
list_for_each_entry(channel,

[PATCH 03/14] vmbus: remove no longer used signal_policy

2017-02-01 Thread Stephen Hemminger
The explicit signal policy is no longer used. A different mechanism
will be added later when xmit_more is supported.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 include/linux/hyperv.h | 18 --
 1 file changed, 18 deletions(-)

diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 85b26f06e172..423fc96cc26a 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -670,11 +670,6 @@ struct hv_input_signal_event_buffer {
struct hv_input_signal_event event;
 };
 
-enum hv_signal_policy {
-   HV_SIGNAL_POLICY_DEFAULT = 0,
-   HV_SIGNAL_POLICY_EXPLICIT,
-};
-
 enum hv_numa_policy {
HV_BALANCED = 0,
HV_LOCALIZED,
@@ -837,13 +832,6 @@ struct vmbus_channel {
 */
struct list_head percpu_list;
/*
-* Host signaling policy: The default policy will be
-* based on the ring buffer state. We will also support
-* a policy where the client driver can have explicit
-* signaling control.
-*/
-   enum hv_signal_policy  signal_policy;
-   /*
 * On the channel send side, many of the VMBUS
 * device drivers explicity serialize access to the
 * outgoing ring buffer. Give more control to the
@@ -904,12 +892,6 @@ static inline bool is_hvsock_channel(const struct 
vmbus_channel *c)
  VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER);
 }
 
-static inline void set_channel_signal_state(struct vmbus_channel *c,
-   enum hv_signal_policy policy)
-{
-   c->signal_policy = policy;
-}
-
 static inline void set_channel_affinity_state(struct vmbus_channel *c,
  enum hv_numa_policy policy)
 {
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 01/14] vmbus: use kernel bitops for traversing interrupt mask

2017-02-01 Thread Stephen Hemminger
Use standard kernel operations for find first set bit to traverse
the channel bit array. This has added benefit of speeding up
lookup on 64 bit and because it uses find first set instruction.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/channel.c  |  8 ++-
 drivers/hv/connection.c   | 55 +++
 drivers/hv/hyperv_vmbus.h | 16 --
 drivers/hv/vmbus_drv.c|  4 +---
 4 files changed, 29 insertions(+), 54 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index be34547cdb68..a016c5c0e472 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -47,12 +47,8 @@ void vmbus_setevent(struct vmbus_channel *channel)
 * For channels marked as in "low latency" mode
 * bypass the monitor page mechanism.
 */
-   if ((channel->offermsg.monitor_allocated) &&
-   (!channel->low_latency)) {
-   /* Each u32 represents 32 channels */
-   sync_set_bit(channel->offermsg.child_relid & 31,
-   (unsigned long *) vmbus_connection.send_int_page +
-   (channel->offermsg.child_relid >> 5));
+   if (channel->offermsg.monitor_allocated && !channel->low_latency) {
+   vmbus_send_interrupt(channel->offermsg.child_relid);
 
/* Get the child to parent monitor page */
monitorpage = vmbus_connection.monitor_pages[1];
diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
index 307a5a8937f6..1766ef03e78d 100644
--- a/drivers/hv/connection.c
+++ b/drivers/hv/connection.c
@@ -379,17 +379,11 @@ static void process_chn_event(u32 relid)
  */
 void vmbus_on_event(unsigned long data)
 {
-   u32 dword;
-   u32 maxdword;
-   int bit;
-   u32 relid;
-   u32 *recv_int_page = NULL;
-   void *page_addr;
-   int cpu = smp_processor_id();
-   union hv_synic_event_flags *event;
+   unsigned long *recv_int_page;
+   u32 maxbits, relid;
 
if (vmbus_proto_version < VERSION_WIN8) {
-   maxdword = MAX_NUM_CHANNELS_SUPPORTED >> 5;
+   maxbits = MAX_NUM_CHANNELS_SUPPORTED;
recv_int_page = vmbus_connection.recv_int_page;
} else {
/*
@@ -397,35 +391,24 @@ void vmbus_on_event(unsigned long data)
 * can be directly checked to get the id of the channel
 * that has the interrupt pending.
 */
-   maxdword = HV_EVENT_FLAGS_DWORD_COUNT;
-   page_addr = hv_context.synic_event_page[cpu];
-   event = (union hv_synic_event_flags *)page_addr +
+   int cpu = smp_processor_id();
+   void *page_addr = hv_context.synic_event_page[cpu];
+   union hv_synic_event_flags *event
+   = (union hv_synic_event_flags *)page_addr +
 VMBUS_MESSAGE_SINT;
-   recv_int_page = event->flags32;
-   }
-
 
+   maxbits = HV_EVENT_FLAGS_COUNT;
+   recv_int_page = event->flags;
+   }
 
-   /* Check events */
-   if (!recv_int_page)
+   if (unlikely(!recv_int_page))
return;
-   for (dword = 0; dword < maxdword; dword++) {
-   if (!recv_int_page[dword])
-   continue;
-   for (bit = 0; bit < 32; bit++) {
-   if (sync_test_and_clear_bit(bit,
-   (unsigned long *)_int_page[dword])) {
-   relid = (dword << 5) + bit;
-
-   if (relid == 0)
-   /*
-* Special case - vmbus
-* channel protocol msg
-*/
-   continue;
 
+   for_each_set_bit(relid, recv_int_page, maxbits) {
+   if (sync_test_and_clear_bit(relid, recv_int_page)) {
+   /* Special case - vmbus channel protocol msg */
+   if (relid != 0)
process_chn_event(relid);
-   }
}
}
 }
@@ -491,12 +474,8 @@ void vmbus_set_event(struct vmbus_channel *channel)
 {
u32 child_relid = channel->offermsg.child_relid;
 
-   if (!channel->is_dedicated_interrupt) {
-   /* Each u32 represents 32 channels */
-   sync_set_bit(child_relid & 31,
-   (unsigned long *)vmbus_connection.send_int_page +
-   (child_relid >> 5));
-   }
+   if (!channel->is_dedicated_interrupt)
+   vmbus_send_interrupt(child_relid);
 
hv_do_hypercall(HVCALL_SIGNAL_EVENT, channel->sig_event, NULL

[PATCH 05/14] netvsc: remove no longer needed receive staging buffers

2017-02-01 Thread Stephen Hemminger
Since commit aed8c164ca5199 ("Drivers: hv: ring_buffer: count on wrap
around mappings") it is no longer necessary to handle ring wrapping
by having a special receive buffer.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/net/hyperv/hyperv_net.h   |  5 ---
 drivers/net/hyperv/netvsc.c   | 83 ++-
 drivers/net/hyperv/rndis_filter.c | 11 --
 3 files changed, 11 insertions(+), 88 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 3958adade7eb..cce70ceba6d5 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -748,11 +748,6 @@ struct netvsc_device {
 
int ring_size;
 
-   /* The primary channel callback buffer */
-   unsigned char *cb_buffer;
-   /* The sub channel callback buffer */
-   unsigned char *sub_cb_buf;
-
struct multi_send_data msd[VRSS_CHANNEL_MAX];
u32 max_pkt; /* max number of pkt in one send, e.g. 8 */
u32 pkt_align; /* alignment bytes, e.g. 8 */
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index e326e68f9f6d..7487498b663c 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -67,12 +67,6 @@ static struct netvsc_device *alloc_net_device(void)
if (!net_device)
return NULL;
 
-   net_device->cb_buffer = kzalloc(NETVSC_PACKET_SIZE, GFP_KERNEL);
-   if (!net_device->cb_buffer) {
-   kfree(net_device);
-   return NULL;
-   }
-
net_device->mrc[0].buf = vzalloc(NETVSC_RECVSLOT_MAX *
 sizeof(struct recv_comp_data));
 
@@ -93,7 +87,6 @@ static void free_netvsc_device(struct netvsc_device *nvdev)
for (i = 0; i < VRSS_CHANNEL_MAX; i++)
vfree(nvdev->mrc[i].buf);
 
-   kfree(nvdev->cb_buffer);
kfree(nvdev);
 }
 
@@ -584,7 +577,6 @@ void netvsc_device_remove(struct hv_device *device)
vmbus_close(device->channel);
 
/* Release all resources */
-   vfree(net_device->sub_cb_buf);
free_netvsc_device(net_device);
 }
 
@@ -1256,16 +1248,11 @@ static void netvsc_process_raw_pkt(struct hv_device 
*device,
 
 void netvsc_channel_cb(void *context)
 {
-   int ret;
-   struct vmbus_channel *channel = (struct vmbus_channel *)context;
+   struct vmbus_channel *channel = context;
u16 q_idx = channel->offermsg.offer.sub_channel_index;
struct hv_device *device;
struct netvsc_device *net_device;
-   u32 bytes_recvd;
-   u64 request_id;
struct vmpacket_descriptor *desc;
-   unsigned char *buffer;
-   int bufferlen = NETVSC_PACKET_SIZE;
struct net_device *ndev;
bool need_to_commit = false;
 
@@ -1277,65 +1264,19 @@ void netvsc_channel_cb(void *context)
net_device = get_inbound_net_device(device);
if (!net_device)
return;
+
ndev = hv_get_drvdata(device);
-   buffer = get_per_channel_state(channel);
-
-   do {
-   desc = get_next_pkt_raw(channel);
-   if (desc != NULL) {
-   netvsc_process_raw_pkt(device,
-  channel,
-  net_device,
-  ndev,
-  desc->trans_id,
-  desc);
-
-   put_pkt_raw(channel, desc);
-   need_to_commit = true;
-   continue;
-   }
-   if (need_to_commit) {
-   need_to_commit = false;
-   commit_rd_index(channel);
-   }
 
-   ret = vmbus_recvpacket_raw(channel, buffer, bufferlen,
-  _recvd, _id);
-   if (ret == 0) {
-   if (bytes_recvd > 0) {
-   desc = (struct vmpacket_descriptor *)buffer;
-   netvsc_process_raw_pkt(device,
-  channel,
-  net_device,
-  ndev,
-  request_id,
-  desc);
-   } else {
-   /*
-* We are done for this pass.
-*/
-   break;
-   }
-
-   } else if (ret == -ENOBUFS) {
-   if (bufferlen > NETVSC_PACKET_SIZE)
-   kfree(buffer);
-   /* Handle large packet */
-   

[PATCH 00/14] hyperv: vmbus related patches

2017-02-01 Thread Stephen Hemminger
This is a rebase/resend of earlier patches. I skipped the pure
cosmetic patches for now.  Mostly this is consolidation earlier
changes, removing dead code etc.  The important part is the
change for allowing a vmbus channel to get callback directly
in interrupt mode; this is necessary for NAPI support.

Stephen Hemminger (14):
  vmbus: use kernel bitops for traversing interrupt mask
  vmbus: drop no longer used kick_q argument
  vmbus: remove no longer used signal_policy
  vmbus: remove unused kickq argument to sendpacket
  netvsc: remove no longer needed receive staging buffers
  vmbus: remove per channel state
  vmbus: callback is in softirq not workqueue
  vmbus: put related per-cpu variable together
  vmbus: change to per channel tasklet
  vmbus: add direct isr callback mode
  vmbus: remove conditional locking of vmbus_write
  vmbus: expose hv_begin/end_read
  vmbus: constify parameters where possible
  vmbus: replace modulus operation with subtraction

Starting point was top of current char-misc-next branch.

 drivers/hv/channel.c  |  47 +
 drivers/hv/channel_mgmt.c |  41 ++--
 drivers/hv/connection.c   | 134 +-
 drivers/hv/hv.c   | 124 +++
 drivers/hv/hv_util.c  |   3 +-
 drivers/hv/hyperv_vmbus.h |  80 ---
 drivers/hv/ring_buffer.c  |  66 ++-
 drivers/hv/vmbus_drv.c| 115 ++--
 drivers/net/hyperv/hyperv_net.h   |   5 --
 drivers/net/hyperv/netvsc.c   | 104 -
 drivers/net/hyperv/rndis_filter.c |  11 
 drivers/uio/uio_hv_generic.c  |   2 +-
 include/linux/hyperv.h| 134 +-
 13 files changed, 338 insertions(+), 528 deletions(-)

-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 02/14] vmbus: drop no longer used kick_q argument

2017-02-01 Thread Stephen Hemminger
The flag to cause notification of host is unused after
commit a01a291a282f7c2e ("Drivers: hv: vmbus: Base host signaling
strictly on the ring state"). Therefore remove it from the ring
buffer internal API.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/hv/channel.c  | 13 -
 drivers/hv/hyperv_vmbus.h |  5 ++---
 drivers/hv/ring_buffer.c  |  8 +++-
 3 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index a016c5c0e472..e26285cde8e0 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -670,9 +670,7 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, 
void *buffer,
bufferlist[2].iov_base = _data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, num_vecs,
-  lock, kick_q);
-
+   return hv_ringbuffer_write(channel, bufferlist, num_vecs, lock);
 }
 EXPORT_SYMBOL(vmbus_sendpacket_ctl);
 
@@ -757,8 +755,7 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel 
*channel,
bufferlist[2].iov_base = _data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3,
-  lock, kick_q);
+   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_pagebuffer_ctl);
 
@@ -813,8 +810,7 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel,
bufferlist[2].iov_base = _data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3,
-  lock, true);
+   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_mpb_desc);
 
@@ -871,8 +867,7 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel 
*channel,
bufferlist[2].iov_base = _data;
bufferlist[2].iov_len = (packetlen_aligned - packetlen);
 
-   return hv_ringbuffer_write(channel, bufferlist, 3,
-  lock, true);
+   return hv_ringbuffer_write(channel, bufferlist, 3, lock);
 }
 EXPORT_SYMBOL_GPL(vmbus_sendpacket_multipagebuffer);
 
diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
index 2749a4142889..c375ec89db6f 100644
--- a/drivers/hv/hyperv_vmbus.h
+++ b/drivers/hv/hyperv_vmbus.h
@@ -275,9 +275,8 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info 
*ring_info,
 void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info);
 
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list,
-   u32 kv_count, bool lock,
-   bool kick_q);
+   struct kvec *kv_list,
+   u32 kv_count, bool lock);
 
 int hv_ringbuffer_read(struct vmbus_channel *channel,
   void *buffer, u32 buflen, u32 *buffer_actual_len,
diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 2cd402986858..30ca55aefd24 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -77,8 +77,7 @@ u32 hv_end_read(struct hv_ring_buffer_info *rbi)
  * host logic is fixed.
  */
 
-static void hv_signal_on_write(u32 old_write, struct vmbus_channel *channel,
-  bool kick_q)
+static void hv_signal_on_write(u32 old_write, struct vmbus_channel *channel)
 {
struct hv_ring_buffer_info *rbi = >outbound;
 
@@ -285,8 +284,7 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info 
*ring_info)
 
 /* Write to the ring buffer. */
 int hv_ringbuffer_write(struct vmbus_channel *channel,
-   struct kvec *kv_list, u32 kv_count, bool lock,
-   bool kick_q)
+   struct kvec *kv_list, u32 kv_count, bool lock)
 {
int i = 0;
u32 bytes_avail_towrite;
@@ -352,7 +350,7 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
if (lock)
spin_unlock_irqrestore(_info->ring_lock, flags);
 
-   hv_signal_on_write(old_write, channel, kick_q);
+   hv_signal_on_write(old_write, channel);
 
if (channel->rescind)
return -ENODEV;
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 06/14] vmbus: remove per channel state

2017-02-01 Thread Stephen Hemminger
The netvsc no longer needs per channel state hook to track receive buffer.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 include/linux/hyperv.h | 14 --
 1 file changed, 14 deletions(-)

diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 8c6a1505b876..39d493ce550d 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -823,10 +823,6 @@ struct vmbus_channel {
 */
struct vmbus_channel *primary_channel;
/*
-* Support per-channel state for use by vmbus drivers.
-*/
-   void *per_channel_state;
-   /*
 * To support per-cpu lookup mapping of relid to channel,
 * link up channels based on their CPU affinity.
 */
@@ -903,16 +899,6 @@ static inline void set_channel_read_state(struct 
vmbus_channel *c, bool state)
c->batched_reading = state;
 }
 
-static inline void set_per_channel_state(struct vmbus_channel *c, void *s)
-{
-   c->per_channel_state = s;
-}
-
-static inline void *get_per_channel_state(struct vmbus_channel *c)
-{
-   return c->per_channel_state;
-}
-
 static inline void set_channel_pending_send_size(struct vmbus_channel *c,
 u32 size)
 {
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] hyperv: implement hv_get_tsc_page()

2017-02-09 Thread Stephen Hemminger
On Thu, 9 Feb 2017 21:14:25 +0100 (CET)
Thomas Gleixner <t...@linutronix.de> wrote:

> On Thu, 9 Feb 2017, Stephen Hemminger wrote:
> 
> > The actual code looks fine, but the style police will not like you.
> > { should be at start of line on functions.
> > And #else should be at start of line,
> > 
> > But maybe this was just more of exchange mangling the mail.  
> 
> Looks like.
> 
> > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) {
> > +   return tsc_pg;
> > +}
> > +  
> 
> That's how it reads in a proper mail client connected to a proper mail
> server:
> 
> > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
> > +{
> > +   return tsc_pg;
> > +}  
> 
> :)


Yup. it looks like the mail server is trying to be "helpful" by eliminating 
extra white space.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

2017-02-09 Thread Stephen Hemminger
On Thu, 9 Feb 2017 14:55:50 -0800
Andy Lutomirski <l...@amacapital.net> wrote:

> On Thu, Feb 9, 2017 at 12:45 PM, KY Srinivasan <k...@microsoft.com> wrote:
> >
> >  
> >> -Original Message-
> >> From: Thomas Gleixner [mailto:t...@linutronix.de]
> >> Sent: Thursday, February 9, 2017 9:08 AM
> >> To: Vitaly Kuznetsov <vkuzn...@redhat.com>
> >> Cc: x...@kernel.org; Andy Lutomirski <l...@amacapital.net>; Ingo Molnar
> >> <mi...@redhat.com>; H. Peter Anvin <h...@zytor.com>; KY Srinivasan
> >> <k...@microsoft.com>; Haiyang Zhang <haiya...@microsoft.com>; Stephen
> >> Hemminger <sthem...@microsoft.com>; Dexuan Cui
> >> <de...@microsoft.com>; linux-ker...@vger.kernel.org;
> >> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org
> >> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read
> >> method
> >>
> >> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote:  
> >> > +#ifdef CONFIG_HYPERV_TSCPAGE
> >> > +static notrace u64 vread_hvclock(int *mode)
> >> > +{
> >> > +   const struct ms_hyperv_tsc_page *tsc_pg =
> >> > +   (const struct ms_hyperv_tsc_page *)_page;
> >> > +   u64 sequence, scale, offset, current_tick, cur_tsc;
> >> > +
> >> > +   while (1) {
> >> > +   sequence = READ_ONCE(tsc_pg->tsc_sequence);
> >> > +   if (!sequence)
> >> > +   break;
> >> > +
> >> > +   scale = READ_ONCE(tsc_pg->tsc_scale);
> >> > +   offset = READ_ONCE(tsc_pg->tsc_offset);
> >> > +   rdtscll(cur_tsc);
> >> > +
> >> > +   current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> >> > +
> >> > +   if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> >> > +   return current_tick;  
> >>
> >> That sequence stuff lacks still a sensible explanation. It's fundamentally
> >> different from the sequence counting we do in the kernel, so documentation
> >> for it is really required.  
> >
> > The host is updating multiple fields in this shared TSC page and the 
> > sequence number is
> > used to ensure that the guest sees a consistent set values published. If I 
> > remember
> > correctly, Xen has a similar mechanism.  
> 
> So what's the actual protocol?  When the hypervisor updates the page,
> does it freeze all guest cpus?  If not, how does it maintain
> atomicity?

The protocol looks a lot like Linux seqlock, but it has an extra protection
which is missing here.

The host needs to update sequence number twice in order to guarantee ordering.
Otherwise it is possible that Host and guest can race.

Host
Write offset
Write scale
Set tsc_sequence = N
  Guest
read sequence = N
Read scale
Write scale
Write offset

Read Offset
Check sequence == N
Set tsc_sequence = N +1

Look like the current host side protocol is wrong.

The solution that Andi Kleen invented, and I used in seqlock was for the writer 
to update
sequence at start and end of transaction. If sequence number is odd, then the 
reader knows
it is looking at stale data.
Host
Write offset
Write scale
Set tsc_sequence = N (end of 
transaction)
  Guest
read sequence = N
Spin until sequence is even (N is even)
Read scale
Set tsc_sequence += 1
Write scale
Write offset

Read Offset
Check sequence == N? (fails is N + 1)
Set tsc_sequence += 1 (end of 
transaction)
read sequence = N+2
Spin until sequence is even (ie N +2)
Read scale  
Read Offset
Check sequence == N +2? (yes ok).

Also it is faster to just read scale and offset with this loop and save
the reading of TSC and doing multiply until after scale/offset has been 
acquired.




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next] net: make ndo_get_stats64 a void function

2017-01-05 Thread Stephen Hemminger
The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 drivers/net/bonding/bond_main.c  | 10 --
 drivers/net/dummy.c  |  5 ++---
 drivers/net/ethernet/alacritech/slicoss.c|  6 ++
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 10 --
 drivers/net/ethernet/amd/xgbe/xgbe-drv.c |  6 ++
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c |  4 +---
 drivers/net/ethernet/atheros/alx/main.c  |  6 ++
 drivers/net/ethernet/broadcom/b44.c  |  5 ++---
 drivers/net/ethernet/broadcom/bnx2.c |  3 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt.c|  6 ++
 drivers/net/ethernet/broadcom/tg3.c  |  8 +++-
 drivers/net/ethernet/brocade/bna/bnad.c  |  6 ++
 drivers/net/ethernet/calxeda/xgmac.c |  5 ++---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c |  5 ++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |  7 +++
 drivers/net/ethernet/cisco/enic/enic_main.c  |  8 +++-
 drivers/net/ethernet/ec_bhf.c|  4 +---
 drivers/net/ethernet/emulex/benet/be_main.c  |  5 ++---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c   |  6 ++
 drivers/net/ethernet/hisilicon/hns/hns_enet.c|  6 ++
 drivers/net/ethernet/ibm/ehea/ehea_main.c|  5 ++---
 drivers/net/ethernet/intel/e1000e/e1000.h|  4 ++--
 drivers/net/ethernet/intel/e1000e/netdev.c   |  5 ++---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c  |  6 ++
 drivers/net/ethernet/intel/i40e/i40e.h   |  5 ++---
 drivers/net/ethernet/intel/i40e/i40e_main.c  | 18 ++
 drivers/net/ethernet/intel/igb/igb_main.c| 10 --
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|  7 ---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c|  6 ++
 drivers/net/ethernet/marvell/mvneta.c|  4 +---
 drivers/net/ethernet/marvell/mvpp2.c |  4 +---
 drivers/net/ethernet/marvell/sky2.c  |  6 ++
 drivers/net/ethernet/mediatek/mtk_eth_soc.c  |  6 ++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |  4 +---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|  3 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c |  3 +--
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  4 +---
 drivers/net/ethernet/mellanox/mlxsw/switchx2.c   |  3 +--
 drivers/net/ethernet/myricom/myri10ge/myri10ge.c |  9 -
 drivers/net/ethernet/neterion/vxge/vxge-main.c   |  4 +---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c  |  6 ++
 drivers/net/ethernet/nvidia/forcedeth.c  |  4 +---
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c | 10 --
 drivers/net/ethernet/qlogic/qede/qede_main.c |  7 ++-
 drivers/net/ethernet/qualcomm/emac/emac.c|  6 ++
 drivers/net/ethernet/realtek/8139too.c   |  9 +++--
 drivers/net/ethernet/realtek/r8169.c |  4 +---
 drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c  |  8 ++--
 drivers/net/ethernet/sfc/efx.c   |  6 ++
 drivers/net/ethernet/sfc/falcon/efx.c|  6 ++
 drivers/net/ethernet/sun/niu.c   |  6 ++
 drivers/net/ethernet/synopsys/dwc_eth_qos.c  |  4 +---
 drivers/net/ethernet/tile/tilepro.c  |  4 ++--
 drivers/net/ethernet/via/via-rhine.c |  8 +++-
 drivers/net/fjes/fjes_main.c |  7 ++-
 drivers/net/hyperv/netvsc_drv.c  |  6 ++
 drivers/net/ifb.c|  6 ++
 drivers/net/ipvlan/ipvlan_main.c |  5 ++---
 drivers/net/loopback.c   |  5 ++---
 drivers/net/macsec.c |  6 ++
 drivers/net/macvlan.c|  5 ++---
 drivers/net/nlmon.c  |  4 +---
 drivers/net/ppp/ppp_generic.c|  4 +---
 drivers/net/slip/slip.c  |  3 +--
 drivers/net/team/team.c  |  3 +--
 drivers/net/tun.c|  3 +--
 drivers/net/veth.c   |  6 ++
 drivers/net/virtio_net.c |  6 ++
 drivers/net/vmxnet3/vmxnet3_ethtool.c|  4 +---
 drivers/net/vmxnet3/vmxnet3_int.h|  4 ++--
 drive

Re: [PATCH net-next V3 3/3] tun: rx batching

2016-12-31 Thread Stephen Hemminger
On Fri, 30 Dec 2016 13:20:51 +0800
Jason Wang  wrote:

> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index cd8e02c..a268ed9 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -75,6 +75,10 @@
>  
>  #include 
>  
> +static int rx_batched;
> +module_param(rx_batched, int, 0444);
> +MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx");
> +
>  /* Uncomment to enable debugging */

I like the concept or rx batching. But controlling it via a module parameter
is one of the worst API choices.  Ethtool would be better to use because that is
how other network devices control batching.

If you do ethtool, you could even extend it to have an number of packets
and max latency value.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 2/3] x86/hyperv: move TSC reading method to asm/mshyperv.h

2017-03-03 Thread Stephen Hemminger

Minor coding comments

> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index d324dce..4ff25436 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -178,6 +178,56 @@ void hyperv_cleanup(void);
>  #endif
>  #ifdef CONFIG_HYPERV_TSCPAGE
>  struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
> +static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
> +{
> + u64 scale, offset, current_tick, cur_tsc;
> + u32 sequence;
> +
> + /*
> +  * The protocol for reading Hyper-V TSC page is specified in Hypervisor
> +  * Top-Level Functional Specification ver. 3.0 and above. To get the
> +  * reference time we must do the following:
> +  * - READ ReferenceTscSequence
> +  *   A special '0' value indicates the time source is unreliable and we
> +  *   need to use something else. The currently published specification
> +  *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
> +  *   instead of '0' as the special value, see commit c35b82ef0294.
> +  * - ReferenceTime =
> +  *((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
> +  * - READ ReferenceTscSequence again. In case its value has changed
> +  *   since our first reading we need to discard ReferenceTime and repeat
> +  *   the whole sequence as the hypervisor was updating the page in
> +  *   between.
> +  */
> + while (1) {
> + sequence = READ_ONCE(tsc_pg->tsc_sequence);
> + if (!sequence)
> + break;

It would be clearer to just return U64_MAX here (and not fall out)
since this is only case here. Also since this failure only occurs if host
clock is not available, probably should be unlikely.

> + /*
> +  * Make sure we read sequence before we read other values from
> +  * TSC page.
> +  */
> + smp_rmb();
> +
> + scale = READ_ONCE(tsc_pg->tsc_scale);
> + offset = READ_ONCE(tsc_pg->tsc_offset);
> + cur_tsc = rdtsc_ordered();

Since you already have smp_ barriers and rdtsc_ordered is a barrier,
the compiler barriers (READ_ONCE()) shouldn't be necessary.

> +
> + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
> +
> + /*
> +  * Make sure we read sequence after we read all other values
> +  * from TSC page.
> +  */
> + smp_rmb();
> +
> + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence)
> + return current_tick;
> + }

Why not make do { } while out of this.

do {
...
} while (unlikely(READ_ONCE(tsc_pg->tsc_sequence) != sequence);
return current_tick;

Also don't need to calculate tick value until have good data. As in:

static inline u32 hv_clock_sequence(const struct ms_hyperv_tsc_page *tsc_pg)
{
u32 sequence =
return sequence;
}

static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
{
u64 scale, offset, cur_tsc;
u32 start;

/*
 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
 * Top-Level Functional Specification ver. 3.0 and above. To get the
 * reference time we must do the following:
 * - READ ReferenceTscSequence
 *   A special '0' value indicates the time source is unreliable and we
 *   need to use something else. The currently published specification
 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
 *   instead of '0' as the special value, see commit c35b82ef0294.
 * - ReferenceTime =
 *((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
 * - READ ReferenceTscSequence again. In case its value has changed
 *   since our first reading we need to discard ReferenceTime and repeat
 *   the whole sequence as the hypervisor was updating the page in
 *   between.
 */
do {
start = READ_ONCE(tsc_pg->tsc_sequence);
smp_rmb();

if (unlikely(!start))
return U64_MAX;

scale = tsc_pg->tsc_scale;
offset = tsc_pg->tsc_offset;

/*
 * Make sure we read sequence after we read all other values
 * from TSC page.
 */
smp_rmb();
} while (unlikely(READ_ONCE(tsc_pg->tsc_sequence != start)));

cur_tsc = rdtsc_ordered();
return mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
}

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 4/4] mlx4: sizeof style usage

2017-08-15 Thread Stephen Hemminger
The kernel coding style is to treat sizeof as a function
(ie. with parenthesis) not as an operator.

Also use kcalloc and kmalloc_array

Signed-off-by: Stephen Hemminger <step...@networkplumber.org>
---
 drivers/net/ethernet/mellanox/mlx4/alloc.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c   |  4 ++--
 drivers/net/ethernet/mellanox/mlx4/en_resources.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c| 20 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c|  2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/icm.h   |  4 ++--
 drivers/net/ethernet/mellanox/mlx4/intf.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  | 12 +--
 drivers/net/ethernet/mellanox/mlx4/mcg.c   | 12 +--
 drivers/net/ethernet/mellanox/mlx4/mr.c| 10 -
 drivers/net/ethernet/mellanox/mlx4/qp.c| 12 +--
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  | 24 +++---
 15 files changed, 56 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c 
b/drivers/net/ethernet/mellanox/mlx4/alloc.c
index b651c1210555..6dabd983e7e0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/alloc.c
+++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c
@@ -186,7 +186,7 @@ int mlx4_bitmap_init(struct mlx4_bitmap *bitmap, u32 num, 
u32 mask,
bitmap->effective_len = bitmap->avail;
spin_lock_init(>lock);
bitmap->table = kzalloc(BITS_TO_LONGS(bitmap->max) *
-   sizeof (long), GFP_KERNEL);
+   sizeof(long), GFP_KERNEL);
if (!bitmap->table)
return -ENOMEM;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 674773b28b2e..97aed30ead21 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -2637,7 +2637,7 @@ int mlx4_cmd_use_events(struct mlx4_dev *dev)
int err = 0;
 
priv->cmd.context = kmalloc(priv->cmd.max_cmds *
-  sizeof (struct mlx4_cmd_context),
+  sizeof(struct mlx4_cmd_context),
   GFP_KERNEL);
if (!priv->cmd.context)
return -ENOMEM;
@@ -2695,7 +2695,7 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct 
mlx4_dev *dev)
 {
struct mlx4_cmd_mailbox *mailbox;
 
-   mailbox = kmalloc(sizeof *mailbox, GFP_KERNEL);
+   mailbox = kmalloc(sizeof(*mailbox), GFP_KERNEL);
if (!mailbox)
return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c 
b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
index 86d2d42d658d..5a47f9669621 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
@@ -44,7 +44,7 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int 
size, int stride,
struct mlx4_en_dev *mdev = priv->mdev;
struct net_device *dev = priv->dev;
 
-   memset(context, 0, sizeof *context);
+   memset(context, 0, sizeof(*context));
context->flags = cpu_to_be32(7 << 16 | rss << MLX4_RSS_QPC_FLAG_OFFSET);
context->pd = cpu_to_be32(mdev->priv_pdn);
context->mtu_msgmax = 0xff;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index bf1638044a7a..dcb8f8f84a97 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1056,7 +1056,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv 
*priv, int qpn,
}
qp->event = mlx4_en_sqp_event;
 
-   memset(context, 0, sizeof *context);
+   memset(context, 0, sizeof(*context));
mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0,
qpn, ring->cqn, -1, context);
context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 73faa3d77921..bcf422efd3b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -643,7 +643,7 @@ static void build_inline_wqe(struct mlx4_en_tx_desc 
*tx_desc,
 void *fragptr)
 {
struct mlx4_wqe_inline_seg *inl = _desc->inl;
-   int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
+   int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof(*inl);
unsigned int hlen = skb_headlen(skb);
 
if (skb->len <= spc) {
diff --git a/drivers/net/ethernet/mellanox/ml

[PATCH net-next 3/4] skge: add paren around sizeof arg

2017-08-15 Thread Stephen Hemminger
Signed-off-by: Stephen Hemminger <step...@networkplumber.org>
---
 drivers/net/ethernet/marvell/skge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/skge.c 
b/drivers/net/ethernet/marvell/skge.c
index 5d7d94de4e00..8a835e82256a 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -3516,7 +3516,7 @@ static const char *skge_board_name(const struct skge_hw 
*hw)
if (skge_chips[i].id == hw->chip_id)
return skge_chips[i].name;
 
-   snprintf(buf, sizeof buf, "chipid 0x%x", hw->chip_id);
+   snprintf(buf, sizeof(buf), "chipid 0x%x", hw->chip_id);
return buf;
 }
 
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next 1/4] tun/tap: use paren's with sizeof

2017-08-15 Thread Stephen Hemminger
Although sizeof is an operator in C. The kernel coding style convention
is to always use it like a function and add parenthesis.

Signed-off-by: Stephen Hemminger <step...@networkplumber.org>
---
 drivers/net/tap.c | 2 +-
 drivers/net/tun.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 0d039411e64c..21b71ae947fd 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1215,7 +1215,7 @@ int tap_queue_resize(struct tap_dev *tap)
int n = tap->numqueues;
int ret, i = 0;
 
-   arrays = kmalloc(sizeof *arrays * n, GFP_KERNEL);
+   arrays = kmalloc_array(n, sizeof(*arrays), GFP_KERNEL);
if (!arrays)
return -ENOMEM;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 5892284eb8d0..f5017121cd57 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2737,7 +2737,7 @@ static int tun_queue_resize(struct tun_struct *tun)
int n = tun->numqueues + tun->numdisabled;
int ret, i;
 
-   arrays = kmalloc(sizeof *arrays * n, GFP_KERNEL);
+   arrays = kmalloc_array(n, sizeof(*arrays), GFP_KERNEL);
if (!arrays)
return -ENOMEM;
 
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] uapi: add SPDX identifier to vm_sockets_diag.h

2017-11-24 Thread Stephen Hemminger
New file seems to have missed the SPDX license scan and update.

Signed-off-by: Stephen Hemminger <sthem...@microsoft.com>
---
 include/uapi/linux/vm_sockets_diag.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/vm_sockets_diag.h 
b/include/uapi/linux/vm_sockets_diag.h
index 14cd7dc5a187..0b4dd54f3d1e 100644
--- a/include/uapi/linux/vm_sockets_diag.h
+++ b/include/uapi/linux/vm_sockets_diag.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
 /* AF_VSOCK sock_diag(7) interface for querying open sockets */
 
 #ifndef _UAPI__VM_SOCKETS_DIAG_H__
-- 
2.11.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-11-29 Thread Stephen Hemminger
On Wed, 29 Nov 2017 19:51:38 -0800
Jakub Kicinski <jakub.kicin...@netronome.com> wrote:

> On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote:
> > On 2017年11月29日 03:27, Jesse Brandeburg wrote:  
> > > Hi, I'd like to get some feedback on a proposal to enhance
> > > virtio-net to ease configuration of a VM and that would enable
> > > live migration of passthrough network SR-IOV devices.
> > >
> > > Today we have SR-IOV network devices (VFs) that can be passed
> > > into a VM in order to enable high performance networking direct
> > > within the VM. The problem I am trying to address is that this
> > > configuration is generally difficult to live-migrate.  There is
> > > documentation [1] indicating that some OS/Hypervisor vendors will
> > > support live migration of a system with a direct assigned
> > > networking device.  The problem I see with these implementations
> > > is that the network configuration requirements that are passed on
> > > to the owner of the VM are quite complicated.  You have to set up
> > > bonding, you have to configure it to enslave two interfaces,
> > > those interfaces (one is virtio-net, the other is SR-IOV
> > > device/driver like ixgbevf) must support MAC address changes
> > > requested in the VM, and on and on...
> > >
> > > So, on to the proposal:
> > > Modify virtio-net driver to be a single VM network device that
> > > enslaves an SR-IOV network device (inside the VM) with the same
> > > MAC address. This would cause the virtio-net driver to appear and
> > > work like a simplified bonding/team driver.  The live migration
> > > problem would be solved just like today's bonding solution, but
> > > the VM user's networking config would be greatly simplified.
> > >
> > > At it's simplest, it would appear something like this in the VM.
> > >
> > > ==
> > > = vnet0  =
> > >   =
> > > (virtio- =   |
> > >   net)=   |
> > >   =  ==
> > >   =  = ixgbef =
> > > ==  ==
> > >
> > > (forgive the ASCII art)
> > >
> > > The fast path traffic would prefer the ixgbevf or other SR-IOV
> > > device path, and fall back to virtio's transmit/receive when
> > > migrating.
> > >
> > > Compared to today's options this proposal would
> > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV
> > > speeds
> > > 2) simplify end user configuration in the VM (most if not all of
> > > the set up to enable migration would be done in the hypervisor)
> > > 3) allow live migration via a simple link down and maybe a PCI
> > > hot-unplug of the SR-IOV device, with failover to the
> > > virtio-net driver core
> > > 4) allow vendor agnostic hardware acceleration, and live migration
> > > between vendors if the VM os has driver support for all the
> > > required SR-IOV devices.
> > >
> > > Runtime operation proposed:
> > > -  virtio-net driver loads, SR-IOV driver loads
> > > - virtio-net finds other NICs that match it's MAC address by
> > >both examining existing interfaces, and sets up a new device
> > > notifier
> > > - virtio-net enslaves the first NIC with the same MAC address
> > > - virtio-net brings up the slave, and makes it the "preferred"
> > > path
> > > - virtio-net follows the behavior of an active backup bond/team
> > > - virtio-net acts as the interface to the VM
> > > - live migration initiates
> > > - link goes down on SR-IOV, or SR-IOV device is removed
> > > - failover to virtio-net as primary path
> > > - migration continues to new host
> > > - new host is started with virio-net as primary
> > > - if no SR-IOV, virtio-net stays primary
> > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio
> > > - virtio-net notices new NIC and starts over at enslave step above
> > >
> > > Future ideas (brainstorming):
> > > - Optimize Fast east-west by having special rules to direct
> > > east-west traffic through virtio-net traffic path
> > >
> > > Thanks for reading!
> > > Jesse
> > 
> > Cc netdev.
> > 
> > Interesting, and this method is actually used by netvsc now:
> > 
> > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e
> > Author: stephen hemminger <step...@networkplumber.org>
> > Da

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger
On Tue, 19 Dec 2017 20:07:01 +0200
"Michael S. Tsirkin" <m...@redhat.com> wrote:

> On Tue, Dec 19, 2017 at 09:55:48AM -0800, Stephen Hemminger wrote:
> > On Tue, 19 Dec 2017 09:41:39 -0800
> > "Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote:
> >   
> > > On 12/19/2017 7:47 AM, Michael S. Tsirkin wrote:  
> > > > I'll need to look at this more, in particular the feature
> > > > bit is missing here. For now one question:
> > > >
> > > > On Mon, Dec 18, 2017 at 04:40:36PM -0800, Sridhar Samudrala wrote:
> > > >> @@ -56,6 +58,8 @@ module_param(napi_tx, bool, 0644);
> > > >>*/
> > > >>   DECLARE_EWMA(pkt_len, 0, 64)
> > > >>   
> > > >> +#define VF_TAKEOVER_INT   (HZ / 10)
> > > >> +
> > > >>   #define VIRTNET_DRIVER_VERSION "1.0.0"
> > > >>   
> > > >>   static const unsigned long guest_offloads[] = {
> > > > Why is this delay necessary? And why by 100ms?
> > > 
> > > This is based on netvsc implementation and here is the commit that
> > > added this delay.  Not sure if this needs to be 100ms.
> > > 
> > > commit 6123c66854c174e4982f98195100c1d990f9e5e6
> > > Author: stephen hemminger <step...@networkplumber.org>
> > > Date:   Wed Aug 9 17:46:03 2017 -0700
> > > 
> > >      netvsc: delay setup of VF device
> > > 
> > >      When VF device is discovered, delay bring it automatically up in
> > >      order to allow userspace to some simple changes (like renaming).
> > > 
> > > 
> > >   
> > 
> > could be 10ms, just enough to let udev do its renaming  
> 
> Isn't there a way not to depend on udev completing its thing within a given 
> timeframe?

Not that I know. the path is quite indirect.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger
On Tue, 19 Dec 2017 14:37:50 -0800
"Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote:

> On 12/19/2017 11:46 AM, Stephen Hemminger wrote:
> > On Tue, 19 Dec 2017 11:42:33 -0800
> > "Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote:
> >  
> >> On 12/19/2017 10:41 AM, Stephen Hemminger wrote:  
> >>> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> >>> David Miller <da...@davemloft.net> wrote:
> >>> 
> >>>> From: Stephen Hemminger <step...@networkplumber.org>
> >>>> Date: Tue, 19 Dec 2017 09:55:48 -0800
> >>>> 
> >>>>> could be 10ms, just enough to let udev do its renaming  
> >>>> Please, move to some kind of notification or event based handling of
> >>>> this problem.
> >>>>
> >>>> No delay is safe, what if userspace gets swapped out or whatever
> >>>> else might make userspace stall unexpectedly?
> >>>> 
> >>> The plan is to remove the delay and do the naming in the kernel.
> >>> This was suggested by Lennart since udev is only doing naming policy
> >>> because kernel names were not repeatable.
> >>>
> >>> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
> >>>
> >>> Patch is pending.  
> >> Do we really need to delay the setup until the name is changed?
> >> Can't we call dev_set_mtu() and dev_open() until dev_change_name() is done?
> >>
> >> Thanks
> >> Sridhar  
> > You can call dev_set_mtu, but when dev_open is done the device name
> > can not be changed by userspace.  
> I did a quick test to remove the delay and also the dev_open() call and 
> i don't see
> any issues with virtio taking over the VF datapath.
> Only the netdev_info() messages may show old device name.
> 
> Any specific scenario where we need to explicitly call  VF's dev_open() 
> in the VF setup process?
> I tried i40evf driver loaded after virtio_net  AND  virtio_net loading 
> after i40evf.
> 
> Thanks
> Sridhar

It happens with hotplug. It is possible on Hyper-V to hotplug SR-IOV on
and off while guest is running. If SR-IOV is disabled in host then the
VF device is removed (hotplug) and the inverse. If the master device is
up then the VF device should be brought up by the master device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger
On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
David Miller <da...@davemloft.net> wrote:

> From: Stephen Hemminger <step...@networkplumber.org>
> Date: Tue, 19 Dec 2017 09:55:48 -0800
> 
> > could be 10ms, just enough to let udev do its renaming  
> 
> Please, move to some kind of notification or event based handling of
> this problem.
> 
> No delay is safe, what if userspace gets swapped out or whatever
> else might make userspace stall unexpectedly?
> 

The plan is to remove the delay and do the naming in the kernel.
This was suggested by Lennart since udev is only doing naming policy
because kernel names were not repeatable.

This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.

Patch is pending.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available

2017-12-19 Thread Stephen Hemminger
On Tue, 19 Dec 2017 11:42:33 -0800
"Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote:

> On 12/19/2017 10:41 AM, Stephen Hemminger wrote:
> > On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> > David Miller <da...@davemloft.net> wrote:
> >  
> >> From: Stephen Hemminger <step...@networkplumber.org>
> >> Date: Tue, 19 Dec 2017 09:55:48 -0800
> >>  
> >>> could be 10ms, just enough to let udev do its renaming  
> >> Please, move to some kind of notification or event based handling of
> >> this problem.
> >>
> >> No delay is safe, what if userspace gets swapped out or whatever
> >> else might make userspace stall unexpectedly?
> >>  
> > The plan is to remove the delay and do the naming in the kernel.
> > This was suggested by Lennart since udev is only doing naming policy
> > because kernel names were not repeatable.
> >
> > This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
> >
> > Patch is pending.  
> Do we really need to delay the setup until the name is changed?
> Can't we call dev_set_mtu() and dev_open() until dev_change_name() is done?
> 
> Thanks
> Sridhar

You can call dev_set_mtu, but when dev_open is done the device name
can not be changed by userspace.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-05 Thread Stephen Hemminger
On Tue, 5 Dec 2017 14:29:28 -0800
Jakub Kicinski  wrote:

> On Tue, 5 Dec 2017 11:59:17 +0200, achiad shochat wrote:
> >  I second Jacob - having a netdev of one device driver enslave a netdev
> >  of another device driver is an awkward a-symmetric model.
> >  Regardless of whether they share the same backend device.
> >  Only I am not sure the Linux Bond is the right choice.
> >  e.g one may well want to use the virtio device also when the
> >  pass-through device is available, e.g for multicasts, east-west
> >  traffic, etc.
> >  I'm not sure the Linux Bond fits that functionality.
> >  And, as I hear in this thread, it is hard to make it work out of the 
> >  box.
> >  So I think the right thing would be to write a new dedicated module
> >  for this purpose.
> > >
> > > This part I can sort of agree with. What if we were to look at
> > > providing a way to somehow advertise that the two devices were meant
> > > to be boded for virtualization purposes? For now lets call it a
> > > "virt-bond". Basically we could look at providing a means for virtio
> > > and VF drivers to advertise that they want this sort of bond. Then it
> > > would just be a matter of providing some sort of side channel to
> > > indicate where you want things like multicast/broadcast/east-west
> > > traffic to go.  
> > 
> > I like this approach.  
> 
> +1 on a separate driver, just enslaving devices to virtio may break
> existing setups.  If people are bonding from user space today, if they
> update their kernel it may surprise them how things get auto-mangled.
> 
> Is what Alex is suggesting a separate PV device that says "I would
> like to be a bond of those two interfaces"?  That would make the HV
> intent explicit and kernel decisions more understandable.

So far, in my experience it still works.
As long as the kernel slaving happens first, it will work.
The attempt to bond an already slaved device will fail and no scripts seem
to check the error return.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-03 Thread Stephen Hemminger
On Sun, 3 Dec 2017 11:14:37 +0200
achiad shochat  wrote:

> On 3 December 2017 at 07:05, Michael S. Tsirkin  wrote:
> > On Fri, Dec 01, 2017 at 12:08:59PM -0800, Shannon Nelson wrote:  
> >> On 11/30/2017 6:11 AM, Michael S. Tsirkin wrote:  
> >> > On Thu, Nov 30, 2017 at 10:08:45AM +0200, achiad shochat wrote:  
> >> > > Re. problem #2:
> >> > > Indeed the best way to address it seems to be to enslave the VF driver
> >> > > netdev under a persistent anchor netdev.
> >> > > And it's indeed desired to allow (but not enforce) PV netdev and VF
> >> > > netdev to work in conjunction.
> >> > > And it's indeed desired that this enslavement logic work out-of-the 
> >> > > box.
> >> > > But in case of PV+VF some configurable policies must be in place (and
> >> > > they'd better be generic rather than differ per PV technology).
> >> > > For example - based on which characteristics should the PV+VF coupling
> >> > > be done? netvsc uses MAC address, but that might not always be the
> >> > > desire.  
> >> >
> >> > It's a policy but not guest userspace policy.
> >> >
> >> > The hypervisor certainly knows.
> >> >
> >> > Are you concerned that someone might want to create two devices with the
> >> > same MAC for an unrelated reason?  If so, hypervisor could easily set a
> >> > flag in the virtio device to say "this is a backup, use MAC to find
> >> > another device".  
> >>
> >> This is something I was going to suggest: a flag or other configuration on
> >> the virtio device to help control how this new feature is used.  I can
> >> imagine this might be useful to control from either the hypervisor side or
> >> the VM side.
> >>
> >> The hypervisor might want to (1) disable it (force it off), (2) enable it
> >> for VM choice, or (3) force it on for the VM.  In case (2), the VM might be
> >> able to chose whether it wants to make use of the feature, or stick with 
> >> the
> >> bonding solution.
> >>
> >> Either way, the kernel is making a feature available, and the user (VM or
> >> hypervisor) is able to control it by selecting the feature based on the
> >> policy desired.
> >>
> >> sln  
> >
> > I'm not sure what's the feature that is available here.
> >
> > I saw this as a flag that says "this device shares backend with another
> > network device which can be found using MAC, and that backend should be
> > preferred".  kernel then forces configuration which uses that other
> > backend - as long as it exists.
> >
> > However, please Cc virtio-dev mailing list if we are doing this since
> > this is a spec extension.
> >
> > --
> > MST  
> 
> 
> Can someone please explain why assume a virtio device is there at all??
> I specified a case where there isn't any.
> 
> I second Jacob - having a netdev of one device driver enslave a netdev
> of another device driver is an awkward a-symmetric model.
> Regardless of whether they share the same backend device.
> Only I am not sure the Linux Bond is the right choice.
> e.g one may well want to use the virtio device also when the
> pass-through device is available, e.g for multicasts, east-west
> traffic, etc.
> I'm not sure the Linux Bond fits that functionality.
> And, as I hear in this thread, it is hard to make it work out of the box.
> So I think the right thing would be to write a new dedicated module
> for this purpose.
> 
> Re policy -
> Indeed the HV can request a policy from the guest but that's not a
> claim for the virtio device enslaving the pass-through device.
> Any policy can be queried by the upper enslaving device.
> 
> Bottom line - I do not see a single reason to have the virtio netdev
> (nor netvsc or any other PV netdev) enslave another netdev by itself.
> If we'd do it right with netvsc from the beginning we wouldn't need
> this discussion at all...

There are several issues with transparent migration.
The first is that the SR-IOV device needs to be shut off for earlier
in the migration process.
Next, the SR-IOV device in the migrated go guest environment maybe different.
It might not exist at all, it might be at a different PCI address, or it
could even be a different vendor/speed/model.
Keeping a virtual network device around allows persisting the connectivity,
during the process.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v10 2/4] net: Introduce generic failover module

2018-05-07 Thread Stephen Hemminger
On Mon,  7 May 2018 15:10:44 -0700
Sridhar Samudrala  wrote:

> +static struct net_device *net_failover_get_bymac(u8 *mac,
> +  struct net_failover_ops **ops)
> +{
> + struct net_device *failover_dev;
> + struct net_failover *failover;
> +
> + spin_lock(_failover_lock);
> + list_for_each_entry(failover, _failover_list, list) {
> + failover_dev = rtnl_dereference(failover->failover_dev);
> + if (ether_addr_equal(failover_dev->perm_addr, mac)) {
> + *ops = rtnl_dereference(failover->ops);
> + spin_unlock(_failover_lock);
> + return failover_dev;
> + }
> + }
> + spin_unlock(_failover_lock);
> + return NULL;
> +}

This is broken if non-ethernet devices such as Infiniband are present.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v10 2/4] net: Introduce generic failover module

2018-05-07 Thread Stephen Hemminger
On Mon,  7 May 2018 15:10:44 -0700
Sridhar Samudrala  wrote:

> This provides a generic interface for paravirtual drivers to listen
> for netdev register/unregister/link change events from pci ethernet
> devices with the same MAC and takeover their datapath. The notifier and
> event handling code is based on the existing netvsc implementation.
> 
> It exposes 2 sets of interfaces to the paravirtual drivers.
> 1. For paravirtual drivers like virtio_net that use 3 netdev model, the
>the failover module provides interfaces to create/destroy additional
>master netdev and all the slave events are managed internally.
>   net_failover_create()
>   net_failover_destroy()
>A failover netdev is created that acts a master device and controls 2
>slave devices. The original virtio_net netdev is registered as 'standby'
>netdev and a passthru/vf device with the same MAC gets registered as
>'primary' netdev. Both 'standby' and 'failover' netdevs are associated
>with the same 'pci' device.  The user accesses the network interface via
>'failover' netdev. The 'failover' netdev chooses 'primary' netdev as
>default for transmits when it is available with link up and running.
> 2. For existing netvsc driver that uses 2 netdev model, no master netdev
>is created. The paravirtual driver registers each instance of netvsc
>as a 'failover' netdev  along with a set of ops to manage the slave
>events. There is no 'standby' netdev in this model. A passthru/vf device
>with the same MAC gets registered as 'primary' netdev.
>   net_failover_register()
>   net_failover_unregister()
> 
> Signed-off-by: Sridhar Samudrala 

You are conflating the net_failover device (3 device model) with
the generic network failover infrastructure into one file. There should be two
seperate files net/core/failover.c and drivers/net/failover.c which splits
the work into two parts (and acts a check for the api).


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v10 2/4] net: Introduce generic failover module

2018-05-07 Thread Stephen Hemminger
On Mon,  7 May 2018 15:10:44 -0700
Sridhar Samudrala  wrote:

> + if (netif_running(failover_dev)) {
> + err = dev_open(slave_dev);
> + if (err && (err != -EBUSY)) {
> + netdev_err(failover_dev, "Opening slave %s failed 
> err:%d\n",
> +slave_dev->name, err);
> + goto err_dev_open;
> + }
> + }
> +
> + netif_addr_lock_bh(failover_dev);
> + dev_uc_sync_multiple(slave_dev, failover_dev);
> + dev_uc_sync_multiple(slave_dev, failover_dev);
> + netif_addr_unlock_bh(failover_dev);
> +

The order of these is backwards, you want to sync addresses before bringing up.
Also, doing it this way does not allow udev/systemd the chance to rename VF 
devices.

The complexity of this whole failover mechanism does not make life easier,
more reliable, or safer for netvsc. I though that was the whole reason for 
having
common code.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-25 Thread Stephen Hemminger
On Thu, 24 May 2018 09:55:14 -0700
Sridhar Samudrala  wrote:

> --- a/drivers/net/hyperv/Kconfig
> +++ b/drivers/net/hyperv/Kconfig
> @@ -2,5 +2,6 @@ config HYPERV_NET
>   tristate "Microsoft Hyper-V virtual network driver"
>   depends on HYPERV
>   select UCS2_STRING
> + select FAILOVER

When I take a working kernel config, add the patches then do
make oldconfig

It is not autoselecting FAILOVER, it prompts me for it. This means
if user says no then a non-working netvsc device is made.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-25 Thread Stephen Hemminger
On Thu, 24 May 2018 09:55:13 -0700
Sridhar Samudrala  wrote:

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 03ed492c4e14..0f4ba52b641d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1421,6 +1421,8 @@ struct net_device_ops {
>   *   entity (i.e. the master device for bridged veth)
>   * @IFF_MACSEC: device is a MACsec device
>   * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
> + * @IFF_FAILOVER: device is a failover master device
> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
>   */
>  enum netdev_priv_flags {
>   IFF_802_1Q_VLAN = 1<<0,
> @@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
>   IFF_PHONY_HEADROOM  = 1<<24,
>   IFF_MACSEC  = 1<<25,
>   IFF_NO_RX_HANDLER   = 1<<26,
> + IFF_FAILOVER= 1<<27,
> + IFF_FAILOVER_SLAVE  = 1<<28,
>  };

Why is FAILOVER any different than other master/slave relationships.
I don't think you need to take up precious netdev flag bits for this.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-25 Thread Stephen Hemminger
On Thu, 24 May 2018 09:55:13 -0700
Sridhar Samudrala  wrote:


> + spin_lock(_lock);

Since register is not in fast path, this should be a mutex?


> +int failover_slave_unregister(struct net_device *slave_dev)
> +{
> + struct net_device *failover_dev;
> + struct failover_ops *fops;
> +
> + if (!netif_is_failover_slave(slave_dev))
> + goto done;
> +
> + ASSERT_RTNL();
> +
> + failover_dev = failover_get_bymac(slave_dev->perm_addr, );
> + if (!failover_dev)
> + goto done;

Since the slave device must have a master device set already, why not use
that instead of searching by MAC address on unregister or link change.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-25 Thread Stephen Hemminger
On Fri, 25 May 2018 16:11:47 -0700
"Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote:

> On 5/25/2018 3:34 PM, Stephen Hemminger wrote:
> > On Thu, 24 May 2018 09:55:14 -0700
> > Sridhar Samudrala <sridhar.samudr...@intel.com> wrote:
> >  
> >> --- a/drivers/net/hyperv/Kconfig
> >> +++ b/drivers/net/hyperv/Kconfig
> >> @@ -2,5 +2,6 @@ config HYPERV_NET
> >>tristate "Microsoft Hyper-V virtual network driver"
> >>depends on HYPERV
> >>select UCS2_STRING
> >> +  select FAILOVER  
> > When I take a working kernel config, add the patches then do
> > make oldconfig
> >
> > It is not autoselecting FAILOVER, it prompts me for it. This means
> > if user says no then a non-working netvsc device is made.  
> 
> I see
> Generic failover module (FAILOVER) [M/y/?] (NEW)
> 
> So the user is given an option to either build as a Module or part of the
> kernel. 'n' is not an option.

With most libraries there is no prompt at all.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-30 Thread Stephen Hemminger
On Thu, 24 May 2018 09:55:14 -0700
Sridhar Samudrala  wrote:

> Use the registration/notification framework supported by the generic
> failover infrastructure.
> 
> Signed-off-by: Sridhar Samudrala 

Why was this merged? It was never signed off by any of the netvsc maintainers,
and there were still issues unresolved.

There are also namespaces issues I am fixing and this breaks them.
Will start my patch set with a revert for this. Sorry

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-30 Thread Stephen Hemminger
On Fri, 25 May 2018 16:06:58 -0700
"Samudrala, Sridhar"  wrote:

> On 5/25/2018 3:38 PM, Stephen Hemminger wrote:
> > On Thu, 24 May 2018 09:55:13 -0700
> > Sridhar Samudrala  wrote:
> >  
> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> index 03ed492c4e14..0f4ba52b641d 100644
> >> --- a/include/linux/netdevice.h
> >> +++ b/include/linux/netdevice.h
> >> @@ -1421,6 +1421,8 @@ struct net_device_ops {
> >>*   entity (i.e. the master device for bridged veth)
> >>* @IFF_MACSEC: device is a MACsec device
> >>* @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
> >> + * @IFF_FAILOVER: device is a failover master device
> >> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
> >>*/
> >>   enum netdev_priv_flags {
> >>IFF_802_1Q_VLAN = 1<<0,
> >> @@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
> >>IFF_PHONY_HEADROOM  = 1<<24,
> >>IFF_MACSEC  = 1<<25,
> >>IFF_NO_RX_HANDLER   = 1<<26,
> >> +  IFF_FAILOVER= 1<<27,
> >> +  IFF_FAILOVER_SLAVE  = 1<<28,
> >>   };  
> > Why is FAILOVER any different than other master/slave relationships.
> > I don't think you need to take up precious netdev flag bits for this.  
> 
> These are netdev priv flags.
> Jiri says that IFF_MASTER/IFF_SLAVE are bonding specific flags and cannot be 
> used
> with other failover mechanisms. Team also doesn't use this flags and it has 
> its own
> priv_flags.
> 

This change breaks userspace.
We already have worked with partners to ignore devices marked as IFF_SLAVE,
and IFF_SLAVE is visible to user space API's.

NAK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 1/5] net: Introduce generic failover module

2018-05-28 Thread Stephen Hemminger
On Fri, 25 May 2018 16:06:58 -0700
"Samudrala, Sridhar"  wrote:

> On 5/25/2018 3:38 PM, Stephen Hemminger wrote:
> > On Thu, 24 May 2018 09:55:13 -0700
> > Sridhar Samudrala  wrote:
> >  
> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> index 03ed492c4e14..0f4ba52b641d 100644
> >> --- a/include/linux/netdevice.h
> >> +++ b/include/linux/netdevice.h
> >> @@ -1421,6 +1421,8 @@ struct net_device_ops {
> >>*   entity (i.e. the master device for bridged veth)
> >>* @IFF_MACSEC: device is a MACsec device
> >>* @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
> >> + * @IFF_FAILOVER: device is a failover master device
> >> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
> >>*/
> >>   enum netdev_priv_flags {
> >>IFF_802_1Q_VLAN = 1<<0,
> >> @@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
> >>IFF_PHONY_HEADROOM  = 1<<24,
> >>IFF_MACSEC  = 1<<25,
> >>IFF_NO_RX_HANDLER   = 1<<26,
> >> +  IFF_FAILOVER= 1<<27,
> >> +  IFF_FAILOVER_SLAVE  = 1<<28,
> >>   };  
> > Why is FAILOVER any different than other master/slave relationships.
> > I don't think you need to take up precious netdev flag bits for this.  
> 
> These are netdev priv flags.
> Jiri says that IFF_MASTER/IFF_SLAVE are bonding specific flags and cannot be 
> used
> with other failover mechanisms. Team also doesn't use this flags and it has 
> its own
> priv_flags.
> 

They are already used by bonding and team.
I don't see why this can't reuse them.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework

2018-05-31 Thread Stephen Hemminger
On Wed, 30 May 2018 20:03:11 -0700
"Samudrala, Sridhar"  wrote:

> On 5/30/2018 7:06 PM, Stephen Hemminger wrote:
> > On Thu, 24 May 2018 09:55:14 -0700
> > Sridhar Samudrala  wrote:
> >  
> >> Use the registration/notification framework supported by the generic
> >> failover infrastructure.
> >>
> >> Signed-off-by: Sridhar Samudrala   
> > Why was this merged? It was never signed off by any of the netvsc 
> > maintainers,
> > and there were still issues unresolved.
> >
> > There are also namespaces issues I am fixing and this breaks them.
> > Will start my patch set with a revert for this. Sorry  
> 
> I would appreciate if you can make the fixes on top of this patch series. I 
> tried hard
> to make sure that netvsc functionality and behavior doesn't change.
> 
> It is possible that there could be some bugs introduced, but they can be 
> fixed.
> Looks like Wei already found a bug and submitted a fix for that.
> 

Ok, but several of these may clash with what you want for virtio.
Like:
- VF should be moved to namespace of virt device
- VF should be associated based on message from host with serial # not
  registration notifier and MAC address.
- control operations should use master device reference rather than
  searching based on MAC.

As you can see these are structural changes.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework

2018-04-26 Thread Stephen Hemminger
On Thu, 26 Apr 2018 05:30:05 +0300
"Michael S. Tsirkin" <m...@redhat.com> wrote:

> On Wed, Apr 25, 2018 at 05:08:37PM -0700, Stephen Hemminger wrote:
> > On Wed, 25 Apr 2018 16:59:28 -0700
> > Sridhar Samudrala <sridhar.samudr...@intel.com> wrote:
> >   
> > > Use the registration/notification framework supported by the generic
> > > failover infrastructure.
> > > 
> > > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>  
> > 
> > NAK unless you prove this works on legacy distributions and with DPDK 18.05
> > without modification.  
> 
> It looks like it should work. What kind of proof are you looking for?
> 


I tried this with working Ubuntu 17 on WS2016.
It boots if the failover driver is configured in (as module).
But if the configuration has:

$ grep FAILOVER .config
# CONFIG_NET_FAILOVER is not set
CONFIG_MAY_USE_NET_FAILOVER=y


The netvsc driver fails on boot with:

[0.826447] hv_vmbus: registering driver hv_netvsc
[0.829616] scsi 0:0:0:0: Direct-Access Msft Virtual Disk 1.0  
PQ: 0 ANSI: 5
[0.836291] input: Microsoft Vmbus HID-compliant Mouse as 
/devices/0006:045E:0621.0001/input/input1
[0.839139] hid-generic 0006:045E:0621.0001: input:  HID v0.01 
Mouse [Microsoft Vmbus HID-compliant Mouse] on
[0.964897] hv_vmbus: probe failed for device 
849a776e-8120-4e4a-9a36-7e3d95ac75b3 (-95)
[0.968039] hv_netvsc: probe of 849a776e-8120-4e4a-9a36-7e3d95ac75b3 failed 
with error -95
[1.112877] hv_vmbus: probe failed for device 
53557f8e-057d-425b-9265-01c0fd7e273e (-95)
[1.116064] hv_netvsc: probe of 53557f8e-057d-425b-9265-01c0fd7e273e failed 
with error -95

The system has two virtual networks. eth0 is on vswitch for management.
eth1 is on vswitch with SR-IOV for performance tests.

You probably need to just put the failover part in net/core and select it.


It is trivial to get an evaluation version of Windows Server 2016 and setup a 
Linux VM.
Please try it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-28 Thread Stephen Hemminger
On Fri, 26 Jan 2018 18:30:03 -0800
Jakub Kicinski  wrote:

> On Fri, 26 Jan 2018 15:30:35 -0800, Samudrala, Sridhar wrote:
> > On 1/26/2018 2:47 PM, Jakub Kicinski wrote:  
> > > On Sat, 27 Jan 2018 00:14:20 +0200, Michael S. Tsirkin wrote:
> > >> On Fri, Jan 26, 2018 at 01:46:42PM -0800, Siwei Liu wrote:
> >  and the VM is not expected to do any tuning/optimizations on the VF 
> >  driver
> >  directly,
> >  i think the current patch that follows the netvsc model of 2 
> >  netdevs(virtio
> >  and vf) should
> >  work fine.
> > >>> OK. For your use case that's fine. But that's too specific scenario
> > >>> with lots of restrictions IMHO, perhaps very few users will benefit
> > >>> from it, I'm not sure. If you're unwilling to move towards it, we'd
> > >>> take this one and come back with a generic solution that is able to
> > >>> address general use cases for VF/PT live migration .
> > >> I think that's a fine approach. Scratch your own itch!  I imagine a very
> > >> generic virtio-switchdev providing host routing info to guests could
> > >> address lots of usecases. A driver could bind to that one and enslave
> > >> arbitrary other devices.  Sounds reasonable.
> > >>
> > >> But given the fundamental idea of a failover was floated at least as
> > >> early as 2013, and made 0 progress since precisely because it kept
> > >> trying to address more and more features, and given netvsc is already
> > >> using the basic solution with some success, I'm not inclined to block
> > >> this specific effort waiting for the generic one.
> > > I think there is an agreement that the extra netdev will be useful for
> > > more advanced use cases, and is generally preferable.  What is the
> > > argument for not doing that from the start?  If it was made I must have
> > > missed it.  Is it just unwillingness to write the extra 300 lines of
> > > code?  Sounds like a pretty weak argument when adding kernel ABI is at
> > > stake...
> > 
> > I am still not clear on the need for the extra netdev created by 
> > virtio_net. The only advantage i can see is that the stats can be
> > broken between VF and virtio datapaths compared to the aggregrated
> > stats on virtio netdev as seen with the 2 netdev approach.  
> 
> Maybe you're not convinced but multiple arguments were made.
> 
> > With 2 netdev model, any VM image that has a working network 
> > configuration will transparently get VF based acceleration without
> > any changes.   
> 
> Nothing happens transparently.  Things may happen automatically.  The
> VF netdev doesn't disappear with netvsc.  The PV netdev transforms into
> something it did not use to be.  And configures and reports some
> information from the PV (e.g. speed) but PV doesn't pass traffic any
> longer.
> 
> > 3 netdev model breaks this configuration starting with the creation
> > and naming of the 2 devices to udev needing to be aware of master and
> > slave virtio-net devices.  
> 
> I don't understand this comment.  There is one virtio-net device and
> one "virtio-bond" netdev.  And user space has to be aware of the special
> automatic arrangement anyway, because it can't touch the VF.  It
> doesn't make any difference whether it ignores the VF or PV and VF.
> It simply can't touch the slaves, no matter how many there are.
> 
> > Also, from a user experience point of view, loading a virtio-net with
> > BACKUP feature enabled will now show 2 virtio-net netdevs.  
> 
> One virtio-net and one virtio-bond, which represents what's happening.
> 
> > For live migration with advanced usecases that Siwei is suggesting, i 
> > think we need a new driver with a new device type that can track the
> > VF specific feature settings even when the VF driver is unloaded.  

I see no added value of the 3 netdev model, there is no need for a bond
device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Stephen Hemminger
On Thu, 22 Feb 2018 13:30:12 -0800
Alexander Duyck  wrote:

> > Again, I undertand your motivation. Yet I don't like your solution.
> > But if the decision is made to do this in-driver bonding. I would like
> > to see it baing done some generic way:
> > 1) share the same "in-driver bonding core" code with netvsc
> >put to net/core.
> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >like active-backup mode only, one vf, one backup, vf netdev type
> >check (so noone could enslave a tap or anything else)
> > If user would need something more, he should employ team/bond.  

Sharing would be good, but netvsc world would really like to only have
one visible network device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-23 Thread Stephen Hemminger
(pruned to reduce thread)

On Wed, 21 Feb 2018 16:17:19 -0800
Alexander Duyck  wrote:

> >>> FWIW two solutions that immediately come to mind is to export "backup"
> >>> as phys_port_name of the backup virtio link and/or assign a name to the
> >>> master like you are doing already.  I think team uses team%d and bond
> >>> uses bond%d, soft naming of master devices seems quite natural in this
> >>> case.  
> >>
> >> I figured I had overlooked something like that.. Thanks for pointing
> >> this out. Okay so I think the phys_port_name approach might resolve
> >> the original issue. If I am reading things correctly what we end up
> >> with is the master showing up as "ens1" for example and the backup
> >> showing up as "ens1nbackup". Am I understanding that right?
> >>
> >> The problem with the team/bond%d approach is that it creates a new
> >> netdevice and so it would require guest configuration changes.
> >>  
> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> >>> link is quite neat.  
> >>
> >> I agree. For non-"backup" virio_net devices would it be okay for us to
> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> >> behavior could be maintained although the function still exists.
> >>  
>  - When the 'active' netdev is unplugged OR not present on a destination
>    system after live migration, the user will see 2 virtio_net netdevs.  
> >>>
> >>> That's necessary and expected, all configuration applies to the master
> >>> so master must exist.  
> >>
> >> With the naming issue resolved this is the only item left outstanding.
> >> This becomes a matter of form vs function.
> >>
> >> The main complaint about the "3 netdev" solution is a bit confusing to
> >> have the 2 netdevs present if the VF isn't there. The idea is that
> >> having the extra "master" netdev there if there isn't really a bond is
> >> a bit ugly.  
> >
> > Is it this uglier in terms of user experience rather than
> > functionality? I don't want it dynamically changed between 2-netdev
> > and 3-netdev depending on the presence of VF. That gets back to my
> > original question and suggestion earlier: why not just hide the lower
> > netdevs from udev renaming and such? Which important observability
> > benefits users may get if exposing the lower netdevs?
> >
> > Thanks,
> > -Siwei  
> 
> The only real advantage to a 2 netdev solution is that it looks like
> the netvsc solution, however it doesn't behave like it since there are
> some features like XDP that may not function correctly if they are
> left enabled in the virtio_net interface.
> 
> As far as functionality the advantage of not hiding the lower devices
> is that they are free to be managed. The problem with pushing all of
> the configuration into the upper device is that you are limited to the
> intersection of the features of the lower devices. This can be
> limiting for some setups as some VFs support things like more queues,
> or better interrupt moderation options than others so trying to make
> everything work with one config would be ugly.
> 


Let's not make XDP the blocker for doing the best solution
from the end user point of view. XDP is just yet another offload
thing which needs to be handled.  The current backup device solution
used in netvsc doesn't handle the full range of offload options
(things like flow direction, DCB, etc); no one but the HW vendors
seems to care.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-22 Thread Stephen Hemminger
On Mon, 22 Jan 2018 15:27:40 -0800
"Samudrala, Sridhar"  wrote:

> On 1/22/2018 1:31 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:  
> >> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin  
> >> wrote:  
> >>> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:  
> 
>  On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:  
> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:  
> >> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> >>  wrote:  
> >>> This feature bit can be used by hypervisor to indicate virtio_net 
> >>> device to
> >>> act as a backup for another device with the same MAC address.
> >>>
> >>> Signed-off-by: Sridhar Samudrala 
> >>> ---
> >>>drivers/net/virtio_net.c| 2 +-
> >>>include/uapi/linux/virtio_net.h | 3 +++
> >>>2 files changed, 4 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 12dfc5fee58e..f149a160a8c5 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> >>>   VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >>>   VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >>>   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> >>> -   VIRTIO_NET_F_SPEED_DUPLEX
> >>> +   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> >>>
> >>>static unsigned int features[] = {
> >>>   VIRTNET_FEATURES,
> >>> diff --git a/include/uapi/linux/virtio_net.h 
> >>> b/include/uapi/linux/virtio_net.h
> >>> index 5de6ed37695b..c7c35fd1a5ed 100644
> >>> --- a/include/uapi/linux/virtio_net.h
> >>> +++ b/include/uapi/linux/virtio_net.h
> >>> @@ -57,6 +57,9 @@
> >>>* Steering */
> >>>#define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
> >>>
> >>> +#define VIRTIO_NET_F_BACKUP  62/* Act as backup for another 
> >>> device
> >>> +* with the same MAC.
> >>> +*/
> >>>#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and 
> >>> duplex */
> >>>
> >>>#ifndef VIRTIO_NET_NO_LEGACY  
> >> I'm not a huge fan of the name "backup" since that implies that the
> >> Virtio interface is only used if the VF is not present, and there are
> >> multiple instances such as dealing with east/west or
> >> broadcast/multicast traffic where it may be desirable to use the
> >> para-virtual interface rather then deal with PCI overhead/bottleneck
> >> to send the packet.  
> > Right now hypervisors mostly expect that yes, only one at a time is
> > used.  E.g. if you try to do multicast sending packets on both VF and
> > virtio then you will end up with two copies of each packet.  
>  I think we want to use only 1 interface to  send out any packet. In case 
>  of
>  broadcast/multicasts it would be an optimization to send them via virtio 
>  and
>  this patch series adds that optimization.  
> >>> Right that's what I think we should rather avoid for now.
> >>>
> >>> It's *not* an optimization if there's a single VM on this host,
> >>> or if a specific multicast group does not have any VMs on same
> >>> host.  
> >> Agreed. In my mind this is something that is controlled by the
> >> pass-thru interface once it is enslaved.  
> > It would be pretty tricky to control through the PT
> > interface since a PT interface pretends to be a physical
> > device, which has no concept of VMs.
> >  
> >>> I'd rather we just sent everything out on the PT if that's
> >>> there. The reason we have virtio in the picture is just so
> >>> we can migrate without downtime.  
> >> I wasn't saying we do that in all cases. That would be something that
> >> would have to be decided by the pass-thru interface. Ideally the
> >> virtio would provide just enough information to get itself into the
> >> bond and I see this being the mechanism for it to do so. From there
> >> the complexity mostly lies in the pass-thru interface to configure the
> >> correct transmit modes if for example you have multiple pass-thru
> >> interfaces or a more complex traffic setup due to things like
> >> SwitchDev.
> >>
> >> In my mind we go the bonding route and there are few use cases for all
> >> of this. First is the backup case that is being addressed here. That
> >> becomes your basic "copy netvsc" approach for this which would be
> >> default. It is how we would handle basic pass-thru back-up paths. If
> >> the host decides to send multicast/broadcast traffic from the host up
> >> through 

Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Stephen Hemminger
On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala  wrote:

> Use the registration/notification framework supported by the generic
> bypass infrastructure.
> 
> Signed-off-by: Sridhar Samudrala 
> ---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.

Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.

Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


  1   2   >