Re: [PATCH] AF_VMCHANNEL address family for guest-host communication.
On Mon, 15 Dec 2008 17:01:14 -0600 Anthony Liguori anth...@codemonkey.ws wrote: David Miller wrote: From: Anthony Liguori anth...@codemonkey.ws Date: Mon, 15 Dec 2008 14:44:26 -0600 We want this communication mechanism to be simple and reliable as we want to implement the backends drivers in the host userspace with minimum mess. One implication of your statement here is that TCP is unreliable. That's absolutely not true. No, TCP falls under the not simple category because it requires the backend to have access to a TCP/IP stack. Within the guest, we need the interface to be always available and we need an addressing scheme that is hypervisor specific. Yes, we can build this all on top of TCP/IP. We could even build it on top of a serial port. Both have their down-sides wrt reliability and complexity. I don't know of any zero-copy through the hypervisor mechanisms for serial ports, but I know we do that with the various virtualization network devices. Yes, and I went down the road of using a dedicated network device and using raw ethernet as the protocol. The thing that killed that was the fact that it's not reliable. You need something like TCP to add reliability. But that's a lot of work and a bit backwards. Use a unreliable transport but use TCP on top of it to get reliability. Our link (virtio) is inherently reliable so why not just expose a reliable interface to userspace? Do you have another recommendation? I don't have to make alternative recommendations until you can show that what we have can't solve the problem acceptably, and TCP emphatically can. It can solve the problem but I don't think it's the best way to solve the problem mainly because the complexity it demands on the backend. Those who don't understand TCP are doomed to reimplement it, badly. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch 34/54] Staging: hv: remove STRUCT_PACKED and STRUCT_ALIGNED defines
On Fri, 24 Jul 2009 23:32:19 +0200 Jörn Engel jo...@logfs.org wrote: On Tue, 21 July 2009 01:46:41 +0200, Arnd Bergmann wrote: On Friday 17 July 2009, Greg Kroah-Hartman wrote: @@ -43,7 +43,7 @@ typedef struct _RING_BUFFER { // volatile u32 InterruptMask; // Ring data starts here + RingDataStartOffset !!! DO NOT place any fields below this !!! u8 Buffer[0]; -} STRUCT_PACKED RING_BUFFER; +} __attribute__((packed)) RING_BUFFER; The data structure is actually packed already, the attribute does not make it better and could be removed. We also have __packed as a shortcut for __attribute__((packed)). Honestly, I don't know how useful __packed really is. In a shared kernel/userspace header, it is only defined for the kernel. As I remember, gcc generates worse code for packed structures on many architectures since it may have to do byte fetchs/recombining to avoid unaligned accesses. -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH][RFC] net/bridge: add basic VEPA support
On Mon, 15 Jun 2009 17:33:10 + Fischer, Anna anna.fisc...@hp.com wrote: This patch adds basic Virtual Ethernet Port Aggregator (VEPA) capabilities to the Linux kernel Ethernet bridging code. A Virtual Ethernet Port Aggregator (VEPA) is a capability within a physical end station that collaborates with an adjacent, external bridge to provide distributed bridging support between multiple virtual end stations and external networks. The VEPA collaborates by forwarding all station-originated frames to the adjacent bridge for frame processing and frame relay (including so-called 'hairpin' forwarding) and by steering and replicating frames received from the VEPA uplink to the appropriate destinations. A VEPA may be implemented in software or in conjunction with embedded hardware. In particular, the patch extends the Linux Ethernet bridge to act as (1) a VEPA - for this we have added VEPA forwarding functionality and added a configuration option for a VEPA uplink port, or as (2) a bridge supporting 'hairpin' forwarding - for this we have added a bridge port 'hairpin' mode which allows sending frames back out through the port the frame was received on. Configuration of VEPA capabilities through Linux userspace bridge utilities is provided by an additional patch 'bridge-utils: add basic VEPA support'. After reading more about this, I am not convinced this should be part of the bridge code. The bridge code really consists of two parts: forwarding table and optional spanning tree. Well the VEPA code short circuits both of these; it can't imagine it working with STP turned on. The only part of bridge code that really gets used by this are the receive packet hooks and the crufty old API. So instead of adding more stuff to existing bridge code, why not have a new driver for just VEPA. You could do it with a simple version of macvlan type driver. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
On Fri, 7 Aug 2009 14:06:58 -0700 Paul Congdon \(UC Davis\) ptcong...@ucdavis.edu wrote: Yaron, The interface multiplexing can be achieved using macvlan driver or using an SR-IOV capable NIC (the preferred option), macvlan may need to be extended to support VEPA multicast handling, this looks like a rather simple task Agreed that the hardware solution is preferred so the macvlan implementation doesn’t really matter. If we are talking SR-IOV, then it is direct mapped, regardless of whether there is a VEB or VEPA in the hardware below, so you are bypassing the bridge software code also. I disagree that adding the multicast handling is simple – while not conceptually hard, it will basically require you to put an address table into the macvlan implementation – if you have that, then why not have just used the one already in the bridge code. If you hook a VEPA up to a non-hairpin mode external bridge, you get the macvlan capability as well. I have a patch that forwards all multicast packets, and another that does proper forwarding. It should have worked that way in original macvlan, the current behavior is really a bug. -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
On Sun, 09 Aug 2009 14:19:08 +0300 Or Gerlitz ogerl...@voltaire.com wrote: Stephen Hemminger wrote: I have a patch that forwards all multicast packets, and another that does proper forwarding. It should have worked that way in original macvlan, the current behavior is really a bug. Looking in macvlan_set_multicast_list() it acts in a similar manner to macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I assume what's left is to add macvlan_hash_xxx multicast logic to map/unmap multicast groups to what macvlan devices want to receive them and this way the flooding can be removed, correct? The device can just flood all multicast packets, since the filtering is done on the receive path anyway. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
On Mon, 10 Aug 2009 16:32:01 + Fischer, Anna anna.fisc...@hp.com wrote: Subject: Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support On Monday 10 August 2009, Stephen Hemminger wrote: On Sun, 09 Aug 2009 14:19:08 +0300, Or Gerlitz ogerl...@voltaire.com wrote: Looking in macvlan_set_multicast_list() it acts in a similar manner to macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I assume what's left is to add macvlan_hash_xxx multicast logic to map/unmap multicast groups to what macvlan devices want to receive them and this way the flooding can be removed, correct? The device can just flood all multicast packets, since the filtering is done on the receive path anyway. Is this handled by one of the additional patches? In the current kernel tree macvlan code it looks as if multicast filtering is only handled by the physical device driver, but not on particular macvlan devices. But we'd still have to copy the frames to user space (for both macvtap and raw packet sockets) and exit from the guest to inject it into its stack, right? I think it would be nice if you can implement what Or describes for macvlan and avoid flooding, and it doesn't sound too hard to do. I guess one advantage for macvlan (over the bridge) is that you can program in all information you have for the ports attached to it, e.g. MAC addresses and multicast addresses. So you could take advantage of that whereas the bridge always floods multicast frames to all ports. How would this work though, if the OS inside the guest wants to register to a particular multicast address? Is this propagated through the backend drivers to the macvlan/macvtap interface? Sure filtering is better, but multicast performance with large number of guests is really a corner case, not the real performance issue. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] net/bridge: Add 'hairpin' port forwarding mode
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index eb404dc..e486f1f 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -256,6 +256,7 @@ static struct net_bridge_port *new_nbp(struct net_bridge *br, p-path_cost = port_cost(dev); p-priority = 0x8000 BR_PORT_BITS; p-port_no = index; + p-flags = 0; br_init_port(p); p-state = BR_STATE_DISABLED; br_stp_port_timer_init(p); Minor nit. this is unnecessary since the structure is allocated with kzalloc. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
On Tue, 1 Sep 2009 14:58:19 +0800 Xin, Xiaohui xiaohui@intel.com wrote: [RFC] Virtual Machine Device Queues (VMDq) support on KVM Network adapter with VMDq technology presents multiple pairs of tx/rx queues, and renders network L2 sorting mechanism based on MAC addresses and VLAN tags for each tx/rx queue pair. Here we present a generic framework, in which network traffic to/from a tx/rx queue pair can be directed from/to a KVM guest without any software copy. Actually this framework can apply to traditional network adapters which have just one tx/rx queue pair. And applications using the same user/kernel interface can utilize this framework to send/receive network traffic directly thru a tx/rx queue pair in a network adapter. We use virtio-net architecture to illustrate the framework. || pop add_buf|| |Qemu process| -TX -- | Guest Kernel | || - -- || |Virtio-net | push get_buf|| | (Backend service) | -RX -- | Virtio-net| || - -- |driver | || push get_buf|| || || | | | AIO (read write) combined with Direct I/O | (which substitute synced file operations) |---| | Host kernel | read: copy-less with directly mapped user | | | space to kernel, payload directly DMAed | | | into user space | | | write: copy-less with directly mapped user | | | space to kernel, payload directly hooked | | | to a skb | | || | (a likely || | queue pair || | instance) || | | || | NIC driver -- TUN/TAP driver | |---| | | traditional adapter or a tx/rx queue pair The basic idea is to utilize the kernel Asynchronous I/O combined with Direct I/O to implements copy-less TUN/TAP device. AIO and Direct I/O is not new to kernel, we still can see it in SCSI tape driver. With traditional file operations, a copying of payload contents from/to the kernel DMA address to/from a user buffer is needed. That's what the copying we want to save. The proposed framework is like this: A TUN/TAP device is bound to a traditional NIC adapter or a tx/rx queue pair in host side. KVM virto-net Backend service, the user space program submits asynchronous read/write I/O requests to the host kernel through TUN/TAP device. The requests are corresponding to the vqueue elements include both transmission receive. They can be queued in one AIO request and later, the completion will be notified through the underlying packets tx/rx processing of the rx/tx queue pair. Detailed path: To guest Virtio-net driver, packets receive corresponding to asynchronous read I/O requests of Backend service. 1) Guest Virtio-net driver provides header and payload address through the receive vqueue to Virtio-net backend service. 2) Virtio-net backend service encapsulates multiple vqueue elements into multiple AIO control blocks and composes them into one AIO read request. 3) Virtio-net backend service uses io_submit() syscall to pass the request to the TUN/TAP device. 4) Virtio-net backend service uses io_getevents() syscall to check the completion of the request. 5) The TUN/TAP driver receives packets from the queue pair of NIC, and prepares for Direct I/O. A modified NIC driver may render a skb which header is allocated in host kernel, but the payload buffer is directly mapped from user space buffer which are rendered through the AIO request by the Backend service. get_user_pages() may do this. For one AIO read request, the TUN/TAP driver maintains a list for the directly mapped buffers, and a NIC driver tries to get the buffers as payload buffer to compose the new skbs. Of course, if getting the buffers fails, then kernel allocated buffers are used. 6) Modern NIC cards now mostly have the header split feature. The NIC queue pair then may directly DMA the payload
Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
On Mon, 21 Sep 2009 16:37:22 +0930 Rusty Russell ru...@rustcorp.com.au wrote: Actually this framework can apply to traditional network adapters which have just one tx/rx queue pair. And applications using the same user/kernel interface can utilize this framework to send/receive network traffic directly thru a tx/rx queue pair in a network adapter. More importantly, when virtualizations is used with multi-queue NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net NIC should preserve the parallelism (lock free) using multiple receive/transmit queues. The number of queues should equal the number of CPUs. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC] Virtual Machine Device Queues(VMDq) support on KVM
On Tue, 22 Sep 2009 13:50:54 +0200 Arnd Bergmann a...@arndb.de wrote: On Tuesday 22 September 2009, Michael S. Tsirkin wrote: More importantly, when virtualizations is used with multi-queue NIC's the virtio-net NIC is a single CPU bottleneck. The virtio-net NIC should preserve the parallelism (lock free) using multiple receive/transmit queues. The number of queues should equal the number of CPUs. Yup, multiqueue virtio is on todo list ;-) Note we'll need multiqueue tap for that to help. My idea for that was to open multiple file descriptors to the same macvtap device and let the kernel figure out the right thing to do with that. You can do the same with raw packed sockets in case of vhost_net, but I wouldn't want to add more complexity to the tun/tap driver for this. Arnd Or get tap out of the way entirely. The packets should not have to go out to user space at all (see veth) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2.6.32-rc1] net: VMware virtual Ethernet NIC driver: vmxnet3
On Wed, 30 Sep 2009 14:34:57 -0700 (PDT) Shreyas Bhatewara sbhatew...@vmware.com wrote: Note: your patch was linewrapped again + + +static void +vmxnet3_declare_features(struct vmxnet3_adapter *adapter, bool dma64) +{ + struct net_device *netdev = adapter-netdev; + + netdev-features = NETIF_F_SG | + NETIF_F_HW_CSUM | + NETIF_F_HW_VLAN_TX | + NETIF_F_HW_VLAN_RX | + NETIF_F_HW_VLAN_FILTER | + NETIF_F_TSO | + NETIF_F_TSO6; + + printk(KERN_INFO features: sg csum vlan jf tso tsoIPv6); + + adapter-rxcsum = true; + adapter-jumbo_frame = true; + + if (!disable_lro) { + adapter-lro = true; + printk( lro); + } Why not use NETIF_F_LRO and ethtool to control LRO support? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2.6.32-rc3] net: VMware virtual Ethernet NIC driver: vmxnet3
On Thu, 8 Oct 2009 10:59:26 -0700 (PDT) Shreyas Bhatewara sbhatew...@vmware.com wrote: Hello all, I do not mean to be bothersome but this thread has been unusually silent. Could you please review the patch for me and reply with your comments / acks ? Thanks. -Shreyas Looks fine, but just a minor style nit (can be changed after insertion in mainline). The code: static void vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget, int *txd_done, int *rxd_done) { if (unlikely(adapter-shared-ecr)) vmxnet3_process_events(adapter); *txd_done = vmxnet3_tq_tx_complete(adapter-tx_queue, adapter); *rxd_done = vmxnet3_rq_rx_complete(adapter-rx_queue, adapter, budget); } static int vmxnet3_poll(struct napi_struct *napi, int budget) { struct vmxnet3_adapter *adapter = container_of(napi, struct vmxnet3_adapter, napi); int rxd_done, txd_done; vmxnet3_do_poll(adapter, budget, txd_done, rxd_done); if (rxd_done budget) { napi_complete(napi); vmxnet3_enable_intr(adapter, 0); } return rxd_done; } Is simpler if you just have do_poll return rx done value. Probably Gcc inline's it all anyway. static int vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget) { if (unlikely(adapter-shared-ecr)) vmxnet3_process_events(adapter); vmxnet3_tq_tx_complete(adapter-tx_queue, adapter); return vmxnet3_rq_rx_complete(adapter-rx_queue, adapter, budget); } static int vmxnet3_poll(struct napi_struct *napi, int budget) { struct vmxnet3_adapter *adapter = container_of(napi, struct vmxnet3_adapter, napi); int rxd_done; rxd_done = vmxnet3_do_poll(adapter, budget); if (rxd_done budget) { napi_complete(napi); vmxnet3_enable_intr(adapter, 0); } return rxd_done; } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH for-2.6.35] virtio_net: fix oom handling on tx
On Thu, 10 Jun 2010 10:17:07 -0700 Sridhar Samudrala s...@us.ibm.com wrote: On Thu, 2010-06-10 at 18:20 +0300, Michael S. Tsirkin wrote: virtio net will never try to overflow the TX ring, so the only reason add_buf may fail is out of memory. Thus, we can not stop the device until some request completes - there's no guarantee anything at all is outstanding. Make the error message clearer as well: error here does not indicate queue full. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/net/virtio_net.c | 15 --- 1 files changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 85615a3..e48a06f 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -563,7 +563,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev); int capacity; -again: /* Free up any pending old buffers before queueing new ones. */ free_old_xmit_skbs(vi); @@ -572,12 +571,14 @@ again: /* This can happen with OOM and indirect buffers. */ if (unlikely(capacity 0)) { - netif_stop_queue(dev); - dev_warn(dev-dev, Unexpected full queue\n); - if (unlikely(!virtqueue_enable_cb(vi-svq))) { - virtqueue_disable_cb(vi-svq); - netif_start_queue(dev); - goto again; + if (net_ratelimit()) { + if (likely(capacity == -ENOMEM)) + dev_warn(dev-dev, +TX queue failure: out of memory\n); + else + dev_warn(dev-dev, +Unexpected TX queue failure: %d\n, +capacity); } return NETDEV_TX_BUSY; } It is not clear to me how xmit_skb() can return -ENOMEM. xmit_skb() calls virtqueue_add_buf_gfp() which can return -ENOSPC. Even vring_add_indirect() doesn't return -ENOMEM on kmalloc failure. It makes more sense to have the device increment tx_droppped, and return NETDEV_TX_OK. Skip the message (or make it a pr_debug()). Network devices do not guarantee packet delivery, and if out of resources then holding more data in the queue is going to hurt not help the situation. -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
On Mon, 12 Jul 2010 20:06:28 -0700 Shreyas Bhatewara sbhatew...@vmware.com wrote: On Thu, 2010-05-06 at 13:21 -0700, Christoph Hellwig wrote: On Wed, May 05, 2010 at 10:52:53AM -0700, Stephen Hemminger wrote: Let me put it bluntly. Any design that allows external code to run in the kernel is not going to be accepted. Out of tree kernel modules are enough of a pain already, why do you expect the developers to add another interface. Exactly. Until our friends at VMware get this basic fact it's useless to continue arguing. Pankaj and Dmitry: you're fine to waste your time on this, but it's not going to go anywhere until you address that fundamental problem. The first thing you need to fix in your archicture is to integrate the VF function code into the kernel tree, and we can work from there. Please post patches doing this if you want to resume the discussion. ___ Pv-drivers mailing list pv-driv...@vmware.com http://mailman2.vmware.com/mailman/listinfo/pv-drivers As discussed, following is the patch to give you an idea about implementation of NPA for vmxnet3 driver. Although the patch is big, I have verified it with checkpatch.pl. It gave 0 errors / warnings. Signed-off-by: Matthieu Bucchaineri matth...@vmware.com Signed-off-by: Shreyas Bhatewara sbhatew...@vmware.com I think the concept won't fly. But you should really at least try running checkpatch to make sure the style conforms. -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
On Mon, 12 Jul 2010 20:06:28 -0700 Shreyas Bhatewara sbhatew...@vmware.com wrote: On Thu, 2010-05-06 at 13:21 -0700, Christoph Hellwig wrote: On Wed, May 05, 2010 at 10:52:53AM -0700, Stephen Hemminger wrote: Let me put it bluntly. Any design that allows external code to run in the kernel is not going to be accepted. Out of tree kernel modules are enough of a pain already, why do you expect the developers to add another interface. Exactly. Until our friends at VMware get this basic fact it's useless to continue arguing. Pankaj and Dmitry: you're fine to waste your time on this, but it's not going to go anywhere until you address that fundamental problem. The first thing you need to fix in your archicture is to integrate the VF function code into the kernel tree, and we can work from there. Please post patches doing this if you want to resume the discussion. ___ Pv-drivers mailing list pv-driv...@vmware.com http://mailman2.vmware.com/mailman/listinfo/pv-drivers As discussed, following is the patch to give you an idea about implementation of NPA for vmxnet3 driver. Although the patch is big, I have verified it with checkpatch.pl. It gave 0 errors / warnings. Signed-off-by: Matthieu Bucchaineri matth...@vmware.com Signed-off-by: Shreyas Bhatewara sbhatew...@vmware.com --- I am surprised, the code seems to use lots of mixed case in places that don't really follow current kernel practice. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] virtio: fix format of sysfs driver/vendor files
The sysfs files for virtio produce the wrong format and are missing the required newline. The output for virtio bus vendor/device should have the same format as the corresponding entries for PCI devices. Although this technically changes the ABI for sysfs, these files were broken to start with! Signed-off-by: Stephen Hemminger shemmin...@vyatta.com --- a/drivers/virtio/virtio.c 2010-11-09 21:39:31.713916249 -0800 +++ b/drivers/virtio/virtio.c 2010-11-09 21:40:16.072089461 -0800 @@ -9,19 +9,19 @@ static ssize_t device_show(struct device struct device_attribute *attr, char *buf) { struct virtio_device *dev = container_of(_d,struct virtio_device,dev); - return sprintf(buf, %hu, dev-id.device); + return sprintf(buf, 0x%04x\n, dev-id.device); } static ssize_t vendor_show(struct device *_d, struct device_attribute *attr, char *buf) { struct virtio_device *dev = container_of(_d,struct virtio_device,dev); - return sprintf(buf, %hu, dev-id.vendor); + return sprintf(buf, 0x%04x\n, dev-id.vendor); } static ssize_t status_show(struct device *_d, struct device_attribute *attr, char *buf) { struct virtio_device *dev = container_of(_d,struct virtio_device,dev); - return sprintf(buf, 0x%08x, dev-config-get_status(dev)); + return sprintf(buf, 0x%08x\n, dev-config-get_status(dev)); } static ssize_t modalias_show(struct device *_d, struct device_attribute *attr, char *buf) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH]: An implementation of HyperV KVP functionality
On Thu, 11 Nov 2010 13:03:10 -0700 Ky Srinivasan ksriniva...@novell.com wrote: +static char *kvp_keys[KVP_MAX_KEY] = {FullyQualifiedDomainName, + IntegrationServicesVersion, + NetworkAddressIPv4, + NetworkAddressIPv6, + OSBuildNumber, + OSName, + OSMajorVersion, + OSMinorVersion, + OSVersion, + ProcessorArchitecture, + }; Minor nit: static const char *kvp_keys[KVP_MAX_KEY] = { FullQualifiedDomainName, ... +/* + * Global state maintained for transaction that is being processed. + * Note that only one transaction can be active at any point in time. + * + * This state is set when we receive a request from the host; we + * cleanup this state when the transaction is completed - when we respond + * to the host with the key value. + */ + +static u8 *recv_buffer; /* the receive buffer that we allocated */ +static int recv_len; /* number of bytes received. */ +static struct vmbus_channel *recv_channel; /*chn on which we got the request*/ +static u64 recv_req_id; /* request ID. */ +static int kvp_current_index; + I would put all the state variables for the transaction in one structure, +static void kvp_timer_func(unsigned long __data) +{ + u32 key = *((u32 *)__data); + /* +* If the timer fires, the user-mode component has not responded; +* process the pending transaction. +*/ + kvp_respond_to_host(key, Guest timed out); +} delayed_work is sometimes better for things like this, since it runs in user context and can sleep. + case (KVP_MAX_KEY): + /* +* We don't support this key +* and any key beyond this. +*/ + icmsghdrp-status = HV_E_FAIL; + goto callback_done; + case labels do not need parens ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] xen: fix p2m section mismatches
On Thu, 24 Mar 2011 13:34:32 -0700 Randy Dunlap randy.dun...@oracle.com wrote: From: Randy Dunlap randy.dun...@oracle.com Fix section mismatch warnings: set_phys_range_identity() is called by __init xen_set_identity(), so also mark set_phys_range_identity() as __init. then: __early_alloc_p2m() is called set_phys_range_identity(), so also mark __early_alloc_p2m() as __init. WARNING: arch/x86/built-in.o(.text+0x7856): Section mismatch in reference from the function __early_alloc_p2m() to the function .init.text:extend_brk() The function __early_alloc_p2m() references the function __init extend_brk(). This is often because __early_alloc_p2m lacks a __init annotation or the annotation of extend_brk is wrong. WARNING: arch/x86/built-in.o(.text+0x7967): Section mismatch in reference from the function set_phys_range_identity() to the function .init.text:extend_brk() The function set_phys_range_identity() references the function __init extend_brk(). This is often because set_phys_range_identity lacks a __init annotation or the annotation of extend_brk is wrong. Signed-off-by: Randy Dunlap randy.dun...@oracle.com --- arch/x86/xen/p2m.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- linux-2.6.38-git13.orig/arch/x86/xen/p2m.c +++ linux-2.6.38-git13/arch/x86/xen/p2m.c @@ -497,7 +497,7 @@ static bool alloc_p2m(unsigned long pfn) return true; } -bool __early_alloc_p2m(unsigned long pfn) +bool __init __early_alloc_p2m(unsigned long pfn) __early_alloc_p2m could be static as well. -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] virtio-net: per cpu 64 bit stats
Use per-cpu variables to maintain 64 bit statistics. Compile tested only. Signed-off-by: Stephen Hemminger shemmin...@vyatta.com --- a/drivers/net/virtio_net.c 2011-06-14 15:18:46.448596355 -0400 +++ b/drivers/net/virtio_net.c 2011-06-15 09:54:22.914426067 -0400 @@ -40,6 +40,15 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX2 +struct virtnet_stats { + struct u64_stats_sync syncp; + u64 tx_bytes; + u64 tx_packets; + + u64 rx_bytes; + u64 rx_packets; +}; + struct virtnet_info { struct virtio_device *vdev; struct virtqueue *rvq, *svq, *cvq; @@ -56,6 +65,9 @@ struct virtnet_info { /* Host will merge rx buffers for big packets (shake it! shake it!) */ bool mergeable_rx_bufs; + /* Active statistics */ + struct virtnet_stats __percpu *stats; + /* Work struct for refilling if we run low on memory. */ struct delayed_work refill; @@ -209,7 +221,6 @@ static int receive_mergeable(struct virt skb-dev-stats.rx_length_errors++; return -EINVAL; } - page = virtqueue_get_buf(vi-rvq, len); if (!page) { pr_debug(%s: rx error: %d buffers missing\n, @@ -217,6 +228,7 @@ static int receive_mergeable(struct virt skb-dev-stats.rx_length_errors++; return -EINVAL; } + if (len PAGE_SIZE) len = PAGE_SIZE; @@ -230,6 +242,7 @@ static int receive_mergeable(struct virt static void receive_buf(struct net_device *dev, void *buf, unsigned int len) { struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats); struct sk_buff *skb; struct page *page; struct skb_vnet_hdr *hdr; @@ -265,8 +278,11 @@ static void receive_buf(struct net_devic hdr = skb_vnet_hdr(skb); skb-truesize += skb-data_len; - dev-stats.rx_bytes += skb-len; - dev-stats.rx_packets++; + + u64_stats_update_begin(stats-syncp); + stats-rx_bytes += skb-len; + stats-rx_packets++; + u64_stats_update_begin(stats-syncp); if (hdr-hdr.flags VIRTIO_NET_HDR_F_NEEDS_CSUM) { pr_debug(Needs csum!\n); @@ -515,11 +531,16 @@ static unsigned int free_old_xmit_skbs(s { struct sk_buff *skb; unsigned int len, tot_sgs = 0; + struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats); while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); - vi-dev-stats.tx_bytes += skb-len; - vi-dev-stats.tx_packets++; + + u64_stats_update_begin(stats-syncp); + stats-tx_bytes += skb-len; + stats-tx_packets++; + u64_stats_update_begin(stats-syncp); + tot_sgs += skb_vnet_hdr(skb)-num_sg; dev_kfree_skb_any(skb); } @@ -641,6 +662,40 @@ static int virtnet_set_mac_address(struc return 0; } +static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev, + struct rtnl_link_stats64 *tot) +{ + struct virtnet_info *vi = netdev_priv(dev); + int cpu; + unsigned int start; + + for_each_possible_cpu(cpu) { + struct virtnet_stats __percpu *stats + = per_cpu_ptr(vi-stats, cpu); + u64 tpackets, tbytes, rpackets, rbytes; + + do { + start = u64_stats_fetch_begin(stats-syncp); + tpackets = stats-tx_packets; + tbytes = stats-tx_bytes; + rpackets = stats-rx_packets; + rbytes = stats-rx_bytes; + } while (u64_stats_fetch_retry(stats-syncp, start)); + + tot-rx_packets += rpackets; + tot-tx_packets += tpackets; + tot-rx_bytes += rbytes; + tot-tx_bytes += tbytes; + } + + tot-tx_dropped = dev-stats.tx_dropped; + tot-rx_dropped = dev-stats.rx_dropped; + tot-rx_length_errors = dev-stats.rx_length_errors; + tot-rx_frame_errors = dev-stats.rx_frame_errors; + + return tot; +} + #ifdef CONFIG_NET_POLL_CONTROLLER static void virtnet_netpoll(struct net_device *dev) { @@ -650,6 +705,14 @@ static void virtnet_netpoll(struct net_d } #endif +static void virtnet_free(struct net_device *dev) +{ + struct virtnet_info *vi = netdev_priv(dev); + + free_percpu(vi-stats); + free_netdev(dev); +} + static int virtnet_open(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); @@ -835,6 +898,7 @@ static const struct net_device_ops virtn .ndo_set_mac_address = virtnet_set_mac_address, .ndo_set_rx_mode
Re: [PATCH] virtio-net: per cpu 64 bit stats (v2)
Use per-cpu variables to maintain 64 bit statistics. Signed-off-by: Stephen Hemminger shemmin...@vyatta.com --- a/drivers/net/virtio_net.c 2011-06-14 15:18:46.448596355 -0400 +++ b/drivers/net/virtio_net.c 2011-06-15 12:02:54.860667443 -0400 @@ -40,6 +40,15 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX2 +struct virtnet_stats { + struct u64_stats_sync syncp; + u64 tx_bytes; + u64 tx_packets; + + u64 rx_bytes; + u64 rx_packets; +}; + struct virtnet_info { struct virtio_device *vdev; struct virtqueue *rvq, *svq, *cvq; @@ -56,6 +65,9 @@ struct virtnet_info { /* Host will merge rx buffers for big packets (shake it! shake it!) */ bool mergeable_rx_bufs; + /* Active statistics */ + struct virtnet_stats __percpu *stats; + /* Work struct for refilling if we run low on memory. */ struct delayed_work refill; @@ -209,7 +221,6 @@ static int receive_mergeable(struct virt skb-dev-stats.rx_length_errors++; return -EINVAL; } - page = virtqueue_get_buf(vi-rvq, len); if (!page) { pr_debug(%s: rx error: %d buffers missing\n, @@ -217,6 +228,7 @@ static int receive_mergeable(struct virt skb-dev-stats.rx_length_errors++; return -EINVAL; } + if (len PAGE_SIZE) len = PAGE_SIZE; @@ -230,6 +242,7 @@ static int receive_mergeable(struct virt static void receive_buf(struct net_device *dev, void *buf, unsigned int len) { struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats); struct sk_buff *skb; struct page *page; struct skb_vnet_hdr *hdr; @@ -265,8 +278,11 @@ static void receive_buf(struct net_devic hdr = skb_vnet_hdr(skb); skb-truesize += skb-data_len; - dev-stats.rx_bytes += skb-len; - dev-stats.rx_packets++; + + u64_stats_update_begin(stats-syncp); + stats-rx_bytes += skb-len; + stats-rx_packets++; + u64_stats_update_end(stats-syncp); if (hdr-hdr.flags VIRTIO_NET_HDR_F_NEEDS_CSUM) { pr_debug(Needs csum!\n); @@ -515,11 +531,16 @@ static unsigned int free_old_xmit_skbs(s { struct sk_buff *skb; unsigned int len, tot_sgs = 0; + struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats); while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); - vi-dev-stats.tx_bytes += skb-len; - vi-dev-stats.tx_packets++; + + u64_stats_update_begin(stats-syncp); + stats-tx_bytes += skb-len; + stats-tx_packets++; + u64_stats_update_end(stats-syncp); + tot_sgs += skb_vnet_hdr(skb)-num_sg; dev_kfree_skb_any(skb); } @@ -641,6 +662,40 @@ static int virtnet_set_mac_address(struc return 0; } +static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev, + struct rtnl_link_stats64 *tot) +{ + struct virtnet_info *vi = netdev_priv(dev); + int cpu; + unsigned int start; + + for_each_possible_cpu(cpu) { + struct virtnet_stats __percpu *stats + = per_cpu_ptr(vi-stats, cpu); + u64 tpackets, tbytes, rpackets, rbytes; + + do { + start = u64_stats_fetch_begin(stats-syncp); + tpackets = stats-tx_packets; + tbytes = stats-tx_bytes; + rpackets = stats-rx_packets; + rbytes = stats-rx_bytes; + } while (u64_stats_fetch_retry(stats-syncp, start)); + + tot-rx_packets += rpackets; + tot-tx_packets += tpackets; + tot-rx_bytes += rbytes; + tot-tx_bytes += tbytes; + } + + tot-tx_dropped = dev-stats.tx_dropped; + tot-rx_dropped = dev-stats.rx_dropped; + tot-rx_length_errors = dev-stats.rx_length_errors; + tot-rx_frame_errors = dev-stats.rx_frame_errors; + + return tot; +} + #ifdef CONFIG_NET_POLL_CONTROLLER static void virtnet_netpoll(struct net_device *dev) { @@ -650,6 +705,14 @@ static void virtnet_netpoll(struct net_d } #endif +static void virtnet_free(struct net_device *dev) +{ + struct virtnet_info *vi = netdev_priv(dev); + + free_percpu(vi-stats); + free_netdev(dev); +} + static int virtnet_open(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); @@ -835,6 +898,7 @@ static const struct net_device_ops virtn .ndo_set_mac_address = virtnet_set_mac_address, .ndo_set_rx_mode = virtnet_set_rx_mode
Re: [PATCH] virtio-net: per cpu 64 bit stats
On Wed, 15 Jun 2011 21:33:16 +0300 Michael S. Tsirkin m...@redhat.com wrote: On Wed, Jun 15, 2011 at 11:43:37AM -0400, Stephen Hemminger wrote: Use per-cpu variables to maintain 64 bit statistics. Compile tested only. Signed-off-by: Stephen Hemminger shemmin...@vyatta.com Interesting. Does this help speed at all? --- a/drivers/net/virtio_net.c 2011-06-14 15:18:46.448596355 -0400 +++ b/drivers/net/virtio_net.c 2011-06-15 09:54:22.914426067 -0400 @@ -40,6 +40,15 @@ module_param(gso, bool, 0444); #define VIRTNET_SEND_COMMAND_SG_MAX2 +struct virtnet_stats { + struct u64_stats_sync syncp; + u64 tx_bytes; + u64 tx_packets; + + u64 rx_bytes; + u64 rx_packets; +}; + struct virtnet_info { struct virtio_device *vdev; struct virtqueue *rvq, *svq, *cvq; @@ -56,6 +65,9 @@ struct virtnet_info { /* Host will merge rx buffers for big packets (shake it! shake it!) */ bool mergeable_rx_bufs; + /* Active statistics */ + struct virtnet_stats __percpu *stats; + /* Work struct for refilling if we run low on memory. */ struct delayed_work refill; @@ -209,7 +221,6 @@ static int receive_mergeable(struct virt skb-dev-stats.rx_length_errors++; return -EINVAL; } - page = virtqueue_get_buf(vi-rvq, len); if (!page) { pr_debug(%s: rx error: %d buffers missing\n, @@ -217,6 +228,7 @@ static int receive_mergeable(struct virt skb-dev-stats.rx_length_errors++; return -EINVAL; } + if (len PAGE_SIZE) len = PAGE_SIZE; Let's not tweak whitespace unnecessarily. @@ -230,6 +242,7 @@ static int receive_mergeable(struct virt static void receive_buf(struct net_device *dev, void *buf, unsigned int len) { struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats); struct sk_buff *skb; struct page *page; struct skb_vnet_hdr *hdr; @@ -265,8 +278,11 @@ static void receive_buf(struct net_devic hdr = skb_vnet_hdr(skb); skb-truesize += skb-data_len; - dev-stats.rx_bytes += skb-len; - dev-stats.rx_packets++; + + u64_stats_update_begin(stats-syncp); + stats-rx_bytes += skb-len; + stats-rx_packets++; + u64_stats_update_begin(stats-syncp); if (hdr-hdr.flags VIRTIO_NET_HDR_F_NEEDS_CSUM) { pr_debug(Needs csum!\n); @@ -515,11 +531,16 @@ static unsigned int free_old_xmit_skbs(s { struct sk_buff *skb; unsigned int len, tot_sgs = 0; + struct virtnet_stats __percpu *stats = this_cpu_ptr(vi-stats); while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); - vi-dev-stats.tx_bytes += skb-len; - vi-dev-stats.tx_packets++; + + u64_stats_update_begin(stats-syncp); + stats-tx_bytes += skb-len; + stats-tx_packets++; + u64_stats_update_begin(stats-syncp); + tot_sgs += skb_vnet_hdr(skb)-num_sg; dev_kfree_skb_any(skb); } @@ -641,6 +662,40 @@ static int virtnet_set_mac_address(struc return 0; } +static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev, + struct rtnl_link_stats64 *tot) +{ + struct virtnet_info *vi = netdev_priv(dev); + int cpu; + unsigned int start; + + for_each_possible_cpu(cpu) { + struct virtnet_stats __percpu *stats + = per_cpu_ptr(vi-stats, cpu); + u64 tpackets, tbytes, rpackets, rbytes; + + do { + start = u64_stats_fetch_begin(stats-syncp); + tpackets = stats-tx_packets; + tbytes = stats-tx_bytes; + rpackets = stats-rx_packets; + rbytes = stats-rx_bytes; + } while (u64_stats_fetch_retry(stats-syncp, start)); + + tot-rx_packets += rpackets; + tot-tx_packets += tpackets; + tot-rx_bytes += rbytes; + tot-tx_bytes += tbytes; + } + + tot-tx_dropped = dev-stats.tx_dropped; + tot-rx_dropped = dev-stats.rx_dropped; + tot-rx_length_errors = dev-stats.rx_length_errors; + tot-rx_frame_errors = dev-stats.rx_frame_errors; + + return tot; +} + #ifdef CONFIG_NET_POLL_CONTROLLER static void virtnet_netpoll(struct net_device *dev) { @@ -650,6 +705,14 @@ static void virtnet_netpoll(struct net_d } #endif +static void virtnet_free(struct net_device *dev) +{ + struct virtnet_info *vi = netdev_priv(dev); + + free_percpu(vi-stats); + free_netdev(dev); +} + static int virtnet_open(struct net_device *dev
Re: [PATCH 00/40] Staging: hv: Driver cleanup
On Fri, 1 Jul 2011 00:19:38 + KY Srinivasan k...@microsoft.com wrote: -Original Message- From: Stephen Hemminger [mailto:shemmin...@vyatta.com] Sent: Thursday, June 30, 2011 7:48 PM To: KY Srinivasan Cc: Christoph Hellwig; de...@linuxdriverproject.org; gre...@suse.de; linux- ker...@vger.kernel.org; virtualizat...@lists.osdl.org Subject: Re: [PATCH 00/40] Staging: hv: Driver cleanup On Thu, 30 Jun 2011 23:32:34 + KY Srinivasan k...@microsoft.com wrote: -Original Message- From: Christoph Hellwig [mailto:h...@infradead.org] Sent: Thursday, June 30, 2011 3:34 PM To: KY Srinivasan Cc: gre...@suse.de; linux-ker...@vger.kernel.org; de...@linuxdriverproject.org; virtualizat...@lists.osdl.org Subject: Re: [PATCH 00/40] Staging: hv: Driver cleanup On Wed, Jun 29, 2011 at 07:38:21AM -0700, K. Y. Srinivasan wrote: Further cleanup of the hv drivers: 1) Cleanup the reference counting mess for both stor and net devices. I really don't understand the need for reference counting on the storage side, especially now that you only have a SCSI driver. The SCSI midlayer does proper counting on it's objects (Scsi_Host, scsi_device, scsi_cmnd), so you'll get that for free given that SCSI drivers just piggyback on the midlayer lifetime rules. For now your patches should probably go in as-is, but mid-term you should be able to completely remove that code on the storage side. Greg, I am thinking of going back to my original implementation where I had one scsi host per IDE device. This will certainly simply the code. Let me know what you think. If you agree with this approach, please drop this patch-set, I will send you a new set of patches. I think there ref counting on network devices is also unneeded as long as the unregister logic handles RCU correctly. The network layer calls the driver unregister routine after all packets are gone. On the networking side, what about incoming packets that may be racing with the device destruction. The current ref counting scheme deals with that case. Not sure how HV driver tells hypervisor to stop sending packets. But the destructor is not called until after all other CPU's are done processing packets from that device. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
I have been playing with userspace-rcu which has a number of neat lockless routines for queuing and hashing. But there aren't kernel versions and several of them may require cmpxchg to work. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] macvtap: Fix macvtap_get_queue to use rxhash first
On Mon, 28 Nov 2011 12:25:15 +0800 Jason Wang jasow...@redhat.com wrote: I'm using ixgbe for testing also, for host, its driver seems provide irq affinity hint, so no binding or irqbalance is needed. The hint is for irqbalance to use. You need to still do manual affinity or use irqbalance. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] vhost-net: add module alias
By adding the a module alias, programs (or users) won't have to explicitly call modprobe. Vhost-net will always be available if built into the kernel. It does require assigning a permanent minor number for depmod to work. Choose one next to TUN since this driver is related to it. Also, use C99 style initialization. Signed-off-by: Stephen Hemminger shemmin...@vyatta.com --- drivers/vhost/net.c|8 +--- include/linux/miscdevice.h |1 + 2 files changed, 6 insertions(+), 3 deletions(-) --- a/drivers/vhost/net.c 2012-01-10 10:56:58.883179194 -0800 +++ b/drivers/vhost/net.c 2012-01-10 19:48:23.650225892 -0800 @@ -856,9 +856,9 @@ static const struct file_operations vhos }; static struct miscdevice vhost_net_misc = { - MISC_DYNAMIC_MINOR, - vhost-net, - vhost_net_fops, + .minor = VHOST_NET_MINOR, + .name = vhost-net, + .fops = vhost_net_fops, }; static int vhost_net_init(void) @@ -879,3 +879,5 @@ MODULE_VERSION(0.0.1); MODULE_LICENSE(GPL v2); MODULE_AUTHOR(Michael S. Tsirkin); MODULE_DESCRIPTION(Host kernel accelerator for virtio net); +MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR); +MODULE_ALIAS(devname:vhost-net); --- a/include/linux/miscdevice.h2012-01-10 10:56:59.779189436 -0800 +++ b/include/linux/miscdevice.h2012-01-10 19:49:56.091748210 -0800 @@ -31,6 +31,7 @@ #define I2O_MINOR 166 #define MICROCODE_MINOR184 #define TUN_MINOR 200 +#define VHOST_NET_MINOR201 #define MWAVE_MINOR219 /* ACP/Mwave Modem */ #define MPT_MINOR 220 #define MPT2SAS_MINOR 221 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vhost-net: add module alias
On Wed, 11 Jan 2012 15:43:42 +0800 Amos Kong kongjian...@gmail.com wrote: On Wed, Jan 11, 2012 at 12:54 PM, Stephen Hemminger shemmin...@vyatta.comwrote: By adding the a module alias, programs (or users) won't have to explicitly call modprobe. Vhost-net will always be available if built into the kernel. It does require assigning a permanent minor number for depmod to work. Choose one next to TUN since this driver is related to it. Also, use C99 style initialization. Signed-off-by: Stephen Hemminger shemmin...@vyatta.com --- drivers/vhost/net.c|8 +--- include/linux/miscdevice.h |1 + 2 files changed, 6 insertions(+), 3 deletions(-) : /* * These allocations are managed by dev...@lanana.org. If you use an * entry that is not in assigned your entry may well be moved and * reassigned, or set dynamic if a fixed value is not justified. */ Didn't that mailing address was ever used any more. Like many places in kernel, the comment looked like a historical leftover. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vhost-net: add module alias
On Wed, 11 Jan 2012 11:07:47 +0400 Michael Tokarev m...@tls.msk.ru wrote: On 11.01.2012 08:54, Stephen Hemminger wrote: By adding the a module alias, programs (or users) won't have to explicitly call modprobe. Vhost-net will always be available if built into the kernel. It does require assigning a permanent minor number for depmod to work. Choose one next to TUN since this driver is related to it. Why do you think a statically-allocated device number will do any good at all? Static /dev is gone almost completely, at least on the systems where whole virt stuff makes any sense, so you don't have pre-created vhost-net device anymore, and hence this allocation makes no sense. Just IMHO anyway. The statically allocated device number is required for the udev/module autoloading to work. Probably the udev infrastructure needs a consistent number to hang off of. It looks like: * driver adds MODULE_ALIAS() for devname and character device * depmod scans modules and creates modules.devname (in /lib/modules) * udev uses modules.devname to autoload the module $ /sbin/modinfo vhost_net filename: /lib/modules/3.2.0-net+/kernel/drivers/vhost/vhost_net.ko alias: devname:vhost-net alias: char-major-10-201 description:Host kernel accelerator for virtio net ... See also: https://lkml.org/lkml/2010/5/21/134 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] vhost-net: add module alias (v2)
By adding the correct module alias, programs won't have to explicitly call modprobe. Vhost-net will always be available if built into the kernel. It does require assigning a permanent minor number for depmod to work. Choose one next to TUN since this driver is related to it. Also, use C99 style initialization. Signed-off-by: Stephen Hemminger shemmin...@vyatta.com --- v2 - document minor number and make sure to not overlap Documentation/devices.txt |2 ++ drivers/vhost/net.c|8 +--- include/linux/miscdevice.h |1 + 3 files changed, 8 insertions(+), 3 deletions(-) --- a/drivers/vhost/net.c 2012-01-10 10:56:58.883179194 -0800 +++ b/drivers/vhost/net.c 2012-01-10 19:48:23.650225892 -0800 @@ -856,9 +856,9 @@ static const struct file_operations vhos }; static struct miscdevice vhost_net_misc = { - MISC_DYNAMIC_MINOR, - vhost-net, - vhost_net_fops, + .minor = VHOST_NET_MINOR, + .name = vhost-net, + .fops = vhost_net_fops, }; static int vhost_net_init(void) @@ -879,3 +879,5 @@ MODULE_VERSION(0.0.1); MODULE_LICENSE(GPL v2); MODULE_AUTHOR(Michael S. Tsirkin); MODULE_DESCRIPTION(Host kernel accelerator for virtio net); +MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR); +MODULE_ALIAS(devname:vhost-net); --- a/include/linux/miscdevice.h2012-01-10 10:56:59.779189436 -0800 +++ b/include/linux/miscdevice.h2012-01-11 09:13:20.803694316 -0800 @@ -42,6 +42,7 @@ #define AUTOFS_MINOR 235 #define MAPPER_CTRL_MINOR 236 #define LOOP_CTRL_MINOR237 +#define VHOST_NET_MINOR238 #define MISC_DYNAMIC_MINOR 255 struct device; --- a/Documentation/devices.txt 2012-01-10 10:56:53.399116518 -0800 +++ b/Documentation/devices.txt 2012-01-11 09:12:49.251197653 -0800 @@ -447,6 +447,8 @@ Your cooperation is appreciated. 234 = /dev/btrfs-controlBtrfs control device 235 = /dev/autofs Autofs control device 236 = /dev/mapper/control Device-Mapper control device + 237 = /dev/vhost-netHost kernel accelerator for virtio net + 240-254 Reserved for local use 255 Reserved for MISC_DYNAMIC_MINOR ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] vhost-net: add module alias (v2.1)
On Mon, 16 Jan 2012 12:26:45 + Alan Cox a...@linux.intel.com wrote: ACKs, NACKs? What is happening here? I would like an Ack from Alan Cox who switched vhost-net to a dynamic minor in the first place, in commit 79907d89c397b8bc2e05b347ec94e928ea919d33. Sorry dev...@lanana.org isn't yet back from the kernel hack incident. I don't read netdev so someone needs to summarise the issue and send me a copy of the patch to look at. Alan Subject: vhost-net: add module alias (v2.1) By adding some module aliases, programs (or users) won't have to explicitly call modprobe. Vhost-net will always be available if built into the kernel. It does require assigning a permanent minor number for depmod to work. Also: - use C99 style initialization. - add missing entry in documentation for loop-control Signed-off-by: Stephen Hemminger shemmin...@vyatta.com --- 2.1 - add missing documentation for loop control as well Documentation/devices.txt |3 +++ drivers/vhost/net.c|8 +--- include/linux/miscdevice.h |1 + 3 files changed, 9 insertions(+), 3 deletions(-) --- a/drivers/vhost/net.c 2012-01-12 14:14:25.681815487 -0800 +++ b/drivers/vhost/net.c 2012-01-12 18:09:56.810680816 -0800 @@ -856,9 +856,9 @@ static const struct file_operations vhos }; static struct miscdevice vhost_net_misc = { - MISC_DYNAMIC_MINOR, - vhost-net, - vhost_net_fops, + .minor = VHOST_NET_MINOR, + .name = vhost-net, + .fops = vhost_net_fops, }; static int vhost_net_init(void) @@ -879,3 +879,5 @@ MODULE_VERSION(0.0.1); MODULE_LICENSE(GPL v2); MODULE_AUTHOR(Michael S. Tsirkin); MODULE_DESCRIPTION(Host kernel accelerator for virtio net); +MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR); +MODULE_ALIAS(devname:vhost-net); --- a/include/linux/miscdevice.h2012-01-12 14:14:25.725815981 -0800 +++ b/include/linux/miscdevice.h2012-01-12 18:09:56.810680816 -0800 @@ -42,6 +42,7 @@ #define AUTOFS_MINOR 235 #define MAPPER_CTRL_MINOR 236 #define LOOP_CTRL_MINOR237 +#define VHOST_NET_MINOR238 #define MISC_DYNAMIC_MINOR 255 struct device; --- a/Documentation/devices.txt 2012-01-12 14:14:25.701815712 -0800 +++ b/Documentation/devices.txt 2012-01-12 18:09:56.814680860 -0800 @@ -447,6 +447,9 @@ Your cooperation is appreciated. 234 = /dev/btrfs-controlBtrfs control device 235 = /dev/autofs Autofs control device 236 = /dev/mapper/control Device-Mapper control device + 237 = /dev/loop-control Loopback control device + 238 = /dev/vhost-netHost kernel accelerator for virtio net + 240-254 Reserved for local use 255 Reserved for MISC_DYNAMIC_MINOR ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [vmw_vmci RFC 01/11] Apply VMCI context code
On Tue, 15 May 2012 08:06:58 -0700 Andrew Stiegmann (stieg) astiegm...@vmware.com wrote: Context code maintains state for vmci and allows the driver to communicate with multiple VMs. Signed-off-by: Andrew Stiegmann (stieg) astiegm...@vmware.com Running checkpatch reveals the usual noise, and the following that should be addressed. ERROR: do not use C99 // comments #272: FILE: drivers/misc/vmw_vmci/vmci_context.c:183: +static bool ctx_exists_locked(uint32_t cid)// IN ERROR: foo * bar should be foo *bar #304: FILE: drivers/misc/vmw_vmci/vmci_context.c:215: + uid_t * user, struct vmci_ctx **outContext) I don't mind the C99 style comments, but the // IN convention is pretty useless and should be removed. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-net: fix a race on 32bit arches
On Wed, 6 Jun 2012 17:49:42 +0300 Michael S. Tsirkin m...@redhat.com wrote: Sounds good, but I have a question: this realies on counters being atomic on 64 bit. Would not it be better to always use a seqlock even on 64 bit? This way counters would actually be correct and in sync. As it is if we want e.g. average packet size, we can not rely e.g. on it being bytes/packets. This has not been a requirement on real physical devices; therefore the added overhead is not really justified. Many network cards use counters in hardware to count packets/bytes and there is no expectation of atomic access there. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
On Fri, 06 Jul 2012 11:20:06 +0800 Jason Wang jasow...@redhat.com wrote: On 07/05/2012 08:51 PM, Sasha Levin wrote: On Thu, 2012-07-05 at 18:29 +0800, Jason Wang wrote: @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ)) vi-has_cvq = true; + /* Use single tx/rx queue pair as default */ + vi-num_queue_pairs = 1; + vi-total_queue_pairs = num_queue_pairs; The code is using this default even if the amount of queue pairs it wants was specified during initialization. This basically limits any device to use 1 pair when starting up. Yes, currently the virtio-net driver would use 1 txq/txq by default since multiqueue may not outperform in all kinds of workload. So it's better to keep it as default and let user enable multiqueue by ethtool -L. I would prefer that the driver sized number of queues based on number of online CPU's. That is what real hardware does. What kind of workload are you doing? If it is some DBMS benchmark then maybe the issue is that some CPU's need to be reserved. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Question]About KVM network zero-copy feature!
On Fri, 10 Aug 2012 11:34:32 +0800 Peter Huang(Peng) peter.huangp...@huawei.com wrote: Hi,All I searched from git-log, and found that until now we have vhost TX zero-copy experiment feature, how about RX zero-copy? For XEN, net-back also only has TX zero-copy, Is there any reason that RX zero-copy still not implemented? There is no guarantee that packet will ever be read by receiver. This means zero-copy could create memory back pressure stalls. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] VSOCK for Linux upstreaming
On Mon, 05 Nov 2012 10:00:38 -0800 George Zhang georgezh...@vmware.com wrote: * * * This series of VSOCK linux upstreaming patches include latest udpate from VMware. Summary of changes: - Add include/linux/socket.h for AF_VSOCK. - Cleanup some comments. - Cleanup makefiles. * * * In an effort to improve the out-of-the-box experience with Linux kernels for VMware users, VMware is working on readying the Virtual Machine Communication Interface (vmw_vmci) and VMCI Sockets (VSOCK) (vmw_vsock) kernel modules for inclusion in the Linux kernel. The purpose of this post is to acquire feedback on the vmw_vsock kernel module. The vmw_vmci kernel module has been presented in an early post. * * * VMCI Sockets allows virtual machines to communicate with host kernel modules and the VMware hypervisors. VMCI Sockets kernel module has dependency on VMCI kernel module. User level applications both in a virtual machine and on the host can use vmw_vmci through VMCI Sockets API which facilitates fast and efficient communication between guest virtual machines and their host. A socket address family designed to be compatible with UDP and TCP at the interface level. Today, VMCI and VMCI Sockets are used by the VMware shared folders (HGFS) and various VMware Tools components inside the guest for zero-config, network-less access to VMware host services. In addition to this, VMware's users are using VMCI Sockets for various applications, where network access of the virtual machine is restricted or non-existent. Examples of this are VMs communicating with device proxies for proprietary hardware running as host applications and automated testing of applications running within virtual machines. The VMware VMCI Sockets are similar to other socket types, like Berkeley UNIX socket interface. The VMCI sockets module supports both connection-oriented stream sockets like TCP, and connectionless datagram sockets like UDP. The VSOCK protocol family is defined as AF_VSOCK and the socket operations split for SOCK_DGRAM and SOCK_STREAM. For additional information about the use of VMCI and in particular VMCI Sockets, please refer to the VMCI Socket Programming Guide available at https://www.vmware.com/support/developer/vmci-sdk/. This should go to netdev as well since it is a new address family. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] VSOCK for Linux upstreaming
Never mind, mail server seemed to be overloaded today. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] VSOCK: vsock protocol implementation.
On Mon, 05 Nov 2012 10:00:51 -0800 George Zhang georgezh...@vmware.com wrote: + /* Added in 2.6.10. */ + .owner = THIS_MODULE, Thanks for submitting this, it will make life easier for distro's that now have to go through extra effort to include out of mainline support for Vmware. You did some scrubbing of the macro's to support multiple kernel versions, but there are still some leftovers. This code seems to have a lot of this added in version xxx type comments. These are probably not a good idea to include in the mainline kernel code. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent
On Mon, 26 Nov 2012 15:56:52 +0800 Jason Wang jasow...@redhat.com wrote: Some deivces do not free the old tx skbs immediately after it has been sent (usually in tx interrupt). One such example is virtio-net which optimizes for virt and only free the possible old tx skbs during the next packet sending. This would lead the pktgen to wait forever in the refcount of the skb if no other pakcet will be sent afterwards. Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which could notify the pktgen that the device does not free skb immediately after it has been sent and let it not to wait for the refcount to be one. Signed-off-by: Jason Wang jasow...@redhat.com Another alternative would be using skb_orphan() and skb-destructor. There are other cases where skb's are not freed right away. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent
On Tue, 27 Nov 2012 14:45:13 +0800 Jason Wang jasow...@redhat.com wrote: On 11/27/2012 01:37 AM, Stephen Hemminger wrote: On Mon, 26 Nov 2012 15:56:52 +0800 Jason Wang jasow...@redhat.com wrote: Some deivces do not free the old tx skbs immediately after it has been sent (usually in tx interrupt). One such example is virtio-net which optimizes for virt and only free the possible old tx skbs during the next packet sending. This would lead the pktgen to wait forever in the refcount of the skb if no other pakcet will be sent afterwards. Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which could notify the pktgen that the device does not free skb immediately after it has been sent and let it not to wait for the refcount to be one. Signed-off-by: Jason Wang jasow...@redhat.com Another alternative would be using skb_orphan() and skb-destructor. There are other cases where skb's are not freed right away. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi Stephen: Do you mean registering a skb-destructor for pktgen then set and check bits in skb-tx_flag? Yes. Register a destructor that does something like update a counter (number of packets pending), then just spin while number of packets pending is over threshold. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent
On Wed, 28 Nov 2012 14:48:52 +0800 Jason Wang jasow...@redhat.com wrote: On 11/28/2012 12:49 AM, Stephen Hemminger wrote: On Tue, 27 Nov 2012 14:45:13 +0800 Jason Wang jasow...@redhat.com wrote: On 11/27/2012 01:37 AM, Stephen Hemminger wrote: On Mon, 26 Nov 2012 15:56:52 +0800 Jason Wang jasow...@redhat.com wrote: Some deivces do not free the old tx skbs immediately after it has been sent (usually in tx interrupt). One such example is virtio-net which optimizes for virt and only free the possible old tx skbs during the next packet sending. This would lead the pktgen to wait forever in the refcount of the skb if no other pakcet will be sent afterwards. Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which could notify the pktgen that the device does not free skb immediately after it has been sent and let it not to wait for the refcount to be one. Signed-off-by: Jason Wang jasow...@redhat.com Another alternative would be using skb_orphan() and skb-destructor. There are other cases where skb's are not freed right away. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi Stephen: Do you mean registering a skb-destructor for pktgen then set and check bits in skb-tx_flag? Yes. Register a destructor that does something like update a counter (number of packets pending), then just spin while number of packets pending is over threshold. -- Not sure this is the best method, since pktgen was used to test the tx process of the device driver and NIC. If we use skb_orhpan(), we would miss the test of tx completion part. There are other places that delay freeing and your solution would mean finding and fixing all those. Code that does that already has to use skb_orphan() to work, and I was looking for a way that could use that. Introducing another flag value seems like a long term burden. Alternatively, virtio could do cleanup more aggressively. Maybe in response to ring getting half full, or add a cleanup timer or something to avoid the problem. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [net-next RFC] pktgen: don't wait for the device who doesn't free skb immediately after sent
On Mon, 03 Dec 2012 14:45:46 +0800 Jason Wang jasow...@redhat.com wrote: On Tuesday, November 27, 2012 08:49:19 AM Stephen Hemminger wrote: On Tue, 27 Nov 2012 14:45:13 +0800 Jason Wang jasow...@redhat.com wrote: On 11/27/2012 01:37 AM, Stephen Hemminger wrote: On Mon, 26 Nov 2012 15:56:52 +0800 Jason Wang jasow...@redhat.com wrote: Some deivces do not free the old tx skbs immediately after it has been sent (usually in tx interrupt). One such example is virtio-net which optimizes for virt and only free the possible old tx skbs during the next packet sending. This would lead the pktgen to wait forever in the refcount of the skb if no other pakcet will be sent afterwards. Solving this issue by introducing a new flag IFF_TX_SKB_FREE_DELAY which could notify the pktgen that the device does not free skb immediately after it has been sent and let it not to wait for the refcount to be one. Signed-off-by: Jason Wang jasow...@redhat.com Another alternative would be using skb_orphan() and skb-destructor. There are other cases where skb's are not freed right away. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi Stephen: Do you mean registering a skb-destructor for pktgen then set and check bits in skb-tx_flag? Yes. Register a destructor that does something like update a counter (number of packets pending), then just spin while number of packets pending is over threshold. Have some experiments on this, looks like it does not work weel when clone_skb is used. For driver that call skb_orphan() in ndo_start_xmit, the destructor is only called when the first packet were sent, but what we need to know is when the last were sent. Any thoughts on this or we can just introduce another flag (anyway we have something like IFF_TX_SKB_SHARING) ? The SKB_SHARING flag looks like the best solution then. Surprisingly, transmit buffer completion is a major bottleneck for 10G devices, and I suspect more changes will come. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/3] virtio: change to_vp_device to an inlined definition
On Wed, 5 Dec 2012 15:03:27 +0800 Wanlong Gao gaowanl...@cn.fujitsu.com wrote: to_vp_device is worth changing to inlined definition. Signed-off-by: Wanlong Gao gaowanl...@cn.fujitsu.com --- drivers/virtio/virtio_pci.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index e3ecc94..7681fe3 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -98,11 +98,7 @@ static struct pci_device_id virtio_pci_id_table[] = { MODULE_DEVICE_TABLE(pci, virtio_pci_id_table); -/* Convert a generic virtio device to our structure */ -static struct virtio_pci_device *to_vp_device(struct virtio_device *vdev) -{ - return container_of(vdev, struct virtio_pci_device, vdev); -} +#define to_vp_device(_vdev) container_of(_vdev, struct virtio_pci_device, vdev) Just mark the function as inline. A macro loses type checking. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v3 1/3] virtio-net: separate fields of sending/receiving queue from virtnet_info
Minor style issue reported by checkpatch which can be fixed after merge. Although sizeof is actually an operator in C, it is considered correct style to treat it as a function. WARNING: sizeof hdr-hdr should be sizeof(hdr-hdr) #293: FILE: drivers/net/virtio_net.c:395: + sg_set_buf(rq-sg, hdr-hdr, sizeof hdr-hdr); WARNING: sizeof hdr-mhdr should be sizeof(hdr-mhdr) #552: FILE: drivers/net/virtio_net.c:641: + sg_set_buf(sq-sg, hdr-mhdr, sizeof hdr-mhdr); WARNING: sizeof hdr-hdr should be sizeof(hdr-hdr) #555: FILE: drivers/net/virtio_net.c:643: + sg_set_buf(sq-sg, hdr-hdr, sizeof hdr-hdr); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v3 0/3] Multiqueue support in virtio-net
On Fri, 07 Dec 2012 15:35:56 -0500 (EST) David Miller da...@davemloft.net wrote: From: Jason Wang jasow...@redhat.com Date: Sat, 8 Dec 2012 01:04:54 +0800 This series is an update version (hope the final version) of multiqueue (VIRTIO_NET_F_MQ) support in virtio-net driver. All previous comments were addressed, the work were based on Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the packets reception and transmission. Performance test show the aggregate latency were increased greately but may get some regression in small packet transmission. Due to this, multiqueue were disabled by default. If user want to benefit form the multiqueue, ethtool -L could be used to enable the feature. Please review and comments. A protype implementation of qemu-kvm support could by found in git://github.com/jasowang/qemu-kvm-mq.git. To start a guest with two queues, you could specify the queues parameters to both tap and virtio-net like: ./qemu-kvm -netdev tap,queues=2,... -device virtio-net-pci,queues=2,... then enable the multiqueue through ethtool by: ethtool -L eth0 combined 2 It seems like most, if not all, of the feedback given for this series has been addressed by Jason. Can I get some ACKs? Other than the minor style nit in the first patch, I see no issues. This is really needed by Virtual Routers. Acked-by: Stephen Hemminger shemmin...@vyatta.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] virtio: make config_ops const
It is just a table of function pointers, make it const for cleanliness and security reasons. Signed-off-by: Stephen Hemminger step...@networkplumber.org --- a/drivers/lguest/lguest_device.c2013-02-02 20:03:23.285506053 +1100 +++ b/drivers/lguest/lguest_device.c2013-02-02 20:06:15.253510658 +1100 @@ -396,7 +396,7 @@ static const char *lg_bus_name(struct vi } /* The ops structure which hooks everything together. */ -static struct virtio_config_ops lguest_config_ops = { +static const struct virtio_config_ops lguest_config_ops = { .get_features = lg_get_features, .finalize_features = lg_finalize_features, .get = lg_get, --- a/drivers/remoteproc/remoteproc_virtio.c2013-02-02 20:03:23.289506053 +1100 +++ b/drivers/remoteproc/remoteproc_virtio.c2013-02-02 20:06:10.097510520 +1100 @@ -222,7 +222,7 @@ static void rproc_virtio_finalize_featur rvdev-gfeatures = vdev-features[0]; } -static struct virtio_config_ops rproc_virtio_config_ops = { +static const struct virtio_config_ops rproc_virtio_config_ops = { .get_features = rproc_virtio_get_features, .finalize_features = rproc_virtio_finalize_features, .find_vqs = rproc_virtio_find_vqs, --- a/drivers/s390/kvm/kvm_virtio.c 2013-01-17 16:06:04.691116116 +1100 +++ b/drivers/s390/kvm/kvm_virtio.c 2013-02-02 20:06:04.121510360 +1100 @@ -275,7 +275,7 @@ static const char *kvm_bus_name(struct v /* * The config ops structure as defined by virtio config */ -static struct virtio_config_ops kvm_vq_configspace_ops = { +static const struct virtio_config_ops kvm_vq_configspace_ops = { .get_features = kvm_get_features, .finalize_features = kvm_finalize_features, .get = kvm_get, --- a/drivers/virtio/virtio_mmio.c 2013-01-09 04:04:12.010765226 +1100 +++ b/drivers/virtio/virtio_mmio.c 2013-02-02 20:05:50.937510007 +1100 @@ -423,7 +423,7 @@ static const char *vm_bus_name(struct vi return vm_dev-pdev-name; } -static struct virtio_config_ops virtio_mmio_config_ops = { +static const struct virtio_config_ops virtio_mmio_config_ops = { .get= vm_get, .set= vm_set, .get_status = vm_get_status, --- a/drivers/virtio/virtio_pci.c 2013-01-09 04:04:12.010765226 +1100 +++ b/drivers/virtio/virtio_pci.c 2013-02-02 20:05:40.009509714 +1100 @@ -652,7 +652,7 @@ static int vp_set_vq_affinity(struct vir return 0; } -static struct virtio_config_ops virtio_pci_config_ops = { +static const struct virtio_config_ops virtio_pci_config_ops = { .get= vp_get, .set= vp_set, .get_status = vp_get_status, --- a/include/linux/virtio.h2013-01-09 04:04:12.122765257 +1100 +++ b/include/linux/virtio.h2013-02-02 20:05:31.433509485 +1100 @@ -78,7 +78,7 @@ struct virtio_device { int index; struct device dev; struct virtio_device_id id; - struct virtio_config_ops *config; + const struct virtio_config_ops *config; struct list_head vqs; /* Note that this is a Linux set_bit-style bitmap. */ unsigned long features[1]; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] x86: Support compiling out userspace I/O (iopl and ioperm)
I/O from userspace is used to implement usermode virtio driver(s). This has been done independently by Intel, Brocade/Vyatta, and 6Wind. Sorry, it has to stay. On Mon, Oct 21, 2013 at 7:35 PM, Josh Triplett j...@joshtriplett.org wrote: On the vast majority of modern systems, no processes will use the userspsace I/O syscalls, iopl and ioperm. Add a new config option, CONFIG_X86_IOPORT, to support configuring them out of the kernel entirely. Since these syscalls only exist to support rare legacy userspace programs, X86_IOPORT does not depend on EXPERT, though it does still default to y. In addition to saving a significant amount of space, this also reduces the size of several major kernel data structures, drops a bit of code from several task-related hot paths, and reduces the attack surface of the kernel. All of the task-related code dealing with userspace I/O permissions now lives in process-io.h as several new inline functions, which become no-ops when CONFIG_X86_IOPORT. bloat-o-meter shows a net reduction of 17681 bytes on 32-bit and 9719 bytes on 64-bit: 32-bit bloat-o-meter: add/remove: 0/3 grow/shrink: 0/10 up/down: 0/-17681 (-17681) function old new delta cpu_init 676 668 -8 ioperm_active 18 7 -11 init_task 12961284 -12 exit_thread 179 91 -88 ioperm_get 103 10 -93 __switch_to_xtra 254 161 -93 sys_iopl 127 --127 SyS_iopl 127 --127 copy_thread 606 446-160 vt_ioctl41273919-208 sys_ioperm 370 --370 init_tss8576 384 -8192 doublefault_tss 8576 384 -8192 64-bit bloat-o-meter: add/remove: 0/4 grow/shrink: 2/9 up/down: 45/-9764 (-9719) function old new delta cpu_init 958 995 +37 arch_align_stack 78 86 +8 perf_event_exit_task 525 517 -8 ioperm_active 17 8 -9 init_task 19681944 -24 stub_iopl 81 - -81 ioperm_get 111 11-100 __switch_to_xtra 281 164-117 exit_thread 212 92-120 vt_ioctl44324304-128 sys_iopl 137 --137 SyS_iopl 137 --137 copy_thread 694 520-174 sys_ioperm 473 --473 init_tss8896 640 -8256 Signed-off-by: Josh Triplett j...@joshtriplett.org --- arch/x86/Kconfig | 10 + arch/x86/include/asm/paravirt.h | 2 + arch/x86/include/asm/paravirt_types.h | 2 + arch/x86/include/asm/processor.h | 51 + arch/x86/include/asm/syscalls.h | 3 ++ arch/x86/kernel/Makefile | 3 +- arch/x86/kernel/cpu/common.c | 12 +- arch/x86/kernel/entry_64.S| 9 +++-- arch/x86/kernel/paravirt.c| 2 + arch/x86/kernel/process-io.h | 71 +++ arch/x86/kernel/process.c | 34 ++--- arch/x86/kernel/process_32.c | 11 +- arch/x86/kernel/ptrace.c | 8 arch/x86/xen/enlighten.c | 4 ++ drivers/tty/vt/vt_ioctl.c | 2 +- kernel/sys_ni.c | 5 +++ 16 files changed, 168 insertions(+), 61 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e241a19..d5b1e68 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -976,6 +976,16 @@ config VM86 XFree86 to initialize some video cards via BIOS. Disabling this option saves about 6k. +config X86_IOPORT + bool iopl and ioperm system calls + default y + ---help--- + This option enables the iopl and ioperm system calls, which allow + privileged userspace processes to directly access I/O ports. This + is used by some legacy software to drive hardware directly from + userspace rather than via a proper kernel driver. Unless you intend + to run such software, you can safely say N here. + config TOSHIBA tristate
[PATCH net-next 3/3] virtio_net: spelling fixes
Signed-off-by: Stephen Hemminger step...@networkplumber.org --- a/drivers/net/virtio_net.c 2013-12-09 16:12:41.409051865 -0800 +++ b/drivers/net/virtio_net.c 2013-12-09 16:12:43.872996856 -0800 @@ -873,7 +873,7 @@ static netdev_tx_t start_xmit(struct sk_ /* * Send command via the control virtqueue and check status. Commands * supported by the hypervisor, as indicated by feature bits, should - * never fail unless improperly formated. + * never fail unless improperly formatted. */ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, struct scatterlist *out) @@ -1061,7 +1061,7 @@ static void virtnet_set_rx_mode(struct n void *buf; int i; - /* We can't dynamicaly set ndo_set_rx_mode, so return gracefully */ + /* We can't dynamically set ndo_set_rx_mode, so return gracefully */ if (!virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_RX)) return; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next 1/3] virtio_net: set multicast filter list to host
The virtio_net driver never sends the multicast address list to the host. This is because send command takes a pointer to scatter list to send but only inserts that one entry into the outgoing scatter list. This bug has been there since: commit f565a7c259d71cc186753653d978c646d2354b36 Author: Alex Williamson alex.william...@hp.com Date: Wed Feb 4 09:02:45 2009 + virtio_net: Add a MAC filter table Signed-off-by: Stephen Hemminger step...@networkplumber.org --- a/drivers/net/virtio_net.c 2013-12-09 16:12:03.897891975 -0800 +++ b/drivers/net/virtio_net.c 2013-12-09 16:12:36.353164803 -0800 @@ -893,7 +893,7 @@ static bool virtnet_send_command(struct sg_init_one(hdr, ctrl, sizeof(ctrl)); sgs[out_num++] = hdr; - if (out) + for (; out; out = sg_next(out)) sgs[out_num++] = out; if (in) sgs[out_num + in_num++] = in; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next 2/3] virtio_net: remove unused parameter to send_command
All the code passes NULL for the last sg list (in). Simplify by just removing it. Signed-off-by: Stephen Hemminger step...@networkplumber.org --- a/drivers/net/virtio_net.c 2013-12-09 16:12:36.353164803 -0800 +++ b/drivers/net/virtio_net.c 2013-12-09 16:12:41.409051865 -0800 @@ -876,13 +876,12 @@ static netdev_tx_t start_xmit(struct sk_ * never fail unless improperly formated. */ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, -struct scatterlist *out, -struct scatterlist *in) +struct scatterlist *out) { struct scatterlist *sgs[4], hdr, stat; struct virtio_net_ctrl_hdr ctrl; virtio_net_ctrl_ack status = ~0; - unsigned out_num = 0, in_num = 0, tmp; + unsigned out_num = 0, tmp; /* Caller should know better */ BUG_ON(!virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_VQ)); @@ -895,16 +894,13 @@ static bool virtnet_send_command(struct for (; out; out = sg_next(out)) sgs[out_num++] = out; - if (in) - sgs[out_num + in_num++] = in; /* Add return status. */ sg_init_one(stat, status, sizeof(status)); - sgs[out_num + in_num++] = stat; + sgs[out_num] = stat; - BUG_ON(out_num + in_num ARRAY_SIZE(sgs)); - BUG_ON(virtqueue_add_sgs(vi-cvq, sgs, out_num, in_num, vi, GFP_ATOMIC) - 0); + BUG_ON(out_num + 1 ARRAY_SIZE(sgs)); + BUG_ON(virtqueue_add_sgs(vi-cvq, sgs, out_num, 1, vi, GFP_ATOMIC) 0); if (unlikely(!virtqueue_kick(vi-cvq))) return status == VIRTIO_NET_OK; @@ -934,8 +930,7 @@ static int virtnet_set_mac_address(struc if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_MAC_ADDR)) { sg_init_one(sg, addr-sa_data, dev-addr_len); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC, - VIRTIO_NET_CTRL_MAC_ADDR_SET, - sg, NULL)) { + VIRTIO_NET_CTRL_MAC_ADDR_SET, sg)) { dev_warn(vdev-dev, Failed to set mac address by vq command.\n); return -EINVAL; @@ -1008,7 +1003,7 @@ static void virtnet_ack_link_announce(st { rtnl_lock(); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE, - VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL, NULL)) + VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL)) dev_warn(vi-dev-dev, Failed to ack link announce.\n); rtnl_unlock(); } @@ -1026,7 +1021,7 @@ static int virtnet_set_queues(struct vir sg_init_one(sg, s, sizeof(s)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ, - VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, sg, NULL)) { + VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET, sg)) { dev_warn(dev-dev, Fail to set num of queue pairs to %d\n, queue_pairs); return -EINVAL; @@ -1076,16 +1071,14 @@ static void virtnet_set_rx_mode(struct n sg_init_one(sg, promisc, sizeof(promisc)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, - VIRTIO_NET_CTRL_RX_PROMISC, - sg, NULL)) + VIRTIO_NET_CTRL_RX_PROMISC, sg)) dev_warn(dev-dev, Failed to %sable promisc mode.\n, promisc ? en : dis); sg_init_one(sg, allmulti, sizeof(allmulti)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, - VIRTIO_NET_CTRL_RX_ALLMULTI, - sg, NULL)) + VIRTIO_NET_CTRL_RX_ALLMULTI, sg)) dev_warn(dev-dev, Failed to %sable allmulti mode.\n, allmulti ? en : dis); @@ -1121,8 +1114,7 @@ static void virtnet_set_rx_mode(struct n sizeof(mac_data-entries) + (mc_count * ETH_ALEN)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC, - VIRTIO_NET_CTRL_MAC_TABLE_SET, - sg, NULL)) + VIRTIO_NET_CTRL_MAC_TABLE_SET, sg)) dev_warn(dev-dev, Failed to set MAC filter table.\n); kfree(buf); @@ -1137,7 +1129,7 @@ static int virtnet_vlan_rx_add_vid(struc sg_init_one(sg, vid, sizeof(vid)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN, - VIRTIO_NET_CTRL_VLAN_ADD, sg, NULL)) + VIRTIO_NET_CTRL_VLAN_ADD, sg)) dev_warn(dev-dev, Failed to add VLAN ID %d.\n, vid); return 0; } @@ -1151,7 +1143,7 @@ static int virtnet_vlan_rx_kill_vid(stru sg_init_one(sg
[PATCH net-next] virtio: change comment in transmit
The original comment was not really informative or funny as well as sexist. Replace it with a better explanation of why the driver does stop and what the impacts are. Signed-off-by: Stephen Hemminger step...@networkplumber.org --- a/drivers/net/virtio_net.c 2015-03-24 15:20:25.174671000 -0700 +++ b/drivers/net/virtio_net.c 2015-03-24 16:17:28.478525333 -0700 @@ -939,8 +939,12 @@ static netdev_tx_t start_xmit(struct sk_ skb_orphan(skb); nf_reset(skb); - /* Apparently nice girls don't return TX_BUSY; stop the queue -* before it gets out of hand. Naturally, this wastes entries. */ + /* It is better to stop queue if running out of space +* instead of forcing queuing layer to requeue the skb +* by returning TX_BUSY (and cause a BUG message). +* Since most packets only take 1 or 2 ring slots +* this means 16 slots are typically wasted. +*/ if (sq-vq-num_free 2+MAX_SKB_FRAGS) { netif_stop_subqueue(dev, qnum); if (unlikely(!virtqueue_enable_cb_delayed(sq-vq))) { ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC V7 PATCH 7/7] vhost_net: add interrupt coalescing support
On Mon, 25 May 2015 01:24:04 -0400 Jason Wang jasow...@redhat.com wrote: Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7d137a4..5ee28b7 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -320,6 +320,9 @@ static void handle_tx(struct vhost_net *net) hdr_size = nvq-vhost_hlen; zcopy = nvq-ubufs; + /* Finish pending interrupts first */ + vhost_check_coalesce_and_signal(vq-dev, vq, false); + for (;;) { /* Release DMAs done buffers first */ if (zcopy) @@ -415,6 +418,7 @@ static void handle_tx(struct vhost_net *net) } } out: + vhost_check_coalesce_and_signal(vq-dev, vq, true); mutex_unlock(vq-mutex); } @@ -554,6 +558,9 @@ static void handle_rx(struct vhost_net *net) vq-log : NULL; mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF); + /* Finish pending interrupts first */ + vhost_check_coalesce_and_signal(vq-dev, vq, false); + while ((sock_len = peek_head_len(sock-sk))) { sock_len += sock_hlen; vhost_len = sock_len + vhost_hlen; @@ -638,6 +645,7 @@ static void handle_rx(struct vhost_net *net) } } out: + vhost_check_coalesce_and_signal(vq-dev, vq, true); mutex_unlock(vq-mutex); } Could you implement ethtool control of these coalescing parameters? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] virtio_net: add ethtool support for set and get of settings
On Tue, 2 Feb 2016 13:51:20 +0100 Nikolay Aleksandrovwrote: > +static bool virtnet_validate_speed(u32 speed) > +{ > + switch (speed) { > + case SPEED_10: > + case SPEED_100: > + case SPEED_1000: > + case SPEED_2500: > + case SPEED_5000: > + case SPEED_1: > + case SPEED_2: > + case SPEED_25000: > + case SPEED_4: > + case SPEED_5: > + case SPEED_56000: > + case SPEED_10: > + case SPEED_UNKNOWN: > + return true; > + } > + > + return false; > +} Why limit to only known values. This switch() will get out of date when some vendor introduces 64G or some other weird value. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 -next 1/2] virtio: Start feature MTU support
On Tue, 15 Mar 2016 17:04:12 -0400 Aaron Conolewrote: > --- a/include/uapi/linux/virtio_net.h > +++ b/include/uapi/linux/virtio_net.h > @@ -55,6 +55,7 @@ > #define VIRTIO_NET_F_MQ 22 /* Device supports Receive Flow >* Steering */ > #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */ > +#define VIRTIO_NET_F_MTU 25 /* Device supports Default MTU Negotiation */ > > #ifndef VIRTIO_NET_NO_LEGACY > #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ > @@ -73,6 +74,8 @@ struct virtio_net_config { >* Legal values are between 1 and 0x8000 >*/ > __u16 max_virtqueue_pairs; > + /* Default maximum transmit unit advice */ > + __u16 mtu; > } __attribute__((packed)); > > /* You can't change user visible headers without breaking ABI. This structure might be used by other user code. Also how can this work if host is using old size of structure. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v2 -next 1/2] virtio: Start feature MTU support
On Thu, 17 Mar 2016 17:10:55 -0400 Aaron Conole <acon...@redhat.com> wrote: > Stephen Hemminger <step...@networkplumber.org> writes: > > > On Tue, 15 Mar 2016 17:04:12 -0400 > > Aaron Conole <acon...@redhat.com> wrote: > > > >> --- a/include/uapi/linux/virtio_net.h > >> +++ b/include/uapi/linux/virtio_net.h > >> @@ -55,6 +55,7 @@ > >> #define VIRTIO_NET_F_MQ 22 /* Device supports Receive Flow > >> * Steering */ > >> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ > >> +#define VIRTIO_NET_F_MTU 25 /* Device supports Default MTU > >> Negotiation */ > >> > >> #ifndef VIRTIO_NET_NO_LEGACY > >> #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */ > >> @@ -73,6 +74,8 @@ struct virtio_net_config { > >> * Legal values are between 1 and 0x8000 > >> */ > >>__u16 max_virtqueue_pairs; > >> + /* Default maximum transmit unit advice */ > >> + __u16 mtu; > >> } __attribute__((packed)); > >> > >> /* > > > > You can't change user visible headers without breaking ABI. > > This structure might be used by other user code. Also how can this > > work if host is using old size of structure. > > How else can this field be added and remain compliant with the spec? The > spec requires that mtu be passed in the virtio_net_config field. > > As for old sizeof, I think the absence of the VIRTIO_NET_F_MTU bit being > asserted is confirmation that mtu is not valid (at least, it is implied > in the spec). Michael is right as long as the code checks for MTU flag before referencing the mtu field, everything is fine. Actually, the structure is never used directly only by fetching fields with offsetof ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 08/14] vmbus: put related per-cpu variable together
The hv_context structure had several arrays which were per-cpu and was allocating small structures (tasklet_struct). Instead use a single per-cpu array. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/channel_mgmt.c | 35 - drivers/hv/connection.c | 20 --- drivers/hv/hv.c | 130 -- drivers/hv/hyperv_vmbus.h | 53 +++ drivers/hv/vmbus_drv.c| 39 -- 5 files changed, 143 insertions(+), 134 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index de90a9900fee..579ad2560a39 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -353,9 +353,10 @@ static void free_channel(struct vmbus_channel *channel) static void percpu_channel_enq(void *arg) { struct vmbus_channel *channel = arg; - int cpu = smp_processor_id(); + struct hv_per_cpu_context *hv_cpu + = this_cpu_ptr(hv_context.cpu_context); - list_add_tail(>percpu_list, _context.percpu_list[cpu]); + list_add_tail(>percpu_list, _cpu->chan_list); } static void percpu_channel_deq(void *arg) @@ -379,19 +380,21 @@ static void vmbus_release_relid(u32 relid) void hv_event_tasklet_disable(struct vmbus_channel *channel) { - struct tasklet_struct *tasklet; - tasklet = hv_context.event_dpc[channel->target_cpu]; - tasklet_disable(tasklet); + struct hv_per_cpu_context *hv_cpu; + + hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu); + tasklet_disable(_cpu->event_dpc); } void hv_event_tasklet_enable(struct vmbus_channel *channel) { - struct tasklet_struct *tasklet; - tasklet = hv_context.event_dpc[channel->target_cpu]; - tasklet_enable(tasklet); + struct hv_per_cpu_context *hv_cpu; + + hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu); + tasklet_enable(_cpu->event_dpc); /* In case there is any pending event */ - tasklet_schedule(tasklet); + tasklet_schedule(_cpu->event_dpc); } void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid) @@ -726,9 +729,12 @@ static void vmbus_wait_for_unload(void) break; for_each_online_cpu(cpu) { - page_addr = hv_context.synic_message_page[cpu]; - msg = (struct hv_message *)page_addr + - VMBUS_MESSAGE_SINT; + struct hv_per_cpu_context *hv_cpu + = per_cpu_ptr(hv_context.cpu_context, cpu); + + page_addr = hv_cpu->synic_message_page; + msg = (struct hv_message *)page_addr + + VMBUS_MESSAGE_SINT; message_type = READ_ONCE(msg->header.message_type); if (message_type == HVMSG_NONE) @@ -752,7 +758,10 @@ static void vmbus_wait_for_unload(void) * messages after we reconnect. */ for_each_online_cpu(cpu) { - page_addr = hv_context.synic_message_page[cpu]; + struct hv_per_cpu_context *hv_cpu + = per_cpu_ptr(hv_context.cpu_context, cpu); + + page_addr = hv_cpu->synic_message_page; msg = (struct hv_message *)page_addr + VMBUS_MESSAGE_SINT; msg->header.message_type = HVMSG_NONE; } diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 1766ef03e78d..158f12823baf 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -93,12 +93,10 @@ static int vmbus_negotiate_version(struct vmbus_channel_msginfo *msginfo, * all the CPUs. This is needed for kexec to work correctly where * the CPU attempting to connect may not be CPU 0. */ - if (version >= VERSION_WIN8_1) { - msg->target_vcpu = hv_context.vp_index[get_cpu()]; - put_cpu(); - } else { + if (version >= VERSION_WIN8_1) + msg->target_vcpu = hv_context.vp_index[smp_processor_id()]; + else msg->target_vcpu = 0; - } /* * Add to list before we send the request since we may @@ -269,12 +267,12 @@ void vmbus_disconnect(void) */ static struct vmbus_channel *pcpu_relid2channel(u32 relid) { + struct hv_per_cpu_context *hv_cpu + = this_cpu_ptr(hv_context.cpu_context); + struct vmbus_channel *found_channel = NULL; struct vmbus_channel *channel; - struct vmbus_channel *found_channel = NULL; - int cpu = smp_processor_id(); - struct list_head *pcpu_head = _context.percpu_list[cpu]; - list_for_each_entry(channel, pcpu_head, percpu_list) { + list_for_each_entry(channel, _cpu->chan_list, percpu_list) { if (channel->o
[PATCH 07/14] vmbus: callback is in softirq not workqueue
The callback is done via tasklet not workqueue. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- include/linux/hyperv.h | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 39d493ce550d..b30808f740f9 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -32,7 +32,6 @@ #include #include #include -#include #include #include #include @@ -729,9 +728,7 @@ struct vmbus_channel { struct vmbus_close_msg close_msg; - /* Channel callback are invoked in this workqueue context */ - /* HANDLE dataWorkQueue; */ - + /* Channel callback's invoked in softirq context */ void (*onchannel_callback)(void *context); void *channel_callback_context; -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 09/14] vmbus: change to per channel tasklet
Make the event handling tasklet per channel rather than per-cpu. This allows for better fairness when getting lots of data on the same cpu. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/channel.c | 2 +- drivers/hv/channel_mgmt.c | 16 +- drivers/hv/connection.c | 78 ++- drivers/hv/hv.c | 2 -- drivers/hv/hyperv_vmbus.h | 1 - drivers/hv/vmbus_drv.c| 58 ++- include/linux/hyperv.h| 3 +- 7 files changed, 64 insertions(+), 96 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index 789c75f6df26..18cc1c78260d 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -530,7 +530,7 @@ static int vmbus_close_internal(struct vmbus_channel *channel) int ret; /* -* process_chn_event(), running in the tasklet, can race +* vmbus_on_event(), running in the tasklet, can race * with vmbus_close_internal() in the case of SMP guest, e.g., when * the former is accessing channel->inbound.ring_buffer, the latter * could be freeing the ring_buffer pages. diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 579ad2560a39..2f6270d76b79 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -339,6 +339,9 @@ static struct vmbus_channel *alloc_channel(void) INIT_LIST_HEAD(>sc_list); INIT_LIST_HEAD(>percpu_list); + tasklet_init(>callback_event, +vmbus_on_event, (unsigned long)channel); + return channel; } @@ -347,6 +350,7 @@ static struct vmbus_channel *alloc_channel(void) */ static void free_channel(struct vmbus_channel *channel) { + tasklet_kill(>callback_event); kfree(channel); } @@ -380,21 +384,15 @@ static void vmbus_release_relid(u32 relid) void hv_event_tasklet_disable(struct vmbus_channel *channel) { - struct hv_per_cpu_context *hv_cpu; - - hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu); - tasklet_disable(_cpu->event_dpc); + tasklet_disable(>callback_event); } void hv_event_tasklet_enable(struct vmbus_channel *channel) { - struct hv_per_cpu_context *hv_cpu; - - hv_cpu = per_cpu_ptr(hv_context.cpu_context, channel->target_cpu); - tasklet_enable(_cpu->event_dpc); + tasklet_enable(>callback_event); /* In case there is any pending event */ - tasklet_schedule(_cpu->event_dpc); + tasklet_schedule(>callback_event); } void hv_process_channel_removal(struct vmbus_channel *channel, u32 relid) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 158f12823baf..27e72dc07e12 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -260,29 +260,6 @@ void vmbus_disconnect(void) } /* - * Map the given relid to the corresponding channel based on the - * per-cpu list of channels that have been affinitized to this CPU. - * This will be used in the channel callback path as we can do this - * mapping in a lock-free fashion. - */ -static struct vmbus_channel *pcpu_relid2channel(u32 relid) -{ - struct hv_per_cpu_context *hv_cpu - = this_cpu_ptr(hv_context.cpu_context); - struct vmbus_channel *found_channel = NULL; - struct vmbus_channel *channel; - - list_for_each_entry(channel, _cpu->chan_list, percpu_list) { - if (channel->offermsg.child_relid == relid) { - found_channel = channel; - break; - } - } - - return found_channel; -} - -/* * relid2channel - Get the channel object given its * child relative id (ie channel id) */ @@ -318,25 +295,16 @@ struct vmbus_channel *relid2channel(u32 relid) } /* - * process_chn_event - Process a channel event notification + * vmbus_on_event - Process a channel event notification */ -static void process_chn_event(u32 relid) +void vmbus_on_event(unsigned long data) { - struct vmbus_channel *channel; + struct vmbus_channel *channel = (void *) data; void *arg; bool read_state; u32 bytes_to_read; /* -* Find the channel based on this relid and invokes the -* channel callback to process the event -*/ - channel = pcpu_relid2channel(relid); - - if (!channel) - return; - - /* * A channel once created is persistent even when there * is no driver handling the device. An unloading driver * sets the onchannel_callback to NULL on the same CPU @@ -344,7 +312,6 @@ static void process_chn_event(u32 relid) * Thus, checking and invoking the driver specific callback takes * care of orderly unloading of the driver. */ - if (channel->onchannel_callback != NULL) { arg = channel->chann
[PATCH 13/14] vmbus: constify parameters where possible
Functions that just query state of ring buffer can have parameters marked const. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/hyperv_vmbus.h | 6 +++--- drivers/hv/ring_buffer.c | 22 ++ include/linux/hyperv.h| 12 ++-- 3 files changed, 19 insertions(+), 21 deletions(-) diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index e15a130de3c9..884f83bba1ab 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -283,14 +283,14 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); int hv_ringbuffer_write(struct vmbus_channel *channel, - struct kvec *kv_list, u32 kv_count); + const struct kvec *kv_list, u32 kv_count); int hv_ringbuffer_read(struct vmbus_channel *channel, void *buffer, u32 buflen, u32 *buffer_actual_len, u64 *requestid, bool raw); -void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info, - struct hv_ring_buffer_debug_info *debug_info); +void hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info, +struct hv_ring_buffer_debug_info *debug_info); /* * Maximum channels is determined by the size of the interrupt page diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index 47ab69089115..ee3e488d9dee 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -96,11 +96,9 @@ hv_set_next_write_location(struct hv_ring_buffer_info *ring_info, /* Get the next read location for the specified ring buffer. */ static inline u32 -hv_get_next_read_location(struct hv_ring_buffer_info *ring_info) +hv_get_next_read_location(const struct hv_ring_buffer_info *ring_info) { - u32 next = ring_info->ring_buffer->read_index; - - return next; + return ring_info->ring_buffer->read_index; } /* @@ -108,8 +106,8 @@ hv_get_next_read_location(struct hv_ring_buffer_info *ring_info) * This allows the caller to skip. */ static inline u32 -hv_get_next_readlocation_withoffset(struct hv_ring_buffer_info *ring_info, -u32 offset) +hv_get_next_readlocation_withoffset(const struct hv_ring_buffer_info *ring_info, + u32 offset) { u32 next = ring_info->ring_buffer->read_index; @@ -130,7 +128,7 @@ hv_set_next_read_location(struct hv_ring_buffer_info *ring_info, /* Get the size of the ring buffer. */ static inline u32 -hv_get_ring_buffersize(struct hv_ring_buffer_info *ring_info) +hv_get_ring_buffersize(const struct hv_ring_buffer_info *ring_info) { return ring_info->ring_datasize; } @@ -147,7 +145,7 @@ hv_get_ring_bufferindices(struct hv_ring_buffer_info *ring_info) * Assume there is enough room. Handles wrap-around in src case only!! */ static u32 hv_copyfrom_ringbuffer( - struct hv_ring_buffer_info *ring_info, + const struct hv_ring_buffer_info *ring_info, void*dest, u32 destlen, u32 start_read_offset) @@ -171,7 +169,7 @@ static u32 hv_copyfrom_ringbuffer( static u32 hv_copyto_ringbuffer( struct hv_ring_buffer_info *ring_info, u32 start_write_offset, - void*src, + const void *src, u32 srclen) { void *ring_buffer = hv_get_ring_buffer(ring_info); @@ -186,8 +184,8 @@ static u32 hv_copyto_ringbuffer( } /* Get various debug metrics for the specified ring buffer. */ -void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info, - struct hv_ring_buffer_debug_info *debug_info) +void hv_ringbuffer_get_debuginfo(const struct hv_ring_buffer_info *ring_info, +struct hv_ring_buffer_debug_info *debug_info) { u32 bytes_avail_towrite; u32 bytes_avail_toread; @@ -264,7 +262,7 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info) /* Write to the ring buffer. */ int hv_ringbuffer_write(struct vmbus_channel *channel, - struct kvec *kv_list, u32 kv_count) + const struct kvec *kv_list, u32 kv_count) { int i = 0; u32 bytes_avail_towrite; diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index dc50997a3fba..32a9cbc66b65 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -137,8 +137,8 @@ struct hv_ring_buffer_info { * for the specified ring buffer */ static inline void -hv_get_ringbuffer_availbytes(struct hv_ring_buffer_info *rbi, - u32 *read, u32 *write) +hv_get_ringbuffer_availbytes(const stru
[PATCH 14/14] vmbus: replace modulus operation with subtraction
Takes less clock cycles to check for ring wrap and subtract than to do a modulus instruction. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/ring_buffer.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index ee3e488d9dee..8ab6298fd5ae 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -112,7 +112,8 @@ hv_get_next_readlocation_withoffset(const struct hv_ring_buffer_info *ring_info, u32 next = ring_info->ring_buffer->read_index; next += offset; - next %= ring_info->ring_datasize; + if (next >= ring_info->ring_datasize) + next -= ring_info->ring_datasize; return next; } @@ -156,7 +157,8 @@ static u32 hv_copyfrom_ringbuffer( memcpy(dest, ring_buffer + start_read_offset, destlen); start_read_offset += destlen; - start_read_offset %= ring_buffer_size; + if (start_read_offset >= ring_buffer_size) + start_read_offset -= ring_buffer_size; return start_read_offset; } @@ -178,7 +180,8 @@ static u32 hv_copyto_ringbuffer( memcpy(ring_buffer + start_write_offset, src, srclen); start_write_offset += srclen; - start_write_offset %= ring_buffer_size; + if (start_write_offset >= ring_buffer_size) + start_write_offset -= ring_buffer_size; return start_write_offset; } -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 12/14] vmbus: expose hv_begin/end_read
In order to implement NAPI in netvsc, the driver needs access to control host interrupt mask. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/hyperv_vmbus.h | 4 drivers/hv/ring_buffer.c | 20 include/linux/hyperv.h| 30 ++ 3 files changed, 30 insertions(+), 24 deletions(-) diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index 6a9b54677218..e15a130de3c9 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -292,10 +292,6 @@ int hv_ringbuffer_read(struct vmbus_channel *channel, void hv_ringbuffer_get_debuginfo(struct hv_ring_buffer_info *ring_info, struct hv_ring_buffer_debug_info *debug_info); -void hv_begin_read(struct hv_ring_buffer_info *rbi); - -u32 hv_end_read(struct hv_ring_buffer_info *rbi); - /* * Maximum channels is determined by the size of the interrupt page * which is PAGE_SIZE. 1/2 of PAGE_SIZE is for send endpoint interrupt diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index 146fd8ab2a2a..47ab69089115 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -32,26 +32,6 @@ #include "hyperv_vmbus.h" -void hv_begin_read(struct hv_ring_buffer_info *rbi) -{ - rbi->ring_buffer->interrupt_mask = 1; - virt_mb(); -} - -u32 hv_end_read(struct hv_ring_buffer_info *rbi) -{ - - rbi->ring_buffer->interrupt_mask = 0; - virt_mb(); - - /* -* Now check to see if the ring buffer is still empty. -* If it is not, we raced and we need to process new -* incoming messages. -*/ - return hv_get_bytes_to_read(rbi); -} - /* * When we write to the ring buffer, check if the host needs to * be signaled. Here is the details of this protocol: diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 9b0165b11c5c..dc50997a3fba 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -1473,6 +1473,36 @@ static inline void hv_signal_on_read(struct vmbus_channel *channel) } /* + * Mask off host interrupt callback notifications + */ +static inline void hv_begin_read(struct hv_ring_buffer_info *rbi) +{ + rbi->ring_buffer->interrupt_mask = 1; + + /* make sure mask update is not reordered */ + virt_mb(); +} + +/* + * Re-enable host callback and return number of outstanding bytes + */ +static inline u32 hv_end_read(struct hv_ring_buffer_info *rbi) +{ + + rbi->ring_buffer->interrupt_mask = 0; + + /* make sure mask update is not reordered */ + virt_mb(); + + /* +* Now check to see if the ring buffer is still empty. +* If it is not, we raced and we need to process new +* incoming messages. +*/ + return hv_get_bytes_to_read(rbi); +} + +/* * An API to support in-place processing of incoming VMBUS packets. */ #define VMBUS_PKT_TRAILER 8 -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 11/14] vmbus: remove conditional locking of vmbus_write
All current usage of vmbus write uses the acquire_lock flag, therefore having it be optional is unnecessary. This also fixes a sparse warning since sparse doesn't like when a function has conditional locking. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/channel.c | 13 - drivers/hv/channel_mgmt.c | 1 - drivers/hv/hyperv_vmbus.h | 3 +-- drivers/hv/ring_buffer.c | 11 --- include/linux/hyperv.h| 15 --- 5 files changed, 9 insertions(+), 34 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index 18cc1c78260d..81a80c82f1bd 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -651,7 +651,6 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, void *buffer, u32 packetlen_aligned = ALIGN(packetlen, sizeof(u64)); struct kvec bufferlist[3]; u64 aligned_data = 0; - bool lock = channel->acquire_ring_lock; int num_vecs = ((bufferlen != 0) ? 3 : 1); @@ -670,7 +669,7 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, void *buffer, bufferlist[2].iov_base = _data; bufferlist[2].iov_len = (packetlen_aligned - packetlen); - return hv_ringbuffer_write(channel, bufferlist, num_vecs, lock); + return hv_ringbuffer_write(channel, bufferlist, num_vecs); } EXPORT_SYMBOL(vmbus_sendpacket_ctl); @@ -716,12 +715,10 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel *channel, u32 packetlen_aligned; struct kvec bufferlist[3]; u64 aligned_data = 0; - bool lock = channel->acquire_ring_lock; if (pagecount > MAX_PAGE_BUFFER_COUNT) return -EINVAL; - /* * Adjust the size down since vmbus_channel_packet_page_buffer is the * largest size we support @@ -753,7 +750,7 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel *channel, bufferlist[2].iov_base = _data; bufferlist[2].iov_len = (packetlen_aligned - packetlen); - return hv_ringbuffer_write(channel, bufferlist, 3, lock); + return hv_ringbuffer_write(channel, bufferlist, 3); } EXPORT_SYMBOL_GPL(vmbus_sendpacket_pagebuffer_ctl); @@ -789,7 +786,6 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel, u32 packetlen_aligned; struct kvec bufferlist[3]; u64 aligned_data = 0; - bool lock = channel->acquire_ring_lock; packetlen = desc_size + bufferlen; packetlen_aligned = ALIGN(packetlen, sizeof(u64)); @@ -809,7 +805,7 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel, bufferlist[2].iov_base = _data; bufferlist[2].iov_len = (packetlen_aligned - packetlen); - return hv_ringbuffer_write(channel, bufferlist, 3, lock); + return hv_ringbuffer_write(channel, bufferlist, 3); } EXPORT_SYMBOL_GPL(vmbus_sendpacket_mpb_desc); @@ -827,7 +823,6 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel *channel, u32 packetlen_aligned; struct kvec bufferlist[3]; u64 aligned_data = 0; - bool lock = channel->acquire_ring_lock; u32 pfncount = NUM_PAGES_SPANNED(multi_pagebuffer->offset, multi_pagebuffer->len); @@ -866,7 +861,7 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel *channel, bufferlist[2].iov_base = _data; bufferlist[2].iov_len = (packetlen_aligned - packetlen); - return hv_ringbuffer_write(channel, bufferlist, 3, lock); + return hv_ringbuffer_write(channel, bufferlist, 3); } EXPORT_SYMBOL_GPL(vmbus_sendpacket_multipagebuffer); diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index b2bb5aafaa2f..f33465d78a02 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -332,7 +332,6 @@ static struct vmbus_channel *alloc_channel(void) if (!channel) return NULL; - channel->acquire_ring_lock = true; spin_lock_init(>inbound_lock); spin_lock_init(>lock); diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index 558a798c407c..6a9b54677218 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -283,8 +283,7 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); int hv_ringbuffer_write(struct vmbus_channel *channel, - struct kvec *kv_list, - u32 kv_count, bool lock); + struct kvec *kv_list, u32 kv_count); int hv_ringbuffer_read(struct vmbus_channel *channel, void *buffer, u32 buflen, u32 *buffer_actual_len, diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index 30ca55aefd24..146fd8ab2a2a 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -284,7 +284,7 @@ void hv_ringbuffer_cleanup(struct hv_ring_
[PATCH 10/14] vmbus: add direct isr callback mode
Change the simple boolean batched_reading into a tri-value. For future NAPI support in netvsc driver, the callback needs to occur directly in interrupt handler. Batched mode is also changed to disable host interrupts immediately in interrupt routine (to avoid unnecessary host signals), and the tasklet is rescheduled if more data is detected. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/channel_mgmt.c| 7 --- drivers/hv/connection.c | 27 --- drivers/hv/hv_util.c | 3 +-- drivers/hv/vmbus_drv.c | 26 -- drivers/uio/uio_hv_generic.c | 2 +- include/linux/hyperv.h | 31 +-- 6 files changed, 55 insertions(+), 41 deletions(-) diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c index 2f6270d76b79..b2bb5aafaa2f 100644 --- a/drivers/hv/channel_mgmt.c +++ b/drivers/hv/channel_mgmt.c @@ -820,13 +820,6 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr) } /* -* By default we setup state to enable batched -* reading. A specific service can choose to -* disable this prior to opening the channel. -*/ - newchannel->batched_reading = true; - - /* * Setup state for signalling the host. */ newchannel->sig_event = (struct hv_input_signal_event *) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 27e72dc07e12..a8366fec1458 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -300,9 +300,7 @@ struct vmbus_channel *relid2channel(u32 relid) void vmbus_on_event(unsigned long data) { struct vmbus_channel *channel = (void *) data; - void *arg; - bool read_state; - u32 bytes_to_read; + void (*callback_fn)(void *); /* * A channel once created is persistent even when there @@ -312,9 +310,13 @@ void vmbus_on_event(unsigned long data) * Thus, checking and invoking the driver specific callback takes * care of orderly unloading of the driver. */ - if (channel->onchannel_callback != NULL) { - arg = channel->channel_callback_context; - read_state = channel->batched_reading; + callback_fn = READ_ONCE(channel->onchannel_callback); + if (unlikely(callback_fn == NULL)) + return; + + (*callback_fn)(channel->channel_callback_context); + + if (channel->callback_mode == HV_CALL_BATCHED) { /* * This callback reads the messages sent by the host. * We can optimize host to guest signaling by ensuring: @@ -326,16 +328,11 @@ void vmbus_on_event(unsigned long data) *state is set we check to see if additional packets are *available to read. In this case we repeat the process. */ + if (hv_end_read(>inbound) != 0) { + hv_begin_read(>inbound); - do { - if (read_state) - hv_begin_read(>inbound); - channel->onchannel_callback(arg); - if (read_state) - bytes_to_read = hv_end_read(>inbound); - else - bytes_to_read = 0; - } while (read_state && (bytes_to_read != 0)); + tasklet_schedule(>callback_event); + } } } diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c index d42ede78a9dd..8410191b4992 100644 --- a/drivers/hv/hv_util.c +++ b/drivers/hv/hv_util.c @@ -409,8 +409,7 @@ static int util_probe(struct hv_device *dev, * Turn off batched reading for all util drivers before we open the * channel. */ - - set_channel_read_state(dev->channel, false); + set_channel_read_mode(dev->channel, HV_CALL_DIRECT); hv_set_drvdata(dev, srv); diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index eaf1a10b0245..f7f6b9144b07 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -887,6 +887,18 @@ void vmbus_on_msg_dpc(unsigned long data) /* + * Direct callback for channels using other deferred processing + */ +static void vmbus_channel_isr(struct vmbus_channel *channel) +{ + void (*callback_fn)(void *); + + callback_fn = READ_ONCE(channel->onchannel_callback); + if (likely(callback_fn != NULL)) + (*callback_fn)(channel->channel_callback_context); +} + +/* * Schedule all channels with events pending */ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu) @@ -927,9 +939,19 @@ static void vmbus_chan_sched(struct hv_per_cpu_context *hv_cpu) /* Find channel based on relid */ list_for_each_entry(channel,
[PATCH 03/14] vmbus: remove no longer used signal_policy
The explicit signal policy is no longer used. A different mechanism will be added later when xmit_more is supported. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- include/linux/hyperv.h | 18 -- 1 file changed, 18 deletions(-) diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 85b26f06e172..423fc96cc26a 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -670,11 +670,6 @@ struct hv_input_signal_event_buffer { struct hv_input_signal_event event; }; -enum hv_signal_policy { - HV_SIGNAL_POLICY_DEFAULT = 0, - HV_SIGNAL_POLICY_EXPLICIT, -}; - enum hv_numa_policy { HV_BALANCED = 0, HV_LOCALIZED, @@ -837,13 +832,6 @@ struct vmbus_channel { */ struct list_head percpu_list; /* -* Host signaling policy: The default policy will be -* based on the ring buffer state. We will also support -* a policy where the client driver can have explicit -* signaling control. -*/ - enum hv_signal_policy signal_policy; - /* * On the channel send side, many of the VMBUS * device drivers explicity serialize access to the * outgoing ring buffer. Give more control to the @@ -904,12 +892,6 @@ static inline bool is_hvsock_channel(const struct vmbus_channel *c) VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER); } -static inline void set_channel_signal_state(struct vmbus_channel *c, - enum hv_signal_policy policy) -{ - c->signal_policy = policy; -} - static inline void set_channel_affinity_state(struct vmbus_channel *c, enum hv_numa_policy policy) { -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 01/14] vmbus: use kernel bitops for traversing interrupt mask
Use standard kernel operations for find first set bit to traverse the channel bit array. This has added benefit of speeding up lookup on 64 bit and because it uses find first set instruction. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/channel.c | 8 ++- drivers/hv/connection.c | 55 +++ drivers/hv/hyperv_vmbus.h | 16 -- drivers/hv/vmbus_drv.c| 4 +--- 4 files changed, 29 insertions(+), 54 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index be34547cdb68..a016c5c0e472 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -47,12 +47,8 @@ void vmbus_setevent(struct vmbus_channel *channel) * For channels marked as in "low latency" mode * bypass the monitor page mechanism. */ - if ((channel->offermsg.monitor_allocated) && - (!channel->low_latency)) { - /* Each u32 represents 32 channels */ - sync_set_bit(channel->offermsg.child_relid & 31, - (unsigned long *) vmbus_connection.send_int_page + - (channel->offermsg.child_relid >> 5)); + if (channel->offermsg.monitor_allocated && !channel->low_latency) { + vmbus_send_interrupt(channel->offermsg.child_relid); /* Get the child to parent monitor page */ monitorpage = vmbus_connection.monitor_pages[1]; diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 307a5a8937f6..1766ef03e78d 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -379,17 +379,11 @@ static void process_chn_event(u32 relid) */ void vmbus_on_event(unsigned long data) { - u32 dword; - u32 maxdword; - int bit; - u32 relid; - u32 *recv_int_page = NULL; - void *page_addr; - int cpu = smp_processor_id(); - union hv_synic_event_flags *event; + unsigned long *recv_int_page; + u32 maxbits, relid; if (vmbus_proto_version < VERSION_WIN8) { - maxdword = MAX_NUM_CHANNELS_SUPPORTED >> 5; + maxbits = MAX_NUM_CHANNELS_SUPPORTED; recv_int_page = vmbus_connection.recv_int_page; } else { /* @@ -397,35 +391,24 @@ void vmbus_on_event(unsigned long data) * can be directly checked to get the id of the channel * that has the interrupt pending. */ - maxdword = HV_EVENT_FLAGS_DWORD_COUNT; - page_addr = hv_context.synic_event_page[cpu]; - event = (union hv_synic_event_flags *)page_addr + + int cpu = smp_processor_id(); + void *page_addr = hv_context.synic_event_page[cpu]; + union hv_synic_event_flags *event + = (union hv_synic_event_flags *)page_addr + VMBUS_MESSAGE_SINT; - recv_int_page = event->flags32; - } - + maxbits = HV_EVENT_FLAGS_COUNT; + recv_int_page = event->flags; + } - /* Check events */ - if (!recv_int_page) + if (unlikely(!recv_int_page)) return; - for (dword = 0; dword < maxdword; dword++) { - if (!recv_int_page[dword]) - continue; - for (bit = 0; bit < 32; bit++) { - if (sync_test_and_clear_bit(bit, - (unsigned long *)_int_page[dword])) { - relid = (dword << 5) + bit; - - if (relid == 0) - /* -* Special case - vmbus -* channel protocol msg -*/ - continue; + for_each_set_bit(relid, recv_int_page, maxbits) { + if (sync_test_and_clear_bit(relid, recv_int_page)) { + /* Special case - vmbus channel protocol msg */ + if (relid != 0) process_chn_event(relid); - } } } } @@ -491,12 +474,8 @@ void vmbus_set_event(struct vmbus_channel *channel) { u32 child_relid = channel->offermsg.child_relid; - if (!channel->is_dedicated_interrupt) { - /* Each u32 represents 32 channels */ - sync_set_bit(child_relid & 31, - (unsigned long *)vmbus_connection.send_int_page + - (child_relid >> 5)); - } + if (!channel->is_dedicated_interrupt) + vmbus_send_interrupt(child_relid); hv_do_hypercall(HVCALL_SIGNAL_EVENT, channel->sig_event, NULL
[PATCH 05/14] netvsc: remove no longer needed receive staging buffers
Since commit aed8c164ca5199 ("Drivers: hv: ring_buffer: count on wrap around mappings") it is no longer necessary to handle ring wrapping by having a special receive buffer. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/net/hyperv/hyperv_net.h | 5 --- drivers/net/hyperv/netvsc.c | 83 ++- drivers/net/hyperv/rndis_filter.c | 11 -- 3 files changed, 11 insertions(+), 88 deletions(-) diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h index 3958adade7eb..cce70ceba6d5 100644 --- a/drivers/net/hyperv/hyperv_net.h +++ b/drivers/net/hyperv/hyperv_net.h @@ -748,11 +748,6 @@ struct netvsc_device { int ring_size; - /* The primary channel callback buffer */ - unsigned char *cb_buffer; - /* The sub channel callback buffer */ - unsigned char *sub_cb_buf; - struct multi_send_data msd[VRSS_CHANNEL_MAX]; u32 max_pkt; /* max number of pkt in one send, e.g. 8 */ u32 pkt_align; /* alignment bytes, e.g. 8 */ diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c index e326e68f9f6d..7487498b663c 100644 --- a/drivers/net/hyperv/netvsc.c +++ b/drivers/net/hyperv/netvsc.c @@ -67,12 +67,6 @@ static struct netvsc_device *alloc_net_device(void) if (!net_device) return NULL; - net_device->cb_buffer = kzalloc(NETVSC_PACKET_SIZE, GFP_KERNEL); - if (!net_device->cb_buffer) { - kfree(net_device); - return NULL; - } - net_device->mrc[0].buf = vzalloc(NETVSC_RECVSLOT_MAX * sizeof(struct recv_comp_data)); @@ -93,7 +87,6 @@ static void free_netvsc_device(struct netvsc_device *nvdev) for (i = 0; i < VRSS_CHANNEL_MAX; i++) vfree(nvdev->mrc[i].buf); - kfree(nvdev->cb_buffer); kfree(nvdev); } @@ -584,7 +577,6 @@ void netvsc_device_remove(struct hv_device *device) vmbus_close(device->channel); /* Release all resources */ - vfree(net_device->sub_cb_buf); free_netvsc_device(net_device); } @@ -1256,16 +1248,11 @@ static void netvsc_process_raw_pkt(struct hv_device *device, void netvsc_channel_cb(void *context) { - int ret; - struct vmbus_channel *channel = (struct vmbus_channel *)context; + struct vmbus_channel *channel = context; u16 q_idx = channel->offermsg.offer.sub_channel_index; struct hv_device *device; struct netvsc_device *net_device; - u32 bytes_recvd; - u64 request_id; struct vmpacket_descriptor *desc; - unsigned char *buffer; - int bufferlen = NETVSC_PACKET_SIZE; struct net_device *ndev; bool need_to_commit = false; @@ -1277,65 +1264,19 @@ void netvsc_channel_cb(void *context) net_device = get_inbound_net_device(device); if (!net_device) return; + ndev = hv_get_drvdata(device); - buffer = get_per_channel_state(channel); - - do { - desc = get_next_pkt_raw(channel); - if (desc != NULL) { - netvsc_process_raw_pkt(device, - channel, - net_device, - ndev, - desc->trans_id, - desc); - - put_pkt_raw(channel, desc); - need_to_commit = true; - continue; - } - if (need_to_commit) { - need_to_commit = false; - commit_rd_index(channel); - } - ret = vmbus_recvpacket_raw(channel, buffer, bufferlen, - _recvd, _id); - if (ret == 0) { - if (bytes_recvd > 0) { - desc = (struct vmpacket_descriptor *)buffer; - netvsc_process_raw_pkt(device, - channel, - net_device, - ndev, - request_id, - desc); - } else { - /* -* We are done for this pass. -*/ - break; - } - - } else if (ret == -ENOBUFS) { - if (bufferlen > NETVSC_PACKET_SIZE) - kfree(buffer); - /* Handle large packet */ -
[PATCH 00/14] hyperv: vmbus related patches
This is a rebase/resend of earlier patches. I skipped the pure cosmetic patches for now. Mostly this is consolidation earlier changes, removing dead code etc. The important part is the change for allowing a vmbus channel to get callback directly in interrupt mode; this is necessary for NAPI support. Stephen Hemminger (14): vmbus: use kernel bitops for traversing interrupt mask vmbus: drop no longer used kick_q argument vmbus: remove no longer used signal_policy vmbus: remove unused kickq argument to sendpacket netvsc: remove no longer needed receive staging buffers vmbus: remove per channel state vmbus: callback is in softirq not workqueue vmbus: put related per-cpu variable together vmbus: change to per channel tasklet vmbus: add direct isr callback mode vmbus: remove conditional locking of vmbus_write vmbus: expose hv_begin/end_read vmbus: constify parameters where possible vmbus: replace modulus operation with subtraction Starting point was top of current char-misc-next branch. drivers/hv/channel.c | 47 + drivers/hv/channel_mgmt.c | 41 ++-- drivers/hv/connection.c | 134 +- drivers/hv/hv.c | 124 +++ drivers/hv/hv_util.c | 3 +- drivers/hv/hyperv_vmbus.h | 80 --- drivers/hv/ring_buffer.c | 66 ++- drivers/hv/vmbus_drv.c| 115 ++-- drivers/net/hyperv/hyperv_net.h | 5 -- drivers/net/hyperv/netvsc.c | 104 - drivers/net/hyperv/rndis_filter.c | 11 drivers/uio/uio_hv_generic.c | 2 +- include/linux/hyperv.h| 134 +- 13 files changed, 338 insertions(+), 528 deletions(-) -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 02/14] vmbus: drop no longer used kick_q argument
The flag to cause notification of host is unused after commit a01a291a282f7c2e ("Drivers: hv: vmbus: Base host signaling strictly on the ring state"). Therefore remove it from the ring buffer internal API. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/hv/channel.c | 13 - drivers/hv/hyperv_vmbus.h | 5 ++--- drivers/hv/ring_buffer.c | 8 +++- 3 files changed, 9 insertions(+), 17 deletions(-) diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c index a016c5c0e472..e26285cde8e0 100644 --- a/drivers/hv/channel.c +++ b/drivers/hv/channel.c @@ -670,9 +670,7 @@ int vmbus_sendpacket_ctl(struct vmbus_channel *channel, void *buffer, bufferlist[2].iov_base = _data; bufferlist[2].iov_len = (packetlen_aligned - packetlen); - return hv_ringbuffer_write(channel, bufferlist, num_vecs, - lock, kick_q); - + return hv_ringbuffer_write(channel, bufferlist, num_vecs, lock); } EXPORT_SYMBOL(vmbus_sendpacket_ctl); @@ -757,8 +755,7 @@ int vmbus_sendpacket_pagebuffer_ctl(struct vmbus_channel *channel, bufferlist[2].iov_base = _data; bufferlist[2].iov_len = (packetlen_aligned - packetlen); - return hv_ringbuffer_write(channel, bufferlist, 3, - lock, kick_q); + return hv_ringbuffer_write(channel, bufferlist, 3, lock); } EXPORT_SYMBOL_GPL(vmbus_sendpacket_pagebuffer_ctl); @@ -813,8 +810,7 @@ int vmbus_sendpacket_mpb_desc(struct vmbus_channel *channel, bufferlist[2].iov_base = _data; bufferlist[2].iov_len = (packetlen_aligned - packetlen); - return hv_ringbuffer_write(channel, bufferlist, 3, - lock, true); + return hv_ringbuffer_write(channel, bufferlist, 3, lock); } EXPORT_SYMBOL_GPL(vmbus_sendpacket_mpb_desc); @@ -871,8 +867,7 @@ int vmbus_sendpacket_multipagebuffer(struct vmbus_channel *channel, bufferlist[2].iov_base = _data; bufferlist[2].iov_len = (packetlen_aligned - packetlen); - return hv_ringbuffer_write(channel, bufferlist, 3, - lock, true); + return hv_ringbuffer_write(channel, bufferlist, 3, lock); } EXPORT_SYMBOL_GPL(vmbus_sendpacket_multipagebuffer); diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h index 2749a4142889..c375ec89db6f 100644 --- a/drivers/hv/hyperv_vmbus.h +++ b/drivers/hv/hyperv_vmbus.h @@ -275,9 +275,8 @@ int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info); int hv_ringbuffer_write(struct vmbus_channel *channel, - struct kvec *kv_list, - u32 kv_count, bool lock, - bool kick_q); + struct kvec *kv_list, + u32 kv_count, bool lock); int hv_ringbuffer_read(struct vmbus_channel *channel, void *buffer, u32 buflen, u32 *buffer_actual_len, diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c index 2cd402986858..30ca55aefd24 100644 --- a/drivers/hv/ring_buffer.c +++ b/drivers/hv/ring_buffer.c @@ -77,8 +77,7 @@ u32 hv_end_read(struct hv_ring_buffer_info *rbi) * host logic is fixed. */ -static void hv_signal_on_write(u32 old_write, struct vmbus_channel *channel, - bool kick_q) +static void hv_signal_on_write(u32 old_write, struct vmbus_channel *channel) { struct hv_ring_buffer_info *rbi = >outbound; @@ -285,8 +284,7 @@ void hv_ringbuffer_cleanup(struct hv_ring_buffer_info *ring_info) /* Write to the ring buffer. */ int hv_ringbuffer_write(struct vmbus_channel *channel, - struct kvec *kv_list, u32 kv_count, bool lock, - bool kick_q) + struct kvec *kv_list, u32 kv_count, bool lock) { int i = 0; u32 bytes_avail_towrite; @@ -352,7 +350,7 @@ int hv_ringbuffer_write(struct vmbus_channel *channel, if (lock) spin_unlock_irqrestore(_info->ring_lock, flags); - hv_signal_on_write(old_write, channel, kick_q); + hv_signal_on_write(old_write, channel); if (channel->rescind) return -ENODEV; -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 06/14] vmbus: remove per channel state
The netvsc no longer needs per channel state hook to track receive buffer. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- include/linux/hyperv.h | 14 -- 1 file changed, 14 deletions(-) diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index 8c6a1505b876..39d493ce550d 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -823,10 +823,6 @@ struct vmbus_channel { */ struct vmbus_channel *primary_channel; /* -* Support per-channel state for use by vmbus drivers. -*/ - void *per_channel_state; - /* * To support per-cpu lookup mapping of relid to channel, * link up channels based on their CPU affinity. */ @@ -903,16 +899,6 @@ static inline void set_channel_read_state(struct vmbus_channel *c, bool state) c->batched_reading = state; } -static inline void set_per_channel_state(struct vmbus_channel *c, void *s) -{ - c->per_channel_state = s; -} - -static inline void *get_per_channel_state(struct vmbus_channel *c) -{ - return c->per_channel_state; -} - static inline void set_channel_pending_send_size(struct vmbus_channel *c, u32 size) { -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] hyperv: implement hv_get_tsc_page()
On Thu, 9 Feb 2017 21:14:25 +0100 (CET) Thomas Gleixner <t...@linutronix.de> wrote: > On Thu, 9 Feb 2017, Stephen Hemminger wrote: > > > The actual code looks fine, but the style police will not like you. > > { should be at start of line on functions. > > And #else should be at start of line, > > > > But maybe this was just more of exchange mangling the mail. > > Looks like. > > > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) { > > + return tsc_pg; > > +} > > + > > That's how it reads in a proper mail client connected to a proper mail > server: > > > +struct ms_hyperv_tsc_page *hv_get_tsc_page(void) > > +{ > > + return tsc_pg; > > +} > > :) Yup. it looks like the mail server is trying to be "helpful" by eliminating extra white space. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
On Thu, 9 Feb 2017 14:55:50 -0800 Andy Lutomirski <l...@amacapital.net> wrote: > On Thu, Feb 9, 2017 at 12:45 PM, KY Srinivasan <k...@microsoft.com> wrote: > > > > > >> -Original Message- > >> From: Thomas Gleixner [mailto:t...@linutronix.de] > >> Sent: Thursday, February 9, 2017 9:08 AM > >> To: Vitaly Kuznetsov <vkuzn...@redhat.com> > >> Cc: x...@kernel.org; Andy Lutomirski <l...@amacapital.net>; Ingo Molnar > >> <mi...@redhat.com>; H. Peter Anvin <h...@zytor.com>; KY Srinivasan > >> <k...@microsoft.com>; Haiyang Zhang <haiya...@microsoft.com>; Stephen > >> Hemminger <sthem...@microsoft.com>; Dexuan Cui > >> <de...@microsoft.com>; linux-ker...@vger.kernel.org; > >> de...@linuxdriverproject.org; virtualization@lists.linux-foundation.org > >> Subject: Re: [PATCH 2/2] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read > >> method > >> > >> On Thu, 9 Feb 2017, Vitaly Kuznetsov wrote: > >> > +#ifdef CONFIG_HYPERV_TSCPAGE > >> > +static notrace u64 vread_hvclock(int *mode) > >> > +{ > >> > + const struct ms_hyperv_tsc_page *tsc_pg = > >> > + (const struct ms_hyperv_tsc_page *)_page; > >> > + u64 sequence, scale, offset, current_tick, cur_tsc; > >> > + > >> > + while (1) { > >> > + sequence = READ_ONCE(tsc_pg->tsc_sequence); > >> > + if (!sequence) > >> > + break; > >> > + > >> > + scale = READ_ONCE(tsc_pg->tsc_scale); > >> > + offset = READ_ONCE(tsc_pg->tsc_offset); > >> > + rdtscll(cur_tsc); > >> > + > >> > + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset; > >> > + > >> > + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence) > >> > + return current_tick; > >> > >> That sequence stuff lacks still a sensible explanation. It's fundamentally > >> different from the sequence counting we do in the kernel, so documentation > >> for it is really required. > > > > The host is updating multiple fields in this shared TSC page and the > > sequence number is > > used to ensure that the guest sees a consistent set values published. If I > > remember > > correctly, Xen has a similar mechanism. > > So what's the actual protocol? When the hypervisor updates the page, > does it freeze all guest cpus? If not, how does it maintain > atomicity? The protocol looks a lot like Linux seqlock, but it has an extra protection which is missing here. The host needs to update sequence number twice in order to guarantee ordering. Otherwise it is possible that Host and guest can race. Host Write offset Write scale Set tsc_sequence = N Guest read sequence = N Read scale Write scale Write offset Read Offset Check sequence == N Set tsc_sequence = N +1 Look like the current host side protocol is wrong. The solution that Andi Kleen invented, and I used in seqlock was for the writer to update sequence at start and end of transaction. If sequence number is odd, then the reader knows it is looking at stale data. Host Write offset Write scale Set tsc_sequence = N (end of transaction) Guest read sequence = N Spin until sequence is even (N is even) Read scale Set tsc_sequence += 1 Write scale Write offset Read Offset Check sequence == N? (fails is N + 1) Set tsc_sequence += 1 (end of transaction) read sequence = N+2 Spin until sequence is even (ie N +2) Read scale Read Offset Check sequence == N +2? (yes ok). Also it is faster to just read scale and offset with this loop and save the reading of TSC and doing multiply until after scale/offset has been acquired. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next] net: make ndo_get_stats64 a void function
The network device operation for reading statistics is only called in one place, and it ignores the return value. Having a structure return value is potentially confusing because some future driver could incorrectly assume that the return value was used. Fix all drivers with ndo_get_stats64 to have a void function. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- drivers/net/bonding/bond_main.c | 10 -- drivers/net/dummy.c | 5 ++--- drivers/net/ethernet/alacritech/slicoss.c| 6 ++ drivers/net/ethernet/amazon/ena/ena_netdev.c | 10 -- drivers/net/ethernet/amd/xgbe/xgbe-drv.c | 6 ++ drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 4 +--- drivers/net/ethernet/atheros/alx/main.c | 6 ++ drivers/net/ethernet/broadcom/b44.c | 5 ++--- drivers/net/ethernet/broadcom/bnx2.c | 3 +-- drivers/net/ethernet/broadcom/bnxt/bnxt.c| 6 ++ drivers/net/ethernet/broadcom/tg3.c | 8 +++- drivers/net/ethernet/brocade/bna/bnad.c | 6 ++ drivers/net/ethernet/calxeda/xgmac.c | 5 ++--- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 5 ++--- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 7 +++ drivers/net/ethernet/cisco/enic/enic_main.c | 8 +++- drivers/net/ethernet/ec_bhf.c| 4 +--- drivers/net/ethernet/emulex/benet/be_main.c | 5 ++--- drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 6 ++ drivers/net/ethernet/hisilicon/hns/hns_enet.c| 6 ++ drivers/net/ethernet/ibm/ehea/ehea_main.c| 5 ++--- drivers/net/ethernet/intel/e1000e/e1000.h| 4 ++-- drivers/net/ethernet/intel/e1000e/netdev.c | 5 ++--- drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 6 ++ drivers/net/ethernet/intel/i40e/i40e.h | 5 ++--- drivers/net/ethernet/intel/i40e/i40e_main.c | 18 ++ drivers/net/ethernet/intel/igb/igb_main.c| 10 -- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 7 --- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c| 6 ++ drivers/net/ethernet/marvell/mvneta.c| 4 +--- drivers/net/ethernet/marvell/mvpp2.c | 4 +--- drivers/net/ethernet/marvell/sky2.c | 6 ++ drivers/net/ethernet/mediatek/mtk_eth_soc.c | 6 ++ drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 4 +--- drivers/net/ethernet/mellanox/mlx5/core/en_main.c| 3 +-- drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 3 +-- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 4 +--- drivers/net/ethernet/mellanox/mlxsw/switchx2.c | 3 +-- drivers/net/ethernet/myricom/myri10ge/myri10ge.c | 9 - drivers/net/ethernet/neterion/vxge/vxge-main.c | 4 +--- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 6 ++ drivers/net/ethernet/nvidia/forcedeth.c | 4 +--- drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c | 10 -- drivers/net/ethernet/qlogic/qede/qede_main.c | 7 ++- drivers/net/ethernet/qualcomm/emac/emac.c| 6 ++ drivers/net/ethernet/realtek/8139too.c | 9 +++-- drivers/net/ethernet/realtek/r8169.c | 4 +--- drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c | 8 ++-- drivers/net/ethernet/sfc/efx.c | 6 ++ drivers/net/ethernet/sfc/falcon/efx.c| 6 ++ drivers/net/ethernet/sun/niu.c | 6 ++ drivers/net/ethernet/synopsys/dwc_eth_qos.c | 4 +--- drivers/net/ethernet/tile/tilepro.c | 4 ++-- drivers/net/ethernet/via/via-rhine.c | 8 +++- drivers/net/fjes/fjes_main.c | 7 ++- drivers/net/hyperv/netvsc_drv.c | 6 ++ drivers/net/ifb.c| 6 ++ drivers/net/ipvlan/ipvlan_main.c | 5 ++--- drivers/net/loopback.c | 5 ++--- drivers/net/macsec.c | 6 ++ drivers/net/macvlan.c| 5 ++--- drivers/net/nlmon.c | 4 +--- drivers/net/ppp/ppp_generic.c| 4 +--- drivers/net/slip/slip.c | 3 +-- drivers/net/team/team.c | 3 +-- drivers/net/tun.c| 3 +-- drivers/net/veth.c | 6 ++ drivers/net/virtio_net.c | 6 ++ drivers/net/vmxnet3/vmxnet3_ethtool.c| 4 +--- drivers/net/vmxnet3/vmxnet3_int.h| 4 ++-- drive
Re: [PATCH net-next V3 3/3] tun: rx batching
On Fri, 30 Dec 2016 13:20:51 +0800 Jason Wangwrote: > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index cd8e02c..a268ed9 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -75,6 +75,10 @@ > > #include > > +static int rx_batched; > +module_param(rx_batched, int, 0444); > +MODULE_PARM_DESC(rx_batched, "Number of packets batched in rx"); > + > /* Uncomment to enable debugging */ I like the concept or rx batching. But controlling it via a module parameter is one of the worst API choices. Ethtool would be better to use because that is how other network devices control batching. If you do ethtool, you could even extend it to have an number of packets and max latency value. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 2/3] x86/hyperv: move TSC reading method to asm/mshyperv.h
Minor coding comments > diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h > index d324dce..4ff25436 100644 > --- a/arch/x86/include/asm/mshyperv.h > +++ b/arch/x86/include/asm/mshyperv.h > @@ -178,6 +178,56 @@ void hyperv_cleanup(void); > #endif > #ifdef CONFIG_HYPERV_TSCPAGE > struct ms_hyperv_tsc_page *hv_get_tsc_page(void); > +static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg) > +{ > + u64 scale, offset, current_tick, cur_tsc; > + u32 sequence; > + > + /* > + * The protocol for reading Hyper-V TSC page is specified in Hypervisor > + * Top-Level Functional Specification ver. 3.0 and above. To get the > + * reference time we must do the following: > + * - READ ReferenceTscSequence > + * A special '0' value indicates the time source is unreliable and we > + * need to use something else. The currently published specification > + * versions (up to 4.0b) contain a mistake and wrongly claim '-1' > + * instead of '0' as the special value, see commit c35b82ef0294. > + * - ReferenceTime = > + *((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset > + * - READ ReferenceTscSequence again. In case its value has changed > + * since our first reading we need to discard ReferenceTime and repeat > + * the whole sequence as the hypervisor was updating the page in > + * between. > + */ > + while (1) { > + sequence = READ_ONCE(tsc_pg->tsc_sequence); > + if (!sequence) > + break; It would be clearer to just return U64_MAX here (and not fall out) since this is only case here. Also since this failure only occurs if host clock is not available, probably should be unlikely. > + /* > + * Make sure we read sequence before we read other values from > + * TSC page. > + */ > + smp_rmb(); > + > + scale = READ_ONCE(tsc_pg->tsc_scale); > + offset = READ_ONCE(tsc_pg->tsc_offset); > + cur_tsc = rdtsc_ordered(); Since you already have smp_ barriers and rdtsc_ordered is a barrier, the compiler barriers (READ_ONCE()) shouldn't be necessary. > + > + current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset; > + > + /* > + * Make sure we read sequence after we read all other values > + * from TSC page. > + */ > + smp_rmb(); > + > + if (READ_ONCE(tsc_pg->tsc_sequence) == sequence) > + return current_tick; > + } Why not make do { } while out of this. do { ... } while (unlikely(READ_ONCE(tsc_pg->tsc_sequence) != sequence); return current_tick; Also don't need to calculate tick value until have good data. As in: static inline u32 hv_clock_sequence(const struct ms_hyperv_tsc_page *tsc_pg) { u32 sequence = return sequence; } static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg) { u64 scale, offset, cur_tsc; u32 start; /* * The protocol for reading Hyper-V TSC page is specified in Hypervisor * Top-Level Functional Specification ver. 3.0 and above. To get the * reference time we must do the following: * - READ ReferenceTscSequence * A special '0' value indicates the time source is unreliable and we * need to use something else. The currently published specification * versions (up to 4.0b) contain a mistake and wrongly claim '-1' * instead of '0' as the special value, see commit c35b82ef0294. * - ReferenceTime = *((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset * - READ ReferenceTscSequence again. In case its value has changed * since our first reading we need to discard ReferenceTime and repeat * the whole sequence as the hypervisor was updating the page in * between. */ do { start = READ_ONCE(tsc_pg->tsc_sequence); smp_rmb(); if (unlikely(!start)) return U64_MAX; scale = tsc_pg->tsc_scale; offset = tsc_pg->tsc_offset; /* * Make sure we read sequence after we read all other values * from TSC page. */ smp_rmb(); } while (unlikely(READ_ONCE(tsc_pg->tsc_sequence != start))); cur_tsc = rdtsc_ordered(); return mul_u64_u64_shr(cur_tsc, scale, 64) + offset; } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next 4/4] mlx4: sizeof style usage
The kernel coding style is to treat sizeof as a function (ie. with parenthesis) not as an operator. Also use kcalloc and kmalloc_array Signed-off-by: Stephen Hemminger <step...@networkplumber.org> --- drivers/net/ethernet/mellanox/mlx4/alloc.c | 2 +- drivers/net/ethernet/mellanox/mlx4/cmd.c | 4 ++-- drivers/net/ethernet/mellanox/mlx4/en_resources.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_tx.c | 2 +- drivers/net/ethernet/mellanox/mlx4/eq.c| 20 +- drivers/net/ethernet/mellanox/mlx4/fw.c| 2 +- drivers/net/ethernet/mellanox/mlx4/icm.c | 2 +- drivers/net/ethernet/mellanox/mlx4/icm.h | 4 ++-- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- drivers/net/ethernet/mellanox/mlx4/main.c | 12 +-- drivers/net/ethernet/mellanox/mlx4/mcg.c | 12 +-- drivers/net/ethernet/mellanox/mlx4/mr.c| 10 - drivers/net/ethernet/mellanox/mlx4/qp.c| 12 +-- .../net/ethernet/mellanox/mlx4/resource_tracker.c | 24 +++--- 15 files changed, 56 insertions(+), 56 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/alloc.c b/drivers/net/ethernet/mellanox/mlx4/alloc.c index b651c1210555..6dabd983e7e0 100644 --- a/drivers/net/ethernet/mellanox/mlx4/alloc.c +++ b/drivers/net/ethernet/mellanox/mlx4/alloc.c @@ -186,7 +186,7 @@ int mlx4_bitmap_init(struct mlx4_bitmap *bitmap, u32 num, u32 mask, bitmap->effective_len = bitmap->avail; spin_lock_init(>lock); bitmap->table = kzalloc(BITS_TO_LONGS(bitmap->max) * - sizeof (long), GFP_KERNEL); + sizeof(long), GFP_KERNEL); if (!bitmap->table) return -ENOMEM; diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index 674773b28b2e..97aed30ead21 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -2637,7 +2637,7 @@ int mlx4_cmd_use_events(struct mlx4_dev *dev) int err = 0; priv->cmd.context = kmalloc(priv->cmd.max_cmds * - sizeof (struct mlx4_cmd_context), + sizeof(struct mlx4_cmd_context), GFP_KERNEL); if (!priv->cmd.context) return -ENOMEM; @@ -2695,7 +2695,7 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev) { struct mlx4_cmd_mailbox *mailbox; - mailbox = kmalloc(sizeof *mailbox, GFP_KERNEL); + mailbox = kmalloc(sizeof(*mailbox), GFP_KERNEL); if (!mailbox) return ERR_PTR(-ENOMEM); diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c b/drivers/net/ethernet/mellanox/mlx4/en_resources.c index 86d2d42d658d..5a47f9669621 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c @@ -44,7 +44,7 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int size, int stride, struct mlx4_en_dev *mdev = priv->mdev; struct net_device *dev = priv->dev; - memset(context, 0, sizeof *context); + memset(context, 0, sizeof(*context)); context->flags = cpu_to_be32(7 << 16 | rss << MLX4_RSS_QPC_FLAG_OFFSET); context->pd = cpu_to_be32(mdev->priv_pdn); context->mtu_msgmax = 0xff; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index bf1638044a7a..dcb8f8f84a97 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -1056,7 +1056,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn, } qp->event = mlx4_en_sqp_event; - memset(context, 0, sizeof *context); + memset(context, 0, sizeof(*context)); mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0, qpn, ring->cqn, -1, context); context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma); diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index 73faa3d77921..bcf422efd3b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -643,7 +643,7 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, void *fragptr) { struct mlx4_wqe_inline_seg *inl = _desc->inl; - int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl; + int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof(*inl); unsigned int hlen = skb_headlen(skb); if (skb->len <= spc) { diff --git a/drivers/net/ethernet/mellanox/ml
[PATCH net-next 3/4] skge: add paren around sizeof arg
Signed-off-by: Stephen Hemminger <step...@networkplumber.org> --- drivers/net/ethernet/marvell/skge.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/marvell/skge.c b/drivers/net/ethernet/marvell/skge.c index 5d7d94de4e00..8a835e82256a 100644 --- a/drivers/net/ethernet/marvell/skge.c +++ b/drivers/net/ethernet/marvell/skge.c @@ -3516,7 +3516,7 @@ static const char *skge_board_name(const struct skge_hw *hw) if (skge_chips[i].id == hw->chip_id) return skge_chips[i].name; - snprintf(buf, sizeof buf, "chipid 0x%x", hw->chip_id); + snprintf(buf, sizeof(buf), "chipid 0x%x", hw->chip_id); return buf; } -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next 1/4] tun/tap: use paren's with sizeof
Although sizeof is an operator in C. The kernel coding style convention is to always use it like a function and add parenthesis. Signed-off-by: Stephen Hemminger <step...@networkplumber.org> --- drivers/net/tap.c | 2 +- drivers/net/tun.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index 0d039411e64c..21b71ae947fd 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -1215,7 +1215,7 @@ int tap_queue_resize(struct tap_dev *tap) int n = tap->numqueues; int ret, i = 0; - arrays = kmalloc(sizeof *arrays * n, GFP_KERNEL); + arrays = kmalloc_array(n, sizeof(*arrays), GFP_KERNEL); if (!arrays) return -ENOMEM; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 5892284eb8d0..f5017121cd57 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2737,7 +2737,7 @@ static int tun_queue_resize(struct tun_struct *tun) int n = tun->numqueues + tun->numdisabled; int ret, i; - arrays = kmalloc(sizeof *arrays * n, GFP_KERNEL); + arrays = kmalloc_array(n, sizeof(*arrays), GFP_KERNEL); if (!arrays) return -ENOMEM; -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] uapi: add SPDX identifier to vm_sockets_diag.h
New file seems to have missed the SPDX license scan and update. Signed-off-by: Stephen Hemminger <sthem...@microsoft.com> --- include/uapi/linux/vm_sockets_diag.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/vm_sockets_diag.h b/include/uapi/linux/vm_sockets_diag.h index 14cd7dc5a187..0b4dd54f3d1e 100644 --- a/include/uapi/linux/vm_sockets_diag.h +++ b/include/uapi/linux/vm_sockets_diag.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ /* AF_VSOCK sock_diag(7) interface for querying open sockets */ #ifndef _UAPI__VM_SOCKETS_DIAG_H__ -- 2.11.0 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio-net: help live migrate SR-IOV devices
On Wed, 29 Nov 2017 19:51:38 -0800 Jakub Kicinski <jakub.kicin...@netronome.com> wrote: > On Thu, 30 Nov 2017 11:29:56 +0800, Jason Wang wrote: > > On 2017年11月29日 03:27, Jesse Brandeburg wrote: > > > Hi, I'd like to get some feedback on a proposal to enhance > > > virtio-net to ease configuration of a VM and that would enable > > > live migration of passthrough network SR-IOV devices. > > > > > > Today we have SR-IOV network devices (VFs) that can be passed > > > into a VM in order to enable high performance networking direct > > > within the VM. The problem I am trying to address is that this > > > configuration is generally difficult to live-migrate. There is > > > documentation [1] indicating that some OS/Hypervisor vendors will > > > support live migration of a system with a direct assigned > > > networking device. The problem I see with these implementations > > > is that the network configuration requirements that are passed on > > > to the owner of the VM are quite complicated. You have to set up > > > bonding, you have to configure it to enslave two interfaces, > > > those interfaces (one is virtio-net, the other is SR-IOV > > > device/driver like ixgbevf) must support MAC address changes > > > requested in the VM, and on and on... > > > > > > So, on to the proposal: > > > Modify virtio-net driver to be a single VM network device that > > > enslaves an SR-IOV network device (inside the VM) with the same > > > MAC address. This would cause the virtio-net driver to appear and > > > work like a simplified bonding/team driver. The live migration > > > problem would be solved just like today's bonding solution, but > > > the VM user's networking config would be greatly simplified. > > > > > > At it's simplest, it would appear something like this in the VM. > > > > > > == > > > = vnet0 = > > > = > > > (virtio- = | > > > net)= | > > > = == > > > = = ixgbef = > > > == == > > > > > > (forgive the ASCII art) > > > > > > The fast path traffic would prefer the ixgbevf or other SR-IOV > > > device path, and fall back to virtio's transmit/receive when > > > migrating. > > > > > > Compared to today's options this proposal would > > > 1) make virtio-net more sticky, allow fast path traffic at SR-IOV > > > speeds > > > 2) simplify end user configuration in the VM (most if not all of > > > the set up to enable migration would be done in the hypervisor) > > > 3) allow live migration via a simple link down and maybe a PCI > > > hot-unplug of the SR-IOV device, with failover to the > > > virtio-net driver core > > > 4) allow vendor agnostic hardware acceleration, and live migration > > > between vendors if the VM os has driver support for all the > > > required SR-IOV devices. > > > > > > Runtime operation proposed: > > > - virtio-net driver loads, SR-IOV driver loads > > > - virtio-net finds other NICs that match it's MAC address by > > >both examining existing interfaces, and sets up a new device > > > notifier > > > - virtio-net enslaves the first NIC with the same MAC address > > > - virtio-net brings up the slave, and makes it the "preferred" > > > path > > > - virtio-net follows the behavior of an active backup bond/team > > > - virtio-net acts as the interface to the VM > > > - live migration initiates > > > - link goes down on SR-IOV, or SR-IOV device is removed > > > - failover to virtio-net as primary path > > > - migration continues to new host > > > - new host is started with virio-net as primary > > > - if no SR-IOV, virtio-net stays primary > > > - hypervisor can hot-add SR-IOV NIC, with same MAC addr as virtio > > > - virtio-net notices new NIC and starts over at enslave step above > > > > > > Future ideas (brainstorming): > > > - Optimize Fast east-west by having special rules to direct > > > east-west traffic through virtio-net traffic path > > > > > > Thanks for reading! > > > Jesse > > > > Cc netdev. > > > > Interesting, and this method is actually used by netvsc now: > > > > commit 0c195567a8f6e82ea5535cd9f1d54a1626dd233e > > Author: stephen hemminger <step...@networkplumber.org> > > Da
Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
On Tue, 19 Dec 2017 20:07:01 +0200 "Michael S. Tsirkin" <m...@redhat.com> wrote: > On Tue, Dec 19, 2017 at 09:55:48AM -0800, Stephen Hemminger wrote: > > On Tue, 19 Dec 2017 09:41:39 -0800 > > "Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote: > > > > > On 12/19/2017 7:47 AM, Michael S. Tsirkin wrote: > > > > I'll need to look at this more, in particular the feature > > > > bit is missing here. For now one question: > > > > > > > > On Mon, Dec 18, 2017 at 04:40:36PM -0800, Sridhar Samudrala wrote: > > > >> @@ -56,6 +58,8 @@ module_param(napi_tx, bool, 0644); > > > >>*/ > > > >> DECLARE_EWMA(pkt_len, 0, 64) > > > >> > > > >> +#define VF_TAKEOVER_INT (HZ / 10) > > > >> + > > > >> #define VIRTNET_DRIVER_VERSION "1.0.0" > > > >> > > > >> static const unsigned long guest_offloads[] = { > > > > Why is this delay necessary? And why by 100ms? > > > > > > This is based on netvsc implementation and here is the commit that > > > added this delay. Not sure if this needs to be 100ms. > > > > > > commit 6123c66854c174e4982f98195100c1d990f9e5e6 > > > Author: stephen hemminger <step...@networkplumber.org> > > > Date: Wed Aug 9 17:46:03 2017 -0700 > > > > > > netvsc: delay setup of VF device > > > > > > When VF device is discovered, delay bring it automatically up in > > > order to allow userspace to some simple changes (like renaming). > > > > > > > > > > > > > could be 10ms, just enough to let udev do its renaming > > Isn't there a way not to depend on udev completing its thing within a given > timeframe? Not that I know. the path is quite indirect. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
On Tue, 19 Dec 2017 14:37:50 -0800 "Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote: > On 12/19/2017 11:46 AM, Stephen Hemminger wrote: > > On Tue, 19 Dec 2017 11:42:33 -0800 > > "Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote: > > > >> On 12/19/2017 10:41 AM, Stephen Hemminger wrote: > >>> On Tue, 19 Dec 2017 13:21:17 -0500 (EST) > >>> David Miller <da...@davemloft.net> wrote: > >>> > >>>> From: Stephen Hemminger <step...@networkplumber.org> > >>>> Date: Tue, 19 Dec 2017 09:55:48 -0800 > >>>> > >>>>> could be 10ms, just enough to let udev do its renaming > >>>> Please, move to some kind of notification or event based handling of > >>>> this problem. > >>>> > >>>> No delay is safe, what if userspace gets swapped out or whatever > >>>> else might make userspace stall unexpectedly? > >>>> > >>> The plan is to remove the delay and do the naming in the kernel. > >>> This was suggested by Lennart since udev is only doing naming policy > >>> because kernel names were not repeatable. > >>> > >>> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly. > >>> > >>> Patch is pending. > >> Do we really need to delay the setup until the name is changed? > >> Can't we call dev_set_mtu() and dev_open() until dev_change_name() is done? > >> > >> Thanks > >> Sridhar > > You can call dev_set_mtu, but when dev_open is done the device name > > can not be changed by userspace. > I did a quick test to remove the delay and also the dev_open() call and > i don't see > any issues with virtio taking over the VF datapath. > Only the netdev_info() messages may show old device name. > > Any specific scenario where we need to explicitly call VF's dev_open() > in the VF setup process? > I tried i40evf driver loaded after virtio_net AND virtio_net loading > after i40evf. > > Thanks > Sridhar It happens with hotplug. It is possible on Hyper-V to hotplug SR-IOV on and off while guest is running. If SR-IOV is disabled in host then the VF device is removed (hotplug) and the inverse. If the master device is up then the VF device should be brought up by the master device. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
On Tue, 19 Dec 2017 13:21:17 -0500 (EST) David Miller <da...@davemloft.net> wrote: > From: Stephen Hemminger <step...@networkplumber.org> > Date: Tue, 19 Dec 2017 09:55:48 -0800 > > > could be 10ms, just enough to let udev do its renaming > > Please, move to some kind of notification or event based handling of > this problem. > > No delay is safe, what if userspace gets swapped out or whatever > else might make userspace stall unexpectedly? > The plan is to remove the delay and do the naming in the kernel. This was suggested by Lennart since udev is only doing naming policy because kernel names were not repeatable. This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly. Patch is pending. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
On Tue, 19 Dec 2017 11:42:33 -0800 "Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote: > On 12/19/2017 10:41 AM, Stephen Hemminger wrote: > > On Tue, 19 Dec 2017 13:21:17 -0500 (EST) > > David Miller <da...@davemloft.net> wrote: > > > >> From: Stephen Hemminger <step...@networkplumber.org> > >> Date: Tue, 19 Dec 2017 09:55:48 -0800 > >> > >>> could be 10ms, just enough to let udev do its renaming > >> Please, move to some kind of notification or event based handling of > >> this problem. > >> > >> No delay is safe, what if userspace gets swapped out or whatever > >> else might make userspace stall unexpectedly? > >> > > The plan is to remove the delay and do the naming in the kernel. > > This was suggested by Lennart since udev is only doing naming policy > > because kernel names were not repeatable. > > > > This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly. > > > > Patch is pending. > Do we really need to delay the setup until the name is changed? > Can't we call dev_set_mtu() and dev_open() until dev_change_name() is done? > > Thanks > Sridhar You can call dev_set_mtu, but when dev_open is done the device name can not be changed by userspace. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio-net: help live migrate SR-IOV devices
On Tue, 5 Dec 2017 14:29:28 -0800 Jakub Kicinskiwrote: > On Tue, 5 Dec 2017 11:59:17 +0200, achiad shochat wrote: > > I second Jacob - having a netdev of one device driver enslave a netdev > > of another device driver is an awkward a-symmetric model. > > Regardless of whether they share the same backend device. > > Only I am not sure the Linux Bond is the right choice. > > e.g one may well want to use the virtio device also when the > > pass-through device is available, e.g for multicasts, east-west > > traffic, etc. > > I'm not sure the Linux Bond fits that functionality. > > And, as I hear in this thread, it is hard to make it work out of the > > box. > > So I think the right thing would be to write a new dedicated module > > for this purpose. > > > > > > This part I can sort of agree with. What if we were to look at > > > providing a way to somehow advertise that the two devices were meant > > > to be boded for virtualization purposes? For now lets call it a > > > "virt-bond". Basically we could look at providing a means for virtio > > > and VF drivers to advertise that they want this sort of bond. Then it > > > would just be a matter of providing some sort of side channel to > > > indicate where you want things like multicast/broadcast/east-west > > > traffic to go. > > > > I like this approach. > > +1 on a separate driver, just enslaving devices to virtio may break > existing setups. If people are bonding from user space today, if they > update their kernel it may surprise them how things get auto-mangled. > > Is what Alex is suggesting a separate PV device that says "I would > like to be a bond of those two interfaces"? That would make the HV > intent explicit and kernel decisions more understandable. So far, in my experience it still works. As long as the kernel slaving happens first, it will work. The attempt to bond an already slaved device will fail and no scripts seem to check the error return. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio-net: help live migrate SR-IOV devices
On Sun, 3 Dec 2017 11:14:37 +0200 achiad shochatwrote: > On 3 December 2017 at 07:05, Michael S. Tsirkin wrote: > > On Fri, Dec 01, 2017 at 12:08:59PM -0800, Shannon Nelson wrote: > >> On 11/30/2017 6:11 AM, Michael S. Tsirkin wrote: > >> > On Thu, Nov 30, 2017 at 10:08:45AM +0200, achiad shochat wrote: > >> > > Re. problem #2: > >> > > Indeed the best way to address it seems to be to enslave the VF driver > >> > > netdev under a persistent anchor netdev. > >> > > And it's indeed desired to allow (but not enforce) PV netdev and VF > >> > > netdev to work in conjunction. > >> > > And it's indeed desired that this enslavement logic work out-of-the > >> > > box. > >> > > But in case of PV+VF some configurable policies must be in place (and > >> > > they'd better be generic rather than differ per PV technology). > >> > > For example - based on which characteristics should the PV+VF coupling > >> > > be done? netvsc uses MAC address, but that might not always be the > >> > > desire. > >> > > >> > It's a policy but not guest userspace policy. > >> > > >> > The hypervisor certainly knows. > >> > > >> > Are you concerned that someone might want to create two devices with the > >> > same MAC for an unrelated reason? If so, hypervisor could easily set a > >> > flag in the virtio device to say "this is a backup, use MAC to find > >> > another device". > >> > >> This is something I was going to suggest: a flag or other configuration on > >> the virtio device to help control how this new feature is used. I can > >> imagine this might be useful to control from either the hypervisor side or > >> the VM side. > >> > >> The hypervisor might want to (1) disable it (force it off), (2) enable it > >> for VM choice, or (3) force it on for the VM. In case (2), the VM might be > >> able to chose whether it wants to make use of the feature, or stick with > >> the > >> bonding solution. > >> > >> Either way, the kernel is making a feature available, and the user (VM or > >> hypervisor) is able to control it by selecting the feature based on the > >> policy desired. > >> > >> sln > > > > I'm not sure what's the feature that is available here. > > > > I saw this as a flag that says "this device shares backend with another > > network device which can be found using MAC, and that backend should be > > preferred". kernel then forces configuration which uses that other > > backend - as long as it exists. > > > > However, please Cc virtio-dev mailing list if we are doing this since > > this is a spec extension. > > > > -- > > MST > > > Can someone please explain why assume a virtio device is there at all?? > I specified a case where there isn't any. > > I second Jacob - having a netdev of one device driver enslave a netdev > of another device driver is an awkward a-symmetric model. > Regardless of whether they share the same backend device. > Only I am not sure the Linux Bond is the right choice. > e.g one may well want to use the virtio device also when the > pass-through device is available, e.g for multicasts, east-west > traffic, etc. > I'm not sure the Linux Bond fits that functionality. > And, as I hear in this thread, it is hard to make it work out of the box. > So I think the right thing would be to write a new dedicated module > for this purpose. > > Re policy - > Indeed the HV can request a policy from the guest but that's not a > claim for the virtio device enslaving the pass-through device. > Any policy can be queried by the upper enslaving device. > > Bottom line - I do not see a single reason to have the virtio netdev > (nor netvsc or any other PV netdev) enslave another netdev by itself. > If we'd do it right with netvsc from the beginning we wouldn't need > this discussion at all... There are several issues with transparent migration. The first is that the SR-IOV device needs to be shut off for earlier in the migration process. Next, the SR-IOV device in the migrated go guest environment maybe different. It might not exist at all, it might be at a different PCI address, or it could even be a different vendor/speed/model. Keeping a virtual network device around allows persisting the connectivity, during the process. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v10 2/4] net: Introduce generic failover module
On Mon, 7 May 2018 15:10:44 -0700 Sridhar Samudralawrote: > +static struct net_device *net_failover_get_bymac(u8 *mac, > + struct net_failover_ops **ops) > +{ > + struct net_device *failover_dev; > + struct net_failover *failover; > + > + spin_lock(_failover_lock); > + list_for_each_entry(failover, _failover_list, list) { > + failover_dev = rtnl_dereference(failover->failover_dev); > + if (ether_addr_equal(failover_dev->perm_addr, mac)) { > + *ops = rtnl_dereference(failover->ops); > + spin_unlock(_failover_lock); > + return failover_dev; > + } > + } > + spin_unlock(_failover_lock); > + return NULL; > +} This is broken if non-ethernet devices such as Infiniband are present. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v10 2/4] net: Introduce generic failover module
On Mon, 7 May 2018 15:10:44 -0700 Sridhar Samudralawrote: > This provides a generic interface for paravirtual drivers to listen > for netdev register/unregister/link change events from pci ethernet > devices with the same MAC and takeover their datapath. The notifier and > event handling code is based on the existing netvsc implementation. > > It exposes 2 sets of interfaces to the paravirtual drivers. > 1. For paravirtual drivers like virtio_net that use 3 netdev model, the >the failover module provides interfaces to create/destroy additional >master netdev and all the slave events are managed internally. > net_failover_create() > net_failover_destroy() >A failover netdev is created that acts a master device and controls 2 >slave devices. The original virtio_net netdev is registered as 'standby' >netdev and a passthru/vf device with the same MAC gets registered as >'primary' netdev. Both 'standby' and 'failover' netdevs are associated >with the same 'pci' device. The user accesses the network interface via >'failover' netdev. The 'failover' netdev chooses 'primary' netdev as >default for transmits when it is available with link up and running. > 2. For existing netvsc driver that uses 2 netdev model, no master netdev >is created. The paravirtual driver registers each instance of netvsc >as a 'failover' netdev along with a set of ops to manage the slave >events. There is no 'standby' netdev in this model. A passthru/vf device >with the same MAC gets registered as 'primary' netdev. > net_failover_register() > net_failover_unregister() > > Signed-off-by: Sridhar Samudrala You are conflating the net_failover device (3 device model) with the generic network failover infrastructure into one file. There should be two seperate files net/core/failover.c and drivers/net/failover.c which splits the work into two parts (and acts a check for the api). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v10 2/4] net: Introduce generic failover module
On Mon, 7 May 2018 15:10:44 -0700 Sridhar Samudralawrote: > + if (netif_running(failover_dev)) { > + err = dev_open(slave_dev); > + if (err && (err != -EBUSY)) { > + netdev_err(failover_dev, "Opening slave %s failed > err:%d\n", > +slave_dev->name, err); > + goto err_dev_open; > + } > + } > + > + netif_addr_lock_bh(failover_dev); > + dev_uc_sync_multiple(slave_dev, failover_dev); > + dev_uc_sync_multiple(slave_dev, failover_dev); > + netif_addr_unlock_bh(failover_dev); > + The order of these is backwards, you want to sync addresses before bringing up. Also, doing it this way does not allow udev/systemd the chance to rename VF devices. The complexity of this whole failover mechanism does not make life easier, more reliable, or safer for netvsc. I though that was the whole reason for having common code. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework
On Thu, 24 May 2018 09:55:14 -0700 Sridhar Samudralawrote: > --- a/drivers/net/hyperv/Kconfig > +++ b/drivers/net/hyperv/Kconfig > @@ -2,5 +2,6 @@ config HYPERV_NET > tristate "Microsoft Hyper-V virtual network driver" > depends on HYPERV > select UCS2_STRING > + select FAILOVER When I take a working kernel config, add the patches then do make oldconfig It is not autoselecting FAILOVER, it prompts me for it. This means if user says no then a non-working netvsc device is made. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v12 1/5] net: Introduce generic failover module
On Thu, 24 May 2018 09:55:13 -0700 Sridhar Samudralawrote: > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 03ed492c4e14..0f4ba52b641d 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -1421,6 +1421,8 @@ struct net_device_ops { > * entity (i.e. the master device for bridged veth) > * @IFF_MACSEC: device is a MACsec device > * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook > + * @IFF_FAILOVER: device is a failover master device > + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device > */ > enum netdev_priv_flags { > IFF_802_1Q_VLAN = 1<<0, > @@ -1450,6 +1452,8 @@ enum netdev_priv_flags { > IFF_PHONY_HEADROOM = 1<<24, > IFF_MACSEC = 1<<25, > IFF_NO_RX_HANDLER = 1<<26, > + IFF_FAILOVER= 1<<27, > + IFF_FAILOVER_SLAVE = 1<<28, > }; Why is FAILOVER any different than other master/slave relationships. I don't think you need to take up precious netdev flag bits for this. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v12 1/5] net: Introduce generic failover module
On Thu, 24 May 2018 09:55:13 -0700 Sridhar Samudralawrote: > + spin_lock(_lock); Since register is not in fast path, this should be a mutex? > +int failover_slave_unregister(struct net_device *slave_dev) > +{ > + struct net_device *failover_dev; > + struct failover_ops *fops; > + > + if (!netif_is_failover_slave(slave_dev)) > + goto done; > + > + ASSERT_RTNL(); > + > + failover_dev = failover_get_bymac(slave_dev->perm_addr, ); > + if (!failover_dev) > + goto done; Since the slave device must have a master device set already, why not use that instead of searching by MAC address on unregister or link change. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework
On Fri, 25 May 2018 16:11:47 -0700 "Samudrala, Sridhar" <sridhar.samudr...@intel.com> wrote: > On 5/25/2018 3:34 PM, Stephen Hemminger wrote: > > On Thu, 24 May 2018 09:55:14 -0700 > > Sridhar Samudrala <sridhar.samudr...@intel.com> wrote: > > > >> --- a/drivers/net/hyperv/Kconfig > >> +++ b/drivers/net/hyperv/Kconfig > >> @@ -2,5 +2,6 @@ config HYPERV_NET > >>tristate "Microsoft Hyper-V virtual network driver" > >>depends on HYPERV > >>select UCS2_STRING > >> + select FAILOVER > > When I take a working kernel config, add the patches then do > > make oldconfig > > > > It is not autoselecting FAILOVER, it prompts me for it. This means > > if user says no then a non-working netvsc device is made. > > I see > Generic failover module (FAILOVER) [M/y/?] (NEW) > > So the user is given an option to either build as a Module or part of the > kernel. 'n' is not an option. With most libraries there is no prompt at all. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework
On Thu, 24 May 2018 09:55:14 -0700 Sridhar Samudrala wrote: > Use the registration/notification framework supported by the generic > failover infrastructure. > > Signed-off-by: Sridhar Samudrala Why was this merged? It was never signed off by any of the netvsc maintainers, and there were still issues unresolved. There are also namespaces issues I am fixing and this breaks them. Will start my patch set with a revert for this. Sorry ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v12 1/5] net: Introduce generic failover module
On Fri, 25 May 2018 16:06:58 -0700 "Samudrala, Sridhar" wrote: > On 5/25/2018 3:38 PM, Stephen Hemminger wrote: > > On Thu, 24 May 2018 09:55:13 -0700 > > Sridhar Samudrala wrote: > > > >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > >> index 03ed492c4e14..0f4ba52b641d 100644 > >> --- a/include/linux/netdevice.h > >> +++ b/include/linux/netdevice.h > >> @@ -1421,6 +1421,8 @@ struct net_device_ops { > >>* entity (i.e. the master device for bridged veth) > >>* @IFF_MACSEC: device is a MACsec device > >>* @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook > >> + * @IFF_FAILOVER: device is a failover master device > >> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device > >>*/ > >> enum netdev_priv_flags { > >>IFF_802_1Q_VLAN = 1<<0, > >> @@ -1450,6 +1452,8 @@ enum netdev_priv_flags { > >>IFF_PHONY_HEADROOM = 1<<24, > >>IFF_MACSEC = 1<<25, > >>IFF_NO_RX_HANDLER = 1<<26, > >> + IFF_FAILOVER= 1<<27, > >> + IFF_FAILOVER_SLAVE = 1<<28, > >> }; > > Why is FAILOVER any different than other master/slave relationships. > > I don't think you need to take up precious netdev flag bits for this. > > These are netdev priv flags. > Jiri says that IFF_MASTER/IFF_SLAVE are bonding specific flags and cannot be > used > with other failover mechanisms. Team also doesn't use this flags and it has > its own > priv_flags. > This change breaks userspace. We already have worked with partners to ignore devices marked as IFF_SLAVE, and IFF_SLAVE is visible to user space API's. NAK ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v12 1/5] net: Introduce generic failover module
On Fri, 25 May 2018 16:06:58 -0700 "Samudrala, Sridhar" wrote: > On 5/25/2018 3:38 PM, Stephen Hemminger wrote: > > On Thu, 24 May 2018 09:55:13 -0700 > > Sridhar Samudrala wrote: > > > >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > >> index 03ed492c4e14..0f4ba52b641d 100644 > >> --- a/include/linux/netdevice.h > >> +++ b/include/linux/netdevice.h > >> @@ -1421,6 +1421,8 @@ struct net_device_ops { > >>* entity (i.e. the master device for bridged veth) > >>* @IFF_MACSEC: device is a MACsec device > >>* @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook > >> + * @IFF_FAILOVER: device is a failover master device > >> + * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device > >>*/ > >> enum netdev_priv_flags { > >>IFF_802_1Q_VLAN = 1<<0, > >> @@ -1450,6 +1452,8 @@ enum netdev_priv_flags { > >>IFF_PHONY_HEADROOM = 1<<24, > >>IFF_MACSEC = 1<<25, > >>IFF_NO_RX_HANDLER = 1<<26, > >> + IFF_FAILOVER= 1<<27, > >> + IFF_FAILOVER_SLAVE = 1<<28, > >> }; > > Why is FAILOVER any different than other master/slave relationships. > > I don't think you need to take up precious netdev flag bits for this. > > These are netdev priv flags. > Jiri says that IFF_MASTER/IFF_SLAVE are bonding specific flags and cannot be > used > with other failover mechanisms. Team also doesn't use this flags and it has > its own > priv_flags. > They are already used by bonding and team. I don't see why this can't reuse them. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v12 2/5] netvsc: refactor notifier/event handling code to use the failover framework
On Wed, 30 May 2018 20:03:11 -0700 "Samudrala, Sridhar" wrote: > On 5/30/2018 7:06 PM, Stephen Hemminger wrote: > > On Thu, 24 May 2018 09:55:14 -0700 > > Sridhar Samudrala wrote: > > > >> Use the registration/notification framework supported by the generic > >> failover infrastructure. > >> > >> Signed-off-by: Sridhar Samudrala > > Why was this merged? It was never signed off by any of the netvsc > > maintainers, > > and there were still issues unresolved. > > > > There are also namespaces issues I am fixing and this breaks them. > > Will start my patch set with a revert for this. Sorry > > I would appreciate if you can make the fixes on top of this patch series. I > tried hard > to make sure that netvsc functionality and behavior doesn't change. > > It is possible that there could be some bugs introduced, but they can be > fixed. > Looks like Wei already found a bug and submitted a fix for that. > Ok, but several of these may clash with what you want for virtio. Like: - VF should be moved to namespace of virt device - VF should be associated based on message from host with serial # not registration notifier and MAC address. - control operations should use master device reference rather than searching based on MAC. As you can see these are structural changes. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework
On Thu, 26 Apr 2018 05:30:05 +0300 "Michael S. Tsirkin" <m...@redhat.com> wrote: > On Wed, Apr 25, 2018 at 05:08:37PM -0700, Stephen Hemminger wrote: > > On Wed, 25 Apr 2018 16:59:28 -0700 > > Sridhar Samudrala <sridhar.samudr...@intel.com> wrote: > > > > > Use the registration/notification framework supported by the generic > > > failover infrastructure. > > > > > > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com> > > > > NAK unless you prove this works on legacy distributions and with DPDK 18.05 > > without modification. > > It looks like it should work. What kind of proof are you looking for? > I tried this with working Ubuntu 17 on WS2016. It boots if the failover driver is configured in (as module). But if the configuration has: $ grep FAILOVER .config # CONFIG_NET_FAILOVER is not set CONFIG_MAY_USE_NET_FAILOVER=y The netvsc driver fails on boot with: [0.826447] hv_vmbus: registering driver hv_netvsc [0.829616] scsi 0:0:0:0: Direct-Access Msft Virtual Disk 1.0 PQ: 0 ANSI: 5 [0.836291] input: Microsoft Vmbus HID-compliant Mouse as /devices/0006:045E:0621.0001/input/input1 [0.839139] hid-generic 0006:045E:0621.0001: input: HID v0.01 Mouse [Microsoft Vmbus HID-compliant Mouse] on [0.964897] hv_vmbus: probe failed for device 849a776e-8120-4e4a-9a36-7e3d95ac75b3 (-95) [0.968039] hv_netvsc: probe of 849a776e-8120-4e4a-9a36-7e3d95ac75b3 failed with error -95 [1.112877] hv_vmbus: probe failed for device 53557f8e-057d-425b-9265-01c0fd7e273e (-95) [1.116064] hv_netvsc: probe of 53557f8e-057d-425b-9265-01c0fd7e273e failed with error -95 The system has two virtual networks. eth0 is on vswitch for management. eth1 is on vswitch with SR-IOV for performance tests. You probably need to just put the failover part in net/core and select it. It is trivial to get an evaluation version of Windows Server 2016 and setup a Linux VM. Please try it. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available
On Fri, 26 Jan 2018 18:30:03 -0800 Jakub Kicinskiwrote: > On Fri, 26 Jan 2018 15:30:35 -0800, Samudrala, Sridhar wrote: > > On 1/26/2018 2:47 PM, Jakub Kicinski wrote: > > > On Sat, 27 Jan 2018 00:14:20 +0200, Michael S. Tsirkin wrote: > > >> On Fri, Jan 26, 2018 at 01:46:42PM -0800, Siwei Liu wrote: > > and the VM is not expected to do any tuning/optimizations on the VF > > driver > > directly, > > i think the current patch that follows the netvsc model of 2 > > netdevs(virtio > > and vf) should > > work fine. > > >>> OK. For your use case that's fine. But that's too specific scenario > > >>> with lots of restrictions IMHO, perhaps very few users will benefit > > >>> from it, I'm not sure. If you're unwilling to move towards it, we'd > > >>> take this one and come back with a generic solution that is able to > > >>> address general use cases for VF/PT live migration . > > >> I think that's a fine approach. Scratch your own itch! I imagine a very > > >> generic virtio-switchdev providing host routing info to guests could > > >> address lots of usecases. A driver could bind to that one and enslave > > >> arbitrary other devices. Sounds reasonable. > > >> > > >> But given the fundamental idea of a failover was floated at least as > > >> early as 2013, and made 0 progress since precisely because it kept > > >> trying to address more and more features, and given netvsc is already > > >> using the basic solution with some success, I'm not inclined to block > > >> this specific effort waiting for the generic one. > > > I think there is an agreement that the extra netdev will be useful for > > > more advanced use cases, and is generally preferable. What is the > > > argument for not doing that from the start? If it was made I must have > > > missed it. Is it just unwillingness to write the extra 300 lines of > > > code? Sounds like a pretty weak argument when adding kernel ABI is at > > > stake... > > > > I am still not clear on the need for the extra netdev created by > > virtio_net. The only advantage i can see is that the stats can be > > broken between VF and virtio datapaths compared to the aggregrated > > stats on virtio netdev as seen with the 2 netdev approach. > > Maybe you're not convinced but multiple arguments were made. > > > With 2 netdev model, any VM image that has a working network > > configuration will transparently get VF based acceleration without > > any changes. > > Nothing happens transparently. Things may happen automatically. The > VF netdev doesn't disappear with netvsc. The PV netdev transforms into > something it did not use to be. And configures and reports some > information from the PV (e.g. speed) but PV doesn't pass traffic any > longer. > > > 3 netdev model breaks this configuration starting with the creation > > and naming of the 2 devices to udev needing to be aware of master and > > slave virtio-net devices. > > I don't understand this comment. There is one virtio-net device and > one "virtio-bond" netdev. And user space has to be aware of the special > automatic arrangement anyway, because it can't touch the VF. It > doesn't make any difference whether it ignores the VF or PV and VF. > It simply can't touch the slaves, no matter how many there are. > > > Also, from a user experience point of view, loading a virtio-net with > > BACKUP feature enabled will now show 2 virtio-net netdevs. > > One virtio-net and one virtio-bond, which represents what's happening. > > > For live migration with advanced usecases that Siwei is suggesting, i > > think we need a new driver with a new device type that can track the > > VF specific feature settings even when the VF driver is unloaded. I see no added value of the 3 netdev model, there is no need for a bond device. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
On Thu, 22 Feb 2018 13:30:12 -0800 Alexander Duyckwrote: > > Again, I undertand your motivation. Yet I don't like your solution. > > But if the decision is made to do this in-driver bonding. I would like > > to see it baing done some generic way: > > 1) share the same "in-driver bonding core" code with netvsc > >put to net/core. > > 2) the "in-driver bonding core" will strictly limit the functionality, > >like active-backup mode only, one vf, one backup, vf netdev type > >check (so noone could enslave a tap or anything else) > > If user would need something more, he should employ team/bond. Sharing would be good, but netvsc world would really like to only have one visible network device. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
(pruned to reduce thread) On Wed, 21 Feb 2018 16:17:19 -0800 Alexander Duyckwrote: > >>> FWIW two solutions that immediately come to mind is to export "backup" > >>> as phys_port_name of the backup virtio link and/or assign a name to the > >>> master like you are doing already. I think team uses team%d and bond > >>> uses bond%d, soft naming of master devices seems quite natural in this > >>> case. > >> > >> I figured I had overlooked something like that.. Thanks for pointing > >> this out. Okay so I think the phys_port_name approach might resolve > >> the original issue. If I am reading things correctly what we end up > >> with is the master showing up as "ens1" for example and the backup > >> showing up as "ens1nbackup". Am I understanding that right? > >> > >> The problem with the team/bond%d approach is that it creates a new > >> netdevice and so it would require guest configuration changes. > >> > >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio > >>> link is quite neat. > >> > >> I agree. For non-"backup" virio_net devices would it be okay for us to > >> just return -EOPNOTSUPP? I assume it would be and that way the legacy > >> behavior could be maintained although the function still exists. > >> > - When the 'active' netdev is unplugged OR not present on a destination > system after live migration, the user will see 2 virtio_net netdevs. > >>> > >>> That's necessary and expected, all configuration applies to the master > >>> so master must exist. > >> > >> With the naming issue resolved this is the only item left outstanding. > >> This becomes a matter of form vs function. > >> > >> The main complaint about the "3 netdev" solution is a bit confusing to > >> have the 2 netdevs present if the VF isn't there. The idea is that > >> having the extra "master" netdev there if there isn't really a bond is > >> a bit ugly. > > > > Is it this uglier in terms of user experience rather than > > functionality? I don't want it dynamically changed between 2-netdev > > and 3-netdev depending on the presence of VF. That gets back to my > > original question and suggestion earlier: why not just hide the lower > > netdevs from udev renaming and such? Which important observability > > benefits users may get if exposing the lower netdevs? > > > > Thanks, > > -Siwei > > The only real advantage to a 2 netdev solution is that it looks like > the netvsc solution, however it doesn't behave like it since there are > some features like XDP that may not function correctly if they are > left enabled in the virtio_net interface. > > As far as functionality the advantage of not hiding the lower devices > is that they are free to be managed. The problem with pushing all of > the configuration into the upper device is that you are limited to the > intersection of the features of the lower devices. This can be > limiting for some setups as some VFs support things like more queues, > or better interrupt moderation options than others so trying to make > everything work with one config would be ugly. > Let's not make XDP the blocker for doing the best solution from the end user point of view. XDP is just yet another offload thing which needs to be handled. The current backup device solution used in netvsc doesn't handle the full range of offload options (things like flow direction, DCB, etc); no one but the HW vendors seems to care. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
On Mon, 22 Jan 2018 15:27:40 -0800 "Samudrala, Sridhar"wrote: > On 1/22/2018 1:31 PM, Michael S. Tsirkin wrote: > > On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote: > >> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin > >> wrote: > >>> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote: > > On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote: > > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote: > >> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala > >> wrote: > >>> This feature bit can be used by hypervisor to indicate virtio_net > >>> device to > >>> act as a backup for another device with the same MAC address. > >>> > >>> Signed-off-by: Sridhar Samudrala > >>> --- > >>>drivers/net/virtio_net.c| 2 +- > >>>include/uapi/linux/virtio_net.h | 3 +++ > >>>2 files changed, 4 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > >>> index 12dfc5fee58e..f149a160a8c5 100644 > >>> --- a/drivers/net/virtio_net.c > >>> +++ b/drivers/net/virtio_net.c > >>> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = { > >>> VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ > >>> VIRTIO_NET_F_CTRL_MAC_ADDR, \ > >>> VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ > >>> - VIRTIO_NET_F_SPEED_DUPLEX > >>> + VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP > >>> > >>>static unsigned int features[] = { > >>> VIRTNET_FEATURES, > >>> diff --git a/include/uapi/linux/virtio_net.h > >>> b/include/uapi/linux/virtio_net.h > >>> index 5de6ed37695b..c7c35fd1a5ed 100644 > >>> --- a/include/uapi/linux/virtio_net.h > >>> +++ b/include/uapi/linux/virtio_net.h > >>> @@ -57,6 +57,9 @@ > >>>* Steering */ > >>>#define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ > >>> > >>> +#define VIRTIO_NET_F_BACKUP 62/* Act as backup for another > >>> device > >>> +* with the same MAC. > >>> +*/ > >>>#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and > >>> duplex */ > >>> > >>>#ifndef VIRTIO_NET_NO_LEGACY > >> I'm not a huge fan of the name "backup" since that implies that the > >> Virtio interface is only used if the VF is not present, and there are > >> multiple instances such as dealing with east/west or > >> broadcast/multicast traffic where it may be desirable to use the > >> para-virtual interface rather then deal with PCI overhead/bottleneck > >> to send the packet. > > Right now hypervisors mostly expect that yes, only one at a time is > > used. E.g. if you try to do multicast sending packets on both VF and > > virtio then you will end up with two copies of each packet. > I think we want to use only 1 interface to send out any packet. In case > of > broadcast/multicasts it would be an optimization to send them via virtio > and > this patch series adds that optimization. > >>> Right that's what I think we should rather avoid for now. > >>> > >>> It's *not* an optimization if there's a single VM on this host, > >>> or if a specific multicast group does not have any VMs on same > >>> host. > >> Agreed. In my mind this is something that is controlled by the > >> pass-thru interface once it is enslaved. > > It would be pretty tricky to control through the PT > > interface since a PT interface pretends to be a physical > > device, which has no concept of VMs. > > > >>> I'd rather we just sent everything out on the PT if that's > >>> there. The reason we have virtio in the picture is just so > >>> we can migrate without downtime. > >> I wasn't saying we do that in all cases. That would be something that > >> would have to be decided by the pass-thru interface. Ideally the > >> virtio would provide just enough information to get itself into the > >> bond and I see this being the mechanism for it to do so. From there > >> the complexity mostly lies in the pass-thru interface to configure the > >> correct transmit modes if for example you have multiple pass-thru > >> interfaces or a more complex traffic setup due to things like > >> SwitchDev. > >> > >> In my mind we go the bonding route and there are few use cases for all > >> of this. First is the backup case that is being addressed here. That > >> becomes your basic "copy netvsc" approach for this which would be > >> default. It is how we would handle basic pass-thru back-up paths. If > >> the host decides to send multicast/broadcast traffic from the host up > >> through
Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework
On Tue, 10 Apr 2018 11:59:50 -0700 Sridhar Samudralawrote: > Use the registration/notification framework supported by the generic > bypass infrastructure. > > Signed-off-by: Sridhar Samudrala > --- Thanks for doing this. Your current version has couple show stopper issues. First, the slave device is instantly taking over the slave. This doesn't allow udev/systemd to do its device rename of the slave device. Netvsc uses a delayed work to workaround this. Secondly, the select queue needs to call queue selection in VF. The bonding/teaming logic doesn't work well for UDP flows. Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF") fixed this performance problem. Lastly, more indirection is bad in current climate. I am not completely adverse to this but it needs to be fast, simple and completely transparent. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization