Re: [PATCH v2 3/3] sched/preempt: fix cond_resched_lock() and cond_resched_softirq()
On Wed, 2015-07-15 at 12:52 +0300, Konstantin Khlebnikov wrote: These functions check should_resched() before unlocking spinlock/bh-enable: preempt_count always non-zero = should_resched() always returns false. cond_resched_lock() worked iff spin_needbreak is set. Interesting, this definitely used to work (linux-3.11) Any idea of which commit broke things ? -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt
On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote: We free transmitted packets in ndo_start_xmit() in the past to get better performance in the past. One side effect is that skb_orphan() needs to be called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in fact. For TCP protocol, this means several optimization could not work well such as TCP small queue and auto corking. This can lead extra low throughput of small packets stream. Thanks to the urgent descriptor support. This patch tries to solve this issue by enable the tx interrupt selectively for stream packets. This means we don't need to orphan TCP stream packets in ndo_start_xmit() but enable tx interrupt for those packets. After we get tx interrupt, a tx napi was scheduled to free those packets. With this method, sk_wmem_alloc of TCP socket were more accurate than in the past which let TCP can batch more through TSQ and auto corking. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/net/virtio_net.c | 164 --- 1 file changed, 128 insertions(+), 36 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 5810841..b450fc4 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -72,6 +72,8 @@ struct send_queue { /* Name of the send queue: output.$index */ char name[40]; + + struct napi_struct napi; }; /* Internal representation of a receive virtqueue */ @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask) return p; } +static int free_old_xmit_skbs(struct send_queue *sq, int budget) +{ + struct sk_buff *skb; + unsigned int len; + struct virtnet_info *vi = sq-vq-vdev-priv; + struct virtnet_stats *stats = this_cpu_ptr(vi-stats); + int sent = 0; + + while (sent budget +(skb = virtqueue_get_buf(sq-vq, len)) != NULL) { + pr_debug(Sent skb %p\n, skb); + + u64_stats_update_begin(stats-tx_syncp); + stats-tx_bytes += skb-len; + stats-tx_packets++; + u64_stats_update_end(stats-tx_syncp); + + dev_kfree_skb_any(skb); + sent++; + } + You could accumulate skb-len in a totlen var, and perform a single u64_stats_update_begin(stats-tx_syncp); stats-tx_bytes += totlen; stats-tx_packets += sent; u64_stats_update_end(stats-tx_syncp); after the loop. + return sent; +} + ... + +static bool virtnet_skb_needs_intr(struct sk_buff *skb) +{ + union { + unsigned char *network; + struct iphdr *ipv4; + struct ipv6hdr *ipv6; + } hdr; + struct tcphdr *th = tcp_hdr(skb); + u16 payload_len; + + hdr.network = skb_network_header(skb); + + /* Only IPv4/IPv6 with TCP is supported */ Oh well, yet another packet flow dissector :) If most packets were caught by your implementation, you could use it for fast patj and fallback to skb_flow_dissect() for encapsulated traffic. struct flow_keys keys; if (!skb_flow_dissect(skb, keys)) return false; if (keys.ip_proto != IPPROTO_TCP) return false; then check __skb_get_poff() how to get th, and check if there is some payload... + if ((skb-protocol == htons(ETH_P_IP)) + hdr.ipv4-protocol == IPPROTO_TCP) { + payload_len = ntohs(hdr.ipv4-tot_len) - hdr.ipv4-ihl * 4 - + th-doff * 4; + } else if ((skb-protocol == htons(ETH_P_IPV6) || +hdr.ipv6-nexthdr == IPPROTO_TCP)) { + payload_len = ntohs(hdr.ipv6-payload_len) - th-doff * 4; + } else { + return false; + } + + /* We don't want to dealy packet with PUSH bit and pure ACK packet */ + if (!th-psh payload_len) + return true; + + return false; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 2/2] vhost: replace rcu with mutex
On Tue, 2014-06-03 at 14:48 +0200, Paolo Bonzini wrote: Il 02/06/2014 23:58, Eric Dumazet ha scritto: This looks dubious What about using kfree_rcu() instead ? It would lead to unbound allocation from userspace. Look at how we did this in commit c3059477fce2d956a0bb3e04357324780c5d8eeb translate_desc() still uses rcu_read_lock(), its not clear if the mutex is really held. Yes, vhost_get_vq_desc must be called with the vq mutex held. The rcu_read_lock/unlock in translate_desc is unnecessary. Yep, this is what I pointed out. This is not only necessary, but confusing and might be incorrectly copy/pasted in the future. This patch is a partial one and leaves confusion. Some places uses the proper mp = rcu_dereference_protected(dev-memory, lockdep_is_held(dev-mutex)); others use the now incorrect : rcu_read_lock(); mp = rcu_dereference(dev-memory); ... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 2/2] vhost: replace rcu with mutex
On Tue, 2014-06-03 at 00:30 +0300, Michael S. Tsirkin wrote: All memory accesses are done under some VQ mutex. So lock/unlock all VQs is a faster equivalent of synchronize_rcu() for memory access changes. Some guests cause a lot of these changes, so it's helpful to make them faster. Reported-by: Gonglei (Arei) arei.gong...@huawei.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 78987e4..1c05e60 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -593,6 +593,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) { struct vhost_memory mem, *newmem, *oldmem; unsigned long size = offsetof(struct vhost_memory, regions); + int i; if (copy_from_user(mem, m, size)) return -EFAULT; @@ -619,7 +620,14 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m) oldmem = rcu_dereference_protected(d-memory, lockdep_is_held(d-mutex)); rcu_assign_pointer(d-memory, newmem); - synchronize_rcu(); + + /* All memory accesses are done under some VQ mutex. + * So below is a faster equivalent of synchronize_rcu() + */ + for (i = 0; i d-nvqs; ++i) { + mutex_lock(d-vqs[i]-mutex); + mutex_unlock(d-vqs[i]-mutex); + } kfree(oldmem); return 0; } This looks dubious What about using kfree_rcu() instead ? translate_desc() still uses rcu_read_lock(), its not clear if the mutex is really held. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving frags and SKBTX_DEV_ZEROCOPY skbs
On Wed, 2014-05-14 at 14:40 +0100, Zoltan Kiss wrote: Hi, Recently I've investigated issues around SKBTX_DEV_ZEROCOPY skbs where the frags list were modified. I came across this function skb_shift(), which moves frags between skbs. And there are a lot more of such kind, skb_split or skb_try_coalesce, for example. It could be a dangerous thing if a frag is referenced from an skb which doesn't have the original destructor_arg, and to avoid that skb_orphan_frags should be called. Although probably these functions are not normally touched in usual usecases, I think it would be useful to review core skb functions proactively and add an skb_orphan_frags everywhere where the frags could be referenced from other places. Any opinion about this? For skb_shift(), it is currently used from tcp stack only, where this SKBTX_DEV_ZEROCOPY thing is not used, so I do not think there is a bug for the moment. I already gave a patch for skb_try_coalesce() : For this one we do not wan skb_orphan_frags() overhead. Its simply better in this case to abort. diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 1b62343f5837..85995a14aafc 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3838,7 +3839,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, return true; } - if (skb_has_frag_list(to) || skb_has_frag_list(from)) + if (skb_has_frag_list(to) || + skb_has_frag_list(from) || + (skb_shinfo(to)-tx_flags SKBTX_DEV_ZEROCOPY) || + (skb_shinfo(from)-tx_flags SKBTX_DEV_ZEROCOPY)) return false; if (skb_headlen(from) != 0) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving frags and SKBTX_DEV_ZEROCOPY skbs
On Wed, 2014-05-14 at 13:42 -0400, David Miller wrote: From: Eric Dumazet eric.duma...@gmail.com Date: Wed, 14 May 2014 07:23:52 -0700 On Wed, 2014-05-14 at 14:40 +0100, Zoltan Kiss wrote: Hi, Recently I've investigated issues around SKBTX_DEV_ZEROCOPY skbs where the frags list were modified. I came across this function skb_shift(), which moves frags between skbs. And there are a lot more of such kind, skb_split or skb_try_coalesce, for example. It could be a dangerous thing if a frag is referenced from an skb which doesn't have the original destructor_arg, and to avoid that skb_orphan_frags should be called. Although probably these functions are not normally touched in usual usecases, I think it would be useful to review core skb functions proactively and add an skb_orphan_frags everywhere where the frags could be referenced from other places. Any opinion about this? For skb_shift(), it is currently used from tcp stack only, where this SKBTX_DEV_ZEROCOPY thing is not used, so I do not think there is a bug for the moment. I already gave a patch for skb_try_coalesce() : For this one we do not wan skb_orphan_frags() overhead. Its simply better in this case to abort. Eric can you please submit this formally? It is second time I've seen it posted as RFC :-) Sure, I was kind of waiting to make sure it was needed. It looks like it is ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Moving frags and SKBTX_DEV_ZEROCOPY skbs
On Wed, 2014-05-14 at 20:41 +0100, Zoltan Kiss wrote: On 14/05/14 15:23, Eric Dumazet wrote: On Wed, 2014-05-14 at 14:40 +0100, Zoltan Kiss wrote: Hi, Recently I've investigated issues around SKBTX_DEV_ZEROCOPY skbs where the frags list were modified. I came across this function skb_shift(), which moves frags between skbs. And there are a lot more of such kind, skb_split or skb_try_coalesce, for example. It could be a dangerous thing if a frag is referenced from an skb which doesn't have the original destructor_arg, and to avoid that skb_orphan_frags should be called. Although probably these functions are not normally touched in usual usecases, I think it would be useful to review core skb functions proactively and add an skb_orphan_frags everywhere where the frags could be referenced from other places. Any opinion about this? For skb_shift(), it is currently used from tcp stack only, where this SKBTX_DEV_ZEROCOPY thing is not used, so I do not think there is a bug for the moment. It is called from tcp_input.c, which suggests it can be called on incoming TCP packets. Nope. We split outgoing packets, stored in the socket write queue. These packets are locally generated by tcp_sendmsg() and tcp_sendpage(), no way we use SKBTX_DEV_ZEROCOPY yet. This split happens when we receive an ACK, that's why it is in tcp_input.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni
On Tue, 2014-02-11 at 22:25 +0800, Qin Chuanyu wrote: we could xmit directly instead of going through softirq to gain throughput and lantency improved. test model: VM-Host-Host just do transmit. with vhost thread and nic interrupt bind cpu1. netperf do throuhput test and qperf do lantency test. Host OS: suse11sp3, Guest OS: suse11sp3 latency result(us): packet_len 64 256 512 1460 old(UDP) 44 47 48 66 new(UDP) 38 41 42 66 old(TCP) 52 55 70 117 new(TCP) 45 48 61 114 throughput result(Gbit/s): packet_len 64 512 1024 1460 old(UDP) 0.42 2.02 3.75 4.68 new(UDP) 0.45 2.14 3.77 5.06 TCP due to the latency, client couldn't send packet big enough to get benefit from TSO of nic, so the result show it will send more packet per sencond but get lower throughput. Eric mentioned that it would has problem with cgroup, but the patch had been sent by Herbert Xu. patch_id f845172531fb7410c7fb7780b1a6e51ee6df7d52 Signed-off-by: Chuanyu Qin qinchua...@huawei.com --- drivers/net/tun.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 44c4db8..90b4e58 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1184,7 +1184,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, skb_probe_transport_header(skb, 0); rxhash = skb_get_hash(skb); - netif_rx_ni(skb); + rcu_read_lock_bh(); + netif_receive_skb(skb); + rcu_read_unlock_bh(); tun-dev-stats.rx_packets++; tun-dev-stats.rx_bytes += len; I already said this patch is not good : rcu_read_lock_bh() makes no sense here. What is really needed is local_bh_disable(); Herbert patch ( http://patchwork.ozlabs.org/patch/52963/ ) had a much cleaner form. Just use it, CC him, credit him, please ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni
On Wed, 2014-02-12 at 13:28 +0800, Jason Wang wrote: A question: without NAPI weight, could this starve other net devices? Not really, as net devices are serviced by softirq handler. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni
On Wed, 2014-02-12 at 13:50 +0800, Jason Wang wrote: On 02/12/2014 01:47 PM, Eric Dumazet wrote: On Wed, 2014-02-12 at 13:28 +0800, Jason Wang wrote: A question: without NAPI weight, could this starve other net devices? Not really, as net devices are serviced by softirq handler. Yes, then the issue is tun could be starved by other net devices. How this patch changes anything to this 'problem' ? netif_rx_ni() can only be called if your process is not preempted by other high prio tasks/softirqs. If this process is scheduled on a cpu, then disabling bh to process _one_ packet wont fundamentally change dynamic of the system. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 8% performance improved by change tap interact with kernel stack
On Tue, 2014-01-28 at 16:14 +0800, Qin Chuanyu wrote: according perf test result,I found that there are 5%-8% cpu cost on softirq by use netif_rx_ni called in tun_get_user. so I changed the function which cause skb transmitted more quickly. from tun_get_user- netif_rx_ni(skb); to tun_get_user- rcu_read_lock_bh(); netif_receive_skb(skb); rcu_read_unlock_bh(); No idea why you use rcu here ? The test result is as below: CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz NIC: intel 82599 Host OS/Guest OS:suse11sp3 Qemu-1.6 netperf udp 512(VM tx) test model: VM-host-host modified before : 2.00Gbps 461146pps modified after : 2.16Gbps 498782pps 8% performance gained from this change, Is there any problem for this patch ? http://patchwork.ozlabs.org/patch/52963/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP small packets throughput and multiqueue virtio-net
On Fri, 2013-03-08 at 14:24 +0800, Jason Wang wrote: Hello all: I meet an issue when testing multiqueue virtio-net. When I testing guest small packets stream sending performance with netperf. I find an regression of multiqueue. When I run 2 sessions of TCP_STREAM test with 1024 byte from guest to local host, I get following result: 1q result: 3457.64 2q result: 7781.45 Statistics shows that: compared with one queue, multiqueue tends to send much more but smaller packets. Tcpdump shows single queue has a much higher possibility to produce a 64K gso packet compared to multiqueue. More but smaller packets will cause more vmexits and interrupts which lead a degradation on throughput. Then problem only exist for small packets sending. When I test with larger size, multiqueue will gradually outperfrom single queue. And multiqueue also outperfrom in both TCP_RR and pktgen test (even with small packets). The problem disappear when I turn off both gso and tso. This makes little sense to me : TCP_RR workload (assuming one byte payload) cannot use GSO or TSO anyway. Same for pktgen as it uses UDP. I'm not sure whether it's a bug or expected since anyway we get improvement on latency. And if it's a bug, I suspect it was related to TCP GSO batching algorithm who tends to batch less in this situation. ( Jiri Pirko suspect it was the defect of virtio-net driver, but I didn't find any obvious clue on this). After some experiments, I find the it maybe related to tcp_tso_should_defer(), after 1) change the tcp_tso_win_divisor to 1 2) the following changes the throughput were almost the same (but still a little worse) as single queue: diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index fd0cea1..dedd2a6 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1777,10 +1777,12 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb) limit = min(send_win, cong_win); +#if 0 /* If a full-sized TSO skb can be sent, do it. */ if (limit = min_t(unsigned int, sk-sk_gso_max_size, sk-sk_gso_max_segs * tp-mss_cache)) goto send_now; +#endif /* Middle in queue won't get any more data, full sendable already? */ if ((skb != tcp_write_queue_tail(sk)) (limit = skb-len)) Git history shows this check were added for both westwood and fasttcp, I'm not familiar with tcp but looks like we can easily hit this check under when multiqueue is enabled for virtio-net. Maybe I was wrong but I wonder whether we can still do some batching here. Any comments, thoughts are welcomed. Well, the point is : if your app does write(1024) bytes, thats probably because it wants small packets from the very beginning. (See the TCP PUSH flag ?) If the transport is slow, TCP stack will automatically collapse several write into single skbs (assuming TSO or GSO is on), and you'll see big GSO packets with tcpdump [1]. So TCP will help you to get less overhead in this case. But if transport is fast, you'll see small packets, and thats good for latency. So my opinion is : its exactly behaving as expected. If you want bigger packets either : - Make the application doing big write() - Slow the vmexit ;) [1] GSO fools tcpdump : Actual packets sent to the wire are not 'big packets', but they hit dev_hard_start_xmit() as GSO packets. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: more interrupts (lower performance) in bare-metal compared with running VM
On Fri, 2012-07-27 at 22:09 -0500, sheng qiu wrote: Hi all, i am comparing network throughput performance under bare-metal case with that running VM with assigned-device (assigned NIC). i have two physical machines (each has a 10Gbit NIC), one is used as remote server (run netserver) and the other is used as the target tested one (run netperf with different send message size, TCP_STREAM test). the remote NIC is connected directly with the tested NIC, both are 10Gbit. fore bare-metal case, i enable 1 cpu core, for VM i also configure 1 vcpu (the memory is sufficient for both bare-metal and VM case). i run netperf for 120 seconds and got the following results: send messageinterrupts throughput (mbit/s) bare-metal 256 106962901114.84 512 101067861391.92 1024 10071032 1508.09 2048 4560857 3434.65 4096 3292200 4762.26 8192 3169801 4733.89 163842780529 4892.6 Are these interrupt counts taken on the receiver ? VM(assigned NIC) 256 3817904 2249.35 512 3599007 4342.81 1024 3005601 4134.69 2048 2952122 4484 4096 2682874 4566.34 8192 2786719 4734.39 16384 2603835 4540.47 as shown, the interrupts for bare-metal case is much more than the VM case for some message size. we also see the throughput for those situations is lower than VM case. it's strange that the bare-metal has lower performance than the VM case. Does anyone have comments on this? i am very confused. Well, I think you answered to your question. High interrupt rates are not good for throughput. They might be good for latencies. Using a VM adds delays and several frames might be delivered per interrupt. Using bare metal is faster and only one frame is delivered by NIC per interrupt. Try TCP_RR instead of TCP_STREAM for example. What NIC is it exactly ? It seems it has no coalescing or LRO strategy. ethtool -k eth0 ethtool -c eth0 What kernel version as used, because 4892 Mbits is not line rate. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] tun: experimental zero copy tx support
On Mon, 2012-05-14 at 20:04 +0300, Michael S. Tsirkin wrote: On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote: On Sun, 13 May 2012 18:52:06 +0300 Michael S. Tsirkin m...@redhat.com wrote: + /* Userspace may produce vectors with count greater than + * MAX_SKB_FRAGS, so we need to linearize parts of the skb + * to let the rest of data to be fit in the frags. + */ Rather than complex partial code, just go through slow path for requests with too many frags (or for really small requests). Creating mixed skb's seems too easy to get wrong. I don't object in principle but macvtap has same code so seems better to stay consistent. If I remember well, code in vtap was buggy and still is. Jason Wang fixes are not yet in, so maybe wait a bit, so that we dont add a pile of new bugs ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] tun: experimental zero copy tx support
On Mon, 2012-05-14 at 22:14 +0300, Michael S. Tsirkin wrote: They seem to be in net-next, or did I miss something? Yes, you re-introduce in this patch some bugs already fixed in macvtap -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] net: use this_cpu_xxx replace percpu_xxx funcs
Le jeudi 20 octobre 2011 à 15:32 +0800, Alex,Shi a écrit : percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them for further code clean up. And in preempt safe scenario, __this_cpu_xxx funcs has a bit better performance since __this_cpu_xxx has no redundant preempt_disable() Signed-off-by: Alex Shi alex@intel.com --- net/netfilter/xt_TEE.c | 12 ++-- net/socket.c |4 ++-- 2 files changed, 8 insertions(+), 8 deletions(-) Acked-by: Eric Dumazet eric.duma...@gmail.com Thanks ! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Code clean up for percpu_xxx() functions
Le jeudi 20 octobre 2011 à 10:44 +0800, Alex,Shi a écrit : Acked-by: Christoph Lameter c...@gentwo.org Thanks, Christoph! and resend for code style problem correction. Sorry this huge patch brings too many potential bugs. I ask you a separate patch for the networking part, because I dont want to Ack all the other parts. diff --git a/net/socket.c b/net/socket.c index ffe92ca..2845d38 100644 --- a/net/socket.c +++ b/net/socket.c @@ -479,7 +479,7 @@ static struct socket *sock_alloc(void) inode-i_uid = current_fsuid(); inode-i_gid = current_fsgid(); - percpu_add(sockets_in_use, 1); + __this_cpu_add(sockets_in_use, 1); No, we are in process context, you need this_cpu_add(sockets_in_use, 1); return sock; } @@ -522,7 +522,7 @@ void sock_release(struct socket *sock) if (rcu_dereference_protected(sock-wq, 1)-fasync_list) printk(KERN_ERR sock_release: fasync list not empty!\n); - percpu_sub(sockets_in_use, 1); + __this_cpu_sub(sockets_in_use, 1); this_cpu_sub(); if (!sock-file) { iput(SOCK_INODE(sock)); return; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le lundi 22 août 2011 à 09:36 +0300, Avi Kivity a écrit : On 08/20/2011 04:16 PM, Brad Campbell wrote: Author: Alexander Duyck alexander.h.du...@intel.com Date: Thu Jul 1 13:28:27 2010 + x86: Drop CONFIG_MCORE2 check around setting of NET_IP_ALIGN This patch removes the CONFIG_MCORE2 check from around NET_IP_ALIGN. It is based on a suggestion from Andi Kleen. The assumption is that there are not any x86 cores where unaligned access is really slow, and this change would allow for a performance improvement to still exist on configurations that are not necessarily optimized for Core 2. Cc: Andi Kleen a...@linux.intel.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Signed-off-by: Alexander Duyck alexander.h.du...@intel.com Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com Acked-by: H. Peter Anvin h...@zytor.com Signed-off-by: David S. Miller da...@davemloft.net :04 04 5a15867789080a2f67a74b17c4422f85b7a9fb4a b98769348bd765731ca3ff03b33764257e23226c March I can confirm this bug exists in the 3.0 kernel, however I'm unable to reproduce it on todays git. So anyone using netfilter, kvm and bridge on kernels between 2.6.36-rc1 and 3.0 may hit this bug, but it looks like it is fixed in the current 3.1-rc kernels. Thanks for this effort. I don't think this patch is buggy in itself, it merely exposed another bug which was fixed later on. Some piece of hardware has a 2-byte offset requirement, and driver incorrectly assumed NET_IP_ALIGN was 2 on x86. Brad, could you post your config (lsmod, dmesg) again ? tg3.c code for example uses a private value, not related to NET_IP_ALIGN #define TG3_RAW_IP_ALIGN 2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next RFC PATCH 4/7] tuntap: multiqueue support
Le vendredi 12 août 2011 à 09:55 +0800, Jason Wang a écrit : + rxq = skb_get_rxhash(skb); + if (rxq) { + tfile = rcu_dereference(tun-tfiles[rxq % numqueues]); + if (tfile) + goto out; + } You can avoid an expensive divide with following trick : u32 idx = ((u64)rxq * numqueues) 32; -static struct tun_struct *tun_get(struct file *file) +static void tun_detach_all(struct net_device *dev) { - return __tun_get(file-private_data); + struct tun_struct *tun = netdev_priv(dev); + struct tun_file *tfile, *tfile_list[MAX_TAP_QUEUES]; + int i, j = 0; + + spin_lock(tun_lock); + + for (i = 0; i MAX_TAP_QUEUES tun-numqueues; i++) { + tfile = rcu_dereference_protected(tun-tfiles[i], + lockdep_is_held(tun_lock)); + if (tfile) { + wake_up_all(tfile-wq.wait); + tfile_list[i++] = tfile; typo here, you want tfile_list[j++] = tfile; + rcu_assign_pointer(tun-tfiles[i], NULL); + rcu_assign_pointer(tfile-tun, NULL); + --tun-numqueues; + } + } + BUG_ON(tun-numqueues != 0); + spin_unlock(tun_lock); + + synchronize_rcu(); + for(--j; j = 0; j--) + sock_put(tfile_list[j]-sk); } Could you take a look at net/packet/af_packet.c, to check how David did the whole fanout thing ? __fanout_unlink() Trick is to not leave NULL entries in the tun-tfiles[] array. It makes things easier in hot path. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le jeudi 09 juin 2011 à 01:02 +0800, Brad Campbell a écrit : On 08/06/11 11:59, Eric Dumazet wrote: Well, a bisection definitely should help, but needs a lot of time in your case. Yes. compile, test, crash, walk out to the other building to press reset, lather, rinse, repeat. I need a reset button on the end of a 50M wire, or a hardware watchdog! Actually it's not so bad. If I turn off slub debugging the kernel panics and reboots itself. This.. : [2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1 [2.913066] netconsole: device eth0 not up yet, forcing it [3.660062] Refined TSC clocksource calibration: 3213.422 MHz. [3.660118] Switching to clocksource tsc [ 63.200273] r8169 :03:00.0: eth0: unable to load firmware patch rtl_nic/rtl8168e-1.fw (-2) [ 63.223513] r8169 :03:00.0: eth0: link down [ 63.223556] r8169 :03:00.0: eth0: link down ..is slowing down reboots considerably. 3.0-rc does _not_ like some timing hardware in my machine. Having said that, at least it does not randomly panic on SCSI like 2.6.39 does. Ok, I've ruled out TCPMSS. Found out where it was being set and neutered it. I've replicated it with only the single DNAT rule. Could you try following patch, because this is the 'usual suspect' I had yesterday : diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 46cbd28..9f548f9 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta; } +#if 0 if (fastpath size + sizeof(struct skb_shared_info)= ksize(skb-head)) { memmove(skb-head + size, skb_shinfo(skb), @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, off = nhead; goto adjust_others; } - +#endif data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); if (!data) goto nodata; Nope.. that's not it. sigh That might have changed the characteristic of the fault slightly, but unfortunately I got caught with a couple of fsck's, so I only got to test it 3 times tonight. It's unfortunate that this is a production system, so I can only take it down between about 9pm and 1am. That would normally be pretty productive, except that an fsck of a 14TB ext4 can take 30 minutes if it panics at the wrong time. I'm out of time tonight, but I'll have a crack at some bisection tomorrow night. Now I just have to go back far enough that it works, and be near enough not to have to futz around with /proc /sys or drivers. I really, really, really appreciate you guys helping me with this. It has been driving me absolutely bonkers. If I'm ever in the same town as any of you, dinner and drinks are on me. Hmm, I wonder if kmemcheck could help you, but its slow as hell, so not appropriate for production :( -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le mardi 07 juin 2011 à 21:27 +0800, Brad Campbell a écrit : On 07/06/11 04:22, Eric Dumazet wrote: Could you please try latest linux-2.6 tree ? We fixed many networking bugs that could explain your crash. No good I'm afraid. [ 543.040056] = [ 543.040136] BUG ip_dst_cache: Padding overwritten. 0x8803e4217ffe-0x8803e4217fff [ 543.040194] Thats pretty strange : These are the last two bytes of a page, set to 0x (a 16 bit value) There is no way a dst field could actually sit on this location (its a padding), since a dst is a bit less than 256 bytes (0xe8), and each entry is aligned on a 64byte address. grep dst /proc/slabinfo ip_dst_cache 32823 62944256 322 : tunables00 0 : slabdata 1967 1967 0 sizeof(struct rtable)=0xe8 - [ 543.040198] [ 543.040298] INFO: Slab 0xea000d9e74d0 objects=25 used=25 fp=0x (null) flags=0x80004081 [ 543.040364] Pid: 4576, comm: kworker/1:2 Not tainted 3.0.0-rc2 #1 [ 543.040415] Call Trace: [ 543.040472] [810b9c1d] ? slab_err+0xad/0xd0 [ 543.040528] [8102e034] ? check_preempt_wakeup+0xa4/0x160 [ 543.040595] [810ba206] ? slab_pad_check+0x126/0x170 [ 543.040650] [8133045b] ? dst_destroy+0x8b/0x110 [ 543.040701] [810ba29a] ? check_slab+0x4a/0xc0 [ 543.040753] [810baf2d] ? free_debug_processing+0x2d/0x250 [ 543.040808] [810bb27b] ? __slab_free+0x12b/0x140 [ 543.040862] [810bbe99] ? kmem_cache_free+0x99/0xa0 [ 543.040915] [8133045b] ? dst_destroy+0x8b/0x110 [ 543.040967] [813307f6] ? dst_gc_task+0x196/0x1f0 [ 543.041021] [8104e954] ? queue_delayed_work_on+0x154/0x160 [ 543.041081] [813066fe] ? do_dbs_timer+0x20e/0x3d0 [ 543.041133] [81330660] ? dst_alloc+0x180/0x180 [ 543.041187] [8104f28b] ? process_one_work+0xfb/0x3b0 [ 543.041242] [8104f964] ? worker_thread+0x144/0x3d0 [ 543.041296] [8102cc10] ? __wake_up_common+0x50/0x80 [ 543.041678] [8104f820] ? rescuer_thread+0x2e0/0x2e0 [ 543.041729] [8104f820] ? rescuer_thread+0x2e0/0x2e0 [ 543.041782] [81053436] ? kthread+0x96/0xa0 [ 543.041835] [813e1d14] ? kernel_thread_helper+0x4/0x10 [ 543.041890] [810533a0] ? kthread_worker_fn+0x120/0x120 [ 543.041944] [813e1d10] ? gs_change+0xb/0xb [ 543.041993] Padding 0x8803e4217f40: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.042718] Padding 0x8803e4217f50: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.043433] Padding 0x8803e4217f60: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.044155] Padding 0x8803e4217f70: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.044866] Padding 0x8803e4217f80: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.045590] Padding 0x8803e4217f90: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.046311] Padding 0x8803e4217fa0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.047034] Padding 0x8803e4217fb0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.047755] Padding 0x8803e4217fc0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.048474] Padding 0x8803e4217fd0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.049203] Padding 0x8803e4217fe0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a [ 543.049909] Padding 0x8803e4217ff0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 00 00 ZZ.. [ 543.050021] FIX ip_dst_cache: Restoring 0x8803e4217f40-0x8803e4217fff=0x5a [ 543.050021] Dropped -mm, Hugh and Andrea from CC as this does not appear to be mm or ksm related. I'll pare down the firewall and see if I can make it break easier with a smaller test set. Hmm, not sure now :( Could you reproduce another bug please ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit : The main suspects would be NAT and TCPMSS. Did you also try whether the crash occurs with only one of these these rules? I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. That's really weird since you're apparently not using any bridge netfilter features. It shouldn't have any effect besides changing at which point ip_tables is invoked. How are your network devices configured (specifically any bridges)? Something in the kernel does u16 *ptr = addr (given by kmalloc()) ptr[-1] = 0; Could be an off-one error in a memmove()/memcopy() or loop... I cant see a network issue here. I checked arch/x86/lib/memmove_64.S and it seems fine. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le mercredi 08 juin 2011 à 08:18 +0800, Brad Campbell a écrit : On 08/06/11 06:57, Patrick McHardy wrote: On 07.06.2011 20:31, Eric Dumazet wrote: Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit : The main suspects would be NAT and TCPMSS. Did you also try whether the crash occurs with only one of these these rules? I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access the address the way I was doing it, so that's a no-go for me. That's really weird since you're apparently not using any bridge netfilter features. It shouldn't have any effect besides changing at which point ip_tables is invoked. How are your network devices configured (specifically any bridges)? Something in the kernel does u16 *ptr = addr (given by kmalloc()) ptr[-1] = 0; Could be an off-one error in a memmove()/memcopy() or loop... I cant see a network issue here. So far me neither, but netfilter appears to trigger the bug. Would it help if I tried some older kernels? This issue only surfaced for me recently as I only installed the VM's in question about 12 weeks ago and have only just started really using them in anger. I could try reproducing it on progressively older kernels to see if I can find one that works and then bisect from there. Well, a bisection definitely should help, but needs a lot of time in your case. Could you try following patch, because this is the 'usual suspect' I had yesterday : diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 46cbd28..9f548f9 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta; } +#if 0 if (fastpath size + sizeof(struct skb_shared_info) = ksize(skb-head)) { memmove(skb-head + size, skb_shinfo(skb), @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, off = nhead; goto adjust_others; } - +#endif data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); if (!data) goto nodata; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le dimanche 05 juin 2011 à 21:45 +0800, Brad Campbell a écrit : On 05/06/11 16:14, Avi Kivity wrote: On 06/03/2011 04:38 PM, Brad Campbell wrote: Is there anyone who can point me at the appropriate cage to rattle? I know it appears to be a netfilter issue, but I don't seem to be able to get a message to the list (and I am subscribed to it and have been getting mail for months) and I'm not sure who to pester. The other alternative is I just stop doing that and wait for it to bite someone else. The mailing list might be set not to send your own mails back to you. Check the list archive. Yep, I did that first.. Given the response to previous issues along the same line, it looks a bit like I just remember not to actually use the system in the way that triggers the bug and be happy that 99% of the time the kernel does not panic, but have that lovely feeling in the back of the skull that says any time now, and without obvious reason the whole machine might just come crashing down.. I guess it's still better than running Xen or Windows.. Could you please try latest linux-2.6 tree ? We fixed many networking bugs that could explain your crash. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM induced panic on 2.6.38[2367] 2.6.39
Le lundi 06 juin 2011 à 22:10 +0200, Bart De Schuymer a écrit : Hi Brad, This has probably nothing to do with ebtables, so please rmmod in case it's loaded. A few questions I didn't directly see an answer to in the threads I scanned... I'm assuming you actually use the bridging firewall functionality. So, what iptables modules do you use? Can you reduce your iptables rules to a core that triggers the bug? Or does it get triggered even with an empty set of firewall rules? Are you using a stock .35 kernel or is it patched? Is this something I can trigger on a poor guy's laptop or does it require specialized hardware (I'm catching up on qemu/kvm...)? Keep netdev, as this most probably is a networking bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
Le jeudi 26 mai 2011 à 12:36 -0700, Shirley Ma a écrit : This patch adds userspace buffers support in skb shared info. A new struct skb_ubuf_info is needed to maintain the userspace buffers argument and index, a callback is used to notify userspace to release the buffers once lower device has done DMA (Last reference to that skb has gone). If there is any userspace apps to reference these userspace buffers, then these userspaces buffers will be copied into kernel. This way we can prevent userspace apps to hold these userspace buffers too long. Signed-off-by: Shirley Ma x...@us.ibm.com --- include/linux/skbuff.h | 26 +++ net/core/skbuff.c | 80 ++- 2 files changed, 104 insertions(+), 2 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index d0ae90a..025de5c 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -189,6 +189,18 @@ enum { SKBTX_DRV_NEEDS_SK_REF = 1 3, }; +/* + * The callback notifies userspace to release buffers when skb DMA is done in + * lower device, the skb last reference should be 0 when calling this. + * The desc is used to track userspace buffer index. + */ +struct skb_ubuf_info { + /* support buffers allocation from userspace */ + void(*callback)(struct sk_buff *); + void*arg; + size_t desc; +}; Thats 24 bytes on each skb... desc for example is not used in this patch (yes, its used later in patch 4/4) But still... could you instead use one pointer only in skb ? + /* This data is invariant across clones and lives at * the end of the header data, ie. at skb-end. */ @@ -211,6 +223,10 @@ struct skb_shared_info { /* Intermediate layers must ensure that destructor_arg * remains valid until skb destructor */ void * destructor_arg; + + /* DMA mapping from/to userspace buffers */ + struct skb_ubuf_info ubuf; + /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; @@ -2261,5 +2277,15 @@ static inline void skb_checksum_none_assert(struct sk_buff *skb) } bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off); + +/* + * skb_ubuf - is the buffer from userspace + * @skb: buffer to check + */ +static inline int skb_ubuf(const struct sk_buff *skb) +{ + return (skb_shinfo(skb)-ubuf.callback != NULL); return skb_shinfo(skb)-ubuf.callback != NULL; +} + #endif /* __KERNEL__ */ #endif /* _LINUX_SKBUFF_H */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7ebeed0..890447c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, shinfo = skb_shinfo(skb); memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); atomic_set(shinfo-dataref, 1); + shinfo-ubuf.callback = NULL; + shinfo-ubuf.arg = NULL; if you put ubuf ptr before dataref, no need to add this (the memset() clear all shared_info up to dataref) kmemcheck_annotate_variable(shinfo-destructor_arg); if (fclone) { @@ -328,6 +330,14 @@ static void skb_release_data(struct sk_buff *skb) put_page(skb_shinfo(skb)-frags[i].page); } + /* + * if skb buf is from userspace, we need to notify the caller + * the lower device DMA has done; + */ + if (skb_ubuf(skb)) { + skb_shinfo(skb)-ubuf.callback(skb); + skb_shinfo(skb)-ubuf.callback = NULL; + } if (skb_has_frag_list(skb)) skb_drop_fraglist(skb); @@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size) if (irqs_disabled()) return false; + if (skb_ubuf(skb)) + return false; + if (skb_is_nonlinear(skb) || skb-fclone != SKB_FCLONE_UNAVAILABLE) return false; @@ -572,6 +585,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb) atomic_set(n-users, 1); atomic_inc((skb_shinfo(skb)-dataref)); + skb_shinfo(skb)-ubuf.callback = NULL; skb-cloned = 1; return n; @@ -595,6 +609,48 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src) } EXPORT_SYMBOL_GPL(skb_morph); +/* skb frags copy userspace buffers to kernel */ +static int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) +{ + int i; + int num_frags = skb_shinfo(skb)-nr_frags; + struct page *page, *head = NULL; + + for (i = 0; i num_frags; i++) { + u8 *vaddr; + skb_frag_t *f = skb_shinfo(skb)-frags[i]; + + page = alloc_page(GFP_ATOMIC); + if (!page) { + while
Re: [PATCH V6 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb
Le jeudi 26 mai 2011 à 13:24 -0700, Shirley Ma a écrit : I could reduce callback pointer by moving it to *arg, but not desc, this indicates that which buffer DMA hasn't done yet in *arg. I guess you dont need to use skb itself to hold all your states ? I understand its convenient for you, but I believe its worth the pain to use only one pointer to a (small) object where you put all your stuff. Some machines alloc/free millions of skbs per second. If/when most skb uses are for userspace zero-copy buffers we can embbed your small object in skb itself ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] posix-timers: RCU conversion
Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard a single spinlock (idr_lock) in posix-timers code, on its 48 core machine. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock, from lock_timer Ref: http://www.spinics.net/lists/kvm/msg51526.html Switching to RCU is quite easy, IDR being already RCU ready. idr_lock should be locked only for an insert/delete, not a lookup. Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime(). Before : real1m18.669s user0m1.346s sys 1m17.180s After : real0m3.296s user0m1.366s sys 0m1.926s Reported-by: Ben Nagy b...@iagu.net Signed-off-by: Eric Dumazet eric.duma...@gmail.com Cc: Avi Kivity a...@redhat.com Cc: Thomas Gleixner t...@linutronix.de Cc: John Stultz johns...@us.ibm.com Cc: Richard Cochran richard.coch...@omicron.at Cc: Paul E. McKenney paul...@linux.vnet.ibm.com --- include/linux/posix-timers.h |1 + kernel/posix-timers.c| 25 ++--- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index d51243a..5dc27ca 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -81,6 +81,7 @@ struct k_itimer { unsigned long expires; } mmtimer; } it; + struct rcu_head rcu; }; struct k_clock { diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index 4c01249..acb9be9 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -491,6 +491,13 @@ static struct k_itimer * alloc_posix_timer(void) return tmr; } +static void k_itimer_rcu_free(struct rcu_head *head) +{ + struct k_itimer *tmr = container_of(head, struct k_itimer, rcu); + + kmem_cache_free(posix_timers_cache, tmr); +} + #define IT_ID_SET 1 #define IT_ID_NOT_SET 0 static void release_posix_timer(struct k_itimer *tmr, int it_id_set) @@ -503,7 +510,7 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set) } put_pid(tmr-it_pid); sigqueue_free(tmr-sigq); - kmem_cache_free(posix_timers_cache, tmr); + call_rcu(tmr-rcu, k_itimer_rcu_free); } static struct k_clock *clockid_to_kclock(const clockid_t id) @@ -631,22 +638,18 @@ out: static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags) { struct k_itimer *timr; - /* -* Watch out here. We do a irqsave on the idr_lock and pass the -* flags part over to the timer lock. Must not let interrupts in -* while we are moving the lock. -*/ - spin_lock_irqsave(idr_lock, *flags); + + rcu_read_lock(); timr = idr_find(posix_timers_id, (int)timer_id); if (timr) { - spin_lock(timr-it_lock); + spin_lock_irqsave(timr-it_lock, *flags); if (timr-it_signal == current-signal) { - spin_unlock(idr_lock); + rcu_read_unlock(); return timr; } - spin_unlock(timr-it_lock); + spin_unlock_irqrestore(timr-it_lock, *flags); } - spin_unlock_irqrestore(idr_lock, *flags); + rcu_read_unlock(); return NULL; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM lock contention on 48 core AMD machine
Le lundi 21 mars 2011 à 22:01 +0545, Ben Nagy a écrit : On Mon, Mar 21, 2011 at 7:38 PM, Avi Kivitya...@redhat.com wrote: In the future, please post the binary perf.dat. Hi Avi, How do I do that? 'make nconfig' and go to the kernel hacking section. Imprecise question sorry, I meant how do I get the perf.dat not how do I disable the debugging. On the non-debug kernel: Linux eax 2.6.38-7-server #36-Ubuntu SMP Fri Mar 18 23:36:13 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux 150512.00 67.8% __ticket_spin_lock [kernel.kallsyms] 11126.00 5.0% memcmp_pages [kernel.kallsyms] 9563.00 4.3% native_safe_halt [kernel.kallsyms] 8965.00 4.0% svm_vcpu_run /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm-amd.ko 6489.00 2.9% tg_load_down [kernel.kallsyms] 4676.00 2.1% kvm_get_cs_db_l_bits /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko 1931.00 0.9% load_balance_fair [kernel.kallsyms] 1917.00 0.9% ktime_get [kernel.kallsyms] 1624.00 0.7% walk_tg_tree.clone.129 [kernel.kallsyms] 1542.00 0.7% find_busiest_group [kernel.kallsyms] 1326.00 0.6% find_next_bit [kernel.kallsyms] 673.00 0.3% lock_hrtimer_base.clone.25 [kernel.kallsyms] 624.00 0.3% copy_user_generic_string[kernel.kallsyms] top now says: top - 00:11:35 up 22 min, 4 users, load average: 0.11, 6.15, 7.78 Tasks: 491 total, 3 running, 488 sleeping, 0 stopped, 0 zombie Cpu(s): 0.9%us, 15.4%sy, 0.0%ni, 83.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 99068660k total, 70831760k used, 28236900k free,10036k buffers Swap: 2438140k total, 2173652k used, 264488k free, 3396144k cached With average 'resting cpu' per idle guest 8%, 96 guests running. Is this as good as I am going to get? It seems like I can't really debug that lock contention without blowing stuff up because of the load of the lock debugging... Don't know if I mentioned this before, but those guests are each pinned to a cpu (cpu guestnum%48) with taskset. Cheers, It seems you hit idr_lock contention (used in kernel/posix-timers.c) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM lock contention on 48 core AMD machine
Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit : Any ideas on how to fix it? We could pre-allocate IDs and batch them in per-cpu caches, but it seems like a lot of work. Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to lock this idr_lock. Sounds crazy, and should be fixed in kernel code, not kvm ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] posix-timers: RCU conversion
Le lundi 21 mars 2011 à 23:57 +0545, Ben Nagy a écrit : On Mon, Mar 21, 2011 at 10:57 PM, Eric Dumazet eric.duma...@gmail.com wrote: Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit : Any ideas on how to fix it? We could pre-allocate IDs and batch them in per-cpu caches, but it seems like a lot of work. Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to lock this idr_lock. Sounds crazy, and should be fixed in kernel code, not kvm ;) Hi, I'll need to work out a way I can make the perf.data available (~100M), but here's the summary http://paste.ubuntu.com/583425/ And here's the summary of the summary # Overhead Command Shared Object Symbol # ... .. # 71.86% kvm [kernel.kallsyms] [k] __ticket_spin_lock | --- __ticket_spin_lock | |--54.19%-- default_spin_lock_flags | _raw_spin_lock_irqsave | | | --54.14%-- __lock_timer | | | |--31.92%-- sys_timer_gettime | | system_call_fastpath | | | --22.22%-- sys_timer_settime | system_call_fastpath | |--15.66%-- _raw_spin_lock [...] Which I guess seems to match what Eric just said. I'll post a link to the full data tomorrow. Many thanks for the help so far. If it's a kernel scaling limit then I guess we just wait until the kernel gets better. :S I'll check it out with a linux guest tomorrow anyway. Here is a quick dirty patch to lower idr_lock use. You could try it eventually ;) Thanks [RFC] posix-timers: RCU conversion Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard a single spinlock (idr_lock) in posix-timers code. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock Switching to RCU should be quite easy, IDR being already RCU ready. idr_lock should be locked only before an insert/delete, not a find. Reported-by: Ben Nagy b...@iagu.net Signed-off-by: Eric Dumazet eric.duma...@gmail.com Cc: Avi Kivity a...@redhat.com Cc: Thomas Gleixner t...@linutronix.de Cc: John Stultz johns...@us.ibm.com Cc: Richard Cochran richard.coch...@omicron.at --- include/linux/posix-timers.h |1 + kernel/posix-timers.c| 26 ++ 2 files changed, 15 insertions(+), 12 deletions(-) diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index d51243a..5dc27ca 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -81,6 +81,7 @@ struct k_itimer { unsigned long expires; } mmtimer; } it; + struct rcu_head rcu; }; struct k_clock { diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c index 4c01249..e2a823a 100644 --- a/kernel/posix-timers.c +++ b/kernel/posix-timers.c @@ -491,6 +491,11 @@ static struct k_itimer * alloc_posix_timer(void) return tmr; } +static void k_itimer_rcu_free(struct rcu_head *head) +{ + kmem_cache_free(posix_timers_cache, container_of(head, struct k_itimer, rcu)); +} + #define IT_ID_SET 1 #define IT_ID_NOT_SET 0 static void release_posix_timer(struct k_itimer *tmr, int it_id_set) @@ -499,11 +504,12 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set) unsigned long flags; spin_lock_irqsave(idr_lock, flags); idr_remove(posix_timers_id, tmr-it_id); + tmr-it_id = -1; spin_unlock_irqrestore(idr_lock, flags); } put_pid(tmr-it_pid); sigqueue_free(tmr-sigq); - kmem_cache_free(posix_timers_cache, tmr); + call_rcu(tmr-rcu, k_itimer_rcu_free); } static struct k_clock *clockid_to_kclock(const clockid_t id) @@ -631,22 +637,18 @@ out: static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags) { struct k_itimer *timr; - /* -* Watch out here. We do a irqsave on the idr_lock and pass the -* flags part over to the timer lock. Must not let interrupts in -* while we are moving the lock. -*/ - spin_lock_irqsave(idr_lock, *flags); + + rcu_read_lock(); timr = idr_find(posix_timers_id, (int)timer_id); if (timr) { - spin_lock(timr-it_lock); - if (timr-it_signal == current-signal
Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()
Le dimanche 13 mars 2011 à 17:06 +0200, Michael S. Tsirkin a écrit : On Mon, Jan 17, 2011 at 04:11:17PM +0800, Jason Wang wrote: We can use lock_sock_fast() instead of lock_sock() in order to get speedup in peek_head_len(). Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c32a2e4..50b622a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -211,12 +211,12 @@ static int peek_head_len(struct sock *sk) { struct sk_buff *head; int len = 0; + bool slow = lock_sock_fast(sk); - lock_sock(sk); head = skb_peek(sk-sk_receive_queue); if (head) len = head-len; - release_sock(sk); + unlock_sock_fast(sk, slow); return len; } Wanted to apply this, but looking at the code I think the lock_sock here is wrong. What we really need is to handle the case where the skb is pulled from the receive queue after skb_peek. However this is not the right lock to use for that, sk_receive_queue.lock is. So I expect the following is the right way to handle this. Comments? Signed-off-by: Michael S. Tsirkin m...@redhat.com diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 0329c41..5720301 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -213,12 +213,13 @@ static int peek_head_len(struct sock *sk) { struct sk_buff *head; int len = 0; + unsigned long flags; - lock_sock(sk); + spin_lock_irqsave(sk-sk_receive_queue.lock, flags); head = skb_peek(sk-sk_receive_queue); - if (head) + if (likely(head)) len = head-len; - release_sock(sk); + spin_unlock_irqrestore(sk-sk_receive_queue.lock, flags); return len; } You may be right, only way to be sure is to check the other side. If it uses skb_queue_tail(), then yes, your patch is fine. If other side did not lock socket, then your patch is a bug fix. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()
Le dimanche 13 mars 2011 à 18:19 +0200, Michael S. Tsirkin a écrit : Other side is in drivers/net/tun.c and net/packet/af_packet.c At least wrt tun it seems clear socket is not locked. Yes (assuming you refer to tun_net_xmit()) Besides queue, dequeue seems to be done without socket locked. It seems this code (assuming you speak of drivers/vhost/net.c ?) has some races indeed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()
Le dimanche 13 mars 2011 à 18:43 +0200, Michael S. Tsirkin a écrit : On Sun, Mar 13, 2011 at 05:32:07PM +0100, Eric Dumazet wrote: Le dimanche 13 mars 2011 à 18:19 +0200, Michael S. Tsirkin a écrit : Other side is in drivers/net/tun.c and net/packet/af_packet.c At least wrt tun it seems clear socket is not locked. Yes (assuming you refer to tun_net_xmit()) Besides queue, dequeue seems to be done without socket locked. It seems this code (assuming you speak of drivers/vhost/net.c ?) has some races indeed. Hmm. Any more besides the one fixed here? If writers and readers dont share a common lock, how can they reliably synchronize states ? For example, the check at line 420 seems unsafe or useless. skb_queue_empty(sock-sk-sk_receive_queue) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible netfilter-related memory corruption in 2.6.37
Le vendredi 18 février 2011 à 19:37 +0100, Patrick McHardy a écrit : Am 14.02.2011 17:52, schrieb Patrick McHardy: Am 14.02.2011 17:48, schrieb Eric Dumazet: I am not sure, but I guess nf_reinject() needs a fix too ;) I agree. That one looks uglier though, I guess we'll have to iterate through all hooks to note the previous one. How about this? Unfortunately I don't think we can avoid iterating through all hooks without violating RCU rules. /* Continue traversal iff userspace said ok... */ if (verdict == NF_REPEAT) { - elem = elem-prev; - verdict = NF_ACCEPT; + prev = NULL; + list_for_each_entry_rcu(i, nf_hooks[entry-pf][entry-hook], + list) { + if (i-list == elem) + break; + prev = i; Hmm... what happens if elem was the first elem in list ? We exit with prev = NULL -- NF_DROP ? I must miss something... + } + + if (prev == NULL || + i-list == nf_hooks[entry-pf][entry-hook]) + verdict = NF_DROP; + else { + elem = prev-list; + verdict = NF_ACCEPT; + } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible netfilter-related memory corruption in 2.6.37
Le lundi 14 février 2011 à 16:58 +0200, Avi Kivity a écrit : We see severe memory corruption in kvm while used in conjunction with bridge/netfilter. Enabling slab debugging points the finger at a netfilter chain invoked from the bridge code. Can someone take a look? https://bugzilla.kernel.org/show_bug.cgi?id=27052 CC netdev Does a revert of commit ca44ac386181ba7 help a bit ? (net: don't reallocate skb-head unless the current one hasn't the needed extra size or is shared) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible netfilter-related memory corruption in 2.6.37
Le lundi 14 février 2011 à 16:18 +0100, Jan Engelhardt a écrit : On Monday 2011-02-14 16:11, Eric Dumazet wrote: Le lundi 14 février 2011 à 16:58 +0200, Avi Kivity a écrit : We see severe memory corruption in kvm while used in conjunction with bridge/netfilter. Enabling slab debugging points the finger at a netfilter chain invoked from the bridge code. Can someone take a look? https://bugzilla.kernel.org/show_bug.cgi?id=27052 Maybe looks familiar to https://lkml.org/lkml/2011/2/3/147 Are you sure Jan ? IMHO it looks like in your case, a NULL -hook() is called, from nf_iterate() BTW, list_for_each_continue_rcu() really should be converted to list_for_each_entry_continue_rcu() This is a bit ugly : list_for_each_continue_rcu(*i, head) { struct nf_hook_ops *elem = (struct nf_hook_ops *)*i; Also, I wonder if RCU rules are respected in nf_iterate(). For example this line is really suspicious : *i = (*i)-prev; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible netfilter-related memory corruption in 2.6.37
Le lundi 14 février 2011 à 17:24 +0100, Patrick McHardy a écrit : Am 14.02.2011 16:50, schrieb Eric Dumazet: Le lundi 14 février 2011 à 16:18 +0100, Jan Engelhardt a écrit : On Monday 2011-02-14 16:11, Eric Dumazet wrote: Le lundi 14 février 2011 à 16:58 +0200, Avi Kivity a écrit : We see severe memory corruption in kvm while used in conjunction with bridge/netfilter. Enabling slab debugging points the finger at a netfilter chain invoked from the bridge code. Can someone take a look? https://bugzilla.kernel.org/show_bug.cgi?id=27052 Maybe looks familiar to https://lkml.org/lkml/2011/2/3/147 Are you sure Jan ? IMHO it looks like in your case, a NULL -hook() is called, from nf_iterate() BTW, list_for_each_continue_rcu() really should be converted to list_for_each_entry_continue_rcu() This is a bit ugly : list_for_each_continue_rcu(*i, head) { struct nf_hook_ops *elem = (struct nf_hook_ops *)*i; Also, I wonder if RCU rules are respected in nf_iterate(). For example this line is really suspicious : *i = (*i)-prev; Yeah, that definitely looks wrong. How about this instead? This patch seems fine to me, thanks ! Acked-by: Eric Dumazet eric.duma...@gmail.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible netfilter-related memory corruption in 2.6.37
Le lundi 14 février 2011 à 17:37 +0100, Patrick McHardy a écrit : Am 14.02.2011 17:29, schrieb Eric Dumazet: Le lundi 14 février 2011 à 17:24 +0100, Patrick McHardy a écrit : Also, I wonder if RCU rules are respected in nf_iterate(). For example this line is really suspicious : *i = (*i)-prev; Yeah, that definitely looks wrong. How about this instead? This patch seems fine to me, thanks ! Acked-by: Eric Dumazet eric.duma...@gmail.com THanks Eric, I've queued the patch for 2.6.38. I am not sure, but I guess nf_reinject() needs a fix too ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()
Le lundi 17 janvier 2011 à 16:11 +0800, Jason Wang a écrit : We can use lock_sock_fast() instead of lock_sock() in order to get speedup in peek_head_len(). Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index c32a2e4..50b622a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -211,12 +211,12 @@ static int peek_head_len(struct sock *sk) { struct sk_buff *head; int len = 0; + bool slow = lock_sock_fast(sk); - lock_sock(sk); head = skb_peek(sk-sk_receive_queue); if (head) len = head-len; - release_sock(sk); + unlock_sock_fast(sk, slow); return len; } Acked-by: Eric Dumazet eric.duma...@gmail.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit : Hi, Back in October I reported that I noticed a problem whereby flow control breaks down when openvswitch is configured to mirror a port[1]. I have (finally) looked into this further and the problem appears to relate to cloning of skbs, as Jesse Gross originally suspected. More specifically, in do_execute_actions[2] the first n-1 times that an skb needs to be transmitted it is cloned first and the final time the original skb is used. In the case that there is only one action, which is the normal case, then the original skb will be used. But in the case of mirroring the cloning comes into effect. And in my case the cloned skb seems to go to the (slow) eth1 interface while the original skb goes to the (fast) dummy0 interface that I set up to be a mirror. The result is that dummy0 paces the flow, and its a cracking pace at that. As an experiment I hacked do_execute_actions() to use the original skb for the first action instead of the last one. In my case the result was that eth1 paces the flow, and things work reasonably nicely. Well, sort of. Things work well for non-GSO skbs but extremely poorly for GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running netserv. I'm unsure why, but I digress. It seems to me that my hack illustrates the point that the flow ends up being paced by one interface. However I think that what would be desirable is that the flow is paced by the slowest link. Unfortunately I'm unsure how to achieve that. Hi Simon ! pacing is done because skb is attached to a socket, and a socket has a limited (but configurable) sndbuf. sk-sk_wmem_alloc is the current sum of all truesize skbs in flight. When you enter something that : 1) Get a clone of the skb, queue the clone to device X 2) queue the original skb to device Y Then : Socket sndbuf is not affected at all by device X queue. This is speed on device Y that matters. You want to get servo control on both X and Y You could try to 1) Get a clone of skb Attach it to socket too (so that socket get a feedback of final orphaning for the clone) with skb_set_owner_w() queue the clone to device X Unfortunatly, stacked skb-destructor() makes this possible only for known destructor (aka sock_wfree()) One idea that I had was to skb_get() the original skb each time it is cloned - that is easy enough. But unfortunately it seems to me that approach would require some sort of callback mechanism in kfree_skb() so that the cloned skbs can kfree_skb() the original skb. Ideas would be greatly appreciated. [1] http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html [2] http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/actions.c;h=5e16143ca402f7da0ee8fc18ee5eb16c3b7598e6;hb=HEAD -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flow Control and Port Mirroring Revisited
Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit : Hi Eric ! Thanks for the advice. I had thought about the socket buffer but at some point it slipped my mind. In any case the following patch seems to implement the change that I had in mind. However my discussions Michael Tsirkin elsewhere in this thread are beginning to make me think that think that perhaps this change isn't the best solution. diff --git a/datapath/actions.c b/datapath/actions.c index 5e16143..505f13f 100644 --- a/datapath/actions.c +++ b/datapath/actions.c @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb, for (a = actions, rem = actions_len; rem 0; a = nla_next(a, rem)) { if (prev_port != -1) { - do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port); + struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC); + if (nskb) { + if (skb-sk) + skb_set_owner_w(nskb, skb-sk); + do_output(dp, nskb, prev_port); + } prev_port = -1; } I got a rather nasty panic without the if (skb-sk), I guess some skbs don't have a socket. Indeed, some packets are not linked to a socket. (ARP packets for example) Sorry, I should have mentioned it :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Certain Mac Address doesn't work in KVM virtual NICs.
Le mardi 21 décembre 2010 à 13:53 +0800, Yueyu Lin a écrit : Hello, everyone I just encountered a very strange issue in Ubuntu 10.10 server edition X86 64bits. I just start up my kvm like : kvm -net nic,macaddr=B3:C0:64:AF:73:28,model=virtio -net tap,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,ifname=tap2 -enable-kvm -drive file=cassandra.raw,format=raw,aio=native -vnc :2 This Mac address doesn't work at all. No working means the virtual machine just thinks the eth0 is not plugged in. After I changed the MAC address, it works as I expected. Are there any restrictions that I can't use certain MAC addresses?-- Sure First byte must be even. B3 is odd, so its not valid. low order bit is a marker for 'multicast' -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap
Le vendredi 10 décembre 2010 à 02:13 -0800, Shirley Ma a écrit : + while (len) { + f = skb_shinfo(skb)-frags[i]; + f-page = page[i]; + f-page_offset = base ~PAGE_MASK; + f-size = min_t(int, len, PAGE_SIZE - f-page_offset); + skb-data_len += f-size; + skb-len += f-size; + skb-truesize += f-size; + skb_shinfo(skb)-nr_frags++; + /* increase sk_wmem_alloc */ + atomic_add(f-size, skb-sk-sk_wmem_alloc); + base += f-size; + len -= f-size; + i++; + } You could make one atomic_add() outside of the loop, and factorize many things... atomic_add(len, skb-sk-sk_wmem_alloc); skb-data_len += len; skb-len += len; skb-truesize += len; while (len) { ... } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap
Le vendredi 10 décembre 2010 à 08:25 -0800, Shirley Ma a écrit : On Fri, 2010-12-10 at 11:27 +0100, Eric Dumazet wrote: You could make one atomic_add() outside of the loop, and factorize many things... atomic_add(len, skb-sk-sk_wmem_alloc); skb-data_len += len; skb-len += len; skb-truesize += len; while (len) { ... } Yep, thanks, will update it! Also take a look at skb_fill_page_desc() helper, and maybe skb_add_rx_frag() too. The atomic op should be factorized for sure, but other adds might be done by helpers to keep code short. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.37-rc2 after KVM shutdown - unregister_netdevice: waiting for vmtst01eth0 to become free. Usage count = 1
Le jeudi 18 novembre 2010 à 07:28 +0100, Nikola Ciprich a écrit : Yep, this is a known problem, thanks ! fix is there : http://patchwork.ozlabs.org/patch/71354/ Thanks Eric, this indeed fixes the problem.. I noticed the fix didn't make it to 2.6.37-rc2-git3 though, maybe it just got omited? anyways, thanks for help! n. Its in David Miller net-2.6 tree (all pending network patches for current linux-2.6 version), so it'll be included next time David push its tree to Linus, dont worry ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.37-rc2 after KVM shutdown - unregister_netdevice: waiting for vmtst01eth0 to become free. Usage count = 1
Le mardi 16 novembre 2010 à 09:41 +0100, Nikola Ciprich a écrit : Hello, I just tried running KVM on 2.6.37-rc2 host, and when trying to shutdown guest, it got to unterruptible state (D) and I'm getting lots of following messages: [ 1269.392009] unregister_netdevice: waiting for vmtst01eth0 to become free. Usage count = 1 So I just want to report.. n. Yep, this is a known problem, thanks ! fix is there : http://patchwork.ozlabs.org/patch/71354/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
Le lundi 08 novembre 2010 à 16:03 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com Hmm, I suggest you read the comment two lines above. If destructor_arg is now cleared each time we allocate a new skb, then, please move it before dataref in shinfo structure, so that the following memset() does the job efficiently... Something like : diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e6ba898..2dca504 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -195,6 +195,9 @@ struct skb_shared_info { __be32 ip6_frag_id; __u8tx_flags; struct sk_buff *frag_list; +/* Intermediate layers must ensure that destructor_arg + * remains valid until skb destructor */ +void*destructor_arg; struct skb_shared_hwtstamps hwtstamps; /* @@ -202,9 +205,6 @@ struct skb_shared_info { */ atomic_tdataref; -/* Intermediate layers must ensure that destructor_arg - * remains valid until skb destructor */ -void * destructor_arg; /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; Will that affect the cache line? What do you mean ? Or, we can move the line to clear destructor_arg to the end of __alloc_skb(). It looks like as the following, which one do you prefer? Thanks Xiaohui --- net/core/skbuff.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..df852f2 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -224,6 +224,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, child-fclone = SKB_FCLONE_UNAVAILABLE; } + shinfo-destructor_arg = NULL; out: return skb; nodata: I dont understand why you want to do this. This adds an instruction, makes code bigger, and no obvious gain for me, at memory transactions side. If integrated in the existing memset(), cost is an extra iteration to perform the clear of this field. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
Le jeudi 04 novembre 2010 à 17:05 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c83b421..5e6d69c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -210,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); + shinfo-destructor_arg = NULL; Hmm, I suggest you read the comment two lines above. If destructor_arg is now cleared each time we allocate a new skb, then, please move it before dataref in shinfo structure, so that the following memset() does the job efficiently... memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); atomic_set(shinfo-dataref, 1); @@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + if (skb-dev dev_is_mpassthru(skb-dev)) { + struct skb_ext_page *ext_page = + skb_shinfo(skb)-destructor_arg; + if (ext_page ext_page-dtor) + ext_page-dtor(ext_page); + } + kfree(skb-head); } } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v14 06/17] Use callback to deal with skb_release_data() specially.
Le jeudi 04 novembre 2010 à 10:04 +0100, Eric Dumazet a écrit : Hmm, I suggest you read the comment two lines above. If destructor_arg is now cleared each time we allocate a new skb, then, please move it before dataref in shinfo structure, so that the following memset() does the job efficiently... Something like : diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e6ba898..2dca504 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -195,6 +195,9 @@ struct skb_shared_info { __be32 ip6_frag_id; __u8tx_flags; struct sk_buff *frag_list; + /* Intermediate layers must ensure that destructor_arg +* remains valid until skb destructor */ + void*destructor_arg; struct skb_shared_hwtstamps hwtstamps; /* @@ -202,9 +205,6 @@ struct skb_shared_info { */ atomic_tdataref; - /* Intermediate layers must ensure that destructor_arg -* remains valid until skb destructor */ - void * destructor_arg; /* must be last field, see pskb_expand_head() */ skb_frag_t frags[MAX_SKB_FRAGS]; }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.
Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit : From: Xin, Xiaohui xiaohui@intel.com Date: Mon, 11 Oct 2010 15:06:05 +0800 That's to avoid the new cache miss caused by using destructor_arg in data path like skb_release_data(). That's based on the comment from Eric Dumazet on v7 patches. Thanks for the explanation. Anyway, frags[] must be the last field of struct skb_shared_info since commit fed66381 (net: pskb_expand_head() optimization) It seems Xin worked on a quite old tree. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver.
Le jeudi 30 septembre 2010 à 22:04 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com The hook is called in netif_receive_skb(). Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/dev.c | 35 +++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index c11e32c..83172b8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2517,6 +2517,37 @@ err: EXPORT_SYMBOL(netdev_mp_port_prep); #endif +#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE) +/* Add a hook to intercept mediate passthru(zero-copy) packets, + * and insert it to the socket queue owned by mp_port specially. + */ +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb, +struct packet_type **pt_prev, +int *ret, +struct net_device *orig_dev) +{ + struct mp_port *mp_port = NULL; + struct sock *sk = NULL; + + if (!dev_is_mpassthru(skb-dev)) + return skb; + mp_port = skb-dev-mp_port; + + if (*pt_prev) { + *ret = deliver_skb(skb, *pt_prev, orig_dev); + *pt_prev = NULL; + } + + sk = mp_port-sock-sk; + skb_queue_tail(sk-sk_receive_queue, skb); + sk-sk_state_change(sk); + + return NULL; +} +#else +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb) +#endif + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb) ncls: #endif + /* To intercept mediate passthru(zero-copy) packets here */ + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev); + if (!skb) + goto out; skb = handle_bridge(skb, pt_prev, ret, orig_dev); if (!skb) goto out; This does not apply to current net-next-2.6 We now have dev-rx_handler (currently for bridge or macvlan) commit ab95bfe01f9872459c8678572ccadbf646badad0 Author: Jiri Pirko jpi...@redhat.com Date: Tue Jun 1 21:52:08 2010 + net: replace hooks in __netif_receive_skb V5 What this patch does is it removes two receive frame hooks (for bridge and for macvlan) from __netif_receive_skb. These are replaced them with a single hook for both. It only supports one hook per device because it makes no sense to do bridging and macvlan on the same device. Then a network driver (of virtual netdev like macvlan or bridge) can register an rx_handler for needed net device. Signed-off-by: Jiri Pirko jpi...@redhat.com Signed-off-by: Stephen Hemminger shemmin...@vyatta.com Signed-off-by: David S. Miller da...@davemloft.net -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 03/19] Export 2 func for device to assign/deassign new strucure
Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- include/linux/netdevice.h |3 +++ net/core/dev.c| 28 2 files changed, 31 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index bae725c..efb575a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1592,6 +1592,9 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi, gro_result_t ret); extern struct sk_buff * napi_frags_skb(struct napi_struct *napi); extern gro_result_t napi_gro_frags(struct napi_struct *napi); +extern int netdev_mp_port_attach(struct net_device *dev, + struct mpassthru_port *port); +extern void netdev_mp_port_detach(struct net_device *dev); static inline void napi_free_frags(struct napi_struct *napi) { diff --git a/net/core/dev.c b/net/core/dev.c index f769098..ecbb6b1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb) rcu_read_unlock(); } +/* Export two functions to assign/de-assign mp_port pointer + * to a net device. + */ + +int netdev_mp_port_attach(struct net_device *dev, + struct mpassthru_port *port) +{ + /* locked by mp_mutex */ + if (rcu_dereference(dev-mp_port)) + return -EBUSY; + Please... this is bogus... Try with following config settings : CONFIG_PROVE_LOCKING=y CONFIG_PROVE_RCU=y CONFIG_PROVE_RCU_REPEATEDLY=y + rcu_assign_pointer(dev-mp_port, port); + + return 0; +} +EXPORT_SYMBOL(netdev_mp_port_attach); + +void netdev_mp_port_detach(struct net_device *dev) +{ + /* locked by mp_mutex */ + if (!rcu_dereference(dev-mp_port)) + return; same problem here + + rcu_assign_pointer(dev-mp_port, NULL); + synchronize_rcu(); +} +EXPORT_SYMBOL(netdev_mp_port_detach); + /** * netif_receive_skb - process receive buffer from network * @skb: buffer to process -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.
Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com child-fclone = SKB_FCLONE_UNAVAILABLE; } + /* Record the external buffer info in this field. It's not so good, + * but we cannot find another place easily. + */ + shinfo-destructor_arg = ext_page; + Yes this is a big problem, its basically using a cache line that was not touched before. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.
Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit : From: Xin Xiaohui xiaohui@intel.com If buffer is external, then use the callback to destruct buffers. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzhao81...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- net/core/skbuff.c | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 37587f0..418457c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb) static void skb_release_data(struct sk_buff *skb) { + /* check if the skb has external buffers, we have use destructor_arg + * here to indicate + */ + struct skb_external_page *ext_page = skb_shinfo(skb)-destructor_arg; + Oh well. This is v7 of your series, and nobody complained yet ? This is a new cache miss on a _critical_ path. if (!skb-cloned || !atomic_sub_return(skb-nohdr ? (1 SKB_DATAREF_SHIFT) + 1 : 1, skb_shinfo(skb)-dataref)) { @@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb) if (skb_has_frags(skb)) skb_drop_fraglist(skb); + /* if the skb has external buffers, use destructor here, + * since after that skb-head will be kfree, in case skb-head + * from external buffer cannot use kfree to destroy. + */ Why not deferring here the access to skb_shinfo(skb)-destructor_arg ? + if (dev_is_mpassthru(skb-dev) ext_page ext_page-dtor) + ext_page-dtor(ext_page); kfree(skb-head); } } if (dev_is_mpassthru(skb-dev)) { struct skb_external_page *ext_page = skb_shinfo(skb)-destructor_arg; if (ext_page ext_page-dtor) ext_page-dtor(ext_page); } destructor_arg should me moved before frags[] if you really want to use it. diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index bf243fc..b136d90 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -202,10 +202,11 @@ struct skb_shared_info { */ atomic_tdataref; - skb_frag_t frags[MAX_SKB_FRAGS]; /* Intermediate layers must ensure that destructor_arg * remains valid until skb destructor */ void * destructor_arg; + + skb_frag_t frags[MAX_SKB_FRAGS]; }; /* We divide dataref into two halves. The higher 16 bits hold references -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
Le mardi 01 juin 2010 à 19:52 +0300, Avi Kivity a écrit : What I'd like to see eventually is a short-term-unfair, long-term-fair spinlock. Might make sense for bare metal as well. But it won't be easy to write. This thread rings a bell here :) Yes, ticket spinlocks are sometime slower, especially in workloads where a spinlock needs to be taken several times to handle one unit of work, and many cpus competing. We currently have kind of a similar problem in network stack, and we have a patch to speedup xmit path by an order of magnitude, letting one cpu (the consumer cpu) to get unfair access to the (ticket) spinlock. (It can compete with no more than one other cpu) Boost from ~50.000 to ~600.000 pps on a dual quad core machine (E5450 @3.00GHz) on a particular workload (many cpus want to xmit their packets) ( patch : http://patchwork.ozlabs.org/patch/53163/ ) It could be possible to write such a generic beast, with a cascade or regular ticket spinlocks ? One ticket spinlock at first stage (only if some conditions are met, aka slow path), then an 'primary' spinlock at second stage. // generic implementation // (x86 could use 16bit fields for users_in user_out) struct cascade_lock { atomic_tusers_in; int users_out; spinlock_t primlock; spinlock_t slowpathlock; // could be outside of this structure, shared by many 'cascade_locks' }; /* * In kvm case, you might call hypervisor when slowpathlock is about to be taken ? * When a cascade lock is unlocked, and relocked right after, this cpu has unfair * priority and could get the lock before cpus blocked in slowpathlock (especially if * an hypervisor call was done) * * In network xmit path, the dequeue thread would use highprio_user=true mode * In network xmit path, the 'contended' enqueueing thread would set a negative threshold, * to force a 'lowprio_user' mode. */ void cascade_lock(struct cascade_lock *l, bool highprio_user, int threshold) { bool slowpath = false; atomic_inc(l-users_in); // no real need for atomic_inc_return() if (atomic_read(l-users_in) - l-users_out threshold !highprio_user)) { spin_lock(l-slowpathlock); slowpath = true; } spin_lock(l-primlock); if (slowpath) spin_unlock(l-slowpathlock); } void cascade_unlock(struct cascade_lock *l) { l-users_out++; spin_unlock(l-primlock); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.
Le mercredi 10 février 2010 à 19:48 +0800, Xin Xiaohui a écrit : Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org +static int page_ctor_attach(struct mp_struct *mp) +{ + int rc; + struct page_ctor *ctor; + struct net_device *dev = mp-dev; + + rcu_read_lock(); + if (rcu_dereference(mp-ctor)) { + rcu_read_unlock(); + return -EBUSY; + } + rcu_read_unlock(); Strange read locking here, for an obvious writer role. What do you really want to do ? If writer are serialized by mp_mutex, you dont need this recu_read_lock()/rcu_read_unlock() stuff. + + ctor = kzalloc(sizeof(*ctor), GFP_KERNEL); + if (!ctor) + return -ENOMEM; + rc = netdev_page_ctor_prep(dev, ctor-ctor); + if (rc) + goto fail; + + ctor-cache = kmem_cache_create(skb_page_info, + sizeof(struct page_info), 0, + SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); SLAB_PANIC here means : crash whole system in case of error. This is not what you want in a driver. + + if (!ctor-cache) + goto cache_fail; + + INIT_LIST_HEAD(ctor-readq); + spin_lock_init(ctor-read_lock); + + ctor-w_len = 0; + ctor-r_len = 0; + + dev_hold(dev); + ctor-dev = dev; + ctor-ctor.ctor = page_ctor; + ctor-ctor.sock = mp-socket; + atomic_set(ctor-refcnt, 1); + + rc = netdev_page_ctor_attach(dev, ctor-ctor); + if (rc) + goto fail; + + /* locked by mp_mutex */ + rcu_assign_pointer(mp-ctor, ctor); + + /* XXX:Need we do set_offload here ? */ + + return 0; + +fail: + kmem_cache_destroy(ctor-cache); +cache_fail: + kfree(ctor); + dev_put(dev); + + return rc; +} + + +static inline void get_page_ctor(struct page_ctor *ctor) +{ + atomic_inc(ctor-refcnt); +} + +static inline void put_page_ctor(struct page_ctor *ctor) +{ + if (atomic_dec_and_test(ctor-refcnt)) + kfree(ctor); Are you sure a RCU grace period is not needed before freeing ? + +static int page_ctor_detach(struct mp_struct *mp) +{ + struct page_ctor *ctor; + struct page_info *info; + int i; + + rcu_read_lock(); + ctor = rcu_dereference(mp-ctor); + rcu_read_unlock(); Strange locking again here + + if (!ctor) + return -ENODEV; + + while ((info = info_dequeue(ctor))) { + for (i = 0; i info-pnum; i++) + if (info-pages[i]) + put_page(info-pages[i]); + kmem_cache_free(ctor-cache, info); + } + kmem_cache_destroy(ctor-cache); + netdev_page_ctor_detach(ctor-dev); + dev_put(ctor-dev); + + /* locked by mp_mutex */ + rcu_assign_pointer(mp-ctor, NULL); + synchronize_rcu(); + + put_page_ctor(ctor); + + return 0; +} + +/* For small user space buffers transmit, we don't need to call + * get_user_pages(). + */ +static struct page_info *alloc_small_page_info(struct page_ctor *ctor, + int total) +{ + struct page_info *info = kmem_cache_alloc(ctor-cache, GFP_KERNEL); kmem_cache_zalloc() ? + + if (!info) + return NULL; + memset(info, 0, sizeof(struct page_info)); + memset(info-pages, 0, sizeof(info-pages)); redundant memset() whole structure already cleared one line above + + info-header = 0; already cleared + info-total = total; + info-skb = NULL; already cleared + info-user.dtor = page_dtor; + info-ctor = ctor; + info-flags = INFO_WRITE; + info-pnum = 0; already cleared + return info; +} + +/* The main function to transform the guest user space address + * to host kernel address via get_user_pages(). Thus the hardware + * can do DMA directly to the user space address. + */ +static struct page_info *alloc_page_info(struct page_ctor *ctor, + struct iovec *iov, int count, struct frag *frags, + int npages, int total) +{ + int rc; + int i, j, n = 0; + int len; + unsigned long base; + struct page_info *info = kmem_cache_alloc(ctor-cache, GFP_KERNEL); kmem_cache_zalloc() ? + + if (!info) + return NULL; + memset(info, 0, sizeof(struct page_info)); kmem_cache_zalloc() ? + memset(info-pages, 0, sizeof(info-pages)); already cleared + + down_read(current-mm-mmap_sem); + for (i = j = 0; i count; i++) { + base = (unsigned long)iov[i].iov_base; + len = iov[i].iov_len; + + if (!len) +
Re: Networking-related crash?
Le 09/12/2009 16:11, Avi Kivity a écrit : On 12/09/2009 03:46 PM, Adam Huffman wrote: I've been seeing lots of crashes on a new Dell Precision T7500, running the KVM in Fedora 12. Finally managed to capture an Oops, which is shown below (hand-transcribed): BUG: unable to handle kernel paging request at 00200200 IP: [8139aab7] destroy_conntrack+0x82/0x11f PGD 332d0e067 PUD 33453c067 PMD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 4 Modules linked in: tun bridge stp llc sunrpc ipt_MASQUERADE iptable_nat nf_nat ipt_LOG xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6 _tables ipv6 dm_multipath kvm_intel kvm uinput snd_hda_codec_analog nouveau snd_hda_intel snd_hda_codec ttm drm_kms_helper snd_hwdep snd_seq drm sn d_seq_device snd_pcm firewire_ohci i2c_i801 snd_timer ppdev firewire_core snd i2c_algo_bit iTCO_wdt crc_itu_t parport_pc i2c_core soundcore parport iTCO_vendor_support tg3 snd_page_alloc shpchp dcdbas wmi mptsas mptscsih mptbase scsi_transport_sas megaraid_sas [last_unloaded: speedstep_lib] Pid: 1759, comm: qemu-kvm Not tainted 2.6.31.6-162.fc12.x86_64 #1 Precision WorkStation T7500 RIP: 0010:[8139aab7] [8139aab7] destroy_conntrack+0x82/0x11f RSP: 0018:c9803bf0 EFLAGS: 00010202 RAX: 8001 RBX: 816fb1a0 RCX: 752f RDX: 00200200 RSI: 0011 RDI: 816fb1a0 RBP: c9803c00 R08: 880336699438 R09: 00aaa5e0 R10: 0002f54189d5 R11: 0001 R12: 819a92e0 R13: a029adcc R14: R15: 880632866c38 FS: 7fdd34b17710() GS:c980() knlGS: CS: 0010 DS: 002B ES: 002B CR0: 80050033 CR2: 00200200 CR3: 0003349c CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-kvm (pid: 1759, threadinfo 88062e9e8000, task 880634945e00) Stack: 880632866c00 880634640c30 c9803c10 813989c2 0 c9803c30 81374092 c9803c30 880632866c00 0 c9803c50 81373dd3 0002 880632866c00 Call Trace: IRQ [813989c2] nf_conntrack_destroy+0x1b/0x1d [81374092] skb_release_head_state+0x95/0xd7 [81373dd3] __kfree_skb+0x16/0x81 [81373ed7] kfree_skb+0x6a/0x72 [a029adcc] ip6_mc_input+0x220/0x230 [ipv6] [a029a3d1] ip6_rcv_finish+0x27/0x2b [ipv6] [a029a763] ipv6_rcv+0x38e/0x3e5 [ipv6] [8137bd91] netif_receive_skb+0x402/0x427 [8137bf1b] napi_skb_finish+0x29/0x3d [8137c37a] napi_gro_receive+0x2f/0x34 [a0084fad] tg3_poll+0x6c6/0x8c3 [tg3] [8137c4b0] net_rx_action+0xaf/0x1c9 [81379cfe] ? list-add_tail+0x15/0x17 [81057614] __do_softirq+0xdd/0x1ad [81026936] ? apic_write+0x16/0x18 [81012eac] call_softirq+0x1c/0x30 [810143fb] do_softirq+0x47/0x8d [81057326] irq_exit+0x44/0x86 [8141ecd5] do_IRQ+0xa5/0xbc [810126d3] ret_from_intr+0x0/0x11 EOI [a02437bb] ? kvm_arch_vcpu_ioctl_run+0x84b/0xb34 [kvm] [a02437aa] ? kvm_arch_vcpu_ioctl_run+0x83a/0xb34 [kvm] [a02395e3] ? kvm_vcpu_ioctl+0xfd/0x556 [kvm] [81108adc] ? vfs_ioctl+0x22/0x87 [81109038] ? do_vfs_ioctl+0x47b/0x4c1 [811090d4] ? sys_ioctl+0x56/0x79 [81012093] ? stub_clone+0x13/0x20 [81011cf2] ? system_call_fastpath+0x16/0x1b Code: c7 00 a6 9a 81 e8 23 04 08 00 48 89 df e8 68 29 00 00 f6 43 78 08 75 24 48 8b 53 10 48 85 d2 75 04 0f 0b eb fe 48 8b 43 08 a8 0148 89 02 7 5 04 48 89 50 08 48 c7 43 10 00 02 20 00 65 8b 14 25 RIP [8139aab7] destroy_conntrack+0x82/0x11f RSPc9803bf0 CR2: 00200200 Looks unrelated to kvm - softirq happened to trigger during a kvm ioctl. Fault looks like list poison. Copying netdev. crash in : 48 8b 43 08 mov0x8(%rbx),%rax a8 01 test $0x1,%al 48 89 02mov%rax,(%rdx) HERE RDX=0x200200 (LIST_POISON2) 75 04 jne1f 48 89 50 08 mov%rdx,0x8(%rax) 1: 48 c7 43 10 00 02 20movq $0x200200,0x10(%rbx) if (!nf_ct_is_confirmed(ct)) { BUG_ON(hlist_nulls_unhashed(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); hlist_nulls_del_rcu(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode); HERE } NF_CT_STAT_INC(net, delete); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net
Shirley Ma a écrit : This patch is generated against 2.6 git tree. I didn't break up this patch since it has one functionality. Please review it. Thanks Shirley Signed-off-by: Shirley Ma x...@us.ibm.com -- +void virtio_free_pages(void *buf) +{ + struct page *page = (struct page *)buf; + + while (page) { + __free_pages(page, 0); + page = (struct page *)page-private; + } +} + Interesting use after free :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
Paul E. McKenney a écrit : (Sorry, but, as always, I could not resist!) Yes :) Thanks Paul for this masterpiece of diplomatic Acked-by ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin a écrit : +static void handle_tx(struct vhost_net *net) +{ + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; + unsigned head, out, in, s; + struct msghdr msg = { + .msg_name = NULL, + .msg_namelen = 0, + .msg_control = NULL, + .msg_controllen = 0, + .msg_iov = vq-iov, + .msg_flags = MSG_DONTWAIT, + }; + size_t len, total_len = 0; + int err, wmem; + size_t hdr_size; + struct socket *sock = rcu_dereference(vq-private_data); + if (!sock) + return; + + wmem = atomic_read(sock-sk-sk_wmem_alloc); + if (wmem = sock-sk-sk_sndbuf) + return; + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + vhost_no_notify(vq); + using rcu_dereference() and mutex_lock() at the same time seems wrong, I suspect that your use of RCU is not correct. 1) rcu_dereference() should be done inside a read_rcu_lock() section, and we are not allowed to sleep in such a section. (Quoting Documentation/RCU/whatisRCU.txt : It is illegal to block while in an RCU read-side critical section, ) 2) mutex_lock() can sleep (ie block) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
Gregory Haskins a écrit : Gregory Haskins wrote: Eric Dumazet wrote: Michael S. Tsirkin a écrit : +static void handle_tx(struct vhost_net *net) +{ + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; + unsigned head, out, in, s; + struct msghdr msg = { + .msg_name = NULL, + .msg_namelen = 0, + .msg_control = NULL, + .msg_controllen = 0, + .msg_iov = vq-iov, + .msg_flags = MSG_DONTWAIT, + }; + size_t len, total_len = 0; + int err, wmem; + size_t hdr_size; + struct socket *sock = rcu_dereference(vq-private_data); + if (!sock) + return; + + wmem = atomic_read(sock-sk-sk_wmem_alloc); + if (wmem = sock-sk-sk_sndbuf) + return; + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + vhost_no_notify(vq); + using rcu_dereference() and mutex_lock() at the same time seems wrong, I suspect that your use of RCU is not correct. 1) rcu_dereference() should be done inside a read_rcu_lock() section, and we are not allowed to sleep in such a section. (Quoting Documentation/RCU/whatisRCU.txt : It is illegal to block while in an RCU read-side critical section, ) 2) mutex_lock() can sleep (ie block) Michael, I warned you that this needed better documentation ;) Eric, I think I flagged this once before, but Michael convinced me that it was indeed ok, if but perhaps a bit unconventional. I will try to find the thread. Kind Regards, -Greg Here it is: http://lkml.org/lkml/2009/8/12/173 Yes, this doesnt convince me at all, and could be a precedent for a wrong RCU use. People wanting to use RCU do a grep on kernel sources to find how to correctly use RCU. Michael, please use existing locking/barrier mechanisms, and not pretend to use RCU. Some automatic tools might barf later. For example, we could add a debugging facility to check that rcu_dereference() is used in an appropriate context, ie conflict with existing mutex_lock() debugging facility. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin a écrit : Paul, you acked this previously. Should I add you acked-by line so people calm down? If you would rather I replace rcu_dereference/rcu_assign_pointer with rmb/wmb, I can do this. Or maybe patch Documentation to explain this RCU usage? So you believe I am over-reacting to this dubious use of RCU ? RCU documentation is already very complex, we dont need to add yet another subtle use, and makes it less readable. It seems you use 'RCU api' in drivers/vhost/net.c as convenient macros : #define rcu_dereference(p) ({ \ typeof(p) _p1 = ACCESS_ONCE(p); \ smp_read_barrier_depends(); \ (_p1); \ }) #define rcu_assign_pointer(p, v) \ ({ \ if (!__builtin_constant_p(v) || \ ((v) != NULL)) \ smp_wmb(); \ (p) = (v); \ }) There are plenty regular uses of smp_wmb() in kernel, not related to Read Copy Update, there is nothing wrong to use barriers with appropriate comments. (And you already use mb(), wmb(), rmb(), smp_wmb() in your patch) BTW there is at least one locking bug in vhost_net_set_features() Apparently, mutex_unlock() doesnt trigger a fault if mutex is not locked by current thread... even with DEBUG_MUTEXES / DEBUG_LOCK_ALLOC static void vhost_net_set_features(struct vhost_net *n, u64 features) { size_t hdr_size = features (1 VHOST_NET_F_VIRTIO_NET_HDR) ? sizeof(struct virtio_net_hdr) : 0; int i; ! mutex_unlock(n-dev.mutex); n-dev.acked_features = features; smp_wmb(); for (i = 0; i VHOST_NET_VQ_MAX; ++i) { mutex_lock(n-vqs[i].mutex); n-vqs[i].hdr_size = hdr_size; mutex_unlock(n-vqs[i].mutex); } mutex_unlock(n-dev.mutex); vhost_net_flush(n); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html