Re: [PATCH v2 3/3] sched/preempt: fix cond_resched_lock() and cond_resched_softirq()

2015-07-15 Thread Eric Dumazet
On Wed, 2015-07-15 at 12:52 +0300, Konstantin Khlebnikov wrote:
 These functions check should_resched() before unlocking spinlock/bh-enable:
 preempt_count always non-zero = should_resched() always returns false.
 cond_resched_lock() worked iff spin_needbreak is set.

Interesting, this definitely used to work (linux-3.11)

Any idea of which commit broke things ?



--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt

2014-10-11 Thread Eric Dumazet
On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote:
 We free transmitted packets in ndo_start_xmit() in the past to get better
 performance in the past. One side effect is that skb_orphan() needs to be
 called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
 fact. For TCP protocol, this means several optimization could not work well
 such as TCP small queue and auto corking. This can lead extra low
 throughput of small packets stream.
 
 Thanks to the urgent descriptor support. This patch tries to solve this
 issue by enable the tx interrupt selectively for stream packets. This means
 we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
 tx interrupt for those packets. After we get tx interrupt, a tx napi was
 scheduled to free those packets.
 
 With this method, sk_wmem_alloc of TCP socket were more accurate than in
 the past which let TCP can batch more through TSQ and auto corking.
 
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/net/virtio_net.c | 164 
 ---
  1 file changed, 128 insertions(+), 36 deletions(-)
 
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 5810841..b450fc4 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -72,6 +72,8 @@ struct send_queue {
  
   /* Name of the send queue: output.$index */
   char name[40];
 +
 + struct napi_struct napi;
  };
  
  /* Internal representation of a receive virtqueue */
 @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue 
 *rq, gfp_t gfp_mask)
   return p;
  }
  
 +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
 +{
 + struct sk_buff *skb;
 + unsigned int len;
 + struct virtnet_info *vi = sq-vq-vdev-priv;
 + struct virtnet_stats *stats = this_cpu_ptr(vi-stats);
 + int sent = 0;
 +
 + while (sent  budget 
 +(skb = virtqueue_get_buf(sq-vq, len)) != NULL) {
 + pr_debug(Sent skb %p\n, skb);
 +
 + u64_stats_update_begin(stats-tx_syncp);
 + stats-tx_bytes += skb-len;
 + stats-tx_packets++;
 + u64_stats_update_end(stats-tx_syncp);
 +
 + dev_kfree_skb_any(skb);
 + sent++;
 + }
 +

You could accumulate skb-len in a totlen var, and perform a single

u64_stats_update_begin(stats-tx_syncp);
stats-tx_bytes += totlen;
stats-tx_packets += sent;
u64_stats_update_end(stats-tx_syncp);

after the loop.


 + return sent;
 +}
 +

...

 +
 +static bool virtnet_skb_needs_intr(struct sk_buff *skb)
 +{
 + union {
 + unsigned char *network;
 + struct iphdr *ipv4;
 + struct ipv6hdr *ipv6;
 + } hdr;
 + struct tcphdr *th = tcp_hdr(skb);
 + u16 payload_len;
 +
 + hdr.network = skb_network_header(skb);
 +
 + /* Only IPv4/IPv6 with TCP is supported */

Oh well, yet another packet flow dissector :)

If most packets were caught by your implementation, you could use it
for fast patj and fallback to skb_flow_dissect() for encapsulated
traffic.

struct flow_keys keys;   

if (!skb_flow_dissect(skb, keys)) 
return false;

if (keys.ip_proto != IPPROTO_TCP)
return false;

then check __skb_get_poff() how to get th, and check if there is some
payload...


 + if ((skb-protocol == htons(ETH_P_IP)) 
 + hdr.ipv4-protocol == IPPROTO_TCP) {
 + payload_len = ntohs(hdr.ipv4-tot_len) - hdr.ipv4-ihl * 4 -
 +   th-doff * 4;
 + } else if ((skb-protocol == htons(ETH_P_IPV6) ||
 +hdr.ipv6-nexthdr == IPPROTO_TCP)) {
 + payload_len = ntohs(hdr.ipv6-payload_len) - th-doff * 4;
 + } else {
 + return false;
 + }
 +
 + /* We don't want to dealy packet with PUSH bit and pure ACK packet */
 + if (!th-psh  payload_len)
 + return true;
 +
 + return false;
  }




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 2/2] vhost: replace rcu with mutex

2014-06-03 Thread Eric Dumazet
On Tue, 2014-06-03 at 14:48 +0200, Paolo Bonzini wrote:
 Il 02/06/2014 23:58, Eric Dumazet ha scritto:
  This looks dubious
 
  What about using kfree_rcu() instead ?
 
 It would lead to unbound allocation from userspace.

Look at how we did this in commit
c3059477fce2d956a0bb3e04357324780c5d8eeb

 
  translate_desc() still uses rcu_read_lock(), its not clear if the mutex
  is really held.
 
 Yes, vhost_get_vq_desc must be called with the vq mutex held.
 
 The rcu_read_lock/unlock in translate_desc is unnecessary.

Yep, this is what I pointed out. This is not only necessary, but
confusing and might be incorrectly copy/pasted in the future.

This patch is a partial one and leaves confusion.

Some places uses the proper 

mp = rcu_dereference_protected(dev-memory,
 lockdep_is_held(dev-mutex));

others use the now incorrect :

rcu_read_lock();
mp = rcu_dereference(dev-memory);
...



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 2/2] vhost: replace rcu with mutex

2014-06-02 Thread Eric Dumazet
On Tue, 2014-06-03 at 00:30 +0300, Michael S. Tsirkin wrote:
 All memory accesses are done under some VQ mutex.
 So lock/unlock all VQs is a faster equivalent of synchronize_rcu()
 for memory access changes.
 Some guests cause a lot of these changes, so it's helpful
 to make them faster.
 
 Reported-by: Gonglei (Arei) arei.gong...@huawei.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/vhost/vhost.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 78987e4..1c05e60 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -593,6 +593,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
 vhost_memory __user *m)
  {
   struct vhost_memory mem, *newmem, *oldmem;
   unsigned long size = offsetof(struct vhost_memory, regions);
 + int i;
  
   if (copy_from_user(mem, m, size))
   return -EFAULT;
 @@ -619,7 +620,14 @@ static long vhost_set_memory(struct vhost_dev *d, struct 
 vhost_memory __user *m)
   oldmem = rcu_dereference_protected(d-memory,
  lockdep_is_held(d-mutex));
   rcu_assign_pointer(d-memory, newmem);
 - synchronize_rcu();
 +
 + /* All memory accesses are done under some VQ mutex.
 +  * So below is a faster equivalent of synchronize_rcu()
 +  */
 + for (i = 0; i  d-nvqs; ++i) {
 + mutex_lock(d-vqs[i]-mutex);
 + mutex_unlock(d-vqs[i]-mutex);
 + }
   kfree(oldmem);
   return 0;
  }

This looks dubious

What about using kfree_rcu() instead ?

translate_desc() still uses rcu_read_lock(), its not clear if the mutex
is really held.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Moving frags and SKBTX_DEV_ZEROCOPY skbs

2014-05-14 Thread Eric Dumazet
On Wed, 2014-05-14 at 14:40 +0100, Zoltan Kiss wrote:
 Hi,
 
 Recently I've investigated issues around SKBTX_DEV_ZEROCOPY skbs where 
 the frags list were modified. I came across this function skb_shift(), 
 which moves frags between skbs. And there are a lot more of such kind, 
 skb_split or skb_try_coalesce, for example.
 It could be a dangerous thing if a frag is referenced from an skb which 
 doesn't have the original destructor_arg, and to avoid that 
 skb_orphan_frags should be called. Although probably these functions are 
 not normally touched in usual usecases, I think it would be useful to 
 review core skb functions proactively and add an skb_orphan_frags 
 everywhere where the frags could be referenced from other places.
 Any opinion about this?


For skb_shift(), it is currently used from tcp stack only, where
this SKBTX_DEV_ZEROCOPY thing is not used, so I do not think there is a
bug for the moment.

I already gave a patch for skb_try_coalesce() : For this one we do not
wan skb_orphan_frags() overhead. Its simply better in this case to
abort.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1b62343f5837..85995a14aafc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3838,7 +3839,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff 
*from,
return true;
}
 
-   if (skb_has_frag_list(to) || skb_has_frag_list(from))
+   if (skb_has_frag_list(to) ||
+   skb_has_frag_list(from) ||
+   (skb_shinfo(to)-tx_flags  SKBTX_DEV_ZEROCOPY) ||
+   (skb_shinfo(from)-tx_flags  SKBTX_DEV_ZEROCOPY))
return false;
 
if (skb_headlen(from) != 0) {



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Moving frags and SKBTX_DEV_ZEROCOPY skbs

2014-05-14 Thread Eric Dumazet
On Wed, 2014-05-14 at 13:42 -0400, David Miller wrote:
 From: Eric Dumazet eric.duma...@gmail.com
 Date: Wed, 14 May 2014 07:23:52 -0700
 
  On Wed, 2014-05-14 at 14:40 +0100, Zoltan Kiss wrote:
  Hi,
  
  Recently I've investigated issues around SKBTX_DEV_ZEROCOPY skbs where 
  the frags list were modified. I came across this function skb_shift(), 
  which moves frags between skbs. And there are a lot more of such kind, 
  skb_split or skb_try_coalesce, for example.
  It could be a dangerous thing if a frag is referenced from an skb which 
  doesn't have the original destructor_arg, and to avoid that 
  skb_orphan_frags should be called. Although probably these functions are 
  not normally touched in usual usecases, I think it would be useful to 
  review core skb functions proactively and add an skb_orphan_frags 
  everywhere where the frags could be referenced from other places.
  Any opinion about this?
  
  
  For skb_shift(), it is currently used from tcp stack only, where
  this SKBTX_DEV_ZEROCOPY thing is not used, so I do not think there is a
  bug for the moment.
  
  I already gave a patch for skb_try_coalesce() : For this one we do not
  wan skb_orphan_frags() overhead. Its simply better in this case to
  abort.
 
 Eric can you please submit this formally?  It is second time I've seen
 it posted as RFC :-)

Sure, I was kind of waiting to make sure it was needed.

It looks like it is ;)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Moving frags and SKBTX_DEV_ZEROCOPY skbs

2014-05-14 Thread Eric Dumazet
On Wed, 2014-05-14 at 20:41 +0100, Zoltan Kiss wrote:
 On 14/05/14 15:23, Eric Dumazet wrote:
  On Wed, 2014-05-14 at 14:40 +0100, Zoltan Kiss wrote:
  Hi,
 
  Recently I've investigated issues around SKBTX_DEV_ZEROCOPY skbs where
  the frags list were modified. I came across this function skb_shift(),
  which moves frags between skbs. And there are a lot more of such kind,
  skb_split or skb_try_coalesce, for example.
  It could be a dangerous thing if a frag is referenced from an skb which
  doesn't have the original destructor_arg, and to avoid that
  skb_orphan_frags should be called. Although probably these functions are
  not normally touched in usual usecases, I think it would be useful to
  review core skb functions proactively and add an skb_orphan_frags
  everywhere where the frags could be referenced from other places.
  Any opinion about this?
 
 
  For skb_shift(), it is currently used from tcp stack only, where
  this SKBTX_DEV_ZEROCOPY thing is not used, so I do not think there is a
  bug for the moment.
 It is called from tcp_input.c, which suggests it can be called on 
 incoming TCP packets.

Nope. 

We split outgoing packets, stored in the socket write queue.
These packets are locally generated by tcp_sendmsg() and tcp_sendpage(),
no way we use SKBTX_DEV_ZEROCOPY yet.

This split happens when we receive an ACK, that's why it is in
tcp_input.c



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni

2014-02-11 Thread Eric Dumazet
On Tue, 2014-02-11 at 22:25 +0800, Qin Chuanyu wrote:
 we could xmit directly instead of going through softirq to gain 
 throughput and lantency improved.
 test model: VM-Host-Host just do transmit. with vhost thread and nic
 interrupt bind cpu1. netperf do throuhput test and qperf do lantency test.
 Host OS: suse11sp3, Guest OS: suse11sp3
 
 latency result(us):
 packet_len 64 256 512 1460
 old(UDP)   44  47  48   66
 new(UDP)   38  41  42   66
 
 old(TCP)   52  55  70  117
 new(TCP)   45  48  61  114
 
 throughput result(Gbit/s):
 packet_len   64   512   1024   1460
 old(UDP)   0.42  2.02   3.75   4.68
 new(UDP)   0.45  2.14   3.77   5.06
 
 TCP due to the latency, client couldn't send packet big enough
 to get benefit from TSO of nic, so the result show it will send
 more packet per sencond but get lower throughput.
 
 Eric mentioned that it would has problem with cgroup, but the patch
 had been sent by Herbert Xu.
 patch_id f845172531fb7410c7fb7780b1a6e51ee6df7d52
 
 Signed-off-by: Chuanyu Qin qinchua...@huawei.com
 ---
   drivers/net/tun.c |4 +++-
   1 files changed, 3 insertions(+), 1 deletions(-)
 
 diff --git a/drivers/net/tun.c b/drivers/net/tun.c
 index 44c4db8..90b4e58 100644
 --- a/drivers/net/tun.c
 +++ b/drivers/net/tun.c
 @@ -1184,7 +1184,9 @@ static ssize_t tun_get_user(struct tun_struct 
 *tun, struct tun_file *tfile,
   skb_probe_transport_header(skb, 0);
 
   rxhash = skb_get_hash(skb);
 - netif_rx_ni(skb);
 + rcu_read_lock_bh();
 + netif_receive_skb(skb);
 + rcu_read_unlock_bh();
 
   tun-dev-stats.rx_packets++;
   tun-dev-stats.rx_bytes += len;

I already said this patch is not good :

rcu_read_lock_bh() makes no sense here.

What is really needed is local_bh_disable();

Herbert patch ( http://patchwork.ozlabs.org/patch/52963/ ) had a much
cleaner form.

Just use it, CC him, credit him, please ?



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni

2014-02-11 Thread Eric Dumazet
On Wed, 2014-02-12 at 13:28 +0800, Jason Wang wrote:

 A question: without NAPI weight, could this starve other net devices?

Not really, as net devices are serviced by softirq handler.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni

2014-02-11 Thread Eric Dumazet
On Wed, 2014-02-12 at 13:50 +0800, Jason Wang wrote:
 On 02/12/2014 01:47 PM, Eric Dumazet wrote:
  On Wed, 2014-02-12 at 13:28 +0800, Jason Wang wrote:
 
  A question: without NAPI weight, could this starve other net devices?
  Not really, as net devices are serviced by softirq handler.
 
 
 
 Yes, then the issue is tun could be starved by other net devices.

How this patch changes anything to this 'problem' ?

netif_rx_ni() can only be called if your process is not preempted by
other high prio tasks/softirqs.

If this process is scheduled on a cpu, then disabling bh to process
_one_ packet wont fundamentally change dynamic of the system.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 8% performance improved by change tap interact with kernel stack

2014-01-28 Thread Eric Dumazet
On Tue, 2014-01-28 at 16:14 +0800, Qin Chuanyu wrote:
 according perf test result,I found that there are 5%-8% cpu cost on 
 softirq by use netif_rx_ni called in tun_get_user.
 
 so I changed the function which cause skb transmitted more quickly.
 from
   tun_get_user-
netif_rx_ni(skb);
 to
   tun_get_user-
   rcu_read_lock_bh();
   netif_receive_skb(skb);
   rcu_read_unlock_bh();

No idea why you use rcu here ?

 
 The test result is as below:
   CPU: Intel(R) Xeon(R) CPU E5620  @ 2.40GHz
   NIC: intel 82599
   Host OS/Guest OS:suse11sp3
   Qemu-1.6
   netperf udp 512(VM tx)
   test model: VM-host-host
 
   modified before : 2.00Gbps 461146pps
   modified after  : 2.16Gbps 498782pps
 
 8% performance gained from this change,
 Is there any problem for this patch ?

http://patchwork.ozlabs.org/patch/52963/




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP small packets throughput and multiqueue virtio-net

2013-03-08 Thread Eric Dumazet
On Fri, 2013-03-08 at 14:24 +0800, Jason Wang wrote:
 Hello all:
 
 I meet an issue when testing multiqueue virtio-net. When I testing guest
 small packets stream sending performance with netperf. I find an
 regression of multiqueue. When I run 2 sessions of TCP_STREAM test with
 1024 byte from guest to local host, I get following result:
 
 1q result: 3457.64
 2q result: 7781.45
 
 Statistics shows that: compared with one queue, multiqueue tends to send
 much more but smaller packets. Tcpdump shows single queue has a much
 higher possibility to produce a 64K gso packet compared to multiqueue.
 More but smaller packets will cause more vmexits and interrupts which
 lead a degradation on throughput.
 
 Then problem only exist for small packets sending. When I test with
 larger size, multiqueue will gradually outperfrom single queue. And
 multiqueue also outperfrom in both TCP_RR and pktgen test (even with
 small packets). The problem disappear when I turn off both gso and tso.
 

This makes little sense to me : TCP_RR workload (assuming one byte
payload) cannot use GSO or TSO anyway. Same for pktgen as it uses UDP.

 I'm not sure whether it's a bug or expected since anyway we get
 improvement on latency. And if it's a bug, I suspect it was related to
 TCP GSO batching algorithm who tends to batch less in this situation. (
 Jiri Pirko suspect it was the defect of virtio-net driver, but I didn't
 find any obvious clue on this). After some experiments, I find the it
 maybe related to tcp_tso_should_defer(), after
 1) change the tcp_tso_win_divisor to 1
 2) the following changes
 the throughput were almost the same (but still a little worse) as single
 queue:
 
 diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
 index fd0cea1..dedd2a6 100644
 --- a/net/ipv4/tcp_output.c
 +++ b/net/ipv4/tcp_output.c
 @@ -1777,10 +1777,12 @@ static bool tcp_tso_should_defer(struct sock
 *sk, struct sk_buff *skb)
  
 limit = min(send_win, cong_win);
  
 +#if 0
 /* If a full-sized TSO skb can be sent, do it. */
 if (limit = min_t(unsigned int, sk-sk_gso_max_size,
sk-sk_gso_max_segs * tp-mss_cache))
 goto send_now;
 +#endif
  
 /* Middle in queue won't get any more data, full sendable
 already? */
 if ((skb != tcp_write_queue_tail(sk))  (limit = skb-len))
 
 Git history shows this check were added for both westwood and fasttcp,
 I'm not familiar with tcp but looks like we can easily hit this check
 under when multiqueue is enabled for virtio-net. Maybe I was wrong but I
 wonder whether we can still do some batching here.
 
 Any comments, thoughts are welcomed.
 

Well, the point is : if your app does write(1024) bytes, thats probably
because it wants small packets from the very beginning. (See the TCP
PUSH flag ?)

If the transport is slow, TCP stack will automatically collapse several
write into single skbs (assuming TSO or GSO is on), and you'll see big
GSO packets with tcpdump [1]. So TCP will help you to get less overhead
in this case.

But if transport is fast, you'll see small packets, and thats good for
latency.

So my opinion is : its exactly behaving as expected.

If you want bigger packets either :
- Make the application doing big write()
- Slow the vmexit ;)

[1] GSO fools tcpdump : Actual packets sent to the wire are not 'big
packets', but they hit dev_hard_start_xmit() as GSO packets.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: more interrupts (lower performance) in bare-metal compared with running VM

2012-07-27 Thread Eric Dumazet
On Fri, 2012-07-27 at 22:09 -0500, sheng qiu wrote:
 Hi all,
 
 i am comparing network throughput performance under bare-metal case
 with that running VM with assigned-device (assigned NIC). i have two
 physical machines (each has a 10Gbit NIC), one is used as remote
 server (run netserver) and the other is used as the target tested one
 (run netperf with different send message size, TCP_STREAM test). the
 remote NIC is connected directly with the tested NIC, both are 10Gbit.
 fore bare-metal case, i enable 1 cpu core, for VM i also configure 1
 vcpu (the memory is sufficient for both bare-metal and VM case).  i
 run netperf for 120 seconds and got the following results:
 
send messageinterrupts   throughput (mbit/s)
 bare-metal 256   106962901114.84
 512   101067861391.92
 1024  10071032   1508.09
 2048  4560857 3434.65
 4096  3292200 4762.26
 8192  3169801 4733.89
 163842780529  4892.6
 

Are these interrupt counts taken on the receiver ?

 VM(assigned NIC)   256   3817904  2249.35
  512   3599007  4342.81
 1024  3005601  4134.69
  2048 2952122  4484
  4096 2682874  4566.34
  8192 2786719  4734.39
  16384   2603835  4540.47
 
 as shown, the interrupts for bare-metal case is much more than the VM
 case for some message size. we also see the throughput for those
 situations is lower than VM case. it's strange that the bare-metal has
 lower performance than the VM case. Does anyone have comments on this?
 i am very confused.

Well, I think you answered to your question. High interrupt rates
are not good for throughput. They might be good for latencies.

Using a VM adds delays and several frames might be delivered per
interrupt.

Using bare metal is faster and only one frame is delivered by NIC per
interrupt.

Try TCP_RR instead of TCP_STREAM for example.

What NIC is it exactly ? It seems it has no coalescing or LRO strategy.

ethtool -k eth0
ethtool -c eth0

What kernel version as used, because 4892 Mbits is not line rate.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] tun: experimental zero copy tx support

2012-05-14 Thread Eric Dumazet
On Mon, 2012-05-14 at 20:04 +0300, Michael S. Tsirkin wrote:
 On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote:
  On Sun, 13 May 2012 18:52:06 +0300
  Michael S. Tsirkin m...@redhat.com wrote:
  
   + /* Userspace may produce vectors with count greater than
   +  * MAX_SKB_FRAGS, so we need to linearize parts of the skb
   +  * to let the rest of data to be fit in the frags.
   +  */
  Rather than complex partial code, just go through slow path for
  requests with too many frags (or for really small requests).
  Creating mixed skb's seems too easy to get wrong.
 
 I don't object in principle but macvtap has same code
 so seems better to stay consistent.
 

If I remember well, code in vtap was buggy and still is.

Jason Wang fixes are not yet in, so maybe wait a bit, so that we dont
add a pile of new bugs ?




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] tun: experimental zero copy tx support

2012-05-14 Thread Eric Dumazet
On Mon, 2012-05-14 at 22:14 +0300, Michael S. Tsirkin wrote:

 They seem to be in net-next, or did I miss something?

Yes, you re-introduce in this patch some bugs already fixed in macvtap



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] net: use this_cpu_xxx replace percpu_xxx funcs

2011-10-20 Thread Eric Dumazet
Le jeudi 20 octobre 2011 à 15:32 +0800, Alex,Shi a écrit :
 percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
 for further code clean up.
 
 And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
 performance since __this_cpu_xxx has no redundant preempt_disable()
 
 Signed-off-by: Alex Shi alex@intel.com
 ---
  net/netfilter/xt_TEE.c |   12 ++--
  net/socket.c   |4 ++--
  2 files changed, 8 insertions(+), 8 deletions(-)

Acked-by: Eric Dumazet eric.duma...@gmail.com

Thanks !


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Code clean up for percpu_xxx() functions

2011-10-19 Thread Eric Dumazet
Le jeudi 20 octobre 2011 à 10:44 +0800, Alex,Shi a écrit :
  Acked-by: Christoph Lameter c...@gentwo.org
 
 Thanks, Christoph! 
 and resend for code style problem correction. 
 

Sorry this huge patch brings too many potential bugs.

I ask you a separate patch for the networking part, because I dont want
to Ack all the other parts.

 diff --git a/net/socket.c b/net/socket.c
 index ffe92ca..2845d38 100644
 --- a/net/socket.c
 +++ b/net/socket.c
 @@ -479,7 +479,7 @@ static struct socket *sock_alloc(void)
   inode-i_uid = current_fsuid();
   inode-i_gid = current_fsgid();
  
 - percpu_add(sockets_in_use, 1);
 + __this_cpu_add(sockets_in_use, 1);

No, we are in process context, you need

this_cpu_add(sockets_in_use, 1);

   return sock;
  }
  
 @@ -522,7 +522,7 @@ void sock_release(struct socket *sock)
   if (rcu_dereference_protected(sock-wq, 1)-fasync_list)
   printk(KERN_ERR sock_release: fasync list not empty!\n);
  
 - percpu_sub(sockets_in_use, 1);
 + __this_cpu_sub(sockets_in_use, 1);

this_cpu_sub();

   if (!sock-file) {
   iput(SOCK_INODE(sock));
   return;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-08-22 Thread Eric Dumazet
Le lundi 22 août 2011 à 09:36 +0300, Avi Kivity a écrit :
 On 08/20/2011 04:16 PM, Brad Campbell wrote:
  Author: Alexander Duyck alexander.h.du...@intel.com
  Date:   Thu Jul 1 13:28:27 2010 +
 
  x86: Drop CONFIG_MCORE2 check around setting of NET_IP_ALIGN
 
  This patch removes the CONFIG_MCORE2 check from around 
  NET_IP_ALIGN.  It is
  based on a suggestion from Andi Kleen.  The assumption is that 
  there are
  not any x86 cores where unaligned access is really slow, and this 
  change
  would allow for a performance improvement to still exist on 
  configurations
  that are not necessarily optimized for Core 2.
 
  Cc: Andi Kleen a...@linux.intel.com
  Cc: Thomas Gleixner t...@linutronix.de
  Cc: Ingo Molnar mi...@redhat.com
  Cc: H. Peter Anvin h...@zytor.com
  Cc: x...@kernel.org
  Signed-off-by: Alexander Duyck alexander.h.du...@intel.com
  Signed-off-by: Jeff Kirsher jeffrey.t.kirs...@intel.com
  Acked-by: H. Peter Anvin h...@zytor.com
  Signed-off-by: David S. Miller da...@davemloft.net
 
  :04 04 5a15867789080a2f67a74b17c4422f85b7a9fb4a 
  b98769348bd765731ca3ff03b33764257e23226c March
 
  I can confirm this bug exists in the 3.0 kernel, however I'm unable to 
  reproduce it on todays git.
 
  So anyone using netfilter, kvm and bridge on kernels between 
  2.6.36-rc1 and 3.0 may hit this bug, but it looks like it is fixed in 
  the current 3.1-rc kernels.
 
 
 Thanks for this effort.  I don't think this patch is buggy in itself, it 
 merely exposed another bug which was fixed later on.
 

Some piece of hardware has a 2-byte offset requirement, and driver
incorrectly assumed NET_IP_ALIGN was 2 on x86.

Brad, could you post your config (lsmod, dmesg) again ?

tg3.c code for example uses a private value, not related to NET_IP_ALIGN

#define TG3_RAW_IP_ALIGN 2



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC PATCH 4/7] tuntap: multiqueue support

2011-08-12 Thread Eric Dumazet
Le vendredi 12 août 2011 à 09:55 +0800, Jason Wang a écrit :

+   rxq = skb_get_rxhash(skb);
+   if (rxq) {
+   tfile = rcu_dereference(tun-tfiles[rxq % numqueues]);
+   if (tfile)
+   goto out;
+   }

You can avoid an expensive divide with following trick :

u32 idx = ((u64)rxq * numqueues)  32;



 -static struct tun_struct *tun_get(struct file *file)
 +static void tun_detach_all(struct net_device *dev)
  {
 - return __tun_get(file-private_data);
 + struct tun_struct *tun = netdev_priv(dev);
 + struct tun_file *tfile, *tfile_list[MAX_TAP_QUEUES];
 + int i, j = 0;
 +
 + spin_lock(tun_lock);
 +
 + for (i = 0; i  MAX_TAP_QUEUES  tun-numqueues; i++) {
 + tfile = rcu_dereference_protected(tun-tfiles[i],
 + lockdep_is_held(tun_lock));
 + if (tfile) {
 + wake_up_all(tfile-wq.wait);
 + tfile_list[i++] = tfile;

typo here, you want tfile_list[j++] = tfile;

 + rcu_assign_pointer(tun-tfiles[i], NULL);
 + rcu_assign_pointer(tfile-tun, NULL);
 + --tun-numqueues;
 + }
 + }
 + BUG_ON(tun-numqueues != 0);
 + spin_unlock(tun_lock);
 +
 + synchronize_rcu();
 + for(--j; j = 0; j--)
 + sock_put(tfile_list[j]-sk);
  }
  

Could you take a look at net/packet/af_packet.c, to check how David did
the whole fanout thing ?

__fanout_unlink() 

Trick is to not leave NULL entries in the tun-tfiles[] array.

It makes things easier in hot path.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-08 Thread Eric Dumazet
Le jeudi 09 juin 2011 à 01:02 +0800, Brad Campbell a écrit :
 On 08/06/11 11:59, Eric Dumazet wrote:
 
  Well, a bisection definitely should help, but needs a lot of time in
  your case.
 
 Yes. compile, test, crash, walk out to the other building to press 
 reset, lather, rinse, repeat.
 
 I need a reset button on the end of a 50M wire, or a hardware watchdog!
 
 Actually it's not so bad. If I turn off slub debugging the kernel panics 
 and reboots itself.
 
 This.. :
 [2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1
 [2.913066] netconsole: device eth0 not up yet, forcing it
 [3.660062] Refined TSC clocksource calibration: 3213.422 MHz.
 [3.660118] Switching to clocksource tsc
 [   63.200273] r8169 :03:00.0: eth0: unable to load firmware patch 
 rtl_nic/rtl8168e-1.fw (-2)
 [   63.223513] r8169 :03:00.0: eth0: link down
 [   63.223556] r8169 :03:00.0: eth0: link down
 
 ..is slowing down reboots considerably. 3.0-rc does _not_ like some 
 timing hardware in my machine. Having said that, at least it does not 
 randomly panic on SCSI like 2.6.39 does.
 
 Ok, I've ruled out TCPMSS. Found out where it was being set and neutered 
 it. I've replicated it with only the single DNAT rule.
 
 
  Could you try following patch, because this is the 'usual suspect' I had
  yesterday :
 
  diff --git a/net/core/skbuff.c b/net/core/skbuff.c
  index 46cbd28..9f548f9 100644
  --- a/net/core/skbuff.c
  +++ b/net/core/skbuff.c
  @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, 
  int ntail,
  fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta;
  }
 
  +#if 0
  if (fastpath
  size + sizeof(struct skb_shared_info)= ksize(skb-head)) {
  memmove(skb-head + size, skb_shinfo(skb),
  @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, 
  int ntail,
  off = nhead;
  goto adjust_others;
  }
  -
  +#endif
  data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
  if (!data)
  goto nodata;
 
 
 
 
 Nope.. that's not it. sigh That might have changed the characteristic 
 of the fault slightly, but unfortunately I got caught with a couple of 
 fsck's, so I only got to test it 3 times tonight.
 
 It's unfortunate that this is a production system, so I can only take it 
 down between about 9pm and 1am. That would normally be pretty 
 productive, except that an fsck of a 14TB ext4 can take 30 minutes if it 
 panics at the wrong time.
 
 I'm out of time tonight, but I'll have a crack at some bisection 
 tomorrow night. Now I just have to go back far enough that it works, and 
 be near enough not to have to futz around with /proc /sys or drivers.
 
 I really, really, really appreciate you guys helping me with this. It 
 has been driving me absolutely bonkers. If I'm ever in the same town as 
 any of you, dinner and drinks are on me.

Hmm, I wonder if kmemcheck could help you, but its slow as hell, so not
appropriate for production :(



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Eric Dumazet
Le mardi 07 juin 2011 à 21:27 +0800, Brad Campbell a écrit :
 On 07/06/11 04:22, Eric Dumazet wrote:
 
  Could you please try latest linux-2.6 tree ?
 
  We fixed many networking bugs that could explain your crash.
 
 
 
 
 
 No good I'm afraid.
 
 [  543.040056] 
 =
 [  543.040136] BUG ip_dst_cache: Padding overwritten. 
 0x8803e4217ffe-0x8803e4217fff
 [  543.040194] 

Thats pretty strange : These are the last two bytes of a page, set to
0x (a 16 bit value)

There is no way a dst field could actually sit on this location (its a
padding), since a dst is a bit less than 256 bytes (0xe8), and each
entry is aligned on a 64byte address.

grep dst /proc/slabinfo 

ip_dst_cache   32823  62944256   322 : tunables00
0 : slabdata   1967   1967  0

sizeof(struct rtable)=0xe8


 -
 [  543.040198]
 [  543.040298] INFO: Slab 0xea000d9e74d0 objects=25 used=25 fp=0x 
 (null) flags=0x80004081
 [  543.040364] Pid: 4576, comm: kworker/1:2 Not tainted 3.0.0-rc2 #1
 [  543.040415] Call Trace:
 [  543.040472]  [810b9c1d] ? slab_err+0xad/0xd0
 [  543.040528]  [8102e034] ? check_preempt_wakeup+0xa4/0x160
 [  543.040595]  [810ba206] ? slab_pad_check+0x126/0x170
 [  543.040650]  [8133045b] ? dst_destroy+0x8b/0x110
 [  543.040701]  [810ba29a] ? check_slab+0x4a/0xc0
 [  543.040753]  [810baf2d] ? free_debug_processing+0x2d/0x250
 [  543.040808]  [810bb27b] ? __slab_free+0x12b/0x140
 [  543.040862]  [810bbe99] ? kmem_cache_free+0x99/0xa0
 [  543.040915]  [8133045b] ? dst_destroy+0x8b/0x110
 [  543.040967]  [813307f6] ? dst_gc_task+0x196/0x1f0
 [  543.041021]  [8104e954] ? queue_delayed_work_on+0x154/0x160
 [  543.041081]  [813066fe] ? do_dbs_timer+0x20e/0x3d0
 [  543.041133]  [81330660] ? dst_alloc+0x180/0x180
 [  543.041187]  [8104f28b] ? process_one_work+0xfb/0x3b0
 [  543.041242]  [8104f964] ? worker_thread+0x144/0x3d0
 [  543.041296]  [8102cc10] ? __wake_up_common+0x50/0x80
 [  543.041678]  [8104f820] ? rescuer_thread+0x2e0/0x2e0
 [  543.041729]  [8104f820] ? rescuer_thread+0x2e0/0x2e0
 [  543.041782]  [81053436] ? kthread+0x96/0xa0
 [  543.041835]  [813e1d14] ? kernel_thread_helper+0x4/0x10
 [  543.041890]  [810533a0] ? kthread_worker_fn+0x120/0x120
 [  543.041944]  [813e1d10] ? gs_change+0xb/0xb
 [  543.041993]  Padding 0x8803e4217f40:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.042718]  Padding 0x8803e4217f50:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.043433]  Padding 0x8803e4217f60:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.044155]  Padding 0x8803e4217f70:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.044866]  Padding 0x8803e4217f80:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.045590]  Padding 0x8803e4217f90:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.046311]  Padding 0x8803e4217fa0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.047034]  Padding 0x8803e4217fb0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.047755]  Padding 0x8803e4217fc0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.048474]  Padding 0x8803e4217fd0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.049203]  Padding 0x8803e4217fe0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 5a 5a 
 [  543.049909]  Padding 0x8803e4217ff0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 
 5a 5a 5a 5a 5a 00 00 ZZ..
 [  543.050021] FIX ip_dst_cache: Restoring 
 0x8803e4217f40-0x8803e4217fff=0x5a
 [  543.050021]
 
 Dropped -mm, Hugh and Andrea from CC as this does not appear to be mm or 
 ksm related.
 
 I'll pare down the firewall and see if I can make it break easier with a 
 smaller test set.

Hmm, not sure now :(

Could you reproduce another bug please ?



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Eric Dumazet
Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :

 The main suspects would be NAT and TCPMSS. Did you also try whether
 the crash occurs with only one of these these rules?
 
  I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
  the address the way I was doing it, so that's a no-go for me.
 
 That's really weird since you're apparently not using any bridge
 netfilter features. It shouldn't have any effect besides changing
 at which point ip_tables is invoked. How are your network devices
 configured (specifically any bridges)?

Something in the kernel does 

u16 *ptr = addr (given by kmalloc())

ptr[-1] = 0;

Could be an off-one error in a memmove()/memcopy() or loop...

I cant see a network issue here.

I checked arch/x86/lib/memmove_64.S and it seems fine.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-07 Thread Eric Dumazet
Le mercredi 08 juin 2011 à 08:18 +0800, Brad Campbell a écrit :
 On 08/06/11 06:57, Patrick McHardy wrote:
  On 07.06.2011 20:31, Eric Dumazet wrote:
  Le mardi 07 juin 2011 à 17:35 +0200, Patrick McHardy a écrit :
 
  The main suspects would be NAT and TCPMSS. Did you also try whether
  the crash occurs with only one of these these rules?
 
  I've just compiled out CONFIG_BRIDGE_NETFILTER and can no longer access
  the address the way I was doing it, so that's a no-go for me.
 
  That's really weird since you're apparently not using any bridge
  netfilter features. It shouldn't have any effect besides changing
  at which point ip_tables is invoked. How are your network devices
  configured (specifically any bridges)?
 
  Something in the kernel does
 
  u16 *ptr = addr (given by kmalloc())
 
  ptr[-1] = 0;
 
  Could be an off-one error in a memmove()/memcopy() or loop...
 
  I cant see a network issue here.
 
  So far me neither, but netfilter appears to trigger the bug.
 
 Would it help if I tried some older kernels? This issue only surfaced 
 for me recently as I only installed the VM's in question about 12 weeks 
 ago and have only just started really using them in anger. I could try 
 reproducing it on progressively older kernels to see if I can find one 
 that works and then bisect from there.

Well, a bisection definitely should help, but needs a lot of time in
your case.

Could you try following patch, because this is the 'usual suspect' I had
yesterday :

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 46cbd28..9f548f9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
ntail,
fastpath = atomic_read(skb_shinfo(skb)-dataref) == delta;
}
 
+#if 0
if (fastpath 
size + sizeof(struct skb_shared_info) = ksize(skb-head)) {
memmove(skb-head + size, skb_shinfo(skb),
@@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int 
ntail,
off = nhead;
goto adjust_others;
}
-
+#endif
data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
if (!data)
goto nodata;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-06 Thread Eric Dumazet
Le dimanche 05 juin 2011 à 21:45 +0800, Brad Campbell a écrit :
 On 05/06/11 16:14, Avi Kivity wrote:
  On 06/03/2011 04:38 PM, Brad Campbell wrote:
 
  Is there anyone who can point me at the appropriate cage to rattle? I
  know it appears to be a netfilter issue, but I don't seem to be able
  to get a message to the list (and I am subscribed to it and have been
  getting mail for months) and I'm not sure who to pester. The other
  alternative is I just stop doing that and wait for it to bite
  someone else.
 
  The mailing list might be set not to send your own mails back to you.
  Check the list archive.
 
 Yep, I did that first..
 
 Given the response to previous issues along the same line, it looks a 
 bit like I just remember not to actually use the system in the way that 
 triggers the bug and be happy that 99% of the time the kernel does not 
 panic, but have that lovely feeling in the back of the skull that says 
 any time now, and without obvious reason the whole machine might just 
 come crashing down..
 
 I guess it's still better than running Xen or Windows..

Could you please try latest linux-2.6 tree ?

We fixed many networking bugs that could explain your crash.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM induced panic on 2.6.38[2367] 2.6.39

2011-06-06 Thread Eric Dumazet
Le lundi 06 juin 2011 à 22:10 +0200, Bart De Schuymer a écrit :
 Hi Brad,
 
 This has probably nothing to do with ebtables, so please rmmod in case 
 it's loaded.
 A few questions I didn't directly see an answer to in the threads I 
 scanned...
 I'm assuming you actually use the bridging firewall functionality. So, 
 what iptables modules do you use? Can you reduce your iptables rules to 
 a core that triggers the bug?
 Or does it get triggered even with an empty set of firewall rules?
 Are you using a stock .35 kernel or is it patched?
 Is this something I can trigger on a poor guy's laptop or does it 
 require specialized hardware (I'm catching up on qemu/kvm...)?
 

Keep netdev, as this most probably is a networking bug.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb

2011-05-26 Thread Eric Dumazet
Le jeudi 26 mai 2011 à 12:36 -0700, Shirley Ma a écrit :
 This patch adds userspace buffers support in skb shared info. A new 
 struct skb_ubuf_info is needed to maintain the userspace buffers
 argument and index, a callback is used to notify userspace to release
 the buffers once lower device has done DMA (Last reference to that skb
 has gone).
 
 If there is any userspace apps to reference these userspace buffers,
 then these userspaces buffers will be copied into kernel. This way we
 can prevent userspace apps to hold these userspace buffers too long.
 
 Signed-off-by: Shirley Ma x...@us.ibm.com
 ---
 
  include/linux/skbuff.h |   26 +++
  net/core/skbuff.c  |   80 ++-
  2 files changed, 104 insertions(+), 2 deletions(-)
 
 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
 index d0ae90a..025de5c 100644
 --- a/include/linux/skbuff.h
 +++ b/include/linux/skbuff.h
 @@ -189,6 +189,18 @@ enum {
   SKBTX_DRV_NEEDS_SK_REF = 1  3,
  };
  
 +/*
 + * The callback notifies userspace to release buffers when skb DMA is done in
 + * lower device, the skb last reference should be 0 when calling this.
 + * The desc is used to track userspace buffer index.
 + */
 +struct skb_ubuf_info {
 + /* support buffers allocation from userspace */
 + void(*callback)(struct sk_buff *);
 + void*arg;
 + size_t  desc;
 +};

Thats 24 bytes on each skb...   desc for example is not used in this
patch (yes, its used later in patch 4/4)

But still... could you instead use one pointer only in skb ?



 +
  /* This data is invariant across clones and lives at
   * the end of the header data, ie. at skb-end.
   */
 @@ -211,6 +223,10 @@ struct skb_shared_info {
   /* Intermediate layers must ensure that destructor_arg
* remains valid until skb destructor */
   void *  destructor_arg;
 +
 + /* DMA mapping from/to userspace buffers */
 + struct skb_ubuf_info ubuf;
 +
   /* must be last field, see pskb_expand_head() */
   skb_frag_t  frags[MAX_SKB_FRAGS];
  };
 @@ -2261,5 +2277,15 @@ static inline void skb_checksum_none_assert(struct 
 sk_buff *skb)
  }
  
  bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
 +
 +/*
 + *   skb_ubuf - is the buffer from userspace
 + *   @skb: buffer to check
 + */
 +static inline int skb_ubuf(const struct sk_buff *skb)
 +{
 + return (skb_shinfo(skb)-ubuf.callback != NULL);

return skb_shinfo(skb)-ubuf.callback != NULL;

 +}
 +
  #endif   /* __KERNEL__ */
  #endif   /* _LINUX_SKBUFF_H */
 diff --git a/net/core/skbuff.c b/net/core/skbuff.c
 index 7ebeed0..890447c 100644
 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
 gfp_mask,
   shinfo = skb_shinfo(skb);
   memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
   atomic_set(shinfo-dataref, 1);
 + shinfo-ubuf.callback = NULL;
 + shinfo-ubuf.arg = NULL;

if you put ubuf ptr before dataref, no need to add this (the memset()
clear all shared_info up to dataref)

   kmemcheck_annotate_variable(shinfo-destructor_arg);
  
   if (fclone) {
 @@ -328,6 +330,14 @@ static void skb_release_data(struct sk_buff *skb)
   put_page(skb_shinfo(skb)-frags[i].page);
   }
  
 + /*
 +  * if skb buf is from userspace, we need to notify the caller
 +  * the lower device DMA has done;
 +  */
 + if (skb_ubuf(skb)) {
 + skb_shinfo(skb)-ubuf.callback(skb);
 + skb_shinfo(skb)-ubuf.callback = NULL;
 + }
   if (skb_has_frag_list(skb))
   skb_drop_fraglist(skb);
  
 @@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
   if (irqs_disabled())
   return false;
  
 + if (skb_ubuf(skb))
 + return false;
 +
   if (skb_is_nonlinear(skb) || skb-fclone != SKB_FCLONE_UNAVAILABLE)
   return false;
  
 @@ -572,6 +585,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
 struct sk_buff *skb)
   atomic_set(n-users, 1);
  
   atomic_inc((skb_shinfo(skb)-dataref));
 + skb_shinfo(skb)-ubuf.callback = NULL;
   skb-cloned = 1;
  
   return n;
 @@ -595,6 +609,48 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct 
 sk_buff *src)
  }
  EXPORT_SYMBOL_GPL(skb_morph);
  
 +/* skb frags copy userspace buffers to kernel */
 +static int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 +{
 + int i;
 + int num_frags = skb_shinfo(skb)-nr_frags;
 + struct page *page, *head = NULL;
 +
 + for (i = 0; i  num_frags; i++) {
 + u8 *vaddr;
 + skb_frag_t *f = skb_shinfo(skb)-frags[i];
 +
 + page = alloc_page(GFP_ATOMIC);
 + if (!page) {
 + while 

Re: [PATCH V6 2/4 net-next] skbuff: Add userspace zero-copy buffers in skb

2011-05-26 Thread Eric Dumazet
Le jeudi 26 mai 2011 à 13:24 -0700, Shirley Ma a écrit :

 I could reduce callback pointer by moving it to *arg, but not desc, this
 indicates that which buffer DMA hasn't done yet in *arg.


I guess you dont need to use skb itself to hold all your states ?

I understand its convenient for you, but I believe its worth the pain to
use only one pointer to a (small) object where you put all your stuff.

Some machines alloc/free millions of skbs per second. 

If/when most skb uses are for userspace zero-copy buffers we can embbed
your small object in skb itself ;)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] posix-timers: RCU conversion

2011-03-22 Thread Eric Dumazet
Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard
a single spinlock (idr_lock) in posix-timers code, on its 48 core
machine.

Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time
used in ticket_spin_lock, from lock_timer

Ref: http://www.spinics.net/lists/kvm/msg51526.html

Switching to RCU is quite easy, IDR being already RCU ready.

idr_lock should be locked only for an insert/delete, not a lookup.

Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime().

Before :

real1m18.669s
user0m1.346s
sys 1m17.180s

After :

real0m3.296s
user0m1.366s
sys 0m1.926s


Reported-by: Ben Nagy b...@iagu.net
Signed-off-by: Eric Dumazet eric.duma...@gmail.com
Cc: Avi Kivity a...@redhat.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: John Stultz johns...@us.ibm.com
Cc: Richard Cochran richard.coch...@omicron.at
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
---
 include/linux/posix-timers.h |1 +
 kernel/posix-timers.c|   25 ++---
 2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index d51243a..5dc27ca 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -81,6 +81,7 @@ struct k_itimer {
unsigned long expires;
} mmtimer;
} it;
+   struct rcu_head rcu;
 };
 
 struct k_clock {
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 4c01249..acb9be9 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -491,6 +491,13 @@ static struct k_itimer * alloc_posix_timer(void)
return tmr;
 }
 
+static void k_itimer_rcu_free(struct rcu_head *head)
+{
+   struct k_itimer *tmr = container_of(head, struct k_itimer, rcu);
+
+   kmem_cache_free(posix_timers_cache, tmr);
+}
+
 #define IT_ID_SET  1
 #define IT_ID_NOT_SET  0
 static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
@@ -503,7 +510,7 @@ static void release_posix_timer(struct k_itimer *tmr, int 
it_id_set)
}
put_pid(tmr-it_pid);
sigqueue_free(tmr-sigq);
-   kmem_cache_free(posix_timers_cache, tmr);
+   call_rcu(tmr-rcu, k_itimer_rcu_free);
 }
 
 static struct k_clock *clockid_to_kclock(const clockid_t id)
@@ -631,22 +638,18 @@ out:
 static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 {
struct k_itimer *timr;
-   /*
-* Watch out here.  We do a irqsave on the idr_lock and pass the
-* flags part over to the timer lock.  Must not let interrupts in
-* while we are moving the lock.
-*/
-   spin_lock_irqsave(idr_lock, *flags);
+
+   rcu_read_lock();
timr = idr_find(posix_timers_id, (int)timer_id);
if (timr) {
-   spin_lock(timr-it_lock);
+   spin_lock_irqsave(timr-it_lock, *flags);
if (timr-it_signal == current-signal) {
-   spin_unlock(idr_lock);
+   rcu_read_unlock();
return timr;
}
-   spin_unlock(timr-it_lock);
+   spin_unlock_irqrestore(timr-it_lock, *flags);
}
-   spin_unlock_irqrestore(idr_lock, *flags);
+   rcu_read_unlock();
 
return NULL;
 }


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM lock contention on 48 core AMD machine

2011-03-21 Thread Eric Dumazet
Le lundi 21 mars 2011 à 22:01 +0545, Ben Nagy a écrit :
  On Mon, Mar 21, 2011 at 7:38 PM, Avi Kivitya...@redhat.com  wrote:
In the future, please post the binary perf.dat.
 
  Hi Avi,
 
  How do I do that?
 
  'make nconfig' and go to the kernel hacking section.
 
 Imprecise question sorry, I meant how do I get the perf.dat not how do
 I disable the debugging.
 
 On the non-debug kernel: Linux eax 2.6.38-7-server #36-Ubuntu SMP Fri
 Mar 18 23:36:13 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
 
150512.00 67.8% __ticket_spin_lock
 [kernel.kallsyms]
 11126.00  5.0% memcmp_pages
 [kernel.kallsyms]
  9563.00  4.3% native_safe_halt
 [kernel.kallsyms]
  8965.00  4.0% svm_vcpu_run
 /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm-amd.ko
  6489.00  2.9% tg_load_down
 [kernel.kallsyms]
  4676.00  2.1% kvm_get_cs_db_l_bits
 /lib/modules/2.6.38-7-server/kernel/arch/x86/kvm/kvm.ko
  1931.00  0.9% load_balance_fair
 [kernel.kallsyms]
  1917.00  0.9% ktime_get
 [kernel.kallsyms]
  1624.00  0.7% walk_tg_tree.clone.129
 [kernel.kallsyms]
  1542.00  0.7% find_busiest_group
 [kernel.kallsyms]
  1326.00  0.6% find_next_bit
 [kernel.kallsyms]
   673.00  0.3% lock_hrtimer_base.clone.25
 [kernel.kallsyms]
   624.00  0.3% copy_user_generic_string[kernel.kallsyms]
 
 top now says:
 
 top - 00:11:35 up 22 min,  4 users,  load average: 0.11, 6.15, 7.78
 Tasks: 491 total,   3 running, 488 sleeping,   0 stopped,   0 zombie
 Cpu(s):  0.9%us, 15.4%sy,  0.0%ni, 83.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
 Mem:  99068660k total, 70831760k used, 28236900k free,10036k buffers
 Swap:  2438140k total,  2173652k used,   264488k free,  3396144k cached
 
 With average 'resting cpu' per idle guest 8%, 96 guests running.
 
 Is this as good as I am going to get? It seems like I can't really
 debug that lock contention without blowing stuff up because of the
 load of the lock debugging...
 
 Don't know if I mentioned this before, but those guests are each
 pinned to a cpu (cpu guestnum%48) with taskset.
 
 Cheers,

It seems you hit idr_lock contention (used in kernel/posix-timers.c)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM lock contention on 48 core AMD machine

2011-03-21 Thread Eric Dumazet
Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :

 Any ideas on how to fix it?  We could pre-allocate IDs and batch them in 
 per-cpu caches, but it seems like a lot of work.
 

Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
lock this idr_lock.

Sounds crazy, and should be fixed in kernel code, not kvm ;)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] posix-timers: RCU conversion

2011-03-21 Thread Eric Dumazet
Le lundi 21 mars 2011 à 23:57 +0545, Ben Nagy a écrit :
 On Mon, Mar 21, 2011 at 10:57 PM, Eric Dumazet eric.duma...@gmail.com wrote:
  Le lundi 21 mars 2011 à 19:02 +0200, Avi Kivity a écrit :
 
  Any ideas on how to fix it?  We could pre-allocate IDs and batch them in
  per-cpu caches, but it seems like a lot of work.
 
 
  Hmm, I dont know what syscalls kvm do, but even a timer_gettime() has to
  lock this idr_lock.
 
  Sounds crazy, and should be fixed in kernel code, not kvm ;)
 
 Hi,
 
 I'll need to work out a way I can make the perf.data available
 (~100M), but here's the summary
 
 http://paste.ubuntu.com/583425/
 
 And here's the summary of the summary
 
 # Overhead  Command Shared Object
 Symbol
 #   ...  
 ..
 #
 71.86%  kvm  [kernel.kallsyms] [k] __ticket_spin_lock
 |
 --- __ticket_spin_lock
|
|--54.19%-- default_spin_lock_flags
|  _raw_spin_lock_irqsave
|  |
|   --54.14%-- __lock_timer
| |
| |--31.92%-- sys_timer_gettime
| |  
 system_call_fastpath
| |
|  --22.22%-- sys_timer_settime
|
 system_call_fastpath
|
|--15.66%-- _raw_spin_lock
 [...]
 
 Which I guess seems to match what Eric just said.
 
 I'll post a link to the full data tomorrow. Many thanks for the help
 so far. If it's a kernel scaling limit then I guess we just wait until
 the kernel gets better. :S I'll check it out with a linux guest
 tomorrow anyway.

Here is a quick  dirty patch to lower idr_lock use. You could try it
eventually ;)

Thanks

[RFC] posix-timers: RCU conversion

Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard
a single spinlock (idr_lock) in posix-timers code.

Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time
used in ticket_spin_lock

Switching to RCU should be quite easy, IDR being already RCU ready.

idr_lock should be locked only before an insert/delete, not a find.

Reported-by: Ben Nagy b...@iagu.net
Signed-off-by: Eric Dumazet eric.duma...@gmail.com
Cc: Avi Kivity a...@redhat.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: John Stultz johns...@us.ibm.com
Cc: Richard Cochran richard.coch...@omicron.at
---
 include/linux/posix-timers.h |1 +
 kernel/posix-timers.c|   26 ++
 2 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index d51243a..5dc27ca 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -81,6 +81,7 @@ struct k_itimer {
unsigned long expires;
} mmtimer;
} it;
+   struct rcu_head rcu;
 };
 
 struct k_clock {
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index 4c01249..e2a823a 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -491,6 +491,11 @@ static struct k_itimer * alloc_posix_timer(void)
return tmr;
 }
 
+static void k_itimer_rcu_free(struct rcu_head *head)
+{
+   kmem_cache_free(posix_timers_cache, container_of(head, struct k_itimer, 
rcu));
+}
+
 #define IT_ID_SET  1
 #define IT_ID_NOT_SET  0
 static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
@@ -499,11 +504,12 @@ static void release_posix_timer(struct k_itimer *tmr, int 
it_id_set)
unsigned long flags;
spin_lock_irqsave(idr_lock, flags);
idr_remove(posix_timers_id, tmr-it_id);
+   tmr-it_id = -1;
spin_unlock_irqrestore(idr_lock, flags);
}
put_pid(tmr-it_pid);
sigqueue_free(tmr-sigq);
-   kmem_cache_free(posix_timers_cache, tmr);
+   call_rcu(tmr-rcu, k_itimer_rcu_free);
 }
 
 static struct k_clock *clockid_to_kclock(const clockid_t id)
@@ -631,22 +637,18 @@ out:
 static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags)
 {
struct k_itimer *timr;
-   /*
-* Watch out here.  We do a irqsave on the idr_lock and pass the
-* flags part over to the timer lock.  Must not let interrupts in
-* while we are moving the lock.
-*/
-   spin_lock_irqsave(idr_lock, *flags);
+
+   rcu_read_lock();
timr = idr_find(posix_timers_id, (int)timer_id);
if (timr) {
-   spin_lock(timr-it_lock);
-   if (timr-it_signal == current-signal

Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()

2011-03-13 Thread Eric Dumazet
Le dimanche 13 mars 2011 à 17:06 +0200, Michael S. Tsirkin a écrit :
 On Mon, Jan 17, 2011 at 04:11:17PM +0800, Jason Wang wrote:
  We can use lock_sock_fast() instead of lock_sock() in order to get
  speedup in peek_head_len().
  
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
   drivers/vhost/net.c |4 ++--
   1 files changed, 2 insertions(+), 2 deletions(-)
  
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index c32a2e4..50b622a 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -211,12 +211,12 @@ static int peek_head_len(struct sock *sk)
   {
  struct sk_buff *head;
  int len = 0;
  +   bool slow = lock_sock_fast(sk);
   
  -   lock_sock(sk);
  head = skb_peek(sk-sk_receive_queue);
  if (head)
  len = head-len;
  -   release_sock(sk);
  +   unlock_sock_fast(sk, slow);
  return len;
   }
   
 
 Wanted to apply this, but looking at the code I think the lock_sock here
 is wrong. What we really need is to handle the case where the skb is
 pulled from the receive queue after skb_peek.  However this is not the
 right lock to use for that, sk_receive_queue.lock is.
 So I expect the following is the right way to handle this.
 Comments?
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 0329c41..5720301 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -213,12 +213,13 @@ static int peek_head_len(struct sock *sk)
  {
   struct sk_buff *head;
   int len = 0;
 + unsigned long flags;
  
 - lock_sock(sk);
 + spin_lock_irqsave(sk-sk_receive_queue.lock, flags);
   head = skb_peek(sk-sk_receive_queue);
 - if (head)
 + if (likely(head))
   len = head-len;
 - release_sock(sk);
 + spin_unlock_irqrestore(sk-sk_receive_queue.lock, flags);
   return len;
  }
  

You may be right, only way to be sure is to check the other side.

If it uses skb_queue_tail(), then yes, your patch is fine.

If other side did not lock socket, then your patch is a bug fix.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()

2011-03-13 Thread Eric Dumazet
Le dimanche 13 mars 2011 à 18:19 +0200, Michael S. Tsirkin a écrit :

 Other side is in drivers/net/tun.c and net/packet/af_packet.c
 At least wrt tun it seems clear socket is not locked.

Yes (assuming you refer to tun_net_xmit())

 Besides queue, dequeue seems to be done without socket locked.
 

It seems this code (assuming you speak of drivers/vhost/net.c ?) has
some races indeed.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()

2011-03-13 Thread Eric Dumazet
Le dimanche 13 mars 2011 à 18:43 +0200, Michael S. Tsirkin a écrit :
 On Sun, Mar 13, 2011 at 05:32:07PM +0100, Eric Dumazet wrote:
  Le dimanche 13 mars 2011 à 18:19 +0200, Michael S. Tsirkin a écrit :
  
   Other side is in drivers/net/tun.c and net/packet/af_packet.c
   At least wrt tun it seems clear socket is not locked.
  
  Yes (assuming you refer to tun_net_xmit())
  
   Besides queue, dequeue seems to be done without socket locked.
   
  
  It seems this code (assuming you speak of drivers/vhost/net.c ?) has
  some races indeed.
  
 
 Hmm. Any more besides the one fixed here?
 

If writers and readers dont share a common lock, how can they reliably
synchronize states ?

For example, the check at line 420 seems unsafe or useless.

skb_queue_empty(sock-sk-sk_receive_queue)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible netfilter-related memory corruption in 2.6.37

2011-02-18 Thread Eric Dumazet
Le vendredi 18 février 2011 à 19:37 +0100, Patrick McHardy a écrit :
 Am 14.02.2011 17:52, schrieb Patrick McHardy:
  Am 14.02.2011 17:48, schrieb Eric Dumazet:
  I am not sure, but I guess nf_reinject() needs a fix too ;)
  
  I agree. That one looks uglier though, I guess we'll have to
  iterate through all hooks to note the previous one.
 
 How about this? Unfortunately I don't think we can avoid
 iterating through all hooks without violating RCU rules.
 
 

   /* Continue traversal iff userspace said ok... */
if (verdict == NF_REPEAT) {
-   elem = elem-prev;
-   verdict = NF_ACCEPT;
+   prev = NULL;
+   list_for_each_entry_rcu(i,
nf_hooks[entry-pf][entry-hook],
+   list) {
+   if (i-list == elem)
+   break;
+   prev = i;


Hmm... what happens if elem was the first elem in list ?

We exit with prev = NULL  -- NF_DROP ?

I must miss something...

+   }
+
+   if (prev == NULL ||
+   i-list == nf_hooks[entry-pf][entry-hook])
+   verdict = NF_DROP;
+   else {
+   elem = prev-list;
+   verdict = NF_ACCEPT;
+   }
}



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible netfilter-related memory corruption in 2.6.37

2011-02-14 Thread Eric Dumazet
Le lundi 14 février 2011 à 16:58 +0200, Avi Kivity a écrit :
 We see severe memory corruption in kvm while used in conjunction with 
 bridge/netfilter.  Enabling slab debugging points the finger at a 
 netfilter chain invoked from the bridge code.
 
 Can someone take a look?
 
 https://bugzilla.kernel.org/show_bug.cgi?id=27052
 

CC netdev

Does a revert of commit ca44ac386181ba7 help a bit ?

(net: don't reallocate skb-head unless the current one hasn't the
needed extra size or is shared)




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: Possible netfilter-related memory corruption in 2.6.37

2011-02-14 Thread Eric Dumazet
Le lundi 14 février 2011 à 16:18 +0100, Jan Engelhardt a écrit :
 On Monday 2011-02-14 16:11, Eric Dumazet wrote:
 
 Le lundi 14 février 2011 à 16:58 +0200, Avi Kivity a écrit :
  We see severe memory corruption in kvm while used in conjunction with 
  bridge/netfilter.  Enabling slab debugging points the finger at a 
  netfilter chain invoked from the bridge code.
  
  Can someone take a look?
  
  https://bugzilla.kernel.org/show_bug.cgi?id=27052
 
 Maybe looks familiar to https://lkml.org/lkml/2011/2/3/147

Are you sure Jan ?

IMHO it looks like in your case, a NULL -hook() is called, from
nf_iterate()

BTW, list_for_each_continue_rcu() really should be converted to 
list_for_each_entry_continue_rcu()

This is a bit ugly :

list_for_each_continue_rcu(*i, head) {
struct nf_hook_ops *elem = (struct nf_hook_ops *)*i;

Also, I wonder if RCU rules are respected in nf_iterate().
For example this line is really suspicious :

*i = (*i)-prev;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible netfilter-related memory corruption in 2.6.37

2011-02-14 Thread Eric Dumazet
Le lundi 14 février 2011 à 17:24 +0100, Patrick McHardy a écrit :
 Am 14.02.2011 16:50, schrieb Eric Dumazet:
  Le lundi 14 février 2011 à 16:18 +0100, Jan Engelhardt a écrit :
  On Monday 2011-02-14 16:11, Eric Dumazet wrote:
 
  Le lundi 14 février 2011 à 16:58 +0200, Avi Kivity a écrit :
  We see severe memory corruption in kvm while used in conjunction with 
  bridge/netfilter.  Enabling slab debugging points the finger at a 
  netfilter chain invoked from the bridge code.
 
  Can someone take a look?
 
  https://bugzilla.kernel.org/show_bug.cgi?id=27052
 
  Maybe looks familiar to https://lkml.org/lkml/2011/2/3/147
  
  Are you sure Jan ?
  
  IMHO it looks like in your case, a NULL -hook() is called, from
  nf_iterate()
  
  BTW, list_for_each_continue_rcu() really should be converted to 
  list_for_each_entry_continue_rcu()
  
  This is a bit ugly :
  
  list_for_each_continue_rcu(*i, head) {
  struct nf_hook_ops *elem = (struct nf_hook_ops *)*i;
  
  Also, I wonder if RCU rules are respected in nf_iterate().
  For example this line is really suspicious :
  
  *i = (*i)-prev;
 
 Yeah, that definitely looks wrong. How about this instead?
 

This patch seems fine to me, thanks !

Acked-by: Eric Dumazet eric.duma...@gmail.com



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible netfilter-related memory corruption in 2.6.37

2011-02-14 Thread Eric Dumazet
Le lundi 14 février 2011 à 17:37 +0100, Patrick McHardy a écrit :
 Am 14.02.2011 17:29, schrieb Eric Dumazet:
  Le lundi 14 février 2011 à 17:24 +0100, Patrick McHardy a écrit :
  Also, I wonder if RCU rules are respected in nf_iterate().
  For example this line is really suspicious :
 
  *i = (*i)-prev;
 
  Yeah, that definitely looks wrong. How about this instead?
 
  
  This patch seems fine to me, thanks !
  
  Acked-by: Eric Dumazet eric.duma...@gmail.com
 
 THanks Eric, I've queued the patch for 2.6.38.

I am not sure, but I guess nf_reinject() needs a fix too ;)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost-net: use lock_sock_fast() in peek_head_len()

2011-01-17 Thread Eric Dumazet
Le lundi 17 janvier 2011 à 16:11 +0800, Jason Wang a écrit :
 We can use lock_sock_fast() instead of lock_sock() in order to get
 speedup in peek_head_len().
 
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/vhost/net.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index c32a2e4..50b622a 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -211,12 +211,12 @@ static int peek_head_len(struct sock *sk)
  {
   struct sk_buff *head;
   int len = 0;
 + bool slow = lock_sock_fast(sk);
  
 - lock_sock(sk);
   head = skb_peek(sk-sk_receive_queue);
   if (head)
   len = head-len;
 - release_sock(sk);
 + unlock_sock_fast(sk, slow);
   return len;
  }
  
Acked-by: Eric Dumazet eric.duma...@gmail.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Eric Dumazet
Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
 Hi,
 
 Back in October I reported that I noticed a problem whereby flow control
 breaks down when openvswitch is configured to mirror a port[1].
 
 I have (finally) looked into this further and the problem appears to relate
 to cloning of skbs, as Jesse Gross originally suspected.
 
 More specifically, in do_execute_actions[2] the first n-1 times that an skb
 needs to be transmitted it is cloned first and the final time the original
 skb is used.
 
 In the case that there is only one action, which is the normal case, then
 the original skb will be used. But in the case of mirroring the cloning
 comes into effect. And in my case the cloned skb seems to go to the (slow)
 eth1 interface while the original skb goes to the (fast) dummy0 interface
 that I set up to be a mirror. The result is that dummy0 paces the flow,
 and its a cracking pace at that.
 
 As an experiment I hacked do_execute_actions() to use the original skb
 for the first action instead of the last one.  In my case the result was
 that eth1 paces the flow, and things work reasonably nicely.
 
 Well, sort of. Things work well for non-GSO skbs but extremely poorly for
 GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
 netserv. I'm unsure why, but I digress.
 
 It seems to me that my hack illustrates the point that the flow ends up
 being paced by one interface. However I think that what would be
 desirable is that the flow is paced by the slowest link. Unfortunately
 I'm unsure how to achieve that.
 

Hi Simon !

pacing is done because skb is attached to a socket, and a socket has a
limited (but configurable) sndbuf. sk-sk_wmem_alloc is the current sum
of all truesize skbs in flight.

When you enter something that :

1) Get a clone of the skb, queue the clone to device X
2) queue the original skb to device Y

Then :  Socket sndbuf is not affected at all by device X queue.
This is speed on device Y that matters.

You want to get servo control on both X and Y

You could try to

1) Get a clone of skb
   Attach it to socket too (so that socket get a feedback of final
orphaning for the clone) with skb_set_owner_w()
   queue the clone to device X

Unfortunatly, stacked skb-destructor() makes this possible only for
known destructor (aka sock_wfree())

 One idea that I had was to skb_get() the original skb each time it is
 cloned - that is easy enough. But unfortunately it seems to me that
 approach would require some sort of callback mechanism in kfree_skb() so
 that the cloned skbs can kfree_skb() the original skb.
 
 Ideas would be greatly appreciated.
 
 [1] 
 http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html
 [2] 
 http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/actions.c;h=5e16143ca402f7da0ee8fc18ee5eb16c3b7598e6;hb=HEAD
 --


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Eric Dumazet
Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit :

 Hi Eric !
 
 Thanks for the advice. I had thought about the socket buffer but at some
 point it slipped my mind.
 
 In any case the following patch seems to implement the change that I had in
 mind. However my discussions Michael Tsirkin elsewhere in this thread are
 beginning to make me think that think that perhaps this change isn't the
 best solution.
 
 diff --git a/datapath/actions.c b/datapath/actions.c
 index 5e16143..505f13f 100644
 --- a/datapath/actions.c
 +++ b/datapath/actions.c
 @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, 
 struct sk_buff *skb,
  
   for (a = actions, rem = actions_len; rem  0; a = nla_next(a, rem)) {
   if (prev_port != -1) {
 - do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
 + struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 + if (nskb) {
 + if (skb-sk)
 + skb_set_owner_w(nskb, skb-sk);
 + do_output(dp, nskb, prev_port);
 + }
   prev_port = -1;
   }
 
 I got a rather nasty panic without the if (skb-sk),
 I guess some skbs don't have a socket.

Indeed, some packets are not linked to a socket.

(ARP packets for example)

Sorry, I should have mentioned it :)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Certain Mac Address doesn't work in KVM virtual NICs.

2010-12-20 Thread Eric Dumazet
Le mardi 21 décembre 2010 à 13:53 +0800, Yueyu Lin a écrit :
 Hello, everyone
   I just encountered a very strange issue in Ubuntu 10.10 server
 edition X86 64bits.  I just start up my kvm like :
 kvm -net nic,macaddr=B3:C0:64:AF:73:28,model=virtio -net
 tap,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,ifname=tap2
 -enable-kvm -drive file=cassandra.raw,format=raw,aio=native -vnc :2
  This Mac address doesn't work at all. No working means the
 virtual machine just thinks the eth0 is not plugged in. After I
 changed the MAC address, it works as I expected.
   Are there any restrictions that I  can't use certain MAC
 addresses?--


Sure

First byte must be even.

B3 is odd, so its not valid.

low order bit is a marker for 'multicast'


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap

2010-12-10 Thread Eric Dumazet
Le vendredi 10 décembre 2010 à 02:13 -0800, Shirley Ma a écrit :

 + while (len) {
 + f = skb_shinfo(skb)-frags[i];
 + f-page = page[i];
 + f-page_offset = base  ~PAGE_MASK;
 + f-size = min_t(int, len, PAGE_SIZE - f-page_offset);
 + skb-data_len += f-size;
 + skb-len += f-size;
 + skb-truesize += f-size;
 + skb_shinfo(skb)-nr_frags++;
 + /* increase sk_wmem_alloc */
 + atomic_add(f-size, skb-sk-sk_wmem_alloc);
 + base += f-size;
 + len -= f-size;
 + i++;
 + }

You could make one atomic_add() outside of the loop, and factorize many
things...

atomic_add(len, skb-sk-sk_wmem_alloc);
skb-data_len += len;
skb-len += len;
skb-truesize += len;
while (len) {
...
}


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 5/5] Add TX zero copy in macvtap

2010-12-10 Thread Eric Dumazet
Le vendredi 10 décembre 2010 à 08:25 -0800, Shirley Ma a écrit :
 On Fri, 2010-12-10 at 11:27 +0100, Eric Dumazet wrote:
  You could make one atomic_add() outside of the loop, and factorize
  many
  things...
  
  atomic_add(len, skb-sk-sk_wmem_alloc);
  skb-data_len += len;
  skb-len += len;
  skb-truesize += len;
  while (len) {
  ...
  } 
 
 Yep, thanks, will update it!

Also take a look at skb_fill_page_desc() helper, and maybe
skb_add_rx_frag() too.

The atomic op should be factorized for sure, but other adds might be
done by helpers to keep code short.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.37-rc2 after KVM shutdown - unregister_netdevice: waiting for vmtst01eth0 to become free. Usage count = 1

2010-11-18 Thread Eric Dumazet
Le jeudi 18 novembre 2010 à 07:28 +0100, Nikola Ciprich a écrit :
  Yep, this is a known problem, thanks !
  
  fix is there : 
  
  http://patchwork.ozlabs.org/patch/71354/
 Thanks Eric, this indeed fixes the problem..
 I noticed the fix didn't make it to 2.6.37-rc2-git3 though,
 maybe it just got omited?
 anyways, thanks for help!
 n.

Its in David Miller net-2.6 tree (all pending network patches for
current linux-2.6 version), so it'll be included next time David push
its tree to Linus, dont worry ;)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.37-rc2 after KVM shutdown - unregister_netdevice: waiting for vmtst01eth0 to become free. Usage count = 1

2010-11-16 Thread Eric Dumazet
Le mardi 16 novembre 2010 à 09:41 +0100, Nikola Ciprich a écrit :
 Hello,
 I just tried running KVM on 2.6.37-rc2 host, and when trying to shutdown 
 guest, it
 got to unterruptible state (D) and I'm getting lots of following messages:
 [ 1269.392009] unregister_netdevice: waiting for vmtst01eth0 to become free. 
 Usage count = 1
 So I just want to report..
 n.
 
 

Yep, this is a known problem, thanks !

fix is there : 

http://patchwork.ozlabs.org/patch/71354/



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.

2010-11-08 Thread Eric Dumazet
Le lundi 08 novembre 2010 à 16:03 +0800, xiaohui@intel.com a écrit :
 From: Xin Xiaohui xiaohui@intel.com
 
  Hmm, I suggest you read the comment two lines above.
 
  If destructor_arg is now cleared each time we allocate a new skb, then,
  please move it before dataref in shinfo structure, so that the following
  memset() does the job efficiently...
 
 
 Something like :
 
 diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
 index e6ba898..2dca504 100644
 --- a/include/linux/skbuff.h
 +++ b/include/linux/skbuff.h
 @@ -195,6 +195,9 @@ struct skb_shared_info {
  __be32  ip6_frag_id;
  __u8tx_flags;
  struct sk_buff  *frag_list;
 +/* Intermediate layers must ensure that destructor_arg
 + * remains valid until skb destructor */
 +void*destructor_arg;
  struct skb_shared_hwtstamps hwtstamps;
 
  /*
 @@ -202,9 +205,6 @@ struct skb_shared_info {
   */
  atomic_tdataref;
 
 -/* Intermediate layers must ensure that destructor_arg
 - * remains valid until skb destructor */
 -void *  destructor_arg;
  /* must be last field, see pskb_expand_head() */
  skb_frag_t  frags[MAX_SKB_FRAGS];
  };
 
 
 
 Will that affect the cache line?

What do you mean ?

 Or, we can move the line to clear destructor_arg to the end of __alloc_skb().
 It looks like as the following, which one do you prefer?
 
 Thanks
 Xiaohui
 
 ---
  net/core/skbuff.c |8 
  1 files changed, 8 insertions(+), 0 deletions(-)
 
 diff --git a/net/core/skbuff.c b/net/core/skbuff.c
 index c83b421..df852f2 100644
 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -224,6 +224,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
 gfp_mask,
  
   child-fclone = SKB_FCLONE_UNAVAILABLE;
   }
 + shinfo-destructor_arg = NULL;
  out:
   return skb;
  nodata:

I dont understand why you want to do this.

This adds an instruction, makes code bigger, and no obvious gain for me,
at memory transactions side.

If integrated in the existing memset(), cost is an extra iteration to
perform the clear of this field.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 06/17] Use callback to deal with skb_release_data() specially.

2010-11-04 Thread Eric Dumazet
Le jeudi 04 novembre 2010 à 17:05 +0800, xiaohui@intel.com a écrit :
 From: Xin Xiaohui xiaohui@intel.com
 
 If buffer is external, then use the callback to destruct
 buffers.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  net/core/skbuff.c |8 
  1 files changed, 8 insertions(+), 0 deletions(-)
 
 diff --git a/net/core/skbuff.c b/net/core/skbuff.c
 index c83b421..5e6d69c 100644
 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -210,6 +210,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
 gfp_mask,
  
   /* make sure we initialize shinfo sequentially */
   shinfo = skb_shinfo(skb);
 + shinfo-destructor_arg = NULL;

Hmm, I suggest you read the comment two lines above.

If destructor_arg is now cleared each time we allocate a new skb, then,
please move it before dataref in shinfo structure, so that the following
memset() does the job efficiently...

   memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
   atomic_set(shinfo-dataref, 1);
  
 @@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb)
   if (skb_has_frags(skb))
   skb_drop_fraglist(skb);
  
 + if (skb-dev  dev_is_mpassthru(skb-dev)) {
 + struct skb_ext_page *ext_page =
 + skb_shinfo(skb)-destructor_arg;
 + if (ext_page  ext_page-dtor)
 + ext_page-dtor(ext_page);
 + }
 +
   kfree(skb-head);
   }
  }


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 06/17] Use callback to deal with skb_release_data() specially.

2010-11-04 Thread Eric Dumazet
Le jeudi 04 novembre 2010 à 10:04 +0100, Eric Dumazet a écrit :

 Hmm, I suggest you read the comment two lines above.
 
 If destructor_arg is now cleared each time we allocate a new skb, then,
 please move it before dataref in shinfo structure, so that the following
 memset() does the job efficiently...


Something like :

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e6ba898..2dca504 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -195,6 +195,9 @@ struct skb_shared_info {
__be32  ip6_frag_id;
__u8tx_flags;
struct sk_buff  *frag_list;
+   /* Intermediate layers must ensure that destructor_arg
+* remains valid until skb destructor */
+   void*destructor_arg;
struct skb_shared_hwtstamps hwtstamps;
 
/*
@@ -202,9 +205,6 @@ struct skb_shared_info {
 */
atomic_tdataref;
 
-   /* Intermediate layers must ensure that destructor_arg
-* remains valid until skb destructor */
-   void *  destructor_arg;
/* must be last field, see pskb_expand_head() */
skb_frag_t  frags[MAX_SKB_FRAGS];
 };



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 06/17] Use callback to deal with skb_release_data() specially.

2010-10-11 Thread Eric Dumazet
Le lundi 11 octobre 2010 à 08:27 -0700, David Miller a écrit :
 From: Xin, Xiaohui xiaohui@intel.com
 Date: Mon, 11 Oct 2010 15:06:05 +0800
 
  That's to avoid the new cache miss caused by using destructor_arg in data 
  path
  like skb_release_data().
  That's based on the comment from Eric Dumazet on v7 patches.
 
 Thanks for the explanation.

Anyway, frags[] must be the last field of struct skb_shared_info
since commit fed66381 (net: pskb_expand_head() optimization)

It seems Xin worked on a quite old tree.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 10/17] Add a hook to intercept external buffers from NIC driver.

2010-09-30 Thread Eric Dumazet
Le jeudi 30 septembre 2010 à 22:04 +0800, xiaohui@intel.com a
écrit :
 From: Xin Xiaohui xiaohui@intel.com
 
 The hook is called in netif_receive_skb().
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  net/core/dev.c |   35 +++
  1 files changed, 35 insertions(+), 0 deletions(-)
 
 diff --git a/net/core/dev.c b/net/core/dev.c
 index c11e32c..83172b8 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -2517,6 +2517,37 @@ err:
  EXPORT_SYMBOL(netdev_mp_port_prep);
  #endif
  
 +#if defined(CONFIG_MEDIATE_PASSTHRU) || 
 defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
 +/* Add a hook to intercept mediate passthru(zero-copy) packets,
 + * and insert it to the socket queue owned by mp_port specially.
 + */
 +static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
 +struct packet_type **pt_prev,
 +int *ret,
 +struct net_device *orig_dev)
 +{
 + struct mp_port *mp_port = NULL;
 + struct sock *sk = NULL;
 +
 + if (!dev_is_mpassthru(skb-dev))
 + return skb;
 + mp_port = skb-dev-mp_port;
 +
 + if (*pt_prev) {
 + *ret = deliver_skb(skb, *pt_prev, orig_dev);
 + *pt_prev = NULL;
 + }
 +
 + sk = mp_port-sock-sk;
 + skb_queue_tail(sk-sk_receive_queue, skb);
 + sk-sk_state_change(sk);
 +
 + return NULL;
 +}
 +#else
 +#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
 +#endif
 +
  /**
   *   netif_receive_skb - process receive buffer from network
   *   @skb: buffer to process
 @@ -2598,6 +2629,10 @@ int netif_receive_skb(struct sk_buff *skb)
  ncls:
  #endif
  
 + /* To intercept mediate passthru(zero-copy) packets here */
 + skb = handle_mpassthru(skb, pt_prev, ret, orig_dev);
 + if (!skb)
 + goto out;
   skb = handle_bridge(skb, pt_prev, ret, orig_dev);
   if (!skb)
   goto out;

This does not apply to current net-next-2.6

We now have dev-rx_handler (currently for bridge or macvlan)


commit ab95bfe01f9872459c8678572ccadbf646badad0
Author: Jiri Pirko jpi...@redhat.com
Date:   Tue Jun 1 21:52:08 2010 +

net: replace hooks in __netif_receive_skb V5

What this patch does is it removes two receive frame hooks (for bridge and 
for
macvlan) from __netif_receive_skb. These are replaced them with a single
hook for both. It only supports one hook per device because it makes no
sense to do bridging and macvlan on the same device.

Then a network driver (of virtual netdev like macvlan or bridge) can 
register
an rx_handler for needed net device.

Signed-off-by: Jiri Pirko jpi...@redhat.com
Signed-off-by: Stephen Hemminger shemmin...@vyatta.com
Signed-off-by: David S. Miller da...@davemloft.net



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 03/19] Export 2 func for device to assign/deassign new strucure

2010-06-05 Thread Eric Dumazet
Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit :
 From: Xin Xiaohui xiaohui@intel.com
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  include/linux/netdevice.h |3 +++
  net/core/dev.c|   28 
  2 files changed, 31 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
 index bae725c..efb575a 100644
 --- a/include/linux/netdevice.h
 +++ b/include/linux/netdevice.h
 @@ -1592,6 +1592,9 @@ extern gro_result_t napi_frags_finish(struct 
 napi_struct *napi,
 gro_result_t ret);
  extern struct sk_buff *  napi_frags_skb(struct napi_struct *napi);
  extern gro_result_t  napi_gro_frags(struct napi_struct *napi);
 +extern int netdev_mp_port_attach(struct net_device *dev,
 +  struct mpassthru_port *port);
 +extern void netdev_mp_port_detach(struct net_device *dev);
  
  static inline void napi_free_frags(struct napi_struct *napi)
  {
 diff --git a/net/core/dev.c b/net/core/dev.c
 index f769098..ecbb6b1 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb)
   rcu_read_unlock();
  }
  
 +/* Export two functions to assign/de-assign mp_port pointer
 + * to a net device.
 + */
 +
 +int netdev_mp_port_attach(struct net_device *dev,
 + struct mpassthru_port *port)
 +{
 + /* locked by mp_mutex */
 + if (rcu_dereference(dev-mp_port))
 + return -EBUSY;
 +

Please... this is bogus...

Try with following config settings :

CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_REPEATEDLY=y




 + rcu_assign_pointer(dev-mp_port, port);
 +
 + return 0;
 +}
 +EXPORT_SYMBOL(netdev_mp_port_attach);
 +
 +void netdev_mp_port_detach(struct net_device *dev)
 +{
 + /* locked by mp_mutex */
 + if (!rcu_dereference(dev-mp_port))
 + return;

same problem here

 +
 + rcu_assign_pointer(dev-mp_port, NULL);
 + synchronize_rcu();
 +}
 +EXPORT_SYMBOL(netdev_mp_port_detach);
 +
  /**
   *   netif_receive_skb - process receive buffer from network
   *   @skb: buffer to process


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.

2010-06-05 Thread Eric Dumazet
Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit :
 From: Xin Xiaohui xiaohui@intel.com
   child-fclone = SKB_FCLONE_UNAVAILABLE;
   }
 + /* Record the external buffer info in this field. It's not so good,
 +  * but we cannot find another place easily.
 +  */
 + shinfo-destructor_arg = ext_page;
 +


Yes this is a big problem, its basically using a cache line that was not
touched before.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.

2010-06-05 Thread Eric Dumazet
Le samedi 05 juin 2010 à 18:14 +0800, xiaohui@intel.com a écrit :
 From: Xin Xiaohui xiaohui@intel.com
 
 If buffer is external, then use the callback to destruct
 buffers.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzhao81...@gmail.com
 Reviewed-by: Jeff Dike jd...@linux.intel.com
 ---
  net/core/skbuff.c |   11 +++
  1 files changed, 11 insertions(+), 0 deletions(-)
 
 diff --git a/net/core/skbuff.c b/net/core/skbuff.c
 index 37587f0..418457c 100644
 --- a/net/core/skbuff.c
 +++ b/net/core/skbuff.c
 @@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)
  
  static void skb_release_data(struct sk_buff *skb)
  {
 + /* check if the skb has external buffers, we have use destructor_arg
 +  * here to indicate
 +  */
 + struct skb_external_page *ext_page = skb_shinfo(skb)-destructor_arg;
 +

Oh well. This is v7 of your series, and nobody complained yet ?

This is a new cache miss on a _critical_ path.


   if (!skb-cloned ||
   !atomic_sub_return(skb-nohdr ? (1  SKB_DATAREF_SHIFT) + 1 : 1,
  skb_shinfo(skb)-dataref)) {
 @@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
   if (skb_has_frags(skb))
   skb_drop_fraglist(skb);
  
 + /* if the skb has external buffers, use destructor here,
 +  * since after that skb-head will be kfree, in case skb-head
 +  * from external buffer cannot use kfree to destroy.
 +  */

Why not deferring here the access to skb_shinfo(skb)-destructor_arg ?

 + if (dev_is_mpassthru(skb-dev)  ext_page  ext_page-dtor)
 + ext_page-dtor(ext_page);
   kfree(skb-head);
   }
  }

if (dev_is_mpassthru(skb-dev)) {
struct skb_external_page *ext_page =
skb_shinfo(skb)-destructor_arg;
if (ext_page  ext_page-dtor)
ext_page-dtor(ext_page);
}

destructor_arg should me moved before frags[] if you really want to use it.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bf243fc..b136d90 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -202,10 +202,11 @@ struct skb_shared_info {
 */
atomic_tdataref;
 
-   skb_frag_t  frags[MAX_SKB_FRAGS];
/* Intermediate layers must ensure that destructor_arg
 * remains valid until skb destructor */
void *  destructor_arg;
+
+   skb_frag_t  frags[MAX_SKB_FRAGS];
 };
 
 /* We divide dataref into two halves.  The higher 16 bits hold references



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] use unfair spinlock when running on hypervisor.

2010-06-01 Thread Eric Dumazet
Le mardi 01 juin 2010 à 19:52 +0300, Avi Kivity a écrit :

 What I'd like to see eventually is a short-term-unfair, long-term-fair 
 spinlock.  Might make sense for bare metal as well.  But it won't be 
 easy to write.
 

This thread rings a bell here :)

Yes, ticket spinlocks are sometime slower, especially in workloads where
a spinlock needs to be taken several times to handle one unit of work,
and many cpus competing.

We currently have kind of a similar problem in network stack, and we
have a patch to speedup xmit path by an order of magnitude, letting one
cpu (the consumer cpu) to get unfair access to the (ticket) spinlock.
(It can compete with no more than one other cpu)

Boost from ~50.000 to ~600.000 pps on a dual quad core machine (E5450
@3.00GHz) on a particular workload (many cpus want to xmit their
packets)

( patch : http://patchwork.ozlabs.org/patch/53163/ )


It could be possible to write such a generic beast, with a cascade or
regular ticket spinlocks ?

One ticket spinlock at first stage (only if some conditions are met, aka
slow path), then an 'primary' spinlock at second stage.


// generic implementation
// (x86 could use 16bit fields for users_in  user_out)
struct cascade_lock {
atomic_tusers_in;
int users_out;
spinlock_t  primlock;
spinlock_t  slowpathlock; // could be outside of this structure, 
shared by many 'cascade_locks'
};

/*
 * In kvm case, you might call hypervisor when slowpathlock is about to be 
taken ?
 * When a cascade lock is unlocked, and relocked right after, this cpu has 
unfair
 * priority and could get the lock before cpus blocked in slowpathlock 
(especially if
 * an hypervisor call was done)
 *
 * In network xmit path, the dequeue thread would use highprio_user=true mode
 * In network xmit path, the 'contended' enqueueing thread would set a negative 
threshold,
 *  to force a 'lowprio_user' mode.
 */
void cascade_lock(struct cascade_lock *l, bool highprio_user, int threshold)
{
bool slowpath = false;

atomic_inc(l-users_in); // no real need for atomic_inc_return()
if (atomic_read(l-users_in) - l-users_out  threshold  
!highprio_user)) {
spin_lock(l-slowpathlock);
slowpath = true;
}
spin_lock(l-primlock);
if (slowpath)
spin_unlock(l-slowpathlock);
}

void cascade_unlock(struct cascade_lock *l)
{
l-users_out++;
spin_unlock(l-primlock);
}



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] A device for zero-copy based on KVM virtio-net.

2010-02-10 Thread Eric Dumazet
Le mercredi 10 février 2010 à 19:48 +0800, Xin Xiaohui a écrit :
 Add a device to utilize the vhost-net backend driver for
 copy-less data transfer between guest FE and host NIC.
 It pins the guest user space to the host memory and
 provides proto_ops as sendmsg/recvmsg to vhost-net.
 
 Signed-off-by: Xin Xiaohui xiaohui@intel.com
 Signed-off-by: Zhao Yu yzha...@gmail.com
 Sigend-off-by: Jeff Dike jd...@c2.user-mode-linux.org


 +static int page_ctor_attach(struct mp_struct *mp)
 +{
 + int rc;
 + struct page_ctor *ctor;
 + struct net_device *dev = mp-dev;
 +
 + rcu_read_lock();
 + if (rcu_dereference(mp-ctor)) {
 + rcu_read_unlock();
 + return -EBUSY;
 + }
 + rcu_read_unlock();

Strange read locking here, for an obvious writer role.
What do you really want to do ?
If writer are serialized by mp_mutex, you dont need this
recu_read_lock()/rcu_read_unlock() stuff.

 +
 + ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
 + if (!ctor)
 + return -ENOMEM;
 + rc = netdev_page_ctor_prep(dev, ctor-ctor);
 + if (rc)
 + goto fail;
 +
 + ctor-cache = kmem_cache_create(skb_page_info,
 + sizeof(struct page_info), 0,
 + SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);


SLAB_PANIC here means : crash whole system in case of error.
This is not what you want in a driver.

 +
 + if (!ctor-cache)
 + goto cache_fail;
 +
 + INIT_LIST_HEAD(ctor-readq);
 + spin_lock_init(ctor-read_lock);
 +
 + ctor-w_len = 0;
 + ctor-r_len = 0;
 +
 + dev_hold(dev);
 + ctor-dev = dev;
 + ctor-ctor.ctor = page_ctor;
 + ctor-ctor.sock = mp-socket;
 + atomic_set(ctor-refcnt, 1);
 +
 + rc = netdev_page_ctor_attach(dev, ctor-ctor);
 + if (rc)
 + goto fail;
 +
 + /* locked by mp_mutex */
 + rcu_assign_pointer(mp-ctor, ctor);
 +
 + /* XXX:Need we do set_offload here ? */
 +
 + return 0;
 +
 +fail:
 + kmem_cache_destroy(ctor-cache);
 +cache_fail:
 + kfree(ctor);
 + dev_put(dev);
 +
 + return rc;
 +}
 +
 +
 +static inline void get_page_ctor(struct page_ctor *ctor)
 +{
 +   atomic_inc(ctor-refcnt);
 +}
 +
 +static inline void put_page_ctor(struct page_ctor *ctor)
 +{
 + if (atomic_dec_and_test(ctor-refcnt))
 + kfree(ctor);

Are you sure a RCU grace period is not needed before freeing ?


 +
 +static int page_ctor_detach(struct mp_struct *mp)
 +{
 + struct page_ctor *ctor;
 + struct page_info *info;
 + int i;
 +
 + rcu_read_lock();
 + ctor = rcu_dereference(mp-ctor);
 + rcu_read_unlock();

Strange locking again here


 +
 + if (!ctor)
 + return -ENODEV;
 +
 + while ((info = info_dequeue(ctor))) {
 + for (i = 0; i  info-pnum; i++)
 + if (info-pages[i])
 + put_page(info-pages[i]);
 + kmem_cache_free(ctor-cache, info);
 + }
 + kmem_cache_destroy(ctor-cache);
 + netdev_page_ctor_detach(ctor-dev);
 + dev_put(ctor-dev);
 +
 + /* locked by mp_mutex */
 + rcu_assign_pointer(mp-ctor, NULL);
 + synchronize_rcu();
 +
 + put_page_ctor(ctor);
 +
 + return 0;
 +}
 +
 +/* For small user space buffers transmit, we don't need to call
 + * get_user_pages().
 + */
 +static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
 + int total)
 +{
 + struct page_info *info = kmem_cache_alloc(ctor-cache, GFP_KERNEL);
kmem_cache_zalloc() ?
 +
 + if (!info)
 + return NULL;
 + memset(info, 0, sizeof(struct page_info));
 + memset(info-pages, 0, sizeof(info-pages));

redundant memset() whole structure already cleared one line above

 +
 + info-header = 0;
already cleared
 + info-total = total;
 + info-skb = NULL;
already cleared 
 
 + info-user.dtor = page_dtor;
 + info-ctor = ctor;
 + info-flags = INFO_WRITE;
 + info-pnum = 0;
already cleared 
 
 + return info;
 +}
 +
 +/* The main function to transform the guest user space address
 + * to host kernel address via get_user_pages(). Thus the hardware
 + * can do DMA directly to the user space address.
 + */
 +static struct page_info *alloc_page_info(struct page_ctor *ctor,
 + struct iovec *iov, int count, struct frag *frags,
 + int npages, int total)
 +{
 + int rc;
 + int i, j, n = 0;
 + int len;
 + unsigned long base;
 + struct page_info *info = kmem_cache_alloc(ctor-cache, GFP_KERNEL);
kmem_cache_zalloc() ? 
 
 +
 + if (!info)
 + return NULL;
 + memset(info, 0, sizeof(struct page_info));
kmem_cache_zalloc() ?
 + memset(info-pages, 0, sizeof(info-pages));
already cleared 
 
 +
 + down_read(current-mm-mmap_sem);
 + for (i = j = 0; i  count; i++) {
 + base = (unsigned long)iov[i].iov_base;
 + len = iov[i].iov_len;
 +
 + if (!len)
 +  

Re: Networking-related crash?

2009-12-09 Thread Eric Dumazet
Le 09/12/2009 16:11, Avi Kivity a écrit :
 On 12/09/2009 03:46 PM, Adam Huffman wrote:
 I've been seeing lots of crashes on a new Dell Precision T7500,
 running the KVM in Fedora 12.  Finally managed to capture an Oops,
 which is shown below (hand-transcribed):

 BUG: unable to handle kernel paging request at 00200200
 IP: [8139aab7] destroy_conntrack+0x82/0x11f
 PGD 332d0e067 PUD 33453c067 PMD 0
 Oops: 0002 [#1] SMP
 last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
 CPU 4
 Modules linked in: tun bridge stp llc sunrpc ipt_MASQUERADE
 iptable_nat nf_nat ipt_LOG xt_physdev ip6t_REJECT nf_conntrack_ipv6
 ip6table_filter ip6
 _tables ipv6 dm_multipath kvm_intel kvm uinput snd_hda_codec_analog
 nouveau snd_hda_intel snd_hda_codec ttm drm_kms_helper snd_hwdep
 snd_seq drm sn
 d_seq_device snd_pcm firewire_ohci i2c_i801 snd_timer ppdev
 firewire_core snd i2c_algo_bit iTCO_wdt crc_itu_t parport_pc i2c_core
 soundcore parport
   iTCO_vendor_support tg3 snd_page_alloc shpchp dcdbas wmi mptsas
 mptscsih mptbase scsi_transport_sas megaraid_sas [last_unloaded:
 speedstep_lib]
 Pid: 1759, comm: qemu-kvm Not tainted 2.6.31.6-162.fc12.x86_64 #1
 Precision WorkStation T7500
 RIP: 0010:[8139aab7]  [8139aab7]
 destroy_conntrack+0x82/0x11f
 RSP: 0018:c9803bf0  EFLAGS: 00010202
 RAX: 8001 RBX: 816fb1a0 RCX: 752f
 RDX: 00200200 RSI: 0011 RDI: 816fb1a0
 RBP: c9803c00 R08: 880336699438 R09: 00aaa5e0
 R10: 0002f54189d5 R11: 0001 R12: 819a92e0
 R13: a029adcc R14:  R15: 880632866c38
 FS:  7fdd34b17710() GS:c980()
 knlGS:
 CS:  0010 DS: 002B ES: 002B CR0: 80050033
 CR2: 00200200 CR3: 0003349c CR4: 26e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process qemu-kvm (pid: 1759, threadinfo 88062e9e8000, task
 880634945e00)
 Stack:
   880632866c00 880634640c30 c9803c10 813989c2
 0  c9803c30 81374092 c9803c30 880632866c00
 0  c9803c50 81373dd3 0002 880632866c00
 Call Trace:
   IRQ
   [813989c2] nf_conntrack_destroy+0x1b/0x1d
   [81374092] skb_release_head_state+0x95/0xd7
   [81373dd3] __kfree_skb+0x16/0x81
   [81373ed7] kfree_skb+0x6a/0x72
   [a029adcc] ip6_mc_input+0x220/0x230 [ipv6]
   [a029a3d1] ip6_rcv_finish+0x27/0x2b [ipv6]
   [a029a763] ipv6_rcv+0x38e/0x3e5 [ipv6]
   [8137bd91] netif_receive_skb+0x402/0x427
   [8137bf1b] napi_skb_finish+0x29/0x3d
   [8137c37a] napi_gro_receive+0x2f/0x34
   [a0084fad] tg3_poll+0x6c6/0x8c3 [tg3]
   [8137c4b0] net_rx_action+0xaf/0x1c9
   [81379cfe] ? list-add_tail+0x15/0x17
   [81057614] __do_softirq+0xdd/0x1ad
   [81026936] ? apic_write+0x16/0x18
   [81012eac] call_softirq+0x1c/0x30
   [810143fb] do_softirq+0x47/0x8d
   [81057326] irq_exit+0x44/0x86
   [8141ecd5] do_IRQ+0xa5/0xbc
   [810126d3] ret_from_intr+0x0/0x11
   EOI
   [a02437bb] ? kvm_arch_vcpu_ioctl_run+0x84b/0xb34 [kvm]
   [a02437aa] ? kvm_arch_vcpu_ioctl_run+0x83a/0xb34 [kvm]
   [a02395e3] ? kvm_vcpu_ioctl+0xfd/0x556 [kvm]
   [81108adc] ? vfs_ioctl+0x22/0x87
   [81109038] ? do_vfs_ioctl+0x47b/0x4c1
   [811090d4] ? sys_ioctl+0x56/0x79
   [81012093] ? stub_clone+0x13/0x20
   [81011cf2] ? system_call_fastpath+0x16/0x1b
 Code: c7 00 a6 9a 81 e8 23 04 08 00 48 89 df e8 68 29 00 00 f6 43 78
 08 75 24 48 8b 53 10 48 85 d2 75 04 0f 0b eb fe 48 8b 43 08 a8 0148
 89 02 7
 5 04 48 89 50 08 48 c7 43 10 00 02 20 00 65 8b 14 25
 RIP  [8139aab7] destroy_conntrack+0x82/0x11f
   RSPc9803bf0
 CR2: 00200200

 
 Looks unrelated to kvm - softirq happened to trigger during a kvm
 ioctl.  Fault looks like list poison.  Copying netdev.
 

crash in :
48 8b 43 08 mov0x8(%rbx),%rax
a8 01   test   $0x1,%al
48 89 02mov%rax,(%rdx)   HERE  RDX=0x200200  
(LIST_POISON2)
75 04   jne1f
48 89 50 08 mov%rdx,0x8(%rax)
1:  48 c7 43 10 00 02 20movq   $0x200200,0x10(%rbx)

if (!nf_ct_is_confirmed(ct)) {

BUG_ON(hlist_nulls_unhashed(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
hlist_nulls_del_rcu(ct-tuplehash[IP_CT_DIR_ORIGINAL].hnnode); 
  HERE 
}
NF_CT_STAT_INC(net, delete); 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

2009-11-19 Thread Eric Dumazet
Shirley Ma a écrit :
 This patch is generated against 2.6 git tree. I didn't break up this
 patch since it has one functionality. Please review it.
 
 Thanks
 Shirley
 
 Signed-off-by: Shirley Ma x...@us.ibm.com
 --
  
 +void virtio_free_pages(void *buf)
 +{
 + struct page *page = (struct page *)buf;
 +
 + while (page) {
 + __free_pages(page, 0);
 + page = (struct page *)page-private;
 + }
 +}
 +

Interesting use after free :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-04 Thread Eric Dumazet
Paul E. McKenney a écrit :
 
 (Sorry, but, as always, I could not resist!)

Yes :)

Thanks Paul for this masterpiece of diplomatic Acked-by ;)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-03 Thread Eric Dumazet
Michael S. Tsirkin a écrit :
 +static void handle_tx(struct vhost_net *net)
 +{
 + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 + unsigned head, out, in, s;
 + struct msghdr msg = {
 + .msg_name = NULL,
 + .msg_namelen = 0,
 + .msg_control = NULL,
 + .msg_controllen = 0,
 + .msg_iov = vq-iov,
 + .msg_flags = MSG_DONTWAIT,
 + };
 + size_t len, total_len = 0;
 + int err, wmem;
 + size_t hdr_size;
 + struct socket *sock = rcu_dereference(vq-private_data);
 + if (!sock)
 + return;
 +
 + wmem = atomic_read(sock-sk-sk_wmem_alloc);
 + if (wmem = sock-sk-sk_sndbuf)
 + return;
 +
 + use_mm(net-dev.mm);
 + mutex_lock(vq-mutex);
 + vhost_no_notify(vq);
 +

using rcu_dereference() and mutex_lock() at the same time seems wrong, I suspect
that your use of RCU is not correct.

1) rcu_dereference() should be done inside a read_rcu_lock() section, and
   we are not allowed to sleep in such a section.
   (Quoting Documentation/RCU/whatisRCU.txt :
 It is illegal to block while in an RCU read-side critical section, )

2) mutex_lock() can sleep (ie block)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-03 Thread Eric Dumazet
Gregory Haskins a écrit :
 Gregory Haskins wrote:
 Eric Dumazet wrote:
 Michael S. Tsirkin a écrit :
 +static void handle_tx(struct vhost_net *net)
 +{
 +  struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 +  unsigned head, out, in, s;
 +  struct msghdr msg = {
 +  .msg_name = NULL,
 +  .msg_namelen = 0,
 +  .msg_control = NULL,
 +  .msg_controllen = 0,
 +  .msg_iov = vq-iov,
 +  .msg_flags = MSG_DONTWAIT,
 +  };
 +  size_t len, total_len = 0;
 +  int err, wmem;
 +  size_t hdr_size;
 +  struct socket *sock = rcu_dereference(vq-private_data);
 +  if (!sock)
 +  return;
 +
 +  wmem = atomic_read(sock-sk-sk_wmem_alloc);
 +  if (wmem = sock-sk-sk_sndbuf)
 +  return;
 +
 +  use_mm(net-dev.mm);
 +  mutex_lock(vq-mutex);
 +  vhost_no_notify(vq);
 +
 using rcu_dereference() and mutex_lock() at the same time seems wrong, I 
 suspect
 that your use of RCU is not correct.

 1) rcu_dereference() should be done inside a read_rcu_lock() section, and
we are not allowed to sleep in such a section.
(Quoting Documentation/RCU/whatisRCU.txt :
  It is illegal to block while in an RCU read-side critical section, )

 2) mutex_lock() can sleep (ie block)


 Michael,
   I warned you that this needed better documentation ;)

 Eric,
   I think I flagged this once before, but Michael convinced me that it
 was indeed ok, if but perhaps a bit unconventional.  I will try to
 find the thread.

 Kind Regards,
 -Greg

 
 Here it is:
 
 http://lkml.org/lkml/2009/8/12/173
 

Yes, this doesnt convince me at all, and could be a precedent for a wrong RCU 
use.
People wanting to use RCU do a grep on kernel sources to find how to correctly
use RCU.

Michael, please use existing locking/barrier mechanisms, and not pretend to use 
RCU.

Some automatic tools might barf later.

For example, we could add a debugging facility to check that rcu_dereference() 
is used
in an appropriate context, ie conflict with existing mutex_lock() debugging 
facility.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-03 Thread Eric Dumazet
Michael S. Tsirkin a écrit :
 
 Paul, you acked this previously. Should I add you acked-by line so
 people calm down?  If you would rather I replace
 rcu_dereference/rcu_assign_pointer with rmb/wmb, I can do this.
 Or maybe patch Documentation to explain this RCU usage?
 

So you believe I am over-reacting to this dubious use of RCU ?

RCU documentation is already very complex, we dont need to add yet another
subtle use, and makes it less readable.

It seems you use 'RCU api' in drivers/vhost/net.c as convenient macros :

#define rcu_dereference(p) ({ \
typeof(p) _p1 = ACCESS_ONCE(p); \
smp_read_barrier_depends(); \
(_p1); \
})

#define rcu_assign_pointer(p, v) \
({ \
if (!__builtin_constant_p(v) || \
((v) != NULL)) \
smp_wmb(); \
(p) = (v); \
})


There are plenty regular uses of smp_wmb() in kernel, not related to Read Copy 
Update,
there is nothing wrong to use barriers with appropriate comments.

(And you already use mb(), wmb(), rmb(), smp_wmb() in your patch)


BTW there is at least one locking bug in vhost_net_set_features()

Apparently, mutex_unlock() doesnt trigger a fault if mutex is not locked
by current thread... even with DEBUG_MUTEXES / DEBUG_LOCK_ALLOC


static void vhost_net_set_features(struct vhost_net *n, u64 features)
{
   size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
   sizeof(struct virtio_net_hdr) : 0;
   int i;
!  mutex_unlock(n-dev.mutex);
   n-dev.acked_features = features;
   smp_wmb();
   for (i = 0; i  VHOST_NET_VQ_MAX; ++i) {
   mutex_lock(n-vqs[i].mutex);
   n-vqs[i].hdr_size = hdr_size;
   mutex_unlock(n-vqs[i].mutex);
   }
   mutex_unlock(n-dev.mutex);
   vhost_net_flush(n);
}
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html