Re: IPv4 BUG: held lock freed!

2012-08-19 Thread Eric Dumazet
On Sun, 2012-08-19 at 22:15 +0800, Lin Ming wrote:

> Will it still has problem if code goes here without sock_hold(sk)?

Not sure of what you mean.

At the time tcp_write_timer() runs, we own one reference on the socket.
(this reference was taken in sk_reset_timer())

On old kernels, if we found the socket locked by the user, we used to
rearm the timer for a 50ms delay (and thus did sock_hold() again)

Another way to avoid the bug would to make sure sk_reset_timer()
increases refcount _before_ setting the timer, but its adding one atomic
in fast path...

diff --git a/net/core/sock.c b/net/core/sock.c
index 8f67ced..d1745b7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2114,8 +2114,9 @@ EXPORT_SYMBOL(sk_send_sigurg);
 void sk_reset_timer(struct sock *sk, struct timer_list* timer,
unsigned long expires)
 {
-   if (!mod_timer(timer, expires))
-   sock_hold(sk);
+   sock_hold(sk);
+   if (mod_timer(timer, expires))
+   __sock_put(sk);
 }
 EXPORT_SYMBOL(sk_reset_timer);
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IPv4 BUG: held lock freed!

2012-08-19 Thread Eric Dumazet
On Sun, 2012-08-19 at 23:05 +0800, Lin Ming wrote:
> On Sun, Aug 19, 2012 at 10:45 PM, Eric Dumazet  wrote:
> > On Sun, 2012-08-19 at 22:15 +0800, Lin Ming wrote:
> >
> >> Will it still has problem if code goes here without sock_hold(sk)?
> >
> > Not sure of what you mean.
> 
> See my comments in the function.
> Is that a potential problem?
> 

No problem.

It always been like that. Thats the whole point having a refcount at the
first place.

The last sock_put(sk) should free the socket.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression with poll(2)

2012-08-20 Thread Eric Dumazet
On Sun, 2012-08-19 at 11:49 -0700, Sage Weil wrote:
> I've bisected and identified this commit:
> 
> netvm: propagate page->pfmemalloc to skb
> 
> The skb->pfmemalloc flag gets set to true iff during the slab allocation
> of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
> packet is fragmented, it is possible that pages will be allocated from the
> PFMEMALLOC reserve without propagating this information to the skb.  This
> patch propagates page->pfmemalloc from pages allocated for fragments to
> the skb.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: David S. Miller 
> Cc: Neil Brown 
> Cc: Peter Zijlstra 
>     Cc: Mike Christie 
> Cc: Eric B Munson 
> Cc: Eric Dumazet 
> Cc: Sebastian Andrzej Siewior 
> Cc: Mel Gorman 
> Cc: Christoph Lameter 
> Signed-off-by: Andrew Morton 
> Signed-off-by: Linus Torvalds 
> 
> I've retested several times and confirmed that this change leads to the 
> breakage, and also confirmed that reverting it on top of -rc1 also fixes 
> the problem.
> 
> I've also added some additional instrumentation to my code and confirmed 
> that the process is blocking on poll(2) while netstat is reporting 
> data available on the socket.
> 
> What can I do to help track this down?
> 
> Thanks!
> sage
> 
> 
> On Wed, 15 Aug 2012, Sage Weil wrote:
> 
> > I'm experiencing a stall with Ceph daemons communicating over TCP that 
> > occurs reliably with 3.6-rc1 (and linus/master) but not 3.5.  The basic 
> > situation is:
> > 
> >  - the socket is two processes communicating over TCP on the same host, 
> > e.g. 
> > 
> > tcp0 2164849 10.214.132.38:6801  10.214.132.38:51729 
> > ESTABLISHED
> > 
> >  - one end writes a bunch of data in
> >  - the other end consumes data, but at some point stalls.
> >  - reads are nonblocking, e.g.
> > 
> >   int got = ::recv( sd, buf, len, MSG_DONTWAIT );
> > 
> >  and between those calls we wait with
> > 
> >   struct pollfd pfd;
> >   short evmask;
> >   pfd.fd = sd;
> >   pfd.events = POLLIN;
> > #if defined(__linux__)
> >   pfd.events |= POLLRDHUP;
> > #endif
> > 
> >   if (poll(&pfd, 1, msgr->timeout) <= 0)
> > return -1;
> > 
> >  - in my case the timeout is ~15 minutes.  at that point it errors out, 
> > and the daemons reconnect and continue for a while until hitting this 
> > again.
> > 
> >  - at the time of the stall, the reading process is blocked on that 
> > poll(2) call.  There are a bunch of threads stuck on poll(2), some of them 
> > stuck and some not, but they all have stacks like
> > 
> > [] poll_schedule_timeout+0x49/0x70
> > [] do_sys_poll+0x35f/0x4c0
> > [] sys_poll+0x6b/0x100
> > [] system_call_fastpath+0x16/0x1b
> > 
> >  - you'll note that the netstat output shows data queued:
> > 
> > tcp0 1163264 10.214.132.36:6807  10.214.132.36:41738 
> > ESTABLISHED
> > tcp0 1622016 10.214.132.36:41738 10.214.132.36:6807  
> > ESTABLISHED
> > 

In this netstat output, we can see some data in output queues, but no
data on receive queues. poll() is OK.

Some TCP frames are not properly delivered, even after a retransmit.

( to see useful stats/counters : ss -emoi dst 10.214.132.36)

For loopback transmits, skbs are taken from the output queue, cloned and
feeded to local stack.

If they have the pfmemalloc bit, they wont be delivered to normal
sockets, but dropped.

tcp_sendmsg() seems to be able to queue skbs with pfmemalloc set to
true, and this makes no sense to me.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression with poll(2)

2012-08-20 Thread Eric Dumazet
On Mon, 2012-08-20 at 10:04 +0100, Mel Gorman wrote:

> Can the following patch be tested please? It is reported to fix an fio
> regression that may be similar to what you are experiencing but has not
> been picked up yet.
> 
> -

This seems to help here.

Boot your machine with "mem=768M" or a bit less depending on your setup,
and try a netperf.

-> before patch :

# netperf
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
localhost.localdomain () port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

 87380  16384  1638414.00   6.05   

-> after patch :

Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

 87380  16384  1638410.0018509.73   


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How does ext2 implement sparse files?

2008-02-01 Thread Eric Dumazet

Shuduo Sang a écrit :

On Feb 1, 2008 2:14 AM, Andi Kleen <[EMAIL PROTECTED]> wrote:
  

Lars Noschinski <[EMAIL PROTECTED]> writes:



For an university project, we had to write a toy filesystem (ext2-like),
for which I would like to implement sparse file support. For this, I
digged through the ext2 source code; but I could not find the point,
where ext2 detects holes.

As far as I can see from fs/buffer.c, an hole is a buffer_head which is
not mapped, but uptodate. But I cannot find a relevant source line,
where ext2 makes usage of this information.
  

It does not explicitely detect holes; holey data is just never written
so no space for it is allocated.




does anybody know how to make a hole in a large file which already has
real content from user space application?
In my project I need this function to delete a piece of content from
an exist large effectively.
thanks.

  
Some OSes use fcntl() F_FREESP/F_FREESP64 to be able to free allocated 
space in files (ie make holes if supported by underlying fs)


AFAIK, linux can generically do this only at the end of a file 
(ftruncate()) , not at random points.


XFS has special support for FREESP (it comes from IRIX), implemented as 
an ioctl()


Check for XFS_IOC_FREESP and XFS_IOC_UNRESVSP in fs/xfs/xfs_fs.h





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Support for statistics to help analyze allocator behavior

2008-02-04 Thread Eric Dumazet

Pekka J Enberg a écrit :

Hi Christoph,

On Mon, 4 Feb 2008, Christoph Lameter wrote:

The statistics provided here allow the monitoring of allocator behavior
at the cost of some (minimal) loss of performance. Counters are placed in
SLUB's per cpu data structure that is already written to by other code.


Looks good but I am wondering if we want to make the statistics per-CPU so 
that we can see the kmalloc/kfree ping-pong of, for example, hackbench 
better?


AFAIK Christoph patch already have percpu statistics :)


+#define STAT_ATTR(si, text)\
+static ssize_t text##_show(struct kmem_cache *s, char *buf)\
+{  \
+   unsigned long sum  = 0; \
+   int cpu;\
+   \
+   for_each_online_cpu(cpu)\
+   sum += get_cpu_slab(s, cpu)->stat[si];   \
+   return sprintf(buf, "%lu\n", sum);\
+}  \

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: Support for statistics to help analyze allocator behavior

2008-02-05 Thread Eric Dumazet
On Tue, 5 Feb 2008 10:08:00 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> On Tue, 5 Feb 2008, Pekka J Enberg wrote:
> 
> > Heh, sure, but it's not exported to userspace which is required for 
> > slabinfo to display the statistics.
> 
> Well we could do the same as for numa stats. Output the global count and 
> then add
> 
> c=count
> 

Yes, or the reverse, to avoid two loops and possible sum errors (Sum of 
c=count different than the global count)

Since text##_show is going to be too big, you could use one function instead of 
several ones ?

(and char *buf is PAGE_SIZE, so you should add a limit ?)

Note I used for_each_possible_cpu() here instead of 'online' variant, or stats 
might be corrupted when a cpu goes offline.

static ssize_t text_show(struct kmem_cache *s, char *buf, unsigned int si)
{   
unsigned long val, sum = 0; 
int cpu;
size_t off = 0; 
size_t buflen = PAGE_SIZE;

for_each_possible_cpu(cpu) {
val = get_cpu_slab(s, cpu)->stat[si];
#ifdef CONFIG_SMP
if (val)
off += snprintf(buf + off, buflen - off, "c%d=%lu ", 
cpu, val);
#endif
sum += val; 
}
off += snprintf(buf + off, buflen - off, "%lu\n", sum); 
return off;
}   


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] x86 mmiotrace: fix relay-buffer-full flag for SMP

2008-02-05 Thread Eric Dumazet

Pekka Paalanen a écrit :

Relay has per-cpu buffers, but mmiotrace was using only a single flag
for detecting buffer full/not-full transitions. The new code makes
this per-cpu and actually counts missed events.

Signed-off-by: Pekka Paalanen <[EMAIL PROTECTED]>
---
 arch/x86/kernel/mmiotrace/mmio-mod.c |   26 --
 1 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/mmiotrace/mmio-mod.c 
b/arch/x86/kernel/mmiotrace/mmio-mod.c
index 82ae920..f492b65 100644
--- a/arch/x86/kernel/mmiotrace/mmio-mod.c
+++ b/arch/x86/kernel/mmiotrace/mmio-mod.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include  /* for ISA_START_ADDRESS */
+#include 
 
 #include "kmmio.h"

 #include "pf_in.h"
@@ -47,9 +48,13 @@ struct trap_reason {
int active_traces;
 };
 
+/* Accessed per-cpu. */

 static struct trap_reason pf_reason[NR_CPUS];
 static struct mm_io_header_rw cpu_trace[NR_CPUS];
 
+/* Access to this is not per-cpu. */

+static atomic_t dropped[NR_CPUS];
+


Please dont introduce NR_CPUS new arrays, since people are working hard to zap 
them from kernel.


You probably can use a per_cpu variable ?

Thank you

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: add code to dump the (kernel) page tables for visual inspection

2008-02-05 Thread Eric Dumazet

Arjan van de Ven a écrit :

Subject: x86: add code to dump the (kernel) page tables for visual inspection 
by kernel developers
From: Arjan van de Ven <[EMAIL PROTECTED]>

This patch adds code to the kernel to have an (optional)
/proc/kernel_page_tables debug file that basically dumps the kernel
pagetables; this allows us kernel developers to verify that nothing fishy is
going on and that the various mappings are set up correctly. This was quite
useful in finding various change_page_attr() bugs, and is very likely to be
useful in the future as well.



Seems nice Arjan, but could we also add NUMA information ?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] x86 mmiotrace: fix relay-buffer-full flag for SMP

2008-02-05 Thread Eric Dumazet

Pekka Paalanen a écrit :

On Tue, 05 Feb 2008 21:44:07 +0100
Eric Dumazet <[EMAIL PROTECTED]> wrote:


Pekka Paalanen a écrit :

diff --git a/arch/x86/kernel/mmiotrace/mmio-mod.c 
b/arch/x86/kernel/mmiotrace/mmio-mod.c
index 82ae920..f492b65 100644
--- a/arch/x86/kernel/mmiotrace/mmio-mod.c
+++ b/arch/x86/kernel/mmiotrace/mmio-mod.c
@@ -47,9 +48,13 @@ struct trap_reason {
int active_traces;
 };
 
+/* Accessed per-cpu. */

 static struct trap_reason pf_reason[NR_CPUS];
 static struct mm_io_header_rw cpu_trace[NR_CPUS];
 
+/* Access to this is not per-cpu. */

+static atomic_t dropped[NR_CPUS];
+
Please dont introduce NR_CPUS new arrays, since people are working hard to zap 
them from kernel.


You probably can use a per_cpu variable ?


Yes, it would probably be more appropriate to use DEFINE_PER_CPU()
for 'pf_reason' and 'cpu_trace', but I wasn't sure since the examples
of DEFINE_PER_CPU I saw always had integers or pointers, not
whole structs. Is it okay for whole structs?


yes you can use a structure, you can check for example :

net/ipv4/route.c:static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);



'dropped' on the other hand is not accessed in per-cpu style, any cpu
may access any element. DEFINE_PER_CPU is not valid here, is it?


It is valid, you can use per_cpu() accessor to get a pointer to a particular 
cpu data.


check net/ipv4/route.c for an example :

return &per_cpu(rt_cache_stat, cpu);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SLUB: statistics improvements

2008-02-06 Thread Eric Dumazet

Christoph Lameter a écrit :

SLUB: statistics improvements

- Fix indentation in unfreeze_slab

- FREE_SLAB/ALLOC_SLAB counters were slightly misplaced and counted
  even if the slab was kept because we were below the mininum of
  partial slabs.

- Export per cpu statistics to user space (follow numa convention
  but change the n character to c (no slabinfo support for display yet)

F.e.

[EMAIL PROTECTED]:/sys/kernel/slab/kmalloc-8$ cat alloc_fastpath
9968 c0=4854 c1=1050 c2=468 c3=190 c4=116 c5=1779 c6=185 c7=1326


nice :)




+static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si)
+{
+   unsigned long sum  = 0;
+   int cpu;
+   int len;
+   int *data = kmalloc(nr_cpu_ids * sizeof(int), GFP_KERNEL);
+
+   if (!data)
+   return -ENOMEM;
+
+   for_each_online_cpu(cpu) {
+   int x = get_cpu_slab(s, cpu)->stat[si];


unsigned int x = ...


+
+   data[cpu] = x;
+   sum += x;


or else x will sign extend here on 64 bit arches ?


+   }
+
+   len = sprintf(buf, "%lu", sum);
+
+   for_each_online_cpu(cpu) {
+   if (data[cpu] && len < PAGE_SIZE - 20)
+   len += sprintf(buf + len, " c%d=%u", cpu, data[cpu]);
+   }
+   kfree(data);
+   return len + sprintf(buf + len, "\n");
+}
+



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Avoid divides in BITS_TO_LONGS

2008-02-06 Thread Eric Dumazet

BITS_PER_LONG is a signed value (32 or 64)

DIV_ROUND_UP(nr, BITS_PER_LONG) performs signed arithmetic if "nr" is signed 
too.

Converting BITS_TO_LONGS(nr) to DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
makes sure compiler can perform a right shift, even if "nr" is a signed value, 
instead of an expensive integer divide.


Applying this patch saves 141 bytes on x86 when CONFIG_CC_OPTIMIZE_FOR_SIZE=y
and speedup bitmap operations.

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index 69c1edb..be5c27c 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -6,8 +6,8 @@
 #define BIT(nr)(1UL << (nr))
 #define BIT_MASK(nr)   (1UL << ((nr) % BITS_PER_LONG))
 #define BIT_WORD(nr)   ((nr) / BITS_PER_LONG)
-#define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_LONG)
 #define BITS_PER_BYTE  8
+#define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
 #endif
 
 /*


Re: [PATCH] Add IPv6 support to TCP SYN cookies

2008-02-07 Thread Eric Dumazet

Evgeniy Polyakov a écrit :

On Wed, Feb 06, 2008 at 10:30:24AM -0800, Glenn Griffin ([EMAIL PROTECTED]) 
wrote:
  

+static u32 cookie_hash(struct in6_addr *saddr, struct in6_addr *daddr,
+  __be16 sport, __be16 dport, u32 count, int c)
+{
+   __u32 tmp[16 + 5 + SHA_WORKSPACE_WORDS];


This huge buffer should not be allocated on stack.
  

I can replace it will a kmalloc, but for my benefit what's the practical
size we try and limit the stack to?  It seemed at first glance to me
that 404 bytes plus the arguments, etc. was not such a large buffer for
a non-recursive function.  Plus the alternative with a kmalloc requires



Well, maybe for connection establishment path it is not, but it is
absolutely the case in the sending and sometimes receiving pathes for 4k
stacks. The main problem is that bugs which happen because of stack
overflow are so much obscure, that it is virtually impossible to detect
where overflow happend. 'Debug stack overflow' somehow does not help to
detect it.

Usually there is about 1-1.5 kb of free stack for each process, so this
change will cut one third of the free stack, getting into account that
something can store ipv6 addresses on stack too, this can end up badly.

  

propogating the possible error status back up to tcp_ipv6.c in the event
we are unable to allocate enough memory, so it can simply drop the
connection.  Not an impossible task by any means but it does
significantly complicate things and I would like to know it's worth the
effort.  Also would it be worth it to provide a supplemental patch for
the ipv4 implementation as it allocates the same buffer?



One can reorganize syncookie support to work with request hash tables
too, so that we could allocate per hash-bucket space and use it as a
scratchpad for cookies.

  

Or maybe use percpu storage for that...

I am not sure if cookie_hash() is always called with preemption disabled.
(If not, we have to use get_cpu_var()/put_cpu_var())

[NET] IPV4: lower stack usage in cookie_hash() function

400 bytes allocated on stack might be a litle bit too much. Using a 
per_cpu var is more friendly.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>


diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index f470fe4..177da14 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -35,10 +35,12 @@ module_init(init_syncookies);
 #define COOKIEBITS 24  /* Upper bits store count */
 #define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)
 
+static DEFINE_PER_CPU(__u32, cookie_scratch)[16 + 5 + SHA_WORKSPACE_WORDS];
+
 static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport,
   u32 count, int c)
 {
-   __u32 tmp[16 + 5 + SHA_WORKSPACE_WORDS];
+   __u32 *tmp = __get_cpu_var(cookie_scratch);
 
memcpy(tmp + 3, syncookie_secret[c], sizeof(syncookie_secret[c]));
tmp[0] = (__force u32)saddr;


Re: Bug? Kernels 2.6.2x drops TCP packets over wireless (independent of card used)

2008-02-07 Thread Eric Dumazet

Marcin Koziej a écrit :

hmm, i think, the site is broken (193.219.28.140), and not the card or
the driver is wrong. when it does, then other sites are auch
reproductable ..

/* is use auch madwifi-0.9.3.3, but it think, it is not driver problem */



Unfortunately, this is not the case :(  This happens to all TCP connections, 
inside and outside LAN,
also with the telnet session with the router. I also tried to manipulate MTU, 
but without any positive effect.
I also tried to change things like net.ipv4.tcp_congestion_control -- which i 
figured out might affect TCP traffic, but also didn't get any results.
I'm afraid this can have something to do with IRQ, because the PCMCIA cards (my 
Atheros wireless card is such) are visible only with irqpoll kernel option.

Of course, as I mentioned, everything works fine with kernel 2.6.19; with the 
same servers etc.

  
Very strange, as the tcpdump you gave shows that the remote peer only 
sent "220-\r\n"


This was ACKed, and then nothing but timeout. We can conclude remote 
peer is *very* slow or a firewall is blocking trafic after 6 bytes sent :)


Could you give a tcpdump for the same destination, on 2.6.19 this time ?





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bug? Kernels 2.6.2x drops TCP packets over wireless (independentof card used)

2008-02-07 Thread Eric Dumazet
ps 12Mbps 18Mbps 24Mbps
36Mbps 48Mbps 54Mbps
wifi0: turboG rates: 6Mbps 12Mbps 18Mbps 24Mbps 36Mbps 48Mbps 54Mbps
wifi0: H/W encryption support: WEP AES AES_CCM TKIP
wifi0: mac 5.9 phy 4.3 radio 4.6
wifi0: Use hw queue 1 for WME_AC_BE traffic
wifi0: Use hw queue 0 for WME_AC_BK traffic
wifi0: Use hw queue 2 for WME_AC_VI traffic
wifi0: Use hw queue 3 for WME_AC_VO traffic
wifi0: Use hw queue 8 for CAB traffic
wifi0: Use hw queue 9 for beacons
wifi0: Atheros 5212: mem=0x3400, irq=11

On the same irq are:
ehci_hcd:usb1
eth0
uhci_hcd:usb4
wifi%d
yenta



Eric Dumazet <[EMAIL PROTECTED]> -
Very strange, as the tcpdump you gave shows that the remote peer only 
sent "220-\r\n"


This was ACKed, and then nothing but timeout. We can conclude remote 
peer is *very* slow or a firewall is blocking trafic after 6 bytes sent :)


Could you give a tcpdump for the same destination, on 2.6.19 this time ?


Here goes, this happens after running:
ftp sunsite.icm.edu.pl
The ftp session is successful



tcpdump -i ath0 tcp and port 21 -X -s 300 -vv
tcpdump: listening on ath0, link-type EN10MB (Ethernet), capture size 300 bytes
18:09:48.435612 IP (tos 0x0, ttl  64, id 6423, offset 0, flags [DF], proto: TCP (6), length: 60) sabayonx86.local.41649 > sunsite2.icm.edu.pl.ftp: S, cksum 0xe759 (correct), 


1749015234:1749015234(0) win 5840 0,nop,wscale ²2>


wscale 2 here (instead of 5 on your non working case)
So win = 23360


0x:  4500 003c 1917 4000 4006 8184 c0a8 0111  E..<[EMAIL 
PROTECTED]@...
0x0010:  c1db 1c8c a2b1 0015 683f dac2    h?..
0x0020:  a002 16d0 e759  0204 05b4 0402 080a  .Y..
0x0030:  001c c3d5   0103 0302
18:09:48.449755 IP (tos 0x0, ttl  60, id 0, offset 0, flags [DF], proto: TCP (6), 
length: 60) sunsite2.icm.edu.pl.ftp > sabayonx86.local.41649: S, cksum 0xc97a 
(correct), 1528170639:1528170639(0) ack 1749015235 win 5792 
0x:  4500 003c  4000 3c06 9e9b c1db 1c8c  E..<[EMAIL 
PROTECTED]<...
0x0010:  c0a8 0111 0015 a2b1 5b16 088f 683f dac3  [...h?..
0x0020:  a012 16a0 c97a  0204 05b4 0402 080a  .z..
0x0030:  1ff7 9a5c 001c c3d5 0103 0307...\
18:09:48.449827 IP (tos 0x0, ttl  64, id 6424, offset 0, flags [DF], proto: TCP (6), 
length: 52) sabayonx86.local.41649 > sunsite2.icm.edu.pl.ftp: ., cksum 0x0925 
(correct), 1:1(0) ack 1 win 1460 
0x:  4500 0034 1918 4000 4006 818b c0a8 0111  [EMAIL 
PROTECTED]@...
0x0010:  c1db 1c8c a2b1 0015 683f dac3 5b16 0890  h?..[...
0x0020:  8010 05b4 0925  0101 080a 001c c3e3  .%..
0x0030:  1ff7 9a5c...\
18:09:48.485234 IP (tos 0x10, ttl  60, id 37583, offset 0, flags [DF], proto: TCP (6), 
length: 58) sunsite2.icm.edu.pl.ftp > sabayonx86.local.41649: P, cksum 0x9f2a 
(correct), 1:7(6) ack 1 win 46 
0x:  4510 003a 92cf 4000 3c06 0bbe c1db 1c8c  E..:[EMAIL 
PROTECTED]<...
0x0010:  c0a8 0111 0015 a2b1 5b16 0890 683f dac3  [...h?..
0x0020:  8018 002e 9f2a  0101 080a 1ff7 9a65  .*.e
0x0030:  001c c3e3 3232 302d 0d0a 220-..
18:09:48.485337 IP (tos 0x10, ttl  64, id 6425, offset 0, flags [DF], proto: TCP (6), length: 52) sabayonx86.local.41649 > sunsite2.icm.edu.pl.ftp: ., cksum 0x08f2 (correct), 


1:1(0) ack 7 win 1460 

this win=1460 is correctly taken by remote peer as 5840 (wscale=2)


0x:  4510 0034 1919 4000 4006 817a c0a8 0111  [EMAIL 
PROTECTED]@..z
0x0010:  c1db 1c8c a2b1 0015 683f dac3 5b16 0896  h?..[...
0x0020:  8010 05b4 08f2  0101 080a 001c c407  
0x0030:  1ff7 9a65...e


Typical window scaling problem here... (well, for previous traces, with 
wscaling of 5, since with wscale 2 it seems to work), you probably have a 
buggy router or something...


http://lwn.net/Articles/92727/

Try :

# echo 0 >/proc/sys/net/ipv4/tcp_window_scaling

And retry to connect to this ftp server

You could alternativly play with /proc/sys/net/ipv4/tcp_rmem

# echo "4096 8192 5" >/proc/sys/net/ipv4/tcp_rmem


18:09:48.500421 IP (tos 0x10, ttl  60, id 37584, offset 0, flags [DF], proto: TCP (6), 
length: 191) sunsite2.icm.edu.pl.ftp > sabayonx86.local.41649: P, cksum 0xc74e 
(correct), 7:146(139) ack 1 win 46 
0x:  4510 00bf 92d0 4000 3c06 0b38 c1db 1c8c  [EMAIL 
PROTECTED]<..8
0x0010:  c0a8 0111 0015 a2b1 5b16 0896 683f dac3  [...h?..
0x0020:  8018 002e c74e  0101 080a 1ff7 9a68  .N.h
0x0030:  001c c407 3232 302d 2020 2020 2020 2020  220-
0x0040:  2020 2020 4654 5020 6e61 2053 756e 534

Re: [git pull] more SLUB updates for 2.6.25

2008-02-07 Thread Eric Dumazet

Nick Piggin a écrit :

On Friday 08 February 2008 13:13, Christoph Lameter wrote:

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/christoph/vm.git slub-linus

(includes the cmpxchg_local fastpath since the cmpxchg_local work
by Matheiu is in now, and the non atomic unlock by Nick. Verified that
this is not doing any harm after some other patches had been removed.


Ah, good. I think it is always a good thing to be able to remove atomics.
They place quite a bit of burden on the CPU, especially x86 where it also
has implicit memory ordering semantics (although x86 can speculatively
get around much of the problem, it's obviously worse than no restriction)

Even if perhaps some cache coherency or timing quirk makes the non-atomic
version slower (all else being equal), then I'd still say that the non
atomic version should be preferred.



What about IRQ masking then ?

Many CPU pay high cost for cli/sti pair...

And SLAB/SLUB allocators, even if only used from process context, want to 
disable/re-enable interrupts...


I understand kmalloc() want generic pools, but dedicated pools could avoid 
this cli/sti


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] more SLUB updates for 2.6.25

2008-02-09 Thread Eric Dumazet

Christoph Lameter a écrit :

On Fri, 8 Feb 2008, Eric Dumazet wrote:


And SLAB/SLUB allocators, even if only used from process context, want to
disable/re-enable interrupts...


Not any more. The new fastpath does allow avoiding interrupt 
enable/disable and we will be hopefully able to increase the scope of that 
over time.





Oh, I missed this new SLUB_FASTPATH stuff (not yet in net-2.6), thanks 
Christoph !

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet
On Mon, 18 Feb 2008 16:12:38 +0800
"Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:

> On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:
> > From: Eric Dumazet <[EMAIL PROTECTED]>
> > Date: Fri, 15 Feb 2008 15:21:48 +0100
> > 
> > > On linux-2.6.25-rc1 x86_64 :
> > > 
> > > offsetof(struct dst_entry, lastuse)=0xb0
> > > offsetof(struct dst_entry, __refcnt)=0xb8
> > > offsetof(struct dst_entry, __use)=0xbc
> > > offsetof(struct dst_entry, next)=0xc0
> > > 
> > > So it should be optimal... I dont know why tbench prefers __refcnt being 
> > > on 0xc0, since in this case lastuse will be on a different cache line...
> > > 
> > > Each incoming IP packet will need to change lastuse, __refcnt and __use, 
> > > so keeping them in the same cache line is a win.
> > > 
> > > I suspect then that even this patch could help tbench, since it avoids 
> > > writing lastuse...
> > 
> > I think your suspicions are right, and even moreso
> > it helps to keep __refcnt out of the same cache line
> > as input/output/ops which are read-almost-entirely :-
> I think you are right. The issue is these three variables sharing the same 
> cache line
> with input/output/ops.
> 
> > )
> > 
> > I haven't done an exhaustive analysis, but it seems that
> > the write traffic to lastuse and __refcnt are about the
> > same.  However if we find that __refcnt gets hit more
> > than lastuse in this workload, it explains the regression.
> I also think __refcnt is the key. I did a new testing by adding 2 unsigned 
> long
> pading before lastuse, so the 3 members are moved to next cache line. The 
> performance is
> recovered.
> 
> How about below patch? Almost all performance is recovered with the new patch.
> 
> Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>
> 
> ---
> 
> --- linux-2.6.25-rc1/include/net/dst.h2008-02-21 14:33:43.0 
> +0800
> +++ linux-2.6.25-rc1_work/include/net/dst.h   2008-02-21 14:36:22.0 
> +0800
> @@ -52,11 +52,10 @@ struct dst_entry
>   unsigned short  header_len; /* more space at head required 
> */
>   unsigned short  trailer_len;/* space to reserve at tail */
>  
> - u32 metrics[RTAX_MAX];
> - struct dst_entry*path;
> -
> - unsigned long   rate_last;  /* rate limiting for ICMP */
>   unsigned intrate_tokens;
> + unsigned long   rate_last;  /* rate limiting for ICMP */
> +
> + struct dst_entry*path;
>  
>  #ifdef CONFIG_NET_CLS_ROUTE
>   __u32   tclassid;
> @@ -70,10 +69,12 @@ struct dst_entry
>   int (*output)(struct sk_buff*);
>  
>   struct  dst_ops *ops;
> - 
> - unsigned long   lastuse;
> +
> + u32 metrics[RTAX_MAX];
> +
>   atomic_t__refcnt;   /* client references*/
>   int __use;
> + unsigned long   lastuse;
>   union {
>   struct dst_entry *next;
>   struct rtable*rt_next;
> 
> 

Well, after this patch, we grow dst_entry by 8 bytes :

sizeof(struct dst_entry)=0xd0
offsetof(struct dst_entry, input)=0x68
offsetof(struct dst_entry, output)=0x70
offsetof(struct dst_entry, __refcnt)=0xb4
offsetof(struct dst_entry, lastuse)=0xc0
offsetof(struct dst_entry, __use)=0xb8
sizeof(struct rtable)=0x140


So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
cache lines ?

I am quite suprised that my patch to not change lastuse if already set to 
jiffies changes nothing...

If you have some time, could you also test this (unrelated) patch ?

We can avoid dirty all the time a cache line of loopback device.

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..0a4186a 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
net_device *dev)
return 0;
}
 #endif
-   dev->last_rx = jiffies;
+#ifdef CONFIG_SMP
+   if (dev->last_rx != jiffies)
+#endif
+   dev->last_rx = jiffies;
 
/* it's OK to use per_cpu_ptr() because BHs are off */
pcpu_lstats = netdev_priv(dev);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet

Zhang, Yanmin a écrit :
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 

On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:


I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

Could you add a comment someplace that says "refcnt wants to be on a different
cache line from input/output/ops or performance tanks badly", to warn some
future kernel hacker who starts adding new fields to the structure?

Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE

-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
 	struct neighbour	*neighbour;

struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;





I prefer this patch, but unfortunatly your perf numbers are for 64 bits kernels.

Could you please test now with 32 bits one ?

Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-18 Thread Eric Dumazet

Zhang, Yanmin a écrit :

On Mon, 2008-02-18 at 11:11 +0100, Eric Dumazet wrote:

On Mon, 18 Feb 2008 16:12:38 +0800
"Zhang, Yanmin" <[EMAIL PROTECTED]> wrote:


On Fri, 2008-02-15 at 15:22 -0800, David Miller wrote:

From: Eric Dumazet <[EMAIL PROTECTED]>
Date: Fri, 15 Feb 2008 15:21:48 +0100


On linux-2.6.25-rc1 x86_64 :

offsetof(struct dst_entry, lastuse)=0xb0
offsetof(struct dst_entry, __refcnt)=0xb8
offsetof(struct dst_entry, __use)=0xbc
offsetof(struct dst_entry, next)=0xc0

So it should be optimal... I dont know why tbench prefers __refcnt being 
on 0xc0, since in this case lastuse will be on a different cache line...


Each incoming IP packet will need to change lastuse, __refcnt and __use, 
so keeping them in the same cache line is a win.


I suspect then that even this patch could help tbench, since it avoids 
writing lastuse...

I think your suspicions are right, and even moreso
it helps to keep __refcnt out of the same cache line
as input/output/ops which are read-almost-entirely :-

I think you are right. The issue is these three variables sharing the same 
cache line
with input/output/ops.


)

I haven't done an exhaustive analysis, but it seems that
the write traffic to lastuse and __refcnt are about the
same.  However if we find that __refcnt gets hit more
than lastuse in this workload, it explains the regression.

I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-21 14:36:22.0 
+0800
@@ -52,11 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
+
+   struct dst_entry*path;
 
 #ifdef CONFIG_NET_CLS_ROUTE

__u32   tclassid;
@@ -70,10 +69,12 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;



Well, after this patch, we grow dst_entry by 8 bytes :

With my .config, it doesn't grow. Perhaps because of CONFIG_NET_CLS_ROUTE, I 
don't
enable it. I will move tclassid under ops.


sizeof(struct dst_entry)=0xd0
offsetof(struct dst_entry, input)=0x68
offsetof(struct dst_entry, output)=0x70
offsetof(struct dst_entry, __refcnt)=0xb4
offsetof(struct dst_entry, lastuse)=0xc0
offsetof(struct dst_entry, __use)=0xb8
sizeof(struct rtable)=0x140


So we dirty two cache lines instead of one, unless your cpu have 128 bytes 
cache lines ?

I am quite suprised that my patch to not change lastuse if already set to 
jiffies changes nothing...

If you have some time, could you also test this (unrelated) patch ?

We can avoid dirty all the time a cache line of loopback device.

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..0a4186a 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -150,7 +150,10 @@ static int loopback_xmit(struct sk_buff *skb, struct 
net_device *dev)
return 0;
}
 #endif
-   dev->last_rx = jiffies;
+#ifdef CONFIG_SMP
+   if (dev->last_rx != jiffies)
+#endif
+   dev->last_rx = jiffies;
 
/* it's OK to use per_cpu_ptr() because BHs are off */

pcpu_lstats = netdev_priv(dev);


Although I didn't test it, I don't think it's ok. The key is __refcnt shares 
the same
cache line with ops/input/output.



Note it was unrelated to struct dst, but dirtying of one cache line of 
'loopback netdevice'


I tested it, and tbench result was better with this patch : 890 MB/s instead 
of 870 MB/s on a bi dual core machine.



I was curious of the potential gain on your 16 cores (4x4) machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.25-rc2

2008-02-19 Thread Eric Dumazet
On Tue, 19 Feb 2008 09:02:30 -0500
Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:

> * Pekka Enberg ([EMAIL PROTECTED]) wrote:
> > On Feb 19, 2008 8:54 AM, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> > > > > [ 5282.056415] [ cut here ]
> > > > > [ 5282.059757] kernel BUG at lib/list_debug.c:33!
> > > > > [ 5282.062055] invalid opcode:  [1] SMP
> > > > > [ 5282.062055] CPU 3
> > > >
> > > > hm. Your crashes do seem to span multiple subsystems, but it always
> > > > seems to be around the SLUB code. Could you try the patch below? The
> > > > SLUB code has a new optimization and i'm not 100% sure about it. [the
> > > > hack below switches the SLUB optimization off by disabling the CPU
> > > > feature it relies on.]
> > > >
> > > > Ingo
> > > >
> > > > ->
> > > >  arch/x86/Kconfig |4 
> > > >  1 file changed, 4 deletions(-)
> > > >
> > > > Index: linux/arch/x86/Kconfig
> > > > ===
> > > > --- linux.orig/arch/x86/Kconfig
> > > > +++ linux/arch/x86/Kconfig
> > > > @@ -59,10 +59,6 @@ config HAVE_LATENCYTOP_SUPPORT
> > > >  config SEMAPHORE_SLEEPERS
> > > > def_bool y
> > > >
> > > > -config FAST_CMPXCHG_LOCAL
> > > > -   bool
> > > > -   default y
> > > > -
> > > >  config MMU
> > > > def_bool y
> > > >
> > >
> > > $ grep FAST_CMPXCHG_LOCAL */.config
> > > linux-2.6.24-rc2-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
> > > linux-2.6.24-rc3-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
> > > linux-2.6.24-rc3-mm2/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
> > > linux-2.6.24-rc6-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
> > > linux-2.6.24-rc8-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
> > > linux-2.6.25-rc1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
> > > linux-2.6.25-rc2-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
> > > linux-2.6.25-rc2/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
> > >
> > > -rc2-mm1 still worked for me.
> > >
> > > Did you mean the new SLUB_FASTPATH?
> > > $ grep "define SLUB_FASTPATH" */mm/slub.c
> > > linux-2.6.25-rc1/mm/slub.c:#define SLUB_FASTPATH
> > > linux-2.6.25-rc2-mm1/mm/slub.c:#define SLUB_FASTPATH
> > > linux-2.6.25-rc2/mm/slub.c:#define SLUB_FASTPATH
> > >
> > > The 2.6.24-rc3+ mm-kernels did crash for me, but don't seem to contain 
> > > this...
> > >
> > > On the other hand:
> > > From the crash in 2.6.25-rc2-mm1:
> > > [59987.116182] RIP  [] kmem_cache_alloc_node+0x6d/0xa0
> > >
> > > (gdb) list *0x8029f83d
> > > 0x8029f83d is in kmem_cache_alloc_node (mm/slub.c:1646).
> > > 1641if (unlikely(is_end(object) || !node_match(c, 
> > > node))) {
> > > 1642object = __slab_alloc(s, gfpflags,
> > > node, addr, c);
> > > 1643break;
> > > 1644}
> > > 1645stat(c, ALLOC_FASTPATH);
> > > 1646} while (cmpxchg_local(&c->freelist, object, 
> > > object[c->offset])
> > > 1647
> > >  != object);
> > > 1648#else
> > > 1649unsigned long flags;
> > > 1650
> > >
> > > That code is part for SLUB_FASTPATH.
> > >
> > > I'm willing to test the patch, but don't know how fast I can find the
> > > time to do it, so my answer if your patch helps might be delayed until
> > > the weekend.
> > 
> > Mathieu, Christoph is on vacation and I'm not at all that familiar
> > with this cmpxchg_local() optimization, so if you could take a peek at
> > this bug report to see if you can spot something obviously wrong with
> > it, I would much appreciate that.
> 
> Sure,
> 
> Initial thoughts :
> 
> I'd like to get the complete config causing this bug. I suspect either :
> 
> - A race between the lockless algo and an IRQ in a driver allocating
>   memory.
> - stat(c, ALLOC_FASTPATH); seems to be using a var++, therefore
>   indicating it is not reentrant if IRQs are disabled. Since those are
>   only stats, I guess it's ok, but still weird.
> - CPU hotplug problem. 
>   http://bugzilla.kernel.org/attachment.cgi?id=14877&action=view shows
>   last sysfs file:
>   /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
>   -- is this linked to a cpu up/down event ?
> 
> Since this shows mostly with network card drivers, I think the most
> plausible cause would be an IRQ nesting over kmem_cache_alloc_node and
> calling it.
> 
> Will dig further...

I wonder how SLUB_FASTPATH is supposed to work, since it is affected by
a classical ABA problem of lockless algo.

cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed,
while an interrupt came (on this cpu), and several allocations were done,
and one free was performed at the end of this interruption, so 'object'
was recycled.

c->freelist can then contain the previous value (object), but
object[c->offset] was changed by IRQ.

We then put back in freelist an already allocated object.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordo

Re: [RFC: 2.6.25 patch] ipv4/fib_hash.c: fix NULL dereference

2008-02-19 Thread Eric Dumazet

Adrian Bunk a écrit :
Unless I miss a guaranteed relation between between "f" and 
"new_fa->fa_info" this patch is required for fixing a NULL dereference
introduced by commit a6501e080c318f8d4467679d17807f42b3a33cd5 and 
spotted by the Coverity checker.


Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---

 net/ipv4/fib_hash.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

--- linux-2.6/net/ipv4/fib_hash.c.old   2008-02-19 23:23:14.0 +0200
+++ linux-2.6/net/ipv4/fib_hash.c   2008-02-19 23:38:28.0 +0200
@@ -367,17 +367,18 @@ static struct fib_node *fib_find_node(st
}
 
 	return NULL;

 }
 
 static int fn_hash_insert(struct fib_table *tb, struct fib_config *cfg)

 {
struct fn_hash *table = (struct fn_hash *) tb->tb_data;
-   struct fib_node *new_f, *f;
+   struct fib_node *new_f = NULL;
+   struct fib_node *f;
struct fib_alias *fa, *new_fa;
struct fn_zone *fz;
struct fib_info *fi;
u8 tos = cfg->fc_tos;
__be32 key;
int err;
 
 	if (cfg->fc_dst_len > 32)

@@ -491,33 +492,32 @@ static int fn_hash_insert(struct fib_tab
}
 
 	err = -ENOENT;

if (!(cfg->fc_nlflags & NLM_F_CREATE))
goto out;
 
 	err = -ENOBUFS;
 
-	new_f = NULL;

if (!f) {
new_f = kmem_cache_zalloc(fn_hash_kmem, GFP_KERNEL);
if (new_f == NULL)
goto out;
 
 		INIT_HLIST_NODE(&new_f->fn_hash);

INIT_LIST_HEAD(&new_f->fn_alias);
new_f->fn_key = key;
f = new_f;
}
 
 	new_fa = &f->fn_embedded_alias;

if (new_fa->fa_info != NULL) {
new_fa = kmem_cache_alloc(fn_alias_kmem, GFP_KERNEL);
if (new_fa == NULL)
-   goto out_free_new_f;
+   goto out;
}
new_fa->fa_info = fi;
new_fa->fa_tos = tos;
new_fa->fa_type = cfg->fc_type;
new_fa->fa_scope = cfg->fc_scope;
new_fa->fa_state = 0;
 
 	/*

@@ -535,19 +535,19 @@ static int fn_hash_insert(struct fib_tab
if (new_f)
fz->fz_nent++;
rt_cache_flush(-1);
 
 	rtmsg_fib(RTM_NEWROUTE, key, new_fa, cfg->fc_dst_len, tb->tb_id,

  &cfg->fc_nlinfo, 0);
return 0;
 
-out_free_new_f:

-   kmem_cache_free(fn_hash_kmem, new_f);
 out:
+   if (new_f)
+   kmem_cache_free(fn_hash_kmem, new_f);
fib_release_info(fi);
return err;
 }
 
 
 static int fn_hash_delete(struct fib_table *tb, struct fib_config *cfg)

 {
struct fn_hash *table = (struct fn_hash*)tb->tb_data;



Hum, you are right, kmem_cache_free() doesnt allow a NULL object, like kfree() 
does.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: tbench regression in 2.6.25-rc1

2008-02-19 Thread Eric Dumazet

Zhang, Yanmin a écrit :

On Tue, 2008-02-19 at 08:40 +0100, Eric Dumazet wrote:

Zhang, Yanmin a �crit :
On Mon, 2008-02-18 at 12:33 -0500, [EMAIL PROTECTED] wrote: 

On Mon, 18 Feb 2008 16:12:38 +0800, "Zhang, Yanmin" said:


I also think __refcnt is the key. I did a new testing by adding 2 unsigned long
pading before lastuse, so the 3 members are moved to next cache line. The 
performance is
recovered.

How about below patch? Almost all performance is recovered with the new patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

Could you add a comment someplace that says "refcnt wants to be on a different
cache line from input/output/ops or performance tanks badly", to warn some
future kernel hacker who starts adding new fields to the structure?

Ok. Below is the new patch.

1) Move tclassid under ops in case CONFIG_NET_CLS_ROUTE=y. So 
sizeof(dst_entry)=200
no matter if CONFIG_NET_CLS_ROUTE=y/n. I tested many patches on my 16-core 
tigerton by
moving tclassid to different place. It looks like tclassid could also have 
impact on
performance.
If moving tclassid before metrics, or just don't move tclassid, the performance 
isn't
good. So I move it behind metrics.

2) Add comments before __refcnt.

If CONFIG_NET_CLS_ROUTE=y, the result with below patch is about 18% better than
the one without the patch.

If CONFIG_NET_CLS_ROUTE=n, the result with below patch is about 30% better than
the one without the patch.

Signed-off-by: Zhang Yanmin <[EMAIL PROTECTED]>

---

--- linux-2.6.25-rc1/include/net/dst.h  2008-02-21 14:33:43.0 +0800
+++ linux-2.6.25-rc1_work/include/net/dst.h 2008-02-22 12:52:19.0 
+0800
@@ -52,15 +52,10 @@ struct dst_entry
unsigned short  header_len; /* more space at head required 
*/
unsigned short  trailer_len;/* space to reserve at tail */
 
-	u32			metrics[RTAX_MAX];

-   struct dst_entry*path;
-
-   unsigned long   rate_last;  /* rate limiting for ICMP */
unsigned intrate_tokens;
+   unsigned long   rate_last;  /* rate limiting for ICMP */
 
-#ifdef CONFIG_NET_CLS_ROUTE

-   __u32   tclassid;
-#endif
+   struct dst_entry*path;
 
 	struct neighbour	*neighbour;

struct hh_cache *hh;
@@ -70,10 +65,20 @@ struct dst_entry
int (*output)(struct sk_buff*);
 
 	struct  dst_ops	*ops;

-   
-   unsigned long   lastuse;
+
+   u32 metrics[RTAX_MAX];
+
+#ifdef CONFIG_NET_CLS_ROUTE
+   __u32   tclassid;
+#endif
+
+   /*
+* __refcnt wants to be on a different cache line from
+* input/output/ops or performance tanks badly
+*/
atomic_t__refcnt;   /* client references*/
int __use;
+   unsigned long   lastuse;
union {
struct dst_entry *next;
struct rtable*rt_next;




I prefer this patch, but unfortunatly your perf numbers are for 64 bits kernels.

Could you please test now with 32 bits one ?

I tested it with 32bit 2.6.25-rc1 on 8-core stoakley. The result almost has no 
difference
between pure kernel and patched kernel.

New update: On 8-core stoakley, the regression becomes 2~3% with kernel 
2.6.25-rc2. On
tigerton, the regression is still 30% with 2.6.25-rc2. On Tulsa( 8 
cores+hyperthreading),
the regression is still 4% with 2.6.25-rc2.

With my patch, on tigerton, almost all regression disappears. On tulsa, only 
about 2%
regression disappears.

So this issue is triggerred with multiple-cpu. Perhaps process scheduler is 
another
factor causing the issue to happen, but it's very hard to change scheduler.



Thanks very much Yanmin, I think we can apply your patch as is, if no 
regression was found for 32bits.




Eric,

I tested your new patch in function loopback_xmit. It has no improvement, while 
it doesn't
introduce new issues. As you tested it on dual-core machine and got 
improvement, how about
merging your patch with mine?


No, thank you, that was an experiment and is not related to your findings on 
dst_entry.


I am currently working on a 'distributed refcount' infrastructure, to be able 
to spread on several nodes (for NUMA machines) or several cache lines (normal 
SMP machines)  the high pressure we currently have on some refcnt (struct 
dst_entry, struct net_device, and many more refcnts ...)


Instead of NR_CPUS allocations, goal is to be able to restrict to a small 
value like 4, 8 or 16 the number of 32bits entities used to store one refcnt, 
even if NR_CPUS=4096 or so.


atomic_inc(&p->refcnt) ->  distref_inc(&p->refcnt)

distref_inc(struct distref *p)
{
atomic_inc(myptr[p->offset]);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the bod

Re: [PATCH 1/2] x86_64: Fold pda into per cpu area v3

2008-02-20 Thread Eric Dumazet
\n");
>  
> +#ifdef CONFIG_SMP
> + _cpu_pda = (void *)_cpu_pda_init;
>   for (i = 0; i < NR_CPUS; i++)
>   cpu_pda(i) = &boot_cpu_pda[i];
> +#endif
> +
> + /* setup percpu segment offset for cpu 0 */
> + cpu_pda(0)->data_offset = (unsigned long)__per_cpu_load;
>  
>   pda_init(0);
>   copy_bootdata(__va(real_mode_data));
> @@ -128,3 +141,31 @@ void __init x86_64_start_kernel(char * r
>  
>   start_kernel();
>  }
> +
> +#ifdef   CONFIG_SMP
> +/*
> + * Remove initial boot_cpu_pda array and cpu_pda pointer table.
> + *
> + * This depends on setup_per_cpu_areas relocating the pda to the beginning
> + * of the per_cpu area so that (_cpu_pda[i] != &boot_cpu_pda[i]).  If it
> + * is equal then the new pda has not been setup for this cpu, and the pda
> + * table will have a NULL address for this cpu.
> + */
> +void __init x86_64_cleanup_pda(void)
> +{
> + int i;
> +
> + _cpu_pda = alloc_bootmem_low(nr_cpu_ids * sizeof(void *));

Here we allocate an array of [nr_cpu_ids] slots

> +
> + if (!_cpu_pda)
> + panic("Cannot allocate cpu pda table\n");
> +
> + /* cpu_pda() now points to allocated cpu_pda_table */
> +
> + for (i = 0; i < NR_CPUS; i++)

But in this loop we want to read/write on [NR_CPUS] slots of this array

> + if (_cpu_pda_init[i] == &boot_cpu_pda[i])
> + cpu_pda(i) = NULL;
> + else
> + cpu_pda(i) = _cpu_pda_init[i];
> +}
> +#endif

You might want to apply this patch.

I also wonder if _cpu_pda should be set only at the very end of 
x86_64_cleanup_pda(), after array initialization, or maybe other
cpus are not yet running ? (Sorry I cannot boot test this patch at this momeent)

[PATCH] x86_64: x86_64_cleanup_pda() should use nr_cpu_ids instead of NR_CPUS

We allocate an array of nr_cpu_ids pointers, so we should respect its bonds.

Delay change of _cpu_pda after array initialization.

Also take into account that alloc_bootmem_low() :
- calls panic() if not enough memory
- already clears allocated memory

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 3942e6a..21532eb 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -154,18 +154,16 @@ void __init x86_64_start_kernel(char * real_mode_data)
 void __init x86_64_cleanup_pda(void)
 {
int i;
+   struct x8664_pda **new_cpu_pda;
 
-   _cpu_pda = alloc_bootmem_low(nr_cpu_ids * sizeof(void *));
+   new_cpu_pda = alloc_bootmem_low(nr_cpu_ids * sizeof(void *));
 
-   if (!_cpu_pda)
-   panic("Cannot allocate cpu pda table\n");
 
+   for (i = 0; i < nr_cpu_ids; i++)
+   if (_cpu_pda_init[i] != &boot_cpu_pda[i])
+   new_cpu_pda[i] = _cpu_pda_init[i];
+   mb();
+   _cpu_pda = new_cpu_pda;
/* cpu_pda() now points to allocated cpu_pda_table */
-
-   for (i = 0; i < NR_CPUS; i++)
-   if (_cpu_pda_init[i] == &boot_cpu_pda[i])
-   cpu_pda(i) = NULL;
-   else
-   cpu_pda(i) = _cpu_pda_init[i];
 }
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/8] [NET]: uninline skb_put, de-bloats a lot

2008-02-20 Thread Eric Dumazet
On Wed, 20 Feb 2008 15:47:11 +0200
"Ilpo Järvinen" <[EMAIL PROTECTED]> wrote:

> ~500 files changed
> ...
> kernel/uninlined.c:
>   skb_put   | +104
>  1 function changed, 104 bytes added, diff: +104
> 
> vmlinux.o:
>  869 functions changed, 198 bytes added, 111003 bytes removed, diff: -110805
> 
> This change is INCOMPLETE, I think that the call to current_text_addr()
> should be rethought but I don't have a clue how to do that.

You want to use __builtin_return_address(0)

> 
> Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
> ---
>  include/linux/skbuff.h |   20 +---
>  net/core/skbuff.c  |   21 +
>  2 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 412672a..5925435 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -896,25 +896,7 @@ static inline unsigned char *__skb_put(struct sk_buff 
> *skb, unsigned int len)
>   return tmp;
>  }
>  
> -/**
> - *   skb_put - add data to a buffer
> - *   @skb: buffer to use
> - *   @len: amount of data to add
> - *
> - *   This function extends the used data area of the buffer. If this would
> - *   exceed the total buffer size the kernel will panic. A pointer to the
> - *   first byte of the extra data is returned.
> - */
> -static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
> -{
> - unsigned char *tmp = skb_tail_pointer(skb);
> - SKB_LINEAR_ASSERT(skb);
> - skb->tail += len;
> - skb->len  += len;
> - if (unlikely(skb->tail > skb->end))
> - skb_over_panic(skb, len, current_text_addr());
> - return tmp;
> -}
> +extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
>  
>  static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int 
> len)
>  {
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4e35422..661d439 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -857,6 +857,27 @@ free_skb:
>   return err;
>  }
>  
> +/**
> + *   skb_put - add data to a buffer
> + *   @skb: buffer to use
> + *   @len: amount of data to add
> + *
> + *   This function extends the used data area of the buffer. If this would
> + *   exceed the total buffer size the kernel will panic. A pointer to the
> + *   first byte of the extra data is returned.
> + */
> +unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
> +{
> + unsigned char *tmp = skb_tail_pointer(skb);
> + SKB_LINEAR_ASSERT(skb);
> + skb->tail += len;
> + skb->len  += len;
> + if (unlikely(skb->tail > skb->end))
> + skb_over_panic(skb, len, current_text_addr());
> + return tmp;
> +}
> +EXPORT_SYMBOL(skb_put);
> +
>  /* Trims skb to length len. It can change skb pointers.
>   */
>  
> -- 
> 1.5.2.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] alloc_percpu() fails to allocate percpu data

2008-02-21 Thread Eric Dumazet
Some oprofile results obtained while using tbench on a 2x2 cpu machine 
were very surprising.


For example, loopback_xmit() function was using high number of cpu 
cycles to perform

the statistic updates, supposed to be real cheap since they use percpu data

   pcpu_lstats = netdev_priv(dev);
   lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
   lb_stats->packets++;  /* HERE : serious contention */
   lb_stats->bytes += skb->len;


struct pcpu_lstats is a small structure containing two longs. It appears 
that on my 32bits platform,
alloc_percpu(8) allocates a single cache line,  instead of giving to 
each cpu a separate

cache line.

Using the following patch gave me impressive boost in various benchmarks 
( 6 % in tbench)

(all percpu_counters hit this bug too)

Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own 
block of memory, so that we
dont need to roudup sizes to L1_CACHE_BYTES, or merging the SGI stuff of 
course...


Note : SLUB vs SLAB is important here to *show* the improvement, since 
they dont have the same minimum

allocation sizes (8 bytes vs 32 bytes).
This could very well explain regressions some guys reported when they 
switched to SLUB.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

mm/allocpercpu.c |   15 ++-
1 files changed, 14 insertions(+), 1 deletion(-)


diff --git a/mm/allocpercpu.c b/mm/allocpercpu.c
index 7e58322..b0012e2 100644
--- a/mm/allocpercpu.c
+++ b/mm/allocpercpu.c
@@ -6,6 +6,10 @@
 #include 
 #include 
 
+#ifndef cache_line_size
+#define cache_line_size()  L1_CACHE_BYTES
+#endif
+
 /**
  * percpu_depopulate - depopulate per-cpu data for given cpu
  * @__pdata: per-cpu data to depopulate
@@ -52,6 +56,11 @@ void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, 
int cpu)
struct percpu_data *pdata = __percpu_disguise(__pdata);
int node = cpu_to_node(cpu);
 
+   /*
+* We should make sure each CPU gets private memory.
+*/
+   size = roundup(size, cache_line_size());
+
BUG_ON(pdata->ptrs[cpu]);
if (node_online(node))
pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
@@ -98,7 +107,11 @@ EXPORT_SYMBOL_GPL(__percpu_populate_mask);
  */
 void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
 {
-   void *pdata = kzalloc(nr_cpu_ids * sizeof(void *), gfp);
+   /*
+* We allocate whole cache lines to avoid false sharing
+*/
+   size_t sz = roundup(nr_cpu_ids * sizeof(void *), cache_line_size());
+   void *pdata = kzalloc(sz, gfp);
void *__pdata = __percpu_disguise(pdata);
 
if (unlikely(!pdata))


Re: more interrupts (lower performance) in bare-metal compared with running VM

2012-07-27 Thread Eric Dumazet
On Fri, 2012-07-27 at 22:09 -0500, sheng qiu wrote:
> Hi all,
> 
> i am comparing network throughput performance under bare-metal case
> with that running VM with assigned-device (assigned NIC). i have two
> physical machines (each has a 10Gbit NIC), one is used as remote
> server (run netserver) and the other is used as the target tested one
> (run netperf with different send message size, TCP_STREAM test). the
> remote NIC is connected directly with the tested NIC, both are 10Gbit.
> fore bare-metal case, i enable 1 cpu core, for VM i also configure 1
> vcpu (the memory is sufficient for both bare-metal and VM case).  i
> run netperf for 120 seconds and got the following results:
> 
>send messageinterrupts   throughput (mbit/s)
> bare-metal 256   106962901114.84
> 512   101067861391.92
> 1024  10071032   1508.09
> 2048  4560857 3434.65
> 4096  3292200 4762.26
> 8192  3169801 4733.89
> 163842780529  4892.6
> 

Are these interrupt counts taken on the receiver ?

> VM(assigned NIC)   256   3817904  2249.35
>  512   3599007  4342.81
> 1024  3005601  4134.69
>  2048 2952122  4484
>  4096 2682874  4566.34
>  8192 2786719  4734.39
>  16384   2603835  4540.47
> 
> as shown, the interrupts for bare-metal case is much more than the VM
> case for some message size. we also see the throughput for those
> situations is lower than VM case. it's strange that the bare-metal has
> lower performance than the VM case. Does anyone have comments on this?
> i am very confused.

Well, I think you answered to your question. High interrupt rates
are not good for throughput. They might be good for latencies.

Using a VM adds delays and several frames might be delivered per
interrupt.

Using bare metal is faster and only one frame is delivered by NIC per
interrupt.

Try TCP_RR instead of TCP_STREAM for example.

What NIC is it exactly ? It seems it has no coalescing or LRO strategy.

ethtool -k eth0
ethtool -c eth0

What kernel version as used, because 4892 Mbits is not line rate.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] Introduce percpu rw semaphores

2012-07-28 Thread Eric Dumazet
On Sat, 2012-07-28 at 12:41 -0400, Mikulas Patocka wrote:
> Introduce percpu rw semaphores
> 
> When many CPUs are locking a rw semaphore for read concurrently, cache
> line bouncing occurs. When a CPU acquires rw semaphore for read, the
> CPU writes to the cache line holding the semaphore. Consequently, the
> cache line is being moved between CPUs and this slows down semaphore
> acquisition.
> 
> This patch introduces new percpu rw semaphores. They are functionally
> identical to existing rw semaphores, but locking the percpu rw semaphore
> for read is faster and locking for write is slower.
> 
> The percpu rw semaphore is implemented as a percpu array of rw
> semaphores, each semaphore for one CPU. When some thread needs to lock
> the semaphore for read, only semaphore on the current CPU is locked for
> read. When some thread needs to lock the semaphore for write, semaphores
> for all CPUs are locked for write. This avoids cache line bouncing.
> 
> Note that the thread that is locking percpu rw semaphore may be
> rescheduled, it doesn't cause bug, but cache line bouncing occurs in
> this case.
> 
> Signed-off-by: Mikulas Patocka 

I am curious to see how this performs with 4096 cpus ?

Really you shouldnt use rwlock in a path if this might hurt performance.

RCU is probably a better answer.

(bdev->bd_block_size should be read exactly once )



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores

2012-07-29 Thread Eric Dumazet
On Sun, 2012-07-29 at 01:13 -0400, Mikulas Patocka wrote:

> Each cpu should have its own rw semaphore in its cache, so I don't see a 
> problem there.
> 
> When you change block size, all 4096 rw semaphores are locked for write, 
> but changing block size is not a performance sensitive operation.
> 
> > Really you shouldnt use rwlock in a path if this might hurt performance.
> > 
> > RCU is probably a better answer.
> 
> RCU is meaningless here. RCU allows lockless dereference of a pointer. 
> Here the problem is not pointer dereference, the problem is that integer 
> bd_block_size may change.

So add a pointer if you need to. Thats the point.

> 
> > (bdev->bd_block_size should be read exactly once )
> 
> Rewrite all direct and non-direct io code so that it reads block size just 
> once ...


You introduced percpu rw semaphores, thats only incentive for people to
use that infrastructure elsewhere.

And its a big hammer :

sizeof(struct rw_semaphore)=0x70 

You can probably design something needing no more than 4 bytes per cpu,
and this thing could use non locked operations as bonus.

like the following ...

struct percpu_rw_semaphore {
/* percpu_sem_down_read() use the following in fast path */
unsigned int __percpu *active_counters;

unsigned int __percpu *counters;
struct rw_semaphore sem; /* used in slow path and by writers */
};

static inline int percpu_sem_init(struct percpu_rw_semaphore *p)
{
p->counters = alloc_percpu(unsigned int);
if (!p->counters)
return -ENOMEM;
init_rwsem(&p->sem);
p->active_counters = p->counters;
return 0;
}


static inline bool percpu_sem_down_read(struct percpu_rw_semaphore *p)
{
unsigned int __percpu *counters = ACCESS_ONCE(p->active_counters);

if (counters) {
this_cpu_inc(*counters);
return true;
}
down_read(&p->sem);
return false;
}

static inline void percpu_sem_up_read(struct percpu_rw_semaphore *p, bool 
fastpath)
{
if (fastpath)
this_cpu_dec(*p->counters);
else
up_read(&p->sem);
}

static inline unsigned int percpu_count(unsigned int *counters)
{
unsigned int total = 0;
int cpu;

for_each_possible_cpu(cpu)
total += *per_cpu_ptr(counters, cpu);

return total;
}

static inline void percpu_sem_down_write(struct percpu_rw_semaphore *p)
{
down_write(&p->sem);
p->active_counters = NULL;

while (percpu_count(p->counters))
schedule();
}

static inline void percpu_sem_up_write(struct percpu_rw_semaphore *p)
{
p->active_counters = p->counters;
up_write(&p->sem);
}




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores

2012-07-29 Thread Eric Dumazet
On Sun, 2012-07-29 at 12:10 +0200, Eric Dumazet wrote:

> You can probably design something needing no more than 4 bytes per cpu,
> and this thing could use non locked operations as bonus.
> 
> like the following ...

Coming back from my bike ride, here is a more polished version with
proper synchronization/ barriers.

struct percpu_rw_semaphore {
/* percpu_sem_down_read() use the following in fast path */
unsigned int __percpu *active_counters;

unsigned int __percpu *counters;
struct rw_semaphore sem; /* used in slow path and by writers */
};

static inline int percpu_sem_init(struct percpu_rw_semaphore *p)
{
p->counters = alloc_percpu(unsigned int);
if (!p->counters)
return -ENOMEM;
init_rwsem(&p->sem);
rcu_assign_pointer(p->active_counters, p->counters);
return 0;
}


static inline bool percpu_sem_down_read(struct percpu_rw_semaphore *p)
{
unsigned int __percpu *counters;

rcu_read_lock();
counters = rcu_dereference(p->active_counters);
if (counters) {
this_cpu_inc(*counters);
smp_wmb(); /* paired with smp_rmb() in percpu_count() */
rcu_read_unlock();
return true;
}
rcu_read_unlock();
down_read(&p->sem);
return false;
}

static inline void percpu_sem_up_read(struct percpu_rw_semaphore *p, bool 
fastpath)
{
if (fastpath)
this_cpu_dec(*p->counters);
else
up_read(&p->sem);
}

static inline unsigned int percpu_count(unsigned int __percpu *counters)
{
unsigned int total = 0;
int cpu;

for_each_possible_cpu(cpu)
total += *per_cpu_ptr(counters, cpu);

return total;
}

static inline void percpu_sem_down_write(struct percpu_rw_semaphore *p)
{
down_write(&p->sem);
p->active_counters = NULL;
synchronize_rcu();
smp_rmb(); /* paired with smp_wmb() in percpu_sem_down_read() */

while (percpu_count(p->counters))
schedule();
}

static inline void percpu_sem_up_write(struct percpu_rw_semaphore *p)
{
rcu_assign_pointer(p->active_counters, p->counters);
up_write(&p->sem);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Huge performance degradation for UDP between 2.4.17 and 2.6

2012-08-02 Thread Eric Dumazet
On Thu, 2012-08-02 at 14:27 +0200, leroy christophe wrote:
> Hi
> 
> I'm having a big issue with UDP. Using a powerpc board (MPC860).
> 
> With our board running kernel 2.4.17, I'm able to send 16 voice 
> packets (UDP, 96 bytes per packet) in 11 seconds.
> With the same board running either Kernel 2.6.35.14 or Kernel 3.4.7, I 
> need 55 seconds to send the same amount of packets.
> 
> 
> Is there anything to tune in order to get same output rate as with 
> Kernel 2.4 ?

kernel size is probably too big for your old / slow cpu.

Maybe you added too many features on your 3.4.7 kernel. (netfilter ?
SLUB debugging ...)

Its hard to say, 2.4.17 had less features and was faster.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/4] hashtable: introduce a small and naive hashtable

2012-08-02 Thread Eric Dumazet
On Thu, 2012-08-02 at 10:32 -0700, Linus Torvalds wrote:
> On Thu, Aug 2, 2012 at 9:40 AM, Eric W. Biederman  
> wrote:
> >
> > For a trivial hash table I don't know if the abstraction is worth it.
> > For a hash table that starts off small and grows as big as you need it
> > the incent to use a hash table abstraction seems a lot stronger.
> 
> I'm not sure growing hash tables are worth it.
> 
> In the dcache layer, we have an allocated-at-boot-time sizing thing,
> and I have been playing around with a patch that makes the hash table
> statically sized (and pretty small). And it actually speeds things up!

By the way, anybody tried to tweak vmalloc() (or
alloc_large_system_hash()) to use HugePages for those large hash
tables ?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup()

2012-08-03 Thread Eric Dumazet
On Fri, 2012-07-27 at 23:37 +0800, Cong Wang wrote:
> slave_enable_netpoll() and __netpoll_setup() may be called
> with read_lock() held, so should use GFP_ATOMIC to allocate
> memory.
> 
> Cc: "David S. Miller" 
> Reported-by: Dan Carpenter 
> Signed-off-by: Cong Wang 
> ---
>  drivers/net/bonding/bond_main.c |2 +-
>  net/core/netpoll.c  |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 6fae5f3..ab773d4 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -1235,7 +1235,7 @@ static inline int slave_enable_netpoll(struct slave 
> *slave)
>   struct netpoll *np;
>   int err = 0;
>  
> - np = kzalloc(sizeof(*np), GFP_KERNEL);
> + np = kzalloc(sizeof(*np), GFP_ATOMIC);
>   err = -ENOMEM;
>   if (!np)
>   goto out;
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index b4c90e4..c78a966 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -734,7 +734,7 @@ int __netpoll_setup(struct netpoll *np, struct net_device 
> *ndev)
>   }
>  
>   if (!ndev->npinfo) {
> - npinfo = kmalloc(sizeof(*npinfo), GFP_KERNEL);
> + npinfo = kmalloc(sizeof(*npinfo), GFP_ATOMIC);
>   if (!npinfo) {
>   err = -ENOMEM;
>   goto out;

Yes this works, but maybe you instead could pass/add a gfp_t flags
argument to __netpoll_setup() ?

Management tasks should allow GFP_KERNEL allocations to have less
failure risks.

Its sad bonding uses the rwlock here instead of a mutex



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup()

2012-08-03 Thread Eric Dumazet
On Fri, 2012-08-03 at 17:34 +0800, Cong Wang wrote:
> On Fri, 2012-08-03 at 11:17 +0200, Eric Dumazet wrote:
> > On Fri, 2012-07-27 at 23:37 +0800, Cong Wang wrote:
> > > slave_enable_netpoll() and __netpoll_setup() may be called
> > > with read_lock() held, so should use GFP_ATOMIC to allocate
> > > memory.
> > > 
> > > Cc: "David S. Miller" 
> > > Reported-by: Dan Carpenter 
> > > Signed-off-by: Cong Wang 
> > > ---
> > >  drivers/net/bonding/bond_main.c |2 +-
> > >  net/core/netpoll.c  |2 +-
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/net/bonding/bond_main.c 
> > > b/drivers/net/bonding/bond_main.c
> > > index 6fae5f3..ab773d4 100644
> > > --- a/drivers/net/bonding/bond_main.c
> > > +++ b/drivers/net/bonding/bond_main.c
> > > @@ -1235,7 +1235,7 @@ static inline int slave_enable_netpoll(struct slave 
> > > *slave)
> > >   struct netpoll *np;
> > >   int err = 0;
> > >  
> > > - np = kzalloc(sizeof(*np), GFP_KERNEL);
> > > + np = kzalloc(sizeof(*np), GFP_ATOMIC);
> > >   err = -ENOMEM;
> > >   if (!np)
> > >   goto out;
> > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> > > index b4c90e4..c78a966 100644
> > > --- a/net/core/netpoll.c
> > > +++ b/net/core/netpoll.c
> > > @@ -734,7 +734,7 @@ int __netpoll_setup(struct netpoll *np, struct 
> > > net_device *ndev)
> > >   }
> > >  
> > >   if (!ndev->npinfo) {
> > > - npinfo = kmalloc(sizeof(*npinfo), GFP_KERNEL);
> > > + npinfo = kmalloc(sizeof(*npinfo), GFP_ATOMIC);
> > >   if (!npinfo) {
> > >   err = -ENOMEM;
> > >   goto out;
> > 
> > Yes this works, but maybe you instead could pass/add a gfp_t flags
> > argument to __netpoll_setup() ?
> > 
> > Management tasks should allow GFP_KERNEL allocations to have less
> > failure risks.
> > 
> > Its sad bonding uses the rwlock here instead of a mutex
> > 
> 
> Yup, that is a good idea. I will update this patch.
> 
> Thanks!
> 

I did this , just take it ;)

 drivers/net/bonding/bond_main.c |6 +++---
 drivers/net/team/team.c |   14 +++---
 include/linux/netdevice.h   |2 +-
 include/linux/netpoll.h |2 +-
 net/8021q/vlan_dev.c|6 +++---
 net/bridge/br_device.c  |   10 +-
 net/bridge/br_if.c  |2 +-
 net/bridge/br_private.h |4 ++--
 net/core/netpoll.c  |8 
 9 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 6fae5f3..ccff590 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1235,12 +1235,12 @@ static inline int slave_enable_netpoll(struct slave 
*slave)
struct netpoll *np;
int err = 0;
 
-   np = kzalloc(sizeof(*np), GFP_KERNEL);
+   np = kzalloc(sizeof(*np), GFP_ATOMIC);
err = -ENOMEM;
if (!np)
goto out;
 
-   err = __netpoll_setup(np, slave->dev);
+   err = __netpoll_setup(np, slave->dev, GFP_ATOMIC);
if (err) {
kfree(np);
goto out;
@@ -1292,7 +1292,7 @@ static void bond_netpoll_cleanup(struct net_device 
*bond_dev)
read_unlock(&bond->lock);
 }
 
-static int bond_netpoll_setup(struct net_device *dev, struct netpoll_info *ni)
+static int bond_netpoll_setup(struct net_device *dev, struct netpoll_info *ni, 
gfp_t flags)
 {
struct bonding *bond = netdev_priv(dev);
struct slave *slave;
diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 87707ab..3177d6b 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -795,16 +795,16 @@ static void team_port_leave(struct team *team, struct 
team_port *port)
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
-static int team_port_enable_netpoll(struct team *team, struct team_port *port)
+static int team_port_enable_netpoll(struct team *team, struct team_port *port, 
gfp_t flags)
 {
struct netpoll *np;
int err;
 
-   np = kzalloc(sizeof(*np), GFP_KERNEL);
+   np = kzalloc(sizeof(*np), flags);
if (!np)
return -ENOMEM;
 
-   err = __netpoll_setup(np, port->dev);
+   err = __netpoll_setup(np, port->dev, flags);
if (err) {
kfree(np);
return err;
@@ -833,7 +833,7 @@ static struct netpoll_info *team_netpoll_info(struct team 
*team)
 }
 
 #else
-static int team_port_enable_netpoll(struct team *team, s

Re: [RFC v2 1/7] hashtable: introduce a small and naive hashtable

2012-08-03 Thread Eric Dumazet
On Fri, 2012-08-03 at 16:23 +0200, Sasha Levin wrote:
> This hashtable implementation is using hlist buckets to provide a simple
> hashtable to prevent it from getting reimplemented all over the kernel.
> 

> +static void hash_add(struct hash_table *ht, struct hlist_node *node, long 
> key)
> +{
> + hlist_add_head(node,
> + &ht->buckets[hash_long((unsigned long)key, HASH_BITS(ht))]);
> +}
> +

Why key is a long, casted later to "unsigned long" ?

hash_long() is expensive on 64bit arches, and not really needed
if key is an u32 from the beginning ( I am referring to your patches 6 &
7 using jhash()  )

Maybe you could use a macro, so that we can automatically select
hash_32() if key is an u32, and hash_long() for other types.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2 7/7] net,9p: use new hashtable implementation

2012-08-03 Thread Eric Dumazet
On Fri, 2012-08-03 at 16:23 +0200, Sasha Levin wrote:
> Switch 9p error table to use the new hashtable implementation. This reduces 
> the amount of
> generic unrelated code in 9p.
> 
> Signed-off-by: Sasha Levin 
> ---
>  net/9p/error.c |   17 -
>  1 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/net/9p/error.c b/net/9p/error.c
> index 2ab2de7..f1037db 100644
> --- a/net/9p/error.c
> +++ b/net/9p/error.c
> @@ -34,7 +34,7 @@
>  #include 
>  #include 
>  #include 
> -
> +#include 
>  /**
>   * struct errormap - map string errors from Plan 9 to Linux numeric ids
>   * @name: string sent over 9P
> @@ -50,8 +50,8 @@ struct errormap {
>   struct hlist_node list;
>  };
>  
> -#define ERRHASHSZ32
> -static struct hlist_head hash_errmap[ERRHASHSZ];


> +#define ERRHASHSZ 5

This name is confusing, it should mention SHIFT or BITS maybe...


> +DEFINE_STATIC_HASHTABLE(hash_errmap, ERRHASHSZ);
>  
>  /* FixMe - reduce to a reasonable size */
>  static struct errormap errmap[] = {
> @@ -196,15 +196,14 @@ int p9_error_init(void)
>   int bucket;

remove "int bucket" and use :

u32 hash;

>  
>   /* initialize hash table */
> - for (bucket = 0; bucket < ERRHASHSZ; bucket++)
> - INIT_HLIST_HEAD(&hash_errmap[bucket]);
> + hash_init(&hash_errmap, ERRHASHSZ);

Why is hash_init() even needed ?

If hash is "DEFINE_STATIC_HASHTABLE(...)", its already ready for use !

>  
>   /* load initial error map into hash table */
>   for (c = errmap; c->name != NULL; c++) {
>   c->namelen = strlen(c->name);
> - bucket = jhash(c->name, c->namelen, 0) % ERRHASHSZ;
> + bucket = jhash(c->name, c->namelen, 0);

bucket is a wrong name here, its more like "key" or "hash"

>   INIT_HLIST_NODE(&c->list);
> - hlist_add_head(&c->list, &hash_errmap[bucket]);
> + hash_add(&hash_errmap, &c->list, bucket);
>   }
>  
>   return 1;
> @@ -228,8 +227,8 @@ int p9_errstr2errno(char *errstr, int len)
>   errno = 0;
>   p = NULL;
>   c = NULL;
> - bucket = jhash(errstr, len, 0) % ERRHASHSZ;
> - hlist_for_each_entry(c, p, &hash_errmap[bucket], list) {
> + bucket = jhash(errstr, len, 0);

hash = jhash(errstr, len, 0);

> + hash_for_each_possible(&hash_errmap, p, c, list, bucket) {
>   if (c->namelen == len && !memcmp(c->name, errstr, len)) {
>   errno = c->val;
>   break;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst cache overflow

2007-08-14 Thread Eric Dumazet
On Tue, 14 Aug 2007 18:06:46 +0200
Tobias Diedrich <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> I suspect I'm seeing a slow dst cache leakage on one of my servers.
> The server in question (oni) regularly needs to be rebooted, because
> it loses network connectivity. However, netconsole and syslog shows that the
> machine is still running and the kernel complains about "dst cache
> overflow".
> 
> I have since installed a monitoring script, which stores the output of
> both "ip route ls cache | fgrep cache | wc -l" and the 'entries' value
> of /proc/net/stat/rt_cache (as suggested in 
> http://www.mail-archive.com/[EMAIL PROTECTED]/msg02107.html)
> and produces a nice rrd graph:
> 
> http://uguu.de/~ranma/route-month-oni.png
> So entries is growing more or less constantly, while the number of
> active routes (not visible on the graph due to being too small) is
> relatively constant.
> 
> Comparing this to another host running the exact same kernel:
> http://uguu.de/~ranma/route-month-ari.png
> Here cached_routes and entries barely differ at all.
> 
> The funny thing is, both hosts are running the exact same kernel
> and use more or less the same iptables rules.
> 
> So I'm not sure what would cause the dst cache to leak only on host
> oni?
> 

Could you send the result of these commands on oni and ari ?

ip route ls
grep . /proc/sys/net/ipv4/route/*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [net/ipv4]: fib_seq_show function adjustment to get a more sensable output of /proc/net/route

2007-10-22 Thread Eric Dumazet

Denis Cheng a écrit :

the temporary bf[127] char array is redundant, and the specified width 127 make 
the output of /proc/net/route include many trailing spaces;
since most terminal's cols are less than 127, this made every fib entry occupy 
two lines,

after applied this patch, the output of /proc/net/route is more sensable like 
this:

Iface   Destination Gateway Flags   RefCnt  Use Metric  Mask
MTU Window  IRTT
eth00001A8C000010   0   0   
00FF0   0   0
lo  007F00010   0   0   
00FF0   0   0
eth00101A8C000030   0   0   
0   0   0

Signed-off-by: Denis Cheng <[EMAIL PROTECTED]>


Hum... did you test your patch with many routes declared ? (more than 32 on 
i386/x86_64)


127 is not a random value, but chosen as a power of two minus 1.
PAGE_SIZE is garanted to be a multiple of 128 (127 chars + line_feed) on all 
arches.


So each read() on /proc/net/route delivers PAGE_SIZE/128 lines.

With your patch, some lines might be truncated (one every 32 on i386)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] CFS : Use NSEC_PER_MSEC and NSEC_PER_SEC in kernel/sched.c and kernel/sysctl.c

2007-10-30 Thread Eric Dumazet
1) hardcoded 10 value is used five times in places where NSEC_PER_SEC 
might be more readable.


2) A conversion from nsec to msec uses the hardcoded 100 value, which is a 
candidate for NSEC_PER_MSEC.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

diff --git a/kernel/sched.c b/kernel/sched.c
index 3f6bd11..57c539d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,7 +75,7 @@
  */
 unsigned long long __attribute__((weak)) sched_clock(void)
 {
-   return (unsigned long long)jiffies * (10 / HZ);
+   return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
 }
 
 /*
@@ -99,8 +99,8 @@ unsigned long long __attribute__((weak)) sched_clock(void)
 /*
  * Some helpers for converting nanosecond timing to jiffy resolution
  */
-#define NS_TO_JIFFIES(TIME)((unsigned long)(TIME) / (10 / HZ))
-#define JIFFIES_TO_NS(TIME)((TIME) * (10 / HZ))
+#define NS_TO_JIFFIES(TIME)((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
+#define JIFFIES_TO_NS(TIME)((TIME) * (NSEC_PER_SEC / HZ))
 
 #define NICE_0_LOADSCHED_LOAD_SCALE
 #define NICE_0_SHIFT   SCHED_LOAD_SHIFT
@@ -7228,7 +7228,7 @@ static u64 cpu_usage_read(struct cgroup *cgrp, struct 
cftype *cft)
spin_unlock_irqrestore(&cpu_rq(i)->lock, flags);
}
/* Convert from ns to ms */
-   do_div(res, 100);
+   do_div(res, NSEC_PER_MSEC);
 
return res;
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3b4efbe..6547f9a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -226,9 +226,9 @@ static struct ctl_table root_table[] = {
 
 #ifdef CONFIG_SCHED_DEBUG
 static unsigned long min_sched_granularity_ns = 10;/* 100 
usecs */
-static unsigned long max_sched_granularity_ns = 10;/* 1 second */
+static unsigned long max_sched_granularity_ns = NSEC_PER_SEC;  /* 1 second */
 static unsigned long min_wakeup_granularity_ns;/* 0 
usecs */
-static unsigned long max_wakeup_granularity_ns = 10;   /* 1 second */
+static unsigned long max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
 #endif
 
 static struct ctl_table kern_table[] = {


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-10-31 Thread Eric Dumazet

Christoph Lameter a écrit :

This patch increases the speed of the SLUB fastpath by
improving the per cpu allocator and makes it usable for SLUB.

Currently allocpercpu manages arrays of pointer to per cpu objects.
This means that is has to allocate the arrays and then populate them
as needed with objects. Although these objects are called per cpu
objects they cannot be handled in the same way as per cpu objects
by adding the per cpu offset of the respective cpu.

The patch here changes that. We create a small memory pool in the
percpu area and allocate from there if alloc per cpu is called.
As a result we do not need the per cpu pointer arrays for each
object. This reduces memory usage and also the cache foot print
of allocpercpu users. Also the per cpu objects for a single processor
are tightly packed next to each other decreasing cache footprint
even further and making it possible to access multiple objects
in the same cacheline.

SLUB has the same mechanism implemented. After fixing up the
alloccpu stuff we throw the SLUB method out and use the new
allocpercpu handling. Then we optimize allocpercpu addressing
by adding a new function

this_cpu_ptr()

that allows the determination of the per cpu pointer for the
current processor in an more efficient way on many platforms.

This increases the speed of SLUB (and likely other kernel subsystems
that benefit from the allocpercpu enhancements):


   SLABSLUBSLUB+   SLUB-o   SLUB-a
   896  86  45  44  38  3 *
  1684  92  49  48  43  2 *
  3284  106 61  59  53  +++
  64102 129 82  88  75  ++
 128147 226 188 181 176 -
 256200 248 207 285 204 =
 512300 301 260 209 250 +
1024416 440 398 264 391 ++
2048720 542 530 390 511 +++
40961254342 342 336 376 3 *

alloc/free test
  SLABSLUBSLUB+   SLUB-oSLUB-a
  137-146 151 68-72   68-74 56-58   3 *

Note: The per cpu optimization are only half way there because of the screwed
up way that x86_64 handles its cpu area that causes addditional cycles to be
spend by retrieving a pointer from memory and adding it to the address.
The i386 code is much less cycle intensive being able to get to per cpu
data using a segment prefix and if we can get that to work on x86_64
then we may be able to get the cycle count for the fastpath down to 20-30
cycles.



Really sounds good Christoph, not only for SLUB, so I guess the 32k limit is 
not enough because many things will use per_cpu if only per_cpu was reasonably 
fast (ie not so many dereferences)


I think this question already came in the past and Linus already answered it, 
but I again ask it. What about VM games with modern cpus (64 bits arches)


Say we reserve on x86_64 a really huge (2^32 bytes) area, and change VM layout 
so that each cpu maps its own per_cpu area on this area, so that the local 
per_cpu data sits in the same virtual address on each cpu. Then we dont need a 
segment prefix nor adding a 'per_cpu offset'. No need to write special asm 
functions to read/write/increment a per_cpu data and gcc could use normal 
rules for optimizations.


We only would need adding "per_cpu offset" to get data for a given cpu.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array

2007-10-31 Thread Eric Dumazet

Christoph Lameter a écrit :

+
+enum unit_type { FREE, END, USED };
+
+static u8 cpu_alloc_map[UNITS_PER_CPU] = { 1, };


You mean END here instead of 1 :)



+/*
+ * Allocate an object of a certain size
+ *
+ * Returns a per cpu pointer that must not be directly used.
+ */
+static void *cpu_alloc(unsigned long size)
+{


We might need to give an alignment constraint here. Some per_cpu users would 
like to get a 64 bytes zone, siting in one cache line and not two :)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 3/7] Allocpercpu: Do __percpu_disguise() only if CONFIG_DEBUG_VM is set

2007-10-31 Thread Eric Dumazet

Christoph Lameter a écrit :

Disguising costs a few cycles in the hot paths. So switch it off if
we are not debuggin.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/percpu.h |4 
 1 file changed, 4 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===
--- linux-2.6.orig/include/linux/percpu.h   2007-10-31 16:40:14.892121256 
-0700
+++ linux-2.6/include/linux/percpu.h2007-10-31 16:41:00.907621059 -0700
@@ -33,7 +33,11 @@
 
 #ifdef CONFIG_SMP
 
+#ifdef CONFIG_DEBUG_VM

 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
+#else
+#define __percpu_disguide(pdata) ((void *)(pdata))
+#endif


Yes good idea, but a litle typo here :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Eric Dumazet

Christoph Lameter a écrit :

On Thu, 1 Nov 2007, David Miller wrote:


From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 15:15:39 -0700 (PDT)

After boot is complete we allow the reduction of the size of the per cpu 
areas . Lets say we only need 128k per cpu. Then the remaining pages will

be returned to the page allocator.

You don't know how much you will need.  I exhausted the limit on
sparc64 very late in the boot process when the last few userland
services were starting up.


Well you would be able to specify how much will remain. If not it will 
just keep the 2M reserve around.



And if I subsequently bring up 100,000 IP tunnels, it will exhaust the
per-cpu allocation area.


Each tunnel needs 4 bytes per cpu?


well, if we move last_rx to a percpu var, we need  8 bytes of percpu space per 
net_device :)





You have to make it fully dynamic, there is no way around it.


Na. Some reasonable upper limit needs to be set. If we set that to say 
32Megabytes and do the virtual mapping then we can just populate the first 
2M and only allocate the remainder if we need it. Then we need to rely on 
Mel's defrag stuff though defrag memory if we need it.


If a 2MB page is not available, could we revert using 4KB pages ? (like 
vmalloc stuff), paying an extra runtime overhead of course.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TCP_DEFER_ACCEPT issues

2007-11-01 Thread Eric Dumazet

Felix von Leitner a écrit :

I am trying to use TCP_DEFER_ACCEPT in my web server.

There are some operational problems.  First of all: timeout handling.  I
would like to be able to set a timeout in seconds (or better:
milliseconds) for how long the socket is allowed to sit there without
data coming in.  For high load situations, I have been enforcing
timeouts in the range of 15 seconds, otherwise someone can DoS the
server by opening a lot of connections and tying up data structures.

It is still possible, of course, to tie up kernel memory this way, by
not reacting to the FIN or RST packets and running into a timeout there,
too, but that is partially tunable via sysctl.

According to tcp(7) the int argument to TCP_DEFER_ACCEPT is in seconds.
In the kernel code, it's converted to TCP timeout units.  When I ran my
server, and connected without sending any data, nothing happened.  No
timeout.  Minutes later, the connection was still there.  Even worse:
when I killed (!) the server process (thus closing the server socket),
the client did not get a reset.  Only when I type something in the
telnet, I get a reset.  This appears to be very broken.

My suggestion:

  1. make the argument to the setsockopt be in seconds, or milliseconds.
  2. if the server socket is closed, reset all pending connections.

Comments?



I agree TCP_DEFER_ACCEPT is not worth it at the current time, if you take into 
account the bad guys, or very slow networks.


1) Setting a timeout in a millisecond range (< 1000) is not very good because 
some clients may need much more time to send your server the data (very long 
distance). So a second granularity is OK.


2) After timeout is elapsed, the server tcp stack has no socket associated to 
your client attempt. So closing the server listening socket wont be able to 
send RST. I agree a RST *should* be sent by the server once the timeout is 
triggered.


A typical tcpdump of what is happening for a tcp_defer_accept timeout of 20 
seconds is :


[1]08:52:47.480291 IP client.60930 > server.http: S 2498995442:2498995442(0) 
win 5840 
[2]08:52:47.480302 IP server.http > client.60930: S 1173302644:1173302644(0) 
ack 2498995443 win 5840 

[3]08:52:47.481669 IP client.60930 > server.http: . ack 1 win 5840

[4]08:52:50.757543 IP server.http > client.60930: S 1173302644:1173302644(0) 
ack 2498995443 win 5840 

[5]08:52:50.758953 IP client.60930 > server.http: . ack 1 win 5840

[6]08:52:56.760611 IP server.http > client.60930: S 1173302644:1173302644(0) 
ack 2498995443 win 5840 

[7]08:52:56.761886 IP client.60930 > server.http: . ack 1 win 5840

[8]08:53:08.771254 IP server.http > client.60930: S 1173302644:1173302644(0) 
ack 2498995443 win 5840 

[9]08:53:08.772514 IP client.60930 > server.http: . ack 1 win 5840

[10]08:53:32.782488 IP server.http > client.60930: S 1173302644:1173302644(0) 
ack 2498995443 win 5840 

[11]08:53:32.783754 IP client.60930 > server.http: . ack 1 win 5840



[12]08:59:30.509097 IP client.60930 > server.http: P 1:3(2) ack 1 win 5840
[13]08:59:30.509125 IP server.http > client.60930: R 1173302645:1173302645(0) 
win 0



So TCP_DEFER_ACCEPT might send way more packets than needed. Packets 4,6,8,10 
(and their corresponding acks 5,7,9,11) seem un-necessary, since (1,2,3) has 
engaged a normal TCP session (three way handshake).


We only should wait for the data coming from the client to be able to pass the 
new socket to the listening application.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net: cgroup: fix access the unallocated memory in netprio cgroup

2012-07-10 Thread Eric Dumazet
On Tue, 2012-07-10 at 16:53 +0800, Gao feng wrote:
> > Hi Gao
> > 
> > Is it still needed to call update_netdev_tables() from write_priomap() ?
> > 
> 
> Yes, I think it's needed,because read_priomap will show all of the net 
> devices,
> 
> But we may add the netdev after create a netprio cgroup, so the new added 
> netdev's
> priomap will not be allocated. if we don't call update_netdev_tables in 
> write_priomap,
> we may access this unallocated memory.
> 

I realize my question was not clear.

If we write in write_priomap() a field of a single netdevice,
why should we allocate memory for all netdevices on the machine ?

So the question was : Do we really need to call
update_netdev_tables(alldevs), instead of extend_netdev_table(dev)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] net: cgroup: fix access the unallocated memory in netprio cgroup

2012-07-10 Thread Eric Dumazet
On Tue, 2012-07-10 at 18:44 +0800, Gao feng wrote:
> there are some out of bound accesses in netprio cgroup.

> - update_netdev_tables();
> + ret = extend_netdev_table(dev, max_len);
> + if (ret < 0)
> + goto out_free_devname;
> +
>   ret = 0;
>   rcu_read_lock();
>   map = rcu_dereference(dev->priomap);

Its unfortunately adding a bug.

extend_netdev_table() is protected by RTNL.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] net: cgroup: fix access the unallocated memory in netprio cgroup

2012-07-10 Thread Eric Dumazet
On Tue, 2012-07-10 at 13:05 +0200, Eric Dumazet wrote:
> On Tue, 2012-07-10 at 18:44 +0800, Gao feng wrote:
> > there are some out of bound accesses in netprio cgroup.
> 
> > -   update_netdev_tables();
> > +   ret = extend_netdev_table(dev, max_len);
> > +   if (ret < 0)
> > +   goto out_free_devname;
> > +
> > ret = 0;
> > rcu_read_lock();
> > map = rcu_dereference(dev->priomap);
> 
> Its unfortunately adding a bug.
> 
> extend_netdev_table() is protected by RTNL.

Please test your next patch using :

CONFIG_LOCKDEP=y
CONFIG_PROVE_RCU=y

Because rtnl_dereference() should shout if you dont hold RTNL


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] net: cgroup: fix access the unallocated memory in netprio cgroup

2012-07-11 Thread Eric Dumazet
On Wed, 2012-07-11 at 16:30 +0800, Gao feng wrote:
> there are some out of bound accesses in netprio cgroup.
> 
> now before accessing the dev->priomap.priomap array,we only check
> if the dev->priomap exist.and because we don't want to see
> additional bound checkings in fast path, so we should make sure
> that dev->priomap is null or array size of dev->priomap.priomap
> is equal to max_prioidx + 1;
> 
> and it's not needed to call extend_netdev_tabel in write_priomap,
> we can only allocate the net device's priomap which we change through
> net_prio.ifpriomap.
> 
> this patch add a return value for update_netdev_tables & extend_netdev_table,
> so when new_priomap is allocated failed,write_priomap will stop to access
> the priomap,and return -ENOMEM back to the userspace to tell the user
> what happend.
> 
> Change From v2:
> 1. protect extend_netdev_table by RTNL.
> 2. when extend_netdev_table failed,call dev_put to reduce device's refcount.
> 
> Signed-off-by: Gao feng 
> Cc: Neil Horman 
> Cc: Eric Dumazet 
> ---
>  

Acked-by: Eric Dumazet 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Protocol handler using dev_add_pack

2012-07-11 Thread Eric Dumazet
On Wed, 2012-07-11 at 10:38 -0300, Jerry Yu wrote:
> I am working on a kernel module to monitor all TCP packets. I created a 
> protocol
> handler with protocol code ETH_P_ALL to handle all incoming and outgoing
> TCP packets. The code worked fine on 2.6.14 kernel, but in current 3.2.0-26
> kernel, I am no longer able to get the TCP payload for outgoing packets.
> The data in TCP payload section of skb->data are mainly 0x00. I am still
> able to get incoming TCP packets' payloads though.
> 
> Is there any change in 3.2.0 or other early kernel versions that will cause
> this issue?

Maybe you make wrong assumptions in your code.

skb->data doesnt exactly contains tcp payload of locally generated TCP
packets, unless you disabled scatter-gather on the NIC

ehtool -K eth0 sg off



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bnx2: update bnx2-mips-09 firmware to bnx2-mips-09-6.2.1b

2012-07-13 Thread Eric Dumazet
On Fri, 2012-07-13 at 14:04 +0100, Chris Webb wrote:
> Commit c2c20ef43d00 "bnx2: Update driver to use new mips firmware"
> updated the bnx2 driver to use bnx2-mips-09-6.2.1b in place of
> bnx2-mips-09-6.2.1a, but didn't replace the copy of bnx2-mips-09-6.2.1a
> in firmware/bnx2/ with the new version.
> 
> This means that the bnx2 driver stopped working altogether for users who
> use CONFIG_FIRMWARE_IN_KERNEL to compile firmware together with drivers
> into their kernel, rather than having a runtime firmware loader.
> 
> Cc: 
> Signed-off-by: Chris Webb 
> ---

Have you read firmware/README.AddingFirmware ?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: Silent data corruption when using sendfile()

2012-07-14 Thread Eric Dumazet
On Sat, 2012-07-14 at 16:04 +0800, Hillf Danton wrote:
> On Sat, Jul 14, 2012 at 1:18 AM, Johannes Truschnigg
>  wrote:
> > Hello good people of linux-kernel.
> >
> > I've been bothered by silent data corruption from my personal fileserver - 
> > no
> > matter the Layer 7 protocol used, huge transfers sporadically ended up 
> > damaged
> > in-flight. I used Samba/CIFS, NFS(v4, via TCP), Apache httpd 2.2, thttpd,
> > python and netcat to verify this.
> >
> > I think I managed to track down the culprit: as soon as I disable sendfile()
> > for all programs that support such a configuration (netcat, afaik, won't 
> > ever
> > use sendfile() to transmit data over a socket, so the problem was never
> > reproducible there in the first place), everything reverts to perfect and
> > proper working condition.
> >
> > I've been experiencing this problem with vanilla kernel releases from the 
> > 3.3
> > up until 3.4.0 series. I do not know if it also occurs with earlier 
> > releases,
> > but I can verify if that is useful. I set up the environment for a minimal
> > kind of testcase (a large ISO image file available from the server's local
> > filesystem, as well as from a mounted NFS export - once via lo, and once via
> > br0/eth0), and proceeded to do the following:
> >
> > i=0; for i in {1..100}
> > do
> >   echo "pass $i:"; sync; echo 3 > /proc/sys/vm/drop_caches
> >   cmp -b /mnt/nfs-test/lo/tmp/X15-65741.iso /srv/files/pub/tmp/X15-65741.iso
> > done
> >
> > I then rotated the source of the data, and tested the network-mount against
> > the loopback-mount, as well as the network-mount against the local 
> > filesystem.
> >
> > Computing the file's md5sum in a loop whilst dropping caches after each
> > iteration by reading it directly from its location in the filesystem 
> > produces
> > the very same hash every time - I therefore think it's safe to assume the
> > corruption is introduced when traversing the networking stack. The hash also
> > does not change if I repeadetly compute the md5sum of the file as 
> > transferred
> > by, e. g., Apache httpd or smbd with sendfile explicitly disabled.
> >
> > Please take a look at the attachment to see the actual output of the above
> > script. It does not matter if I do an actual transfer over the network from 
> > my
> > server to one of its clients (I verified the problem with two different 
> > client
> > machines, one even running Windows), or if the server is both source and
> > destination of the transfer - as long as sendfile is involed, some of the 
> > data
> > will always become garbled sooner or later. That also leads me to believe 
> > that
> > my internetworking devices (my switch in particular) is working just fine;
> > testing bulky transfers from one host to another confirms this insofar as 
> > thus
> > all data makes it through unscathed.
> >
> > As soon as I switch off sendfile-support (in, e. g. Samba or Apache httpd), 
> > I
> > can run a series of thousands and more transfers, and not experience any
> > corruption at all. Whenever the data gets fubared, there is no hint at
> > anything fishy going on in the debug ringbuffer - curruption takes place in
> > total silence.
> >
> > The system in question has an Intel Pro/1000 PCI-e NIC for doing the 
> > networked
> > file transfers, and is backed by a md RAID5-Array with LVM2 on top. The 4GB 
> > of
> > system memory (ECC-enabled UDIMM) are operating in S4ECD4ED mode as reported
> > by EDAC, and there are no reported errors. The CPU I have installed is an 
> > AMD
> > Athlon II X2 245e on an ASUS M4A88TD-M/USB3 Motherboard. It's running Gentoo
> > for amd64. The box can run prime96 in torture mode and linpack just fine for
> > days - I'm therefore assuming the hardware to be working correctly.
> >
> > I have attached my kernel's config (from 3.4.0, as that's the image that I
> > have running right now) attached for sake of completeness, as well as some
> > information for you to see how I tested, and what these tests actually
> > produced. If you need any other information to help track this down, please
> > let me know.
> >
> > If you decide to answer please keep me CC'd, as I'm not subscribed to this
> > list.
> >
> > Just in case the numerous attachments get scrubbed/removed, I've also 
> > uploaded
> > them to http://johannes.truschnigg.info/tmp/sendfile_data_corruption/
> >
> > Thanks for reading, and have a nice weekend everyone :)
> >
> 
> Is the above corruption related to the one below?
> 
> 
> On Tue, Jul 3, 2012 at 8:02 AM, Willy Tarreau  wrote:
> >
> > In fact it has been true zero copy in 2.6.25 until we faced a large
> > amount of data corruption and the zero copy was disabled in 2.6.25.X.
> > Since then it remained that way until you brought your patches to
> > re-instantiate it.

Might be, or not (could be a NIC bug)

Please Johannes could you try latest kernel tree ?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vge

Re: resurrecting tcphealth

2012-07-14 Thread Eric Dumazet
On Sat, 2012-07-14 at 09:56 +0200, Piotr Sawuk wrote:
> On Sa, 14.07.2012, 03:31, valdis.kletni...@vt.edu wrote:
> > On Fri, 13 Jul 2012 16:55:44 -0700, Stephen Hemminger said:
> >
> >> >+ /* Course retransmit inefficiency- this packet has been 
> >> >received
> >> twice. */
> >> >+ tp->dup_pkts_recv++;
> >> I don't understand that comment, could you use a better sentence please?
> >
> > I think what was intended was:
> >
> > /* Curse you, retransmit inefficiency! This packet has been received at
> least twice */
> >
> 
> LOL, no. I think "course retransmit" is short for "course-grained timeout
> caused retransmit" but I can't be sure since I'm not the author of these
> lines. I'll replace that comment with the non-shorthand version though.
> however, I think the real comment here should be:
> 
> /*A perceived shortcoming of the standard TCP implementation: A
> TCP receiver can get duplicate packets from the sender because it cannot
> acknowledge packets that arrive out of order. These duplicates would happen
> when the sender mistakenly thinks some packets have been lost by the network
> because it does not receive acks for them but in reality they were
> successfully received out of order. Since the receiver has no way of letting
> the sender know about the receipt of these packets, they could potentially
> be re-sent and re-received at the receiver. Not only do duplicate packets
> waste precious Internet bandwidth but they hurt performance because the
> sender mistakenly detects congestion from packet losses. The SACK TCP
> extension specically addresses this issue. A large number of duplicate
> packets received would indicate a signicant benet to the wide adoption of
> SACK. The duplicatepacketsreceived metric is computed at the
> receiver and counts these packets on a per-connection basis.*/
> 
> as copied from his thesis at [1]. also in the thesis he writes:
> 
> In our limited experiment, the results indicated no duplicate packets were
> received on any connection in the 18 hour run. This leads us to several
> conclusions. Since duplicate ACKs were seen on many connections we know that
> some packets were lost or reordered, but unACKed reordered packets never
> caused a /coursegrainedtimeouts/ on our connections. Only these timeouts
> will cause duplicate packets to be received since less severe out-of-order
> conditions will be resolved with fast retransmits. The lack of course
> timeouts
> may be due to the quality of UCSD's ActiveWeb network or the paucity of
> large gaps between received packet groups. It should be noted that Linux 2.2
> implements fast retransmits for up to two packet gaps, thus reducing the
> need for course grained timeouts due to the lack of SACK.
> 
> [1] https://sacerdoti.org/tcphealth/tcphealth-paper.pdf

Not sure how pertinent is this paper today in 2012

I would prefer you add global counters, instead of per tcp counters that
most applications wont use at all.

Example of a more useful patch : add a counter of packets queued in Out
Of Order queue ( in tcp_data_queue_ofo() )

"netstat -s" will display the total count, without any changes in
userland tools/applications.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: Silent data corruption when using sendfile()

2012-07-14 Thread Eric Dumazet
On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote:
> On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote:
> > > Please Johannes could you try latest kernel tree ?
> > 
> > It would be useful, especially given the amount of changes you performed
> > in this area in latest version, it could be very possible that this new
> > bug got fixed as a side effect !
> 
> I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been running)
> and what can I say - the problem really seems to have disappeared. I performed
> about 3700 iterations of my previos tests over the night, and the data always
> turned out to be OK, not a single byte turned out kaput!
> 
> I wish I would have tested that earlier, and spared you the noise... well,
> maybe someone who runs into a similar problem in the future will have this
> discovery save her/him some time and headaches and make her/him just upgrade
> kernels :)
> 
> Thanks a lot for your polite and quick responses!
> 

Nice to hear. Now we should make sure we have all needed fixes for prior
stable kernels as well !

Still trying to understand the issue, since I thought I only did
optimizations, not bug fixes. So maybe real bug is still there but its
probability of occurrence lowered enough to not hit your workload.
 
Hmmm...



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: Silent data corruption when using sendfile()

2012-07-14 Thread Eric Dumazet
On Sat, 2012-07-14 at 12:44 +0200, Willy Tarreau wrote:
> On Sat, Jul 14, 2012 at 12:33:24PM +0200, Eric Dumazet wrote:
> > On Sat, 2012-07-14 at 12:13 +0200, Johannes Truschnigg wrote:
> > > On Sat, Jul 14, 2012 at 10:31:36AM +0200, Willy Tarreau wrote:
> > > > > Please Johannes could you try latest kernel tree ?
> > > > 
> > > > It would be useful, especially given the amount of changes you performed
> > > > in this area in latest version, it could be very possible that this new
> > > > bug got fixed as a side effect !
> > > 
> > > I upgraded to 3.4.4 (identical config as the 3.4.0 build I've been 
> > > running)
> > > and what can I say - the problem really seems to have disappeared. I 
> > > performed
> > > about 3700 iterations of my previos tests over the night, and the data 
> > > always
> > > turned out to be OK, not a single byte turned out kaput!
> > > 
> > > I wish I would have tested that earlier, and spared you the noise... well,
> > > maybe someone who runs into a similar problem in the future will have this
> > > discovery save her/him some time and headaches and make her/him just 
> > > upgrade
> > > kernels :)
> > > 
> > > Thanks a lot for your polite and quick responses!
> > > 
> > 
> > Nice to hear. Now we should make sure we have all needed fixes for prior
> > stable kernels as well !
> > 
> > Still trying to understand the issue, since I thought I only did
> > optimizations, not bug fixes. So maybe real bug is still there but its
> > probability of occurrence lowered enough to not hit your workload.
> 
> Please note that Johannes tested 3.4.4 while your changes are in 3.5-rc.
> 
> I'm wondering whether this patch merged into 3.4.2 one has an impact on
> sendfile :
> 
>   commit b642cb6a143da812f188307c2661c0357776a9d0
>   Author: Konstantin Khlebnikov 
>   Date:   Tue Jun 5 21:36:33 2012 +0400
> 
> radix-tree: fix contiguous iterator
> 
> commit fffaee365fded09f9ebf2db19066065fa54323c3 upstream.
> 
> This patch fixes bug in macro radix_tree_for_each_contig().
> 
> If radix_tree_next_slot() sees NULL in next slot it returns NULL, but 
> following
> radix_tree_next_chunk() switches iterating into next chunk. As result 
> iterating
> becomes non-contiguous and breaks vfs "splice" and all its users.
> 
> Willy
> 


Hmmm, this is supposed to fix a bug introduced in 3.4, no ?

So 3.3 kernel should work well ?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: Silent data corruption when using sendfile()

2012-07-14 Thread Eric Dumazet
On Sat, 2012-07-14 at 22:08 +0800, Hillf Danton wrote:
> On Sat, Jul 14, 2012 at 4:20 PM, Eric Dumazet  wrote:
> >
> > Might be, or not (could be a NIC bug)
> >
> Dunno why sendfile sits in the layer of NIC and
> how they interact.

sendfile() relies heavily on TSO capabilities, a buggy NIC could
corrupt frame content on some obscure occasions.

We had some known cases on IPv6 for example.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: resurrecting tcphealth

2012-07-15 Thread Eric Dumazet
On Sun, 2012-07-15 at 01:43 +0200, Piotr Sawuk wrote:

> oh, and again I recommend the really short although outdated thesis
> 
> [1] https://sacerdoti.org/tcphealth/tcphealth-paper.pdf

A thesis saying SACK are not useful is highly suspect.

Instead of finding why they behave not so good and fix the bugs, just
say "SACK addition to TCP is not critical"

Really ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: resurrecting tcphealth

2012-07-15 Thread Eric Dumazet
On Sun, 2012-07-15 at 11:17 +0200, Piotr Sawuk wrote:
> On So, 15.07.2012, 09:16, Eric Dumazet wrote:
> > On Sun, 2012-07-15 at 01:43 +0200, Piotr Sawuk wrote:
> >
> >> oh, and again I recommend the really short although outdated thesis
> >>
> >> [1] https://sacerdoti.org/tcphealth/tcphealth-paper.pdf
> >
> > A thesis saying SACK are not useful is highly suspect.
> >
> > Instead of finding why they behave not so good and fix the bugs, just
> > say "SACK addition to TCP is not critical"
> the actual quotation is "We also found that the number of unnecessary
> duplicate packets were quite small potentially indicating that the SACK
> addition to TCP is not critical."
> >
> > Really ?
> 
> no, not really. he he actually said that SACK has been made mostly obsolete
> by "Linux 2.2 implements fast retransmits for up to two packet gaps, thus
> reducing the need for course grained timeouts due to the lack of SACK." and
> he was a bit more careful and admitted that further tests with tcphealth are
> needed to check if SACK really makes that big a difference. he admitted "It
> could be that SACK's advantage lies in other areas such as very large
> downloads or when using slow and unreliable network links." all these things
> could be checked again nowadays, with larger files available and wlan-users
> and higher traffic -- just find something without SACK...

There are hundred of papers about TCP behavior. Many are very good.

This one seems not the best of them by far, and is based on measures
done on 2001 (!!!), on a single computer (!!!), connected to a
particular ISP (!!!), using a wireless pcmcia network card. (!!!)

At that time, almost no clients were using SACK. Because windows 98/XP
dont negociate SACK by default (you need to tweak registry)

Its like saying ECN is useless : If ECN users are under 1 % of total
number of users, network is still under pressure and ECN benefits cannot
rise because of misbehavior of other flows.

With RTT of 100 ms, SACK are clearly a win for long transferts.

A single drop shall retransmit a single packet instead of ~500 packets.
Only fools could deny this fact. Studying DuplicateAcks to detect
retransmits is clearly wrong.

Really, dont recommend this paper, it contains a lot of false
statements.

One example : "we discovered some surprising things as the high
percentage of lost or reordered packets from supposedly highly reliable
and fast services as Akamai networks". 

I cant believe such nonsense can be written, and recommended.

So you can add more counters to TCP stack because having them is good to
better understand TCP behavior and what can be done to improve it, but
dont base this work on dubious 'tcphealth'.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] cgroup: fix panic in netprio_cgroup

2012-07-08 Thread Eric Dumazet
On Thu, 2012-07-05 at 17:28 +0800, Gao feng wrote:
> we set max_prioidx to the first zero bit index of prioidx_map in
> function get_prioidx.
> 
> So when we delete the low index netprio cgroup and adding a new
> netprio cgroup again,the max_prioidx will be set to the low index.
> 
> when we set the high index cgroup's net_prio.ifpriomap,the function
> write_priomap will call update_netdev_tables to alloc memory which
> size is sizeof(struct netprio_map) + sizeof(u32) * (max_prioidx + 1),
> so the size of array that map->priomap point to is max_prioidx +1,
> which is low than what we actually need.
> 
> fix this by adding check in get_prioidx,only set max_prioidx when
> max_prioidx low than the new prioidx.
> 
> Signed-off-by: Gao feng 
> ---
>  net/core/netprio_cgroup.c |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index 5b8aa2f..aa907ed 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -49,8 +49,9 @@ static int get_prioidx(u32 *prio)
>   return -ENOSPC;
>   }
>   set_bit(prioidx, prioidx_map);
> + if (atomic_read(&max_prioidx) < prioidx)
> + atomic_set(&max_prioidx, prioidx);
>   spin_unlock_irqrestore(&prioidx_map_lock, flags);
> - atomic_set(&max_prioidx, prioidx);
>   *prio = prioidx;
>   return 0;
>  }

This patch seems fine to me.

Acked-by: Eric Dumazet 

Neil, looking at this file, I believe something is wrong.

dev->priomap is allocated by extend_netdev_table() called from
update_netdev_tables(). And this is only called if write_priomap() is
called.

But if write_priomap() is not called, it seems we can have out of bounds
accesses in cgrp_destroy() and read_priomap()

What do you think of following patch ?

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 5b8aa2f..80150d2 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -141,7 +141,7 @@ static void cgrp_destroy(struct cgroup *cgrp)
rtnl_lock();
for_each_netdev(&init_net, dev) {
map = rtnl_dereference(dev->priomap);
-   if (map)
+   if (map && cs->prioidx < map->priomap_len)
map->priomap[cs->prioidx] = 0;
}
rtnl_unlock();
@@ -165,7 +165,7 @@ static int read_priomap(struct cgroup *cont, struct cftype 
*cft,
rcu_read_lock();
for_each_netdev_rcu(&init_net, dev) {
map = rcu_dereference(dev->priomap);
-   priority = map ? map->priomap[prioidx] : 0;
+   priority = (map && prioidx < map->priomap_len) ? 
map->priomap[prioidx] : 0;
cb->fill(cb, dev->name, priority);
}
rcu_read_unlock();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] net: cgroup: fix out of bounds accesses

2012-07-09 Thread Eric Dumazet
From: Eric Dumazet 

dev->priomap is allocated by extend_netdev_table() called from
update_netdev_tables().
And this is only called if write_priomap() is called.

But if write_priomap() is not called, it seems we can have out of bounds
accesses in cgrp_destroy(), read_priomap() & skb_update_prio()

With help from Gao Feng

Signed-off-by: Eric Dumazet 
Cc: Neil Horman 
Cc: Gao feng 
---
net/core/dev.c|8 ++--
net/core/netprio_cgroup.c |4 ++--
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 84f01ba..0f28a9e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2444,8 +2444,12 @@ static void skb_update_prio(struct sk_buff *skb)
 {
struct netprio_map *map = rcu_dereference_bh(skb->dev->priomap);
 
-   if ((!skb->priority) && (skb->sk) && map)
-   skb->priority = map->priomap[skb->sk->sk_cgrp_prioidx];
+   if (!skb->priority && skb->sk && map) {
+   unsigned int prioidx = skb->sk->sk_cgrp_prioidx;
+
+   if (prioidx < map->priomap_len)
+   skb->priority = map->priomap[prioidx];
+   }
 }
 #else
 #define skb_update_prio(skb)
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index aa907ed..3e953ea 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -142,7 +142,7 @@ static void cgrp_destroy(struct cgroup *cgrp)
rtnl_lock();
for_each_netdev(&init_net, dev) {
map = rtnl_dereference(dev->priomap);
-   if (map)
+   if (map && cs->prioidx < map->priomap_len)
map->priomap[cs->prioidx] = 0;
}
rtnl_unlock();
@@ -166,7 +166,7 @@ static int read_priomap(struct cgroup *cont, struct cftype 
*cft,
rcu_read_lock();
for_each_netdev_rcu(&init_net, dev) {
map = rcu_dereference(dev->priomap);
-   priority = map ? map->priomap[prioidx] : 0;
+   priority = (map && prioidx < map->priomap_len) ? 
map->priomap[prioidx] : 0;
cb->fill(cb, dev->name, priority);
}
rcu_read_unlock();


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 82571EB: Detected Hardware Unit Hang

2012-07-09 Thread Eric Dumazet
On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
> Hi list,
> 
> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when doing
> scp test. this issue is easy do reproduced on SUN FIRE X2270 M2, just copy
> a big file (>500M) from another server will hit it at once. 
> 
> Would you please help on this?
> 

Its a known problem.

But apparently Intel guys are not very responsive, as they have another
patch than the following :

http://permalink.gmane.org/gmane.linux.network/232669


We only have to wait they push their alternative patch, eventually.

In the mean time, you can use Hiroaki SHIMODA patch, it works.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net: cgroup: fix out of bounds accesses

2012-07-09 Thread Eric Dumazet
On Mon, 2012-07-09 at 07:01 -0400, Neil Horman wrote:

> Thank you for doing this Eric, Gao.  Just to be sure (I asked in the previous
> thread), would it be better to avoid the length check in skb_update_prio, and
> instead update the netdev tables to be long enough in cgrp_create and in
> netprio_device_event on device registration?

Yes probably, and it is even needed because extend_netdev_table() can
acutally fail to expand the table if kzalloc() returned NULL.

Current code just ignores this allocation failure so we also can crash
in write_priomap()



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net: cgroup: fix out of bounds accesses

2012-07-09 Thread Eric Dumazet
On Mon, 2012-07-09 at 08:13 -0400, Neil Horman wrote:
> On Mon, Jul 09, 2012 at 01:50:52PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-07-09 at 07:01 -0400, Neil Horman wrote:
> > 
> > > Thank you for doing this Eric, Gao.  Just to be sure (I asked in the 
> > > previous
> > > thread), would it be better to avoid the length check in skb_update_prio, 
> > > and
> > > instead update the netdev tables to be long enough in cgrp_create and in
> > > netprio_device_event on device registration?
> > 
> > Yes probably, and it is even needed because extend_netdev_table() can
> > acutally fail to expand the table if kzalloc() returned NULL.
> > 
> > Current code just ignores this allocation failure so we also can crash
> > in write_priomap()
> > 
> ACK, can you follow up with a patch please?

Gao was working on this allocation problem (he privately sent me a v1 of
his patch), so I think we can wait Gao submit a v2 to combine all the
work/ideas in a single patch.

(ie make sure we dont need additional bound checkings in fast path)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net: cgroup: fix access the unallocated memory in netprio cgroup

2012-07-09 Thread Eric Dumazet
On Tue, 2012-07-10 at 10:31 +0800, Gao feng wrote:
> there are some out of bound accesses in netprio cgroup.
> when creating a new netprio cgroup,we only set a prioidx for
> the new cgroup,without allocate memory for dev->priomap.
> 
> because we don't want to see additional bound checkings in
> fast path, so I think the best way is to allocate memory when we
> creating a new netprio cgroup.
> 
> and because netdev can be created or registered after cgroup being
> created, so extend_netdev_table is also needed in write_priomap.
> 
> this patch add a return value for update_netdev_tables & extend_netdev_table,
> so when new_priomap is allocated failed,write_priomap will stop to access
> the priomap,and return -ENOMEM back to the userspace to tell the user
> what happend.
> 
> Signed-off-by: Gao feng 
> Cc: Neil Horman 
> Cc: Eric Dumazet 
> ---

>  static void cgrp_destroy(struct cgroup *cgrp)
> @@ -221,7 +233,10 @@ static int write_priomap(struct cgroup *cgrp, struct 
> cftype *cft,
>   if (!dev)
>   goto out_free_devname;
>  
> - update_netdev_tables();
> + ret = update_netdev_tables();
> + if (ret < 0)
> + goto out_free_devname;
> +
>   ret = 0;
>   rcu_read_lock();
>   map = rcu_dereference(dev->priomap);

Hi Gao

Is it still needed to call update_netdev_tables() from write_priomap() ?



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Something very strange on x86_64 2.6.X kernels

2005-01-21 Thread Eric Dumazet
Petr Vandrovec wrote:
Maybe I already missed answer, but try patch below.  It is definitely bad
to mark syscall page as global one...
Hi Petr
If I follow you, any 64 bits program is corrupted as soon one 32bits 
program using sysenter starts ?

Thank you for the patch, I will try it as soon as possible.
I tried your tpg program and had the same behavior you describe.
I confirm that avoiding the 0xE000 - 0x1 VM ranges is also 
OK , the program never crash...

Eric
When you build program below, once as 64bit and once as 32bit, 32bit one
should print 464C457F and 64bit one should die with SIGSEGV.  But when
you run both in parallel, 64bit one sometime gets SIGSEGV as it should,
sometime it gets 464C457F. (actually results below are from SMP system;
I believe that on UP you'll get reproducible 464C457F on UP system...)
vana:~/64bit-test# ./tpg32
Memory at e000 is 464C457F
vana:~/64bit-test# ./tpg
Segmentation fault
vana:~/64bit-test# ./tpg32 & ./tpg
[1] 8450
Memory at e000 is 464C457F
Memory at e000 is 464C457F
[1]+  Exit 31 ./tpg32
vana:~/64bit-test# ./tpg32 & ./tpg
[1] 8454
Memory at e000 is 464C457F
[1]+  Exit 31 ./tpg32
Segmentation fault
vana:~/64bit-test# ./tpg32 & ./tpg
[1] 8456
Memory at e000 is 464C457F
Memory at e000 is 464C457F
[1]+  Exit 31 ./tpg32
vana:~/64bit-test# ./tpg32 & ./tpg
[1] 8458
Memory at e000 is 464C457F
Memory at e000 is 464C457F
[1]+  Exit 31 ./tpg32
vana:~/64bit-test#
void main(void) {
int acc;
int i;
for (i = 0; i < 1; i++) ;
acc = *(volatile unsigned long*)(0xe000);
printf("Memory at e000 is %08X\n", acc);
}
Petr
diff -urdN linux/arch/x86_64/ia32/syscall32.c 
linux/arch/x86_64/ia32/syscall32.c
--- linux/arch/x86_64/ia32/syscall32.c  2005-01-17 12:29:05.0 +
+++ linux/arch/x86_64/ia32/syscall32.c  2005-01-21 16:15:04.0 +
@@ -55,7 +55,7 @@
if (pte_none(*pte)) {
set_pte(pte,
mk_pte(virt_to_page(syscall32_page),
-  PAGE_KERNEL_VSYSCALL));
+  PAGE_KERNEL_VSYSCALL32));
}
/* Flush only the local CPU. Other CPUs taking a fault
   will just end up here again
diff -urdN linux/include/asm-x86_64/pgtable.h linux/include/asm-x86_64/pgtable.h
--- linux/include/asm-x86_64/pgtable.h  2005-01-17 12:29:11.0 +
+++ linux/include/asm-x86_64/pgtable.h  2005-01-21 16:14:44.0 +
@@ -182,6 +182,7 @@
 #define PAGE_KERNEL_EXEC MAKE_GLOBAL(__PAGE_KERNEL_EXEC)
 #define PAGE_KERNEL_RO MAKE_GLOBAL(__PAGE_KERNEL_RO)
 #define PAGE_KERNEL_NOCACHE MAKE_GLOBAL(__PAGE_KERNEL_NOCACHE)
+#define PAGE_KERNEL_VSYSCALL32 __pgprot(__PAGE_KERNEL_VSYSCALL)
 #define PAGE_KERNEL_VSYSCALL MAKE_GLOBAL(__PAGE_KERNEL_VSYSCALL)
 #define PAGE_KERNEL_LARGE MAKE_GLOBAL(__PAGE_KERNEL_LARGE)
 #define PAGE_KERNEL_VSYSCALL_NOCACHE 
MAKE_GLOBAL(__PAGE_KERNEL_VSYSCALL_NOCACHE)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Time to change NR_OPEN value

2005-01-31 Thread Eric Dumazet
Time has come to change NR_OPEN value, some production servers hit the 
not so 'ridiculously high value' of 1024*1024 file descriptors per process.

AFAIK this is safe to raise this value, because alloc_fd_array() uses 
vmalloc() for large arrays and vmalloc() returns NULL  if a too large 
allocation is attempted (or in case of memory shortage)

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
diff -Nru /tmp/fs.h include/linux/fs.h
--- linux.orig/include/linux/fs.h   2005-01-31 15:28:01.926685144 +0100
+++ inux/include/linux/fs.h  2005-01-31 15:29:37.047224624 +0100
@@ -32,7 +32,8 @@
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
  * the file limit at runtime and only root can increase the per-process
  * nr_file rlimit, so it's safe to set up a ridiculously high absolute
- * upper limit on files-per-process.
+ * upper limit on files-per-process. Actual limit depends on vmalloc()
+ * constraints.
  *
  * Some programs (notably those using select()) may have to be
  * recompiled to take full advantage of the new limits..
@@ -40,7 +41,7 @@
 /* Fixed constants first: */
 #undef NR_OPEN
-#define NR_OPEN (1024*1024)/* Absolute upper limit on fd num */
+#define NR_OPEN (16*1024*1024) /* Absolute upper limit on fd num */
 #define INR_OPEN 1024  /* Initial setting for nfile rlimits */
 #define BLOCK_SIZE_BITS 10
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Something very strange on x86_64 2.6.X kernels

2005-01-20 Thread Eric Dumazet
Hi Andi
I have very strange coredumps happening on a big 64bits program.
Some background :
- This program is multi-threaded
- Machine is a dual Opteron 248 machine, 12GB ram.
- Kernel 2.6.6  (tried 2.6.10 too but problems too)
- The program uses hugetlb pages.
- The program uses prefetchnta
- The program uses about 8GB of ram.
After numerous differents core dumps of this program, and gdb debugging 
I found :

Every time the crash occurs when one thread is using some ram located at 
virtual address 0xe6xx

When examining the core image, the data saved on this page seems correct 
(ie countains coherent user data). But one register (%rbx) is usually 
corrupted and contains a small value (like 0x3c)

The last instruction using this register is :
prefetchnta 0x18(,%rbx,4)
Examining linux sources, I found that 0xe000 is 'special' (ia 32 
vsyscall) and 0xe600 is about sigreturn subsection of this special area.

Is it possible some vm trick just kicks in and corrupts my true 64bits 
program ?

Thank you
Eric Dumazet
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Something very strange on x86_64 2.6.X kernels

2005-01-20 Thread Eric Dumazet
Andrew Morton wrote:
Eric Dumazet <[EMAIL PROTECTED]> wrote:

Every time the crash occurs when one thread is using some ram located at 
virtual address 0xe6xx

What does "using" mean?  Is the program executing from that location?
No, the program text is located between 0x0010 and 0x001c6000  (no 
shared libs)

0xe6xx is READ|WRITE data, mapped on Hugetlb fs
extract from /proc/pid/maps
ff40-10040 rw-s 8200 00:0b 12960938 
   /huge/file

Interesting.  IIRC, opterons will very occasionally (and incorrectly) take
a fault when performing a prefetch against a dud pointer.  The kernel will
fix that up.  At a guess, I'd say tha the fixup code isn't doing the right
thing when the faulting EIP is in the vsyscall page.
Maybe, but I want to say that in this case, the address 'prefetched' is 
valid (ie mapped read/write by the program, on a huge page too)

Thanks
Eric Dumazet
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Add prefetch switch stack hook in scheduler function

2005-07-29 Thread Eric Dumazet

Ingo Molnar a écrit :



unroll prefetch_range() loops manually.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>

 include/linux/prefetch.h |   31 +--
 1 files changed, 29 insertions(+), 2 deletions(-)

Index: linux/include/linux/prefetch.h
===
--- linux.orig/include/linux/prefetch.h
+++ linux/include/linux/prefetch.h
@@ -58,11 +58,38 @@ static inline void prefetchw(const void 
 static inline void prefetch_range(void *addr, size_t len)

 {
 #ifdef ARCH_HAS_PREFETCH
-   char *cp;
+   char *cp = addr;
char *end = addr + len;
 
-	for (cp = addr; cp < end; cp += PREFETCH_STRIDE)

+   /*
+* Unroll agressively:
+*/
+   if (len <= PREFETCH_STRIDE)
prefetch(cp);
+   else if (len <= 2*PREFETCH_STRIDE) {
+   prefetch(cp);
+   prefetch(cp + PREFETCH_STRIDE);
+   }
+   else if (len <= 3*PREFETCH_STRIDE) {
+   prefetch(cp);
+   prefetch(cp + PREFETCH_STRIDE);
+   prefetch(cp + 2*PREFETCH_STRIDE);
+   }
+   else if (len <= 4*PREFETCH_STRIDE) {
+   prefetch(cp);
+   prefetch(cp + PREFETCH_STRIDE);
+   prefetch(cp + 2*PREFETCH_STRIDE);
+   prefetch(cp + 3*PREFETCH_STRIDE);
+   }
+   else if (len <= 5*PREFETCH_STRIDE) {
+   prefetch(cp);
+   prefetch(cp + PREFETCH_STRIDE);
+   prefetch(cp + 2*PREFETCH_STRIDE);
+   prefetch(cp + 3*PREFETCH_STRIDE);
+   prefetch(cp + 4*PREFETCH_STRIDE);
+   } else
+   for (; cp < end; cp += PREFETCH_STRIDE)
+   prefetch(cp);
 #endif
 }
 
-


Please test that len is a constant, or else the inlining is too large for the 
non constant case.

Thank you
static inline void prefetch_range(void *addr, size_t len)
{
char *cp;
char *end = addr + len;

if (__builtin_constant_p(len) && (len <= 5*PREFETCH_STRIDE)) {
if (len <= PREFETCH_STRIDE)
prefetch(cp);
else if (len <= 2*PREFETCH_STRIDE) {
prefetch(cp);
prefetch(cp + PREFETCH_STRIDE);
}
else if (len <= 3*PREFETCH_STRIDE) {
prefetch(cp);
prefetch(cp + PREFETCH_STRIDE);
prefetch(cp + 2*PREFETCH_STRIDE);
}
else if (len <= 4*PREFETCH_STRIDE) {
prefetch(cp);
prefetch(cp + PREFETCH_STRIDE);
prefetch(cp + 2*PREFETCH_STRIDE);
prefetch(cp + 3*PREFETCH_STRIDE);
}
else if (len <= 5*PREFETCH_STRIDE) {
prefetch(cp);
prefetch(cp + PREFETCH_STRIDE);
prefetch(cp + 2*PREFETCH_STRIDE);
prefetch(cp + 3*PREFETCH_STRIDE);
prefetch(cp + 4*PREFETCH_STRIDE);
}
}
else
for (; cp < end; cp += PREFETCH_STRIDE)
prefetch(cp);
}


[PATCH] mm/slab.c : prefetchw the start of new allocated objects

2005-07-29 Thread Eric Dumazet

[MM] slab.c : prefetchw the start of new allocated objects

Most of objects returned by __cache_alloc() will be written by the caller,
(but not all callers want to write all the object, but just at the begining)
prefetchw() tells the modern CPU to think about the future writes, ie start
some memory transactions in advance.

Some CPU lacks a prefetchw() and currently do nothing, so I ask this question :
Should'nt make prefetchw() do at least a prefetch() ? A read hint is better 
than nothing.


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

diff -Nru linux-2.6.13-rc4/mm/slab.c linux-2.6.13-rc4-ed/mm/slab.c
--- linux-2.6.13-rc4/mm/slab.c  2005-07-29 00:44:44.0 +0200
+++ linux-2.6.13-rc4-ed/mm/slab.c   2005-07-29 10:48:45.0 +0200
@@ -2166,6 +2166,7 @@
}
local_irq_restore(save_flags);
objp = cache_alloc_debugcheck_after(cachep, flags, objp, 
__builtin_return_address(0));
+   prefetchw(objp);
return objp;
 }
 


[PATCH] MM, NUMA : sys_set_mempolicy() doesnt check if mode < 0

2005-08-01 Thread Eric Dumazet

MM, NUMA : sys_set_mempolicy() doesnt check if mode < 0

A kernel BUG() is triggered by a call to set_mempolicy() with a negative first 
argument.
This is because the mode is declared as an int, and the validity check doesnt 
check < 0 values.
Alternatively, mode could be declared as unsigned int or unsigned long.

Thank you
Eric
-
Test program for x86_64:
-
#include 
#include 
#include 
#include 

#define __NR_set_mempolicy  238
#define __sys_set_mempolicy(mode, nmask, maxnode) _syscall3(int, set_mempolicy, 
int, mode, unsigned long *, nmask, unsigned long, maxnode)
static __sys_set_mempolicy(mode, nmask, maxnode)

unsigned long nodes = 3;

int main()
{
int ret = set_mempolicy(-6, &nodes, 2);
printf("result=%d errno=%d\n", ret, errno);
return 0;
}


Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

--- linux-2.6.13-rc4/mm/mempolicy.c 2005-07-29 00:44:44.0 +0200
+++ linux-2.6.13-rc4-ed/mm/mempolicy.c  2005-08-01 23:52:43.0 +0200
@@ -443,7 +443,7 @@
struct mempolicy *new;
DECLARE_BITMAP(nodes, MAX_NUMNODES);
 
-   if (mode > MPOL_MAX)
+   if ((unsigned int)mode > MPOL_MAX)
return -EINVAL;
err = get_nodes(nodes, nmask, maxnode, mode);
if (err)


[RFC] : SLAB : Could we have a process context only versions of kmem_cache_alloc(), and kmem_cache_free()

2005-08-04 Thread Eric Dumazet

Hi

The cost of local_irq_save(flags)/local_irq_restore(flags) in slab functions is 
very high
 popf, cli, pushf do stress the modern processors.

Maybe we could provide special functions for caches that are known to be used 
only from process context ?


These functions may use the local_irq_save(flags)/local_irq_restore(flags) only 
if needed (cache_alloc_refill() or cache_flusharray())

Something like :

void *kmem_cache_alloc_noirq(kmem_cache_t *cachep, unsigned int __nocast flags)
{
unsigned long save_flags;
void* objp;
struct array_cache *ac;

cache_alloc_debugcheck_before(cachep, flags);
check_irq_on();
preempt_disable();
ac = ac_data(cachep);
if (likely(ac->avail)) {
STATS_INC_ALLOCHIT(cachep);
ac->touched = 1;
objp = ac_entry(ac)[--ac->avail];
} else {
STATS_INC_ALLOCMISS(cachep);
local_irq_save(save_flags);
objp = cache_alloc_refill(cachep, flags);
local_irq_restore(save_flags);
}
preempt_enable();
objp = cache_alloc_debugcheck_after(cachep, flags, objp, 
__builtin_return_address(0));
prefetchw(objp);
return objp;
}


void kmem_cache_free_noirq(kmem_cache_t *cachep, void *objp)
{
struct array_cache *ac;

check_irq_on();
preempt_disable();
ac  = ac_data(cachep);

objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));

if (likely(ac->avail < ac->limit)) {
STATS_INC_FREEHIT(cachep);
} else {
unsigned long flags;
STATS_INC_FREEMISS(cachep);
local_irq_save(flags);
cache_flusharray(cachep, ac);
local_irq_restore(flags);
}
ac_entry(ac)[ac->avail++] = objp;
preempt_disable();
}

Thank you

Eric Dumazet

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86_64 : prefetchw() can fall back to prefetch() if !3DNOW

2005-07-28 Thread Eric Dumazet

[PATCH] x86_64 : prefetchw() can fall back to prefetch() if !3DNOW

If the cpu lacks 3DNOW feature, we can use a normal prefetcht0 instruction 
instead of NOP5.
"prefetchw (%rxx)" and "prefetcht0 (%rxx)" have the same length, ranging from 3 
to 5 bytes
depending on the register. So this patch even helps AMD64, shortening the 
length of the code.

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>
--- linux-2.6.13-rc3/include/asm-x86_64/processor.h 2005-07-13 
06:46:46.0 +0200
+++ linux-2.6.13-rc3-ed/include/asm-x86_64/processor.h  2005-07-28 
18:47:39.0 +0200
@@ -398,7 +398,7 @@
 #define ARCH_HAS_PREFETCHW 1
 static inline void prefetchw(void *x) 
 { 
-   alternative_input(ASM_NOP5,
+   alternative_input("prefetcht0 (%1)",
  "prefetchw (%1)",
  X86_FEATURE_3DNOW,
  "r" (x));


[PATCH] random : prefetch the whole pool, not 1/4 of it

2005-07-28 Thread Eric Dumazet

Hi Matt

Could you check this patch and apply it ?

Thank you

Eric

[RANDOM] : prefetch the whole pool, not 1/4 of it,
   (pool contains u32 words, not bytes)

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

--- linux-2.6.13-rc3/drivers/char/random.c  2005-07-13 06:46:46.0 
+0200
+++ linux-2.6.13-rc3-ed/drivers/char/random.c   2005-07-29 00:11:24.0 
+0200
@@ -469,7 +469,7 @@
next_w = *in++;
 
spin_lock_irqsave(&r->lock, flags);
-   prefetch_range(r->pool, wordmask);
+   prefetch_range(r->pool, wordmask*4);
input_rotate = r->input_rotate;
add_ptr = r->add_ptr;
 


[PATCH] eventpoll : Suppress a short lived lock from struct file

2005-07-11 Thread Eric Dumazet

Hi Davide

I found in my tests that there is no need to have a f_ep_lock spinlock
attached to each struct file, using 8 bytes on 64bits platforms. The
lock is hold for a very short time period and can be global, with almost
no change in performance for applications using epoll, and a gain for
all others.

Thank you
Eric Dumazet

[PATCH] eventpoll : Suppress a short lived lock from struct file

Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]>

--- linux-2.6.12/fs/eventpoll.c 2005-06-17 21:48:29.0 +0200
+++ linux-2.6.12-ed/fs/eventpoll.c  2005-07-11 08:56:07.0 +0200
@@ -179,6 +179,8 @@
spinlock_t lock;
 };
 
+static DEFINE_SPINLOCK(f_ep_lock);
+
 /*
  * This structure is stored inside the "private_data" member of the file
  * structure and rapresent the main data sructure for the eventpoll
@@ -426,7 +428,6 @@
 {
 
INIT_LIST_HEAD(&file->f_ep_links);
-   spin_lock_init(&file->f_ep_lock);
 }
 
 
@@ -967,9 +968,9 @@
goto eexit_2;
 
/* Add the current item to the list of active epoll hook for this file 
*/
-   spin_lock(&tfile->f_ep_lock);
+   spin_lock(&f_ep_lock);
list_add_tail(&epi->fllink, &tfile->f_ep_links);
-   spin_unlock(&tfile->f_ep_lock);
+   spin_unlock(&f_ep_lock);
 
/* We have to drop the new item inside our item list to keep track of 
it */
write_lock_irqsave(&ep->lock, flags);
@@ -1160,7 +1161,6 @@
 {
int error;
unsigned long flags;
-   struct file *file = epi->ffd.file;
 
/*
 * Removes poll wait queue hooks. We _have_ to do this without holding
@@ -1173,10 +1173,10 @@
ep_unregister_pollwait(ep, epi);
 
/* Remove the current item from the list of epoll hooks */
-   spin_lock(&file->f_ep_lock);
+   spin_lock(&f_ep_lock);
if (EP_IS_LINKED(&epi->fllink))
EP_LIST_DEL(&epi->fllink);
-   spin_unlock(&file->f_ep_lock);
+   spin_unlock(&f_ep_lock);
 
/* We need to acquire the write IRQ lock before calling ep_unlink() */
write_lock_irqsave(&ep->lock, flags);
--- linux-2.6.12/include/linux/fs.h 2005-06-17 21:48:29.0 +0200
+++ linux-2.6.12-ed/include/linux/fs.h  2005-07-11 08:58:02.0 +0200
@@ -597,7 +597,6 @@
 #ifdef CONFIG_EPOLL
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct list_headf_ep_links;
-   spinlock_t  f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
 };


Re: [PATCH] eventpoll : Suppress a short lived lock from struct file

2005-07-11 Thread Eric Dumazet

Peter Zijlstra a écrit :

On Mon, 2005-07-11 at 09:18 +0200, Eric Dumazet wrote:

Have you tested the impact of this change on big SMP/NUMA machines?
I hate to see an Altrix crashing to its knees :-)



I tested on a small NUMA machine (2 nodes), with a epoll enabled application,
that use around 100 epoll ctl per second.

Of course, one may write a special benchmark on a BIG SMP/NUMA machine that
 defeat these patch, using thousands of epoll ctl per second, but, a normal 
(well written ?)
epoll application doesnt constantly add/remove epoll ctl.

Should we waste 8 bytes per 'struct file' for a very unlikely micro benchmark ?

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] eventpoll : Suppress a short lived lock from struct file

2005-07-11 Thread Eric Dumazet

Davide Libenzi a écrit :

Eric, I can't really say I like this one. Not at least after extensive 
tests run on top of it.


fair enough :)

You are asking to add a bottleneck to save 8 
bytes on an entity that taken alone in more than 120 bytes. Consider 
that when you have a "struct file" allocated, the cost on the system is 
not only the struct itself, but all the allocations associated with it. 
For example, if you consider that a case where you might feel a "struct 
file" pressure is when you have hundreds of thousands of network 
connections, the 8 bytes saved compared to all the buffers associated 
with those sockets boils down to basically nothing.


Well, the filp_cachep slab is created with SLAB_HWCACHE_ALIGN, enforcing a 
alignment of 64 bytes or even 128 bytes.

So it can be usefull to let the size of struct file goes from 0x84 to 0x80, because we can gain 64 or 128 bytes per file (0x80 bytes really 
allocated instead of 0xc0 or even 0x100 on Pentium 4).


In my case, I use other patches outside the scope of eventpoll (like declaring f_security only #ifdef CONFIG_SECURITY_SELINUX), and really 
gain 128 bytes of low memory per file. It reduces cache pressure for a given workload, and reduce lowmem pressure.


Before :

# grep filp /proc/slabinfo
filp   66633  66750256   151 : tunables  120   608 : 
slabdata   4450   4450 60


After :

# grep filp /proc/slabinfo
filp   82712  82987128   311 : tunables  120   608 : 
slabdata   2677   2677 20


It may appears to you as a penalty, but at least for me it is a noticeable gain.

Another candidate to "file struct" size reduction is the big struct file_ra_state that is included in all files, even sockets that dont use 
it, but that's a different story :)


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG?] x86_64 : Can not read /dev/kmem ?

2005-03-14 Thread Eric Dumazet
Hi Andi
I tried to read /dev/kmem on x86_64 (linux-2.6.11) and got no success.
read() or pread() returns EINVAL
I tried mmap() too : mmap() calls succeed, but as soon the user process 
dereference memory, we get :

tinfo: Corrupted page table at address 2aabf800
PGD 8a983067 PUD c7e5a067 PMD 91588067 PTE 8048a025
Bad pagetable: 000d [1] SMP
CPU 0
Modules linked in: ipt_REJECT
Pid: 10892, comm: tinfo Not tainted 2.6.11
RIP: 0033:[<00100562>] [<00100562>]
RSP: 002b:7790  EFLAGS: 00010217
RAX: 2aabf000 RBX: 2abbe000 RCX: 2ac8fc0c
RDX: 0001 RSI: 1000 RDI: 
RBP: 77f8 R08: 0003 R09: 8048a000
R10: 0001 R11: 0206 R12: 001005b0
R13: 0001 R14: 2adfdfe8 R15: 00100530
FS:  2abcb970() GS:804866c0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 2aabf800 CR3: 90368000 CR4: 06e0
Process tinfo (pid: 10892, threadinfo 8100901b, task 
8100c7d976c0)

RIP [<00100562>] RSP <7ffff790>

Thank you
Eric Dumazet

# cat tinfo.c
#define _XOPEN_SOURCE 500
#include 
#include 
#include 
struct tcp_hashinfo {
struct tcp_ehash_bucket *__tcp_ehash;
struct tcp_bind_hashbucket *__tcp_bhash;
int __tcp_bhash_size;
int __tcp_ehash_size;
} tcp_hashinfo;
#define TCPINFO_ADDR 0x8048a000 /* tcp_hashinfo */
int main()
{
int fd = open("/dev/kmem", O_RDONLY) ;
if (pread(fd, &tcp_hashinfo, sizeof(tcp_hashinfo), TCPINFO_ADDR) == -1) {
lseek(fd, TCPINFO_ADDR, 0) ;
if (read(fd, &tcp_hashinfo, sizeof(tcp_hashinfo)) == -1) {
perror("Can not read /dev/kmem ?") ;
return 1 ;
}
}
printf("ehash=%p esize=%d bhash=%p bsize=%d\n",
tcp_hashinfo.__tcp_ehash,
tcp_hashinfo.__tcp_ehash_size,
tcp_hashinfo.__tcp_bhash,
tcp_hashinfo.__tcp_bhash_size) ;
return 0 ;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


x86: spin_unlock(), spin_unlock_irq() & others are out of line ?

2005-03-15 Thread Eric Dumazet
Hi all
I noticed that in current linux kernel versions (2.6.11), some basic 
functions are out of line (not inlined)

Example of a call to spin_unlock(&somelock)
c01069fa:   b8 e8 7b 35 c0  mov$0xc0357be8,%eax
c01069ff:   e8 3c e4 1f 00  call   c0304e40 <_spin_unlock>
c0304e40 <_spin_unlock>:
c0304e40:   c6 00 01movb   $0x1,(%eax)
c0304e43:   c3  ret
Same problem for _write_unlock(), _read_unlock(), _spin_unlock_irq(), ...
That seems odd, and I fail to see the reason for that. (It's OK for 
complex functions, but not for very short ones...)

Is it a regression, or is it needed ?
configuration :
- SMP
- Processor family (Pentium-4/Celeron(P4-based)/Pentium-4 M/Xeon)
- No "Generic x86 support"
Thank you
Eric Dumazet
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Bogus buffer length check in linux-2.6.11 read()

2005-03-16 Thread Eric Dumazet
linux-os wrote:

I don't know how much more precise I could have been. I show the
code that will cause the observed condition. I explain that this
condition is new, that it doesn't correspond to the previous
behavior.
Never before was some buffer checked for length before some data
was written to it. The EFAULT is supposed to occur IFF a write
attempt occurs outside the caller's accessible address space.
This used to be done by hardware during the write to user-space.
This had zero impact upon performance. Now there is some
software added that adds CPU cycles, subtracts performance,
and cannot possibly do anything useful.
Also, the code was written to show the problem. The code
is not designed to be an example of good coding practice.
The actual problem observed with the new kernel was
when some legacy code used gets() instead of fgets().
The call returned immediately with an EFAULT because
the 'C' runtime library put some value that the kernel
didn't 'like' (4096 bytes) in the subsequent read.

If you use a buggy program, that had a hidden bug now exposed because 
of different kernel checks, dont complain, and use your brain.

If you do
$ export VAR1=" A very very very very long chain just to be sure my 
environnement (which is placed at the top of the stack at exec() time) 
will use at least 4 Kb  : then my litle buggy program will run if I 
type few chars but destroy my stack if I type a long string or if I 
use : cat longfile | ./xxx : So I wont complain again on lkml that I 
am so lazy. Oh what could I type now, I'm tired, maybe I can copy 
this string to others variables. Yes... sure"
$ export VAR2=$VAR1
$ export VAR3=$VAR1
$ export VAR4=$VAR1
$ export VAR5=$VAR1
Then check your env size is large enough
$ env|wc -c
   4508
$ ./xxx
./xxx 2>/dev/null

Apparently the kernel thinks 4096 is a good length!
So what ? Your program works well now, on a linux-2.6.11 typical 
machine. Ready to buffer overflow again.

Maybe you can pay me $1000 :)
Eric Dumazet
This is code for which there are no sources available
and it is required to be used, cannot be replaced,
cannot be thrown away and costs about US$ 10,000
from a company that is no longer in business.
Somebody's arbitrary and capricious addition of spook
code destroyed an application's functionality.
Cheers,
Dick Johnson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 32Bit vs 64Bit

2005-03-16 Thread Eric Dumazet
regatta wrote:
Hi everyone,
I have a question about the 64Bit mode in AMD 64bit
My question is if I run a 32Bit application in Optreon AMD 64Bit with
Linux 64Bit does this give my any benefit ? I mean running 32Bit
application in 64Bit machine with 64 Linux is it better that running
it in 32Bit or doesn't make any different at all ?
Thanks
Hi
Running a 32 bits application on a x86_64 kernel gives more virtual 
address : 4GB of user memory, instead of 3GB on a standard 32bits kernel

If your application uses a lot of in-kernel ressources (like tcp 
sockets and network buffers), it also wont be constrained by the 
pressure a 32 bits kernel has on lowmem (typically 896 MB of lowmem)

If your machine has less than 2GB, running a 64bits kernel is not a 
win, because all kernel data use more ram (pointers are 64 bits 
instead of 32bits)

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst cache overflow

2005-03-22 Thread Eric Dumazet
[EMAIL PROTECTED] a écrit :
I see on 2.6.10/2.6.11.3

Hello
Could you give us the results of these commands :
# grep . /proc/sys/net/ipv4/route/*
# cat /proc/net/stat/rt_cache
Eric Dumazet
Quoting Phil Oester <[EMAIL PROTECTED]>:
On Tue, Mar 22, 2005 at 10:39:43AM +0200, [EMAIL PROTECTED] 
wrote:

computer's main job is to be router on small LAN with 10 users and  some
services like qmail, apache, proftpd, shoutcast, squid, and ices on 
slack
10.1. Iptables and tc are used to limit  bandwiwdth and the two 
bandwidthd
 daemons are running on eth0 interface and all the time the cpu is 
used at
about 0.4% and additional 12% by ices  when encoding mp3 on demand, and
the proccess ksoftirqd/0 randomally starts to use 100% of 0 cpu in 
normal
situation and one time when the ksoftirqd/0 became crazy i noticed dst
cache overflow messages in syslog but there are more of thies lines in
logs  about 5 times in 10 days period

There was a problem fixed in the handling of fragments which caused dst
cache overflow in the 2.6.11-rc series.  Are you still seeing dst cache
overflow on 2.6.11?
Phil


This message was sent using IMP, the Internet Messaging Program.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dst cache overflow

2005-03-22 Thread Eric Dumazet
[EMAIL PROTECTED] a écrit :

grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:5000
/proc/sys/net/ipv4/route/error_cost:1000
grep: /proc/sys/net/ipv4/route/flush: Invalid argument
/proc/sys/net/ipv4/route/gc_elasticity:8
/proc/sys/net/ipv4/route/gc_interval:60
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:4096
/proc/sys/net/ipv4/route/gc_timeout:300
/proc/sys/net/ipv4/route/max_delay:10
/proc/sys/net/ipv4/route/max_size:65536
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_delay:2
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:20
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:20480
/proc/sys/net/ipv4/route/secret_interval:600
cat /proc/net/stat/rt_cache
entries  in_hit in_slow_tot in_no_route in_brd in_martian_dst 
in_martian_src out_hit out_slow_tot out_slow_mc  gc_total gc_ignored 
gc_goal_miss
gc_dst_overflow in_hlist_search out_hlist_search
00b9  02e05549 01fa47b9   00016e03 0022 00251b22 
00083e65 fe7e 0008 00f15fc6 00f064e8 ebe8 eb57 08703a77
87cf
00b9         
0001105a 27ed 0002 018f 0171 000e 0009 
3217
OK, route cache settings are hard to tune.
Try these settings :
echo 1 >/proc/sys/net/ipv4/route/gc_interval
echo 2 >/proc/sys/net/ipv4/route/gc_elasticity
echo 150 >/proc/sys/net/ipv4/route/gc_timeout
echo 8192 >/proc/sys/net/ipv4/route/gc_thresh
You might want to boot adding this "rhash_entries=8192" in your kernel boot string 
(append="rhash_entries=8192" in lilo.conf)
(If you change rhash_entries, change also /proc/sys/net/ipv4/route/gc_thresh 
accordingly)
You might also try this patch.
Eric Dumazet
--- linux-2.6.11/net/ipv4/route.c   2005-03-17 11:19:45.0 +0100
+++ linux-2.6.11-ed/net/ipv4/route.c2005-03-21 12:01:23.0 +0100
@@ -54,6 +54,8 @@
  * Marc Boucher:   routing by fwmark
  * Robert Olsson   :   Added rt_cache statistics
  * Arnaldo C. Melo :   Convert proc stuff to seq_file
+ *  Eric Dumazet:   hashed spinlocks and rt_check_expire() 
fixes.
+ * :   bugfix in rt_cpu_seq_show()
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
@@ -107,12 +109,13 @@
 #define IP_MAX_MTU 0xFFF0
 
 #define RT_GC_TIMEOUT (300*HZ)
+#define RT_GC_INTERVAL (RT_GC_TIMEOUT/10) /* rt_check_expire() scans 1/10 of 
the table each round */
 
 static int ip_rt_min_delay = 2 * HZ;
 static int ip_rt_max_delay = 10 * HZ;
 static int ip_rt_max_size;
 static int ip_rt_gc_timeout= RT_GC_TIMEOUT;
-static int ip_rt_gc_interval   = 60 * HZ;
+static int ip_rt_gc_interval   = RT_GC_INTERVAL;
 static int ip_rt_gc_min_interval   = HZ / 2;
 static int ip_rt_redirect_number   = 9;
 static int ip_rt_redirect_load = HZ / 50;
@@ -124,6 +127,7 @@
 static int ip_rt_min_pmtu  = 512 + 20 + 20;
 static int ip_rt_min_advmss= 256;
 static int ip_rt_secret_interval   = 10 * 60 * HZ;
+static int ip_rt_debug ;
 static unsigned long rt_deadline;
 
 #define RTprint(a...)  printk(KERN_DEBUG a)
@@ -197,8 +201,24 @@
 
 struct rt_hash_bucket {
struct rtable   *chain;
-   spinlock_t  lock;
-} __attribute__((__aligned__(8)));
+};
+
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+/*
+ * Instead of using one spinlock for each rt_hash_bucket, we use a table of 
fixed size spinlocks
+ */
+# define RT_HASH_LOCK_SZ 256
+   static spinlock_t   rt_hash_lock[RT_HASH_LOCK_SZ];
+# define rt_hash_lock_addr(slot) &rt_hash_lock[slot & (RT_HASH_LOCK_SZ - 1)]
+# define rt_hash_lock_init()   { \
+   int i; \
+   for (i = 0; i < RT_HASH_LOCK_SZ; i++) \
+   spin_lock_init(&rt_hash_lock[i]); \
+   }
+#else
+# define rt_hash_lock_addr(slot) NULL
+# define rt_hash_lock_init()
+#endif
 
 static struct rt_hash_bucket   *rt_hash_table;
 static unsignedrt_hash_mask;
@@ -393,7 +413,7 @@
struct rt_cache_stat *st = v;
 
if (v == SEQ_START_TOKEN) {
-   seq_printf(seq, "entries  in_hit in_slow_tot in_no_route in_brd 
in_martian_dst in_martian_src  out_hit out_slow_tot out_slow_mc  gc_total 
gc_ignored gc_goal_miss gc_dst_overflow in_hlist_search out_hlist_search\n");
+   seq_printf(seq, "entries  in_hit in_slow_tot in_slow_mc 
in_no_route in_brd in_martian_dst in_martian_src  out_hit out_slow_tot 
out_slow_mc  gc_total gc_ignored gc_goal_miss gc_dst_overflow i

Re: question about sockfd_lookup( )

2005-02-28 Thread Eric Dumazet
Hi
Try adding sockfd_put(sock) ;
MingJie Chang wrote:
Dear all,
I want to get socket information by the sockfd while accetping,
so I write a module to test sockfd_lookup(),
but I got some problems when I test it.
I hope someone can help me...
Thank you
following text is my code and error message
===
=== code ===
int my_socketcall(int call,unsigned long *args)  
{
   int ret,err;
   struct socket * sock;

   ret = run_org_socket_call(call,args);   //orignal sys_sockcall()
   
   if(call==SYS_ACCEPT&&ret>=0) 
   {
  sock=sockfd_lookup(ret,&err);
  printk("lookup done\n");
if (sock) sockfd_put(sock) ;
   }
   return ret;
}
Eric Dumazet
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Performance Stats: Kernel patch

2007-04-11 Thread Eric Dumazet

Maxim Uvarov a écrit :

Eric Dumazet wrote:

 >Please check kernel/sys.c:k_getrusage() to see how getrusage() has to 
sum *lot* of individual fields to get precise process numbers (even 
counting stats for dead threads)




Thanks for helping me and for this link. But it is not enough clear for 
me what do you mean at this time.  Inside of patch I am using 2 default 
counters
task_struct->nivcsw and task_struct->nvcsw. And also one new syscall 
counter. And there is only one way to increment this counter, it is from 
entry.S.


If you are speaking about locks,  in my point of view, they are not 
needed in this code. Because increment syscall counter is atomic for X86 
(just one assembly instruction) and in case with PPC (3 instructions) 
there 1) nothing wrong will not happen in any case 2) only own thread 
can increase it's syscall counter. So here should be not any race 
conditions.


I was not speaking about locks.



I've tested this patch on x86,x86_64,and ppc_32. And I should work now 
with ppc_64 (I didn't check).

And  also updated description.

Best regards,
Maxim Uvarov.




Patch adds Process Performance Statistics.
It make available to the user the following 
thread performance statistics:

   * Involuntary Context Switches (task_struct->nivcsw)
   * Voluntary Context Switches (task_struct->nvcsw)
   * Number of system calls (added new counter 
 thread_info->sysc_cnt)


Statistics information is available from
/proc/PID/status
   
This data is useful for detecting hyperactivity 
patterns between processes.




What I meant is : You falsely speak of 'PROCESS performance statistics'.

Your implementation only cares about threads, not processes.
There is a slight difference, that getrusage() can do.

So if you do "cat /proc/PID/status", you'll get counters not for the PROCESS, 
only the main thread of the process.


If you want an analogy, imagine a "ps aux" that doesnt show the cpu time of 
all threads of a process, but only the cpu time of the main thread. Quite 
meaningless isnt it ?


So either :

1) You change all your description to mention 'thread' instead of 'process'.

2) You change your implementation to match your claim.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] make MADV_FREE lazily free memory

2007-04-11 Thread Eric Dumazet

Rik van Riel a écrit :

Make it possible for applications to have the kernel free memory
lazily.  This reduces a repeated free/malloc cycle from freeing
pages and allocating them, to just marking them freeable.  If the
application wants to reuse them before the kernel needs the memory,
not even a page fault will happen.



Hi Rik

I dont understand this last sentence. If not even a page fault happens, how 
the kernel knows that the page was eventually reused by the application, and 
should not be freed in case of memory pressure ?


ptr = mmap(some space);
madvise(ptr, length, MADV_FREE);
/* kernel may free the pages */
sleep(10);

/* what the application must do know before reusing space ? */
memset(ptr, data, 1);
/* kernel should not free ptr[0..1] now */

Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] make MADV_FREE lazily free memory

2007-04-11 Thread Eric Dumazet

Rik van Riel a écrit :

Eric Dumazet wrote:

Rik van Riel a écrit :

Make it possible for applications to have the kernel free memory
lazily.  This reduces a repeated free/malloc cycle from freeing
pages and allocating them, to just marking them freeable.  If the
application wants to reuse them before the kernel needs the memory,
not even a page fault will happen.


I dont understand this last sentence. If not even a page fault 
happens, how the kernel knows that the page was eventually reused by 
the application, and should not be freed in case of memory pressure ?


Before maybe freeing the page, the kernel checks the referenced
and dirty bits of the page table entries mapping that page.


ptr = mmap(some space);
madvise(ptr, length, MADV_FREE);
/* kernel may free the pages */


All this call does is:
- clear the accessed and dirty bits
- move the page to the far end of the inactive list,
  where it will be the first to be reclaimed


sleep(10);

/* what the application must do know before reusing space ? */
memset(ptr, data, 1);
/* kernel should not free ptr[0..1] now */


Two things can happen here.

If this program used the pages before the kernel needed
them, the program will be reusing its old pages.


ah ok, this is because accessed/dirty bits are set by hardware and not a page 
fault. Is it true for all architectures ?




If the kernel got there first, you will get page faults
and the kernel will fill in the memory with new pages.


perfect



Both of these alternatives are transparent to userspace.



Thanks a lot for these clarifications. This will fly :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/4] i386 - pte update optimizations

2007-04-12 Thread Eric Dumazet

Zachary Amsden a écrit :


Yes.  Even then, last time I clocked instructions, xchg was still slower 
than read / write, although I could be misremembering.  And it's not 
totally clear that they will always be in cached state, however, and for 
SMP, we still want to drop the implicit lock in cases where the 
processor might not know they are cached exclusive, but we know there 
are no other racing users.  And there are plenty of old processors out 
there to still make it worthwhile.




Is there one processor that benefit from this patch then ?

I couldnt get a win on my test machines, maybe they are not old enough ;)

umask() doesnt need xchg() atomic semantic. If several threads are using 
umask() concurrently results are not guaranted anyway.
--- linux-2.6.21-rc6/kernel/sys.c
+++ linux-2.6.21-rc6-ed/kernel/sys.c
@@ -2138,8 +2138,10 @@ asmlinkage long sys_getrusage(int who, s
 
 asmlinkage long sys_umask(int mask)
 {
-   mask = xchg(¤t->fs->umask, mask & S_IRWXUGO);
-   return mask;
+   struct fs_struct *fs = current->fs;
+   int old = fs->umask;
+   fs->umask = mask & S_IRWXUGO;
+   return old;
 }
 
 asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,


Re: [patch] generic rwsems

2007-04-13 Thread Eric Dumazet
On Fri, 13 Apr 2007 14:31:52 +0100
David Howells <[EMAIL PROTECTED]> wrote:


> Break the counter down like this:
> 
>   0x  - not locked; queue empty
>   0x4000  - locked by writer; queue empty
>   0xc000  - locket by writer; queue occupied
>   0x0nnn  - n readers; queue empty
>   0x8nnn  - n readers; queue occupied

If space considerations are that important, we could then reserve one bit for 
the 'wait_lock spinlock'

0x2000 : one cpu gained control of 'wait_list'

This would save 4 bytes on 32 bit platforms.

64 bit platforms could have a limit of 2^60 threads, instead of the way too 
small 2^28 one ;)

(we loose the debug version of spinlock of course)

Another possibility to save space would be to move wait_lock/wait_list outside 
of rw_semaphore, in a hashed global array.
This would save 12/16 bytes per rw_semaphore (inode structs are probably the 
most demanding)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oddness with reading /proc/net/tcp

2007-04-13 Thread Eric Dumazet

Witold Krecicki a écrit :
Reading data from /proc/net/tcp is slower with progress of reading data, 
tested on system with >200k active connections.




Yes, this is a known problem.

This is O(N^2) algo.

Use ss from iproute package to get better performance... (less than 15 seconds 
 for 200k active connections)




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [KJ][PATCH 03/04]use set_current_state in fs

2007-04-14 Thread Eric Dumazet

Milind Arun Choudhary a écrit :

use set_current_state(TASK_*) instead of current->state = TASK_*, in fs/nfs

Signed-off-by: Milind Arun Choudhary <[EMAIL PROTECTED]>


---
 idmap.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/idmap.c b/fs/nfs/idmap.c
index 9d4a6b2..054ca15 100644
--- a/fs/nfs/idmap.c
+++ b/fs/nfs/idmap.c
@@ -272,7 +272,7 @@ nfs_idmap_id(struct idmap *idmap, struct idmap_hashtable *h,
set_current_state(TASK_UNINTERRUPTIBLE);
mutex_unlock(&idmap->idmap_im_lock);
schedule();
-   current->state = TASK_RUNNING;
+   set_current_state(TASK_RUNNING);
remove_wait_queue(&idmap->idmap_wq, &wq);
mutex_lock(&idmap->idmap_im_lock);


Probably a dumb question, so please forgive me.

Why are you  forcing a memory barrier here, (and also on your other patches).

Is'nt a  __set_current_state(TASK_RUNNING); appropriate ?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Memory Allocation

2007-04-17 Thread Eric Dumazet

Brian D. McGrew a écrit :

Good evening gents!

I need some help in allocating memory and understanding how the system
allocates memory with physical versus virtual page tables.  Please
consider the following snippet of code.  Please, no wisecracks about bad
code; it was written in 30 seconds in haste :-)

#include 

#include 
#include 
#include 

const static u_long kMaxSize = (2048 * 2048 * 256);

void *msg(void *ptr);
static u_long threads_done  = 0;

int
main(int argc, char *argv[])
{
 pthread_t thread1;
 pthread_t thread2;

 char *message1 = "Thread 1";
 char *message2 = "Thread 2";

 int iret1;
 int iret2;

 iret1 = pthread_create(&thread1, NULL, msg, (void *) message1);
 iret2 = pthread_create(&thread2, NULL, msg, (void *) message2);

//pthread_join(thread1, NULL);
//pthread_join(thread2, NULL); 


while (threads_done < 2) {
std::cout << "Threads complete: " << threads_done << std::endl;
sleep(3);
}

exit(0);
}

void *
msg(void *ptr)
{
char *message = (char *) ptr;

//
// Equal to 1 bank per thread of 256 each 4MP image buffers.  2GB.
//
char *buffer = new char[kMaxSize];

u_long max = kMaxSize;

//
// Init each buffer to 'something'.
//
for (u_long inx = 0; inx < max; inx++) {
if (inx % 10240 == 0) {
std::cout << message << ": Index: " << inx << std::endl;
}

buffer[inx] = inx;
}

free(buffer);
threads_done++;
}

My test machine is a Dell Precision 490 with dual 5140 processors and
3GB of RAM.  If I reduced kMaxSize to (2048 * 2048 * 236) is works.
However, I need to allocate an array of char that is (2048 * 2048 * 256)
and maybe even as large at (2048 * 2048 * 512).

Obviously I have enough physical memory in the box to do this.  However,
I suspect that I'm running out of page table entries.  Please, correct
me if I'm wrong; but if I allocate (2048 * 2048 * 236) it work.  When I
increment to 256 or 512 it fails and it is my suspicion that I just
don't have enough more in kernel memory to allocate this much memory in
user space.  


Because of a piece of 3rd party hardware, I'm forced to run the kernel
in the 4GB memory model.  What I need to be able to do is allocate an
array of char (2048 * 2048 * (up to 512)) in user space *** AND *** I
need the addresses that I get back to be contiguous, that's just the way
my 3rd party hardware works.

I'm inclined to believe that this in not specifically a Linux problem
but maybe an architecture problem???  But maybe there is some kind of
work around in the kernel for it???  I'd find it hard to believe that
I'm the first one that ever needed to use this much memory.

I ran this same code on two difference Macs.  One of them a Powerbook G4
with 4GB of RAM and it was successful.  The other was a Macbook Pro with
4GB of RAM and it failed.  Both running OS 10.4.9.  And of course it
runs just lovely on my Sun workstation with Solaris.  Thus, I'm thinking
it's an Intel/X86 issue!

How the heck to I get past this problem in Linux on the X86 plateform???

Thanks,


Hi Brian

Add this line at the begining of your msg() function :

char cmd[128];
sprintf(cmd, "cat /proc/%d/maps", getpid());
system(cmd);

You'll see :

08048000-08049000 r-xp  08:07 23 /tmp/test1
08049000-0804a000 rw-p  08:07 23 /tmp/test1
0804a000-0806b000 rw-p 0804a000 00:00 0
4000-40015000 r-xp  08:02 31309  /lib/ld-2.3.6.so
40015000-40017000 rw-p 00014000 08:02 31309  /lib/ld-2.3.6.so
40017000-40019000 rw-p 40017000 00:00 0
4001d000-4002b000 r-xp  08:02 31349  /lib/tls/libpthread-2.3.6.so
4002b000-4002d000 rw-p d000 08:02 31349  /lib/tls/libpthread-2.3.6.so
4002d000-4002f000 rw-p 4002d000 00:00 0
4002f000-40109000 r-xp  08:05 128152 /usr/lib/libstdc++.so.6.0.8
40109000-4010c000 r--p 000d9000 08:05 128152 /usr/lib/libstdc++.so.6.0.8
4010c000-4010e000 rw-p 000dc000 08:05 128152 /usr/lib/libstdc++.so.6.0.8
4010e000-40114000 rw-p 4010e000 00:00 0
40114000-40137000 r-xp  08:02 31339  /lib/tls/libm-2.3.6.so
40137000-40139000 rw-p 00022000 08:02 31339  /lib/tls/libm-2.3.6.so
40139000-40143000 r-xp  08:02 31871  /lib/libgcc_s.so.1
40143000-40144000 rw-p 9000 08:02 31871  /lib/libgcc_s.so.1
40144000-4026c000 r-xp  08:02 31335  /lib/tls/libc-2.3.6.so
4026c000-40271000 r--p 00127000 08:02 31335  /lib/tls/libc-2.3.6.so
40271000-40273000 rw-p 0012c000 08:02 31335  /lib/tls/libc-2.3.6.so
40273000-40278000 rw-p 40273000 00:00 0
40278000-40279000 ---p 40278000 00:00 0
40279000-40a78000 rw-p 40279000 00:00 0
b000-c000 rw-p b000 00:00 0
e000-f000 ---p  00:00 0
Thread 1: Index: 0
08048000-08049000 r-xp  08:07 23 /tmp/test1
08049000-0804a000 rw-p  08:07 23 /tmp/test1
0804a000-0806b000 rw-p 0804a000 00:00 0
4000-40015000 r-xp  08:02 31309 

Re: [PATCH] Show slab memory usage on OOM and SysRq-M

2007-04-17 Thread Eric Dumazet
On Tue, 17 Apr 2007 16:22:48 +0300
"Pekka Enberg" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> On 4/17/07, Pavel Emelianov <[EMAIL PROTECTED]> wrote:
> > +static unsigned long get_cache_size(struct kmem_cache *cachep)
> > +{
> > +   unsigned long slabs;
> > +   struct kmem_list3 *l3;
> > +   struct list_head *lh;
> > +   int node;
> > +
> > +   slabs = 0;
> > +
> > +   for_each_online_node (node) {
> > +   l3 = cachep->nodelists[node];
> > +   if (l3 == NULL)
> > +   continue;
> > +
> > +   spin_lock(&l3->list_lock);
> > +   list_for_each (lh, &l3->slabs_full)
> > +   slabs++;
> > +   list_for_each (lh, &l3->slabs_partial)
> > +   slabs++;
> > +   list_for_each (lh, &l3->slabs_free)
> > +   slabs++;
> > +   spin_unlock(&l3->list_lock);
> > +   }
> > +
> > +   return slabs * ((PAGE_SIZE << cachep->gfporder) +
> > +   (OFF_SLAB(cachep) ? cachep->slabp_cache->buffer_size : 0));
> > +}
> 
> Considering you're doing this at out_of_memory() time, wouldn't it
> make more sense to add a ->nr_pages to struct kmem_cache and do the
> tracking in kmem_getpages/kmem_freepages?
> 

To avoid a deadlock ? yes...

This nr_pages should be in struct kmem_list3, not in struct kmem_cache, or else 
you defeat NUMA optimizations if touching a field in kmem_cache at 
kmem_getpages()/kmem_freepages() time.

   for_each_online_node (node) {
   l3 = cachep->nodelists[node];
   if (l3)
   slabs += l3->nr_pages; /* dont lock l3->list_lock */
   }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] slab: resize the alien caches too

2007-04-17 Thread Eric Dumazet

Siddha, Suresh B a écrit :

Christoph,

While going through the slab code, I observed that alien caches are
not getting resized, when user changes the slab tunables. Appended patch
tries to fix this. Please review and let me know if I missed anything.

thanks,
suresh
---

Resize the alien caches too based on the slab tunables.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/mm/slab.c b/mm/slab.c
index 4cbac24..e0dd9af 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3823,6 +3823,7 @@ static int alloc_kmemlist(struct kmem_cache *cachep)
l3 = cachep->nodelists[node];
if (l3) {
struct array_cache *shared = l3->shared;
+   struct array_cache **alien = l3->alien;
 
 			spin_lock_irq(&l3->list_lock);
 


Christoph already rejected the patch, but I have one further comment : 
l3->alien should be fetched after spin_lock_irq() of course.


It's true that the :
if (limit > 1) limit = 12;
from alloc_alien_cache() is quite disturbing and could be cleaned up.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Show slab memory usage on OOM and SysRq-M

2007-04-18 Thread Eric Dumazet
On Wed, 18 Apr 2007 09:17:19 +0300 (EEST)
Pekka J Enberg <[EMAIL PROTECTED]> wrote:

> On Tue, 17 Apr 2007, Eric Dumazet wrote:
> > This nr_pages should be in struct kmem_list3, not in struct kmem_cache, 
> > or else you defeat NUMA optimizations if touching a field in kmem_cache 
> > at kmem_getpages()/kmem_freepages() time.
> 
> We already touch ->flags, ->gfpflags, and ->gfporder in kmem_getpages(). 
> Sorry for my ignorance, but how is this different?
> 

Those fields are read. Thats OK, because several CPUS might share all those 
without problem.

But modifying one field in kmem_cache would invalidate one cache line for all 
cpus that would have to reload it later.

This is what we call "false sharing" or cache line ping pongs


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU

2007-04-20 Thread Eric Dumazet
On Fri, 20 Apr 2007 09:58:52 +0100
David Howells <[EMAIL PROTECTED]> wrote:
> 
> Because kmalloc() may be able to get us a smaller chunk of memory.  Actually,
> calling __get_free_pages() might be a better, and then release the excess
> pages.

Interesting, that rings a bell here.

I wonder why we dont use this in alloc_large_system_hash().

(if __get_free_pages(GFP_ATOMIC, order) is used instead of alloc_bootmem() or 
__vmalloc())

We currently loose 1/4 of space on tcp hash table for example, because 
sizeof(inet_ehash_bucket) is not a power of 2.

Is it really possible to allocate an order-10 page, then release part of it 
(say an order-8 subpage) ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Eric Dumazet

Rik van Riel a écrit :

Andrew Morton wrote:

On Fri, 20 Apr 2007 17:38:06 -0400
Rik van Riel <[EMAIL PROTECTED]> wrote:


Andrew Morton wrote:


I've also merged Nick's "mm: madvise avoid exclusive mmap_sem".

- Nick's patch also will help this problem.  It could be that your 
patch

  no longer offers a 2x speedup when combined with Nick's patch.

  It could well be that the combination of the two is even better, 
but it
  would be nice to firm that up a bit.  

I'll test that.


Thanks.


Well, good news.

It turns out that Nick's patch does not improve peak
performance much, but it does prevent the decline when
running with 16 threads on my quad core CPU!

We _definately_ want both patches, there's a huge benefit
in having them both.

Here are the transactions/seconds for each combination:

   vanilla   new glibc  madv_free kernel   madv_free + mmap_sem
threads

1 610 609 596545


545 tps versus 610 tps for one thread ? It seems quite bad, no ?

Could you please find an explanation for this ?


2103211361196   1200
4107011282014   2024
8100010881665   2087
1677910731310   1999




Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-12 Thread Eric Dumazet

Anton Blanchard a écrit :
 
Hi Nick,



Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
at this, send me a mail (eg. especially with the sched_setscheduler issue,
you might be able to do something better).


I took a look at this today and figured Id document it:

http://ozlabs.org/~anton/linux/sysbench/

Bottom line: it looks like issues in the glibc malloc library, replacing
it with the google malloc library fixes the negative scaling:

# apt-get install libgoogle-perftools0
# LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld


Hi Anton, thanks for the report.
glibc has certainly many scalability problems.

One of the known problem is its (ab)use of mmap() to allocate one (yes : one 
!) page every time you fopen() a file. And then a munmap() at fclose() time.



mmap()/munmap() should be avoided as hell in multithreaded programs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Eric Dumazet
On Tuesday 13 March 2007 12:12, Nick Piggin wrote:
>
> I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
> glibc allocator. But I wonder if there are other improvements that glibc
> can do here?

I cooked a patch some time ago to speedup threaded apps and got no feedback.

http://lkml.org/lkml/2006/8/9/26

Maybe we have to wait for 32 core cpu before thinking of cache line 
bouncings...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Eric Dumazet
On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote:

> My wild guess is that they're allocating memory after taking
> futexes. If they do, something like this will happen:
>
>  taskAtaskB   taskC
>  user lock
>   mmap_sem lock
>  mmap sem -> schedule
>   user lock -> schedule
>
> If taskB wouldn't be there triggering more random trashing over the
> mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.
>
> I suspect the real fix is not to allocate memory or to run other
> expensive syscalls that can block inside the futex critical sections...

glibc malloc uses arenas, and trylock() only. It should not block because if 
an arena is already locked, thread automatically chose another arena, and 
might create a new one if necessary.

But yes, mmap_sem contention is a big problem, because it's also taken by 
futex code (unfortunately)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: SMP performance degradation with sysbench

2007-03-13 Thread Eric Dumazet

Nish Aravamudan a écrit :

On 3/12/07, Anton Blanchard <[EMAIL PROTECTED]> wrote:


Hi Nick,

> Anyway, I'll keep experimenting. If anyone from MySQL wants to help 
look
> at this, send me a mail (eg. especially with the sched_setscheduler 
issue,

> you might be able to do something better).

I took a look at this today and figured Id document it:

http://ozlabs.org/~anton/linux/sysbench/

Bottom line: it looks like issues in the glibc malloc library, replacing
it with the google malloc library fixes the negative scaling:

# apt-get install libgoogle-perftools0
# LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld


Quick datapoint, still collecting data and trying to verify it's
always the case: on my 8-way Xeon, I'm actually seeing *much* worse
performance with libtcmalloc.so compared to mainline. Am generating
graphs and such still, but maybe someone else with x86_64 hardware
could try the google PRELOAD and see if it helps/hurts (to rule out
tester stupidity)?


I wish I had a 8-way test platform :)

Anyway, could you post some oprofile results ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   8   9   10   >