Re: Memory Allocation

2007-04-17 Thread Eric Dumazet
Brian D. McGrew a écrit : Good evening gents! I need some help in allocating memory and understanding how the system allocates memory with physical versus virtual page tables. Please consider the following snippet of code. Please, no wisecracks about bad code; it was written in 30 seconds in

Re: [PATCH] Show slab memory usage on OOM and SysRq-M

2007-04-17 Thread Eric Dumazet
On Tue, 17 Apr 2007 16:22:48 +0300 Pekka Enberg [EMAIL PROTECTED] wrote: Hi, On 4/17/07, Pavel Emelianov [EMAIL PROTECTED] wrote: +static unsigned long get_cache_size(struct kmem_cache *cachep) +{ + unsigned long slabs; + struct kmem_list3 *l3; + struct list_head

Re: [patch] slab: resize the alien caches too

2007-04-17 Thread Eric Dumazet
Siddha, Suresh B a écrit : Christoph, While going through the slab code, I observed that alien caches are not getting resized, when user changes the slab tunables. Appended patch tries to fix this. Please review and let me know if I missed anything. thanks, suresh --- Resize the alien caches

Re: [PATCH] Show slab memory usage on OOM and SysRq-M

2007-04-18 Thread Eric Dumazet
On Wed, 18 Apr 2007 09:17:19 +0300 (EEST) Pekka J Enberg [EMAIL PROTECTED] wrote: On Tue, 17 Apr 2007, Eric Dumazet wrote: This nr_pages should be in struct kmem_list3, not in struct kmem_cache, or else you defeat NUMA optimizations if touching a field in kmem_cache at kmem_getpages

Re: [PATCH] CONFIG_PACKET_MMAP should depend on MMU

2007-04-20 Thread Eric Dumazet
On Fri, 20 Apr 2007 09:58:52 +0100 David Howells [EMAIL PROTECTED] wrote: Because kmalloc() may be able to get us a smaller chunk of memory. Actually, calling __get_free_pages() might be a better, and then release the excess pages. Interesting, that rings a bell here. I wonder why we dont

Re: [PATCH] lazy freeing of memory through MADV_FREE

2007-04-20 Thread Eric Dumazet
Rik van Riel a écrit : Andrew Morton wrote: On Fri, 20 Apr 2007 17:38:06 -0400 Rik van Riel [EMAIL PROTECTED] wrote: Andrew Morton wrote: I've also merged Nick's mm: madvise avoid exclusive mmap_sem. - Nick's patch also will help this problem. It could be that your patch no longer

Re: Re[2]: sendfile to nonblocking socket

2007-04-24 Thread Eric Dumazet
On Tue, 24 Apr 2007 14:33:48 +0400 Alex Vorona [EMAIL PROTECTED] wrote: Hello David, Tuesday, April 24, 2007, 1:19:49 PM, you wrote: sendfile function is not just a more efficient version of a read followed by a write. It reads from one fd and write to another at tha same time.

Re: [08/17] Define functions for page cache handling

2007-04-24 Thread Eric Dumazet
[EMAIL PROTECTED] a écrit : --- linux-2.6.21-rc7.orig/include/linux/fs.h2007-04-24 11:31:49.0 -0700 +++ linux-2.6.21-rc7/include/linux/fs.h 2007-04-24 11:37:21.0 -0700 @@ -435,6 +435,11 @@ struct address_space { struct inode*host; /* owner:

Re: [PATCH]:Replacing current-state with set_current_state in kernel/signal.c

2007-04-25 Thread Eric Dumazet
On Wed, 25 Apr 2007 12:08:58 +0530 Shani Moideen [EMAIL PROTECTED] wrote: Hi, Replacing current-state with set_current_state in kernel/signal.c @@ -2596,7 +2596,7 @@ sys_signal(int sig, __sighandler_t handler) asmlinkage long sys_pause(void) { - current-state =

Re: Big reserved mappings on x86_64

2007-04-25 Thread Eric Dumazet
On Wed, 25 Apr 2007 04:49:31 -0400 Jakub Jelinek [EMAIL PROTECTED] wrote: On Wed, Apr 25, 2007 at 10:42:20AM +0200, Jan Engelhardt wrote: I actually took a look at `pmap $$`, which reveals that a lot of shared libraries map 2044K or 2048K unreadable-unwritable-private mappings...for

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

2007-03-08 Thread Eric Dumazet
consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet [EMAIL PROTECTED] Acked-by: Christoph Hellwig [EMAIL PROTECTED] Acked-by: Linus Torvalds [EMAIL PROTECTED] Documentation/filesystems/Locking

[PATCH] VFS : Delay the dentry name generation on sockets and pipes.

2007-03-08 Thread Eric Dumazet
-by: Eric Dumazet [EMAIL PROTECTED] Acked-by: Christoph Hellwig [EMAIL PROTECTED] Acked-by: Linus Torvalds [EMAIL PROTECTED]  Documentation/filesystems/Locking |    2 ++  Documentation/filesystems/vfs.txt |   12 +++-  fs/dcache.c                       |   10 ++  fs/pipe.c

[PATCH] procfs : reorder struct pid_dentry to save space on 64bit archs, and constify them

2007-03-08 Thread Eric Dumazet
-off-by: Eric Dumazet [EMAIL PROTECTED] fs/proc/base.c | 53 +++ 1 file changed, 27 insertions(+), 26 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 01f7769..d71135b 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -90,8 +90,8

Re: sys_write() racy for multi-threaded append?

2007-03-08 Thread Eric Dumazet
Michael K. Edwards a écrit : from sys_write(): file = fget_light(fd, fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_write(file, buf, count, pos); file_pos_write(file, pos); fput_light(file,

Re: sys_write() racy for multi-threaded append?

2007-03-08 Thread Eric Dumazet
Michael K. Edwards a écrit : On 3/8/07, Eric Dumazet [EMAIL PROTECTED] wrote: Nothing in the manuals says that write() on same fd should be non racy : In particular file pos might be undefined. There is a reason pwrite() exists. Kernel doesnt have to enforce thread safety as standard

Re: sys_write() racy for multi-threaded append?

2007-03-08 Thread Eric Dumazet
Michael K. Edwards a écrit : On 3/8/07, Eric Dumazet [EMAIL PROTECTED] wrote: Absolutely not. We dont want to slow down kernel 'just in case a fool might want to do crazy things' Actually, I think it would make the kernel (negligibly) faster to bump f_pos before the vfs_write() call. Unless

Re: block_til_ready

2007-03-08 Thread Eric Dumazet
Mockern a écrit : Hi, What is the simpliest implementation of block_til_ready for tty driver? Thanks, Andy Welcome Andy Since your messages always make me wonder if you are some kind of robot, able to post one one line message to lkml everyday, I have one suggestion : Try next times to

Re: [PATCH 2/7] revoke: add f_light flag for struct file

2007-03-09 Thread Eric Dumazet
On Friday 09 March 2007 09:14, Pekka J Enberg wrote: From: Pekka Enberg [EMAIL PROTECTED] This adds a f_light flag to struct file to indicate that the file was looked up with fget_light(). Needed by revoke to ensure we don't close a file pointer while someone is using it without actually

[PATCH, take2] VFS : Delay the dentry name generation on sockets and pipes.

2007-03-09 Thread Eric Dumazet
instead of 3.450 s Signed-off-by: Eric Dumazet [EMAIL PROTECTED] Acked-by: Christoph Hellwig [EMAIL PROTECTED] Acked-by: Linus Torvalds [EMAIL PROTECTED] Documentation/filesystems/Locking |2 ++ Documentation/filesystems/vfs.txt | 26 +- fs/dcache.c

Re: [PATCH 2/7] revoke: add f_light flag for struct file

2007-03-09 Thread Eric Dumazet
On Friday 09 March 2007 11:43, Pekka Enberg wrote: On 3/9/07, Eric Dumazet [EMAIL PROTECTED] wrote: Cannot we use a flag in 'struct files_struct', set to one when the task is mono-thread (at task creation in fact), and set to 0 when it creates a new thread (or when someone remotely access

Re: sys_write() racy for multi-threaded append?

2007-03-09 Thread Eric Dumazet
On Friday 09 March 2007 13:19, Michael K. Edwards wrote: On 3/8/07, Benjamin LaHaise [EMAIL PROTECTED] wrote: Any number of things can cause a short write to occur, and rewinding the file position after the fact is just as bad. A sane app has to either serialise the writes itself or use a

Re: [PATCH 2/7] revoke: add f_light flag for struct file

2007-03-09 Thread Eric Dumazet
On Friday 09 March 2007 17:11, Benjamin LaHaise wrote: On Fri, Mar 09, 2007 at 12:13:35PM +0100, Eric Dumazet wrote: Then just drop the fget_light() 'optimisation' and always take a reference (atomic on f_count) regardless of single-thread or not. Instead of dirtying f_light, just do

[PATCH, take3] VFS : Delay the dentry name generation on sockets and pipes.

2007-03-09 Thread Eric Dumazet
for pipes : No more sprintf() at pipe creation. This is delayed up to the moment someone does an access to /proc/pid/fd/... A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a *nice* speedup on my Pentium(M) 1.6 Ghz : 3.090 s instead of 3.450 s Signed-off-by: Eric Dumazet

[PATCH] getrusage() : Fill ru_inblock and ru_oublock fields if possible

2007-03-12 Thread Eric Dumazet
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+3000outputs (0major+299minor)pagefaults 0swaps Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- linux-2.6.21-rc3/kernel/sys.c 2007-03-12 17:30:34.0 +0100 +++ linux-2.6.21-rc3-ed/kernel/sys.c2007-03

Re: SMP performance degradation with sysbench

2007-03-12 Thread Eric Dumazet
Anton Blanchard a écrit : Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took a look at this today and figured Id document it:

Re: SMP performance degradation with sysbench

2007-03-13 Thread Eric Dumazet
On Tuesday 13 March 2007 12:12, Nick Piggin wrote: I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose glibc allocator. But I wonder if there are other improvements that glibc can do here? I cooked a patch some time ago to speedup threaded apps and got no feedback.

Re: SMP performance degradation with sysbench

2007-03-13 Thread Eric Dumazet
On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote: My wild guess is that they're allocating memory after taking futexes. If they do, something like this will happen: taskAtaskB taskC user lock mmap_sem lock mmap sem - schedule

Re: SMP performance degradation with sysbench

2007-03-13 Thread Eric Dumazet
Nish Aravamudan a écrit : On 3/12/07, Anton Blanchard [EMAIL PROTECTED] wrote: Hi Nick, Anyway, I'll keep experimenting. If anyone from MySQL wants to help look at this, send me a mail (eg. especially with the sched_setscheduler issue, you might be able to do something better). I took

[PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements

2007-03-15 Thread Eric Dumazet
() call 187 cycles per umask() call 182 cycles per ni_syscall() call Thank you for reading this mail [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic [PATCH 2/3] FUTEX : introduce private hashtables [PATCH 3/3] FUTEX : NUMA friendly global hashtable Signed-off-by: Eric Dumazet [EMAIL

[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic

2007-03-15 Thread Eric Dumazet
of this changes (new kernel and updated libc) Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- include/linux/futex.h | 12 + kernel/futex.c| 273 +--- 2 files changed, 188 insertions(+), 97 deletions(-) --- linux-2.6.21-rc3/kernel/futex.c 2007-03-13

[PATCH 2/3] FUTEX : introduce private hashtables

2007-03-15 Thread Eric Dumazet
of the private hashtable should be 768 bytes on 32bit arches, 1536 bytes on 64bit arches. Private hashtable is freed() when process exits. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- include/linux/futex.h |4 + include/linux/sched.h |7 ++ kernel/fork.c |1 kernel/futex.c

[PATCH 3/3] FUTEX : NUMA friendly global hashtable

2007-03-15 Thread Eric Dumazet
should have a temporary effect, as most futexes are expected to be stored in process private tables. We probably can drop it in five years :) Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- kernel/futex.c | 26 +++--- 1 file changed, 23 insertions(+), 3 deletions

Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements

2007-03-16 Thread Eric Dumazet
On Friday 16 March 2007 09:05, Peter Zijlstra wrote: On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote: Hi I'm pleased to present these patches which improve linux futex performance and scalability, on both UP, SMP and NUMA configs. I had this idea last year but I

Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements

2007-03-16 Thread Eric Dumazet
On Friday 16 March 2007 11:10, Peter Zijlstra wrote: http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p atch Oh thanks But if it has to walk the vmas (and take mmap_sem), you already loose the PRIVATE benefit. It doesn't take mmap_sem, I am aware of the

Re: [RFC] kernel/pid.c pid allocation wierdness

2007-03-16 Thread Eric Dumazet
On Friday 16 March 2007 11:57, Pavel Emelianov wrote: Oleg Nesterov wrote: On 03/14, Eric W. Biederman wrote: Pavel Emelianov [EMAIL PROTECTED] writes: Hi. I'm looking at how alloc_pid() works and can't understand one (simple/stupid) thing. It first kmem_cache_alloc()-s a strct

Re: + getrusage-fill-ru_inblock-and-ru_oublock-fields-if-possible.patch added to -mm tree

2007-03-16 Thread Eric Dumazet
On Friday 16 March 2007 18:10, Oleg Nesterov wrote: Eric Dumazet wrote: @@ -2021,6 +2022,8 @@ static void k_getrusage(struct task_stru r-ru_nivcsw = p-signal-cnivcsw; r-ru_minflt = p-signal-cmin_flt; r-ru_majflt = p-signal

Re: + getrusage-fill-ru_inblock-and-ru_oublock-fields-if-possible.patch added to -mm tree

2007-03-16 Thread Eric Dumazet
On Friday 16 March 2007 18:23, Eric Dumazet wrote: Very good point, you found a bug in k_getrusage(). I just followed the existing logic, but it seems this logic is bad. So not only ru_inblock/ru_oublock are multiplied by 3 : others fields as well are wrong. Also the definition

Re: [PATCH 2.6.21 review I] [21/25] x86_64: a memcpy that tries to reduce cache pressure

2007-02-13 Thread Eric Dumazet
Andi Kleen a écrit : From: Bryan O'Sullivan [EMAIL PROTECTED] This copy routine is memcpy-compatible, but on some architectures will use cache-bypassing loads to avoid bringing the source data into the cache. One case where this is useful is when a device issues a DMA to a memory region, and

Re: [patch 05/11] syslets: core code

2007-02-15 Thread Eric Dumazet
On Thursday 15 February 2007 19:46, bert hubert wrote: Both 1 and 2 are currently limiting factors when I enter the 100kqps domain of name serving. This doesn't mean the rest of my code is as tight as it could be, but I spend a significant portion of time in the kernel even at moderate

[PATCH, take2] getrusage() : Fill ru_inblock and ru_oublock fields if possible

2007-03-19 Thread Eric Dumazet
4:21.42elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k 878112inputs+22448outputs (2major+1148minor)pagefaults 0swaps # ls -s --block-size=512 /var/lib/slocate/slocate.db 22472 /var/lib/slocate/slocate.db Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- include/linux/sched.h

Re: [PATCH, take2] getrusage() : Fill ru_inblock and ru_oublock fields if possible

2007-03-19 Thread Eric Dumazet
On Monday 19 March 2007 11:53, Oleg Nesterov wrote: On 03/19, Eric Dumazet wrote: +static inline unsigned long task_io_get_inblock(const struct task_struct *p) +{ + return p-ioac.read_bytes 9; +} [...snip...] @@ -2021,6 +2022,8 @@ static void k_getrusage(struct task_stru

[PATCH, take3] getrusage() : Fill ru_inblock and ru_oublock fields if possible

2007-03-19 Thread Eric Dumazet
+0avgdata 0maxresident)k 0inputs+1000outputs (0major+235minor)pagefaults 0swaps # /usr/bin/time updatedb 1.58user 6.20system 4:26.06elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k 881088inputs+22464outputs (2major+1163minor)pagefaults 0swaps Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- include

Re: [PATCH, take3] getrusage() : Fill ru_inblock and ru_oublock fields if possible

2007-03-19 Thread Eric Dumazet
On Mon, 19 Mar 2007 17:37:23 +0300 Oleg Nesterov [EMAIL PROTECTED] wrote: (offtopic) Well..., it *is* ontopic I'm afraid... We are reading u64 read_bytes/write_bytes which could be updated asynchronously. /proc/pid/io does the same. Of course, I don't blame this patch, just a stupid

[PATCH] x86_64 : Suppress __jiffies

2007-03-19 Thread Eric Dumazet
included in vsyscall page. This patch prepares the introduction of time_data. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c index b43c698..ac4d865 100644 --- a/arch/x86_64/kernel/vsyscall.c +++ b/arch/x86_64/kernel/vsyscall.c

Re: [patch 6/13] signal/timer/event fds v7 - timerfd core ...

2007-03-19 Thread Eric Dumazet
Davide Libenzi a écrit : +struct timerfd_ctx { + struct hrtimer tmr; + ktime_t tintv; + spinlock_t lock; + wait_queue_head_t wqh; + unsigned long ticks; +}; +static struct kmem_cache *timerfd_ctx_cachep; + timerfd_ctx_cachep =

Re: [PATCH] slab: deal with NULL pointers passed to kmem_cache_free

2007-03-20 Thread Eric Dumazet
Pekka Enberg a écrit : On 3/19/07, Andrew Morton [EMAIL PROTECTED] wrote: This is a super-hot path. Super-hot exactly where? Don't be silly Pekka ... We have plenty oprofiles results if you dont trust Andrew. CPU: AMD64 processors, speed 1992.52 MHz (estimated) Counted CPU_CLK_UNHALTED

Re: [PATCH] slab: deal with NULL pointers passed to kmem_cache_free

2007-03-20 Thread Eric Dumazet
Pekka J Enberg a écrit : Thanks for the profile. I still wonder where exactly thouse super-hot call-sites are... In this case, it's a typical network server Each time a packet is sent to or received from network, network stack has to allocate/free a skb

[PATCH] time : SMP friendly alignment of struct clocksource

2007-03-20 Thread Eric Dumazet
. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- linux-2.6.21-rc4-mm1/include/linux/clocksource.h +++ linux-2.6.21-rc4-mm1-ed/include/linux/clocksource.h @@ -49,25 +49,35 @@ struct clocksource; * @flags: flags describing special properties * @vread: vsyscall based read

[RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Eric Dumazet
Hi I noticed on a small x86_64 NUMA setup (2 nodes) that cache_free_alien() is very expensive. This is because of a cache miss on struct slab. At the time an object is freed (call to kmem_cache_free() for example), the underlying 'struct slab' is not anymore cache-hot. struct slab *slabp =

Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Eric Dumazet
Andi Kleen a écrit : Is it possible virt_to_slab(objp)-nodeid being different from pfn_to_nid(objp) ? It is possible the page allocator falls back to another node than requested. We would need to check that this never occurs. The only way to ensure that would be to set a strict mempolicy.

Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-20 Thread Eric Dumazet
Christoph Lameter a écrit : On Tue, 20 Mar 2007, Eric Dumazet wrote: I understand we want to do special things (fallback and such tricks) at allocation time, but I believe that we can just trust the real nid of memory at free time. Sorry no. The node at allocation time determines which node

[PATCH] SLAB : Use num_possible_cpus() in enable_cpucache()

2007-03-20 Thread Eric Dumazet
arrays', to save some memory. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/mm/slab.c b/mm/slab.c index 57f7aa4..a69d0a5 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3975,10 +3975,8 @@ static int enable_cpucache(struct kmem_c * to a larger limit. Thus disabled by default

Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;

2007-03-21 Thread Eric Dumazet
Christoph Lameter a écrit : On Wed, 21 Mar 2007, Eric Dumazet wrote: The fast path is to put the pointer, into the cpu array cache. This object might be given back some cycles later, because of a kmem_cache_alloc() : No need to access the two cache lines (struct page, struct slab) If you do

[PATCH] SLAB : Dont allocate empty shared caches

2007-03-21 Thread Eric Dumazet
We can avoid allocating empty shared caches and avoid unecessary check of cache-limit. We save some memory. We avoid bringing into CPU cache unecessary cache lines. All accesses to l3-shared are already checking NULL pointers so this patch is safe. Signed-off-by: Eric Dumazet [EMAIL

[RFC, PATCH] SLAB : [NUMA] keep nodeid in struct page instead of struct slab

2007-03-21 Thread Eric Dumazet
*' pointer, but also the nodeid in the low order bits. This also reduces sizeof(struct slab) by 8 bytes on 64bits arches. This reduces sizeof(struct slab) on all platforms (UP, or SMP) Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/mm/slab.c b/mm/slab.c index abf46ae..d2f7299 100644

Re: [RFC, PATCH] SLAB : [NUMA] keep nodeid in struct page instead of struct slab

2007-03-21 Thread Eric Dumazet
On Wed, 21 Mar 2007 07:41:46 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] wrote: On Wed, 21 Mar 2007, Eric Dumazet wrote: In order to avoid a cache miss in kmem_cache_free() on NUMA and reduce hot path length, we could exploit the following common facts. 1) MAX_NUMNODES = 64

Re: [RFC, PATCH] SLAB : [NUMA] keep nodeid in struct page instead of struct slab

2007-03-21 Thread Eric Dumazet
On Wed, 21 Mar 2007 17:42:22 +0200 (EET) Pekka J Enberg [EMAIL PROTECTED] wrote: On Wed, 21 Mar 2007, Christoph Lameter wrote: None of that please. The page flags already contain a node number that is accessible via page_to_nid(). Just make sure that we never put a slab onto the wrong

Re: [RFC, PATCH] SLAB : [NUMA] keep nodeid in struct page instead of struct slab

2007-03-21 Thread Eric Dumazet
On Wed, 21 Mar 2007 17:47:13 +0200 Pekka Enberg [EMAIL PROTECTED] wrote: On 3/21/07, Eric Dumazet [EMAIL PROTECTED] wrote: Last time I checked 'struct page', they was no nodeid in it. Hmm, page_to_nid() in include/linx/mm.h doesn't agree with you: #ifdef NODE_NOT_IN_PAGE_FLAGS extern

Re: [RFC, PATCH] SLAB : [NUMA] keep nodeid in struct page instead of struct slab

2007-03-21 Thread Eric Dumazet
On Wed, 21 Mar 2007 08:44:53 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] wrote: You wanted to exploit that MAX_NUMNODES = 64? Maybe my english is not perfect. I promise I improve it eventually. My patch was for unlucky guys that have MAX_NUMNODES = 64. For others, the preprocessor gave the

[RFC] : Is /proc/kcore still usefull and/or maintained ?

2007-03-21 Thread Eric Dumazet
Hi all On i386 , 2.6.20 / 2.6.21-rc4 : # gdb vmlinux /proc/kcore error # file /proc/kcore error Apparently we can not llseek() anymore on this file (returns -EINVAL) On x86_64 2.6.20 it's working # file /proc/kcore /proc/kcore: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style

Re: [RFC] : Is /proc/kcore still usefull and/or maintained ?

2007-03-21 Thread Eric Dumazet
I stand corrected : This is a new bug The /proc/kcore problem appears with linux-2.6.21-rc4-mm1 fd = open(/proc/kcore, 0); llseek(fd, ...) returns an -EINVAL error Quick code inspection (before going to sleep...) shows that proc_reg_llseek() (file fs/proc/inode.c) is doing something like :

Re: [RFC] : Is /proc/kcore still usefull and/or maintained ?

2007-03-21 Thread Eric Dumazet
On Thu, 22 Mar 2007 02:04:50 +0200 Maxim [EMAIL PROTECTED] wrote: Hi, Yes, you are right, you have different problem that I had But why do you need llseek ? I dont personnaly, but tools do need llseek. Why not to mmap it ? It is natural thing to do with files

Problem with fix-rmmod-read-write-races-in-proc-entries.patch in 2.6.21-rc4-mm1

2007-03-22 Thread Eric Dumazet
Hi Alexey, It seems you are fix-rmmod-read-write-races-in-proc-entries.patch author ? /proc/kcore is no longer seekable (or mappable) Also, do we really need to proxy via proc_reg_file_ops files that are not provided by a module ? I think not. Could you please add in proc_get_inode() a check

Re: max_loop limit

2007-03-22 Thread Eric Dumazet
On Thu, 22 Mar 2007 12:37:54 +0100 Tomas M [EMAIL PROTECTED] wrote: The question is not Why do we need more than 255 loops?. The question should be Why do we need the hardcoded 255-limit in kernel while there is no reason for it at all? My patch simply removes the hardcoded limitation.

Re: max_loop limit

2007-03-22 Thread Eric Dumazet
On Thu, 22 Mar 2007 14:42:31 +0100 Jens Axboe [EMAIL PROTECTED] wrote: This time, you would be limited to 16384 loop devices on x86_64, 32768 on i386 :) But this still wastes memory, why not just allocate each loop device dynamically when it is set up? The current approach is crap, it

Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-22 Thread Eric Dumazet
Siddha, Suresh B a écrit : Christoph, While we are at this topic, recently I had reports that cache_free_alien() is costly on non NUMA platforms too (similar to the cache miss issues that Eric was referring to on NUMA) and the appended patch seems to fix it for non NUMA atleast. Appended patch

[PATCH] slab: NUMA kmem_cache diet

2007-03-22 Thread Eric Dumazet
reduce the gfporder of cache_cache Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/mm/slab.c b/mm/slab.c index abf46ae..b187618 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -389,7 +389,6 @@ struct kmem_cache { unsigned int buffer_size; u32 reciprocal_buffer_size; /* 3

Re: non-NUMA cache_free_alien() (was Re: [RFC] SLAB : NUMA cache_free_alien() very expensive because of virt_to_slab(objp); nodeid = slabp-nodeid;)

2007-03-22 Thread Eric Dumazet
Siddha, Suresh B a écrit : On Thu, Mar 22, 2007 at 11:12:39PM +0100, Eric Dumazet wrote: Siddha, Suresh B a écrit : + if (num_online_nodes() == 1) + use_alien_caches = 0; + Unfortunatly this part is wrong. oops. You should check num_possible_nodes(), or nr_node_ids

Re: [PATCH] slab: NUMA kmem_cache diet

2007-03-23 Thread Eric Dumazet
Pekka J Enberg a écrit : (Please inline patches to the mail, makes it easier to review.) On Thu, 22 Mar 2007, Eric Dumazet wrote: Some NUMA machines have a big MAX_NUMNODES (possibly 1024), but fewer possible nodes. This patch dynamically sizes the 'struct kmem_cache' to allocate only needed

Re: [rfc][patch] queued spinlocks (i386)

2007-03-23 Thread Eric Dumazet
On Fri, 23 Mar 2007 09:59:11 +0100 Nick Piggin [EMAIL PROTECTED] wrote: Implement queued spinlocks for i386. This shouldn't increase the size of the spinlock structure, while still able to handle 2^16 CPUs. Not completely implemented with assembly yet, to make the algorithm a bit clearer.

Re: [rfc][patch] queued spinlocks (i386)

2007-03-23 Thread Eric Dumazet
On Fri, 23 Mar 2007 11:32:44 +0100 Nick Piggin [EMAIL PROTECTED] wrote: On Fri, Mar 23, 2007 at 11:04:18AM +0100, Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: Implement queued spinlocks for i386. [...] isnt this patented by MS? (which might not worry you SuSE/Novell

Re: [RFC] NUMA : could we introduce virt_to_nid() ?

2007-03-23 Thread Eric Dumazet
On Fri, 23 Mar 2007 14:48:24 +0200 Pekka Enberg [EMAIL PROTECTED] wrote: On 3/23/07, Eric Dumazet [EMAIL PROTECTED] wrote: Checking Christoph quicklist implementation, I found the same cache miss in free() than SLAB has. /* common implementation * int virt_to_nid(const void *addr

Re: [patch] [bugfix] loop.c

2007-03-23 Thread Eric Dumazet
On Fri, 23 Mar 2007 15:04:54 +0100 Tomas M [EMAIL PROTECTED] wrote: I posted this yesterday but it seems people didn't understand the real goal of my patch. So I will explain once more again: This is a bugfix for loop.c block driver, as it currently allocates more memory then it needs,

Re: [patch] [bugfix] loop.c

2007-03-23 Thread Eric Dumazet
On Fri, 23 Mar 2007 14:36:05 + Al Viro [EMAIL PROTECTED] wrote: On Fri, Mar 23, 2007 at 03:19:56PM +0100, Eric Dumazet wrote: I cooked the following patch (untested), feel free to test it. Please, get the cleanup into saner shape. This is too ugly. out_mem: while (nba

Re: [patch] [bugfix] loop.c

2007-03-23 Thread Eric Dumazet
On Fri, 23 Mar 2007 15:25:23 +0100 (CET) Jiri Kosina [EMAIL PROTECTED] wrote: On Fri, 23 Mar 2007, Eric Dumazet wrote: - if (max_loop 1 || max_loop 256) { - printk(KERN_WARNING loop: invalid max_loop (must be between -1 and 256), using

Re: [RFC] NUMA : could we introduce virt_to_nid() ?

2007-03-23 Thread Eric Dumazet
On Fri, 23 Mar 2007 07:50:28 -0700 (PDT) Christoph Lameter [EMAIL PROTECTED] wrote: On Fri, 23 Mar 2007, Eric Dumazet wrote: Checking Christoph quicklist implementation, I found the same cache miss in free() than SLAB has. /* common implementation * int virt_to_nid(const void *addr

Re: [patch 1/2] Ignore stolen time in the softlockup watchdog

2007-03-27 Thread Eric Dumazet
Jeremy Fitzhardinge a écrit : +static DEFINE_PER_CPU(unsigned long long, touch_timestamp); ... void touch_softlockup_watchdog(void) { - __raw_get_cpu_var(touch_timestamp) = jiffies; + __raw_get_cpu_var(touch_timestamp) = sched_clock(); } Not very clear if this is safe on

Re: [patch 1/2] Ignore stolen time in the softlockup watchdog

2007-03-27 Thread Eric Dumazet
On Tue, 27 Mar 2007 00:12:53 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] wrote: Eric Dumazet wrote: Jeremy Fitzhardinge a écrit : +static DEFINE_PER_CPU(unsigned long long, touch_timestamp); ... void touch_softlockup_watchdog(void) { -__raw_get_cpu_var(touch_timestamp

Re: IPv6: Connection reset/timeout under heavy load

2007-03-27 Thread Eric Dumazet
On Tue, 27 Mar 2007 11:45:46 +0200 Agoston Horvath [EMAIL PROTECTED] wrote: Hello, I'm trying to add ipv6 support to the RIPE whois-server. I'm going with the dual-stack address familiy independent solution (/proc/sys/net/ipv6/bindv6only is set to 0). I've written a small piece of code

Re: [PATCH] FUTEX : new PRIVATE futexes

2007-04-05 Thread Eric Dumazet
Nick Piggin a écrit : Hi Eric, Thanks for doing this... It's looking good, I just have some minor comments: Hi Nick, thanks for reviewing. Eric Dumazet wrote: */ -int get_futex_key(void __user *uaddr, union futex_key *key) +int get_futex_key(void __user *uaddr, union futex_key *key

Re: [PATCH 2.6.21-rc6] [netfilter] early_drop imrovement

2007-04-06 Thread Eric Dumazet
On Fri, 06 Apr 2007 12:00:29 +0400 Vasily Averin [EMAIL PROTECTED] wrote: When the number of conntracks is reached ip_conntrack_max limit, early_drop() is called and tries to free one of already used conntracks in one of the hash buckets. If it does not find any conntracks that may be freed,

[PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Eric Dumazet
ni_syscall() call Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- include/linux/futex.h | 46 + kernel/futex.c| 340 ++-- 2 files changed, 268 insertions(+), 118 deletions(-) --- linux-2.6.21-rc5-mm4/include/linux/futex.h +++ linux-2.6.21-rc5-mm4-ed

Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Eric Dumazet
On Sat, 07 Apr 2007 19:30:14 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Eric Dumazet wrote: - Current mm code have a problem with 64bit futexes, as spoted by Nick : get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible

Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-07 Thread Eric Dumazet
Jakub Jelinek a écrit : On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote: get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages of memory... That would be a user bug. 32-bit futexes have to be 32

Re: [PATCH nf-2.6.22] [netfilter] early_drop imrovement

2007-04-07 Thread Eric Dumazet
Vasily Averin a e'crit : When the number of conntracks is reached nf_conntrack_max limit, early_drop() is called and tries to free one of already used conntracks in one of the hash buckets. If it does not find any conntracks that may be freed, it leads to transmission errors. However it is not

Re: Performance Stats: Kernel patch

2007-04-10 Thread Eric Dumazet
On Mon, 09 Apr 2007 18:22:22 +0400 Maxim Uvarov [EMAIL PROTECTED] wrote: --- linux-2.6.21-rc5.orig/arch/x86_64/kernel/entry.S +++ linux-2.6.21-rc5/arch/x86_64/kernel/entry.S @@ -236,6 +236,11 @@ ENTRY(system_call) movq %r10,%rcx call *sys_call_table(,%rax,8) # XXX:rip

Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-10 Thread Eric Dumazet
On Sat, 7 Apr 2007 15:15:56 -0700 Andrew Morton [EMAIL PROTECTED] wrote: On Sat, 7 Apr 2007 10:43:39 +0200 Eric Dumazet [EMAIL PROTECTED] wrote: get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages

Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-11 Thread Eric Dumazet
On Wed, 11 Apr 2007 17:22:57 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Eric Dumazet wrote: On Sat, 07 Apr 2007 19:30:14 +1000 Nick Piggin [EMAIL PROTECTED] wrote: Eric Dumazet wrote: - Current mm code have a problem with 64bit futexes, as spoted by Nick : get_futex_key

[PATCH, take5] FUTEX : new PRIVATE futexes

2007-04-11 Thread Eric Dumazet
(FUTEX_WAIT) call (using one futex) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex) For reference : 187 cycles per getppid() call 188 cycles per umask() call 183 cycles per ni_syscall() call Signed-off-by: Eric Dumazet [EMAIL

Re: [PATCH, take4] FUTEX : new PRIVATE futexes

2007-04-11 Thread Eric Dumazet
On Wed, 11 Apr 2007 19:23:26 +1000 Nick Piggin [EMAIL PROTECTED] wrote: But... that isn't there in mainline. Why is it in -mm? At any rate, that makes it a no brainer to change. Seems to be related to lguest. Ask Rusty :) As this external thing certainly is not doing the check itself,

Re: Performance Stats: Kernel patch

2007-04-11 Thread Eric Dumazet
On Wed, 11 Apr 2007 15:59:16 +0400 Maxim Uvarov [EMAIL PROTECTED] wrote: Thanks Eric, I's really better. I have done changes. Do you have any others objections now? All is in attached perf_stat.patch. Hi Maxim I know *nothing* about powerpc assembly, but I think there is a problem : Index:

Re: Performance Stats: Kernel patch

2007-04-11 Thread Eric Dumazet
On Wed, 11 Apr 2007 15:59:16 +0400 Maxim Uvarov [EMAIL PROTECTED] wrote: Patch adds Process Performance Statistics. It make available to the user the following new per-process (thread) performance statistics: * Involuntary Context Switches * Voluntary Context Switches * Number of

Re: Performance Stats: Kernel patch

2007-04-11 Thread Eric Dumazet
Maxim Uvarov a écrit : Eric Dumazet wrote: Please check kernel/sys.c:k_getrusage() to see how getrusage() has to sum *lot* of individual fields to get precise process numbers (even counting stats for dead threads) Thanks for helping me and for this link. But it is not enough clear

Re: [PATCH] make MADV_FREE lazily free memory

2007-04-11 Thread Eric Dumazet
Rik van Riel a écrit : Make it possible for applications to have the kernel free memory lazily. This reduces a repeated free/malloc cycle from freeing pages and allocating them, to just marking them freeable. If the application wants to reuse them before the kernel needs the memory, not even a

Re: [PATCH] make MADV_FREE lazily free memory

2007-04-11 Thread Eric Dumazet
Rik van Riel a écrit : Eric Dumazet wrote: Rik van Riel a écrit : Make it possible for applications to have the kernel free memory lazily. This reduces a repeated free/malloc cycle from freeing pages and allocating them, to just marking them freeable. If the application wants to reuse them

Re: [PATCH 0/4] i386 - pte update optimizations

2007-04-13 Thread Eric Dumazet
Zachary Amsden a écrit : Yes. Even then, last time I clocked instructions, xchg was still slower than read / write, although I could be misremembering. And it's not totally clear that they will always be in cached state, however, and for SMP, we still want to drop the implicit lock in

Re: [patch] generic rwsems

2007-04-13 Thread Eric Dumazet
On Fri, 13 Apr 2007 14:31:52 +0100 David Howells [EMAIL PROTECTED] wrote: Break the counter down like this: 0x - not locked; queue empty 0x4000 - locked by writer; queue empty 0xc000 - locket by writer; queue occupied 0x0nnn

Re: Oddness with reading /proc/net/tcp

2007-04-13 Thread Eric Dumazet
Witold Krecicki a écrit : Reading data from /proc/net/tcp is slower with progress of reading data, tested on system with 200k active connections. Yes, this is a known problem. This is O(N^2) algo. Use ss from iproute package to get better performance... (less than 15 seconds for 200k

Re: [KJ][PATCH 03/04]use set_current_state in fs

2007-04-14 Thread Eric Dumazet
Milind Arun Choudhary a écrit : use set_current_state(TASK_*) instead of current-state = TASK_*, in fs/nfs Signed-off-by: Milind Arun Choudhary [EMAIL PROTECTED] --- idmap.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/nfs/idmap.c b/fs/nfs/idmap.c index

Re: all syscalls initially taking 4usec on a P4? Re: nonblocking UDPv4 recvfrom() taking 4usec @ 3GHz?

2007-02-20 Thread Eric Dumazet
On Tuesday 20 February 2007 17:27, bert hubert wrote: On Tue, Feb 20, 2007 at 11:50:13AM +0100, Andi Kleen wrote: P4s are pretty slow at taking locks (or rather doing atomical operations) and there are several of them in this path. You could try it with a UP kernel. Actually hotunplugging

  1   2   3   4   5   6   7   8   9   10   >