subject:"Re\: \[PATCH\] SLUB use cmpxchg

Re: [PATCH] SLUB use cmpxchg_local

2007-09-04 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local 
> emulation. Results are not good:
> 

Hi Christoph,

I tried to come up with a patch set implementing the basics of a new
critical section: local_enter(flags) and local_exit(flags).

Can you try those on ia64 and tell me if the results are better ?

See the 2 next posts...

Mathieu
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-09-04 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local 
 emulation. Results are not good:
 

Hi Christoph,

I tried to come up with a patch set implementing the basics of a new
critical section: local_enter(flags) and local_exit(flags).

Can you try those on ia64 and tell me if the results are better ?

See the 2 next posts...

Mathieu
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Peter Zijlstra


On Tue, 2007-08-28 at 12:36 -0700, Christoph Lameter wrote:
> On Tue, 28 Aug 2007, Peter Zijlstra wrote:
> 
> > On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> > > H. One wild idea would be to use a priority futex for the slab lock? 
> > > That would make the slow paths interrupt safe without requiring interrupt 
> > > disable? Does a futex fit into the page struct?
> > 
> > Very much puzzled at what you propose. in-kernel we use rt_mutex (has
> > PI) or mutex, futexes are user-space. (on -rt spinlock_t == mutex ==
> > rt_mutex)
> > 
> > Neither disable interrupts since they are sleeping locks.
> > 
> > That said, on -rt we do not need to disable interrupts in the allocators
> > because its a bug to call an allocator from raw irq context.
> 
> Right so if a prioriuty futex 

futex stands for Fast Userspace muTEX, please lets call it a rt_mutex.

> would have been taken from a process 
> context and then an interrupt thread (or so no idea about RT) is scheduled 
> then the interrupt thread could switch to the process context and complete 
> the work there before doing the "interrupt" work. So disabling interrupts 
> is no longer necessary.

-rt does all of the irq handler in thread (process) context, the hard
irq handler just does something akin to a wakeup.

These irq threads typically run fifo/50 or simething like that.

[ note that this allows a form of irq priorisation even if the hardware
  doesn't. ]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Christoph Lameter

On Tue, 28 Aug 2007, Mathieu Desnoyers wrote:

> Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
> must always come with the acquire or release semantic. Is there any
> cmpxchg equivalent on ia64 that would be acquire and release semantic
> free ? This implicit memory ordering in the instruction seems to be
> responsible for the slowdown.

No. There is no cmpxchg used in the patches that I tested. The slowdown 
seem to come from the need to serialize at barriers. Adding an interrupt
enable/disable in the middle of the hot path creates another serialization 
point.

> If such primitive does not exist, then we should think about an irq
> disable fallback for this local atomic operation. However, I would
> prefer to let the cmpxchg_local primitive be bound to the "slow"
> cmpxchg_acq and create something like _cmpxchg_local that would be
> interrupt-safe, but not reentrant wrt NMIs.

Ummm... That is what I did. See the included patch that you quoted. The 
measurements show that such a fallback is not preserving the performance 
on IA64.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Christoph Lameter

On Tue, 28 Aug 2007, Peter Zijlstra wrote:

> On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> > H. One wild idea would be to use a priority futex for the slab lock? 
> > That would make the slow paths interrupt safe without requiring interrupt 
> > disable? Does a futex fit into the page struct?
> 
> Very much puzzled at what you propose. in-kernel we use rt_mutex (has
> PI) or mutex, futexes are user-space. (on -rt spinlock_t == mutex ==
> rt_mutex)
> 
> Neither disable interrupts since they are sleeping locks.
> 
> That said, on -rt we do not need to disable interrupts in the allocators
> because its a bug to call an allocator from raw irq context.

Right so if a prioriuty futex would have been taken from a process 
context and then an interrupt thread (or so no idea about RT) is scheduled 
then the interrupt thread could switch to the process context and complete 
the work there before doing the "interrupt" work. So disabling interrupts 
is no longer necessary.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Mathieu Desnoyers

Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
must always come with the acquire or release semantic. Is there any
cmpxchg equivalent on ia64 that would be acquire and release semantic
free ? This implicit memory ordering in the instruction seems to be
responsible for the slowdown.

If such primitive does not exist, then we should think about an irq
disable fallback for this local atomic operation. However, I would
prefer to let the cmpxchg_local primitive be bound to the "slow"
cmpxchg_acq and create something like _cmpxchg_local that would be
interrupt-safe, but not reentrant wrt NMIs.

This way, cmpxchg_local users could choose either the fast flavor
(_cmpxchg_local: not necessarily atomic wrt NMIs) or the most atomic
flavor (cmpxchg_local) available on the architecture. If you think of a
better name, please tell me... it could also be: fast version (mostly
used): cmpxchg_local(); slow, fully reentrant version:
cmpxchg_local_nmi().

Mathieu

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local 
> emulation. Results are not good:
> 
> slub/per cpu
> 1 times kmalloc(8)/kfree -> 105 cycles
> 1 times kmalloc(16)/kfree -> 104 cycles
> 1 times kmalloc(32)/kfree -> 105 cycles
> 1 times kmalloc(64)/kfree -> 104 cycles
> 1 times kmalloc(128)/kfree -> 104 cycles
> 1 times kmalloc(256)/kfree -> 115 cycles
> 1 times kmalloc(512)/kfree -> 116 cycles
> 1 times kmalloc(1024)/kfree -> 115 cycles
> 1 times kmalloc(2048)/kfree -> 115 cycles
> 1 times kmalloc(4096)/kfree -> 115 cycles
> 1 times kmalloc(8192)/kfree -> 117 cycles
> 1 times kmalloc(16384)/kfree -> 439 cycles
> 1 times kmalloc(32768)/kfree -> 800 cycles
> 
> 
> slub/per cpu + cmpxchg_local emulation
> 1 times kmalloc(8)/kfree -> 143 cycles
> 1 times kmalloc(16)/kfree -> 143 cycles
> 1 times kmalloc(32)/kfree -> 143 cycles
> 1 times kmalloc(64)/kfree -> 143 cycles
> 1 times kmalloc(128)/kfree -> 143 cycles
> 1 times kmalloc(256)/kfree -> 154 cycles
> 1 times kmalloc(512)/kfree -> 154 cycles
> 1 times kmalloc(1024)/kfree -> 154 cycles
> 1 times kmalloc(2048)/kfree -> 154 cycles
> 1 times kmalloc(4096)/kfree -> 155 cycles
> 1 times kmalloc(8192)/kfree -> 155 cycles
> 1 times kmalloc(16384)/kfree -> 440 cycles
> 1 times kmalloc(32768)/kfree -> 819 cycles
> 1 times kmalloc(65536)/kfree -> 902 cycles
> 
> 
> Parallel allocs:
> 
> Kmalloc N*alloc N*free(16): 0=102/136 1=97/136 2=99/140 3=98/140 4=100/138 
> 5=99/139 6=100/139 7=101/141 Average=99/139
> 
> cmpxchg_local emulation
> Kmalloc N*alloc N*free(16): 0=116/147 1=116/145 2=115/151 3=115/147 
> 4=115/149 5=117/147 6=116/148 7=116/146 Average=116/147
> 
> Patch used:
> 
> Index: linux-2.6/include/asm-ia64/atomic.h
> ===
> --- linux-2.6.orig/include/asm-ia64/atomic.h  2007-08-27 16:42:02.0 
> -0700
> +++ linux-2.6/include/asm-ia64/atomic.h   2007-08-27 17:50:24.0 
> -0700
> @@ -223,4 +223,17 @@ atomic64_add_negative (__s64 i, atomic64
>  #define smp_mb__after_atomic_inc()   barrier()
>  
>  #include 
> +
> +static inline void *cmpxchg_local(void **p, void *old, void *new)
> +{
> + unsigned long flags;
> + void *before;
> +
> + local_irq_save(flags);
> + before = *p;
> + if (likely(before == old))
> + *p = new;
> + local_irq_restore(flags);
> + return before;
> +}
>  #endif /* _ASM_IA64_ATOMIC_H */
> 
> kmem_cache_alloc before
> 
> 8900 :
> 8900:   01 28 31 0e 80 05   [MII]   alloc r37=ar.pfs,12,7,0
> 8906:   40 02 00 62 00 00   mov r36=b0
> 890c:   00 00 04 00 nop.i 0x0;;
> 8910:   0b 18 01 00 25 04   [MMI]   mov r35=psr;;
> 8916:   00 00 04 0e 00 00   rsm 0x4000
> 891c:   00 00 04 00 nop.i 0x0;;
> 8920:   08 50 90 1b 19 21   [MMI]   adds r10=3300,r13
> 8926:   70 02 80 00 42 40   mov r39=r32
> 892c:   05 00 c4 00 mov r42=b0
> 8930:   09 40 01 42 00 21   [MMI]   mov r40=r33
> 8936:   00 00 00 02 00 20   nop.m 0x0
> 893c:   f5 e7 ff 9f mov r41=-1;;
> 8940:   0b 48 00 14 10 10   [MMI]   ld4 r9=[r10];;
> 8946:   00 00 00 02 00 00   nop.m 0x0
> 894c:   01 48 58 00 sxt4 r8=r9;;
> 8950:   0b 18 20 40 12 20   [MMI]   shladd r3=r8,3,r32;;
> 8956:   20 80 0f 82 48 00   addl r2=8432,r3
> 895c:   00 00 04 00 nop.i 0x0;;
> 8960:   0a 00 01 04 18 10   [MMI]   ld8 r32=[r2];;
> 8966:   e0 a0 80 00 42 60   adds r14=20,r32
> 896c:

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Peter Zijlstra

On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> H. One wild idea would be to use a priority futex for the slab lock? 
> That would make the slow paths interrupt safe without requiring interrupt 
> disable? Does a futex fit into the page struct?

Very much puzzled at what you propose. in-kernel we use rt_mutex (has
PI) or mutex, futexes are user-space. (on -rt spinlock_t == mutex ==
rt_mutex)

Neither disable interrupts since they are sleeping locks.

That said, on -rt we do not need to disable interrupts in the allocators
because its a bug to call an allocator from raw irq context.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Peter Zijlstra

On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
 H. One wild idea would be to use a priority futex for the slab lock? 
 That would make the slow paths interrupt safe without requiring interrupt 
 disable? Does a futex fit into the page struct?

Very much puzzled at what you propose. in-kernel we use rt_mutex (has
PI) or mutex, futexes are user-space. (on -rt spinlock_t == mutex ==
rt_mutex)

Neither disable interrupts since they are sleeping locks.

That said, on -rt we do not need to disable interrupts in the allocators
because its a bug to call an allocator from raw irq context.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Mathieu Desnoyers

Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
must always come with the acquire or release semantic. Is there any
cmpxchg equivalent on ia64 that would be acquire and release semantic
free ? This implicit memory ordering in the instruction seems to be
responsible for the slowdown.

If such primitive does not exist, then we should think about an irq
disable fallback for this local atomic operation. However, I would
prefer to let the cmpxchg_local primitive be bound to the slow
cmpxchg_acq and create something like _cmpxchg_local that would be
interrupt-safe, but not reentrant wrt NMIs.

This way, cmpxchg_local users could choose either the fast flavor
(_cmpxchg_local: not necessarily atomic wrt NMIs) or the most atomic
flavor (cmpxchg_local) available on the architecture. If you think of a
better name, please tell me... it could also be: fast version (mostly
used): cmpxchg_local(); slow, fully reentrant version:
cmpxchg_local_nmi().

Mathieu

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local 
 emulation. Results are not good:
 
 slub/per cpu
 1 times kmalloc(8)/kfree - 105 cycles
 1 times kmalloc(16)/kfree - 104 cycles
 1 times kmalloc(32)/kfree - 105 cycles
 1 times kmalloc(64)/kfree - 104 cycles
 1 times kmalloc(128)/kfree - 104 cycles
 1 times kmalloc(256)/kfree - 115 cycles
 1 times kmalloc(512)/kfree - 116 cycles
 1 times kmalloc(1024)/kfree - 115 cycles
 1 times kmalloc(2048)/kfree - 115 cycles
 1 times kmalloc(4096)/kfree - 115 cycles
 1 times kmalloc(8192)/kfree - 117 cycles
 1 times kmalloc(16384)/kfree - 439 cycles
 1 times kmalloc(32768)/kfree - 800 cycles
 
 
 slub/per cpu + cmpxchg_local emulation
 1 times kmalloc(8)/kfree - 143 cycles
 1 times kmalloc(16)/kfree - 143 cycles
 1 times kmalloc(32)/kfree - 143 cycles
 1 times kmalloc(64)/kfree - 143 cycles
 1 times kmalloc(128)/kfree - 143 cycles
 1 times kmalloc(256)/kfree - 154 cycles
 1 times kmalloc(512)/kfree - 154 cycles
 1 times kmalloc(1024)/kfree - 154 cycles
 1 times kmalloc(2048)/kfree - 154 cycles
 1 times kmalloc(4096)/kfree - 155 cycles
 1 times kmalloc(8192)/kfree - 155 cycles
 1 times kmalloc(16384)/kfree - 440 cycles
 1 times kmalloc(32768)/kfree - 819 cycles
 1 times kmalloc(65536)/kfree - 902 cycles
 
 
 Parallel allocs:
 
 Kmalloc N*alloc N*free(16): 0=102/136 1=97/136 2=99/140 3=98/140 4=100/138 
 5=99/139 6=100/139 7=101/141 Average=99/139
 
 cmpxchg_local emulation
 Kmalloc N*alloc N*free(16): 0=116/147 1=116/145 2=115/151 3=115/147 
 4=115/149 5=117/147 6=116/148 7=116/146 Average=116/147
 
 Patch used:
 
 Index: linux-2.6/include/asm-ia64/atomic.h
 ===
 --- linux-2.6.orig/include/asm-ia64/atomic.h  2007-08-27 16:42:02.0 
 -0700
 +++ linux-2.6/include/asm-ia64/atomic.h   2007-08-27 17:50:24.0 
 -0700
 @@ -223,4 +223,17 @@ atomic64_add_negative (__s64 i, atomic64
  #define smp_mb__after_atomic_inc()   barrier()
  
  #include asm-generic/atomic.h
 +
 +static inline void *cmpxchg_local(void **p, void *old, void *new)
 +{
 + unsigned long flags;
 + void *before;
 +
 + local_irq_save(flags);
 + before = *p;
 + if (likely(before == old))
 + *p = new;
 + local_irq_restore(flags);
 + return before;
 +}
  #endif /* _ASM_IA64_ATOMIC_H */
 
 kmem_cache_alloc before
 
 8900 kmem_cache_alloc:
 8900:   01 28 31 0e 80 05   [MII]   alloc r37=ar.pfs,12,7,0
 8906:   40 02 00 62 00 00   mov r36=b0
 890c:   00 00 04 00 nop.i 0x0;;
 8910:   0b 18 01 00 25 04   [MMI]   mov r35=psr;;
 8916:   00 00 04 0e 00 00   rsm 0x4000
 891c:   00 00 04 00 nop.i 0x0;;
 8920:   08 50 90 1b 19 21   [MMI]   adds r10=3300,r13
 8926:   70 02 80 00 42 40   mov r39=r32
 892c:   05 00 c4 00 mov r42=b0
 8930:   09 40 01 42 00 21   [MMI]   mov r40=r33
 8936:   00 00 00 02 00 20   nop.m 0x0
 893c:   f5 e7 ff 9f mov r41=-1;;
 8940:   0b 48 00 14 10 10   [MMI]   ld4 r9=[r10];;
 8946:   00 00 00 02 00 00   nop.m 0x0
 894c:   01 48 58 00 sxt4 r8=r9;;
 8950:   0b 18 20 40 12 20   [MMI]   shladd r3=r8,3,r32;;
 8956:   20 80 0f 82 48 00   addl r2=8432,r3
 895c:   00 00 04 00 nop.i 0x0;;
 8960:   0a 00 01 04 18 10   [MMI]   ld8 r32=[r2];;
 8966:   e0 a0 80 00 42 60   adds r14=20,r32
 896c:   05 00 01 84 mov r43=r32
 8970:   0b 10 01 40 18 10

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Christoph Lameter

On Tue, 28 Aug 2007, Peter Zijlstra wrote:

 On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
  H. One wild idea would be to use a priority futex for the slab lock? 
  That would make the slow paths interrupt safe without requiring interrupt 
  disable? Does a futex fit into the page struct?
 
 Very much puzzled at what you propose. in-kernel we use rt_mutex (has
 PI) or mutex, futexes are user-space. (on -rt spinlock_t == mutex ==
 rt_mutex)
 
 Neither disable interrupts since they are sleeping locks.
 
 That said, on -rt we do not need to disable interrupts in the allocators
 because its a bug to call an allocator from raw irq context.

Right so if a prioriuty futex would have been taken from a process 
context and then an interrupt thread (or so no idea about RT) is scheduled 
then the interrupt thread could switch to the process context and complete 
the work there before doing the interrupt work. So disabling interrupts 
is no longer necessary.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Christoph Lameter

On Tue, 28 Aug 2007, Mathieu Desnoyers wrote:

 Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
 must always come with the acquire or release semantic. Is there any
 cmpxchg equivalent on ia64 that would be acquire and release semantic
 free ? This implicit memory ordering in the instruction seems to be
 responsible for the slowdown.

No. There is no cmpxchg used in the patches that I tested. The slowdown 
seem to come from the need to serialize at barriers. Adding an interrupt
enable/disable in the middle of the hot path creates another serialization 
point.

 If such primitive does not exist, then we should think about an irq
 disable fallback for this local atomic operation. However, I would
 prefer to let the cmpxchg_local primitive be bound to the slow
 cmpxchg_acq and create something like _cmpxchg_local that would be
 interrupt-safe, but not reentrant wrt NMIs.

Ummm... That is what I did. See the included patch that you quoted. The 
measurements show that such a fallback is not preserving the performance 
on IA64.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Peter Zijlstra


On Tue, 2007-08-28 at 12:36 -0700, Christoph Lameter wrote:
 On Tue, 28 Aug 2007, Peter Zijlstra wrote:
 
  On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
   H. One wild idea would be to use a priority futex for the slab lock? 
   That would make the slow paths interrupt safe without requiring interrupt 
   disable? Does a futex fit into the page struct?
  
  Very much puzzled at what you propose. in-kernel we use rt_mutex (has
  PI) or mutex, futexes are user-space. (on -rt spinlock_t == mutex ==
  rt_mutex)
  
  Neither disable interrupts since they are sleeping locks.
  
  That said, on -rt we do not need to disable interrupts in the allocators
  because its a bug to call an allocator from raw irq context.
 
 Right so if a prioriuty futex 

futex stands for Fast Userspace muTEX, please lets call it a rt_mutex.

 would have been taken from a process 
 context and then an interrupt thread (or so no idea about RT) is scheduled 
 then the interrupt thread could switch to the process context and complete 
 the work there before doing the interrupt work. So disabling interrupts 
 is no longer necessary.

-rt does all of the irq handler in thread (process) context, the hard
irq handler just does something akin to a wakeup.

These irq threads typically run fifo/50 or simething like that.

[ note that this allows a form of irq priorisation even if the hardware
  doesn't. ]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local 
emulation. Results are not good:

slub/per cpu
1 times kmalloc(8)/kfree -> 105 cycles
1 times kmalloc(16)/kfree -> 104 cycles
1 times kmalloc(32)/kfree -> 105 cycles
1 times kmalloc(64)/kfree -> 104 cycles
1 times kmalloc(128)/kfree -> 104 cycles
1 times kmalloc(256)/kfree -> 115 cycles
1 times kmalloc(512)/kfree -> 116 cycles
1 times kmalloc(1024)/kfree -> 115 cycles
1 times kmalloc(2048)/kfree -> 115 cycles
1 times kmalloc(4096)/kfree -> 115 cycles
1 times kmalloc(8192)/kfree -> 117 cycles
1 times kmalloc(16384)/kfree -> 439 cycles
1 times kmalloc(32768)/kfree -> 800 cycles


slub/per cpu + cmpxchg_local emulation
1 times kmalloc(8)/kfree -> 143 cycles
1 times kmalloc(16)/kfree -> 143 cycles
1 times kmalloc(32)/kfree -> 143 cycles
1 times kmalloc(64)/kfree -> 143 cycles
1 times kmalloc(128)/kfree -> 143 cycles
1 times kmalloc(256)/kfree -> 154 cycles
1 times kmalloc(512)/kfree -> 154 cycles
1 times kmalloc(1024)/kfree -> 154 cycles
1 times kmalloc(2048)/kfree -> 154 cycles
1 times kmalloc(4096)/kfree -> 155 cycles
1 times kmalloc(8192)/kfree -> 155 cycles
1 times kmalloc(16384)/kfree -> 440 cycles
1 times kmalloc(32768)/kfree -> 819 cycles
1 times kmalloc(65536)/kfree -> 902 cycles


Parallel allocs:

Kmalloc N*alloc N*free(16): 0=102/136 1=97/136 2=99/140 3=98/140 4=100/138 
5=99/139 6=100/139 7=101/141 Average=99/139

cmpxchg_local emulation
Kmalloc N*alloc N*free(16): 0=116/147 1=116/145 2=115/151 3=115/147 
4=115/149 5=117/147 6=116/148 7=116/146 Average=116/147

Patch used:

Index: linux-2.6/include/asm-ia64/atomic.h
===
--- linux-2.6.orig/include/asm-ia64/atomic.h2007-08-27 16:42:02.0 
-0700
+++ linux-2.6/include/asm-ia64/atomic.h 2007-08-27 17:50:24.0 -0700
@@ -223,4 +223,17 @@ atomic64_add_negative (__s64 i, atomic64
 #define smp_mb__after_atomic_inc() barrier()
 
 #include 
+
+static inline void *cmpxchg_local(void **p, void *old, void *new)
+{
+   unsigned long flags;
+   void *before;
+
+   local_irq_save(flags);
+   before = *p;
+   if (likely(before == old))
+   *p = new;
+   local_irq_restore(flags);
+   return before;
+}
 #endif /* _ASM_IA64_ATOMIC_H */

kmem_cache_alloc before

8900 :
8900:   01 28 31 0e 80 05   [MII]   alloc r37=ar.pfs,12,7,0
8906:   40 02 00 62 00 00   mov r36=b0
890c:   00 00 04 00 nop.i 0x0;;
8910:   0b 18 01 00 25 04   [MMI]   mov r35=psr;;
8916:   00 00 04 0e 00 00   rsm 0x4000
891c:   00 00 04 00 nop.i 0x0;;
8920:   08 50 90 1b 19 21   [MMI]   adds r10=3300,r13
8926:   70 02 80 00 42 40   mov r39=r32
892c:   05 00 c4 00 mov r42=b0
8930:   09 40 01 42 00 21   [MMI]   mov r40=r33
8936:   00 00 00 02 00 20   nop.m 0x0
893c:   f5 e7 ff 9f mov r41=-1;;
8940:   0b 48 00 14 10 10   [MMI]   ld4 r9=[r10];;
8946:   00 00 00 02 00 00   nop.m 0x0
894c:   01 48 58 00 sxt4 r8=r9;;
8950:   0b 18 20 40 12 20   [MMI]   shladd r3=r8,3,r32;;
8956:   20 80 0f 82 48 00   addl r2=8432,r3
895c:   00 00 04 00 nop.i 0x0;;
8960:   0a 00 01 04 18 10   [MMI]   ld8 r32=[r2];;
8966:   e0 a0 80 00 42 60   adds r14=20,r32
896c:   05 00 01 84 mov r43=r32
8970:   0b 10 01 40 18 10   [MMI]   ld8 r34=[r32];;
8976:   70 00 88 0c 72 00   cmp.eq p7,p6=0,r34
897c:   00 00 04 00 nop.i 0x0;;
8980:   cb 70 00 1c 10 90   [MMI] (p06) ld4 r14=[r14];;
8986:   e1 70 88 24 40 00 (p06) shladd r14=r14,3,r34
898c:   00 00 04 00 nop.i 0x0;;
8990:   c2 70 00 1c 18 10   [MII] (p06) ld8 r14=[r14]
8996:   00 00 00 02 00 00   nop.i 0x0;;
899c:   00 00 04 00 nop.i 0x0
89a0:   d8 00 38 40 98 11   [MMB] (p06) st8 [r32]=r14
89a6:   00 00 00 02 00 03   nop.m 0x0
89ac:   30 00 00 40   (p06) br.cond.sptk.few 89d0 

89b0:   11 00 00 00 01 00   [MIB]   nop.m 0x0
89b6:   00 00 00 02 00 00   nop.i 0x0
89bc:   18 d8 ff 58 br.call.sptk.many b0=61c0 
<__slab_alloc>;;
89c0:   08 10 01 10 00 21   [MMI]   mov r34=r8
89c6:   00 00 00 02 00 00   nop.m 0x0
89cc:

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:

> Hrm, I just want to certify one thing: A lot of code paths seems to go
> to the slow path without requiring cmpxchg_local to execute at all. So
> is the slow path more likely to be triggered by the (!object),
> (!node_match) tests or by these same tests done in the redo after the
> initial cmpxchg_local ?

The slow path is more likely to be triggered by settings in the per cpu 
structure. The cmpxchg failure is comparatively rare. So the worst case
is getting worse but the average use of interrupt enable/disable may not 
change much. Need to have some measurements to confirm that. I can try to 
run the emulation on IA64 and see what the result will be.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> 
> > > The slow path would require disable preemption and two interrupt disables.
> > If the slow path have to call new_slab, then yes. But it seems that not
> > every slow path must call it, so for the other slow paths, only one
> > interrupt disable would be required.
> 
> If we include new_slab then we get to 3 times:
> 
> 1. In the cmpxchg_local emulation that fails
> 
> 2. For the slow path
> 
> 3. When calling the page allocator.
> 

Hrm, I just want to certify one thing: A lot of code paths seems to go
to the slow path without requiring cmpxchg_local to execute at all. So
is the slow path more likely to be triggered by the (!object),
(!node_match) tests or by these same tests done in the redo after the
initial cmpxchg_local ?

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

H. One wild idea would be to use a priority futex for the slab lock? 
That would make the slow paths interrupt safe without requiring interrupt 
disable? Does a futex fit into the page struct?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:

> > The slow path would require disable preemption and two interrupt disables.
> If the slow path have to call new_slab, then yes. But it seems that not
> every slow path must call it, so for the other slow paths, only one
> interrupt disable would be required.

If we include new_slab then we get to 3 times:

1. In the cmpxchg_local emulation that fails

2. For the slow path

3. When calling the page allocator.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> 
> > > a clean solution source code wise. It also minimizes the interrupt 
> > > holdoff 
> > > for the non-cmpxchg_local arches. However, it means that we will have to 
> > > disable interrupts twice for the slow path. If that is too expensive then 
> > > we need a different solution.
> > > 
> > 
> > cmpxchg_local is not used on the slow path... ?
> 
> Right.
> 
> > Did you meant:
> > 
> > it means that we will have to disable preemption _and_ interrupts on the
> > fast path for non-cmpxchg_local arches ?
> 
> We would have to disable preemption and interrupts once on the fast path. 
> The interrupt holdoff would just be a couple of instructions.
> 

Right.

> The slow path would require disable preemption and two interrupt disables.
> 

If the slow path have to call new_slab, then yes. But it seems that not
every slow path must call it, so for the other slow paths, only one
interrupt disable would be required.

> Question is if this makes sense performance wise. If not then we may have 
> to look at more complicated schemes.
> 

Yep, such as the arch_have_cmpxchg() macro that I proposed, but it
really hurts my eyes... :(

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:

> > a clean solution source code wise. It also minimizes the interrupt holdoff 
> > for the non-cmpxchg_local arches. However, it means that we will have to 
> > disable interrupts twice for the slow path. If that is too expensive then 
> > we need a different solution.
> > 
> 
> cmpxchg_local is not used on the slow path... ?

Right.

> Did you meant:
> 
> it means that we will have to disable preemption _and_ interrupts on the
> fast path for non-cmpxchg_local arches ?

We would have to disable preemption and interrupts once on the fast path. 
The interrupt holdoff would just be a couple of instructions.

The slow path would require disable preemption and two interrupt disables.

Question is if this makes sense performance wise. If not then we may have 
to look at more complicated schemes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> I think the simplest solution may be to leave slub as done in the patch 
> that we developed last week. The arch must provide a cmpxchg_local that is 
> performance wise the fastest possible. On x86 this is going to be the 
> cmpxchg_local on others where cmpxchg is slower than interrupt 
> disable/enable this is going to be the emulation that does
> 
> interrupt disable
> 
> cmpchg simulation
> 
> interrupt enable
> 
> 
> If we can establish that this is not a performance regression then we have 
> a clean solution source code wise. It also minimizes the interrupt holdoff 
> for the non-cmpxchg_local arches. However, it means that we will have to 
> disable interrupts twice for the slow path. If that is too expensive then 
> we need a different solution.
> 

cmpxchg_local is not used on the slow path... ?

Did you meant:

it means that we will have to disable preemption _and_ interrupts on the
fast path for non-cmpxchg_local arches ?

Or maybe am I thinking about the wrong code snippet there ?

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

I think the simplest solution may be to leave slub as done in the patch 
that we developed last week. The arch must provide a cmpxchg_local that is 
performance wise the fastest possible. On x86 this is going to be the 
cmpxchg_local on others where cmpxchg is slower than interrupt 
disable/enable this is going to be the emulation that does

interrupt disable

cmpchg simulation

interrupt enable


If we can establish that this is not a performance regression then we have 
a clean solution source code wise. It also minimizes the interrupt holdoff 
for the non-cmpxchg_local arches. However, it means that we will have to 
disable interrupts twice for the slow path. If that is too expensive then 
we need a different solution.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> 
> > * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > > On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> > > 
> > > > So, if the fast path can be done with a preempt off, it might be doable
> > > > to suffer the slow path with a per cpu lock like that.
> > > 
> > > Sadly the cmpxchg_local requires local per cpu data access. Isnt there 
> > > some way to make this less expensive on RT? Acessing cpu local memory is 
> > > really good for performance on NUMA since the data is optimally placed 
> > > and 
> > > one can avoid/reduce locking if the process stays tied to the processor.
> > > 
> > 
> > On the slow path, in slab_new, we already have to reenable interrupts
> > because we can sleep. If we make sure that whenever we return to an irq
> > disable code path we take the current per-cpu data structure again, can
> > we make the preempt-disable/irq-disabled code paths O(1) ?
> 
> Not sure exactly what you are getting at?
> This would mean running __alloc_pages tied to one processor even though 
> waiting is possible?
> 

Not exactly. What I propose is:

- Running slab_alloc and slab_free fast paths in preempt_disable
  context, using cmpxchg_local.
- Running slab_alloc and slab_free slow paths with irqs disabled.
- Running __alloc_pages in preemptible context, not tied to any CPU.

In this scheme, calling __alloc_pages from slab_alloc would reenable
interrupts and potentially migrate us to a different CPU. We would
therefore have to get once again our per-cpu data structure once we get
back into irq disabled code, because we may be running on a different
CPU. This is actually what the __slab_alloc slow path does:

new_slab:
new = get_partial(s, gfpflags, node);
if (new) {
c->page = new;
goto load_freelist;
}

new = new_slab(s, gfpflags, node);

  > within new_slab, we can reenable interrupts for the
__slab_alloc call.

if (new) {
c = get_cpu_slab(s, smp_processor_id());
if (c->page) {
/*
 * Someone else populated the cpu_slab while we
 * enabled interrupts, or we have gotten scheduled
 * on another cpu. The page may not be on the
 * requested node even if __GFP_THISNODE was
 * specified. So we need to recheck.
 */
if (node_match(c, node)) {
/*
 * Current cpuslab is acceptable and we
 * want the current one since its cache hot
 */
discard_slab(s, new);
slab_lock(c->page);
goto load_freelist;
}
/* New slab does not fit our expectations */
flush_slab(s, c);
}
slab_lock(new);
SetSlabFrozen(new);
c->page = new;
goto load_freelist;

So the idea would be to split the code in O(1)
preempt_disable/irq_disable sections and to enable interrupt and check
for current per-cpu data structure when re-entering in irq disabled
code.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:

> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> > 
> > > So, if the fast path can be done with a preempt off, it might be doable
> > > to suffer the slow path with a per cpu lock like that.
> > 
> > Sadly the cmpxchg_local requires local per cpu data access. Isnt there 
> > some way to make this less expensive on RT? Acessing cpu local memory is 
> > really good for performance on NUMA since the data is optimally placed and 
> > one can avoid/reduce locking if the process stays tied to the processor.
> > 
> 
> On the slow path, in slab_new, we already have to reenable interrupts
> because we can sleep. If we make sure that whenever we return to an irq
> disable code path we take the current per-cpu data structure again, can
> we make the preempt-disable/irq-disabled code paths O(1) ?

Not sure exactly what you are getting at?
This would mean running __alloc_pages tied to one processor even though 
waiting is possible?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> 
> > So, if the fast path can be done with a preempt off, it might be doable
> > to suffer the slow path with a per cpu lock like that.
> 
> Sadly the cmpxchg_local requires local per cpu data access. Isnt there 
> some way to make this less expensive on RT? Acessing cpu local memory is 
> really good for performance on NUMA since the data is optimally placed and 
> one can avoid/reduce locking if the process stays tied to the processor.
> 

On the slow path, in slab_new, we already have to reenable interrupts
because we can sleep. If we make sure that whenever we return to an irq
disable code path we take the current per-cpu data structure again, can
we make the preempt-disable/irq-disabled code paths O(1) ?

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Peter Zijlstra wrote:

> So, if the fast path can be done with a preempt off, it might be doable
> to suffer the slow path with a per cpu lock like that.

Sadly the cmpxchg_local requires local per cpu data access. Isnt there 
some way to make this less expensive on RT? Acessing cpu local memory is 
really good for performance on NUMA since the data is optimally placed and 
one can avoid/reduce locking if the process stays tied to the processor.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Peter Zijlstra

On Tue, 2007-08-21 at 16:14 -0700, Christoph Lameter wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > - Changed smp_rmb() for barrier(). We are not interested in read order
> >   across cpus, what we want is to be ordered wrt local interrupts only.
> >   barrier() is much cheaper than a rmb().
> 
> But this means a preempt disable is required. RT users do not want that.
> Without preemption the processor can be moved after c has been determined.
> That is why the smp_rmb() is there.

Likewise for disabling interrupts, we don't like that either. So
anything that requires cpu-pinning is preferably not done.

That said, we can suffer a preempt-off section if its O(1) and only a
few hundred cycles.

The trouble with all this percpu data in slub is that it also requires
pinning to the cpu in much of the slow path, either that or what we've
been doing so far with slab, a lock per cpu, and just grab one of those
locks and stick to the data belonging to that lock, regardless of
whether we get migrated.

slab-rt has these locks for all allocations and they are a massive
bottleneck for quite a few workloads, getting a fast path allocation
without using these would be most welcome.

So, if the fast path can be done with a preempt off, it might be doable
to suffer the slow path with a per cpu lock like that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Peter Zijlstra

On Tue, 2007-08-21 at 16:14 -0700, Christoph Lameter wrote:
 On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
 
  - Changed smp_rmb() for barrier(). We are not interested in read order
across cpus, what we want is to be ordered wrt local interrupts only.
barrier() is much cheaper than a rmb().
 
 But this means a preempt disable is required. RT users do not want that.
 Without preemption the processor can be moved after c has been determined.
 That is why the smp_rmb() is there.

Likewise for disabling interrupts, we don't like that either. So
anything that requires cpu-pinning is preferably not done.

That said, we can suffer a preempt-off section if its O(1) and only a
few hundred cycles.

The trouble with all this percpu data in slub is that it also requires
pinning to the cpu in much of the slow path, either that or what we've
been doing so far with slab, a lock per cpu, and just grab one of those
locks and stick to the data belonging to that lock, regardless of
whether we get migrated.

slab-rt has these locks for all allocations and they are a massive
bottleneck for quite a few workloads, getting a fast path allocation
without using these would be most welcome.

So, if the fast path can be done with a preempt off, it might be doable
to suffer the slow path with a per cpu lock like that.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Peter Zijlstra wrote:

 So, if the fast path can be done with a preempt off, it might be doable
 to suffer the slow path with a per cpu lock like that.

Sadly the cmpxchg_local requires local per cpu data access. Isnt there 
some way to make this less expensive on RT? Acessing cpu local memory is 
really good for performance on NUMA since the data is optimally placed and 
one can avoid/reduce locking if the process stays tied to the processor.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 On Mon, 27 Aug 2007, Peter Zijlstra wrote:
 
  So, if the fast path can be done with a preempt off, it might be doable
  to suffer the slow path with a per cpu lock like that.
 
 Sadly the cmpxchg_local requires local per cpu data access. Isnt there 
 some way to make this less expensive on RT? Acessing cpu local memory is 
 really good for performance on NUMA since the data is optimally placed and 
 one can avoid/reduce locking if the process stays tied to the processor.
 

On the slow path, in slab_new, we already have to reenable interrupts
because we can sleep. If we make sure that whenever we return to an irq
disable code path we take the current per-cpu data structure again, can
we make the preempt-disable/irq-disabled code paths O(1) ?

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:

 * Christoph Lameter ([EMAIL PROTECTED]) wrote:
  On Mon, 27 Aug 2007, Peter Zijlstra wrote:
  
   So, if the fast path can be done with a preempt off, it might be doable
   to suffer the slow path with a per cpu lock like that.
  
  Sadly the cmpxchg_local requires local per cpu data access. Isnt there 
  some way to make this less expensive on RT? Acessing cpu local memory is 
  really good for performance on NUMA since the data is optimally placed and 
  one can avoid/reduce locking if the process stays tied to the processor.
  
 
 On the slow path, in slab_new, we already have to reenable interrupts
 because we can sleep. If we make sure that whenever we return to an irq
 disable code path we take the current per-cpu data structure again, can
 we make the preempt-disable/irq-disabled code paths O(1) ?

Not sure exactly what you are getting at?
This would mean running __alloc_pages tied to one processor even though 
waiting is possible?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
 
  * Christoph Lameter ([EMAIL PROTECTED]) wrote:
   On Mon, 27 Aug 2007, Peter Zijlstra wrote:
   
So, if the fast path can be done with a preempt off, it might be doable
to suffer the slow path with a per cpu lock like that.
   
   Sadly the cmpxchg_local requires local per cpu data access. Isnt there 
   some way to make this less expensive on RT? Acessing cpu local memory is 
   really good for performance on NUMA since the data is optimally placed 
   and 
   one can avoid/reduce locking if the process stays tied to the processor.
   
  
  On the slow path, in slab_new, we already have to reenable interrupts
  because we can sleep. If we make sure that whenever we return to an irq
  disable code path we take the current per-cpu data structure again, can
  we make the preempt-disable/irq-disabled code paths O(1) ?
 
 Not sure exactly what you are getting at?
 This would mean running __alloc_pages tied to one processor even though 
 waiting is possible?
 

Not exactly. What I propose is:

- Running slab_alloc and slab_free fast paths in preempt_disable
  context, using cmpxchg_local.
- Running slab_alloc and slab_free slow paths with irqs disabled.
- Running __alloc_pages in preemptible context, not tied to any CPU.

In this scheme, calling __alloc_pages from slab_alloc would reenable
interrupts and potentially migrate us to a different CPU. We would
therefore have to get once again our per-cpu data structure once we get
back into irq disabled code, because we may be running on a different
CPU. This is actually what the __slab_alloc slow path does:


new_slab:
new = get_partial(s, gfpflags, node);
if (new) {
c-page = new;
goto load_freelist;
}

new = new_slab(s, gfpflags, node);

   within new_slab, we can reenable interrupts for the
__slab_alloc call.

if (new) {
c = get_cpu_slab(s, smp_processor_id());
if (c-page) {
/*
 * Someone else populated the cpu_slab while we
 * enabled interrupts, or we have gotten scheduled
 * on another cpu. The page may not be on the
 * requested node even if __GFP_THISNODE was
 * specified. So we need to recheck.
 */
if (node_match(c, node)) {
/*
 * Current cpuslab is acceptable and we
 * want the current one since its cache hot
 */
discard_slab(s, new);
slab_lock(c-page);
goto load_freelist;
}
/* New slab does not fit our expectations */
flush_slab(s, c);
}
slab_lock(new);
SetSlabFrozen(new);
c-page = new;
goto load_freelist;

So the idea would be to split the code in O(1)
preempt_disable/irq_disable sections and to enable interrupt and check
for current per-cpu data structure when re-entering in irq disabled
code.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

I think the simplest solution may be to leave slub as done in the patch 
that we developed last week. The arch must provide a cmpxchg_local that is 
performance wise the fastest possible. On x86 this is going to be the 
cmpxchg_local on others where cmpxchg is slower than interrupt 
disable/enable this is going to be the emulation that does

interrupt disable

cmpchg simulation

interrupt enable


If we can establish that this is not a performance regression then we have 
a clean solution source code wise. It also minimizes the interrupt holdoff 
for the non-cmpxchg_local arches. However, it means that we will have to 
disable interrupts twice for the slow path. If that is too expensive then 
we need a different solution.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 I think the simplest solution may be to leave slub as done in the patch 
 that we developed last week. The arch must provide a cmpxchg_local that is 
 performance wise the fastest possible. On x86 this is going to be the 
 cmpxchg_local on others where cmpxchg is slower than interrupt 
 disable/enable this is going to be the emulation that does
 
 interrupt disable
 
 cmpchg simulation
 
 interrupt enable
 
 
 If we can establish that this is not a performance regression then we have 
 a clean solution source code wise. It also minimizes the interrupt holdoff 
 for the non-cmpxchg_local arches. However, it means that we will have to 
 disable interrupts twice for the slow path. If that is too expensive then 
 we need a different solution.
 

cmpxchg_local is not used on the slow path... ?

Did you meant:

it means that we will have to disable preemption _and_ interrupts on the
fast path for non-cmpxchg_local arches ?

Or maybe am I thinking about the wrong code snippet there ?

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:

  a clean solution source code wise. It also minimizes the interrupt holdoff 
  for the non-cmpxchg_local arches. However, it means that we will have to 
  disable interrupts twice for the slow path. If that is too expensive then 
  we need a different solution.
  
 
 cmpxchg_local is not used on the slow path... ?

Right.

 Did you meant:
 
 it means that we will have to disable preemption _and_ interrupts on the
 fast path for non-cmpxchg_local arches ?

We would have to disable preemption and interrupts once on the fast path. 
The interrupt holdoff would just be a couple of instructions.

The slow path would require disable preemption and two interrupt disables.

Question is if this makes sense performance wise. If not then we may have 
to look at more complicated schemes.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
 
   a clean solution source code wise. It also minimizes the interrupt 
   holdoff 
   for the non-cmpxchg_local arches. However, it means that we will have to 
   disable interrupts twice for the slow path. If that is too expensive then 
   we need a different solution.
   
  
  cmpxchg_local is not used on the slow path... ?
 
 Right.
 
  Did you meant:
  
  it means that we will have to disable preemption _and_ interrupts on the
  fast path for non-cmpxchg_local arches ?
 
 We would have to disable preemption and interrupts once on the fast path. 
 The interrupt holdoff would just be a couple of instructions.
 

Right.

 The slow path would require disable preemption and two interrupt disables.
 

If the slow path have to call new_slab, then yes. But it seems that not
every slow path must call it, so for the other slow paths, only one
interrupt disable would be required.

 Question is if this makes sense performance wise. If not then we may have 
 to look at more complicated schemes.
 

Yep, such as the arch_have_cmpxchg() macro that I proposed, but it
really hurts my eyes... :(

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:

  The slow path would require disable preemption and two interrupt disables.
 If the slow path have to call new_slab, then yes. But it seems that not
 every slow path must call it, so for the other slow paths, only one
 interrupt disable would be required.

If we include new_slab then we get to 3 times:

1. In the cmpxchg_local emulation that fails

2. For the slow path

3. When calling the page allocator.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

H. One wild idea would be to use a priority futex for the slab lock? 
That would make the slow paths interrupt safe without requiring interrupt 
disable? Does a futex fit into the page struct?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
 
   The slow path would require disable preemption and two interrupt disables.
  If the slow path have to call new_slab, then yes. But it seems that not
  every slow path must call it, so for the other slow paths, only one
  interrupt disable would be required.
 
 If we include new_slab then we get to 3 times:
 
 1. In the cmpxchg_local emulation that fails
 
 2. For the slow path
 
 3. When calling the page allocator.
 

Hrm, I just want to certify one thing: A lot of code paths seems to go
to the slow path without requiring cmpxchg_local to execute at all. So
is the slow path more likely to be triggered by the (!object),
(!node_match) tests or by these same tests done in the redo after the
initial cmpxchg_local ?

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:

 Hrm, I just want to certify one thing: A lot of code paths seems to go
 to the slow path without requiring cmpxchg_local to execute at all. So
 is the slow path more likely to be triggered by the (!object),
 (!node_match) tests or by these same tests done in the redo after the
 initial cmpxchg_local ?

The slow path is more likely to be triggered by settings in the per cpu 
structure. The cmpxchg failure is comparatively rare. So the worst case
is getting worse but the average use of interrupt enable/disable may not 
change much. Need to have some measurements to confirm that. I can try to 
run the emulation on IA64 and see what the result will be.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter

Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local 
emulation. Results are not good:

slub/per cpu
1 times kmalloc(8)/kfree - 105 cycles
1 times kmalloc(16)/kfree - 104 cycles
1 times kmalloc(32)/kfree - 105 cycles
1 times kmalloc(64)/kfree - 104 cycles
1 times kmalloc(128)/kfree - 104 cycles
1 times kmalloc(256)/kfree - 115 cycles
1 times kmalloc(512)/kfree - 116 cycles
1 times kmalloc(1024)/kfree - 115 cycles
1 times kmalloc(2048)/kfree - 115 cycles
1 times kmalloc(4096)/kfree - 115 cycles
1 times kmalloc(8192)/kfree - 117 cycles
1 times kmalloc(16384)/kfree - 439 cycles
1 times kmalloc(32768)/kfree - 800 cycles


slub/per cpu + cmpxchg_local emulation
1 times kmalloc(8)/kfree - 143 cycles
1 times kmalloc(16)/kfree - 143 cycles
1 times kmalloc(32)/kfree - 143 cycles
1 times kmalloc(64)/kfree - 143 cycles
1 times kmalloc(128)/kfree - 143 cycles
1 times kmalloc(256)/kfree - 154 cycles
1 times kmalloc(512)/kfree - 154 cycles
1 times kmalloc(1024)/kfree - 154 cycles
1 times kmalloc(2048)/kfree - 154 cycles
1 times kmalloc(4096)/kfree - 155 cycles
1 times kmalloc(8192)/kfree - 155 cycles
1 times kmalloc(16384)/kfree - 440 cycles
1 times kmalloc(32768)/kfree - 819 cycles
1 times kmalloc(65536)/kfree - 902 cycles


Parallel allocs:

Kmalloc N*alloc N*free(16): 0=102/136 1=97/136 2=99/140 3=98/140 4=100/138 
5=99/139 6=100/139 7=101/141 Average=99/139

cmpxchg_local emulation
Kmalloc N*alloc N*free(16): 0=116/147 1=116/145 2=115/151 3=115/147 
4=115/149 5=117/147 6=116/148 7=116/146 Average=116/147

Patch used:

Index: linux-2.6/include/asm-ia64/atomic.h
===
--- linux-2.6.orig/include/asm-ia64/atomic.h2007-08-27 16:42:02.0 
-0700
+++ linux-2.6/include/asm-ia64/atomic.h 2007-08-27 17:50:24.0 -0700
@@ -223,4 +223,17 @@ atomic64_add_negative (__s64 i, atomic64
 #define smp_mb__after_atomic_inc() barrier()
 
 #include asm-generic/atomic.h
+
+static inline void *cmpxchg_local(void **p, void *old, void *new)
+{
+   unsigned long flags;
+   void *before;
+
+   local_irq_save(flags);
+   before = *p;
+   if (likely(before == old))
+   *p = new;
+   local_irq_restore(flags);
+   return before;
+}
 #endif /* _ASM_IA64_ATOMIC_H */

kmem_cache_alloc before

8900 kmem_cache_alloc:
8900:   01 28 31 0e 80 05   [MII]   alloc r37=ar.pfs,12,7,0
8906:   40 02 00 62 00 00   mov r36=b0
890c:   00 00 04 00 nop.i 0x0;;
8910:   0b 18 01 00 25 04   [MMI]   mov r35=psr;;
8916:   00 00 04 0e 00 00   rsm 0x4000
891c:   00 00 04 00 nop.i 0x0;;
8920:   08 50 90 1b 19 21   [MMI]   adds r10=3300,r13
8926:   70 02 80 00 42 40   mov r39=r32
892c:   05 00 c4 00 mov r42=b0
8930:   09 40 01 42 00 21   [MMI]   mov r40=r33
8936:   00 00 00 02 00 20   nop.m 0x0
893c:   f5 e7 ff 9f mov r41=-1;;
8940:   0b 48 00 14 10 10   [MMI]   ld4 r9=[r10];;
8946:   00 00 00 02 00 00   nop.m 0x0
894c:   01 48 58 00 sxt4 r8=r9;;
8950:   0b 18 20 40 12 20   [MMI]   shladd r3=r8,3,r32;;
8956:   20 80 0f 82 48 00   addl r2=8432,r3
895c:   00 00 04 00 nop.i 0x0;;
8960:   0a 00 01 04 18 10   [MMI]   ld8 r32=[r2];;
8966:   e0 a0 80 00 42 60   adds r14=20,r32
896c:   05 00 01 84 mov r43=r32
8970:   0b 10 01 40 18 10   [MMI]   ld8 r34=[r32];;
8976:   70 00 88 0c 72 00   cmp.eq p7,p6=0,r34
897c:   00 00 04 00 nop.i 0x0;;
8980:   cb 70 00 1c 10 90   [MMI] (p06) ld4 r14=[r14];;
8986:   e1 70 88 24 40 00 (p06) shladd r14=r14,3,r34
898c:   00 00 04 00 nop.i 0x0;;
8990:   c2 70 00 1c 18 10   [MII] (p06) ld8 r14=[r14]
8996:   00 00 00 02 00 00   nop.i 0x0;;
899c:   00 00 04 00 nop.i 0x0
89a0:   d8 00 38 40 98 11   [MMB] (p06) st8 [r32]=r14
89a6:   00 00 00 02 00 03   nop.m 0x0
89ac:   30 00 00 40   (p06) br.cond.sptk.few 89d0 
kmem_cache_alloc+0xd0
89b0:   11 00 00 00 01 00   [MIB]   nop.m 0x0
89b6:   00 00 00 02 00 00   nop.i 0x0
89bc:   18 d8 ff 58 br.call.sptk.many b0=61c0 
__slab_alloc;;
89c0:   08 10 01 10 00 21   [MMI]   mov r34=r8
89c6:   00 00 00 02 00 00

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

Ok so we need this.


Fix up preempt checks.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 mm/slub.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-22 13:33:40.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-22 13:35:31.0 -0700
@@ -1469,6 +1469,7 @@ load_freelist:
 out:
slab_unlock(c->page);
local_irq_restore(flags);
+   preempt_check_resched();
if (unlikely((gfpflags & __GFP_ZERO)))
memset(object, 0, c->objsize);
return object;
@@ -1512,6 +1513,7 @@ new_slab:
goto load_freelist;
}
local_irq_restore(flags);
+   preempt_check_resched();
return NULL;
 debug:
object = c->page->freelist;
@@ -1592,8 +1594,8 @@ static void __slab_free(struct kmem_cach
void **object = (void *)x;
unsigned long flags;
 
+   put_cpu();
local_irq_save(flags);
-   put_cpu_no_resched();
slab_lock(page);
 
if (unlikely(SlabDebug(page)))

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
> 
> > * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > >  void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
> > >  {
> > >   void *prior;
> > >   void **object = (void *)x;
> > > + unsigned long flags;
> > >  
> > > + local_irq_save(flags);
> > > + put_cpu_no_resched();
> > 
> > Those two lines may skip a preempt_check.
> 
> Yes we cannot execute something else here.
>  
> > Could we change them to this instead ?
> >   
> >   put_cpu();
> >   local_irq_save(flags);
> 
> Then the thread could be preempted and rescheduled on a different cpu 
> between put_cpu and local_irq_save() which means that we loose the
> state information of the kmem_cache_cpu structure.
> 

Maybe am I misunderstanding something, but kmem_cache_cpu does not seem
to be passed to __slab_free() at all, nor any data referenced by it. So
why do we care about being preempted there ?

> > Otherwise, it would be good to call
> > 
> >   preempt_check_resched();
> > 
> >   After each local_irq_restore() in this function.
> 
> We could do that but maybe the frequency of these checks would be too 
> high? When should the resched checks be used?

Since we are only doing this on the slow path, it does not hurt.
preempt_check_resched() is embedded in preempt_enable() and has a very
low impact (simple thread flag check in the standard case).

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:

> > Then the thread could be preempted and rescheduled on a different cpu 
> > between put_cpu and local_irq_save() which means that we loose the
> > state information of the kmem_cache_cpu structure.
> > 
> 
> Maybe am I misunderstanding something, but kmem_cache_cpu does not seem
> to be passed to __slab_free() at all, nor any data referenced by it. So
> why do we care about being preempted there ?

Right it is only useful for __slab_alloc. I just changed them both to look 
the same. We could do it that way in __slab_free() to avoid the later 
preempt_check_resched().

> > We could do that but maybe the frequency of these checks would be too 
> > high? When should the resched checks be used?
> 
> Since we are only doing this on the slow path, it does not hurt.
> preempt_check_resched() is embedded in preempt_enable() and has a very
> low impact (simple thread flag check in the standard case).

Ok then lets add it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:

> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> >  void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
> >  {
> > void *prior;
> > void **object = (void *)x;
> > +   unsigned long flags;
> >  
> > +   local_irq_save(flags);
> > +   put_cpu_no_resched();
> 
> Those two lines may skip a preempt_check.

Yes we cannot execute something else here.

> Could we change them to this instead ?
>   
>   put_cpu();
>   local_irq_save(flags);

Then the thread could be preempted and rescheduled on a different cpu 
between put_cpu and local_irq_save() which means that we loose the
state information of the kmem_cache_cpu structure.

> Otherwise, it would be good to call
> 
>   preempt_check_resched();
> 
>   After each local_irq_restore() in this function.

We could do that but maybe the frequency of these checks would be too 
high? When should the resched checks be used?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
>  void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
>  {
>   void *prior;
>   void **object = (void *)x;
> + unsigned long flags;
>  
> + local_irq_save(flags);
> + put_cpu_no_resched();

Those two lines may skip a preempt_check.

Could we change them to this instead ?
  
  put_cpu();
  local_irq_save(flags);

Otherwise, it would be good to call

  preempt_check_resched();

  After each local_irq_restore() in this function.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

Here is the current cmpxchg_local version that I used for testing.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

---
 include/linux/slub_def.h |   10 +++---
 mm/slub.c|   74 ---
 2 files changed, 56 insertions(+), 28 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-21 22:34:30.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-22 02:07:26.0 -0700
@@ -1442,13 +1442,18 @@ static void *__slab_alloc(struct kmem_ca
 {
void **object;
struct page *new;
+   unsigned long flags;
 
+   local_irq_save(flags);
+   put_cpu_no_resched();
if (!c->page)
+   /* Slab was flushed */
goto new_slab;
 
slab_lock(c->page);
if (unlikely(!node_match(c, node)))
goto another_slab;
+
 load_freelist:
object = c->page->freelist;
if (unlikely(!object))
@@ -1457,11 +1462,15 @@ load_freelist:
goto debug;
 
object = c->page->freelist;
-   c->freelist = object[c->offset];
c->page->inuse = s->objects;
c->page->freelist = NULL;
c->node = page_to_nid(c->page);
+   c->freelist = object[c->offset];
+out:
slab_unlock(c->page);
+   local_irq_restore(flags);
+   if (unlikely((gfpflags & __GFP_ZERO)))
+   memset(object, 0, c->objsize);
return object;
 
 another_slab:
@@ -1502,6 +1511,7 @@ new_slab:
c->page = new;
goto load_freelist;
}
+   local_irq_restore(flags);
return NULL;
 debug:
object = c->page->freelist;
@@ -1511,8 +1521,7 @@ debug:
c->page->inuse++;
c->page->freelist = object[c->offset];
c->node = -1;
-   slab_unlock(c->page);
-   return object;
+   goto out;
 }
 
 /*
@@ -1529,25 +1538,29 @@ static void __always_inline *slab_alloc(
gfp_t gfpflags, int node, void *addr)
 {
void **object;
-   unsigned long flags;
struct kmem_cache_cpu *c;
 
-   local_irq_save(flags);
-   c = get_cpu_slab(s, smp_processor_id());
-   if (unlikely(!c->freelist || !node_match(c, node)))
+   c = get_cpu_slab(s, get_cpu());
+redo:
+   object = c->freelist;
+   if (unlikely(!object))
+   goto slow;
 
-   object = __slab_alloc(s, gfpflags, node, addr, c);
+   if (unlikely(!node_match(c, node)))
+   goto slow;
 
-   else {
-   object = c->freelist;
-   c->freelist = object[c->offset];
-   }
-   local_irq_restore(flags);
+   if (unlikely(cmpxchg_local(>freelist, object,
+   object[c->offset]) != object))
+   goto redo;
 
-   if (unlikely((gfpflags & __GFP_ZERO) && object))
+   put_cpu();
+   if (unlikely((gfpflags & __GFP_ZERO)))
memset(object, 0, c->objsize);
 
return object;
+slow:
+   return __slab_alloc(s, gfpflags, node, addr, c);
+
 }
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
 {
void *prior;
void **object = (void *)x;
+   unsigned long flags;
 
+   local_irq_save(flags);
+   put_cpu_no_resched();
slab_lock(page);
 
if (unlikely(SlabDebug(page)))
@@ -1603,6 +1619,7 @@ checks_ok:
 
 out_unlock:
slab_unlock(page);
+   local_irq_restore(flags);
return;
 
 slab_empty:
@@ -1613,6 +1630,7 @@ slab_empty:
remove_partial(s, page);
 
slab_unlock(page);
+   local_irq_restore(flags);
discard_slab(s, page);
return;
 
@@ -1637,19 +1655,29 @@ static void __always_inline slab_free(st
struct page *page, void *x, void *addr)
 {
void **object = (void *)x;
-   unsigned long flags;
+   void **freelist;
struct kmem_cache_cpu *c;
 
-   local_irq_save(flags);
debug_check_no_locks_freed(object, s->objsize);
-   c = get_cpu_slab(s, smp_processor_id());
-   if (likely(page == c->page && c->node >= 0)) {
-   object[c->offset] = c->freelist;
-   c->freelist = object;
-   } else
-   __slab_free(s, page, x, addr, c->offset);
 
-   local_irq_restore(flags);
+   c = get_cpu_slab(s, get_cpu());
+   if (unlikely(c->node < 0))
+   goto slow;
+redo:
+   freelist = c->freelist;
+   barrier();  /* If interrupt changes c->page -> cmpxchg failure */
+   if (unlikely(page != c->page))
+   goto slow;
+
+   object[c->offset] = freelist;
+   if (unlikely(cmpxchg_local(>freelist, freelist, object)
+   != freelist))
+   goto redo;
+
+   put_cpu();
+   return;
+slow:
+   __slab_free(s, page, x, addr,

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

I can confirm Mathieus' measurement now:

Athlon64:

regular NUMA/discontig

1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) -> 79 cycles kfree -> 92 cycles
1 times kmalloc(16) -> 79 cycles kfree -> 93 cycles
1 times kmalloc(32) -> 88 cycles kfree -> 95 cycles
1 times kmalloc(64) -> 124 cycles kfree -> 132 cycles
1 times kmalloc(128) -> 157 cycles kfree -> 247 cycles
1 times kmalloc(256) -> 200 cycles kfree -> 257 cycles
1 times kmalloc(512) -> 250 cycles kfree -> 277 cycles
1 times kmalloc(1024) -> 337 cycles kfree -> 314 cycles
1 times kmalloc(2048) -> 365 cycles kfree -> 330 cycles
1 times kmalloc(4096) -> 352 cycles kfree -> 240 cycles
1 times kmalloc(8192) -> 456 cycles kfree -> 340 cycles
1 times kmalloc(16384) -> 646 cycles kfree -> 471 cycles
2. Kmalloc: alloc/free test
1 times kmalloc(8)/kfree -> 124 cycles
1 times kmalloc(16)/kfree -> 124 cycles
1 times kmalloc(32)/kfree -> 124 cycles
1 times kmalloc(64)/kfree -> 124 cycles
1 times kmalloc(128)/kfree -> 124 cycles
1 times kmalloc(256)/kfree -> 132 cycles
1 times kmalloc(512)/kfree -> 132 cycles
1 times kmalloc(1024)/kfree -> 132 cycles
1 times kmalloc(2048)/kfree -> 132 cycles
1 times kmalloc(4096)/kfree -> 319 cycles
1 times kmalloc(8192)/kfree -> 486 cycles
1 times kmalloc(16384)/kfree -> 539 cycles

cmpxchg_local NUMA/discontig

1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) -> 55 cycles kfree -> 90 cycles
1 times kmalloc(16) -> 55 cycles kfree -> 92 cycles
1 times kmalloc(32) -> 70 cycles kfree -> 91 cycles
1 times kmalloc(64) -> 100 cycles kfree -> 141 cycles
1 times kmalloc(128) -> 128 cycles kfree -> 233 cycles
1 times kmalloc(256) -> 172 cycles kfree -> 251 cycles
1 times kmalloc(512) -> 225 cycles kfree -> 275 cycles
1 times kmalloc(1024) -> 325 cycles kfree -> 311 cycles
1 times kmalloc(2048) -> 346 cycles kfree -> 330 cycles
1 times kmalloc(4096) -> 351 cycles kfree -> 238 cycles
1 times kmalloc(8192) -> 450 cycles kfree -> 342 cycles
1 times kmalloc(16384) -> 630 cycles kfree -> 546 cycles
2. Kmalloc: alloc/free test
1 times kmalloc(8)/kfree -> 81 cycles
1 times kmalloc(16)/kfree -> 81 cycles
1 times kmalloc(32)/kfree -> 81 cycles
1 times kmalloc(64)/kfree -> 81 cycles
1 times kmalloc(128)/kfree -> 81 cycles
1 times kmalloc(256)/kfree -> 91 cycles
1 times kmalloc(512)/kfree -> 90 cycles
1 times kmalloc(1024)/kfree -> 91 cycles
1 times kmalloc(2048)/kfree -> 90 cycles
1 times kmalloc(4096)/kfree -> 318 cycles
1 times kmalloc(8192)/kfree -> 483 cycles
1 times kmalloc(16384)/kfree -> 536 cycles

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers

Measurements on a AMD64 2.0 GHz dual-core

In this test, we seem to remove 10 cycles from the kmalloc fast path.
On small allocations, it gives a 14% performance increase. kfree fast
path also seems to have a 10 cycles improvement.

1. Kmalloc: Repeatedly allocate then free test

* cmpxchg_local slub
kmalloc(8) = 63 cycles  kfree = 126 cycles
kmalloc(16) = 66 cycles kfree = 129 cycles
kmalloc(32) = 76 cycles kfree = 138 cycles
kmalloc(64) = 100 cycleskfree = 288 cycles
kmalloc(128) = 128 cycles   kfree = 309 cycles
kmalloc(256) = 170 cycles   kfree = 315 cycles
kmalloc(512) = 221 cycles   kfree = 357 cycles
kmalloc(1024) = 324 cycles  kfree = 393 cycles
kmalloc(2048) = 354 cycles  kfree = 440 cycles
kmalloc(4096) = 394 cycles  kfree = 330 cycles
kmalloc(8192) = 523 cycles  kfree = 481 cycles
kmalloc(16384) = 643 cycles kfree = 649 cycles

* Base
kmalloc(8) = 74 cycles  kfree = 113 cycles
kmalloc(16) = 76 cycles kfree = 116 cycles
kmalloc(32) = 85 cycles kfree = 133 cycles
kmalloc(64) = 111 cycleskfree = 279 cycles
kmalloc(128) = 138 cycles   kfree = 294 cycles
kmalloc(256) = 181 cycles   kfree = 304 cycles
kmalloc(512) = 237 cycles   kfree = 327 cycles
kmalloc(1024) = 340 cycles  kfree = 379 cycles
kmalloc(2048) = 378 cycles  kfree = 433 cycles
kmalloc(4096) = 399 cycles  kfree = 329 cycles
kmalloc(8192) = 528 cycles  kfree = 624 cycles
kmalloc(16384) = 651 cycles kfree = 737 cycles

2. Kmalloc: alloc/free test

* cmpxchg_local slub
kmalloc(8)/kfree = 96 cycles
kmalloc(16)/kfree = 97 cycles
kmalloc(32)/kfree = 97 cycles
kmalloc(64)/kfree = 97 cycles
kmalloc(128)/kfree = 97 cycles
kmalloc(256)/kfree = 105 cycles
kmalloc(512)/kfree = 108 cycles
kmalloc(1024)/kfree = 105 cycles
kmalloc(2048)/kfree = 107 cycles
kmalloc(4096)/kfree = 390 cycles
kmalloc(8192)/kfree = 626 cycles
kmalloc(16384)/kfree = 662 cycles

* Base
kmalloc(8)/kfree = 116 cycles
kmalloc(16)/kfree = 116 cycles
kmalloc(32)/kfree = 116 cycles
kmalloc(64)/kfree = 116 cycles
kmalloc(128)/kfree = 116 cycles
kmalloc(256)/kfree = 126 cycles
kmalloc(512)/kfree = 126 cycles
kmalloc(1024)/kfree = 126 cycles
kmalloc(2048)/kfree = 126 cycles
kmalloc(4096)/kfree = 384 cycles
kmalloc(8192)/kfree = 749 cycles
kmalloc(16384)/kfree = 786 cycles

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Andi Kleen

On Wed, Aug 22, 2007 at 09:45:33AM -0400, Mathieu Desnoyers wrote:
> Measurements on a AMD64 2.0 GHz dual-core
> 
> In this test, we seem to remove 10 cycles from the kmalloc fast path.
> On small allocations, it gives a 14% performance increase. kfree fast
> path also seems to have a 10 cycles improvement.

Looks good. Anything that makes kmalloc faster is good

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Andi Kleen

On Tue, Aug 21, 2007 at 06:06:19PM -0700, Christoph Lameter wrote:
> Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz 

Note the P4 is a extreme case in that "unusual" instructions are
quite slow (basically anything that falls out of the trace cache). Core2 
tends to be much more benign and generally acts more like a K8 in latencies.

There are millions and millions of P4s around of course and we
shouldn't disregard them, but they're not the future and not
highest priority.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Andi Kleen

On Tue, Aug 21, 2007 at 06:06:19PM -0700, Christoph Lameter wrote:
 Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz 

Note the P4 is a extreme case in that unusual instructions are
quite slow (basically anything that falls out of the trace cache). Core2 
tends to be much more benign and generally acts more like a K8 in latencies.

There are millions and millions of P4s around of course and we
shouldn't disregard them, but they're not the future and not
highest priority.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Andi Kleen

On Wed, Aug 22, 2007 at 09:45:33AM -0400, Mathieu Desnoyers wrote:
 Measurements on a AMD64 2.0 GHz dual-core
 
 In this test, we seem to remove 10 cycles from the kmalloc fast path.
 On small allocations, it gives a 14% performance increase. kfree fast
 path also seems to have a 10 cycles improvement.

Looks good. Anything that makes kmalloc faster is good

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers

Measurements on a AMD64 2.0 GHz dual-core

In this test, we seem to remove 10 cycles from the kmalloc fast path.
On small allocations, it gives a 14% performance increase. kfree fast
path also seems to have a 10 cycles improvement.

1. Kmalloc: Repeatedly allocate then free test

* cmpxchg_local slub
kmalloc(8) = 63 cycles  kfree = 126 cycles
kmalloc(16) = 66 cycles kfree = 129 cycles
kmalloc(32) = 76 cycles kfree = 138 cycles
kmalloc(64) = 100 cycleskfree = 288 cycles
kmalloc(128) = 128 cycles   kfree = 309 cycles
kmalloc(256) = 170 cycles   kfree = 315 cycles
kmalloc(512) = 221 cycles   kfree = 357 cycles
kmalloc(1024) = 324 cycles  kfree = 393 cycles
kmalloc(2048) = 354 cycles  kfree = 440 cycles
kmalloc(4096) = 394 cycles  kfree = 330 cycles
kmalloc(8192) = 523 cycles  kfree = 481 cycles
kmalloc(16384) = 643 cycles kfree = 649 cycles

* Base
kmalloc(8) = 74 cycles  kfree = 113 cycles
kmalloc(16) = 76 cycles kfree = 116 cycles
kmalloc(32) = 85 cycles kfree = 133 cycles
kmalloc(64) = 111 cycleskfree = 279 cycles
kmalloc(128) = 138 cycles   kfree = 294 cycles
kmalloc(256) = 181 cycles   kfree = 304 cycles
kmalloc(512) = 237 cycles   kfree = 327 cycles
kmalloc(1024) = 340 cycles  kfree = 379 cycles
kmalloc(2048) = 378 cycles  kfree = 433 cycles
kmalloc(4096) = 399 cycles  kfree = 329 cycles
kmalloc(8192) = 528 cycles  kfree = 624 cycles
kmalloc(16384) = 651 cycles kfree = 737 cycles

2. Kmalloc: alloc/free test

* cmpxchg_local slub
kmalloc(8)/kfree = 96 cycles
kmalloc(16)/kfree = 97 cycles
kmalloc(32)/kfree = 97 cycles
kmalloc(64)/kfree = 97 cycles
kmalloc(128)/kfree = 97 cycles
kmalloc(256)/kfree = 105 cycles
kmalloc(512)/kfree = 108 cycles
kmalloc(1024)/kfree = 105 cycles
kmalloc(2048)/kfree = 107 cycles
kmalloc(4096)/kfree = 390 cycles
kmalloc(8192)/kfree = 626 cycles
kmalloc(16384)/kfree = 662 cycles

* Base
kmalloc(8)/kfree = 116 cycles
kmalloc(16)/kfree = 116 cycles
kmalloc(32)/kfree = 116 cycles
kmalloc(64)/kfree = 116 cycles
kmalloc(128)/kfree = 116 cycles
kmalloc(256)/kfree = 126 cycles
kmalloc(512)/kfree = 126 cycles
kmalloc(1024)/kfree = 126 cycles
kmalloc(2048)/kfree = 126 cycles
kmalloc(4096)/kfree = 384 cycles
kmalloc(8192)/kfree = 749 cycles
kmalloc(16384)/kfree = 786 cycles

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

I can confirm Mathieus' measurement now:

Athlon64:

regular NUMA/discontig

1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) - 79 cycles kfree - 92 cycles
1 times kmalloc(16) - 79 cycles kfree - 93 cycles
1 times kmalloc(32) - 88 cycles kfree - 95 cycles
1 times kmalloc(64) - 124 cycles kfree - 132 cycles
1 times kmalloc(128) - 157 cycles kfree - 247 cycles
1 times kmalloc(256) - 200 cycles kfree - 257 cycles
1 times kmalloc(512) - 250 cycles kfree - 277 cycles
1 times kmalloc(1024) - 337 cycles kfree - 314 cycles
1 times kmalloc(2048) - 365 cycles kfree - 330 cycles
1 times kmalloc(4096) - 352 cycles kfree - 240 cycles
1 times kmalloc(8192) - 456 cycles kfree - 340 cycles
1 times kmalloc(16384) - 646 cycles kfree - 471 cycles
2. Kmalloc: alloc/free test
1 times kmalloc(8)/kfree - 124 cycles
1 times kmalloc(16)/kfree - 124 cycles
1 times kmalloc(32)/kfree - 124 cycles
1 times kmalloc(64)/kfree - 124 cycles
1 times kmalloc(128)/kfree - 124 cycles
1 times kmalloc(256)/kfree - 132 cycles
1 times kmalloc(512)/kfree - 132 cycles
1 times kmalloc(1024)/kfree - 132 cycles
1 times kmalloc(2048)/kfree - 132 cycles
1 times kmalloc(4096)/kfree - 319 cycles
1 times kmalloc(8192)/kfree - 486 cycles
1 times kmalloc(16384)/kfree - 539 cycles

cmpxchg_local NUMA/discontig

1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) - 55 cycles kfree - 90 cycles
1 times kmalloc(16) - 55 cycles kfree - 92 cycles
1 times kmalloc(32) - 70 cycles kfree - 91 cycles
1 times kmalloc(64) - 100 cycles kfree - 141 cycles
1 times kmalloc(128) - 128 cycles kfree - 233 cycles
1 times kmalloc(256) - 172 cycles kfree - 251 cycles
1 times kmalloc(512) - 225 cycles kfree - 275 cycles
1 times kmalloc(1024) - 325 cycles kfree - 311 cycles
1 times kmalloc(2048) - 346 cycles kfree - 330 cycles
1 times kmalloc(4096) - 351 cycles kfree - 238 cycles
1 times kmalloc(8192) - 450 cycles kfree - 342 cycles
1 times kmalloc(16384) - 630 cycles kfree - 546 cycles
2. Kmalloc: alloc/free test
1 times kmalloc(8)/kfree - 81 cycles
1 times kmalloc(16)/kfree - 81 cycles
1 times kmalloc(32)/kfree - 81 cycles
1 times kmalloc(64)/kfree - 81 cycles
1 times kmalloc(128)/kfree - 81 cycles
1 times kmalloc(256)/kfree - 91 cycles
1 times kmalloc(512)/kfree - 90 cycles
1 times kmalloc(1024)/kfree - 91 cycles
1 times kmalloc(2048)/kfree - 90 cycles
1 times kmalloc(4096)/kfree - 318 cycles
1 times kmalloc(8192)/kfree - 483 cycles
1 times kmalloc(16384)/kfree - 536 cycles

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

Here is the current cmpxchg_local version that I used for testing.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/slub_def.h |   10 +++---
 mm/slub.c|   74 ---
 2 files changed, 56 insertions(+), 28 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-21 22:34:30.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-22 02:07:26.0 -0700
@@ -1442,13 +1442,18 @@ static void *__slab_alloc(struct kmem_ca
 {
void **object;
struct page *new;
+   unsigned long flags;
 
+   local_irq_save(flags);
+   put_cpu_no_resched();
if (!c-page)
+   /* Slab was flushed */
goto new_slab;
 
slab_lock(c-page);
if (unlikely(!node_match(c, node)))
goto another_slab;
+
 load_freelist:
object = c-page-freelist;
if (unlikely(!object))
@@ -1457,11 +1462,15 @@ load_freelist:
goto debug;
 
object = c-page-freelist;
-   c-freelist = object[c-offset];
c-page-inuse = s-objects;
c-page-freelist = NULL;
c-node = page_to_nid(c-page);
+   c-freelist = object[c-offset];
+out:
slab_unlock(c-page);
+   local_irq_restore(flags);
+   if (unlikely((gfpflags  __GFP_ZERO)))
+   memset(object, 0, c-objsize);
return object;
 
 another_slab:
@@ -1502,6 +1511,7 @@ new_slab:
c-page = new;
goto load_freelist;
}
+   local_irq_restore(flags);
return NULL;
 debug:
object = c-page-freelist;
@@ -1511,8 +1521,7 @@ debug:
c-page-inuse++;
c-page-freelist = object[c-offset];
c-node = -1;
-   slab_unlock(c-page);
-   return object;
+   goto out;
 }
 
 /*
@@ -1529,25 +1538,29 @@ static void __always_inline *slab_alloc(
gfp_t gfpflags, int node, void *addr)
 {
void **object;
-   unsigned long flags;
struct kmem_cache_cpu *c;
 
-   local_irq_save(flags);
-   c = get_cpu_slab(s, smp_processor_id());
-   if (unlikely(!c-freelist || !node_match(c, node)))
+   c = get_cpu_slab(s, get_cpu());
+redo:
+   object = c-freelist;
+   if (unlikely(!object))
+   goto slow;
 
-   object = __slab_alloc(s, gfpflags, node, addr, c);
+   if (unlikely(!node_match(c, node)))
+   goto slow;
 
-   else {
-   object = c-freelist;
-   c-freelist = object[c-offset];
-   }
-   local_irq_restore(flags);
+   if (unlikely(cmpxchg_local(c-freelist, object,
+   object[c-offset]) != object))
+   goto redo;
 
-   if (unlikely((gfpflags  __GFP_ZERO)  object))
+   put_cpu();
+   if (unlikely((gfpflags  __GFP_ZERO)))
memset(object, 0, c-objsize);
 
return object;
+slow:
+   return __slab_alloc(s, gfpflags, node, addr, c);
+
 }
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
 {
void *prior;
void **object = (void *)x;
+   unsigned long flags;
 
+   local_irq_save(flags);
+   put_cpu_no_resched();
slab_lock(page);
 
if (unlikely(SlabDebug(page)))
@@ -1603,6 +1619,7 @@ checks_ok:
 
 out_unlock:
slab_unlock(page);
+   local_irq_restore(flags);
return;
 
 slab_empty:
@@ -1613,6 +1630,7 @@ slab_empty:
remove_partial(s, page);
 
slab_unlock(page);
+   local_irq_restore(flags);
discard_slab(s, page);
return;
 
@@ -1637,19 +1655,29 @@ static void __always_inline slab_free(st
struct page *page, void *x, void *addr)
 {
void **object = (void *)x;
-   unsigned long flags;
+   void **freelist;
struct kmem_cache_cpu *c;
 
-   local_irq_save(flags);
debug_check_no_locks_freed(object, s-objsize);
-   c = get_cpu_slab(s, smp_processor_id());
-   if (likely(page == c-page  c-node = 0)) {
-   object[c-offset] = c-freelist;
-   c-freelist = object;
-   } else
-   __slab_free(s, page, x, addr, c-offset);
 
-   local_irq_restore(flags);
+   c = get_cpu_slab(s, get_cpu());
+   if (unlikely(c-node  0))
+   goto slow;
+redo:
+   freelist = c-freelist;
+   barrier();  /* If interrupt changes c-page - cmpxchg failure */
+   if (unlikely(page != c-page))
+   goto slow;
+
+   object[c-offset] = freelist;
+   if (unlikely(cmpxchg_local(c-freelist, freelist, object)
+   != freelist))
+   goto redo;
+
+   put_cpu();
+   return;
+slow:
+   __slab_free(s, page, x, addr, c-offset);
 }
 
 void kmem_cache_free(struct kmem_cache *s, void

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
  void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
  {
   void *prior;
   void **object = (void *)x;
 + unsigned long flags;
  
 + local_irq_save(flags);
 + put_cpu_no_resched();

Those two lines may skip a preempt_check.

Could we change them to this instead ?
  
  put_cpu();
  local_irq_save(flags);

Otherwise, it would be good to call

  preempt_check_resched();

  After each local_irq_restore() in this function.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:

 * Christoph Lameter ([EMAIL PROTECTED]) wrote:
   void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
  @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
   {
  void *prior;
  void **object = (void *)x;
  +   unsigned long flags;
   
  +   local_irq_save(flags);
  +   put_cpu_no_resched();
 
 Those two lines may skip a preempt_check.

Yes we cannot execute something else here.
 
 Could we change them to this instead ?
   
   put_cpu();
   local_irq_save(flags);

Then the thread could be preempted and rescheduled on a different cpu 
between put_cpu and local_irq_save() which means that we loose the
state information of the kmem_cache_cpu structure.

 Otherwise, it would be good to call
 
   preempt_check_resched();
 
   After each local_irq_restore() in this function.

We could do that but maybe the frequency of these checks would be too 
high? When should the resched checks be used?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:

  Then the thread could be preempted and rescheduled on a different cpu 
  between put_cpu and local_irq_save() which means that we loose the
  state information of the kmem_cache_cpu structure.
  
 
 Maybe am I misunderstanding something, but kmem_cache_cpu does not seem
 to be passed to __slab_free() at all, nor any data referenced by it. So
 why do we care about being preempted there ?

Right it is only useful for __slab_alloc. I just changed them both to look 
the same. We could do it that way in __slab_free() to avoid the later 
preempt_check_resched().

  We could do that but maybe the frequency of these checks would be too 
  high? When should the resched checks be used?
 
 Since we are only doing this on the slow path, it does not hurt.
 preempt_check_resched() is embedded in preempt_enable() and has a very
 low impact (simple thread flag check in the standard case).

Ok then lets add it.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
 
  * Christoph Lameter ([EMAIL PROTECTED]) wrote:
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
   @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
{
 void *prior;
 void **object = (void *)x;
   + unsigned long flags;

   + local_irq_save(flags);
   + put_cpu_no_resched();
  
  Those two lines may skip a preempt_check.
 
 Yes we cannot execute something else here.
  
  Could we change them to this instead ?

put_cpu();
local_irq_save(flags);
 
 Then the thread could be preempted and rescheduled on a different cpu 
 between put_cpu and local_irq_save() which means that we loose the
 state information of the kmem_cache_cpu structure.
 

Maybe am I misunderstanding something, but kmem_cache_cpu does not seem
to be passed to __slab_free() at all, nor any data referenced by it. So
why do we care about being preempted there ?

  Otherwise, it would be good to call
  
preempt_check_resched();
  
After each local_irq_restore() in this function.
 
 We could do that but maybe the frequency of these checks would be too 
 high? When should the resched checks be used?

Since we are only doing this on the slow path, it does not hurt.
preempt_check_resched() is embedded in preempt_enable() and has a very
low impact (simple thread flag check in the standard case).

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter

Ok so we need this.


Fix up preempt checks.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/slub.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-08-22 13:33:40.0 -0700
+++ linux-2.6/mm/slub.c 2007-08-22 13:35:31.0 -0700
@@ -1469,6 +1469,7 @@ load_freelist:
 out:
slab_unlock(c-page);
local_irq_restore(flags);
+   preempt_check_resched();
if (unlikely((gfpflags  __GFP_ZERO)))
memset(object, 0, c-objsize);
return object;
@@ -1512,6 +1513,7 @@ new_slab:
goto load_freelist;
}
local_irq_restore(flags);
+   preempt_check_resched();
return NULL;
 debug:
object = c-page-freelist;
@@ -1592,8 +1594,8 @@ static void __slab_free(struct kmem_cach
void **object = (void *)x;
unsigned long flags;
 
+   put_cpu();
local_irq_save(flags);
-   put_cpu_no_resched();
slab_lock(page);
 
if (unlikely(SlabDebug(page)))

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > As I am going back through the initial cmpxchg_local implementation, it
> > seems like it was executing __slab_alloc() with preemption disabled,
> > which is wrong. new_slab() is not designed for that.
> 
> The version I send you did not use preemption.
> 
> We need to make a decision if we want to go without preemption and cmpxchg 
> or with preemption and cmpxchg_local.
> 

I don't expect any performance improvements with cmpxchg() over irq
disable/restore. I think we'll have to use cmpxchg_local

Also, we may argue that locked cmpxchg will have more scalability impact
than cmpxchg_local. Actually, I expect the LOCK prefix to have a bigger
scalability impact than the irq save/restore pair.

> If we really want to do this then the implementation of all of these 
> components need to result in competitive performance on all platforms.
> 

The minor issue I see here is on architectures where we have to simulate
cmpxchg_local with irq save/restore. Depending on how we implement the
code, it may result in two irq save/restore pairs instead of one, which
could make the code slower. However, if we are clever enough in our
low-level primitive usage, I think we could make the code use
cmpxchg_local when available and fall back on only _one_ irq disabled
section surrounding the whole code for other architectures.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz 
> (hyperthreading enabled). Test run with your module show only minor 
> performance improvements and lots of regressions. So we must have 
> cmpxchg_local to see any improvements? Some kind of a recent optimization 
> of cmpxchg performance that we do not see on older cpus?
> 

I did not expect the cmpxchg with LOCK prefix to be faster than irq
save/restore. You will need to run these tests using cmpxchg_local to
see an improvement.

Mathieu

> 
> Code of kmem_cache_alloc (to show you that there are no debug options on):
> 
> Dump of assembler code for function kmem_cache_alloc:
> 0x4015cfa9 :push   %ebp
> 0x4015cfaa :mov%esp,%ebp
> 0x4015cfac :push   %edi
> 0x4015cfad :push   %esi
> 0x4015cfae :push   %ebx
> 0x4015cfaf :sub$0x10,%esp
> 0x4015cfb2 :mov%eax,%esi
> 0x4015cfb4 :   mov%edx,0xffe8(%ebp)
> 0x4015cfb7 :   mov0x4(%ebp),%eax
> 0x4015cfba :   mov%eax,0xfff0(%ebp)
> 0x4015cfbd :   mov%fs:0x404af008,%eax
> 0x4015cfc3 :   mov0x90(%esi,%eax,4),%edi
> 0x4015cfca :   mov(%edi),%ecx
> 0x4015cfcc :   test   %ecx,%ecx
> 0x4015cfce :   je 0x4015d00a 
> 
> 0x4015cfd0 :   mov0xc(%edi),%eax
> 0x4015cfd3 :   mov(%ecx,%eax,4),%eax
> 0x4015cfd6 :   mov%eax,%edx
> 0x4015cfd8 :   mov%ecx,%eax
> 0x4015cfda :   lock cmpxchg %edx,(%edi)
> 0x4015cfde :   mov%eax,%ebx
> 0x4015cfe0 :   cmp%ecx,%eax
> 0x4015cfe2 :   jne0x4015cfbd 
> 
> 0x4015cfe4 :   cmpw   $0x0,0xffe8(%ebp)
> 0x4015cfe9 :   jns0x4015d006 
> 
> 0x4015cfeb :   mov0x10(%edi),%edx
> 0x4015cfee :   xor%eax,%eax
> 0x4015cff0 :   mov%edx,%ecx
> 0x4015cff2 :   shr$0x2,%ecx
> 0x4015cff5 :   mov%ebx,%edi
> 
> Base
> 
> 1. Kmalloc: Repeatedly allocate then free test
> 1 times kmalloc(8) -> 332 cycles kfree -> 422 cycles
> 1 times kmalloc(16) -> 218 cycles kfree -> 360 cycles
> 1 times kmalloc(32) -> 214 cycles kfree -> 368 cycles
> 1 times kmalloc(64) -> 244 cycles kfree -> 390 cycles
> 1 times kmalloc(128) -> 320 cycles kfree -> 417 cycles
> 1 times kmalloc(256) -> 438 cycles kfree -> 550 cycles
> 1 times kmalloc(512) -> 527 cycles kfree -> 626 cycles
> 1 times kmalloc(1024) -> 678 cycles kfree -> 775 cycles
> 1 times kmalloc(2048) -> 748 cycles kfree -> 822 cycles
> 1 times kmalloc(4096) -> 641 cycles kfree -> 650 cycles
> 1 times kmalloc(8192) -> 741 cycles kfree -> 817 cycles
> 1 times kmalloc(16384) -> 872 cycles kfree -> 927 cycles
> 2. Kmalloc: alloc/free test
> 1 times kmalloc(8)/kfree -> 332 cycles
> 1 times kmalloc(16)/kfree -> 327 cycles
> 1 times kmalloc(32)/kfree -> 323 cycles
> 1 times kmalloc(64)/kfree -> 320 cycles
> 1 times kmalloc(128)/kfree -> 320 cycles
> 1 times kmalloc(256)/kfree -> 333 cycles
> 1 times kmalloc(512)/kfree -> 332 cycles
> 1 times kmalloc(1024)/kfree -> 330 cycles
> 1 times kmalloc(2048)/kfree -> 334 cycles
> 1 times kmalloc(4096)/kfree -> 674 cycles
> 1 times kmalloc(8192)/kfree -> 1155 cycles
> 1 times kmalloc(16384)/kfree -> 1226 cycles
> 
> Slub cmpxchg.
> 
> 1. Kmalloc: Repeatedly allocate then free test
> 1 times kmalloc(8) -> 296 cycles kfree -> 515 cycles
> 1 times kmalloc(16) -> 193 cycles kfree -> 412 cycles
> 1 times kmalloc(32) -> 188 cycles kfree -> 422 cycles
> 1 times kmalloc(64) -> 222 cycles kfree -> 441 cycles
> 1 times kmalloc(128) -> 292 cycles kfree -> 476 cycles
> 1 times kmalloc(256) -> 414 cycles kfree -> 589 cycles
> 1 times kmalloc(512) -> 513 cycles kfree -> 673 cycles
> 1 times kmalloc(1024) -> 694 cycles kfree -> 825 cycles
> 1 times kmalloc(2048) -> 739 cycles kfree -> 878 cycles
> 1 times kmalloc(4096) -> 636 cycles kfree -> 653 cycles
> 1 times kmalloc(8192) -> 715 cycles kfree -> 799 cycles
> 1 times kmalloc(16384) -> 855 cycles kfree -> 927 cycles
> 2. Kmalloc: alloc/free test
> 1 times kmalloc(8)/kfree -> 354 cycles
> 1 times kmalloc(16)/kfree -> 336 cycles
> 1 times kmalloc(32)/kfree -> 335 cycles
> 1 times kmalloc(64)/kfree -> 337 cycles
> 1 times kmalloc(128)/kfree -> 337 cycles
> 1 times kmalloc(256)/kfree -> 355 cycles
> 1 times kmalloc(512)/kfree -> 354 cycles
> 1 times kmalloc(1024)/kfree -> 337 cycles
> 1 times kmalloc(2048)/kfree -> 339 cycles
> 1 times kmalloc(4096)/kfree -> 674 cycles
> 1 times kmalloc(8192)/kfree -> 1128 cycles
> 1 times kmalloc(16384)/kfree -> 1240 cycles
> 
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz 
(hyperthreading enabled). Test run with your module show only minor 
performance improvements and lots of regressions. So we must have 
cmpxchg_local to see any improvements? Some kind of a recent optimization 
of cmpxchg performance that we do not see on older cpus?


Code of kmem_cache_alloc (to show you that there are no debug options on):

Dump of assembler code for function kmem_cache_alloc:
0x4015cfa9 :push   %ebp
0x4015cfaa :mov%esp,%ebp
0x4015cfac :push   %edi
0x4015cfad :push   %esi
0x4015cfae :push   %ebx
0x4015cfaf :sub$0x10,%esp
0x4015cfb2 :mov%eax,%esi
0x4015cfb4 :   mov%edx,0xffe8(%ebp)
0x4015cfb7 :   mov0x4(%ebp),%eax
0x4015cfba :   mov%eax,0xfff0(%ebp)
0x4015cfbd :   mov%fs:0x404af008,%eax
0x4015cfc3 :   mov0x90(%esi,%eax,4),%edi
0x4015cfca :   mov(%edi),%ecx
0x4015cfcc :   test   %ecx,%ecx
0x4015cfce :   je 0x4015d00a 
0x4015cfd0 :   mov0xc(%edi),%eax
0x4015cfd3 :   mov(%ecx,%eax,4),%eax
0x4015cfd6 :   mov%eax,%edx
0x4015cfd8 :   mov%ecx,%eax
0x4015cfda :   lock cmpxchg %edx,(%edi)
0x4015cfde :   mov%eax,%ebx
0x4015cfe0 :   cmp%ecx,%eax
0x4015cfe2 :   jne0x4015cfbd 
0x4015cfe4 :   cmpw   $0x0,0xffe8(%ebp)
0x4015cfe9 :   jns0x4015d006 
0x4015cfeb :   mov0x10(%edi),%edx
0x4015cfee :   xor%eax,%eax
0x4015cff0 :   mov%edx,%ecx
0x4015cff2 :   shr$0x2,%ecx
0x4015cff5 :   mov%ebx,%edi

Base

1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) -> 332 cycles kfree -> 422 cycles
1 times kmalloc(16) -> 218 cycles kfree -> 360 cycles
1 times kmalloc(32) -> 214 cycles kfree -> 368 cycles
1 times kmalloc(64) -> 244 cycles kfree -> 390 cycles
1 times kmalloc(128) -> 320 cycles kfree -> 417 cycles
1 times kmalloc(256) -> 438 cycles kfree -> 550 cycles
1 times kmalloc(512) -> 527 cycles kfree -> 626 cycles
1 times kmalloc(1024) -> 678 cycles kfree -> 775 cycles
1 times kmalloc(2048) -> 748 cycles kfree -> 822 cycles
1 times kmalloc(4096) -> 641 cycles kfree -> 650 cycles
1 times kmalloc(8192) -> 741 cycles kfree -> 817 cycles
1 times kmalloc(16384) -> 872 cycles kfree -> 927 cycles
2. Kmalloc: alloc/free test
1 times kmalloc(8)/kfree -> 332 cycles
1 times kmalloc(16)/kfree -> 327 cycles
1 times kmalloc(32)/kfree -> 323 cycles
1 times kmalloc(64)/kfree -> 320 cycles
1 times kmalloc(128)/kfree -> 320 cycles
1 times kmalloc(256)/kfree -> 333 cycles
1 times kmalloc(512)/kfree -> 332 cycles
1 times kmalloc(1024)/kfree -> 330 cycles
1 times kmalloc(2048)/kfree -> 334 cycles
1 times kmalloc(4096)/kfree -> 674 cycles
1 times kmalloc(8192)/kfree -> 1155 cycles
1 times kmalloc(16384)/kfree -> 1226 cycles

Slub cmpxchg.

1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) -> 296 cycles kfree -> 515 cycles
1 times kmalloc(16) -> 193 cycles kfree -> 412 cycles
1 times kmalloc(32) -> 188 cycles kfree -> 422 cycles
1 times kmalloc(64) -> 222 cycles kfree -> 441 cycles
1 times kmalloc(128) -> 292 cycles kfree -> 476 cycles
1 times kmalloc(256) -> 414 cycles kfree -> 589 cycles
1 times kmalloc(512) -> 513 cycles kfree -> 673 cycles
1 times kmalloc(1024) -> 694 cycles kfree -> 825 cycles
1 times kmalloc(2048) -> 739 cycles kfree -> 878 cycles
1 times kmalloc(4096) -> 636 cycles kfree -> 653 cycles
1 times kmalloc(8192) -> 715 cycles kfree -> 799 cycles
1 times kmalloc(16384) -> 855 cycles kfree -> 927 cycles
2. Kmalloc: alloc/free test
1 times kmalloc(8)/kfree -> 354 cycles
1 times kmalloc(16)/kfree -> 336 cycles
1 times kmalloc(32)/kfree -> 335 cycles
1 times kmalloc(64)/kfree -> 337 cycles
1 times kmalloc(128)/kfree -> 337 cycles
1 times kmalloc(256)/kfree -> 355 cycles
1 times kmalloc(512)/kfree -> 354 cycles
1 times kmalloc(1024)/kfree -> 337 cycles
1 times kmalloc(2048)/kfree -> 339 cycles
1 times kmalloc(4096)/kfree -> 674 cycles
1 times kmalloc(8192)/kfree -> 1128 cycles
1 times kmalloc(16384)/kfree -> 1240 cycles


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Andi Kleen ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
> > 
> > The measurements I get (in cycles):
> >  enable interrupts (STI)   disable interrupts (CLI)   local 
> > CMPXCHG
> > IA32 (P4)11282 26
> > x86_64 AMD64 125   102 19
> 
> What exactly did you benchmark here? On K8 CLI/STI are only supposed
> to be a few cycles. pushf/popf might me more expensive, but not that much.
> 

Hi Andi,

I benchmarked cmpxchg_local vs local_irq_save/local_irq_restore.
Details, and code, follow.

* cpuinfo:

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 15
model   : 35
model name  : AMD Athlon(tm)64 X2 Dual Core Processor  3800+
stepping: 2
cpu MHz : 2009.204
cache size  : 512 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 
3dnow pni lahf_lm cmp_legacy
bogomips: 4023.38
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor   : 1
vendor_id   : AuthenticAMD
cpu family  : 15
model   : 35
model name  : AMD Athlon(tm)64 X2 Dual Core Processor  3800+
stepping: 2
cpu MHz : 2009.204
cache size  : 512 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 1
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 
3dnow pni lahf_lm cmp_legacy
bogomips: 4018.49
TLB size: 1024 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp


* Test ran:


/* test-cmpxchg-nolock.c
 *
 * Compare local cmpxchg with irq disable / enable.
 */


#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NR_LOOPS 2

int test_val;

static void do_test_cmpxchg(void)
{
int ret;
long flags;
unsigned int i;
cycles_t time1, time2, time;
long rem;

local_irq_save(flags);
preempt_disable();
time1 = get_cycles();
for (i = 0; i < NR_LOOPS; i++) {
ret = cmpxchg_local(_val, 0, 0);
}
time2 = get_cycles();
local_irq_restore(flags);
preempt_enable();
time = time2 - time1;

printk(KERN_ALERT "test results: time for non locked cmpxchg\n");
printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
printk(KERN_ALERT "total time: %llu\n", time);
time = div_long_long_rem(time, NR_LOOPS, );
printk(KERN_ALERT "-> non locked cmpxchg takes %llu cycles\n", time);
printk(KERN_ALERT "test end\n");
}

/*
 * This test will have a higher standard deviation due to incoming interrupts.
 */
static void do_test_enable_int(void)
{
long flags;
unsigned int i;
cycles_t time1, time2, time;
long rem;

local_irq_save(flags);
preempt_disable();
time1 = get_cycles();
for (i = 0; i < NR_LOOPS; i++) {
local_irq_restore(flags);
}
time2 = get_cycles();
local_irq_restore(flags);
preempt_enable();
time = time2 - time1;

printk(KERN_ALERT "test results: time for enabling interrupts (STI)\n");
printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
printk(KERN_ALERT "total time: %llu\n", time);
time = div_long_long_rem(time, NR_LOOPS, );
printk(KERN_ALERT "-> enabling interrupts (STI) takes %llu cycles\n",
time);
printk(KERN_ALERT "test end\n");
}

static void do_test_disable_int(void)
{
unsigned long flags, flags2;
unsigned int i;
cycles_t time1, time2, time;
long rem;

local_irq_save(flags);
preempt_disable();
time1 = get_cycles();
for ( i = 0; i < NR_LOOPS; i++) {
local_irq_save(flags2);
}
time2 = get_cycles();
local_irq_restore(flags);
preempt_enable();
time = time2 - time1;

printk(KERN_ALERT "test results: time for disabling interrupts 
(CLI)\n");
printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
printk(KERN_ALERT "total time: %llu\n", time);
time = div_long_long_rem(time, NR_LOOPS, );
printk(KERN_ALERT "-> disabling interrupts (CLI) takes %llu cycles\n",
time);

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> As I am going back through the initial cmpxchg_local implementation, it
> seems like it was executing __slab_alloc() with preemption disabled,
> which is wrong. new_slab() is not designed for that.

The version I send you did not use preemption.

We need to make a decision if we want to go without preemption and cmpxchg 
or with preemption and cmpxchg_local.

If we really want to do this then the implementation of all of these 
components need to result in competitive performance on all platforms.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Andi Kleen

Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
> 
> The measurements I get (in cycles):
>  enable interrupts (STI)   disable interrupts (CLI)   local 
> CMPXCHG
> IA32 (P4)11282 26
> x86_64 AMD64 125   102 19

What exactly did you benchmark here? On K8 CLI/STI are only supposed
to be a few cycles. pushf/popf might me more expensive, but not that much.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > - Rounding error.. you seem to round at 0.1ms, but I keep the values in
> >   cycles. The times that you get (1.1ms) seems strangely higher than
> >   mine, which are under 1000 cycles on a 3GHz system (less than 333ns).
> >   I guess there is both a ms - ns error there and/or not enough
> >   precision in your numbers.
> 
> Nope the rounding for output is depending on the amount. Rounds to one 
> digit after whatever unit we figured out is best to display.
> 
> And multiplications (cyc2ns) do not result in rounding errors.
> 

Ok, I see now that the 1.1ms was for the 1 iterations, which makes
it about 230 ns/iteration for the 1 times kmalloc(8) = 2.3ms test.

As I am going back through the initial cmpxchg_local implementation, it
seems like it was executing __slab_alloc() with preemption disabled,
which is wrong. new_slab() is not designed for that.

I'll try to run my tests on AMD64.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> - Rounding error.. you seem to round at 0.1ms, but I keep the values in
>   cycles. The times that you get (1.1ms) seems strangely higher than
>   mine, which are under 1000 cycles on a 3GHz system (less than 333ns).
>   I guess there is both a ms - ns error there and/or not enough
>   precision in your numbers.

Nope the rounding for output is depending on the amount. Rounds to one 
digit after whatever unit we figured out is best to display.

And multiplications (cyc2ns) do not result in rounding errors.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > Are you running a UP or SMP kernel ? If you run a UP kernel, the
> > cmpxchg_local and cmpxchg are identical.
> 
> UP.
> 
> > Oh, and if you run your tests at boot time, the alternatives code may
> > have removed the lock prefix, therefore making cmpxchg and cmpxchg_local
> > exactly the same.
> 
> Tests were run at boot time.
> 
> That still does not explain kmalloc not showing improvements.
> 

Hrm, weird.. because it should. Here are the numbers I posted
previously:


The measurements I get (in cycles):
 enable interrupts (STI)   disable interrupts (CLI)   local CMPXCHG
IA32 (P4)11282 26
x86_64 AMD64 125   102 19

So both AMD64 and IA32 should be improved.

So why those improvements are not shown in your test ? A few possible
causes:

- Do you have any CONFIG_DEBUG_* options activated ? smp_processor_id()
  may end up being more expensive in these cases.
- Rounding error.. you seem to round at 0.1ms, but I keep the values in
  cycles. The times that you get (1.1ms) seems strangely higher than
  mine, which are under 1000 cycles on a 3GHz system (less than 333ns).
  I guess there is both a ms - ns error there and/or not enough
  precision in your numbers.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> Are you running a UP or SMP kernel ? If you run a UP kernel, the
> cmpxchg_local and cmpxchg are identical.

UP.

> Oh, and if you run your tests at boot time, the alternatives code may
> have removed the lock prefix, therefore making cmpxchg and cmpxchg_local
> exactly the same.

Tests were run at boot time.

That still does not explain kmalloc not showing improvements.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
> > shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
> > for the kmalloc/kfree pair (test 2).
> 
> H.. I wonder if the AMD processors simply do the same in either 
> version.

No supposed to. I remember having posted numbers that show a
difference.

Are you running a UP or SMP kernel ? If you run a UP kernel, the
cmpxchg_local and cmpxchg are identical.

Oh, and if you run your tests at boot time, the alternatives code may
have removed the lock prefix, therefore making cmpxchg and cmpxchg_local
exactly the same.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> > 
> > > - Changed smp_rmb() for barrier(). We are not interested in read order
> > >   across cpus, what we want is to be ordered wrt local interrupts only.
> > >   barrier() is much cheaper than a rmb().
> > 
> > But this means a preempt disable is required. RT users do not want that.
> > Without preemption the processor can be moved after c has been determined.
> > That is why the smp_rmb() is there.
> 
> preemption is required if we want to use cmpxchg_local anyway.
> 
> We may have to find a way to use preemption while being able to give an
> upper bound on the preempt disabled execution time. I think I got a way
> to do this yesterday.. I'll dig in my patches.
> 

Yeah, I remember having done so : moving the preempt disable nearer to
the cmpxchg, checking if the cpuid has changed between the
raw_smp_processor_id() read and the preempt_disable done later, redo if
it is different. It makes the slow path faster, but makes the fast path
more complex, therefore I finally dropped the patch. And we talk about
~10 cycles for the slow path here, I doubt it's worth the complexity
added to the fast path.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
> shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
> for the kmalloc/kfree pair (test 2).

H.. I wonder if the AMD processors simply do the same in either 
version.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> kmalloc(8)/kfree = 112 cycles
> kmalloc(16)/kfree = 103 cycles
> kmalloc(32)/kfree = 103 cycles
> kmalloc(64)/kfree = 103 cycles
> kmalloc(128)/kfree = 112 cycles
> kmalloc(256)/kfree = 111 cycles
> kmalloc(512)/kfree = 111 cycles
> kmalloc(1024)/kfree = 111 cycles
> kmalloc(2048)/kfree = 121 cycles

Looks good. This improves handling for short lived objects about 
threefold.

> kmalloc(4096)/kfree = 650 cycles
> kmalloc(8192)/kfree = 1042 cycles
> kmalloc(16384)/kfree = 1149 cycles

Hmmm... The page allocator is really bad here

Could we use the cmpxchg_local approach for the per cpu queues in the 
page_allocator? May have an even greater influence on overall system 
performance than the SLUB changes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > SLUB Use cmpxchg() everywhere.
> > 
> > It applies to "SLUB: Single atomic instruction alloc/free using
> > cmpxchg".
> 
> > +++ slab/mm/slub.c  2007-08-20 18:42:28.0 -0400
> > @@ -1682,7 +1682,7 @@ redo:
> >  
> > object[c->offset] = freelist;
> >  
> > -   if (unlikely(cmpxchg_local(>freelist, freelist, object) != freelist))
> > +   if (unlikely(cmpxchg(>freelist, freelist, object) != freelist))
> > goto redo;
> > return;
> >  slow:
> 
> Ok so regular cmpxchg, no cmpxchg_local. cmpxchg_local does not bring 
> anything more? My measurements did not show any difference. I measured on 
> Athlon64. What processor is being used?
> 

This patch only cleans up the tree before proposing my cmpxchg_local
changes. There was an inconsistent use of cmpxchg/cmpxchg_local there.

Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
for the kmalloc/kfree pair (test 2).

Pros :
- we can use barrier() instead of rmb()
- cmpxchg_local is faster

Con :
- we must disable preemption

I use a 3GHz Pentium 4 for my tests.

Results (compared to cmpxchg_local numbers) :

SLUB Performance testing

1. Kmalloc: Repeatedly allocate then free test
(kfree here is slow path)

* cmpxchg
kmalloc(8) = 271 cycles kfree = 645 cycles
kmalloc(16) = 158 cycles  kfree = 428 cycles
kmalloc(32) = 153 cycles  kfree = 446 cycles
kmalloc(64) = 178 cycles  kfree = 459 cycles
kmalloc(128) = 247 cycles kfree = 481 cycles
kmalloc(256) = 363 cycles kfree = 605 cycles
kmalloc(512) = 449 cycles kfree = 677 cycles
kmalloc(1024) = 626 cycles  kfree = 810 cycles
kmalloc(2048) = 681 cycles  kfree = 869 cycles
kmalloc(4096) = 471 cycles  kfree = 575 cycles
kmalloc(8192) = 666 cycles  kfree = 747 cycles
kmalloc(16384) = 736 cycles kfree = 853 cycles

* cmpxchg_local
kmalloc(8) = 83 cycles  kfree = 363 cycles
kmalloc(16) = 85 cycles kfree = 372 cycles
kmalloc(32) = 92 cycles kfree = 377 cycles
kmalloc(64) = 115 cycleskfree = 397 cycles
kmalloc(128) = 179 cycles   kfree = 438 cycles
kmalloc(256) = 314 cycles   kfree = 564 cycles
kmalloc(512) = 398 cycles   kfree = 615 cycles
kmalloc(1024) = 573 cycles  kfree = 745 cycles
kmalloc(2048) = 629 cycles  kfree = 816 cycles
kmalloc(4096) = 473 cycles  kfree = 548 cycles
kmalloc(8192) = 659 cycles  kfree = 745 cycles
kmalloc(16384) = 724 cycles kfree = 843 cycles


2. Kmalloc: alloc/free test

*cmpxchg
kmalloc(8)/kfree = 321 cycles
kmalloc(16)/kfree = 308 cycles
kmalloc(32)/kfree = 311 cycles
kmalloc(64)/kfree = 310 cycles
kmalloc(128)/kfree = 306 cycles
kmalloc(256)/kfree = 325 cycles
kmalloc(512)/kfree = 324 cycles
kmalloc(1024)/kfree = 322 cycles
kmalloc(2048)/kfree = 309 cycles
kmalloc(4096)/kfree = 678 cycles
kmalloc(8192)/kfree = 1027 cycles
kmalloc(16384)/kfree = 1204 cycles

* cmpxchg_local
kmalloc(8)/kfree = 112 cycles
kmalloc(16)/kfree = 103 cycles
kmalloc(32)/kfree = 103 cycles
kmalloc(64)/kfree = 103 cycles
kmalloc(128)/kfree = 112 cycles
kmalloc(256)/kfree = 111 cycles
kmalloc(512)/kfree = 111 cycles
kmalloc(1024)/kfree = 111 cycles
kmalloc(2048)/kfree = 121 cycles
kmalloc(4096)/kfree = 650 cycles
kmalloc(8192)/kfree = 1042 cycles
kmalloc(16384)/kfree = 1149 cycles

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> * cmpxchg_local Slub test
> kmalloc(8) = 83 cycleskfree = 363 cycles
> kmalloc(16) = 85 cycles   kfree = 372 cycles
> kmalloc(32) = 92 cycles   kfree = 377 cycles
> kmalloc(64) = 115 cycleskfree = 397 cycles
> kmalloc(128) = 179 cycles   kfree = 438 cycles

So for consecutive allocs of small slabs up to 128 bytes this effectively 
doubles the speed of kmalloc.

> kmalloc(256) = 314 cycles   kfree = 564 cycles
> kmalloc(512) = 398 cycles   kfree = 615 cycles
> kmalloc(1024) = 573 cycleskfree = 745 cycles

Less of a benefit.

> kmalloc(2048) = 629 cycleskfree = 816 cycles

Allmost as before.

> kmalloc(4096) = 473 cycleskfree = 548 cycles
> kmalloc(8192) = 659 cycleskfree = 745 cycles
> kmalloc(16384) = 724 cycles   kfree = 843 cycles

Page allocator pass through measurements.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > - Changed smp_rmb() for barrier(). We are not interested in read order
> >   across cpus, what we want is to be ordered wrt local interrupts only.
> >   barrier() is much cheaper than a rmb().
> 
> But this means a preempt disable is required. RT users do not want that.
> Without preemption the processor can be moved after c has been determined.
> That is why the smp_rmb() is there.

preemption is required if we want to use cmpxchg_local anyway.

We may have to find a way to use preemption while being able to give an
upper bound on the preempt disabled execution time. I think I got a way
to do this yesterday.. I'll dig in my patches.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

Reformatting...

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> Hi Christoph,
> 
> If you are interested in the raw numbers:
> 
> The (very basic) test module follows. Make sure you change get_cycles()
> for get_cycles_sync() if you plan to run this on x86_64.
> 
> (tests taken on a 3GHz Pentium 4)
> 

(Note: test 1 uses the kfree slow path, as figured out by
instrumentation)

SLUB Performance testing

1. Kmalloc: Repeatedly allocate then free test

* slub HEAD, test 1
kmalloc(8) = 201 cycles kfree = 351 cycles
kmalloc(16) = 198 cycles  kfree = 359 cycles
kmalloc(32) = 200 cycles  kfree = 381 cycles
kmalloc(64) = 224 cycles  kfree = 394 cycles
kmalloc(128) = 285 cycles kfree = 424 cycles
kmalloc(256) = 411 cycles kfree = 546 cycles
kmalloc(512) = 480 cycles kfree = 619 cycles
kmalloc(1024) = 623 cycles  kfree = 750 cycles
kmalloc(2048) = 686 cycles  kfree = 811 cycles
kmalloc(4096) = 482 cycles  kfree = 538 cycles
kmalloc(8192) = 680 cycles  kfree = 734 cycles
kmalloc(16384) = 713 cycles kfree = 843 cycles

* Slub HEAD, test 2
kmalloc(8) = 190 cycles kfree = 351 cycles
kmalloc(16) = 195 cycles  kfree = 360 cycles
kmalloc(32) = 201 cycles  kfree = 370 cycles
kmalloc(64) = 245 cycles  kfree = 389 cycles
kmalloc(128) = 283 cycles kfree = 413 cycles
kmalloc(256) = 409 cycles kfree = 547 cycles
kmalloc(512) = 476 cycles kfree = 616 cycles
kmalloc(1024) = 628 cycles  kfree = 753 cycles
kmalloc(2048) = 684 cycles  kfree = 811 cycles
kmalloc(4096) = 480 cycles  kfree = 539 cycles
kmalloc(8192) = 661 cycles  kfree = 746 cycles
kmalloc(16384) = 741 cycles kfree = 856 cycles

* cmpxchg_local Slub test
kmalloc(8) = 83 cycles  kfree = 363 cycles
kmalloc(16) = 85 cycles kfree = 372 cycles
kmalloc(32) = 92 cycles kfree = 377 cycles
kmalloc(64) = 115 cycles  kfree = 397 cycles
kmalloc(128) = 179 cycles kfree = 438 cycles
kmalloc(256) = 314 cycles kfree = 564 cycles
kmalloc(512) = 398 cycles kfree = 615 cycles
kmalloc(1024) = 573 cycles  kfree = 745 cycles
kmalloc(2048) = 629 cycles  kfree = 816 cycles
kmalloc(4096) = 473 cycles  kfree = 548 cycles
kmalloc(8192) = 659 cycles  kfree = 745 cycles
kmalloc(16384) = 724 cycles kfree = 843 cycles



2. Kmalloc: alloc/free test

* slub HEAD, test 1
kmalloc(8)/kfree = 322 cycles
kmalloc(16)/kfree = 318 cycles
kmalloc(32)/kfree = 318 cycles
kmalloc(64)/kfree = 325 cycles
kmalloc(128)/kfree = 318 cycles
kmalloc(256)/kfree = 328 cycles
kmalloc(512)/kfree = 328 cycles
kmalloc(1024)/kfree = 328 cycles
kmalloc(2048)/kfree = 328 cycles
kmalloc(4096)/kfree = 678 cycles
kmalloc(8192)/kfree = 1013 cycles
kmalloc(16384)/kfree = 1157 cycles

* Slub HEAD, test 2
kmalloc(8)/kfree = 323 cycles
kmalloc(16)/kfree = 318 cycles
kmalloc(32)/kfree = 318 cycles
kmalloc(64)/kfree = 318 cycles
kmalloc(128)/kfree = 318 cycles
kmalloc(256)/kfree = 328 cycles
kmalloc(512)/kfree = 328 cycles
kmalloc(1024)/kfree = 328 cycles
kmalloc(2048)/kfree = 328 cycles
kmalloc(4096)/kfree = 648 cycles
kmalloc(8192)/kfree = 1009 cycles
kmalloc(16384)/kfree = 1105 cycles

* cmpxchg_local Slub test
kmalloc(8)/kfree = 112 cycles
kmalloc(16)/kfree = 103 cycles
kmalloc(32)/kfree = 103 cycles
kmalloc(64)/kfree = 103 cycles
kmalloc(128)/kfree = 112 cycles
kmalloc(256)/kfree = 111 cycles
kmalloc(512)/kfree = 111 cycles
kmalloc(1024)/kfree = 111 cycles
kmalloc(2048)/kfree = 121 cycles
kmalloc(4096)/kfree = 650 cycles
kmalloc(8192)/kfree = 1042 cycles
kmalloc(16384)/kfree = 1149 cycles

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> SLUB Use cmpxchg() everywhere.
> 
> It applies to "SLUB: Single atomic instruction alloc/free using
> cmpxchg".

> +++ slab/mm/slub.c2007-08-20 18:42:28.0 -0400
> @@ -1682,7 +1682,7 @@ redo:
>  
>   object[c->offset] = freelist;
>  
> - if (unlikely(cmpxchg_local(>freelist, freelist, object) != freelist))
> + if (unlikely(cmpxchg(>freelist, freelist, object) != freelist))
>   goto redo;
>   return;
>  slow:

Ok so regular cmpxchg, no cmpxchg_local. cmpxchg_local does not bring 
anything more? My measurements did not show any difference. I measured on 
Athlon64. What processor is being used?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> - Changed smp_rmb() for barrier(). We are not interested in read order
>   across cpus, what we want is to be ordered wrt local interrupts only.
>   barrier() is much cheaper than a rmb().

But this means a preempt disable is required. RT users do not want that.
Without preemption the processor can be moved after c has been determined.
That is why the smp_rmb() is there.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> 
> > - Fixed an erroneous test in slab_free() (logic was flipped from the 
> >   original code when testing for slow path. It explains the wrong 
> >   numbers you have with big free).
> 
> If you look at the numbers that I posted earlier then you will see that 
> even the measurements without free were not up to par.
> 

I seem to get a clear performance improvement in the kmalloc fast path.

> > It applies on top of the 
> > "SLUB Use cmpxchg() everywhere" patch.
> 
> Which one is that?
> 

This one:


SLUB Use cmpxchg() everywhere.

It applies to "SLUB: Single atomic instruction alloc/free using
cmpxchg".

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: slab/mm/slub.c
===
--- slab.orig/mm/slub.c 2007-08-20 18:42:16.0 -0400
+++ slab/mm/slub.c  2007-08-20 18:42:28.0 -0400
@@ -1682,7 +1682,7 @@ redo:
 
object[c->offset] = freelist;
 
-   if (unlikely(cmpxchg_local(>freelist, freelist, object) != freelist))
+   if (unlikely(cmpxchg(>freelist, freelist, object) != freelist))
goto redo;
return;
 slow:

> >  | slab.git HEAD slub (min-max)|  cmpxchg_local slub
> > kmalloc(8)   | 190 - 201   | 83
> > kfree(8) | 351 - 351   |363
> > kmalloc(64)  | 224 - 245   |115
> > kfree(64)| 389 - 394   |397
> > kmalloc(16384)|713 - 741   |724
> > kfree(16384) | 843 - 856   |843
> > 
> > Therefore, there seems to be a repeatable gain on the kmalloc fast path
> > (more than twice faster). No significant performance hit for the kfree
> > case, but no gain neither, same for large kmalloc, as expected.
> 
> There is a consistent loss on slab_free it seems. The 16k numbers are 
> irrelevant since we do not use slab_alloc/slab_free due to the direct pass 
> through patch but call the page allocator directly. That also explains 
> that there is no loss there.
> 

Yes. slab_free in these tests falls mostly into __slab_free() slow path
(I instrumented the number of slow and fast path to get this). The small
performance hit (~10 cycles) can be explained by the added
preempt_disable()/preempt_enable().

> The kmalloc numbers look encouraging. I will check to see if I can 
> reproduce it once I sort out the patches.

Ok.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> Therefore, in the test where we have separate passes for slub allocation
> and free, we hit mostly the slow path. Any particular reason for that ?

Maybe on SMP you are schedule to run on a different processor? Note that 
I ran my tests at early boot where such effects do not occur.

> Note that the alloc/free test (test 2) seems to hit the fast path as
> expected.

It is much more likely in that case that the execution thread stays on one 
processor.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> If you are interested in the raw numbers:
> 
> The (very basic) test module follows. Make sure you change get_cycles()
> for get_cycles_sync() if you plan to run this on x86_64.

Which test is which? Would you be able to format this in a way that we can 
easily read it?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

> - Fixed an erroneous test in slab_free() (logic was flipped from the 
>   original code when testing for slow path. It explains the wrong 
>   numbers you have with big free).

If you look at the numbers that I posted earlier then you will see that 
even the measurements without free were not up to par.

> It applies on top of the 
> "SLUB Use cmpxchg() everywhere" patch.

Which one is that?

>  | slab.git HEAD slub (min-max)|  cmpxchg_local slub
> kmalloc(8)   | 190 - 201   | 83
> kfree(8) | 351 - 351   |363
> kmalloc(64)  | 224 - 245   |115
> kfree(64)| 389 - 394   |397
> kmalloc(16384)|713 - 741   |724
> kfree(16384) | 843 - 856   |843
> 
> Therefore, there seems to be a repeatable gain on the kmalloc fast path
> (more than twice faster). No significant performance hit for the kfree
> case, but no gain neither, same for large kmalloc, as expected.

There is a consistent loss on slab_free it seems. The 16k numbers are 
irrelevant since we do not use slab_alloc/slab_free due to the direct pass 
through patch but call the page allocator directly. That also explains 
that there is no loss there.

The kmalloc numbers look encouraging. I will check to see if I can 
reproduce it once I sort out the patches.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> Ok, I played with your patch a bit, and the results are quite
> interesting:
> 
...
> Summary:
> 
> (tests repeated 1 times on a 3GHz Pentium 4)
> (kernel DEBUG menuconfig options are turned off)
> results are in cycles per iteration
> I did 2 runs of the slab.git HEAD to have an idea of errors associated
> to the measurements:
> 
>  | slab.git HEAD slub (min-max)|  cmpxchg_local slub
> kmalloc(8)   | 190 - 201   | 83
> kfree(8) | 351 - 351   |363
> kmalloc(64)  | 224 - 245   |115
> kfree(64)| 389 - 394   |397
> kmalloc(16384)|713 - 741   |724
> kfree(16384) | 843 - 856   |843
> 
> Therefore, there seems to be a repeatable gain on the kmalloc fast path
> (more than twice faster). No significant performance hit for the kfree
> case, but no gain neither, same for large kmalloc, as expected.
> 

Having no performance improvement for kfree seems a little weird, since
we are moving from irq disable to cmpxchg_local in the fast path. A
possible explanation would be that we are always hitting the slow path.
I did a simple test, counting the number of fast vs slow paths with my
cmpxchg_local slub version:

(initial state before the test)
[  386.359364] Fast slub free: 654982
[  386.369507] Slow slub free: 392195
[  386.379660] SLUB Performance testing
[  386.390361] 
[  386.401020] 1. Kmalloc: Repeatedly allocate then free test
...
(after test 1)
[  387.366002] Fast slub free: 657338 diff (2356)
[  387.376158] Slow slub free: 482162 diff (89967)

[  387.386294] 2. Kmalloc: alloc/free test
...
(after test 2)
[  387.897816] Fast slub free: 748968 (diff 91630)
[  387.907947] Slow slub free: 482584 diff (422)

Therefore, in the test where we have separate passes for slub allocation
and free, we hit mostly the slow path. Any particular reason for that ?

Note that the alloc/free test (test 2) seems to hit the fast path as
expected.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

Hi Christoph,

If you are interested in the raw numbers:

The (very basic) test module follows. Make sure you change get_cycles()
for get_cycles_sync() if you plan to run this on x86_64.

(tests taken on a 3GHz Pentium 4)

* slub HEAD, test 1

[   99.774699] SLUB Performance testing
[   99.785431] 
[   99.796099] 1. Kmalloc: Repeatedly allocate then free test
[   99.813159] 1 times kmalloc(8) = 
[   99.824072] number of loops: 1
[   99.834207] total time: 2019990
[   99.843562] -> 201 cycles
[   99.852535] 1 times kfree = 
[   99.862167] number of loops: 1
[   99.872310] total time: 3519982
[   99.881669] -> 351 cycles
[   99.890128] 1 times kmalloc(16) = 
[   99.901298] number of loops: 1
[   99.911433] total time: 1986503
[   99.920786] -> 198 cycles
[   99.929784] 1 times kfree = 
[   99.939397] number of loops: 1
[   99.949532] total time: 3596775
[   99.958885] -> 359 cycles
[   99.967352] 1 times kmalloc(32) = 
[   99.978522] number of loops: 1
[   99.988657] total time: 2009003
[   99.998098] -> 200 cycles
[  100.007171] 1 times kfree = 
[  100.016786] number of loops: 1
[  100.026919] total time: 3814418
[  100.036296] -> 381 cycles
[  100.044844] 1 times kmalloc(64) = 
[  100.056016] number of loops: 1
[  100.066150] total time: 2242620
[  100.075504] -> 224 cycles
[  100.084619] 1 times kfree = 
[  100.094234] number of loops: 1
[  100.104369] total time: 3941348
[  100.113722] -> 394 cycles
[  100.122475] 1 times kmalloc(128) = 
[  100.133914] number of loops: 1
[  100.144049] total time: 2857560
[  100.153485] -> 285 cycles
[  100.162705] 1 times kfree = 
[  100.172332] number of loops: 1
[  100.182468] total time: 4241543
[  100.191821] -> 424 cycles
[  100.200996] 1 times kmalloc(256) = 
[  100.212437] number of loops: 1
[  100.222571] total time: 4119900
[  100.231949] -> 411 cycles
[  100.241570] 1 times kfree = 
[  100.251218] number of loops: 1
[  100.261353] total time: 5462655
[  100.270705] -> 546 cycles
[  100.280105] 1 times kmalloc(512) = 
[  100.291548] number of loops: 1
[  100.301683] total time: 4802820
[  100.311037] -> 480 cycles
[  100.320899] 1 times kfree = 
[  100.330518] number of loops: 1
[  100.340661] total time: 6191827
[  100.350040] -> 619 cycles
[  100.359917] 1 times kmalloc(1024) = 
[  100.371633] number of loops: 1
[  100.381767] total time: 6235890
[  100.391120] -> 623 cycles
[  100.401419] 1 times kfree = 
[  100.411034] number of loops: 1
[  100.421170] total time: 7504095
[  100.430523] -> 750 cycles
[  100.440608] 1 times kmalloc(2048) = 
[  100.452300] number of loops: 1
[  100.462433] total time: 6863955
[  100.471786] -> 686 cycles
[  100.482287] 1 times kfree = 
[  100.491922] number of loops: 1
[  100.502065] total time: 8110590
[  100.511419] -> 811 cycles
[  100.520824] 1 times kmalloc(4096) = 
[  100.532537] number of loops: 1
[  100.542670] total time: 4824007
[  100.552023] -> 482 cycles
[  100.561618] 1 times kfree = 
[  100.571255] number of loops: 1
[  100.581390] total time: 5387670
[  100.590768] -> 538 cycles
[  100.600835] 1 times kmalloc(8192) = 
[  100.612549] number of loops: 1
[  100.622684] total time: 6808680
[  100.632037] -> 680 cycles
[  100.642285] 1 times kfree = 
[  100.651898] number of loops: 1
[  100.662031] total time: 7349797
[  100.671385] -> 734 cycles
[  100.681563] 1 times kmalloc(16384) = 
[  100.693523] number of loops: 1
[  100.703658] total time: 7133790
[  100.713036] -> 713 cycles
[  100.723654] 1 times kfree = 
[  100.733299] number of loops: 1
[  100.743434] total time: 8431725
[  100.752788] -> 843 cycles
[  100.760588] 2. Kmalloc: alloc/free test
[  100.773091] 1 times kmalloc(8)/kfree = 
[  100.785558] number of loops: 1
[  100.795694] total time: 3223072
[  100.805046] -> 322 cycles
[  100.813904] 1 times kmalloc(16)/kfree = 
[  100.826629] number of loops: 1
[  100.836763] total time: 3181702
[  100.846116] -> 318 cycles
[  100.854975] 1 times kmalloc(32)/kfree = 
[  100.867725] number of loops: 1
[  100.877860] total time: 3183517
[  100.887296] -> 318 cycles
[  100.896179] 1 times kmalloc(64)/kfree = 
[  100.908905] number of loops: 1
[  100.919039] total time: 3253335
[  100.928418] -> 325 cycles
[  100.937277] 1 times kmalloc(128)/kfree = 
[  100.950272] number of loops: 1
[  100.960407] total time: 3181478
[  100.969760] -> 318 cycles
[  100.978652] 1 times kmalloc(256)/kfree = 
[  100.991662] number of loops: 1
[  101.001796] total time: 3282810
[  101.011149] -> 328 cycles
[  101.020042] 1 times kmalloc(512)/kfree = 
[  101.033025] number of loops: 1
[  101.043161] total time: 3286725
[  101.052515] -> 328 cycles
[  101.061409] 1 times kmalloc(1024)/kfree = 
[  101.074652] number of loops: 1
[  101.084787] total time:

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

Hi Christoph,

If you are interested in the raw numbers:

The (very basic) test module follows. Make sure you change get_cycles()
for get_cycles_sync() if you plan to run this on x86_64.

(tests taken on a 3GHz Pentium 4)

* slub HEAD, test 1

[   99.774699] SLUB Performance testing
[   99.785431] 
[   99.796099] 1. Kmalloc: Repeatedly allocate then free test
[   99.813159] 1 times kmalloc(8) = 
[   99.824072] number of loops: 1
[   99.834207] total time: 2019990
[   99.843562] - 201 cycles
[   99.852535] 1 times kfree = 
[   99.862167] number of loops: 1
[   99.872310] total time: 3519982
[   99.881669] - 351 cycles
[   99.890128] 1 times kmalloc(16) = 
[   99.901298] number of loops: 1
[   99.911433] total time: 1986503
[   99.920786] - 198 cycles
[   99.929784] 1 times kfree = 
[   99.939397] number of loops: 1
[   99.949532] total time: 3596775
[   99.958885] - 359 cycles
[   99.967352] 1 times kmalloc(32) = 
[   99.978522] number of loops: 1
[   99.988657] total time: 2009003
[   99.998098] - 200 cycles
[  100.007171] 1 times kfree = 
[  100.016786] number of loops: 1
[  100.026919] total time: 3814418
[  100.036296] - 381 cycles
[  100.044844] 1 times kmalloc(64) = 
[  100.056016] number of loops: 1
[  100.066150] total time: 2242620
[  100.075504] - 224 cycles
[  100.084619] 1 times kfree = 
[  100.094234] number of loops: 1
[  100.104369] total time: 3941348
[  100.113722] - 394 cycles
[  100.122475] 1 times kmalloc(128) = 
[  100.133914] number of loops: 1
[  100.144049] total time: 2857560
[  100.153485] - 285 cycles
[  100.162705] 1 times kfree = 
[  100.172332] number of loops: 1
[  100.182468] total time: 4241543
[  100.191821] - 424 cycles
[  100.200996] 1 times kmalloc(256) = 
[  100.212437] number of loops: 1
[  100.222571] total time: 4119900
[  100.231949] - 411 cycles
[  100.241570] 1 times kfree = 
[  100.251218] number of loops: 1
[  100.261353] total time: 5462655
[  100.270705] - 546 cycles
[  100.280105] 1 times kmalloc(512) = 
[  100.291548] number of loops: 1
[  100.301683] total time: 4802820
[  100.311037] - 480 cycles
[  100.320899] 1 times kfree = 
[  100.330518] number of loops: 1
[  100.340661] total time: 6191827
[  100.350040] - 619 cycles
[  100.359917] 1 times kmalloc(1024) = 
[  100.371633] number of loops: 1
[  100.381767] total time: 6235890
[  100.391120] - 623 cycles
[  100.401419] 1 times kfree = 
[  100.411034] number of loops: 1
[  100.421170] total time: 7504095
[  100.430523] - 750 cycles
[  100.440608] 1 times kmalloc(2048) = 
[  100.452300] number of loops: 1
[  100.462433] total time: 6863955
[  100.471786] - 686 cycles
[  100.482287] 1 times kfree = 
[  100.491922] number of loops: 1
[  100.502065] total time: 8110590
[  100.511419] - 811 cycles
[  100.520824] 1 times kmalloc(4096) = 
[  100.532537] number of loops: 1
[  100.542670] total time: 4824007
[  100.552023] - 482 cycles
[  100.561618] 1 times kfree = 
[  100.571255] number of loops: 1
[  100.581390] total time: 5387670
[  100.590768] - 538 cycles
[  100.600835] 1 times kmalloc(8192) = 
[  100.612549] number of loops: 1
[  100.622684] total time: 6808680
[  100.632037] - 680 cycles
[  100.642285] 1 times kfree = 
[  100.651898] number of loops: 1
[  100.662031] total time: 7349797
[  100.671385] - 734 cycles
[  100.681563] 1 times kmalloc(16384) = 
[  100.693523] number of loops: 1
[  100.703658] total time: 7133790
[  100.713036] - 713 cycles
[  100.723654] 1 times kfree = 
[  100.733299] number of loops: 1
[  100.743434] total time: 8431725
[  100.752788] - 843 cycles
[  100.760588] 2. Kmalloc: alloc/free test
[  100.773091] 1 times kmalloc(8)/kfree = 
[  100.785558] number of loops: 1
[  100.795694] total time: 3223072
[  100.805046] - 322 cycles
[  100.813904] 1 times kmalloc(16)/kfree = 
[  100.826629] number of loops: 1
[  100.836763] total time: 3181702
[  100.846116] - 318 cycles
[  100.854975] 1 times kmalloc(32)/kfree = 
[  100.867725] number of loops: 1
[  100.877860] total time: 3183517
[  100.887296] - 318 cycles
[  100.896179] 1 times kmalloc(64)/kfree = 
[  100.908905] number of loops: 1
[  100.919039] total time: 3253335
[  100.928418] - 325 cycles
[  100.937277] 1 times kmalloc(128)/kfree = 
[  100.950272] number of loops: 1
[  100.960407] total time: 3181478
[  100.969760] - 318 cycles
[  100.978652] 1 times kmalloc(256)/kfree = 
[  100.991662] number of loops: 1
[  101.001796] total time: 3282810
[  101.011149] - 328 cycles
[  101.020042] 1 times kmalloc(512)/kfree = 
[  101.033025] number of loops: 1
[  101.043161] total time: 3286725
[  101.052515] - 328 cycles
[  101.061409] 1 times kmalloc(1024)/kfree = 
[  101.074652] number of loops: 1
[  101.084787] total time: 3281677
[  101.094141] - 328 cycles

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
 Ok, I played with your patch a bit, and the results are quite
 interesting:
 
...
 Summary:
 
 (tests repeated 1 times on a 3GHz Pentium 4)
 (kernel DEBUG menuconfig options are turned off)
 results are in cycles per iteration
 I did 2 runs of the slab.git HEAD to have an idea of errors associated
 to the measurements:
 
  | slab.git HEAD slub (min-max)|  cmpxchg_local slub
 kmalloc(8)   | 190 - 201   | 83
 kfree(8) | 351 - 351   |363
 kmalloc(64)  | 224 - 245   |115
 kfree(64)| 389 - 394   |397
 kmalloc(16384)|713 - 741   |724
 kfree(16384) | 843 - 856   |843
 
 Therefore, there seems to be a repeatable gain on the kmalloc fast path
 (more than twice faster). No significant performance hit for the kfree
 case, but no gain neither, same for large kmalloc, as expected.
 

Having no performance improvement for kfree seems a little weird, since
we are moving from irq disable to cmpxchg_local in the fast path. A
possible explanation would be that we are always hitting the slow path.
I did a simple test, counting the number of fast vs slow paths with my
cmpxchg_local slub version:

(initial state before the test)
[  386.359364] Fast slub free: 654982
[  386.369507] Slow slub free: 392195
[  386.379660] SLUB Performance testing
[  386.390361] 
[  386.401020] 1. Kmalloc: Repeatedly allocate then free test
...
(after test 1)
[  387.366002] Fast slub free: 657338 diff (2356)
[  387.376158] Slow slub free: 482162 diff (89967)

[  387.386294] 2. Kmalloc: alloc/free test
...
(after test 2)
[  387.897816] Fast slub free: 748968 (diff 91630)
[  387.907947] Slow slub free: 482584 diff (422)

Therefore, in the test where we have separate passes for slub allocation
and free, we hit mostly the slow path. Any particular reason for that ?

Note that the alloc/free test (test 2) seems to hit the fast path as
expected.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

 - Fixed an erroneous test in slab_free() (logic was flipped from the 
   original code when testing for slow path. It explains the wrong 
   numbers you have with big free).

If you look at the numbers that I posted earlier then you will see that 
even the measurements without free were not up to par.

 It applies on top of the 
 SLUB Use cmpxchg() everywhere patch.

Which one is that?

  | slab.git HEAD slub (min-max)|  cmpxchg_local slub
 kmalloc(8)   | 190 - 201   | 83
 kfree(8) | 351 - 351   |363
 kmalloc(64)  | 224 - 245   |115
 kfree(64)| 389 - 394   |397
 kmalloc(16384)|713 - 741   |724
 kfree(16384) | 843 - 856   |843
 
 Therefore, there seems to be a repeatable gain on the kmalloc fast path
 (more than twice faster). No significant performance hit for the kfree
 case, but no gain neither, same for large kmalloc, as expected.

There is a consistent loss on slab_free it seems. The 16k numbers are 
irrelevant since we do not use slab_alloc/slab_free due to the direct pass 
through patch but call the page allocator directly. That also explains 
that there is no loss there.

The kmalloc numbers look encouraging. I will check to see if I can 
reproduce it once I sort out the patches.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

 If you are interested in the raw numbers:
 
 The (very basic) test module follows. Make sure you change get_cycles()
 for get_cycles_sync() if you plan to run this on x86_64.

Which test is which? Would you be able to format this in a way that we can 
easily read it?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

 Therefore, in the test where we have separate passes for slub allocation
 and free, we hit mostly the slow path. Any particular reason for that ?

Maybe on SMP you are schedule to run on a different processor? Note that 
I ran my tests at early boot where such effects do not occur.
 
 Note that the alloc/free test (test 2) seems to hit the fast path as
 expected.

It is much more likely in that case that the execution thread stays on one 
processor.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
 
  - Fixed an erroneous test in slab_free() (logic was flipped from the 
original code when testing for slow path. It explains the wrong 
numbers you have with big free).
 
 If you look at the numbers that I posted earlier then you will see that 
 even the measurements without free were not up to par.
 

I seem to get a clear performance improvement in the kmalloc fast path.

  It applies on top of the 
  SLUB Use cmpxchg() everywhere patch.
 
 Which one is that?
 

This one:


SLUB Use cmpxchg() everywhere.

It applies to SLUB: Single atomic instruction alloc/free using
cmpxchg.

Signed-off-by: Mathieu Desnoyers [EMAIL PROTECTED]
---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: slab/mm/slub.c
===
--- slab.orig/mm/slub.c 2007-08-20 18:42:16.0 -0400
+++ slab/mm/slub.c  2007-08-20 18:42:28.0 -0400
@@ -1682,7 +1682,7 @@ redo:
 
object[c-offset] = freelist;
 
-   if (unlikely(cmpxchg_local(c-freelist, freelist, object) != freelist))
+   if (unlikely(cmpxchg(c-freelist, freelist, object) != freelist))
goto redo;
return;
 slow:

   | slab.git HEAD slub (min-max)|  cmpxchg_local slub
  kmalloc(8)   | 190 - 201   | 83
  kfree(8) | 351 - 351   |363
  kmalloc(64)  | 224 - 245   |115
  kfree(64)| 389 - 394   |397
  kmalloc(16384)|713 - 741   |724
  kfree(16384) | 843 - 856   |843
  
  Therefore, there seems to be a repeatable gain on the kmalloc fast path
  (more than twice faster). No significant performance hit for the kfree
  case, but no gain neither, same for large kmalloc, as expected.
 
 There is a consistent loss on slab_free it seems. The 16k numbers are 
 irrelevant since we do not use slab_alloc/slab_free due to the direct pass 
 through patch but call the page allocator directly. That also explains 
 that there is no loss there.
 

Yes. slab_free in these tests falls mostly into __slab_free() slow path
(I instrumented the number of slow and fast path to get this). The small
performance hit (~10 cycles) can be explained by the added
preempt_disable()/preempt_enable().

 The kmalloc numbers look encouraging. I will check to see if I can 
 reproduce it once I sort out the patches.

Ok.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

 - Changed smp_rmb() for barrier(). We are not interested in read order
   across cpus, what we want is to be ordered wrt local interrupts only.
   barrier() is much cheaper than a rmb().

But this means a preempt disable is required. RT users do not want that.
Without preemption the processor can be moved after c has been determined.
That is why the smp_rmb() is there.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

 SLUB Use cmpxchg() everywhere.
 
 It applies to SLUB: Single atomic instruction alloc/free using
 cmpxchg.

 +++ slab/mm/slub.c2007-08-20 18:42:28.0 -0400
 @@ -1682,7 +1682,7 @@ redo:
  
   object[c-offset] = freelist;
  
 - if (unlikely(cmpxchg_local(c-freelist, freelist, object) != freelist))
 + if (unlikely(cmpxchg(c-freelist, freelist, object) != freelist))
   goto redo;
   return;
  slow:

Ok so regular cmpxchg, no cmpxchg_local. cmpxchg_local does not bring 
anything more? My measurements did not show any difference. I measured on 
Athlon64. What processor is being used?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

Reformatting...

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
 Hi Christoph,
 
 If you are interested in the raw numbers:
 
 The (very basic) test module follows. Make sure you change get_cycles()
 for get_cycles_sync() if you plan to run this on x86_64.
 
 (tests taken on a 3GHz Pentium 4)
 

(Note: test 1 uses the kfree slow path, as figured out by
instrumentation)

SLUB Performance testing

1. Kmalloc: Repeatedly allocate then free test

* slub HEAD, test 1
kmalloc(8) = 201 cycles kfree = 351 cycles
kmalloc(16) = 198 cycles  kfree = 359 cycles
kmalloc(32) = 200 cycles  kfree = 381 cycles
kmalloc(64) = 224 cycles  kfree = 394 cycles
kmalloc(128) = 285 cycles kfree = 424 cycles
kmalloc(256) = 411 cycles kfree = 546 cycles
kmalloc(512) = 480 cycles kfree = 619 cycles
kmalloc(1024) = 623 cycles  kfree = 750 cycles
kmalloc(2048) = 686 cycles  kfree = 811 cycles
kmalloc(4096) = 482 cycles  kfree = 538 cycles
kmalloc(8192) = 680 cycles  kfree = 734 cycles
kmalloc(16384) = 713 cycles kfree = 843 cycles

* Slub HEAD, test 2
kmalloc(8) = 190 cycles kfree = 351 cycles
kmalloc(16) = 195 cycles  kfree = 360 cycles
kmalloc(32) = 201 cycles  kfree = 370 cycles
kmalloc(64) = 245 cycles  kfree = 389 cycles
kmalloc(128) = 283 cycles kfree = 413 cycles
kmalloc(256) = 409 cycles kfree = 547 cycles
kmalloc(512) = 476 cycles kfree = 616 cycles
kmalloc(1024) = 628 cycles  kfree = 753 cycles
kmalloc(2048) = 684 cycles  kfree = 811 cycles
kmalloc(4096) = 480 cycles  kfree = 539 cycles
kmalloc(8192) = 661 cycles  kfree = 746 cycles
kmalloc(16384) = 741 cycles kfree = 856 cycles

* cmpxchg_local Slub test
kmalloc(8) = 83 cycles  kfree = 363 cycles
kmalloc(16) = 85 cycles kfree = 372 cycles
kmalloc(32) = 92 cycles kfree = 377 cycles
kmalloc(64) = 115 cycles  kfree = 397 cycles
kmalloc(128) = 179 cycles kfree = 438 cycles
kmalloc(256) = 314 cycles kfree = 564 cycles
kmalloc(512) = 398 cycles kfree = 615 cycles
kmalloc(1024) = 573 cycles  kfree = 745 cycles
kmalloc(2048) = 629 cycles  kfree = 816 cycles
kmalloc(4096) = 473 cycles  kfree = 548 cycles
kmalloc(8192) = 659 cycles  kfree = 745 cycles
kmalloc(16384) = 724 cycles kfree = 843 cycles



2. Kmalloc: alloc/free test

* slub HEAD, test 1
kmalloc(8)/kfree = 322 cycles
kmalloc(16)/kfree = 318 cycles
kmalloc(32)/kfree = 318 cycles
kmalloc(64)/kfree = 325 cycles
kmalloc(128)/kfree = 318 cycles
kmalloc(256)/kfree = 328 cycles
kmalloc(512)/kfree = 328 cycles
kmalloc(1024)/kfree = 328 cycles
kmalloc(2048)/kfree = 328 cycles
kmalloc(4096)/kfree = 678 cycles
kmalloc(8192)/kfree = 1013 cycles
kmalloc(16384)/kfree = 1157 cycles

* Slub HEAD, test 2
kmalloc(8)/kfree = 323 cycles
kmalloc(16)/kfree = 318 cycles
kmalloc(32)/kfree = 318 cycles
kmalloc(64)/kfree = 318 cycles
kmalloc(128)/kfree = 318 cycles
kmalloc(256)/kfree = 328 cycles
kmalloc(512)/kfree = 328 cycles
kmalloc(1024)/kfree = 328 cycles
kmalloc(2048)/kfree = 328 cycles
kmalloc(4096)/kfree = 648 cycles
kmalloc(8192)/kfree = 1009 cycles
kmalloc(16384)/kfree = 1105 cycles

* cmpxchg_local Slub test
kmalloc(8)/kfree = 112 cycles
kmalloc(16)/kfree = 103 cycles
kmalloc(32)/kfree = 103 cycles
kmalloc(64)/kfree = 103 cycles
kmalloc(128)/kfree = 112 cycles
kmalloc(256)/kfree = 111 cycles
kmalloc(512)/kfree = 111 cycles
kmalloc(1024)/kfree = 111 cycles
kmalloc(2048)/kfree = 121 cycles
kmalloc(4096)/kfree = 650 cycles
kmalloc(8192)/kfree = 1042 cycles
kmalloc(16384)/kfree = 1149 cycles

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
 
  - Changed smp_rmb() for barrier(). We are not interested in read order
across cpus, what we want is to be ordered wrt local interrupts only.
barrier() is much cheaper than a rmb().
 
 But this means a preempt disable is required. RT users do not want that.
 Without preemption the processor can be moved after c has been determined.
 That is why the smp_rmb() is there.

preemption is required if we want to use cmpxchg_local anyway.

We may have to find a way to use preemption while being able to give an
upper bound on the preempt disabled execution time. I think I got a way
to do this yesterday.. I'll dig in my patches.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

 * cmpxchg_local Slub test
 kmalloc(8) = 83 cycleskfree = 363 cycles
 kmalloc(16) = 85 cycles   kfree = 372 cycles
 kmalloc(32) = 92 cycles   kfree = 377 cycles
 kmalloc(64) = 115 cycleskfree = 397 cycles
 kmalloc(128) = 179 cycles   kfree = 438 cycles

So for consecutive allocs of small slabs up to 128 bytes this effectively 
doubles the speed of kmalloc.

 kmalloc(256) = 314 cycles   kfree = 564 cycles
 kmalloc(512) = 398 cycles   kfree = 615 cycles
 kmalloc(1024) = 573 cycleskfree = 745 cycles

Less of a benefit.

 kmalloc(2048) = 629 cycleskfree = 816 cycles

Allmost as before.

 kmalloc(4096) = 473 cycleskfree = 548 cycles
 kmalloc(8192) = 659 cycleskfree = 745 cycles
 kmalloc(16384) = 724 cycles   kfree = 843 cycles

Page allocator pass through measurements.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

 kmalloc(8)/kfree = 112 cycles
 kmalloc(16)/kfree = 103 cycles
 kmalloc(32)/kfree = 103 cycles
 kmalloc(64)/kfree = 103 cycles
 kmalloc(128)/kfree = 112 cycles
 kmalloc(256)/kfree = 111 cycles
 kmalloc(512)/kfree = 111 cycles
 kmalloc(1024)/kfree = 111 cycles
 kmalloc(2048)/kfree = 121 cycles

Looks good. This improves handling for short lived objects about 
threefold.

 kmalloc(4096)/kfree = 650 cycles
 kmalloc(8192)/kfree = 1042 cycles
 kmalloc(16384)/kfree = 1149 cycles

Hmmm... The page allocator is really bad here

Could we use the cmpxchg_local approach for the per cpu queues in the 
page_allocator? May have an even greater influence on overall system 
performance than the SLUB changes.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers

* Christoph Lameter ([EMAIL PROTECTED]) wrote:
 On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
 
  SLUB Use cmpxchg() everywhere.
  
  It applies to SLUB: Single atomic instruction alloc/free using
  cmpxchg.
 
  +++ slab/mm/slub.c  2007-08-20 18:42:28.0 -0400
  @@ -1682,7 +1682,7 @@ redo:
   
  object[c-offset] = freelist;
   
  -   if (unlikely(cmpxchg_local(c-freelist, freelist, object) != freelist))
  +   if (unlikely(cmpxchg(c-freelist, freelist, object) != freelist))
  goto redo;
  return;
   slow:
 
 Ok so regular cmpxchg, no cmpxchg_local. cmpxchg_local does not bring 
 anything more? My measurements did not show any difference. I measured on 
 Athlon64. What processor is being used?
 

This patch only cleans up the tree before proposing my cmpxchg_local
changes. There was an inconsistent use of cmpxchg/cmpxchg_local there.

Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
for the kmalloc/kfree pair (test 2).

Pros :
- we can use barrier() instead of rmb()
- cmpxchg_local is faster

Con :
- we must disable preemption

I use a 3GHz Pentium 4 for my tests.

Results (compared to cmpxchg_local numbers) :

SLUB Performance testing

1. Kmalloc: Repeatedly allocate then free test
(kfree here is slow path)

* cmpxchg
kmalloc(8) = 271 cycles kfree = 645 cycles
kmalloc(16) = 158 cycles  kfree = 428 cycles
kmalloc(32) = 153 cycles  kfree = 446 cycles
kmalloc(64) = 178 cycles  kfree = 459 cycles
kmalloc(128) = 247 cycles kfree = 481 cycles
kmalloc(256) = 363 cycles kfree = 605 cycles
kmalloc(512) = 449 cycles kfree = 677 cycles
kmalloc(1024) = 626 cycles  kfree = 810 cycles
kmalloc(2048) = 681 cycles  kfree = 869 cycles
kmalloc(4096) = 471 cycles  kfree = 575 cycles
kmalloc(8192) = 666 cycles  kfree = 747 cycles
kmalloc(16384) = 736 cycles kfree = 853 cycles

* cmpxchg_local
kmalloc(8) = 83 cycles  kfree = 363 cycles
kmalloc(16) = 85 cycles kfree = 372 cycles
kmalloc(32) = 92 cycles kfree = 377 cycles
kmalloc(64) = 115 cycleskfree = 397 cycles
kmalloc(128) = 179 cycles   kfree = 438 cycles
kmalloc(256) = 314 cycles   kfree = 564 cycles
kmalloc(512) = 398 cycles   kfree = 615 cycles
kmalloc(1024) = 573 cycles  kfree = 745 cycles
kmalloc(2048) = 629 cycles  kfree = 816 cycles
kmalloc(4096) = 473 cycles  kfree = 548 cycles
kmalloc(8192) = 659 cycles  kfree = 745 cycles
kmalloc(16384) = 724 cycles kfree = 843 cycles


2. Kmalloc: alloc/free test

*cmpxchg
kmalloc(8)/kfree = 321 cycles
kmalloc(16)/kfree = 308 cycles
kmalloc(32)/kfree = 311 cycles
kmalloc(64)/kfree = 310 cycles
kmalloc(128)/kfree = 306 cycles
kmalloc(256)/kfree = 325 cycles
kmalloc(512)/kfree = 324 cycles
kmalloc(1024)/kfree = 322 cycles
kmalloc(2048)/kfree = 309 cycles
kmalloc(4096)/kfree = 678 cycles
kmalloc(8192)/kfree = 1027 cycles
kmalloc(16384)/kfree = 1204 cycles

* cmpxchg_local
kmalloc(8)/kfree = 112 cycles
kmalloc(16)/kfree = 103 cycles
kmalloc(32)/kfree = 103 cycles
kmalloc(64)/kfree = 103 cycles
kmalloc(128)/kfree = 112 cycles
kmalloc(256)/kfree = 111 cycles
kmalloc(512)/kfree = 111 cycles
kmalloc(1024)/kfree = 111 cycles
kmalloc(2048)/kfree = 121 cycles
kmalloc(4096)/kfree = 650 cycles
kmalloc(8192)/kfree = 1042 cycles
kmalloc(16384)/kfree = 1149 cycles

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter

On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:

 Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
 shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
 for the kmalloc/kfree pair (test 2).

H.. I wonder if the AMD processors simply do the same in either 
version.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 112 matches

Mail list logo