Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-15 Thread Ravikiran G Thirumalai
On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote:
> 
> Seeing the code helps.

But there was a subtle problem with hold time instrumentation here.
The code assumed the critical section exiting through 
spin_unlock_irq entered critical section with spin_lock_irq, but that
might not be the case always, and the instrumentation for hold time goes bad
when that happens (as in shrink_inactive_list)

> 
> >  The
> > instrumentation goes like this:
> > 
> > void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > {
> > unsigned long long t1,t2;
> > local_irq_disable();
> > t1 = get_cycles_sync();
> > preempt_disable();
> > spin_acquire(>dep_map, 0, 0, _RET_IP_);
> > _raw_spin_lock(lock);
> > t2 = get_cycles_sync();
> > lock->raw_lock.htsc = t2;
> > if (lock->spin_time < (t2 - t1))
> > lock->spin_time = t2 - t1;
> > }
> > ...
> > 
> > void __lockfunc _spin_unlock_irq(spinlock_t *lock)
> > {
> > unsigned long long t1 ;
> > spin_release(>dep_map, 1, _RET_IP_);
> > t1 = get_cycles_sync();
> > if (lock->cs_time < (t1 -  lock->raw_lock.htsc))
> > lock->cs_time = t1 -  lock->raw_lock.htsc;
> > _raw_spin_unlock(lock);
> > local_irq_enable();
> > preempt_enable();
> > }
> > 
...
> 
> OK, now we need to do a dump_stack() each time we discover a new max hold
> time.  That might a bit tricky: the printk code does spinlocking too so
> things could go recursively deadlocky.  Maybe make spin_unlock_irq() return
> the hold time then do:

What I found now after fixing the above is that hold time is not bad --
249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS.
The spin time is still in orders of seconds.

Hence this looks like a hardware fairness issue.

Attaching the instrumentation patch with this email FR.


Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 
2007-01-14 22:36:46.694248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h  2007-01-15 
15:40:36.554248000 -0800
@@ -6,6 +6,18 @@
 #include 
 #include 
 
+/* Like get_cycles, but make sure the CPU is synchronized. */
+static inline unsigned long long get_cycles_sync2(void)
+{
+   unsigned long long ret;
+   unsigned eax;
+   /* Don't do an additional sync on CPUs where we know
+  RDTSC is already synchronous. */
+   alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
+ "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
+   rdtscll(ret);
+   return ret;
+}
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  *
@@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s
"jle 3b\n\t"
"jmp 1b\n"
"2:\t" : "=m" (lock->slock) : : "memory");
+   lock->htsc = get_cycles_sync2();
 }
 
 /*
@@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags
"jmp 4b\n"
"5:\n\t"
: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
+   lock->htsc = get_cycles_sync2();
 }
 #endif
 
@@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw
:"=q" (oldval), "=m" (lock->slock)
:"0" (0) : "memory");
 
+   if (oldval)
+   lock->htsc = get_cycles_sync2();
return oldval > 0;
 }
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
+   unsigned long long t = get_cycles_sync2();
+   if (lock->hold_time <  t - lock->htsc)
+   lock->hold_time = t - lock->htsc;
asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
 }
 
Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h   
2007-01-14 22:36:46.714248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
2007-01-15 14:23:37.204248000 -0800
@@ -7,9 +7,11 @@
 
 typedef struct {
unsigned int slock;
+   unsigned long long hold_time;
+   unsigned long long htsc;
 } raw_spinlock_t;
 
-#define __RAW_SPIN_LOCK_UNLOCKED   { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED   { 1,0,0 }
 
 typedef struct {
unsigned int lock;
Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h  2007-01-14 
22:36:48.464248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h   2007-01-14 
22:41:30.964248000 -0800
@@ -231,8 +231,8 @@ do {
\
 # define spin_unlock(lock) 

Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-15 Thread Ravikiran G Thirumalai
On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote:
 
 Seeing the code helps.

But there was a subtle problem with hold time instrumentation here.
The code assumed the critical section exiting through 
spin_unlock_irq entered critical section with spin_lock_irq, but that
might not be the case always, and the instrumentation for hold time goes bad
when that happens (as in shrink_inactive_list)

 
   The
  instrumentation goes like this:
  
  void __lockfunc _spin_lock_irq(spinlock_t *lock)
  {
  unsigned long long t1,t2;
  local_irq_disable();
  t1 = get_cycles_sync();
  preempt_disable();
  spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
  _raw_spin_lock(lock);
  t2 = get_cycles_sync();
  lock-raw_lock.htsc = t2;
  if (lock-spin_time  (t2 - t1))
  lock-spin_time = t2 - t1;
  }
  ...
  
  void __lockfunc _spin_unlock_irq(spinlock_t *lock)
  {
  unsigned long long t1 ;
  spin_release(lock-dep_map, 1, _RET_IP_);
  t1 = get_cycles_sync();
  if (lock-cs_time  (t1 -  lock-raw_lock.htsc))
  lock-cs_time = t1 -  lock-raw_lock.htsc;
  _raw_spin_unlock(lock);
  local_irq_enable();
  preempt_enable();
  }
  
...
 
 OK, now we need to do a dump_stack() each time we discover a new max hold
 time.  That might a bit tricky: the printk code does spinlocking too so
 things could go recursively deadlocky.  Maybe make spin_unlock_irq() return
 the hold time then do:

What I found now after fixing the above is that hold time is not bad --
249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS.
The spin time is still in orders of seconds.

Hence this looks like a hardware fairness issue.

Attaching the instrumentation patch with this email FR.


Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 
2007-01-14 22:36:46.694248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h  2007-01-15 
15:40:36.554248000 -0800
@@ -6,6 +6,18 @@
 #include asm/page.h
 #include asm/processor.h
 
+/* Like get_cycles, but make sure the CPU is synchronized. */
+static inline unsigned long long get_cycles_sync2(void)
+{
+   unsigned long long ret;
+   unsigned eax;
+   /* Don't do an additional sync on CPUs where we know
+  RDTSC is already synchronous. */
+   alternative_io(cpuid, ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
+ =a (eax), 0 (1) : ebx,ecx,edx,memory);
+   rdtscll(ret);
+   return ret;
+}
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  *
@@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s
jle 3b\n\t
jmp 1b\n
2:\t : =m (lock-slock) : : memory);
+   lock-htsc = get_cycles_sync2();
 }
 
 /*
@@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags
jmp 4b\n
5:\n\t
: +m (lock-slock) : r ((unsigned)flags) : memory);
+   lock-htsc = get_cycles_sync2();
 }
 #endif
 
@@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw
:=q (oldval), =m (lock-slock)
:0 (0) : memory);
 
+   if (oldval)
+   lock-htsc = get_cycles_sync2();
return oldval  0;
 }
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
+   unsigned long long t = get_cycles_sync2();
+   if (lock-hold_time   t - lock-htsc)
+   lock-hold_time = t - lock-htsc;
asm volatile(movl $1,%0 :=m (lock-slock) :: memory);
 }
 
Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h   
2007-01-14 22:36:46.714248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
2007-01-15 14:23:37.204248000 -0800
@@ -7,9 +7,11 @@
 
 typedef struct {
unsigned int slock;
+   unsigned long long hold_time;
+   unsigned long long htsc;
 } raw_spinlock_t;
 
-#define __RAW_SPIN_LOCK_UNLOCKED   { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED   { 1,0,0 }
 
 typedef struct {
unsigned int lock;
Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h
===
--- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h  2007-01-14 
22:36:48.464248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h   2007-01-14 
22:41:30.964248000 -0800
@@ -231,8 +231,8 @@ do {
\
 # define spin_unlock(lock) __raw_spin_unlock((lock)-raw_lock)
 # define read_unlock(lock) __raw_read_unlock((lock)-raw_lock)
 # define 

Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-13 Thread Andrew Morton
> On Sat, 13 Jan 2007 11:53:34 -0800 Ravikiran G Thirumalai <[EMAIL PROTECTED]> 
> wrote:
> On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote:
> > > On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[EMAIL 
> > > PROTECTED]> wrote:
> > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > > > >{
> > > > >local_irq_disable();
> > > > >> rdtsc(t1);
> > > > >preempt_disable();
> > > > >spin_acquire(>dep_map, 0, 0, _RET_IP_);
> > > > >_raw_spin_lock(lock);
> > > > >> rdtsc(t2);
> > > > >if (lock->spin_time < (t2 - t1))
> > > > >lock->spin_time = t2 - t1;
> > > > >}
> > > > >
> > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or 
> > > > >more
> > > > >while the maximal CS time was 3 seconds or so.
> > > > 
> > > > What is the "CS time"?
> > > 
> > > Critical Section :).  This is the maximal time interval I measured  from 
> > > t2 above to the time point we release the spin lock.  This is the hold 
> > > time I guess.
> > 
> > By no means.  The theory here is that CPUA is taking and releasing the
> > lock at high frequency, but CPUB never manages to get in and take it.  In
> > which case the maximum-acquisition-time is much larger than the
> > maximum-hold-time.
> > 
> > I'd suggest that you use a similar trick to measure the maximum hold time:
> > start the timer after we got the lock, stop it just before we release the
> > lock (assuming that the additional rdtsc delay doesn't "fix" things, of
> > course...)
> 
> Well, that is exactly what I described above  as CS time.

Seeing the code helps.

>  The
> instrumentation goes like this:
> 
> void __lockfunc _spin_lock_irq(spinlock_t *lock)
> {
> unsigned long long t1,t2;
> local_irq_disable();
> t1 = get_cycles_sync();
> preempt_disable();
> spin_acquire(>dep_map, 0, 0, _RET_IP_);
> _raw_spin_lock(lock);
> t2 = get_cycles_sync();
> lock->raw_lock.htsc = t2;
> if (lock->spin_time < (t2 - t1))
> lock->spin_time = t2 - t1;
> }
> ...
> 
> void __lockfunc _spin_unlock_irq(spinlock_t *lock)
> {
> unsigned long long t1 ;
> spin_release(>dep_map, 1, _RET_IP_);
> t1 = get_cycles_sync();
> if (lock->cs_time < (t1 -  lock->raw_lock.htsc))
> lock->cs_time = t1 -  lock->raw_lock.htsc;
> _raw_spin_unlock(lock);
> local_irq_enable();
> preempt_enable();
> }
> 
> Am I missing something?  Is this not what you just described? (The
> synchronizing rdtsc might not be really required at all locations, but I 
> doubt if it would contribute a significant fraction to 33s  or even 
> the 3s hold time on a 2.6 GHZ opteron).

OK, now we need to do a dump_stack() each time we discover a new max hold
time.  That might a bit tricky: the printk code does spinlocking too so
things could go recursively deadlocky.  Maybe make spin_unlock_irq() return
the hold time then do:

void lru_spin_unlock_irq(struct zone *zone)
{
long this_time;

this_time = spin_unlock_irq(>lru_lock);
if (this_time > zone->max_time) {
zone->max_time = this_time;
dump_stack();
}
}

or similar.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-13 Thread Ravikiran G Thirumalai
On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote:
> > On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[EMAIL 
> > PROTECTED]> wrote:
> > > >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > > >{
> > > >local_irq_disable();
> > > >> rdtsc(t1);
> > > >preempt_disable();
> > > >spin_acquire(>dep_map, 0, 0, _RET_IP_);
> > > >_raw_spin_lock(lock);
> > > >> rdtsc(t2);
> > > >if (lock->spin_time < (t2 - t1))
> > > >lock->spin_time = t2 - t1;
> > > >}
> > > >
> > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or 
> > > >more
> > > >while the maximal CS time was 3 seconds or so.
> > > 
> > > What is the "CS time"?
> > 
> > Critical Section :).  This is the maximal time interval I measured  from 
> > t2 above to the time point we release the spin lock.  This is the hold 
> > time I guess.
> 
> By no means.  The theory here is that CPUA is taking and releasing the
> lock at high frequency, but CPUB never manages to get in and take it.  In
> which case the maximum-acquisition-time is much larger than the
> maximum-hold-time.
> 
> I'd suggest that you use a similar trick to measure the maximum hold time:
> start the timer after we got the lock, stop it just before we release the
> lock (assuming that the additional rdtsc delay doesn't "fix" things, of
> course...)

Well, that is exactly what I described above  as CS time.  The
instrumentation goes like this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
unsigned long long t1,t2;
local_irq_disable();
t1 = get_cycles_sync();
preempt_disable();
spin_acquire(>dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
t2 = get_cycles_sync();
lock->raw_lock.htsc = t2;
if (lock->spin_time < (t2 - t1))
lock->spin_time = t2 - t1;
}
...

void __lockfunc _spin_unlock_irq(spinlock_t *lock)
{
unsigned long long t1 ;
spin_release(>dep_map, 1, _RET_IP_);
t1 = get_cycles_sync();
if (lock->cs_time < (t1 -  lock->raw_lock.htsc))
lock->cs_time = t1 -  lock->raw_lock.htsc;
_raw_spin_unlock(lock);
local_irq_enable();
preempt_enable();
}

Am I missing something?  Is this not what you just described? (The
synchronizing rdtsc might not be really required at all locations, but I 
doubt if it would contribute a significant fraction to 33s  or even 
the 3s hold time on a 2.6 GHZ opteron).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-13 Thread Andrew Morton
> On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[EMAIL PROTECTED]> 
> wrote:
> On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:
> > Ravikiran G Thirumalai wrote:
> > >Hi,
> > >We noticed high interrupt hold off times while running some memory 
> > >intensive
> > >tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
> > 
> > [...]
> > 
> > >We did not use any lock debugging options and used plain old rdtsc to
> > >measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
> > >this:
> > >
> > >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > >{
> > >local_irq_disable();
> > >> rdtsc(t1);
> > >preempt_disable();
> > >spin_acquire(>dep_map, 0, 0, _RET_IP_);
> > >_raw_spin_lock(lock);
> > >> rdtsc(t2);
> > >if (lock->spin_time < (t2 - t1))
> > >lock->spin_time = t2 - t1;
> > >}
> > >
> > >On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> > >while the maximal CS time was 3 seconds or so.
> > 
> > What is the "CS time"?
> 
> Critical Section :).  This is the maximal time interval I measured  from 
> t2 above to the time point we release the spin lock.  This is the hold 
> time I guess.

By no means.  The theory here is that CPUA is taking and releasing the
lock at high frequency, but CPUB never manages to get in and take it.  In
which case the maximum-acquisition-time is much larger than the
maximum-hold-time.

I'd suggest that you use a similar trick to measure the maximum hold time:
start the timer after we got the lock, stop it just before we release the
lock (assuming that the additional rdtsc delay doesn't "fix" things, of
course...)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-13 Thread Andrew Morton
 On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] 
 wrote:
 On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:
  Ravikiran G Thirumalai wrote:
  Hi,
  We noticed high interrupt hold off times while running some memory 
  intensive
  tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
  
  [...]
  
  We did not use any lock debugging options and used plain old rdtsc to
  measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
  this:
  
  void __lockfunc _spin_lock_irq(spinlock_t *lock)
  {
  local_irq_disable();
   rdtsc(t1);
  preempt_disable();
  spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
  _raw_spin_lock(lock);
   rdtsc(t2);
  if (lock-spin_time  (t2 - t1))
  lock-spin_time = t2 - t1;
  }
  
  On some runs, we found that the zone-lru_lock spun for 33 seconds or more
  while the maximal CS time was 3 seconds or so.
  
  What is the CS time?
 
 Critical Section :).  This is the maximal time interval I measured  from 
 t2 above to the time point we release the spin lock.  This is the hold 
 time I guess.

By no means.  The theory here is that CPUA is taking and releasing the
lock at high frequency, but CPUB never manages to get in and take it.  In
which case the maximum-acquisition-time is much larger than the
maximum-hold-time.

I'd suggest that you use a similar trick to measure the maximum hold time:
start the timer after we got the lock, stop it just before we release the
lock (assuming that the additional rdtsc delay doesn't fix things, of
course...)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-13 Thread Ravikiran G Thirumalai
On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote:
  On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai [EMAIL 
  PROTECTED] wrote:
   void __lockfunc _spin_lock_irq(spinlock_t *lock)
   {
   local_irq_disable();
    rdtsc(t1);
   preempt_disable();
   spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
   _raw_spin_lock(lock);
    rdtsc(t2);
   if (lock-spin_time  (t2 - t1))
   lock-spin_time = t2 - t1;
   }
   
   On some runs, we found that the zone-lru_lock spun for 33 seconds or 
   more
   while the maximal CS time was 3 seconds or so.
   
   What is the CS time?
  
  Critical Section :).  This is the maximal time interval I measured  from 
  t2 above to the time point we release the spin lock.  This is the hold 
  time I guess.
 
 By no means.  The theory here is that CPUA is taking and releasing the
 lock at high frequency, but CPUB never manages to get in and take it.  In
 which case the maximum-acquisition-time is much larger than the
 maximum-hold-time.
 
 I'd suggest that you use a similar trick to measure the maximum hold time:
 start the timer after we got the lock, stop it just before we release the
 lock (assuming that the additional rdtsc delay doesn't fix things, of
 course...)

Well, that is exactly what I described above  as CS time.  The
instrumentation goes like this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
unsigned long long t1,t2;
local_irq_disable();
t1 = get_cycles_sync();
preempt_disable();
spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
t2 = get_cycles_sync();
lock-raw_lock.htsc = t2;
if (lock-spin_time  (t2 - t1))
lock-spin_time = t2 - t1;
}
...

void __lockfunc _spin_unlock_irq(spinlock_t *lock)
{
unsigned long long t1 ;
spin_release(lock-dep_map, 1, _RET_IP_);
t1 = get_cycles_sync();
if (lock-cs_time  (t1 -  lock-raw_lock.htsc))
lock-cs_time = t1 -  lock-raw_lock.htsc;
_raw_spin_unlock(lock);
local_irq_enable();
preempt_enable();
}

Am I missing something?  Is this not what you just described? (The
synchronizing rdtsc might not be really required at all locations, but I 
doubt if it would contribute a significant fraction to 33s  or even 
the 3s hold time on a 2.6 GHZ opteron).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-13 Thread Andrew Morton
 On Sat, 13 Jan 2007 11:53:34 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] 
 wrote:
 On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote:
   On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai [EMAIL 
   PROTECTED] wrote:
void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
local_irq_disable();
 rdtsc(t1);
preempt_disable();
spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
 rdtsc(t2);
if (lock-spin_time  (t2 - t1))
lock-spin_time = t2 - t1;
}

On some runs, we found that the zone-lru_lock spun for 33 seconds or 
more
while the maximal CS time was 3 seconds or so.

What is the CS time?
   
   Critical Section :).  This is the maximal time interval I measured  from 
   t2 above to the time point we release the spin lock.  This is the hold 
   time I guess.
  
  By no means.  The theory here is that CPUA is taking and releasing the
  lock at high frequency, but CPUB never manages to get in and take it.  In
  which case the maximum-acquisition-time is much larger than the
  maximum-hold-time.
  
  I'd suggest that you use a similar trick to measure the maximum hold time:
  start the timer after we got the lock, stop it just before we release the
  lock (assuming that the additional rdtsc delay doesn't fix things, of
  course...)
 
 Well, that is exactly what I described above  as CS time.

Seeing the code helps.

  The
 instrumentation goes like this:
 
 void __lockfunc _spin_lock_irq(spinlock_t *lock)
 {
 unsigned long long t1,t2;
 local_irq_disable();
 t1 = get_cycles_sync();
 preempt_disable();
 spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
 _raw_spin_lock(lock);
 t2 = get_cycles_sync();
 lock-raw_lock.htsc = t2;
 if (lock-spin_time  (t2 - t1))
 lock-spin_time = t2 - t1;
 }
 ...
 
 void __lockfunc _spin_unlock_irq(spinlock_t *lock)
 {
 unsigned long long t1 ;
 spin_release(lock-dep_map, 1, _RET_IP_);
 t1 = get_cycles_sync();
 if (lock-cs_time  (t1 -  lock-raw_lock.htsc))
 lock-cs_time = t1 -  lock-raw_lock.htsc;
 _raw_spin_unlock(lock);
 local_irq_enable();
 preempt_enable();
 }
 
 Am I missing something?  Is this not what you just described? (The
 synchronizing rdtsc might not be really required at all locations, but I 
 doubt if it would contribute a significant fraction to 33s  or even 
 the 3s hold time on a 2.6 GHZ opteron).

OK, now we need to do a dump_stack() each time we discover a new max hold
time.  That might a bit tricky: the printk code does spinlocking too so
things could go recursively deadlocky.  Maybe make spin_unlock_irq() return
the hold time then do:

void lru_spin_unlock_irq(struct zone *zone)
{
long this_time;

this_time = spin_unlock_irq(zone-lru_lock);
if (this_time  zone-max_time) {
zone-max_time = this_time;
dump_stack();
}
}

or similar.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Nick Piggin

Ravikiran G Thirumalai wrote:

On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:



What is the "CS time"?



Critical Section :).  This is the maximal time interval I measured  from 
t2 above to the time point we release the spin lock.  This is the hold 
time I guess.



It would be interesting to know how long the maximal lru_lock *hold* time 
is,

which could give us a better indication of whether it is a hardware problem.

For example, if the maximum hold time is 10ms, that it might indicate a
hardware fairness problem.



The maximal hold time was about 3s.


Well then it doesn't seem very surprising that this could cause a 30s wait
time for one CPU in a 16 core system, regardless of fairness.

I guess most of the contention, and the lock hold times are coming from
vmscan? Do you know exactly which critical sections are the culprits?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote:
> On Fri, 12 Jan 2007 17:00:39 -0800
> Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:
> 
> > But is
> > lru_lock an issue is another question.
> 
> I doubt it, although there might be changes we can make in there to
> work around it.
> 
> 

I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference.  I still
notice the atrociously high spin times.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:
> Ravikiran G Thirumalai wrote:
> >Hi,
> >We noticed high interrupt hold off times while running some memory 
> >intensive
> >tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
> 
> [...]
> 
> >We did not use any lock debugging options and used plain old rdtsc to
> >measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
> >this:
> >
> >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> >{
> >local_irq_disable();
> >> rdtsc(t1);
> >preempt_disable();
> >spin_acquire(>dep_map, 0, 0, _RET_IP_);
> >_raw_spin_lock(lock);
> >> rdtsc(t2);
> >if (lock->spin_time < (t2 - t1))
> >lock->spin_time = t2 - t1;
> >}
> >
> >On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> >while the maximal CS time was 3 seconds or so.
> 
> What is the "CS time"?

Critical Section :).  This is the maximal time interval I measured  from 
t2 above to the time point we release the spin lock.  This is the hold 
time I guess.

> 
> It would be interesting to know how long the maximal lru_lock *hold* time 
> is,
> which could give us a better indication of whether it is a hardware problem.
> 
> For example, if the maximum hold time is 10ms, that it might indicate a
> hardware fairness problem.

The maximal hold time was about 3s.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Nick Piggin

Ravikiran G Thirumalai wrote:

Hi,
We noticed high interrupt hold off times while running some memory intensive
tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,


[...]


We did not use any lock debugging options and used plain old rdtsc to
measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
local_irq_disable();
> rdtsc(t1);
preempt_disable();
spin_acquire(>dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
> rdtsc(t2);
if (lock->spin_time < (t2 - t1))
lock->spin_time = t2 - t1;
}

On some runs, we found that the zone->lru_lock spun for 33 seconds or more
while the maximal CS time was 3 seconds or so.


What is the "CS time"?

It would be interesting to know how long the maximal lru_lock *hold* time is,
which could give us a better indication of whether it is a hardware problem.

For example, if the maximum hold time is 10ms, that it might indicate a
hardware fairness problem.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Andrew Morton
On Fri, 12 Jan 2007 17:00:39 -0800
Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote:

> But is
> lru_lock an issue is another question.

I doubt it, although there might be changes we can make in there to
work around it.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote:
> On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
> 
> Moreover mostatomic operations are to remote memory which is also 
> increasing the problem by making the atomic ops take longer. Typically 
> mature NUMA system have implemented hardware provisions that can deal with 
> such high degrees of contention. If this is simply a SMP system that was
> turned into a NUMA box then this is a new hardware scenario for the 
> engineers.

This is using HT as all AMD systems do, but this is one of the 8
socket systems.  

I ran the same test on a 2 node Tyan AMD box, and did not notice the
atrocious spin times. It would be interesting to see how a 4 socket HT box
would fare. Unfortunately, I do not have access to one. If someone has access
to such a box, I can provide the test case and instrumentation patches.

It could very well be the hardware limitation in this case, which means, all
the more reason to enable interrupts with spin locks while spinning. But is
lru_lock an issue is another question.

Thanks,
Kiran
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Christoph Lameter
On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:

> > Does the system scale the right way if you stay within the bounds of node 
> > memory? I.e. allocate 1.5GB from each process?
> 
> Yes. We see problems only when we oversubscribe memory.

Ok in that case we can have more than 2 processors trying to acquire the 
same zone lock. If they have all exhausted their node local memory and are 
all going off node then all processor may be hitting the last node that 
has some  memory left which will cause a very high degree of contention.

Moreover mostatomic operations are to remote memory which is also 
increasing the problem by making the atomic ops take longer. Typically 
mature NUMA system have implemented hardware provisions that can deal with 
such high degrees of contention. If this is simply a SMP system that was
turned into a NUMA box then this is a new hardware scenario for the 
engineers.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote:
> On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
> 
> > The test was simple, we have 16 processes, each allocating 3.5G of memory
> > and and touching each and every page and returning.  Each of the process is
> > bound to a node (socket), with the local node being the preferred node for
> > allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
> > socket has 4G of physical memory and there are two cores on each socket. On
> > start of the test, the machine becomes unresponsive after sometime and
> > prints out softlockup and OOM messages.  We then found out the cause
> > for softlockups being the excessive spin times on zone_lru lock.  The fact
> > that spin_lock_irq disables interrupts while spinning made matters very bad.
> > We instrumented the spin_lock_irq code and found that the spin time on the
> > lru locks was in the order of a few seconds (tens of seconds at times) and
> > the hold time was comparatively lesser.
> 
> So the issue is two processes contenting on the zone lock for one node? 
> You are overallocating the 4G node with two processes attempting to 
> allocate 7.5GB? So we go off node for 3.5G of the allocation?

Yes.

> 
> Does the system scale the right way if you stay within the bounds of node 
> memory? I.e. allocate 1.5GB from each process?

Yes. We see problems only when we oversubscribe memory.

> 
> Have you tried increasing the size of the per cpu caches in 
> /proc/sys/vm/percpu_pagelist_fraction?

No not yet. I can give it a try.

> 
> > While the softlockups and the like went away by enabling interrupts during
> > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> > Andi thought maybe this is exposing a problem with zone->lru_locks and 
> > hence warrants a discussion on lkml, hence this post.  Are there any 
> > plans/patches/ideas to address the spin time under such extreme conditions?
> 
> Could this be a hardware problem? Some issue with atomic ops in the 
> Sun hardware?

I think that is unlikely -- because when we donot oversubscribe
memory, the tests complete quickly without softlockups ane the like.  Peter 
has also noticed this (presumeably on different hardware).  I would think
this could also be locking unfairness (cpus of the same node getting the 
lock and starving out other nodes) case under extreme contention.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Andrew Morton
On Fri, 12 Jan 2007 11:46:22 -0800 (PST)
Christoph Lameter <[EMAIL PROTECTED]> wrote:

> > While the softlockups and the like went away by enabling interrupts during
> > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> > Andi thought maybe this is exposing a problem with zone->lru_locks and 
> > hence warrants a discussion on lkml, hence this post.  Are there any 
> > plans/patches/ideas to address the spin time under such extreme conditions?
> 
> Could this be a hardware problem? Some issue with atomic ops in the 
> Sun hardware?

I'd assume so.  We don't hold lru_lock for 33 seconds ;)

Probably similar symptoms are demonstrable using other locks, if a
suitable workload is chosen.

Increasing PAGEVEC_SIZE might help.  But we do allocate those things
on the stack.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Christoph Lameter
On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:

> The test was simple, we have 16 processes, each allocating 3.5G of memory
> and and touching each and every page and returning.  Each of the process is
> bound to a node (socket), with the local node being the preferred node for
> allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
> socket has 4G of physical memory and there are two cores on each socket. On
> start of the test, the machine becomes unresponsive after sometime and
> prints out softlockup and OOM messages.  We then found out the cause
> for softlockups being the excessive spin times on zone_lru lock.  The fact
> that spin_lock_irq disables interrupts while spinning made matters very bad.
> We instrumented the spin_lock_irq code and found that the spin time on the
> lru locks was in the order of a few seconds (tens of seconds at times) and
> the hold time was comparatively lesser.

So the issue is two processes contenting on the zone lock for one node? 
You are overallocating the 4G node with two processes attempting to 
allocate 7.5GB? So we go off node for 3.5G of the allocation?

Does the system scale the right way if you stay within the bounds of node 
memory? I.e. allocate 1.5GB from each process?

Have you tried increasing the size of the per cpu caches in 
/proc/sys/vm/percpu_pagelist_fraction?

> While the softlockups and the like went away by enabling interrupts during
> spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> Andi thought maybe this is exposing a problem with zone->lru_locks and 
> hence warrants a discussion on lkml, hence this post.  Are there any 
> plans/patches/ideas to address the spin time under such extreme conditions?

Could this be a hardware problem? Some issue with atomic ops in the 
Sun hardware?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Peter Zijlstra
On Fri, 2007-01-12 at 08:01 -0800, Ravikiran G Thirumalai wrote:
> Hi,
> We noticed high interrupt hold off times while running some memory intensive
> tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
> lost ticks and even wall time drifting (which is probably a bug in the
> x86_64 timer subsystem). 
> 
> The test was simple, we have 16 processes, each allocating 3.5G of memory
> and and touching each and every page and returning.  Each of the process is
> bound to a node (socket), with the local node being the preferred node for
> allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
> socket has 4G of physical memory and there are two cores on each socket. On
> start of the test, the machine becomes unresponsive after sometime and
> prints out softlockup and OOM messages.  We then found out the cause
> for softlockups being the excessive spin times on zone_lru lock.  The fact
> that spin_lock_irq disables interrupts while spinning made matters very bad.
> We instrumented the spin_lock_irq code and found that the spin time on the
> lru locks was in the order of a few seconds (tens of seconds at times) and
> the hold time was comparatively lesser.
> 
> We did not use any lock debugging options and used plain old rdtsc to
> measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
> this:
> 
> void __lockfunc _spin_lock_irq(spinlock_t *lock)
> {
> local_irq_disable();
> > rdtsc(t1);
> preempt_disable();
> spin_acquire(>dep_map, 0, 0, _RET_IP_);
> _raw_spin_lock(lock);
> > rdtsc(t2);
> if (lock->spin_time < (t2 - t1))
> lock->spin_time = t2 - t1;
> }
> 
> On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> while the maximal CS time was 3 seconds or so.
> 
> While the softlockups and the like went away by enabling interrupts during
> spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> Andi thought maybe this is exposing a problem with zone->lru_locks and 
> hence warrants a discussion on lkml, hence this post.  Are there any 
> plans/patches/ideas to address the spin time under such extreme conditions?
> 
> I will be happy to provide any additional information (config/dmesg/test
> case if needed.

I have been tinkering with this because -rt shows similar issues.
Find there the patch so far, it works on UP, but it still went *boom*
the last time I tried an actual SMP box.

So take this patch only as an indication of the direction I'm working
in.

One concern I have with the taken approach is cacheline bouncing.
Perhaps I should retain some form of per-cpu data structure.

---
Subject: mm: streamline zone->lock acquisition on lru_cache_add

By buffering the lru pages on a per cpu basis the flush of that buffer
is prone to bounce around zones. Furthermore release_pages can also acquire
the zone->lock.

Streeamline all this by replacing the per cpu buffer with a per zone
lockless buffer. Once the buffer is filled flush it and perform
all needed operation under one lock acquisition.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/linux/mmzone.h |   12 +++
 mm/internal.h  |2 
 mm/page_alloc.c|   21 ++
 mm/swap.c  |  169 +
 4 files changed, 149 insertions(+), 55 deletions(-)

Index: linux-2.6-rt/include/linux/mmzone.h
===
--- linux-2.6-rt.orig/include/linux/mmzone.h2007-01-11 16:27:08.0 
+0100
+++ linux-2.6-rt/include/linux/mmzone.h 2007-01-11 16:32:08.0 +0100
@@ -153,6 +153,17 @@ enum zone_type {
 #define ZONES_SHIFT 2
 #endif
 
+/*
+ * must be power of 2 to avoid wrap around artifacts
+ */
+#define PAGEBUF_SIZE   32
+
+struct pagebuf {
+   atomic_t head;
+   atomic_t tail;
+   struct page *pages[PAGEBUF_SIZE];
+};
+
 struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long   free_pages;
@@ -188,6 +199,7 @@ struct zone {
 #endif
struct free_areafree_area[MAX_ORDER];
 
+   struct pagebuf  pagebuf;
 
ZONE_PADDING(_pad1_)
 
Index: linux-2.6-rt/mm/swap.c
===
--- linux-2.6-rt.orig/mm/swap.c 2007-01-11 16:27:08.0 +0100
+++ linux-2.6-rt/mm/swap.c  2007-01-11 16:36:34.0 +0100
@@ -31,6 +31,8 @@
 #include 
 #include 
 
+#include "internal.h"
+
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
@@ -170,49 +172,131 @@ void fastcall mark_page_accessed(struct 
 
 EXPORT_SYMBOL(mark_page_accessed);
 
+static int __pagebuf_add(struct zone *zone, struct page *page)
+{
+   BUG_ON(page_zone(page) != zone);
+
+   switch (page_count(page)) {
+   case 0:
+   BUG();
+
+   case 1:
+   /*
+ 

High lock spin time for zone->lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
Hi,
We noticed high interrupt hold off times while running some memory intensive
tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
lost ticks and even wall time drifting (which is probably a bug in the
x86_64 timer subsystem). 

The test was simple, we have 16 processes, each allocating 3.5G of memory
and and touching each and every page and returning.  Each of the process is
bound to a node (socket), with the local node being the preferred node for
allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
socket has 4G of physical memory and there are two cores on each socket. On
start of the test, the machine becomes unresponsive after sometime and
prints out softlockup and OOM messages.  We then found out the cause
for softlockups being the excessive spin times on zone_lru lock.  The fact
that spin_lock_irq disables interrupts while spinning made matters very bad.
We instrumented the spin_lock_irq code and found that the spin time on the
lru locks was in the order of a few seconds (tens of seconds at times) and
the hold time was comparatively lesser.

We did not use any lock debugging options and used plain old rdtsc to
measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
local_irq_disable();
> rdtsc(t1);
preempt_disable();
spin_acquire(>dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
> rdtsc(t2);
if (lock->spin_time < (t2 - t1))
lock->spin_time = t2 - t1;
}

On some runs, we found that the zone->lru_lock spun for 33 seconds or more
while the maximal CS time was 3 seconds or so.

While the softlockups and the like went away by enabling interrupts during
spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
Andi thought maybe this is exposing a problem with zone->lru_locks and 
hence warrants a discussion on lkml, hence this post.  Are there any 
plans/patches/ideas to address the spin time under such extreme conditions?

I will be happy to provide any additional information (config/dmesg/test
case if needed.

Thanks,
Kiran
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
Hi,
We noticed high interrupt hold off times while running some memory intensive
tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
lost ticks and even wall time drifting (which is probably a bug in the
x86_64 timer subsystem). 

The test was simple, we have 16 processes, each allocating 3.5G of memory
and and touching each and every page and returning.  Each of the process is
bound to a node (socket), with the local node being the preferred node for
allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
socket has 4G of physical memory and there are two cores on each socket. On
start of the test, the machine becomes unresponsive after sometime and
prints out softlockup and OOM messages.  We then found out the cause
for softlockups being the excessive spin times on zone_lru lock.  The fact
that spin_lock_irq disables interrupts while spinning made matters very bad.
We instrumented the spin_lock_irq code and found that the spin time on the
lru locks was in the order of a few seconds (tens of seconds at times) and
the hold time was comparatively lesser.

We did not use any lock debugging options and used plain old rdtsc to
measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
local_irq_disable();
 rdtsc(t1);
preempt_disable();
spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
 rdtsc(t2);
if (lock-spin_time  (t2 - t1))
lock-spin_time = t2 - t1;
}

On some runs, we found that the zone-lru_lock spun for 33 seconds or more
while the maximal CS time was 3 seconds or so.

While the softlockups and the like went away by enabling interrupts during
spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
Andi thought maybe this is exposing a problem with zone-lru_locks and 
hence warrants a discussion on lkml, hence this post.  Are there any 
plans/patches/ideas to address the spin time under such extreme conditions?

I will be happy to provide any additional information (config/dmesg/test
case if needed.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Peter Zijlstra
On Fri, 2007-01-12 at 08:01 -0800, Ravikiran G Thirumalai wrote:
 Hi,
 We noticed high interrupt hold off times while running some memory intensive
 tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
 lost ticks and even wall time drifting (which is probably a bug in the
 x86_64 timer subsystem). 
 
 The test was simple, we have 16 processes, each allocating 3.5G of memory
 and and touching each and every page and returning.  Each of the process is
 bound to a node (socket), with the local node being the preferred node for
 allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
 socket has 4G of physical memory and there are two cores on each socket. On
 start of the test, the machine becomes unresponsive after sometime and
 prints out softlockup and OOM messages.  We then found out the cause
 for softlockups being the excessive spin times on zone_lru lock.  The fact
 that spin_lock_irq disables interrupts while spinning made matters very bad.
 We instrumented the spin_lock_irq code and found that the spin time on the
 lru locks was in the order of a few seconds (tens of seconds at times) and
 the hold time was comparatively lesser.
 
 We did not use any lock debugging options and used plain old rdtsc to
 measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
 this:
 
 void __lockfunc _spin_lock_irq(spinlock_t *lock)
 {
 local_irq_disable();
  rdtsc(t1);
 preempt_disable();
 spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
 _raw_spin_lock(lock);
  rdtsc(t2);
 if (lock-spin_time  (t2 - t1))
 lock-spin_time = t2 - t1;
 }
 
 On some runs, we found that the zone-lru_lock spun for 33 seconds or more
 while the maximal CS time was 3 seconds or so.
 
 While the softlockups and the like went away by enabling interrupts during
 spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
 Andi thought maybe this is exposing a problem with zone-lru_locks and 
 hence warrants a discussion on lkml, hence this post.  Are there any 
 plans/patches/ideas to address the spin time under such extreme conditions?
 
 I will be happy to provide any additional information (config/dmesg/test
 case if needed.

I have been tinkering with this because -rt shows similar issues.
Find there the patch so far, it works on UP, but it still went *boom*
the last time I tried an actual SMP box.

So take this patch only as an indication of the direction I'm working
in.

One concern I have with the taken approach is cacheline bouncing.
Perhaps I should retain some form of per-cpu data structure.

---
Subject: mm: streamline zone-lock acquisition on lru_cache_add

By buffering the lru pages on a per cpu basis the flush of that buffer
is prone to bounce around zones. Furthermore release_pages can also acquire
the zone-lock.

Streeamline all this by replacing the per cpu buffer with a per zone
lockless buffer. Once the buffer is filled flush it and perform
all needed operation under one lock acquisition.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 include/linux/mmzone.h |   12 +++
 mm/internal.h  |2 
 mm/page_alloc.c|   21 ++
 mm/swap.c  |  169 +
 4 files changed, 149 insertions(+), 55 deletions(-)

Index: linux-2.6-rt/include/linux/mmzone.h
===
--- linux-2.6-rt.orig/include/linux/mmzone.h2007-01-11 16:27:08.0 
+0100
+++ linux-2.6-rt/include/linux/mmzone.h 2007-01-11 16:32:08.0 +0100
@@ -153,6 +153,17 @@ enum zone_type {
 #define ZONES_SHIFT 2
 #endif
 
+/*
+ * must be power of 2 to avoid wrap around artifacts
+ */
+#define PAGEBUF_SIZE   32
+
+struct pagebuf {
+   atomic_t head;
+   atomic_t tail;
+   struct page *pages[PAGEBUF_SIZE];
+};
+
 struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long   free_pages;
@@ -188,6 +199,7 @@ struct zone {
 #endif
struct free_areafree_area[MAX_ORDER];
 
+   struct pagebuf  pagebuf;
 
ZONE_PADDING(_pad1_)
 
Index: linux-2.6-rt/mm/swap.c
===
--- linux-2.6-rt.orig/mm/swap.c 2007-01-11 16:27:08.0 +0100
+++ linux-2.6-rt/mm/swap.c  2007-01-11 16:36:34.0 +0100
@@ -31,6 +31,8 @@
 #include linux/notifier.h
 #include linux/init.h
 
+#include internal.h
+
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
@@ -170,49 +172,131 @@ void fastcall mark_page_accessed(struct 
 
 EXPORT_SYMBOL(mark_page_accessed);
 
+static int __pagebuf_add(struct zone *zone, struct page *page)
+{
+   BUG_ON(page_zone(page) != zone);
+
+   switch (page_count(page)) {
+   case 0:
+   BUG();
+
+   case 1:
+   /*
+* we're the 

Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Christoph Lameter
On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:

 The test was simple, we have 16 processes, each allocating 3.5G of memory
 and and touching each and every page and returning.  Each of the process is
 bound to a node (socket), with the local node being the preferred node for
 allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
 socket has 4G of physical memory and there are two cores on each socket. On
 start of the test, the machine becomes unresponsive after sometime and
 prints out softlockup and OOM messages.  We then found out the cause
 for softlockups being the excessive spin times on zone_lru lock.  The fact
 that spin_lock_irq disables interrupts while spinning made matters very bad.
 We instrumented the spin_lock_irq code and found that the spin time on the
 lru locks was in the order of a few seconds (tens of seconds at times) and
 the hold time was comparatively lesser.

So the issue is two processes contenting on the zone lock for one node? 
You are overallocating the 4G node with two processes attempting to 
allocate 7.5GB? So we go off node for 3.5G of the allocation?

Does the system scale the right way if you stay within the bounds of node 
memory? I.e. allocate 1.5GB from each process?

Have you tried increasing the size of the per cpu caches in 
/proc/sys/vm/percpu_pagelist_fraction?

 While the softlockups and the like went away by enabling interrupts during
 spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
 Andi thought maybe this is exposing a problem with zone-lru_locks and 
 hence warrants a discussion on lkml, hence this post.  Are there any 
 plans/patches/ideas to address the spin time under such extreme conditions?

Could this be a hardware problem? Some issue with atomic ops in the 
Sun hardware?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Andrew Morton
On Fri, 12 Jan 2007 11:46:22 -0800 (PST)
Christoph Lameter [EMAIL PROTECTED] wrote:

  While the softlockups and the like went away by enabling interrupts during
  spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
  Andi thought maybe this is exposing a problem with zone-lru_locks and 
  hence warrants a discussion on lkml, hence this post.  Are there any 
  plans/patches/ideas to address the spin time under such extreme conditions?
 
 Could this be a hardware problem? Some issue with atomic ops in the 
 Sun hardware?

I'd assume so.  We don't hold lru_lock for 33 seconds ;)

Probably similar symptoms are demonstrable using other locks, if a
suitable workload is chosen.

Increasing PAGEVEC_SIZE might help.  But we do allocate those things
on the stack.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote:
 On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
 
  The test was simple, we have 16 processes, each allocating 3.5G of memory
  and and touching each and every page and returning.  Each of the process is
  bound to a node (socket), with the local node being the preferred node for
  allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node).  Each
  socket has 4G of physical memory and there are two cores on each socket. On
  start of the test, the machine becomes unresponsive after sometime and
  prints out softlockup and OOM messages.  We then found out the cause
  for softlockups being the excessive spin times on zone_lru lock.  The fact
  that spin_lock_irq disables interrupts while spinning made matters very bad.
  We instrumented the spin_lock_irq code and found that the spin time on the
  lru locks was in the order of a few seconds (tens of seconds at times) and
  the hold time was comparatively lesser.
 
 So the issue is two processes contenting on the zone lock for one node? 
 You are overallocating the 4G node with two processes attempting to 
 allocate 7.5GB? So we go off node for 3.5G of the allocation?

Yes.

 
 Does the system scale the right way if you stay within the bounds of node 
 memory? I.e. allocate 1.5GB from each process?

Yes. We see problems only when we oversubscribe memory.

 
 Have you tried increasing the size of the per cpu caches in 
 /proc/sys/vm/percpu_pagelist_fraction?

No not yet. I can give it a try.

 
  While the softlockups and the like went away by enabling interrupts during
  spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
  Andi thought maybe this is exposing a problem with zone-lru_locks and 
  hence warrants a discussion on lkml, hence this post.  Are there any 
  plans/patches/ideas to address the spin time under such extreme conditions?
 
 Could this be a hardware problem? Some issue with atomic ops in the 
 Sun hardware?

I think that is unlikely -- because when we donot oversubscribe
memory, the tests complete quickly without softlockups ane the like.  Peter 
has also noticed this (presumeably on different hardware).  I would think
this could also be locking unfairness (cpus of the same node getting the 
lock and starving out other nodes) case under extreme contention.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Christoph Lameter
On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:

  Does the system scale the right way if you stay within the bounds of node 
  memory? I.e. allocate 1.5GB from each process?
 
 Yes. We see problems only when we oversubscribe memory.

Ok in that case we can have more than 2 processors trying to acquire the 
same zone lock. If they have all exhausted their node local memory and are 
all going off node then all processor may be hitting the last node that 
has some  memory left which will cause a very high degree of contention.

Moreover mostatomic operations are to remote memory which is also 
increasing the problem by making the atomic ops take longer. Typically 
mature NUMA system have implemented hardware provisions that can deal with 
such high degrees of contention. If this is simply a SMP system that was
turned into a NUMA box then this is a new hardware scenario for the 
engineers.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote:
 On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
 
 Moreover mostatomic operations are to remote memory which is also 
 increasing the problem by making the atomic ops take longer. Typically 
 mature NUMA system have implemented hardware provisions that can deal with 
 such high degrees of contention. If this is simply a SMP system that was
 turned into a NUMA box then this is a new hardware scenario for the 
 engineers.

This is using HT as all AMD systems do, but this is one of the 8
socket systems.  

I ran the same test on a 2 node Tyan AMD box, and did not notice the
atrocious spin times. It would be interesting to see how a 4 socket HT box
would fare. Unfortunately, I do not have access to one. If someone has access
to such a box, I can provide the test case and instrumentation patches.

It could very well be the hardware limitation in this case, which means, all
the more reason to enable interrupts with spin locks while spinning. But is
lru_lock an issue is another question.

Thanks,
Kiran
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Andrew Morton
On Fri, 12 Jan 2007 17:00:39 -0800
Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:

 But is
 lru_lock an issue is another question.

I doubt it, although there might be changes we can make in there to
work around it.

mentions PAGEVEC_SIZE again
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Nick Piggin

Ravikiran G Thirumalai wrote:

Hi,
We noticed high interrupt hold off times while running some memory intensive
tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,


[...]


We did not use any lock debugging options and used plain old rdtsc to
measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
local_irq_disable();
 rdtsc(t1);
preempt_disable();
spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
 rdtsc(t2);
if (lock-spin_time  (t2 - t1))
lock-spin_time = t2 - t1;
}

On some runs, we found that the zone-lru_lock spun for 33 seconds or more
while the maximal CS time was 3 seconds or so.


What is the CS time?

It would be interesting to know how long the maximal lru_lock *hold* time is,
which could give us a better indication of whether it is a hardware problem.

For example, if the maximum hold time is 10ms, that it might indicate a
hardware fairness problem.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:
 Ravikiran G Thirumalai wrote:
 Hi,
 We noticed high interrupt hold off times while running some memory 
 intensive
 tests on a Sun x4600 8 socket 16 core x86_64 box.  We noticed softlockups,
 
 [...]
 
 We did not use any lock debugging options and used plain old rdtsc to
 measure cycles.  (We disable cpu freq scaling in the BIOS). All we did was
 this:
 
 void __lockfunc _spin_lock_irq(spinlock_t *lock)
 {
 local_irq_disable();
  rdtsc(t1);
 preempt_disable();
 spin_acquire(lock-dep_map, 0, 0, _RET_IP_);
 _raw_spin_lock(lock);
  rdtsc(t2);
 if (lock-spin_time  (t2 - t1))
 lock-spin_time = t2 - t1;
 }
 
 On some runs, we found that the zone-lru_lock spun for 33 seconds or more
 while the maximal CS time was 3 seconds or so.
 
 What is the CS time?

Critical Section :).  This is the maximal time interval I measured  from 
t2 above to the time point we release the spin lock.  This is the hold 
time I guess.

 
 It would be interesting to know how long the maximal lru_lock *hold* time 
 is,
 which could give us a better indication of whether it is a hardware problem.
 
 For example, if the maximum hold time is 10ms, that it might indicate a
 hardware fairness problem.

The maximal hold time was about 3s.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Ravikiran G Thirumalai
On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote:
 On Fri, 12 Jan 2007 17:00:39 -0800
 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote:
 
  But is
  lru_lock an issue is another question.
 
 I doubt it, although there might be changes we can make in there to
 work around it.
 
 mentions PAGEVEC_SIZE again

I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference.  I still
notice the atrociously high spin times.

Thanks,
Kiran
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High lock spin time for zone-lru_lock under extreme conditions

2007-01-12 Thread Nick Piggin

Ravikiran G Thirumalai wrote:

On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:



What is the CS time?



Critical Section :).  This is the maximal time interval I measured  from 
t2 above to the time point we release the spin lock.  This is the hold 
time I guess.



It would be interesting to know how long the maximal lru_lock *hold* time 
is,

which could give us a better indication of whether it is a hardware problem.

For example, if the maximum hold time is 10ms, that it might indicate a
hardware fairness problem.



The maximal hold time was about 3s.


Well then it doesn't seem very surprising that this could cause a 30s wait
time for one CPU in a 16 core system, regardless of fairness.

I guess most of the contention, and the lock hold times are coming from
vmscan? Do you know exactly which critical sections are the culprits?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/