Re: High lock spin time for zone->lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote: > > Seeing the code helps. But there was a subtle problem with hold time instrumentation here. The code assumed the critical section exiting through spin_unlock_irq entered critical section with spin_lock_irq, but that might not be the case always, and the instrumentation for hold time goes bad when that happens (as in shrink_inactive_list) > > > The > > instrumentation goes like this: > > > > void __lockfunc _spin_lock_irq(spinlock_t *lock) > > { > > unsigned long long t1,t2; > > local_irq_disable(); > > t1 = get_cycles_sync(); > > preempt_disable(); > > spin_acquire(>dep_map, 0, 0, _RET_IP_); > > _raw_spin_lock(lock); > > t2 = get_cycles_sync(); > > lock->raw_lock.htsc = t2; > > if (lock->spin_time < (t2 - t1)) > > lock->spin_time = t2 - t1; > > } > > ... > > > > void __lockfunc _spin_unlock_irq(spinlock_t *lock) > > { > > unsigned long long t1 ; > > spin_release(>dep_map, 1, _RET_IP_); > > t1 = get_cycles_sync(); > > if (lock->cs_time < (t1 - lock->raw_lock.htsc)) > > lock->cs_time = t1 - lock->raw_lock.htsc; > > _raw_spin_unlock(lock); > > local_irq_enable(); > > preempt_enable(); > > } > > ... > > OK, now we need to do a dump_stack() each time we discover a new max hold > time. That might a bit tricky: the printk code does spinlocking too so > things could go recursively deadlocky. Maybe make spin_unlock_irq() return > the hold time then do: What I found now after fixing the above is that hold time is not bad -- 249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS. The spin time is still in orders of seconds. Hence this looks like a hardware fairness issue. Attaching the instrumentation patch with this email FR. Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 2007-01-14 22:36:46.694248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h 2007-01-15 15:40:36.554248000 -0800 @@ -6,6 +6,18 @@ #include #include +/* Like get_cycles, but make sure the CPU is synchronized. */ +static inline unsigned long long get_cycles_sync2(void) +{ + unsigned long long ret; + unsigned eax; + /* Don't do an additional sync on CPUs where we know + RDTSC is already synchronous. */ + alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC, + "=a" (eax), "0" (1) : "ebx","ecx","edx","memory"); + rdtscll(ret); + return ret; +} /* * Your basic SMP spinlocks, allowing only a single CPU anywhere * @@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s "jle 3b\n\t" "jmp 1b\n" "2:\t" : "=m" (lock->slock) : : "memory"); + lock->htsc = get_cycles_sync2(); } /* @@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags "jmp 4b\n" "5:\n\t" : "+m" (lock->slock) : "r" ((unsigned)flags) : "memory"); + lock->htsc = get_cycles_sync2(); } #endif @@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw :"=q" (oldval), "=m" (lock->slock) :"0" (0) : "memory"); + if (oldval) + lock->htsc = get_cycles_sync2(); return oldval > 0; } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { + unsigned long long t = get_cycles_sync2(); + if (lock->hold_time < t - lock->htsc) + lock->hold_time = t - lock->htsc; asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory"); } Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h 2007-01-14 22:36:46.714248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h 2007-01-15 14:23:37.204248000 -0800 @@ -7,9 +7,11 @@ typedef struct { unsigned int slock; + unsigned long long hold_time; + unsigned long long htsc; } raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 1,0,0 } typedef struct { unsigned int lock; Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h 2007-01-14 22:36:48.464248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h 2007-01-14 22:41:30.964248000 -0800 @@ -231,8 +231,8 @@ do { \ # define spin_unlock(lock)
Re: High lock spin time for zone-lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote: Seeing the code helps. But there was a subtle problem with hold time instrumentation here. The code assumed the critical section exiting through spin_unlock_irq entered critical section with spin_lock_irq, but that might not be the case always, and the instrumentation for hold time goes bad when that happens (as in shrink_inactive_list) The instrumentation goes like this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { unsigned long long t1,t2; local_irq_disable(); t1 = get_cycles_sync(); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); t2 = get_cycles_sync(); lock-raw_lock.htsc = t2; if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } ... void __lockfunc _spin_unlock_irq(spinlock_t *lock) { unsigned long long t1 ; spin_release(lock-dep_map, 1, _RET_IP_); t1 = get_cycles_sync(); if (lock-cs_time (t1 - lock-raw_lock.htsc)) lock-cs_time = t1 - lock-raw_lock.htsc; _raw_spin_unlock(lock); local_irq_enable(); preempt_enable(); } ... OK, now we need to do a dump_stack() each time we discover a new max hold time. That might a bit tricky: the printk code does spinlocking too so things could go recursively deadlocky. Maybe make spin_unlock_irq() return the hold time then do: What I found now after fixing the above is that hold time is not bad -- 249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS. The spin time is still in orders of seconds. Hence this looks like a hardware fairness issue. Attaching the instrumentation patch with this email FR. Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 2007-01-14 22:36:46.694248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h 2007-01-15 15:40:36.554248000 -0800 @@ -6,6 +6,18 @@ #include asm/page.h #include asm/processor.h +/* Like get_cycles, but make sure the CPU is synchronized. */ +static inline unsigned long long get_cycles_sync2(void) +{ + unsigned long long ret; + unsigned eax; + /* Don't do an additional sync on CPUs where we know + RDTSC is already synchronous. */ + alternative_io(cpuid, ASM_NOP2, X86_FEATURE_SYNC_RDTSC, + =a (eax), 0 (1) : ebx,ecx,edx,memory); + rdtscll(ret); + return ret; +} /* * Your basic SMP spinlocks, allowing only a single CPU anywhere * @@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s jle 3b\n\t jmp 1b\n 2:\t : =m (lock-slock) : : memory); + lock-htsc = get_cycles_sync2(); } /* @@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags jmp 4b\n 5:\n\t : +m (lock-slock) : r ((unsigned)flags) : memory); + lock-htsc = get_cycles_sync2(); } #endif @@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw :=q (oldval), =m (lock-slock) :0 (0) : memory); + if (oldval) + lock-htsc = get_cycles_sync2(); return oldval 0; } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { + unsigned long long t = get_cycles_sync2(); + if (lock-hold_time t - lock-htsc) + lock-hold_time = t - lock-htsc; asm volatile(movl $1,%0 :=m (lock-slock) :: memory); } Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h === --- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h 2007-01-14 22:36:46.714248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h 2007-01-15 14:23:37.204248000 -0800 @@ -7,9 +7,11 @@ typedef struct { unsigned int slock; + unsigned long long hold_time; + unsigned long long htsc; } raw_spinlock_t; -#define __RAW_SPIN_LOCK_UNLOCKED { 1 } +#define __RAW_SPIN_LOCK_UNLOCKED { 1,0,0 } typedef struct { unsigned int lock; Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h === --- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h 2007-01-14 22:36:48.464248000 -0800 +++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h 2007-01-14 22:41:30.964248000 -0800 @@ -231,8 +231,8 @@ do { \ # define spin_unlock(lock) __raw_spin_unlock((lock)-raw_lock) # define read_unlock(lock) __raw_read_unlock((lock)-raw_lock) # define
Re: High lock spin time for zone->lru_lock under extreme conditions
> On Sat, 13 Jan 2007 11:53:34 -0800 Ravikiran G Thirumalai <[EMAIL PROTECTED]> > wrote: > On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote: > > > On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[EMAIL > > > PROTECTED]> wrote: > > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock) > > > > >{ > > > > >local_irq_disable(); > > > > >> rdtsc(t1); > > > > >preempt_disable(); > > > > >spin_acquire(>dep_map, 0, 0, _RET_IP_); > > > > >_raw_spin_lock(lock); > > > > >> rdtsc(t2); > > > > >if (lock->spin_time < (t2 - t1)) > > > > >lock->spin_time = t2 - t1; > > > > >} > > > > > > > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or > > > > >more > > > > >while the maximal CS time was 3 seconds or so. > > > > > > > > What is the "CS time"? > > > > > > Critical Section :). This is the maximal time interval I measured from > > > t2 above to the time point we release the spin lock. This is the hold > > > time I guess. > > > > By no means. The theory here is that CPUA is taking and releasing the > > lock at high frequency, but CPUB never manages to get in and take it. In > > which case the maximum-acquisition-time is much larger than the > > maximum-hold-time. > > > > I'd suggest that you use a similar trick to measure the maximum hold time: > > start the timer after we got the lock, stop it just before we release the > > lock (assuming that the additional rdtsc delay doesn't "fix" things, of > > course...) > > Well, that is exactly what I described above as CS time. Seeing the code helps. > The > instrumentation goes like this: > > void __lockfunc _spin_lock_irq(spinlock_t *lock) > { > unsigned long long t1,t2; > local_irq_disable(); > t1 = get_cycles_sync(); > preempt_disable(); > spin_acquire(>dep_map, 0, 0, _RET_IP_); > _raw_spin_lock(lock); > t2 = get_cycles_sync(); > lock->raw_lock.htsc = t2; > if (lock->spin_time < (t2 - t1)) > lock->spin_time = t2 - t1; > } > ... > > void __lockfunc _spin_unlock_irq(spinlock_t *lock) > { > unsigned long long t1 ; > spin_release(>dep_map, 1, _RET_IP_); > t1 = get_cycles_sync(); > if (lock->cs_time < (t1 - lock->raw_lock.htsc)) > lock->cs_time = t1 - lock->raw_lock.htsc; > _raw_spin_unlock(lock); > local_irq_enable(); > preempt_enable(); > } > > Am I missing something? Is this not what you just described? (The > synchronizing rdtsc might not be really required at all locations, but I > doubt if it would contribute a significant fraction to 33s or even > the 3s hold time on a 2.6 GHZ opteron). OK, now we need to do a dump_stack() each time we discover a new max hold time. That might a bit tricky: the printk code does spinlocking too so things could go recursively deadlocky. Maybe make spin_unlock_irq() return the hold time then do: void lru_spin_unlock_irq(struct zone *zone) { long this_time; this_time = spin_unlock_irq(>lru_lock); if (this_time > zone->max_time) { zone->max_time = this_time; dump_stack(); } } or similar. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote: > > On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[EMAIL > > PROTECTED]> wrote: > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock) > > > >{ > > > >local_irq_disable(); > > > >> rdtsc(t1); > > > >preempt_disable(); > > > >spin_acquire(>dep_map, 0, 0, _RET_IP_); > > > >_raw_spin_lock(lock); > > > >> rdtsc(t2); > > > >if (lock->spin_time < (t2 - t1)) > > > >lock->spin_time = t2 - t1; > > > >} > > > > > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or > > > >more > > > >while the maximal CS time was 3 seconds or so. > > > > > > What is the "CS time"? > > > > Critical Section :). This is the maximal time interval I measured from > > t2 above to the time point we release the spin lock. This is the hold > > time I guess. > > By no means. The theory here is that CPUA is taking and releasing the > lock at high frequency, but CPUB never manages to get in and take it. In > which case the maximum-acquisition-time is much larger than the > maximum-hold-time. > > I'd suggest that you use a similar trick to measure the maximum hold time: > start the timer after we got the lock, stop it just before we release the > lock (assuming that the additional rdtsc delay doesn't "fix" things, of > course...) Well, that is exactly what I described above as CS time. The instrumentation goes like this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { unsigned long long t1,t2; local_irq_disable(); t1 = get_cycles_sync(); preempt_disable(); spin_acquire(>dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); t2 = get_cycles_sync(); lock->raw_lock.htsc = t2; if (lock->spin_time < (t2 - t1)) lock->spin_time = t2 - t1; } ... void __lockfunc _spin_unlock_irq(spinlock_t *lock) { unsigned long long t1 ; spin_release(>dep_map, 1, _RET_IP_); t1 = get_cycles_sync(); if (lock->cs_time < (t1 - lock->raw_lock.htsc)) lock->cs_time = t1 - lock->raw_lock.htsc; _raw_spin_unlock(lock); local_irq_enable(); preempt_enable(); } Am I missing something? Is this not what you just described? (The synchronizing rdtsc might not be really required at all locations, but I doubt if it would contribute a significant fraction to 33s or even the 3s hold time on a 2.6 GHZ opteron). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
> On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[EMAIL PROTECTED]> > wrote: > On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: > > Ravikiran G Thirumalai wrote: > > >Hi, > > >We noticed high interrupt hold off times while running some memory > > >intensive > > >tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, > > > > [...] > > > > >We did not use any lock debugging options and used plain old rdtsc to > > >measure cycles. (We disable cpu freq scaling in the BIOS). All we did was > > >this: > > > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock) > > >{ > > >local_irq_disable(); > > >> rdtsc(t1); > > >preempt_disable(); > > >spin_acquire(>dep_map, 0, 0, _RET_IP_); > > >_raw_spin_lock(lock); > > >> rdtsc(t2); > > >if (lock->spin_time < (t2 - t1)) > > >lock->spin_time = t2 - t1; > > >} > > > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or more > > >while the maximal CS time was 3 seconds or so. > > > > What is the "CS time"? > > Critical Section :). This is the maximal time interval I measured from > t2 above to the time point we release the spin lock. This is the hold > time I guess. By no means. The theory here is that CPUA is taking and releasing the lock at high frequency, but CPUB never manages to get in and take it. In which case the maximum-acquisition-time is much larger than the maximum-hold-time. I'd suggest that you use a similar trick to measure the maximum hold time: start the timer after we got the lock, stop it just before we release the lock (assuming that the additional rdtsc delay doesn't "fix" things, of course...) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: Ravikiran G Thirumalai wrote: Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, [...] We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the CS time? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. By no means. The theory here is that CPUA is taking and releasing the lock at high frequency, but CPUB never manages to get in and take it. In which case the maximum-acquisition-time is much larger than the maximum-hold-time. I'd suggest that you use a similar trick to measure the maximum hold time: start the timer after we got the lock, stop it just before we release the lock (assuming that the additional rdtsc delay doesn't fix things, of course...) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote: On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the CS time? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. By no means. The theory here is that CPUA is taking and releasing the lock at high frequency, but CPUB never manages to get in and take it. In which case the maximum-acquisition-time is much larger than the maximum-hold-time. I'd suggest that you use a similar trick to measure the maximum hold time: start the timer after we got the lock, stop it just before we release the lock (assuming that the additional rdtsc delay doesn't fix things, of course...) Well, that is exactly what I described above as CS time. The instrumentation goes like this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { unsigned long long t1,t2; local_irq_disable(); t1 = get_cycles_sync(); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); t2 = get_cycles_sync(); lock-raw_lock.htsc = t2; if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } ... void __lockfunc _spin_unlock_irq(spinlock_t *lock) { unsigned long long t1 ; spin_release(lock-dep_map, 1, _RET_IP_); t1 = get_cycles_sync(); if (lock-cs_time (t1 - lock-raw_lock.htsc)) lock-cs_time = t1 - lock-raw_lock.htsc; _raw_spin_unlock(lock); local_irq_enable(); preempt_enable(); } Am I missing something? Is this not what you just described? (The synchronizing rdtsc might not be really required at all locations, but I doubt if it would contribute a significant fraction to 33s or even the 3s hold time on a 2.6 GHZ opteron). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Sat, 13 Jan 2007 11:53:34 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote: On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the CS time? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. By no means. The theory here is that CPUA is taking and releasing the lock at high frequency, but CPUB never manages to get in and take it. In which case the maximum-acquisition-time is much larger than the maximum-hold-time. I'd suggest that you use a similar trick to measure the maximum hold time: start the timer after we got the lock, stop it just before we release the lock (assuming that the additional rdtsc delay doesn't fix things, of course...) Well, that is exactly what I described above as CS time. Seeing the code helps. The instrumentation goes like this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { unsigned long long t1,t2; local_irq_disable(); t1 = get_cycles_sync(); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); t2 = get_cycles_sync(); lock-raw_lock.htsc = t2; if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } ... void __lockfunc _spin_unlock_irq(spinlock_t *lock) { unsigned long long t1 ; spin_release(lock-dep_map, 1, _RET_IP_); t1 = get_cycles_sync(); if (lock-cs_time (t1 - lock-raw_lock.htsc)) lock-cs_time = t1 - lock-raw_lock.htsc; _raw_spin_unlock(lock); local_irq_enable(); preempt_enable(); } Am I missing something? Is this not what you just described? (The synchronizing rdtsc might not be really required at all locations, but I doubt if it would contribute a significant fraction to 33s or even the 3s hold time on a 2.6 GHZ opteron). OK, now we need to do a dump_stack() each time we discover a new max hold time. That might a bit tricky: the printk code does spinlocking too so things could go recursively deadlocky. Maybe make spin_unlock_irq() return the hold time then do: void lru_spin_unlock_irq(struct zone *zone) { long this_time; this_time = spin_unlock_irq(zone-lru_lock); if (this_time zone-max_time) { zone-max_time = this_time; dump_stack(); } } or similar. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
Ravikiran G Thirumalai wrote: On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: What is the "CS time"? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. It would be interesting to know how long the maximal lru_lock *hold* time is, which could give us a better indication of whether it is a hardware problem. For example, if the maximum hold time is 10ms, that it might indicate a hardware fairness problem. The maximal hold time was about 3s. Well then it doesn't seem very surprising that this could cause a 30s wait time for one CPU in a 16 core system, regardless of fairness. I guess most of the contention, and the lock hold times are coming from vmscan? Do you know exactly which critical sections are the culprits? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote: > On Fri, 12 Jan 2007 17:00:39 -0800 > Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > > > But is > > lru_lock an issue is another question. > > I doubt it, although there might be changes we can make in there to > work around it. > > I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference. I still notice the atrociously high spin times. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: > Ravikiran G Thirumalai wrote: > >Hi, > >We noticed high interrupt hold off times while running some memory > >intensive > >tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, > > [...] > > >We did not use any lock debugging options and used plain old rdtsc to > >measure cycles. (We disable cpu freq scaling in the BIOS). All we did was > >this: > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock) > >{ > >local_irq_disable(); > >> rdtsc(t1); > >preempt_disable(); > >spin_acquire(>dep_map, 0, 0, _RET_IP_); > >_raw_spin_lock(lock); > >> rdtsc(t2); > >if (lock->spin_time < (t2 - t1)) > >lock->spin_time = t2 - t1; > >} > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or more > >while the maximal CS time was 3 seconds or so. > > What is the "CS time"? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. > > It would be interesting to know how long the maximal lru_lock *hold* time > is, > which could give us a better indication of whether it is a hardware problem. > > For example, if the maximum hold time is 10ms, that it might indicate a > hardware fairness problem. The maximal hold time was about 3s. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
Ravikiran G Thirumalai wrote: Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, [...] We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); > rdtsc(t1); preempt_disable(); spin_acquire(>dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); > rdtsc(t2); if (lock->spin_time < (t2 - t1)) lock->spin_time = t2 - t1; } On some runs, we found that the zone->lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the "CS time"? It would be interesting to know how long the maximal lru_lock *hold* time is, which could give us a better indication of whether it is a hardware problem. For example, if the maximum hold time is 10ms, that it might indicate a hardware fairness problem. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, 12 Jan 2007 17:00:39 -0800 Ravikiran G Thirumalai <[EMAIL PROTECTED]> wrote: > But is > lru_lock an issue is another question. I doubt it, although there might be changes we can make in there to work around it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote: > On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: > > Moreover mostatomic operations are to remote memory which is also > increasing the problem by making the atomic ops take longer. Typically > mature NUMA system have implemented hardware provisions that can deal with > such high degrees of contention. If this is simply a SMP system that was > turned into a NUMA box then this is a new hardware scenario for the > engineers. This is using HT as all AMD systems do, but this is one of the 8 socket systems. I ran the same test on a 2 node Tyan AMD box, and did not notice the atrocious spin times. It would be interesting to see how a 4 socket HT box would fare. Unfortunately, I do not have access to one. If someone has access to such a box, I can provide the test case and instrumentation patches. It could very well be the hardware limitation in this case, which means, all the more reason to enable interrupts with spin locks while spinning. But is lru_lock an issue is another question. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: > > Does the system scale the right way if you stay within the bounds of node > > memory? I.e. allocate 1.5GB from each process? > > Yes. We see problems only when we oversubscribe memory. Ok in that case we can have more than 2 processors trying to acquire the same zone lock. If they have all exhausted their node local memory and are all going off node then all processor may be hitting the last node that has some memory left which will cause a very high degree of contention. Moreover mostatomic operations are to remote memory which is also increasing the problem by making the atomic ops take longer. Typically mature NUMA system have implemented hardware provisions that can deal with such high degrees of contention. If this is simply a SMP system that was turned into a NUMA box then this is a new hardware scenario for the engineers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote: > On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: > > > The test was simple, we have 16 processes, each allocating 3.5G of memory > > and and touching each and every page and returning. Each of the process is > > bound to a node (socket), with the local node being the preferred node for > > allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each > > socket has 4G of physical memory and there are two cores on each socket. On > > start of the test, the machine becomes unresponsive after sometime and > > prints out softlockup and OOM messages. We then found out the cause > > for softlockups being the excessive spin times on zone_lru lock. The fact > > that spin_lock_irq disables interrupts while spinning made matters very bad. > > We instrumented the spin_lock_irq code and found that the spin time on the > > lru locks was in the order of a few seconds (tens of seconds at times) and > > the hold time was comparatively lesser. > > So the issue is two processes contenting on the zone lock for one node? > You are overallocating the 4G node with two processes attempting to > allocate 7.5GB? So we go off node for 3.5G of the allocation? Yes. > > Does the system scale the right way if you stay within the bounds of node > memory? I.e. allocate 1.5GB from each process? Yes. We see problems only when we oversubscribe memory. > > Have you tried increasing the size of the per cpu caches in > /proc/sys/vm/percpu_pagelist_fraction? No not yet. I can give it a try. > > > While the softlockups and the like went away by enabling interrupts during > > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , > > Andi thought maybe this is exposing a problem with zone->lru_locks and > > hence warrants a discussion on lkml, hence this post. Are there any > > plans/patches/ideas to address the spin time under such extreme conditions? > > Could this be a hardware problem? Some issue with atomic ops in the > Sun hardware? I think that is unlikely -- because when we donot oversubscribe memory, the tests complete quickly without softlockups ane the like. Peter has also noticed this (presumeably on different hardware). I would think this could also be locking unfairness (cpus of the same node getting the lock and starving out other nodes) case under extreme contention. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, 12 Jan 2007 11:46:22 -0800 (PST) Christoph Lameter <[EMAIL PROTECTED]> wrote: > > While the softlockups and the like went away by enabling interrupts during > > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , > > Andi thought maybe this is exposing a problem with zone->lru_locks and > > hence warrants a discussion on lkml, hence this post. Are there any > > plans/patches/ideas to address the spin time under such extreme conditions? > > Could this be a hardware problem? Some issue with atomic ops in the > Sun hardware? I'd assume so. We don't hold lru_lock for 33 seconds ;) Probably similar symptoms are demonstrable using other locks, if a suitable workload is chosen. Increasing PAGEVEC_SIZE might help. But we do allocate those things on the stack. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: > The test was simple, we have 16 processes, each allocating 3.5G of memory > and and touching each and every page and returning. Each of the process is > bound to a node (socket), with the local node being the preferred node for > allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each > socket has 4G of physical memory and there are two cores on each socket. On > start of the test, the machine becomes unresponsive after sometime and > prints out softlockup and OOM messages. We then found out the cause > for softlockups being the excessive spin times on zone_lru lock. The fact > that spin_lock_irq disables interrupts while spinning made matters very bad. > We instrumented the spin_lock_irq code and found that the spin time on the > lru locks was in the order of a few seconds (tens of seconds at times) and > the hold time was comparatively lesser. So the issue is two processes contenting on the zone lock for one node? You are overallocating the 4G node with two processes attempting to allocate 7.5GB? So we go off node for 3.5G of the allocation? Does the system scale the right way if you stay within the bounds of node memory? I.e. allocate 1.5GB from each process? Have you tried increasing the size of the per cpu caches in /proc/sys/vm/percpu_pagelist_fraction? > While the softlockups and the like went away by enabling interrupts during > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , > Andi thought maybe this is exposing a problem with zone->lru_locks and > hence warrants a discussion on lkml, hence this post. Are there any > plans/patches/ideas to address the spin time under such extreme conditions? Could this be a hardware problem? Some issue with atomic ops in the Sun hardware? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone->lru_lock under extreme conditions
On Fri, 2007-01-12 at 08:01 -0800, Ravikiran G Thirumalai wrote: > Hi, > We noticed high interrupt hold off times while running some memory intensive > tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, > lost ticks and even wall time drifting (which is probably a bug in the > x86_64 timer subsystem). > > The test was simple, we have 16 processes, each allocating 3.5G of memory > and and touching each and every page and returning. Each of the process is > bound to a node (socket), with the local node being the preferred node for > allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each > socket has 4G of physical memory and there are two cores on each socket. On > start of the test, the machine becomes unresponsive after sometime and > prints out softlockup and OOM messages. We then found out the cause > for softlockups being the excessive spin times on zone_lru lock. The fact > that spin_lock_irq disables interrupts while spinning made matters very bad. > We instrumented the spin_lock_irq code and found that the spin time on the > lru locks was in the order of a few seconds (tens of seconds at times) and > the hold time was comparatively lesser. > > We did not use any lock debugging options and used plain old rdtsc to > measure cycles. (We disable cpu freq scaling in the BIOS). All we did was > this: > > void __lockfunc _spin_lock_irq(spinlock_t *lock) > { > local_irq_disable(); > > rdtsc(t1); > preempt_disable(); > spin_acquire(>dep_map, 0, 0, _RET_IP_); > _raw_spin_lock(lock); > > rdtsc(t2); > if (lock->spin_time < (t2 - t1)) > lock->spin_time = t2 - t1; > } > > On some runs, we found that the zone->lru_lock spun for 33 seconds or more > while the maximal CS time was 3 seconds or so. > > While the softlockups and the like went away by enabling interrupts during > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , > Andi thought maybe this is exposing a problem with zone->lru_locks and > hence warrants a discussion on lkml, hence this post. Are there any > plans/patches/ideas to address the spin time under such extreme conditions? > > I will be happy to provide any additional information (config/dmesg/test > case if needed. I have been tinkering with this because -rt shows similar issues. Find there the patch so far, it works on UP, but it still went *boom* the last time I tried an actual SMP box. So take this patch only as an indication of the direction I'm working in. One concern I have with the taken approach is cacheline bouncing. Perhaps I should retain some form of per-cpu data structure. --- Subject: mm: streamline zone->lock acquisition on lru_cache_add By buffering the lru pages on a per cpu basis the flush of that buffer is prone to bounce around zones. Furthermore release_pages can also acquire the zone->lock. Streeamline all this by replacing the per cpu buffer with a per zone lockless buffer. Once the buffer is filled flush it and perform all needed operation under one lock acquisition. Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> --- include/linux/mmzone.h | 12 +++ mm/internal.h |2 mm/page_alloc.c| 21 ++ mm/swap.c | 169 + 4 files changed, 149 insertions(+), 55 deletions(-) Index: linux-2.6-rt/include/linux/mmzone.h === --- linux-2.6-rt.orig/include/linux/mmzone.h2007-01-11 16:27:08.0 +0100 +++ linux-2.6-rt/include/linux/mmzone.h 2007-01-11 16:32:08.0 +0100 @@ -153,6 +153,17 @@ enum zone_type { #define ZONES_SHIFT 2 #endif +/* + * must be power of 2 to avoid wrap around artifacts + */ +#define PAGEBUF_SIZE 32 + +struct pagebuf { + atomic_t head; + atomic_t tail; + struct page *pages[PAGEBUF_SIZE]; +}; + struct zone { /* Fields commonly accessed by the page allocator */ unsigned long free_pages; @@ -188,6 +199,7 @@ struct zone { #endif struct free_areafree_area[MAX_ORDER]; + struct pagebuf pagebuf; ZONE_PADDING(_pad1_) Index: linux-2.6-rt/mm/swap.c === --- linux-2.6-rt.orig/mm/swap.c 2007-01-11 16:27:08.0 +0100 +++ linux-2.6-rt/mm/swap.c 2007-01-11 16:36:34.0 +0100 @@ -31,6 +31,8 @@ #include #include +#include "internal.h" + /* How many pages do we try to swap or page in/out together? */ int page_cluster; @@ -170,49 +172,131 @@ void fastcall mark_page_accessed(struct EXPORT_SYMBOL(mark_page_accessed); +static int __pagebuf_add(struct zone *zone, struct page *page) +{ + BUG_ON(page_zone(page) != zone); + + switch (page_count(page)) { + case 0: + BUG(); + + case 1: + /* +
High lock spin time for zone->lru_lock under extreme conditions
Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, lost ticks and even wall time drifting (which is probably a bug in the x86_64 timer subsystem). The test was simple, we have 16 processes, each allocating 3.5G of memory and and touching each and every page and returning. Each of the process is bound to a node (socket), with the local node being the preferred node for allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each socket has 4G of physical memory and there are two cores on each socket. On start of the test, the machine becomes unresponsive after sometime and prints out softlockup and OOM messages. We then found out the cause for softlockups being the excessive spin times on zone_lru lock. The fact that spin_lock_irq disables interrupts while spinning made matters very bad. We instrumented the spin_lock_irq code and found that the spin time on the lru locks was in the order of a few seconds (tens of seconds at times) and the hold time was comparatively lesser. We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); > rdtsc(t1); preempt_disable(); spin_acquire(>dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); > rdtsc(t2); if (lock->spin_time < (t2 - t1)) lock->spin_time = t2 - t1; } On some runs, we found that the zone->lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone->lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? I will be happy to provide any additional information (config/dmesg/test case if needed. Thanks, Kiran - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
High lock spin time for zone-lru_lock under extreme conditions
Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, lost ticks and even wall time drifting (which is probably a bug in the x86_64 timer subsystem). The test was simple, we have 16 processes, each allocating 3.5G of memory and and touching each and every page and returning. Each of the process is bound to a node (socket), with the local node being the preferred node for allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each socket has 4G of physical memory and there are two cores on each socket. On start of the test, the machine becomes unresponsive after sometime and prints out softlockup and OOM messages. We then found out the cause for softlockups being the excessive spin times on zone_lru lock. The fact that spin_lock_irq disables interrupts while spinning made matters very bad. We instrumented the spin_lock_irq code and found that the spin time on the lru locks was in the order of a few seconds (tens of seconds at times) and the hold time was comparatively lesser. We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone-lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? I will be happy to provide any additional information (config/dmesg/test case if needed. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, 2007-01-12 at 08:01 -0800, Ravikiran G Thirumalai wrote: Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, lost ticks and even wall time drifting (which is probably a bug in the x86_64 timer subsystem). The test was simple, we have 16 processes, each allocating 3.5G of memory and and touching each and every page and returning. Each of the process is bound to a node (socket), with the local node being the preferred node for allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each socket has 4G of physical memory and there are two cores on each socket. On start of the test, the machine becomes unresponsive after sometime and prints out softlockup and OOM messages. We then found out the cause for softlockups being the excessive spin times on zone_lru lock. The fact that spin_lock_irq disables interrupts while spinning made matters very bad. We instrumented the spin_lock_irq code and found that the spin time on the lru locks was in the order of a few seconds (tens of seconds at times) and the hold time was comparatively lesser. We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone-lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? I will be happy to provide any additional information (config/dmesg/test case if needed. I have been tinkering with this because -rt shows similar issues. Find there the patch so far, it works on UP, but it still went *boom* the last time I tried an actual SMP box. So take this patch only as an indication of the direction I'm working in. One concern I have with the taken approach is cacheline bouncing. Perhaps I should retain some form of per-cpu data structure. --- Subject: mm: streamline zone-lock acquisition on lru_cache_add By buffering the lru pages on a per cpu basis the flush of that buffer is prone to bounce around zones. Furthermore release_pages can also acquire the zone-lock. Streeamline all this by replacing the per cpu buffer with a per zone lockless buffer. Once the buffer is filled flush it and perform all needed operation under one lock acquisition. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- include/linux/mmzone.h | 12 +++ mm/internal.h |2 mm/page_alloc.c| 21 ++ mm/swap.c | 169 + 4 files changed, 149 insertions(+), 55 deletions(-) Index: linux-2.6-rt/include/linux/mmzone.h === --- linux-2.6-rt.orig/include/linux/mmzone.h2007-01-11 16:27:08.0 +0100 +++ linux-2.6-rt/include/linux/mmzone.h 2007-01-11 16:32:08.0 +0100 @@ -153,6 +153,17 @@ enum zone_type { #define ZONES_SHIFT 2 #endif +/* + * must be power of 2 to avoid wrap around artifacts + */ +#define PAGEBUF_SIZE 32 + +struct pagebuf { + atomic_t head; + atomic_t tail; + struct page *pages[PAGEBUF_SIZE]; +}; + struct zone { /* Fields commonly accessed by the page allocator */ unsigned long free_pages; @@ -188,6 +199,7 @@ struct zone { #endif struct free_areafree_area[MAX_ORDER]; + struct pagebuf pagebuf; ZONE_PADDING(_pad1_) Index: linux-2.6-rt/mm/swap.c === --- linux-2.6-rt.orig/mm/swap.c 2007-01-11 16:27:08.0 +0100 +++ linux-2.6-rt/mm/swap.c 2007-01-11 16:36:34.0 +0100 @@ -31,6 +31,8 @@ #include linux/notifier.h #include linux/init.h +#include internal.h + /* How many pages do we try to swap or page in/out together? */ int page_cluster; @@ -170,49 +172,131 @@ void fastcall mark_page_accessed(struct EXPORT_SYMBOL(mark_page_accessed); +static int __pagebuf_add(struct zone *zone, struct page *page) +{ + BUG_ON(page_zone(page) != zone); + + switch (page_count(page)) { + case 0: + BUG(); + + case 1: + /* +* we're the
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: The test was simple, we have 16 processes, each allocating 3.5G of memory and and touching each and every page and returning. Each of the process is bound to a node (socket), with the local node being the preferred node for allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each socket has 4G of physical memory and there are two cores on each socket. On start of the test, the machine becomes unresponsive after sometime and prints out softlockup and OOM messages. We then found out the cause for softlockups being the excessive spin times on zone_lru lock. The fact that spin_lock_irq disables interrupts while spinning made matters very bad. We instrumented the spin_lock_irq code and found that the spin time on the lru locks was in the order of a few seconds (tens of seconds at times) and the hold time was comparatively lesser. So the issue is two processes contenting on the zone lock for one node? You are overallocating the 4G node with two processes attempting to allocate 7.5GB? So we go off node for 3.5G of the allocation? Does the system scale the right way if you stay within the bounds of node memory? I.e. allocate 1.5GB from each process? Have you tried increasing the size of the per cpu caches in /proc/sys/vm/percpu_pagelist_fraction? While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone-lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? Could this be a hardware problem? Some issue with atomic ops in the Sun hardware? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, 12 Jan 2007 11:46:22 -0800 (PST) Christoph Lameter [EMAIL PROTECTED] wrote: While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone-lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? Could this be a hardware problem? Some issue with atomic ops in the Sun hardware? I'd assume so. We don't hold lru_lock for 33 seconds ;) Probably similar symptoms are demonstrable using other locks, if a suitable workload is chosen. Increasing PAGEVEC_SIZE might help. But we do allocate those things on the stack. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote: On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: The test was simple, we have 16 processes, each allocating 3.5G of memory and and touching each and every page and returning. Each of the process is bound to a node (socket), with the local node being the preferred node for allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each socket has 4G of physical memory and there are two cores on each socket. On start of the test, the machine becomes unresponsive after sometime and prints out softlockup and OOM messages. We then found out the cause for softlockups being the excessive spin times on zone_lru lock. The fact that spin_lock_irq disables interrupts while spinning made matters very bad. We instrumented the spin_lock_irq code and found that the spin time on the lru locks was in the order of a few seconds (tens of seconds at times) and the hold time was comparatively lesser. So the issue is two processes contenting on the zone lock for one node? You are overallocating the 4G node with two processes attempting to allocate 7.5GB? So we go off node for 3.5G of the allocation? Yes. Does the system scale the right way if you stay within the bounds of node memory? I.e. allocate 1.5GB from each process? Yes. We see problems only when we oversubscribe memory. Have you tried increasing the size of the per cpu caches in /proc/sys/vm/percpu_pagelist_fraction? No not yet. I can give it a try. While the softlockups and the like went away by enabling interrupts during spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 , Andi thought maybe this is exposing a problem with zone-lru_locks and hence warrants a discussion on lkml, hence this post. Are there any plans/patches/ideas to address the spin time under such extreme conditions? Could this be a hardware problem? Some issue with atomic ops in the Sun hardware? I think that is unlikely -- because when we donot oversubscribe memory, the tests complete quickly without softlockups ane the like. Peter has also noticed this (presumeably on different hardware). I would think this could also be locking unfairness (cpus of the same node getting the lock and starving out other nodes) case under extreme contention. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: Does the system scale the right way if you stay within the bounds of node memory? I.e. allocate 1.5GB from each process? Yes. We see problems only when we oversubscribe memory. Ok in that case we can have more than 2 processors trying to acquire the same zone lock. If they have all exhausted their node local memory and are all going off node then all processor may be hitting the last node that has some memory left which will cause a very high degree of contention. Moreover mostatomic operations are to remote memory which is also increasing the problem by making the atomic ops take longer. Typically mature NUMA system have implemented hardware provisions that can deal with such high degrees of contention. If this is simply a SMP system that was turned into a NUMA box then this is a new hardware scenario for the engineers. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote: On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote: Moreover mostatomic operations are to remote memory which is also increasing the problem by making the atomic ops take longer. Typically mature NUMA system have implemented hardware provisions that can deal with such high degrees of contention. If this is simply a SMP system that was turned into a NUMA box then this is a new hardware scenario for the engineers. This is using HT as all AMD systems do, but this is one of the 8 socket systems. I ran the same test on a 2 node Tyan AMD box, and did not notice the atrocious spin times. It would be interesting to see how a 4 socket HT box would fare. Unfortunately, I do not have access to one. If someone has access to such a box, I can provide the test case and instrumentation patches. It could very well be the hardware limitation in this case, which means, all the more reason to enable interrupts with spin locks while spinning. But is lru_lock an issue is another question. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, 12 Jan 2007 17:00:39 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: But is lru_lock an issue is another question. I doubt it, although there might be changes we can make in there to work around it. mentions PAGEVEC_SIZE again - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
Ravikiran G Thirumalai wrote: Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, [...] We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the CS time? It would be interesting to know how long the maximal lru_lock *hold* time is, which could give us a better indication of whether it is a hardware problem. For example, if the maximum hold time is 10ms, that it might indicate a hardware fairness problem. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: Ravikiran G Thirumalai wrote: Hi, We noticed high interrupt hold off times while running some memory intensive tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups, [...] We did not use any lock debugging options and used plain old rdtsc to measure cycles. (We disable cpu freq scaling in the BIOS). All we did was this: void __lockfunc _spin_lock_irq(spinlock_t *lock) { local_irq_disable(); rdtsc(t1); preempt_disable(); spin_acquire(lock-dep_map, 0, 0, _RET_IP_); _raw_spin_lock(lock); rdtsc(t2); if (lock-spin_time (t2 - t1)) lock-spin_time = t2 - t1; } On some runs, we found that the zone-lru_lock spun for 33 seconds or more while the maximal CS time was 3 seconds or so. What is the CS time? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. It would be interesting to know how long the maximal lru_lock *hold* time is, which could give us a better indication of whether it is a hardware problem. For example, if the maximum hold time is 10ms, that it might indicate a hardware fairness problem. The maximal hold time was about 3s. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote: On Fri, 12 Jan 2007 17:00:39 -0800 Ravikiran G Thirumalai [EMAIL PROTECTED] wrote: But is lru_lock an issue is another question. I doubt it, although there might be changes we can make in there to work around it. mentions PAGEVEC_SIZE again I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference. I still notice the atrociously high spin times. Thanks, Kiran - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: High lock spin time for zone-lru_lock under extreme conditions
Ravikiran G Thirumalai wrote: On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote: What is the CS time? Critical Section :). This is the maximal time interval I measured from t2 above to the time point we release the spin lock. This is the hold time I guess. It would be interesting to know how long the maximal lru_lock *hold* time is, which could give us a better indication of whether it is a hardware problem. For example, if the maximum hold time is 10ms, that it might indicate a hardware fairness problem. The maximal hold time was about 3s. Well then it doesn't seem very surprising that this could cause a 30s wait time for one CPU in a 16 core system, regardless of fairness. I guess most of the contention, and the lock hold times are coming from vmscan? Do you know exactly which critical sections are the culprits? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/