Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Thursday 03 January 2008 19:55, Ingo Molnar wrote: > * Nick Piggin <[EMAIL PROTECTED]> wrote: > > > Have you done anything more with allowing > 256 CPUS in this > > > spinlock patch? We've been testing with 1k cpus and to verify with > > > -mm kernel, we need to "unpatch" these spinlock changes. > > > > Hi Mike, > > > > Actually I had it in my mind that 64 bit used single-byte locking like > > i386, so I didn't think I'd caused a regression there. > > > > I'll take a look at fixing that up now. > > thanks - this is a serious showstopper for the ticket spinlock patch. > > ( which has otherwise been performing very well in x86.git so far - it > has passed a few thousand bootup tests on 64-bit and 32-bit as well, > so we are close to it being in a mergable state. Would be a pity to > lose it due to the 256 cpus limit. ) OK, this is what my test harness code looks like for > 256 CPUs (basically the same as the in-kernel code, but some names etc. are slightly different). It passes my basic tests, and performance doesn't seem to have suffered. I was going to suggest making the <= 256 vs > 256 cases config options, but maybe we don't need to unless some CPUs are slow at shifts / rotates? I don't know... After I get comments, I will come up with an incremental patch against the kernel... It will be interesting to know whether ticket locks help big SGI systems. static inline void xlock(lock_t *lock) { lock_t inc = 0x0001; lock_t tmp; __asm__ __volatile__ ( "lock ; xaddl %0, %1\n" "movzwl %w0, %2\n\t" "shrl $16, %0\n\t" "1:\t" "cmpl %0, %2\n\t" "je 2f\n\t" "rep ; nop\n\t" "movzwl %1, %2\n\t" /* don't need lfence here, because loads are in-order */ "jmp 1b\n" "2:" :"+Q" (inc), "+m" (*lock), "=r" (tmp) : :"memory", "cc"); } static inline int xtrylock(lock_t *lock) { lock_t tmp; lock_t new; asm volatile( "movl %2,%0\n\t" "movl %0,%1\n\t" "roll $16, %0\n\t" "cmpl %0,%1\n\t" "jne 1f\n\t" "addl $0x0001, %1\n\t" "lock ; cmpxchgl %1,%2\n\t" "1:" "sete %b1\n\t" "movzbl %b1,%0\n\t" :"=" (tmp), "=r" (new), "+m" (*lock) : : "memory", "cc"); return tmp; } static inline void xunlock(lock_t *lock) { __asm__ __volatile__( "incw %0" :"+m" (*lock) : :"memory", "cc"); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Thursday 03 January 2008 19:55, Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: Have you done anything more with allowing 256 CPUS in this spinlock patch? We've been testing with 1k cpus and to verify with -mm kernel, we need to unpatch these spinlock changes. Hi Mike, Actually I had it in my mind that 64 bit used single-byte locking like i386, so I didn't think I'd caused a regression there. I'll take a look at fixing that up now. thanks - this is a serious showstopper for the ticket spinlock patch. ( which has otherwise been performing very well in x86.git so far - it has passed a few thousand bootup tests on 64-bit and 32-bit as well, so we are close to it being in a mergable state. Would be a pity to lose it due to the 256 cpus limit. ) OK, this is what my test harness code looks like for 256 CPUs (basically the same as the in-kernel code, but some names etc. are slightly different). It passes my basic tests, and performance doesn't seem to have suffered. I was going to suggest making the = 256 vs 256 cases config options, but maybe we don't need to unless some CPUs are slow at shifts / rotates? I don't know... After I get comments, I will come up with an incremental patch against the kernel... It will be interesting to know whether ticket locks help big SGI systems. static inline void xlock(lock_t *lock) { lock_t inc = 0x0001; lock_t tmp; __asm__ __volatile__ ( lock ; xaddl %0, %1\n movzwl %w0, %2\n\t shrl $16, %0\n\t 1:\t cmpl %0, %2\n\t je 2f\n\t rep ; nop\n\t movzwl %1, %2\n\t /* don't need lfence here, because loads are in-order */ jmp 1b\n 2: :+Q (inc), +m (*lock), =r (tmp) : :memory, cc); } static inline int xtrylock(lock_t *lock) { lock_t tmp; lock_t new; asm volatile( movl %2,%0\n\t movl %0,%1\n\t roll $16, %0\n\t cmpl %0,%1\n\t jne 1f\n\t addl $0x0001, %1\n\t lock ; cmpxchgl %1,%2\n\t 1: sete %b1\n\t movzbl %b1,%0\n\t :=a (tmp), =r (new), +m (*lock) : : memory, cc); return tmp; } static inline void xunlock(lock_t *lock) { __asm__ __volatile__( incw %0 :+m (*lock) : :memory, cc); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
* Nick Piggin <[EMAIL PROTECTED]> wrote: > > Have you done anything more with allowing > 256 CPUS in this > > spinlock patch? We've been testing with 1k cpus and to verify with > > -mm kernel, we need to "unpatch" these spinlock changes. > > Hi Mike, > > Actually I had it in my mind that 64 bit used single-byte locking like > i386, so I didn't think I'd caused a regression there. > > I'll take a look at fixing that up now. thanks - this is a serious showstopper for the ticket spinlock patch. ( which has otherwise been performing very well in x86.git so far - it has passed a few thousand bootup tests on 64-bit and 32-bit as well, so we are close to it being in a mergable state. Would be a pity to lose it due to the 256 cpus limit. ) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
* Nick Piggin [EMAIL PROTECTED] wrote: Have you done anything more with allowing 256 CPUS in this spinlock patch? We've been testing with 1k cpus and to verify with -mm kernel, we need to unpatch these spinlock changes. Hi Mike, Actually I had it in my mind that 64 bit used single-byte locking like i386, so I didn't think I'd caused a regression there. I'll take a look at fixing that up now. thanks - this is a serious showstopper for the ticket spinlock patch. ( which has otherwise been performing very well in x86.git so far - it has passed a few thousand bootup tests on 64-bit and 32-bit as well, so we are close to it being in a mergable state. Would be a pity to lose it due to the 256 cpus limit. ) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Thursday 03 January 2008 10:35, Mike Travis wrote: > Hi Nick, > > Have you done anything more with allowing > 256 CPUS in this spinlock > patch? We've been testing with 1k cpus and to verify with -mm kernel, > we need to "unpatch" these spinlock changes. > > Thanks, > Mike Hi Mike, Actually I had it in my mind that 64 bit used single-byte locking like i386, so I didn't think I'd caused a regression there. I'll take a look at fixing that up now. Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
Hi Nick, Have you done anything more with allowing > 256 CPUS in this spinlock patch? We've been testing with 1k cpus and to verify with -mm kernel, we need to "unpatch" these spinlock changes. Thanks, Mike Nick Piggin wrote: > On Thursday 20 December 2007 18:04, Christoph Lameter wrote: >>> The only reason the x86 ticket locks have the 256 CPu limit is that >>> if they go any bigger, we can't use the partial registers so would >>> have to have a few more instructions. >> x86_64 is going up to 4k or 16k cpus soon for our new hardware. >> >>> A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters, >>> each would be 16 bits). And it would actually shrink the spinlock in >>> the case of preempt kernels too (because it would no longer have the >>> lockbreak field). >>> >>> And yes, I'll go out on a limb and say that 64k CPUs ought to be >>> enough for anyone ;) >> I think those things need a timeframe applied to it. Thats likely >> going to be true for the next 3 years (optimistic assessment ;-)). > > Yeah, that was tongue in cheek ;) > > >> Could you go to 32bit spinlock by default? > > On x86, the size of the ticket locks is 32 bit, simply because I didn't > want to risk possible alignment bugs (a subsequent patch cuts it down to > 16 bits, but this is a much smaller win than 64->32 in general because > of natural alignment of types). > > Note that the ticket locks still support twice the number as the old > spinlocks, so I'm not causing a regression here... but yes, increasing > the size further will require an extra instruction or two. > >> How about NUMA awareness for the spinlocks? Larger backoff periods for >> off node lock contentions please. > > ticket locks can naturally tell you how many waiters there are, and how > many waiters are in front of you, so it is really nice for doing backoff > (eg. you can adapt the backoff *very* nicely depending on how many are in > front of you, and how quickly you are moving toward the front). > > Also, since I got rid of the ->break_lock field, you could use that space > perhaps to add a cpu # of the lock holder for even more backoff context > (if you find that helps). > > Anyway, I didn't do any of that because it obviously needs someone with > real hardware in order to tune it properly. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to [EMAIL PROTECTED] For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"[EMAIL PROTECTED]"> [EMAIL PROTECTED] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Thursday 03 January 2008 10:35, Mike Travis wrote: Hi Nick, Have you done anything more with allowing 256 CPUS in this spinlock patch? We've been testing with 1k cpus and to verify with -mm kernel, we need to unpatch these spinlock changes. Thanks, Mike Hi Mike, Actually I had it in my mind that 64 bit used single-byte locking like i386, so I didn't think I'd caused a regression there. I'll take a look at fixing that up now. Thanks, Nick -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Thursday 20 December 2007 18:04, Christoph Lameter wrote: > > The only reason the x86 ticket locks have the 256 CPu limit is that > > if they go any bigger, we can't use the partial registers so would > > have to have a few more instructions. > > x86_64 is going up to 4k or 16k cpus soon for our new hardware. > > > A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters, > > each would be 16 bits). And it would actually shrink the spinlock in > > the case of preempt kernels too (because it would no longer have the > > lockbreak field). > > > > And yes, I'll go out on a limb and say that 64k CPUs ought to be > > enough for anyone ;) > > I think those things need a timeframe applied to it. Thats likely > going to be true for the next 3 years (optimistic assessment ;-)). Yeah, that was tongue in cheek ;) > Could you go to 32bit spinlock by default? On x86, the size of the ticket locks is 32 bit, simply because I didn't want to risk possible alignment bugs (a subsequent patch cuts it down to 16 bits, but this is a much smaller win than 64->32 in general because of natural alignment of types). Note that the ticket locks still support twice the number as the old spinlocks, so I'm not causing a regression here... but yes, increasing the size further will require an extra instruction or two. > How about NUMA awareness for the spinlocks? Larger backoff periods for > off node lock contentions please. ticket locks can naturally tell you how many waiters there are, and how many waiters are in front of you, so it is really nice for doing backoff (eg. you can adapt the backoff *very* nicely depending on how many are in front of you, and how quickly you are moving toward the front). Also, since I got rid of the ->break_lock field, you could use that space perhaps to add a cpu # of the lock holder for even more backoff context (if you find that helps). Anyway, I didn't do any of that because it obviously needs someone with real hardware in order to tune it properly. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Thursday 20 December 2007 18:04, Christoph Lameter wrote: The only reason the x86 ticket locks have the 256 CPu limit is that if they go any bigger, we can't use the partial registers so would have to have a few more instructions. x86_64 is going up to 4k or 16k cpus soon for our new hardware. A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters, each would be 16 bits). And it would actually shrink the spinlock in the case of preempt kernels too (because it would no longer have the lockbreak field). And yes, I'll go out on a limb and say that 64k CPUs ought to be enough for anyone ;) I think those things need a timeframe applied to it. Thats likely going to be true for the next 3 years (optimistic assessment ;-)). Yeah, that was tongue in cheek ;) Could you go to 32bit spinlock by default? On x86, the size of the ticket locks is 32 bit, simply because I didn't want to risk possible alignment bugs (a subsequent patch cuts it down to 16 bits, but this is a much smaller win than 64-32 in general because of natural alignment of types). Note that the ticket locks still support twice the number as the old spinlocks, so I'm not causing a regression here... but yes, increasing the size further will require an extra instruction or two. How about NUMA awareness for the spinlocks? Larger backoff periods for off node lock contentions please. ticket locks can naturally tell you how many waiters there are, and how many waiters are in front of you, so it is really nice for doing backoff (eg. you can adapt the backoff *very* nicely depending on how many are in front of you, and how quickly you are moving toward the front). Also, since I got rid of the -break_lock field, you could use that space perhaps to add a cpu # of the lock holder for even more backoff context (if you find that helps). Anyway, I didn't do any of that because it obviously needs someone with real hardware in order to tune it properly. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
> The only reason the x86 ticket locks have the 256 CPu limit is that > if they go any bigger, we can't use the partial registers so would > have to have a few more instructions. x86_64 is going up to 4k or 16k cpus soon for our new hardware. > A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters, > each would be 16 bits). And it would actually shrink the spinlock in > the case of preempt kernels too (because it would no longer have the > lockbreak field). > > And yes, I'll go out on a limb and say that 64k CPUs ought to be > enough for anyone ;) I think those things need a timeframe applied to it. Thats likely going to be true for the next 3 years (optimistic assessment ;-)). Could you go to 32bit spinlock by default? How about NUMA awareness for the spinlocks? Larger backoff periods for off node lock contentions please. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Thursday 20 December 2007 06:28, Peter Zijlstra wrote: > On Wed, 2007-12-19 at 11:53 -0500, Lee Schermerhorn wrote: > > On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote: > > > On Wed, 19 Dec 2007 10:52:09 -0500 > > > > > > Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > > > > I keep these patches up to date for testing. I don't have conclusive > > > > evidence whether they alleviate or exacerbate the problem nor by how > > > > much. > > > > > > When the queued locking from Ingo's x86 tree hits mainline, > > > I suspect that spinlocks may end up behaving a lot nicer. > > > > That would be worth testing with our problematic workloads... > > > > > Should I drop the rwlock patches from my tree for now and > > > focus on just the page reclaim stuff? > > > > That's fine with me. They're out there is anyone is interested. I'll > > keep them up to date in my tree [and hope they don't conflict with split > > lru and noreclaim patches too much] for occasional testing. > > Of course, someone would need to implement ticket locks for ia64 - > preferably without the 256 cpu limit. Yep. Wouldn't be hard at all -- ia64 has a "fetchadd" with acquire semantics. The only reason the x86 ticket locks have the 256 CPu limit is that if they go any bigger, we can't use the partial registers so would have to have a few more instructions. > Nick, growing spinlock_t to 64 bits would yield space for 64k cpus > right? I'm guessing that would be enough for a while, even for SGI. A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters, each would be 16 bits). And it would actually shrink the spinlock in the case of preempt kernels too (because it would no longer have the lockbreak field). And yes, I'll go out on a limb and say that 64k CPUs ought to be enough for anyone ;) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wed, 2007-12-19 at 11:53 -0500, Lee Schermerhorn wrote: > On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote: > > On Wed, 19 Dec 2007 10:52:09 -0500 > > Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > > > > > I keep these patches up to date for testing. I don't have conclusive > > > evidence whether they alleviate or exacerbate the problem nor by how > > > much. > > > > When the queued locking from Ingo's x86 tree hits mainline, > > I suspect that spinlocks may end up behaving a lot nicer. > > That would be worth testing with our problematic workloads... > > > > > Should I drop the rwlock patches from my tree for now and > > focus on just the page reclaim stuff? > > That's fine with me. They're out there is anyone is interested. I'll > keep them up to date in my tree [and hope they don't conflict with split > lru and noreclaim patches too much] for occasional testing. Of course, someone would need to implement ticket locks for ia64 - preferably without the 256 cpu limit. Nick, growing spinlock_t to 64 bits would yield space for 64k cpus, right? I'm guessing that would be enough for a while, even for SGI. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote: > On Wed, 19 Dec 2007 10:52:09 -0500 > Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > > > I keep these patches up to date for testing. I don't have conclusive > > evidence whether they alleviate or exacerbate the problem nor by how > > much. > > When the queued locking from Ingo's x86 tree hits mainline, > I suspect that spinlocks may end up behaving a lot nicer. That would be worth testing with our problematic workloads... > > Should I drop the rwlock patches from my tree for now and > focus on just the page reclaim stuff? That's fine with me. They're out there is anyone is interested. I'll keep them up to date in my tree [and hope they don't conflict with split lru and noreclaim patches too much] for occasional testing. Lee -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wed, 19 Dec 2007 10:52:09 -0500 Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > I keep these patches up to date for testing. I don't have conclusive > evidence whether they alleviate or exacerbate the problem nor by how > much. When the queued locking from Ingo's x86 tree hits mainline, I suspect that spinlocks may end up behaving a lot nicer. Should I drop the rwlock patches from my tree for now and focus on just the page reclaim stuff? -- All Rights Reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wed, 2007-12-19 at 11:48 +1100, Nick Piggin wrote: > On Wednesday 19 December 2007 08:15, Rik van Riel wrote: > > I have seen soft cpu lockups in page_referenced_file() due to > > contention on i_mmap_lock() for different pages. Making the > > i_mmap_lock a reader/writer lock should increase parallelism > > in vmscan for file back pages mapped into many address spaces. > > > > Read lock the i_mmap_lock for all usage except: > > > > 1) mmap/munmap: linking vma into i_mmap prio_tree or removing > > 2) unmap_mapping_range: protecting vm_truncate_count > > > > rmap: try_to_unmap_file() required new cond_resched_rwlock(). > > To reduce code duplication, I recast cond_resched_lock() as a > > [static inline] wrapper around reworked cond_sched_lock() => > > __cond_resched_lock(void *lock, int type). > > New cond_resched_rwlock() implemented as another wrapper. > > Reader/writer locks really suck in terms of fairness and starvation, > especially when the read-side is common and frequent. (also, single > threaded performance of the read-side is worse). > > I know Lee saw some big latencies on the anon_vma list lock when > running (IIRC) a large benchmark... but are there more realistic > situations where this is a problem? Yes, we see the stall on the anon_vma lock most frequently running the AIM benchmark with several tens of thousands of processes--all forked from the same parent. If we push the system into reclaim, all cpus end up spinning on the lock in one of the anon_vma's shared by all the tasks. Quite easy to reproduce. I have also seen this running stress tests to force reclaim under Dave Anderson's "usex" exerciser--e.g., testing the split LRU and noreclaim patches--even with the reader-writer lock patch. I've seen the lockups on the i_mmap_lock running Oracle workloads on our large servers. This is running an OLTP workload with only a thousand or so "clients" all running the same application image. Again, when the system attempts to reclaim we end up spinning on the i_mmap_lock of one of the files [possibly the shared global shmem segment] shared by all the applications. I also see it with the usex stress load--also, with and without this patch. I think this is a more probably scenario--thousands of processes sharing a single file, such as libc.so--than thousands of processes all descended from a single ancestor w/o exec'ing. I keep these patches up to date for testing. I don't have conclusive evidence whether they alleviate or exacerbate the problem nor by how much. Lee -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wed, 2007-12-19 at 11:48 +1100, Nick Piggin wrote: On Wednesday 19 December 2007 08:15, Rik van Riel wrote: I have seen soft cpu lockups in page_referenced_file() due to contention on i_mmap_lock() for different pages. Making the i_mmap_lock a reader/writer lock should increase parallelism in vmscan for file back pages mapped into many address spaces. Read lock the i_mmap_lock for all usage except: 1) mmap/munmap: linking vma into i_mmap prio_tree or removing 2) unmap_mapping_range: protecting vm_truncate_count rmap: try_to_unmap_file() required new cond_resched_rwlock(). To reduce code duplication, I recast cond_resched_lock() as a [static inline] wrapper around reworked cond_sched_lock() = __cond_resched_lock(void *lock, int type). New cond_resched_rwlock() implemented as another wrapper. Reader/writer locks really suck in terms of fairness and starvation, especially when the read-side is common and frequent. (also, single threaded performance of the read-side is worse). I know Lee saw some big latencies on the anon_vma list lock when running (IIRC) a large benchmark... but are there more realistic situations where this is a problem? Yes, we see the stall on the anon_vma lock most frequently running the AIM benchmark with several tens of thousands of processes--all forked from the same parent. If we push the system into reclaim, all cpus end up spinning on the lock in one of the anon_vma's shared by all the tasks. Quite easy to reproduce. I have also seen this running stress tests to force reclaim under Dave Anderson's usex exerciser--e.g., testing the split LRU and noreclaim patches--even with the reader-writer lock patch. I've seen the lockups on the i_mmap_lock running Oracle workloads on our large servers. This is running an OLTP workload with only a thousand or so clients all running the same application image. Again, when the system attempts to reclaim we end up spinning on the i_mmap_lock of one of the files [possibly the shared global shmem segment] shared by all the applications. I also see it with the usex stress load--also, with and without this patch. I think this is a more probably scenario--thousands of processes sharing a single file, such as libc.so--than thousands of processes all descended from a single ancestor w/o exec'ing. I keep these patches up to date for testing. I don't have conclusive evidence whether they alleviate or exacerbate the problem nor by how much. Lee -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wed, 19 Dec 2007 10:52:09 -0500 Lee Schermerhorn [EMAIL PROTECTED] wrote: I keep these patches up to date for testing. I don't have conclusive evidence whether they alleviate or exacerbate the problem nor by how much. When the queued locking from Ingo's x86 tree hits mainline, I suspect that spinlocks may end up behaving a lot nicer. Should I drop the rwlock patches from my tree for now and focus on just the page reclaim stuff? -- All Rights Reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote: On Wed, 19 Dec 2007 10:52:09 -0500 Lee Schermerhorn [EMAIL PROTECTED] wrote: I keep these patches up to date for testing. I don't have conclusive evidence whether they alleviate or exacerbate the problem nor by how much. When the queued locking from Ingo's x86 tree hits mainline, I suspect that spinlocks may end up behaving a lot nicer. That would be worth testing with our problematic workloads... Should I drop the rwlock patches from my tree for now and focus on just the page reclaim stuff? That's fine with me. They're out there is anyone is interested. I'll keep them up to date in my tree [and hope they don't conflict with split lru and noreclaim patches too much] for occasional testing. Lee -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wed, 2007-12-19 at 11:53 -0500, Lee Schermerhorn wrote: On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote: On Wed, 19 Dec 2007 10:52:09 -0500 Lee Schermerhorn [EMAIL PROTECTED] wrote: I keep these patches up to date for testing. I don't have conclusive evidence whether they alleviate or exacerbate the problem nor by how much. When the queued locking from Ingo's x86 tree hits mainline, I suspect that spinlocks may end up behaving a lot nicer. That would be worth testing with our problematic workloads... Should I drop the rwlock patches from my tree for now and focus on just the page reclaim stuff? That's fine with me. They're out there is anyone is interested. I'll keep them up to date in my tree [and hope they don't conflict with split lru and noreclaim patches too much] for occasional testing. Of course, someone would need to implement ticket locks for ia64 - preferably without the 256 cpu limit. Nick, growing spinlock_t to 64 bits would yield space for 64k cpus, right? I'm guessing that would be enough for a while, even for SGI. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Thursday 20 December 2007 06:28, Peter Zijlstra wrote: On Wed, 2007-12-19 at 11:53 -0500, Lee Schermerhorn wrote: On Wed, 2007-12-19 at 11:31 -0500, Rik van Riel wrote: On Wed, 19 Dec 2007 10:52:09 -0500 Lee Schermerhorn [EMAIL PROTECTED] wrote: I keep these patches up to date for testing. I don't have conclusive evidence whether they alleviate or exacerbate the problem nor by how much. When the queued locking from Ingo's x86 tree hits mainline, I suspect that spinlocks may end up behaving a lot nicer. That would be worth testing with our problematic workloads... Should I drop the rwlock patches from my tree for now and focus on just the page reclaim stuff? That's fine with me. They're out there is anyone is interested. I'll keep them up to date in my tree [and hope they don't conflict with split lru and noreclaim patches too much] for occasional testing. Of course, someone would need to implement ticket locks for ia64 - preferably without the 256 cpu limit. Yep. Wouldn't be hard at all -- ia64 has a fetchadd with acquire semantics. The only reason the x86 ticket locks have the 256 CPu limit is that if they go any bigger, we can't use the partial registers so would have to have a few more instructions. Nick, growing spinlock_t to 64 bits would yield space for 64k cpus right? I'm guessing that would be enough for a while, even for SGI. A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters, each would be 16 bits). And it would actually shrink the spinlock in the case of preempt kernels too (because it would no longer have the lockbreak field). And yes, I'll go out on a limb and say that 64k CPUs ought to be enough for anyone ;) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
The only reason the x86 ticket locks have the 256 CPu limit is that if they go any bigger, we can't use the partial registers so would have to have a few more instructions. x86_64 is going up to 4k or 16k cpus soon for our new hardware. A 32 bit spinlock would allow 64K cpus (ticket lock has 2 counters, each would be 16 bits). And it would actually shrink the spinlock in the case of preempt kernels too (because it would no longer have the lockbreak field). And yes, I'll go out on a limb and say that 64k CPUs ought to be enough for anyone ;) I think those things need a timeframe applied to it. Thats likely going to be true for the next 3 years (optimistic assessment ;-)). Could you go to 32bit spinlock by default? How about NUMA awareness for the spinlocks? Larger backoff periods for off node lock contentions please. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
Hi > > rmap: try_to_unmap_file() required new cond_resched_rwlock(). > > To reduce code duplication, I recast cond_resched_lock() as a > > [static inline] wrapper around reworked cond_sched_lock() => > > __cond_resched_lock(void *lock, int type). > > New cond_resched_rwlock() implemented as another wrapper. > > Reader/writer locks really suck in terms of fairness and starvation, > especially when the read-side is common and frequent. (also, single > threaded performance of the read-side is worse). Agreed. rwlock got bad performance some case. (especially on many cpu machine) if many cpu grab read-lock on and off on many cpu system. then at least 1 cpu always grab read lock and the cpu of waiting write-lock never get lock. threrefore, rwlock often make performance weakness of stress. I want know testcase for this patch and run it. Do you have it? /kosaki -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wednesday 19 December 2007 08:15, Rik van Riel wrote: > I have seen soft cpu lockups in page_referenced_file() due to > contention on i_mmap_lock() for different pages. Making the > i_mmap_lock a reader/writer lock should increase parallelism > in vmscan for file back pages mapped into many address spaces. > > Read lock the i_mmap_lock for all usage except: > > 1) mmap/munmap: linking vma into i_mmap prio_tree or removing > 2) unmap_mapping_range: protecting vm_truncate_count > > rmap: try_to_unmap_file() required new cond_resched_rwlock(). > To reduce code duplication, I recast cond_resched_lock() as a > [static inline] wrapper around reworked cond_sched_lock() => > __cond_resched_lock(void *lock, int type). > New cond_resched_rwlock() implemented as another wrapper. Reader/writer locks really suck in terms of fairness and starvation, especially when the read-side is common and frequent. (also, single threaded performance of the read-side is worse). I know Lee saw some big latencies on the anon_vma list lock when running (IIRC) a large benchmark... but are there more realistic situations where this is a problem? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 02/20] make the inode i_mmap_lock a reader/writer lock
I have seen soft cpu lockups in page_referenced_file() due to contention on i_mmap_lock() for different pages. Making the i_mmap_lock a reader/writer lock should increase parallelism in vmscan for file back pages mapped into many address spaces. Read lock the i_mmap_lock for all usage except: 1) mmap/munmap: linking vma into i_mmap prio_tree or removing 2) unmap_mapping_range: protecting vm_truncate_count rmap: try_to_unmap_file() required new cond_resched_rwlock(). To reduce code duplication, I recast cond_resched_lock() as a [static inline] wrapper around reworked cond_sched_lock() => __cond_resched_lock(void *lock, int type). New cond_resched_rwlock() implemented as another wrapper. Signed-off-by: Lee Schermerhorn <[EMAIL PROTECTED]> Signed-off-by: Rik van Riel <[EMAIL PROTECTED]> Index: linux-2.6.24-rc5-mm1/include/linux/fs.h === --- linux-2.6.24-rc5-mm1.orig/include/linux/fs.h +++ linux-2.6.24-rc5-mm1/include/linux/fs.h @@ -501,7 +501,7 @@ struct address_space { unsigned inti_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock;/* protect tree, count, list */ + rwlock_ti_mmap_lock;/* protect tree, count, list */ unsigned inttruncate_count; /* Cover race condition with truncate */ unsigned long nrpages;/* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ Index: linux-2.6.24-rc5-mm1/include/linux/mm.h === --- linux-2.6.24-rc5-mm1.orig/include/linux/mm.h +++ linux-2.6.24-rc5-mm1/include/linux/mm.h @@ -707,7 +707,7 @@ struct zap_details { struct address_space *check_mapping;/* Check page->mapping if set */ pgoff_t first_index;/* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ - spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */ + rwlock_t *i_mmap_lock; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ }; Index: linux-2.6.24-rc5-mm1/fs/inode.c === --- linux-2.6.24-rc5-mm1.orig/fs/inode.c +++ linux-2.6.24-rc5-mm1/fs/inode.c @@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(>i_devices); INIT_RADIX_TREE(>i_data.page_tree, GFP_ATOMIC); rwlock_init(>i_data.tree_lock); - spin_lock_init(>i_data.i_mmap_lock); + rwlock_init(>i_data.i_mmap_lock); INIT_LIST_HEAD(>i_data.private_list); spin_lock_init(>i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(>i_data.i_mmap); Index: linux-2.6.24-rc5-mm1/fs/hugetlbfs/inode.c === --- linux-2.6.24-rc5-mm1.orig/fs/hugetlbfs/inode.c +++ linux-2.6.24-rc5-mm1/fs/hugetlbfs/inode.c @@ -420,6 +420,9 @@ static void hugetlbfs_drop_inode(struct hugetlbfs_forget_inode(inode); } +/* + * LOCKING: __unmap_hugepage_range() requires write lock on i_mmap_lock + */ static inline void hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff) { @@ -454,10 +457,10 @@ static int hugetlb_vmtruncate(struct ino pgoff = offset >> PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(>i_mmap_lock); + write_lock(>i_mmap_lock); if (!prio_tree_empty(>i_mmap)) hugetlb_vmtruncate_list(>i_mmap, pgoff); - spin_unlock(>i_mmap_lock); + write_unlock(>i_mmap_lock); truncate_hugepages(inode, offset); return 0; } Index: linux-2.6.24-rc5-mm1/kernel/fork.c === --- linux-2.6.24-rc5-mm1.orig/kernel/fork.c +++ linux-2.6.24-rc5-mm1/kernel/fork.c @@ -272,12 +272,12 @@ static int dup_mmap(struct mm_struct *mm atomic_dec(>i_writecount); /* insert tmp into the share list, just after mpnt */ - spin_lock(>f_mapping->i_mmap_lock); + write_lock(>f_mapping->i_mmap_lock); tmp->vm_truncate_count = mpnt->vm_truncate_count; flush_dcache_mmap_lock(file->f_mapping); vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(file->f_mapping); - spin_unlock(>f_mapping->i_mmap_lock); + write_unlock(>f_mapping->i_mmap_lock); } /* Index: linux-2.6.24-rc5-mm1/mm/filemap_xip.c
[patch 02/20] make the inode i_mmap_lock a reader/writer lock
I have seen soft cpu lockups in page_referenced_file() due to contention on i_mmap_lock() for different pages. Making the i_mmap_lock a reader/writer lock should increase parallelism in vmscan for file back pages mapped into many address spaces. Read lock the i_mmap_lock for all usage except: 1) mmap/munmap: linking vma into i_mmap prio_tree or removing 2) unmap_mapping_range: protecting vm_truncate_count rmap: try_to_unmap_file() required new cond_resched_rwlock(). To reduce code duplication, I recast cond_resched_lock() as a [static inline] wrapper around reworked cond_sched_lock() = __cond_resched_lock(void *lock, int type). New cond_resched_rwlock() implemented as another wrapper. Signed-off-by: Lee Schermerhorn [EMAIL PROTECTED] Signed-off-by: Rik van Riel [EMAIL PROTECTED] Index: linux-2.6.24-rc5-mm1/include/linux/fs.h === --- linux-2.6.24-rc5-mm1.orig/include/linux/fs.h +++ linux-2.6.24-rc5-mm1/include/linux/fs.h @@ -501,7 +501,7 @@ struct address_space { unsigned inti_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock;/* protect tree, count, list */ + rwlock_ti_mmap_lock;/* protect tree, count, list */ unsigned inttruncate_count; /* Cover race condition with truncate */ unsigned long nrpages;/* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ Index: linux-2.6.24-rc5-mm1/include/linux/mm.h === --- linux-2.6.24-rc5-mm1.orig/include/linux/mm.h +++ linux-2.6.24-rc5-mm1/include/linux/mm.h @@ -707,7 +707,7 @@ struct zap_details { struct address_space *check_mapping;/* Check page-mapping if set */ pgoff_t first_index;/* Lowest page-index to unmap */ pgoff_t last_index; /* Highest page-index to unmap */ - spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */ + rwlock_t *i_mmap_lock; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ }; Index: linux-2.6.24-rc5-mm1/fs/inode.c === --- linux-2.6.24-rc5-mm1.orig/fs/inode.c +++ linux-2.6.24-rc5-mm1/fs/inode.c @@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(inode-i_devices); INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC); rwlock_init(inode-i_data.tree_lock); - spin_lock_init(inode-i_data.i_mmap_lock); + rwlock_init(inode-i_data.i_mmap_lock); INIT_LIST_HEAD(inode-i_data.private_list); spin_lock_init(inode-i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap); Index: linux-2.6.24-rc5-mm1/fs/hugetlbfs/inode.c === --- linux-2.6.24-rc5-mm1.orig/fs/hugetlbfs/inode.c +++ linux-2.6.24-rc5-mm1/fs/hugetlbfs/inode.c @@ -420,6 +420,9 @@ static void hugetlbfs_drop_inode(struct hugetlbfs_forget_inode(inode); } +/* + * LOCKING: __unmap_hugepage_range() requires write lock on i_mmap_lock + */ static inline void hugetlb_vmtruncate_list(struct prio_tree_root *root, pgoff_t pgoff) { @@ -454,10 +457,10 @@ static int hugetlb_vmtruncate(struct ino pgoff = offset PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(mapping-i_mmap_lock); + write_lock(mapping-i_mmap_lock); if (!prio_tree_empty(mapping-i_mmap)) hugetlb_vmtruncate_list(mapping-i_mmap, pgoff); - spin_unlock(mapping-i_mmap_lock); + write_unlock(mapping-i_mmap_lock); truncate_hugepages(inode, offset); return 0; } Index: linux-2.6.24-rc5-mm1/kernel/fork.c === --- linux-2.6.24-rc5-mm1.orig/kernel/fork.c +++ linux-2.6.24-rc5-mm1/kernel/fork.c @@ -272,12 +272,12 @@ static int dup_mmap(struct mm_struct *mm atomic_dec(inode-i_writecount); /* insert tmp into the share list, just after mpnt */ - spin_lock(file-f_mapping-i_mmap_lock); + write_lock(file-f_mapping-i_mmap_lock); tmp-vm_truncate_count = mpnt-vm_truncate_count; flush_dcache_mmap_lock(file-f_mapping); vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(file-f_mapping); - spin_unlock(file-f_mapping-i_mmap_lock); + write_unlock(file-f_mapping-i_mmap_lock);
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
On Wednesday 19 December 2007 08:15, Rik van Riel wrote: I have seen soft cpu lockups in page_referenced_file() due to contention on i_mmap_lock() for different pages. Making the i_mmap_lock a reader/writer lock should increase parallelism in vmscan for file back pages mapped into many address spaces. Read lock the i_mmap_lock for all usage except: 1) mmap/munmap: linking vma into i_mmap prio_tree or removing 2) unmap_mapping_range: protecting vm_truncate_count rmap: try_to_unmap_file() required new cond_resched_rwlock(). To reduce code duplication, I recast cond_resched_lock() as a [static inline] wrapper around reworked cond_sched_lock() = __cond_resched_lock(void *lock, int type). New cond_resched_rwlock() implemented as another wrapper. Reader/writer locks really suck in terms of fairness and starvation, especially when the read-side is common and frequent. (also, single threaded performance of the read-side is worse). I know Lee saw some big latencies on the anon_vma list lock when running (IIRC) a large benchmark... but are there more realistic situations where this is a problem? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 02/20] make the inode i_mmap_lock a reader/writer lock
Hi rmap: try_to_unmap_file() required new cond_resched_rwlock(). To reduce code duplication, I recast cond_resched_lock() as a [static inline] wrapper around reworked cond_sched_lock() = __cond_resched_lock(void *lock, int type). New cond_resched_rwlock() implemented as another wrapper. Reader/writer locks really suck in terms of fairness and starvation, especially when the read-side is common and frequent. (also, single threaded performance of the read-side is worse). Agreed. rwlock got bad performance some case. (especially on many cpu machine) if many cpu grab read-lock on and off on many cpu system. then at least 1 cpu always grab read lock and the cpu of waiting write-lock never get lock. threrefore, rwlock often make performance weakness of stress. I want know testcase for this patch and run it. Do you have it? /kosaki -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/