Re: [patch 1/4] x86: FIFO ticket spinlocks
On Fri, Nov 02, 2007 at 04:33:32PM +0100, Ingo Molnar wrote: > > * Nick Piggin <[EMAIL PROTECTED]> wrote: > > > Anyway, if this can make its way to the x86 tree, I think it will get > > pulled into -mm (?) and get some exposure... > > ok, we can certainly try it there. Anything particular I have to do to get it into the x86 tree? And presumably that tree is going to be picked up by Andrew at some point? (-mm exposure is the main objective here, the only reason I didn't send it to Andrew directly is in the interests of doing the right thing). > Your code is really nifty. Thanks. One of the CPU engineers at Intel asked to use it as their reference ticket lock code sequence, which was pretty cool :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Fri, Nov 02, 2007 at 04:33:32PM +0100, Ingo Molnar wrote: * Nick Piggin [EMAIL PROTECTED] wrote: Anyway, if this can make its way to the x86 tree, I think it will get pulled into -mm (?) and get some exposure... ok, we can certainly try it there. Anything particular I have to do to get it into the x86 tree? And presumably that tree is going to be picked up by Andrew at some point? (-mm exposure is the main objective here, the only reason I didn't send it to Andrew directly is in the interests of doing the right thing). Your code is really nifty. Thanks. One of the CPU engineers at Intel asked to use it as their reference ticket lock code sequence, which was pretty cool :) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] Optimize zone allocator synchronization
On Wednesday 07 November 2007 17:19, Andrew Morton wrote: > > On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <[EMAIL PROTECTED]> wrote: > > > > Don Porter wrote: > > > From: Donald E. Porter <[EMAIL PROTECTED]> > > > > > > In the bulk page allocation/free routines in mm/page_alloc.c, the zone > > > lock is held across all iterations. For certain parallel workloads, I > > > have found that releasing and reacquiring the lock for each iteration > > > yields better performance, especially at higher CPU counts. For > > > instance, kernel compilation is sped up by 5% on an 8 CPU test > > > machine. In most cases, there is no significant effect on performance > > > (although the effect tends to be slightly positive). This seems quite > > > reasonable for the very small scope of the change. > > > > > > My intuition is that this patch prevents smaller requests from waiting > > > on larger ones. While grabbing and releasing the lock within the loop > > > adds a few instructions, it can lower the latency for a particular > > > thread's allocation which is often on the thread's critical path. > > > Lowering the average latency for allocation can increase system > > > throughput. > > > > > > More detailed information, including data from the tests I ran to > > > validate this change are available at > > > http://www.cs.utexas.edu/~porterde/kernel-patch.html . > > > > > > Thanks in advance for your consideration and feedback. I did see this initial post, and didn't quite know what to make of it. I'll admit it is slightly unexpected :) Always good to research ideas against common convention, though. I don't know whether your reasoning is correct though: unless there is a significant number of higher order allocations (which there should not be, AFAIKS), all allocators will go through the per CPU lists which batch the same number of objects on and off, so there is no such thing as smaller or larger requests. And there are a number of regressions as well in your tests. It would be nice to get some more detailed profile numbers (preferably with an upstream kernel) to try to work out what is going faster. It's funny, Dave Miller and I were just talking about the possible reappearance of zone->lock contention with massively multi core and multi threaded CPUs. I think the right way to fix this in the long run if it turns into a real problem, is something like having a lock per MAX_ORDER block, and having CPUs prefer to allocate from different blocks. Anti-frag makes this pretty interesting to implement, but it will be possible. > > That's an interesting insight. My intuition is that Nick Piggin's > > recently-posted ticket spinlocks patches[1] will reduce the need for this > > patch, though it may be useful to have both. Can you benchmark again > > with only ticket spinlocks, and with ticket spinlocks + this patch? > > You'll probably want to use 2.6.24-rc1 as your baseline, due to the x86 > > architecture merge. > > The patch as-is would hurt low cpu-count workloads, and single-threaded > workloads: it is simply taking that lock a lot more times. This will be > particuarly noticable on things like older P4 machines which have > peculiarly expensive locked operations. It's not even restricted to P4s -- another big cost is going to be the cacheline pingpong. Actually it might be worth trying another test run with zone->lock put into its own cacheline (as it stands, when the lock gets contended, spinners will just sit there pushing useful fields out of the holder's memory -- ticket locks will do better here, but they still write to the lock once, then sit there loading it). > A test to run would be, on ext2: > > time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo) > > (might need to increase /proc/sys/vm/dirty* to avoid any writeback) > > > I wonder if we can do something like: > > if (lock_is_contended(lock)) { > spin_unlock(lock); > spin_lock(lock);/* To the back of the queue */ > } > > (in conjunction with the ticket locks) so that we only do the expensive > buslocked operation when we actually have a need to do so. > > (The above should be wrapped in some new spinlock interface function which > is probably a no-op on architectures which cannot implement it usefully) We have the need_lockbreak stuff. Of course, that's often pretty useless with regular spinlocks (when you consider that my tests show that a single CPU can be allowed to retake the same lock several million times in a row despite contention)... Anyway, yeah we could do that. But I think we do actually want to batch up allocations on a given CPU in the multithreaded case as well, rather than interleave them. There are some benefits avoiding cacheline bouncing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: > For quite some time I'm seeing occasional lockups spread over 50 different > machines I'm maintaining. Symptom: a page allocation failure with order:1, > GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free > pages, almost no swap used) followed by a lockup (everything dead). I've > collected all (12) crash cases which occurred the last 10 weeks on 50 > machines total (i.e. 1 crash every 41 weeks on average). The kernel > messages are summarized to show the interesting part (IMO) they have > in common. Over the years this has become the crash cause #1 for stable > kernels for me (fglrx doesn't count ;). > > One note: I suspect that reporting a GFP_ATOMIC allocation failure in an > network driver via that same driver (netconsole) may not be the smartest > thing to do and this could be responsible for the lockup itself. However, > the initial page allocation failure remains and I'm not sure how to > address that problem. It isn't unexpected. If an atomic allocation doesn't have enough memory, it kicks off kswapd to start freeing memory for it. However, it cannot wait for memory to become free (it's GFP_ATOMIC), so it has to return failure. GFP_ATOMIC allocation paths are designed so that the kernel can recover from this situation, and a subsequent allocation will have free memory. Probably in production kernels we should default to only reporting this when page reclaim is not making any progress. > I still think the issue is memory fragmentation but if so, it looks > a bit extreme to me: One system with 2GB of ram crashed after a day, > merely running a couple of TCP server programs. All systems have either > 1 or 2GB ram and at least 1G of (merely unused) swap. You can reduce the chances of it happening by increasing /proc/sys/vm/min_free_kbytes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oom-killer error.
On Tuesday 06 November 2007 19:34, Jim van Wel wrote: > Hi there, > > I have a strange problem with like 10-15 servers right now. > We have here all HP DL380-G5 servers with kernel 2.6.22.6. System all > works normall. But after a uptime of like 15 a 25 days, we get these > messages, and the servers is just crashed. > It is trying to allocate lowmem, but you have none left (and none to speak of can be reclaimed). I'd guess you have a kernel memory leak. Can you start by posting the output of /proc/slabinfo and /proc/meminfo after the machine has been up for 10-20 days (eg. close to OOM). And preferably also post another set after just a day or so uptime. Are you using the SLAB or SLUB allocator (check .config). The kernel crashes (rather than recovers) because the leak has used up all its working memory and killing processes does not release it. (by the looks). Thanks for the report, Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oom-killer error.
On Tuesday 06 November 2007 19:34, Jim van Wel wrote: Hi there, I have a strange problem with like 10-15 servers right now. We have here all HP DL380-G5 servers with kernel 2.6.22.6. System all works normall. But after a uptime of like 15 a 25 days, we get these messages, and the servers is just crashed. It is trying to allocate lowmem, but you have none left (and none to speak of can be reclaimed). I'd guess you have a kernel memory leak. Can you start by posting the output of /proc/slabinfo and /proc/meminfo after the machine has been up for 10-20 days (eg. close to OOM). And preferably also post another set after just a day or so uptime. Are you using the SLAB or SLUB allocator (check .config). The kernel crashes (rather than recovers) because the leak has used up all its working memory and killing processes does not release it. (by the looks). Thanks for the report, Nick - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)
On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote: For quite some time I'm seeing occasional lockups spread over 50 different machines I'm maintaining. Symptom: a page allocation failure with order:1, GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free pages, almost no swap used) followed by a lockup (everything dead). I've collected all (12) crash cases which occurred the last 10 weeks on 50 machines total (i.e. 1 crash every 41 weeks on average). The kernel messages are summarized to show the interesting part (IMO) they have in common. Over the years this has become the crash cause #1 for stable kernels for me (fglrx doesn't count ;). One note: I suspect that reporting a GFP_ATOMIC allocation failure in an network driver via that same driver (netconsole) may not be the smartest thing to do and this could be responsible for the lockup itself. However, the initial page allocation failure remains and I'm not sure how to address that problem. It isn't unexpected. If an atomic allocation doesn't have enough memory, it kicks off kswapd to start freeing memory for it. However, it cannot wait for memory to become free (it's GFP_ATOMIC), so it has to return failure. GFP_ATOMIC allocation paths are designed so that the kernel can recover from this situation, and a subsequent allocation will have free memory. Probably in production kernels we should default to only reporting this when page reclaim is not making any progress. I still think the issue is memory fragmentation but if so, it looks a bit extreme to me: One system with 2GB of ram crashed after a day, merely running a couple of TCP server programs. All systems have either 1 or 2GB ram and at least 1G of (merely unused) swap. You can reduce the chances of it happening by increasing /proc/sys/vm/min_free_kbytes. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] Optimize zone allocator synchronization
On Wednesday 07 November 2007 17:19, Andrew Morton wrote: On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook [EMAIL PROTECTED] wrote: Don Porter wrote: From: Donald E. Porter [EMAIL PROTECTED] In the bulk page allocation/free routines in mm/page_alloc.c, the zone lock is held across all iterations. For certain parallel workloads, I have found that releasing and reacquiring the lock for each iteration yields better performance, especially at higher CPU counts. For instance, kernel compilation is sped up by 5% on an 8 CPU test machine. In most cases, there is no significant effect on performance (although the effect tends to be slightly positive). This seems quite reasonable for the very small scope of the change. My intuition is that this patch prevents smaller requests from waiting on larger ones. While grabbing and releasing the lock within the loop adds a few instructions, it can lower the latency for a particular thread's allocation which is often on the thread's critical path. Lowering the average latency for allocation can increase system throughput. More detailed information, including data from the tests I ran to validate this change are available at http://www.cs.utexas.edu/~porterde/kernel-patch.html . Thanks in advance for your consideration and feedback. I did see this initial post, and didn't quite know what to make of it. I'll admit it is slightly unexpected :) Always good to research ideas against common convention, though. I don't know whether your reasoning is correct though: unless there is a significant number of higher order allocations (which there should not be, AFAIKS), all allocators will go through the per CPU lists which batch the same number of objects on and off, so there is no such thing as smaller or larger requests. And there are a number of regressions as well in your tests. It would be nice to get some more detailed profile numbers (preferably with an upstream kernel) to try to work out what is going faster. It's funny, Dave Miller and I were just talking about the possible reappearance of zone-lock contention with massively multi core and multi threaded CPUs. I think the right way to fix this in the long run if it turns into a real problem, is something like having a lock per MAX_ORDER block, and having CPUs prefer to allocate from different blocks. Anti-frag makes this pretty interesting to implement, but it will be possible. That's an interesting insight. My intuition is that Nick Piggin's recently-posted ticket spinlocks patches[1] will reduce the need for this patch, though it may be useful to have both. Can you benchmark again with only ticket spinlocks, and with ticket spinlocks + this patch? You'll probably want to use 2.6.24-rc1 as your baseline, due to the x86 architecture merge. The patch as-is would hurt low cpu-count workloads, and single-threaded workloads: it is simply taking that lock a lot more times. This will be particuarly noticable on things like older P4 machines which have peculiarly expensive locked operations. It's not even restricted to P4s -- another big cost is going to be the cacheline pingpong. Actually it might be worth trying another test run with zone-lock put into its own cacheline (as it stands, when the lock gets contended, spinners will just sit there pushing useful fields out of the holder's memory -- ticket locks will do better here, but they still write to the lock once, then sit there loading it). A test to run would be, on ext2: time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo) (might need to increase /proc/sys/vm/dirty* to avoid any writeback) I wonder if we can do something like: if (lock_is_contended(lock)) { spin_unlock(lock); spin_lock(lock);/* To the back of the queue */ } (in conjunction with the ticket locks) so that we only do the expensive buslocked operation when we actually have a need to do so. (The above should be wrapped in some new spinlock interface function which is probably a no-op on architectures which cannot implement it usefully) We have the need_lockbreak stuff. Of course, that's often pretty useless with regular spinlocks (when you consider that my tests show that a single CPU can be allowed to retake the same lock several million times in a row despite contention)... Anyway, yeah we could do that. But I think we do actually want to batch up allocations on a given CPU in the multithreaded case as well, rather than interleave them. There are some benefits avoiding cacheline bouncing. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Fri, Nov 02, 2007 at 08:56:46PM -0400, Chuck Ebbert wrote: > On 11/02/2007 07:01 PM, Nick Piggin wrote: > > > > In the contended multi-threaded tight loop, the xchg lock is slower than inc > > lock but still beats the fair xadd lock, but that's only because it is > > just as unfair if not more so on this hardware (runtime difference of up to > > about 10%) > > > > I meant xchg for unlock, not lock. That is for unlock. 2x the number of atomic operations ~= 2x the cost. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Fri, Nov 02, 2007 at 09:51:27AM -0700, Linus Torvalds wrote: > > > On Fri, 2 Nov 2007, Chuck Ebbert wrote: > > > > There's also a very easy way to get better fairness with our current > > spinlocks: > > use xchg to release the lock instead of mov. > > That does nothing at all. > > Yes, it slows the unlock down, which in turn on some machines will make it > easier for another core/socket to get it, but it's purely about the > slowdown, nothing else. Yeah, it's not such a good idea... it slows down the single threaded case like crazy. On my dual core core2: _Single thread_ inc-lock in cache takes 21.94ns xadd-lock in cache takes 22.64ns xchg-lock in cache takes 35.21ns inc-lock out of cache takes 140.73ns xadd-lock out of cache takes 141.15ns xchg-lock out of cache takes 155.13ns In the contended multi-threaded tight loop, the xchg lock is slower than inc lock but still beats the fair xadd lock, but that's only because it is just as unfair if not more so on this hardware (runtime difference of up to about 10%) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Fri, Nov 02, 2007 at 10:05:37AM -0400, Rik van Riel wrote: > On Fri, 2 Nov 2007 07:42:20 +0100 > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote: > > > > > > > > > On Thu, 1 Nov 2007, Rik van Riel wrote: > > > > > > > > Larry Woodman managed to wedge the VM into a state where, on his > > > > 4x dual core system, only 2 cores (on the same CPU) could get the > > > > zone->lru_lock overnight. The other 6 cores on the system were > > > > just spinning, without being able to get the lock. > > > > That's quite incredible, considering that the CPUs actually _taking_ > > the locks also drop the locks and do quite a bit of work before taking > > them again (ie. they take them to pull pages off the LRU, but then > > do a reasonable amount of work to remove each one from pagecache > > before refilling from the LRU). > > > > Possibly actually that is a *more* difficult case for the HW to > > handle: once the CPU actually goes away and operates on other > > cachelines, it may get a little more difficult to detect that it is > > causing starvation issues. > > In case of the zone->lru_lock, grabbing the spinlock does > not mean that the process is letting go of the cacheline. > > On the contrary, once the spinlock has been grabbed, the > real cacheline prodding begins. I didn't say that, though. Obviously the hardware can't do anything about starvating until a lock is released. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Thu, Nov 01, 2007 at 10:02:04PM -0700, David Miller wrote: > From: David Miller <[EMAIL PROTECTED]> > Date: Wed, 31 Oct 2007 00:44:25 -0700 (PDT) > > > From: Nick Piggin <[EMAIL PROTECTED]> > > Date: Wed, 31 Oct 2007 08:41:06 +0100 > > > > > You could possibly even do a generic "best effort" kind of thing with > > > regular IPIs, that will timeout and continue if some CPUs don't handle > > > them, and should be pretty easy to get working with existing smp_call_ > > > function stuff. Not exactly clean, but it would be better than nothing. > > > > Without a doubt. > > Putting my code where my mouth is, here is an example implementation > of a special SysRQ "g" "dump regs globally" debugging tool for > sparc64. > > The only thing that has to happen is the SysRQ trigger. So if you can > either SysRQ-'g' at the console or "echo 'g' >/proc/sysrq-trigger" you > can get the registers from the cpus in the system. > > The only case the remote cpu registers would not be capturable would > be if they were stuck looping in the trap entry, trap exit, or low > level TLB handler code. > > This means that even if some cpu is stuck in a spinlock loop with > interrupts disabled, you'd see it with this thing. The way it works > is that cross cpu vectored interrupts are disabled independently of > the processor interrupt level on sparc64. That's really sweet. I'd better put my code where my mouth is as well... this patch seems to be about the best we can do as a generic fallback without getting a dedicated vector (which has to be done in arch code anyway). Maybe it can be improved, I don't know... I also don't know if I'm doing exactly the right thing with pt_regs, but it seems to work. One thing I'm doing is just dumping a single CPU, on a rotational basis. Don't know whether that's better or worse, but just an idea. Now I don't know what's the best way for an arch to override this with their own enhanced handler (an ARCH_HAVE ifdef?) Index: linux-2.6/arch/x86/kernel/smp_64.c === --- linux-2.6.orig/arch/x86/kernel/smp_64.c +++ linux-2.6/arch/x86/kernel/smp_64.c @@ -504,8 +504,10 @@ asmlinkage void smp_reschedule_interrupt add_pda(irq_resched_count, 1); } -asmlinkage void smp_call_function_interrupt(void) +asmlinkage void smp_call_function_interrupt(struct pt_regs *regs) { + struct pt_regs *old_regs = set_irq_regs(regs); + void (*func) (void *info) = call_data->func; void *info = call_data->info; int wait = call_data->wait; @@ -515,7 +517,7 @@ asmlinkage void smp_call_function_interr * Notify initiating CPU that I've grabbed the data and am * about to execute the function */ - mb(); + smp_rmb(); atomic_inc(_data->started); /* * At this point the info structure may be out of scope unless wait==1 @@ -526,8 +528,10 @@ asmlinkage void smp_call_function_interr add_pda(irq_call_count, 1); irq_exit(); if (wait) { - mb(); + smp_mb(); atomic_inc(_data->finished); } + + set_irq_regs(old_regs); } Index: linux-2.6/drivers/char/sysrq.c === --- linux-2.6.orig/drivers/char/sysrq.c +++ linux-2.6/drivers/char/sysrq.c @@ -196,12 +196,72 @@ static struct sysrq_key_op sysrq_showloc #define sysrq_showlocks_op (*(struct sysrq_key_op *)0) #endif -static void sysrq_handle_showregs(int key, struct tty_struct *tty) +static void show_registers(struct pt_regs *regs) { - struct pt_regs *regs = get_irq_regs(); if (regs) show_regs(regs); } + +static void show_cpu_registers(void) +{ + show_registers(get_irq_regs()); +} + +static void show_global_regs(void *arg) +{ + show_cpu_registers(); +} + +static DEFINE_SPINLOCK(global_rotate_lock); +static cpumask_t global_rotate_cpu; + +static void sysrq_handle_global_showregs_work_fn(struct work_struct *work) +{ + int cpu, next, thiscpu; + + spin_lock(_rotate_lock); + if (cpus_weight(global_rotate_cpu) == 0) + cpu = smp_processor_id(); + else + cpu = first_cpu(global_rotate_cpu); + + next = next_cpu(cpu, cpu_online_map); + if (next == NR_CPUS) + next = first_cpu(cpu_online_map); + cpu_clear(cpu, global_rotate_cpu); + cpu_set(next, global_rotate_cpu); + BUG_ON(cpus_weight(global_rotate_cpu) != 1); + spin_unlock(_rotate_lock); + + thiscpu = get_cpu(); + if (cpu == thiscpu) + show_registers(task_pt_regs(current)); + else { + if (smp_call_function_single(cpu, show_global_regs, NULL, 1, 1) +
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Thu, Nov 01, 2007 at 10:02:04PM -0700, David Miller wrote: From: David Miller [EMAIL PROTECTED] Date: Wed, 31 Oct 2007 00:44:25 -0700 (PDT) From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 31 Oct 2007 08:41:06 +0100 You could possibly even do a generic best effort kind of thing with regular IPIs, that will timeout and continue if some CPUs don't handle them, and should be pretty easy to get working with existing smp_call_ function stuff. Not exactly clean, but it would be better than nothing. Without a doubt. Putting my code where my mouth is, here is an example implementation of a special SysRQ g dump regs globally debugging tool for sparc64. The only thing that has to happen is the SysRQ trigger. So if you can either SysRQ-'g' at the console or echo 'g' /proc/sysrq-trigger you can get the registers from the cpus in the system. The only case the remote cpu registers would not be capturable would be if they were stuck looping in the trap entry, trap exit, or low level TLB handler code. This means that even if some cpu is stuck in a spinlock loop with interrupts disabled, you'd see it with this thing. The way it works is that cross cpu vectored interrupts are disabled independently of the processor interrupt level on sparc64. That's really sweet. I'd better put my code where my mouth is as well... this patch seems to be about the best we can do as a generic fallback without getting a dedicated vector (which has to be done in arch code anyway). Maybe it can be improved, I don't know... I also don't know if I'm doing exactly the right thing with pt_regs, but it seems to work. One thing I'm doing is just dumping a single CPU, on a rotational basis. Don't know whether that's better or worse, but just an idea. Now I don't know what's the best way for an arch to override this with their own enhanced handler (an ARCH_HAVE ifdef?) Index: linux-2.6/arch/x86/kernel/smp_64.c === --- linux-2.6.orig/arch/x86/kernel/smp_64.c +++ linux-2.6/arch/x86/kernel/smp_64.c @@ -504,8 +504,10 @@ asmlinkage void smp_reschedule_interrupt add_pda(irq_resched_count, 1); } -asmlinkage void smp_call_function_interrupt(void) +asmlinkage void smp_call_function_interrupt(struct pt_regs *regs) { + struct pt_regs *old_regs = set_irq_regs(regs); + void (*func) (void *info) = call_data-func; void *info = call_data-info; int wait = call_data-wait; @@ -515,7 +517,7 @@ asmlinkage void smp_call_function_interr * Notify initiating CPU that I've grabbed the data and am * about to execute the function */ - mb(); + smp_rmb(); atomic_inc(call_data-started); /* * At this point the info structure may be out of scope unless wait==1 @@ -526,8 +528,10 @@ asmlinkage void smp_call_function_interr add_pda(irq_call_count, 1); irq_exit(); if (wait) { - mb(); + smp_mb(); atomic_inc(call_data-finished); } + + set_irq_regs(old_regs); } Index: linux-2.6/drivers/char/sysrq.c === --- linux-2.6.orig/drivers/char/sysrq.c +++ linux-2.6/drivers/char/sysrq.c @@ -196,12 +196,72 @@ static struct sysrq_key_op sysrq_showloc #define sysrq_showlocks_op (*(struct sysrq_key_op *)0) #endif -static void sysrq_handle_showregs(int key, struct tty_struct *tty) +static void show_registers(struct pt_regs *regs) { - struct pt_regs *regs = get_irq_regs(); if (regs) show_regs(regs); } + +static void show_cpu_registers(void) +{ + show_registers(get_irq_regs()); +} + +static void show_global_regs(void *arg) +{ + show_cpu_registers(); +} + +static DEFINE_SPINLOCK(global_rotate_lock); +static cpumask_t global_rotate_cpu; + +static void sysrq_handle_global_showregs_work_fn(struct work_struct *work) +{ + int cpu, next, thiscpu; + + spin_lock(global_rotate_lock); + if (cpus_weight(global_rotate_cpu) == 0) + cpu = smp_processor_id(); + else + cpu = first_cpu(global_rotate_cpu); + + next = next_cpu(cpu, cpu_online_map); + if (next == NR_CPUS) + next = first_cpu(cpu_online_map); + cpu_clear(cpu, global_rotate_cpu); + cpu_set(next, global_rotate_cpu); + BUG_ON(cpus_weight(global_rotate_cpu) != 1); + spin_unlock(global_rotate_lock); + + thiscpu = get_cpu(); + if (cpu == thiscpu) + show_registers(task_pt_regs(current)); + else { + if (smp_call_function_single(cpu, show_global_regs, NULL, 1, 1) + == -1) { + printk(Could not interrogate CPU registers\n); + return; + } + } + put_cpu(); +} +static DECLARE_WORK(global_showregs_work
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Fri, Nov 02, 2007 at 10:05:37AM -0400, Rik van Riel wrote: On Fri, 2 Nov 2007 07:42:20 +0100 Nick Piggin [EMAIL PROTECTED] wrote: On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote: On Thu, 1 Nov 2007, Rik van Riel wrote: Larry Woodman managed to wedge the VM into a state where, on his 4x dual core system, only 2 cores (on the same CPU) could get the zone-lru_lock overnight. The other 6 cores on the system were just spinning, without being able to get the lock. That's quite incredible, considering that the CPUs actually _taking_ the locks also drop the locks and do quite a bit of work before taking them again (ie. they take them to pull pages off the LRU, but then do a reasonable amount of work to remove each one from pagecache before refilling from the LRU). Possibly actually that is a *more* difficult case for the HW to handle: once the CPU actually goes away and operates on other cachelines, it may get a little more difficult to detect that it is causing starvation issues. In case of the zone-lru_lock, grabbing the spinlock does not mean that the process is letting go of the cacheline. On the contrary, once the spinlock has been grabbed, the real cacheline prodding begins. I didn't say that, though. Obviously the hardware can't do anything about starvating until a lock is released. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Fri, Nov 02, 2007 at 09:51:27AM -0700, Linus Torvalds wrote: On Fri, 2 Nov 2007, Chuck Ebbert wrote: There's also a very easy way to get better fairness with our current spinlocks: use xchg to release the lock instead of mov. That does nothing at all. Yes, it slows the unlock down, which in turn on some machines will make it easier for another core/socket to get it, but it's purely about the slowdown, nothing else. Yeah, it's not such a good idea... it slows down the single threaded case like crazy. On my dual core core2: _Single thread_ inc-lock in cache takes 21.94ns xadd-lock in cache takes 22.64ns xchg-lock in cache takes 35.21ns inc-lock out of cache takes 140.73ns xadd-lock out of cache takes 141.15ns xchg-lock out of cache takes 155.13ns In the contended multi-threaded tight loop, the xchg lock is slower than inc lock but still beats the fair xadd lock, but that's only because it is just as unfair if not more so on this hardware (runtime difference of up to about 10%) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Fri, Nov 02, 2007 at 08:56:46PM -0400, Chuck Ebbert wrote: On 11/02/2007 07:01 PM, Nick Piggin wrote: In the contended multi-threaded tight loop, the xchg lock is slower than inc lock but still beats the fair xadd lock, but that's only because it is just as unfair if not more so on this hardware (runtime difference of up to about 10%) I meant xchg for unlock, not lock. That is for unlock. 2x the number of atomic operations ~= 2x the cost. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote: > > > On Thu, 1 Nov 2007, Rik van Riel wrote: > > > > Larry Woodman managed to wedge the VM into a state where, on his > > 4x dual core system, only 2 cores (on the same CPU) could get the > > zone->lru_lock overnight. The other 6 cores on the system were > > just spinning, without being able to get the lock. That's quite incredible, considering that the CPUs actually _taking_ the locks also drop the locks and do quite a bit of work before taking them again (ie. they take them to pull pages off the LRU, but then do a reasonable amount of work to remove each one from pagecache before refilling from the LRU). Possibly actually that is a *more* difficult case for the HW to handle: once the CPU actually goes away and operates on other cachelines, it may get a little more difficult to detect that it is causing starvation issues. > .. and this is almost always the result of a locking *bug*, not unfairness > per se. IOW, unfairness just ends up showing the bug in the first place. I'd almost agree, but there are always going to be corner cases where we get multiple contentions on a spinlock -- the fact that a lock is needed at all obviously suggests that it can be contended. The LRU locking could be improved, but you could have eg. scheduler runqueue lock starvation if the planets lined up just right, and it is a little more difficult to improve on runqueue locking. Anyway, I also think this is partially a hardware issue, and as muliple cores, threads, and sockets get more common, I hope it will improve (it affects Intel CPUs as well as AMD). So it is possible to have an option to switch between locks if the hardware is fairer, but I want to get as much exposure with this locking as possible for now, to see if there is any funny performance corner cases exposed (which quite possibly will turn out to be caused by suboptimal locking itself). Anyway, if this can make its way to the x86 tree, I think it will get pulled into -mm (?) and get some exposure... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Thu, Nov 01, 2007 at 06:17:42PM -0700, Linus Torvalds wrote: > > > On Fri, 2 Nov 2007, Nick Piggin wrote: > > > > But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb > > uses this for inserting breakpoints (but fortunately, a COW page in a > > MAP_PRIVATE mapping is a much more natural thing for the VM). > > Yes, I phrased that badly. I meant that I'd be happier if we got rid of > VM_MAYSHARE entirely, and just used VM_SHARED. I thought we already made > them always be the same (and any VM_MAYSHARE use is historical). Oh yeah, I think it would probably be clearer to use VM_SHARED == MAP_SHARED, and test the write permission explicitly. Though there could be something I missed that makes it not as easy as it sounds... probably something best left for Hugh ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Thu, Nov 01, 2007 at 04:01:45PM -0400, Chuck Ebbert wrote: > On 11/01/2007 10:03 AM, Nick Piggin wrote: > > [edited to show the resulting code] > > > + __asm__ __volatile__ ( > > + LOCK_PREFIX "xaddw %w0, %1\n" > > + "1:\t" > > + "cmpb %h0, %b0\n\t" > > + "je 2f\n\t" > > + "rep ; nop\n\t" > > + "movb %1, %b0\n\t" > > + /* don't need lfence here, because loads are in-order */ > > "jmp 1b\n" > > + "2:" > > + :"+Q" (inc), "+m" (lock->slock) > > + : > > + :"memory", "cc"); > > } > > If you really thought you might get long queues, you could figure out > how far back you are and use that to determine how long to wait before > testing the lock again. That cmpb could become a subb without adding > overhead to the fast path -- that would give you the queue length (or > its complement anyway.) Indeed. You can use this as a really nice input into a backoff algorithm (eg. if you're next in line, don't back off, or at least don't go into exponential backoff; if you've got people in front of you, start throttling harder). I think I'll leave that to SGI if they come up with a big x86 SSI ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Thu, Nov 01, 2007 at 09:08:45AM -0700, Linus Torvalds wrote: > > > On Thu, 1 Nov 2007, Nick Piggin wrote: > > > > Untested patch follows > > Ok, this looks ok. > > Except I would remove the VM_MAYSHARE bit from the test. But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb uses this for inserting breakpoints (but fortunately, a COW page in a MAP_PRIVATE mapping is a much more natural thing for the VM). > That whole bit should go, in fact. > > We used to make it something different: iirc, a read-only SHARED mapping > was downgraded to a non-shared mapping, because we wanted to avoid some of > the costs we used to have with the VM implementation (actually, I think it > was various filesystems that don't like shared mappings because they don't > have a per-page writeback). But we left the VM_MAYSHARE bit on, to get > /proc//mmap things right. > > Or something like that. I forget the details. But I *think* we don't > actually need this any more. > > But basically, the "right" way to test for shared mappings is historically > to just test the VM_MAYSHARE bit - but not *both* bits. Because VM_SHARE > may have been artificially cleared. I think you're right -- VM_MAYSHARE is basically testing for MAP_SHARED. I just don't know exactly what you're proposing here. Thanks, Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] spinlock: lockbreak cleanup
On Thu, Nov 01, 2007 at 04:46:36PM +0100, Ingo Molnar wrote: > > * Lee Schermerhorn <[EMAIL PROTECTED]> wrote: > > > > I guess it was done to make the "template" hacks eaiser. I don't > > > really find that in good taste, especially for important core > > > infrastructure. Anyway. > > > > Actually, what I had/have is a cond_resched_rwlock() that I needed to > > convert the i_mmap_lock() to rw for testing reclaim scalability. > > [I've seen a large system running an Oracle OLTP load hang spitting > > "cpu soft lockup" messages with all cpus spinning on a i_mmap_lock > > spin lock.] One of the i_mmap_lock paths uses cond_resched_lock() for > > spin locks. To do a straight forward conversion [and maybe that isn't > > the right approach], I created the cond_resched_rwlock() function by > > generalizing the cond_sched_lock() code and creating both spin and rw > > lock wrappers. I took advantage of the fact that, currently, > > need_lockbreak() is a macro and that both spin and rw locks have/had > > the break_lock member. Typesafe functions would probably be > > preferrable, if we want to keep break_lock for rw spin locks. > > > > Here's the most recent posting: > > > > http://marc.info/?l=linux-mm=118980356306014=4 > > > > See the changes to sched.[ch]. Should apply to 23-mm1 with offsets > > and minor fixup in fs/inode.c. > > yep. I'm too in favor of keeping the need-lockbreak mechanism and its > type-insensitive data structure. We've got way too many locking > primitives and keeping them all sorted is nontrivial already. I think a large contributor to that is being a bit clever with indirections and cute code (eg. like this template stuff), rather than having two types of spinlocks instead of one. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Thu, Nov 01, 2007 at 08:14:47AM -0700, Linus Torvalds wrote: > > > On Thu, 1 Nov 2007, Nick Piggin wrote: > > > On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote: > > > > > > We made much bigger changes to ptrace support when we disallowed writing > > > to read-only shared memory areas (we used to do the magic per-page COW > > > thing). > > > > Really? No, we still do that magic COW thing which creates anonymous > > pages in MAP_SHARED vmas, don't we? > > No, we don't. I'm pretty sure. It didn't work with the VM cleanups, since > the MAP_SHARED vma's won't be on the anonymous list any more, and cannot > be swapped out. > > So now, if you try to write to a read-only shared page through ptrace, > you'll get "Unable to access". No, it COWs it (the file is RW). I believe do_wp_page will still attach an anon_vma to the vma, which will make the pte discoverable via rmap. > Of course, I didn't really look closely, so maybe I just don't remember > things right.. > > > > access_vm_pages() (things like core-dumping comes to mind - although I > > > think we don't dump pure file mappings at all, do we?) it would certainly > > > be good to run any such tests on the current -git tree... > > > > We do for MAP_SHARED|MAP_ANONYMOUS, by the looks. > > Well, as we should. There's no way for a debugger to get those pages back. > So that all looks sane. > > > - vm_flags |= VM_SHARED | VM_MAYSHARE; > > - if (!(file->f_mode & FMODE_WRITE)) > > - vm_flags &= ~(VM_MAYWRITE | VM_SHARED); > > + vm_flags |= VM_MAYSHARE; > > + if (file->f_mode & FMODE_WRITE) > > + vm_flags |= VM_SHARED; > > + if (!(vm_flags & VM_WRITE)) > > + vm_flags &= ~VM_MAYWRITE; > > This looks totally bogus. What was the intent of this patch? > > The VM_MAYWRITE flag is *not* supposed to track the VM_WRITE flag: that > would defeat the whole purpose of it! The whole point of that flag is to > say whether mprotect() could turn it into a VM_WRITE mapping, and it > depends on the file mode, not VM_WRITE! Yeah of course that won't work, stupid... The intent is to stop get_user_pages from proceeding with a write fault (and subsequent COW) to readonly shared mappings, when force is set. I think it can be done simply via get_user_pages(), which is what I should have done to begin with. Untested patch follows --- Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1031,7 +1031,9 @@ int get_user_pages(struct task_struct *t } if (!vma || (vma->vm_flags & (VM_IO | VM_PFNMAP)) - || !(vm_flags & vma->vm_flags)) + || !(vm_flags & vma->vm_flags) + || (write && ((vma->vm_flags & + (VM_SHARED|VM_MAYSHARE)) == VM_MAYSHARE))) return i ? : -EFAULT; if (is_vm_hugetlb_page(vma)) { @@ -1563,13 +1565,11 @@ static int do_wp_page(struct mm_struct * reuse = can_share_swap_page(old_page); unlock_page(old_page); } - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED))) { + } else if (unlikely((vma->vm_flags & VM_SHARED))) { /* -* Only catch write-faults on shared writable pages, -* read-only shared pages can get COWed by -* get_user_pages(.write=1, .force=1). +* Only catch write-faults on shared writable pages. */ + BUG_ON(!(vma->vm_flags & VM_WRITE)); if (vma->vm_ops && vma->vm_ops->page_mkwrite) { /* * Notify the address space that the page is about to - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] spinlock: lockbreak cleanup
On Thu, Nov 01, 2007 at 03:06:05PM +0100, Peter Zijlstra wrote: > On Thu, 2007-11-01 at 15:02 +0100, Nick Piggin wrote: > > > Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to > > decouple it from the spinlock implementation, and make it typesafe (rwlocks > > do not have any need_lockbreak sites -- why do they even get bloated up > > with that break_lock then?). > > IIRC Lee has a few patches floating about that do introduce lockbreak > stuff for rwlocks. Well that would be a good reason to introduce a break_lock for them, but previously not so much... we have rwlocks in some slightly space critical structures (vmas, inodes, etc). I guess it was done to make the "template" hacks eaiser. I don't really find that in good taste, especially for important core infrastructure. Anyway. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/4] x86: spinlock.h merge prep
Prepare for merging 32 and 64 bit spinlocks, by making them identical (except for the OOSTORE thing). raw_read_lock and raw_write_lock get a relaxed register constraint, and 64-bit has a few "=m" constraints changed to "+m". I hope these things actually make the code better. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> --- Index: linux-2.6/include/asm-x86/spinlock_32.h === --- linux-2.6.orig/include/asm-x86/spinlock_32.h +++ linux-2.6/include/asm-x86/spinlock_32.h @@ -98,14 +98,15 @@ static inline int __raw_spin_trylock(raw return ret; } -#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE) +#if defined(CONFIG_X86_32) && \ + (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)) /* * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock * (PPro errata 66, 92) */ -#define UNLOCK_LOCK_PREFIX LOCK_PREFIX +# define UNLOCK_LOCK_PREFIX LOCK_PREFIX #else -#define UNLOCK_LOCK_PREFIX +# define UNLOCK_LOCK_PREFIX #endif static inline void __raw_spin_unlock(raw_spinlock_t *lock) @@ -135,49 +136,42 @@ static inline void __raw_spin_unlock_wai * * On x86, we implement read-write locks as a 32-bit counter * with the high bit (sign) being the "contended" bit. - * - * The inline assembly is non-obvious. Think about it. - * - * Changed to use the same technique as rw semaphores. See - * semaphore.h for details. -ben - * - * the helpers are in arch/i386/kernel/semaphore.c */ /** * read_can_lock - would read_trylock() succeed? * @lock: the rwlock in question. */ -static inline int __raw_read_can_lock(raw_rwlock_t *x) +static inline int __raw_read_can_lock(raw_rwlock_t *lock) { - return (int)(x)->lock > 0; + return (int)(lock)->lock > 0; } /** * write_can_lock - would write_trylock() succeed? * @lock: the rwlock in question. */ -static inline int __raw_write_can_lock(raw_rwlock_t *x) +static inline int __raw_write_can_lock(raw_rwlock_t *lock) { - return (x)->lock == RW_LOCK_BIAS; + return (lock)->lock == RW_LOCK_BIAS; } static inline void __raw_read_lock(raw_rwlock_t *rw) { - asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t" + asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t" "jns 1f\n" "call __read_lock_failed\n\t" "1:\n" -::"a" (rw) : "memory"); +::"r" (rw) : "memory"); } static inline void __raw_write_lock(raw_rwlock_t *rw) { - asm volatile(LOCK_PREFIX " subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" + asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t" "jz 1f\n" "call __write_lock_failed\n\t" "1:\n" -::"a" (rw) : "memory"); +::"r" (rw), "i" (RW_LOCK_BIAS) : "memory"); } static inline int __raw_read_trylock(raw_rwlock_t *lock) @@ -206,8 +200,8 @@ static inline void __raw_read_unlock(raw static inline void __raw_write_unlock(raw_rwlock_t *rw) { - asm volatile(LOCK_PREFIX "addl $" RW_LOCK_BIAS_STR ", %0" -: "+m" (rw->lock) : : "memory"); + asm volatile(LOCK_PREFIX "addl %1,%0" + : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory"); } #define _raw_spin_relax(lock) cpu_relax() Index: linux-2.6/include/asm-x86/spinlock_64.h === --- linux-2.6.orig/include/asm-x86/spinlock_64.h +++ linux-2.6/include/asm-x86/spinlock_64.h @@ -5,6 +5,7 @@ #include #include #include +#include /* * Your basic SMP spinlocks, allowing only a single CPU anywhere @@ -126,11 +127,19 @@ static inline void __raw_spin_unlock_wai * with the high bit (sign) being the "contended" bit. */ +/** + * read_can_lock - would read_trylock() succeed? + * @lock: the rwlock in question. + */ static inline int __raw_read_can_lock(raw_rwlock_t *lock) { return (int)(lock)->lock > 0; } +/** + * write_can_lock - would write_trylock() succeed? + * @lock: the rwlock in question. + */ static inline int __raw_write_can_lock(raw_rwlock_t *lock) { return (lock)->lock == RW_LOCK_BIAS; @@ -140,18 +149,18 @@ static inline void __raw_read_lock(raw_r { asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t" "jns 1f\n" -"call __read_lock_failed\n" +"call __read_lock_failed\n\t" "1:\n" -::"D&quo
[patch 4/4] x86: spinlock.h merge
Merge spinlock_32.h and spinlock_64.h into spinlock.h. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> --- Index: linux-2.6/include/asm-x86/spinlock.h === --- linux-2.6.orig/include/asm-x86/spinlock.h +++ linux-2.6/include/asm-x86/spinlock.h @@ -1,5 +1,211 @@ -#ifdef CONFIG_X86_32 -# include "spinlock_32.h" +#ifndef __ASM_SPINLOCK_H +#define __ASM_SPINLOCK_H + +#include +#include +#include +#include +#include + +/* + * Your basic SMP spinlocks, allowing only a single CPU anywhere + * + * Simple spin lock operations. There are two variants, one clears IRQ's + * on the local processor, one does not. + * + * These are fair FIFO ticket locks, which are currently limited to 256 + * CPUs. + * + * (the type definitions are in asm/spinlock_types.h) + */ + +#if (NR_CPUS > 256) +#error spinlock supports a maximum of 256 CPUs +#endif + +static inline int __raw_spin_is_locked(raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)(&(lock)->slock); + + return (((tmp >> 8) & 0xff) != (tmp & 0xff)); +} + +static inline int __raw_spin_is_contended(raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)(&(lock)->slock); + + return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1; +} + +static inline void __raw_spin_lock(raw_spinlock_t *lock) +{ + short inc = 0x0100; + + /* +* Ticket locks are conceptually two bytes, one indicating the current +* head of the queue, and the other indicating the current tail. The +* lock is acquired by atomically noting the tail and incrementing it +* by one (thus adding ourself to the queue and noting our position), +* then waiting until the head becomes equal to the the initial value +* of the tail. +* +* This uses a 16-bit xadd to increment the tail and also load the +* position of the head, which takes care of memory ordering issues +* and should be optimal for the uncontended case. Note the tail must +* be in the high byte, otherwise the 16-bit wide increment of the low +* byte would carry up and contaminate the high byte. +*/ + + __asm__ __volatile__ ( + LOCK_PREFIX "xaddw %w0, %1\n" + "1:\t" + "cmpb %h0, %b0\n\t" + "je 2f\n\t" + "rep ; nop\n\t" + "movb %1, %b0\n\t" + /* don't need lfence here, because loads are in-order */ + "jmp 1b\n" + "2:" + :"+Q" (inc), "+m" (lock->slock) + : + :"memory", "cc"); +} + +#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) + +static inline int __raw_spin_trylock(raw_spinlock_t *lock) +{ + short prev; + short new; + int ret = 0; + + asm volatile( + "movw %2,%w0\n\t" + "cmpb %h0,%b0\n\t" + "jne 1f\n\t" + "movw %w0,%w1\n\t" + "incb %h1\n\t" + "lock ; cmpxchgw %w1,%2\n\t" + "decb %h1\n\t" + "cmpw %w0,%w1\n\t" + "jne 1f\n\t" + "movl $1,%3\n\t" + "1:" + :"=a" (prev), "=Q" (new), "+m" (lock->slock), "+m" (ret) + : + : "memory", "cc"); + + return ret; +} + +#if defined(CONFIG_X86_32) && \ + (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)) +/* + * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock + * (PPro errata 66, 92) + */ +# define UNLOCK_LOCK_PREFIX LOCK_PREFIX #else -# include "spinlock_64.h" +# define UNLOCK_LOCK_PREFIX #endif + +static inline void __raw_spin_unlock(raw_spinlock_t *lock) +{ + __asm__ __volatile__( + UNLOCK_LOCK_PREFIX "incb %0" + :"+m" (lock->slock) + : + :"memory", "cc"); +} + +static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock) +{ + while (__raw_spin_is_locked(lock)) + cpu_relax(); +} + +/* + * Read-write spinlocks, allowing multiple readers + * but only one writer. + * + * NOTE! it is quite common to have readers in interrupts + * but no interrupt writers. For those circumstances we + * can "mix" irq-safe locks - any writer needs to get a + * irq-safe write-lock, but readers can get non-irqsafe + * read-locks. + * + * On x86, we implement read-write locks as a 32-bit counter + * with the high bit (sign) being the "contended" bit. + */ + +/**
[patch 1/4] x86: FIFO ticket spinlocks
Introduce ticket lock spinlocks for x86 which are FIFO. The implementation is described in the comments. The straight-line lock/unlock instruction sequence is slightly slower than the dec based locks on modern x86 CPUs, however the difference is quite small on Core2 and Opteron when working out of cache, and becomes almost insignificant even on P4 when the lock misses cache. trylock is more significantly slower, but they are relatively rare. On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable, with a userspace test having a difference of up to 2x runtime per thread, and some threads are starved or "unfairly" granted the lock up to 1 000 000 (!) times. After this patch, all threads appear to finish at exactly the same time. The memory ordering of the lock does conform to x86 standards, and the implementation has been reviewed by Intel and AMD engineers. The algorithm also tells us how many CPUs are contending the lock, so lockbreak becomes trivial and we no longer have to waste 4 bytes per spinlock for it. After this, we can no longer spin on any locks with preempt enabled and cannot reenable interrupts when spinning on an irq safe lock, because at that point we have already taken a ticket and the would deadlock if the same CPU tries to take the lock again. These are questionable anyway: if the lock happens to be called under a preempt or interrupt disabled section, then it will just have the same latency problems. The real fix is to keep critical sections short, and ensure locks are reasonably fair (which this patch does). Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> --- Index: linux-2.6/include/asm-x86/spinlock_64.h === --- linux-2.6.orig/include/asm-x86/spinlock_64.h +++ linux-2.6/include/asm-x86/spinlock_64.h @@ -12,74 +12,98 @@ * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. * - * We make no fairness assumptions. They have a cost. + * These are fair FIFO ticket locks, which are currently limited to 256 + * CPUs. * * (the type definitions are in asm/spinlock_types.h) */ +#if (NR_CPUS > 256) +#error spinlock supports a maximum of 256 CPUs +#endif + static inline int __raw_spin_is_locked(raw_spinlock_t *lock) { - return *(volatile signed int *)(&(lock)->slock) <= 0; + int tmp = *(volatile signed int *)(&(lock)->slock); + + return (((tmp >> 8) & 0xff) != (tmp & 0xff)); } -static inline void __raw_spin_lock(raw_spinlock_t *lock) +static inline int __raw_spin_is_contended(raw_spinlock_t *lock) { - asm volatile( - "\n1:\t" - LOCK_PREFIX " ; decl %0\n\t" - "jns 2f\n" - "3:\n" - "rep;nop\n\t" - "cmpl $0,%0\n\t" - "jle 3b\n\t" - "jmp 1b\n" - "2:\t" : "=m" (lock->slock) : : "memory"); + int tmp = *(volatile signed int *)(&(lock)->slock); + + return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1; } -/* - * Same as __raw_spin_lock, but reenable interrupts during spinning. - */ -#ifndef CONFIG_PROVE_LOCKING -static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags) +static inline void __raw_spin_lock(raw_spinlock_t *lock) { - asm volatile( - "\n1:\t" - LOCK_PREFIX " ; decl %0\n\t" - "jns 5f\n" - "testl $0x200, %1\n\t" /* interrupts were disabled? */ - "jz 4f\n\t" - "sti\n" - "3:\t" - "rep;nop\n\t" - "cmpl $0, %0\n\t" - "jle 3b\n\t" - "cli\n\t" + short inc = 0x0100; + + /* +* Ticket locks are conceptually two bytes, one indicating the current +* head of the queue, and the other indicating the current tail. The +* lock is acquired by atomically noting the tail and incrementing it +* by one (thus adding ourself to the queue and noting our position), +* then waiting until the head becomes equal to the the initial value +* of the tail. +* +* This uses a 16-bit xadd to increment the tail and also load the +* position of the head, which takes care of memory ordering issues +* and should be optimal for the uncontended case. Note the tail must +* be in the high byte, otherwise the 16-bit wide increment of the low +* byte would carry up and contaminate the high byte. +*/ + + __asm__ __volatile__ ( + LOCK_PREFIX "xaddw %w0, %1\n" + "1:\t"
[patch 1/4] spinlock: lockbreak cleanup
The break_lock data structure and code for spinlocks is quite nasty. Not only does it double the size of a spinlock but it changes locking to a potentially less optimal trylock. Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a __raw_spin_is_contended that uses the lock data itself to determine whether there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is not set. Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> --- Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1854,23 +1854,16 @@ extern int cond_resched_softirq(void); /* * Does a critical section need to be broken due to another - * task waiting?: + * task waiting?: (technically does not depend on CONFIG_PREEMPT, + * but a general need for low latency) */ -#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) -# define need_lockbreak(lock) ((lock)->break_lock) -#else -# define need_lockbreak(lock) 0 -#endif - -/* - * Does a critical section need to be broken due to another - * task waiting or preemption being signalled: - */ -static inline int lock_need_resched(spinlock_t *lock) +static inline int spin_needbreak(spinlock_t *lock) { - if (need_lockbreak(lock) || need_resched()) - return 1; +#ifdef CONFIG_PREEMPT + return spin_is_contended(lock); +#else return 0; +#endif } /* Index: linux-2.6/include/linux/spinlock.h === --- linux-2.6.orig/include/linux/spinlock.h +++ linux-2.6/include/linux/spinlock.h @@ -120,6 +120,12 @@ do { \ #define spin_is_locked(lock) __raw_spin_is_locked(&(lock)->raw_lock) +#ifdef CONFIG_GENERIC_LOCKBREAK +#define spin_is_contended(lock) ((lock)->break_lock) +#else +#define spin_is_contended(lock) __raw_spin_is_contended(&(lock)->raw_lock) +#endif + /** * spin_unlock_wait - wait until the spinlock gets unlocked * @lock: the spinlock in question. Index: linux-2.6/fs/jbd/checkpoint.c === --- linux-2.6.orig/fs/jbd/checkpoint.c +++ linux-2.6/fs/jbd/checkpoint.c @@ -347,7 +347,8 @@ restart: break; } retry = __process_buffer(journal, jh, bhs,_count); - if (!retry && lock_need_resched(>j_list_lock)){ + if (!retry && (need_resched() || + spin_needbreak(>j_list_lock))) { spin_unlock(>j_list_lock); retry = 1; break; Index: linux-2.6/fs/jbd/commit.c === --- linux-2.6.orig/fs/jbd/commit.c +++ linux-2.6/fs/jbd/commit.c @@ -265,7 +265,7 @@ write_out_data: put_bh(bh); } - if (lock_need_resched(>j_list_lock)) { + if (need_resched() || spin_needbreak(>j_list_lock)) { spin_unlock(>j_list_lock); goto write_out_data; } Index: linux-2.6/fs/jbd2/checkpoint.c === --- linux-2.6.orig/fs/jbd2/checkpoint.c +++ linux-2.6/fs/jbd2/checkpoint.c @@ -347,7 +347,8 @@ restart: break; } retry = __process_buffer(journal, jh, bhs,_count); - if (!retry && lock_need_resched(>j_list_lock)){ + if (!retry && (need_resched() || + spin_needbreak(>j_list_lock))) { spin_unlock(>j_list_lock); retry = 1; break; Index: linux-2.6/fs/jbd2/commit.c === --- linux-2.6.orig/fs/jbd2/commit.c +++ linux-2.6/fs/jbd2/commit.c @@ -265,7 +265,7 @@ write_out_data: put_bh(bh); } - if (lock_need_resched(>j_list_lock)) { + if (need_resched() || spin_needbreak(>j_list_lock)) { spin_unlock(>j_list_lock); goto write_out_data; } Index: linux-2.6/include/linux/spinlock_up.h === --- linux-2.6.orig/include/linux/spinlock_up.
[patch 0/4] ticket spinlocks for x86
Hi, I'd like to propose these patches for the x86 tree for a bit more exposure and testing. Or at least get some discussion going again. Just for fun I also had a shot at merging the headers, as they become a lot more similar after this with the removal of the paravirt crud. Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.34-rc1 eat my photo SD card :-(
On Thursday 01 November 2007 22:56, Romano Giannetti wrote: > Hi, > > I have a very possible regression to signal. This morning 2.6.24-rc1 > eat and destroyed my SD card. I have a toshiba laptop with a card slot > and I have used it with 2.6.23-rcX and 2.6.23 without problems so far. > This morning I put the card in, nothing happened, removed it. When I put > it in again the filesystem in it was completely scr***ed up. > > I have a flight waiting now, so I have put all the dmesgs and syslogs > over there: > > http://www.dea.icai.upcomillas.es/romano/linux/info/2624-rc1-mmc/ > > Sunday I'll be back to help debug it. Thanks for the report. Is it a FAT filesystem? Is it reproduceable? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dealing with barriers (was Re: [PATCH] firewire: fw-core: enforce write order when updating fw_device.generation)
On Thursday 01 November 2007 20:51, Stefan Richter wrote: > Nick Piggin wrote: > > On Thursday 01 November 2007 12:49, Stefan Richter wrote: > >> fw_device.node_id and fw_device.generation are accessed without mutexes. > >> We have to ensure that all readers will get to see node_id updates > >> before generation updates. > > > > Hi, a few points: > > Thanks, this is appreciated. > > > - can change it to use spinlocks instead? This would be most > > preferable. > > It could be done, but I don't find it preferable for this purpose. > Although the struct members in question are basically part of fw-core's > API (kernel-internal API), there won't be a lot of places which access > these members. > > (Besides, lockless algorithms are essential to FireWire driver code > everywhere where CPU and controller interact. So IMO these lockless > accesses don't constitute a whole new level of complexity in these > drivers.) Fair enough. > > - if not, you need comments. > > So far I tended to disagree every time when checkpatch.pl bugged me > about barriers without comments. But maybe that was because I had a > wrong idea of what kind of comment should go there. There would > certainly be no value in a comment which effectively says "trust me, I > know we need a barrier here". However, thinking about it again, it > occurs to me that I should add the following comments: > > 1.) At respective struct type definitions: A comment on how struct > members are to be accessed and why. That's something that can be really helpful I think, yes. > 2.) At the occurrences of rmb() and wmb(): A comment on which > accesses the particular barrier divides. This is in order to get > the barriers properly updated (i.e. moved or removed) when > surrounding code is reordered, APIs reworked, or whatever. Yes, this is the main thing to comment at the actual site of the barirer. Although it isn't always 100% clear with locks either (because you might do some non-critical operatoins inside a lock section), it just tends to be clearer what it is being locked, and what other paths are being locked against. So describing what accesses are being ordered is good. Also a brief description of how the lockless read-side works, if it's not obvious (or you've already described that in your #1). > Of course this division of the Why and the What only applies to accesses > like those in the patch --- i.e. API-like data types which aren't > abstracted by accessor functions --- while there may be other > requirements on comments for other usage types of barriers. > > Or am I still wrong in my judgment of how barriers should be documented? I think what you've said sounds good. I think that even memory barriers within some subsystem can use commenting, even if only a line or two. > > - you also seem to be missing rmb()s now. I see a couple in the > > firewire directory, but nothing that seems to be ordering loads > > of these particular fields. > > That's right. Of course the wmb()s don't make sense if the reader side > isn't fixed up accordingly. I did this for all places which I spotted > yesterday in the patches > "firewire: fw-core: react on bus resets while the config ROM is being > fetched", http://lkml.org/lkml/2007/10/31/464 (Hmm, this has an unclear > changelog.) > "firewire: fw-sbp2: enforce read order of device generation and node > ID", http://lkml.org/lkml/2007/10/31/465 Ah, right, didn't see them. Thanks, that looks more useful now ;) > I didn't fold all these patches into one because these two other patches > include also other changes to the respective read sides. I should have > pointed this out in this first patch. Sure. I might personally still separate out the memory ordering fix and put all the rmb and wmb in a single patch, but that's no big deal. > Also, it could be that I have overlooked one more reader last night; I > will reexamine it. > > > - use smp_*mb() if you are just ordering regular cacheable RAM > > accesses. > > Hmm, right. Looks like the smp_ variants are indeed the right thing to > do here. OK. > > Also, diffstat is a bit wrong... maybe you posted the wrong version? > > Oops, this happened when I worked on what became the "fw-core: react on > bus resets..." patch. So the diffstat is wrong but... Ah, that's no problem. I manage to screw up patches in much worse ways than this, don't worry ;) Thanks, Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dealing with barriers (was Re: [PATCH] firewire: fw-core: enforce write order when updating fw_device.generation)
On Thursday 01 November 2007 20:51, Stefan Richter wrote: Nick Piggin wrote: On Thursday 01 November 2007 12:49, Stefan Richter wrote: fw_device.node_id and fw_device.generation are accessed without mutexes. We have to ensure that all readers will get to see node_id updates before generation updates. Hi, a few points: Thanks, this is appreciated. - can change it to use spinlocks instead? This would be most preferable. It could be done, but I don't find it preferable for this purpose. Although the struct members in question are basically part of fw-core's API (kernel-internal API), there won't be a lot of places which access these members. (Besides, lockless algorithms are essential to FireWire driver code everywhere where CPU and controller interact. So IMO these lockless accesses don't constitute a whole new level of complexity in these drivers.) Fair enough. - if not, you need comments. So far I tended to disagree every time when checkpatch.pl bugged me about barriers without comments. But maybe that was because I had a wrong idea of what kind of comment should go there. There would certainly be no value in a comment which effectively says trust me, I know we need a barrier here. However, thinking about it again, it occurs to me that I should add the following comments: 1.) At respective struct type definitions: A comment on how struct members are to be accessed and why. That's something that can be really helpful I think, yes. 2.) At the occurrences of rmb() and wmb(): A comment on which accesses the particular barrier divides. This is in order to get the barriers properly updated (i.e. moved or removed) when surrounding code is reordered, APIs reworked, or whatever. Yes, this is the main thing to comment at the actual site of the barirer. Although it isn't always 100% clear with locks either (because you might do some non-critical operatoins inside a lock section), it just tends to be clearer what it is being locked, and what other paths are being locked against. So describing what accesses are being ordered is good. Also a brief description of how the lockless read-side works, if it's not obvious (or you've already described that in your #1). Of course this division of the Why and the What only applies to accesses like those in the patch --- i.e. API-like data types which aren't abstracted by accessor functions --- while there may be other requirements on comments for other usage types of barriers. Or am I still wrong in my judgment of how barriers should be documented? I think what you've said sounds good. I think that even memory barriers within some subsystem can use commenting, even if only a line or two. - you also seem to be missing rmb()s now. I see a couple in the firewire directory, but nothing that seems to be ordering loads of these particular fields. That's right. Of course the wmb()s don't make sense if the reader side isn't fixed up accordingly. I did this for all places which I spotted yesterday in the patches firewire: fw-core: react on bus resets while the config ROM is being fetched, http://lkml.org/lkml/2007/10/31/464 (Hmm, this has an unclear changelog.) firewire: fw-sbp2: enforce read order of device generation and node ID, http://lkml.org/lkml/2007/10/31/465 Ah, right, didn't see them. Thanks, that looks more useful now ;) I didn't fold all these patches into one because these two other patches include also other changes to the respective read sides. I should have pointed this out in this first patch. Sure. I might personally still separate out the memory ordering fix and put all the rmb and wmb in a single patch, but that's no big deal. Also, it could be that I have overlooked one more reader last night; I will reexamine it. - use smp_*mb() if you are just ordering regular cacheable RAM accesses. Hmm, right. Looks like the smp_ variants are indeed the right thing to do here. OK. Also, diffstat is a bit wrong... maybe you posted the wrong version? Oops, this happened when I worked on what became the fw-core: react on bus resets... patch. So the diffstat is wrong but... Ah, that's no problem. I manage to screw up patches in much worse ways than this, don't worry ;) Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.34-rc1 eat my photo SD card :-(
On Thursday 01 November 2007 22:56, Romano Giannetti wrote: Hi, I have a very possible regression to signal. This morning 2.6.24-rc1 eat and destroyed my SD card. I have a toshiba laptop with a card slot and I have used it with 2.6.23-rcX and 2.6.23 without problems so far. This morning I put the card in, nothing happened, removed it. When I put it in again the filesystem in it was completely scr***ed up. I have a flight waiting now, so I have put all the dmesgs and syslogs over there: http://www.dea.icai.upcomillas.es/romano/linux/info/2624-rc1-mmc/ Sunday I'll be back to help debug it. Thanks for the report. Is it a FAT filesystem? Is it reproduceable? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/4] ticket spinlocks for x86
Hi, I'd like to propose these patches for the x86 tree for a bit more exposure and testing. Or at least get some discussion going again. Just for fun I also had a shot at merging the headers, as they become a lot more similar after this with the removal of the paravirt crud. Nick - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/4] spinlock: lockbreak cleanup
The break_lock data structure and code for spinlocks is quite nasty. Not only does it double the size of a spinlock but it changes locking to a potentially less optimal trylock. Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a __raw_spin_is_contended that uses the lock data itself to determine whether there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is not set. Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/include/linux/sched.h === --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1854,23 +1854,16 @@ extern int cond_resched_softirq(void); /* * Does a critical section need to be broken due to another - * task waiting?: + * task waiting?: (technically does not depend on CONFIG_PREEMPT, + * but a general need for low latency) */ -#if defined(CONFIG_PREEMPT) defined(CONFIG_SMP) -# define need_lockbreak(lock) ((lock)-break_lock) -#else -# define need_lockbreak(lock) 0 -#endif - -/* - * Does a critical section need to be broken due to another - * task waiting or preemption being signalled: - */ -static inline int lock_need_resched(spinlock_t *lock) +static inline int spin_needbreak(spinlock_t *lock) { - if (need_lockbreak(lock) || need_resched()) - return 1; +#ifdef CONFIG_PREEMPT + return spin_is_contended(lock); +#else return 0; +#endif } /* Index: linux-2.6/include/linux/spinlock.h === --- linux-2.6.orig/include/linux/spinlock.h +++ linux-2.6/include/linux/spinlock.h @@ -120,6 +120,12 @@ do { \ #define spin_is_locked(lock) __raw_spin_is_locked((lock)-raw_lock) +#ifdef CONFIG_GENERIC_LOCKBREAK +#define spin_is_contended(lock) ((lock)-break_lock) +#else +#define spin_is_contended(lock) __raw_spin_is_contended((lock)-raw_lock) +#endif + /** * spin_unlock_wait - wait until the spinlock gets unlocked * @lock: the spinlock in question. Index: linux-2.6/fs/jbd/checkpoint.c === --- linux-2.6.orig/fs/jbd/checkpoint.c +++ linux-2.6/fs/jbd/checkpoint.c @@ -347,7 +347,8 @@ restart: break; } retry = __process_buffer(journal, jh, bhs,batch_count); - if (!retry lock_need_resched(journal-j_list_lock)){ + if (!retry (need_resched() || + spin_needbreak(journal-j_list_lock))) { spin_unlock(journal-j_list_lock); retry = 1; break; Index: linux-2.6/fs/jbd/commit.c === --- linux-2.6.orig/fs/jbd/commit.c +++ linux-2.6/fs/jbd/commit.c @@ -265,7 +265,7 @@ write_out_data: put_bh(bh); } - if (lock_need_resched(journal-j_list_lock)) { + if (need_resched() || spin_needbreak(journal-j_list_lock)) { spin_unlock(journal-j_list_lock); goto write_out_data; } Index: linux-2.6/fs/jbd2/checkpoint.c === --- linux-2.6.orig/fs/jbd2/checkpoint.c +++ linux-2.6/fs/jbd2/checkpoint.c @@ -347,7 +347,8 @@ restart: break; } retry = __process_buffer(journal, jh, bhs,batch_count); - if (!retry lock_need_resched(journal-j_list_lock)){ + if (!retry (need_resched() || + spin_needbreak(journal-j_list_lock))) { spin_unlock(journal-j_list_lock); retry = 1; break; Index: linux-2.6/fs/jbd2/commit.c === --- linux-2.6.orig/fs/jbd2/commit.c +++ linux-2.6/fs/jbd2/commit.c @@ -265,7 +265,7 @@ write_out_data: put_bh(bh); } - if (lock_need_resched(journal-j_list_lock)) { + if (need_resched() || spin_needbreak(journal-j_list_lock)) { spin_unlock(journal-j_list_lock); goto write_out_data; } Index: linux-2.6/include/linux/spinlock_up.h === --- linux-2.6.orig/include/linux/spinlock_up.h +++ linux-2.6/include/linux
[patch 1/4] x86: FIFO ticket spinlocks
Introduce ticket lock spinlocks for x86 which are FIFO. The implementation is described in the comments. The straight-line lock/unlock instruction sequence is slightly slower than the dec based locks on modern x86 CPUs, however the difference is quite small on Core2 and Opteron when working out of cache, and becomes almost insignificant even on P4 when the lock misses cache. trylock is more significantly slower, but they are relatively rare. On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable, with a userspace test having a difference of up to 2x runtime per thread, and some threads are starved or unfairly granted the lock up to 1 000 000 (!) times. After this patch, all threads appear to finish at exactly the same time. The memory ordering of the lock does conform to x86 standards, and the implementation has been reviewed by Intel and AMD engineers. The algorithm also tells us how many CPUs are contending the lock, so lockbreak becomes trivial and we no longer have to waste 4 bytes per spinlock for it. After this, we can no longer spin on any locks with preempt enabled and cannot reenable interrupts when spinning on an irq safe lock, because at that point we have already taken a ticket and the would deadlock if the same CPU tries to take the lock again. These are questionable anyway: if the lock happens to be called under a preempt or interrupt disabled section, then it will just have the same latency problems. The real fix is to keep critical sections short, and ensure locks are reasonably fair (which this patch does). Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/include/asm-x86/spinlock_64.h === --- linux-2.6.orig/include/asm-x86/spinlock_64.h +++ linux-2.6/include/asm-x86/spinlock_64.h @@ -12,74 +12,98 @@ * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. * - * We make no fairness assumptions. They have a cost. + * These are fair FIFO ticket locks, which are currently limited to 256 + * CPUs. * * (the type definitions are in asm/spinlock_types.h) */ +#if (NR_CPUS 256) +#error spinlock supports a maximum of 256 CPUs +#endif + static inline int __raw_spin_is_locked(raw_spinlock_t *lock) { - return *(volatile signed int *)((lock)-slock) = 0; + int tmp = *(volatile signed int *)((lock)-slock); + + return (((tmp 8) 0xff) != (tmp 0xff)); } -static inline void __raw_spin_lock(raw_spinlock_t *lock) +static inline int __raw_spin_is_contended(raw_spinlock_t *lock) { - asm volatile( - \n1:\t - LOCK_PREFIX ; decl %0\n\t - jns 2f\n - 3:\n - rep;nop\n\t - cmpl $0,%0\n\t - jle 3b\n\t - jmp 1b\n - 2:\t : =m (lock-slock) : : memory); + int tmp = *(volatile signed int *)((lock)-slock); + + return (((tmp 8) 0xff) - (tmp 0xff)) 1; } -/* - * Same as __raw_spin_lock, but reenable interrupts during spinning. - */ -#ifndef CONFIG_PROVE_LOCKING -static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long flags) +static inline void __raw_spin_lock(raw_spinlock_t *lock) { - asm volatile( - \n1:\t - LOCK_PREFIX ; decl %0\n\t - jns 5f\n - testl $0x200, %1\n\t /* interrupts were disabled? */ - jz 4f\n\t - sti\n - 3:\t - rep;nop\n\t - cmpl $0, %0\n\t - jle 3b\n\t - cli\n\t + short inc = 0x0100; + + /* +* Ticket locks are conceptually two bytes, one indicating the current +* head of the queue, and the other indicating the current tail. The +* lock is acquired by atomically noting the tail and incrementing it +* by one (thus adding ourself to the queue and noting our position), +* then waiting until the head becomes equal to the the initial value +* of the tail. +* +* This uses a 16-bit xadd to increment the tail and also load the +* position of the head, which takes care of memory ordering issues +* and should be optimal for the uncontended case. Note the tail must +* be in the high byte, otherwise the 16-bit wide increment of the low +* byte would carry up and contaminate the high byte. +*/ + + __asm__ __volatile__ ( + LOCK_PREFIX xaddw %w0, %1\n + 1:\t + cmpb %h0, %b0\n\t + je 2f\n\t + rep ; nop\n\t + movb %1, %b0\n\t + /* don't need lfence here, because loads are in-order */ jmp 1b\n - 4:\t - rep;nop\n\t - cmpl $0, %0\n\t - jg 1b\n\t - jmp 4b\n - 5:\n\t
[patch 3/4] x86: spinlock.h merge prep
Prepare for merging 32 and 64 bit spinlocks, by making them identical (except for the OOSTORE thing). raw_read_lock and raw_write_lock get a relaxed register constraint, and 64-bit has a few =m constraints changed to +m. I hope these things actually make the code better. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/include/asm-x86/spinlock_32.h === --- linux-2.6.orig/include/asm-x86/spinlock_32.h +++ linux-2.6/include/asm-x86/spinlock_32.h @@ -98,14 +98,15 @@ static inline int __raw_spin_trylock(raw return ret; } -#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE) +#if defined(CONFIG_X86_32) \ + (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)) /* * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock * (PPro errata 66, 92) */ -#define UNLOCK_LOCK_PREFIX LOCK_PREFIX +# define UNLOCK_LOCK_PREFIX LOCK_PREFIX #else -#define UNLOCK_LOCK_PREFIX +# define UNLOCK_LOCK_PREFIX #endif static inline void __raw_spin_unlock(raw_spinlock_t *lock) @@ -135,49 +136,42 @@ static inline void __raw_spin_unlock_wai * * On x86, we implement read-write locks as a 32-bit counter * with the high bit (sign) being the contended bit. - * - * The inline assembly is non-obvious. Think about it. - * - * Changed to use the same technique as rw semaphores. See - * semaphore.h for details. -ben - * - * the helpers are in arch/i386/kernel/semaphore.c */ /** * read_can_lock - would read_trylock() succeed? * @lock: the rwlock in question. */ -static inline int __raw_read_can_lock(raw_rwlock_t *x) +static inline int __raw_read_can_lock(raw_rwlock_t *lock) { - return (int)(x)-lock 0; + return (int)(lock)-lock 0; } /** * write_can_lock - would write_trylock() succeed? * @lock: the rwlock in question. */ -static inline int __raw_write_can_lock(raw_rwlock_t *x) +static inline int __raw_write_can_lock(raw_rwlock_t *lock) { - return (x)-lock == RW_LOCK_BIAS; + return (lock)-lock == RW_LOCK_BIAS; } static inline void __raw_read_lock(raw_rwlock_t *rw) { - asm volatile(LOCK_PREFIX subl $1,(%0)\n\t + asm volatile(LOCK_PREFIX subl $1,(%0)\n\t jns 1f\n call __read_lock_failed\n\t 1:\n -::a (rw) : memory); +::r (rw) : memory); } static inline void __raw_write_lock(raw_rwlock_t *rw) { - asm volatile(LOCK_PREFIX subl $ RW_LOCK_BIAS_STR ,(%0)\n\t + asm volatile(LOCK_PREFIX subl %1,(%0)\n\t jz 1f\n call __write_lock_failed\n\t 1:\n -::a (rw) : memory); +::r (rw), i (RW_LOCK_BIAS) : memory); } static inline int __raw_read_trylock(raw_rwlock_t *lock) @@ -206,8 +200,8 @@ static inline void __raw_read_unlock(raw static inline void __raw_write_unlock(raw_rwlock_t *rw) { - asm volatile(LOCK_PREFIX addl $ RW_LOCK_BIAS_STR , %0 -: +m (rw-lock) : : memory); + asm volatile(LOCK_PREFIX addl %1,%0 + : +m (rw-lock) : i (RW_LOCK_BIAS) : memory); } #define _raw_spin_relax(lock) cpu_relax() Index: linux-2.6/include/asm-x86/spinlock_64.h === --- linux-2.6.orig/include/asm-x86/spinlock_64.h +++ linux-2.6/include/asm-x86/spinlock_64.h @@ -5,6 +5,7 @@ #include asm/rwlock.h #include asm/page.h #include asm/processor.h +#include linux/compiler.h /* * Your basic SMP spinlocks, allowing only a single CPU anywhere @@ -126,11 +127,19 @@ static inline void __raw_spin_unlock_wai * with the high bit (sign) being the contended bit. */ +/** + * read_can_lock - would read_trylock() succeed? + * @lock: the rwlock in question. + */ static inline int __raw_read_can_lock(raw_rwlock_t *lock) { return (int)(lock)-lock 0; } +/** + * write_can_lock - would write_trylock() succeed? + * @lock: the rwlock in question. + */ static inline int __raw_write_can_lock(raw_rwlock_t *lock) { return (lock)-lock == RW_LOCK_BIAS; @@ -140,18 +149,18 @@ static inline void __raw_read_lock(raw_r { asm volatile(LOCK_PREFIX subl $1,(%0)\n\t jns 1f\n -call __read_lock_failed\n +call __read_lock_failed\n\t 1:\n -::D (rw), i (RW_LOCK_BIAS) : memory); +::r (rw) : memory); } static inline void __raw_write_lock(raw_rwlock_t *rw) { asm volatile(LOCK_PREFIX subl %1,(%0)\n\t jz 1f\n -\tcall __write_lock_failed\n\t +call __write_lock_failed\n\t 1:\n -::D (rw), i (RW_LOCK_BIAS) : memory); +::r (rw), i (RW_LOCK_BIAS) : memory
[patch 4/4] x86: spinlock.h merge
Merge spinlock_32.h and spinlock_64.h into spinlock.h. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/include/asm-x86/spinlock.h === --- linux-2.6.orig/include/asm-x86/spinlock.h +++ linux-2.6/include/asm-x86/spinlock.h @@ -1,5 +1,211 @@ -#ifdef CONFIG_X86_32 -# include spinlock_32.h +#ifndef __ASM_SPINLOCK_H +#define __ASM_SPINLOCK_H + +#include asm/atomic.h +#include asm/rwlock.h +#include asm/page.h +#include asm/processor.h +#include linux/compiler.h + +/* + * Your basic SMP spinlocks, allowing only a single CPU anywhere + * + * Simple spin lock operations. There are two variants, one clears IRQ's + * on the local processor, one does not. + * + * These are fair FIFO ticket locks, which are currently limited to 256 + * CPUs. + * + * (the type definitions are in asm/spinlock_types.h) + */ + +#if (NR_CPUS 256) +#error spinlock supports a maximum of 256 CPUs +#endif + +static inline int __raw_spin_is_locked(raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)((lock)-slock); + + return (((tmp 8) 0xff) != (tmp 0xff)); +} + +static inline int __raw_spin_is_contended(raw_spinlock_t *lock) +{ + int tmp = *(volatile signed int *)((lock)-slock); + + return (((tmp 8) 0xff) - (tmp 0xff)) 1; +} + +static inline void __raw_spin_lock(raw_spinlock_t *lock) +{ + short inc = 0x0100; + + /* +* Ticket locks are conceptually two bytes, one indicating the current +* head of the queue, and the other indicating the current tail. The +* lock is acquired by atomically noting the tail and incrementing it +* by one (thus adding ourself to the queue and noting our position), +* then waiting until the head becomes equal to the the initial value +* of the tail. +* +* This uses a 16-bit xadd to increment the tail and also load the +* position of the head, which takes care of memory ordering issues +* and should be optimal for the uncontended case. Note the tail must +* be in the high byte, otherwise the 16-bit wide increment of the low +* byte would carry up and contaminate the high byte. +*/ + + __asm__ __volatile__ ( + LOCK_PREFIX xaddw %w0, %1\n + 1:\t + cmpb %h0, %b0\n\t + je 2f\n\t + rep ; nop\n\t + movb %1, %b0\n\t + /* don't need lfence here, because loads are in-order */ + jmp 1b\n + 2: + :+Q (inc), +m (lock-slock) + : + :memory, cc); +} + +#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock) + +static inline int __raw_spin_trylock(raw_spinlock_t *lock) +{ + short prev; + short new; + int ret = 0; + + asm volatile( + movw %2,%w0\n\t + cmpb %h0,%b0\n\t + jne 1f\n\t + movw %w0,%w1\n\t + incb %h1\n\t + lock ; cmpxchgw %w1,%2\n\t + decb %h1\n\t + cmpw %w0,%w1\n\t + jne 1f\n\t + movl $1,%3\n\t + 1: + :=a (prev), =Q (new), +m (lock-slock), +m (ret) + : + : memory, cc); + + return ret; +} + +#if defined(CONFIG_X86_32) \ + (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)) +/* + * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock + * (PPro errata 66, 92) + */ +# define UNLOCK_LOCK_PREFIX LOCK_PREFIX #else -# include spinlock_64.h +# define UNLOCK_LOCK_PREFIX #endif + +static inline void __raw_spin_unlock(raw_spinlock_t *lock) +{ + __asm__ __volatile__( + UNLOCK_LOCK_PREFIX incb %0 + :+m (lock-slock) + : + :memory, cc); +} + +static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock) +{ + while (__raw_spin_is_locked(lock)) + cpu_relax(); +} + +/* + * Read-write spinlocks, allowing multiple readers + * but only one writer. + * + * NOTE! it is quite common to have readers in interrupts + * but no interrupt writers. For those circumstances we + * can mix irq-safe locks - any writer needs to get a + * irq-safe write-lock, but readers can get non-irqsafe + * read-locks. + * + * On x86, we implement read-write locks as a 32-bit counter + * with the high bit (sign) being the contended bit. + */ + +/** + * read_can_lock - would read_trylock() succeed? + * @lock: the rwlock in question. + */ +static inline int __raw_read_can_lock(raw_rwlock_t *lock) +{ + return (int)(lock)-lock 0; +} + +/** + * write_can_lock - would write_trylock() succeed? + * @lock: the rwlock in question. + */ +static inline int __raw_write_can_lock(raw_rwlock_t *lock) +{ + return (lock)-lock == RW_LOCK_BIAS; +} + +static inline void __raw_read_lock(raw_rwlock_t *rw) +{ + asm
Re: [patch 1/4] spinlock: lockbreak cleanup
On Thu, Nov 01, 2007 at 03:06:05PM +0100, Peter Zijlstra wrote: On Thu, 2007-11-01 at 15:02 +0100, Nick Piggin wrote: Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to decouple it from the spinlock implementation, and make it typesafe (rwlocks do not have any need_lockbreak sites -- why do they even get bloated up with that break_lock then?). IIRC Lee has a few patches floating about that do introduce lockbreak stuff for rwlocks. Well that would be a good reason to introduce a break_lock for them, but previously not so much... we have rwlocks in some slightly space critical structures (vmas, inodes, etc). I guess it was done to make the template hacks eaiser. I don't really find that in good taste, especially for important core infrastructure. Anyway. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Thu, Nov 01, 2007 at 08:14:47AM -0700, Linus Torvalds wrote: On Thu, 1 Nov 2007, Nick Piggin wrote: On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote: We made much bigger changes to ptrace support when we disallowed writing to read-only shared memory areas (we used to do the magic per-page COW thing). Really? No, we still do that magic COW thing which creates anonymous pages in MAP_SHARED vmas, don't we? No, we don't. I'm pretty sure. It didn't work with the VM cleanups, since the MAP_SHARED vma's won't be on the anonymous list any more, and cannot be swapped out. So now, if you try to write to a read-only shared page through ptrace, you'll get Unable to access. No, it COWs it (the file is RW). I believe do_wp_page will still attach an anon_vma to the vma, which will make the pte discoverable via rmap. Of course, I didn't really look closely, so maybe I just don't remember things right.. access_vm_pages() (things like core-dumping comes to mind - although I think we don't dump pure file mappings at all, do we?) it would certainly be good to run any such tests on the current -git tree... We do for MAP_SHARED|MAP_ANONYMOUS, by the looks. Well, as we should. There's no way for a debugger to get those pages back. So that all looks sane. - vm_flags |= VM_SHARED | VM_MAYSHARE; - if (!(file-f_mode FMODE_WRITE)) - vm_flags = ~(VM_MAYWRITE | VM_SHARED); + vm_flags |= VM_MAYSHARE; + if (file-f_mode FMODE_WRITE) + vm_flags |= VM_SHARED; + if (!(vm_flags VM_WRITE)) + vm_flags = ~VM_MAYWRITE; This looks totally bogus. What was the intent of this patch? The VM_MAYWRITE flag is *not* supposed to track the VM_WRITE flag: that would defeat the whole purpose of it! The whole point of that flag is to say whether mprotect() could turn it into a VM_WRITE mapping, and it depends on the file mode, not VM_WRITE! Yeah of course that won't work, stupid... The intent is to stop get_user_pages from proceeding with a write fault (and subsequent COW) to readonly shared mappings, when force is set. I think it can be done simply via get_user_pages(), which is what I should have done to begin with. Untested patch follows --- Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1031,7 +1031,9 @@ int get_user_pages(struct task_struct *t } if (!vma || (vma-vm_flags (VM_IO | VM_PFNMAP)) - || !(vm_flags vma-vm_flags)) + || !(vm_flags vma-vm_flags) + || (write ((vma-vm_flags + (VM_SHARED|VM_MAYSHARE)) == VM_MAYSHARE))) return i ? : -EFAULT; if (is_vm_hugetlb_page(vma)) { @@ -1563,13 +1565,11 @@ static int do_wp_page(struct mm_struct * reuse = can_share_swap_page(old_page); unlock_page(old_page); } - } else if (unlikely((vma-vm_flags (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED))) { + } else if (unlikely((vma-vm_flags VM_SHARED))) { /* -* Only catch write-faults on shared writable pages, -* read-only shared pages can get COWed by -* get_user_pages(.write=1, .force=1). +* Only catch write-faults on shared writable pages. */ + BUG_ON(!(vma-vm_flags VM_WRITE)); if (vma-vm_ops vma-vm_ops-page_mkwrite) { /* * Notify the address space that the page is about to - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] spinlock: lockbreak cleanup
On Thu, Nov 01, 2007 at 04:46:36PM +0100, Ingo Molnar wrote: * Lee Schermerhorn [EMAIL PROTECTED] wrote: I guess it was done to make the template hacks eaiser. I don't really find that in good taste, especially for important core infrastructure. Anyway. Actually, what I had/have is a cond_resched_rwlock() that I needed to convert the i_mmap_lock() to rw for testing reclaim scalability. [I've seen a large system running an Oracle OLTP load hang spitting cpu soft lockup messages with all cpus spinning on a i_mmap_lock spin lock.] One of the i_mmap_lock paths uses cond_resched_lock() for spin locks. To do a straight forward conversion [and maybe that isn't the right approach], I created the cond_resched_rwlock() function by generalizing the cond_sched_lock() code and creating both spin and rw lock wrappers. I took advantage of the fact that, currently, need_lockbreak() is a macro and that both spin and rw locks have/had the break_lock member. Typesafe functions would probably be preferrable, if we want to keep break_lock for rw spin locks. Here's the most recent posting: http://marc.info/?l=linux-mmm=118980356306014w=4 See the changes to sched.[ch]. Should apply to 23-mm1 with offsets and minor fixup in fs/inode.c. yep. I'm too in favor of keeping the need-lockbreak mechanism and its type-insensitive data structure. We've got way too many locking primitives and keeping them all sorted is nontrivial already. I think a large contributor to that is being a bit clever with indirections and cute code (eg. like this template stuff), rather than having two types of spinlocks instead of one. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Thu, Nov 01, 2007 at 04:01:45PM -0400, Chuck Ebbert wrote: On 11/01/2007 10:03 AM, Nick Piggin wrote: [edited to show the resulting code] + __asm__ __volatile__ ( + LOCK_PREFIX xaddw %w0, %1\n + 1:\t + cmpb %h0, %b0\n\t + je 2f\n\t + rep ; nop\n\t + movb %1, %b0\n\t + /* don't need lfence here, because loads are in-order */ jmp 1b\n + 2: + :+Q (inc), +m (lock-slock) + : + :memory, cc); } If you really thought you might get long queues, you could figure out how far back you are and use that to determine how long to wait before testing the lock again. That cmpb could become a subb without adding overhead to the fast path -- that would give you the queue length (or its complement anyway.) Indeed. You can use this as a really nice input into a backoff algorithm (eg. if you're next in line, don't back off, or at least don't go into exponential backoff; if you've got people in front of you, start throttling harder). I think I'll leave that to SGI if they come up with a big x86 SSI ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Thu, Nov 01, 2007 at 09:08:45AM -0700, Linus Torvalds wrote: On Thu, 1 Nov 2007, Nick Piggin wrote: Untested patch follows Ok, this looks ok. Except I would remove the VM_MAYSHARE bit from the test. But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb uses this for inserting breakpoints (but fortunately, a COW page in a MAP_PRIVATE mapping is a much more natural thing for the VM). That whole bit should go, in fact. We used to make it something different: iirc, a read-only SHARED mapping was downgraded to a non-shared mapping, because we wanted to avoid some of the costs we used to have with the VM implementation (actually, I think it was various filesystems that don't like shared mappings because they don't have a per-page writeback). But we left the VM_MAYSHARE bit on, to get /proc/pid/mmap things right. Or something like that. I forget the details. But I *think* we don't actually need this any more. But basically, the right way to test for shared mappings is historically to just test the VM_MAYSHARE bit - but not *both* bits. Because VM_SHARE may have been artificially cleared. I think you're right -- VM_MAYSHARE is basically testing for MAP_SHARED. I just don't know exactly what you're proposing here. Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Thu, Nov 01, 2007 at 06:17:42PM -0700, Linus Torvalds wrote: On Fri, 2 Nov 2007, Nick Piggin wrote: But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb uses this for inserting breakpoints (but fortunately, a COW page in a MAP_PRIVATE mapping is a much more natural thing for the VM). Yes, I phrased that badly. I meant that I'd be happier if we got rid of VM_MAYSHARE entirely, and just used VM_SHARED. I thought we already made them always be the same (and any VM_MAYSHARE use is historical). Oh yeah, I think it would probably be clearer to use VM_SHARED == MAP_SHARED, and test the write permission explicitly. Though there could be something I missed that makes it not as easy as it sounds... probably something best left for Hugh ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/4] x86: FIFO ticket spinlocks
On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote: On Thu, 1 Nov 2007, Rik van Riel wrote: Larry Woodman managed to wedge the VM into a state where, on his 4x dual core system, only 2 cores (on the same CPU) could get the zone-lru_lock overnight. The other 6 cores on the system were just spinning, without being able to get the lock. That's quite incredible, considering that the CPUs actually _taking_ the locks also drop the locks and do quite a bit of work before taking them again (ie. they take them to pull pages off the LRU, but then do a reasonable amount of work to remove each one from pagecache before refilling from the LRU). Possibly actually that is a *more* difficult case for the HW to handle: once the CPU actually goes away and operates on other cachelines, it may get a little more difficult to detect that it is causing starvation issues. .. and this is almost always the result of a locking *bug*, not unfairness per se. IOW, unfairness just ends up showing the bug in the first place. I'd almost agree, but there are always going to be corner cases where we get multiple contentions on a spinlock -- the fact that a lock is needed at all obviously suggests that it can be contended. The LRU locking could be improved, but you could have eg. scheduler runqueue lock starvation if the planets lined up just right, and it is a little more difficult to improve on runqueue locking. Anyway, I also think this is partially a hardware issue, and as muliple cores, threads, and sockets get more common, I hope it will improve (it affects Intel CPUs as well as AMD). So it is possible to have an option to switch between locks if the hardware is fairer, but I want to get as much exposure with this locking as possible for now, to see if there is any funny performance corner cases exposed (which quite possibly will turn out to be caused by suboptimal locking itself). Anyway, if this can make its way to the x86 tree, I think it will get pulled into -mm (?) and get some exposure... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: filp usage when cpu busy
On Thursday 01 November 2007 12:56, bc Wong (chimwong) wrote: > Hi, > > With 2.6.16 x86_64 on a 4 core machine, I noticed > that the filp usage (according to /proc/slabinfo) > shoots up and keeps on increasing sharply when one > of the CPUs is (1) locked up, or (2) very busy > doing a lot of printk()'s with KERN_EMERG. > > In the case of (1), it's permanent until it runs > out of memory eventually. For (2), it's temporary; > filp count comes back down when the printk()'s are > done. > > I can't think of any relationship between a busy/ > locked-up CPU and filp count. The system is still > functional. New short-lived processes kept being > created, but the overall number of processes is > stable. > > Does anyone know why filp count would go up like > that? Yeah, it's probably because filp structures are freed by RCU, and if you have a locked up CPU then it can't go through a quiescent state so RCU stops freeing your filps. If you add some cond_resched()s to your code, you should find that RCU will force a reschedule and things will work (actually, for 2.6.16, I'm not sure if RCU had the code to force a reschedule... it's force_quiescent_state() in kernel/rcupdate.c upstream). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] firewire: fw-core: enforce write order when updating fw_device.generation
On Thursday 01 November 2007 12:49, Stefan Richter wrote: > fw_device.node_id and fw_device.generation are accessed without mutexes. > We have to ensure that all readers will get to see node_id updates > before generation updates. > Hi, a few points: - can change it to use spinlocks instead? This would be most preferable. - if not, you need comments. - you also seem to be missing rmb()s now. I see a couple in the firewire directory, but nothing that seems to be ordering loads of these particular fields. - use smp_*mb() if you are just ordering regular cacheable RAM accesses. Also, diffstat is a bit wrong... maybe you posted the wrong version? > Signed-off-by: Stefan Richter <[EMAIL PROTECTED]> > --- > drivers/firewire/fw-device.c |6 ++ > drivers/firewire/fw-topology.c |1 + > 2 files changed, 7 insertions(+) > > Index: linux/drivers/firewire/fw-device.c > === > --- linux.orig/drivers/firewire/fw-device.c > +++ linux/drivers/firewire/fw-device.c > @@ -808,6 +813,7 @@ void fw_node_event(struct fw_card *card, > > device = node->data; > device->node_id = node->node_id; > + wmb(); > device->generation = card->generation; > if (atomic_read(>state) == FW_DEVICE_RUNNING) { > PREPARE_DELAYED_WORK(>work, fw_device_update); > Index: linux/drivers/firewire/fw-topology.c > === > --- linux.orig/drivers/firewire/fw-topology.c > +++ linux/drivers/firewire/fw-topology.c > @@ -518,6 +518,7 @@ fw_core_handle_bus_reset(struct fw_card > card->bm_retries = 0; > > card->node_id = node_id; > + wmb(); > card->generation = generation; > card->reset_jiffies = jiffies; > schedule_delayed_work(>work, 0); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote: > > > On Wed, 31 Oct 2007, Nick Piggin wrote: > > > > No that would be great. Fingers crossed it won't cause any problems. > > I actually doubt it will cause problems. > > We made much bigger changes to ptrace support when we disallowed writing > to read-only shared memory areas (we used to do the magic per-page COW > thing). Really? No, we still do that magic COW thing which creates anonymous pages in MAP_SHARED vmas, don't we? But I think that is pretty arbitrary as well. What the "force" argument to get_user_pages on a MAP_SHARED vma effectively does is punt the permission check from the vma permissions to the file permissions. If you've opened the file WR but mmap()ed PROT_READ, it goes ahead and COWs, wheras if you've opened the file RDONLY it -EFAULTs. This seems strange wrong, because the act of COWing seems to be exactly done in order to *absolve* us of file permissions (if we're COWing, why should it matter if the file happened to be readonly or not). Could we just try disallow COW for MAP_SHARED mappings? Seeing as we're doing the break-ptrace thing this release already ;) All the stuff I know of that requires forced COW (ie. gdb for breakpoints), does so on MAP_PRIVATE mappings. I know this is one of Hugh's bugbears. I don't know whether he's thought out how to remove it, but I wonder if the appended patch is roughly in the right direction? > But if people have test-cases for ptrace and/or other magical users of > access_vm_pages() (things like core-dumping comes to mind - although I > think we don't dump pure file mappings at all, do we?) it would certainly > be good to run any such tests on the current -git tree... We do for MAP_SHARED|MAP_ANONYMOUS, by the looks. --- Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1563,13 +1563,11 @@ static int do_wp_page(struct mm_struct * reuse = can_share_swap_page(old_page); unlock_page(old_page); } - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED))) { + } else if (unlikely((vma->vm_flags & VM_SHARED))) { /* -* Only catch write-faults on shared writable pages, -* read-only shared pages can get COWed by -* get_user_pages(.write=1, .force=1). +* Only catch write-faults on shared writable pages. */ + BUG_ON(!(vma->vm_flags & VM_WRITE)); if (vma->vm_ops && vma->vm_ops->page_mkwrite) { /* * Notify the address space that the page is about to Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c +++ linux-2.6/mm/mmap.c @@ -980,9 +980,11 @@ unsigned long do_mmap_pgoff(struct file if (locks_verify_locked(inode)) return -EAGAIN; - vm_flags |= VM_SHARED | VM_MAYSHARE; - if (!(file->f_mode & FMODE_WRITE)) - vm_flags &= ~(VM_MAYWRITE | VM_SHARED); + vm_flags |= VM_MAYSHARE; + if (file->f_mode & FMODE_WRITE) + vm_flags |= VM_SHARED; + if (!(vm_flags & VM_WRITE)) + vm_flags &= ~VM_MAYWRITE; /* fall through */ case MAP_PRIVATE: @@ -998,6 +1000,7 @@ unsigned long do_mmap_pgoff(struct file if (!file->f_op || !file->f_op->mmap) return -ENODEV; + break; default: @@ -1007,6 +1010,8 @@ unsigned long do_mmap_pgoff(struct file switch (flags & MAP_TYPE) { case MAP_SHARED: vm_flags |= VM_SHARED | VM_MAYSHARE; + if (!(vm_flags & VM_WRITE)) + vm_flags &= ~VM_MAYWRITE; break; case MAP_PRIVATE: /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 08:59:41AM -0700, Linus Torvalds wrote: > > > On Wed, 31 Oct 2007, Nick Piggin wrote: > > > > Well the patch is right, in the context of the regression I introduced > > (and so it should probably go into 2.6.23). > > Yeah, it probably is fine for -stable. > > And if mine (which actually changes behaviour, in that it makes ptrace get > an access error) causes regressions, I guess we'll have to use that > compatible-with-old-behaviour one for 2.6.24 too. > > But I just rebooted and tested - the cleaned-up patch does seem to work > fine, and I get "Cannot access memory at address " rather than any > reported problem. > > So I think I'll commit my version asap, and see if anybody reports that > they have a situation where they use ptrace() and expect zero back from a > shared mapping past the end.. And if there are issues, we can switch back > to the old broken behaviour with your patch, No that would be great. Fingers crossed it won't cause any problems. Thanks, all. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 08:11:10AM -0700, Linus Torvalds wrote: > > > On Wed, 31 Oct 2007, Nick Piggin wrote: > > > > However I actually don't really like how this all works. I don't like that > > filemap.c should have to know about ptrace, or exactly what ptrace wants > > here. > > It shouldn't. It should just fail when it fails. Then, handle_mm_fault() > should return an error code, which should cause get_user_pages() to return > an error code. Which should make ptrace just stop. > > So I think your patch is wrong. mm/filemap.c should *not* care about who > does the fault. I think the proper patch is something untested like the > appended... Well the patch is right, in the context of the regression I introduced (and so it should probably go into 2.6.23). I'd love to get rid of that outside data content crap if possible in 2.6.24. I think you're the one who has the best feeling for the ptrace cases we have in the VM, so I trust you ;) > > It's a bit hairy to force insert page into pagecache and pte into pagetables > > here, given the races. > > It's also wrong. They shouldn't be in the page cache, since that can cause > problems with truncate etc. Maybe it doesn't any more, but it's reasonable > to believe that a page outside of i_size should not exist. No I believe it could still be a problem (and at least, it is quite fragile). > > In access_process_vm, can't we just zero fill in the case of a sigbus? > > Linus? > > That will also avoid changing applicatoin behaviour due to a gdb read... > > access_process_vm() should just return how many bytes it could fill (which > means a truncated copy - very including zero bytes - for an error), and > the caller should decide what the right thing to do is. > > But who knows, maybe I missed something. I hope it works. > Duane? Does this fix things for you? > > Linus > > --- > mm/filemap.c | 13 ++--- > 1 files changed, 2 insertions(+), 11 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 9940895..188cf5f 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1300,7 +1300,7 @@ int filemap_fault(struct vm_area_struct *vma, struct > vm_fault *vmf) > > size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; > if (vmf->pgoff >= size) > - goto outside_data_content; > + return VM_FAULT_SIGBUS; > > /* If we don't want any read-ahead, don't bother */ > if (VM_RandomReadHint(vma)) > @@ -1377,7 +1377,7 @@ retry_find: > if (unlikely(vmf->pgoff >= size)) { > unlock_page(page); > page_cache_release(page); > - goto outside_data_content; > + return VM_FAULT_SIGBUS; > } > > /* > @@ -1388,15 +1388,6 @@ retry_find: > vmf->page = page; > return ret | VM_FAULT_LOCKED; > > -outside_data_content: > - /* > - * An external ptracer can access pages that normally aren't > - * accessible.. > - */ > - if (vma->vm_mm == current->mm) > - return VM_FAULT_SIGBUS; > - > - /* Fall through to the non-read-ahead case */ > no_cached_page: > /* >* We're only likely to ever get here if MADV_RANDOM is in - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 23:17, Peter Zijlstra wrote: > On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote: > > And I'd prevent these ones from doing so. > > > > Without keeping track of "reserve" pages, which doesn't feel > > too clean. > > The problem with that is that once a slab was allocated with the right > allocation context, anybody can get objects from these slabs. [snip] I understand that. > So we either reserve a page per object, which for 32 byte objects is a > large waste, or we stop anybody who doesn't have the right permissions > from obtaining objects. I took the latter approach. What I'm saying is that the slab allocator slowpath should always just check watermarks against the current task. Instead of this ->reserve stuff. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote: > On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote: > > On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: > > > Allow PF_MEMALLOC to be set in softirq context. When running softirqs > > > from a borrowed context save current->flags, ksoftirqd will have its > > > own task_struct. > > > > What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess > > that some networking thing must be picking it up in a subsequent patch, > > but I'm too lazy to look!)... Again, can you have more of a rationale in > > your patch headers, or ref the patch that uses it... thanks > > Right, I knew I was forgetting something in these changelogs. > > The network stack does quite a bit of packet processing from softirq > context. Once you start swapping over network, some of the packets want > to be processed under PF_MEMALLOC. Hmm... what about processing from interrupt context? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote: > On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote: > > On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: > > > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation > > > contexts that are entitled to it. > > > > > > Care is taken to only touch the SLUB slow path. > > > > > > This is done to ensure reserve pages don't leak out and get consumed. > > > > I think this is generally a good idea (to prevent slab allocators > > from stealing reserve). However I naively think the implementation > > is a bit overengineered and thus has a few holes. > > > > Humour me, what was the problem with failing the slab allocation > > (actually, not fail but just call into the page allocator to do > > correct waiting / reclaim) in the slowpath if the process fails the > > watermark checks? > > Ah, we actually need slabs below the watermarks. Right, I'd still allow those guys to allocate slabs. Provided they have the right allocation context, right? > Its just that once I > allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want > allocation contexts that do not have rights to those pages to walk off > with objects. And I'd prevent these ones from doing so. Without keeping track of "reserve" pages, which doesn't feel too clean. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Swap delay accounting, include lock_page() delays
On Wednesday 31 October 2007 18:41, Nick Piggin wrote: > On Wednesday 31 October 2007 18:52, Balbir Singh wrote: > > Reported-by: Nick Piggin <[EMAIL PROTECTED]> > > > > The delay incurred in lock_page() should also be accounted in swap delay > > accounting > > > > Signed-off-by: Balbir Singh <[EMAIL PROTECTED]> > > Ah right, I forgot to resend this one, sorry. Thanks for remembering. Although, I think I had a bit more detail in the changelog which I think should be kept. Basically, swap delay accounting seems quite broken as of now, because what it is counting is the time required to allocate a new page and submit the IO, but not actually the time to perform the IO at all (which I'd expect will be significant, although possibly in some workloads the actual page allocation will dominate). > > > --- > > > > mm/memory.c |2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff -puN mm/swapfile.c~fix-delay-accounting-swap-accounting > > mm/swapfile.c diff -puN mm/memory.c~fix-delay-accounting-swap-accounting > > mm/memory.c --- > > linux-2.6-latest/mm/memory.c~fix-delay-accounting-swap-accounting 2007-10 > >-3 1 12:58:05.0 +0530 +++ > > linux-2.6-latest-balbir/mm/memory.c 2007-10-31 13:02:50.0 +0530 > > @@ -2084,9 +2084,9 @@ static int do_swap_page(struct mm_struct > > count_vm_event(PGMAJFAULT); > > } > > > > - delayacct_clear_flag(DELAYACCT_PF_SWAPIN); > > mark_page_accessed(page); > > lock_page(page); > > + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); > > > > /* > > * Back out if somebody else already faulted in this pte. > > _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Swap delay accounting, include lock_page() delays
On Wednesday 31 October 2007 18:52, Balbir Singh wrote: > Reported-by: Nick Piggin <[EMAIL PROTECTED]> > > The delay incurred in lock_page() should also be accounted in swap delay > accounting > > Signed-off-by: Balbir Singh <[EMAIL PROTECTED]> Ah right, I forgot to resend this one, sorry. Thanks for remembering. > --- > > mm/memory.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff -puN mm/swapfile.c~fix-delay-accounting-swap-accounting mm/swapfile.c > diff -puN mm/memory.c~fix-delay-accounting-swap-accounting mm/memory.c > --- > linux-2.6-latest/mm/memory.c~fix-delay-accounting-swap-accounting > 2007-10-3 >1 12:58:05.0 +0530 +++ > linux-2.6-latest-balbir/mm/memory.c 2007-10-31 13:02:50.0 +0530 @@ > -2084,9 +2084,9 @@ static int do_swap_page(struct mm_struct > count_vm_event(PGMAJFAULT); > } > > - delayacct_clear_flag(DELAYACCT_PF_SWAPIN); > mark_page_accessed(page); > lock_page(page); > + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); > > /* >* Back out if somebody else already faulted in this pte. > _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Tue, Oct 30, 2007 at 11:56:00PM -0700, David Miller wrote: > From: Nick Piggin <[EMAIL PROTECTED]> > Date: Wed, 31 Oct 2007 07:42:21 +0100 > > > Sysrq+T fails to show the stack trace of a running task. Presumably this > > is to avoid a garbled stack, however it can often be useful, and besides > > there is no guarantee that the task won't start running in the middle of > > show_stack(). If there are any correctness issues, then the archietcture > > would have to take further steps to ensure the task is not running. > > > > Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> > > This is useful. > > Even more useful would be a show_regs() on the cpu where running tasks > are running. If not a full show_regs() at least a program counter. Yes, that's something I've needed as well. And I doubt that we're very unusual among kernel developers in this regard ;) > That's usually what you're trying to debug and we provide nearly no > way to handle: some task is stuck in a loop in kernel mode and you > need to know exactly where that is. > > This is pretty easy to do on sparc64. In fact I can capture remote > cpu registers even when that CPU's interrupts are disabled. I suppose > other arches could do a NMI'ish register capture like this as well. > > I have a few bug reports that I can't make more progress on because I > currently can't ask users to do something to fetch the registers on > the seemingly hung processor. This is why I'm harping on this so > much :-) > > Anyways, my core suggestion is to add a hook here so platforms can > do the remote register fetch if they want. You could possibly even do a generic "best effort" kind of thing with regular IPIs, that will timeout and continue if some CPUs don't handle them, and should be pretty easy to get working with existing smp_call_ function stuff. Not exactly clean, but it would be better than nothing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Tue, Oct 30, 2007 at 11:56:00PM -0700, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 31 Oct 2007 07:42:21 +0100 Sysrq+T fails to show the stack trace of a running task. Presumably this is to avoid a garbled stack, however it can often be useful, and besides there is no guarantee that the task won't start running in the middle of show_stack(). If there are any correctness issues, then the archietcture would have to take further steps to ensure the task is not running. Signed-off-by: Nick Piggin [EMAIL PROTECTED] This is useful. Even more useful would be a show_regs() on the cpu where running tasks are running. If not a full show_regs() at least a program counter. Yes, that's something I've needed as well. And I doubt that we're very unusual among kernel developers in this regard ;) That's usually what you're trying to debug and we provide nearly no way to handle: some task is stuck in a loop in kernel mode and you need to know exactly where that is. This is pretty easy to do on sparc64. In fact I can capture remote cpu registers even when that CPU's interrupts are disabled. I suppose other arches could do a NMI'ish register capture like this as well. I have a few bug reports that I can't make more progress on because I currently can't ask users to do something to fetch the registers on the seemingly hung processor. This is why I'm harping on this so much :-) Anyways, my core suggestion is to add a hook here so platforms can do the remote register fetch if they want. You could possibly even do a generic best effort kind of thing with regular IPIs, that will timeout and continue if some CPUs don't handle them, and should be pretty easy to get working with existing smp_call_ function stuff. Not exactly clean, but it would be better than nothing. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Swap delay accounting, include lock_page() delays
On Wednesday 31 October 2007 18:52, Balbir Singh wrote: Reported-by: Nick Piggin [EMAIL PROTECTED] The delay incurred in lock_page() should also be accounted in swap delay accounting Signed-off-by: Balbir Singh [EMAIL PROTECTED] Ah right, I forgot to resend this one, sorry. Thanks for remembering. --- mm/memory.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN mm/swapfile.c~fix-delay-accounting-swap-accounting mm/swapfile.c diff -puN mm/memory.c~fix-delay-accounting-swap-accounting mm/memory.c --- linux-2.6-latest/mm/memory.c~fix-delay-accounting-swap-accounting 2007-10-3 1 12:58:05.0 +0530 +++ linux-2.6-latest-balbir/mm/memory.c 2007-10-31 13:02:50.0 +0530 @@ -2084,9 +2084,9 @@ static int do_swap_page(struct mm_struct count_vm_event(PGMAJFAULT); } - delayacct_clear_flag(DELAYACCT_PF_SWAPIN); mark_page_accessed(page); lock_page(page); + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); /* * Back out if somebody else already faulted in this pte. _ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Swap delay accounting, include lock_page() delays
On Wednesday 31 October 2007 18:41, Nick Piggin wrote: On Wednesday 31 October 2007 18:52, Balbir Singh wrote: Reported-by: Nick Piggin [EMAIL PROTECTED] The delay incurred in lock_page() should also be accounted in swap delay accounting Signed-off-by: Balbir Singh [EMAIL PROTECTED] Ah right, I forgot to resend this one, sorry. Thanks for remembering. Although, I think I had a bit more detail in the changelog which I think should be kept. Basically, swap delay accounting seems quite broken as of now, because what it is counting is the time required to allocate a new page and submit the IO, but not actually the time to perform the IO at all (which I'd expect will be significant, although possibly in some workloads the actual page allocation will dominate). --- mm/memory.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff -puN mm/swapfile.c~fix-delay-accounting-swap-accounting mm/swapfile.c diff -puN mm/memory.c~fix-delay-accounting-swap-accounting mm/memory.c --- linux-2.6-latest/mm/memory.c~fix-delay-accounting-swap-accounting 2007-10 -3 1 12:58:05.0 +0530 +++ linux-2.6-latest-balbir/mm/memory.c 2007-10-31 13:02:50.0 +0530 @@ -2084,9 +2084,9 @@ static int do_swap_page(struct mm_struct count_vm_event(PGMAJFAULT); } - delayacct_clear_flag(DELAYACCT_PF_SWAPIN); mark_page_accessed(page); lock_page(page); + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); /* * Back out if somebody else already faulted in this pte. _ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote: On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote: On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Allow PF_MEMALLOC to be set in softirq context. When running softirqs from a borrowed context save current-flags, ksoftirqd will have its own task_struct. What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess that some networking thing must be picking it up in a subsequent patch, but I'm too lazy to look!)... Again, can you have more of a rationale in your patch headers, or ref the patch that uses it... thanks Right, I knew I was forgetting something in these changelogs. The network stack does quite a bit of packet processing from softirq context. Once you start swapping over network, some of the packets want to be processed under PF_MEMALLOC. Hmm... what about processing from interrupt context? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote: On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote: On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation contexts that are entitled to it. Care is taken to only touch the SLUB slow path. This is done to ensure reserve pages don't leak out and get consumed. I think this is generally a good idea (to prevent slab allocators from stealing reserve). However I naively think the implementation is a bit overengineered and thus has a few holes. Humour me, what was the problem with failing the slab allocation (actually, not fail but just call into the page allocator to do correct waiting / reclaim) in the slowpath if the process fails the watermark checks? Ah, we actually need slabs below the watermarks. Right, I'd still allow those guys to allocate slabs. Provided they have the right allocation context, right? Its just that once I allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want allocation contexts that do not have rights to those pages to walk off with objects. And I'd prevent these ones from doing so. Without keeping track of reserve pages, which doesn't feel too clean. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 23:17, Peter Zijlstra wrote: On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote: And I'd prevent these ones from doing so. Without keeping track of reserve pages, which doesn't feel too clean. The problem with that is that once a slab was allocated with the right allocation context, anybody can get objects from these slabs. [snip] I understand that. So we either reserve a page per object, which for 32 byte objects is a large waste, or we stop anybody who doesn't have the right permissions from obtaining objects. I took the latter approach. What I'm saying is that the slab allocator slowpath should always just check watermarks against the current task. Instead of this -reserve stuff. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 08:11:10AM -0700, Linus Torvalds wrote: On Wed, 31 Oct 2007, Nick Piggin wrote: However I actually don't really like how this all works. I don't like that filemap.c should have to know about ptrace, or exactly what ptrace wants here. It shouldn't. It should just fail when it fails. Then, handle_mm_fault() should return an error code, which should cause get_user_pages() to return an error code. Which should make ptrace just stop. So I think your patch is wrong. mm/filemap.c should *not* care about who does the fault. I think the proper patch is something untested like the appended... Well the patch is right, in the context of the regression I introduced (and so it should probably go into 2.6.23). I'd love to get rid of that outside data content crap if possible in 2.6.24. I think you're the one who has the best feeling for the ptrace cases we have in the VM, so I trust you ;) It's a bit hairy to force insert page into pagecache and pte into pagetables here, given the races. It's also wrong. They shouldn't be in the page cache, since that can cause problems with truncate etc. Maybe it doesn't any more, but it's reasonable to believe that a page outside of i_size should not exist. No I believe it could still be a problem (and at least, it is quite fragile). In access_process_vm, can't we just zero fill in the case of a sigbus? Linus? That will also avoid changing applicatoin behaviour due to a gdb read... access_process_vm() should just return how many bytes it could fill (which means a truncated copy - very including zero bytes - for an error), and the caller should decide what the right thing to do is. But who knows, maybe I missed something. I hope it works. Duane? Does this fix things for you? Linus --- mm/filemap.c | 13 ++--- 1 files changed, 2 insertions(+), 11 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 9940895..188cf5f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1300,7 +1300,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) PAGE_CACHE_SHIFT; if (vmf-pgoff = size) - goto outside_data_content; + return VM_FAULT_SIGBUS; /* If we don't want any read-ahead, don't bother */ if (VM_RandomReadHint(vma)) @@ -1377,7 +1377,7 @@ retry_find: if (unlikely(vmf-pgoff = size)) { unlock_page(page); page_cache_release(page); - goto outside_data_content; + return VM_FAULT_SIGBUS; } /* @@ -1388,15 +1388,6 @@ retry_find: vmf-page = page; return ret | VM_FAULT_LOCKED; -outside_data_content: - /* - * An external ptracer can access pages that normally aren't - * accessible.. - */ - if (vma-vm_mm == current-mm) - return VM_FAULT_SIGBUS; - - /* Fall through to the non-read-ahead case */ no_cached_page: /* * We're only likely to ever get here if MADV_RANDOM is in - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 08:59:41AM -0700, Linus Torvalds wrote: On Wed, 31 Oct 2007, Nick Piggin wrote: Well the patch is right, in the context of the regression I introduced (and so it should probably go into 2.6.23). Yeah, it probably is fine for -stable. And if mine (which actually changes behaviour, in that it makes ptrace get an access error) causes regressions, I guess we'll have to use that compatible-with-old-behaviour one for 2.6.24 too. But I just rebooted and tested - the cleaned-up patch does seem to work fine, and I get Cannot access memory at address xyz rather than any reported problem. So I think I'll commit my version asap, and see if anybody reports that they have a situation where they use ptrace() and expect zero back from a shared mapping past the end.. And if there are issues, we can switch back to the old broken behaviour with your patch, No that would be great. Fingers crossed it won't cause any problems. Thanks, all. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote: On Wed, 31 Oct 2007, Nick Piggin wrote: No that would be great. Fingers crossed it won't cause any problems. I actually doubt it will cause problems. We made much bigger changes to ptrace support when we disallowed writing to read-only shared memory areas (we used to do the magic per-page COW thing). Really? No, we still do that magic COW thing which creates anonymous pages in MAP_SHARED vmas, don't we? But I think that is pretty arbitrary as well. What the force argument to get_user_pages on a MAP_SHARED vma effectively does is punt the permission check from the vma permissions to the file permissions. If you've opened the file WR but mmap()ed PROT_READ, it goes ahead and COWs, wheras if you've opened the file RDONLY it -EFAULTs. This seems strange wrong, because the act of COWing seems to be exactly done in order to *absolve* us of file permissions (if we're COWing, why should it matter if the file happened to be readonly or not). Could we just try disallow COW for MAP_SHARED mappings? Seeing as we're doing the break-ptrace thing this release already ;) All the stuff I know of that requires forced COW (ie. gdb for breakpoints), does so on MAP_PRIVATE mappings. I know this is one of Hugh's bugbears. I don't know whether he's thought out how to remove it, but I wonder if the appended patch is roughly in the right direction? But if people have test-cases for ptrace and/or other magical users of access_vm_pages() (things like core-dumping comes to mind - although I think we don't dump pure file mappings at all, do we?) it would certainly be good to run any such tests on the current -git tree... We do for MAP_SHARED|MAP_ANONYMOUS, by the looks. --- Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -1563,13 +1563,11 @@ static int do_wp_page(struct mm_struct * reuse = can_share_swap_page(old_page); unlock_page(old_page); } - } else if (unlikely((vma-vm_flags (VM_WRITE|VM_SHARED)) == - (VM_WRITE|VM_SHARED))) { + } else if (unlikely((vma-vm_flags VM_SHARED))) { /* -* Only catch write-faults on shared writable pages, -* read-only shared pages can get COWed by -* get_user_pages(.write=1, .force=1). +* Only catch write-faults on shared writable pages. */ + BUG_ON(!(vma-vm_flags VM_WRITE)); if (vma-vm_ops vma-vm_ops-page_mkwrite) { /* * Notify the address space that the page is about to Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c +++ linux-2.6/mm/mmap.c @@ -980,9 +980,11 @@ unsigned long do_mmap_pgoff(struct file if (locks_verify_locked(inode)) return -EAGAIN; - vm_flags |= VM_SHARED | VM_MAYSHARE; - if (!(file-f_mode FMODE_WRITE)) - vm_flags = ~(VM_MAYWRITE | VM_SHARED); + vm_flags |= VM_MAYSHARE; + if (file-f_mode FMODE_WRITE) + vm_flags |= VM_SHARED; + if (!(vm_flags VM_WRITE)) + vm_flags = ~VM_MAYWRITE; /* fall through */ case MAP_PRIVATE: @@ -998,6 +1000,7 @@ unsigned long do_mmap_pgoff(struct file if (!file-f_op || !file-f_op-mmap) return -ENODEV; + break; default: @@ -1007,6 +1010,8 @@ unsigned long do_mmap_pgoff(struct file switch (flags MAP_TYPE) { case MAP_SHARED: vm_flags |= VM_SHARED | VM_MAYSHARE; + if (!(vm_flags VM_WRITE)) + vm_flags = ~VM_MAYWRITE; break; case MAP_PRIVATE: /* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] firewire: fw-core: enforce write order when updating fw_device.generation
On Thursday 01 November 2007 12:49, Stefan Richter wrote: fw_device.node_id and fw_device.generation are accessed without mutexes. We have to ensure that all readers will get to see node_id updates before generation updates. Hi, a few points: - can change it to use spinlocks instead? This would be most preferable. - if not, you need comments. - you also seem to be missing rmb()s now. I see a couple in the firewire directory, but nothing that seems to be ordering loads of these particular fields. - use smp_*mb() if you are just ordering regular cacheable RAM accesses. Also, diffstat is a bit wrong... maybe you posted the wrong version? Signed-off-by: Stefan Richter [EMAIL PROTECTED] --- drivers/firewire/fw-device.c |6 ++ drivers/firewire/fw-topology.c |1 + 2 files changed, 7 insertions(+) Index: linux/drivers/firewire/fw-device.c === --- linux.orig/drivers/firewire/fw-device.c +++ linux/drivers/firewire/fw-device.c @@ -808,6 +813,7 @@ void fw_node_event(struct fw_card *card, device = node-data; device-node_id = node-node_id; + wmb(); device-generation = card-generation; if (atomic_read(device-state) == FW_DEVICE_RUNNING) { PREPARE_DELAYED_WORK(device-work, fw_device_update); Index: linux/drivers/firewire/fw-topology.c === --- linux.orig/drivers/firewire/fw-topology.c +++ linux/drivers/firewire/fw-topology.c @@ -518,6 +518,7 @@ fw_core_handle_bus_reset(struct fw_card card-bm_retries = 0; card-node_id = node_id; + wmb(); card-generation = generation; card-reset_jiffies = jiffies; schedule_delayed_work(card-work, 0); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: filp usage when cpu busy
On Thursday 01 November 2007 12:56, bc Wong (chimwong) wrote: Hi, With 2.6.16 x86_64 on a 4 core machine, I noticed that the filp usage (according to /proc/slabinfo) shoots up and keeps on increasing sharply when one of the CPUs is (1) locked up, or (2) very busy doing a lot of printk()'s with KERN_EMERG. In the case of (1), it's permanent until it runs out of memory eventually. For (2), it's temporary; filp count comes back down when the printk()'s are done. I can't think of any relationship between a busy/ locked-up CPU and filp count. The system is still functional. New short-lived processes kept being created, but the overall number of processes is stable. Does anyone know why filp count would go up like that? Yeah, it's probably because filp structures are freed by RCU, and if you have a locked up CPU then it can't go through a quiescent state so RCU stops freeing your filps. If you add some cond_resched()s to your code, you should find that RCU will force a reschedule and things will work (actually, for 2.6.16, I'm not sure if RCU had the code to force a reschedule... it's force_quiescent_state() in kernel/rcupdate.c upstream). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 12:45:35AM +, Duane Griffin wrote: > Accessing a memory mapped region past the last page containing a valid > file mapping produces a SIGBUS fault (as it should). Running a program > that does this under gdb, then accessing the invalid memory from gdb, > causes it to start consuming 100% CPU and become unkillable. Once in > that state, SysRq-T doesn't show a stack trace for gdb, although it is > shown as running and stack traces are dumped for other tasks. BTW. this has come up for me before, and I have found it useful on a number of occasions to print the stack of running tasks when they are looping in the kernel... Any reason we can't do this? -- Sysrq+T fails to show the stack trace of a running task. Presumably this is to avoid a garbled stack, however it can often be useful, and besides there is no guarantee that the task won't start running in the middle of show_stack(). If there are any correctness issues, then the archietcture would have to take further steps to ensure the task is not running. Signed-off-by: Nick Piggin <[EMAIL PROTECTED]> Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2007-10-31 06:53:22.0 +1100 +++ linux-2.6/kernel/sched.c2007-10-31 06:56:02.0 +1100 @@ -4900,8 +4900,7 @@ printk(KERN_CONT "%5lu %5d %6d\n", free, task_pid_nr(p), task_pid_nr(p->parent)); - if (state != TASK_RUNNING) - show_stack(p, NULL); + show_stack(p, NULL); } void show_state_filter(unsigned long state_filter) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] Swap over NFS -v14
On Wednesday 31 October 2007 15:37, David Miller wrote: > From: Nick Piggin <[EMAIL PROTECTED]> > Date: Wed, 31 Oct 2007 14:26:32 +1100 > > > Is it really worth all the added complexity of making swap > > over NFS files work, given that you could use a network block > > device instead? > > Don't be misled. Swapping over NFS is just a scarecrow for the > seemingly real impetus behind these changes which is network storage > stuff like iSCSI. Oh, I'm OK with the network reserves stuff (not the actual patch, which I'm not really qualified to review, but at least the idea of it...). And also I'm not as such against the idea of swap over network. However, specifically the change to make swapfiles work through the filesystem layer (ATM it goes straight to the block layer, modulo some initialisation stuff which uses block filesystem- specific calls). I mean, I assume that anybody trying to swap over network *today* has to be using a network block device anyway, so the idea of just being able to transparently improve that case seems better than adding new complexities for seemingly not much gain. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: > Change ALLOC_NO_WATERMARK page allocation such that the reserves are system > wide - which they are per setup_per_zone_pages_min(), when we scrape the > barrel, do it properly. > IIRC it's actually not too uncommon to have allocations coming here via page reclaim. It's not exactly clear that you want to break mempolicies at this point. > Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> > --- > mm/page_alloc.c |6 ++ > 1 file changed, 6 insertions(+) > > Index: linux-2.6/mm/page_alloc.c > === > --- linux-2.6.orig/mm/page_alloc.c > +++ linux-2.6/mm/page_alloc.c > @@ -1638,6 +1638,12 @@ restart: > rebalance: > if (alloc_flags & ALLOC_NO_WATERMARKS) { > nofail_alloc: > + /* > + * break out of mempolicy boundaries > + */ > + zonelist = NODE_DATA(numa_node_id())->node_zonelists + > + gfp_zone(gfp_mask); > + > /* go through the zonelist yet again, ignoring mins */ > page = get_page_from_freelist(gfp_mask, order, zonelist, > ALLOC_NO_WATERMARKS); > > -- > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: > Allow PF_MEMALLOC to be set in softirq context. When running softirqs from > a borrowed context save current->flags, ksoftirqd will have its own > task_struct. What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess that some networking thing must be picking it up in a subsequent patch, but I'm too lazy to look!)... Again, can you have more of a rationale in your patch headers, or ref the patch that uses it... thanks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/33] mm: kmem_estimate_pages()
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: > Provide a method to get the upper bound on the pages needed to allocate > a given number of objects from a given kmem_cache. > Fair enough, but just to make it a bit easier, can you provide a little reason of why in this patch (or reference the patch number where you use it, or put it together with the patch where you use it, etc.). Thanks, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: > Allow the mempool to use the memalloc reserves when all else fails and > the allocation context would otherwise allow it. I don't see what this is for. The whole point of when I fixed this to *not* use the memalloc reserves is because processes that were otherwise allowed to use those reserves, were. They should not. > Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> > --- > mm/mempool.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > Index: linux-2.6/mm/mempool.c > === > --- linux-2.6.orig/mm/mempool.c > +++ linux-2.6/mm/mempool.c > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include "internal.h" > > static void add_element(mempool_t *pool, void *element) > { > @@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf > void *element; > unsigned long flags; > wait_queue_t wait; > - gfp_t gfp_temp; > + gfp_t gfp_temp, gfp_orig = gfp_mask; > > might_sleep_if(gfp_mask & __GFP_WAIT); > > @@ -228,6 +229,15 @@ repeat_alloc: > } > spin_unlock_irqrestore(>lock, flags); > > + /* if we really had right to the emergency reserves try those */ > + if (gfp_to_alloc_flags(gfp_orig) & ALLOC_NO_WATERMARKS) { > + if (gfp_temp & __GFP_NOMEMALLOC) { > + gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN); > + goto repeat_alloc; > + } else > + gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN; > + } > + > /* We must not sleep in the GFP_ATOMIC case */ > if (!(gfp_mask & __GFP_WAIT)) > return NULL; > > -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation > contexts that are entitled to it. > > Care is taken to only touch the SLUB slow path. > > This is done to ensure reserve pages don't leak out and get consumed. I think this is generally a good idea (to prevent slab allocators from stealing reserve). However I naively think the implementation is a bit overengineered and thus has a few holes. Humour me, what was the problem with failing the slab allocation (actually, not fail but just call into the page allocator to do correct waiting / reclaim) in the slowpath if the process fails the watermark checks? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] Swap over NFS -v14
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: > Hi, > > Another posting of the full swap over NFS series. Hi, Is it really worth all the added complexity of making swap over NFS files work, given that you could use a network block device instead? Also, have you ensured that page_file_index, page_file_mapping and page_offset are only ever used on anonymous pages when the page is locked? (otherwise PageSwapCache could change) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 12:45:35AM +, Duane Griffin wrote: > Accessing a memory mapped region past the last page containing a valid > file mapping produces a SIGBUS fault (as it should). Running a program > that does this under gdb, then accessing the invalid memory from gdb, > causes it to start consuming 100% CPU and become unkillable. Once in > that state, SysRq-T doesn't show a stack trace for gdb, although it is > shown as running and stack traces are dumped for other tasks. > > 2.6.22 does not have this bug (gdb just prints '\0' as the contents, > although arguably that is also a bug, and it should instead report the > SIGBUS). > > Bisection indicates the problem was introduced by: > > 54cb8821de07f2ffcd28c380ce9b93d5784b40d7 > "mm: merge populate and nopage into fault (fixes nonlinear)" > > The following program demonstrates the issue: > > #include > #include > #include > #include > #include > #include > #include > #include > > int main(int argc, char *argv[]) > { > int fd; > struct stat buf; > > if (argc != 2) { > fprintf(stderr, "usage: %s \n", argv[0]); > exit(1); > } > > fd = open(argv[1], O_RDONLY); > fstat(fd, ); > int count = buf.st_size + sysconf(_SC_PAGE_SIZE); > char *file = (char *) mmap(NULL, count, PROT_READ, MAP_PRIVATE, fd, > 0); > if (file == MAP_FAILED) { > fprintf(stderr, "mmap failed: %s\n", strerror(errno)); > } else { > char ch; > fprintf(stderr, "using offset %d\n", (count - 1)); > ch = file[count - 1]; > munmap(file, count); > } > close(fd); > return 0; > } > > To reproduce the bug, run it under gdb, go up a couple of frames to > the main function, then access invalid memory, for e.g. with: "print > file[4095]", or whatever offset was reported. Well that's probably the best bug report I've ever had, thanks Duane! The issue is a silly thinko -- I didn't pay enough attention to the ptrace rules in filemap_fault :( partly I think that's because I don't understand them and am scared of them ;) The following minimal patch fixes it here, and should probably be applied to 2.6.23 stable as well as 2.6.24. If you could verify it, that would be much appreciated. However I actually don't really like how this all works. I don't like that filemap.c should have to know about ptrace, or exactly what ptrace wants here. It's a bit hairy to force insert page into pagecache and pte into pagetables here, given the races. In access_process_vm, can't we just zero fill in the case of a sigbus? Linus? That will also avoid changing applicatoin behaviour due to a gdb read... Thanks, Nick -- Duane Griffin noticed a 2.6.23 regression that will cause gdb to hang when it tries to access the memory of another process beyond i_size. This is because the solution to the fault vs invalidate race requires that we recheck i_size after the page lock is taken. However in that case, I had not accounted for the fact that ptracers are granted an exception to this rule. Cc: Duane Griffin <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED] Signed-off-by: Nick Piggin <[EMAIL PROTECTED] --- Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1374,7 +1374,7 @@ retry_find: /* Must recheck i_size under page lock */ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; - if (unlikely(vmf->pgoff >= size)) { + if (unlikely(vmf->pgoff >= size) && vma->vm_mm == current->mm) { unlock_page(page); page_cache_release(page); goto outside_data_content; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 performance regression
On Tuesday 30 October 2007 18:54, Lorenzo Allegrucci wrote: > Hi, sorry if this is a faq but reading > http://people.freebsd.org/~kris/scaling/7.0%20Preview.pdf (slides 17, > 18) > looks like 2.6.23 is having a performance regression on MySQL and > PostgreSQL benchmarks. Has anyone investigated these? I think Ingo was looking into them. I'm just in the middle of building a new setup to investigate some performance issues too, so I hope to look at this. Apparently it did go unnoticed during CFS testing, unfortunately. Although now it has been brought to light, I think 2.6.24 is supposed to be better (although it now gets worse on other things). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/10] SLUB: Do our own locking via slab_lock and slab_unlock.
On Wednesday 31 October 2007 05:32, Christoph Lameter wrote: > On Tue, 30 Oct 2007, Nick Piggin wrote: > > Is this actually a speedup on any architecture to roll your own locking > > rather than using bit spinlock? > > It avoids one load from memory when allocating and the release is simply > writing the page->flags back. Less instructions. OK, but it probably isn't a measurable speedup, even on microbenchmarks, right? And many architectures have to have more barriers around cmpxchg than they do around a test_and_set_bit_lock, so it may even be slower on some. > > I am not exactly convinced that smp_wmb() is a good idea to have in your > > unlock, rather than the normally required smp_mb() that every other open > > coded lock in the kernel is using today. If you comment every code path > > where a load leaking out of the critical section would not be a problem, > > then OK it may be correct, but I still don't think it is worth the > > maintenance overhead. > > I thought you agreed that release semantics only require a write barrier? Not in general. > The issue here is that other processors see the updates before the > updates to page-flags. > > A load leaking out of a critical section would require that the result of > the load is not used to update other information before the slab_unlock > and that the source of the load is not overwritten in the critical > section. That does not happen in sluib. That may be the case, but I don't think there is enough performance justification to add a hack like this. ia64 for example is going to do an mf for smp_wmb so I doubt it is a clear win. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/10] SLUB: Do our own locking via slab_lock and slab_unlock.
On Wednesday 31 October 2007 05:32, Christoph Lameter wrote: On Tue, 30 Oct 2007, Nick Piggin wrote: Is this actually a speedup on any architecture to roll your own locking rather than using bit spinlock? It avoids one load from memory when allocating and the release is simply writing the page-flags back. Less instructions. OK, but it probably isn't a measurable speedup, even on microbenchmarks, right? And many architectures have to have more barriers around cmpxchg than they do around a test_and_set_bit_lock, so it may even be slower on some. I am not exactly convinced that smp_wmb() is a good idea to have in your unlock, rather than the normally required smp_mb() that every other open coded lock in the kernel is using today. If you comment every code path where a load leaking out of the critical section would not be a problem, then OK it may be correct, but I still don't think it is worth the maintenance overhead. I thought you agreed that release semantics only require a write barrier? Not in general. The issue here is that other processors see the updates before the updates to page-flags. A load leaking out of a critical section would require that the result of the load is not used to update other information before the slab_unlock and that the source of the load is not overwritten in the critical section. That does not happen in sluib. That may be the case, but I don't think there is enough performance justification to add a hack like this. ia64 for example is going to do an mf for smp_wmb so I doubt it is a clear win. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 performance regression
On Tuesday 30 October 2007 18:54, Lorenzo Allegrucci wrote: Hi, sorry if this is a faq but reading http://people.freebsd.org/~kris/scaling/7.0%20Preview.pdf (slides 17, 18) looks like 2.6.23 is having a performance regression on MySQL and PostgreSQL benchmarks. Has anyone investigated these? I think Ingo was looking into them. I'm just in the middle of building a new setup to investigate some performance issues too, so I hope to look at this. Apparently it did go unnoticed during CFS testing, unfortunately. Although now it has been brought to light, I think 2.6.24 is supposed to be better (although it now gets worse on other things). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 12:45:35AM +, Duane Griffin wrote: Accessing a memory mapped region past the last page containing a valid file mapping produces a SIGBUS fault (as it should). Running a program that does this under gdb, then accessing the invalid memory from gdb, causes it to start consuming 100% CPU and become unkillable. Once in that state, SysRq-T doesn't show a stack trace for gdb, although it is shown as running and stack traces are dumped for other tasks. 2.6.22 does not have this bug (gdb just prints '\0' as the contents, although arguably that is also a bug, and it should instead report the SIGBUS). Bisection indicates the problem was introduced by: 54cb8821de07f2ffcd28c380ce9b93d5784b40d7 mm: merge populate and nopage into fault (fixes nonlinear) The following program demonstrates the issue: #include errno.h #include stdio.h #include stdlib.h #include string.h #include fcntl.h #include unistd.h #include sys/mman.h #include sys/stat.h int main(int argc, char *argv[]) { int fd; struct stat buf; if (argc != 2) { fprintf(stderr, usage: %s filename\n, argv[0]); exit(1); } fd = open(argv[1], O_RDONLY); fstat(fd, buf); int count = buf.st_size + sysconf(_SC_PAGE_SIZE); char *file = (char *) mmap(NULL, count, PROT_READ, MAP_PRIVATE, fd, 0); if (file == MAP_FAILED) { fprintf(stderr, mmap failed: %s\n, strerror(errno)); } else { char ch; fprintf(stderr, using offset %d\n, (count - 1)); ch = file[count - 1]; munmap(file, count); } close(fd); return 0; } To reproduce the bug, run it under gdb, go up a couple of frames to the main function, then access invalid memory, for e.g. with: print file[4095], or whatever offset was reported. Well that's probably the best bug report I've ever had, thanks Duane! The issue is a silly thinko -- I didn't pay enough attention to the ptrace rules in filemap_fault :( partly I think that's because I don't understand them and am scared of them ;) The following minimal patch fixes it here, and should probably be applied to 2.6.23 stable as well as 2.6.24. If you could verify it, that would be much appreciated. However I actually don't really like how this all works. I don't like that filemap.c should have to know about ptrace, or exactly what ptrace wants here. It's a bit hairy to force insert page into pagecache and pte into pagetables here, given the races. In access_process_vm, can't we just zero fill in the case of a sigbus? Linus? That will also avoid changing applicatoin behaviour due to a gdb read... Thanks, Nick -- Duane Griffin noticed a 2.6.23 regression that will cause gdb to hang when it tries to access the memory of another process beyond i_size. This is because the solution to the fault vs invalidate race requires that we recheck i_size after the page lock is taken. However in that case, I had not accounted for the fact that ptracers are granted an exception to this rule. Cc: Duane Griffin [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1374,7 +1374,7 @@ retry_find: /* Must recheck i_size under page lock */ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) PAGE_CACHE_SHIFT; - if (unlikely(vmf-pgoff = size)) { + if (unlikely(vmf-pgoff = size) vma-vm_mm == current-mm) { unlock_page(page); page_cache_release(page); goto outside_data_content; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] Swap over NFS -v14
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Hi, Another posting of the full swap over NFS series. Hi, Is it really worth all the added complexity of making swap over NFS files work, given that you could use a network block device instead? Also, have you ensured that page_file_index, page_file_mapping and page_offset are only ever used on anonymous pages when the page is locked? (otherwise PageSwapCache could change) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation contexts that are entitled to it. Care is taken to only touch the SLUB slow path. This is done to ensure reserve pages don't leak out and get consumed. I think this is generally a good idea (to prevent slab allocators from stealing reserve). However I naively think the implementation is a bit overengineered and thus has a few holes. Humour me, what was the problem with failing the slab allocation (actually, not fail but just call into the page allocator to do correct waiting / reclaim) in the slowpath if the process fails the watermark checks? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Allow the mempool to use the memalloc reserves when all else fails and the allocation context would otherwise allow it. I don't see what this is for. The whole point of when I fixed this to *not* use the memalloc reserves is because processes that were otherwise allowed to use those reserves, were. They should not. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/mempool.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) Index: linux-2.6/mm/mempool.c === --- linux-2.6.orig/mm/mempool.c +++ linux-2.6/mm/mempool.c @@ -14,6 +14,7 @@ #include linux/mempool.h #include linux/blkdev.h #include linux/writeback.h +#include internal.h static void add_element(mempool_t *pool, void *element) { @@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf void *element; unsigned long flags; wait_queue_t wait; - gfp_t gfp_temp; + gfp_t gfp_temp, gfp_orig = gfp_mask; might_sleep_if(gfp_mask __GFP_WAIT); @@ -228,6 +229,15 @@ repeat_alloc: } spin_unlock_irqrestore(pool-lock, flags); + /* if we really had right to the emergency reserves try those */ + if (gfp_to_alloc_flags(gfp_orig) ALLOC_NO_WATERMARKS) { + if (gfp_temp __GFP_NOMEMALLOC) { + gfp_temp = ~(__GFP_NOMEMALLOC|__GFP_NOWARN); + goto repeat_alloc; + } else + gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN; + } + /* We must not sleep in the GFP_ATOMIC case */ if (!(gfp_mask __GFP_WAIT)) return NULL; -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/33] mm: kmem_estimate_pages()
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Provide a method to get the upper bound on the pages needed to allocate a given number of objects from a given kmem_cache. Fair enough, but just to make it a bit easier, can you provide a little reason of why in this patch (or reference the patch number where you use it, or put it together with the patch where you use it, etc.). Thanks, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Change ALLOC_NO_WATERMARK page allocation such that the reserves are system wide - which they are per setup_per_zone_pages_min(), when we scrape the barrel, do it properly. IIRC it's actually not too uncommon to have allocations coming here via page reclaim. It's not exactly clear that you want to break mempolicies at this point. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- mm/page_alloc.c |6 ++ 1 file changed, 6 insertions(+) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c +++ linux-2.6/mm/page_alloc.c @@ -1638,6 +1638,12 @@ restart: rebalance: if (alloc_flags ALLOC_NO_WATERMARKS) { nofail_alloc: + /* + * break out of mempolicy boundaries + */ + zonelist = NODE_DATA(numa_node_id())-node_zonelists + + gfp_zone(gfp_mask); + /* go through the zonelist yet again, ignoring mins */ page = get_page_from_freelist(gfp_mask, order, zonelist, ALLOC_NO_WATERMARKS); -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context
On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote: Allow PF_MEMALLOC to be set in softirq context. When running softirqs from a borrowed context save current-flags, ksoftirqd will have its own task_struct. What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess that some networking thing must be picking it up in a subsequent patch, but I'm too lazy to look!)... Again, can you have more of a rationale in your patch headers, or ref the patch that uses it... thanks - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] Swap over NFS -v14
On Wednesday 31 October 2007 15:37, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 31 Oct 2007 14:26:32 +1100 Is it really worth all the added complexity of making swap over NFS files work, given that you could use a network block device instead? Don't be misled. Swapping over NFS is just a scarecrow for the seemingly real impetus behind these changes which is network storage stuff like iSCSI. Oh, I'm OK with the network reserves stuff (not the actual patch, which I'm not really qualified to review, but at least the idea of it...). And also I'm not as such against the idea of swap over network. However, specifically the change to make swapfiles work through the filesystem layer (ATM it goes straight to the block layer, modulo some initialisation stuff which uses block filesystem- specific calls). I mean, I assume that anybody trying to swap over network *today* has to be using a network block device anyway, so the idea of just being able to transparently improve that case seems better than adding new complexities for seemingly not much gain. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning
On Wed, Oct 31, 2007 at 12:45:35AM +, Duane Griffin wrote: Accessing a memory mapped region past the last page containing a valid file mapping produces a SIGBUS fault (as it should). Running a program that does this under gdb, then accessing the invalid memory from gdb, causes it to start consuming 100% CPU and become unkillable. Once in that state, SysRq-T doesn't show a stack trace for gdb, although it is shown as running and stack traces are dumped for other tasks. BTW. this has come up for me before, and I have found it useful on a number of occasions to print the stack of running tasks when they are looping in the kernel... Any reason we can't do this? -- Sysrq+T fails to show the stack trace of a running task. Presumably this is to avoid a garbled stack, however it can often be useful, and besides there is no guarantee that the task won't start running in the middle of show_stack(). If there are any correctness issues, then the archietcture would have to take further steps to ensure the task is not running. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c 2007-10-31 06:53:22.0 +1100 +++ linux-2.6/kernel/sched.c2007-10-31 06:56:02.0 +1100 @@ -4900,8 +4900,7 @@ printk(KERN_CONT %5lu %5d %6d\n, free, task_pid_nr(p), task_pid_nr(p-parent)); - if (state != TASK_RUNNING) - show_stack(p, NULL); + show_stack(p, NULL); } void show_state_filter(unsigned long state_filter) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/10] SLUB: Do our own locking via slab_lock and slab_unlock.
On Sunday 28 October 2007 14:32, Christoph Lameter wrote: > Too many troubles with the bitlocks and we really do not need > to do any bitops. Bitops do not effectively retrieve the old > value which we want. So use a cmpxchg instead on the arches > that allow it. > > Instead of modifying the page->flags with fake atomic operations > we pass the page state as a parameter to functions. No function uses > the slab state if the page lock is held. Function must wait until the > lock is cleared. Thus we can defer the update of page->flags until slab > processing is complete. The "unlock" operation is then simply updating > page->flags. Is this actually a speedup on any architecture to roll your own locking rather than using bit spinlock? I am not exactly convinced that smp_wmb() is a good idea to have in your unlock, rather than the normally required smp_mb() that every other open coded lock in the kernel is using today. If you comment every code path where a load leaking out of the critical section would not be a problem, then OK it may be correct, but I still don't think it is worth the maintenance overhead. > > Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> > > > --- > mm/slub.c | 324 > +++--- 1 file > changed, 187 insertions(+), 137 deletions(-) > > Index: linux-2.6/mm/slub.c > === > --- linux-2.6.orig/mm/slub.c 2007-10-27 07:56:03.0 -0700 > +++ linux-2.6/mm/slub.c 2007-10-27 07:56:52.0 -0700 > @@ -101,6 +101,7 @@ > */ > > #define FROZEN (1 << PG_active) > +#define LOCKED (1 << PG_locked) > > #ifdef CONFIG_SLUB_DEBUG > #define SLABDEBUG (1 << PG_error) > @@ -108,36 +109,6 @@ > #define SLABDEBUG 0 > #endif > > -static inline int SlabFrozen(struct page *page) > -{ > - return page->flags & FROZEN; > -} > - > -static inline void SetSlabFrozen(struct page *page) > -{ > - page->flags |= FROZEN; > -} > - > -static inline void ClearSlabFrozen(struct page *page) > -{ > - page->flags &= ~FROZEN; > -} > - > -static inline int SlabDebug(struct page *page) > -{ > - return page->flags & SLABDEBUG; > -} > - > -static inline void SetSlabDebug(struct page *page) > -{ > - page->flags |= SLABDEBUG; > -} > - > -static inline void ClearSlabDebug(struct page *page) > -{ > - page->flags &= ~SLABDEBUG; > -} > - > /* > * Issues still to be resolved: > * > @@ -818,11 +789,12 @@ static void trace(struct kmem_cache *s, > /* > * Tracking of fully allocated slabs for debugging purposes. > */ > -static void add_full(struct kmem_cache *s, struct page *page) > +static void add_full(struct kmem_cache *s, struct page *page, > + unsigned long state) > { > struct kmem_cache_node *n = get_node(s, page_to_nid(page)); > > - if (!SlabDebug(page) || !(s->flags & SLAB_STORE_USER)) > + if (!(state & SLABDEBUG) || !(s->flags & SLAB_STORE_USER)) > return; > spin_lock(>list_lock); > list_add(>lru, >full); > @@ -894,7 +866,7 @@ bad: > } > > static noinline int free_debug_processing(struct kmem_cache *s, > - struct page *page, void *object, void *addr) > + struct page *page, void *object, void *addr, unsigned long state) > { > if (!check_slab(s, page)) > goto fail; > @@ -930,7 +902,7 @@ static noinline int free_debug_processin > } > > /* Special debug activities for freeing objects */ > - if (!SlabFrozen(page) && page->freelist == page->end) > + if (!(state & FROZEN) && page->freelist == page->end) > remove_full(s, page); > if (s->flags & SLAB_STORE_USER) > set_track(s, object, TRACK_FREE, addr); > @@ -1047,7 +1019,8 @@ static inline int slab_pad_check(struct > { return 1; } > static inline int check_object(struct kmem_cache *s, struct page *page, > void *object, int active) { return 1; } > -static inline void add_full(struct kmem_cache *s, struct page *page) {} > +static inline void add_full(struct kmem_cache *s, struct page *page, > + unsigned long state) {} > static inline unsigned long kmem_cache_flags(unsigned long objsize, > unsigned long flags, const char *name, > void (*ctor)(struct kmem_cache *, void *)) > @@ -1105,6 +1078,7 @@ static noinline struct page *new_slab(st > void *start; > void *last; > void *p; > + unsigned long state; > > BUG_ON(flags & GFP_SLAB_BUG_MASK); > > @@ -1117,11 +1091,12 @@ static noinline struct page *new_slab(st > if (n) > atomic_long_inc(>nr_slabs); > page->slab = s; > - page->flags |= 1 << PG_slab; > + state = 1 << PG_slab; > if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON | > SLAB_STORE_USER | SLAB_TRACE)) > - SetSlabDebug(page); > + state |= SLABDEBUG; > >
Re: [PATCH resend2] rd: fix data corruption on memory pressure
On Tuesday 30 October 2007 01:17, Christian Borntraeger wrote: > Nick, Eric, Andrew, > > we now have passed rc1. That means that Erics or Nicks rd rewrite is no > longer an option for 2.6.24. If I followed the last thread correctly all > alternative patches have one of the following issue > - too big for post rc1 Yeah, I don't think the rewrites were ever intended for 2.6.24 anyway... > - break reiserfs and maybe others > - call into vfs while being unrelated > > So this is a resend of my patch, which is in my opinion the simplest fix > for the data corruption problem. > The patch was tested by our test department, thanks to Oliver Paukstadt and > Thorsten Diehl. > > --- > > Subject: [PATCH] rd: fix data corruption on memory pressure > From: Christian Borntraeger <[EMAIL PROTECTED]> > > We have seen ramdisk based install systems, where some pages of mapped > libraries and programs were suddendly zeroed under memory pressure. This > should not happen, as the ramdisk avoids freeing its pages by keeping them > dirty all the time. > > It turns out that there is a case, where the VM makes a ramdisk page clean, > without telling the ramdisk driver. > On memory pressure shrink_zone runs and it starts to run > shrink_active_list. There is a check for buffer_heads_over_limit, and if > true, pagevec_strip is called. pagevec_strip calls try_to_release_page. If > the mapping has no releasepage callback, try_to_free_buffers is called. > try_to_free_buffers has now a special logic for some file systems to make a > dirty page clean, if all buffers are clean. Thats what happened in our test > case. > > The simplest solution is to provide a noop-releasepage callback for the > ramdisk driver. This avoids try_to_free_buffers for ramdisk pages. I think this is the least intrusive change that is least likely to break rd, or any other kernel code, that we've seen. It really should go in 2.6.24, IMO. Acked-by: Nick Piggin <[EMAIL PROTECTED]> > Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]> > --- > drivers/block/rd.c | 13 + > 1 file changed, 13 insertions(+) > > Index: linux-2.6/drivers/block/rd.c > === > --- linux-2.6.orig/drivers/block/rd.c > +++ linux-2.6/drivers/block/rd.c > @@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct > return 0; > } > > +/* > + * releasepage is called by pagevec_strip/try_to_release_page if > + * buffers_heads_over_limit is true. Without a releasepage function > + * try_to_free_buffers is called instead. That can unset the dirty > + * bit of our ram disk pages, which will be eventually freed, even > + * if the page is still in use. > + */ > +static int ramdisk_releasepage(struct page *page, gfp_t dummy) > +{ > + return 0; > +} > + > static const struct address_space_operations ramdisk_aops = { > .readpage = ramdisk_readpage, > .prepare_write = ramdisk_prepare_write, > @@ -196,6 +208,7 @@ static const struct address_space_operat > .writepage = ramdisk_writepage, > .set_page_dirty = ramdisk_set_page_dirty, > .writepages = ramdisk_writepages, > + .releasepage= ramdisk_releasepage, > }; > > static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t > sector, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm_ops.page_mkwrite() fails with vmalloc on 2.6.23
On Monday 29 October 2007 23:35, Peter Zijlstra wrote: > On Mon, 2007-10-29 at 11:11 +0100, Peter Zijlstra wrote: > > On Mon, 2007-10-29 at 01:17 -0700, Jaya Kumar wrote: > > > On 10/29/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > On Mon, 22 Oct 2007 16:40:57 +0200 Stefani Seibold <[EMAIL PROTECTED]> wrote: > > > > > The problem original occurs with the fb_defio driver > > > > > (driver/video/fb_defio.c). This driver use the > > > > > vm_ops.page_mkwrite() handler for tracking the modified pages, > > > > > which will be in an extra thread handled, to perform the IO and > > > > > clean and write protect all pages with page_clean(). > > > > > > Hi, > > > > > > An aside, I just tested that deferred IO works fine on > > > 2.6.22.10/pxa255. > > > > > > I understood from the thread that PeterZ is looking into page_mkclean > > > changes which I guess went into 2.6.23. I'm also happy to help in any > > > way if the way we're doing fb_defio needs to change. > > > > Yeah, its the truncate race stuff introduced by Nick in > > d0217ac04ca6591841e5665f518e38064f4e65bd > > > > I'm a bit at a loss on how to go around fixing this. One ugly idea I had > > was to check page->mapping before going into page_mkwrite() and when > > that is null, don't bother with the truncate check. > > Something like this I think it's a fine minimal patch. Maybe add a comment to say exactly what we're doing here (pagecache generally just uses !mapping to test for truncate). Otherwise, Acked-by: Nick Piggin <[EMAIL PROTECTED]>, thanks! > --- > mm/memory.c |4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > Index: linux-2.6/mm/memory.c > === > --- linux-2.6.orig/mm/memory.c > +++ linux-2.6/mm/memory.c > @@ -2300,6 +2300,8 @@ static int __do_fault(struct mm_struct * >* to become writable >*/ > if (vma->vm_ops->page_mkwrite) { > + struct address_space *mapping = page->mapping; > + > unlock_page(page); > if (vma->vm_ops->page_mkwrite(vma, page) < 0) { > ret = VM_FAULT_SIGBUS; > @@ -2314,7 +2316,7 @@ static int __do_fault(struct mm_struct * >* reworking page_mkwrite locking API, which >* is better done later. >*/ > - if (!page->mapping) { > + if (mapping != page->mapping) { > ret = 0; > anon = 1; /* no anon but release > vmf.page */ > goto out; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm_ops.page_mkwrite() fails with vmalloc on 2.6.23
On Monday 29 October 2007 23:35, Peter Zijlstra wrote: On Mon, 2007-10-29 at 11:11 +0100, Peter Zijlstra wrote: On Mon, 2007-10-29 at 01:17 -0700, Jaya Kumar wrote: On 10/29/07, Andrew Morton [EMAIL PROTECTED] wrote: On Mon, 22 Oct 2007 16:40:57 +0200 Stefani Seibold [EMAIL PROTECTED] wrote: The problem original occurs with the fb_defio driver (driver/video/fb_defio.c). This driver use the vm_ops.page_mkwrite() handler for tracking the modified pages, which will be in an extra thread handled, to perform the IO and clean and write protect all pages with page_clean(). Hi, An aside, I just tested that deferred IO works fine on 2.6.22.10/pxa255. I understood from the thread that PeterZ is looking into page_mkclean changes which I guess went into 2.6.23. I'm also happy to help in any way if the way we're doing fb_defio needs to change. Yeah, its the truncate race stuff introduced by Nick in d0217ac04ca6591841e5665f518e38064f4e65bd I'm a bit at a loss on how to go around fixing this. One ugly idea I had was to check page-mapping before going into page_mkwrite() and when that is null, don't bother with the truncate check. Something like this I think it's a fine minimal patch. Maybe add a comment to say exactly what we're doing here (pagecache generally just uses !mapping to test for truncate). Otherwise, Acked-by: Nick Piggin [EMAIL PROTECTED], thanks! --- mm/memory.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6/mm/memory.c === --- linux-2.6.orig/mm/memory.c +++ linux-2.6/mm/memory.c @@ -2300,6 +2300,8 @@ static int __do_fault(struct mm_struct * * to become writable */ if (vma-vm_ops-page_mkwrite) { + struct address_space *mapping = page-mapping; + unlock_page(page); if (vma-vm_ops-page_mkwrite(vma, page) 0) { ret = VM_FAULT_SIGBUS; @@ -2314,7 +2316,7 @@ static int __do_fault(struct mm_struct * * reworking page_mkwrite locking API, which * is better done later. */ - if (!page-mapping) { + if (mapping != page-mapping) { ret = 0; anon = 1; /* no anon but release vmf.page */ goto out; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH resend2] rd: fix data corruption on memory pressure
On Tuesday 30 October 2007 01:17, Christian Borntraeger wrote: Nick, Eric, Andrew, we now have passed rc1. That means that Erics or Nicks rd rewrite is no longer an option for 2.6.24. If I followed the last thread correctly all alternative patches have one of the following issue - too big for post rc1 Yeah, I don't think the rewrites were ever intended for 2.6.24 anyway... - break reiserfs and maybe others - call into vfs while being unrelated So this is a resend of my patch, which is in my opinion the simplest fix for the data corruption problem. The patch was tested by our test department, thanks to Oliver Paukstadt and Thorsten Diehl. --- Subject: [PATCH] rd: fix data corruption on memory pressure From: Christian Borntraeger [EMAIL PROTECTED] We have seen ramdisk based install systems, where some pages of mapped libraries and programs were suddendly zeroed under memory pressure. This should not happen, as the ramdisk avoids freeing its pages by keeping them dirty all the time. It turns out that there is a case, where the VM makes a ramdisk page clean, without telling the ramdisk driver. On memory pressure shrink_zone runs and it starts to run shrink_active_list. There is a check for buffer_heads_over_limit, and if true, pagevec_strip is called. pagevec_strip calls try_to_release_page. If the mapping has no releasepage callback, try_to_free_buffers is called. try_to_free_buffers has now a special logic for some file systems to make a dirty page clean, if all buffers are clean. Thats what happened in our test case. The simplest solution is to provide a noop-releasepage callback for the ramdisk driver. This avoids try_to_free_buffers for ramdisk pages. I think this is the least intrusive change that is least likely to break rd, or any other kernel code, that we've seen. It really should go in 2.6.24, IMO. Acked-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] --- drivers/block/rd.c | 13 + 1 file changed, 13 insertions(+) Index: linux-2.6/drivers/block/rd.c === --- linux-2.6.orig/drivers/block/rd.c +++ linux-2.6/drivers/block/rd.c @@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct return 0; } +/* + * releasepage is called by pagevec_strip/try_to_release_page if + * buffers_heads_over_limit is true. Without a releasepage function + * try_to_free_buffers is called instead. That can unset the dirty + * bit of our ram disk pages, which will be eventually freed, even + * if the page is still in use. + */ +static int ramdisk_releasepage(struct page *page, gfp_t dummy) +{ + return 0; +} + static const struct address_space_operations ramdisk_aops = { .readpage = ramdisk_readpage, .prepare_write = ramdisk_prepare_write, @@ -196,6 +208,7 @@ static const struct address_space_operat .writepage = ramdisk_writepage, .set_page_dirty = ramdisk_set_page_dirty, .writepages = ramdisk_writepages, + .releasepage= ramdisk_releasepage, }; static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/10] SLUB: Do our own locking via slab_lock and slab_unlock.
On Sunday 28 October 2007 14:32, Christoph Lameter wrote: Too many troubles with the bitlocks and we really do not need to do any bitops. Bitops do not effectively retrieve the old value which we want. So use a cmpxchg instead on the arches that allow it. Instead of modifying the page-flags with fake atomic operations we pass the page state as a parameter to functions. No function uses the slab state if the page lock is held. Function must wait until the lock is cleared. Thus we can defer the update of page-flags until slab processing is complete. The unlock operation is then simply updating page-flags. Is this actually a speedup on any architecture to roll your own locking rather than using bit spinlock? I am not exactly convinced that smp_wmb() is a good idea to have in your unlock, rather than the normally required smp_mb() that every other open coded lock in the kernel is using today. If you comment every code path where a load leaking out of the critical section would not be a problem, then OK it may be correct, but I still don't think it is worth the maintenance overhead. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- mm/slub.c | 324 +++--- 1 file changed, 187 insertions(+), 137 deletions(-) Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c 2007-10-27 07:56:03.0 -0700 +++ linux-2.6/mm/slub.c 2007-10-27 07:56:52.0 -0700 @@ -101,6 +101,7 @@ */ #define FROZEN (1 PG_active) +#define LOCKED (1 PG_locked) #ifdef CONFIG_SLUB_DEBUG #define SLABDEBUG (1 PG_error) @@ -108,36 +109,6 @@ #define SLABDEBUG 0 #endif -static inline int SlabFrozen(struct page *page) -{ - return page-flags FROZEN; -} - -static inline void SetSlabFrozen(struct page *page) -{ - page-flags |= FROZEN; -} - -static inline void ClearSlabFrozen(struct page *page) -{ - page-flags = ~FROZEN; -} - -static inline int SlabDebug(struct page *page) -{ - return page-flags SLABDEBUG; -} - -static inline void SetSlabDebug(struct page *page) -{ - page-flags |= SLABDEBUG; -} - -static inline void ClearSlabDebug(struct page *page) -{ - page-flags = ~SLABDEBUG; -} - /* * Issues still to be resolved: * @@ -818,11 +789,12 @@ static void trace(struct kmem_cache *s, /* * Tracking of fully allocated slabs for debugging purposes. */ -static void add_full(struct kmem_cache *s, struct page *page) +static void add_full(struct kmem_cache *s, struct page *page, + unsigned long state) { struct kmem_cache_node *n = get_node(s, page_to_nid(page)); - if (!SlabDebug(page) || !(s-flags SLAB_STORE_USER)) + if (!(state SLABDEBUG) || !(s-flags SLAB_STORE_USER)) return; spin_lock(n-list_lock); list_add(page-lru, n-full); @@ -894,7 +866,7 @@ bad: } static noinline int free_debug_processing(struct kmem_cache *s, - struct page *page, void *object, void *addr) + struct page *page, void *object, void *addr, unsigned long state) { if (!check_slab(s, page)) goto fail; @@ -930,7 +902,7 @@ static noinline int free_debug_processin } /* Special debug activities for freeing objects */ - if (!SlabFrozen(page) page-freelist == page-end) + if (!(state FROZEN) page-freelist == page-end) remove_full(s, page); if (s-flags SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); @@ -1047,7 +1019,8 @@ static inline int slab_pad_check(struct { return 1; } static inline int check_object(struct kmem_cache *s, struct page *page, void *object, int active) { return 1; } -static inline void add_full(struct kmem_cache *s, struct page *page) {} +static inline void add_full(struct kmem_cache *s, struct page *page, + unsigned long state) {} static inline unsigned long kmem_cache_flags(unsigned long objsize, unsigned long flags, const char *name, void (*ctor)(struct kmem_cache *, void *)) @@ -1105,6 +1078,7 @@ static noinline struct page *new_slab(st void *start; void *last; void *p; + unsigned long state; BUG_ON(flags GFP_SLAB_BUG_MASK); @@ -1117,11 +1091,12 @@ static noinline struct page *new_slab(st if (n) atomic_long_inc(n-nr_slabs); page-slab = s; - page-flags |= 1 PG_slab; + state = 1 PG_slab; if (s-flags (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | SLAB_TRACE)) - SetSlabDebug(page); + state |= SLABDEBUG; + page-flags |= state; start = page_address(page); page-end = start + 1; @@ -1147,13 +1122,13 @@ static void __free_slab(struct kmem_cach { int
Re: [2.6.23-rt3] NMI watchdog trace of deadlock
On Saturday 27 October 2007 15:21, Mike Galbraith wrote: > Greetings, > > For quite a while now, RT kernels have been locking up on me > occasionally while my back is turned. Yesterday, the little bugger > finally pounced while my serial console box was up and waiting. > > [10138.162953] WARNING: at arch/i386/kernel/smp.c:581 > native_smp_call_function_mask() [10138.170583] [] > show_trace_log_lvl+0x1a/0x30 > [10138.175796] [] show_trace+0x12/0x14 > [10138.180291] [] dump_stack+0x16/0x18 > [10138.184769] [] native_smp_call_function_mask+0x138/0x13d > [10138.191117] [] smp_call_function+0x1e/0x24 > [10138.196210] [] on_each_cpu+0x25/0x50 > [10138.200807] [] flush_tlb_all+0x1e/0x20 > [10138.205553] [] kmap_high+0x1b6/0x417 > [10138.210118] [] kmap+0x4d/0x4f > [10138.214102] [] ntfs_end_buffer_async_read+0x228/0x2f9 > [10138.220163] [] end_bio_bh_io_sync+0x26/0x3f > [10138.225352] [] bio_endio+0x42/0x6d > [10138.229769] [] __end_that_request_first+0x115/0x4ac > [10138.235682] [] end_that_request_chunk+0x8/0xa > [10138.241052] [] ide_end_request+0x55/0x10a > [10138.246058] [] ide_dma_intr+0x6f/0xac > [10138.250727] [] ide_intr+0x93/0x1e0 > [10138.255125] [] handle_IRQ_event+0x5c/0xc9 Looks like ntfs is kmap()ing from interrupt context. Should be using kmap_atomic instead, I think. But the ntfs code I have seems to do just that... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6.23-rt3] NMI watchdog trace of deadlock
On Saturday 27 October 2007 15:21, Mike Galbraith wrote: Greetings, For quite a while now, RT kernels have been locking up on me occasionally while my back is turned. Yesterday, the little bugger finally pounced while my serial console box was up and waiting. [10138.162953] WARNING: at arch/i386/kernel/smp.c:581 native_smp_call_function_mask() [10138.170583] [c01051da] show_trace_log_lvl+0x1a/0x30 [10138.175796] [c0105de3] show_trace+0x12/0x14 [10138.180291] [c0105dfb] dump_stack+0x16/0x18 [10138.184769] [c011609f] native_smp_call_function_mask+0x138/0x13d [10138.191117] [c0117606] smp_call_function+0x1e/0x24 [10138.196210] [c012f85c] on_each_cpu+0x25/0x50 [10138.200807] [c0115c74] flush_tlb_all+0x1e/0x20 [10138.205553] [c016caaf] kmap_high+0x1b6/0x417 [10138.210118] [c011ec88] kmap+0x4d/0x4f [10138.214102] [c026a9d8] ntfs_end_buffer_async_read+0x228/0x2f9 [10138.220163] [c01a0e9e] end_bio_bh_io_sync+0x26/0x3f [10138.225352] [c01a2b09] bio_endio+0x42/0x6d [10138.229769] [c02c2a08] __end_that_request_first+0x115/0x4ac [10138.235682] [c02c2da7] end_that_request_chunk+0x8/0xa [10138.241052] [c0365943] ide_end_request+0x55/0x10a [10138.246058] [c036dae3] ide_dma_intr+0x6f/0xac [10138.250727] [c0366d83] ide_intr+0x93/0x1e0 [10138.255125] [c015afb4] handle_IRQ_event+0x5c/0xc9 Looks like ntfs is kmap()ing from interrupt context. Should be using kmap_atomic instead, I think. But the ntfs code I have seems to do just that... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [interesting] smattering of possible memory ordering bugs
On Friday 26 October 2007 13:35, Benjamin Herrenschmidt wrote: [acks] Thanks for those... > > Index: linux-2.6/include/asm-powerpc/mmu_context.h > > === > > --- linux-2.6.orig/include/asm-powerpc/mmu_context.h > > +++ linux-2.6/include/asm-powerpc/mmu_context.h > > @@ -129,7 +129,7 @@ static inline void get_mmu_context(struc > > steal_context(); > > #endif > > ctx = next_mmu_context; > > - while (test_and_set_bit(ctx, context_map)) { > > + while (test_and_set_bit_lock(ctx, context_map)) { > > ctx = find_next_zero_bit(context_map, LAST_CONTEXT+1, > > ctx); if (ctx > LAST_CONTEXT) > > ctx = 0; > > @@ -158,7 +158,7 @@ static inline void destroy_context(struc > > { > > preempt_disable(); > > if (mm->context.id != NO_CONTEXT) { > > - clear_bit(mm->context.id, context_map); > > + clear_bit_unlock(mm->context.id, context_map); > > mm->context.id = NO_CONTEXT; > > #ifdef FEW_CONTEXTS > > atomic_inc(_free_contexts); > > I don't think the previous code was wrong... it's not a locked section > and we don't care about ordering previous stores. It's an allocation, it > should be fine. In general, bitmap allocators should be allright. Well if it is just allocating an arbitrary _number_ out of a bitmap and nothing else (eg. like the pid allocator), then you don't need barriers. > Ignore the FEW_CONTEXTS stuff for now :-) At this point, it's UP only > and will be replaced sooner or later. OK. Then I agree, provided you're doing the correct synchronisation or flushing etc. when destroying a context (which presumably you are). I'll drop those bits then. Thanks, Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[interesting] smattering of possible memory ordering bugs
Hi, Just out of interest, I did a grep for files containing test_and_set_bit as well as clear_bit (excluding obvious ones like include/asm-*/bitops.h). Quite a few interesting things. There is a lot of stuff in drivers/* that could be suspect, WRT memory barriers, including lots I didn't touch. Lot of work. But forget that... Some surprises with more obvious bugs, which I have added some cc's for. powerpc seems to have an mmu context allocation bug, (and possible improvement in hpte locking with the new bitops). floppy is using a crazy open coded bit mutex, convert to regular mutex. vt has something strange too. fs-writeback could be made a little safer by properly closing the "critical" section (the existing sections aren't really critical, but you never know what the future holds ;)) jfs seems to be missing an obvious smp_mb__before_clear_bit (and a less obvious smp_mb__after_clear_bit) xfs seems to be missing smp_mb__before Lots of code that allocates things (eg. msi interrupts) is suspicious. I'm not exactly sure if these allocation bitmaps are actually used to protect data structures or not... I'll have to get round to preparing proper patches for these things if I don't hear nacks... --- arch/arm/mach-davinci/gpio.c |4 ++-- arch/arm/mach-imx/generic.c |4 ++-- arch/arm/mach-iop13xx/msi.c |4 ++-- arch/arm/mach-ns9xxx/gpio.c |4 ++-- arch/avr32/mach-at32ap/pio.c |1 + arch/powerpc/mm/hash_native_64.c |5 ++--- arch/ppc/xmon/start.c | 20 +--- block/ll_rw_blk.c | 20 +--- drivers/block/cciss.c |4 ++-- drivers/block/cpqarray.c |4 ++-- drivers/block/floppy.c| 36 drivers/char/vt.c |4 ++-- drivers/net/ibm_newemac/mal.c |5 ++--- drivers/net/s2io.c|8 drivers/usb/misc/phidgetservo.c |6 +++--- fs/fs-writeback.c |4 ++-- fs/jfs/jfs_metapage.c |5 +++-- fs/xfs/xfs_mount.c|4 ++-- include/asm-powerpc/mmu_context.h |4 ++-- include/asm-ppc/mmu_context.h |4 ++-- include/linux/aio.h |5 - include/linux/interrupt.h |3 ++- include/linux/netdevice.h |5 ++--- net/bluetooth/cmtp/core.c |4 ++-- net/core/dev.c|5 ++--- 26 files changed, 75 insertions(+), 98 deletions(-) Index: linux-2.6/arch/arm/mach-davinci/gpio.c === --- linux-2.6.orig/arch/arm/mach-davinci/gpio.c +++ linux-2.6/arch/arm/mach-davinci/gpio.c @@ -34,7 +34,7 @@ int gpio_request(unsigned gpio, const ch if (gpio >= DAVINCI_N_GPIO) return -EINVAL; - if (test_and_set_bit(gpio, gpio_in_use)) + if (test_and_set_bit_lock(gpio, gpio_in_use)) return -EBUSY; return 0; @@ -46,7 +46,7 @@ void gpio_free(unsigned gpio) if (gpio >= DAVINCI_N_GPIO) return; - clear_bit(gpio, gpio_in_use); + clear_bit_unlock(gpio, gpio_in_use); } EXPORT_SYMBOL(gpio_free); Index: linux-2.6/arch/arm/mach-imx/generic.c === --- linux-2.6.orig/arch/arm/mach-imx/generic.c +++ linux-2.6/arch/arm/mach-imx/generic.c @@ -107,7 +107,7 @@ int imx_gpio_request(unsigned gpio, cons return -EINVAL; } - if(test_and_set_bit(gpio, imx_gpio_alloc_map)) { + if(test_and_set_bit_lock(gpio, imx_gpio_alloc_map)) { printk(KERN_ERR "imx_gpio: GPIO %d already used. Allocation for \"%s\" failed\n", gpio, label ? label : "?"); return -EBUSY; @@ -123,7 +123,7 @@ void imx_gpio_free(unsigned gpio) if(gpio >= (GPIO_PORT_MAX + 1) * 32) return; - clear_bit(gpio, imx_gpio_alloc_map); + clear_bit_unlock(gpio, imx_gpio_alloc_map); } EXPORT_SYMBOL(imx_gpio_free); Index: linux-2.6/arch/arm/mach-iop13xx/msi.c === --- linux-2.6.orig/arch/arm/mach-iop13xx/msi.c +++ linux-2.6/arch/arm/mach-iop13xx/msi.c @@ -135,7 +135,7 @@ again: if (irq > NR_IRQS) return -ENOSPC; /* test_and_set_bit operates on 32-bits at a time */ - if (test_and_set_bit(pos, msi_irq_in_use)) + if (test_and_set_bit_lock(pos, msi_irq_in_use)) goto again; dynamic_irq_init(irq); @@ -149,7 +149,7 @@ void destroy_irq(unsigned int irq) dynamic_irq_cleanup(irq); - clear_bit(pos, msi_irq_in_use); + clear_bit_unlock(pos, msi_irq_in_use); } void arch_teardown_msi_irq(unsigned int irq) Index: linux-2.6/arch/arm/mach-ns9xxx/gpio.c === --- linux-2.6.orig/arch/arm/mach-ns9xxx/gpio.c +++ linux-2.6/arch/arm/mach-ns9xxx/gpio.c @@ -85,7 +85,7 @@ static inline void __iomem *ns9xxx_gpio_ int gpio_request(unsigned gpio, const char *label) { if (likely(ns9xxx_valid_gpio(gpio))) - return test_and_set_bit(gpio, gpiores) ? -EBUSY : 0;
Re: Is gcc thread-unsafe?
On Friday 26 October 2007 09:55, Andi Kleen wrote: > > But we don't actually know what it is, and it could change with > > different architectures or versions of gcc. I think the sanest thing > > is for gcc to help us out here, seeing as there is this very well > > defined requirement that we want. > > > > If you really still want the optimisation to occur, > > I don't think it makes sense for memory. It may may make sense for > registers. > > But given that my kernel doesn't seem to contain any instances > at all it's probably not too useful for it anyways. > > So I wouldn't have a problem disabling it, but it would > be better if they fixed their heuristics to only apply it when > it makes sense. That's what I mean -- disabling it for memory. I mean, I am just talking about the conditional => unconditional store to a shared variable optimisation. So for example, adc, sbb, and cmov are all still fine when they are used for the right things. I don't want to turn them off because they really are quite useful. As far as it being theoretical... I hope it is. But we should nip this in the "bud" (gcc 3.x does this too, sigh) before it causes problems for us (and any and all other threaded programs and libraries out there). And by that I mean ask them for a compiler option rather than start adding volatile. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
On Friday 26 October 2007 09:09, Andi Kleen wrote: > On Friday 26 October 2007 00:49:42 Nick Piggin wrote: > > Marking volatile I think is out of the question. To start with, > > volatile creates really poor code (and most of the time we actually > > do want the code in critical sections to be as tight as possible). > > Poor code is better than broken code I would say. Besides > the cross CPU synchronization paths are likely dominated by > cache misses anyways; it's unlikely they're actually limited > by the core CPU. So it'll probably not matter all that much > if the code is poor or not. But we have to mark all memory access inside the critical section as volatile, even after the CPU synchronisation, and often the common case where there is no contention or cacheline bouncing. Sure, real code is dominated by cache misses anyway, etc etc. However volatile prevents a lot of real optimisations that we'd actually want to do, and increases icache size etc. > > But also because I don't think these bugs are just going to be > > found easily. > > > > > It might be useful to come up with some kind of assembler pattern > > > matcher to check if any such code is generated for the kernel > > > and try it with different compiler versions. > > > > Hard to know how to do it. If you can, then it would be interesting. > > I checked my kernel for "adc" at least (for the trylock/++ pattern) > and couldn't find any (in fact all of them were in > data the compiler thought to be code). That was not a allyesconfig build, > so i missed some code. sbb as well. > For cmov it's at first harder because they're much more frequent > and cmov to memory is a multiple instruction pattern (the instruction > does only support memory source operands). Luckily gcc > doesn't know the if (x) mem = a; => ptr = cmov(x, , ); *ptr = a; > transformation trick so I don't think there are actually any conditional > stores on current x86. > > It might be a problem on other architectures which support true > conditional stores though (like IA64 or ARM) It might just depend on the instruction costs that gcc knows about. For example, if ld/st is expensive, they might hoist them out of loops etc. You don't even need to have one of these predicate or pseudo predicate instructions. > Also I'm not sure if gcc doesn't know any other tricks like the > conditional add using carry, although I cannot think of any related > to stores from the top of my hat. > > Anyways, if it's only conditional add if we ever catch such a case > it could be also handled with inline assembly similar to local_inc() But we don't actually know what it is, and it could change with different architectures or versions of gcc. I think the sanest thing is for gcc to help us out here, seeing as there is this very well defined requirement that we want. If you really still want the optimisation to occur, I don't think it is too much to use a local variable for the accumulator (eg. in the simple example case). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/