from:"Nick Piggin"

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-07 Thread Nick Piggin

On Fri, Nov 02, 2007 at 04:33:32PM +0100, Ingo Molnar wrote:
> 
> * Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > Anyway, if this can make its way to the x86 tree, I think it will get 
> > pulled into -mm (?) and get some exposure...
> 
> ok, we can certainly try it there.

Anything particular I have to do to get it into the x86 tree? And
presumably that tree is going to be picked up by Andrew at some
point? (-mm exposure is the main objective here, the only reason I
didn't send it to Andrew directly is in the interests of doing the
right thing).

> Your code is really nifty.

Thanks. One of the CPU engineers at Intel asked to use it as their reference
ticket lock code sequence, which was pretty cool :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-07 Thread Nick Piggin

On Fri, Nov 02, 2007 at 04:33:32PM +0100, Ingo Molnar wrote:
 
 * Nick Piggin [EMAIL PROTECTED] wrote:
 
  Anyway, if this can make its way to the x86 tree, I think it will get 
  pulled into -mm (?) and get some exposure...
 
 ok, we can certainly try it there.

Anything particular I have to do to get it into the x86 tree? And
presumably that tree is going to be picked up by Andrew at some
point? (-mm exposure is the main objective here, the only reason I
didn't send it to Andrew directly is in the interests of doing the
right thing).


 Your code is really nifty.

Thanks. One of the CPU engineers at Intel asked to use it as their reference
ticket lock code sequence, which was pretty cool :)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH] Optimize zone allocator synchronization

2007-11-06 Thread Nick Piggin

On Wednesday 07 November 2007 17:19, Andrew Morton wrote:
> > On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <[EMAIL PROTECTED]> wrote:
> >
> > Don Porter wrote:
> > > From: Donald E. Porter <[EMAIL PROTECTED]>
> > >
> > > In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> > > lock is held across all iterations.  For certain parallel workloads, I
> > > have found that releasing and reacquiring the lock for each iteration
> > > yields better performance, especially at higher CPU counts.  For
> > > instance, kernel compilation is sped up by 5% on an 8 CPU test
> > > machine.  In most cases, there is no significant effect on performance
> > > (although the effect tends to be slightly positive).  This seems quite
> > > reasonable for the very small scope of the change.
> > >
> > > My intuition is that this patch prevents smaller requests from waiting
> > > on larger ones.  While grabbing and releasing the lock within the loop
> > > adds a few instructions, it can lower the latency for a particular
> > > thread's allocation which is often on the thread's critical path.
> > > Lowering the average latency for allocation can increase system
> > > throughput.
> > >
> > > More detailed information, including data from the tests I ran to
> > > validate this change are available at
> > > http://www.cs.utexas.edu/~porterde/kernel-patch.html .
> > >
> > > Thanks in advance for your consideration and feedback.

I did see this initial post, and didn't quite know what to make of it.
I'll admit it is slightly unexpected :) Always good to research ideas
against common convention, though.

I don't know whether your reasoning is correct though: unless there is
a significant number of higher order allocations (which there should
not be, AFAIKS), all allocators will go through the per CPU lists which
batch the same number of objects on and off, so there is no such thing
as smaller or larger requests.

And there are a number of regressions as well in your tests. It would be
nice to get some more detailed profile numbers (preferably with an
upstream kernel) to try to work out what is going faster.

It's funny, Dave Miller and I were just talking about the possible
reappearance of zone->lock contention with massively multi core and
multi threaded CPUs. I think the right way to fix this in the long run
if it turns into a real problem, is something like having a lock per
MAX_ORDER block, and having CPUs prefer to allocate from different
blocks. Anti-frag makes this pretty interesting to implement, but it
will be possible.

> > That's an interesting insight.  My intuition is that Nick Piggin's
> > recently-posted ticket spinlocks patches[1] will reduce the need for this
> > patch, though it may be useful to have both.  Can you benchmark again
> > with only ticket spinlocks, and with ticket spinlocks + this patch? 
> > You'll probably want to use 2.6.24-rc1 as your baseline, due to the x86
> > architecture merge.
>
> The patch as-is would hurt low cpu-count workloads, and single-threaded
> workloads: it is simply taking that lock a lot more times.  This will be
> particuarly noticable on things like older P4 machines which have
> peculiarly expensive locked operations.

It's not even restricted to P4s -- another big cost is going to be the
cacheline pingpong. Actually it might be worth trying another test run
with zone->lock put into its own cacheline (as it stands, when the lock
gets contended, spinners will just sit there pushing useful fields out
of the holder's memory -- ticket locks will do better here, but they
still write to the lock once, then sit there loading it).

> A test to run would be, on ext2:
>
>   time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)
>
> (might need to increase /proc/sys/vm/dirty* to avoid any writeback)
>
>
> I wonder if we can do something like:
>
>   if (lock_is_contended(lock)) {
>   spin_unlock(lock);
>   spin_lock(lock);/* To the back of the queue */
>   }
>
> (in conjunction with the ticket locks) so that we only do the expensive
> buslocked operation when we actually have a need to do so.
>
> (The above should be wrapped in some new spinlock interface function which
> is probably a no-op on architectures which cannot implement it usefully)

We have the need_lockbreak stuff. Of course, that's often pretty useless
with regular spinlocks (when you consider that my tests show that a single
CPU can be allowed to retake the same lock several million times in a row
despite contention)...

Anyway, yeah we could do that. But I think we do actually want to batch
up allocations on a given CPU in the multithreaded case as well, rather
than interleave them. There are some benefits avoiding cacheline bouncing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)

2007-11-06 Thread Nick Piggin

On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote:
> For quite some time I'm seeing occasional lockups spread over 50 different
> machines I'm maintaining. Symptom: a page allocation failure with order:1,
> GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free
> pages, almost no swap used) followed by a lockup (everything dead). I've
> collected all (12) crash cases which occurred the last 10 weeks on 50
> machines total (i.e. 1 crash every 41 weeks on average). The kernel
> messages are summarized to show the interesting part (IMO) they have
> in common. Over the years this has become the crash cause #1 for stable
> kernels for me (fglrx doesn't count ;).
>
> One note: I suspect that reporting a GFP_ATOMIC allocation failure in an
> network driver via that same driver (netconsole) may not be the smartest
> thing to do and this could be responsible for the lockup itself. However,
> the initial page allocation failure remains and I'm not sure how to
> address that problem.

It isn't unexpected. If an atomic allocation doesn't have enough memory,
it kicks off kswapd to start freeing memory for it. However, it cannot
wait for memory to become free (it's GFP_ATOMIC), so it has to return
failure. GFP_ATOMIC allocation paths are designed so that the kernel can
recover from this situation, and a subsequent allocation will have free
memory.

Probably in production kernels we should default to only reporting this
when page reclaim is not making any progress.


> I still think the issue is memory fragmentation but if so, it looks
> a bit extreme to me: One system with 2GB of ram crashed after a day,
> merely running a couple of TCP server programs. All systems have either
> 1 or 2GB ram and at least 1G of (merely unused) swap.

You can reduce the chances of it happening by increasing
/proc/sys/vm/min_free_kbytes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oom-killer error.

2007-11-06 Thread Nick Piggin

On Tuesday 06 November 2007 19:34, Jim van Wel wrote:
> Hi there,
>
> I have a strange problem with like 10-15 servers right now.
> We have here all HP DL380-G5 servers with kernel 2.6.22.6. System all
> works normall. But after a uptime of like 15 a 25 days, we get these
> messages, and the servers is just crashed.
>

It is trying to allocate lowmem, but you have none left (and none
to speak of can be reclaimed).

I'd guess you have a kernel memory leak. Can you start by posting
the output of /proc/slabinfo and /proc/meminfo after the machine has
been up for 10-20 days (eg. close to OOM). And preferably also post
another set after just a day or so uptime.

Are you using the SLAB or SLUB allocator (check .config).

The kernel crashes (rather than recovers) because the leak has used
up all its working memory and killing processes does not release it.
(by the looks).

Thanks for the report,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oom-killer error.

2007-11-06 Thread Nick Piggin

On Tuesday 06 November 2007 19:34, Jim van Wel wrote:
 Hi there,

 I have a strange problem with like 10-15 servers right now.
 We have here all HP DL380-G5 servers with kernel 2.6.22.6. System all
 works normall. But after a uptime of like 15 a 25 days, we get these
 messages, and the servers is just crashed.


It is trying to allocate lowmem, but you have none left (and none
to speak of can be reclaimed).

I'd guess you have a kernel memory leak. Can you start by posting
the output of /proc/slabinfo and /proc/meminfo after the machine has
been up for 10-20 days (eg. close to OOM). And preferably also post
another set after just a day or so uptime.

Are you using the SLAB or SLUB allocator (check .config).

The kernel crashes (rather than recovers) because the leak has used
up all its working memory and killing processes does not release it.
(by the looks).

Thanks for the report,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: VM/networking crash cause #1: page allocation failure (order:1, GFP_ATOMIC)

2007-11-06 Thread Nick Piggin

On Tuesday 06 November 2007 04:42, Frank van Maarseveen wrote:
 For quite some time I'm seeing occasional lockups spread over 50 different
 machines I'm maintaining. Symptom: a page allocation failure with order:1,
 GFP_ATOMIC, while there is plenty of memory, as it seems (lots of free
 pages, almost no swap used) followed by a lockup (everything dead). I've
 collected all (12) crash cases which occurred the last 10 weeks on 50
 machines total (i.e. 1 crash every 41 weeks on average). The kernel
 messages are summarized to show the interesting part (IMO) they have
 in common. Over the years this has become the crash cause #1 for stable
 kernels for me (fglrx doesn't count ;).

 One note: I suspect that reporting a GFP_ATOMIC allocation failure in an
 network driver via that same driver (netconsole) may not be the smartest
 thing to do and this could be responsible for the lockup itself. However,
 the initial page allocation failure remains and I'm not sure how to
 address that problem.

It isn't unexpected. If an atomic allocation doesn't have enough memory,
it kicks off kswapd to start freeing memory for it. However, it cannot
wait for memory to become free (it's GFP_ATOMIC), so it has to return
failure. GFP_ATOMIC allocation paths are designed so that the kernel can
recover from this situation, and a subsequent allocation will have free
memory.

Probably in production kernels we should default to only reporting this
when page reclaim is not making any progress.


 I still think the issue is memory fragmentation but if so, it looks
 a bit extreme to me: One system with 2GB of ram crashed after a day,
 merely running a couple of TCP server programs. All systems have either
 1 or 2GB ram and at least 1G of (merely unused) swap.

You can reduce the chances of it happening by increasing
/proc/sys/vm/min_free_kbytes.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC/PATCH] Optimize zone allocator synchronization

2007-11-06 Thread Nick Piggin

On Wednesday 07 November 2007 17:19, Andrew Morton wrote:
  On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook [EMAIL PROTECTED] wrote:
 
  Don Porter wrote:
   From: Donald E. Porter [EMAIL PROTECTED]
  
   In the bulk page allocation/free routines in mm/page_alloc.c, the zone
   lock is held across all iterations.  For certain parallel workloads, I
   have found that releasing and reacquiring the lock for each iteration
   yields better performance, especially at higher CPU counts.  For
   instance, kernel compilation is sped up by 5% on an 8 CPU test
   machine.  In most cases, there is no significant effect on performance
   (although the effect tends to be slightly positive).  This seems quite
   reasonable for the very small scope of the change.
  
   My intuition is that this patch prevents smaller requests from waiting
   on larger ones.  While grabbing and releasing the lock within the loop
   adds a few instructions, it can lower the latency for a particular
   thread's allocation which is often on the thread's critical path.
   Lowering the average latency for allocation can increase system
   throughput.
  
   More detailed information, including data from the tests I ran to
   validate this change are available at
   http://www.cs.utexas.edu/~porterde/kernel-patch.html .
  
   Thanks in advance for your consideration and feedback.

I did see this initial post, and didn't quite know what to make of it.
I'll admit it is slightly unexpected :) Always good to research ideas
against common convention, though.

I don't know whether your reasoning is correct though: unless there is
a significant number of higher order allocations (which there should
not be, AFAIKS), all allocators will go through the per CPU lists which
batch the same number of objects on and off, so there is no such thing
as smaller or larger requests.

And there are a number of regressions as well in your tests. It would be
nice to get some more detailed profile numbers (preferably with an
upstream kernel) to try to work out what is going faster.

It's funny, Dave Miller and I were just talking about the possible
reappearance of zone-lock contention with massively multi core and
multi threaded CPUs. I think the right way to fix this in the long run
if it turns into a real problem, is something like having a lock per
MAX_ORDER block, and having CPUs prefer to allocate from different
blocks. Anti-frag makes this pretty interesting to implement, but it
will be possible.


  That's an interesting insight.  My intuition is that Nick Piggin's
  recently-posted ticket spinlocks patches[1] will reduce the need for this
  patch, though it may be useful to have both.  Can you benchmark again
  with only ticket spinlocks, and with ticket spinlocks + this patch? 
  You'll probably want to use 2.6.24-rc1 as your baseline, due to the x86
  architecture merge.

 The patch as-is would hurt low cpu-count workloads, and single-threaded
 workloads: it is simply taking that lock a lot more times.  This will be
 particuarly noticable on things like older P4 machines which have
 peculiarly expensive locked operations.

It's not even restricted to P4s -- another big cost is going to be the
cacheline pingpong. Actually it might be worth trying another test run
with zone-lock put into its own cacheline (as it stands, when the lock
gets contended, spinners will just sit there pushing useful fields out
of the holder's memory -- ticket locks will do better here, but they
still write to the lock once, then sit there loading it).


 A test to run would be, on ext2:

   time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)

 (might need to increase /proc/sys/vm/dirty* to avoid any writeback)


 I wonder if we can do something like:

   if (lock_is_contended(lock)) {
   spin_unlock(lock);
   spin_lock(lock);/* To the back of the queue */
   }

 (in conjunction with the ticket locks) so that we only do the expensive
 buslocked operation when we actually have a need to do so.

 (The above should be wrapped in some new spinlock interface function which
 is probably a no-op on architectures which cannot implement it usefully)

We have the need_lockbreak stuff. Of course, that's often pretty useless
with regular spinlocks (when you consider that my tests show that a single
CPU can be allowed to retake the same lock several million times in a row
despite contention)...

Anyway, yeah we could do that. But I think we do actually want to batch
up allocations on a given CPU in the multithreaded case as well, rather
than interleave them. There are some benefits avoiding cacheline bouncing.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin

On Fri, Nov 02, 2007 at 08:56:46PM -0400, Chuck Ebbert wrote:
> On 11/02/2007 07:01 PM, Nick Piggin wrote:
> > 
> > In the contended multi-threaded tight loop, the xchg lock is slower than inc
> > lock but still beats the fair xadd lock, but that's only because it is
> > just as unfair if not more so on this hardware (runtime difference of up to
> > about 10%)
> > 
> 
> I meant xchg for unlock, not lock.

That is for unlock. 2x the number of atomic operations ~= 2x the cost.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin

On Fri, Nov 02, 2007 at 09:51:27AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Nov 2007, Chuck Ebbert wrote:
> > 
> > There's also a very easy way to get better fairness with our current 
> > spinlocks:
> > use xchg to release the lock instead of mov.
> 
> That does nothing at all.
> 
> Yes, it slows the unlock down, which in turn on some machines will make it 
> easier for another core/socket to get it, but it's purely about the 
> slowdown, nothing else. 

Yeah, it's not such a good idea... it slows down the single threaded case
like crazy. On my dual core core2:

_Single thread_
inc-lock in cache takes 21.94ns
xadd-lock in cache takes 22.64ns
xchg-lock in cache takes 35.21ns

inc-lock out of cache takes 140.73ns
xadd-lock out of cache takes 141.15ns
xchg-lock out of cache takes 155.13ns


In the contended multi-threaded tight loop, the xchg lock is slower than inc
lock but still beats the fair xadd lock, but that's only because it is
just as unfair if not more so on this hardware (runtime difference of up to
about 10%)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin

On Fri, Nov 02, 2007 at 10:05:37AM -0400, Rik van Riel wrote:
> On Fri, 2 Nov 2007 07:42:20 +0100
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
> > > 
> > > 
> > > On Thu, 1 Nov 2007, Rik van Riel wrote:
> > > > 
> > > > Larry Woodman managed to wedge the VM into a state where, on his
> > > > 4x dual core system, only 2 cores (on the same CPU) could get the
> > > > zone->lru_lock overnight.  The other 6 cores on the system were
> > > > just spinning, without being able to get the lock.
> > 
> > That's quite incredible, considering that the CPUs actually _taking_
> > the locks also drop the locks and do quite a bit of work before taking
> > them again (ie. they take them to pull pages off the LRU, but then
> > do a reasonable amount of work to remove each one from pagecache
> > before refilling from the LRU).
> > 
> > Possibly actually that is a *more* difficult case for the HW to
> > handle: once the CPU actually goes away and operates on other
> > cachelines, it may get a little more difficult to detect that it is
> > causing starvation issues.
> 
> In case of the zone->lru_lock, grabbing the spinlock does
> not mean that the process is letting go of the cacheline.
> 
> On the contrary, once the spinlock has been grabbed, the
> real cacheline prodding begins.

I didn't say that, though. Obviously the hardware can't do anything
about starvating until a lock is released.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-02 Thread Nick Piggin

On Thu, Nov 01, 2007 at 10:02:04PM -0700, David Miller wrote:
> From: David Miller <[EMAIL PROTECTED]>
> Date: Wed, 31 Oct 2007 00:44:25 -0700 (PDT)
> 
> > From: Nick Piggin <[EMAIL PROTECTED]>
> > Date: Wed, 31 Oct 2007 08:41:06 +0100
> > 
> > > You could possibly even do a generic "best effort" kind of thing with
> > > regular IPIs, that will timeout and continue if some CPUs don't handle
> > > them, and should be pretty easy to get working with existing smp_call_
> > > function stuff. Not exactly clean, but it would be better than nothing.
> > 
> > Without a doubt.
> 
> Putting my code where my mouth is, here is an example implementation
> of a special SysRQ "g" "dump regs globally" debugging tool for
> sparc64.
> 
> The only thing that has to happen is the SysRQ trigger.  So if you can
> either SysRQ-'g' at the console or "echo 'g' >/proc/sysrq-trigger" you
> can get the registers from the cpus in the system.
> 
> The only case the remote cpu registers would not be capturable would
> be if they were stuck looping in the trap entry, trap exit, or low
> level TLB handler code.
> 
> This means that even if some cpu is stuck in a spinlock loop with
> interrupts disabled, you'd see it with this thing.  The way it works
> is that cross cpu vectored interrupts are disabled independently of
> the processor interrupt level on sparc64.

That's really sweet. I'd better put my code where my mouth is as well...
this patch seems to be about the best we can do as a generic fallback
without getting a dedicated vector (which has to be done in arch
code anyway).

Maybe it can be improved, I don't know... I also don't know if I'm doing
exactly the right thing with pt_regs, but it seems to work.

One thing I'm doing is just dumping a single CPU, on a rotational basis. 
Don't know whether that's better or worse, but just an idea.

Now I don't know what's the best way for an arch to override this with
their own enhanced handler (an ARCH_HAVE ifdef?)

Index: linux-2.6/arch/x86/kernel/smp_64.c
===
--- linux-2.6.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6/arch/x86/kernel/smp_64.c
@@ -504,8 +504,10 @@ asmlinkage void smp_reschedule_interrupt
add_pda(irq_resched_count, 1);
 }
 
-asmlinkage void smp_call_function_interrupt(void)
+asmlinkage void smp_call_function_interrupt(struct pt_regs *regs)
 {
+   struct pt_regs *old_regs = set_irq_regs(regs);
+
void (*func) (void *info) = call_data->func;
void *info = call_data->info;
int wait = call_data->wait;
@@ -515,7 +517,7 @@ asmlinkage void smp_call_function_interr
 * Notify initiating CPU that I've grabbed the data and am
 * about to execute the function
 */
-   mb();
+   smp_rmb();
atomic_inc(_data->started);
/*
 * At this point the info structure may be out of scope unless wait==1
@@ -526,8 +528,10 @@ asmlinkage void smp_call_function_interr
add_pda(irq_call_count, 1);
irq_exit();
if (wait) {
-   mb();
+   smp_mb();
atomic_inc(_data->finished);
}
+
+   set_irq_regs(old_regs);
 }
 
Index: linux-2.6/drivers/char/sysrq.c
===
--- linux-2.6.orig/drivers/char/sysrq.c
+++ linux-2.6/drivers/char/sysrq.c
@@ -196,12 +196,72 @@ static struct sysrq_key_op sysrq_showloc
 #define sysrq_showlocks_op (*(struct sysrq_key_op *)0)
 #endif
 
-static void sysrq_handle_showregs(int key, struct tty_struct *tty)
+static void show_registers(struct pt_regs *regs)
 {
-   struct pt_regs *regs = get_irq_regs();
if (regs)
show_regs(regs);
 }
+
+static void show_cpu_registers(void)
+{
+   show_registers(get_irq_regs());
+}
+
+static void show_global_regs(void *arg)
+{
+   show_cpu_registers();
+}
+
+static DEFINE_SPINLOCK(global_rotate_lock);
+static cpumask_t global_rotate_cpu;
+
+static void sysrq_handle_global_showregs_work_fn(struct work_struct *work)
+{
+   int cpu, next, thiscpu;
+
+   spin_lock(_rotate_lock);
+   if (cpus_weight(global_rotate_cpu) == 0)
+   cpu = smp_processor_id();
+   else
+   cpu = first_cpu(global_rotate_cpu);
+
+   next = next_cpu(cpu, cpu_online_map);
+   if (next == NR_CPUS)
+   next = first_cpu(cpu_online_map);
+   cpu_clear(cpu, global_rotate_cpu);
+   cpu_set(next, global_rotate_cpu);
+   BUG_ON(cpus_weight(global_rotate_cpu) != 1);
+   spin_unlock(_rotate_lock);
+
+   thiscpu = get_cpu();
+   if (cpu == thiscpu)
+   show_registers(task_pt_regs(current));
+   else {
+   if (smp_call_function_single(cpu, show_global_regs, NULL, 1, 1)
+

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-02 Thread Nick Piggin

On Thu, Nov 01, 2007 at 10:02:04PM -0700, David Miller wrote:
 From: David Miller [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 00:44:25 -0700 (PDT)

  From: Nick Piggin [EMAIL PROTECTED]
  Date: Wed, 31 Oct 2007 08:41:06 +0100

   You could possibly even do a generic best effort kind of thing with
   regular IPIs, that will timeout and continue if some CPUs don't handle
   them, and should be pretty easy to get working with existing smp_call_
   function stuff. Not exactly clean, but it would be better than nothing.

  Without a doubt.

 Putting my code where my mouth is, here is an example implementation
 of a special SysRQ g dump regs globally debugging tool for
 sparc64.

 The only thing that has to happen is the SysRQ trigger.  So if you can
 either SysRQ-'g' at the console or echo 'g' /proc/sysrq-trigger you
 can get the registers from the cpus in the system.

 The only case the remote cpu registers would not be capturable would
 be if they were stuck looping in the trap entry, trap exit, or low
 level TLB handler code.

 This means that even if some cpu is stuck in a spinlock loop with
 interrupts disabled, you'd see it with this thing.  The way it works
 is that cross cpu vectored interrupts are disabled independently of
 the processor interrupt level on sparc64.

That's really sweet. I'd better put my code where my mouth is as well...
this patch seems to be about the best we can do as a generic fallback
without getting a dedicated vector (which has to be done in arch
code anyway).

Maybe it can be improved, I don't know... I also don't know if I'm doing
exactly the right thing with pt_regs, but it seems to work.

One thing I'm doing is just dumping a single CPU, on a rotational basis. 
Don't know whether that's better or worse, but just an idea.

Now I don't know what's the best way for an arch to override this with
their own enhanced handler (an ARCH_HAVE ifdef?)

Index: linux-2.6/arch/x86/kernel/smp_64.c
===
--- linux-2.6.orig/arch/x86/kernel/smp_64.c
+++ linux-2.6/arch/x86/kernel/smp_64.c
@@ -504,8 +504,10 @@ asmlinkage void smp_reschedule_interrupt
add_pda(irq_resched_count, 1);
 }

-asmlinkage void smp_call_function_interrupt(void)
+asmlinkage void smp_call_function_interrupt(struct pt_regs *regs)
 {
+   struct pt_regs *old_regs = set_irq_regs(regs);
+
void (*func) (void *info) = call_data-func;
void *info = call_data-info;
int wait = call_data-wait;
@@ -515,7 +517,7 @@ asmlinkage void smp_call_function_interr
 * Notify initiating CPU that I've grabbed the data and am
 * about to execute the function
 */
-   mb();
+   smp_rmb();
atomic_inc(call_data-started);
/*
 * At this point the info structure may be out of scope unless wait==1
@@ -526,8 +528,10 @@ asmlinkage void smp_call_function_interr
add_pda(irq_call_count, 1);
irq_exit();
if (wait) {
-   mb();
+   smp_mb();
atomic_inc(call_data-finished);
}
+
+   set_irq_regs(old_regs);
 }

Index: linux-2.6/drivers/char/sysrq.c
===
--- linux-2.6.orig/drivers/char/sysrq.c
+++ linux-2.6/drivers/char/sysrq.c
@@ -196,12 +196,72 @@ static struct sysrq_key_op sysrq_showloc
 #define sysrq_showlocks_op (*(struct sysrq_key_op *)0)
 #endif

-static void sysrq_handle_showregs(int key, struct tty_struct *tty)
+static void show_registers(struct pt_regs *regs)
 {
-   struct pt_regs *regs = get_irq_regs();
if (regs)
show_regs(regs);
 }
+
+static void show_cpu_registers(void)
+{
+   show_registers(get_irq_regs());
+}
+
+static void show_global_regs(void *arg)
+{
+   show_cpu_registers();
+}
+
+static DEFINE_SPINLOCK(global_rotate_lock);
+static cpumask_t global_rotate_cpu;
+
+static void sysrq_handle_global_showregs_work_fn(struct work_struct *work)
+{
+   int cpu, next, thiscpu;
+
+   spin_lock(global_rotate_lock);
+   if (cpus_weight(global_rotate_cpu) == 0)
+   cpu = smp_processor_id();
+   else
+   cpu = first_cpu(global_rotate_cpu);
+
+   next = next_cpu(cpu, cpu_online_map);
+   if (next == NR_CPUS)
+   next = first_cpu(cpu_online_map);
+   cpu_clear(cpu, global_rotate_cpu);
+   cpu_set(next, global_rotate_cpu);
+   BUG_ON(cpus_weight(global_rotate_cpu) != 1);
+   spin_unlock(global_rotate_lock);
+
+   thiscpu = get_cpu();
+   if (cpu == thiscpu)
+   show_registers(task_pt_regs(current));
+   else {
+   if (smp_call_function_single(cpu, show_global_regs, NULL, 1, 1)
+   == -1) {
+   printk(Could not interrogate CPU registers\n);
+   return;
+   }
+   }
+   put_cpu();
+}
+static DECLARE_WORK(global_showregs_work

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin

On Fri, Nov 02, 2007 at 10:05:37AM -0400, Rik van Riel wrote:
 On Fri, 2 Nov 2007 07:42:20 +0100
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
   
   
   On Thu, 1 Nov 2007, Rik van Riel wrote:

Larry Woodman managed to wedge the VM into a state where, on his
4x dual core system, only 2 cores (on the same CPU) could get the
zone-lru_lock overnight.  The other 6 cores on the system were
just spinning, without being able to get the lock.
  
  That's quite incredible, considering that the CPUs actually _taking_
  the locks also drop the locks and do quite a bit of work before taking
  them again (ie. they take them to pull pages off the LRU, but then
  do a reasonable amount of work to remove each one from pagecache
  before refilling from the LRU).
  
  Possibly actually that is a *more* difficult case for the HW to
  handle: once the CPU actually goes away and operates on other
  cachelines, it may get a little more difficult to detect that it is
  causing starvation issues.
 
 In case of the zone-lru_lock, grabbing the spinlock does
 not mean that the process is letting go of the cacheline.
 
 On the contrary, once the spinlock has been grabbed, the
 real cacheline prodding begins.

I didn't say that, though. Obviously the hardware can't do anything
about starvating until a lock is released.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin

On Fri, Nov 02, 2007 at 09:51:27AM -0700, Linus Torvalds wrote:
 
 
 On Fri, 2 Nov 2007, Chuck Ebbert wrote:
  
  There's also a very easy way to get better fairness with our current 
  spinlocks:
  use xchg to release the lock instead of mov.
 
 That does nothing at all.
 
 Yes, it slows the unlock down, which in turn on some machines will make it 
 easier for another core/socket to get it, but it's purely about the 
 slowdown, nothing else. 

Yeah, it's not such a good idea... it slows down the single threaded case
like crazy. On my dual core core2:

_Single thread_
inc-lock in cache takes 21.94ns
xadd-lock in cache takes 22.64ns
xchg-lock in cache takes 35.21ns

inc-lock out of cache takes 140.73ns
xadd-lock out of cache takes 141.15ns
xchg-lock out of cache takes 155.13ns


In the contended multi-threaded tight loop, the xchg lock is slower than inc
lock but still beats the fair xadd lock, but that's only because it is
just as unfair if not more so on this hardware (runtime difference of up to
about 10%)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-02 Thread Nick Piggin

On Fri, Nov 02, 2007 at 08:56:46PM -0400, Chuck Ebbert wrote:
 On 11/02/2007 07:01 PM, Nick Piggin wrote:
  
  In the contended multi-threaded tight loop, the xchg lock is slower than inc
  lock but still beats the fair xadd lock, but that's only because it is
  just as unfair if not more so on this hardware (runtime difference of up to
  about 10%)
  
 
 I meant xchg for unlock, not lock.

That is for unlock. 2x the number of atomic operations ~= 2x the cost.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 1 Nov 2007, Rik van Riel wrote:
> > 
> > Larry Woodman managed to wedge the VM into a state where, on his
> > 4x dual core system, only 2 cores (on the same CPU) could get the
> > zone->lru_lock overnight.  The other 6 cores on the system were
> > just spinning, without being able to get the lock.

That's quite incredible, considering that the CPUs actually _taking_
the locks also drop the locks and do quite a bit of work before taking
them again (ie. they take them to pull pages off the LRU, but then
do a reasonable amount of work to remove each one from pagecache before
refilling from the LRU).

Possibly actually that is a *more* difficult case for the HW to handle:
once the CPU actually goes away and operates on other cachelines, it 
may get a little more difficult to detect that it is causing starvation
issues.

> .. and this is almost always the result of a locking *bug*, not unfairness 
> per se. IOW, unfairness just ends up showing the bug in the first place.

I'd almost agree, but there are always going to be corner cases where
we get multiple contentions on a spinlock -- the fact that a lock is
needed at all obviously suggests that it can be contended. The LRU locking
could be improved, but you could have eg. scheduler runqueue lock starvation
if the planets lined up just right, and it is a little more difficult to
improve on runqueue locking.

Anyway, I also think this is partially a hardware issue, and as muliple
cores, threads, and sockets get more common, I hope it will improve (it
affects Intel CPUs as well as AMD). So it is possible to have an option
to switch between locks if the hardware is fairer, but I want to get
as much exposure with this locking as possible for now, to see if there
is any funny performance corner cases exposed (which quite possibly will
turn out to be caused by suboptimal locking itself).

Anyway, if this can make its way to the x86 tree, I think it will get
pulled into -mm (?) and get some exposure...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 06:17:42PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Nov 2007, Nick Piggin wrote:
> > 
> > But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
> > uses this for inserting breakpoints (but fortunately, a COW page in a
> > MAP_PRIVATE mapping is a much more natural thing for the VM).
> 
> Yes, I phrased that badly. I meant that I'd be happier if we got rid of 
> VM_MAYSHARE entirely, and just used VM_SHARED. I thought we already made 
> them always be the same (and any VM_MAYSHARE use is historical).

Oh yeah, I think it would probably be clearer to use VM_SHARED == MAP_SHARED,
and test the write permission explicitly. Though there could be something
I missed that makes it not as easy as it sounds... probably something best
left for Hugh ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 04:01:45PM -0400, Chuck Ebbert wrote:
> On 11/01/2007 10:03 AM, Nick Piggin wrote:
> 
> [edited to show the resulting code]
> 
> > +   __asm__ __volatile__ (
> > +   LOCK_PREFIX "xaddw %w0, %1\n"
> > +   "1:\t"
> > +   "cmpb %h0, %b0\n\t"
> > +   "je 2f\n\t"
> > +   "rep ; nop\n\t"
> > +   "movb %1, %b0\n\t"
> > +   /* don't need lfence here, because loads are in-order */
> > "jmp 1b\n"
> > +   "2:"
> > +   :"+Q" (inc), "+m" (lock->slock)
> > +   :
> > +   :"memory", "cc");
> >  }
> 
> If you really thought you might get long queues, you could figure out
> how far back you are and use that to determine how long to wait before
> testing the lock again. That cmpb could become a subb without adding
> overhead to the fast path -- that would give you the queue length (or
> its complement anyway.)

Indeed. You can use this as a really nice input into a backoff
algorithm (eg. if you're next in line, don't back off, or at least
don't go into exponential backoff; if you've got people in front
of you, start throttling harder).

I think I'll leave that to SGI if they come up with a big x86 SSI ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 09:08:45AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 1 Nov 2007, Nick Piggin wrote:
> > 
> > Untested patch follows
> 
> Ok, this looks ok.
> 
> Except I would remove the VM_MAYSHARE bit from the test.

But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
uses this for inserting breakpoints (but fortunately, a COW page in a
MAP_PRIVATE mapping is a much more natural thing for the VM).


> That whole bit should go, in fact.
> 
> We used to make it something different: iirc, a read-only SHARED mapping 
> was downgraded to a non-shared mapping, because we wanted to avoid some of 
> the costs we used to have with the VM implementation (actually, I think it 
> was various filesystems that don't like shared mappings because they don't 
> have a per-page writeback). But we left the VM_MAYSHARE bit on, to get 
> /proc//mmap things right.
> 
> Or something like that. I forget the details. But I *think* we don't 
> actually need this any more.
> 
> But basically, the "right" way to test for shared mappings is historically 
> to just test the VM_MAYSHARE bit - but not *both* bits. Because VM_SHARE 
> may have been artificially cleared.
 
I think you're right -- VM_MAYSHARE is basically testing for MAP_SHARED.
I just don't know exactly what you're proposing here.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] spinlock: lockbreak cleanup

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 04:46:36PM +0100, Ingo Molnar wrote:
> 
> * Lee Schermerhorn <[EMAIL PROTECTED]> wrote:
> 
> > > I guess it was done to make the "template" hacks eaiser. I don't 
> > > really find that in good taste, especially for important core 
> > > infrastructure. Anyway.
> > 
> > Actually, what I had/have is a cond_resched_rwlock() that I needed to 
> > convert the i_mmap_lock() to rw for testing reclaim scalability.  
> > [I've seen a large system running an Oracle OLTP load hang spitting 
> > "cpu soft lockup" messages with all cpus spinning on a i_mmap_lock 
> > spin lock.] One of the i_mmap_lock paths uses cond_resched_lock() for 
> > spin locks. To do a straight forward conversion [and maybe that isn't 
> > the right approach], I created the cond_resched_rwlock() function by 
> > generalizing the cond_sched_lock() code and creating both spin and rw 
> > lock wrappers. I took advantage of the fact that, currently, 
> > need_lockbreak() is a macro and that both spin and rw locks have/had 
> > the break_lock member. Typesafe functions would probably be 
> > preferrable, if we want to keep break_lock for rw spin locks.
> > 
> > Here's the most recent posting:
> > 
> > http://marc.info/?l=linux-mm=118980356306014=4
> > 
> > See the changes to sched.[ch].  Should apply to 23-mm1 with offsets 
> > and minor fixup in fs/inode.c.
> 
> yep. I'm too in favor of keeping the need-lockbreak mechanism and its 
> type-insensitive data structure. We've got way too many locking 
> primitives and keeping them all sorted is nontrivial already.

I think a large contributor to that is being a bit clever with indirections
and cute code (eg. like this template stuff), rather than having two types of
spinlocks instead of one.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 08:14:47AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 1 Nov 2007, Nick Piggin wrote:
> 
> > On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote:
> > > 
> > > We made much bigger changes to ptrace support when we disallowed writing 
> > > to read-only shared memory areas (we used to do the magic per-page COW 
> > > thing).
> > 
> > Really? No, we still do that magic COW thing which creates anonymous
> > pages in MAP_SHARED vmas, don't we?
> 
> No, we don't. I'm pretty sure. It didn't work with the VM cleanups, since 
> the MAP_SHARED vma's won't be on the anonymous list any more, and cannot 
> be swapped out.
> 
> So now, if you try to write to a read-only shared page through ptrace, 
> you'll get "Unable to access".
 
No, it COWs it (the file is RW).

I believe do_wp_page will still attach an anon_vma to the vma, which
will make the pte discoverable via rmap.


> Of course, I didn't really look closely, so maybe I just don't remember 
> things right..
> 
> > > access_vm_pages() (things like core-dumping comes to mind - although I 
> > > think we don't dump pure file mappings at all, do we?) it would certainly 
> > > be good to run any such tests on the current -git tree...
> > 
> > We do for MAP_SHARED|MAP_ANONYMOUS, by the looks.
> 
> Well, as we should. There's no way for a debugger to get those pages back. 
> So that all looks sane.
> 
> > -   vm_flags |= VM_SHARED | VM_MAYSHARE;
> > -   if (!(file->f_mode & FMODE_WRITE))
> > -   vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
> > +   vm_flags |= VM_MAYSHARE;
> > +   if (file->f_mode & FMODE_WRITE)
> > +   vm_flags |= VM_SHARED;
> > +   if (!(vm_flags & VM_WRITE))
> > +   vm_flags &= ~VM_MAYWRITE;
> 
> This looks totally bogus. What was the intent of this patch?
> 
> The VM_MAYWRITE flag is *not* supposed to track the VM_WRITE flag: that 
> would defeat the whole purpose of it! The whole point of that flag is to 
> say whether mprotect() could turn it into a VM_WRITE mapping, and it 
> depends on the file mode, not VM_WRITE!

Yeah of course that won't work, stupid...

The intent is to stop get_user_pages from proceeding with a write fault (and
subsequent COW) to readonly shared mappings, when force is set. I think it
can be done simply via get_user_pages(), which is what I should have done
to begin with.

Untested patch follows
---
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1031,7 +1031,9 @@ int get_user_pages(struct task_struct *t
}
 
if (!vma || (vma->vm_flags & (VM_IO | VM_PFNMAP))
-   || !(vm_flags & vma->vm_flags))
+   || !(vm_flags & vma->vm_flags)
+   || (write && ((vma->vm_flags &
+ (VM_SHARED|VM_MAYSHARE)) == VM_MAYSHARE)))
return i ? : -EFAULT;
 
if (is_vm_hugetlb_page(vma)) {
@@ -1563,13 +1565,11 @@ static int do_wp_page(struct mm_struct *
reuse = can_share_swap_page(old_page);
unlock_page(old_page);
}
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-   (VM_WRITE|VM_SHARED))) {
+   } else if (unlikely((vma->vm_flags & VM_SHARED))) {
/*
-* Only catch write-faults on shared writable pages,
-* read-only shared pages can get COWed by
-* get_user_pages(.write=1, .force=1).
+* Only catch write-faults on shared writable pages.
 */
+   BUG_ON(!(vma->vm_flags & VM_WRITE));
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
/*
 * Notify the address space that the page is about to
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] spinlock: lockbreak cleanup

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 03:06:05PM +0100, Peter Zijlstra wrote:
> On Thu, 2007-11-01 at 15:02 +0100, Nick Piggin wrote:
> 
> > Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
> > decouple it from the spinlock implementation, and make it typesafe (rwlocks
> > do not have any need_lockbreak sites -- why do they even get bloated up
> > with that break_lock then?).
> 
> IIRC Lee has a few patches floating about that do introduce lockbreak
> stuff for rwlocks.

Well that would be a good reason to introduce a break_lock for them,
but previously not so much... we have rwlocks in some slightly space
critical structures (vmas, inodes, etc).

I guess it was done to make the "template" hacks eaiser. I don't really
find that in good taste, especially for important core infrastructure.
Anyway.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/4] x86: spinlock.h merge prep

2007-11-01 Thread Nick Piggin


Prepare for merging 32 and 64 bit spinlocks, by making them identical
(except for the OOSTORE thing). raw_read_lock and raw_write_lock get a
relaxed register constraint, and 64-bit has a few "=m" constraints changed
to "+m". I hope these things actually make the code better.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/include/asm-x86/spinlock_32.h
===
--- linux-2.6.orig/include/asm-x86/spinlock_32.h
+++ linux-2.6/include/asm-x86/spinlock_32.h
@@ -98,14 +98,15 @@ static inline int __raw_spin_trylock(raw
return ret;
 }
 
-#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)
+#if defined(CONFIG_X86_32) && \
+   (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
 /*
  * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
  * (PPro errata 66, 92)
  */
-#define UNLOCK_LOCK_PREFIX LOCK_PREFIX
+# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
 #else
-#define UNLOCK_LOCK_PREFIX
+# define UNLOCK_LOCK_PREFIX
 #endif
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
@@ -135,49 +136,42 @@ static inline void __raw_spin_unlock_wai
  *
  * On x86, we implement read-write locks as a 32-bit counter
  * with the high bit (sign) being the "contended" bit.
- *
- * The inline assembly is non-obvious. Think about it.
- *
- * Changed to use the same technique as rw semaphores.  See
- * semaphore.h for details.  -ben
- *
- * the helpers are in arch/i386/kernel/semaphore.c
  */
 
 /**
  * read_can_lock - would read_trylock() succeed?
  * @lock: the rwlock in question.
  */
-static inline int __raw_read_can_lock(raw_rwlock_t *x)
+static inline int __raw_read_can_lock(raw_rwlock_t *lock)
 {
-   return (int)(x)->lock > 0;
+   return (int)(lock)->lock > 0;
 }
 
 /**
  * write_can_lock - would write_trylock() succeed?
  * @lock: the rwlock in question.
  */
-static inline int __raw_write_can_lock(raw_rwlock_t *x)
+static inline int __raw_write_can_lock(raw_rwlock_t *lock)
 {
-   return (x)->lock == RW_LOCK_BIAS;
+   return (lock)->lock == RW_LOCK_BIAS;
 }
 
 static inline void __raw_read_lock(raw_rwlock_t *rw)
 {
-   asm volatile(LOCK_PREFIX " subl $1,(%0)\n\t"
+   asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t"
 "jns 1f\n"
 "call __read_lock_failed\n\t"
 "1:\n"
-::"a" (rw) : "memory");
+::"r" (rw) : "memory");
 }
 
 static inline void __raw_write_lock(raw_rwlock_t *rw)
 {
-   asm volatile(LOCK_PREFIX " subl $" RW_LOCK_BIAS_STR ",(%0)\n\t"
+   asm volatile(LOCK_PREFIX "subl %1,(%0)\n\t"
 "jz 1f\n"
 "call __write_lock_failed\n\t"
 "1:\n"
-::"a" (rw) : "memory");
+::"r" (rw), "i" (RW_LOCK_BIAS) : "memory");
 }
 
 static inline int __raw_read_trylock(raw_rwlock_t *lock)
@@ -206,8 +200,8 @@ static inline void __raw_read_unlock(raw
 
 static inline void __raw_write_unlock(raw_rwlock_t *rw)
 {
-   asm volatile(LOCK_PREFIX "addl $" RW_LOCK_BIAS_STR ", %0"
-: "+m" (rw->lock) : : "memory");
+   asm volatile(LOCK_PREFIX "addl %1,%0"
+   : "+m" (rw->lock) : "i" (RW_LOCK_BIAS) : "memory");
 }
 
 #define _raw_spin_relax(lock)  cpu_relax()
Index: linux-2.6/include/asm-x86/spinlock_64.h
===
--- linux-2.6.orig/include/asm-x86/spinlock_64.h
+++ linux-2.6/include/asm-x86/spinlock_64.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
@@ -126,11 +127,19 @@ static inline void __raw_spin_unlock_wai
  * with the high bit (sign) being the "contended" bit.
  */
 
+/**
+ * read_can_lock - would read_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
 static inline int __raw_read_can_lock(raw_rwlock_t *lock)
 {
return (int)(lock)->lock > 0;
 }
 
+/**
+ * write_can_lock - would write_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
 static inline int __raw_write_can_lock(raw_rwlock_t *lock)
 {
return (lock)->lock == RW_LOCK_BIAS;
@@ -140,18 +149,18 @@ static inline void __raw_read_lock(raw_r
 {
asm volatile(LOCK_PREFIX "subl $1,(%0)\n\t"
 "jns 1f\n"
-"call __read_lock_failed\n"
+"call __read_lock_failed\n\t"
 "1:\n"
-::"D&quo

[patch 4/4] x86: spinlock.h merge

2007-11-01 Thread Nick Piggin


Merge spinlock_32.h and spinlock_64.h into spinlock.h.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/include/asm-x86/spinlock.h
===
--- linux-2.6.orig/include/asm-x86/spinlock.h
+++ linux-2.6/include/asm-x86/spinlock.h
@@ -1,5 +1,211 @@
-#ifdef CONFIG_X86_32
-# include "spinlock_32.h"
+#ifndef __ASM_SPINLOCK_H
+#define __ASM_SPINLOCK_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Your basic SMP spinlocks, allowing only a single CPU anywhere
+ *
+ * Simple spin lock operations.  There are two variants, one clears IRQ's
+ * on the local processor, one does not.
+ *
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
+ *
+ * (the type definitions are in asm/spinlock_types.h)
+ */
+
+#if (NR_CPUS > 256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
+static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)(&(lock)->slock);
+
+   return (((tmp >> 8) & 0xff) != (tmp & 0xff));
+}
+
+static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)(&(lock)->slock);
+
+   return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
+}
+
+static inline void __raw_spin_lock(raw_spinlock_t *lock)
+{
+   short inc = 0x0100;
+
+   /*
+* Ticket locks are conceptually two bytes, one indicating the current
+* head of the queue, and the other indicating the current tail. The
+* lock is acquired by atomically noting the tail and incrementing it
+* by one (thus adding ourself to the queue and noting our position),
+* then waiting until the head becomes equal to the the initial value
+* of the tail.
+*
+* This uses a 16-bit xadd to increment the tail and also load the
+* position of the head, which takes care of memory ordering issues
+* and should be optimal for the uncontended case. Note the tail must
+* be in the high byte, otherwise the 16-bit wide increment of the low
+* byte would carry up and contaminate the high byte.
+*/
+
+   __asm__ __volatile__ (
+   LOCK_PREFIX "xaddw %w0, %1\n"
+   "1:\t"
+   "cmpb %h0, %b0\n\t"
+   "je 2f\n\t"
+   "rep ; nop\n\t"
+   "movb %1, %b0\n\t"
+   /* don't need lfence here, because loads are in-order */
+   "jmp 1b\n"
+   "2:"
+   :"+Q" (inc), "+m" (lock->slock)
+   :
+   :"memory", "cc");
+}
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+
+static inline int __raw_spin_trylock(raw_spinlock_t *lock)
+{
+   short prev;
+   short new;
+   int ret = 0;
+
+   asm volatile(
+   "movw %2,%w0\n\t"
+   "cmpb %h0,%b0\n\t"
+   "jne 1f\n\t"
+   "movw %w0,%w1\n\t"
+   "incb %h1\n\t"
+   "lock ; cmpxchgw %w1,%2\n\t"
+   "decb %h1\n\t"
+   "cmpw %w0,%w1\n\t"
+   "jne 1f\n\t"
+   "movl $1,%3\n\t"
+   "1:"
+   :"=a" (prev), "=Q" (new), "+m" (lock->slock), "+m" (ret)
+   :
+   : "memory", "cc");
+
+   return ret;
+}
+
+#if defined(CONFIG_X86_32) && \
+   (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
+/*
+ * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
+ * (PPro errata 66, 92)
+ */
+# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
 #else
-# include "spinlock_64.h"
+# define UNLOCK_LOCK_PREFIX
 #endif
+
+static inline void __raw_spin_unlock(raw_spinlock_t *lock)
+{
+   __asm__ __volatile__(
+   UNLOCK_LOCK_PREFIX "incb %0"
+   :"+m" (lock->slock)
+   :
+   :"memory", "cc");
+}
+
+static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
+{
+   while (__raw_spin_is_locked(lock))
+   cpu_relax();
+}
+
+/*
+ * Read-write spinlocks, allowing multiple readers
+ * but only one writer.
+ *
+ * NOTE! it is quite common to have readers in interrupts
+ * but no interrupt writers. For those circumstances we
+ * can "mix" irq-safe locks - any writer needs to get a
+ * irq-safe write-lock, but readers can get non-irqsafe
+ * read-locks.
+ *
+ * On x86, we implement read-write locks as a 32-bit counter
+ * with the high bit (sign) being the "contended" bit.
+ */
+
+/**

[patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin


Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
is described in the comments. The straight-line lock/unlock instruction
sequence is slightly slower than the dec based locks on modern x86 CPUs,
however the difference is quite small on Core2 and Opteron when working out of
cache, and becomes almost insignificant even on P4 when the lock misses cache.
trylock is more significantly slower, but they are relatively rare.

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
with a userspace test having a difference of up to 2x runtime per thread, and
some threads are starved or "unfairly" granted the lock up to 1 000 000 (!)
times. After this patch, all threads appear to finish at exactly the same
time.

The memory ordering of the lock does conform to x86 standards, and the
implementation has been reviewed by Intel and AMD engineers.

The algorithm also tells us how many CPUs are contending the lock, so
lockbreak becomes trivial and we no longer have to waste 4 bytes per
spinlock for it.

After this, we can no longer spin on any locks with preempt enabled
and cannot reenable interrupts when spinning on an irq safe lock, because
at that point we have already taken a ticket and the would deadlock if
the same CPU tries to take the lock again.  These are questionable anyway:
if the lock happens to be called under a preempt or interrupt disabled section,
then it will just have the same latency problems. The real fix is to keep
critical sections short, and ensure locks are reasonably fair (which this
patch does).

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
Index: linux-2.6/include/asm-x86/spinlock_64.h
===
--- linux-2.6.orig/include/asm-x86/spinlock_64.h
+++ linux-2.6/include/asm-x86/spinlock_64.h
@@ -12,74 +12,98 @@
  * Simple spin lock operations.  There are two variants, one clears IRQ's
  * on the local processor, one does not.
  *
- * We make no fairness assumptions. They have a cost.
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
  *
  * (the type definitions are in asm/spinlock_types.h)
  */
 
+#if (NR_CPUS > 256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
 static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
 {
-   return *(volatile signed int *)(&(lock)->slock) <= 0;
+   int tmp = *(volatile signed int *)(&(lock)->slock);
+
+   return (((tmp >> 8) & 0xff) != (tmp & 0xff));
 }
 
-static inline void __raw_spin_lock(raw_spinlock_t *lock)
+static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
 {
-   asm volatile(
-   "\n1:\t"
-   LOCK_PREFIX " ; decl %0\n\t"
-   "jns 2f\n"
-   "3:\n"
-   "rep;nop\n\t"
-   "cmpl $0,%0\n\t"
-   "jle 3b\n\t"
-   "jmp 1b\n"
-   "2:\t" : "=m" (lock->slock) : : "memory");
+   int tmp = *(volatile signed int *)(&(lock)->slock);
+
+   return (((tmp >> 8) & 0xff) - (tmp & 0xff)) > 1;
 }
 
-/*
- * Same as __raw_spin_lock, but reenable interrupts during spinning.
- */
-#ifndef CONFIG_PROVE_LOCKING
-static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long 
flags)
+static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
-   asm volatile(
-   "\n1:\t"
-   LOCK_PREFIX " ; decl %0\n\t"
-   "jns 5f\n"
-   "testl $0x200, %1\n\t"  /* interrupts were disabled? */
-   "jz 4f\n\t"
-   "sti\n"
-   "3:\t"
-   "rep;nop\n\t"
-   "cmpl $0, %0\n\t"
-   "jle 3b\n\t"
-   "cli\n\t"
+   short inc = 0x0100;
+
+   /*
+* Ticket locks are conceptually two bytes, one indicating the current
+* head of the queue, and the other indicating the current tail. The
+* lock is acquired by atomically noting the tail and incrementing it
+* by one (thus adding ourself to the queue and noting our position),
+* then waiting until the head becomes equal to the the initial value
+* of the tail.
+*
+* This uses a 16-bit xadd to increment the tail and also load the
+* position of the head, which takes care of memory ordering issues
+* and should be optimal for the uncontended case. Note the tail must
+* be in the high byte, otherwise the 16-bit wide increment of the low
+* byte would carry up and contaminate the high byte.
+*/
+
+   __asm__ __volatile__ (
+   LOCK_PREFIX "xaddw %w0, %1\n"
+   "1:\t"

[patch 1/4] spinlock: lockbreak cleanup

2007-11-01 Thread Nick Piggin


The break_lock data structure and code for spinlocks is quite nasty.
Not only does it double the size of a spinlock but it changes locking to
a potentially less optimal trylock.

Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
__raw_spin_is_contended that uses the lock data itself to determine whether
there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
not set.

Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
decouple it from the spinlock implementation, and make it typesafe (rwlocks
do not have any need_lockbreak sites -- why do they even get bloated up
with that break_lock then?).

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1854,23 +1854,16 @@ extern int cond_resched_softirq(void);
 
 /*
  * Does a critical section need to be broken due to another
- * task waiting?:
+ * task waiting?: (technically does not depend on CONFIG_PREEMPT,
+ * but a general need for low latency)
  */
-#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP)
-# define need_lockbreak(lock) ((lock)->break_lock)
-#else
-# define need_lockbreak(lock) 0
-#endif
-
-/*
- * Does a critical section need to be broken due to another
- * task waiting or preemption being signalled:
- */
-static inline int lock_need_resched(spinlock_t *lock)
+static inline int spin_needbreak(spinlock_t *lock)
 {
-   if (need_lockbreak(lock) || need_resched())
-   return 1;
+#ifdef CONFIG_PREEMPT
+   return spin_is_contended(lock);
+#else
return 0;
+#endif
 }
 
 /*
Index: linux-2.6/include/linux/spinlock.h
===
--- linux-2.6.orig/include/linux/spinlock.h
+++ linux-2.6/include/linux/spinlock.h
@@ -120,6 +120,12 @@ do {   
\
 
 #define spin_is_locked(lock)   __raw_spin_is_locked(&(lock)->raw_lock)
 
+#ifdef CONFIG_GENERIC_LOCKBREAK
+#define spin_is_contended(lock) ((lock)->break_lock)
+#else
+#define spin_is_contended(lock)
__raw_spin_is_contended(&(lock)->raw_lock)
+#endif
+
 /**
  * spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
Index: linux-2.6/fs/jbd/checkpoint.c
===
--- linux-2.6.orig/fs/jbd/checkpoint.c
+++ linux-2.6/fs/jbd/checkpoint.c
@@ -347,7 +347,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,_count);
-   if (!retry && lock_need_resched(>j_list_lock)){
+   if (!retry && (need_resched() ||
+   spin_needbreak(>j_list_lock))) {
spin_unlock(>j_list_lock);
retry = 1;
break;
Index: linux-2.6/fs/jbd/commit.c
===
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -265,7 +265,7 @@ write_out_data:
put_bh(bh);
}
 
-   if (lock_need_resched(>j_list_lock)) {
+   if (need_resched() || spin_needbreak(>j_list_lock)) {
spin_unlock(>j_list_lock);
goto write_out_data;
}
Index: linux-2.6/fs/jbd2/checkpoint.c
===
--- linux-2.6.orig/fs/jbd2/checkpoint.c
+++ linux-2.6/fs/jbd2/checkpoint.c
@@ -347,7 +347,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,_count);
-   if (!retry && lock_need_resched(>j_list_lock)){
+   if (!retry && (need_resched() ||
+   spin_needbreak(>j_list_lock))) {
spin_unlock(>j_list_lock);
retry = 1;
break;
Index: linux-2.6/fs/jbd2/commit.c
===
--- linux-2.6.orig/fs/jbd2/commit.c
+++ linux-2.6/fs/jbd2/commit.c
@@ -265,7 +265,7 @@ write_out_data:
put_bh(bh);
}
 
-   if (lock_need_resched(>j_list_lock)) {
+   if (need_resched() || spin_needbreak(>j_list_lock)) {
spin_unlock(>j_list_lock);
goto write_out_data;
}
Index: linux-2.6/include/linux/spinlock_up.h
===
--- linux-2.6.orig/include/linux/spinlock_up.

[patch 0/4] ticket spinlocks for x86

2007-11-01 Thread Nick Piggin

Hi,

I'd like to propose these patches for the x86 tree for a bit more exposure
and testing. Or at least get some discussion going again.

Just for fun I also had a shot at merging the headers, as they become a
lot more similar after this with the removal of the paravirt crud.


Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.34-rc1 eat my photo SD card :-(

2007-11-01 Thread Nick Piggin

On Thursday 01 November 2007 22:56, Romano Giannetti wrote:
> Hi,
>
>   I have a very possible regression to signal. This morning 2.6.24-rc1
> eat and destroyed my SD card. I have a toshiba laptop with a card slot
> and I have used it with 2.6.23-rcX and 2.6.23 without problems so far.
> This morning I put the card in, nothing happened, removed it. When I put
> it in again the filesystem in it was completely scr***ed up.
>
> I have a flight waiting now, so I have put all the dmesgs and syslogs
> over there:
>
> http://www.dea.icai.upcomillas.es/romano/linux/info/2624-rc1-mmc/
>
> Sunday I'll be back to help debug it.

Thanks for the report. Is it a FAT filesystem? Is it reproduceable?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dealing with barriers (was Re: [PATCH] firewire: fw-core: enforce write order when updating fw_device.generation)

2007-11-01 Thread Nick Piggin

On Thursday 01 November 2007 20:51, Stefan Richter wrote:
> Nick Piggin wrote:
> > On Thursday 01 November 2007 12:49, Stefan Richter wrote:
> >> fw_device.node_id and fw_device.generation are accessed without mutexes.
> >> We have to ensure that all readers will get to see node_id updates
> >> before generation updates.
> >
> > Hi, a few points:
>
> Thanks, this is appreciated.
>
> > - can change it to use spinlocks instead? This would be most
> >   preferable.
>
> It could be done, but I don't find it preferable for this purpose.
> Although the struct members in question are basically part of fw-core's
> API (kernel-internal API), there won't be a lot of places which access
> these members.
>
> (Besides, lockless algorithms are essential to FireWire driver code
> everywhere where CPU and controller interact.  So IMO these lockless
> accesses don't constitute a whole new level of complexity in these
> drivers.)

Fair enough.


> > - if not, you need comments.
>
> So far I tended to disagree every time when checkpatch.pl bugged me
> about barriers without comments.  But maybe that was because I had a
> wrong idea of what kind of comment should go there.  There would
> certainly be no value in a comment which effectively says "trust me, I
> know we need a barrier here".  However, thinking about it again, it
> occurs to me that I should add the following comments:
>
>   1.) At respective struct type definitions:  A comment on how struct
>   members are to be accessed and why.

That's something that can be really helpful I think, yes.


>   2.) At the occurrences of rmb() and wmb():  A comment on which
>   accesses the particular barrier divides.  This is in order to get
>   the barriers properly updated (i.e. moved or removed) when
>   surrounding code is reordered, APIs reworked, or whatever.

Yes, this is the main thing to comment at the actual site of the barirer.
Although it isn't always 100% clear with locks either (because you might 
do some non-critical operatoins inside a lock section), it just tends to
be clearer what it is being locked, and what other paths are being locked
against.

So describing what accesses are being ordered is good. Also a brief
description of how the lockless read-side works, if it's not obvious
(or you've already described that in your #1).


> Of course this division of the Why and the What only applies to accesses
> like those in the patch --- i.e. API-like data types which aren't
> abstracted by accessor functions --- while there may be other
> requirements on comments for other usage types of barriers.
>
> Or am I still wrong in my judgment of how barriers should be documented?

I think what you've said sounds good. I think that even memory barriers
within some subsystem can use commenting, even if only a line or two.


> > - you also seem to be missing rmb()s now. I see a couple in the
> >   firewire directory, but nothing that seems to be ordering loads
> >   of these particular fields.
>
> That's right.  Of course the wmb()s don't make sense if the reader side
> isn't fixed up accordingly.  I did this for all places which I spotted
> yesterday in the patches
> "firewire: fw-core: react on bus resets while the config ROM is being
> fetched", http://lkml.org/lkml/2007/10/31/464  (Hmm, this has an unclear
> changelog.)
> "firewire: fw-sbp2: enforce read order of device generation and node
> ID", http://lkml.org/lkml/2007/10/31/465

Ah, right, didn't see them. Thanks, that looks more useful now ;)


> I didn't fold all these patches into one because these two other patches
> include also other changes to the respective read sides.  I should have
> pointed this out in this first patch.

Sure. I might personally still separate out the memory ordering fix
and put all the rmb and wmb in a single patch, but that's no big
deal.


> Also, it could be that I have overlooked one more reader last night; I
> will reexamine it.
>
> > - use smp_*mb() if you are just ordering regular cacheable RAM
> >   accesses.
>
> Hmm, right.  Looks like the smp_ variants are indeed the right thing to
> do here.

OK.


> > Also, diffstat is a bit wrong... maybe you posted the wrong version?
>
> Oops, this happened when I worked on what became the "fw-core: react on
> bus resets..." patch.  So the diffstat is wrong but...

Ah, that's no problem. I manage to screw up patches in much worse
ways than this, don't worry ;)

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dealing with barriers (was Re: [PATCH] firewire: fw-core: enforce write order when updating fw_device.generation)

2007-11-01 Thread Nick Piggin

On Thursday 01 November 2007 20:51, Stefan Richter wrote:
 Nick Piggin wrote:
  On Thursday 01 November 2007 12:49, Stefan Richter wrote:
  fw_device.node_id and fw_device.generation are accessed without mutexes.
  We have to ensure that all readers will get to see node_id updates
  before generation updates.
 
  Hi, a few points:

 Thanks, this is appreciated.

  - can change it to use spinlocks instead? This would be most
preferable.

 It could be done, but I don't find it preferable for this purpose.
 Although the struct members in question are basically part of fw-core's
 API (kernel-internal API), there won't be a lot of places which access
 these members.

 (Besides, lockless algorithms are essential to FireWire driver code
 everywhere where CPU and controller interact.  So IMO these lockless
 accesses don't constitute a whole new level of complexity in these
 drivers.)

Fair enough.


  - if not, you need comments.

 So far I tended to disagree every time when checkpatch.pl bugged me
 about barriers without comments.  But maybe that was because I had a
 wrong idea of what kind of comment should go there.  There would
 certainly be no value in a comment which effectively says trust me, I
 know we need a barrier here.  However, thinking about it again, it
 occurs to me that I should add the following comments:

   1.) At respective struct type definitions:  A comment on how struct
   members are to be accessed and why.

That's something that can be really helpful I think, yes.


   2.) At the occurrences of rmb() and wmb():  A comment on which
   accesses the particular barrier divides.  This is in order to get
   the barriers properly updated (i.e. moved or removed) when
   surrounding code is reordered, APIs reworked, or whatever.

Yes, this is the main thing to comment at the actual site of the barirer.
Although it isn't always 100% clear with locks either (because you might 
do some non-critical operatoins inside a lock section), it just tends to
be clearer what it is being locked, and what other paths are being locked
against.

So describing what accesses are being ordered is good. Also a brief
description of how the lockless read-side works, if it's not obvious
(or you've already described that in your #1).


 Of course this division of the Why and the What only applies to accesses
 like those in the patch --- i.e. API-like data types which aren't
 abstracted by accessor functions --- while there may be other
 requirements on comments for other usage types of barriers.

 Or am I still wrong in my judgment of how barriers should be documented?

I think what you've said sounds good. I think that even memory barriers
within some subsystem can use commenting, even if only a line or two.


  - you also seem to be missing rmb()s now. I see a couple in the
firewire directory, but nothing that seems to be ordering loads
of these particular fields.

 That's right.  Of course the wmb()s don't make sense if the reader side
 isn't fixed up accordingly.  I did this for all places which I spotted
 yesterday in the patches
 firewire: fw-core: react on bus resets while the config ROM is being
 fetched, http://lkml.org/lkml/2007/10/31/464  (Hmm, this has an unclear
 changelog.)
 firewire: fw-sbp2: enforce read order of device generation and node
 ID, http://lkml.org/lkml/2007/10/31/465

Ah, right, didn't see them. Thanks, that looks more useful now ;)


 I didn't fold all these patches into one because these two other patches
 include also other changes to the respective read sides.  I should have
 pointed this out in this first patch.

Sure. I might personally still separate out the memory ordering fix
and put all the rmb and wmb in a single patch, but that's no big
deal.


 Also, it could be that I have overlooked one more reader last night; I
 will reexamine it.

  - use smp_*mb() if you are just ordering regular cacheable RAM
accesses.

 Hmm, right.  Looks like the smp_ variants are indeed the right thing to
 do here.

OK.


  Also, diffstat is a bit wrong... maybe you posted the wrong version?

 Oops, this happened when I worked on what became the fw-core: react on
 bus resets... patch.  So the diffstat is wrong but...

Ah, that's no problem. I manage to screw up patches in much worse
ways than this, don't worry ;)

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.34-rc1 eat my photo SD card :-(

2007-11-01 Thread Nick Piggin

On Thursday 01 November 2007 22:56, Romano Giannetti wrote:
 Hi,

   I have a very possible regression to signal. This morning 2.6.24-rc1
 eat and destroyed my SD card. I have a toshiba laptop with a card slot
 and I have used it with 2.6.23-rcX and 2.6.23 without problems so far.
 This morning I put the card in, nothing happened, removed it. When I put
 it in again the filesystem in it was completely scr***ed up.

 I have a flight waiting now, so I have put all the dmesgs and syslogs
 over there:

 http://www.dea.icai.upcomillas.es/romano/linux/info/2624-rc1-mmc/

 Sunday I'll be back to help debug it.

Thanks for the report. Is it a FAT filesystem? Is it reproduceable?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 0/4] ticket spinlocks for x86

2007-11-01 Thread Nick Piggin

Hi,

I'd like to propose these patches for the x86 tree for a bit more exposure
and testing. Or at least get some discussion going again.

Just for fun I also had a shot at merging the headers, as they become a
lot more similar after this with the removal of the paravirt crud.


Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 1/4] spinlock: lockbreak cleanup

2007-11-01 Thread Nick Piggin


The break_lock data structure and code for spinlocks is quite nasty.
Not only does it double the size of a spinlock but it changes locking to
a potentially less optimal trylock.

Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
__raw_spin_is_contended that uses the lock data itself to determine whether
there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
not set.

Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
decouple it from the spinlock implementation, and make it typesafe (rwlocks
do not have any need_lockbreak sites -- why do they even get bloated up
with that break_lock then?).

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1854,23 +1854,16 @@ extern int cond_resched_softirq(void);
 
 /*
  * Does a critical section need to be broken due to another
- * task waiting?:
+ * task waiting?: (technically does not depend on CONFIG_PREEMPT,
+ * but a general need for low latency)
  */
-#if defined(CONFIG_PREEMPT)  defined(CONFIG_SMP)
-# define need_lockbreak(lock) ((lock)-break_lock)
-#else
-# define need_lockbreak(lock) 0
-#endif
-
-/*
- * Does a critical section need to be broken due to another
- * task waiting or preemption being signalled:
- */
-static inline int lock_need_resched(spinlock_t *lock)
+static inline int spin_needbreak(spinlock_t *lock)
 {
-   if (need_lockbreak(lock) || need_resched())
-   return 1;
+#ifdef CONFIG_PREEMPT
+   return spin_is_contended(lock);
+#else
return 0;
+#endif
 }
 
 /*
Index: linux-2.6/include/linux/spinlock.h
===
--- linux-2.6.orig/include/linux/spinlock.h
+++ linux-2.6/include/linux/spinlock.h
@@ -120,6 +120,12 @@ do {   
\
 
 #define spin_is_locked(lock)   __raw_spin_is_locked((lock)-raw_lock)
 
+#ifdef CONFIG_GENERIC_LOCKBREAK
+#define spin_is_contended(lock) ((lock)-break_lock)
+#else
+#define spin_is_contended(lock)
__raw_spin_is_contended((lock)-raw_lock)
+#endif
+
 /**
  * spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
Index: linux-2.6/fs/jbd/checkpoint.c
===
--- linux-2.6.orig/fs/jbd/checkpoint.c
+++ linux-2.6/fs/jbd/checkpoint.c
@@ -347,7 +347,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,batch_count);
-   if (!retry  lock_need_resched(journal-j_list_lock)){
+   if (!retry  (need_resched() ||
+   spin_needbreak(journal-j_list_lock))) {
spin_unlock(journal-j_list_lock);
retry = 1;
break;
Index: linux-2.6/fs/jbd/commit.c
===
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -265,7 +265,7 @@ write_out_data:
put_bh(bh);
}
 
-   if (lock_need_resched(journal-j_list_lock)) {
+   if (need_resched() || spin_needbreak(journal-j_list_lock)) {
spin_unlock(journal-j_list_lock);
goto write_out_data;
}
Index: linux-2.6/fs/jbd2/checkpoint.c
===
--- linux-2.6.orig/fs/jbd2/checkpoint.c
+++ linux-2.6/fs/jbd2/checkpoint.c
@@ -347,7 +347,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,batch_count);
-   if (!retry  lock_need_resched(journal-j_list_lock)){
+   if (!retry  (need_resched() ||
+   spin_needbreak(journal-j_list_lock))) {
spin_unlock(journal-j_list_lock);
retry = 1;
break;
Index: linux-2.6/fs/jbd2/commit.c
===
--- linux-2.6.orig/fs/jbd2/commit.c
+++ linux-2.6/fs/jbd2/commit.c
@@ -265,7 +265,7 @@ write_out_data:
put_bh(bh);
}
 
-   if (lock_need_resched(journal-j_list_lock)) {
+   if (need_resched() || spin_needbreak(journal-j_list_lock)) {
spin_unlock(journal-j_list_lock);
goto write_out_data;
}
Index: linux-2.6/include/linux/spinlock_up.h
===
--- linux-2.6.orig/include/linux/spinlock_up.h
+++ linux-2.6/include/linux

[patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin


Introduce ticket lock spinlocks for x86 which are FIFO. The implementation
is described in the comments. The straight-line lock/unlock instruction
sequence is slightly slower than the dec based locks on modern x86 CPUs,
however the difference is quite small on Core2 and Opteron when working out of
cache, and becomes almost insignificant even on P4 when the lock misses cache.
trylock is more significantly slower, but they are relatively rare.

On an 8 core (2 socket) Opteron, spinlock unfairness is extremely noticable,
with a userspace test having a difference of up to 2x runtime per thread, and
some threads are starved or unfairly granted the lock up to 1 000 000 (!)
times. After this patch, all threads appear to finish at exactly the same
time.

The memory ordering of the lock does conform to x86 standards, and the
implementation has been reviewed by Intel and AMD engineers.

The algorithm also tells us how many CPUs are contending the lock, so
lockbreak becomes trivial and we no longer have to waste 4 bytes per
spinlock for it.

After this, we can no longer spin on any locks with preempt enabled
and cannot reenable interrupts when spinning on an irq safe lock, because
at that point we have already taken a ticket and the would deadlock if
the same CPU tries to take the lock again.  These are questionable anyway:
if the lock happens to be called under a preempt or interrupt disabled section,
then it will just have the same latency problems. The real fix is to keep
critical sections short, and ensure locks are reasonably fair (which this
patch does).

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
Index: linux-2.6/include/asm-x86/spinlock_64.h
===
--- linux-2.6.orig/include/asm-x86/spinlock_64.h
+++ linux-2.6/include/asm-x86/spinlock_64.h
@@ -12,74 +12,98 @@
  * Simple spin lock operations.  There are two variants, one clears IRQ's
  * on the local processor, one does not.
  *
- * We make no fairness assumptions. They have a cost.
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
  *
  * (the type definitions are in asm/spinlock_types.h)
  */
 
+#if (NR_CPUS  256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
 static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
 {
-   return *(volatile signed int *)((lock)-slock) = 0;
+   int tmp = *(volatile signed int *)((lock)-slock);
+
+   return (((tmp  8)  0xff) != (tmp  0xff));
 }
 
-static inline void __raw_spin_lock(raw_spinlock_t *lock)
+static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
 {
-   asm volatile(
-   \n1:\t
-   LOCK_PREFIX  ; decl %0\n\t
-   jns 2f\n
-   3:\n
-   rep;nop\n\t
-   cmpl $0,%0\n\t
-   jle 3b\n\t
-   jmp 1b\n
-   2:\t : =m (lock-slock) : : memory);
+   int tmp = *(volatile signed int *)((lock)-slock);
+
+   return (((tmp  8)  0xff) - (tmp  0xff))  1;
 }
 
-/*
- * Same as __raw_spin_lock, but reenable interrupts during spinning.
- */
-#ifndef CONFIG_PROVE_LOCKING
-static inline void __raw_spin_lock_flags(raw_spinlock_t *lock, unsigned long 
flags)
+static inline void __raw_spin_lock(raw_spinlock_t *lock)
 {
-   asm volatile(
-   \n1:\t
-   LOCK_PREFIX  ; decl %0\n\t
-   jns 5f\n
-   testl $0x200, %1\n\t  /* interrupts were disabled? */
-   jz 4f\n\t
-   sti\n
-   3:\t
-   rep;nop\n\t
-   cmpl $0, %0\n\t
-   jle 3b\n\t
-   cli\n\t
+   short inc = 0x0100;
+
+   /*
+* Ticket locks are conceptually two bytes, one indicating the current
+* head of the queue, and the other indicating the current tail. The
+* lock is acquired by atomically noting the tail and incrementing it
+* by one (thus adding ourself to the queue and noting our position),
+* then waiting until the head becomes equal to the the initial value
+* of the tail.
+*
+* This uses a 16-bit xadd to increment the tail and also load the
+* position of the head, which takes care of memory ordering issues
+* and should be optimal for the uncontended case. Note the tail must
+* be in the high byte, otherwise the 16-bit wide increment of the low
+* byte would carry up and contaminate the high byte.
+*/
+
+   __asm__ __volatile__ (
+   LOCK_PREFIX xaddw %w0, %1\n
+   1:\t
+   cmpb %h0, %b0\n\t
+   je 2f\n\t
+   rep ; nop\n\t
+   movb %1, %b0\n\t
+   /* don't need lfence here, because loads are in-order */
jmp 1b\n
-   4:\t
-   rep;nop\n\t
-   cmpl $0, %0\n\t
-   jg 1b\n\t
-   jmp 4b\n
-   5:\n\t

[patch 3/4] x86: spinlock.h merge prep

2007-11-01 Thread Nick Piggin


Prepare for merging 32 and 64 bit spinlocks, by making them identical
(except for the OOSTORE thing). raw_read_lock and raw_write_lock get a
relaxed register constraint, and 64-bit has a few =m constraints changed
to +m. I hope these things actually make the code better.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/include/asm-x86/spinlock_32.h
===
--- linux-2.6.orig/include/asm-x86/spinlock_32.h
+++ linux-2.6/include/asm-x86/spinlock_32.h
@@ -98,14 +98,15 @@ static inline int __raw_spin_trylock(raw
return ret;
 }
 
-#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)
+#if defined(CONFIG_X86_32)  \
+   (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
 /*
  * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
  * (PPro errata 66, 92)
  */
-#define UNLOCK_LOCK_PREFIX LOCK_PREFIX
+# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
 #else
-#define UNLOCK_LOCK_PREFIX
+# define UNLOCK_LOCK_PREFIX
 #endif
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
@@ -135,49 +136,42 @@ static inline void __raw_spin_unlock_wai
  *
  * On x86, we implement read-write locks as a 32-bit counter
  * with the high bit (sign) being the contended bit.
- *
- * The inline assembly is non-obvious. Think about it.
- *
- * Changed to use the same technique as rw semaphores.  See
- * semaphore.h for details.  -ben
- *
- * the helpers are in arch/i386/kernel/semaphore.c
  */
 
 /**
  * read_can_lock - would read_trylock() succeed?
  * @lock: the rwlock in question.
  */
-static inline int __raw_read_can_lock(raw_rwlock_t *x)
+static inline int __raw_read_can_lock(raw_rwlock_t *lock)
 {
-   return (int)(x)-lock  0;
+   return (int)(lock)-lock  0;
 }
 
 /**
  * write_can_lock - would write_trylock() succeed?
  * @lock: the rwlock in question.
  */
-static inline int __raw_write_can_lock(raw_rwlock_t *x)
+static inline int __raw_write_can_lock(raw_rwlock_t *lock)
 {
-   return (x)-lock == RW_LOCK_BIAS;
+   return (lock)-lock == RW_LOCK_BIAS;
 }
 
 static inline void __raw_read_lock(raw_rwlock_t *rw)
 {
-   asm volatile(LOCK_PREFIX  subl $1,(%0)\n\t
+   asm volatile(LOCK_PREFIX subl $1,(%0)\n\t
 jns 1f\n
 call __read_lock_failed\n\t
 1:\n
-::a (rw) : memory);
+::r (rw) : memory);
 }
 
 static inline void __raw_write_lock(raw_rwlock_t *rw)
 {
-   asm volatile(LOCK_PREFIX  subl $ RW_LOCK_BIAS_STR ,(%0)\n\t
+   asm volatile(LOCK_PREFIX subl %1,(%0)\n\t
 jz 1f\n
 call __write_lock_failed\n\t
 1:\n
-::a (rw) : memory);
+::r (rw), i (RW_LOCK_BIAS) : memory);
 }
 
 static inline int __raw_read_trylock(raw_rwlock_t *lock)
@@ -206,8 +200,8 @@ static inline void __raw_read_unlock(raw
 
 static inline void __raw_write_unlock(raw_rwlock_t *rw)
 {
-   asm volatile(LOCK_PREFIX addl $ RW_LOCK_BIAS_STR , %0
-: +m (rw-lock) : : memory);
+   asm volatile(LOCK_PREFIX addl %1,%0
+   : +m (rw-lock) : i (RW_LOCK_BIAS) : memory);
 }
 
 #define _raw_spin_relax(lock)  cpu_relax()
Index: linux-2.6/include/asm-x86/spinlock_64.h
===
--- linux-2.6.orig/include/asm-x86/spinlock_64.h
+++ linux-2.6/include/asm-x86/spinlock_64.h
@@ -5,6 +5,7 @@
 #include asm/rwlock.h
 #include asm/page.h
 #include asm/processor.h
+#include linux/compiler.h
 
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
@@ -126,11 +127,19 @@ static inline void __raw_spin_unlock_wai
  * with the high bit (sign) being the contended bit.
  */
 
+/**
+ * read_can_lock - would read_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
 static inline int __raw_read_can_lock(raw_rwlock_t *lock)
 {
return (int)(lock)-lock  0;
 }
 
+/**
+ * write_can_lock - would write_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
 static inline int __raw_write_can_lock(raw_rwlock_t *lock)
 {
return (lock)-lock == RW_LOCK_BIAS;
@@ -140,18 +149,18 @@ static inline void __raw_read_lock(raw_r
 {
asm volatile(LOCK_PREFIX subl $1,(%0)\n\t
 jns 1f\n
-call __read_lock_failed\n
+call __read_lock_failed\n\t
 1:\n
-::D (rw), i (RW_LOCK_BIAS) : memory);
+::r (rw) : memory);
 }
 
 static inline void __raw_write_lock(raw_rwlock_t *rw)
 {
asm volatile(LOCK_PREFIX subl %1,(%0)\n\t
 jz 1f\n
-\tcall __write_lock_failed\n\t
+call __write_lock_failed\n\t
 1:\n
-::D (rw), i (RW_LOCK_BIAS) : memory);
+::r (rw), i (RW_LOCK_BIAS) : memory

[patch 4/4] x86: spinlock.h merge

2007-11-01 Thread Nick Piggin


Merge spinlock_32.h and spinlock_64.h into spinlock.h.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/include/asm-x86/spinlock.h
===
--- linux-2.6.orig/include/asm-x86/spinlock.h
+++ linux-2.6/include/asm-x86/spinlock.h
@@ -1,5 +1,211 @@
-#ifdef CONFIG_X86_32
-# include spinlock_32.h
+#ifndef __ASM_SPINLOCK_H
+#define __ASM_SPINLOCK_H
+
+#include asm/atomic.h
+#include asm/rwlock.h
+#include asm/page.h
+#include asm/processor.h
+#include linux/compiler.h
+
+/*
+ * Your basic SMP spinlocks, allowing only a single CPU anywhere
+ *
+ * Simple spin lock operations.  There are two variants, one clears IRQ's
+ * on the local processor, one does not.
+ *
+ * These are fair FIFO ticket locks, which are currently limited to 256
+ * CPUs.
+ *
+ * (the type definitions are in asm/spinlock_types.h)
+ */
+
+#if (NR_CPUS  256)
+#error spinlock supports a maximum of 256 CPUs
+#endif
+
+static inline int __raw_spin_is_locked(raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)((lock)-slock);
+
+   return (((tmp  8)  0xff) != (tmp  0xff));
+}
+
+static inline int __raw_spin_is_contended(raw_spinlock_t *lock)
+{
+   int tmp = *(volatile signed int *)((lock)-slock);
+
+   return (((tmp  8)  0xff) - (tmp  0xff))  1;
+}
+
+static inline void __raw_spin_lock(raw_spinlock_t *lock)
+{
+   short inc = 0x0100;
+
+   /*
+* Ticket locks are conceptually two bytes, one indicating the current
+* head of the queue, and the other indicating the current tail. The
+* lock is acquired by atomically noting the tail and incrementing it
+* by one (thus adding ourself to the queue and noting our position),
+* then waiting until the head becomes equal to the the initial value
+* of the tail.
+*
+* This uses a 16-bit xadd to increment the tail and also load the
+* position of the head, which takes care of memory ordering issues
+* and should be optimal for the uncontended case. Note the tail must
+* be in the high byte, otherwise the 16-bit wide increment of the low
+* byte would carry up and contaminate the high byte.
+*/
+
+   __asm__ __volatile__ (
+   LOCK_PREFIX xaddw %w0, %1\n
+   1:\t
+   cmpb %h0, %b0\n\t
+   je 2f\n\t
+   rep ; nop\n\t
+   movb %1, %b0\n\t
+   /* don't need lfence here, because loads are in-order */
+   jmp 1b\n
+   2:
+   :+Q (inc), +m (lock-slock)
+   :
+   :memory, cc);
+}
+
+#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)
+
+static inline int __raw_spin_trylock(raw_spinlock_t *lock)
+{
+   short prev;
+   short new;
+   int ret = 0;
+
+   asm volatile(
+   movw %2,%w0\n\t
+   cmpb %h0,%b0\n\t
+   jne 1f\n\t
+   movw %w0,%w1\n\t
+   incb %h1\n\t
+   lock ; cmpxchgw %w1,%2\n\t
+   decb %h1\n\t
+   cmpw %w0,%w1\n\t
+   jne 1f\n\t
+   movl $1,%3\n\t
+   1:
+   :=a (prev), =Q (new), +m (lock-slock), +m (ret)
+   :
+   : memory, cc);
+
+   return ret;
+}
+
+#if defined(CONFIG_X86_32)  \
+   (defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE))
+/*
+ * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock
+ * (PPro errata 66, 92)
+ */
+# define UNLOCK_LOCK_PREFIX LOCK_PREFIX
 #else
-# include spinlock_64.h
+# define UNLOCK_LOCK_PREFIX
 #endif
+
+static inline void __raw_spin_unlock(raw_spinlock_t *lock)
+{
+   __asm__ __volatile__(
+   UNLOCK_LOCK_PREFIX incb %0
+   :+m (lock-slock)
+   :
+   :memory, cc);
+}
+
+static inline void __raw_spin_unlock_wait(raw_spinlock_t *lock)
+{
+   while (__raw_spin_is_locked(lock))
+   cpu_relax();
+}
+
+/*
+ * Read-write spinlocks, allowing multiple readers
+ * but only one writer.
+ *
+ * NOTE! it is quite common to have readers in interrupts
+ * but no interrupt writers. For those circumstances we
+ * can mix irq-safe locks - any writer needs to get a
+ * irq-safe write-lock, but readers can get non-irqsafe
+ * read-locks.
+ *
+ * On x86, we implement read-write locks as a 32-bit counter
+ * with the high bit (sign) being the contended bit.
+ */
+
+/**
+ * read_can_lock - would read_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
+static inline int __raw_read_can_lock(raw_rwlock_t *lock)
+{
+   return (int)(lock)-lock  0;
+}
+
+/**
+ * write_can_lock - would write_trylock() succeed?
+ * @lock: the rwlock in question.
+ */
+static inline int __raw_write_can_lock(raw_rwlock_t *lock)
+{
+   return (lock)-lock == RW_LOCK_BIAS;
+}
+
+static inline void __raw_read_lock(raw_rwlock_t *rw)
+{
+   asm

Re: [patch 1/4] spinlock: lockbreak cleanup

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 03:06:05PM +0100, Peter Zijlstra wrote:
 On Thu, 2007-11-01 at 15:02 +0100, Nick Piggin wrote:
 
  Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
  decouple it from the spinlock implementation, and make it typesafe (rwlocks
  do not have any need_lockbreak sites -- why do they even get bloated up
  with that break_lock then?).
 
 IIRC Lee has a few patches floating about that do introduce lockbreak
 stuff for rwlocks.

Well that would be a good reason to introduce a break_lock for them,
but previously not so much... we have rwlocks in some slightly space
critical structures (vmas, inodes, etc).

I guess it was done to make the template hacks eaiser. I don't really
find that in good taste, especially for important core infrastructure.
Anyway.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 08:14:47AM -0700, Linus Torvalds wrote:
 
 
 On Thu, 1 Nov 2007, Nick Piggin wrote:
 
  On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote:
   
   We made much bigger changes to ptrace support when we disallowed writing 
   to read-only shared memory areas (we used to do the magic per-page COW 
   thing).
  
  Really? No, we still do that magic COW thing which creates anonymous
  pages in MAP_SHARED vmas, don't we?
 
 No, we don't. I'm pretty sure. It didn't work with the VM cleanups, since 
 the MAP_SHARED vma's won't be on the anonymous list any more, and cannot 
 be swapped out.
 
 So now, if you try to write to a read-only shared page through ptrace, 
 you'll get Unable to access.
 
No, it COWs it (the file is RW).

I believe do_wp_page will still attach an anon_vma to the vma, which
will make the pte discoverable via rmap.


 Of course, I didn't really look closely, so maybe I just don't remember 
 things right..
 
   access_vm_pages() (things like core-dumping comes to mind - although I 
   think we don't dump pure file mappings at all, do we?) it would certainly 
   be good to run any such tests on the current -git tree...
  
  We do for MAP_SHARED|MAP_ANONYMOUS, by the looks.
 
 Well, as we should. There's no way for a debugger to get those pages back. 
 So that all looks sane.
 
  -   vm_flags |= VM_SHARED | VM_MAYSHARE;
  -   if (!(file-f_mode  FMODE_WRITE))
  -   vm_flags = ~(VM_MAYWRITE | VM_SHARED);
  +   vm_flags |= VM_MAYSHARE;
  +   if (file-f_mode  FMODE_WRITE)
  +   vm_flags |= VM_SHARED;
  +   if (!(vm_flags  VM_WRITE))
  +   vm_flags = ~VM_MAYWRITE;
 
 This looks totally bogus. What was the intent of this patch?
 
 The VM_MAYWRITE flag is *not* supposed to track the VM_WRITE flag: that 
 would defeat the whole purpose of it! The whole point of that flag is to 
 say whether mprotect() could turn it into a VM_WRITE mapping, and it 
 depends on the file mode, not VM_WRITE!

Yeah of course that won't work, stupid...

The intent is to stop get_user_pages from proceeding with a write fault (and
subsequent COW) to readonly shared mappings, when force is set. I think it
can be done simply via get_user_pages(), which is what I should have done
to begin with.

Untested patch follows
---
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1031,7 +1031,9 @@ int get_user_pages(struct task_struct *t
}
 
if (!vma || (vma-vm_flags  (VM_IO | VM_PFNMAP))
-   || !(vm_flags  vma-vm_flags))
+   || !(vm_flags  vma-vm_flags)
+   || (write  ((vma-vm_flags 
+ (VM_SHARED|VM_MAYSHARE)) == VM_MAYSHARE)))
return i ? : -EFAULT;
 
if (is_vm_hugetlb_page(vma)) {
@@ -1563,13 +1565,11 @@ static int do_wp_page(struct mm_struct *
reuse = can_share_swap_page(old_page);
unlock_page(old_page);
}
-   } else if (unlikely((vma-vm_flags  (VM_WRITE|VM_SHARED)) ==
-   (VM_WRITE|VM_SHARED))) {
+   } else if (unlikely((vma-vm_flags  VM_SHARED))) {
/*
-* Only catch write-faults on shared writable pages,
-* read-only shared pages can get COWed by
-* get_user_pages(.write=1, .force=1).
+* Only catch write-faults on shared writable pages.
 */
+   BUG_ON(!(vma-vm_flags  VM_WRITE));
if (vma-vm_ops  vma-vm_ops-page_mkwrite) {
/*
 * Notify the address space that the page is about to
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] spinlock: lockbreak cleanup

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 04:46:36PM +0100, Ingo Molnar wrote:
 
 * Lee Schermerhorn [EMAIL PROTECTED] wrote:
 
   I guess it was done to make the template hacks eaiser. I don't 
   really find that in good taste, especially for important core 
   infrastructure. Anyway.
  
  Actually, what I had/have is a cond_resched_rwlock() that I needed to 
  convert the i_mmap_lock() to rw for testing reclaim scalability.  
  [I've seen a large system running an Oracle OLTP load hang spitting 
  cpu soft lockup messages with all cpus spinning on a i_mmap_lock 
  spin lock.] One of the i_mmap_lock paths uses cond_resched_lock() for 
  spin locks. To do a straight forward conversion [and maybe that isn't 
  the right approach], I created the cond_resched_rwlock() function by 
  generalizing the cond_sched_lock() code and creating both spin and rw 
  lock wrappers. I took advantage of the fact that, currently, 
  need_lockbreak() is a macro and that both spin and rw locks have/had 
  the break_lock member. Typesafe functions would probably be 
  preferrable, if we want to keep break_lock for rw spin locks.
  
  Here's the most recent posting:
  
  http://marc.info/?l=linux-mmm=118980356306014w=4
  
  See the changes to sched.[ch].  Should apply to 23-mm1 with offsets 
  and minor fixup in fs/inode.c.
 
 yep. I'm too in favor of keeping the need-lockbreak mechanism and its 
 type-insensitive data structure. We've got way too many locking 
 primitives and keeping them all sorted is nontrivial already.

I think a large contributor to that is being a bit clever with indirections
and cute code (eg. like this template stuff), rather than having two types of
spinlocks instead of one.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 04:01:45PM -0400, Chuck Ebbert wrote:
 On 11/01/2007 10:03 AM, Nick Piggin wrote:
 
 [edited to show the resulting code]
 
  +   __asm__ __volatile__ (
  +   LOCK_PREFIX xaddw %w0, %1\n
  +   1:\t
  +   cmpb %h0, %b0\n\t
  +   je 2f\n\t
  +   rep ; nop\n\t
  +   movb %1, %b0\n\t
  +   /* don't need lfence here, because loads are in-order */
  jmp 1b\n
  +   2:
  +   :+Q (inc), +m (lock-slock)
  +   :
  +   :memory, cc);
   }
 
 If you really thought you might get long queues, you could figure out
 how far back you are and use that to determine how long to wait before
 testing the lock again. That cmpb could become a subb without adding
 overhead to the fast path -- that would give you the queue length (or
 its complement anyway.)

Indeed. You can use this as a really nice input into a backoff
algorithm (eg. if you're next in line, don't back off, or at least
don't go into exponential backoff; if you've got people in front
of you, start throttling harder).

I think I'll leave that to SGI if they come up with a big x86 SSI ;)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 09:08:45AM -0700, Linus Torvalds wrote:
 
 
 On Thu, 1 Nov 2007, Nick Piggin wrote:
  
  Untested patch follows
 
 Ok, this looks ok.
 
 Except I would remove the VM_MAYSHARE bit from the test.

But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
uses this for inserting breakpoints (but fortunately, a COW page in a
MAP_PRIVATE mapping is a much more natural thing for the VM).


 That whole bit should go, in fact.
 
 We used to make it something different: iirc, a read-only SHARED mapping 
 was downgraded to a non-shared mapping, because we wanted to avoid some of 
 the costs we used to have with the VM implementation (actually, I think it 
 was various filesystems that don't like shared mappings because they don't 
 have a per-page writeback). But we left the VM_MAYSHARE bit on, to get 
 /proc/pid/mmap things right.
 
 Or something like that. I forget the details. But I *think* we don't 
 actually need this any more.
 
 But basically, the right way to test for shared mappings is historically 
 to just test the VM_MAYSHARE bit - but not *both* bits. Because VM_SHARE 
 may have been artificially cleared.
 
I think you're right -- VM_MAYSHARE is basically testing for MAP_SHARED.
I just don't know exactly what you're proposing here.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 06:17:42PM -0700, Linus Torvalds wrote:
 
 
 On Fri, 2 Nov 2007, Nick Piggin wrote:
  
  But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
  uses this for inserting breakpoints (but fortunately, a COW page in a
  MAP_PRIVATE mapping is a much more natural thing for the VM).
 
 Yes, I phrased that badly. I meant that I'd be happier if we got rid of 
 VM_MAYSHARE entirely, and just used VM_SHARED. I thought we already made 
 them always be the same (and any VM_MAYSHARE use is historical).

Oh yeah, I think it would probably be clearer to use VM_SHARED == MAP_SHARED,
and test the write permission explicitly. Though there could be something
I missed that makes it not as easy as it sounds... probably something best
left for Hugh ;)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin

On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
 
 
 On Thu, 1 Nov 2007, Rik van Riel wrote:
  
  Larry Woodman managed to wedge the VM into a state where, on his
  4x dual core system, only 2 cores (on the same CPU) could get the
  zone-lru_lock overnight.  The other 6 cores on the system were
  just spinning, without being able to get the lock.

That's quite incredible, considering that the CPUs actually _taking_
the locks also drop the locks and do quite a bit of work before taking
them again (ie. they take them to pull pages off the LRU, but then
do a reasonable amount of work to remove each one from pagecache before
refilling from the LRU).

Possibly actually that is a *more* difficult case for the HW to handle:
once the CPU actually goes away and operates on other cachelines, it 
may get a little more difficult to detect that it is causing starvation
issues.


 .. and this is almost always the result of a locking *bug*, not unfairness 
 per se. IOW, unfairness just ends up showing the bug in the first place.

I'd almost agree, but there are always going to be corner cases where
we get multiple contentions on a spinlock -- the fact that a lock is
needed at all obviously suggests that it can be contended. The LRU locking
could be improved, but you could have eg. scheduler runqueue lock starvation
if the planets lined up just right, and it is a little more difficult to
improve on runqueue locking.

Anyway, I also think this is partially a hardware issue, and as muliple
cores, threads, and sockets get more common, I hope it will improve (it
affects Intel CPUs as well as AMD). So it is possible to have an option
to switch between locks if the hardware is fairer, but I want to get
as much exposure with this locking as possible for now, to see if there
is any funny performance corner cases exposed (which quite possibly will
turn out to be caused by suboptimal locking itself).

Anyway, if this can make its way to the x86 tree, I think it will get
pulled into -mm (?) and get some exposure...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: filp usage when cpu busy

2007-10-31 Thread Nick Piggin

On Thursday 01 November 2007 12:56, bc Wong (chimwong) wrote:
> Hi,
>
> With 2.6.16 x86_64 on a 4 core machine, I noticed
> that the filp usage (according to /proc/slabinfo)
> shoots up and keeps on increasing sharply when one
> of the CPUs is (1) locked up, or (2) very busy
> doing a lot of printk()'s with KERN_EMERG.
>
> In the case of (1), it's permanent until it runs
> out of memory eventually. For (2), it's temporary;
> filp count comes back down when the printk()'s are
> done.
>
> I can't think of any relationship between a busy/
> locked-up CPU and filp count. The system is still
> functional. New short-lived processes kept being
> created, but the overall number of processes is
> stable.
>
> Does anyone know why filp count would go up like
> that?

Yeah, it's probably because filp structures are freed by
RCU, and if you have a locked up CPU then it can't go
through a quiescent state so RCU stops freeing your filps.

If you add some cond_resched()s to your code, you should
find that RCU will force a reschedule and things will work
(actually, for 2.6.16, I'm not sure if RCU had the code to
force a reschedule... it's force_quiescent_state() in
kernel/rcupdate.c upstream).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] firewire: fw-core: enforce write order when updating fw_device.generation

2007-10-31 Thread Nick Piggin

On Thursday 01 November 2007 12:49, Stefan Richter wrote:
> fw_device.node_id and fw_device.generation are accessed without mutexes.
> We have to ensure that all readers will get to see node_id updates
> before generation updates.
>

Hi, a few points:

- can change it to use spinlocks instead? This would be most
  preferable.

- if not, you need comments.

- you also seem to be missing rmb()s now. I see a couple in the
  firewire directory, but nothing that seems to be ordering loads
  of these particular fields.

- use smp_*mb() if you are just ordering regular cacheable RAM
  accesses.

Also, diffstat is a bit wrong... maybe you posted the wrong version?

> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>
> ---
>  drivers/firewire/fw-device.c   |6 ++
>  drivers/firewire/fw-topology.c |1 +
>  2 files changed, 7 insertions(+)
>
> Index: linux/drivers/firewire/fw-device.c
> ===
> --- linux.orig/drivers/firewire/fw-device.c
> +++ linux/drivers/firewire/fw-device.c
> @@ -808,6 +813,7 @@ void fw_node_event(struct fw_card *card,
>
>   device = node->data;
>   device->node_id = node->node_id;
> + wmb();
>   device->generation = card->generation;
>   if (atomic_read(>state) == FW_DEVICE_RUNNING) {
>   PREPARE_DELAYED_WORK(>work, fw_device_update);
> Index: linux/drivers/firewire/fw-topology.c
> ===
> --- linux.orig/drivers/firewire/fw-topology.c
> +++ linux/drivers/firewire/fw-topology.c
> @@ -518,6 +518,7 @@ fw_core_handle_bus_reset(struct fw_card
>   card->bm_retries = 0;
>
>   card->node_id = node_id;
> + wmb();
>   card->generation = generation;
>   card->reset_jiffies = jiffies;
>   schedule_delayed_work(>work, 0);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-31 Thread Nick Piggin

On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 31 Oct 2007, Nick Piggin wrote:
> > 
> > No that would be great. Fingers crossed it won't cause any problems.
> 
> I actually doubt it will cause problems.
> 
> We made much bigger changes to ptrace support when we disallowed writing 
> to read-only shared memory areas (we used to do the magic per-page COW 
> thing).

Really? No, we still do that magic COW thing which creates anonymous
pages in MAP_SHARED vmas, don't we?

But I think that is pretty arbitrary as well. What the "force" argument to
get_user_pages on a MAP_SHARED vma effectively does is punt the permission
check from the vma permissions to the file permissions. If you've opened the
file WR but mmap()ed PROT_READ, it goes ahead and COWs, wheras if you've
opened the file RDONLY it -EFAULTs.

This seems strange wrong, because the act of COWing seems to be exactly
done in order to *absolve* us of file permissions (if we're COWing, why
should it matter if the file happened to be readonly or not).

Could we just try disallow COW for MAP_SHARED mappings?
Seeing as we're doing the break-ptrace thing this release already ;)
All the stuff I know of that requires forced COW (ie. gdb for breakpoints),
does so on MAP_PRIVATE mappings.

I know this is one of Hugh's bugbears. I don't know whether he's thought
out how to remove it, but I wonder if the appended patch is roughly in
the right direction?

> But if people have test-cases for ptrace and/or other magical users of 
> access_vm_pages() (things like core-dumping comes to mind - although I 
> think we don't dump pure file mappings at all, do we?) it would certainly 
> be good to run any such tests on the current -git tree...

We do for MAP_SHARED|MAP_ANONYMOUS, by the looks.

---
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1563,13 +1563,11 @@ static int do_wp_page(struct mm_struct *
reuse = can_share_swap_page(old_page);
unlock_page(old_page);
}
-   } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
-   (VM_WRITE|VM_SHARED))) {
+   } else if (unlikely((vma->vm_flags & VM_SHARED))) {
/*
-* Only catch write-faults on shared writable pages,
-* read-only shared pages can get COWed by
-* get_user_pages(.write=1, .force=1).
+* Only catch write-faults on shared writable pages.
 */
+   BUG_ON(!(vma->vm_flags & VM_WRITE));
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
/*
 * Notify the address space that the page is about to
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -980,9 +980,11 @@ unsigned long do_mmap_pgoff(struct file 
if (locks_verify_locked(inode))
return -EAGAIN;

-   vm_flags |= VM_SHARED | VM_MAYSHARE;
-   if (!(file->f_mode & FMODE_WRITE))
-   vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
+   vm_flags |= VM_MAYSHARE;
+   if (file->f_mode & FMODE_WRITE)
+   vm_flags |= VM_SHARED;
+   if (!(vm_flags & VM_WRITE))
+   vm_flags &= ~VM_MAYWRITE;

/* fall through */
case MAP_PRIVATE:
@@ -998,6 +1000,7 @@ unsigned long do_mmap_pgoff(struct file 

if (!file->f_op || !file->f_op->mmap)
return -ENODEV;
+
break;

default:
@@ -1007,6 +1010,8 @@ unsigned long do_mmap_pgoff(struct file 
switch (flags & MAP_TYPE) {
case MAP_SHARED:
vm_flags |= VM_SHARED | VM_MAYSHARE;
+   if (!(vm_flags & VM_WRITE))
+   vm_flags &= ~VM_MAYWRITE;
break;
case MAP_PRIVATE:
/*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-31 Thread Nick Piggin

On Wed, Oct 31, 2007 at 08:59:41AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 31 Oct 2007, Nick Piggin wrote:
> > 
> > Well the patch is right, in the context of the regression I introduced
> > (and so it should probably go into 2.6.23).
> 
> Yeah, it probably is fine for -stable.
> 
> And if mine (which actually changes behaviour, in that it makes ptrace get 
> an access error) causes regressions, I guess we'll have to use that 
> compatible-with-old-behaviour one for 2.6.24 too.
> 
> But I just rebooted and tested - the cleaned-up patch does seem to work 
> fine, and I get "Cannot access memory at address " rather than any 
> reported problem.
> 
> So I think I'll commit my version asap, and see if anybody reports that 
> they have a situation where they use ptrace() and expect zero back from a 
> shared mapping past the end.. And if there are issues, we can switch back 
> to the old broken behaviour with your patch,

No that would be great. Fingers crossed it won't cause any problems.

Thanks, all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-31 Thread Nick Piggin

On Wed, Oct 31, 2007 at 08:11:10AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 31 Oct 2007, Nick Piggin wrote:
> > 
> > However I actually don't really like how this all works. I don't like that
> > filemap.c should have to know about ptrace, or exactly what ptrace wants 
> > here.
> 
> It shouldn't. It should just fail when it fails. Then, handle_mm_fault() 
> should return an error code, which should cause get_user_pages() to return 
> an error code. Which should make ptrace just stop.
> 
> So I think your patch is wrong. mm/filemap.c should *not* care about who 
> does the fault.  I think the proper patch is something untested like the 
> appended...

Well the patch is right, in the context of the regression I introduced
(and so it should probably go into 2.6.23).

I'd love to get rid of that outside data content crap if possible in
2.6.24. I think you're the one who has the best feeling for the ptrace
cases we have in the VM, so I trust you ;)


> > It's a bit hairy to force insert page into pagecache and pte into pagetables
> > here, given the races.
> 
> It's also wrong. They shouldn't be in the page cache, since that can cause 
> problems with truncate etc. Maybe it doesn't any more, but it's reasonable 
> to believe that a page outside of i_size should not exist.

No I believe it could still be a problem (and at least, it is quite
fragile).

 
> > In access_process_vm, can't we just zero fill in the case of a sigbus? 
> > Linus?
> > That will also avoid changing applicatoin behaviour due to a gdb read...
> 
> access_process_vm() should just return how many bytes it could fill (which 
> means a truncated copy - very including zero bytes - for an error), and 
> the caller should decide what the right thing to do is.
> 
> But who knows, maybe I missed something.

I hope it works.
 

> Duane? Does this fix things for you?
> 
>   Linus
> 
> ---
>  mm/filemap.c |   13 ++---
>  1 files changed, 2 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9940895..188cf5f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1300,7 +1300,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
> vm_fault *vmf)
>  
>   size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>   if (vmf->pgoff >= size)
> - goto outside_data_content;
> + return VM_FAULT_SIGBUS;
>  
>   /* If we don't want any read-ahead, don't bother */
>   if (VM_RandomReadHint(vma))
> @@ -1377,7 +1377,7 @@ retry_find:
>   if (unlikely(vmf->pgoff >= size)) {
>   unlock_page(page);
>   page_cache_release(page);
> - goto outside_data_content;
> + return VM_FAULT_SIGBUS;
>   }
>  
>   /*
> @@ -1388,15 +1388,6 @@ retry_find:
>   vmf->page = page;
>   return ret | VM_FAULT_LOCKED;
>  
> -outside_data_content:
> - /*
> -  * An external ptracer can access pages that normally aren't
> -  * accessible..
> -  */
> - if (vma->vm_mm == current->mm)
> - return VM_FAULT_SIGBUS;
> -
> - /* Fall through to the non-read-ahead case */
>  no_cached_page:
>   /*
>* We're only likely to ever get here if MADV_RANDOM is in
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 23:17, Peter Zijlstra wrote:
> On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote:

> > And I'd prevent these ones from doing so.
> >
> > Without keeping track of "reserve" pages, which doesn't feel
> > too clean.
>
> The problem with that is that once a slab was allocated with the right
> allocation context, anybody can get objects from these slabs.

[snip]

I understand that.


> So we either reserve a page per object, which for 32 byte objects is a
> large waste, or we stop anybody who doesn't have the right permissions
> from obtaining objects. I took the latter approach.

What I'm saying is that the slab allocator slowpath should always
just check watermarks against the current task. Instead of this
->reserve stuff.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
> On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote:
> > On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > > Allow PF_MEMALLOC to be set in softirq context. When running softirqs
> > > from a borrowed context save current->flags, ksoftirqd will have its
> > > own task_struct.
> >
> > What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
> > that some networking thing must be picking it up in a subsequent patch,
> > but I'm too lazy to look!)... Again, can you have more of a rationale in
> > your patch headers, or ref the patch that uses it... thanks
>
> Right, I knew I was forgetting something in these changelogs.
>
> The network stack does quite a bit of packet processing from softirq
> context. Once you start swapping over network, some of the packets want
> to be processed under PF_MEMALLOC.

Hmm... what about processing from interrupt context?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
> On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote:
> > On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > > contexts that are entitled to it.
> > >
> > > Care is taken to only touch the SLUB slow path.
> > >
> > > This is done to ensure reserve pages don't leak out and get consumed.
> >
> > I think this is generally a good idea (to prevent slab allocators
> > from stealing reserve). However I naively think the implementation
> > is a bit overengineered and thus has a few holes.
> >
> > Humour me, what was the problem with failing the slab allocation
> > (actually, not fail but just call into the page allocator to do
> > correct waiting  / reclaim) in the slowpath if the process fails the
> > watermark checks?
>
> Ah, we actually need slabs below the watermarks.

Right, I'd still allow those guys to allocate slabs. Provided they
have the right allocation context, right?


> Its just that once I 
> allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want
> allocation contexts that do not have rights to those pages to walk off
> with objects.

And I'd prevent these ones from doing so.

Without keeping track of "reserve" pages, which doesn't feel
too clean.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Swap delay accounting, include lock_page() delays

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 18:41, Nick Piggin wrote:
> On Wednesday 31 October 2007 18:52, Balbir Singh wrote:
> > Reported-by: Nick Piggin <[EMAIL PROTECTED]>
> >
> > The delay incurred in lock_page() should also be accounted in swap delay
> > accounting
> >
> > Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
>
> Ah right, I forgot to resend this one, sorry. Thanks for remembering.

Although, I think I had a bit more detail in the changelog which
I think should be kept.

Basically, swap delay accounting seems quite broken as of now,
because what it is counting is the time required to allocate a new
page and submit the IO, but not actually the time to perform the IO
at all (which I'd expect will be significant, although possibly in
some workloads the actual page allocation will dominate).

>
> > ---
> >
> >  mm/memory.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff -puN mm/swapfile.c~fix-delay-accounting-swap-accounting
> > mm/swapfile.c diff -puN mm/memory.c~fix-delay-accounting-swap-accounting
> > mm/memory.c ---
> > linux-2.6-latest/mm/memory.c~fix-delay-accounting-swap-accounting   2007-10
> >-3 1 12:58:05.0 +0530 +++
> > linux-2.6-latest-balbir/mm/memory.c 2007-10-31 13:02:50.0 +0530
> > @@ -2084,9 +2084,9 @@ static int do_swap_page(struct mm_struct
> > count_vm_event(PGMAJFAULT);
> > }
> >
> > -   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> > mark_page_accessed(page);
> > lock_page(page);
> > +   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> >
> > /*
> >  * Back out if somebody else already faulted in this pte.
> > _

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Swap delay accounting, include lock_page() delays

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 18:52, Balbir Singh wrote:
> Reported-by: Nick Piggin <[EMAIL PROTECTED]>
>
> The delay incurred in lock_page() should also be accounted in swap delay
> accounting
>
> Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>

Ah right, I forgot to resend this one, sorry. Thanks for remembering.


> ---
>
>  mm/memory.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff -puN mm/swapfile.c~fix-delay-accounting-swap-accounting mm/swapfile.c
> diff -puN mm/memory.c~fix-delay-accounting-swap-accounting mm/memory.c
> ---
> linux-2.6-latest/mm/memory.c~fix-delay-accounting-swap-accounting 
> 2007-10-3
>1 12:58:05.0 +0530 +++
> linux-2.6-latest-balbir/mm/memory.c   2007-10-31 13:02:50.0 +0530 @@
> -2084,9 +2084,9 @@ static int do_swap_page(struct mm_struct
>   count_vm_event(PGMAJFAULT);
>   }
>
> - delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>   mark_page_accessed(page);
>   lock_page(page);
> + delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>
>   /*
>* Back out if somebody else already faulted in this pte.
> _
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-31 Thread Nick Piggin

On Tue, Oct 30, 2007 at 11:56:00PM -0700, David Miller wrote:
> From: Nick Piggin <[EMAIL PROTECTED]>
> Date: Wed, 31 Oct 2007 07:42:21 +0100
> 
> > Sysrq+T fails to show the stack trace of a running task. Presumably this
> > is to avoid a garbled stack, however it can often be useful, and besides
> > there is no guarantee that the task won't start running in the middle of
> > show_stack(). If there are any correctness issues, then the archietcture
> > would have to take further steps to ensure the task is not running.
> > 
> > Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
> 
> This is useful.
> 
> Even more useful would be a show_regs() on the cpu where running tasks
> are running.  If not a full show_regs() at least a program counter.

Yes, that's something I've needed as well. And I doubt that we're
very unusual among kernel developers in this regard ;)
 

> That's usually what you're trying to debug and we provide nearly no
> way to handle: some task is stuck in a loop in kernel mode and you
> need to know exactly where that is.
> 
> This is pretty easy to do on sparc64.  In fact I can capture remote
> cpu registers even when that CPU's interrupts are disabled.  I suppose
> other arches could do a NMI'ish register capture like this as well.
> 
> I have a few bug reports that I can't make more progress on because I
> currently can't ask users to do something to fetch the registers on
> the seemingly hung processor.  This is why I'm harping on this so
> much :-)
> 
> Anyways, my core suggestion is to add a hook here so platforms can
> do the remote register fetch if they want.

You could possibly even do a generic "best effort" kind of thing with
regular IPIs, that will timeout and continue if some CPUs don't handle
them, and should be pretty easy to get working with existing smp_call_
function stuff. Not exactly clean, but it would be better than nothing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-31 Thread Nick Piggin

On Tue, Oct 30, 2007 at 11:56:00PM -0700, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 07:42:21 +0100

  Sysrq+T fails to show the stack trace of a running task. Presumably this
  is to avoid a garbled stack, however it can often be useful, and besides
  there is no guarantee that the task won't start running in the middle of
  show_stack(). If there are any correctness issues, then the archietcture
  would have to take further steps to ensure the task is not running.

  Signed-off-by: Nick Piggin [EMAIL PROTECTED]

 This is useful.

 Even more useful would be a show_regs() on the cpu where running tasks
 are running.  If not a full show_regs() at least a program counter.

Yes, that's something I've needed as well. And I doubt that we're
very unusual among kernel developers in this regard ;)

 That's usually what you're trying to debug and we provide nearly no
 way to handle: some task is stuck in a loop in kernel mode and you
 need to know exactly where that is.

 This is pretty easy to do on sparc64.  In fact I can capture remote
 cpu registers even when that CPU's interrupts are disabled.  I suppose
 other arches could do a NMI'ish register capture like this as well.

 I have a few bug reports that I can't make more progress on because I
 currently can't ask users to do something to fetch the registers on
 the seemingly hung processor.  This is why I'm harping on this so
 much :-)

 Anyways, my core suggestion is to add a hook here so platforms can
 do the remote register fetch if they want.

You could possibly even do a generic best effort kind of thing with
regular IPIs, that will timeout and continue if some CPUs don't handle
them, and should be pretty easy to get working with existing smp_call_
function stuff. Not exactly clean, but it would be better than nothing.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Swap delay accounting, include lock_page() delays

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 18:52, Balbir Singh wrote:
 Reported-by: Nick Piggin [EMAIL PROTECTED]

 The delay incurred in lock_page() should also be accounted in swap delay
 accounting

 Signed-off-by: Balbir Singh [EMAIL PROTECTED]

Ah right, I forgot to resend this one, sorry. Thanks for remembering.


 ---

  mm/memory.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff -puN mm/swapfile.c~fix-delay-accounting-swap-accounting mm/swapfile.c
 diff -puN mm/memory.c~fix-delay-accounting-swap-accounting mm/memory.c
 ---
 linux-2.6-latest/mm/memory.c~fix-delay-accounting-swap-accounting 
 2007-10-3
1 12:58:05.0 +0530 +++
 linux-2.6-latest-balbir/mm/memory.c   2007-10-31 13:02:50.0 +0530 @@
 -2084,9 +2084,9 @@ static int do_swap_page(struct mm_struct
   count_vm_event(PGMAJFAULT);
   }

 - delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
   mark_page_accessed(page);
   lock_page(page);
 + delayacct_clear_flag(DELAYACCT_PF_SWAPIN);

   /*
* Back out if somebody else already faulted in this pte.
 _
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Swap delay accounting, include lock_page() delays

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 18:41, Nick Piggin wrote:
 On Wednesday 31 October 2007 18:52, Balbir Singh wrote:
  Reported-by: Nick Piggin [EMAIL PROTECTED]
 
  The delay incurred in lock_page() should also be accounted in swap delay
  accounting
 
  Signed-off-by: Balbir Singh [EMAIL PROTECTED]

 Ah right, I forgot to resend this one, sorry. Thanks for remembering.

Although, I think I had a bit more detail in the changelog which
I think should be kept.

Basically, swap delay accounting seems quite broken as of now,
because what it is counting is the time required to allocate a new
page and submit the IO, but not actually the time to perform the IO
at all (which I'd expect will be significant, although possibly in
some workloads the actual page allocation will dominate).


  ---
 
   mm/memory.c |2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
 
  diff -puN mm/swapfile.c~fix-delay-accounting-swap-accounting
  mm/swapfile.c diff -puN mm/memory.c~fix-delay-accounting-swap-accounting
  mm/memory.c ---
  linux-2.6-latest/mm/memory.c~fix-delay-accounting-swap-accounting   2007-10
 -3 1 12:58:05.0 +0530 +++
  linux-2.6-latest-balbir/mm/memory.c 2007-10-31 13:02:50.0 +0530
  @@ -2084,9 +2084,9 @@ static int do_swap_page(struct mm_struct
  count_vm_event(PGMAJFAULT);
  }
 
  -   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
  mark_page_accessed(page);
  lock_page(page);
  +   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
  /*
   * Back out if somebody else already faulted in this pte.
  _

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
 On Wed, 2007-10-31 at 14:51 +1100, Nick Piggin wrote:
  On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
   Allow PF_MEMALLOC to be set in softirq context. When running softirqs
   from a borrowed context save current-flags, ksoftirqd will have its
   own task_struct.
 
  What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
  that some networking thing must be picking it up in a subsequent patch,
  but I'm too lazy to look!)... Again, can you have more of a rationale in
  your patch headers, or ref the patch that uses it... thanks

 Right, I knew I was forgetting something in these changelogs.

 The network stack does quite a bit of packet processing from softirq
 context. Once you start swapping over network, some of the packets want
 to be processed under PF_MEMALLOC.

Hmm... what about processing from interrupt context?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 21:42, Peter Zijlstra wrote:
 On Wed, 2007-10-31 at 14:37 +1100, Nick Piggin wrote:
  On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
   Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
   contexts that are entitled to it.
  
   Care is taken to only touch the SLUB slow path.
  
   This is done to ensure reserve pages don't leak out and get consumed.
 
  I think this is generally a good idea (to prevent slab allocators
  from stealing reserve). However I naively think the implementation
  is a bit overengineered and thus has a few holes.
 
  Humour me, what was the problem with failing the slab allocation
  (actually, not fail but just call into the page allocator to do
  correct waiting  / reclaim) in the slowpath if the process fails the
  watermark checks?

 Ah, we actually need slabs below the watermarks.

Right, I'd still allow those guys to allocate slabs. Provided they
have the right allocation context, right?


 Its just that once I 
 allocated those slabs using __GFP_MEMALLOC/PF_MEMALLOC I don't want
 allocation contexts that do not have rights to those pages to walk off
 with objects.

And I'd prevent these ones from doing so.

Without keeping track of reserve pages, which doesn't feel
too clean.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-31 Thread Nick Piggin

On Wednesday 31 October 2007 23:17, Peter Zijlstra wrote:
 On Wed, 2007-10-31 at 21:46 +1100, Nick Piggin wrote:

  And I'd prevent these ones from doing so.
 
  Without keeping track of reserve pages, which doesn't feel
  too clean.

 The problem with that is that once a slab was allocated with the right
 allocation context, anybody can get objects from these slabs.

[snip]

I understand that.


 So we either reserve a page per object, which for 32 byte objects is a
 large waste, or we stop anybody who doesn't have the right permissions
 from obtaining objects. I took the latter approach.

What I'm saying is that the slab allocator slowpath should always
just check watermarks against the current task. Instead of this
-reserve stuff.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-31 Thread Nick Piggin

On Wed, Oct 31, 2007 at 08:11:10AM -0700, Linus Torvalds wrote:
 
 
 On Wed, 31 Oct 2007, Nick Piggin wrote:
  
  However I actually don't really like how this all works. I don't like that
  filemap.c should have to know about ptrace, or exactly what ptrace wants 
  here.
 
 It shouldn't. It should just fail when it fails. Then, handle_mm_fault() 
 should return an error code, which should cause get_user_pages() to return 
 an error code. Which should make ptrace just stop.
 
 So I think your patch is wrong. mm/filemap.c should *not* care about who 
 does the fault.  I think the proper patch is something untested like the 
 appended...

Well the patch is right, in the context of the regression I introduced
(and so it should probably go into 2.6.23).

I'd love to get rid of that outside data content crap if possible in
2.6.24. I think you're the one who has the best feeling for the ptrace
cases we have in the VM, so I trust you ;)


  It's a bit hairy to force insert page into pagecache and pte into pagetables
  here, given the races.
 
 It's also wrong. They shouldn't be in the page cache, since that can cause 
 problems with truncate etc. Maybe it doesn't any more, but it's reasonable 
 to believe that a page outside of i_size should not exist.

No I believe it could still be a problem (and at least, it is quite
fragile).

 
  In access_process_vm, can't we just zero fill in the case of a sigbus? 
  Linus?
  That will also avoid changing applicatoin behaviour due to a gdb read...
 
 access_process_vm() should just return how many bytes it could fill (which 
 means a truncated copy - very including zero bytes - for an error), and 
 the caller should decide what the right thing to do is.
 
 But who knows, maybe I missed something.

I hope it works.
 

 Duane? Does this fix things for you?
 
   Linus
 
 ---
  mm/filemap.c |   13 ++---
  1 files changed, 2 insertions(+), 11 deletions(-)
 
 diff --git a/mm/filemap.c b/mm/filemap.c
 index 9940895..188cf5f 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -1300,7 +1300,7 @@ int filemap_fault(struct vm_area_struct *vma, struct 
 vm_fault *vmf)
  
   size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1)  PAGE_CACHE_SHIFT;
   if (vmf-pgoff = size)
 - goto outside_data_content;
 + return VM_FAULT_SIGBUS;
  
   /* If we don't want any read-ahead, don't bother */
   if (VM_RandomReadHint(vma))
 @@ -1377,7 +1377,7 @@ retry_find:
   if (unlikely(vmf-pgoff = size)) {
   unlock_page(page);
   page_cache_release(page);
 - goto outside_data_content;
 + return VM_FAULT_SIGBUS;
   }
  
   /*
 @@ -1388,15 +1388,6 @@ retry_find:
   vmf-page = page;
   return ret | VM_FAULT_LOCKED;
  
 -outside_data_content:
 - /*
 -  * An external ptracer can access pages that normally aren't
 -  * accessible..
 -  */
 - if (vma-vm_mm == current-mm)
 - return VM_FAULT_SIGBUS;
 -
 - /* Fall through to the non-read-ahead case */
  no_cached_page:
   /*
* We're only likely to ever get here if MADV_RANDOM is in
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-31 Thread Nick Piggin

On Wed, Oct 31, 2007 at 08:59:41AM -0700, Linus Torvalds wrote:
 
 
 On Wed, 31 Oct 2007, Nick Piggin wrote:
  
  Well the patch is right, in the context of the regression I introduced
  (and so it should probably go into 2.6.23).
 
 Yeah, it probably is fine for -stable.
 
 And if mine (which actually changes behaviour, in that it makes ptrace get 
 an access error) causes regressions, I guess we'll have to use that 
 compatible-with-old-behaviour one for 2.6.24 too.
 
 But I just rebooted and tested - the cleaned-up patch does seem to work 
 fine, and I get Cannot access memory at address xyz rather than any 
 reported problem.
 
 So I think I'll commit my version asap, and see if anybody reports that 
 they have a situation where they use ptrace() and expect zero back from a 
 shared mapping past the end.. And if there are issues, we can switch back 
 to the old broken behaviour with your patch,

No that would be great. Fingers crossed it won't cause any problems.

Thanks, all.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-31 Thread Nick Piggin

On Wed, Oct 31, 2007 at 04:08:21PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 31 Oct 2007, Nick Piggin wrote:
  
  No that would be great. Fingers crossed it won't cause any problems.
 
 I actually doubt it will cause problems.
 
 We made much bigger changes to ptrace support when we disallowed writing 
 to read-only shared memory areas (we used to do the magic per-page COW 
 thing).

Really? No, we still do that magic COW thing which creates anonymous
pages in MAP_SHARED vmas, don't we?

But I think that is pretty arbitrary as well. What the force argument to
get_user_pages on a MAP_SHARED vma effectively does is punt the permission
check from the vma permissions to the file permissions. If you've opened the
file WR but mmap()ed PROT_READ, it goes ahead and COWs, wheras if you've
opened the file RDONLY it -EFAULTs.

This seems strange wrong, because the act of COWing seems to be exactly
done in order to *absolve* us of file permissions (if we're COWing, why
should it matter if the file happened to be readonly or not).

Could we just try disallow COW for MAP_SHARED mappings?
Seeing as we're doing the break-ptrace thing this release already ;)
All the stuff I know of that requires forced COW (ie. gdb for breakpoints),
does so on MAP_PRIVATE mappings.

I know this is one of Hugh's bugbears. I don't know whether he's thought
out how to remove it, but I wonder if the appended patch is roughly in
the right direction?

 But if people have test-cases for ptrace and/or other magical users of 
 access_vm_pages() (things like core-dumping comes to mind - although I 
 think we don't dump pure file mappings at all, do we?) it would certainly 
 be good to run any such tests on the current -git tree...

We do for MAP_SHARED|MAP_ANONYMOUS, by the looks.

---
Index: linux-2.6/mm/memory.c
===
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1563,13 +1563,11 @@ static int do_wp_page(struct mm_struct *
reuse = can_share_swap_page(old_page);
unlock_page(old_page);
}
-   } else if (unlikely((vma-vm_flags  (VM_WRITE|VM_SHARED)) ==
-   (VM_WRITE|VM_SHARED))) {
+   } else if (unlikely((vma-vm_flags  VM_SHARED))) {
/*
-* Only catch write-faults on shared writable pages,
-* read-only shared pages can get COWed by
-* get_user_pages(.write=1, .force=1).
+* Only catch write-faults on shared writable pages.
 */
+   BUG_ON(!(vma-vm_flags  VM_WRITE));
if (vma-vm_ops  vma-vm_ops-page_mkwrite) {
/*
 * Notify the address space that the page is about to
Index: linux-2.6/mm/mmap.c
===
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -980,9 +980,11 @@ unsigned long do_mmap_pgoff(struct file 
if (locks_verify_locked(inode))
return -EAGAIN;
 
-   vm_flags |= VM_SHARED | VM_MAYSHARE;
-   if (!(file-f_mode  FMODE_WRITE))
-   vm_flags = ~(VM_MAYWRITE | VM_SHARED);
+   vm_flags |= VM_MAYSHARE;
+   if (file-f_mode  FMODE_WRITE)
+   vm_flags |= VM_SHARED;
+   if (!(vm_flags  VM_WRITE))
+   vm_flags = ~VM_MAYWRITE;
 
/* fall through */
case MAP_PRIVATE:
@@ -998,6 +1000,7 @@ unsigned long do_mmap_pgoff(struct file 
 
if (!file-f_op || !file-f_op-mmap)
return -ENODEV;
+
break;
 
default:
@@ -1007,6 +1010,8 @@ unsigned long do_mmap_pgoff(struct file 
switch (flags  MAP_TYPE) {
case MAP_SHARED:
vm_flags |= VM_SHARED | VM_MAYSHARE;
+   if (!(vm_flags  VM_WRITE))
+   vm_flags = ~VM_MAYWRITE;
break;
case MAP_PRIVATE:
/*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] firewire: fw-core: enforce write order when updating fw_device.generation

2007-10-31 Thread Nick Piggin

On Thursday 01 November 2007 12:49, Stefan Richter wrote:
 fw_device.node_id and fw_device.generation are accessed without mutexes.
 We have to ensure that all readers will get to see node_id updates
 before generation updates.


Hi, a few points:

- can change it to use spinlocks instead? This would be most
  preferable.

- if not, you need comments.

- you also seem to be missing rmb()s now. I see a couple in the
  firewire directory, but nothing that seems to be ordering loads
  of these particular fields.

- use smp_*mb() if you are just ordering regular cacheable RAM
  accesses.

Also, diffstat is a bit wrong... maybe you posted the wrong version?

 Signed-off-by: Stefan Richter [EMAIL PROTECTED]
 ---
  drivers/firewire/fw-device.c   |6 ++
  drivers/firewire/fw-topology.c |1 +
  2 files changed, 7 insertions(+)

 Index: linux/drivers/firewire/fw-device.c
 ===
 --- linux.orig/drivers/firewire/fw-device.c
 +++ linux/drivers/firewire/fw-device.c
 @@ -808,6 +813,7 @@ void fw_node_event(struct fw_card *card,

   device = node-data;
   device-node_id = node-node_id;
 + wmb();
   device-generation = card-generation;
   if (atomic_read(device-state) == FW_DEVICE_RUNNING) {
   PREPARE_DELAYED_WORK(device-work, fw_device_update);
 Index: linux/drivers/firewire/fw-topology.c
 ===
 --- linux.orig/drivers/firewire/fw-topology.c
 +++ linux/drivers/firewire/fw-topology.c
 @@ -518,6 +518,7 @@ fw_core_handle_bus_reset(struct fw_card
   card-bm_retries = 0;

   card-node_id = node_id;
 + wmb();
   card-generation = generation;
   card-reset_jiffies = jiffies;
   schedule_delayed_work(card-work, 0);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: filp usage when cpu busy

2007-10-31 Thread Nick Piggin

On Thursday 01 November 2007 12:56, bc Wong (chimwong) wrote:
 Hi,

 With 2.6.16 x86_64 on a 4 core machine, I noticed
 that the filp usage (according to /proc/slabinfo)
 shoots up and keeps on increasing sharply when one
 of the CPUs is (1) locked up, or (2) very busy
 doing a lot of printk()'s with KERN_EMERG.

 In the case of (1), it's permanent until it runs
 out of memory eventually. For (2), it's temporary;
 filp count comes back down when the printk()'s are
 done.

 I can't think of any relationship between a busy/
 locked-up CPU and filp count. The system is still
 functional. New short-lived processes kept being
 created, but the overall number of processes is
 stable.

 Does anyone know why filp count would go up like
 that?

Yeah, it's probably because filp structures are freed by
RCU, and if you have a locked up CPU then it can't go
through a quiescent state so RCU stops freeing your filps.

If you add some cond_resched()s to your code, you should
find that RCU will force a reschedule and things will work
(actually, for 2.6.16, I'm not sure if RCU had the code to
force a reschedule... it's force_quiescent_state() in
kernel/rcupdate.c upstream).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-30 Thread Nick Piggin

On Wed, Oct 31, 2007 at 12:45:35AM +, Duane Griffin wrote:
> Accessing a memory mapped region past the last page containing a valid
> file mapping produces a SIGBUS fault (as it should). Running a program
> that does this under gdb, then accessing the invalid memory from gdb,
> causes it to start consuming 100% CPU and become unkillable. Once in
> that state, SysRq-T doesn't show a stack trace for gdb, although it is
> shown as running and stack traces are dumped for other tasks.


BTW. this has come up for me before, and I have found it useful on
a number of occasions to print the stack of running tasks when they
are looping in the kernel...

Any reason we can't do this?

--
Sysrq+T fails to show the stack trace of a running task. Presumably this
is to avoid a garbled stack, however it can often be useful, and besides
there is no guarantee that the task won't start running in the middle of
show_stack(). If there are any correctness issues, then the archietcture
would have to take further steps to ensure the task is not running.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2007-10-31 06:53:22.0 +1100
+++ linux-2.6/kernel/sched.c2007-10-31 06:56:02.0 +1100
@@ -4900,8 +4900,7 @@
printk(KERN_CONT "%5lu %5d %6d\n", free,
task_pid_nr(p), task_pid_nr(p->parent));
 
-   if (state != TASK_RUNNING)
-   show_stack(p, NULL);
+   show_stack(p, NULL);
 }
 
 void show_state_filter(unsigned long state_filter)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/33] Swap over NFS -v14

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 15:37, David Miller wrote:
> From: Nick Piggin <[EMAIL PROTECTED]>
> Date: Wed, 31 Oct 2007 14:26:32 +1100
>
> > Is it really worth all the added complexity of making swap
> > over NFS files work, given that you could use a network block
> > device instead?
>
> Don't be misled.  Swapping over NFS is just a scarecrow for the
> seemingly real impetus behind these changes which is network storage
> stuff like iSCSI.

Oh, I'm OK with the network reserves stuff (not the actual patch,
which I'm not really qualified to review, but at least the idea
of it...).

And also I'm not as such against the idea of swap over network.

However, specifically the change to make swapfiles work through
the filesystem layer (ATM it goes straight to the block layer,
modulo some initialisation stuff which uses block filesystem-
specific calls).

I mean, I assume that anybody trying to swap over network *today*
has to be using a network block device anyway, so the idea of
just being able to transparently improve that case seems better
than adding new complexities for seemingly not much gain.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
> wide - which they are per setup_per_zone_pages_min(), when we scrape the
> barrel, do it properly.
>

IIRC it's actually not too uncommon to have allocations coming here via
page reclaim. It's not exactly clear that you want to break mempolicies
at this point.


> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> ---
>  mm/page_alloc.c |6 ++
>  1 file changed, 6 insertions(+)
>
> Index: linux-2.6/mm/page_alloc.c
> ===
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -1638,6 +1638,12 @@ restart:
>  rebalance:
>   if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  nofail_alloc:
> + /*
> +  * break out of mempolicy boundaries
> +  */
> + zonelist = NODE_DATA(numa_node_id())->node_zonelists +
> + gfp_zone(gfp_mask);
> +
>   /* go through the zonelist yet again, ignoring mins */
>   page = get_page_from_freelist(gfp_mask, order, zonelist,
>   ALLOC_NO_WATERMARKS);
>
> --
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
> a borrowed context save current->flags, ksoftirqd will have its own
> task_struct.

What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
that some networking thing must be picking it up in a subsequent patch,
but I'm too lazy to look!)... Again, can you have more of a rationale in
your patch headers, or ref the patch that uses it... thanks

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/33] mm: kmem_estimate_pages()

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Provide a method to get the upper bound on the pages needed to allocate
> a given number of objects from a given kmem_cache.
>

Fair enough, but just to make it a bit easier, can you provide a
little reason of why in this patch (or reference the patch number
where you use it, or put it together with the patch where you use
it, etc.).

Thanks,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Allow the mempool to use the memalloc reserves when all else fails and
> the allocation context would otherwise allow it.

I don't see what this is for. The whole point of when I fixed this
to *not* use the memalloc reserves is because processes that were
otherwise allowed to use those reserves, were. They should not.



> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
> ---
>  mm/mempool.c |   12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/mm/mempool.c
> ===
> --- linux-2.6.orig/mm/mempool.c
> +++ linux-2.6/mm/mempool.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include "internal.h"
>
>  static void add_element(mempool_t *pool, void *element)
>  {
> @@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf
>   void *element;
>   unsigned long flags;
>   wait_queue_t wait;
> - gfp_t gfp_temp;
> + gfp_t gfp_temp, gfp_orig = gfp_mask;
>
>   might_sleep_if(gfp_mask & __GFP_WAIT);
>
> @@ -228,6 +229,15 @@ repeat_alloc:
>   }
>   spin_unlock_irqrestore(>lock, flags);
>
> + /* if we really had right to the emergency reserves try those */
> + if (gfp_to_alloc_flags(gfp_orig) & ALLOC_NO_WATERMARKS) {
> + if (gfp_temp & __GFP_NOMEMALLOC) {
> + gfp_temp &= ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
> + goto repeat_alloc;
> + } else
> + gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
> + }
> +
>   /* We must not sleep in the GFP_ATOMIC case */
>   if (!(gfp_mask & __GFP_WAIT))
>   return NULL;
>
> --
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> contexts that are entitled to it.
>
> Care is taken to only touch the SLUB slow path.
>
> This is done to ensure reserve pages don't leak out and get consumed.

I think this is generally a good idea (to prevent slab allocators
from stealing reserve). However I naively think the implementation
is a bit overengineered and thus has a few holes.

Humour me, what was the problem with failing the slab allocation
(actually, not fail but just call into the page allocator to do
correct waiting  / reclaim) in the slowpath if the process fails the
watermark checks?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/33] Swap over NFS -v14

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Hi,
>
> Another posting of the full swap over NFS series.

Hi,

Is it really worth all the added complexity of making swap
over NFS files work, given that you could use a network block
device instead?

Also, have you ensured that page_file_index, page_file_mapping
and page_offset are only ever used on anonymous pages when the
page is locked? (otherwise PageSwapCache could change)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-30 Thread Nick Piggin

On Wed, Oct 31, 2007 at 12:45:35AM +, Duane Griffin wrote:
> Accessing a memory mapped region past the last page containing a valid
> file mapping produces a SIGBUS fault (as it should). Running a program
> that does this under gdb, then accessing the invalid memory from gdb,
> causes it to start consuming 100% CPU and become unkillable. Once in
> that state, SysRq-T doesn't show a stack trace for gdb, although it is
> shown as running and stack traces are dumped for other tasks.
> 
> 2.6.22 does not have this bug (gdb just prints '\0' as the contents,
> although arguably that is also a bug, and it should instead report the
> SIGBUS).
> 
> Bisection indicates the problem was introduced by:
> 
> 54cb8821de07f2ffcd28c380ce9b93d5784b40d7
> "mm: merge populate and nopage into fault (fixes nonlinear)"
> 
> The following program demonstrates the issue:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main(int argc, char *argv[])
> {
> int fd;
> struct stat buf;
> 
> if (argc != 2) {
> fprintf(stderr, "usage: %s \n", argv[0]);
> exit(1);
> }
> 
> fd = open(argv[1], O_RDONLY);
> fstat(fd, );
> int count = buf.st_size + sysconf(_SC_PAGE_SIZE);
> char *file = (char *) mmap(NULL, count, PROT_READ, MAP_PRIVATE, fd, 
> 0);
> if (file == MAP_FAILED) {
> fprintf(stderr, "mmap failed: %s\n", strerror(errno));
> } else {
> char ch;
> fprintf(stderr, "using offset %d\n", (count - 1));
> ch = file[count - 1];
> munmap(file, count);
> }
> close(fd);
> return 0;
> }
> 
> To reproduce the bug, run it under gdb, go up a couple of frames to
> the main function, then access invalid memory, for e.g. with: "print
> file[4095]", or whatever offset was reported.


Well that's probably the best bug report I've ever had, thanks Duane!

The issue is a silly thinko -- I didn't pay enough attention to the ptrace
rules in filemap_fault :( partly I think that's because I don't understand
them and am scared of them ;)

The following minimal patch fixes it here, and should probably be applied to
2.6.23 stable as well as 2.6.24. If you could verify it, that would be much
appreciated.

However I actually don't really like how this all works. I don't like that
filemap.c should have to know about ptrace, or exactly what ptrace wants here.
It's a bit hairy to force insert page into pagecache and pte into pagetables
here, given the races.

In access_process_vm, can't we just zero fill in the case of a sigbus? Linus?
That will also avoid changing applicatoin behaviour due to a gdb read...

Thanks,
Nick

--
Duane Griffin noticed a 2.6.23 regression that will cause gdb to hang when
it tries to access the memory of another process beyond i_size.

This is because the solution to the fault vs invalidate race requires that
we recheck i_size after the page lock is taken. However in that case, I
had not accounted for the fact that ptracers are granted an exception to this
rule.

Cc: Duane Griffin <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Signed-off-by: Nick Piggin <[EMAIL PROTECTED]
---
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1374,7 +1374,7 @@ retry_find:
 
/* Must recheck i_size under page lock */
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-   if (unlikely(vmf->pgoff >= size)) {
+   if (unlikely(vmf->pgoff >= size) && vma->vm_mm == current->mm) {
unlock_page(page);
page_cache_release(page);
goto outside_data_content;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 performance regression

2007-10-30 Thread Nick Piggin

On Tuesday 30 October 2007 18:54, Lorenzo Allegrucci wrote:
> Hi, sorry if this is a faq but reading
> http://people.freebsd.org/~kris/scaling/7.0%20Preview.pdf (slides 17,
> 18)
> looks like 2.6.23 is having a performance regression on MySQL and
> PostgreSQL benchmarks.  Has anyone investigated these?

I think Ingo was looking into them. I'm just in the middle of
building a new setup to investigate some performance issues too,
so I hope to look at this.

Apparently it did go unnoticed during CFS testing, unfortunately.
Although now it has been brought to light, I think 2.6.24 is
supposed to be better (although it now gets worse on other things).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 09/10] SLUB: Do our own locking via slab_lock and slab_unlock.

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 05:32, Christoph Lameter wrote:
> On Tue, 30 Oct 2007, Nick Piggin wrote:
> > Is this actually a speedup on any architecture to roll your own locking
> > rather than using bit spinlock?
>
> It avoids one load from memory when allocating and the release is simply
> writing the page->flags back. Less instructions.

OK, but it probably isn't a measurable speedup, even on microbenchmarks,
right? And many architectures have to have more barriers around cmpxchg
than they do around a test_and_set_bit_lock, so it may even be slower
on some.

> > I am not exactly convinced that smp_wmb() is a good idea to have in your
> > unlock, rather than the normally required smp_mb() that every other open
> > coded lock in the kernel is using today. If you comment every code path
> > where a load leaking out of the critical section would not be a problem,
> > then OK it may be correct, but I still don't think it is worth the
> > maintenance overhead.
>
> I thought you agreed that release semantics only require a write barrier?

Not in general.

> The issue here is that other processors see the updates before the
> updates to page-flags.
>
> A load leaking out of a critical section would require that the result of
> the load is not used to update other information before the slab_unlock
> and that the source of the load is not overwritten in the critical
> section. That does not happen in sluib.

That may be the case, but I don't think there is enough performance
justification to add a hack like this. ia64 for example is going to
do an mf for smp_wmb so I doubt it is a clear win.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 09/10] SLUB: Do our own locking via slab_lock and slab_unlock.

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 05:32, Christoph Lameter wrote:
 On Tue, 30 Oct 2007, Nick Piggin wrote:
  Is this actually a speedup on any architecture to roll your own locking
  rather than using bit spinlock?

 It avoids one load from memory when allocating and the release is simply
 writing the page-flags back. Less instructions.

OK, but it probably isn't a measurable speedup, even on microbenchmarks,
right? And many architectures have to have more barriers around cmpxchg
than they do around a test_and_set_bit_lock, so it may even be slower
on some.


  I am not exactly convinced that smp_wmb() is a good idea to have in your
  unlock, rather than the normally required smp_mb() that every other open
  coded lock in the kernel is using today. If you comment every code path
  where a load leaking out of the critical section would not be a problem,
  then OK it may be correct, but I still don't think it is worth the
  maintenance overhead.

 I thought you agreed that release semantics only require a write barrier?

Not in general.


 The issue here is that other processors see the updates before the
 updates to page-flags.

 A load leaking out of a critical section would require that the result of
 the load is not used to update other information before the slab_unlock
 and that the source of the load is not overwritten in the critical
 section. That does not happen in sluib.

That may be the case, but I don't think there is enough performance
justification to add a hack like this. ia64 for example is going to
do an mf for smp_wmb so I doubt it is a clear win.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 performance regression

2007-10-30 Thread Nick Piggin

On Tuesday 30 October 2007 18:54, Lorenzo Allegrucci wrote:
 Hi, sorry if this is a faq but reading
 http://people.freebsd.org/~kris/scaling/7.0%20Preview.pdf (slides 17,
 18)
 looks like 2.6.23 is having a performance regression on MySQL and
 PostgreSQL benchmarks.  Has anyone investigated these?

I think Ingo was looking into them. I'm just in the middle of
building a new setup to investigate some performance issues too,
so I hope to look at this.

Apparently it did go unnoticed during CFS testing, unfortunately.
Although now it has been brought to light, I think 2.6.24 is
supposed to be better (although it now gets worse on other things).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-30 Thread Nick Piggin

On Wed, Oct 31, 2007 at 12:45:35AM +, Duane Griffin wrote:
 Accessing a memory mapped region past the last page containing a valid
 file mapping produces a SIGBUS fault (as it should). Running a program
 that does this under gdb, then accessing the invalid memory from gdb,
 causes it to start consuming 100% CPU and become unkillable. Once in
 that state, SysRq-T doesn't show a stack trace for gdb, although it is
 shown as running and stack traces are dumped for other tasks.
 
 2.6.22 does not have this bug (gdb just prints '\0' as the contents,
 although arguably that is also a bug, and it should instead report the
 SIGBUS).
 
 Bisection indicates the problem was introduced by:
 
 54cb8821de07f2ffcd28c380ce9b93d5784b40d7
 mm: merge populate and nopage into fault (fixes nonlinear)
 
 The following program demonstrates the issue:
 
 #include errno.h
 #include stdio.h
 #include stdlib.h
 #include string.h
 #include fcntl.h
 #include unistd.h
 #include sys/mman.h
 #include sys/stat.h
 
 int main(int argc, char *argv[])
 {
 int fd;
 struct stat buf;
 
 if (argc != 2) {
 fprintf(stderr, usage: %s filename\n, argv[0]);
 exit(1);
 }
 
 fd = open(argv[1], O_RDONLY);
 fstat(fd, buf);
 int count = buf.st_size + sysconf(_SC_PAGE_SIZE);
 char *file = (char *) mmap(NULL, count, PROT_READ, MAP_PRIVATE, fd, 
 0);
 if (file == MAP_FAILED) {
 fprintf(stderr, mmap failed: %s\n, strerror(errno));
 } else {
 char ch;
 fprintf(stderr, using offset %d\n, (count - 1));
 ch = file[count - 1];
 munmap(file, count);
 }
 close(fd);
 return 0;
 }
 
 To reproduce the bug, run it under gdb, go up a couple of frames to
 the main function, then access invalid memory, for e.g. with: print
 file[4095], or whatever offset was reported.


Well that's probably the best bug report I've ever had, thanks Duane!

The issue is a silly thinko -- I didn't pay enough attention to the ptrace
rules in filemap_fault :( partly I think that's because I don't understand
them and am scared of them ;)

The following minimal patch fixes it here, and should probably be applied to
2.6.23 stable as well as 2.6.24. If you could verify it, that would be much
appreciated.

However I actually don't really like how this all works. I don't like that
filemap.c should have to know about ptrace, or exactly what ptrace wants here.
It's a bit hairy to force insert page into pagecache and pte into pagetables
here, given the races.

In access_process_vm, can't we just zero fill in the case of a sigbus? Linus?
That will also avoid changing applicatoin behaviour due to a gdb read...

Thanks,
Nick

--
Duane Griffin noticed a 2.6.23 regression that will cause gdb to hang when
it tries to access the memory of another process beyond i_size.

This is because the solution to the fault vs invalidate race requires that
we recheck i_size after the page lock is taken. However in that case, I
had not accounted for the fact that ptracers are granted an exception to this
rule.

Cc: Duane Griffin [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1374,7 +1374,7 @@ retry_find:
 
/* Must recheck i_size under page lock */
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1)  PAGE_CACHE_SHIFT;
-   if (unlikely(vmf-pgoff = size)) {
+   if (unlikely(vmf-pgoff = size)  vma-vm_mm == current-mm) {
unlock_page(page);
page_cache_release(page);
goto outside_data_content;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/33] Swap over NFS -v14

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Hi,

 Another posting of the full swap over NFS series.

Hi,

Is it really worth all the added complexity of making swap
over NFS files work, given that you could use a network block
device instead?

Also, have you ensured that page_file_index, page_file_mapping
and page_offset are only ever used on anonymous pages when the
page is locked? (otherwise PageSwapCache could change)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/33] mm: slub: add knowledge of reserve pages

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
 contexts that are entitled to it.

 Care is taken to only touch the SLUB slow path.

 This is done to ensure reserve pages don't leak out and get consumed.

I think this is generally a good idea (to prevent slab allocators
from stealing reserve). However I naively think the implementation
is a bit overengineered and thus has a few holes.

Humour me, what was the problem with failing the slab allocation
(actually, not fail but just call into the page allocator to do
correct waiting  / reclaim) in the slowpath if the process fails the
watermark checks?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/33] mm: allow mempool to fall back to memalloc reserves

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Allow the mempool to use the memalloc reserves when all else fails and
 the allocation context would otherwise allow it.

I don't see what this is for. The whole point of when I fixed this
to *not* use the memalloc reserves is because processes that were
otherwise allowed to use those reserves, were. They should not.



 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  mm/mempool.c |   12 +++-
  1 file changed, 11 insertions(+), 1 deletion(-)

 Index: linux-2.6/mm/mempool.c
 ===
 --- linux-2.6.orig/mm/mempool.c
 +++ linux-2.6/mm/mempool.c
 @@ -14,6 +14,7 @@
  #include linux/mempool.h
  #include linux/blkdev.h
  #include linux/writeback.h
 +#include internal.h

  static void add_element(mempool_t *pool, void *element)
  {
 @@ -204,7 +205,7 @@ void * mempool_alloc(mempool_t *pool, gf
   void *element;
   unsigned long flags;
   wait_queue_t wait;
 - gfp_t gfp_temp;
 + gfp_t gfp_temp, gfp_orig = gfp_mask;

   might_sleep_if(gfp_mask  __GFP_WAIT);

 @@ -228,6 +229,15 @@ repeat_alloc:
   }
   spin_unlock_irqrestore(pool-lock, flags);

 + /* if we really had right to the emergency reserves try those */
 + if (gfp_to_alloc_flags(gfp_orig)  ALLOC_NO_WATERMARKS) {
 + if (gfp_temp  __GFP_NOMEMALLOC) {
 + gfp_temp = ~(__GFP_NOMEMALLOC|__GFP_NOWARN);
 + goto repeat_alloc;
 + } else
 + gfp_temp |= __GFP_NOMEMALLOC|__GFP_NOWARN;
 + }
 +
   /* We must not sleep in the GFP_ATOMIC case */
   if (!(gfp_mask  __GFP_WAIT))
   return NULL;

 --
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/33] mm: kmem_estimate_pages()

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Provide a method to get the upper bound on the pages needed to allocate
 a given number of objects from a given kmem_cache.


Fair enough, but just to make it a bit easier, can you provide a
little reason of why in this patch (or reference the patch number
where you use it, or put it together with the patch where you use
it, etc.).

Thanks,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/33] mm: system wide ALLOC_NO_WATERMARK

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Change ALLOC_NO_WATERMARK page allocation such that the reserves are system
 wide - which they are per setup_per_zone_pages_min(), when we scrape the
 barrel, do it properly.


IIRC it's actually not too uncommon to have allocations coming here via
page reclaim. It's not exactly clear that you want to break mempolicies
at this point.


 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  mm/page_alloc.c |6 ++
  1 file changed, 6 insertions(+)

 Index: linux-2.6/mm/page_alloc.c
 ===
 --- linux-2.6.orig/mm/page_alloc.c
 +++ linux-2.6/mm/page_alloc.c
 @@ -1638,6 +1638,12 @@ restart:
  rebalance:
   if (alloc_flags  ALLOC_NO_WATERMARKS) {
  nofail_alloc:
 + /*
 +  * break out of mempolicy boundaries
 +  */
 + zonelist = NODE_DATA(numa_node_id())-node_zonelists +
 + gfp_zone(gfp_mask);
 +
   /* go through the zonelist yet again, ignoring mins */
   page = get_page_from_freelist(gfp_mask, order, zonelist,
   ALLOC_NO_WATERMARKS);

 --

 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/33] mm: allow PF_MEMALLOC from softirq context

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
 Allow PF_MEMALLOC to be set in softirq context. When running softirqs from
 a borrowed context save current-flags, ksoftirqd will have its own
 task_struct.


What's this for? Why would ksoftirqd pick up PF_MEMALLOC? (I guess
that some networking thing must be picking it up in a subsequent patch,
but I'm too lazy to look!)... Again, can you have more of a rationale in
your patch headers, or ref the patch that uses it... thanks

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/33] Swap over NFS -v14

2007-10-30 Thread Nick Piggin

On Wednesday 31 October 2007 15:37, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 14:26:32 +1100

  Is it really worth all the added complexity of making swap
  over NFS files work, given that you could use a network block
  device instead?

 Don't be misled.  Swapping over NFS is just a scarecrow for the
 seemingly real impetus behind these changes which is network storage
 stuff like iSCSI.

Oh, I'm OK with the network reserves stuff (not the actual patch,
which I'm not really qualified to review, but at least the idea
of it...).

And also I'm not as such against the idea of swap over network.

However, specifically the change to make swapfiles work through
the filesystem layer (ATM it goes straight to the block layer,
modulo some initialisation stuff which uses block filesystem-
specific calls).

I mean, I assume that anybody trying to swap over network *today*
has to be using a network block device anyway, so the idea of
just being able to transparently improve that case seems better
than adding new complexities for seemingly not much gain.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-10-30 Thread Nick Piggin

On Wed, Oct 31, 2007 at 12:45:35AM +, Duane Griffin wrote:
 Accessing a memory mapped region past the last page containing a valid
 file mapping produces a SIGBUS fault (as it should). Running a program
 that does this under gdb, then accessing the invalid memory from gdb,
 causes it to start consuming 100% CPU and become unkillable. Once in
 that state, SysRq-T doesn't show a stack trace for gdb, although it is
 shown as running and stack traces are dumped for other tasks.


BTW. this has come up for me before, and I have found it useful on
a number of occasions to print the stack of running tasks when they
are looping in the kernel...

Any reason we can't do this?

--
Sysrq+T fails to show the stack trace of a running task. Presumably this
is to avoid a garbled stack, however it can often be useful, and besides
there is no guarantee that the task won't start running in the middle of
show_stack(). If there are any correctness issues, then the archietcture
would have to take further steps to ensure the task is not running.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/kernel/sched.c
===
--- linux-2.6.orig/kernel/sched.c   2007-10-31 06:53:22.0 +1100
+++ linux-2.6/kernel/sched.c2007-10-31 06:56:02.0 +1100
@@ -4900,8 +4900,7 @@
printk(KERN_CONT %5lu %5d %6d\n, free,
task_pid_nr(p), task_pid_nr(p-parent));
 
-   if (state != TASK_RUNNING)
-   show_stack(p, NULL);
+   show_stack(p, NULL);
 }
 
 void show_state_filter(unsigned long state_filter)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 09/10] SLUB: Do our own locking via slab_lock and slab_unlock.

2007-10-29 Thread Nick Piggin

On Sunday 28 October 2007 14:32, Christoph Lameter wrote:
> Too many troubles with the bitlocks and we really do not need
> to do any bitops. Bitops do not effectively retrieve the old
> value which we want. So use a cmpxchg instead on the arches
> that allow it.
>
> Instead of modifying the page->flags with fake atomic operations
> we pass the page state as a parameter to functions. No function uses
> the slab state if the page lock is held. Function must wait until the
> lock is cleared. Thus we can defer the update of page->flags until slab
> processing is complete. The "unlock" operation is then simply updating
> page->flags.

Is this actually a speedup on any architecture to roll your own locking
rather than using bit spinlock?

I am not exactly convinced that smp_wmb() is a good idea to have in your
unlock, rather than the normally required smp_mb() that every other open
coded lock in the kernel is using today. If you comment every code path
where a load leaking out of the critical section would not be a problem,
then OK it may be correct, but I still don't think it is worth the
maintenance overhead.

>
> Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
>
>
> ---
>  mm/slub.c |  324
> +++--- 1 file
> changed, 187 insertions(+), 137 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===
> --- linux-2.6.orig/mm/slub.c  2007-10-27 07:56:03.0 -0700
> +++ linux-2.6/mm/slub.c   2007-10-27 07:56:52.0 -0700
> @@ -101,6 +101,7 @@
>   */
>
>  #define FROZEN (1 << PG_active)
> +#define LOCKED (1 << PG_locked)
>
>  #ifdef CONFIG_SLUB_DEBUG
>  #define SLABDEBUG (1 << PG_error)
> @@ -108,36 +109,6 @@
>  #define SLABDEBUG 0
>  #endif
>
> -static inline int SlabFrozen(struct page *page)
> -{
> - return page->flags & FROZEN;
> -}
> -
> -static inline void SetSlabFrozen(struct page *page)
> -{
> - page->flags |= FROZEN;
> -}
> -
> -static inline void ClearSlabFrozen(struct page *page)
> -{
> - page->flags &= ~FROZEN;
> -}
> -
> -static inline int SlabDebug(struct page *page)
> -{
> - return page->flags & SLABDEBUG;
> -}
> -
> -static inline void SetSlabDebug(struct page *page)
> -{
> - page->flags |= SLABDEBUG;
> -}
> -
> -static inline void ClearSlabDebug(struct page *page)
> -{
> - page->flags &= ~SLABDEBUG;
> -}
> -
>  /*
>   * Issues still to be resolved:
>   *
> @@ -818,11 +789,12 @@ static void trace(struct kmem_cache *s,
>  /*
>   * Tracking of fully allocated slabs for debugging purposes.
>   */
> -static void add_full(struct kmem_cache *s, struct page *page)
> +static void add_full(struct kmem_cache *s, struct page *page,
> + unsigned long state)
>  {
>   struct kmem_cache_node *n = get_node(s, page_to_nid(page));
>
> - if (!SlabDebug(page) || !(s->flags & SLAB_STORE_USER))
> + if (!(state & SLABDEBUG) || !(s->flags & SLAB_STORE_USER))
>   return;
>   spin_lock(>list_lock);
>   list_add(>lru, >full);
> @@ -894,7 +866,7 @@ bad:
>  }
>
>  static noinline int free_debug_processing(struct kmem_cache *s,
> - struct page *page, void *object, void *addr)
> + struct page *page, void *object, void *addr, unsigned long state)
>  {
>   if (!check_slab(s, page))
>   goto fail;
> @@ -930,7 +902,7 @@ static noinline int free_debug_processin
>   }
>
>   /* Special debug activities for freeing objects */
> - if (!SlabFrozen(page) && page->freelist == page->end)
> + if (!(state & FROZEN) && page->freelist == page->end)
>   remove_full(s, page);
>   if (s->flags & SLAB_STORE_USER)
>   set_track(s, object, TRACK_FREE, addr);
> @@ -1047,7 +1019,8 @@ static inline int slab_pad_check(struct
>   { return 1; }
>  static inline int check_object(struct kmem_cache *s, struct page *page,
>   void *object, int active) { return 1; }
> -static inline void add_full(struct kmem_cache *s, struct page *page) {}
> +static inline void add_full(struct kmem_cache *s, struct page *page,
> + unsigned long state) {}
>  static inline unsigned long kmem_cache_flags(unsigned long objsize,
>   unsigned long flags, const char *name,
>   void (*ctor)(struct kmem_cache *, void *))
> @@ -1105,6 +1078,7 @@ static noinline struct page *new_slab(st
>   void *start;
>   void *last;
>   void *p;
> + unsigned long state;
>
>   BUG_ON(flags & GFP_SLAB_BUG_MASK);
>
> @@ -1117,11 +1091,12 @@ static noinline struct page *new_slab(st
>   if (n)
>   atomic_long_inc(>nr_slabs);
>   page->slab = s;
> - page->flags |= 1 << PG_slab;
> + state = 1 << PG_slab;
>   if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
>   SLAB_STORE_USER | SLAB_TRACE))
> - SetSlabDebug(page);
> + state |= SLABDEBUG;
>
>

Re: [PATCH resend2] rd: fix data corruption on memory pressure

2007-10-29 Thread Nick Piggin

On Tuesday 30 October 2007 01:17, Christian Borntraeger wrote:
> Nick, Eric, Andrew,
>
> we now have passed rc1. That means that Erics or Nicks rd rewrite is no
> longer an option for 2.6.24. If I followed the last thread correctly all
> alternative patches have one of the following issue
> - too big for post rc1

Yeah, I don't think the rewrites were ever intended for 2.6.24
anyway...


> - break reiserfs and maybe others
> - call into vfs while being unrelated
>
> So this is a resend of my patch, which is in my opinion the simplest fix
> for the data corruption problem.
> The patch was tested by our test department, thanks to Oliver Paukstadt and
> Thorsten Diehl.
>
> ---
>
> Subject: [PATCH] rd: fix data corruption on memory pressure
> From: Christian Borntraeger <[EMAIL PROTECTED]>
>
> We have seen ramdisk based install systems, where some pages of mapped
> libraries and programs were suddendly zeroed under memory pressure. This
> should not happen, as the ramdisk avoids freeing its pages by keeping them
> dirty all the time.
>
> It turns out that there is a case, where the VM makes a ramdisk page clean,
> without telling the ramdisk driver.
> On memory pressure shrink_zone runs and it starts to run
> shrink_active_list. There is a check for buffer_heads_over_limit, and if
> true, pagevec_strip is called. pagevec_strip calls try_to_release_page. If
> the mapping has no releasepage callback, try_to_free_buffers is called.
> try_to_free_buffers has now a special logic for some file systems to make a
> dirty page clean, if all buffers are clean. Thats what happened in our test
> case.
>
> The simplest solution is to provide a noop-releasepage callback for the
> ramdisk driver. This avoids try_to_free_buffers for ramdisk pages.

I think this is the least intrusive change that is least likely
to break rd, or any other kernel code, that we've seen. It really
should go in 2.6.24, IMO.

Acked-by: Nick Piggin <[EMAIL PROTECTED]>

> Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>
> ---
>  drivers/block/rd.c |   13 +
>  1 file changed, 13 insertions(+)
>
> Index: linux-2.6/drivers/block/rd.c
> ===
> --- linux-2.6.orig/drivers/block/rd.c
> +++ linux-2.6/drivers/block/rd.c
> @@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct
>   return 0;
>  }
>
> +/*
> + * releasepage is called by pagevec_strip/try_to_release_page if
> + * buffers_heads_over_limit is true. Without a releasepage function
> + * try_to_free_buffers is called instead. That can unset the dirty
> + * bit of our ram disk pages, which will be eventually freed, even
> + * if the page is still in use.
> + */
> +static int ramdisk_releasepage(struct page *page, gfp_t dummy)
> +{
> + return 0;
> +}
> +
>  static const struct address_space_operations ramdisk_aops = {
>   .readpage   = ramdisk_readpage,
>   .prepare_write  = ramdisk_prepare_write,
> @@ -196,6 +208,7 @@ static const struct address_space_operat
>   .writepage  = ramdisk_writepage,
>   .set_page_dirty = ramdisk_set_page_dirty,
>   .writepages = ramdisk_writepages,
> + .releasepage= ramdisk_releasepage,
>  };
>
>  static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t
> sector,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: vm_ops.page_mkwrite() fails with vmalloc on 2.6.23

2007-10-29 Thread Nick Piggin

On Monday 29 October 2007 23:35, Peter Zijlstra wrote:
> On Mon, 2007-10-29 at 11:11 +0100, Peter Zijlstra wrote:
> > On Mon, 2007-10-29 at 01:17 -0700, Jaya Kumar wrote:
> > > On 10/29/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> > > > On Mon, 22 Oct 2007 16:40:57 +0200 Stefani Seibold 
<[EMAIL PROTECTED]> wrote:
> > > > > The problem original occurs with the fb_defio driver
> > > > > (driver/video/fb_defio.c). This driver use the
> > > > > vm_ops.page_mkwrite() handler for tracking the modified pages,
> > > > > which will be in an extra thread handled, to perform the IO and
> > > > > clean and write protect all pages with page_clean().
> > >
> > > Hi,
> > >
> > > An aside, I just tested that deferred IO works fine on
> > > 2.6.22.10/pxa255.
> > >
> > > I understood from the thread that PeterZ is looking into page_mkclean
> > > changes which I guess went into 2.6.23. I'm also happy to help in any
> > > way if the way we're doing fb_defio needs to change.
> >
> > Yeah, its the truncate race stuff introduced by Nick in
> >   d0217ac04ca6591841e5665f518e38064f4e65bd
> >
> > I'm a bit at a loss on how to go around fixing this. One ugly idea I had
> > was to check page->mapping before going into page_mkwrite() and when
> > that is null, don't bother with the truncate check.
>
> Something like this


I think it's a fine minimal patch. Maybe add a comment to say exactly
what we're doing here (pagecache generally just uses !mapping to test
for truncate).

Otherwise, Acked-by: Nick Piggin <[EMAIL PROTECTED]>, thanks!

> ---
>  mm/memory.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/mm/memory.c
> ===
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -2300,6 +2300,8 @@ static int __do_fault(struct mm_struct *
>* to become writable
>*/
>   if (vma->vm_ops->page_mkwrite) {
> + struct address_space *mapping = page->mapping;
> +
>   unlock_page(page);
>   if (vma->vm_ops->page_mkwrite(vma, page) < 0) {
>   ret = VM_FAULT_SIGBUS;
> @@ -2314,7 +2316,7 @@ static int __do_fault(struct mm_struct *
>* reworking page_mkwrite locking API, which
>* is better done later.
>*/
> - if (!page->mapping) {
> + if (mapping != page->mapping) {
>   ret = 0;
>   anon = 1; /* no anon but release 
> vmf.page */
>   goto out;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: vm_ops.page_mkwrite() fails with vmalloc on 2.6.23

2007-10-29 Thread Nick Piggin

On Monday 29 October 2007 23:35, Peter Zijlstra wrote:
 On Mon, 2007-10-29 at 11:11 +0100, Peter Zijlstra wrote:
  On Mon, 2007-10-29 at 01:17 -0700, Jaya Kumar wrote:
   On 10/29/07, Andrew Morton [EMAIL PROTECTED] wrote:
On Mon, 22 Oct 2007 16:40:57 +0200 Stefani Seibold 
[EMAIL PROTECTED] wrote:
 The problem original occurs with the fb_defio driver
 (driver/video/fb_defio.c). This driver use the
 vm_ops.page_mkwrite() handler for tracking the modified pages,
 which will be in an extra thread handled, to perform the IO and
 clean and write protect all pages with page_clean().
  
   Hi,
  
   An aside, I just tested that deferred IO works fine on
   2.6.22.10/pxa255.
  
   I understood from the thread that PeterZ is looking into page_mkclean
   changes which I guess went into 2.6.23. I'm also happy to help in any
   way if the way we're doing fb_defio needs to change.
 
  Yeah, its the truncate race stuff introduced by Nick in
d0217ac04ca6591841e5665f518e38064f4e65bd
 
  I'm a bit at a loss on how to go around fixing this. One ugly idea I had
  was to check page-mapping before going into page_mkwrite() and when
  that is null, don't bother with the truncate check.

 Something like this


I think it's a fine minimal patch. Maybe add a comment to say exactly
what we're doing here (pagecache generally just uses !mapping to test
for truncate).

Otherwise, Acked-by: Nick Piggin [EMAIL PROTECTED], thanks!

 ---
  mm/memory.c |4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

 Index: linux-2.6/mm/memory.c
 ===
 --- linux-2.6.orig/mm/memory.c
 +++ linux-2.6/mm/memory.c
 @@ -2300,6 +2300,8 @@ static int __do_fault(struct mm_struct *
* to become writable
*/
   if (vma-vm_ops-page_mkwrite) {
 + struct address_space *mapping = page-mapping;
 +
   unlock_page(page);
   if (vma-vm_ops-page_mkwrite(vma, page)  0) {
   ret = VM_FAULT_SIGBUS;
 @@ -2314,7 +2316,7 @@ static int __do_fault(struct mm_struct *
* reworking page_mkwrite locking API, which
* is better done later.
*/
 - if (!page-mapping) {
 + if (mapping != page-mapping) {
   ret = 0;
   anon = 1; /* no anon but release 
 vmf.page */
   goto out;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH resend2] rd: fix data corruption on memory pressure

2007-10-29 Thread Nick Piggin

On Tuesday 30 October 2007 01:17, Christian Borntraeger wrote:
 Nick, Eric, Andrew,

 we now have passed rc1. That means that Erics or Nicks rd rewrite is no
 longer an option for 2.6.24. If I followed the last thread correctly all
 alternative patches have one of the following issue
 - too big for post rc1

Yeah, I don't think the rewrites were ever intended for 2.6.24
anyway...


 - break reiserfs and maybe others
 - call into vfs while being unrelated

 So this is a resend of my patch, which is in my opinion the simplest fix
 for the data corruption problem.
 The patch was tested by our test department, thanks to Oliver Paukstadt and
 Thorsten Diehl.

 ---

 Subject: [PATCH] rd: fix data corruption on memory pressure
 From: Christian Borntraeger [EMAIL PROTECTED]

 We have seen ramdisk based install systems, where some pages of mapped
 libraries and programs were suddendly zeroed under memory pressure. This
 should not happen, as the ramdisk avoids freeing its pages by keeping them
 dirty all the time.

 It turns out that there is a case, where the VM makes a ramdisk page clean,
 without telling the ramdisk driver.
 On memory pressure shrink_zone runs and it starts to run
 shrink_active_list. There is a check for buffer_heads_over_limit, and if
 true, pagevec_strip is called. pagevec_strip calls try_to_release_page. If
 the mapping has no releasepage callback, try_to_free_buffers is called.
 try_to_free_buffers has now a special logic for some file systems to make a
 dirty page clean, if all buffers are clean. Thats what happened in our test
 case.

 The simplest solution is to provide a noop-releasepage callback for the
 ramdisk driver. This avoids try_to_free_buffers for ramdisk pages.

I think this is the least intrusive change that is least likely
to break rd, or any other kernel code, that we've seen. It really
should go in 2.6.24, IMO.

Acked-by: Nick Piggin [EMAIL PROTECTED]

 Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
 ---
  drivers/block/rd.c |   13 +
  1 file changed, 13 insertions(+)

 Index: linux-2.6/drivers/block/rd.c
 ===
 --- linux-2.6.orig/drivers/block/rd.c
 +++ linux-2.6/drivers/block/rd.c
 @@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct
   return 0;
  }

 +/*
 + * releasepage is called by pagevec_strip/try_to_release_page if
 + * buffers_heads_over_limit is true. Without a releasepage function
 + * try_to_free_buffers is called instead. That can unset the dirty
 + * bit of our ram disk pages, which will be eventually freed, even
 + * if the page is still in use.
 + */
 +static int ramdisk_releasepage(struct page *page, gfp_t dummy)
 +{
 + return 0;
 +}
 +
  static const struct address_space_operations ramdisk_aops = {
   .readpage   = ramdisk_readpage,
   .prepare_write  = ramdisk_prepare_write,
 @@ -196,6 +208,7 @@ static const struct address_space_operat
   .writepage  = ramdisk_writepage,
   .set_page_dirty = ramdisk_set_page_dirty,
   .writepages = ramdisk_writepages,
 + .releasepage= ramdisk_releasepage,
  };

  static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t
 sector,
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 09/10] SLUB: Do our own locking via slab_lock and slab_unlock.

2007-10-29 Thread Nick Piggin

On Sunday 28 October 2007 14:32, Christoph Lameter wrote:
 Too many troubles with the bitlocks and we really do not need
 to do any bitops. Bitops do not effectively retrieve the old
 value which we want. So use a cmpxchg instead on the arches
 that allow it.

 Instead of modifying the page-flags with fake atomic operations
 we pass the page state as a parameter to functions. No function uses
 the slab state if the page lock is held. Function must wait until the
 lock is cleared. Thus we can defer the update of page-flags until slab
 processing is complete. The unlock operation is then simply updating
 page-flags.

Is this actually a speedup on any architecture to roll your own locking
rather than using bit spinlock?

I am not exactly convinced that smp_wmb() is a good idea to have in your
unlock, rather than the normally required smp_mb() that every other open
coded lock in the kernel is using today. If you comment every code path
where a load leaking out of the critical section would not be a problem,
then OK it may be correct, but I still don't think it is worth the
maintenance overhead.


 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]


 ---
  mm/slub.c |  324
 +++--- 1 file
 changed, 187 insertions(+), 137 deletions(-)

 Index: linux-2.6/mm/slub.c
 ===
 --- linux-2.6.orig/mm/slub.c  2007-10-27 07:56:03.0 -0700
 +++ linux-2.6/mm/slub.c   2007-10-27 07:56:52.0 -0700
 @@ -101,6 +101,7 @@
   */

  #define FROZEN (1  PG_active)
 +#define LOCKED (1  PG_locked)

  #ifdef CONFIG_SLUB_DEBUG
  #define SLABDEBUG (1  PG_error)
 @@ -108,36 +109,6 @@
  #define SLABDEBUG 0
  #endif

 -static inline int SlabFrozen(struct page *page)
 -{
 - return page-flags  FROZEN;
 -}
 -
 -static inline void SetSlabFrozen(struct page *page)
 -{
 - page-flags |= FROZEN;
 -}
 -
 -static inline void ClearSlabFrozen(struct page *page)
 -{
 - page-flags = ~FROZEN;
 -}
 -
 -static inline int SlabDebug(struct page *page)
 -{
 - return page-flags  SLABDEBUG;
 -}
 -
 -static inline void SetSlabDebug(struct page *page)
 -{
 - page-flags |= SLABDEBUG;
 -}
 -
 -static inline void ClearSlabDebug(struct page *page)
 -{
 - page-flags = ~SLABDEBUG;
 -}
 -
  /*
   * Issues still to be resolved:
   *
 @@ -818,11 +789,12 @@ static void trace(struct kmem_cache *s,
  /*
   * Tracking of fully allocated slabs for debugging purposes.
   */
 -static void add_full(struct kmem_cache *s, struct page *page)
 +static void add_full(struct kmem_cache *s, struct page *page,
 + unsigned long state)
  {
   struct kmem_cache_node *n = get_node(s, page_to_nid(page));

 - if (!SlabDebug(page) || !(s-flags  SLAB_STORE_USER))
 + if (!(state  SLABDEBUG) || !(s-flags  SLAB_STORE_USER))
   return;
   spin_lock(n-list_lock);
   list_add(page-lru, n-full);
 @@ -894,7 +866,7 @@ bad:
  }

  static noinline int free_debug_processing(struct kmem_cache *s,
 - struct page *page, void *object, void *addr)
 + struct page *page, void *object, void *addr, unsigned long state)
  {
   if (!check_slab(s, page))
   goto fail;
 @@ -930,7 +902,7 @@ static noinline int free_debug_processin
   }

   /* Special debug activities for freeing objects */
 - if (!SlabFrozen(page)  page-freelist == page-end)
 + if (!(state  FROZEN)  page-freelist == page-end)
   remove_full(s, page);
   if (s-flags  SLAB_STORE_USER)
   set_track(s, object, TRACK_FREE, addr);
 @@ -1047,7 +1019,8 @@ static inline int slab_pad_check(struct
   { return 1; }
  static inline int check_object(struct kmem_cache *s, struct page *page,
   void *object, int active) { return 1; }
 -static inline void add_full(struct kmem_cache *s, struct page *page) {}
 +static inline void add_full(struct kmem_cache *s, struct page *page,
 + unsigned long state) {}
  static inline unsigned long kmem_cache_flags(unsigned long objsize,
   unsigned long flags, const char *name,
   void (*ctor)(struct kmem_cache *, void *))
 @@ -1105,6 +1078,7 @@ static noinline struct page *new_slab(st
   void *start;
   void *last;
   void *p;
 + unsigned long state;

   BUG_ON(flags  GFP_SLAB_BUG_MASK);

 @@ -1117,11 +1091,12 @@ static noinline struct page *new_slab(st
   if (n)
   atomic_long_inc(n-nr_slabs);
   page-slab = s;
 - page-flags |= 1  PG_slab;
 + state = 1  PG_slab;
   if (s-flags  (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
   SLAB_STORE_USER | SLAB_TRACE))
 - SetSlabDebug(page);
 + state |= SLABDEBUG;

 + page-flags |= state;
   start = page_address(page);
   page-end = start + 1;

 @@ -1147,13 +1122,13 @@ static void __free_slab(struct kmem_cach
  {
   int

Re: [2.6.23-rt3] NMI watchdog trace of deadlock

2007-10-27 Thread Nick Piggin

On Saturday 27 October 2007 15:21, Mike Galbraith wrote:
> Greetings,
>
> For quite a while now, RT kernels have been locking up on me
> occasionally while my back is turned.  Yesterday, the little bugger
> finally pounced while my serial console box was up and waiting.
>
> [10138.162953] WARNING: at arch/i386/kernel/smp.c:581
> native_smp_call_function_mask() [10138.170583]  []
> show_trace_log_lvl+0x1a/0x30
> [10138.175796]  [] show_trace+0x12/0x14
> [10138.180291]  [] dump_stack+0x16/0x18
> [10138.184769]  [] native_smp_call_function_mask+0x138/0x13d
> [10138.191117]  [] smp_call_function+0x1e/0x24
> [10138.196210]  [] on_each_cpu+0x25/0x50
> [10138.200807]  [] flush_tlb_all+0x1e/0x20
> [10138.205553]  [] kmap_high+0x1b6/0x417
> [10138.210118]  [] kmap+0x4d/0x4f
> [10138.214102]  [] ntfs_end_buffer_async_read+0x228/0x2f9
> [10138.220163]  [] end_bio_bh_io_sync+0x26/0x3f
> [10138.225352]  [] bio_endio+0x42/0x6d
> [10138.229769]  [] __end_that_request_first+0x115/0x4ac
> [10138.235682]  [] end_that_request_chunk+0x8/0xa
> [10138.241052]  [] ide_end_request+0x55/0x10a
> [10138.246058]  [] ide_dma_intr+0x6f/0xac
> [10138.250727]  [] ide_intr+0x93/0x1e0
> [10138.255125]  [] handle_IRQ_event+0x5c/0xc9

Looks like ntfs is kmap()ing from interrupt context. Should
be using kmap_atomic instead, I think.

But the ntfs code I have seems to do just that...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.23-rt3] NMI watchdog trace of deadlock

2007-10-27 Thread Nick Piggin

On Saturday 27 October 2007 15:21, Mike Galbraith wrote:
 Greetings,

 For quite a while now, RT kernels have been locking up on me
 occasionally while my back is turned.  Yesterday, the little bugger
 finally pounced while my serial console box was up and waiting.

 [10138.162953] WARNING: at arch/i386/kernel/smp.c:581
 native_smp_call_function_mask() [10138.170583]  [c01051da]
 show_trace_log_lvl+0x1a/0x30
 [10138.175796]  [c0105de3] show_trace+0x12/0x14
 [10138.180291]  [c0105dfb] dump_stack+0x16/0x18
 [10138.184769]  [c011609f] native_smp_call_function_mask+0x138/0x13d
 [10138.191117]  [c0117606] smp_call_function+0x1e/0x24
 [10138.196210]  [c012f85c] on_each_cpu+0x25/0x50
 [10138.200807]  [c0115c74] flush_tlb_all+0x1e/0x20
 [10138.205553]  [c016caaf] kmap_high+0x1b6/0x417
 [10138.210118]  [c011ec88] kmap+0x4d/0x4f
 [10138.214102]  [c026a9d8] ntfs_end_buffer_async_read+0x228/0x2f9
 [10138.220163]  [c01a0e9e] end_bio_bh_io_sync+0x26/0x3f
 [10138.225352]  [c01a2b09] bio_endio+0x42/0x6d
 [10138.229769]  [c02c2a08] __end_that_request_first+0x115/0x4ac
 [10138.235682]  [c02c2da7] end_that_request_chunk+0x8/0xa
 [10138.241052]  [c0365943] ide_end_request+0x55/0x10a
 [10138.246058]  [c036dae3] ide_dma_intr+0x6f/0xac
 [10138.250727]  [c0366d83] ide_intr+0x93/0x1e0
 [10138.255125]  [c015afb4] handle_IRQ_event+0x5c/0xc9

Looks like ntfs is kmap()ing from interrupt context. Should
be using kmap_atomic instead, I think.

But the ntfs code I have seems to do just that...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [interesting] smattering of possible memory ordering bugs

2007-10-25 Thread Nick Piggin

On Friday 26 October 2007 13:35, Benjamin Herrenschmidt wrote:

[acks]

Thanks for those...

> > Index: linux-2.6/include/asm-powerpc/mmu_context.h
> > ===
> > --- linux-2.6.orig/include/asm-powerpc/mmu_context.h
> > +++ linux-2.6/include/asm-powerpc/mmu_context.h
> > @@ -129,7 +129,7 @@ static inline void get_mmu_context(struc
> > steal_context();
> >  #endif
> > ctx = next_mmu_context;
> > -   while (test_and_set_bit(ctx, context_map)) {
> > +   while (test_and_set_bit_lock(ctx, context_map)) {
> > ctx = find_next_zero_bit(context_map, LAST_CONTEXT+1,
> > ctx); if (ctx > LAST_CONTEXT)
> > ctx = 0;
> > @@ -158,7 +158,7 @@ static inline void destroy_context(struc
> >  {
> > preempt_disable();
> > if (mm->context.id != NO_CONTEXT) {
> > -   clear_bit(mm->context.id, context_map);
> > +   clear_bit_unlock(mm->context.id, context_map);
> > mm->context.id = NO_CONTEXT;
> >  #ifdef FEW_CONTEXTS
> > atomic_inc(_free_contexts);
>
> I don't think the previous code was wrong... it's not a locked section
> and we don't care about ordering previous stores. It's an allocation, it
> should be fine. In general, bitmap allocators should be allright.

Well if it is just allocating an arbitrary _number_ out of a bitmap
and nothing else (eg. like the pid allocator), then you don't need
barriers.


> Ignore the FEW_CONTEXTS stuff for now :-) At this point, it's UP only
> and will be replaced sooner or later.

OK. Then I agree, provided you're doing the correct synchronisation
or flushing etc. when destroying a context (which presumably you are).

I'll drop those bits then.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[interesting] smattering of possible memory ordering bugs

2007-10-25 Thread Nick Piggin

Hi,

Just out of interest, I did a grep for files containing test_and_set_bit
as well as clear_bit (excluding obvious ones like include/asm-*/bitops.h).

Quite a few interesting things. There is a lot of stuff in drivers/* that
could be suspect, WRT memory barriers, including lots I didn't touch. Lot
of work. But forget that...

Some surprises with more obvious bugs, which I have added some cc's for.
powerpc seems to have an mmu context allocation bug, (and possible
improvement in hpte locking with the new bitops).

floppy is using a crazy open coded bit mutex, convert to regular mutex.

vt has something strange too.

fs-writeback could be made a little safer by properly closing the "critical"
section (the existing sections aren't really critical, but you never know
what the future holds ;))

jfs seems to be missing an obvious smp_mb__before_clear_bit (and a less
obvious smp_mb__after_clear_bit)

xfs seems to be missing smp_mb__before

Lots of code that allocates things (eg. msi interrupts) is suspicious. I'm
not exactly sure if these allocation bitmaps are actually used to protect
data structures or not...

I'll have to get round to preparing proper patches for these things if I
don't hear nacks...
---
 arch/arm/mach-davinci/gpio.c  |4 ++--
 arch/arm/mach-imx/generic.c   |4 ++--
 arch/arm/mach-iop13xx/msi.c   |4 ++--
 arch/arm/mach-ns9xxx/gpio.c   |4 ++--
 arch/avr32/mach-at32ap/pio.c  |1 +
 arch/powerpc/mm/hash_native_64.c  |5 ++---
 arch/ppc/xmon/start.c |   20 +---
 block/ll_rw_blk.c |   20 +---
 drivers/block/cciss.c |4 ++--
 drivers/block/cpqarray.c  |4 ++--
 drivers/block/floppy.c|   36 
 drivers/char/vt.c |4 ++--
 drivers/net/ibm_newemac/mal.c |5 ++---
 drivers/net/s2io.c|8 
 drivers/usb/misc/phidgetservo.c   |6 +++---
 fs/fs-writeback.c |4 ++--
 fs/jfs/jfs_metapage.c |5 +++--
 fs/xfs/xfs_mount.c|4 ++--
 include/asm-powerpc/mmu_context.h |4 ++--
 include/asm-ppc/mmu_context.h |4 ++--
 include/linux/aio.h   |5 -
 include/linux/interrupt.h |3 ++-
 include/linux/netdevice.h |5 ++---
 net/bluetooth/cmtp/core.c |4 ++--
 net/core/dev.c|5 ++---
 26 files changed, 75 insertions(+), 98 deletions(-)

Index: linux-2.6/arch/arm/mach-davinci/gpio.c
===
--- linux-2.6.orig/arch/arm/mach-davinci/gpio.c
+++ linux-2.6/arch/arm/mach-davinci/gpio.c
@@ -34,7 +34,7 @@ int gpio_request(unsigned gpio, const ch
 	if (gpio >= DAVINCI_N_GPIO)
 		return -EINVAL;
 
-	if (test_and_set_bit(gpio, gpio_in_use))
+	if (test_and_set_bit_lock(gpio, gpio_in_use))
 		return -EBUSY;
 
 	return 0;
@@ -46,7 +46,7 @@ void gpio_free(unsigned gpio)
 	if (gpio >= DAVINCI_N_GPIO)
 		return;
 
-	clear_bit(gpio, gpio_in_use);
+	clear_bit_unlock(gpio, gpio_in_use);
 }
 EXPORT_SYMBOL(gpio_free);
 
Index: linux-2.6/arch/arm/mach-imx/generic.c
===
--- linux-2.6.orig/arch/arm/mach-imx/generic.c
+++ linux-2.6/arch/arm/mach-imx/generic.c
@@ -107,7 +107,7 @@ int imx_gpio_request(unsigned gpio, cons
 		return -EINVAL;
 	}
 
-	if(test_and_set_bit(gpio, imx_gpio_alloc_map)) {
+	if(test_and_set_bit_lock(gpio, imx_gpio_alloc_map)) {
 		printk(KERN_ERR "imx_gpio: GPIO %d already used. Allocation for \"%s\" failed\n",
 			gpio, label ? label : "?");
 		return -EBUSY;
@@ -123,7 +123,7 @@ void imx_gpio_free(unsigned gpio)
 	if(gpio >= (GPIO_PORT_MAX + 1) * 32)
 		return;
 
-	clear_bit(gpio, imx_gpio_alloc_map);
+	clear_bit_unlock(gpio, imx_gpio_alloc_map);
 }
 
 EXPORT_SYMBOL(imx_gpio_free);
Index: linux-2.6/arch/arm/mach-iop13xx/msi.c
===
--- linux-2.6.orig/arch/arm/mach-iop13xx/msi.c
+++ linux-2.6/arch/arm/mach-iop13xx/msi.c
@@ -135,7 +135,7 @@ again:
 	if (irq > NR_IRQS)
 		return -ENOSPC;
 	/* test_and_set_bit operates on 32-bits at a time */
-	if (test_and_set_bit(pos, msi_irq_in_use))
+	if (test_and_set_bit_lock(pos, msi_irq_in_use))
 		goto again;
 
 	dynamic_irq_init(irq);
@@ -149,7 +149,7 @@ void destroy_irq(unsigned int irq)
 
 	dynamic_irq_cleanup(irq);
 
-	clear_bit(pos, msi_irq_in_use);
+	clear_bit_unlock(pos, msi_irq_in_use);
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
Index: linux-2.6/arch/arm/mach-ns9xxx/gpio.c
===
--- linux-2.6.orig/arch/arm/mach-ns9xxx/gpio.c
+++ linux-2.6/arch/arm/mach-ns9xxx/gpio.c
@@ -85,7 +85,7 @@ static inline void __iomem *ns9xxx_gpio_
 int gpio_request(unsigned gpio, const char *label)
 {
 	if (likely(ns9xxx_valid_gpio(gpio)))
-		return test_and_set_bit(gpio, gpiores) ? -EBUSY : 0;

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin

On Friday 26 October 2007 09:55, Andi Kleen wrote:
> > But we don't actually know what it is, and it could change with
> > different architectures or versions of gcc. I think the sanest thing
> > is for gcc to help us out here, seeing as there is this very well
> > defined requirement that we want.
> >
> > If you really still want the optimisation to occur,
>
> I don't think it makes sense for memory. It may may make sense for
> registers.
>
> But given that my kernel doesn't seem to contain any instances
> at all it's probably not too useful for it anyways.
>
> So I wouldn't have a problem disabling it, but it would
> be better if they fixed their heuristics to only apply it when
> it makes sense.

That's what I mean -- disabling it for memory. I mean, I am just
talking about the conditional => unconditional store to a shared
variable optimisation.

So for example, adc, sbb, and cmov are all still fine when they
are used for the right things. I don't want to turn them off
because they really are quite useful.

As far as it being theoretical... I hope it is. But we should
nip this in the "bud" (gcc 3.x does this too, sigh) before it
causes problems for us (and any and all other threaded programs
and libraries out there). And by that I mean ask them for a
compiler option rather than start adding volatile.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-25 Thread Nick Piggin

On Friday 26 October 2007 09:09, Andi Kleen wrote:
> On Friday 26 October 2007 00:49:42 Nick Piggin wrote:

> > Marking volatile I think is out of the question. To start with,
> > volatile creates really poor code (and most of the time we actually
> > do want the code in critical sections to be as tight as possible).
>
> Poor code is better than broken code I would say. Besides
> the cross CPU synchronization paths are likely dominated by
> cache misses anyways; it's unlikely they're actually limited
> by the core CPU. So it'll probably not matter all that much
> if the code is poor or not.

But we have to mark all memory access inside the critical section
as volatile, even after the CPU synchronisation, and often the
common case where there is no contention or cacheline bouncing.

Sure, real code is dominated by cache misses anyway, etc etc.
However volatile prevents a lot of real optimisations that we'd
actually want to do, and increases icache size etc.

> > But also because I don't think these bugs are just going to be
> > found easily.
> >
> > > It might be useful to come up with some kind of assembler pattern
> > > matcher to check if any such code is generated for the kernel
> > > and try it with different compiler versions.
> >
> > Hard to know how to do it. If you can, then it would be interesting.
>
> I checked my kernel for "adc" at least (for the trylock/++ pattern)
> and couldn't find any (in fact all of them were in
> data the compiler thought to be code). That was not a allyesconfig build,
> so i missed some code.

sbb as well.

> For cmov it's at first harder because they're much more frequent
> and cmov to memory is a multiple instruction pattern (the instruction
> does only support memory source operands). Luckily gcc
> doesn't know the if (x) mem = a; => ptr = cmov(x, , ); *ptr = a;
> transformation trick so I don't think there are actually any conditional
> stores on current x86.
>
> It might be a problem on other architectures which support true
> conditional stores though (like IA64 or ARM)

It might just depend on the instruction costs that gcc knows about.
For example, if ld/st is expensive, they might hoist them out of
loops etc. You don't even need to have one of these predicate or
pseudo predicate instructions.

> Also I'm not sure if gcc doesn't know any other tricks like the
> conditional add using carry, although I cannot think of any related
> to stores from the top of my hat.
>
> Anyways, if it's only conditional add if we ever catch such a case
> it could be also handled with inline assembly similar to local_inc()

But we don't actually know what it is, and it could change with
different architectures or versions of gcc. I think the sanest thing
is for gcc to help us out here, seeing as there is this very well
defined requirement that we want.

If you really still want the optimisation to occur, I don't think it
is too much to use a local variable for the accumulator (eg. in the
simple example case).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 3926 matches

Mail list logo