from:"Siddha, Suresh B"

RE: [PATCH] x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs

2012-07-25 Thread Siddha, Suresh B

Tomoki wrote:
> In current Linux, percpu variable `vector_irq' is not always cleared when
> a CPU is offlined. If the cpu that has the disabled irqs in vector_irq is
> hotplugged again, __setup_vector_irq() hits invalid irq vector and may
> crash.
>
> Commit f6175f5bfb4c partially fixes this, but was not enough in
> environments with IOMMU IRQ remapper.

So, this patch essentially makes the commit f6175f5bfb4c unnecessary, right?

Can you revert that too as part of this new proposed patch?

thanks,
suresh--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs

2012-07-25 Thread Siddha, Suresh B

Tomoki wrote:
 In current Linux, percpu variable `vector_irq' is not always cleared when
 a CPU is offlined. If the cpu that has the disabled irqs in vector_irq is
 hotplugged again, __setup_vector_irq() hits invalid irq vector and may
 crash.

 Commit f6175f5bfb4c partially fixes this, but was not enough in
 environments with IOMMU IRQ remapper.

So, this patch essentially makes the commit f6175f5bfb4c unnecessary, right?

Can you revert that too as part of this new proposed patch?

thanks,
suresh--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/2] x86,fpu: lazy allocation of FPU area

2008-02-24 Thread Siddha, Suresh B

On Sun, Feb 24, 2008 at 01:20:05PM +0100, Andi Kleen wrote:
> On Sat, Feb 23, 2008 at 06:34:39PM -0800, Suresh Siddha wrote:
> > Only allocate the FPU area when the application actually uses FPU, i.e., in 
> > the
> > first lazy FPU trap. This could save memory for non-fpu using apps.
> 
> Did you measure this making any difference? 
> 
> At least at some point glibc always touched the FPU once to initialize
> it.

Simple kernel boot showed that only 20 or so tasks using FPU, out of 
200 or so tasks running.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/2] x86,fpu: split FPU state from task struct

2008-02-24 Thread Siddha, Suresh B

On Sun, Feb 24, 2008 at 08:22:02AM +0100, Ingo Molnar wrote:
> 
> * Suresh Siddha <[EMAIL PROTECTED]> wrote:
> 
> > Split the FPU save area from the task struct. This allows easy 
> > migration of FPU context, and it's generally cleaner. It also allows 
> > the following two optimizations:
> > 
> > 1) only allocate when the application actually uses FPU, so in the 
> > first lazy FPU trap. This could save memory for non-fpu using apps. 
> > Next patch does this lazy allocation.
> > 
> > 2) allocate the right size for the actual cpu rather than 512 bytes 
> > always. Patches enabling xsave/xrstor support (coming shortly) will 
> > take advantage of this.
> 
> i like the concept. Please clean up the issues found by Christoph and 
> please also base it against x86.git#testing [this is clear 2.6.26 
> material and there are already some changes in this area]:
> 
>http://people.redhat.com/mingo/x86.git/README

Sure. Will do that in a day.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/2] x86,fpu: split FPU state from task struct

2008-02-24 Thread Siddha, Suresh B

On Sun, Feb 24, 2008 at 08:27:30AM +0100, Roger While wrote:
> 
> On Sat, Feb 23, 2008 at 06:34:38PM -0800, Suresh Siddha wrote:
> > Split the FPU save area from the task struct. This allows easy migration
> > of FPU context, and it's generally cleaner. It also allows the following
> > two optimizations:
> >
> > 1) only allocate when the application actually uses FPU, so in the first
> > lazy FPU trap. This could save memory for non-fpu using apps. Next patch
> > does this lazy allocation.
> >
> > 2) allocate the right size for the actual cpu rather than 512 bytes 
> always.
> > Patches enabling xsave/xrstor support (coming shortly) will take advantage
> > of this.
> 
> > if (next_p->fpu_counter>5)
> > -   prefetch(>i387.fxsave);
> > +   prefetch(FXSAVE(next_p));
> 
> Shouldn't that be  prefetch(FXSAVE(next));  ?

No. 'next_p' which is the task_struct is what FXSAVE macro takes.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/2] x86,fpu: split FPU state from task struct

2008-02-24 Thread Siddha, Suresh B

On Sun, Feb 24, 2008 at 08:27:30AM +0100, Roger While wrote:
 
 On Sat, Feb 23, 2008 at 06:34:38PM -0800, Suresh Siddha wrote:
  Split the FPU save area from the task struct. This allows easy migration
  of FPU context, and it's generally cleaner. It also allows the following
  two optimizations:
 
  1) only allocate when the application actually uses FPU, so in the first
  lazy FPU trap. This could save memory for non-fpu using apps. Next patch
  does this lazy allocation.
 
  2) allocate the right size for the actual cpu rather than 512 bytes 
 always.
  Patches enabling xsave/xrstor support (coming shortly) will take advantage
  of this.
 
  if (next_p-fpu_counter5)
  -   prefetch(next-i387.fxsave);
  +   prefetch(FXSAVE(next_p));
 
 Shouldn't that be  prefetch(FXSAVE(next));  ?

No. 'next_p' which is the task_struct is what FXSAVE macro takes.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 1/2] x86,fpu: split FPU state from task struct

2008-02-24 Thread Siddha, Suresh B

On Sun, Feb 24, 2008 at 08:22:02AM +0100, Ingo Molnar wrote:
 
 * Suresh Siddha [EMAIL PROTECTED] wrote:
 
  Split the FPU save area from the task struct. This allows easy 
  migration of FPU context, and it's generally cleaner. It also allows 
  the following two optimizations:
  
  1) only allocate when the application actually uses FPU, so in the 
  first lazy FPU trap. This could save memory for non-fpu using apps. 
  Next patch does this lazy allocation.
  
  2) allocate the right size for the actual cpu rather than 512 bytes 
  always. Patches enabling xsave/xrstor support (coming shortly) will 
  take advantage of this.
 
 i like the concept. Please clean up the issues found by Christoph and 
 please also base it against x86.git#testing [this is clear 2.6.26 
 material and there are already some changes in this area]:
 
http://people.redhat.com/mingo/x86.git/README

Sure. Will do that in a day.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/2] x86,fpu: lazy allocation of FPU area

2008-02-24 Thread Siddha, Suresh B

On Sun, Feb 24, 2008 at 01:20:05PM +0100, Andi Kleen wrote:
 On Sat, Feb 23, 2008 at 06:34:39PM -0800, Suresh Siddha wrote:
  Only allocate the FPU area when the application actually uses FPU, i.e., in 
  the
  first lazy FPU trap. This could save memory for non-fpu using apps.
 
 Did you measure this making any difference? 
 
 At least at some point glibc always touched the FPU once to initialize
 it.

Simple kernel boot showed that only 20 or so tasks using FPU, out of 
200 or so tasks running.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Unable to continue testing of 2.6.25

2008-02-19 Thread Siddha, Suresh B

On Mon, Feb 18, 2008 at 11:18:48AM -0800, Roland Dreier wrote:
>  > > AFAIK mapping PCI memory WB is not allowed, so WC is really our only
>  > > choice.
> 
>  > afaik that depends on the BAR being prefetchable or not.
> 
> In my case the BAR is prefetchable.

Even if the BAR is prefetchable, on some platforms mapping MMIO space
as WB can cause bad results like machine check etc.

>  > (and by your argument, ioremap_cached() would not be useful, and since 
> that was, until
>  > 2.6.25-rc1, the default behavior for ioremap(), would have caused massive 
> problems)
> 
> I'm not sure what ioremap_cached() would really do in my case, since
> the MTRRs for PCI memory are set to UC, so without monkeying with MTRR
> contents (which can't really be done safely) the only choices we have
> are leaving the mapping as UC or using PAT to get WC.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Unable to continue testing of 2.6.25

2008-02-19 Thread Siddha, Suresh B

On Mon, Feb 18, 2008 at 11:18:48AM -0800, Roland Dreier wrote:
AFAIK mapping PCI memory WB is not allowed, so WC is really our only
choice.
 
   afaik that depends on the BAR being prefetchable or not.
 
 In my case the BAR is prefetchable.

Even if the BAR is prefetchable, on some platforms mapping MMIO space
as WB can cause bad results like machine check etc.

   (and by your argument, ioremap_cached() would not be useful, and since 
 that was, until
   2.6.25-rc1, the default behavior for ioremap(), would have caused massive 
 problems)
 
 I'm not sure what ioremap_cached() would really do in my case, since
 the MTRRs for PCI memory are set to UC, so without monkeying with MTRR
 contents (which can't really be done safely) the only choices we have
 are leaving the mapping as UC or using PAT to get WC.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What's the status of x2APIC support in Linux kernel?

2008-02-14 Thread Siddha, Suresh B

On Thu, Feb 14, 2008 at 12:06:37PM -0700, Bjorn Helgaas wrote:
> 
> Are you adding x2apic support for both x86 and ia64, or only for x86?

x2apic extension is specific to x86. ia64 already has an advanced lsapic,
isn't it..

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What's the status of x2APIC support in Linux kernel?

2008-02-14 Thread Siddha, Suresh B

On Thu, Feb 14, 2008 at 12:06:37PM -0700, Bjorn Helgaas wrote:
 
 Are you adding x2apic support for both x86 and ia64, or only for x86?

x2apic extension is specific to x86. ia64 already has an advanced lsapic,
isn't it..

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [7/8] CPA: Don't flush caches on CPUs that support self-snoop

2008-02-11 Thread Siddha, Suresh B

On Mon, Feb 11, 2008 at 06:58:46PM +0100, Andi Kleen wrote:
> On Monday 11 February 2008 18:36:06 Siddha, Suresh B wrote:
> > On Mon, Feb 11, 2008 at 04:27:23PM +0100, Andi Kleen wrote:
> > > 
> > > > > That is exactly the situation in pageattr.c. You're saying the manual
> > > > > is wrong here?
> > > > 
> > > > I'm saying that we are not following step 2 (marking the pages not 
> > > > present) 
> > > 
> > > Yes that's true. It's one of the design problems of the intent API that 
> > > makes
> > > fixing this hard unfortunately.
> > 
> > BTW, making pages not present is required only while changing the attribute
> > from WB to WC or WC to WB. I think this step is for avoiding attribute 
> > aliasing
> > for speculative accesses. For UC, we don't have speculative accesses and 
> > such we
> > probably don't need it.
> 
> Ok that would imply that my patch is ok for all current in tree users at least
> (none of them ever set WC currently, only UC). It might not be safe for
> the out of tree WC users though.
> 
> So it should be ok to apply according to standard policy :)

There is atleast one issue though. For an I/O device which is not capable of
snooping the caches, if the driver model assumes that ioremap_nocache() will 
flush
the cpu caches(before initiating DMA), then the lack of cache flush in cpa() 
might
cause some breakages.

If there are any such usages, we should fix the driver models with a new API.

> 
> > So WC support through PAT should take care of it.
> 
> You mean the PAT patch should do something about it? Yes probably, but what?

marking the pages not present etc..

> > > (intent API assumes that the caller doesn't fully own the page to change, 
> > > and
> > > if you can't control it 100% it is not possible to unmap it temporarily) 
> > 
> > True. Perhaps for WC chages we can assume the ownership?
> 
> Not assuming owner ship was the main reason why the intent change was 
> originally
> submitted. I personally never quite understood the rationale for that,
> but the submitter and the maintainers apparently thought it a valuable
> thing to have, otherwise the patch would not have been written and included. 
> 
> You'll need to ask them if that has changed now.
> 
> At least if that requirement was dropped it would be possible to remove
> quite a lot of code from pageattr.c

I don't think we can drop that requirement for UC atleast. For WC, probably yes.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [7/8] CPA: Don't flush caches on CPUs that support self-snoop

2008-02-11 Thread Siddha, Suresh B

On Mon, Feb 11, 2008 at 04:27:23PM +0100, Andi Kleen wrote:
> 
> > > That is exactly the situation in pageattr.c. You're saying the manual
> > > is wrong here?
> > 
> > I'm saying that we are not following step 2 (marking the pages not present) 
> 
> Yes that's true. It's one of the design problems of the intent API that makes
> fixing this hard unfortunately.

BTW, making pages not present is required only while changing the attribute
from WB to WC or WC to WB. I think this step is for avoiding attribute aliasing
for speculative accesses. For UC, we don't have speculative accesses and such we
probably don't need it.

So WC support through PAT should take care of it.

> (intent API assumes that the caller doesn't fully own the page to change, and
> if you can't control it 100% it is not possible to unmap it temporarily) 

True. Perhaps for WC chages we can assume the ownership?

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [7/8] CPA: Don't flush caches on CPUs that support self-snoop

2008-02-11 Thread Siddha, Suresh B

On Mon, Feb 11, 2008 at 06:58:46PM +0100, Andi Kleen wrote:
 On Monday 11 February 2008 18:36:06 Siddha, Suresh B wrote:
  On Mon, Feb 11, 2008 at 04:27:23PM +0100, Andi Kleen wrote:
   
 That is exactly the situation in pageattr.c. You're saying the manual
 is wrong here?

I'm saying that we are not following step 2 (marking the pages not 
present) 
   
   Yes that's true. It's one of the design problems of the intent API that 
   makes
   fixing this hard unfortunately.
  
  BTW, making pages not present is required only while changing the attribute
  from WB to WC or WC to WB. I think this step is for avoiding attribute 
  aliasing
  for speculative accesses. For UC, we don't have speculative accesses and 
  such we
  probably don't need it.
 
 Ok that would imply that my patch is ok for all current in tree users at least
 (none of them ever set WC currently, only UC). It might not be safe for
 the out of tree WC users though.
 
 So it should be ok to apply according to standard policy :)

There is atleast one issue though. For an I/O device which is not capable of
snooping the caches, if the driver model assumes that ioremap_nocache() will 
flush
the cpu caches(before initiating DMA), then the lack of cache flush in cpa() 
might
cause some breakages.

If there are any such usages, we should fix the driver models with a new API.

 
  So WC support through PAT should take care of it.
 
 You mean the PAT patch should do something about it? Yes probably, but what?

marking the pages not present etc..

   (intent API assumes that the caller doesn't fully own the page to change, 
   and
   if you can't control it 100% it is not possible to unmap it temporarily) 
  
  True. Perhaps for WC chages we can assume the ownership?
 
 Not assuming owner ship was the main reason why the intent change was 
 originally
 submitted. I personally never quite understood the rationale for that,
 but the submitter and the maintainers apparently thought it a valuable
 thing to have, otherwise the patch would not have been written and included. 
 
 You'll need to ask them if that has changed now.
 
 At least if that requirement was dropped it would be possible to remove
 quite a lot of code from pageattr.c

I don't think we can drop that requirement for UC atleast. For WC, probably yes.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [7/8] CPA: Don't flush caches on CPUs that support self-snoop

2008-02-11 Thread Siddha, Suresh B

On Mon, Feb 11, 2008 at 04:27:23PM +0100, Andi Kleen wrote:
 
   That is exactly the situation in pageattr.c. You're saying the manual
   is wrong here?
  
  I'm saying that we are not following step 2 (marking the pages not present) 
 
 Yes that's true. It's one of the design problems of the intent API that makes
 fixing this hard unfortunately.

BTW, making pages not present is required only while changing the attribute
from WB to WC or WC to WB. I think this step is for avoiding attribute aliasing
for speculative accesses. For UC, we don't have speculative accesses and such we
probably don't need it.

So WC support through PAT should take care of it.

 (intent API assumes that the caller doesn't fully own the page to change, and
 if you can't control it 100% it is not possible to unmap it temporarily) 

True. Perhaps for WC chages we can assume the ownership?

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: issue with patch "x86: no CPA on iounmap"

2008-02-07 Thread Siddha, Suresh B

On Tue, Feb 05, 2008 at 08:05:35AM +0100, Ingo Molnar wrote:
> there are many GART drivers, and the method used depends on the GART 
> driver. The following GART drivers still use ioremap in one way or 
> another:

There is another issue I see in the recent pageattr changes, again in the GART
driver context.

GART drivers use map_page_into_agp() and unmap_page_from_agp() during
runtime (and not just during load and unload of the driver module). In the
recent pageattr changes, we seem to have dropped the concept of reverting
to the large page(for the kernel identity mapping) while changing the attribute
back to WB. In this GART driver context, over the time, potentially
kernel identity mappings might be backed by small pages, if we don't
revert to large page again during set_memory_wb() which gets called during
unmap_page_from_agp() for a RAM page. And thus loosing the advantage of large
page mapping for kernel identity mappings.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: issue with patch x86: no CPA on iounmap

2008-02-07 Thread Siddha, Suresh B

On Tue, Feb 05, 2008 at 08:05:35AM +0100, Ingo Molnar wrote:
 there are many GART drivers, and the method used depends on the GART 
 driver. The following GART drivers still use ioremap in one way or 
 another:

There is another issue I see in the recent pageattr changes, again in the GART
driver context.

GART drivers use map_page_into_agp() and unmap_page_from_agp() during
runtime (and not just during load and unload of the driver module). In the
recent pageattr changes, we seem to have dropped the concept of reverting
to the large page(for the kernel identity mapping) while changing the attribute
back to WB. In this GART driver context, over the time, potentially
kernel identity mappings might be backed by small pages, if we don't
revert to large page again during set_memory_wb() which gets called during
unmap_page_from_agp() for a RAM page. And thus loosing the advantage of large
page mapping for kernel identity mappings.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

issue with patch "x86: no CPA on iounmap"

2008-02-04 Thread Siddha, Suresh B

This is wrt to x86 git commit f56d005d30342a45d8af2b75e82200f09600
"x86: no CPA on iounmap"

This can use performance issue. When a GART driver unmaps a RAM page,
which was mapped as UC, this commit will still retain UC attribute
on the kernel identity mapping. This can cause mysterious performance issue
if this freed page gets used by kernel later.

For now we should change the attribute during iounmap and in future PAT
infrastructure will have necessary hooks to avoid the aliasing issues.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-04 Thread Siddha, Suresh B

On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
> Hi guys,
> 
> Just had another way we might do this. Migrate the completions out to
> the submitting CPUs rather than migrate submission into the completing
> CPU.

Hi Nick, This was the first experiment I tried on a quad core four
package SMP platform. And it didn't show much improvement in my
prototype(my protoype was migrating the softirq to the kblockd context
of the submitting CPU).

In the OLTP workload, quite a bit of activity happens below the block layer
and by the time we come to softirq, some damage is done in
slab, scsi cmds, timers etc. Last year OLS paper
(http://ols.108.redhat.com/2007/Reprints/gough-Reprint.pdf)
shows different cache lines that are contended in the kernel for the
OLTP workload.

Softirq migration should atleast reduce the cacheline contention that
happens in sched and AIO layers. I didn't spend much time why my softirq
migration patch didn't help much (as I was behind bigger birds of migrating
IO submission to completion CPU at that time). If this solution has
less side-effects and easily acceptable, then we can analyze the softirq
migration patch further and findout the potential.

While there is some potential with the softirq migration, full potential
can be exploited by making the IO submission and completion on the same CPU.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-04 Thread Siddha, Suresh B

On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
 Hi guys,
 
 Just had another way we might do this. Migrate the completions out to
 the submitting CPUs rather than migrate submission into the completing
 CPU.

Hi Nick, This was the first experiment I tried on a quad core four
package SMP platform. And it didn't show much improvement in my
prototype(my protoype was migrating the softirq to the kblockd context
of the submitting CPU).

In the OLTP workload, quite a bit of activity happens below the block layer
and by the time we come to softirq, some damage is done in
slab, scsi cmds, timers etc. Last year OLS paper
(http://ols.108.redhat.com/2007/Reprints/gough-Reprint.pdf)
shows different cache lines that are contended in the kernel for the
OLTP workload.

Softirq migration should atleast reduce the cacheline contention that
happens in sched and AIO layers. I didn't spend much time why my softirq
migration patch didn't help much (as I was behind bigger birds of migrating
IO submission to completion CPU at that time). If this solution has
less side-effects and easily acceptable, then we can analyze the softirq
migration patch further and findout the potential.

While there is some potential with the softirq migration, full potential
can be exploited by making the IO submission and completion on the same CPU.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

issue with patch x86: no CPA on iounmap

2008-02-04 Thread Siddha, Suresh B

This is wrt to x86 git commit f56d005d30342a45d8af2b75e82200f09600
x86: no CPA on iounmap

This can use performance issue. When a GART driver unmaps a RAM page,
which was mapped as UC, this commit will still retain UC attribute
on the kernel identity mapping. This can cause mysterious performance issue
if this freed page gets used by kernel later.

For now we should change the attribute during iounmap and in future PAT
infrastructure will have necessary hooks to avoid the aliasing issues.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What's the status of x2APIC support in Linux kernel?

2008-02-01 Thread Siddha, Suresh B

On Fri, Feb 01, 2008 at 01:17:14PM +0100, Andi Kleen wrote:
> "Jike Song" <[EMAIL PROTECTED]> writes:
> 
> > On 2/1/08, Rijndael Cosque <[EMAIL PROTECTED]> wrote:
> >> Hi all,
> >>
> >> I found the x2APIC spec via 
> >> http://www.intel.com/products/processor/manuals/.
> >>
> >> Looks at present there is no x2APIC support in Linux kernel 2.6.24?
> >>
> >> Is there any experimental patch available for Linux kernel? -- I
> >> googled "x2APIC Linux";  looks no patch for now?
> 
> There are right no CPUs shipping which implement x2apic.

That's correct. None of the released CPU's support it.

I am working on a patch enabling these future x2apic extensions and the plan 
is to post this kernel patch shortly.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What's the status of x2APIC support in Linux kernel?

2008-02-01 Thread Siddha, Suresh B

On Fri, Feb 01, 2008 at 01:17:14PM +0100, Andi Kleen wrote:
 Jike Song [EMAIL PROTECTED] writes:
 
  On 2/1/08, Rijndael Cosque [EMAIL PROTECTED] wrote:
  Hi all,
 
  I found the x2APIC spec via 
  http://www.intel.com/products/processor/manuals/.
 
  Looks at present there is no x2APIC support in Linux kernel 2.6.24?
 
  Is there any experimental patch available for Linux kernel? -- I
  googled x2APIC Linux;  looks no patch for now?
 
 There are right no CPUs shipping which implement x2apic.

That's correct. None of the released CPU's support it.

I am working on a patch enabling these future x2apic extensions and the plan 
is to post this kernel patch shortly.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH x86/mm] x86: i387 fpregs_set convert_to_fxsr

2008-01-25 Thread Siddha, Suresh B

On Thu, Jan 24, 2008 at 05:59:33PM -0800, Roland McGrath wrote:
> + if (pos > 0 || count < sizeof(env))
> + convert_from_fxsr(, target);

Well, is the generic regset code enforce the need for this now? Can we
disallow the usage cases where the user passes smaller target buffer
size or requests data from in between. We were disallowing such usage
scenarios before, isn't it?

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH x86/mm] x86: i387 fpregs_set convert_to_fxsr

2008-01-25 Thread Siddha, Suresh B

On Thu, Jan 24, 2008 at 05:59:33PM -0800, Roland McGrath wrote:
 + if (pos  0 || count  sizeof(env))
 + convert_from_fxsr(env, target);

Well, is the generic regset code enforce the need for this now? Can we
disallow the usage cases where the user passes smaller target buffer
size or requests data from in between. We were disallowing such usage
scenarios before, isn't it?

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86, i387: use convert_to_fxsr() in fpregs_set()

2008-01-24 Thread Siddha, Suresh B

Roland, Just happen to notice this bug. Can you please ack the bug fix which
needs to goto x86 mm tree.

thanks.
---
[patch] x86, i387: use convert_to_fxsr() in fpregs_set()

This fixes the bug introduced recently during the revamp of the code.
fpregs_set() need to use convert_to_fxsr() rather than copying into the
fxsave struct directly.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 7e354a3..93a1706 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -327,6 +327,7 @@ int fpregs_set(struct task_struct *target, const struct 
user_regset *regset,
   const void *kbuf, const void __user *ubuf)
 {
int ret;
+   struct user_i387_ia32_struct env;
 
if (!HAVE_HWFP)
return fpregs_soft_set(target, regset, pos, count, kbuf, ubuf);
@@ -339,13 +340,9 @@ int fpregs_set(struct task_struct *target, const struct 
user_regset *regset,
  >thread.i387.fsave, 0, -1);
 
ret = user_regset_copyin(, , , ,
->thread.i387.fxsave, 0, -1);
-
-   /*
-* mxcsr reserved bits must be masked to zero for security reasons.
-*/
-   target->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
+, 0, -1);
 
+   convert_to_fxsr(target, );
return ret;
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86, i387: use convert_to_fxsr() in fpregs_set()

2008-01-24 Thread Siddha, Suresh B

Roland, Just happen to notice this bug. Can you please ack the bug fix which
needs to goto x86 mm tree.

thanks.
---
[patch] x86, i387: use convert_to_fxsr() in fpregs_set()

This fixes the bug introduced recently during the revamp of the code.
fpregs_set() need to use convert_to_fxsr() rather than copying into the
fxsave struct directly.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 7e354a3..93a1706 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -327,6 +327,7 @@ int fpregs_set(struct task_struct *target, const struct 
user_regset *regset,
   const void *kbuf, const void __user *ubuf)
 {
int ret;
+   struct user_i387_ia32_struct env;
 
if (!HAVE_HWFP)
return fpregs_soft_set(target, regset, pos, count, kbuf, ubuf);
@@ -339,13 +340,9 @@ int fpregs_set(struct task_struct *target, const struct 
user_regset *regset,
  target-thread.i387.fsave, 0, -1);
 
ret = user_regset_copyin(pos, count, kbuf, ubuf,
-target-thread.i387.fxsave, 0, -1);
-
-   /*
-* mxcsr reserved bits must be masked to zero for security reasons.
-*/
-   target-thread.i387.fxsave.mxcsr = mxcsr_feature_mask;
+env, 0, -1);
 
+   convert_to_fxsr(target, env);
return ret;
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc8-mm1

2008-01-17 Thread Siddha, Suresh B

On Thu, Jan 17, 2008 at 03:04:03PM -0800, Balbir Singh wrote:
> I think I found the root cause of the problem and a fix for it.
> The fix works for me.
> 

Thanks Balbir. But the appended fix is more clean and appropriate. Can you
please check if it works.
---

>From Balbir Singh:
> With the introduction of reserve_mattr() and free_mattr(), the ioremap*
> routines
> started exploiting it. The recent 2.6.24-rc8-mm1 kernel has a peculiar
> problem
> where in, certain devices disappear. In my case for example
>
> e100: Intel(R) PRO/100 Network Driver, 3. 5.23-k4-NAPI
> e100: Copyright(c) 1999-2006 Intel Corporation
> ACPI: PCI Interrupt :04:08.0[A] -> GSI 20 (level, low) -> IRQ 20
> modprobe:2584 conflicting cache attribute 5000-50001000
> uncached<->default
> e100: :04:08.0: e100_probe: Cannot map device registers, aborting.
> ACPI: PCI interrupt for device :04:08.0 disabled
>
> On further analysis, it was discovered that quirk_e100_interrupt() calls
> ioremap(), which reserves memory attributes for the e100 card, but
> iounmap()
> does not free it.

Fix the iounmap() to call free_matrr() unconditionally.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
---

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index ae9c8b3..4d5bea8 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -201,12 +201,11 @@ void iounmap(volatile void __iomem *addr)
return;
}
 
+   free_mattr(p->phys_addr, p->phys_addr + get_vm_area_size(p),
+  p->flags>>20);
/* Reset the direct mapping. Can block */
-   if (p->flags >> 20) {
-   free_mattr(p->phys_addr, p->phys_addr + get_vm_area_size(p),
-  p->flags>>20);
+   if (p->flags >> 20)
ioremap_change_attr(p->phys_addr, get_vm_area_size(p), 0);
-   }
 
/* Finally remove it */
o = remove_vm_area((void *)addr);
diff --git a/arch/x86/mm/ioremap_64.c b/arch/x86/mm/ioremap_64.c
index 022b645..c766327 100644
--- a/arch/x86/mm/ioremap_64.c
+++ b/arch/x86/mm/ioremap_64.c
@@ -183,12 +183,11 @@ void iounmap(volatile void __iomem *addr)
return;
}
 
+   free_mattr(p->phys_addr, p->phys_addr + get_vm_area_size(p),
+  p->flags>>20);
/* Reset the direct mapping. Can block */
-   if (p->flags >> 20) {
-   free_mattr(p->phys_addr, p->phys_addr + get_vm_area_size(p),
-  p->flags>>20);
+   if (p->flags >> 20)
ioremap_change_attr(p->phys_addr, get_vm_area_size(p), 0);
-   }
 
/* Finally remove it */
o = remove_vm_area((void *)addr);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/4] x86: PAT followup - Incremental changes and bug fixes

2008-01-17 Thread Siddha, Suresh B

On Thu, Jan 17, 2008 at 10:13:08PM +0100, Ingo Molnar wrote:
> but in general we must be robust enough in this case and just degrade 
> any overlapping page to UC (and emit a warning perhaps) - instead of 
> failing the ioremap and thus failing the driver (and the bootup).

But then, this will cause an attribute conflicit. Old one was specifying
WB in PAT (ioremap with noflags) and the new ioremap specifies UC.

As Linus mentioned, main problem is to figure out the correct attribute
for ioremap() which doesn't specify the actual attribute to be used.

One mechanism to fix the issue generically (somewhat atleast) is to use
MTRR's and figure out the default MTRR attribute for that physical address
and use it for ioremap().

> no, there should be no such need. There can be "mapping leaks", in that 
> the mapped object is not unmapped. There's detection code in today's 
> x86.git that should report something like this if it occurs:
> 
>   Debug warning: early ioremap leak of 1 areas detected.
>   please boot with early_ioremap_debug and report the dmesg.
>   [ cut here ]
>   WARNING: at arch/x86/mm/ioremap_32.c:346 ()
> 
> but i have not seen this message in your boot log. Could you boot with 
> early_ioremap_debug and send us the dmesg - i'm curious which ACPI 
> tables are actively mapped while those devices are initialized.

In this scenario, ACPI is using ioremap() leaving some dangling references.
Venki is looking to fix this code. Getting the attribute for MTRR
for ioremap noflags, might solve some of these issues aswell. Will look into
this.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 0/4] x86: PAT followup - Incremental changes and bug fixes

2008-01-17 Thread Siddha, Suresh B

On Thu, Jan 17, 2008 at 10:13:08PM +0100, Ingo Molnar wrote:
 but in general we must be robust enough in this case and just degrade 
 any overlapping page to UC (and emit a warning perhaps) - instead of 
 failing the ioremap and thus failing the driver (and the bootup).

But then, this will cause an attribute conflicit. Old one was specifying
WB in PAT (ioremap with noflags) and the new ioremap specifies UC.

As Linus mentioned, main problem is to figure out the correct attribute
for ioremap() which doesn't specify the actual attribute to be used.

One mechanism to fix the issue generically (somewhat atleast) is to use
MTRR's and figure out the default MTRR attribute for that physical address
and use it for ioremap().

 no, there should be no such need. There can be mapping leaks, in that 
 the mapped object is not unmapped. There's detection code in today's 
 x86.git that should report something like this if it occurs:
 
   Debug warning: early ioremap leak of 1 areas detected.
   please boot with early_ioremap_debug and report the dmesg.
   [ cut here ]
   WARNING: at arch/x86/mm/ioremap_32.c:346 ()
 
 but i have not seen this message in your boot log. Could you boot with 
 early_ioremap_debug and send us the dmesg - i'm curious which ACPI 
 tables are actively mapped while those devices are initialized.

In this scenario, ACPI is using ioremap() leaving some dangling references.
Venki is looking to fix this code. Getting the attribute for MTRR
for ioremap noflags, might solve some of these issues aswell. Will look into
this.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc8-mm1

2008-01-17 Thread Siddha, Suresh B

On Thu, Jan 17, 2008 at 03:04:03PM -0800, Balbir Singh wrote:
 I think I found the root cause of the problem and a fix for it.
 The fix works for me.
 

Thanks Balbir. But the appended fix is more clean and appropriate. Can you
please check if it works.
---

From Balbir Singh:
 With the introduction of reserve_mattr() and free_mattr(), the ioremap*
 routines
 started exploiting it. The recent 2.6.24-rc8-mm1 kernel has a peculiar
 problem
 where in, certain devices disappear. In my case for example

 e100: Intel(R) PRO/100 Network Driver, 3. 5.23-k4-NAPI
 e100: Copyright(c) 1999-2006 Intel Corporation
 ACPI: PCI Interrupt :04:08.0[A] - GSI 20 (level, low) - IRQ 20
 modprobe:2584 conflicting cache attribute 5000-50001000
 uncached-default
 e100: :04:08.0: e100_probe: Cannot map device registers, aborting.
 ACPI: PCI interrupt for device :04:08.0 disabled

 On further analysis, it was discovered that quirk_e100_interrupt() calls
 ioremap(), which reserves memory attributes for the e100 card, but
 iounmap()
 does not free it.

Fix the iounmap() to call free_matrr() unconditionally.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
Signed-off-by: Balbir Singh [EMAIL PROTECTED]
---

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index ae9c8b3..4d5bea8 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -201,12 +201,11 @@ void iounmap(volatile void __iomem *addr)
return;
}
 
+   free_mattr(p-phys_addr, p-phys_addr + get_vm_area_size(p),
+  p-flags20);
/* Reset the direct mapping. Can block */
-   if (p-flags  20) {
-   free_mattr(p-phys_addr, p-phys_addr + get_vm_area_size(p),
-  p-flags20);
+   if (p-flags  20)
ioremap_change_attr(p-phys_addr, get_vm_area_size(p), 0);
-   }
 
/* Finally remove it */
o = remove_vm_area((void *)addr);
diff --git a/arch/x86/mm/ioremap_64.c b/arch/x86/mm/ioremap_64.c
index 022b645..c766327 100644
--- a/arch/x86/mm/ioremap_64.c
+++ b/arch/x86/mm/ioremap_64.c
@@ -183,12 +183,11 @@ void iounmap(volatile void __iomem *addr)
return;
}
 
+   free_mattr(p-phys_addr, p-phys_addr + get_vm_area_size(p),
+  p-flags20);
/* Reset the direct mapping. Can block */
-   if (p-flags  20) {
-   free_mattr(p-phys_addr, p-phys_addr + get_vm_area_size(p),
-  p-flags20);
+   if (p-flags  20)
ioremap_change_attr(p-phys_addr, get_vm_area_size(p), 0);
-   }
 
/* Finally remove it */
o = remove_vm_area((void *)addr);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 02/11] PAT x86: Map only usable memory in x86_64 identity map and kernel text

2008-01-15 Thread Siddha, Suresh B

On Tue, Jan 15, 2008 at 11:17:58PM +0100, Ingo Molnar wrote:
> 
> * Siddha, Suresh B <[EMAIL PROTECTED]> wrote:
> > Time to resurrect Jesse's old patches 
> > i386-trim-memory-not-covered-by-wb-mtrrs.patch(which was in -mm 
> > sometime back)
> 
> just to make sure i understood the attribute priorities right: we cannot 
> just mark it WB in the PAT and expect it to be write-back - the UC of 
> the MTRR will control?

unfortuantely PAT is not the over-riding winner always. It all depends
on the individual attributes. For WB in PAT, mtrr always takes
the precedence.

> 
> > >(NOTE: UC- would be fine and 
> > >overridable by PAT, hence it's not a conflict we should detect.)
> > 
> > UC- can't be specified by MTRR's.
> 
> hm, only by PATs? Not even by the default MTRR?

No.

> > >  - mmio area marked cacheable in the MTRR (results in broken 
> > >  system)
> > 
> > PAT can help specify the UC/WC attribute here.
> 
> ok. So it seems we dont even need all that many special cases, a "dont 
> write MTRRs" and "use PATs everywhere" rule would just do the right 
> thing all across?

Yes. The main thing required is on the lines of Jesse's patch.
If the MTRR's def type is not WB, then we need to check if any of the RAM
is not covered by MTRR range registers and trim the RAM accordingly.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 02/11] PAT x86: Map only usable memory in x86_64 identity map and kernel text

2008-01-15 Thread Siddha, Suresh B

On Tue, Jan 15, 2008 at 11:17:58PM +0100, Ingo Molnar wrote:
 
 * Siddha, Suresh B [EMAIL PROTECTED] wrote:
  Time to resurrect Jesse's old patches 
  i386-trim-memory-not-covered-by-wb-mtrrs.patch(which was in -mm 
  sometime back)
 
 just to make sure i understood the attribute priorities right: we cannot 
 just mark it WB in the PAT and expect it to be write-back - the UC of 
 the MTRR will control?

unfortuantely PAT is not the over-riding winner always. It all depends
on the individual attributes. For WB in PAT, mtrr always takes
the precedence.

 
  (NOTE: UC- would be fine and 
  overridable by PAT, hence it's not a conflict we should detect.)
  
  UC- can't be specified by MTRR's.
 
 hm, only by PATs? Not even by the default MTRR?

No.

- mmio area marked cacheable in the MTRR (results in broken 
system)
  
  PAT can help specify the UC/WC attribute here.
 
 ok. So it seems we dont even need all that many special cases, a dont 
 write MTRRs and use PATs everywhere rule would just do the right 
 thing all across?

Yes. The main thing required is on the lines of Jesse's patch.
If the MTRR's def type is not WB, then we need to check if any of the RAM
is not covered by MTRR range registers and trim the RAM accordingly.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 02/11] PAT x86: Map only usable memory in x86_64 identity map and kernel text

2008-01-14 Thread Siddha, Suresh B

On Mon, Jan 14, 2008 at 05:43:24PM +0100, Ingo Molnar wrote:
> 
> * Pallipadi, Venkatesh <[EMAIL PROTECTED]> wrote:
> 
> > Also, relying on MTRR, is like giving more importance to BIOS writer 
> > than required :-). I think the best way to deal with MTRR is just to 
> > not touch it. Leave it as it is and do not try to assume that they are 
> > correct, as frequently they will not be.
> 
> i'd suggest the following strategy on PAT-capable CPUs:
> 
>  - do not try to write MTRRs. Ever.
> 
>  - _read_ the current MTRR settings (including the default MTRR) and 
>check them against the e820 map. I can see two basic types of 
>mismatches:
> 
>  - RAM area marked fine in e820 but marked UC by MTRR: this 
>currently results in a slow system.

Time to resurrect Jesse's old patches 
i386-trim-memory-not-covered-by-wb-mtrrs.patch(which was in -mm sometime back)

>(NOTE: UC- would be fine and 
>overridable by PAT, hence it's not a conflict we should detect.)

UC- can't be specified by MTRR's.

>  - mmio area marked cacheable in the MTRR (results in broken system)

PAT can help specify the UC/WC attribute here.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 02/11] PAT x86: Map only usable memory in x86_64 identity map and kernel text

2008-01-14 Thread Siddha, Suresh B

On Mon, Jan 14, 2008 at 05:43:24PM +0100, Ingo Molnar wrote:
 
 * Pallipadi, Venkatesh [EMAIL PROTECTED] wrote:
 
  Also, relying on MTRR, is like giving more importance to BIOS writer 
  than required :-). I think the best way to deal with MTRR is just to 
  not touch it. Leave it as it is and do not try to assume that they are 
  correct, as frequently they will not be.
 
 i'd suggest the following strategy on PAT-capable CPUs:
 
  - do not try to write MTRRs. Ever.
 
  - _read_ the current MTRR settings (including the default MTRR) and 
check them against the e820 map. I can see two basic types of 
mismatches:
 
  - RAM area marked fine in e820 but marked UC by MTRR: this 
currently results in a slow system.

Time to resurrect Jesse's old patches 
i386-trim-memory-not-covered-by-wb-mtrrs.patch(which was in -mm sometime back)

(NOTE: UC- would be fine and 
overridable by PAT, hence it's not a conflict we should detect.)

UC- can't be specified by MTRR's.

  - mmio area marked cacheable in the MTRR (results in broken system)

PAT can help specify the UC/WC attribute here.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Analysis of sched_mc_power_savings

2008-01-08 Thread Siddha, Suresh B

On Tue, Jan 08, 2008 at 11:08:15PM +0530, Vaidyanathan Srinivasan wrote:
> Hi,
> 
> The following experiments were conducted on a two socket dual core
> intel processor based machine in order to understand the impact of
> sched_mc_power_savings scheduler heuristics.

Thanks for these experiments. sched_mc_power_savings is mainly for
the reasonably long(which can be caught by periodic load balance) running
light load(when number of tasks < number of logical cpus).  This tunable
will minimize the number of packages carrying load, enabling the other
completely idle packages to go to the available deepest P and C states.

> 
> The scheduler heuristics for multi core system
> /sys/devices/system/cpu/sched_mc_power_savings should ideally extend
> the cpu tickless idle time atleast on few CPU in an SMP machine.

Not really. Ideally sched_mc_power_savings re-distributes the load(in
the scenarios mentioned above) to different logical cpus in the system, so
as to minimize the busy packages in the system.

> Experiment 1:
> -
...
> Observations with sched_mc_power_savings=1:
> 
> * No major impact of sched_mc_power_savings on CPU0 and CPU1
> * Significant idle time improvement on CPU2
> * However, significant idle time reduction on CPU3

In your setup, CPU0 and 3 are the core siblings? If so, this is probably
expected. Previously there was some load distribution happening between
packages. With sched_mc_power_savings, load is now getting distributed
between cores in the same package. But on my systems, typically CPU0 and 2
are core siblings. I will let you confirm your system topology, before we
come to conclusions.

> Experiment 2:
> -
...
> 
> Observations with sched_mc_power_savings=1:
> 
> * No major impact of sched_mc_power_savings on CPU0 and CPU1
> * Good idle time improvement on CPU2 and CPU3

I would have expected similar results like in experiment-1. CPU3 data
seems almost same with and without sched_mc_power_savings. Atleast the
data is not significantly different as in other cases like CPU2 for ex.

> Please review the experiment and comment on how the effectiveness of
> sched_mc_power_savings can be analysed.  

While we should see a no change or a simple redistribution(to minimize 
busy packages) in your experiments, for evaluating sched_mc_power_savings,
we can also use some lightly loaded system (like kernel-compilation or
specjbb with 2 threads on  a DP with dual-core, for example and see how the
load is distributed with and without MC power savings.)

Similarly it will be interesting to see how this data varies with and without
tickless.

For some loads which can't be caught by periodic load balancer, we may
not see any difference. But atleast we should not see any scheduling
anomalies.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Analysis of sched_mc_power_savings

2008-01-08 Thread Siddha, Suresh B

On Tue, Jan 08, 2008 at 11:08:15PM +0530, Vaidyanathan Srinivasan wrote:
 Hi,
 
 The following experiments were conducted on a two socket dual core
 intel processor based machine in order to understand the impact of
 sched_mc_power_savings scheduler heuristics.

Thanks for these experiments. sched_mc_power_savings is mainly for
the reasonably long(which can be caught by periodic load balance) running
light load(when number of tasks  number of logical cpus).  This tunable
will minimize the number of packages carrying load, enabling the other
completely idle packages to go to the available deepest P and C states.

 
 The scheduler heuristics for multi core system
 /sys/devices/system/cpu/sched_mc_power_savings should ideally extend
 the cpu tickless idle time atleast on few CPU in an SMP machine.

Not really. Ideally sched_mc_power_savings re-distributes the load(in
the scenarios mentioned above) to different logical cpus in the system, so
as to minimize the busy packages in the system.

 Experiment 1:
 -
...
 Observations with sched_mc_power_savings=1:
 
 * No major impact of sched_mc_power_savings on CPU0 and CPU1
 * Significant idle time improvement on CPU2
 * However, significant idle time reduction on CPU3

In your setup, CPU0 and 3 are the core siblings? If so, this is probably
expected. Previously there was some load distribution happening between
packages. With sched_mc_power_savings, load is now getting distributed
between cores in the same package. But on my systems, typically CPU0 and 2
are core siblings. I will let you confirm your system topology, before we
come to conclusions.

 Experiment 2:
 -
...
 
 Observations with sched_mc_power_savings=1:
 
 * No major impact of sched_mc_power_savings on CPU0 and CPU1
 * Good idle time improvement on CPU2 and CPU3

I would have expected similar results like in experiment-1. CPU3 data
seems almost same with and without sched_mc_power_savings. Atleast the
data is not significantly different as in other cases like CPU2 for ex.

 Please review the experiment and comment on how the effectiveness of
 sched_mc_power_savings can be analysed.  

While we should see a no change or a simple redistribution(to minimize 
busy packages) in your experiments, for evaluating sched_mc_power_savings,
we can also use some lightly loaded system (like kernel-compilation or
specjbb with 2 threads on  a DP with dual-core, for example and see how the
load is distributed with and without MC power savings.)

Similarly it will be interesting to see how this data varies with and without
tickless.

For some loads which can't be caught by periodic load balancer, we may
not see any difference. But atleast we should not see any scheduling
anomalies.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

2007-12-14 Thread Siddha, Suresh B

On Fri, Dec 14, 2007 at 01:10:39PM -0800, Siddha, Suresh B wrote:
> On Thu, Dec 13, 2007 at 09:23:26PM -0700, Eric W. Biederman wrote:
> > [EMAIL PROTECTED] (Eric W. Biederman) writes:
> > Ok.  My analysis here was wrong.  Currently pgprot_noncached and
> > ioremap_nocache are out of sync.  With ioremap_nocache only specifying
> > _PAGE_PCD and pgprot_noncached specifying _PAGE_PCD | _PAGE_PWT.
> > 
> > So I don't have a clue how someone could reprogram the mtrrs currently
> > and expect things to work.
> > 
> > ...
> > 
> > If we bother to ask ioremap for memory that is not cached, the last
> > thing in the world we want is the MTRRs upgrading that to write combining.
> > So ioremap_nocache has been slightly buggy for ages.  ioremap_nocache
> > and PAGE_KERNEL_NOCACHE should get _PAGE_PWT added to their
> > definitions.
> > 
> > Could we please get a cleanup patch at the beginning of this patchset
> > or that comes before it that fixes ioremap_nocache on x86?
> > 
> > That will make us a lot more git-bisect safe.
> 
> Ok. I will send a separate patch  fixing ioremap_nocache on x86.

Appended the patch. x86 folks, please consider for x86 mm git tree. Thanks.

---
[patch] x86: Set strong uncacheable where UC is really desired

Also use _PAGE_PWT for all the mappings which need uncache mapping. Instead of
existing PAT2 which is UC- (and can be overwritten by MTRRs), we now use PAT3
which is strong uncacheable.

This makes it consistent with pgprot_noncached()

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index 0b27831..ef0f6a4 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -119,7 +119,7 @@ EXPORT_SYMBOL(__ioremap);
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
unsigned long last_addr;
-   void __iomem *p = __ioremap(phys_addr, size, _PAGE_PCD);
+   void __iomem *p = __ioremap(phys_addr, size, _PAGE_PCD | _PAGE_PWT);
if (!p) 
return p; 
 
diff --git a/arch/x86/mm/ioremap_64.c b/arch/x86/mm/ioremap_64.c
index 6cac90a..8be3062 100644
--- a/arch/x86/mm/ioremap_64.c
+++ b/arch/x86/mm/ioremap_64.c
@@ -158,7 +158,7 @@ EXPORT_SYMBOL(__ioremap);
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
-   return __ioremap(phys_addr, size, _PAGE_PCD);
+   return __ioremap(phys_addr, size, _PAGE_PCD | _PAGE_PWT);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index ed3e70d..b1215e1 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -156,7 +156,7 @@ void paging_init(void);
 extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
 #define __PAGE_KERNEL_RO   (__PAGE_KERNEL & ~_PAGE_RW)
 #define __PAGE_KERNEL_RX   (__PAGE_KERNEL_EXEC & ~_PAGE_RW)
-#define __PAGE_KERNEL_NOCACHE  (__PAGE_KERNEL | _PAGE_PCD)
+#define __PAGE_KERNEL_NOCACHE  (__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)
 #define __PAGE_KERNEL_LARGE(__PAGE_KERNEL | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_EXEC   (__PAGE_KERNEL_EXEC | _PAGE_PSE)
 
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
index 9b0ff47..4e4dcc4 100644
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -185,13 +185,13 @@ static inline pte_t ptep_get_and_clear_full(struct 
mm_struct *mm, unsigned long
 #define __PAGE_KERNEL_EXEC \
(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)
 #define __PAGE_KERNEL_NOCACHE \
-   (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_ACCESSED | 
_PAGE_NX)
+   (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_PWT | 
_PAGE_ACCESSED | _PAGE_NX)
 #define __PAGE_KERNEL_RO \
(_PAGE_PRESENT | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_NX)
 #define __PAGE_KERNEL_VSYSCALL \
(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
 #define __PAGE_KERNEL_VSYSCALL_NOCACHE \
-   (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_PCD)
+   (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_PCD | _PAGE_PWT)
 #define __PAGE_KERNEL_LARGE \
(__PAGE_KERNEL | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_EXEC \
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 00/12] PAT 64b: PAT support for X86_64

2007-12-14 Thread Siddha, Suresh B

On Fri, Dec 14, 2007 at 12:28:25AM +, Dave Airlie wrote:
> Yes, the main use for GPUs is to have RAM pages mapped WC, and placed into 
> a GART on the GPU side, currently for Intel IGD we are okay as the CPU can 
> access the GPU GART aperture, but other chips exist where this isn't 
> possible, I think poulsbo and possible some of the AMD IGPs..

Ok. So how is it working today on these platforms with no PAT support.
Open source drivers use UC or WB on these platforms? As this RAM is not
contiguous, one can't use MTRRs to set WC. Right?

Well, if WC is needed for RAM, then we have to address it too.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 06/12] PAT 64b: Add ioremap_wc support

2007-12-14 Thread Siddha, Suresh B

On Thu, Dec 13, 2007 at 09:48:36PM -0700, Eric W. Biederman wrote:
> Roland Dreier <[EMAIL PROTECTED]> writes:
> 
> >  > > Also I didn't see anything like pgprot_wc() in the patchset (although
> >
> >  > pgprot_writcombined.
> >
> > Oh I see it now (pgprot_writecombine() actually).
> >
> > However the same comment as before applies: there needs to be a
> > fallback to pgprot_noncached() for all other architectures so that
> > drivers can actually use it in a sane way.
> 
> Sounds reasonable.

We will fix it in next rev.

> It should be the conflict checking that is the actual bottleneck.
> The rest is just gravy.

Yes. We are looking for comments for our proposal to track the
reserved/non-reserved regions some what different.
This is the critical issue which had been holding off PAT for years now...

Change x86_64 identity map to only map non-reserved memory. This helps
to handle UC/WC mapping of reserved region in a much simple manner
(we don't have to do cpa any more, as such not keep track of the actual
reference counts. We still track all the usages to keep the mappings
consistent. We just avoid the headache of splitting mattr regions for
managing ref counts for every individual usage of the reserved area).

For now, we don't track RAM pages using memattr infrastructure. This is because,
memattr infrastructure is not enough. i.e., while the page is getting
tracked using memattr infrastructure, potentially the page can get
freed(a bug that we need to catch, to avoid attribute aliasing).
For example, a driver does ioremap_uc and an application mapped the
same page using /dev/mem. When the driver does iounamp and free the page,
/dev/mem mapping is still live and we run into aliasing issue.

Can we use the existing page struct to keep track of the attribute
and usage? /dev/mem mappings then can increment the page ref count and not
allow to free the page while the /dev/mem mappings are active. And allow
/dev/mem to map only those pages which are marked reserved (which the driver
does before doing iomap).

Or when a WB mapping through /dev/mem is active, don't allow any driver
to map the page as UC.. Can we do this tracking for RAM pages through
struct page. Or there any issues we should keep in mind..

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

2007-12-14 Thread Siddha, Suresh B

On Thu, Dec 13, 2007 at 09:23:26PM -0700, Eric W. Biederman wrote:
> [EMAIL PROTECTED] (Eric W. Biederman) writes:
> Ok.  My analysis here was wrong.  Currently pgprot_noncached and
> ioremap_nocache are out of sync.  With ioremap_nocache only specifying
> _PAGE_PCD and pgprot_noncached specifying _PAGE_PCD | _PAGE_PWT.
> 
> So I don't have a clue how someone could reprogram the mtrrs currently
> and expect things to work.
> 
> ...
> 
> If we bother to ask ioremap for memory that is not cached, the last
> thing in the world we want is the MTRRs upgrading that to write combining.
> So ioremap_nocache has been slightly buggy for ages.  ioremap_nocache
> and PAGE_KERNEL_NOCACHE should get _PAGE_PWT added to their
> definitions.
> 
> Could we please get a cleanup patch at the beginning of this patchset
> or that comes before it that fixes ioremap_nocache on x86?
> 
> That will make us a lot more git-bisect safe.

Ok. I will send a separate patch  fixing ioremap_nocache on x86.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

2007-12-14 Thread Siddha, Suresh B

On Thu, Dec 13, 2007 at 08:48:45PM -0700, Eric W. Biederman wrote:
> > +   pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
> > + PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);
> 
> I strongly object to this configuration.
> 
> The caching modes of interest are:
> PAT_WB write-back or a close as the MTRRs will allow
>used for WC today.
> PAT_UC completely uncachable not overridable by MTRRs 
>and what we use today for pgprot_noncached
> PAT_WC what isn't available for current use.
>
> We should use:
> > +   pat = PAT(0,WB) | PAT(1,WT) | PAT(2,WC) | PAT(3,UC) |
> > + PAT(4,WB) | PAT(5,WT) | PAT(6,WC) | PAT(7,UC);
> 
> Changing the UC- which currently allows write-combining if the MTRRs specify 
> it,
> to WC.  This grandfathers in all of our current usage and changes the one
> PAT type that could today and in legacy mode specify WC to really specify WC.

That seems reasonable. But looking at mainline kernel, ioremap_nocache()
actually uses UC_MINUS. Wonder why it is not using UC (like
pgprot_noncached).  I think it is ok to change ioremap_nocache() to use UC.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

2007-12-14 Thread Siddha, Suresh B

On Thu, Dec 13, 2007 at 09:23:26PM -0700, Eric W. Biederman wrote:
 [EMAIL PROTECTED] (Eric W. Biederman) writes:
 Ok.  My analysis here was wrong.  Currently pgprot_noncached and
 ioremap_nocache are out of sync.  With ioremap_nocache only specifying
 _PAGE_PCD and pgprot_noncached specifying _PAGE_PCD | _PAGE_PWT.
 
 So I don't have a clue how someone could reprogram the mtrrs currently
 and expect things to work.
 
 ...
 
 If we bother to ask ioremap for memory that is not cached, the last
 thing in the world we want is the MTRRs upgrading that to write combining.
 So ioremap_nocache has been slightly buggy for ages.  ioremap_nocache
 and PAGE_KERNEL_NOCACHE should get _PAGE_PWT added to their
 definitions.
 
 Could we please get a cleanup patch at the beginning of this patchset
 or that comes before it that fixes ioremap_nocache on x86?
 
 That will make us a lot more git-bisect safe.

Ok. I will send a separate patch  fixing ioremap_nocache on x86.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

2007-12-14 Thread Siddha, Suresh B

On Thu, Dec 13, 2007 at 08:48:45PM -0700, Eric W. Biederman wrote:
  +   pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
  + PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);
 
 I strongly object to this configuration.
 
 The caching modes of interest are:
 PAT_WB write-back or a close as the MTRRs will allow
used for WC today.
 PAT_UC completely uncachable not overridable by MTRRs 
and what we use today for pgprot_noncached
 PAT_WC what isn't available for current use.

 We should use:
  +   pat = PAT(0,WB) | PAT(1,WT) | PAT(2,WC) | PAT(3,UC) |
  + PAT(4,WB) | PAT(5,WT) | PAT(6,WC) | PAT(7,UC);
 
 Changing the UC- which currently allows write-combining if the MTRRs specify 
 it,
 to WC.  This grandfathers in all of our current usage and changes the one
 PAT type that could today and in legacy mode specify WC to really specify WC.

That seems reasonable. But looking at mainline kernel, ioremap_nocache()
actually uses UC_MINUS. Wonder why it is not using UC (like
pgprot_noncached).  I think it is ok to change ioremap_nocache() to use UC.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 06/12] PAT 64b: Add ioremap_wc support

2007-12-14 Thread Siddha, Suresh B

On Thu, Dec 13, 2007 at 09:48:36PM -0700, Eric W. Biederman wrote:
 Roland Dreier [EMAIL PROTECTED] writes:
 
 Also I didn't see anything like pgprot_wc() in the patchset (although
 
pgprot_writcombined.
 
  Oh I see it now (pgprot_writecombine() actually).
 
  However the same comment as before applies: there needs to be a
  fallback to pgprot_noncached() for all other architectures so that
  drivers can actually use it in a sane way.
 
 Sounds reasonable.

We will fix it in next rev.

 It should be the conflict checking that is the actual bottleneck.
 The rest is just gravy.

Yes. We are looking for comments for our proposal to track the
reserved/non-reserved regions some what different.
This is the critical issue which had been holding off PAT for years now...

snip from the other mails

Change x86_64 identity map to only map non-reserved memory. This helps
to handle UC/WC mapping of reserved region in a much simple manner
(we don't have to do cpa any more, as such not keep track of the actual
reference counts. We still track all the usages to keep the mappings
consistent. We just avoid the headache of splitting mattr regions for
managing ref counts for every individual usage of the reserved area).

For now, we don't track RAM pages using memattr infrastructure. This is because,
memattr infrastructure is not enough. i.e., while the page is getting
tracked using memattr infrastructure, potentially the page can get
freed(a bug that we need to catch, to avoid attribute aliasing).
For example, a driver does ioremap_uc and an application mapped the
same page using /dev/mem. When the driver does iounamp and free the page,
/dev/mem mapping is still live and we run into aliasing issue.

Can we use the existing page struct to keep track of the attribute
and usage? /dev/mem mappings then can increment the page ref count and not
allow to free the page while the /dev/mem mappings are active. And allow
/dev/mem to map only those pages which are marked reserved (which the driver
does before doing iomap).

Or when a WB mapping through /dev/mem is active, don't allow any driver
to map the page as UC.. Can we do this tracking for RAM pages through
struct page. Or there any issues we should keep in mind..

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 00/12] PAT 64b: PAT support for X86_64

2007-12-14 Thread Siddha, Suresh B

On Fri, Dec 14, 2007 at 12:28:25AM +, Dave Airlie wrote:
 Yes, the main use for GPUs is to have RAM pages mapped WC, and placed into 
 a GART on the GPU side, currently for Intel IGD we are okay as the CPU can 
 access the GPU GART aperture, but other chips exist where this isn't 
 possible, I think poulsbo and possible some of the AMD IGPs..

Ok. So how is it working today on these platforms with no PAT support.
Open source drivers use UC or WB on these platforms? As this RAM is not
contiguous, one can't use MTRRs to set WC. Right?

Well, if WC is needed for RAM, then we have to address it too.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

2007-12-14 Thread Siddha, Suresh B

On Fri, Dec 14, 2007 at 01:10:39PM -0800, Siddha, Suresh B wrote:
 On Thu, Dec 13, 2007 at 09:23:26PM -0700, Eric W. Biederman wrote:
  [EMAIL PROTECTED] (Eric W. Biederman) writes:
  Ok.  My analysis here was wrong.  Currently pgprot_noncached and
  ioremap_nocache are out of sync.  With ioremap_nocache only specifying
  _PAGE_PCD and pgprot_noncached specifying _PAGE_PCD | _PAGE_PWT.
  
  So I don't have a clue how someone could reprogram the mtrrs currently
  and expect things to work.
  
  ...
  
  If we bother to ask ioremap for memory that is not cached, the last
  thing in the world we want is the MTRRs upgrading that to write combining.
  So ioremap_nocache has been slightly buggy for ages.  ioremap_nocache
  and PAGE_KERNEL_NOCACHE should get _PAGE_PWT added to their
  definitions.
  
  Could we please get a cleanup patch at the beginning of this patchset
  or that comes before it that fixes ioremap_nocache on x86?
  
  That will make us a lot more git-bisect safe.
 
 Ok. I will send a separate patch  fixing ioremap_nocache on x86.

Appended the patch. x86 folks, please consider for x86 mm git tree. Thanks.

---
[patch] x86: Set strong uncacheable where UC is really desired

Also use _PAGE_PWT for all the mappings which need uncache mapping. Instead of
existing PAT2 which is UC- (and can be overwritten by MTRRs), we now use PAT3
which is strong uncacheable.

This makes it consistent with pgprot_noncached()

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index 0b27831..ef0f6a4 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -119,7 +119,7 @@ EXPORT_SYMBOL(__ioremap);
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
unsigned long last_addr;
-   void __iomem *p = __ioremap(phys_addr, size, _PAGE_PCD);
+   void __iomem *p = __ioremap(phys_addr, size, _PAGE_PCD | _PAGE_PWT);
if (!p) 
return p; 
 
diff --git a/arch/x86/mm/ioremap_64.c b/arch/x86/mm/ioremap_64.c
index 6cac90a..8be3062 100644
--- a/arch/x86/mm/ioremap_64.c
+++ b/arch/x86/mm/ioremap_64.c
@@ -158,7 +158,7 @@ EXPORT_SYMBOL(__ioremap);
 
 void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
-   return __ioremap(phys_addr, size, _PAGE_PCD);
+   return __ioremap(phys_addr, size, _PAGE_PCD | _PAGE_PWT);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index ed3e70d..b1215e1 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -156,7 +156,7 @@ void paging_init(void);
 extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
 #define __PAGE_KERNEL_RO   (__PAGE_KERNEL  ~_PAGE_RW)
 #define __PAGE_KERNEL_RX   (__PAGE_KERNEL_EXEC  ~_PAGE_RW)
-#define __PAGE_KERNEL_NOCACHE  (__PAGE_KERNEL | _PAGE_PCD)
+#define __PAGE_KERNEL_NOCACHE  (__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)
 #define __PAGE_KERNEL_LARGE(__PAGE_KERNEL | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_EXEC   (__PAGE_KERNEL_EXEC | _PAGE_PSE)
 
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
index 9b0ff47..4e4dcc4 100644
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -185,13 +185,13 @@ static inline pte_t ptep_get_and_clear_full(struct 
mm_struct *mm, unsigned long
 #define __PAGE_KERNEL_EXEC \
(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)
 #define __PAGE_KERNEL_NOCACHE \
-   (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_ACCESSED | 
_PAGE_NX)
+   (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_PWT | 
_PAGE_ACCESSED | _PAGE_NX)
 #define __PAGE_KERNEL_RO \
(_PAGE_PRESENT | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_NX)
 #define __PAGE_KERNEL_VSYSCALL \
(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
 #define __PAGE_KERNEL_VSYSCALL_NOCACHE \
-   (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_PCD)
+   (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_PCD | _PAGE_PWT)
 #define __PAGE_KERNEL_LARGE \
(__PAGE_KERNEL | _PAGE_PSE)
 #define __PAGE_KERNEL_LARGE_EXEC \
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What was the problem with quicklists and x86-64?

2007-12-13 Thread Siddha, Suresh B

On Thu, Dec 13, 2007 at 11:47:29AM -0800, Christoph Lameter wrote:
> On Wed, 12 Dec 2007, Jeremy Fitzhardinge wrote:
> 
> > I'm looking at unifying the various pgalloc+pgd_lists mechanisms between
> > 32-bit (PAE and non-PAE) and 64-bit, so I'm trying to understand why
> > these differences exist in the first place.
> > 
> > Change da8f153e51290e7438ba7da66234a864e5d3e1c1 reverted the use of
> > quicklists for allocating pagetables, because of concerns about ordering
> > with respect to tlb flushes.
> 
> These issues only exist with NUMA because of the freeing of off node pages 
> before the TLB flush was done. There was a discussion about this issue and 
> my fix of simply not freeing the offnode pages early was ignored. Instead 
> the x86_64 implementation (which has no problems that I know of) was 

NUMA  bug might not be the only problem. I think there are more issues
as Linus noticed.

Oh, and I see what's wrong: you not only switched "free_page()" to 
"quicklist_free()", you *also* switched "tlb_remove_page()" to 
"quicklist_free()".

The above comment is in reference to below portion of code:

-#define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
+#define __pte_free_tlb(tlb,pte) quicklist_free_page(QUICK_PT, NULL,(pte))

tlb_remove_page() was marking tlb->need_flush. Which is later used
by tlb_flush_mmu(). With quicklist_free_page() we loose all that..

Now in a corner case scenario  with a big munmap() which calls unmap_region()
and if it so happens that the region getting unmapped just has page
tables setup but with all PTE's set to NULL, unmap_region() may
potentially free the page table pages
[ tlb_finish_mmu() calls check_pgt_cache() which trims quicklist ]
with out flushing the TLB's.
[ (tlb_finish_mmu() calls the tlb_flush_mmu() but it will not do
   much as need_flush is not set ]

Similarly Linus brought pre-emptions issues associated with quicklist usage..

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What was the problem with quicklists and x86-64?

2007-12-13 Thread Siddha, Suresh B

On Wed, Dec 12, 2007 at 11:14:41AM -0800, Jeremy Fitzhardinge wrote:
> I'm looking at unifying the various pgalloc+pgd_lists mechanisms between
> 32-bit (PAE and non-PAE) and 64-bit, so I'm trying to understand why
> these differences exist in the first place.
> 
> Change da8f153e51290e7438ba7da66234a864e5d3e1c1 reverted the use of
> quicklists for allocating pagetables, because of concerns about ordering
> with respect to tlb flushes.

Main thing to note is, while unmapping linear address space, we should not
free underlying pages, page table pages (for all hierarchies), till we flush
the active TLB caches in all the CPUs.

Violation of this can potentially cause SW failures and hard to debug issues
like
http://www.ussg.iu.edu/hypermail/linux/kernel/0205.2/1254.html

For more info, you can refer 
http://developer.intel.com/design/processor/applnots/317080.pdf

> Some questions about this:
> 
>1. What's the difference between quicklists and normal page
>   allocation with respect to tlb flushing?

Linus and Christoph went in detail about this.

http://kerneltrap.org/mailarchive/linux-kernel/2007/9/21/271638

There were some preemption and other concerns that Linus mentioned,
and was referring to of integrating quicklists and mmu_gather operations
and not keep them separate.

>2. Why doesn't this also affect i386's use of quicklists?
>3. Is there some way to resolve this tlb interaction so that
>   quicklists can be used?
>4. Failing that, what's the cost of reverting i386's use of
>   quicklists too?

I have to look at i386 code but I don't think it was using quicklists
as extensively as the previous x86_64 code does. Will take a look at it.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What was the problem with quicklists and x86-64?

2007-12-13 Thread Siddha, Suresh B

On Wed, Dec 12, 2007 at 11:14:41AM -0800, Jeremy Fitzhardinge wrote:
 I'm looking at unifying the various pgalloc+pgd_lists mechanisms between
 32-bit (PAE and non-PAE) and 64-bit, so I'm trying to understand why
 these differences exist in the first place.
 
 Change da8f153e51290e7438ba7da66234a864e5d3e1c1 reverted the use of
 quicklists for allocating pagetables, because of concerns about ordering
 with respect to tlb flushes.

Main thing to note is, while unmapping linear address space, we should not
free underlying pages, page table pages (for all hierarchies), till we flush
the active TLB caches in all the CPUs.

Violation of this can potentially cause SW failures and hard to debug issues
like
http://www.ussg.iu.edu/hypermail/linux/kernel/0205.2/1254.html

For more info, you can refer 
http://developer.intel.com/design/processor/applnots/317080.pdf

 Some questions about this:
 
1. What's the difference between quicklists and normal page
   allocation with respect to tlb flushing?

Linus and Christoph went in detail about this.

http://kerneltrap.org/mailarchive/linux-kernel/2007/9/21/271638

There were some preemption and other concerns that Linus mentioned,
and was referring to of integrating quicklists and mmu_gather operations
and not keep them separate.

2. Why doesn't this also affect i386's use of quicklists?
3. Is there some way to resolve this tlb interaction so that
   quicklists can be used?
4. Failing that, what's the cost of reverting i386's use of
   quicklists too?

I have to look at i386 code but I don't think it was using quicklists
as extensively as the previous x86_64 code does. Will take a look at it.

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: What was the problem with quicklists and x86-64?

2007-12-13 Thread Siddha, Suresh B

On Thu, Dec 13, 2007 at 11:47:29AM -0800, Christoph Lameter wrote:
 On Wed, 12 Dec 2007, Jeremy Fitzhardinge wrote:
 
  I'm looking at unifying the various pgalloc+pgd_lists mechanisms between
  32-bit (PAE and non-PAE) and 64-bit, so I'm trying to understand why
  these differences exist in the first place.
  
  Change da8f153e51290e7438ba7da66234a864e5d3e1c1 reverted the use of
  quicklists for allocating pagetables, because of concerns about ordering
  with respect to tlb flushes.
 
 These issues only exist with NUMA because of the freeing of off node pages 
 before the TLB flush was done. There was a discussion about this issue and 
 my fix of simply not freeing the offnode pages early was ignored. Instead 
 the x86_64 implementation (which has no problems that I know of) was 

NUMA  bug might not be the only problem. I think there are more issues
as Linus noticed.

snip
Oh, and I see what's wrong: you not only switched free_page() to 
quicklist_free(), you *also* switched tlb_remove_page() to 
quicklist_free().
/snip

The above comment is in reference to below portion of code:

-#define __pte_free_tlb(tlb,pte) tlb_remove_page((tlb),(pte))
+#define __pte_free_tlb(tlb,pte) quicklist_free_page(QUICK_PT, NULL,(pte))

tlb_remove_page() was marking tlb-need_flush. Which is later used
by tlb_flush_mmu(). With quicklist_free_page() we loose all that..

Now in a corner case scenario  with a big munmap() which calls unmap_region()
and if it so happens that the region getting unmapped just has page
tables setup but with all PTE's set to NULL, unmap_region() may
potentially free the page table pages
[ tlb_finish_mmu() calls check_pgt_cache() which trims quicklist ]
with out flushing the TLB's.
[ (tlb_finish_mmu() calls the tlb_flush_mmu() but it will not do
   much as need_flush is not set ]

Similarly Linus brought pre-emptions issues associated with quicklist usage..

thanks,
suresh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86, ptrace: support for branch trace store(BTS)

2007-11-13 Thread Siddha, Suresh B

Support for Intel's last branch recording to ptrace. This gives debuggers
access to this hardware feature and allows them to show an execution trace
of the debugged application.

Last branch recording (see section 18.5 in the Intel 64 and IA-32
Architectures Software Developer's Manual) allows taking an execution
trace of the running application without instrumentation. When a branch
is executed, the hardware logs the source and destination address in a
cyclic buffer given to it by the OS.

This can be a great debugging aid. It shows you how exactly you got
where you currently are without requiring you to do lots of single
stepping and rerunning.

This patch manages the various buffers, configures the trace
hardware, disentangles the trace, and provides a user interface via
ptrace. On the high-level design:
- there is one optional trace buffer per thread_struct
- upon a context switch, the trace hardware is reconfigured to either
  disable tracing or to use the appropriate buffer for the new task.
  - tracing induces ~20% overhead as branch records are sent out on
the bus. 
  - the hardware collects trace per processor. To disentangle the
traces for different tasks, we use separate buffers and reconfigure
the trace hardware.
- the internal buffer interpretation as well as the corresponding
  operations are selected at run-time by hardware detection
  - different processors use different branch record formats

Opens:
- warnings due to code sharing between i386 and x86_64
- support for more processors (to come)
- ptrace command numbers (just picked some numbers randomly)

Signed-off-by: Markus Metzger <[EMAIL PROTECTED]>
Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

Index: linux-2.6/arch/x86/kernel/process_32.c
===
--- linux-2.6.orig/arch/x86/kernel/process_32.c 2007-11-07 09:08:32.%N +0100
+++ linux-2.6/arch/x86/kernel/process_32.c  2007-11-07 09:09:36.%N +0100
@@ -735,6 +735,18 @@
__switch_to_xtra(prev_p, next_p, tss);
 
/*
+* Last branch recording recofiguration of trace hardware and
+* disentangling of trace data per task.
+*/
+   if (unlikely(task_thread_info(prev_p)->flags & _TIF_BTS_TRACE ||
+task_thread_info(prev_p)->flags & _TIF_BTS_TRACE_TS))
+   ptrace_bts_task_departs(prev_p);
+
+   if (unlikely(task_thread_info(next_p)->flags & _TIF_BTS_TRACE ||
+task_thread_info(next_p)->flags & _TIF_BTS_TRACE_TS))
+   ptrace_bts_task_arrives(next_p);
+
+   /*
 * Leave lazy mode, flushing any hypercalls made here.
 * This must be done before restoring TLS segments so
 * the GDT and LDT are properly updated, and must be
Index: linux-2.6/arch/x86/kernel/ptrace_32.c
===
--- linux-2.6.orig/arch/x86/kernel/ptrace_32.c  2007-11-07 09:08:32.%N +0100
+++ linux-2.6/arch/x86/kernel/ptrace_32.c   2007-11-07 09:09:36.%N +0100
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * does not yet catch signals sent when the child dies.
@@ -274,6 +275,7 @@
 { 
clear_singlestep(child);
clear_tsk_thread_flag(child, TIF_SYSCALL_EMU);
+   ptrace_bts_task_detached(child);
 }
 
 /*
@@ -610,6 +612,37 @@
(struct user_desc __user *) data);
break;
 
+   case PTRACE_BTS_INIT:
+   ret = ptrace_bts_init();
+   break;
+
+   case PTRACE_BTS_ALLOCATE_BUFFER:
+   ret = ptrace_bts_allocate_bts(child, data);
+   break;
+
+   case PTRACE_BTS_GET_BUFFER_SIZE:
+   ret = ptrace_bts_get_buffer_size(child);
+   break;
+
+   case PTRACE_BTS_GET_INDEX:
+   ret = ptrace_bts_get_index(child);
+   break;
+
+   case PTRACE_BTS_READ_RECORD:
+   ret = ptrace_bts_read_record
+   (child, data,
+(struct ptrace_bts_record __user *) addr);
+   break;
+
+   case PTRACE_BTS_CONFIG:
+   ptrace_bts_config(child, data);
+   ret = 0;
+   break;
+
+   case PTRACE_BTS_STATUS:
+   ret = ptrace_bts_status(child);
+   break;
+
default:
ret = ptrace_request(child, request, addr, data);
break;
Index: linux-2.6/include/asm-x86/ptrace-abi.h
===
--- linux-2.6.orig/include/asm-x86/ptrace-abi.h 2007-11-07 09:08:32.%N +0100
+++ linux-2.6/include/asm-x86/ptrace-abi.h  2007-11-07 09:09:36.%N +0100
@@ -78,4 +78,54 @@
 # define PTRACE_SYSEMU_SINGLESTEP 32
 #endif
 
+
+/* Initialize bts tracing. Needs to be called once.
+   Return the type of tracing available */
+#define PTRACE_BTS_INIT 40
+
+/* No trace hardware available or trace hardware

[patch] x86, ptrace: support for branch trace store(BTS)

2007-11-13 Thread Siddha, Suresh B

Support for Intel's last branch recording to ptrace. This gives debuggers
access to this hardware feature and allows them to show an execution trace
of the debugged application.

Last branch recording (see section 18.5 in the Intel 64 and IA-32
Architectures Software Developer's Manual) allows taking an execution
trace of the running application without instrumentation. When a branch
is executed, the hardware logs the source and destination address in a
cyclic buffer given to it by the OS.

This can be a great debugging aid. It shows you how exactly you got
where you currently are without requiring you to do lots of single
stepping and rerunning.

This patch manages the various buffers, configures the trace
hardware, disentangles the trace, and provides a user interface via
ptrace. On the high-level design:
- there is one optional trace buffer per thread_struct
- upon a context switch, the trace hardware is reconfigured to either
  disable tracing or to use the appropriate buffer for the new task.
  - tracing induces ~20% overhead as branch records are sent out on
the bus. 
  - the hardware collects trace per processor. To disentangle the
traces for different tasks, we use separate buffers and reconfigure
the trace hardware.
- the internal buffer interpretation as well as the corresponding
  operations are selected at run-time by hardware detection
  - different processors use different branch record formats

Opens:
- warnings due to code sharing between i386 and x86_64
- support for more processors (to come)
- ptrace command numbers (just picked some numbers randomly)

Signed-off-by: Markus Metzger [EMAIL PROTECTED]
Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

Index: linux-2.6/arch/x86/kernel/process_32.c
===
--- linux-2.6.orig/arch/x86/kernel/process_32.c 2007-11-07 09:08:32.%N +0100
+++ linux-2.6/arch/x86/kernel/process_32.c  2007-11-07 09:09:36.%N +0100
@@ -735,6 +735,18 @@
__switch_to_xtra(prev_p, next_p, tss);
 
/*
+* Last branch recording recofiguration of trace hardware and
+* disentangling of trace data per task.
+*/
+   if (unlikely(task_thread_info(prev_p)-flags  _TIF_BTS_TRACE ||
+task_thread_info(prev_p)-flags  _TIF_BTS_TRACE_TS))
+   ptrace_bts_task_departs(prev_p);
+
+   if (unlikely(task_thread_info(next_p)-flags  _TIF_BTS_TRACE ||
+task_thread_info(next_p)-flags  _TIF_BTS_TRACE_TS))
+   ptrace_bts_task_arrives(next_p);
+
+   /*
 * Leave lazy mode, flushing any hypercalls made here.
 * This must be done before restoring TLS segments so
 * the GDT and LDT are properly updated, and must be
Index: linux-2.6/arch/x86/kernel/ptrace_32.c
===
--- linux-2.6.orig/arch/x86/kernel/ptrace_32.c  2007-11-07 09:08:32.%N +0100
+++ linux-2.6/arch/x86/kernel/ptrace_32.c   2007-11-07 09:09:36.%N +0100
@@ -24,6 +24,7 @@
 #include asm/debugreg.h
 #include asm/ldt.h
 #include asm/desc.h
+#include asm/ptrace_bts.h
 
 /*
  * does not yet catch signals sent when the child dies.
@@ -274,6 +275,7 @@
 { 
clear_singlestep(child);
clear_tsk_thread_flag(child, TIF_SYSCALL_EMU);
+   ptrace_bts_task_detached(child);
 }
 
 /*
@@ -610,6 +612,37 @@
(struct user_desc __user *) data);
break;
 
+   case PTRACE_BTS_INIT:
+   ret = ptrace_bts_init();
+   break;
+
+   case PTRACE_BTS_ALLOCATE_BUFFER:
+   ret = ptrace_bts_allocate_bts(child, data);
+   break;
+
+   case PTRACE_BTS_GET_BUFFER_SIZE:
+   ret = ptrace_bts_get_buffer_size(child);
+   break;
+
+   case PTRACE_BTS_GET_INDEX:
+   ret = ptrace_bts_get_index(child);
+   break;
+
+   case PTRACE_BTS_READ_RECORD:
+   ret = ptrace_bts_read_record
+   (child, data,
+(struct ptrace_bts_record __user *) addr);
+   break;
+
+   case PTRACE_BTS_CONFIG:
+   ptrace_bts_config(child, data);
+   ret = 0;
+   break;
+
+   case PTRACE_BTS_STATUS:
+   ret = ptrace_bts_status(child);
+   break;
+
default:
ret = ptrace_request(child, request, addr, data);
break;
Index: linux-2.6/include/asm-x86/ptrace-abi.h
===
--- linux-2.6.orig/include/asm-x86/ptrace-abi.h 2007-11-07 09:08:32.%N +0100
+++ linux-2.6/include/asm-x86/ptrace-abi.h  2007-11-07 09:09:36.%N +0100
@@ -78,4 +78,54 @@
 # define PTRACE_SYSEMU_SINGLESTEP 32
 #endif
 
+
+/* Initialize bts tracing. Needs to be called once.
+   Return the type of tracing available */
+#define PTRACE_BTS_INIT 40
+
+/* No trace

[patch] x86: fix taking DNA during 64bit sigreturn

2007-11-11 Thread Siddha, Suresh B

restore sigcontext is taking a DNA exception while restoring FP context from
the user stack, during the sigreturn. Appended patch fixes it by doing clts()
if the app doesn't touch FP during the signal handler execution. This will
stop generating a DNA, during the fxrstor in the sigreturn.

This improves 64-bit lat_sig numbers by ~30% on my core2 platform.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/arch/x86/kernel/i387_64.c b/arch/x86/kernel/i387_64.c
index 56c1f11..bfaff28 100644
--- a/arch/x86/kernel/i387_64.c
+++ b/arch/x86/kernel/i387_64.c
@@ -92,13 +92,14 @@ int save_i387(struct _fpstate __user *buf)
if (task_thread_info(tsk)->status & TS_USEDFPU) {
err = save_i387_checking((struct i387_fxsave_struct __user 
*)buf);
if (err) return err;
+   task_thread_info(tsk)->status &= ~TS_USEDFPU;
stts();
-   } else {
-   if (__copy_to_user(buf, >thread.i387.fxsave, 
+   } else {
+   if (__copy_to_user(buf, >thread.i387.fxsave,
   sizeof(struct i387_fxsave_struct)))
return -1;
-   } 
-   return 1;
+   }
+   return 1;
 }
 
 /*
diff --git a/include/asm-x86/i387_64.h b/include/asm-x86/i387_64.h
index 0217b74..3a4ffba 100644
--- a/include/asm-x86/i387_64.h
+++ b/include/asm-x86/i387_64.h
@@ -203,6 +203,11 @@ static inline void save_init_fpu(struct task_struct *tsk)
  */
 static inline int restore_i387(struct _fpstate __user *buf)
 {
+   set_used_math();
+   if (!(task_thread_info(current)->status & TS_USEDFPU)) {
+   clts();
+   task_thread_info(current)->status |= TS_USEDFPU;
+   }
return restore_fpu_checking((__force struct i387_fxsave_struct *)buf);
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86: fix taking DNA during 64bit sigreturn

2007-11-11 Thread Siddha, Suresh B

restore sigcontext is taking a DNA exception while restoring FP context from
the user stack, during the sigreturn. Appended patch fixes it by doing clts()
if the app doesn't touch FP during the signal handler execution. This will
stop generating a DNA, during the fxrstor in the sigreturn.

This improves 64-bit lat_sig numbers by ~30% on my core2 platform.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/arch/x86/kernel/i387_64.c b/arch/x86/kernel/i387_64.c
index 56c1f11..bfaff28 100644
--- a/arch/x86/kernel/i387_64.c
+++ b/arch/x86/kernel/i387_64.c
@@ -92,13 +92,14 @@ int save_i387(struct _fpstate __user *buf)
if (task_thread_info(tsk)-status  TS_USEDFPU) {
err = save_i387_checking((struct i387_fxsave_struct __user 
*)buf);
if (err) return err;
+   task_thread_info(tsk)-status = ~TS_USEDFPU;
stts();
-   } else {
-   if (__copy_to_user(buf, tsk-thread.i387.fxsave, 
+   } else {
+   if (__copy_to_user(buf, tsk-thread.i387.fxsave,
   sizeof(struct i387_fxsave_struct)))
return -1;
-   } 
-   return 1;
+   }
+   return 1;
 }
 
 /*
diff --git a/include/asm-x86/i387_64.h b/include/asm-x86/i387_64.h
index 0217b74..3a4ffba 100644
--- a/include/asm-x86/i387_64.h
+++ b/include/asm-x86/i387_64.h
@@ -203,6 +203,11 @@ static inline void save_init_fpu(struct task_struct *tsk)
  */
 static inline int restore_i387(struct _fpstate __user *buf)
 {
+   set_used_math();
+   if (!(task_thread_info(current)-status  TS_USEDFPU)) {
+   clts();
+   task_thread_info(current)-status |= TS_USEDFPU;
+   }
return restore_fpu_checking((__force struct i387_fxsave_struct *)buf);
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] 2.6.23 boot failures on x86-64.

2007-10-29 Thread Siddha, Suresh B

On Mon, Oct 29, 2007 at 11:37:40AM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 29 Oct 2007, Greg KH wrote:
> > 
> > I'll be glad to revert it in -stable, if it's also reverted in Linus's
> > tree first :)
> 
> We've had some changes since 2.6.23, and afaik, the 
> "alloc_bootmem_high_node()" code is alreadt effectively dead there. It's 
> only called if CONFIG_SPARSEMEM_VMEMMAP is *not* enabled, and I *think* we 
> enable it by force on x86-64 these days.

If so, we(Nanhai and myself) will take a look at VMEMMAP changes and see if
the bug that the commit 2e1c49db4c640b35df13889b86b9d62215ade4b6 tries to fix is
still open in the latest git.

But I can't explain how 2e1c49db4c640b35df13889b86b9d62215ade4b6 can be
the root cause of Dave's issue in 2.6.23.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] 2.6.23 boot failures on x86-64.

2007-10-29 Thread Siddha, Suresh B

On Mon, Oct 29, 2007 at 11:37:40AM -0700, Linus Torvalds wrote:
 
 
 On Mon, 29 Oct 2007, Greg KH wrote:
  
  I'll be glad to revert it in -stable, if it's also reverted in Linus's
  tree first :)
 
 We've had some changes since 2.6.23, and afaik, the 
 alloc_bootmem_high_node() code is alreadt effectively dead there. It's 
 only called if CONFIG_SPARSEMEM_VMEMMAP is *not* enabled, and I *think* we 
 enable it by force on x86-64 these days.

If so, we(Nanhai and myself) will take a look at VMEMMAP changes and see if
the bug that the commit 2e1c49db4c640b35df13889b86b9d62215ade4b6 tries to fix is
still open in the latest git.

But I can't explain how 2e1c49db4c640b35df13889b86b9d62215ade4b6 can be
the root cause of Dave's issue in 2.6.23.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix improper load balance across sched domain

2007-10-16 Thread Siddha, Suresh B

On Tue, Oct 16, 2007 at 12:07:06PM -0700, Ken Chen wrote:
> We recently discovered a nasty performance bug in the kernel CPU load
> balancer where we were hit by 50% performance regression.
> 
> When tasks are assigned to a subset of CPUs that span across
> sched_domains (either ccNUMA node or the new multi-core domain) via
> cpu affinity, kernel fails to perform proper load balance at
> these domains, due to several logic in find_busiest_group() miss
> identified busiest sched group within a given domain. This leads to
> inadequate load balance and causes 50% performance hit.
> 
> To give you a concrete example, on a dual-core, 2 socket numa system,
> there are 4 logical cpu, organized as:

oops, this issue can easily happen when cores are not sharing caches. I
think this is what happening on your setup, right?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix improper load balance across sched domain

2007-10-16 Thread Siddha, Suresh B

On Tue, Oct 16, 2007 at 12:07:06PM -0700, Ken Chen wrote:
 We recently discovered a nasty performance bug in the kernel CPU load
 balancer where we were hit by 50% performance regression.
 
 When tasks are assigned to a subset of CPUs that span across
 sched_domains (either ccNUMA node or the new multi-core domain) via
 cpu affinity, kernel fails to perform proper load balance at
 these domains, due to several logic in find_busiest_group() miss
 identified busiest sched group within a given domain. This leads to
 inadequate load balance and causes 50% performance hit.
 
 To give you a concrete example, on a dual-core, 2 socket numa system,
 there are 4 logical cpu, organized as:

oops, this issue can easily happen when cores are not sharing caches. I
think this is what happening on your setup, right?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86_64, vsyscall: fix the oops crash with __pa_vsymbol()

2007-10-09 Thread Siddha, Suresh B

Appended patch fixes an oops while changing the vsyscall sysctl.
I am sure no one tested this code before integrating into mainline :(

BTW, using ioremap() in vsyscall_sysctl_change() to get the virtual
address of a kernel symbol sounds like an over kill.. I wonder if we
can define a simple __va_vsymbol() which will return directly the
kernel direct mapping. comments in the code which says gcc has trouble
with __va(__pa()) sounds bogus to me. __pa() on a vsyscall address will
not work anyhow :(

And also, the whole nop out syscall in vsyscall page infrastructure
(vsyscall_sysctl_change()) is added to make some attacks difficult,
and yet I don't see this nop out being done by default. This area
requires more cleanups?

thanks,
suresh
---

Fix an oops with __pa_vsymbol(). VSYSCALL_FIRST_PAGE is a fixmap index.
We want the starting virtual address of the vsyscall page and not the index.

Reported-by: Yanmin Zhang <[EMAIL PROTECTED]>
Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c
index 06c3494..9dfbad5 100644
--- a/arch/x86_64/kernel/vsyscall.c
+++ b/arch/x86_64/kernel/vsyscall.c
@@ -50,7 +50,7 @@
({unsigned long v;  \
extern char __vsyscall_0;   \
  asm("" : "=r" (v) : "0" (x)); \
- ((v - VSYSCALL_FIRST_PAGE) + __pa_symbol(&__vsyscall_0)); })
+ ((v - VSYSCALL_START) + __pa_symbol(&__vsyscall_0)); })
 
 /*
  * vsyscall_gtod_data contains data that is :
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] x86_64, vsyscall: fix the oops crash with __pa_vsymbol()

2007-10-09 Thread Siddha, Suresh B

Appended patch fixes an oops while changing the vsyscall sysctl.
I am sure no one tested this code before integrating into mainline :(

BTW, using ioremap() in vsyscall_sysctl_change() to get the virtual
address of a kernel symbol sounds like an over kill.. I wonder if we
can define a simple __va_vsymbol() which will return directly the
kernel direct mapping. comments in the code which says gcc has trouble
with __va(__pa()) sounds bogus to me. __pa() on a vsyscall address will
not work anyhow :(

And also, the whole nop out syscall in vsyscall page infrastructure
(vsyscall_sysctl_change()) is added to make some attacks difficult,
and yet I don't see this nop out being done by default. This area
requires more cleanups?

thanks,
suresh
---

Fix an oops with __pa_vsymbol(). VSYSCALL_FIRST_PAGE is a fixmap index.
We want the starting virtual address of the vsyscall page and not the index.

Reported-by: Yanmin Zhang [EMAIL PROTECTED]
Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/arch/x86_64/kernel/vsyscall.c b/arch/x86_64/kernel/vsyscall.c
index 06c3494..9dfbad5 100644
--- a/arch/x86_64/kernel/vsyscall.c
+++ b/arch/x86_64/kernel/vsyscall.c
@@ -50,7 +50,7 @@
({unsigned long v;  \
extern char __vsyscall_0;   \
  asm( : =r (v) : 0 (x)); \
- ((v - VSYSCALL_FIRST_PAGE) + __pa_symbol(__vsyscall_0)); })
+ ((v - VSYSCALL_START) + __pa_symbol(__vsyscall_0)); })
 
 /*
  * vsyscall_gtod_data contains data that is :
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB performance regression vs SLAB

2007-10-04 Thread Siddha, Suresh B

On Thu, Oct 04, 2007 at 12:05:35PM -0700, Christoph Lameter wrote:
> > > Was the page allocator pass through patchset 
> > > separately applied as I requested?
> > 
> > I don't believe so.  Suresh?
> 
> If it was a git pull then the pass through was included and never taken 
> out.

It was a git pull from the performance branch that you pointed out earlier
http://git.kernel.org/?p=linux/kernel/git/christoph/slab.git;a=log;h=performance

and the config is based on EL5 config with just the SLUB turned on.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SLUB performance regression vs SLAB

2007-10-04 Thread Siddha, Suresh B

On Thu, Oct 04, 2007 at 12:05:35PM -0700, Christoph Lameter wrote:
   Was the page allocator pass through patchset 
   separately applied as I requested?
  
  I don't believe so.  Suresh?
 
 If it was a git pull then the pass through was included and never taken 
 out.

It was a git pull from the performance branch that you pointed out earlier
http://git.kernel.org/?p=linux/kernel/git/christoph/slab.git;a=log;h=performance

and the config is based on EL5 config with just the SLUB turned on.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MTRR initialization

2007-09-21 Thread Siddha, Suresh B

On Fri, Sep 14, 2007 at 09:33:30AM -0700, Howard Chu wrote:
> Hi, was wondering if anyone else has been tripped up by this... I've got
> 4GB of
> RAM in my Asus A8V Deluxe and memory hole mapping enabled in the BIOS. By
> default, my system boots up with these MTRR settings:
> 
> reg00: base=0x (   0MB), size=4096MB: write-back, count=1
> reg01: base=0x1 (4096MB), size=1024MB: write-back, count=1
> reg02: base=0xc000 (3072MB), size=1024MB: uncachable, count=1
> reg03: base=0xc000 (3072MB), size= 256MB: write-combining, count=1
> 
> The X server and various other programs try to add a mapping for my video
> card's buffer, at 0xd000, size=256MB, type=write-combining, and this
> always
> fails with a type mismatch error (old type is write-back). Apparently it's
> conflicting with mapping register 0. I can't just disable the existing
> settings
> and re-add them; the system hangs soon after disabling reg01.
> 
> I guess the kernel must be getting the initial setup from the BIOS. I've
> hacked
> around this in mtrr/generic.c by explicitly changing the MTRR state in
> get_mtrr_state to split the first mapping into two; one at base 0 size
> 2048M
> and one at base 2048M size 1024M. So now I have this, which is pretty much
> what
> I wanted:
> 
> reg00: base=0x (   0MB), size=2048MB: write-back, count=1
> reg01: base=0x8000 (2048MB), size=1024MB: write-back, count=1
> reg02: base=0x1 (4096MB), size=1024MB: write-back, count=1
> reg03: base=0xc000 (3072MB), size=1024MB: uncachable, count=1
> reg04: base=0xc000 (3072MB), size= 256MB: write-combining, count=1
> reg05: base=0xd000 (3328MB), size= 256MB: write-combining, count=1

BTW, having overlapping WC, UC regions make the end result UC. So in this
case, you may not be getting the desired performance.

> 
> So the question is - was there an easier/correct way to do this?
> 
> It might have been nice if the MTRR ioctls allowed the register number to
> be
> specified on the Set commands, though I'm not sure that would have helped
> in
> this case.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

x86_64: potential critical issue with quicklists and page table pages

2007-09-21 Thread Siddha, Suresh B

git commit 34feb2c83beb3bdf13535a36770f7e50b47ef299 started using quicklists
for freeing page table pages and removed the usage of tlb_remove_page()

And looking at quicklist_free() and quicklist_free_page(), on a NUMA platform,
this can potentially free the page before the corresponding TLB caches
are flushed.

Essentially quicklist free routines are doing something like
__quicklist_free()
...
if (unlikely(nid != numa_node_id())) {
__free_page(page);
...
}




Now this will potentially cause a problem, if a cpu in someother node starts
using this page, while the corresponding TLB entries are still alive
in the original cpu which is still freeing the page table pages.

This violates the guideline documented in
http://developer.intel.com/design/processor/applnots/317080.pdf

This potentially can cause SW failures and hard to debug issues like
http://www.ussg.iu.edu/hypermail/linux/kernel/0205.2/1254.html

Can we revert this commit for 2.6.23 and look at this code post 2.6.23?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

x86_64: potential critical issue with quicklists and page table pages

2007-09-21 Thread Siddha, Suresh B

git commit 34feb2c83beb3bdf13535a36770f7e50b47ef299 started using quicklists
for freeing page table pages and removed the usage of tlb_remove_page()

And looking at quicklist_free() and quicklist_free_page(), on a NUMA platform,
this can potentially free the page before the corresponding TLB caches
are flushed.

Essentially quicklist free routines are doing something like
__quicklist_free()
...
if (unlikely(nid != numa_node_id())) {
__free_page(page);
...
}




Now this will potentially cause a problem, if a cpu in someother node starts
using this page, while the corresponding TLB entries are still alive
in the original cpu which is still freeing the page table pages.

This violates the guideline documented in
http://developer.intel.com/design/processor/applnots/317080.pdf

This potentially can cause SW failures and hard to debug issues like
http://www.ussg.iu.edu/hypermail/linux/kernel/0205.2/1254.html

Can we revert this commit for 2.6.23 and look at this code post 2.6.23?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: MTRR initialization

2007-09-21 Thread Siddha, Suresh B

On Fri, Sep 14, 2007 at 09:33:30AM -0700, Howard Chu wrote:
 Hi, was wondering if anyone else has been tripped up by this... I've got
 4GB of
 RAM in my Asus A8V Deluxe and memory hole mapping enabled in the BIOS. By
 default, my system boots up with these MTRR settings:
 
 reg00: base=0x (   0MB), size=4096MB: write-back, count=1
 reg01: base=0x1 (4096MB), size=1024MB: write-back, count=1
 reg02: base=0xc000 (3072MB), size=1024MB: uncachable, count=1
 reg03: base=0xc000 (3072MB), size= 256MB: write-combining, count=1
 
 The X server and various other programs try to add a mapping for my video
 card's buffer, at 0xd000, size=256MB, type=write-combining, and this
 always
 fails with a type mismatch error (old type is write-back). Apparently it's
 conflicting with mapping register 0. I can't just disable the existing
 settings
 and re-add them; the system hangs soon after disabling reg01.
 
 I guess the kernel must be getting the initial setup from the BIOS. I've
 hacked
 around this in mtrr/generic.c by explicitly changing the MTRR state in
 get_mtrr_state to split the first mapping into two; one at base 0 size
 2048M
 and one at base 2048M size 1024M. So now I have this, which is pretty much
 what
 I wanted:
 
 reg00: base=0x (   0MB), size=2048MB: write-back, count=1
 reg01: base=0x8000 (2048MB), size=1024MB: write-back, count=1
 reg02: base=0x1 (4096MB), size=1024MB: write-back, count=1
 reg03: base=0xc000 (3072MB), size=1024MB: uncachable, count=1
 reg04: base=0xc000 (3072MB), size= 256MB: write-combining, count=1
 reg05: base=0xd000 (3328MB), size= 256MB: write-combining, count=1

BTW, having overlapping WC, UC regions make the end result UC. So in this
case, you may not be getting the desired performance.

 
 So the question is - was there an easier/correct way to do this?
 
 It might have been nice if the MTRR ioctls allowed the register number to
 be
 specified on the Set commands, though I'm not sure that would have helped
 in
 this case.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Siddha, Suresh B

On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
> This patch attempts to improve CFS's SMP global fairness based on the new 
> virtual time design.
> 
> Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
> 
> Modified small_imbalance logic in find_busiest_group(). If there's small 
> imbalance, move tasks from busiest to local sched_group only if the local 
> group contains a CPU whose min_vruntime is the maximum among all CPUs in 
> the same sched_domain. This prevents any CPU from advancing too far ahead 
> in virtual time and avoids tasks thrashing between two CPUs without 
> utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
> since the load is not evenly divisible by the number of CPUs, we want the 
> extra load to have a fair use of every CPU in the system.
> 
> Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 

Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.

>   if (*imbalance < busiest_load_per_task) {
> - unsigned long tmp, pwr_now, pwr_move;
> - unsigned int imbn;
> -
>  small_imbalance:
> - pwr_move = pwr_now = 0;
> - imbn = 2;
> - if (this_nr_running) {
> - this_load_per_task /= this_nr_running;
> - if (busiest_load_per_task > this_load_per_task)
> - imbn = 1;
> - } else
> - this_load_per_task = SCHED_LOAD_SCALE;
> -
> - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
> - busiest_load_per_task * imbn) {
> - *imbalance = busiest_load_per_task;
> - return busiest;
> - }

This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.

> -
> - /*
> -  * OK, we don't have enough imbalance to justify moving 
> tasks,
> -  * however we may be able to increase total CPU power used by
> -  * moving them.
> + /* 
> +  * When there's small imbalance, move tasks only if this
> +  * sched_group contains a CPU whose min_vruntime is the 
> +  * maximum among all CPUs in the same domain.
>*/
> -
> - pwr_now += busiest->__cpu_power *
> - min(busiest_load_per_task, max_load);
> - pwr_now += this->__cpu_power *
> - min(this_load_per_task, this_load);
> - pwr_now /= SCHED_LOAD_SCALE;
> -
> - /* Amount of load we'd subtract */
> - tmp = sg_div_cpu_power(busiest,
> - busiest_load_per_task * SCHED_LOAD_SCALE);
> - if (max_load > tmp)
> - pwr_move += busiest->__cpu_power *
> - min(busiest_load_per_task, max_load - tmp);
> -
> - /* Amount of load we'd add */
> - if (max_load * busiest->__cpu_power <
> - busiest_load_per_task * SCHED_LOAD_SCALE)
> - tmp = sg_div_cpu_power(this,
> - max_load * busiest->__cpu_power);
> - else
> - tmp = sg_div_cpu_power(this,
> - busiest_load_per_task * SCHED_LOAD_SCALE);
> - pwr_move += this->__cpu_power *
> - min(this_load_per_task, this_load + tmp);
> - pwr_move /= SCHED_LOAD_SCALE;
> -
> - /* Move if we gain throughput */
> - if (pwr_move > pwr_now)
> + if (max_vruntime_group == this)
>   *imbalance = busiest_load_per_task;
> + else
> + *imbalance = 0;

Not sure how this all interacts when some of the cpu's are idle. I have to
look more closely.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git] CFS-devel, group scheduler, fixes

2007-09-19 Thread Siddha, Suresh B

On Tue, Sep 18, 2007 at 11:03:59PM -0700, Tong Li wrote:
 This patch attempts to improve CFS's SMP global fairness based on the new 
 virtual time design.
 
 Removed vruntime adjustment in set_task_cpu() as it skews global fairness.
 
 Modified small_imbalance logic in find_busiest_group(). If there's small 
 imbalance, move tasks from busiest to local sched_group only if the local 
 group contains a CPU whose min_vruntime is the maximum among all CPUs in 
 the same sched_domain. This prevents any CPU from advancing too far ahead 
 in virtual time and avoids tasks thrashing between two CPUs without 
 utilizing other CPUs in the system. For example, for 10 tasks on 8 CPUs, 
 since the load is not evenly divisible by the number of CPUs, we want the 
 extra load to have a fair use of every CPU in the system.
 
 Tested with a microbenchmark running 10 nice-0 tasks on 8 CPUs. Each task 

Just as an experiment, can you run 82 tasks on 8 CPUs. Current
imbalance_pct logic will not detect and fix the global fairness issue
even with this patch.

   if (*imbalance  busiest_load_per_task) {
 - unsigned long tmp, pwr_now, pwr_move;
 - unsigned int imbn;
 -
  small_imbalance:
 - pwr_move = pwr_now = 0;
 - imbn = 2;
 - if (this_nr_running) {
 - this_load_per_task /= this_nr_running;
 - if (busiest_load_per_task  this_load_per_task)
 - imbn = 1;
 - } else
 - this_load_per_task = SCHED_LOAD_SCALE;
 -
 - if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ =
 - busiest_load_per_task * imbn) {
 - *imbalance = busiest_load_per_task;
 - return busiest;
 - }

This patch removes quite a few lines and all this is logic is not for
fairness :( Especially the above portion handles some of the HT/MC
optimizations.

 -
 - /*
 -  * OK, we don't have enough imbalance to justify moving 
 tasks,
 -  * however we may be able to increase total CPU power used by
 -  * moving them.
 + /* 
 +  * When there's small imbalance, move tasks only if this
 +  * sched_group contains a CPU whose min_vruntime is the 
 +  * maximum among all CPUs in the same domain.
*/
 -
 - pwr_now += busiest-__cpu_power *
 - min(busiest_load_per_task, max_load);
 - pwr_now += this-__cpu_power *
 - min(this_load_per_task, this_load);
 - pwr_now /= SCHED_LOAD_SCALE;
 -
 - /* Amount of load we'd subtract */
 - tmp = sg_div_cpu_power(busiest,
 - busiest_load_per_task * SCHED_LOAD_SCALE);
 - if (max_load  tmp)
 - pwr_move += busiest-__cpu_power *
 - min(busiest_load_per_task, max_load - tmp);
 -
 - /* Amount of load we'd add */
 - if (max_load * busiest-__cpu_power 
 - busiest_load_per_task * SCHED_LOAD_SCALE)
 - tmp = sg_div_cpu_power(this,
 - max_load * busiest-__cpu_power);
 - else
 - tmp = sg_div_cpu_power(this,
 - busiest_load_per_task * SCHED_LOAD_SCALE);
 - pwr_move += this-__cpu_power *
 - min(this_load_per_task, this_load + tmp);
 - pwr_move /= SCHED_LOAD_SCALE;
 -
 - /* Move if we gain throughput */
 - if (pwr_move  pwr_now)
 + if (max_vruntime_group == this)
   *imbalance = busiest_load_per_task;
 + else
 + *imbalance = 0;

Not sure how this all interacts when some of the cpu's are idle. I have to
look more closely.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tbench regression - Why process scheduler has impact on tbench and why small per-cpu slab (SLUB) cache creates the scenario?

2007-09-18 Thread Siddha, Suresh B

On Fri, Sep 14, 2007 at 12:51:34PM -0700, Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Siddha, Suresh B wrote:
> > We are trying to get the latest data with 2.6.23-rc4-mm1 with and without
> > slub. Is this good enough?
>
> Good enough. If you are concerned about the page allocator pass through
> then you may want to test the page allocator pass through patchset
> separately. The fastpath of the page allocator is currently not
> competitive if you always free and allocate a single page. If contiguous
> pages are allocated then the pass through is superior.

We are having all sorts of stability issues with -mm kernels, let alone
perf testing :(

For now, we are trying to do slab Vs slub comparisons for the mainline kernels.
Let's see how that goes.

Meanwhile, any chance that you can point us at relevant recent patches/fixes
that are in -mm and perhaps that can be applied to mainline kernel?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tbench regression - Why process scheduler has impact on tbench and why small per-cpu slab (SLUB) cache creates the scenario?

2007-09-18 Thread Siddha, Suresh B

On Fri, Sep 14, 2007 at 12:51:34PM -0700, Christoph Lameter wrote:
 On Fri, 14 Sep 2007, Siddha, Suresh B wrote:
  We are trying to get the latest data with 2.6.23-rc4-mm1 with and without
  slub. Is this good enough?

 Good enough. If you are concerned about the page allocator pass through
 then you may want to test the page allocator pass through patchset
 separately. The fastpath of the page allocator is currently not
 competitive if you always free and allocate a single page. If contiguous
 pages are allocated then the pass through is superior.

We are having all sorts of stability issues with -mm kernels, let alone
perf testing :(

For now, we are trying to do slab Vs slub comparisons for the mainline kernels.
Let's see how that goes.

Meanwhile, any chance that you can point us at relevant recent patches/fixes
that are in -mm and perhaps that can be applied to mainline kernel?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Fix BIOS-e820 end address

2007-09-14 Thread Siddha, Suresh B

On Fri, Sep 14, 2007 at 02:00:02PM -0700, Jeremy Fitzhardinge wrote:
> Keshavamurthy, Anil S wrote:
> > Subject: [patch] Fix BIOS-e820 end address
> >
> > --snip of boot message--
> > BIOS-provided physical RAM map:
> >  BIOS-e820:  - 000a (usable)
> >  BIOS-e820: 000f - 0010 (reserved)
> >  BIOS-e820: 0010 - 7fe8cc00 (usable)
> > end snip---
> >
> > As you see from above the address 0010 is both
> > shown as reserved and usable which is confusing.
> >   
> 
> I think this is consistent with many other kernel interfaces (such as
> /proc/X/maps) where the end address is taken to be exclusive: [0xf,
> 0x10).

Andrew, Please disregard this patch. As Jermy, Jan pointed out, this
will cause more confusions. Thanks. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tbench regression - Why process scheduler has impact on tbench and why small per-cpu slab (SLUB) cache creates the scenario?

2007-09-14 Thread Siddha, Suresh B

Christoph,

On Thu, Sep 13, 2007 at 11:03:53AM -0700, Christoph Lameter wrote:
> On Wed, 12 Sep 2007, Siddha, Suresh B wrote:
> 
> > Christoph, Not sure if you are referring to me or not here. But our
> > tests(atleast on with the database workloads) approx 1.5 months or so back
> > showed that on ia64 slub was on par with slab and on x86_64, slub was 9% 
> > down.
> > And after changing the slub min order and max order, slub perf on x86_64 is
> > down approx 3.5% or so compared to slab.
> 
> No, I was referring to another talk that I had at the OLS with Corey 
> Gough. I keep getting confusing information from Intel. Last I heard was 

Please don't go with informal talks and discussions. Please demand the numbers
and make decisions, conclusions based on those numbers. AFAIK, we haven't
posted confusing numbers so far.

> that IA64 had a regression and x86_64 was fine (but they were not allowed 
> to tell me details). Would you please straighten out your story and give 
> me details?

Numbers I posted in the previous e-mail is the only story we have so far.

> AFAIK the two of us discussed some issues related to object handover 
> between processors that cause cache line bouncing and I sent you a 
> patchset for testing but I did not get any feedback. The patches that were 

Sorry, These systems are huge and limited. We are raising the priority
with the performance team to do the latest slub patch testing.

> discussed are now in mm.
>
> > While I don't rule out large sized allocations like PAGE_SIZE, I am mostly
> > certain that the critical allocations in this workload are not PAGE_SIZE
> > based.  Mostly they are in the range less than 300-500 bytes or so.
> > 
> > Any changes in the recent slub which takes the pressure away from the page
> > allocator especially for smaller page sized architectures? If so, we can
> > redo some of the experiments. Looking at this thread, it doesn't sound like?
> 
> Its too late for 2.6.23. But we can certainly do things for .24. Could you 
> please test the patches queued up in Andrew's tree? In particular the page 
> allocator pass through and the per cpu structures optimizations?

We are trying to get the latest data with 2.6.23-rc4-mm1 with and without
slub. Is this good enough?

> 
> There is more work out of tree to optimize the fastpath that is mostly 
> driven by Mathieu Desnoyers. I hope to get that into mm in the next weeks 
> but I do not think that it is going to be available before .25.
> 
> The work of Matheiu also has implications for the page allocator. We may 
> be able to significantly speed up the fastpath there as well.

Ok. Atleast till all the regressions addressed and all these patches well
tested, we shouldn't do away with slab from mainline anytime soon.

Other than us, who else are you banking on for analysing slub? Do
you have any numbers that you can share, which show where slub
is good or bad...

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tbench regression - Why process scheduler has impact on tbench and why small per-cpu slab (SLUB) cache creates the scenario?

2007-09-14 Thread Siddha, Suresh B

Christoph,

On Thu, Sep 13, 2007 at 11:03:53AM -0700, Christoph Lameter wrote:
 On Wed, 12 Sep 2007, Siddha, Suresh B wrote:
 
  Christoph, Not sure if you are referring to me or not here. But our
  tests(atleast on with the database workloads) approx 1.5 months or so back
  showed that on ia64 slub was on par with slab and on x86_64, slub was 9% 
  down.
  And after changing the slub min order and max order, slub perf on x86_64 is
  down approx 3.5% or so compared to slab.
 
 No, I was referring to another talk that I had at the OLS with Corey 
 Gough. I keep getting confusing information from Intel. Last I heard was 

Please don't go with informal talks and discussions. Please demand the numbers
and make decisions, conclusions based on those numbers. AFAIK, we haven't
posted confusing numbers so far.

 that IA64 had a regression and x86_64 was fine (but they were not allowed 
 to tell me details). Would you please straighten out your story and give 
 me details?

Numbers I posted in the previous e-mail is the only story we have so far.

 AFAIK the two of us discussed some issues related to object handover 
 between processors that cause cache line bouncing and I sent you a 
 patchset for testing but I did not get any feedback. The patches that were 

Sorry, These systems are huge and limited. We are raising the priority
with the performance team to do the latest slub patch testing.

 discussed are now in mm.

  While I don't rule out large sized allocations like PAGE_SIZE, I am mostly
  certain that the critical allocations in this workload are not PAGE_SIZE
  based.  Mostly they are in the range less than 300-500 bytes or so.
  
  Any changes in the recent slub which takes the pressure away from the page
  allocator especially for smaller page sized architectures? If so, we can
  redo some of the experiments. Looking at this thread, it doesn't sound like?
 
 Its too late for 2.6.23. But we can certainly do things for .24. Could you 
 please test the patches queued up in Andrew's tree? In particular the page 
 allocator pass through and the per cpu structures optimizations?

We are trying to get the latest data with 2.6.23-rc4-mm1 with and without
slub. Is this good enough?

 
 There is more work out of tree to optimize the fastpath that is mostly 
 driven by Mathieu Desnoyers. I hope to get that into mm in the next weeks 
 but I do not think that it is going to be available before .25.
 
 The work of Matheiu also has implications for the page allocator. We may 
 be able to significantly speed up the fastpath there as well.

Ok. Atleast till all the regressions addressed and all these patches well
tested, we shouldn't do away with slab from mainline anytime soon.

Other than us, who else are you banking on for analysing slub? Do
you have any numbers that you can share, which show where slub
is good or bad...

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Fix BIOS-e820 end address

2007-09-14 Thread Siddha, Suresh B

On Fri, Sep 14, 2007 at 02:00:02PM -0700, Jeremy Fitzhardinge wrote:
 Keshavamurthy, Anil S wrote:
  Subject: [patch] Fix BIOS-e820 end address
 
  --snip of boot message--
  BIOS-provided physical RAM map:
   BIOS-e820:  - 000a (usable)
   BIOS-e820: 000f - 0010 (reserved)
   BIOS-e820: 0010 - 7fe8cc00 (usable)
  end snip---
 
  As you see from above the address 0010 is both
  shown as reserved and usable which is confusing.

 
 I think this is consistent with many other kernel interfaces (such as
 /proc/X/maps) where the end address is taken to be exclusive: [0xf,
 0x10).

Andrew, Please disregard this patch. As Jermy, Jan pointed out, this
will cause more confusions. Thanks. 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tbench regression - Why process scheduler has impact on tbench and why small per-cpu slab (SLUB) cache creates the scenario?

2007-09-13 Thread Siddha, Suresh B

On Tue, Sep 11, 2007 at 01:19:30PM -0700, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> 
> > The impression I got at vm meeting was that SLUB was good to go :(
> 
> Its not? I have had Intel test this thoroughly and they assured me that it 
> is up to SLAB.

Christoph, Not sure if you are referring to me or not here. But our
tests(atleast on with the database workloads) approx 1.5 months or so back
showed that on ia64 slub was on par with slab and on x86_64, slub was 9% down.
And after changing the slub min order and max order, slub perf on x86_64 is
down approx 3.5% or so compared to slab.

While I don't rule out large sized allocations like PAGE_SIZE, I am mostly
certain that the critical allocations in this workload are not PAGE_SIZE
based.  Mostly they are in the range less than 300-500 bytes or so.

Any changes in the recent slub which takes the pressure away from the page
allocator especially for smaller page sized architectures? If so, we can
redo some of the experiments. Looking at this thread, it doesn't sound like?

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: tbench regression - Why process scheduler has impact on tbench and why small per-cpu slab (SLUB) cache creates the scenario?

2007-09-13 Thread Siddha, Suresh B

On Tue, Sep 11, 2007 at 01:19:30PM -0700, Christoph Lameter wrote:
 On Tue, 11 Sep 2007, Nick Piggin wrote:
 
  The impression I got at vm meeting was that SLUB was good to go :(
 
 Its not? I have had Intel test this thoroughly and they assured me that it 
 is up to SLAB.

Christoph, Not sure if you are referring to me or not here. But our
tests(atleast on with the database workloads) approx 1.5 months or so back
showed that on ia64 slub was on par with slab and on x86_64, slub was 9% down.
And after changing the slub min order and max order, slub perf on x86_64 is
down approx 3.5% or so compared to slab.

While I don't rule out large sized allocations like PAGE_SIZE, I am mostly
certain that the critical allocations in this workload are not PAGE_SIZE
based.  Mostly they are in the range less than 300-500 bytes or so.

Any changes in the recent slub which takes the pressure away from the page
allocator especially for smaller page sized architectures? If so, we can
redo some of the experiments. Looking at this thread, it doesn't sound like?

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix broken smt/mc optimizations with CFS

2007-09-04 Thread Siddha, Suresh B

On Tue, Sep 04, 2007 at 07:35:21PM -0400, Chuck Ebbert wrote:
> On 08/28/2007 06:27 PM, Siddha, Suresh B wrote:
> > Try to fix MC/HT scheduler optimization breakage again, with out breaking
> > the FUZZ logic.
> > 
> > First fix the check
> > if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task)
> > with this
> > if (*imbalance < busiest_load_per_task)
> > 
> > As the current check is always false for nice 0 tasks (as 
> > SCHED_LOAD_SCALE_FUZZ
> > is same as busiest_load_per_task for nice 0 tasks).
> > 
> > With the above change, imbalance was getting reset to 0 in the corner case
> > condition, making the FUZZ logic fail. Fix it by not corrupting the
> > imbalance and change the imbalance, only when it finds that the
> > HT/MC optimization is needed.
> > 
> > Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
> > ---
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 9fe473a..03e5e8d 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -2511,7 +2511,7 @@ group_next:
> >  * a think about bumping its value to force at least one task to be
> >  * moved
> >  */
> > -   if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task) {
> > +   if (*imbalance < busiest_load_per_task) {
> > unsigned long tmp, pwr_now, pwr_move;
> > unsigned int imbn;
> >  
> > @@ -2563,10 +2563,8 @@ small_imbalance:
> > pwr_move /= SCHED_LOAD_SCALE;
> >  
> > /* Move if we gain throughput */
> > -   if (pwr_move <= pwr_now)
> > -   goto out_balanced;
> > -
> > -   *imbalance = busiest_load_per_task;
> > +   if (pwr_move > pwr_now)
> > +   *imbalance = busiest_load_per_task;
> > }
> >  
> > return busiest;
> 
> Seems this didn't get merged? Latest git as of today still has the code
> as it was before this patch.

This is must fix for .23 and Ingo previously mentioned that he will push it
for .23

Ingo?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix broken smt/mc optimizations with CFS

2007-09-04 Thread Siddha, Suresh B

On Tue, Sep 04, 2007 at 07:35:21PM -0400, Chuck Ebbert wrote:
 On 08/28/2007 06:27 PM, Siddha, Suresh B wrote:
  Try to fix MC/HT scheduler optimization breakage again, with out breaking
  the FUZZ logic.
  
  First fix the check
  if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task)
  with this
  if (*imbalance  busiest_load_per_task)
  
  As the current check is always false for nice 0 tasks (as 
  SCHED_LOAD_SCALE_FUZZ
  is same as busiest_load_per_task for nice 0 tasks).
  
  With the above change, imbalance was getting reset to 0 in the corner case
  condition, making the FUZZ logic fail. Fix it by not corrupting the
  imbalance and change the imbalance, only when it finds that the
  HT/MC optimization is needed.
  
  Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
  ---
  
  diff --git a/kernel/sched.c b/kernel/sched.c
  index 9fe473a..03e5e8d 100644
  --- a/kernel/sched.c
  +++ b/kernel/sched.c
  @@ -2511,7 +2511,7 @@ group_next:
   * a think about bumping its value to force at least one task to be
   * moved
   */
  -   if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task) {
  +   if (*imbalance  busiest_load_per_task) {
  unsigned long tmp, pwr_now, pwr_move;
  unsigned int imbn;
   
  @@ -2563,10 +2563,8 @@ small_imbalance:
  pwr_move /= SCHED_LOAD_SCALE;
   
  /* Move if we gain throughput */
  -   if (pwr_move = pwr_now)
  -   goto out_balanced;
  -
  -   *imbalance = busiest_load_per_task;
  +   if (pwr_move  pwr_now)
  +   *imbalance = busiest_load_per_task;
  }
   
  return busiest;
 
 Seems this didn't get merged? Latest git as of today still has the code
 as it was before this patch.

This is must fix for .23 and Ingo previously mentioned that he will push it
for .23

Ingo?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] i386, apic: fix 4 bit apicid assumption of mach-default

2007-08-28 Thread Siddha, Suresh B

Andi/Andrew,

Can you pick this up for your trees and if there are no issues, can you please
push it to mainline before .23 gets released.

We have seen a boot failure with fewer cpu sockets populated on a MP platform.
Similar problem can happen on a fully populated system, if # of cpus <= 8
and any of the apic id's is > 16

thanks,
suresh
---
Fix get_apic_id() in mach-default, so that it uses 8 bits incase of xAPIC case
and 4 bits for legacy APIC case.

This fixes the i386 kernel assumption that apic id is less than 16 for xAPIC
platforms with 8 cpus or less and makes the kernel boot on such platforms.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/include/asm-i386/mach-default/mach_apicdef.h 
b/include/asm-i386/mach-default/mach_apicdef.h
index 7bcb350..ae98413 100644
--- a/include/asm-i386/mach-default/mach_apicdef.h
+++ b/include/asm-i386/mach-default/mach_apicdef.h
@@ -1,11 +1,17 @@
 #ifndef __ASM_MACH_APICDEF_H
 #define __ASM_MACH_APICDEF_H
 
+#include 
+
 #defineAPIC_ID_MASK(0xF<<24)
 
 static inline unsigned get_apic_id(unsigned long x) 
 { 
-   return (((x)>>24)&0xF);
+   unsigned int ver = GET_APIC_VERSION(apic_read(APIC_LVR));
+   if (APIC_XAPIC(ver))
+   return (((x)>>24)&0xFF);
+   else
+   return (((x)>>24)&0xF);
 } 
 
 #defineGET_APIC_ID(x)  get_apic_id(x)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix broken smt/mc optimizations with CFS

2007-08-28 Thread Siddha, Suresh B

On Mon, Aug 27, 2007 at 12:31:03PM -0700, Siddha, Suresh B wrote:
> Essentially I observed that nice 0 tasks still endup on two cores of same
> package, with out getting spread out to two different packages. This behavior
> is same with out this fix and this fix doesn't help in any way.

Ingo, Appended patch seems to fix the issue and as far as I can test, seems ok
to me.

This is a quick fix for .23. Peter Williams and myself plan to look at
code cleanups in this area (HT/MC optimizations) post .23

BTW, with this fix, do you want to retain the current FUZZ value?

thanks,
suresh
--

Try to fix MC/HT scheduler optimization breakage again, with out breaking
the FUZZ logic.

First fix the check
if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task)
with this
if (*imbalance < busiest_load_per_task)

As the current check is always false for nice 0 tasks (as SCHED_LOAD_SCALE_FUZZ
is same as busiest_load_per_task for nice 0 tasks).

With the above change, imbalance was getting reset to 0 in the corner case
condition, making the FUZZ logic fail. Fix it by not corrupting the
imbalance and change the imbalance, only when it finds that the
HT/MC optimization is needed.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 9fe473a..03e5e8d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2511,7 +2511,7 @@ group_next:
 * a think about bumping its value to force at least one task to be
 * moved
 */
-   if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task) {
+   if (*imbalance < busiest_load_per_task) {
unsigned long tmp, pwr_now, pwr_move;
unsigned int imbn;

@@ -2563,10 +2563,8 @@ small_imbalance:
pwr_move /= SCHED_LOAD_SCALE;

/* Move if we gain throughput */
-   if (pwr_move <= pwr_now)
-   goto out_balanced;
-
-   *imbalance = busiest_load_per_task;
+   if (pwr_move > pwr_now)
+   *imbalance = busiest_load_per_task;
}

return busiest;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix broken smt/mc optimizations with CFS

2007-08-28 Thread Siddha, Suresh B

On Mon, Aug 27, 2007 at 12:31:03PM -0700, Siddha, Suresh B wrote:
 Essentially I observed that nice 0 tasks still endup on two cores of same
 package, with out getting spread out to two different packages. This behavior
 is same with out this fix and this fix doesn't help in any way.

Ingo, Appended patch seems to fix the issue and as far as I can test, seems ok
to me.

This is a quick fix for .23. Peter Williams and myself plan to look at
code cleanups in this area (HT/MC optimizations) post .23

BTW, with this fix, do you want to retain the current FUZZ value?

thanks,
suresh
--

Try to fix MC/HT scheduler optimization breakage again, with out breaking
the FUZZ logic.

First fix the check
if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task)
with this
if (*imbalance  busiest_load_per_task)

As the current check is always false for nice 0 tasks (as SCHED_LOAD_SCALE_FUZZ
is same as busiest_load_per_task for nice 0 tasks).

With the above change, imbalance was getting reset to 0 in the corner case
condition, making the FUZZ logic fail. Fix it by not corrupting the
imbalance and change the imbalance, only when it finds that the
HT/MC optimization is needed.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 9fe473a..03e5e8d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2511,7 +2511,7 @@ group_next:
 * a think about bumping its value to force at least one task to be
 * moved
 */
-   if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task) {
+   if (*imbalance  busiest_load_per_task) {
unsigned long tmp, pwr_now, pwr_move;
unsigned int imbn;
 
@@ -2563,10 +2563,8 @@ small_imbalance:
pwr_move /= SCHED_LOAD_SCALE;
 
/* Move if we gain throughput */
-   if (pwr_move = pwr_now)
-   goto out_balanced;
-
-   *imbalance = busiest_load_per_task;
+   if (pwr_move  pwr_now)
+   *imbalance = busiest_load_per_task;
}
 
return busiest;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] i386, apic: fix 4 bit apicid assumption of mach-default

2007-08-28 Thread Siddha, Suresh B

Andi/Andrew,

Can you pick this up for your trees and if there are no issues, can you please
push it to mainline before .23 gets released.

We have seen a boot failure with fewer cpu sockets populated on a MP platform.
Similar problem can happen on a fully populated system, if # of cpus = 8
and any of the apic id's is  16

thanks,
suresh
---
Fix get_apic_id() in mach-default, so that it uses 8 bits incase of xAPIC case
and 4 bits for legacy APIC case.

This fixes the i386 kernel assumption that apic id is less than 16 for xAPIC
platforms with 8 cpus or less and makes the kernel boot on such platforms.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/include/asm-i386/mach-default/mach_apicdef.h 
b/include/asm-i386/mach-default/mach_apicdef.h
index 7bcb350..ae98413 100644
--- a/include/asm-i386/mach-default/mach_apicdef.h
+++ b/include/asm-i386/mach-default/mach_apicdef.h
@@ -1,11 +1,17 @@
 #ifndef __ASM_MACH_APICDEF_H
 #define __ASM_MACH_APICDEF_H
 
+#include asm/apic.h
+
 #defineAPIC_ID_MASK(0xF24)
 
 static inline unsigned get_apic_id(unsigned long x) 
 { 
-   return (((x)24)0xF);
+   unsigned int ver = GET_APIC_VERSION(apic_read(APIC_LVR));
+   if (APIC_XAPIC(ver))
+   return (((x)24)0xFF);
+   else
+   return (((x)24)0xF);
 } 
 
 #defineGET_APIC_ID(x)  get_apic_id(x)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix broken smt/mc optimizations with CFS

2007-08-27 Thread Siddha, Suresh B

On Mon, Aug 27, 2007 at 09:23:24PM +0200, Ingo Molnar wrote:
> 
> * Siddha, Suresh B <[EMAIL PROTECTED]> wrote:
> 
> > > - if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task/2) {
> > > + if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task) {
> > 
> > Ingo, this is still broken. This condition is always false for nice-0 
> > tasks..
> 
> yes - negative reniced tasks are not spread out via this - and positive 
> reniced tasks are spread out more easily.

Or the opposite?

Essentially I observed that nice 0 tasks still endup on two cores of same
package, with out getting spread out to two different packages. This behavior
is same with out this fix and this fix doesn't help in any way.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix broken smt/mc optimizations with CFS

2007-08-27 Thread Siddha, Suresh B

On Thu, Aug 23, 2007 at 02:13:41PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > [...] So how about the patch below instead?
> 
> the right patch attached.
> 
> >
> Subject: sched: fix broken SMT/MC optimizations
> From: "Siddha, Suresh B" <[EMAIL PROTECTED]>
> 
> On a four package system with HT - HT load balancing optimizations were
> broken.  For example, if two tasks end up running on two logical threads
> of one of the packages, scheduler is not able to pull one of the tasks
> to a completely idle package.
> 
> In this scenario, for nice-0 tasks, imbalance calculated by scheduler
> will be 512 and find_busiest_queue() will return 0 (as each cpu's load
> is 1024 > imbalance and has only one task running).
> 
> Similarly MC scheduler optimizations also get fixed with this patch.
> 
> [ [EMAIL PROTECTED]: restored fair balancing by increasing the fuzz and
>  adding it back to the power decision, without the /2
>  factor. ]
> 
> Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
> ---
> 
>  include/linux/sched.h |2 +-
>  kernel/sched.c|2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> Index: linux/include/linux/sched.h
> ===
> --- linux.orig/include/linux/sched.h
> +++ linux/include/linux/sched.h
> @@ -681,7 +681,7 @@ enum cpu_idle_type {
>  #define SCHED_LOAD_SHIFT 10
>  #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
>  
> -#define SCHED_LOAD_SCALE_FUZZ(SCHED_LOAD_SCALE >> 1)
> +#define SCHED_LOAD_SCALE_FUZZSCHED_LOAD_SCALE
>  
>  #ifdef CONFIG_SMP
>  #define SD_LOAD_BALANCE  1   /* Do load balancing on this 
> domain. */
> Index: linux/kernel/sched.c
> ===
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -2517,7 +2517,7 @@ group_next:
>* a think about bumping its value to force at least one task to be
>* moved
>*/
> - if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task/2) {
> + if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task) {

Ingo, this is still broken. This condition is always false for nice-0 tasks..

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix broken smt/mc optimizations with CFS

2007-08-27 Thread Siddha, Suresh B

On Thu, Aug 23, 2007 at 02:13:41PM +0200, Ingo Molnar wrote:
 
 * Ingo Molnar [EMAIL PROTECTED] wrote:
 
  [...] So how about the patch below instead?
 
 the right patch attached.
 
 
 Subject: sched: fix broken SMT/MC optimizations
 From: Siddha, Suresh B [EMAIL PROTECTED]
 
 On a four package system with HT - HT load balancing optimizations were
 broken.  For example, if two tasks end up running on two logical threads
 of one of the packages, scheduler is not able to pull one of the tasks
 to a completely idle package.
 
 In this scenario, for nice-0 tasks, imbalance calculated by scheduler
 will be 512 and find_busiest_queue() will return 0 (as each cpu's load
 is 1024  imbalance and has only one task running).
 
 Similarly MC scheduler optimizations also get fixed with this patch.
 
 [ [EMAIL PROTECTED]: restored fair balancing by increasing the fuzz and
  adding it back to the power decision, without the /2
  factor. ]
 
 Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
 Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
 ---
 
  include/linux/sched.h |2 +-
  kernel/sched.c|2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)
 
 Index: linux/include/linux/sched.h
 ===
 --- linux.orig/include/linux/sched.h
 +++ linux/include/linux/sched.h
 @@ -681,7 +681,7 @@ enum cpu_idle_type {
  #define SCHED_LOAD_SHIFT 10
  #define SCHED_LOAD_SCALE (1L  SCHED_LOAD_SHIFT)
  
 -#define SCHED_LOAD_SCALE_FUZZ(SCHED_LOAD_SCALE  1)
 +#define SCHED_LOAD_SCALE_FUZZSCHED_LOAD_SCALE
  
  #ifdef CONFIG_SMP
  #define SD_LOAD_BALANCE  1   /* Do load balancing on this 
 domain. */
 Index: linux/kernel/sched.c
 ===
 --- linux.orig/kernel/sched.c
 +++ linux/kernel/sched.c
 @@ -2517,7 +2517,7 @@ group_next:
* a think about bumping its value to force at least one task to be
* moved
*/
 - if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task/2) {
 + if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task) {

Ingo, this is still broken. This condition is always false for nice-0 tasks..

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: fix broken smt/mc optimizations with CFS

2007-08-27 Thread Siddha, Suresh B

On Mon, Aug 27, 2007 at 09:23:24PM +0200, Ingo Molnar wrote:
 
 * Siddha, Suresh B [EMAIL PROTECTED] wrote:
 
   - if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task/2) {
   + if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task) {
  
  Ingo, this is still broken. This condition is always false for nice-0 
  tasks..
 
 yes - negative reniced tasks are not spread out via this - and positive 
 reniced tasks are spread out more easily.

Or the opposite?

Essentially I observed that nice 0 tasks still endup on two cores of same
package, with out getting spread out to two different packages. This behavior
is same with out this fix and this fix doesn't help in any way.

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] x86: Reduce Memory Usage and Inter-Node message traffic (v2)

2007-08-24 Thread Siddha, Suresh B

On Fri, Aug 24, 2007 at 03:26:54PM -0700, [EMAIL PROTECTED] wrote:
> Previous Intro:

Thanks for doing this.

> In x86_64 and i386 architectures most arrays that are sized
> using NR_CPUS lay in local memory on node 0.  Not only will most
> (99%?) of the systems not use all the slots in these arrays,
> particularly when NR_CPUS is increased to accommodate future
> very high cpu count systems, but a number of cache lines are
> passed unnecessarily on the system bus when these arrays are
> referenced by cpus on other nodes.

Can we move cpuinfo_x86 also to per cpu area? Though critical run
time code doesn't access this area, it will be nice to move the cpuinfo_x86
also into per cpu area.

Perhaps the current cpuinfo_x86 layout might cause confusion and make people
add arch specific per cpu elements into cpuinfo_x86(thinking that it uses per
cpu area).

Wonder if this confusion is the reason for git commit f3fa8ebc

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] x86: fix cpu_to_node references (v2)

2007-08-24 Thread Siddha, Suresh B

On Fri, Aug 24, 2007 at 03:26:55PM -0700, [EMAIL PROTECTED] wrote:
> Fix four instances where cpu_to_node is referenced
> by array instead of via the cpu_to_node macro.  This
> is preparation to moving it to the per_cpu data area.
> 
...

>  unsigned long __init numa_free_all_bootmem(void) 
> --- a/arch/x86_64/mm/srat.c
> +++ b/arch/x86_64/mm/srat.c
> @@ -431,9 +431,9 @@
>   setup_node_bootmem(i, nodes[i].start, nodes[i].end);
>  
>   for (i = 0; i < NR_CPUS; i++) {
> - if (cpu_to_node[i] == NUMA_NO_NODE)
> + if (cpu_to_node(i) == NUMA_NO_NODE)
>   continue;
> - if (!node_isset(cpu_to_node[i], node_possible_map))
> + if (!node_isset(cpu_to_node(i), node_possible_map))
>   numa_set_node(i, NUMA_NO_NODE);
>   }
>   numa_init_array();

During this particular routine execution, per cpu areas are not yet setup. In
future, when we make cpu_to_node(i) use per cpu area, then this code will break.

And actually setup_per_cpu_areas() uses cpu_to_node(). So...

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/6] x86: fix cpu_to_node references (v2)

2007-08-24 Thread Siddha, Suresh B

On Fri, Aug 24, 2007 at 03:26:55PM -0700, [EMAIL PROTECTED] wrote:
 Fix four instances where cpu_to_node is referenced
 by array instead of via the cpu_to_node macro.  This
 is preparation to moving it to the per_cpu data area.
 
...

  unsigned long __init numa_free_all_bootmem(void) 
 --- a/arch/x86_64/mm/srat.c
 +++ b/arch/x86_64/mm/srat.c
 @@ -431,9 +431,9 @@
   setup_node_bootmem(i, nodes[i].start, nodes[i].end);
  
   for (i = 0; i  NR_CPUS; i++) {
 - if (cpu_to_node[i] == NUMA_NO_NODE)
 + if (cpu_to_node(i) == NUMA_NO_NODE)
   continue;
 - if (!node_isset(cpu_to_node[i], node_possible_map))
 + if (!node_isset(cpu_to_node(i), node_possible_map))
   numa_set_node(i, NUMA_NO_NODE);
   }
   numa_init_array();

During this particular routine execution, per cpu areas are not yet setup. In
future, when we make cpu_to_node(i) use per cpu area, then this code will break.

And actually setup_per_cpu_areas() uses cpu_to_node(). So...

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] x86: Reduce Memory Usage and Inter-Node message traffic (v2)

2007-08-24 Thread Siddha, Suresh B

On Fri, Aug 24, 2007 at 03:26:54PM -0700, [EMAIL PROTECTED] wrote:
 Previous Intro:

Thanks for doing this.

 In x86_64 and i386 architectures most arrays that are sized
 using NR_CPUS lay in local memory on node 0.  Not only will most
 (99%?) of the systems not use all the slots in these arrays,
 particularly when NR_CPUS is increased to accommodate future
 very high cpu count systems, but a number of cache lines are
 passed unnecessarily on the system bus when these arrays are
 referenced by cpus on other nodes.

Can we move cpuinfo_x86 also to per cpu area? Though critical run
time code doesn't access this area, it will be nice to move the cpuinfo_x86
also into per cpu area.

Perhaps the current cpuinfo_x86 layout might cause confusion and make people
add arch specific per cpu elements into cpuinfo_x86(thinking that it uses per
cpu area).

Wonder if this confusion is the reason for git commit f3fa8ebc

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lmbench ctxsw regression with CFS

2007-08-16 Thread Siddha, Suresh B

On Tue, Aug 14, 2007 at 05:23:00AM +0200, Nick Piggin wrote:
> On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote:
> > Put it this way: if a 50% slowdown in context switch times yields a 5%
> > improvement in, say, balancing decisions then it's probably a net win.
> > 
> > Guys, repeat after me: "context switch is not a fast path".  Take that
> > benchmark and set fire to it.
> 
> It definitely can be. For workloads that are inherently asynchronous, high
> speed networking or disk IO (ie. with event generation significantly outside
> the control of the kernel or app), then it can be. Sure, you may just be
> switching between the main working thread and idle thread, but in that case a
> slowdown in the scheduler will be _more_ pronounced because you don't have to
> do as much work to actually switch contexts.
> 
> If there was a performance tradeoff involved, then we could think about it,
> and you might be right. But this is just a case of "write code to do direct
> calls or do indirect calls".
> 
> Ken Chen's last ia64 database benchmark I could find says schedule takes
> 6.5% of the clock cycles, the second highest consumer. Considering the
> lengths he was going to shave cycles off other paths, I'd call schedule()
> a fastpath. Would be really interesting to rerun that benchmark with CFS.
> Is anyone at Intel still doing those tests?

Yes. schedule() still is in the top 2-3 consumers of kernel time for that
workload. We did some tests when CFS was in initial days (I think V2 or so)
and it didn't show any regression.

We have plans to run that workload with 2.6.23-rc kernels, but other things
were taking priority so far...

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: lmbench ctxsw regression with CFS

2007-08-16 Thread Siddha, Suresh B

On Tue, Aug 14, 2007 at 05:23:00AM +0200, Nick Piggin wrote:
 On Mon, Aug 13, 2007 at 08:00:38PM -0700, Andrew Morton wrote:
  Put it this way: if a 50% slowdown in context switch times yields a 5%
  improvement in, say, balancing decisions then it's probably a net win.
  
  Guys, repeat after me: context switch is not a fast path.  Take that
  benchmark and set fire to it.
 
 It definitely can be. For workloads that are inherently asynchronous, high
 speed networking or disk IO (ie. with event generation significantly outside
 the control of the kernel or app), then it can be. Sure, you may just be
 switching between the main working thread and idle thread, but in that case a
 slowdown in the scheduler will be _more_ pronounced because you don't have to
 do as much work to actually switch contexts.
 
 If there was a performance tradeoff involved, then we could think about it,
 and you might be right. But this is just a case of write code to do direct
 calls or do indirect calls.
 
 Ken Chen's last ia64 database benchmark I could find says schedule takes
 6.5% of the clock cycles, the second highest consumer. Considering the
 lengths he was going to shave cycles off other paths, I'd call schedule()
 a fastpath. Would be really interesting to rerun that benchmark with CFS.
 Is anyone at Intel still doing those tests?

Yes. schedule() still is in the top 2-3 consumers of kernel time for that
workload. We did some tests when CFS was in initial days (I think V2 or so)
and it didn't show any regression.

We have plans to run that workload with 2.6.23-rc kernels, but other things
were taking priority so far...

thanks,
suresh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] sched: skip updating rq's next_balance under null SD

2007-08-15 Thread Siddha, Suresh B

Was playing with sched_smt_power_savings/sched_mc_power_savings and found
out that while the scheduler domains are reconstructed when sysfs settings
change, rebalance_domains() can get triggered with null domain on other cpus,
which is setting next_balance to jiffies + 60*HZ. Resulting in no idle/busy
balancing for 60 seconds.

Fix this.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 45e17b8..74565c0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3020,6 +3020,7 @@ static inline void rebalance_domains(int cpu, enum 
cpu_idle_type idle)
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
+   int update_next_balance = 0;
 
for_each_domain(cpu, sd) {
if (!(sd->flags & SD_LOAD_BALANCE))
@@ -3056,8 +3057,10 @@ static inline void rebalance_domains(int cpu, enum 
cpu_idle_type idle)
if (sd->flags & SD_SERIALIZE)
spin_unlock();
 out:
-   if (time_after(next_balance, sd->last_balance + interval))
+   if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
+   update_next_balance = 1;
+   }
 
/*
 * Stop the load balance at this level. There is another
@@ -3067,7 +3070,14 @@ out:
if (!balance)
break;
}
-   rq->next_balance = next_balance;
+
+   /*
+* next_balance will be updated only when there is a need.
+* When the cpu is attached to null domain for ex, it will not be
+* updated.
+*/
+   if (likely(update_next_balance))
+   rq->next_balance = next_balance;
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] sched: fix broken smt/mc optimizations with CFS

2007-08-15 Thread Siddha, Suresh B

Ingo, let me know if there any side effects of this change. Thanks.
---

On a four package system with HT - HT load balancing optimizations
were broken. For example, if two tasks end up running on two logical
threads of one of the packages, scheduler is not able to pull one of
the tasks to a completely idle package.

In this scenario, for nice-0 tasks, imbalance calculated by scheduler
will be 512 and find_busiest_queue() will return 0 (as each cpu's load
is 1024 > imbalance and has only one task running).

Similarly MC scheduler optimizations also get fixed with this patch.

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 45e17b8..c5ac710 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2494,7 +2494,7 @@ group_next:
 * a think about bumping its value to force at least one task to be
 * moved
 */
-   if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task/2) {
+   if (*imbalance < busiest_load_per_task) {
unsigned long tmp, pwr_now, pwr_move;
unsigned int imbn;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] sched: skip updating rq's next_balance under null SD

2007-08-15 Thread Siddha, Suresh B

Was playing with sched_smt_power_savings/sched_mc_power_savings and found
out that while the scheduler domains are reconstructed when sysfs settings
change, rebalance_domains() can get triggered with null domain on other cpus,
which is setting next_balance to jiffies + 60*HZ. Resulting in no idle/busy
balancing for 60 seconds.

Fix this.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 45e17b8..74565c0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3020,6 +3020,7 @@ static inline void rebalance_domains(int cpu, enum 
cpu_idle_type idle)
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
+   int update_next_balance = 0;
 
for_each_domain(cpu, sd) {
if (!(sd-flags  SD_LOAD_BALANCE))
@@ -3056,8 +3057,10 @@ static inline void rebalance_domains(int cpu, enum 
cpu_idle_type idle)
if (sd-flags  SD_SERIALIZE)
spin_unlock(balancing);
 out:
-   if (time_after(next_balance, sd-last_balance + interval))
+   if (time_after(next_balance, sd-last_balance + interval)) {
next_balance = sd-last_balance + interval;
+   update_next_balance = 1;
+   }
 
/*
 * Stop the load balance at this level. There is another
@@ -3067,7 +3070,14 @@ out:
if (!balance)
break;
}
-   rq-next_balance = next_balance;
+
+   /*
+* next_balance will be updated only when there is a need.
+* When the cpu is attached to null domain for ex, it will not be
+* updated.
+*/
+   if (likely(update_next_balance))
+   rq-next_balance = next_balance;
 }
 
 /*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] sched: fix broken smt/mc optimizations with CFS

2007-08-15 Thread Siddha, Suresh B

Ingo, let me know if there any side effects of this change. Thanks.
---

On a four package system with HT - HT load balancing optimizations
were broken. For example, if two tasks end up running on two logical
threads of one of the packages, scheduler is not able to pull one of
the tasks to a completely idle package.

In this scenario, for nice-0 tasks, imbalance calculated by scheduler
will be 512 and find_busiest_queue() will return 0 (as each cpu's load
is 1024  imbalance and has only one task running).

Similarly MC scheduler optimizations also get fixed with this patch.

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 45e17b8..c5ac710 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2494,7 +2494,7 @@ group_next:
 * a think about bumping its value to force at least one task to be
 * moved
 */
-   if (*imbalance + SCHED_LOAD_SCALE_FUZZ  busiest_load_per_task/2) {
+   if (*imbalance  busiest_load_per_task) {
unsigned long tmp, pwr_now, pwr_move;
unsigned int imbn;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] slab: revert "slab: fix alien cache handling"

2007-08-14 Thread Siddha, Suresh B

On Mon, Aug 13, 2007 at 01:58:48PM -0700, Christoph Lameter wrote:
> On Mon, 13 Aug 2007, Siddha, Suresh B wrote:
> 
> > Can we revert git commit 3cdc0ed0cea50ea08dd146c1bbc82b1bcc2e1b80 ?
> 
> Only if you find another way to fix the bug that is addressed there.

Does the appended version fix both the issues? Name alien in cache_free_alien
is confusing, as the function does two things. free through alien caches
or direct remote free..

thanks,
suresh
---

Skip calling cache_free_alien() when the platform is not numa capable.
This will avoid cache misses that happen while accessing slabp (which
is per page memory  reference) to get nodeid. Instead use a global
variable to skip the call, which is mostly likely to be present in the
cache.

This gives a 0.8% performance boost with the database oltp workload
on a quad-core SMP platform and by any means the number is not small :)

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---

diff --git a/mm/slab.c b/mm/slab.c
index a684778..6f6abef 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -883,6 +883,7 @@ static void __slab_error(const char *function, struct 
kmem_cache *cachep,
   */
 
 static int use_alien_caches __read_mostly = 1;
+static int numa_platform __read_mostly = 1;
 static int __init noaliencache_setup(char *s)
 {
use_alien_caches = 0;
@@ -1399,8 +1400,10 @@ void __init kmem_cache_init(void)
int order;
int node;
 
-   if (num_possible_nodes() == 1)
+   if (num_possible_nodes() == 1) {
use_alien_caches = 0;
+   numa_platform = 0;
+   }
 
for (i = 0; i < NUM_INIT_LISTS; i++) {
kmem_list3_init(_list3[i]);
@@ -3558,7 +3561,14 @@ static inline void __cache_free(struct kmem_cache 
*cachep, void *objp)
check_irq_off();
objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
-   if (cache_free_alien(cachep, objp))
+   /*
+* Skip calling cache_free_alien() when the platform is not numa.
+* This will avoid cache misses that happen while accessing slabp (which
+* is per page memory  reference) to get nodeid. Instead use a global
+* variable to skip the call, which is mostly likely to be present in
+* the cache.
+*/
+   if (numa_platform && cache_free_alien(cachep, objp))
return;
 
if (likely(ac->avail < ac->limit)) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] slab: revert slab: fix alien cache handling

2007-08-14 Thread Siddha, Suresh B

On Mon, Aug 13, 2007 at 01:58:48PM -0700, Christoph Lameter wrote:
 On Mon, 13 Aug 2007, Siddha, Suresh B wrote:
 
  Can we revert git commit 3cdc0ed0cea50ea08dd146c1bbc82b1bcc2e1b80 ?
 
 Only if you find another way to fix the bug that is addressed there.

Does the appended version fix both the issues? Name alien in cache_free_alien
is confusing, as the function does two things. free through alien caches
or direct remote free..

thanks,
suresh
---

Skip calling cache_free_alien() when the platform is not numa capable.
This will avoid cache misses that happen while accessing slabp (which
is per page memory  reference) to get nodeid. Instead use a global
variable to skip the call, which is mostly likely to be present in the
cache.

This gives a 0.8% performance boost with the database oltp workload
on a quad-core SMP platform and by any means the number is not small :)

Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
---

diff --git a/mm/slab.c b/mm/slab.c
index a684778..6f6abef 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -883,6 +883,7 @@ static void __slab_error(const char *function, struct 
kmem_cache *cachep,
   */
 
 static int use_alien_caches __read_mostly = 1;
+static int numa_platform __read_mostly = 1;
 static int __init noaliencache_setup(char *s)
 {
use_alien_caches = 0;
@@ -1399,8 +1400,10 @@ void __init kmem_cache_init(void)
int order;
int node;
 
-   if (num_possible_nodes() == 1)
+   if (num_possible_nodes() == 1) {
use_alien_caches = 0;
+   numa_platform = 0;
+   }
 
for (i = 0; i  NUM_INIT_LISTS; i++) {
kmem_list3_init(initkmem_list3[i]);
@@ -3558,7 +3561,14 @@ static inline void __cache_free(struct kmem_cache 
*cachep, void *objp)
check_irq_off();
objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
-   if (cache_free_alien(cachep, objp))
+   /*
+* Skip calling cache_free_alien() when the platform is not numa.
+* This will avoid cache misses that happen while accessing slabp (which
+* is per page memory  reference) to get nodeid. Instead use a global
+* variable to skip the call, which is mostly likely to be present in
+* the cache.
+*/
+   if (numa_platform  cache_free_alien(cachep, objp))
return;
 
if (likely(ac-avail  ac-limit)) {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 386 matches

Mail list logo