Re: [rfc 08/45] cpu alloc: x86 support

2007-11-26 Thread John Richard Moser



Andi Kleen wrote:

On Tuesday 20 November 2007 04:50, Christoph Lameter wrote:

On Tue, 20 Nov 2007, Andi Kleen wrote:



You could in theory move the modules, but then you would need to implement
a full PIC dynamic linker for them  first and also increase runtime overhead
for them because they would need to use a GOT/PLT.


On x86-64?  The GOT/PLT should stay in cache due to temporal locality. 
The x86-64 instruction set itself handles GOT-relative addressing rather 
well; what's a 1% loss on x86 is like 0.01% on x86-64, so I'm thinking 
100 times better?


I think I got this by `-fpic -pie` compiling nbyte benchmark versus 
fixed position, each with and without on 32-bit (which made about a 1% 
difference) and on 64-bit (which made a 0.01% difference).  It was a 
long time ago.


Still, yeah I know.  Complexity.

(You have the ability to textrel these things too, and just rewrite 
non-PIC, depending on how you feel about that)

--
Bring back the Firefox plushy!
http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good
https://bugzilla.mozilla.org/show_bug.cgi?id=322367
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-26 Thread John Richard Moser



Andi Kleen wrote:

On Tuesday 20 November 2007 04:50, Christoph Lameter wrote:

On Tue, 20 Nov 2007, Andi Kleen wrote:



You could in theory move the modules, but then you would need to implement
a full PIC dynamic linker for them  first and also increase runtime overhead
for them because they would need to use a GOT/PLT.


On x86-64?  The GOT/PLT should stay in cache due to temporal locality. 
The x86-64 instruction set itself handles GOT-relative addressing rather 
well; what's a 1% loss on x86 is like 0.01% on x86-64, so I'm thinking 
100 times better?


I think I got this by `-fpic -pie` compiling nbyte benchmark versus 
fixed position, each with and without on 32-bit (which made about a 1% 
difference) and on 64-bit (which made a 0.01% difference).  It was a 
long time ago.


Still, yeah I know.  Complexity.

(You have the ability to textrel these things too, and just rewrite 
non-PIC, depending on how you feel about that)

--
Bring back the Firefox plushy!
http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good
https://bugzilla.mozilla.org/show_bug.cgi?id=322367
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-21 Thread Christoph Lameter
On Wed, 21 Nov 2007, Andi Kleen wrote:

> The whole mapping for all CPUs cannot fit into 2GB of course, but the 
> reference 
> linker managed range can.

Ok so you favor the solution where we subtract smp_processor_id() << 
shift?

> > The offset relative to %gs cannot be used if you have a loop and are 
> > calculating the addresses for all instances. That is what we are talking 
> > about. The CPU_xxx operations that are using the %gs register are fine and 
> > are not affected by the changes we are discussing.
> 
> Sure it can -- you just get the base address from a global array
> and then add the offset

Ok so generalize the data_offset for that case? I noted that other arches 
and i386 have a similar solution there. I fiddled around some more and 
found that the overhead that the subtraction introduces is equivalent to 
loading an 8 byte constant of the base.

Keeping the usage of data_offset can avoid the shift and the add for the 
__get_cpu_var case that needs CPU_PTR( ..., smp_processor_id()) because 
the load from data_offset avoid the shifting and adding of 
smp_processor_id().

For the loops this is not useful since the compiler can move the 
loading of the base pointer outside of the loop )if CPU_PTR needs to load 
an 8 byte constant pointers).

With loading the 8 byte base the loops actually become:

sum = 0
ptr = CPU_AREA_BASE
while base < NR_CPUS << shift {
sum = *ptr
ptr += 1 << shift
}

So I think we need to go with the implementation where CPU_PTR(var, cpu) 
is

CPU_AREA_BASE + cpu << shift + var_offset

The CPU_AREA_BASE will be loaded into a register. The var_offset usually 
ends up being an offset in a mov instruction.

> > 
> > > Then the reference data would be initdata and eventually freed.
> > > That is similar to how the current per cpu data works.
> > 
> > Yes that is also how the current patchset works. I just do not understand 
> > what you want changed.
> 
> Anyways i think your current scheme cannot work (too much VM, placed at the 
> wrong 
> place; some wrong assumptions).

The constant pointer solution fixes that. No need to despair.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-21 Thread Andi Kleen

> > All you need is a 2MB area (16MB is too large if you really
> > want 16k CPUs someday) somewhere in the -2GB or probably better
> > in +2GB. Then the linker puts stuff in there and you use
> > the offsets for referencing relative to %gs.
> 
> 2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do
> not fit in the 2GB area.

I was referring here to the 16MB/CPU you proposed originally which will not fit 
into _any_  kernel area for 16k CPUs. 

The whole mapping for all CPUs cannot fit into 2GB of course, but the reference 
linker managed range can.

> The offset relative to %gs cannot be used if you have a loop and are 
> calculating the addresses for all instances. That is what we are talking 
> about. The CPU_xxx operations that are using the %gs register are fine and 
> are not affected by the changes we are discussing.

Sure it can -- you just get the base address from a global array
and then add the offset

> 
> > Then the reference data would be initdata and eventually freed.
> > That is similar to how the current per cpu data works.
> 
> Yes that is also how the current patchset works. I just do not understand 
> what you want changed.

Anyways i think your current scheme cannot work (too much VM, placed at the 
wrong 
place; some wrong assumptions).

But since I seem unable to communicate this to you I'll stop commenting
and let you find it out the hard way. Have fun.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-21 Thread Andi Kleen

  All you need is a 2MB area (16MB is too large if you really
  want 16k CPUs someday) somewhere in the -2GB or probably better
  in +2GB. Then the linker puts stuff in there and you use
  the offsets for referencing relative to %gs.
 
 2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do
 not fit in the 2GB area.

I was referring here to the 16MB/CPU you proposed originally which will not fit 
into _any_  kernel area for 16k CPUs. 

The whole mapping for all CPUs cannot fit into 2GB of course, but the reference 
linker managed range can.

 The offset relative to %gs cannot be used if you have a loop and are 
 calculating the addresses for all instances. That is what we are talking 
 about. The CPU_xxx operations that are using the %gs register are fine and 
 are not affected by the changes we are discussing.

Sure it can -- you just get the base address from a global array
and then add the offset

 
  Then the reference data would be initdata and eventually freed.
  That is similar to how the current per cpu data works.
 
 Yes that is also how the current patchset works. I just do not understand 
 what you want changed.

Anyways i think your current scheme cannot work (too much VM, placed at the 
wrong 
place; some wrong assumptions).

But since I seem unable to communicate this to you I'll stop commenting
and let you find it out the hard way. Have fun.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-21 Thread Christoph Lameter
On Wed, 21 Nov 2007, Andi Kleen wrote:

 The whole mapping for all CPUs cannot fit into 2GB of course, but the 
 reference 
 linker managed range can.

Ok so you favor the solution where we subtract smp_processor_id()  
shift?

  The offset relative to %gs cannot be used if you have a loop and are 
  calculating the addresses for all instances. That is what we are talking 
  about. The CPU_xxx operations that are using the %gs register are fine and 
  are not affected by the changes we are discussing.
 
 Sure it can -- you just get the base address from a global array
 and then add the offset

Ok so generalize the data_offset for that case? I noted that other arches 
and i386 have a similar solution there. I fiddled around some more and 
found that the overhead that the subtraction introduces is equivalent to 
loading an 8 byte constant of the base.

Keeping the usage of data_offset can avoid the shift and the add for the 
__get_cpu_var case that needs CPU_PTR( ..., smp_processor_id()) because 
the load from data_offset avoid the shifting and adding of 
smp_processor_id().

For the loops this is not useful since the compiler can move the 
loading of the base pointer outside of the loop )if CPU_PTR needs to load 
an 8 byte constant pointers).

With loading the 8 byte base the loops actually become:

sum = 0
ptr = CPU_AREA_BASE
while base  NR_CPUS  shift {
sum = *ptr
ptr += 1  shift
}

So I think we need to go with the implementation where CPU_PTR(var, cpu) 
is

CPU_AREA_BASE + cpu  shift + var_offset

The CPU_AREA_BASE will be loaded into a register. The var_offset usually 
ends up being an offset in a mov instruction.

  
   Then the reference data would be initdata and eventually freed.
   That is similar to how the current per cpu data works.
  
  Yes that is also how the current patchset works. I just do not understand 
  what you want changed.
 
 Anyways i think your current scheme cannot work (too much VM, placed at the 
 wrong 
 place; some wrong assumptions).

The constant pointer solution fixes that. No need to despair.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Wed, 21 Nov 2007, Andi Kleen wrote:

> On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote:
> > But one can subtract too... 
> 
> The linker cannot subtract (unless you add a new relocation types) 

The compiler knows and emits assembly to compensate.

> All you need is a 2MB area (16MB is too large if you really
> want 16k CPUs someday) somewhere in the -2GB or probably better
> in +2GB. Then the linker puts stuff in there and you use
> the offsets for referencing relative to %gs.

2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do
not fit in the 2GB area.

The offset relative to %gs cannot be used if you have a loop and are 
calculating the addresses for all instances. That is what we are talking 
about. The CPU_xxx operations that are using the %gs register are fine and 
are not affected by the changes we are discussing.

> Then for all CPUs (including CPU #0) you put the real mapping
> somewhere else, copy the reference data there (which also doesn't need
> to be on the offset the linker assigned, just on a constant offset
> from it somewhere in the normal kernel data) and off you go.

Real mapping? We have constant offsets after this patchset. I do not get 
what you are planning here.

> Then the reference data would be initdata and eventually freed.
> That is similar to how the current per cpu data works.

Yes that is also how the current patchset works. I just do not understand 
what you want changed.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen
On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote:
> But one can subtract too... 

The linker cannot subtract (unless you add a new relocation types) 

> Hmmm... So the cpu area 0 could be put at 
> the beginning of the 2GB kernel area and then grow downwards from 
> 0x8000. The cost in terms of code is one subtract
> instruction for each per_cpu() or CPU_PTR()
> 
> The next thing doward from 0x8000 is the vmemmap at 
> 0xe200, so ~32TB. If we leave 16TB for the vmemmap
> (a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes 
> more than currently supported by the processors)
> 
> then the remaining 16TB could be used to map 1GB per cpu for a 16k config. 
> That is wildly overdoing it. Guess we could just do it with 1M anyways. 
> Just to be safe we could do 128M. 128M x 16k = 2TB?
> 
> Would such a configuration be okay?

I'm not sure I really understand your problem.

All you need is a 2MB area (16MB is too large if you really
want 16k CPUs someday) somewhere in the -2GB or probably better
in +2GB. Then the linker puts stuff in there and you use
the offsets for referencing relative to %gs.

But %gs can be located wherever you want in the end,
at a completely different address than you told the linker.
All you're interested in were offsets anyways.

Then for all CPUs (including CPU #0) you put the real mapping
somewhere else, copy the reference data there (which also doesn't need
to be on the offset the linker assigned, just on a constant offset
from it somewhere in the normal kernel data) and off you go.

Then the reference data would be initdata and eventually freed.
That is similar to how the current per cpu data works.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
But one can subtract too... Hmmm... So the cpu area 0 could be put at
the beginning of the 2GB kernel area and then grow downwards from 
0x8000. The cost in terms of code is one subtract
instruction for each per_cpu() or CPU_PTR()

The next thing doward from 0x8000 is the vmemmap at 
0xe200, so ~32TB. If we leave 16TB for the vmemmap
(a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes 
more than currently supported by the processors)

then the remaining 16TB could be used to map 1GB per cpu for a 16k config. 
That is wildly overdoing it. Guess we could just do it with 1M anyways. 
Just to be safe we could do 128M. 128M x 16k = 2TB?

Would such a configuration be okay?

 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Christoph Lameter wrote:

> 32bit sign extension for what? Absolute data references? The addressing 
> that I have seen was IP relative. Thus I thought that the kernel could be 
> moved lower.

Argh. This is all depending on a special gcc option to compile the 
kernel and that option limits the kernel to the upper 2GB. So I guess for 
CPU_PTR we need to explicitly load the address as a constant, use that as 
a base and then add the offset and the smp_id shifted to it in the 
instruction. The CPU_INC/DEC stuff using gs is not affected.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, H. Peter Anvin wrote:

> But you wouldn't actually *use* this address space.  It's just for the linker
> to know what address to tag the references with; it gets relocated by gs_base
> down into proper kernel space.  The linker can stash the initialized reference
> copy at any address (LMA) which can be different from what it will be used at
> (VMA); that is not an issue.

That is already provided by this patchset. The cpu area starts at absolute 
0.
 
> To use %rip references, though, which are more efficient, you probably want to
> use offsets that are just below .text (at -2 GB); presumably
> -2 GB-[max size of percpu section].  Again, however, no CPU actually needs to
> have its data stashed in that particular location; it's just an offset.

Right. That is what we are discussion in another thread.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andi Kleen wrote:

> 
> > 
> > Right so I could move the kernel to
> > 
> > #define __PAGE_OFFSET _AC(0x8100, UL)
> > #define __START_KERNEL_map_AC(0xfff8, UL)
> 
> That is -31GB unless I'm miscounting. But it needs to be >= -2GB
> (31bits) 

The __START_KERNEL_map needs to cover the 2GB that the kernel needs for 
modules and the cpu area 0. The remaining 28GB can be out of range.

> Right now it is at -2GB + 2MB,  because it is loaded at physical +2MB
> so it's convenient to identity map there. In theory you could avoid that
> with some effort, but that would only buy you 2MB and would also
> break some early code and earlyprintk I believe.

My proporsal is -32GB + 2MB. It keeps the arrangement.

> > > You could in theory move the modules, but then you would need to implement
> > > a full PIC dynamic linker for them  first and also increase runtime 
> > > overhead
> > > for them because they would need to use a GOT/PLT.
> > 
> > Why is it not possible to move the kernel lower while keeping bit 31 1?
> 
> The kernel model relies on 32bit sign extension. This means bits [31;63] have
> to be all 1

32bit sign extension for what? Absolute data references? The addressing 
that I have seen was IP relative. Thus I thought that the kernel could be 
moved lower.

> > > I suspect all of this  would cause far more overhead all over the kernel 
> > > than 
> > > you could ever save with the per cpu data in your fast paths.
> > 
> > Moving the kernel down a bit seems to be trivial without any of the weird 
> > solutions.
> 
> Another one I came up in the previous mail would be to do the linker reference
> variable allocation in [0;2GB] positive space; but do all real references only
> %gs relative. And keep the real data copy on some other address. That would
> be a similar trick to the old style x86-64 vsyscalls. It gets fairly
> messy in the linker map file though.

The linker references are to per cpu data already in the 0-MAX_CPU_AREA 
range after this patchset. The problem is the relocation of the 
references when the linker calculates the address of a per cpu variable.


F.e. The pda is placed at absolute address 0

In order to access the pda we have to do either



1. A  CPU_PTR(offsetof pda,) which is

offsetof pda (so 0) + cpu_area + cpu << area_shift

The compiler currently folds

offsetof pda + cpu_area

and then addds the shift later.

or

2. We use a CPU_xx op with a segment register.

Then we have no need to add cpu area because it is included in the gs 
offset.


So only case 1 is affected which is not used if cpu ops can be used.
There are still critical path uses of that operation so I would really 
like to avoid the explicit calculation.

If we would go the proposed route then the folding of the address would 
have to be replaced by the linker by an explicit calculation.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread H. Peter Anvin

Christoph Lameter wrote:

On Tue, 20 Nov 2007, Andi Kleen wrote:

This limitation shouldn't apply to the percpu area, since gs_base can be 
pointed anywhere in the address space -- in effect we're always indirect.

The initial reference copy of the percpu area has to be addressed by
the linker.


Right that is important for the percpu references that can be folded by 
the linker in order to avoid address calculations.


Hmm, in theory since it is not actually used by itself I suppose you could 
move it into positive space.


But the positive space is reserved for a processes memory.



But you wouldn't actually *use* this address space.  It's just for the 
linker to know what address to tag the references with; it gets 
relocated by gs_base down into proper kernel space.  The linker can 
stash the initialized reference copy at any address (LMA) which can be 
different from what it will be used at (VMA); that is not an issue.


To use %rip references, though, which are more efficient, you probably 
want to use offsets that are just below .text (at -2 GB); presumably
-2 GB-[max size of percpu section].  Again, however, no CPU actually 
needs to have its data stashed in that particular location; it's just an 
offset.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread H. Peter Anvin

Andi Kleen wrote:
This limitation shouldn't apply to the percpu area, since gs_base can be 
pointed anywhere in the address space -- in effect we're always indirect.


The initial reference copy of the percpu area has to be addressed by
the linker.

Hmm, in theory since it is not actually used by itself I suppose you could 
move it into positive space.




Positive space for absolute references, or just below -2 GB for %rip 
references; either should work.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen

> 
> Right so I could move the kernel to
> 
> #define __PAGE_OFFSET _AC(0x8100, UL)
> #define __START_KERNEL_map_AC(0xfff8, UL)

That is -31GB unless I'm miscounting. But it needs to be >= -2GB
(31bits) 

Right now it is at -2GB + 2MB,  because it is loaded at physical +2MB
so it's convenient to identity map there. In theory you could avoid that
with some effort, but that would only buy you 2MB and would also
break some early code and earlyprintk I believe.

> > You could in theory move the modules, but then you would need to implement
> > a full PIC dynamic linker for them  first and also increase runtime overhead
> > for them because they would need to use a GOT/PLT.
> 
> Why is it not possible to move the kernel lower while keeping bit 31 1?

The kernel model relies on 32bit sign extension. This means bits [31;63] have
to be all 1
 
> > I suspect all of this  would cause far more overhead all over the kernel 
> > than 
> > you could ever save with the per cpu data in your fast paths.
> 
> Moving the kernel down a bit seems to be trivial without any of the weird 
> solutions.

Another one I came up in the previous mail would be to do the linker reference
variable allocation in [0;2GB] positive space; but do all real references only
%gs relative. And keep the real data copy on some other address. That would
be a similar trick to the old style x86-64 vsyscalls. It gets fairly
messy in the linker map file though.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andi Kleen wrote:

> 
> > This limitation shouldn't apply to the percpu area, since gs_base can be 
> > pointed anywhere in the address space -- in effect we're always indirect.
> 
> The initial reference copy of the percpu area has to be addressed by
> the linker.

Right that is important for the percpu references that can be folded by 
the linker in order to avoid address calculations.

> Hmm, in theory since it is not actually used by itself I suppose you could 
> move it into positive space.

But the positive space is reserved for a processes memory.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen

> This limitation shouldn't apply to the percpu area, since gs_base can be 
> pointed anywhere in the address space -- in effect we're always indirect.

The initial reference copy of the percpu area has to be addressed by
the linker.

Hmm, in theory since it is not actually used by itself I suppose you could 
move it into positive space.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread H. Peter Anvin

Andi Kleen wrote:

On Tuesday 20 November 2007 04:50, Christoph Lameter wrote:

On Tue, 20 Nov 2007, Andi Kleen wrote:

I might be pointing out the obvious, but on x86-64 there is definitely
not 256TB of VM available for this.

Well maybe in the future.


That would either require more than 4 levels or larger pages
in page tables.


One of the issues that I ran into is that I had to place the cpu area
in between to make the offsets link right.


Above -2GB, otherwise you cannot address them



This limitation shouldn't apply to the percpu area, since gs_base can be 
pointed anywhere in the address space -- in effect we're always indirect.


Obviously the offsets *within* the percpu area has to be in range (±2 GB 
per cpu for absolute offsets, slightly smaller for %rip-based addressing 
-- obviously judicious use of an offset for gs_base is essential in the 
latter case).


Thus you want the percpu areas below -2 GB where they don't interfere 
with modules or any other precious address space.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andi Kleen wrote:

> > So I think we have a 2GB area right?
> 
> For everything that needs the -31bit offsets; that is everything linked

Of course.

> > 1GB kernel
> > 1GB - 1x per cpu area (128M?) modules?
> > cpu aree 0
> >  2GB limit
> > cpu area 1
> > cpu area 2
> > 
> >
> > For that we would need to move the kernel down a bit. Can we do that?
> 
> The kernel model requires kernel and modules and everything else
> linked be in negative -31bit space. That is how the kernel code model is 
> defined.

Right so I could move the kernel to

#define __PAGE_OFFSET _AC(0x8100, UL)
#define __START_KERNEL_map_AC(0xfff8, UL)
#define KERNEL_TEXT_START _AC(0xfff8, UL) 30 bits = 1GB for kernel 
text
#define MODULES_VADDR _AC(0xfff88000, UL) 30 bits = 1GB for modules
#define MODULES_END   _AC(0xfff8f000, UL)
#define CPU_AREA_BASE _AC(0xfff8f000, UL) 31 bits 256MB for cpu 
area 0
#define CPU_AREA_BASE1_AC(0xfff9, UL) More cpu areas for higher 
numbered processors
#define CPU_AREA_END  _AC(0x, UL)

> You could in theory move the modules, but then you would need to implement
> a full PIC dynamic linker for them  first and also increase runtime overhead
> for them because they would need to use a GOT/PLT.

Why is it not possible to move the kernel lower while keeping bit 31 1?

> I suspect all of this  would cause far more overhead all over the kernel than 
> you could ever save with the per cpu data in your fast paths.

Moving the kernel down a bit seems to be trivial without any of the weird 
solutions.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen
On Tuesday 20 November 2007 04:50, Christoph Lameter wrote:
> On Tue, 20 Nov 2007, Andi Kleen wrote:
> > I might be pointing out the obvious, but on x86-64 there is definitely
> > not 256TB of VM available for this.
>
> Well maybe in the future.

That would either require more than 4 levels or larger pages
in page tables.

> One of the issues that I ran into is that I had to place the cpu area
> in between to make the offsets link right.

Above -2GB, otherwise you cannot address them

If you can move all the other CPUs somewhere else it might work.

But even then 16MB/cpu max is unrealistic. Perhaps 1M/CPU 
max -- then 16k CPU would be 128GB which could still fit into the existing
vmalloc area.

>
> However, it would be best if the cpuarea came *after* the modules area. We
> only need linking that covers the per cpu area of processor 0.
>
> So I think we have a 2GB area right?

For everything that needs the -31bit offsets; that is everything linked

> 1GB kernel
> 1GB - 1x per cpu area (128M?) modules?
> cpu aree 0
>  2GB limit
> cpu area 1
> cpu area 2
> 
>
> For that we would need to move the kernel down a bit. Can we do that?

The kernel model requires kernel and modules and everything else
linked be in negative -31bit space. That is how the kernel code model is 
defined.

You could in theory move the modules, but then you would need to implement
a full PIC dynamic linker for them  first and also increase runtime overhead
for them because they would need to use a GOT/PLT.

Or you could switch kernel over to the large model, which is very costly
and has toolkit problems.

Or use the UML trick and run the kernel PIC but again that causes
overhead.

I suspect all of this  would cause far more overhead all over the kernel than 
you could ever save with the per cpu data in your fast paths.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen

> > Yeah yea but the latencies are minimal making the NUMA logic too
> > expensive for most loads ... If you put a NUMA kernel onto those then
> > performance drops (I think someone measures 15-30%?)
>
> Small socket count systems are going to increasingly be NUMA in future.
> If CONFIG_NUMA hurts performance by that much on those systems, then the
> kernel is broken IMO.

Not sure where that number came from.

In my tests some time ago NUMA overhead on SMP was minimal.

This was admittedly with old 2.4 kernels. There have been some doubts about
some of the newer NUMA features added; in particular about NUMA slab;
don't think there was much trouble with anything else -- in fact the trouble
was that it apparently sometimes made moderate NUMA factor NUMA systems 
slower too. But I assume SLUB will address this anyways.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen
On Tuesday 20 November 2007 04:50, Christoph Lameter wrote:
 On Tue, 20 Nov 2007, Andi Kleen wrote:
  I might be pointing out the obvious, but on x86-64 there is definitely
  not 256TB of VM available for this.

 Well maybe in the future.

That would either require more than 4 levels or larger pages
in page tables.

 One of the issues that I ran into is that I had to place the cpu area
 in between to make the offsets link right.

Above -2GB, otherwise you cannot address them

If you can move all the other CPUs somewhere else it might work.

But even then 16MB/cpu max is unrealistic. Perhaps 1M/CPU 
max -- then 16k CPU would be 128GB which could still fit into the existing
vmalloc area.


 However, it would be best if the cpuarea came *after* the modules area. We
 only need linking that covers the per cpu area of processor 0.

 So I think we have a 2GB area right?

For everything that needs the -31bit offsets; that is everything linked

 1GB kernel
 1GB - 1x per cpu area (128M?) modules?
 cpu aree 0
  2GB limit
 cpu area 1
 cpu area 2
 

 For that we would need to move the kernel down a bit. Can we do that?

The kernel model requires kernel and modules and everything else
linked be in negative -31bit space. That is how the kernel code model is 
defined.

You could in theory move the modules, but then you would need to implement
a full PIC dynamic linker for them  first and also increase runtime overhead
for them because they would need to use a GOT/PLT.

Or you could switch kernel over to the large model, which is very costly
and has toolkit problems.

Or use the UML trick and run the kernel PIC but again that causes
overhead.

I suspect all of this  would cause far more overhead all over the kernel than 
you could ever save with the per cpu data in your fast paths.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen

  Yeah yea but the latencies are minimal making the NUMA logic too
  expensive for most loads ... If you put a NUMA kernel onto those then
  performance drops (I think someone measures 15-30%?)

 Small socket count systems are going to increasingly be NUMA in future.
 If CONFIG_NUMA hurts performance by that much on those systems, then the
 kernel is broken IMO.

Not sure where that number came from.

In my tests some time ago NUMA overhead on SMP was minimal.

This was admittedly with old 2.4 kernels. There have been some doubts about
some of the newer NUMA features added; in particular about NUMA slab;
don't think there was much trouble with anything else -- in fact the trouble
was that it apparently sometimes made moderate NUMA factor NUMA systems 
slower too. But I assume SLUB will address this anyways.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andi Kleen wrote:

  So I think we have a 2GB area right?
 
 For everything that needs the -31bit offsets; that is everything linked

Of course.

  1GB kernel
  1GB - 1x per cpu area (128M?) modules?
  cpu aree 0
   2GB limit
  cpu area 1
  cpu area 2
  
 
  For that we would need to move the kernel down a bit. Can we do that?
 
 The kernel model requires kernel and modules and everything else
 linked be in negative -31bit space. That is how the kernel code model is 
 defined.

Right so I could move the kernel to

#define __PAGE_OFFSET _AC(0x8100, UL)
#define __START_KERNEL_map_AC(0xfff8, UL)
#define KERNEL_TEXT_START _AC(0xfff8, UL) 30 bits = 1GB for kernel 
text
#define MODULES_VADDR _AC(0xfff88000, UL) 30 bits = 1GB for modules
#define MODULES_END   _AC(0xfff8f000, UL)
#define CPU_AREA_BASE _AC(0xfff8f000, UL) 31 bits 256MB for cpu 
area 0
#define CPU_AREA_BASE1_AC(0xfff9, UL) More cpu areas for higher 
numbered processors
#define CPU_AREA_END  _AC(0x, UL)

 You could in theory move the modules, but then you would need to implement
 a full PIC dynamic linker for them  first and also increase runtime overhead
 for them because they would need to use a GOT/PLT.

Why is it not possible to move the kernel lower while keeping bit 31 1?

 I suspect all of this  would cause far more overhead all over the kernel than 
 you could ever save with the per cpu data in your fast paths.

Moving the kernel down a bit seems to be trivial without any of the weird 
solutions.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread H. Peter Anvin

Andi Kleen wrote:

On Tuesday 20 November 2007 04:50, Christoph Lameter wrote:

On Tue, 20 Nov 2007, Andi Kleen wrote:

I might be pointing out the obvious, but on x86-64 there is definitely
not 256TB of VM available for this.

Well maybe in the future.


That would either require more than 4 levels or larger pages
in page tables.


One of the issues that I ran into is that I had to place the cpu area
in between to make the offsets link right.


Above -2GB, otherwise you cannot address them



This limitation shouldn't apply to the percpu area, since gs_base can be 
pointed anywhere in the address space -- in effect we're always indirect.


Obviously the offsets *within* the percpu area has to be in range (±2 GB 
per cpu for absolute offsets, slightly smaller for %rip-based addressing 
-- obviously judicious use of an offset for gs_base is essential in the 
latter case).


Thus you want the percpu areas below -2 GB where they don't interfere 
with modules or any other precious address space.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen

 This limitation shouldn't apply to the percpu area, since gs_base can be 
 pointed anywhere in the address space -- in effect we're always indirect.

The initial reference copy of the percpu area has to be addressed by
the linker.

Hmm, in theory since it is not actually used by itself I suppose you could 
move it into positive space.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andi Kleen wrote:

 
  This limitation shouldn't apply to the percpu area, since gs_base can be 
  pointed anywhere in the address space -- in effect we're always indirect.
 
 The initial reference copy of the percpu area has to be addressed by
 the linker.

Right that is important for the percpu references that can be folded by 
the linker in order to avoid address calculations.

 Hmm, in theory since it is not actually used by itself I suppose you could 
 move it into positive space.

But the positive space is reserved for a processes memory.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen

 
 Right so I could move the kernel to
 
 #define __PAGE_OFFSET _AC(0x8100, UL)
 #define __START_KERNEL_map_AC(0xfff8, UL)

That is -31GB unless I'm miscounting. But it needs to be = -2GB
(31bits) 

Right now it is at -2GB + 2MB,  because it is loaded at physical +2MB
so it's convenient to identity map there. In theory you could avoid that
with some effort, but that would only buy you 2MB and would also
break some early code and earlyprintk I believe.

  You could in theory move the modules, but then you would need to implement
  a full PIC dynamic linker for them  first and also increase runtime overhead
  for them because they would need to use a GOT/PLT.
 
 Why is it not possible to move the kernel lower while keeping bit 31 1?

The kernel model relies on 32bit sign extension. This means bits [31;63] have
to be all 1
 
  I suspect all of this  would cause far more overhead all over the kernel 
  than 
  you could ever save with the per cpu data in your fast paths.
 
 Moving the kernel down a bit seems to be trivial without any of the weird 
 solutions.

Another one I came up in the previous mail would be to do the linker reference
variable allocation in [0;2GB] positive space; but do all real references only
%gs relative. And keep the real data copy on some other address. That would
be a similar trick to the old style x86-64 vsyscalls. It gets fairly
messy in the linker map file though.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread H. Peter Anvin

Andi Kleen wrote:
This limitation shouldn't apply to the percpu area, since gs_base can be 
pointed anywhere in the address space -- in effect we're always indirect.


The initial reference copy of the percpu area has to be addressed by
the linker.

Hmm, in theory since it is not actually used by itself I suppose you could 
move it into positive space.




Positive space for absolute references, or just below -2 GB for %rip 
references; either should work.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andi Kleen wrote:

 
  
  Right so I could move the kernel to
  
  #define __PAGE_OFFSET _AC(0x8100, UL)
  #define __START_KERNEL_map_AC(0xfff8, UL)
 
 That is -31GB unless I'm miscounting. But it needs to be = -2GB
 (31bits) 

The __START_KERNEL_map needs to cover the 2GB that the kernel needs for 
modules and the cpu area 0. The remaining 28GB can be out of range.

 Right now it is at -2GB + 2MB,  because it is loaded at physical +2MB
 so it's convenient to identity map there. In theory you could avoid that
 with some effort, but that would only buy you 2MB and would also
 break some early code and earlyprintk I believe.

My proporsal is -32GB + 2MB. It keeps the arrangement.

   You could in theory move the modules, but then you would need to implement
   a full PIC dynamic linker for them  first and also increase runtime 
   overhead
   for them because they would need to use a GOT/PLT.
  
  Why is it not possible to move the kernel lower while keeping bit 31 1?
 
 The kernel model relies on 32bit sign extension. This means bits [31;63] have
 to be all 1

32bit sign extension for what? Absolute data references? The addressing 
that I have seen was IP relative. Thus I thought that the kernel could be 
moved lower.

   I suspect all of this  would cause far more overhead all over the kernel 
   than 
   you could ever save with the per cpu data in your fast paths.
  
  Moving the kernel down a bit seems to be trivial without any of the weird 
  solutions.
 
 Another one I came up in the previous mail would be to do the linker reference
 variable allocation in [0;2GB] positive space; but do all real references only
 %gs relative. And keep the real data copy on some other address. That would
 be a similar trick to the old style x86-64 vsyscalls. It gets fairly
 messy in the linker map file though.

The linker references are to per cpu data already in the 0-MAX_CPU_AREA 
range after this patchset. The problem is the relocation of the 
references when the linker calculates the address of a per cpu variable.


F.e. The pda is placed at absolute address 0

In order to access the pda we have to do either



1. A  CPU_PTR(offsetof pda,cpu) which is

offsetof pda (so 0) + cpu_area + cpu  area_shift

The compiler currently folds

offsetof pda + cpu_area

and then addds the shift later.

or

2. We use a CPU_xx op with a segment register.

Then we have no need to add cpu area because it is included in the gs 
offset.


So only case 1 is affected which is not used if cpu ops can be used.
There are still critical path uses of that operation so I would really 
like to avoid the explicit calculation.

If we would go the proposed route then the folding of the address would 
have to be replaced by the linker by an explicit calculation.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, H. Peter Anvin wrote:

 But you wouldn't actually *use* this address space.  It's just for the linker
 to know what address to tag the references with; it gets relocated by gs_base
 down into proper kernel space.  The linker can stash the initialized reference
 copy at any address (LMA) which can be different from what it will be used at
 (VMA); that is not an issue.

That is already provided by this patchset. The cpu area starts at absolute 
0.
 
 To use %rip references, though, which are more efficient, you probably want to
 use offsets that are just below .text (at -2 GB); presumably
 -2 GB-[max size of percpu section].  Again, however, no CPU actually needs to
 have its data stashed in that particular location; it's just an offset.

Right. That is what we are discussion in another thread.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread H. Peter Anvin

Christoph Lameter wrote:

On Tue, 20 Nov 2007, Andi Kleen wrote:

This limitation shouldn't apply to the percpu area, since gs_base can be 
pointed anywhere in the address space -- in effect we're always indirect.

The initial reference copy of the percpu area has to be addressed by
the linker.


Right that is important for the percpu references that can be folded by 
the linker in order to avoid address calculations.


Hmm, in theory since it is not actually used by itself I suppose you could 
move it into positive space.


But the positive space is reserved for a processes memory.



But you wouldn't actually *use* this address space.  It's just for the 
linker to know what address to tag the references with; it gets 
relocated by gs_base down into proper kernel space.  The linker can 
stash the initialized reference copy at any address (LMA) which can be 
different from what it will be used at (VMA); that is not an issue.


To use %rip references, though, which are more efficient, you probably 
want to use offsets that are just below .text (at -2 GB); presumably
-2 GB-[max size of percpu section].  Again, however, no CPU actually 
needs to have its data stashed in that particular location; it's just an 
offset.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Tue, 20 Nov 2007, Christoph Lameter wrote:

 32bit sign extension for what? Absolute data references? The addressing 
 that I have seen was IP relative. Thus I thought that the kernel could be 
 moved lower.

Argh. This is all depending on a special gcc option to compile the 
kernel and that option limits the kernel to the upper 2GB. So I guess for 
CPU_PTR we need to explicitly load the address as a constant, use that as 
a base and then add the offset and the smp_id shifted to it in the 
instruction. The CPU_INC/DEC stuff using gs is not affected.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
But one can subtract too... Hmmm... So the cpu area 0 could be put at
the beginning of the 2GB kernel area and then grow downwards from 
0x8000. The cost in terms of code is one subtract
instruction for each per_cpu() or CPU_PTR()

The next thing doward from 0x8000 is the vmemmap at 
0xe200, so ~32TB. If we leave 16TB for the vmemmap
(a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes 
more than currently supported by the processors)

then the remaining 16TB could be used to map 1GB per cpu for a 16k config. 
That is wildly overdoing it. Guess we could just do it with 1M anyways. 
Just to be safe we could do 128M. 128M x 16k = 2TB?

Would such a configuration be okay?

 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Andi Kleen
On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote:
 But one can subtract too... 

The linker cannot subtract (unless you add a new relocation types) 

 Hmmm... So the cpu area 0 could be put at 
 the beginning of the 2GB kernel area and then grow downwards from 
 0x8000. The cost in terms of code is one subtract
 instruction for each per_cpu() or CPU_PTR()
 
 The next thing doward from 0x8000 is the vmemmap at 
 0xe200, so ~32TB. If we leave 16TB for the vmemmap
 (a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes 
 more than currently supported by the processors)
 
 then the remaining 16TB could be used to map 1GB per cpu for a 16k config. 
 That is wildly overdoing it. Guess we could just do it with 1M anyways. 
 Just to be safe we could do 128M. 128M x 16k = 2TB?
 
 Would such a configuration be okay?

I'm not sure I really understand your problem.

All you need is a 2MB area (16MB is too large if you really
want 16k CPUs someday) somewhere in the -2GB or probably better
in +2GB. Then the linker puts stuff in there and you use
the offsets for referencing relative to %gs.

But %gs can be located wherever you want in the end,
at a completely different address than you told the linker.
All you're interested in were offsets anyways.

Then for all CPUs (including CPU #0) you put the real mapping
somewhere else, copy the reference data there (which also doesn't need
to be on the offset the linker assigned, just on a constant offset
from it somewhere in the normal kernel data) and off you go.

Then the reference data would be initdata and eventually freed.
That is similar to how the current per cpu data works.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-20 Thread Christoph Lameter
On Wed, 21 Nov 2007, Andi Kleen wrote:

 On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote:
  But one can subtract too... 
 
 The linker cannot subtract (unless you add a new relocation types) 

The compiler knows and emits assembly to compensate.

 All you need is a 2MB area (16MB is too large if you really
 want 16k CPUs someday) somewhere in the -2GB or probably better
 in +2GB. Then the linker puts stuff in there and you use
 the offsets for referencing relative to %gs.

2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do
not fit in the 2GB area.

The offset relative to %gs cannot be used if you have a loop and are 
calculating the addresses for all instances. That is what we are talking 
about. The CPU_xxx operations that are using the %gs register are fine and 
are not affected by the changes we are discussing.

 Then for all CPUs (including CPU #0) you put the real mapping
 somewhere else, copy the reference data there (which also doesn't need
 to be on the offset the linker assigned, just on a constant offset
 from it somewhere in the normal kernel data) and off you go.

Real mapping? We have constant offsets after this patchset. I do not get 
what you are planning here.

 Then the reference data would be initdata and eventually freed.
 That is similar to how the current per cpu data works.

Yes that is also how the current patchset works. I just do not understand 
what you want changed.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Nick Piggin
On Tuesday 20 November 2007 13:02, Christoph Lameter wrote:
> On Mon, 19 Nov 2007, H. Peter Anvin wrote:
> > You're making the assumption here that NUMA = large number of CPUs. This
> > assumption is flat-out wrong.
>
> Well maybe. Usually one gets to NUMA because the hardware gets too big to
> be handleed the UMA way.
>
> > On x86-64, most two-socket systems are still NUMA, and I would expect
> > that most distro kernels probably compile in NUMA.  However,
> > burning megabytes of memory on a two-socket dual-core system when we're
> > talking about tens of kilobytes used would be more than a wee bit insane.
>
> Yeah yea but the latencies are minimal making the NUMA logic too expensive
> for most loads ... If you put a NUMA kernel onto those then performance
> drops (I think someone measures 15-30%?)

Small socket count systems are going to increasingly be NUMA in future.
If CONFIG_NUMA hurts performance by that much on those systems, then the
kernel is broken IMO.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andi Kleen wrote:

> I might be pointing out the obvious, but on x86-64 there is definitely not 
> 256TB of VM available for this.

Well maybe in the future.

One of the issues that I ran into is that I had to place the cpu area
in between to make the offsets link right.

However, it would be best if the cpuarea came *after* the modules area. We 
only need linking that covers the per cpu area of processor 0.

So I think we have a 2GB area right?

1GB kernel
1GB - 1x per cpu area (128M?) modules?
cpu aree 0
 2GB limit
cpu area 1 
cpu area 2


For that we would need to move the kernel down a bit. Can we do that?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Nick Piggin
On Tuesday 20 November 2007 13:02, Christoph Lameter wrote:
> On Mon, 19 Nov 2007, H. Peter Anvin wrote:
> > You're making the assumption here that NUMA = large number of CPUs. This
> > assumption is flat-out wrong.
>
> Well maybe. Usually one gets to NUMA because the hardware gets too big to
> be handleed the UMA way.

Not the way things are going with multicore and multithread, though
(that is, the hardware can be one socket and still have many cpus).

The chip might have several memory controllers on it, but they could
well be connected to the caches with a crossbar, so it needn't be
NUMA at all. Future scalability work shouldn't rely on many cores
~= many nodes, IMO.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Andi Kleen

> 4k cpu configurations with 1k nodes:
>
>   4096 * 16MB = 64TB of virtual space.
>
> Maximum theoretical configuration 16384 processors 1k nodes:
>
>   16384 * 16MB = 256TB of virtual space.
>
> Both fit within the established limits established.

I might be pointing out the obvious, but on x86-64 there is definitely not 
256TB of VM available for this.

Not even 64TB, as long as you want to have any
other mappings in kernel (total kernel memory 128TB, but it is split in
half for the direct mapping) 

BTW if you allocate any VM you should also update
Documentation/x86_64/mm.txt which describes the mapping

> Index: linux-2.6/include/asm-x86/pgtable_64.h
> ===
> --- linux-2.6.orig/include/asm-x86/pgtable_64.h   2007-11-19
> 15:45:07.638390147 -0800 +++
> linux-2.6/include/asm-x86/pgtable_64.h2007-11-19 15:55:53.165640248 
> -0800
> @@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f
>  #define VMALLOC_START_AC(0xc200, UL)
>  #define VMALLOC_END  _AC(0xe1ff, UL)
>  #define VMEMMAP_START _AC(0xe200, UL)
> +#define CPU_AREA_BASE _AC(0x8400, UL)

That's slightly less than 1GB before you bump into the maximum.
But you'll bump into the module mapping even before that.

For 16MB/CPU and the full 1GB that's ~123 CPUs if my calculations are correct. 
Even for a  not Altix that's quite tight.

I suppose 16MB/CPU are too large.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread H. Peter Anvin

Christoph Lameter wrote:

On Mon, 19 Nov 2007, H. Peter Anvin wrote:


You're making the assumption here that NUMA = large number of CPUs. This
assumption is flat-out wrong.


Well maybe. Usually one gets to NUMA because the hardware gets too big to 
be handleed the UMA way.



On x86-64, most two-socket systems are still NUMA, and I would expect that
most distro kernels probably compile in NUMA.  However,
burning megabytes of memory on a two-socket dual-core system when we're
talking about tens of kilobytes used would be more than a wee bit insane.


Yeah yea but the latencies are minimal making the NUMA logic too expensive 
for most loads ... If you put a NUMA kernel onto those then performance 
drops (I think someone measures 15-30%?)




How do you handle this memory, in the first place?  Do you allocate the 
whole 2 MB for the particular CPU, or do you reclaim the upper part of 
the large page?  (I haven't dug far enough into the source to tell.)


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Christoph Lameter
On Mon, 19 Nov 2007, H. Peter Anvin wrote:

> You're making the assumption here that NUMA = large number of CPUs. This
> assumption is flat-out wrong.

Well maybe. Usually one gets to NUMA because the hardware gets too big to 
be handleed the UMA way.

> On x86-64, most two-socket systems are still NUMA, and I would expect that
> most distro kernels probably compile in NUMA.  However,
> burning megabytes of memory on a two-socket dual-core system when we're
> talking about tens of kilobytes used would be more than a wee bit insane.

Yeah yea but the latencies are minimal making the NUMA logic too expensive 
for most loads ... If you put a NUMA kernel onto those then performance 
drops (I think someone measures 15-30%?)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread H. Peter Anvin

Christoph Lameter wrote:


For the UP and SMP case map the area using 4k ptes. Typical use of per cpu
data is around 16k for UP and SMP configurations. It goes up to 45k when the
per cpu area is managed by cpu_alloc (see special x86_64 patchset).
Allocating in 2M segments would be overkill.

For NUMA map the area using 2M PMDs. A large NUMA system may use
lots of cpu data for the page allocator data alone. We typically
have large amounts of memory around on those size. Using a 2M page size
reduces TLB pressure for that case.

Some numbers for envisioned maximum configurations of NUMA systems:

4k cpu configurations with 1k nodes:

4096 * 16MB = 64TB of virtual space.

Maximum theoretical configuration 16384 processors 1k nodes:

16384 * 16MB = 256TB of virtual space.

Both fit within the established limits established.



You're making the assumption here that NUMA = large number of CPUs. 
This assumption is flat-out wrong.


On x86-64, most two-socket systems are still NUMA, and I would expect 
that most distro kernels probably compile in NUMA.  However,
burning megabytes of memory on a two-socket dual-core system when we're 
talking about tens of kilobytes used would be more than a wee bit insane.


I do like the concept, overall, but the above distinction needs to be fixed.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread clameter
64 bit:

Set up a cpu area that allows the use of up 16MB for each processor.

Cpu memory use can grow a bit. F.e. if we assume that a pageset
occupies 64 bytes of memory and we have 3 zones in each of 1024 nodes
then we need 3 * 1k * 16k = 50 million pagesets or 3096 pagesets per
processor. This results in a total of 3.2 GB of page structs.
Each cpu needs around 200k of cpu storage for the page allocator alone.
So its a worth it to use a 2M huge mapping here.

For the UP and SMP case map the area using 4k ptes. Typical use of per cpu
data is around 16k for UP and SMP configurations. It goes up to 45k when the
per cpu area is managed by cpu_alloc (see special x86_64 patchset).
Allocating in 2M segments would be overkill.

For NUMA map the area using 2M PMDs. A large NUMA system may use
lots of cpu data for the page allocator data alone. We typically
have large amounts of memory around on those size. Using a 2M page size
reduces TLB pressure for that case.

Some numbers for envisioned maximum configurations of NUMA systems:

4k cpu configurations with 1k nodes:

4096 * 16MB = 64TB of virtual space.

Maximum theoretical configuration 16384 processors 1k nodes:

16384 * 16MB = 256TB of virtual space.

Both fit within the established limits established.

32 bit:

Setup a 256 kB area for the cpu areas below the FIXADDR area.

The use of the cpu alloc area is pretty minimal on i386. An 8p system
with no extras uses only ~8kb. So 256kb should be plenty. A configuration
that supports up to 8 processors takes up 2MB of the scarce
virtual address space

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
 arch/x86/Kconfig |   13 +
 arch/x86/kernel/vmlinux_32.lds.S |1 +
 arch/x86/kernel/vmlinux_64.lds.S |3 +++
 arch/x86/mm/init_32.c|3 +++
 arch/x86/mm/init_64.c|   38 ++
 include/asm-x86/pgtable_32.h |7 +--
 include/asm-x86/pgtable_64.h |1 +
 7 files changed, 64 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===
--- linux-2.6.orig/arch/x86/mm/init_64.c2007-11-19 15:45:07.602390533 
-0800
+++ linux-2.6/arch/x86/mm/init_64.c 2007-11-19 15:55:53.165640248 -0800
@@ -781,3 +781,41 @@ int __meminit vmemmap_populate(struct pa
return 0;
 }
 #endif
+
+#ifdef CONFIG_NUMA
+int __meminit cpu_area_populate(void *start, unsigned long size,
+   gfp_t flags, int node)
+{
+   unsigned long addr = (unsigned long)start;
+   unsigned long end = addr + size;
+   unsigned long next;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+
+   for (; addr < end; addr = next) {
+   next = pmd_addr_end(addr, end);
+
+   pgd = cpu_area_pgd_populate(addr, flags, node);
+   if (!pgd)
+   return -ENOMEM;
+   pud = cpu_area_pud_populate(pgd, addr, flags, node);
+   if (!pud)
+   return -ENOMEM;
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   pte_t entry;
+   void *p = cpu_area_alloc_block(PMD_SIZE, flags, node);
+   if (!p)
+   return -ENOMEM;
+
+   entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
+   mk_pte_huge(entry);
+   set_pmd(pmd, __pmd(pte_val(entry)));
+   }
+   }
+
+   return 0;
+}
+#endif
Index: linux-2.6/include/asm-x86/pgtable_64.h
===
--- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-19 15:45:07.638390147 
-0800
+++ linux-2.6/include/asm-x86/pgtable_64.h  2007-11-19 15:55:53.165640248 
-0800
@@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f
 #define VMALLOC_START_AC(0xc200, UL)
 #define VMALLOC_END  _AC(0xe1ff, UL)
 #define VMEMMAP_START   _AC(0xe200, UL)
+#define CPU_AREA_BASE   _AC(0x8400, UL)
 #define MODULES_VADDR_AC(0x8800, UL)
 #define MODULES_END  _AC(0xfff0, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
Index: linux-2.6/arch/x86/Kconfig
===
--- linux-2.6.orig/arch/x86/Kconfig 2007-11-19 15:54:10.509139813 -0800
+++ linux-2.6/arch/x86/Kconfig  2007-11-19 15:55:53.165640248 -0800
@@ -159,6 +159,19 @@ config X86_TRAMPOLINE
 
 config KTIME_SCALAR
def_bool X86_32
+
+config CPU_AREA_VIRTUAL
+   bool
+   default y
+
+config CPU_AREA_ORDER
+   int
+   default "6"
+
+config CPU_AREA_ALLOC_ORDER
+   int
+   default "0"
+
 source "init/Kconfig"
 
 menu "Processor type and features"
Index: linux-2.6/arch/x86/mm/init_32.c

[rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread clameter
64 bit:

Set up a cpu area that allows the use of up 16MB for each processor.

Cpu memory use can grow a bit. F.e. if we assume that a pageset
occupies 64 bytes of memory and we have 3 zones in each of 1024 nodes
then we need 3 * 1k * 16k = 50 million pagesets or 3096 pagesets per
processor. This results in a total of 3.2 GB of page structs.
Each cpu needs around 200k of cpu storage for the page allocator alone.
So its a worth it to use a 2M huge mapping here.

For the UP and SMP case map the area using 4k ptes. Typical use of per cpu
data is around 16k for UP and SMP configurations. It goes up to 45k when the
per cpu area is managed by cpu_alloc (see special x86_64 patchset).
Allocating in 2M segments would be overkill.

For NUMA map the area using 2M PMDs. A large NUMA system may use
lots of cpu data for the page allocator data alone. We typically
have large amounts of memory around on those size. Using a 2M page size
reduces TLB pressure for that case.

Some numbers for envisioned maximum configurations of NUMA systems:

4k cpu configurations with 1k nodes:

4096 * 16MB = 64TB of virtual space.

Maximum theoretical configuration 16384 processors 1k nodes:

16384 * 16MB = 256TB of virtual space.

Both fit within the established limits established.

32 bit:

Setup a 256 kB area for the cpu areas below the FIXADDR area.

The use of the cpu alloc area is pretty minimal on i386. An 8p system
with no extras uses only ~8kb. So 256kb should be plenty. A configuration
that supports up to 8 processors takes up 2MB of the scarce
virtual address space

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
---
 arch/x86/Kconfig |   13 +
 arch/x86/kernel/vmlinux_32.lds.S |1 +
 arch/x86/kernel/vmlinux_64.lds.S |3 +++
 arch/x86/mm/init_32.c|3 +++
 arch/x86/mm/init_64.c|   38 ++
 include/asm-x86/pgtable_32.h |7 +--
 include/asm-x86/pgtable_64.h |1 +
 7 files changed, 64 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===
--- linux-2.6.orig/arch/x86/mm/init_64.c2007-11-19 15:45:07.602390533 
-0800
+++ linux-2.6/arch/x86/mm/init_64.c 2007-11-19 15:55:53.165640248 -0800
@@ -781,3 +781,41 @@ int __meminit vmemmap_populate(struct pa
return 0;
 }
 #endif
+
+#ifdef CONFIG_NUMA
+int __meminit cpu_area_populate(void *start, unsigned long size,
+   gfp_t flags, int node)
+{
+   unsigned long addr = (unsigned long)start;
+   unsigned long end = addr + size;
+   unsigned long next;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+
+   for (; addr  end; addr = next) {
+   next = pmd_addr_end(addr, end);
+
+   pgd = cpu_area_pgd_populate(addr, flags, node);
+   if (!pgd)
+   return -ENOMEM;
+   pud = cpu_area_pud_populate(pgd, addr, flags, node);
+   if (!pud)
+   return -ENOMEM;
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd)) {
+   pte_t entry;
+   void *p = cpu_area_alloc_block(PMD_SIZE, flags, node);
+   if (!p)
+   return -ENOMEM;
+
+   entry = pfn_pte(__pa(p)  PAGE_SHIFT, PAGE_KERNEL);
+   mk_pte_huge(entry);
+   set_pmd(pmd, __pmd(pte_val(entry)));
+   }
+   }
+
+   return 0;
+}
+#endif
Index: linux-2.6/include/asm-x86/pgtable_64.h
===
--- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-19 15:45:07.638390147 
-0800
+++ linux-2.6/include/asm-x86/pgtable_64.h  2007-11-19 15:55:53.165640248 
-0800
@@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f
 #define VMALLOC_START_AC(0xc200, UL)
 #define VMALLOC_END  _AC(0xe1ff, UL)
 #define VMEMMAP_START   _AC(0xe200, UL)
+#define CPU_AREA_BASE   _AC(0x8400, UL)
 #define MODULES_VADDR_AC(0x8800, UL)
 #define MODULES_END  _AC(0xfff0, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
Index: linux-2.6/arch/x86/Kconfig
===
--- linux-2.6.orig/arch/x86/Kconfig 2007-11-19 15:54:10.509139813 -0800
+++ linux-2.6/arch/x86/Kconfig  2007-11-19 15:55:53.165640248 -0800
@@ -159,6 +159,19 @@ config X86_TRAMPOLINE
 
 config KTIME_SCALAR
def_bool X86_32
+
+config CPU_AREA_VIRTUAL
+   bool
+   default y
+
+config CPU_AREA_ORDER
+   int
+   default 6
+
+config CPU_AREA_ALLOC_ORDER
+   int
+   default 0
+
 source init/Kconfig
 
 menu Processor type and features
Index: linux-2.6/arch/x86/mm/init_32.c

Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread H. Peter Anvin

Christoph Lameter wrote:


For the UP and SMP case map the area using 4k ptes. Typical use of per cpu
data is around 16k for UP and SMP configurations. It goes up to 45k when the
per cpu area is managed by cpu_alloc (see special x86_64 patchset).
Allocating in 2M segments would be overkill.

For NUMA map the area using 2M PMDs. A large NUMA system may use
lots of cpu data for the page allocator data alone. We typically
have large amounts of memory around on those size. Using a 2M page size
reduces TLB pressure for that case.

Some numbers for envisioned maximum configurations of NUMA systems:

4k cpu configurations with 1k nodes:

4096 * 16MB = 64TB of virtual space.

Maximum theoretical configuration 16384 processors 1k nodes:

16384 * 16MB = 256TB of virtual space.

Both fit within the established limits established.



You're making the assumption here that NUMA = large number of CPUs. 
This assumption is flat-out wrong.


On x86-64, most two-socket systems are still NUMA, and I would expect 
that most distro kernels probably compile in NUMA.  However,
burning megabytes of memory on a two-socket dual-core system when we're 
talking about tens of kilobytes used would be more than a wee bit insane.


I do like the concept, overall, but the above distinction needs to be fixed.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Christoph Lameter
On Mon, 19 Nov 2007, H. Peter Anvin wrote:

 You're making the assumption here that NUMA = large number of CPUs. This
 assumption is flat-out wrong.

Well maybe. Usually one gets to NUMA because the hardware gets too big to 
be handleed the UMA way.

 On x86-64, most two-socket systems are still NUMA, and I would expect that
 most distro kernels probably compile in NUMA.  However,
 burning megabytes of memory on a two-socket dual-core system when we're
 talking about tens of kilobytes used would be more than a wee bit insane.

Yeah yea but the latencies are minimal making the NUMA logic too expensive 
for most loads ... If you put a NUMA kernel onto those then performance 
drops (I think someone measures 15-30%?)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread H. Peter Anvin

Christoph Lameter wrote:

On Mon, 19 Nov 2007, H. Peter Anvin wrote:


You're making the assumption here that NUMA = large number of CPUs. This
assumption is flat-out wrong.


Well maybe. Usually one gets to NUMA because the hardware gets too big to 
be handleed the UMA way.



On x86-64, most two-socket systems are still NUMA, and I would expect that
most distro kernels probably compile in NUMA.  However,
burning megabytes of memory on a two-socket dual-core system when we're
talking about tens of kilobytes used would be more than a wee bit insane.


Yeah yea but the latencies are minimal making the NUMA logic too expensive 
for most loads ... If you put a NUMA kernel onto those then performance 
drops (I think someone measures 15-30%?)




How do you handle this memory, in the first place?  Do you allocate the 
whole 2 MB for the particular CPU, or do you reclaim the upper part of 
the large page?  (I haven't dug far enough into the source to tell.)


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Andi Kleen

 4k cpu configurations with 1k nodes:

   4096 * 16MB = 64TB of virtual space.

 Maximum theoretical configuration 16384 processors 1k nodes:

   16384 * 16MB = 256TB of virtual space.

 Both fit within the established limits established.

I might be pointing out the obvious, but on x86-64 there is definitely not 
256TB of VM available for this.

Not even 64TB, as long as you want to have any
other mappings in kernel (total kernel memory 128TB, but it is split in
half for the direct mapping) 

BTW if you allocate any VM you should also update
Documentation/x86_64/mm.txt which describes the mapping

 Index: linux-2.6/include/asm-x86/pgtable_64.h
 ===
 --- linux-2.6.orig/include/asm-x86/pgtable_64.h   2007-11-19
 15:45:07.638390147 -0800 +++
 linux-2.6/include/asm-x86/pgtable_64.h2007-11-19 15:55:53.165640248 
 -0800
 @@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f
  #define VMALLOC_START_AC(0xc200, UL)
  #define VMALLOC_END  _AC(0xe1ff, UL)
  #define VMEMMAP_START _AC(0xe200, UL)
 +#define CPU_AREA_BASE _AC(0x8400, UL)

That's slightly less than 1GB before you bump into the maximum.
But you'll bump into the module mapping even before that.

For 16MB/CPU and the full 1GB that's ~123 CPUs if my calculations are correct. 
Even for a  not Altix that's quite tight.

I suppose 16MB/CPU are too large.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Nick Piggin
On Tuesday 20 November 2007 13:02, Christoph Lameter wrote:
 On Mon, 19 Nov 2007, H. Peter Anvin wrote:
  You're making the assumption here that NUMA = large number of CPUs. This
  assumption is flat-out wrong.

 Well maybe. Usually one gets to NUMA because the hardware gets too big to
 be handleed the UMA way.

  On x86-64, most two-socket systems are still NUMA, and I would expect
  that most distro kernels probably compile in NUMA.  However,
  burning megabytes of memory on a two-socket dual-core system when we're
  talking about tens of kilobytes used would be more than a wee bit insane.

 Yeah yea but the latencies are minimal making the NUMA logic too expensive
 for most loads ... If you put a NUMA kernel onto those then performance
 drops (I think someone measures 15-30%?)

Small socket count systems are going to increasingly be NUMA in future.
If CONFIG_NUMA hurts performance by that much on those systems, then the
kernel is broken IMO.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Nick Piggin
On Tuesday 20 November 2007 13:02, Christoph Lameter wrote:
 On Mon, 19 Nov 2007, H. Peter Anvin wrote:
  You're making the assumption here that NUMA = large number of CPUs. This
  assumption is flat-out wrong.

 Well maybe. Usually one gets to NUMA because the hardware gets too big to
 be handleed the UMA way.

Not the way things are going with multicore and multithread, though
(that is, the hardware can be one socket and still have many cpus).

The chip might have several memory controllers on it, but they could
well be connected to the caches with a crossbar, so it needn't be
NUMA at all. Future scalability work shouldn't rely on many cores
~= many nodes, IMO.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [rfc 08/45] cpu alloc: x86 support

2007-11-19 Thread Christoph Lameter
On Tue, 20 Nov 2007, Andi Kleen wrote:

 I might be pointing out the obvious, but on x86-64 there is definitely not 
 256TB of VM available for this.

Well maybe in the future.

One of the issues that I ran into is that I had to place the cpu area
in between to make the offsets link right.

However, it would be best if the cpuarea came *after* the modules area. We 
only need linking that covers the per cpu area of processor 0.

So I think we have a 2GB area right?

1GB kernel
1GB - 1x per cpu area (128M?) modules?
cpu aree 0
 2GB limit
cpu area 1 
cpu area 2


For that we would need to move the kernel down a bit. Can we do that?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/