Re: [rfc 08/45] cpu alloc: x86 support
Andi Kleen wrote: On Tuesday 20 November 2007 04:50, Christoph Lameter wrote: On Tue, 20 Nov 2007, Andi Kleen wrote: You could in theory move the modules, but then you would need to implement a full PIC dynamic linker for them first and also increase runtime overhead for them because they would need to use a GOT/PLT. On x86-64? The GOT/PLT should stay in cache due to temporal locality. The x86-64 instruction set itself handles GOT-relative addressing rather well; what's a 1% loss on x86 is like 0.01% on x86-64, so I'm thinking 100 times better? I think I got this by `-fpic -pie` compiling nbyte benchmark versus fixed position, each with and without on 32-bit (which made about a 1% difference) and on 64-bit (which made a 0.01% difference). It was a long time ago. Still, yeah I know. Complexity. (You have the ability to textrel these things too, and just rewrite non-PIC, depending on how you feel about that) -- Bring back the Firefox plushy! http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good https://bugzilla.mozilla.org/show_bug.cgi?id=322367 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Andi Kleen wrote: On Tuesday 20 November 2007 04:50, Christoph Lameter wrote: On Tue, 20 Nov 2007, Andi Kleen wrote: You could in theory move the modules, but then you would need to implement a full PIC dynamic linker for them first and also increase runtime overhead for them because they would need to use a GOT/PLT. On x86-64? The GOT/PLT should stay in cache due to temporal locality. The x86-64 instruction set itself handles GOT-relative addressing rather well; what's a 1% loss on x86 is like 0.01% on x86-64, so I'm thinking 100 times better? I think I got this by `-fpic -pie` compiling nbyte benchmark versus fixed position, each with and without on 32-bit (which made about a 1% difference) and on 64-bit (which made a 0.01% difference). It was a long time ago. Still, yeah I know. Complexity. (You have the ability to textrel these things too, and just rewrite non-PIC, depending on how you feel about that) -- Bring back the Firefox plushy! http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good https://bugzilla.mozilla.org/show_bug.cgi?id=322367 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Wed, 21 Nov 2007, Andi Kleen wrote: > The whole mapping for all CPUs cannot fit into 2GB of course, but the > reference > linker managed range can. Ok so you favor the solution where we subtract smp_processor_id() << shift? > > The offset relative to %gs cannot be used if you have a loop and are > > calculating the addresses for all instances. That is what we are talking > > about. The CPU_xxx operations that are using the %gs register are fine and > > are not affected by the changes we are discussing. > > Sure it can -- you just get the base address from a global array > and then add the offset Ok so generalize the data_offset for that case? I noted that other arches and i386 have a similar solution there. I fiddled around some more and found that the overhead that the subtraction introduces is equivalent to loading an 8 byte constant of the base. Keeping the usage of data_offset can avoid the shift and the add for the __get_cpu_var case that needs CPU_PTR( ..., smp_processor_id()) because the load from data_offset avoid the shifting and adding of smp_processor_id(). For the loops this is not useful since the compiler can move the loading of the base pointer outside of the loop )if CPU_PTR needs to load an 8 byte constant pointers). With loading the 8 byte base the loops actually become: sum = 0 ptr = CPU_AREA_BASE while base < NR_CPUS << shift { sum = *ptr ptr += 1 << shift } So I think we need to go with the implementation where CPU_PTR(var, cpu) is CPU_AREA_BASE + cpu << shift + var_offset The CPU_AREA_BASE will be loaded into a register. The var_offset usually ends up being an offset in a mov instruction. > > > > > Then the reference data would be initdata and eventually freed. > > > That is similar to how the current per cpu data works. > > > > Yes that is also how the current patchset works. I just do not understand > > what you want changed. > > Anyways i think your current scheme cannot work (too much VM, placed at the > wrong > place; some wrong assumptions). The constant pointer solution fixes that. No need to despair. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
> > All you need is a 2MB area (16MB is too large if you really > > want 16k CPUs someday) somewhere in the -2GB or probably better > > in +2GB. Then the linker puts stuff in there and you use > > the offsets for referencing relative to %gs. > > 2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do > not fit in the 2GB area. I was referring here to the 16MB/CPU you proposed originally which will not fit into _any_ kernel area for 16k CPUs. The whole mapping for all CPUs cannot fit into 2GB of course, but the reference linker managed range can. > The offset relative to %gs cannot be used if you have a loop and are > calculating the addresses for all instances. That is what we are talking > about. The CPU_xxx operations that are using the %gs register are fine and > are not affected by the changes we are discussing. Sure it can -- you just get the base address from a global array and then add the offset > > > Then the reference data would be initdata and eventually freed. > > That is similar to how the current per cpu data works. > > Yes that is also how the current patchset works. I just do not understand > what you want changed. Anyways i think your current scheme cannot work (too much VM, placed at the wrong place; some wrong assumptions). But since I seem unable to communicate this to you I'll stop commenting and let you find it out the hard way. Have fun. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
All you need is a 2MB area (16MB is too large if you really want 16k CPUs someday) somewhere in the -2GB or probably better in +2GB. Then the linker puts stuff in there and you use the offsets for referencing relative to %gs. 2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do not fit in the 2GB area. I was referring here to the 16MB/CPU you proposed originally which will not fit into _any_ kernel area for 16k CPUs. The whole mapping for all CPUs cannot fit into 2GB of course, but the reference linker managed range can. The offset relative to %gs cannot be used if you have a loop and are calculating the addresses for all instances. That is what we are talking about. The CPU_xxx operations that are using the %gs register are fine and are not affected by the changes we are discussing. Sure it can -- you just get the base address from a global array and then add the offset Then the reference data would be initdata and eventually freed. That is similar to how the current per cpu data works. Yes that is also how the current patchset works. I just do not understand what you want changed. Anyways i think your current scheme cannot work (too much VM, placed at the wrong place; some wrong assumptions). But since I seem unable to communicate this to you I'll stop commenting and let you find it out the hard way. Have fun. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Wed, 21 Nov 2007, Andi Kleen wrote: The whole mapping for all CPUs cannot fit into 2GB of course, but the reference linker managed range can. Ok so you favor the solution where we subtract smp_processor_id() shift? The offset relative to %gs cannot be used if you have a loop and are calculating the addresses for all instances. That is what we are talking about. The CPU_xxx operations that are using the %gs register are fine and are not affected by the changes we are discussing. Sure it can -- you just get the base address from a global array and then add the offset Ok so generalize the data_offset for that case? I noted that other arches and i386 have a similar solution there. I fiddled around some more and found that the overhead that the subtraction introduces is equivalent to loading an 8 byte constant of the base. Keeping the usage of data_offset can avoid the shift and the add for the __get_cpu_var case that needs CPU_PTR( ..., smp_processor_id()) because the load from data_offset avoid the shifting and adding of smp_processor_id(). For the loops this is not useful since the compiler can move the loading of the base pointer outside of the loop )if CPU_PTR needs to load an 8 byte constant pointers). With loading the 8 byte base the loops actually become: sum = 0 ptr = CPU_AREA_BASE while base NR_CPUS shift { sum = *ptr ptr += 1 shift } So I think we need to go with the implementation where CPU_PTR(var, cpu) is CPU_AREA_BASE + cpu shift + var_offset The CPU_AREA_BASE will be loaded into a register. The var_offset usually ends up being an offset in a mov instruction. Then the reference data would be initdata and eventually freed. That is similar to how the current per cpu data works. Yes that is also how the current patchset works. I just do not understand what you want changed. Anyways i think your current scheme cannot work (too much VM, placed at the wrong place; some wrong assumptions). The constant pointer solution fixes that. No need to despair. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Wed, 21 Nov 2007, Andi Kleen wrote: > On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote: > > But one can subtract too... > > The linker cannot subtract (unless you add a new relocation types) The compiler knows and emits assembly to compensate. > All you need is a 2MB area (16MB is too large if you really > want 16k CPUs someday) somewhere in the -2GB or probably better > in +2GB. Then the linker puts stuff in there and you use > the offsets for referencing relative to %gs. 2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do not fit in the 2GB area. The offset relative to %gs cannot be used if you have a loop and are calculating the addresses for all instances. That is what we are talking about. The CPU_xxx operations that are using the %gs register are fine and are not affected by the changes we are discussing. > Then for all CPUs (including CPU #0) you put the real mapping > somewhere else, copy the reference data there (which also doesn't need > to be on the offset the linker assigned, just on a constant offset > from it somewhere in the normal kernel data) and off you go. Real mapping? We have constant offsets after this patchset. I do not get what you are planning here. > Then the reference data would be initdata and eventually freed. > That is similar to how the current per cpu data works. Yes that is also how the current patchset works. I just do not understand what you want changed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote: > But one can subtract too... The linker cannot subtract (unless you add a new relocation types) > Hmmm... So the cpu area 0 could be put at > the beginning of the 2GB kernel area and then grow downwards from > 0x8000. The cost in terms of code is one subtract > instruction for each per_cpu() or CPU_PTR() > > The next thing doward from 0x8000 is the vmemmap at > 0xe200, so ~32TB. If we leave 16TB for the vmemmap > (a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes > more than currently supported by the processors) > > then the remaining 16TB could be used to map 1GB per cpu for a 16k config. > That is wildly overdoing it. Guess we could just do it with 1M anyways. > Just to be safe we could do 128M. 128M x 16k = 2TB? > > Would such a configuration be okay? I'm not sure I really understand your problem. All you need is a 2MB area (16MB is too large if you really want 16k CPUs someday) somewhere in the -2GB or probably better in +2GB. Then the linker puts stuff in there and you use the offsets for referencing relative to %gs. But %gs can be located wherever you want in the end, at a completely different address than you told the linker. All you're interested in were offsets anyways. Then for all CPUs (including CPU #0) you put the real mapping somewhere else, copy the reference data there (which also doesn't need to be on the offset the linker assigned, just on a constant offset from it somewhere in the normal kernel data) and off you go. Then the reference data would be initdata and eventually freed. That is similar to how the current per cpu data works. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
But one can subtract too... Hmmm... So the cpu area 0 could be put at the beginning of the 2GB kernel area and then grow downwards from 0x8000. The cost in terms of code is one subtract instruction for each per_cpu() or CPU_PTR() The next thing doward from 0x8000 is the vmemmap at 0xe200, so ~32TB. If we leave 16TB for the vmemmap (a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes more than currently supported by the processors) then the remaining 16TB could be used to map 1GB per cpu for a 16k config. That is wildly overdoing it. Guess we could just do it with 1M anyways. Just to be safe we could do 128M. 128M x 16k = 2TB? Would such a configuration be okay? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Christoph Lameter wrote: > 32bit sign extension for what? Absolute data references? The addressing > that I have seen was IP relative. Thus I thought that the kernel could be > moved lower. Argh. This is all depending on a special gcc option to compile the kernel and that option limits the kernel to the upper 2GB. So I guess for CPU_PTR we need to explicitly load the address as a constant, use that as a base and then add the offset and the smp_id shifted to it in the instruction. The CPU_INC/DEC stuff using gs is not affected. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, H. Peter Anvin wrote: > But you wouldn't actually *use* this address space. It's just for the linker > to know what address to tag the references with; it gets relocated by gs_base > down into proper kernel space. The linker can stash the initialized reference > copy at any address (LMA) which can be different from what it will be used at > (VMA); that is not an issue. That is already provided by this patchset. The cpu area starts at absolute 0. > To use %rip references, though, which are more efficient, you probably want to > use offsets that are just below .text (at -2 GB); presumably > -2 GB-[max size of percpu section]. Again, however, no CPU actually needs to > have its data stashed in that particular location; it's just an offset. Right. That is what we are discussion in another thread. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Andi Kleen wrote: > > > > > Right so I could move the kernel to > > > > #define __PAGE_OFFSET _AC(0x8100, UL) > > #define __START_KERNEL_map_AC(0xfff8, UL) > > That is -31GB unless I'm miscounting. But it needs to be >= -2GB > (31bits) The __START_KERNEL_map needs to cover the 2GB that the kernel needs for modules and the cpu area 0. The remaining 28GB can be out of range. > Right now it is at -2GB + 2MB, because it is loaded at physical +2MB > so it's convenient to identity map there. In theory you could avoid that > with some effort, but that would only buy you 2MB and would also > break some early code and earlyprintk I believe. My proporsal is -32GB + 2MB. It keeps the arrangement. > > > You could in theory move the modules, but then you would need to implement > > > a full PIC dynamic linker for them first and also increase runtime > > > overhead > > > for them because they would need to use a GOT/PLT. > > > > Why is it not possible to move the kernel lower while keeping bit 31 1? > > The kernel model relies on 32bit sign extension. This means bits [31;63] have > to be all 1 32bit sign extension for what? Absolute data references? The addressing that I have seen was IP relative. Thus I thought that the kernel could be moved lower. > > > I suspect all of this would cause far more overhead all over the kernel > > > than > > > you could ever save with the per cpu data in your fast paths. > > > > Moving the kernel down a bit seems to be trivial without any of the weird > > solutions. > > Another one I came up in the previous mail would be to do the linker reference > variable allocation in [0;2GB] positive space; but do all real references only > %gs relative. And keep the real data copy on some other address. That would > be a similar trick to the old style x86-64 vsyscalls. It gets fairly > messy in the linker map file though. The linker references are to per cpu data already in the 0-MAX_CPU_AREA range after this patchset. The problem is the relocation of the references when the linker calculates the address of a per cpu variable. F.e. The pda is placed at absolute address 0 In order to access the pda we have to do either 1. A CPU_PTR(offsetof pda,) which is offsetof pda (so 0) + cpu_area + cpu << area_shift The compiler currently folds offsetof pda + cpu_area and then addds the shift later. or 2. We use a CPU_xx op with a segment register. Then we have no need to add cpu area because it is included in the gs offset. So only case 1 is affected which is not used if cpu ops can be used. There are still critical path uses of that operation so I would really like to avoid the explicit calculation. If we would go the proposed route then the folding of the address would have to be replaced by the linker by an explicit calculation. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Christoph Lameter wrote: On Tue, 20 Nov 2007, Andi Kleen wrote: This limitation shouldn't apply to the percpu area, since gs_base can be pointed anywhere in the address space -- in effect we're always indirect. The initial reference copy of the percpu area has to be addressed by the linker. Right that is important for the percpu references that can be folded by the linker in order to avoid address calculations. Hmm, in theory since it is not actually used by itself I suppose you could move it into positive space. But the positive space is reserved for a processes memory. But you wouldn't actually *use* this address space. It's just for the linker to know what address to tag the references with; it gets relocated by gs_base down into proper kernel space. The linker can stash the initialized reference copy at any address (LMA) which can be different from what it will be used at (VMA); that is not an issue. To use %rip references, though, which are more efficient, you probably want to use offsets that are just below .text (at -2 GB); presumably -2 GB-[max size of percpu section]. Again, however, no CPU actually needs to have its data stashed in that particular location; it's just an offset. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Andi Kleen wrote: This limitation shouldn't apply to the percpu area, since gs_base can be pointed anywhere in the address space -- in effect we're always indirect. The initial reference copy of the percpu area has to be addressed by the linker. Hmm, in theory since it is not actually used by itself I suppose you could move it into positive space. Positive space for absolute references, or just below -2 GB for %rip references; either should work. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
> > Right so I could move the kernel to > > #define __PAGE_OFFSET _AC(0x8100, UL) > #define __START_KERNEL_map_AC(0xfff8, UL) That is -31GB unless I'm miscounting. But it needs to be >= -2GB (31bits) Right now it is at -2GB + 2MB, because it is loaded at physical +2MB so it's convenient to identity map there. In theory you could avoid that with some effort, but that would only buy you 2MB and would also break some early code and earlyprintk I believe. > > You could in theory move the modules, but then you would need to implement > > a full PIC dynamic linker for them first and also increase runtime overhead > > for them because they would need to use a GOT/PLT. > > Why is it not possible to move the kernel lower while keeping bit 31 1? The kernel model relies on 32bit sign extension. This means bits [31;63] have to be all 1 > > I suspect all of this would cause far more overhead all over the kernel > > than > > you could ever save with the per cpu data in your fast paths. > > Moving the kernel down a bit seems to be trivial without any of the weird > solutions. Another one I came up in the previous mail would be to do the linker reference variable allocation in [0;2GB] positive space; but do all real references only %gs relative. And keep the real data copy on some other address. That would be a similar trick to the old style x86-64 vsyscalls. It gets fairly messy in the linker map file though. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Andi Kleen wrote: > > > This limitation shouldn't apply to the percpu area, since gs_base can be > > pointed anywhere in the address space -- in effect we're always indirect. > > The initial reference copy of the percpu area has to be addressed by > the linker. Right that is important for the percpu references that can be folded by the linker in order to avoid address calculations. > Hmm, in theory since it is not actually used by itself I suppose you could > move it into positive space. But the positive space is reserved for a processes memory. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
> This limitation shouldn't apply to the percpu area, since gs_base can be > pointed anywhere in the address space -- in effect we're always indirect. The initial reference copy of the percpu area has to be addressed by the linker. Hmm, in theory since it is not actually used by itself I suppose you could move it into positive space. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Andi Kleen wrote: On Tuesday 20 November 2007 04:50, Christoph Lameter wrote: On Tue, 20 Nov 2007, Andi Kleen wrote: I might be pointing out the obvious, but on x86-64 there is definitely not 256TB of VM available for this. Well maybe in the future. That would either require more than 4 levels or larger pages in page tables. One of the issues that I ran into is that I had to place the cpu area in between to make the offsets link right. Above -2GB, otherwise you cannot address them This limitation shouldn't apply to the percpu area, since gs_base can be pointed anywhere in the address space -- in effect we're always indirect. Obviously the offsets *within* the percpu area has to be in range (±2 GB per cpu for absolute offsets, slightly smaller for %rip-based addressing -- obviously judicious use of an offset for gs_base is essential in the latter case). Thus you want the percpu areas below -2 GB where they don't interfere with modules or any other precious address space. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Andi Kleen wrote: > > So I think we have a 2GB area right? > > For everything that needs the -31bit offsets; that is everything linked Of course. > > 1GB kernel > > 1GB - 1x per cpu area (128M?) modules? > > cpu aree 0 > > 2GB limit > > cpu area 1 > > cpu area 2 > > > > > > For that we would need to move the kernel down a bit. Can we do that? > > The kernel model requires kernel and modules and everything else > linked be in negative -31bit space. That is how the kernel code model is > defined. Right so I could move the kernel to #define __PAGE_OFFSET _AC(0x8100, UL) #define __START_KERNEL_map_AC(0xfff8, UL) #define KERNEL_TEXT_START _AC(0xfff8, UL) 30 bits = 1GB for kernel text #define MODULES_VADDR _AC(0xfff88000, UL) 30 bits = 1GB for modules #define MODULES_END _AC(0xfff8f000, UL) #define CPU_AREA_BASE _AC(0xfff8f000, UL) 31 bits 256MB for cpu area 0 #define CPU_AREA_BASE1_AC(0xfff9, UL) More cpu areas for higher numbered processors #define CPU_AREA_END _AC(0x, UL) > You could in theory move the modules, but then you would need to implement > a full PIC dynamic linker for them first and also increase runtime overhead > for them because they would need to use a GOT/PLT. Why is it not possible to move the kernel lower while keeping bit 31 1? > I suspect all of this would cause far more overhead all over the kernel than > you could ever save with the per cpu data in your fast paths. Moving the kernel down a bit seems to be trivial without any of the weird solutions. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tuesday 20 November 2007 04:50, Christoph Lameter wrote: > On Tue, 20 Nov 2007, Andi Kleen wrote: > > I might be pointing out the obvious, but on x86-64 there is definitely > > not 256TB of VM available for this. > > Well maybe in the future. That would either require more than 4 levels or larger pages in page tables. > One of the issues that I ran into is that I had to place the cpu area > in between to make the offsets link right. Above -2GB, otherwise you cannot address them If you can move all the other CPUs somewhere else it might work. But even then 16MB/cpu max is unrealistic. Perhaps 1M/CPU max -- then 16k CPU would be 128GB which could still fit into the existing vmalloc area. > > However, it would be best if the cpuarea came *after* the modules area. We > only need linking that covers the per cpu area of processor 0. > > So I think we have a 2GB area right? For everything that needs the -31bit offsets; that is everything linked > 1GB kernel > 1GB - 1x per cpu area (128M?) modules? > cpu aree 0 > 2GB limit > cpu area 1 > cpu area 2 > > > For that we would need to move the kernel down a bit. Can we do that? The kernel model requires kernel and modules and everything else linked be in negative -31bit space. That is how the kernel code model is defined. You could in theory move the modules, but then you would need to implement a full PIC dynamic linker for them first and also increase runtime overhead for them because they would need to use a GOT/PLT. Or you could switch kernel over to the large model, which is very costly and has toolkit problems. Or use the UML trick and run the kernel PIC but again that causes overhead. I suspect all of this would cause far more overhead all over the kernel than you could ever save with the per cpu data in your fast paths. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
> > Yeah yea but the latencies are minimal making the NUMA logic too > > expensive for most loads ... If you put a NUMA kernel onto those then > > performance drops (I think someone measures 15-30%?) > > Small socket count systems are going to increasingly be NUMA in future. > If CONFIG_NUMA hurts performance by that much on those systems, then the > kernel is broken IMO. Not sure where that number came from. In my tests some time ago NUMA overhead on SMP was minimal. This was admittedly with old 2.4 kernels. There have been some doubts about some of the newer NUMA features added; in particular about NUMA slab; don't think there was much trouble with anything else -- in fact the trouble was that it apparently sometimes made moderate NUMA factor NUMA systems slower too. But I assume SLUB will address this anyways. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tuesday 20 November 2007 04:50, Christoph Lameter wrote: On Tue, 20 Nov 2007, Andi Kleen wrote: I might be pointing out the obvious, but on x86-64 there is definitely not 256TB of VM available for this. Well maybe in the future. That would either require more than 4 levels or larger pages in page tables. One of the issues that I ran into is that I had to place the cpu area in between to make the offsets link right. Above -2GB, otherwise you cannot address them If you can move all the other CPUs somewhere else it might work. But even then 16MB/cpu max is unrealistic. Perhaps 1M/CPU max -- then 16k CPU would be 128GB which could still fit into the existing vmalloc area. However, it would be best if the cpuarea came *after* the modules area. We only need linking that covers the per cpu area of processor 0. So I think we have a 2GB area right? For everything that needs the -31bit offsets; that is everything linked 1GB kernel 1GB - 1x per cpu area (128M?) modules? cpu aree 0 2GB limit cpu area 1 cpu area 2 For that we would need to move the kernel down a bit. Can we do that? The kernel model requires kernel and modules and everything else linked be in negative -31bit space. That is how the kernel code model is defined. You could in theory move the modules, but then you would need to implement a full PIC dynamic linker for them first and also increase runtime overhead for them because they would need to use a GOT/PLT. Or you could switch kernel over to the large model, which is very costly and has toolkit problems. Or use the UML trick and run the kernel PIC but again that causes overhead. I suspect all of this would cause far more overhead all over the kernel than you could ever save with the per cpu data in your fast paths. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Yeah yea but the latencies are minimal making the NUMA logic too expensive for most loads ... If you put a NUMA kernel onto those then performance drops (I think someone measures 15-30%?) Small socket count systems are going to increasingly be NUMA in future. If CONFIG_NUMA hurts performance by that much on those systems, then the kernel is broken IMO. Not sure where that number came from. In my tests some time ago NUMA overhead on SMP was minimal. This was admittedly with old 2.4 kernels. There have been some doubts about some of the newer NUMA features added; in particular about NUMA slab; don't think there was much trouble with anything else -- in fact the trouble was that it apparently sometimes made moderate NUMA factor NUMA systems slower too. But I assume SLUB will address this anyways. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Andi Kleen wrote: So I think we have a 2GB area right? For everything that needs the -31bit offsets; that is everything linked Of course. 1GB kernel 1GB - 1x per cpu area (128M?) modules? cpu aree 0 2GB limit cpu area 1 cpu area 2 For that we would need to move the kernel down a bit. Can we do that? The kernel model requires kernel and modules and everything else linked be in negative -31bit space. That is how the kernel code model is defined. Right so I could move the kernel to #define __PAGE_OFFSET _AC(0x8100, UL) #define __START_KERNEL_map_AC(0xfff8, UL) #define KERNEL_TEXT_START _AC(0xfff8, UL) 30 bits = 1GB for kernel text #define MODULES_VADDR _AC(0xfff88000, UL) 30 bits = 1GB for modules #define MODULES_END _AC(0xfff8f000, UL) #define CPU_AREA_BASE _AC(0xfff8f000, UL) 31 bits 256MB for cpu area 0 #define CPU_AREA_BASE1_AC(0xfff9, UL) More cpu areas for higher numbered processors #define CPU_AREA_END _AC(0x, UL) You could in theory move the modules, but then you would need to implement a full PIC dynamic linker for them first and also increase runtime overhead for them because they would need to use a GOT/PLT. Why is it not possible to move the kernel lower while keeping bit 31 1? I suspect all of this would cause far more overhead all over the kernel than you could ever save with the per cpu data in your fast paths. Moving the kernel down a bit seems to be trivial without any of the weird solutions. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Andi Kleen wrote: On Tuesday 20 November 2007 04:50, Christoph Lameter wrote: On Tue, 20 Nov 2007, Andi Kleen wrote: I might be pointing out the obvious, but on x86-64 there is definitely not 256TB of VM available for this. Well maybe in the future. That would either require more than 4 levels or larger pages in page tables. One of the issues that I ran into is that I had to place the cpu area in between to make the offsets link right. Above -2GB, otherwise you cannot address them This limitation shouldn't apply to the percpu area, since gs_base can be pointed anywhere in the address space -- in effect we're always indirect. Obviously the offsets *within* the percpu area has to be in range (±2 GB per cpu for absolute offsets, slightly smaller for %rip-based addressing -- obviously judicious use of an offset for gs_base is essential in the latter case). Thus you want the percpu areas below -2 GB where they don't interfere with modules or any other precious address space. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
This limitation shouldn't apply to the percpu area, since gs_base can be pointed anywhere in the address space -- in effect we're always indirect. The initial reference copy of the percpu area has to be addressed by the linker. Hmm, in theory since it is not actually used by itself I suppose you could move it into positive space. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Andi Kleen wrote: This limitation shouldn't apply to the percpu area, since gs_base can be pointed anywhere in the address space -- in effect we're always indirect. The initial reference copy of the percpu area has to be addressed by the linker. Right that is important for the percpu references that can be folded by the linker in order to avoid address calculations. Hmm, in theory since it is not actually used by itself I suppose you could move it into positive space. But the positive space is reserved for a processes memory. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Right so I could move the kernel to #define __PAGE_OFFSET _AC(0x8100, UL) #define __START_KERNEL_map_AC(0xfff8, UL) That is -31GB unless I'm miscounting. But it needs to be = -2GB (31bits) Right now it is at -2GB + 2MB, because it is loaded at physical +2MB so it's convenient to identity map there. In theory you could avoid that with some effort, but that would only buy you 2MB and would also break some early code and earlyprintk I believe. You could in theory move the modules, but then you would need to implement a full PIC dynamic linker for them first and also increase runtime overhead for them because they would need to use a GOT/PLT. Why is it not possible to move the kernel lower while keeping bit 31 1? The kernel model relies on 32bit sign extension. This means bits [31;63] have to be all 1 I suspect all of this would cause far more overhead all over the kernel than you could ever save with the per cpu data in your fast paths. Moving the kernel down a bit seems to be trivial without any of the weird solutions. Another one I came up in the previous mail would be to do the linker reference variable allocation in [0;2GB] positive space; but do all real references only %gs relative. And keep the real data copy on some other address. That would be a similar trick to the old style x86-64 vsyscalls. It gets fairly messy in the linker map file though. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Andi Kleen wrote: This limitation shouldn't apply to the percpu area, since gs_base can be pointed anywhere in the address space -- in effect we're always indirect. The initial reference copy of the percpu area has to be addressed by the linker. Hmm, in theory since it is not actually used by itself I suppose you could move it into positive space. Positive space for absolute references, or just below -2 GB for %rip references; either should work. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Andi Kleen wrote: Right so I could move the kernel to #define __PAGE_OFFSET _AC(0x8100, UL) #define __START_KERNEL_map_AC(0xfff8, UL) That is -31GB unless I'm miscounting. But it needs to be = -2GB (31bits) The __START_KERNEL_map needs to cover the 2GB that the kernel needs for modules and the cpu area 0. The remaining 28GB can be out of range. Right now it is at -2GB + 2MB, because it is loaded at physical +2MB so it's convenient to identity map there. In theory you could avoid that with some effort, but that would only buy you 2MB and would also break some early code and earlyprintk I believe. My proporsal is -32GB + 2MB. It keeps the arrangement. You could in theory move the modules, but then you would need to implement a full PIC dynamic linker for them first and also increase runtime overhead for them because they would need to use a GOT/PLT. Why is it not possible to move the kernel lower while keeping bit 31 1? The kernel model relies on 32bit sign extension. This means bits [31;63] have to be all 1 32bit sign extension for what? Absolute data references? The addressing that I have seen was IP relative. Thus I thought that the kernel could be moved lower. I suspect all of this would cause far more overhead all over the kernel than you could ever save with the per cpu data in your fast paths. Moving the kernel down a bit seems to be trivial without any of the weird solutions. Another one I came up in the previous mail would be to do the linker reference variable allocation in [0;2GB] positive space; but do all real references only %gs relative. And keep the real data copy on some other address. That would be a similar trick to the old style x86-64 vsyscalls. It gets fairly messy in the linker map file though. The linker references are to per cpu data already in the 0-MAX_CPU_AREA range after this patchset. The problem is the relocation of the references when the linker calculates the address of a per cpu variable. F.e. The pda is placed at absolute address 0 In order to access the pda we have to do either 1. A CPU_PTR(offsetof pda,cpu) which is offsetof pda (so 0) + cpu_area + cpu area_shift The compiler currently folds offsetof pda + cpu_area and then addds the shift later. or 2. We use a CPU_xx op with a segment register. Then we have no need to add cpu area because it is included in the gs offset. So only case 1 is affected which is not used if cpu ops can be used. There are still critical path uses of that operation so I would really like to avoid the explicit calculation. If we would go the proposed route then the folding of the address would have to be replaced by the linker by an explicit calculation. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, H. Peter Anvin wrote: But you wouldn't actually *use* this address space. It's just for the linker to know what address to tag the references with; it gets relocated by gs_base down into proper kernel space. The linker can stash the initialized reference copy at any address (LMA) which can be different from what it will be used at (VMA); that is not an issue. That is already provided by this patchset. The cpu area starts at absolute 0. To use %rip references, though, which are more efficient, you probably want to use offsets that are just below .text (at -2 GB); presumably -2 GB-[max size of percpu section]. Again, however, no CPU actually needs to have its data stashed in that particular location; it's just an offset. Right. That is what we are discussion in another thread. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Christoph Lameter wrote: On Tue, 20 Nov 2007, Andi Kleen wrote: This limitation shouldn't apply to the percpu area, since gs_base can be pointed anywhere in the address space -- in effect we're always indirect. The initial reference copy of the percpu area has to be addressed by the linker. Right that is important for the percpu references that can be folded by the linker in order to avoid address calculations. Hmm, in theory since it is not actually used by itself I suppose you could move it into positive space. But the positive space is reserved for a processes memory. But you wouldn't actually *use* this address space. It's just for the linker to know what address to tag the references with; it gets relocated by gs_base down into proper kernel space. The linker can stash the initialized reference copy at any address (LMA) which can be different from what it will be used at (VMA); that is not an issue. To use %rip references, though, which are more efficient, you probably want to use offsets that are just below .text (at -2 GB); presumably -2 GB-[max size of percpu section]. Again, however, no CPU actually needs to have its data stashed in that particular location; it's just an offset. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Christoph Lameter wrote: 32bit sign extension for what? Absolute data references? The addressing that I have seen was IP relative. Thus I thought that the kernel could be moved lower. Argh. This is all depending on a special gcc option to compile the kernel and that option limits the kernel to the upper 2GB. So I guess for CPU_PTR we need to explicitly load the address as a constant, use that as a base and then add the offset and the smp_id shifted to it in the instruction. The CPU_INC/DEC stuff using gs is not affected. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
But one can subtract too... Hmmm... So the cpu area 0 could be put at the beginning of the 2GB kernel area and then grow downwards from 0x8000. The cost in terms of code is one subtract instruction for each per_cpu() or CPU_PTR() The next thing doward from 0x8000 is the vmemmap at 0xe200, so ~32TB. If we leave 16TB for the vmemmap (a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes more than currently supported by the processors) then the remaining 16TB could be used to map 1GB per cpu for a 16k config. That is wildly overdoing it. Guess we could just do it with 1M anyways. Just to be safe we could do 128M. 128M x 16k = 2TB? Would such a configuration be okay? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote: But one can subtract too... The linker cannot subtract (unless you add a new relocation types) Hmmm... So the cpu area 0 could be put at the beginning of the 2GB kernel area and then grow downwards from 0x8000. The cost in terms of code is one subtract instruction for each per_cpu() or CPU_PTR() The next thing doward from 0x8000 is the vmemmap at 0xe200, so ~32TB. If we leave 16TB for the vmemmap (a 16TB vmmemmap be able to map 2^(44 - 6 + 12) = 2^50 bytes more than currently supported by the processors) then the remaining 16TB could be used to map 1GB per cpu for a 16k config. That is wildly overdoing it. Guess we could just do it with 1M anyways. Just to be safe we could do 128M. 128M x 16k = 2TB? Would such a configuration be okay? I'm not sure I really understand your problem. All you need is a 2MB area (16MB is too large if you really want 16k CPUs someday) somewhere in the -2GB or probably better in +2GB. Then the linker puts stuff in there and you use the offsets for referencing relative to %gs. But %gs can be located wherever you want in the end, at a completely different address than you told the linker. All you're interested in were offsets anyways. Then for all CPUs (including CPU #0) you put the real mapping somewhere else, copy the reference data there (which also doesn't need to be on the offset the linker assigned, just on a constant offset from it somewhere in the normal kernel data) and off you go. Then the reference data would be initdata and eventually freed. That is similar to how the current per cpu data works. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Wed, 21 Nov 2007, Andi Kleen wrote: On Wednesday 21 November 2007 02:16:11 Christoph Lameter wrote: But one can subtract too... The linker cannot subtract (unless you add a new relocation types) The compiler knows and emits assembly to compensate. All you need is a 2MB area (16MB is too large if you really want 16k CPUs someday) somewhere in the -2GB or probably better in +2GB. Then the linker puts stuff in there and you use the offsets for referencing relative to %gs. 2MB * 16k = 32GB. Even with 4k cpus we will have 2M * 4k = 8GB both do not fit in the 2GB area. The offset relative to %gs cannot be used if you have a loop and are calculating the addresses for all instances. That is what we are talking about. The CPU_xxx operations that are using the %gs register are fine and are not affected by the changes we are discussing. Then for all CPUs (including CPU #0) you put the real mapping somewhere else, copy the reference data there (which also doesn't need to be on the offset the linker assigned, just on a constant offset from it somewhere in the normal kernel data) and off you go. Real mapping? We have constant offsets after this patchset. I do not get what you are planning here. Then the reference data would be initdata and eventually freed. That is similar to how the current per cpu data works. Yes that is also how the current patchset works. I just do not understand what you want changed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tuesday 20 November 2007 13:02, Christoph Lameter wrote: > On Mon, 19 Nov 2007, H. Peter Anvin wrote: > > You're making the assumption here that NUMA = large number of CPUs. This > > assumption is flat-out wrong. > > Well maybe. Usually one gets to NUMA because the hardware gets too big to > be handleed the UMA way. > > > On x86-64, most two-socket systems are still NUMA, and I would expect > > that most distro kernels probably compile in NUMA. However, > > burning megabytes of memory on a two-socket dual-core system when we're > > talking about tens of kilobytes used would be more than a wee bit insane. > > Yeah yea but the latencies are minimal making the NUMA logic too expensive > for most loads ... If you put a NUMA kernel onto those then performance > drops (I think someone measures 15-30%?) Small socket count systems are going to increasingly be NUMA in future. If CONFIG_NUMA hurts performance by that much on those systems, then the kernel is broken IMO. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Andi Kleen wrote: > I might be pointing out the obvious, but on x86-64 there is definitely not > 256TB of VM available for this. Well maybe in the future. One of the issues that I ran into is that I had to place the cpu area in between to make the offsets link right. However, it would be best if the cpuarea came *after* the modules area. We only need linking that covers the per cpu area of processor 0. So I think we have a 2GB area right? 1GB kernel 1GB - 1x per cpu area (128M?) modules? cpu aree 0 2GB limit cpu area 1 cpu area 2 For that we would need to move the kernel down a bit. Can we do that? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tuesday 20 November 2007 13:02, Christoph Lameter wrote: > On Mon, 19 Nov 2007, H. Peter Anvin wrote: > > You're making the assumption here that NUMA = large number of CPUs. This > > assumption is flat-out wrong. > > Well maybe. Usually one gets to NUMA because the hardware gets too big to > be handleed the UMA way. Not the way things are going with multicore and multithread, though (that is, the hardware can be one socket and still have many cpus). The chip might have several memory controllers on it, but they could well be connected to the caches with a crossbar, so it needn't be NUMA at all. Future scalability work shouldn't rely on many cores ~= many nodes, IMO. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
> 4k cpu configurations with 1k nodes: > > 4096 * 16MB = 64TB of virtual space. > > Maximum theoretical configuration 16384 processors 1k nodes: > > 16384 * 16MB = 256TB of virtual space. > > Both fit within the established limits established. I might be pointing out the obvious, but on x86-64 there is definitely not 256TB of VM available for this. Not even 64TB, as long as you want to have any other mappings in kernel (total kernel memory 128TB, but it is split in half for the direct mapping) BTW if you allocate any VM you should also update Documentation/x86_64/mm.txt which describes the mapping > Index: linux-2.6/include/asm-x86/pgtable_64.h > === > --- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-19 > 15:45:07.638390147 -0800 +++ > linux-2.6/include/asm-x86/pgtable_64.h2007-11-19 15:55:53.165640248 > -0800 > @@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f > #define VMALLOC_START_AC(0xc200, UL) > #define VMALLOC_END _AC(0xe1ff, UL) > #define VMEMMAP_START _AC(0xe200, UL) > +#define CPU_AREA_BASE _AC(0x8400, UL) That's slightly less than 1GB before you bump into the maximum. But you'll bump into the module mapping even before that. For 16MB/CPU and the full 1GB that's ~123 CPUs if my calculations are correct. Even for a not Altix that's quite tight. I suppose 16MB/CPU are too large. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Christoph Lameter wrote: On Mon, 19 Nov 2007, H. Peter Anvin wrote: You're making the assumption here that NUMA = large number of CPUs. This assumption is flat-out wrong. Well maybe. Usually one gets to NUMA because the hardware gets too big to be handleed the UMA way. On x86-64, most two-socket systems are still NUMA, and I would expect that most distro kernels probably compile in NUMA. However, burning megabytes of memory on a two-socket dual-core system when we're talking about tens of kilobytes used would be more than a wee bit insane. Yeah yea but the latencies are minimal making the NUMA logic too expensive for most loads ... If you put a NUMA kernel onto those then performance drops (I think someone measures 15-30%?) How do you handle this memory, in the first place? Do you allocate the whole 2 MB for the particular CPU, or do you reclaim the upper part of the large page? (I haven't dug far enough into the source to tell.) -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Mon, 19 Nov 2007, H. Peter Anvin wrote: > You're making the assumption here that NUMA = large number of CPUs. This > assumption is flat-out wrong. Well maybe. Usually one gets to NUMA because the hardware gets too big to be handleed the UMA way. > On x86-64, most two-socket systems are still NUMA, and I would expect that > most distro kernels probably compile in NUMA. However, > burning megabytes of memory on a two-socket dual-core system when we're > talking about tens of kilobytes used would be more than a wee bit insane. Yeah yea but the latencies are minimal making the NUMA logic too expensive for most loads ... If you put a NUMA kernel onto those then performance drops (I think someone measures 15-30%?) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Christoph Lameter wrote: For the UP and SMP case map the area using 4k ptes. Typical use of per cpu data is around 16k for UP and SMP configurations. It goes up to 45k when the per cpu area is managed by cpu_alloc (see special x86_64 patchset). Allocating in 2M segments would be overkill. For NUMA map the area using 2M PMDs. A large NUMA system may use lots of cpu data for the page allocator data alone. We typically have large amounts of memory around on those size. Using a 2M page size reduces TLB pressure for that case. Some numbers for envisioned maximum configurations of NUMA systems: 4k cpu configurations with 1k nodes: 4096 * 16MB = 64TB of virtual space. Maximum theoretical configuration 16384 processors 1k nodes: 16384 * 16MB = 256TB of virtual space. Both fit within the established limits established. You're making the assumption here that NUMA = large number of CPUs. This assumption is flat-out wrong. On x86-64, most two-socket systems are still NUMA, and I would expect that most distro kernels probably compile in NUMA. However, burning megabytes of memory on a two-socket dual-core system when we're talking about tens of kilobytes used would be more than a wee bit insane. I do like the concept, overall, but the above distinction needs to be fixed. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[rfc 08/45] cpu alloc: x86 support
64 bit: Set up a cpu area that allows the use of up 16MB for each processor. Cpu memory use can grow a bit. F.e. if we assume that a pageset occupies 64 bytes of memory and we have 3 zones in each of 1024 nodes then we need 3 * 1k * 16k = 50 million pagesets or 3096 pagesets per processor. This results in a total of 3.2 GB of page structs. Each cpu needs around 200k of cpu storage for the page allocator alone. So its a worth it to use a 2M huge mapping here. For the UP and SMP case map the area using 4k ptes. Typical use of per cpu data is around 16k for UP and SMP configurations. It goes up to 45k when the per cpu area is managed by cpu_alloc (see special x86_64 patchset). Allocating in 2M segments would be overkill. For NUMA map the area using 2M PMDs. A large NUMA system may use lots of cpu data for the page allocator data alone. We typically have large amounts of memory around on those size. Using a 2M page size reduces TLB pressure for that case. Some numbers for envisioned maximum configurations of NUMA systems: 4k cpu configurations with 1k nodes: 4096 * 16MB = 64TB of virtual space. Maximum theoretical configuration 16384 processors 1k nodes: 16384 * 16MB = 256TB of virtual space. Both fit within the established limits established. 32 bit: Setup a 256 kB area for the cpu areas below the FIXADDR area. The use of the cpu alloc area is pretty minimal on i386. An 8p system with no extras uses only ~8kb. So 256kb should be plenty. A configuration that supports up to 8 processors takes up 2MB of the scarce virtual address space Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- arch/x86/Kconfig | 13 + arch/x86/kernel/vmlinux_32.lds.S |1 + arch/x86/kernel/vmlinux_64.lds.S |3 +++ arch/x86/mm/init_32.c|3 +++ arch/x86/mm/init_64.c| 38 ++ include/asm-x86/pgtable_32.h |7 +-- include/asm-x86/pgtable_64.h |1 + 7 files changed, 64 insertions(+), 2 deletions(-) Index: linux-2.6/arch/x86/mm/init_64.c === --- linux-2.6.orig/arch/x86/mm/init_64.c2007-11-19 15:45:07.602390533 -0800 +++ linux-2.6/arch/x86/mm/init_64.c 2007-11-19 15:55:53.165640248 -0800 @@ -781,3 +781,41 @@ int __meminit vmemmap_populate(struct pa return 0; } #endif + +#ifdef CONFIG_NUMA +int __meminit cpu_area_populate(void *start, unsigned long size, + gfp_t flags, int node) +{ + unsigned long addr = (unsigned long)start; + unsigned long end = addr + size; + unsigned long next; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + + for (; addr < end; addr = next) { + next = pmd_addr_end(addr, end); + + pgd = cpu_area_pgd_populate(addr, flags, node); + if (!pgd) + return -ENOMEM; + pud = cpu_area_pud_populate(pgd, addr, flags, node); + if (!pud) + return -ENOMEM; + + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) { + pte_t entry; + void *p = cpu_area_alloc_block(PMD_SIZE, flags, node); + if (!p) + return -ENOMEM; + + entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); + mk_pte_huge(entry); + set_pmd(pmd, __pmd(pte_val(entry))); + } + } + + return 0; +} +#endif Index: linux-2.6/include/asm-x86/pgtable_64.h === --- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-19 15:45:07.638390147 -0800 +++ linux-2.6/include/asm-x86/pgtable_64.h 2007-11-19 15:55:53.165640248 -0800 @@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f #define VMALLOC_START_AC(0xc200, UL) #define VMALLOC_END _AC(0xe1ff, UL) #define VMEMMAP_START _AC(0xe200, UL) +#define CPU_AREA_BASE _AC(0x8400, UL) #define MODULES_VADDR_AC(0x8800, UL) #define MODULES_END _AC(0xfff0, UL) #define MODULES_LEN (MODULES_END - MODULES_VADDR) Index: linux-2.6/arch/x86/Kconfig === --- linux-2.6.orig/arch/x86/Kconfig 2007-11-19 15:54:10.509139813 -0800 +++ linux-2.6/arch/x86/Kconfig 2007-11-19 15:55:53.165640248 -0800 @@ -159,6 +159,19 @@ config X86_TRAMPOLINE config KTIME_SCALAR def_bool X86_32 + +config CPU_AREA_VIRTUAL + bool + default y + +config CPU_AREA_ORDER + int + default "6" + +config CPU_AREA_ALLOC_ORDER + int + default "0" + source "init/Kconfig" menu "Processor type and features" Index: linux-2.6/arch/x86/mm/init_32.c
[rfc 08/45] cpu alloc: x86 support
64 bit: Set up a cpu area that allows the use of up 16MB for each processor. Cpu memory use can grow a bit. F.e. if we assume that a pageset occupies 64 bytes of memory and we have 3 zones in each of 1024 nodes then we need 3 * 1k * 16k = 50 million pagesets or 3096 pagesets per processor. This results in a total of 3.2 GB of page structs. Each cpu needs around 200k of cpu storage for the page allocator alone. So its a worth it to use a 2M huge mapping here. For the UP and SMP case map the area using 4k ptes. Typical use of per cpu data is around 16k for UP and SMP configurations. It goes up to 45k when the per cpu area is managed by cpu_alloc (see special x86_64 patchset). Allocating in 2M segments would be overkill. For NUMA map the area using 2M PMDs. A large NUMA system may use lots of cpu data for the page allocator data alone. We typically have large amounts of memory around on those size. Using a 2M page size reduces TLB pressure for that case. Some numbers for envisioned maximum configurations of NUMA systems: 4k cpu configurations with 1k nodes: 4096 * 16MB = 64TB of virtual space. Maximum theoretical configuration 16384 processors 1k nodes: 16384 * 16MB = 256TB of virtual space. Both fit within the established limits established. 32 bit: Setup a 256 kB area for the cpu areas below the FIXADDR area. The use of the cpu alloc area is pretty minimal on i386. An 8p system with no extras uses only ~8kb. So 256kb should be plenty. A configuration that supports up to 8 processors takes up 2MB of the scarce virtual address space Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- arch/x86/Kconfig | 13 + arch/x86/kernel/vmlinux_32.lds.S |1 + arch/x86/kernel/vmlinux_64.lds.S |3 +++ arch/x86/mm/init_32.c|3 +++ arch/x86/mm/init_64.c| 38 ++ include/asm-x86/pgtable_32.h |7 +-- include/asm-x86/pgtable_64.h |1 + 7 files changed, 64 insertions(+), 2 deletions(-) Index: linux-2.6/arch/x86/mm/init_64.c === --- linux-2.6.orig/arch/x86/mm/init_64.c2007-11-19 15:45:07.602390533 -0800 +++ linux-2.6/arch/x86/mm/init_64.c 2007-11-19 15:55:53.165640248 -0800 @@ -781,3 +781,41 @@ int __meminit vmemmap_populate(struct pa return 0; } #endif + +#ifdef CONFIG_NUMA +int __meminit cpu_area_populate(void *start, unsigned long size, + gfp_t flags, int node) +{ + unsigned long addr = (unsigned long)start; + unsigned long end = addr + size; + unsigned long next; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + + for (; addr end; addr = next) { + next = pmd_addr_end(addr, end); + + pgd = cpu_area_pgd_populate(addr, flags, node); + if (!pgd) + return -ENOMEM; + pud = cpu_area_pud_populate(pgd, addr, flags, node); + if (!pud) + return -ENOMEM; + + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) { + pte_t entry; + void *p = cpu_area_alloc_block(PMD_SIZE, flags, node); + if (!p) + return -ENOMEM; + + entry = pfn_pte(__pa(p) PAGE_SHIFT, PAGE_KERNEL); + mk_pte_huge(entry); + set_pmd(pmd, __pmd(pte_val(entry))); + } + } + + return 0; +} +#endif Index: linux-2.6/include/asm-x86/pgtable_64.h === --- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-19 15:45:07.638390147 -0800 +++ linux-2.6/include/asm-x86/pgtable_64.h 2007-11-19 15:55:53.165640248 -0800 @@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f #define VMALLOC_START_AC(0xc200, UL) #define VMALLOC_END _AC(0xe1ff, UL) #define VMEMMAP_START _AC(0xe200, UL) +#define CPU_AREA_BASE _AC(0x8400, UL) #define MODULES_VADDR_AC(0x8800, UL) #define MODULES_END _AC(0xfff0, UL) #define MODULES_LEN (MODULES_END - MODULES_VADDR) Index: linux-2.6/arch/x86/Kconfig === --- linux-2.6.orig/arch/x86/Kconfig 2007-11-19 15:54:10.509139813 -0800 +++ linux-2.6/arch/x86/Kconfig 2007-11-19 15:55:53.165640248 -0800 @@ -159,6 +159,19 @@ config X86_TRAMPOLINE config KTIME_SCALAR def_bool X86_32 + +config CPU_AREA_VIRTUAL + bool + default y + +config CPU_AREA_ORDER + int + default 6 + +config CPU_AREA_ALLOC_ORDER + int + default 0 + source init/Kconfig menu Processor type and features Index: linux-2.6/arch/x86/mm/init_32.c
Re: [rfc 08/45] cpu alloc: x86 support
Christoph Lameter wrote: For the UP and SMP case map the area using 4k ptes. Typical use of per cpu data is around 16k for UP and SMP configurations. It goes up to 45k when the per cpu area is managed by cpu_alloc (see special x86_64 patchset). Allocating in 2M segments would be overkill. For NUMA map the area using 2M PMDs. A large NUMA system may use lots of cpu data for the page allocator data alone. We typically have large amounts of memory around on those size. Using a 2M page size reduces TLB pressure for that case. Some numbers for envisioned maximum configurations of NUMA systems: 4k cpu configurations with 1k nodes: 4096 * 16MB = 64TB of virtual space. Maximum theoretical configuration 16384 processors 1k nodes: 16384 * 16MB = 256TB of virtual space. Both fit within the established limits established. You're making the assumption here that NUMA = large number of CPUs. This assumption is flat-out wrong. On x86-64, most two-socket systems are still NUMA, and I would expect that most distro kernels probably compile in NUMA. However, burning megabytes of memory on a two-socket dual-core system when we're talking about tens of kilobytes used would be more than a wee bit insane. I do like the concept, overall, but the above distinction needs to be fixed. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Mon, 19 Nov 2007, H. Peter Anvin wrote: You're making the assumption here that NUMA = large number of CPUs. This assumption is flat-out wrong. Well maybe. Usually one gets to NUMA because the hardware gets too big to be handleed the UMA way. On x86-64, most two-socket systems are still NUMA, and I would expect that most distro kernels probably compile in NUMA. However, burning megabytes of memory on a two-socket dual-core system when we're talking about tens of kilobytes used would be more than a wee bit insane. Yeah yea but the latencies are minimal making the NUMA logic too expensive for most loads ... If you put a NUMA kernel onto those then performance drops (I think someone measures 15-30%?) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
Christoph Lameter wrote: On Mon, 19 Nov 2007, H. Peter Anvin wrote: You're making the assumption here that NUMA = large number of CPUs. This assumption is flat-out wrong. Well maybe. Usually one gets to NUMA because the hardware gets too big to be handleed the UMA way. On x86-64, most two-socket systems are still NUMA, and I would expect that most distro kernels probably compile in NUMA. However, burning megabytes of memory on a two-socket dual-core system when we're talking about tens of kilobytes used would be more than a wee bit insane. Yeah yea but the latencies are minimal making the NUMA logic too expensive for most loads ... If you put a NUMA kernel onto those then performance drops (I think someone measures 15-30%?) How do you handle this memory, in the first place? Do you allocate the whole 2 MB for the particular CPU, or do you reclaim the upper part of the large page? (I haven't dug far enough into the source to tell.) -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
4k cpu configurations with 1k nodes: 4096 * 16MB = 64TB of virtual space. Maximum theoretical configuration 16384 processors 1k nodes: 16384 * 16MB = 256TB of virtual space. Both fit within the established limits established. I might be pointing out the obvious, but on x86-64 there is definitely not 256TB of VM available for this. Not even 64TB, as long as you want to have any other mappings in kernel (total kernel memory 128TB, but it is split in half for the direct mapping) BTW if you allocate any VM you should also update Documentation/x86_64/mm.txt which describes the mapping Index: linux-2.6/include/asm-x86/pgtable_64.h === --- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-19 15:45:07.638390147 -0800 +++ linux-2.6/include/asm-x86/pgtable_64.h2007-11-19 15:55:53.165640248 -0800 @@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f #define VMALLOC_START_AC(0xc200, UL) #define VMALLOC_END _AC(0xe1ff, UL) #define VMEMMAP_START _AC(0xe200, UL) +#define CPU_AREA_BASE _AC(0x8400, UL) That's slightly less than 1GB before you bump into the maximum. But you'll bump into the module mapping even before that. For 16MB/CPU and the full 1GB that's ~123 CPUs if my calculations are correct. Even for a not Altix that's quite tight. I suppose 16MB/CPU are too large. -Andi - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tuesday 20 November 2007 13:02, Christoph Lameter wrote: On Mon, 19 Nov 2007, H. Peter Anvin wrote: You're making the assumption here that NUMA = large number of CPUs. This assumption is flat-out wrong. Well maybe. Usually one gets to NUMA because the hardware gets too big to be handleed the UMA way. On x86-64, most two-socket systems are still NUMA, and I would expect that most distro kernels probably compile in NUMA. However, burning megabytes of memory on a two-socket dual-core system when we're talking about tens of kilobytes used would be more than a wee bit insane. Yeah yea but the latencies are minimal making the NUMA logic too expensive for most loads ... If you put a NUMA kernel onto those then performance drops (I think someone measures 15-30%?) Small socket count systems are going to increasingly be NUMA in future. If CONFIG_NUMA hurts performance by that much on those systems, then the kernel is broken IMO. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tuesday 20 November 2007 13:02, Christoph Lameter wrote: On Mon, 19 Nov 2007, H. Peter Anvin wrote: You're making the assumption here that NUMA = large number of CPUs. This assumption is flat-out wrong. Well maybe. Usually one gets to NUMA because the hardware gets too big to be handleed the UMA way. Not the way things are going with multicore and multithread, though (that is, the hardware can be one socket and still have many cpus). The chip might have several memory controllers on it, but they could well be connected to the caches with a crossbar, so it needn't be NUMA at all. Future scalability work shouldn't rely on many cores ~= many nodes, IMO. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc 08/45] cpu alloc: x86 support
On Tue, 20 Nov 2007, Andi Kleen wrote: I might be pointing out the obvious, but on x86-64 there is definitely not 256TB of VM available for this. Well maybe in the future. One of the issues that I ran into is that I had to place the cpu area in between to make the offsets link right. However, it would be best if the cpuarea came *after* the modules area. We only need linking that covers the per cpu area of processor 0. So I think we have a 2GB area right? 1GB kernel 1GB - 1x per cpu area (128M?) modules? cpu aree 0 2GB limit cpu area 1 cpu area 2 For that we would need to move the kernel down a bit. Can we do that? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/