Re: [PATCH 1/8] fix bootmem reservation on uninitialized node
Paul Mackerras wrote: Dave Hansen writes: This patch ensures that we do not touch bootmem for any node which has not been initialized. Signed-off-by: Dave Hansen [EMAIL PROTECTED] So, should I be sending this to Linus for 2.6.28? I notice you have added a dbg() call. For a 2.6.28 patch I'd somewhat prefer not to have that in unless necessary. Jon, does this patch fix the problem on your machine with 16G pages? It worked on a machine with one page, I am awaiting access to another with more pages. Paul. Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Fix boot freeze on machine with empty memory node
Dave Hansen wrote: I got a bug report about a distro kernel not booting on a particular machine. It would freeze during boot: ... Could not find start_pfn for node 1 [boot]0015 Setup Done Built 2 zonelists in Node order, mobility grouping on. Total pages: 123783 Policy zone: DMA Kernel command line: [boot]0020 XICS Init [boot]0021 XICS Done PID hash table entries: 4096 (order: 12, 32768 bytes) clocksource: timebase mult[7d] shift[22] registered Console: colour dummy device 80x25 console handover: boot [udbg0] - real [hvc0] Dentry cache hash table entries: 1048576 (order: 7, 8388608 bytes) Inode-cache hash table entries: 524288 (order: 6, 4194304 bytes) freeing bootmem node 0 I've reproduced this on 2.6.27.7. I'm pretty sure it is caused by this patch: http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8f64e1f2d1e09267ac926e15090fd505c1c0cbcb The problem is that Jon took a loop which was (in psuedocode): for_each_node(nid) NODE_DATA(nid) = careful_alloc(nid); setup_bootmem(nid); reserve_node_bootmem(nid); and broke it up into: for_each_node(nid) NODE_DATA(nid) = careful_alloc(nid); setup_bootmem(nid); for_each_node(nid) reserve_node_bootmem(nid); The issue comes in when the 'careful_alloc()' is called on a node with no memory. It falls back to using bootmem from a previously-initialized node. But, bootmem has not yet been reserved when Jon's patch is applied. It gives back bogus memory (0xc000) and pukes later in boot. The following patch collapses the loop back together. It also breaks the mark_reserved_regions_for_nid() code out into a function and adds some comments. I think a huge part of introducing this bug is because for loop was too long and hard to read. The actual bug fix here is the: + if (end_pfn = node-node_start_pfn || + start_pfn = node_end_pfn) + continue; Signed-off-by: Dave Hansen [EMAIL PROTECTED] diff -ru linux-2.6.27.7.orig/arch/powerpc//mm/numa.c linux-2.6.27.7/arch/powerpc//mm/numa.c --- linux-2.6.27.7.orig/arch/powerpc//mm/numa.c 2008-11-20 17:02:37.0 -0600 +++ linux-2.6.27.7/arch/powerpc//mm/numa.c 2008-11-24 15:53:35.0 -0600 @@ -822,6 +822,67 @@ .priority = 1 /* Must run before sched domains notifier. */ }; +static void mark_reserved_regions_for_nid(int nid) +{ + struct pglist_data *node = NODE_DATA(nid); + int i; + + for (i = 0; i lmb.reserved.cnt; i++) { + unsigned long physbase = lmb.reserved.region[i].base; + unsigned long size = lmb.reserved.region[i].size; + unsigned long start_pfn = physbase PAGE_SHIFT; + unsigned long end_pfn = ((physbase + size) PAGE_SHIFT); + struct node_active_region node_ar; + unsigned long node_end_pfn = node-node_start_pfn + +node-node_spanned_pages; + + /* +* Check to make sure that this lmb.reserved area is +* within the bounds of the node that we care about. +* Checking the nid of the start and end points is not +* sufficient because the reserved area could span the +* entire node. +*/ + if (end_pfn = node-node_start_pfn || + start_pfn = node_end_pfn) + continue; + + get_node_active_region(start_pfn, node_ar); + while (start_pfn end_pfn + node_ar.start_pfn node_ar.end_pfn) { + unsigned long reserve_size = size; + /* +* if reserved region extends past active region +* then trim size to active region +*/ + if (end_pfn node_ar.end_pfn) + reserve_size = (node_ar.end_pfn PAGE_SHIFT) + - (start_pfn PAGE_SHIFT); + dbg(reserve_bootmem %lx %lx nid=%d\n, physbase, + reserve_size, node_ar.nid); + reserve_bootmem_node(NODE_DATA(node_ar.nid), physbase, + reserve_size, BOOTMEM_DEFAULT); + /* +* if reserved region is contained in the active region +* then done. +*/ + if (end_pfn = node_ar.end_pfn) + break; + + /* +* reserved region extends past the active region +* get next active region that contains this +* reserved region +*/ +
[PATCH 1/1 v2] powerpc: hugetlb pgtable cache access cleanup
It was suggested by Andrew that using a macro that made an array look like a function call made it harder to understand the code. Cleaned up use of macro. We now reference the pgtable_cache array directly instead of using a macro. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] Cc: Nick Piggin [EMAIL PROTECTED] Cc: Paul Mackerras [EMAIL PROTECTED] Cc: Benjamin Herrenschmidt [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Acked-by: David Gibson [EMAIL PROTECTED] --- arch/powerpc/mm/hugetlbpage.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff -puN arch/powerpc/mm/hugetlbpage.c~powerpc-hugetlb-pgtable-cache-access-cleanup arch/powerpc/mm/hugetlbpage.c --- a/arch/powerpc/mm/hugetlbpage.c~powerpc-hugetlb-pgtable-cache-access-cleanup +++ a/arch/powerpc/mm/hugetlbpage.c @@ -53,8 +53,7 @@ unsigned int mmu_huge_psizes[MMU_PAGE_CO /* Subtract one from array size because we don't need a cache for 4K since * is not a huge page size */ -#define huge_pgtable_cache(psize) (pgtable_cache[HUGEPTE_CACHE_NUM \ - + psize-1]) +#define HUGE_PGTABLE_INDEX(psize) (HUGEPTE_CACHE_NUM + psize - 1) #define HUGEPTE_CACHE_NAME(psize) (huge_pgtable_cache_name[psize]) static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = { @@ -113,7 +112,7 @@ static inline pte_t *hugepte_offset(huge static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp, unsigned long address, unsigned int psize) { - pte_t *new = kmem_cache_zalloc(huge_pgtable_cache(psize), + pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], GFP_KERNEL|__GFP_REPEAT); if (! new) @@ -121,7 +120,7 @@ static int __hugepte_alloc(struct mm_str spin_lock(mm-page_table_lock); if (!hugepd_none(*hpdp)) - kmem_cache_free(huge_pgtable_cache(psize), new); + kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new); else hpdp-pd = (unsigned long)new | HUGEPD_OK; spin_unlock(mm-page_table_lock); @@ -760,13 +759,14 @@ static int __init hugetlbpage_init(void) for (psize = 0; psize MMU_PAGE_COUNT; ++psize) { if (mmu_huge_psizes[psize]) { - huge_pgtable_cache(psize) = kmem_cache_create( - HUGEPTE_CACHE_NAME(psize), - HUGEPTE_TABLE_SIZE(psize), - HUGEPTE_TABLE_SIZE(psize), - 0, - NULL); - if (!huge_pgtable_cache(psize)) + pgtable_cache[HUGE_PGTABLE_INDEX(psize)] = + kmem_cache_create( + HUGEPTE_CACHE_NAME(psize), + HUGEPTE_TABLE_SIZE(psize), + HUGEPTE_TABLE_SIZE(psize), + 0, + NULL); + if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)]) panic(hugetlbpage_init(): could not create %s\ \n, HUGEPTE_CACHE_NAME(psize)); } _ ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
16G related patches for stable kernel 2.6.27
Please consider the following patches for the 2.6.27 stable tree. The first two allow a powerpc machine with more then 2 numa nodes to boot when 16G pages are enabled. The third one allows a powerpc machine to boot if using 16G pages and the mem= boot param. thanks, Jon powerpc: Reserve in bootmem lmb reserved regions that cross NUMA nodes commit 8f64e1f2d1e09267ac926e15090fd505c1c0cbcb powerpc/numa: Make memory reserve code more robust commit e81703724a966120ace6504c993bda9e084cbf3e powerpc: Don't use a 16G page if beyond mem= limits commit 4792adbac9eb41cea77a45ab76258ea10d411173 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] powerpc: Don't use a 16G page if beyond mem= limits
If mem= is used on the boot command line to limit memory then the memory block where a 16G page resides may not be available. Thanks to Michael Ellerman for finding the problem. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- arch/powerpc/mm/hash_utils_64.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 5c64af1..8d5b475 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -382,8 +382,10 @@ static int __init htab_dt_scan_hugepage_blocks(unsigned long node, printk(KERN_INFO Huge page(16GB) memory: addr = 0x%lX size = 0x%lX pages = %d\n, phys_addr, block_size, expected_pages); - lmb_reserve(phys_addr, block_size * expected_pages); - add_gpage(phys_addr, block_size, expected_pages); + if (phys_addr + (16 * GB) = lmb_end_of_DRAM()) { + lmb_reserve(phys_addr, block_size * expected_pages); + add_gpage(phys_addr, block_size, expected_pages); + } return 0; } #endif /* CONFIG_HUGETLB_PAGE */ ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH v3] powerpc: properly reserve in bootmem the lmb reserved regions that cross NUMA nodes
Benjamin Herrenschmidt wrote: On Thu, 2008-10-09 at 15:18 -0500, Jon Tollefson wrote: If there are multiple reserved memory blocks via lmb_reserve() that are contiguous addresses and on different NUMA nodes we are losing track of which address ranges to reserve in bootmem on which node. I discovered this when I recently got to try 16GB huge pages on a system with more then 2 nodes. I'm going to apply it, however, could you double check something for me ? A cursory glance of the new version makes me wonder, what if the first call to get_node_active_region() ends up with the work_fn never hitting the if () case ? I think in that case, node_ar-end_pfn never gets initialized right ? Can that happen in practice ? I suspect that isn't the case but better safe than sorry... I have tested this on a few machines and it hasn't been a problem. But I don't see anything in lmb_reserve() that would prevent reserving a block that was outside of valid memory. So to be safe I have attached a patch that checks for an empty active range. I also noticed that the size to reserve for subsequent nodes for a reserve that spans nodes wasn't taking into account the amount reserved on previous nodes so the patch addresses that too. If you would prefer this be a separate patch let me know. If there's indeed a potential problem, please send a fixup patch. Cheers, Ben. Adjust amount to reserve based on previous nodes for reserves spanning multiple nodes. Check if the node active range is empty before attempting to pass the reserve to bootmem. In practice the range shouldn't be empty, but to be sure we check. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- arch/powerpc/mm/numa.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 6cf5c71..195bfcd 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -116,6 +116,7 @@ static int __init get_active_region_work_fn(unsigned long start_pfn, /* * get_node_active_region - Return active region containing start_pfn + * Active range returned is empty if none found. * @start_pfn: The page to return the region for. * @node_ar: Returned set to the active region containing start_pfn */ @@ -126,6 +127,7 @@ static void __init get_node_active_region(unsigned long start_pfn, node_ar-nid = nid; node_ar-start_pfn = start_pfn; + node_ar-end_pfn = start_pfn; work_with_active_regions(nid, get_active_region_work_fn, node_ar); } @@ -933,18 +935,20 @@ void __init do_init_bootmem(void) struct node_active_region node_ar; get_node_active_region(start_pfn, node_ar); - while (start_pfn end_pfn) { + while (start_pfn end_pfn + node_ar.start_pfn node_ar.end_pfn) { + unsigned long reserve_size = size; /* * if reserved region extends past active region * then trim size to active region */ if (end_pfn node_ar.end_pfn) - size = (node_ar.end_pfn PAGE_SHIFT) + reserve_size = (node_ar.end_pfn PAGE_SHIFT) - (start_pfn PAGE_SHIFT); - dbg(reserve_bootmem %lx %lx nid=%d\n, physbase, size, - node_ar.nid); + dbg(reserve_bootmem %lx %lx nid=%d\n, physbase, + reserve_size, node_ar.nid); reserve_bootmem_node(NODE_DATA(node_ar.nid), physbase, - size, BOOTMEM_DEFAULT); + reserve_size, BOOTMEM_DEFAULT); /* * if reserved region is contained in the active region * then done. @@ -959,6 +963,7 @@ void __init do_init_bootmem(void) */ start_pfn = node_ar.end_pfn; physbase = start_pfn PAGE_SHIFT; + size = size - reserve_size; get_node_active_region(start_pfn, node_ar); } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH v3] powerpc: properly reserve in bootmem the lmb reserved regions that cross NUMA nodes
If there are multiple reserved memory blocks via lmb_reserve() that are contiguous addresses and on different NUMA nodes we are losing track of which address ranges to reserve in bootmem on which node. I discovered this when I recently got to try 16GB huge pages on a system with more then 2 nodes. When scanning the device tree in early boot we call lmb_reserve() with the addresses of the 16G pages that we find so that the memory doesn't get used for something else. For example the addresses for the pages could be 40, 44, 48, 4C, etc - 8 pages, one on each of eight nodes. In the lmb after all the pages have been reserved it will look something like the following: lmb_dump_all: memory.cnt= 0x2 memory.size = 0x3e8000 memory.region[0x0].base = 0x0 .size = 0x1e8000 memory.region[0x1].base = 0x40 .size = 0x20 reserved.cnt = 0x5 reserved.size = 0x3e8000 reserved.region[0x0].base = 0x0 .size = 0x7b5000 reserved.region[0x1].base = 0x2a0 .size = 0x78c000 reserved.region[0x2].base = 0x328c000 .size = 0x43000 reserved.region[0x3].base = 0xf4e8000 .size = 0xb18000 reserved.region[0x4].base = 0x40 .size = 0x20 The reserved.region[0x4] contains the 16G pages. In arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the node numbers looking for the reserved regions that belong to the particular node. It is not able to identify region 0x4 as being a part of each of the 8 nodes. It is assuming that a reserved region is only on a single node. This patch takes out the reserved region loop from inside the loop that goes over each node. It looks up the active region containing the start of the reserved region. If it extends past that active region then it adjusts the size and gets the next active region containing it. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- Changes: v2: -style changes as suggested by Adam Litke v3: -moved helper function to powerpc code since it is the only user at present -made end_pfn consistently exclusive -other minor code cleanups Please consider for 2.6.28. numa.c | 108 - 1 file changed, 80 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d9a1813..72447f1 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -89,6 +89,46 @@ static int __cpuinit fake_numa_create_new_node(unsigned long end_pfn, return 0; } +/* + * get_active_region_work_fn - A helper function for get_node_active_region + * Returns datax set to the start_pfn and end_pfn if they contain + * the initial value of datax-start_pfn between them + * @start_pfn: start page(inclusive) of region to check + * @end_pfn: end page(exclusive) of region to check + * @datax: comes in with -start_pfn set to value to search for and + * goes out with active range if it contains it + * Returns 1 if search value is in range else 0 + */ +static int __init get_active_region_work_fn(unsigned long start_pfn, + unsigned long end_pfn, void *datax) +{ + struct node_active_region *data; + data = (struct node_active_region *)datax; + + if (start_pfn = data-start_pfn end_pfn data-start_pfn) { + data-start_pfn = start_pfn; + data-end_pfn = end_pfn; + return 1; + } + return 0; + +} + +/* + * get_node_active_region - Return active region containing start_pfn + * @start_pfn: The page to return the region for. + * @node_ar: Returned set to the active region containing start_pfn + */ +static void __init get_node_active_region(unsigned long start_pfn, + struct node_active_region *node_ar) +{ + int nid = early_pfn_to_nid(start_pfn); + + node_ar-nid = nid; + node_ar-start_pfn = start_pfn; + work_with_active_regions(nid, get_active_region_work_fn, node_ar); +} + static void __cpuinit map_cpu_to_node(int cpu, int node) { numa_cpu_lookup_table[cpu] = node; @@ -837,38 +877,50 @@ void __init do_init_bootmem(void) start_pfn, end_pfn); free_bootmem_with_active_regions(nid, end_pfn); + } - /* Mark reserved regions on this node */ - for (i = 0; i lmb.reserved.cnt; i++) { - unsigned long physbase = lmb.reserved.region[i].base; - unsigned long size = lmb.reserved.region[i].size; - unsigned long start_paddr = start_pfn PAGE_SHIFT
Re: [PATCH] properly reserve in bootmem the lmb reserved regions that cross numa nodes
Kumar Gala wrote: Out of interest how to do you guys represent NUMA regions of memory in the device tree? - k Looking at the source code in numa.c I see at the start of do_init_bootmem() that parse_numa_properties() is called. It appears to be looking at memory nodes and getting the node id from it. It gets an associativity property for the memory node and indexes that array with a 'min_common_depth' value to get the node id. This node id is then used to setup the active ranges in the early_node_map[]. Is this what you are asking about? There are others I am sure who know more about it then I though. Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] properly reserve in bootmem the lmb reserved regions that cross numa nodes
Kumar Gala wrote: On Oct 6, 2008, at 10:42 AM, Jon Tollefson wrote: Kumar Gala wrote: Out of interest how to do you guys represent NUMA regions of memory in the device tree? - k Looking at the source code in numa.c I see at the start of do_init_bootmem() that parse_numa_properties() is called. It appears to be looking at memory nodes and getting the node id from it. It gets an associativity property for the memory node and indexes that array with a 'min_common_depth' value to get the node id. This node id is then used to setup the active ranges in the early_node_map[]. Is this what you are asking about? There are others I am sure who know more about it then I though. I was wondering if this was documented anywhere (like in sPAPR)? - k I see some information on it in section C.6.6. Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH v2] properly reserve in bootmem the lmb reserved regions that cross NUMA nodes
If there are multiple reserved memory blocks via lmb_reserve() that are contiguous addresses and on different NUMA nodes we are losing track of which address ranges to reserve in bootmem on which node. I discovered this when I only recently got to try 16GB huge pages on a system with more then 2 nodes. When scanning the device tree in early boot we call lmb_reserve() with the addresses of the 16G pages that we find so that the memory doesn't get used for something else. For example the addresses for the pages could be 40, 44, 48, 4C, etc - 8 pages, one on each of eight nodes. In the lmb after all the pages have been reserved it will look something like the following: lmb_dump_all: memory.cnt= 0x2 memory.size = 0x3e8000 memory.region[0x0].base = 0x0 .size = 0x1e8000 memory.region[0x1].base = 0x40 .size = 0x20 reserved.cnt = 0x5 reserved.size = 0x3e8000 reserved.region[0x0].base = 0x0 .size = 0x7b5000 reserved.region[0x1].base = 0x2a0 .size = 0x78c000 reserved.region[0x2].base = 0x328c000 .size = 0x43000 reserved.region[0x3].base = 0xf4e8000 .size = 0xb18000 reserved.region[0x4].base = 0x40 .size = 0x20 The reserved.region[0x4] contains the 16G pages. In arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the node numbers looking for the reserved regions that belong to the particular node. It is not able to identify region 0x4 as being a part of each of the 8 nodes. It is assuming that a reserved region is only on a single node. This patch takes out the reserved region loop from inside the loop that goes over each node. It looks up the active region containing the start of the reserved region. If it extends past that active region then it adjusts the size and gets the next active region containing it. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- Changes: -style changes as suggested by Adam Litke Please consider for 2.6.28. arch/powerpc/mm/numa.c | 63 - include/linux/mm.h |2 + mm/page_alloc.c| 19 ++ 3 files changed, 57 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d9a1813..9a3b0c9 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -837,36 +837,45 @@ void __init do_init_bootmem(void) start_pfn, end_pfn); free_bootmem_with_active_regions(nid, end_pfn); + } - /* Mark reserved regions on this node */ - for (i = 0; i lmb.reserved.cnt; i++) { - unsigned long physbase = lmb.reserved.region[i].base; - unsigned long size = lmb.reserved.region[i].size; - unsigned long start_paddr = start_pfn PAGE_SHIFT; - unsigned long end_paddr = end_pfn PAGE_SHIFT; - - if (early_pfn_to_nid(physbase PAGE_SHIFT) != nid - early_pfn_to_nid((physbase+size-1) PAGE_SHIFT) != nid) - continue; - - if (physbase end_paddr - (physbase+size) start_paddr) { - /* overlaps */ - if (physbase start_paddr) { - size -= start_paddr - physbase; - physbase = start_paddr; - } - - if (size end_paddr - physbase) - size = end_paddr - physbase; - - dbg(reserve_bootmem %lx %lx\n, physbase, - size); - reserve_bootmem_node(NODE_DATA(nid), physbase, -size, BOOTMEM_DEFAULT); - } + /* Mark reserved regions */ + for (i = 0; i lmb.reserved.cnt; i++) { + unsigned long physbase = lmb.reserved.region[i].base; + unsigned long size = lmb.reserved.region[i].size; + unsigned long start_pfn = physbase PAGE_SHIFT; + unsigned long end_pfn = ((physbase + size - 1) PAGE_SHIFT); + struct node_active_region *node_ar; + + node_ar = get_node_active_region(start_pfn); + while (start_pfn end_pfn node_ar != NULL) { + /* +* if reserved region extends past active region +* then trim size to active region
Re: [PATCH] properly reserve in bootmem the lmb reserved regions that cross numa nodes
Adam Litke wrote: This seems like the right approach to me. I have pointed out a few stylistic issues below. Thanks. I'll make those changes. I assume by __mminit you meant __meminit Jon On Tue, 2008-09-30 at 09:53 -0500, Jon Tollefson wrote: snip +/* Mark reserved regions */ +for (i = 0; i lmb.reserved.cnt; i++) { +unsigned long physbase = lmb.reserved.region[i].base; +unsigned long size = lmb.reserved.region[i].size; +unsigned long start_pfn = physbase PAGE_SHIFT; +unsigned long end_pfn = ((physbase+size-1) PAGE_SHIFT); CodingStyle dictates that this should be: unsigned long end_pfn = ((physbase + size - 1) PAGE_SHIFT); snip +/** + * get_node_active_region - Return active region containing start_pfn + * @start_pfn The page to return the region for. + * + * It will return NULL if active region is not found. + */ +struct node_active_region *get_node_active_region( +unsigned long start_pfn) Bad style. I think the convention would be to write it like this: struct node_active_region * get_node_active_region(unsigned long start_pfn) +{ +int i; +for (i = 0; i nr_nodemap_entries; i++) { +unsigned long node_start_pfn = early_node_map[i].start_pfn; +unsigned long node_end_pfn = early_node_map[i].end_pfn; + +if (node_start_pfn = start_pfn node_end_pfn start_pfn) +return early_node_map[i]; +} +return NULL; +} Since this is using the early_node_map[], should we mark the function __mminit? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] properly reserve in bootmem the lmb reserved regions that cross numa nodes
If there are multiple reserved memory blocks via lmb_reserve() that are contiguous addresses and on different numa nodes we are losing track of which address ranges to reserve in bootmem on which node. I discovered this when I only recently got to try 16GB huge pages on a system with more then 2 nodes. When scanning the device tree in early boot we call lmb_reserve() with the addresses of the 16G pages that we find so that the memory doesn't get used for something else. For example the addresses for the pages could be 40, 44, 48, 4C, etc - 8 pages, one on each of eight nodes. In the lmb after all the pages have been reserved it will look something like the following: lmb_dump_all: memory.cnt= 0x2 memory.size = 0x3e8000 memory.region[0x0].base = 0x0 .size = 0x1e8000 memory.region[0x1].base = 0x40 .size = 0x20 reserved.cnt = 0x5 reserved.size = 0x3e8000 reserved.region[0x0].base = 0x0 .size = 0x7b5000 reserved.region[0x1].base = 0x2a0 .size = 0x78c000 reserved.region[0x2].base = 0x328c000 .size = 0x43000 reserved.region[0x3].base = 0xf4e8000 .size = 0xb18000 reserved.region[0x4].base = 0x40 .size = 0x20 The reserved.region[0x4] contains the 16G pages. In arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the node numbers looking for the reserved regions that belong to the particular node. It is not able to identify region 0x4 as being a part of each of the 8 nodes. It is assuming that a reserved region is only on a single node. This patch takes out the reserved region loop from inside the loop that goes over each node. It looks up the active region containing the start of the reserved region. If it extends past that active region then it adjusts the size and gets the next active region containing it. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- arch/powerpc/mm/numa.c | 63 - include/linux/mm.h |2 + mm/page_alloc.c| 19 ++ 3 files changed, 57 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d9a1813..07b8726 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -837,36 +837,45 @@ void __init do_init_bootmem(void) start_pfn, end_pfn); free_bootmem_with_active_regions(nid, end_pfn); + } - /* Mark reserved regions on this node */ - for (i = 0; i lmb.reserved.cnt; i++) { - unsigned long physbase = lmb.reserved.region[i].base; - unsigned long size = lmb.reserved.region[i].size; - unsigned long start_paddr = start_pfn PAGE_SHIFT; - unsigned long end_paddr = end_pfn PAGE_SHIFT; - - if (early_pfn_to_nid(physbase PAGE_SHIFT) != nid - early_pfn_to_nid((physbase+size-1) PAGE_SHIFT) != nid) - continue; - - if (physbase end_paddr - (physbase+size) start_paddr) { - /* overlaps */ - if (physbase start_paddr) { - size -= start_paddr - physbase; - physbase = start_paddr; - } - - if (size end_paddr - physbase) - size = end_paddr - physbase; - - dbg(reserve_bootmem %lx %lx\n, physbase, - size); - reserve_bootmem_node(NODE_DATA(nid), physbase, -size, BOOTMEM_DEFAULT); - } + /* Mark reserved regions */ + for (i = 0; i lmb.reserved.cnt; i++) { + unsigned long physbase = lmb.reserved.region[i].base; + unsigned long size = lmb.reserved.region[i].size; + unsigned long start_pfn = physbase PAGE_SHIFT; + unsigned long end_pfn = ((physbase+size-1) PAGE_SHIFT); + struct node_active_region *node_ar; + + node_ar = get_node_active_region(start_pfn); + while (start_pfn end_pfn node_ar != NULL) { + /* +* if reserved region extends past active region +* then trim size to active region +*/ + if (end_pfn = node_ar-end_pfn
Re: [Libhugetlbfs-devel] Buglet in 16G page handling
Jon Tollefson wrote: David Gibson wrote: On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote: David Gibson wrote: When BenH and I were looking at the new code for handling 16G pages, we noticed a small bug. It doesn't actually break anything user visible, but it's certainly not the way things are supposed to be. The 16G patches didn't update the huge_pte_offset() and huge_pte_alloc() functions, which means that the hugepte tables for 16G pages will be allocated much further down the page table tree than they should be - allocating several levels of page table with a single entry in them along the way. The patch below is supposed to fix this, cleaning up the existing handling of 64k vs 16M pages while its at it. However, it needs some testing. I've checked that it doesn't break existing 16M support, either with 4k or 64k base pages. I haven't figured out how to test with 64k pages yet, at least until the multisize support goes into libhugetlbfs. For 16G pages, I just don't have access to a machine with enough memory to test. Jon, presumably you must have found such a machine when you did the 16G page support in the first place. Do you still have access, and can you test this patch? I do have access to a machine to test it. I applied the patch to -rc4 and used a pseries_defconfig. I boot with default_hugepagesz=16G... in order to test huge page sizes other then 16M at this point. Running the libhugetlbfs test suite it gets as far as Readback (64): PASS before it hits the following program check. Ah, yes, oops, forgot to fix up the pagetable freeing path in line with the other changes. Try the revised version below. I have run through the tests twice now with this new patch using a 4k base page size(and 16G huge page size) and there are no program checks or spin lock issues. So looking good. I will run it next a couple of times with 64K base pages. I have run through the libhugetest suite 3 times each now with both combinations(4k and 64K base page) and have not seen the spin lock problem or any other problems. Acked-by: Jon Tollefson [EMAIL PROTECTED] Jon Index: working-2.6/arch/powerpc/mm/hugetlbpage.c === --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c 2008-09-02 11:50:12.0 +1000 +++ working-2.6/arch/powerpc/mm/hugetlbpage.c2008-09-03 10:10:54.0 +1000 @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str return 0; } -/* Base page size affects how we walk hugetlb page tables */ -#ifdef CONFIG_PPC_64K_PAGES -#define hpmd_offset(pud, addr, h) pmd_offset(pud, addr) -#define hpmd_alloc(mm, pud, addr, h)pmd_alloc(mm, pud, addr) -#else -static inline -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate) + +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate) +{ +if (huge_page_shift(hstate) PUD_SHIFT) +return pud_offset(pgd, addr); +else +return (pud_t *) pgd; +} +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr, + struct hstate *hstate) { -if (huge_page_shift(hstate) == PAGE_SHIFT_64K) +if (huge_page_shift(hstate) PUD_SHIFT) +return pud_alloc(mm, pgd, addr); +else +return (pud_t *) pgd; +} +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate) +{ +if (huge_page_shift(hstate) PMD_SHIFT) return pmd_offset(pud, addr); else return (pmd_t *) pud; } -static inline -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr, - struct hstate *hstate) +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr, + struct hstate *hstate) { -if (huge_page_shift(hstate) == PAGE_SHIFT_64K) +if (huge_page_shift(hstate) PMD_SHIFT) return pmd_alloc(mm, pud, addr); else return (pmd_t *) pud; } -#endif /* Build list of addresses of gigantic pages. This function is used in early * boot before the buddy or bootmem allocator is setup. @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct pg = pgd_offset(mm, addr); if (!pgd_none(*pg)) { -pu = pud_offset(pg, addr); +pu = hpud_offset(pg, addr, hstate); if (!pud_none(*pu)) { pm = hpmd_offset(pu, addr, hstate); if (!pmd_none(*pm)) @@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct * addr = hstate-mask; pg = pgd_offset(mm, addr); -pu = pud_alloc(mm, pg, addr); +pu = hpud_alloc(mm, pg, addr, hstate); if (pu) { pm = hpmd_alloc(mm, pu, addr, hstate); @@ -316,13 +324,7
Re: [Libhugetlbfs-devel] Buglet in 16G page handling
Benjamin Herrenschmidt wrote: On Tue, 2008-09-02 at 17:16 -0500, Jon Tollefson wrote: Benjamin Herrenschmidt wrote: Actually, Jon has been hitting an occasional pagetable lock related problem. The last theory was that it might be some sort of race but it's vaguely possible that this is the issue. Jon? All hugetlbfs ops should be covered by the big PTL except walking... Can we have more info about the problem ? Cheers, Ben. I hit this when running the complete libhugetlbfs test suite (make check) with base page at 4K and default huge page size at 16G. It is on the last test (shm-getraw) when it hits it. Just running that test alone has not caused it for me - only when I have run all the tests and it gets to this one. Also it doesn't happen every time. I have tried to reproduce as well with a 64K base page but haven't seen it happen there. I don't see anything huge pages related in the backtraces which is interesting ... Can you get us access to a machine with enough RAM to test the 16G pages ? Ben. You can use the machine I have been using. I'll send you a note with the details on it after I test David's patch today. Jon snip ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [Libhugetlbfs-devel] Buglet in 16G page handling
David Gibson wrote: On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote: David Gibson wrote: When BenH and I were looking at the new code for handling 16G pages, we noticed a small bug. It doesn't actually break anything user visible, but it's certainly not the way things are supposed to be. The 16G patches didn't update the huge_pte_offset() and huge_pte_alloc() functions, which means that the hugepte tables for 16G pages will be allocated much further down the page table tree than they should be - allocating several levels of page table with a single entry in them along the way. The patch below is supposed to fix this, cleaning up the existing handling of 64k vs 16M pages while its at it. However, it needs some testing. I've checked that it doesn't break existing 16M support, either with 4k or 64k base pages. I haven't figured out how to test with 64k pages yet, at least until the multisize support goes into libhugetlbfs. For 16G pages, I just don't have access to a machine with enough memory to test. Jon, presumably you must have found such a machine when you did the 16G page support in the first place. Do you still have access, and can you test this patch? I do have access to a machine to test it. I applied the patch to -rc4 and used a pseries_defconfig. I boot with default_hugepagesz=16G... in order to test huge page sizes other then 16M at this point. Running the libhugetlbfs test suite it gets as far as Readback (64): PASS before it hits the following program check. Ah, yes, oops, forgot to fix up the pagetable freeing path in line with the other changes. Try the revised version below. I have run through the tests twice now with this new patch using a 4k base page size(and 16G huge page size) and there are no program checks or spin lock issues. So looking good. I will run it next a couple of times with 64K base pages. Jon Index: working-2.6/arch/powerpc/mm/hugetlbpage.c === --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c2008-09-02 11:50:12.0 +1000 +++ working-2.6/arch/powerpc/mm/hugetlbpage.c 2008-09-03 10:10:54.0 +1000 @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str return 0; } -/* Base page size affects how we walk hugetlb page tables */ -#ifdef CONFIG_PPC_64K_PAGES -#define hpmd_offset(pud, addr, h)pmd_offset(pud, addr) -#define hpmd_alloc(mm, pud, addr, h) pmd_alloc(mm, pud, addr) -#else -static inline -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate) + +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate) +{ + if (huge_page_shift(hstate) PUD_SHIFT) + return pud_offset(pgd, addr); + else + return (pud_t *) pgd; +} +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr, + struct hstate *hstate) { - if (huge_page_shift(hstate) == PAGE_SHIFT_64K) + if (huge_page_shift(hstate) PUD_SHIFT) + return pud_alloc(mm, pgd, addr); + else + return (pud_t *) pgd; +} +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate) +{ + if (huge_page_shift(hstate) PMD_SHIFT) return pmd_offset(pud, addr); else return (pmd_t *) pud; } -static inline -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr, - struct hstate *hstate) +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr, + struct hstate *hstate) { - if (huge_page_shift(hstate) == PAGE_SHIFT_64K) + if (huge_page_shift(hstate) PMD_SHIFT) return pmd_alloc(mm, pud, addr); else return (pmd_t *) pud; } -#endif /* Build list of addresses of gigantic pages. This function is used in early * boot before the buddy or bootmem allocator is setup. @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct pg = pgd_offset(mm, addr); if (!pgd_none(*pg)) { - pu = pud_offset(pg, addr); + pu = hpud_offset(pg, addr, hstate); if (!pud_none(*pu)) { pm = hpmd_offset(pu, addr, hstate); if (!pmd_none(*pm)) @@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct * addr = hstate-mask; pg = pgd_offset(mm, addr); - pu = pud_alloc(mm, pg, addr); + pu = hpud_alloc(mm, pg, addr, hstate); if (pu) { pm = hpmd_alloc(mm, pu, addr, hstate); @@ -316,13 +324,7 @@ static void hugetlb_free_pud_range(struc pud = pud_offset(pgd, addr); do { next = pud_addr_end(addr, end); -#ifdef CONFIG_PPC_64K_PAGES - if (pud_none_or_clear_bad(pud)) - continue
Re: Buglet in 16G page handling
David Gibson wrote: When BenH and I were looking at the new code for handling 16G pages, we noticed a small bug. It doesn't actually break anything user visible, but it's certainly not the way things are supposed to be. The 16G patches didn't update the huge_pte_offset() and huge_pte_alloc() functions, which means that the hugepte tables for 16G pages will be allocated much further down the page table tree than they should be - allocating several levels of page table with a single entry in them along the way. The patch below is supposed to fix this, cleaning up the existing handling of 64k vs 16M pages while its at it. However, it needs some testing. I've checked that it doesn't break existing 16M support, either with 4k or 64k base pages. I haven't figured out how to test with 64k pages yet, at least until the multisize support goes into libhugetlbfs. For 16G pages, I just don't have access to a machine with enough memory to test. Jon, presumably you must have found such a machine when you did the 16G page support in the first place. Do you still have access, and can you test this patch? I do have access to a machine to test it. I applied the patch to -rc4 and used a pseries_defconfig. I boot with default_hugepagesz=16G... in order to test huge page sizes other then 16M at this point. Running the libhugetlbfs test suite it gets as far as Readback (64): PASS before it hits the following program check. kernel BUG at arch/powerpc/mm/hugetlbpage.c:98! cpu 0x0: Vector: 700 (Program Check) at [c002843db580] pc: c0035ff4: .free_hugepte_range+0x2c/0x7c lr: c0036af0: .hugetlb_free_pgd_range+0x2c0/0x398 sp: c002843db800 msr: 80029032 current = 0xc0028417a2a0 paca= 0xc08d4300 pid = 3334, comm = readback kernel BUG at arch/powerpc/mm/hugetlbpage.c:98! enter ? for help [c002843db880] c0036af0 .hugetlb_free_pgd_range+0x2c0/0x398 [c002843db980] c00da224 .free_pgtables+0x98/0x140 [c002843dba40] c00dc4d8 .exit_mmap+0x13c/0x22c [c002843dbb00] c005b218 .mmput+0x78/0x148 [c002843dbba0] c0060528 .exit_mm+0x164/0x18c [c002843dbc50] c0062718 .do_exit+0x2e8/0x858 [c002843dbd10] c0062d24 .do_group_exit+0x9c/0xd0 [c002843dbdb0] c0062d74 .sys_exit_group+0x1c/0x30 [c002843dbe30] c00086d4 syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 00802db7a530 SP (fa6e290) is in userspace Line 98 appears to be this BUG_ON static inline pte_t *hugepd_page(hugepd_t hpd) { BUG_ON(!(hpd.pd HUGEPD_OK)); Jon Index: working-2.6/arch/powerpc/mm/hugetlbpage.c === --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c2008-09-02 13:39:52.0 +1000 +++ working-2.6/arch/powerpc/mm/hugetlbpage.c 2008-09-02 14:08:56.0 +1000 @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str return 0; } -/* Base page size affects how we walk hugetlb page tables */ -#ifdef CONFIG_PPC_64K_PAGES -#define hpmd_offset(pud, addr, h)pmd_offset(pud, addr) -#define hpmd_alloc(mm, pud, addr, h) pmd_alloc(mm, pud, addr) -#else -static inline -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate) + +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate) +{ + if (huge_page_shift(hstate) PUD_SHIFT) + return pud_offset(pgd, addr); + else + return (pud_t *) pgd; +} +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr, + struct hstate *hstate) { - if (huge_page_shift(hstate) == PAGE_SHIFT_64K) + if (huge_page_shift(hstate) PUD_SHIFT) + return pud_alloc(mm, pgd, addr); + else + return (pud_t *) pgd; +} +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate) +{ + if (huge_page_shift(hstate) PMD_SHIFT) return pmd_offset(pud, addr); else return (pmd_t *) pud; } -static inline -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr, - struct hstate *hstate) +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr, + struct hstate *hstate) { - if (huge_page_shift(hstate) == PAGE_SHIFT_64K) + if (huge_page_shift(hstate) PMD_SHIFT) return pmd_alloc(mm, pud, addr); else return (pmd_t *) pud; } -#endif /* Build list of addresses of gigantic pages. This function is used in early * boot before the buddy or bootmem allocator is setup. @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct pg = pgd_offset(mm, addr); if (!pgd_none(*pg)) { - pu = pud_offset(pg, addr); + pu = hpud_offset(pg, addr, hstate);
Re: [Libhugetlbfs-devel] Buglet in 16G page handling
Benjamin Herrenschmidt wrote: Actually, Jon has been hitting an occasional pagetable lock related problem. The last theory was that it might be some sort of race but it's vaguely possible that this is the issue. Jon? All hugetlbfs ops should be covered by the big PTL except walking... Can we have more info about the problem ? Cheers, Ben. I hit this when running the complete libhugetlbfs test suite (make check) with base page at 4K and default huge page size at 16G. It is on the last test (shm-getraw) when it hits it. Just running that test alone has not caused it for me - only when I have run all the tests and it gets to this one. Also it doesn't happen every time. I have tried to reproduce as well with a 64K base page but haven't seen it happen there. BUG: spinlock bad magic on CPU#2, shm-getraw/10359 lock: fde6e158, .magic: , .owner: none/-1, .owner_cpu: 0 Call Trace: [c00285d9b420] [c00110b0] .show_stack+0x78/0x190 (unreliable) [c00285d9b4d0] [c00111e8] .dump_stack+0x20/0x34 [c00285d9b550] [c0295d94] .spin_bug+0xb8/0xe0 [c00285d9b5f0] [c02962d8] ._raw_spin_lock+0x4c/0x1a0 [c00285d9b690] [c0510c60] ._spin_lock+0x5c/0x7c [c00285d9b720] [c00d809c] .handle_mm_fault+0x2f0/0x9ac [c00285d9b810] [c0513688] .do_page_fault+0x444/0x62c [c00285d9b950] [c0005230] handle_page_fault+0x20/0x5c --- Exception: 301 at .__clear_user+0x38/0x7c LR = .read_zero+0xb0/0x1a8 [c00285d9bc40] [c02e19e0] .read_zero+0x80/0x1a8 (unreliable) [c00285d9bcf0] [c0102c00] .vfs_read+0xe0/0x1c8 [c00285d9bd90] [c010332c] .sys_read+0x54/0x98 [c00285d9be30] [c00086d4] syscall_exit+0x0/0x40 BUG: spinlock lockup on CPU#2, shm-getraw/10359, fde6e158 Call Trace: [c00285d9b4c0] [c00110b0] .show_stack+0x78/0x190 (unreliable) [c00285d9b570] [c00111e8] .dump_stack+0x20/0x34 [c00285d9b5f0] [c02963ec] ._raw_spin_lock+0x160/0x1a0 [c00285d9b690] [c0510c60] ._spin_lock+0x5c/0x7c [c00285d9b720] [c00d809c] .handle_mm_fault+0x2f0/0x9ac [c00285d9b810] [c0513688] .do_page_fault+0x444/0x62c [c00285d9b950] [c0005230] handle_page_fault+0x20/0x5c --- Exception: 301 at .__clear_user+0x38/0x7c LR = .read_zero+0xb0/0x1a8 [c00285d9bc40] [c02e19e0] .read_zero+0x80/0x1a8 (unreliable) [c00285d9bcf0] [c0102c00] .vfs_read+0xe0/0x1c8 [c00285d9bd90] [c010332c] .sys_read+0x54/0x98 [c00285d9be30] [c00086d4] syscall_exit+0x0/0x40 BUG: soft lockup - CPU#2 stuck for 61s! [shm-getraw:10359] Modules linked in: autofs4 binfmt_misc dm_mirror dm_log dm_multipath parport ibmvscsic uhci_hcd ohci_hcd ehci_hcd irq event stamp: 1423661 hardirqs last enabled at (1423661): [c008d954] .trace_hardirqs_on+0x1c/0x30 hardirqs last disabled at (1423660): [c008af60] .trace_hardirqs_off+0x1c/0x30 softirqs last enabled at (1422710): [c0064f6c] .__do_softirq+0x19c/0x1c4 softirqs last disabled at (1422705): [c002943c] .call_do_softirq+0x14/0x24 NIP: c002569c LR: c02963ac CTR: 80f7cdec REGS: c00285d9b330 TRAP: 0901 Not tainted (2.6.27-rc4-pseries) MSR: 80009032 EE,ME,IR,DR CR: 88000284 XER: 0002 TASK = c00285f18000[10359] 'shm-getraw' THREAD: c00285d98000 CPU: 2 GPR00: 8002 c00285d9b5b0 c08924e0 0001 GPR04: c00285f18000 0070 0002 GPR08: 0003c3c66e8adf66 0002 0010 GPR12: 000b4cbd c08d4700 NIP [c002569c] .__delay+0x10/0x38 LR [c02963ac] ._raw_spin_lock+0x120/0x1a0 Call Trace: [c00285d9b5b0] [c00285d9b690] 0xc00285d9b690 (unreliable) [c00285d9b5f0] [c0296378] ._raw_spin_lock+0xec/0x1a0 [c00285d9b690] [c0510c60] ._spin_lock+0x5c/0x7c [c00285d9b720] [c00d809c] .handle_mm_fault+0x2f0/0x9ac [c00285d9b810] [c0513688] .do_page_fault+0x444/0x62c [c00285d9b950] [c0005230] handle_page_fault+0x20/0x5c --- Exception: 301 at .__clear_user+0x38/0x7c LR = .read_zero+0xb0/0x1a8 [c00285d9bc40] [c02e19e0] .read_zero+0x80/0x1a8 (unreliable) [c00285d9bcf0] [c0102c00] .vfs_read+0xe0/0x1c8 [c00285d9bd90] [c010332c] .sys_read+0x54/0x98 [c00285d9be30] [c00086d4] syscall_exit+0x0/0x40 Instruction dump: eb41ffd0 eb61ffd8 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 fbe1fff8 f821ffc1 7c3f0b78 7d2c42e6 4808 7c210b78 7c0c42e6 7c090050 [root]# addr2line c00d809c -e /boot/vmlinux.rc4-pseries /root/src/linux-2.6-rc4/mm/memory.c:2381 [root]# addr2line c0513688 -e /boot/vmlinux.rc4-pseries /root/src/linux-2.6-rc4/arch/powerpc/mm/fault.c:313 [root]# addr2line c010332c -e /boot/vmlinux.rc4-pseries
link failure: file truncated
Just tried to build the latest version from Linus' tree and I am getting a link error. building with the pseries_defconfig ... LD drivers/built-in.o LD vmlinux.o MODPOST vmlinux.o WARNING: modpost: Found 6 section mismatch(es). To see full details build your kernel with: 'make CONFIG_DEBUG_SECTION_MISMATCH=y' GEN .version CHK include/linux/compile.h UPD include/linux/compile.h CC init/version.o LD init/built-in.o LD .tmp_vmlinux1 ld: final link failed: File truncated make: *** [.tmp_vmlinux1] Error 1 ~/src/linus/linux-2.6cat /etc/SuSE-release SUSE LINUX Enterprise Server 9 (ppc) VERSION = 9 PATCHLEVEL = 3 ~/src/linus/linux-2.6ld --version GNU ld version 2.15.90.0.1.1 20040303 (SuSE Linux) Copyright 2002 Free Software Foundation, Inc. This program is free software; you may redistribute it under the terms of the GNU General Public License. This program has absolutely no warranty. ~/src/linus/linux-2.6gcc --version gcc (GCC) 3.3.3 (SuSE Linux) Copyright (C) 2003 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ~/src/linus/linux-2.6cat /etc/SuSE-release SUSE LINUX Enterprise Server 9 (ppc) VERSION = 9 PATCHLEVEL = 3 Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] powerpc: Fix compile error with binutils 2.15
Segher Boessenkool wrote: My previous patch to fix compilation with binutils-2.17 causes a file truncated build error from ld with binutils 2.15 (and possibly older), and a warning with 2.16 and 2.17. This fixes it. Signed-off-by: Segher Boessenkool [EMAIL PROTECTED] --- arch/powerpc/kernel/vmlinux.lds.S |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S index a914411..4a8ce62 100644 --- a/arch/powerpc/kernel/vmlinux.lds.S +++ b/arch/powerpc/kernel/vmlinux.lds.S @@ -85,7 +85,7 @@ SECTIONS /* The dummy segment contents for the bug workaround mentioned above near PHDRS. */ - .dummy : { + .dummy : AT(ADDR(.dummy) - LOAD_OFFSET) { LONG(0xf177) } :kernel :dummy This fixed the file truncated error for me. Also the kernel booted fine. Jon ~/src/linus/linux-2.6make vmlinux CHK include/linux/version.h CHK include/linux/utsrelease.h UPD include/linux/utsrelease.h CALLscripts/checksyscalls.sh stdin:1397:2: warning: #warning syscall signalfd4 not implemented stdin:1401:2: warning: #warning syscall eventfd2 not implemented stdin:1405:2: warning: #warning syscall epoll_create1 not implemented stdin:1409:2: warning: #warning syscall dup3 not implemented stdin:1413:2: warning: #warning syscall pipe2 not implemented stdin:1417:2: warning: #warning syscall inotify_init1 not implemented CHK include/linux/compile.h CC init/version.o LD init/built-in.o CALLarch/powerpc/kernel/systbl_chk.sh CALLarch/powerpc/kernel/prom_init_check.sh LDS arch/powerpc/kernel/vmlinux.lds CC kernel/module.o CC kernel/kexec.o LD kernel/built-in.o LD vmlinux.o MODPOST vmlinux.o WARNING: modpost: Found 6 section mismatch(es). To see full details build your kernel with: 'make CONFIG_DEBUG_SECTION_MISMATCH=y' GEN .version CHK include/linux/compile.h UPD include/linux/compile.h CC init/version.o LD init/built-in.o LD .tmp_vmlinux1 KSYM.tmp_kallsyms1.S AS .tmp_kallsyms1.o LD .tmp_vmlinux2 KSYM.tmp_kallsyms2.S AS .tmp_kallsyms2.o LD vmlinux SYSMAP System.map SYSMAP .tmp_System.map ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: gigantci pages patches
David Gibson wrote: On Fri, Jul 11, 2008 at 05:45:15PM +1000, Stephen Rothwell wrote: Hi all, Could people take one last look at these patches and if there are no issues, please send Ack-bys to Andrew who will push them to Linus for 2.6.27. [PATCH 1/6 v2] allow arch specific function for allocating gigantic pages http://patchwork.ozlabs.org/linuxppc/patch?id=18437 Patch: [PATCH 2/6 v2] powerpc: function for allocating gigantic pages http://patchwork.ozlabs.org/linuxppc/patch?id=18438 Patch: [PATCH 3/6 v2] powerpc: scan device tree and save gigantic page locations http://patchwork.ozlabs.org/linuxppc/patch?id=18439 Patch: [PATCH 4/6 v2] powerpc: define page support for 16G pages http://patchwork.ozlabs.org/linuxppc/patch?id=18440 Patch: [PATCH 5/6 v2] check for overflow http://patchwork.ozlabs.org/linuxppc/patch?id=18441 Patch: [PATCH 6/6] powerpc: support multiple huge page sizes http://patchwork.ozlabs.org/linuxppc/patch?id=18442 Sorry, I should have looked at these properly when they went past in May, but obviously I missed them. They mostly look ok. I'm a bit confused on 2/6 though - it seems the new powerpc alloc_bootmem_huge_page() function is specific to the 16G gigantic pages. But can't that function also get called for the normal 16M hugepages depending on how the hugepage pool is initialized. Or am I missing something (wouldn't surprise me given my brain's sluggishness today)? The alloc_bootmem_huge_page() function is only called for pages = MAX_ORDER. The 16M pages are always allocated within the generic hugetlbfs code with alloc_pages_node(). Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH 6/6] powerpc: support multiple huge page sizes
Nick Piggin wrote: On Tue, May 13, 2008 at 12:25:27PM -0500, Jon Tollefson wrote: Instead of using the variable mmu_huge_psize to keep track of the huge page size we use an array of MMU_PAGE_* values. For each supported huge page size we need to know the hugepte_shift value and have a pgtable_cache. The hstate or an mmu_huge_psizes index is passed to functions so that they know which huge page size they should use. The hugepage sizes 16M and 64K are setup(if available on the hardware) so that they don't have to be set on the boot cmd line in order to use them. The number of 16G pages have to be specified at boot-time though (e.g. hugepagesz=16G hugepages=5). Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- @@ -150,17 +191,25 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) pud_t *pu; pmd_t *pm; -BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize); +unsigned int psize; +unsigned int shift; +unsigned long sz; +struct hstate *hstate; +psize = get_slice_psize(mm, addr); +shift = mmu_psize_to_shift(psize); +sz = ((1UL) shift); +hstate = size_to_hstate(sz); -addr = HPAGE_MASK; +addr = hstate-mask; pg = pgd_offset(mm, addr); if (!pgd_none(*pg)) { pu = pud_offset(pg, addr); if (!pud_none(*pu)) { -pm = hpmd_offset(pu, addr); +pm = hpmd_offset(pu, addr, hstate); if (!pmd_none(*pm)) -return hugepte_offset((hugepd_t *)pm, addr); +return hugepte_offset((hugepd_t *)pm, addr, + hstate); } } Hi Jon, I just noticed in a few places like this, you might be doing more work than really needed to get the HPAGE_MASK. I would love to be able to simplify it. For a first-pass conversion, this is the right way to go (just manually replace hugepage constants with hstate- equivalents). However in this case if you already know the page size, you should be able to work out the shift from there, I think? That way you can avoid the size_to_hstate call completely. Something like the following? + addr = ~(sz - 1); Is that faster then just pulling it out of hstate? I still need to locate hstate, but I guess if the mask is calculated this way the locate could be pushed further into the function so that it isn't done if it isn't always needed. Anyway, just something to consider. Thanks, Nick Thank you for looking at the code. Jon @@ -173,16 +222,20 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz pud_t *pu; pmd_t *pm; hugepd_t *hpdp = NULL; +struct hstate *hstate; +unsigned int psize; +hstate = size_to_hstate(sz); -BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize); +psize = get_slice_psize(mm, addr); +BUG_ON(!mmu_huge_psizes[psize]); -addr = HPAGE_MASK; +addr = hstate-mask; pg = pgd_offset(mm, addr); pu = pud_alloc(mm, pg, addr); if (pu) { -pm = hpmd_alloc(mm, pu, addr); +pm = hpmd_alloc(mm, pu, addr, hstate); if (pm) hpdp = (hugepd_t *)pm; } @@ -190,10 +243,10 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz if (! hpdp) return NULL; -if (hugepd_none(*hpdp) __hugepte_alloc(mm, hpdp, addr)) +if (hugepd_none(*hpdp) __hugepte_alloc(mm, hpdp, addr, psize)) return NULL; -return hugepte_offset(hpdp, addr); +return hugepte_offset(hpdp, addr, hstate); } int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) @@ -201,19 +254,22 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) return 0; } -static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp) +static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp, + unsigned int psize) { pte_t *hugepte = hugepd_page(*hpdp); hpdp-pd = 0; tlb-need_flush = 1; -pgtable_free_tlb(tlb, pgtable_free_cache(hugepte, HUGEPTE_CACHE_NUM, +pgtable_free_tlb(tlb, pgtable_free_cache(hugepte, + HUGEPTE_CACHE_NUM+psize-1, PGF_CACHENUM_MASK)); } static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, unsigned long addr, unsigned long end, - unsigned long floor, unsigned long ceiling) + unsigned long floor, unsigned long ceiling, + unsigned int psize) { pmd_t *pmd; unsigned long next; @@ -225,7 +281,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t
[PATCH 1/6 v2] allow arch specific function for allocating gigantic pages
Allow alloc_bm_huge_page() to be overridden by architectures that can't always use bootmem. This requires huge_boot_pages to be available for use by this function. The 16G pages on ppc64 have to be reserved prior to boot-time. The location of these pages are indicated in the device tree. A BUG_ON in huge_add_hstate is commented out in order to allow 64K huge pages to continue to work on power. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- include/linux/hugetlb.h | 10 ++ mm/hugetlb.c| 15 ++- 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 8c47ca7..b550ec7 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -35,6 +35,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed); extern unsigned long hugepages_treat_as_movable; extern const unsigned long hugetlb_zero, hugetlb_infinity; extern int sysctl_hugetlb_shm_group; +extern struct list_head huge_boot_pages; /* arch callbacks */ @@ -205,6 +206,14 @@ struct hstate { unsigned int surplus_huge_pages_node[MAX_NUMNODES]; }; +struct huge_bm_page { + struct list_head list; + struct hstate *hstate; +}; + +/* arch callback */ +int alloc_bm_huge_page(struct hstate *h); + void __init huge_add_hstate(unsigned order); struct hstate *size_to_hstate(unsigned long size); @@ -256,6 +265,7 @@ extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE]; #else struct hstate {}; +#define alloc_bm_huge_page(h) NULL #define hstate_file(f) NULL #define hstate_vma(v) NULL #define hstate_inode(i) NULL diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5273f6c..efb5805 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -27,6 +27,7 @@ unsigned long max_huge_pages[HUGE_MAX_HSTATE]; unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE]; static gfp_t htlb_alloc_mask = GFP_HIGHUSER; unsigned long hugepages_treat_as_movable; +struct list_head huge_boot_pages; static int max_hstate = 0; @@ -533,14 +534,8 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, return page; } -static __initdata LIST_HEAD(huge_boot_pages); - -struct huge_bm_page { - struct list_head list; - struct hstate *hstate; -}; - -static int __init alloc_bm_huge_page(struct hstate *h) +/* Can be overriden by architectures */ +__attribute__((weak)) int alloc_bm_huge_page(struct hstate *h) { struct huge_bm_page *m; int nr_nodes = nodes_weight(node_online_map); @@ -583,6 +578,8 @@ static void __init hugetlb_init_hstate(struct hstate *h) unsigned long i; /* Don't reinitialize lists if they have been already init'ed */ + if (!huge_boot_pages.next) + INIT_LIST_HEAD(huge_boot_pages); if (!h-hugepage_freelists[0].next) { for (i = 0; i MAX_NUMNODES; ++i) INIT_LIST_HEAD(h-hugepage_freelists[i]); @@ -664,7 +661,7 @@ void __init huge_add_hstate(unsigned order) return; } BUG_ON(max_hstate = HUGE_MAX_HSTATE); - BUG_ON(order HPAGE_SHIFT - PAGE_SHIFT); +/* BUG_ON(order HPAGE_SHIFT - PAGE_SHIFT);*/ h = hstates[max_hstate++]; h-order = order; h-mask = ~((1ULL (order + PAGE_SHIFT)) - 1); ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/6 v2] powerpc: function for allocating gigantic pages
The 16G page locations have been saved during early boot in an array. The alloc_bm_huge_page() function adds a page from here to the huge_boot_pages list. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- arch/powerpc/mm/hugetlbpage.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 26f212f..383b3b2 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -29,6 +29,12 @@ #define NUM_LOW_AREAS (0x1UL SID_SHIFT) #define NUM_HIGH_AREAS (PGTABLE_RANGE HTLB_AREA_SHIFT) +#define MAX_NUMBER_GPAGES 1024 + +/* Tracks the 16G pages after the device tree is scanned and before the + * huge_boot_pages list is ready. */ +static unsigned long gpage_freearray[MAX_NUMBER_GPAGES]; +static unsigned nr_gpages; unsigned int hugepte_shift; #define PTRS_PER_HUGEPTE(1 hugepte_shift) @@ -104,6 +110,22 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr) } #endif +/* Moves the gigantic page addresses from the temporary list to the + * huge_boot_pages list. + */ +int alloc_bm_huge_page(struct hstate *h) +{ + struct huge_bm_page *m; + if (nr_gpages == 0) + return 0; + m = phys_to_virt(gpage_freearray[--nr_gpages]); + gpage_freearray[nr_gpages] = 0; + list_add(m-list, huge_boot_pages); + m-hstate = h; + return 1; +} + + /* Modelled after find_linux_pte() */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 3/6 v2] powerpc: scan device tree and save gigantic page locations
The 16G huge pages have to be reserved in the HMC prior to boot. The location of the pages are placed in the device tree. This patch adds code to scan the device tree during very early boot and save these page locations until hugetlbfs is ready for them. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- arch/powerpc/mm/hash_utils_64.c | 44 ++- arch/powerpc/mm/hugetlbpage.c| 16 ++ include/asm-powerpc/mmu-hash64.h |2 + 3 files changed, 61 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index a83dfa3..133d6e2 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -67,6 +67,7 @@ #define KB (1024) #define MB (1024*KB) +#define GB (1024L*MB) /* * Note: pte -- Linux PTE @@ -302,6 +303,44 @@ static int __init htab_dt_scan_page_sizes(unsigned long node, return 0; } +/* Scan for 16G memory blocks that have been set aside for huge pages + * and reserve those blocks for 16G huge pages. + */ +static int __init htab_dt_scan_hugepage_blocks(unsigned long node, + const char *uname, int depth, + void *data) { + char *type = of_get_flat_dt_prop(node, device_type, NULL); + unsigned long *addr_prop; + u32 *page_count_prop; + unsigned int expected_pages; + long unsigned int phys_addr; + long unsigned int block_size; + + /* We are scanning memory nodes only */ + if (type == NULL || strcmp(type, memory) != 0) + return 0; + + /* This property is the log base 2 of the number of virtual pages that +* will represent this memory block. */ + page_count_prop = of_get_flat_dt_prop(node, ibm,expected#pages, NULL); + if (page_count_prop == NULL) + return 0; + expected_pages = (1 page_count_prop[0]); + addr_prop = of_get_flat_dt_prop(node, reg, NULL); + if (addr_prop == NULL) + return 0; + phys_addr = addr_prop[0]; + block_size = addr_prop[1]; + if (block_size != (16 * GB)) + return 0; + printk(KERN_INFO Huge page(16GB) memory: + addr = 0x%lX size = 0x%lX pages = %d\n, + phys_addr, block_size, expected_pages); + lmb_reserve(phys_addr, block_size * expected_pages); + add_gpage(phys_addr, block_size, expected_pages); + return 0; +} + static void __init htab_init_page_sizes(void) { int rc; @@ -370,7 +409,10 @@ static void __init htab_init_page_sizes(void) mmu_psize_defs[mmu_io_psize].shift); #ifdef CONFIG_HUGETLB_PAGE - /* Init large page size. Currently, we pick 16M or 1M depending + /* Reserve 16G huge page memory sections for huge pages */ + of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL); + +/* Init large page size. Currently, we pick 16M or 1M depending * on what is available */ if (mmu_psize_defs[MMU_PAGE_16M].shift) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 383b3b2..a27b80c 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -110,6 +110,22 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr) } #endif +/* Build list of addresses of gigantic pages. This function is used in early + * boot before the buddy or bootmem allocator is setup. + */ +void add_gpage(unsigned long addr, unsigned long page_size, + unsigned long number_of_pages) +{ + if (!addr) + return; + while (number_of_pages 0) { + gpage_freearray[nr_gpages] = addr; + nr_gpages++; + number_of_pages--; + addr += page_size; + } +} + /* Moves the gigantic page addresses from the temporary list to the * huge_boot_pages list. */ diff --git a/include/asm-powerpc/mmu-hash64.h b/include/asm-powerpc/mmu-hash64.h index 2864fa3..db1276a 100644 --- a/include/asm-powerpc/mmu-hash64.h +++ b/include/asm-powerpc/mmu-hash64.h @@ -279,6 +279,8 @@ extern int htab_bolt_mapping(unsigned long vstart, unsigned long vend, unsigned long pstart, unsigned long mode, int psize, int ssize); extern void set_huge_psize(int psize); +extern void add_gpage(unsigned long addr, unsigned long page_size, + unsigned long number_of_pages); extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr); extern void htab_initialize(void); ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 4/6 v2] powerpc: define page support for 16G pages
The huge page size is defined for 16G pages. If a hugepagesz of 16G is specified at boot-time then it becomes the huge page size instead of the default 16M. The change in pgtable-64K.h is to the macro pte_iterate_hashed_subpages to make the increment to va (the 1 being shifted) be a long so that it is not shifted to 0. Otherwise it would create an infinite loop when the shift value is for a 16G page (when base page size is 64K). Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- arch/powerpc/mm/hugetlbpage.c | 62 ++ include/asm-powerpc/pgtable-64k.h |2 - 2 files changed, 45 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index a27b80c..063ec36 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -24,8 +24,9 @@ #include asm/cputable.h #include asm/spu.h -#define HPAGE_SHIFT_64K16 -#define HPAGE_SHIFT_16M24 +#define PAGE_SHIFT_64K 16 +#define PAGE_SHIFT_16M 24 +#define PAGE_SHIFT_16G 34 #define NUM_LOW_AREAS (0x1UL SID_SHIFT) #define NUM_HIGH_AREAS (PGTABLE_RANGE HTLB_AREA_SHIFT) @@ -95,7 +96,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp, static inline pmd_t *hpmd_offset(pud_t *pud, unsigned long addr) { - if (HPAGE_SHIFT == HPAGE_SHIFT_64K) + if (HPAGE_SHIFT == PAGE_SHIFT_64K) return pmd_offset(pud, addr); else return (pmd_t *) pud; @@ -103,7 +104,7 @@ pmd_t *hpmd_offset(pud_t *pud, unsigned long addr) static inline pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr) { - if (HPAGE_SHIFT == HPAGE_SHIFT_64K) + if (HPAGE_SHIFT == PAGE_SHIFT_64K) return pmd_alloc(mm, pud, addr); else return (pmd_t *) pud; @@ -260,7 +261,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd, continue; hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling); #else - if (HPAGE_SHIFT == HPAGE_SHIFT_64K) { + if (HPAGE_SHIFT == PAGE_SHIFT_64K) { if (pud_none_or_clear_bad(pud)) continue; hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling); @@ -591,20 +592,40 @@ void set_huge_psize(int psize) { /* Check that it is a page size supported by the hardware and * that it fits within pagetable limits. */ - if (mmu_psize_defs[psize].shift mmu_psize_defs[psize].shift SID_SHIFT + if (mmu_psize_defs[psize].shift + mmu_psize_defs[psize].shift SID_SHIFT_1T (mmu_psize_defs[psize].shift MIN_HUGEPTE_SHIFT || - mmu_psize_defs[psize].shift == HPAGE_SHIFT_64K)) { +mmu_psize_defs[psize].shift == PAGE_SHIFT_64K || +mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) { + /* Return if huge page size is the same as the +* base page size. */ + if (mmu_psize_defs[psize].shift == PAGE_SHIFT) + return; + HPAGE_SHIFT = mmu_psize_defs[psize].shift; mmu_huge_psize = psize; -#ifdef CONFIG_PPC_64K_PAGES - hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT); -#else - if (HPAGE_SHIFT == HPAGE_SHIFT_64K) - hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT); - else - hugepte_shift = (PUD_SHIFT-HPAGE_SHIFT); -#endif + switch (HPAGE_SHIFT) { + case PAGE_SHIFT_64K: + /* We only allow 64k hpages with 4k base page, +* which was checked above, and always put them +* at the PMD */ + hugepte_shift = PMD_SHIFT; + break; + case PAGE_SHIFT_16M: + /* 16M pages can be at two different levels +* of pagestables based on base page size */ + if (PAGE_SHIFT == PAGE_SHIFT_64K) + hugepte_shift = PMD_SHIFT; + else /* 4k base page */ + hugepte_shift = PUD_SHIFT; + break; + case PAGE_SHIFT_16G: + /* 16G pages are always at PGD level */ + hugepte_shift = PGDIR_SHIFT; + break; + } + hugepte_shift -= HPAGE_SHIFT; } else HPAGE_SHIFT = 0; } @@ -620,17 +641,22 @@ static int __init hugepage_setup_sz(char *str) shift = __ffs(size); switch (shift) { #ifndef CONFIG_PPC_64K_PAGES - case HPAGE_SHIFT_64K: + case PAGE_SHIFT_64K: mmu_psize = MMU_PAGE_64K; break; #endif - case HPAGE_SHIFT_16M: + case PAGE_SHIFT_16M: mmu_psize = MMU_PAGE_16M
[PATCH 5/6 v2] check for overflow
Adds a check for an overflow in the filesystem size so if someone is checking with statfs() on a 16G hugetlbfs in a 32bit binary that it will report back EOVERFLOW instead of a size of 0. Are other places that need a similar check? I had tried a similar check in put_compat_statfs64 too but it didn't seem to generate an EOVERFLOW in my test case. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- fs/compat.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/compat.c b/fs/compat.c index 2ce4456..6eb6aad 100644 --- a/fs/compat.c +++ b/fs/compat.c @@ -196,8 +196,8 @@ static int put_compat_statfs(struct compat_statfs __user *ubuf, struct kstatfs * { if (sizeof ubuf-f_blocks == 4) { - if ((kbuf-f_blocks | kbuf-f_bfree | kbuf-f_bavail) - 0xULL) + if ((kbuf-f_blocks | kbuf-f_bfree | kbuf-f_bavail | +kbuf-f_bsize | kbuf-f_frsize) 0xULL) return -EOVERFLOW; /* f_files and f_ffree may be -1; it's okay * to stuff that into 32 bits */ ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 6/6] powerpc: support multiple huge page sizes
Instead of using the variable mmu_huge_psize to keep track of the huge page size we use an array of MMU_PAGE_* values. For each supported huge page size we need to know the hugepte_shift value and have a pgtable_cache. The hstate or an mmu_huge_psizes index is passed to functions so that they know which huge page size they should use. The hugepage sizes 16M and 64K are setup(if available on the hardware) so that they don't have to be set on the boot cmd line in order to use them. The number of 16G pages have to be specified at boot-time though (e.g. hugepagesz=16G hugepages=5). Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- arch/powerpc/mm/hash_utils_64.c |9 - arch/powerpc/mm/hugetlbpage.c| 267 +-- arch/powerpc/mm/init_64.c|8 - arch/powerpc/mm/tlb_64.c |2 include/asm-powerpc/mmu-hash64.h |4 include/asm-powerpc/page_64.h|1 include/asm-powerpc/pgalloc-64.h |4 7 files changed, 187 insertions(+), 108 deletions(-) --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -99,7 +99,6 @@ int mmu_kernel_ssize = MMU_SEGSIZE_256M; int mmu_highuser_ssize = MMU_SEGSIZE_256M; u16 mmu_slb_size = 64; #ifdef CONFIG_HUGETLB_PAGE -int mmu_huge_psize = MMU_PAGE_16M; unsigned int HPAGE_SHIFT; #endif #ifdef CONFIG_PPC_64K_PAGES @@ -412,15 +411,15 @@ static void __init htab_init_page_sizes(void) /* Reserve 16G huge page memory sections for huge pages */ of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL); -/* Init large page size. Currently, we pick 16M or 1M depending +/* Set default large page size. Currently, we pick 16M or 1M depending * on what is available */ if (mmu_psize_defs[MMU_PAGE_16M].shift) - set_huge_psize(MMU_PAGE_16M); + HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift; /* With 4k/4level pagetables, we can't (for now) cope with a * huge page size PMD_SIZE */ else if (mmu_psize_defs[MMU_PAGE_1M].shift) - set_huge_psize(MMU_PAGE_1M); + HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift; #endif /* CONFIG_HUGETLB_PAGE */ } @@ -819,7 +818,7 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap) #ifdef CONFIG_HUGETLB_PAGE /* Handle hugepage regions */ - if (HPAGE_SHIFT psize == mmu_huge_psize) { + if (HPAGE_SHIFT mmu_huge_psizes[psize]) { DBG_LOW( - huge page !\n); return hash_huge_page(mm, access, ea, vsid, local, trap); } diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 063ec36..61ce875 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -37,15 +37,30 @@ static unsigned long gpage_freearray[MAX_NUMBER_GPAGES]; static unsigned nr_gpages; -unsigned int hugepte_shift; -#define PTRS_PER_HUGEPTE (1 hugepte_shift) -#define HUGEPTE_TABLE_SIZE (sizeof(pte_t) hugepte_shift) +/* Array of valid huge page sizes - non-zero value(hugepte_shift) is + * stored for the huge page sizes that are valid. + */ +unsigned int mmu_huge_psizes[MMU_PAGE_COUNT]; + +#define hugepte_shift mmu_huge_psizes +#define PTRS_PER_HUGEPTE(psize)(1 hugepte_shift[psize]) +#define HUGEPTE_TABLE_SIZE(psize) (sizeof(pte_t) hugepte_shift[psize]) + +#define HUGEPD_SHIFT(psize)(mmu_psize_to_shift(psize) \ + + hugepte_shift[psize]) +#define HUGEPD_SIZE(psize) (1UL HUGEPD_SHIFT(psize)) +#define HUGEPD_MASK(psize) (~(HUGEPD_SIZE(psize)-1)) -#define HUGEPD_SHIFT (HPAGE_SHIFT + hugepte_shift) -#define HUGEPD_SIZE(1UL HUGEPD_SHIFT) -#define HUGEPD_MASK(~(HUGEPD_SIZE-1)) +/* Subtract one from array size because we don't need a cache for 4K since + * is not a huge page size */ +#define huge_pgtable_cache(psize) (pgtable_cache[HUGEPTE_CACHE_NUM \ + + psize-1]) +#define HUGEPTE_CACHE_NAME(psize) (huge_pgtable_cache_name[psize]) -#define huge_pgtable_cache (pgtable_cache[HUGEPTE_CACHE_NUM]) +static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = { + unused_4K, hugepte_cache_64K, unused_64K_AP, + hugepte_cache_1M, hugepte_cache_16M, hugepte_cache_16G +}; /* Flag to mark huge PD pointers. This means pmd_bad() and pud_bad() * will choke on pointers to hugepte tables, which is handy for @@ -56,24 +71,49 @@ typedef struct { unsigned long pd; } hugepd_t; #define hugepd_none(hpd)((hpd).pd == 0) +static inline int shift_to_mmu_psize(unsigned int shift) +{ + switch (shift) { +#ifndef CONFIG_PPC_64K_PAGES + case PAGE_SHIFT_64K: + return MMU_PAGE_64K; +#endif + case PAGE_SHIFT_16M: + return MMU_PAGE_16M; + case PAGE_SHIFT_16G: + return MMU_PAGE_16G
[PATCH 0/6] 16G and multi size hugetlb page support on powerpc
This patch set builds on Nick Piggin's patches for multi size and giant hugetlb page support of April 22. The following set of patches adds support for 16G huge pages on ppc64 and support for multiple huge page sizes at the same time on ppc64. Thus allowing 64K, 16M, and 16G huge pages given a POWER5+ or later machine. New to this version of my patch is numerous bug fixes and cleanups, but the biggest change is the support for multiple huge page sizes on power. patch 1: changes to generic hugetlb to enable 16G pages on power patch 2: powerpc: adds function for allocating 16G pages patch 3: powerpc: setups 16G page locations found in device tree patch 4: powerpc: page definition support for 16G pages patch 5: check for overflow when user space is 32bit patch 6: powerpc: multiple huge page size support Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 0/4] 16G huge page support for powerpc
This patch set builds on Andi Kleen's patches for GB pages for hugetlb posted on March 16th. This set adds support for 16G huge pages on ppc64. Supporting multiple huge page sizes on ppc64 as defined in Andi's patches is not a part of this set; that will be included in a future patch. The first patch here adds an arch callback since the 16G pages are not allocated from bootmem. The 16G pages have to be reserved prior to boot-time. The location of these pages are indicated in the device tree. Support for 16G pages requires a POWER5+ or later machine and a little bit of memory. Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/4] allow arch specific function for allocating gigantic pages
Allow alloc_bm_huge_page() to be overridden by architectures that can't always use bootmem. This requires huge_boot_pages to be available for use by this function. Also huge_page_size() and other functions need to use a long so that they can handle the 16G page size. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- include/linux/hugetlb.h | 10 +- mm/hugetlb.c| 21 + 2 files changed, 18 insertions(+), 13 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a8de3c1..35a41be 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -35,6 +35,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed); extern unsigned long hugepages_treat_as_movable; extern const unsigned long hugetlb_zero, hugetlb_infinity; extern int sysctl_hugetlb_shm_group; +extern struct list_head huge_boot_pages; /* arch callbacks */ @@ -219,9 +220,15 @@ struct hstate { unsigned int surplus_huge_pages_node[MAX_NUMNODES]; unsigned long parsed_hugepages; }; +struct huge_bm_page { + struct list_head list; + struct hstate *hstate; +}; void __init huge_add_hstate(unsigned order); struct hstate *huge_lookup_hstate(unsigned long pagesize); +/* arch callback */ +int alloc_bm_huge_page(struct hstate *h); #ifndef HUGE_MAX_HSTATE #define HUGE_MAX_HSTATE 1 @@ -248,7 +255,7 @@ static inline struct hstate *hstate_inode(struct inode *i) return HUGETLBFS_I(i)-hstate; } -static inline unsigned huge_page_size(struct hstate *h) +static inline unsigned long huge_page_size(struct hstate *h) { return PAGE_SIZE h-order; } @@ -273,6 +280,7 @@ extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE]; #else struct hstate {}; +#define alloc_bm_huge_page(h) NULL #define hstate_file(f) NULL #define hstate_vma(v) NULL #define hstate_inode(i) NULL diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c28b8b6..a0017b0 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -27,6 +27,7 @@ unsigned long max_huge_pages[HUGE_MAX_HSTATE]; unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE]; static gfp_t htlb_alloc_mask = GFP_HIGHUSER; unsigned long hugepages_treat_as_movable; +struct list_head huge_boot_pages; static int max_hstate = 1; @@ -43,7 +44,8 @@ struct hstate *parsed_hstate __initdata = global_hstate; */ static DEFINE_SPINLOCK(hugetlb_lock); -static void clear_huge_page(struct page *page, unsigned long addr, unsigned sz) +static void clear_huge_page(struct page *page, unsigned long addr, + unsigned long sz) { int i; @@ -521,14 +523,8 @@ static __init char *memfmt(char *buf, unsigned long n) return buf; } -static __initdata LIST_HEAD(huge_boot_pages); - -struct huge_bm_page { - struct list_head list; - struct hstate *hstate; -}; - -static int __init alloc_bm_huge_page(struct hstate *h) +/* Can be overriden by architectures */ +__attribute__((weak)) int alloc_bm_huge_page(struct hstate *h) { struct huge_bm_page *m; m = __alloc_bootmem_node_nopanic(NODE_DATA(h-hugetlb_next_nid), @@ -614,6 +610,7 @@ static int __init hugetlb_init(void) { if (HPAGE_SHIFT == 0) return 0; + INIT_LIST_HEAD(huge_boot_pages); return hugetlb_init_hstate(global_hstate); } module_init(hugetlb_init); @@ -866,7 +863,7 @@ int hugetlb_report_meminfo(char *buf) n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages)); n += sprintf(buf + n, Hugepagesize: ); for_each_hstate (h) - n += sprintf(buf + n, %5u, huge_page_size(h) / 1024); + n += sprintf(buf + n, %5lu, huge_page_size(h) / 1024); n += sprintf(buf + n, kB\n); return n; } @@ -947,7 +944,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, unsigned long addr; int cow; struct hstate *h = hstate_vma(vma); - unsigned sz = huge_page_size(h); + unsigned long sz = huge_page_size(h); cow = (vma-vm_flags (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; @@ -992,7 +989,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, struct page *page; struct page *tmp; struct hstate *h = hstate_vma(vma); - unsigned sz = huge_page_size(h); + unsigned long sz = huge_page_size(h); /* * A page gathering list, protected by per file i_mmap_lock. The ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/4] powerpc: function for allocating gigantic pages
The 16G page locations have been saved during early boot in an array. The alloc_bm_huge_page() function adds a page from here to the huge_boot_pages list. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- hugetlbpage.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 94625db..31d977b 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -29,6 +29,10 @@ #define NUM_LOW_AREAS (0x1UL SID_SHIFT) #define NUM_HIGH_AREAS (PGTABLE_RANGE HTLB_AREA_SHIFT) +#define MAX_NUMBER_GPAGES 1024 + +static void *gpage_freearray[MAX_NUMBER_GPAGES]; +static unsigned nr_gpages; unsigned int hugepte_shift; #define PTRS_PER_HUGEPTE (1 hugepte_shift) @@ -104,6 +108,21 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr) } #endif +/* Put 16G page address into temporary huge page list because the mem_map + * is not up yet. + */ +int alloc_bm_huge_page(struct hstate *h) +{ + struct huge_bm_page *m; + if (nr_gpages == 0) + return 0; + m = gpage_freearray[--nr_gpages]; + list_add(m-list, huge_boot_pages); + m-hstate = h; + return 1; +} + + /* Modelled after find_linux_pte() */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 3/4] powerpc: scan device tree and save gigantic page locations
The 16G huge pages have to be reserved in the HMC prior to boot. The location of the pages are placed in the device tree. During very early boot these locations are saved for use by hugetlbfs. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- arch/powerpc/mm/hash_utils_64.c | 41 ++- arch/powerpc/mm/hugetlbpage.c| 17 include/asm-powerpc/mmu-hash64.h |2 + 3 files changed, 59 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index a83dfa3..d3f7d92 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -67,6 +67,7 @@ #define KB (1024) #define MB (1024*KB) +#define GB (1024L*MB) /* * Note: pte -- Linux PTE @@ -302,6 +303,41 @@ static int __init htab_dt_scan_page_sizes(unsigned long node, return 0; } +/* Scan for 16G memory blocks that have been set aside for huge pages + * and reserve those blocks for 16G huge pages. + */ +static int __init htab_dt_scan_hugepage_blocks(unsigned long node, + const char *uname, int depth, + void *data) { + char *type = of_get_flat_dt_prop(node, device_type, NULL); + unsigned long *lprop; + u32 *prop; + + /* We are scanning memory nodes only */ + if (type == NULL || strcmp(type, memory) != 0) + return 0; + + /* This property is the log base 2 of the number of virtual pages that +* will represent this memory block. */ + prop = of_get_flat_dt_prop(node, ibm,expected#pages, NULL); + if (prop == NULL) + return 0; + unsigned int expected_pages = (1 prop[0]); + lprop = of_get_flat_dt_prop(node, reg, NULL); + if (lprop == NULL) + return 0; + long unsigned int phys_addr = lprop[0]; + long unsigned int block_size = lprop[1]; + if (block_size != (16 * GB)) + return 0; + printk(KERN_INFO Reserving huge page memory + addr = 0x%lX size = 0x%lX pages = %d\n, + phys_addr, block_size, expected_pages); + lmb_reserve(phys_addr, block_size * expected_pages); + add_gpage(phys_addr, block_size, expected_pages); + return 0; +} + static void __init htab_init_page_sizes(void) { int rc; @@ -370,7 +406,10 @@ static void __init htab_init_page_sizes(void) mmu_psize_defs[mmu_io_psize].shift); #ifdef CONFIG_HUGETLB_PAGE - /* Init large page size. Currently, we pick 16M or 1M depending + /* Reserve 16G huge page memory sections for huge pages */ + of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL); + +/* Init large page size. Currently, we pick 16M or 1M depending * on what is available */ if (mmu_psize_defs[MMU_PAGE_16M].shift) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 31d977b..44d3d55 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -108,6 +108,23 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr) } #endif +/* Build list of addresses of gigantic pages. This function is used in early + * boot before the buddy allocator is setup. + */ +void add_gpage(unsigned long addr, unsigned long page_size, + unsigned long number_of_pages) +{ + if (addr) { + while (number_of_pages 0) { + gpage_freearray[nr_gpages] = __va(addr); + nr_gpages++; + number_of_pages--; + addr += page_size; + } + } +} + + /* Put 16G page address into temporary huge page list because the mem_map * is not up yet. */ diff --git a/include/asm-powerpc/mmu-hash64.h b/include/asm-powerpc/mmu-hash64.h index 2864fa3..db1276a 100644 --- a/include/asm-powerpc/mmu-hash64.h +++ b/include/asm-powerpc/mmu-hash64.h @@ -279,6 +279,8 @@ extern int htab_bolt_mapping(unsigned long vstart, unsigned long vend, unsigned long pstart, unsigned long mode, int psize, int ssize); extern void set_huge_psize(int psize); +extern void add_gpage(unsigned long addr, unsigned long page_size, + unsigned long number_of_pages); extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr); extern void htab_initialize(void); ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 4/4] powerpc: define page support for 16G pages
The huge page size is setup for 16G pages if that size is specified at boot-time. The support for multiple huge page sizes is not being utilized yet. That will be in a future patch. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- hugetlbpage.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 44d3d55..b6a02b7 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -26,6 +26,7 @@ #define HPAGE_SHIFT_64K16 #define HPAGE_SHIFT_16M24 +#define HPAGE_SHIFT_16G34 #define NUM_LOW_AREAS (0x1UL SID_SHIFT) #define NUM_HIGH_AREAS (PGTABLE_RANGE HTLB_AREA_SHIFT) @@ -589,9 +590,11 @@ void set_huge_psize(int psize) { /* Check that it is a page size supported by the hardware and * that it fits within pagetable limits. */ - if (mmu_psize_defs[psize].shift mmu_psize_defs[psize].shift SID_SHIFT + if (mmu_psize_defs[psize].shift + mmu_psize_defs[psize].shift SID_SHIFT_1T (mmu_psize_defs[psize].shift MIN_HUGEPTE_SHIFT || - mmu_psize_defs[psize].shift == HPAGE_SHIFT_64K)) { +mmu_psize_defs[psize].shift == HPAGE_SHIFT_64K || +mmu_psize_defs[psize].shift == HPAGE_SHIFT_16G)) { HPAGE_SHIFT = mmu_psize_defs[psize].shift; mmu_huge_psize = psize; #ifdef CONFIG_PPC_64K_PAGES @@ -599,6 +602,8 @@ void set_huge_psize(int psize) #else if (HPAGE_SHIFT == HPAGE_SHIFT_64K) hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT); + else if (HPAGE_SHIFT == HPAGE_SHIFT_16G) + hugepte_shift = (PGDIR_SHIFT-HPAGE_SHIFT); else hugepte_shift = (PUD_SHIFT-HPAGE_SHIFT); #endif @@ -625,6 +630,9 @@ static int __init hugepage_setup_sz(char *str) case HPAGE_SHIFT_16M: mmu_psize = MMU_PAGE_16M; break; + case HPAGE_SHIFT_16G: + mmu_psize = MMU_PAGE_16G; + break; } if (mmu_psize =0 mmu_psize_defs[mmu_psize].shift) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCh v3] powerpc: add hugepagesz boot-time parameter
Arnd Bergmann wrote: We started discussing this in v1, but the discussion got sidetracked: Is there a technical reason why you don't also allow 1M pages, which may be useful in certain scenarios? No, it was mostly a matter of the time I have had and machines easily available to me for testing. I don't know of a technical reason that would prevent supporting 1M huge pages, but would want the tests in the libhugetlbfs suite to pass, etc. On the Cell/B.E. platforms (IBM/Mercury blades, Toshiba Celleb, PS3), the second large page size is an option that can be set in a HID SPR to either 64KB or 1MB. Unfortunately, we can't do these two simultaneously, but the firmware can change the default and put it into the device tree, or you could have the kernel override the firmware settings. Going a lot further, do you have plans for a fully dynamic hugepage size, e.g. using a mount option for hugetlbfs? I can see that as rather useful, but at the same time it's probably much more complicated than the boot time option. Eventually we will want to support dynamic huge page sizes. This is already being looked into. In the meantime we can have some flexibility with a boot-time parameter though. Arnd Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCh v3] powerpc: add hugepagesz boot-time parameter
Paul, please include this in 2.6.25 if there are no objections. This patch adds the hugepagesz boot-time parameter for ppc64. It lets one pick the size for huge pages. The choices available are 64K and 16M when the base page size is 4k. It defaults to 16M (previously the only only choice) if nothing or an invalid choice is specified. Tested 64K huge pages successfully with the libhugetlbfs 1.2. Changes from v2: Moved functions from header file into hugetlbpage.c where they are used. Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 33121d6..2fc1fb8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -685,6 +685,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/isdn/README.HiSax. hugepages= [HW,X86-32,IA-64] Maximal number of HugeTLB pages. + hugepagesz= [HW,IA-64,PPC] The size of the HugeTLB pages. i8042.direct[HW] Put keyboard port into non-translated mode i8042.dumbkbd [HW] Pretend that controller can only read data from diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index cbbd8b0..9326a69 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -369,18 +369,11 @@ static void __init htab_init_page_sizes(void) * on what is available */ if (mmu_psize_defs[MMU_PAGE_16M].shift) - mmu_huge_psize = MMU_PAGE_16M; + set_huge_psize(MMU_PAGE_16M); /* With 4k/4level pagetables, we can't (for now) cope with a * huge page size PMD_SIZE */ else if (mmu_psize_defs[MMU_PAGE_1M].shift) - mmu_huge_psize = MMU_PAGE_1M; - - /* Calculate HPAGE_SHIFT and sanity check it */ - if (mmu_psize_defs[mmu_huge_psize].shift MIN_HUGEPTE_SHIFT - mmu_psize_defs[mmu_huge_psize].shift SID_SHIFT) - HPAGE_SHIFT = mmu_psize_defs[mmu_huge_psize].shift; - else - HPAGE_SHIFT = 0; /* No huge pages dude ! */ + set_huge_psize(MMU_PAGE_1M); #endif /* CONFIG_HUGETLB_PAGE */ } diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 71efb38..a02266d 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -24,18 +24,17 @@ #include asm/cputable.h #include asm/spu.h +#define HPAGE_SHIFT_64K16 +#define HPAGE_SHIFT_16M24 + #define NUM_LOW_AREAS (0x1UL SID_SHIFT) #define NUM_HIGH_AREAS (PGTABLE_RANGE HTLB_AREA_SHIFT) -#ifdef CONFIG_PPC_64K_PAGES -#define HUGEPTE_INDEX_SIZE (PMD_SHIFT-HPAGE_SHIFT) -#else -#define HUGEPTE_INDEX_SIZE (PUD_SHIFT-HPAGE_SHIFT) -#endif -#define PTRS_PER_HUGEPTE (1 HUGEPTE_INDEX_SIZE) -#define HUGEPTE_TABLE_SIZE (sizeof(pte_t) HUGEPTE_INDEX_SIZE) +unsigned int hugepte_shift; +#define PTRS_PER_HUGEPTE (1 hugepte_shift) +#define HUGEPTE_TABLE_SIZE (sizeof(pte_t) hugepte_shift) -#define HUGEPD_SHIFT (HPAGE_SHIFT + HUGEPTE_INDEX_SIZE) +#define HUGEPD_SHIFT (HPAGE_SHIFT + hugepte_shift) #define HUGEPD_SIZE(1UL HUGEPD_SHIFT) #define HUGEPD_MASK(~(HUGEPD_SIZE-1)) @@ -82,11 +81,35 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp, return 0; } +/* Base page size affects how we walk hugetlb page tables */ +#ifdef CONFIG_PPC_64K_PAGES +#define hpmd_offset(pud, addr) pmd_offset(pud, addr) +#define hpmd_alloc(mm, pud, addr) pmd_alloc(mm, pud, addr) +#else +static inline +pmd_t *hpmd_offset(pud_t *pud, unsigned long addr) +{ + if (HPAGE_SHIFT == HPAGE_SHIFT_64K) + return pmd_offset(pud, addr); + else + return (pmd_t *) pud; +} +static inline +pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr) +{ + if (HPAGE_SHIFT == HPAGE_SHIFT_64K) + return pmd_alloc(mm, pud, addr); + else + return (pmd_t *) pud; +} +#endif + /* Modelled after find_linux_pte() */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { pgd_t *pg; pud_t *pu; + pmd_t *pm; BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize); @@ -96,14 +119,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) if (!pgd_none(*pg)) { pu = pud_offset(pg, addr); if (!pud_none(*pu)) { -#ifdef CONFIG_PPC_64K_PAGES - pmd_t *pm; - pm = pmd_offset(pu, addr); + pm = hpmd_offset(pu, addr); if (!pmd_none(*pm)) return hugepte_offset((hugepd_t *)pm, addr); -#else - return hugepte_offset((hugepd_t *)pu, addr); -#endif } } @@ -114,6 +132,7 @@ pte_t
[PATCH v2] powerpc: add hugepagesz boot-time parameter
Paul, please include this in 2.6.25 if there are no objections. This patch adds the hugepagesz boot-time parameter for ppc64. It lets one pick the size for huge pages. The choices available are 64K and 16M when the base page size is 4k. It defaults to 16M (previously the only only choice) if nothing or an invalid choice is specified. Tested 64K huge pages successfully with the libhugetlbfs 1.2. Changes from v1: disallow 64K huge pages when base page size is 64K since we can't distinguish between base and huge pages when doing a hash_page() collapsed pmd_offset and pmd_alloc to inline calls to simplify the main code removed printing of the huge page size in mm/hugetlb.c since this information is already available in /proc/meminfo and leaves the remaining changes all powerpc specific Signed-off-by: Jon Tollefson [EMAIL PROTECTED] --- diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 33121d6..2fc1fb8 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -685,6 +685,7 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/isdn/README.HiSax. hugepages= [HW,X86-32,IA-64] Maximal number of HugeTLB pages. + hugepagesz= [HW,IA-64,PPC] The size of the HugeTLB pages. i8042.direct[HW] Put keyboard port into non-translated mode i8042.dumbkbd [HW] Pretend that controller can only read data from diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index cbbd8b0..9326a69 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -369,18 +369,11 @@ static void __init htab_init_page_sizes(void) * on what is available */ if (mmu_psize_defs[MMU_PAGE_16M].shift) - mmu_huge_psize = MMU_PAGE_16M; + set_huge_psize(MMU_PAGE_16M); /* With 4k/4level pagetables, we can't (for now) cope with a * huge page size PMD_SIZE */ else if (mmu_psize_defs[MMU_PAGE_1M].shift) - mmu_huge_psize = MMU_PAGE_1M; - - /* Calculate HPAGE_SHIFT and sanity check it */ - if (mmu_psize_defs[mmu_huge_psize].shift MIN_HUGEPTE_SHIFT - mmu_psize_defs[mmu_huge_psize].shift SID_SHIFT) - HPAGE_SHIFT = mmu_psize_defs[mmu_huge_psize].shift; - else - HPAGE_SHIFT = 0; /* No huge pages dude ! */ + set_huge_psize(MMU_PAGE_1M); #endif /* CONFIG_HUGETLB_PAGE */ } diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 71efb38..3099e48 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -24,18 +24,17 @@ #include asm/cputable.h #include asm/spu.h +#define HPAGE_SHIFT_64K16 +#define HPAGE_SHIFT_16M24 + #define NUM_LOW_AREAS (0x1UL SID_SHIFT) #define NUM_HIGH_AREAS (PGTABLE_RANGE HTLB_AREA_SHIFT) -#ifdef CONFIG_PPC_64K_PAGES -#define HUGEPTE_INDEX_SIZE (PMD_SHIFT-HPAGE_SHIFT) -#else -#define HUGEPTE_INDEX_SIZE (PUD_SHIFT-HPAGE_SHIFT) -#endif -#define PTRS_PER_HUGEPTE (1 HUGEPTE_INDEX_SIZE) -#define HUGEPTE_TABLE_SIZE (sizeof(pte_t) HUGEPTE_INDEX_SIZE) +unsigned int hugepte_shift; +#define PTRS_PER_HUGEPTE (1 hugepte_shift) +#define HUGEPTE_TABLE_SIZE (sizeof(pte_t) hugepte_shift) -#define HUGEPD_SHIFT (HPAGE_SHIFT + HUGEPTE_INDEX_SIZE) +#define HUGEPD_SHIFT (HPAGE_SHIFT + hugepte_shift) #define HUGEPD_SIZE(1UL HUGEPD_SHIFT) #define HUGEPD_MASK(~(HUGEPD_SIZE-1)) @@ -82,11 +81,31 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp, return 0; } +#ifndef CONFIG_PPC_64K_PAGES +static inline +pmd_t *hpmd_offset(pud_t *pud, unsigned long addr) +{ + if (HPAGE_SHIFT == HPAGE_SHIFT_64K) + return pmd_offset(pud, addr); + else + return (pmd_t *) pud; +} +static inline +pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr) +{ + if (HPAGE_SHIFT == HPAGE_SHIFT_64K) + return pmd_alloc(mm, pud, addr); + else + return (pmd_t *) pud; +} +#endif + /* Modelled after find_linux_pte() */ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { pgd_t *pg; pud_t *pu; + pmd_t *pm; BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize); @@ -96,14 +115,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) if (!pgd_none(*pg)) { pu = pud_offset(pg, addr); if (!pud_none(*pu)) { -#ifdef CONFIG_PPC_64K_PAGES - pmd_t *pm; - pm = pmd_offset(pu, addr); + pm = hpmd_offset(pu, addr); if (!pmd_none(*pm)) return hugepte_offset((hugepd_t *)pm, addr); -#else
Re: [PATCH 2/2] powerpc: make 64K huge pages more reliable
David Gibson wrote: On Tue, Nov 27, 2007 at 11:03:16PM -0600, Jon Tollefson wrote: This patch adds reliability to the 64K huge page option by making use of the PMD for 64K huge pages when base pages are 4k. So instead of a 12 bit pte it would be 7 bit pmd and a 5 bit pte. The pgd and pud offsets would continue as 9 bits and 7 bits respectively. This will allow the pgtable to fit in one base page. This patch would have to be applied after part 1. Hrm.. shouldn't we just ban 64K hugepages on a 64K base page size setup? There's not a whole lot of point to it, after all... Banning the base and huge page size from being the same size feels like an artificial barrier. It is probably not the most massively useful combination, but it shouldn't hurt performance. Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Use 1TB segments
Paul Mackerras wrote: diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c A couple of hunks fail in this file when applying to the current tree. ... diff --git a/include/asm-powerpc/mmu-hash64.h b/include/asm-powerpc/mmu-hash64.h index 695962f..053f86b 100644 --- a/include/asm-powerpc/mmu-hash64.h +++ b/include/asm-powerpc/mmu-hash64.h @@ -47,6 +47,8 @@ extern char initial_stab[]; /* Bits in the SLB VSID word */ #define SLB_VSID_SHIFT 12 +#define SLB_VSID_SHIFT_1T24 +#define SLB_VSID_SSIZE_SHIFT 62 #define SLB_VSID_B ASM_CONST(0xc000) #define SLB_VSID_B_256M ASM_CONST(0x) #define SLB_VSID_B_1TASM_CONST(0x4000) @@ -66,6 +68,7 @@ extern char initial_stab[]; #define SLB_VSID_USER(SLB_VSID_KP|SLB_VSID_KS|SLB_VSID_C) #define SLBIE_C (0x0800) +#define SLBIE_SSIZE_SHIFT25 /* * Hash table @@ -77,7 +80,7 @@ extern char initial_stab[]; #define HPTE_V_AVPN_SHIFT7 #define HPTE_V_AVPN ASM_CONST(0x3f80) #define HPTE_V_AVPN_VAL(x) (((x) HPTE_V_AVPN) HPTE_V_AVPN_SHIFT) -#define HPTE_V_COMPARE(x,y) (!(((x) ^ (y)) HPTE_V_AVPN)) +#define HPTE_V_COMPARE(x,y) (!(((x) ^ (y)) 0xff80)) #define HPTE_V_BOLTEDASM_CONST(0x0010) #define HPTE_V_LOCK ASM_CONST(0x0008) #define HPTE_V_LARGE ASM_CONST(0x0004) @@ -164,16 +167,25 @@ struct mmu_psize_def #define MMU_SEGSIZE_256M 0 #define MMU_SEGSIZE_1T 1 +/* + * Supported segment sizes + */ +#define MMU_SEGSIZE_256M 0 +#define MMU_SEGSIZE_1T 1 It looks like this is repeating the definitions just above it. Jon ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev