Re: [PATCH 1/8] fix bootmem reservation on uninitialized node

2008-12-10 Thread Jon Tollefson

Paul Mackerras wrote:

Dave Hansen writes:

  

This patch ensures that we do not touch bootmem for any node which
has not been initialized.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]



So, should I be sending this to Linus for 2.6.28?

I notice you have added a dbg() call.  For a 2.6.28 patch I'd somewhat
prefer not to have that in unless necessary.

Jon, does this patch fix the problem on your machine with 16G pages?
  
It worked on a machine with one page, I am awaiting access to another 
with more pages.



Paul.
  

Jon

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fix boot freeze on machine with empty memory node

2008-12-04 Thread Jon Tollefson

Dave Hansen wrote:

I got a bug report about a distro kernel not booting on a particular
machine.  It would freeze during boot:

  

...
Could not find start_pfn for node 1
[boot]0015 Setup Done
Built 2 zonelists in Node order, mobility grouping on.  Total pages: 123783
Policy zone: DMA
Kernel command line:
[boot]0020 XICS Init
[boot]0021 XICS Done
PID hash table entries: 4096 (order: 12, 32768 bytes)
clocksource: timebase mult[7d] shift[22] registered
Console: colour dummy device 80x25
console handover: boot [udbg0] - real [hvc0]
Dentry cache hash table entries: 1048576 (order: 7, 8388608 bytes)
Inode-cache hash table entries: 524288 (order: 6, 4194304 bytes)
freeing bootmem node 0



I've reproduced this on 2.6.27.7.  I'm pretty sure it is caused by this
patch:

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8f64e1f2d1e09267ac926e15090fd505c1c0cbcb

The problem is that Jon took a loop which was (in psuedocode):

for_each_node(nid)
NODE_DATA(nid) = careful_alloc(nid);
setup_bootmem(nid);
reserve_node_bootmem(nid);

and broke it up into:

for_each_node(nid)
NODE_DATA(nid) = careful_alloc(nid);
setup_bootmem(nid);
for_each_node(nid)
reserve_node_bootmem(nid);

The issue comes in when the 'careful_alloc()' is called on a node with
no memory.  It falls back to using bootmem from a previously-initialized
node.  But, bootmem has not yet been reserved when Jon's patch is
applied.  It gives back bogus memory (0xc000) and pukes
later in boot.

The following patch collapses the loop back together.  It also breaks
the mark_reserved_regions_for_nid() code out into a function and adds
some comments.  I think a huge part of introducing this bug is because
for loop was too long and hard to read.

The actual bug fix here is the:

+   if (end_pfn = node-node_start_pfn ||
+   start_pfn = node_end_pfn)
+   continue;

Signed-off-by: Dave Hansen [EMAIL PROTECTED]

diff -ru linux-2.6.27.7.orig/arch/powerpc//mm/numa.c 
linux-2.6.27.7/arch/powerpc//mm/numa.c
--- linux-2.6.27.7.orig/arch/powerpc//mm/numa.c 2008-11-20 17:02:37.0 
-0600
+++ linux-2.6.27.7/arch/powerpc//mm/numa.c  2008-11-24 15:53:35.0 
-0600
@@ -822,6 +822,67 @@
.priority = 1 /* Must run before sched domains notifier. */
 };

+static void mark_reserved_regions_for_nid(int nid)
+{
+   struct pglist_data *node = NODE_DATA(nid);
+   int i;
+
+   for (i = 0; i  lmb.reserved.cnt; i++) {
+   unsigned long physbase = lmb.reserved.region[i].base;
+   unsigned long size = lmb.reserved.region[i].size;
+   unsigned long start_pfn = physbase  PAGE_SHIFT;
+   unsigned long end_pfn = ((physbase + size)  PAGE_SHIFT);
+   struct node_active_region node_ar;
+   unsigned long node_end_pfn = node-node_start_pfn +
+node-node_spanned_pages;
+
+   /*
+* Check to make sure that this lmb.reserved area is
+* within the bounds of the node that we care about.
+* Checking the nid of the start and end points is not
+* sufficient because the reserved area could span the
+* entire node.
+*/
+   if (end_pfn = node-node_start_pfn ||
+   start_pfn = node_end_pfn)
+   continue;
+
+   get_node_active_region(start_pfn, node_ar);
+   while (start_pfn  end_pfn 
+   node_ar.start_pfn  node_ar.end_pfn) {
+   unsigned long reserve_size = size;
+   /*
+* if reserved region extends past active region
+* then trim size to active region
+*/
+   if (end_pfn  node_ar.end_pfn)
+   reserve_size = (node_ar.end_pfn  PAGE_SHIFT)
+   - (start_pfn  PAGE_SHIFT);
+   dbg(reserve_bootmem %lx %lx nid=%d\n, physbase,
+   reserve_size, node_ar.nid);
+   reserve_bootmem_node(NODE_DATA(node_ar.nid), physbase,
+   reserve_size, BOOTMEM_DEFAULT);
+   /*
+* if reserved region is contained in the active region
+* then done.
+*/
+   if (end_pfn = node_ar.end_pfn)
+   break;
+
+   /*
+* reserved region extends past the active region
+*   get next active region that contains this
+*   reserved region
+*/
+  

[PATCH 1/1 v2] powerpc: hugetlb pgtable cache access cleanup

2008-10-30 Thread Jon Tollefson
It was suggested by Andrew that using a macro that made an array
look like a function call made it harder to understand the code.
Cleaned up use of macro.  We now reference the pgtable_cache array
directly instead of using a macro.  


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
Cc: Nick Piggin [EMAIL PROTECTED]
Cc: Paul Mackerras [EMAIL PROTECTED]
Cc: Benjamin Herrenschmidt [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Acked-by: David Gibson [EMAIL PROTECTED]
---

 arch/powerpc/mm/hugetlbpage.c |   22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff -puN 
arch/powerpc/mm/hugetlbpage.c~powerpc-hugetlb-pgtable-cache-access-cleanup 
arch/powerpc/mm/hugetlbpage.c
--- a/arch/powerpc/mm/hugetlbpage.c~powerpc-hugetlb-pgtable-cache-access-cleanup
+++ a/arch/powerpc/mm/hugetlbpage.c
@@ -53,8 +53,7 @@ unsigned int mmu_huge_psizes[MMU_PAGE_CO

 /* Subtract one from array size because we don't need a cache for 4K since
  * is not a huge page size */
-#define huge_pgtable_cache(psize)  (pgtable_cache[HUGEPTE_CACHE_NUM \
-   + psize-1])
+#define HUGE_PGTABLE_INDEX(psize)  (HUGEPTE_CACHE_NUM + psize - 1)
 #define HUGEPTE_CACHE_NAME(psize)  (huge_pgtable_cache_name[psize])

 static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
@@ -113,7 +112,7 @@ static inline pte_t *hugepte_offset(huge
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
   unsigned long address, unsigned int psize)
 {
-   pte_t *new = kmem_cache_zalloc(huge_pgtable_cache(psize),
+   pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)],
  GFP_KERNEL|__GFP_REPEAT);

if (! new)
@@ -121,7 +120,7 @@ static int __hugepte_alloc(struct mm_str

spin_lock(mm-page_table_lock);
if (!hugepd_none(*hpdp))
-   kmem_cache_free(huge_pgtable_cache(psize), new);
+   kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new);
else
hpdp-pd = (unsigned long)new | HUGEPD_OK;
spin_unlock(mm-page_table_lock);
@@ -760,13 +759,14 @@ static int __init hugetlbpage_init(void)

for (psize = 0; psize  MMU_PAGE_COUNT; ++psize) {
if (mmu_huge_psizes[psize]) {
-   huge_pgtable_cache(psize) = kmem_cache_create(
-   HUGEPTE_CACHE_NAME(psize),
-   HUGEPTE_TABLE_SIZE(psize),
-   HUGEPTE_TABLE_SIZE(psize),
-   0,
-   NULL);
-   if (!huge_pgtable_cache(psize))
+   pgtable_cache[HUGE_PGTABLE_INDEX(psize)] =
+   kmem_cache_create(
+   HUGEPTE_CACHE_NAME(psize),
+   HUGEPTE_TABLE_SIZE(psize),
+   HUGEPTE_TABLE_SIZE(psize),
+   0,
+   NULL);
+   if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)])
panic(hugetlbpage_init(): could not create %s\
  \n, HUGEPTE_CACHE_NAME(psize));
}
_


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


16G related patches for stable kernel 2.6.27

2008-10-29 Thread Jon Tollefson
Please consider the following patches for the 2.6.27 stable tree.

The first two allow a powerpc machine with more then 2 numa nodes
to boot when 16G pages are enabled.  The third one allows a powerpc
machine to boot if using 16G pages and the mem= boot param.

thanks,
Jon


powerpc: Reserve in bootmem lmb reserved regions that cross NUMA nodes
commit 8f64e1f2d1e09267ac926e15090fd505c1c0cbcb
powerpc/numa: Make memory reserve code more robust
commit e81703724a966120ace6504c993bda9e084cbf3e
powerpc: Don't use a 16G page if beyond mem= limits
commit 4792adbac9eb41cea77a45ab76258ea10d411173



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] powerpc: Don't use a 16G page if beyond mem= limits

2008-10-21 Thread Jon Tollefson

If mem= is used on the boot command line to limit memory then the memory block 
where a 16G page resides may not be available.

Thanks to Michael Ellerman for finding the problem.

Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

arch/powerpc/mm/hash_utils_64.c |6 --
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 5c64af1..8d5b475 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -382,8 +382,10 @@ static int __init htab_dt_scan_hugepage_blocks(unsigned 
long node,
printk(KERN_INFO Huge page(16GB) memory: 
addr = 0x%lX size = 0x%lX pages = %d\n,
phys_addr, block_size, expected_pages);
-   lmb_reserve(phys_addr, block_size * expected_pages);
-   add_gpage(phys_addr, block_size, expected_pages);
+   if (phys_addr + (16 * GB) = lmb_end_of_DRAM()) {
+   lmb_reserve(phys_addr, block_size * expected_pages);
+   add_gpage(phys_addr, block_size, expected_pages);
+   }
return 0;
}
#endif /* CONFIG_HUGETLB_PAGE */



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH v3] powerpc: properly reserve in bootmem the lmb reserved regions that cross NUMA nodes

2008-10-16 Thread Jon Tollefson

Benjamin Herrenschmidt wrote:

On Thu, 2008-10-09 at 15:18 -0500, Jon Tollefson wrote:
  

If there are multiple reserved memory blocks via lmb_reserve() that are
contiguous addresses and on different NUMA nodes we are losing track of which 
address ranges to reserve in bootmem on which node.  I discovered this 
when I recently got to try 16GB huge pages on a system with more then 2 nodes.



I'm going to apply it, however, could you double check something for
me ? A cursory glance of the new version makes me wonder, what if the
first call to get_node_active_region() ends up with the work_fn never
hitting the if () case ? I think in that case, node_ar-end_pfn never
gets initialized right ? Can that happen in practice ? I suspect that
isn't the case but better safe than sorry...
  
I have tested this on a few machines and it hasn't been a problem.  But 
I don't see anything in lmb_reserve() that would prevent reserving a 
block that was outside of valid memory.  So to be safe I have attached a 
patch that checks for an empty active range.


I also noticed that the size to reserve for subsequent nodes for a 
reserve that spans nodes wasn't taking into account the amount reserved 
on previous nodes so the patch addresses that too.  If you would prefer 
this be a separate patch let me know.



If there's indeed a potential problem, please send a fixup patch.

Cheers,
Ben.
  

Adjust amount to reserve based on previous nodes for reserves spanning
multiple nodes. Check if the node active range is empty before attempting
to pass the reserve to bootmem.  In practice the range shouldn't be empty,
but to be sure we check.

Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---


arch/powerpc/mm/numa.c |   15 ++-
1 file changed, 10 insertions(+), 5 deletions(-)


diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6cf5c71..195bfcd 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -116,6 +116,7 @@ static int __init get_active_region_work_fn(unsigned long 
start_pfn,

/*
 * get_node_active_region - Return active region containing start_pfn
+ * Active range returned is empty if none found.
 * @start_pfn: The page to return the region for.
 * @node_ar: Returned set to the active region containing start_pfn
 */
@@ -126,6 +127,7 @@ static void __init get_node_active_region(unsigned long 
start_pfn,

node_ar-nid = nid;
node_ar-start_pfn = start_pfn;
+   node_ar-end_pfn = start_pfn;
work_with_active_regions(nid, get_active_region_work_fn, node_ar);
}

@@ -933,18 +935,20 @@ void __init do_init_bootmem(void)
struct node_active_region node_ar;

get_node_active_region(start_pfn, node_ar);
-   while (start_pfn  end_pfn) {
+   while (start_pfn  end_pfn 
+   node_ar.start_pfn  node_ar.end_pfn) {
+   unsigned long reserve_size = size;
/*
 * if reserved region extends past active region
 * then trim size to active region
 */
if (end_pfn  node_ar.end_pfn)
-   size = (node_ar.end_pfn  PAGE_SHIFT)
+   reserve_size = (node_ar.end_pfn  PAGE_SHIFT)
- (start_pfn  PAGE_SHIFT);
-   dbg(reserve_bootmem %lx %lx nid=%d\n, physbase, size,
-   node_ar.nid);
+   dbg(reserve_bootmem %lx %lx nid=%d\n, physbase,
+   reserve_size, node_ar.nid);
reserve_bootmem_node(NODE_DATA(node_ar.nid), physbase,
-   size, BOOTMEM_DEFAULT);
+   reserve_size, BOOTMEM_DEFAULT);
/*
 * if reserved region is contained in the active region
 * then done.
@@ -959,6 +963,7 @@ void __init do_init_bootmem(void)
 */
start_pfn = node_ar.end_pfn;
physbase = start_pfn  PAGE_SHIFT;
+   size = size - reserve_size;
get_node_active_region(start_pfn, node_ar);
}





___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH v3] powerpc: properly reserve in bootmem the lmb reserved regions that cross NUMA nodes

2008-10-09 Thread Jon Tollefson
If there are multiple reserved memory blocks via lmb_reserve() that are
contiguous addresses and on different NUMA nodes we are losing track of which 
address ranges to reserve in bootmem on which node.  I discovered this 
when I recently got to try 16GB huge pages on a system with more then 2 nodes.

When scanning the device tree in early boot we call lmb_reserve() with 
the addresses of the 16G pages that we find so that the memory doesn't 
get used for something else.  For example the addresses for the pages 
could be 40, 44, 48, 4C, etc - 8 pages, 
one on each of eight nodes.  In the lmb after all the pages have been 
reserved it will look something like the following:

lmb_dump_all:
memory.cnt= 0x2
memory.size   = 0x3e8000
memory.region[0x0].base   = 0x0
  .size = 0x1e8000
memory.region[0x1].base   = 0x40
  .size = 0x20
reserved.cnt  = 0x5
reserved.size = 0x3e8000
reserved.region[0x0].base   = 0x0
  .size = 0x7b5000
reserved.region[0x1].base   = 0x2a0
  .size = 0x78c000
reserved.region[0x2].base   = 0x328c000
  .size = 0x43000
reserved.region[0x3].base   = 0xf4e8000
  .size = 0xb18000
reserved.region[0x4].base   = 0x40
  .size = 0x20


The reserved.region[0x4] contains the 16G pages.  In 
arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the 
node numbers looking for the reserved regions that belong to the 
particular node.  It is not able to identify region 0x4 as being a part 
of each of the 8 nodes.  It is assuming that a reserved region is only
on a single node.

This patch takes out the reserved region loop from inside
the loop that goes over each node.  It looks up the active region containing
the start of the reserved region.  If it extends past that active region then
it adjusts the size and gets the next active region containing it.

Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

Changes:
v2:
-style changes as suggested by Adam Litke
v3:
-moved helper function to powerpc code since it is the only user at 
present
-made end_pfn consistently exclusive
-other minor code cleanups

Please consider for 2.6.28.

 numa.c |  108 -
 1 file changed, 80 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d9a1813..72447f1 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -89,6 +89,46 @@ static int __cpuinit fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
 }
 
+/*
+ * get_active_region_work_fn - A helper function for get_node_active_region
+ * Returns datax set to the start_pfn and end_pfn if they contain
+ * the initial value of datax-start_pfn between them
+ * @start_pfn: start page(inclusive) of region to check
+ * @end_pfn: end page(exclusive) of region to check
+ * @datax: comes in with -start_pfn set to value to search for and
+ * goes out with active range if it contains it
+ * Returns 1 if search value is in range else 0
+ */
+static int __init get_active_region_work_fn(unsigned long start_pfn,
+   unsigned long end_pfn, void *datax)
+{
+   struct node_active_region *data;
+   data = (struct node_active_region *)datax;
+
+   if (start_pfn = data-start_pfn  end_pfn  data-start_pfn) {
+   data-start_pfn = start_pfn;
+   data-end_pfn = end_pfn;
+   return 1;
+   }
+   return 0;
+
+}
+
+/*
+ * get_node_active_region - Return active region containing start_pfn
+ * @start_pfn: The page to return the region for.
+ * @node_ar: Returned set to the active region containing start_pfn
+ */
+static void __init get_node_active_region(unsigned long start_pfn,
+  struct node_active_region *node_ar)
+{
+   int nid = early_pfn_to_nid(start_pfn);
+
+   node_ar-nid = nid;
+   node_ar-start_pfn = start_pfn;
+   work_with_active_regions(nid, get_active_region_work_fn, node_ar);
+}
+
 static void __cpuinit map_cpu_to_node(int cpu, int node)
 {
numa_cpu_lookup_table[cpu] = node;
@@ -837,38 +877,50 @@ void __init do_init_bootmem(void)
  start_pfn, end_pfn);
 
free_bootmem_with_active_regions(nid, end_pfn);
+   }
 
-   /* Mark reserved regions on this node */
-   for (i = 0; i  lmb.reserved.cnt; i++) {
-   unsigned long physbase = lmb.reserved.region[i].base;
-   unsigned long size = lmb.reserved.region[i].size;
-   unsigned long start_paddr = start_pfn  PAGE_SHIFT

Re: [PATCH] properly reserve in bootmem the lmb reserved regions that cross numa nodes

2008-10-06 Thread Jon Tollefson
Kumar Gala wrote:
 Out of interest how to do you guys represent NUMA regions of memory in
 the device tree?

 - k
Looking at the source code in numa.c I see at the start of
do_init_bootmem() that parse_numa_properties() is called.  It appears to
be looking at memory nodes and getting the node id from it.  It gets an
associativity property for the memory node and indexes that array with a
'min_common_depth' value to get the node id.

This node id is then used to setup the active ranges in the
early_node_map[].

Is this what you are asking about?  There are others I am sure who know
more about it then I though.

Jon

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] properly reserve in bootmem the lmb reserved regions that cross numa nodes

2008-10-06 Thread Jon Tollefson
Kumar Gala wrote:

 On Oct 6, 2008, at 10:42 AM, Jon Tollefson wrote:

 Kumar Gala wrote:
 Out of interest how to do you guys represent NUMA regions of memory in
 the device tree?

 - k
 Looking at the source code in numa.c I see at the start of
 do_init_bootmem() that parse_numa_properties() is called.  It appears to
 be looking at memory nodes and getting the node id from it.  It gets an
 associativity property for the memory node and indexes that array with a
 'min_common_depth' value to get the node id.

 This node id is then used to setup the active ranges in the
 early_node_map[].

 Is this what you are asking about?  There are others I am sure who know
 more about it then I though.

 I was wondering if this was documented anywhere (like in sPAPR)?

 - k
I see some information on it in section C.6.6.

Jon

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH v2] properly reserve in bootmem the lmb reserved regions that cross NUMA nodes

2008-10-06 Thread Jon Tollefson
If there are multiple reserved memory blocks via lmb_reserve() that are
contiguous addresses and on different NUMA nodes we are losing track of which 
address ranges to reserve in bootmem on which node.  I discovered this 
when I only recently got to try 16GB huge pages on a system with more 
then 2 nodes.

When scanning the device tree in early boot we call lmb_reserve() with 
the addresses of the 16G pages that we find so that the memory doesn't 
get used for something else.  For example the addresses for the pages 
could be 40, 44, 48, 4C, etc - 8 pages, 
one on each of eight nodes.  In the lmb after all the pages have been 
reserved it will look something like the following:

lmb_dump_all:
memory.cnt= 0x2
memory.size   = 0x3e8000
memory.region[0x0].base   = 0x0
  .size = 0x1e8000
memory.region[0x1].base   = 0x40
  .size = 0x20
reserved.cnt  = 0x5
reserved.size = 0x3e8000
reserved.region[0x0].base   = 0x0
  .size = 0x7b5000
reserved.region[0x1].base   = 0x2a0
  .size = 0x78c000
reserved.region[0x2].base   = 0x328c000
  .size = 0x43000
reserved.region[0x3].base   = 0xf4e8000
  .size = 0xb18000
reserved.region[0x4].base   = 0x40
  .size = 0x20


The reserved.region[0x4] contains the 16G pages.  In 
arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the 
node numbers looking for the reserved regions that belong to the 
particular node.  It is not able to identify region 0x4 as being a part 
of each of the 8 nodes.  It is assuming that a reserved region is only
on a single node.

This patch takes out the reserved region loop from inside
the loop that goes over each node.  It looks up the active region containing
the start of the reserved region.  If it extends past that active region then
it adjusts the size and gets the next active region containing it.


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

Changes:
-style changes as suggested by Adam Litke


Please consider for 2.6.28.


 arch/powerpc/mm/numa.c |   63 -
 include/linux/mm.h |2 +
 mm/page_alloc.c|   19 ++
 3 files changed, 57 insertions(+), 27 deletions(-)


diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d9a1813..9a3b0c9 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -837,36 +837,45 @@ void __init do_init_bootmem(void)
  start_pfn, end_pfn);
 
free_bootmem_with_active_regions(nid, end_pfn);
+   }
 
-   /* Mark reserved regions on this node */
-   for (i = 0; i  lmb.reserved.cnt; i++) {
-   unsigned long physbase = lmb.reserved.region[i].base;
-   unsigned long size = lmb.reserved.region[i].size;
-   unsigned long start_paddr = start_pfn  PAGE_SHIFT;
-   unsigned long end_paddr = end_pfn  PAGE_SHIFT;
-
-   if (early_pfn_to_nid(physbase  PAGE_SHIFT) != nid 
-   early_pfn_to_nid((physbase+size-1)  PAGE_SHIFT) 
!= nid)
-   continue;
-
-   if (physbase  end_paddr 
-   (physbase+size)  start_paddr) {
-   /* overlaps */
-   if (physbase  start_paddr) {
-   size -= start_paddr - physbase;
-   physbase = start_paddr;
-   }
-
-   if (size  end_paddr - physbase)
-   size = end_paddr - physbase;
-
-   dbg(reserve_bootmem %lx %lx\n, physbase,
-   size);
-   reserve_bootmem_node(NODE_DATA(nid), physbase,
-size, BOOTMEM_DEFAULT);
-   }
+   /* Mark reserved regions */
+   for (i = 0; i  lmb.reserved.cnt; i++) {
+   unsigned long physbase = lmb.reserved.region[i].base;
+   unsigned long size = lmb.reserved.region[i].size;
+   unsigned long start_pfn = physbase  PAGE_SHIFT;
+   unsigned long end_pfn = ((physbase + size - 1)  PAGE_SHIFT);
+   struct node_active_region *node_ar;
+
+   node_ar = get_node_active_region(start_pfn);
+   while (start_pfn  end_pfn  node_ar != NULL) {
+   /*
+* if reserved region extends past active region
+* then trim size to active region

Re: [PATCH] properly reserve in bootmem the lmb reserved regions that cross numa nodes

2008-10-02 Thread Jon Tollefson
Adam Litke wrote:
 This seems like the right approach to me.  I have pointed out a few
 stylistic issues below.
   
Thanks.  I'll make those changes.  I assume by __mminit you meant __meminit

Jon

 On Tue, 2008-09-30 at 09:53 -0500, Jon Tollefson wrote:
 snip
   
 +/* Mark reserved regions */
 +for (i = 0; i  lmb.reserved.cnt; i++) {
 +unsigned long physbase = lmb.reserved.region[i].base;
 +unsigned long size = lmb.reserved.region[i].size;
 +unsigned long start_pfn = physbase  PAGE_SHIFT;
 +unsigned long end_pfn = ((physbase+size-1)  PAGE_SHIFT);
 

 CodingStyle dictates that this should be:
 unsigned long end_pfn = ((physbase + size - 1)  PAGE_SHIFT);

 snip

   
 +/**
 + * get_node_active_region - Return active region containing start_pfn
 + * @start_pfn The page to return the region for.
 + *
 + * It will return NULL if active region is not found.
 + */
 +struct node_active_region *get_node_active_region(
 +unsigned long start_pfn)
 

 Bad style.  I think the convention would be to write it like this:

 struct node_active_region *
 get_node_active_region(unsigned long start_pfn)

   
 +{
 +int i;
 +for (i = 0; i  nr_nodemap_entries; i++) {
 +unsigned long node_start_pfn = early_node_map[i].start_pfn;
 +unsigned long node_end_pfn = early_node_map[i].end_pfn;
 +
 +if (node_start_pfn = start_pfn  node_end_pfn  start_pfn)
 +return early_node_map[i];
 +}
 +return NULL;
 +}
 

 Since this is using the early_node_map[], should we mark the function
 __mminit?  

   

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] properly reserve in bootmem the lmb reserved regions that cross numa nodes

2008-09-30 Thread Jon Tollefson
If there are multiple reserved memory blocks via lmb_reserve() that are 
contiguous addresses and on different numa nodes we are losing track of which 
address ranges to reserve in bootmem on which node.  I discovered this 
when I only recently got to try 16GB huge pages on a system with more 
then 2 nodes.

When scanning the device tree in early boot we call lmb_reserve() with 
the addresses of the 16G pages that we find so that the memory doesn't 
get used for something else.  For example the addresses for the pages 
could be 40, 44, 48, 4C, etc - 8 pages, 
one on each of eight nodes.  In the lmb after all the pages have been 
reserved it will look something like the following:

lmb_dump_all:
memory.cnt= 0x2
memory.size   = 0x3e8000
memory.region[0x0].base   = 0x0
  .size = 0x1e8000
memory.region[0x1].base   = 0x40
  .size = 0x20
reserved.cnt  = 0x5
reserved.size = 0x3e8000
reserved.region[0x0].base   = 0x0
  .size = 0x7b5000
reserved.region[0x1].base   = 0x2a0
  .size = 0x78c000
reserved.region[0x2].base   = 0x328c000
  .size = 0x43000
reserved.region[0x3].base   = 0xf4e8000
  .size = 0xb18000
reserved.region[0x4].base   = 0x40
  .size = 0x20


The reserved.region[0x4] contains the 16G pages.  In 
arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the 
node numbers looking for the reserved regions that belong to the 
particular node.  It is not able to identify region 0x4 as being a part 
of each of the 8 nodes.  It is assuming that a reserved region is only
on a single node.

This patch takes out the reserved region loop from inside
the loop that goes over each node.  It looks up the active region containing
the start of the reserved region.  If it extends past that active region then
it adjusts the size and gets the next active region containing it.


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---


 arch/powerpc/mm/numa.c |   63 -
 include/linux/mm.h |2 +
 mm/page_alloc.c|   19 ++
 3 files changed, 57 insertions(+), 27 deletions(-)


diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d9a1813..07b8726 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -837,36 +837,45 @@ void __init do_init_bootmem(void)
  start_pfn, end_pfn);

free_bootmem_with_active_regions(nid, end_pfn);
+   }

-   /* Mark reserved regions on this node */
-   for (i = 0; i  lmb.reserved.cnt; i++) {
-   unsigned long physbase = lmb.reserved.region[i].base;
-   unsigned long size = lmb.reserved.region[i].size;
-   unsigned long start_paddr = start_pfn  PAGE_SHIFT;
-   unsigned long end_paddr = end_pfn  PAGE_SHIFT;
-
-   if (early_pfn_to_nid(physbase  PAGE_SHIFT) != nid 
-   early_pfn_to_nid((physbase+size-1)  PAGE_SHIFT) 
!= nid)
-   continue;
-
-   if (physbase  end_paddr 
-   (physbase+size)  start_paddr) {
-   /* overlaps */
-   if (physbase  start_paddr) {
-   size -= start_paddr - physbase;
-   physbase = start_paddr;
-   }
-
-   if (size  end_paddr - physbase)
-   size = end_paddr - physbase;
-
-   dbg(reserve_bootmem %lx %lx\n, physbase,
-   size);
-   reserve_bootmem_node(NODE_DATA(nid), physbase,
-size, BOOTMEM_DEFAULT);
-   }
+   /* Mark reserved regions */
+   for (i = 0; i  lmb.reserved.cnt; i++) {
+   unsigned long physbase = lmb.reserved.region[i].base;
+   unsigned long size = lmb.reserved.region[i].size;
+   unsigned long start_pfn = physbase  PAGE_SHIFT;
+   unsigned long end_pfn = ((physbase+size-1)  PAGE_SHIFT);
+   struct node_active_region *node_ar;
+
+   node_ar = get_node_active_region(start_pfn);
+   while (start_pfn  end_pfn  node_ar != NULL) {
+   /*
+* if reserved region extends past active region
+* then trim size to active region
+*/
+   if (end_pfn = node_ar-end_pfn

Re: [Libhugetlbfs-devel] Buglet in 16G page handling

2008-09-04 Thread Jon Tollefson
Jon Tollefson wrote:
 David Gibson wrote:
   
 On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote:
   
 
 David Gibson wrote:
 
   
 When BenH and I were looking at the new code for handling 16G pages,
 we noticed a small bug.  It doesn't actually break anything user
 visible, but it's certainly not the way things are supposed to be.
 The 16G patches didn't update the huge_pte_offset() and
 huge_pte_alloc() functions, which means that the hugepte tables for
 16G pages will be allocated much further down the page table tree than
 they should be - allocating several levels of page table with a single
 entry in them along the way.

 The patch below is supposed to fix this, cleaning up the existing
 handling of 64k vs 16M pages while its at it.  However, it needs some
 testing.

 I've checked that it doesn't break existing 16M support, either with
 4k or 64k base pages.  I haven't figured out how to test with 64k
 pages yet, at least until the multisize support goes into
 libhugetlbfs.  For 16G pages, I just don't have access to a machine
 with enough memory to test.  Jon, presumably you must have found such
 a machine when you did the 16G page support in the first place.  Do
 you still have access, and can you test this patch?
   
   
 
 I do have access to a machine to test it.  I applied the patch to -rc4
 and used a pseries_defconfig.  I boot with
 default_hugepagesz=16G... in order to test huge page sizes other then
 16M at this point.

 Running the libhugetlbfs test suite it gets as far as   Readback (64):  
 PASS
 before it hits the following program check.
 
   
 Ah, yes, oops, forgot to fix up the pagetable freeing path in line
 with the other changes.  Try the revised version below.
   
 
 I have run through the tests twice now with this new patch using a 4k
 base page size(and 16G huge page size) and there are no program checks
 or spin lock issues.  So looking good.

 I will run it next a couple of times with 64K base pages.
   
I have run through the libhugetest suite 3 times each now with both
combinations(4k and 64K base page) and have not seen the spin lock
problem or any other problems.

Acked-by: Jon Tollefson [EMAIL PROTECTED]


 Jon




   
 Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
 ===
 --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c   2008-09-02 
 11:50:12.0 +1000
 +++ working-2.6/arch/powerpc/mm/hugetlbpage.c2008-09-03 
 10:10:54.0 +1000
 @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
  return 0;
  }

 -/* Base page size affects how we walk hugetlb page tables */
 -#ifdef CONFIG_PPC_64K_PAGES
 -#define hpmd_offset(pud, addr, h)   pmd_offset(pud, addr)
 -#define hpmd_alloc(mm, pud, addr, h)pmd_alloc(mm, pud, addr)
 -#else
 -static inline
 -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
 +
 +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate 
 *hstate)
 +{
 +if (huge_page_shift(hstate)  PUD_SHIFT)
 +return pud_offset(pgd, addr);
 +else
 +return (pud_t *) pgd;
 +}
 +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long 
 addr,
 + struct hstate *hstate)
  {
 -if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 +if (huge_page_shift(hstate)  PUD_SHIFT)
 +return pud_alloc(mm, pgd, addr);
 +else
 +return (pud_t *) pgd;
 +}
 +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate 
 *hstate)
 +{
 +if (huge_page_shift(hstate)  PMD_SHIFT)
  return pmd_offset(pud, addr);
  else
  return (pmd_t *) pud;
  }
 -static inline
 -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
 -  struct hstate *hstate)
 +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long 
 addr,
 + struct hstate *hstate)
  {
 -if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 +if (huge_page_shift(hstate)  PMD_SHIFT)
  return pmd_alloc(mm, pud, addr);
  else
  return (pmd_t *) pud;
  }
 -#endif

  /* Build list of addresses of gigantic pages.  This function is used in 
 early
   * boot before the buddy or bootmem allocator is setup.
 @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 

  pg = pgd_offset(mm, addr);
  if (!pgd_none(*pg)) {
 -pu = pud_offset(pg, addr);
 +pu = hpud_offset(pg, addr, hstate);
  if (!pud_none(*pu)) {
  pm = hpmd_offset(pu, addr, hstate);
  if (!pmd_none(*pm))
 @@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct *
  addr = hstate-mask;

  pg = pgd_offset(mm, addr);
 -pu = pud_alloc(mm, pg, addr);
 +pu = hpud_alloc(mm, pg, addr, hstate);

  if (pu) {
  pm = hpmd_alloc(mm, pu, addr, hstate);
 @@ -316,13 +324,7

Re: [Libhugetlbfs-devel] Buglet in 16G page handling

2008-09-03 Thread Jon Tollefson
Benjamin Herrenschmidt wrote:
 On Tue, 2008-09-02 at 17:16 -0500, Jon Tollefson wrote:
   
 Benjamin Herrenschmidt wrote:
 
 Actually, Jon has been hitting an occasional pagetable lock related
 problem. The last theory was that it might be some sort of race but it's
 vaguely possible that this is the issue. Jon?
 
 
 All hugetlbfs ops should be covered by the big PTL except walking... Can
 we have more info about the problem ?

 Cheers,
 Ben.
   
   
 I hit this when running the complete libhugetlbfs test suite (make
 check) with base page at 4K and default huge page size at 16G.  It is on
 the last test (shm-getraw) when it hits it.  Just running that test
 alone has not caused it for me - only when I have run all the tests and
 it gets to this one.  Also it doesn't happen every time.  I have tried
 to reproduce as well with a 64K base page but haven't seen it happen there.
 

 I don't see anything huge pages related in the backtraces which is
 interesting ...

 Can you get us access to a machine with enough RAM to test the 16G
 pages ?

 Ben.

   
You can use the machine I have been using.  I'll send you a note with
the details on it after I test David's patch today.

Jon

snip

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [Libhugetlbfs-devel] Buglet in 16G page handling

2008-09-03 Thread Jon Tollefson
David Gibson wrote:
 On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote:
   
 David Gibson wrote:
 
 When BenH and I were looking at the new code for handling 16G pages,
 we noticed a small bug.  It doesn't actually break anything user
 visible, but it's certainly not the way things are supposed to be.
 The 16G patches didn't update the huge_pte_offset() and
 huge_pte_alloc() functions, which means that the hugepte tables for
 16G pages will be allocated much further down the page table tree than
 they should be - allocating several levels of page table with a single
 entry in them along the way.

 The patch below is supposed to fix this, cleaning up the existing
 handling of 64k vs 16M pages while its at it.  However, it needs some
 testing.

 I've checked that it doesn't break existing 16M support, either with
 4k or 64k base pages.  I haven't figured out how to test with 64k
 pages yet, at least until the multisize support goes into
 libhugetlbfs.  For 16G pages, I just don't have access to a machine
 with enough memory to test.  Jon, presumably you must have found such
 a machine when you did the 16G page support in the first place.  Do
 you still have access, and can you test this patch?
   
   
 I do have access to a machine to test it.  I applied the patch to -rc4
 and used a pseries_defconfig.  I boot with
 default_hugepagesz=16G... in order to test huge page sizes other then
 16M at this point.

 Running the libhugetlbfs test suite it gets as far as   Readback (64):  
 PASS
 before it hits the following program check.
 

 Ah, yes, oops, forgot to fix up the pagetable freeing path in line
 with the other changes.  Try the revised version below.
   
I have run through the tests twice now with this new patch using a 4k
base page size(and 16G huge page size) and there are no program checks
or spin lock issues.  So looking good.

I will run it next a couple of times with 64K base pages.

Jon




 Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
 ===
 --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c2008-09-02 
 11:50:12.0 +1000
 +++ working-2.6/arch/powerpc/mm/hugetlbpage.c 2008-09-03 10:10:54.0 
 +1000
 @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
   return 0;
  }

 -/* Base page size affects how we walk hugetlb page tables */
 -#ifdef CONFIG_PPC_64K_PAGES
 -#define hpmd_offset(pud, addr, h)pmd_offset(pud, addr)
 -#define hpmd_alloc(mm, pud, addr, h) pmd_alloc(mm, pud, addr)
 -#else
 -static inline
 -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
 +
 +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate 
 *hstate)
 +{
 + if (huge_page_shift(hstate)  PUD_SHIFT)
 + return pud_offset(pgd, addr);
 + else
 + return (pud_t *) pgd;
 +}
 +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long 
 addr,
 +  struct hstate *hstate)
  {
 - if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 + if (huge_page_shift(hstate)  PUD_SHIFT)
 + return pud_alloc(mm, pgd, addr);
 + else
 + return (pud_t *) pgd;
 +}
 +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate 
 *hstate)
 +{
 + if (huge_page_shift(hstate)  PMD_SHIFT)
   return pmd_offset(pud, addr);
   else
   return (pmd_t *) pud;
  }
 -static inline
 -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
 -   struct hstate *hstate)
 +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long 
 addr,
 +  struct hstate *hstate)
  {
 - if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 + if (huge_page_shift(hstate)  PMD_SHIFT)
   return pmd_alloc(mm, pud, addr);
   else
   return (pmd_t *) pud;
  }
 -#endif

  /* Build list of addresses of gigantic pages.  This function is used in early
   * boot before the buddy or bootmem allocator is setup.
 @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 

   pg = pgd_offset(mm, addr);
   if (!pgd_none(*pg)) {
 - pu = pud_offset(pg, addr);
 + pu = hpud_offset(pg, addr, hstate);
   if (!pud_none(*pu)) {
   pm = hpmd_offset(pu, addr, hstate);
   if (!pmd_none(*pm))
 @@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct *
   addr = hstate-mask;

   pg = pgd_offset(mm, addr);
 - pu = pud_alloc(mm, pg, addr);
 + pu = hpud_alloc(mm, pg, addr, hstate);

   if (pu) {
   pm = hpmd_alloc(mm, pu, addr, hstate);
 @@ -316,13 +324,7 @@ static void hugetlb_free_pud_range(struc
   pud = pud_offset(pgd, addr);
   do {
   next = pud_addr_end(addr, end);
 -#ifdef CONFIG_PPC_64K_PAGES
 - if (pud_none_or_clear_bad(pud))
 - continue

Re: Buglet in 16G page handling

2008-09-02 Thread Jon Tollefson
David Gibson wrote:
 When BenH and I were looking at the new code for handling 16G pages,
 we noticed a small bug.  It doesn't actually break anything user
 visible, but it's certainly not the way things are supposed to be.
 The 16G patches didn't update the huge_pte_offset() and
 huge_pte_alloc() functions, which means that the hugepte tables for
 16G pages will be allocated much further down the page table tree than
 they should be - allocating several levels of page table with a single
 entry in them along the way.

 The patch below is supposed to fix this, cleaning up the existing
 handling of 64k vs 16M pages while its at it.  However, it needs some
 testing.

 I've checked that it doesn't break existing 16M support, either with
 4k or 64k base pages.  I haven't figured out how to test with 64k
 pages yet, at least until the multisize support goes into
 libhugetlbfs.  For 16G pages, I just don't have access to a machine
 with enough memory to test.  Jon, presumably you must have found such
 a machine when you did the 16G page support in the first place.  Do
 you still have access, and can you test this patch?
   
I do have access to a machine to test it.  I applied the patch to -rc4
and used a pseries_defconfig.  I boot with
default_hugepagesz=16G... in order to test huge page sizes other then
16M at this point.

Running the libhugetlbfs test suite it gets as far as   Readback (64):  
PASS
before it hits the following program check.

kernel BUG at arch/powerpc/mm/hugetlbpage.c:98!
cpu 0x0: Vector: 700 (Program Check) at [c002843db580]
pc: c0035ff4: .free_hugepte_range+0x2c/0x7c
lr: c0036af0: .hugetlb_free_pgd_range+0x2c0/0x398
sp: c002843db800
   msr: 80029032
  current = 0xc0028417a2a0
  paca= 0xc08d4300
pid   = 3334, comm = readback
kernel BUG at arch/powerpc/mm/hugetlbpage.c:98!
enter ? for help
[c002843db880] c0036af0 .hugetlb_free_pgd_range+0x2c0/0x398
[c002843db980] c00da224 .free_pgtables+0x98/0x140
[c002843dba40] c00dc4d8 .exit_mmap+0x13c/0x22c
[c002843dbb00] c005b218 .mmput+0x78/0x148
[c002843dbba0] c0060528 .exit_mm+0x164/0x18c
[c002843dbc50] c0062718 .do_exit+0x2e8/0x858
[c002843dbd10] c0062d24 .do_group_exit+0x9c/0xd0
[c002843dbdb0] c0062d74 .sys_exit_group+0x1c/0x30
[c002843dbe30] c00086d4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 00802db7a530
SP (fa6e290) is in userspace


Line 98 appears to be this BUG_ON

static inline pte_t *hugepd_page(hugepd_t hpd)
{
BUG_ON(!(hpd.pd  HUGEPD_OK));


Jon

 Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
 ===
 --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c2008-09-02 
 13:39:52.0 +1000
 +++ working-2.6/arch/powerpc/mm/hugetlbpage.c 2008-09-02 14:08:56.0 
 +1000
 @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
   return 0;
  }

 -/* Base page size affects how we walk hugetlb page tables */
 -#ifdef CONFIG_PPC_64K_PAGES
 -#define hpmd_offset(pud, addr, h)pmd_offset(pud, addr)
 -#define hpmd_alloc(mm, pud, addr, h) pmd_alloc(mm, pud, addr)
 -#else
 -static inline
 -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
 +
 +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate 
 *hstate)
 +{
 + if (huge_page_shift(hstate)  PUD_SHIFT)
 + return pud_offset(pgd, addr);
 + else
 + return (pud_t *) pgd;
 +}
 +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long 
 addr,
 +  struct hstate *hstate)
  {
 - if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 + if (huge_page_shift(hstate)  PUD_SHIFT)
 + return pud_alloc(mm, pgd, addr);
 + else
 + return (pud_t *) pgd;
 +}
 +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate 
 *hstate)
 +{
 + if (huge_page_shift(hstate)  PMD_SHIFT)
   return pmd_offset(pud, addr);
   else
   return (pmd_t *) pud;
  }
 -static inline
 -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
 -   struct hstate *hstate)
 +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long 
 addr,
 +  struct hstate *hstate)
  {
 - if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
 + if (huge_page_shift(hstate)  PMD_SHIFT)
   return pmd_alloc(mm, pud, addr);
   else
   return (pmd_t *) pud;
  }
 -#endif

  /* Build list of addresses of gigantic pages.  This function is used in early
   * boot before the buddy or bootmem allocator is setup.
 @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 

   pg = pgd_offset(mm, addr);
   if (!pgd_none(*pg)) {
 - pu = pud_offset(pg, addr);
 + pu = hpud_offset(pg, addr, hstate);

Re: [Libhugetlbfs-devel] Buglet in 16G page handling

2008-09-02 Thread Jon Tollefson
Benjamin Herrenschmidt wrote:
 Actually, Jon has been hitting an occasional pagetable lock related
 problem. The last theory was that it might be some sort of race but it's
 vaguely possible that this is the issue. Jon?
 

 All hugetlbfs ops should be covered by the big PTL except walking... Can
 we have more info about the problem ?

 Cheers,
 Ben.
   

I hit this when running the complete libhugetlbfs test suite (make
check) with base page at 4K and default huge page size at 16G.  It is on
the last test (shm-getraw) when it hits it.  Just running that test
alone has not caused it for me - only when I have run all the tests and
it gets to this one.  Also it doesn't happen every time.  I have tried
to reproduce as well with a 64K base page but haven't seen it happen there.

BUG: spinlock bad magic on CPU#2, shm-getraw/10359
 lock: fde6e158, .magic: , .owner: none/-1, .owner_cpu: 0
Call Trace:
[c00285d9b420] [c00110b0] .show_stack+0x78/0x190 (unreliable)
[c00285d9b4d0] [c00111e8] .dump_stack+0x20/0x34
[c00285d9b550] [c0295d94] .spin_bug+0xb8/0xe0
[c00285d9b5f0] [c02962d8] ._raw_spin_lock+0x4c/0x1a0
[c00285d9b690] [c0510c60] ._spin_lock+0x5c/0x7c
[c00285d9b720] [c00d809c] .handle_mm_fault+0x2f0/0x9ac
[c00285d9b810] [c0513688] .do_page_fault+0x444/0x62c
[c00285d9b950] [c0005230] handle_page_fault+0x20/0x5c
--- Exception: 301 at .__clear_user+0x38/0x7c
LR = .read_zero+0xb0/0x1a8
[c00285d9bc40] [c02e19e0] .read_zero+0x80/0x1a8 (unreliable)
[c00285d9bcf0] [c0102c00] .vfs_read+0xe0/0x1c8
[c00285d9bd90] [c010332c] .sys_read+0x54/0x98
[c00285d9be30] [c00086d4] syscall_exit+0x0/0x40
BUG: spinlock lockup on CPU#2, shm-getraw/10359, fde6e158
Call Trace:
[c00285d9b4c0] [c00110b0] .show_stack+0x78/0x190 (unreliable)
[c00285d9b570] [c00111e8] .dump_stack+0x20/0x34
[c00285d9b5f0] [c02963ec] ._raw_spin_lock+0x160/0x1a0
[c00285d9b690] [c0510c60] ._spin_lock+0x5c/0x7c
[c00285d9b720] [c00d809c] .handle_mm_fault+0x2f0/0x9ac
[c00285d9b810] [c0513688] .do_page_fault+0x444/0x62c
[c00285d9b950] [c0005230] handle_page_fault+0x20/0x5c
--- Exception: 301 at .__clear_user+0x38/0x7c
LR = .read_zero+0xb0/0x1a8
[c00285d9bc40] [c02e19e0] .read_zero+0x80/0x1a8 (unreliable)
[c00285d9bcf0] [c0102c00] .vfs_read+0xe0/0x1c8
[c00285d9bd90] [c010332c] .sys_read+0x54/0x98
[c00285d9be30] [c00086d4] syscall_exit+0x0/0x40
BUG: soft lockup - CPU#2 stuck for 61s! [shm-getraw:10359]
Modules linked in: autofs4 binfmt_misc dm_mirror dm_log dm_multipath parport 
ibmvscsic uhci_hcd ohci_hcd ehci_hcd
irq event stamp: 1423661
hardirqs last  enabled at (1423661): [c008d954] 
.trace_hardirqs_on+0x1c/0x30
hardirqs last disabled at (1423660): [c008af60] 
.trace_hardirqs_off+0x1c/0x30
softirqs last  enabled at (1422710): [c0064f6c] 
.__do_softirq+0x19c/0x1c4
softirqs last disabled at (1422705): [c002943c] 
.call_do_softirq+0x14/0x24
NIP: c002569c LR: c02963ac CTR: 80f7cdec
REGS: c00285d9b330 TRAP: 0901   Not tainted  (2.6.27-rc4-pseries)
MSR: 80009032 EE,ME,IR,DR  CR: 88000284  XER: 0002
TASK = c00285f18000[10359] 'shm-getraw' THREAD: c00285d98000 CPU: 2
GPR00: 8002 c00285d9b5b0 c08924e0 0001 
GPR04: c00285f18000 0070  0002 
GPR08:  0003c3c66e8adf66 0002 0010 
GPR12: 000b4cbd c08d4700 
NIP [c002569c] .__delay+0x10/0x38
LR [c02963ac] ._raw_spin_lock+0x120/0x1a0
Call Trace:
[c00285d9b5b0] [c00285d9b690] 0xc00285d9b690 (unreliable)
[c00285d9b5f0] [c0296378] ._raw_spin_lock+0xec/0x1a0
[c00285d9b690] [c0510c60] ._spin_lock+0x5c/0x7c
[c00285d9b720] [c00d809c] .handle_mm_fault+0x2f0/0x9ac
[c00285d9b810] [c0513688] .do_page_fault+0x444/0x62c
[c00285d9b950] [c0005230] handle_page_fault+0x20/0x5c
--- Exception: 301 at .__clear_user+0x38/0x7c
LR = .read_zero+0xb0/0x1a8
[c00285d9bc40] [c02e19e0] .read_zero+0x80/0x1a8 (unreliable)
[c00285d9bcf0] [c0102c00] .vfs_read+0xe0/0x1c8
[c00285d9bd90] [c010332c] .sys_read+0x54/0x98
[c00285d9be30] [c00086d4] syscall_exit+0x0/0x40
Instruction dump:
eb41ffd0 eb61ffd8 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 
fbe1fff8 f821ffc1 7c3f0b78 7d2c42e6 4808 7c210b78 7c0c42e6 7c090050 


[root]# addr2line c00d809c -e /boot/vmlinux.rc4-pseries 
/root/src/linux-2.6-rc4/mm/memory.c:2381
[root]# addr2line c0513688 -e /boot/vmlinux.rc4-pseries 
/root/src/linux-2.6-rc4/arch/powerpc/mm/fault.c:313
[root]# addr2line c010332c -e /boot/vmlinux.rc4-pseries 

link failure: file truncated

2008-07-24 Thread Jon Tollefson
Just tried to build the latest version from Linus' tree and I am getting 
a link error.


building with the pseries_defconfig

...
 LD  drivers/built-in.o
 LD  vmlinux.o
 MODPOST vmlinux.o
WARNING: modpost: Found 6 section mismatch(es).
To see full details build your kernel with:
'make CONFIG_DEBUG_SECTION_MISMATCH=y'
 GEN .version
 CHK include/linux/compile.h
 UPD include/linux/compile.h
 CC  init/version.o
 LD  init/built-in.o
 LD  .tmp_vmlinux1
ld: final link failed: File truncated
make: *** [.tmp_vmlinux1] Error 1


~/src/linus/linux-2.6cat /etc/SuSE-release 
SUSE LINUX Enterprise Server 9 (ppc)

VERSION = 9
PATCHLEVEL = 3
~/src/linus/linux-2.6ld --version
GNU ld version 2.15.90.0.1.1 20040303 (SuSE Linux)
Copyright 2002 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License.  This program has absolutely no warranty.
~/src/linus/linux-2.6gcc --version
gcc (GCC) 3.3.3 (SuSE Linux)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

~/src/linus/linux-2.6cat /etc/SuSE-release 
SUSE LINUX Enterprise Server 9 (ppc)

VERSION = 9
PATCHLEVEL = 3



Jon

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] powerpc: Fix compile error with binutils 2.15

2008-07-24 Thread Jon Tollefson

Segher Boessenkool wrote:

My previous patch to fix compilation with binutils-2.17 causes
a file truncated build error from ld with binutils 2.15 (and
possibly older), and a warning with 2.16 and 2.17.

This fixes it.

Signed-off-by: Segher Boessenkool [EMAIL PROTECTED]
---
 arch/powerpc/kernel/vmlinux.lds.S |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index a914411..4a8ce62 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -85,7 +85,7 @@ SECTIONS

/* The dummy segment contents for the bug workaround mentioned above
   near PHDRS.  */
-   .dummy : {
+   .dummy : AT(ADDR(.dummy) - LOAD_OFFSET) {
LONG(0xf177)
} :kernel :dummy
  


This fixed the file truncated error for me.  Also the kernel booted fine.

Jon

~/src/linus/linux-2.6make vmlinux
 CHK include/linux/version.h
 CHK include/linux/utsrelease.h
 UPD include/linux/utsrelease.h
 CALLscripts/checksyscalls.sh
stdin:1397:2: warning: #warning syscall signalfd4 not implemented
stdin:1401:2: warning: #warning syscall eventfd2 not implemented
stdin:1405:2: warning: #warning syscall epoll_create1 not implemented
stdin:1409:2: warning: #warning syscall dup3 not implemented
stdin:1413:2: warning: #warning syscall pipe2 not implemented
stdin:1417:2: warning: #warning syscall inotify_init1 not implemented
 CHK include/linux/compile.h
 CC  init/version.o
 LD  init/built-in.o
 CALLarch/powerpc/kernel/systbl_chk.sh
 CALLarch/powerpc/kernel/prom_init_check.sh
 LDS arch/powerpc/kernel/vmlinux.lds
 CC  kernel/module.o
 CC  kernel/kexec.o
 LD  kernel/built-in.o
 LD  vmlinux.o
 MODPOST vmlinux.o
WARNING: modpost: Found 6 section mismatch(es).
To see full details build your kernel with:
'make CONFIG_DEBUG_SECTION_MISMATCH=y'
 GEN .version
 CHK include/linux/compile.h
 UPD include/linux/compile.h
 CC  init/version.o
 LD  init/built-in.o
 LD  .tmp_vmlinux1
 KSYM.tmp_kallsyms1.S
 AS  .tmp_kallsyms1.o
 LD  .tmp_vmlinux2
 KSYM.tmp_kallsyms2.S
 AS  .tmp_kallsyms2.o
 LD  vmlinux
 SYSMAP  System.map
 SYSMAP  .tmp_System.map



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: gigantci pages patches

2008-07-22 Thread Jon Tollefson
David Gibson wrote:
 On Fri, Jul 11, 2008 at 05:45:15PM +1000, Stephen Rothwell wrote:
   
 Hi all,

 Could people take one last look at these patches and if there are no
 issues, please send Ack-bys to Andrew who will push them to Linus for
 2.6.27.

 [PATCH 1/6 v2] allow arch specific function for allocating gigantic pages
 http://patchwork.ozlabs.org/linuxppc/patch?id=18437
 Patch: [PATCH 2/6 v2] powerpc: function for allocating gigantic pages
 http://patchwork.ozlabs.org/linuxppc/patch?id=18438
 Patch: [PATCH 3/6 v2] powerpc: scan device tree and save gigantic page 
 locations
 http://patchwork.ozlabs.org/linuxppc/patch?id=18439
 Patch: [PATCH 4/6 v2] powerpc: define page support for 16G pages
 http://patchwork.ozlabs.org/linuxppc/patch?id=18440
 Patch: [PATCH 5/6 v2] check for overflow
 http://patchwork.ozlabs.org/linuxppc/patch?id=18441
 Patch: [PATCH 6/6] powerpc: support multiple huge page sizes
 http://patchwork.ozlabs.org/linuxppc/patch?id=18442
 

 Sorry, I should have looked at these properly when they went past in
 May, but obviously I missed them.

 They mostly look ok.  I'm a bit confused on 2/6 though - it seems the
 new powerpc alloc_bootmem_huge_page() function is specific to the 16G
 gigantic pages.  But can't that function also get called for the
 normal 16M hugepages depending on how the hugepage pool is
 initialized.

 Or am I missing something (wouldn't surprise me given my brain's
 sluggishness today)?
   
The alloc_bootmem_huge_page() function is only called for pages =
MAX_ORDER.  The 16M pages are always allocated within the generic
hugetlbfs code with alloc_pages_node().

Jon


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH 6/6] powerpc: support multiple huge page sizes

2008-06-24 Thread Jon Tollefson
Nick Piggin wrote:
 On Tue, May 13, 2008 at 12:25:27PM -0500, Jon Tollefson wrote:
   
 Instead of using the variable mmu_huge_psize to keep track of the huge
 page size we use an array of MMU_PAGE_* values.  For each supported
 huge page size we need to know the hugepte_shift value and have a
 pgtable_cache.  The hstate or an mmu_huge_psizes index is passed to
 functions so that they know which huge page size they should use.

 The hugepage sizes 16M and 64K are setup(if available on the
 hardware) so that they don't have to be set on the boot cmd line in
 order to use them.  The number of 16G pages have to be specified at
 boot-time though (e.g. hugepagesz=16G hugepages=5).


 Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
 ---

 @@ -150,17 +191,25 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned 
 long addr)
  pud_t *pu;
  pmd_t *pm;

 -BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize);
 +unsigned int psize;
 +unsigned int shift;
 +unsigned long sz;
 +struct hstate *hstate;
 +psize = get_slice_psize(mm, addr);
 +shift = mmu_psize_to_shift(psize);
 +sz = ((1UL)  shift);
 +hstate = size_to_hstate(sz);

 -addr = HPAGE_MASK;
 +addr = hstate-mask;

  pg = pgd_offset(mm, addr);
  if (!pgd_none(*pg)) {
  pu = pud_offset(pg, addr);
  if (!pud_none(*pu)) {
 -pm = hpmd_offset(pu, addr);
 +pm = hpmd_offset(pu, addr, hstate);
  if (!pmd_none(*pm))
 -return hugepte_offset((hugepd_t *)pm, addr);
 +return hugepte_offset((hugepd_t *)pm, addr,
 +  hstate);
  }
  }
 

 Hi Jon,

 I just noticed in a few places like this, you might be doing more work
 than really needed to get the HPAGE_MASK.
   
I would love to be able to simplify it.
 For a first-pass conversion, this is the right way to go (just manually
 replace hugepage constants with hstate- equivalents). However in this
 case if you already know the page size, you should be able to work out
 the shift from there, I think? That way you can avoid the size_to_hstate
 call completely.
   
Something like the following?

+   addr = ~(sz - 1);

Is that faster then just pulling it out of hstate?
I still need to locate hstate, but I guess if the mask is calculated
this way the locate could be pushed further into the function so that it
isn't done if it isn't always needed.
 Anyway, just something to consider.

 Thanks,
 Nick
   
Thank you for looking at the code.

Jon

 @@ -173,16 +222,20 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned 
 long addr, unsigned long sz
  pud_t *pu;
  pmd_t *pm;
  hugepd_t *hpdp = NULL;
 +struct hstate *hstate;
 +unsigned int psize;
 +hstate = size_to_hstate(sz);

 -BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize);
 +psize = get_slice_psize(mm, addr);
 +BUG_ON(!mmu_huge_psizes[psize]);

 -addr = HPAGE_MASK;
 +addr = hstate-mask;

  pg = pgd_offset(mm, addr);
  pu = pud_alloc(mm, pg, addr);

  if (pu) {
 -pm = hpmd_alloc(mm, pu, addr);
 +pm = hpmd_alloc(mm, pu, addr, hstate);
  if (pm)
  hpdp = (hugepd_t *)pm;
  }
 @@ -190,10 +243,10 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned 
 long addr, unsigned long sz
  if (! hpdp)
  return NULL;

 -if (hugepd_none(*hpdp)  __hugepte_alloc(mm, hpdp, addr))
 +if (hugepd_none(*hpdp)  __hugepte_alloc(mm, hpdp, addr, psize))
  return NULL;

 -return hugepte_offset(hpdp, addr);
 +return hugepte_offset(hpdp, addr, hstate);
 }

 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 @@ -201,19 +254,22 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned 
 long *addr, pte_t *ptep)
  return 0;
 }

 -static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp)
 +static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp,
 +   unsigned int psize)
 {
  pte_t *hugepte = hugepd_page(*hpdp);

  hpdp-pd = 0;
  tlb-need_flush = 1;
 -pgtable_free_tlb(tlb, pgtable_free_cache(hugepte, HUGEPTE_CACHE_NUM,
 +pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
 + HUGEPTE_CACHE_NUM+psize-1,
   PGF_CACHENUM_MASK));
 }

 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 unsigned long addr, unsigned long end,
 -   unsigned long floor, unsigned long 
 ceiling)
 +   unsigned long floor, unsigned long 
 ceiling,
 +   unsigned int psize)
 {
  pmd_t *pmd;
  unsigned long next;
 @@ -225,7 +281,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather 
 *tlb, pud_t

[PATCH 1/6 v2] allow arch specific function for allocating gigantic pages

2008-05-13 Thread Jon Tollefson

Allow alloc_bm_huge_page() to be overridden by architectures that can't
always use bootmem. This requires huge_boot_pages to be available for
use by this function. The 16G pages on ppc64 have to be reserved prior
to boot-time. The location of these pages are indicated in the device
tree.


A BUG_ON in huge_add_hstate is commented out in order to allow 64K huge
pages to continue to work on power.



Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

include/linux/hugetlb.h |   10 ++
mm/hugetlb.c|   15 ++-
2 files changed, 16 insertions(+), 9 deletions(-)


diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8c47ca7..b550ec7 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -35,6 +35,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long 
offset, long freed);
extern unsigned long hugepages_treat_as_movable;
extern const unsigned long hugetlb_zero, hugetlb_infinity;
extern int sysctl_hugetlb_shm_group;
+extern struct list_head huge_boot_pages;

/* arch callbacks */

@@ -205,6 +206,14 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
};

+struct huge_bm_page {
+   struct list_head list;
+   struct hstate *hstate;
+};
+
+/* arch callback */
+int alloc_bm_huge_page(struct hstate *h);
+
void __init huge_add_hstate(unsigned order);
struct hstate *size_to_hstate(unsigned long size);

@@ -256,6 +265,7 @@ extern unsigned long 
sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];

#else
struct hstate {};
+#define alloc_bm_huge_page(h) NULL
#define hstate_file(f) NULL
#define hstate_vma(v) NULL
#define hstate_inode(i) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5273f6c..efb5805 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -27,6 +27,7 @@ unsigned long max_huge_pages[HUGE_MAX_HSTATE];
unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
+struct list_head huge_boot_pages;

static int max_hstate = 0;

@@ -533,14 +534,8 @@ static struct page *alloc_huge_page(struct vm_area_struct 
*vma,
return page;
}

-static __initdata LIST_HEAD(huge_boot_pages);
-
-struct huge_bm_page {
-   struct list_head list;
-   struct hstate *hstate;
-};
-
-static int __init alloc_bm_huge_page(struct hstate *h)
+/* Can be overriden by architectures */
+__attribute__((weak)) int alloc_bm_huge_page(struct hstate *h)
{
struct huge_bm_page *m;
int nr_nodes = nodes_weight(node_online_map);
@@ -583,6 +578,8 @@ static void __init hugetlb_init_hstate(struct hstate *h)
unsigned long i;

/* Don't reinitialize lists if they have been already init'ed */
+   if (!huge_boot_pages.next)
+   INIT_LIST_HEAD(huge_boot_pages);
if (!h-hugepage_freelists[0].next) {
for (i = 0; i  MAX_NUMNODES; ++i)
INIT_LIST_HEAD(h-hugepage_freelists[i]);
@@ -664,7 +661,7 @@ void __init huge_add_hstate(unsigned order)
return;
}
BUG_ON(max_hstate = HUGE_MAX_HSTATE);
-   BUG_ON(order  HPAGE_SHIFT - PAGE_SHIFT);
+/* BUG_ON(order  HPAGE_SHIFT - PAGE_SHIFT);*/
h = hstates[max_hstate++];
h-order = order;
h-mask = ~((1ULL  (order + PAGE_SHIFT)) - 1);




___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 2/6 v2] powerpc: function for allocating gigantic pages

2008-05-13 Thread Jon Tollefson

The 16G page locations have been saved during early boot in an array.
The alloc_bm_huge_page() function adds a page from here to the
huge_boot_pages list.


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

arch/powerpc/mm/hugetlbpage.c |   22 ++
1 file changed, 22 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26f212f..383b3b2 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -29,6 +29,12 @@

#define NUM_LOW_AREAS   (0x1UL  SID_SHIFT)
#define NUM_HIGH_AREAS  (PGTABLE_RANGE  HTLB_AREA_SHIFT)
+#define MAX_NUMBER_GPAGES  1024
+
+/* Tracks the 16G pages after the device tree is scanned and before the
+ *  huge_boot_pages list is ready.  */
+static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
+static unsigned nr_gpages;

unsigned int hugepte_shift;
#define PTRS_PER_HUGEPTE(1  hugepte_shift)
@@ -104,6 +110,22 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long addr)
}
#endif

+/* Moves the gigantic page addresses from the temporary list to the
+  * huge_boot_pages list.
+ */
+int alloc_bm_huge_page(struct hstate *h)
+{
+   struct huge_bm_page *m;
+   if (nr_gpages == 0)
+   return 0;
+   m = phys_to_virt(gpage_freearray[--nr_gpages]);
+   gpage_freearray[nr_gpages] = 0;
+   list_add(m-list, huge_boot_pages);
+   m-hstate = h;
+   return 1;
+}
+
+
/* Modelled after find_linux_pte() */
pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
{




___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 3/6 v2] powerpc: scan device tree and save gigantic page locations

2008-05-13 Thread Jon Tollefson

The 16G huge pages have to be reserved in the HMC prior to boot. The
location of the pages are placed in the device tree.   This patch adds
code to scan the device tree during very early boot and save these page
locations until hugetlbfs is ready for them.



Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

arch/powerpc/mm/hash_utils_64.c  |   44 ++-
arch/powerpc/mm/hugetlbpage.c|   16 ++
include/asm-powerpc/mmu-hash64.h |2 +
3 files changed, 61 insertions(+), 1 deletion(-)



diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index a83dfa3..133d6e2 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -67,6 +67,7 @@

#define KB (1024)
#define MB (1024*KB)
+#define GB (1024L*MB)

/*
 * Note:  pte   -- Linux PTE
@@ -302,6 +303,44 @@ static int __init htab_dt_scan_page_sizes(unsigned long 
node,
return 0;
}

+/* Scan for 16G memory blocks that have been set aside for huge pages
+ * and reserve those blocks for 16G huge pages.
+ */
+static int __init htab_dt_scan_hugepage_blocks(unsigned long node,
+   const char *uname, int depth,
+   void *data) {
+   char *type = of_get_flat_dt_prop(node, device_type, NULL);
+   unsigned long *addr_prop;
+   u32 *page_count_prop;
+   unsigned int expected_pages;
+   long unsigned int phys_addr;
+   long unsigned int block_size;
+
+   /* We are scanning memory nodes only */
+   if (type == NULL || strcmp(type, memory) != 0)
+   return 0;
+
+   /* This property is the log base 2 of the number of virtual pages that
+* will represent this memory block. */
+   page_count_prop = of_get_flat_dt_prop(node, ibm,expected#pages, NULL);
+   if (page_count_prop == NULL)
+   return 0;
+   expected_pages = (1  page_count_prop[0]);
+   addr_prop = of_get_flat_dt_prop(node, reg, NULL);
+   if (addr_prop == NULL)
+   return 0;
+   phys_addr = addr_prop[0];
+   block_size = addr_prop[1];
+   if (block_size != (16 * GB))
+   return 0;
+   printk(KERN_INFO Huge page(16GB) memory: 
+   addr = 0x%lX size = 0x%lX pages = %d\n,
+   phys_addr, block_size, expected_pages);
+   lmb_reserve(phys_addr, block_size * expected_pages);
+   add_gpage(phys_addr, block_size, expected_pages);
+   return 0;
+}
+
static void __init htab_init_page_sizes(void)
{
int rc;
@@ -370,7 +409,10 @@ static void __init htab_init_page_sizes(void)
   mmu_psize_defs[mmu_io_psize].shift);

#ifdef CONFIG_HUGETLB_PAGE
-   /* Init large page size. Currently, we pick 16M or 1M depending
+   /* Reserve 16G huge page memory sections for huge pages */
+   of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
+
+/* Init large page size. Currently, we pick 16M or 1M depending
 * on what is available
 */
if (mmu_psize_defs[MMU_PAGE_16M].shift)
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 383b3b2..a27b80c 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -110,6 +110,22 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long addr)
}
#endif

+/* Build list of addresses of gigantic pages.  This function is used in early
+ * boot before the buddy or bootmem allocator is setup.
+ */
+void add_gpage(unsigned long addr, unsigned long page_size,
+   unsigned long number_of_pages)
+{
+   if (!addr)
+   return;
+   while (number_of_pages  0) {
+   gpage_freearray[nr_gpages] = addr;
+   nr_gpages++;
+   number_of_pages--;
+   addr += page_size;
+   }
+}
+
/* Moves the gigantic page addresses from the temporary list to the
  * huge_boot_pages list.
 */
diff --git a/include/asm-powerpc/mmu-hash64.h b/include/asm-powerpc/mmu-hash64.h
index 2864fa3..db1276a 100644
--- a/include/asm-powerpc/mmu-hash64.h
+++ b/include/asm-powerpc/mmu-hash64.h
@@ -279,6 +279,8 @@ extern int htab_bolt_mapping(unsigned long vstart, unsigned 
long vend,
 unsigned long pstart, unsigned long mode,
 int psize, int ssize);
extern void set_huge_psize(int psize);
+extern void add_gpage(unsigned long addr, unsigned long page_size,
+ unsigned long number_of_pages);
extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr);

extern void htab_initialize(void);


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 4/6 v2] powerpc: define page support for 16G pages

2008-05-13 Thread Jon Tollefson

The huge page size is defined for 16G pages.  If a hugepagesz of 16G is
specified at boot-time then it becomes the huge page size instead of
the default 16M.

The change in pgtable-64K.h is to the macro
pte_iterate_hashed_subpages to make the increment to va (the 1
being shifted) be a long so that it is not shifted to 0.  Otherwise it
would create an infinite loop when the shift value is for a 16G page
(when base page size is 64K).



Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

arch/powerpc/mm/hugetlbpage.c |   62 ++
include/asm-powerpc/pgtable-64k.h |2 -
2 files changed, 45 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a27b80c..063ec36 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -24,8 +24,9 @@
#include asm/cputable.h
#include asm/spu.h

-#define HPAGE_SHIFT_64K16
-#define HPAGE_SHIFT_16M24
+#define PAGE_SHIFT_64K 16
+#define PAGE_SHIFT_16M 24
+#define PAGE_SHIFT_16G 34

#define NUM_LOW_AREAS   (0x1UL  SID_SHIFT)
#define NUM_HIGH_AREAS  (PGTABLE_RANGE  HTLB_AREA_SHIFT)
@@ -95,7 +96,7 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
static inline
pmd_t *hpmd_offset(pud_t *pud, unsigned long addr)
{
-   if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+   if (HPAGE_SHIFT == PAGE_SHIFT_64K)
return pmd_offset(pud, addr);
else
return (pmd_t *) pud;
@@ -103,7 +104,7 @@ pmd_t *hpmd_offset(pud_t *pud, unsigned long addr)
static inline
pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr)
{
-   if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+   if (HPAGE_SHIFT == PAGE_SHIFT_64K)
return pmd_alloc(mm, pud, addr);
else
return (pmd_t *) pud;
@@ -260,7 +261,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, 
pgd_t *pgd,
continue;
hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling);
#else
-   if (HPAGE_SHIFT == HPAGE_SHIFT_64K) {
+   if (HPAGE_SHIFT == PAGE_SHIFT_64K) {
if (pud_none_or_clear_bad(pud))
continue;
hugetlb_free_pmd_range(tlb, pud, addr, next, floor, 
ceiling);
@@ -591,20 +592,40 @@ void set_huge_psize(int psize)
{
/* Check that it is a page size supported by the hardware and
 * that it fits within pagetable limits. */
-   if (mmu_psize_defs[psize].shift  mmu_psize_defs[psize].shift  SID_SHIFT 

+   if (mmu_psize_defs[psize].shift 
+   mmu_psize_defs[psize].shift  SID_SHIFT_1T 
(mmu_psize_defs[psize].shift  MIN_HUGEPTE_SHIFT ||
-   mmu_psize_defs[psize].shift == HPAGE_SHIFT_64K)) {
+mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
+mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
+   /* Return if huge page size is the same as the
+* base page size. */
+   if (mmu_psize_defs[psize].shift == PAGE_SHIFT)
+   return;
+
HPAGE_SHIFT = mmu_psize_defs[psize].shift;
mmu_huge_psize = psize;
-#ifdef CONFIG_PPC_64K_PAGES
-   hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT);
-#else
-   if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
-   hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT);
-   else
-   hugepte_shift = (PUD_SHIFT-HPAGE_SHIFT);
-#endif

+   switch (HPAGE_SHIFT) {
+   case PAGE_SHIFT_64K:
+   /* We only allow 64k hpages with 4k base page,
+* which was checked above, and always put them
+* at the PMD */
+   hugepte_shift = PMD_SHIFT;
+   break;
+   case PAGE_SHIFT_16M:
+   /* 16M pages can be at two different levels
+* of pagestables based on base page size */
+   if (PAGE_SHIFT == PAGE_SHIFT_64K)
+   hugepte_shift = PMD_SHIFT;
+   else /* 4k base page */
+   hugepte_shift = PUD_SHIFT;
+   break;
+   case PAGE_SHIFT_16G:
+   /* 16G pages are always at PGD level */
+   hugepte_shift = PGDIR_SHIFT;
+   break;
+   }
+   hugepte_shift -= HPAGE_SHIFT;
} else
HPAGE_SHIFT = 0;
}
@@ -620,17 +641,22 @@ static int __init hugepage_setup_sz(char *str)
shift = __ffs(size);
switch (shift) {
#ifndef CONFIG_PPC_64K_PAGES
-   case HPAGE_SHIFT_64K:
+   case PAGE_SHIFT_64K:
mmu_psize = MMU_PAGE_64K;
break;
#endif
-   case HPAGE_SHIFT_16M:
+   case PAGE_SHIFT_16M:
mmu_psize = MMU_PAGE_16M

[PATCH 5/6 v2] check for overflow

2008-05-13 Thread Jon Tollefson

Adds a check for an overflow in the filesystem size so if someone is
checking with statfs() on a 16G hugetlbfs  in a 32bit binary that it
will report back EOVERFLOW instead of a size of 0.

Are other places that need a similar check?  I had tried a similar
check in put_compat_statfs64 too but it didn't seem to generate an
EOVERFLOW in my test case.


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

fs/compat.c |4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/compat.c b/fs/compat.c
index 2ce4456..6eb6aad 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -196,8 +196,8 @@ static int put_compat_statfs(struct compat_statfs __user 
*ubuf, struct kstatfs *
{

if (sizeof ubuf-f_blocks == 4) {
-   if ((kbuf-f_blocks | kbuf-f_bfree | kbuf-f_bavail) 
-   0xULL)
+   if ((kbuf-f_blocks | kbuf-f_bfree | kbuf-f_bavail |
+kbuf-f_bsize | kbuf-f_frsize)  0xULL)
return -EOVERFLOW;
/* f_files and f_ffree may be -1; it's okay
 * to stuff that into 32 bits */


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 6/6] powerpc: support multiple huge page sizes

2008-05-13 Thread Jon Tollefson

Instead of using the variable mmu_huge_psize to keep track of the huge
page size we use an array of MMU_PAGE_* values.  For each supported
huge page size we need to know the hugepte_shift value and have a
pgtable_cache.  The hstate or an mmu_huge_psizes index is passed to
functions so that they know which huge page size they should use.

The hugepage sizes 16M and 64K are setup(if available on the
hardware) so that they don't have to be set on the boot cmd line in
order to use them.  The number of 16G pages have to be specified at
boot-time though (e.g. hugepagesz=16G hugepages=5).


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

arch/powerpc/mm/hash_utils_64.c  |9 -
arch/powerpc/mm/hugetlbpage.c|  267 +--
arch/powerpc/mm/init_64.c|8 -
arch/powerpc/mm/tlb_64.c |2 
include/asm-powerpc/mmu-hash64.h |4 
include/asm-powerpc/page_64.h|1 
include/asm-powerpc/pgalloc-64.h |4 
7 files changed, 187 insertions(+), 108 deletions(-)



--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -99,7 +99,6 @@ int mmu_kernel_ssize = MMU_SEGSIZE_256M;
int mmu_highuser_ssize = MMU_SEGSIZE_256M;
u16 mmu_slb_size = 64;
#ifdef CONFIG_HUGETLB_PAGE
-int mmu_huge_psize = MMU_PAGE_16M;
unsigned int HPAGE_SHIFT;
#endif
#ifdef CONFIG_PPC_64K_PAGES
@@ -412,15 +411,15 @@ static void __init htab_init_page_sizes(void)
/* Reserve 16G huge page memory sections for huge pages */
of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);

-/* Init large page size. Currently, we pick 16M or 1M depending
+/* Set default large page size. Currently, we pick 16M or 1M depending
 * on what is available
 */
if (mmu_psize_defs[MMU_PAGE_16M].shift)
-   set_huge_psize(MMU_PAGE_16M);
+   HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
/* With 4k/4level pagetables, we can't (for now) cope with a
 * huge page size  PMD_SIZE */
else if (mmu_psize_defs[MMU_PAGE_1M].shift)
-   set_huge_psize(MMU_PAGE_1M);
+   HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
#endif /* CONFIG_HUGETLB_PAGE */
}

@@ -819,7 +818,7 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap)

#ifdef CONFIG_HUGETLB_PAGE
/* Handle hugepage regions */
-   if (HPAGE_SHIFT  psize == mmu_huge_psize) {
+   if (HPAGE_SHIFT  mmu_huge_psizes[psize]) {
DBG_LOW( - huge page !\n);
return hash_huge_page(mm, access, ea, vsid, local, trap);
}
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 063ec36..61ce875 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -37,15 +37,30 @@
static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
static unsigned nr_gpages;

-unsigned int hugepte_shift;
-#define PTRS_PER_HUGEPTE   (1  hugepte_shift)
-#define HUGEPTE_TABLE_SIZE (sizeof(pte_t)  hugepte_shift)
+/* Array of valid huge page sizes - non-zero value(hugepte_shift) is
+ * stored for the huge page sizes that are valid.
+ */
+unsigned int mmu_huge_psizes[MMU_PAGE_COUNT];
+
+#define hugepte_shift  mmu_huge_psizes
+#define PTRS_PER_HUGEPTE(psize)(1  hugepte_shift[psize])
+#define HUGEPTE_TABLE_SIZE(psize)  (sizeof(pte_t)  hugepte_shift[psize])
+
+#define HUGEPD_SHIFT(psize)(mmu_psize_to_shift(psize) \
+   + hugepte_shift[psize])
+#define HUGEPD_SIZE(psize) (1UL  HUGEPD_SHIFT(psize))
+#define HUGEPD_MASK(psize) (~(HUGEPD_SIZE(psize)-1))

-#define HUGEPD_SHIFT   (HPAGE_SHIFT + hugepte_shift)
-#define HUGEPD_SIZE(1UL  HUGEPD_SHIFT)
-#define HUGEPD_MASK(~(HUGEPD_SIZE-1))
+/* Subtract one from array size because we don't need a cache for 4K since
+ * is not a huge page size */
+#define huge_pgtable_cache(psize)  (pgtable_cache[HUGEPTE_CACHE_NUM \
+   + psize-1])
+#define HUGEPTE_CACHE_NAME(psize)  (huge_pgtable_cache_name[psize])

-#define huge_pgtable_cache (pgtable_cache[HUGEPTE_CACHE_NUM])
+static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
+   unused_4K, hugepte_cache_64K, unused_64K_AP,
+   hugepte_cache_1M, hugepte_cache_16M, hugepte_cache_16G
+};

/* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
 * will choke on pointers to hugepte tables, which is handy for
@@ -56,24 +71,49 @@ typedef struct { unsigned long pd; } hugepd_t;

#define hugepd_none(hpd)((hpd).pd == 0)

+static inline int shift_to_mmu_psize(unsigned int shift)
+{
+   switch (shift) {
+#ifndef CONFIG_PPC_64K_PAGES
+   case PAGE_SHIFT_64K:
+   return MMU_PAGE_64K;
+#endif
+   case PAGE_SHIFT_16M:
+   return MMU_PAGE_16M;
+   case PAGE_SHIFT_16G:
+   return MMU_PAGE_16G

[PATCH 0/6] 16G and multi size hugetlb page support on powerpc

2008-05-13 Thread Jon Tollefson
This patch set builds on Nick Piggin's patches for multi size and giant 
hugetlb page support of April 22.  The following set of patches adds 
support for 16G huge pages on ppc64 and support for multiple huge page 
sizes at the same time on ppc64.  Thus allowing 64K, 16M, and 16G huge 
pages given a POWER5+ or later machine.


New to this version of my patch is numerous bug fixes and cleanups, but 
the biggest change is the support for multiple huge page sizes on power.


patch 1: changes to generic hugetlb to enable 16G pages on power
patch 2: powerpc: adds function for allocating 16G pages
patch 3: powerpc: setups 16G page locations found in device tree
patch 4: powerpc: page definition support for 16G pages
patch 5: check for overflow when user space is 32bit
patch 6: powerpc: multiple huge page size support

Jon


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 0/4] 16G huge page support for powerpc

2008-03-26 Thread Jon Tollefson

This patch set builds on Andi Kleen's patches for GB pages for hugetlb
posted on March 16th.  This set adds support for 16G huge pages on
ppc64.  Supporting multiple huge page sizes on ppc64 as defined in
Andi's patches is not a part of this set; that will be included in a
future patch.

The first patch here adds an arch callback since the 16G pages are not
allocated from bootmem.  The 16G pages have to be reserved prior to
boot-time.  The location of these pages are indicated in the device tree.

Support for 16G pages requires a POWER5+ or later machine and a little
bit of memory.

Jon

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 1/4] allow arch specific function for allocating gigantic pages

2008-03-26 Thread Jon Tollefson
Allow alloc_bm_huge_page() to be overridden by architectures that can't always 
use bootmem.
This requires huge_boot_pages to be available for use by this function.  Also 
huge_page_size()
and other functions need to use a long so that they can handle the 16G page 
size.


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

 include/linux/hugetlb.h |   10 +-
 mm/hugetlb.c|   21 +
 2 files changed, 18 insertions(+), 13 deletions(-)


diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a8de3c1..35a41be 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -35,6 +35,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long 
offset, long freed);
 extern unsigned long hugepages_treat_as_movable;
 extern const unsigned long hugetlb_zero, hugetlb_infinity;
 extern int sysctl_hugetlb_shm_group;
+extern struct list_head huge_boot_pages;
 
 /* arch callbacks */
 
@@ -219,9 +220,15 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
unsigned long parsed_hugepages;
 };
+struct huge_bm_page {
+   struct list_head list;
+   struct hstate *hstate;
+};
 
 void __init huge_add_hstate(unsigned order);
 struct hstate *huge_lookup_hstate(unsigned long pagesize);
+/* arch callback */
+int alloc_bm_huge_page(struct hstate *h);
 
 #ifndef HUGE_MAX_HSTATE
 #define HUGE_MAX_HSTATE 1
@@ -248,7 +255,7 @@ static inline struct hstate *hstate_inode(struct inode *i)
return HUGETLBFS_I(i)-hstate;
 }
 
-static inline unsigned huge_page_size(struct hstate *h)
+static inline unsigned long huge_page_size(struct hstate *h)
 {
return PAGE_SIZE  h-order;
 }
@@ -273,6 +280,7 @@ extern unsigned long 
sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
 
 #else
 struct hstate {};
+#define alloc_bm_huge_page(h) NULL
 #define hstate_file(f) NULL
 #define hstate_vma(v) NULL
 #define hstate_inode(i) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c28b8b6..a0017b0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -27,6 +27,7 @@ unsigned long max_huge_pages[HUGE_MAX_HSTATE];
 unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
+struct list_head huge_boot_pages;
 
 static int max_hstate = 1;
 
@@ -43,7 +44,8 @@ struct hstate *parsed_hstate __initdata = global_hstate;
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
-static void clear_huge_page(struct page *page, unsigned long addr, unsigned sz)
+static void clear_huge_page(struct page *page, unsigned long addr,
+   unsigned long sz)
 {
int i;
 
@@ -521,14 +523,8 @@ static __init char *memfmt(char *buf, unsigned long n)
return buf;
 }
 
-static __initdata LIST_HEAD(huge_boot_pages);
-
-struct huge_bm_page {
-   struct list_head list;
-   struct hstate *hstate;
-};
-
-static int __init alloc_bm_huge_page(struct hstate *h)
+/* Can be overriden by architectures */
+__attribute__((weak)) int alloc_bm_huge_page(struct hstate *h)
 {
struct huge_bm_page *m;
m = __alloc_bootmem_node_nopanic(NODE_DATA(h-hugetlb_next_nid),
@@ -614,6 +610,7 @@ static int __init hugetlb_init(void)
 {
if (HPAGE_SHIFT == 0)
return 0;
+   INIT_LIST_HEAD(huge_boot_pages);
return hugetlb_init_hstate(global_hstate);
 }
 module_init(hugetlb_init);
@@ -866,7 +863,7 @@ int hugetlb_report_meminfo(char *buf)
n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages));
n += sprintf(buf + n, Hugepagesize:   );
for_each_hstate (h)
-   n += sprintf(buf + n,  %5u, huge_page_size(h) / 1024);
+   n += sprintf(buf + n,  %5lu, huge_page_size(h) / 1024);
n += sprintf(buf + n,  kB\n);
return n;
 }
@@ -947,7 +944,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct 
mm_struct *src,
unsigned long addr;
int cow;
struct hstate *h = hstate_vma(vma);
-   unsigned sz = huge_page_size(h);
+   unsigned long sz = huge_page_size(h);
 
cow = (vma-vm_flags  (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
 
@@ -992,7 +989,7 @@ void __unmap_hugepage_range(struct vm_area_struct *vma, 
unsigned long start,
struct page *page;
struct page *tmp;
struct hstate *h = hstate_vma(vma);
-   unsigned sz = huge_page_size(h);
+   unsigned long sz = huge_page_size(h);
 
/*
 * A page gathering list, protected by per file i_mmap_lock. The




___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 2/4] powerpc: function for allocating gigantic pages

2008-03-26 Thread Jon Tollefson
The 16G page locations have been saved during early boot in an array.  The
alloc_bm_huge_page() function adds a page from here to the huge_boot_pages list.


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---


 hugetlbpage.c |   19 +++
 1 file changed, 19 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 94625db..31d977b 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -29,6 +29,10 @@
 
 #define NUM_LOW_AREAS  (0x1UL  SID_SHIFT)
 #define NUM_HIGH_AREAS (PGTABLE_RANGE  HTLB_AREA_SHIFT)
+#define MAX_NUMBER_GPAGES  1024
+
+static void *gpage_freearray[MAX_NUMBER_GPAGES];
+static unsigned nr_gpages;
 
 unsigned int hugepte_shift;
 #define PTRS_PER_HUGEPTE   (1  hugepte_shift)
@@ -104,6 +108,21 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long addr)
 }
 #endif
 
+/* Put 16G page address into temporary huge page list because the mem_map
+ * is not up yet.
+ */
+int alloc_bm_huge_page(struct hstate *h)
+{
+   struct huge_bm_page *m;
+   if (nr_gpages == 0)
+   return 0;
+   m = gpage_freearray[--nr_gpages];
+   list_add(m-list, huge_boot_pages);
+   m-hstate = h;
+   return 1;
+}
+
+
 /* Modelled after find_linux_pte() */
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 3/4] powerpc: scan device tree and save gigantic page locations

2008-03-26 Thread Jon Tollefson
The 16G huge pages have to be reserved in the HMC prior to boot.  The location 
of
the pages are placed in the device tree.  During very early boot these 
locations are
saved for use by hugetlbfs.

Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

 arch/powerpc/mm/hash_utils_64.c  |   41 ++-
 arch/powerpc/mm/hugetlbpage.c|   17 
 include/asm-powerpc/mmu-hash64.h |2 +
 3 files changed, 59 insertions(+), 1 deletion(-)


diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index a83dfa3..d3f7d92 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -67,6 +67,7 @@
 
 #define KB (1024)
 #define MB (1024*KB)
+#define GB (1024L*MB)
 
 /*
  * Note:  pte   -- Linux PTE
@@ -302,6 +303,41 @@ static int __init htab_dt_scan_page_sizes(unsigned long 
node,
return 0;
 }
 
+/* Scan for 16G memory blocks that have been set aside for huge pages
+ * and reserve those blocks for 16G huge pages.
+ */
+static int __init htab_dt_scan_hugepage_blocks(unsigned long node,
+   const char *uname, int depth,
+   void *data) {
+   char *type = of_get_flat_dt_prop(node, device_type, NULL);
+   unsigned long *lprop;
+   u32 *prop;
+
+   /* We are scanning memory nodes only */
+   if (type == NULL || strcmp(type, memory) != 0)
+   return 0;
+
+   /* This property is the log base 2 of the number of virtual pages that
+* will represent this memory block. */
+   prop = of_get_flat_dt_prop(node, ibm,expected#pages, NULL);
+   if (prop == NULL)
+   return 0;
+   unsigned int expected_pages = (1  prop[0]);
+   lprop = of_get_flat_dt_prop(node, reg, NULL);
+   if (lprop == NULL)
+   return 0;
+   long unsigned int phys_addr = lprop[0];
+   long unsigned int block_size = lprop[1];
+   if (block_size != (16 * GB))
+   return 0;
+   printk(KERN_INFO Reserving huge page memory 
+   addr = 0x%lX size = 0x%lX pages = %d\n,
+   phys_addr, block_size, expected_pages);
+   lmb_reserve(phys_addr, block_size * expected_pages);
+   add_gpage(phys_addr, block_size, expected_pages);
+   return 0;
+}
+
 static void __init htab_init_page_sizes(void)
 {
int rc;
@@ -370,7 +406,10 @@ static void __init htab_init_page_sizes(void)
   mmu_psize_defs[mmu_io_psize].shift);
 
 #ifdef CONFIG_HUGETLB_PAGE
-   /* Init large page size. Currently, we pick 16M or 1M depending
+   /* Reserve 16G huge page memory sections for huge pages */
+   of_scan_flat_dt(htab_dt_scan_hugepage_blocks, NULL);
+
+/* Init large page size. Currently, we pick 16M or 1M depending
 * on what is available
 */
if (mmu_psize_defs[MMU_PAGE_16M].shift)
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 31d977b..44d3d55 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -108,6 +108,23 @@ pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long addr)
 }
 #endif
 
+/* Build list of addresses of gigantic pages.  This function is used in early
+ * boot before the buddy allocator is setup.
+ */
+void add_gpage(unsigned long addr, unsigned long page_size,
+   unsigned long number_of_pages)
+{
+   if (addr) {
+   while (number_of_pages  0) {
+   gpage_freearray[nr_gpages] = __va(addr);
+   nr_gpages++;
+   number_of_pages--;
+   addr += page_size;
+   }
+   }
+}
+
+
 /* Put 16G page address into temporary huge page list because the mem_map
  * is not up yet.
  */
diff --git a/include/asm-powerpc/mmu-hash64.h b/include/asm-powerpc/mmu-hash64.h
index 2864fa3..db1276a 100644
--- a/include/asm-powerpc/mmu-hash64.h
+++ b/include/asm-powerpc/mmu-hash64.h
@@ -279,6 +279,8 @@ extern int htab_bolt_mapping(unsigned long vstart, unsigned 
long vend,
 unsigned long pstart, unsigned long mode,
 int psize, int ssize);
 extern void set_huge_psize(int psize);
+extern void add_gpage(unsigned long addr, unsigned long page_size,
+ unsigned long number_of_pages);
 extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr);
 
 extern void htab_initialize(void);




___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH 4/4] powerpc: define page support for 16G pages

2008-03-26 Thread Jon Tollefson
The huge page size is setup for 16G pages if that size is specified at 
boot-time.  The support for
multiple huge page sizes is not being utilized yet.  That will be in a future 
patch.


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

 hugetlbpage.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)


diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 44d3d55..b6a02b7 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -26,6 +26,7 @@
 
 #define HPAGE_SHIFT_64K16
 #define HPAGE_SHIFT_16M24
+#define HPAGE_SHIFT_16G34
 
 #define NUM_LOW_AREAS  (0x1UL  SID_SHIFT)
 #define NUM_HIGH_AREAS (PGTABLE_RANGE  HTLB_AREA_SHIFT)
@@ -589,9 +590,11 @@ void set_huge_psize(int psize)
 {
/* Check that it is a page size supported by the hardware and
 * that it fits within pagetable limits. */
-   if (mmu_psize_defs[psize].shift  mmu_psize_defs[psize].shift  
SID_SHIFT 
+   if (mmu_psize_defs[psize].shift 
+   mmu_psize_defs[psize].shift  SID_SHIFT_1T 
(mmu_psize_defs[psize].shift  MIN_HUGEPTE_SHIFT ||
-   mmu_psize_defs[psize].shift == HPAGE_SHIFT_64K)) {
+mmu_psize_defs[psize].shift == HPAGE_SHIFT_64K ||
+mmu_psize_defs[psize].shift == HPAGE_SHIFT_16G)) {
HPAGE_SHIFT = mmu_psize_defs[psize].shift;
mmu_huge_psize = psize;
 #ifdef CONFIG_PPC_64K_PAGES
@@ -599,6 +602,8 @@ void set_huge_psize(int psize)
 #else
if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
hugepte_shift = (PMD_SHIFT-HPAGE_SHIFT);
+   else if (HPAGE_SHIFT == HPAGE_SHIFT_16G)
+   hugepte_shift = (PGDIR_SHIFT-HPAGE_SHIFT);
else
hugepte_shift = (PUD_SHIFT-HPAGE_SHIFT);
 #endif
@@ -625,6 +630,9 @@ static int __init hugepage_setup_sz(char *str)
case HPAGE_SHIFT_16M:
mmu_psize = MMU_PAGE_16M;
break;
+   case HPAGE_SHIFT_16G:
+   mmu_psize = MMU_PAGE_16G;
+   break;
}
 
if (mmu_psize =0  mmu_psize_defs[mmu_psize].shift)



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCh v3] powerpc: add hugepagesz boot-time parameter

2008-01-04 Thread Jon Tollefson
Arnd Bergmann wrote:
 We started discussing this in v1, but the discussion got sidetracked:
 Is there a technical reason why you don't also allow 1M pages, which
 may be useful in certain scenarios?
   
No, it was mostly a matter of the time I have had and machines easily
available to me for testing.  I don't know of a technical reason that
would prevent supporting 1M huge pages, but would want the tests in the
libhugetlbfs suite to pass, etc.
 On the Cell/B.E. platforms (IBM/Mercury blades, Toshiba Celleb, PS3), the
 second large page size is an option that can be set in a HID SPR
 to either 64KB or 1MB. Unfortunately, we can't do these two simultaneously,
 but the firmware can change the default and put it into the device tree,
 or you could have the kernel override the firmware settings.

 Going a lot further, do you have plans for a fully dynamic hugepage size,
 e.g. using a mount option for hugetlbfs? I can see that as rather useful,
 but at the same time it's probably much more complicated than the boot time
 option.
   
Eventually we will want to support dynamic huge page sizes.  This is
already being looked into.  In the meantime we can have some flexibility
with a boot-time parameter though.

   Arnd 
   
Jon

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCh v3] powerpc: add hugepagesz boot-time parameter

2008-01-03 Thread Jon Tollefson
Paul, please include this in 2.6.25 if there are no objections.

This patch adds the hugepagesz boot-time parameter for ppc64.  It lets
one pick the size for huge pages. The choices available are 64K and 16M
when the base page size is 4k. It defaults to 16M (previously the only
only choice) if nothing or an invalid choice is specified.

Tested 64K huge pages successfully with the libhugetlbfs 1.2.

Changes from v2:
Moved functions from header file into hugetlbpage.c where they are used.


Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 33121d6..2fc1fb8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -685,6 +685,7 @@ and is between 256 and 4096 characters. It is defined in 
the file
See Documentation/isdn/README.HiSax.
 
hugepages=  [HW,X86-32,IA-64] Maximal number of HugeTLB pages.
+   hugepagesz= [HW,IA-64,PPC] The size of the HugeTLB pages.
 
i8042.direct[HW] Put keyboard port into non-translated mode
i8042.dumbkbd   [HW] Pretend that controller can only read data from
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index cbbd8b0..9326a69 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -369,18 +369,11 @@ static void __init htab_init_page_sizes(void)
 * on what is available
 */
if (mmu_psize_defs[MMU_PAGE_16M].shift)
-   mmu_huge_psize = MMU_PAGE_16M;
+   set_huge_psize(MMU_PAGE_16M);
/* With 4k/4level pagetables, we can't (for now) cope with a
 * huge page size  PMD_SIZE */
else if (mmu_psize_defs[MMU_PAGE_1M].shift)
-   mmu_huge_psize = MMU_PAGE_1M;
-
-   /* Calculate HPAGE_SHIFT and sanity check it */
-   if (mmu_psize_defs[mmu_huge_psize].shift  MIN_HUGEPTE_SHIFT 
-   mmu_psize_defs[mmu_huge_psize].shift  SID_SHIFT)
-   HPAGE_SHIFT = mmu_psize_defs[mmu_huge_psize].shift;
-   else
-   HPAGE_SHIFT = 0; /* No huge pages dude ! */
+   set_huge_psize(MMU_PAGE_1M);
 #endif /* CONFIG_HUGETLB_PAGE */
 }
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 71efb38..a02266d 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -24,18 +24,17 @@
 #include asm/cputable.h
 #include asm/spu.h
 
+#define HPAGE_SHIFT_64K16
+#define HPAGE_SHIFT_16M24
+
 #define NUM_LOW_AREAS  (0x1UL  SID_SHIFT)
 #define NUM_HIGH_AREAS (PGTABLE_RANGE  HTLB_AREA_SHIFT)
 
-#ifdef CONFIG_PPC_64K_PAGES
-#define HUGEPTE_INDEX_SIZE (PMD_SHIFT-HPAGE_SHIFT)
-#else
-#define HUGEPTE_INDEX_SIZE (PUD_SHIFT-HPAGE_SHIFT)
-#endif
-#define PTRS_PER_HUGEPTE   (1  HUGEPTE_INDEX_SIZE)
-#define HUGEPTE_TABLE_SIZE (sizeof(pte_t)  HUGEPTE_INDEX_SIZE)
+unsigned int hugepte_shift;
+#define PTRS_PER_HUGEPTE   (1  hugepte_shift)
+#define HUGEPTE_TABLE_SIZE (sizeof(pte_t)  hugepte_shift)
 
-#define HUGEPD_SHIFT   (HPAGE_SHIFT + HUGEPTE_INDEX_SIZE)
+#define HUGEPD_SHIFT   (HPAGE_SHIFT + hugepte_shift)
 #define HUGEPD_SIZE(1UL  HUGEPD_SHIFT)
 #define HUGEPD_MASK(~(HUGEPD_SIZE-1))
 
@@ -82,11 +81,35 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
return 0;
 }
 
+/* Base page size affects how we walk hugetlb page tables */
+#ifdef CONFIG_PPC_64K_PAGES
+#define hpmd_offset(pud, addr) pmd_offset(pud, addr)
+#define hpmd_alloc(mm, pud, addr)  pmd_alloc(mm, pud, addr)
+#else
+static inline
+pmd_t *hpmd_offset(pud_t *pud, unsigned long addr)
+{
+   if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+   return pmd_offset(pud, addr);
+   else
+   return (pmd_t *) pud;
+}
+static inline
+pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr)
+{
+   if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+   return pmd_alloc(mm, pud, addr);
+   else
+   return (pmd_t *) pud;
+}
+#endif
+
 /* Modelled after find_linux_pte() */
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
pgd_t *pg;
pud_t *pu;
+   pmd_t *pm;
 
BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize);
 
@@ -96,14 +119,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long 
addr)
if (!pgd_none(*pg)) {
pu = pud_offset(pg, addr);
if (!pud_none(*pu)) {
-#ifdef CONFIG_PPC_64K_PAGES
-   pmd_t *pm;
-   pm = pmd_offset(pu, addr);
+   pm = hpmd_offset(pu, addr);
if (!pmd_none(*pm))
return hugepte_offset((hugepd_t *)pm, addr);
-#else
-   return hugepte_offset((hugepd_t *)pu, addr);
-#endif
}
}
 
@@ -114,6 +132,7 @@ pte_t

[PATCH v2] powerpc: add hugepagesz boot-time parameter

2007-12-19 Thread Jon Tollefson
Paul, please include this in 2.6.25 if there are no objections.

This patch adds the hugepagesz boot-time parameter for ppc64.  It lets
one pick the size for huge pages. The choices available are 64K and 16M
when the base page size is 4k. It defaults to 16M (previously the only
only choice) if nothing or an invalid choice is specified.

Tested 64K huge pages successfully with the libhugetlbfs 1.2.

Changes from v1:
disallow 64K huge pages when base page size is 64K since we can't
distinguish between
base and huge pages when doing a hash_page()
collapsed pmd_offset and pmd_alloc to inline calls to simplify the
main code
removed printing of the huge page size in mm/hugetlb.c since this
information is already
   available in /proc/meminfo and leaves the remaining changes all
powerpc specific

Signed-off-by: Jon Tollefson [EMAIL PROTECTED]
---

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 33121d6..2fc1fb8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -685,6 +685,7 @@ and is between 256 and 4096 characters. It is defined in 
the file
See Documentation/isdn/README.HiSax.
 
hugepages=  [HW,X86-32,IA-64] Maximal number of HugeTLB pages.
+   hugepagesz= [HW,IA-64,PPC] The size of the HugeTLB pages.
 
i8042.direct[HW] Put keyboard port into non-translated mode
i8042.dumbkbd   [HW] Pretend that controller can only read data from
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index cbbd8b0..9326a69 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -369,18 +369,11 @@ static void __init htab_init_page_sizes(void)
 * on what is available
 */
if (mmu_psize_defs[MMU_PAGE_16M].shift)
-   mmu_huge_psize = MMU_PAGE_16M;
+   set_huge_psize(MMU_PAGE_16M);
/* With 4k/4level pagetables, we can't (for now) cope with a
 * huge page size  PMD_SIZE */
else if (mmu_psize_defs[MMU_PAGE_1M].shift)
-   mmu_huge_psize = MMU_PAGE_1M;
-
-   /* Calculate HPAGE_SHIFT and sanity check it */
-   if (mmu_psize_defs[mmu_huge_psize].shift  MIN_HUGEPTE_SHIFT 
-   mmu_psize_defs[mmu_huge_psize].shift  SID_SHIFT)
-   HPAGE_SHIFT = mmu_psize_defs[mmu_huge_psize].shift;
-   else
-   HPAGE_SHIFT = 0; /* No huge pages dude ! */
+   set_huge_psize(MMU_PAGE_1M);
 #endif /* CONFIG_HUGETLB_PAGE */
 }
 
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 71efb38..3099e48 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -24,18 +24,17 @@
 #include asm/cputable.h
 #include asm/spu.h
 
+#define HPAGE_SHIFT_64K16
+#define HPAGE_SHIFT_16M24
+
 #define NUM_LOW_AREAS  (0x1UL  SID_SHIFT)
 #define NUM_HIGH_AREAS (PGTABLE_RANGE  HTLB_AREA_SHIFT)
 
-#ifdef CONFIG_PPC_64K_PAGES
-#define HUGEPTE_INDEX_SIZE (PMD_SHIFT-HPAGE_SHIFT)
-#else
-#define HUGEPTE_INDEX_SIZE (PUD_SHIFT-HPAGE_SHIFT)
-#endif
-#define PTRS_PER_HUGEPTE   (1  HUGEPTE_INDEX_SIZE)
-#define HUGEPTE_TABLE_SIZE (sizeof(pte_t)  HUGEPTE_INDEX_SIZE)
+unsigned int hugepte_shift;
+#define PTRS_PER_HUGEPTE   (1  hugepte_shift)
+#define HUGEPTE_TABLE_SIZE (sizeof(pte_t)  hugepte_shift)
 
-#define HUGEPD_SHIFT   (HPAGE_SHIFT + HUGEPTE_INDEX_SIZE)
+#define HUGEPD_SHIFT   (HPAGE_SHIFT + hugepte_shift)
 #define HUGEPD_SIZE(1UL  HUGEPD_SHIFT)
 #define HUGEPD_MASK(~(HUGEPD_SIZE-1))
 
@@ -82,11 +81,31 @@ static int __hugepte_alloc(struct mm_struct *mm, hugepd_t 
*hpdp,
return 0;
 }
 
+#ifndef CONFIG_PPC_64K_PAGES
+static inline
+pmd_t *hpmd_offset(pud_t *pud, unsigned long addr)
+{
+   if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+   return pmd_offset(pud, addr);
+   else
+   return (pmd_t *) pud;
+}
+static inline
+pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr)
+{
+   if (HPAGE_SHIFT == HPAGE_SHIFT_64K)
+   return pmd_alloc(mm, pud, addr);
+   else
+   return (pmd_t *) pud;
+}
+#endif
+
 /* Modelled after find_linux_pte() */
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
pgd_t *pg;
pud_t *pu;
+   pmd_t *pm;
 
BUG_ON(get_slice_psize(mm, addr) != mmu_huge_psize);
 
@@ -96,14 +115,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long 
addr)
if (!pgd_none(*pg)) {
pu = pud_offset(pg, addr);
if (!pud_none(*pu)) {
-#ifdef CONFIG_PPC_64K_PAGES
-   pmd_t *pm;
-   pm = pmd_offset(pu, addr);
+   pm = hpmd_offset(pu, addr);
if (!pmd_none(*pm))
return hugepte_offset((hugepd_t *)pm, addr);
-#else

Re: [PATCH 2/2] powerpc: make 64K huge pages more reliable

2007-12-03 Thread Jon Tollefson
David Gibson wrote:
 On Tue, Nov 27, 2007 at 11:03:16PM -0600, Jon Tollefson wrote:
   
 This patch adds reliability to the 64K huge page option by making use of 
 the PMD for 64K huge pages when base pages are 4k.  So instead of a 12 
 bit pte it would be 7 bit pmd and a 5 bit pte. The pgd and pud offsets 
 would continue as 9 bits and 7 bits respectively.  This will allow the 
 pgtable to fit in one base page.  This patch would have to be applied 
 after part 1.
 

 Hrm.. shouldn't we just ban 64K hugepages on a 64K base page size
 setup?  There's not a whole lot of point to it, after all...
   

Banning the base and huge page size from being the same size feels like
an artificial barrier.  It is probably not the most massively useful
combination, but it shouldn't hurt performance. 

Jon

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Use 1TB segments

2007-08-06 Thread Jon Tollefson
Paul Mackerras wrote:
 diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
   
A couple of hunks fail in this file when applying to the current tree.

...
 diff --git a/include/asm-powerpc/mmu-hash64.h 
 b/include/asm-powerpc/mmu-hash64.h
 index 695962f..053f86b 100644
 --- a/include/asm-powerpc/mmu-hash64.h
 +++ b/include/asm-powerpc/mmu-hash64.h
 @@ -47,6 +47,8 @@ extern char initial_stab[];

  /* Bits in the SLB VSID word */
  #define SLB_VSID_SHIFT   12
 +#define SLB_VSID_SHIFT_1T24
 +#define SLB_VSID_SSIZE_SHIFT 62
  #define SLB_VSID_B   ASM_CONST(0xc000)
  #define SLB_VSID_B_256M  ASM_CONST(0x)
  #define SLB_VSID_B_1TASM_CONST(0x4000)
 @@ -66,6 +68,7 @@ extern char initial_stab[];
  #define SLB_VSID_USER(SLB_VSID_KP|SLB_VSID_KS|SLB_VSID_C)

  #define SLBIE_C  (0x0800)
 +#define SLBIE_SSIZE_SHIFT25

  /*
   * Hash table
 @@ -77,7 +80,7 @@ extern char initial_stab[];
  #define HPTE_V_AVPN_SHIFT7
  #define HPTE_V_AVPN  ASM_CONST(0x3f80)
  #define HPTE_V_AVPN_VAL(x)   (((x)  HPTE_V_AVPN)  HPTE_V_AVPN_SHIFT)
 -#define HPTE_V_COMPARE(x,y)  (!(((x) ^ (y))  HPTE_V_AVPN))
 +#define HPTE_V_COMPARE(x,y)  (!(((x) ^ (y))  0xff80))
  #define HPTE_V_BOLTEDASM_CONST(0x0010)
  #define HPTE_V_LOCK  ASM_CONST(0x0008)
  #define HPTE_V_LARGE ASM_CONST(0x0004)
 @@ -164,16 +167,25 @@ struct mmu_psize_def
  #define MMU_SEGSIZE_256M 0
  #define MMU_SEGSIZE_1T   1

 +/*
 + * Supported segment sizes
 + */
 +#define MMU_SEGSIZE_256M 0
 +#define MMU_SEGSIZE_1T   1
   
It looks like this is repeating the definitions just above it.


Jon


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev