Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-13 Thread Christoph Lameter
On Wed, 11 Jun 2014, David Rientjes wrote:

> > > Yes, but the question is: why?
> >
> > zones need to be aligned so that the huge pages order and other page
> > orders allocated from the page allocator are at their "natural alignment".
> > Otherwise huge pages cannot be mapped properly and various I/O devices
> > may encounter issues if they rely on the natural alignment.
> >
>
> Any reason not to align to HUGETLB_PAGE_ORDER on x86 instead of
> ZONE_ALIGN?

if MAX_ORDER = Hugetlb order then no issue.

However, if there are devices that require larger order pages (dont know
if such devices exist) then there may be an issue. SGI UV DMA engine,
graphics or some other device?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-11 Thread David Rientjes
On Wed, 11 Jun 2014, Christoph Lameter wrote:

> > > The zone should not cross the 8M boundary?
> >
> > Yes, but the question is: why?
> 
> zones need to be aligned so that the huge pages order and other page
> orders allocated from the page allocator are at their "natural alignment".
> Otherwise huge pages cannot be mapped properly and various I/O devices
> may encounter issues if they rely on the natural alignment.
> 

Any reason not to align to HUGETLB_PAGE_ORDER on x86 instead of 
ZONE_ALIGN?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-11 Thread Christoph Lameter
On Wed, 11 Jun 2014, Luiz Capitulino wrote:

> > The zone should not cross the 8M boundary?
>
> Yes, but the question is: why?

zones need to be aligned so that the huge pages order and other page
orders allocated from the page allocator are at their "natural alignment".
Otherwise huge pages cannot be mapped properly and various I/O devices
may encounter issues if they rely on the natural alignment.

> My current thinking, after discussing this with David, is to just page
> align the memory range. This should fix the hyperv-triggered bug in 2.6.32
> and seems to be the right thing for upstream too.

You need to make sure that the page orders can be allocated at their
proper boundaries.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-11 Thread Luiz Capitulino

Yinghai, sorry for my late reply.

On Mon, 9 Jun 2014 15:13:41 -0700
Yinghai Lu  wrote:

> On Mon, Jun 9, 2014 at 12:03 PM, Luiz Capitulino  
> wrote:
> > On Sun, 8 Jun 2014 18:29:11 -0700
> > Yinghai Lu  wrote:
> >
> >> On Sun, Jun 8, 2014 at 3:14 PM, Luiz Capitulino  
> >> wrote:
> > [0.00] e820: BIOS-provided physical RAM map:
> > [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
> > [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] 
> > reserved
> > [0.00] BIOS-e820: [mem 0x000e-0x000f] 
> > reserved
> > [0.00] BIOS-e820: [mem 0x0010-0x3ffe] usable
> > [0.00] BIOS-e820: [mem 0x3fff-0x3fffefff] ACPI 
> > data
> > [0.00] BIOS-e820: [mem 0x3000-0x3fff] ACPI 
> > NVS
> > [0.00] BIOS-e820: [mem 0x4020-0x801f] usable
> ...
> > [0.00] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> > [0.00] SRAT: PXM 0 -> APIC 0x01 -> Node 0
> > [0.00] SRAT: PXM 1 -> APIC 0x02 -> Node 1
> > [0.00] SRAT: PXM 1 -> APIC 0x03 -> Node 1
> > [0.00] SRAT: Node 0 PXM 0 [mem 0x-0x3fff]
> > [0.00] SRAT: Node 1 PXM 1 [mem 0x4020-0x801f]
> > [0.00] Initmem setup node 0 [mem 0x-0x3fff]
> > [0.00]   NODE_DATA [mem 0x3ffec000-0x3ffe]
> > [0.00] Initmem setup node 1 [mem 0x4080-0x801f]
> > [0.00]   NODE_DATA [mem 0x801fb000-0x801fefff]
> 
> so node1 start is aligned to 8M from 2M
> 
> node0: [0, 1G)
> node1: [1G+2M, 2G+2M)
> 
> The zone should not cross the 8M boundary?

Yes, but the question is: why?

> In the case should we trim the memblock for numa to be 8M alignment ?

My current thinking, after discussing this with David, is to just page
align the memory range. This should fix the hyperv-triggered bug in 2.6.32
and seems to be the right thing for upstream too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-10 Thread Luiz Capitulino
On Tue, 10 Jun 2014 15:10:01 -0700 (PDT)
David Rientjes  wrote:

> On Mon, 9 Jun 2014, Luiz Capitulino wrote:
> 
> > > > > > diff --git a/arch/x86/include/asm/numa.h 
> > > > > > b/arch/x86/include/asm/numa.h
> > > > > > index 4064aca..01b493e 100644
> > > > > > --- a/arch/x86/include/asm/numa.h
> > > > > > +++ b/arch/x86/include/asm/numa.h
> > > > > > @@ -9,7 +9,6 @@
> > > > > >  #ifdef CONFIG_NUMA
> > > > > >  
> > > > > >  #define NR_NODE_MEMBLKS(MAX_NUMNODES*2)
> > > > > > -#define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
> > > > > >  
> > > > > >  /*
> > > > > >   * Too small node sizes may confuse the VM badly. Usually they
> > > > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > > > index 1d045f9..69f6362 100644
> > > > > > --- a/arch/x86/mm/numa.c
> > > > > > +++ b/arch/x86/mm/numa.c
> > > > > > @@ -200,8 +200,6 @@ static void __init setup_node_data(int nid, u64 
> > > > > > start, u64 end)
> > > > > > if (end && (end - start) < NODE_MIN_SIZE)
> > > > > > return;
> > > > > >  
> > > > > > -   start = roundup(start, ZONE_ALIGN);
> > > > > > -
> > > > > > printk(KERN_INFO "Initmem setup node %d [mem 
> > > > > > %#010Lx-%#010Lx]\n",
> > > > > >nid, start, end - 1);
> > > > > >  
> > > > > 
> > > > > What ensures this start address is page aligned from the BIOS?
> > > > 
> > > > To which start address do you refer to?
> > > 
> > > The start address displayed in the dmesg is not page aligned anymore with 
> > > your change, correct?  
> > 
> > I have to check that but I don't expect this to happen because my
> > understanding of the code is that what's rounded up here is just discarded
> > in free_area_init_node(). Am I wrong?
> > 
> 
> NODE_DATA(nid)->node_start_pfn needs to be accurate if 
> node_set_online(nid).  Since there is no guarantee about page alignment 
> from the ACPI spec, removing the roundup() entirely could cause the 
> address shift >> PAGE_SIZE to be off by one.  I, like you, do not see the 
> need for the ZONE_ALIGN above, but I think we agree that it should be 
> replaced with PAGE_SIZE instead.

Agreed. I'm just not completely sure setup_node_data() is the best place
for it, shouldn't we do it in acpi_numa_memory_affinity_init(), which is
when the ranges are read off the SRAT table?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-10 Thread David Rientjes
On Mon, 9 Jun 2014, Luiz Capitulino wrote:

> > > > > diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> > > > > index 4064aca..01b493e 100644
> > > > > --- a/arch/x86/include/asm/numa.h
> > > > > +++ b/arch/x86/include/asm/numa.h
> > > > > @@ -9,7 +9,6 @@
> > > > >  #ifdef CONFIG_NUMA
> > > > >  
> > > > >  #define NR_NODE_MEMBLKS  (MAX_NUMNODES*2)
> > > > > -#define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
> > > > >  
> > > > >  /*
> > > > >   * Too small node sizes may confuse the VM badly. Usually they
> > > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > > index 1d045f9..69f6362 100644
> > > > > --- a/arch/x86/mm/numa.c
> > > > > +++ b/arch/x86/mm/numa.c
> > > > > @@ -200,8 +200,6 @@ static void __init setup_node_data(int nid, u64 
> > > > > start, u64 end)
> > > > >   if (end && (end - start) < NODE_MIN_SIZE)
> > > > >   return;
> > > > >  
> > > > > - start = roundup(start, ZONE_ALIGN);
> > > > > -
> > > > >   printk(KERN_INFO "Initmem setup node %d [mem 
> > > > > %#010Lx-%#010Lx]\n",
> > > > >  nid, start, end - 1);
> > > > >  
> > > > 
> > > > What ensures this start address is page aligned from the BIOS?
> > > 
> > > To which start address do you refer to?
> > 
> > The start address displayed in the dmesg is not page aligned anymore with 
> > your change, correct?  
> 
> I have to check that but I don't expect this to happen because my
> understanding of the code is that what's rounded up here is just discarded
> in free_area_init_node(). Am I wrong?
> 

NODE_DATA(nid)->node_start_pfn needs to be accurate if 
node_set_online(nid).  Since there is no guarantee about page alignment 
from the ACPI spec, removing the roundup() entirely could cause the 
address shift >> PAGE_SIZE to be off by one.  I, like you, do not see the 
need for the ZONE_ALIGN above, but I think we agree that it should be 
replaced with PAGE_SIZE instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-09 Thread Luiz Capitulino
On Mon, 9 Jun 2014 14:57:16 -0700 (PDT)
David Rientjes  wrote:

> On Mon, 9 Jun 2014, Luiz Capitulino wrote:
> 
> > > > diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> > > > index 4064aca..01b493e 100644
> > > > --- a/arch/x86/include/asm/numa.h
> > > > +++ b/arch/x86/include/asm/numa.h
> > > > @@ -9,7 +9,6 @@
> > > >  #ifdef CONFIG_NUMA
> > > >  
> > > >  #define NR_NODE_MEMBLKS(MAX_NUMNODES*2)
> > > > -#define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
> > > >  
> > > >  /*
> > > >   * Too small node sizes may confuse the VM badly. Usually they
> > > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > > index 1d045f9..69f6362 100644
> > > > --- a/arch/x86/mm/numa.c
> > > > +++ b/arch/x86/mm/numa.c
> > > > @@ -200,8 +200,6 @@ static void __init setup_node_data(int nid, u64 
> > > > start, u64 end)
> > > > if (end && (end - start) < NODE_MIN_SIZE)
> > > > return;
> > > >  
> > > > -   start = roundup(start, ZONE_ALIGN);
> > > > -
> > > > printk(KERN_INFO "Initmem setup node %d [mem 
> > > > %#010Lx-%#010Lx]\n",
> > > >nid, start, end - 1);
> > > >  
> > > 
> > > What ensures this start address is page aligned from the BIOS?
> > 
> > To which start address do you refer to?
> 
> The start address displayed in the dmesg is not page aligned anymore with 
> your change, correct?  

I have to check that but I don't expect this to happen because my
understanding of the code is that what's rounded up here is just discarded
in free_area_init_node(). Am I wrong?

> acpi_parse_memory_affinity() does no 
> transformations on the table, the base address is coming strictly from the 
> SRAT and there is no page alignment requirement in the ACPI specification.  
> NODE_DATA(nid)->node_start_pfn will be correct because it does the shift 
> for you, but it still seems you want to at least align to PAGE_SIZE here. 

I do agree we need to align to PAGE_SIZE, but I'm not sure where we should
do it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-09 Thread Yinghai Lu
On Mon, Jun 9, 2014 at 12:03 PM, Luiz Capitulino  wrote:
> On Sun, 8 Jun 2014 18:29:11 -0700
> Yinghai Lu  wrote:
>
>> On Sun, Jun 8, 2014 at 3:14 PM, Luiz Capitulino  
>> wrote:
> [0.00] e820: BIOS-provided physical RAM map:
> [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
> [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
> [0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
> [0.00] BIOS-e820: [mem 0x0010-0x3ffe] usable
> [0.00] BIOS-e820: [mem 0x3fff-0x3fffefff] ACPI 
> data
> [0.00] BIOS-e820: [mem 0x3000-0x3fff] ACPI NVS
> [0.00] BIOS-e820: [mem 0x4020-0x801f] usable
...
> [0.00] SRAT: PXM 0 -> APIC 0x00 -> Node 0
> [0.00] SRAT: PXM 0 -> APIC 0x01 -> Node 0
> [0.00] SRAT: PXM 1 -> APIC 0x02 -> Node 1
> [0.00] SRAT: PXM 1 -> APIC 0x03 -> Node 1
> [0.00] SRAT: Node 0 PXM 0 [mem 0x-0x3fff]
> [0.00] SRAT: Node 1 PXM 1 [mem 0x4020-0x801f]
> [0.00] Initmem setup node 0 [mem 0x-0x3fff]
> [0.00]   NODE_DATA [mem 0x3ffec000-0x3ffe]
> [0.00] Initmem setup node 1 [mem 0x4080-0x801f]
> [0.00]   NODE_DATA [mem 0x801fb000-0x801fefff]

so node1 start is aligned to 8M from 2M

node0: [0, 1G)
node1: [1G+2M, 2G+2M)

The zone should not cross the 8M boundary?

In the case should we trim the memblock for numa to be 8M alignment ?

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-09 Thread David Rientjes
On Mon, 9 Jun 2014, Luiz Capitulino wrote:

> > > diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> > > index 4064aca..01b493e 100644
> > > --- a/arch/x86/include/asm/numa.h
> > > +++ b/arch/x86/include/asm/numa.h
> > > @@ -9,7 +9,6 @@
> > >  #ifdef CONFIG_NUMA
> > >  
> > >  #define NR_NODE_MEMBLKS  (MAX_NUMNODES*2)
> > > -#define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
> > >  
> > >  /*
> > >   * Too small node sizes may confuse the VM badly. Usually they
> > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > > index 1d045f9..69f6362 100644
> > > --- a/arch/x86/mm/numa.c
> > > +++ b/arch/x86/mm/numa.c
> > > @@ -200,8 +200,6 @@ static void __init setup_node_data(int nid, u64 
> > > start, u64 end)
> > >   if (end && (end - start) < NODE_MIN_SIZE)
> > >   return;
> > >  
> > > - start = roundup(start, ZONE_ALIGN);
> > > -
> > >   printk(KERN_INFO "Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
> > >  nid, start, end - 1);
> > >  
> > 
> > What ensures this start address is page aligned from the BIOS?
> 
> To which start address do you refer to?

The start address displayed in the dmesg is not page aligned anymore with 
your change, correct?  acpi_parse_memory_affinity() does no 
transformations on the table, the base address is coming strictly from the 
SRAT and there is no page alignment requirement in the ACPI specification.  
NODE_DATA(nid)->node_start_pfn will be correct because it does the shift 
for you, but it still seems you want to at least align to PAGE_SIZE here. 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-09 Thread Luiz Capitulino
On Sun, 8 Jun 2014 18:29:11 -0700
Yinghai Lu  wrote:

> On Sun, Jun 8, 2014 at 3:14 PM, Luiz Capitulino  
> wrote:
> > In short, I believe this is just dead code for the upstream kernel but this
> > causes a bug for 2.6.32 based kernels.
> >
> > The setup_node_data() function is used to initialize NODE_DATA() for a node.
> > It gets a node id and a memory range. The start address for the memory range
> > is rounded up to ZONE_ALIGN and then it's used to initialize
> > NODE_DATA(nid)->node_start_pfn.
> > The 2.6.32 kernel did use the rounded up range start to register a node's
> > memory range with the bootmem interface by calling init_bootmem_node().
> > A few steps later during bootmem initialization, the 2.6.32 kernel calls
> > free_bootmem_with_active_regions() to initialize the bootmem bitmap. This
> > function goes through all memory ranges read from the SRAT table and try
> > to mark them as usable for bootmem usage. However, before marking a range
> > as usable, mark_bootmem_node() asserts if the memory range start address
> > (as read from the SRAT table) is less than the value registered with
> > init_bootmem_node(). The assertion will trigger whenever the memory range
> > start address is rounded up, as it will always be greater than what is
> > reported in the SRAT table. This is true when the 2.6.32 kernel runs as a
> > HyperV guest on Windows Server 2012. Dropping ZONE_ALIGN solves the
> > problem there.
> 
> What is e820 memmap and srat from HyperV guest?

I think the dmesg below provides this? Let me know otherwise.

> Can you post bootlog first 200 lines?

[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 3.15.0-rc6+ 
(r...@amd-6168-8-1.englab.nay.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 
4.4.7-3) (GCC) ) #113 SMP Thu May 29 16:28:41 CST 2014
[0.00] Command line: ro root=/dev/mapper/vg_dhcp66106105-lv_root 
rd_NO_LUKS  KEYBOARDTYPE=pc KEYTABLE=us LANG=en_US.UTF-8 rd_NO_MD 
rd_LVM_LV=vg_dhcp66106105/lv_swap SYSFONT=latarcyrheb-sun16 crashkernel=auto 
rd_LVM_LV=vg_dhcp66106105/lv_root rd_NO_DM rhgb quiet KEYBOARDTYPE=pc 
KEYTABLE=us rd_NO_DM console=ttyS0,115200
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x3ffe] usable
[0.00] BIOS-e820: [mem 0x3fff-0x3fffefff] ACPI data
[0.00] BIOS-e820: [mem 0x3000-0x3fff] ACPI NVS
[0.00] BIOS-e820: [mem 0x4020-0x801f] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.3 present.
[0.00] DMI: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 
090006  05/23/2012
[0.00] Hypervisor detected: Microsoft HyperV
[0.00] HyperV: features 0xe7f, hints 0x2c
[0.00] HyperV: LAPIC Timer Frequency: 0x30d40
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] No AGP bridge found
[0.00] e820: last_pfn = 0x80200 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-D uncachable
[0.00]   E-F write-back
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 000 mask 3FF write-back
[0.00]   1 disabled
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[0.00] found SMP MP-table at [mem 0x000ff780-0x000ff78f] mapped at 
[880ff780]
[0.00] Scanning 1 areas for low memory corruption
[0.00] Base memory trampoline at [88099000] 99000 size 24576
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00]  [mem 0x-0x000f] page 4k
[0.00] BRK [0x020eb000, 0x020ebfff] PGTABLE
[0.00] BRK [0x020ec000, 0x020ecfff] PGTABLE
[0.00] BRK [0x020ed000, 0x020edfff] PGTABLE
[0.00] init_memory_mapping: [mem 0x8000-0x801f]
[0.00]  [mem 0x8000-0x801f] page 2M
[0.00] BRK [0x020ee000, 0x020eefff] PGTABLE
[0.00] init_memory_mapping: [mem 0x7c00-0x7fff]
[0.00]  [mem 0x7c00-0x7fff] page 2M
[0.00] BRK [0x020ef000, 0x020e] PGTABLE
[0.00] init_memory_mapping: [mem 0x0010-0x3ffe]
[0.00]  [mem 0x0010-0x001f] page 4k
[0.

Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-09 Thread Luiz Capitulino
On Sun, 8 Jun 2014 15:25:50 -0700 (PDT)
David Rientjes  wrote:

> On Sun, 8 Jun 2014, Luiz Capitulino wrote:
> 
> > diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> > index 4064aca..01b493e 100644
> > --- a/arch/x86/include/asm/numa.h
> > +++ b/arch/x86/include/asm/numa.h
> > @@ -9,7 +9,6 @@
> >  #ifdef CONFIG_NUMA
> >  
> >  #define NR_NODE_MEMBLKS(MAX_NUMNODES*2)
> > -#define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
> >  
> >  /*
> >   * Too small node sizes may confuse the VM badly. Usually they
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 1d045f9..69f6362 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -200,8 +200,6 @@ static void __init setup_node_data(int nid, u64 start, 
> > u64 end)
> > if (end && (end - start) < NODE_MIN_SIZE)
> > return;
> >  
> > -   start = roundup(start, ZONE_ALIGN);
> > -
> > printk(KERN_INFO "Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
> >nid, start, end - 1);
> >  
> 
> What ensures this start address is page aligned from the BIOS?

To which start address do you refer to? The start address passed to
setup_node_data() comes from memblks registered when the SRAT table is parsed.
Those memblks get some transformations between the parsing of the SRAT table
and this point. I haven't checked them in detail to see if they are aligned
at some point. But no alignment is enforced in the code that adds the memblks
read from the SRAT table, which is acpi_numa_memory_affinity_init().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-08 Thread Yinghai Lu
On Sun, Jun 8, 2014 at 3:14 PM, Luiz Capitulino  wrote:
> In short, I believe this is just dead code for the upstream kernel but this
> causes a bug for 2.6.32 based kernels.
>
> The setup_node_data() function is used to initialize NODE_DATA() for a node.
> It gets a node id and a memory range. The start address for the memory range
> is rounded up to ZONE_ALIGN and then it's used to initialize
> NODE_DATA(nid)->node_start_pfn.
> The 2.6.32 kernel did use the rounded up range start to register a node's
> memory range with the bootmem interface by calling init_bootmem_node().
> A few steps later during bootmem initialization, the 2.6.32 kernel calls
> free_bootmem_with_active_regions() to initialize the bootmem bitmap. This
> function goes through all memory ranges read from the SRAT table and try
> to mark them as usable for bootmem usage. However, before marking a range
> as usable, mark_bootmem_node() asserts if the memory range start address
> (as read from the SRAT table) is less than the value registered with
> init_bootmem_node(). The assertion will trigger whenever the memory range
> start address is rounded up, as it will always be greater than what is
> reported in the SRAT table. This is true when the 2.6.32 kernel runs as a
> HyperV guest on Windows Server 2012. Dropping ZONE_ALIGN solves the
> problem there.

What is e820 memmap and srat from HyperV guest?

Can you post bootlog first 200 lines?

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: numa: drop ZONE_ALIGN

2014-06-08 Thread David Rientjes
On Sun, 8 Jun 2014, Luiz Capitulino wrote:

> diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
> index 4064aca..01b493e 100644
> --- a/arch/x86/include/asm/numa.h
> +++ b/arch/x86/include/asm/numa.h
> @@ -9,7 +9,6 @@
>  #ifdef CONFIG_NUMA
>  
>  #define NR_NODE_MEMBLKS  (MAX_NUMNODES*2)
> -#define ZONE_ALIGN (1UL << (MAX_ORDER+PAGE_SHIFT))
>  
>  /*
>   * Too small node sizes may confuse the VM badly. Usually they
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1d045f9..69f6362 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -200,8 +200,6 @@ static void __init setup_node_data(int nid, u64 start, 
> u64 end)
>   if (end && (end - start) < NODE_MIN_SIZE)
>   return;
>  
> - start = roundup(start, ZONE_ALIGN);
> -
>   printk(KERN_INFO "Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
>  nid, start, end - 1);
>  

What ensures this start address is page aligned from the BIOS?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/