Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-26 Thread Yinghai Lu
On Wed, Aug 26, 2015 at 1:49 PM, Andrew Morton
 wrote:
> On Tue, 25 Aug 2015 22:42:05 -0700 Yinghai Lu  wrote:
> I don't know what that means.  We have multiple patches under at least
> two different Subject:s.  Please be very careful and very specific when
> identifying patches.  Otherwise mistakes will be made.
>
>
> I presently have three patches:
>
> mm-check-if-section-present-during-memory-block-unregistering.patch
> mm-check-if-section-present-during-memory-block-unregistering-v2.patch
> mm-check-if-section-present-during-memory-block-unregistering-v2-fix.patch
>
> When these are consolidated together, this is the result:

Please drop all three, and apply v3 directly from

https://patchwork.kernel.org/patch/7080111/

we should not touch unregiser path, as unregister_memory_section()
already check if the section is present before.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-26 Thread Andrew Morton
On Tue, 25 Aug 2015 22:42:05 -0700 Yinghai Lu  wrote:

> On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar  wrote:
> > NAK due to lack of cleanliness: the two loops look almost identical - this 
> > sure
> > can be factored out...
> 
> Please check complete version at
> 
> https://patchwork.kernel.org/patch/7074341/

That doesn't do what Ingo suggested: "can be factored out...".

Please review this?

--- 
a/drivers/base/node.c~mm-check-if-section-present-during-memory-block-unregistering-v2-fix
+++ a/drivers/base/node.c
@@ -375,6 +375,22 @@ static int __init_refok get_nid_for_pfn(
return pfn_to_nid(pfn);
 }
 
+/*
+ * A memory block can have several absent sections.  A helper function for
+ * skipping over these holes.
+ *
+ * If an absent section is detected, skip_absent_section() will advance *pfn
+ * to the final page in that section and will return true.
+ */
+static bool skip_absent_section(unsigned long *pfn)
+{
+   if (present_section_nr(pfn_to_section_nr(*pfn)))
+   return false;
+
+   *pfn = round_down(*pfn + PAGES_PER_SECTION, PAGES_PER_SECTION) - 1;
+   return true;
+}
+
 /* register memory section under specified node if it spans that node */
 int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 {
@@ -390,18 +406,10 @@ int register_mem_sect_under_node(struct
sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
sect_end_pfn += PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-   int page_nid, scn_nr;
+   int page_nid;
 
-   /*
-* memory block could have several absent sections from start.
-* skip pfn range from absent section
-*/
-   scn_nr = pfn_to_section_nr(pfn);
-   if (!present_section_nr(scn_nr)) {
-   pfn = round_down(pfn + PAGES_PER_SECTION,
-PAGES_PER_SECTION) - 1;
+   if (skip_absent_section())
continue;
-   }
 
page_nid = get_nid_for_pfn(pfn);
if (page_nid < 0)
@@ -441,18 +449,10 @@ int unregister_mem_sect_under_nodes(stru
sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
sect_end_pfn += PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-   int nid, scn_nr;
+   int nid;
 
-   /*
-* memory block could have several absent sections from start.
-* skip pfn range from absent section
-*/
-   scn_nr = pfn_to_section_nr(pfn);
-   if (!present_section_nr(scn_nr)) {
-   pfn = round_down(pfn + PAGES_PER_SECTION,
-PAGES_PER_SECTION) - 1;
+   if (skip_absent_section())
continue;
-   }
 
nid = get_nid_for_pfn(pfn);
if (nid < 0)
_


> Andrew,
> Ingo NAKed raw version of this patch, so you may need to remove it
> from -mm tree.

I don't know what that means.  We have multiple patches under at least
two different Subject:s.  Please be very careful and very specific when
identifying patches.  Otherwise mistakes will be made.


I presently have three patches:

mm-check-if-section-present-during-memory-block-unregistering.patch
mm-check-if-section-present-during-memory-block-unregistering-v2.patch
mm-check-if-section-present-during-memory-block-unregistering-v2-fix.patch

When these are consolidated together, this is the result:


From: Yinghai Lu 
Subject: mm: check if section present during memory block (un)registering

Tony Luck found on his setup, if memory block size 512M will cause crash
during booting.

 BUG: unable to handle kernel paging request at ea007420
 IP: [] get_nid_for_pfn+0x17/0x40
 PGD 128ffcb067 PUD 128ffc9067 PMD 0
 Oops:  [#1] SMP
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc8 #1
...
 Call Trace:
  [] ? register_mem_sect_under_node+0x66/0xe0
  [] register_one_node+0x17b/0x240
  [] ? pci_iommu_alloc+0x6e/0x6e
  [] topology_init+0x3c/0x95
  [] do_one_initcall+0xcd/0x1f0

The system has non continuous RAM address:
 BIOS-e820: [mem 0x0013-0x001c] usable
 BIOS-e820: [mem 0x001d7000-0x001ec7ffefff] usable
 BIOS-e820: [mem 0x001f-0x002b] usable
 BIOS-e820: [mem 0x002c1800-0x002d6fffefff] usable
 BIOS-e820: [mem 0x002e-0x0039] usable

So there are start sections in memory block not present.
For example:
memory block : [0x2c1800, 0x2c2000) 512M
first three sections are not present.

Current register_mem_sect_under_node() assume first section is present,
but memory block section number range [start_section_nr, end_section_nr]
would include not present section.

For arch that support vmemmap, we 

Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-26 Thread Yinghai Lu
On Wed, Aug 26, 2015 at 1:49 PM, Andrew Morton
a...@linux-foundation.org wrote:
 On Tue, 25 Aug 2015 22:42:05 -0700 Yinghai Lu ying...@kernel.org wrote:
 I don't know what that means.  We have multiple patches under at least
 two different Subject:s.  Please be very careful and very specific when
 identifying patches.  Otherwise mistakes will be made.


 I presently have three patches:

 mm-check-if-section-present-during-memory-block-unregistering.patch
 mm-check-if-section-present-during-memory-block-unregistering-v2.patch
 mm-check-if-section-present-during-memory-block-unregistering-v2-fix.patch

 When these are consolidated together, this is the result:

Please drop all three, and apply v3 directly from

https://patchwork.kernel.org/patch/7080111/

we should not touch unregiser path, as unregister_memory_section()
already check if the section is present before.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-26 Thread Andrew Morton
On Tue, 25 Aug 2015 22:42:05 -0700 Yinghai Lu ying...@kernel.org wrote:

 On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar mi...@kernel.org wrote:
  NAK due to lack of cleanliness: the two loops look almost identical - this 
  sure
  can be factored out...
 
 Please check complete version at
 
 https://patchwork.kernel.org/patch/7074341/

That doesn't do what Ingo suggested: can be factored out

Please review this?

--- 
a/drivers/base/node.c~mm-check-if-section-present-during-memory-block-unregistering-v2-fix
+++ a/drivers/base/node.c
@@ -375,6 +375,22 @@ static int __init_refok get_nid_for_pfn(
return pfn_to_nid(pfn);
 }
 
+/*
+ * A memory block can have several absent sections.  A helper function for
+ * skipping over these holes.
+ *
+ * If an absent section is detected, skip_absent_section() will advance *pfn
+ * to the final page in that section and will return true.
+ */
+static bool skip_absent_section(unsigned long *pfn)
+{
+   if (present_section_nr(pfn_to_section_nr(*pfn)))
+   return false;
+
+   *pfn = round_down(*pfn + PAGES_PER_SECTION, PAGES_PER_SECTION) - 1;
+   return true;
+}
+
 /* register memory section under specified node if it spans that node */
 int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 {
@@ -390,18 +406,10 @@ int register_mem_sect_under_node(struct
sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr);
sect_end_pfn += PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
-   int page_nid, scn_nr;
+   int page_nid;
 
-   /*
-* memory block could have several absent sections from start.
-* skip pfn range from absent section
-*/
-   scn_nr = pfn_to_section_nr(pfn);
-   if (!present_section_nr(scn_nr)) {
-   pfn = round_down(pfn + PAGES_PER_SECTION,
-PAGES_PER_SECTION) - 1;
+   if (skip_absent_section(pfn))
continue;
-   }
 
page_nid = get_nid_for_pfn(pfn);
if (page_nid  0)
@@ -441,18 +449,10 @@ int unregister_mem_sect_under_nodes(stru
sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr);
sect_end_pfn += PAGES_PER_SECTION - 1;
for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
-   int nid, scn_nr;
+   int nid;
 
-   /*
-* memory block could have several absent sections from start.
-* skip pfn range from absent section
-*/
-   scn_nr = pfn_to_section_nr(pfn);
-   if (!present_section_nr(scn_nr)) {
-   pfn = round_down(pfn + PAGES_PER_SECTION,
-PAGES_PER_SECTION) - 1;
+   if (skip_absent_section(pfn))
continue;
-   }
 
nid = get_nid_for_pfn(pfn);
if (nid  0)
_


 Andrew,
 Ingo NAKed raw version of this patch, so you may need to remove it
 from -mm tree.

I don't know what that means.  We have multiple patches under at least
two different Subject:s.  Please be very careful and very specific when
identifying patches.  Otherwise mistakes will be made.


I presently have three patches:

mm-check-if-section-present-during-memory-block-unregistering.patch
mm-check-if-section-present-during-memory-block-unregistering-v2.patch
mm-check-if-section-present-during-memory-block-unregistering-v2-fix.patch

When these are consolidated together, this is the result:


From: Yinghai Lu ying...@kernel.org
Subject: mm: check if section present during memory block (un)registering

Tony Luck found on his setup, if memory block size 512M will cause crash
during booting.

 BUG: unable to handle kernel paging request at ea007420
 IP: [81670527] get_nid_for_pfn+0x17/0x40
 PGD 128ffcb067 PUD 128ffc9067 PMD 0
 Oops:  [#1] SMP
 Modules linked in:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc8 #1
...
 Call Trace:
  [81453b56] ? register_mem_sect_under_node+0x66/0xe0
  [81453eeb] register_one_node+0x17b/0x240
  [81b1f1ed] ? pci_iommu_alloc+0x6e/0x6e
  [81b1f229] topology_init+0x3c/0x95
  [8100213d] do_one_initcall+0xcd/0x1f0

The system has non continuous RAM address:
 BIOS-e820: [mem 0x0013-0x001c] usable
 BIOS-e820: [mem 0x001d7000-0x001ec7ffefff] usable
 BIOS-e820: [mem 0x001f-0x002b] usable
 BIOS-e820: [mem 0x002c1800-0x002d6fffefff] usable
 BIOS-e820: [mem 0x002e-0x0039] usable

So there are start sections in memory block not present.
For example:
memory block : [0x2c1800, 0x2c2000) 512M
first three sections are not present.

Current register_mem_sect_under_node() assume first section is present,
but memory 

Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-25 Thread Yinghai Lu
On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar  wrote:
> NAK due to lack of cleanliness: the two loops look almost identical - this 
> sure
> can be factored out...

Please check complete version at

https://patchwork.kernel.org/patch/7074341/

Andrew,
Ingo NAKed raw version of this patch, so you may need to remove it
from -mm tree.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-25 Thread Ingo Molnar

* Yinghai Lu  wrote:

> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block 
> *mem_blk, int nid)
>   sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
>   sect_end_pfn += PAGES_PER_SECTION - 1;
>   for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
> - int page_nid;
> + int page_nid, scn_nr;
>  
> + scn_nr = pfn_to_section_nr(pfn);
> + if (!present_section_nr(scn_nr)) {
> + pfn = round_down(pfn + PAGES_PER_SECTION,
> +  PAGES_PER_SECTION) - 1;
> + continue;
> + }
>   page_nid = get_nid_for_pfn(pfn);
>   if (page_nid < 0)
>   continue;
> @@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block 
> *mem_blk,
>   return -ENOMEM;
>   nodes_clear(*unlinked_nodes);
>  
> - sect_start_pfn = section_nr_to_pfn(phys_index);
> - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
> + sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
> + sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
> + sect_end_pfn += PAGES_PER_SECTION - 1;
>   for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
> - int nid;
> + int nid, scn_nr;
> +
> + scn_nr = pfn_to_section_nr(pfn);
> + if (!present_section_nr(scn_nr)) {
> + pfn = round_down(pfn + PAGES_PER_SECTION,
> +  PAGES_PER_SECTION) - 1;
> + continue;
> + }

NAK due to lack of cleanliness: the two loops look almost identical - this sure 
can be factored out...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-25 Thread Tony Luck
On Tue, Aug 25, 2015 at 12:01 PM, Yinghai Lu  wrote:
>> It does ... but this (attached) is simpler.  Your patch and mine both
>> allow the system to boot ...
>
> The version that fix with section_nr present checking may save couple 
> thousands
> calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k

Actually saves about 1.2 million calls. Your patch wins :-)

Reported-and-tested-by: Tony Luck 

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-25 Thread Yinghai Lu
On Tue, Aug 25, 2015 at 10:03 AM, Tony Luck  wrote:
> On Mon, Aug 24, 2015 at 4:59 PM, Yinghai Lu  wrote:
>> attached should fix the problem:
>
> It does ... but this (attached) is simpler.  Your patch and mine both
> allow the system to boot ...

The version that fix with section_nr present checking may save couple thousands
calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k

> but it is not happy. See all the chatter from systemd in the attached dmesg.

because of you have "debug ignore_loglevel" ?

>
> x86 doesn't allow me to set CONFIG_HOLES_IN_ZONE ... but now I'm
> worried about all the other places use pfn_valid_within()
>
> Still trying to get an answer from the BIOS folks on whether these
> holes are normal when setting up mirrored areas of memory.

The problem only happens when memory block size is 512M and section
size is 128M.
when you have them both at 128M, the system works. so current kernel
should only has
problem with hole size  > 128M to leave some section not present.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-25 Thread Yinghai Lu
On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar mi...@kernel.org wrote:
 NAK due to lack of cleanliness: the two loops look almost identical - this 
 sure
 can be factored out...

Please check complete version at

https://patchwork.kernel.org/patch/7074341/

Andrew,
Ingo NAKed raw version of this patch, so you may need to remove it
from -mm tree.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-25 Thread Ingo Molnar

* Yinghai Lu ying...@kernel.org wrote:

 --- a/drivers/base/node.c
 +++ b/drivers/base/node.c
 @@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block 
 *mem_blk, int nid)
   sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr);
   sect_end_pfn += PAGES_PER_SECTION - 1;
   for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
 - int page_nid;
 + int page_nid, scn_nr;
  
 + scn_nr = pfn_to_section_nr(pfn);
 + if (!present_section_nr(scn_nr)) {
 + pfn = round_down(pfn + PAGES_PER_SECTION,
 +  PAGES_PER_SECTION) - 1;
 + continue;
 + }
   page_nid = get_nid_for_pfn(pfn);
   if (page_nid  0)
   continue;
 @@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block 
 *mem_blk,
   return -ENOMEM;
   nodes_clear(*unlinked_nodes);
  
 - sect_start_pfn = section_nr_to_pfn(phys_index);
 - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
 + sect_start_pfn = section_nr_to_pfn(mem_blk-start_section_nr);
 + sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr);
 + sect_end_pfn += PAGES_PER_SECTION - 1;
   for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
 - int nid;
 + int nid, scn_nr;
 +
 + scn_nr = pfn_to_section_nr(pfn);
 + if (!present_section_nr(scn_nr)) {
 + pfn = round_down(pfn + PAGES_PER_SECTION,
 +  PAGES_PER_SECTION) - 1;
 + continue;
 + }

NAK due to lack of cleanliness: the two loops look almost identical - this sure 
can be factored out...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-25 Thread Yinghai Lu
On Tue, Aug 25, 2015 at 10:03 AM, Tony Luck tony.l...@gmail.com wrote:
 On Mon, Aug 24, 2015 at 4:59 PM, Yinghai Lu ying...@kernel.org wrote:
 attached should fix the problem:

 It does ... but this (attached) is simpler.  Your patch and mine both
 allow the system to boot ...

The version that fix with section_nr present checking may save couple thousands
calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k

 but it is not happy. See all the chatter from systemd in the attached dmesg.

because of you have debug ignore_loglevel ?


 x86 doesn't allow me to set CONFIG_HOLES_IN_ZONE ... but now I'm
 worried about all the other places use pfn_valid_within()

 Still trying to get an answer from the BIOS folks on whether these
 holes are normal when setting up mirrored areas of memory.

The problem only happens when memory block size is 512M and section
size is 128M.
when you have them both at 128M, the system works. so current kernel
should only has
problem with hole size   128M to leave some section not present.

Thanks

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-25 Thread Tony Luck
On Tue, Aug 25, 2015 at 12:01 PM, Yinghai Lu ying...@kernel.org wrote:
 It does ... but this (attached) is simpler.  Your patch and mine both
 allow the system to boot ...

 The version that fix with section_nr present checking may save couple 
 thousands
 calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k

Actually saves about 1.2 million calls. Your patch wins :-)

Reported-and-tested-by: Tony Luck tony.l...@intel.com

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Yinghai Lu
On Mon, Aug 24, 2015 at 4:41 PM, Yinghai Lu  wrote:
> On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck  wrote:
>> On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu  wrote:
>>
>>> Can you boot with "debug ignore_loglevel" so we can see following print out
>>> for vmemmap?
>>
>> See attached. There are a few extra messages from my own debug printk()
>> calls. It seems that we successfully deal with node 0 from topology_init()
>> but die walking node 1. I see that the NODE_DATA limits for memory
>> on node 1 were from 1d7 to 3a0. But when we get into
>> register_mem_sect_under_node() we have rounded the start pfn down to
>> 1d0 ... and we panic processing that range (which is in a hole in e820).
>>
>> We seem to die here:
>>
>> for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
>> int page_nid;
>>
>> page_nid = get_nid_for_pfn(pfn);
>
> oh, no.
> register_mem_sect_under_node() is assuming:
> first section in the block is present and first page in that section is 
> present.

attached should fix the problem:
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 31df474d..cc910ad 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
 	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-		int page_nid;
+		int page_nid, scn_nr;
 
+		scn_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(scn_nr)) {
+			pfn = round_down(pfn + PAGES_PER_SECTION,
+	 PAGES_PER_SECTION) - 1;
+			continue;
+		}
 		page_nid = get_nid_for_pfn(pfn);
 		if (page_nid < 0)
 			continue;
@@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
 
-	sect_start_pfn = section_nr_to_pfn(phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
+	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-		int nid;
+		int nid, scn_nr;
+
+		scn_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(scn_nr)) {
+			pfn = round_down(pfn + PAGES_PER_SECTION,
+	 PAGES_PER_SECTION) - 1;
+			continue;
+		}
 
 		nid = get_nid_for_pfn(pfn);
 		if (nid < 0)


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Yinghai Lu
On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck  wrote:
> On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu  wrote:
>
>> Can you boot with "debug ignore_loglevel" so we can see following print out
>> for vmemmap?
>
> See attached. There are a few extra messages from my own debug printk()
> calls. It seems that we successfully deal with node 0 from topology_init()
> but die walking node 1. I see that the NODE_DATA limits for memory
> on node 1 were from 1d7 to 3a0. But when we get into
> register_mem_sect_under_node() we have rounded the start pfn down to
> 1d0 ... and we panic processing that range (which is in a hole in e820).
>
> We seem to die here:
>
> for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
> int page_nid;
>
> page_nid = get_nid_for_pfn(pfn);

oh, no.
register_mem_sect_under_node() is assuming:
first section in the block is present and first page in that section is present.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Tony Luck
On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu  wrote:

> Can you boot with "debug ignore_loglevel" so we can see following print out
> for vmemmap?

See attached. There are a few extra messages from my own debug printk()
calls. It seems that we successfully deal with node 0 from topology_init()
but die walking node 1. I see that the NODE_DATA limits for memory
on node 1 were from 1d7 to 3a0. But when we get into
register_mem_sect_under_node() we have rounded the start pfn down to
1d0 ... and we panic processing that range (which is in a hole in e820).

We seem to die here:

for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
int page_nid;

page_nid = get_nid_for_pfn(pfn);

-Tony


dmesg2
Description: Binary data


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Yinghai Lu
On Mon, Aug 24, 2015 at 1:41 PM, Tony Luck  wrote:
> On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu  wrote:
>> Then, what does the E820 look like?
>
> See attached serial console log of the latest crash

Can you boot with "debug ignore_loglevel" so we can see following print out
for vmemmap?

[0.352486]  [ea00-ea0001ff] PMD ->
[88007de0-88007fdf] on node 0
[0.358758]  [ea000400-ea0005ff] PMD ->
[88017d60-88017f5f] on node 1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Tony Luck
On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu  wrote:
> Then, what does the E820 look like?

See attached serial console log of the latest crash

-Tony


dmesg
Description: Binary data


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Yinghai Lu
On Fri, Aug 21, 2015 at 4:54 PM, Tony Luck  wrote:
> On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu  wrote:
>
> Still stuff going on that I don't understand here. I increased the amount of
> mirrored memory in this machine which moved max_pfn to 0x756
> and probe_memory_block_size() picked 512MB as the memory_block_size,
> which seemed plausible.
>
> But my kernel still crashed during boot with this value. :-(
> Forcing the block size to 128M made the system boot.
>
> Maybe all the holes in the e820 map matter too (specifically the
> alignment of the holes)?

Then, what does the E820 look like?

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Yinghai Lu
On Fri, Aug 21, 2015 at 4:54 PM, Tony Luck tony.l...@gmail.com wrote:
 On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu ying...@kernel.org wrote:

 Still stuff going on that I don't understand here. I increased the amount of
 mirrored memory in this machine which moved max_pfn to 0x756
 and probe_memory_block_size() picked 512MB as the memory_block_size,
 which seemed plausible.

 But my kernel still crashed during boot with this value. :-(
 Forcing the block size to 128M made the system boot.

 Maybe all the holes in the e820 map matter too (specifically the
 alignment of the holes)?

Then, what does the E820 look like?

Yinghai
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Yinghai Lu
On Mon, Aug 24, 2015 at 1:41 PM, Tony Luck tony.l...@gmail.com wrote:
 On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu ying...@kernel.org wrote:
 Then, what does the E820 look like?

 See attached serial console log of the latest crash

Can you boot with debug ignore_loglevel so we can see following print out
for vmemmap?

[0.352486]  [ea00-ea0001ff] PMD -
[88007de0-88007fdf] on node 0
[0.358758]  [ea000400-ea0005ff] PMD -
[88017d60-88017f5f] on node 1
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Tony Luck
On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu ying...@kernel.org wrote:
 Then, what does the E820 look like?

See attached serial console log of the latest crash

-Tony


dmesg
Description: Binary data


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Yinghai Lu
On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck tony.l...@gmail.com wrote:
 On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu ying...@kernel.org wrote:

 Can you boot with debug ignore_loglevel so we can see following print out
 for vmemmap?

 See attached. There are a few extra messages from my own debug printk()
 calls. It seems that we successfully deal with node 0 from topology_init()
 but die walking node 1. I see that the NODE_DATA limits for memory
 on node 1 were from 1d7 to 3a0. But when we get into
 register_mem_sect_under_node() we have rounded the start pfn down to
 1d0 ... and we panic processing that range (which is in a hole in e820).

 We seem to die here:

 for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
 int page_nid;

 page_nid = get_nid_for_pfn(pfn);

oh, no.
register_mem_sect_under_node() is assuming:
first section in the block is present and first page in that section is present.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Tony Luck
On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu ying...@kernel.org wrote:

 Can you boot with debug ignore_loglevel so we can see following print out
 for vmemmap?

See attached. There are a few extra messages from my own debug printk()
calls. It seems that we successfully deal with node 0 from topology_init()
but die walking node 1. I see that the NODE_DATA limits for memory
on node 1 were from 1d7 to 3a0. But when we get into
register_mem_sect_under_node() we have rounded the start pfn down to
1d0 ... and we panic processing that range (which is in a hole in e820).

We seem to die here:

for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
int page_nid;

page_nid = get_nid_for_pfn(pfn);

-Tony


dmesg2
Description: Binary data


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-24 Thread Yinghai Lu
On Mon, Aug 24, 2015 at 4:41 PM, Yinghai Lu ying...@kernel.org wrote:
 On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck tony.l...@gmail.com wrote:
 On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu ying...@kernel.org wrote:

 Can you boot with debug ignore_loglevel so we can see following print out
 for vmemmap?

 See attached. There are a few extra messages from my own debug printk()
 calls. It seems that we successfully deal with node 0 from topology_init()
 but die walking node 1. I see that the NODE_DATA limits for memory
 on node 1 were from 1d7 to 3a0. But when we get into
 register_mem_sect_under_node() we have rounded the start pfn down to
 1d0 ... and we panic processing that range (which is in a hole in e820).

 We seem to die here:

 for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
 int page_nid;

 page_nid = get_nid_for_pfn(pfn);

 oh, no.
 register_mem_sect_under_node() is assuming:
 first section in the block is present and first page in that section is 
 present.

attached should fix the problem:
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 31df474d..cc910ad 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 	sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr);
 	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
-		int page_nid;
+		int page_nid, scn_nr;
 
+		scn_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(scn_nr)) {
+			pfn = round_down(pfn + PAGES_PER_SECTION,
+	 PAGES_PER_SECTION) - 1;
+			continue;
+		}
 		page_nid = get_nid_for_pfn(pfn);
 		if (page_nid  0)
 			continue;
@@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
 
-	sect_start_pfn = section_nr_to_pfn(phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+	sect_start_pfn = section_nr_to_pfn(mem_blk-start_section_nr);
+	sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr);
+	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {
-		int nid;
+		int nid, scn_nr;
+
+		scn_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(scn_nr)) {
+			pfn = round_down(pfn + PAGES_PER_SECTION,
+	 PAGES_PER_SECTION) - 1;
+			continue;
+		}
 
 		nid = get_nid_for_pfn(pfn);
 		if (nid  0)


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Tony Luck
On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu  wrote:
>> It seems that many systems with large amounts of memory
>> will have a nicely aligned max_pfn ... so they will get
>> the 2GB block size.  If they don't have a well aligned
>> max_pfn, then they need to use a smaller size to avoid
>> the crash I saw.
>
> Good to me.

Still stuff going on that I don't understand here. I increased the amount of
mirrored memory in this machine which moved max_pfn to 0x756
and probe_memory_block_size() picked 512MB as the memory_block_size,
which seemed plausible.

But my kernel still crashed during boot with this value. :-(
Forcing the block size to 128M made the system boot.

Maybe all the holes in the e820 map matter too (specifically the
alignment of the holes)?

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Yinghai Lu
On Fri, Aug 21, 2015 at 1:27 PM, Luck, Tony  wrote:
> On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote:
>> That commit could be reverted.
>> According to
>> https://lkml.org/lkml/2014/11/10/123
>
> Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small
> systems?

That is introduced in commit 982792c7 ("x86, mm: probe memory block
size for generic x86 64bit
").
that patch is used to make boot faster why create less entries
in /sys/device/system/memory/.
On system with less 64G ram, that will not have too many entries
even with MIN_MEMORY_BLOCK_SIZE.

>
> What about this patch - which just uses max_pfn to choose
> the block size.
>
> It seems that many systems with large amounts of memory
> will have a nicely aligned max_pfn ... so they will get
> the 2GB block size.  If they don't have a well aligned
> max_pfn, then they need to use a smaller size to avoid
> the crash I saw.

Good to me.

> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3fba623e3ba5..e14e90fd1cf8 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void)
> /* start from 2g */
> unsigned long bz = 1UL<<31;
>
> -   if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
> -   pr_info("Using 2GB memory block size for large-memory 
> system\n");
> -   return 2UL * 1024 * 1024 * 1024;
> -   }
> -
> -   /* less than 64g installed */
> -   if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
> -   return MIN_MEMORY_BLOCK_SIZE;
> -
> /* get the tail size */
> while (bz > MIN_MEMORY_BLOCK_SIZE) {
> if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Luck, Tony
On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote:
> That commit could be reverted.
> According to
> https://lkml.org/lkml/2014/11/10/123

Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small
systems?

What about this patch - which just uses max_pfn to choose
the block size.

It seems that many systems with large amounts of memory
will have a nicely aligned max_pfn ... so they will get
the 2GB block size.  If they don't have a well aligned
max_pfn, then they need to use a smaller size to avoid
the crash I saw.

-Tony


diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..e14e90fd1cf8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void)
/* start from 2g */
unsigned long bz = 1UL<<31;
 
-   if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
-   pr_info("Using 2GB memory block size for large-memory 
system\n");
-   return 2UL * 1024 * 1024 * 1024;
-   }
-
-   /* less than 64g installed */
-   if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
-   return MIN_MEMORY_BLOCK_SIZE;
-
/* get the tail size */
while (bz > MIN_MEMORY_BLOCK_SIZE) {
if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Yinghai Lu
On Fri, Aug 21, 2015 at 11:19 AM, Luck, Tony  wrote:
> On Tue, Nov 04, 2014 at 04:29:44PM +0800, Daniel J Blueman wrote:
>> On large-memory x86-64 systems of 64GB or more with memory hot-plug
>> enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces
>> the number of directories in /sys/devices/system/memory from 512 to 32,
>> making it more manageable, and reducing the creation time accordingly.
>>
>> This caveat is that the memory can't be offlined (for hotplug or otherwise)
>> with finer 128MB granularity, but this is unimportant due to the high
>> memory densities generally used with such large-memory systems, where
>> eg a single DIMM is the order of 16GB.
>
> git bisect points to this commit as the cause of a panic on my
> machine:
>
> [4.518415] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
> [4.525882] PCI: MMCONFIG for domain  [bus 00-ff] at [mem 
> 0x8000-0x8fff] (base 0x8000)
> [4.536280] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in E820
> [4.544344] PCI: Using configuration type 1 for base access
> [4.550778] BUG: unable to handle kernel paging request at ea007820
> [4.558572] IP: [] register_mem_sect_under_
...
> so the older code will look at max_pfn and set memory block size:
>
> [3.021752] memory block size : 256MB
>
> I think the problem is more connected to the strange max_pfn rather
> than the holes ... but will defer to wiser heads.
>
> If the problem is with max_pfn ... I don't think it is a safe assumption
> that systems with >64GB memory will have 2GB aligned max_pfn.

That commit could be reverted.
According to
https://lkml.org/lkml/2014/11/10/123

I had attached patch for my test setups for a while.

Yinghai
Subject: [PATCH] x86, mm: put memory block size probing back

commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems")
let system with more than 64GiB ram just use 2G as memory block
size without probing.

found one system: has memory map like:
[0x-0x6000)
[0x1-0x20a000)

We should use 0x2000 in this case. So can not assume system with big
memory have 2g tail anymore.

So revert it to put probing back.

Fixes: bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems")
Signed-off-by: Yinghai Lu 

---
 arch/x86/mm/init_64.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "mm_internal.h"
@@ -1204,10 +1205,12 @@ static unsigned long probe_memory_block_
 	/* start from 2g */
 	unsigned long bz = 1UL<<31;
 
-	if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
-		pr_info("Using 2GB memory block size for large-memory system\n");
+#ifdef CONFIG_X86_UV
+	if (is_uv_system()) {
+		printk(KERN_INFO "UV: memory block size 2GB\n");
 		return 2UL * 1024 * 1024 * 1024;
 	}
+#endif
 
 	/* less than 64g installed */
 	if ((max_pfn << PAGE_SHIFT) < (16UL << 32))


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Luck, Tony
On Tue, Nov 04, 2014 at 04:29:44PM +0800, Daniel J Blueman wrote:
> On large-memory x86-64 systems of 64GB or more with memory hot-plug
> enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces
> the number of directories in /sys/devices/system/memory from 512 to 32,
> making it more manageable, and reducing the creation time accordingly.
> 
> This caveat is that the memory can't be offlined (for hotplug or otherwise)
> with finer 128MB granularity, but this is unimportant due to the high
> memory densities generally used with such large-memory systems, where
> eg a single DIMM is the order of 16GB. 

git bisect points to this commit as the cause of a panic on my
machine:

[4.518415] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[4.525882] PCI: MMCONFIG for domain  [bus 00-ff] at [mem 
0x8000-0x8fff] (base 0x8000)
[4.536280] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in E820
[4.544344] PCI: Using configuration type 1 for base access
[4.550778] BUG: unable to handle kernel paging request at ea007820
[4.558572] IP: [] register_mem_sect_under_node+0x6d/0xe0
[4.566366] PGD 1dfffcc067 PUD 1dfffca067 PMD 0
[4.571554] Oops:  [#1] SMP
[4.575181] Modules linked in:
[4.578604] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc2+ #17
[4.585800] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRBDXSD1.86B.0326.D03.1508171454 08/17/2015
[4.597347] task: 883b8496 ti: 881d7ea14000 task.ti: 
881d7ea14000
[4.605705] RIP: 0010:[]  [] 
register_mem_sect_under_node+0x6d/0xe0
[4.616205] RSP: :881d7ea17d68  EFLAGS: 00010206
[4.622135] RAX: ea007820 RBX: 0001 RCX: 01e0
[4.630102] RDX: 7800 RSI: 0001 RDI: 881d7ccb6400
[4.638069] RBP: 881d7ea17d78 R08: 01e7 R09: 03c0
[4.646035] R10: 813043a0 R11: ea0169efa600 R12: 0001
[4.654003] R13: 0001 R14: 881d7ccb6400 R15: 
[4.661972] FS:  () GS:881d8b40() 
knlGS:
[4.670996] CS:  0010 DS:  ES:  CR0: 80050033
[4.677411] CR2: ea007820 CR3: 019a CR4: 003407f0
[4.685381] Stack:
[4.687627]  01e7 0001 881d7ea17dc8 
8142af0a
[4.695926]  881d7ea17de8 03c0 881d0018 
0002
[4.704225]  0400  81b101c5 

[4.712524] Call Trace:
[4.715261]  [] register_one_node+0x18a/0x2b0
[4.721871]  [] ? pci_iommu_alloc+0x6e/0x6e
[4.728287]  [] topology_init+0x3c/0x95
[4.734321]  [] do_one_initcall+0xd4/0x210
[4.740645]  [] ? parse_args+0x245/0x480
[4.746774]  [] ? __wake_up+0x48/0x60
[4.752611]  [] kernel_init_freeable+0x19d/0x23c
[4.759511]  [] ? initcall_blacklist+0xb6/0xb6
[4.766226]  [] ? rest_init+0x80/0x80
[4.772059]  [] kernel_init+0xe/0xf0
[4.777803]  [] ret_from_fork+0x7c/0xb0
[4.783831]  [] ? rest_init+0x80/0x80
[4.789655] Code: 39 c1 77 59 48 c1 e2 15 48 b8 00 00 00 00 00 ea ff ff 48 
8d 44 02 20 eb 12 0f 1f 44 00 00 48 83 c1 01 48 83 c0 40 49 39 c8 72 5b <48> 83 
38 00 74 ed 48 8b 50 e0 48 c1 ea 36 39 d6 75 e1 48 8b 04
[4.811356] RIP  [] register_mem_sect_under_node+0x6d/0xe0
[4.819238]  RSP 
[4.823132] CR2: ea007820
[4.826836] ---[ end trace 10b7bb944b11529f ]---
[4.831989] Kernel panic - not syncing: Fatal exception
[4.837866] ---[ end Kernel panic - not syncing: Fatal exception

reverting the commit indeed makes the problem go away.

Now the root problem for me is that I have an insane BIOS
that handed me an e820 table that is full of holes (for entries
above 4GB) ... and ends with an entry that is only 256M aligned:


[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0008dfff] usable
[0.00] BIOS-e820: [mem 0x0008e000-0x0008] reserved
[0.00] BIOS-e820: [mem 0x0009-0x0009] usable
[0.00] BIOS-e820: [mem 0x000a-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x5cc0afff] usable
[0.00] BIOS-e820: [mem 0x5cc0b000-0x5e108fff] reserved
[0.00] BIOS-e820: [mem 0x5e109000-0x6035cfff] ACPI NVS
[0.00] BIOS-e820: [mem 0x6035d000-0x604fcfff] ACPI data
[0.00] BIOS-e820: [mem 0x604fd000-0x7baf] usable
[0.00] BIOS-e820: [mem 0x7bb0-0x8fff] reserved
[0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00118fffefff] usable
[0.00] BIOS-e820: [mem 0x0012-0x001d] 

Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Luck, Tony
On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote:
 That commit could be reverted.
 According to
 https://lkml.org/lkml/2014/11/10/123

Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small
systems?

What about this patch - which just uses max_pfn to choose
the block size.

It seems that many systems with large amounts of memory
will have a nicely aligned max_pfn ... so they will get
the 2GB block size.  If they don't have a well aligned
max_pfn, then they need to use a smaller size to avoid
the crash I saw.

-Tony


diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..e14e90fd1cf8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void)
/* start from 2g */
unsigned long bz = 1UL31;
 
-   if (totalram_pages = (64ULL  (30 - PAGE_SHIFT))) {
-   pr_info(Using 2GB memory block size for large-memory 
system\n);
-   return 2UL * 1024 * 1024 * 1024;
-   }
-
-   /* less than 64g installed */
-   if ((max_pfn  PAGE_SHIFT)  (16UL  32))
-   return MIN_MEMORY_BLOCK_SIZE;
-
/* get the tail size */
while (bz  MIN_MEMORY_BLOCK_SIZE) {
if (!((max_pfn  PAGE_SHIFT)  (bz - 1)))
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Yinghai Lu
On Fri, Aug 21, 2015 at 1:27 PM, Luck, Tony tony.l...@intel.com wrote:
 On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote:
 That commit could be reverted.
 According to
 https://lkml.org/lkml/2014/11/10/123

 Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small
 systems?

That is introduced in commit 982792c7 (x86, mm: probe memory block
size for generic x86 64bit
).
that patch is used to make boot faster why create less entries
in /sys/device/system/memory/.
On system with less 64G ram, that will not have too many entries
even with MIN_MEMORY_BLOCK_SIZE.


 What about this patch - which just uses max_pfn to choose
 the block size.

 It seems that many systems with large amounts of memory
 will have a nicely aligned max_pfn ... so they will get
 the 2GB block size.  If they don't have a well aligned
 max_pfn, then they need to use a smaller size to avoid
 the crash I saw.

Good to me.

 diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
 index 3fba623e3ba5..e14e90fd1cf8 100644
 --- a/arch/x86/mm/init_64.c
 +++ b/arch/x86/mm/init_64.c
 @@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void)
 /* start from 2g */
 unsigned long bz = 1UL31;

 -   if (totalram_pages = (64ULL  (30 - PAGE_SHIFT))) {
 -   pr_info(Using 2GB memory block size for large-memory 
 system\n);
 -   return 2UL * 1024 * 1024 * 1024;
 -   }
 -
 -   /* less than 64g installed */
 -   if ((max_pfn  PAGE_SHIFT)  (16UL  32))
 -   return MIN_MEMORY_BLOCK_SIZE;
 -
 /* get the tail size */
 while (bz  MIN_MEMORY_BLOCK_SIZE) {
 if (!((max_pfn  PAGE_SHIFT)  (bz - 1)))
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Luck, Tony
On Tue, Nov 04, 2014 at 04:29:44PM +0800, Daniel J Blueman wrote:
 On large-memory x86-64 systems of 64GB or more with memory hot-plug
 enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces
 the number of directories in /sys/devices/system/memory from 512 to 32,
 making it more manageable, and reducing the creation time accordingly.
 
 This caveat is that the memory can't be offlined (for hotplug or otherwise)
 with finer 128MB granularity, but this is unimportant due to the high
 memory densities generally used with such large-memory systems, where
 eg a single DIMM is the order of 16GB. 

git bisect points to this commit as the cause of a panic on my
machine:

[4.518415] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[4.525882] PCI: MMCONFIG for domain  [bus 00-ff] at [mem 
0x8000-0x8fff] (base 0x8000)
[4.536280] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in E820
[4.544344] PCI: Using configuration type 1 for base access
[4.550778] BUG: unable to handle kernel paging request at ea007820
[4.558572] IP: [8142ab0d] register_mem_sect_under_node+0x6d/0xe0
[4.566366] PGD 1dfffcc067 PUD 1dfffca067 PMD 0
[4.571554] Oops:  [#1] SMP
[4.575181] Modules linked in:
[4.578604] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc2+ #17
[4.585800] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS 
BRBDXSD1.86B.0326.D03.1508171454 08/17/2015
[4.597347] task: 883b8496 ti: 881d7ea14000 task.ti: 
881d7ea14000
[4.605705] RIP: 0010:[8142ab0d]  [8142ab0d] 
register_mem_sect_under_node+0x6d/0xe0
[4.616205] RSP: :881d7ea17d68  EFLAGS: 00010206
[4.622135] RAX: ea007820 RBX: 0001 RCX: 01e0
[4.630102] RDX: 7800 RSI: 0001 RDI: 881d7ccb6400
[4.638069] RBP: 881d7ea17d78 R08: 01e7 R09: 03c0
[4.646035] R10: 813043a0 R11: ea0169efa600 R12: 0001
[4.654003] R13: 0001 R14: 881d7ccb6400 R15: 
[4.661972] FS:  () GS:881d8b40() 
knlGS:
[4.670996] CS:  0010 DS:  ES:  CR0: 80050033
[4.677411] CR2: ea007820 CR3: 019a CR4: 003407f0
[4.685381] Stack:
[4.687627]  01e7 0001 881d7ea17dc8 
8142af0a
[4.695926]  881d7ea17de8 03c0 881d0018 
0002
[4.704225]  0400  81b101c5 

[4.712524] Call Trace:
[4.715261]  [8142af0a] register_one_node+0x18a/0x2b0
[4.721871]  [81b101c5] ? pci_iommu_alloc+0x6e/0x6e
[4.728287]  [81b10201] topology_init+0x3c/0x95
[4.734321]  [81002144] do_one_initcall+0xd4/0x210
[4.740645]  [8109b515] ? parse_args+0x245/0x480
[4.746774]  [810bddc8] ? __wake_up+0x48/0x60
[4.752611]  [81b062f9] kernel_init_freeable+0x19d/0x23c
[4.759511]  [81b059e3] ? initcall_blacklist+0xb6/0xb6
[4.766226]  [816580d0] ? rest_init+0x80/0x80
[4.772059]  [816580de] kernel_init+0xe/0xf0
[4.777803]  [8167057c] ret_from_fork+0x7c/0xb0
[4.783831]  [816580d0] ? rest_init+0x80/0x80
[4.789655] Code: 39 c1 77 59 48 c1 e2 15 48 b8 00 00 00 00 00 ea ff ff 48 
8d 44 02 20 eb 12 0f 1f 44 00 00 48 83 c1 01 48 83 c0 40 49 39 c8 72 5b 48 83 
38 00 74 ed 48 8b 50 e0 48 c1 ea 36 39 d6 75 e1 48 8b 04
[4.811356] RIP  [8142ab0d] register_mem_sect_under_node+0x6d/0xe0
[4.819238]  RSP 881d7ea17d68
[4.823132] CR2: ea007820
[4.826836] ---[ end trace 10b7bb944b11529f ]---
[4.831989] Kernel panic - not syncing: Fatal exception
[4.837866] ---[ end Kernel panic - not syncing: Fatal exception

reverting the commit indeed makes the problem go away.

Now the root problem for me is that I have an insane BIOS
that handed me an e820 table that is full of holes (for entries
above 4GB) ... and ends with an entry that is only 256M aligned:


[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0008dfff] usable
[0.00] BIOS-e820: [mem 0x0008e000-0x0008] reserved
[0.00] BIOS-e820: [mem 0x0009-0x0009] usable
[0.00] BIOS-e820: [mem 0x000a-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x5cc0afff] usable
[0.00] BIOS-e820: [mem 0x5cc0b000-0x5e108fff] reserved
[0.00] BIOS-e820: [mem 0x5e109000-0x6035cfff] ACPI NVS
[0.00] BIOS-e820: [mem 0x6035d000-0x604fcfff] ACPI data
[0.00] BIOS-e820: [mem 0x604fd000-0x7baf] usable
[0.00] BIOS-e820: [mem 

Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Yinghai Lu
On Fri, Aug 21, 2015 at 11:19 AM, Luck, Tony tony.l...@intel.com wrote:
 On Tue, Nov 04, 2014 at 04:29:44PM +0800, Daniel J Blueman wrote:
 On large-memory x86-64 systems of 64GB or more with memory hot-plug
 enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces
 the number of directories in /sys/devices/system/memory from 512 to 32,
 making it more manageable, and reducing the creation time accordingly.

 This caveat is that the memory can't be offlined (for hotplug or otherwise)
 with finer 128MB granularity, but this is unimportant due to the high
 memory densities generally used with such large-memory systems, where
 eg a single DIMM is the order of 16GB.

 git bisect points to this commit as the cause of a panic on my
 machine:

 [4.518415] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
 [4.525882] PCI: MMCONFIG for domain  [bus 00-ff] at [mem 
 0x8000-0x8fff] (base 0x8000)
 [4.536280] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in E820
 [4.544344] PCI: Using configuration type 1 for base access
 [4.550778] BUG: unable to handle kernel paging request at ea007820
 [4.558572] IP: [8142ab0d] register_mem_sect_under_
...
 so the older code will look at max_pfn and set memory block size:

 [3.021752] memory block size : 256MB

 I think the problem is more connected to the strange max_pfn rather
 than the holes ... but will defer to wiser heads.

 If the problem is with max_pfn ... I don't think it is a safe assumption
 that systems with 64GB memory will have 2GB aligned max_pfn.

That commit could be reverted.
According to
https://lkml.org/lkml/2014/11/10/123

I had attached patch for my test setups for a while.

Yinghai
Subject: [PATCH] x86, mm: put memory block size probing back

commit bdee237c0343 (x86: mm: Use 2GB memory block size on large-memory x86-64 systems)
let system with more than 64GiB ram just use 2G as memory block
size without probing.

found one system: has memory map like:
[0x-0x6000)
[0x1-0x20a000)

We should use 0x2000 in this case. So can not assume system with big
memory have 2g tail anymore.

So revert it to put probing back.

Fixes: bdee237c0343 (x86: mm: Use 2GB memory block size on large-memory x86-64 systems)
Signed-off-by: Yinghai Lu ying...@kernel.org

---
 arch/x86/mm/init_64.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -52,6 +52,7 @@
 #include asm/numa.h
 #include asm/cacheflush.h
 #include asm/init.h
+#include asm/uv/uv.h
 #include asm/setup.h
 
 #include mm_internal.h
@@ -1204,10 +1205,12 @@ static unsigned long probe_memory_block_
 	/* start from 2g */
 	unsigned long bz = 1UL31;
 
-	if (totalram_pages = (64ULL  (30 - PAGE_SHIFT))) {
-		pr_info(Using 2GB memory block size for large-memory system\n);
+#ifdef CONFIG_X86_UV
+	if (is_uv_system()) {
+		printk(KERN_INFO UV: memory block size 2GB\n);
 		return 2UL * 1024 * 1024 * 1024;
 	}
+#endif
 
 	/* less than 64g installed */
 	if ((max_pfn  PAGE_SHIFT)  (16UL  32))


Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2015-08-21 Thread Tony Luck
On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu ying...@kernel.org wrote:
 It seems that many systems with large amounts of memory
 will have a nicely aligned max_pfn ... so they will get
 the 2GB block size.  If they don't have a well aligned
 max_pfn, then they need to use a smaller size to avoid
 the crash I saw.

 Good to me.

Still stuff going on that I don't understand here. I increased the amount of
mirrored memory in this machine which moved max_pfn to 0x756
and probe_memory_block_size() picked 512MB as the memory_block_size,
which seemed plausible.

But my kernel still crashed during boot with this value. :-(
Forcing the block size to 128M made the system boot.

Maybe all the holes in the e820 map matter too (specifically the
alignment of the holes)?

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2014-11-04 Thread Daniel J Blueman
On large-memory x86-64 systems of 64GB or more with memory hot-plug
enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces
the number of directories in /sys/devices/system/memory from 512 to 32,
making it more manageable, and reducing the creation time accordingly.

This caveat is that the memory can't be offlined (for hotplug or otherwise)
with finer 128MB granularity, but this is unimportant due to the high
memory densities generally used with such large-memory systems, where
eg a single DIMM is the order of 16GB. 

Signed-off-by: Daniel J Blueman 
---
 init_64.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index df1a992..9622ab2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -52,7 +52,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "mm_internal.h"
@@ -1234,12 +1233,10 @@ static unsigned long probe_memory_block_size(void)
/* start from 2g */
unsigned long bz = 1UL<<31;
 
-#ifdef CONFIG_X86_UV
-   if (is_uv_system()) {
-   printk(KERN_INFO "UV: memory block size 2GB\n");
+   if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
+   pr_info("Using 2GB memory block size for large-memory 
system\n");
return 2UL * 1024 * 1024 * 1024;
}
-#endif
 
/* less than 64g installed */
if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems

2014-11-04 Thread Daniel J Blueman
On large-memory x86-64 systems of 64GB or more with memory hot-plug
enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces
the number of directories in /sys/devices/system/memory from 512 to 32,
making it more manageable, and reducing the creation time accordingly.

This caveat is that the memory can't be offlined (for hotplug or otherwise)
with finer 128MB granularity, but this is unimportant due to the high
memory densities generally used with such large-memory systems, where
eg a single DIMM is the order of 16GB. 

Signed-off-by: Daniel J Blueman dan...@numascale.com
---
 init_64.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index df1a992..9622ab2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -52,7 +52,6 @@
 #include asm/numa.h
 #include asm/cacheflush.h
 #include asm/init.h
-#include asm/uv/uv.h
 #include asm/setup.h
 
 #include mm_internal.h
@@ -1234,12 +1233,10 @@ static unsigned long probe_memory_block_size(void)
/* start from 2g */
unsigned long bz = 1UL31;
 
-#ifdef CONFIG_X86_UV
-   if (is_uv_system()) {
-   printk(KERN_INFO UV: memory block size 2GB\n);
+   if (totalram_pages = (64ULL  (30 - PAGE_SHIFT))) {
+   pr_info(Using 2GB memory block size for large-memory 
system\n);
return 2UL * 1024 * 1024 * 1024;
}
-#endif
 
/* less than 64g installed */
if ((max_pfn  PAGE_SHIFT)  (16UL  32))
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/