Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Wed, Aug 26, 2015 at 1:49 PM, Andrew Morton wrote: > On Tue, 25 Aug 2015 22:42:05 -0700 Yinghai Lu wrote: > I don't know what that means. We have multiple patches under at least > two different Subject:s. Please be very careful and very specific when > identifying patches. Otherwise mistakes will be made. > > > I presently have three patches: > > mm-check-if-section-present-during-memory-block-unregistering.patch > mm-check-if-section-present-during-memory-block-unregistering-v2.patch > mm-check-if-section-present-during-memory-block-unregistering-v2-fix.patch > > When these are consolidated together, this is the result: Please drop all three, and apply v3 directly from https://patchwork.kernel.org/patch/7080111/ we should not touch unregiser path, as unregister_memory_section() already check if the section is present before. Thanks Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, 25 Aug 2015 22:42:05 -0700 Yinghai Lu wrote: > On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar wrote: > > NAK due to lack of cleanliness: the two loops look almost identical - this > > sure > > can be factored out... > > Please check complete version at > > https://patchwork.kernel.org/patch/7074341/ That doesn't do what Ingo suggested: "can be factored out...". Please review this? --- a/drivers/base/node.c~mm-check-if-section-present-during-memory-block-unregistering-v2-fix +++ a/drivers/base/node.c @@ -375,6 +375,22 @@ static int __init_refok get_nid_for_pfn( return pfn_to_nid(pfn); } +/* + * A memory block can have several absent sections. A helper function for + * skipping over these holes. + * + * If an absent section is detected, skip_absent_section() will advance *pfn + * to the final page in that section and will return true. + */ +static bool skip_absent_section(unsigned long *pfn) +{ + if (present_section_nr(pfn_to_section_nr(*pfn))) + return false; + + *pfn = round_down(*pfn + PAGES_PER_SECTION, PAGES_PER_SECTION) - 1; + return true; +} + /* register memory section under specified node if it spans that node */ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) { @@ -390,18 +406,10 @@ int register_mem_sect_under_node(struct sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { - int page_nid, scn_nr; + int page_nid; - /* -* memory block could have several absent sections from start. -* skip pfn range from absent section -*/ - scn_nr = pfn_to_section_nr(pfn); - if (!present_section_nr(scn_nr)) { - pfn = round_down(pfn + PAGES_PER_SECTION, -PAGES_PER_SECTION) - 1; + if (skip_absent_section()) continue; - } page_nid = get_nid_for_pfn(pfn); if (page_nid < 0) @@ -441,18 +449,10 @@ int unregister_mem_sect_under_nodes(stru sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { - int nid, scn_nr; + int nid; - /* -* memory block could have several absent sections from start. -* skip pfn range from absent section -*/ - scn_nr = pfn_to_section_nr(pfn); - if (!present_section_nr(scn_nr)) { - pfn = round_down(pfn + PAGES_PER_SECTION, -PAGES_PER_SECTION) - 1; + if (skip_absent_section()) continue; - } nid = get_nid_for_pfn(pfn); if (nid < 0) _ > Andrew, > Ingo NAKed raw version of this patch, so you may need to remove it > from -mm tree. I don't know what that means. We have multiple patches under at least two different Subject:s. Please be very careful and very specific when identifying patches. Otherwise mistakes will be made. I presently have three patches: mm-check-if-section-present-during-memory-block-unregistering.patch mm-check-if-section-present-during-memory-block-unregistering-v2.patch mm-check-if-section-present-during-memory-block-unregistering-v2-fix.patch When these are consolidated together, this is the result: From: Yinghai Lu Subject: mm: check if section present during memory block (un)registering Tony Luck found on his setup, if memory block size 512M will cause crash during booting. BUG: unable to handle kernel paging request at ea007420 IP: [] get_nid_for_pfn+0x17/0x40 PGD 128ffcb067 PUD 128ffc9067 PMD 0 Oops: [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc8 #1 ... Call Trace: [] ? register_mem_sect_under_node+0x66/0xe0 [] register_one_node+0x17b/0x240 [] ? pci_iommu_alloc+0x6e/0x6e [] topology_init+0x3c/0x95 [] do_one_initcall+0xcd/0x1f0 The system has non continuous RAM address: BIOS-e820: [mem 0x0013-0x001c] usable BIOS-e820: [mem 0x001d7000-0x001ec7ffefff] usable BIOS-e820: [mem 0x001f-0x002b] usable BIOS-e820: [mem 0x002c1800-0x002d6fffefff] usable BIOS-e820: [mem 0x002e-0x0039] usable So there are start sections in memory block not present. For example: memory block : [0x2c1800, 0x2c2000) 512M first three sections are not present. Current register_mem_sect_under_node() assume first section is present, but memory block section number range [start_section_nr, end_section_nr] would include not present section. For arch that support vmemmap, we
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Wed, Aug 26, 2015 at 1:49 PM, Andrew Morton a...@linux-foundation.org wrote: On Tue, 25 Aug 2015 22:42:05 -0700 Yinghai Lu ying...@kernel.org wrote: I don't know what that means. We have multiple patches under at least two different Subject:s. Please be very careful and very specific when identifying patches. Otherwise mistakes will be made. I presently have three patches: mm-check-if-section-present-during-memory-block-unregistering.patch mm-check-if-section-present-during-memory-block-unregistering-v2.patch mm-check-if-section-present-during-memory-block-unregistering-v2-fix.patch When these are consolidated together, this is the result: Please drop all three, and apply v3 directly from https://patchwork.kernel.org/patch/7080111/ we should not touch unregiser path, as unregister_memory_section() already check if the section is present before. Thanks Yinghai -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, 25 Aug 2015 22:42:05 -0700 Yinghai Lu ying...@kernel.org wrote: On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar mi...@kernel.org wrote: NAK due to lack of cleanliness: the two loops look almost identical - this sure can be factored out... Please check complete version at https://patchwork.kernel.org/patch/7074341/ That doesn't do what Ingo suggested: can be factored out Please review this? --- a/drivers/base/node.c~mm-check-if-section-present-during-memory-block-unregistering-v2-fix +++ a/drivers/base/node.c @@ -375,6 +375,22 @@ static int __init_refok get_nid_for_pfn( return pfn_to_nid(pfn); } +/* + * A memory block can have several absent sections. A helper function for + * skipping over these holes. + * + * If an absent section is detected, skip_absent_section() will advance *pfn + * to the final page in that section and will return true. + */ +static bool skip_absent_section(unsigned long *pfn) +{ + if (present_section_nr(pfn_to_section_nr(*pfn))) + return false; + + *pfn = round_down(*pfn + PAGES_PER_SECTION, PAGES_PER_SECTION) - 1; + return true; +} + /* register memory section under specified node if it spans that node */ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) { @@ -390,18 +406,10 @@ int register_mem_sect_under_node(struct sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr); sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { - int page_nid, scn_nr; + int page_nid; - /* -* memory block could have several absent sections from start. -* skip pfn range from absent section -*/ - scn_nr = pfn_to_section_nr(pfn); - if (!present_section_nr(scn_nr)) { - pfn = round_down(pfn + PAGES_PER_SECTION, -PAGES_PER_SECTION) - 1; + if (skip_absent_section(pfn)) continue; - } page_nid = get_nid_for_pfn(pfn); if (page_nid 0) @@ -441,18 +449,10 @@ int unregister_mem_sect_under_nodes(stru sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr); sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { - int nid, scn_nr; + int nid; - /* -* memory block could have several absent sections from start. -* skip pfn range from absent section -*/ - scn_nr = pfn_to_section_nr(pfn); - if (!present_section_nr(scn_nr)) { - pfn = round_down(pfn + PAGES_PER_SECTION, -PAGES_PER_SECTION) - 1; + if (skip_absent_section(pfn)) continue; - } nid = get_nid_for_pfn(pfn); if (nid 0) _ Andrew, Ingo NAKed raw version of this patch, so you may need to remove it from -mm tree. I don't know what that means. We have multiple patches under at least two different Subject:s. Please be very careful and very specific when identifying patches. Otherwise mistakes will be made. I presently have three patches: mm-check-if-section-present-during-memory-block-unregistering.patch mm-check-if-section-present-during-memory-block-unregistering-v2.patch mm-check-if-section-present-during-memory-block-unregistering-v2-fix.patch When these are consolidated together, this is the result: From: Yinghai Lu ying...@kernel.org Subject: mm: check if section present during memory block (un)registering Tony Luck found on his setup, if memory block size 512M will cause crash during booting. BUG: unable to handle kernel paging request at ea007420 IP: [81670527] get_nid_for_pfn+0x17/0x40 PGD 128ffcb067 PUD 128ffc9067 PMD 0 Oops: [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc8 #1 ... Call Trace: [81453b56] ? register_mem_sect_under_node+0x66/0xe0 [81453eeb] register_one_node+0x17b/0x240 [81b1f1ed] ? pci_iommu_alloc+0x6e/0x6e [81b1f229] topology_init+0x3c/0x95 [8100213d] do_one_initcall+0xcd/0x1f0 The system has non continuous RAM address: BIOS-e820: [mem 0x0013-0x001c] usable BIOS-e820: [mem 0x001d7000-0x001ec7ffefff] usable BIOS-e820: [mem 0x001f-0x002b] usable BIOS-e820: [mem 0x002c1800-0x002d6fffefff] usable BIOS-e820: [mem 0x002e-0x0039] usable So there are start sections in memory block not present. For example: memory block : [0x2c1800, 0x2c2000) 512M first three sections are not present. Current register_mem_sect_under_node() assume first section is present, but memory
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar wrote: > NAK due to lack of cleanliness: the two loops look almost identical - this > sure > can be factored out... Please check complete version at https://patchwork.kernel.org/patch/7074341/ Andrew, Ingo NAKed raw version of this patch, so you may need to remove it from -mm tree. Thanks Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
* Yinghai Lu wrote: > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block > *mem_blk, int nid) > sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); > sect_end_pfn += PAGES_PER_SECTION - 1; > for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { > - int page_nid; > + int page_nid, scn_nr; > > + scn_nr = pfn_to_section_nr(pfn); > + if (!present_section_nr(scn_nr)) { > + pfn = round_down(pfn + PAGES_PER_SECTION, > + PAGES_PER_SECTION) - 1; > + continue; > + } > page_nid = get_nid_for_pfn(pfn); > if (page_nid < 0) > continue; > @@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block > *mem_blk, > return -ENOMEM; > nodes_clear(*unlinked_nodes); > > - sect_start_pfn = section_nr_to_pfn(phys_index); > - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; > + sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr); > + sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); > + sect_end_pfn += PAGES_PER_SECTION - 1; > for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { > - int nid; > + int nid, scn_nr; > + > + scn_nr = pfn_to_section_nr(pfn); > + if (!present_section_nr(scn_nr)) { > + pfn = round_down(pfn + PAGES_PER_SECTION, > + PAGES_PER_SECTION) - 1; > + continue; > + } NAK due to lack of cleanliness: the two loops look almost identical - this sure can be factored out... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, Aug 25, 2015 at 12:01 PM, Yinghai Lu wrote: >> It does ... but this (attached) is simpler. Your patch and mine both >> allow the system to boot ... > > The version that fix with section_nr present checking may save couple > thousands > calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k Actually saves about 1.2 million calls. Your patch wins :-) Reported-and-tested-by: Tony Luck -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, Aug 25, 2015 at 10:03 AM, Tony Luck wrote: > On Mon, Aug 24, 2015 at 4:59 PM, Yinghai Lu wrote: >> attached should fix the problem: > > It does ... but this (attached) is simpler. Your patch and mine both > allow the system to boot ... The version that fix with section_nr present checking may save couple thousands calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k > but it is not happy. See all the chatter from systemd in the attached dmesg. because of you have "debug ignore_loglevel" ? > > x86 doesn't allow me to set CONFIG_HOLES_IN_ZONE ... but now I'm > worried about all the other places use pfn_valid_within() > > Still trying to get an answer from the BIOS folks on whether these > holes are normal when setting up mirrored areas of memory. The problem only happens when memory block size is 512M and section size is 128M. when you have them both at 128M, the system works. so current kernel should only has problem with hole size > 128M to leave some section not present. Thanks Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar mi...@kernel.org wrote: NAK due to lack of cleanliness: the two loops look almost identical - this sure can be factored out... Please check complete version at https://patchwork.kernel.org/patch/7074341/ Andrew, Ingo NAKed raw version of this patch, so you may need to remove it from -mm tree. Thanks Yinghai -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
* Yinghai Lu ying...@kernel.org wrote: --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr); sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { - int page_nid; + int page_nid, scn_nr; + scn_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(scn_nr)) { + pfn = round_down(pfn + PAGES_PER_SECTION, + PAGES_PER_SECTION) - 1; + continue; + } page_nid = get_nid_for_pfn(pfn); if (page_nid 0) continue; @@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, return -ENOMEM; nodes_clear(*unlinked_nodes); - sect_start_pfn = section_nr_to_pfn(phys_index); - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; + sect_start_pfn = section_nr_to_pfn(mem_blk-start_section_nr); + sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr); + sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { - int nid; + int nid, scn_nr; + + scn_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(scn_nr)) { + pfn = round_down(pfn + PAGES_PER_SECTION, + PAGES_PER_SECTION) - 1; + continue; + } NAK due to lack of cleanliness: the two loops look almost identical - this sure can be factored out... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, Aug 25, 2015 at 10:03 AM, Tony Luck tony.l...@gmail.com wrote: On Mon, Aug 24, 2015 at 4:59 PM, Yinghai Lu ying...@kernel.org wrote: attached should fix the problem: It does ... but this (attached) is simpler. Your patch and mine both allow the system to boot ... The version that fix with section_nr present checking may save couple thousands calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k but it is not happy. See all the chatter from systemd in the attached dmesg. because of you have debug ignore_loglevel ? x86 doesn't allow me to set CONFIG_HOLES_IN_ZONE ... but now I'm worried about all the other places use pfn_valid_within() Still trying to get an answer from the BIOS folks on whether these holes are normal when setting up mirrored areas of memory. The problem only happens when memory block size is 512M and section size is 128M. when you have them both at 128M, the system works. so current kernel should only has problem with hole size 128M to leave some section not present. Thanks Yinghai -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, Aug 25, 2015 at 12:01 PM, Yinghai Lu ying...@kernel.org wrote: It does ... but this (attached) is simpler. Your patch and mine both allow the system to boot ... The version that fix with section_nr present checking may save couple thousands calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k Actually saves about 1.2 million calls. Your patch wins :-) Reported-and-tested-by: Tony Luck tony.l...@intel.com -Tony -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 4:41 PM, Yinghai Lu wrote: > On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck wrote: >> On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu wrote: >> >>> Can you boot with "debug ignore_loglevel" so we can see following print out >>> for vmemmap? >> >> See attached. There are a few extra messages from my own debug printk() >> calls. It seems that we successfully deal with node 0 from topology_init() >> but die walking node 1. I see that the NODE_DATA limits for memory >> on node 1 were from 1d7 to 3a0. But when we get into >> register_mem_sect_under_node() we have rounded the start pfn down to >> 1d0 ... and we panic processing that range (which is in a hole in e820). >> >> We seem to die here: >> >> for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { >> int page_nid; >> >> page_nid = get_nid_for_pfn(pfn); > > oh, no. > register_mem_sect_under_node() is assuming: > first section in the block is present and first page in that section is > present. attached should fix the problem: diff --git a/drivers/base/node.c b/drivers/base/node.c index 31df474d..cc910ad 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { - int page_nid; + int page_nid, scn_nr; + scn_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(scn_nr)) { + pfn = round_down(pfn + PAGES_PER_SECTION, + PAGES_PER_SECTION) - 1; + continue; + } page_nid = get_nid_for_pfn(pfn); if (page_nid < 0) continue; @@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, return -ENOMEM; nodes_clear(*unlinked_nodes); - sect_start_pfn = section_nr_to_pfn(phys_index); - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; + sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr); + sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr); + sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { - int nid; + int nid, scn_nr; + + scn_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(scn_nr)) { + pfn = round_down(pfn + PAGES_PER_SECTION, + PAGES_PER_SECTION) - 1; + continue; + } nid = get_nid_for_pfn(pfn); if (nid < 0)
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck wrote: > On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu wrote: > >> Can you boot with "debug ignore_loglevel" so we can see following print out >> for vmemmap? > > See attached. There are a few extra messages from my own debug printk() > calls. It seems that we successfully deal with node 0 from topology_init() > but die walking node 1. I see that the NODE_DATA limits for memory > on node 1 were from 1d7 to 3a0. But when we get into > register_mem_sect_under_node() we have rounded the start pfn down to > 1d0 ... and we panic processing that range (which is in a hole in e820). > > We seem to die here: > > for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { > int page_nid; > > page_nid = get_nid_for_pfn(pfn); oh, no. register_mem_sect_under_node() is assuming: first section in the block is present and first page in that section is present. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu wrote: > Can you boot with "debug ignore_loglevel" so we can see following print out > for vmemmap? See attached. There are a few extra messages from my own debug printk() calls. It seems that we successfully deal with node 0 from topology_init() but die walking node 1. I see that the NODE_DATA limits for memory on node 1 were from 1d7 to 3a0. But when we get into register_mem_sect_under_node() we have rounded the start pfn down to 1d0 ... and we panic processing that range (which is in a hole in e820). We seem to die here: for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) { int page_nid; page_nid = get_nid_for_pfn(pfn); -Tony dmesg2 Description: Binary data
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 1:41 PM, Tony Luck wrote: > On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu wrote: >> Then, what does the E820 look like? > > See attached serial console log of the latest crash Can you boot with "debug ignore_loglevel" so we can see following print out for vmemmap? [0.352486] [ea00-ea0001ff] PMD -> [88007de0-88007fdf] on node 0 [0.358758] [ea000400-ea0005ff] PMD -> [88017d60-88017f5f] on node 1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu wrote: > Then, what does the E820 look like? See attached serial console log of the latest crash -Tony dmesg Description: Binary data
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 4:54 PM, Tony Luck wrote: > On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu wrote: > > Still stuff going on that I don't understand here. I increased the amount of > mirrored memory in this machine which moved max_pfn to 0x756 > and probe_memory_block_size() picked 512MB as the memory_block_size, > which seemed plausible. > > But my kernel still crashed during boot with this value. :-( > Forcing the block size to 128M made the system boot. > > Maybe all the holes in the e820 map matter too (specifically the > alignment of the holes)? Then, what does the E820 look like? Yinghai -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 4:54 PM, Tony Luck tony.l...@gmail.com wrote: On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu ying...@kernel.org wrote: Still stuff going on that I don't understand here. I increased the amount of mirrored memory in this machine which moved max_pfn to 0x756 and probe_memory_block_size() picked 512MB as the memory_block_size, which seemed plausible. But my kernel still crashed during boot with this value. :-( Forcing the block size to 128M made the system boot. Maybe all the holes in the e820 map matter too (specifically the alignment of the holes)? Then, what does the E820 look like? Yinghai -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 1:41 PM, Tony Luck tony.l...@gmail.com wrote: On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu ying...@kernel.org wrote: Then, what does the E820 look like? See attached serial console log of the latest crash Can you boot with debug ignore_loglevel so we can see following print out for vmemmap? [0.352486] [ea00-ea0001ff] PMD - [88007de0-88007fdf] on node 0 [0.358758] [ea000400-ea0005ff] PMD - [88017d60-88017f5f] on node 1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu ying...@kernel.org wrote: Then, what does the E820 look like? See attached serial console log of the latest crash -Tony dmesg Description: Binary data
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck tony.l...@gmail.com wrote: On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu ying...@kernel.org wrote: Can you boot with debug ignore_loglevel so we can see following print out for vmemmap? See attached. There are a few extra messages from my own debug printk() calls. It seems that we successfully deal with node 0 from topology_init() but die walking node 1. I see that the NODE_DATA limits for memory on node 1 were from 1d7 to 3a0. But when we get into register_mem_sect_under_node() we have rounded the start pfn down to 1d0 ... and we panic processing that range (which is in a hole in e820). We seem to die here: for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { int page_nid; page_nid = get_nid_for_pfn(pfn); oh, no. register_mem_sect_under_node() is assuming: first section in the block is present and first page in that section is present. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu ying...@kernel.org wrote: Can you boot with debug ignore_loglevel so we can see following print out for vmemmap? See attached. There are a few extra messages from my own debug printk() calls. It seems that we successfully deal with node 0 from topology_init() but die walking node 1. I see that the NODE_DATA limits for memory on node 1 were from 1d7 to 3a0. But when we get into register_mem_sect_under_node() we have rounded the start pfn down to 1d0 ... and we panic processing that range (which is in a hole in e820). We seem to die here: for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { int page_nid; page_nid = get_nid_for_pfn(pfn); -Tony dmesg2 Description: Binary data
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Mon, Aug 24, 2015 at 4:41 PM, Yinghai Lu ying...@kernel.org wrote: On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck tony.l...@gmail.com wrote: On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu ying...@kernel.org wrote: Can you boot with debug ignore_loglevel so we can see following print out for vmemmap? See attached. There are a few extra messages from my own debug printk() calls. It seems that we successfully deal with node 0 from topology_init() but die walking node 1. I see that the NODE_DATA limits for memory on node 1 were from 1d7 to 3a0. But when we get into register_mem_sect_under_node() we have rounded the start pfn down to 1d0 ... and we panic processing that range (which is in a hole in e820). We seem to die here: for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { int page_nid; page_nid = get_nid_for_pfn(pfn); oh, no. register_mem_sect_under_node() is assuming: first section in the block is present and first page in that section is present. attached should fix the problem: diff --git a/drivers/base/node.c b/drivers/base/node.c index 31df474d..cc910ad 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid) sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr); sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { - int page_nid; + int page_nid, scn_nr; + scn_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(scn_nr)) { + pfn = round_down(pfn + PAGES_PER_SECTION, + PAGES_PER_SECTION) - 1; + continue; + } page_nid = get_nid_for_pfn(pfn); if (page_nid 0) continue; @@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, return -ENOMEM; nodes_clear(*unlinked_nodes); - sect_start_pfn = section_nr_to_pfn(phys_index); - sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1; + sect_start_pfn = section_nr_to_pfn(mem_blk-start_section_nr); + sect_end_pfn = section_nr_to_pfn(mem_blk-end_section_nr); + sect_end_pfn += PAGES_PER_SECTION - 1; for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) { - int nid; + int nid, scn_nr; + + scn_nr = pfn_to_section_nr(pfn); + if (!present_section_nr(scn_nr)) { + pfn = round_down(pfn + PAGES_PER_SECTION, + PAGES_PER_SECTION) - 1; + continue; + } nid = get_nid_for_pfn(pfn); if (nid 0)
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu wrote: >> It seems that many systems with large amounts of memory >> will have a nicely aligned max_pfn ... so they will get >> the 2GB block size. If they don't have a well aligned >> max_pfn, then they need to use a smaller size to avoid >> the crash I saw. > > Good to me. Still stuff going on that I don't understand here. I increased the amount of mirrored memory in this machine which moved max_pfn to 0x756 and probe_memory_block_size() picked 512MB as the memory_block_size, which seemed plausible. But my kernel still crashed during boot with this value. :-( Forcing the block size to 128M made the system boot. Maybe all the holes in the e820 map matter too (specifically the alignment of the holes)? -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 1:27 PM, Luck, Tony wrote: > On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote: >> That commit could be reverted. >> According to >> https://lkml.org/lkml/2014/11/10/123 > > Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small > systems? That is introduced in commit 982792c7 ("x86, mm: probe memory block size for generic x86 64bit "). that patch is used to make boot faster why create less entries in /sys/device/system/memory/. On system with less 64G ram, that will not have too many entries even with MIN_MEMORY_BLOCK_SIZE. > > What about this patch - which just uses max_pfn to choose > the block size. > > It seems that many systems with large amounts of memory > will have a nicely aligned max_pfn ... so they will get > the 2GB block size. If they don't have a well aligned > max_pfn, then they need to use a smaller size to avoid > the crash I saw. Good to me. > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index 3fba623e3ba5..e14e90fd1cf8 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void) > /* start from 2g */ > unsigned long bz = 1UL<<31; > > - if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) { > - pr_info("Using 2GB memory block size for large-memory > system\n"); > - return 2UL * 1024 * 1024 * 1024; > - } > - > - /* less than 64g installed */ > - if ((max_pfn << PAGE_SHIFT) < (16UL << 32)) > - return MIN_MEMORY_BLOCK_SIZE; > - > /* get the tail size */ > while (bz > MIN_MEMORY_BLOCK_SIZE) { > if (!((max_pfn << PAGE_SHIFT) & (bz - 1))) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote: > That commit could be reverted. > According to > https://lkml.org/lkml/2014/11/10/123 Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small systems? What about this patch - which just uses max_pfn to choose the block size. It seems that many systems with large amounts of memory will have a nicely aligned max_pfn ... so they will get the 2GB block size. If they don't have a well aligned max_pfn, then they need to use a smaller size to avoid the crash I saw. -Tony diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3fba623e3ba5..e14e90fd1cf8 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void) /* start from 2g */ unsigned long bz = 1UL<<31; - if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) { - pr_info("Using 2GB memory block size for large-memory system\n"); - return 2UL * 1024 * 1024 * 1024; - } - - /* less than 64g installed */ - if ((max_pfn << PAGE_SHIFT) < (16UL << 32)) - return MIN_MEMORY_BLOCK_SIZE; - /* get the tail size */ while (bz > MIN_MEMORY_BLOCK_SIZE) { if (!((max_pfn << PAGE_SHIFT) & (bz - 1))) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 11:19 AM, Luck, Tony wrote: > On Tue, Nov 04, 2014 at 04:29:44PM +0800, Daniel J Blueman wrote: >> On large-memory x86-64 systems of 64GB or more with memory hot-plug >> enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces >> the number of directories in /sys/devices/system/memory from 512 to 32, >> making it more manageable, and reducing the creation time accordingly. >> >> This caveat is that the memory can't be offlined (for hotplug or otherwise) >> with finer 128MB granularity, but this is unimportant due to the high >> memory densities generally used with such large-memory systems, where >> eg a single DIMM is the order of 16GB. > > git bisect points to this commit as the cause of a panic on my > machine: > > [4.518415] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 > [4.525882] PCI: MMCONFIG for domain [bus 00-ff] at [mem > 0x8000-0x8fff] (base 0x8000) > [4.536280] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in E820 > [4.544344] PCI: Using configuration type 1 for base access > [4.550778] BUG: unable to handle kernel paging request at ea007820 > [4.558572] IP: [] register_mem_sect_under_ ... > so the older code will look at max_pfn and set memory block size: > > [3.021752] memory block size : 256MB > > I think the problem is more connected to the strange max_pfn rather > than the holes ... but will defer to wiser heads. > > If the problem is with max_pfn ... I don't think it is a safe assumption > that systems with >64GB memory will have 2GB aligned max_pfn. That commit could be reverted. According to https://lkml.org/lkml/2014/11/10/123 I had attached patch for my test setups for a while. Yinghai Subject: [PATCH] x86, mm: put memory block size probing back commit bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems") let system with more than 64GiB ram just use 2G as memory block size without probing. found one system: has memory map like: [0x-0x6000) [0x1-0x20a000) We should use 0x2000 in this case. So can not assume system with big memory have 2g tail anymore. So revert it to put probing back. Fixes: bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64 systems") Signed-off-by: Yinghai Lu --- arch/x86/mm/init_64.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6/arch/x86/mm/init_64.c === --- linux-2.6.orig/arch/x86/mm/init_64.c +++ linux-2.6/arch/x86/mm/init_64.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include "mm_internal.h" @@ -1204,10 +1205,12 @@ static unsigned long probe_memory_block_ /* start from 2g */ unsigned long bz = 1UL<<31; - if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) { - pr_info("Using 2GB memory block size for large-memory system\n"); +#ifdef CONFIG_X86_UV + if (is_uv_system()) { + printk(KERN_INFO "UV: memory block size 2GB\n"); return 2UL * 1024 * 1024 * 1024; } +#endif /* less than 64g installed */ if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, Nov 04, 2014 at 04:29:44PM +0800, Daniel J Blueman wrote: > On large-memory x86-64 systems of 64GB or more with memory hot-plug > enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces > the number of directories in /sys/devices/system/memory from 512 to 32, > making it more manageable, and reducing the creation time accordingly. > > This caveat is that the memory can't be offlined (for hotplug or otherwise) > with finer 128MB granularity, but this is unimportant due to the high > memory densities generally used with such large-memory systems, where > eg a single DIMM is the order of 16GB. git bisect points to this commit as the cause of a panic on my machine: [4.518415] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 [4.525882] PCI: MMCONFIG for domain [bus 00-ff] at [mem 0x8000-0x8fff] (base 0x8000) [4.536280] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in E820 [4.544344] PCI: Using configuration type 1 for base access [4.550778] BUG: unable to handle kernel paging request at ea007820 [4.558572] IP: [] register_mem_sect_under_node+0x6d/0xe0 [4.566366] PGD 1dfffcc067 PUD 1dfffca067 PMD 0 [4.571554] Oops: [#1] SMP [4.575181] Modules linked in: [4.578604] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc2+ #17 [4.585800] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRBDXSD1.86B.0326.D03.1508171454 08/17/2015 [4.597347] task: 883b8496 ti: 881d7ea14000 task.ti: 881d7ea14000 [4.605705] RIP: 0010:[] [] register_mem_sect_under_node+0x6d/0xe0 [4.616205] RSP: :881d7ea17d68 EFLAGS: 00010206 [4.622135] RAX: ea007820 RBX: 0001 RCX: 01e0 [4.630102] RDX: 7800 RSI: 0001 RDI: 881d7ccb6400 [4.638069] RBP: 881d7ea17d78 R08: 01e7 R09: 03c0 [4.646035] R10: 813043a0 R11: ea0169efa600 R12: 0001 [4.654003] R13: 0001 R14: 881d7ccb6400 R15: [4.661972] FS: () GS:881d8b40() knlGS: [4.670996] CS: 0010 DS: ES: CR0: 80050033 [4.677411] CR2: ea007820 CR3: 019a CR4: 003407f0 [4.685381] Stack: [4.687627] 01e7 0001 881d7ea17dc8 8142af0a [4.695926] 881d7ea17de8 03c0 881d0018 0002 [4.704225] 0400 81b101c5 [4.712524] Call Trace: [4.715261] [] register_one_node+0x18a/0x2b0 [4.721871] [] ? pci_iommu_alloc+0x6e/0x6e [4.728287] [] topology_init+0x3c/0x95 [4.734321] [] do_one_initcall+0xd4/0x210 [4.740645] [] ? parse_args+0x245/0x480 [4.746774] [] ? __wake_up+0x48/0x60 [4.752611] [] kernel_init_freeable+0x19d/0x23c [4.759511] [] ? initcall_blacklist+0xb6/0xb6 [4.766226] [] ? rest_init+0x80/0x80 [4.772059] [] kernel_init+0xe/0xf0 [4.777803] [] ret_from_fork+0x7c/0xb0 [4.783831] [] ? rest_init+0x80/0x80 [4.789655] Code: 39 c1 77 59 48 c1 e2 15 48 b8 00 00 00 00 00 ea ff ff 48 8d 44 02 20 eb 12 0f 1f 44 00 00 48 83 c1 01 48 83 c0 40 49 39 c8 72 5b <48> 83 38 00 74 ed 48 8b 50 e0 48 c1 ea 36 39 d6 75 e1 48 8b 04 [4.811356] RIP [] register_mem_sect_under_node+0x6d/0xe0 [4.819238] RSP [4.823132] CR2: ea007820 [4.826836] ---[ end trace 10b7bb944b11529f ]--- [4.831989] Kernel panic - not syncing: Fatal exception [4.837866] ---[ end Kernel panic - not syncing: Fatal exception reverting the commit indeed makes the problem go away. Now the root problem for me is that I have an insane BIOS that handed me an e820 table that is full of holes (for entries above 4GB) ... and ends with an entry that is only 256M aligned: [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0008dfff] usable [0.00] BIOS-e820: [mem 0x0008e000-0x0008] reserved [0.00] BIOS-e820: [mem 0x0009-0x0009] usable [0.00] BIOS-e820: [mem 0x000a-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0x5cc0afff] usable [0.00] BIOS-e820: [mem 0x5cc0b000-0x5e108fff] reserved [0.00] BIOS-e820: [mem 0x5e109000-0x6035cfff] ACPI NVS [0.00] BIOS-e820: [mem 0x6035d000-0x604fcfff] ACPI data [0.00] BIOS-e820: [mem 0x604fd000-0x7baf] usable [0.00] BIOS-e820: [mem 0x7bb0-0x8fff] reserved [0.00] BIOS-e820: [mem 0xfed1c000-0xfed1] reserved [0.00] BIOS-e820: [mem 0x0001-0x00118fffefff] usable [0.00] BIOS-e820: [mem 0x0012-0x001d]
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote: That commit could be reverted. According to https://lkml.org/lkml/2014/11/10/123 Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small systems? What about this patch - which just uses max_pfn to choose the block size. It seems that many systems with large amounts of memory will have a nicely aligned max_pfn ... so they will get the 2GB block size. If they don't have a well aligned max_pfn, then they need to use a smaller size to avoid the crash I saw. -Tony diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3fba623e3ba5..e14e90fd1cf8 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void) /* start from 2g */ unsigned long bz = 1UL31; - if (totalram_pages = (64ULL (30 - PAGE_SHIFT))) { - pr_info(Using 2GB memory block size for large-memory system\n); - return 2UL * 1024 * 1024 * 1024; - } - - /* less than 64g installed */ - if ((max_pfn PAGE_SHIFT) (16UL 32)) - return MIN_MEMORY_BLOCK_SIZE; - /* get the tail size */ while (bz MIN_MEMORY_BLOCK_SIZE) { if (!((max_pfn PAGE_SHIFT) (bz - 1))) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 1:27 PM, Luck, Tony tony.l...@intel.com wrote: On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote: That commit could be reverted. According to https://lkml.org/lkml/2014/11/10/123 Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small systems? That is introduced in commit 982792c7 (x86, mm: probe memory block size for generic x86 64bit ). that patch is used to make boot faster why create less entries in /sys/device/system/memory/. On system with less 64G ram, that will not have too many entries even with MIN_MEMORY_BLOCK_SIZE. What about this patch - which just uses max_pfn to choose the block size. It seems that many systems with large amounts of memory will have a nicely aligned max_pfn ... so they will get the 2GB block size. If they don't have a well aligned max_pfn, then they need to use a smaller size to avoid the crash I saw. Good to me. diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3fba623e3ba5..e14e90fd1cf8 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void) /* start from 2g */ unsigned long bz = 1UL31; - if (totalram_pages = (64ULL (30 - PAGE_SHIFT))) { - pr_info(Using 2GB memory block size for large-memory system\n); - return 2UL * 1024 * 1024 * 1024; - } - - /* less than 64g installed */ - if ((max_pfn PAGE_SHIFT) (16UL 32)) - return MIN_MEMORY_BLOCK_SIZE; - /* get the tail size */ while (bz MIN_MEMORY_BLOCK_SIZE) { if (!((max_pfn PAGE_SHIFT) (bz - 1))) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Tue, Nov 04, 2014 at 04:29:44PM +0800, Daniel J Blueman wrote: On large-memory x86-64 systems of 64GB or more with memory hot-plug enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces the number of directories in /sys/devices/system/memory from 512 to 32, making it more manageable, and reducing the creation time accordingly. This caveat is that the memory can't be offlined (for hotplug or otherwise) with finer 128MB granularity, but this is unimportant due to the high memory densities generally used with such large-memory systems, where eg a single DIMM is the order of 16GB. git bisect points to this commit as the cause of a panic on my machine: [4.518415] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 [4.525882] PCI: MMCONFIG for domain [bus 00-ff] at [mem 0x8000-0x8fff] (base 0x8000) [4.536280] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in E820 [4.544344] PCI: Using configuration type 1 for base access [4.550778] BUG: unable to handle kernel paging request at ea007820 [4.558572] IP: [8142ab0d] register_mem_sect_under_node+0x6d/0xe0 [4.566366] PGD 1dfffcc067 PUD 1dfffca067 PMD 0 [4.571554] Oops: [#1] SMP [4.575181] Modules linked in: [4.578604] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.0-rc2+ #17 [4.585800] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRBDXSD1.86B.0326.D03.1508171454 08/17/2015 [4.597347] task: 883b8496 ti: 881d7ea14000 task.ti: 881d7ea14000 [4.605705] RIP: 0010:[8142ab0d] [8142ab0d] register_mem_sect_under_node+0x6d/0xe0 [4.616205] RSP: :881d7ea17d68 EFLAGS: 00010206 [4.622135] RAX: ea007820 RBX: 0001 RCX: 01e0 [4.630102] RDX: 7800 RSI: 0001 RDI: 881d7ccb6400 [4.638069] RBP: 881d7ea17d78 R08: 01e7 R09: 03c0 [4.646035] R10: 813043a0 R11: ea0169efa600 R12: 0001 [4.654003] R13: 0001 R14: 881d7ccb6400 R15: [4.661972] FS: () GS:881d8b40() knlGS: [4.670996] CS: 0010 DS: ES: CR0: 80050033 [4.677411] CR2: ea007820 CR3: 019a CR4: 003407f0 [4.685381] Stack: [4.687627] 01e7 0001 881d7ea17dc8 8142af0a [4.695926] 881d7ea17de8 03c0 881d0018 0002 [4.704225] 0400 81b101c5 [4.712524] Call Trace: [4.715261] [8142af0a] register_one_node+0x18a/0x2b0 [4.721871] [81b101c5] ? pci_iommu_alloc+0x6e/0x6e [4.728287] [81b10201] topology_init+0x3c/0x95 [4.734321] [81002144] do_one_initcall+0xd4/0x210 [4.740645] [8109b515] ? parse_args+0x245/0x480 [4.746774] [810bddc8] ? __wake_up+0x48/0x60 [4.752611] [81b062f9] kernel_init_freeable+0x19d/0x23c [4.759511] [81b059e3] ? initcall_blacklist+0xb6/0xb6 [4.766226] [816580d0] ? rest_init+0x80/0x80 [4.772059] [816580de] kernel_init+0xe/0xf0 [4.777803] [8167057c] ret_from_fork+0x7c/0xb0 [4.783831] [816580d0] ? rest_init+0x80/0x80 [4.789655] Code: 39 c1 77 59 48 c1 e2 15 48 b8 00 00 00 00 00 ea ff ff 48 8d 44 02 20 eb 12 0f 1f 44 00 00 48 83 c1 01 48 83 c0 40 49 39 c8 72 5b 48 83 38 00 74 ed 48 8b 50 e0 48 c1 ea 36 39 d6 75 e1 48 8b 04 [4.811356] RIP [8142ab0d] register_mem_sect_under_node+0x6d/0xe0 [4.819238] RSP 881d7ea17d68 [4.823132] CR2: ea007820 [4.826836] ---[ end trace 10b7bb944b11529f ]--- [4.831989] Kernel panic - not syncing: Fatal exception [4.837866] ---[ end Kernel panic - not syncing: Fatal exception reverting the commit indeed makes the problem go away. Now the root problem for me is that I have an insane BIOS that handed me an e820 table that is full of holes (for entries above 4GB) ... and ends with an entry that is only 256M aligned: [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0008dfff] usable [0.00] BIOS-e820: [mem 0x0008e000-0x0008] reserved [0.00] BIOS-e820: [mem 0x0009-0x0009] usable [0.00] BIOS-e820: [mem 0x000a-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0x5cc0afff] usable [0.00] BIOS-e820: [mem 0x5cc0b000-0x5e108fff] reserved [0.00] BIOS-e820: [mem 0x5e109000-0x6035cfff] ACPI NVS [0.00] BIOS-e820: [mem 0x6035d000-0x604fcfff] ACPI data [0.00] BIOS-e820: [mem 0x604fd000-0x7baf] usable [0.00] BIOS-e820: [mem
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 11:19 AM, Luck, Tony tony.l...@intel.com wrote: On Tue, Nov 04, 2014 at 04:29:44PM +0800, Daniel J Blueman wrote: On large-memory x86-64 systems of 64GB or more with memory hot-plug enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces the number of directories in /sys/devices/system/memory from 512 to 32, making it more manageable, and reducing the creation time accordingly. This caveat is that the memory can't be offlined (for hotplug or otherwise) with finer 128MB granularity, but this is unimportant due to the high memory densities generally used with such large-memory systems, where eg a single DIMM is the order of 16GB. git bisect points to this commit as the cause of a panic on my machine: [4.518415] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 [4.525882] PCI: MMCONFIG for domain [bus 00-ff] at [mem 0x8000-0x8fff] (base 0x8000) [4.536280] PCI: MMCONFIG at [mem 0x8000-0x8fff] reserved in E820 [4.544344] PCI: Using configuration type 1 for base access [4.550778] BUG: unable to handle kernel paging request at ea007820 [4.558572] IP: [8142ab0d] register_mem_sect_under_ ... so the older code will look at max_pfn and set memory block size: [3.021752] memory block size : 256MB I think the problem is more connected to the strange max_pfn rather than the holes ... but will defer to wiser heads. If the problem is with max_pfn ... I don't think it is a safe assumption that systems with 64GB memory will have 2GB aligned max_pfn. That commit could be reverted. According to https://lkml.org/lkml/2014/11/10/123 I had attached patch for my test setups for a while. Yinghai Subject: [PATCH] x86, mm: put memory block size probing back commit bdee237c0343 (x86: mm: Use 2GB memory block size on large-memory x86-64 systems) let system with more than 64GiB ram just use 2G as memory block size without probing. found one system: has memory map like: [0x-0x6000) [0x1-0x20a000) We should use 0x2000 in this case. So can not assume system with big memory have 2g tail anymore. So revert it to put probing back. Fixes: bdee237c0343 (x86: mm: Use 2GB memory block size on large-memory x86-64 systems) Signed-off-by: Yinghai Lu ying...@kernel.org --- arch/x86/mm/init_64.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: linux-2.6/arch/x86/mm/init_64.c === --- linux-2.6.orig/arch/x86/mm/init_64.c +++ linux-2.6/arch/x86/mm/init_64.c @@ -52,6 +52,7 @@ #include asm/numa.h #include asm/cacheflush.h #include asm/init.h +#include asm/uv/uv.h #include asm/setup.h #include mm_internal.h @@ -1204,10 +1205,12 @@ static unsigned long probe_memory_block_ /* start from 2g */ unsigned long bz = 1UL31; - if (totalram_pages = (64ULL (30 - PAGE_SHIFT))) { - pr_info(Using 2GB memory block size for large-memory system\n); +#ifdef CONFIG_X86_UV + if (is_uv_system()) { + printk(KERN_INFO UV: memory block size 2GB\n); return 2UL * 1024 * 1024 * 1024; } +#endif /* less than 64g installed */ if ((max_pfn PAGE_SHIFT) (16UL 32))
Re: [PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu ying...@kernel.org wrote: It seems that many systems with large amounts of memory will have a nicely aligned max_pfn ... so they will get the 2GB block size. If they don't have a well aligned max_pfn, then they need to use a smaller size to avoid the crash I saw. Good to me. Still stuff going on that I don't understand here. I increased the amount of mirrored memory in this machine which moved max_pfn to 0x756 and probe_memory_block_size() picked 512MB as the memory_block_size, which seemed plausible. But my kernel still crashed during boot with this value. :-( Forcing the block size to 128M made the system boot. Maybe all the holes in the e820 map matter too (specifically the alignment of the holes)? -Tony -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On large-memory x86-64 systems of 64GB or more with memory hot-plug enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces the number of directories in /sys/devices/system/memory from 512 to 32, making it more manageable, and reducing the creation time accordingly. This caveat is that the memory can't be offlined (for hotplug or otherwise) with finer 128MB granularity, but this is unimportant due to the high memory densities generally used with such large-memory systems, where eg a single DIMM is the order of 16GB. Signed-off-by: Daniel J Blueman --- init_64.c |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index df1a992..9622ab2 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -52,7 +52,6 @@ #include #include #include -#include #include #include "mm_internal.h" @@ -1234,12 +1233,10 @@ static unsigned long probe_memory_block_size(void) /* start from 2g */ unsigned long bz = 1UL<<31; -#ifdef CONFIG_X86_UV - if (is_uv_system()) { - printk(KERN_INFO "UV: memory block size 2GB\n"); + if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) { + pr_info("Using 2GB memory block size for large-memory system\n"); return 2UL * 1024 * 1024 * 1024; } -#endif /* less than 64g installed */ if ((max_pfn << PAGE_SHIFT) < (16UL << 32)) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 4/4] Use 2GB memory block size on large-memory x86-64 systems
On large-memory x86-64 systems of 64GB or more with memory hot-plug enabled, use a 2GB memory block size. Eg with 64GB memory, this reduces the number of directories in /sys/devices/system/memory from 512 to 32, making it more manageable, and reducing the creation time accordingly. This caveat is that the memory can't be offlined (for hotplug or otherwise) with finer 128MB granularity, but this is unimportant due to the high memory densities generally used with such large-memory systems, where eg a single DIMM is the order of 16GB. Signed-off-by: Daniel J Blueman dan...@numascale.com --- init_64.c |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index df1a992..9622ab2 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -52,7 +52,6 @@ #include asm/numa.h #include asm/cacheflush.h #include asm/init.h -#include asm/uv/uv.h #include asm/setup.h #include mm_internal.h @@ -1234,12 +1233,10 @@ static unsigned long probe_memory_block_size(void) /* start from 2g */ unsigned long bz = 1UL31; -#ifdef CONFIG_X86_UV - if (is_uv_system()) { - printk(KERN_INFO UV: memory block size 2GB\n); + if (totalram_pages = (64ULL (30 - PAGE_SHIFT))) { + pr_info(Using 2GB memory block size for large-memory system\n); return 2UL * 1024 * 1024 * 1024; } -#endif /* less than 64g installed */ if ((max_pfn PAGE_SHIFT) (16UL 32)) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/