Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/16/2015 05:20 AM, Tejun Heo wrote: On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: ... - /* and there's no empty block */ - if (bi->start >= bi->end) + /* and there's no empty or non-exist block */ + if (bi->start >= bi->end || + memblock_overlaps_region(, + bi->start, bi->end - bi->start) == -1) Ugh can you please change memblock_overlaps_region() to return bool instead? Well, I think memblock_overlaps_region() is designed to return the index of the region overlapping with the given region. Of course for now, it is only called by memblock_is_region_reserved(). Will post a patch to do this. Thanks. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/16/2015 05:20 AM, Tejun Heo wrote: On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: ... - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, + bi-start, bi-end - bi-start) == -1) Ugh can you please change memblock_overlaps_region() to return bool instead? Well, I think memblock_overlaps_region() is designed to return the index of the region overlapping with the given region. Of course for now, it is only called by memblock_is_region_reserved(). Will post a patch to do this. Thanks. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/16/2015 05:20 AM, Tejun Heo wrote: On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: ... - /* and there's no empty block */ - if (bi->start >= bi->end) + /* and there's no empty or non-exist block */ + if (bi->start >= bi->end || + memblock_overlaps_region(, + bi->start, bi->end - bi->start) == -1) Ugh can you please change memblock_overlaps_region() to return bool instead? Well, I think memblock_overlaps_region() is designed to return the index of the region overlapping with the given region. Maybe it had some users before. Of course for now, it is only called by memblock_is_region_reserved(). It is OK to change the return value of memblock_overlaps_region() to bool. But any caller of memblock_is_region_reserved() should also be changed. I think it is OK to leave it there. Thanks. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: ... > - /* and there's no empty block */ > - if (bi->start >= bi->end) > + /* and there's no empty or non-exist block */ > + if (bi->start >= bi->end || > + memblock_overlaps_region(, > + bi->start, bi->end - bi->start) == -1) Ugh can you please change memblock_overlaps_region() to return bool instead? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/16/2015 05:20 AM, Tejun Heo wrote: On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: ... - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, + bi-start, bi-end - bi-start) == -1) Ugh can you please change memblock_overlaps_region() to return bool instead? Well, I think memblock_overlaps_region() is designed to return the index of the region overlapping with the given region. Maybe it had some users before. Of course for now, it is only called by memblock_is_region_reserved(). It is OK to change the return value of memblock_overlaps_region() to bool. But any caller of memblock_is_region_reserved() should also be changed. I think it is OK to leave it there. Thanks. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote: ... - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, + bi-start, bi-end - bi-start) == -1) Ugh can you please change memblock_overlaps_region() to return bool instead? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/07/2015 12:42 AM, Yasuaki Ishimatsu wrote: On Fri, 3 Jul 2015 09:26:05 +0800 Tang Chen wrote: On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: Hi Tang, On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. According your description of patch, node 4 and 5 are mistakenly Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. Please add the results of lscpu before/after applyinig the patch into description of your patch. Feel free to add my Reviewed-by: Yasuaki Ishimatsu Thanks for reviewing. Will update the patch soon. Thanks. Thanks, Yasuaki Ishimatsu set to online. Why does lscpu show the above result? Well, actually not only lscpu gives the strange result, under /sys/device/system/node, interfaces for node 2 and 3 are also created. I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But obviously, node 2 and 3 are set online, which is incorrect. For now, I only found that in numa_cleanup_meminfo(), memory above max_pfn is removed, but holes between nodes are not removed. I think libraries are not able to handle this problem since nodes are set online in kernel. Seeing from user space, there is no hole. Thanks. Thanks, Yasuaki Ishimatsu On Wed, 1 Jul 2015 15:55:30 +0800 Tang Chen wrote: On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command "numactl --hard" will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. Thanks, Xishi Qiu Signed-off-by: Tang Chen --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi->start = max(bi->start, low); bi->end = min(bi->end, high); - /* and there's no empty block */ - if (bi->start >= bi->end) + /* and there's no empty or non-exist block */ + if (bi->start >= bi->end || + memblock_overlaps_region(, + bi->start, bi->end - bi->start) == -1) numa_remove_memblk_from(i--, mi);
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/07/2015 12:42 AM, Yasuaki Ishimatsu wrote: On Fri, 3 Jul 2015 09:26:05 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: Hi Tang, On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. According your description of patch, node 4 and 5 are mistakenly Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. Please add the results of lscpu before/after applyinig the patch into description of your patch. Feel free to add my Reviewed-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Thanks for reviewing. Will update the patch soon. Thanks. Thanks, Yasuaki Ishimatsu set to online. Why does lscpu show the above result? Well, actually not only lscpu gives the strange result, under /sys/device/system/node, interfaces for node 2 and 3 are also created. I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But obviously, node 2 and 3 are set online, which is incorrect. For now, I only found that in numa_cleanup_meminfo(), memory above max_pfn is removed, but holes between nodes are not removed. I think libraries are not able to handle this problem since nodes are set online in kernel. Seeing from user space, there is no hole. Thanks. Thanks, Yasuaki Ishimatsu On Wed, 1 Jul 2015 15:55:30 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command numactl --hard will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. Thanks, Xishi Qiu Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi-start = max(bi-start, low); bi-end = min(bi-end, high); - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, +
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On Fri, 3 Jul 2015 09:26:05 +0800 Tang Chen wrote: > > On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: > > Hi Tang, > > > >> On my box, if I run lscpu, the output looks like this: > >> > >> NUMA node0 CPU(s): 0-14,128-142 > >> NUMA node1 CPU(s): 15-29,143-157 > >> NUMA node2 CPU(s): > >> NUMA node3 CPU(s): > >> NUMA node4 CPU(s): 62-76,190-204 > >> NUMA node5 CPU(s): 78-92,206-220 > >> > >> Node 2 and 3 are not exist, but they are online. > > According your description of patch, node 4 and 5 are mistakenly > > Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. Please add the results of lscpu before/after applyinig the patch into description of your patch. Feel free to add my Reviewed-by: Yasuaki Ishimatsu Thanks, Yasuaki Ishimatsu > > set to online. Why does lscpu show the above result? > > Well, actually not only lscpu gives the strange result, under > /sys/device/system/node, > interfaces for node 2 and 3 are also created. > > I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But > obviously, > node 2 and 3 are set online, which is incorrect. > > For now, I only found that in numa_cleanup_meminfo(), memory above > max_pfn is removed, > but holes between nodes are not removed. > > I think libraries are not able to handle this problem since nodes are > set online in kernel. > Seeing from user space, there is no hole. > > Thanks. > > > > > Thanks, > > Yasuaki Ishimatsu > > > > On Wed, 1 Jul 2015 15:55:30 +0800 > > Tang Chen wrote: > > > >> On 07/01/2015 02:25 PM, Xishi Qiu wrote: > >>> On 2015/7/1 11:16, Tang Chen wrote: > >>> > When parsing SRAT, all memory ranges are added into numa_meminfo. > In numa_init(), before entering numa_cleanup_meminfo(), all possible > memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes > all ranges over max_pfn or empty. > > But, this only works if the nodes are continuous. Let's have a look > at the following example: > > We have an SRAT like this: > SRAT: Node 0 PXM 0 [mem 0x-0x5fff] > SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] > SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] > SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug > SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug > SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug > SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug > SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug > SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug > > On boot, only node 0,1,2,3 exist. > > And the numa_meminfo will look like this: > numa_meminfo.nr_blks = 9 > 1. on node 0: [0, 6000] > 2. on node 0: [1, 200] > 3. on node 1: [200, 400] > 4. on node 4: [400, 600] > 5. on node 5: [600, 800] > 6. on node 2: [800, a00] > 7. on node 3: [a00, a08] > 8. on node 6: [c00, a08] > 9. on node 7: [e00, a08] > > And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because > the end address is over max_pfn, which is a08. But 4 and 5 > are not removed because their end addresses are less then max_pfn. > But in fact, node 4 and 5 don't exist. > > In a word, numa_cleanup_meminfo() is not able to handle holes between > nodes. > > Since memory ranges in node 4 and 5 are in numa_meminfo, in > numa_register_memblks(), > node 4 and 5 will be mistakenly set to online. > > In this patch, we use memblock_overlaps_region() to check if ranges in > numa_meminfo overlap with ranges in memory_block. Since memory_block > contains > all available memory at boot time, if they overlap, it means the ranges > exist. If not, then remove them from numa_meminfo. > > >>> Hi Tang Chen, > >>> > >>> What's the impact of this problem? > >>> > >>> Command "numactl --hard" will show an empty node(no cpu and no memory, > >>> but pgdat is created), right? > >> On my box, if I run lscpu, the output looks like this: > >> > >> NUMA node0 CPU(s): 0-14,128-142 > >> NUMA node1 CPU(s): 15-29,143-157 > >> NUMA node2 CPU(s): > >> NUMA node3 CPU(s): > >> NUMA node4 CPU(s): 62-76,190-204 > >> NUMA node5 CPU(s): 78-92,206-220 > >> > >> Node 2 and 3 are not exist, but they are online. > >> > >> Thanks. > >> > >>> Thanks, > >>> Xishi Qiu > >>> > Signed-off-by: Tang Chen > --- > arch/x86/mm/numa.c | 6 -- > include/linux/memblock.h | 2 ++ > mm/memblock.c| 2 +- > 3 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > index 4053bb5..0c55cc5 100644 > --- a/arch/x86/mm/numa.c >
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On Fri, 3 Jul 2015 09:26:05 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: Hi Tang, On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. According your description of patch, node 4 and 5 are mistakenly Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. Please add the results of lscpu before/after applyinig the patch into description of your patch. Feel free to add my Reviewed-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com Thanks, Yasuaki Ishimatsu set to online. Why does lscpu show the above result? Well, actually not only lscpu gives the strange result, under /sys/device/system/node, interfaces for node 2 and 3 are also created. I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But obviously, node 2 and 3 are set online, which is incorrect. For now, I only found that in numa_cleanup_meminfo(), memory above max_pfn is removed, but holes between nodes are not removed. I think libraries are not able to handle this problem since nodes are set online in kernel. Seeing from user space, there is no hole. Thanks. Thanks, Yasuaki Ishimatsu On Wed, 1 Jul 2015 15:55:30 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command numactl --hard will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. Thanks, Xishi Qiu Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi-start = max(bi-start, low); bi-end = min(bi-end, high); -/* and there's no empty block */ -if (bi-start = bi-end) +/* and there's no empty or non-exist block
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: Hi Tang, On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. According your description of patch, node 4 and 5 are mistakenly Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. set to online. Why does lscpu show the above result? Well, actually not only lscpu gives the strange result, under /sys/device/system/node, interfaces for node 2 and 3 are also created. I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But obviously, node 2 and 3 are set online, which is incorrect. For now, I only found that in numa_cleanup_meminfo(), memory above max_pfn is removed, but holes between nodes are not removed. I think libraries are not able to handle this problem since nodes are set online in kernel. Seeing from user space, there is no hole. Thanks. Thanks, Yasuaki Ishimatsu On Wed, 1 Jul 2015 15:55:30 +0800 Tang Chen wrote: On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command "numactl --hard" will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. Thanks, Xishi Qiu Signed-off-by: Tang Chen --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi->start = max(bi->start, low); bi->end = min(bi->end, high); - /* and there's no empty block */ - if (bi->start >= bi->end) + /* and there's no empty or non-exist block */ + if (bi->start >= bi->end || + memblock_overlaps_region(, + bi->start, bi->end - bi->start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
Hi Tang, > On my box, if I run lscpu, the output looks like this: > > NUMA node0 CPU(s): 0-14,128-142 > NUMA node1 CPU(s): 15-29,143-157 > NUMA node2 CPU(s): > NUMA node3 CPU(s): > NUMA node4 CPU(s): 62-76,190-204 > NUMA node5 CPU(s): 78-92,206-220 > > Node 2 and 3 are not exist, but they are online. According your description of patch, node 4 and 5 are mistakenly set to online. Why does lscpu show the above result? Thanks, Yasuaki Ishimatsu On Wed, 1 Jul 2015 15:55:30 +0800 Tang Chen wrote: > > On 07/01/2015 02:25 PM, Xishi Qiu wrote: > > On 2015/7/1 11:16, Tang Chen wrote: > > > >> When parsing SRAT, all memory ranges are added into numa_meminfo. > >> In numa_init(), before entering numa_cleanup_meminfo(), all possible > >> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes > >> all ranges over max_pfn or empty. > >> > >> But, this only works if the nodes are continuous. Let's have a look > >> at the following example: > >> > >> We have an SRAT like this: > >> SRAT: Node 0 PXM 0 [mem 0x-0x5fff] > >> SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] > >> SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] > >> SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug > >> SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug > >> SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug > >> SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug > >> SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug > >> SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug > >> > >> On boot, only node 0,1,2,3 exist. > >> > >> And the numa_meminfo will look like this: > >> numa_meminfo.nr_blks = 9 > >> 1. on node 0: [0, 6000] > >> 2. on node 0: [1, 200] > >> 3. on node 1: [200, 400] > >> 4. on node 4: [400, 600] > >> 5. on node 5: [600, 800] > >> 6. on node 2: [800, a00] > >> 7. on node 3: [a00, a08] > >> 8. on node 6: [c00, a08] > >> 9. on node 7: [e00, a08] > >> > >> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because > >> the end address is over max_pfn, which is a08. But 4 and 5 > >> are not removed because their end addresses are less then max_pfn. > >> But in fact, node 4 and 5 don't exist. > >> > >> In a word, numa_cleanup_meminfo() is not able to handle holes between > >> nodes. > >> > >> Since memory ranges in node 4 and 5 are in numa_meminfo, in > >> numa_register_memblks(), > >> node 4 and 5 will be mistakenly set to online. > >> > >> In this patch, we use memblock_overlaps_region() to check if ranges in > >> numa_meminfo overlap with ranges in memory_block. Since memory_block > >> contains > >> all available memory at boot time, if they overlap, it means the ranges > >> exist. If not, then remove them from numa_meminfo. > >> > > Hi Tang Chen, > > > > What's the impact of this problem? > > > > Command "numactl --hard" will show an empty node(no cpu and no memory, > > but pgdat is created), right? > > On my box, if I run lscpu, the output looks like this: > > NUMA node0 CPU(s): 0-14,128-142 > NUMA node1 CPU(s): 15-29,143-157 > NUMA node2 CPU(s): > NUMA node3 CPU(s): > NUMA node4 CPU(s): 62-76,190-204 > NUMA node5 CPU(s): 78-92,206-220 > > Node 2 and 3 are not exist, but they are online. > > Thanks. > > > > > Thanks, > > Xishi Qiu > > > >> Signed-off-by: Tang Chen > >> --- > >> arch/x86/mm/numa.c | 6 -- > >> include/linux/memblock.h | 2 ++ > >> mm/memblock.c| 2 +- > >> 3 files changed, 7 insertions(+), 3 deletions(-) > >> > >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > >> index 4053bb5..0c55cc5 100644 > >> --- a/arch/x86/mm/numa.c > >> +++ b/arch/x86/mm/numa.c > >> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo > >> *mi) > >>bi->start = max(bi->start, low); > >>bi->end = min(bi->end, high); > >> > >> - /* and there's no empty block */ > >> - if (bi->start >= bi->end) > >> + /* and there's no empty or non-exist block */ > >> + if (bi->start >= bi->end || > >> + memblock_overlaps_region(, > >> + bi->start, bi->end - bi->start) == -1) > >>numa_remove_memblk_from(i--, mi); > >>} > >> > >> diff --git a/include/linux/memblock.h b/include/linux/memblock.h > >> index 0215ffd..3bf6cc1 100644 > >> --- a/include/linux/memblock.h > >> +++ b/include/linux/memblock.h > >> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); > >> int memblock_free(phys_addr_t base, phys_addr_t size); > >> int memblock_reserve(phys_addr_t base, phys_addr_t size); > >> void memblock_trim_memory(phys_addr_t align); > >> +long memblock_overlaps_region(struct memblock_type *type, > >> +phys_addr_t base, phys_addr_t size); > >> int
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
Hi Tang, On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. According your description of patch, node 4 and 5 are mistakenly set to online. Why does lscpu show the above result? Thanks, Yasuaki Ishimatsu On Wed, 1 Jul 2015 15:55:30 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command numactl --hard will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. Thanks, Xishi Qiu Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi-start = max(bi-start, low); bi-end = min(bi-end, high); - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, + bi-start, bi-end - bi-start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); void memblock_trim_memory(phys_addr_t align); +long memblock_overlaps_region(struct memblock_type *type, +phys_addr_t base, phys_addr_t size); int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); diff --git a/mm/memblock.c b/mm/memblock.c index 1b444c7..55b5f9f 100644
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote: Hi Tang, On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. According your description of patch, node 4 and 5 are mistakenly Not node 4 and 5, it is node 2 and 3 which are mistakenly set online. set to online. Why does lscpu show the above result? Well, actually not only lscpu gives the strange result, under /sys/device/system/node, interfaces for node 2 and 3 are also created. I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But obviously, node 2 and 3 are set online, which is incorrect. For now, I only found that in numa_cleanup_meminfo(), memory above max_pfn is removed, but holes between nodes are not removed. I think libraries are not able to handle this problem since nodes are set online in kernel. Seeing from user space, there is no hole. Thanks. Thanks, Yasuaki Ishimatsu On Wed, 1 Jul 2015 15:55:30 +0800 Tang Chen tangc...@cn.fujitsu.com wrote: On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command numactl --hard will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. Thanks, Xishi Qiu Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi-start = max(bi-start, low); bi-end = min(bi-end, high); - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, + bi-start, bi-end - bi-start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base,
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 2015/7/1 15:55, Tang Chen wrote: > > On 07/01/2015 02:25 PM, Xishi Qiu wrote: >> On 2015/7/1 11:16, Tang Chen wrote: >> >>> When parsing SRAT, all memory ranges are added into numa_meminfo. >>> In numa_init(), before entering numa_cleanup_meminfo(), all possible >>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes >>> all ranges over max_pfn or empty. >>> >>> But, this only works if the nodes are continuous. Let's have a look >>> at the following example: >>> >>> We have an SRAT like this: >>> SRAT: Node 0 PXM 0 [mem 0x-0x5fff] >>> SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] >>> SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] >>> SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug >>> SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug >>> SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug >>> SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug >>> SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug >>> SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug >>> >>> On boot, only node 0,1,2,3 exist. >>> >>> And the numa_meminfo will look like this: >>> numa_meminfo.nr_blks = 9 >>> 1. on node 0: [0, 6000] >>> 2. on node 0: [1, 200] >>> 3. on node 1: [200, 400] >>> 4. on node 4: [400, 600] >>> 5. on node 5: [600, 800] >>> 6. on node 2: [800, a00] >>> 7. on node 3: [a00, a08] >>> 8. on node 6: [c00, a08] >>> 9. on node 7: [e00, a08] >>> >>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because >>> the end address is over max_pfn, which is a08. But 4 and 5 >>> are not removed because their end addresses are less then max_pfn. >>> But in fact, node 4 and 5 don't exist. >>> >>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. >>> >>> Since memory ranges in node 4 and 5 are in numa_meminfo, in >>> numa_register_memblks(), >>> node 4 and 5 will be mistakenly set to online. >>> >>> In this patch, we use memblock_overlaps_region() to check if ranges in >>> numa_meminfo overlap with ranges in memory_block. Since memory_block >>> contains >>> all available memory at boot time, if they overlap, it means the ranges >>> exist. If not, then remove them from numa_meminfo. >>> >> Hi Tang Chen, >> >> What's the impact of this problem? >> >> Command "numactl --hard" will show an empty node(no cpu and no memory, >> but pgdat is created), right? > > On my box, if I run lscpu, the output looks like this: > > NUMA node0 CPU(s): 0-14,128-142 > NUMA node1 CPU(s): 15-29,143-157 > NUMA node2 CPU(s): > NUMA node3 CPU(s): > NUMA node4 CPU(s): 62-76,190-204 > NUMA node5 CPU(s): 78-92,206-220 > > Node 2 and 3 are not exist, but they are online. > Yes, because srat->numa_meminfo->alloc pgdat. Thanks, Xishi Qiu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command "numactl --hard" will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. Thanks, Xishi Qiu Signed-off-by: Tang Chen --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi->start = max(bi->start, low); bi->end = min(bi->end, high); - /* and there's no empty block */ - if (bi->start >= bi->end) + /* and there's no empty or non-exist block */ + if (bi->start >= bi->end || + memblock_overlaps_region(, + bi->start, bi->end - bi->start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); void memblock_trim_memory(phys_addr_t align); +long memblock_overlaps_region(struct memblock_type *type, + phys_addr_t base, phys_addr_t size); int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); diff --git a/mm/memblock.c b/mm/memblock.c index 1b444c7..55b5f9f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); } -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, +long __init_memblock memblock_overlaps_region(struct memblock_type *type, phys_addr_t base, phys_addr_t size) { unsigned long i; . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 2015/7/1 11:16, Tang Chen wrote: > When parsing SRAT, all memory ranges are added into numa_meminfo. > In numa_init(), before entering numa_cleanup_meminfo(), all possible > memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes > all ranges over max_pfn or empty. > > But, this only works if the nodes are continuous. Let's have a look > at the following example: > > We have an SRAT like this: > SRAT: Node 0 PXM 0 [mem 0x-0x5fff] > SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] > SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] > SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug > SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug > SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug > SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug > SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug > SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug > > On boot, only node 0,1,2,3 exist. > > And the numa_meminfo will look like this: > numa_meminfo.nr_blks = 9 > 1. on node 0: [0, 6000] > 2. on node 0: [1, 200] > 3. on node 1: [200, 400] > 4. on node 4: [400, 600] > 5. on node 5: [600, 800] > 6. on node 2: [800, a00] > 7. on node 3: [a00, a08] > 8. on node 6: [c00, a08] > 9. on node 7: [e00, a08] > > And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because > the end address is over max_pfn, which is a08. But 4 and 5 > are not removed because their end addresses are less then max_pfn. > But in fact, node 4 and 5 don't exist. > > In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. > > Since memory ranges in node 4 and 5 are in numa_meminfo, in > numa_register_memblks(), > node 4 and 5 will be mistakenly set to online. > > In this patch, we use memblock_overlaps_region() to check if ranges in > numa_meminfo overlap with ranges in memory_block. Since memory_block contains > all available memory at boot time, if they overlap, it means the ranges > exist. If not, then remove them from numa_meminfo. > Hi Tang Chen, What's the impact of this problem? Command "numactl --hard" will show an empty node(no cpu and no memory, but pgdat is created), right? Thanks, Xishi Qiu > Signed-off-by: Tang Chen > --- > arch/x86/mm/numa.c | 6 -- > include/linux/memblock.h | 2 ++ > mm/memblock.c| 2 +- > 3 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c > index 4053bb5..0c55cc5 100644 > --- a/arch/x86/mm/numa.c > +++ b/arch/x86/mm/numa.c > @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) > bi->start = max(bi->start, low); > bi->end = min(bi->end, high); > > - /* and there's no empty block */ > - if (bi->start >= bi->end) > + /* and there's no empty or non-exist block */ > + if (bi->start >= bi->end || > + memblock_overlaps_region(, > + bi->start, bi->end - bi->start) == -1) > numa_remove_memblk_from(i--, mi); > } > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h > index 0215ffd..3bf6cc1 100644 > --- a/include/linux/memblock.h > +++ b/include/linux/memblock.h > @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); > int memblock_free(phys_addr_t base, phys_addr_t size); > int memblock_reserve(phys_addr_t base, phys_addr_t size); > void memblock_trim_memory(phys_addr_t align); > +long memblock_overlaps_region(struct memblock_type *type, > + phys_addr_t base, phys_addr_t size); > int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); > int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); > int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); > diff --git a/mm/memblock.c b/mm/memblock.c > index 1b444c7..55b5f9f 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -91,7 +91,7 @@ static unsigned long __init_memblock > memblock_addrs_overlap(phys_addr_t base1, p > return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); > } > > -static long __init_memblock memblock_overlaps_region(struct memblock_type > *type, > +long __init_memblock memblock_overlaps_region(struct memblock_type *type, > phys_addr_t base, phys_addr_t size) > { > unsigned long i; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 2015/7/1 15:55, Tang Chen wrote: On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command numactl --hard will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Yes, because srat-numa_meminfo-alloc pgdat. Thanks, Xishi Qiu -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 07/01/2015 02:25 PM, Xishi Qiu wrote: On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command numactl --hard will show an empty node(no cpu and no memory, but pgdat is created), right? On my box, if I run lscpu, the output looks like this: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Node 2 and 3 are not exist, but they are online. Thanks. Thanks, Xishi Qiu Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi-start = max(bi-start, low); bi-end = min(bi-end, high); - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, + bi-start, bi-end - bi-start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); void memblock_trim_memory(phys_addr_t align); +long memblock_overlaps_region(struct memblock_type *type, + phys_addr_t base, phys_addr_t size); int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); diff --git a/mm/memblock.c b/mm/memblock.c index 1b444c7..55b5f9f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p return ((base1 (base2 + size2)) (base2 (base1 + size1))); } -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, +long __init_memblock memblock_overlaps_region(struct memblock_type *type, phys_addr_t base, phys_addr_t size) { unsigned long i; . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read
Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
On 2015/7/1 11:16, Tang Chen wrote: When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Hi Tang Chen, What's the impact of this problem? Command numactl --hard will show an empty node(no cpu and no memory, but pgdat is created), right? Thanks, Xishi Qiu Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi-start = max(bi-start, low); bi-end = min(bi-end, high); - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, + bi-start, bi-end - bi-start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); void memblock_trim_memory(phys_addr_t align); +long memblock_overlaps_region(struct memblock_type *type, + phys_addr_t base, phys_addr_t size); int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); diff --git a/mm/memblock.c b/mm/memblock.c index 1b444c7..55b5f9f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p return ((base1 (base2 + size2)) (base2 (base1 + size1))); } -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, +long __init_memblock memblock_overlaps_region(struct memblock_type *type, phys_addr_t base, phys_addr_t size) { unsigned long i; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Signed-off-by: Tang Chen --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi->start = max(bi->start, low); bi->end = min(bi->end, high); - /* and there's no empty block */ - if (bi->start >= bi->end) + /* and there's no empty or non-exist block */ + if (bi->start >= bi->end || + memblock_overlaps_region(, + bi->start, bi->end - bi->start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); void memblock_trim_memory(phys_addr_t align); +long memblock_overlaps_region(struct memblock_type *type, + phys_addr_t base, phys_addr_t size); int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); diff --git a/mm/memblock.c b/mm/memblock.c index 1b444c7..55b5f9f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p return ((base1 < (base2 + size2)) && (base2 < (base1 + size1))); } -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, +long __init_memblock memblock_overlaps_region(struct memblock_type *type, phys_addr_t base, phys_addr_t size) { unsigned long i; -- 1.8.4.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.
When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x-0x5fff] SRAT: Node 0 PXM 0 [mem 0x1-0x1ff] SRAT: Node 1 PXM 1 [mem 0x200-0x3ff] SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 6000] 2. on node 0: [1, 200] 3. on node 1: [200, 400] 4. on node 4: [400, 600] 5. on node 5: [600, 800] 6. on node 2: [800, a00] 7. on node 3: [a00, a08] 8. on node 6: [c00, a08] 9. on node 7: [e00, a08] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a08. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. Signed-off-by: Tang Chen tangc...@cn.fujitsu.com --- arch/x86/mm/numa.c | 6 -- include/linux/memblock.h | 2 ++ mm/memblock.c| 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..0c55cc5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi) bi-start = max(bi-start, low); bi-end = min(bi-end, high); - /* and there's no empty block */ - if (bi-start = bi-end) + /* and there's no empty or non-exist block */ + if (bi-start = bi-end || + memblock_overlaps_region(memblock.memory, + bi-start, bi-end - bi-start) == -1) numa_remove_memblk_from(i--, mi); } diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 0215ffd..3bf6cc1 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size); int memblock_free(phys_addr_t base, phys_addr_t size); int memblock_reserve(phys_addr_t base, phys_addr_t size); void memblock_trim_memory(phys_addr_t align); +long memblock_overlaps_region(struct memblock_type *type, + phys_addr_t base, phys_addr_t size); int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size); int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size); int memblock_mark_mirror(phys_addr_t base, phys_addr_t size); diff --git a/mm/memblock.c b/mm/memblock.c index 1b444c7..55b5f9f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -91,7 +91,7 @@ static unsigned long __init_memblock memblock_addrs_overlap(phys_addr_t base1, p return ((base1 (base2 + size2)) (base2 (base1 + size1))); } -static long __init_memblock memblock_overlaps_region(struct memblock_type *type, +long __init_memblock memblock_overlaps_region(struct memblock_type *type, phys_addr_t base, phys_addr_t size) { unsigned long i; -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/