Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-16 Thread Tang Chen


On 07/16/2015 05:20 AM, Tejun Heo wrote:

On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
...

-   /* and there's no empty block */
-   if (bi->start >= bi->end)
+   /* and there's no empty or non-exist block */
+   if (bi->start >= bi->end ||
+   memblock_overlaps_region(,
+   bi->start, bi->end - bi->start) == -1)

Ugh can you please change memblock_overlaps_region() to return
bool instead?


Well, I think memblock_overlaps_region() is designed to return
the index of the region overlapping with the given region.

Of course for now, it is only called by memblock_is_region_reserved().

Will post a patch to do this.

Thanks.



Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-16 Thread Tang Chen


On 07/16/2015 05:20 AM, Tejun Heo wrote:

On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
...

-   /* and there's no empty block */
-   if (bi-start = bi-end)
+   /* and there's no empty or non-exist block */
+   if (bi-start = bi-end ||
+   memblock_overlaps_region(memblock.memory,
+   bi-start, bi-end - bi-start) == -1)

Ugh can you please change memblock_overlaps_region() to return
bool instead?


Well, I think memblock_overlaps_region() is designed to return
the index of the region overlapping with the given region.

Of course for now, it is only called by memblock_is_region_reserved().

Will post a patch to do this.

Thanks.



Thanks.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-15 Thread Tang Chen


On 07/16/2015 05:20 AM, Tejun Heo wrote:

On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
...

-   /* and there's no empty block */
-   if (bi->start >= bi->end)
+   /* and there's no empty or non-exist block */
+   if (bi->start >= bi->end ||
+   memblock_overlaps_region(,
+   bi->start, bi->end - bi->start) == -1)

Ugh can you please change memblock_overlaps_region() to return
bool instead?


Well, I think memblock_overlaps_region() is designed to return
the index of the region overlapping with the given region.
Maybe it had some users before.

Of course for now, it is only called by memblock_is_region_reserved().

It is OK to change the return value of memblock_overlaps_region() to bool.
But any caller of memblock_is_region_reserved() should also be changed.

I think it is OK to leave it there.

Thanks.



Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-15 Thread Tejun Heo
On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
...
> - /* and there's no empty block */
> - if (bi->start >= bi->end)
> + /* and there's no empty or non-exist block */
> + if (bi->start >= bi->end ||
> + memblock_overlaps_region(,
> + bi->start, bi->end - bi->start) == -1)

Ugh can you please change memblock_overlaps_region() to return
bool instead?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-15 Thread Tang Chen


On 07/16/2015 05:20 AM, Tejun Heo wrote:

On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
...

-   /* and there's no empty block */
-   if (bi-start = bi-end)
+   /* and there's no empty or non-exist block */
+   if (bi-start = bi-end ||
+   memblock_overlaps_region(memblock.memory,
+   bi-start, bi-end - bi-start) == -1)

Ugh can you please change memblock_overlaps_region() to return
bool instead?


Well, I think memblock_overlaps_region() is designed to return
the index of the region overlapping with the given region.
Maybe it had some users before.

Of course for now, it is only called by memblock_is_region_reserved().

It is OK to change the return value of memblock_overlaps_region() to bool.
But any caller of memblock_is_region_reserved() should also be changed.

I think it is OK to leave it there.

Thanks.



Thanks.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-15 Thread Tejun Heo
On Wed, Jul 01, 2015 at 11:16:54AM +0800, Tang Chen wrote:
...
 - /* and there's no empty block */
 - if (bi-start = bi-end)
 + /* and there's no empty or non-exist block */
 + if (bi-start = bi-end ||
 + memblock_overlaps_region(memblock.memory,
 + bi-start, bi-end - bi-start) == -1)

Ugh can you please change memblock_overlaps_region() to return
bool instead?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-07 Thread Tang Chen


On 07/07/2015 12:42 AM, Yasuaki Ishimatsu wrote:

On Fri, 3 Jul 2015 09:26:05 +0800
Tang Chen  wrote:


On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:

Hi Tang,


On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly

Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.

Please add the results of lscpu before/after applyinig the patch into
description of your patch.

Feel free to add my
Reviewed-by: Yasuaki Ishimatsu 


Thanks for reviewing. Will update the patch soon.

Thanks.



Thanks,
Yasuaki Ishimatsu


set to online. Why does lscpu show the above result?

Well, actually not only lscpu gives the strange result, under
/sys/device/system/node,
interfaces for node 2 and 3 are also created.

I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But
obviously,
node 2 and 3 are set online, which is incorrect.

For now, I only found that in numa_cleanup_meminfo(), memory above
max_pfn is removed,
but holes between nodes are not removed.

I think libraries are not able to handle this problem since nodes are
set online in kernel.
Seeing from user space, there is no hole.

Thanks.


Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen  wrote:


On 07/01/2015 02:25 PM, Xishi Qiu wrote:

On 2015/7/1 11:16, Tang Chen wrote:


When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 6000]
2. on node 0: [1, 200]
3. on node 1: [200, 400]
4. on node 4: [400, 600]
5. on node 5: [600, 800]
6. on node 2: [800, a00]
7. on node 3: [a00, a08]
8. on node 6: [c00, a08]
9. on node 7: [e00, a08]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a08. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in 
numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.


Hi Tang Chen,

What's the impact of this problem?

Command "numactl --hard" will show an empty node(no cpu and no memory,
but pgdat is created), right?

On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

Thanks.


Thanks,
Xishi Qiu


Signed-off-by: Tang Chen 
---
arch/x86/mm/numa.c   | 6 --
include/linux/memblock.h | 2 ++
mm/memblock.c| 2 +-
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi->start = max(bi->start, low);
bi->end = min(bi->end, high);

-		/* and there's no empty block */

-   if (bi->start >= bi->end)
+   /* and there's no empty or non-exist block */
+   if (bi->start >= bi->end ||
+   memblock_overlaps_region(,
+   bi->start, bi->end - bi->start) == -1)
numa_remove_memblk_from(i--, mi);
   

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-07 Thread Tang Chen


On 07/07/2015 12:42 AM, Yasuaki Ishimatsu wrote:

On Fri, 3 Jul 2015 09:26:05 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:


On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:

Hi Tang,


On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly

Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.

Please add the results of lscpu before/after applyinig the patch into
description of your patch.

Feel free to add my
Reviewed-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com


Thanks for reviewing. Will update the patch soon.

Thanks.



Thanks,
Yasuaki Ishimatsu


set to online. Why does lscpu show the above result?

Well, actually not only lscpu gives the strange result, under
/sys/device/system/node,
interfaces for node 2 and 3 are also created.

I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But
obviously,
node 2 and 3 are set online, which is incorrect.

For now, I only found that in numa_cleanup_meminfo(), memory above
max_pfn is removed,
but holes between nodes are not removed.

I think libraries are not able to handle this problem since nodes are
set online in kernel.
Seeing from user space, there is no hole.

Thanks.


Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:


On 07/01/2015 02:25 PM, Xishi Qiu wrote:

On 2015/7/1 11:16, Tang Chen wrote:


When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 6000]
2. on node 0: [1, 200]
3. on node 1: [200, 400]
4. on node 4: [400, 600]
5. on node 5: [600, 800]
6. on node 2: [800, a00]
7. on node 3: [a00, a08]
8. on node 6: [c00, a08]
9. on node 7: [e00, a08]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a08. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in 
numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.


Hi Tang Chen,

What's the impact of this problem?

Command numactl --hard will show an empty node(no cpu and no memory,
but pgdat is created), right?

On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

Thanks.


Thanks,
Xishi Qiu


Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
arch/x86/mm/numa.c   | 6 --
include/linux/memblock.h | 2 ++
mm/memblock.c| 2 +-
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi-start = max(bi-start, low);
bi-end = min(bi-end, high);

-		/* and there's no empty block */

-   if (bi-start = bi-end)
+   /* and there's no empty or non-exist block */
+   if (bi-start = bi-end ||
+   memblock_overlaps_region(memblock.memory,
+ 

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-06 Thread Yasuaki Ishimatsu

On Fri, 3 Jul 2015 09:26:05 +0800
Tang Chen  wrote:

> 
> On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:
> > Hi Tang,
> >
> >> On my box, if I run lscpu, the output looks like this:
> >>
> >> NUMA node0 CPU(s): 0-14,128-142
> >> NUMA node1 CPU(s): 15-29,143-157
> >> NUMA node2 CPU(s):
> >> NUMA node3 CPU(s):
> >> NUMA node4 CPU(s): 62-76,190-204
> >> NUMA node5 CPU(s): 78-92,206-220
> >>
> >> Node 2 and 3 are not exist, but they are online.
> > According your description of patch, node 4 and 5 are mistakenly
> 
> Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.

Please add the results of lscpu before/after applyinig the patch into
description of your patch.

Feel free to add my 
Reviewed-by: Yasuaki Ishimatsu 

Thanks,
Yasuaki Ishimatsu

> > set to online. Why does lscpu show the above result?
> 
> Well, actually not only lscpu gives the strange result, under 
> /sys/device/system/node,
> interfaces for node 2 and 3 are also created.
> 
> I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But 
> obviously,
> node 2 and 3 are set online, which is incorrect.
> 
> For now, I only found that in numa_cleanup_meminfo(), memory above 
> max_pfn is removed,
> but holes between nodes are not removed.
> 
> I think libraries are not able to handle this problem since nodes are 
> set online in kernel.
> Seeing from user space, there is no hole.
> 
> Thanks.
> 
> >
> > Thanks,
> > Yasuaki Ishimatsu
> >
> > On Wed, 1 Jul 2015 15:55:30 +0800
> > Tang Chen  wrote:
> >
> >> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
> >>> On 2015/7/1 11:16, Tang Chen wrote:
> >>>
>  When parsing SRAT, all memory ranges are added into numa_meminfo.
>  In numa_init(), before entering numa_cleanup_meminfo(), all possible
>  memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
>  all ranges over max_pfn or empty.
> 
>  But, this only works if the nodes are continuous. Let's have a look
>  at the following example:
> 
>  We have an SRAT like this:
>  SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
>  SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
>  SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
>  SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
>  SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
>  SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
>  SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
>  SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
>  SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug
> 
>  On boot, only node 0,1,2,3 exist.
> 
>  And the numa_meminfo will look like this:
>  numa_meminfo.nr_blks = 9
>  1. on node 0: [0, 6000]
>  2. on node 0: [1, 200]
>  3. on node 1: [200, 400]
>  4. on node 4: [400, 600]
>  5. on node 5: [600, 800]
>  6. on node 2: [800, a00]
>  7. on node 3: [a00, a08]
>  8. on node 6: [c00, a08]
>  9. on node 7: [e00, a08]
> 
>  And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
>  the end address is over max_pfn, which is a08. But 4 and 5
>  are not removed because their end addresses are less then max_pfn.
>  But in fact, node 4 and 5 don't exist.
> 
>  In a word, numa_cleanup_meminfo() is not able to handle holes between 
>  nodes.
> 
>  Since memory ranges in node 4 and 5 are in numa_meminfo, in 
>  numa_register_memblks(),
>  node 4 and 5 will be mistakenly set to online.
> 
>  In this patch, we use memblock_overlaps_region() to check if ranges in
>  numa_meminfo overlap with ranges in memory_block. Since memory_block 
>  contains
>  all available memory at boot time, if they overlap, it means the ranges
>  exist. If not, then remove them from numa_meminfo.
> 
> >>> Hi Tang Chen,
> >>>
> >>> What's the impact of this problem?
> >>>
> >>> Command "numactl --hard" will show an empty node(no cpu and no memory,
> >>> but pgdat is created), right?
> >> On my box, if I run lscpu, the output looks like this:
> >>
> >> NUMA node0 CPU(s): 0-14,128-142
> >> NUMA node1 CPU(s): 15-29,143-157
> >> NUMA node2 CPU(s):
> >> NUMA node3 CPU(s):
> >> NUMA node4 CPU(s): 62-76,190-204
> >> NUMA node5 CPU(s): 78-92,206-220
> >>
> >> Node 2 and 3 are not exist, but they are online.
> >>
> >> Thanks.
> >>
> >>> Thanks,
> >>> Xishi Qiu
> >>>
>  Signed-off-by: Tang Chen 
>  ---
> arch/x86/mm/numa.c   | 6 --
> include/linux/memblock.h | 2 ++
> mm/memblock.c| 2 +-
> 3 files changed, 7 insertions(+), 3 deletions(-)
> 
>  diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>  index 4053bb5..0c55cc5 100644
>  --- a/arch/x86/mm/numa.c
> 

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-06 Thread Yasuaki Ishimatsu

On Fri, 3 Jul 2015 09:26:05 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 
 On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:
  Hi Tang,
 
  On my box, if I run lscpu, the output looks like this:
 
  NUMA node0 CPU(s): 0-14,128-142
  NUMA node1 CPU(s): 15-29,143-157
  NUMA node2 CPU(s):
  NUMA node3 CPU(s):
  NUMA node4 CPU(s): 62-76,190-204
  NUMA node5 CPU(s): 78-92,206-220
 
  Node 2 and 3 are not exist, but they are online.
  According your description of patch, node 4 and 5 are mistakenly
 
 Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.

Please add the results of lscpu before/after applyinig the patch into
description of your patch.

Feel free to add my 
Reviewed-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com

Thanks,
Yasuaki Ishimatsu

  set to online. Why does lscpu show the above result?
 
 Well, actually not only lscpu gives the strange result, under 
 /sys/device/system/node,
 interfaces for node 2 and 3 are also created.
 
 I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But 
 obviously,
 node 2 and 3 are set online, which is incorrect.
 
 For now, I only found that in numa_cleanup_meminfo(), memory above 
 max_pfn is removed,
 but holes between nodes are not removed.
 
 I think libraries are not able to handle this problem since nodes are 
 set online in kernel.
 Seeing from user space, there is no hole.
 
 Thanks.
 
 
  Thanks,
  Yasuaki Ishimatsu
 
  On Wed, 1 Jul 2015 15:55:30 +0800
  Tang Chen tangc...@cn.fujitsu.com wrote:
 
  On 07/01/2015 02:25 PM, Xishi Qiu wrote:
  On 2015/7/1 11:16, Tang Chen wrote:
 
  When parsing SRAT, all memory ranges are added into numa_meminfo.
  In numa_init(), before entering numa_cleanup_meminfo(), all possible
  memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
  all ranges over max_pfn or empty.
 
  But, this only works if the nodes are continuous. Let's have a look
  at the following example:
 
  We have an SRAT like this:
  SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
  SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
  SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
  SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
  SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
  SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
  SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
  SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
  SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug
 
  On boot, only node 0,1,2,3 exist.
 
  And the numa_meminfo will look like this:
  numa_meminfo.nr_blks = 9
  1. on node 0: [0, 6000]
  2. on node 0: [1, 200]
  3. on node 1: [200, 400]
  4. on node 4: [400, 600]
  5. on node 5: [600, 800]
  6. on node 2: [800, a00]
  7. on node 3: [a00, a08]
  8. on node 6: [c00, a08]
  9. on node 7: [e00, a08]
 
  And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
  the end address is over max_pfn, which is a08. But 4 and 5
  are not removed because their end addresses are less then max_pfn.
  But in fact, node 4 and 5 don't exist.
 
  In a word, numa_cleanup_meminfo() is not able to handle holes between 
  nodes.
 
  Since memory ranges in node 4 and 5 are in numa_meminfo, in 
  numa_register_memblks(),
  node 4 and 5 will be mistakenly set to online.
 
  In this patch, we use memblock_overlaps_region() to check if ranges in
  numa_meminfo overlap with ranges in memory_block. Since memory_block 
  contains
  all available memory at boot time, if they overlap, it means the ranges
  exist. If not, then remove them from numa_meminfo.
 
  Hi Tang Chen,
 
  What's the impact of this problem?
 
  Command numactl --hard will show an empty node(no cpu and no memory,
  but pgdat is created), right?
  On my box, if I run lscpu, the output looks like this:
 
  NUMA node0 CPU(s): 0-14,128-142
  NUMA node1 CPU(s): 15-29,143-157
  NUMA node2 CPU(s):
  NUMA node3 CPU(s):
  NUMA node4 CPU(s): 62-76,190-204
  NUMA node5 CPU(s): 78-92,206-220
 
  Node 2 and 3 are not exist, but they are online.
 
  Thanks.
 
  Thanks,
  Xishi Qiu
 
  Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
  ---
 arch/x86/mm/numa.c   | 6 --
 include/linux/memblock.h | 2 ++
 mm/memblock.c| 2 +-
 3 files changed, 7 insertions(+), 3 deletions(-)
 
  diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
  index 4053bb5..0c55cc5 100644
  --- a/arch/x86/mm/numa.c
  +++ b/arch/x86/mm/numa.c
  @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo 
  *mi)
   bi-start = max(bi-start, low);
   bi-end = min(bi-end, high);
 
  -/* and there's no empty block */
  -if (bi-start = bi-end)
  +/* and there's no empty or non-exist block 

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-02 Thread Tang Chen


On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:

Hi Tang,


On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly


Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.

set to online. Why does lscpu show the above result?


Well, actually not only lscpu gives the strange result, under 
/sys/device/system/node,

interfaces for node 2 and 3 are also created.

I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But 
obviously,

node 2 and 3 are set online, which is incorrect.

For now, I only found that in numa_cleanup_meminfo(), memory above 
max_pfn is removed,

but holes between nodes are not removed.

I think libraries are not able to handle this problem since nodes are 
set online in kernel.

Seeing from user space, there is no hole.

Thanks.



Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen  wrote:


On 07/01/2015 02:25 PM, Xishi Qiu wrote:

On 2015/7/1 11:16, Tang Chen wrote:


When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 6000]
2. on node 0: [1, 200]
3. on node 1: [200, 400]
4. on node 4: [400, 600]
5. on node 5: [600, 800]
6. on node 2: [800, a00]
7. on node 3: [a00, a08]
8. on node 6: [c00, a08]
9. on node 7: [e00, a08]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a08. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in 
numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.


Hi Tang Chen,

What's the impact of this problem?

Command "numactl --hard" will show an empty node(no cpu and no memory,
but pgdat is created), right?

On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

Thanks.


Thanks,
Xishi Qiu


Signed-off-by: Tang Chen 
---
   arch/x86/mm/numa.c   | 6 --
   include/linux/memblock.h | 2 ++
   mm/memblock.c| 2 +-
   3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi->start = max(bi->start, low);
bi->end = min(bi->end, high);
   
-		/* and there's no empty block */

-   if (bi->start >= bi->end)
+   /* and there's no empty or non-exist block */
+   if (bi->start >= bi->end ||
+   memblock_overlaps_region(,
+   bi->start, bi->end - bi->start) == -1)
numa_remove_memblk_from(i--, mi);
}
   
diff --git a/include/linux/memblock.h b/include/linux/memblock.h

index 0215ffd..3bf6cc1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
   int memblock_free(phys_addr_t base, phys_addr_t size);
   int memblock_reserve(phys_addr_t 

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-02 Thread Yasuaki Ishimatsu
Hi Tang,

> On my box, if I run lscpu, the output looks like this:
> 
> NUMA node0 CPU(s): 0-14,128-142
> NUMA node1 CPU(s): 15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s): 62-76,190-204
> NUMA node5 CPU(s): 78-92,206-220
> 
> Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly
set to online. Why does lscpu show the above result?

Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen  wrote:

> 
> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
> > On 2015/7/1 11:16, Tang Chen wrote:
> >
> >> When parsing SRAT, all memory ranges are added into numa_meminfo.
> >> In numa_init(), before entering numa_cleanup_meminfo(), all possible
> >> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
> >> all ranges over max_pfn or empty.
> >>
> >> But, this only works if the nodes are continuous. Let's have a look
> >> at the following example:
> >>
> >> We have an SRAT like this:
> >> SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
> >> SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
> >> SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
> >> SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
> >> SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
> >> SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
> >> SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
> >> SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
> >> SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug
> >>
> >> On boot, only node 0,1,2,3 exist.
> >>
> >> And the numa_meminfo will look like this:
> >> numa_meminfo.nr_blks = 9
> >> 1. on node 0: [0, 6000]
> >> 2. on node 0: [1, 200]
> >> 3. on node 1: [200, 400]
> >> 4. on node 4: [400, 600]
> >> 5. on node 5: [600, 800]
> >> 6. on node 2: [800, a00]
> >> 7. on node 3: [a00, a08]
> >> 8. on node 6: [c00, a08]
> >> 9. on node 7: [e00, a08]
> >>
> >> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
> >> the end address is over max_pfn, which is a08. But 4 and 5
> >> are not removed because their end addresses are less then max_pfn.
> >> But in fact, node 4 and 5 don't exist.
> >>
> >> In a word, numa_cleanup_meminfo() is not able to handle holes between 
> >> nodes.
> >>
> >> Since memory ranges in node 4 and 5 are in numa_meminfo, in 
> >> numa_register_memblks(),
> >> node 4 and 5 will be mistakenly set to online.
> >>
> >> In this patch, we use memblock_overlaps_region() to check if ranges in
> >> numa_meminfo overlap with ranges in memory_block. Since memory_block 
> >> contains
> >> all available memory at boot time, if they overlap, it means the ranges
> >> exist. If not, then remove them from numa_meminfo.
> >>
> > Hi Tang Chen,
> >
> > What's the impact of this problem?
> >
> > Command "numactl --hard" will show an empty node(no cpu and no memory,
> > but pgdat is created), right?
> 
> On my box, if I run lscpu, the output looks like this:
> 
> NUMA node0 CPU(s): 0-14,128-142
> NUMA node1 CPU(s): 15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s): 62-76,190-204
> NUMA node5 CPU(s): 78-92,206-220
> 
> Node 2 and 3 are not exist, but they are online.
> 
> Thanks.
> 
> >
> > Thanks,
> > Xishi Qiu
> >
> >> Signed-off-by: Tang Chen 
> >> ---
> >>   arch/x86/mm/numa.c   | 6 --
> >>   include/linux/memblock.h | 2 ++
> >>   mm/memblock.c| 2 +-
> >>   3 files changed, 7 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >> index 4053bb5..0c55cc5 100644
> >> --- a/arch/x86/mm/numa.c
> >> +++ b/arch/x86/mm/numa.c
> >> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo 
> >> *mi)
> >>bi->start = max(bi->start, low);
> >>bi->end = min(bi->end, high);
> >>   
> >> -  /* and there's no empty block */
> >> -  if (bi->start >= bi->end)
> >> +  /* and there's no empty or non-exist block */
> >> +  if (bi->start >= bi->end ||
> >> +  memblock_overlaps_region(,
> >> +  bi->start, bi->end - bi->start) == -1)
> >>numa_remove_memblk_from(i--, mi);
> >>}
> >>   
> >> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> >> index 0215ffd..3bf6cc1 100644
> >> --- a/include/linux/memblock.h
> >> +++ b/include/linux/memblock.h
> >> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
> >>   int memblock_free(phys_addr_t base, phys_addr_t size);
> >>   int memblock_reserve(phys_addr_t base, phys_addr_t size);
> >>   void memblock_trim_memory(phys_addr_t align);
> >> +long memblock_overlaps_region(struct memblock_type *type,
> >> +phys_addr_t base, phys_addr_t size);
> >>   int 

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-02 Thread Yasuaki Ishimatsu
Hi Tang,

 On my box, if I run lscpu, the output looks like this:
 
 NUMA node0 CPU(s): 0-14,128-142
 NUMA node1 CPU(s): 15-29,143-157
 NUMA node2 CPU(s):
 NUMA node3 CPU(s):
 NUMA node4 CPU(s): 62-76,190-204
 NUMA node5 CPU(s): 78-92,206-220
 
 Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly
set to online. Why does lscpu show the above result?

Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 
 On 07/01/2015 02:25 PM, Xishi Qiu wrote:
  On 2015/7/1 11:16, Tang Chen wrote:
 
  When parsing SRAT, all memory ranges are added into numa_meminfo.
  In numa_init(), before entering numa_cleanup_meminfo(), all possible
  memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
  all ranges over max_pfn or empty.
 
  But, this only works if the nodes are continuous. Let's have a look
  at the following example:
 
  We have an SRAT like this:
  SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
  SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
  SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
  SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
  SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
  SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
  SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
  SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
  SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug
 
  On boot, only node 0,1,2,3 exist.
 
  And the numa_meminfo will look like this:
  numa_meminfo.nr_blks = 9
  1. on node 0: [0, 6000]
  2. on node 0: [1, 200]
  3. on node 1: [200, 400]
  4. on node 4: [400, 600]
  5. on node 5: [600, 800]
  6. on node 2: [800, a00]
  7. on node 3: [a00, a08]
  8. on node 6: [c00, a08]
  9. on node 7: [e00, a08]
 
  And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
  the end address is over max_pfn, which is a08. But 4 and 5
  are not removed because their end addresses are less then max_pfn.
  But in fact, node 4 and 5 don't exist.
 
  In a word, numa_cleanup_meminfo() is not able to handle holes between 
  nodes.
 
  Since memory ranges in node 4 and 5 are in numa_meminfo, in 
  numa_register_memblks(),
  node 4 and 5 will be mistakenly set to online.
 
  In this patch, we use memblock_overlaps_region() to check if ranges in
  numa_meminfo overlap with ranges in memory_block. Since memory_block 
  contains
  all available memory at boot time, if they overlap, it means the ranges
  exist. If not, then remove them from numa_meminfo.
 
  Hi Tang Chen,
 
  What's the impact of this problem?
 
  Command numactl --hard will show an empty node(no cpu and no memory,
  but pgdat is created), right?
 
 On my box, if I run lscpu, the output looks like this:
 
 NUMA node0 CPU(s): 0-14,128-142
 NUMA node1 CPU(s): 15-29,143-157
 NUMA node2 CPU(s):
 NUMA node3 CPU(s):
 NUMA node4 CPU(s): 62-76,190-204
 NUMA node5 CPU(s): 78-92,206-220
 
 Node 2 and 3 are not exist, but they are online.
 
 Thanks.
 
 
  Thanks,
  Xishi Qiu
 
  Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
  ---
arch/x86/mm/numa.c   | 6 --
include/linux/memblock.h | 2 ++
mm/memblock.c| 2 +-
3 files changed, 7 insertions(+), 3 deletions(-)
 
  diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
  index 4053bb5..0c55cc5 100644
  --- a/arch/x86/mm/numa.c
  +++ b/arch/x86/mm/numa.c
  @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo 
  *mi)
 bi-start = max(bi-start, low);
 bi-end = min(bi-end, high);

  -  /* and there's no empty block */
  -  if (bi-start = bi-end)
  +  /* and there's no empty or non-exist block */
  +  if (bi-start = bi-end ||
  +  memblock_overlaps_region(memblock.memory,
  +  bi-start, bi-end - bi-start) == -1)
 numa_remove_memblk_from(i--, mi);
 }

  diff --git a/include/linux/memblock.h b/include/linux/memblock.h
  index 0215ffd..3bf6cc1 100644
  --- a/include/linux/memblock.h
  +++ b/include/linux/memblock.h
  @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
  +long memblock_overlaps_region(struct memblock_type *type,
  +phys_addr_t base, phys_addr_t size);
int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
  diff --git a/mm/memblock.c b/mm/memblock.c
  index 1b444c7..55b5f9f 100644
  

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-02 Thread Tang Chen


On 07/02/2015 11:02 PM, Yasuaki Ishimatsu wrote:

Hi Tang,


On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

According your description of patch, node 4 and 5 are mistakenly


Not node 4 and 5, it is node 2 and 3 which are mistakenly set online.

set to online. Why does lscpu show the above result?


Well, actually not only lscpu gives the strange result, under 
/sys/device/system/node,

interfaces for node 2 and 3 are also created.

I haven't read lscpu code, so I'm not sure how lscpu handles nodes. But 
obviously,

node 2 and 3 are set online, which is incorrect.

For now, I only found that in numa_cleanup_meminfo(), memory above 
max_pfn is removed,

but holes between nodes are not removed.

I think libraries are not able to handle this problem since nodes are 
set online in kernel.

Seeing from user space, there is no hole.

Thanks.



Thanks,
Yasuaki Ishimatsu

On Wed, 1 Jul 2015 15:55:30 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:


On 07/01/2015 02:25 PM, Xishi Qiu wrote:

On 2015/7/1 11:16, Tang Chen wrote:


When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 6000]
2. on node 0: [1, 200]
3. on node 1: [200, 400]
4. on node 4: [400, 600]
5. on node 5: [600, 800]
6. on node 2: [800, a00]
7. on node 3: [a00, a08]
8. on node 6: [c00, a08]
9. on node 7: [e00, a08]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a08. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in 
numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.


Hi Tang Chen,

What's the impact of this problem?

Command numactl --hard will show an empty node(no cpu and no memory,
but pgdat is created), right?

On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

Thanks.


Thanks,
Xishi Qiu


Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
   arch/x86/mm/numa.c   | 6 --
   include/linux/memblock.h | 2 ++
   mm/memblock.c| 2 +-
   3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi-start = max(bi-start, low);
bi-end = min(bi-end, high);
   
-		/* and there's no empty block */

-   if (bi-start = bi-end)
+   /* and there's no empty or non-exist block */
+   if (bi-start = bi-end ||
+   memblock_overlaps_region(memblock.memory,
+   bi-start, bi-end - bi-start) == -1)
numa_remove_memblk_from(i--, mi);
}
   
diff --git a/include/linux/memblock.h b/include/linux/memblock.h

index 0215ffd..3bf6cc1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
   int memblock_free(phys_addr_t base, 

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-01 Thread Xishi Qiu
On 2015/7/1 15:55, Tang Chen wrote:

> 
> On 07/01/2015 02:25 PM, Xishi Qiu wrote:
>> On 2015/7/1 11:16, Tang Chen wrote:
>>
>>> When parsing SRAT, all memory ranges are added into numa_meminfo.
>>> In numa_init(), before entering numa_cleanup_meminfo(), all possible
>>> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
>>> all ranges over max_pfn or empty.
>>>
>>> But, this only works if the nodes are continuous. Let's have a look
>>> at the following example:
>>>
>>> We have an SRAT like this:
>>> SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
>>> SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
>>> SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
>>> SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
>>> SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
>>> SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
>>> SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
>>> SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
>>> SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug
>>>
>>> On boot, only node 0,1,2,3 exist.
>>>
>>> And the numa_meminfo will look like this:
>>> numa_meminfo.nr_blks = 9
>>> 1. on node 0: [0, 6000]
>>> 2. on node 0: [1, 200]
>>> 3. on node 1: [200, 400]
>>> 4. on node 4: [400, 600]
>>> 5. on node 5: [600, 800]
>>> 6. on node 2: [800, a00]
>>> 7. on node 3: [a00, a08]
>>> 8. on node 6: [c00, a08]
>>> 9. on node 7: [e00, a08]
>>>
>>> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
>>> the end address is over max_pfn, which is a08. But 4 and 5
>>> are not removed because their end addresses are less then max_pfn.
>>> But in fact, node 4 and 5 don't exist.
>>>
>>> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
>>>
>>> Since memory ranges in node 4 and 5 are in numa_meminfo, in 
>>> numa_register_memblks(),
>>> node 4 and 5 will be mistakenly set to online.
>>>
>>> In this patch, we use memblock_overlaps_region() to check if ranges in
>>> numa_meminfo overlap with ranges in memory_block. Since memory_block 
>>> contains
>>> all available memory at boot time, if they overlap, it means the ranges
>>> exist. If not, then remove them from numa_meminfo.
>>>
>> Hi Tang Chen,
>>
>> What's the impact of this problem?
>>
>> Command "numactl --hard" will show an empty node(no cpu and no memory,
>> but pgdat is created), right?
> 
> On my box, if I run lscpu, the output looks like this:
> 
> NUMA node0 CPU(s): 0-14,128-142
> NUMA node1 CPU(s): 15-29,143-157
> NUMA node2 CPU(s):
> NUMA node3 CPU(s):
> NUMA node4 CPU(s): 62-76,190-204
> NUMA node5 CPU(s): 78-92,206-220
> 
> Node 2 and 3 are not exist, but they are online.
> 

Yes, because srat->numa_meminfo->alloc pgdat.


Thanks,
Xishi Qiu


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-01 Thread Tang Chen


On 07/01/2015 02:25 PM, Xishi Qiu wrote:

On 2015/7/1 11:16, Tang Chen wrote:


When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 6000]
2. on node 0: [1, 200]
3. on node 1: [200, 400]
4. on node 4: [400, 600]
5. on node 5: [600, 800]
6. on node 2: [800, a00]
7. on node 3: [a00, a08]
8. on node 6: [c00, a08]
9. on node 7: [e00, a08]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a08. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in 
numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.


Hi Tang Chen,

What's the impact of this problem?

Command "numactl --hard" will show an empty node(no cpu and no memory,
but pgdat is created), right?


On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

Thanks.



Thanks,
Xishi Qiu


Signed-off-by: Tang Chen 
---
  arch/x86/mm/numa.c   | 6 --
  include/linux/memblock.h | 2 ++
  mm/memblock.c| 2 +-
  3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi->start = max(bi->start, low);
bi->end = min(bi->end, high);
  
-		/* and there's no empty block */

-   if (bi->start >= bi->end)
+   /* and there's no empty or non-exist block */
+   if (bi->start >= bi->end ||
+   memblock_overlaps_region(,
+   bi->start, bi->end - bi->start) == -1)
numa_remove_memblk_from(i--, mi);
}
  
diff --git a/include/linux/memblock.h b/include/linux/memblock.h

index 0215ffd..3bf6cc1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
  int memblock_free(phys_addr_t base, phys_addr_t size);
  int memblock_reserve(phys_addr_t base, phys_addr_t size);
  void memblock_trim_memory(phys_addr_t align);
+long memblock_overlaps_region(struct memblock_type *type,
+ phys_addr_t base, phys_addr_t size);
  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 1b444c7..55b5f9f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -91,7 +91,7 @@ static unsigned long __init_memblock 
memblock_addrs_overlap(phys_addr_t base1, p
return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
  }
  
-static long __init_memblock memblock_overlaps_region(struct memblock_type *type,

+long __init_memblock memblock_overlaps_region(struct memblock_type *type,
phys_addr_t base, phys_addr_t size)
  {
unsigned long i;



.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-01 Thread Xishi Qiu
On 2015/7/1 11:16, Tang Chen wrote:

> When parsing SRAT, all memory ranges are added into numa_meminfo.
> In numa_init(), before entering numa_cleanup_meminfo(), all possible
> memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
> all ranges over max_pfn or empty.
> 
> But, this only works if the nodes are continuous. Let's have a look
> at the following example:
> 
> We have an SRAT like this:
> SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
> SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
> SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
> SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
> SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
> SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
> SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
> SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
> SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug
> 
> On boot, only node 0,1,2,3 exist.
> 
> And the numa_meminfo will look like this:
> numa_meminfo.nr_blks = 9
> 1. on node 0: [0, 6000]
> 2. on node 0: [1, 200]
> 3. on node 1: [200, 400]
> 4. on node 4: [400, 600]
> 5. on node 5: [600, 800]
> 6. on node 2: [800, a00]
> 7. on node 3: [a00, a08]
> 8. on node 6: [c00, a08]
> 9. on node 7: [e00, a08]
> 
> And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
> the end address is over max_pfn, which is a08. But 4 and 5
> are not removed because their end addresses are less then max_pfn.
> But in fact, node 4 and 5 don't exist.
> 
> In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
> 
> Since memory ranges in node 4 and 5 are in numa_meminfo, in 
> numa_register_memblks(),
> node 4 and 5 will be mistakenly set to online.
> 
> In this patch, we use memblock_overlaps_region() to check if ranges in
> numa_meminfo overlap with ranges in memory_block. Since memory_block contains
> all available memory at boot time, if they overlap, it means the ranges
> exist. If not, then remove them from numa_meminfo.
> 

Hi Tang Chen,

What's the impact of this problem?

Command "numactl --hard" will show an empty node(no cpu and no memory,
but pgdat is created), right?

Thanks,
Xishi Qiu

> Signed-off-by: Tang Chen 
> ---
>  arch/x86/mm/numa.c   | 6 --
>  include/linux/memblock.h | 2 ++
>  mm/memblock.c| 2 +-
>  3 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 4053bb5..0c55cc5 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
>   bi->start = max(bi->start, low);
>   bi->end = min(bi->end, high);
>  
> - /* and there's no empty block */
> - if (bi->start >= bi->end)
> + /* and there's no empty or non-exist block */
> + if (bi->start >= bi->end ||
> + memblock_overlaps_region(,
> + bi->start, bi->end - bi->start) == -1)
>   numa_remove_memblk_from(i--, mi);
>   }
>  
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 0215ffd..3bf6cc1 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
>  int memblock_free(phys_addr_t base, phys_addr_t size);
>  int memblock_reserve(phys_addr_t base, phys_addr_t size);
>  void memblock_trim_memory(phys_addr_t align);
> +long memblock_overlaps_region(struct memblock_type *type,
> +   phys_addr_t base, phys_addr_t size);
>  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
>  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 1b444c7..55b5f9f 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -91,7 +91,7 @@ static unsigned long __init_memblock 
> memblock_addrs_overlap(phys_addr_t base1, p
>   return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
>  }
>  
> -static long __init_memblock memblock_overlaps_region(struct memblock_type 
> *type,
> +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
>   phys_addr_t base, phys_addr_t size)
>  {
>   unsigned long i;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-01 Thread Xishi Qiu
On 2015/7/1 15:55, Tang Chen wrote:

 
 On 07/01/2015 02:25 PM, Xishi Qiu wrote:
 On 2015/7/1 11:16, Tang Chen wrote:

 When parsing SRAT, all memory ranges are added into numa_meminfo.
 In numa_init(), before entering numa_cleanup_meminfo(), all possible
 memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
 all ranges over max_pfn or empty.

 But, this only works if the nodes are continuous. Let's have a look
 at the following example:

 We have an SRAT like this:
 SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
 SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
 SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
 SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
 SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
 SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
 SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
 SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
 SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

 On boot, only node 0,1,2,3 exist.

 And the numa_meminfo will look like this:
 numa_meminfo.nr_blks = 9
 1. on node 0: [0, 6000]
 2. on node 0: [1, 200]
 3. on node 1: [200, 400]
 4. on node 4: [400, 600]
 5. on node 5: [600, 800]
 6. on node 2: [800, a00]
 7. on node 3: [a00, a08]
 8. on node 6: [c00, a08]
 9. on node 7: [e00, a08]

 And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
 the end address is over max_pfn, which is a08. But 4 and 5
 are not removed because their end addresses are less then max_pfn.
 But in fact, node 4 and 5 don't exist.

 In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

 Since memory ranges in node 4 and 5 are in numa_meminfo, in 
 numa_register_memblks(),
 node 4 and 5 will be mistakenly set to online.

 In this patch, we use memblock_overlaps_region() to check if ranges in
 numa_meminfo overlap with ranges in memory_block. Since memory_block 
 contains
 all available memory at boot time, if they overlap, it means the ranges
 exist. If not, then remove them from numa_meminfo.

 Hi Tang Chen,

 What's the impact of this problem?

 Command numactl --hard will show an empty node(no cpu and no memory,
 but pgdat is created), right?
 
 On my box, if I run lscpu, the output looks like this:
 
 NUMA node0 CPU(s): 0-14,128-142
 NUMA node1 CPU(s): 15-29,143-157
 NUMA node2 CPU(s):
 NUMA node3 CPU(s):
 NUMA node4 CPU(s): 62-76,190-204
 NUMA node5 CPU(s): 78-92,206-220
 
 Node 2 and 3 are not exist, but they are online.
 

Yes, because srat-numa_meminfo-alloc pgdat.


Thanks,
Xishi Qiu


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-01 Thread Tang Chen


On 07/01/2015 02:25 PM, Xishi Qiu wrote:

On 2015/7/1 11:16, Tang Chen wrote:


When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 6000]
2. on node 0: [1, 200]
3. on node 1: [200, 400]
4. on node 4: [400, 600]
5. on node 5: [600, 800]
6. on node 2: [800, a00]
7. on node 3: [a00, a08]
8. on node 6: [c00, a08]
9. on node 7: [e00, a08]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a08. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in 
numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.


Hi Tang Chen,

What's the impact of this problem?

Command numactl --hard will show an empty node(no cpu and no memory,
but pgdat is created), right?


On my box, if I run lscpu, the output looks like this:

NUMA node0 CPU(s): 0-14,128-142
NUMA node1 CPU(s): 15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s): 62-76,190-204
NUMA node5 CPU(s): 78-92,206-220

Node 2 and 3 are not exist, but they are online.

Thanks.



Thanks,
Xishi Qiu


Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
  arch/x86/mm/numa.c   | 6 --
  include/linux/memblock.h | 2 ++
  mm/memblock.c| 2 +-
  3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi-start = max(bi-start, low);
bi-end = min(bi-end, high);
  
-		/* and there's no empty block */

-   if (bi-start = bi-end)
+   /* and there's no empty or non-exist block */
+   if (bi-start = bi-end ||
+   memblock_overlaps_region(memblock.memory,
+   bi-start, bi-end - bi-start) == -1)
numa_remove_memblk_from(i--, mi);
}
  
diff --git a/include/linux/memblock.h b/include/linux/memblock.h

index 0215ffd..3bf6cc1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
  int memblock_free(phys_addr_t base, phys_addr_t size);
  int memblock_reserve(phys_addr_t base, phys_addr_t size);
  void memblock_trim_memory(phys_addr_t align);
+long memblock_overlaps_region(struct memblock_type *type,
+ phys_addr_t base, phys_addr_t size);
  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 1b444c7..55b5f9f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -91,7 +91,7 @@ static unsigned long __init_memblock 
memblock_addrs_overlap(phys_addr_t base1, p
return ((base1  (base2 + size2))  (base2  (base1 + size1)));
  }
  
-static long __init_memblock memblock_overlaps_region(struct memblock_type *type,

+long __init_memblock memblock_overlaps_region(struct memblock_type *type,
phys_addr_t base, phys_addr_t size)
  {
unsigned long i;



.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read 

Re: [PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-07-01 Thread Xishi Qiu
On 2015/7/1 11:16, Tang Chen wrote:

 When parsing SRAT, all memory ranges are added into numa_meminfo.
 In numa_init(), before entering numa_cleanup_meminfo(), all possible
 memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
 all ranges over max_pfn or empty.
 
 But, this only works if the nodes are continuous. Let's have a look
 at the following example:
 
 We have an SRAT like this:
 SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
 SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
 SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
 SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
 SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
 SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
 SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
 SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
 SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug
 
 On boot, only node 0,1,2,3 exist.
 
 And the numa_meminfo will look like this:
 numa_meminfo.nr_blks = 9
 1. on node 0: [0, 6000]
 2. on node 0: [1, 200]
 3. on node 1: [200, 400]
 4. on node 4: [400, 600]
 5. on node 5: [600, 800]
 6. on node 2: [800, a00]
 7. on node 3: [a00, a08]
 8. on node 6: [c00, a08]
 9. on node 7: [e00, a08]
 
 And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
 the end address is over max_pfn, which is a08. But 4 and 5
 are not removed because their end addresses are less then max_pfn.
 But in fact, node 4 and 5 don't exist.
 
 In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
 
 Since memory ranges in node 4 and 5 are in numa_meminfo, in 
 numa_register_memblks(),
 node 4 and 5 will be mistakenly set to online.
 
 In this patch, we use memblock_overlaps_region() to check if ranges in
 numa_meminfo overlap with ranges in memory_block. Since memory_block contains
 all available memory at boot time, if they overlap, it means the ranges
 exist. If not, then remove them from numa_meminfo.
 

Hi Tang Chen,

What's the impact of this problem?

Command numactl --hard will show an empty node(no cpu and no memory,
but pgdat is created), right?

Thanks,
Xishi Qiu

 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 ---
  arch/x86/mm/numa.c   | 6 --
  include/linux/memblock.h | 2 ++
  mm/memblock.c| 2 +-
  3 files changed, 7 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
 index 4053bb5..0c55cc5 100644
 --- a/arch/x86/mm/numa.c
 +++ b/arch/x86/mm/numa.c
 @@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
   bi-start = max(bi-start, low);
   bi-end = min(bi-end, high);
  
 - /* and there's no empty block */
 - if (bi-start = bi-end)
 + /* and there's no empty or non-exist block */
 + if (bi-start = bi-end ||
 + memblock_overlaps_region(memblock.memory,
 + bi-start, bi-end - bi-start) == -1)
   numa_remove_memblk_from(i--, mi);
   }
  
 diff --git a/include/linux/memblock.h b/include/linux/memblock.h
 index 0215ffd..3bf6cc1 100644
 --- a/include/linux/memblock.h
 +++ b/include/linux/memblock.h
 @@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
  int memblock_free(phys_addr_t base, phys_addr_t size);
  int memblock_reserve(phys_addr_t base, phys_addr_t size);
  void memblock_trim_memory(phys_addr_t align);
 +long memblock_overlaps_region(struct memblock_type *type,
 +   phys_addr_t base, phys_addr_t size);
  int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
  int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 diff --git a/mm/memblock.c b/mm/memblock.c
 index 1b444c7..55b5f9f 100644
 --- a/mm/memblock.c
 +++ b/mm/memblock.c
 @@ -91,7 +91,7 @@ static unsigned long __init_memblock 
 memblock_addrs_overlap(phys_addr_t base1, p
   return ((base1  (base2 + size2))  (base2  (base1 + size1)));
  }
  
 -static long __init_memblock memblock_overlaps_region(struct memblock_type 
 *type,
 +long __init_memblock memblock_overlaps_region(struct memblock_type *type,
   phys_addr_t base, phys_addr_t size)
  {
   unsigned long i;



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-06-30 Thread Tang Chen
When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 6000]
2. on node 0: [1, 200]
3. on node 1: [200, 400]
4. on node 4: [400, 600]
5. on node 5: [600, 800]
6. on node 2: [800, a00]
7. on node 3: [a00, a08]
8. on node 6: [c00, a08]
9. on node 7: [e00, a08]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a08. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in 
numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.

Signed-off-by: Tang Chen 
---
 arch/x86/mm/numa.c   | 6 --
 include/linux/memblock.h | 2 ++
 mm/memblock.c| 2 +-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi->start = max(bi->start, low);
bi->end = min(bi->end, high);
 
-   /* and there's no empty block */
-   if (bi->start >= bi->end)
+   /* and there's no empty or non-exist block */
+   if (bi->start >= bi->end ||
+   memblock_overlaps_region(,
+   bi->start, bi->end - bi->start) == -1)
numa_remove_memblk_from(i--, mi);
}
 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 0215ffd..3bf6cc1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
+long memblock_overlaps_region(struct memblock_type *type,
+ phys_addr_t base, phys_addr_t size);
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 1b444c7..55b5f9f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -91,7 +91,7 @@ static unsigned long __init_memblock 
memblock_addrs_overlap(phys_addr_t base1, p
return ((base1 < (base2 + size2)) && (base2 < (base1 + size1)));
 }
 
-static long __init_memblock memblock_overlaps_region(struct memblock_type 
*type,
+long __init_memblock memblock_overlaps_region(struct memblock_type *type,
phys_addr_t base, phys_addr_t size)
 {
unsigned long i;
-- 
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] mem-hotplug: Handle node hole when initializing numa_meminfo.

2015-06-30 Thread Tang Chen
When parsing SRAT, all memory ranges are added into numa_meminfo.
In numa_init(), before entering numa_cleanup_meminfo(), all possible
memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes
all ranges over max_pfn or empty.

But, this only works if the nodes are continuous. Let's have a look
at the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x-0x5fff]
SRAT: Node 0 PXM 0 [mem 0x1-0x1ff]
SRAT: Node 1 PXM 1 [mem 0x200-0x3ff]
SRAT: Node 4 PXM 2 [mem 0x400-0x5ff] hotplug
SRAT: Node 5 PXM 3 [mem 0x600-0x7ff] hotplug
SRAT: Node 2 PXM 4 [mem 0x800-0x9ff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa00-0xbff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc00-0xdff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe00-0xfff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 6000]
2. on node 0: [1, 200]
3. on node 1: [200, 400]
4. on node 4: [400, 600]
5. on node 5: [600, 800]
6. on node 2: [800, a00]
7. on node 3: [a00, a08]
8. on node 6: [c00, a08]
9. on node 7: [e00, a08]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because
the end address is over max_pfn, which is a08. But 4 and 5
are not removed because their end addresses are less then max_pfn.
But in fact, node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in 
numa_register_memblks(),
node 4 and 5 will be mistakenly set to online.

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block. Since memory_block contains
all available memory at boot time, if they overlap, it means the ranges
exist. If not, then remove them from numa_meminfo.

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 arch/x86/mm/numa.c   | 6 --
 include/linux/memblock.h | 2 ++
 mm/memblock.c| 2 +-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4053bb5..0c55cc5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -246,8 +246,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi)
bi-start = max(bi-start, low);
bi-end = min(bi-end, high);
 
-   /* and there's no empty block */
-   if (bi-start = bi-end)
+   /* and there's no empty or non-exist block */
+   if (bi-start = bi-end ||
+   memblock_overlaps_region(memblock.memory,
+   bi-start, bi-end - bi-start) == -1)
numa_remove_memblk_from(i--, mi);
}
 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 0215ffd..3bf6cc1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -77,6 +77,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 void memblock_trim_memory(phys_addr_t align);
+long memblock_overlaps_region(struct memblock_type *type,
+ phys_addr_t base, phys_addr_t size);
 int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
diff --git a/mm/memblock.c b/mm/memblock.c
index 1b444c7..55b5f9f 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -91,7 +91,7 @@ static unsigned long __init_memblock 
memblock_addrs_overlap(phys_addr_t base1, p
return ((base1  (base2 + size2))  (base2  (base1 + size1)));
 }
 
-static long __init_memblock memblock_overlaps_region(struct memblock_type 
*type,
+long __init_memblock memblock_overlaps_region(struct memblock_type *type,
phys_addr_t base, phys_addr_t size)
 {
unsigned long i;
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/