Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-11-03 Thread Kamezawa Hiroyuki

On 2015/10/31 4:42, Luck, Tony wrote:

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?


It's useful from a memory bandwidth perspective to have allocations
spread across both memory controllers. Keeping a whole bunch of
Xeon cores fed needs all the bandwidth you can get.



Hmm. But physical address layout is not related to dual memory controller.
I think reliable range can be contiguous by firmware...

-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-11-03 Thread Kamezawa Hiroyuki

On 2015/10/31 4:42, Luck, Tony wrote:

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?


It's useful from a memory bandwidth perspective to have allocations
spread across both memory controllers. Keeping a whole bunch of
Xeon cores fed needs all the bandwidth you can get.



Hmm. But physical address layout is not related to dual memory controller.
I think reliable range can be contiguous by firmware...

-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-30 Thread Luck, Tony
> If each memory controller has the same distance/latency, you (your firmware) 
> don't need
> to allocate reliable memory per each memory controller.
> If distance is problem, another node should be allocated.
>
> ...is the behavior(splitting zone) really required ?

It's useful from a memory bandwidth perspective to have allocations
spread across both memory controllers. Keeping a whole bunch of
Xeon cores fed needs all the bandwidth you can get.

Socket0 is also a problem.  We want to mirror <4GB addresses because
there is a bunch of critical stuff there (entire kernel text+data). But we
can currently only mirror one block per memory controller, so we end up
with just 2GB mirrored (the 2GB-4GB range is MMIO).  This isn't enough
for even a small machine (I have 128GB on node0 ... but that is really the
bare minimum configuration ... 2GB is only enough to cover the "struct
page" allocations for node0).  I really have to allocate some more mirror
from the other memory controller.

-Tony

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-30 Thread Kamezawa Hiroyuki
On 2015/10/23 10:44, Luck, Tony wrote:
> First part of each memory controller. I have two memory controllers on each 
> node
> 

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-30 Thread Kamezawa Hiroyuki
On 2015/10/23 10:44, Luck, Tony wrote:
> First part of each memory controller. I have two memory controllers on each 
> node
> 

If each memory controller has the same distance/latency, you (your firmware) 
don't need
to allocate reliable memory per each memory controller.
If distance is problem, another node should be allocated.

...is the behavior(splitting zone) really required ?

Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-30 Thread Luck, Tony
> If each memory controller has the same distance/latency, you (your firmware) 
> don't need
> to allocate reliable memory per each memory controller.
> If distance is problem, another node should be allocated.
>
> ...is the behavior(splitting zone) really required ?

It's useful from a memory bandwidth perspective to have allocations
spread across both memory controllers. Keeping a whole bunch of
Xeon cores fed needs all the bandwidth you can get.

Socket0 is also a problem.  We want to mirror <4GB addresses because
there is a bunch of critical stuff there (entire kernel text+data). But we
can currently only mirror one block per memory controller, so we end up
with just 2GB mirrored (the 2GB-4GB range is MMIO).  This isn't enough
for even a small machine (I have 128GB on node0 ... but that is really the
bare minimum configuration ... 2GB is only enough to cover the "struct
page" allocations for node0).  I really have to allocate some more mirror
from the other memory controller.

-Tony

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Xishi Qiu
On 2015/10/15 21:32, Taku Izumi wrote:

> Xeon E7 v3 based systems supports Address Range Mirroring
> and UEFI BIOS complied with UEFI spec 2.5 can notify which
> ranges are reliable (mirrored) via EFI memory map.
> Now Linux kernel utilize its information and allocates
> boot time memory from reliable region.
> 
> My requirement is:
>   - allocate kernel memory from reliable region
>   - allocate user memory from non-reliable region
> 
> In order to meet my requirement, ZONE_MOVABLE is useful.
> By arranging non-reliable range into ZONE_MOVABLE,
> reliable memory is only used for kernel allocations.
> 
> This patch extends existing "kernelcore" option and
> introduces kernelcore=reliable option. By specifying
> "reliable" instead of specifying the amount of memory,
> non-reliable region will be arranged into ZONE_MOVABLE.
> 
> Earlier discussion is at:
>  https://lkml.org/lkml/2015/10/9/24
> 
> For example, suppose 2-nodes system with the following
>  memory range:
>   node 0 [mem 0x1000-0x00109fff]
>   node 1 [mem 0x0010a000-0x00209fff]
> 
> and the following ranges are marked as reliable (*):
>   [0x-0x0001]
>   [0x0001-0x00018000]
>   [0x0010a000-0x00112000]
> 
> If you specify kernelcore=reliable, Movable zones are
> arranged like the following:
>   Movable zone start for each node
> Node 0: 0x00018000
> Node 1: 0x00112000
> 
> (*) I specified the following instead of using UEFI BIOS
> complied with UEFI spec 2.5,
> efi_fake_mem=4G@0:0x1,2G@0x10a000:0x1,2G@4G:0x1
> efi_fake_mem is found at:
>  git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
>  tags/efi-next
> 
> Signed-off-by: Taku Izumi 
> ---
>  Documentation/kernel-parameters.txt |  9 -
>  mm/page_alloc.c | 26 ++
>  2 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index cd5312f..b2c8c13 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>  
>   keepinitrd  [HW,ARM]
>  
> - kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
> + kernelcore= Format: nn[KMG] | "reliable"
> + [KNL,X86,IA-64,PPC] This parameter
>   specifies the amount of memory usable by the kernel
>   for non-movable allocations.  The requested amount is
>   spread evenly throughout all nodes in the system. The
> @@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   use the HighMem zone if it exists, and the Normal
>   zone if it does not.
>  
> + Instead of specifying the amount of memory (nn[KMS]),
> + you can specify "reliable" option. In case "reliable"
> + option is specified, reliable memory is used for
> + non-movable allocations and remaining memory is used
> + for Movable pages.
> +
>   kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
>   Format: [,poll interval]
>   The controller # is the number of the ehci usb debug
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index beda417..d0b3ac9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -221,6 +221,7 @@ static unsigned long __meminitdata 
> arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>  static unsigned long __initdata required_kernelcore;
>  static unsigned long __initdata required_movablecore;
>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> +static bool reliable_kernelcore __initdata;
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -5618,6 +5619,25 @@ static void __init 
> find_zone_movable_pfns_for_nodes(void)
>   }
>  
>   /*
> +  * If kernelcore=reliable is specified, ignore movablecore option
> +  */
> + if (reliable_kernelcore) {
> + for_each_memblock(memory, r) {
> + if (memblock_is_mirror(r))
> + continue;
> +
> + nid = r->nid;
> +
> + usable_startpfn = PFN_DOWN(r->base);
> + zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> + min(usable_startpfn, zone_movable_pfn[nid]) :
> + usable_startpfn;
> + }
> +
> + goto out2;

Hi Taku,

If user set 0-1G is mirrored memory, 1-2G is normal memory, and 2-4G is hole.
Then the movable zone will start at 2G?

Thanks,
Xishi Qiu

> + }
> +
> + /*
>* If 

Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Luck, Tony
First part of each memory controller. I have two memory controllers on each node

Sent from my iPhone

> On Oct 22, 2015, at 18:01, Izumi, Taku  wrote:
> 
> Dear Tony,
> 
>> -Original Message-
>> From: Luck, Tony [mailto:tony.l...@intel.com]
>> Sent: Friday, October 23, 2015 8:27 AM
>> To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; 
>> linux...@kvack.org
>> Cc: qiuxi...@huawei.com; m...@csn.ul.ie; a...@linux-foundation.org; Hansen, 
>> Dave; m...@codeblueprint.co.uk
>> Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option
>> 
>>> I think /proc/zoneinfo can show detailed numbers per zone. Do we need some 
>>> for meminfo ?
>> 
>> I wrote a little script (attached) to summarize /proc/zoneinfo ... on my 
>> system it says
>> 
>> $ zoneinfo
>> Node  Normal Movable DMA   DMA32
>>   00.00   103020.078.94 1554.46
>>   1 9284.5489870.43
>>   2 9626.3394050.09
>>   3 9602.8293650.04
>> 
>> Not sure why I have zero Normal memory free on node0.  The sum of all those
>> free counts is 410667.72 MB ... which is close enough to the boot time 
>> message
>> showing the amount of mirror/total memory:
>> 
>> [0.00] efi: Memory: 80979/420096M mirrored memory
>> 
>> but a fair amount of the 80G of mirrored memory seems to have been miscounted
>> as Movable instead of Normal. Perhaps this is because I have two blocks of 
>> mirrored
>> memory on each node and the movable zone code doesn't expect that?
> 
> You were saying that OS view of memory of node is something like the 
> following ?
> 
>Node X:  |MM--MM|  
>   (legend) M: mirrored  -: not mirrrored
> 
> If so, is this a real Box's configuration?
> Sorry, I haven't got a real Address Range Mirror capable boxes yet ...
> I thought mirroring range is concatenated at the first part of each node.
> 
> Sincerely,
> Taku Izumi
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Izumi, Taku
 Dear Tony,

> -Original Message-
> From: Luck, Tony [mailto:tony.l...@intel.com]
> Sent: Friday, October 23, 2015 8:27 AM
> To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; 
> linux...@kvack.org
> Cc: qiuxi...@huawei.com; m...@csn.ul.ie; a...@linux-foundation.org; Hansen, 
> Dave; m...@codeblueprint.co.uk
> Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option
> 
> > I think /proc/zoneinfo can show detailed numbers per zone. Do we need some 
> > for meminfo ?
> 
> I wrote a little script (attached) to summarize /proc/zoneinfo ... on my 
> system it says
> 
> $ zoneinfo
> Node  Normal Movable DMA   DMA32
>00.00   103020.078.94 1554.46
>1 9284.5489870.43
>2 9626.3394050.09
>3 9602.8293650.04
> 
> Not sure why I have zero Normal memory free on node0.  The sum of all those
> free counts is 410667.72 MB ... which is close enough to the boot time message
> showing the amount of mirror/total memory:
> 
> [0.00] efi: Memory: 80979/420096M mirrored memory
> 
> but a fair amount of the 80G of mirrored memory seems to have been miscounted
> as Movable instead of Normal. Perhaps this is because I have two blocks of 
> mirrored
> memory on each node and the movable zone code doesn't expect that?

 You were saying that OS view of memory of node is something like the following 
?
  
Node X:  |MM--MM|  
   (legend) M: mirrored  -: not mirrrored

 If so, is this a real Box's configuration?
 Sorry, I haven't got a real Address Range Mirror capable boxes yet ...
 I thought mirroring range is concatenated at the first part of each node.

 Sincerely,
 Taku Izumi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Luck, Tony
> I think /proc/zoneinfo can show detailed numbers per zone. Do we need some 
> for meminfo ?

I wrote a little script (attached) to summarize /proc/zoneinfo ... on my system 
it says

$ zoneinfo
Node  Normal Movable DMA   DMA32 
   00.00   103020.078.94 1554.46 
   1 9284.5489870.43 
   2 9626.3394050.09 
   3 9602.8293650.04

Not sure why I have zero Normal memory free on node0.  The sum of all those
free counts is 410667.72 MB ... which is close enough to the boot time message
showing the amount of mirror/total memory:

[0.00] efi: Memory: 80979/420096M mirrored memory

but a fair amount of the 80G of mirrored memory seems to have been miscounted
as Movable instead of Normal. Perhaps this is because I have two blocks of 
mirrored
memory on each node and the movable zone code doesn't expect that?

-Tony 





zoneinfo
Description: zoneinfo


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Kamezawa Hiroyuki

On 2015/10/22 3:17, Luck, Tony wrote:

+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;

Should we have a safety check here that there is some mirrored memory?  If you 
give
the kernelcore=reliable option on a machine which doesn't have any mirror 
configured,
then we'll mark all memory as removable.


You're right.


What happens then?  Do kernel allocations fail?  Or do they fall back to using 
removable memory?


Maybe the kernel cannot boot because NORMAL zone is empty.


Is there a /proc or /sys file that shows the current counts for the removable 
zone?  I just
tried this patch with a high percentage of memory marked as mirror ... but I'd 
like to see
how much is actually being used to tune things a bit.



I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for 
meminfo ?

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Izumi, Taku
 Dear Tony,

> -Original Message-
> From: Luck, Tony [mailto:tony.l...@intel.com]
> Sent: Friday, October 23, 2015 8:27 AM
> To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; 
> linux...@kvack.org
> Cc: qiuxi...@huawei.com; m...@csn.ul.ie; a...@linux-foundation.org; Hansen, 
> Dave; m...@codeblueprint.co.uk
> Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option
> 
> > I think /proc/zoneinfo can show detailed numbers per zone. Do we need some 
> > for meminfo ?
> 
> I wrote a little script (attached) to summarize /proc/zoneinfo ... on my 
> system it says
> 
> $ zoneinfo
> Node  Normal Movable DMA   DMA32
>00.00   103020.078.94 1554.46
>1 9284.5489870.43
>2 9626.3394050.09
>3 9602.8293650.04
> 
> Not sure why I have zero Normal memory free on node0.  The sum of all those
> free counts is 410667.72 MB ... which is close enough to the boot time message
> showing the amount of mirror/total memory:
> 
> [0.00] efi: Memory: 80979/420096M mirrored memory
> 
> but a fair amount of the 80G of mirrored memory seems to have been miscounted
> as Movable instead of Normal. Perhaps this is because I have two blocks of 
> mirrored
> memory on each node and the movable zone code doesn't expect that?

 You were saying that OS view of memory of node is something like the following 
?
  
Node X:  |MM--MM|  
   (legend) M: mirrored  -: not mirrrored

 If so, is this a real Box's configuration?
 Sorry, I haven't got a real Address Range Mirror capable boxes yet ...
 I thought mirroring range is concatenated at the first part of each node.

 Sincerely,
 Taku Izumi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Luck, Tony
> I think /proc/zoneinfo can show detailed numbers per zone. Do we need some 
> for meminfo ?

I wrote a little script (attached) to summarize /proc/zoneinfo ... on my system 
it says

$ zoneinfo
Node  Normal Movable DMA   DMA32 
   00.00   103020.078.94 1554.46 
   1 9284.5489870.43 
   2 9626.3394050.09 
   3 9602.8293650.04

Not sure why I have zero Normal memory free on node0.  The sum of all those
free counts is 410667.72 MB ... which is close enough to the boot time message
showing the amount of mirror/total memory:

[0.00] efi: Memory: 80979/420096M mirrored memory

but a fair amount of the 80G of mirrored memory seems to have been miscounted
as Movable instead of Normal. Perhaps this is because I have two blocks of 
mirrored
memory on each node and the movable zone code doesn't expect that?

-Tony 





zoneinfo
Description: zoneinfo


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Kamezawa Hiroyuki

On 2015/10/22 3:17, Luck, Tony wrote:

+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;

Should we have a safety check here that there is some mirrored memory?  If you 
give
the kernelcore=reliable option on a machine which doesn't have any mirror 
configured,
then we'll mark all memory as removable.


You're right.


What happens then?  Do kernel allocations fail?  Or do they fall back to using 
removable memory?


Maybe the kernel cannot boot because NORMAL zone is empty.


Is there a /proc or /sys file that shows the current counts for the removable 
zone?  I just
tried this patch with a high percentage of memory marked as mirror ... but I'd 
like to see
how much is actually being used to tune things a bit.



I think /proc/zoneinfo can show detailed numbers per zone. Do we need some for 
meminfo ?

Thanks,
-Kame


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Xishi Qiu
On 2015/10/15 21:32, Taku Izumi wrote:

> Xeon E7 v3 based systems supports Address Range Mirroring
> and UEFI BIOS complied with UEFI spec 2.5 can notify which
> ranges are reliable (mirrored) via EFI memory map.
> Now Linux kernel utilize its information and allocates
> boot time memory from reliable region.
> 
> My requirement is:
>   - allocate kernel memory from reliable region
>   - allocate user memory from non-reliable region
> 
> In order to meet my requirement, ZONE_MOVABLE is useful.
> By arranging non-reliable range into ZONE_MOVABLE,
> reliable memory is only used for kernel allocations.
> 
> This patch extends existing "kernelcore" option and
> introduces kernelcore=reliable option. By specifying
> "reliable" instead of specifying the amount of memory,
> non-reliable region will be arranged into ZONE_MOVABLE.
> 
> Earlier discussion is at:
>  https://lkml.org/lkml/2015/10/9/24
> 
> For example, suppose 2-nodes system with the following
>  memory range:
>   node 0 [mem 0x1000-0x00109fff]
>   node 1 [mem 0x0010a000-0x00209fff]
> 
> and the following ranges are marked as reliable (*):
>   [0x-0x0001]
>   [0x0001-0x00018000]
>   [0x0010a000-0x00112000]
> 
> If you specify kernelcore=reliable, Movable zones are
> arranged like the following:
>   Movable zone start for each node
> Node 0: 0x00018000
> Node 1: 0x00112000
> 
> (*) I specified the following instead of using UEFI BIOS
> complied with UEFI spec 2.5,
> efi_fake_mem=4G@0:0x1,2G@0x10a000:0x1,2G@4G:0x1
> efi_fake_mem is found at:
>  git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
>  tags/efi-next
> 
> Signed-off-by: Taku Izumi 
> ---
>  Documentation/kernel-parameters.txt |  9 -
>  mm/page_alloc.c | 26 ++
>  2 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index cd5312f..b2c8c13 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>  
>   keepinitrd  [HW,ARM]
>  
> - kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
> + kernelcore= Format: nn[KMG] | "reliable"
> + [KNL,X86,IA-64,PPC] This parameter
>   specifies the amount of memory usable by the kernel
>   for non-movable allocations.  The requested amount is
>   spread evenly throughout all nodes in the system. The
> @@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   use the HighMem zone if it exists, and the Normal
>   zone if it does not.
>  
> + Instead of specifying the amount of memory (nn[KMS]),
> + you can specify "reliable" option. In case "reliable"
> + option is specified, reliable memory is used for
> + non-movable allocations and remaining memory is used
> + for Movable pages.
> +
>   kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
>   Format: [,poll interval]
>   The controller # is the number of the ehci usb debug
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index beda417..d0b3ac9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -221,6 +221,7 @@ static unsigned long __meminitdata 
> arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>  static unsigned long __initdata required_kernelcore;
>  static unsigned long __initdata required_movablecore;
>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> +static bool reliable_kernelcore __initdata;
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -5618,6 +5619,25 @@ static void __init 
> find_zone_movable_pfns_for_nodes(void)
>   }
>  
>   /*
> +  * If kernelcore=reliable is specified, ignore movablecore option
> +  */
> + if (reliable_kernelcore) {
> + for_each_memblock(memory, r) {
> + if (memblock_is_mirror(r))
> + continue;
> +
> + nid = r->nid;
> +
> + usable_startpfn = PFN_DOWN(r->base);
> + zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
> + min(usable_startpfn, zone_movable_pfn[nid]) :
> + usable_startpfn;
> + }
> +
> + goto out2;

Hi Taku,

If user set 0-1G is mirrored memory, 1-2G is normal memory, and 2-4G is hole.
Then the movable zone will start at 2G?

Thanks,
Xishi Qiu

> +  

Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Luck, Tony
First part of each memory controller. I have two memory controllers on each node

Sent from my iPhone

> On Oct 22, 2015, at 18:01, Izumi, Taku <izumi.t...@jp.fujitsu.com> wrote:
> 
> Dear Tony,
> 
>> -Original Message-
>> From: Luck, Tony [mailto:tony.l...@intel.com]
>> Sent: Friday, October 23, 2015 8:27 AM
>> To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; 
>> linux...@kvack.org
>> Cc: qiuxi...@huawei.com; m...@csn.ul.ie; a...@linux-foundation.org; Hansen, 
>> Dave; m...@codeblueprint.co.uk
>> Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option
>> 
>>> I think /proc/zoneinfo can show detailed numbers per zone. Do we need some 
>>> for meminfo ?
>> 
>> I wrote a little script (attached) to summarize /proc/zoneinfo ... on my 
>> system it says
>> 
>> $ zoneinfo
>> Node  Normal Movable DMA   DMA32
>>   00.00   103020.078.94 1554.46
>>   1 9284.5489870.43
>>   2 9626.3394050.09
>>   3 9602.8293650.04
>> 
>> Not sure why I have zero Normal memory free on node0.  The sum of all those
>> free counts is 410667.72 MB ... which is close enough to the boot time 
>> message
>> showing the amount of mirror/total memory:
>> 
>> [0.00] efi: Memory: 80979/420096M mirrored memory
>> 
>> but a fair amount of the 80G of mirrored memory seems to have been miscounted
>> as Movable instead of Normal. Perhaps this is because I have two blocks of 
>> mirrored
>> memory on each node and the movable zone code doesn't expect that?
> 
> You were saying that OS view of memory of node is something like the 
> following ?
> 
>Node X:  |MM--MM|  
>   (legend) M: mirrored  -: not mirrrored
> 
> If so, is this a real Box's configuration?
> Sorry, I haven't got a real Address Range Mirror capable boxes yet ...
> I thought mirroring range is concatenated at the first part of each node.
> 
> Sincerely,
> Taku Izumi
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-21 Thread Luck, Tony
+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;

Should we have a safety check here that there is some mirrored memory?  If you 
give
the kernelcore=reliable option on a machine which doesn't have any mirror 
configured,
then we'll mark all memory as removable.  What happens then?  Do kernel 
allocations
fail?  Or do they fall back to using removable memory?

Is there a /proc or /sys file that shows the current counts for the removable 
zone?  I just
tried this patch with a high percentage of memory marked as mirror ... but I'd 
like to see
how much is actually being used to tune things a bit.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-21 Thread Luck, Tony
+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;

Should we have a safety check here that there is some mirrored memory?  If you 
give
the kernelcore=reliable option on a machine which doesn't have any mirror 
configured,
then we'll mark all memory as removable.  What happens then?  Do kernel 
allocations
fail?  Or do they fall back to using removable memory?

Is there a /proc or /sys file that shows the current counts for the removable 
zone?  I just
tried this patch with a high percentage of memory marked as mirror ... but I'd 
like to see
how much is actually being used to tune things a bit.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-19 Thread Xishi Qiu
On 2015/10/20 8:34, Izumi, Taku wrote:

>  Hi Xishi,
> 
>> On 2015/10/15 21:32, Taku Izumi wrote:
>>
>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>> ranges are reliable (mirrored) via EFI memory map.
>>> Now Linux kernel utilize its information and allocates
>>> boot time memory from reliable region.
>>>
>>> My requirement is:
>>>   - allocate kernel memory from reliable region
>>>   - allocate user memory from non-reliable region
>>>
>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>> By arranging non-reliable range into ZONE_MOVABLE,
>>> reliable memory is only used for kernel allocations.
>>>
>>> This patch extends existing "kernelcore" option and
>>> introduces kernelcore=reliable option. By specifying
>>> "reliable" instead of specifying the amount of memory,
>>> non-reliable region will be arranged into ZONE_MOVABLE.
>>>
>>> Earlier discussion is at:
>>>  https://lkml.org/lkml/2015/10/9/24
>>>
>>
>> Hi Taku,
>>
>> If user don't want to waste a lot of memory, and he only set
>> a few memory to mirrored memory, then the kernelcore is very
>> small, right? That means OS will have a very small normal zone
>> and a very large movable zone.
> 
>  Right.
> 
>> Kernel allocation could only use the unmovable zone. As the
>> normal zone is very small, the kernel allocation maybe OOM,
>> right?
> 
>  Right.
> 
>> Do you mean that we will reuse the movable zone in short-term
>> solution and create a new zone(mirrored zone) in future?
> 
>  If there is that kind of requirements, I don't oppose 
>  creating a new zone.
> 

As far as I know, some apps(e.g. date base) maybe could only use
the normal zone.

Thanks,
Xishi Qiu

>  Sincerely,
>  Taku Izumi
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-19 Thread Izumi, Taku
 Hi Xishi,

> On 2015/10/15 21:32, Taku Izumi wrote:
> 
> > Xeon E7 v3 based systems supports Address Range Mirroring
> > and UEFI BIOS complied with UEFI spec 2.5 can notify which
> > ranges are reliable (mirrored) via EFI memory map.
> > Now Linux kernel utilize its information and allocates
> > boot time memory from reliable region.
> >
> > My requirement is:
> >   - allocate kernel memory from reliable region
> >   - allocate user memory from non-reliable region
> >
> > In order to meet my requirement, ZONE_MOVABLE is useful.
> > By arranging non-reliable range into ZONE_MOVABLE,
> > reliable memory is only used for kernel allocations.
> >
> > This patch extends existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> >
> > Earlier discussion is at:
> >  https://lkml.org/lkml/2015/10/9/24
> >
> 
> Hi Taku,
> 
> If user don't want to waste a lot of memory, and he only set
> a few memory to mirrored memory, then the kernelcore is very
> small, right? That means OS will have a very small normal zone
> and a very large movable zone.

 Right.

> Kernel allocation could only use the unmovable zone. As the
> normal zone is very small, the kernel allocation maybe OOM,
> right?

 Right.

> Do you mean that we will reuse the movable zone in short-term
> solution and create a new zone(mirrored zone) in future?

 If there is that kind of requirements, I don't oppose 
 creating a new zone.

 Sincerely,
 Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-19 Thread Izumi, Taku
 Hi Xishi,

> On 2015/10/15 21:32, Taku Izumi wrote:
> 
> > Xeon E7 v3 based systems supports Address Range Mirroring
> > and UEFI BIOS complied with UEFI spec 2.5 can notify which
> > ranges are reliable (mirrored) via EFI memory map.
> > Now Linux kernel utilize its information and allocates
> > boot time memory from reliable region.
> >
> > My requirement is:
> >   - allocate kernel memory from reliable region
> >   - allocate user memory from non-reliable region
> >
> > In order to meet my requirement, ZONE_MOVABLE is useful.
> > By arranging non-reliable range into ZONE_MOVABLE,
> > reliable memory is only used for kernel allocations.
> >
> > This patch extends existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> >
> > Earlier discussion is at:
> >  https://lkml.org/lkml/2015/10/9/24
> >
> 
> Hi Taku,
> 
> If user don't want to waste a lot of memory, and he only set
> a few memory to mirrored memory, then the kernelcore is very
> small, right? That means OS will have a very small normal zone
> and a very large movable zone.

 Right.

> Kernel allocation could only use the unmovable zone. As the
> normal zone is very small, the kernel allocation maybe OOM,
> right?

 Right.

> Do you mean that we will reuse the movable zone in short-term
> solution and create a new zone(mirrored zone) in future?

 If there is that kind of requirements, I don't oppose 
 creating a new zone.

 Sincerely,
 Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-19 Thread Xishi Qiu
On 2015/10/20 8:34, Izumi, Taku wrote:

>  Hi Xishi,
> 
>> On 2015/10/15 21:32, Taku Izumi wrote:
>>
>>> Xeon E7 v3 based systems supports Address Range Mirroring
>>> and UEFI BIOS complied with UEFI spec 2.5 can notify which
>>> ranges are reliable (mirrored) via EFI memory map.
>>> Now Linux kernel utilize its information and allocates
>>> boot time memory from reliable region.
>>>
>>> My requirement is:
>>>   - allocate kernel memory from reliable region
>>>   - allocate user memory from non-reliable region
>>>
>>> In order to meet my requirement, ZONE_MOVABLE is useful.
>>> By arranging non-reliable range into ZONE_MOVABLE,
>>> reliable memory is only used for kernel allocations.
>>>
>>> This patch extends existing "kernelcore" option and
>>> introduces kernelcore=reliable option. By specifying
>>> "reliable" instead of specifying the amount of memory,
>>> non-reliable region will be arranged into ZONE_MOVABLE.
>>>
>>> Earlier discussion is at:
>>>  https://lkml.org/lkml/2015/10/9/24
>>>
>>
>> Hi Taku,
>>
>> If user don't want to waste a lot of memory, and he only set
>> a few memory to mirrored memory, then the kernelcore is very
>> small, right? That means OS will have a very small normal zone
>> and a very large movable zone.
> 
>  Right.
> 
>> Kernel allocation could only use the unmovable zone. As the
>> normal zone is very small, the kernel allocation maybe OOM,
>> right?
> 
>  Right.
> 
>> Do you mean that we will reuse the movable zone in short-term
>> solution and create a new zone(mirrored zone) in future?
> 
>  If there is that kind of requirements, I don't oppose 
>  creating a new zone.
> 

As far as I know, some apps(e.g. date base) maybe could only use
the normal zone.

Thanks,
Xishi Qiu

>  Sincerely,
>  Taku Izumi
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-18 Thread Xishi Qiu
On 2015/10/15 21:32, Taku Izumi wrote:

> Xeon E7 v3 based systems supports Address Range Mirroring
> and UEFI BIOS complied with UEFI spec 2.5 can notify which
> ranges are reliable (mirrored) via EFI memory map.
> Now Linux kernel utilize its information and allocates
> boot time memory from reliable region.
> 
> My requirement is:
>   - allocate kernel memory from reliable region
>   - allocate user memory from non-reliable region
> 
> In order to meet my requirement, ZONE_MOVABLE is useful.
> By arranging non-reliable range into ZONE_MOVABLE,
> reliable memory is only used for kernel allocations.
> 
> This patch extends existing "kernelcore" option and
> introduces kernelcore=reliable option. By specifying
> "reliable" instead of specifying the amount of memory,
> non-reliable region will be arranged into ZONE_MOVABLE.
> 
> Earlier discussion is at:
>  https://lkml.org/lkml/2015/10/9/24
> 

Hi Taku,

If user don't want to waste a lot of memory, and he only set
a few memory to mirrored memory, then the kernelcore is very
small, right? That means OS will have a very small normal zone
and a very large movable zone.

Kernel allocation could only use the unmovable zone. As the
normal zone is very small, the kernel allocation maybe OOM,
right?

Do you mean that we will reuse the movable zone in short-term
solution and create a new zone(mirrored zone) in future?

Thanks,
Xishi Qiu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-18 Thread Xishi Qiu
On 2015/10/15 21:32, Taku Izumi wrote:

> Xeon E7 v3 based systems supports Address Range Mirroring
> and UEFI BIOS complied with UEFI spec 2.5 can notify which
> ranges are reliable (mirrored) via EFI memory map.
> Now Linux kernel utilize its information and allocates
> boot time memory from reliable region.
> 
> My requirement is:
>   - allocate kernel memory from reliable region
>   - allocate user memory from non-reliable region
> 
> In order to meet my requirement, ZONE_MOVABLE is useful.
> By arranging non-reliable range into ZONE_MOVABLE,
> reliable memory is only used for kernel allocations.
> 
> This patch extends existing "kernelcore" option and
> introduces kernelcore=reliable option. By specifying
> "reliable" instead of specifying the amount of memory,
> non-reliable region will be arranged into ZONE_MOVABLE.
> 
> Earlier discussion is at:
>  https://lkml.org/lkml/2015/10/9/24
> 

Hi Taku,

If user don't want to waste a lot of memory, and he only set
a few memory to mirrored memory, then the kernelcore is very
small, right? That means OS will have a very small normal zone
and a very large movable zone.

Kernel allocation could only use the unmovable zone. As the
normal zone is very small, the kernel allocation maybe OOM,
right?

Do you mean that we will reuse the movable zone in short-term
solution and create a new zone(mirrored zone) in future?

Thanks,
Xishi Qiu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: Introduce kernelcore=reliable option

2015-10-14 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Earlier discussion is at:
 https://lkml.org/lkml/2015/10/9/24

For example, suppose 2-nodes system with the following
 memory range:
  node 0 [mem 0x1000-0x00109fff]
  node 1 [mem 0x0010a000-0x00209fff]

and the following ranges are marked as reliable (*):
  [0x-0x0001]
  [0x0001-0x00018000]
  [0x0010a000-0x00112000]

If you specify kernelcore=reliable, Movable zones are
arranged like the following:
  Movable zone start for each node
Node 0: 0x00018000
Node 1: 0x00112000

(*) I specified the following instead of using UEFI BIOS
complied with UEFI spec 2.5,
efi_fake_mem=4G@0:0x1,2G@0x10a000:0x1,2G@4G:0x1
efi_fake_mem is found at:
 git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
 tags/efi-next

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |  9 -
 mm/page_alloc.c | 26 ++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index cd5312f..b2c8c13 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "reliable"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "reliable" option. In case "reliable"
+   option is specified, reliable memory is used for
+   non-movable allocations and remaining memory is used
+   for Movable pages.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beda417..d0b3ac9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -221,6 +221,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5618,6 +5619,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
}
 
/*
+* If kernelcore=reliable is specified, ignore movablecore option
+*/
+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;
+
+   nid = r->nid;
+
+   usable_startpfn = PFN_DOWN(r->base);
+   zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+   min(usable_startpfn, zone_movable_pfn[nid]) :
+   usable_startpfn;
+   }
+
+   goto out2;
+   }
+
+   /*
 * If movablecore=nn[KMG] was specified, calculate what size of
 * kernelcore that corresponds so that memory usable for
 * any allocation type is evenly spread. If both kernelcore
@@ -5873,6 +5893,12 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
  */
 static int __init cmdline_parse_kernelcore(char *p)
 {
+   /* parse 

[PATCH] mm: Introduce kernelcore=reliable option

2015-10-14 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Earlier discussion is at:
 https://lkml.org/lkml/2015/10/9/24

For example, suppose 2-nodes system with the following
 memory range:
  node 0 [mem 0x1000-0x00109fff]
  node 1 [mem 0x0010a000-0x00209fff]

and the following ranges are marked as reliable (*):
  [0x-0x0001]
  [0x0001-0x00018000]
  [0x0010a000-0x00112000]

If you specify kernelcore=reliable, Movable zones are
arranged like the following:
  Movable zone start for each node
Node 0: 0x00018000
Node 1: 0x00112000

(*) I specified the following instead of using UEFI BIOS
complied with UEFI spec 2.5,
efi_fake_mem=4G@0:0x1,2G@0x10a000:0x1,2G@4G:0x1
efi_fake_mem is found at:
 git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
 tags/efi-next

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |  9 -
 mm/page_alloc.c | 26 ++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index cd5312f..b2c8c13 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "reliable"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "reliable" option. In case "reliable"
+   option is specified, reliable memory is used for
+   non-movable allocations and remaining memory is used
+   for Movable pages.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beda417..d0b3ac9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -221,6 +221,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5618,6 +5619,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
}
 
/*
+* If kernelcore=reliable is specified, ignore movablecore option
+*/
+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;
+
+   nid = r->nid;
+
+   usable_startpfn = PFN_DOWN(r->base);
+   zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+   min(usable_startpfn, zone_movable_pfn[nid]) :
+   usable_startpfn;
+   }
+
+   goto out2;
+   }
+
+   /*
 * If movablecore=nn[KMG] was specified, calculate what size of
 * kernelcore that corresponds so that memory usable for
 * any allocation type is evenly spread. If both kernelcore
@@ -5873,6 +5893,12 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
  */
 static int __init