Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-19 Thread Xishi Qiu
On 2015/5/19 12:48, Tony Luck wrote:

> On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu  wrote:
>> In part2, does it means the memory allocated from kernel should use mirrored 
>> memory?
> 
> Yes. I want to use mirrored memory for all (or as many as
> possible) kernel allocations.
> 
>> I have heard of this feature(address range mirroring) before, and I changed 
>> some
>> code to test it(implement memory allocations in specific physical areas).
>>
>> In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is 
>> not a good
>> idea. If there are XX discontiguous mirrored areas in one numa node, there 
>> should be
>> XX ZONE_MIRROR zones in one pgdat, it is impossible, right?
> 
> With current h/w implementations XX is at most 2, and is possibly only 1
> on most nodes.  But we shouldn't depend on that.
> 
>> I think add a new migrate type(MIGRATE_MIRROR) will be better, the following 
>> print
>> is from my changed kernel.
> 
> This sounds interesting.
> 
>> [root@localhost ~]# cat /proc/pagetypeinfo
>> Page block order: 9
>> Pages per block:  512
>>
>> Free pages count per migrate type at order   0  1  2  3  
>> 4  5  6  7  8  9 10
> ...
>> Node0, zone  DMA, type   Mirror  0  0  0  0  
>> 0  0  0  0  0  0  0
> ...
>> Node0, zoneDMA32, type   Mirror  0  0  0  0  
>> 0  0  0  0  0  0  0
> 
> I see all zero counts here ... which is fine.  I expect that systems
> will mirror all memory below 4GB ... but we should probably
> ignore the attribute for this range because we want to make

Hi Tony,

I think 0-4G will be all mirrored, so I change nothing, just ignore
the mirror flag.(e.g. 4 socket machine, every socket has 32G memory,
then node0: 0-4G, 4-8G mirrored, node1: 32-36G mirrored, node2:64-68G
mirrored, node3: 96-100G mirrored)

> sure that the memory is still available for users that depend
> on getting memory that legacy devices can access. On systems
> that support address range mirror the <4GB area is <2% of even
> a small system (128GB seems to be the minimum rational configuration
> for a 4 socket machine ... you end up with that much if you populate
> every channel with just one 4GB DIMM). On a big system (in the TB
> range) <4GB area is a trivial rounding error.
> 
>> Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both
>> kernel-space and user-space. If there is no mirrored memory, we will allocate
>> other types memory.
> 
> But I *think* I want all kernel and no users to allocate mirror
> memory.  I'd like to not have to touch every place that allocates
> memory to add/clear this flag.
> 

If only want kernel to use the mirrored memory, it is much easier.
I have some patches, but it's a little ugly and implement both user 
and kernel.

>> 1) kernel-space(pcp, page buddy, slab/slub ...):
>> -> use mirrored memory(e.g. /proc/sys/vm/mirrorable)
>> -> __alloc_pages_nodemask()
>> ->gfpflags_to_migratetype()
>> -> use MIGRATE_MIRROR list
> 
> I think you are telling me that we can do this, but I don't understand
> how the code would look.
> 
>> 2) user-space(syscall, madvise, mmap ...):
>> -> add VM_MIRROR flag in the vma
>> -> add GFP_MIRROR when page fault in the vma
>> -> __alloc_pages_nodemask()
>> -> use MIGRATE_MIRROR list
> 
> If we do let users have access to mirrored memory, then
> madvise/mmap seem a plausible way to allow it.  Not sure
> what access privileges are appropriate to allow it. I expect
> mirrored memory to be in short supply (the whole point of

I think allocations from some key process(e.g. date base) are
as important as kernel, and in most cases MCE just kill them
if memory failure, so let user can access the mirrored memory
may be a good way to solve the problem. 

> address range mirror is to make do with a minimal amount
> of mirrored memory ... if you expect to want/have lots of
> mirrored memory, then just take the 50% hit in capacity
> and mirror everything and ignore all the s/w complexity).
> 
> Are your patches ready to be shared?

I'll rewrite and send them soon.

Thanks,
Xishi Qiu

> 
> -Tony
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-19 Thread Xishi Qiu
On 2015/5/19 12:48, Tony Luck wrote:

 On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu qiuxi...@huawei.com wrote:
 In part2, does it means the memory allocated from kernel should use mirrored 
 memory?
 
 Yes. I want to use mirrored memory for all (or as many as
 possible) kernel allocations.
 
 I have heard of this feature(address range mirroring) before, and I changed 
 some
 code to test it(implement memory allocations in specific physical areas).

 In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is 
 not a good
 idea. If there are XX discontiguous mirrored areas in one numa node, there 
 should be
 XX ZONE_MIRROR zones in one pgdat, it is impossible, right?
 
 With current h/w implementations XX is at most 2, and is possibly only 1
 on most nodes.  But we shouldn't depend on that.
 
 I think add a new migrate type(MIGRATE_MIRROR) will be better, the following 
 print
 is from my changed kernel.
 
 This sounds interesting.
 
 [root@localhost ~]# cat /proc/pagetypeinfo
 Page block order: 9
 Pages per block:  512

 Free pages count per migrate type at order   0  1  2  3  
 4  5  6  7  8  9 10
 ...
 Node0, zone  DMA, type   Mirror  0  0  0  0  
 0  0  0  0  0  0  0
 ...
 Node0, zoneDMA32, type   Mirror  0  0  0  0  
 0  0  0  0  0  0  0
 
 I see all zero counts here ... which is fine.  I expect that systems
 will mirror all memory below 4GB ... but we should probably
 ignore the attribute for this range because we want to make

Hi Tony,

I think 0-4G will be all mirrored, so I change nothing, just ignore
the mirror flag.(e.g. 4 socket machine, every socket has 32G memory,
then node0: 0-4G, 4-8G mirrored, node1: 32-36G mirrored, node2:64-68G
mirrored, node3: 96-100G mirrored)

 sure that the memory is still available for users that depend
 on getting memory that legacy devices can access. On systems
 that support address range mirror the 4GB area is 2% of even
 a small system (128GB seems to be the minimum rational configuration
 for a 4 socket machine ... you end up with that much if you populate
 every channel with just one 4GB DIMM). On a big system (in the TB
 range) 4GB area is a trivial rounding error.
 
 Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both
 kernel-space and user-space. If there is no mirrored memory, we will allocate
 other types memory.
 
 But I *think* I want all kernel and no users to allocate mirror
 memory.  I'd like to not have to touch every place that allocates
 memory to add/clear this flag.
 

If only want kernel to use the mirrored memory, it is much easier.
I have some patches, but it's a little ugly and implement both user 
and kernel.

 1) kernel-space(pcp, page buddy, slab/slub ...):
 - use mirrored memory(e.g. /proc/sys/vm/mirrorable)
 - __alloc_pages_nodemask()
 -gfpflags_to_migratetype()
 - use MIGRATE_MIRROR list
 
 I think you are telling me that we can do this, but I don't understand
 how the code would look.
 
 2) user-space(syscall, madvise, mmap ...):
 - add VM_MIRROR flag in the vma
 - add GFP_MIRROR when page fault in the vma
 - __alloc_pages_nodemask()
 - use MIGRATE_MIRROR list
 
 If we do let users have access to mirrored memory, then
 madvise/mmap seem a plausible way to allow it.  Not sure
 what access privileges are appropriate to allow it. I expect
 mirrored memory to be in short supply (the whole point of

I think allocations from some key process(e.g. date base) are
as important as kernel, and in most cases MCE just kill them
if memory failure, so let user can access the mirrored memory
may be a good way to solve the problem. 

 address range mirror is to make do with a minimal amount
 of mirrored memory ... if you expect to want/have lots of
 mirrored memory, then just take the 50% hit in capacity
 and mirror everything and ignore all the s/w complexity).
 
 Are your patches ready to be shared?

I'll rewrite and send them soon.

Thanks,
Xishi Qiu

 
 -Tony
 
 .
 



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-18 Thread Tony Luck
On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu  wrote:
> In part2, does it means the memory allocated from kernel should use mirrored 
> memory?

Yes. I want to use mirrored memory for all (or as many as
possible) kernel allocations.

> I have heard of this feature(address range mirroring) before, and I changed 
> some
> code to test it(implement memory allocations in specific physical areas).
>
> In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not 
> a good
> idea. If there are XX discontiguous mirrored areas in one numa node, there 
> should be
> XX ZONE_MIRROR zones in one pgdat, it is impossible, right?

With current h/w implementations XX is at most 2, and is possibly only 1
on most nodes.  But we shouldn't depend on that.

> I think add a new migrate type(MIGRATE_MIRROR) will be better, the following 
> print
> is from my changed kernel.

This sounds interesting.

> [root@localhost ~]# cat /proc/pagetypeinfo
> Page block order: 9
> Pages per block:  512
>
> Free pages count per migrate type at order   0  1  2  3  
> 4  5  6  7  8  9 10
...
> Node0, zone  DMA, type   Mirror  0  0  0  0  
> 0  0  0  0  0  0  0
...
> Node0, zoneDMA32, type   Mirror  0  0  0  0  
> 0  0  0  0  0  0  0

I see all zero counts here ... which is fine.  I expect that systems
will mirror all memory below 4GB ... but we should probably
ignore the attribute for this range because we want to make
sure that the memory is still available for users that depend
on getting memory that legacy devices can access. On systems
that support address range mirror the <4GB area is <2% of even
a small system (128GB seems to be the minimum rational configuration
for a 4 socket machine ... you end up with that much if you populate
every channel with just one 4GB DIMM). On a big system (in the TB
range) <4GB area is a trivial rounding error.

> Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both
> kernel-space and user-space. If there is no mirrored memory, we will allocate
> other types memory.

But I *think* I want all kernel and no users to allocate mirror
memory.  I'd like to not have to touch every place that allocates
memory to add/clear this flag.

> 1) kernel-space(pcp, page buddy, slab/slub ...):
> -> use mirrored memory(e.g. /proc/sys/vm/mirrorable)
> -> __alloc_pages_nodemask()
> ->gfpflags_to_migratetype()
> -> use MIGRATE_MIRROR list

I think you are telling me that we can do this, but I don't understand
how the code would look.

> 2) user-space(syscall, madvise, mmap ...):
> -> add VM_MIRROR flag in the vma
> -> add GFP_MIRROR when page fault in the vma
> -> __alloc_pages_nodemask()
> -> use MIGRATE_MIRROR list

If we do let users have access to mirrored memory, then
madvise/mmap seem a plausible way to allow it.  Not sure
what access privileges are appropriate to allow it. I expect
mirrored memory to be in short supply (the whole point of
address range mirror is to make do with a minimal amount
of mirrored memory ... if you expect to want/have lots of
mirrored memory, then just take the 50% hit in capacity
and mirror everything and ignore all the s/w complexity).

Are your patches ready to be shared?

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-18 Thread Xishi Qiu
On 2015/5/9 0:44, Tony Luck wrote:

> Some high end Intel Xeon systems report uncorrectable memory errors
> as a recoverable machine check. Linux has included code for some time
> to process these and just signal the affected processes (or even
> recover completely if the error was in a read only page that can be
> replaced by reading from disk).
> 
> But we have no recovery path for errors encountered during kernel
> code execution. Except for some very specific cases were are unlikely
> to ever be able to recover.
> 
> Enter memory mirroring. Actually 3rd generation of memory mirroing.
> 
> Gen1: All memory is mirrored
>   Pro: No s/w enabling - h/w just gets good data from other side of the 
> mirror
>   Con: Halves effective memory capacity available to OS/applications
> Gen2: Partial memory mirror - just mirror memory begind some memory 
> controllers
>   Pro: Keep more of the capacity
>   Con: Nightmare to enable. Have to choose between allocating from
>mirrored memory for safety vs. NUMA local memory for performance
> Gen3: Address range partial memory mirror - some mirror on each memory 
> controller
>   Pro: Can tune the amount of mirror and keep NUMA performance
>   Con: I have to write memory management code to implement
> 
> The current plan is just to use mirrored memory for kernel allocations. This
> has been broken into two phases:
> 1) This patch series - find the mirrored memory, use it for boot time 
> allocations
> 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused
>mirrored memory from mm/memblock.c and only give it out to select kernel
>allocations (this is still being scoped because page_alloc.c is scary).
> 

Hi Tony,

In part2, does it means the memory allocated from kernel should use mirrored 
memory?

I have heard of this feature(address range mirroring) before, and I changed some
code to test it(implement memory allocations in specific physical areas).

In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a 
good
idea. If there are XX discontiguous mirrored areas in one numa node, there 
should be
XX ZONE_MIRROR zones in one pgdat, it is impossible, right?

I think add a new migrate type(MIGRATE_MIRROR) will be better, the following 
print
is from my changed kernel. 

[root@localhost ~]# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order   0  1  2  3  4  
5  6  7  8  9 10
Node0, zone  DMA, typeUnmovable  1  1  1  0  2  
1  1  0  1  0  0
Node0, zone  DMA, type  Reclaimable  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone  DMA, type  Movable  0  0  0  0  0  
0  0  0  0  0  3
Node0, zone  DMA, type   Mirror  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone  DMA, type  Reserve  0  0  0  0  0  
0  0  0  0  1  0
Node0, zone  DMA, type  CMA  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone  DMA, type  Isolate  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, typeUnmovable 14  7  6  1  3  
0  1  0  0  0  0
Node0, zoneDMA32, type  Reclaimable 15  2  2  1  1  
2  1  1  0  0  0
Node0, zoneDMA32, type  Movable  3 24 52 58 31  
2  1  1  1  3231
Node0, zoneDMA32, type   Mirror  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, type  Reserve  0  0  0  0  0  
0  0  0  0  0  1
Node0, zoneDMA32, type  CMA  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, type  Isolate  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone   Normal, typeUnmovable 80 12  6  7  3  
1 67 58 23 11  0
Node0, zone   Normal, type  Reclaimable  6  6  8 11  5  
3  0  1  0  0  0
Node0, zone   Normal, type  Movable  6198618675363  
   13  4  3  0  2   4074
Node0, zone   Normal, type   Mirror  0  0  0  0  0  
0  0  0  0  0   1024
Node0, zone   Normal, type  Reserve  0  0  0  0  0  
0  0  0  0  0  1
Node0, zone   Normal, type  CMA  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone   

Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-18 Thread Xishi Qiu
On 2015/5/9 0:44, Tony Luck wrote:

 Some high end Intel Xeon systems report uncorrectable memory errors
 as a recoverable machine check. Linux has included code for some time
 to process these and just signal the affected processes (or even
 recover completely if the error was in a read only page that can be
 replaced by reading from disk).
 
 But we have no recovery path for errors encountered during kernel
 code execution. Except for some very specific cases were are unlikely
 to ever be able to recover.
 
 Enter memory mirroring. Actually 3rd generation of memory mirroing.
 
 Gen1: All memory is mirrored
   Pro: No s/w enabling - h/w just gets good data from other side of the 
 mirror
   Con: Halves effective memory capacity available to OS/applications
 Gen2: Partial memory mirror - just mirror memory begind some memory 
 controllers
   Pro: Keep more of the capacity
   Con: Nightmare to enable. Have to choose between allocating from
mirrored memory for safety vs. NUMA local memory for performance
 Gen3: Address range partial memory mirror - some mirror on each memory 
 controller
   Pro: Can tune the amount of mirror and keep NUMA performance
   Con: I have to write memory management code to implement
 
 The current plan is just to use mirrored memory for kernel allocations. This
 has been broken into two phases:
 1) This patch series - find the mirrored memory, use it for boot time 
 allocations
 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused
mirrored memory from mm/memblock.c and only give it out to select kernel
allocations (this is still being scoped because page_alloc.c is scary).
 

Hi Tony,

In part2, does it means the memory allocated from kernel should use mirrored 
memory?

I have heard of this feature(address range mirroring) before, and I changed some
code to test it(implement memory allocations in specific physical areas).

In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a 
good
idea. If there are XX discontiguous mirrored areas in one numa node, there 
should be
XX ZONE_MIRROR zones in one pgdat, it is impossible, right?

I think add a new migrate type(MIGRATE_MIRROR) will be better, the following 
print
is from my changed kernel. 

[root@localhost ~]# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order   0  1  2  3  4  
5  6  7  8  9 10
Node0, zone  DMA, typeUnmovable  1  1  1  0  2  
1  1  0  1  0  0
Node0, zone  DMA, type  Reclaimable  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone  DMA, type  Movable  0  0  0  0  0  
0  0  0  0  0  3
Node0, zone  DMA, type   Mirror  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone  DMA, type  Reserve  0  0  0  0  0  
0  0  0  0  1  0
Node0, zone  DMA, type  CMA  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone  DMA, type  Isolate  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, typeUnmovable 14  7  6  1  3  
0  1  0  0  0  0
Node0, zoneDMA32, type  Reclaimable 15  2  2  1  1  
2  1  1  0  0  0
Node0, zoneDMA32, type  Movable  3 24 52 58 31  
2  1  1  1  3231
Node0, zoneDMA32, type   Mirror  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, type  Reserve  0  0  0  0  0  
0  0  0  0  0  1
Node0, zoneDMA32, type  CMA  0  0  0  0  0  
0  0  0  0  0  0
Node0, zoneDMA32, type  Isolate  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone   Normal, typeUnmovable 80 12  6  7  3  
1 67 58 23 11  0
Node0, zone   Normal, type  Reclaimable  6  6  8 11  5  
3  0  1  0  0  0
Node0, zone   Normal, type  Movable  6198618675363  
   13  4  3  0  2   4074
Node0, zone   Normal, type   Mirror  0  0  0  0  0  
0  0  0  0  0   1024
Node0, zone   Normal, type  Reserve  0  0  0  0  0  
0  0  0  0  0  1
Node0, zone   Normal, type  CMA  0  0  0  0  0  
0  0  0  0  0  0
Node0, zone   Normal, type  Isolate  0  

Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-18 Thread Tony Luck
On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu qiuxi...@huawei.com wrote:
 In part2, does it means the memory allocated from kernel should use mirrored 
 memory?

Yes. I want to use mirrored memory for all (or as many as
possible) kernel allocations.

 I have heard of this feature(address range mirroring) before, and I changed 
 some
 code to test it(implement memory allocations in specific physical areas).

 In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not 
 a good
 idea. If there are XX discontiguous mirrored areas in one numa node, there 
 should be
 XX ZONE_MIRROR zones in one pgdat, it is impossible, right?

With current h/w implementations XX is at most 2, and is possibly only 1
on most nodes.  But we shouldn't depend on that.

 I think add a new migrate type(MIGRATE_MIRROR) will be better, the following 
 print
 is from my changed kernel.

This sounds interesting.

 [root@localhost ~]# cat /proc/pagetypeinfo
 Page block order: 9
 Pages per block:  512

 Free pages count per migrate type at order   0  1  2  3  
 4  5  6  7  8  9 10
...
 Node0, zone  DMA, type   Mirror  0  0  0  0  
 0  0  0  0  0  0  0
...
 Node0, zoneDMA32, type   Mirror  0  0  0  0  
 0  0  0  0  0  0  0

I see all zero counts here ... which is fine.  I expect that systems
will mirror all memory below 4GB ... but we should probably
ignore the attribute for this range because we want to make
sure that the memory is still available for users that depend
on getting memory that legacy devices can access. On systems
that support address range mirror the 4GB area is 2% of even
a small system (128GB seems to be the minimum rational configuration
for a 4 socket machine ... you end up with that much if you populate
every channel with just one 4GB DIMM). On a big system (in the TB
range) 4GB area is a trivial rounding error.

 Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both
 kernel-space and user-space. If there is no mirrored memory, we will allocate
 other types memory.

But I *think* I want all kernel and no users to allocate mirror
memory.  I'd like to not have to touch every place that allocates
memory to add/clear this flag.

 1) kernel-space(pcp, page buddy, slab/slub ...):
 - use mirrored memory(e.g. /proc/sys/vm/mirrorable)
 - __alloc_pages_nodemask()
 -gfpflags_to_migratetype()
 - use MIGRATE_MIRROR list

I think you are telling me that we can do this, but I don't understand
how the code would look.

 2) user-space(syscall, madvise, mmap ...):
 - add VM_MIRROR flag in the vma
 - add GFP_MIRROR when page fault in the vma
 - __alloc_pages_nodemask()
 - use MIGRATE_MIRROR list

If we do let users have access to mirrored memory, then
madvise/mmap seem a plausible way to allow it.  Not sure
what access privileges are appropriate to allow it. I expect
mirrored memory to be in short supply (the whole point of
address range mirror is to make do with a minimal amount
of mirrored memory ... if you expect to want/have lots of
mirrored memory, then just take the 50% hit in capacity
and mirror everything and ignore all the s/w complexity).

Are your patches ready to be shared?

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-08 Thread Tony Luck
On Fri, May 8, 2015 at 1:49 PM, Andrew Morton  wrote:
> What I mean is: allow userspace to consume ZONE_MIRROR memory because
> we can snatch it back if it is needed for kernel memory.

For suitable interpretations of "snatch it back" ... if there is none
free in a GFP_NOWAIT request, then we are doomed.  But we
could maintain some high/low watermarks to arrange the snatching
when mirrored memory is getting low, rather than all the way out.

It's worth a look - but perhaps at phase three. It would make life
a bit easier for people to get the right amount of mirror. If they
guess too high they are still wasting some memory because
every mirrored page has two pages in DIMM. But without this
sort of trick all the extra mirrored memory would be totally wasted.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-08 Thread Andrew Morton
On Fri, 8 May 2015 13:38:52 -0700 Tony Luck  wrote:

> > Will surplus ZONE_MIRROR memory be available for regular old movable
> > allocations?
> ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We
> only want kernel allocations in mirror memory, and we can't allow any
> kernel allocations in movable (cause they'll pin it).

What I mean is: allow userspace to consume ZONE_MIRROR memory because
we can snatch it back if it is needed for kernel memory.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-08 Thread Tony Luck
On Fri, May 8, 2015 at 1:03 PM, Andrew Morton  wrote:
> Looks good to me.  What happens to these patches while ZONE_MIRROR is
> being worked on?

I think these patches can go into the kernel now while I figure
out the next phase - there is some value in just this part. We'll
have all memory <4GB mirrored to cover the kernel code/data.
Adding the boot time allocations mostly means the page structures
(in terms of total amount of memory).

> I'm wondering about phase II.  What does "select kernel allocations"
> mean?  I assume we can't say "all kernel allocations" because that can
> sometimes be "almost all memory".  How are you planning on implementing
> this?  A new __GFP_foo flag, then sprinkle that into selected sites?

Some of that is TBD - there are some clear places where we have bounded
amounts of memory that we'd like to pull into the mirror area. E.g. loadable
modules - on a specific machine an administrator can easily see which modules
are loaded, tally up the sizes, and then adjust the amount of mirrored memory.
I don't think we necessarily need to get to 100% ... if we can avoid 9/10
errors crashing the machine - that moves the reliability needle enough to
make a difference. Phase 2 may turn into phase 2a, 2b, 2c etc. as we
pick on certain areas.

Oh - they'll be some sysfs or debugfs stats too - so people can
check that they have the right amount of mirror memory under application
load. Too little and they'll be at risk because kernel allocations will
fall back to non-mirrored. Too much, and they are wasting memory.

> Will surplus ZONE_MIRROR memory be available for regular old movable
> allocations?
ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We
only want kernel allocations in mirror memory, and we can't allow any
kernel allocations in movable (cause they'll pin it).

> I suggest you run the design ideas by Mel before getting into
> implementation.
Good idea - when I have something fit to be seen, I'll share
with Mel.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-08 Thread Andrew Morton
On Fri, 8 May 2015 09:44:21 -0700 Tony Luck  wrote:

> Some high end Intel Xeon systems report uncorrectable memory errors
> as a recoverable machine check. Linux has included code for some time
> to process these and just signal the affected processes (or even
> recover completely if the error was in a read only page that can be
> replaced by reading from disk).
> 
> But we have no recovery path for errors encountered during kernel
> code execution. Except for some very specific cases were are unlikely
> to ever be able to recover.
> 
> Enter memory mirroring. Actually 3rd generation of memory mirroing.
> 
> Gen1: All memory is mirrored
>   Pro: No s/w enabling - h/w just gets good data from other side of the 
> mirror
>   Con: Halves effective memory capacity available to OS/applications
> Gen2: Partial memory mirror - just mirror memory begind some memory 
> controllers
>   Pro: Keep more of the capacity
>   Con: Nightmare to enable. Have to choose between allocating from
>mirrored memory for safety vs. NUMA local memory for performance
> Gen3: Address range partial memory mirror - some mirror on each memory 
> controller
>   Pro: Can tune the amount of mirror and keep NUMA performance
>   Con: I have to write memory management code to implement
> 
> The current plan is just to use mirrored memory for kernel allocations. This
> has been broken into two phases:
> 1) This patch series - find the mirrored memory, use it for boot time 
> allocations
> 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused
>mirrored memory from mm/memblock.c and only give it out to select kernel
>allocations (this is still being scoped because page_alloc.c is scary).

Looks good to me.  What happens to these patches while ZONE_MIRROR is
being worked on?


I'm wondering about phase II.  What does "select kernel allocations"
mean?  I assume we can't say "all kernel allocations" because that can
sometimes be "almost all memory".  How are you planning on implementing
this?  A new __GFP_foo flag, then sprinkle that into selected sites?

Will surplus ZONE_MIRROR memory be available for regular old movable
allocations?

I suggest you run the design ideas by Mel before getting into
implementation.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-08 Thread Andrew Morton
On Fri, 8 May 2015 13:38:52 -0700 Tony Luck tony.l...@gmail.com wrote:

  Will surplus ZONE_MIRROR memory be available for regular old movable
  allocations?
 ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We
 only want kernel allocations in mirror memory, and we can't allow any
 kernel allocations in movable (cause they'll pin it).

What I mean is: allow userspace to consume ZONE_MIRROR memory because
we can snatch it back if it is needed for kernel memory.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-08 Thread Andrew Morton
On Fri, 8 May 2015 09:44:21 -0700 Tony Luck tony.l...@intel.com wrote:

 Some high end Intel Xeon systems report uncorrectable memory errors
 as a recoverable machine check. Linux has included code for some time
 to process these and just signal the affected processes (or even
 recover completely if the error was in a read only page that can be
 replaced by reading from disk).
 
 But we have no recovery path for errors encountered during kernel
 code execution. Except for some very specific cases were are unlikely
 to ever be able to recover.
 
 Enter memory mirroring. Actually 3rd generation of memory mirroing.
 
 Gen1: All memory is mirrored
   Pro: No s/w enabling - h/w just gets good data from other side of the 
 mirror
   Con: Halves effective memory capacity available to OS/applications
 Gen2: Partial memory mirror - just mirror memory begind some memory 
 controllers
   Pro: Keep more of the capacity
   Con: Nightmare to enable. Have to choose between allocating from
mirrored memory for safety vs. NUMA local memory for performance
 Gen3: Address range partial memory mirror - some mirror on each memory 
 controller
   Pro: Can tune the amount of mirror and keep NUMA performance
   Con: I have to write memory management code to implement
 
 The current plan is just to use mirrored memory for kernel allocations. This
 has been broken into two phases:
 1) This patch series - find the mirrored memory, use it for boot time 
 allocations
 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused
mirrored memory from mm/memblock.c and only give it out to select kernel
allocations (this is still being scoped because page_alloc.c is scary).

Looks good to me.  What happens to these patches while ZONE_MIRROR is
being worked on?


I'm wondering about phase II.  What does select kernel allocations
mean?  I assume we can't say all kernel allocations because that can
sometimes be almost all memory.  How are you planning on implementing
this?  A new __GFP_foo flag, then sprinkle that into selected sites?

Will surplus ZONE_MIRROR memory be available for regular old movable
allocations?

I suggest you run the design ideas by Mel before getting into
implementation.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-08 Thread Tony Luck
On Fri, May 8, 2015 at 1:49 PM, Andrew Morton a...@linux-foundation.org wrote:
 What I mean is: allow userspace to consume ZONE_MIRROR memory because
 we can snatch it back if it is needed for kernel memory.

For suitable interpretations of snatch it back ... if there is none
free in a GFP_NOWAIT request, then we are doomed.  But we
could maintain some high/low watermarks to arrange the snatching
when mirrored memory is getting low, rather than all the way out.

It's worth a look - but perhaps at phase three. It would make life
a bit easier for people to get the right amount of mirror. If they
guess too high they are still wasting some memory because
every mirrored page has two pages in DIMM. But without this
sort of trick all the extra mirrored memory would be totally wasted.

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations

2015-05-08 Thread Tony Luck
On Fri, May 8, 2015 at 1:03 PM, Andrew Morton a...@linux-foundation.org wrote:
 Looks good to me.  What happens to these patches while ZONE_MIRROR is
 being worked on?

I think these patches can go into the kernel now while I figure
out the next phase - there is some value in just this part. We'll
have all memory 4GB mirrored to cover the kernel code/data.
Adding the boot time allocations mostly means the page structures
(in terms of total amount of memory).

 I'm wondering about phase II.  What does select kernel allocations
 mean?  I assume we can't say all kernel allocations because that can
 sometimes be almost all memory.  How are you planning on implementing
 this?  A new __GFP_foo flag, then sprinkle that into selected sites?

Some of that is TBD - there are some clear places where we have bounded
amounts of memory that we'd like to pull into the mirror area. E.g. loadable
modules - on a specific machine an administrator can easily see which modules
are loaded, tally up the sizes, and then adjust the amount of mirrored memory.
I don't think we necessarily need to get to 100% ... if we can avoid 9/10
errors crashing the machine - that moves the reliability needle enough to
make a difference. Phase 2 may turn into phase 2a, 2b, 2c etc. as we
pick on certain areas.

Oh - they'll be some sysfs or debugfs stats too - so people can
check that they have the right amount of mirror memory under application
load. Too little and they'll be at risk because kernel allocations will
fall back to non-mirrored. Too much, and they are wasting memory.

 Will surplus ZONE_MIRROR memory be available for regular old movable
 allocations?
ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We
only want kernel allocations in mirror memory, and we can't allow any
kernel allocations in movable (cause they'll pin it).

 I suggest you run the design ideas by Mel before getting into
 implementation.
Good idea - when I have something fit to be seen, I'll share
with Mel.

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/