On 2015/5/9 0:44, Tony Luck wrote:

> Some high end Intel Xeon systems report uncorrectable memory errors
> as a recoverable machine check. Linux has included code for some time
> to process these and just signal the affected processes (or even
> recover completely if the error was in a read only page that can be
> replaced by reading from disk).
> 
> But we have no recovery path for errors encountered during kernel
> code execution. Except for some very specific cases were are unlikely
> to ever be able to recover.
> 
> Enter memory mirroring. Actually 3rd generation of memory mirroing.
> 
> Gen1: All memory is mirrored
>       Pro: No s/w enabling - h/w just gets good data from other side of the 
> mirror
>       Con: Halves effective memory capacity available to OS/applications
> Gen2: Partial memory mirror - just mirror memory begind some memory 
> controllers
>       Pro: Keep more of the capacity
>       Con: Nightmare to enable. Have to choose between allocating from
>            mirrored memory for safety vs. NUMA local memory for performance
> Gen3: Address range partial memory mirror - some mirror on each memory 
> controller
>       Pro: Can tune the amount of mirror and keep NUMA performance
>       Con: I have to write memory management code to implement
> 
> The current plan is just to use mirrored memory for kernel allocations. This
> has been broken into two phases:
> 1) This patch series - find the mirrored memory, use it for boot time 
> allocations
> 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused
>    mirrored memory from mm/memblock.c and only give it out to select kernel
>    allocations (this is still being scoped because page_alloc.c is scary).
> 

Hi Tony,

In part2, does it means the memory allocated from kernel should use mirrored 
memory?

I have heard of this feature(address range mirroring) before, and I changed some
code to test it(implement memory allocations in specific physical areas).

In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a 
good
idea. If there are XX discontiguous mirrored areas in one numa node, there 
should be
XX ZONE_MIRROR zones in one pgdat, it is impossible, right?

I think add a new migrate type(MIGRATE_MIRROR) will be better, the following 
print
is from my changed kernel. 

[root@localhost ~]# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4  
    5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      1      1      0      2  
    1      1      0      1      0      0
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0  
    0      0      0      0      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0  
    0      0      0      0      0      3
Node    0, zone      DMA, type       Mirror      0      0      0      0      0  
    0      0      0      0      0      0
Node    0, zone      DMA, type      Reserve      0      0      0      0      0  
    0      0      0      0      1      0
Node    0, zone      DMA, type          CMA      0      0      0      0      0  
    0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0  
    0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable     14      7      6      1      3  
    0      1      0      0      0      0
Node    0, zone    DMA32, type  Reclaimable     15      2      2      1      1  
    2      1      1      0      0      0
Node    0, zone    DMA32, type      Movable      3     24     52     58     31  
    2      1      1      1      3    231
Node    0, zone    DMA32, type       Mirror      0      0      0      0      0  
    0      0      0      0      0      0
Node    0, zone    DMA32, type      Reserve      0      0      0      0      0  
    0      0      0      0      0      1
Node    0, zone    DMA32, type          CMA      0      0      0      0      0  
    0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0  
    0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable     80     12      6      7      3  
    1     67     58     23     11      0
Node    0, zone   Normal, type  Reclaimable      6      6      8     11      5  
    3      0      1      0      0      0
Node    0, zone   Normal, type      Movable      6    198    618    675    363  
   13      4      3      0      2   4074
Node    0, zone   Normal, type       Mirror      0      0      0      0      0  
    0      0      0      0      0   1024
Node    0, zone   Normal, type      Reserve      0      0      0      0      0  
    0      0      0      0      0      1
Node    0, zone   Normal, type          CMA      0      0      0      0      0  
    0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0  
    0      0      0      0      0      0

Number of blocks type     Unmovable  Reclaimable      Movable       Mirror      
Reserve          CMA      Isolate
Node 0, zone      DMA            1            0            6            0       
     1            0            0
Node 0, zone    DMA32            8           32          975            0       
     1            0            0
Node 0, zone   Normal          216          334        12760         2048       
     2            0            0
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4  
    5      6      7      8      9     10
Node    1, zone   Normal, type    Unmovable     18      2     19      3     21  
   28     13      0      1      1      0
Node    1, zone   Normal, type  Reclaimable      0      1      1      1      0  
    0      1      0      0      1      0
Node    1, zone   Normal, type      Movable      6     13      9      3      0  
    4      5      0      1      0   6970
Node    1, zone   Normal, type       Mirror      0      0      0      0      0  
    0      0      0      0      0   1024
Node    1, zone   Normal, type      Reserve      0      0      0      0      0  
    0      0      0      0      0      1
Node    1, zone   Normal, type          CMA      0      0      0      0      0  
    0      0      0      0      0      0
Node    1, zone   Normal, type      Isolate      0      0      0      0      0  
    0      0      0      0      0      0

Number of blocks type     Unmovable  Reclaimable      Movable       Mirror      
Reserve          CMA      Isolate
Node 1, zone   Normal          112            4        14218         2048       
     2            0            0


Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both
kernel-space and user-space. If there is no mirrored memory, we will allocate
other types memory.

1) kernel-space(pcp, page buddy, slab/slub ...):
        -> use mirrored memory(e.g. /proc/sys/vm/mirrorable)
                -> __alloc_pages_nodemask()
                        ->gfpflags_to_migratetype()
                                -> use MIGRATE_MIRROR list
2) user-space(syscall, madvise, mmap ...):
        -> add VM_MIRROR flag in the vma
                -> add GFP_MIRROR when page fault in the vma
                        -> __alloc_pages_nodemask()
                                -> use MIGRATE_MIRROR list

Thanks,
Xishi Qiu

> Tony Luck (3):
>   mm/memblock: Add extra "flags" to memblock to allow selection of
>     memory based on attribute
>   mm/memblock: Allocate boot time data structures from mirrored memory
>   x86, mirror: x86 enabling - find mirrored memory ranges
> 
>  arch/s390/kernel/crash_dump.c |   5 +-
>  arch/sparc/mm/init_64.c       |   6 ++-
>  arch/x86/kernel/check.c       |   3 +-
>  arch/x86/kernel/e820.c        |   3 +-
>  arch/x86/kernel/setup.c       |   3 ++
>  arch/x86/mm/init_32.c         |   2 +-
>  arch/x86/platform/efi/efi.c   |  21 ++++++++
>  include/linux/efi.h           |   3 ++
>  include/linux/memblock.h      |  49 +++++++++++------
>  mm/cma.c                      |   6 ++-
>  mm/memblock.c                 | 123 
> +++++++++++++++++++++++++++++++++---------
>  mm/memtest.c                  |   3 +-
>  mm/nobootmem.c                |  14 ++++-
>  13 files changed, 188 insertions(+), 53 deletions(-)
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to