Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On 2015/5/19 12:48, Tony Luck wrote: > On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu wrote: >> In part2, does it means the memory allocated from kernel should use mirrored >> memory? > > Yes. I want to use mirrored memory for all (or as many as > possible) kernel allocations. > >> I have heard of this feature(address range mirroring) before, and I changed >> some >> code to test it(implement memory allocations in specific physical areas). >> >> In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is >> not a good >> idea. If there are XX discontiguous mirrored areas in one numa node, there >> should be >> XX ZONE_MIRROR zones in one pgdat, it is impossible, right? > > With current h/w implementations XX is at most 2, and is possibly only 1 > on most nodes. But we shouldn't depend on that. > >> I think add a new migrate type(MIGRATE_MIRROR) will be better, the following >> print >> is from my changed kernel. > > This sounds interesting. > >> [root@localhost ~]# cat /proc/pagetypeinfo >> Page block order: 9 >> Pages per block: 512 >> >> Free pages count per migrate type at order 0 1 2 3 >> 4 5 6 7 8 9 10 > ... >> Node0, zone DMA, type Mirror 0 0 0 0 >> 0 0 0 0 0 0 0 > ... >> Node0, zoneDMA32, type Mirror 0 0 0 0 >> 0 0 0 0 0 0 0 > > I see all zero counts here ... which is fine. I expect that systems > will mirror all memory below 4GB ... but we should probably > ignore the attribute for this range because we want to make Hi Tony, I think 0-4G will be all mirrored, so I change nothing, just ignore the mirror flag.(e.g. 4 socket machine, every socket has 32G memory, then node0: 0-4G, 4-8G mirrored, node1: 32-36G mirrored, node2:64-68G mirrored, node3: 96-100G mirrored) > sure that the memory is still available for users that depend > on getting memory that legacy devices can access. On systems > that support address range mirror the <4GB area is <2% of even > a small system (128GB seems to be the minimum rational configuration > for a 4 socket machine ... you end up with that much if you populate > every channel with just one 4GB DIMM). On a big system (in the TB > range) <4GB area is a trivial rounding error. > >> Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both >> kernel-space and user-space. If there is no mirrored memory, we will allocate >> other types memory. > > But I *think* I want all kernel and no users to allocate mirror > memory. I'd like to not have to touch every place that allocates > memory to add/clear this flag. > If only want kernel to use the mirrored memory, it is much easier. I have some patches, but it's a little ugly and implement both user and kernel. >> 1) kernel-space(pcp, page buddy, slab/slub ...): >> -> use mirrored memory(e.g. /proc/sys/vm/mirrorable) >> -> __alloc_pages_nodemask() >> ->gfpflags_to_migratetype() >> -> use MIGRATE_MIRROR list > > I think you are telling me that we can do this, but I don't understand > how the code would look. > >> 2) user-space(syscall, madvise, mmap ...): >> -> add VM_MIRROR flag in the vma >> -> add GFP_MIRROR when page fault in the vma >> -> __alloc_pages_nodemask() >> -> use MIGRATE_MIRROR list > > If we do let users have access to mirrored memory, then > madvise/mmap seem a plausible way to allow it. Not sure > what access privileges are appropriate to allow it. I expect > mirrored memory to be in short supply (the whole point of I think allocations from some key process(e.g. date base) are as important as kernel, and in most cases MCE just kill them if memory failure, so let user can access the mirrored memory may be a good way to solve the problem. > address range mirror is to make do with a minimal amount > of mirrored memory ... if you expect to want/have lots of > mirrored memory, then just take the 50% hit in capacity > and mirror everything and ignore all the s/w complexity). > > Are your patches ready to be shared? I'll rewrite and send them soon. Thanks, Xishi Qiu > > -Tony > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On 2015/5/19 12:48, Tony Luck wrote: On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu qiuxi...@huawei.com wrote: In part2, does it means the memory allocated from kernel should use mirrored memory? Yes. I want to use mirrored memory for all (or as many as possible) kernel allocations. I have heard of this feature(address range mirroring) before, and I changed some code to test it(implement memory allocations in specific physical areas). In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good idea. If there are XX discontiguous mirrored areas in one numa node, there should be XX ZONE_MIRROR zones in one pgdat, it is impossible, right? With current h/w implementations XX is at most 2, and is possibly only 1 on most nodes. But we shouldn't depend on that. I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print is from my changed kernel. This sounds interesting. [root@localhost ~]# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 ... Node0, zoneDMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 I see all zero counts here ... which is fine. I expect that systems will mirror all memory below 4GB ... but we should probably ignore the attribute for this range because we want to make Hi Tony, I think 0-4G will be all mirrored, so I change nothing, just ignore the mirror flag.(e.g. 4 socket machine, every socket has 32G memory, then node0: 0-4G, 4-8G mirrored, node1: 32-36G mirrored, node2:64-68G mirrored, node3: 96-100G mirrored) sure that the memory is still available for users that depend on getting memory that legacy devices can access. On systems that support address range mirror the 4GB area is 2% of even a small system (128GB seems to be the minimum rational configuration for a 4 socket machine ... you end up with that much if you populate every channel with just one 4GB DIMM). On a big system (in the TB range) 4GB area is a trivial rounding error. Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both kernel-space and user-space. If there is no mirrored memory, we will allocate other types memory. But I *think* I want all kernel and no users to allocate mirror memory. I'd like to not have to touch every place that allocates memory to add/clear this flag. If only want kernel to use the mirrored memory, it is much easier. I have some patches, but it's a little ugly and implement both user and kernel. 1) kernel-space(pcp, page buddy, slab/slub ...): - use mirrored memory(e.g. /proc/sys/vm/mirrorable) - __alloc_pages_nodemask() -gfpflags_to_migratetype() - use MIGRATE_MIRROR list I think you are telling me that we can do this, but I don't understand how the code would look. 2) user-space(syscall, madvise, mmap ...): - add VM_MIRROR flag in the vma - add GFP_MIRROR when page fault in the vma - __alloc_pages_nodemask() - use MIGRATE_MIRROR list If we do let users have access to mirrored memory, then madvise/mmap seem a plausible way to allow it. Not sure what access privileges are appropriate to allow it. I expect mirrored memory to be in short supply (the whole point of I think allocations from some key process(e.g. date base) are as important as kernel, and in most cases MCE just kill them if memory failure, so let user can access the mirrored memory may be a good way to solve the problem. address range mirror is to make do with a minimal amount of mirrored memory ... if you expect to want/have lots of mirrored memory, then just take the 50% hit in capacity and mirror everything and ignore all the s/w complexity). Are your patches ready to be shared? I'll rewrite and send them soon. Thanks, Xishi Qiu -Tony . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu wrote: > In part2, does it means the memory allocated from kernel should use mirrored > memory? Yes. I want to use mirrored memory for all (or as many as possible) kernel allocations. > I have heard of this feature(address range mirroring) before, and I changed > some > code to test it(implement memory allocations in specific physical areas). > > In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not > a good > idea. If there are XX discontiguous mirrored areas in one numa node, there > should be > XX ZONE_MIRROR zones in one pgdat, it is impossible, right? With current h/w implementations XX is at most 2, and is possibly only 1 on most nodes. But we shouldn't depend on that. > I think add a new migrate type(MIGRATE_MIRROR) will be better, the following > print > is from my changed kernel. This sounds interesting. > [root@localhost ~]# cat /proc/pagetypeinfo > Page block order: 9 > Pages per block: 512 > > Free pages count per migrate type at order 0 1 2 3 > 4 5 6 7 8 9 10 ... > Node0, zone DMA, type Mirror 0 0 0 0 > 0 0 0 0 0 0 0 ... > Node0, zoneDMA32, type Mirror 0 0 0 0 > 0 0 0 0 0 0 0 I see all zero counts here ... which is fine. I expect that systems will mirror all memory below 4GB ... but we should probably ignore the attribute for this range because we want to make sure that the memory is still available for users that depend on getting memory that legacy devices can access. On systems that support address range mirror the <4GB area is <2% of even a small system (128GB seems to be the minimum rational configuration for a 4 socket machine ... you end up with that much if you populate every channel with just one 4GB DIMM). On a big system (in the TB range) <4GB area is a trivial rounding error. > Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both > kernel-space and user-space. If there is no mirrored memory, we will allocate > other types memory. But I *think* I want all kernel and no users to allocate mirror memory. I'd like to not have to touch every place that allocates memory to add/clear this flag. > 1) kernel-space(pcp, page buddy, slab/slub ...): > -> use mirrored memory(e.g. /proc/sys/vm/mirrorable) > -> __alloc_pages_nodemask() > ->gfpflags_to_migratetype() > -> use MIGRATE_MIRROR list I think you are telling me that we can do this, but I don't understand how the code would look. > 2) user-space(syscall, madvise, mmap ...): > -> add VM_MIRROR flag in the vma > -> add GFP_MIRROR when page fault in the vma > -> __alloc_pages_nodemask() > -> use MIGRATE_MIRROR list If we do let users have access to mirrored memory, then madvise/mmap seem a plausible way to allow it. Not sure what access privileges are appropriate to allow it. I expect mirrored memory to be in short supply (the whole point of address range mirror is to make do with a minimal amount of mirrored memory ... if you expect to want/have lots of mirrored memory, then just take the 50% hit in capacity and mirror everything and ignore all the s/w complexity). Are your patches ready to be shared? -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On 2015/5/9 0:44, Tony Luck wrote: > Some high end Intel Xeon systems report uncorrectable memory errors > as a recoverable machine check. Linux has included code for some time > to process these and just signal the affected processes (or even > recover completely if the error was in a read only page that can be > replaced by reading from disk). > > But we have no recovery path for errors encountered during kernel > code execution. Except for some very specific cases were are unlikely > to ever be able to recover. > > Enter memory mirroring. Actually 3rd generation of memory mirroing. > > Gen1: All memory is mirrored > Pro: No s/w enabling - h/w just gets good data from other side of the > mirror > Con: Halves effective memory capacity available to OS/applications > Gen2: Partial memory mirror - just mirror memory begind some memory > controllers > Pro: Keep more of the capacity > Con: Nightmare to enable. Have to choose between allocating from >mirrored memory for safety vs. NUMA local memory for performance > Gen3: Address range partial memory mirror - some mirror on each memory > controller > Pro: Can tune the amount of mirror and keep NUMA performance > Con: I have to write memory management code to implement > > The current plan is just to use mirrored memory for kernel allocations. This > has been broken into two phases: > 1) This patch series - find the mirrored memory, use it for boot time > allocations > 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused >mirrored memory from mm/memblock.c and only give it out to select kernel >allocations (this is still being scoped because page_alloc.c is scary). > Hi Tony, In part2, does it means the memory allocated from kernel should use mirrored memory? I have heard of this feature(address range mirroring) before, and I changed some code to test it(implement memory allocations in specific physical areas). In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good idea. If there are XX discontiguous mirrored areas in one numa node, there should be XX ZONE_MIRROR zones in one pgdat, it is impossible, right? I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print is from my changed kernel. [root@localhost ~]# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node0, zone DMA, typeUnmovable 1 1 1 0 2 1 1 0 1 0 0 Node0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 3 Node0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 Node0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, typeUnmovable 14 7 6 1 3 0 1 0 0 0 0 Node0, zoneDMA32, type Reclaimable 15 2 2 1 1 2 1 1 0 0 0 Node0, zoneDMA32, type Movable 3 24 52 58 31 2 1 1 1 3231 Node0, zoneDMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node0, zoneDMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node0, zone Normal, typeUnmovable 80 12 6 7 3 1 67 58 23 11 0 Node0, zone Normal, type Reclaimable 6 6 8 11 5 3 0 1 0 0 0 Node0, zone Normal, type Movable 6198618675363 13 4 3 0 2 4074 Node0, zone Normal, type Mirror 0 0 0 0 0 0 0 0 0 0 1024 Node0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node0, zone
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On 2015/5/9 0:44, Tony Luck wrote: Some high end Intel Xeon systems report uncorrectable memory errors as a recoverable machine check. Linux has included code for some time to process these and just signal the affected processes (or even recover completely if the error was in a read only page that can be replaced by reading from disk). But we have no recovery path for errors encountered during kernel code execution. Except for some very specific cases were are unlikely to ever be able to recover. Enter memory mirroring. Actually 3rd generation of memory mirroing. Gen1: All memory is mirrored Pro: No s/w enabling - h/w just gets good data from other side of the mirror Con: Halves effective memory capacity available to OS/applications Gen2: Partial memory mirror - just mirror memory begind some memory controllers Pro: Keep more of the capacity Con: Nightmare to enable. Have to choose between allocating from mirrored memory for safety vs. NUMA local memory for performance Gen3: Address range partial memory mirror - some mirror on each memory controller Pro: Can tune the amount of mirror and keep NUMA performance Con: I have to write memory management code to implement The current plan is just to use mirrored memory for kernel allocations. This has been broken into two phases: 1) This patch series - find the mirrored memory, use it for boot time allocations 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused mirrored memory from mm/memblock.c and only give it out to select kernel allocations (this is still being scoped because page_alloc.c is scary). Hi Tony, In part2, does it means the memory allocated from kernel should use mirrored memory? I have heard of this feature(address range mirroring) before, and I changed some code to test it(implement memory allocations in specific physical areas). In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good idea. If there are XX discontiguous mirrored areas in one numa node, there should be XX ZONE_MIRROR zones in one pgdat, it is impossible, right? I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print is from my changed kernel. [root@localhost ~]# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node0, zone DMA, typeUnmovable 1 1 1 0 2 1 1 0 1 0 0 Node0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 3 Node0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 Node0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, typeUnmovable 14 7 6 1 3 0 1 0 0 0 0 Node0, zoneDMA32, type Reclaimable 15 2 2 1 1 2 1 1 0 0 0 Node0, zoneDMA32, type Movable 3 24 52 58 31 2 1 1 1 3231 Node0, zoneDMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node0, zoneDMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node0, zoneDMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node0, zone Normal, typeUnmovable 80 12 6 7 3 1 67 58 23 11 0 Node0, zone Normal, type Reclaimable 6 6 8 11 5 3 0 1 0 0 0 Node0, zone Normal, type Movable 6198618675363 13 4 3 0 2 4074 Node0, zone Normal, type Mirror 0 0 0 0 0 0 0 0 0 0 1024 Node0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node0, zone Normal, type Isolate 0
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Mon, May 18, 2015 at 8:01 PM, Xishi Qiu qiuxi...@huawei.com wrote: In part2, does it means the memory allocated from kernel should use mirrored memory? Yes. I want to use mirrored memory for all (or as many as possible) kernel allocations. I have heard of this feature(address range mirroring) before, and I changed some code to test it(implement memory allocations in specific physical areas). In my opinion, add a new zone(ZONE_MIRROR) to fill the mirrored memory is not a good idea. If there are XX discontiguous mirrored areas in one numa node, there should be XX ZONE_MIRROR zones in one pgdat, it is impossible, right? With current h/w implementations XX is at most 2, and is possibly only 1 on most nodes. But we shouldn't depend on that. I think add a new migrate type(MIGRATE_MIRROR) will be better, the following print is from my changed kernel. This sounds interesting. [root@localhost ~]# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node0, zone DMA, type Mirror 0 0 0 0 0 0 0 0 0 0 0 ... Node0, zoneDMA32, type Mirror 0 0 0 0 0 0 0 0 0 0 0 I see all zero counts here ... which is fine. I expect that systems will mirror all memory below 4GB ... but we should probably ignore the attribute for this range because we want to make sure that the memory is still available for users that depend on getting memory that legacy devices can access. On systems that support address range mirror the 4GB area is 2% of even a small system (128GB seems to be the minimum rational configuration for a 4 socket machine ... you end up with that much if you populate every channel with just one 4GB DIMM). On a big system (in the TB range) 4GB area is a trivial rounding error. Also I add a new flag(GFP_MIRROR), then we can use the mirrored form both kernel-space and user-space. If there is no mirrored memory, we will allocate other types memory. But I *think* I want all kernel and no users to allocate mirror memory. I'd like to not have to touch every place that allocates memory to add/clear this flag. 1) kernel-space(pcp, page buddy, slab/slub ...): - use mirrored memory(e.g. /proc/sys/vm/mirrorable) - __alloc_pages_nodemask() -gfpflags_to_migratetype() - use MIGRATE_MIRROR list I think you are telling me that we can do this, but I don't understand how the code would look. 2) user-space(syscall, madvise, mmap ...): - add VM_MIRROR flag in the vma - add GFP_MIRROR when page fault in the vma - __alloc_pages_nodemask() - use MIGRATE_MIRROR list If we do let users have access to mirrored memory, then madvise/mmap seem a plausible way to allow it. Not sure what access privileges are appropriate to allow it. I expect mirrored memory to be in short supply (the whole point of address range mirror is to make do with a minimal amount of mirrored memory ... if you expect to want/have lots of mirrored memory, then just take the 50% hit in capacity and mirror everything and ignore all the s/w complexity). Are your patches ready to be shared? -Tony -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Fri, May 8, 2015 at 1:49 PM, Andrew Morton wrote: > What I mean is: allow userspace to consume ZONE_MIRROR memory because > we can snatch it back if it is needed for kernel memory. For suitable interpretations of "snatch it back" ... if there is none free in a GFP_NOWAIT request, then we are doomed. But we could maintain some high/low watermarks to arrange the snatching when mirrored memory is getting low, rather than all the way out. It's worth a look - but perhaps at phase three. It would make life a bit easier for people to get the right amount of mirror. If they guess too high they are still wasting some memory because every mirrored page has two pages in DIMM. But without this sort of trick all the extra mirrored memory would be totally wasted. -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Fri, 8 May 2015 13:38:52 -0700 Tony Luck wrote: > > Will surplus ZONE_MIRROR memory be available for regular old movable > > allocations? > ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We > only want kernel allocations in mirror memory, and we can't allow any > kernel allocations in movable (cause they'll pin it). What I mean is: allow userspace to consume ZONE_MIRROR memory because we can snatch it back if it is needed for kernel memory. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Fri, May 8, 2015 at 1:03 PM, Andrew Morton wrote: > Looks good to me. What happens to these patches while ZONE_MIRROR is > being worked on? I think these patches can go into the kernel now while I figure out the next phase - there is some value in just this part. We'll have all memory <4GB mirrored to cover the kernel code/data. Adding the boot time allocations mostly means the page structures (in terms of total amount of memory). > I'm wondering about phase II. What does "select kernel allocations" > mean? I assume we can't say "all kernel allocations" because that can > sometimes be "almost all memory". How are you planning on implementing > this? A new __GFP_foo flag, then sprinkle that into selected sites? Some of that is TBD - there are some clear places where we have bounded amounts of memory that we'd like to pull into the mirror area. E.g. loadable modules - on a specific machine an administrator can easily see which modules are loaded, tally up the sizes, and then adjust the amount of mirrored memory. I don't think we necessarily need to get to 100% ... if we can avoid 9/10 errors crashing the machine - that moves the reliability needle enough to make a difference. Phase 2 may turn into phase 2a, 2b, 2c etc. as we pick on certain areas. Oh - they'll be some sysfs or debugfs stats too - so people can check that they have the right amount of mirror memory under application load. Too little and they'll be at risk because kernel allocations will fall back to non-mirrored. Too much, and they are wasting memory. > Will surplus ZONE_MIRROR memory be available for regular old movable > allocations? ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We only want kernel allocations in mirror memory, and we can't allow any kernel allocations in movable (cause they'll pin it). > I suggest you run the design ideas by Mel before getting into > implementation. Good idea - when I have something fit to be seen, I'll share with Mel. -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Fri, 8 May 2015 09:44:21 -0700 Tony Luck wrote: > Some high end Intel Xeon systems report uncorrectable memory errors > as a recoverable machine check. Linux has included code for some time > to process these and just signal the affected processes (or even > recover completely if the error was in a read only page that can be > replaced by reading from disk). > > But we have no recovery path for errors encountered during kernel > code execution. Except for some very specific cases were are unlikely > to ever be able to recover. > > Enter memory mirroring. Actually 3rd generation of memory mirroing. > > Gen1: All memory is mirrored > Pro: No s/w enabling - h/w just gets good data from other side of the > mirror > Con: Halves effective memory capacity available to OS/applications > Gen2: Partial memory mirror - just mirror memory begind some memory > controllers > Pro: Keep more of the capacity > Con: Nightmare to enable. Have to choose between allocating from >mirrored memory for safety vs. NUMA local memory for performance > Gen3: Address range partial memory mirror - some mirror on each memory > controller > Pro: Can tune the amount of mirror and keep NUMA performance > Con: I have to write memory management code to implement > > The current plan is just to use mirrored memory for kernel allocations. This > has been broken into two phases: > 1) This patch series - find the mirrored memory, use it for boot time > allocations > 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused >mirrored memory from mm/memblock.c and only give it out to select kernel >allocations (this is still being scoped because page_alloc.c is scary). Looks good to me. What happens to these patches while ZONE_MIRROR is being worked on? I'm wondering about phase II. What does "select kernel allocations" mean? I assume we can't say "all kernel allocations" because that can sometimes be "almost all memory". How are you planning on implementing this? A new __GFP_foo flag, then sprinkle that into selected sites? Will surplus ZONE_MIRROR memory be available for regular old movable allocations? I suggest you run the design ideas by Mel before getting into implementation. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Fri, 8 May 2015 13:38:52 -0700 Tony Luck tony.l...@gmail.com wrote: Will surplus ZONE_MIRROR memory be available for regular old movable allocations? ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We only want kernel allocations in mirror memory, and we can't allow any kernel allocations in movable (cause they'll pin it). What I mean is: allow userspace to consume ZONE_MIRROR memory because we can snatch it back if it is needed for kernel memory. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Fri, 8 May 2015 09:44:21 -0700 Tony Luck tony.l...@intel.com wrote: Some high end Intel Xeon systems report uncorrectable memory errors as a recoverable machine check. Linux has included code for some time to process these and just signal the affected processes (or even recover completely if the error was in a read only page that can be replaced by reading from disk). But we have no recovery path for errors encountered during kernel code execution. Except for some very specific cases were are unlikely to ever be able to recover. Enter memory mirroring. Actually 3rd generation of memory mirroing. Gen1: All memory is mirrored Pro: No s/w enabling - h/w just gets good data from other side of the mirror Con: Halves effective memory capacity available to OS/applications Gen2: Partial memory mirror - just mirror memory begind some memory controllers Pro: Keep more of the capacity Con: Nightmare to enable. Have to choose between allocating from mirrored memory for safety vs. NUMA local memory for performance Gen3: Address range partial memory mirror - some mirror on each memory controller Pro: Can tune the amount of mirror and keep NUMA performance Con: I have to write memory management code to implement The current plan is just to use mirrored memory for kernel allocations. This has been broken into two phases: 1) This patch series - find the mirrored memory, use it for boot time allocations 2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the unused mirrored memory from mm/memblock.c and only give it out to select kernel allocations (this is still being scoped because page_alloc.c is scary). Looks good to me. What happens to these patches while ZONE_MIRROR is being worked on? I'm wondering about phase II. What does select kernel allocations mean? I assume we can't say all kernel allocations because that can sometimes be almost all memory. How are you planning on implementing this? A new __GFP_foo flag, then sprinkle that into selected sites? Will surplus ZONE_MIRROR memory be available for regular old movable allocations? I suggest you run the design ideas by Mel before getting into implementation. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Fri, May 8, 2015 at 1:49 PM, Andrew Morton a...@linux-foundation.org wrote: What I mean is: allow userspace to consume ZONE_MIRROR memory because we can snatch it back if it is needed for kernel memory. For suitable interpretations of snatch it back ... if there is none free in a GFP_NOWAIT request, then we are doomed. But we could maintain some high/low watermarks to arrange the snatching when mirrored memory is getting low, rather than all the way out. It's worth a look - but perhaps at phase three. It would make life a bit easier for people to get the right amount of mirror. If they guess too high they are still wasting some memory because every mirrored page has two pages in DIMM. But without this sort of trick all the extra mirrored memory would be totally wasted. -Tony -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv2 0/3] Find mirrored memory, use for boot time allocations
On Fri, May 8, 2015 at 1:03 PM, Andrew Morton a...@linux-foundation.org wrote: Looks good to me. What happens to these patches while ZONE_MIRROR is being worked on? I think these patches can go into the kernel now while I figure out the next phase - there is some value in just this part. We'll have all memory 4GB mirrored to cover the kernel code/data. Adding the boot time allocations mostly means the page structures (in terms of total amount of memory). I'm wondering about phase II. What does select kernel allocations mean? I assume we can't say all kernel allocations because that can sometimes be almost all memory. How are you planning on implementing this? A new __GFP_foo flag, then sprinkle that into selected sites? Some of that is TBD - there are some clear places where we have bounded amounts of memory that we'd like to pull into the mirror area. E.g. loadable modules - on a specific machine an administrator can easily see which modules are loaded, tally up the sizes, and then adjust the amount of mirrored memory. I don't think we necessarily need to get to 100% ... if we can avoid 9/10 errors crashing the machine - that moves the reliability needle enough to make a difference. Phase 2 may turn into phase 2a, 2b, 2c etc. as we pick on certain areas. Oh - they'll be some sysfs or debugfs stats too - so people can check that they have the right amount of mirror memory under application load. Too little and they'll be at risk because kernel allocations will fall back to non-mirrored. Too much, and they are wasting memory. Will surplus ZONE_MIRROR memory be available for regular old movable allocations? ZONE_MIRROR and ZONE_MOVABLE are pretty much opposites. We only want kernel allocations in mirror memory, and we can't allow any kernel allocations in movable (cause they'll pin it). I suggest you run the design ideas by Mel before getting into implementation. Good idea - when I have something fit to be seen, I'll share with Mel. -Tony -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/