[PATCH 1/4][V2] powerpc : add support for linux, usable-memory properties for drconf memory
Scan for linux,usable-memory properties in case of dynamic reconfiguration memory. Support for kexec/kdump. Signed-off-by: Chandru Siddalingappa [EMAIL PROTECTED] --- Patch applies on linux-next tree (patch-v2.6.26-rc9-next-20080711.gz) arch/powerpc/kernel/prom.c | 40 +++-- arch/powerpc/mm/numa.c | 48 --- 2 files changed, 65 insertions(+), 23 deletions(-) diff -Naurp linux-2.6.26-rc9-orig/arch/powerpc/kernel/prom.c linux-2.6.26-rc9/arch/powerpc/kernel/prom.c --- linux-2.6.26-rc9-orig/arch/powerpc/kernel/prom.c2008-07-11 14:44:55.0 +0530 +++ linux-2.6.26-rc9/arch/powerpc/kernel/prom.c 2008-07-11 14:58:26.0 +0530 @@ -888,9 +888,10 @@ static u64 __init dt_mem_next_cell(int s */ static int __init early_init_dt_scan_drconf_memory(unsigned long node) { - cell_t *dm, *ls; - unsigned long l, n, flags; + cell_t *dm, *ls, *usm; + unsigned long l, n, flags, ranges; u64 base, size, lmb_size; + char buf[32]; ls = (cell_t *)of_get_flat_dt_prop(node, ibm,lmb-size, l); if (ls == NULL || l dt_root_size_cells * sizeof(cell_t)) @@ -914,14 +915,37 @@ static int __init early_init_dt_scan_drc or if the block is not assigned to this partition (0x8) */ if ((flags 0x80) || !(flags 0x8)) continue; - size = lmb_size; - if (iommu_is_off) { + if (iommu_is_off) if (base = 0x8000ul) continue; - if ((base + size) 0x8000ul) - size = 0x8000ul - base; - } - lmb_add(base, size); + size = lmb_size; + + /* +* Append 'n' to 'linux,usable-memory' to get special +* properties passed in by tools like kexec-tools. Relevant +* only if this is a kexec/kdump kernel. +*/ + sprintf(buf, linux,usable-memory%d, (int)n); + usm = of_get_flat_dt_prop(node, buf, l); + ranges = 1; + if (usm != NULL) + ranges = (l 2)/(dt_root_addr_cells + + dt_root_size_cells); + do { + if (usm != NULL) { + base = dt_mem_next_cell(dt_root_addr_cells, +usm); + size = dt_mem_next_cell(dt_root_size_cells, +usm); + if (size == 0) + break; + } + if (iommu_is_off) + if ((base + size) 0x8000ul) + size = 0x8000ul - base; + + lmb_add(base, size); + } while (--ranges); } lmb_dump_all(); return 0; diff -Naurp linux-2.6.26-rc9-orig/arch/powerpc/mm/numa.c linux-2.6.26-rc9/arch/powerpc/mm/numa.c --- linux-2.6.26-rc9-orig/arch/powerpc/mm/numa.c2008-07-11 14:44:55.0 +0530 +++ linux-2.6.26-rc9/arch/powerpc/mm/numa.c 2008-07-11 15:01:56.0 +0530 @@ -493,11 +493,13 @@ static unsigned long __init numa_enforce */ static void __init parse_drconf_memory(struct device_node *memory) { - const u32 *dm; - unsigned int n, rc; - unsigned long lmb_size, size; + const u32 *dm, *usm; + unsigned int n, rc, len, ranges; + unsigned long lmb_size, size, sz; int nid; struct assoc_arrays aa; + char buf[32]; + u64 base; n = of_get_drconf_memory(memory, dm); if (!n) @@ -524,19 +526,35 @@ static void __init parse_drconf_memory(s nid = of_drconf_to_nid_single(drmem, aa); - fake_numa_create_new_node( - ((drmem.base_addr + lmb_size) PAGE_SHIFT), - nid); - - node_set_online(nid); - - size = numa_enforce_memory_limit(drmem.base_addr, lmb_size); - if (!size) - continue; + /* +* Append 'n' to 'linux,usable-memory' to get special +* properties passed in by tools like kexec-tools. Relevant +* only if this is a kexec/kdump kernel. +*/ + sprintf(buf, linux,usable-memory%d, (int)n); + usm = of_get_property(memory, buf, len); + ranges = 1; + if (usm != NULL) + ranges = (len 2) / +(n_mem_addr_cells + n_mem_size_cells); + + base = drmem.base_addr; + size = lmb_size
Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump
On Fri, Jul 11, 2008 at 12:21:31PM -0700, Andrew Morton wrote: On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal [EMAIL PROTECTED] wrote: On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote: This patch provides an enhancement to kexec/kdump. It implements the following features: - Backup/restore memory used by the original kernel before/after kexec. - Save/restore CPU state before/after kexec. Hi Huang, In general this patch set looks good enough to live in -mm and get some testing going. To me, adding capability to return back to original kernel looks like a logical extension to kexec functionality. Exciting ;) It's much less code than I expected. I don't think I understand the feature any more. Once upon a time we thought that this might become a new and better (or at least better-code-sharing) way of doing suspend-to-disk. How far are we from that? Hi Andrew, We can use this patchset for hibernation, but can it be a better way of doing things than what we already have, I don't know. Last time I had raised this question and power people had various views. In the end, Pavel wanted this patchset to be in. Pavel, can tell more here... To me this patchset looks interesting for couple of reasons. - Looks like an interesting feature where one can have a separate kernel in memory and one can switch between the kernels on the fly. It can be modified to have more than one kernel in memory at a time. - So far kexec was one directional. One can only kexec to new kernel and old kernel was gone. Now this patchset makes kexec functionality kind of bidirectional and this looks like logical extension and can lead to intersting use cases in future. Huang also talks of using this feature for snapshotting kernel and invoking some BIOS code in protected mode. I am not very sure how exactly are they planning to use it. Huang, do you have more details on this? What are the prospects of supporting other architectures? I think it should be doable on other architectures as well where kexec is supported. Can't think of a reason why it can't be. Huang, what do you think? Who maintains kexec-tools, and are they OK with merging up the corresponding changes? I think Eric still has the ownership of kexec-tools. But it has been long since kexec-tools has been updated. Now simon horman is maintaining a separate tree, kexec-tools-testing, and all the active development is taking place there. Huang has not exactly posted kexec-tools patches but has given link to kexec-tools patches and no body has objected so far. I am CCing it to Simon Horman, if he sees any issues. Thanks Vivek ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump
On Fri 2008-07-11 12:21:31, Andrew Morton wrote: On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal [EMAIL PROTECTED] wrote: On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote: This patch provides an enhancement to kexec/kdump. It implements the following features: - Backup/restore memory used by the original kernel before/after kexec. - Save/restore CPU state before/after kexec. Hi Huang, In general this patch set looks good enough to live in -mm and get some testing going. To me, adding capability to return back to original kernel looks like a logical extension to kexec functionality. Exciting ;) It's much less code than I expected. I don't think I understand the feature any more. Once upon a time we thought that this might become a new and better (or at least better-code-sharing) way of doing suspend-to-disk. How far are we from that? Well, it will be tricky to get kjump-hibernation right with respect to ACPI, but we should be fairly close to basic hibernation working with this. It has major advantage of not needing refrigerator (and few disadvantages -- like doing aditional boot during suspend). But main reason I'd like kjump to be in is different -- it should be useful to stuff like dump but continue running, etc... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: AMD Family 10H machine check on vmcore read
On Fri, Jul 11, 2008 at 1:50 PM, Vivek Goyal [EMAIL PROTECTED] wrote: On Wed, Jul 09, 2008 at 11:32:40AM -0600, Bob Montgomery wrote: On Tue, 2008-07-08 at 13:28 +, Vivek Goyal wrote: On Mon, Jul 07, 2008 at 05:08:06PM -0600, Bob Montgomery wrote: We maintain a 2.6.18 derived kernel. When testing kdump on a new AMD Family 10h (16) processor, once in the kdump kernel, a read from either /proc/vmcore or /dev/oldmem that corresponds to the area of memory identified in the original (crashing) kernel by these boot messages: On a Family 15 AMD64 processor running this kernel and kdump kernel, I can read the areas identified as being in the aperture from the kdump kernel and get values, but on the new processor, reads from the kdump kernel that are within that address range result in the machine check: HARDWARE ERROR CPU 0: Machine Check Exception:4 Bank 4: be010005001b TSC 141bd974323de ADDR 1c00 MISC e00c0ffe0100 Hi Bob, I am not sure what's happening here. Because in /proc/iomem, GART reserved area is reported as System RAM, kdump kernel will try to read this area and save it. Now I am not sure, what is so special about this area that mapping it and reading it in second kernel would cause a MCE. CCing it to LKML, hoping people knowing GART will be able to provide some input. But I don't see this fix upstream in the kernel. So I'm wondering if some other patch protects other kdump kernels from this problem. In particular, a recent patch that informed the e820 map about the gart aperture to prevent a normal kernel and a kexec kernel from putting it at different addresses. It didn't mention machine checks from kdump kernels, but I wonder if it would have prevented access to that memory area by having it be excluded from the /proc/vmcore list of areas?? Can you provide a link to the patch above? If /proc/iomem, does not report GART area as system ram then it will be excluded from the dump. (IIUC, IOMMU tables are in GART area and ideally one should be capturing it to find out how IOMMU tables looked like at the time of crash). The patch that I thought might be related is: x86: disable the GART early, 64-bit author Yinghai Lu [EMAIL PROTECTED] Wed, 30 Jan 2008 12:33:09 + (13:33 +0100) committer Ingo Molnar [EMAIL PROTECTED] Wed, 30 Jan 2008 12:33:09 + (13:33 +0100) commit aaf230424204864e2833dcc1da23e2cb0b9f39cd tree a42042f5135aa63a780964bd053ae174211ab62f I thought it might be relevant because of this included comment: hm, i'm wondering, instead of modifying the GART, why dont we simply _detect_ whatever GART settings we have inherited, and propagate that into our e820 maps? I.e. if there's inconsistency, then punch that out from the memory maps and just dont use that memory. But this patch doesn't mention machine checks as the symptom that initiated the patch. And my reason for looking was because I didn't think I could be the first person to try reading /proc/vmcore on a Family 10h processor. So I wondered why it hadn't been seen by some other tester, and thought some other patch might have fixed it a different way on newer kernels than mine. Hi Bob, So it looks like this patch will mark aperture region as non RAM and kudmp will not try to dump that memory and will not run into MCE. Have you tried the kernel with this patch? Does it work for you? At this point I don't know, why accessing the aperture region of first kernel causes MCE. May be Andi or Yinghai will know, CCing them. So backporting the Yinghai's patch to your kernel should help here. two kernels could have different position with GART aperture allocated from low ram. so need to stop it in shutdown path of first kernel or startup path of second kernel. otherwise second kernel will try to use address of first kernel gart, and later if the gart is disabled for new gart, the setting will be lost. with this patch could use MCE error and HT sync flood and warm reset YH ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump
On Fri, 11 Jul 2008, Eric W. Biederman wrote: I just realized with a little care the block layer does have support for this, or something very close. You setup a software raid mirror with one disk device.The physical device can come in and out while the filesystems depend on the real device. Do you mean the filesystems depend on the logical RAID device? What's to prevent userspace from accessing the physical device directly? What this amounts to, in the end, is having a way to distinguish the set of I/O requests coming from the hibernation code (reading or writing the memory image) from the set of all other I/O requests. The driver or the block layer has to be set up to allow the first set through while blocking the second set. (And don't forget about the complications caused by error-recovery I/O during the hibernation activity!) Forcing the second set of requests to filter through an extra software layer is a clumsy way of accomplishing this. There ought to be a better approach. Alan Stern ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec
Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump
Alan Stern [EMAIL PROTECTED] writes: On Fri, 11 Jul 2008, Eric W. Biederman wrote: I just realized with a little care the block layer does have support for this, or something very close. You setup a software raid mirror with one disk device.The physical device can come in and out while the filesystems depend on the real device. Do you mean the filesystems depend on the logical RAID device? Oh yes. Thinko. What's to prevent userspace from accessing the physical device directly? Nothing. What this amounts to, in the end, is having a way to distinguish the set of I/O requests coming from the hibernation code (reading or writing the memory image) from the set of all other I/O requests. The driver or the block layer has to be set up to allow the first set through while blocking the second set. (And don't forget about the complications caused by error-recovery I/O during the hibernation activity!) I guess this problem exists but it is not at all the problem I was thinking of. Forcing the second set of requests to filter through an extra software layer is a clumsy way of accomplishing this. There ought to be a better approach. The point was something different. The reasons we can not store the state of the system with the hardware devices logically hot unplugged (and thus reuse all of the find device hotplug methods) is because things like the filesystem layer don't know how to cope with their block devices going away an coming back. That is the problem inserting an virtual software device in the middle can solve. If that works should there be a better way? Certainly but to prove it out starting with a block device wrapper is a trivial way to go. Eric ___ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec