[PATCH 1/4][V2] powerpc : add support for linux, usable-memory properties for drconf memory

2008-07-11 Thread Chandru
Scan for linux,usable-memory properties in case of dynamic reconfiguration 
memory. Support for kexec/kdump. 

Signed-off-by: Chandru Siddalingappa [EMAIL PROTECTED]
---

Patch applies on linux-next tree (patch-v2.6.26-rc9-next-20080711.gz)

 arch/powerpc/kernel/prom.c |   40 +++--
 arch/powerpc/mm/numa.c |   48 ---
 2 files changed, 65 insertions(+), 23 deletions(-)

diff -Naurp linux-2.6.26-rc9-orig/arch/powerpc/kernel/prom.c 
linux-2.6.26-rc9/arch/powerpc/kernel/prom.c
--- linux-2.6.26-rc9-orig/arch/powerpc/kernel/prom.c2008-07-11 
14:44:55.0 +0530
+++ linux-2.6.26-rc9/arch/powerpc/kernel/prom.c 2008-07-11 14:58:26.0 
+0530
@@ -888,9 +888,10 @@ static u64 __init dt_mem_next_cell(int s
  */
 static int __init early_init_dt_scan_drconf_memory(unsigned long node)
 {
-   cell_t *dm, *ls;
-   unsigned long l, n, flags;
+   cell_t *dm, *ls, *usm;
+   unsigned long l, n, flags, ranges;
u64 base, size, lmb_size;
+   char buf[32];
 
ls = (cell_t *)of_get_flat_dt_prop(node, ibm,lmb-size, l);
if (ls == NULL || l  dt_root_size_cells * sizeof(cell_t))
@@ -914,14 +915,37 @@ static int __init early_init_dt_scan_drc
   or if the block is not assigned to this partition (0x8) */
if ((flags  0x80) || !(flags  0x8))
continue;
-   size = lmb_size;
-   if (iommu_is_off) {
+   if (iommu_is_off)
if (base = 0x8000ul)
continue;
-   if ((base + size)  0x8000ul)
-   size = 0x8000ul - base;
-   }
-   lmb_add(base, size);
+   size = lmb_size;
+
+   /*
+* Append 'n' to 'linux,usable-memory' to get special
+* properties passed in by tools like kexec-tools. Relevant
+* only if this is a kexec/kdump kernel.
+*/
+   sprintf(buf, linux,usable-memory%d, (int)n);
+   usm = of_get_flat_dt_prop(node, buf, l);
+   ranges = 1;
+   if (usm != NULL)
+   ranges = (l  2)/(dt_root_addr_cells
+   + dt_root_size_cells);
+   do {
+   if (usm != NULL) {
+   base = dt_mem_next_cell(dt_root_addr_cells,
+usm);
+   size = dt_mem_next_cell(dt_root_size_cells,
+usm);
+   if (size == 0)
+   break;
+   }
+   if (iommu_is_off)
+   if ((base + size)  0x8000ul)
+   size = 0x8000ul - base;
+
+   lmb_add(base, size);
+   } while (--ranges);
}
lmb_dump_all();
return 0;
diff -Naurp linux-2.6.26-rc9-orig/arch/powerpc/mm/numa.c 
linux-2.6.26-rc9/arch/powerpc/mm/numa.c
--- linux-2.6.26-rc9-orig/arch/powerpc/mm/numa.c2008-07-11 
14:44:55.0 
+0530
+++ linux-2.6.26-rc9/arch/powerpc/mm/numa.c 2008-07-11 15:01:56.0 
+0530
@@ -493,11 +493,13 @@ static unsigned long __init numa_enforce
  */
 static void __init parse_drconf_memory(struct device_node *memory)
 {
-   const u32 *dm;
-   unsigned int n, rc;
-   unsigned long lmb_size, size;
+   const u32 *dm, *usm;
+   unsigned int n, rc, len, ranges;
+   unsigned long lmb_size, size, sz;
int nid;
struct assoc_arrays aa;
+   char buf[32];
+   u64 base;
 
n = of_get_drconf_memory(memory, dm);
if (!n)
@@ -524,19 +526,35 @@ static void __init parse_drconf_memory(s
 
nid = of_drconf_to_nid_single(drmem, aa);
 
-   fake_numa_create_new_node(
-   ((drmem.base_addr + lmb_size)  PAGE_SHIFT),
-  nid);
-
-   node_set_online(nid);
-
-   size = numa_enforce_memory_limit(drmem.base_addr, lmb_size);
-   if (!size)
-   continue;
+   /*
+* Append 'n' to 'linux,usable-memory' to get special
+* properties passed in by tools like kexec-tools. Relevant
+* only if this is a kexec/kdump kernel.
+*/
+   sprintf(buf, linux,usable-memory%d, (int)n);
+   usm = of_get_property(memory, buf, len);
+   ranges = 1;
+   if (usm != NULL)
+   ranges = (len  2) /
+(n_mem_addr_cells + n_mem_size_cells);
+
+   base = drmem.base_addr;
+   size = lmb_size

Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

2008-07-11 Thread Vivek Goyal
On Fri, Jul 11, 2008 at 12:21:31PM -0700, Andrew Morton wrote:
 On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal [EMAIL PROTECTED] wrote:
 
  On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
   This patch provides an enhancement to kexec/kdump. It implements
   the following features:
   
   - Backup/restore memory used by the original kernel before/after
 kexec.
   
   - Save/restore CPU state before/after kexec.
   
  
  Hi Huang,
  
  In general this patch set looks good enough to live in -mm and
  get some testing going.
  
  To me, adding capability to return back to original kernel looks
  like a logical extension to kexec functionality.
 
 Exciting ;)  It's much less code than I expected.
 
 I don't think I understand the feature any more.  Once upon a time we
 thought that this might become a new and better (or at least
 better-code-sharing) way of doing suspend-to-disk.  How far are we from
 that?
 

Hi Andrew,

We can use this patchset for hibernation, but can it be a better way of doing
things than what we already have, I don't know. Last time I had raised
this question and power people had various views. In the end, Pavel wanted
this patchset to be in.  Pavel, can tell more here...

To me this patchset looks interesting for couple of reasons.

- Looks like an interesting feature where one can have a separate kernel
  in memory and one can switch between the kernels on the fly. It can
  be modified to have more than one kernel in memory at a time.

- So far kexec was one directional. One can only kexec to new kernel and
  old kernel was gone. Now this patchset makes kexec functionality kind
  of bidirectional and this looks like logical extension and can lead
  to intersting use cases in future.

Huang also talks of using this feature for snapshotting kernel and
invoking some BIOS code in protected mode. I am not very sure how exactly
are they planning to use it. Huang, do you have more details on this?

 What are the prospects of supporting other architectures?
 

I think it should be doable on other architectures as well where kexec
is supported. Can't think of a reason why it can't be. Huang, what do
you think?

 Who maintains kexec-tools, and are they OK with merging up the
 corresponding changes?
 

I think Eric still has the ownership of kexec-tools. But it has been
long since kexec-tools has been updated. Now simon horman is maintaining
a separate tree, kexec-tools-testing, and all the active development
is taking place there.

Huang has not exactly posted kexec-tools patches but has given link
to kexec-tools patches and no body has objected so far. I am CCing it
to Simon Horman, if he sees any issues.

Thanks
Vivek

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [PATCH -mm 1/2] kexec jump -v12: kexec jump

2008-07-11 Thread Pavel Machek
On Fri 2008-07-11 12:21:31, Andrew Morton wrote:
 On Tue, 8 Jul 2008 10:50:51 -0400 Vivek Goyal [EMAIL PROTECTED] wrote:
 
  On Mon, Jul 07, 2008 at 11:25:22AM +0800, Huang Ying wrote:
   This patch provides an enhancement to kexec/kdump. It implements
   the following features:
   
   - Backup/restore memory used by the original kernel before/after
 kexec.
   
   - Save/restore CPU state before/after kexec.
   
  
  Hi Huang,
  
  In general this patch set looks good enough to live in -mm and
  get some testing going.
  
  To me, adding capability to return back to original kernel looks
  like a logical extension to kexec functionality.
 
 Exciting ;)  It's much less code than I expected.
 
 I don't think I understand the feature any more.  Once upon a time we
 thought that this might become a new and better (or at least
 better-code-sharing) way of doing suspend-to-disk.  How far are we from
 that?

Well, it will be tricky to get kjump-hibernation right with respect to
ACPI, but we should be fairly close to basic hibernation working with
this. It has major advantage of not needing refrigerator (and few
disadvantages -- like doing aditional boot during suspend).

But main reason I'd like kjump to be in is different -- it should be
useful to stuff like dump but continue running, etc...
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: AMD Family 10H machine check on vmcore read

2008-07-11 Thread Yinghai Lu
On Fri, Jul 11, 2008 at 1:50 PM, Vivek Goyal [EMAIL PROTECTED] wrote:
 On Wed, Jul 09, 2008 at 11:32:40AM -0600, Bob Montgomery wrote:
 On Tue, 2008-07-08 at 13:28 +, Vivek Goyal wrote:
  On Mon, Jul 07, 2008 at 05:08:06PM -0600, Bob Montgomery wrote:
   We maintain a 2.6.18 derived kernel.
   When testing kdump on a new AMD Family 10h (16) processor, once in the
   kdump kernel, a read from either /proc/vmcore or /dev/oldmem that
   corresponds to the area of memory identified in the original (crashing)
   kernel by these boot messages:

  
   On a Family 15 AMD64 processor running this kernel and kdump kernel, I
   can read the areas identified as being in the aperture from the kdump
   kernel and get values, but on the new processor, reads from the kdump
   kernel that are within that address range result in the machine check:
  
   HARDWARE ERROR
   CPU 0: Machine Check Exception:4 Bank 4: be010005001b
   TSC 141bd974323de ADDR 1c00 MISC e00c0ffe0100
  

  Hi Bob,
 
  I am not sure what's happening here. Because in /proc/iomem, GART reserved
  area is reported as System RAM, kdump kernel will try to read this area
  and save it. Now I am not sure, what is so special about this area that
  mapping it and reading it in second kernel would cause a MCE.
 
  CCing it to LKML, hoping people knowing GART will be able to provide some
  input.
 
   But I don't see this fix upstream in the kernel.  So I'm wondering if
   some other patch protects other kdump kernels from this problem.  In
   particular, a recent patch that informed the e820 map about the gart
   aperture to prevent a normal kernel and a kexec kernel from putting it
   at different addresses.  It didn't mention machine checks from kdump
   kernels, but I wonder if it would have prevented access to that memory
   area by having it be excluded from the /proc/vmcore list of areas??
 
  Can you provide a link to the patch above? If /proc/iomem, does not report
  GART area as system ram then it will be excluded from the dump. (IIUC,
  IOMMU tables are in GART area and ideally one should be capturing it to
  find out how IOMMU tables looked like at the time of crash).

 The patch that I thought might be related is:
  x86: disable the GART early, 64-bit

 author Yinghai Lu [EMAIL PROTECTED]
Wed, 30 Jan 2008 12:33:09 + (13:33 +0100)
 committer  Ingo Molnar [EMAIL PROTECTED]
Wed, 30 Jan 2008 12:33:09 + (13:33 +0100)
 commit aaf230424204864e2833dcc1da23e2cb0b9f39cd
 tree   a42042f5135aa63a780964bd053ae174211ab62f

 I thought it might be relevant because of this included comment:

  hm, i'm wondering, instead of modifying the GART, why dont we simply
  _detect_ whatever GART settings we have inherited, and propagate that
  into our e820 maps? I.e. if there's inconsistency, then punch that out
  from the memory maps and just dont use that memory.

 But this patch doesn't mention machine checks as the symptom that
 initiated the patch.

 And my reason for looking was because I didn't think I could be the
 first person to try reading /proc/vmcore on a Family 10h processor.  So
 I wondered why it hadn't been seen by some other tester, and thought
 some other patch might have fixed it a different way on newer kernels
 than mine.


 Hi Bob,

 So it looks like this patch will mark aperture region as non RAM and kudmp
 will not try to dump that memory and will not run into MCE. Have you tried
 the kernel with this patch? Does it work for you?

 At this point I don't know, why accessing the aperture region of first
 kernel causes MCE. May be Andi or Yinghai will know, CCing them.

 So backporting the Yinghai's patch to your kernel should help here.


two kernels could have different position with GART aperture allocated
from low ram.

so need to stop it in shutdown path of first kernel or startup path of
second kernel.

otherwise second kernel will try to use address of first kernel gart,
and later if the gart is disabled for new gart, the setting will be
lost.

with this patch could use MCE error and HT sync flood and warm reset

YH

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump

2008-07-11 Thread Alan Stern
On Fri, 11 Jul 2008, Eric W. Biederman wrote:

 I just realized with a little care the block layer does have support for this,
 or something very close.
 
 You setup a software raid mirror with one disk device.The physical
 device can come in and out while the filesystems depend on the real device.

Do you mean the filesystems depend on the logical RAID device?  

What's to prevent userspace from accessing the physical device 
directly?

What this amounts to, in the end, is having a way to distinguish the
set of I/O requests coming from the hibernation code (reading or
writing the memory image) from the set of all other I/O requests.  The
driver or the block layer has to be set up to allow the first set
through while blocking the second set.  (And don't forget about the 
complications caused by error-recovery I/O during the hibernation 
activity!)

Forcing the second set of requests to filter through an extra software 
layer is a clumsy way of accomplishing this.  There ought to be a 
better approach.

Alan Stern


___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec


Re: [linux-pm] [PATCH -mm 1/2] kexec jump -v12: kexec jump

2008-07-11 Thread Eric W. Biederman
Alan Stern [EMAIL PROTECTED] writes:

 On Fri, 11 Jul 2008, Eric W. Biederman wrote:

 I just realized with a little care the block layer does have support for 
 this,
 or something very close.
 
 You setup a software raid mirror with one disk device.The physical
 device can come in and out while the filesystems depend on the real device.

 Do you mean the filesystems depend on the logical RAID device?  

Oh yes. Thinko.

 What's to prevent userspace from accessing the physical device 
 directly?

Nothing.

 What this amounts to, in the end, is having a way to distinguish the
 set of I/O requests coming from the hibernation code (reading or
 writing the memory image) from the set of all other I/O requests.  The
 driver or the block layer has to be set up to allow the first set
 through while blocking the second set.  (And don't forget about the 
 complications caused by error-recovery I/O during the hibernation 
 activity!)

I guess this problem exists but it is not at all the problem I was
thinking of.

 Forcing the second set of requests to filter through an extra software 
 layer is a clumsy way of accomplishing this.  There ought to be a 
 better approach.

The point was something different.  The reasons we can not store the
state of the system with the hardware devices logically hot unplugged
(and thus reuse all of the find device hotplug methods) is because
things like the filesystem layer don't know how to cope with their
block devices going away an coming back.

That is the problem inserting an virtual software device in the middle
can solve.  If that works should there be a better way?  Certainly but
to prove it out starting with a block device wrapper is a trivial way to
go.

Eric

___
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec