Re: 2.6.18-mm2 boot failure on x86-64

2006-10-16 Thread Vivek Goyal
On Mon, Oct 09, 2006 at 10:53:58AM +0100, Mel Gorman wrote:
 On Fri, 6 Oct 2006, Vivek Goyal wrote:
 
 On Fri, Oct 06, 2006 at 01:03:50PM -0500, Steve Fox wrote:
 On Fri, 2006-10-06 at 18:11 +0100, Mel Gorman wrote:
 On (06/10/06 11:36), Vivek Goyal didst pronounce:
 Where is bss placed in physical memory? I guess bss_start and bss_stop
 from System.map will tell us. That will confirm that above memset step 
 is
 stomping over bss. Then we have to just find that somewhere probably
 we allocated wrong physical memory area for bootmem allocator map.
 
 
 BSS is at 0x643000 - 0x777BC4
 init_bootmem wipes from 0x777000 - 0x8F7000
 
 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously
 pile a page alignment of addr  PAGE_MASK) gets set to 0xFF. One possible
 fix is below. It adds a check in bad_addr() to see if the BSS section is
 about to be used for bootmap. It Seems To Work For Me (tm) and 
 illustrates
 the source of the problem even if it's not the 100% correct fix.
 
 I was able to boot the machine with Mel's patch applied on top of
 -git22.
 
 
 Please have a look at the attached patch. Does it make some sense.
 
 
 It makes some sense. As you state, it wastes memory but that is better 
 than breaking.
 
 Steve, can you please give this patch a try if it fixes the problem?
 
 
 I boottested the patch on the same machine as Steve was using and it 
 completed successfully.


Hi Andrew,

Can you please have a look at the attached patch and include it in -mm.
This fixes the issue for steve. It also figures in the list of Adrian Bunk
of known regressions.

Subject: oops in xfrm_register_mode
References : http://lkml.org/lkml/2006/10/4/170
Submitter  : Steve Fox [EMAIL PROTECTED]
Handled-By : Vivek Goyal [EMAIL PROTECTED]
Status : patch available



o Currently some code pieces assume that address returned by find_e820_area()
  are page aligned. But looks like find_e820_area() had no such intention
  and hence one might end up stomping over some of the data. One such
  case is bootmem allocator initialization code stomped over bss.

o This patch modified find_e820_area() to return page aligned address. This
  might be little wasteful of memory but at the same time probably it is
  easier to handle page aligned memory. 

Signed-off-by: Vivek Goyal [EMAIL PROTECTED]
---

 arch/x86_64/kernel/e820.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff -puN 
arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area
 arch/x86_64/kernel/e820.c
--- 
linux-2.6.19-rc1-1M/arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area
   2006-10-06 15:28:13.0 -0400
+++ linux-2.6.19-rc1-1M-root/arch/x86_64/kernel/e820.c  2006-10-06 
15:44:45.0 -0400
@@ -54,13 +54,13 @@ static inline int bad_addr(unsigned long
 
/* various gunk below that needed for SMP startup */
if (addr  0x8000) { 
-   *addrp = 0x8000;
+   *addrp = PAGE_ALIGN(0x8000);
return 1; 
}
 
/* direct mapping tables of the kernel */
if (last = table_startPAGE_SHIFT  addr  table_endPAGE_SHIFT) { 
-   *addrp = table_end  PAGE_SHIFT; 
+   *addrp = PAGE_ALIGN(table_end  PAGE_SHIFT);
return 1;
} 
 
@@ -68,18 +68,18 @@ static inline int bad_addr(unsigned long
 #ifdef CONFIG_BLK_DEV_INITRD
if (LOADER_TYPE  INITRD_START  last = INITRD_START  
addr  INITRD_START+INITRD_SIZE) { 
-   *addrp = INITRD_START + INITRD_SIZE; 
+   *addrp = PAGE_ALIGN(INITRD_START + INITRD_SIZE);
return 1;
} 
 #endif
/* kernel code */
-   if (last = __pa_symbol(_text)  last  __pa_symbol(_end)) {
-   *addrp = __pa_symbol(_end);
+   if (last = __pa_symbol(_text)  addr  __pa_symbol(_end)) {
+   *addrp = PAGE_ALIGN(__pa_symbol(_end));
return 1;
}
 
if (last = ebda_addr  addr  ebda_addr + ebda_size) {
-   *addrp = ebda_addr + ebda_size;
+   *addrp = PAGE_ALIGN(ebda_addr + ebda_size);
return 1;
}
 
@@ -152,7 +152,7 @@ unsigned long __init find_e820_area(unsi
continue; 
while (bad_addr(addr, size)  addr+size = ei-addr+ei-size)
;
-   last = addr + size;
+   last = PAGE_ALIGN(addr) + size;
if (last  ei-addr + ei-size)
continue;
if (last  end) 
_
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-mm2 boot failure on x86-64

2006-10-06 Thread Vivek Goyal
On Fri, Oct 06, 2006 at 03:33:12PM +0100, Mel Gorman wrote:
  Linux version 2.6.18-git22 ([EMAIL PROTECTED]) (gcc version 4.1.0 (SUSE 
  Linux)) #2 SMP Thu Oct 5 19:05:36 PDT 2006
  Command line: root=/dev/sda1 vga=791  
  ip=9.47.67.239:9.47.67.50:9.47.67.1:255.255.255.0 resume=/dev/sdb1 showopts 
  earlyprintk=serial,ttyS0,57600 console=tty0 console=ttyS0,57600 
  autobench_args: root=/dev/sda1 ABAT:1160100417
  BIOS-provided physical RAM map:
   BIOS-e820:  - 0009ac00 (usable)
   BIOS-e820: 0009ac00 - 000a (reserved)
   BIOS-e820: 000e - 0010 (reserved)
   BIOS-e820: 0010 - bff764c0 (usable)
   BIOS-e820: bff764c0 - bff98880 (ACPI data)
   BIOS-e820: bff98880 - c000 (reserved)
   BIOS-e820: fec0 - 0001 (reserved)
   BIOS-e820: 0001 - 000c (usable)
 
 I continued what Steve was doing this morning to see could this be
 pinned down. After placing 'CHECK;' in a few places as suggested by
 Andi's check, the problem code was identified as that following in
 mm/bootmem.c#init_bootmem_core()
 
 mapsize = get_mapsize(bdata);
 memset(bdata-node_bootmem_map, 0xff, mapsize);
 
 That explains the value in the array at least. A few more printfs around
 this point printed out the following in the boot log
 
 init_bootmem_core(0, 1909, 0, 12582912)
 init_bootmem_core: Calling memset(0x81775000, 1572864)
 AAGH: afinfo corrupted at mm/bootmem.c:121
 
 where;
 
 1909 == mapstart
 0 == start
 12582912 == end
 1572864 == mapsize
 
 mapstart, start and end being the parameters being passed to
 init_bootmem_core(). This means we are calling memset for the physical
 range 0x775000 - 0x8F5000 which is in a usable range according to the
 BIOS-e820 map it appears.
 

Hi Mel,

Where is bss placed in physical memory? I guess bss_start and bss_stop
from System.map will tell us. That will confirm that above memset step is
stomping over bss. Then we have to just find that somewhere probably
we allocated wrong physical memory area for bootmem allocator map.

Thanks
Vivek

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-mm2 boot failure on x86-64

2006-10-06 Thread Vivek Goyal
On Fri, Oct 06, 2006 at 06:11:05PM +0100, Mel Gorman wrote:
 On (06/10/06 11:36), Vivek Goyal didst pronounce:
  On Fri, Oct 06, 2006 at 03:33:12PM +0100, Mel Gorman wrote:
Linux version 2.6.18-git22 ([EMAIL PROTECTED]) (gcc version 4.1.0 (SUSE 
Linux)) #2 SMP Thu Oct 5 19:05:36 PDT 2006
Command line: root=/dev/sda1 vga=791  
ip=9.47.67.239:9.47.67.50:9.47.67.1:255.255.255.0 resume=/dev/sdb1 
showopts earlyprintk=serial,ttyS0,57600 console=tty0 
console=ttyS0,57600 autobench_args: root=/dev/sda1 ABAT:1160100417
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009ac00 (usable)
 BIOS-e820: 0009ac00 - 000a (reserved)
 BIOS-e820: 000e - 0010 (reserved)
 BIOS-e820: 0010 - bff764c0 (usable)
 BIOS-e820: bff764c0 - bff98880 (ACPI data)
 BIOS-e820: bff98880 - c000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
 BIOS-e820: 0001 - 000c (usable)
   
   I continued what Steve was doing this morning to see could this be
   pinned down. After placing 'CHECK;' in a few places as suggested by
   Andi's check, the problem code was identified as that following in
   mm/bootmem.c#init_bootmem_core()
   
   mapsize = get_mapsize(bdata);
   memset(bdata-node_bootmem_map, 0xff, mapsize);
   
   That explains the value in the array at least. A few more printfs around
   this point printed out the following in the boot log
   
   init_bootmem_core(0, 1909, 0, 12582912)
   init_bootmem_core: Calling memset(0x81775000, 1572864)
   AAGH: afinfo corrupted at mm/bootmem.c:121
   
   where;
   
   1909 == mapstart
   0 == start
   12582912 == end
   1572864 == mapsize
   
   mapstart, start and end being the parameters being passed to
   init_bootmem_core(). This means we are calling memset for the physical
   range 0x775000 - 0x8F5000 which is in a usable range according to the
   BIOS-e820 map it appears.
   
  
  Hi Mel,
  
 
 Hi.
 
  Where is bss placed in physical memory? I guess bss_start and bss_stop
  from System.map will tell us. That will confirm that above memset step is
  stomping over bss. Then we have to just find that somewhere probably
  we allocated wrong physical memory area for bootmem allocator map.
  
 
 BSS is at 0x643000 - 0x777BC4
 init_bootmem wipes from 0x777000 - 0x8F7000
 
 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously
 pile a page alignment of addr  PAGE_MASK) gets set to 0xFF. One possible
 fix is below. It adds a check in bad_addr() to see if the BSS section is
 about to be used for bootmap. It Seems To Work For Me (tm) and illustrates
 the source of the problem even if it's not the 100% correct fix.
 
 diff -rup -X /usr/src/patchset-0.6/bin//dontdiff 
 linux-2.6.18-git22-clean/arch/x86_64/kernel/e820.c 
 linux-2.6.18-git22-bss_relocate_fix/arch/x86_64/kernel/e820.c
 --- linux-2.6.18-git22-clean/arch/x86_64/kernel/e820.c2006-10-05 
 20:42:07.0 +0100
 +++ linux-2.6.18-git22-bss_relocate_fix/arch/x86_64/kernel/e820.c 
 2006-10-06 17:39:51.0 +0100
 @@ -51,6 +51,7 @@ extern struct resource code_resource, da
  static inline int bad_addr(unsigned long *addrp, unsigned long size)
  { 
   unsigned long addr = *addrp, last = addr + size; 
 + unsigned long bss_start, bss_end;
  
   /* various gunk below that needed for SMP startup */
   if (addr  0x8000) { 
 @@ -77,6 +78,14 @@ static inline int bad_addr(unsigned long
   *addrp = __pa_symbol(_end);
   return 1;
   }
 + 
 + /* bss section */
 + bss_start = __pa_symbol(__bss_start);
 + bss_end = PAGE_ALIGN(__pa_symbol(__bss_stop));
 + if (addr = bss_start  addr  bss_end) {
 + *addrp = bss_end;
 + return 1;
 + }
  

Surprising, the kernel code check just before this should have taken care
of it.

 /* kernel code */
if (last = __pa_symbol(_text)  last  __pa_symbol(_end)) {
*addrp = __pa_symbol(_end);
return 1;
}
May be it can be changed to 
if (last = __pa_symbol(_text)  last  
PAGE_ALIGN(__pa_symbol(_end))) {

But all this seem to be a stopgap fix. Still the real puzzle is exactly
where did it slip out and should be fixed there.

May be some more printks will help us.

Thanks
Vivek
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-mm2 boot failure on x86-64

2006-10-06 Thread Vivek Goyal
On Fri, Oct 06, 2006 at 06:11:05PM +0100, Mel Gorman wrote:
 On (06/10/06 11:36), Vivek Goyal didst pronounce:
  On Fri, Oct 06, 2006 at 03:33:12PM +0100, Mel Gorman wrote:
Linux version 2.6.18-git22 ([EMAIL PROTECTED]) (gcc version 4.1.0 (SUSE 
Linux)) #2 SMP Thu Oct 5 19:05:36 PDT 2006
Command line: root=/dev/sda1 vga=791  
ip=9.47.67.239:9.47.67.50:9.47.67.1:255.255.255.0 resume=/dev/sdb1 
showopts earlyprintk=serial,ttyS0,57600 console=tty0 
console=ttyS0,57600 autobench_args: root=/dev/sda1 ABAT:1160100417
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009ac00 (usable)
 BIOS-e820: 0009ac00 - 000a (reserved)
 BIOS-e820: 000e - 0010 (reserved)
 BIOS-e820: 0010 - bff764c0 (usable)
 BIOS-e820: bff764c0 - bff98880 (ACPI data)
 BIOS-e820: bff98880 - c000 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
 BIOS-e820: 0001 - 000c (usable)
   
   I continued what Steve was doing this morning to see could this be
   pinned down. After placing 'CHECK;' in a few places as suggested by
   Andi's check, the problem code was identified as that following in
   mm/bootmem.c#init_bootmem_core()
   
   mapsize = get_mapsize(bdata);
   memset(bdata-node_bootmem_map, 0xff, mapsize);
   
   That explains the value in the array at least. A few more printfs around
   this point printed out the following in the boot log
   
   init_bootmem_core(0, 1909, 0, 12582912)
   init_bootmem_core: Calling memset(0x81775000, 1572864)
   AAGH: afinfo corrupted at mm/bootmem.c:121
   
   where;
   
   1909 == mapstart
   0 == start
   12582912 == end
   1572864 == mapsize
   
   mapstart, start and end being the parameters being passed to
   init_bootmem_core(). This means we are calling memset for the physical
   range 0x775000 - 0x8F5000 which is in a usable range according to the
   BIOS-e820 map it appears.
   
  
  Hi Mel,
  
 
 Hi.
 
  Where is bss placed in physical memory? I guess bss_start and bss_stop
  from System.map will tell us. That will confirm that above memset step is
  stomping over bss. Then we have to just find that somewhere probably
  we allocated wrong physical memory area for bootmem allocator map.
  
 
 BSS is at 0x643000 - 0x777BC4
 init_bootmem wipes from 0x777000 - 0x8F7000
 
 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously
 pile a page alignment of addr  PAGE_MASK) gets set to 0xFF. One possible
 fix is below. It adds a check in bad_addr() to see if the BSS section is
 about to be used for bootmap. It Seems To Work For Me (tm) and illustrates
 the source of the problem even if it's not the 100% correct fix.
 

Ok, it looks like that code is assuming that memory area returned by
find_e820_area() is page aligned. I found two such instances and that's
what is leading to problem.

bootmap_size = init_bootmem_node(NODE_DATA(nodeid),
 bootmap_start  PAGE_SHIFT,
 start_pfn, end_pfn);

Here bootmap_start is not page aligned and I guess  currently should
contain the value 0x777BC4 (just beyond _end). But the moement I do
bootmap_startPAGE_SHIFT, I start stomping bss.

Similar is the case here.

bootmap = find_e820_area(0, end_pfnPAGE_SHIFT, bootmap_size);
if (bootmap == -1L)
panic(Cannot find bootmem map of size %ld\n,bootmap_size);
bootmap_size = init_bootmem(bootmap  PAGE_SHIFT, end_pfn);

So may be we should return a page aligned address from find_e820_area(). 
May be we can change bad_addr() to set *addrp to next page aligned 
boundary for every check?

*addrp = PAGE_ALIGN(__pa_symbol(_end));

Thanks
Vivek
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-mm2 boot failure on x86-64

2006-10-06 Thread Vivek Goyal
On Fri, Oct 06, 2006 at 01:03:50PM -0500, Steve Fox wrote:
 On Fri, 2006-10-06 at 18:11 +0100, Mel Gorman wrote:
  On (06/10/06 11:36), Vivek Goyal didst pronounce:
   Where is bss placed in physical memory? I guess bss_start and bss_stop
   from System.map will tell us. That will confirm that above memset step is
   stomping over bss. Then we have to just find that somewhere probably
   we allocated wrong physical memory area for bootmem allocator map.
   
  
  BSS is at 0x643000 - 0x777BC4
  init_bootmem wipes from 0x777000 - 0x8F7000
  
  So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously
  pile a page alignment of addr  PAGE_MASK) gets set to 0xFF. One possible
  fix is below. It adds a check in bad_addr() to see if the BSS section is
  about to be used for bootmap. It Seems To Work For Me (tm) and illustrates
  the source of the problem even if it's not the 100% correct fix.
 
 I was able to boot the machine with Mel's patch applied on top of
 -git22.


Please have a look at the attached patch. Does it make some sense. 

Steve, can you please give this patch a try if it fixes the problem?

Thanks
Vivek




o Currently some code pieces assume that address returned by find_e820_area()
  are page aligned. But looks like find_e820_area() had no such intention
  and hence one might end up stomping over some of the data. One such
  case is bootmem allocator initialization code stomped over bss.

o This patch modified find_e820_area() to return page aligned address. This
  might be little wasteful of memory but at the same time probably it is
  easier to handle page aligned memory. 

Signed-off-by: Vivek Goyal [EMAIL PROTECTED]
---

 arch/x86_64/kernel/e820.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff -puN 
arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area
 arch/x86_64/kernel/e820.c
--- 
linux-2.6.19-rc1-1M/arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area
   2006-10-06 15:28:13.0 -0400
+++ linux-2.6.19-rc1-1M-root/arch/x86_64/kernel/e820.c  2006-10-06 
15:44:45.0 -0400
@@ -54,13 +54,13 @@ static inline int bad_addr(unsigned long
 
/* various gunk below that needed for SMP startup */
if (addr  0x8000) { 
-   *addrp = 0x8000;
+   *addrp = PAGE_ALIGN(0x8000);
return 1; 
}
 
/* direct mapping tables of the kernel */
if (last = table_startPAGE_SHIFT  addr  table_endPAGE_SHIFT) { 
-   *addrp = table_end  PAGE_SHIFT; 
+   *addrp = PAGE_ALIGN(table_end  PAGE_SHIFT);
return 1;
} 
 
@@ -68,18 +68,18 @@ static inline int bad_addr(unsigned long
 #ifdef CONFIG_BLK_DEV_INITRD
if (LOADER_TYPE  INITRD_START  last = INITRD_START  
addr  INITRD_START+INITRD_SIZE) { 
-   *addrp = INITRD_START + INITRD_SIZE; 
+   *addrp = PAGE_ALIGN(INITRD_START + INITRD_SIZE);
return 1;
} 
 #endif
/* kernel code */
-   if (last = __pa_symbol(_text)  last  __pa_symbol(_end)) {
-   *addrp = __pa_symbol(_end);
+   if (last = __pa_symbol(_text)  addr  __pa_symbol(_end)) {
+   *addrp = PAGE_ALIGN(__pa_symbol(_end));
return 1;
}
 
if (last = ebda_addr  addr  ebda_addr + ebda_size) {
-   *addrp = ebda_addr + ebda_size;
+   *addrp = PAGE_ALIGN(ebda_addr + ebda_size);
return 1;
}
 
@@ -152,7 +152,7 @@ unsigned long __init find_e820_area(unsi
continue; 
while (bad_addr(addr, size)  addr+size = ei-addr+ei-size)
;
-   last = addr + size;
+   last = PAGE_ALIGN(addr) + size;
if (last  ei-addr + ei-size)
continue;
if (last  end) 
_
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-mm2 boot failure on x86-64

2006-10-05 Thread Vivek Goyal
On Thu, Oct 05, 2006 at 08:27:02PM +0200, Andi Kleen wrote:
 On Thursday 05 October 2006 19:57, Steve Fox wrote:
  On Thu, 2006-10-05 at 17:40 +0200, Andi Kleen wrote:
  
   Please don't snip the Code: line. It is fairly important.
  
  Sorry about that. The remote console I was using appears to overwrite
  some text after I force the reboot. Here's a clean one.
  
  global 
 
 Ok that definitely shouldn't be in there.
 
 I guess we need to track when it gets corrupted. Can you send the full
 boot log with this patch applied?
 

Just recalled one more observation about the problem when keith had
reported it last. If I just move .bss before .data_nosave instead
of it being at the end, keith's problem had disappeared.

Thanks
Vivek
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-mm2 boot failure on x86-64

2006-10-04 Thread Vivek Goyal
On Wed, Oct 04, 2006 at 08:45:40AM -0700, Andrew Morton wrote:
 On Wed, 04 Oct 2006 08:42:28 -0500
 Steve Fox [EMAIL PROTECTED] wrote:
 
  On Thu, 2006-09-28 at 14:01 -0700, Andrew Morton wrote:
   On Thu, 28 Sep 2006 17:50:31 + (UTC)
   Steve Fox [EMAIL PROTECTED] wrote:
   
On Thu, 28 Sep 2006 01:46:23 -0700, Andrew Morton wrote:

 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/

Panic on boot. This machine booted 2.6.18-mm1 fine. em64t machine.

TCP bic registered
TCP westwood registered
TCP htcp registered
NET: Registered protocol family 1
NET: Registered protocol family 17
Unable to handle kernel paging request at  RIP: 
 [8047ef93] packet_notifier+0x163/0x1a0
PGD 203027 PUD 2b031067 PMD 0 
Oops:  [1] SMP 
last sysfs file: 
CPU 0 
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.18-mm2-autokern1 #1
RIP: 0010:[8047ef93]  [8047ef93] 
packet_notifier+0x163/0x1a0
RSP: :810bffcbde90  EFLAGS: 00010286
RAX:  RBX: 810bff4a1000 RCX: 
RDX: 810bff4a1000 RSI: 0005 RDI: 8055f5e0
RBP:  R08: 7616 R09: 000e
R10: 0006 R11: 803373f0 R12: 
R13: 0005 R14: 810bff4a1000 R15: 
FS:  () GS:805d8000() 
knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2:  CR3: 00201000 CR4: 06e0
Process swapper (pid: 1, threadinfo 810bffcbc000, task 
810bffcbb510)
Stack:  810bff4a1000 8055f4c0  
810bffcbdef0
  8042736e  
  8061c68d 806260f0 80207182
Call Trace:
 [8042736e] register_netdevice_notifier+0x3e/0x70
 [8061c68d] packet_init+0x2d/0x53
 [80207182] init+0x162/0x330
 [8020a9d8] child_rip+0xa/0x12
 [8033c2a2] acpi_ds_init_one_object+0x0/0x82
 [80207020] init+0x0/0x330
 [8020a9ce] child_rip+0x0/0x12


Code: 48 8b 45 00 0f 18 08 49 83 fd 02 4c 8d 65 f8 0f 84 f8 fe ff 
RIP  [8047ef93] packet_notifier+0x163/0x1a0
 RSP 810bffcbde90
CR2: 
 0Kernel panic - not syncing: Attempted to kill init!

   
   I'm really struggling to work out what went wrong there.  Comparing your
   miserable 20 bytes of code to my object code makes me think that this:
   
 struct packet_sock *po = pkt_sk(sk);
   
   returned -1, perhaps in %ebp.  But it's all very crude.
   
   Perhaps you could compile that kernel with CONFIG_DEBUG_INFO, rerun it 
   (the
   addresses might change) then have a poke around with `gdb vmlinux' (or
   maybe just addr2line) to work out where it's really oopsing?
   
   I don't see much which has changed in that area recently.
  
  Sorry for the delay. I was finally able to perform a bisect on this. It
  turns out the patch that causes this is
  x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a
  strange candidate, but sure enough I can boot to login: right up until
  that patch is applied.
 
 hm, that patch was merged into mainline September 29.  Does mainline work?
 

I thought above patch was dropped because Keith ran into some boot issues
on one of the machines. Though there seems to be nothing wrong with the
patch as such but it might have triggered some existing bug. At that point
of time I looked into the issue but nothing was conclusive.

So looks like this patch has come back. I am not sure how.

Thanks
Vivek
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-mm2 boot failure on x86-64

2006-10-04 Thread Vivek Goyal
On Wed, Oct 04, 2006 at 05:06:59PM -0700, Andrew Morton wrote:
 On Wed, 04 Oct 2006 11:41:59 -0500
 Steve Fox [EMAIL PROTECTED] wrote:
 
  On Wed, 2006-10-04 at 08:45 -0700, Andrew Morton wrote:
   On Wed, 04 Oct 2006 08:42:28 -0500
   Steve Fox [EMAIL PROTECTED] wrote:
Sorry for the delay. I was finally able to perform a bisect on this. It
turns out the patch that causes this is
x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a
strange candidate, but sure enough I can boot to login: right up until
that patch is applied.
   
   hm, that patch was merged into mainline September 29.  Does mainline work?
  
  -git21 also fails with this same error.
  
 
 OK, thanks.  And we know that
 x86_64-mm-re-positioning-the-bss-segment.patch triggered this failure.  And
 that patch is non-buggy, and the xfrm code is probably non-buggy.  So we don't
 know squat, and we're going to need to debug this crash.
 
 Well.  There is one trick we could use: apply
 x86_64-mm-re-positioning-the-bss-segment.patch to 2.6.18 base and see if it
 crashes.  If it doesn't, then we can theorise that the bug is some buggy
 post 2.6.18 patch which is being exposed by

I think most likely it would crash on 2.6.18. Keith mannthey had reported
a different crash on 2.6.18-rc4-mm2 when this patch was introduced first
time. Following is the link to the thread.

http://marc.theaimsgroup.com/?l=linux-kernelm=115629369729911w=2

Following is the backtrace he had reported.

 Unable to handle kernel NULL pointer dereference at 0007
 RIP:
  [803d45b0] __unix_insert_socket+0x49/0x5a
 PGD 115c934067 PUD 115c935067 PMD 0
 Oops: 0002 [1] SMP
 last sysfs file:
 CPU 14
 Modules linked in:
 Pid: 1, comm: init Not tainted 2.6.18-rc4-mm2-smp #3
 RIP: 0010:[803d45b0]  [803d45b0]
 __unix_insert_socket+0x49/0x5a
 RSP: 0018:810460605eb8  EFLAGS: 00010286
 RAX:  RBX: 81115c171c80 RCX: 
 RDX: 81115c171c88 RSI: 81115c171c80 RDI: 806656e0
 RBP: 806656e0 R08: 81115c069200 R09: 8110700b4000
 R10:  R11: 0002 R12: 81115c170d00
 R13: 0001 R14: 0001 R15: 
 FS:  2b793a4fd6d0() GS:81115c910e40()
 knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0007 CR3: 00115c92d000 CR4: 06e0
 Process init (pid: 1, threadinfo 810460604000, task
 81115cb10040)
 Stack:  00010001  81115c171c80
 803d58e9
  8045bb30 000180298f61 80498080 0001
  81115c170d00 803d595d 0004 80376061
 Call Trace:
  [803d58e9] unix_create1+0xf3/0x107
  [803d595d] unix_create+0x60/0x6b
  [80376061] __sock_create+0x12f/0x227
  [80376429] sys_socket+0xf/0x37
  [8020968e] system_call+0x7e/0x83


 Code: 48 89 50 08 48 89 55 00 48 89 6a 08 41 58 5b 5d c3 c7 47 08
 RIP  [803d45b0] __unix_insert_socket+0x49/0x5a
  RSP 810460605eb8
 CR2: 0007
  0Kernel panic - not syncing: Attempted to kill init!

Thanks
Vivek
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html