Re: 2.6.18-mm2 boot failure on x86-64
On Mon, Oct 16, 2006 at 04:58:14PM -0700, Andrew Morton wrote: On Mon, 16 Oct 2006 14:16:13 -0400 Vivek Goyal [EMAIL PROTECTED] wrote: Can you please have a look at the attached patch Looks like a fine patch to me, although it could benefit from a comment explaining why all those PAGE_ALIGN()s are in there. and include it in -mm. Does it fix a patch in -mm or is it needed in mainline? The bug in my list was reported to be present in mainline [1]. cu Adrian [1] http://lkml.org/lkml/2006/10/4/394 -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Tue, 17 Oct 2006, Adrian Bunk wrote: On Mon, Oct 16, 2006 at 04:58:14PM -0700, Andrew Morton wrote: On Mon, 16 Oct 2006 14:16:13 -0400 Vivek Goyal [EMAIL PROTECTED] wrote: Can you please have a look at the attached patch Looks like a fine patch to me, although it could benefit from a comment explaining why all those PAGE_ALIGN()s are in there. and include it in -mm. Does it fix a patch in -mm or is it needed in mainline? The bug in my list was reported to be present in mainline [1]. Confirmed. This bug is present in 2.6.19-rc2 -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Mon, Oct 09, 2006 at 10:53:58AM +0100, Mel Gorman wrote: On Fri, 6 Oct 2006, Vivek Goyal wrote: On Fri, Oct 06, 2006 at 01:03:50PM -0500, Steve Fox wrote: On Fri, 2006-10-06 at 18:11 +0100, Mel Gorman wrote: On (06/10/06 11:36), Vivek Goyal didst pronounce: Where is bss placed in physical memory? I guess bss_start and bss_stop from System.map will tell us. That will confirm that above memset step is stomping over bss. Then we have to just find that somewhere probably we allocated wrong physical memory area for bootmem allocator map. BSS is at 0x643000 - 0x777BC4 init_bootmem wipes from 0x777000 - 0x8F7000 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously pile a page alignment of addr PAGE_MASK) gets set to 0xFF. One possible fix is below. It adds a check in bad_addr() to see if the BSS section is about to be used for bootmap. It Seems To Work For Me (tm) and illustrates the source of the problem even if it's not the 100% correct fix. I was able to boot the machine with Mel's patch applied on top of -git22. Please have a look at the attached patch. Does it make some sense. It makes some sense. As you state, it wastes memory but that is better than breaking. Steve, can you please give this patch a try if it fixes the problem? I boottested the patch on the same machine as Steve was using and it completed successfully. Hi Andrew, Can you please have a look at the attached patch and include it in -mm. This fixes the issue for steve. It also figures in the list of Adrian Bunk of known regressions. Subject: oops in xfrm_register_mode References : http://lkml.org/lkml/2006/10/4/170 Submitter : Steve Fox [EMAIL PROTECTED] Handled-By : Vivek Goyal [EMAIL PROTECTED] Status : patch available o Currently some code pieces assume that address returned by find_e820_area() are page aligned. But looks like find_e820_area() had no such intention and hence one might end up stomping over some of the data. One such case is bootmem allocator initialization code stomped over bss. o This patch modified find_e820_area() to return page aligned address. This might be little wasteful of memory but at the same time probably it is easier to handle page aligned memory. Signed-off-by: Vivek Goyal [EMAIL PROTECTED] --- arch/x86_64/kernel/e820.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff -puN arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area arch/x86_64/kernel/e820.c --- linux-2.6.19-rc1-1M/arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area 2006-10-06 15:28:13.0 -0400 +++ linux-2.6.19-rc1-1M-root/arch/x86_64/kernel/e820.c 2006-10-06 15:44:45.0 -0400 @@ -54,13 +54,13 @@ static inline int bad_addr(unsigned long /* various gunk below that needed for SMP startup */ if (addr 0x8000) { - *addrp = 0x8000; + *addrp = PAGE_ALIGN(0x8000); return 1; } /* direct mapping tables of the kernel */ if (last = table_startPAGE_SHIFT addr table_endPAGE_SHIFT) { - *addrp = table_end PAGE_SHIFT; + *addrp = PAGE_ALIGN(table_end PAGE_SHIFT); return 1; } @@ -68,18 +68,18 @@ static inline int bad_addr(unsigned long #ifdef CONFIG_BLK_DEV_INITRD if (LOADER_TYPE INITRD_START last = INITRD_START addr INITRD_START+INITRD_SIZE) { - *addrp = INITRD_START + INITRD_SIZE; + *addrp = PAGE_ALIGN(INITRD_START + INITRD_SIZE); return 1; } #endif /* kernel code */ - if (last = __pa_symbol(_text) last __pa_symbol(_end)) { - *addrp = __pa_symbol(_end); + if (last = __pa_symbol(_text) addr __pa_symbol(_end)) { + *addrp = PAGE_ALIGN(__pa_symbol(_end)); return 1; } if (last = ebda_addr addr ebda_addr + ebda_size) { - *addrp = ebda_addr + ebda_size; + *addrp = PAGE_ALIGN(ebda_addr + ebda_size); return 1; } @@ -152,7 +152,7 @@ unsigned long __init find_e820_area(unsi continue; while (bad_addr(addr, size) addr+size = ei-addr+ei-size) ; - last = addr + size; + last = PAGE_ALIGN(addr) + size; if (last ei-addr + ei-size) continue; if (last end) _ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Mon, 16 Oct 2006 14:16:13 -0400 Vivek Goyal [EMAIL PROTECTED] wrote: Can you please have a look at the attached patch Looks like a fine patch to me, although it could benefit from a comment explaining why all those PAGE_ALIGN()s are in there. and include it in -mm. Does it fix a patch in -mm or is it needed in mainline? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Fri, 6 Oct 2006, Vivek Goyal wrote: On Fri, Oct 06, 2006 at 01:03:50PM -0500, Steve Fox wrote: On Fri, 2006-10-06 at 18:11 +0100, Mel Gorman wrote: On (06/10/06 11:36), Vivek Goyal didst pronounce: Where is bss placed in physical memory? I guess bss_start and bss_stop from System.map will tell us. That will confirm that above memset step is stomping over bss. Then we have to just find that somewhere probably we allocated wrong physical memory area for bootmem allocator map. BSS is at 0x643000 - 0x777BC4 init_bootmem wipes from 0x777000 - 0x8F7000 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously pile a page alignment of addr PAGE_MASK) gets set to 0xFF. One possible fix is below. It adds a check in bad_addr() to see if the BSS section is about to be used for bootmap. It Seems To Work For Me (tm) and illustrates the source of the problem even if it's not the 100% correct fix. I was able to boot the machine with Mel's patch applied on top of -git22. Please have a look at the attached patch. Does it make some sense. It makes some sense. As you state, it wastes memory but that is better than breaking. Steve, can you please give this patch a try if it fixes the problem? I boottested the patch on the same machine as Steve was using and it completed successfully. Thanks Vivek o Currently some code pieces assume that address returned by find_e820_area() are page aligned. But looks like find_e820_area() had no such intention and hence one might end up stomping over some of the data. One such case is bootmem allocator initialization code stomped over bss. o This patch modified find_e820_area() to return page aligned address. This might be little wasteful of memory but at the same time probably it is easier to handle page aligned memory. Signed-off-by: Vivek Goyal [EMAIL PROTECTED] --- arch/x86_64/kernel/e820.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff -puN arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area arch/x86_64/kernel/e820.c --- linux-2.6.19-rc1-1M/arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area 2006-10-06 15:28:13.0 -0400 +++ linux-2.6.19-rc1-1M-root/arch/x86_64/kernel/e820.c 2006-10-06 15:44:45.0 -0400 @@ -54,13 +54,13 @@ static inline int bad_addr(unsigned long /* various gunk below that needed for SMP startup */ if (addr 0x8000) { - *addrp = 0x8000; + *addrp = PAGE_ALIGN(0x8000); return 1; } /* direct mapping tables of the kernel */ if (last = table_startPAGE_SHIFT addr table_endPAGE_SHIFT) { - *addrp = table_end PAGE_SHIFT; + *addrp = PAGE_ALIGN(table_end PAGE_SHIFT); return 1; } @@ -68,18 +68,18 @@ static inline int bad_addr(unsigned long #ifdef CONFIG_BLK_DEV_INITRD if (LOADER_TYPE INITRD_START last = INITRD_START addr INITRD_START+INITRD_SIZE) { - *addrp = INITRD_START + INITRD_SIZE; + *addrp = PAGE_ALIGN(INITRD_START + INITRD_SIZE); return 1; } #endif /* kernel code */ - if (last = __pa_symbol(_text) last __pa_symbol(_end)) { - *addrp = __pa_symbol(_end); + if (last = __pa_symbol(_text) addr __pa_symbol(_end)) { + *addrp = PAGE_ALIGN(__pa_symbol(_end)); return 1; } if (last = ebda_addr addr ebda_addr + ebda_size) { - *addrp = ebda_addr + ebda_size; + *addrp = PAGE_ALIGN(ebda_addr + ebda_size); return 1; } @@ -152,7 +152,7 @@ unsigned long __init find_e820_area(unsi continue; while (bad_addr(addr, size) addr+size = ei-addr+ei-size) ; - last = addr + size; + last = PAGE_ALIGN(addr) + size; if (last ei-addr + ei-size) continue; if (last end) _ -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Fri, Oct 06, 2006 at 03:33:12PM +0100, Mel Gorman wrote: Linux version 2.6.18-git22 ([EMAIL PROTECTED]) (gcc version 4.1.0 (SUSE Linux)) #2 SMP Thu Oct 5 19:05:36 PDT 2006 Command line: root=/dev/sda1 vga=791 ip=9.47.67.239:9.47.67.50:9.47.67.1:255.255.255.0 resume=/dev/sdb1 showopts earlyprintk=serial,ttyS0,57600 console=tty0 console=ttyS0,57600 autobench_args: root=/dev/sda1 ABAT:1160100417 BIOS-provided physical RAM map: BIOS-e820: - 0009ac00 (usable) BIOS-e820: 0009ac00 - 000a (reserved) BIOS-e820: 000e - 0010 (reserved) BIOS-e820: 0010 - bff764c0 (usable) BIOS-e820: bff764c0 - bff98880 (ACPI data) BIOS-e820: bff98880 - c000 (reserved) BIOS-e820: fec0 - 0001 (reserved) BIOS-e820: 0001 - 000c (usable) I continued what Steve was doing this morning to see could this be pinned down. After placing 'CHECK;' in a few places as suggested by Andi's check, the problem code was identified as that following in mm/bootmem.c#init_bootmem_core() mapsize = get_mapsize(bdata); memset(bdata-node_bootmem_map, 0xff, mapsize); That explains the value in the array at least. A few more printfs around this point printed out the following in the boot log init_bootmem_core(0, 1909, 0, 12582912) init_bootmem_core: Calling memset(0x81775000, 1572864) AAGH: afinfo corrupted at mm/bootmem.c:121 where; 1909 == mapstart 0 == start 12582912 == end 1572864 == mapsize mapstart, start and end being the parameters being passed to init_bootmem_core(). This means we are calling memset for the physical range 0x775000 - 0x8F5000 which is in a usable range according to the BIOS-e820 map it appears. Hi Mel, Where is bss placed in physical memory? I guess bss_start and bss_stop from System.map will tell us. That will confirm that above memset step is stomping over bss. Then we have to just find that somewhere probably we allocated wrong physical memory area for bootmem allocator map. Thanks Vivek - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Fri, Oct 06, 2006 at 06:11:05PM +0100, Mel Gorman wrote: On (06/10/06 11:36), Vivek Goyal didst pronounce: On Fri, Oct 06, 2006 at 03:33:12PM +0100, Mel Gorman wrote: Linux version 2.6.18-git22 ([EMAIL PROTECTED]) (gcc version 4.1.0 (SUSE Linux)) #2 SMP Thu Oct 5 19:05:36 PDT 2006 Command line: root=/dev/sda1 vga=791 ip=9.47.67.239:9.47.67.50:9.47.67.1:255.255.255.0 resume=/dev/sdb1 showopts earlyprintk=serial,ttyS0,57600 console=tty0 console=ttyS0,57600 autobench_args: root=/dev/sda1 ABAT:1160100417 BIOS-provided physical RAM map: BIOS-e820: - 0009ac00 (usable) BIOS-e820: 0009ac00 - 000a (reserved) BIOS-e820: 000e - 0010 (reserved) BIOS-e820: 0010 - bff764c0 (usable) BIOS-e820: bff764c0 - bff98880 (ACPI data) BIOS-e820: bff98880 - c000 (reserved) BIOS-e820: fec0 - 0001 (reserved) BIOS-e820: 0001 - 000c (usable) I continued what Steve was doing this morning to see could this be pinned down. After placing 'CHECK;' in a few places as suggested by Andi's check, the problem code was identified as that following in mm/bootmem.c#init_bootmem_core() mapsize = get_mapsize(bdata); memset(bdata-node_bootmem_map, 0xff, mapsize); That explains the value in the array at least. A few more printfs around this point printed out the following in the boot log init_bootmem_core(0, 1909, 0, 12582912) init_bootmem_core: Calling memset(0x81775000, 1572864) AAGH: afinfo corrupted at mm/bootmem.c:121 where; 1909 == mapstart 0 == start 12582912 == end 1572864 == mapsize mapstart, start and end being the parameters being passed to init_bootmem_core(). This means we are calling memset for the physical range 0x775000 - 0x8F5000 which is in a usable range according to the BIOS-e820 map it appears. Hi Mel, Hi. Where is bss placed in physical memory? I guess bss_start and bss_stop from System.map will tell us. That will confirm that above memset step is stomping over bss. Then we have to just find that somewhere probably we allocated wrong physical memory area for bootmem allocator map. BSS is at 0x643000 - 0x777BC4 init_bootmem wipes from 0x777000 - 0x8F7000 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously pile a page alignment of addr PAGE_MASK) gets set to 0xFF. One possible fix is below. It adds a check in bad_addr() to see if the BSS section is about to be used for bootmap. It Seems To Work For Me (tm) and illustrates the source of the problem even if it's not the 100% correct fix. diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-git22-clean/arch/x86_64/kernel/e820.c linux-2.6.18-git22-bss_relocate_fix/arch/x86_64/kernel/e820.c --- linux-2.6.18-git22-clean/arch/x86_64/kernel/e820.c2006-10-05 20:42:07.0 +0100 +++ linux-2.6.18-git22-bss_relocate_fix/arch/x86_64/kernel/e820.c 2006-10-06 17:39:51.0 +0100 @@ -51,6 +51,7 @@ extern struct resource code_resource, da static inline int bad_addr(unsigned long *addrp, unsigned long size) { unsigned long addr = *addrp, last = addr + size; + unsigned long bss_start, bss_end; /* various gunk below that needed for SMP startup */ if (addr 0x8000) { @@ -77,6 +78,14 @@ static inline int bad_addr(unsigned long *addrp = __pa_symbol(_end); return 1; } + + /* bss section */ + bss_start = __pa_symbol(__bss_start); + bss_end = PAGE_ALIGN(__pa_symbol(__bss_stop)); + if (addr = bss_start addr bss_end) { + *addrp = bss_end; + return 1; + } Surprising, the kernel code check just before this should have taken care of it. /* kernel code */ if (last = __pa_symbol(_text) last __pa_symbol(_end)) { *addrp = __pa_symbol(_end); return 1; } May be it can be changed to if (last = __pa_symbol(_text) last PAGE_ALIGN(__pa_symbol(_end))) { But all this seem to be a stopgap fix. Still the real puzzle is exactly where did it slip out and should be fixed there. May be some more printks will help us. Thanks Vivek - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Fri, Oct 06, 2006 at 06:11:05PM +0100, Mel Gorman wrote: On (06/10/06 11:36), Vivek Goyal didst pronounce: On Fri, Oct 06, 2006 at 03:33:12PM +0100, Mel Gorman wrote: Linux version 2.6.18-git22 ([EMAIL PROTECTED]) (gcc version 4.1.0 (SUSE Linux)) #2 SMP Thu Oct 5 19:05:36 PDT 2006 Command line: root=/dev/sda1 vga=791 ip=9.47.67.239:9.47.67.50:9.47.67.1:255.255.255.0 resume=/dev/sdb1 showopts earlyprintk=serial,ttyS0,57600 console=tty0 console=ttyS0,57600 autobench_args: root=/dev/sda1 ABAT:1160100417 BIOS-provided physical RAM map: BIOS-e820: - 0009ac00 (usable) BIOS-e820: 0009ac00 - 000a (reserved) BIOS-e820: 000e - 0010 (reserved) BIOS-e820: 0010 - bff764c0 (usable) BIOS-e820: bff764c0 - bff98880 (ACPI data) BIOS-e820: bff98880 - c000 (reserved) BIOS-e820: fec0 - 0001 (reserved) BIOS-e820: 0001 - 000c (usable) I continued what Steve was doing this morning to see could this be pinned down. After placing 'CHECK;' in a few places as suggested by Andi's check, the problem code was identified as that following in mm/bootmem.c#init_bootmem_core() mapsize = get_mapsize(bdata); memset(bdata-node_bootmem_map, 0xff, mapsize); That explains the value in the array at least. A few more printfs around this point printed out the following in the boot log init_bootmem_core(0, 1909, 0, 12582912) init_bootmem_core: Calling memset(0x81775000, 1572864) AAGH: afinfo corrupted at mm/bootmem.c:121 where; 1909 == mapstart 0 == start 12582912 == end 1572864 == mapsize mapstart, start and end being the parameters being passed to init_bootmem_core(). This means we are calling memset for the physical range 0x775000 - 0x8F5000 which is in a usable range according to the BIOS-e820 map it appears. Hi Mel, Hi. Where is bss placed in physical memory? I guess bss_start and bss_stop from System.map will tell us. That will confirm that above memset step is stomping over bss. Then we have to just find that somewhere probably we allocated wrong physical memory area for bootmem allocator map. BSS is at 0x643000 - 0x777BC4 init_bootmem wipes from 0x777000 - 0x8F7000 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously pile a page alignment of addr PAGE_MASK) gets set to 0xFF. One possible fix is below. It adds a check in bad_addr() to see if the BSS section is about to be used for bootmap. It Seems To Work For Me (tm) and illustrates the source of the problem even if it's not the 100% correct fix. Ok, it looks like that code is assuming that memory area returned by find_e820_area() is page aligned. I found two such instances and that's what is leading to problem. bootmap_size = init_bootmem_node(NODE_DATA(nodeid), bootmap_start PAGE_SHIFT, start_pfn, end_pfn); Here bootmap_start is not page aligned and I guess currently should contain the value 0x777BC4 (just beyond _end). But the moement I do bootmap_startPAGE_SHIFT, I start stomping bss. Similar is the case here. bootmap = find_e820_area(0, end_pfnPAGE_SHIFT, bootmap_size); if (bootmap == -1L) panic(Cannot find bootmem map of size %ld\n,bootmap_size); bootmap_size = init_bootmem(bootmap PAGE_SHIFT, end_pfn); So may be we should return a page aligned address from find_e820_area(). May be we can change bad_addr() to set *addrp to next page aligned boundary for every check? *addrp = PAGE_ALIGN(__pa_symbol(_end)); Thanks Vivek - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Fri, 2006-10-06 at 18:11 +0100, Mel Gorman wrote: On (06/10/06 11:36), Vivek Goyal didst pronounce: Where is bss placed in physical memory? I guess bss_start and bss_stop from System.map will tell us. That will confirm that above memset step is stomping over bss. Then we have to just find that somewhere probably we allocated wrong physical memory area for bootmem allocator map. BSS is at 0x643000 - 0x777BC4 init_bootmem wipes from 0x777000 - 0x8F7000 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously pile a page alignment of addr PAGE_MASK) gets set to 0xFF. One possible fix is below. It adds a check in bad_addr() to see if the BSS section is about to be used for bootmap. It Seems To Work For Me (tm) and illustrates the source of the problem even if it's not the 100% correct fix. I was able to boot the machine with Mel's patch applied on top of -git22. -- Steve Fox IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Fri, Oct 06, 2006 at 01:03:50PM -0500, Steve Fox wrote: On Fri, 2006-10-06 at 18:11 +0100, Mel Gorman wrote: On (06/10/06 11:36), Vivek Goyal didst pronounce: Where is bss placed in physical memory? I guess bss_start and bss_stop from System.map will tell us. That will confirm that above memset step is stomping over bss. Then we have to just find that somewhere probably we allocated wrong physical memory area for bootmem allocator map. BSS is at 0x643000 - 0x777BC4 init_bootmem wipes from 0x777000 - 0x8F7000 So the BSS bytes from 0x777000 -0x777BC4 (which looks very suspiciously pile a page alignment of addr PAGE_MASK) gets set to 0xFF. One possible fix is below. It adds a check in bad_addr() to see if the BSS section is about to be used for bootmap. It Seems To Work For Me (tm) and illustrates the source of the problem even if it's not the 100% correct fix. I was able to boot the machine with Mel's patch applied on top of -git22. Please have a look at the attached patch. Does it make some sense. Steve, can you please give this patch a try if it fixes the problem? Thanks Vivek o Currently some code pieces assume that address returned by find_e820_area() are page aligned. But looks like find_e820_area() had no such intention and hence one might end up stomping over some of the data. One such case is bootmem allocator initialization code stomped over bss. o This patch modified find_e820_area() to return page aligned address. This might be little wasteful of memory but at the same time probably it is easier to handle page aligned memory. Signed-off-by: Vivek Goyal [EMAIL PROTECTED] --- arch/x86_64/kernel/e820.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff -puN arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area arch/x86_64/kernel/e820.c --- linux-2.6.19-rc1-1M/arch/x86_64/kernel/e820.c~x86_64-return-page-aligned-phy-addr-from-find-e820-area 2006-10-06 15:28:13.0 -0400 +++ linux-2.6.19-rc1-1M-root/arch/x86_64/kernel/e820.c 2006-10-06 15:44:45.0 -0400 @@ -54,13 +54,13 @@ static inline int bad_addr(unsigned long /* various gunk below that needed for SMP startup */ if (addr 0x8000) { - *addrp = 0x8000; + *addrp = PAGE_ALIGN(0x8000); return 1; } /* direct mapping tables of the kernel */ if (last = table_startPAGE_SHIFT addr table_endPAGE_SHIFT) { - *addrp = table_end PAGE_SHIFT; + *addrp = PAGE_ALIGN(table_end PAGE_SHIFT); return 1; } @@ -68,18 +68,18 @@ static inline int bad_addr(unsigned long #ifdef CONFIG_BLK_DEV_INITRD if (LOADER_TYPE INITRD_START last = INITRD_START addr INITRD_START+INITRD_SIZE) { - *addrp = INITRD_START + INITRD_SIZE; + *addrp = PAGE_ALIGN(INITRD_START + INITRD_SIZE); return 1; } #endif /* kernel code */ - if (last = __pa_symbol(_text) last __pa_symbol(_end)) { - *addrp = __pa_symbol(_end); + if (last = __pa_symbol(_text) addr __pa_symbol(_end)) { + *addrp = PAGE_ALIGN(__pa_symbol(_end)); return 1; } if (last = ebda_addr addr ebda_addr + ebda_size) { - *addrp = ebda_addr + ebda_size; + *addrp = PAGE_ALIGN(ebda_addr + ebda_size); return 1; } @@ -152,7 +152,7 @@ unsigned long __init find_e820_area(unsi continue; while (bad_addr(addr, size) addr+size = ei-addr+ei-size) ; - last = addr + size; + last = PAGE_ALIGN(addr) + size; if (last ei-addr + ei-size) continue; if (last end) _ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Wed, 2006-10-04 at 18:08 -0700, Martin Bligh wrote: Andi Kleen wrote: I think most likely it would crash on 2.6.18. Keith mannthey had reported a different crash on 2.6.18-rc4-mm2 when this patch was introduced first time. Following is the link to the thread. Then maybe trying 2.6.17 + the patch and then bisect between that and -rc4? I think it's fixed already in -git22, or at least it is for the IBM box reporting to test.kernel.org. You might want to try that one ... -git22 also panics for me. -- Steve Fox IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, 2006-10-05 at 09:53 -0500, Steve Fox wrote: On Wed, 2006-10-04 at 18:08 -0700, Martin Bligh wrote: Andi Kleen wrote: I think most likely it would crash on 2.6.18. Keith mannthey had reported a different crash on 2.6.18-rc4-mm2 when this patch was introduced first time. Following is the link to the thread. Then maybe trying 2.6.17 + the patch and then bisect between that and -rc4? I think it's fixed already in -git22, or at least it is for the IBM box reporting to test.kernel.org. You might want to try that one ... -git22 also panics for me. Steve, Can you post the latest panic stack again (with CONFIG_DEBUG_KERNEL) ? Last time I couldn't match your instruction dump to any code segment in the routine. And also, can you post your .config file. I have an amd64 and em64t machine and both work fine... Thanks, Badari - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, 2006-10-05 at 08:12 -0700, Badari Pulavarty wrote: Can you post the latest panic stack again (with CONFIG_DEBUG_KERNEL) ? CONFIG_DEBUG_KERNEL should be on Last time I couldn't match your instruction dump to any code segment in the routine. And also, can you post your .config file. I have an amd64 and em64t machine and both work fine... Unable to handle kernel NULL pointer dereference at 0827 RIP: [804705e6] xfrm_register_mode+0x36/0x60 PGD 0 Oops: [1] SMP CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-git22 #1 RIP: 0010:[804705e6] [804705e6] xfrm_register_mode+0x36/0x60 RSP: :810bffcbded0 EFLAGS: 00010286 RAX: 081f RBX: 805588a0 RCX: RDX: RSI: 0002 RDI: 80559550 RBP: ffef R08: 3f924371 R09: R10: 810bffcbdcb0 R11: 0154 R12: R13: 810bffcbdef0 R14: R15: FS: () GS:805d2000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: 0827 CR3: 00201000 CR4: 06e0 Process swapper (pid: 1, threadinfo 810bffcbc000, task 810bffcbb4e0) Stack: 8061fb48 80207182 0009 The base config file I'm using is at http://flooterbu.net/kernel/elm3b239-2.6.17.config -- Steve Fox IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, 2006-10-05 at 17:40 +0200, Andi Kleen wrote: Please don't snip the Code: line. It is fairly important. Sorry about that. The remote console I was using appears to overwrite some text after I force the reboot. Here's a clean one. global Unable to handle kernel NULL pointer dereference at 0827 RIP: [80470766] xfrm_register_mode+0x36/0x60 PGD 0 Oops: [1] SMP CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-git22 #3 RIP: 0010:[80470766] [80470766] xfrm_register_mode+0x36/0x60 RSP: :810bffcbded0 EFLAGS: 00010286 RAX: 081f RBX: 805588a0 RCX: RDX: RSI: 0046 RDI: 80559550 RBP: ffef R08: 7a02 R09: 000e R10: 0006 R11: 80334660 R12: R13: 810bffcbdef0 R14: R15: FS: () GS:805d2000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: 0827 CR3: 00201000 CR4: 06e0 Process swapper (pid: 1, threadinfo 810bffcbc000, task 810bffcbb4e0) Stack: 8061fb48 80207182 0009 Call Trace: [80207182] init+0x162/0x330 [8020a9a8] child_rip+0xa/0x12 [803394c2] acpi_ds_init_one_object+0x0/0x82 [80207020] init+0x0/0x330 [8020a99e] child_rip+0x0/0x12 Code: 48 83 78 08 00 75 06 48 89 58 08 31 ed 48 89 d7 e8 65 fd ff RIP [80470766] xfrm_register_mode+0x36/0x60 RSP 810bffcbded0 CR2: 0827 0Kernel panic - not syncing: Aiee, killing interrupt handler! My guess is that something is wrong with the global variable it is accessing. Can you post the output of grep -5 xfrm_policy_afinfo ? elm3b239:/boot # grep -5 xfrm_policy_afinfo System.map-2.6.18-git22 805594c0 d xfrm4_state_afinfo 80559500 D xfrm_cfg_mutex 80559530 d xfrm_dev_notifier 80559548 d xfrm_policy_lock 8055954c d xfrm_policy_gc_lock 80559550 d xfrm_policy_afinfo_lock 80559560 d xfrm_hash_work 805595c0 d hash_resize_mutex 80559600 D sysctl_xfrm_aevent_etime 80559604 D sysctl_xfrm_aevent_rseqth 80559610 D km_waitq -- 8075bfd8 b idiagnl 8075bfe0 B xfrm_policy_count 8075bff8 b xfrm_policy_gc_list 8075c000 b dummy.28400 8075c038 b idx_generator.27450 8075c040 b xfrm_policy_afinfo 8075c140 b xfrm_policy_gc_work 8075c1a0 b xfrm_policy_inexact 8075c1e0 B xfrm_nl 8075c1e8 b xfrm_state_gc_list 8075c1f0 b acqseq.27386 And please add a printk(global %p\n, xfrm_policy_afinfo[family]); at the beginning of net/xfrm/xfrm_poliy.c:xfrm_policy_lock_afinfo and post the output. Included above. -- Steve Fox IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thursday 05 October 2006 19:57, Steve Fox wrote: On Thu, 2006-10-05 at 17:40 +0200, Andi Kleen wrote: Please don't snip the Code: line. It is fairly important. Sorry about that. The remote console I was using appears to overwrite some text after I force the reboot. Here's a clean one. global Ok that definitely shouldn't be in there. I guess we need to track when it gets corrupted. Can you send the full boot log with this patch applied? -Andi Index: linux-2.6.19-rc1-hack/init/main.c === --- linux-2.6.19-rc1-hack.orig/init/main.c +++ linux-2.6.19-rc1-hack/init/main.c @@ -75,6 +75,9 @@ static int init(void *); +extern void bugcheck(char *, int); +#define CHECK bugcheck(__FILE__, __LINE__) + extern void init_IRQ(void); extern void fork_init(unsigned long); extern void mca_init(void); @@ -480,6 +483,8 @@ asmlinkage void __init start_kernel(void char * command_line; extern struct kernel_param __start___param[], __stop___param[]; + CHECK; + smp_setup_processor_id(); /* @@ -502,7 +507,9 @@ asmlinkage void __init start_kernel(void page_address_init(); printk(KERN_NOTICE); printk(linux_banner); + CHECK; setup_arch(command_line); + CHECK; setup_per_cpu_areas(); smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ @@ -517,6 +524,7 @@ asmlinkage void __init start_kernel(void * fragile until we cpu_idle() for the first time. */ preempt_disable(); + CHECK; build_all_zonelists(); page_alloc_init(); printk(KERN_NOTICE Kernel command line: %s\n, saved_command_line); @@ -525,6 +533,7 @@ asmlinkage void __init start_kernel(void __stop___param - __start___param, unknown_bootoption); sort_main_extable(); + CHECK; trap_init(); rcu_init(); init_IRQ(); @@ -533,8 +542,10 @@ asmlinkage void __init start_kernel(void hrtimers_init(); softirq_init(); timekeeping_init(); + CHECK; time_init(); profile_init(); + CHECK; if (!irqs_disabled()) printk(start_kernel(): bug: interrupts were enabled early\n); early_boot_irqs_on(); @@ -568,7 +579,9 @@ asmlinkage void __init start_kernel(void #endif vfs_caches_init_early(); cpuset_init_early(); + CHECK; mem_init(); + CHECK; kmem_cache_init(); setup_per_cpu_pageset(); numa_policy_init(); @@ -577,6 +590,7 @@ asmlinkage void __init start_kernel(void calibrate_delay(); pidmap_init(); pgtable_cache_init(); + CHECK; prio_tree_init(); anon_vma_init(); #ifdef CONFIG_X86 @@ -586,12 +600,14 @@ asmlinkage void __init start_kernel(void fork_init(num_physpages); proc_caches_init(); buffer_init(); + CHECK; unnamed_dev_init(); key_init(); security_init(); vfs_caches_init(num_physpages); radix_tree_init(); signals_init(); + CHECK; /* rootfs populating might need page-writeback */ page_writeback_init(); #ifdef CONFIG_PROC_FS @@ -599,6 +615,7 @@ asmlinkage void __init start_kernel(void #endif cpuset_init(); taskstats_init_early(); + CHECK; delayacct_init(); check_bugs(); @@ -609,7 +626,7 @@ asmlinkage void __init start_kernel(void rest_init(); } -static int __initdata initcall_debug; +static int __initdata initcall_debug = 1; static int __init initcall_debug_setup(char *str) { @@ -639,7 +656,11 @@ static void __init do_initcalls(void) printk(\n); } + CHECK; + result = (*call)(); + + CHECK; if (result result != -ENODEV initcall_debug) { sprintf(msgbuf, error code %d, result); @@ -725,21 +746,32 @@ static int init(void * unused) smp_prepare_cpus(max_cpus); + CHECK; + do_pre_smp_initcalls(); smp_init(); + + CHECK; + sched_init_smp(); cpuset_init_smp(); + CHECK; + /* * Do this before initcalls, because some drivers want to access * firmware files. */ populate_rootfs(); + CHECK; + do_basic_setup(); + CHECK; + /* * check if there is an early userspace init. If yes, let it do all * the work Index: linux-2.6.19-rc1-hack/net/xfrm/xfrm_policy.c === --- linux-2.6.19-rc1-hack.orig/net/xfrm/xfrm_policy.c +++ linux-2.6.19-rc1-hack/net/xfrm/xfrm_policy.c @@ -39,6 +39,16 @@ EXPORT_SYMBOL(xfrm_policy_count); static DEFINE_RWLOCK(xfrm_policy_afinfo_lock); static struct
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, 2006-10-05 at 20:27 +0200, Andi Kleen wrote: I guess we need to track when it gets corrupted. Can you send the full boot log with this patch applied? Here she blows! root (hd0,0) Filesystem type is reiserfs, partition type 0x83 kernel /boot/vmlinuz-autobench root=/dev/sda1 vga=791 ip=9.47.67.239:9.47.67.5 0:9.47.67.1:255.255.255.0 resume=/dev/sdb1 showopts console=tty0 console=ttyS0, 57600 autobench_args: root=/dev/sda1 ABAT:1160073474 [Linux-bzImage, setup=0x1400, size=0x1dd755] initrd /boot/initrd-autobench.img [Linux-initrd @ 0x37ceb000, 0x304c57 bytes] Linux version 2.6.18-git22 ([EMAIL PROTECTED]) (gcc version 4.1.0 (SUSE Linux)) #4 SMP Thu Oct 5 11:36:21 PDT 2006 Command line: root=/dev/sda1 vga=791 ip=9.47.67.239:9.47.67.50:9.47.67.1:255.255.255.0 resume=/dev/sdb1 showopts console=tty0 console=ttyS0,57600 autobench_args: root=/dev/sda1 ABAT:1160073474 BIOS-provided physical RAM map: BIOS-e820: - 0009ac00 (usable) BIOS-e820: 0009ac00 - 000a (reserved) BIOS-e820: 000e - 0010 (reserved) BIOS-e820: 0010 - bff764c0 (usable) BIOS-e820: bff764c0 - bff98880 (ACPI data) BIOS-e820: bff98880 - c000 (reserved) BIOS-e820: fec0 - 0001 (reserved) BIOS-e820: 0001 - 000c (usable) end_pfn_map = 12582912 DMI 2.3 present. Zone PFN ranges: DMA 0 - 4096 DMA324096 - 1048576 Normal1048576 - 12582912 early_node_map[3] active PFN ranges 0:0 - 154 0: 256 - 786294 0: 1048576 - 12582912 ACPI: PM-Timer IO Port: 0x9c ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 (Bootup-CPU) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled) Processor #6 ACPI: LAPIC (acpi_id[0x03] lapic_id[0x07] enabled) Processor #7 ACPI: LAPIC (acpi_id[0x04] lapic_id[0x10] enabled) Processor #16 ACPI: LAPIC (acpi_id[0x05] lapic_id[0x11] enabled) Processor #17 ACPI: LAPIC (acpi_id[0x06] lapic_id[0x16] enabled) Processor #22 ACPI: LAPIC (acpi_id[0x07] lapic_id[0x17] enabled) Processor #23 ACPI: LAPIC (acpi_id[0x10] lapic_id[0x20] enabled) Processor #32 ACPI: LAPIC (acpi_id[0x11] lapic_id[0x21] enabled) Processor #33 ACPI: LAPIC (acpi_id[0x12] lapic_id[0x26] enabled) Processor #38 ACPI: LAPIC (acpi_id[0x13] lapic_id[0x27] enabled) Processor #39 ACPI: LAPIC (acpi_id[0x14] lapic_id[0x30] enabled) Processor #48 ACPI: LAPIC (acpi_id[0x15] lapic_id[0x31] enabled) Processor #49 ACPI: LAPIC (acpi_id[0x16] lapic_id[0x36] enabled) Processor #54 ACPI: LAPIC (acpi_id[0x17] lapic_id[0x37] enabled) Processor #55 ACPI: LAPIC (acpi_id[0x20] lapic_id[0x40] enabled) Processor #64 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x21] lapic_id[0x41] enabled) Processor #65 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x22] lapic_id[0x46] enabled) Processor #70 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x23] lapic_id[0x47] enabled) Processor #71 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x24] lapic_id[0x50] enabled) Processor #80 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x25] lapic_id[0x51] enabled) Processor #81 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x26] lapic_id[0x56] enabled) Processor #86 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x27] lapic_id[0x57] enabled) Processor #87 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x30] lapic_id[0x60] enabled) Processor #96 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x31] lapic_id[0x61] enabled) Processor #97 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x32] lapic_id[0x66] enabled) Processor #102 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x33] lapic_id[0x67] enabled) Processor #103 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x34] lapic_id[0x70] enabled) Processor #112 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x35] lapic_id[0x71] enabled) Processor #113 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x36] lapic_id[0x76] enabled) Processor #118 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC (acpi_id[0x37] lapic_id[0x77] enabled) Processor #119 WARNING: NR_CPUS limit of 16 reached. Processor ignored. ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x03] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x04] dfl dfl lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x05] dfl dfl lint[0x1])
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, Oct 05, 2006 at 08:27:02PM +0200, Andi Kleen wrote: On Thursday 05 October 2006 19:57, Steve Fox wrote: On Thu, 2006-10-05 at 17:40 +0200, Andi Kleen wrote: Please don't snip the Code: line. It is fairly important. Sorry about that. The remote console I was using appears to overwrite some text after I force the reboot. Here's a clean one. global Ok that definitely shouldn't be in there. I guess we need to track when it gets corrupted. Can you send the full boot log with this patch applied? Just recalled one more observation about the problem when keith had reported it last. If I just move .bss before .data_nosave instead of it being at the end, keith's problem had disappeared. Thanks Vivek - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thursday 05 October 2006 20:51, Steve Fox wrote: On Thu, 2006-10-05 at 20:27 +0200, Andi Kleen wrote: I guess we need to track when it gets corrupted. Can you send the full boot log with this patch applied? Here she blows! Can you please try it again with this patch to narrow it down further? -Andi Index: linux-2.6.19-rc1-hack/init/main.c === --- linux-2.6.19-rc1-hack.orig/init/main.c +++ linux-2.6.19-rc1-hack/init/main.c @@ -75,6 +75,9 @@ static int init(void *); +extern void bugcheck(char *, int); +#define CHECK bugcheck(__FILE__, __LINE__) + extern void init_IRQ(void); extern void fork_init(unsigned long); extern void mca_init(void); @@ -480,6 +483,8 @@ asmlinkage void __init start_kernel(void char * command_line; extern struct kernel_param __start___param[], __stop___param[]; + CHECK; + smp_setup_processor_id(); /* @@ -502,7 +507,9 @@ asmlinkage void __init start_kernel(void page_address_init(); printk(KERN_NOTICE); printk(linux_banner); + CHECK; setup_arch(command_line); + CHECK; setup_per_cpu_areas(); smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ @@ -517,6 +524,7 @@ asmlinkage void __init start_kernel(void * fragile until we cpu_idle() for the first time. */ preempt_disable(); + CHECK; build_all_zonelists(); page_alloc_init(); printk(KERN_NOTICE Kernel command line: %s\n, saved_command_line); @@ -525,6 +533,7 @@ asmlinkage void __init start_kernel(void __stop___param - __start___param, unknown_bootoption); sort_main_extable(); + CHECK; trap_init(); rcu_init(); init_IRQ(); @@ -533,8 +542,10 @@ asmlinkage void __init start_kernel(void hrtimers_init(); softirq_init(); timekeeping_init(); + CHECK; time_init(); profile_init(); + CHECK; if (!irqs_disabled()) printk(start_kernel(): bug: interrupts were enabled early\n); early_boot_irqs_on(); @@ -568,7 +579,9 @@ asmlinkage void __init start_kernel(void #endif vfs_caches_init_early(); cpuset_init_early(); + CHECK; mem_init(); + CHECK; kmem_cache_init(); setup_per_cpu_pageset(); numa_policy_init(); @@ -577,6 +590,7 @@ asmlinkage void __init start_kernel(void calibrate_delay(); pidmap_init(); pgtable_cache_init(); + CHECK; prio_tree_init(); anon_vma_init(); #ifdef CONFIG_X86 @@ -586,12 +600,14 @@ asmlinkage void __init start_kernel(void fork_init(num_physpages); proc_caches_init(); buffer_init(); + CHECK; unnamed_dev_init(); key_init(); security_init(); vfs_caches_init(num_physpages); radix_tree_init(); signals_init(); + CHECK; /* rootfs populating might need page-writeback */ page_writeback_init(); #ifdef CONFIG_PROC_FS @@ -599,6 +615,7 @@ asmlinkage void __init start_kernel(void #endif cpuset_init(); taskstats_init_early(); + CHECK; delayacct_init(); check_bugs(); @@ -609,7 +626,7 @@ asmlinkage void __init start_kernel(void rest_init(); } -static int __initdata initcall_debug; +static int __initdata initcall_debug = 1; static int __init initcall_debug_setup(char *str) { @@ -639,7 +656,11 @@ static void __init do_initcalls(void) printk(\n); } + CHECK; + result = (*call)(); + + CHECK; if (result result != -ENODEV initcall_debug) { sprintf(msgbuf, error code %d, result); @@ -725,21 +746,32 @@ static int init(void * unused) smp_prepare_cpus(max_cpus); + CHECK; + do_pre_smp_initcalls(); smp_init(); + + CHECK; + sched_init_smp(); cpuset_init_smp(); + CHECK; + /* * Do this before initcalls, because some drivers want to access * firmware files. */ populate_rootfs(); + CHECK; + do_basic_setup(); + CHECK; + /* * check if there is an early userspace init. If yes, let it do all * the work Index: linux-2.6.19-rc1-hack/net/xfrm/xfrm_policy.c === --- linux-2.6.19-rc1-hack.orig/net/xfrm/xfrm_policy.c +++ linux-2.6.19-rc1-hack/net/xfrm/xfrm_policy.c @@ -39,6 +39,16 @@ EXPORT_SYMBOL(xfrm_policy_count); static DEFINE_RWLOCK(xfrm_policy_afinfo_lock); static struct xfrm_policy_afinfo *xfrm_policy_afinfo[NPROTO]; +void bugcheck(char *where, int line) +{ + int i; + for (i = 0; i NPROTO; i++) + if
Re: 2.6.18-mm2 boot failure on x86-64
On Thursday 05 October 2006 20:52, Vivek Goyal wrote: On Thu, Oct 05, 2006 at 08:27:02PM +0200, Andi Kleen wrote: On Thursday 05 October 2006 19:57, Steve Fox wrote: On Thu, 2006-10-05 at 17:40 +0200, Andi Kleen wrote: Please don't snip the Code: line. It is fairly important. Sorry about that. The remote console I was using appears to overwrite some text after I force the reboot. Here's a clean one. global Ok that definitely shouldn't be in there. I guess we need to track when it gets corrupted. Can you send the full boot log with this patch applied? Just recalled one more observation about the problem when keith had reported it last. If I just move .bss before .data_nosave instead of it being at the end, keith's problem had disappeared. Yes, that could well be that it's something in the new bootmap management. Steve's box failed at Using ACPI (MADT) for SMP configuration information Nosave address range: 0009a000 - 0009b000 Nosave address range: 0009b000 - 000a Nosave address range: 000a - 000e Nosave address range: 000e - 0010 Nosave address range: bff76000 - bff77000 Nosave address range: bff77000 - bff98000 Nosave address range: bff98000 - bff99000 Nosave address range: bff99000 - c000 Nosave address range: c000 - fec0 Nosave address range: fec0 - 0001 Allocating PCI resources starting at c400 (gap: c000:3ec0) afinfo corrupted at init/main.c:512 which is directly after that code does lots of stuff. Mel might want to take a look (and perhaps also cut down a little on the ugly printks ...) BTW I found one of my test systems too now which does a lot of: I'm about to leave for vacation so i won't have time to track it down any time soon. But here is it for reference. -Andi Please enable the IOMMU option in the BIOS setup This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 800 Bad page state in process 'swapper' page:810003ee5480 flags:0x mapping: mapcount:1 count:0 Trying to fix it up, but a reboot is needed Backtrace: Call Trace: [8020ac84] show_trace+0x34/0x47 [8020aca9] dump_stack+0x12/0x17 [802586a7] bad_page+0x57/0x81 [80258791] __free_pages_ok+0x64/0x247 [807cca72] free_all_bootmem_core+0xcc/0x1a9 [807ca08b] numa_free_all_bootmem+0x3b/0x77 [807c915e] mem_init+0x44/0x186 [807bc5f0] start_kernel+0x17b/0x207 [807bc168] _sinittext+0x168/0x16c Bad page state in process 'swapper' page:810003ee54b8 flags:0x mapping: mapcount:1 count:0 Trying to fix it up, but a reboot is needed Backtrace: Call Trace: [8020ac84] show_trace+0x34/0x47 [8020aca9] dump_stack+0x12/0x17 [802586a7] bad_page+0x57/0x81 [80258791] __free_pages_ok+0x64/0x247 [807cca72] free_all_bootmem_core+0xcc/0x1a9 [807ca08b] numa_free_all_bootmem+0x3b/0x77 [807c915e] mem_init+0x44/0x186 [807bc5f0] start_kernel+0x17b/0x207 [807bc168] _sinittext+0x168/0x16c ... lots more of those ... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, 2006-10-05 at 21:08 +0200, Andi Kleen wrote: Mel might want to take a look (and perhaps also cut down a little on the ugly printks ...) I tested a patch from Mel which backs out the arch independent zone sizing and got the same results (to my inexperienced eye). I've sent him the boot log to verify they really are the same as without this back-out. -- Steve Fox IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, 5 Oct 2006, Andi Kleen wrote: On Thursday 05 October 2006 20:52, Vivek Goyal wrote: On Thu, Oct 05, 2006 at 08:27:02PM +0200, Andi Kleen wrote: On Thursday 05 October 2006 19:57, Steve Fox wrote: On Thu, 2006-10-05 at 17:40 +0200, Andi Kleen wrote: Please don't snip the Code: line. It is fairly important. Sorry about that. The remote console I was using appears to overwrite some text after I force the reboot. Here's a clean one. global Ok that definitely shouldn't be in there. I guess we need to track when it gets corrupted. Can you send the full boot log with this patch applied? Just recalled one more observation about the problem when keith had reported it last. If I just move .bss before .data_nosave instead of it being at the end, keith's problem had disappeared. Yes, that could well be that it's something in the new bootmap management. Steve's box failed at Using ACPI (MADT) for SMP configuration information Nosave address range: 0009a000 - 0009b000 Nosave address range: 0009b000 - 000a Nosave address range: 000a - 000e Nosave address range: 000e - 0010 Nosave address range: bff76000 - bff77000 Nosave address range: bff77000 - bff98000 Nosave address range: bff98000 - bff99000 Nosave address range: bff99000 - c000 Nosave address range: c000 - fec0 Nosave address range: fec0 - 0001 Allocating PCI resources starting at c400 (gap: c000:3ec0) afinfo corrupted at init/main.c:512 which is directly after that code does lots of stuff. Mel might want to take a look (and perhaps also cut down a little on the ugly printks ...) Steve tested a patch with arch-independent zone-sizing backed out for x86_64 and things looked ok but that is no guarantee it is not a contributary factor. The Nosave address range: printks are related to a suspend problem that was reported end of June I believe. I'll pick this up in the morning because I should have access to the same machine Steve does and see what I can come up with. BTW I found one of my test systems too now which does a lot of: I'm about to leave for vacation so i won't have time to track it down any time soon. But here is it for reference. hmm, rather than bugging you with patches now, I'll see what I can find with the x86_64 machines I have access to and see can I reproduce it. -Andi Please enable the IOMMU option in the BIOS setup This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 800 Bad page state in process 'swapper' page:810003ee5480 flags:0x mapping: mapcount:1 count:0 Trying to fix it up, but a reboot is needed Backtrace: Call Trace: [8020ac84] show_trace+0x34/0x47 [8020aca9] dump_stack+0x12/0x17 [802586a7] bad_page+0x57/0x81 [80258791] __free_pages_ok+0x64/0x247 [807cca72] free_all_bootmem_core+0xcc/0x1a9 [807ca08b] numa_free_all_bootmem+0x3b/0x77 [807c915e] mem_init+0x44/0x186 [807bc5f0] start_kernel+0x17b/0x207 [807bc168] _sinittext+0x168/0x16c Bad page state in process 'swapper' page:810003ee54b8 flags:0x mapping: mapcount:1 count:0 Trying to fix it up, but a reboot is needed Backtrace: Call Trace: [8020ac84] show_trace+0x34/0x47 [8020aca9] dump_stack+0x12/0x17 [802586a7] bad_page+0x57/0x81 [80258791] __free_pages_ok+0x64/0x247 [807cca72] free_all_bootmem_core+0xcc/0x1a9 [807ca08b] numa_free_all_bootmem+0x3b/0x77 [807c915e] mem_init+0x44/0x186 [807bc5f0] start_kernel+0x17b/0x207 [807bc168] _sinittext+0x168/0x16c ... lots more of those ... -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, 2006-10-05 at 21:05 +0200, Andi Kleen wrote: Can you please try it again with this patch to narrow it down further? Unfortunately this is as far as it got before it hung. root (hd0,0) Filesystem type is reiserfs, partition type 0x83 kernel /boot/vmlinuz-autobench root=/dev/sda1 vga=791 ip=9.47.67.239:9.47.67.5 0:9.47.67.1:255.255.255.0 resume=/dev/sdb1 showopts console=tty0 console=ttyS0, 57600 autobench_args: root=/dev/sda1 ABAT:1160080320 [Linux-bzImage, setup=0x1400, size=0x1dd871] initrd /boot/initrd-autobench.img [Linux-initrd @ 0x37ceb000, 0x304c57 bytes] -- Steve Fox IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thursday 05 October 2006 22:42, Steve Fox wrote: On Thu, 2006-10-05 at 21:05 +0200, Andi Kleen wrote: Can you please try it again with this patch to narrow it down further? Unfortunately this is as far as it got before it hung. Boot with earlyprintk=serial,ttyS0,57600 (or change the panic in the checkfunction back to a printk) -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
hmm, rather than bugging you with patches now, I'll see what I can find with the x86_64 machines I have access to and see can I reproduce it. I started the bisect, should finish soon. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64 II
On Thursday 05 October 2006 22:51, Andi Kleen wrote: hmm, rather than bugging you with patches now, I'll see what I can find with the x86_64 machines I have access to and see can I reproduce it. I started the bisect, should finish soon. It ended at diff-tree d5cdb67236dba94496de052c9f9f431e1fc658f4 (from 0dad3510ee82bcf8a380b81 a2184a664a911ef9c) Author: Satoru Takeuchi [EMAIL PROTECTED] Date: Tue Sep 12 10:19:00 2006 -0700 acpiphp: disable bridges Currently acpiphp calls pci_enable_device() against all hot-added bridges, but acpiphp does not call pci_disable_device() against them in hot-remove. So ioapic hot-remove would fail. This patch fixes this issue. Not sure that is it really, it is possible i made a mistake during bisect (the symptoms changed from bad page to just networking doesn't work somewhere at 4cfee88ad30acc47f02b8b7ba3db8556262dce1e) I don't have time to rerun unfortunately for some time. Anyone else looking would be useful. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64 II
On Fri, 2006-10-06 at 01:14 +0200, Andi Kleen wrote: On Thursday 05 October 2006 22:51, Andi Kleen wrote: hmm, rather than bugging you with patches now, I'll see what I can find with the x86_64 machines I have access to and see can I reproduce it. I started the bisect, should finish soon. It ended at diff-tree d5cdb67236dba94496de052c9f9f431e1fc658f4 (from 0dad3510ee82bcf8a380b81 a2184a664a911ef9c) Author: Satoru Takeuchi [EMAIL PROTECTED] Date: Tue Sep 12 10:19:00 2006 -0700 acpiphp: disable bridges Currently acpiphp calls pci_enable_device() against all hot-added bridges, but acpiphp does not call pci_disable_device() against them in hot-remove. So ioapic hot-remove would fail. This patch fixes this issue. Not sure that is it really, it is possible i made a mistake during bisect (the symptoms changed from bad page to just networking doesn't work somewhere at 4cfee88ad30acc47f02b8b7ba3db8556262dce1e) I don't have time to rerun unfortunately for some time. Anyone else looking would be useful. As of yet I haven't been able to recreate the hang. I am running similar HW to Steve. Thanks, Keith - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64 II
As of yet I haven't been able to recreate the hang. I am running similar HW to Steve. That was on a 4 core Opteron with Tyan board (S2881) and AMD-8111 chipset. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64 II
On Fri, 2006-10-06 at 01:35 +0200, Andi Kleen wrote: As of yet I haven't been able to recreate the hang. I am running similar HW to Steve. I ran into this with -mm3 Memory: 24150368k/26738688k available (1933k kernel code, 490260k reserved, 978k data, 308k init) [ cut here ] kernel BUG in init_list at mm/slab.c:1334! invalid opcode: [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.18-mm3-smp #1 RIP: 0010:[8027f8fa] [8027f8fa] init_list+0x1d/0xfd RSP: 0018:80577f48 EFLAGS: 00010212 RAX: 0040 RBX: 0001 RCX: RDX: 0001 RSI: 805ba848 RDI: 810460700040 RBP: 0001 R08: 0001 R09: 0003 R10: R11: 805bc268 R12: 810460700040 R13: 805ba848 R14: R15: FS: () GS:804d8000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 00201000 CR4: 06a0 Process swapper (pid: 0, threadinfo 80576000, task 80455840) Stack: 0001 0001 805ba848 80593aa8 02c0 00010001 0008ef00 0008c000 Call Trace: [80593aa8] kmem_cache_init+0x344/0x406 [805805ef] start_kernel+0x180/0x21b [8058016a] _sinittext+0x16a/0x16e Code: 0f 0b 48 8b 3d 15 ab 1e 00 be d0 00 00 00 e8 c0 f5 ff ff 48 RIP [8027f8fa] init_list+0x1d/0xfd RSP 80577f48 0Kernel panic - not syncing: Attempted to kill the idle task! I am going to revert the patch and see if it works. I ran -git22 just fine. Thanks, Keith - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64 II
keith mannthey wrote: On Fri, 2006-10-06 at 01:35 +0200, Andi Kleen wrote: As of yet I haven't been able to recreate the hang. I am running similar HW to Steve. I ran into this with -mm3 Memory: 24150368k/26738688k available (1933k kernel code, 490260k reserved, 978k data, 308k init) [ cut here ] kernel BUG in init_list at mm/slab.c:1334! invalid opcode: [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.18-mm3-smp #1 RIP: 0010:[8027f8fa] [8027f8fa] init_list+0x1d/0xfd RSP: 0018:80577f48 EFLAGS: 00010212 RAX: 0040 RBX: 0001 RCX: RDX: 0001 RSI: 805ba848 RDI: 810460700040 RBP: 0001 R08: 0001 R09: 0003 R10: R11: 805bc268 R12: 810460700040 R13: 805ba848 R14: R15: FS: () GS:804d8000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 00201000 CR4: 06a0 Process swapper (pid: 0, threadinfo 80576000, task 80455840) Stack: 0001 0001 805ba848 80593aa8 02c0 00010001 0008ef00 0008c000 Call Trace: [80593aa8] kmem_cache_init+0x344/0x406 [805805ef] start_kernel+0x180/0x21b [8058016a] _sinittext+0x16a/0x16e Code: 0f 0b 48 8b 3d 15 ab 1e 00 be d0 00 00 00 e8 c0 f5 ff ff 48 RIP [8027f8fa] init_list+0x1d/0xfd RSP 80577f48 0Kernel panic - not syncing: Attempted to kill the idle task! I am going to revert the patch and see if it works. I ran -git22 just fine. Thanks, Keith Keith, I fixed this already. Can you look for it on lkml (look for 2.6.18-mm3 in the subject line). one typo in mm/slab.c Thanks, Badari - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64 II
On Thu, 05 Oct 2006 17:02:54 -0700 Badari Pulavarty [EMAIL PROTECTED] wrote: Code: 0f 0b 48 8b 3d 15 ab 1e 00 be d0 00 00 00 e8 c0 f5 ff ff 48 RIP [8027f8fa] init_list+0x1d/0xfd RSP 80577f48 0Kernel panic - not syncing: Attempted to kill the idle task! I am going to revert the patch and see if it works. I ran -git22 just fine. Thanks, Keith Keith, I fixed this already. Can you look for it on lkml (look for 2.6.18-mm3 in the subject line). one typo in mm/slab.c ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm3/hot-fixes - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Thu, 2006-09-28 at 14:01 -0700, Andrew Morton wrote: On Thu, 28 Sep 2006 17:50:31 + (UTC) Steve Fox [EMAIL PROTECTED] wrote: On Thu, 28 Sep 2006 01:46:23 -0700, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/ Panic on boot. This machine booted 2.6.18-mm1 fine. em64t machine. TCP bic registered TCP westwood registered TCP htcp registered NET: Registered protocol family 1 NET: Registered protocol family 17 Unable to handle kernel paging request at RIP: [8047ef93] packet_notifier+0x163/0x1a0 PGD 203027 PUD 2b031067 PMD 0 Oops: [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-mm2-autokern1 #1 RIP: 0010:[8047ef93] [8047ef93] packet_notifier+0x163/0x1a0 RSP: :810bffcbde90 EFLAGS: 00010286 RAX: RBX: 810bff4a1000 RCX: RDX: 810bff4a1000 RSI: 0005 RDI: 8055f5e0 RBP: R08: 7616 R09: 000e R10: 0006 R11: 803373f0 R12: R13: 0005 R14: 810bff4a1000 R15: FS: () GS:805d8000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 00201000 CR4: 06e0 Process swapper (pid: 1, threadinfo 810bffcbc000, task 810bffcbb510) Stack: 810bff4a1000 8055f4c0 810bffcbdef0 8042736e 8061c68d 806260f0 80207182 Call Trace: [8042736e] register_netdevice_notifier+0x3e/0x70 [8061c68d] packet_init+0x2d/0x53 [80207182] init+0x162/0x330 [8020a9d8] child_rip+0xa/0x12 [8033c2a2] acpi_ds_init_one_object+0x0/0x82 [80207020] init+0x0/0x330 [8020a9ce] child_rip+0x0/0x12 Code: 48 8b 45 00 0f 18 08 49 83 fd 02 4c 8d 65 f8 0f 84 f8 fe ff RIP [8047ef93] packet_notifier+0x163/0x1a0 RSP 810bffcbde90 CR2: 0Kernel panic - not syncing: Attempted to kill init! I'm really struggling to work out what went wrong there. Comparing your miserable 20 bytes of code to my object code makes me think that this: struct packet_sock *po = pkt_sk(sk); returned -1, perhaps in %ebp. But it's all very crude. Perhaps you could compile that kernel with CONFIG_DEBUG_INFO, rerun it (the addresses might change) then have a poke around with `gdb vmlinux' (or maybe just addr2line) to work out where it's really oopsing? I don't see much which has changed in that area recently. Sorry for the delay. I was finally able to perform a bisect on this. It turns out the patch that causes this is x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a strange candidate, but sure enough I can boot to login: right up until that patch is applied. P.S. I had to comment usb-hubc-build-fix.patch out of the series file because it would not apply cleanly and caused quilt (0.45) to simply abort its 'push' operation. -- Steve Fox IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Wed, 04 Oct 2006 08:42:28 -0500 Steve Fox [EMAIL PROTECTED] wrote: On Thu, 2006-09-28 at 14:01 -0700, Andrew Morton wrote: On Thu, 28 Sep 2006 17:50:31 + (UTC) Steve Fox [EMAIL PROTECTED] wrote: On Thu, 28 Sep 2006 01:46:23 -0700, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/ Panic on boot. This machine booted 2.6.18-mm1 fine. em64t machine. TCP bic registered TCP westwood registered TCP htcp registered NET: Registered protocol family 1 NET: Registered protocol family 17 Unable to handle kernel paging request at RIP: [8047ef93] packet_notifier+0x163/0x1a0 PGD 203027 PUD 2b031067 PMD 0 Oops: [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-mm2-autokern1 #1 RIP: 0010:[8047ef93] [8047ef93] packet_notifier+0x163/0x1a0 RSP: :810bffcbde90 EFLAGS: 00010286 RAX: RBX: 810bff4a1000 RCX: RDX: 810bff4a1000 RSI: 0005 RDI: 8055f5e0 RBP: R08: 7616 R09: 000e R10: 0006 R11: 803373f0 R12: R13: 0005 R14: 810bff4a1000 R15: FS: () GS:805d8000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 00201000 CR4: 06e0 Process swapper (pid: 1, threadinfo 810bffcbc000, task 810bffcbb510) Stack: 810bff4a1000 8055f4c0 810bffcbdef0 8042736e 8061c68d 806260f0 80207182 Call Trace: [8042736e] register_netdevice_notifier+0x3e/0x70 [8061c68d] packet_init+0x2d/0x53 [80207182] init+0x162/0x330 [8020a9d8] child_rip+0xa/0x12 [8033c2a2] acpi_ds_init_one_object+0x0/0x82 [80207020] init+0x0/0x330 [8020a9ce] child_rip+0x0/0x12 Code: 48 8b 45 00 0f 18 08 49 83 fd 02 4c 8d 65 f8 0f 84 f8 fe ff RIP [8047ef93] packet_notifier+0x163/0x1a0 RSP 810bffcbde90 CR2: 0Kernel panic - not syncing: Attempted to kill init! I'm really struggling to work out what went wrong there. Comparing your miserable 20 bytes of code to my object code makes me think that this: struct packet_sock *po = pkt_sk(sk); returned -1, perhaps in %ebp. But it's all very crude. Perhaps you could compile that kernel with CONFIG_DEBUG_INFO, rerun it (the addresses might change) then have a poke around with `gdb vmlinux' (or maybe just addr2line) to work out where it's really oopsing? I don't see much which has changed in that area recently. Sorry for the delay. I was finally able to perform a bisect on this. It turns out the patch that causes this is x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a strange candidate, but sure enough I can boot to login: right up until that patch is applied. hm, that patch was merged into mainline September 29. Does mainline work? P.S. I had to comment usb-hubc-build-fix.patch out of the series file because it would not apply cleanly and caused quilt (0.45) to simply abort its 'push' operation. Sorry about that. If mainline _does_ work then perhaps it's an interaction between that patch and something else in the -mm2 lineup (and at that point in the bisection, it'll be one of the git trees or something else in the x86_64 tree). Could be that the problem remains in -mm3. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Wednesday 04 October 2006 17:45, Andrew Morton wrote: On Wed, 04 Oct 2006 08:42:28 -0500 Steve Fox [EMAIL PROTECTED] wrote: On Thu, 2006-09-28 at 14:01 -0700, Andrew Morton wrote: On Thu, 28 Sep 2006 17:50:31 + (UTC) Steve Fox [EMAIL PROTECTED] wrote: On Thu, 28 Sep 2006 01:46:23 -0700, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/ Panic on boot. This machine booted 2.6.18-mm1 fine. em64t machine. TCP bic registered TCP westwood registered TCP htcp registered NET: Registered protocol family 1 NET: Registered protocol family 17 Unable to handle kernel paging request at RIP: [8047ef93] packet_notifier+0x163/0x1a0 PGD 203027 PUD 2b031067 PMD 0 Oops: [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-mm2-autokern1 #1 RIP: 0010:[8047ef93] [8047ef93] packet_notifier+0x163/0x1a0 RSP: :810bffcbde90 EFLAGS: 00010286 RAX: RBX: 810bff4a1000 RCX: RDX: 810bff4a1000 RSI: 0005 RDI: 8055f5e0 RBP: R08: 7616 R09: 000e R10: 0006 R11: 803373f0 R12: R13: 0005 R14: 810bff4a1000 R15: FS: () GS:805d8000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 00201000 CR4: 06e0 Process swapper (pid: 1, threadinfo 810bffcbc000, task 810bffcbb510) Stack: 810bff4a1000 8055f4c0 810bffcbdef0 8042736e 8061c68d 806260f0 80207182 Call Trace: [8042736e] register_netdevice_notifier+0x3e/0x70 [8061c68d] packet_init+0x2d/0x53 [80207182] init+0x162/0x330 [8020a9d8] child_rip+0xa/0x12 [8033c2a2] acpi_ds_init_one_object+0x0/0x82 [80207020] init+0x0/0x330 [8020a9ce] child_rip+0x0/0x12 Code: 48 8b 45 00 0f 18 08 49 83 fd 02 4c 8d 65 f8 0f 84 f8 fe ff RIP [8047ef93] packet_notifier+0x163/0x1a0 RSP 810bffcbde90 CR2: 0Kernel panic - not syncing: Attempted to kill init! I'm really struggling to work out what went wrong there. Comparing your miserable 20 bytes of code to my object code makes me think that this: struct packet_sock *po = pkt_sk(sk); returned -1, perhaps in %ebp. But it's all very crude. Perhaps you could compile that kernel with CONFIG_DEBUG_INFO, rerun it (the addresses might change) then have a poke around with `gdb vmlinux' (or maybe just addr2line) to work out where it's really oopsing? I don't see much which has changed in that area recently. Sorry for the delay. I was finally able to perform a bisect on this. It turns out the patch that causes this is x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a strange candidate, but sure enough I can boot to login: right up until that patch is applied. hm, that patch was merged into mainline September 29. Does mainline work? Yes we had this earlier already. But without this patch it doesn't compile for some people. So it was readded. And nobody knows why the reposition-bss patch actually breaks things :/ In theory the reposition is ok, so it must be some marginal code somewhere else that just ends up failing over. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Wed, Oct 04, 2006 at 08:45:40AM -0700, Andrew Morton wrote: On Wed, 04 Oct 2006 08:42:28 -0500 Steve Fox [EMAIL PROTECTED] wrote: On Thu, 2006-09-28 at 14:01 -0700, Andrew Morton wrote: On Thu, 28 Sep 2006 17:50:31 + (UTC) Steve Fox [EMAIL PROTECTED] wrote: On Thu, 28 Sep 2006 01:46:23 -0700, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18/2.6.18-mm2/ Panic on boot. This machine booted 2.6.18-mm1 fine. em64t machine. TCP bic registered TCP westwood registered TCP htcp registered NET: Registered protocol family 1 NET: Registered protocol family 17 Unable to handle kernel paging request at RIP: [8047ef93] packet_notifier+0x163/0x1a0 PGD 203027 PUD 2b031067 PMD 0 Oops: [1] SMP last sysfs file: CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-mm2-autokern1 #1 RIP: 0010:[8047ef93] [8047ef93] packet_notifier+0x163/0x1a0 RSP: :810bffcbde90 EFLAGS: 00010286 RAX: RBX: 810bff4a1000 RCX: RDX: 810bff4a1000 RSI: 0005 RDI: 8055f5e0 RBP: R08: 7616 R09: 000e R10: 0006 R11: 803373f0 R12: R13: 0005 R14: 810bff4a1000 R15: FS: () GS:805d8000() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 00201000 CR4: 06e0 Process swapper (pid: 1, threadinfo 810bffcbc000, task 810bffcbb510) Stack: 810bff4a1000 8055f4c0 810bffcbdef0 8042736e 8061c68d 806260f0 80207182 Call Trace: [8042736e] register_netdevice_notifier+0x3e/0x70 [8061c68d] packet_init+0x2d/0x53 [80207182] init+0x162/0x330 [8020a9d8] child_rip+0xa/0x12 [8033c2a2] acpi_ds_init_one_object+0x0/0x82 [80207020] init+0x0/0x330 [8020a9ce] child_rip+0x0/0x12 Code: 48 8b 45 00 0f 18 08 49 83 fd 02 4c 8d 65 f8 0f 84 f8 fe ff RIP [8047ef93] packet_notifier+0x163/0x1a0 RSP 810bffcbde90 CR2: 0Kernel panic - not syncing: Attempted to kill init! I'm really struggling to work out what went wrong there. Comparing your miserable 20 bytes of code to my object code makes me think that this: struct packet_sock *po = pkt_sk(sk); returned -1, perhaps in %ebp. But it's all very crude. Perhaps you could compile that kernel with CONFIG_DEBUG_INFO, rerun it (the addresses might change) then have a poke around with `gdb vmlinux' (or maybe just addr2line) to work out where it's really oopsing? I don't see much which has changed in that area recently. Sorry for the delay. I was finally able to perform a bisect on this. It turns out the patch that causes this is x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a strange candidate, but sure enough I can boot to login: right up until that patch is applied. hm, that patch was merged into mainline September 29. Does mainline work? I thought above patch was dropped because Keith ran into some boot issues on one of the machines. Though there seems to be nothing wrong with the patch as such but it might have triggered some existing bug. At that point of time I looked into the issue but nothing was conclusive. So looks like this patch has come back. I am not sure how. Thanks Vivek - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Wed, 2006-10-04 at 08:45 -0700, Andrew Morton wrote: On Wed, 04 Oct 2006 08:42:28 -0500 Steve Fox [EMAIL PROTECTED] wrote: Sorry for the delay. I was finally able to perform a bisect on this. It turns out the patch that causes this is x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a strange candidate, but sure enough I can boot to login: right up until that patch is applied. hm, that patch was merged into mainline September 29. Does mainline work? -git21 also fails with this same error. -- Steve Fox IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Wed, 04 Oct 2006 11:41:59 -0500 Steve Fox [EMAIL PROTECTED] wrote: On Wed, 2006-10-04 at 08:45 -0700, Andrew Morton wrote: On Wed, 04 Oct 2006 08:42:28 -0500 Steve Fox [EMAIL PROTECTED] wrote: Sorry for the delay. I was finally able to perform a bisect on this. It turns out the patch that causes this is x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a strange candidate, but sure enough I can boot to login: right up until that patch is applied. hm, that patch was merged into mainline September 29. Does mainline work? -git21 also fails with this same error. OK, thanks. And we know that x86_64-mm-re-positioning-the-bss-segment.patch triggered this failure. And that patch is non-buggy, and the xfrm code is probably non-buggy. So we don't know squat, and we're going to need to debug this crash. Well. There is one trick we could use: apply x86_64-mm-re-positioning-the-bss-segment.patch to 2.6.18 base and see if it crashes. If it doesn't, then we can theorise that the bug is some buggy post 2.6.18 patch which is being exposed by x86_64-mm-re-positioning-the-bss-segment.patch. A technique I've used before for identifying the buggy patch is to do a git-bisect, but apply x86_64-mm-re-positioning-the-bss-segment.patch by hand at each bisection step. It's pretty straightforward as long as the patch roughly applies at each step. Or we could debug it. Can you send the .config? Let's see if it happens with my toolchain+machine first. Thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On Wed, Oct 04, 2006 at 05:06:59PM -0700, Andrew Morton wrote: On Wed, 04 Oct 2006 11:41:59 -0500 Steve Fox [EMAIL PROTECTED] wrote: On Wed, 2006-10-04 at 08:45 -0700, Andrew Morton wrote: On Wed, 04 Oct 2006 08:42:28 -0500 Steve Fox [EMAIL PROTECTED] wrote: Sorry for the delay. I was finally able to perform a bisect on this. It turns out the patch that causes this is x86_64-mm-re-positioning-the-bss-segment.patch, which seems like a strange candidate, but sure enough I can boot to login: right up until that patch is applied. hm, that patch was merged into mainline September 29. Does mainline work? -git21 also fails with this same error. OK, thanks. And we know that x86_64-mm-re-positioning-the-bss-segment.patch triggered this failure. And that patch is non-buggy, and the xfrm code is probably non-buggy. So we don't know squat, and we're going to need to debug this crash. Well. There is one trick we could use: apply x86_64-mm-re-positioning-the-bss-segment.patch to 2.6.18 base and see if it crashes. If it doesn't, then we can theorise that the bug is some buggy post 2.6.18 patch which is being exposed by I think most likely it would crash on 2.6.18. Keith mannthey had reported a different crash on 2.6.18-rc4-mm2 when this patch was introduced first time. Following is the link to the thread. http://marc.theaimsgroup.com/?l=linux-kernelm=115629369729911w=2 Following is the backtrace he had reported. Unable to handle kernel NULL pointer dereference at 0007 RIP: [803d45b0] __unix_insert_socket+0x49/0x5a PGD 115c934067 PUD 115c935067 PMD 0 Oops: 0002 [1] SMP last sysfs file: CPU 14 Modules linked in: Pid: 1, comm: init Not tainted 2.6.18-rc4-mm2-smp #3 RIP: 0010:[803d45b0] [803d45b0] __unix_insert_socket+0x49/0x5a RSP: 0018:810460605eb8 EFLAGS: 00010286 RAX: RBX: 81115c171c80 RCX: RDX: 81115c171c88 RSI: 81115c171c80 RDI: 806656e0 RBP: 806656e0 R08: 81115c069200 R09: 8110700b4000 R10: R11: 0002 R12: 81115c170d00 R13: 0001 R14: 0001 R15: FS: 2b793a4fd6d0() GS:81115c910e40() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0007 CR3: 00115c92d000 CR4: 06e0 Process init (pid: 1, threadinfo 810460604000, task 81115cb10040) Stack: 00010001 81115c171c80 803d58e9 8045bb30 000180298f61 80498080 0001 81115c170d00 803d595d 0004 80376061 Call Trace: [803d58e9] unix_create1+0xf3/0x107 [803d595d] unix_create+0x60/0x6b [80376061] __sock_create+0x12f/0x227 [80376429] sys_socket+0xf/0x37 [8020968e] system_call+0x7e/0x83 Code: 48 89 50 08 48 89 55 00 48 89 6a 08 41 58 5b 5d c3 c7 47 08 RIP [803d45b0] __unix_insert_socket+0x49/0x5a RSP 810460605eb8 CR2: 0007 0Kernel panic - not syncing: Attempted to kill init! Thanks Vivek - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
I think most likely it would crash on 2.6.18. Keith mannthey had reported a different crash on 2.6.18-rc4-mm2 when this patch was introduced first time. Following is the link to the thread. Then maybe trying 2.6.17 + the patch and then bisect between that and -rc4? -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-mm2 boot failure on x86-64
On 10/4/06, Martin Bligh [EMAIL PROTECTED] wrote: Andi Kleen wrote: I think most likely it would crash on 2.6.18. Keith mannthey had reported a different crash on 2.6.18-rc4-mm2 when this patch was introduced first time. Following is the link to the thread. Then maybe trying 2.6.17 + the patch and then bisect between that and -rc4? I think it's fixed already in -git22, or at least it is for the IBM box reporting to test.kernel.org. You might want to try that one ... Fixed or hidden... hard to say at this point. I think it could be a werid interaction between patches and or config options. I will see tommorrow if I can recreate again. Thanks, Keith - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html