Bug#603229: Further information
Hi ! > The error message about 'domain->cpu_power' does not refer to power > management, but to the scheduler's estimation of the processing power of > each group of processor threads. > > The scheduler is trying to group the processor threads by: > > - NUMA node (NODE; sharing a connection to RAM) > - Package (CPU; sharing some caches) > - Core (MC; sharing execution units) So lets start here: On this machine NUMA node and Package are identical: CPU0 / CPU1 are one group and CPU2 / CPU3 is the other. As for all Socket 940 Opterons, the cores logically are complete CPUs i.e. do not share execution units. > so that it can make good decisions about where a task should run when it > is ready to do so. > > > But whereas 2.6.32-5 afterwards crashes with a divide error, > > 2.6.30-2 starts up normally: > [...] > > I suppose that it is the divide error in [0.852154], we have to deal > > with. > [...] > > The division by zero appears to be a result of getting bad information > from the firmware about the groups of processors. Well, technically a division error always is a result of bad data fed to that division. I rather meant, that this is the point to backtrace the error. Though the bios of the w2100z is known for some problems, the cpus are reported correctly by the bios and it is the latest version (R01-B5-S1). > I realise that this > same bad information did not previously result in a crash, but I (and > the upstream developers) need to know what that information is before we > can understand how this can be avoided. Are there any means to gather more information ? Tell me and i shall do it. Tilo -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/1290920436.4255.1025.ca...@localhost
Bug#603229: Further information
On Tue, 2010-11-23 at 13:17 +0100, Frede Feuerstein wrote: > Hi ! > > > This shows something about what's going wrong. Could you please try > > adding 'debug' to the kernel parameters? That will show some more > > context for these errors. > > I booted 2.6.32-5 with the debug option on, and for comparison did the > same with 2.6.30-2. > > The errors concerning the power management itself are also showing up in > 2.6.30-2. The error message about 'domain->cpu_power' does not refer to power management, but to the scheduler's estimation of the processing power of each group of processor threads. The scheduler is trying to group the processor threads by: - NUMA node (NODE; sharing a connection to RAM) - Package (CPU; sharing some caches) - Core (MC; sharing execution units) so that it can make good decisions about where a task should run when it is ready to do so. > But whereas 2.6.32-5 afterwards crashes with a divide error, > 2.6.30-2 starts up normally: [...] > I suppose that it is the divide error in [0.852154], we have to deal > with. [...] The division by zero appears to be a result of getting bad information from the firmware about the groups of processors. I realise that this same bad information did not previously result in a crash, but I (and the upstream developers) need to know what that information is before we can understand how this can be avoided. Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse. signature.asc Description: This is a digitally signed message part
Bug#603229: Further information
Hi ! > This shows something about what's going wrong. Could you please try > adding 'debug' to the kernel parameters? That will show some more > context for these errors. I booted 2.6.32-5 with the debug option on, and for comparison did the same with 2.6.30-2. The errors concerning the power management itself are also showing up in 2.6.30-2. But whereas 2.6.32-5 afterwards crashes with a divide error, 2.6.30-2 starts up normally: Here the according section of the 2.6.30-2 protocol (if You would like to obtain the complete protocol, just tell me) : [0.692003] ERROR: groups don't span domain->span [0.696004] CPU3 attaching sched-domain: [0.73] domain 0: span 2-3 level MC [0.708003] groups: 3 2 [0.720003] domain 1: span 1-3 level CPU [0.728003]groups: 2-3 1 [0.744003]domain 2: span 0-3 level NODE [0.752003] groups: 1-3 (__cpu_power = 2048) [0.766180] ERROR: domain->cpu_power not set [0.768003] [0.772003] ERROR: groups don't span domain->span [0.776172] net_namespace: 1936 bytes [0.780049] Booting paravirtualized kernel on bare hardware [0.784149] regulator: core version 0.5 [0.788114] NET: Registered protocol family 16 [0.792037] node 0 link 2: io port [0, 2fff] [0.796004] node 0 link 0: io port [3000, 3fff] [0.84] TOM: 8000 aka 2048M [0.804004] node 0 link 2: mmio [d000, d04f] = And here the complete protocol of the 2.6.32-5 crashing: = bash-3.00$ tip hardwire connected [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Linux version 2.6.32-5-amd64 (Debian 2.6.32-27) (m...@debian.org) (gcc version 4 .3.5 (Debian 4.3.5-4) ) #1 SMP Sat Oct 30 14:18:21 UTC 2010 [0.00] Command line: root=LABEL=S_rt ro debug console=ttyS0 [0.00] KERNEL supported cpus: [0.00] Intel GenuineIntel [0.00] AMD AuthenticAMD [0.00] Centaur CentaurHauls [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: - 0009d400 (usable) [0.00] BIOS-e820: 0009d400 - 000a (reserved) [0.00] BIOS-e820: 000ce000 - 0010 (reserved) [0.00] BIOS-e820: 0010 - 7ff6 (usable) [0.00] BIOS-e820: 7ff6 - 7ff72000 (ACPI data) [0.00] BIOS-e820: 7ff72000 - 7ff8 (ACPI NVS) [0.00] BIOS-e820: 7ff8 - 8000 (reserved) [0.00] BIOS-e820: fec0 - fec00400 (reserved) [0.00] BIOS-e820: fee0 - fee01000 (reserved) [0.00] BIOS-e820: fff8 - 0001 (reserved) [0.00] DMI present. [0.00] last_pfn = 0x7ff60 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-B uncachable [0.00] C-D1FFF write-protect [0.00] D2000-E7FFF uncachable [0.00] E8000-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask FF8000 write-back [0.00] 1 disabled [0.00] 2 disabled [0.00] 3 disabled [0.00] 4 disabled [0.00] 5 disabled [0.00] 6 disabled [0.00] 7 disabled [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 [0.00] initial memory mapped : 0 - 2000 [0.00] init_memory_mapping: -7ff6 [0.00] 00 - 007fe0 page 2M [0.00] 007fe0 - 007ff6 page 4k [0.00] kernel direct mapping tables up to 7ff6 @ 8000-c000 [0.00] RAMDISK: 375e7000 - 37fef333 [0.00] ACPI: RSDP 000f7200 00024 (v02 PTLTD ) [0.00] ACPI: XSDT 7ff6d424 00044 (v01 PTLTD ? XSDT 0604 LTP ) [0.00] ACPI: FACP 7ff71a2a 000F4 (v03 SUNSUNmetro 0604 PTEC 000F4240) [0.00] ACPI: DSDT 7ff6d468 0454E (v01SUNK85AE 0604 MSFT 010D) [0.00] ACPI: FACS 7ff72fc0 00040 [0.00] ACPI: SRAT 7ff71b1e 000C8 (v01 AMDHAMMER 0604 AMD 0001) [0.00] ACPI: APIC 7ff71be6 000AA (v01 PTLTD ? APIC 0604 LTP ) [0.00] ACPI: SSDT 7ff71c90 00370 (v01 PTLTD POWERNOW 0604 LTP 0001) [0.00] ACPI: Local APIC address 0xfee0 [0.00] SRAT: PXM 0 -> APIC 0 -> Node 0 [0.00] SRAT: PXM 1 -> APIC 1 -> Node 1 [0.00] SRAT: Node 0 PXM 0 0-a [0.00] SRAT: Node 0 PXM 0 10-4000 [0.00] SRAT: Node 1 PXM 1 4000-8
Bug#603229: Further information
On Mon, 2010-11-22 at 19:08 +0100, Tilo Hacke wrote: > Hi ! > > > I just have tried the last 2.6.31-2 an it is working flawlessly. > > Further i have set up a serial connection to my SB1500 and so got an > protocol of the boot process and the crash: [...] > [0.536565] ERROR: domain->cpu_power not set > [0.540002] > [0.544002] ERROR: groups don't span domain->span > [0.548011] ERROR: parent span is not a superset of domain->span > [0.552016] ERROR: domain->cpu_power not set > [0.556002] > [0.560002] ERROR: groups don't span domain->span > [0.564024] ERROR: domain->cpu_power not set > [0.568005] > [0.572002] ERROR: groups don't span domain->span > [0.576023] ERROR: domain->cpu_power not set > [0.580002] > [0.584002] ERROR: groups don't span domain->span [...] This shows something about what's going wrong. Could you please try adding 'debug' to the kernel parameters? That will show some more context for these errors. Ben. -- Ben Hutchings Once a job is fouled up, anything done to improve it makes it worse. signature.asc Description: This is a digitally signed message part
Bug#603229: Further information
Hi ! I just have tried the last 2.6.31-2 an it is working flawlessly. Further i have set up a serial connection to my SB1500 and so got an protocol of the boot process and the crash: === bash-3.00$ tip hardwire connected [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Linux version 2.6.32-5-amd64 (Debian 2.6.32-27) (m...@debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Sat Oct 30 14:18:21 UTC 2010 [0.00] Command line: root=LABEL=S_rt ro console=ttyS0 [0.00] KERNEL supported cpus: [0.00] Intel GenuineIntel [0.00] AMD AuthenticAMD [0.00] Centaur CentaurHauls [0.00] BIOS-provided physical RAM map: [0.00] BIOS-e820: - 0009d400 (usable) [0.00] BIOS-e820: 0009d400 - 000a (reserved) [0.00] BIOS-e820: 000ce000 - 0010 (reserved) [0.00] BIOS-e820: 0010 - 7ff6 (usable) [0.00] BIOS-e820: 7ff6 - 7ff72000 (ACPI data) [0.00] BIOS-e820: 7ff72000 - 7ff8 (ACPI NVS) [0.00] BIOS-e820: 7ff8 - 8000 (reserved) [0.00] BIOS-e820: fec0 - fec00400 (reserved) [0.00] BIOS-e820: fee0 - fee01000 (reserved) [0.00] BIOS-e820: fff8 - 0001 (reserved) [0.00] DMI present. [0.00] last_pfn = 0x7ff60 max_arch_pfn = 0x4 [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 [0.00] init_memory_mapping: -7ff6 [0.00] RAMDISK: 375e7000 - 37fef333 [0.00] ACPI: RSDP 000f7200 00024 (v02 PTLTD ) [0.00] ACPI: XSDT 7ff6d424 00044 (v01 PTLTD ? XSDT 0604 LTP ) [0.00] ACPI: FACP 7ff71a2a 000F4 (v03 SUNSUNmetro 0604 PTEC 000F4240) [0.00] ACPI: DSDT 7ff6d468 0454E (v01SUNK85AE 0604 MSFT 010D) [0.00] ACPI: FACS 7ff72fc0 00040 [0.00] ACPI: SRAT 7ff71b1e 000C8 (v01 AMDHAMMER 0604 AMD 0001) [0.00] ACPI: APIC 7ff71be6 000AA (v01 PTLTD ? APIC 0604 LTP ) [0.00] ACPI: SSDT 7ff71c90 00370 (v01 PTLTD POWERNOW 0604 LTP 0001) [0.00] SRAT: PXM 0 -> APIC 0 -> Node 0 [0.00] SRAT: PXM 1 -> APIC 1 -> Node 1 [0.00] SRAT: Node 0 PXM 0 0-a [0.00] SRAT: Node 0 PXM 0 10-4000 [0.00] SRAT: Node 1 PXM 1 4000-8000 [0.00] Bootmem setup node 0 -4000 [0.00] NODE_DATA [b040 - 0001303f] [0.00] bootmap [00014000 - 0001bfff] pages 8 [0.00] (8 early reservations) ==> bootmem [00 - 004000] [0.00] #0 [00 - 001000] BIOS data page ==> [00 - 001000] [0.00] #1 [006000 - 008000] TRAMPOLINE ==> [006000 - 008000] [0.00] #2 [000100 - 0001688414]TEXT DATA BSS ==> [000100 - 0001688414] [0.00] #3 [00375e7000 - 0037fef333] RAMDISK ==> [00375e7000 - 0037fef333] [0.00] #4 [09d400 - 10]BIOS reserved ==> [09d400 - 10] [0.00] #5 [0001689000 - 00016890c8] BRK ==> [0001689000 - 00016890c8] [0.00] #6 [008000 - 00a000] PGTABLE ==> [008000 - 00a000] [0.00] #7 [00a000 - 00b040] MEMNODEMAP ==> [00a000 - 00b040] [0.00] Bootmem setup node 1 4000-7ff6 [0.00] NODE_DATA [4000 - 40007fff] [0.00] bootmap [40008000 - 4000ffef] pages 8 [0.00] (8 early reservations) ==> bootmem [004000 - 007ff6] [0.00] #0 [00 - 001000] BIOS data page [0.00] #1 [006000 - 008000] TRAMPOLINE [0.00] #2 [000100 - 0001688414]TEXT DATA BSS [0.00] #3 [00375e7000 - 0037fef333] RAMDISK [0.00] #4 [09d400 - 10]BIOS reserved [0.00] #5 [0001689000 - 00016890c8] BRK [0.00] #6 [008000 - 00a000] PGTABLE [0.00] #7 [00a000 - 00b040] MEMNODEMAP [0.00] found SMP MP-table at [880f7250] f7250 [0.00] Zone PFN ranges: [0.00] DMA 0x -> 0x1000 [0.00] DMA320x1000 -> 0x0010 [0.00] Normal 0x0010 -> 0x0010 [0.00] Movable zone start PFN for each node [0.00] early_node_map[3] active PFN ranges [0.00] 0: 0x -> 0x009d [0.00] 0: 0x0100 -> 0x0004 [0.00] 1: