Bug#603229: Scheduler grouping failure; division by zero in select_task_rq_fair

2010-12-14 Thread Frede Feuerstein
Hello !

> We definitely want to robustify scheduler init code to not crash and to (if 
> possible) print a warning about the borkage.

I just tested the last 2.6.32-5 update and it still crashes in the same
manner.

Best regards
Tilo Hacke




-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1292321241.4511.2.ca...@localhost



Bug#603229: Scheduler grouping failure; division by zero in select_task_rq_fair

2010-11-28 Thread Ben Hutchings
On Sun, 2010-11-28 at 06:00 +0100, Frede Feuerstein wrote:
[...]
> > The division by zero appears to be a result of getting bad information
> > from the firmware about the groups of processors.
> 
> Well, technically a division error always is a result of bad data fed to
> that division. I rather meant, that this is the point to backtrace the
> error.
> Though the bios of the w2100z is known for some problems, the cpus are
> reported correctly by the bios and it is the latest version (R01-B5-S1).
> 
> >   I realise that this
> > same bad information did not previously result in a crash, but I (and
> > the upstream developers) need to know what that information is before we
> > can understand how this can be avoided.
> 
> Are there any means to gather more information ? Tell me and i shall do
> it. 

I think this is now enough information.

Ingo, Peter, the output from scheduler domain/group setup was:

[0.536554] CPU0 attaching sched-domain:
[0.540004]  domain 0: span 0-1 level MC
[0.548002]   groups: 0 1
[0.560003]   domain 1: span 0-3 level NODE
[0.568002]groups:
[0.574179] ERROR: domain->cpu_power not set
[0.576002]
[0.580002] ERROR: groups don't span domain->span
[0.584004] CPU1 attaching sched-domain:
[0.588007]  domain 0: span 0-1 level MC
[0.596002]   groups: 1 0 (cpu_power = 1023)
[0.612002] ERROR: parent span is not a superset of domain->span
[0.616003]   domain 1: span 1-3 level CPU
[0.624002]groups: 1 (cpu_power = 2048) 2-3 (cpu_power = 2048)
[0.644003]domain 2: span 0-3 level NODE
[0.652004] groups: 1-3 (cpu_power = 4096)
[0.668002] ERROR: domain->cpu_power not set
[0.672002]
[0.676002] ERROR: groups don't span domain->span
[0.680004] CPU2 attaching sched-domain:
[0.684003]  domain 0: span 2-3 level MC
[0.692003]   groups: 2 3
[0.704003]   domain 1: span 1-3 level CPU
[0.712003]groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
[0.736003]domain 2: span 0-3 level NODE
[0.744003] groups: 1-3 (cpu_power = 4096)
[0.760003] ERROR: domain->cpu_power not set
[0.764003]
[0.768003] ERROR: groups don't span domain->span
[0.772004] CPU3 attaching sched-domain:
[0.776003]  domain 0: span 2-3 level MC
[0.784003]   groups: 3 2
[0.794183]   domain 1: span 1-3 level CPU
[0.83]groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
[0.822183]domain 2: span 0-3 level NODE
[0.828003] groups: 1-3 (cpu_power = 4096)
[0.842180] ERROR: domain->cpu_power not set
[0.844003]
[0.848003] ERROR: groups don't span domain->span

and the oops is:

[0.852154] divide error:  [#1] SMP
[0.856002] last sysfs file:
[0.856002] CPU 1
[0.856002] Modules linked in:
[0.856002] Pid: 2, comm: kthreadd Not tainted 2.6.32-5-amd64 #1 W1100z/2100z
[0.856002] RIP: 0010:[]  [] 
select_task_rq_fair+0x665/0 x800
[0.856002] RSP: 0018:88003fdb7c90  EFLAGS: 00010046
[0.856002] RAX:  RBX:  RCX: 
[0.856002] RDX:  RSI: 0200 RDI: 0200
[0.856002] RBP: 88004120fd50 R08:  R09: 88007f98f0b0
[0.856002] R10:  R11: 000252d0 R12: 88007f98f060
[0.856002] R13: 88007f98f070 R14:  R15: 00015780
[0.856002] FS:  () GS:88004120() 
knlGS:
[0.856002] CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
[0.856002] CR2:  CR3: 01001000 CR4: 06e0
[0.856002] DR0:  DR1:  DR2: 
[0.856002] DR3:  DR6: 0ff0 DR7: 0400
[0.856002] Process kthreadd (pid: 2, threadinfo 88003fdb6000, task 
88003fdc8710)
[0.856002] Stack:
[0.856002]  00015780 00015780 00015780 
00015780
[0.856002] <0> 00015780 00015788 00015788 
8146c260
[0.856002] <0> 0008 88007f9b 880041215780 
81317f88
[0.856002] Call Trace:
[0.856002]  [] ? copy_process+0x1007/0x115f
[0.856002]  [] ? select_task_rq+0xb/0x3e
[0.856002]  [] ? wake_up_new_task+0x35/0xf6
[0.856002]  [] ? do_fork+0x254/0x31e
[0.856002]  [] ? pick_next_task_fair+0xca/0xd6
[0.856002]  [] ? finish_task_switch+0x3a/0xaf
[0.856002]  [] ? kernel_thread+0x82/0xe0
[0.856002]  [] ? kthread+0x0/0x81
[0.856002]  [] ? child_rip+0x0/0x20
[0.856002]  [] ? kthreadd+0xb1/0xec
[0.856002]  [] ? early_idt_handler+0x0/0x71
[0.856002]  [] ? child_rip+0xa/0x20
[0.856002]  [] ? early_idt_handler+0x0/0x71
[0.856002]  [] ? do_set_mempolicy+0x128/0x13a
[0.856002]  [] ? kthreadd+0x0/0xec
[0.856002]  [] ? child_rip+0x0/0x20
[0.856002] Code: 00 02 00 00 4c 89 ef 48 63 d2 e8 0f c6 14 00 3b 05 

Bug#603229: Scheduler grouping failure; division by zero in select_task_rq_fair

2010-11-29 Thread Peter Zijlstra
On Sun, 2010-11-28 at 20:14 +, Ben Hutchings wrote:

> [0.856002] Pid: 2, comm: kthreadd Not tainted 2.6.32-5-amd64 #1 
> W1100z/2100z

What's in that kernel? is that simply the latest .32-stable?

> [0.536554] CPU0 attaching sched-domain:
> [0.540004]  domain 0: span 0-1 level MC
> [0.548002]   groups: 0 1
> [0.560003]   domain 1: span 0-3 level NODE
> [0.568002]groups:
> [0.574179] ERROR: domain->cpu_power not set
> [0.576002]
> [0.580002] ERROR: groups don't span domain->span
> [0.584004] CPU1 attaching sched-domain:
> [0.588007]  domain 0: span 0-1 level MC
> [0.596002]   groups: 1 0 (cpu_power = 1023)
> [0.612002] ERROR: parent span is not a superset of domain->span
> [0.616003]   domain 1: span 1-3 level CPU
> [0.624002]groups: 1 (cpu_power = 2048) 2-3 (cpu_power = 2048)
> [0.644003]domain 2: span 0-3 level NODE
> [0.652004] groups: 1-3 (cpu_power = 4096)
> [0.668002] ERROR: domain->cpu_power not set
> [0.672002]
> [0.676002] ERROR: groups don't span domain->span
> [0.680004] CPU2 attaching sched-domain:
> [0.684003]  domain 0: span 2-3 level MC
> [0.692003]   groups: 2 3
> [0.704003]   domain 1: span 1-3 level CPU
> [0.712003]groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
> [0.736003]domain 2: span 0-3 level NODE
> [0.744003] groups: 1-3 (cpu_power = 4096)
> [0.760003] ERROR: domain->cpu_power not set
> [0.764003]
> [0.768003] ERROR: groups don't span domain->span
> [0.772004] CPU3 attaching sched-domain:
> [0.776003]  domain 0: span 2-3 level MC
> [0.784003]   groups: 3 2
> [0.794183]   domain 1: span 1-3 level CPU
> [0.83]groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
> [0.822183]domain 2: span 0-3 level NODE
> [0.828003] groups: 1-3 (cpu_power = 4096)
> [0.842180] ERROR: domain->cpu_power not set
> [0.844003]
> [0.848003] ERROR: groups don't span domain->span

Hrm that smells like the architecture topology setup is wrecked, looks
like the NUMA setup is bonkers.

I really know very little about that code, I think Tejun recently poked
at that in an attempt to merge 32 and 64bit x86.

Also,  its very weird CPU0 only gets to have 2 domains, while the other
3 get 3 domains..

Ingo, who knows this stuff?



--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1291031425.32004.16.ca...@laptop



Bug#603229: Scheduler grouping failure; division by zero in select_task_rq_fair

2010-11-29 Thread Frede Feuerstein
On Mon, 2010-11-29 at 12:50 +0100, Peter Zijlstra wrote:
> On Sun, 2010-11-28 at 20:14 +, Ben Hutchings wrote:
> 
> > [0.856002] Pid: 2, comm: kthreadd Not tainted 2.6.32-5-amd64 #1 
> > W1100z/2100z
> 
> What's in that kernel? is that simply the latest .32-stable?

It was the latest stable .32 contained in the official "squeeze".
Today a new one arrived. I shall try that tomorrow on today late night,
since i cannot reset the mill right now.

Additional remark:
When i said, that the cores contained in each chip do not share any
resources, i forgot about the clock source:
The clock is set globally for each whole chip, i.e. CPU0 and CPU1 are
always running on the same clock, as well as CPU2 and CPU3 do.

Best regards

Tilo Hacke





-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1291039093.26937.9.ca...@localhost



Bug#603229: Scheduler grouping failure; division by zero in select_task_rq_fair

2010-11-29 Thread Ben Hutchings
On Mon, Nov 29, 2010 at 12:50:25PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-11-28 at 20:14 +, Ben Hutchings wrote:
> 
> > [0.856002] Pid: 2, comm: kthreadd Not tainted 2.6.32-5-amd64 #1 
> > W1100z/2100z
> 
> What's in that kernel? is that simply the latest .32-stable?

No, we have quite a few backported driver features and some bug fixes
that aren't in stable yet.  No scheduler or topology changes except
reverting 669c55e9f99b90e46eaa0f98a67ec53d46dc969a for ABI reasons
(which I guess we don't actually need to do).

> > [0.536554] CPU0 attaching sched-domain:
> > [0.540004]  domain 0: span 0-1 level MC
> > [0.548002]   groups: 0 1
> > [0.560003]   domain 1: span 0-3 level NODE
> > [0.568002]groups:
> > [0.574179] ERROR: domain->cpu_power not set
> > [0.576002]
> > [0.580002] ERROR: groups don't span domain->span
> > [0.584004] CPU1 attaching sched-domain:
> > [0.588007]  domain 0: span 0-1 level MC
> > [0.596002]   groups: 1 0 (cpu_power = 1023)
> > [0.612002] ERROR: parent span is not a superset of domain->span
> > [0.616003]   domain 1: span 1-3 level CPU
> > [0.624002]groups: 1 (cpu_power = 2048) 2-3 (cpu_power = 2048)
> > [0.644003]domain 2: span 0-3 level NODE
> > [0.652004] groups: 1-3 (cpu_power = 4096)
> > [0.668002] ERROR: domain->cpu_power not set
> > [0.672002]
> > [0.676002] ERROR: groups don't span domain->span
> > [0.680004] CPU2 attaching sched-domain:
> > [0.684003]  domain 0: span 2-3 level MC
> > [0.692003]   groups: 2 3
> > [0.704003]   domain 1: span 1-3 level CPU
> > [0.712003]groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
> > [0.736003]domain 2: span 0-3 level NODE
> > [0.744003] groups: 1-3 (cpu_power = 4096)
> > [0.760003] ERROR: domain->cpu_power not set
> > [0.764003]
> > [0.768003] ERROR: groups don't span domain->span
> > [0.772004] CPU3 attaching sched-domain:
> > [0.776003]  domain 0: span 2-3 level MC
> > [0.784003]   groups: 3 2
> > [0.794183]   domain 1: span 1-3 level CPU
> > [0.83]groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
> > [0.822183]domain 2: span 0-3 level NODE
> > [0.828003] groups: 1-3 (cpu_power = 4096)
> > [0.842180] ERROR: domain->cpu_power not set
> > [0.844003]
> > [0.848003] ERROR: groups don't span domain->span
> 
> Hrm that smells like the architecture topology setup is wrecked, looks
> like the NUMA setup is bonkers.
[...]

Right, that's what I thought.  Question is whether the topology setup
code should fix this up or whether the schedular init should (as it
appears to have done before 2.6.32).

Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
  - Albert Camus



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101129162605.gh8...@decadent.org.uk



Bug#603229: Scheduler grouping failure; division by zero in select_task_rq_fair

2010-11-29 Thread Ingo Molnar

* Ben Hutchings  wrote:

> On Mon, Nov 29, 2010 at 12:50:25PM +0100, Peter Zijlstra wrote:
> > On Sun, 2010-11-28 at 20:14 +, Ben Hutchings wrote:
> > 
> > > [0.856002] Pid: 2, comm: kthreadd Not tainted 2.6.32-5-amd64 #1 
> > > W1100z/2100z
> > 
> > What's in that kernel? is that simply the latest .32-stable?
> 
> No, we have quite a few backported driver features and some bug fixes
> that aren't in stable yet.  No scheduler or topology changes except
> reverting 669c55e9f99b90e46eaa0f98a67ec53d46dc969a for ABI reasons
> (which I guess we don't actually need to do).
> 
> > > [0.536554] CPU0 attaching sched-domain:
> > > [0.540004]  domain 0: span 0-1 level MC
> > > [0.548002]   groups: 0 1
> > > [0.560003]   domain 1: span 0-3 level NODE
> > > [0.568002]groups:
> > > [0.574179] ERROR: domain->cpu_power not set
> > > [0.576002]
> > > [0.580002] ERROR: groups don't span domain->span
> > > [0.584004] CPU1 attaching sched-domain:
> > > [0.588007]  domain 0: span 0-1 level MC
> > > [0.596002]   groups: 1 0 (cpu_power = 1023)
> > > [0.612002] ERROR: parent span is not a superset of domain->span
> > > [0.616003]   domain 1: span 1-3 level CPU
> > > [0.624002]groups: 1 (cpu_power = 2048) 2-3 (cpu_power = 2048)
> > > [0.644003]domain 2: span 0-3 level NODE
> > > [0.652004] groups: 1-3 (cpu_power = 4096)
> > > [0.668002] ERROR: domain->cpu_power not set
> > > [0.672002]
> > > [0.676002] ERROR: groups don't span domain->span
> > > [0.680004] CPU2 attaching sched-domain:
> > > [0.684003]  domain 0: span 2-3 level MC
> > > [0.692003]   groups: 2 3
> > > [0.704003]   domain 1: span 1-3 level CPU
> > > [0.712003]groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
> > > [0.736003]domain 2: span 0-3 level NODE
> > > [0.744003] groups: 1-3 (cpu_power = 4096)
> > > [0.760003] ERROR: domain->cpu_power not set
> > > [0.764003]
> > > [0.768003] ERROR: groups don't span domain->span
> > > [0.772004] CPU3 attaching sched-domain:
> > > [0.776003]  domain 0: span 2-3 level MC
> > > [0.784003]   groups: 3 2
> > > [0.794183]   domain 1: span 1-3 level CPU
> > > [0.83]groups: 2-3 (cpu_power = 2048) 1 (cpu_power = 2048)
> > > [0.822183]domain 2: span 0-3 level NODE
> > > [0.828003] groups: 1-3 (cpu_power = 4096)
> > > [0.842180] ERROR: domain->cpu_power not set
> > > [0.844003]
> > > [0.848003] ERROR: groups don't span domain->span
> > 
> > Hrm that smells like the architecture topology setup is wrecked, looks
> > like the NUMA setup is bonkers.
> [...]
> 
> Right, that's what I thought.  Question is whether the topology setup
> code should fix this up or whether the schedular init should (as it
> appears to have done before 2.6.32).

We definitely want to robustify scheduler init code to not crash and to (if 
possible) print a warning about the borkage.

Thanks,

Ingo



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20101129180605.gc14...@elte.hu