Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Sujit K M
On Mon, Sep 7, 2015 at 11:00 AM, Chinmay V S  wrote:
> Hello everyone,
>
> TL;DR: In Linux RT scheduler, how can rt_nr_running be non-zero AND
> active-bitmap NOT have any valid bit set?
>
> Details:
> Recently i encountered the following BUG() within the realtime
> scheduler (sched_rt.c) on 3.1.10 kernel.
> [101640.492840] kernel BUG at kernel/sched_rt.c:1126!
>
> This turns out to be
> 1126 BUG_ON(idx >= MAX_RT_PRIO);

The reason for the stack trace is given below.
http://www.spinics.net/lists/newbies/msg08889.html





-- 
-- Sujit K M

blog(http://kmsujit.blogspot.com/)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Sujit K M
On Mon, Sep 7, 2015 at 12:28 PM, Chinmay V S  wrote:
> Thanks for your quick response Mike.
>
>> Try without the proprietary modules. You may also want to audit futex
>> fixes if you can't use a maintained stable tree.  3.2 has a bunch that
>> 3.1 does not.
>
> I see that futex.c has 17 patches in 3.2.y that are missing in my tree.
> http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/futex.c?h=linux-3.2.y

If in doubt please use the mainline kernel and try and reproduce.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Mike Galbraith
On Mon, 2015-09-07 at 12:28 +0530, Chinmay V S wrote:

> To catch the "culprit" in the middle of busting the scheduler's
> internal data structures, what would be the recommended debug
> mechanisms (or config options) that i can try?

I'd configure kdump, let it explode, and examine runqueues in the crash
dump first.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Chinmay V S
Thanks for your quick response Mike.

> Try without the proprietary modules. You may also want to audit futex
> fixes if you can't use a maintained stable tree.  3.2 has a bunch that
> 3.1 does not.

I see that futex.c has 17 patches in 3.2.y that are missing in my tree.
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/futex.c?h=linux-3.2.y

Will apply these patches and kick-off a run today.
It takes upto 2days to reproduce this RT-scheduler BUG().

Also, in one of the earlier runs to reproduce, i had enabled
CONFIG_CC_STACKPROTECTOR
CONFIG_STRICT_DEVMEM
but there weren't any additional logs indicating illegal writes to memory.
Kernel OOPS was similar to the one in the original email in this thread.

To catch the "culprit" in the middle of busting the scheduler's
internal data structures, what would be the recommended debug
mechanisms (or config options) that i can try?

regards
CVS
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Mike Galbraith
On Mon, 2015-09-07 at 11:00 +0530, Chinmay V S wrote:

> So how could rt_nr_running be non-zero AND active-bitmap NOT have any
> valid bit set?

It can't without being busted.

> Also including the kernel OOPS below.
> Do you see any tell-tale signs in the register-dump/backtrace that can
> point me in the right direction?

Try without the proprietary modules.  You may also want to audit futex
fixes if you can't use a maintained stable tree.  3.2 has a bunch that
3.1 does not.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Mike Galbraith
On Mon, 2015-09-07 at 11:00 +0530, Chinmay V S wrote:

> So how could rt_nr_running be non-zero AND active-bitmap NOT have any
> valid bit set?

It can't without being busted.

> Also including the kernel OOPS below.
> Do you see any tell-tale signs in the register-dump/backtrace that can
> point me in the right direction?

Try without the proprietary modules.  You may also want to audit futex
fixes if you can't use a maintained stable tree.  3.2 has a bunch that
3.1 does not.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Sujit K M
On Mon, Sep 7, 2015 at 11:00 AM, Chinmay V S  wrote:
> Hello everyone,
>
> TL;DR: In Linux RT scheduler, how can rt_nr_running be non-zero AND
> active-bitmap NOT have any valid bit set?
>
> Details:
> Recently i encountered the following BUG() within the realtime
> scheduler (sched_rt.c) on 3.1.10 kernel.
> [101640.492840] kernel BUG at kernel/sched_rt.c:1126!
>
> This turns out to be
> 1126 BUG_ON(idx >= MAX_RT_PRIO);

The reason for the stack trace is given below.
http://www.spinics.net/lists/newbies/msg08889.html





-- 
-- Sujit K M

blog(http://kmsujit.blogspot.com/)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Chinmay V S
Thanks for your quick response Mike.

> Try without the proprietary modules. You may also want to audit futex
> fixes if you can't use a maintained stable tree.  3.2 has a bunch that
> 3.1 does not.

I see that futex.c has 17 patches in 3.2.y that are missing in my tree.
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/futex.c?h=linux-3.2.y

Will apply these patches and kick-off a run today.
It takes upto 2days to reproduce this RT-scheduler BUG().

Also, in one of the earlier runs to reproduce, i had enabled
CONFIG_CC_STACKPROTECTOR
CONFIG_STRICT_DEVMEM
but there weren't any additional logs indicating illegal writes to memory.
Kernel OOPS was similar to the one in the original email in this thread.

To catch the "culprit" in the middle of busting the scheduler's
internal data structures, what would be the recommended debug
mechanisms (or config options) that i can try?

regards
CVS
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Mike Galbraith
On Mon, 2015-09-07 at 12:28 +0530, Chinmay V S wrote:

> To catch the "culprit" in the middle of busting the scheduler's
> internal data structures, what would be the recommended debug
> mechanisms (or config options) that i can try?

I'd configure kdump, let it explode, and examine runqueues in the crash
dump first.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-07 Thread Sujit K M
On Mon, Sep 7, 2015 at 12:28 PM, Chinmay V S  wrote:
> Thanks for your quick response Mike.
>
>> Try without the proprietary modules. You may also want to audit futex
>> fixes if you can't use a maintained stable tree.  3.2 has a bunch that
>> 3.1 does not.
>
> I see that futex.c has 17 patches in 3.2.y that are missing in my tree.
> http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/futex.c?h=linux-3.2.y

If in doubt please use the mainline kernel and try and reproduce.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-06 Thread Chinmay V S
Hello everyone,

TL;DR: In Linux RT scheduler, how can rt_nr_running be non-zero AND
active-bitmap NOT have any valid bit set?

Details:
Recently i encountered the following BUG() within the realtime
scheduler (sched_rt.c) on 3.1.10 kernel.
[101640.492840] kernel BUG at kernel/sched_rt.c:1126!

This turns out to be
1126 BUG_ON(idx >= MAX_RT_PRIO);

within the function pick_next_rt_entity() as shown here:
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/sched_rt.c?h=linux-3.1.y#n1115

What this means is that the scheduler failed to find a valid bit
within the bitmap containing a prioritised list of active tasks.
However before attempting to parse the bitmap, there is a check for a
non-zero rt_nr_running.
(i.e. parsing the bitmap should find atleast 1 bit of the active
running rt task)

So how could rt_nr_running be non-zero AND active-bitmap NOT have any
valid bit set?

The issue is observed on
- a quad-core Cortex A9 SMP embedded system.
- running an userspace app with ~25 RT threads (FIFO and RR)
- typical ubuntu-core rootfs

This issue consistently reproduces within 24-48hours on continuously
running the system.

Searching the net/lkml i could not find this issue reported before,
though there are a few memory corruption bugs in scheduler.
I have already backported the patches to fix know memory corruption
issue from upstream kernel version and still encounter the above
BUG().

Is anyone aware of this issue?

Also including the kernel OOPS below.
Do you see any tell-tale signs in the register-dump/backtrace that can
point me in the right direction?

[101640.488133] [ cut here ]
[101640.492840] kernel BUG at kernel/sched_rt.c:1126!
[101640.497621] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP
[101640.504742] Modules linked in: misc_arz(P) audio_sta3_2(P)
audio_sta3_1(P) audio_sta3(P) lamp_tlc8116_2(P) lamp_tlc8116_1(P)
lamp_tlc8116(P) i2c_master_pcu9669(P) tegra_gpio_helper(P)
outport_timer(P) intTimer(P) nvidia(P)
[101640.524618] CPU: 0Tainted: P (3.1.10 #1)
[101640.530015] PC is at pick_next_task_rt+0x138/0x140
[101640.534888] LR is at __schedule+0x63c/0x858
[101640.539150] pc : []lr : []psr: 200f0093
[101640.539154] sp : e2d7fcc8  ip : e2d7fce8  fp : e2d7fce4
[101640.550785] r10: c05a4e60  r9 : e21f6d4c  r8 : c05c8ab0
[101640.556084] r7 : e2d7e000  r6 : 0001  r5 : c186fe60  r4 : c043f928
[101640.562685] r3 : c186ff60  r2 : 0064  r1 : fff0  r0 : c186fe60
[101640.569288] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
Segment user
[101640.576582] Control: 10c5387d  Table: a335804a  DAC: 0015
[101640.582400]
[101640.582403] PC: 0xc0042880:
[101640.586839] 2880  eaeb e5932008 e352 0a15 e2621000
e0012002 e16f2f12 e262205f
[101640.595143] 28a0  eae3 e30034b8 e2406f5a e18320d5 e356
e1c625f8 0ac9 e2854d11
[101640.603449] 28c0  e2800070 e2844008 e1a01004 eb06b771 e5953448
e1a6 e0534004 13a04001
[101640.611753] 28e0  e58544d4 e89da878 e593200c e2621000 e0012002
e16f2f12 e262207f eacc
[101640.620057] 2900  e7f001f2 e7f001f2 e1a0c00d e92ddff0 e24cb004
e24dd01c e2914038 e1a08002
[101640.628361] 2920  0a74 e5913054 e353 1a68 e30ba720
e3a09e4b e34ca05e e50ba038
[101640.636665] 2940  e2083005 e5945124 e3530001 0a04 e1c423d0
e1c501d8 e0922000 e0a33001
[101640.644969] 2960  e1c423f0 e1a5 ebffee88 e3a01000 e1a5
ebfff0b7 e5943000 e3a01000
[101640.653276]
[101640.653278] LR: 0xc0435110:
[101640.657713] 5110  e1932f9f e2822001 e1831f92 e331 1afa
ea5d e1a8 eb000947
[101640.666017] 5130  e51b004c eb0009bc e1a8 eb0009ba eade
e1a8 e1a01009 e3a02001
[101640.674323] 5150  ebf032e2 eadf e59f3248 e1a08007 e51ba070
e1a07006 e1a06005 e1a05004
[101640.682627] 5170  e1a04003 ea02 e5944000 e354 0a86
e5943018 e1a5 e12fff33
[101640.690931] 5190  e350 0af7 e1a04005 e50ba070 e1a05006
e1a0a000 e1a06007 e1a07008
[101640.699235] 51b0  eafffeb5 e1a4 eb0008fc ea48 e3000518
e3071ff4 e18420d0 e3a00e4b
[101640.707539] 51d0  e34c105c e591104c e14b26fc e18420d0 e1a1
e3a01000 e14b23fc e3083080
[101640.715845] 51f0  e34c305a e50b304c e593c000 e14b26dc e1530001
0152 e14b03dc e3a03e51
[101640.724152]
[101640.724154] SP: 0xe2d7fc48:
[101640.728589] fc48   e21f6aa0 2c5bae94  e2d7fc74
e2d7fc68 c0042904 200f0093
[101640.736893] fc68  c000e394  e2d7fce4 e2d7fc80 c000e0c8
c00081a0 c186fe60 fff0
[101640.745197] fc88  0064 c186ff60 c043f928 c186fe60 0001
e2d7e000 c05c8ab0 e21f6d4c
[101640.753501] fca8  c05a4e60 e2d7fce4 e2d7fce8 e2d7fcc8 c0435190
c0042900 200f0093 
[101640.761807] fcc8  c00427c8 c043f928 c186fe60 e21f6aa0 e2d7fd64
e2d7fce8 c0435190 c00427d4
[101640.770111] fce8  e39de688 0031 0001 c05a4e60 00062904
 e2d7fd54 e2d7fd10
[101640.778415] fd08  c00404f0 c003e308 c05a4e60 c05a4e60 c05a8080
5c71 c05a40c4 c05a4e60
[101640.786721] fd28  

RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)

2015-09-06 Thread Chinmay V S
Hello everyone,

TL;DR: In Linux RT scheduler, how can rt_nr_running be non-zero AND
active-bitmap NOT have any valid bit set?

Details:
Recently i encountered the following BUG() within the realtime
scheduler (sched_rt.c) on 3.1.10 kernel.
[101640.492840] kernel BUG at kernel/sched_rt.c:1126!

This turns out to be
1126 BUG_ON(idx >= MAX_RT_PRIO);

within the function pick_next_rt_entity() as shown here:
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/sched_rt.c?h=linux-3.1.y#n1115

What this means is that the scheduler failed to find a valid bit
within the bitmap containing a prioritised list of active tasks.
However before attempting to parse the bitmap, there is a check for a
non-zero rt_nr_running.
(i.e. parsing the bitmap should find atleast 1 bit of the active
running rt task)

So how could rt_nr_running be non-zero AND active-bitmap NOT have any
valid bit set?

The issue is observed on
- a quad-core Cortex A9 SMP embedded system.
- running an userspace app with ~25 RT threads (FIFO and RR)
- typical ubuntu-core rootfs

This issue consistently reproduces within 24-48hours on continuously
running the system.

Searching the net/lkml i could not find this issue reported before,
though there are a few memory corruption bugs in scheduler.
I have already backported the patches to fix know memory corruption
issue from upstream kernel version and still encounter the above
BUG().

Is anyone aware of this issue?

Also including the kernel OOPS below.
Do you see any tell-tale signs in the register-dump/backtrace that can
point me in the right direction?

[101640.488133] [ cut here ]
[101640.492840] kernel BUG at kernel/sched_rt.c:1126!
[101640.497621] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP
[101640.504742] Modules linked in: misc_arz(P) audio_sta3_2(P)
audio_sta3_1(P) audio_sta3(P) lamp_tlc8116_2(P) lamp_tlc8116_1(P)
lamp_tlc8116(P) i2c_master_pcu9669(P) tegra_gpio_helper(P)
outport_timer(P) intTimer(P) nvidia(P)
[101640.524618] CPU: 0Tainted: P (3.1.10 #1)
[101640.530015] PC is at pick_next_task_rt+0x138/0x140
[101640.534888] LR is at __schedule+0x63c/0x858
[101640.539150] pc : []lr : []psr: 200f0093
[101640.539154] sp : e2d7fcc8  ip : e2d7fce8  fp : e2d7fce4
[101640.550785] r10: c05a4e60  r9 : e21f6d4c  r8 : c05c8ab0
[101640.556084] r7 : e2d7e000  r6 : 0001  r5 : c186fe60  r4 : c043f928
[101640.562685] r3 : c186ff60  r2 : 0064  r1 : fff0  r0 : c186fe60
[101640.569288] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM
Segment user
[101640.576582] Control: 10c5387d  Table: a335804a  DAC: 0015
[101640.582400]
[101640.582403] PC: 0xc0042880:
[101640.586839] 2880  eaeb e5932008 e352 0a15 e2621000
e0012002 e16f2f12 e262205f
[101640.595143] 28a0  eae3 e30034b8 e2406f5a e18320d5 e356
e1c625f8 0ac9 e2854d11
[101640.603449] 28c0  e2800070 e2844008 e1a01004 eb06b771 e5953448
e1a6 e0534004 13a04001
[101640.611753] 28e0  e58544d4 e89da878 e593200c e2621000 e0012002
e16f2f12 e262207f eacc
[101640.620057] 2900  e7f001f2 e7f001f2 e1a0c00d e92ddff0 e24cb004
e24dd01c e2914038 e1a08002
[101640.628361] 2920  0a74 e5913054 e353 1a68 e30ba720
e3a09e4b e34ca05e e50ba038
[101640.636665] 2940  e2083005 e5945124 e3530001 0a04 e1c423d0
e1c501d8 e0922000 e0a33001
[101640.644969] 2960  e1c423f0 e1a5 ebffee88 e3a01000 e1a5
ebfff0b7 e5943000 e3a01000
[101640.653276]
[101640.653278] LR: 0xc0435110:
[101640.657713] 5110  e1932f9f e2822001 e1831f92 e331 1afa
ea5d e1a8 eb000947
[101640.666017] 5130  e51b004c eb0009bc e1a8 eb0009ba eade
e1a8 e1a01009 e3a02001
[101640.674323] 5150  ebf032e2 eadf e59f3248 e1a08007 e51ba070
e1a07006 e1a06005 e1a05004
[101640.682627] 5170  e1a04003 ea02 e5944000 e354 0a86
e5943018 e1a5 e12fff33
[101640.690931] 5190  e350 0af7 e1a04005 e50ba070 e1a05006
e1a0a000 e1a06007 e1a07008
[101640.699235] 51b0  eafffeb5 e1a4 eb0008fc ea48 e3000518
e3071ff4 e18420d0 e3a00e4b
[101640.707539] 51d0  e34c105c e591104c e14b26fc e18420d0 e1a1
e3a01000 e14b23fc e3083080
[101640.715845] 51f0  e34c305a e50b304c e593c000 e14b26dc e1530001
0152 e14b03dc e3a03e51
[101640.724152]
[101640.724154] SP: 0xe2d7fc48:
[101640.728589] fc48   e21f6aa0 2c5bae94  e2d7fc74
e2d7fc68 c0042904 200f0093
[101640.736893] fc68  c000e394  e2d7fce4 e2d7fc80 c000e0c8
c00081a0 c186fe60 fff0
[101640.745197] fc88  0064 c186ff60 c043f928 c186fe60 0001
e2d7e000 c05c8ab0 e21f6d4c
[101640.753501] fca8  c05a4e60 e2d7fce4 e2d7fce8 e2d7fcc8 c0435190
c0042900 200f0093 
[101640.761807] fcc8  c00427c8 c043f928 c186fe60 e21f6aa0 e2d7fd64
e2d7fce8 c0435190 c00427d4
[101640.770111] fce8  e39de688 0031 0001 c05a4e60 00062904
 e2d7fd54 e2d7fd10
[101640.778415] fd08  c00404f0 c003e308 c05a4e60 c05a4e60 c05a8080
5c71 c05a40c4 c05a4e60
[101640.786721] fd28