Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
On Mon, Sep 7, 2015 at 11:00 AM, Chinmay V S wrote: > Hello everyone, > > TL;DR: In Linux RT scheduler, how can rt_nr_running be non-zero AND > active-bitmap NOT have any valid bit set? > > Details: > Recently i encountered the following BUG() within the realtime > scheduler (sched_rt.c) on 3.1.10 kernel. > [101640.492840] kernel BUG at kernel/sched_rt.c:1126! > > This turns out to be > 1126 BUG_ON(idx >= MAX_RT_PRIO); The reason for the stack trace is given below. http://www.spinics.net/lists/newbies/msg08889.html -- -- Sujit K M blog(http://kmsujit.blogspot.com/) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
On Mon, Sep 7, 2015 at 12:28 PM, Chinmay V S wrote: > Thanks for your quick response Mike. > >> Try without the proprietary modules. You may also want to audit futex >> fixes if you can't use a maintained stable tree. 3.2 has a bunch that >> 3.1 does not. > > I see that futex.c has 17 patches in 3.2.y that are missing in my tree. > http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/futex.c?h=linux-3.2.y If in doubt please use the mainline kernel and try and reproduce. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
On Mon, 2015-09-07 at 12:28 +0530, Chinmay V S wrote: > To catch the "culprit" in the middle of busting the scheduler's > internal data structures, what would be the recommended debug > mechanisms (or config options) that i can try? I'd configure kdump, let it explode, and examine runqueues in the crash dump first. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
Thanks for your quick response Mike. > Try without the proprietary modules. You may also want to audit futex > fixes if you can't use a maintained stable tree. 3.2 has a bunch that > 3.1 does not. I see that futex.c has 17 patches in 3.2.y that are missing in my tree. http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/futex.c?h=linux-3.2.y Will apply these patches and kick-off a run today. It takes upto 2days to reproduce this RT-scheduler BUG(). Also, in one of the earlier runs to reproduce, i had enabled CONFIG_CC_STACKPROTECTOR CONFIG_STRICT_DEVMEM but there weren't any additional logs indicating illegal writes to memory. Kernel OOPS was similar to the one in the original email in this thread. To catch the "culprit" in the middle of busting the scheduler's internal data structures, what would be the recommended debug mechanisms (or config options) that i can try? regards CVS -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
On Mon, 2015-09-07 at 11:00 +0530, Chinmay V S wrote: > So how could rt_nr_running be non-zero AND active-bitmap NOT have any > valid bit set? It can't without being busted. > Also including the kernel OOPS below. > Do you see any tell-tale signs in the register-dump/backtrace that can > point me in the right direction? Try without the proprietary modules. You may also want to audit futex fixes if you can't use a maintained stable tree. 3.2 has a bunch that 3.1 does not. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
On Mon, 2015-09-07 at 11:00 +0530, Chinmay V S wrote: > So how could rt_nr_running be non-zero AND active-bitmap NOT have any > valid bit set? It can't without being busted. > Also including the kernel OOPS below. > Do you see any tell-tale signs in the register-dump/backtrace that can > point me in the right direction? Try without the proprietary modules. You may also want to audit futex fixes if you can't use a maintained stable tree. 3.2 has a bunch that 3.1 does not. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
On Mon, Sep 7, 2015 at 11:00 AM, Chinmay V Swrote: > Hello everyone, > > TL;DR: In Linux RT scheduler, how can rt_nr_running be non-zero AND > active-bitmap NOT have any valid bit set? > > Details: > Recently i encountered the following BUG() within the realtime > scheduler (sched_rt.c) on 3.1.10 kernel. > [101640.492840] kernel BUG at kernel/sched_rt.c:1126! > > This turns out to be > 1126 BUG_ON(idx >= MAX_RT_PRIO); The reason for the stack trace is given below. http://www.spinics.net/lists/newbies/msg08889.html -- -- Sujit K M blog(http://kmsujit.blogspot.com/) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
Thanks for your quick response Mike. > Try without the proprietary modules. You may also want to audit futex > fixes if you can't use a maintained stable tree. 3.2 has a bunch that > 3.1 does not. I see that futex.c has 17 patches in 3.2.y that are missing in my tree. http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/futex.c?h=linux-3.2.y Will apply these patches and kick-off a run today. It takes upto 2days to reproduce this RT-scheduler BUG(). Also, in one of the earlier runs to reproduce, i had enabled CONFIG_CC_STACKPROTECTOR CONFIG_STRICT_DEVMEM but there weren't any additional logs indicating illegal writes to memory. Kernel OOPS was similar to the one in the original email in this thread. To catch the "culprit" in the middle of busting the scheduler's internal data structures, what would be the recommended debug mechanisms (or config options) that i can try? regards CVS -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
On Mon, 2015-09-07 at 12:28 +0530, Chinmay V S wrote: > To catch the "culprit" in the middle of busting the scheduler's > internal data structures, what would be the recommended debug > mechanisms (or config options) that i can try? I'd configure kdump, let it explode, and examine runqueues in the crash dump first. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
On Mon, Sep 7, 2015 at 12:28 PM, Chinmay V Swrote: > Thanks for your quick response Mike. > >> Try without the proprietary modules. You may also want to audit futex >> fixes if you can't use a maintained stable tree. 3.2 has a bunch that >> 3.1 does not. > > I see that futex.c has 17 patches in 3.2.y that are missing in my tree. > http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/kernel/futex.c?h=linux-3.2.y If in doubt please use the mainline kernel and try and reproduce. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
Hello everyone, TL;DR: In Linux RT scheduler, how can rt_nr_running be non-zero AND active-bitmap NOT have any valid bit set? Details: Recently i encountered the following BUG() within the realtime scheduler (sched_rt.c) on 3.1.10 kernel. [101640.492840] kernel BUG at kernel/sched_rt.c:1126! This turns out to be 1126 BUG_ON(idx >= MAX_RT_PRIO); within the function pick_next_rt_entity() as shown here: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/sched_rt.c?h=linux-3.1.y#n1115 What this means is that the scheduler failed to find a valid bit within the bitmap containing a prioritised list of active tasks. However before attempting to parse the bitmap, there is a check for a non-zero rt_nr_running. (i.e. parsing the bitmap should find atleast 1 bit of the active running rt task) So how could rt_nr_running be non-zero AND active-bitmap NOT have any valid bit set? The issue is observed on - a quad-core Cortex A9 SMP embedded system. - running an userspace app with ~25 RT threads (FIFO and RR) - typical ubuntu-core rootfs This issue consistently reproduces within 24-48hours on continuously running the system. Searching the net/lkml i could not find this issue reported before, though there are a few memory corruption bugs in scheduler. I have already backported the patches to fix know memory corruption issue from upstream kernel version and still encounter the above BUG(). Is anyone aware of this issue? Also including the kernel OOPS below. Do you see any tell-tale signs in the register-dump/backtrace that can point me in the right direction? [101640.488133] [ cut here ] [101640.492840] kernel BUG at kernel/sched_rt.c:1126! [101640.497621] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP [101640.504742] Modules linked in: misc_arz(P) audio_sta3_2(P) audio_sta3_1(P) audio_sta3(P) lamp_tlc8116_2(P) lamp_tlc8116_1(P) lamp_tlc8116(P) i2c_master_pcu9669(P) tegra_gpio_helper(P) outport_timer(P) intTimer(P) nvidia(P) [101640.524618] CPU: 0Tainted: P (3.1.10 #1) [101640.530015] PC is at pick_next_task_rt+0x138/0x140 [101640.534888] LR is at __schedule+0x63c/0x858 [101640.539150] pc : []lr : []psr: 200f0093 [101640.539154] sp : e2d7fcc8 ip : e2d7fce8 fp : e2d7fce4 [101640.550785] r10: c05a4e60 r9 : e21f6d4c r8 : c05c8ab0 [101640.556084] r7 : e2d7e000 r6 : 0001 r5 : c186fe60 r4 : c043f928 [101640.562685] r3 : c186ff60 r2 : 0064 r1 : fff0 r0 : c186fe60 [101640.569288] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user [101640.576582] Control: 10c5387d Table: a335804a DAC: 0015 [101640.582400] [101640.582403] PC: 0xc0042880: [101640.586839] 2880 eaeb e5932008 e352 0a15 e2621000 e0012002 e16f2f12 e262205f [101640.595143] 28a0 eae3 e30034b8 e2406f5a e18320d5 e356 e1c625f8 0ac9 e2854d11 [101640.603449] 28c0 e2800070 e2844008 e1a01004 eb06b771 e5953448 e1a6 e0534004 13a04001 [101640.611753] 28e0 e58544d4 e89da878 e593200c e2621000 e0012002 e16f2f12 e262207f eacc [101640.620057] 2900 e7f001f2 e7f001f2 e1a0c00d e92ddff0 e24cb004 e24dd01c e2914038 e1a08002 [101640.628361] 2920 0a74 e5913054 e353 1a68 e30ba720 e3a09e4b e34ca05e e50ba038 [101640.636665] 2940 e2083005 e5945124 e3530001 0a04 e1c423d0 e1c501d8 e0922000 e0a33001 [101640.644969] 2960 e1c423f0 e1a5 ebffee88 e3a01000 e1a5 ebfff0b7 e5943000 e3a01000 [101640.653276] [101640.653278] LR: 0xc0435110: [101640.657713] 5110 e1932f9f e2822001 e1831f92 e331 1afa ea5d e1a8 eb000947 [101640.666017] 5130 e51b004c eb0009bc e1a8 eb0009ba eade e1a8 e1a01009 e3a02001 [101640.674323] 5150 ebf032e2 eadf e59f3248 e1a08007 e51ba070 e1a07006 e1a06005 e1a05004 [101640.682627] 5170 e1a04003 ea02 e5944000 e354 0a86 e5943018 e1a5 e12fff33 [101640.690931] 5190 e350 0af7 e1a04005 e50ba070 e1a05006 e1a0a000 e1a06007 e1a07008 [101640.699235] 51b0 eafffeb5 e1a4 eb0008fc ea48 e3000518 e3071ff4 e18420d0 e3a00e4b [101640.707539] 51d0 e34c105c e591104c e14b26fc e18420d0 e1a1 e3a01000 e14b23fc e3083080 [101640.715845] 51f0 e34c305a e50b304c e593c000 e14b26dc e1530001 0152 e14b03dc e3a03e51 [101640.724152] [101640.724154] SP: 0xe2d7fc48: [101640.728589] fc48 e21f6aa0 2c5bae94 e2d7fc74 e2d7fc68 c0042904 200f0093 [101640.736893] fc68 c000e394 e2d7fce4 e2d7fc80 c000e0c8 c00081a0 c186fe60 fff0 [101640.745197] fc88 0064 c186ff60 c043f928 c186fe60 0001 e2d7e000 c05c8ab0 e21f6d4c [101640.753501] fca8 c05a4e60 e2d7fce4 e2d7fce8 e2d7fcc8 c0435190 c0042900 200f0093 [101640.761807] fcc8 c00427c8 c043f928 c186fe60 e21f6aa0 e2d7fd64 e2d7fce8 c0435190 c00427d4 [101640.770111] fce8 e39de688 0031 0001 c05a4e60 00062904 e2d7fd54 e2d7fd10 [101640.778415] fd08 c00404f0 c003e308 c05a4e60 c05a4e60 c05a8080 5c71 c05a40c4 c05a4e60 [101640.786721] fd28
RT Scheduler - BUG_ON (idx >= MAX_RT_PRIO)
Hello everyone, TL;DR: In Linux RT scheduler, how can rt_nr_running be non-zero AND active-bitmap NOT have any valid bit set? Details: Recently i encountered the following BUG() within the realtime scheduler (sched_rt.c) on 3.1.10 kernel. [101640.492840] kernel BUG at kernel/sched_rt.c:1126! This turns out to be 1126 BUG_ON(idx >= MAX_RT_PRIO); within the function pick_next_rt_entity() as shown here: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/kernel/sched_rt.c?h=linux-3.1.y#n1115 What this means is that the scheduler failed to find a valid bit within the bitmap containing a prioritised list of active tasks. However before attempting to parse the bitmap, there is a check for a non-zero rt_nr_running. (i.e. parsing the bitmap should find atleast 1 bit of the active running rt task) So how could rt_nr_running be non-zero AND active-bitmap NOT have any valid bit set? The issue is observed on - a quad-core Cortex A9 SMP embedded system. - running an userspace app with ~25 RT threads (FIFO and RR) - typical ubuntu-core rootfs This issue consistently reproduces within 24-48hours on continuously running the system. Searching the net/lkml i could not find this issue reported before, though there are a few memory corruption bugs in scheduler. I have already backported the patches to fix know memory corruption issue from upstream kernel version and still encounter the above BUG(). Is anyone aware of this issue? Also including the kernel OOPS below. Do you see any tell-tale signs in the register-dump/backtrace that can point me in the right direction? [101640.488133] [ cut here ] [101640.492840] kernel BUG at kernel/sched_rt.c:1126! [101640.497621] Internal error: Oops - undefined instruction: 0 [#1] PREEMPT SMP [101640.504742] Modules linked in: misc_arz(P) audio_sta3_2(P) audio_sta3_1(P) audio_sta3(P) lamp_tlc8116_2(P) lamp_tlc8116_1(P) lamp_tlc8116(P) i2c_master_pcu9669(P) tegra_gpio_helper(P) outport_timer(P) intTimer(P) nvidia(P) [101640.524618] CPU: 0Tainted: P (3.1.10 #1) [101640.530015] PC is at pick_next_task_rt+0x138/0x140 [101640.534888] LR is at __schedule+0x63c/0x858 [101640.539150] pc : []lr : []psr: 200f0093 [101640.539154] sp : e2d7fcc8 ip : e2d7fce8 fp : e2d7fce4 [101640.550785] r10: c05a4e60 r9 : e21f6d4c r8 : c05c8ab0 [101640.556084] r7 : e2d7e000 r6 : 0001 r5 : c186fe60 r4 : c043f928 [101640.562685] r3 : c186ff60 r2 : 0064 r1 : fff0 r0 : c186fe60 [101640.569288] Flags: nzCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user [101640.576582] Control: 10c5387d Table: a335804a DAC: 0015 [101640.582400] [101640.582403] PC: 0xc0042880: [101640.586839] 2880 eaeb e5932008 e352 0a15 e2621000 e0012002 e16f2f12 e262205f [101640.595143] 28a0 eae3 e30034b8 e2406f5a e18320d5 e356 e1c625f8 0ac9 e2854d11 [101640.603449] 28c0 e2800070 e2844008 e1a01004 eb06b771 e5953448 e1a6 e0534004 13a04001 [101640.611753] 28e0 e58544d4 e89da878 e593200c e2621000 e0012002 e16f2f12 e262207f eacc [101640.620057] 2900 e7f001f2 e7f001f2 e1a0c00d e92ddff0 e24cb004 e24dd01c e2914038 e1a08002 [101640.628361] 2920 0a74 e5913054 e353 1a68 e30ba720 e3a09e4b e34ca05e e50ba038 [101640.636665] 2940 e2083005 e5945124 e3530001 0a04 e1c423d0 e1c501d8 e0922000 e0a33001 [101640.644969] 2960 e1c423f0 e1a5 ebffee88 e3a01000 e1a5 ebfff0b7 e5943000 e3a01000 [101640.653276] [101640.653278] LR: 0xc0435110: [101640.657713] 5110 e1932f9f e2822001 e1831f92 e331 1afa ea5d e1a8 eb000947 [101640.666017] 5130 e51b004c eb0009bc e1a8 eb0009ba eade e1a8 e1a01009 e3a02001 [101640.674323] 5150 ebf032e2 eadf e59f3248 e1a08007 e51ba070 e1a07006 e1a06005 e1a05004 [101640.682627] 5170 e1a04003 ea02 e5944000 e354 0a86 e5943018 e1a5 e12fff33 [101640.690931] 5190 e350 0af7 e1a04005 e50ba070 e1a05006 e1a0a000 e1a06007 e1a07008 [101640.699235] 51b0 eafffeb5 e1a4 eb0008fc ea48 e3000518 e3071ff4 e18420d0 e3a00e4b [101640.707539] 51d0 e34c105c e591104c e14b26fc e18420d0 e1a1 e3a01000 e14b23fc e3083080 [101640.715845] 51f0 e34c305a e50b304c e593c000 e14b26dc e1530001 0152 e14b03dc e3a03e51 [101640.724152] [101640.724154] SP: 0xe2d7fc48: [101640.728589] fc48 e21f6aa0 2c5bae94 e2d7fc74 e2d7fc68 c0042904 200f0093 [101640.736893] fc68 c000e394 e2d7fce4 e2d7fc80 c000e0c8 c00081a0 c186fe60 fff0 [101640.745197] fc88 0064 c186ff60 c043f928 c186fe60 0001 e2d7e000 c05c8ab0 e21f6d4c [101640.753501] fca8 c05a4e60 e2d7fce4 e2d7fce8 e2d7fcc8 c0435190 c0042900 200f0093 [101640.761807] fcc8 c00427c8 c043f928 c186fe60 e21f6aa0 e2d7fd64 e2d7fce8 c0435190 c00427d4 [101640.770111] fce8 e39de688 0031 0001 c05a4e60 00062904 e2d7fd54 e2d7fd10 [101640.778415] fd08 c00404f0 c003e308 c05a4e60 c05a4e60 c05a8080 5c71 c05a40c4 c05a4e60 [101640.786721] fd28