Your message dated Mon, 24 Jun 2013 19:27:57 +0200 with message-id <[email protected]> and subject line Closing has caused the Debian Bug report #623275, regarding linux-2.6: [x86] Null pointer dereference in hrtick_start_fair to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact [email protected] immediately.) -- 623275: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=623275 Debian Bug Tracking System Contact [email protected] with problems
--- Begin Message ---Package: linux-2.6 Version: 2.6.26-* This is the same bug that was reported in PR 538332, that bug was archived so submitting new changes here. Sorry for not replying on the other bug earlier. I totally forgot about this bug report, until I received multiple reports of hitting this bug more often while running reboot loop tests on the debian (5.x) kernel, and started looking at this closely. First of all this is the panic message that we see. <1>[ 1.890083] BUG: unable to handle kernel NULL pointer dereference at 00000000 <1>[ 1.890083] IP: [<c0119118>] hrtick_start_fair+0x63/0x12c <4>[ 1.890083] *pde = 00000000 <0>[ 1.890083] Oops: 0000 [#1] SMP <4>[ 1.890083] Modules linked in: <4>[ 1.890083] <4>[ 1.890083] Pid: 11, comm: khelper Not tainted (2.6.26-2-686 #1) <4>[ 1.890083] EIP: 0060:[<c0119118>] EFLAGS: 00010046 CPU: 0 <4>[ 1.890083] EIP is at hrtick_start_fair+0x63/0x12c <4>[ 1.890083] EAX: 00000000 EBX: c1413ffc ECX: 00000001 EDX: 00000001 <4>[ 1.890083] ESI: df47d900 EDI: c1413fc0 EBP: df4bd228 ESP: df499f20 <4>[ 1.890083] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 <0>[ 1.890083] Process khelper (pid: 11, ti=df498000 task=df48e8c0 task.ti=df498000) <0>[ 1.890083] Stack: df4bd200 df4bd200 c02c4d40 df47d900 c1413fc0 00000001 c0118966 c1413fc0 <0>[ 1.890083] df47d900 00000001 c011898c df47d900 c1413fc0 c011b6fa 00000003 00000002 <0>[ 1.890083] df47feb0 df47fed4 00000001 00000001 c0118511 00000000 00000003 df47fedc <0>[ 1.890083] Call Trace: <0>[ 1.890083] [<c0118966>] enqueue_task+0x52/0x5d <0>[ 1.890083] [<c011898c>] activate_task+0x1b/0x26 <0>[ 1.890083] [<c011b6fa>] try_to_wake_up+0xaf/0xf1 <0>[ 1.890083] [<c0118511>] __wake_up_common+0x2e/0x58 <0>[ 1.890083] [<c011a686>] complete+0x28/0x36 <0>[ 1.890083] [<c012ebfa>] __call_usermodehelper+0x0/0x4b <0>[ 1.890083] [<c012f0ae>] run_workqueue+0x74/0xf2 <0>[ 1.890083] [<c012f789>] worker_thread+0x0/0xbd <0>[ 1.890083] [<c012f83c>] worker_thread+0xb3/0xbd <0>[ 1.890083] [<c0131a44>] autoremove_wake_function+0x0/0x2d <0>[ 1.890083] [<c0131983>] kthread+0x38/0x5d <0>[ 1.890083] [<c013194b>] kthread+0x0/0x5d <0>[ 1.890083] [<c01044f7>] kernel_thread_helper+0x7/0x10 <0>[ 1.890083] ======================= <0>[ 1.890083] Code: 00 b8 51 09 31 c0 e8 17 95 00 00 f6 05 40 45 37 c0 40 0f 84 d5 00 00 00 f6 87 28 04 00 00 04 0f 85 c8 00 00 00 8b 87 4c 04 00 00 <8b> 00 83 78 7c 00 0f 84 b6 00 00 00 83 7b 08 01 0f 86 ac 00 0 0 <0>[ 1.890083] EIP: [<c0119118>] hrtick_start_fair+0x63/0x12c SS:ESP 0068:df499f20 <4>[ 1.890083] ---[ end trace a7919e7f17c0a725 ]--- I think there maybe a race between hrtimer_start and hrtick_start_fair, which can cause this. The timer->base for the rq (run queue) of all possible cpu's is setup in __hrtimer_init and at this time all cpu's rq.timer->base points to cpu0's hrtimer_bases. When cpu1 start running, the first time hrtimer_start gets called on this cpu, it will try to change the base to the local cpu's base, as seen in __hrtimer_init it is still pointing to cpu0's base, the switch is done in switch_hrtimer_base. At the time this is happening on cpu1, if cpu0 tries to access cpu1 runqueue's timer base (rq.timer->base), without calling lock_timer_base it may see the null value. As seen in the stack trace this may happen, when cpu0 might be trying to wake up a task which is on cpu1's runqueue, and it may see rq.timer->base as NULL. static inline struct hrtimer_clock_base * switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base) { .... /* See the comment in lock_timer_base() */ timer->base = NULL; <<==== after this, cpu0 might see the base as NULL for cpu1's runqueue. spin_unlock(&base->cpu_base->lock); spin_lock(&new_base->cpu_base->lock); timer->base = new_base; .... } I didn't dig when was the race introduced, but it seems to exist on mainline 2.6.26 too, looking at recent kernels, the hrtimer code has been revamped quite a bit here and the race doesn't exist on those versions. Can you please take a look at the analysis and let me know if you have any comments. Thanks, Alok
--- End Message ---
--- Begin Message ---Hi, your bug has been filed against the "linux-2.6" source package and was filed for a kernel older than the recently released Debian 7.x / Wheezy with a severity less than important. We don't have the ressources to reproduce the complete backlog of all older kernel bugs, so we're closing this bug for now. If you can reproduce the bug with Debian Wheezy or a more recent kernel from testing or unstable, please reopen the bug by sending a mail to [email protected] with the following three commands included in the mail: reopen BUGNUMBER reassign BUGNUMBER src:linux thanks Cheers, Moritz
--- End Message ---

