Hmm. No obvious ideas come to mind, but I'm adding more people to the cc.

Clearly the wait_event_interruptible_timeout() in the RCU grace-period
thread causes this, but I'm not seeing why shutdown would trigger it.

The code disassembles to

   0: 85 db                 test   %ebx,%ebx
   2: 79 0c                 jns    0x10
   4: 81 e6 ff 00 00 00     and    $0xff,%esi
   a: 8d 44 f0 30           lea    0x30(%eax,%esi,8),%eax
   e: eb 0a                 jmp    0x1a
  10: c1 e9 1a             shr    $0x1a,%ecx
  13: 8d 84 c8 30 0e 00 00 lea    0xe30(%eax,%ecx,8),%eax
  1a: 8b 48 04             mov    0x4(%eax),%ecx
  1d: 89 50 04             mov    %edx,0x4(%eax)
  20: 89 02                 mov    %eax,(%edx)
  22: 89 4a 04             mov    %ecx,0x4(%edx)
  25:* 89 11                 mov    %edx,(%ecx) <-- trapping instruction
  27: 5b                   pop    %ebx
  28: 5e                   pop    %esi
  29: 5d                   pop    %ebp
  2a: c3                   ret

so the oops is in the final

        list_add_tail(&timer->entry, vec);

where "%ecx" is "vec->prev" (f8c551f4). That looks like it might be a
perfectly valid pointer, but clearly it isn't (it's about 115M off the
top of virtual memory, I think that might be in the vmalloc area).

So I'm *guessing* that something did a vfree() on some data structure
that contained active timers - and then later on the RCU thread ended
up being the next thing that tried to add a timer after the
now-non-existing one.

And your other oopses do seem to have a similar pattern, even if their
actual oops is elsewhere. They oops in run_timer_softirq, also taking
a page fault in the 0xf9...... range, so it might well be a vmalloc
address there too.

But I sure as hell can't start to guess what that would be.

I'm wondering it CONFIG_DEBUG_OBJECTS (and then
CONFIG_DEBUG_OBJECTS_FREE=y and CONFIG_DEBUG_OBJECTS_TIMERS=y) might
help catch this...

                Linus

On Mon, Oct 14, 2013 at 4:07 AM, Knut Petersen
<knut_peter...@t-online.de> wrote:
>
> It愀 the third time in four months that I have to report a kernel Oops during
> shutdown.
> All of these Oopses seem somehow related to the timer subsystem, but they
> are
> not easily reproducible. As all this happens on two different machines, it愀
> unlikely
> that this mess is related to bad hardware.
>
> I clearly would appreciate any idea how to track this down.
>
> For the last two reports see:
>
> http://www.gossamer-threads.com/lists/linux/kernel/1782575?#1782575
>
> http://www.gossamer-threads.com/lists/linux/kernel/1744892?#1744892
>
> This time the kernel oopsed after systemd reported that target shutdown
> had been reached - see attached pdf for the full trace. To make it easier
> to find this problem a shortened call trace:
>
>
> Call Trace:
>     internal_add_timer
>     schedule_timeout
>     ? call_timer_fn
>     rcu_gp_kthread
>     __init_waitqueue_head
>     ? rcu_gp_fqs
>     kthread
>     ret_from_kernel_thread
>     ? __init_kthread_worker
>
> EIP: __internal_add_timer
>
> Hardware: AOpen i915GMm-hfs mobo with a Pentium-M Dothan and 2GB of RAM.
> Distribution: openSuSE 12.3
> Kernel: local 3.12.0-rc4-00127-g45877c4 is  kernel 9d05746 with my
> "Enforce 1 as lower limit for perf_event_max_sample_rate"
> patch applied.
>
> cu,
>  knut
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to