Re: Unexplained Kernel Panic / Hung Task

David Sommerseth Wed, 04 Dec 2013 15:37:25 -0800

On 04/12/13 14:21, ~Stack~ wrote:> Greetings,
>
> I have a test system I use for testing deployments and when I am not
> using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box.
> Recently (last ~3 weeks) I have started getting the same kernel panic.
> Sometimes it will be multiple times in a single day and other times it
> will be days before the next one (it just had a 5 day uptime). But the
> kernel panic looks pretty much the same. It is a complaint about a hung
> task plus information about the ext4 file system. I have run the
> smartmon tool against both drives (2 drives setup in a hardware RAID
> mirror) and both drives checkout fine. I ran a fsck against the /
> partition and everything looked fine (on this text box there is only /
> and swap partitions). I even took out a drive at a time and had the same
> crashes (though this could be an indicator that both drives are bad). I
> am wondering if my RAID card is going bad.
>
> When the crash happens I still have the SSH prompt, however, I can only
> do basic things like navigating directories and sometimes reading files.
> Writing to a file seems to hang, using tab-autocomplete will frequently
> hang, running most programs (even `init 6` or `top`) will hang.
>
> It crashed again last night, and I am kind of stumped. I would greatly
> appreciate others thoughts and input on what the problem might be.
>
> Thanks!
> ~Stack~
>
> Dec  4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked
> for more than 120 seconds.
> Dec  4 02:25:09 testbox kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec  4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0000000000000000     0
>   273      2 0x00000000
> Dec  4 02:25:09 testbox kernel: ffff8802142cfb30 0000000000000046
> ffff8802138b5800 0000000000001000
> Dec  4 02:25:09 testbox kernel: ffff8802142cfaa0 ffffffff81012c59
> ffff8802142cfae0 ffffffff810a2431
> Dec  4 02:25:09 testbox kernel: ffff880214157058 ffff8802142cffd8
> 000000000000fb88 ffff880214157058
> Dec  4 02:25:09 testbox kernel: Call Trace:
> Dec  4 02:25:09 testbox kernel: [<ffffffff81012c59>] ? read_tsc+0x9/0x20

This looks like some locking issue to me, triggered by something around theTSC timer.

This is either a buggy driver (most likely the ccsis driver) or a relatedfirmware (read the complete boot log carefully, look after firmware warnings).Or it's a really unstable TSC clock source. Try switching from TSC to HPET(or in really worst case acpi_pm). See this KB for some related info:<https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html>

But my hunch tells me it's a driver related issue, with some bad locking.There seems to be several filesystem operations happening on two or more CPUcores in a certain order which seems to trigger a deadlock.



--
kind regards,

David Sommerseth

Re: Unexplained Kernel Panic / Hung Task

Reply via email to