On 04/12/13 14:21, ~Stack~ wrote:> Greetings, > > I have a test system I use for testing deployments and when I am not > using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box. > Recently (last ~3 weeks) I have started getting the same kernel panic. > Sometimes it will be multiple times in a single day and other times it > will be days before the next one (it just had a 5 day uptime). But the > kernel panic looks pretty much the same. It is a complaint about a hung > task plus information about the ext4 file system. I have run the > smartmon tool against both drives (2 drives setup in a hardware RAID > mirror) and both drives checkout fine. I ran a fsck against the / > partition and everything looked fine (on this text box there is only / > and swap partitions). I even took out a drive at a time and had the same > crashes (though this could be an indicator that both drives are bad). I > am wondering if my RAID card is going bad. > > When the crash happens I still have the SSH prompt, however, I can only > do basic things like navigating directories and sometimes reading files. > Writing to a file seems to hang, using tab-autocomplete will frequently > hang, running most programs (even `init 6` or `top`) will hang. > > It crashed again last night, and I am kind of stumped. I would greatly > appreciate others thoughts and input on what the problem might be. > > Thanks! > ~Stack~ > > Dec 4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked > for more than 120 seconds. > Dec 4 02:25:09 testbox kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Dec 4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0000000000000000 0 > 273 2 0x00000000 > Dec 4 02:25:09 testbox kernel: ffff8802142cfb30 0000000000000046 > ffff8802138b5800 0000000000001000 > Dec 4 02:25:09 testbox kernel: ffff8802142cfaa0 ffffffff81012c59 > ffff8802142cfae0 ffffffff810a2431 > Dec 4 02:25:09 testbox kernel: ffff880214157058 ffff8802142cffd8 > 000000000000fb88 ffff880214157058 > Dec 4 02:25:09 testbox kernel: Call Trace: > Dec 4 02:25:09 testbox kernel: [<ffffffff81012c59>] ? read_tsc+0x9/0x20
This looks like some locking issue to me, triggered by something around the TSC timer.
This is either a buggy driver (most likely the ccsis driver) or a related firmware (read the complete boot log carefully, look after firmware warnings). Or it's a really unstable TSC clock source. Try switching from TSC to HPET (or in really worst case acpi_pm). See this KB for some related info: <https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html>
But my hunch tells me it's a driver related issue, with some bad locking. There seems to be several filesystem operations happening on two or more CPU cores in a certain order which seems to trigger a deadlock.
-- kind regards, David Sommerseth