Unexplained Kernel Panic / Hung Task
Greetings, I have a test system I use for testing deployments and when I am not using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box. Recently (last ~3 weeks) I have started getting the same kernel panic. Sometimes it will be multiple times in a single day and other times it will be days before the next one (it just had a 5 day uptime). But the kernel panic looks pretty much the same. It is a complaint about a hung task plus information about the ext4 file system. I have run the smartmon tool against both drives (2 drives setup in a hardware RAID mirror) and both drives checkout fine. I ran a fsck against the / partition and everything looked fine (on this text box there is only / and swap partitions). I even took out a drive at a time and had the same crashes (though this could be an indicator that both drives are bad). I am wondering if my RAID card is going bad. When the crash happens I still have the SSH prompt, however, I can only do basic things like navigating directories and sometimes reading files. Writing to a file seems to hang, using tab-autocomplete will frequently hang, running most programs (even `init 6` or `top`) will hang. It crashed again last night, and I am kind of stumped. I would greatly appreciate others thoughts and input on what the problem might be. Thanks! ~Stack~ Dec 4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked for more than 120 seconds. Dec 4 02:25:09 testbox kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Dec 4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0 273 2 0x Dec 4 02:25:09 testbox kernel: 8802142cfb30 0046 8802138b5800 1000 Dec 4 02:25:09 testbox kernel: 8802142cfaa0 81012c59 8802142cfae0 810a2431 Dec 4 02:25:09 testbox kernel: 880214157058 8802142cffd8 fb88 880214157058 Dec 4 02:25:09 testbox kernel: Call Trace: Dec 4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20 Dec 4 02:25:09 testbox kernel: [810a2431] ? ktime_get_ts+0xb1/0xf0 Dec 4 02:25:09 testbox kernel: [810a2431] ? ktime_get_ts+0xb1/0xf0 Dec 4 02:25:09 testbox kernel: [81119e10] ? sync_page+0x0/0x50 Dec 4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0 Dec 4 02:25:09 testbox kernel: [81119e4d] sync_page+0x3d/0x50 Dec 4 02:25:09 testbox kernel: [8150f30f] __wait_on_bit+0x5f/0x90 Dec 4 02:25:09 testbox kernel: [8111a083] wait_on_page_bit+0x73/0x80 Dec 4 02:25:09 testbox kernel: [81096de0] ? wake_bit_function+0x0/0x50 Dec 4 02:25:09 testbox kernel: [8112f115] ? pagevec_lookup_tag+0x25/0x40 Dec 4 02:25:09 testbox kernel: [8111a4ab] wait_on_page_writeback_range+0xfb/0x190 Dec 4 02:25:09 testbox kernel: [8125d42d] ? submit_bio+0x8d/0x120 Dec 4 02:25:09 testbox kernel: [8111a56f] filemap_fdatawait+0x2f/0x40 Dec 4 02:25:09 testbox kernel: [a004de59] jbd2_journal_commit_transaction+0x7e9/0x1500 [jbd2] Dec 4 02:25:09 testbox kernel: [8100975d] ? __switch_to+0x13d/0x320 Dec 4 02:25:09 testbox kernel: [81081b5b] ? try_to_del_timer_sync+0x7b/0xe0 Dec 4 02:25:09 testbox kernel: [a0054148] kjournald2+0xb8/0x220 [jbd2] Dec 4 02:25:09 testbox kernel: [81096da0] ? autoremove_wake_function+0x0/0x40 Dec 4 02:25:09 testbox kernel: [a0054090] ? kjournald2+0x0/0x220 [jbd2] Dec 4 02:25:09 testbox kernel: [81096a36] kthread+0x96/0xa0 Dec 4 02:25:09 testbox kernel: [8100c0ca] child_rip+0xa/0x20 Dec 4 02:25:09 testbox kernel: [810969a0] ? kthread+0x0/0xa0 Dec 4 02:25:09 testbox kernel: [8100c0c0] ? child_rip+0x0/0x20 Dec 4 02:25:09 testbox kernel: INFO: task master:1058 blocked for more than 120 seconds. Dec 4 02:25:09 testbox kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Dec 4 02:25:09 testbox kernel: masterD 0 1058 1 0x0080 Dec 4 02:25:09 testbox kernel: 88021535d948 0082 88021535d8d8 81065c75 Dec 4 02:25:09 testbox kernel: 880028216700 88021396b578 880214336ad8 880028216700 Dec 4 02:25:09 testbox kernel: 88021396baf8 88021535dfd8 fb88 88021396baf8 Dec 4 02:25:09 testbox kernel: Call Trace: Dec 4 02:25:09 testbox kernel: [81065c75] ? enqueue_entity+0x125/0x410 Dec 4 02:25:09 testbox kernel: [810a2431] ? ktime_get_ts+0xb1/0xf0 Dec 4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50 Dec 4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0 Dec 4 02:25:09 testbox kernel: [811b62f0] sync_buffer+0x40/0x50 Dec 4 02:25:09 testbox kernel: [8150f1ba] __wait_on_bit_lock+0x5a/0xc0 Dec 4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50 Dec 4 02:25:09 testbox kernel: [8150f298]
Re: Unexplained Kernel Panic / Hung Task
Yup that's a hardware problem.It may be a bad firmware on the controller I would check the firmware version first and see if there is a patch. I've seen this kind of thing with Dell OEMed RAID controllers enough over the years that that's almost always the first thing I try.-- Sent from my HP Pre3On Dec 4, 2013 8:21, ~Stack~ i.am.st...@gmail.com wrote: Greetings, I have a test system I use for testing deployments and when I am not using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box. Recently (last ~3 weeks) I have started getting the same kernel panic. Sometimes it will be multiple times in a single day and other times it will be days before the next one (it just had a 5 day uptime). But the kernel panic looks pretty much the same. It is a complaint about a hung task plus information about the ext4 file system. I have run the smartmon tool against both drives (2 drives setup in a hardware RAID mirror) and both drives checkout fine. I ran a fsck against the / partition and everything looked fine (on this text box there is only / and swap partitions). I even took out a drive at a time and had the same crashes (though this could be an indicator that both drives are bad). I am wondering if my RAID card is going bad. When the crash happens I still have the SSH prompt, however, I can only do basic things like navigating directories and sometimes reading files. Writing to a file seems to hang, using tab-autocomplete will frequently hang, running most programs (even `init 6` or `top`) will hang. It crashed again last night, and I am kind of stumped. I would greatly appreciate others thoughts and input on what the problem might be. Thanks! ~Stack~ Dec 4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked for more than 120 seconds. Dec 4 02:25:09 testbox kernel: "echo 0 /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0 273 2 0x Dec 4 02:25:09 testbox kernel: 8802142cfb30 0046 8802138b5800 1000 Dec 4 02:25:09 testbox kernel: 8802142cfaa0 81012c59 8802142cfae0 810a2431 Dec 4 02:25:09 testbox kernel: 880214157058 8802142cffd8 fb88 880214157058 Dec 4 02:25:09 testbox kernel: Call Trace: Dec 4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20 Dec 4 02:25:09 testbox kernel: [810a2431] ? ktime_get_ts+0xb1/0xf0 Dec 4 02:25:09 testbox kernel: [810a2431] ? ktime_get_ts+0xb1/0xf0 Dec 4 02:25:09 testbox kernel: [81119e10] ? sync_page+0x0/0x50 Dec 4 02:25:09 testbox kernel: [8150e953] io_schedule+0x73/0xc0 Dec 4 02:25:09 testbox kernel: [81119e4d] sync_page+0x3d/0x50 Dec 4 02:25:09 testbox kernel: [8150f30f] __wait_on_bit+0x5f/0x90 Dec 4 02:25:09 testbox kernel: [8111a083] wait_on_page_bit+0x73/0x80 Dec 4 02:25:09 testbox kernel: [81096de0] ? wake_bit_function+0x0/0x50 Dec 4 02:25:09 testbox kernel: [8112f115] ? pagevec_lookup_tag+0x25/0x40 Dec 4 02:25:09 testbox kernel: [8111a4ab] wait_on_page_writeback_range+0xfb/0x190 Dec 4 02:25:09 testbox kernel: [8125d42d] ? submit_bio+0x8d/0x120 Dec 4 02:25:09 testbox kernel: [8111a56f] filemap_fdatawait+0x2f/0x40 Dec 4 02:25:09 testbox kernel: [a004de59] jbd2_journal_commit_transaction+0x7e9/0x1500 [jbd2] Dec 4 02:25:09 testbox kernel: [8100975d] ? __switch_to+0x13d/0x320 Dec 4 02:25:09 testbox kernel: [81081b5b] ? try_to_del_timer_sync+0x7b/0xe0 Dec 4 02:25:09 testbox kernel: [a0054148] kjournald2+0xb8/0x220 [jbd2] Dec 4 02:25:09 testbox kernel: [81096da0] ? autoremove_wake_function+0x0/0x40 Dec 4 02:25:09 testbox kernel: [a0054090] ? kjournald2+0x0/0x220 [jbd2] Dec 4 02:25:09 testbox kernel: [81096a36] kthread+0x96/0xa0 Dec 4 02:25:09 testbox kernel: [8100c0ca] child_rip+0xa/0x20 Dec 4 02:25:09 testbox kernel: [810969a0] ? kthread+0x0/0xa0 Dec 4 02:25:09 testbox kernel: [8100c0c0] ? child_rip+0x0/0x20 Dec 4 02:25:09 testbox kernel: INFO: task master:1058 blocked for more than 120 seconds. Dec 4 02:25:09 testbox kernel: "echo 0 /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 4 02:25:09 testbox kernel: masterD 0 1058 1 0x0080 Dec 4 02:25:09 testbox kernel: 88021535d948 0082 88021535d8d8 81065c75 Dec 4 02:25:09 testbox kernel: 880028216700 88021396b578 880214336ad8 880028216700 Dec 4 02:25:09 testbox kernel: 88021396baf8 88021535dfd8 fb88 88021396baf8 Dec 4 02:25:09 testbox kernel: Call Trace: Dec 4 02:25:09 testbox kernel: [81065c75] ? enqueue_entity+0x125/0x410 Dec 4 02:25:09 testbox kernel: [810a2431] ? ktime_get_ts+0xb1/0xf0 Dec 4 02:25:09 testbox kernel: [811b62b0] ? sync_buffer+0x0/0x50 Dec 4
Re: Unexplained Kernel Panic / Hung Task
On 04/12/13 14:21, ~Stack~ wrote: Greetings, I have a test system I use for testing deployments and when I am not using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box. Recently (last ~3 weeks) I have started getting the same kernel panic. Sometimes it will be multiple times in a single day and other times it will be days before the next one (it just had a 5 day uptime). But the kernel panic looks pretty much the same. It is a complaint about a hung task plus information about the ext4 file system. I have run the smartmon tool against both drives (2 drives setup in a hardware RAID mirror) and both drives checkout fine. I ran a fsck against the / partition and everything looked fine (on this text box there is only / and swap partitions). I even took out a drive at a time and had the same crashes (though this could be an indicator that both drives are bad). I am wondering if my RAID card is going bad. When the crash happens I still have the SSH prompt, however, I can only do basic things like navigating directories and sometimes reading files. Writing to a file seems to hang, using tab-autocomplete will frequently hang, running most programs (even `init 6` or `top`) will hang. It crashed again last night, and I am kind of stumped. I would greatly appreciate others thoughts and input on what the problem might be. Thanks! ~Stack~ Dec 4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked for more than 120 seconds. Dec 4 02:25:09 testbox kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Dec 4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0 273 2 0x Dec 4 02:25:09 testbox kernel: 8802142cfb30 0046 8802138b5800 1000 Dec 4 02:25:09 testbox kernel: 8802142cfaa0 81012c59 8802142cfae0 810a2431 Dec 4 02:25:09 testbox kernel: 880214157058 8802142cffd8 fb88 880214157058 Dec 4 02:25:09 testbox kernel: Call Trace: Dec 4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20 This looks like some locking issue to me, triggered by something around the TSC timer. This is either a buggy driver (most likely the ccsis driver) or a related firmware (read the complete boot log carefully, look after firmware warnings). Or it's a really unstable TSC clock source. Try switching from TSC to HPET (or in really worst case acpi_pm). See this KB for some related info: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html But my hunch tells me it's a driver related issue, with some bad locking. There seems to be several filesystem operations happening on two or more CPU cores in a certain order which seems to trigger a deadlock. -- kind regards, David Sommerseth
Re: Unexplained Kernel Panic / Hung Task
Well I tend to discount the driver idea because of an other problem he has involving multiple what I think are identical machines . Also any problems I've ever had with the ccsis driver were usually firmware related an a update or roll back usually corrects them.Besides the based on what I've heard this is low budget equipment and ProLiants aren't cheap. If I had to guess we are talking about Dells.-- Sent from my HP Pre3On Dec 4, 2013 18:36, David Sommerseth sl+us...@lists.topphemmelig.net wrote: On 04/12/13 14:21, ~Stack~ wrote: Greetings, I have a test system I use for testing deployments and when I am not using it, it runs Boinc. It is a Scientific Linux 6.4 fully updated box. Recently (last ~3 weeks) I have started getting the same kernel panic. Sometimes it will be multiple times in a single day and other times it will be days before the next one (it just had a 5 day uptime). But the kernel panic looks pretty much the same. It is a complaint about a hung task plus information about the ext4 file system. I have run the smartmon tool against both drives (2 drives setup in a hardware RAID mirror) and both drives checkout fine. I ran a fsck against the / partition and everything looked fine (on this text box there is only / and swap partitions). I even took out a drive at a time and had the same crashes (though this could be an indicator that both drives are bad). I am wondering if my RAID card is going bad. When the crash happens I still have the SSH prompt, however, I can only do basic things like navigating directories and sometimes reading files. Writing to a file seems to hang, using tab-autocomplete will frequently hang, running most programs (even `init 6` or `top`) will hang. It crashed again last night, and I am kind of stumped. I would greatly appreciate others thoughts and input on what the problem might be. Thanks! ~Stack~ Dec 4 02:25:09 testbox kernel: INFO: task jbd2/cciss!c0d0:273 blocked for more than 120 seconds. Dec 4 02:25:09 testbox kernel: "echo 0 /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 4 02:25:09 testbox kernel: jbd2/cciss!c0 D 0 273 2 0x Dec 4 02:25:09 testbox kernel: 8802142cfb30 0046 8802138b5800 1000 Dec 4 02:25:09 testbox kernel: 8802142cfaa0 81012c59 8802142cfae0 810a2431 Dec 4 02:25:09 testbox kernel: 880214157058 8802142cffd8 fb88 880214157058 Dec 4 02:25:09 testbox kernel: Call Trace: Dec 4 02:25:09 testbox kernel: [81012c59] ? read_tsc+0x9/0x20 This looks like some locking issue to me, triggered by something around the TSC timer. This is either a buggy driver (most likely the ccsis driver) or a related firmware (read the complete boot log carefully, look after firmware warnings). Or it's a really unstable TSC clock source. Try switching from TSC to HPET (or in really worst case acpi_pm). See this KB for some related info: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Timestamping.html But my hunch tells me it's a driver related issue, with some bad locking. There seems to be several filesystem operations happening on two or more CPU cores in a certain order which seems to trigger a deadlock. -- kind regards, David Sommerseth
Re: Unexplained Kernel Panic / Hung Task
If not down rev it to the same version as the one that works.It isn't hard to do with their utilities because those of us who work in mission critical environment have hammered it into their heads that its an absolute requierment-- Sent from my HP Pre3On Dec 4, 2013 19:12, ~Stack~ i.am.st...@gmail.com wrote: On 12/04/2013 05:51 PM, Paul Robert Marino wrote: Well I tend to discount the driver idea because of an other problem he has involving multiple what I think are identical machines . Also any problems I've ever had with the ccsis driver were usually firmware related an a update or roll back usually corrects them. Besides the based on what I've heard this is low budget equipment and ProLiants aren't cheap. If I had to guess we are talking about Dells. You are right, in that I am experiencing two different issues and the vast majority of my test lab is older cast-away parts. The difference is that both issues are on very different systems. The DHCP problem is on a bunch of similar generic Dells. This particular problem is on a HP Prolient DL360 G4 which its twin (same hardware specs and thanks to Puppet should be dang-near identical in terms of software) so far has not displayed this problem. Because the twin isn't having this problem and the problem only started ~3 weeks ago is why I thought for the last few weeks it was a disk drive problem. I am looking up the firmware versions for this box now. I am not hopeful that I will find a newer firmware for this old of a system though. Still, totally worth the try! :-) Thanks! ~Stack~